mobilkit.temporal module

Tools and functions to analyze the data in time.

mobilkit.temporal.computeDisplacementFigures(df_disp, minimum_pings_per_night=5)

Given a dataframe returned by mobilkit.temporal.homeLocationWindow computes a pivoted dataframe with, for each user, the home area for every time window, plus the arrays of displaced and active people per area and the arrays with the (per user) cumulative number of areas where the user slept.

Parameters:
  • df_disp (pandas.dataframe) – A dataframe as returned by mobilkit.temporal.homeLocationWindow.

  • minimum_pings_per_night (int, optional) – The number of pings recorded during a night for a user to be considered.

Returns:

df_pivoted, first_user_area, heaps_arrays, count_users_per_area

  • df_pivoted is a dataframe containing one row per user and with the column being the sorted

    time windows of the analysis period. Each cell contains the location where the user (row) has slept in night t (column), Nan if the user was not active that night.

  • first_user_area is a dict telling, for each user, the tile_ID where he has been sleeping

    for the first time.

  • heaps_arrays is a (n_users x n_windows) array telling the cumulative number of areas where

    a users slept up to window t.

  • counts_users_per_area is a dictionary {tile_ID: {"active": [...], "displaced": [...]}}

    telling the number of active and displaced people per area in time.

Return type:

pandas.dataframe, dict, array, dict

mobilkit.temporal.computeResiduals(df_activity, signal_column, profile)

Function that computes the average, z-score and residual activity of an area in a given time period and for a given time bin.

Parameters:
  • df_activity (dask.DataFrame) – As returned by mobilkit.temporal.computeTemporalProfile, a dataframe with the columns and periods volumes and normalization (if needed) already computed.

  • profile (str) – The temporal profile used for normalization in mobilkit.temporal.computeTemporalProfile.

  • signal_column (str) – The columns to use as proxy for volume. Usually one of "users", "pings", "frac_users", "frac_pings"

Returns:

Two dictionaries containing the aggregated results in numpy arrays. results has four keys:

  • raw the raw signal in the area_index,period_index,period_hour_index indexing;

  • mean the mean over the periods of the raw signal in the

    area_index,period_hour_index shape;

  • zscore the zscore of the area signal (with respect to its average and std) in the

    area_index,period_hour_index shape;

  • residual the residual activity computed as the difference between the area’s zscore

    and the global average zscore at a given hour in the area_index,period_hour_index shape;

On the other hand, mappings contains the back and forth mapping between the numpy indexes and the original values of the areas (idx2area and area2idx), periods, and, hour of the period. These will be useful later for plotting.

Return type:

results, mappings

mobilkit.temporal.computeTemporalProfile(df_tot, timeBin, byArea=False, profile='week', weekdays=None, normalization=None, start_date=None, stop_date=None, date_format=None, sliceName=None, selected_areas=None, areasName=None, split_out=10)

Function to compute the normalized profiles of areas. The idea is to have a dataframe with the count of users and pings per time bin (and per area is byArea=True) together with a normalization column (computed if normalization is not None over a different time window profile) telling the total number of pings and users seen in that period (and in that area if byArea). If normalization is specified, also the fraction of users and pings recorded in an area at that time bin are given.

Parameters:
  • df_tot (dask.DataFrame) – A dataframe as returned from mobilkit.loader.load_raw_files or imported from scikit-mobility using mobilkit.loader.load_from_skmob. If using byArea the df must contain the tile_ID column as returned by mobilkit.spatial.tessellate.

  • timeBin (str) – The width of the time bin to use to aggregate activity. Currently supported: [“W”, “MS”, “M”, “H”, “D”, “T”] You can implement others found in [pandas time series aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). For instance:

    • ‘B’ business day frequency

    • ‘D’ calendar day frequency

    • ‘W’ weekly frequency

    • ‘M’ month end frequency

    • ‘MS’ month start frequency

    • ‘SMS’ semi-month start frequency (1st and 15th)

    • ‘BH’ business hour frequency

    • ‘H’ hourly frequency

    • ‘T’,’min’ minutely frequency

  • byArea (bool, optional) – Whether or not to compute the activity per area (default False). If False will compute the overall activity.

  • profile (str) – The base of the activity profile: must be "week" to compute the weekly profile or "day" for the daily one or "month" for one month period (month_end to use month end). Each profile of area / week or day (depending on profile) will be computed separately. NOTE that this period should be equal or longer than the timeBin (i.e., "weekly" or "monthly" if timeBin="week") otherwise the normalization will fail.

  • weekdays (set or list, optional) – The weekdays to consider (0 Monday -> 6 Sunday). Default None equals to keep all.

  • normalization (str, optional) – One of None, "area", "total". Normalize nothing (None), on the total period of the area (area) or on the total period of all the selected areas (total).

  • start_date (str, optional) – The starting date when to consider data in the date_format format.

  • stop_date (str, optional) – The end date when to consider. Must have the same format as start_date.

  • date_format (str, optional) – The python date format of the dates, if given.

  • sliceName (str, optional) – The name that will be saved in timeSlice column, if given.

  • selected_areas (set or list, optional) – The set or list of selected areas. If None (default) uses all the areas. Use mobilkit.spatial.selecteAreasFromBounds to select areas from given bounds.

  • areasName (str, optional) – The name that will be saved in areaName column, if given.

  • split_out (int, optional) – The number of partitions to split the results in (for large number of areas and time bins). The default value of 10 should work in most of the cases.

Returns:

df_normalized – A dataframe with these columns: - one with the same name as timeBin with the date truncated at the selected width.

  • pings the number of pings recorded in that time bin and area (if byArea=True).

  • users the number of users seen in that time bin and area (if byArea=True).

  • pings_per_user the average number of pings per user in that time bin and area

    (if byArea=True).

  • tile_ID (if byArea=True) the area where the signal has been recorded.

  • the additional columns timeSlice and ``areaName``if the two names are given,

    plus, if normalization is not None:

  • tot_pings/users the total number of pings and users seen in the area (region) in

    the profile period if normalize is "area" (total).

  • frac_pings/users the fraction of pings and users seen in that area, at that time bin

    with respect to the total volume of the area (region) depending on the normalization.

  • profile_hour the zero-based hour of the typical month, week or day (depending on the

    value of profile).

Return type:

dask.DataFrame

mobilkit.temporal.computeTimeBinActivity(df, byArea=False, timeBin='hour', split_out=10)

Basic function to compute, for each time bin and area, the activity profile in terms of users and pings recorded. It also computes the set of users seen in that bin for later aggregations.

Parameters:
  • df (dask.DataFrame) – A dataframe as returned from mobilkit.loader.load_raw_files or imported from scikit-mobility using mobilkit.loader.load_from_skmob. If using byArea the df must contain the tile_ID column as returned by mobilkit.spatial.tessellate.

  • byArea (bool, optional) – Whether or not to compute the activity per area (default False). If False will compute the overall activity.

  • timeBin (str, optional) – The width of the time bin to use to aggregate activity. Must be one of the ones found in [pandas time series aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). For instance:

    • ‘B’ business day frequency

    • ‘D’ calendar day frequency

    • ‘W’ weekly frequency

    • ‘M’ month end frequency

    • ‘MS’ month start frequency

    • ‘SMS’ semi-month start frequency (1st and 15th)

    • ‘BH’ business hour frequency

    • ‘H’ hourly frequency

    • ‘T’,’min’ minutely frequency

  • split_out (int, optional) – The number of dask dataframe partitions after the groupby aggregation.

Returns:

df_activity – A dataframe with these columns:

  • one with the same name as timeBin with the date truncated at the selected width.

  • pings the number of pings recorded in that time bin and area (if byArea=True).

  • users the number of users seen in that time bin and area (if byArea=True).

  • users_set the set of users seen in that time bin and area (if byArea=True). Useful to normalize later analysis.

  • pings_per_user the average number of pings per user in that time bin and area (if byArea=True).

  • tile_ID (if byArea=True) the area where the signal has been recorded.

Return type:

dask.dataframe

mobilkit.temporal.computeVolumeProfile(df: DataFrame, what: str = 'pings', normalized: bool = True, freq='1d') DataFrame

Computes the volume of pings or users in a given interval given by freq.

Parameters:
  • df (dask.dataframe.DataFrame) – The dataframe containing the pings with at least the mobilkit.dask_schemas.dttColName and the mobilkit.dask_schemas.uidColName columns.

  • what (str) – pings users or both, the volume to count.

  • `normalize` (bool) – If True will normalize the curve in the 0-1 range, otherwise returns the raw count.

  • `freq` (str) – A valid datetime interval up to which the dates will be floored.

Returns:

volume – A dataframe whose index is the time bin and whose value is the observed volume.

Return type:

pd.DataFrame

mobilkit.temporal.filter_daynight_time(df, filter_from_h=21.5, filter_to_h=8.5, previous_day_until_h=4.0, daytime_from_h=9.0, daytime_to_h=21.0)

Prepares a raw event df for the ping-based displacement analysis.

Parameters:
  • df (dask.DataFrame) – A dataframe containing at least the uid,datetime,lat,lng columns as returned by mobilkit.loader.load_raw_files or similar functions.

  • filter_{from,to}_h (float) – The starting and ending float hours to consider. If from_hour<to_hour only pings whose float hour h are from_hour <= h < to_hour are considered otherwise all the pings with h >= from_hour or h < to_hour. Note that float hour h for datetime dt is h = dt.hour + dt.minute/60. so to express 9:45am put 9.75.

  • previous_day_until_h (float) – All the valid events with float hour h < previous_day_until_h will be projected to the previous day. Put 0 or a negative number to keep all events of one day to its date.

  • daytime_{from,to}_h (float) – The starting and ending float hours to consider in daytime (other will be put in nightime. All events with from_hour<= float_hour <= to_hour will have a 1 entry in the daytime column, others 0. from hour must be smaller than to hour. Note that float hour h for datetime dt is h = dt.hour + dt.minute/60. so to express 9:45am put 9.75.

Returns:

df – The same initial dataframe filtered accordingly to from_hour,to_hour and with three additional columns:

  • float_hour: the day-hour expressed as h=dt.hour + dt.minutes

  • date: the datetime column floored to the day. All events with

    float_hour < previous_day_until_h will be further advanced by one day.

  • daytime: 1 if the event’s float_hour is between daytime_from_h and

    daytime_to_h

Return type:

dask.DataFrame

mobilkit.temporal.homeLocationWindow(df_hw, initial_days_home=None, home_days_window=3, start_date=None, stop_date=None)

Given a dataframe returned by mobilkit.stats.userHomeWork computes, for each user, the home area for every window of home_days_window days after the initial date. Note that the points before 12pm will be assigned to the previous day’s night and the one after 12pm to the same day’s night.

Parameters:
  • df_hw (dask.dataframe) – A dataframe as returned by mobilkit.stats.userHomeWork with at least the uid, tile_ID, datetime and isHome and isWork columns.

  • initial_days_home (int, optional) – The number of initial days to be used to compute the original home area. If None (default) it will just compute the home for every window since the beginning.

  • home_days_window (int, optional) – The number of days to use to assess the home location of a user (default 3). For each day d in the start_date to stop_date - home_days_window it computes the home location between the [d,d+home_days_window) period.

  • start_date (datetime.datetime) – A python datetime object with no timezone telling the date (included) to start from. The default behavior is to keep all the events.

  • stop_date (datetime.datetime, optional) – A python datetime object with no timezone telling the date (excluded) to stop at. Default is to keep all the events.

Returns:

df_hwindow – The dataframe containing, for each user and active day of user the tile_ID of the user’s home and the number of pings recorded there in the time window. The date is saved in window_date and refers to the start of the time window (whose index is saved in timeSlice). For the initial home window the date corresponds to its end.

Return type:

pandas.dataframe

Note

When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.temporal.plotDisplacement(count_users_per_area, pivoted, gdf, area_key='tile_ID', epicenter=[18.584, 98.399], bins=5)
Parameters:
  • count_users_per_area (dict) – The dict returned with the pivot table, the original home location, and the Heaps law of visited areas by mobilkit.temporal.homeLocationWindow.

  • pivoted (pandas.DataFrame) – The pivoted dataframe of the visited location during the night as returned with the the original home location, the Heaps law of visited areas and the count of users per area and date by mobilkit.temporal.homeLocationWindow.

  • gdf (geopandas.GeoDataFrame) – The geodataframe used to tessellate data. Must contain the area_key column.

  • area_key (str) – The column containing the ID of the tessellation areas used to join the displacement data and the GeoDataFrame.

  • epicenter (tuple) – The (lat,lon) coordinates of the center to be used to split areas in bins bins based on their distance from this point.

  • bins (int) – The number of linear distance bins to compute from the epicenter.

mobilkit.temporal.plotMonthlyActivity(df_activity, timeBin, what='users', ax=None, log_y=False, **kwargs)

Basic function to plot the monthly activity of areas or total region.

Parameters:
  • df_activity (dask.DataFrame) – A dataframe as returned from mobilkit.temporal.computeTimeBinActivity.

  • timeBin (str) – The width of the time bin used in mobilkit.temporal.computeTimeBinActivity.

  • what (str, optional) – The quantity to plot. Must be one amongst 'users', 'pings', 'pings_per_user'.

  • ax (axis, optional) – The axis to use. If None will create a new figure.

  • log_y (bool, optional) – Whether or not to plot with y log scale. Default False.

  • **kwargs – Will be passed to seaborn.lineplot function.

Returns:

  • df (pandas.DataFrame) – Thee aggregated data plotted.

  • ax (axis) – The axis of the figure.