mobilkit.temporal module
Tools and functions to analyze the data in time.
- mobilkit.temporal.computeDisplacementFigures(df_disp, minimum_pings_per_night=5)
Given a dataframe returned by
mobilkit.temporal.homeLocationWindow
computes a pivoted dataframe with, for each user, the home area for every time window, plus the arrays of displaced and active people per area and the arrays with the (per user) cumulative number of areas where the user slept.- Parameters:
df_disp (pandas.dataframe) – A dataframe as returned by
mobilkit.temporal.homeLocationWindow
.minimum_pings_per_night (int, optional) – The number of pings recorded during a night for a user to be considered.
- Returns:
df_pivoted, first_user_area, heaps_arrays, count_users_per_area –
df_pivoted
is a dataframe containing one row per user and with the column being the sortedtime windows of the analysis period. Each cell contains the location where the user (row) has slept in night t (column),
Nan
if the user was not active that night.
first_user_area
is a dict telling, for each user, thetile_ID
where he has been sleepingfor the first time.
heaps_arrays
is a (n_users x n_windows) array telling the cumulative number of areas wherea users slept up to window t.
counts_users_per_area
is a dictionary{tile_ID: {"active": [...], "displaced": [...]}}
telling the number of active and displaced people per area in time.
- Return type:
pandas.dataframe, dict, array, dict
- mobilkit.temporal.computeResiduals(df_activity, signal_column, profile)
Function that computes the average, z-score and residual activity of an area in a given time period and for a given time bin.
- Parameters:
df_activity (dask.DataFrame) – As returned by
mobilkit.temporal.computeTemporalProfile
, a dataframe with the columns and periods volumes and normalization (if needed) already computed.profile (str) – The temporal profile used for normalization in
mobilkit.temporal.computeTemporalProfile
.signal_column (str) – The columns to use as proxy for volume. Usually one of
"users", "pings", "frac_users", "frac_pings"
- Returns:
Two dictionaries containing the aggregated results in numpy arrays.
results
has four keys:raw
the raw signal in thearea_index,period_index,period_hour_index
indexing;mean
the mean over the periods of the raw signal in thearea_index,period_hour_index
shape;
zscore
the zscore of the area signal (with respect to its average and std) in thearea_index,period_hour_index
shape;
residual
the residual activity computed as the difference between the area’szscore
and the global average
zscore
at a given hour in thearea_index,period_hour_index
shape;
On the other hand,
mappings
contains the back and forth mapping between the numpy indexes and the original values of the areas (idx2area
andarea2idx
), periods, and, hour of the period. These will be useful later for plotting.- Return type:
results, mappings
- mobilkit.temporal.computeTemporalProfile(df_tot, timeBin, byArea=False, profile='week', weekdays=None, normalization=None, start_date=None, stop_date=None, date_format=None, sliceName=None, selected_areas=None, areasName=None, split_out=10)
Function to compute the normalized profiles of areas. The idea is to have a dataframe with the count of users and pings per time bin (and per area is
byArea=True
) together with a normalization column (computed ifnormalization
is notNone
over a different time windowprofile
) telling the total number of pings and users seen in that period (and in that area ifbyArea
). Ifnormalization
is specified, also the fraction of users and pings recorded in an area at that time bin are given.- Parameters:
df_tot (dask.DataFrame) – A dataframe as returned from
mobilkit.loader.load_raw_files
or imported fromscikit-mobility
usingmobilkit.loader.load_from_skmob
. If usingbyArea
the df must contain thetile_ID
column as returned bymobilkit.spatial.tessellate
.timeBin (str) – The width of the time bin to use to aggregate activity. Currently supported: [“W”, “MS”, “M”, “H”, “D”, “T”] You can implement others found in [pandas time series aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). For instance:
‘B’ business day frequency
‘D’ calendar day frequency
‘W’ weekly frequency
‘M’ month end frequency
‘MS’ month start frequency
‘SMS’ semi-month start frequency (1st and 15th)
‘BH’ business hour frequency
‘H’ hourly frequency
‘T’,’min’ minutely frequency
byArea (bool, optional) – Whether or not to compute the activity per area (default
False
). IfFalse
will compute the overall activity.profile (str) – The base of the activity profile: must be
"week"
to compute the weekly profile or"day"
for the daily one or"month"
for one month period (month_end to use month end). Each profile of area / week or day (depending on profile) will be computed separately. NOTE that this period should be equal or longer than thetimeBin
(i.e.,"weekly"
or"monthly"
iftimeBin="week"
) otherwise the normalization will fail.weekdays (set or list, optional) – The weekdays to consider (0 Monday -> 6 Sunday). Default
None
equals to keep all.normalization (str, optional) – One of
None, "area", "total"
. Normalize nothing (None
), on the total period of the area (area
) or on the total period of all the selected areas (total
).start_date (str, optional) – The starting date when to consider data in the
date_format
format.stop_date (str, optional) – The end date when to consider. Must have the same format as
start_date
.date_format (str, optional) – The python date format of the dates, if given.
sliceName (str, optional) – The name that will be saved in timeSlice column, if given.
selected_areas (set or list, optional) – The set or list of selected areas. If
None
(default) uses all the areas. Usemobilkit.spatial.selecteAreasFromBounds
to select areas from given bounds.areasName (str, optional) – The name that will be saved in areaName column, if given.
split_out (int, optional) – The number of partitions to split the results in (for large number of areas and time bins). The default value of 10 should work in most of the cases.
- Returns:
df_normalized – A dataframe with these columns: - one with the same name as
timeBin
with the date truncated at the selected width.pings
the number of pings recorded in that time bin and area (ifbyArea=True
).users
the number of users seen in that time bin and area (ifbyArea=True
).pings_per_user
the average number of pings per user in that time bin and area(if
byArea=True
).
tile_ID
(ifbyArea=True
) the area where the signal has been recorded.- the additional columns
timeSlice
and ``areaName``if the two names are given, plus, if
normalization
is notNone
:
- the additional columns
tot_pings/users
the total number of pings and users seen in the area (region) inthe profile period if normalize is
"area"
(total
).
frac_pings/users
the fraction of pings and users seen in that area, at that time binwith respect to the total volume of the area (region) depending on the normalization.
profile_hour
the zero-based hour of the typical month, week or day (depending on thevalue of
profile
).
- Return type:
dask.DataFrame
- mobilkit.temporal.computeTimeBinActivity(df, byArea=False, timeBin='hour', split_out=10)
Basic function to compute, for each time bin and area, the activity profile in terms of users and pings recorded. It also computes the set of users seen in that bin for later aggregations.
- Parameters:
df (dask.DataFrame) – A dataframe as returned from
mobilkit.loader.load_raw_files
or imported fromscikit-mobility
usingmobilkit.loader.load_from_skmob
. If usingbyArea
the df must contain thetile_ID
column as returned bymobilkit.spatial.tessellate
.byArea (bool, optional) – Whether or not to compute the activity per area (default
False
). IfFalse
will compute the overall activity.timeBin (str, optional) – The width of the time bin to use to aggregate activity. Must be one of the ones found in [pandas time series aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). For instance:
‘B’ business day frequency
‘D’ calendar day frequency
‘W’ weekly frequency
‘M’ month end frequency
‘MS’ month start frequency
‘SMS’ semi-month start frequency (1st and 15th)
‘BH’ business hour frequency
‘H’ hourly frequency
‘T’,’min’ minutely frequency
split_out (int, optional) – The number of dask dataframe partitions after the groupby aggregation.
- Returns:
df_activity – A dataframe with these columns:
one with the same name as
timeBin
with the date truncated at the selected width.pings
the number of pings recorded in that time bin and area (ifbyArea=True
).users
the number of users seen in that time bin and area (ifbyArea=True
).users_set
the set of users seen in that time bin and area (ifbyArea=True
). Useful to normalize later analysis.pings_per_user
the average number of pings per user in that time bin and area (ifbyArea=True
).tile_ID
(ifbyArea=True
) the area where the signal has been recorded.
- Return type:
dask.dataframe
- mobilkit.temporal.computeVolumeProfile(df: DataFrame, what: str = 'pings', normalized: bool = True, freq='1d') DataFrame
Computes the volume of pings or users in a given interval given by freq.
- Parameters:
df (dask.dataframe.DataFrame) – The dataframe containing the pings with at least the
mobilkit.dask_schemas.dttColName
and themobilkit.dask_schemas.uidColName
columns.what (str) – pings users or both, the volume to count.
`normalize` (bool) – If True will normalize the curve in the 0-1 range, otherwise returns the raw count.
`freq` (str) – A valid datetime interval up to which the dates will be floored.
- Returns:
volume – A dataframe whose index is the time bin and whose value is the observed volume.
- Return type:
pd.DataFrame
- mobilkit.temporal.filter_daynight_time(df, filter_from_h=21.5, filter_to_h=8.5, previous_day_until_h=4.0, daytime_from_h=9.0, daytime_to_h=21.0)
Prepares a raw event df for the ping-based displacement analysis.
- Parameters:
df (dask.DataFrame) – A dataframe containing at least the uid,datetime,lat,lng columns as returned by
mobilkit.loader.load_raw_files
or similar functions.filter_{from,to}_h (float) – The starting and ending float hours to consider. If from_hour<to_hour only pings whose float hour h are from_hour <= h < to_hour are considered otherwise all the pings with h >= from_hour or h < to_hour. Note that float hour h for datetime dt is h = dt.hour + dt.minute/60. so to express 9:45am put 9.75.
previous_day_until_h (float) – All the valid events with float hour h < previous_day_until_h will be projected to the previous day. Put 0 or a negative number to keep all events of one day to its date.
daytime_{from,to}_h (float) – The starting and ending float hours to consider in daytime (other will be put in nightime. All events with from_hour<= float_hour <= to_hour will have a 1 entry in the daytime column, others 0. from hour must be smaller than to hour. Note that float hour h for datetime dt is h = dt.hour + dt.minute/60. so to express 9:45am put 9.75.
- Returns:
df – The same initial dataframe filtered accordingly to from_hour,to_hour and with three additional columns:
float_hour: the day-hour expressed as h=dt.hour + dt.minutes
- date: the datetime column floored to the day. All events with
float_hour < previous_day_until_h will be further advanced by one day.
- daytime: 1 if the event’s float_hour is between daytime_from_h and
daytime_to_h
- Return type:
dask.DataFrame
- mobilkit.temporal.homeLocationWindow(df_hw, initial_days_home=None, home_days_window=3, start_date=None, stop_date=None)
Given a dataframe returned by
mobilkit.stats.userHomeWork
computes, for each user, the home area for every window ofhome_days_window
days after the initial date. Note that the points before 12pm will be assigned to the previous day’s night and the one after 12pm to the same day’s night.- Parameters:
df_hw (dask.dataframe) – A dataframe as returned by
mobilkit.stats.userHomeWork
with at least the uid, tile_ID, datetime and isHome and isWork columns.initial_days_home (int, optional) – The number of initial days to be used to compute the original home area. If
None
(default) it will just compute the home for every window since the beginning.home_days_window (int, optional) – The number of days to use to assess the home location of a user (default 3). For each day
d
in thestart_date
tostop_date - home_days_window
it computes the home location between the[d,d+home_days_window)
period.start_date (datetime.datetime) – A python datetime object with no timezone telling the date (included) to start from. The default behavior is to keep all the events.
stop_date (datetime.datetime, optional) – A python datetime object with no timezone telling the date (excluded) to stop at. Default is to keep all the events.
- Returns:
df_hwindow – The dataframe containing, for each user and active day of user the
tile_ID
of the user’s home and the number of pings recorded there in the time window. The date is saved inwindow_date
and refers to the start of the time window (whose index is saved intimeSlice
). For the initial home window the date corresponds to its end.- Return type:
pandas.dataframe
Note
When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.
This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.
However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.
- mobilkit.temporal.plotDisplacement(count_users_per_area, pivoted, gdf, area_key='tile_ID', epicenter=[18.584, 98.399], bins=5)
- Parameters:
count_users_per_area (dict) – The dict returned with the pivot table, the original home location, and the Heaps law of visited areas by
mobilkit.temporal.homeLocationWindow
.pivoted (pandas.DataFrame) – The pivoted dataframe of the visited location during the night as returned with the the original home location, the Heaps law of visited areas and the count of users per area and date by
mobilkit.temporal.homeLocationWindow
.gdf (geopandas.GeoDataFrame) – The geodataframe used to tessellate data. Must contain the area_key column.
area_key (str) – The column containing the ID of the tessellation areas used to join the displacement data and the GeoDataFrame.
epicenter (tuple) – The (lat,lon) coordinates of the center to be used to split areas in bins bins based on their distance from this point.
bins (int) – The number of linear distance bins to compute from the epicenter.
- mobilkit.temporal.plotMonthlyActivity(df_activity, timeBin, what='users', ax=None, log_y=False, **kwargs)
Basic function to plot the monthly activity of areas or total region.
- Parameters:
df_activity (dask.DataFrame) – A dataframe as returned from
mobilkit.temporal.computeTimeBinActivity
.timeBin (str) – The width of the time bin used in
mobilkit.temporal.computeTimeBinActivity
.what (str, optional) – The quantity to plot. Must be one amongst
'users', 'pings', 'pings_per_user'
.ax (axis, optional) – The axis to use. If
None
will create a new figure.log_y (bool, optional) – Whether or not to plot with y log scale. Default
False
.**kwargs – Will be passed to
seaborn.lineplot
function.
- Returns:
df (pandas.DataFrame) – Thee aggregated data plotted.
ax (axis) – The axis of the figure.