mobilkit.stats module

Tools and functions to compute the per-users and per area stats.

Note

When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.stats.areaStats(df, start_date=None, stop_date=None, hours=(0, 24), weekdays=(1, 2, 3, 4, 5, 6, 7))

Computes the stats of a given area in terms of pings and unique users seen in a given area in a given period.

Parameters:

df (dask.dataframe) – A dataframe as returned by mobilkit.spatial.tessellate with at least the uid, tile_ID and datetime columns.
start_date (datetime.datetime) – A python datetime object with no timezone telling the date (included) to start from. The default behavior is to keep all the events.
stop_date (datetime.datetime, optional) – A python datetime object with no timezone telling the date (excluded) to stop at. Default is to keep all the events.
hours (tuple, optional) – The hours when to start (included) and stop (excluded) in float notation (e.g., 09:15 am is 9.25 whereas 10:45pm is 22.75).
weekdays (tuple or set or list, optional) – The list or tuple or set of days to be kept in python notation so 0 = Monday, 1 = Tuesday, … and 6 = Sunday.

Returns:

df – The tile_ID -> count of pings/users mapping.

Return type:

dask.DataFrame

mobilkit.stats.compressLocsStats2hwTable(df)

Transforms a per location home work stats table into the per user home and work stats table.

Parameters:: df (pd.DataFrame) – A dataframe containing all the stats of the locations.
Returns:: hw_stats – The home work locations of the users as if they were returned by mobilkit.stats.userHomeWorkLocation.
Return type:: pd.DataFrame

mobilkit.stats.computeBufferStat(gdf_stat, gdf_grid, column, aggregation, how='inner', lat_name='lat', lon_name='lng', local_EPSG=None, buffer=None)

Computes the statistics contained in a column of a dataframe containing the lat and lon coordinates of points with respect to a gdf_grid tessellation, possibly applying local reprojection and buffer on the points. This is equivalent to a KDE with a flat circular kernel of radius buffer.

Parameters:

gdf_stat, gdf_grid (gpd.GeoDataFrame) – The geo-dataframes containing the statistics in the column column and the tessellation system. They must be in the same reference system and will be projected to local_EPSG, if specified. The gdf_grid will be dissolved on the mobilkit.dask_schemas.zidColName after the spatial join with the (possibly buffered) gdf_stat geometries.
column (str) – The column for which we will compute the statistics.
aggregation (str or callable) – The geopandas string or callable to use on the spatially joined geo-dataframe.
how (str, optional) – The method to perform the spatial join.
lat_name, lon_name (str, optional) – The name of the columns to use as initial coords.
local_EPSG (int, optional) – The code of the local EPSG crs.
buffer (float, optional) – The local map unit in local_EPSG to perform the buffer.

Returns:

buffered_stats – The geodataframe with the aggregated stat.

Return type:

gpd.GeoDataFrame

mobilkit.stats.computeHomeWorkSurvival(df_stops_stats, min_durations=[0], min_day_counts=[0], min_hour_counts=[0], min_delta_counts=[0], min_delta_durations=[0], limit_hw_locs=False, loc_col='loc_ID')

Given a dataframe of locations (tiles) with home work stats as returned by mobilkit.stats.stopsToHomeWorkStats it computes the home and work presence at different thresholds of home and work duration count etc.

Parameters:

df_stops_stats (pandas.DataFrame) – The locations (or tiles) stats of the users as returned by mobilkit.stats.stopsToHomeWorkStats.
min_durations (iterable, optional) – The minimum duration of home and work stops to keep lines in the group.
min_day_counts, min_hour_counts (iterable, optional) – The minimum count of stops in home and work locations to keep lines in the group.
min_delta_counts, min_delta_durations (iterable, optional) – The minimum fraction of home/work hours during which the area/location is the most visited in terms of duration/count of stops for it to be kept.
limit_hw_locs (bool, optional) – If True, it will limit the home and work candidates to the row(s) featuring isHome or isWork equal True, respectively. If False (default), all the rows are kept as candidates.
loc_col (str, optional) – The column to use to check if the home and work candidates are in the same location.

Returns:

user_flags (pd.DataFrame) – A data frame indexed by user containing, for each combination of threshold values in the order of minimum duration, minimum days, minimum hours, min delta count, min delta duration, the flag of: - out_flags if the user has a home AND work candidate with the threshold; - out_has_home if the user has a home candidate with the threshold; - out_has_work if the user has a work candidate with the threshold; - out_same_locs if the user has a unique the home and work candidate falling

under the same loc_col ID.
df_cnt (pd.DataFrame) – The dataframe in long format containing the count of valid counts for the users for each combination of minimum threshold. The columns are: - ‘tot_duration’, ‘n_days’, ‘n_hours’, ‘delta_count’, ‘delta_duration’

the values of the constraint for the current count.
- ’n_users’ how many users have both home and work with current settings;
- ’with_home_users’ how many users have a home location with current settings;
- ’with_work_users’ how many users have a work location with current settings;
- ’home_work_same_area_users’ how many users have home and work locations featuring the same loc_col ID.
- ’home_work_same_area_users_frac’ the fraction of valid users with home and work that have have home and work locations featuring the same loc_col ID.

mobilkit.stats.computeSurvivalFracs(users_stats_df, thresholds=[1, 10, 20, 50, 100])

Function to compute the fraction of users above threshold.

Parameters:

users_stats (pandas.DataFrame) – A dataframe with the users stats as returned by mobilkit.stats.userStats and passed to pandas with the toPandas method.
thresholds (list or array of ints, optional) – The values of the threshold to compute. The number of days above the threshold and the fraction of active days above threshold will be saved, for each user, in the days_above_TTT and frac_days_above_TTT where TTT is the threshold value.

Returns:

df – The enriched dataframe.

Return type:

pandas.DataFrame

mobilkit.stats.computeTripTimeStats(df_trip_times, df_hw_locs, gdf_grid, local_EPSG, buffer_m=500)

mobilkit.stats.computeUserHomeWorkTripTimes(df_hw_locs, osrm_url=None, direction='both', what='duration', max_trip_duration_h=4, max_trip_distance_km=150): TODO This is quite slow as it is a serial part, it can be parallelized using a pool or directly mapping in Dask :rtype: time in seconds, distance in meters

mobilkit.stats.filterUsers(df, dfStats=None, minPings=1, minDaysSpanned=1, minDaysActive=1, minSuperUserDayFrac=None, superUserPingThreshold=None)

Function to filter the pings and keep only the ones of the users with given statistics.

Parameters:

df (dask.dataframe) – The dataframe containing the pings.
dfStats (dask.dataframe, optional) – The dataframe containing the pre-computed stats of the users as returned by mobilkit.stats.userStats. If None, it will be automatically computed. In either cases it is returned together with the result.
minPings (int) – The minimum number of recorded pings for a user to be kept.
minDaysSpanned (float) – The minimum number of days between the first and last ping for a user to be kept.
minDaysActive (int) – The minimum number of active days for a user to be kept.
minSuperUserDayFrac (float) – The minimum fraction of days with same or more pings than superUserPingThreshold for a user to be considered. Must be between 0 and 1.
superUserPingThreshold (int) – The minimum number of pings for a user-day to be considered as super user.

Returns:

df_out, df_stats, valid_users_set – The dataframe containing the pings of the valid users only, the one containing the stats per user and the set of the valid users.

Return type:

dask.dataframe, dask.dataframe, set

mobilkit.stats.filterUsersFromSet(df, users_set)

Function to filter the pings and keep only the ones of the users in users_set.

Parameters:

df (dask.dataframe) – The dataframe containing the pings.
users_set (set or list) – The ids of the users to keep.

Returns:

df_out – The filtered dataframe containing the pings of the valid users only.

Return type:

dask.dataframe

mobilkit.stats.homeWorkStats(df_hw)

Given a dataframe returned by mobilkit.stats.userHomeWork computes, for each user and area, the total number of pings recorded in that area (total_pings column), the pings recorded in home hours (home_pings column) and the ones in work hours (work_pings column).

Parameters:

df_hw (dask.dataframe) – A dataframe as returned by mobilkit.stats.userHomeWork with at least the uid, tile_ID and isHome and isWork columns.

Returns:

df_hw_stats –

The dataframe containing, for each user and area id:

total_pings: the total number of pings recorded for that user in that area
home_pings: the pings recorded for that user in home hours in that area
work_pings: the ping in work hours for that user in that area

Return type:

dask.dataframe

Note

When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.stats.plotSurvivalDays(users_stats_df, min_days=10, ax=None)

Function to plot the survival probability of users by number of days given different pings/day threshold.

Parameters:

users_stats_df (pandas.DataFrame) – A dataframe with the users stats as returned by mobilkit.stats.computeSurvivalFracs.
min_days (int, optional) – The minimum number of active days above threshold to be counted as super user in the plot count.
ax (plt.axes, optional) – The axes to use. If None a new figure will be produced.

Returns:

The axes of the figure.

Return type:

ax

mobilkit.stats.plotSurvivalFrac(users_stats_df, min_frac=0.8, ax=None)

Function to plot the survival probability of users by fraction of active days given different pings/day threshold.

Parameters:

users_stats_df (pandas.DataFrame) – A dataframe with the users stats as returned by mobilkit.stats.computeSurvivalFracs.
min_frac (0 < float < 1, optional) – The minimum fraction of active days above threshold to be counted as super user in the plot count.
ax (plt.axes, optional) – The axes to use. If None a new figure will be produced.

Returns:

The axes of the figure.

Return type:

ax

mobilkit.stats.plotUsersHist(users_stats, min_pings=5, min_days=5, days='active', cmap='YlGnBu', xbins=100, ybins=20)

Function to plot the 2d histogram of the users stats.

Parameters:

users_stats (pandas.DataFrame) – A dataframe with the users stats as returned by mobilkit.stats.userStats and passed to pandas with the toPandas method.
min_pings (int, optional) – The number of pings to be used as threshold in the plot counts.
min_days (int, optional) – The number of active or spanned days (depending on days) to be used as threshold in the plot counts.
days (str, optional) – Whether to use active (active, default) days or spanned days (spanned).
cmap (str, optional) – The colormap to use.
xbins, ybins (int, optional) – The number of bins to use on the x and y axis.

Returns:

The axes of the figure.

Return type:

ax

mobilkit.stats.stopsToHomeWorkStats(df_stops, home_hours=(21, 7), work_hours=(9, 17), work_days=(0, 1, 2, 3, 4), force_different=False, ignore_dynamical=True, min_hw_distance_km=0.0, min_home_delta_count=0, min_home_delta_duration=0, min_work_delta_count=0, min_work_delta_duration=0, min_home_days=0, min_work_days=0, min_home_hours=0, min_work_hours=0)

Computes the home and work time stats for each user and location (tile).

Parameters:

df_stop_locs_usr (dask.DataFrame or pd.DataFrame) – The stops of a user as returned by locations or stops TODO;
home_hours, work_hours (tuple, optional) – TODO
work_days (tuple) – TODO
force_different (bool, optional) – TODO
ignore_dynamical (bool, optional) – TODO
min_hw_distance_km (float, optional) – TODO
min_home_delta_count, min_home_delta_duration,
min_work_delta_count, min_work_delta_duration (float, optional) – TODO
min_home_days, min_work_days,
min_home_hours, min_work_hours (int, optional) – TODO
latCol, lonCol, locCol (str, optional) – TODO

Returns:

df_stats – A dataframe with the columns:

uid the user id
’loc_id’ or ‘tile_ID’ the location/tile id 0-based;
’lat_medoid’,’lng_medoid’ or ‘lat’, ‘lng’ the average coordinates of the stops seen within that location/tile;
’{home,work}_{day/hour}_count’ the number of unique days (hours) when the user has been seen as active in the location (tile) at home (work) hours;

’{home,work}_per_hour_{count,duration}’ the list containing, for each hour in the home (work) hours, the number of visits (duration in seconds) spent at the location/tile;
’{home,work}_{count,duration}’ the total number of visits (seconds duration) spent at this location/tile;
’tot_seen_{home,work}_{hours,days}’ the total number of days and hours where the user has been active during home (work) hours during the valid stops;
’tot_seen_{hours,days}’ the total number of days and hours where the user has been active during the valid stops, both in home and workj period;
’tot_stop_count’, ‘tot_stop_time’ the total number and duration (in seconds) of the user’s stops;
’frac_{home,work}_{count,duration}’ the fraction of stops (duration) spent in this tile/location during home (work) hours;
’{home,work}_delta_{count,duration}’ the fraction of hours in the home (work) range at which the given tile/location was the most visited in terms of stops (duration).
’isHome’, ‘isWork’ the flag telling whsther the location is home or work (or potentially both, if force_different is False).

Return type:

pd.DataFrame

mobilkit.stats.userBasedBufferedStat(df_stat, df_user_grid, stat_col, uid_col='uid', tile_col='tile_ID', explode_col=False, how='inner', stats=['min', 'max', 'mean', 'std', 'count'])

Given a dataframe containing the per user stat df_stat in the stat_col and a dataframe containing the users per area as returned from mk.stats.computeBufferStat computes the stats of the stat_col merging the two df on the tile_col.

Parameters:

df_stat (pd.DataFrame) – The dataframe containing at least the uid_col and stat_col. They can also be in the df’s index as it will be reset.
df_user_grid (pd.DataFrame) – A dataframe containing the users per area (in the uid_col and tile_col)as returned from mk.stats.computeBufferStat using passing as gdf_stat the home work locations, lat_name=’lat_home’,lon_name=’lng_home’, column=uidColName and aggregation=set.
uid_col, tile_col (str, optional) – The columns containing the user id and the tile id in the two input dfs.
explode_col (bool, optional) – Whether we need to explode the stat_col before merging (for list-like observations).
how (str, optional) – The join method to use between the tile ids in the grid and the df of stats.
stats (list or str) – The stats to be compute at the tile level on the stat_col column.

Returns:

stats – The dataframe containing the tile id as index and with the stats in the stat_col_{min/max/mean} format.

Return type:

pd.DataFrame

Examples

>>> # Compute the per area buffered users based on home location (500m buffer):
>>> users_buffered_per_area = mk.stats.computeBufferStat(
                                gdf_stat=df_hw_locs_pd.reset_index()[['lat_home','lng_home', uidColName]],
                                gdf_grid=gdf_aoi_grid,
                                column=uidColName,
                                aggregation=set,
                                lat_name='lat_home',
                                lon_name='lng_home',
                                local_EPSG=local_EPSG,
                                buffer=500)
>>> # Compute the per area total daily traveled distance
>>> ttd_daily_user = mk.spatial.totalUserTravelDistance(df_pings, freq='1d')
>>> df_out = mk.stats.userBasedBufferedStat(ttd_daily_user,
                                            users_buffered_per_area,
                                            stat_col='ttd')
>>> df_out.head()
tile_ID   |    ttd_min |    ttd_max |   ttd_mean |    ttd_std |  ttd_count |
12345     |      2.345 |     12.345 |      5.345 |      3.345 |        125 |

mobilkit.stats.userHomeWork(df, homeHours=(19.5, 7.5), workHours=(9.0, 18.5), weHome=False)

Computes, for each row of the dataset, if the ping has been recorded in home or work time. Can be used in combination with mobilkit.stats.homeWorkStats' and :attr:'mobilkit.stats.userHomeWorkLocation to determine the home and work locations of a user.

Parameters:

df (dask.dataframe) – The loaded dataframe with at least uid, datetime and tile_ID columns.
homeHours (tuple, optional) – The starting and end hours of the home period in 24h floating numbers. For example, to put the house period from 08:15pm to 07:20am put homeHours=(20.25, 7.33).
workHours (tuple, optional) – The starting and end hours of the work period in 24h floating numbers. For example, to put the work period from 09:15am to 06:50pm put workHours=(9.25, 18.8333). Note that work hours are counted only from Monday to Friday.
weHome (bool, optional) – If False (default) counts only weekend hours within the home hours as valid home hours. If True, all the pings recorded during the weekend (Saturday and Sunday) are counted as home pings.

Returns:

out – The dataframe with two additional columns: isHome and isWork telling if a given ping has been recorded during home or work time (or none of them).

Return type:

dask.dataframe

Note

When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.stats.userHomeWorkLocation(df_hw: <module 'dask.dataframe' from '/home/docs/checkouts/readthedocs.org/user_builds/mobilkit/envs/latest/lib/python3.11/site-packages/dask/dataframe/__init__.py'>, force_different: bool = False)

Given a dataframe returned by mobilkit.stats.userHomeWork computes, for each user, the home and work area as well as their location. The home/work area is the one with more pings recorded and the location is assigned to the mean point of this cloud.

Parameters:

df_hw (dask.dataframe) – A dataframe as returned by mobilkit.stats.userHomeWork with at least the uid, tile_ID and isHome and isWork columns.
force_different (bool, optional) – Whether we want to force the work location to be different from the home location.

Returns:

df_hw_locs –

A dataframe containing, for each uid with at least one ping at home or work:

pings_home: the total number of pings recorded in the home area
pings_work: the total number of pings recorded in the work area
tile_ID_home: the tile id of the home area
tile_ID_work: the tile id of the work area
lng_home: the longitude of the home location
lat_home: the latitude of the home location
lng_work: the longitude of the work location
lat_work: the latitude of the work location

Return type:

dask.dataframe

Note

When determining the home location of a user, please consider that some data providers, like _Cuebiq_, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.stats.userRealHomeWorkTimes(df_stops, home_work_locs, direction='both', uid_col='uid', location_col='tile_ID', additional_hw_cols=['home_work_straight_dist', 'home_work_osrm_time', 'work_home_osrm_time', 'home_work_osrm_dist', 'work_home_osrm_dist'], **kwargs)

Computes the real homework commuting time looking at the sequence of the user’s stops.

Parameters:

df_stops (dd.DataFrame) – A dask dataframe containing the stops (or pings) of the users to be analyzed. It might feature a location id column (as when returned by mobilkit.spatial.computeUsersLocations applied on the output of mobilkit.spatial.findStops) or the tile id column (as in the df returned by mobilkit.spatial.tessellate).
home_work_locs (dd.DataFrame or pd.DataFrame) – The necessarily pre-cleaned dataframe containing the stats on users home and work. This can be either a dataframe as returned by mobilkit.stats.userHomeWorkLocation or the one obtained by chaining the mobilkit.stats.stopsToHomeWorkStats and the mobilkit.stats.
**kwargs – Are used to tune the functioning of mobilkit.stats._per_user_real_home_work_times

mobilkit.stats.userStats(df)

Computes the stats per user:

days spanned (time from first to last ping);
days active (actual number of days being active);
number of pings per user;
number of pings per user per active day;

Parameters:

df (dask.dataframe) – The dataframe extracted or imported. Must contains the uid column and the datetime one.

Returns:

df_out –

A dask dataframe with the stats per user in three columns:

daysActive the number of different days where the user has been active;
daysSpanned the days spanned between the first and last recorded ping;
pings the number of pings recorded for the user;
pingsPerDay the number of pings recorded for the user in every active day;
avg the average number of pings per recorded day for the user.

Return type:

dask.DataFrame

Example

>>> df_out = mk.stats.userStats(df)
>>> df_out.head()
uid   |    min_day |    max_day |  pings |  daysActive |   avg | daysSpanned | pingsPerDay    |
'abc' | 2017-08-12 | 2017-12-22 |   3452 |         124 | 27.83 |         222 | [12,22,...,13] |