mobilkit.spatial module

Tools and functions to spatially analyze the data.

Note

When determining the home location of a user, please consider that some data providers, like Cuebiq, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.spatial.assignAreasDF(df, zones_gdf)

Returns the geo-dataframe with an additional column the ZONE_IDX column. Non overlapping areas are guaranteed to be found there with a negative -1 value; The order of the original index and columns is preserved.

mobilkit.spatial.box2poly(box)

[min_lon, min_lat, max_lon, max_lat]

mobilkit.spatial.computeUsersLocations(stops_df, method='dbscan', link_dist=150, min_stops_count=2, return_locations=True)

TODO DOC

mobilkit.spatial.compute_ROG(df, which='both', df_hw_locs=None)
mobilkit.spatial.compute_medoid_index(distM)

Returns the row index of the necessarily symmetric matrix that minimizes the sum of distances to all the other points.

Parameters:

distM (np.array) – The distances matrix.

Returns:

index – The row (column) index of the medoid.

Return type:

int

mobilkit.spatial.compute_poi_index_dist(g, tree_model=None, lon_col='lng_proj', lat_col='lat_proj')
Parameters:
  • g (pandas.DataFrame) – A dataframe containing at least the lat_col and lon_col columns with the raw points’ coordinates projected to the same projection of the tree_model.

  • tree_model (sklearn.neighbors.KDTree) – A KDTree trained on the POIs projected in the same proj of lat on lon points. Distance will be computed by the tree.

  • lat_col, lon_col (str) – The names of the columns containing the latitude and longitude of the points in g. By default they match the ones used in mobilkit.spatial.compute_poi_visit.

Returns:

g

The original data frame with two additional columns named:
  • ’poi_distance’ the distance of the closest poi found in the tree

    in KM;

  • ’_POI_INDEX_’ the 0-based index of the closest tree leaf.

Return type:

pd.DataFrame

mobilkit.spatial.compute_poi_visit(df_pings, df_homes, df_POIs, from_crs='EPSG:4326', to_crs='EPSG:6362', min_home_dist_km=0.2, visit_time_bin='1H', lat_lon_tol_box=0.02)

Computes the set of users and number of users visiting a given POI for each visit_time_bin period of time found in the pings dataframe.

Parameters:
  • df_ping (dask.DataFrame) – A dataframe containing at least the uid, datetime, lat and lng columns with the raw points’ coordinates. The coordinates must be given in the from_crs projection.

  • df_homes (Dataframe) – A pandas or dask Dataframe with the uid, homelat and homelon home coordinates of all the users in the df. The coordinates must be given in the from_crs projection. Note that the three dataframes of pings, homes and POIs must feature the same initial projection equal to from_crs.

  • df_POIs (Dataframe) – A pandas or dask Dataframe with at least the radius, poilat and poilon columns stating the radius to be considered in the POI (in km) and the POI’s coordinates. The coordinates must be given in the from_crs projection.

  • from_crs, to_crs (str) – The codes of the original and target projections to use. Will be used to compute planar distances in km using a euclidean distance so use the appropriate reference system for your ROI (e.g., use to_crs=’EPSG:6362’’ for the Mexico area). Will be passed to mobilkit.spatial.convert_df_crs.

  • min_home_dist (float) – The minimum distance for a point to be from the user’s home to be considered valid (in km).

  • visit_time_bin (str) – The frequency of the time bin to use. Each datetime will be floored to this time frequency.

  • lat_lon_tol_box (float) – The pings will be filtered within the box of the maximum/minimum latitude/longitude of the POIs original projection’s dataframe. This is the margin added around this box to account for pings right outside of the POIs’ boundaries that may still fall into their radius.

Returns:

pings_merged_home_poi, results

  • pings_merged_home_poi is a view on the dask.DataFrame containing, for all
    the points falling within the POIs radius and far enough from users’ home:
    • the original pings columns plus their projected coords in {lat,lng}_proj;

    • the home and ‘poi’ original and projected (with ‘_proj’ suffix) lat coords;

    • ’poi_distance’, ‘_POI_INDEX_’ the distance (in km) and the unique index of the

      closest POI;

    • all the df_POIs columns related to this POI (if common names of columns are

      found they will be inserted with the _FROM_POI_TABLE suffix);

    • ’home_dist’ the distance in km from the user’s home;

    • ’time_bin’ the original datetime floored to visit_time_bin freq.

  • results is a dataframe containing, for each unique _POI_INDEX_ and time_bin as
    given by visit_time_bin:
    • all the df_POIs columns related to this POI;

    • users,num_users the columns containing the list of the uid-s of the users

      found in that POI and that time_bin and their number.

Return type:

dask.DataFrame, pd.DataFrame

mobilkit.spatial.compute_population_density(df, **kwargs)
Parameters:
  • df (dask.DataFrame) – A dataframe as returned by mobilkit.temporal.filter_daynight_time with at least the date,daytime,uid,lat,lng columns containing the date rounded to day, a bool stating if the point is in daytime or nightime, the user id and the coordinates of the point.

  • **kwargs – Will be passed to mobilkit.spatial.meanshift.

Returns:

df – A dataframe with a multi index of date,daytime,uid and as columns the lat and lng coordinates of the mean shift location of the user on that part of the day on that date.

Return type:

pandas.DataFrame

mobilkit.spatial.convert_df_crs(df, lat_col='lat', lon_col='lng', from_crs='EPSG:4326', to_crs='EPSG:6362', return_gdf=False)
Parameters:
  • df (pandas.DataFrame) – A dataframe containing the lat_col and lon_col columns at least.

  • lat_col, lon_col (str) – The names of the columns containing the latitude and longitude of the points in df.

  • from_crs, to_crs (str, optional) – The codes of the original and target projections to use. If to_crs is None no reprojection is done.

  • return_gdf (bool, optional) – If True returns the newly created gdf otherwise the original df with two additional columns telling the projected lat and lon.

Returns:

df – If return_gdf the df ported to a geodataframe in the to_crs projection. Otherwise, the original data frame with two additional columns named lat_col + ‘_proj’ and lon_col + ‘_proj’ containing the original coordinates projected to to_crs.

Return type:

pd.DataFrame or gpd.GeoDataFrame

mobilkit.spatial.density_map(latitudes, longitudes, center, bins, radius)
Parameters:
  • latitudes, longitudes (array-like) – The arrays containing the latitude and longitude coordinates of each user’s location.

  • center (tuple-like) – The (lat, lon) of the center where to compute the population density around.

  • bins (int) – The number of bins to use horizontally and vertically in the region around the center.

  • radius (float) – The space to consider above, below, left and right of the center (same unity of the center).

Returns:

density – The 2d histogram of the population.

Return type:

np.array

mobilkit.spatial.distanceHomeDF(g, **kwargs)
Parameters:
  • g (pandas.DataFrame) – A dataframe containing at least the lat_col and lon_col columns with the raw points’ coordinates and the home coordinates in homelon and homelat columns of all the users.

  • **kwargs – Such as lat_col, lon_col will be passed to mobilkit.spatial.distanceHomeUser.

Returns:

g – The original data frame with an additional column named ‘home_dist’ containing the haversine distance between each point and the home location of the user of that row in kilometers.

Return type:

pd.DataFrame

mobilkit.spatial.distanceHomeUser(g, lon_col='lng', lat_col='lat', h_lon_col='homelon', h_lat_col='homelat')
Parameters:
  • g (pandas.DataFrame) – A dataframe containing at least the lat_col and lon_col columns with the raw points’ coordinates and the home coordinates in homelon and homelat columns. Must contain all the data of one user only.

  • lat_col, lon_col (str) – The names of the columns containing the latitude and longitude of the points in g.

  • h_lat_col, h_lon_col (str) – The names of the columns containing the latitude and longitude of the home user g.

Returns:

g – The original data frame with an additional column named ‘home_dist’ containing the haversine distance between each point and the home location in kilometers.

Return type:

pd.DataFrame

Note

When determining the home location of a user, please consider that some data providers, like Cuebiq, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.spatial.expandStops(df, freq: str = '1h', explode_stop: bool = True)

Given a dataframe containing the single stops of one or more users (as returned by mobilkit.spatial.findStops) it explodes them to be repeated once every freq time bin they traverse (if explode_stop) or it lists them in the mobilkit.dask_schemas.stpColName column.

Parameters:
  • df (dask.datframe) – A dataframe containing the stops of one or more users (as returned by mobilkit.spatial.findStops).

  • freq (str, optional) – The time bin to use to replicate/explode the stop. Currently all the valid freq arguments to pandas.date_range are accepted.

  • explode_stop (bool, optional) – If True the mobilkit.dask_schemas.stpColName column will contain the exploded list of time bins touched by the stop, else the list itself.

Returns:

exploded_stops_df – The initial dataframe with the additional column mobilkit.dask_schemas.stpColName containing the time bins (or their list, depending on explode_stop) touched by the stop.

Return type:

dask.datframe

mobilkit.spatial.filter_to_box(df, minlon, maxlon, minlat, maxlat, lat_col='lat', lon_col='lng')
Parameters:
  • df (DataFrame) – A dataframe containing at least the lat_col and lon_col columns with the raw points’ coordinates.

  • {min,max}{lat,lon} (float) – The min and max values of lat and lon (will keep all coords >= min and <= max).

  • lat_col, lon_col (str) – The names of the columns containing the latitude and longitude of the points in df.

Returns:

df – The original data frame filtered to the points within the box.

Return type:

pd.DataFrame

mobilkit.spatial.findStops(df, tesselation_shp=None, stay_locations_kwds=None, filterAreas=True)

Computes the stops of a group of users using the scikit-mobility tools. Note that the mobilkit[complete] or [skmob] version should be installed to use this tool.

Parameters:
  • df (dask.dataframe) – A dataframe containing the raw pings with at least the latitude, longitude and datetime columns.

  • tessellation_shp (str, optional) – The path to be used to tessellate the stops (if None, no tessellation will be performed).

  • stay_locations_kwds (dict, optional) – The custom keywords to be passed to mobilkit.spatial._find_stops. If not told otherwise, the stay location keywords passed to skmob.preprocessing.detection are:

    • minutes_for_a_stop=5.0

    • spatial_radius_km=0.2

    • no_data_for_minutes=60*12

    • leaving_time=True

  • filterAreas (bool, optional) – Whether or not to filter out stops found outside of the tessellation when tessellating.

Returns:

stops_df – The dataframe with the latitude and longitude of each stop together with: - its starting time (in the mobilkit.dask_schemas.dttColName) - its ending time (saved into the mobilkit.dask_schemas.ldtColName column). - the duration of the stop in seconds in the mobilkit.dask_schemas.durColName column. - if a tessellation file is specified, an additional mobilkit.dask_schemas.zidColName

is telling in which grid cell the stop is falling.

Return type:

dask.dataframe

mobilkit.spatial.haversine_pairwise(X, Y=None, isRadians=False)
Parameters:
  • X, Y (np.array) – a Nx * 2 and Ny*2 arrays of (lat,lon) coordinates. If Y is None it will be assigned to X (computes the matrix of distances of X items).

  • isRadians (bool, optional) – Whether the supplied coordinates are already in radians. If not they will be automatically converted.

Returns:

distances – a Nx*Ny matrix of distances in kilometers

Return type:

np.array

mobilkit.spatial.makeVoronoi(gdf)
mobilkit.spatial.meanshift(df, bw=0.01, maxpoints=100, **kwargs)

Given the points of a user finds the home location with MeanShift clustering.

Parameters:
  • df (pandas.DataFrame) – With at least latcol,loncol.

  • bw (float) – Bandwidth to be used in MeanShift.

  • maxpoints (int) – The maximum number of points to be used in meanshift. If more, a fraction of the df to have maxpoints will be sampled.

  • kwargs – Will be passed to sklearn.cluster.MeanShift constructor.

Returns:

clust_center – The center of the cluster found in the (longitude,latitude) format.

Return type:

tuple

Note

When determining the home location of a user, please consider that some data providers, like Cuebiq, obfuscate/obscure/alter the coordinates of the points falling near the user’s home location in order to preserve privacy.

This means that you cannot locate the precise home of a user with a spatial resolution higher than the one used to obfuscate these data. If you are interested in the census area (or geohash) of the user’s home alone and you are using a spatial tessellation with a spatial resolution wider than or equal to the one used to obfuscate the data, then this is of no concern.

However, tasks such as stop-detection or POI visit rate computation may be affected by the noise added to data in the user’s home location area. Please check if your data has such noise added and choose the spatial tessellation according to your use case.

mobilkit.spatial.plotActivityCount(df_act, gdf, what='pings', ax=None, kwargs_map=None)

Plots a colormap of the number of pings (or unique users) observed in a given area in a given period.

Parameters:
  • df_act (pandas.dataframe) – A dataframe as returned by mobilkit.spatial.areaStats with at least the tile_ID and pings and/or users columns and passed to pandas.

  • gdf (geopandas.GeoDataFrame) – A geo-dataframe as returned by mobilkit.spatial.tessellate.

  • what (str) – The pings or users string, telling whether to plot the number of pings recorded in an area or the number of unique users seen there.

  • ax (pyplot.axes, optional) – The axes where to plot. If None (default) creates a new figure.

  • kwargs_map (dict, optional) – Will be passed to the geopandas plot function plotting the boundaries and colormap.

Returns:

ax – The axes of the figure.

Return type:

pyplot.axes, optional

mobilkit.spatial.plotHomeWorkPoints(uid, df_hw, gdf, ax=None, kwargs_bounds=None, kwargs_points=None)

Plots the points in home and work hours for an user on the map.

Parameters:
  • uid ((str or int, depending on the uid type)) – The id of the user to plot.

  • df_hw (dask.dataframe) – A dataframe as returned by mobilkit.stats.userHomeWork with at least the uid, tile_ID and isHome and isWork columns.

  • gdf (geopandas.GeoDataFrame) – A geo-dataframe as returned by mobilkit.spatial.tessellate.

  • ax (pyplot.axes, optional) – The axes where to plot. If None (default) creates a new figure.

  • kwargs_bounds (dict, optional) – Will be passed to the geopandas plot function plotting the boundaries.

  • kwargs_bounds (dict, optional) – Will be passed to the geopandas plot function plotting the boundaries.

Returns:

ax – The axes of the figure.

Return type:

pyplot.axes, optional

mobilkit.spatial.plotHomeWorkUserCount(df_hw_locs, gdf, what='home', ax=None, kwargs_map=None)

Plots a colormap of the number of people living (or working) in each area.

Parameters:
  • df_hw_locs (pandas.dataframe) – A dataframe as returned by mobilkit.stats.userHomeWorkLocation with at least the uid, home_tile_ID work_tile_ID columns and passed to pandas.

  • gdf (geopandas.GeoDataFrame) – A geo-dataframe as returned by mobilkit.spatial.tessellate.

  • what (str) – The home or work string, telling whether to plot the number of people living or working in an area.

  • ax (pyplot.axes, optional) – The axes where to plot. If None (default) creates a new figure.

  • kwargs_map (dict, optional) – Will be passed to the geopandas plot function plotting the boundaries and colormap.

Returns:

  • ax (pyplot.axes, optional) – The axes of the figure.

  • gdf (geopandas.GeoDataFrame) – The original geo dataframe with an additional column (n_users_home if counting home or n_users_work if counting work). If the column is already in the df it will be overwritten.

  • df (pandas.DataFrame) – The tile_ID -> count of users mapping.

mobilkit.spatial.points_to_medoid(df, latColName='lat', lonColName='lng')

Returns the pd.Series containing the latitude and longitude of the medoid for the current df.

Parameters:
  • df (dask.DataFrame or pd.DataFrame) – The dataframe with at least the latColName and lonColName.

  • latColName, lonColName (str, optional) – The columns of df containing the latitude and longitude.

Returns:

medoid – A series with columns mobilkit.dask_schemas.medLatColName and mobilkit.dask_schemas.medLonColName containing the latitude and longitude of the medoid.

Return type:

pd.Series

mobilkit.spatial.rad_of_gyr(coords: array, center_of_mass=None) array
Parameters:
  • coords (np.array) – a Nx*2 array of (lat,lon) coordinates.

  • center_of_mass (np.array, optional) – A (1,2) np array containing the latitude and longitude of the center of mass to be used to compute the ROG (for instance, the user’s home).

Returns:

radius_of_gyrations – The radius of gyration for the selected coords.

Return type:

float

mobilkit.spatial.replaceAreaID(df, mapping)

Function that replaces all the tile_ID with a new id given in the mapping.

Parameters:
  • df (dask.DataFrame) – A dataframe with at least the tile_ID column.

  • mapping (dict) – A mapping between the original tile_ID and the new desired one. MUST CONTAIN ALL THE tile_ID s present in df.

Returns:

df_out – A copy of the original dataframe with the tile_ID replaced.

Return type:

dask.DataFrame

mobilkit.spatial.selectAreasFromBounds(gdf, relation='within', min_lon=-99.15913, max_lon=-99.10032, min_lat=19.41353, max_lat=19.461)

Function to select areas from a geodataframe given the bounds of a selected region.

Parameters:
  • gdf (geopandas.GeoDataFrame) – A geodataframe with at least the tile_ID and geometry columns as returned by mobilkit.spatial.tessellate.

  • relation (str, optional) – The relation between the bounds and the areas. “within” or “intersects”

  • min/max_lon/lat (float, optional) – The minimum and maximum latitude and longitude of the box.

Returns:

areas_ids – The set of the areas within or intersecting the given bounds

Return type:

set

mobilkit.spatial.stack_density_map(df, dates, center, daytime=True, bins=200, radius=1)
Parameters:
  • df (pd.DataFrame) – A dataframe as returned by mobilkit.spatial.compute_population_density with a multi index of date,daytime,uid and as columns the lat and lng coordinates of the mean shift location of the user on that part of the day on that date.

  • dates (list pof datetime) – A list of dates when to compute the density.

  • center (tuple-like) – The (lat, lon) of the center where to compute the population density around.

  • daytime (bool) – Whether to compute the density on the daytime or nightime part of selected dates.

  • bins (int) – The number of bins to use horizontally and vertically in the region around the center.

  • radius (float) – The space to consider above, below, left and right of the center (same unity of the center).

Returns:

maps, resultsmaps is the tensor of shape (len(dates),bins,bins) storing for each date the x-y density map as computed by mobilkit.spatial.density_map. results stores the x and y bins.

Return type:

np.array, tuple

mobilkit.spatial.stats_density_map(df, dates, center, daytime=True, bins=200, radius=1, clip_std=0.0001)
Parameters:
  • df (pd.DataFrame) – A dataframe as returned by mobilkit.spatial.compute_population_density with a multi index of date,daytime,uid and as columns the lat and lng coordinates of the mean shift location of the user on that part of the day on that date.

  • dates (list pof datetime) – A list of dates when to compute the density.

  • center (tuple-like) – The (lat, lon) of the center where to compute the population density around.

  • daytime (bool) – Whether to compute the density on the daytime or nightime part of selected dates.

  • bins (int) – The number of bins to use horizontally and vertically in the region around the center.

  • radius (float) – The space to consider above, below, left and right of the center (same unity of the center).

  • clip_std (float) – Pixels with a 0 or nan std will be clipped to this value when computing the z-score. The same pixels will be set to -1 on output.

Returns:

results

A dictionary containing teh key-values:
  • stack the tensor of shape (len(dates),bins,bins) storing for each date

    the x-y density map as computed by mobilkit.spatial.density_map.

  • avg, std the average and standard deviation population density with

    shape (1,bins,bins).

  • x_bins, y_bins the bins of the 2d histogram as produced by

    mobilkit.spatial.density_map.

Return type:

dict

mobilkit.spatial.tessellate(df, tesselation_shp, filterAreas=False, partitions_number=None, latCol='lat', lonCol='lng')

Function to assign to each point a given area index.

Parameters:
  • df (dask.DataFrame) – A dataframe as returned from mobilkit.loader.load_raw_files or imported from scikit-mobility using mobilkit.loader.load_from_skmob.

  • tesselation_shp (str) – The path (relative or absolute) to the shapefile containing the tesselation of space. If the shapefile does not contain a tile_ID field it will be initialized here and included in the returned geodataframe.

  • filterAreas (bool) – If tesselation is specified, keeps only the points within the specified shapofile.

  • partitions_number (int, optional) – The batch size of the geopandas sjoin function to be applied. Leave it as is unless you know what you’re doing.

  • latCol, lonCol (str, optional) – The names of the columns containing the latitude and longitude coordinates.

Returns:

  • df_tile (dask.dataframe) – The initial dataframe with the additional tile_ID column telling the int id of the area the point is belonging to (-1 if the point is outside of the shapefile bounds).

  • tessellation_gdf (geopandas.GeoDataFrame) – The geo-dataframe with the possibly missing tile_ID column added.

mobilkit.spatial.totalUserTravelDistance(df_pings, doROG=False, freq='1d')

Computes the total distance traveled (km computed on the straight lines between each point) by a user i each freq time bin.

Parameters:
  • df_pings (dask.DataFrame) – The dataframe containing at least the mobilkit.dask_schemas.uidColName, mobilkit.dask_schemas.latColName, mobilkit.dask_schemas.lonColName and mobilkit.dask_schemas.dttColName.

  • doROG (bool, optional) – If True also computes the ROG on the pings of that day.

  • freq (str, optional) – The datetime interval to fllor the dttColName to (default one day).

Returns:

ttd – The dataframe containing the user,tBin index and: - ttd column with the total traveled distance (in km) for that user on that time bin. - nPings the number of pings for that user on that time bin; - if doROG a column rog with the daily ROG using as center of mass the mean point

of that time bin’s coordinates.

Return type:

dask.DataFrame

mobilkit.spatial.total_distance_traveled(coords)
Parameters:

coords (np.array) – a Nx*2 array of (lat,lon) coordinates.

Returns:

total_distance_traveled – The total distance traveled in the dataframe.

Return type:

float

mobilkit.spatial.userHomeWorkDistance(r)

Computes the distance between lat/lng_home and lat/lng_work for the row of a user.

Returns:

dist – None if one of the coords is invalid, the distance in km otherwise.

Return type:

float

mobilkit.spatial.user_dist_cbds(df_hw, cbds_latlon, assign_lat_col='lat_work', assign_lng_col='lng_work', distance_lat_col='lat_home', distance_lng_col='lng_home')

Computes the distance from the CBD for each user using the cbd which is closest to the assign lat lons and computing its distance from distance_lat/lng.