Mobility for resilience: population analysis

This notebook shows the preliminary steps done using mobilkit to load raw HFLB data, determine the population estimates of each area and prepare the data for displacement and POI visit rates.

We start loading raw HFLB data using the mobilkit.loader module.

[1]:

%matplotlib inline
%config Completer.use_jedi = False

import os
import sys
from copy import copy, deepcopy
from glob import glob
from collections import Counter
import pytz
from datetime import datetime

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from mpl_toolkits.axes_grid1 import make_axes_locatable
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import contextily as ctx
import pyproj
from scipy import stats
from sklearn import cluster

import dask
from dask.distributed import Client
from dask import dataframe as dd

### import mobility libraries
import skmob
import mobilkit

sns.set_context("notebook", font_scale=1.5)

[10]:

dask.__version__ ### tested using Dask version 2020.12.0

[10]:

'2020.12.0'

Load data

Set up Dask

Notes:
- Use Dask library for high-speed computation on edge computer
  - https://dask.org/
  - accumulates tasks and runs actual computation when “.compute()” is given
- If cluster computing is available, using PySpark is recommended
Click the URL of the Dashboard below to monitor progress

[12]:

client = Client(address="127.0.0.1:8786") ### choose number of cores to use
client

[12]:

Client

Scheduler: tcp://127.0.0.1:8786
Dashboard: http://127.0.0.1:8787/status

Cluster

Workers: 2
Cores: 2
Memory: 28.00 GB

Load raw data using `mobilkit` interface

[16]:

datapath = "../../data/"
outpath  = "../../results/"

### define temporal cropping parameters (including these dates)
timezone = "America/Mexico_City"

startdate = "2017-09-04"
enddate   = "2017-10-08"

nightendtime = "09:00:00"
nightstarttime = "18:00:00"

# How to translate the original columns in the mobilkit's nomenclature
colnames = {"id": "uid",
            "gaid": "gaid",
            "hw": "hw",
            "lat": "lat",
            "lon": "lng",
            "accuracy": "acc",
            "unixtime": "UTC",
            "noise": "noise"
           }

# Wehere raw data are stored
filepath = "/data/DataWB/sample/*.part"
ddf = mobilkit.loader.load_raw_files(filepath,
                                     version="wb",
                                     sep=",",
                                     file_schema=colnames,
                                     start_date=startdate,
                                     stop_date=enddate,
                                     timezone=timezone,
                                     header=True,
                                     minAcc=300.,
                                    )

Quickly compute min/max of space-time

Use the mobilkit and skmob column names notations.

[23]:

dmin, dmax, lonmin, lonmax, latmin, latmax = dask.compute(ddf.UTC.min(),
                                                          ddf.UTC.max(),
                                                          ddf.lng.min(), ddf.lng.max(),
                                                          ddf.lat.min(), ddf.lat.max()
                                                         )

[54]:

print(mobilkit.loader.fromunix2fulldate(dmin),
      mobilkit.loader.fromunix2fulldate(dmax))
print(lonmin, lonmax, latmin, latmax)

2017-09-03 10:07:17 2017-10-09 07:13:57
-105 -95 15.5 22.55

[55]:

boundary = (lonmin, latmin, lonmax, latmax)

[55]:

(-105, 15.5, -95, 22.55)

[57]:

mobilkit.viz.visualize_boundarymap(boundary)

../_images/examples_M4R_01_Dataloading_cleaning_validating_11_0.png

Sample of dataset (choose a very small fraction)

[28]:

%%time
ddf_sample = ddf.sample(frac=0.0001).compute()

CPU times: user 11.9 s, sys: 736 ms, total: 12.6 s
Wall time: 5min 11s

[29]:

len(ddf_sample)

[29]:

[30]:

mobilkit.viz.visualize_simpleplot(ddf_sample)

/home/ubi/Sandbox/mobilkit_dask/mobenv/lib/python3.9/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))

../_images/examples_M4R_01_Dataloading_cleaning_validating_15_1.png

Clean data

Some ideas on data cleaning:
- geographical boundary; analyze data only within a specific area
- temporal boundary; analyze data only within a specific timeframe
- Users’ data quality; select users with more than X datapoints, etc.

Geographical boundary

[31]:

### define boundary box: (min long, min lat, max long, max lat)
# ==== Parameters === #
bbox = (-106.3, 15.5, -86.3, 29.1)

[32]:

# ddf_sc = ddf.map_partitions(data_preprocess.crop_spatial, bbox)
ddf_sc = ddf.map_partitions(mobilkit.loader.crop_spatial, bbox)

Temporal boundary

These computation gets done automatically when loading now.

We only have to filter night hours.

[34]:

nightendtime = "09:00:00"
nightstarttime = "18:00:00"

ddf_tc2 = ddf_sc.map_partitions(mobilkit.loader.crop_time,
                                nightendtime,
                                nightstarttime,
                                timezone)

select users with sufficient data points

users_totalXpoints : select users with more than X data points throughout entire period
users_Xdays : select users with observations of more than X days
users_Xavgps : select users with more than X observations per day
users_Xdays_Xavgps : select users that satisfy both criteria

[38]:

# ==== Parameters === #
mindays = 3
avgpoints = 1

ddf = ddf.assign(uid=ddf["id"])
users_stats = mobilkit.stats.userStats(ddf).compute()
valid_users = set(users_stats[
                (users_stats["avg"] > avgpoints)
                & (users_stats["daysActive"] > mindays)
                ]["uid"].values)

ddf_clean = mobilkit.stats.filterUsersFromSet(ddf, valid_users)
# I do not have this col...
# ddf_clean_homework = ddf_clean[ddf_clean["hw"]=="HOMEWORK"]
# I keep only events during night
ddf_clean_homework = ddf_clean_homework[~ddf_clean_homework["datetime"].dt.hour.between(8,19)]

Home location estimation

Estimation using Meanshift

took ~ 2 hours 15 minutes for entire dataset (mindays=1, avgpoints=0.1)

We compute home location and we later split it into its latitude and longitude.

[42]:

id_home = ddf_clean_homework.groupby("uid").apply(mobilkit.spatial.meanshift)\
                                .compute()\
                                .reset_index()\
                                .rename(columns={0:"home"})
toc = datetime.now()
print("Number of IDs with estimated homes: ",len(id_home))

<ipython-input-42-2437fa6c87f0>:2: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  id_home = ddf_clean_homework.groupby("uid").apply(mobilkit.spatial.meanshift)\
/home/ubi/Sandbox/mobilkit_dask/mobenv/lib/python3.9/site-packages/distributed/worker.py:3445: UserWarning: Large object of size 19.15 MB detected in task graph:
  (['a81dbcb8f4c35834d6619a45d67f34d95911fab1318710d ... e5ce12ca182'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(

Number of IDs with estimated homes:  279541

[44]:

### save to csv file
id_home.to_csv("../data/"+"id_home_"+str(mindays)+"_"+str(avgpoints).replace(".","")+".csv")

[46]:

id_home["lon"] = id_home["home"].apply(lambda x : x[0])
id_home["lat"] = id_home["home"].apply(lambda x : x[1])
id_home = id_home.drop(columns=["home"])[["uid","lon","lat"]]
id_home.lon = id_home.lon.astype("float64")
id_home.lat = id_home.lat.astype("float64")

[48]:

# Create a geodataframe for spatial queries
idhome_gdf = gpd.GeoDataFrame(id_home, geometry=gpd.points_from_xy(id_home.lon, id_home.lat))

Compute administrative region for each ID

manzana shape data (for only urban areas)

[49]:

### load shape data
areas = ["09_Manzanas_INV2016_shp","17_Manzanas_INV2016_shp",
         "21_Manzanas_INV2016_shp","29_Manzanas_INV2016_shp"]
manz_shp = gpd.GeoDataFrame()
for i,a in enumerate(areas):
    manz_f = "data/spatial/manzanas_shapefiles/"+a+"/"
    manz_shp1 = gpd.read_file(manz_f)
    manz_shp = manz_shp.append(manz_shp1, ignore_index=True)
    print("done",i)

done 0
done 1
done 2
done 3

[50]:

manz_shp = manz_shp[["geometry","CVEGEO",'ENT','MUN','LOC','AGEB', 'MZA']]
manz_shp.head()

[50]:

	geometry	CVEGEO	ENT	MUN	LOC	AGEB	MZA
0	POLYGON ((-99.20644 19.51393, -99.20640 19.513...	0900200010010001	09	002	0001	0010	001
1	POLYGON ((-99.20594 19.51418, -99.20586 19.514...	0900200010010002	09	002	0001	0010	002
2	POLYGON ((-99.20526 19.51279, -99.20526 19.512...	0900200010010003	09	002	0001	0010	003
3	POLYGON ((-99.20563 19.51279, -99.20562 19.512...	0900200010010004	09	002	0001	0010	004
4	POLYGON ((-99.20655 19.51278, -99.20653 19.512...	0900200010010005	09	002	0001	0010	005

By Entidad or Municipio stratification

[51]:

adm2_f = datapath+"spatial/boundaries_shapefiles/mex_admbnda_adm2_govmex/"
adm2_shp = gpd.read_file(adm2_f)
adm2_shp.boundary.plot()

[51]:

<AxesSubplot:>

../_images/examples_M4R_01_Dataloading_cleaning_validating_32_1.png

[52]:

adm2_shp = adm2_shp[["ADM2_PCODE","geometry"]]
adm2_shp["ent"] = adm2_shp["ADM2_PCODE"].apply(lambda x : x[2:4])
adm2_shp["entmun"] = adm2_shp["ADM2_PCODE"].apply(lambda x : x[2:])

[53]:

adm2_shp.head()

[53]:

	ADM2_PCODE	geometry	ent	entmun
0	MX01001	POLYGON ((-102.09775 22.02325, -102.09857 22.0...	01	01001
1	MX01002	POLYGON ((-101.99941 22.21951, -101.99940 22.2...	01	01002
2	MX01003	POLYGON ((-102.57625 21.96778, -102.57626 21.9...	01	01003
3	MX01004	POLYGON ((-102.25320 22.37449, -102.25239 22.3...	01	01004
4	MX01005	POLYGON ((-102.31034 22.03716, -102.30653 22.0...	01	01005

Spatial join with manzana data

compute what geographical boundar each home location is in

[2]:

id_manz = gpd.sjoin(idhome_gdf, manz_shp, how="inner", op='within')
id_manz["loc_code"]  = id_manz["CVEGEO"].apply(lambda x : x[:9])
id_manz["ageb_code"] = id_manz["CVEGEO"].apply(lambda x : x[:13])
id_manz["mza_code"]  = id_manz["CVEGEO"].apply(lambda x : x[:16])
id_manz = id_manz.drop(columns=["LOC","AGEB","MZA"])

Spatial join with entidad/muncipio data

[3]:

id_entmun = gpd.sjoin(idhome_gdf, adm2_shp, how="inner", op='within')

Validation using census population data

Population data for all levels

[56]:

poppath = datapath+"sociodemographic/populationdata/"
df_pop = pd.DataFrame()
for es in ["09","17","21","29"]:
    pop = poppath+"resultados_ageb_urbana_"+es+"_cpv2010.csv"
    df_pop1 = pd.read_csv(pop)[["entidad","mun","loc","ageb","mza","pobtot"]]
    df_pop = df_pop.append(df_pop1, ignore_index=True)

df_pop["CVEGEO"] = df_pop.apply(lambda row: str(row["entidad"]).zfill(2)+
                                str(row["mun"]).zfill(3)+
                                str(row["loc"]).zfill(4)+
                                str(row["ageb"]).zfill(4)+
                                str(row["mza"]).zfill(3), axis=1)

[57]:

df_pop.head()

[57]:

	entidad	mun	loc	ageb	mza	pobtot	CVEGEO
0	9	0	0	0000	0	8851080	0900000000000000
1	9	2	0	0000	0	414711	0900200000000000
2	9	2	1	0000	0	414711	0900200010000000
3	9	2	1	0010	0	3424	0900200010010000
4	9	2	1	0010	1	202	0900200010010001

Entidad level

[59]:

ent_ids = id_entmun.groupby("ent").uid.count().reset_index()
ent_pop = df_pop[(df_pop["mun"]==0) & (df_pop["loc"]==0)
                 & (df_pop["ageb"]=="0000") & (df_pop["mza"]==0)][["entidad","pobtot"]]
ent_pop["ent"] = ent_pop["entidad"].apply(lambda x : str(x).zfill(2))
ent_ids_pop = ent_pop.merge(ent_ids, on="ent")

Municipio level

[60]:

mun_ids = id_entmun.groupby("entmun").uid.count().reset_index()
mun_pop = df_pop[(df_pop["mun"]!=0) & (df_pop["loc"]==0)
                 & (df_pop["ageb"]=="0000") & (df_pop["mza"]==0)][["CVEGEO","pobtot"]]
mun_pop["entmun"] = mun_pop["CVEGEO"].apply(lambda x : str(x)[:5])
mun_ids_pop = mun_pop.merge(mun_ids, on="entmun")

Localidades level

[61]:

loc_ids = id_manz.groupby("loc_code").uid.count().reset_index()
loc_pop = df_pop[(df_pop["mun"]!=0) & (df_pop["loc"]!=0)
                 & (df_pop["ageb"]=="0000") & (df_pop["mza"]==0)][["CVEGEO","pobtot"]]
loc_pop["loc_code"] = loc_pop["CVEGEO"].apply(lambda x : str(x)[:9])
loc_ids_pop = loc_pop.merge(loc_ids, on="loc_code")

AGEB level

[62]:

ageb_ids = id_manz.groupby("ageb_code").uid.count().reset_index()
ageb_pop = df_pop[(df_pop["mun"]!=0) & (df_pop["loc"]!=0)
                 & (df_pop["ageb"]!="0000") & (df_pop["mza"]==0)][["CVEGEO","pobtot"]]
ageb_pop["ageb_code"] = ageb_pop["CVEGEO"].apply(lambda x : str(x)[:13])
ageb_ids_pop = ageb_pop.merge(ageb_ids, on="ageb_code")

Manzana level

[63]:

mza_ids = id_manz.groupby("mza_code").uid.count().reset_index()
mza_pop = df_pop[(df_pop["mun"]!=0) & (df_pop["loc"]!=0)
                 & (df_pop["ageb"]!="0000") & (df_pop["mza"]!=0)][["CVEGEO","pobtot"]]
mza_pop["mza_code"] = mza_pop["CVEGEO"].apply(lambda x : str(x)[:17])
mza_ids_pop = mza_pop.merge(mza_ids, on="mza_code")

Plot census population vs MP data

[64]:

def plot_compare(df, ax, title):
    df["logpop"] = np.log10(df["pobtot"])
    df["loguser"] = np.log10(df["uid"])
    df = df.replace([np.inf, -np.inf], np.nan).dropna()
#     for col in set(df["color"].values):
#         df_thiscol = df[df["color"]==col]
    ax.scatter(df["logpop"].values, df["loguser"].values, color="b", s=15)
    c1, i1, s1, p_value, std_err = stats.linregress(df["logpop"].values, df["loguser"].values)
    ax.plot([0,np.max(df["logpop"])*1.1],[i1,i1+np.max(df["logpop"])*1.1*c1],
            linestyle="-", color="gray")
    ax.set_xlim(np.min(df["logpop"]),np.max(df["logpop"])*1.1)
    ax.set_ylim(0,np.max(df["loguser"])*1.1)
    ax.set_xlabel(r"$log_{10}$(Census population)", fontsize=14)
    ax.set_ylabel(r"$log_{10}$(Unique users)", fontsize=14)
    ax.annotate("Slope: "+str(c1)[:5]+"\n"+str(s1)[:5], #+utils.stars(p_value),
                        xy=(.1,0.7),
                        xycoords='axes fraction', color="k", fontsize=14)
    ax.set_title(title, fontsize=16)

[65]:

fig = plt.figure(figsize=(10,8))
gs=GridSpec(2,2)

ax0 = fig.add_subplot(gs[0,0])
plot_compare(mun_ids_pop, ax0, "Municipio")

ax1 = fig.add_subplot(gs[0,1])
plot_compare(loc_ids_pop, ax1, "Localidades")

ax2 = fig.add_subplot(gs[1,0])
plot_compare(ageb_ids_pop, ax2, "AGEBs")

ax3 = fig.add_subplot(gs[1,1])
plot_compare(mza_ids_pop, ax3, "Manzanas")

plt.tight_layout()
# plt.savefig(outpath+"represent_manzana_eq.png",
#             dpi=300, bbox_inches='tight', pad_inches=0.05)
plt.show()

/home/ubi/Sandbox/mobilkit_dask/mobenv/lib/python3.9/site-packages/pandas/core/arraylike.py:274: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
/home/ubi/Sandbox/mobilkit_dask/mobenv/lib/python3.9/site-packages/pandas/core/arraylike.py:274: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)

../_images/examples_M4R_01_Dataloading_cleaning_validating_54_1.png

Plot population on map

[66]:

mun_ids_pop["rate"] = mun_ids_pop["uid"]/mun_ids_pop["pobtot"]
mun_ids_pop["pcode"] = mun_ids_pop["entmun"].apply(lambda x : "MX"+str(x))

mun_ids_pop_shp = adm2_shp.merge(mun_ids_pop, on="entmun", how="right")

[68]:

fig,ax = plt.subplots(figsize=(10,10))
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
mun_ids_pop_shp.plot(ax=ax, column='rate', cmap='OrRd', legend=True,
                    cax=cax, legend_kwds={'label': "Sampling rate"}, alpha=0.65)
mun_ids_pop_shp.boundary.plot(ax=ax, color="k", linewidth=0.5)

[68]:

<AxesSubplot:>

../_images/examples_M4R_01_Dataloading_cleaning_validating_57_1.png

[ ]: