mobilkit.loader module

loader.py contains a set of tools to load and prepare the database from raw files.

mobilkit.loader.compute_datetime_col(df, selected_tz)

mobilkit.loader.crop_date(dff, startdt, enddt, timezone='America/Mexico_City')

mobilkit.loader.crop_spatial(dff, bbox): Filters dff with a box=[minlon,minlat,maxlon,maxlat].

mobilkit.loader.crop_time(dff, nighttime_start, nighttime_end, timezone)

mobilkit.loader.dask_to_skmob(df, **kwargs)

Ports a dataframe from dask to skmob. Given the structure of skmob it is done only to a skmob.TrajDataFrame.

Parameters:

df (dask.dataframe) – A dask dataframe with at least the uid, lat and lng columns.
**kwargs – Will be passed to skmob.TrajDataFrame.

Returns:

df_sp – A skmob.TrajDataFrame containing the input columns.

Return type:

skmob.dataframe

mobilkit.loader.filterStartStopDates(df, start_date, stop_date, tz)

mobilkit.loader.fromunix2date(x, timezone='America/Mexico_City'): Inherited from D4R.

mobilkit.loader.fromunix2fulldate(x, timezone='America/Mexico_City'): Inherited from D4R.

mobilkit.loader.fromunix2time(x, timezone='America/Mexico_City'): Inherited from D4R.

mobilkit.loader.loadGeolifeData(path, acc_default=1, timezone='Asia/Shanghai')

Loads the Geolife v1.3 trajectories with files ordered in the

GeolifeTrajectories1.3/Data/000/Trajectory/20090401202331.plt

structure with 6 useless rows at the beginning and the

lat,lng,0,altitude,days,date,time

format.

Parameters:

path (str) – The path to the root of the geolife data, usually called data/GeolifeTrajectories1.3.
acc_default (float) – The default accuracy to give to each point to replicate the mobilkit format.
timezone (str) – The code of the timezone the data has been recorded in.

Returns:

df –

The dataframe containing the: uid,UTC,datetime,acc,lat,lng columns.

Return type:

pd.DataFrame

mobilkit.loader.load_from_skmob(df, uid='user', npartitions=10)

Loads a dataframe imported with skmobility and returns a dask dataframe.

Parameters:

df (scikit-mobility.dataframe) – A dataframe as imported from scikit-mobility. May already contains the tile_ID and uid columns. If no uid column is found it will be initialized to the uid value.
uid (str, optional) – The uid to be used, otherwise uses the present ones if the uid column is there.
npartitions (int, optional) – The number of partition for the dataframe to be split into.

Returns:

df_sp – A dask.dataframe containing the input columns plus the accuracy acc (with dummy 1 value) and possibly the uid one if it was missing.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files(pattern, version='hflb', timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)

Function that loads the files and returns the dataframe.

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
version (str, optional) – One of hflb, wb or csv, the format in which data are stored.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be dask_schemas.eventLineRAW. If version=wb file_schema is a dictionary telling how to translate the original colums in the mobilkit nomenclature. NOTE that the accuracy column must be called acc.
**kwargs – Will be passed to mobilkit.loader.load_raw_files_hflb if version=’hflb’ otherwise to mobilkit.loader.load_raw_files_wb if version=’wb’.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files_custom(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None, **kwargs)

Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be dask_schemas.eventLineRAW. NOTE that the accuracy column must be called acc.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files_hflb(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None)

Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be dask_schemas.eventLineRAW. NOTE that the accuracy column must be called acc.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files_wb(pattern, timezone=None, header=False, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)

Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The dict to rename the original columns to the mobilkit ones. NOTE that the accuracy column must be called acc.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.loaddata_takeapeek(dirpath, sep, ext)

mobilkit.loader.localizeDatetimeNaive(date, tz, date_format='%Y-%m-%d')

mobilkit.loader.persistDF(df, path, overwrite=True, header=True, index=False, out_format='csv')

Save a dask dataframo file.

Parameters:

df (dask.DataFrame) – The dataframe to save
path (str) – The path where to save the dataframe.
overwrite (bool) – Whether or not to force overwrite.
header (bool) – Whether or not to put the header in the output file.
index (bool) – Whether or not to put the index column in the output file.
out_format (bool) – One of csv, parquet the format to use. If the df has arrays in it use parquet.

mobilkit.loader.reloadDF(path, header=True, in_format='csv')

Load a dask dataframe from file.

Parameters:

path (str) – The path where to read the dataframe from.
header (bool) – Whether or not to read the header in the output file.
in_format (bool) – One of csv, parquet the format used to persist the df.

Returns:

df – The loaded dataframe.

Return type:

dask.DataFrame

mobilkit.loader.syntheticGeoLifeDay(df_geolife, selected_day)

mobilkit.loader.syntheticGeoLifeWeek(df_geolife, selected_week)