Loading data

`load_raw_files`(pattern[, version, timezone, ...])	Function that loads the files and returns the dataframe.
`load_from_skmob`(df[, uid, npartitions])	Loads a dataframe imported with skmobility and returns a dask dataframe.
`dask_to_skmob`(df, **kwargs)	Ports a dataframe from dask to skmob.

loader.py contains a set of tools to load and prepare the database from raw files.

mobilkit.loader.dask_to_skmob(df, **kwargs)

Ports a dataframe from dask to skmob. Given the structure of skmob it is done only to a skmob.TrajDataFrame.

Parameters:

df (dask.dataframe) – A dask dataframe with at least the uid, lat and lng columns.
**kwargs – Will be passed to skmob.TrajDataFrame.

Returns:

df_sp – A skmob.TrajDataFrame containing the input columns.

Return type:

skmob.dataframe

mobilkit.loader.load_from_skmob(df, uid='user', npartitions=10)

Loads a dataframe imported with skmobility and returns a dask dataframe.

Parameters:

df (scikit-mobility.dataframe) – A dataframe as imported from scikit-mobility. May already contains the tile_ID and uid columns. If no uid column is found it will be initialized to the uid value.
uid (str, optional) – The uid to be used, otherwise uses the present ones if the uid column is there.
npartitions (int, optional) – The number of partition for the dataframe to be split into.

Returns:

df_sp – A dask.dataframe containing the input columns plus the accuracy acc (with dummy 1 value) and possibly the uid one if it was missing.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files(pattern, version='hflb', timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)

Function that loads the files and returns the dataframe.

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
version (str, optional) – One of hflb, wb or csv, the format in which data are stored.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be dask_schemas.eventLineRAW. If version=wb file_schema is a dictionary telling how to translate the original colums in the mobilkit nomenclature. NOTE that the accuracy column must be called acc.
**kwargs – Will be passed to mobilkit.loader.load_raw_files_hflb if version=’hflb’ otherwise to mobilkit.loader.load_raw_files_wb if version=’wb’.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files_custom(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None, **kwargs)

Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be dask_schemas.eventLineRAW. NOTE that the accuracy column must be called acc.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe

mobilkit.loader.load_raw_files_wb(pattern, timezone=None, header=False, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)

Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).

Parameters:

pattern (str) – The pattern of the raw files with bash syntax. For example: 'sample_data/20*/part-*.csv.gz'.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called acc.
sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The dict to rename the original columns to the mobilkit ones. NOTE that the accuracy column must be called acc.

Returns:

df – A representation of the dataframe.

Return type:

dask.dataframe