Loading data
|
Function that loads the files and returns the dataframe. |
|
Loads a dataframe imported with skmobility and returns a dask dataframe. |
|
Ports a dataframe from dask to skmob. |
loader.py contains a set of tools to load and prepare the database from raw files.
- mobilkit.loader.dask_to_skmob(df, **kwargs)
Ports a dataframe from dask to skmob. Given the structure of skmob it is done only to a skmob.TrajDataFrame.
- Parameters:
df (dask.dataframe) – A dask dataframe with at least the
uid
,lat
andlng
columns.**kwargs – Will be passed to
skmob.TrajDataFrame
.
- Returns:
df_sp – A skmob.TrajDataFrame containing the input columns.
- Return type:
skmob.dataframe
- mobilkit.loader.load_from_skmob(df, uid='user', npartitions=10)
Loads a dataframe imported with skmobility and returns a dask dataframe.
- Parameters:
df (scikit-mobility.dataframe) – A dataframe as imported from scikit-mobility. May already contains the
tile_ID
anduid
columns. If nouid
column is found it will be initialized to theuid
value.uid (str, optional) – The
uid
to be used, otherwise uses the present ones if theuid
column is there.npartitions (int, optional) – The number of partition for the dataframe to be split into.
- Returns:
df_sp – A dask.dataframe containing the input columns plus the accuracy
acc
(with dummy 1 value) and possibly theuid
one if it was missing.- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files(pattern, version='hflb', timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)
Function that loads the files and returns the dataframe.
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.version (str, optional) – One of hflb, wb or csv, the format in which data are stored.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be
dask_schemas.eventLineRAW
. If version=wb file_schema is a dictionary telling how to translate the original colums in the mobilkit nomenclature. NOTE that the accuracy column must be calledacc
.**kwargs – Will be passed to
mobilkit.loader.load_raw_files_hflb
if version=’hflb’ otherwise tomobilkit.loader.load_raw_files_wb
if version=’wb’.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files_custom(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None, **kwargs)
Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be
dask_schemas.eventLineRAW
. NOTE that the accuracy column must be calledacc
.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files_wb(pattern, timezone=None, header=False, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)
Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The dict to rename the original columns to the mobilkit ones. NOTE that the accuracy column must be called
acc
.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe