mobilkit.loader module
loader.py contains a set of tools to load and prepare the database from raw files.
- mobilkit.loader.compute_datetime_col(df, selected_tz)
- mobilkit.loader.crop_date(dff, startdt, enddt, timezone='America/Mexico_City')
- mobilkit.loader.crop_spatial(dff, bbox)
Filters dff with a box=[minlon,minlat,maxlon,maxlat].
- mobilkit.loader.crop_time(dff, nighttime_start, nighttime_end, timezone)
- mobilkit.loader.dask_to_skmob(df, **kwargs)
Ports a dataframe from dask to skmob. Given the structure of skmob it is done only to a skmob.TrajDataFrame.
- Parameters:
df (dask.dataframe) – A dask dataframe with at least the
uid
,lat
andlng
columns.**kwargs – Will be passed to
skmob.TrajDataFrame
.
- Returns:
df_sp – A skmob.TrajDataFrame containing the input columns.
- Return type:
skmob.dataframe
- mobilkit.loader.filterStartStopDates(df, start_date, stop_date, tz)
- mobilkit.loader.fromunix2date(x, timezone='America/Mexico_City')
Inherited from D4R.
- mobilkit.loader.fromunix2fulldate(x, timezone='America/Mexico_City')
Inherited from D4R.
- mobilkit.loader.fromunix2time(x, timezone='America/Mexico_City')
Inherited from D4R.
- mobilkit.loader.loadGeolifeData(path, acc_default=1, timezone='Asia/Shanghai')
Loads the Geolife v1.3 trajectories with files ordered in the
GeolifeTrajectories1.3/Data/000/Trajectory/20090401202331.plt
structure with 6 useless rows at the beginning and the
lat,lng,0,altitude,days,date,time
format.
- Parameters:
path (str) – The path to the root of the geolife data, usually called data/GeolifeTrajectories1.3.
acc_default (float) – The default accuracy to give to each point to replicate the mobilkit format.
timezone (str) – The code of the timezone the data has been recorded in.
- Returns:
df –
- The dataframe containing the
uid,UTC,datetime,acc,lat,lng columns.
- Return type:
pd.DataFrame
- mobilkit.loader.load_from_skmob(df, uid='user', npartitions=10)
Loads a dataframe imported with skmobility and returns a dask dataframe.
- Parameters:
df (scikit-mobility.dataframe) – A dataframe as imported from scikit-mobility. May already contains the
tile_ID
anduid
columns. If nouid
column is found it will be initialized to theuid
value.uid (str, optional) – The
uid
to be used, otherwise uses the present ones if theuid
column is there.npartitions (int, optional) – The number of partition for the dataframe to be split into.
- Returns:
df_sp – A dask.dataframe containing the input columns plus the accuracy
acc
(with dummy 1 value) and possibly theuid
one if it was missing.- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files(pattern, version='hflb', timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)
Function that loads the files and returns the dataframe.
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.version (str, optional) – One of hflb, wb or csv, the format in which data are stored.
timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be
dask_schemas.eventLineRAW
. If version=wb file_schema is a dictionary telling how to translate the original colums in the mobilkit nomenclature. NOTE that the accuracy column must be calledacc
.**kwargs – Will be passed to
mobilkit.loader.load_raw_files_hflb
if version=’hflb’ otherwise tomobilkit.loader.load_raw_files_wb
if version=’wb’.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files_custom(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None, **kwargs)
Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be
dask_schemas.eventLineRAW
. NOTE that the accuracy column must be calledacc
.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files_hflb(pattern, timezone=None, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, partition_size=None)
Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The schema of the file. By default will be
dask_schemas.eventLineRAW
. NOTE that the accuracy column must be calledacc
.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.load_raw_files_wb(pattern, timezone=None, header=False, start_date=None, stop_date=None, minAcc=300, sep='\t', file_schema=None, **kwargs)
Function that loads the files and returns the dask dataframe. Note that this function is lazy meaning that it only construct the dataframe and does not build it (it will be built the first time a query is performed on it).
- Parameters:
pattern (str) – The pattern of the raw files with bash syntax. For example:
'sample_data/20*/part-*.csv.gz'
.timezone (str, optional) – The timezone in pytz syntax (e.g., “Europe/Rome” or “America/Mexico_City”) to be used to localize the Unix Time Stamp time-stamp in the raw-data. If no timezone is specified (default) it defaults to UTC.
start_date (str, optional) – The starting date when to consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given.
stop_date (str, optional) – The end day up to which consider data in the “yyyy-mm-dd” format. This will be localized in timezone if given. This day will be INCLUDED.
minAcc (int, optional) – The minimum accuracy for a point to be kept. If accuracy is larger than this the point will be discarded. NOTE that the accuracy column must be called
acc
.sep (str, optional) – The delimiter of the fields in the files.
file_schema (list of tuples) – The dict to rename the original columns to the mobilkit ones. NOTE that the accuracy column must be called
acc
.
- Returns:
df – A representation of the dataframe.
- Return type:
dask.dataframe
- mobilkit.loader.loaddata_takeapeek(dirpath, sep, ext)
- mobilkit.loader.localizeDatetimeNaive(date, tz, date_format='%Y-%m-%d')
- mobilkit.loader.persistDF(df, path, overwrite=True, header=True, index=False, out_format='csv')
Save a dask dataframo file.
- Parameters:
df (dask.DataFrame) – The dataframe to save
path (str) – The path where to save the dataframe.
overwrite (bool) – Whether or not to force overwrite.
header (bool) – Whether or not to put the header in the output file.
index (bool) – Whether or not to put the index column in the output file.
out_format (bool) – One of
csv, parquet
the format to use. If the df has arrays in it useparquet
.
- mobilkit.loader.reloadDF(path, header=True, in_format='csv')
Load a dask dataframe from file.
- Parameters:
path (str) – The path where to read the dataframe from.
header (bool) – Whether or not to read the header in the output file.
in_format (bool) – One of
csv, parquet
the format used to persist the df.
- Returns:
df – The loaded dataframe.
- Return type:
dask.DataFrame
- mobilkit.loader.syntheticGeoLifeDay(df_geolife, selected_day)
- mobilkit.loader.syntheticGeoLifeWeek(df_geolife, selected_week)