Reading and Writing Data (IO)¶

Megatron can currently read data from the following sources:

Pandas Dataframes

CSV files

SQL database connections

When outputs have been calculated, they can be stored in association with their input observation index in a database. Any SQL database connection can be provided.

Datasets (Input)¶

megatron.io.dataset.CSVData(filepath, exclude_cols=[], nrows=None)¶

Load fixed data from CSV filepath into Megatron Input nodes, one for each column.

Parameters:	filepath (str) – the CSV filepath to be loaded from. exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input. nrows (int (default: None)) – number of rows to load. If None, load all rows.

megatron.io.dataset.PandasData(dataframe, exclude_cols=[], nrows=None)¶

Load fixed data from Pandas Dataframe into Megatron Input nodes, one for each column.

Parameters:	dataframe (Pandas.DataFrame) – the dataframe to be used. exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input. nrows (int (default: None)) – number of rows to load. If None, loads all rows.

megatron.io.dataset.SQLData(connection, query)¶

Load fixed data from SQL query into Megatron Input nodes, one for each column.

Parameters:	connection (Connection) – a database connection to any valid SQL database engine. query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs.

Data Generators (Input)¶

class megatron.io.generator.CSVGenerator(filepath, batch_size=32, exclude_cols=[])¶

Bases: object

A generator of data batches from a CSV file in pipeline Input format.

Parameters:	filepath (str) – the CSV filepath to be loaded from. batch_size (int) – number of observations to yield in each iteration. exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.

class megatron.io.generator.PandasGenerator(dataframe, batch_size=32, exclude_cols=[])¶

Bases: object

A generator of data batches from a Pandas Dataframe into Megatron Input nodes.

Parameters:	dataframe (Pandas.DataFrame) – dataframe to load data from. batch_size (int) – number of observations to yield in each iteration. exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.

class megatron.io.generator.SQLGenerator(connection, query, batch_size=32, limit=None)¶

Bases: object

A generator of data batches from a SQL query in pipeline Input format.

Parameters:	connection (Connection) – a database connection to any valid SQL database engine. query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs. batch_size (int) – number of observations to yield in each iteration. limit (int) – number of observations to use from the query in total.

Storage (Output)¶

class megatron.io.storage.DataStore(table_name, version, db_conn, overwrite)¶

Bases: object

SQL table of input data and output features, associated with a single pipeline.

Parameters:	table_name (str) – name of pipeline’s cache table in the database. version (str) – version tag for pipeline’s cache table in the database. db_conn (Connection) – database connection to query.

read(cols=None, rows=None)¶

Retrieve all processed features from cache, or lookup a single observation.

For features that are multi-dimensional, use pickle to read string.

Parameters:	cols (list of int (default: None)) – indices of output columns to retrieve. If None, get all columns. rows (list of any or any (default: None)) – index value to lookup output for, in dictionary form. If None, get all rows. should be the same data type as the index.

write(output_data, data_index)¶

Write set of observations to database.

For features that are multi-dimensional, use pickle to compress to string.

Parameters:	output_data (dict of ndarray) – resulting features from applying pipeline to input_data. data_index (np.array) – index of observations.