Reading and Writing Data (IO)

Megatron can currently read data from the following sources:

  • Pandas Dataframes
  • CSV files
  • SQL database connections

When outputs have been calculated, they can be stored in association with their input observation index in a database. Any SQL database connection can be provided.

Datasets (Input)

megatron.io.dataset.CSVData(filepath, exclude_cols=[], nrows=None)

Load fixed data from CSV filepath into Megatron Input nodes, one for each column.

Parameters:
  • filepath (str) – the CSV filepath to be loaded from.
  • exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
  • nrows (int (default: None)) – number of rows to load. If None, load all rows.
megatron.io.dataset.PandasData(dataframe, exclude_cols=[], nrows=None)

Load fixed data from Pandas Dataframe into Megatron Input nodes, one for each column.

Parameters:
  • dataframe (Pandas.DataFrame) – the dataframe to be used.
  • exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
  • nrows (int (default: None)) – number of rows to load. If None, loads all rows.
megatron.io.dataset.SQLData(connection, query)

Load fixed data from SQL query into Megatron Input nodes, one for each column.

Parameters:
  • connection (Connection) – a database connection to any valid SQL database engine.
  • query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs.

Data Generators (Input)

class megatron.io.generator.CSVGenerator(filepath, batch_size=32, exclude_cols=[])

Bases: object

A generator of data batches from a CSV file in pipeline Input format.

Parameters:
  • filepath (str) – the CSV filepath to be loaded from.
  • batch_size (int) – number of observations to yield in each iteration.
  • exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
class megatron.io.generator.PandasGenerator(dataframe, batch_size=32, exclude_cols=[])

Bases: object

A generator of data batches from a Pandas Dataframe into Megatron Input nodes.

Parameters:
  • dataframe (Pandas.DataFrame) – dataframe to load data from.
  • batch_size (int) – number of observations to yield in each iteration.
  • exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
class megatron.io.generator.SQLGenerator(connection, query, batch_size=32, limit=None)

Bases: object

A generator of data batches from a SQL query in pipeline Input format.

Parameters:
  • connection (Connection) – a database connection to any valid SQL database engine.
  • query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs.
  • batch_size (int) – number of observations to yield in each iteration.
  • limit (int) – number of observations to use from the query in total.

Storage (Output)

class megatron.io.storage.DataStore(table_name, version, db_conn, overwrite)

Bases: object

SQL table of input data and output features, associated with a single pipeline.

Parameters:
  • table_name (str) – name of pipeline’s cache table in the database.
  • version (str) – version tag for pipeline’s cache table in the database.
  • db_conn (Connection) – database connection to query.
read(cols=None, rows=None)

Retrieve all processed features from cache, or lookup a single observation.

For features that are multi-dimensional, use pickle to read string.

Parameters:
  • cols (list of int (default: None)) – indices of output columns to retrieve. If None, get all columns.
  • rows (list of any or any (default: None)) – index value to lookup output for, in dictionary form. If None, get all rows. should be the same data type as the index.
write(output_data, data_index)

Write set of observations to database.

For features that are multi-dimensional, use pickle to compress to string.

Parameters:
  • output_data (dict of ndarray) – resulting features from applying pipeline to input_data.
  • data_index (np.array) – index of observations.