Reading and Writing Data (IO)¶
Megatron can currently read data from the following sources:
- Pandas Dataframes
- CSV files
- SQL database connections
When outputs have been calculated, they can be stored in association with their input observation index in a database. Any SQL database connection can be provided.
Datasets (Input)¶
-
megatron.io.dataset.
CSVData
(filepath, exclude_cols=[], nrows=None)¶ Load fixed data from CSV filepath into Megatron Input nodes, one for each column.
Parameters: - filepath (str) – the CSV filepath to be loaded from.
- exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
- nrows (int (default: None)) – number of rows to load. If None, load all rows.
-
megatron.io.dataset.
PandasData
(dataframe, exclude_cols=[], nrows=None)¶ Load fixed data from Pandas Dataframe into Megatron Input nodes, one for each column.
Parameters: - dataframe (Pandas.DataFrame) – the dataframe to be used.
- exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
- nrows (int (default: None)) – number of rows to load. If None, loads all rows.
-
megatron.io.dataset.
SQLData
(connection, query)¶ Load fixed data from SQL query into Megatron Input nodes, one for each column.
Parameters: - connection (Connection) – a database connection to any valid SQL database engine.
- query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs.
Data Generators (Input)¶
-
class
megatron.io.generator.
CSVGenerator
(filepath, batch_size=32, exclude_cols=[])¶ Bases:
object
A generator of data batches from a CSV file in pipeline Input format.
Parameters: - filepath (str) – the CSV filepath to be loaded from.
- batch_size (int) – number of observations to yield in each iteration.
- exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
-
class
megatron.io.generator.
PandasGenerator
(dataframe, batch_size=32, exclude_cols=[])¶ Bases:
object
A generator of data batches from a Pandas Dataframe into Megatron Input nodes.
Parameters: - dataframe (Pandas.DataFrame) – dataframe to load data from.
- batch_size (int) – number of observations to yield in each iteration.
- exclude_cols (list of str (default: [])) – any columns that should not be loaded as Input.
-
class
megatron.io.generator.
SQLGenerator
(connection, query, batch_size=32, limit=None)¶ Bases:
object
A generator of data batches from a SQL query in pipeline Input format.
Parameters: - connection (Connection) – a database connection to any valid SQL database engine.
- query (str) – a valid SQL query according to the engine being used, that extracts the data for Inputs.
- batch_size (int) – number of observations to yield in each iteration.
- limit (int) – number of observations to use from the query in total.
Storage (Output)¶
-
class
megatron.io.storage.
DataStore
(table_name, version, db_conn, overwrite)¶ Bases:
object
SQL table of input data and output features, associated with a single pipeline.
Parameters: - table_name (str) – name of pipeline’s cache table in the database.
- version (str) – version tag for pipeline’s cache table in the database.
- db_conn (Connection) – database connection to query.
-
read
(cols=None, rows=None)¶ Retrieve all processed features from cache, or lookup a single observation.
For features that are multi-dimensional, use pickle to read string.
Parameters: - cols (list of int (default: None)) – indices of output columns to retrieve. If None, get all columns.
- rows (list of any or any (default: None)) – index value to lookup output for, in dictionary form. If None, get all rows. should be the same data type as the index.
-
write
(output_data, data_index)¶ Write set of observations to database.
For features that are multi-dimensional, use pickle to compress to string.
Parameters: - output_data (dict of ndarray) – resulting features from applying pipeline to input_data.
- data_index (np.array) – index of observations.