Pipelines¶
Pipelines are the core of Megatron. Pipelines contain all your transformations and are what you ultimately use to generate outputs.
-
class
megatron.pipeline.
Pipeline
(inputs, outputs, metrics=[], explorers=[], name=None, version=None, storage=None, overwrite=False)¶ Bases:
object
Holds the core computation graph that maps out Layers and manipulates data.
Parameters: - inputs (list of megatron.Node(s)) – input nodes of the Pipeline, where raw data is fed in.
- outputs (list of megatron.Node(s)) – output nodes of the Pipeline, the processed features.
- name (str) – unique identifying name of the Pipeline.
- version (str) – version tag for Pipeline’s cache table in the database.
- storage_db (Connection (defeault: 'sqlite')) – database connection to be used for input and output data storage.
-
inputs
¶ input nodes of the Pipeline, where raw data is fed in.
Type: list of megatron.Node(s)
-
outputs
¶ output nodes of the Pipeline, the processed features.
Type: list of megatron.Node(s)
-
path
¶ full topological sort of Pipeline from inputs to outputs.
Type: list of megatron.Nodes
-
nodes
¶ separation of Nodes by type.
Type: dict of list of megatron.Node(s)
-
eager
¶ when True, TransformationNode outputs are to be calculated on creation. This is indicated by data being passed to an InputNode node as a function call.
Type: bool
-
name
¶ unique identifying name of the Pipeline.
Type: str
-
version
¶ version tag for Pipeline’s cache table in the database.
Type: str
-
storage
¶ storage database for input and output data.
Type: Connection (defeault: None)
-
evaluate
(input_data, prune=True)¶ Execute the metric Nodes in the Pipeline and get their results.
Parameters: input_data (dict of Numpy array) – the input data to be passed to InputNodes to begin execution.
-
evaluate_generator
(input_generator, steps)¶ Execute the metric Nodes in the Pipeline for each batch in a generator.
-
explore_generator
(input_generator, steps)¶ Execute the explorer Nodes in the Pipeline for each batch in a generator.
-
fit
(input_data, epochs=1)¶ Fit to input data and overwrite the metadata.
Parameters: - input_data (2-tuple of dict of Numpy array, Numpy array) – the input data to be passed to InputNodes to begin execution, and the index.
- epochs (int (default: 1)) – number of passes to perform over the data.
-
fit_generator
(input_generator, steps_per_epoch, epochs=1)¶ Fit to generator of input data batches. Execute partial_fit to each batch.
Parameters: - input_generator (generator of 2-tuple of dict of Numpy array and Numpy array) – generator that produces features and labels for each batch of data.
- steps_per_epoch (int) – number of batches that are considered one full epoch.
- epochs (int (default: 1)) – number of passes to perform over the data.
-
partial_fit
(input_data)¶ Fit to input data in an incremental way if possible.
Parameters: input_data (dict of Numpy array) – the input data to be passed to InputNodes to begin execution.
-
save
(save_dir)¶ Store the Pipeline and its learned metadata without the outputs on disk.
The filename will be {name of the pipeline}{version}.pkl.
Parameters: save_dir (str) – the desired location of the stored nodes, without the filename.
-
transform
(input_data, index_field=None, prune=True)¶ Execute the graph with some input data, get the output nodes’ data.
Parameters: - input_data (dict of Numpy array) – the input data to be passed to InputNodes to begin execution.
- index_field (str) – name of key from input_data to be used as index for storage and lookup.
- keep_data (bool) – whether to keep data in non-output nodes after execution. activating this flag can be useful for debugging.
-
transform_generator
(input_generator, steps, index=None)¶ Execute the graph with some input data from a generator, create generator.
Parameters: - input_generator (dict of Numpy array) – generator producing input data to be passed to Input nodes.
- steps (int) – number of batches to pull from input_generator before terminating.
-
megatron.pipeline.
load_pipeline
(filepath, storage_db=None)¶ Load a set of nodes from a given file, stored previously with Pipeline.save().
Parameters: - filepath (str) – the file from which to load a Pipeline.
- storage_db (Connection (default: sqlite3.connect('megatron_default.db'))) – database connection object to query for cached data from the Pipeline.