Data pipelines
During the exploration phase, you might want to compare how a model performs on a dataset processed using different transformers. For example, on one dataset balanced with an undersampling strategy and the other with an oversampling strategy. For this, atom has the branching system.
Branches
The branching system helps manage multiple pipelines within the same
atom instance. Every pipeline is stored in a branch, which can be
accessed through the branch
attribute. A branch contains a copy of
the dataset, and all transformers and models that are fitted on that
specific dataset. Transformers and models called from atom use the
dataset in the current branch, as well as data attributes such as
atom.dataset
. Also Use the branch's __repr__ to get an overview
of the transformers in the branch. Don't change the data in a branch
after fitting a model, this can cause unexpected model behaviour.
Instead, create a new branch for every unique pipeline.
By default, atom starts with one branch called "master". To start a new
branch, set a new name to the property, e.g. atom.branch = "undersample"
.
This will create a new branch from the current one. To create a branch
from any other branch type "_from_" between the new name and the branch
from which to split, e.g. atom.branch = "oversample_from_master"
will
create branch "oversample" from branch "master", even if the current branch
is "undersample". To switch between existing branches, just type the name
of the desired branch, e.g. atom.branch = "master"
brings you back to the
master branch. Note that every branch contains a unique copy of the whole
dataset! Creating many branches can cause memory issues for large datasets.
See the Imbalanced datasets or Feature engineering examples for branching use cases.
Warning
Always create a new branch if you want to change the dataset after fitting a model! Not doing so can cause unexpected model behaviour.
The branch class has the following methods.
delete | Delete the branch from the atom instance. |
rename | Change the name of the branch. |
status | Get an overview of the pipeline and models in the branch. |
Delete the branch and all the models in it. Same as executing del atom.branch
.
Change the name of the branch.
Parameters: |
name: str |
Get an overview of the pipeline and models in the branch. This method
prints the same information as the __repr__ and also saves it to the
logger.
Memory considerations
An atom instance stores one copy of the dataframe in each branch. Note
that there are always at least two branches in the instance: master
(or another user defined branch) and one additional branch that stores
the dataframe with which the class was initialized. This internal branch
is called og
(original) and is used for the reset
method and to be able to create new branches from the original dataframe,
even after having applied transformations.
Apart from the dataset itself, the model's predictions
are also stored as attributes of the model (e.g. predict_proba_train
)
and can occupy considerable memory for large datasets. You can delete
these attributes using the reset_predictions
method to free some memory before saving
the class.
Data transformations
Performing data transformations is a common requirement of many datasets before they are ready to be ingested by a model. ATOM provides various classes to apply data cleaning and feature engineering transformations to the data. This tooling should be able to help you apply most of the typically needed transformations to get the data ready for modelling. For further fine-tuning, it's also possible to transform the data using custom transformers (see the add method) or through a function (see the apply method). Remember that all transformations are only applied to the dataset in the current branch.
AutoML
Automated machine learning (AutoML) automates the selection, composition
and parameterization of machine learning pipelines. Automating the machine
learning process makes it more user-friendly and often provides faster, more
accurate outputs than hand-coded algorithms. ATOM uses the TPOT
package for AutoML optimization. TPOT uses a genetic algorithm to intelligently
explore thousands of possible pipelines in order to find the best one for your
data. Such an algorithm can be started through the automl
method. The resulting data transformers and final estimator are merged with atom's
pipeline (check the pipeline
and models
attributes after the method
finishes running).
Warning
AutoML algorithms aren't intended to run for only a few minutes. If left to its default parameters, the method can take a very long time to finish!