Data pipelines

During the exploration phase, you might want to compare how a model performs on a dataset processed using different transformers. For example, on one dataset balanced with an undersampling strategy and the other with an oversampling strategy. For this, atom has the branching system.

Branches

The branching system helps manage multiple pipelines within the same atom instance. Every pipeline is stored in a branch, which can be accessed through the branch attribute. A branch contains a copy of the dataset, and all transformers and models that are fitted on that specific dataset. Transformers and models called from atom use the dataset in the current branch, as well as data attributes such as atom.dataset. Also Use the branch's __repr__ to get an overview of the transformers in the branch. Don't change the data in a branch after fitting a model, this can cause unexpected model behaviour. Instead, create a new branch for every unique pipeline.

By default, atom starts with one branch called "master". To start a new branch, set a new name to the property, e.g. atom.branch = "undersample". This will create a new branch from the current one. To create a branch from any other branch type "_from_" between the new name and the branch from which to split, e.g. atom.branch = "oversample_from_master" will create branch "oversample" from branch "master", even if the current branch is "undersample". To switch between existing branches, just type the name of the desired branch, e.g. atom.branch = "master" brings you back to the master branch. Note that every branch contains a unique copy of the whole dataset! Creating many branches can cause memory issues for large datasets.

Figure 1. Diagram of a possible branch system to compare an oversampling with an undersampling pipeline.

You can delete a branch either deleting the attribute, e.g. del atom.branch, or using the delete method, e.g. atom.branch.delete(). Be aware that deleting a branch also deletes all models that were trained on it! Use atom.branch.status() for an overview of the transformers and models in the branch.

See the Imbalanced datasets or Feature engineering examples for branching use cases.

Warning

Always create a new branch if you want to change the dataset after fitting a model! Not doing so can cause unexpected model behaviour.

Data transformations

Performing data transformations is a common requirement of many datasets before they are ready to be ingested by a model. ATOM provides various classes to apply data cleaning and feature engineering transformations to the data. This tooling should be able to help you apply most of the typically needed transformations to get the data ready for modelling. For further fine-tuning, it's also possible to transform the data using custom transformers (see the add method) or through a function (see the apply method). Remember that all transformations are only applied to the dataset in the current branch.

AutoML

Automated machine learning (AutoML) automates the selection, composition and parameterization of machine learning pipelines. Automating the machine learning process makes it more user-friendly and often provides faster, more accurate outputs than hand-coded algorithms. ATOM uses the TPOT package for AutoML optimization. TPOT uses a genetic algorithm to intelligently explore thousands of possible pipelines in order to find the best one for your data. Such an algorithm can be started through the automl method. The resulting data transformers and final estimator are merged with atom's pipeline (check the pipeline and models attributes after the method finishes running).

Warning

AutoML algorithms aren't intended to run for only a few minutes. If left to its default parameters, the method can take a very long time to finish!