Frequently asked questions

Here we try to give some answers to questions that have popped up regularly. If you have any other questions, don't hesitate to post them on the slack channel!

Is this package related to the text editor?
How does ATOM relate to AutoML?
Is it possible to run deep learning models?
Can I run atom's methods on just a subset of the columns?
How can I compare the same model on different datasets?
Can I train models through atom using a GPU?
How are numerical and categorical columns differentiated?
Can I run unsupervised learning pipelines?
Is there a way to plot multiple models in the same shap plot?
Can I merge a sklearn pipeline with atom?
Is it possible to initialize atom with an existing train and test set?
Can I train the models using cross-validation?
Is there a way to process datetime features?

There is, indeed, a text editor with the same name and a similar logo as this package. Is this a shameless copy? No. When I started the project, I didn't know about the text editor, and it doesn't require much thinking to come up with the idea of replacing the letter O of the word atom with the image of an atom.

How does ATOM relate to AutoML?

ATOM is not an AutoML tool since it does not automate the search for an optimal pipeline like well known AutoML tools such as auto-sklearn or TPOT do. Instead, ATOM helps the user find the optimal pipeline himself. One of the goals of this package is to help data scientists produce explainable pipelines, and using an AutoML black box function would impede that. That said, it is possible to integrate a TPOT pipeline with atom through the automl method.

Is it possible to run deep learning models?

Yes. Deep learning models can be added as custom models to the pipeline as long as they follow sklearn's API. If the dataset is 2-dimensional, everything should work normally. If the dataset has more than 2 dimensions, often the case for images, only a subset of atom's methods will work. For more information, see the deep learning section of the user guide.

Can I run atom's methods on just a subset of the columns?

Yes, all data cleaning and feature engineering methods accept a columns parameter to only transform the selected features. For example, to only impute the numerical columns in the dataset we could type atom.impute(strat_num="mean", columns=atom.numerical). The parameter accepts column names, column indices, dtypes or a slice object.

How can I compare the same model on different datasets?

In many occasions you might want to test how a model performs on datasets processed with different pipelines. For this, atom has the branch system. Create a new branch for every new pipeline you want to test and use the plot methods to compare all models, independent of the branch it was trained on.

Can I train models through atom using a GPU?

ATOM doesn't fit the models himself. The models' underlying package does. Since the majority of predefined models are implemented through sklearn and sklearn works on CPU only, they can not be trained on any GPU. If you are using a custom model whose package allows GPU implementation (e.g. Keras) and the settings or model parameters are tuned to do so, the model will train on the GPU like it would do without using ATOM.

How are numerical and categorical columns differentiated?

The columns are separated using a dataframe's select_dtypes method. Numerical columns are selected using include="number" whereas categorical columns are selected using exclude="number".

Can I run unsupervised learning pipelines?

No. As for now, ATOM only supports supervised machine learning pipelines. However, various unsupervised algorithms can be chosen as strategy in the Pruner class to detect and remove outliers from the dataset.

Is there a way to plot multiple models in the same shap plot?

No. Unfortunately, there is no way to plot multiple models in the same shap plot since the plots are made by the shap package and passed as matplotlib.axes objects to atom. This means that it's not within the reach of this package to implement such a utility.

Can I merge a sklearn pipeline with atom?

Yes. Like any other transformer, it is possible to add a sklearn pipeline to atom using the add method. Every transformer in the pipeline is merged independently. The pipeline is not allowed to end with a model since atom manages its own models. If that is the case, add the pipeline using atom.add(pipeline[:-1]).

Is it possible to initialize atom with an existing train and test set?

Yes. If you already have a separated train and test set you can initialize atom in two ways:

atom = ATOMClassifier(train, test)
atom = ATOMClassifier((X_train, y_train), (X_test, y_test))

Make sure the train and test size have the same number of columns! If atom is initialized in any of these two ways, the test_size parameter is ignored.

Can I train the models using cross-validation?

It is not possible to train models using cross-validation, but for a good reason. Applying cross-validation would mean transforming every step of the pipeline multiple times, each with different results. This would prevent ATOM from being able to show the transformation results after every pre-processing step, which means losing the ability to inspect how a transformer changed the dataset. This makes cross-validation an inappropriate technique for the purpose of exploration.

So why not use cross-validation only to train and evaluate the models, instead of applying it to the whole pipeline? Cross-validating only the models would make no sense here. If we use the complete dataset for that (both the train and test set), we would be evaluating the models on data that was used to fit the transformers. This implies data leakage and can severely bias the results towards specific transformers. On the other hand, using only the training set beats the point of applying cross-validation in the first place, since we can train the model on the complete training set and evaluate the results on the independent test set. The only reason of doing cross-validation would be to get an idea of the robustness of the model. This can also be achieves using bootstrapping. That said, ideally we would cross-validate the entire pipeline using the entire dataset. This can be done using a trainer's cross_validate method, but for the reason just explained above, the method only outputs the final metric results.

Is there a way to process datetime features?

Yes, the FeatureExtractor class (released in v4.7.0) can automatically extract useful features (day, month, year, etc...) from datetime columns. The extracted features are always encoded to numerical values, so they can be fed directly to a model.