Skip to content

Frequently asked questions





There is, indeed, a text editor with the same name and a similar logo as this package. Is this a shameless copy? No. When I started the project, I didn't know about the text editor, and it doesn't require much thinking to come up with the idea of replacing the letter O of the word atom with the image of an atom.


How does ATOM relate to AutoML?

ATOM is not an AutoML tool since it does not automate the search for an optimal pipeline like well known AutoML tools such as auto-sklearn or TPOT do. Instead, ATOM helps the user find the optimal pipeline himself. One of the goals of this package is to help data scientists produce explainable pipelines, and using an AutoML black box function would impede that. That said, it is possible to integrate a TPOT pipeline with atom through the automl method.


Is it possible to run deep learning models?

Yes. Deep learning models can be added as custom models to the pipeline as long as they follow sklearn's API. If the dataset is 2-dimensional, everything should work normally. If the dataset has more than 2 dimensions, often the case for images, only a subset of atom's methods will work. For more information, see the deep learning section of the user guide.


Can I run atom's methods on just a subset of the columns?

Yes, all data cleaning and feature engineering methods accept a columns parameter to only transform the selected features. For example, to only impute the numerical columns in the dataset we could type #atom.impute(strat_num="mean", columns=atom.numerical). The parameter accepts column names, column indices or a slice object.


How can I compare the same model on different datasets?

In many occasions you might want to test how a model performs on datasets processed with different pipelines. For this, atom has the branch system. Create a new branch for every new pipeline you want to test and use the plot methods to compare all models, independent of the branch it was trained on.


Can I train models through atom using a GPU?

ATOM doesn't fit the models himself. The models' underlying package does. Since the majority of predefined models are implemented through sklearn and sklearn works on CPU only, they can not be trained on any GPU. If you are using a custom model whose package allows GPU implementation (e.g. Keras) and the settings or model parameters are tuned to do so, the model will train on the GPU like it would do without using ATOM.


How are numerical and categorical columns differentiated?

The columns are separated using a dataframe's select_dtypes method. Numerical columns are selected using include="number" whereas categorical columns are selected using exclude="number".


Can I run unsupervised learning pipelines?

No. As for now, ATOM only supports supervised machine learning pipelines. However, various unsupervised algorithms can be chosen as strategy in the Pruner class to detect and remove outliers from the dataset.


Is there a way to plot multiple models in the same shap plot?

No. Unfortunately, there is no way to plot multiple models in the same shap plot since the plots are made by the SHAP package and passed as matplotlib.axes objects to atom. This means that it's not within the reach of this package to implement such a utility.


Can I merge a sklearn pipeline with atom?

Yes. Like any other transformer, it is possible to add a sklearn pipeline to atom using the add method. Every transformer in the pipeline is merged independently. The pipeline is not allowed to end with a model since atom manages its own models. If that is the case, add the pipeline using atom.add(pipeline[:-1]).


Is it possible to initialize atom with an existing train and test set?

Yes. If you already have a separated train and test set you can initialize atom in two ways:

  • atom = ATOMClassifier(train, test)
  • atom = ATOMClassifier((X_train, y_train), (X_test, y_test))

Make sure the train and test size have the same number of columns. If initialized in any of these two ways the test_size parameter is ignored.


Can I train the models using cross-validation?

It is not possible to train models using cross-validation, but for a good reason. Applying cross-validation would mean transforming every step of the pipeline multiple times, with different results. This would prevent ATOM from being able to show the transformation results after every pre-processing step, which means losing the ability to inspect how a transformer changed the dataset. This makes cross-validation an inappropriate technique for our exploration purposes.

So why not use cross-validation only to train and evaluate the models, instead of applying it to the whole pipeline? Cross-validating only the models would make no sense here. If we use the complete dataset for that, so both the train and test set, we would be evaluating the models on data that was used to fit the transformers. This implies data leakage and can severely bias the results towards specific transformers. On the other hand, using only the training set beats the point of applying cross-validation in the first place, since we can train the model on the complete training set and evaluate the results on the independent test set. No cross-validation needed. That said, ideally we would cross-validate the entire pipeline using the entire dataset. This can be done using a trainer's cross_validate method, but for the reason just explained above, the method only outputs the final metric results.


Why does encoding fail with a ValueError when transforming??

The ValueError: Columns to be encoded can not contain new values exception can occur when calling the encode method or transforming the Encoder class. This exception is not raised by ATOM but by the category_encoders package which is used internally to perform the encodings. The error is raised when the column to be transformed presents classes that were not encountered during fitting. This is intended behavior, since the transformer wouldn't know what to do with the new classes. In atom's case, it means the test set contains classes that were not present in the training set. This usually happens if the training set is very small or if the column has many classes with few occurrences. To fix this, either increase the number of samples in the training set to make sure it contains all the classes in the column, or increase the frac_to_other parameter.

Back to top