ATOMRegressor
ATOMRegressor is ATOM's wrapper for regression tasks. Use this class to easily apply all data transformations and model management provided by the package on a given dataset. Note that contrary to sklearn's API, an ATOMRegressor instance already contains the dataset on which we want to perform the analysis. Calling a method will automatically apply it on the dataset it contains.
You can predict, plot and call any model from atom.
Parameters: |
*arrays: sequence of indexables Dataset containing features and target. Allowed formats are:
Feature set with shape=(n_samples, n_features). y: int, str or sequence
This parameter is ignored if the target column is provided
through
This parameter is ignored if the test set is provided through
This parameter is ignored if the holdout set is provided through
shuffle: bool, optional (default=True) Random subsample of the provided dataset to use. The default value selects all the rows.
Number of cores to use for parallel processing.
Beware that using multiple processes on the same machine may cause memory issues for large datasets. gpu: bool or str, optional (default=False)Train estimators on GPU (instead of CPU). Refer to the documentation to check which estimators are supported.
Verbosity level of the class. Choose from:
Changing this parameter affects the
experiment: str or None, optional (default=None)
random_state: int or None, optional (default=None) |
Magic methods
The class contains some magic methods to help you access some of its elements faster. Note that methods that apply on the pipeline can return different results per branch.
- __repr__: Prints an overview of atom's branches, models, metric and errors.
- __len__: Returns the length of the dataset.
- __iter__: Iterate over the pipeline's transformers.
- __contains__: Checks if the provided item is a column in the dataset.
- __getitem__: Access a branch, model, column or subset of the dataset.
Attributes
Data attributes
The dataset can be accessed at any time through multiple attributes,
e.g. calling atom.train
will return the training set. Updating one
of the data attributes will automatically update the rest as well.
Changing the branch will also change the response from these attributes
accordingly.
Attributes: |
pipeline: pd.Series
feature_importance: list
dataset: pd.DataFrame
train: pd.DataFrame
test: pd.DataFrame
X: pd.DataFrame
y: pd.Series
X_train: pd.DataFrame
y_train: pd.Series
X_test: pd.DataFrame
y_test: pd.Series
shape: tuple
columns: pd.Index
n_columns: int
features: pd.Index
n_features: int
target: str
mapping: dict of dicts
scaled: bool or None
duplicates: int
nans: pd.Series or None
n_nans: int or None
numerical: pd.Index
n_numerical: int
categorical: pd.Index
n_categorical: int
outliers: pd.Series or None
n_outliers: int or None |
Utility attributes
Attributes: |
missing: list
models: list
metric: str or list
errors: dict
winners: list of str
winner: model Dataframe of the training results. Columns can include:
|
Additional: |
Attributes and methods for dataset Some attributes and methods can be called from atom but will return the call from the dataset in the current branch, e.g. atom.dtypes
shows the types of every column in the dataset. These attributes and
methods are: "size", "head", "tail", "loc", "iloc", "describe", "iterrows",
"dtypes", "at", "iat".
|
Plot attributes
Attributes: |
style: str
palette: str
title_fontsize: int
label_fontsize: int
tick_fontsize: int |
Utility methods
The class contains a variety of utility methods to handle the data and manage the pipeline.
add | Add a transformer to the current branch. |
apply | Apply a function to the dataset. |
automl | Search for an optimized pipeline in an automated fashion. |
available_models | Give an overview of the available predefined models. |
canvas | Create a figure with multiple plots. |
clear | Clear attributes from all models. |
delete | Delete models from the trainer. |
distribution | Get statistics on column distributions. |
drop | Drop columns from the dataset. |
evaluate | Get all models' scores for the provided metrics. |
export_pipeline | Export the pipeline to a sklearn-like Pipeline object. |
log | Save information to the logger and print to stdout. |
merge | Merge another trainer into this one. |
report | Get an extensive profile analysis of the data. |
reset | Reset the instance to it's initial state. |
reset_aesthetics | Reset the plot aesthetics to their default values. |
save | Save the instance to a pickle file. |
save_data | Save data to a csv file. |
shrink | Converts the columns to the smallest possible matching dtype. |
stacking | Add a Stacking instance to the models in the pipeline. |
stats | Print out a list of basic statistics on the dataset. |
status | Get an overview of atom's branches, models and errors. |
voting | Add a Voting instance to the models in the pipeline. |
Add a transformer to the current branch. If the transformer is not fitted, it is fitted on the complete training set. Afterwards, the data set is transformed and the transformer is added to atom's pipeline. If the transformer is a sklearn Pipeline, every transformer is merged independently with atom.
Warning
- The transformer should have fit and/or transform methods with arguments
X
(accepting an array-like object of shape=(n_samples, n_features)) and/ory
(accepting a sequence of shape=(n_samples,)). - The transform method should return a feature set as an array-like object of shape=(n_samples, n_features) and/or a target column as a sequence of shape=(n_samples,).
Note
If the transform method doesn't return a dataframe:
- The column naming happens as follows. If the transformer has a
get_feature_names
orget_feature_names_out
method, it is used. If not, and it returns the same number of columns, the names are kept equal. If the number of columns change, old columns will keep their name (as long as the column is unchanged) and new columns will receive the namex[N-1]
, where N stands for the n-th feature. This means that a transformer should only transform, add or drop columns, not combinations of these. - The index remains the same as before the transformation. This means that the transformer should not add, remove or shuffle rows unless it returns a dataframe.
Note
If the transformer has a n_jobs
and/or random_state
parameter that
is left to its default value, it adopts atom's value.
Parameters: |
transformer: estimator
columns: int, str, slice, sequence or None, optional (default=None)
train_only: bool, optional (default=False)
**fit_params |
Transform one column in the dataset using a function (can be a lambda). If the provided column is present in the dataset, that same column is transformed. If it's not a column in the dataset, a new column with that name is created. The first parameter of the function is the complete dataset.
Note
This approach is preferred over changing the dataset directly
through the property's @setter
since the transformation
is saved to atom's pipeline.
Parameters: |
func: callable
columns: int or str
args: tuple, optional (default=())
**kwargs |
Uses the TPOT package to perform
an automated search of transformers and a final estimator that maximizes
a metric on the dataset. The resulting transformations and estimator are
merged with atom's pipeline. The tpot instance can be accessed through the
tpot
attribute. Read more in the user guide.
Parameters: |
**kwargs Keyword arguments for TPOTRegressor. |
Give an overview of the available predefined models.
Returns: |
pd.DataFrame Information about the predefined models available for the current task. Columns include:
|
This @contextmanager
allows you to draw many plots in one figure.
The default option is to add two plots side by side. See the
user guide for an example.
Parameters: |
nrows: int, optional (default=1)
ncols: int, optional (default=2)
title: str or None, optional (default=None)
figsize: tuple or None, optional (default=None)
filename: str or None, optional (default=None)
display: bool, optional (default=True) |
Reset all model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:
Delete models from the trainer. If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.
Parameters: |
models: str or sequence, optional (default=None) Models to delete. If None, delete them all. |
Compute the Kolmogorov-Smirnov test for various distributions against columns in the dataset. Only for numerical columns. Missing values are ignored.
Tip
Use the plot_distribution method to plot a column's distribution.
Parameters: |
distributions: str, sequence or None, optional (default=None)
columns: int, str, slice, sequence or None, optional (default=None) |
Returns: |
pd.DataFrame Statistic results with multiindex levels:
|
Drop columns from the dataset.
Note
This approach is preferred over dropping columns from the
dataset directly through the property's @setter
since
the transformation is saved to atom's pipeline.
Parameters: |
columns: int, str, slice or sequence Names or indices of the columns to drop. |
Get all the models' scores for the provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
dataset: str, optional (default="test")
sample_weight: sequence or None, optional (default=None) |
Returns: |
pd.DataFrame Scores of the models. |
Export atom's pipeline to a sklearn-like Pipeline object. Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.
Info
ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always outputs pandas objects.
- Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.
Parameters: |
model: str or None, optional (default=None) Used to cache the fitted transformers of the pipeline.
verbose: int or None, optional (default=None) |
Returns: |
Pipeline Current branch as a sklearn-like Pipeline object. |
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
level: int, optional (default=0) |
Merge another trainer into this one. Branches, models, metrics and
attributes of the other trainer are merged into this one. If there
are branches and/or models with the same name, they are merged
adding the suffix
parameter to their name. The errors and missing
attributes are extended with those of the other instance. It's only
possible to merge two instances if they are initialized with the same
dataset and trained with the same metric.
Parameters: |
other: trainer
suffix: str, optional (default="2") |
Create an extensive profile analysis report of the data. The report
is rendered in HTML5 and CSS3. Note that this method can be slow for
n_rows
> 10k.
Parameters: |
dataset: str, optional (default="dataset")
n_rows: int or None, optional (default=None)
filename: str or None, optional (default=None)
**kwargs |
Returns: |
ProfileReport Created profile object. |
Reset the instance to it's initial state, i.e. it deletes all branches
and models. The dataset is also reset to its form after initialization.
Reset the plot aesthetics to their default values.
Save the instance to a pickle file. Remember that the class contains
the complete dataset as attribute, so the file can become large for
big datasets! To avoid this, use save_data=False
.
Parameters: |
filename: str, optional (default="auto")
save_data: bool, optional (default=True) |
Save the data in the current branch to a csv file.
Parameters: |
filename: str, optional (default="auto")
dataset: str, optional (default="dataset") |
Converts the columns to the smallest possible matching dtype. Examples
are: float64
-> float32
, int64
-> int8
, etc... Sparse arrays also
transform their non-fill value. Use this method for memory optimization.
Note that applying transformers to the data may alter the types again.
Parameters: |
obj2cat: bool, optional (default=True)
int2uint: bool, optional (default=False)
dense2sparse: bool, optional (default=False)
columns: int, str, slice, sequence or None, optional (default=None) |
Add a Stacking model to the pipeline.
Parameters: |
name: str, optional (default="Stack")
models: sequence or None, optional (default=None)
**kwargs |
Print basic information about the dataset.
Get an overview of the branches, models and errors in the instance.
This method prints the same information as atom's __repr__ and also
saves it to the logger.
Add a Voting model to the pipeline.
Parameters: |
name: str, optional (default="Vote")
models: sequence or None, optional (default=None)
**kwargs |
Data cleaning
The class provides data cleaning methods to scale or transform the features and handle missing values, categorical columns and outliers. Calling on one of them will automatically apply the method on the dataset in the pipeline.
Tip
Use the report method to examine the data and help you determine suitable parameters for the data cleaning methods.
scale | Scale the dataset. |
normalize | Transform the data to follow a Normal/Gaussian distribution. |
clean | Applies standard data cleaning steps on the dataset. |
impute | Handle missing values in the dataset. |
discretize | Bin continuous data into intervals. |
encode | Encode categorical features. |
prune | Prune outliers from the training set. |
Applies one of sklearn's scalers. Non-numerical columns are ignored. The
estimator created by the class is attached to atom. See the
Scaler class for a description of the parameters.
Transform the data to follow a Normal/Gaussian distribution. This
transformation is useful for modeling issues related to heteroscedasticity
(non-constant variance), or other situations where normality is desired.
Missing values are disregarded in fit and maintained in transform.
Categorical columns are ignored. The estimator created by the class is
attached to atom. See the See the Normalizer
class for a description of the parameters.
Applies standard data cleaning steps on the dataset. Use the parameters to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
- Strip categorical features from white spaces.
- Drop categorical columns with maximal cardinality.
- Drop columns with minimum cardinality.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
See the Cleaner class for a description of the parameters.
Impute or remove missing values according to the selected strategy.
Also removes rows and columns with too many missing values. The
imputer is fitted only on the training set to avoid data leakage.
Use the missing
attribute to customize what are considered "missing
values". See Imputer for a description
of the parameters. Note that since the Imputer can remove rows from
both the train and test set, the size of the sets may change after
the transformation.
Bin continuous data into intervals. For each feature, the bin edges are
computed during fit and, together with the number of bins, they will
define the intervals. Ignores numerical columns. See
Discretizer for a description of the parameters.
Perform encoding of categorical features. The encoding type depends on the number of unique values in the column:
- If n_unique=2 or ordinal feature, use Label-encoding.
- If 2 < n_unique <= max_onehot, use OneHot-encoding.
- If n_unique > max_onehot, use `strategy`-encoding.
Missing values are propagated to the output column. Unknown classes
encountered during transforming are converted to np.NaN
. The class
is also capable of replacing classes with low occurrences with the
value other
in order to prevent too high cardinality. See
Encoder for a description of the parameters.
Note
This method only encodes the categorical features. It does not encode the target column! Use the clean method for that.
Prune outliers from the training set. The definition of outlier depends on the selected strategy and can greatly differ from one each other. Ignores categorical columns. The estimators created by the class are attached to atom. See Pruner for a description of the parameters.
Note
This transformation is only applied to the training set in order to maintain the original distribution of samples in the test set.
NLP
The Natural Language Processing (NLP) transformers help to convert raw text to meaningful numeric values, ready to be ingested by a model.
textclean | Applies standard text cleaning to the corpus. |
tokenize | Convert documents into sequences of words |
textnormalize | Convert words to a more uniform standard. |
vectorize | Transform the corpus into meaningful vectors of numbers. |
Applies standard text cleaning to the corpus. Transformations include
normalizing characters and dropping noise from the text (emails, HTML
tags, URLs, etc...). The transformations are applied on the column
named corpus
, in the same order the parameters are presented. If
there is no column with that name, an exception is raised. See the
TextCleaner class for a description of the
parameters.
Convert documents into sequences of words. Additionally, create
n-grams (represented by words united with underscores, e.g.
"New_York") based on their frequency in the corpus. The
transformations are applied on the column named corpus
. If
there is no column with that name, an exception is raised. See
the Tokenizer class for a description
of the parameters.
Convert words to a more uniform standard. The transformations
are applied on the column named corpus
, in the same order the
parameters are presented. If there is no column with that name,
an exception is raised. If the provided documents are strings,
words are separated by spaces. See the TextNormalizer
class for a description of the parameters.
Transform the corpus into meaningful vectors of numbers. The
transformation is applied on the column named corpus
. If there
is no column with that name, an exception is raised. The transformed
columns are named after the word they are embedding (if the column is
already present in the provided dataset, _[strategy]
is added behind
the name). See the Vectorizer class for a
description of the parameters.
Feature engineering
To further pre-process the data, it's possible to extract features from datetime columns, create new non-linear features transforming the existing ones or, if the dataset is too large, remove features using one of the provided strategies.
feature_extraction | Extract features from datetime columns. |
feature_generation | Create new features from combinations of existing ones. |
feature_selection | Remove features according to the selected strategy. |
Extract features (hour, day, month, year, etc..) from datetime columns.
Columns of dtype datetime64
are used as is. Categorical columns that
can be successfully converted to a datetime format (less than 30% NaT
values after conversion) are also used. See the FeatureExtractor class for a
description of the parameters.
Create new combinations of existing features to capture the non-linear
relations between the original features. See FeatureGenerator
for a description of the parameters. Attributes created by the class
are attached to atom.
Remove features according to the selected strategy. Ties between features with equal scores are broken in an unspecified way. Additionally, remove multicollinear and low variance features. See FeatureSelector for a description of the parameters. Plotting methods and attributes created by the class are attached to atom.
Note
- When strategy="univariate" and solver=None, f_regression is used as default solver.
- When the strategy requires a model and it's one of ATOM's
predefined models, the
algorithm automatically selects the regressor (no need to add
_reg
to the solver). - When strategy is not one of univariate or pca, and solver=None, atom uses the winning model (if it exists) as solver.
- When strategy is sfs, rfecv or any of the advanced strategies and no scoring is specified, atom uses the metric in the pipeline (if it exists) as scoring parameter.
Training
The training methods are where the models are fitted to the data and their performance is evaluated according to the selected metric. There are three methods to call the three different training approaches in ATOM. All relevant attributes and methods from the training classes are attached to atom for convenience. These include the errors, winner and results attributes, as well as the models, and the prediction and plotting methods.
run | Fit the models to the data in a direct fashion. |
successive_halving | Fit the models to the data in a successive halving fashion. |
train_sizing | Fit the models to the data in a train sizing fashion. |
Fit and evaluate the models. The following steps are applied to every model:
- Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
- The model is fitted on the training set using the best combination of hyperparameters found.
- The model is evaluated on the test set.
- The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).
See DirectClassifier for a description of
the parameters.
Fit and evaluate the models in a successive halving fashion. The following steps are applied to every model (per iteration):
- Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
- The model is fitted on the training set using the best combination of hyperparameters found.
- The model is evaluated on the test set.
- The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).
See SuccessiveHalvingClassifier
for a description of the parameters.
Fit and evaluate the models in a train sizing fashion. The following steps are applied to every model (per iteration):
- Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
- The model is fitted on the training set using the best combination of hyperparameters found.
- The model is evaluated on the test set.
- The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).
See TrainSizingClassifier for a
description of the parameters.
Example
from atom import ATOMRegressor
# Initialize atom
atom = ATOMRegressor(X, y, logger="auto", n_jobs=2, verbose=2)
# Apply data cleaning methods
atom.prune(strategy="iforest", include_target=True)
# Fit the models to the data
atom.run(
models=["OLS", "BR", "CatB"],
metric="MSE",
n_calls=25,
n_initial_points=10,
n_bootstrap=4,
)
# Analyze the results
atom.plot_errors(figsize=(9, 6), filename="errors.png")
atom.catb.plot_feature_importance(filename="catboost_feature_importance.png")
# Get the predictions for the best model on new data
pred = atom.predict(X_new)