ATOMRegressor

class atom.api.ATOMRegressor(*arrays, y=-1, index=False, test_size=0.2, holdout_size=None, shuffle=True, n_rows=1, n_jobs=1, gpu=False, verbose=0, warnings=True, logger=None, experiment=None, random_state=None) [source]

ATOMRegressor is ATOM's wrapper for regression tasks. Use this class to easily apply all data transformations and model management provided by the package on a given dataset. Note that contrary to sklearn's API, an ATOMRegressor instance already contains the dataset on which we want to perform the analysis. Calling a method will automatically apply it on the dataset it contains.

You can predict, plot and call any model from atom.

Parameters:

*arrays: sequence of indexables
Dataset containing features and target. Allowed formats are:

X
X, y
train, test
train, test, holdout
X_train, X_test, y_train, y_test
X_train, X_test, X_holdout, y_train, y_test, y_holdout
(X_train, y_train), (X_test, y_test)
(X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)

X, train, test: dataframe-like

Feature set with shape=(n_samples, n_features).

y: int, str or sequence

If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

y: int, str or sequence, optional (default=-1)

If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

This parameter is ignored if the target column is provided through arrays.

index: bool, int, str or sequence, optional (default=False)

If False: Reset to RangeIndex.
If True: Use the current index.
If int: Index of the column to use as index.
If str: Name of the column to use as index.
If sequence: Index column with shape=(n_samples,).

test_size: int or float, optional (default=0.2)

If <=1: Fraction of the dataset to include in the test set.
If >1: Number of rows to include in the test set.

This parameter is ignored if the test set is provided through arrays.

holdout_size: int, float or None, optional (default=None)

If None: No holdout data set is kept apart.
If <=1: Fraction of the dataset to include in the holdout set.
If >1: Number of rows to include in the holdout set.

This parameter is ignored if the holdout set is provided through arrays.

shuffle: bool, optional (default=True)
Whether to shuffle the dataset before splitting the train and test set. Be aware that not shuffling the dataset can cause an unequal distribution of target classes over the sets.

n_rows: int or float, optional (default=1)
Random subsample of the provided dataset to use. The default value selects all the rows.

If <=1: Select this fraction of the dataset.
If >1: Select this exact number of rows. Only if the input doesn't already specify the data sets (i.e. X or X, y).

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use available_cores - 1 + n_jobs.

Beware that using multiple processes on the same machine may cause memory issues for large datasets.

gpu: bool or str, optional (default=False)
Train estimators on GPU (instead of CPU). Refer to the documentation to check which estimators are supported.

If False: Always use CPU implementation.
If True: Use GPU implementation if possible.
If "force": Force GPU implementation.

verbose: int, optional (default=0)
Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

warnings: bool or str, optional (default=False)

If True: Default warning action (equal to "default").
If False: Suppress all warnings (equal to "ignore").
If str: One of the actions in python's warnings environment.

Changing this parameter affects the PYTHONWARNINGS environment.
ATOM can't manage warnings that go directly from C/C++ code to stdout.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

experiment: str or None, optional (default=None)
Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

Magic methods

The class contains some magic methods to help you access some of its elements faster. Note that methods that apply on the pipeline can return different results per branch.

__repr__: Prints an overview of atom's branches, models, metric and errors.
__len__: Returns the length of the dataset.
__iter__: Iterate over the pipeline's transformers.
__contains__: Checks if the provided item is a column in the dataset.
__getitem__: Access a branch, model, column or subset of the dataset.

Attributes

Data attributes

The dataset can be accessed at any time through multiple attributes, e.g. calling atom.train will return the training set. Updating one of the data attributes will automatically update the rest as well. Changing the branch will also change the response from these attributes accordingly.

Attributes:

pipeline: pd.Series
Series containing all transformers fitted on the data in the current branch. Use this attribute only to access the individual instances. To visualize the pipeline, use the status method from the branch or the plot_pipeline method.

feature_importance: list
Features ordered by most to least important. This attribute is created after running the feature_selection, plot_permutation_importance or plot_feature_importance methods.

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: pd.Index
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: pd.Index
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

mapping: dict of dicts
Encoded values and their respective mapping. The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

scaled: bool or None
Whether the feature set is scaled. It is considered scaled when it has mean=0 and std=1, or when atom has a scaler in the pipeline. Returns None for multidimensional or sparse datasets.

duplicates: int
Number of duplicate rows in the dataset.

nans: pd.Series or None
Columns with the number of missing values in them. Returns None for multidimensional or sparse datasets.

n_nans: int or None
Number of samples containing missing values. Returns None for multidimensional or sparse datasets.

numerical: pd.Index
Names of the numerical features in the dataset.

n_numerical: int
Number of numerical features in the dataset.

categorical: pd.Index
Names of the categorical features in the dataset.

n_categorical: int
Number of categorical features in the dataset.

outliers: pd.Series or None
Columns in training set with amount of outlier values. Returns None for multidimensional or sparse datasets.

n_outliers: int or None
Number of samples in the training set containing outliers. Returns None for multidimensional or sparse datasets.

Utility attributes

Attributes:

missing: list
List of values that are considered "missing" (used by the clean and impute methods). Default values are: "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

models: list
List of models in the pipeline.

metric: str or list
Metric(s) used to fit the models.

errors: dict
Dictionary of the encountered exceptions (if any).

winners: list of str
Model names ordered by performance on the test set (either through the metric_test or mean_bootstrap attribute).

winner: model
Model subclass that performed best on the test set (either through the metric_test or mean_bootstrap attribute).

results: pd.DataFrame
Dataframe of the training results. Columns can include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Additional: Attributes and methods for dataset
Some attributes and methods can be called from atom but will return the call from the dataset in the current branch, e.g. atom.dtypes shows the types of every column in the dataset. These attributes and methods are: "size", "head", "tail", "loc", "iloc", "describe", "iterrows", "dtypes", "at", "iat".

Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.

Utility methods

The class contains a variety of utility methods to handle the data and manage the pipeline.

add	Add a transformer to the current branch.
apply	Apply a function to the dataset.
automl	Search for an optimized pipeline in an automated fashion.
available_models	Give an overview of the available predefined models.
canvas	Create a figure with multiple plots.
clear	Clear attributes from all models.
delete	Delete models from the trainer.
distribution	Get statistics on column distributions.
drop	Drop columns from the dataset.
evaluate	Get all models' scores for the provided metrics.
export_pipeline	Export the pipeline to a sklearn-like Pipeline object.
log	Save information to the logger and print to stdout.
merge	Merge another trainer into this one.
report	Get an extensive profile analysis of the data.
reset	Reset the instance to it's initial state.
reset_aesthetics	Reset the plot aesthetics to their default values.
save	Save the instance to a pickle file.
save_data	Save data to a csv file.
shrink	Converts the columns to the smallest possible matching dtype.
stacking	Add a Stacking instance to the models in the pipeline.
stats	Print out a list of basic statistics on the dataset.
status	Get an overview of atom's branches, models and errors.
voting	Add a Voting instance to the models in the pipeline.

method add(transformer, columns=None, train_only=False, **fit_params) [source]

Add a transformer to the current branch. If the transformer is not fitted, it is fitted on the complete training set. Afterwards, the data set is transformed and the transformer is added to atom's pipeline. If the transformer is a sklearn Pipeline, every transformer is merged independently with atom.

Warning

The transformer should have fit and/or transform methods with arguments X (accepting an array-like object of shape=(n_samples, n_features)) and/or y (accepting a sequence of shape=(n_samples,)).
The transform method should return a feature set as an array-like object of shape=(n_samples, n_features) and/or a target column as a sequence of shape=(n_samples,).

Note

If the transform method doesn't return a dataframe:

The column naming happens as follows. If the transformer has a get_feature_names or get_feature_names_out method, it is used. If not, and it returns the same number of columns, the names are kept equal. If the number of columns change, old columns will keep their name (as long as the column is unchanged) and new columns will receive the name x[N-1], where N stands for the n-th feature. This means that a transformer should only transform, add or drop columns, not combinations of these.
The index remains the same as before the transformation. This means that the transformer should not add, remove or shuffle rows unless it returns a dataframe.

Note

If the transformer has a n_jobs and/or random_state parameter that is left to its default value, it adopts atom's value.

Parameters:

transformer: estimator
Transformer to add to the pipeline. Should implement a transform method.

columns: int, str, slice, sequence or None, optional (default=None)
Names, indices or dtypes of the columns in the dataset to transform. If None, transform all columns. Add ! in front of a name or dtype to exclude that column, e.g. atom.add(Transformer(), columns="!Location") transforms all columns except Location. You can either include or exclude columns, not combinations of these.

train_only: bool, optional (default=False)
Whether to apply the estimator only on the training set or on the complete dataset. Note that if True, the transformation is skipped when making predictions on unseen data.

**fit_params
Additional keyword arguments for the fit method of the transformer.

method apply(func, columns, args=(), **kwargs) [source]

Transform one column in the dataset using a function (can be a lambda). If the provided column is present in the dataset, that same column is transformed. If it's not a column in the dataset, a new column with that name is created. The first parameter of the function is the complete dataset.

Note

This approach is preferred over changing the dataset directly through the property's @setter since the transformation is saved to atom's pipeline.

Parameters:

func: callable
Function to apply to the dataset.

columns: int or str
Name or index of the column in the dataset to create or transform.

args: tuple, optional (default=())
Positional arguments for the function (after the dataset).

**kwargs
Additional keyword arguments for the function.

method automl(**kwargs) [source]

Uses the TPOT package to perform an automated search of transformers and a final estimator that maximizes a metric on the dataset. The resulting transformations and estimator are merged with atom's pipeline. The tpot instance can be accessed through the tpot attribute. Read more in the user guide.

Parameters:

**kwargs
Keyword arguments for TPOTRegressor.

method available_models() [source]

Give an overview of the available predefined models.

Returns:

pd.DataFrame
Information about the predefined models available for the current task. Columns include:

acronym: Model's acronym (used to call the model).
fullname: Complete name of the model.
estimator: The model's underlying estimator.
module: The estimator's module.
needs_scaling: Whether the model requires feature scaling.
accepts_sparse: Whether the model has native support for sparse matrices.
supports_gpu: Whether the model has GPU support.

method canvas(nrows=1, ncols=2, title=None, figsize=None, filename=None, display=True) [source]

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters:

nrows: int, optional (default=1)
Number of plots in length.

ncols: int, optional (default=2)
Number of plots in width.

title: str or None, optional (default=None)
Plot's title. If None, no title is displayed.

figsize: tuple or None, optional (default=None)
Figure's size, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str or None, optional (default=None)
Name of the file. Use "auto" for automatic naming. If None, the figure is not saved.

display: bool, optional (default=True)
Whether to render the plot.

method clear() [source]

Reset all model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:

method delete(models=None) [source]

Delete models from the trainer. If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters:

models: str or sequence, optional (default=None)
Models to delete. If None, delete them all.

method distribution(distributions=None, columns=0) [source]

Compute the Kolmogorov-Smirnov test for various distributions against columns in the dataset. Only for numerical columns. Missing values are ignored.

Tip

Use the plot_distribution method to plot a column's distribution.

Parameters:

distributions: str, sequence or None, optional (default=None)
Names of the distributions in scipy.stats to get the statistics on. If None, a selection of the most common ones is used.

columns: int, str, slice, sequence or None, optional (default=None)
Names, indices or dtypes of the columns in the dataset to perform the test on. If None, select all numerical columns.

Returns:

pd.DataFrame
Statistic results with multiindex levels:

dist: Name of the distribution.
stat: Statistic results:
- score: KS-test score.
- p_value: Corresponding p-value.

method drop(columns) [source]

Drop columns from the dataset.

Note

This approach is preferred over dropping columns from the dataset directly through the property's @setter since the transformation is saved to atom's pipeline.

Parameters:

columns: int, str, slice or sequence
Names or indices of the columns to drop.

method evaluate(metric=None, dataset="test", sample_weight=None) [source]

Get all the models' scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

sample_weight: sequence or None, optional (default=None)
Sample weights corresponding to y in dataset.

Returns: pd.DataFrame
Scores of the models.

method export_pipeline(model=None, memory=None, verbose=None) [source]

Export atom's pipeline to a sklearn-like Pipeline object. Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.

Info

ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:

Accepts transformers that change the target column.
Accepts transformers that drop rows.
Accepts transformers that only are fitted on a subset of the provided dataset.
Always outputs pandas objects.
Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.

Parameters:

model: str or None, optional (default=None)
Name of the model to add as a final estimator to the pipeline. If the model used automated feature scaling, the scaler is added to the pipeline. If None, only the transformers are added.

memory: bool, str, Memory or None, optional (default=None)
Used to cache the fitted transformers of the pipeline.

If None or False: No caching is performed.
If True: A default temp directory is used.
If str: Path to the caching directory.
If Memory: Object with the joblib.Memory interface.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns: Pipeline
Current branch as a sklearn-like Pipeline object.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method merge(other, suffix="2") [source]

Merge another trainer into this one. Branches, models, metrics and attributes of the other trainer are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parameters:

other: trainer
Trainer instance with which to merge.

suffix: str, optional (default="2")
Conflicting branches and models are merged adding suffix to the end of their names.

method report(dataset="dataset", n_rows=None, filename=None, **kwargs) [source]

Create an extensive profile analysis report of the data. The report is rendered in HTML5 and CSS3. Note that this method can be slow for n_rows > 10k.

Parameters:

dataset: str, optional (default="dataset")
Data set to get the report from.

n_rows: int or None, optional (default=None)
Number of (randomly picked) rows to process. None to use all rows.

filename: str or None, optional (default=None)
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ProfileReport instance.

Returns:

ProfileReport
Created profile object.

method reset() [source]

Reset the instance to it's initial state, i.e. it deletes all branches and models. The dataset is also reset to its form after initialization.

method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.

method save(filename="auto", save_data=True) [source]

Save the instance to a pickle file. Remember that the class contains the complete dataset as attribute, so the file can become large for big datasets! To avoid this, use save_data=False.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

save_data: bool, optional (default=True)
Whether to save the data as an attribute of the instance. If False, remember to add the data to ATOMLoader when loading the file.

method save_data(filename="auto", dataset="dataset") [source]

Save the data in the current branch to a csv file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

dataset: str, optional (default="dataset")
Data set to save.

method shrink(obj2cat=True, int2uint=False, dense2sparse=False, columns=None) [source]

Converts the columns to the smallest possible matching dtype. Examples are: float64 -> float32, int64 -> int8, etc... Sparse arrays also transform their non-fill value. Use this method for memory optimization. Note that applying transformers to the data may alter the types again.

Parameters:

obj2cat: bool, optional (default=True)
Whether to convert object to category. Only if the number of categories would be less than 30% of the length of the column.

int2uint: bool, optional (default=False)
Whether to convert int to uint (unsigned integer). Only if the values in the column are strictly positive.

dense2sparse: bool, optional (default=False)
Whether to convert all features to sparse format. The value that is compressed is the most frequent value in the column.

columns: int, str, slice, sequence or None, optional (default=None)
Names, indices or dtypes of the columns in the dataset to shrink. If None, transform all columns.

method stacking(name="Stack", models=None, **kwargs) [source]

Add a Stacking model to the pipeline.

Parameters:

name: str, optional (default="Stack")
Name of the model. The name is always presided with the model's acronym: Stack.

models: sequence or None, optional (default=None)
Models that feed the stacking estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's StackingRegressor instance. The predefined model's acronyms can be used for the final_estimator parameter.

method stats() [source]

Print basic information about the dataset.

method status() [source]

Get an overview of the branches, models and errors in the instance. This method prints the same information as atom's __repr__ and also saves it to the logger.

method voting(name="Vote", models=None, **kwargs) [source]

Add a Voting model to the pipeline.

Parameters:

name: str, optional (default="Vote")
Name of the model. The name is always presided with the model's acronym: Vote.

models: sequence or None, optional (default=None)
Models that feed the voting estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's VotingRegressor instance.

Data cleaning

The class provides data cleaning methods to scale or transform the features and handle missing values, categorical columns and outliers. Calling on one of them will automatically apply the method on the dataset in the pipeline.

Tip

Use the report method to examine the data and help you determine suitable parameters for the data cleaning methods.

scale	Scale the dataset.
normalize	Transform the data to follow a Normal/Gaussian distribution.
clean	Applies standard data cleaning steps on the dataset.
impute	Handle missing values in the dataset.
discretize	Bin continuous data into intervals.
encode	Encode categorical features.
prune	Prune outliers from the training set.

method scale(strategy="standard", **kwargs) [source]

Applies one of sklearn's scalers. Non-numerical columns are ignored. The estimator created by the class is attached to atom. See the Scaler class for a description of the parameters.

method normalize(strategy="yeojohnson", **kwargs) [source]

Transform the data to follow a Normal/Gaussian distribution. This transformation is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired. Missing values are disregarded in fit and maintained in transform. Categorical columns are ignored. The estimator created by the class is attached to atom. See the See the Normalizer class for a description of the parameters.

method clean(drop_types=None, strip_categorical=True, drop_max_cardinality=True, drop_min_cardinality=True, drop_duplicates=False, drop_missing_target=True) [source]

Applies standard data cleaning steps on the dataset. Use the parameters to choose which transformations to perform. The available steps are:

Drop columns with specific data types.
Strip categorical features from white spaces.
Drop categorical columns with maximal cardinality.
Drop columns with minimum cardinality.
Drop duplicate rows.
Drop rows with missing values in the target column.

See the Cleaner class for a description of the parameters.

method impute(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, missing=None) [source]

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. The imputer is fitted only on the training set to avoid data leakage. Use the missing attribute to customize what are considered "missing values". See Imputer for a description of the parameters. Note that since the Imputer can remove rows from both the train and test set, the size of the sets may change after the transformation.

method discretize(strategy="quantile", bins=5, labels=None) [source]

Bin continuous data into intervals. For each feature, the bin edges are computed during fit and, together with the number of bins, they will define the intervals. Ignores numerical columns. See Discretizer for a description of the parameters.

method encode(strategy="LeaveOneOut", max_onehot=10, ordinal=None, frac_to_other=None) [source]

Perform encoding of categorical features. The encoding type depends on the number of unique values in the column:

If n_unique=2 or ordinal feature, use Label-encoding.
If 2 < n_unique <= max_onehot, use OneHot-encoding.
If n_unique > max_onehot, use `strategy`-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are converted to np.NaN. The class is also capable of replacing classes with low occurrences with the value other in order to prevent too high cardinality. See Encoder for a description of the parameters.

Note

This method only encodes the categorical features. It does not encode the target column! Use the clean method for that.

method prune(strategy="zscore", method="drop", max_sigma=3, include_target=False, **kwargs) [source]

Prune outliers from the training set. The definition of outlier depends on the selected strategy and can greatly differ from one each other. Ignores categorical columns. The estimators created by the class are attached to atom. See Pruner for a description of the parameters.

Note

This transformation is only applied to the training set in order to maintain the original distribution of samples in the test set.

NLP

The Natural Language Processing (NLP) transformers help to convert raw text to meaningful numeric values, ready to be ingested by a model.

textclean	Applies standard text cleaning to the corpus.
tokenize	Convert documents into sequences of words
textnormalize	Convert words to a more uniform standard.
vectorize	Transform the corpus into meaningful vectors of numbers.

method textclean(decode=True, lower_case=True, drop_emails=True, regex_emails=None, drop_url=True, regex_url=None, drop_html=True, regex_html=None, drop_emojis, regex_emojis=None, drop_numbers=True, regex_numbers=None, drop_punctuation=True) [source]

Applies standard text cleaning to the corpus. Transformations include normalizing characters and dropping noise from the text (emails, HTML tags, URLs, etc...). The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised. See the TextCleaner class for a description of the parameters.

method tokenize(bigram_freq=None, trigram_freq=None, quadgram_freq=None) [source]

Convert documents into sequences of words. Additionally, create n-grams (represented by words united with underscores, e.g. "New_York") based on their frequency in the corpus. The transformations are applied on the column named corpus. If there is no column with that name, an exception is raised. See the Tokenizer class for a description of the parameters.

method textnormalize(stopwords=True, custom_stopwords=None, stem=False, lemmatize=True) [source]

Convert words to a more uniform standard. The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised. If the provided documents are strings, words are separated by spaces. See the TextNormalizer class for a description of the parameters.

method vectorize(strategy="bow", return_sparse=True, **kwargs) [source]

Transform the corpus into meaningful vectors of numbers. The transformation is applied on the column named corpus. If there is no column with that name, an exception is raised. The transformed columns are named after the word they are embedding (if the column is already present in the provided dataset, _[strategy] is added behind the name). See the Vectorizer class for a description of the parameters.

Feature engineering

To further pre-process the data, it's possible to extract features from datetime columns, create new non-linear features transforming the existing ones or, if the dataset is too large, remove features using one of the provided strategies.

feature_extraction	Extract features from datetime columns.
feature_generation	Create new features from combinations of existing ones.
feature_selection	Remove features according to the selected strategy.

method feature_extraction(features=["day", "month", "year"], fmt=None, encoding_type="ordinal", drop_columns=True) [source]

Extract features (hour, day, month, year, etc..) from datetime columns. Columns of dtype datetime64 are used as is. Categorical columns that can be successfully converted to a datetime format (less than 30% NaT values after conversion) are also used. See the FeatureExtractor class for a description of the parameters.

method feature_generation(strategy="dfs", n_features=None, operators=None, **kwargs) [source]

Create new combinations of existing features to capture the non-linear relations between the original features. See FeatureGenerator for a description of the parameters. Attributes created by the class are attached to atom.

method feature_selection(strategy=None, solver=None, n_features=None, max_frac_repeated=1., max_correlation=1., **kwargs) [source]

Remove features according to the selected strategy. Ties between features with equal scores are broken in an unspecified way. Additionally, remove multicollinear and low variance features. See FeatureSelector for a description of the parameters. Plotting methods and attributes created by the class are attached to atom.

Note

When strategy="univariate" and solver=None, f_regression is used as default solver.
When the strategy requires a model and it's one of ATOM's predefined models, the algorithm automatically selects the regressor (no need to add _reg to the solver).
When strategy is not one of univariate or pca, and solver=None, atom uses the winning model (if it exists) as solver.
When strategy is sfs, rfecv or any of the advanced strategies and no scoring is specified, atom uses the metric in the pipeline (if it exists) as scoring parameter.

Training

The training methods are where the models are fitted to the data and their performance is evaluated according to the selected metric. There are three methods to call the three different training approaches in ATOM. All relevant attributes and methods from the training classes are attached to atom for convenience. These include the errors, winner and results attributes, as well as the models, and the prediction and plotting methods.

run	Fit the models to the data in a direct fashion.
successive_halving	Fit the models to the data in a successive halving fashion.
train_sizing	Fit the models to the data in a train sizing fashion.

method run(models=None, metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False, n_calls=10, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0) [source]

Fit and evaluate the models. The following steps are applied to every model:

Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
The model is fitted on the training set using the best combination of hyperparameters found.
The model is evaluated on the test set.
The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).

See DirectClassifier for a description of the parameters.

method successive_halving(models=None, metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False, skip_runs=0, n_calls=0, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0) [source]

Fit and evaluate the models in a successive halving fashion. The following steps are applied to every model (per iteration):

Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
The model is fitted on the training set using the best combination of hyperparameters found.
The model is evaluated on the test set.
The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).

See SuccessiveHalvingClassifier for a description of the parameters.

method train_sizing(models=None, metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False, train_sizes=5, n_calls=0, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0) [source]

Fit and evaluate the models in a train sizing fashion. The following steps are applied to every model (per iteration):

Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
The model is fitted on the training set using the best combination of hyperparameters found.
The model is evaluated on the test set.
The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).

See TrainSizingClassifier for a description of the parameters.

Example

from atom import ATOMRegressor

# Initialize atom
atom = ATOMRegressor(X, y, logger="auto", n_jobs=2, verbose=2)

# Apply data cleaning methods
atom.prune(strategy="iforest", include_target=True)

# Fit the models to the data
atom.run(
    models=["OLS", "BR", "CatB"],
    metric="MSE",
    n_calls=25,
    n_initial_points=10,
    n_bootstrap=4,
)

# Analyze the results
atom.plot_errors(figsize=(9, 6), filename="errors.png")  
atom.catb.plot_feature_importance(filename="catboost_feature_importance.png")

# Get the predictions for the best model on new data
pred = atom.predict(X_new)