ATOMClassifier

class atom.api.ATOMClassifier(*arrays, y=-1, index=False, metadata=None, ignore=None, shuffle=True, stratify=-1, n_rows=1, test_size=0.2, holdout_size=None, n_jobs=1, device="cpu", engine=None, backend="loky", memory=False, verbose=0, warnings=False, logger=None, experiment=None, random_state=None)[source]

Main class for classification tasks.

Apply all data transformations and model management provided by the package on a given dataset. Note that, contrary to sklearn's API, the instance contains the dataset on which to perform the analysis. Calling a method will automatically apply it on the dataset it contains.

All data cleaning, feature engineering, model training and plotting functionality can be accessed from an instance of this class.

Parameters

*arrays: sequence of indexables

Dataset containing features and target. Allowed formats are:

X
X, y
train, test
train, test, holdout
X_train, X_test, y_train, y_test
X_train, X_test, X_holdout, y_train, y_test, y_holdout
(X_train, y_train), (X_test, y_test)
(X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)

X, train, test: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or dataframe-like
Target column(s) corresponding to X.

If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe-like: Target columns for multioutput tasks.

y: int, str, sequence or dataframe-like, default=-1

Target column(s) corresponding to X.

If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe: Target columns for multioutput tasks.

This parameter is ignored if the target column is provided through arrays.

index: bool, int, str or sequence, default=False

Handle the index in the resulting dataframe.

If False: Reset to RangeIndex.
If True: Use the provided index.
If int: Position of the column to use as index.
If str: Name of the column to use as index.
If sequence: Array with shape=(n_samples,) to use as index.

metadata: dict or None, default=None

Metadata to route to estimators, scorers, and CV splitters. If None, no metadata is used. If dict, the available keys are:

groups: sequence of shape=(n_samples,) Group labels for the samples used while splitting the dataset into train and test sets.
sample_weight: sequence of shape=(n_samples,) Individual weights for each sample.

ignore: int, str, sequence or None, default=None

Features in X to ignore during data transformations and model training. The features are still used in the remaining methods.

test_size: int or float, default=0.2

If <=1: Fraction of the dataset to include in the test set.
If >1: Number of rows to include in the test set.

This parameter is ignored if the test set is provided through arrays.

If 'groups' is provided in the metadata parameter, test_size represents the proportion of groups to include in the test split or the absolute number of test groups.

holdout_size: int, float or None, default=None

If None: No holdout data set is kept apart.
If <=1: Fraction of the dataset to include in the holdout set.
If >1: Number of rows to include in the holdout set.

This parameter is ignored if the holdout set is provided through arrays.

shuffle: bool, default=True

Whether to shuffle the dataset before splitting the data sets.

stratify: int, str or None, default=-1

Handle stratification of the target classes over the data sets.

If None: No stratification is applied.
If int: Position of the column to use for stratification.
If str: Name of the column to use for stratification.

The stratification column can't contain NaN values.

This parameter is ignored if shuffle=False or if the test set is provided through arrays.

n_rows: int or float, default=1

Random subsample of the dataset to use. The default value selects all rows.

If <=1: Fraction of the dataset to select.
If >1: Exact number of rows to select. Only if arrays is X or X, y.

n_jobs: int, default=1

Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"

Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, dict or None, default=None

Execution engine to use for data and estimators. The value should be one of the possible values to change one of the two engines, or a dictionary with keys data and estimator, with their corresponding choice as values to change both engines. If None, the default values are used. Choose from:

"data":
- "numpy"
- "pandas" (default)
- "pandas-pyarrow"
- "polars"
- "polars-lazy"
- "pyarrow"
- "modin"
- "dask"
- "pyspark"
- "pyspark-pandas"
"estimator":
- "sklearn" (default)
- "sklearnex"
- "cuml"

backend: str, default="loky"

Parallelization backend. Read more in the user guide. Choose from:

"loky": Single-node, process-based parallelism.
"multiprocessing": Legacy single-node, process-based parallelism. Less robust than loky.
"threading": Single-node, thread-based parallelism.
"ray": Multi-node, process-based parallelism.
"dask": Multi-node, process-based parallelism.

memory: bool, str, Path or Memory, default=False

Enables caching for memory optimization. Read more in the user guide.

If False: No caching is performed.
If True: A default temp directory is used.
If str: Path to the caching directory.
If Path: A pathlib.Path to the caching directory.
If Memory: Object with the joblib.Memory interface.

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

warnings: bool or str, default=False

If True: Default warning action (equal to "once").
If False: Suppress all warnings (equal to "ignore").
If str: One of python's warnings filters.

Changing this parameter affects the PYTHONWarnings environment. ATOM can't manage warnings that go from C/C++ code to stdout.

logger: str, Logger or None, default=None

If None: Logging isn't used.
If str: Name of the log file. Use "auto" for automatic name.
If Path: A pathlib.Path to the log file.
Else: Python logging.Logger instance.

experiment: str or None, default=None

Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, default=None

Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.

Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Initialize atom
>>> atom = ATOMClassifier(X, y, verbose=2)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.

Dataset stats ==================== >>
Shape: (569, 31)
Train set size: 456
Test set size: 113
-------------------------------------
Memory: 138.97 kB
Scaled: False
Outlier values: 185 (1.3%)


>>> # Apply data cleaning and feature engineering methods
>>> atom.balance(strategy="smote")

Oversampling with SMOTE...
 --> Adding 116 samples to class 0.
>>> atom.feature_selection(strategy="rfe", solver="lr", n_features=22)

Fitting FeatureSelector...
Performing feature selection ...
 --> rfe selected 22 features from the dataset.
   --> Dropping feature mean area (rank 4).
   --> Dropping feature mean compactness (rank 3).
   --> Dropping feature mean fractal dimension (rank 7).
   --> Dropping feature smoothness error (rank 9).
   --> Dropping feature concave points error (rank 5).
   --> Dropping feature symmetry error (rank 2).
   --> Dropping feature fractal dimension error (rank 8).
   --> Dropping feature worst area (rank 6).

>>> # Train models
>>> atom.run(models=["LR", "RF", "XGB"])


Training ========================= >>
Models: LR, RF, XGB
Metric: f1


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.9861
Test evaluation --> f1: 0.971
Time elapsed: 0.188s
-------------------------------------------------
Time: 0.188s


Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.971
Time elapsed: 0.180s
-------------------------------------------------
Time: 0.180s


Results for XGBoost:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.971
Time elapsed: 0.530s
-------------------------------------------------
Time: 0.530s


Final results ==================== >>
Total time: 0.908s
-------------------------------------
LogisticRegression --> f1: 0.971 !
RandomForest       --> f1: 0.971 !
XGBoost            --> f1: 0.971 !

>>> # Analyze the results
>>> atom.results

	f1_train	f1_test	time_fit	time
LR	0.986100	0.971000	0.188171	0.188171
RF	1.000000	0.971000	0.180164	0.180164
XGB	1.000000	0.971000	0.530483	0.530483

Magic methods

The class contains some magic methods to help you access some of its elements faster. Note that methods that apply on the pipeline can return different results per branch.

__repr__: Prints an overview of atom's branches, models, and metrics.
__len__: Returns the length of the dataset.
__iter__: Iterate over the pipeline's transformers.
__contains__: Checks if the provided item is a column in the dataset.
__getitem__: Access a branch, model, column or subset of the dataset.

Attributes

Data attributes

The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.

Attributes

pipeline: Pipeline

Pipeline of transformers.

Tip

Use the plot_pipeline method to visualize the pipeline.

mapping: dict[str, dict[str, int | float]]

Encoded values and their respective mapped values.

The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g., Ordinal, Leave-one-out, etc...).

dataset: pd.DataFrame

Complete data set.

train: pd.DataFrame

Training set.

test: pd.DataFrame

Test set.

X: pd.DataFrame

Feature set.

y: pd.Series | pd.DataFrame

Target column(s).

holdout: pd.DataFrame | None

Holdout set.

This data set is untransformed by the pipeline. Read more in the user guide.

X_train: pd.DataFrame

Features of the training set.

y_train: pd.Series | pd.DataFrame

Target column(s) of the training set.

X_test: pd.DataFrame

Features of the test set.

y_test: pd.Series | pd.DataFrame

Target column(s) of the test set.

shape: tuple[int, int]

Shape of the dataset (n_rows, n_columns).

columns: pd.Index

Name of all the columns.

n_columns: int

Number of columns.

features: pd.Index

Name of the features.

n_features: int

Number of features.

target: str | list[str]

Name of the target column(s).

scaled: bool

Whether the feature set is scaled.

A data set is considered scaled when it has mean~0 and std~1, or when there is a scaler in the pipeline. Categorical and binary columns (only zeros and ones) are excluded from the calculation. Sparse datasets always return False.

duplicates: int

Number of duplicate rows in the dataset.

nans: pd.Series

Columns with the number of missing values in them.

This property is unavailable for sparse datasets.

n_nans: int

Number of rows containing missing values.

This property is unavailable for sparse datasets.

numerical: pd.Index

Names of the numerical features in the dataset.

n_numerical: int

Number of numerical features in the dataset.

categorical: pd.Index

Names of the categorical features in the dataset.

n_categorical: int

Number of categorical features in the dataset.

outliers: pd.Series

Columns in training set with number of outlier values.

This property is unavailable for sparse datasets.

n_outliers: int

Number of samples in the training set containing outliers.

This property is unavailable for sparse datasets.

classes: pd.DataFrame

Distribution of target classes per data set.

This property is only available for classification tasks.

n_classes: int | pd.Series

Number of classes in the target column(s).

This property is only available for classification tasks.

Utility attributes

The utility attributes are used to access information about the models in the instance after training.

Attributes

pos_label: bool | int | float | str

Positive label for binary/multilabel classification tasks.

metadata: Bunch

Metadata of the dataset.

Tracking attributes

The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.

Attributes

log_ht: bool

Whether to track every trial of the hyperparameter tuning.

log_plots: bool

Whether to save plots as artifacts.

log_data: bool

Whether to save the train and test sets.

log_pipeline: bool

Whether to save the model's pipeline.

Plot attributes

The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.

Attributes

palette: str | Sequence[str]

Color palette.

Specify one of plotly's built-in palettes or create a custom one, e.g., atom.palette = ["red", "green", "blue"].

title_fontsize: int | float

Fontsize for the plot's title.

label_fontsize: int | float

Fontsize for the labels, legend and hover information.

tick_fontsize: int | float

Fontsize for the ticks along the plot's axes.

line_width: int | float

Width of the line plots.

marker_size: int | float

Size of the markers.

Utility methods

Next to the plotting methods, the class contains a variety of utility methods to handle the data and manage the pipeline.

add	Add a transformer to the pipeline.
apply	Apply a function to the dataset.
available_models	Give an overview of the available predefined models.
canvas	Create a figure with multiple plots.
clear	Reset attributes and clear cache from all models.
delete	Delete models.
distributions	Get statistics on column distributions.
eda	Create an Exploratory Data Analysis report.
evaluate	Get all models' scores for the provided metrics.
export_pipeline	Export the internal pipeline.
get_class_weight	Return class weights for a balanced data set.
get_sample_weight	Return sample weights for a balanced data set.
inverse_transform	Inversely transform new data through the pipeline.
load	Load an atom instance from a pickle file.
merge	Merge another instance of the same class into this one.
update_layout	Update the properties of the plot's layout.
update_traces	Update the properties of the plot's traces.
reset	Reset the instance to it's initial state.
reset_aesthetics	Reset the plot aesthetics to their default values.
save	Save the instance to a pickle file.
save_data	Save the data in the current branch to a `.csv` file.
shrink	Convert the columns to the smallest possible matching dtype.
stacking	Add a Stacking model to the pipeline.
stats	Display basic information about the dataset.
status	Get an overview of the branches and models.
transform	Transform new data through the pipeline.
voting	Add a Voting model to the pipeline.

method add(transformer, columns=None, train_only=False, feature_names_out=None, **fit_params)[source]

Add a transformer to the pipeline.

If the transformer is not fitted, it is fitted on the complete training set. Afterward, the data set is transformed and the estimator is added to atom's pipeline. If the estimator is a sklearn Pipeline, every estimator is merged independently with atom.

Warning

The transformer should have fit and/or transform methods with arguments X (accepting a dataframe-like object of shape=(n_samples, n_features)) and/or y (accepting a sequence of shape=(n_samples,)).
The transform method should return a feature set as a dataframe-like object of shape=(n_samples, n_features) and/or a target column as a sequence of shape=(n_samples,).

Note

If the transform method doesn't return a dataframe:

The column naming happens as follows. If the transformer has a get_feature_names_out method, it is used. If not, and it returns the same number of columns, the names are kept equal. If the number of columns changes, old columns will keep their name (as long as the column is unchanged) and new columns will receive the name x[N-1], where N stands for the n-th feature. This means that a transformer should only transform, add or drop columns, not combinations of these.
The index remains the same as before the transformation. This means that the transformer should not add, remove or shuffle rows unless it returns a dataframe.

Parameters

transformer: Transformer

Estimator to add to the pipeline. Should implement a transform method. If a class is provided (instead of an instance), and it has the n_jobs and/or random_state parameters, it adopts atom's values.

columns: int, str, segment, sequence, dataframe or None, default=None

Selection of columns to transform. Only select features or the target column, not both at the same time (if that happens, the target column is ignored). If None, transform all columns.

train_only: bool, default=False

Whether to apply the estimator only on the training set or on the complete dataset. Note that if True, the transformation is skipped when making predictions on new data.

feature_names_out: "one-to-one", callable or None, default=None

Determines the list of feature names that will be returned by the get_feature_names_out method.

If None: The get_feature_names_out method is not defined.
If "one-to-one": The output feature names will be equal to the input feature names.
If callable: Function that takes positional arguments self and a sequence of input feature names. It must return a sequence of output feature names.

**fit_params

Additional keyword arguments for the transformer's fit method.

method apply(func, inverse_func=None, feature_names_out=None, kw_args=None, inv_kw_args=None, **kwargs)[source]

Apply a function to the dataset.

This method is useful for stateless transformations such as taking the log, doing custom scaling, etc...

Note

This approach is preferred over changing the dataset directly through the property's @setter since the transformation is stored in the pipeline.

Tip

Use atom.apply(lambda df: df.drop("column_name",axis=1)) to store the removal of columns in the pipeline.

Parameters

func: callable

Function to apply with signature

func(dataframe, **kw_args)
-> dataframe-like

.

inverse_func: callable or None, default=None

Inverse function of func. If None, the inverse_transform method returns the input unchanged.

feature_names_out: "one-to-one", callable or None, default=None

Determines the list of feature names that will be returned by the get_feature_names_out method.

If None: The get_feature_names_out method is not defined.
If "one-to-one": The output feature names will be equal to the input feature names.
If callable: Function that takes positional arguments self and a sequence of input feature names. It must return a sequence of output feature names.

kw_args: dict or None, default=None

Additional keyword arguments for the function.

inv_kw_args: dict or None, default=None

Additional keyword arguments for the inverse function.

method available_models(**kwargs)[source]

Give an overview of the available predefined models.

Parameters

**kwargs

Filter the returned models providing any of the column as keyword arguments, where the value is the desired filter, e.g., accepts_sparse=True, to get all models that accept sparse input or supports_engines="cuml" to get all models that support the cuML engine.

Returns

pd.DataFrame

Tags of the available predefined models. The columns depend on the task, but can include:

acronym: Model's acronym (used to call the model).
fullname: Name of the model's class.
estimator: Name of the model's underlying estimator.
module: The estimator's module.
handles_missing: Whether the model can handle missing values without preprocessing. If False, consider using the Imputer class before training the models.
needs_scaling: Whether the model requires feature scaling. If True, automated feature scaling is applied.
accepts_sparse: Whether the model accepts sparse input.
uses_exogenous: Whether the model uses exogenous variables.
multiple_seasonality: Whether the model can handle more than one seasonality period.
native_multilabel: Whether the model has native support for multilabel tasks.
native_multioutput: Whether the model has native support for multioutput tasks.
validation: Whether the model has in-training validation.
supports_engines: Engines supported by the model.

method canvas(rows=1, cols=2, sharex=False, sharey=False, hspace=0.05, vspace=0.07, title=None, legend="out", figsize=None, filename=None, display=True)[source]

Create a figure with multiple plots.

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters

rows: int, default=1

Number of plots in length.

cols: int, default=2

Number of plots in width.

sharex: bool, default=False

If True, hide the label and ticks from non-border subplots on the x-axis.

sharey: bool, default=False

If True, hide the label and ticks from non-border subplots on the y-axis.

hspace: float, default=0.05

Space between subplot rows in normalized plot coordinates. The spacing is relative to the figure's size.

vspace: float, default=0.07

Space between subplot cols in normalized plot coordinates. The spacing is relative to the figure's size.

title: str, dict or None, default=None

Title for the plot.

If None, no title is shown.
If str, text for the title.
If dict, title configuration.

legend: bool, str or dict, default="out"

Legend for the plot. See the user guide for an extended description of the choices.

If None: No legend is shown.
If str: Position to display the legend.
If dict: Legend configuration.

figsize: tuple or None, default=None

Figure's size in pixels, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str, Path or None, default=None

Save the plot using this name. Use "auto" for automatic naming. The type of the file depends on the provided name (.html, .png, .pdf, etc...). If filename has no file type, the plot is saved as html. If None, the plot is not saved.

display: bool, default=True

Whether to render the plot.

Yields

{#canvas-go.Figure} go.Figure

Plot object.

method clear()[source]

Reset attributes and clear cache from all models.

Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:

method delete(models=None)[source]

Delete models.

If all models are removed, the metric is reset. Use this method to drop unwanted or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters

models: int, str, Model, segment, sequence or None, default=None

Models to delete. If None, all models are deleted.

method distributions(distributions=None, columns=None)[source]

Get statistics on column distributions.

Compute the Kolmogorov-Smirnov test for various distributions against columns in the dataset. Only for numerical columns. Missing values are ignored.

Tip

Use the plot_distribution method to plot a column's distribution.

Parameters	distributions: str, sequence or None, default=None Names of the distributions in `scipy.stats` to get the statistics on. If None, a selection of the most common ones is used. columns: int, str, segment, sequence, dataframe or None, default=None Selection of columns on which to perform the test. If None, select all numerical columns.
Returns	pd.DataFrame Statistic results with multiindex levels: dist: Name of the distribution. stat: Statistic results: score: KS-test score. p_value: Corresponding p-value.

method eda(rows="dataset", target=0, filename=None)[source]

Create an Exploratory Data Analysis report.

ATOM uses the sweetviz package for EDA. The report is rendered directly in the notebook. It can also be accessed through the report attribute. It can either report one dataset or compare two datasets against each other.

Warning

This method can be slow for large datasets.

Parameters

rows: str, sequence or dict, default="dataset"

Selection of rows on which to calculate the metric.

If str: Name of the data set to report.
If sequence: Names of two data sets to compare.
If dict: Names of up to two data sets with corresponding selection of rows to report.

target: int or str, default=0

Target column to look at. Only for multilabel tasks. Only bool and numerical features can be used as target.

filename: str, Path or None, default=None

Filename or pathlib.Path of the (html) file to save. If None, don't save anything.

method evaluate(metric=None, rows="test")[source]

Get all models' scores for the provided metrics.

Tip

This method returns a pandas' Styler object. Convert the result back to a regular dataframe using its data attribute.

Parameters	metric: str, func, scorer, sequence or None, default=None Metric to calculate. If None, it returns an overview of the most common metrics per task. rows: hashable, segment, sequence or dataframe, default="test" Selection of rows to calculate metric on.
Returns	{#evaluate-Styler} Styler Scores of the models.

method export_pipeline(model=None)[source]

Export the internal pipeline.

This method returns a deepcopy of the branch's pipeline. Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.

Parameters	model: str, Model or None, default=None Model for which to export the pipeline. If the model used automated feature scaling, the Scaler is added to the pipeline. If None, the pipeline in the current branch is exported (without any model).
Returns	{#export_pipeline-Pipeline} Pipeline Current branch as a sklearn-like Pipeline object.

method get_class_weight(rows="train")[source]

Return class weights for a balanced data set.

Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected rows.

Parameters	rows: hashable, segment, sequence or dataframe, default="train" Selection of rows for which to get the weights.
Returns	dict Classes with the corresponding weights. A dict of dicts is returned for multioutput tasks.

method get_sample_weight(rows="train")[source]

Return sample weights for a balanced data set.

The returned weights are inversely proportional to the class frequencies in the selected data set. For multioutput tasks, the weights of each column of y will be multiplied.

Parameters	rows: hashable, segment, sequence or dataframe, default="train" Selection of rows for which to get the weights.
Returns	pd.Series Sequence of weights with shape=(n_samples,).

method inverse_transform(X=None, y=None, verbose=None)[source]

Inversely transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. The rest should all implement an inverse_transform method. If only X or only y is provided, it ignores transformers that require the other parameter. This can be used to transform only the target column.

Parameters

X: Transformed feature set with shape=(n_samples, n_features).

If None, X is ignored in the transformers.

y: int, str, sequence, dataframe-like or None, default=None

Transformed target column corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe-like: Target columns for multioutput tasks.

verbose: int or None, default=None

Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns

dataframe

Original feature set. Only returned if provided.

series or dataframe

Original target column. Only returned if provided.

classmethod atom.atom.load(filename, data=None)[source]

Load an atom instance from a pickle file.

If the instance was saved using save_data=False, it's possible to load new data into it and apply all data transformations.

Info

The loaded instance's current branch is the same branch as it was when saved.

Parameters

filename: str or Path

Filename or pathlib.Path of the pickle file.

data: tuple of indexables or None, default=None

Original dataset as it was provided to the instance's constructor. Only use this parameter if the loaded file was saved using save_data=False. Allowed formats are:

X
X, y
train, test
train, test, holdout
X_train, X_test, y_train, y_test
X_train, X_test, X_holdout, y_train, y_test, y_holdout
(X_train, y_train), (X_test, y_test)
(X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)

X, train, test: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or dataframe
Target column(s) corresponding to X.

If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe: Target columns for multioutput tasks.

Returns

atom

Unpickled atom instance.

method merge(other, suffix="2")[source]

Merge another instance of the same class into this one.

Branches, models, metrics and attributes of the other instance are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parameters

other: Runner

Instance with which to merge. Should be of the same class as self.

suffix: str, default="2"

Branches and models with conflicting names are merged adding suffix to the end of their names.

classmethod atom.plots.baseplot.update_layout(**kwargs)[source]

Update the properties of the plot's layout.

Recursively update the structure of the original layout with the values in the arguments.

Parameters

**kwargs

Keyword arguments for the figure's update_layout method.

classmethod atom.plots.baseplot.update_traces(**kwargs)[source]

Update the properties of the plot's traces.

Recursively update the structure of the original traces with the values in the arguments.

Parameters

**kwargs

Keyword arguments for the figure's update_traces method.

method reset(hard=False)[source]

Reset the instance to it's initial state.

Deletes all branches and models. The dataset is also reset to its form after initialization.

Parameters

hard: bool, default=False

If True, flushes completely the cache.

classmethod atom.plots.baseplot.reset_aesthetics()[source]

Reset the plot aesthetics to their default values.

method save(filename="auto", save_data=True)[source]

Save the instance to a pickle file.

Parameters

filename: str or Path, default="auto"

Filename or pathlib.Path of the file to save. Use "auto" for automatic naming.

save_data: bool, default=True

Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method to reload the instance.

method save_data(filename="auto", rows="dataset", **kwargs)[source]

Save the data in the current branch to a .csv file.

Parameters

filename: str or Path, default="auto"

Filename or pathlib.Path of the file to save. Use "auto" for automatic naming.

rows: hashable, segment, sequence or dataframe, default="dataset"

Selection of rows to save.

**kwargs

Additional keyword arguments for pandas' to_csv method.

method shrink(int2bool=False, int2uint=False, str2cat=False, dense2sparse=False, columns=None)[source]

Convert the columns to the smallest possible matching dtype.

Examples are: float64 -> float32, int64 -> int8, etc... Sparse arrays also transform their non-fill value. Use this method for memory optimization before saving the dataset. Note that applying transformers to the data may alter the types again.

Parameters

int2bool: bool, default=False

Whether to convert int columns to bool type. Only if the values in the column are strictly in (0, 1) or (-1, 1).

int2uint: bool, default=False

Whether to convert int to uint (unsigned integer). Only if the values in the column are strictly positive.

str2cat: bool, default=False

Whether to convert string to category. Only if the number of categories is less than 30% of the column's length.

dense2sparse: bool, default=False

Whether to convert all features to sparse format. The value that is compressed is the most frequent value in the column.

columns: int, str, segment, sequence, dataframe or None, default=None

Selection of columns to shrink. If None, transform all columns.

method stacking(models=None, name="Stack", train_on_test=False, **kwargs)[source]

Add a Stacking model to the pipeline.

Warning

Combining models trained on different branches into one ensemble is not allowed and will raise an exception.

Parameters

models: segment, sequence or None, default=None

Models that feed the stacking estimator. The models must have been fitted on the current branch.

name: str, default="Stack"

Name of the model. The name is always presided with the model's acronym: Stack.

train_on_test: bool, default=False

Whether to train the final estimator of the stacking model on the test set instead of the training set. Note that training it on the training set (default option) means there is a high risk of overfitting. It's recommended to use this option if you have another, independent set for testing (holdout set).

**kwargs

Additional keyword arguments for one of these estimators.

For classification tasks: StackingClassifier.
For regression tasks: StackingRegressor.
For forecast tasks: StackingForecaster.

Tip

The model's acronyms can be used for the final_estimator parameter, e.g., atom.stacking(final_estimator="LR").

method stats()[source]

Display basic information about the dataset.

method status()[source]

Get an overview of the branches and models.

This method prints the same information as the __repr__ and also saves it to the logger.

method transform(X=None, y=None, verbose=None)[source]

Transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, transform only the target column.

Parameters

X: dataframe-like or None, default=None

Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None

Target column(s) corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe-like: Target columns for multioutput tasks.

verbose: int or None, default=None

Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns

dataframe

Transformed feature set. Only returned if provided.

series or dataframe

Transformed target column. Only returned if provided.

method voting(models=None, name="Vote", **kwargs)[source]

Add a Voting model to the pipeline.

Warning

Combining models trained on different branches into one ensemble is not allowed and will raise an exception.

Parameters

models: segment, sequence or None, default=None

Models that feed the stacking estimator. The models must have been fitted on the current branch.

name: str, default="Vote"

Name of the model. The name is always presided with the model's acronym: Vote.

**kwargs

Additional keyword arguments for one of these estimators.

For classification tasks: VotingClassifier.
For regression tasks: VotingRegressor.
For forecast tasks: EnsembleForecaster.

Data cleaning

The data cleaning methods can help you scale the data, handle missing values, categorical columns, outliers and unbalanced datasets. All attributes of the data cleaning classes are attached to atom after running. Read more in the user guide.

Tip

Use the eda method to examine the data and help you determine suitable parameters for the data cleaning methods.

balance	Balance the number of rows per class in the target column.
clean	Apply standard data cleaning steps on the dataset.
discretize	Bin continuous data into intervals.
encode	Perform encoding of categorical features.
impute	Handle missing values in the dataset.
normalize	Transform the data to follow a Normal/Gaussian distribution.
prune	Prune outliers from the training set.
scale	Scale the data.

method balance(strategy="adasyn", **kwargs)[source]

Balance the number of rows per class in the target column.

When oversampling, the newly created samples have an increasing integer index for numerical indices, and an index of the form [estimator]_N for non-numerical indices, where N stands for the N-th sample in the data set.

See the Balancer class for a description of the parameters.

Warning

The balance method does not support multioutput tasks.
The balance method does not support sample_weights passed through metadata routing.
This transformation is only applied to the training set to maintain the original distribution of target classes in the test set.

Tip

Use atom's classes attribute for an overview of the target class distribution per data set.

method clean(convert_dtypes=True, drop_dtypes=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, **kwargs)[source]

Apply standard data cleaning steps on the dataset.

Use the parameters to choose which transformations to perform. The available steps are:

Convert dtypes to the best possible types.
Drop columns with specific data types.
Remove characters from column names.
Strip categorical features from spaces.
Drop duplicate rows.
Drop rows with missing values in the target column.
Encode the target column (only for classification tasks).

See the Cleaner class for a description of the parameters.

method discretize(strategy="quantile", bins=5, labels=None, **kwargs)[source]

Bin continuous data into intervals.

For each feature, the bin edges are computed during fit and, together with the number of bins, they will define the intervals. Ignores numerical columns.

See the Discretizer class for a description of the parameters.

Tip

Use the plot_distribution method to visualize a column's distribution and decide on the bins.

method encode(strategy="Target", max_onehot=10, ordinal=None, infrequent_to_value=None, value="infrequent", **kwargs)[source]

Perform encoding of categorical features.

The encoding type depends on the number of classes in the column:

If n_classes=2 or ordinal feature, use Ordinal-encoding.
If 2 < n_classes <= max_onehot, use OneHot-encoding.
If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Rare classes can be replaced with a value in order to prevent too high cardinality.

See the Encoder class for a description of the parameters.

Note

This method only encodes the categorical features. It does not encode the target column! Use the clean method for that.

Tip

Use the categorical attribute for a list of the categorical features in the dataset.

method impute(strat_num="mean", strat_cat="most_frequent", max_nan_rows=None, max_nan_cols=None, **kwargs)[source]

Handle missing values in the dataset.

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values.

See the Imputer class for a description of the parameters.

Tip

Use the nans attribute to check the amount of missing values per column.
Use the missing attribute to customize what are considered "missing values".

method normalize(strategy="yeojohnson", **kwargs)[source]

Transform the data to follow a Normal/Gaussian distribution.

This transformation is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired. Missing values are disregarded in fit and maintained in transform. Ignores categorical columns.

See the Normalizer class for a description of the parameters.

Tip

Use the plot_distribution method to examine a column's distribution.

method prune(strategy="zscore", method="drop", max_sigma=3, include_target=False, **kwargs)[source]

Prune outliers from the training set.

Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns.

See the Pruner class for a description of the parameters.

Note

This transformation is only applied to the training set in order to maintain the original distribution of samples in the test set.

Tip

Use the outliers attribute to check the number of outliers per column.

method scale(strategy="standard", include_binary=False, **kwargs)[source]

Scale the data.

Apply one of sklearn's scaling strategies. Categorical columns are ignored.

See the Scaler class for a description of the parameters.

Tip

Use the scaled attribute to check whether the dataset is scaled.

NLP

The Natural Language Processing (NLP) transformers help to convert raw text to meaningful numeric values, ready to be ingested by a model. All transformations are applied only on the column in the dataset called corpus. Read more in the user guide.

textclean	Apply standard text cleaning to the corpus.
textnormalize	Normalize the corpus.
tokenize	Tokenize the corpus.
vectorize	Vectorize the corpus.

method textclean(decode=True, lower_case=True, drop_email=True, regex_email=None, drop_url=True, regex_url=None, drop_html=True, regex_html=None, drop_emoji=True, regex_emoji=None, drop_number=True, regex_number=None, drop_punctuation=True, **kwargs)[source]

Apply standard text cleaning to the corpus.

Transformations include normalizing characters and drop noise from the text (emails, HTML tags, URLs, etc...). The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised.

See the TextCleaner class for a description of the parameters.

method textnormalize(stopwords=True, custom_stopwords=None, stem=False, lemmatize=True, **kwargs)[source]

Normalize the corpus.

Convert words to a more uniform standard. The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised. If the provided documents are strings, words are separated by spaces.

See the TextNormalizer class for a description of the parameters.

method tokenize(bigram_freq=None, trigram_freq=None, quadgram_freq=None, **kwargs)[source]

Tokenize the corpus.

Convert documents into sequences of words. Additionally, create n-grams (represented by words united with underscores, e.g., "New_York") based on their frequency in the corpus. The transformations are applied on the column named corpus. If there is no column with that name, an exception is raised.

See the Tokenizer class for a description of the parameters.

method vectorize(strategy="bow", return_sparse=True, **kwargs)[source]

Vectorize the corpus.

Transform the corpus into meaningful vectors of numbers. The transformation is applied on the column named corpus. If there is no column with that name, an exception is raised.

If strategy="bow" or "tfidf", the transformed columns are named after the word they are embedding with the prefix corpus_. If strategy="hashing", the columns are named hash[N], where N stands for the n-th hashed column.

See the Vectorizer class for a description of the parameters.

Feature engineering

To further pre-process the data, it's possible to extract features from datetime columns, create new non-linear features transforming the existing ones, group similar features or, if the dataset is too large, remove features. Read more in the user guide.

feature_extraction	Extract features from datetime columns.
feature_generation	Generate new features.
feature_grouping	Extract statistics from similar features.
feature_selection	Reduce the number of features in the data.

method feature_extraction(features=('day', 'month', 'year'), fmt=None, encoding_type="ordinal", drop_columns=True, from_index=False, **kwargs)[source]

Extract features from datetime columns.

Create new features extracting datetime elements (day, month, year, etc...) from the provided columns. Columns of dtype datetime64 are used as is. Categorical columns that can be successfully converted to a datetime format (less than 30% NaT values after conversion) are also used.

See the FeatureExtractor class for a description of the parameters.

method feature_generation(strategy="dfs", n_features=None, operators=None, **kwargs)[source]

Generate new features.

Create new combinations of existing features to capture the non-linear relations between the original features.

See the FeatureGenerator class for a description of the parameters.

method feature_grouping(groups, operators=None, drop_columns=True, **kwargs)[source]

Extract statistics from similar features.

Replace groups of features with related characteristics with new features that summarize statistical properties of the group. The statistical operators are calculated over every row of the group. The group names and features can be accessed through the groups method.

See the FeatureGrouper class for a description of the parameters.

Tip

Use a regex pattern with the groups parameter to select groups easier, e.g., atom.feature_grouping({"group1": "var_.+") to select all features that start with var_.

method feature_selection(strategy=None, solver=None, n_features=None, min_repeated=2, max_repeated=1.0, max_correlation=1.0, **kwargs)[source]

Reduce the number of features in the data.

Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low-variance features.

See the FeatureSelector class for a description of the parameters.

Note

When strategy="univariate" and solver=None, f_classif or f_regression is used as default solver.
When strategy is "sfs", "rfecv" or any of the advanced strategies and no scoring is specified, atom's metric (if it exists) is used as scoring.

Training

The training methods are where the models are fitted to the data and their performance is evaluated against a selected metric. There are three methods to call the three different training approaches. Read more in the user guide.

run	Train and evaluate the models in a direct fashion.
successive_halving	Fit the models in a successive halving fashion.
train_sizing	Train and evaluate the models in a train sizing fashion.

method run(models=None, metric=None, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, parallel=False, errors="skip", **kwargs)[source]

Train and evaluate the models in a direct fashion.

Contrary to successive_halving and train_sizing, the direct approach only iterates once over the models, using the full dataset.

The following steps are applied to every model:

Apply hyperparameter tuning (optional).
Fit the model on the training set using the best combination of hyperparameters found.
Evaluate the model on the test set.
Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the DirectClassifier or DirectRegressor class for a description of the parameters.

method successive_halving(models=None, metric=None, skip_runs=0, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, parallel=False, errors="skip", **kwargs)[source]

Fit the models in a successive halving fashion.

The successive halving technique is a bandit-based algorithm that fits N models to 1/N of the data. The best half are selected to go to the next iteration where the process is repeated. This continues until only one model remains, which is fitted on the complete dataset. Beware that a model's performance can depend greatly on the amount of data on which it is trained. For this reason, it is recommended to only use this technique with similar models, e.g., only using tree-based models.

The following steps are applied to every model (per iteration):

Apply hyperparameter tuning (optional).
Fit the model on the training set using the best combination of hyperparameters found.
Evaluate the model on the test set.
Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the SuccessiveHalvingClassifier or SuccessiveHalvingRegressor class for a description of the parameters.

method train_sizing(models=None, metric=None, train_sizes=5, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, parallel=False, errors="skip", **kwargs)[source]

Train and evaluate the models in a train sizing fashion.

When training models, there is usually a trade-off between model performance and computation time; that is regulated by the number of samples in the training set. This method can be used to create insights in this trade-off, and help determine the optimal size of the training set. The models are fitted multiple times, ever-increasing the number of samples in the training set.

The following steps are applied to every model (per iteration):

Apply hyperparameter tuning (optional).
Fit the model on the training set using the best combination of hyperparameters found.
Evaluate the model on the test set.
Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the TrainSizingClassifier or TrainSizingRegressor class for a description of the parameters.