Skip to content

ATOMRegressor


class atom.api.ATOMRegressor(*arrays, y=-1, index=False, shuffle=True, n_rows=1, test_size=0.2, holdout_size=None, n_jobs=1, device="cpu", engine="sklearn", verbose=0, warnings=False, logger=None, experiment=None, random_state=None)[source]
Main class for regression tasks.

Apply all data transformations and model management provided by the package on a given dataset. Note that, contrary to sklearn's API, the instance contains the dataset on which to perform the analysis. Calling a method will automatically apply it on the dataset it contains.

All data cleaning, feature engineering, model training and plotting functionality can be accessed from an instance of this class.

Parameters*arrays: sequence of indexables
Dataset containing features and target. Allowed formats are:

  • X
  • X, y
  • train, test
  • train, test, holdout
  • X_train, X_test, y_train, y_test
  • X_train, X_test, X_holdout, y_train, y_test, y_holdout
  • (X_train, y_train), (X_test, y_test)
  • (X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)

X, train, test: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str or sequence
Target column corresponding to X.

  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • Else: Array with shape=(n_samples,) to use as target.

y: int, str or sequence, default=-1
Target column corresponding to X.

  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • Else: Array with shape=(n_samples,) to use as target.

This parameter is ignored if the target column is provided through arrays.

index: bool, int, str or sequence, default=False
Handle the index in the resulting dataframe.

  • If False: Reset to RangeIndex.
  • If True: Use the provided index.
  • If int: Position of the column to use as index.
  • If str: Name of the column to use as index.
  • If sequence: Array with shape=(n_samples,) to use as index.

test_size: int or float, default=0.2

  • If <=1: Fraction of the dataset to include in the test set.
  • If >1: Number of rows to include in the test set.

This parameter is ignored if the test set is provided through arrays.

holdout_size: int, float or None, default=None

  • If None: No holdout data set is kept apart.
  • If <=1: Fraction of the dataset to include in the holdout set.
  • If >1: Number of rows to include in the holdout set.

This parameter is ignored if the holdout set is provided through arrays.

shuffle: bool, default=True
Whether to shuffle the dataset before splitting the train and test set. Be aware that not shuffling the dataset can cause an unequal distribution of target classes over the sets.

n_rows: int or float, default=1
Random subsample of the dataset to use. The default value selects all rows.

  • If <=1: Fraction of the dataset to select.
  • If >1: Exact number of rows to select. Only if arrays is X or X, y.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"
Device on which to train the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, default="sklearn"
Execution engine to use for the estimators. Refer to the user guide for an explanation regarding every choice. Choose from:

  • "sklearn" (only if device="cpu")
  • "sklearnex"
  • "cuml" (only if device="gpu")

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

warnings: bool or str, default=False

  • If True: Default warning action (equal to "default").
  • If False: Suppress all warnings (equal to "ignore").
  • If str: One of python's warnings filters.

Changing this parameter affects the PYTHONWARNINGS environment. ATOM can't manage warnings that go from C/C++ code to stdout.

logger: str, Logger or None, default=None

  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic name.
  • Else: Python logging.Logger instance.

experiment: str or None, default=None
Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.


See Also

ATOMClassifier

Main class for binary and multiclass classification tasks.


Example

>>> from atom import ATOMRegressor
>>> from sklearn.datasets import load_diabetes

>>> X, y = load_diabetes(return_X_y=True, as_frame=True)

>>> # Initialize atom
>>> atom = ATOMRegressor(X, y, logger="auto", n_jobs=2, verbose=2)

<< ================== ATOM ================== >>
Algorithm task: regression.
Parallel processing with 2 cores.

Dataset stats ==================== >>
Shape: (442, 11)
Memory: 39.02 kB
Scaled: False
Outlier values: 10 (0.3%)
-------------------------------------
Train set size: 310
Test set size: 132

>>> # Apply data cleaning and feature engineering methods
>>> atom.scale()

Fitting Scaler...
Scaling features...

>>> atom.feature_selection(strategy="rfecv", solver="xgb", n_features=12)

Fitting FeatureSelector...
Performing feature selection ...
 --> rfecv selected 6 features from the dataset.
   --> Dropping feature age (rank 5).
   --> Dropping feature sex (rank 4).
   --> Dropping feature s1 (rank 2).
   --> Dropping feature s3 (rank 3).

>>> # Train models
>>> atom.run(
...    models=["LR", "RF", "XGB"],
...    metric="precision",
...    n_bootstrap=4,
... )

Training ========================= >>
Models: OLS, BR, RF
Metric: r2

Results for Ordinary Least Squares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.5223
Test evaluation --> r2: 0.4012
Time elapsed: 0.010s
-------------------------------------------------
Total time: 0.010s

Results for Bayesian Ridge:
Fit ---------------------------------------------
Train evaluation --> r2: 0.522
Test evaluation --> r2: 0.4037
Time elapsed: 0.010s
-------------------------------------------------
Total time: 0.010s

Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> r2: 0.9271
Test evaluation --> r2: 0.259
Time elapsed: 0.175s
-------------------------------------------------
Total time: 0.175s


Final results ==================== >>
Total time: 0.195s
-------------------------------------
Ordinary Least Squares --> r2: 0.4012 ~
Bayesian Ridge         --> r2: 0.4037 ~ !
Random Forest          --> r2: 0.259 ~

>>> # Analyze the results
>>> atom.evaluate()

     neg_mean_absolute_error  ...  neg_root_mean_squared_error
OLS               -43.756992  ...                   -54.984345
BR                -43.734975  ...                   -54.869543
RF                -48.327879  ...                   -61.167760

[3 rows x 7 columns]


Magic methods

The class contains some magic methods to help you access some of its elements faster. Note that methods that apply on the pipeline can return different results per branch.

  • __repr__: Prints an overview of atom's branches, models, metric and errors.
  • __len__: Returns the length of the dataset.
  • __iter__: Iterate over the pipeline's transformers.
  • __contains__: Checks if the provided item is a column in the dataset.
  • __getitem__: Access a branch, model, column or subset of the dataset.


Attributes

Data attributes

The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.

Attributespipeline: pd.Series
Transformers fitted on the data.

Use this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method.

mapping: dict
Encoded values and their respective mapped values.

The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

dataset: pd.DataFrame
Complete data set.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series
Target column.
X_train: pd.DataFrame
Features of the training set.
y_train: pd.Series
Target column of the training set.
X_test: pd.DataFrame
Features of the test set.
y_test: pd.Series
Target column of the test set.
shape: tuple
Shape of the dataset (n_rows, n_cols).
columns: pd.Series
Name of all the columns.
n_columns: int
Number of columns.
features: pd.Series
Name of the features.
n_features: int
Number of features.
target: str
Name of the target column.
scaled: bool
Whether the feature set is scaled.

A data set is considered scaled when it has mean=0 and std=1, or when atom has a scaler in the pipeline. Returns None for sparse datasets.

duplicates: pd.Series
Number of duplicate rows in the dataset.
missing: list
Values that are considered "missing".

These values are used by the clean and impute methods. Default values are: None, NaN, +inf, -inf, "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

nans: pd.Series
Columns with the number of missing values in them.
n_nans: int
Number of samples containing missing values.
numerical: pd.Series
Names of the numerical features in the dataset.
n_numerical: int
Number of numerical features in the dataset.
categorical: pd.Series
Names of the categorical features in the dataset.
n_categorical: int
Number of categorical features in the dataset.
outliers: pd.Series
Columns in training set with amount of outlier values.
n_outliers: int
Number of samples in the training set containing outliers.


Utility attributes

The utility attributes are used to access information about the models in the instance after training.

Attributesbranch: Branch
Current active branch.

Use the property's @setter to change from current branch or to create a new one. If the value is the name of an existing branch, switch to that one. Else, create a new branch using that name. The new branch is split from the current branch. Use __from__ to split the new branch from any other existing branch. Read more in the user guide.

models: str or list
Name of the model(s).
metric: str or list
Name of the metric(s).
errors: dict
Errors encountered during model training.

The key is the model's name and the value is the exception object that was raised. Use the __traceback__ attribute to investigate the error.

winners: list
Models ordered by performance.

Performance is measured as the highest score on the model's score_bootstrap or score_test attributes, checked in that order. For multi-metric runs, only the main metric is compared.

winner: model
Best performing model.

Performance is measured as the highest score on the model's score_bootstrap or score_test attributes, checked in that order. For multi-metric runs, only the main metric is compared.

results: pd.DataFrame
Overview of the training results.

All durations are in seconds. Columns include:

  • score_ht: Score obtained by the hyperparameter tuning.
  • time_ht: Duration of the hyperparameter tuning.
  • score_train: Metric score on the train set.
  • score_test: Metric score on the test set.
  • time_fit: Duration of the model fitting on the train set.
  • score_bootstrap: Mean score on the bootstrapped samples.
  • time_bootstrap: Duration of the bootstrapping.
  • time: Total duration of the model run.


Tracking attributes

The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.

Attributeslog_ht: bool
Whether to track every trial of the hyperparameter tuning.
log_model: bool
Whether to save the model's estimator after fitting.
log_plots: bool
Whether to save plots as artifacts.
log_data: bool
Whether to save the train and test sets.
log_pipeline: bool
Whether to save the model's pipeline.


Plot attributes

The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.

Attributespalette: str or sequence
Color palette.

Specify one of plotly's built-in palettes or create a custom one, e.g. atom.palette = ["red", "green", "blue"].

title_fontsize: int
Fontsize for the plot's title.
label_fontsize: int
Fontsize for the labels, legend and hover information.
tick_fontsize: int
Fontsize for the ticks along the plot's axes.


Utility methods

Next to the plotting methods, the class contains a variety of utility methods to handle the data and manage the pipeline.

addAdd a transformer to the pipeline.
applyApply a function to the dataset.
automlSearch for an optimized pipeline in an automated fashion.
available_modelsGive an overview of the available predefined models.
canvasCreate a figure with multiple plots.
clearClear attributes from all models.
deleteDelete models.
distributionGet statistics on column distributions.
evaluateGet all models' scores for the provided metrics.
export_pipelineExport the pipeline to a sklearn-like object.
get_class_weightReturn class weights for a balanced dataset.
inverse_transformInversely transform new data through the pipeline.
logPrint message and save to log file.
mergeMerge another instance of the same class into this one.
update_layoutUpdate the properties of the plot's layout.
reportCreate an extensive profile analysis report of the data.
resetReset the instance to it's initial state.
reset_aestheticsReset the plot aesthetics to their default values.
saveSave the instance to a pickle file.
save_dataSave the data in the current branch to a .csv file.
shrinkConverts the columns to the smallest possible matching dtype.
stackingAdd a Stacking model to the pipeline.
statsPrint basic information about the dataset.
statusGet an overview of the branches and models.
transformTransform new data through the pipeline.
votingAdd a Voting model to the pipeline.


method add(transformer, columns=None, train_only=False, **fit_params)[source]
Add a transformer to the pipeline.

If the transformer is not fitted, it is fitted on the complete training set. Afterwards, the data set is transformed and the estimator is added to atom's pipeline. If the estimator is a sklearn Pipeline, every estimator is merged independently with atom.

Warning

  • The transformer should have fit and/or transform methods with arguments X (accepting a dataframe-like object of shape=(n_samples, n_features)) and/or y (accepting a sequence of shape=(n_samples,)).
  • The transform method should return a feature set as a dataframe-like object of shape=(n_samples, n_features) and/or a target column as a sequence of shape=(n_samples,).

Note

If the transform method doesn't return a dataframe:

  • The column naming happens as follows. If the transformer has a get_feature_names or get_feature_names_out method, it is used. If not, and it returns the same number of columns, the names are kept equal. If the number of columns change, old columns will keep their name (as long as the column is unchanged) and new columns will receive the name x[N-1], where N stands for the n-th feature. This means that a transformer should only transform, add or drop columns, not combinations of these.
  • The index remains the same as before the transformation. This means that the transformer should not add, remove or shuffle rows unless it returns a dataframe.

Note

If the transformer has a n_jobs and/or random_state parameter that is left to its default value, it adopts atom's value.

Parameterstransformer: Transformer
Estimator to add to the pipeline. Should implement a transform method.

columns: int, str, slice, sequence or None, default=None
Names, indices or dtypes of the columns in the dataset to transform. If None, transform all columns. Add ! in front of a name or dtype to exclude that column, e.g. atom.add(Transformer(), columns="!Location")transforms all columns exceptLocation`. You can either include or exclude columns, not combinations of these. The target column is always included if required by the transformer.

train_only: bool, default=False
Whether to apply the estimator only on the training set or on the complete dataset. Note that if True, the transformation is skipped when making predictions on new data.

**fit_params
Additional keyword arguments for the transformer's fit method.



method apply(func, inverse_func=None, kw_args=None, inv_kw_args=None, **kwargs)[source]
Apply a function to the dataset.

The function should have signature func(dataset, **kw_args) -> dataset. This method is useful for stateless transformations such as taking the log, doing custom scaling, etc...

Note

This approach is preferred over changing the dataset directly through the property's @setter since the transformation is stored in the pipeline.

Tip

Use atom.apply(lambda df: df.drop("column_name",axis=1)) to store the removal of columns in the pipeline.

Parametersfunc: callable
Function to apply.

inverse_func: callable or None, default=None
Inverse function of func. If None, the inverse_transform method returns the input unchanged.

kw_args: dict or None, default=None
Additional keyword arguments for the function.

inv_kw_args: dict or None, default=None
Additional keyword arguments for the inverse function.



method automl(**kwargs)[source]
Search for an optimized pipeline in an automated fashion.

Automated machine learning (AutoML) automates the selection, composition and parameterization of machine learning pipelines. Automating the machine learning often provides faster, more accurate outputs than hand-coded algorithms. ATOM uses the evalML package for AutoML optimization. The resulting transformers and final estimator are merged with atom's pipeline (check the pipeline and models attributes after the method finishes running). The created AutoMLSearch instance can be accessed through the evalml attribute.

Warning

AutoML algorithms aren't intended to run for only a few minutes. The method may need a very long time to achieve optimal results.

Parameters**kwargs
Additional keyword arguments for the AutoMLSearch instance.



method available_models()[source]
Give an overview of the available predefined models.

Returnspd.DataFrame
Information about the available predefined models. Columns include:

  • acronym: Model's acronym (used to call the model).
  • model: Name of the model's class.
  • estimator: The model's underlying estimator.
  • module: The estimator's module.
  • needs_scaling: Whether the model requires feature scaling.
  • accepts_sparse: Whether the model accepts sparse matrices.
  • has_validation: Whether the model has in-training validation.
  • supports_engines: List of engines supported by the model.



method canvas(rows=1, cols=2, horizontal_spacing=0.05, vertical_spacing=0.07, title=None, legend="out", figsize=None, filename=None, display=True)[source]
Create a figure with multiple plots.

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parametersrows: int, default=1
Number of plots in length.

cols: int, default=2
Number of plots in width.

horizontal_spacing: float, default=0.05
Space between subplot rows in normalized plot coordinates. The spacing is relative to the figure's size.

vertical_spacing: float, default=0.07
Space between subplot cols in normalized plot coordinates. The spacing is relative to the figure's size.

title: str, dict or None, default=None
Title for the plot.

legend: bool, str or dict, default="out"
Legend for the plot. See the user guide for an extended description of the choices.

  • If None: No legend is shown.
  • If str: Location where to show the legend.
  • If dict: Legend configuration.

figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str or None, default=None
Save the plot using this name. Use "auto" for automatic naming. The type of the file depends on the provided name (.html, .png, .pdf, etc...). If filename has no file type, the plot is saved as html. If None, the plot is not saved.

display: bool, default=True
Whether to render the plot.

Yieldsgo.Figure
Plot object.



method clear()[source]
Clear attributes from all models.

Reset all model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The cleared attributes per model are:



method delete(models=None)[source]
Delete models.

If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parametersmodels: int, str, slice, Model, sequence or None, default=None
Models to delete. If None, all models are deleted.



method distribution(distributions=None, columns=None)[source]
Get statistics on column distributions.

Compute the Kolmogorov-Smirnov test for various distributions against columns in the dataset. Only for numerical columns. Missing values are ignored.

Tip

Use the plot_distribution method to plot a column's distribution.

Parametersdistributions: str, sequence or None, default=None
Names of the distributions in scipy.stats to get the statistics on. If None, a selection of the most common ones is used.

columns: int, str, slice, sequence or None, default=None
Names, positions or dtypes of the columns in the dataset to perform the test on. If None, select all numerical columns.

Returnspd.DataFrame
Statistic results with multiindex levels:

  • dist: Name of the distribution.
  • stat: Statistic results:
    • score: KS-test score.
    • p_value: Corresponding p-value.



method evaluate(metric=None, dataset="test", threshold=0.5, sample_weight=None)[source]
Get all models' scores for the provided metrics.

Parametersmetric: str, func, scorer, sequence or None, default=None
Metric to calculate. If None, it returns an overview of the most common metrics per task.

dataset: str, default="test"
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float, default=0.5
Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:

  • The task is binary classification.
  • The model has a predict_proba method.
  • The metric evaluates predicted target values.

sample_weight: sequence or None, default=None
Sample weights corresponding to y in dataset.

Returnspd.DataFrame
Scores of the models.



method export_pipeline(model=None, memory=None, verbose=None)[source]
Export the pipeline to a sklearn-like object.

Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.

Info

The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:

  • Accepts transformers that change the target column.
  • Accepts transformers that drop rows.
  • Accepts transformers that only are fitted on a subset of the provided dataset.
  • Always returns pandas objects.
  • Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.

Parametersmodel: str, Model or None, default=None
Model for which to export the pipeline. If the model used automated feature scaling, the Scaler is added to the pipeline. If None, the pipeline in the current branch is exported.

memory: bool, str, Memory or None, default=None
Used to cache the fitted transformers of the pipeline. - If None or False: No caching is performed. - If True: A default temp directory is used. - If str: Path to the caching directory. - If Memory: Object with the joblib.Memory interface.

verbose: int or None, default=None
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

ReturnsPipeline
Sklearn-like Pipeline object with all transformers in the current branch.



method get_class_weight(dataset="train")[source]
Return class weights for a balanced dataset.

Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.

Parametersdataset: str, default="train"
Data set from which to get the weights. Choose from: "train", "test" or "dataset".

Returnsdict
Classes with the corresponding weights.



method inverse_transform(X=None, y=None, verbose=None)[source]
Inversely transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. The rest should all implement a inverse_transform method. If only X or only y is provided, it ignores transformers that require the other parameter. This can be used to transform only the target column.

ParametersX: dataframe-like or None, default=None
Transformed feature set with shape=(n_samples, n_features). If None, X is ignored in the transformers.

y: int, str, dict, sequence or None, default=None
Target column corresponding to X. - If None: y is ignored in the transformers. - If int: Position of the target column in X. - If str: Name of the target column in X. - Else: Array with shape=(n_samples,) to use as target.

verbose: int or None, default=None
Verbosity level for the transformers. If None, it uses the transformer's own verbosity.

Returnspd.DataFrame
Original feature set. Only returned if provided.

y: pd.Series
Original target column. Only returned if provided.



method log(msg, level=0, severity="info")[source]
Print message and save to log file.

Parametersmsg: int, float or str
Message to save to the logger and print to stdout.

level: int, default=0
Minimum verbosity level to print the message.

severity: str, default="info"
Severity level of the message. Choose from: debug, info, warning, error, critical.



method merge(other, suffix="2")[source]
Merge another instance of the same class into this one.

Branches, models, metrics and attributes of the other instance are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parametersother: Runner
Instance with which to merge. Should be of the same class as self.

suffix: str, default="2"
Conflicting branches and models are merged adding suffix to the end of their names.



method update_layout(dict1=None, overwrite=False, **kwargs)[source]
Update the properties of the plot's layout.

This recursively updates the structure of the original layout with the values in the input dict / keyword arguments.

Parametersdict1: dict or None, default=None
Dictionary of properties to be updated.

overwrite: bool, default=False
If True, overwrite existing properties. If False, apply updates to existing properties recursively, preserving existing properties that are not specified in the update operation.

**kwargs
Keyword/value pair of properties to be updated.



method report(dataset="dataset", n_rows=None, filename=None, **kwargs)[source]
Create an extensive profile analysis report of the data.

ATOM uses the pandas-profiling package for the analysis. The report is rendered directly in the notebook. The created ProfileReport instance can be accessed through the profile attribute.

Warning

This method can be slow for large datasets.

Parametersdataset: str, default="dataset"
Data set to get the report from.

n_rows: int or None, default=None
Number of (randomly picked) rows to process. None to use all rows.

filename: str or None, default=None
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ProfileReport instance.



method reset()[source]
Reset the instance to it's initial state.

Deletes all branches and models. The dataset is also reset to its form after initialization.



method reset_aesthetics()[source]
Reset the plot aesthetics to their default values.



method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True
Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, remember to add the data to ATOMLoader when loading the file.



method save_data(filename="auto", dataset="dataset")[source]
Save the data in the current branch to a .csv file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.

dataset: str, default="dataset"
Data set to save.



method shrink(obj2cat=True, int2uint=False, dense2sparse=False, columns=None)[source]
Converts the columns to the smallest possible matching dtype.

Parametersobj2cat: bool, default=True
Whether to convert object to category. Only if the number of categories would be less than 30% of the length of the column.

int2uint: bool, default=False
Whether to convert int to uint (unsigned integer). Only if the values in the column are strictly positive.

dense2sparse: bool, default=False
Whether to convert all features to sparse format. The value that is compressed is the most frequent value in the column.

columns: int, str, slice, sequence or None, default=None
Names, positions or dtypes of the columns in the dataset to shrink. If None, transform all columns.



method stacking(name="Stack", models=None, **kwargs)[source]
Add a Stacking model to the pipeline.

Parametersname: str, default="Stack"
Name of the model. The name is always presided with the model's acronym: stack.

models: slice, sequence or None, default=None
Models that feed the stacking estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's stacking instance. The model's acronyms can be used for the final_estimator parameter.



method stats(_vb=-2)[source]
Print basic information about the dataset.

Tip

For classification tasks, the count and balance of classes is shown, followed by the ratio (between parentheses) of the class with respect to the rest of the classes in the same data set, i.e. the class with the fewest samples is followed by (1.0). This information can be used to quickly assess if the data set is unbalanced.

Parameters_vb: int, default=-2
Internal parameter to always print if called by user.



method status()[source]
Get an overview of the branches and models.

This method prints the same information as the __repr__ and also saves it to the logger.



method transform(X=None, y=None, verbose=None)[source]
Transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, transform only the target column.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored. If None, X is ignored in the transformers.

y: int, str, dict, sequence or None, default=None
Target column corresponding to X. - If None: y is ignored in the transformers. - If int: Position of the target column in X. - If str: Name of the target column in X. - Else: Array with shape=(n_samples,) to use as target.

verbose: int or None, default=None
Verbosity level for the transformers. If None, it uses the transformer's own verbosity.

Returnspd.DataFrame
Transformed feature set. Only returned if provided.

pd.Series
Transformed target column. Only returned if provided.



method voting(name="Vote", models=None, **kwargs)[source]
Add a Voting model to the pipeline.

Parametersname: str, default="Vote"
Name of the model. The name is always presided with the model's acronym: vote.

models: slice, sequence or None, default=None
Models that feed the voting estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's voting instance.



Data cleaning

The data cleaning methods can help you scale the data, handle missing values, categorical columns and outliers. All attributes of the data cleaning classes are attached to atom after running. Read more in the user guide.

Tip

Use the report method to examine the data and help you determine suitable parameters for the data cleaning methods.

cleanApplies standard data cleaning steps on the dataset.
discretizeBin continuous data into intervals.
encodePerform encoding of categorical features.
imputeHandle missing values in the dataset.
normalizeTransform the data to follow a Normal/Gaussian distribution.
prunePrune outliers from the training set.
scaleScale the data.


method clean(drop_types=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, **kwargs)[source]
Applies standard data cleaning steps on the dataset.

Use the parameters to choose which transformations to perform. The available steps are:

  • Drop columns with specific data types.
  • Strip categorical features from white spaces.
  • Drop duplicate rows.
  • Drop rows with missing values in the target column.
  • Encode the target column (can't be True for regression tasks).

See the Cleaner class for a description of the parameters.



method discretize(strategy="quantile", bins=5, labels=None, **kwargs)[source]
Bin continuous data into intervals.

For each feature, the bin edges are computed during fit and, together with the number of bins, they will define the intervals. Ignores numerical columns.

See the Discretizer class for a description of the parameters.

Tip

Use the plot_distribution method to visualize a column's distribution and decide on the bins.



method encode(strategy="LeaveOneOut", max_onehot=10, ordinal=None, rare_to_value=None, value="rare", **kwargs)[source]
Perform encoding of categorical features.

The encoding type depends on the number of classes in the column:

  • If n_classes=2 or ordinal feature, use Ordinal-encoding.
  • If 2 < n_classes <= max_onehot, use OneHot-encoding.
  • If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Rare classes can be replaced with a value in order to prevent too high cardinality.

See the Encoder class for a description of the parameters.

Note

This method only encodes the categorical features. It does not encode the target column! Use the clean method for that.

Tip

Use the categorical attribute for a list of the categorical features in the dataset.



method impute(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, **kwargs)[source]
Handle missing values in the dataset.

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. Use the missing attribute to customize what are considered "missing values".

See the Imputer class for a description of the parameters.

Tip

Use the nans attribute to check the amount of missing values per column.



method normalize(strategy="yeojohnson", **kwargs)[source]
Transform the data to follow a Normal/Gaussian distribution.

This transformation is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired. Missing values are disregarded in fit and maintained in transform. Ignores categorical columns.

See the Normalizer class for a description of the parameters.

Tip

Use the plot_distribution method to examine a column's distribution.



method prune(strategy="zscore", method="drop", max_sigma=3, include_target=False, **kwargs)[source]
Prune outliers from the training set.

Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns.

See the Pruner class for a description of the parameters.

Note

This transformation is only applied to the training set in order to maintain the original distribution of samples in the test set.

Tip

Use the outliers attribute to check the number of outliers per column.



method scale(strategy="standard", include_binary=False, **kwargs)[source]
Scale the data.

Apply one of sklearn's scalers. Categorical columns are ignored.

See the Scaler class for a description of the parameters.

Tip

Use the scaled attribute to check whether the dataset is scaled.



NLP

The Natural Language Processing (NLP) transformers help to convert raw text to meaningful numeric values, ready to be ingested by a model. All transformations are applied only on the column in the dataset called corpus. Read more in the user guide.

textcleanApplies standard text cleaning to the corpus.
textnormalizeNormalize the corpus.
tokenizeTokenize the corpus.
vectorizeVectorize the corpus.


method textclean(decode=True, lower_case=True, drop_email=True, regex_email=None, drop_url=True, regex_url=None, drop_html=True, regex_html=None, drop_emoji=True, regex_emoji=None, drop_number=True, regex_number=None, drop_punctuation=True, **kwargs)[source]
Applies standard text cleaning to the corpus.

Transformations include normalizing characters and dropping noise from the text (emails, HTML tags, URLs, etc...). The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised.

See the TextCleaner class for a description of the parameters.



method textnormalize(stopwords=True, custom_stopwords=None, stem=False, lemmatize=True, **kwargs)[source]
Normalize the corpus.

Convert words to a more uniform standard. The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised. If the provided documents are strings, words are separated by spaces.

See the TextNormalizer class for a description of the parameters.



method tokenize(bigram_freq=None, trigram_freq=None, quadgram_freq=None, **kwargs)[source]
Tokenize the corpus.

Convert documents into sequences of words. Additionally, create n-grams (represented by words united with underscores, e.g. "New_York") based on their frequency in the corpus. The transformations are applied on the column named corpus. If there is no column with that name, an exception is raised.

See the Tokenizer class for a description of the parameters.



method vectorize(strategy="bow", return_sparse=True, **kwargs)[source]
Vectorize the corpus.

Transform the corpus into meaningful vectors of numbers. The transformation is applied on the column named corpus. If there is no column with that name, an exception is raised.

If strategy="bow" or "tfidf", the transformed columns are named after the word they are embedding with the prefix corpus_. If strategy="hashing", the columns are named hash[N], where N stands for the n-th hashed column.

See the Vectorizer class for a description of the parameters.



Feature engineering

To further pre-process the data, it's possible to extract features from datetime columns, create new non-linear features transforming the existing ones, group similar features or, if the dataset is too large, remove features. Read more in the user guide.

feature_extractionExtract features from datetime columns.
feature_generationGenerate new features.
feature_groupingExtract statistics from similar features.
feature_selectionReduce the number of features in the data.


method feature_extraction(features=['day', 'month', 'year'], fmt=None, encoding_type="ordinal", drop_columns=True, **kwargs)[source]
Extract features from datetime columns.

Create new features extracting datetime elements (day, month, year, etc...) from the provided columns. Columns of dtype datetime64 are used as is. Categorical columns that can be successfully converted to a datetime format (less than 30% NaT values after conversion) are also used.

See the FeatureExtractor class for a description of the parameters.



method feature_generation(strategy="dfs", n_features=None, operators=None, **kwargs)[source]
Generate new features.

Create new combinations of existing features to capture the non-linear relations between the original features.

See the FeatureGenerator class for a description of the parameters.



method feature_grouping(group, name=None, operators=None, drop_columns=True, **kwargs)[source]
Extract statistics from similar features.

Replace groups of features with related characteristics with new features that summarize statistical properties of te group. The statistical operators are calculated over every row of the group. The group names and features can be accessed through the groups method.

See the FeatureGrouper class for a description of the parameters.



method feature_selection(strategy=None, solver=None, n_features=None, min_repeated=2, max_repeated=1.0, max_correlation=1.0, **kwargs)[source]
Reduce the number of features in the data.

Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low variance features.

See the FeatureSelector class for a description of the parameters.

Note

  • When strategy="univariate" and solver=None, f_classif or f_regression is used as default solver.
  • When strategy is "sfs", "rfecv" or any of the advanced strategies and no scoring is specified, atom's metric (if it exists) is used as scoring.



Training

The training methods are where the models are fitted to the data and their performance is evaluated against a selected metric. There are three methods to call the three different training approaches. Read more in the user guide.

runTrain and evaluate the models in a direct fashion.
successive_halvingFit the models in a successive halving fashion.
train_sizingTrain and evaluate the models in a train sizing fashion.


method run(models=None, metric=None, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, **kwargs)[source]
Train and evaluate the models in a direct fashion.

Contrary to successive_halving and train_sizing, the direct approach only iterates once over the models, using the full dataset.

The following steps are applied to every model:

  1. Apply hyperparameter tuning (optional).
  2. Fit the model on the training set using the best combination of hyperparameters found.
  3. Evaluate the model on the test set.
  4. Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the DirectClassifier or DirectRegressor class for a description of the parameters.



method successive_halving(models, metric=None, skip_runs=0, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, **kwargs)[source]
Fit the models in a successive halving fashion.

The successive halving technique is a bandit-based algorithm that fits N models to 1/N of the data. The best half are selected to go to the next iteration where the process is repeated. This continues until only one model remains, which is fitted on the complete dataset. Beware that a model's performance can depend greatly on the amount of data on which it is trained. For this reason, it is recommended to only use this technique with similar models, e.g. only using tree-based models.

The following steps are applied to every model (per iteration):

  1. Apply hyperparameter tuning (optional).
  2. Fit the model on the training set using the best combination of hyperparameters found.
  3. Evaluate the model on the test set.
  4. Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the SuccessiveHalvingClassifier or SuccessiveHalvingRegressor class for a description of the parameters.



method train_sizing(models, metric=None, train_sizes=5, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, **kwargs)[source]
Train and evaluate the models in a train sizing fashion.

When training models, there is usually a trade-off between model performance and computation time, that is regulated by the number of samples in the training set. This method can be used to create insights in this trade-off, and help determine the optimal size of the training set. The models are fitted multiple times, ever-increasing the number of samples in the training set.

The following steps are applied to every model (per iteration):

  1. Apply hyperparameter tuning (optional).
  2. Fit the model on the training set using the best combination of hyperparameters found.
  3. Evaluate the model on the test set.
  4. Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

See the TrainSizingClassifier or TrainSizingRegressor class for a description of the parameters.