TrainSizingRegressor

class atom.training.TrainSizingRegressor(models=None, metric=None, train_sizes=5, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, parallel=False, errors="skip", n_jobs=1, device="cpu", engine="sklearn", backend="loky", verbose=0, warnings=False, logger=None, experiment=None, random_state=None)[source]

Train and evaluate the models in a train sizing fashion.

The following steps are applied to every model (per iteration):

Apply hyperparameter tuning (optional).
Fit the model on the training set using the best combination of hyperparameters found.
Evaluate the model on the test set.
Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

Parameters

models: str, estimator or sequence, default=None

Models to fit to the data. Allowed inputs are: an acronym from any of the predefined models, an ATOMModel or a custom predictor as class or instance. If None, all the predefined models are used.

metric: str, func, scorer, sequence or None, default=None

Metric on which to fit the models. Choose from any of sklearn's scorers, a function with signature

function(y_true, y_pred)
-> score

, a scorer object or a sequence of these. If None, a default metric is selected for every task:

"f1" for binary classification
"f1_weighted" for multiclass(-multioutput) classification
"average_precision" for multilabel classification
"r2" for regression or multioutput regression

train_sizes: int or sequence, default=5

Sequence of training set sizes used to run the trainings.

If int: Number of equally distributed splits, i.e. for a value N, it's equal to np.linspace(1.0/N, 1.0, N).
If sequence: Fraction of the training set when <=1, else total number of samples.

n_trials: int or sequence, default=0

Maximum number of iterations for the hyperparameter tuning. If 0, skip the tuning and fit the model on its default parameters. If sequence, the n-th value applies to the n-th model.

est_params: dict or None, default=None

Additional parameters for the models. See their corresponding documentation for the available options. For multiple models, use the acronyms as key (or 'all' for all models) and a dict of the parameters as value. Add _fit to the parameter's name to pass it to the estimator's fit method instead of the constructor.

ht_params: dict or None, default=None

Additional parameters for the hyperparameter tuning. If None, it uses the same parameters as the first run. Can include:

cv: int, dict or sequence, default=1
Number of folds for the cross-validation. If 1, the training set is randomly split in a subtrain and validation set.
plot: bool, dict or sequence, default=False
Whether to plot the optimization's progress as it runs. Creates a canvas with two plots: the first plot shows the score of every trial and the second shows the distance between the last consecutive steps. See the plot_trials method.
distributions: dict, sequence or None, default=None
Custom hyperparameter distributions. If None, it uses the model's predefined distributions. Read more in the user guide.
tags: dict, sequence or None, default=None
Custom tags for the model's trial and mlflow run.
**kwargs
Additional Keyword arguments for the constructor of the study class or the optimize method.

n_bootstrap: int or sequence, default=0

Number of data sets to use for bootstrapping. If 0, no bootstrapping is performed. If sequence, the n-th value applies to the n-th model.

parallel: bool, default=False

Whether to train the models in a parallel or sequential fashion. Using parallel=True turns off the verbosity of the models during training. Note that many models also have build-in parallelizations (often when the estimator has the n_jobs parameter).

errors: str, default="skip"

How to handle exceptions encountered during model training. Choose from:

"raise": Raise any encountered exception.
"skip": Skip a failed model. This model is not accessible after training.
"keep": Keep the model in its state at failure. Note that this model can break down many other methods after training. This option is useful to be able to rerun hyperparameter optimization after failure without losing previous succesfull trials.

n_jobs: int, default=1

Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"

Device on which to train the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, default="sklearn"

Execution engine to use for the estimators. Refer to the user guide for an explanation regarding every choice. Choose from:

"sklearn" (only if device="cpu")
"sklearnex"
"cuml" (only if device="gpu")

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

warnings: bool or str, default=False

If True: Default warning action (equal to "default").
If False: Suppress all warnings (equal to "ignore").
If str: One of python's warnings filters.

Changing this parameter affects the PYTHONWARNINGS environment. ATOM can't manage warnings that go from C/C++ code to stdout.

logger: str, Logger or None, default=None

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic name.
Else: Python logging.Logger instance.

experiment: str or None, default=None

Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, default=None

Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.

Example

>>> from atom.training import TrainSizingRegressor
>>> from sklearn.datasets import load_digits

>>> X, y = load_digits(return_X_y=True, as_frame=True)
>>> train, test = train_test_split(
...     X.merge(y.to_frame(), left_index=True, right_index=True),
...     test_size=0.3,
... )

>>> runner = TrainSizingRegressor(models="OLS", metric="r2", verbose=2)
>>> runner.run(train, test)

Training ========================= >>
Metric: r2


Run: 0 =========================== >>
Models: OLS02
Size of training set: 79 (20%)
Size of test set: 171


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.8554
Test evaluation --> r2: 0.4273
Time elapsed: 0.008s
-------------------------------------------------
Total time: 0.008s


Final results ==================== >>
Total time: 0.107s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.4273 ~


Run: 1 =========================== >>
Models: OLS04
Size of training set: 159 (40%)
Size of test set: 171


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.7987
Test evaluation --> r2: 0.653
Time elapsed: 0.008s
-------------------------------------------------
Total time: 0.008s


Final results ==================== >>
Total time: 0.129s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.653


Run: 2 =========================== >>
Models: OLS06
Size of training set: 238 (60%)
Size of test set: 171


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.7828
Test evaluation --> r2: 0.7161
Time elapsed: 0.008s
-------------------------------------------------
Total time: 0.008s


Final results ==================== >>
Total time: 0.156s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.7161


Run: 3 =========================== >>
Models: OLS08
Size of training set: 318 (80%)
Size of test set: 171


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.7866
Test evaluation --> r2: 0.7306
Time elapsed: 0.009s
-------------------------------------------------
Total time: 0.009s


Final results ==================== >>
Total time: 0.187s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.7306


Run: 4 =========================== >>
Models: OLS10
Size of training set: 398 (100%)
Size of test set: 171


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.7798
Test evaluation --> r2: 0.7394
Time elapsed: 0.009s
-------------------------------------------------
Total time: 0.009s


Final results ==================== >>
Total time: 0.226s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.7394

>>> # Analyze the results
>>> runner.evaluate()

       neg_mean_absolute_error  ...  neg_root_mean_squared_error
OLS02                  -0.2766  ...                      -0.3650
OLS04                  -0.2053  ...                      -0.2841
OLS06                  -0.1957  ...                      -0.2570
OLS08                  -0.1928  ...                      -0.2504
OLS10                  -0.1933  ...                      -0.2463

[5 rows x 6 columns]

Attributes

Data attributes

The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.

Attributes

dataset: dataframe

Complete data set.

train: dataframe

Training set.

test: dataframe

Test set.

X: dataframe

Feature set.

y: series | dataframe

Target column(s).

X_train: dataframe

Features of the training set.

y_train: series | dataframe

Target column(s) of the training set.

X_test: dataframe

Features of the test set.

y_test: series | dataframe

Target column(s) of the test set.

shape: tuple[int, int]

Shape of the dataset (n_rows, n_columns).

columns: series

Name of all the columns.

n_columns: int

Number of columns.

features: series

Name of the features.

n_features: int

Number of features.

target: str | list[str]

Name of the target column(s).

Utility attributes

The utility attributes are used to access information about the models in the instance after training.

Attributes

models: str | list[str] | None

Name of the model(s).

metric: str | list[str] | None

Name of the metric(s).

winners: list[model]

Models ordered by performance.

Performance is measured as the highest score on the model's score_bootstrap or score_test attributes, checked in that order. For multi-metric runs, only the main metric is compared.

winner: model

Best performing model.

Performance is measured as the highest score on the model's score_bootstrap or score_test attributes, checked in that order. For multi-metric runs, only the main metric is compared.

results: pd.DataFrame

Overview of the training results.

All durations are in seconds. Columns include:

score_ht: Score obtained by the hyperparameter tuning.
time_ht: Duration of the hyperparameter tuning.
score_train: Metric score on the train set.
score_test: Metric score on the test set.
time_fit: Duration of the model fitting on the train set.
score_bootstrap: Mean score on the bootstrapped samples.
time_bootstrap: Duration of the bootstrapping.
time: Total duration of the model run.

Tracking attributes

The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.

Attributes

log_ht: bool

Whether to track every trial of the hyperparameter tuning.

log_model: bool

Whether to save the model's estimator after fitting.

log_plots: bool

Whether to save plots as artifacts.

log_data: bool

Whether to save the train and test sets.

log_pipeline: bool

Whether to save the model's pipeline.

Plot attributes

The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.

Attributes

palette: str | SEQUENCE

Color palette.

Specify one of plotly's built-in palettes or create a custom one, e.g. atom.palette = ["red", "green", "blue"].

title_fontsize: int

Fontsize for the plot's title.

label_fontsize: int

Fontsize for the labels, legend and hover information.

tick_fontsize: int

Fontsize for the ticks along the plot's axes.

line_width: int

Width of the line plots.

marker_size: int

Size of the markers.

Methods

Next to the plotting methods, the class contains a variety of methods to handle the data, run the training, and manage the pipeline.

available_models	Give an overview of the available predefined models.
canvas	Create a figure with multiple plots.
clear	Reset attributes and clear cache from all models.
delete	Delete models.
evaluate	Get all models' scores for the provided metrics.
export_pipeline	Export the pipeline to a sklearn-like object.
get_class_weight	Return class weights for a balanced data set.
get_params	Get parameters for this estimator.
log	Print message and save to log file.
merge	Merge another instance of the same class into this one.
update_layout	Update the properties of the plot's layout.
reset_aesthetics	Reset the plot aesthetics to their default values.
run	Train and evaluate the models.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
stacking	Add a Stacking model to the pipeline.
voting	Add a Voting model to the pipeline.

method available_models()[source]

Give an overview of the available predefined models.

Returns

pd.DataFrame

Information about the available predefined models. Columns include:

acronym: Model's acronym (used to call the model).
model: Name of the model's class.
estimator: The model's underlying estimator.
module: The estimator's module.
needs_scaling: Whether the model requires feature scaling.
accepts_sparse: Whether the model accepts sparse matrices.
native_multioutput: Whether the model has native support for multioutput tasks.
has_validation: Whether the model has in-training validation.
supports_engines: List of engines supported by the model.

method canvas(rows=1, cols=2, horizontal_spacing=0.05, vertical_spacing=0.07, title=None, legend="out", figsize=None, filename=None, display=True)[source]

Create a figure with multiple plots.

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters

rows: int, default=1

Number of plots in length.

cols: int, default=2

Number of plots in width.

horizontal_spacing: float, default=0.05

Space between subplot rows in normalized plot coordinates. The spacing is relative to the figure's size.

vertical_spacing: float, default=0.07

Space between subplot cols in normalized plot coordinates. The spacing is relative to the figure's size.

title: str, dict or None, default=None

Title for the plot.

If None, no title is shown.
If str, text for the title.
If dict, title configuration.

legend: bool, str or dict, default="out"

Legend for the plot. See the user guide for an extended description of the choices.

If None: No legend is shown.
If str: Location where to show the legend.
If dict: Legend configuration.

figsize: tuple or None, default=None

Figure's size in pixels, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str or None, default=None

Save the plot using this name. Use "auto" for automatic naming. The type of the file depends on the provided name (.html, .png, .pdf, etc...). If filename has no file type, the plot is saved as html. If None, the plot is not saved.

display: bool, default=True

Whether to render the plot.

Yields

go.Figure

Plot object.

method clear()[source]

Reset attributes and clear cache from all models.

Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:

method delete(models=None)[source]

Delete models.

If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters

models: int, str, slice, Model, sequence or None, default=None

Models to delete. If None, all models are deleted.

method evaluate(metric=None, dataset="test", threshold=0.5, sample_weight=None)[source]

Get all models' scores for the provided metrics.

Parameters

metric: str, func, scorer, sequence or None, default=None

Metric to calculate. If None, it returns an overview of the most common metrics per task.

dataset: str, default="test"

Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float or sequence, default=0.5

Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:

The task is binary or multilabel classification.
The model has a predict_proba method.
The metric evaluates predicted probabilities.

For multilabel classification tasks, it's possible to provide a sequence of thresholds (one per target column). The same threshold per target column is applied to all models.

sample_weight: sequence or None, default=None

Sample weights corresponding to y in dataset.

Returns

pd.DataFrame

Scores of the models.

method export_pipeline(model=None, memory=None, verbose=None)[source]

Export the pipeline to a sklearn-like object.

Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.

Info

The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:

Accepts transformers that change the target column.
Accepts transformers that drop rows.
Accepts transformers that only are fitted on a subset of the provided dataset.
Always returns pandas objects.
Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.

Parameters

model: str, Model or None, default=None

Model for which to export the pipeline. If the model used automated feature scaling, the Scaler is added to the pipeline. If None, the pipeline in the current branch is exported.

memory: bool, str, Memory or None, default=None

Used to cache the fitted transformers of the pipeline. - If None or False: No caching is performed. - If True: A default temp directory is used. - If str: Path to the caching directory. - If Memory: Object with the joblib.Memory interface.

verbose: int or None, default=None

Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns

Pipeline

Current branch as a sklearn-like Pipeline object.

method get_class_weight(dataset="train")[source]

Return class weights for a balanced data set.

Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.

Parameters	dataset: str, default="train" Data set from which to get the weights. Choose from: "train", "test", "dataset".
Returns	dict Classes with the corresponding weights. A dict of dicts is returned for multioutput tasks.

method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns	params : dict Parameter names mapped to their values.

method log(msg, level=0, severity="info")[source]

Print message and save to log file.

Parameters

msg: int, float or str

Message to save to the logger and print to stdout.

level: int, default=0

Minimum verbosity level to print the message.

severity: str, default="info"

Severity level of the message. Choose from: debug, info, warning, error, critical.

method merge(other, suffix="2")[source]

Merge another instance of the same class into this one.

Branches, models, metrics and attributes of the other instance are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parameters

other: Runner

Instance with which to merge. Should be of the same class as self.

suffix: str, default="2"

Conflicting branches and models are merged adding suffix to the end of their names.

method update_layout(dict1=None, overwrite=False, **kwargs)[source]

Update the properties of the plot's layout.

This recursively updates the structure of the original layout with the values in the input dict / keyword arguments.

Parameters

dict1: dict or None, default=None

Dictionary of properties to be updated.

overwrite: bool, default=False

If True, overwrite existing properties. If False, apply updates to existing properties recursively, preserving existing properties that are not specified in the update operation.

**kwargs

Keyword/value pair of properties to be updated.

method reset_aesthetics()[source]

Reset the plot aesthetics to their default values.

method run(*arrays)[source]

Train and evaluate the models.

Parameters	**params : dict Estimator parameters.
Returns	self : estimator instance Estimator instance.