Skip to content

SuccessiveHalvingRegressor


class atom.training.SuccessiveHalvingRegressor(models=None, metric=None, skip_runs=0, est_params=None, n_trials=0, ht_params=None, n_bootstrap=0, parallel=False, errors="skip", n_jobs=1, device="cpu", engine=None, backend="loky", memory=False, verbose=0, warnings=False, logger=None, experiment=None, random_state=None)[source]

Train and evaluate the models in a successive halving fashion.

The following steps are applied to every model (per iteration):

  1. Apply hyperparameter tuning (optional).
  2. Fit the model on the training set using the best combination of hyperparameters found.
  3. Evaluate the model on the test set.
  4. Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).

Parameters models: str, estimator or sequence, default=None
Models to fit to the data. Allowed inputs are: an acronym from any of the predefined models, an ATOMModel or a custom predictor as class or instance. If None, all the predefined models are used.

metric: str, func, scorer, sequence or None, default=None
Metric on which to fit the models. Choose from any of sklearn's scorers, a function with signature function(y_true, y_pred, **kwargs) -> score, a scorer object or a sequence of these. If None, the default metric r2 is selected.

skip_runs: int, default=0
Skip last skip_runs runs of the successive halving.

n_trials: int, dict or sequence, default=0
Maximum number of iterations for the hyperparameter tuning. If 0, skip the tuning and fit the model on its default parameters. If sequence, the n-th value applies to the n-th model.

est_params: dict or None, default=None
Additional parameters for the models. See their corresponding documentation for the available options. For multiple models, use the acronyms as key (or 'all' for all models) and a dict of the parameters as value. Add _fit to the parameter's name to pass it to the estimator's fit method instead of the constructor.

ht_params: dict or None, default=None
Additional parameters for the hyperparameter tuning. If None, it uses the same parameters as the first run. Can include:

  • cv: int, cv-generator, dict or sequence, default=1
    Cross-validation object or number of splits. If 1, the data is randomly split in a subtrain and validation set.
  • plot: bool, dict or sequence, default=False
    Whether to plot the optimization's progress as it runs. Creates a canvas with two plots: the first plot shows the score of every trial and the second shows the distance between the last consecutive steps. See the plot_trials method.
  • distributions: dict, sequence or None, default=None
    Custom hyperparameter distributions. If None, it uses the model's predefined distributions. Read more in the user guide.
  • tags: dict, sequence or None, default=None
    Custom tags for the model's trial and mlflow run.
  • **kwargs
    Additional Keyword arguments for the constructor of the study class or the optimize method.

n_bootstrap: int or sequence, default=0
Number of data sets to use for bootstrapping. If 0, no bootstrapping is performed. If sequence, the n-th value applies to the n-th model.

parallel: bool, default=False
Whether to train the models in a parallel or sequential fashion. Using parallel=True turns off the verbosity of the models during training. Note that many models also have build-in parallelizations (often when the estimator has the n_jobs parameter).

errors: str, default="skip"
How to handle exceptions encountered during model training. Choose from:

  • "raise": Raise any encountered exception.
  • "skip": Skip a failed model. This model is not accessible after training.
  • "keep": Keep the model in its state at failure. Note that this model can break down many other methods after training. This option is useful to be able to rerun hyperparameter optimization after failure without losing previous successful trials.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"
Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, dict or None, default=None
Execution engine to use for data and estimators. The value should be one of the possible values to change one of the two engines, or a dictionary with keys data and estimator, with their corresponding choice as values to change both engines. If None, the default values are used. Choose from:

  • "data":

    • "pandas" (default)
    • "pyarrow"
    • "modin"
  • "estimator":

    • "sklearn" (default)
    • "sklearnex"
    • "cuml"

backend: str, default="loky"
Parallelization backend. Read more in the user guide. Choose from:

  • "loky": Single-node, process-based parallelism.
  • "multiprocessing": Legacy single-node, process-based parallelism. Less robust than loky.
  • "threading": Single-node, thread-based parallelism.
  • "ray": Multi-node, process-based parallelism.
  • "dask": Multi-node, process-based parallelism.

memory: bool, str, Path or Memory, default=False
Enables caching for memory optimization. Read more in the user guide.

  • If False: No caching is performed.
  • If True: A default temp directory is used.
  • If str: Path to the caching directory.
  • If Path: A pathlib.Path to the caching directory.
  • If Memory: Object with the joblib.Memory interface.

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

warnings: bool or str, default=False

  • If True: Default warning action (equal to "once").
  • If False: Suppress all warnings (equal to "ignore").
  • If str: One of python's warnings filters.

Changing this parameter affects the PYTHONWarnings environment. ATOM can't manage warnings that go from C/C++ code to stdout.

logger: str, Logger or None, default=None

  • If None: Logging isn't used.
  • If str: Name of the log file. Use "auto" for automatic name.
  • If Path: A pathlib.Path to the log file.
  • Else: Python logging.Logger instance.

experiment: str or None, default=None
Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.


See Also

ATOMRegressor

Main class for regression tasks.

DirectRegressor

Train and evaluate the models in a direct fashion.

TrainSizingRegressor

Train and evaluate the models in a train sizing fashion.


Example

>>> from atom.training import SuccessiveHalvingRegressor
>>> from sklearn.datasets import load_digits
>>> from sklearn.model_selection import train_test_split

>>> X, y = load_digits(return_X_y=True, as_frame=True)

>>> train, test = train_test_split(
...     X.merge(y.to_frame(), left_index=True, right_index=True),
...     test_size=0.3,
... )

>>> runner = SuccessiveHalvingRegressor(["OLS", "RF"], verbose=2)
>>> runner.run(train, test)


Training ========================= >>
Metric: r2


Run: 0 =========================== >>
Models: OLS2, RF2
Size of training set: 1257 (50%)
Size of test set: 540


Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.634
Test evaluation --> r2: 0.5649
Time elapsed: 0.422s
-------------------------------------------------
Time: 0.422s


Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> r2: 0.9687
Test evaluation --> r2: 0.7986
Time elapsed: 0.536s
-------------------------------------------------
Time: 0.536s


Final results ==================== >>
Total time: 0.961s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.5649
RandomForest         --> r2: 0.7986 !


Run: 1 =========================== >>
Models: RF1
Size of training set: 1257 (100%)
Size of test set: 540


Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> r2: 0.979
Test evaluation --> r2: 0.8561
Time elapsed: 0.962s
-------------------------------------------------
Time: 0.962s


Final results ==================== >>
Total time: 0.963s
-------------------------------------
RandomForest --> r2: 0.8561

>>> # Analyze the results
>>> runner.results
    r2_train r2_test time_fit time
frac model        
0.500000 OLS2 0.634000 0.564900 0.422387 0.422387
RF2 0.968700 0.798600 0.536488 0.536488
1.000000 RF1 0.979000 0.856100 0.961875 0.961875


Attributes

Data attributes

The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.

Attributes dataset: pd.DataFrame
Complete data set.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series | pd.DataFrame
Target column(s).
holdout: pd.DataFrame | None
Holdout set.

This data set is untransformed by the pipeline. Read more in the user guide.

X_train: pd.DataFrame
Features of the training set.
y_train: pd.Series | pd.DataFrame
Target column(s) of the training set.
X_test: pd.DataFrame
Features of the test set.
y_test: pd.Series | pd.DataFrame
Target column(s) of the test set.
shape: tuple[int, int]
Shape of the dataset (n_rows, n_columns).
columns: pd.Index
Name of all the columns.
n_columns: int
Number of columns.
features: pd.Index
Name of the features.
n_features: int
Number of features.
target: str | list[str]
Name of the target column(s).


Utility attributes

The utility attributes are used to access information about the models in the instance after training.

Attributes models: str | list[str] | None
Name of the model(s).
metric: str | list[str] | None
Name of the metric(s).
winners: list[model] | None
Models ordered by performance.

Performance is measured as the highest score on the model's [main_metric]_bootstrap or [main_metric]_test, checked in that order. Ties are resolved looking at the lowest time_fit.

winner: model | None
Best performing model.

Performance is measured as the highest score on the model's [main_metric]_bootstrap or [main_metric]_test, checked in that order. Ties are resolved looking at the lowest time_fit.

results: Styler
Overview of the training results.

All durations are in seconds. Possible values include:

  • [metric]_ht: Score obtained by the hyperparameter tuning.
  • time_ht: Duration of the hyperparameter tuning.
  • [metric]_train: Metric score on the train set.
  • [metric]_test: Metric score on the test set.
  • time_fit: Duration of the model fitting on the train set.
  • [metric]_bootstrap: Mean score on the bootstrapped samples.
  • time_bootstrap: Duration of the bootstrapping.
  • time: Total duration of the run.

Tip

This attribute returns a pandas' Styler object. Convert the result back to a regular dataframe using its data attribute.


Tracking attributes

The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.

Attributes log_ht: bool
Whether to track every trial of the hyperparameter tuning.
log_plots: bool
Whether to save plots as artifacts.
log_data: bool
Whether to save the train and test sets.
log_pipeline: bool
Whether to save the model's pipeline.


Plot attributes

The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.

Attributes palette: str | Sequence[str]
Color palette.

Specify one of plotly's built-in palettes or create a custom one, e.g., atom.palette = ["red", "green", "blue"].

title_fontsize: int | float
Fontsize for the plot's title.
label_fontsize: int | float
Fontsize for the labels, legend and hover information.
tick_fontsize: int | float
Fontsize for the ticks along the plot's axes.
line_width: int | float
Width of the line plots.
marker_size: int | float
Size of the markers.


Methods

Next to the plotting methods, the class contains a variety of methods to handle the data, run the training, and manage the pipeline.

available_modelsGive an overview of the available predefined models.
canvasCreate a figure with multiple plots.
clearReset attributes and clear cache from all models.
deleteDelete models.
evaluateGet all models' scores for the provided metrics.
export_pipelineExport the internal pipeline.
get_class_weightReturn class weights for a balanced data set.
get_paramsGet parameters for this estimator.
mergeMerge another instance of the same class into this one.
update_layoutUpdate the properties of the plot's layout.
update_tracesUpdate the properties of the plot's traces.
reset_aestheticsReset the plot aesthetics to their default values.
runTrain and evaluate the models.
saveSave the instance to a pickle file.
set_paramsSet the parameters of this estimator.
stackingAdd a Stacking model to the pipeline.
votingAdd a Voting model to the pipeline.


method available_models(**kwargs)[source]

Give an overview of the available predefined models.

Parameters **kwargs
Filter the returned models providing any of the column as keyword arguments, where the value is the desired filter, e.g., accepts_sparse=True, to get all models that accept sparse input or supports_engines="cuml" to get all models that support the cuML engine.

Returns pd.DataFrame
Tags of the available predefined models. The columns depend on the task, but can include:

  • acronym: Model's acronym (used to call the model).
  • fullname: Name of the model's class.
  • estimator: Name of the model's underlying estimator.
  • module: The estimator's module.
  • handles_missing: Whether the model can handle missing values without preprocessing. If False, consider using the Imputer class before training the models.
  • needs_scaling: Whether the model requires feature scaling. If True, automated feature scaling is applied.
  • accepts_sparse: Whether the model accepts sparse input.
  • uses_exogenous: Whether the model uses exogenous variables.
  • multiple_seasonality: Whether the model can handle more than one seasonality period.
  • native_multilabel: Whether the model has native support for multilabel tasks.
  • native_multioutput: Whether the model has native support for multioutput tasks.
  • validation: Whether the model has in-training validation.
  • supports_engines: Engines supported by the model.



method canvas(rows=1, cols=2, sharex=False, sharey=False, hspace=0.05, vspace=0.07, title=None, legend="out", figsize=None, filename=None, display=True)[source]

Create a figure with multiple plots.

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters rows: int, default=1
Number of plots in length.

cols: int, default=2
Number of plots in width.

sharex: bool, default=False
If True, hide the label and ticks from non-border subplots on the x-axis.

sharey: bool, default=False
If True, hide the label and ticks from non-border subplots on the y-axis.

hspace: float, default=0.05
Space between subplot rows in normalized plot coordinates. The spacing is relative to the figure's size.

vspace: float, default=0.07
Space between subplot cols in normalized plot coordinates. The spacing is relative to the figure's size.

title: str, dict or None, default=None
Title for the plot.

legend: bool, str or dict, default="out"
Legend for the plot. See the user guide for an extended description of the choices.

  • If None: No legend is shown.
  • If str: Position to display the legend.
  • If dict: Legend configuration.

figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str, Path or None, default=None
Save the plot using this name. Use "auto" for automatic naming. The type of the file depends on the provided name (.html, .png, .pdf, etc...). If filename has no file type, the plot is saved as html. If None, the plot is not saved.

display: bool, default=True
Whether to render the plot.

Yields{#canvas-go.Figure} go.Figure
Plot object.



method clear()[source]

Reset attributes and clear cache from all models.

Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:



method delete(models=None)[source]

Delete models.

If all models are removed, the metric is reset. Use this method to drop unwanted or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters models: int, str, Model, segment, sequence or None, default=None
Models to delete. If None, all models are deleted.



method evaluate(metric=None, rows="test")[source]

Get all models' scores for the provided metrics.

Tip

This method returns a pandas' Styler object. Convert the result back to a regular dataframe using its data attribute.

Parameters metric: str, func, scorer, sequence or None, default=None
Metric to calculate. If None, it returns an overview of the most common metrics per task.

rows: hashable, segment, sequence or dataframe, default="test"
Selection of rows to calculate metric on.

Returns{#evaluate-Styler} Styler
Scores of the models.



method export_pipeline(model=None)[source]

Export the internal pipeline.

This method returns a deepcopy of the branch's pipeline. Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.

Parameters model: str, Model or None, default=None
Model for which to export the pipeline. If the model used automated feature scaling, the Scaler is added to the pipeline. If None, the pipeline in the current branch is exported (without any model).

Returns{#export_pipeline-Pipeline} Pipeline
Current branch as a sklearn-like Pipeline object.



method get_class_weight(rows="train")[source]

Return class weights for a balanced data set.

Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected rows.

Parameters rows: hashable, segment, sequence or dataframe, default="train"
Selection of rows for which to get the weights.

Returns dict
Classes with the corresponding weights. A dict of dicts is returned for multioutput tasks.



method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns params : dict
Parameter names mapped to their values.



method merge(other, suffix="2")[source]

Merge another instance of the same class into this one.

Branches, models, metrics and attributes of the other instance are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parameters other: Runner
Instance with which to merge. Should be of the same class as self.

suffix: str, default="2"
Branches and models with conflicting names are merged adding suffix to the end of their names.



classmethod atom.plots.baseplot.update_layout(**kwargs)[source]

Update the properties of the plot's layout.

Recursively update the structure of the original layout with the values in the arguments.

Parameters **kwargs
Keyword arguments for the figure's update_layout method.



classmethod atom.plots.baseplot.update_traces(**kwargs)[source]

Update the properties of the plot's traces.

Recursively update the structure of the original traces with the values in the arguments.

Parameters **kwargs
Keyword arguments for the figure's update_traces method.



classmethod atom.plots.baseplot.reset_aesthetics()[source]

Reset the plot aesthetics to their default values.



method run(*arrays)[source]

Train and evaluate the models.

Read more in the user guide.

Parameters *arrays: sequence of indexables
Training set and test set. Allowed formats are:

  • train, test
  • X_train, X_test, y_train, y_test
  • (X_train, y_train), (X_test, y_test)



method save(filename="auto", save_data=True)[source]

Save the instance to a pickle file.

Parameters filename: str or Path, default="auto"
Filename or pathlib.Path of the file to save. Use "auto" for automatic naming.

save_data: bool, default=True
Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method to reload the instance.



method set_params(**params)[source]

Set the parameters of this estimator.

Parameters **params : dict
Estimator parameters.

Returns self : estimator instance
Estimator instance.



method stacking(models=None, name="Stack", train_on_test=False, **kwargs)[source]

Add a Stacking model to the pipeline.

Warning

Combining models trained on different branches into one ensemble is not allowed and will raise an exception.

Parameters models: segment, sequence or None, default=None
Models that feed the stacking estimator. The models must have been fitted on the current branch.

name: str, default="Stack"
Name of the model. The name is always presided with the model's acronym: Stack.

train_on_test: bool, default=False
Whether to train the final estimator of the stacking model on the test set instead of the training set. Note that training it on the training set (default option) means there is a high risk of overfitting. It's recommended to use this option if you have another, independent set for testing (holdout set).

**kwargs
Additional keyword arguments for one of these estimators.

Tip

The model's acronyms can be used for the final_estimator parameter, e.g., atom.stacking(final_estimator="LR").



method voting(models=None, name="Vote", **kwargs)[source]

Add a Voting model to the pipeline.

Warning

Combining models trained on different branches into one ensemble is not allowed and will raise an exception.

Parameters models: segment, sequence or None, default=None
Models that feed the stacking estimator. The models must have been fitted on the current branch.

name: str, default="Vote"
Name of the model. The name is always presided with the model's acronym: Vote.

**kwargs
Additional keyword arguments for one of these estimators.