SuccessiveHalvingRegressor
class atom.training.
SuccessiveHalvingRegressor(models=None,
metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False,
skip_runs=0, n_calls=0, n_initial_points=5, est_params=None, bo_params=None,
n_bootstrap=0, n_jobs=1, gpu=False, verbose=0, warnings=True, logger=None,
experiment=None, random_state=None)
[source]
Fit and evaluate the models in a successive halving
fashion. The pipeline applies the following steps per iteration:
- The optimal hyperparameters for the model are selected using a bayesian
optimization algorithm (optional).
- The model is fitted on the training set using the best combination
of hyperparameters found. After that, the model is evaluated on the tes set.
- Calculate various scores on the test set using a bootstrap algorithm (optional).
You can predict, plot
and call any model from the instance.
Read more in the user guide.
Parameters: |
models: str, estimator or sequence, optional (default=None)
Models to fit to the data. Allowed inputs are: an acronym from any of
ATOM's predefined models, an ATOMModel
or a custom estimator as class or instance. If None, all the predefined
models are used. Available predefined models are:
metric: str, func, scorer, sequence or None, optional (default=None)
Metric on which to fit the models. Choose from any of sklearn's
SCORERS,
a function with signature metric(y_true, y_pred) ,
a scorer object or a sequence of these. If multiple metrics are
selected, only the first is used to optimize the BO. If None, a
default metric is selected:
- "f1" for binary classification
- "f1_weighted" for multiclass classification
- "r2" for regression
greater_is_better: bool or sequence, optional (default=True)
Whether the metric is a score function or a loss function,
i.e. if True, a higher score is better and if False, lower is
better. This parameter is ignored if the metric is a string or
a scorer. If sequence, the n-th value applies to the n-th
metric.
needs_proba: bool or sequence, optional (default=False)
Whether the metric function requires probability estimates out
of a classifier. If True, make sure that every selected model has
a predict_proba method. This parameter is ignored
if the metric is a string or a scorer. If sequence, the n-th
value applies to the n-th metric.
needs_threshold: bool or sequence, optional (default=False)
Whether the metric function takes a continuous decision certainty.
This only works for binary classification using estimators that
have either a decision_function or predict_proba
method. This parameter is ignored if the metric is a string or a
scorer. If sequence, the n-th value applies to the n-th metric.
skip_runs: int, optional (default=0)
Skip last skip_runs runs of the successive halving.
n_calls: int or sequence, optional (default=0)
Maximum number of iterations of the BO. It includes the random
points of n_initial_points . If 0, skip the BO and
fit the model on its default parameters. If sequence, the n-th
value applies to the n-th model.
n_initial_points: int or sequence, optional (default=5)
Initial number of random tests of the BO before fitting the
surrogate function. If equal to n_calls , the optimizer will
technically be performing a random search. If sequence, the n-th
value applies to the n-th model.
est_params: dict, optional (default=None)
Additional parameters for the estimators. See the corresponding
documentation for the available options. For multiple models,
use the acronyms as key (or 'all' for all models) and a dict
of the parameters as value. Add _fit to the parameter's name
to pass it to the fit method instead of the initializer.
bo_params: dict, optional (default=None)
Additional parameters to for the BO. These can include:
- base_estimator: str, optional (default="GP")
Base estimator to use in the BO. Choose from:
- "GP" for Gaussian Process
- "RF" for Random Forest
- "ET" for Extra-Trees
- "GBRT" for Gradient Boosted Regression Trees
- max_time: int, optional (default=np.inf)
Stop the optimization after max_time seconds.
- delta_x: int or float, optional (default=0)
Stop the optimization when |x1 - x2| < delta_x .
- delta_y: int or float, optional (default=0)
Stop the optimization if the 5 minima are within delta_y (the function is always minimized).
- cv: int, optional (default=1)
Number of folds for
the cross-validation. If 1, the training set is randomly split
in a subtrain and validation set.
- early stopping: int, float or None, optional (default=None)
Training
will stop if the model didn't improve in last early_stopping rounds. If <1,
fraction of rounds from the total. If None, no early stopping is performed. Only
available for models that allow in-training evaluation.
- callback: callable or list of callables, optional (default=None)
Callbacks for the BO.
- dimensions: dict, list or None, optional (default=None)
Custom hyperparameter
space for the bayesian optimization. Can be a list to share dimensions across
models or a dict with the model's name as key (or 'all' for all models). If None,
ATOM's predefined dimensions are used.
- plot: bool, optional (default=False)
Whether to plot the BO's progress as it runs.
Creates a canvas with two plots: the first plot shows the score of every trial
and the second shows the distance between the last consecutive steps.
- Additional keyword arguments for skopt's optimizer.
bootstrap: int or sequence, optional (default=0)
Number of data sets (bootstrapped from the training set) to use in
the bootstrap algorithm. If 0, no bootstrap is performed.
If sequence, the n-th value will apply to the n-th model.
n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.
- If >0: Number of cores to use.
- If -1: Use all available cores.
- If <-1: Use available_cores - 1 +
n_jobs .
Beware that using multiple processes on the same machine may cause
memory issues for large datasets.
gpu: bool or str, optional (default=False)
Train models on GPU (instead of CPU). Refer to the
documentation
to check which estimators are supported.
- If False: Always use CPU implementation.
- If True: Use GPU implementation if possible.
- If "force": Force GPU implementation.
verbose: int, optional (default=0)
Verbosity level of the class. Choose from:
- 0 to not print anything.
- 1 to print basic information.
- 2 to print detailed information.
warnings: bool or str, optional (default=False)
- If True: Default warning action (equal to "default").
- If False: Suppress all warnings (equal to "ignore").
- If str: One of the actions in python's warnings environment.
Changing this parameter affects the PYTHONWARNINGS environment.
ATOM can't manage warnings that go directly from C/C++ code to stdout.
logger: str, Logger or None, optional (default=None)
- If None: Doesn't save a logging file.
- If str: Name of the log file. Use "auto" for automatic naming.
- Else: Python
logging.Logger instance.
experiment: str or None, optional (default=None)
Name of the mlflow experiment to use for tracking. If None,
no mlflow tracking is performed.
random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number
generator is the RandomState instance used by np.random .
|
Magic methods
The class contains some magic methods to help you access some of its
elements faster.
- __len__: Returns the length of the dataset.
- __contains__: Checks if the provided item is a column in the dataset.
- __getitem__: Access a model, column or subset of the dataset.
Attributes
Data attributes
The dataset can be accessed at any time through multiple attributes,
e.g. calling trainer.train
will return the training set. Updating
one of the data attributes will automatically update the rest as well.
Changing the branch will also change the response from these attributes
accordingly.
Attributes: |
dataset: pd.DataFrame
Complete dataset in the pipeline.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series
Target column.
X_train: pd.DataFrame
Training features.
y_train: pd.Series
Training target.
X_test: pd.DataFrame
Test features.
y_test: pd.Series
Test target.
shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols)
for datasets with more than two dimensions.
columns: pd.Index
Names of the columns in the dataset.
n_columns: int
Number of columns in the dataset.
features: pd.Index
Names of the features in the dataset.
n_features: int
Number of features in the dataset.
target: str
Name of the target column.
|
Utility attributes
Attributes: |
models: list
List of models in the pipeline.
metric: str or list
Metric(s) used to fit the models.
errors: dict
Dictionary of the encountered exceptions (if any).
winners: list of str
Model names ordered by performance on the test set (either through the
metric_test or mean_bootstrap attribute).
winner: model
Model subclass that performed best on the test set (either through the
metric_test or mean_bootstrap attribute).
results: pd.DataFrame
Dataframe of the training results. Columns can include:
- metric_bo: Best score achieved during the BO.
- time_bo: Time spent on the BO.
- metric_train: Metric score on the training set.
- metric_test: Metric score on the test set.
- time_fit: Time spent fitting and evaluating.
- mean_bootstrap: Mean score of the bootstrap results.
- std_bootstrap: Standard deviation score of the bootstrap results.
- time_bootstrap: Time spent on the bootstrap algorithm.
- time: Total time spent on the whole run.
|
Plot attributes
Attributes: |
style: str
Plotting style. See seaborn's documentation.
palette: str
Color palette. See seaborn's documentation.
title_fontsize: int
Fontsize for the plot's title.
label_fontsize: int
Fontsize for labels and legends.
tick_fontsize: int
Fontsize for the ticks along the plot's axes.
|
Methods
available_models |
Give an overview of the available predefined models. |
canvas |
Create a figure with multiple plots. |
clear |
Clear attributes from all models. |
delete |
Delete models from the trainer. |
evaluate |
Get all models' scores for the provided metrics. |
get_params |
Get parameters for this estimator. |
log |
Save information to the logger and print to stdout. |
merge |
Merge another trainer into this one. |
reset_aesthetics |
Reset the plot aesthetics to their default values. |
run |
Fit and evaluate the models. |
save |
Save the instance to a pickle file. |
set_params |
Set the parameters of this estimator. |
stacking |
Add a Stacking instance to the models in the pipeline. |
voting |
Add a Voting instance to the models in the pipeline. |
Give an overview of the available predefined models.
Returns: |
pd.DataFrame
Information about the predefined models available for the current task.
Columns include:
- acronym: Model's acronym (used to call the model).
- fullname: Complete name of the model.
- estimator: The model's underlying estimator.
- module: The estimator's module.
- needs_scaling: Whether the model requires feature scaling.
- accepts_sparse: Whether the model has native support for sparse matrices.
- supports_gpu: Whether the model has GPU support.
|
method canvas(nrows=1,
ncols=2, title=None, figsize=None, filename=None, display=True)
[source]
This @contextmanager
allows you to draw many plots in one figure.
The default option is to add two plots side by side. See the
user guide for an example.
Parameters: |
nrows: int, optional (default=1)
Number of plots in length.
ncols: int, optional (default=2)
Number of plots in width.
title: str or None, optional (default=None)
Plot's title. If None, no title is displayed.
figsize: tuple or None, optional (default=None)
Figure's size, format as (x, y). If None, it adapts the size to the
number of plots in the canvas.
filename: str or None, optional (default=None)
Name of the file. Use "auto" for automatic naming.
If None, the figure is not saved.
display: bool, optional (default=True)
Whether to render the plot.
|
Reset all model attributes to their initial state, deleting potentially
large data arrays. Use this method to free some memory before saving
the class. The cleared attributes per model are:
Delete models from the trainer. If all models are removed, the metric
is reset. Use this method to drop unwanted models from the pipeline
or to free some memory before saving. Deleted models are not removed
from any active mlflow experiment.
Parameters: |
models: str or sequence, optional (default=None)
Models to delete. If None, delete them all.
|
method evaluate(metric=None,
dataset="test")
[source]
Get all the models' scores for the provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common
metrics per task are used.
dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train",
"test" or "holdout".
|
Returns: |
pd.DataFrame
Scores of the models.
|
method get_class_weights(dataset="train")
[source]
Return class weights for a balanced data set. Statistically, the class
weights re-balance the data set so that the sampled data set represents
the target population as closely as possible. The returned weights are
inversely proportional to the class frequencies in the selected data set.
Parameters: |
dataset: str, optional (default="train")
Data set from which to get the weights. Choose from: "train", "test" or "dataset".
|
Returns: |
dict
Classes with the corresponding weights.
|
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained
subobjects that are estimators.
|
Returns: |
dict
Parameter names mapped to their values.
|
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
Message to write to the logger and print to stdout.
level: int, optional (default=0)
Minimum verbosity level to print the message.
|
method merge(other, suffix="2")
[source]
Merge another trainer into this one. Branches, models, metrics and
attributes of the other trainer are merged into this one. If there
are branches and/or models with the same name, they are merged
adding the suffix
parameter to their name. The errors and missing
attributes are extended with those of the other instance. It's only
possible to merge two instances if they are initialized with the same
dataset and trained with the same metric.
Parameters: |
other: trainer
Trainer instance with which to merge.
suffix: str, optional (default="2")
Conflicting branches and models are merged adding suffix
to the end of their names.
|
Reset the plot aesthetics to their default values.
Fit and evaluate the models.
Parameters: |
*arrays: sequence of indexables
Training and test set (and optionally a holdout set). Allowed formats are:
- train, test
- train, test, holdout
- X_train, X_test, y_train, y_test
- X_train, X_test, X_holdout, y_train, y_test, y_holdout
- (X_train, y_train), (X_test, y_test)
- (X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)
|
method save(filename="auto", save_data=True)
[source]
Save the instance to a pickle file. Remember that the class contains
the complete dataset as attribute, so the file can become large for
big datasets! To avoid this, use save_data=False
.
Parameters: |
filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.
save_data: bool, optional (default=True)
Whether to save the data as an attribute of the instance. If False,
remember to add the data to ATOMLoader
when loading the file.
|
Set the parameters of this estimator.
Parameters: |
**params: dict
Estimator parameters.
|
Returns: |
SuccessiveHalvingRegressor
Estimator instance.
|
method stacking(name="Stack",
models=None, **kwargs)
[source]
Add a Stacking model to the pipeline.
Parameters: |
name: str, optional (default="Stack")
Name of the model. The name is always presided with the
model's acronym: Stack .
models: sequence or None, optional (default=None)
Models that feed the stacking estimator. If None, it selects
all non-ensemble models trained on the current branch.
**kwargs
Additional keyword arguments for sklearn's StackingRegressor
instance. The predefined model's
acronyms can be used for the final_estimator parameter.
|
method voting(name="Vote",
models=None, **kwargs)
[source]
Add a Voting model to the pipeline.
Parameters: |
name: str, optional (default="Vote")
Name of the model. The name is always presided with the
model's acronym: Vote .
models: sequence or None, optional (default=None)
Models that feed the voting estimator. If None, it selects
all non-ensemble models trained on the current branch.
**kwargs
Additional keyword arguments for sklearn's VotingRegressor
instance.
|
Example
from atom.training import SuccessiveHalvingRegressor
# Run the pipeline
trainer = SuccessiveHalvingRegressor(["Tree", "Bag", "RF", "ET"], metric="f1")
trainer.run(train, test)
# Analyze the results
trainer.plot_successive_halving()
print(trainer.results)