TrainSizingRegressor

class atom.training.TrainSizingRegressor(models=None, metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False, train_sizes=5, n_calls=0, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0, n_jobs=1, gpu=False, verbose=0, warnings=True, logger=None, experiment=None, random_state=None) [source]

Fit and evaluate the models in a train sizing fashion. The following steps are applied to every model (per iteration):

Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
The model is fitted on the training set using the best combination of hyperparameters found.
The model is evaluated on the test set.
The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).

You can predict, plot and call any model from the instance. Read more in the user guide.

Parameters:

models: str, estimator or sequence, optional (default=None)
Models to fit to the data. Allowed inputs are: an acronym from any of ATOM's predefined models, an ATOMModel or a custom estimator as class or instance. If None, all the predefined models are used. Available predefined models are:

"GP" for Gaussian Process
"OLS" for Ordinary Least Squares
"Ridge" for Ridge Regression
"Lasso" for Lasso Regression
"EN" for ElasticNet
"BR" for Bayesian Ridge
"ARD" for Automated Relevance Determination
"KNN" for K-Nearest Neighbors
"RNN" for Radius Nearest Neighbors
"Tree" for a single Decision Tree
"Bag" for Bagging
"ET" for Extra-Trees
"RF" for Random Forest
"AdaB" for AdaBoost
"GBM" for Gradient Boosting Machine
"XGB" for XGBoost (only available if package is installed)
"LGB" for LightGBM (only available if package is installed)
"CatB" for CatBoost (only available if package is installed)
"lSVM" for Linear SVM
"kSVM" for Kernel SVM
"PA" for Passive Aggressive
"SGD" for Stochastic Gradient Descent
"MLP" for Multi-layer Perceptron

metric: str, func, scorer, sequence or None, optional (default=None)
Metric on which to fit the models. Choose from any of sklearn's SCORERS, a function with signature metric(y_true, y_pred), a scorer object or a sequence of these. If multiple metrics are selected, only the first is used to optimize the BO. If None, a default metric is selected:

"f1" for binary classification
"f1_weighted" for multiclass classification
"r2" for regression

greater_is_better: bool or sequence, optional (default=True)
Whether the metric is a score function or a loss function, i.e. if True, a higher score is better and if False, lower is better. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

needs_proba: bool or sequence, optional (default=False)
Whether the metric function requires probability estimates out of a classifier. If True, make sure that every selected model has a predict_proba method. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

needs_threshold: bool or sequence, optional (default=False)
Whether the metric function takes a continuous decision certainty. This only works for binary classification using estimators that have either a decision_function or predict_proba method. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

train_sizes: int or sequence, optional (default=5)
Sequence of training set sizes used to run the trainings.

If int: Number of equally distributed splits, i.e. for a value N it's equal to np.linspace(1.0/N, 1.0, N).
If sequence: Fraction of the training set when <=1, else total number of samples.

n_calls: int or sequence, optional (default=0)
Maximum number of iterations of the BO. It includes the random points of n_initial_points. If 0, skip the BO and fit the model on its default parameters. If sequence, the n-th value applies to the n-th model.

n_initial_points: int or sequence, optional (default=5)
Initial number of random tests of the BO before fitting the surrogate function. If equal to n_calls, the optimizer will technically be performing a random search. If sequence, the n-th value applies to the n-th model.

est_params: dict, optional (default=None)
Additional parameters for the estimators. See the corresponding documentation for the available options. For multiple models, use the acronyms as key (or 'all' for all models) and a dict of the parameters as value. Add _fit to the parameter's name to pass it to the fit method instead of the initializer.

bo_params: dict, optional (default=None)
Additional parameters to for the BO. These can include:

base_estimator: str, optional (default="GP")
Base estimator to use in the BO. Choose from:
- "GP" for Gaussian Process
- "RF" for Random Forest
- "ET" for Extra-Trees
- "GBRT" for Gradient Boosted Regression Trees
max_time: int, optional (default=np.inf)
Stop the optimization after max_time seconds.
delta_x: int or float, optional (default=0)
Stop the optimization when |x1 - x2| < delta_x.
delta_y: int or float, optional (default=0)
Stop the optimization if the 5 minima are within delta_y (the function is always minimized).
cv: int, optional (default=1)
Number of folds for the cross-validation. If 1, the training set is randomly split in a subtrain and validation set.
early stopping: int, float or None, optional (default=None)
Training will stop if the model didn't improve in last early_stopping rounds. If <1, fraction of rounds from the total. If None, no early stopping is performed. Only available for models that allow in-training evaluation.
callback: callable or list of callables, optional (default=None)
Callbacks for the BO.
dimensions: dict, list or None, optional (default=None)
Custom hyperparameter space for the bayesian optimization. Can be a list to share dimensions across models or a dict with the model's name as key (or 'all' for all models). If None, ATOM's predefined dimensions are used.
plot: bool, optional (default=False)
Whether to plot the BO's progress as it runs. Creates a canvas with two plots: the first plot shows the score of every trial and the second shows the distance between the last consecutive steps.
Additional keyword arguments for skopt's optimizer.

bootstrap: int or sequence, optional (default=0)
Number of data sets (bootstrapped from the training set) to use in the bootstrap algorithm. If 0, no bootstrap is performed. If sequence, the n-th value will apply to the n-th model.

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use available_cores - 1 + n_jobs.

Beware that using multiple processes on the same machine may cause memory issues for large datasets.

gpu: bool or str, optional (default=False)
Train models on GPU (instead of CPU). Refer to the documentation to check which estimators are supported.

If False: Always use CPU implementation.
If True: Use GPU implementation if possible.
If "force": Force GPU implementation.

verbose: int, optional (default=0)
Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

warnings: bool or str, optional (default=False)

If True: Default warning action (equal to "default").
If False: Suppress all warnings (equal to "ignore").
If str: One of the actions in python's warnings environment.

Changing this parameter affects the PYTHONWARNINGS environment.
ATOM can't manage warnings that go directly from C/C++ code to stdout.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

experiment: str or None, optional (default=None)
Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

Magic methods

The class contains some magic methods to help you access some of its elements faster.

__len__: Returns the length of the dataset.
__contains__: Checks if the provided item is a column in the dataset.
__getitem__: Access a model, column or subset of the dataset.

Attributes

Data attributes

The dataset can be accessed at any time through multiple attributes, e.g. calling trainer.train will return the training set. Updating one of the data attributes will automatically update the rest as well. Changing the branch will also change the response from these attributes accordingly.

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: pd.Index
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: pd.Index
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

models: list
List of models in the pipeline.

metric: str or list
Metric(s) used to fit the models.

errors: dict
Dictionary of the encountered exceptions (if any).

winners: list of str
Model names ordered by performance on the test set (either through the metric_test or mean_bootstrap attribute).

winner: model
Model subclass that performed best on the test set (either through the metric_test or mean_bootstrap attribute).

results: pd.DataFrame
Dataframe of the training results. Columns can include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.

Methods

available_models	Give an overview of the available predefined models.
canvas	Create a figure with multiple plots.
clear	Clear attributes from all models.
delete	Delete models from the trainer.
evaluate	Get all models' scores for the provided metrics.
get_params	Get parameters for this estimator.
log	Save information to the logger and print to stdout.
merge	Merge another trainer into this one.
reset_aesthetics	Reset the plot aesthetics to their default values.
run	Fit and evaluate the models.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
stacking	Add a Stacking instance to the models in the pipeline.
voting	Add a Voting instance to the models in the pipeline.

method available_models() [source]

Give an overview of the available predefined models.

Returns:

pd.DataFrame
Information about the predefined models available for the current task. Columns include:

acronym: Model's acronym (used to call the model).
fullname: Complete name of the model.
estimator: The model's underlying estimator.
module: The estimator's module.
needs_scaling: Whether the model requires feature scaling.
accepts_sparse: Whether the model has native support for sparse matrices.
supports_gpu: Whether the model has GPU support.

method canvas(nrows=1, ncols=2, title=None, figsize=None, filename=None, display=True) [source]

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters:

nrows: int, optional (default=1)
Number of plots in length.

ncols: int, optional (default=2)
Number of plots in width.

title: str or None, optional (default=None)
Plot's title. If None, no title is displayed.

figsize: tuple or None, optional (default=None)
Figure's size, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str or None, optional (default=None)
Name of the file. Use "auto" for automatic naming. If None, the figure is not saved.

display: bool, optional (default=True)
Whether to render the plot.

method clear() [source]

Reset all model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:

method delete(models=None) [source]

Delete models from the trainer. If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters:

models: str or sequence, optional (default=None)
Models to delete. If None, delete them all.

method evaluate(metric=None, dataset="test") [source]

Get all the models' scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

Returns:

pd.DataFrame
Scores of the models.

method get_class_weights(dataset="train") [source]

Return class weights for a balanced data set. Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.

Parameters:	dataset: str, optional (default="train") Data set from which to get the weights. Choose from: "train", "test" or "dataset".
Returns:	dict Classes with the corresponding weights.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	dict Parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method merge(other, suffix="2") [source]

Merge another trainer into this one. Branches, models, metrics and attributes of the other trainer are merged into this one. If there are branches and/or models with the same name, they are merged adding the suffix parameter to their name. The errors and missing attributes are extended with those of the other instance. It's only possible to merge two instances if they are initialized with the same dataset and trained with the same metric.

Parameters:

other: trainer
Trainer instance with which to merge.

suffix: str, optional (default="2")
Conflicting branches and models are merged adding suffix to the end of their names.

method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.

method run(*arrays) [source]

Fit and evaluate the models.

Parameters:

*arrays: sequence of indexables
Training and test set (and optionally a holdout set). Allowed formats are:

train, test
train, test, holdout
X_train, X_test, y_train, y_test
X_train, X_test, X_holdout, y_train, y_test, y_holdout
(X_train, y_train), (X_test, y_test)
(X_train, y_train), (X_test, y_test), (X_holdout, y_holdout)

method save(filename="auto", save_data=True) [source]

Save the instance to a pickle file. Remember that the class contains the complete dataset as attribute, so the file can become large for big datasets! To avoid this, use save_data=False.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

save_data: bool, optional (default=True)
Whether to save the data as an attribute of the instance. If False, remember to add the data to ATOMLoader when loading the file.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	TrainSizingRegressor Estimator instance.

method stacking(name="Stack", models=None, **kwargs) [source]

Add a Stacking model to the pipeline.

Parameters:

name: str, optional (default="Stack")
Name of the model. The name is always presided with the model's acronym: Stack.

models: sequence or None, optional (default=None)
Models that feed the stacking estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's StackingRegressor instance. The predefined model's acronyms can be used for the final_estimator parameter.

method voting(name="Vote", models=None, **kwargs) [source]

Add a Voting model to the pipeline.

Parameters:

name: str, optional (default="Vote")
Name of the model. The name is always presided with the model's acronym: Vote.

models: sequence or None, optional (default=None)
Models that feed the voting estimator. If None, it selects all non-ensemble models trained on the current branch.

**kwargs
Additional keyword arguments for sklearn's VotingRegressor instance.

Example

from atom.training import TrainSizingRegressor

# Run the pipeline
trainer = TrainSizingRegressor("RF", n_calls=5, n_initial_points=3)
trainer.run(train, test)

# Analyze the results
trainer.plot_learning_curve()
print(trainer.results)