TrainSizingClassifier

class atom.training.TrainSizingClassifier(models=None, metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False, train_sizes=5, n_calls=0, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0, n_jobs=1, verbose=0, warnings=True, logger=None, experiment=None, random_state=None) [source]

Fit and evaluate the models in a train sizing fashion. The following steps are applied to every model (per iteration):

Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
The model is fitted on the training set using the best combination of hyperparameters found.
The model is evaluated on the test set.
The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).

You can predict, plot and call any model from the instance. Read more in the user guide.

Parameters:

models: str, estimator or sequence, optional (default=None)
Models to fit to the data. Allowed inputs are: an acronym from any of ATOM's predefined models, an ATOMModel or a custom estimator as class or instance. If None, all the predefined models are used. Available predefined models are:

"GP" for Gaussian Process
"GNB" for Gaussian Naive Bayes
"MNB" for Multinomial Naive Bayes
"BNB" for Bernoulli Naive Bayes
"CatNB" for Categorical Naive Bayes
"CNB" for Complement Naive Bayes
"Ridge" for Ridge Classification
"LR" for Logistic Regression
"LDA" for Linear Discriminant Analysis
"QDA" for Quadratic Discriminant Analysis
"KNN" for K-Nearest Neighbors
"RNN" for Radius Nearest Neighbors
"Tree" for a single Decision Tree
"Bag" for Bagging
"ET" for Extra-Trees
"RF" for Random Forest
"AdaB" for AdaBoost
"GBM" for Gradient Boosting Machine
"XGB" for XGBoost (only available if package is installed)
"LGB" for LightGBM (only available if package is installed)
"CatB" for CatBoost (only available if package is installed)
"lSVM" for Linear-SVM
"kSVM" for Kernel-SVM
"PA" for Passive Aggressive
"SGD" for Stochastic Gradient Descent
"MLP" for Multi-layer Perceptron

metric: str, func, scorer, sequence or None, optional (default=None)
Metric on which to fit the models. Choose from any of sklearn's SCORERS, a function with signature metric(y_true, y_pred), a scorer object or a sequence of these. If multiple metrics are selected, only the first is used to optimize the BO. If None, a default metric is selected:

"f1" for binary classification
"f1_weighted" for multiclass classification
"r2" for regression

greater_is_better: bool or sequence, optional (default=True)
Whether the metric is a score function or a loss function, i.e. if True, a higher score is better and if False, lower is better. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

needs_proba: bool or sequence, optional (default=False)
Whether the metric function requires probability estimates out of a classifier. If True, make sure that every selected model has a predict_proba method. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

needs_threshold: bool or sequence, optional (default=False)
Whether the metric function takes a continuous decision certainty. This only works for binary classification using estimators that have either a decision_function or predict_proba method. This parameter is ignored if the metric is a string or a scorer. If sequence, the n-th value applies to the n-th metric.

train_sizes: int or sequence, optional (default=5)
Sequence of training set sizes used to run the trainings.

If int: Number of equally distributed splits, i.e. for a value N it's equal to np.linspace(1.0/N, 1.0, N).
If sequence: Fraction of the training set when <=1, else total number of samples.

n_calls: int or sequence, optional (default=0)
Maximum number of iterations of the BO. It includes the random points of n_initial_points. If 0, skip the BO and fit the model on its default parameters. If sequence, the n-th value applies to the n-th model.

n_initial_points: int or sequence, optional (default=5)
Initial number of random tests of the BO before fitting the surrogate function. If equal to n_calls, the optimizer will technically be performing a random search. If sequence, the n-th value applies to the n-th model.

est_params: dict, optional (default=None)
Additional parameters for the estimators. See the corresponding documentation for the available options. For multiple models, use the acronyms as key and a dictionary of the parameters as value. Add _fit to the parameter's name to pass it to the fit method instead of the initializer.

bo_params: dict, optional (default=None)
Additional parameters to for the BO. These can include:

base_estimator: str, optional (default="GP")
Base estimator to use in the BO. Choose from:
- "GP" for Gaussian Process
- "RF" for Random Forest
- "ET" for Extra-Trees
- "GBRT" for Gradient Boosted Regression Trees
max_time: int, optional (default=np.inf)
Stop the optimization after max_time seconds.
delta_x: int or float, optional (default=0)
Stop the optimization when |x1 - x2| < delta_x.
delta_y: int or float, optional (default=0)
Stop the optimization if the 5 minima are within delta_y (the function is always minimized).
cv: int, optional (default=5)
Number of folds for the cross-validation. If 1, the training set is randomly split in a subtrain and validation set.
early stopping: int, float or None, optional (default=None)
Training will stop if the model didn't improve in last early_stopping rounds. If <1, fraction of rounds from the total. If None, no early stopping is performed. Only available for models that allow in-training evaluation.
callback: callable or list of callables, optional (default=None)
Callbacks for the BO.
dimensions: dict, array or None, optional (default=None)
Custom hyperparameter space for the bayesian optimization. Can be an array to share dimensions across models or a dictionary with the model's name as key. If None, ATOM's predefined dimensions are used.
plot: bool, optional (default=False)
Whether to plot the BO's progress as it runs. Creates a canvas with two plots: the first plot shows the score of every trial and the second shows the distance between the last consecutive steps.
Additional keyword arguments for skopt's optimizer.

bootstrap: int or sequence, optional (default=0)
Number of data sets (bootstrapped from the training set) to use in the bootstrap algorithm. If 0, no bootstrap is performed. If sequence, the n-th value will apply to the n-th model.

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use available_cores - 1 + n_jobs.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

warnings: bool or str, optional (default=True)

If True: Default warning action (equal to "default").
If False: Suppress all warnings (equal to "ignore").
If str: One of the actions in python's warnings environment.

Changing this parameter affects the PYTHONWARNINGS environment.
ATOM can't manage warnings that go directly from C/C++ code to stdout.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

experiment: str or None, optional (default=None)
Name of the mlflow experiment to use for tracking. If None, no mlflow tracking is performed.

random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by numpy.random.

Attributes

Data attributes

The dataset can be accessed at any time through multiple attributes, e.g. calling trainer.train will return the training set. Updating one of the data attributes will automatically update the rest as well. Changing the branch will also change the response from these attributes accordingly.

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: list
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: list
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

models: list
List of models in the pipeline.

metric: str or list
Metric(s) used to fit the models.

errors: dict
Dictionary of the encountered exceptions (if any).

winner: model
Model subclass that performed best on the test set.

results: pd.DataFrame
Dataframe of the training results. Columns can include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.

Methods

calibrate	Calibrate the winning model.
canvas	Create a figure with multiple plots.
cross_validate	Evaluate the winning model using cross-validation.
delete	Remove a model from the pipeline.
get_class_weight	Return class weights for a balanced dataset.
get_params	Get parameters for this estimator.
log	Save information to the logger and print to stdout.
reset_aesthetics	Reset the plot aesthetics to their default values.
reset_predictions	Clear the prediction attributes from all models.
run	Fit and evaluate the models.
save	Save the instance to a pickle file.
eval	Get all models'scores for the provided metrics.
set_params	Set the parameters of this estimator.
stacking	Add a Stacking instance to the models in the pipeline.
voting	Add a Voting instance to the models in the pipeline.

method calibrate(**kwargs) [source]

Applies probability calibration on the winning model. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute and is logged to any active mlflow experiment. Since the estimator changed, all the model's prediction attributes are reset.

Tip

Use the plot_calibration method to visualize a model's calibration.

Parameters:

**kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.

method canvas(nrows=1, ncols=2, title=None, figsize=None, filename=None, display=True) [source]

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters:

nrows: int, optional (default=1)
Number of plots in length.

ncols: int, optional (default=2)
Number of plots in width.

title: str or None, optional (default=None)
Plot's title. If None, no title is displayed.

figsize: tuple or None, optional (default=None)
Figure's size, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str or None, optional (default=None)
Name of the file. Use "auto" for automatic naming. If None, the figure is not saved.

display: bool, optional (default=True)
Whether to render the plot.

method cross_validate(**kwargs) [source]

Evaluate the winning model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the model's performance.

Parameters:	**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns:	scores: dict Return of sklearn's cross_validate function.

method delete(models=None) [source]

Delete a model from the trainer. If the winning model is removed, the next best model (through metric_test or mean_bootstrap) is selected as winner. If all models are removed, the metric and training approach are reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.

Parameters:

models: str or sequence, optional (default=None)
Models to delete. If None, delete them all.

method get_class_weights(dataset="train") [source]

Return class weights for a balanced data set. Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.

Parameters:	dataset: str, optional (default="train") Data set from which to get the weights. Choose between "train", "test" or "dataset".
Returns:	class_weights: dict Classes with the corresponding weights.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Dictionary of the parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.

method reset_predictions() [source]

Clear the prediction attributes from all models. Use this method to free some memory before saving the trainer.

method run(*arrays) [source]

Fit and evaluate the models.

Parameters:

*arrays: sequence of indexables
Training set and test set. Allowed input formats are:

train, test
X_train, X_test, y_train, y_test
(X_train, y_train), (X_test, y_test)

method save(filename="auto", save_data=True) [source]

Save the instance to a pickle file. Remember that the class contains the complete dataset as attribute, so the file can become large for big datasets! To avoid this, use save_data=False.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

save_data: bool, optional (default=True)
Whether to save the data as an attribute of the instance. If False, remember to add the data to ATOMLoader when loading the file.

method evaluate(metric=None, dataset="test") [source]

Get all the models' scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Options are "train" or "test".

Returns:

scores: pd.DataFrame
Scores of the models.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: TrainSizingClassifier Estimator instance.

method stacking(models=None, estimator=None, stack_method="auto", passthrough=False) [source]

Add a Stacking instance to the models in the pipeline.

Parameters:

models: sequence or None, optional (default=None)
Models that feed the stacking. If None, it selects all models depending on the current branch.

estimator: str, callable or None, optional (default=None)
The final estimator, which is used to combine the base estimators. If str, choose from ATOM's predefined models. If None, Logistic Regression is selected.

stack_method: str, optional (default="auto")
Methods called for each base estimator. If "auto", it will try to invoke predict_proba, decision_function or predict in that order.

passthrough: bool, optional (default=False) When False, only the predictions of estimators are used as training data for the final estimator. When True, the estimator is trained on the predictions as well as the original training data. The passed dataset is scaled if any of the models require scaled features and they are not already.

method voting(models=None, weights=None) [source]

Add a Voting instance to the models in the pipeline.

Parameters:

models: sequence or None, optional (default=None)
Models that feed the voting. If None, it selects all models depending on the current branch.

weights: sequence or None, optional (default=None)
Sequence of weights (int or float) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights if None.

Example

from atom.training import TrainSizingClassifier

# Run the pipeline
trainer = TrainSizingClassifier("RF", n_calls=5, n_initial_points=3)
trainer.run(train, test)

# Analyze the results
trainer.plot_learning_curve()