TrainSizingClassifier
Fit and evaluate the models in a train sizing fashion. The following steps are applied to every model (per iteration):
- Hyperparameter tuning is performed using a Bayesian Optimization approach (optional).
- The model is fitted on the training set using the best combination of hyperparameters found.
- The model is evaluated on the test set.
- The model is trained on various bootstrapped samples of the training set and scored again on the test set (optional).
You can predict, plot and call any model from the instance. Read more in the user guide.
Parameters: |
models: str, estimator or sequence, optional (default=None) Models to fit to the data. Allowed inputs are: an acronym from any of ATOM's predefined models, an ATOMModel or a custom estimator as class or instance. If None, all the predefined models are used. Available predefined models are:
Metric on which to fit the models. Choose from any of sklearn's SCORERS, a function with signature metric(y_true, y_pred) ,
a scorer object or a sequence of these. If multiple metrics are
selected, only the first is used to optimize the BO. If None, a
default metric is selected:
greater_is_better: bool or sequence, optional (default=True)
needs_proba: bool or sequence, optional (default=False)
needs_threshold: bool or sequence, optional (default=False) Sequence of training set sizes used to run the trainings.
n_calls: int or sequence, optional (default=0)
n_initial_points: int or sequence, optional (default=5)
est_params: dict, optional (default=None) Additional parameters to for the BO. These can include:
bootstrap: int or sequence, optional (default=0) Number of cores to use for parallel processing.
Verbosity level of the class. Possible values are:
Changing this parameter affects the
experiment: str or None, optional (default=None)
random_state: int or None, optional (default=None) |
Attributes
Data attributes
The dataset can be accessed at any time through multiple attributes,
e.g. calling trainer.train
will return the training set. Updating
one of the data attributes will automatically update the rest as well.
Changing the branch will also change the response from these attributes
accordingly.
Attributes: |
dataset: pd.DataFrame
train: pd.DataFrame
test: pd.DataFrame
X: pd.DataFrame
y: pd.Series
X_train: pd.DataFrame
y_train: pd.Series
X_test: pd.DataFrame
y_test: pd.Series
shape: tuple
columns: list
n_columns: int
features: list
n_features: int
target: str |
Utility attributes
Attributes: |
models: list
metric: str or list
errors: dict
winner: model Dataframe of the training results. Columns can include:
|
Plot attributes
Attributes: |
style: str
palette: str
title_fontsize: int
label_fontsize: int
tick_fontsize: int |
Methods
calibrate | Calibrate the winning model. |
canvas | Create a figure with multiple plots. |
cross_validate | Evaluate the winning model using cross-validation. |
delete | Remove a model from the pipeline. |
get_class_weight | Return class weights for a balanced dataset. |
get_params | Get parameters for this estimator. |
log | Save information to the logger and print to stdout. |
reset_aesthetics | Reset the plot aesthetics to their default values. |
reset_predictions | Clear the prediction attributes from all models. |
run | Fit and evaluate the models. |
save | Save the instance to a pickle file. |
scoring | Get all the models scoring for provided metrics. |
set_params | Set the parameters of this estimator. |
stacking | Add a Stacking instance to the models in the pipeline. |
voting | Add a Voting instance to the models in the pipeline. |
Applies probability calibration on the winning model. The
estimator is trained via cross-validation on a subset of the
training data, using the rest to fit the calibrator. The new
classifier will replace the estimator
attribute and is
logged to any active mlflow experiment. Since the estimator
changed, all the model's prediction attributes are reset.
Tip
Use the plot_calibration method to visualize a model's calibration.
Parameters: |
**kwargs Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Note that doing this will result in data leakage in the test set. Use this only if you have another, independent set for testing. |
This @contextmanager
allows you to draw many plots in one figure.
The default option is to add two plots side by side. See the
user guide for an example.
Parameters: |
nrows: int, optional (default=1)
ncols: int, optional (default=2)
title: str or None, optional (default=None)
figsize: tuple or None, optional (default=None)
filename: str or None, optional (default=None)
display: bool, optional (default=True) |
Evaluate the winning model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the model's performance.
Parameters: |
**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric. |
Returns: |
scores: dict Return of sklearn's cross_validate function. |
Delete a model from the trainer. If the winning model is
removed, the next best model (through metric_test
or
mean_bootstrap
) is selected as winner. If all models are
removed, the metric and training approach are reset. Use
this method to drop unwanted models from the pipeline
or to free some memory before saving. Deleted models are
not removed from any active mlflow experiment.
Parameters: |
models: str or sequence, optional (default=None) Models to delete. If None, delete them all. |
Return class weights for a balanced data set. Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.
Parameters: |
dataset: str, optional (default="train") Data set from which to get the weights. Choose between "train", "test" or "dataset". |
Returns: |
class_weights: dict Classes with the corresponding weights. |
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True) |
Returns: |
params: dict Dictionary of the parameter names mapped to their values. |
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
level: int, optional (default=0) |
Reset the plot aesthetics to their default values.
Clear the prediction attributes from all models.
Use this method to free some memory before saving the trainer.
Fit and evaluate the models.
Parameters: |
*arrays: sequence of indexables Training set and test set. Allowed input formats are:
|
Save the instance to a pickle file. Remember that the class contains
the complete dataset as attribute, so the file can become large for
big datasets! To avoid this, use save_data=False
.
Parameters: |
filename: str, optional (default="auto")
save_data: bool, optional (default=True) |
Get all the models scoring for provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
dataset: str, optional (default="test") |
Returns: |
score: pd.DataFrame Scoring of the models. |
Set the parameters of this estimator.
Parameters: |
**params: dict Estimator parameters. |
Returns: |
self: TrainSizingClassifier Estimator instance. |
Add a Stacking instance to the models in the pipeline.
Parameters: |
models: sequence or None, optional (default=None)
estimator: str, callable or None, optional (default=None)
stack_method: str, optional (default="auto") passthrough: bool, optional (default=False) When False, only the predictions of estimators are used as training data for the final estimator. When True, the estimator is trained on the predictions as well as the original training data. The passed dataset is scaled if any of the models require scaled features and they are not already. |
Add a Voting instance to the models in the pipeline.
Parameters: |
models: sequence or None, optional (default=None)
weights: sequence or None, optional (default=None) |
Example
from atom.training import TrainSizingClassifier
# Run the pipeline
trainer = TrainSizingClassifier("RF", n_calls=5, n_initial_points=3)
trainer.run(train, test)
# Analyze the results
trainer.plot_learning_curve()