DirectRegressor
class atom.training.
DirectRegressor(models=None,
metric=None, greater_is_better=True, needs_proba=False, needs_threshold=False,
n_calls=0, n_initial_points=5, est_params=None, bo_params=None, n_bootstrap=0,
n_jobs=1, verbose=0, warnings=True, logger=None, experiment=None, random_state=None)
[source]
Fit and evaluate the models. The following steps are applied to every model:
- Hyperparameter tuning is performed using a Bayesian Optimization
approach (optional).
- The model is fitted on the training set using the best combination
of hyperparameters found.
- The model is evaluated on the test set.
- The model is trained on various bootstrapped samples of the training
set and scored again on the test set (optional).
You can predict, plot
and call any model from the instance.
Read more in the user guide.
Parameters: |
models: str, estimator or sequence, optional (default=None)
Models to fit to the data. Allowed inputs are: an acronym from any of
ATOM's predefined models, an ATOMModel
or a custom estimator as class or instance. If None, all the predefined
models are used. Available predefined models are:
metric: str, func, scorer, sequence or None, optional (default=None)
Metric on which to fit the models. Choose from any of sklearn's
SCORERS,
a function with signature metric(y_true, y_pred) ,
a scorer object or a sequence of these. If multiple metrics are
selected, only the first is used to optimize the BO. If None, a
default metric is selected:
- "f1" for binary classification
- "f1_weighted" for multiclass classification
- "r2" for regression
greater_is_better: bool or sequence, optional (default=True)
Whether the metric is a score function or a loss function,
i.e. if True, a higher score is better and if False, lower is
better. This parameter is ignored if the metric is a string or
a scorer. If sequence, the n-th value applies to the n-th
metric.
needs_proba: bool or sequence, optional (default=False)
Whether the metric function requires probability estimates out
of a classifier. If True, make sure that every selected model has
a predict_proba method. This parameter is ignored
if the metric is a string or a scorer. If sequence, the n-th
value applies to the n-th metric.
needs_threshold: bool or sequence, optional (default=False)
Whether the metric function takes a continuous decision certainty.
This only works for binary classification using estimators that
have either a decision_function or predict_proba
method. This parameter is ignored if the metric is a string or a
scorer. If sequence, the n-th value applies to the n-th metric.
n_calls: int or sequence, optional (default=0)
Maximum number of iterations of the BO. It includes the random
points of n_initial_points . If 0, skip the BO and
fit the model on its default parameters. If sequence, the n-th
value applies to the n-th model.
n_initial_points: int or sequence, optional (default=5)
Initial number of random tests of the BO before fitting the
surrogate function. If equal to n_calls , the optimizer will
technically be performing a random search. If sequence, the n-th
value applies to the n-th model.
est_params: dict, optional (default=None)
Additional parameters for the estimators. See the corresponding
documentation for the available options. For multiple models, use
the acronyms as key and a dictionary of the parameters as value.
Add _fit to the parameter's name to pass it to the fit method instead
of the initializer.
bo_params: dict, optional (default=None)
Additional parameters to for the BO. These can include:
- base_estimator: str, optional (default="GP")
Base estimator to use in the BO.
Choose from:
- "GP" for Gaussian Process
- "RF" for Random Forest
- "ET" for Extra-Trees
- "GBRT" for Gradient Boosted Regression Trees
- max_time: int, optional (default=np.inf)
Stop the optimization after max_time seconds.
- delta_x: int or float, optional (default=0)
Stop the optimization when |x1 - x2| < delta_x .
- delta_y: int or float, optional (default=0)
Stop the optimization if the 5 minima are within delta_y (the function is always minimized).
- cv: int, optional (default=5)
Number of folds for the cross-validation. If 1, the
training set is randomly split in a subtrain and validation set.
- early stopping: int, float or None, optional (default=None)
Training
will stop if the model didn't improve in last early_stopping rounds. If <1,
fraction of rounds from the total. If None, no early stopping is performed. Only
available for models that allow in-training evaluation.
- callback: callable or list of callables, optional (default=None)
Callbacks for the BO.
- dimensions: dict, array or None, optional (default=None)
Custom hyperparameter
space for the bayesian optimization. Can be an array to share dimensions across
models or a dictionary with the model's name as key. If None, ATOM's predefined dimensions are used.
- plot: bool, optional (default=False)
Whether to plot the BO's progress as it runs.
Creates a canvas with two plots: the first plot shows the score of every trial
and the second shows the distance between the last consecutive steps.
- Additional keyword arguments for skopt's optimizer.
bootstrap: int or sequence, optional (default=0)
Number of data sets (bootstrapped from the training set) to use in
the bootstrap algorithm. If 0, no bootstrap is performed.
If sequence, the n-th value will apply to the n-th model.
n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.
- If >0: Number of cores to use.
- If -1: Use all available cores.
- If <-1: Use available_cores - 1 +
n_jobs .
verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:
- 0 to not print anything.
- 1 to print basic information.
- 2 to print detailed information.
warnings: bool or str, optional (default=True)
- If True: Default warning action (equal to "default").
- If False: Suppress all warnings (equal to "ignore").
- If str: One of the actions in python's warnings environment.
Changing this parameter affects the PYTHONWARNINGS environment.
ATOM can't manage warnings that go directly from C/C++ code to stdout.
logger: str, Logger or None, optional (default=None)
- If None: Doesn't save a logging file.
- If str: Name of the log file. Use "auto" for automatic naming.
- Else: Python
logging.Logger instance.
experiment: str or None, optional (default=None)
Name of the mlflow experiment to use for tracking. If None,
no mlflow tracking is performed.
random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number
generator is the RandomState instance used by numpy.random .
|
Attributes
Data attributes
The dataset can be accessed at any time through multiple attributes,
e.g. calling trainer.train
will return the training set. Updating
one of the data attributes will automatically update the rest as well.
Changing the branch will also change the response from these attributes
accordingly.
Attributes: |
dataset: pd.DataFrame
Complete dataset in the pipeline.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series
Target column.
X_train: pd.DataFrame
Training features.
y_train: pd.Series
Training target.
X_test: pd.DataFrame
Test features.
y_test: pd.Series
Test target.
shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols)
for datasets with more than two dimensions.
columns: list
Names of the columns in the dataset.
n_columns: int
Number of columns in the dataset.
features: list
Names of the features in the dataset.
n_features: int
Number of features in the dataset.
target: str
Name of the target column.
|
Utility attributes
Attributes: |
models: list
List of models in the pipeline.
metric: str or list
Metric(s) used to fit the models.
errors: dict
Dictionary of the encountered exceptions (if any).
winner: model
Model subclass that performed best on the test set.
results: pd.DataFrame
Dataframe of the training results. Columns can include:
- metric_bo: Best score achieved during the BO.
- time_bo: Time spent on the BO.
- metric_train: Metric score on the training set.
- metric_test: Metric score on the test set.
- time_fit: Time spent fitting and evaluating.
- mean_bootstrap: Mean score of the bootstrap results.
- std_bootstrap: Standard deviation score of the bootstrap results.
- time_bootstrap: Time spent on the bootstrap algorithm.
- time: Total time spent on the whole run.
|
Plot attributes
Attributes: |
style: str
Plotting style. See seaborn's documentation.
palette: str
Color palette. See seaborn's documentation.
title_fontsize: int
Fontsize for the plot's title.
label_fontsize: int
Fontsize for labels and legends.
tick_fontsize: int
Fontsize for the ticks along the plot's axes.
|
Methods
canvas |
Create a figure with multiple plots. |
cross_validate |
Evaluate the winning model using cross-validation. |
delete |
Remove a model from the pipeline. |
get_params |
Get parameters for this estimator. |
log |
Save information to the logger and print to stdout. |
reset_aesthetics |
Reset the plot aesthetics to their default values. |
reset_predictions |
Clear the prediction attributes from all models. |
run |
Fit and evaluate the models. |
save |
Save the instance to a pickle file. |
eval |
Get all models'scores for the provided metrics. |
set_params |
Set the parameters of this estimator. |
stacking |
Add a Stacking instance to the models in the pipeline. |
voting |
Add a Voting instance to the models in the pipeline. |
method canvas(nrows=1,
ncols=2, title=None, figsize=None, filename=None, display=True)
[source]
This @contextmanager
allows you to draw many plots in one figure.
The default option is to add two plots side by side. See the
user guide for an example.
Parameters: |
nrows: int, optional (default=1)
Number of plots in length.
ncols: int, optional (default=2)
Number of plots in width.
title: str or None, optional (default=None)
Plot's title. If None, no title is displayed.
figsize: tuple or None, optional (default=None)
Figure's size, format as (x, y). If None, it adapts the size to the
number of plots in the canvas.
filename: str or None, optional (default=None)
Name of the file. Use "auto" for automatic naming.
If None, the figure is not saved.
display: bool, optional (default=True)
Whether to render the plot.
|
method cross_validate(**kwargs)
[source]
Evaluate the winning model using cross-validation. This method cross-validates
the whole pipeline on the complete dataset. Use it to assess the robustness of
the model's performance.
Parameters: |
**kwargs
Additional keyword arguments for sklearn's cross_validate
function. If the scoring method is not specified, it uses
the trainer's metric.
|
Returns: |
scores: dict
Return of sklearn's cross_validate
function.
|
Delete a model from the trainer. If the winning model is
removed, the next best model (through metric_test
or
mean_bootstrap
) is selected as winner. If all models are
removed, the metric and training approach are reset. Use
this method to drop unwanted models from the pipeline
or to free some memory before saving. Deleted models are
not removed from any active mlflow experiment.
Parameters: |
models: str or sequence, optional (default=None)
Models to delete. If None, delete them all.
|
method get_class_weights(dataset="train")
[source]
Return class weights for a balanced data set. Statistically, the class
weights re-balance the data set so that the sampled data set represents
the target population as closely as possible. The returned weights are
inversely proportional to the class frequencies in the selected data set.
Parameters: |
dataset: str, optional (default="train")
Data set from which to get the weights. Choose between "train", "test" or "dataset".
|
Returns: |
class_weights: dict
Classes with the corresponding weights.
|
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.
|
Returns: |
params: dict
Dictionary of the parameter names mapped to their values.
|
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
Message to write to the logger and print to stdout.
level: int, optional (default=0)
Minimum verbosity level to print the message.
|
Reset the plot aesthetics to their default values.
Clear the prediction attributes from all models.
Use this method to free some memory before saving the trainer.
Fit and evaluate the models.
Parameters: |
*arrays: sequence of indexables
Training set and test set. Allowed input formats are:
- train, test
- X_train, X_test, y_train, y_test
- (X_train, y_train), (X_test, y_test)
|
method save(filename="auto", save_data=True)
[source]
Save the instance to a pickle file. Remember that the class contains
the complete dataset as attribute, so the file can become large for
big datasets! To avoid this, use save_data=False
.
Parameters: |
filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.
save_data: bool, optional (default=True)
Whether to save the data as an attribute of the instance. If False,
remember to add the data to ATOMLoader
when loading the file.
|
method evaluate(metric=None, dataset="test")
[source]
Get all the models' scores for the provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common
metrics per task are used.
dataset: str, optional (default="test")
Data set on which to calculate the metric. Options are "train" or "test".
|
Returns: |
scores: pd.DataFrame
Scores of the models.
|
Set the parameters of this estimator.
Parameters: |
**params: dict
Estimator parameters.
|
Returns: |
self: DirectRegressor
Estimator instance.
|
method stacking(models=None,
estimator=None, stack_method="auto", passthrough=False)
[source]
Add a Stacking instance to the models in the pipeline.
Parameters: |
models: sequence or None, optional (default=None)
Models that feed the stacking. If None, it selects all models
depending on the current branch.
estimator: str, callable or None, optional (default=None)
The final estimator, which is used to combine the base
estimators. If str, choose from ATOM's predefined models.
If None, Ridge is selected.
stack_method: str, optional (default="auto")
Methods called for each base estimator. If "auto", it will try to
invoke predict_proba , decision_function
or predict in that order.
passthrough: bool, optional (default=False)
When False, only the predictions of estimators are used
as training data for the final estimator. When True, the
estimator is trained on the predictions as well as the
original training data. The passed dataset is scaled
if any of the models require scaled features and they are
not already.
|
method voting(models=None, weights=None)
[source]
Add a Voting instance to the models in the pipeline.
Parameters: |
models: sequence or None, optional (default=None)
Models that feed the voting. If None, it selects all models
depending on the current branch.
weights: sequence or None, optional (default=None)
Sequence of weights (int or float) to weight the
occurrences of predicted class labels (hard voting)
or class probabilities before averaging (soft voting).
Uses uniform weights if None.
|
Example
from atom.training import DirectRegressor
trainer = DirectRegressor(["OLS", "BR"], n_calls=5, n_initial_points=3, n_bootstrap=5)
trainer.run(train, test)
# Analyze the results
trainer.plot_results()
trainer.scoring()