Gradient Boosting Machine (GBM)

accept sparse

A Gradient Boosting Machine builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.

Corresponding estimators are:

GradientBoostingClassifier for classification tasks.
GradientBoostingRegressor for regression tasks.

Hyperparameters

By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
For multiclass classification tasks, the loss parameter is always set to "deviance".
The alpha parameter is only used when loss="huber" or "quantile".
The random_state parameter is set equal to that of the trainer.

Dimensions:

loss: str

binary classifier: default="deviance"
Categorical(["deviance", "exponential"], name="loss")
regressor: default="squared_error"
Categorical(["squared_error", "absolute_error", "huber", "quantile"], name="loss")

learning_rate: float, default=0.1
Real(0.01, 1.0, "log-uniform", name="learning_rate")

n_estimators: int, default=100
Integer(10, 500, name="n_estimators")

subsample: float, default=1.0
Categorical(np.linspace(0.5, 1.0, 6), name="subsample")

criterion: str, default="friedman_mse"
Categorical(["friedman_mse", "mse"], name="criterion")

min_samples_split: int, default=2
Integer(2, 20, name="min_samples_split")

min_samples_leaf: int, default=1
Integer(1, 20, name="min_samples_leaf")

max_depth: int, default=3
Integer(1, 21, name="max_depth")

max_features: str, float or None, default="auto"
Categorical(["auto", "sqrt", "log2", *np.linspace(0.5, 0.9, 5), None], name="max_features")

ccp_alpha: float, default=0.0
Real(0, 0.035, name="ccp_alpha")

Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: pd.Index
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: pd.Index
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

bo: pd.DataFrame
Information of every step taken by the BO. Columns include:

call: Name of the call.
params: Parameters used in the model.
estimator: Estimator used for this iteration (fitted on last cross-validation).
score: Score of the chosen metric. List of scores for multi-metric.
time: Time spent on this iteration.
total_time: Total time spent since the start of the BO.

best_call: str
Name of the best call in the BO.

best_params: dict
Dictionary of the best combination of hyperparameters found by the BO.

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_bo: str
Time it took to run the bayesian optimization algorithm.

metric_bo: float or list
Best metric score(s) on the BO.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: np.array
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.array
Predictions of the model on the training set.

predict_test: np.array
Predictions of the model on the test set.

predict_proba_train: np.array
Predicted probabilities of the model on the training set (only if classifier).

predict_proba_test: np.array
Predicted probabilities of the model on the test set (only if classifier).

predict_log_proba_train: np.array
Predicted log probabilities of the model on the training set (only if classifier).

predict_log_proba_test: np.array
Predicted log probabilities of the model on the test set (only if classifier).

decision_function_train: np.array
Decision function scores on the training set (only if classifier).

decision_function_test: np.array
Decision function scores on the test set (only if classifier).

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.

Methods

The majority of the plots and prediction methods can be called directly from the models, e.g. atom.gbm.plot_permutation_importance() or atom.gbm.predict(X). The remaining utility methods can be found hereunder.

calibrate	Calibrate the model.
clear	Clear attributes from the model.
cross_validate	Evaluate the model using cross-validation.
delete	Delete the model from the trainer.
dashboard	Create an interactive dashboard to analyze the model.
evaluate	Get the model's scores for the provided metrics.
export_pipeline	Export the model's pipeline to a sklearn-like Pipeline object.
full_train	Train the estimator on the complete dataset.
rename	Change the model's tag.
save_estimator	Save the estimator to a pickle file.
transform	Transform new data through the model's branch.

method calibrate(**kwargs) [source]

Applies probability calibration on the model. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute. If there is an active mlflow experiment, a new run is started using the name [model_name]_calibrate. Since the estimator changed, the model is cleared. Only if classifier.

Parameters:

**kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.

method clear() [source]

Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:

method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance. The return of sklearn's cross_validate function is stored under the cv attribute.

Parameters:	**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns:	pd.DataFrame Overview of the results.

method delete() [source]

Delete the model from the trainer. If it's the last model in the trainer, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.

method dashboard(dataset="test", filename=None, **kwargs) [source]

Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.

Parameters:

dataset: str, optional (default="test")
Data set to get the report from. Choose from: "train", "test", "both" (train and test) or "holdout".

filename: str or None, optional (default=None)
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.

Returns:

ExplainerDashboard
Created dashboard object.

method evaluate(metric=None, dataset="test", threshold=0.5, sample_weight=None) [source]

Get the model's scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float, optional (default=0.5)
Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:

The task is binary classification.
The model has a predict_proba method.
The metric evaluates predicted target values.

sample_weight: sequence or None, optional (default=None)
Sample weights corresponding to y in dataset.

Returns: pd.Series
Scores of the model.

method export_pipeline(verbose=None) [source]

Export the model's pipeline to a sklearn-like Pipeline object. If the model used automated feature scaling, the scaler is added to the pipeline. The returned pipeline is already fitted on the training set.

Info

ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:

Accepts transformers that change the target column.
Accepts transformers that drop rows.
Accepts transformers that only are fitted on a subset of the provided dataset.
Always outputs pandas objects.
Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.

Parameters:

memory: bool, str, Memory or None, optional (default=None)
Used to cache the fitted transformers of the pipeline.

If None or False: No caching is performed.
If True: A default temp directory is used.
If str: Path to the caching directory.
If Memory: Object with the joblib.Memory interface.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns: Pipeline
Current branch as a sklearn-like Pipeline object.

method full_train(include_holdout=False) [source]

In some cases it might be desirable to use all available data to train a final model. Note that doing this means that the estimator can no longer be evaluated on the test set. The newly retrained estimator will replace the estimator attribute. If there is an active mlflow experiment, a new run is started with the name [model_name]_full_train. Since the estimator changed, the model is cleared.

Warning

Although the model is trained on the complete dataset, the pipeline is not! To also get the fully trained pipeline, use: pipeline = atom .export_pipeline().fit(X, y).

Parameters:

include_holdout: bool, optional (default=False)
Whether to include the holdout data set (if available) in the training of the estimator. Note that if True, it means the model can't be evaluated.

method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters:

name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.

method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method transform(X, y=None, verbose=None) [source]

Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.

Parameters:

X: dataframe-like
Features to transform, with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored in the transformers.
If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

verbose: int or None, optional (default=None)
Verbosity level of the output. If None, it uses the transformer's own verbosity.

Returns:

pd.DataFrame
Transformed feature set.

pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.run(models="GBM")