Gradient Boosting Machine (GBM)
A Gradient Boosting Machine builds an additive model in a forward
stage-wise fashion; it allows for the optimization of arbitrary
differentiable loss functions. In each stage n_classes_
regression
trees are fit on the negative gradient of the binomial or multinomial
deviance loss function. Binary classification is a special case where
only a single regression tree is induced.
Corresponding estimators are:
- GradientBoostingClassifier for classification tasks.
- GradientBoostingRegressor for regression tasks.
Read more in sklearn's documentation.
Hyperparameters
- By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
- For multiclass classification tasks, the
loss
parameter is always set to "deviance". - The
alpha
parameter is only used when loss="huber" or "quantile". - The
random_state
parameter is set equal to that of the trainer.
Dimensions: |
loss: str
learning_rate: float, default=0.1
n_estimators: int, default=100
subsample: float, default=1.0
criterion: str, default="friedman_mse"
min_samples_split: int, default=2
min_samples_leaf: int, default=1
max_depth: int, default=3
max_features: str, float or None, default="auto"
ccp_alpha: float, default=0.0 |
Attributes
Data attributes
Attributes: |
dataset: pd.DataFrame
train: pd.DataFrame
test: pd.DataFrame
X: pd.DataFrame
y: pd.Series
X_train: pd.DataFrame
y_train: pd.Series
X_test: pd.DataFrame
y_test: pd.Series
shape: tuple
columns: pd.Index
n_columns: int
features: pd.Index
n_features: int
target: str |
Utility attributes
Attributes: |
bo: pd.DataFrame Information of every step taken by the BO. Columns include:
best_call: str
best_params: dict
estimator: class
time_bo: str
metric_bo: float or list
time_fit: str
metric_train: float or list
metric_test: float or list
metric_bootstrap: np.array
mean_bootstrap: float or list
std_bootstrap: float or list Training results. Columns include:
|
Prediction attributes
The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.
Prediction attributes: |
predict_train: np.array
predict_test: np.array
predict_proba_train: np.array
predict_proba_test: np.array
predict_log_proba_train: np.array
predict_log_proba_test: np.array
decision_function_train: np.array
decision_function_test: np.array
score_train: np.float64
score_test: np.float64 |
Methods
The majority of the plots and prediction methods
can be called directly from the models, e.g. atom.gbm.plot_permutation_importance()
or atom.gbm.predict(X)
. The remaining utility methods can be found hereunder.
calibrate | Calibrate the model. |
clear | Clear attributes from the model. |
cross_validate | Evaluate the model using cross-validation. |
delete | Delete the model from the trainer. |
dashboard | Create an interactive dashboard to analyze the model. |
evaluate | Get the model's scores for the provided metrics. |
export_pipeline | Export the model's pipeline to a sklearn-like Pipeline object. |
full_train | Train the estimator on the complete dataset. |
rename | Change the model's tag. |
save_estimator | Save the estimator to a pickle file. |
transform | Transform new data through the model's branch. |
Applies probability calibration on the model. The estimator
is trained via cross-validation on a subset of the training
data, using the rest to fit the calibrator. The new classifier
will replace the estimator
attribute. If there is an active
mlflow experiment, a new run is started using the name
[model_name]_calibrate
. Since the estimator changed, the
model is cleared. Only if classifier.
Parameters: |
**kwargs Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing. |
Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:
Evaluate the model using cross-validation. This method cross-validates the
whole pipeline on the complete dataset. Use it to assess the robustness of
the solution's performance. The return of sklearn's cross_validate
function is stored under the cv
attribute.
Parameters: |
**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric. |
Returns: |
pd.DataFrame Overview of the results. |
Delete the model from the trainer. If it's the last model in the
trainer, the metric is reset. Use this method to drop unwanted
models from the pipeline or to free some memory before saving.
The model is not removed from any active mlflow experiment.
Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.
Parameters: |
dataset: str, optional (default="test")
filename: str or None, optional (default=None)
**kwargs |
Returns: |
ExplainerDashboard Created dashboard object. |
Get the model's scores for the provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
dataset: str, optional (default="test") Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:
sample_weight: sequence or None, optional (default=None) |
Returns: |
pd.Series Scores of the model. |
Export the model's pipeline to a sklearn-like Pipeline object. If the
model used automated feature scaling,
the scaler
is added to the pipeline. The returned pipeline is already
fitted on the training set.
Info
ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always outputs pandas objects.
- Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.
Parameters: |
memory: bool, str, Memory or None, optional (default=None) Used to cache the fitted transformers of the pipeline.
verbose: int or None, optional (default=None) |
Returns: |
Pipeline Current branch as a sklearn-like Pipeline object. |
In some cases it might be desirable to use all available data to train
a final model. Note that doing this means that the estimator can no
longer be evaluated on the test set. The newly retrained estimator will
replace the estimator
attribute. If there is an active mlflow
experiment, a new run is started with the name [model_name]_full_train
.
Since the estimator changed, the model is cleared.
Warning
Although the model is trained on the complete dataset, the pipeline
is not! To also get the fully trained pipeline, use: pipeline = atom
.export_pipeline().fit(X, y)
.
Parameters: |
include_holdout: bool, optional (default=False) Whether to include the holdout data set (if available) in the training of the estimator. Note that if True, it means the model can't be evaluated. |
Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.
Parameters: |
name: str or None, optional (default=None) New tag for the model. If None, the tag is removed. |
Save the estimator to a pickle file.
Parameters: |
filename: str, optional (default="auto") Name of the file. Use "auto" for automatic naming. |
Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.
Parameters: |
X: dataframe-like
verbose: int or None, optional (default=None) |
Returns: |
pd.DataFrame
pd.Series |
Example
from atom import ATOMRegressor
atom = ATOMRegressor(X, y)
atom.run(models="GBM")