XGBoost (XGB)
XGBoost is an optimized distributed gradient boosting model designed to be highly efficient, flexible and portable. XGBoost provides a parallel tree boosting that solve many data science problems in a fast and accurate way.
Corresponding estimators are:
- XGBClassifier for classification tasks.
- XGBRegressor for regression tasks.
Read more in XGBoost's documentation.
Info
XGBoost allows early stopping to stop the training of unpromising models prematurely!
Hyperparameters
- By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
- The
n_jobs
andrandom_state
parameters are set equal to those of the trainer.
Dimensions: |
n_estimators: int, default=100
learning_rate: float, default=0.1
max_depth: int, default=6
gamma: float, default=0.0
min_child_weight: int, default=1
subsample: float, default=1.0
colsample_by_tree: float, default=1.0
reg_alpha: float, default=0.0
reg_lambda: float, default=1.0 |
Attributes
Data attributes
Attributes: |
dataset: pd.DataFrame
train: pd.DataFrame
test: pd.DataFrame
X: pd.DataFrame
y: pd.Series
X_train: pd.DataFrame
y_train: pd.Series
X_test: pd.DataFrame
y_test: pd.Series
shape: tuple
columns: list
n_columns: int
features: list
n_features: int
target: str |
Utility attributes
Attributes: |
bo: pd.DataFrame Information of every step taken by the BO. Columns include:
best_call: str
best_params: dict
estimator: class
time_bo: str
metric_bo: float or list
time_fit: str
metric_train: float or list
metric_test: float or list
metric_bootstrap: np.array
mean_bootstrap: float or list
std_bootstrap: float or list Training results. Columns include:
|
Prediction attributes
The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.
Prediction attributes: |
predict_train: np.array
predict_test: np.array
predict_proba_train: np.array
predict_proba_test: np.array
predict_log_proba_train: np.array
predict_log_proba_test: np.array
score_train: np.float64
score_test: np.float64 |
Methods
The majority of the plots and prediction methods
can be called directly from the models, e.g. atom.xgb.plot_permutation_importance()
or atom.xgb.predict(X)
.
The remaining utility methods can be found hereunder.
calibrate | Calibrate the model. |
clear | Clear attributes from the model. |
cross_validate | Evaluate the model using cross-validation. |
delete | Delete the model from the trainer. |
dashboard | Create an interactive dashboard to analyze the model. |
evaluate | Get the model's scores for the provided metrics. |
export_pipeline | Export the model's pipeline to a sklearn-like Pipeline object. |
full_train | Train the estimator on the complete dataset. |
rename | Change the model's tag. |
save_estimator | Save the estimator to a pickle file. |
transform | Transform new data through the model's branch. |
Applies probability calibration on the model. The estimator
is trained via cross-validation on a subset of the training
data, using the rest to fit the calibrator. The new classifier
will replace the estimator
attribute. If there is an active
mlflow experiment, a new run is started using the name
[model_name]_calibrate
. Since the estimator changed, the
model is cleared. Only if classifier.
Parameters: |
**kwargs Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing. |
Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:
Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.
Parameters: |
**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric. |
Returns: |
scores: dict Return of sklearn's cross_validate function. |
Delete the model from the trainer. If it's the last model in the
trainer, the metric is reset. Use this method to drop unwanted
models from the pipeline or to free some memory before saving.
The model is not removed from any active mlflow experiment.
Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.
Parameters: |
dataset: str, optional (default="test")
filename: str or None, optional (default=None)
**kwargs |
Returns: |
dashboard: ExplainerDashboard Created dashboard object. |
Get the model's scores for the provided metrics.
Parameters: |
metric: str, func, scorer, sequence or None, optional (default=None)
dataset: str, optional (default="test") Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:
sample_weight: sequence or None, optional (default=None) |
Returns: |
score: pd.Series Scores of the model. |
Export the model's pipeline to a sklearn-like Pipeline object. If the
model used automated feature scaling,
the scaler
is added to the pipeline. The returned pipeline is already
fitted on the training set.
Info
ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always outputs pandas objects.
- Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.
Parameters: |
memory: bool, str, Memory or None, optional (default=None) Used to cache the fitted transformers of the pipeline.
verbose: int or None, optional (default=None) |
Returns: |
Pipeline Current branch as a sklearn-like Pipeline object. |
In some cases it might be desirable to use all available data
to train a final model. Note that doing this means that the
estimator can no longer be evaluated on the test set. The newly
retrained estimator will replace the estimator
attribute. If
there is an active mlflow experiment, a new run is started
with the name [model_name]_full_train
. Since the estimator
changed, the model is cleared.
Parameters: |
include_holdout: bool, optional (default=False) Whether to include the holdout data set (if available) in the training of the estimator. Note that if True, it means the model can't be evaluated. |
Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.
Parameters: |
name: str or None, optional (default=None) New tag for the model. If None, the tag is removed. |
Save the estimator to a pickle file.
Parameters: |
filename: str, optional (default="auto") Name of the file. Use "auto" for automatic naming. |
Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.
Parameters: |
X: dataframe-like
verbose: int or None, optional (default=None) |
Returns: |
pd.DataFrame
pd.Series |
Example
from atom import ATOMRegressor
atom = ATOMRegressor(X, y)
atom.run(models="XGB", metric="me", n_calls=25, bo_params={"cv": 1})