CatBoost (CatB)

CatBoost is a machine learning method based on gradient boosting over decision trees. Main advantages of CatBoost:

Superior quality when compared with other GBDT models on many datasets.
Best in class prediction speed.

Corresponding estimators are:

CatBoostClassifier for classification tasks.
CatBoostRegressor for regression tasks.

Hyperparameters

By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
The bootstrap_type parameter is set to "Bernoulli" to allow for the subsample parameter.
The num_leaves and min_child_samples parameters are not available for the CPU implementation.
The n_jobs and random_state parameters are set equal to those of the trainer.

Dimensions:

n_estimators: int, default=100
Integer(20, 500, name="n_estimators")

learning_rate: float, default=0.1
Real(0.01, 1.0, "log-uniform", name="learning_rate")

max_depth: int or None, default=None
Categorical([None, *list(range(1, 10))], name="max_depth")

subsample: float, default=1.0
Categorical(np.linspace(0.5, 1.0, 6), name="subsample")

colsample_by_level: float, default=1.0
Categorical(np.linspace(0.3, 1.0, 8), name="colsample_by_level")

reg_lambda: int, default=0
Categorical([0, 0.01, 0.1, 1, 10, 100], name="reg_lambda")

Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: list
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: list
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

bo: pd.DataFrame
Information of every step taken by the BO. Columns include:

params: Parameters used in the model.
estimator: Estimator used for this iteration (fitted on last cross-validation).
score: Score of the chosen metric. List of scores for multi-metric.
time_iteration: Time spent on this iteration.
time: Total time spent since the start of the BO.

best_params: dict
Dictionary of the best combination of hyperparameters found by the BO.

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_bo: str
Time it took to run the bayesian optimization algorithm.

metric_bo: float or list
Best metric score(s) on the BO.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: list
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.ndarray
Predictions of the model on the training set.

predict_test: np.ndarray
Predictions of the model on the test set.

predict_proba_train: np.ndarray
Predicted probabilities of the model on the training set (only if classifier).

predict_proba_test: np.ndarray
Predicted probabilities of the model on the test set (only if classifier).

predict_log_proba_train: np.ndarray
Predicted log probabilities of the model on the training set (only if classifier).

predict_log_proba_test: np.ndarray
Predicted log probabilities of the model on the test set (only if classifier).

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.

Methods

The majority of the plots and prediction methods can be called directly from the models, e.g. atom.catb.plot_permutation_importance() or atom.catb.predict(X). The remaining utility methods can be found hereunder.

calibrate	Calibrate the model.
cross_validate	Evaluate the model using cross-validation.
delete	Delete the model from the trainer.
export_pipeline	Export the model's pipeline to a sklearn-like Pipeline object.
full_train	Get the estimator trained on the complete dataset.
rename	Change the model's tag.
reset_predictions	Clear all the prediction attributes.
evaluate	Get the score for a specific metric.
save_estimator	Save the estimator to a pickle file.

method calibrate(**kwargs) [source]

Applies probability calibration on the estimator. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute and is logged to any active mlflow experiment. Since the estimator changed, all the model's prediction attributes are reset. Only if classifier.

Parameters:

**kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.

method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.

Parameters:	**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns:	scores: dict Return of sklearn's cross_validate function.

method delete() [source]

Delete the model from the trainer. If it's the winning model, the next best model (through metric_test or mean_bootstrap) is selected as winner. If it's the last model in the trainer, the metric and training approach are reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.

method export_pipeline(pipeline=None, verbose=None) [source]

Export the model's pipeline to a sklearn-like object. If the model used feature scaling, the Scaler is added before the model. The returned pipeline is already fitted on the training set.

Note

ATOM's Pipeline class behaves exactly the same as a sklearn Pipeline, and additionally, it's compatible with transformers that drop samples and transformers that change the target column.

Warning

Due to incompatibilities with sklearn's API, the exported pipeline always fits/transforms on the entire dataset provided. Beware that this can cause errors if the transformers were fitted on a subset of the data.

Parameters:

pipeline: bool, sequence or None, optional (default=None)
Transformers to use on the data before predicting.

If None: Only transformers that are applied on the whole dataset are used.
If False: Don't use any transformers.
If True: Use all transformers in the pipeline.
If sequence: Transformers to use, selected by their index in the pipeline.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity.

Returns:

pipeline: Pipeline
Current branch as a sklearn-like Pipeline object.

method full_train() [source]

Get the estimator trained on the complete dataset. In some cases it might be desirable to use all the available data to train a final model after the right hyperparameters are found. Note that this means that the model can not be evaluated.

Returns:

est: estimator
Model estimator trained on the full dataset.

method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters:

name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.

method reset_predictions() [source]

Clear the prediction attributes from all models. Use this method to free some memory before saving the trainer.

method evaluate(metric=None, dataset="test") [source]

Get the model's score for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Options are "train" or "test".

Returns:

score: pd.Series
Scores of the model.

method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.run(models="CatB", n_calls=50, bo_params={"early_stopping": 0.1})