Skip to content

Bagging (Bag)


accept sparse

Bagging uses an ensemble meta-estimator that fits base classifiers/regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

Corresponding estimators are:

Read more in sklearn's documentation.



Hyperparameters

  • By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
  • The n_jobs and random_state parameters are set equal to those of the trainer.
Dimensions:

n_estimators: int, default=10
Integer(10, 500, name="n_estimators")

max_samples: float, default=1.0
Categorical(np.linspace(0.5, 1.0, 6), name="max_samples")

max_features: float, default=1.0
Categorical(np.linspace(0.5, 1.0, 6), name="max_features")

bootstrap: bool, default=True
Categorical([True, False], name="bootstrap")

bootstrap_features: bool, default=False
Categorical([True, False], name="bootstrap_features")



Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: pd.Index
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: pd.Index
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.


Utility attributes

Attributes: bo: pd.DataFrame
Information of every step taken by the BO. Columns include:
  • call: Name of the call.
  • params: Parameters used in the model.
  • estimator: Estimator used for this iteration (fitted on last cross-validation).
  • score: Score of the chosen metric. List of scores for multi-metric.
  • time: Time spent on this iteration.
  • total_time: Total time spent since the start of the BO.

best_call: str
Name of the best call in the BO.

best_params: dict
Dictionary of the best combination of hyperparameters found by the BO.

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_bo: str
Time it took to run the bayesian optimization algorithm.

metric_bo: float or list
Best metric score(s) on the BO.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: np.array
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:
  • metric_bo: Best score achieved during the BO.
  • time_bo: Time spent on the BO.
  • metric_train: Metric score on the training set.
  • metric_test: Metric score on the test set.
  • time_fit: Time spent fitting and evaluating.
  • mean_bootstrap: Mean score of the bootstrap results.
  • std_bootstrap: Standard deviation score of the bootstrap results.
  • time_bootstrap: Time spent on the bootstrap algorithm.
  • time: Total time spent on the whole run.


Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.array
Predictions of the model on the training set.

predict_test: np.array
Predictions of the model on the test set.

predict_proba_train: np.array
Predicted probabilities of the model on the training set (only if classifier).

predict_proba_test: np.array
Predicted probabilities of the model on the test set (only if classifier).

predict_log_proba_train: np.array
Predicted log probabilities of the model on the training set (only if classifier).

predict_log_proba_test: np.array
Predicted log probabilities of the model on the test set (only if classifier).

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.



Methods

The majority of the plots and prediction methods can be called directly from the models, e.g. atom.bag.plot_permutation_importance() or atom.bag.predict(X). The remaining utility methods can be found hereunder.

calibrate Calibrate the model.
clear Clear attributes from the model.
cross_validate Evaluate the model using cross-validation.
delete Delete the model from the trainer.
dashboard Create an interactive dashboard to analyze the model.
evaluate Get the model's scores for the provided metrics.
export_pipeline Export the model's pipeline to a sklearn-like Pipeline object.
full_train Train the estimator on the complete dataset.
rename Change the model's tag.
save_estimator Save the estimator to a pickle file.
transform Transform new data through the model's branch.


method calibrate(**kwargs) [source]

Applies probability calibration on the model. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute. If there is an active mlflow experiment, a new run is started using the name [model_name]_calibrate. Since the estimator changed, the model is cleared. Only if classifier.

Parameters: **kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.


method clear() [source]

Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:




method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.

Parameters: **kwargs
Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns: scores: dict
Return of sklearn's cross_validate function.


method delete() [source]

Delete the model from the trainer. If it's the last model in the trainer, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.


method dashboard(dataset="test", filename=None, **kwargs) [source]

Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.

Parameters:

dataset: str, optional (default="test")
Data set to get the report from. Choose from: "train", "test", "both" (train and test) or "holdout".

filename: str or None, optional (default=None)
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.

Returns: dashboard: ExplainerDashboard
Created dashboard object.


method evaluate(metric=None, dataset="test", threshold=0.5, sample_weight=None) [source]

Get the model's scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float, optional (default=0.5)
Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:
  • The task is binary classification.
  • The model has a predict_proba method.
  • The metric evaluates predicted target values.

sample_weight: sequence or None, optional (default=None)
Sample weights corresponding to y in dataset.

Returns: score: pd.Series
Scores of the model.


method export_pipeline(verbose=None) [source]

Export the model's pipeline to a sklearn-like Pipeline object. If the model used automated feature scaling, the scaler is added to the pipeline. The returned pipeline is already fitted on the training set.

Info

ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:

  • Accepts transformers that change the target column.
  • Accepts transformers that drop rows.
  • Accepts transformers that only are fitted on a subset of the provided dataset.
  • Always outputs pandas objects.
  • Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.

Parameters: memory: bool, str, Memory or None, optional (default=None)
Used to cache the fitted transformers of the pipeline.
  • If None or False: No caching is performed.
  • If True: A default temp directory is used.
  • If str: Path to the caching directory.
  • If Memory: Object with the joblib.Memory interface.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns: Pipeline
Current branch as a sklearn-like Pipeline object.


method full_train(include_holdout=False) [source]

In some cases it might be desirable to use all available data to train a final model. Note that doing this means that the estimator can no longer be evaluated on the test set. The newly retrained estimator will replace the estimator attribute. If there is an active mlflow experiment, a new run is started with the name [model_name]_full_train. Since the estimator changed, the model is cleared.

Parameters: include_holdout: bool, optional (default=False)
Whether to include the holdout data set (if available) in the training of the estimator. Note that if True, it means the model can't be evaluated.


method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters: name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.


method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method transform(X, y=None, verbose=None) [source]

Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.

Parameters:

X: dataframe-like
Features to transform, with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored in the transformers.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).

verbose: int or None, optional (default=None)
Verbosity level of the output. If None, it uses the transformer's own verbosity.

Returns:

pd.DataFrame
Transformed feature set.

pd.Series
Transformed target column. Only returned if provided.


Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.run(models="Bag")
Back to top