Skip to content

ExtraTrees


ETaccept sparse

Extra-Trees use a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Corresponding estimators are:

Read more in sklearn's documentation.


See Also

DecisionTree

Single Decision Tree.

ExtraTree

Extremely Randomized Tree.

RandomForest

Random Forest.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y)
>>> atom.run(models="ET", metric="f1", verbose=2)

Training ========================= >>
Models: ET
Metric: f1


Results for ExtraTrees:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.993
Time elapsed: 0.095s
-------------------------------------------------
Total time: 0.095s


Final results ==================== >>
Total time: 0.095s
-------------------------------------
ExtraTrees --> f1: 0.993



Hyperparameters

Parametersn_estimators
IntDistribution(high=500, log=False, low=10, step=10)
criterion
CategoricalDistribution(choices=('gini', 'entropy'))
max_depth
CategoricalDistribution(choices=(None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16))
min_samples_split
IntDistribution(high=20, log=False, low=2, step=1)
min_samples_leaf
IntDistribution(high=20, log=False, low=1, step=1)
max_features
CategoricalDistribution(choices=(None, 'sqrt', 'log2', 0.5, 0.6, 0.7, 0.8, 0.9))
bootstrap
CategoricalDistribution(choices=(True, False))
max_samples
CategoricalDistribution(choices=(None, 0.5, 0.6, 0.7, 0.8, 0.9))
ccp_alpha
FloatDistribution(high=0.035, log=False, low=0.0, step=0.005)

Parametersn_estimators
IntDistribution(high=500, log=False, low=10, step=10)
criterion
CategoricalDistribution(choices=('squared_error', 'absolute_error'))
max_depth
CategoricalDistribution(choices=(None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16))
min_samples_split
IntDistribution(high=20, log=False, low=2, step=1)
min_samples_leaf
IntDistribution(high=20, log=False, low=1, step=1)
max_features
CategoricalDistribution(choices=(None, 'sqrt', 'log2', 0.5, 0.6, 0.7, 0.8, 0.9))
bootstrap
CategoricalDistribution(choices=(True, False))
max_samples
CategoricalDistribution(choices=(None, 0.5, 0.6, 0.7, 0.8, 0.9))
ccp_alpha
FloatDistribution(high=0.035, log=False, low=0.0, step=0.005)





Attributes

Data attributes

Attributespipeline: (self)
Transformers fitted on the data.

Models that used automated feature scaling have the scaler added. Use this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method.

mapping: dict
Encoded values and their respective mapped values.

The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

dataset: pd.DataFrame
Complete data set.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series
Target column.
X_train: pd.DataFrame
Features of the training set.
y_train: pd.Series
Target column of the training set.
X_test: pd.DataFrame
Features of the test set.
y_test: pd.Series
Target column of the test set.
shape: tuple
Shape of the dataset (n_rows, n_cols).
columns: pd.Series
Name of all the columns.
n_columns: int
Number of columns.
features: pd.Series
Name of the features.
n_features: int
Number of features.
target: str
Name of the target column.


Utility attributes

Attributesname: str
Name of the model.

Use the property's @setter to change the model's name. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run also changes.

study: Study or None
Optuna study used for hyperparameter tuning.
trials: pd.DataFrame or None
Overview of the trials' results.

All durations are in seconds. Columns include:

  • params: Parameters used for this trial.
  • estimator: Estimator used for this trial.
  • score: Objective score(s) of the trial.
  • time_trial: Duration of the trial.
  • time_ht: Duration of the hyperparameter tuning.
  • state: Trial's state (COMPLETE, PRUNED, FAIL).
best_trial: Trial or None
Trial that returned the highest score.

For multi-metric runs, the best trial is the trial that performed best on the main metric. Use the property's @setter to change the best trial. See here an example.

best_params: dict
Hyperparameters used by the best trial.
score_ht: Union[float, numpy.floating, List[Union[float, numpy.floating]], NoneType]
Metric score obtained by the best trial.
time_ht: int or None
Duration of the hyperparameter tuning (in seconds).
estimator: Predictor
Estimator fitted on the training set.
score_train: Union[float, numpy.floating, List[Union[float, numpy.floating]]]
Metric score on the training set.
score_test: Union[float, numpy.floating, List[Union[float, numpy.floating]]]
Metric score on the test set.
score_holdout: Union[float, numpy.floating, List[Union[float, numpy.floating]]]
Metric score on the holdout set.
time_fit: int
Duration of the model fitting on the train set (in seconds).
bootstrap: pd.DataFrame or None
Overview of the bootstrapping scores.

The dataframe has shape=(n_bootstrap, metric) and shows the score obtained by every bootstrapped sample for every metric. Using atom.bootstrap.mean() yields the same values as score_bootstrap.

score_bootstrap: Union[float, numpy.floating, List[Union[float, numpy.floating]], NoneType]
Mean metric score on the bootstrapped samples.
time_bootstrap: int or None
Duration of the bootstrapping (in seconds).
time: int
Total duration of the run (in seconds).
feature_importance: pd.Series or None
Normalized feature importance scores.

The sum of importances for all features is 1. The scores are extracted from the estimator's scores_, coef_ or feature_importances_ attribute, checked in that order. Returns None for estimators without any of those attributes.

results: pd.Series
Overview of the training results.

All durations are in seconds. Values include:

  • score_ht: Score obtained by the hyperparameter tuning.
  • time_ht: Duration of the hyperparameter tuning.
  • score_train: Metric score on the train set.
  • score_test: Metric score on the test set.
  • time_fit: Duration of the model fitting on the train set.
  • score_bootstrap: Mean score on the bootstrapped samples.
  • time_bootstrap: Duration of the bootstrapping.
  • time: Total duration of the run.


Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Attributespredict_train: pd.Series
Class predictions on the training set.
predict_test: pd.Series
Class predictions on the test set.
predict_holdout: pd.Series or None
Class predictions on the holdout set.
predict_log_proba_train: pd.DataFrame
Class log-probabilities predictions on the training set.
predict_log_proba_test: pd.DataFrame
Class log-probabilities predictions on the test set.
predict_log_proba_holdout: pd.DataFrame or None
Class log-probabilities predictions on the holdout set.
predict_proba_train: pd.DataFrame
Class probabilities predictions on the training set.
predict_proba_test: pd.DataFrame
Class probabilities predictions on the test set.
predict_proba_holdout: pd.DataFrame or None
Class probabilities predictions on the holdout set.



Methods

The plots and prediction methods can be called directly from the model. The remaining utility methods can be found hereunder.

bootstrappingApply a bootstrap algorithm.
calibrateCalibrate the model.
clearClear attributes from the model.
create_appCreate an interactive app to test model predictions.
create_dashboardCreate an interactive dashboard to analyze the model.
cross_validateEvaluate the model using cross-validation.
deleteDelete the model.
evaluateGet the model's scores for the provided metrics.
export_pipelineExport the model's pipeline to a sklearn-like object.
fitFit and validate the model.
full_trainTrain the estimator on the complete dataset.
hyperparameter_tuningRun the hyperparameter tuning algorithm.
inverse_transformInversely transform new data through the pipeline.
save_estimatorSave the estimator to a pickle file.
transformTransform new data through the pipeline.


method bootstrapping(n_bootstrap, reset=False)[source]
Apply a bootstrap algorithm.

Take bootstrapped samples from the training set and test them on the test set to get a distribution of the model's results.

Parametersn_bootstrap: int
umber of bootstrapped samples to fit on.

reset: bool, default=False
Whether to start a new run or continue the existing one.



method calibrate(**kwargs)[source]
Calibrate the model.

Applies probability calibration on the model. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute. If there is an active mlflow experiment, a new run is started using the name [model_name]_calibrate. Since the estimator changed, the model is cleared. Only for classifiers.

Parameters**kwargs
Additional keyword arguments for sklearn's CCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.



method clear()[source]
Clear attributes from the model.

Reset the model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The cleared attributes are:



method create_app(**kwargs)[source]
Create an interactive app to test model predictions.

Demo your machine learning model with a friendly web interface. This app launches directly in the notebook or on an external browser page. The created Interface instance can be accessed through the app attribute.

Parameters**kwargs
Additional keyword arguments for the Interface instance or the Interface.launch method.



method create_dashboard(dataset="test", filename=None, **kwargs)[source]
Create an interactive dashboard to analyze the model.

ATOM uses the explainerdashboard package to provide a quick and easy way to analyze and explain the predictions and workings of the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees.

By default, the dashboard renders in a new tab in your default browser, but if preferable, you can render it inside the notebook using the mode="inline" parameter. The created ExplainerDashboard instance can be accessed through the dashboard attribute.

Note

Plots displayed by the dashboard are not created by ATOM and can differ from those retrieved through this package.

Parametersdataset: str, default="test"
Data set to get the report from. Choose from: "train", "test", "both" (train and test) or "holdout".

filename: str or None, default=None
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.



method cross_validate(**kwargs)[source]
Evaluate the model using cross-validation.

This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.

Parameters**kwargs
Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses atom's metric.

Returnspd.DataFrame
Overview of the results.



method delete()[source]
Delete the model.

If it's the last model in atom, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.



method evaluate(metric=None, dataset="test", threshold=0.5, sample_weight=None)[source]
Get the model's scores for the provided metrics.

Parametersmetric: str, func, scorer, sequence or None, default=None
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, default="test"
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float, default=0.5
Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:

  • The task is binary classification.
  • The model has a predict_proba method.
  • The metric evaluates predicted target values.

sample_weight: sequence or None, default=None
Sample weights corresponding to y in dataset.

Returnspd.Series
Scores of the model.



method export_pipeline(memory=None, verbose=None)[source]
Export the model's pipeline to a sklearn-like object.

The returned pipeline is already fitted on the training set. Note that, if the model used automated feature scaling, the Scaler is added to the pipeline.

Info

The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:

  • Accepts transformers that change the target column.
  • Accepts transformers that drop rows.
  • Accepts transformers that only are fitted on a subset of the provided dataset.
  • Always returns pandas objects.
  • Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.

Parametersmemory: bool, str, Memory or None, default=None
Used to cache the fitted transformers of the pipeline. - If None or False: No caching is performed. - If True: A default temp directory is used. - If str: Path to the caching directory. - If Memory: Object with the joblib.Memory interface.

verbose: int or None, default=None
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

ReturnsPipeline
Current branch as a sklearn-like Pipeline object.



method fit(X=None, y=None)[source]
Fit and validate the model.

The estimator is fitted using the best hyperparameters found during hyperparameter tuning. Afterwards, the estimator is evaluated on the test set. Only use this method to re-fit the model after having continued the study.

ParametersX: pd.DataFrame or None
Feature set with shape=(n_samples, n_features). If None, self.X_train is used.

y: pd.Series or None
Target column corresponding to X. If None, self.y_train is used.



method full_train(include_holdout=False)[source]
Train the estimator on the complete dataset.

In some cases it might be desirable to use all available data to train a final model. Note that doing this means that the estimator can no longer be evaluated on the test set. The newly retrained estimator will replace the estimator attribute. If there is an active mlflow experiment, a new run is started with the name [model_name]_full_train. Since the estimator changed, the model is cleared.

Warning

Although the model is trained on the complete dataset, the pipeline is not. To get a fully trained pipeline, use: pipeline = atom.export_pipeline().fit(atom.X, atom.y).

Parametersinclude_holdout: bool, default=False
Whether to include the holdout set (if available) in the training of the estimator. It's discouraged to use this option since it means the model can no longer be evaluated on any set.



method hyperparameter_tuning(n_trials, reset=False)[source]
Run the hyperparameter tuning algorithm.

Search for the best combination of hyperparameters. The function to optimize is evaluated either with a K-fold cross-validation on the training set or using a random train and validation split every trial. Use this method to continue the optimization.

Parametersn_trials: int
Number of trials for the hyperparameter tuning.

reset: bool, default=False
Whether to start a new study or continue the existing one.



method inverse_transform(X=None, y=None, verbose=None)[source]
Inversely transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. The rest should all implement a inverse_transform method. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, inversely transform only the target column. If called from a model that used automated feature scaling, the scaling is inversed as well.

ParametersX: dataframe-like or None, default=None
Transformed feature set with shape=(n_samples, n_features). If None, X is ignored in the transformers.

y: int, str, dict, sequence or None, default=None

  • If None: y is ignored in the transformers.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • Else: Array with shape=(n_samples,) to use as target.

verbose: int or None, default=None
Verbosity level for the transformers. If None, it uses the transformer's own verbosity.

Returnspd.DataFrame
Original feature set. Only returned if provided.

y: pd.Series
Original target column. Only returned if provided.



method save_estimator(filename="auto")[source]
Save the estimator to a pickle file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.



method transform(X=None, y=None, verbose=None)[source]
Transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, transform only the target column. If called from a model that used automated feature scaling, the data is scaled as well.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored. If None, X is ignored in the transformers.

y: int, str, dict, sequence or None, default=None

  • If None: y is ignored in the transformers.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • Else: Array with shape=(n_samples,) to use as target.

verbose: int or None, default=None
Verbosity level for the transformers. If None, it uses the transformer's own verbosity.

Returnspd.DataFrame
Transformed feature set. Only returned if provided.

y: pd.Series
Transformed target column. Only returned if provided.