Skip to content

DecisionTree


Tree  accept sparse  native multilabel  native multioutput

A single decision tree classifier/regressor.

Corresponding estimators are:

Read more in sklearn's documentation.


See Also

ExtraTree

Extremely Randomized Tree.

ExtraTrees

Extremely Randomized Trees.

RandomForest

Random Forest.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> atom.run(models="Tree", metric="f1", verbose=2)


Training ========================= >>
Models: Tree
Metric: f1


Results for DecisionTree:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.9589
Time elapsed: 0.029s
-------------------------------------------------
Time: 0.029s


Final results ==================== >>
Total time: 0.032s
-------------------------------------
DecisionTree --> f1: 0.9589



Hyperparameters

Parameters

criterion

CategoricalDistribution(choices=('gini', 'entropy'))
splitter
CategoricalDistribution(choices=('best', 'random'))
max_depth
CategoricalDistribution(choices=(None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16))
min_samples_split
IntDistribution(high=20, log=False, low=2, step=1)
min_samples_leaf
IntDistribution(high=20, log=False, low=1, step=1)
max_features
CategoricalDistribution(choices=(None, 'sqrt', 'log2', 0.5, 0.6, 0.7, 0.8, 0.9))
ccp_alpha
FloatDistribution(high=0.035, log=False, low=0.0, step=0.005)

Parameters

criterion

CategoricalDistribution(choices=('squared_error', 'absolute_error', 'friedman_mse', 'poisson'))
splitter
CategoricalDistribution(choices=('best', 'random'))
max_depth
CategoricalDistribution(choices=(None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16))
min_samples_split
IntDistribution(high=20, log=False, low=2, step=1)
min_samples_leaf
IntDistribution(high=20, log=False, low=1, step=1)
max_features
CategoricalDistribution(choices=(None, 'sqrt', 'log2', 0.5, 0.6, 0.7, 0.8, 0.9))
ccp_alpha
FloatDistribution(high=0.035, log=False, low=0.0, step=0.005)





Attributes

Data attributes

Attributes pipeline: Pipeline
Pipeline of transformers.

Models that used automated feature scaling have the scaler added.

Tip

Use the plot_pipeline method to visualize the pipeline.

mapping: dict[str, dict[str, int | float]]
Encoded values and their respective mapped values.

The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g., Ordinal, Leave-one-out, etc...).

dataset: pd.DataFrame
Complete data set.
train: pd.DataFrame
Training set.
test: pd.DataFrame
Test set.
X: pd.DataFrame
Feature set.
y: pd.Series | pd.DataFrame
Target column(s).
X_train: pd.DataFrame
Features of the training set.
y_train: pd.Series | pd.DataFrame
Target column of the training set.
X_test: pd.DataFrame
Features of the test set.
y_test: pd.Series | pd.DataFrame
Target column(s) of the test set.
X_holdout: pd.DataFrame | None
Features of the holdout set.
y_holdout: pd.Series | pd.DataFrame | None
Target column of the holdout set.
shape: tuple[Int, Int]
Shape of the dataset (n_rows, n_columns).
columns: pd.Index
Name of all the columns.
n_columns: int
Number of columns.
features: pd.Index
Name of the features.
n_features: int
Number of features.
target: str | list[str]
Name of the target column(s).


Utility attributes

Attributes name: str
Name of the model.

Use the property's @setter to change the model's name. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run also changes.

run: Run
Mlflow run corresponding to this model.

This property is only available for models that with mlflow tracking enabled.

study: Study
Optuna study used for hyperparameter tuning.

This property is only available for models that ran hyperparameter tuning.

trials: pd.DataFrame
Overview of the trials' results.

This property is only available for models that ran hyperparameter tuning. All durations are in seconds. Columns include:

  • [param_name]: Parameter value used in this trial.
  • estimator: Estimator used in this trial.
  • [metric_name]: Metric score of the trial.
  • [best_metric_name]: Best score so far in this study.
  • time_trial: Duration of the trial.
  • time_ht: Duration of the hyperparameter tuning.
  • state: Trial's state (COMPLETE, PRUNED, FAIL).
best_trial: FrozenTrial
Trial that returned the highest score.

For multi-metric runs, the best trial is the trial that performed best on the main metric. Use the property's @setter to change the best trial. See here an example. This property is only available for models that ran hyperparameter tuning.

best_params: dict[str, Any]
Estimator's parameters in the best trial.

This property is only available for models that ran hyperparameter tuning.

estimator: Predictor
Estimator fitted on the training set.
bootstrap: pd.DataFrame
Overview of the bootstrapping scores.

The dataframe has shape=(n_bootstrap, metric) and shows the score obtained by every bootstrapped sample for every metric. Using atom.bootstrap.mean() yields the same values as [metric]_bootstrap. This property is only available for models that ran bootstrapping.

results: pd.Series
Overview of the model results.

All durations are in seconds. Possible values include:

  • [metric]_ht: Score obtained by the hyperparameter tuning.
  • time_ht: Duration of the hyperparameter tuning.
  • [metric]_train: Metric score on the train set.
  • [metric]_test: Metric score on the test set.
  • time_fit: Duration of the model fitting on the train set.
  • [metric]_bootstrap: Mean score on the bootstrapped samples.
  • time_bootstrap: Duration of the bootstrapping.
  • time: Total duration of the run.
feature_importance: pd.Series
Normalized feature importance scores.

The sum of importances for all features is 1. The scores are extracted from the estimator's scores_, coef_ or feature_importances_ attribute, checked in that order. This property is only available for estimators with at least one of those attributes.



Methods

The plots can be called directly from the model. The remaining utility methods can be found hereunder.

bootstrappingApply a bootstrap algorithm.
calibrateCalibrate and retrain the model.
canvasCreate a figure with multiple plots.
clearReset attributes and clear cache from the model.
create_appCreate an interactive app to test model predictions.
create_dashboardCreate an interactive dashboard to analyze the model.
cross_validateEvaluate the model using cross-validation.
decision_functionGet confidence scores on new data or existing rows.
evaluateGet the model's scores for the provided metrics.
export_pipelineExport the transformer pipeline with final estimator.
fitFit and validate the model.
full_trainTrain the estimator on the complete dataset.
get_best_thresholdGet the threshold that maximizes a metric.
get_tagsGet the model's tags.
hyperparameter_tuningRun the hyperparameter tuning algorithm.
inverse_transformInversely transform new data through the pipeline.
predictGet predictions on new data or existing rows.
predict_log_probaGet class log-probabilities on new data or existing rows.
predict_probaGet class probabilities on new data or existing rows.
registerRegister the model in mlflow's model registry.
reset_aestheticsReset the plot aesthetics to their default values.
save_estimatorSave the estimator to a pickle file.
scoreGet a metric score on new data.
serveServe the model as rest API endpoint for inference.
set_thresholdSet the binary threshold of the estimator.
transformTransform new data through the pipeline.
update_layoutUpdate the properties of the plot's layout.
update_tracesUpdate the properties of the plot's traces.


method bootstrapping(n_bootstrap, reset=False)[source]

Apply a bootstrap algorithm.

Take bootstrapped samples from the training set and test them on the test set to get a distribution of the model's results.

Parameters n_bootstrap: int
umber of bootstrapped samples to fit on.

reset: bool, default=False
Whether to start a new run or continue the existing one.



method calibrate(method="sigmoid", train_on_test=False)[source]

Calibrate and retrain the model.

Uses sklearn's CalibratedClassifierCV to apply probability calibration on the model. The new classifier replaces the estimator attribute. If there is an active mlflow experiment, a new run is started using the name [model_name]_calibrate. Since the estimator changed, the model is cleared. Only for classifiers.

Note

By default, the calibration is optimized using the training set (which is already used for the initial training). This approach is subject to undesired overfitting. It's preferred to use train_on_test=True, which uses the test set for calibration, but only if there is another, independent set for testing (holdout set).

Parameters method: str, default="sigmoid"
The method to use for calibration. Choose from:

  • "sigmoid": Corresponds to Platt's method (i.e., a logistic regression model).
  • "isotonic": Non-parametric approach. It's not advised to use this calibration method with too few samples (<1000) since it tends to overfit.

train_on_test: bool, default=False
Whether to train the calibrator on the test set.



method canvas(rows=1, cols=2, sharex=False, sharey=False, hspace=0.05, vspace=0.07, title=None, legend="out", figsize=None, filename=None, display=True)[source]

Create a figure with multiple plots.

This @contextmanager allows you to draw many plots in one figure. The default option is to add two plots side by side. See the user guide for an example.

Parameters rows: int, default=1
Number of plots in length.

cols: int, default=2
Number of plots in width.

sharex: bool, default=False
If True, hide the label and ticks from non-border subplots on the x-axis.

sharey: bool, default=False
If True, hide the label and ticks from non-border subplots on the y-axis.

hspace: float, default=0.05
Space between subplot rows in normalized plot coordinates. The spacing is relative to the figure's size.

vspace: float, default=0.07
Space between subplot cols in normalized plot coordinates. The spacing is relative to the figure's size.

title: str, dict or None, default=None
Title for the plot.

legend: bool, str or dict, default="out"
Legend for the plot. See the user guide for an extended description of the choices.

  • If None: No legend is shown.
  • If str: Position to display the legend.
  • If dict: Legend configuration.

figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it adapts the size to the number of plots in the canvas.

filename: str, Path or None, default=None
Save the plot using this name. Use "auto" for automatic naming. The type of the file depends on the provided name (.html, .png, .pdf, etc...). If filename has no file type, the plot is saved as html. If None, the plot is not saved.

display: bool, default=True
Whether to render the plot.

Yields{#canvas-go.Figure} go.Figure
Plot object.



method clear()[source]

Reset attributes and clear cache from the model.

Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:



method create_app(**kwargs)[source]

Create an interactive app to test model predictions.

Demo your machine learning model with a friendly web interface. This app launches directly in the notebook or on an external browser page. The created Interface instance can be accessed through the app attribute.

Parameters **kwargs
Additional keyword arguments for the Interface instance or the Interface.launch method.



method create_dashboard(rows="test", filename=None, **kwargs)[source]

Create an interactive dashboard to analyze the model.

ATOM uses the explainerdashboard package to provide a quick and easy way to analyze and explain the predictions and workings of the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees.

By default, the dashboard renders in a new tab in your default browser, but if preferable, you can render it inside the notebook using the mode="inline" parameter. The created ExplainerDashboard instance can be accessed through the dashboard attribute. This method is not available for multioutput tasks.

Note

Plots displayed by the dashboard are not created by ATOM and can differ from those retrieved through this package.

Parameters rows: hashable, segment, sequence or dataframe, default="test"
Selection of rows to get the report from.

filename: str, Path or None, default=None
Filename or pathlib.Path of the file to save. None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.



method cross_validate(include_holdout=False, **kwargs)[source]

Evaluate the model using cross-validation.

This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the model's performance. If the scoring method is not specified in kwargs, it uses atom's metric. The results of the cross-validation are stored in the model's cv attribute.

Tip

This method returns a pandas' Styler object. Convert the result back to a regular dataframe using its data attribute.

Parameters include_holdout: bool, default=False
Whether to include the holdout set (if available) in the cross-validation.

**kwargs
Additional keyword arguments for one of these functions.

Returns{#cross_validate-Styler} Styler
Overview of the results.



method decision_function(X, verbose=None)[source]

Get confidence scores on new data or existing rows.

New data is first transformed through the model's pipeline. Transformers that are only applied on the training set are skipped. The estimator must have a decision_function method.

Read more in the user guide.

Parameters X: hashable, segment, sequence or dataframe-like
Selection of rows or feature set with shape=(n_samples, n_features) to make predictions on.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns series or dataframe
Predicted confidence scores with shape=(n_samples,) for binary classification tasks (log likelihood ratio of the positive class) or shape=(n_samples, n_classes) for multiclass classification tasks.



method evaluate(metric=None, rows="test")[source]

Get the model's scores for the provided metrics.

Tip

Use the get_best_threshold or plot_threshold method to determine a suitable threshold for a binary classifier.

Parameters metric: str, func, scorer, sequence or None, default=None
Metrics to calculate. If None, a selection of the most common metrics per task is used.

rows: hashable, segment, sequence or dataframe, default="test"
Selection of rows to calculate metric on.

Returns pd.Series
Scores of the model.



method export_pipeline()[source]

Export the transformer pipeline with final estimator.

The returned pipeline is already fitted on the training set. Note that if the model used automated feature scaling, the Scaler is added to the pipeline.

Returns{#export_pipeline-Pipeline} Pipeline
Current branch as a sklearn-like Pipeline object.



method fit(X=None, y=None, prefit=False)[source]

Fit and validate the model.

The estimator is fitted using the best hyperparameters found during hyperparameter tuning. Afterwards, the estimator is evaluated on the test set. Only use this method to re-fit the model after having continued the study.

Parameters X: pd.DataFrame or None
Feature set with shape=(n_samples, n_features). If None, self.X_train is used.

y: pd.Series, pd.DataFrame or None
Target column(s) corresponding to X. If None, self.y_train is used.

prefit: bool, default=False
Whether the estimator is already fitted. If True, only evaluate the model.



method full_train(include_holdout=False)[source]

Train the estimator on the complete dataset.

In some cases, it might be desirable to use all available data to train a final model. Note that doing this means that the estimator can no longer be evaluated on the test set. The newly retrained estimator will replace the estimator attribute. If there is an active mlflow experiment, a new run is started with the name [model_name]_full_train. Since the estimator changed, the model is cleared.

Warning

Although the model is trained on the complete dataset, the pipeline is not. To get a fully trained pipeline, use: pipeline = atom.export_pipeline().fit(atom.X, atom.y).

Parameters include_holdout: bool, default=False
Whether to include the holdout set (if available) in the training of the estimator. It's discouraged to use this option since it means the model can no longer be evaluated on any set.



method get_best_threshold(metric=None, train_on_test=False)[source]

Get the threshold that maximizes a metric.

Uses sklearn's TunedThresholdClassifierCV to post-tune the decision threshold (cut-off point) that is used for converting posterior probability estimates (i.e., output of predict_proba) or decision scores (i.e., output of decision_function) into a class label. The tuning is done by optimizing one of atom's metrics. The tuning estimator is stored under the tuned_threshold attribute. Only available for binary classifiers.

Note

By default, the threshold is optimized using the training set (which is already used for the initial training). This approach is subject to undesired overfitting. It's preferred to use train_on_test=True, which uses the test set for tuning, but only if there is another, independent set for testing (holdout set).

Tip

Use the plot_threshold method to visualize the effect of different thresholds on a metric.

Parameters metric: int, str or None, default=None
Metric to optimize on. If None, the main metric is used.

train_on_test: bool, default=False
Whether to train the calibrator on the test set.

Returns float
Optimized threshold value.



method get_tags()[source]

Get the model's tags.

Return class parameters that provide general information about the model's characteristics.

Returns dict
Model's tags.



method hyperparameter_tuning(n_trials, reset=False)[source]

Run the hyperparameter tuning algorithm.

Search for the best combination of hyperparameters. The function to optimize is evaluated either with a K-fold cross-validation on the training set or using a random train and validation split every trial. Use this method to continue the optimization.

Parameters n_trials: int
Number of trials for the hyperparameter tuning.

reset: bool, default=False
Whether to start a new study or continue the existing one.



method inverse_transform(X=None, y=None, verbose=None)[source]

Inversely transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. The rest should all implement an inverse_transform method. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, inversely transform only the target column. If called from a model that used automated feature scaling, the scaling is inverted as well.

Parameters X: dataframe-like or None, default=None
Transformed feature set with shape=(n_samples, n_features). If None, X is ignored in the transformers.

y: int, str, sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns for multioutput tasks.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns dataframe
Original feature set. Only returned if provided.

series or dataframe
Original target column. Only returned if provided.



method predict(X, inverse=True, verbose=None)[source]

Get predictions on new data or existing rows.

New data is first transformed through the model's pipeline. Transformers that are only applied on the training set are skipped. The estimator must have a predict method.

Read more in the user guide.

Parameters X: hashable, segment, sequence or dataframe-like
Selection of rows or feature set with shape=(n_samples, n_features) to make predictions on.

inverse: bool, default=True
Whether to inversely transform the output through the pipeline. This doesn't affect the predictions if there are no transformers in the pipeline or if the transformers have no inverse_transform method or don't apply to y.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns series or dataframe
Predictions with shape=(n_samples,) or shape=(n_samples, n_targets) for multioutput tasks.



method predict_log_proba(X, verbose=None)[source]

Get class log-probabilities on new data or existing rows.

New data is first transformed through the model's pipeline. Transformers that are only applied on the training set are skipped. The estimator must have a predict_log_proba method.

Read more in the user guide.

Parameters X: hashable, segment, sequence or dataframe-like
Selection of rows or feature set with shape=(n_samples, n_features) to make predictions on.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns dataframe
Predicted class log-probabilities with shape=(n_samples, n_classes) or shape=(n_samples * n_classes, n_targets) with a multiindex format for multioutput tasks.



method predict_proba(X, verbose=None)[source]

Get class probabilities on new data or existing rows.

New data is first transformed through the model's pipeline. Transformers that are only applied on the training set are skipped. The estimator must have a predict_proba method.

Read more in the user guide.

Parameters X: hashable, segment, sequence or dataframe-like
Selection of rows or feature set with shape=(n_samples, n_features) to make predictions on.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns dataframe
Predicted class probabilities with shape=(n_samples, n_classes) or shape=(n_samples * n_classes, n_targets) with a multiindex format for multioutput tasks.



method register(name=None, stage="None", archive_existing_versions=False)[source]

Register the model in mlflow's model registry.

This method is only available when model tracking is enabled using one of the following URI schemes: databricks, http, https, postgresql, mysql, sqlite, mssql.

Parameters name: str or None, default=None
Name for the registered model. If None, the model's full name is used. If the name of the model already exists, a new model version is created.

stage: str, default="None"
New desired stage for the model.

archive_existing_versions: bool, default=False
Whether all existing model versions in the stage will be moved to the "Archived" stage. Only valid when stage is "Staging" or "Production", otherwise an error will be raised.



classmethod atom.plots.baseplot.reset_aesthetics()[source]

Reset the plot aesthetics to their default values.



method save_estimator(filename="auto")[source]

Save the estimator to a pickle file.

Parameters filename: str or Path, default="auto"
Filename or pathlib.Path of the file to save. Use "auto" for automatic naming.



method score(X, y=None, metric=None, sample_weight=None, verbose=None)[source]

Get a metric score on new data.

New data is first transformed through the model's pipeline. Transformers that are only applied on the training set are skipped.

Read more in the user guide.

Info

If the metric parameter is left to its default value, the method returns atom's metric score, not the metric returned by sklearn's score method for estimators.

Parameters X: hashable, segment, sequence or dataframe-like
Selection of rows or feature set with shape=(n_samples, n_features) to make predictions on.

y: int, str, sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

  • If None: X must be a selection of rows in the dataset.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

metric: str, func, scorer or None, default=None
Metric to calculate. Choose from any of sklearn's scorers, a function with signature metric(y_true, y_pred) -> score or a scorer object. If None, it uses atom's metric (the main metric for multi-metric runs).

sample_weight: sequence or None, default=None
Sample weights corresponding to y.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns float
Metric score of X with respect to y.



method serve(method="predict")[source]

Serve the model as rest API endpoint for inference.

The complete pipeline is served with the model. The inference data must be supplied as json to the HTTP request, e.g. requests.get("http://127.0.0.1:8000/", json=X.to_json()). The deployment is done on a ray cluster. The default host and port parameters deploy to localhost.

Tip

Use import ray; ray.serve.shutdown() to close the endpoint after finishing.

Parameters method: str, default="predict"
Estimator's method to do inference on.



method set_threshold(threshold)[source]

Set the binary threshold of the estimator.

A new classifier using the new threshold replaces the estimator attribute. If there is an active mlflow experiment, a new run is started using the name [model_name]_threshold_X. Since the estimator changed, the model is cleared. Only for binary classifiers.

Tip

Use the get_best_threshold method to find the optimal threshold for a specific metric.

Parameters threshold: float
Binary threshold to classify the positive class.



method transform(X=None, y=None, verbose=None)[source]

Transform new data through the pipeline.

Transformers that are only applied on the training set are skipped. If only X or only y is provided, it ignores transformers that require the other parameter. This can be of use to, for example, transform only the target column. If called from a model that used automated feature scaling, the data is scaled as well.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored. If None, X is ignored in the transformers.

y: int, str, sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns for multioutput tasks.

verbose: int or None, default=None
Verbosity level for the transformers in the pipeline. If None, it uses the pipeline's verbosity.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



classmethod atom.plots.baseplot.update_layout(**kwargs)[source]

Update the properties of the plot's layout.

Recursively update the structure of the original layout with the values in the arguments.

Parameters **kwargs
Keyword arguments for the figure's update_layout method.



classmethod atom.plots.baseplot.update_traces(**kwargs)[source]

Update the properties of the plot's traces.

Recursively update the structure of the original traces with the values in the arguments.

Parameters **kwargs
Keyword arguments for the figure's update_traces method.