Dummy
When doing supervised learning, a simple sanity check consists of comparing one's estimator against simple rules of thumb. The prediction methods completely ignore the input data. Do not use this model for real problems. Use it only as a simple baseline to compare with other models.
Corresponding estimators are:
- DummyClassifier for classification tasks.
- DummyRegressor for regression tasks.
Read more in sklearn's documentation.
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y)
>>> atom.run(models="Dummy", metric="f1", verbose=2)
Training ========================= >>
Models: Dummy
Metric: f1
Results for Dummy:
Fit ---------------------------------------------
Train evaluation --> f1: 0.7709
Test evaluation --> f1: 0.7717
Time elapsed: 0.006s
-------------------------------------------------
Total time: 0.006s
Final results ==================== >>
Total time: 0.007s
-------------------------------------
Dummy --> f1: 0.7717
Hyperparameters
Attributes
Data attributes
Attributes | pipeline: (self) Transformers fitted on the data.
mapping: dictModels that used automated feature scaling have the scaler added. Use this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method. Encoded values and their respective mapped values.
dataset: pd.DataFrameThe column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...). Complete data set. train: pd.DataFrameTraining set. test: pd.DataFrameTest set. X: pd.DataFrameFeature set. y: pd.SeriesTarget column. X_train: pd.DataFrameFeatures of the training set. y_train: pd.SeriesTarget column of the training set. X_test: pd.DataFrameFeatures of the test set. y_test: pd.SeriesTarget column of the test set. shape: tupleShape of the dataset (n_rows, n_cols). columns: pd.SeriesName of all the columns. n_columns: intNumber of columns. features: pd.SeriesName of the features. n_features: intNumber of features. target: strName of the target column. |
Utility attributes
Attributes | name: str Name of the model.
study: Study or NoneUse the property's Optuna study used for hyperparameter tuning. trials: pd.DataFrame or NoneOverview of the trials' results.
best_trial: Trial or NoneAll durations are in seconds. Columns include:
Trial that returned the highest score.
best_params: dictFor multi-metric runs, the best trial is the trial that
performed best on the main metric. Use the property's Hyperparameters used by the best trial. score_ht: Union[float, numpy.floating, List[Union[float, numpy.floating]], NoneType]Metric score obtained by the best trial. time_ht: int or NoneDuration of the hyperparameter tuning (in seconds). estimator: PredictorEstimator fitted on the training set. score_train: Union[float, numpy.floating, List[Union[float, numpy.floating]]]Metric score on the training set. score_test: Union[float, numpy.floating, List[Union[float, numpy.floating]]]Metric score on the test set. score_holdout: Union[float, numpy.floating, List[Union[float, numpy.floating]]]Metric score on the holdout set. time_fit: intDuration of the model fitting on the train set (in seconds). bootstrap: pd.DataFrame or NoneOverview of the bootstrapping scores.
score_bootstrap: Union[float, numpy.floating, List[Union[float, numpy.floating]], NoneType]The dataframe has shape=(n_bootstrap, metric) and shows the
score obtained by every bootstrapped sample for every metric.
Using Mean metric score on the bootstrapped samples. time_bootstrap: int or NoneDuration of the bootstrapping (in seconds). time: intTotal duration of the run (in seconds). feature_importance: pd.Series or NoneNormalized feature importance scores.
results: pd.SeriesThe sum of importances for all features is 1. The scores are
extracted from the estimator's Overview of the training results.
All durations are in seconds. Values include:
|
Prediction attributes
The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.
Methods
The plots and prediction methods can be called directly from the model. The remaining utility methods can be found hereunder.
bootstrapping | Apply a bootstrap algorithm. |
calibrate | Calibrate the model. |
clear | Clear attributes from the model. |
create_app | Create an interactive app to test model predictions. |
create_dashboard | Create an interactive dashboard to analyze the model. |
cross_validate | Evaluate the model using cross-validation. |
delete | Delete the model. |
evaluate | Get the model's scores for the provided metrics. |
export_pipeline | Export the model's pipeline to a sklearn-like object. |
fit | Fit and validate the model. |
full_train | Train the estimator on the complete dataset. |
hyperparameter_tuning | Run the hyperparameter tuning algorithm. |
inverse_transform | Inversely transform new data through the pipeline. |
save_estimator | Save the estimator to a pickle file. |
transform | Transform new data through the pipeline. |
Take bootstrapped samples from the training set and test them on the test set to get a distribution of the model's results.
Parameters | n_bootstrap: int
umber of bootstrapped samples to fit on.
reset: bool, default=False
Whether to start a new run or continue the existing one.
|
Applies probability calibration on the model. The estimator
is trained via cross-validation on a subset of the training
data, using the rest to fit the calibrator. The new classifier
will replace the estimator
attribute. If there is an active
mlflow experiment, a new run is started using the name
[model_name]_calibrate
. Since the estimator changed, the
model is cleared. Only for classifiers.
Reset the model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The cleared attributes are:
- In-training validation scores.
- Metric scores
- Prediction attributes
- Shap values
- App instance
- Dashboard instance
Demo your machine learning model with a friendly web interface.
This app launches directly in the notebook or on an external
browser page. The created Interface instance can be accessed
through the app
attribute.
Parameters | **kwargs
Additional keyword arguments for the Interface instance
or the Interface.launch method.
|
ATOM uses the explainerdashboard package to provide a quick and easy way to analyze and explain the predictions and workings of the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees.
By default, the dashboard renders in a new tab in your default
browser, but if preferable, you can render it inside the notebook
using the mode="inline"
parameter. The created
ExplainerDashboard instance can be accessed through the
dashboard
attribute.
Note
Plots displayed by the dashboard are not created by ATOM and can differ from those retrieved through this package.
Parameters | dataset: str, default="test"
Data set to get the report from. Choose from: "train", "test",
"both" (train and test) or "holdout".
filename: str or None, default=None
Name to save the file with (as .html). None to not save
anything.
**kwargs
Additional keyword arguments for the ExplainerDashboard
instance.
|
This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.
Parameters | **kwargs
Additional keyword arguments for sklearn's cross_validate
function. If the scoring method is not specified, it uses
atom's metric.
|
Returns | pd.DataFrame
Overview of the results.
|
If it's the last model in atom, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.
The returned pipeline is already fitted on the training set. Note that, if the model used automated feature scaling, the Scaler is added to the pipeline.
Info
The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always returns pandas objects.
- Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.
The estimator is fitted using the best hyperparameters found during hyperparameter tuning. Afterwards, the estimator is evaluated on the test set. Only use this method to re-fit the model after having continued the study.
Parameters | X: pd.DataFrame or None
Feature set with shape=(n_samples, n_features). If None,
y: pd.Series or Noneself.X_train is used.
Target column corresponding to X. If None, self.y_train
is used.
|
In some cases it might be desirable to use all available data
to train a final model. Note that doing this means that the
estimator can no longer be evaluated on the test set. The newly
retrained estimator will replace the estimator
attribute. If
there is an active mlflow experiment, a new run is started
with the name [model_name]_full_train
. Since the estimator
changed, the model is cleared.
Warning
Although the model is trained on the complete dataset, the
pipeline is not. To get a fully trained pipeline, use:
pipeline = atom.export_pipeline().fit(atom.X, atom.y)
.
Search for the best combination of hyperparameters. The function to optimize is evaluated either with a K-fold cross-validation on the training set or using a random train and validation split every trial. Use this method to continue the optimization.
Parameters | n_trials: int
Number of trials for the hyperparameter tuning.
reset: bool, default=False
Whether to start a new study or continue the existing one.
|
Transformers that are only applied on the training set are
skipped. The rest should all implement a inverse_transform
method. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
of use to, for example, inversely transform only the target
column. If called from a model that used automated feature
scaling, the scaling is inversed as well.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
|
Transformers that are only applied on the training set are
skipped. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
of use to, for example, transform only the target column. If
called from a model that used automated feature scaling, the
data is scaled as well.