GaussianNB
Gaussian Naive Bayes implements the Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.
Corresponding estimators are:
- GaussianNB for classification tasks.
Read more in sklearn's documentation.
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y)
>>> atom.run(models="GNB", metric="f1", verbose=2)
Training ========================= >>
Models: GNB
Metric: f1
Results for GaussianNB:
Fit ---------------------------------------------
Train evaluation --> f1: 0.9555
Test evaluation --> f1: 0.965
Time elapsed: 0.009s
-------------------------------------------------
Total time: 0.009s
Final results ==================== >>
Total time: 0.010s
-------------------------------------
GaussianNB --> f1: 0.965
Attributes
Data attributes
Attributes | pipeline: pd.Series Transformers fitted on the data.
mapping: dictModels that used automated feature scaling have the scaler added. Use this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method. Encoded values and their respective mapped values.
dataset: dataframeThe column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...). Complete data set. train: dataframeTraining set. test: dataframeTest set. X: dataframeFeature set. y: series | dataframeTarget column(s). X_train: dataframeFeatures of the training set. y_train: series | dataframeTarget column(s) of the training set. X_test: dataframeFeatures of the test set. y_test: series | dataframeTarget column(s) of the test set. shape: tuple[int, int]Shape of the dataset (n_rows, n_columns). columns: seriesName of all the columns. n_columns: intNumber of columns. features: seriesName of the features. n_features: intNumber of features. target: str | list[str]Name of the target column(s). |
Utility attributes
Attributes | name: str Name of the model.
study: Study | NoneUse the property's Optuna study used for hyperparameter tuning. trials: pd.DataFrame | NoneOverview of the trials' results.
best_trial: Trial | NoneAll durations are in seconds. Columns include:
Trial that returned the highest score.
best_params: dictFor multi-metric runs, the best trial is the trial that
performed best on the main metric. Use the property's Hyperparameters used by the best trial. score_ht: float | list[float] | NoneMetric score obtained by the best trial. time_ht: int | NoneDuration of the hyperparameter tuning (in seconds). estimator: PredictorEstimator fitted on the training set. score_train: float | list[float]Metric score on the training set. score_test: float | list[float]Metric score on the test set. score_holdout: float | list[float]Metric score on the holdout set. time_fit: intDuration of the model fitting on the train set (in seconds). bootstrap: pd.DataFrame | NoneOverview of the bootstrapping scores.
score_bootstrap: float | list[float] | NoneThe dataframe has shape=(n_bootstrap, metric) and shows the
score obtained by every bootstrapped sample for every metric.
Using Mean metric score on the bootstrapped samples. time_bootstrap: int | NoneDuration of the bootstrapping (in seconds). time: intTotal duration of the run (in seconds). feature_importance: pd.Series | NoneNormalized feature importance scores.
results: pd.SeriesThe sum of importances for all features is 1. The scores are
extracted from the estimator's Overview of the training results.
All durations are in seconds. Values include:
|
Prediction attributes
The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.
Attributes | predict_train: series | dataframe Class predictions on the training set.
predict_test: series | dataframeThe shape of the output depends on the task:
Class predictions on the test set.
predict_holdout: series | dataframe | NoneThe shape of the output depends on the task:
Class predictions on the holdout set.
predict_log_proba_train: dataframeThe shape of the output depends on the task:
Class log-probability predictions on the training set.
predict_log_proba_test: dataframeThe shape of the output depends on the task:
Class log-probability predictions on the test set.
predict_log_proba_holdout: dataframe | NoneThe shape of the output depends on the task:
Class log-probability predictions on the holdout set.
predict_proba_train: dataframeThe shape of the output depends on the task:
Class probability predictions on the training set.
predict_proba_test: dataframeThe shape of the output depends on the task:
Class probability predictions on the test set.
predict_proba_holdout: dataframe | NoneThe shape of the output depends on the task:
Class probability predictions on the holdout set.
The shape of the output depends on the task:
|
Methods
The plots and prediction methods can be called directly from the model. The remaining utility methods can be found hereunder.
bootstrapping | Apply a bootstrap algorithm. |
calibrate | Calibrate the model. |
clear | Reset attributes and clear cache from the model. |
create_app | Create an interactive app to test model predictions. |
create_dashboard | Create an interactive dashboard to analyze the model. |
cross_validate | Evaluate the model using cross-validation. |
evaluate | Get the model's scores for the provided metrics. |
export_pipeline | Export the model's pipeline to a sklearn-like object. |
fit | Fit and validate the model. |
full_train | Train the estimator on the complete dataset. |
get_best_threshold | Get the threshold that maximizes the ROC curve. |
hyperparameter_tuning | Run the hyperparameter tuning algorithm. |
inverse_transform | Inversely transform new data through the pipeline. |
save_estimator | Save the estimator to a pickle file. |
serve | Serve the model as rest API endpoint for inference. |
register | Register the model in mlflow's model registry. |
transform | Transform new data through the pipeline. |
Take bootstrapped samples from the training set and test them on the test set to get a distribution of the model's results.
Parameters | n_bootstrap: int
umber of bootstrapped samples to fit on.
reset: bool, default=False
Whether to start a new run or continue the existing one.
|
Applies probability calibration on the model. The estimator
is trained via cross-validation on a subset of the training
data, using the rest to fit the calibrator. The new classifier
will replace the estimator
attribute. If there is an active
mlflow experiment, a new run is started using the name
[model_name]_calibrate
. Since the estimator changed, the
model is cleared. Only for classifiers.
Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:
- In-training validation scores
- Shap values
- App instance
- Dashboard instance
- Cached prediction attributes
- Cached metric scores
- Cached holdout data sets
Demo your machine learning model with a friendly web interface.
This app launches directly in the notebook or on an external
browser page. The created Interface instance can be accessed
through the app
attribute.
Parameters | **kwargs
Additional keyword arguments for the Interface instance
or the Interface.launch method.
|
ATOM uses the explainerdashboard package to provide a quick and easy way to analyze and explain the predictions and workings of the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees.
By default, the dashboard renders in a new tab in your default
browser, but if preferable, you can render it inside the
notebook using the mode="inline"
parameter. The created
ExplainerDashboard instance can be accessed through the
dashboard
attribute. This method is not available for
multioutput tasks.
Note
Plots displayed by the dashboard are not created by ATOM and can differ from those retrieved through this package.
Parameters | dataset: str, default="test"
Data set to get the report from. Choose from: "train", "test",
"both" (train and test) or "holdout".
filename: str or None, default=None
Name to save the file with (as .html). None to not save
anything.
**kwargs
Additional keyword arguments for the ExplainerDashboard
instance.
|
This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.
Parameters | **kwargs
Additional keyword arguments for sklearn's cross_validate
function. If the scoring method is not specified, it uses
atom's metric.
|
Returns | pd.DataFrame
Overview of the results.
|
Tip
Use the self-get_best_threshold or plot_threshold
method to determine a suitable value for the threshold
parameter.
Parameters | metric: str, func, scorer, sequence or None, default=None
Metrics to calculate. If None, a selection of the most
common metrics per task are used.
dataset: str, default="test"
Data set on which to calculate the metric. Choose from:
"train", "test" or "holdout".
threshold: float or sequence, default=0.5
Threshold between 0 and 1 to convert predicted probabilities
to class labels. Only used when:
sample_weight: sequence or None, default=None
For multilabel classification tasks, it's possible to provide a sequence of thresholds (one per target column, as returned by the get_best_threshold method). If float, the same threshold is applied to all target columns.
Sample weights corresponding to y in dataset .
|
Returns | pd.Series
Scores of the model.
|
The returned pipeline is already fitted on the training set. Note that, if the model used automated feature scaling, the Scaler is added to the pipeline.
Info
The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always returns pandas objects.
- Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.
The estimator is fitted using the best hyperparameters found during hyperparameter tuning. Afterwards, the estimator is evaluated on the test set. Only use this method to re-fit the model after having continued the study.
Parameters | X: dataframe or None
Feature set with shape=(n_samples, n_features). If None,
y: series or Noneself.X_train is used.
Target column corresponding to X. If None, self.y_train
is used.
|
In some cases it might be desirable to use all available data
to train a final model. Note that doing this means that the
estimator can no longer be evaluated on the test set. The newly
retrained estimator will replace the estimator
attribute. If
there is an active mlflow experiment, a new run is started
with the name [model_name]_full_train
. Since the estimator
changed, the model is cleared.
Warning
Although the model is trained on the complete dataset, the
pipeline is not. To get a fully trained pipeline, use:
pipeline = atom.export_pipeline().fit(atom.X, atom.y)
.
Only available for models with a predict_proba
method in a
binary or multilabel classification task.
Search for the best combination of hyperparameters. The function to optimize is evaluated either with a K-fold cross-validation on the training set or using a random train and validation split every trial. Use this method to continue the optimization.
Parameters | n_trials: int
Number of trials for the hyperparameter tuning.
reset: bool, default=False
Whether to start a new study or continue the existing one.
|
Transformers that are only applied on the training set are
skipped. The rest should all implement a inverse_transform
method. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
of use to, for example, inversely transform only the target
column. If called from a model that used automated feature
scaling, the scaling is inverted as well.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
|
The complete pipeline is served with the model. The inference
data must be supplied as json to the HTTP request, e.g.
requests.get("http://127.0.0.1:8000/", json=X.to_json())
.
The deployment is done on a ray cluster. The default host
and port
parameters deploy to localhost.
Tip
Use import ray; ray.serve.shutdown()
to close the
endpoint after finishing.
This method is only available when model tracking is enabled using one of the following URI schemes: databricks, http, https, postgresql, mysql, sqlite, mssql.
Parameters | name: str or None, default=None
Name for the registered model. If None, the model's full name
is used.
stage: str, default="Staging"
New desired stage for the model.
|
Transformers that are only applied on the training set are
skipped. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
of use to, for example, transform only the target column. If
called from a model that used automated feature scaling, the
data is scaled as well.