Complement Naive Bayes (CNB)

The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets.

Corresponding estimators are:

ComplementNB for classification tasks.

Hyperparameters

By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.

Dimensions:

alpha: float, default=1.0
Real(0.01, 10, "log-uniform", name="alpha")

fit_prior: bool, default=True
Categorical([True, False], name="fit_prior")

norm: bool, default=False
Categorical([True, False], name="norm")

Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: list
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: list
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

bo: pd.DataFrame
Information of every step taken by the BO. Columns include:

call: Name of the call.
params: Parameters used in the model.
estimator: Estimator used for this iteration (fitted on last cross-validation).
score: Score of the chosen metric. List of scores for multi-metric.
time: Time spent on this iteration.
total_time: Total time spent since the start of the BO.

best_call: str
Name of the best call in the BO.

best_params: dict
Dictionary of the best combination of hyperparameters found by the BO.

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_bo: str
Time it took to run the bayesian optimization algorithm.

metric_bo: float or list
Best metric score(s) on the BO.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: np.array
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.array
Predictions of the model on the training set.

predict_test: np.array
Predictions of the model on the test set.

predict_proba_train: np.array
Predicted probabilities of the model on the training set.

predict_proba_test: np.array
Predicted probabilities of the model on the test set.

predict_log_proba_train: np.array
Predicted log probabilities of the model on the training set.

predict_log_proba_test: np.array
Predicted log probabilities of the model on the test set.

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.

Methods

The majority of the plots and prediction methods can be called directly from the model, e.g. atom.cnb.plot_permutation_importance() or atom.cnb.predict(X). The remaining utility methods can be found hereunder.

calibrate	Calibrate the model.
clear	Clear attributes from the model.
cross_validate	Evaluate the model using cross-validation.
delete	Delete the model from the trainer.
dashboard	Create an interactive dashboard to analyze the model.
evaluate	Get the model's scores for the provided metrics.
export_pipeline	Export the model's pipeline to a sklearn-like Pipeline object.
full_train	Train the estimator on the complete dataset.
rename	Change the model's tag.
save_estimator	Save the estimator to a pickle file.
transform	Transform new data through the model's branch.

method calibrate(**kwargs) [source]

Applies probability calibration on the estimator. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute and is logged to any active mlflow experiment. Since the estimator changed, all the model's prediction attributes are reset.

Parameters:

**kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.

method clear() [source]

Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:

method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.

Parameters:	**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns:	scores: dict Return of sklearn's cross_validate function.

method delete() [source]

Delete the model from the trainer. If it's the last model in the trainer, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.

method dashboard(dataset="test", filename=None, **kwargs) [source]

Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.

Parameters:

dataset: str, optional (default="test")
Data set to get the report from. Choose from: "train", "test", "both" (train and test) or "holdout".

filename: str or None, optional (default=None)
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.

Returns:

dashboard: ExplainerDashboard
Created dashboard object.

method evaluate(metric=None, dataset="test", threshold=0.5) [source]

Get the model's scores for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

threshold: float, optional (default=0.5)
Threshold between 0 and 1 to convert predicted probabilities to class labels. Only used when:

The task is binary classification.
The model has a predict_proba method.
The metric evaluates predicted target values.

Returns: score: pd.Series
Scores of the model.

method export_pipeline(verbose=None) [source]

Export the model's pipeline to a sklearn-like Pipeline object. If the model used automated feature scaling, the scaler is added to the pipeline. The returned pipeline is already fitted on the training set.

Info

ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:

Accepts transformers that change the target column.
Accepts transformers that drop rows.
Accepts transformers that only are fitted on a subset of the provided dataset.
Always outputs pandas objects.
Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.

Parameters:

memory: bool, str, Memory or None, optional (default=None)
Used to cache the fitted transformers of the pipeline.

If None or False: No caching is performed.
If True: A default temp directory is used.
If str: Path to the caching directory.
If Memory: Object with the joblib.Memory interface.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns: Pipeline
Current branch as a sklearn-like Pipeline object.

method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters:

name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.

method evaluate (metric=None, dataset="test") [source]

Get the model's score for the provided metrics.

Parameters:

metric: str or None, optional (default=None)
Name of the metric to calculate. If None, returns the models' final results (ignoring the dataset parameter). Choose from any of sklearn's classification SCORERS or one of the following custom metrics:

"cm" for the confusion matrix.
"tn" for true negatives.
"fp" for false positives.
"fn" for false negatives.
"tp" for true positives.
"fpr" for the false positive rate.
"tpr" for the true positive rate.
"fnr" for the false negative rate.
"tnr" for the true negative rate.
"sup" for the support metric.
"lift" for the lift metric.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

Returns: score: float or np.array
Model's score for the selected metric.

method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method transform(X, y=None, verbose=None) [source]

Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.

Parameters:

X: dataframe-like
Features to transform, with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored in the transformers.
If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

verbose: int or None, optional (default=None)
Verbosity level of the output. If None, it uses the transformer's own verbosity.

Returns:

pd.DataFrame
Transformed feature set.

pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.run(models="CNB")