Skip to content

Complement Naive Bayes (CNB)


The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets.

Corresponding estimators are:

Read more in sklearn's documentation.



Hyperparameters

  • By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
Dimensions:

alpha: float, default=1.0
Real(1e-3, 10, "log-uniform", name="alpha")

fit_prior: bool, default=True
Categorical([True, False], name="fit_prior")

norm: bool, default=False
Categorical([True, False], name="norm")



Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: list
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: list
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.


Utility attributes

Attributes: bo: pd.DataFrame
Information of every step taken by the BO. Columns include:
  • params: Parameters used in the model.
  • estimator: Estimator used for this iteration (fitted on last cross-validation).
  • score: Score of the chosen metric. List of scores for multi-metric.
  • time_iteration: Time spent on this iteration.
  • time: Total time spent since the start of the BO.

best_params: dict
Dictionary of the best combination of hyperparameters found by the BO.

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_bo: str
Time it took to run the bayesian optimization algorithm.

metric_bo: float or list
Best metric score(s) on the BO.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: list
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:
  • metric_bo: Best score achieved during the BO.
  • time_bo: Time spent on the BO.
  • metric_train: Metric score on the training set.
  • metric_test: Metric score on the test set.
  • time_fit: Time spent fitting and evaluating.
  • mean_bootstrap: Mean score of the bootstrap results.
  • std_bootstrap: Standard deviation score of the bootstrap results.
  • time_bootstrap: Time spent on the bootstrap algorithm.
  • time: Total time spent on the whole run.


Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.ndarray
Predictions of the model on the training set.

predict_test: np.ndarray
Predictions of the model on the test set.

predict_proba_train: np.ndarray
Predicted probabilities of the model on the training set.

predict_proba_test: np.ndarray
Predicted probabilities of the model on the test set.

predict_log_proba_train: np.ndarray
Predicted log probabilities of the model on the training set.

predict_log_proba_test: np.ndarray
Predicted log probabilities of the model on the test set.

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.



Methods

The majority of the plots and prediction methods can be called directly from the model, e.g. atom.cnb.plot_permutation_importance() or atom.cnb.predict(X). The remaining utility methods can be found hereunder.

calibrate Calibrate the model.
cross_validate Evaluate the model using cross-validation.
delete Delete the model from the trainer.
export_pipeline Export the model's pipeline to a sklearn-like Pipeline object.
full_train Get the estimator trained on the complete dataset.
rename Change the model's tag.
reset_predictions Clear all the prediction attributes.
scoring Get the score for a specific metric.
save_estimator Save the estimator to a pickle file.


method calibrate(**kwargs) [source]

Applies probability calibration on the estimator. The estimator is trained via cross-validation on a subset of the training data, using the rest to fit the calibrator. The new classifier will replace the estimator attribute and is logged to any active mlflow experiment. Since the estimator changed, all the model's prediction attributes are reset.

Parameters: **kwargs
Additional keyword arguments for sklearn's CalibratedClassifierCV. Using cv="prefit" will use the trained model and fit the calibrator on the test set. Use this only if you have another, independent set for testing.


method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance.

Parameters: **kwargs
Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns: scores: dict
Return of sklearn's cross_validate function.


method delete() [source]

Delete the model from the trainer. If it's the winning model, the next best model (through metric_test or mean_bootstrap) is selected as winner. If it's the last model in the trainer, the metric and training approach are reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.


method export_pipeline(pipeline=None, verbose=None) [source]

Export the model's pipeline to a sklearn-like object. If the model used feature scaling, the Scaler is added before the model. The returned pipeline is already fitted on the training set.

Note

ATOM's Pipeline class behaves exactly the same as a sklearn Pipeline, and additionally, it's compatible with transformers that drop samples and transformers that change the target column.

Warning

Due to incompatibilities with sklearn's API, the exported pipeline always fits/transforms on the entire dataset provided. Beware that this can cause errors if the transformers were fitted on a subset of the data.

Parameters: pipeline: bool, sequence or None, optional (default=None)
Transformers to use on the data before predicting.
  • If None: Only transformers that are applied on the whole dataset are used.
  • If False: Don't use any transformers.
  • If True: Use all transformers in the pipeline.
  • If sequence: Transformers to use, selected by their index in the pipeline.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity.

Returns: pipeline: Pipeline
Current branch as a sklearn-like Pipeline object.


method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters: name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.


method reset_predictions() [source]

Clear the prediction attributes from all models. Use this method to free some memory before saving the trainer.


method scoring (metric=None, dataset="test") [source]

Get the scoring for a specific metric.

Parameters: metric: str or None, optional (default=None)
Name of the metric to calculate. If None, returns the models' final results (ignoring the dataset parameter). Choose from any of sklearn's classification SCORERS or one of the following custom metrics:
  • "cm" for the confusion matrix.
  • "tn" for true negatives.
  • "fp" for false positives.
  • "fn" for false negatives.
  • "tp" for true positives.
  • "fpr" for the false positive rate.
  • "tpr" for the true positive rate.
  • "fnr" for the false negative rate.
  • "tnr" for the true negative rate.
  • "sup" for the support metric.
  • "lift" for the lift metric.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Options are "train" or "test".

Returns: score: float or np.ndarray
Model's score for the selected metric.


method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.run(models="CNB")
Back to top