Example: Imbalanced datasetsĀ¶
This example shows how ATOM can help you handle imbalanced datasets. We will evaluate the performance of three different Random Forest models: one trained directly on the imbalanced dataset, one trained on an oversampled dataset and the last one trained on an undersampled dataset.
Load the dataĀ¶
InĀ [1]:
Copied!
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
InĀ [2]:
Copied!
# Create a mock imbalanced dataset
X, y = make_classification(
n_samples=5000,
n_features=30,
n_informative=20,
weights=(0.95,),
random_state=1,
)
# Create a mock imbalanced dataset
X, y = make_classification(
n_samples=5000,
n_features=30,
n_informative=20,
weights=(0.95,),
random_state=1,
)
Run the pipelineĀ¶
InĀ [3]:
Copied!
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Dataset stats ==================== >> Shape: (5000, 31) Train set size: 4000 Test set size: 1000 ------------------------------------- Memory: 1.24 MB Scaled: False Outlier values: 570 (0.5%)
InĀ [4]:
Copied!
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
Out[4]:
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.535760 | -2.426045 | 1.256836 | 0.374501 | -3.241958 | -1.239468 | -0.208750 | -6.015995 | 3.698669 | 0.112512 | ... | 0.044302 | -1.935727 | 10.870353 | 0.286755 | -2.416507 | 0.556990 | -1.522635 | 3.719201 | 1.449135 | 0 |
1 | -3.311935 | -3.149920 | -0.801252 | -2.644414 | -0.704889 | -3.312256 | 0.714515 | 2.992345 | 5.056910 | 3.036775 | ... | 2.224359 | 0.451273 | -1.822108 | -1.435801 | 0.036132 | -1.364583 | 1.215663 | 5.232161 | 1.408798 | 0 |
2 | 3.821199 | 1.328129 | -1.000720 | -13.151697 | 0.254253 | 1.263636 | -1.088451 | 4.924264 | -1.225646 | -6.974824 | ... | 3.541222 | 1.686667 | -13.763703 | -1.321256 | 1.677687 | 0.774966 | -5.067689 | 4.663386 | -1.714186 | 0 |
3 | 5.931126 | 3.338830 | 0.545906 | 2.296355 | -3.941088 | 3.527252 | -0.158770 | 3.138381 | -0.927460 | -1.642079 | ... | -3.634442 | 7.853176 | -8.457598 | 0.000490 | -2.612756 | -1.138206 | 0.497150 | 4.351289 | -0.321748 | 0 |
4 | -2.829472 | -1.227185 | -0.751892 | 3.056106 | -1.988920 | -2.219184 | -0.075882 | 5.790102 | -2.786671 | 2.023458 | ... | 4.057954 | 1.178564 | -15.028187 | 1.627140 | -1.093587 | -0.422655 | 1.777011 | 6.660638 | -2.553723 | 0 |
5 rows Ć 31 columns
InĀ [5]:
Copied!
# Let's start reducing the number of features
atom.feature_selection("rfe", solver="rf", n_features=12)
# Let's start reducing the number of features
atom.feature_selection("rfe", solver="rf", n_features=12)
Fitting FeatureSelector... Performing feature selection ... --> rfe selected 12 features from the dataset. --> Dropping feature x1 (rank 8). --> Dropping feature x2 (rank 11). --> Dropping feature x4 (rank 3). --> Dropping feature x6 (rank 16). --> Dropping feature x7 (rank 14). --> Dropping feature x10 (rank 19). --> Dropping feature x12 (rank 13). --> Dropping feature x13 (rank 12). --> Dropping feature x14 (rank 9). --> Dropping feature x16 (rank 10). --> Dropping feature x18 (rank 17). --> Dropping feature x19 (rank 2). --> Dropping feature x20 (rank 4). --> Dropping feature x22 (rank 7). --> Dropping feature x23 (rank 5). --> Dropping feature x24 (rank 18). --> Dropping feature x25 (rank 6). --> Dropping feature x26 (rank 15).
InĀ [6]:
Copied!
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
Training ========================= >> Models: RF Metric: ba Results for RandomForest: Fit --------------------------------------------- Train evaluation --> ba: 1.0 Test evaluation --> ba: 0.5556 Time elapsed: 1.148s ------------------------------------------------- Time: 1.148s Final results ==================== >> Total time: 1.150s ------------------------------------- RandomForest --> ba: 0.5556 ~
InĀ [7]:
Copied!
# The transformer and the models have been added to the branch
atom.branch
# The transformer and the models have been added to the branch
atom.branch
Out[7]:
Branch(main)
OversamplingĀ¶
InĀ [8]:
Copied!
# Create a new branch for oversampling
atom.branch = "oversample"
# Create a new branch for oversampling
atom.branch = "oversample"
Successfully created new branch: oversample.
InĀ [9]:
Copied!
# Perform oversampling of the minority class
atom.balance(strategy="smote")
# Perform oversampling of the minority class
atom.balance(strategy="smote")
Oversampling with SMOTE... --> Adding 3570 samples to class 1.
InĀ [10]:
Copied!
atom.classes # Check the balanced training set!
atom.classes # Check the balanced training set!
Out[10]:
dataset | train | test | |
---|---|---|---|
0 | 4731 | 3785 | 946 |
1 | 3839 | 3785 | 54 |
InĀ [11]:
Copied!
# Train another model on the new branch. Add a tag after
# the model's acronym to distinguish it from the first model
atom.run("rf_os") # os for oversample
# Train another model on the new branch. Add a tag after
# the model's acronym to distinguish it from the first model
atom.run("rf_os") # os for oversample
Training ========================= >> Models: RF_os Metric: ba Results for RandomForest: Fit --------------------------------------------- Train evaluation --> ba: 1.0 Test evaluation --> ba: 0.7672 Time elapsed: 2.089s ------------------------------------------------- Time: 2.089s Final results ==================== >> Total time: 2.091s ------------------------------------- RandomForest --> ba: 0.7672 ~
UndersamplingĀ¶
InĀ [12]:
Copied!
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_main"
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_main"
Successfully created new branch: undersample.
InĀ [13]:
Copied!
atom.classes # In this branch, the data is still imbalanced
atom.classes # In this branch, the data is still imbalanced
Out[13]:
dataset | train | test | |
---|---|---|---|
0 | 4731 | 3785 | 946 |
1 | 269 | 215 | 54 |
InĀ [14]:
Copied!
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
Undersampling with NearMiss... --> Removing 3570 samples from class 0.
InĀ [15]:
Copied!
atom.run("rf_us")
atom.run("rf_us")
Training ========================= >> Models: RF_us Metric: ba Results for RandomForest: Fit --------------------------------------------- Train evaluation --> ba: 1.0 Test evaluation --> ba: 0.6706 Time elapsed: 0.207s ------------------------------------------------- Time: 0.207s Final results ==================== >> Total time: 0.209s ------------------------------------- RandomForest --> ba: 0.6706 ~
InĀ [16]:
Copied!
# Check that the branch only contains the desired transformers
atom.branch
# Check that the branch only contains the desired transformers
atom.branch
Out[16]:
Branch(undersample)
InĀ [17]:
Copied!
# Visualize the complete pipeline
atom.plot_pipeline()
# Visualize the complete pipeline
atom.plot_pipeline()
Analyze the resultsĀ¶
InĀ [18]:
Copied!
atom.evaluate()
atom.evaluate()
Out[18]:
Ā | accuracy | ap | ba | f1 | jaccard | mcc | precision | recall | auc |
---|---|---|---|---|---|---|---|---|---|
RF | 0.952000 | 0.656200 | 0.555600 | 0.200000 | 0.111100 | 0.325200 | 1.000000 | 0.111100 | 0.910700 |
RF_os | 0.956000 | 0.621500 | 0.767200 | 0.576900 | 0.405400 | 0.554200 | 0.600000 | 0.555600 | 0.925100 |
RF_us | 0.509000 | 0.368700 | 0.670600 | 0.157800 | 0.085700 | 0.154500 | 0.087000 | 0.851900 | 0.825800 |
InĀ [19]:
Copied!
atom.plot_prc()
atom.plot_prc()
InĀ [20]:
Copied!
atom.plot_roc()
atom.plot_roc()