Imbalanced datasets¶
This example shows how ATOM can help you handle imbalanced datasets. We will evaluate the performance of three different Random Forest models: one trained directly on the imbalanced dataset, one trained on an oversampled dataset and the last one trained on an undersampled dataset.
Load the data¶
In [20]:
Copied!
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
In [21]:
Copied!
# Create a mock imbalanced dataset
X, y = make_classification(
n_samples=5000,
n_features=30,
n_informative=20,
weights=(0.95,),
random_state=1,
)
# Create a mock imbalanced dataset
X, y = make_classification(
n_samples=5000,
n_features=30,
n_informative=20,
weights=(0.95,),
random_state=1,
)
Run the pipeline¶
In [22]:
Copied!
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
<< ================== ATOM ================== >> Algorithm task: binary classification. Dataset stats ==================== >> Shape: (5000, 31) Scaled: False Outlier values: 582 (0.5%) ------------------------------------- Train set size: 4000 Test set size: 1000 ------------------------------------- | | dataset | train | test | | -- | ------------ | ------------ | ------------ | | 0 | 4731 (17.6) | 3777 (16.9) | 954 (20.7) | | 1 | 269 (1.0) | 223 (1.0) | 46 (1.0) |
In [23]:
Copied!
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
Out[23]:
feature 1 | feature 2 | feature 3 | feature 4 | feature 5 | feature 6 | feature 7 | feature 8 | feature 9 | feature 10 | ... | feature 22 | feature 23 | feature 24 | feature 25 | feature 26 | feature 27 | feature 28 | feature 29 | feature 30 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -3.487515 | 3.854686 | -0.540999 | 14.601416 | 2.656621 | 0.890481 | -1.369840 | -4.366444 | 3.466322 | -0.029618 | ... | 0.404926 | 1.743085 | 7.313850 | -0.922566 | 0.994747 | 0.000604 | -1.441929 | -0.741606 | -1.021307 | 0 |
1 | 0.493415 | -5.068825 | 0.727330 | 5.470114 | 1.631227 | -1.267924 | 0.550369 | -2.055318 | -0.105617 | 3.511867 | ... | 2.888285 | 0.309113 | 3.817221 | 1.118588 | 1.502536 | -0.143976 | -1.250787 | 6.095970 | 0.490575 | 0 |
2 | 0.010148 | -1.926452 | -2.079807 | -3.113497 | 4.962937 | 1.586091 | -0.201393 | -3.469079 | 5.401403 | 5.751620 | ... | 5.692868 | -2.097011 | -7.173968 | -0.493898 | -3.530642 | -1.061223 | 5.956756 | 3.937707 | 2.394163 | 0 |
3 | 1.134321 | 4.081801 | 0.012246 | 7.176998 | -2.520901 | -3.982497 | -1.034924 | 0.885346 | 2.221092 | -1.082554 | ... | -4.747836 | -0.642487 | 6.538375 | -0.724994 | -3.191186 | -0.220013 | 1.133145 | -0.385503 | 1.351542 | 0 |
4 | 1.324943 | 1.428477 | 0.935775 | 1.270190 | -2.449393 | 1.137865 | 1.652580 | 1.746711 | -3.283392 | -1.369744 | ... | -0.548266 | 3.693259 | 4.914830 | -0.275366 | -0.526327 | 0.528170 | -2.031690 | 0.889746 | -0.605291 | 0 |
5 rows × 31 columns
In [24]:
Copied!
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)
Fitting FeatureSelector... Performing feature selection ... --> RFE selected 12 features from the dataset. >>> Dropping feature feature 2 (rank 3). >>> Dropping feature feature 3 (rank 8). >>> Dropping feature feature 5 (rank 10). >>> Dropping feature feature 7 (rank 17). >>> Dropping feature feature 8 (rank 12). >>> Dropping feature feature 11 (rank 19). >>> Dropping feature feature 13 (rank 13). >>> Dropping feature feature 14 (rank 11). >>> Dropping feature feature 15 (rank 15). >>> Dropping feature feature 17 (rank 4). >>> Dropping feature feature 19 (rank 16). >>> Dropping feature feature 20 (rank 2). >>> Dropping feature feature 21 (rank 6). >>> Dropping feature feature 23 (rank 5). >>> Dropping feature feature 24 (rank 9). >>> Dropping feature feature 25 (rank 18). >>> Dropping feature feature 26 (rank 7). >>> Dropping feature feature 27 (rank 14).
In [25]:
Copied!
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
Training ========================= >> Models: RF Metric: balanced_accuracy Results for Random Forest: Fit --------------------------------------------- Train evaluation --> balanced_accuracy: 1.0 Test evaluation --> balanced_accuracy: 0.5326 Time elapsed: 0.769s ------------------------------------------------- Total time: 0.769s Final results ==================== >> Duration: 0.771s ------------------------------------- Random Forest --> balanced_accuracy: 0.5326 ~
In [26]:
Copied!
# The transformer and the models have been added to the branch
atom.branch
# The transformer and the models have been added to the branch
atom.branch
Out[26]:
Branch: master --> Pipeline: >>> FeatureSelector --> strategy: RFE --> solver: RF_class --> n_features: 12 --> max_frac_repeated: 1.0 --> max_correlation: 1.0 --> kwargs: {} --> Models: RF
Oversampling¶
In [27]:
Copied!
# Create a new branch for oversampling
atom.branch = "oversample"
# Create a new branch for oversampling
atom.branch = "oversample"
New branch oversample successfully created.
In [28]:
Copied!
# Perform oversampling of the minority class
atom.balance(strategy="smote")
# Perform oversampling of the minority class
atom.balance(strategy="smote")
Oversampling with SMOTE... --> Adding 3554 samples to class: 1.
In [29]:
Copied!
atom.classes # Check the balanced training set!
atom.classes # Check the balanced training set!
Out[29]:
dataset | train | test | |
---|---|---|---|
0 | 4731 | 3777 | 954 |
1 | 3823 | 3777 | 46 |
In [30]:
Copied!
# Train another model on the new branch. Add a tag after
# the model's acronym to distinguish it from the first model
atom.run("rf_os") # os for oversample
# Train another model on the new branch. Add a tag after
# the model's acronym to distinguish it from the first model
atom.run("rf_os") # os for oversample
Training ========================= >> Models: RF_os Metric: balanced_accuracy Results for Random Forest: Fit --------------------------------------------- Train evaluation --> balanced_accuracy: 1.0 Test evaluation --> balanced_accuracy: 0.7737 Time elapsed: 1.361s ------------------------------------------------- Total time: 1.361s Final results ==================== >> Duration: 1.362s ------------------------------------- Random Forest --> balanced_accuracy: 0.7737 ~
Undersampling¶
In [31]:
Copied!
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"
New branch undersample successfully created.
In [32]:
Copied!
atom.classes # In this branch, the data is still imbalanced
atom.classes # In this branch, the data is still imbalanced
Out[32]:
dataset | train | test | |
---|---|---|---|
0 | 4731 | 3777 | 954 |
1 | 269 | 223 | 46 |
In [33]:
Copied!
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
Undersampling with NearMiss... --> Removing 3554 samples from class: 0.
In [34]:
Copied!
atom.run("rf_us")
atom.run("rf_us")
Training ========================= >> Models: RF_us Metric: balanced_accuracy Results for Random Forest: Fit --------------------------------------------- Train evaluation --> balanced_accuracy: 1.0 Test evaluation --> balanced_accuracy: 0.6733 Time elapsed: 0.172s ------------------------------------------------- Total time: 0.172s Final results ==================== >> Duration: 0.172s ------------------------------------- Random Forest --> balanced_accuracy: 0.6733 ~
In [35]:
Copied!
# Check that the branch only contains the desired transformers
atom.branch
# Check that the branch only contains the desired transformers
atom.branch
Out[35]:
Branch: undersample --> Pipeline: >>> FeatureSelector --> strategy: RFE --> solver: RF_class --> n_features: 12 --> max_frac_repeated: 1.0 --> max_correlation: 1.0 --> kwargs: {} >>> Balancer --> strategy: NearMiss --> kwargs: {} --> Models: RF_us
Analyze results¶
In [36]:
Copied!
atom.evaluate()
atom.evaluate()
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Out[36]:
accuracy | average_precision | balanced_accuracy | f1 | jaccard | matthews_corrcoef | precision | recall | roc_auc | |
---|---|---|---|---|---|---|---|---|---|
RF | 0.957 | 0.513725 | 0.532609 | 0.122449 | 0.065217 | 0.249809 | 1.000000 | 0.065217 | 0.921156 |
RF_os | 0.963 | 0.567880 | 0.773699 | 0.584270 | 0.412698 | 0.565283 | 0.604651 | 0.565217 | 0.934942 |
RF_us | 0.495 | 0.335867 | 0.673252 | 0.136752 | 0.073394 | 0.145619 | 0.074212 | 0.869565 | 0.805875 |
In [37]:
Copied!
atom.plot_prc()
atom.plot_prc()
In [38]:
Copied!
atom.plot_roc()
atom.plot_roc()