Imbalanced datasets¶

This example shows how ATOM can help you handle imbalanced datasets. We will evaluate the performance of three different Random Forest models: one trained directly on the imbalanced dataset, one trained on an oversampled dataset and the last one trained on an undersampled dataset.

Load the data¶

In [20]:

            
                Copied!
                
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification

In [21]:

            
                Copied!
                
                    
                    
                
                

        
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)

Run the pipeline¶

In [22]:

            
                Copied!
                
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (5000, 31)
Scaled: False
Outlier values: 582 (0.5%)
-------------------------------------
Train set size: 4000
Test set size: 1000
-------------------------------------
|    |      dataset |        train |         test |
| -- | ------------ | ------------ | ------------ |
| 0  |  4731 (17.6) |  3777 (16.9) |   954 (20.7) |
| 1  |    269 (1.0) |    223 (1.0) |     46 (1.0) |

In [23]:

            
                Copied!
                
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()

Out[23]:

	feature 1	feature 2	feature 3	feature 4	feature 5	feature 6	feature 7	feature 8	feature 9	feature 10	...	feature 22	feature 23	feature 24	feature 25	feature 26	feature 27	feature 28	feature 29	feature 30
0	-3.487515	3.854686	-0.540999	14.601416	2.656621	0.890481	-1.369840	-4.366444	3.466322	-0.029618	...	0.404926	1.743085	7.313850	-0.922566	0.994747	0.000604	-1.441929	-0.741606	-1.021307
1	0.493415	-5.068825	0.727330	5.470114	1.631227	-1.267924	0.550369	-2.055318	-0.105617	3.511867	...	2.888285	0.309113	3.817221	1.118588	1.502536	-0.143976	-1.250787	6.095970	0.490575
2	0.010148	-1.926452	-2.079807	-3.113497	4.962937	1.586091	-0.201393	-3.469079	5.401403	5.751620	...	5.692868	-2.097011	-7.173968	-0.493898	-3.530642	-1.061223	5.956756	3.937707	2.394163
3	1.134321	4.081801	0.012246	7.176998	-2.520901	-3.982497	-1.034924	0.885346	2.221092	-1.082554	...	-4.747836	-0.642487	6.538375	-0.724994	-3.191186	-0.220013	1.133145	-0.385503	1.351542
4	1.324943	1.428477	0.935775	1.270190	-2.449393	1.137865	1.652580	1.746711	-3.283392	-1.369744	...	-0.548266	3.693259	4.914830	-0.275366	-0.526327	0.528170	-2.031690	0.889746	-0.605291

5 rows × 31 columns

In [24]:

            
                Copied!
                
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)

Fitting FeatureSelector...
Performing feature selection ...
 --> RFE selected 12 features from the dataset.
   >>> Dropping feature feature 2 (rank 3).
   >>> Dropping feature feature 3 (rank 8).
   >>> Dropping feature feature 5 (rank 10).
   >>> Dropping feature feature 7 (rank 17).
   >>> Dropping feature feature 8 (rank 12).
   >>> Dropping feature feature 11 (rank 19).
   >>> Dropping feature feature 13 (rank 13).
   >>> Dropping feature feature 14 (rank 11).
   >>> Dropping feature feature 15 (rank 15).
   >>> Dropping feature feature 17 (rank 4).
   >>> Dropping feature feature 19 (rank 16).
   >>> Dropping feature feature 20 (rank 2).
   >>> Dropping feature feature 21 (rank 6).
   >>> Dropping feature feature 23 (rank 5).
   >>> Dropping feature feature 24 (rank 9).
   >>> Dropping feature feature 25 (rank 18).
   >>> Dropping feature feature 26 (rank 7).
   >>> Dropping feature feature 27 (rank 14).

In [25]:

            
                Copied!
                
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")

Training ========================= >>
Models: RF
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.5326
Time elapsed: 0.769s
-------------------------------------------------
Total time: 0.769s


Final results ==================== >>
Duration: 0.771s
-------------------------------------
Random Forest --> balanced_accuracy: 0.5326 ~

In [26]:

            
                Copied!
                
# The transformer and the models have been added to the branch
atom.branch
# The transformer and the models have been added to the branch
atom.branch

Out[26]:

Branch: master
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
 --> Models: RF

Oversampling¶

In [27]:

            
                Copied!
                
# Create a new branch for oversampling
atom.branch = "oversample"
# Create a new branch for oversampling
atom.branch = "oversample"

New branch oversample successfully created.

In [28]:

            
                Copied!
                
# Perform oversampling of the minority class
atom.balance(strategy="smote")
# Perform oversampling of the minority class
atom.balance(strategy="smote")

Oversampling with SMOTE...
 --> Adding 3554 samples to class: 1.

In [29]:

            
                Copied!
                
atom.classes  # Check the balanced training set!
atom.classes  # Check the balanced training set!

Out[29]:

	dataset	train	test
0	4731	3777	954
1	3823	3777	46

In [30]:

            
                Copied!
                
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample

Training ========================= >>
Models: RF_os
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7737
Time elapsed: 1.361s
-------------------------------------------------
Total time: 1.361s


Final results ==================== >>
Duration: 1.362s
-------------------------------------
Random Forest --> balanced_accuracy: 0.7737 ~

Undersampling¶

In [31]:

            
                Copied!
                
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"

New branch undersample successfully created.

In [32]:

            
                Copied!
                
atom.classes  # In this branch, the data is still imbalanced
atom.classes  # In this branch, the data is still imbalanced

Out[32]:

	dataset	train	test
0	4731	3777	954
1	269	223	46

In [33]:

            
                Copied!
                
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")

Undersampling with NearMiss...
 --> Removing 3554 samples from class: 0.

In [34]:

            
                Copied!
                
atom.run("rf_us")
atom.run("rf_us")

Training ========================= >>
Models: RF_us
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.6733
Time elapsed: 0.172s
-------------------------------------------------
Total time: 0.172s


Final results ==================== >>
Duration: 0.172s
-------------------------------------
Random Forest --> balanced_accuracy: 0.6733 ~

In [35]:

            
                Copied!
                
# Check that the branch only contains the desired transformers 
atom.branch
# Check that the branch only contains the desired transformers 
atom.branch

Out[35]:

Branch: undersample
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
   >>> Balancer
     --> strategy: NearMiss
     --> kwargs: {}
 --> Models: RF_us

Analyze results¶

In [36]:

            
                Copied!
                
atom.evaluate()
atom.evaluate()

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Out[36]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
RF	0.957	0.513725	0.532609	0.122449	0.065217	0.249809	1.000000	0.065217	0.921156
RF_os	0.963	0.567880	0.773699	0.584270	0.412698	0.565283	0.604651	0.565217	0.934942
RF_us	0.495	0.335867	0.673252	0.136752	0.073394	0.145619	0.074212	0.869565	0.805875

In [37]:

            
                Copied!
                
atom.plot_prc()
atom.plot_prc()

In [38]:

            
                Copied!
                
atom.plot_roc()
atom.plot_roc()