Imbalanced datasets¶

This example shows how ATOM can help you handle imbalanced datasets. We will evaluate the performance of three different Random Forest models: one trained directly on the imbalanced dataset, one trained on an oversampled dataset and the last one trained on an undersampled dataset.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ====================== >>
Shape: (5000, 31)
Scaled: False
Outlier values: 582 (0.5%)
---------------------------------------
Train set size: 4000
Test set size: 1000
---------------------------------------
|    | dataset     | train       | test       |
|---:|:------------|:------------|:-----------|
|  0 | 4731 (17.6) | 3777 (16.9) | 954 (20.7) |
|  1 | 269 (1.0)   | 223 (1.0)   | 46 (1.0)   |

In [4]:

            
                Copied!
                
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()

Out[4]:

	Feature 1	Feature 2	Feature 3	Feature 4	Feature 5	Feature 6	Feature 7	Feature 8	Feature 9	Feature 10	...	Feature 22	Feature 23	Feature 24	Feature 25	Feature 26	Feature 27	Feature 28	Feature 29	Feature 30
0	-3.487515	3.854686	-0.540999	14.601416	2.656621	0.890481	-1.369840	-4.366444	3.466322	-0.029618	...	0.404926	1.743085	7.313850	-0.922566	0.994747	0.000604	-1.441929	-0.741606	-1.021307
1	0.493415	-5.068825	0.727330	5.470114	1.631227	-1.267924	0.550369	-2.055318	-0.105617	3.511867	...	2.888285	0.309113	3.817221	1.118588	1.502536	-0.143976	-1.250787	6.095970	0.490575
2	0.010148	-1.926452	-2.079807	-3.113497	4.962937	1.586091	-0.201393	-3.469079	5.401403	5.751620	...	5.692868	-2.097011	-7.173968	-0.493898	-3.530642	-1.061223	5.956756	3.937707	2.394163
3	1.134321	4.081801	0.012246	7.176998	-2.520901	-3.982497	-1.034924	0.885346	2.221092	-1.082554	...	-4.747836	-0.642487	6.538375	-0.724994	-3.191186	-0.220013	1.133145	-0.385503	1.351542
4	1.324943	1.428477	0.935775	1.270190	-2.449393	1.137865	1.652580	1.746711	-3.283392	-1.369744	...	-0.548266	3.693259	4.914830	-0.275366	-0.526327	0.528170	-2.031690	0.889746	-0.605291

5 rows × 31 columns

In [5]:

            
                Copied!
                
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)

Fitting FeatureSelector...
Performing feature selection ...
 --> The RFE selected 12 features from the dataset.
   >>> Dropping feature Feature 2 (rank 3).
   >>> Dropping feature Feature 3 (rank 8).
   >>> Dropping feature Feature 5 (rank 10).
   >>> Dropping feature Feature 7 (rank 17).
   >>> Dropping feature Feature 8 (rank 12).
   >>> Dropping feature Feature 11 (rank 19).
   >>> Dropping feature Feature 13 (rank 13).
   >>> Dropping feature Feature 14 (rank 11).
   >>> Dropping feature Feature 15 (rank 15).
   >>> Dropping feature Feature 17 (rank 4).
   >>> Dropping feature Feature 19 (rank 16).
   >>> Dropping feature Feature 20 (rank 2).
   >>> Dropping feature Feature 21 (rank 6).
   >>> Dropping feature Feature 23 (rank 5).
   >>> Dropping feature Feature 24 (rank 9).
   >>> Dropping feature Feature 25 (rank 18).
   >>> Dropping feature Feature 26 (rank 7).
   >>> Dropping feature Feature 27 (rank 14).

In [6]:

            
                Copied!
                
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")

Training ===================================== >>
Models: RF
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.5326
Time elapsed: 0.798s
-------------------------------------------------
Total time: 0.798s


Final results ========================= >>
Duration: 0.798s
------------------------------------------
Random Forest --> balanced_accuracy: 0.5326 ~

In [7]:

            
                Copied!
                
# The transformer and the models have been added to the branch
atom.branch
# The transformer and the models have been added to the branch
atom.branch

Out[7]:

Branch: master
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
 --> Models: RF

Oversampling¶

In [8]:

            
                Copied!
                
# Create a new branch for oversampling
atom.branch = "oversample"
# Create a new branch for oversampling
atom.branch = "oversample"

New branch oversample successfully created!

In [9]:

            
                Copied!
                
# Perform oversampling of the minority class
atom.balance(strategy="smote")
# Perform oversampling of the minority class
atom.balance(strategy="smote")

Oversampling with SMOTE...
 --> Adding 3554 samples to class: 1.

In [10]:

            
                Copied!
                
atom.classes  # Check the balanced training set!
atom.classes  # Check the balanced training set!

Out[10]:

	dataset	train	test
0	4731	3777	954
1	3823	3777	46

In [11]:

            
                Copied!
                
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample

Training ===================================== >>
Models: RF_os
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7546
Time elapsed: 1.400s
-------------------------------------------------
Total time: 1.400s


Final results ========================= >>
Duration: 1.400s
------------------------------------------
Random Forest --> balanced_accuracy: 0.7546 ~

Undersampling¶

In [12]:

            
                Copied!
                
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"

New branch undersample successfully created!

In [13]:

            
                Copied!
                
atom.classes  # In this branch, the data is still imbalanced
atom.classes  # In this branch, the data is still imbalanced

Out[13]:

	dataset	train	test
0	4731	3777	954
1	269	223	46

In [14]:

            
                Copied!
                
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")

Undersampling with NearMiss...
 --> Removing 3554 samples from class: 0.

In [15]:

            
                Copied!
                
atom.run("rf_us")
atom.run("rf_us")

Training ===================================== >>
Models: RF_us
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.6837
Time elapsed: 0.165s
-------------------------------------------------
Total time: 0.165s


Final results ========================= >>
Duration: 0.166s
------------------------------------------
Random Forest --> balanced_accuracy: 0.6837 ~

In [16]:

            
                Copied!
                
# Check that the branch only contains the desired transformers 
atom.branch
# Check that the branch only contains the desired transformers 
atom.branch

Out[16]:

Branch: undersample
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
   >>> Balancer
     --> strategy: NearMiss
     --> kwargs: {}
 --> Models: RF_us

Analyze results¶

In [17]:

            
                Copied!
                
atom.evaluate()
atom.evaluate()

Out[17]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
RF	0.957	0.528979	0.532609	0.122449	0.065217	0.249809	1.000000	0.065217	0.917236
RF_os	0.966	0.557137	0.754580	0.585366	0.413793	0.572556	0.666667	0.521739	0.943476
RF_us	0.515	0.223479	0.683734	0.141593	0.076190	0.154070	0.077071	0.869565	0.808939

In [18]:

            
                Copied!
                
atom.plot_prc()
atom.plot_prc()

In [19]:

            
                Copied!
                
atom.plot_roc()
atom.plot_roc()