Imbalanced datasets¶

This example shows how ATOM can help you handle imbalanced datasets. We will evaluate the performance of three different Random Forest models: one trained directly on the imbalanced dataset, one trained on an oversampled dataset and the last one trained on an undersampled dataset.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Import packages
from atom import ATOMClassifier
from sklearn.datasets import make_classification

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)
# Create a mock imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
    random_state=1,
)

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)
# Initialize atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2, random_state=1)

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (5000, 31)
Memory: 1.22 MB
Scaled: False
Outlier values: 565 (0.5%)
-------------------------------------
Train set size: 4000
Test set size: 1000
-------------------------------------
|   |      dataset |        train |         test |
| - | ------------ | ------------ | ------------ |
| 0 |  4731 (17.6) |  3785 (17.6) |   946 (17.5) |
| 1 |    269 (1.0) |    215 (1.0) |     54 (1.0) |

In [4]:

            
                Copied!
                
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()
# Let's have a look at the data. Note that, since the input wasn't
# a dataframe, atom has given default names to the columns.
atom.head()

Out[4]:

	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	...	x21	x22	x23	x24	x25	x26	x27	x28	x29
0	-3.778228	-0.812052	-0.896615	-3.499848	-1.198172	0.670656	-0.740861	0.723931	0.987058	-1.280431	...	-1.372013	-1.582733	-5.205504	-0.132154	-2.046509	1.171858	-0.937969	-1.000315	-5.039237
1	-0.376098	-2.040344	-1.187582	-0.543320	2.283740	-0.199718	1.371512	2.533223	-1.065436	1.181449	...	0.785491	2.542636	-8.172737	-0.014625	-0.476868	1.121809	-4.180679	4.526489	-3.113989
2	0.098476	1.255913	0.040136	-7.349154	0.911161	-2.400060	-0.995364	3.451334	3.276193	1.191030	...	3.323461	-0.211275	-0.646718	-0.804356	3.738427	0.608230	-0.404319	1.287946	0.236896
3	-2.733745	-2.329786	0.883725	9.381290	1.975243	-2.876693	0.169162	-2.638822	0.432828	-1.093796	...	0.044253	-1.316711	4.813267	0.087488	3.813586	-1.438706	-0.044852	3.106058	-4.622981
4	-1.194097	2.582772	0.964323	10.242885	0.636551	-1.848875	-0.981837	0.903811	0.747750	1.526570	...	-3.227021	1.712595	-7.194171	-0.309420	-8.076342	0.742821	-0.876746	5.963986	-1.404056

5 rows × 31 columns

In [5]:

            
                Copied!
                
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)
# Let's start reducing the number of features
atom.feature_selection("RFE", solver="RF", n_features=12)

Fitting FeatureSelector...
Performing feature selection ...
 --> RFE selected 12 features from the dataset.
   >>> Dropping feature x1 (rank 12).
   >>> Dropping feature x2 (rank 8).
   >>> Dropping feature x4 (rank 2).
   >>> Dropping feature x6 (rank 17).
   >>> Dropping feature x7 (rank 14).
   >>> Dropping feature x10 (rank 19).
   >>> Dropping feature x11 (rank 3).
   >>> Dropping feature x12 (rank 11).
   >>> Dropping feature x13 (rank 9).
   >>> Dropping feature x14 (rank 13).
   >>> Dropping feature x16 (rank 5).
   >>> Dropping feature x18 (rank 16).
   >>> Dropping feature x19 (rank 4).
   >>> Dropping feature x22 (rank 7).
   >>> Dropping feature x23 (rank 10).
   >>> Dropping feature x24 (rank 18).
   >>> Dropping feature x25 (rank 6).
   >>> Dropping feature x26 (rank 15).

In [6]:

            
                Copied!
                
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")
# Fit a model directly on the imbalanced data
atom.run("RF", metric="ba")

Training ========================= >>
Models: RF
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.6111
Time elapsed: 0.703s
-------------------------------------------------
Total time: 0.703s


Final results ==================== >>
Duration: 0.703s
-------------------------------------
Random Forest --> balanced_accuracy: 0.6111 ~

In [7]:

            
                Copied!
                
# The transformer and the models have been added to the branch
atom.branch
# The transformer and the models have been added to the branch
atom.branch

Out[7]:

Branch: master
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
 --> Models: RF

Oversampling¶

In [8]:

            
                Copied!
                
# Create a new branch for oversampling
atom.branch = "oversample"
# Create a new branch for oversampling
atom.branch = "oversample"

New branch oversample successfully created.

In [9]:

            
                Copied!
                
# Perform oversampling of the minority class
atom.balance(strategy="smote")
# Perform oversampling of the minority class
atom.balance(strategy="smote")

Oversampling with SMOTE...
 --> Adding 3570 samples to class 1.

In [10]:

            
                Copied!
                
atom.classes  # Check the balanced training set!
atom.classes  # Check the balanced training set!

Out[10]:

	dataset	train	test
0	4731	3785	946
1	3839	3785	54

In [11]:

            
                Copied!
                
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample
# Train another model on the new branch. Add a tag after 
# the model's acronym to distinguish it from the first model
atom.run("rf_os")  # os for oversample

Training ========================= >>
Models: RF_os
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7214
Time elapsed: 1.406s
-------------------------------------------------
Total time: 1.406s


Final results ==================== >>
Duration: 1.406s
-------------------------------------
Random Forest --> balanced_accuracy: 0.7214 ~

Undersampling¶

In [12]:

            
                Copied!
                
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"
# Create the undersampling branch
# Split from master to not adopt the oversmapling transformer
atom.branch = "undersample_from_master"

New branch undersample successfully created.

In [13]:

            
                Copied!
                
atom.classes  # In this branch, the data is still imbalanced
atom.classes  # In this branch, the data is still imbalanced

Out[13]:

	dataset	train	test
0	4731	3785	946
1	269	215	54

In [14]:

            
                Copied!
                
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")
# Perform undersampling of the majority class
atom.balance(strategy="NearMiss")

Undersampling with NearMiss...
 --> Removing 3570 samples from class 0.

In [15]:

            
                Copied!
                
atom.run("rf_us")
atom.run("rf_us")

Training ========================= >>
Models: RF_us
Metric: balanced_accuracy


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7225
Time elapsed: 0.156s
-------------------------------------------------
Total time: 0.156s


Final results ==================== >>
Duration: 0.156s
-------------------------------------
Random Forest --> balanced_accuracy: 0.7225 ~

In [16]:

            
                Copied!
                
# Check that the branch only contains the desired transformers 
atom.branch
# Check that the branch only contains the desired transformers 
atom.branch

Out[16]:

Branch: undersample
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RF_class
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
   >>> Balancer
     --> strategy: NearMiss
     --> kwargs: {}
 --> Models: RF_us

In [17]:

            
                Copied!
                
# Visualize the complete pipeline
atom.plot_pipeline()
# Visualize the complete pipeline
atom.plot_pipeline()

Analyze results¶

In [18]:

            
                Copied!
                
atom.evaluate()
atom.evaluate()

Out[18]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
RF	0.958	0.711949	0.611111	0.363636	0.222222	0.461276	1.000000	0.222222	0.944405
RF_os	0.952	0.585779	0.721439	0.510204	0.342466	0.488058	0.568182	0.462963	0.940431
RF_us	0.508	0.471905	0.722496	0.174497	0.095588	0.201866	0.095941	0.962963	0.870762

In [19]:

            
                Copied!
                
atom.plot_prc()
atom.plot_prc()

In [20]:

            
                Copied!
                
atom.plot_roc()
atom.plot_roc()