Feature engineering¶
This example shows how to use automated feature generation to improve a model's performance.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.
Load the data¶
In [1]:
                Copied!
                
                
            # Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
    
        In [2]:
                Copied!
                
                
            # Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
    
        Out[2]:
| Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 | 
| 1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 | 
| 2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 | 
| 3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 | 
| 4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 | 
5 rows × 22 columns
Run the pipeline¶
In [3]:
                Copied!
                
                
            # Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0, random_state=1)
atom.clean()
atom.impute(strat_num="knn", strat_cat="remove", min_frac_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0, random_state=1)
atom.clean()
atom.impute(strat_num="knn", strat_cat="remove", min_frac_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
    
        is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
In [4]:
                Copied!
                
                
            atom.verbose = 2  # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
atom.verbose = 2  # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
    
        Training ===================================== >> Models: LGB Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9854 Test evaluation --> roc_auc: 0.8788 Time elapsed: 0.303s ------------------------------------------------- Total time: 0.303s Final results ========================= >> Duration: 0.304s ------------------------------------------ LightGBM --> roc_auc: 0.8788
Deep Feature Synthesis¶
In [5]:
                Copied!
                
                
            # Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
    
        New branch dfs successfully created!
In [6]:
                Copied!
                
                
            # Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
# Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
    
        Fitting FeatureGenerator... Creating new features... --> 50 new features were added to the dataset.
divide by zero encountered in log
In [7]:
                Copied!
                
                
            # The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
    
        Out[7]:
LOG(Sunshine) 156 LOG(WindSpeed3pm) 34 dtype: int64
In [8]:
                Copied!
                
                
            # Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", min_frac_rows=0.8)
# Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", min_frac_rows=0.8)
    
        Fitting Imputer... Imputing missing values... --> Imputing 156 missing values using the KNN imputer in feature LOG(Sunshine). --> Imputing 34 missing values using the KNN imputer in feature LOG(WindSpeed3pm).
In [9]:
                Copied!
                
                
            # 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
    strategy="RFECV",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)
# 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
    strategy="RFECV",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)
    
        Fitting FeatureSelector... Performing feature selection ... --> Feature Location was removed due to low variance. Value 0.2234864447253828 repeated in 100% of the rows. --> Feature Cloud3pm + Humidity3pm was removed due to collinearity with another feature. --> Feature Cloud3pm + RainToday_No was removed due to collinearity with another feature. --> Feature Cloud3pm - Location was removed due to collinearity with another feature. --> Feature Cloud3pm - RainToday_No was removed due to collinearity with another feature. --> Feature Evaporation + WindGustDir was removed due to collinearity with another feature. --> Feature Evaporation - WindDir3pm was removed due to collinearity with another feature. --> Feature Humidity9am - WindDir3pm was removed due to collinearity with another feature. --> Feature Location + MinTemp was removed due to collinearity with another feature. --> Feature Location + RainToday_No was removed due to collinearity with another feature. --> Feature Location + WindDir3pm was removed due to collinearity with another feature. --> Feature Location + WindGustDir was removed due to collinearity with another feature. --> Feature Location + WindSpeed3pm was removed due to collinearity with another feature. --> Feature MaxTemp + RainToday_Yes was removed due to collinearity with another feature. --> Feature RainToday_No - WindDir9am was removed due to collinearity with another feature. --> Feature RainToday_Yes + WindDir9am was removed due to collinearity with another feature. --> Feature RainToday_Yes + WindSpeed3pm was removed due to collinearity with another feature. --> Feature RainToday_other - Temp9am was removed due to collinearity with another feature. --> Feature Sunshine + WindDir9am was removed due to collinearity with another feature. --> Feature Temp3pm - WindGustDir was removed due to collinearity with another feature. --> Feature WindDir9am + WindSpeed3pm was removed due to collinearity with another feature. --> Feature WindDir9am + WindSpeed9am was removed due to collinearity with another feature. --> Feature WindGustDir + WindGustSpeed was removed due to collinearity with another feature. --> Feature WindGustDir + WindSpeed9am was removed due to collinearity with another feature. --> The RFECV selected 44 features from the dataset. >>> Dropping feature WindSpeed3pm (rank 2). >>> Dropping feature RainToday_Yes (rank 5). >>> Dropping feature RainToday_No (rank 6). >>> Dropping feature LOG(WindSpeed3pm) (rank 4). >>> Dropping feature RainToday_other - WindSpeed9am (rank 3).
In [10]:
                Copied!
                
                
            # The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
    
        Out[10]:
| drop_feature | correlated_feature | correlation_value | |
|---|---|---|---|
| 0 | Cloud3pm + Humidity3pm | Humidity3pm | 0.99558 | 
| 1 | Cloud3pm + RainToday_No | Cloud3pm | 0.98138 | 
| 2 | Cloud3pm - Location | Cloud3pm, Cloud3pm + RainToday_No | 1.0, 0.98138 | 
| 3 | Cloud3pm - RainToday_No | Cloud3pm, Cloud3pm - Location | 0.98407, 0.98407 | 
| 4 | Evaporation + WindGustDir | Evaporation | 0.99989 | 
| 5 | Evaporation - WindDir3pm | Evaporation, Evaporation + WindGustDir | 0.99991, 0.9997 | 
| 6 | Humidity9am - WindDir3pm | Humidity9am | 1.0 | 
| 7 | Location + MinTemp | MinTemp | 1.0 | 
| 8 | Location + RainToday_No | RainToday_Yes, RainToday_No | -0.9836, 1.0 | 
| 9 | Location + WindDir3pm | WindDir3pm | 1.0 | 
| 10 | Location + WindGustDir | WindGustDir | 1.0 | 
| 11 | Location + WindSpeed3pm | WindSpeed3pm | 1.0 | 
| 12 | MaxTemp + RainToday_Yes | MaxTemp | 0.99827 | 
| 13 | RainToday_No - WindDir9am | RainToday_No, Location + RainToday_No | 0.9905, 0.9905 | 
| 14 | RainToday_Yes + WindDir9am | RainToday_Yes, RainToday_No - WindDir9am | 0.99031, -0.98433 | 
| 15 | RainToday_Yes + WindSpeed3pm | WindSpeed3pm, Location + WindSpeed3pm | 0.99886, 0.99886 | 
| 16 | RainToday_other - Temp9am | Temp9am, RainToday_No - Temp9am | -0.99993, 0.9977 | 
| 17 | Sunshine + WindDir9am | Sunshine | 0.99978 | 
| 18 | Temp3pm - WindGustDir | Temp3pm | 0.99997 | 
| 19 | WindDir9am + WindSpeed3pm | WindSpeed3pm, Location + WindSpeed3pm, RainTod... | 0.99998, 0.99998, 0.99886, -0.99995 | 
| 20 | WindDir9am + WindSpeed9am | WindSpeed9am, RainToday_other - WindSpeed9am | 0.99998, -0.99994 | 
| 21 | WindGustDir + WindGustSpeed | WindGustSpeed | 0.99999 | 
| 22 | WindGustDir + WindSpeed9am | WindSpeed9am, RainToday_other - WindSpeed9am, ... | 0.99998, -0.99995, 0.99998 | 
In [11]:
                Copied!
                
                
            # After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()
# After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()
    
        In [12]:
                Copied!
                
                
            # Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
    
        Training ===================================== >> Models: LGB_dfs Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9917 Test evaluation --> roc_auc: 0.8691 Time elapsed: 0.493s ------------------------------------------------- Total time: 0.493s Final results ========================= >> Duration: 0.493s ------------------------------------------ LightGBM --> roc_auc: 0.8691
Genetic Feature Generation¶
In [13]:
                Copied!
                
                
            # Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
    
        New branch gfg successfully created!
In [14]:
                Copied!
                
                
            # Create new features using Genetic Programming
atom.feature_generation(
    strategy='GFG',
    n_features=20,
    generations=10,
    population=2000,
)
# Create new features using Genetic Programming
atom.feature_generation(
    strategy='GFG',
    n_features=20,
    generations=10,
    population=2000,
)
    
        Fitting FeatureGenerator...
    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0     3.17         0.126131        3          0.50226              N/A     10.12s
   1     3.07         0.340705        5         0.514677              N/A      9.69s
   2     3.38         0.442159        9         0.520907              N/A      8.84s
   3     3.98         0.454125       13         0.527897              N/A      6.96s
   4     5.77         0.472497        9         0.535088              N/A      5.89s
   5     7.31         0.467921       15         0.541857              N/A      4.86s
   6     8.70         0.459723       17         0.544147              N/A      3.62s
   7     9.91         0.452777       19          0.54458              N/A      2.64s
   8    11.41         0.458764       21         0.546345              N/A      1.19s
   9    11.67         0.461799       29         0.546264              N/A      0.00s
Creating new features...
 --> 16 new features were added to the dataset.
In [15]:
                Copied!
                
                
            # We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
    
        Out[15]:
| name | description | fitness | |
|---|---|---|---|
| 0 | Feature 24 | add(Sunshine, add(Sunshine, sub(Pressure3pm, s... | 0.517264 | 
| 1 | Feature 25 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.518441 | 
| 2 | Feature 26 | add(Sunshine, add(Sunshine, add(add(Sunshine, ... | 0.526070 | 
| 3 | Feature 27 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.526070 | 
| 4 | Feature 28 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.526070 | 
| 5 | Feature 29 | add(Sunshine, add(add(Sunshine, add(Sunshine, ... | 0.526070 | 
| 6 | Feature 30 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.526070 | 
| 7 | Feature 31 | add(Sunshine, add(Sunshine, add(Sunshine, add(... | 0.523850 | 
| 8 | Feature 32 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.517636 | 
| 9 | Feature 33 | add(Sunshine, sub(Pressure3pm, sub(add(Cloud3p... | 0.527415 | 
| 10 | Feature 34 | add(Sunshine, add(Sunshine, sub(Pressure3pm, s... | 0.527257 | 
| 11 | Feature 35 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.527147 | 
| 12 | Feature 36 | add(Sunshine, add(Sunshine, sub(Pressure3pm, s... | 0.519093 | 
| 13 | Feature 37 | add(Sunshine, add(Sunshine, add(add(Sunshine, ... | 0.525060 | 
| 14 | Feature 38 | add(Sunshine, add(Sunshine, add(Sunshine, sub(... | 0.528303 | 
| 15 | Feature 39 | add(Sunshine, sub(Pressure3pm, sub(Humidity3pm... | 0.530190 | 
In [16]:
                Copied!
                
                
            # Fit the model again
atom.run("LGB_gfg", metric="auc")
# Fit the model again
atom.run("LGB_gfg", metric="auc")
    
        Training ===================================== >> Models: LGB_gfg Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9901 Test evaluation --> roc_auc: 0.8734 Time elapsed: 0.402s ------------------------------------------------- Total time: 0.402s Final results ========================= >> Duration: 0.402s ------------------------------------------ LightGBM --> roc_auc: 0.8734
Analyze results¶
In [17]:
                Copied!
                
                
            # Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
    
        In [18]:
                Copied!
                
                
            # For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
    atom.lgb.plot_feature_importance(show=10, title="LGB")
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
    atom.lgb.plot_feature_importance(show=10, title="LGB")
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")
    
        In [19]:
                Copied!
                
                
            # We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)
    
        In [20]:
                Copied!
                
                
            atom.LGB_gfg.decision_plot(index=(-20, -1), show=15)
atom.LGB_gfg.decision_plot(index=(-20, -1), show=15)