Feature engineering¶

This example shows how to use automated feature generation to improve a model's performance.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

            
                Copied!
                
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, warnings=False, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, warnings=False, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)

In [4]:

            
                Copied!
                
atom.verbose = 2  # Increase verbosity to see the output

# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
atom.verbose = 2  # Increase verbosity to see the output

# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')

Training ========================= >>
Models: LGB
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9829
Test evaluation --> roc_auc: 0.8684
Time elapsed: 0.285s
-------------------------------------------------
Total time: 0.285s


Final results ==================== >>
Duration: 0.285s
-------------------------------------
LightGBM --> roc_auc: 0.8684

Deep Feature Synthesis¶

In [5]:

            
                Copied!
                
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"

New branch dfs successfully created.

In [6]:

            
                Copied!
                
# Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
# Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])

Fitting FeatureGenerator...
Creating new features...
 --> 50 new features were added.

In [7]:

            
                Copied!
                
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans

Out[7]:

LOG(Sunshine)    148
dtype: int64

In [8]:

            
                Copied!
                
# Turn off warnings in the future
atom.warnings = False

# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
# Turn off warnings in the future
atom.warnings = False

# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 148 missing values using the KNN imputer in feature LOG(Sunshine).

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
    strategy="RFECV",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)
# 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
    strategy="RFECV",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)

Fitting FeatureSelector...
Performing feature selection ...
 --> Feature Location was removed due to low variance. Value 0.12987012987012986 repeated in 506798.0% of the rows.
 --> Feature Cloud3pm + RainToday_Yes was removed due to collinearity with another feature.
 --> Feature Cloud3pm - RainToday_No was removed due to collinearity with another feature.
 --> Feature Cloud3pm - RainToday_other was removed due to collinearity with another feature.
 --> Feature Cloud9am + Humidity3pm was removed due to collinearity with another feature.
 --> Feature Cloud9am + WindGustDir was removed due to collinearity with another feature.
 --> Feature Cloud9am - RainToday_other was removed due to collinearity with another feature.
 --> Feature Cloud9am - WindDir3pm was removed due to collinearity with another feature.
 --> Feature Evaporation + WindDir9am was removed due to collinearity with another feature.
 --> Feature Humidity3pm + RainToday_No was removed due to collinearity with another feature.
 --> Feature Humidity9am + RainToday_No was removed due to collinearity with another feature.
 --> Feature LOG(Pressure9am) was removed due to collinearity with another feature.
 --> Feature Location - RainToday_Yes was removed due to collinearity with another feature.
 --> Feature MaxTemp + Temp3pm was removed due to collinearity with another feature.
 --> Feature MaxTemp + WindDir3pm was removed due to collinearity with another feature.
 --> Feature MaxTemp + WindDir9am was removed due to collinearity with another feature.
 --> Feature MinTemp + WindDir9am was removed due to collinearity with another feature.
 --> Feature Pressure3pm - WindDir9am was removed due to collinearity with another feature.
 --> Feature Pressure3pm - WindGustDir was removed due to collinearity with another feature.
 --> Feature Pressure9am - RainToday_No was removed due to collinearity with another feature.
 --> Feature Pressure9am - Temp3pm was removed due to collinearity with another feature.
 --> Feature RainToday_No + Temp3pm was removed due to collinearity with another feature.
 --> Feature RainToday_No - WindDir3pm was removed due to collinearity with another feature.
 --> Feature RainToday_Yes + WindGustSpeed was removed due to collinearity with another feature.
 --> Feature RainToday_Yes - WindDir9am was removed due to collinearity with another feature.
 --> Feature Sunshine + WindDir3pm was removed due to collinearity with another feature.
 --> RFECV selected 40 features from the dataset.
   >>> Dropping feature Cloud9am (rank 3).
   >>> Dropping feature RainToday_No (rank 7).
   >>> Dropping feature RainToday_Yes (rank 6).
   >>> Dropping feature RainToday_other (rank 2).
   >>> Dropping feature LOG(Humidity3pm) (rank 8).
   >>> Dropping feature LOG(Sunshine) (rank 4).
   >>> Dropping feature Location - WindDir9am (rank 5).

In [10]:

            
                Copied!
                
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear

Out[10]:

	drop_feature	correlated_feature	correlation_value
0	Cloud3pm + RainToday_Yes	Cloud3pm	0.98395
1	Cloud3pm - RainToday_No	Cloud3pm, Cloud3pm + RainToday_Yes	0.9837, 0.99946
2	Cloud3pm - RainToday_other	Cloud3pm, Cloud3pm + RainToday_Yes, Cloud3pm -...	0.99939, 0.98364, 0.98222
3	Cloud9am + Humidity3pm	Humidity3pm	0.99517
4	Cloud9am + WindGustDir	Cloud9am	0.99978
5	Cloud9am - RainToday_other	Cloud9am, Cloud9am + WindGustDir	0.99948, 0.99926
6	Cloud9am - WindDir3pm	Cloud9am, Cloud9am + WindGustDir, Cloud9am - R...	0.99979, 0.99938, 0.99927
7	Evaporation + WindDir9am	Evaporation	0.99988
8	Humidity3pm + RainToday_No	Humidity3pm, Cloud9am + Humidity3pm	0.99982, 0.99477
9	Humidity9am + RainToday_No	Humidity9am	0.99979
10	LOG(Pressure9am)	Pressure9am	0.99999
11	Location - RainToday_Yes	RainToday_No, RainToday_Yes	0.98392, -1.0
12	MaxTemp + Temp3pm	MaxTemp, Temp3pm	0.99441, 0.99407
13	MaxTemp + WindDir3pm	MaxTemp, MaxTemp + Temp3pm	0.99998, 0.9944
14	MaxTemp + WindDir9am	MaxTemp, MaxTemp + Temp3pm, MaxTemp + WindDir3pm	0.99997, 0.99431, 0.99997
15	MinTemp + WindDir9am	MinTemp	0.99997
16	Pressure3pm - WindDir9am	Pressure3pm	0.99997
17	Pressure3pm - WindGustDir	Pressure3pm, Pressure3pm - WindDir9am	0.99998, 0.99997
18	Pressure9am - RainToday_No	Pressure9am, LOG(Pressure9am)	0.99825, 0.99821
19	Pressure9am - Temp3pm	Pressure3pm - Temp3pm	0.9858
20	RainToday_No + Temp3pm	Temp3pm, MaxTemp + Temp3pm	0.99826, 0.99222
21	RainToday_No - WindDir3pm	RainToday_No	0.99362
22	RainToday_Yes + WindGustSpeed	WindGustSpeed	0.99952
23	RainToday_Yes - WindDir9am	RainToday_Yes, Location - RainToday_Yes	0.99205, -0.99205
24	Sunshine + WindDir3pm	Sunshine	0.99986

In [11]:

            
                Copied!
                
# After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()
# After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()

In [12]:

            
                Copied!
                
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")

Training ========================= >>
Models: LGB_dfs
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9959
Test evaluation --> roc_auc: 0.8761
Time elapsed: 0.503s
-------------------------------------------------
Total time: 0.503s


Final results ==================== >>
Duration: 0.503s
-------------------------------------
LightGBM --> roc_auc: 0.8761

Genetic Feature Generation¶

In [13]:

            
                Copied!
                
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"

New branch gfg successfully created.

In [14]:

            
                Copied!
                
# Create new features using Genetic Programming
atom.feature_generation(strategy='GFG', n_features=20)
# Create new features using Genetic Programming
atom.feature_generation(strategy='GFG', n_features=20)

Fitting FeatureGenerator...
    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0     3.10         0.131523        4         0.508444              N/A     18.68s
   1     3.01         0.343716        5         0.517489              N/A     19.12s
   2     3.55         0.425475        8         0.528016              N/A     17.39s
   3     4.31         0.448246        8         0.535803              N/A     17.79s
   4     6.32         0.426331       20         0.557325              N/A     15.30s
   5     8.77         0.442598       14         0.561713              N/A     14.63s
   6    10.73         0.455358       21         0.565327              N/A     13.84s
   7    13.14         0.469143       25         0.567983              N/A     13.02s
   8    15.02          0.50451       25         0.567983              N/A     11.98s
   9    13.37         0.507931       20         0.567567              N/A     10.75s
  10    12.41         0.508695       19         0.569125              N/A      9.69s
  11    13.05         0.514911       19         0.569125              N/A      8.56s
  12    14.21         0.526793       19         0.569125              N/A      7.50s
  13    14.85         0.529131       19         0.569125              N/A      7.31s
  14    15.00         0.529909       19         0.569125              N/A      5.36s
  15    14.70         0.529649       19         0.569125              N/A      4.29s
  16    14.89         0.528901       19         0.569125              N/A      3.24s
  17    15.08         0.531022       19         0.569125              N/A      2.15s
  18    15.08         0.528407       19         0.569125              N/A      1.09s
  19    14.69         0.529187       19         0.568676              N/A      0.00s
Creating new features...
 --> Dropping 14 features due to repetition.
 --> 6 new features were added.

In [15]:

            
                Copied!
                
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features

Out[15]:

	name	description	fitness
0	feature 24	mul(mul(add(WindGustSpeed, Humidity3pm), sub(C...	0.549676
1	feature 25	mul(mul(add(WindGustSpeed, sub(Cloud3pm, sub(S...	0.549463
2	feature 26	mul(mul(sub(Sunshine, add(WindGustSpeed, Humid...	0.550944
3	feature 27	mul(mul(add(WindGustSpeed, Humidity3pm), sub(C...	0.551330
4	feature 28	mul(add(WindGustSpeed, Humidity3pm), mul(sub(C...	0.551330
5	feature 29	mul(mul(sub(Sunshine, add(WindGustSpeed, Humid...	0.551330

In [16]:

            
                Copied!
                
# Fit the model again
atom.run("LGB_gfg", metric="auc")
# Fit the model again
atom.run("LGB_gfg", metric="auc")

Training ========================= >>
Models: LGB_gfg
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9849
Test evaluation --> roc_auc: 0.8694
Time elapsed: 0.336s
-------------------------------------------------
Total time: 0.336s


Final results ==================== >>
Duration: 0.338s
-------------------------------------
LightGBM --> roc_auc: 0.8694

Analyze results¶

In [17]:

            
                Copied!
                
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()

In [18]:

            
                Copied!
                
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
    atom.lgb.plot_feature_importance(show=10, title="LGB")
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
    atom.lgb.plot_feature_importance(show=10, title="LGB")
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")

In [19]:

            
                Copied!
                
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)

In [20]:

            
                Copied!
                
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)