Example: Feature engineering¶

This example shows how to use automated feature generation to improve a model's performance.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

            
                Copied!
                
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, rare_to_value=0.04)
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, rare_to_value=0.04)

In [4]:

            
                Copied!
                
atom.verbose = 2  # Increase verbosity to see the output

# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
atom.verbose = 2  # Increase verbosity to see the output

# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')

Training ========================= >>
Models: LGB
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9818
Test evaluation --> roc_auc: 0.8625
Time elapsed: 0.696s
-------------------------------------------------
Total time: 0.696s


Final results ==================== >>
Total time: 0.697s
-------------------------------------
LightGBM --> roc_auc: 0.8625

Deep Feature Synthesis¶

In [5]:

            
                Copied!
                
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"

New branch dfs successfully created.

In [6]:

            
                Copied!
                
# Create 50 new features using dfs
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
# Create 50 new features using dfs
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])

Fitting FeatureGenerator...
Generating new features...
 --> 50 new features were added.

In [7]:

            
                Copied!
                
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans

Out[7]:

NATURAL_LOGARITHM(Cloud3pm)        332
NATURAL_LOGARITHM(Evaporation)      19
NATURAL_LOGARITHM(Rainfall)       6351
dtype: int64

In [8]:

            
                Copied!
                
# Turn off warnings in the future
atom.warnings = False

# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
# Turn off warnings in the future
atom.warnings = False

# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 332 missing values using the KNN imputer in feature NATURAL_LOGARITHM(Cloud3pm).
 --> Imputing 19 missing values using the KNN imputer in feature NATURAL_LOGARITHM(Evaporation).
 --> Imputing 6351 missing values using the KNN imputer in feature NATURAL_LOGARITHM(Rainfall).

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# 50 new features may be to much...
# Let's check for multicollinearity and use rfecv to reduce the number
atom.feature_selection(
    strategy="rfecv",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)
# 50 new features may be to much...
# Let's check for multicollinearity and use rfecv to reduce the number
atom.feature_selection(
    strategy="rfecv",
    solver="LGB",
    n_features=30,
    scoring="auc",
    max_correlation=0.98,
)

Fitting FeatureSelector...
Performing feature selection ...
 --> Feature MinTemp was removed due to collinearity with another feature.
 --> Feature Location + MaxTemp was removed due to collinearity with another feature.
 --> Feature Location + Rainfall was removed due to collinearity with another feature.
 --> Feature Sunshine was removed due to collinearity with another feature.
 --> Feature Sunshine + WindGustDir was removed due to collinearity with another feature.
 --> Feature RainToday_No + WindGustSpeed was removed due to collinearity with another feature.
 --> Feature WindSpeed9am was removed due to collinearity with another feature.
 --> Feature WindSpeed3pm was removed due to collinearity with another feature.
 --> Feature Humidity3pm was removed due to collinearity with another feature.
 --> Feature Humidity3pm + RainToday_rare was removed due to collinearity with another feature.
 --> Feature Humidity3pm - WindDir9am was removed due to collinearity with another feature.
 --> Feature NATURAL_LOGARITHM(Pressure9am) was removed due to collinearity with another feature.
 --> Feature Pressure9am - WindGustDir was removed due to collinearity with another feature.
 --> Feature Pressure3pm - WindDir3pm was removed due to collinearity with another feature.
 --> Feature Cloud9am - WindDir3pm was removed due to collinearity with another feature.
 --> Feature RainToday_No - WindDir9am was removed due to collinearity with another feature.
 --> Feature RainToday_Yes was removed due to collinearity with another feature.
 --> Feature Location + RainToday_Yes was removed due to collinearity with another feature.
 --> Feature RainToday_Yes - WindDir3pm was removed due to collinearity with another feature.
 --> Feature Humidity3pm - MinTemp was removed due to collinearity with another feature.
 --> Feature RainToday_rare - WindSpeed3pm was removed due to collinearity with another feature.
 --> rfecv selected 44 features from the dataset.
   --> Dropping feature Location (rank 9).
   --> Dropping feature RainToday_No (rank 8).
   --> Dropping feature RainToday_rare (rank 6).
   --> Dropping feature Location - RainToday_rare (rank 3).
   --> Dropping feature Location - WindGustDir (rank 4).
   --> Dropping feature NATURAL_LOGARITHM(Cloud3pm) (rank 7).
   --> Dropping feature NATURAL_LOGARITHM(Evaporation) (rank 5).
   --> Dropping feature NATURAL_LOGARITHM(Rainfall) (rank 2).

In [10]:

            
                Copied!
                
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear

Out[10]:

	drop	corr_feature	corr_value
0	MinTemp	MinTemp + RainToday_rare	0.9999
1	Location + MaxTemp	MaxTemp	1.0
2	Location + Rainfall	Rainfall	1.0
3	Sunshine	RainToday_rare + Sunshine, Sunshine + WindGustDir	0.9995, 0.9999
4	Sunshine + WindGustDir	Sunshine, RainToday_rare + Sunshine	0.9999, 0.9993
5	RainToday_No + WindGustSpeed	WindGustSpeed	0.9995
6	WindSpeed9am	RainToday_rare + WindSpeed9am	1.0
7	WindSpeed3pm	WindDir9am + WindSpeed3pm	1.0
8	Humidity3pm	Cloud3pm + Humidity3pm, Humidity3pm + RainToda...	0.996, 1.0, 1.0
9	Humidity3pm + RainToday_rare	Humidity3pm, Cloud3pm + Humidity3pm, Humidity3...	1.0, 0.996, 1.0
10	Humidity3pm - WindDir9am	Humidity3pm, Cloud3pm + Humidity3pm, Humidity3...	1.0, 0.996, 1.0
11	NATURAL_LOGARITHM(Pressure9am)	Pressure9am, Pressure9am - WindGustDir	1.0, 1.0
12	Pressure9am - WindGustDir	Pressure9am, NATURAL_LOGARITHM(Pressure9am)	1.0, 1.0
13	Pressure3pm - WindDir3pm	Pressure3pm	1.0
14	Cloud9am - WindDir3pm	Cloud9am	0.9998
15	RainToday_No - WindDir9am	RainToday_No	0.9906
16	RainToday_Yes	Location + RainToday_Yes, RainToday_Yes + Wind...	1.0, 0.9946, 0.9941
17	Location + RainToday_Yes	RainToday_Yes, RainToday_Yes + WindGustDir, Ra...	1.0, 0.9946, 0.9941
18	RainToday_Yes - WindDir3pm	RainToday_Yes, Location + RainToday_Yes, RainT...	0.9941, 0.9941, 0.9824
19	Humidity3pm - MinTemp	Humidity3pm - MaxTemp	0.9877
20	RainToday_rare - WindSpeed3pm	RainToday_Yes - WindSpeed3pm	0.9988

In [11]:

            
                Copied!
                
# After applying rfecv, we can plot the score per number of features
atom.plot_rfecv()
# After applying rfecv, we can plot the score per number of features
atom.plot_rfecv()

In [12]:

            
                Copied!
                
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")

Training ========================= >>
Models: LGB_dfs
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9906
Test evaluation --> roc_auc: 0.8616
Time elapsed: 1.485s
-------------------------------------------------
Total time: 1.485s


Final results ==================== >>
Total time: 1.485s
-------------------------------------
LightGBM --> roc_auc: 0.8616

Genetic Feature Generation¶

In [13]:

            
                Copied!
                
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"

New branch gfg successfully created.

In [14]:

            
                Copied!
                
# Create new features using Genetic Programming
atom.feature_generation(strategy='gfg', n_features=20)
# Create new features using Genetic Programming
atom.feature_generation(strategy='gfg', n_features=20)

Fitting FeatureGenerator...
    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0     3.08         0.129852        3         0.485297              N/A     21.97s
   1     3.06         0.329673        3         0.492484              N/A     26.06s
   2     3.27         0.419512        5         0.510269              N/A     22.29s
   3     3.90          0.44394        6         0.512441              N/A     22.04s
   4     5.31         0.471916        9         0.516333              N/A     20.13s
   5     5.36         0.457234       10          0.51887              N/A     22.43s
   6     6.05         0.454503       16          0.51986              N/A     18.27s
   7     8.56         0.480404       16          0.51986              N/A     13.40s
   8     9.73         0.482795       16          0.51986              N/A     12.82s
   9     9.79         0.483111       16          0.51986              N/A     11.88s
  10     9.95         0.482671       16          0.51986              N/A     11.42s
  11     9.96         0.477608       16          0.51986              N/A     12.71s
  12    10.10         0.480786       16          0.51986              N/A     10.02s
  13    10.03         0.480488       16          0.51986              N/A      7.30s
  14    10.03         0.484053       16          0.51986              N/A      6.06s
  15     9.95         0.478832       10         0.520691              N/A      5.55s
  16    10.10         0.482892       16         0.520868              N/A      3.67s
  17     9.99         0.482447       12         0.521197              N/A      2.46s
  18    10.02          0.47667       17         0.521669              N/A      1.23s
  19    10.05         0.482259       16         0.521743              N/A      0.00s
Generating new features...
 --> 20 new features were added.

In [15]:

            
                Copied!
                
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features

Out[15]:

	name	description	fitness
0	x23	log(mul(add(add(WindGustSpeed, sub(Humidity3pm...	0.510691
1	x24	log(mul(add(sub(add(WindGustSpeed, Humidity3pm...	0.510691
2	x25	log(log(mul(add(sub(add(WindGustSpeed, Humidit...	0.510057
3	x26	log(log(mul(add(add(WindGustSpeed, sub(Humidit...	0.510057
4	x27	log(mul(mul(add(add(WindGustSpeed, sub(Humidit...	0.509417
5	x28	log(mul(mul(add(sub(add(WindGustSpeed, Humidit...	0.509417
6	x29	log(log(log(mul(add(add(WindGustSpeed, sub(Hum...	0.509197
7	x30	log(log(mul(mul(add(sub(add(WindGustSpeed, Hum...	0.508659
8	x31	log(log(log(log(mul(add(add(WindGustSpeed, sub...	0.508341
9	x32	log(log(log(mul(mul(add(add(WindGustSpeed, sub...	0.507739
10	x33	log(mul(add(mul(add(add(WindGustSpeed, sub(Hum...	0.507421
11	x34	log(mul(add(mul(add(sub(add(WindGustSpeed, Hum...	0.507421
12	x35	log(log(log(mul(add(mul(add(add(WindGustSpeed,...	0.505743
13	x36	log(log(mul(add(add(WindGustSpeed, add(add(Win...	0.505716
14	x37	log(mul(add(add(WindGustSpeed, mul(add(add(Win...	0.505426
15	x38	log(mul(add(add(WindGustSpeed, mul(add(sub(add...	0.505426
16	x39	log(mul(add(sub(mul(add(add(WindGustSpeed, sub...	0.505410
17	x40	log(mul(add(add(WindGustSpeed, sub(add(add(Win...	0.504868
18	x41	log(mul(add(add(WindGustSpeed, sub(Humidity3pm...	0.504868
19	x42	log(mul(add(add(WindGustSpeed, sub(add(sub(add...	0.504868

In [16]:

            
                Copied!
                
# Fit the model again
atom.run("LGB_gfg", metric="auc")
# Fit the model again
atom.run("LGB_gfg", metric="auc")

Training ========================= >>
Models: LGB_gfg
Metric: roc_auc


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.9832
Test evaluation --> roc_auc: 0.8571
Time elapsed: 0.942s
-------------------------------------------------
Total time: 0.942s


Final results ==================== >>
Total time: 0.943s
-------------------------------------
LightGBM --> roc_auc: 0.8571

In [17]:

            
                Copied!
                
# Visualize the whole pipeline
atom.plot_pipeline()
# Visualize the whole pipeline
atom.plot_pipeline()

Analyze the results¶

In [18]:

            
                Copied!
                
# Use atom's plots to compare the three models
atom.plot_roc(dataset="test+train")
# Use atom's plots to compare the three models
atom.plot_roc(dataset="test+train")

In [19]:

            
                Copied!
                
# To compare other plots it might be useful to use a canvas
with atom.canvas(1, 3, horizontal_spacing=0.08, figsize=(1800, 800)):
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + dfs")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + gfg")
# To compare other plots it might be useful to use a canvas
with atom.canvas(1, 3, horizontal_spacing=0.08, figsize=(1800, 800)):
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + dfs")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + gfg")

In [20]:

            
                Copied!
                
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_dfs", "LGB_gfg"], show=12)
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_dfs", "LGB_gfg"], show=12)

In [21]:

            
                Copied!
                
atom.LGB_gfg.plot_shap_decision(index=(0, 10), show=15)
atom.LGB_gfg.plot_shap_decision(index=(0, 10), show=15)