Feature engineering¶
This example shows how to use automated feature generation to improve a model's performance.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [51]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [52]:
Copied!
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[52]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Run the pipeline¶
In [53]:
Copied!
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, warnings=False, verbose=0)
atom.clean()
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, warnings=False, verbose=0)
atom.clean()
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
In [54]:
Copied!
atom.verbose = 2 # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
atom.verbose = 2 # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
Training ========================= >> Models: LGB Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9698 Test evaluation --> roc_auc: 0.8684 Time elapsed: 0.281s ------------------------------------------------- Total time: 0.281s Final results ==================== >> Duration: 0.281s ------------------------------------- LightGBM --> roc_auc: 0.8684
Deep Feature Synthesis¶
In [55]:
Copied!
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
New branch dfs successfully created.
In [56]:
Copied!
# Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
# Create 50 new features using DFS
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
Fitting FeatureGenerator... Creating new features... --> 73 new features were added to the dataset.
In [57]:
Copied!
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
Out[57]:
LOG(RainToday_other) 8837 LOG(Sunshine) 141 dtype: int64
In [58]:
Copied!
# Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
# Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
Fitting Imputer... Imputing missing values... --> Imputing 8837 missing values using the KNN imputer in feature LOG(RainToday_other). --> Imputing 141 missing values using the KNN imputer in feature LOG(Sunshine).
In [59]:
Copied!
# 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
strategy="RFECV",
solver="LGB",
n_features=30,
scoring="auc",
max_correlation=0.98,
)
# 50 new features may be to much...
# Let's check for multicollinearity and use RFECV to reduce the number
atom.feature_selection(
strategy="RFECV",
solver="LGB",
n_features=30,
scoring="auc",
max_correlation=0.98,
)
Fitting FeatureSelector... Performing feature selection ... --> Feature Location was removed due to low variance. Value 0.07228915662650602 repeated in 505378.0% of the rows. --> Feature LOG(RainToday_other) was removed due to low variance. Value 0.0 repeated in 505378.0% of the rows. --> Feature Cloud3pm + RainToday_other was removed due to collinearity with another feature. --> Feature Cloud3pm - WindDir3pm was removed due to collinearity with another feature. --> Feature Evaporation + RainToday_other was removed due to collinearity with another feature. --> Feature Humidity3pm + WindDir3pm was removed due to collinearity with another feature. --> Feature Humidity3pm - RainToday_other was removed due to collinearity with another feature. --> Feature Humidity9am + RainToday_No was removed due to collinearity with another feature. --> Feature Humidity9am + WindGustDir was removed due to collinearity with another feature. --> Feature Humidity9am - WindDir3pm was removed due to collinearity with another feature. --> Feature Humidity9am - WindGustDir was removed due to collinearity with another feature. --> Feature Location + Pressure3pm was removed due to collinearity with another feature. --> Feature Location + RainToday_Yes was removed due to collinearity with another feature. --> Feature Location + WindGustSpeed was removed due to collinearity with another feature. --> Feature MaxTemp + RainToday_No was removed due to collinearity with another feature. --> Feature MaxTemp + WindGustDir was removed due to collinearity with another feature. --> Feature Pressure3pm - WindDir9am was removed due to collinearity with another feature. --> Feature Pressure9am - WindGustDir was removed due to collinearity with another feature. --> Feature RainToday_No + RainToday_other was removed due to collinearity with another feature. --> Feature RainToday_No + Temp9am was removed due to collinearity with another feature. --> Feature RainToday_Yes + WindSpeed3pm was removed due to collinearity with another feature. --> Feature RainToday_Yes - WindGustDir was removed due to collinearity with another feature. --> Feature Temp3pm + WindDir9am was removed due to collinearity with another feature. --> Feature Temp9am - WindDir9am was removed due to collinearity with another feature. --> RFECV selected 41 features from the dataset. >>> Dropping feature WindSpeed9am (rank 5). >>> Dropping feature WindSpeed3pm (rank 4). >>> Dropping feature Cloud9am (rank 2). >>> Dropping feature RainToday_No (rank 9). >>> Dropping feature RainToday_Yes (rank 8). >>> Dropping feature RainToday_other (rank 3). >>> Dropping feature Location - WindGustDir (rank 6). >>> Dropping feature Location - WindSpeed3pm (rank 7).
In [60]:
Copied!
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
Out[60]:
drop_feature | correlated_feature | correlation_value | |
---|---|---|---|
0 | Cloud3pm + RainToday_other | Cloud3pm | 0.99948 |
1 | Cloud3pm - WindDir3pm | Cloud3pm, Cloud3pm + RainToday_other | 0.99972, 0.99918 |
2 | Evaporation + RainToday_other | Evaporation | 0.99975 |
3 | Humidity3pm + WindDir3pm | Humidity3pm | 1.0 |
4 | Humidity3pm - RainToday_other | Humidity3pm, Humidity3pm + WindDir3pm | 0.99999, 0.99999 |
5 | Humidity9am + RainToday_No | Humidity9am | 0.99978 |
6 | Humidity9am + WindGustDir | Humidity9am, Humidity9am + RainToday_No | 1.0, 0.99977 |
7 | Humidity9am - WindDir3pm | Humidity9am, Humidity9am + RainToday_No, Humid... | 1.0, 0.99978, 0.99999 |
8 | Humidity9am - WindGustDir | Humidity9am, Humidity9am + RainToday_No, Humid... | 1.0, 0.99978, 0.99999, 1.0 |
9 | Location + Pressure3pm | Pressure3pm | 1.0 |
10 | Location + RainToday_Yes | RainToday_No, RainToday_Yes | -0.98582, 1.0 |
11 | Location + WindGustSpeed | WindGustSpeed | 1.0 |
12 | MaxTemp + RainToday_No | MaxTemp | 0.99833 |
13 | MaxTemp + WindGustDir | MaxTemp, MaxTemp + RainToday_No | 0.99998, 0.99829 |
14 | Pressure3pm - WindDir9am | Pressure3pm, Location + Pressure3pm | 0.99997, 0.99997 |
15 | Pressure9am - WindGustDir | Pressure9am | 0.99998 |
16 | RainToday_No + RainToday_other | RainToday_No, RainToday_Yes, Location + RainTo... | 0.98582, -1.0, -1.0 |
17 | RainToday_No + Temp9am | Temp9am | 0.99788 |
18 | RainToday_Yes + WindSpeed3pm | WindSpeed3pm, Location - WindSpeed3pm | 0.99887, -0.99887 |
19 | RainToday_Yes - WindGustDir | RainToday_Yes, Location + RainToday_Yes, RainT... | 0.9937, 0.9937, -0.9937 |
20 | Temp3pm + WindDir9am | Temp3pm, RainToday_other - Temp3pm | 0.99996, -0.99991 |
21 | Temp9am - WindDir9am | Temp9am, RainToday_No + Temp9am | 0.99996, 0.99789 |
In [61]:
Copied!
# After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()
# After applying RFECV, we can plot the score per number of features
atom.plot_rfecv()
In [62]:
Copied!
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
Training ========================= >> Models: LGB_dfs Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9932 Test evaluation --> roc_auc: 0.867 Time elapsed: 0.547s ------------------------------------------------- Total time: 0.547s Final results ==================== >> Duration: 0.547s ------------------------------------- LightGBM --> roc_auc: 0.867
Genetic Feature Generation¶
In [63]:
Copied!
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
New branch gfg successfully created.
In [64]:
Copied!
# Create new features using Genetic Programming
atom.feature_generation(
strategy='GFG',
n_features=20,
generations=10,
population=2000,
)
# Create new features using Genetic Programming
atom.feature_generation(
strategy='GFG',
n_features=20,
generations=10,
population=2000,
)
Fitting FeatureGenerator... | Population Average | Best Individual | ---- ------------------------- ------------------------------------------ ---------- Gen Length Fitness Length Fitness OOB Fitness Time Left 0 3.13 0.130778 3 0.49782 N/A 9.57s 1 3.31 0.324661 6 0.515497 N/A 10.25s 2 3.79 0.412231 10 0.526287 N/A 8.13s 3 4.53 0.462306 13 0.530089 N/A 6.70s 4 5.96 0.486278 15 0.530094 N/A 6.25s 5 5.36 0.489311 11 0.531058 N/A 4.79s 6 6.63 0.488193 15 0.534876 N/A 3.52s 7 8.58 0.495064 9 0.537901 N/A 2.28s 8 7.74 0.502927 9 0.537901 N/A 1.20s 9 7.03 0.500436 9 0.537901 N/A 0.00s Creating new features... --> 3 new features were added to the dataset.
In [65]:
Copied!
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
Out[65]:
name | description | fitness | |
---|---|---|---|
0 | feature 24 | mul(mul(add(RainToday_Yes, Cloud3pm), mul(Wind... | 0.528901 |
1 | feature 25 | mul(Humidity3pm, mul(add(RainToday_Yes, Cloud3... | 0.528901 |
2 | feature 26 | mul(mul(Cloud3pm, mul(WindGustSpeed, Humidity3... | 0.524057 |
In [66]:
Copied!
# Fit the model again
atom.run("LGB_gfg", metric="auc")
# Fit the model again
atom.run("LGB_gfg", metric="auc")
Training ========================= >> Models: LGB_gfg Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9683 Test evaluation --> roc_auc: 0.8676 Time elapsed: 0.290s ------------------------------------------------- Total time: 0.290s Final results ==================== >> Duration: 0.290s ------------------------------------- LightGBM --> roc_auc: 0.8676
Analyze results¶
In [67]:
Copied!
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
In [68]:
Copied!
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
atom.lgb.plot_feature_importance(show=10, title="LGB")
atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
atom.lgb.plot_feature_importance(show=10, title="LGB")
atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")
In [69]:
Copied!
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_DFS", "LGB_GFG"], show=10)
In [70]:
Copied!
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)