Feature engineering¶
This example shows how to use automated feature generation to improve a model's performance.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [21]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [22]:
Copied!
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[22]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Run the pipeline¶
In [23]:
Copied!
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
# Initialize atom and apply data cleaning
atom = ATOMClassifier(X, n_rows=1e4, test_size=0.2, verbose=0)
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
atom.encode(max_onehot=10, frac_to_other=0.04)
In [24]:
Copied!
atom.verbose = 2 # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
atom.verbose = 2 # Increase verbosity to see the output
# Let's see how a LightGBM model performs
atom.run('LGB', metric='auc')
sklearn.metrics.SCORERS is deprecated and will be removed in v1.3. Please use sklearn.metrics.get_scorer_names to get a list of available scorers and sklearn.metrics.get_metric to get scorer.
Training ========================= >> Models: LGB Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.98 Test evaluation --> roc_auc: 0.879 Time elapsed: 0.250s ------------------------------------------------- Total time: 0.250s Final results ==================== >> Duration: 0.250s ------------------------------------- LightGBM --> roc_auc: 0.879
Deep Feature Synthesis¶
In [25]:
Copied!
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
# Since we are going to compare different datasets,
# we need to create separate branches
atom.branch = "dfs"
New branch dfs successfully created.
In [26]:
Copied!
# Create 50 new features using dfs
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
# Create 50 new features using dfs
atom.feature_generation("dfs", n_features=50, operators=["add", "sub", "log"])
Fitting FeatureGenerator... Creating new features... --> 50 new features were added.
In [27]:
Copied!
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
# The warnings warn us that some operators created missing values!
# We can see the columns with missing values using the nans attribute
atom.nans
Out[27]:
Series([], dtype: int64)
In [28]:
Copied!
# Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
# Turn off warnings in the future
atom.warnings = False
# Impute the data again to get rid of the missing values
atom.impute(strat_num="knn", strat_cat="remove", max_nan_rows=0.8)
Fitting Imputer... Imputing missing values...
In [29]:
Copied!
# 50 new features may be to much...
# Let's check for multicollinearity and use rfecv to reduce the number
atom.feature_selection(
strategy="rfecv",
solver="LGB",
n_features=30,
scoring="auc",
max_correlation=0.98,
)
# 50 new features may be to much...
# Let's check for multicollinearity and use rfecv to reduce the number
atom.feature_selection(
strategy="rfecv",
solver="LGB",
n_features=30,
scoring="auc",
max_correlation=0.98,
)
Fitting FeatureSelector... Performing feature selection ... --> Feature Location was removed due to low variance. Value 0.09090909090909091 repeated in 100.0% of the rows. --> Feature MinTemp + RainToday_No was removed due to collinearity with another feature. --> Feature Location + MaxTemp was removed due to collinearity with another feature. --> Feature Rainfall - WindDir3pm was removed due to collinearity with another feature. --> Feature RainToday_No + WindGustSpeed was removed due to collinearity with another feature. --> Feature WindSpeed3pm was removed due to collinearity with another feature. --> Feature Humidity9am was removed due to collinearity with another feature. --> Feature Humidity9am - RainToday_other was removed due to collinearity with another feature. --> Feature Humidity3pm was removed due to collinearity with another feature. --> Feature Humidity3pm + RainToday_Yes was removed due to collinearity with another feature. --> Feature Humidity3pm + Sunshine was removed due to collinearity with another feature. --> Feature Humidity3pm - RainToday_Yes was removed due to collinearity with another feature. --> Feature Pressure9am was removed due to collinearity with another feature. --> Feature Pressure9am + RainToday_No was removed due to collinearity with another feature. --> Feature Pressure3pm was removed due to collinearity with another feature. --> Feature Cloud3pm was removed due to collinearity with another feature. --> Feature Cloud3pm + RainToday_No was removed due to collinearity with another feature. --> Feature Temp9am was removed due to collinearity with another feature. --> Feature Temp3pm - WindDir9am was removed due to collinearity with another feature. --> Feature Location + RainToday_other was removed due to collinearity with another feature. --> Feature Evaporation + Temp3pm was removed due to collinearity with another feature. --> Feature RainToday_No - WindGustSpeed was removed due to collinearity with another feature. --> RFECV selected 45 features from the dataset. >>> Dropping feature WindSpeed9am (rank 5). >>> Dropping feature RainToday_No (rank 7). >>> Dropping feature RainToday_Yes (rank 6). >>> Dropping feature RainToday_other (rank 3). >>> Dropping feature Location - WindGustSpeed (rank 4). >>> Dropping feature RainToday_other + WindSpeed3pm (rank 2).
In [30]:
Copied!
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
# The collinear attribute shows what features were removed due to multicollinearity
atom.collinear
Out[30]:
drop | corr_feature | corr_value | |
---|---|---|---|
0 | MinTemp + RainToday_No | MinTemp | 0.9978 |
1 | Location + MaxTemp | MaxTemp | 1.0 |
2 | Rainfall - WindDir3pm | Rainfall | 1.0 |
3 | RainToday_No + WindGustSpeed | WindGustSpeed | 0.9995 |
4 | WindSpeed3pm | RainToday_other + WindSpeed3pm | 1.0 |
5 | Humidity9am | Humidity9am + RainToday_other, Humidity9am - R... | 1.0, 1.0 |
6 | Humidity9am - RainToday_other | Humidity9am, Humidity9am + RainToday_other | 1.0, 1.0 |
7 | Humidity3pm | Humidity3pm + RainToday_Yes, Humidity3pm + Sun... | 0.9998, 0.9916, 0.9998, 0.9998 |
8 | Humidity3pm + RainToday_Yes | Humidity3pm, Humidity3pm + Sunshine, Humidity3... | 0.9998, 0.9912, 1.0, 0.9993 |
9 | Humidity3pm + Sunshine | Humidity3pm, Humidity3pm + RainToday_Yes, Humi... | 0.9916, 0.9912, 0.9912, 0.9917 |
10 | Humidity3pm - RainToday_Yes | Humidity3pm, Humidity3pm + RainToday_Yes, Humi... | 0.9998, 0.9993, 0.9917, 0.9993 |
11 | Pressure9am | Pressure9am + RainToday_No, Pressure9am + Wind... | 0.9983, 1.0 |
12 | Pressure9am + RainToday_No | Pressure9am, Pressure9am + WindGustDir | 0.9983, 0.9983 |
13 | Pressure3pm | Pressure3pm + RainToday_Yes | 0.9983 |
14 | Cloud3pm | Cloud3pm + RainToday_No, Cloud3pm + RainToday_... | 0.9813, 0.9995 |
15 | Cloud3pm + RainToday_No | Cloud3pm | 0.9813 |
16 | Temp9am | Temp9am + WindGustDir | 1.0 |
17 | Temp3pm - WindDir9am | Temp3pm | 1.0 |
18 | Location + RainToday_other | RainToday_other | 1.0 |
19 | Evaporation + Temp3pm | Evaporation + MaxTemp | 0.9859 |
20 | RainToday_No - WindGustSpeed | Location - WindGustSpeed | 0.9995 |
In [31]:
Copied!
# After applying rfecv, we can plot the score per number of features
atom.plot_rfecv()
# After applying rfecv, we can plot the score per number of features
atom.plot_rfecv()
In [32]:
Copied!
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
# Let's see how the model performs now
# Add a tag to the model's acronym to not overwrite previous LGB
atom.run("LGB_dfs")
Training ========================= >> Models: LGB_dfs Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9949 Test evaluation --> roc_auc: 0.8884 Time elapsed: 0.516s ------------------------------------------------- Total time: 0.516s Final results ==================== >> Duration: 0.516s ------------------------------------- LightGBM --> roc_auc: 0.8884
Genetic Feature Generation¶
In [33]:
Copied!
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
# Create another branch for the genetic features
# Split form master to avoid the dfs features
atom.branch = "gfg_from_master"
New branch gfg successfully created.
In [34]:
Copied!
# Create new features using Genetic Programming
atom.feature_generation(strategy='gfg', n_features=20)
# Create new features using Genetic Programming
atom.feature_generation(strategy='gfg', n_features=20)
Fitting FeatureGenerator... | Population Average | Best Individual | ---- ------------------------- ------------------------------------------ ---------- Gen Length Fitness Length Fitness OOB Fitness Time Left 0 3.04 0.136973 4 0.513834 N/A 18.71s 1 3.05 0.336392 5 0.520524 N/A 21.77s 2 3.54 0.433237 5 0.530545 N/A 19.56s 3 4.09 0.476537 9 0.543896 N/A 21.25s 4 5.88 0.499316 17 0.559524 N/A 15.48s 5 8.91 0.52323 19 0.561831 N/A 14.64s 6 9.54 0.530888 17 0.561145 N/A 17.16s 7 10.33 0.524547 17 0.563917 N/A 13.02s 8 12.12 0.508475 17 0.563917 N/A 11.93s 9 13.56 0.483745 19 0.565534 N/A 10.84s 10 14.40 0.484833 19 0.566354 N/A 9.49s 11 15.06 0.476303 19 0.568294 N/A 8.77s 12 14.96 0.466753 19 0.568294 N/A 7.58s 13 15.22 0.465683 19 0.568294 N/A 6.33s 14 15.90 0.46962 21 0.569199 N/A 5.43s 15 16.76 0.481579 21 0.568591 N/A 4.35s 16 16.17 0.47784 19 0.568294 N/A 3.34s 17 16.44 0.484682 19 0.568294 N/A 2.58s 18 16.54 0.483549 19 0.568294 N/A 1.10s 19 16.97 0.488088 19 0.568294 N/A 0.00s Creating new features... --> 20 new features were added.
In [35]:
Copied!
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
# We can see the feature's fitness and description through the genetic_features attribute
atom.genetic_features
Out[35]:
name | description | fitness | |
---|---|---|---|
0 | x23 | mul(mul(sub(add(Sunshine, add(WindSpeed3pm, Su... | 0.549822 |
1 | x24 | mul(mul(sub(add(WindSpeed3pm, add(Sunshine, Su... | 0.549822 |
2 | x25 | mul(mul(sub(add(add(Sunshine, WindSpeed3pm), S... | 0.549822 |
3 | x26 | mul(mul(sub(add(add(WindSpeed3pm, Sunshine), S... | 0.549822 |
4 | x27 | mul(mul(add(WindGustSpeed, Humidity3pm), add(H... | 0.549822 |
5 | x28 | mul(mul(sub(add(add(Sunshine, Sunshine), WindS... | 0.549822 |
6 | x29 | mul(mul(add(WindGustSpeed, Humidity3pm), add(H... | 0.549822 |
7 | x30 | mul(mul(sub(add(WindSpeed3pm, add(Sunshine, Su... | 0.549659 |
8 | x31 | mul(mul(sub(add(Sunshine, add(WindSpeed3pm, Su... | 0.549659 |
9 | x32 | mul(sub(add(Sunshine, add(WindSpeed3pm, Sunshi... | 0.549659 |
10 | x33 | mul(sub(add(add(WindSpeed3pm, Sunshine), Sunsh... | 0.549659 |
11 | x34 | mul(mul(sub(add(add(WindSpeed3pm, add(Sunshine... | 0.549294 |
12 | x35 | mul(mul(sub(add(Sunshine, add(WindSpeed3pm, ad... | 0.549294 |
13 | x36 | mul(mul(sub(add(add(Sunshine, Sunshine), add(W... | 0.549294 |
14 | x37 | mul(mul(add(WindGustSpeed, Humidity3pm), add(H... | 0.549294 |
15 | x38 | mul(mul(sub(add(WindSpeed3pm, add(Sunshine, ad... | 0.549294 |
16 | x39 | mul(mul(sub(add(WindSpeed3pm, add(Sunshine, Su... | 0.549124 |
17 | x40 | mul(mul(sub(add(Sunshine, add(WindSpeed3pm, Su... | 0.549124 |
18 | x41 | mul(mul(sub(add(add(WindSpeed3pm, Sunshine), S... | 0.548367 |
19 | x42 | mul(mul(sub(add(Sunshine, add(WindSpeed3pm, Su... | 0.548302 |
In [36]:
Copied!
# Fit the model again
atom.run("LGB_gfg", metric="auc")
# Fit the model again
atom.run("LGB_gfg", metric="auc")
Training ========================= >> Models: LGB_gfg Metric: roc_auc Results for LightGBM: Fit --------------------------------------------- Train evaluation --> roc_auc: 0.9818 Test evaluation --> roc_auc: 0.8856 Time elapsed: 0.422s ------------------------------------------------- Total time: 0.422s Final results ==================== >> Duration: 0.422s ------------------------------------- LightGBM --> roc_auc: 0.8856
In [37]:
Copied!
# Visualize the whole pipeline
atom.plot_pipeline()
# Visualize the whole pipeline
atom.plot_pipeline()
Analyze results¶
In [38]:
Copied!
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
# Use atom's plots to compare the three models
atom.palette = "Paired"
atom.plot_roc(dataset="both")
atom.reset_aesthetics()
In [39]:
Copied!
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
atom.lgb.plot_feature_importance(show=10, title="LGB")
atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + dfs")
atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + gfg")
# For busy plots it might be useful to use a canvas
with atom.canvas(1, 3, figsize=(20, 8)):
atom.lgb.plot_feature_importance(show=10, title="LGB")
atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + dfs")
atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + gfg")
In [40]:
Copied!
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_dfs", "LGB_gfg"], show=10)
# We can check the feature importance with other plots as well
atom.plot_permutation_importance(models=["LGB_dfs", "LGB_gfg"], show=10)
In [41]:
Copied!
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)
atom.LGB_gfg.decision_plot(index=(0, 10), show=15)