Example: Train sizing¶
This example shows how to asses a model's performance based on the size of the training set.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [1]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [2]:
Copied!
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[2]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Run the pipeline¶
In [3]:
Copied!
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
<< ================== ATOM ================== >> Algorithm task: binary classification. Dataset stats ==================== >> Shape: (142193, 22) Memory: 61.69 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicate samples: 45 (0.0%) ------------------------------------- Train set size: 113755 Test set size: 28438 ------------------------------------- | | dataset | train | test | | - | -------------- | -------------- | -------------- | | 0 | 110316 (3.5) | 88253 (3.5) | 22063 (3.5) | | 1 | 31877 (1.0) | 25502 (1.0) | 6375 (1.0) | Fitting Cleaner... Cleaning the data... --> Label-encoding the target column. Fitting Imputer... Imputing missing values... --> Dropping 161 samples for containing more than 16 missing values. --> Imputing 481 missing values with median (12.0) in feature MinTemp. --> Imputing 265 missing values with median (22.6) in feature MaxTemp. --> Imputing 1354 missing values with median (0.0) in feature Rainfall. --> Imputing 60682 missing values with median (4.8) in feature Evaporation. --> Imputing 67659 missing values with median (8.4) in feature Sunshine. --> Imputing 9187 missing values with most_frequent (W) in feature WindGustDir. --> Imputing 9127 missing values with median (39.0) in feature WindGustSpeed. --> Imputing 9852 missing values with most_frequent (N) in feature WindDir9am. --> Imputing 3617 missing values with most_frequent (SE) in feature WindDir3pm. --> Imputing 1187 missing values with median (13.0) in feature WindSpeed9am. --> Imputing 2469 missing values with median (19.0) in feature WindSpeed3pm. --> Imputing 1613 missing values with median (70.0) in feature Humidity9am. --> Imputing 3449 missing values with median (52.0) in feature Humidity3pm. --> Imputing 13863 missing values with median (1017.6) in feature Pressure9am. --> Imputing 13830 missing values with median (1015.2) in feature Pressure3pm. --> Imputing 53496 missing values with median (5.0) in feature Cloud9am. --> Imputing 56933 missing values with median (5.0) in feature Cloud3pm. --> Imputing 743 missing values with median (16.7) in feature Temp9am. --> Imputing 2565 missing values with median (21.1) in feature Temp3pm. --> Imputing 1354 missing values with most_frequent (No) in feature RainToday. Fitting Encoder... Encoding categorical columns... --> LeaveOneOut-encoding feature Location. Contains 49 classes. --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [4]:
Copied!
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)
Training ========================= >> Metric: f1 Run: 0 =========================== >> Models: LGB01 Size of training set: 11362 (10%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.795 Test evaluation --> f1: 0.6169 Time elapsed: 2.726s Bootstrap --------------------------------------- Evaluation --> f1: 0.6025 ± 0.0021 Time elapsed: 2.361s ------------------------------------------------- Total time: 5.088s Final results ==================== >> Total time: 5.089s ------------------------------------- LightGBM --> f1: 0.6025 ± 0.0021 ~ Run: 1 =========================== >> Models: LGB02 Size of training set: 22724 (20%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.711 Test evaluation --> f1: 0.6172 Time elapsed: 3.588s Bootstrap --------------------------------------- Evaluation --> f1: 0.606 ± 0.0021 Time elapsed: 3.214s ------------------------------------------------- Total time: 6.802s Final results ==================== >> Total time: 6.803s ------------------------------------- LightGBM --> f1: 0.606 ± 0.0021 Run: 2 =========================== >> Models: LGB03 Size of training set: 34087 (30%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6844 Test evaluation --> f1: 0.6205 Time elapsed: 4.145s Bootstrap --------------------------------------- Evaluation --> f1: 0.6136 ± 0.0021 Time elapsed: 3.725s ------------------------------------------------- Total time: 7.870s Final results ==================== >> Total time: 7.872s ------------------------------------- LightGBM --> f1: 0.6136 ± 0.0021 Run: 3 =========================== >> Models: LGB04 Size of training set: 45449 (40%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6788 Test evaluation --> f1: 0.6246 Time elapsed: 4.740s Bootstrap --------------------------------------- Evaluation --> f1: 0.6209 ± 0.0012 Time elapsed: 4.361s ------------------------------------------------- Total time: 9.101s Final results ==================== >> Total time: 9.105s ------------------------------------- LightGBM --> f1: 0.6209 ± 0.0012 Run: 4 =========================== >> Models: LGB05 Size of training set: 56812 (50%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6694 Test evaluation --> f1: 0.6256 Time elapsed: 5.560s Bootstrap --------------------------------------- Evaluation --> f1: 0.6231 ± 0.0025 Time elapsed: 5.129s ------------------------------------------------- Total time: 10.689s Final results ==================== >> Total time: 10.693s ------------------------------------- LightGBM --> f1: 0.6231 ± 0.0025 Run: 5 =========================== >> Models: LGB06 Size of training set: 68174 (60%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6623 Test evaluation --> f1: 0.627 Time elapsed: 6.235s Bootstrap --------------------------------------- Evaluation --> f1: 0.6223 ± 0.0043 Time elapsed: 5.758s ------------------------------------------------- Total time: 11.993s Final results ==================== >> Total time: 11.998s ------------------------------------- LightGBM --> f1: 0.6223 ± 0.0043 Run: 6 =========================== >> Models: LGB07 Size of training set: 79536 (70%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6609 Test evaluation --> f1: 0.6307 Time elapsed: 6.979s Bootstrap --------------------------------------- Evaluation --> f1: 0.6254 ± 0.0029 Time elapsed: 6.485s ------------------------------------------------- Total time: 13.465s Final results ==================== >> Total time: 13.469s ------------------------------------- LightGBM --> f1: 0.6254 ± 0.0029 Run: 7 =========================== >> Models: LGB08 Size of training set: 90899 (80%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6588 Test evaluation --> f1: 0.6316 Time elapsed: 7.869s Bootstrap --------------------------------------- Evaluation --> f1: 0.6255 ± 0.002 Time elapsed: 7.227s ------------------------------------------------- Total time: 15.095s Final results ==================== >> Total time: 15.101s ------------------------------------- LightGBM --> f1: 0.6255 ± 0.002 Run: 8 =========================== >> Models: LGB09 Size of training set: 102261 (90%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6601 Test evaluation --> f1: 0.6318 Time elapsed: 8.578s Bootstrap --------------------------------------- Evaluation --> f1: 0.6253 ± 0.0022 Time elapsed: 8.169s ------------------------------------------------- Total time: 16.747s Final results ==================== >> Total time: 16.752s ------------------------------------- LightGBM --> f1: 0.6253 ± 0.0022 Run: 9 =========================== >> Models: LGB10 Size of training set: 113624 (100%) Size of test set: 28408 Results for LightGBM: Fit --------------------------------------------- Train evaluation --> f1: 0.6558 Test evaluation --> f1: 0.631 Time elapsed: 9.401s Bootstrap --------------------------------------- Evaluation --> f1: 0.6258 ± 0.0034 Time elapsed: 8.782s ------------------------------------------------- Total time: 18.183s Final results ==================== >> Total time: 18.190s ------------------------------------- LightGBM --> f1: 0.6258 ± 0.0034
Analyze the results¶
In [5]:
Copied!
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
Out[5]:
score_train | score_test | time_fit | score_bootstrap | time_bootstrap | time | ||
---|---|---|---|---|---|---|---|
frac | model | ||||||
0.1 | LGB01 | 0.7950 | 0.6169 | 2.726477 | 0.602473 | 2.361145 | 5.087622 |
0.2 | LGB02 | 0.7110 | 0.6172 | 3.587786 | 0.605984 | 3.214102 | 6.801888 |
0.3 | LGB03 | 0.6844 | 0.6205 | 4.144765 | 0.613633 | 3.724748 | 7.869513 |
0.4 | LGB04 | 0.6788 | 0.6246 | 4.740403 | 0.620894 | 4.360960 | 9.101363 |
0.5 | LGB05 | 0.6694 | 0.6256 | 5.559976 | 0.623075 | 5.128658 | 10.688634 |
0.6 | LGB06 | 0.6623 | 0.6270 | 6.234684 | 0.622287 | 5.758230 | 11.992914 |
0.7 | LGB07 | 0.6609 | 0.6307 | 6.979477 | 0.625412 | 6.485406 | 13.464883 |
0.8 | LGB08 | 0.6588 | 0.6316 | 7.868586 | 0.625519 | 7.226822 | 15.095408 |
0.9 | LGB09 | 0.6601 | 0.6318 | 8.578300 | 0.625334 | 8.168814 | 16.747114 |
1.0 | LGB10 | 0.6558 | 0.6310 | 9.401000 | 0.625840 | 8.782370 | 18.183370 |
In [7]:
Copied!
# Every model can be accessed through its name
atom.lgb05.plot_shap_waterfall(show=6)
# Every model can be accessed through its name
atom.lgb05.plot_shap_waterfall(show=6)
In [8]:
Copied!
# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()