Example: Train sizing¶
This example shows how to asses a model's performance based on the size of the training set.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [1]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [2]:
Copied!
# Load the data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load the data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[2]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Run the pipeline¶
In [3]:
Copied!
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 25.03 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%) Fitting Cleaner... Cleaning the data... Fitting Imputer... Imputing missing values... --> Dropping 161 samples for containing more than 16 missing values. --> Imputing 481 missing values with median (12.0) in column MinTemp. --> Imputing 265 missing values with median (22.6) in column MaxTemp. --> Imputing 1354 missing values with median (0.0) in column Rainfall. --> Imputing 60682 missing values with median (4.8) in column Evaporation. --> Imputing 67659 missing values with median (8.4) in column Sunshine. --> Imputing 9187 missing values with most_frequent (W) in column WindGustDir. --> Imputing 9127 missing values with median (39.0) in column WindGustSpeed. --> Imputing 9852 missing values with most_frequent (N) in column WindDir9am. --> Imputing 3617 missing values with most_frequent (SE) in column WindDir3pm. --> Imputing 1187 missing values with median (13.0) in column WindSpeed9am. --> Imputing 2469 missing values with median (19.0) in column WindSpeed3pm. --> Imputing 1613 missing values with median (70.0) in column Humidity9am. --> Imputing 3449 missing values with median (52.0) in column Humidity3pm. --> Imputing 13863 missing values with median (1017.6) in column Pressure9am. --> Imputing 13830 missing values with median (1015.2) in column Pressure3pm. --> Imputing 53496 missing values with median (5.0) in column Cloud9am. --> Imputing 56933 missing values with median (5.0) in column Cloud3pm. --> Imputing 743 missing values with median (16.7) in column Temp9am. --> Imputing 2565 missing values with median (21.1) in column Temp3pm. --> Imputing 1354 missing values with most_frequent (No) in column RainToday. Fitting Encoder... Encoding categorical columns... --> Target-encoding feature Location. Contains 49 classes. --> Target-encoding feature WindGustDir. Contains 16 classes. --> Target-encoding feature WindDir9am. Contains 16 classes. --> Target-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [4]:
Copied!
# Analyze the impact of the training set's size on a LR model
atom.train_sizing("LR", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LR model
atom.train_sizing("LR", train_sizes=10, n_bootstrap=5)
Training ========================= >> Metric: f1 Run: 0 =========================== >> Models: LR01 Size of training set: 11362 (10%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.563 Test evaluation --> f1: 0.5854 Time elapsed: 1.181s Bootstrap --------------------------------------- Evaluation --> f1: 0.5849 ± 0.002 Time elapsed: 0.910s ------------------------------------------------- Time: 2.091s Final results ==================== >> Total time: 2.109s ------------------------------------- LogisticRegression --> f1: 0.5849 ± 0.002 Run: 1 =========================== >> Models: LR02 Size of training set: 22724 (20%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.582 Test evaluation --> f1: 0.5873 Time elapsed: 1.455s Bootstrap --------------------------------------- Evaluation --> f1: 0.5852 ± 0.0021 Time elapsed: 1.120s ------------------------------------------------- Time: 2.575s Final results ==================== >> Total time: 2.598s ------------------------------------- LogisticRegression --> f1: 0.5852 ± 0.0021 Run: 2 =========================== >> Models: LR03 Size of training set: 34087 (30%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.581 Test evaluation --> f1: 0.5851 Time elapsed: 1.702s Bootstrap --------------------------------------- Evaluation --> f1: 0.5861 ± 0.0009 Time elapsed: 1.355s ------------------------------------------------- Time: 3.057s Final results ==================== >> Total time: 3.082s ------------------------------------- LogisticRegression --> f1: 0.5861 ± 0.0009 Run: 3 =========================== >> Models: LR04 Size of training set: 45449 (40%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5827 Test evaluation --> f1: 0.5869 Time elapsed: 2.250s Bootstrap --------------------------------------- Evaluation --> f1: 0.5863 ± 0.0017 Time elapsed: 1.599s ------------------------------------------------- Time: 3.850s Final results ==================== >> Total time: 3.881s ------------------------------------- LogisticRegression --> f1: 0.5863 ± 0.0017 Run: 4 =========================== >> Models: LR05 Size of training set: 56812 (50%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5819 Test evaluation --> f1: 0.585 Time elapsed: 2.163s Bootstrap --------------------------------------- Evaluation --> f1: 0.5854 ± 0.0017 Time elapsed: 1.878s ------------------------------------------------- Time: 4.041s Final results ==================== >> Total time: 4.077s ------------------------------------- LogisticRegression --> f1: 0.5854 ± 0.0017 Run: 5 =========================== >> Models: LR06 Size of training set: 68174 (60%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5832 Test evaluation --> f1: 0.5848 Time elapsed: 2.338s Bootstrap --------------------------------------- Evaluation --> f1: 0.5849 ± 0.0018 Time elapsed: 1.899s ------------------------------------------------- Time: 4.237s Final results ==================== >> Total time: 4.279s ------------------------------------- LogisticRegression --> f1: 0.5849 ± 0.0018 Run: 6 =========================== >> Models: LR07 Size of training set: 79536 (70%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5873 Test evaluation --> f1: 0.5849 Time elapsed: 2.427s Bootstrap --------------------------------------- Evaluation --> f1: 0.5852 ± 0.0012 Time elapsed: 2.060s ------------------------------------------------- Time: 4.486s Final results ==================== >> Total time: 4.531s ------------------------------------- LogisticRegression --> f1: 0.5852 ± 0.0012 Run: 7 =========================== >> Models: LR08 Size of training set: 90899 (80%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.589 Test evaluation --> f1: 0.5837 Time elapsed: 2.631s Bootstrap --------------------------------------- Evaluation --> f1: 0.5853 ± 0.0026 Time elapsed: 2.173s ------------------------------------------------- Time: 4.804s Final results ==================== >> Total time: 4.853s ------------------------------------- LogisticRegression --> f1: 0.5853 ± 0.0026 Run: 8 =========================== >> Models: LR09 Size of training set: 102261 (90%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5871 Test evaluation --> f1: 0.5845 Time elapsed: 2.837s Bootstrap --------------------------------------- Evaluation --> f1: 0.5846 ± 0.002 Time elapsed: 2.550s ------------------------------------------------- Time: 5.387s Final results ==================== >> Total time: 5.443s ------------------------------------- LogisticRegression --> f1: 0.5846 ± 0.002 Run: 9 =========================== >> Models: LR10 Size of training set: 113624 (100%) Size of test set: 28408 Results for LogisticRegression: Fit --------------------------------------------- Train evaluation --> f1: 0.5858 Test evaluation --> f1: 0.5848 Time elapsed: 4.211s Bootstrap --------------------------------------- Evaluation --> f1: 0.5848 ± 0.0007 Time elapsed: 2.967s ------------------------------------------------- Time: 7.178s Final results ==================== >> Total time: 7.243s ------------------------------------- LogisticRegression --> f1: 0.5848 ± 0.0007
Analyze the results¶
In [5]:
Copied!
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
Out[5]:
f1_train | f1_test | time_fit | f1_bootstrap | time_bootstrap | time | ||
---|---|---|---|---|---|---|---|
frac | model | ||||||
0.100000 | LR01 | 0.562100 | 0.584800 | 1.181076 | 0.584922 | 0.909830 | 2.090906 |
0.200000 | LR02 | 0.583200 | 0.584600 | 1.455324 | 0.585234 | 1.120021 | 2.575345 |
0.300000 | LR03 | 0.580000 | 0.585200 | 1.702020 | 0.586118 | 1.354517 | 3.056537 |
0.400000 | LR04 | 0.584500 | 0.585700 | 2.250048 | 0.586348 | 1.599457 | 3.849505 |
0.500000 | LR05 | 0.583300 | 0.586500 | 2.163214 | 0.585384 | 1.877947 | 4.041161 |
0.600000 | LR06 | 0.583100 | 0.583200 | 2.338079 | 0.584891 | 1.898731 | 4.236810 |
0.700000 | LR07 | 0.587800 | 0.585800 | 2.426779 | 0.585235 | 2.059590 | 4.486369 |
0.800000 | LR08 | 0.591600 | 0.588600 | 2.630608 | 0.585269 | 2.172981 | 4.803589 |
0.900000 | LR09 | 0.585600 | 0.583300 | 2.836993 | 0.584633 | 2.550147 | 5.387140 |
1.000000 | LR10 | 0.585800 | 0.584800 | 4.211031 | 0.584836 | 2.966612 | 7.177643 |
In [6]:
Copied!
# Every model can be accessed through its name
atom.lr05.plot_shap_waterfall(show=6)
# Every model can be accessed through its name
atom.lr05.plot_shap_waterfall(show=6)
In [7]:
Copied!
# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()