Example: Train sizing¶

This example shows how to asses a model's performance based on the size of the training set.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

                
                    Copied!
                    
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

UserWarning: The pandas version installed (1.5.3) does not match the supported pandas version in Modin (1.5.2). This may cause undesired side effects!

In [2]:

                
                    Copied!
                    
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

                
                    Copied!
                    
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 61.69 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicate samples: 45 (0.0%)

Fitting Cleaner...
Cleaning the data...
Fitting Imputer...
Imputing missing values...
 --> Dropping 161 samples for containing more than 16 missing values.
 --> Imputing 481 missing values with median (12.0) in feature MinTemp.
 --> Imputing 265 missing values with median (22.6) in feature MaxTemp.
 --> Imputing 1354 missing values with median (0.0) in feature Rainfall.
 --> Imputing 60682 missing values with median (4.8) in feature Evaporation.
 --> Imputing 67659 missing values with median (8.4) in feature Sunshine.
 --> Imputing 9187 missing values with most_frequent (W) in feature WindGustDir.
 --> Imputing 9127 missing values with median (39.0) in feature WindGustSpeed.
 --> Imputing 9852 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 3617 missing values with most_frequent (SE) in feature WindDir3pm.
 --> Imputing 1187 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 2469 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 1613 missing values with median (70.0) in feature Humidity9am.
 --> Imputing 3449 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 13863 missing values with median (1017.6) in feature Pressure9am.
 --> Imputing 13830 missing values with median (1015.2) in feature Pressure3pm.
 --> Imputing 53496 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 56933 missing values with median (5.0) in feature Cloud3pm.
 --> Imputing 743 missing values with median (16.7) in feature Temp9am.
 --> Imputing 2565 missing values with median (21.1) in feature Temp3pm.
 --> Imputing 1354 missing values with most_frequent (No) in feature RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> LeaveOneOut-encoding feature Location. Contains 49 classes.
 --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

                
                    Copied!
                    
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)

Training ========================= >>
Metric: f1


Run: 0 =========================== >>
Models: LGB01
Size of training set: 11362 (10%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.795
Test evaluation --> f1: 0.6169
Time elapsed: 2.702s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6025 ± 0.0021
Time elapsed: 2.367s
-------------------------------------------------
Total time: 5.069s


Final results ==================== >>
Total time: 5.072s
-------------------------------------
LightGBM --> f1: 0.6025 ± 0.0021 ~


Run: 1 =========================== >>
Models: LGB02
Size of training set: 22724 (20%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.711
Test evaluation --> f1: 0.6172
Time elapsed: 3.361s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.606 ± 0.0021
Time elapsed: 2.924s
-------------------------------------------------
Total time: 6.285s


Final results ==================== >>
Total time: 6.288s
-------------------------------------
LightGBM --> f1: 0.606 ± 0.0021


Run: 2 =========================== >>
Models: LGB03
Size of training set: 34087 (30%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6844
Test evaluation --> f1: 0.6205
Time elapsed: 4.115s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6136 ± 0.0021
Time elapsed: 3.574s
-------------------------------------------------
Total time: 7.689s


Final results ==================== >>
Total time: 7.692s
-------------------------------------
LightGBM --> f1: 0.6136 ± 0.0021


Run: 3 =========================== >>
Models: LGB04
Size of training set: 45449 (40%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6788
Test evaluation --> f1: 0.6246
Time elapsed: 4.704s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6209 ± 0.0012
Time elapsed: 4.312s
-------------------------------------------------
Total time: 9.017s


Final results ==================== >>
Total time: 9.019s
-------------------------------------
LightGBM --> f1: 0.6209 ± 0.0012


Run: 4 =========================== >>
Models: LGB05
Size of training set: 56812 (50%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6694
Test evaluation --> f1: 0.6256
Time elapsed: 5.333s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6231 ± 0.0025
Time elapsed: 4.956s
-------------------------------------------------
Total time: 10.289s


Final results ==================== >>
Total time: 10.295s
-------------------------------------
LightGBM --> f1: 0.6231 ± 0.0025


Run: 5 =========================== >>
Models: LGB06
Size of training set: 68174 (60%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6623
Test evaluation --> f1: 0.627
Time elapsed: 6.177s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6223 ± 0.0043
Time elapsed: 5.432s
-------------------------------------------------
Total time: 11.609s


Final results ==================== >>
Total time: 11.615s
-------------------------------------
LightGBM --> f1: 0.6223 ± 0.0043


Run: 6 =========================== >>
Models: LGB07
Size of training set: 79536 (70%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6609
Test evaluation --> f1: 0.6307
Time elapsed: 6.787s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6254 ± 0.0029
Time elapsed: 6.138s
-------------------------------------------------
Total time: 12.925s


Final results ==================== >>
Total time: 12.930s
-------------------------------------
LightGBM --> f1: 0.6254 ± 0.0029


Run: 7 =========================== >>
Models: LGB08
Size of training set: 90899 (80%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6588
Test evaluation --> f1: 0.6316
Time elapsed: 7.660s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6255 ± 0.002
Time elapsed: 7.141s
-------------------------------------------------
Total time: 14.802s


Final results ==================== >>
Total time: 14.808s
-------------------------------------
LightGBM --> f1: 0.6255 ± 0.002


Run: 8 =========================== >>
Models: LGB09
Size of training set: 102261 (90%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6601
Test evaluation --> f1: 0.6318
Time elapsed: 8.433s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6253 ± 0.0022
Time elapsed: 7.353s
-------------------------------------------------
Total time: 15.786s


Final results ==================== >>
Total time: 15.792s
-------------------------------------
LightGBM --> f1: 0.6253 ± 0.0022


Run: 9 =========================== >>
Models: LGB10
Size of training set: 113624 (100%)
Size of test set: 28408


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6558
Test evaluation --> f1: 0.631
Time elapsed: 8.937s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6258 ± 0.0034
Time elapsed: 8.158s
-------------------------------------------------
Total time: 17.095s


Final results ==================== >>
Total time: 17.100s
-------------------------------------
LightGBM --> f1: 0.6258 ± 0.0034

Analyze the results¶

In [5]:

                
                    Copied!
                    
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results

Out[5]:

		score_train	score_test	time_fit	score_bootstrap	time_bootstrap	time
frac	model
0.1	LGB01	0.7950	0.6169	2.701927	0.602473	2.366629	5.068556
0.2	LGB02	0.7110	0.6172	3.361056	0.605984	2.923961	6.285017
0.3	LGB03	0.6844	0.6205	4.114851	0.613633	3.573816	7.688667
0.4	LGB04	0.6788	0.6246	4.704423	0.620894	4.312111	9.016534
0.5	LGB05	0.6694	0.6256	5.332624	0.623075	4.956064	10.288688
0.6	LGB06	0.6623	0.6270	6.176526	0.622287	5.432179	11.608705
0.7	LGB07	0.6609	0.6307	6.786634	0.625412	6.138183	12.924817
0.8	LGB08	0.6588	0.6316	7.660243	0.625519	7.141488	14.801731
0.9	LGB09	0.6601	0.6318	8.433411	0.625334	7.352633	15.786044
1.0	LGB10	0.6558	0.6310	8.937261	0.625840	8.158222	17.095483

In [6]:

                
                    Copied!
                    
# Every model can be accessed through its name
atom.lgb05.plot_shap_waterfall(show=6)
# Every model can be accessed through its name
atom.lgb05.plot_shap_waterfall(show=6)

In [7]:

                
                    Copied!
                    
# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()