Train sizing¶

This example shows how to asses a model's performance based on the size of the training set.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

            
                Copied!
                
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (142193, 22)
Memory: 61.69 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicate samples: 45 (0.0%)
-------------------------------------
Train set size: 113755
Test set size: 28438
-------------------------------------
|   |        dataset |          train |           test |
| - | -------------- | -------------- | -------------- |
| 0 |   110316 (3.5) |    88253 (3.5) |    22063 (3.5) |
| 1 |    31877 (1.0) |    25502 (1.0) |     6375 (1.0) |

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding the target column.
Fitting Imputer...
Imputing missing values...
 --> Dropping 15182 samples for containing more than 16 missing values.
 --> Imputing 100 missing values with median (12.2) in feature MinTemp.
 --> Imputing 57 missing values with median (22.8) in feature MaxTemp.
 --> Imputing 640 missing values with median (0.0) in feature Rainfall.
 --> Imputing 46535 missing values with median (4.8) in feature Evaporation.
 --> Imputing 53034 missing values with median (8.5) in feature Sunshine.
 --> Imputing 4381 missing values with most_frequent (W) in feature WindGustDir.
 --> Imputing 4359 missing values with median (39.0) in feature WindGustSpeed.
 --> Imputing 6624 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 612 missing values with most_frequent (SE) in feature WindDir3pm.
 --> Imputing 80 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 49 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 532 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 1168 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 1028 missing values with median (1017.6) in feature Pressure9am.
 --> Imputing 972 missing values with median (1015.2) in feature Pressure3pm.
 --> Imputing 42172 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 44251 missing values with median (5.0) in feature Cloud3pm.
 --> Imputing 98 missing values with median (16.8) in feature Temp9am.
 --> Imputing 702 missing values with median (21.3) in feature Temp3pm.
 --> Imputing 640 missing values with most_frequent (No) in feature RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> LeaveOneOut-encoding feature Location. Contains 45 classes.
 --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

            
                Copied!
                
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)


Run: 0 ================================ >>
Size of training set: 10146 (10%)
Size of test set: 25548

Training ========================= >>
Models: LGB01
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.8195
Test evaluation --> f1: 0.6268
Time elapsed: 0.828s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6186 ± 0.0048
Time elapsed: 1.844s
-------------------------------------------------
Total time: 2.672s


Final results ==================== >>
Duration: 2.672s
-------------------------------------
LightGBM --> f1: 0.6186 ± 0.0048 ~


Run: 1 ================================ >>
Size of training set: 20292 (20%)
Size of test set: 25548

Training ========================= >>
Models: LGB02
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.7403
Test evaluation --> f1: 0.6378
Time elapsed: 0.859s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6307 ± 0.002
Time elapsed: 2.360s
-------------------------------------------------
Total time: 3.219s


Final results ==================== >>
Duration: 3.219s
-------------------------------------
LightGBM --> f1: 0.6307 ± 0.002


Run: 2 ================================ >>
Size of training set: 30438 (30%)
Size of test set: 25548

Training ========================= >>
Models: LGB03
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.7119
Test evaluation --> f1: 0.6452
Time elapsed: 1.031s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6342 ± 0.0017
Time elapsed: 2.853s
-------------------------------------------------
Total time: 3.884s


Final results ==================== >>
Duration: 3.884s
-------------------------------------
LightGBM --> f1: 0.6342 ± 0.0017


Run: 3 ================================ >>
Size of training set: 40585 (40%)
Size of test set: 25548

Training ========================= >>
Models: LGB04
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6949
Test evaluation --> f1: 0.6451
Time elapsed: 1.313s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6372 ± 0.005
Time elapsed: 3.438s
-------------------------------------------------
Total time: 4.766s


Final results ==================== >>
Duration: 4.766s
-------------------------------------
LightGBM --> f1: 0.6372 ± 0.005


Run: 4 ================================ >>
Size of training set: 50731 (50%)
Size of test set: 25548

Training ========================= >>
Models: LGB05
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6844
Test evaluation --> f1: 0.6459
Time elapsed: 1.453s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.64 ± 0.0009
Time elapsed: 3.954s
-------------------------------------------------
Total time: 5.407s


Final results ==================== >>
Duration: 5.407s
-------------------------------------
LightGBM --> f1: 0.64 ± 0.0009


Run: 5 ================================ >>
Size of training set: 60877 (60%)
Size of test set: 25548

Training ========================= >>
Models: LGB06
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6743
Test evaluation --> f1: 0.648
Time elapsed: 1.688s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6394 ± 0.0017
Time elapsed: 4.498s
-------------------------------------------------
Total time: 6.186s


Final results ==================== >>
Duration: 6.186s
-------------------------------------
LightGBM --> f1: 0.6394 ± 0.0017


Run: 6 ================================ >>
Size of training set: 71024 (70%)
Size of test set: 25548

Training ========================= >>
Models: LGB07
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6723
Test evaluation --> f1: 0.6456
Time elapsed: 1.844s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6413 ± 0.0031
Time elapsed: 5.183s
-------------------------------------------------
Total time: 7.027s


Final results ==================== >>
Duration: 7.027s
-------------------------------------
LightGBM --> f1: 0.6413 ± 0.0031


Run: 7 ================================ >>
Size of training set: 81170 (80%)
Size of test set: 25548

Training ========================= >>
Models: LGB08
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6677
Test evaluation --> f1: 0.6466
Time elapsed: 2.125s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6435 ± 0.0005
Time elapsed: 5.872s
-------------------------------------------------
Total time: 7.997s


Final results ==================== >>
Duration: 7.997s
-------------------------------------
LightGBM --> f1: 0.6435 ± 0.0005


Run: 8 ================================ >>
Size of training set: 91316 (90%)
Size of test set: 25548

Training ========================= >>
Models: LGB09
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6669
Test evaluation --> f1: 0.6462
Time elapsed: 2.348s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6427 ± 0.0012
Time elapsed: 6.260s
-------------------------------------------------
Total time: 8.608s


Final results ==================== >>
Duration: 8.608s
-------------------------------------
LightGBM --> f1: 0.6427 ± 0.0012


Run: 9 ================================ >>
Size of training set: 101463 (100%)
Size of test set: 25548

Training ========================= >>
Models: LGB10
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6658
Test evaluation --> f1: 0.6466
Time elapsed: 2.547s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6444 ± 0.0042
Time elapsed: 6.848s
-------------------------------------------------
Total time: 9.395s


Final results ==================== >>
Duration: 9.395s
-------------------------------------
LightGBM --> f1: 0.6444 ± 0.0042

Analyze the results¶

In [5]:

            
                Copied!
                
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results

Out[5]:

		metric_train	metric_test	time_fit	mean_bootstrap	std_bootstrap	time_bootstrap	time
frac	model
0.1	LGB01	0.665781	0.646603	0.828s	0.618559	0.004824	1.844s	2.672s
0.2	LGB02	0.665781	0.646603	0.859s	0.630685	0.002021	2.360s	3.219s
0.3	LGB03	0.665781	0.646603	1.031s	0.634181	0.001720	2.853s	3.884s
0.4	LGB04	0.665781	0.646603	1.313s	0.637212	0.004984	3.438s	4.766s
0.5	LGB05	0.665781	0.646603	1.453s	0.639984	0.000927	3.954s	5.407s
0.6	LGB06	0.665781	0.646603	1.688s	0.639425	0.001749	4.498s	6.186s
0.7	LGB07	0.665781	0.646603	1.844s	0.641320	0.003079	5.183s	7.027s
0.8	LGB08	0.665781	0.646603	2.125s	0.643463	0.000544	5.872s	7.997s
0.9	LGB09	0.665781	0.646603	2.348s	0.642688	0.001249	6.260s	8.608s
1.0	LGB10	0.665781	0.646603	2.547s	0.644359	0.004202	6.848s	9.395s

In [6]:

            
                Copied!
                
# Every model can be accessed through its name
atom.lgb05.waterfall_plot(show=6)
# Every model can be accessed through its name
atom.lgb05.waterfall_plot(show=6)

In [7]:

            
                Copied!
                
# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()