Train sizing¶

This example shows how to asses a model's performance based on the size of the training set.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

            
                Copied!
                
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

            
                Copied!
                
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, warnings=False, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, warnings=False, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (142193, 22)
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicate samples: 45 (0.0%)
-------------------------------------
Train set size: 113755
Test set size: 28438
-------------------------------------
|    |        dataset |          train |           test |
| -- | -------------- | -------------- | -------------- |
| 0  |   110316 (3.5) |    88412 (3.5) |    21904 (3.4) |
| 1  |    31877 (1.0) |    25343 (1.0) |     6534 (1.0) |

Applying data cleaning...
Fitting Imputer...
Imputing missing values...
 --> Dropping 15182 samples for containing more than 16 missing values.
 --> Imputing 100 missing values with median (12.2) in feature MinTemp.
 --> Imputing 57 missing values with median (22.8) in feature MaxTemp.
 --> Imputing 640 missing values with median (0.0) in feature Rainfall.
 --> Imputing 46535 missing values with median (4.8) in feature Evaporation.
 --> Imputing 53034 missing values with median (8.5) in feature Sunshine.
 --> Imputing 4381 missing values with most_frequent (W) in feature WindGustDir.
 --> Imputing 4359 missing values with median (39.0) in feature WindGustSpeed.
 --> Imputing 6624 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 612 missing values with most_frequent (SE) in feature WindDir3pm.
 --> Imputing 80 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 49 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 532 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 1168 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 1028 missing values with median (1017.6) in feature Pressure9am.
 --> Imputing 972 missing values with median (1015.2) in feature Pressure3pm.
 --> Imputing 42172 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 44251 missing values with median (5.0) in feature Cloud3pm.
 --> Imputing 98 missing values with median (16.8) in feature Temp9am.
 --> Imputing 702 missing values with median (21.3) in feature Temp3pm.
 --> Imputing 640 missing values with most_frequent (No) in feature RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> LeaveOneOut-encoding feature Location. Contains 45 classes.
 --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

            
                Copied!
                
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LightGBM model
atom.train_sizing("LGB", train_sizes=10, n_bootstrap=5)


Run: 0 ================================ >>
Size of training set: 10165 (10%)
Size of test set: 25359

Training ========================= >>
Models: LGB01
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.8093
Test evaluation --> f1: 0.61
Time elapsed: 0.677s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6054 ± 0.0034
Time elapsed: 1.862s
-------------------------------------------------
Total time: 2.539s


Final results ==================== >>
Duration: 2.539s
-------------------------------------
LightGBM --> f1: 0.6054 ± 0.0034 ~


Run: 1 ================================ >>
Size of training set: 20330 (20%)
Size of test set: 25359

Training ========================= >>
Models: LGB02
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.7328
Test evaluation --> f1: 0.6218
Time elapsed: 0.900s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6169 ± 0.0039
Time elapsed: 2.390s
-------------------------------------------------
Total time: 3.291s


Final results ==================== >>
Duration: 3.292s
-------------------------------------
LightGBM --> f1: 0.6169 ± 0.0039


Run: 2 ================================ >>
Size of training set: 30495 (30%)
Size of test set: 25359

Training ========================= >>
Models: LGB03
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.7075
Test evaluation --> f1: 0.6252
Time elapsed: 1.051s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6189 ± 0.0044
Time elapsed: 2.922s
-------------------------------------------------
Total time: 3.974s


Final results ==================== >>
Duration: 3.974s
-------------------------------------
LightGBM --> f1: 0.6189 ± 0.0044


Run: 3 ================================ >>
Size of training set: 40660 (40%)
Size of test set: 25359

Training ========================= >>
Models: LGB04
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6939
Test evaluation --> f1: 0.6275
Time elapsed: 1.556s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6222 ± 0.0036
Time elapsed: 3.832s
-------------------------------------------------
Total time: 5.389s


Final results ==================== >>
Duration: 5.390s
-------------------------------------
LightGBM --> f1: 0.6222 ± 0.0036


Run: 4 ================================ >>
Size of training set: 50826 (50%)
Size of test set: 25359

Training ========================= >>
Models: LGB05
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6814
Test evaluation --> f1: 0.6291
Time elapsed: 1.467s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6249 ± 0.0018
Time elapsed: 4.023s
-------------------------------------------------
Total time: 5.492s


Final results ==================== >>
Duration: 5.492s
-------------------------------------
LightGBM --> f1: 0.6249 ± 0.0018


Run: 5 ================================ >>
Size of training set: 60991 (60%)
Size of test set: 25359

Training ========================= >>
Models: LGB06
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6766
Test evaluation --> f1: 0.6356
Time elapsed: 1.665s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6285 ± 0.0036
Time elapsed: 4.606s
-------------------------------------------------
Total time: 6.273s


Final results ==================== >>
Duration: 6.273s
-------------------------------------
LightGBM --> f1: 0.6285 ± 0.0036


Run: 6 ================================ >>
Size of training set: 71156 (70%)
Size of test set: 25359

Training ========================= >>
Models: LGB07
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6742
Test evaluation --> f1: 0.6289
Time elapsed: 1.858s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6297 ± 0.0025
Time elapsed: 5.227s
-------------------------------------------------
Total time: 7.087s


Final results ==================== >>
Duration: 7.087s
-------------------------------------
LightGBM --> f1: 0.6297 ± 0.0025


Run: 7 ================================ >>
Size of training set: 81321 (80%)
Size of test set: 25359

Training ========================= >>
Models: LGB08
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.672
Test evaluation --> f1: 0.6322
Time elapsed: 2.105s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.63 ± 0.0029
Time elapsed: 5.790s
-------------------------------------------------
Total time: 7.896s


Final results ==================== >>
Duration: 7.897s
-------------------------------------
LightGBM --> f1: 0.63 ± 0.0029


Run: 8 ================================ >>
Size of training set: 91486 (90%)
Size of test set: 25359

Training ========================= >>
Models: LGB09
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6674
Test evaluation --> f1: 0.6354
Time elapsed: 2.347s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6317 ± 0.0024
Time elapsed: 6.337s
-------------------------------------------------
Total time: 8.685s


Final results ==================== >>
Duration: 8.685s
-------------------------------------
LightGBM --> f1: 0.6317 ± 0.0024


Run: 9 ================================ >>
Size of training set: 101652 (100%)
Size of test set: 25359

Training ========================= >>
Models: LGB10
Metric: f1

Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 0.665
Test evaluation --> f1: 0.6356
Time elapsed: 2.527s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.6314 ± 0.0015
Time elapsed: 7.056s
-------------------------------------------------
Total time: 9.585s


Final results ==================== >>
Duration: 9.586s
-------------------------------------
LightGBM --> f1: 0.6314 ± 0.0015

Analyze the results¶

In [5]:

            
                Copied!
                
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results

Out[5]:

		metric_train	metric_test	time_fit	mean_bootstrap	std_bootstrap	time_bootstrap	time
frac	model
0.1	LGB01	0.665031	0.635616	0.677s	0.605427	0.003379	1.862s	2.539s
0.2	LGB02	0.665031	0.635616	0.900s	0.616879	0.003871	2.390s	3.291s
0.3	LGB03	0.665031	0.635616	1.051s	0.618866	0.004363	2.922s	3.974s
0.4	LGB04	0.665031	0.635616	1.556s	0.622234	0.003582	3.832s	5.389s
0.5	LGB05	0.665031	0.635616	1.467s	0.624927	0.001810	4.023s	5.492s
0.6	LGB06	0.665031	0.635616	1.665s	0.628501	0.003582	4.606s	6.273s
0.7	LGB07	0.665031	0.635616	1.858s	0.629707	0.002465	5.227s	7.087s
0.8	LGB08	0.665031	0.635616	2.105s	0.630006	0.002926	5.790s	7.896s
0.9	LGB09	0.665031	0.635616	2.347s	0.631703	0.002354	6.337s	8.685s
1.0	LGB10	0.665031	0.635616	2.527s	0.631401	0.001479	7.056s	9.585s

In [6]:

            
                Copied!
                
# Every model can be accessed through its name
atom.lgb05.waterfall_plot(show=6)
# Every model can be accessed through its name
atom.lgb05.waterfall_plot(show=6)

In [7]:

            
                Copied!
                
# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()