Example: Train sizing¶

This example shows how to asses a model's performance based on the size of the training set.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

Copied!

# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

Copied!

# Load the data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

Copied!





# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()
# Initialize atom and prepare the data
atom = ATOMClassifier(X, verbose=2, random_state=1)
atom.clean()
atom.impute(strat_num="median", strat_cat="most_frequent", max_nan_rows=0.8)
atom.encode()

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

Fitting Cleaner...
Cleaning the data...
Fitting Imputer...
Imputing missing values...
 --> Dropping 161 samples for containing more than 16 missing values.
 --> Imputing 481 missing values with median (12.0) in column MinTemp.
 --> Imputing 265 missing values with median (22.6) in column MaxTemp.
 --> Imputing 1354 missing values with median (0.0) in column Rainfall.
 --> Imputing 60682 missing values with median (4.8) in column Evaporation.
 --> Imputing 67659 missing values with median (8.4) in column Sunshine.
 --> Imputing 9187 missing values with most_frequent (W) in column WindGustDir.
 --> Imputing 9127 missing values with median (39.0) in column WindGustSpeed.
 --> Imputing 9852 missing values with most_frequent (N) in column WindDir9am.
 --> Imputing 3617 missing values with most_frequent (SE) in column WindDir3pm.
 --> Imputing 1187 missing values with median (13.0) in column WindSpeed9am.
 --> Imputing 2469 missing values with median (19.0) in column WindSpeed3pm.
 --> Imputing 1613 missing values with median (70.0) in column Humidity9am.
 --> Imputing 3449 missing values with median (52.0) in column Humidity3pm.
 --> Imputing 13863 missing values with median (1017.6) in column Pressure9am.
 --> Imputing 13830 missing values with median (1015.2) in column Pressure3pm.
 --> Imputing 53496 missing values with median (5.0) in column Cloud9am.
 --> Imputing 56933 missing values with median (5.0) in column Cloud3pm.
 --> Imputing 743 missing values with median (16.7) in column Temp9am.
 --> Imputing 2565 missing values with median (21.1) in column Temp3pm.
 --> Imputing 1354 missing values with most_frequent (No) in column RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 49 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

Copied!

# Analyze the impact of the training set's size on a LR model
atom.train_sizing("LR", train_sizes=10, n_bootstrap=5)
# Analyze the impact of the training set's size on a LR model
atom.train_sizing("LR", train_sizes=10, n_bootstrap=5)

Training ========================= >>
Metric: f1


Run: 0 =========================== >>
Models: LR01
Size of training set: 11362 (10%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.563
Test evaluation --> f1: 0.5854
Time elapsed: 1.181s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5849 ± 0.002
Time elapsed: 0.910s
-------------------------------------------------
Time: 2.091s


Final results ==================== >>
Total time: 2.109s
-------------------------------------
LogisticRegression --> f1: 0.5849 ± 0.002


Run: 1 =========================== >>
Models: LR02
Size of training set: 22724 (20%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.582
Test evaluation --> f1: 0.5873
Time elapsed: 1.455s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5852 ± 0.0021
Time elapsed: 1.120s
-------------------------------------------------
Time: 2.575s


Final results ==================== >>
Total time: 2.598s
-------------------------------------
LogisticRegression --> f1: 0.5852 ± 0.0021


Run: 2 =========================== >>
Models: LR03
Size of training set: 34087 (30%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.581
Test evaluation --> f1: 0.5851
Time elapsed: 1.702s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5861 ± 0.0009
Time elapsed: 1.355s
-------------------------------------------------
Time: 3.057s


Final results ==================== >>
Total time: 3.082s
-------------------------------------
LogisticRegression --> f1: 0.5861 ± 0.0009


Run: 3 =========================== >>
Models: LR04
Size of training set: 45449 (40%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5827
Test evaluation --> f1: 0.5869
Time elapsed: 2.250s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5863 ± 0.0017
Time elapsed: 1.599s
-------------------------------------------------
Time: 3.850s


Final results ==================== >>
Total time: 3.881s
-------------------------------------
LogisticRegression --> f1: 0.5863 ± 0.0017


Run: 4 =========================== >>
Models: LR05
Size of training set: 56812 (50%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5819
Test evaluation --> f1: 0.585
Time elapsed: 2.163s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5854 ± 0.0017
Time elapsed: 1.878s
-------------------------------------------------
Time: 4.041s


Final results ==================== >>
Total time: 4.077s
-------------------------------------
LogisticRegression --> f1: 0.5854 ± 0.0017


Run: 5 =========================== >>
Models: LR06
Size of training set: 68174 (60%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5832
Test evaluation --> f1: 0.5848
Time elapsed: 2.338s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5849 ± 0.0018
Time elapsed: 1.899s
-------------------------------------------------
Time: 4.237s


Final results ==================== >>
Total time: 4.279s
-------------------------------------
LogisticRegression --> f1: 0.5849 ± 0.0018


Run: 6 =========================== >>
Models: LR07
Size of training set: 79536 (70%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5873
Test evaluation --> f1: 0.5849
Time elapsed: 2.427s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5852 ± 0.0012
Time elapsed: 2.060s
-------------------------------------------------
Time: 4.486s


Final results ==================== >>
Total time: 4.531s
-------------------------------------
LogisticRegression --> f1: 0.5852 ± 0.0012


Run: 7 =========================== >>
Models: LR08
Size of training set: 90899 (80%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.589
Test evaluation --> f1: 0.5837
Time elapsed: 2.631s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5853 ± 0.0026
Time elapsed: 2.173s
-------------------------------------------------
Time: 4.804s


Final results ==================== >>
Total time: 4.853s
-------------------------------------
LogisticRegression --> f1: 0.5853 ± 0.0026


Run: 8 =========================== >>
Models: LR09
Size of training set: 102261 (90%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5871
Test evaluation --> f1: 0.5845
Time elapsed: 2.837s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5846 ± 0.002
Time elapsed: 2.550s
-------------------------------------------------
Time: 5.387s


Final results ==================== >>
Total time: 5.443s
-------------------------------------
LogisticRegression --> f1: 0.5846 ± 0.002


Run: 9 =========================== >>
Models: LR10
Size of training set: 113624 (100%)
Size of test set: 28408


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5858
Test evaluation --> f1: 0.5848
Time elapsed: 4.211s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5848 ± 0.0007
Time elapsed: 2.967s
-------------------------------------------------
Time: 7.178s


Final results ==================== >>
Total time: 7.243s
-------------------------------------
LogisticRegression --> f1: 0.5848 ± 0.0007

Analyze the results¶

In [5]:

Copied!





# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results
# The results are now multi-index, where frac is the fraction
# of the training set used to fit the model. The model names
# end with the fraction as well (without the dot)
atom.results

Out[5]:

		f1_train	f1_test	time_fit	f1_bootstrap	time_bootstrap	time
frac	model
0.100000	LR01	0.562100	0.584800	1.181076	0.584922	0.909830	2.090906
0.200000	LR02	0.583200	0.584600	1.455324	0.585234	1.120021	2.575345
0.300000	LR03	0.580000	0.585200	1.702020	0.586118	1.354517	3.056537
0.400000	LR04	0.584500	0.585700	2.250048	0.586348	1.599457	3.849505
0.500000	LR05	0.583300	0.586500	2.163214	0.585384	1.877947	4.041161
0.600000	LR06	0.583100	0.583200	2.338079	0.584891	1.898731	4.236810
0.700000	LR07	0.587800	0.585800	2.426779	0.585235	2.059590	4.486369
0.800000	LR08	0.591600	0.588600	2.630608	0.585269	2.172981	4.803589
0.900000	LR09	0.585600	0.583300	2.836993	0.584633	2.550147	5.387140
1.000000	LR10	0.585800	0.584800	4.211031	0.584836	2.966612	7.177643

In [6]:

Copied!

# Every model can be accessed through its name
atom.lr05.plot_shap_waterfall(show=6)
# Every model can be accessed through its name
atom.lr05.plot_shap_waterfall(show=6)

No description has been provided for this image

In [7]:

Copied!

# Plot the train sizing's results
atom.plot_learning_curve()
# Plot the train sizing's results
atom.plot_learning_curve()