Binary classification¶

This example shows how to use ATOM to solve a binary classification problem. Additonnaly, we'll perform a variety of data cleaning steps to prepare the data for modelling.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:

            
                Copied!
                
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Run the pipeline¶

In [3]:

            
                Copied!
                
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, "RainTomorrow", n_rows=0.05, n_jobs=8, warnings=False, verbose=2)
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, "RainTomorrow", n_rows=0.05, n_jobs=8, warnings=False, verbose=2)

<< ================== ATOM ================== >>
Algorithm task: binary classification.
Parallel processing with 8 cores.

Dataset stats ==================== >>
Shape: (7109, 22)
Memory: 3.08 MB
Scaled: False
Missing values: 15896 (10.2%)
Categorical features: 5 (23.8%)
Duplicate samples: 2 (0.0%)
-------------------------------------
Train set size: 5688
Test set size: 1421
-------------------------------------
|   |      dataset |        train |         test |
| - | ------------ | ------------ | ------------ |
| 0 |   5614 (3.8) |   4492 (3.8) |   1122 (3.8) |
| 1 |   1495 (1.0) |   1196 (1.0) |    299 (1.0) |

In [4]:

            
                Copied!
                
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)

Fitting Imputer...
Imputing missing values...
 --> Dropping 774 samples for containing more than 16 missing values.
 --> Imputing 7 missing values with median (12.1) in feature MinTemp.
 --> Imputing 5 missing values with median (22.9) in feature MaxTemp.
 --> Imputing 33 missing values with median (0.0) in feature Rainfall.
 --> Imputing 2315 missing values with median (4.6) in feature Evaporation.
 --> Imputing 2648 missing values with median (8.45) in feature Sunshine.
 --> Dropping 202 samples due to missing values in feature WindGustDir.
 --> Imputing 200 missing values with median (39.0) in feature WindGustSpeed.
 --> Dropping 365 samples due to missing values in feature WindDir9am.
 --> Dropping 24 samples due to missing values in feature WindDir3pm.
 --> Imputing 4 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 3 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 23 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 57 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 42 missing values with median (1017.6) in feature Pressure9am.
 --> Imputing 40 missing values with median (1015.2) in feature Pressure3pm.
 --> Imputing 2112 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 2200 missing values with median (5.0) in feature Cloud3pm.
 --> Imputing 5 missing values with median (16.9) in feature Temp9am.
 --> Imputing 34 missing values with median (21.3) in feature Temp3pm.
 --> Dropping 33 samples due to missing values in feature RainToday.

In [5]:

            
                Copied!
                
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, frac_to_other=0.04)
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, frac_to_other=0.04)

Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 44 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [6]:

            
                Copied!
                
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)

Training ========================= >>
Models: ET, RF
Metric: f1


Results for Extra-Trees:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.5517
Time elapsed: 0.214s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5135 ± 0.0058
Time elapsed: 0.894s
-------------------------------------------------
Total time: 1.109s


Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.57
Time elapsed: 0.285s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5492 ± 0.0059
Time elapsed: 1.107s
-------------------------------------------------
Total time: 1.394s


Final results ==================== >>
Duration: 2.504s
-------------------------------------
Extra-Trees   --> f1: 0.5135 ± 0.0058 ~
Random Forest --> f1: 0.5492 ± 0.0059 ~ !

Analyze the results¶

In [7]:

            
                Copied!
                
# Let's have a look at the final results
atom.results
# Let's have a look at the final results
atom.results

Out[7]:

	metric_train	metric_test	time_fit	mean_bootstrap	std_bootstrap	time_bootstrap	time
ET	1.0	0.551724	0.214s	0.513471	0.005761	0.894s	1.109s
RF	1.0	0.569975	0.285s	0.549239	0.005908	1.107s	1.394s

In [8]:

            
                Copied!
                
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")

In [9]:

            
                Copied!
                
# Print the results of some common metrics
atom.evaluate()
# Print the results of some common metrics
atom.evaluate()

Out[9]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
ET	0.849644	0.675446	0.697648	0.551724	0.380952	0.497465	0.764706	0.431535	0.874154
RF	0.849644	0.679633	0.709715	0.569975	0.398577	0.503377	0.736842	0.464730	0.877577

In [10]:

            
                Copied!
                
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.fullname} model!!")
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.fullname} model!!")

The winner is the Random Forest model!!

In [11]:

            
                Copied!
                
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()

In [12]:

            
                Copied!
                
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "average_precision"], steps=50)
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "average_precision"], steps=50)