# Example: Binary classification
--------------------------------

This example shows how to use ATOM to solve a binary classification problem. Additonnaly, we'll perform a variety of data cleaning steps to prepare the data for modelling.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

## Load the data

In [1]:
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,MelbourneAirport,18.0,26.9,21.4,7.0,8.9,SSE,41.0,W,SSE,...,95.0,54.0,1019.5,1017.0,8.0,5.0,18.5,26.0,Yes,0
1,Adelaide,17.2,23.4,0.0,,,S,41.0,S,WSW,...,59.0,36.0,1015.7,1015.7,,,17.7,21.9,No,0
2,Cairns,18.6,24.6,7.4,3.0,6.1,SSE,54.0,SSE,SE,...,78.0,57.0,1018.7,1016.6,3.0,3.0,20.8,24.1,Yes,0
3,Portland,13.6,16.8,4.2,1.2,0.0,ESE,39.0,ESE,ESE,...,76.0,74.0,1021.4,1020.5,7.0,8.0,15.6,16.0,Yes,1
4,Walpole,16.4,19.9,0.0,,,SE,44.0,SE,SE,...,78.0,70.0,1019.4,1018.9,,,17.4,18.1,No,0


## Run the pipeline

In [3]:
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, "RainTomorrow", n_rows=0.05, n_jobs=8, verbose=2)

Algorithm task: binary classification.
Parallel processing with 8 cores.
Parallelization backend: loky

Shape: (7109, 22)
Train set size: 5688
Test set size: 1421
-------------------------------------
Memory: 3.08 MB
Scaled: False
Missing values: 15681 (10.0%)
Categorical features: 5 (23.8%)
Duplicate samples: 2 (0.0%)



In [4]:
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)

Fitting Imputer...
Imputing missing values...
 --> Dropping 11 samples for containing more than 16 missing values.
 --> Imputing 23 missing values with median (11.9) in feature MinTemp.
 --> Imputing 23 missing values with median (22.4) in feature MaxTemp.
 --> Imputing 69 missing values with median (0.0) in feature Rainfall.
 --> Imputing 2986 missing values with median (4.8) in feature Evaporation.
 --> Imputing 3358 missing values with median (8.4) in feature Sunshine.
 --> Dropping 474 samples due to missing values in feature WindGustDir.
 --> Imputing 471 missing values with median (39.0) in feature WindGustSpeed.
 --> Dropping 490 samples due to missing values in feature WindDir9am.
 --> Dropping 179 samples due to missing values in feature WindDir3pm.
 --> Imputing 50 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 121 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 73 missing values with median (69.0) in feature Humidity9am.
 --

In [5]:
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, infrequent_to_value=0.04)

Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 47 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.


In [6]:
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)


Models: ET, RF
Metric: f1


Results for ExtraTrees:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.5688
Time elapsed: 9.395s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.5463 ± 0.0135
Time elapsed: 4.742s
-------------------------------------------------
Total time: 14.138s


Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.5969
Time elapsed: 1.368s
Bootstrap ---------------------------------------
Evaluation --> f1: 0.576 ± 0.0117
Time elapsed: 5.341s
-------------------------------------------------
Total time: 6.709s


Total time: 20.861s
-------------------------------------
ExtraTrees   --> f1: 0.5463 ± 0.0135 ~
RandomForest --> f1: 0.576 ± 0.0117 ~ !


## Analyze the results

In [7]:
# Let's have a look at the final results
atom.results

Unnamed: 0,score_train,score_test,time_fit,score_bootstrap,time_bootstrap,time
ET,1.0,0.5688,9.395485,0.546307,4.742292,14.137777
RF,1.0,0.5969,1.36776,0.575995,5.341101,6.708861


In [8]:
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")

In [9]:
# Print the results of some common metrics
atom.evaluate()

Unnamed: 0,accuracy,average_precision,balanced_accuracy,f1,jaccard,matthews_corrcoef,precision,recall,roc_auc
ET,0.8427,0.6813,0.6911,0.5403,0.3701,0.4828,0.755,0.4207,0.8613
RF,0.8516,0.6871,0.718,0.5869,0.4153,0.5212,0.7558,0.4797,0.8652


In [10]:
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.name} model!!")

The winner is the RF model!!


In [11]:
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()

In [12]:
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "ap"], steps=50)
