Example: Binary classification¶

This example shows how to use ATOM to solve a binary classification problem. Additonnaly, we'll perform a variety of data cleaning steps to prepare the data for modeling.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [13]:

Copied!

# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [14]:

Copied!

# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()

Run the pipeline¶

In [15]:

Copied!

# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=0.05, n_jobs=8, verbose=2)
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=0.05, n_jobs=8, verbose=2)

In [16]:

Copied!

# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)

In [17]:

Copied!

# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, infrequent_to_value=0.04)
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, infrequent_to_value=0.04)

In [18]:

Copied!

# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)

Analyze the results¶

In [19]:

Copied!

# Let's have a look at the final results
atom.results
# Let's have a look at the final results
atom.results

In [20]:

Copied!

# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")

In [21]:

Copied!

# Print the results of some common metrics
atom.evaluate()
# Print the results of some common metrics
atom.evaluate()

In [22]:

Copied!

# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.name} model!!")
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.name} model!!")

In [23]:

Copied!

# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()

In [24]:

Copied!

# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "ap"], steps=50)
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "ap"], steps=50)