Example: Binary classification¶
This example shows how to use ATOM to solve a binary classification problem. Additonnaly, we'll perform a variety of data cleaning steps to prepare the data for modeling.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [13]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [14]:
Copied!
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
Run the pipeline¶
In [15]:
Copied!
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=0.05, n_jobs=8, verbose=2)
# Call atom using only 5% of the complete dataset (for explanatory purposes)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=0.05, n_jobs=8, verbose=2)
In [16]:
Copied!
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)
# Impute missing values
atom.impute(strat_num="median", strat_cat="drop", max_nan_rows=0.8)
In [17]:
Copied!
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, infrequent_to_value=0.04)
# Encode the categorical features
atom.encode(strategy="Target", max_onehot=10, infrequent_to_value=0.04)
In [18]:
Copied!
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)
# Train an Extra-Trees and a Random Forest model
atom.run(models=["ET", "RF"], metric="f1", n_bootstrap=5)
Analyze the results¶
In [19]:
Copied!
# Let's have a look at the final results
atom.results
# Let's have a look at the final results
atom.results
In [20]:
Copied!
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")
# Visualize the bootstrap results
atom.plot_results(title="RF vs ET performance")
In [21]:
Copied!
# Print the results of some common metrics
atom.evaluate()
# Print the results of some common metrics
atom.evaluate()
In [22]:
Copied!
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.name} model!!")
# The winner attribute calls the best model (atom.winner == atom.rf)
print(f"The winner is the {atom.winner.name} model!!")
In [23]:
Copied!
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()
# Visualize the distribution of predicted probabilities
atom.winner.plot_probabilities()
In [24]:
Copied!
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "ap"], steps=50)
# Compare how different metrics perform for different thresholds
atom.winner.plot_threshold(metric=["f1", "accuracy", "ap"], steps=50)