Example: Holdout set¶
This example shows when and how to use ATOM's holdout set in an exploration pipeline.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
Run the pipeline¶
# Initialize atom specifying a fraction of the dataset for holdout
atom = ATOMClassifier(X, n_rows=0.5, holdout_size=0.2, verbose=2)
# The test and holdout fractions are split after subsampling the dataset
# Also note that the holdout data set is not a part of atom's dataset
print("Length loaded data:", len(X))
print("Length dataset + holdout:", len(atom.dataset) + len(atom.holdout))
atom.impute()
atom.encode()
# Unlike train and test, the holdout data set is not transformed until used for predictions
atom.holdout
atom.run(models=["GNB", "LR", "RF"])
atom.plot_prc()
# Based on the results on the test set, we select the best model for further tuning
atom.run("lr_tuned", n_trials=10)
Analyze the results¶
We already used the test set to choose the best model for futher tuning, so this set is no longer truly independent. Although it may not be directly visible in the results, using the test set now to evaluate the tuned LR model would be a mistake, since it carries a bias. For this reason, we have set apart an extra, indepedent set to validate the final model: the holdout set. If we are not going to use the test set for validation, we might as well use it to train the model and so optimize the use of the available data. Use the full_train method for this.
# Re-train the model on the full dataset (train + test)
atom.lr_tuned.full_train()
# Evaluate on the holdout set
atom.lr_tuned.evaluate(rows="holdout")
atom.lr_tuned.plot_prc(rows="holdout", legend="upper right")