Example: Getting started¶
This example shows how to get started with the atom-ml library.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
In [6]:
Copied!
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
In [7]:
Copied!
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Dataset stats ==================== >> Shape: (1000, 22) Train set size: 800 Test set size: 200 ------------------------------------- Memory: 176.13 kB Scaled: False Missing values: 2243 (10.2%) Categorical features: 5 (23.8%)
In [8]:
Copied!
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="Target", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="Target", max_onehot=8)
Fitting Imputer... Imputing missing values... --> Imputing 5 missing values with median (11.8) in column MinTemp. --> Imputing 1 missing values with median (22.2) in column MaxTemp. --> Imputing 11 missing values with median (0.0) in column Rainfall. --> Imputing 429 missing values with median (4.8) in column Evaporation. --> Imputing 477 missing values with median (8.6) in column Sunshine. --> Imputing 67 missing values with most_frequent (SE) in column WindGustDir. --> Imputing 66 missing values with median (39.0) in column WindGustSpeed. --> Imputing 78 missing values with most_frequent (N) in column WindDir9am. --> Imputing 24 missing values with most_frequent (SSE) in column WindDir3pm. --> Imputing 7 missing values with median (13.0) in column WindSpeed9am. --> Imputing 20 missing values with median (19.0) in column WindSpeed3pm. --> Imputing 15 missing values with median (70.0) in column Humidity9am. --> Imputing 24 missing values with median (53.0) in column Humidity3pm. --> Imputing 106 missing values with median (1017.9) in column Pressure9am. --> Imputing 104 missing values with median (1015.7) in column Pressure3pm. --> Imputing 371 missing values with median (5.0) in column Cloud9am. --> Imputing 402 missing values with median (5.0) in column Cloud3pm. --> Imputing 7 missing values with median (16.4) in column Temp9am. --> Imputing 18 missing values with median (20.8) in column Temp3pm. --> Imputing 11 missing values with most_frequent (No) in column RainToday. Fitting Encoder... Encoding categorical columns... --> Target-encoding feature Location. Contains 49 classes. --> Target-encoding feature WindGustDir. Contains 16 classes. --> Target-encoding feature WindDir9am. Contains 16 classes. --> Target-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [9]:
Copied!
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
Training ========================= >> Models: LDA, AdaB Metric: auc Running hyperparameter tuning for LinearDiscriminantAnalysis... | trial | solver | shrinkage | auc | best_auc | time_trial | time_ht | state | | ----- | ------- | --------- | ------- | -------- | ---------- | ------- | -------- | | 0 | lsqr | auto | 0.6291 | 0.6291 | 0.127s | 0.127s | COMPLETE | | 1 | svd | None | 0.7018 | 0.7018 | 0.122s | 0.250s | COMPLETE | | 2 | svd | None | 0.7018 | 0.7018 | 0.001s | 0.251s | COMPLETE | | 3 | svd | None | 0.7018 | 0.7018 | 0.000s | 0.251s | COMPLETE | | 4 | svd | None | 0.7018 | 0.7018 | 0.000s | 0.251s | COMPLETE | | 5 | eigen | auto | 0.6675 | 0.7018 | 0.129s | 0.380s | COMPLETE | | 6 | lsqr | 0.9 | 0.7511 | 0.7511 | 0.124s | 0.504s | COMPLETE | | 7 | svd | None | 0.7018 | 0.7511 | 0.000s | 0.504s | COMPLETE | | 8 | lsqr | 0.8 | 0.7035 | 0.7511 | 0.121s | 0.625s | COMPLETE | | 9 | eigen | None | 0.6638 | 0.7511 | 0.120s | 0.745s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 6 Best parameters: --> solver: lsqr --> shrinkage: 0.9 Best evaluation --> auc: 0.7511 Time elapsed: 0.745s Fit --------------------------------------------- Train evaluation --> auc: 0.8034 Test evaluation --> auc: 0.8655 Time elapsed: 0.035s ------------------------------------------------- Time: 0.779s Running hyperparameter tuning for AdaBoost... | trial | n_estimators | learning_rate | auc | best_auc | time_trial | time_ht | state | | ----- | ------------ | ------------- | ------- | -------- | ---------- | ------- | -------- | | 0 | 220 | 0.0145 | 0.5558 | 0.5558 | 0.559s | 0.559s | COMPLETE | | 1 | 340 | 0.0149 | 0.6245 | 0.6245 | 0.797s | 1.356s | COMPLETE | | 2 | 310 | 0.3206 | 0.6427 | 0.6427 | 0.745s | 2.101s | COMPLETE | | 3 | 120 | 8.1247 | 0.5 | 0.6427 | 0.368s | 2.469s | COMPLETE | | 4 | 70 | 0.065 | 0.5728 | 0.6427 | 0.262s | 2.731s | COMPLETE | | 5 | 280 | 3.2722 | 0.5 | 0.6427 | 0.684s | 3.415s | COMPLETE | | 6 | 330 | 0.0341 | 0.6167 | 0.6427 | 0.790s | 4.205s | COMPLETE | | 7 | 90 | 0.0442 | 0.5856 | 0.6427 | 0.308s | 4.513s | COMPLETE | | 8 | 290 | 5.6564 | 0.5 | 0.6427 | 0.723s | 5.236s | COMPLETE | | 9 | 450 | 5.7754 | 0.5128 | 0.6427 | 1.046s | 6.282s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 2 Best parameters: --> n_estimators: 310 --> learning_rate: 0.3206 Best evaluation --> auc: 0.6427 Time elapsed: 6.282s Fit --------------------------------------------- Train evaluation --> auc: 0.952 Test evaluation --> auc: 0.8025 Time elapsed: 0.790s ------------------------------------------------- Time: 7.072s Final results ==================== >> Total time: 9.717s ------------------------------------- LinearDiscriminantAnalysis --> auc: 0.8655 ! AdaBoost --> auc: 0.8025
In [10]:
Copied!
atom.evaluate()
atom.evaluate()
Out[10]:
accuracy | ap | ba | f1 | jaccard | mcc | precision | recall | auc | |
---|---|---|---|---|---|---|---|---|---|
LDA | 0.830000 | 0.778200 | 0.804700 | 0.685200 | 0.521100 | 0.574700 | 0.627100 | 0.755100 | 0.865500 |
AdaB | 0.825000 | 0.671200 | 0.711800 | 0.578300 | 0.406800 | 0.485000 | 0.705900 | 0.489800 | 0.802500 |