Example: Getting started¶
This example shows how to get started with the atom-ml library.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
In [1]:
Copied!
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
In [2]:
Copied!
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
<< ================== ATOM ================== >> Algorithm task: binary classification. Dataset stats ==================== >> Shape: (1000, 22) Memory: 434.20 kB Scaled: False Missing values: 2210 (10.0%) Categorical features: 5 (23.8%) Outlier values: 1 (0.0%) ------------------------------------- Train set size: 800 Test set size: 200 ------------------------------------- | | dataset | train | test | | - | ----------- | ----------- | ----------- | | 0 | 777 (3.5) | 622 (3.5) | 155 (3.4) | | 1 | 223 (1.0) | 178 (1.0) | 45 (1.0) |
In [3]:
Copied!
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="LeaveOneOut", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="LeaveOneOut", max_onehot=8)
Fitting Imputer... Imputing missing values... --> Imputing 1 missing values with median (12.7) in feature MinTemp. --> Imputing 12 missing values with median (0.0) in feature Rainfall. --> Imputing 430 missing values with median (4.6) in feature Evaporation. --> Imputing 476 missing values with median (8.8) in feature Sunshine. --> Imputing 54 missing values with most_frequent (N) in feature WindGustDir. --> Imputing 54 missing values with median (37.0) in feature WindGustSpeed. --> Imputing 70 missing values with most_frequent (N) in feature WindDir9am. --> Imputing 26 missing values with most_frequent (S) in feature WindDir3pm. --> Imputing 5 missing values with median (13.0) in feature WindSpeed9am. --> Imputing 16 missing values with median (17.0) in feature WindSpeed3pm. --> Imputing 9 missing values with median (69.0) in feature Humidity9am. --> Imputing 22 missing values with median (51.0) in feature Humidity3pm. --> Imputing 105 missing values with median (1018.2) in feature Pressure9am. --> Imputing 109 missing values with median (1015.6) in feature Pressure3pm. --> Imputing 393 missing values with median (5.0) in feature Cloud9am. --> Imputing 397 missing values with median (4.0) in feature Cloud3pm. --> Imputing 4 missing values with median (17.0) in feature Temp9am. --> Imputing 15 missing values with median (21.5) in feature Temp3pm. --> Imputing 12 missing values with most_frequent (No) in feature RainToday. Fitting Encoder... Encoding categorical columns... --> LeaveOneOut-encoding feature Location. Contains 49 classes. --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [4]:
Copied!
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
Training ========================= >> Models: LDA, AdaB Metric: roc_auc Running hyperparameter tuning for LinearDiscriminantAnalysis... | trial | solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht | state | | ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- | | 0 | eigen | 0.8 | 0.8412 | 0.8412 | 0.218s | 0.218s | COMPLETE | | 1 | lsqr | 0.6 | 0.8192 | 0.8412 | 0.253s | 0.471s | COMPLETE | | 2 | lsqr | auto | 0.7988 | 0.8412 | 0.218s | 0.689s | COMPLETE | | 3 | eigen | 0.9 | 0.89 | 0.89 | 0.216s | 0.905s | COMPLETE | | 4 | svd | --- | 0.8542 | 0.89 | 0.249s | 1.154s | COMPLETE | | 5 | lsqr | 0.6 | 0.8192 | 0.89 | 0.004s | 1.158s | COMPLETE | | 6 | lsqr | auto | 0.7988 | 0.89 | 0.003s | 1.161s | COMPLETE | | 7 | lsqr | 0.8 | 0.8129 | 0.89 | 0.244s | 1.405s | COMPLETE | | 8 | lsqr | None | 0.7948 | 0.89 | 0.208s | 1.613s | COMPLETE | | 9 | eigen | 0.8 | 0.8412 | 0.89 | 0.002s | 1.616s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 3 Best parameters: --> solver: eigen --> shrinkage: 0.9 Best evaluation --> roc_auc: 0.89 Time elapsed: 1.616s Fit --------------------------------------------- Train evaluation --> roc_auc: 0.8146 Test evaluation --> roc_auc: 0.8305 Time elapsed: 0.040s ------------------------------------------------- Total time: 1.656s Running hyperparameter tuning for AdaBoost... | trial | n_estimators | learning_rate | algorithm | roc_auc | best_roc_auc | time_trial | time_ht | state | | ----- | ------------ | ------------- | --------- | ------- | ------------ | ---------- | ------- | -------- | | 0 | 190 | 1.6916 | SAMME.R | 0.6808 | 0.6808 | 0.690s | 0.690s | COMPLETE | | 1 | 380 | 0.1227 | SAMME | 0.7919 | 0.7919 | 0.999s | 1.689s | COMPLETE | | 2 | 330 | 4.937 | SAMME | 0.642 | 0.7919 | 0.235s | 1.923s | COMPLETE | | 3 | 80 | 1.0146 | SAMME.R | 0.7708 | 0.7919 | 0.391s | 2.314s | COMPLETE | | 4 | 470 | 3.6762 | SAMME | 0.642 | 0.7919 | 0.215s | 2.529s | COMPLETE | | 5 | 200 | 6.1828 | SAMME | 0.2231 | 0.7919 | 0.209s | 2.737s | COMPLETE | | 6 | 240 | 0.5726 | SAMME.R | 0.7708 | 0.7919 | 0.748s | 3.485s | COMPLETE | | 7 | 390 | 3.0825 | SAMME.R | 0.3262 | 0.7919 | 1.093s | 4.577s | COMPLETE | | 8 | 290 | 0.7957 | SAMME | 0.7699 | 0.7919 | 0.862s | 5.439s | COMPLETE | | 9 | 190 | 0.0256 | SAMME | 0.8011 | 0.8011 | 0.596s | 6.036s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 9 Best parameters: --> n_estimators: 190 --> learning_rate: 0.0256 --> algorithm: SAMME Best evaluation --> roc_auc: 0.8011 Time elapsed: 6.036s Fit --------------------------------------------- Train evaluation --> roc_auc: 0.844 Test evaluation --> roc_auc: 0.8366 Time elapsed: 0.449s ------------------------------------------------- Total time: 6.485s Final results ==================== >> Total time: 8.352s ------------------------------------- LinearDiscriminantAnalysis --> roc_auc: 0.8305 AdaBoost --> roc_auc: 0.8366 !
In [5]:
Copied!
atom.evaluate()
atom.evaluate()
Out[5]:
accuracy | average_precision | balanced_accuracy | f1 | jaccard | matthews_corrcoef | precision | recall | roc_auc | |
---|---|---|---|---|---|---|---|---|---|
LDA | 0.805 | 0.6325 | 0.7559 | 0.6061 | 0.4348 | 0.4814 | 0.5556 | 0.6667 | 0.8305 |
AdaB | 0.815 | 0.6505 | 0.6204 | 0.3934 | 0.2449 | 0.3707 | 0.7500 | 0.2667 | 0.8366 |