Example: Getting started¶
This example shows how to get started with the atom-ml library.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
In [1]:
Copied!
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier
# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
In [2]:
Copied!
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
<< ================== ATOM ================== >> Algorithm task: binary classification. Dataset stats ==================== >> Shape: (1000, 22) Train set size: 800 Test set size: 200 ------------------------------------- Memory: 433.78 kB Scaled: False Missing values: 2231 (10.1%) Categorical features: 5 (23.8%)
In [3]:
Copied!
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="Target", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")
atom.encode(strategy="Target", max_onehot=8)
Fitting Imputer... Imputing missing values... --> Imputing 5 missing values with median (11.7) in feature MinTemp. --> Imputing 2 missing values with median (22.25) in feature MaxTemp. --> Imputing 12 missing values with median (0.0) in feature Rainfall. --> Imputing 417 missing values with median (4.2) in feature Evaporation. --> Imputing 469 missing values with median (8.4) in feature Sunshine. --> Imputing 68 missing values with most_frequent (W) in feature WindGustDir. --> Imputing 68 missing values with median (37.0) in feature WindGustSpeed. --> Imputing 64 missing values with most_frequent (N) in feature WindDir9am. --> Imputing 32 missing values with most_frequent (SE) in feature WindDir3pm. --> Imputing 13 missing values with median (13.0) in feature WindSpeed9am. --> Imputing 23 missing values with median (19.0) in feature WindSpeed3pm. --> Imputing 17 missing values with median (69.0) in feature Humidity9am. --> Imputing 28 missing values with median (52.0) in feature Humidity3pm. --> Imputing 100 missing values with median (1017.6) in feature Pressure9am. --> Imputing 98 missing values with median (1015.3) in feature Pressure3pm. --> Imputing 379 missing values with median (5.0) in feature Cloud9am. --> Imputing 399 missing values with median (5.0) in feature Cloud3pm. --> Imputing 7 missing values with median (16.5) in feature Temp9am. --> Imputing 18 missing values with median (21.1) in feature Temp3pm. --> Imputing 12 missing values with most_frequent (No) in feature RainToday. Fitting Encoder... Encoding categorical columns... --> Target-encoding feature Location. Contains 49 classes. --> Target-encoding feature WindGustDir. Contains 16 classes. --> Target-encoding feature WindDir9am. Contains 16 classes. --> Target-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [4]:
Copied!
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
Training ========================= >> Models: LDA, AdaB Metric: roc_auc Running hyperparameter tuning for LinearDiscriminantAnalysis... | trial | solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht | state | | ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- | | 0 | eigen | auto | 0.8686 | 0.8686 | 1.863s | 1.863s | COMPLETE | | 1 | lsqr | 0.8 | 0.8607 | 0.8686 | 1.575s | 3.439s | COMPLETE | | 2 | eigen | auto | 0.8686 | 0.8686 | 0.003s | 3.442s | COMPLETE | | 3 | svd | --- | 0.8428 | 0.8686 | 1.494s | 4.936s | COMPLETE | | 4 | lsqr | 0.5 | 0.7998 | 0.8686 | 1.475s | 6.411s | COMPLETE | | 5 | svd | --- | 0.8428 | 0.8686 | 0.000s | 6.411s | COMPLETE | | 6 | lsqr | auto | 0.8147 | 0.8686 | 1.528s | 7.939s | COMPLETE | | 7 | svd | --- | 0.8428 | 0.8686 | 0.000s | 7.939s | COMPLETE | | 8 | svd | --- | 0.8428 | 0.8686 | 0.008s | 7.947s | COMPLETE | | 9 | lsqr | 1.0 | 0.8214 | 0.8686 | 1.518s | 9.465s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 0 Best parameters: --> solver: eigen --> shrinkage: auto Best evaluation --> roc_auc: 0.8686 Time elapsed: 9.465s Fit --------------------------------------------- Train evaluation --> roc_auc: 0.8867 Test evaluation --> roc_auc: 0.9118 Time elapsed: 0.112s ------------------------------------------------- Total time: 9.577s Running hyperparameter tuning for AdaBoost... | trial | n_estimators | learning_rate | algorithm | roc_auc | best_roc_auc | time_trial | time_ht | state | | ----- | ------------ | ------------- | --------- | ------- | ------------ | ---------- | ------- | -------- | | 0 | 240 | 0.1504 | SAMME | 0.8736 | 0.8736 | 2.840s | 2.840s | COMPLETE | | 1 | 360 | 0.0183 | SAMME.R | 0.8063 | 0.8736 | 3.375s | 6.215s | COMPLETE | | 2 | 490 | 0.6449 | SAMME.R | 0.8147 | 0.8736 | 4.000s | 10.215s | COMPLETE | | 3 | 100 | 0.6744 | SAMME | 0.7949 | 0.8736 | 1.304s | 11.519s | COMPLETE | | 4 | 270 | 0.0344 | SAMME.R | 0.7845 | 0.8736 | 2.595s | 14.115s | COMPLETE | | 5 | 360 | 1.5914 | SAMME.R | 0.7563 | 0.8736 | 3.225s | 17.340s | COMPLETE | | 6 | 400 | 0.0799 | SAMME.R | 0.7664 | 0.8736 | 3.543s | 20.883s | COMPLETE | | 7 | 450 | 0.0244 | SAMME.R | 0.8151 | 0.8736 | 4.176s | 25.059s | COMPLETE | | 8 | 310 | 0.2973 | SAMME.R | 0.7478 | 0.8736 | 2.866s | 27.926s | COMPLETE | | 9 | 290 | 1.6933 | SAMME | 0.8609 | 0.8736 | 2.736s | 30.662s | COMPLETE | Hyperparameter tuning --------------------------- Best trial --> 0 Best parameters: --> n_estimators: 240 --> learning_rate: 0.1504 --> algorithm: SAMME Best evaluation --> roc_auc: 0.8736 Time elapsed: 30.662s Fit --------------------------------------------- Train evaluation --> roc_auc: 0.896 Test evaluation --> roc_auc: 0.8754 Time elapsed: 1.918s ------------------------------------------------- Total time: 32.580s Final results ==================== >> Total time: 43.206s ------------------------------------- LinearDiscriminantAnalysis --> roc_auc: 0.9118 ! AdaBoost --> roc_auc: 0.8754
In [5]:
Copied!
atom.evaluate()
atom.evaluate()
Out[5]:
accuracy | average_precision | balanced_accuracy | f1 | jaccard | matthews_corrcoef | precision | recall | roc_auc | |
---|---|---|---|---|---|---|---|---|---|
LDA | 0.855 | 0.761 | 0.7431 | 0.6329 | 0.4630 | 0.5623 | 0.7812 | 0.5319 | 0.9118 |
AdaB | 0.850 | 0.731 | 0.6956 | 0.5588 | 0.3878 | 0.5411 | 0.9048 | 0.4043 | 0.8754 |