Example: Getting started¶

This example shows how to get started with the atom-ml library.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

In [6]:

Copied!

import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [7]:

Copied!

atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.

Dataset stats ==================== >>
Shape: (1000, 22)
Train set size: 800
Test set size: 200
-------------------------------------
Memory: 176.13 kB
Scaled: False
Missing values: 2243 (10.2%)
Categorical features: 5 (23.8%)

In [8]:

Copied!

atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="Target", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="Target", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 5 missing values with median (11.8) in column MinTemp.
 --> Imputing 1 missing values with median (22.2) in column MaxTemp.
 --> Imputing 11 missing values with median (0.0) in column Rainfall.
 --> Imputing 429 missing values with median (4.8) in column Evaporation.
 --> Imputing 477 missing values with median (8.6) in column Sunshine.
 --> Imputing 67 missing values with most_frequent (SE) in column WindGustDir.
 --> Imputing 66 missing values with median (39.0) in column WindGustSpeed.
 --> Imputing 78 missing values with most_frequent (N) in column WindDir9am.
 --> Imputing 24 missing values with most_frequent (SSE) in column WindDir3pm.
 --> Imputing 7 missing values with median (13.0) in column WindSpeed9am.
 --> Imputing 20 missing values with median (19.0) in column WindSpeed3pm.
 --> Imputing 15 missing values with median (70.0) in column Humidity9am.
 --> Imputing 24 missing values with median (53.0) in column Humidity3pm.
 --> Imputing 106 missing values with median (1017.9) in column Pressure9am.
 --> Imputing 104 missing values with median (1015.7) in column Pressure3pm.
 --> Imputing 371 missing values with median (5.0) in column Cloud9am.
 --> Imputing 402 missing values with median (5.0) in column Cloud3pm.
 --> Imputing 7 missing values with median (16.4) in column Temp9am.
 --> Imputing 18 missing values with median (20.8) in column Temp3pm.
 --> Imputing 11 missing values with most_frequent (No) in column RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 49 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [9]:

Copied!

atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)

Training ========================= >>
Models: LDA, AdaB
Metric: auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage |     auc | best_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | -------- | ---------- | ------- | -------- |
| 0     |    lsqr |      auto |  0.6291 |   0.6291 |     0.127s |  0.127s | COMPLETE |
| 1     |     svd |      None |  0.7018 |   0.7018 |     0.122s |  0.250s | COMPLETE |
| 2     |     svd |      None |  0.7018 |   0.7018 |     0.001s |  0.251s | COMPLETE |
| 3     |     svd |      None |  0.7018 |   0.7018 |     0.000s |  0.251s | COMPLETE |
| 4     |     svd |      None |  0.7018 |   0.7018 |     0.000s |  0.251s | COMPLETE |
| 5     |   eigen |      auto |  0.6675 |   0.7018 |     0.129s |  0.380s | COMPLETE |
| 6     |    lsqr |       0.9 |  0.7511 |   0.7511 |     0.124s |  0.504s | COMPLETE |
| 7     |     svd |      None |  0.7018 |   0.7511 |     0.000s |  0.504s | COMPLETE |
| 8     |    lsqr |       0.8 |  0.7035 |   0.7511 |     0.121s |  0.625s | COMPLETE |
| 9     |   eigen |      None |  0.6638 |   0.7511 |     0.120s |  0.745s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 6
Best parameters:
 --> solver: lsqr
 --> shrinkage: 0.9
Best evaluation --> auc: 0.7511
Time elapsed: 0.745s
Fit ---------------------------------------------
Train evaluation --> auc: 0.8034
Test evaluation --> auc: 0.8655
Time elapsed: 0.035s
-------------------------------------------------
Time: 0.779s


Running hyperparameter tuning for AdaBoost...
| trial | n_estimators | learning_rate |     auc | best_auc | time_trial | time_ht |    state |
| ----- | ------------ | ------------- | ------- | -------- | ---------- | ------- | -------- |
| 0     |          220 |        0.0145 |  0.5558 |   0.5558 |     0.559s |  0.559s | COMPLETE |
| 1     |          340 |        0.0149 |  0.6245 |   0.6245 |     0.797s |  1.356s | COMPLETE |
| 2     |          310 |        0.3206 |  0.6427 |   0.6427 |     0.745s |  2.101s | COMPLETE |
| 3     |          120 |        8.1247 |     0.5 |   0.6427 |     0.368s |  2.469s | COMPLETE |
| 4     |           70 |         0.065 |  0.5728 |   0.6427 |     0.262s |  2.731s | COMPLETE |
| 5     |          280 |        3.2722 |     0.5 |   0.6427 |     0.684s |  3.415s | COMPLETE |
| 6     |          330 |        0.0341 |  0.6167 |   0.6427 |     0.790s |  4.205s | COMPLETE |
| 7     |           90 |        0.0442 |  0.5856 |   0.6427 |     0.308s |  4.513s | COMPLETE |
| 8     |          290 |        5.6564 |     0.5 |   0.6427 |     0.723s |  5.236s | COMPLETE |
| 9     |          450 |        5.7754 |  0.5128 |   0.6427 |     1.046s |  6.282s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 2
Best parameters:
 --> n_estimators: 310
 --> learning_rate: 0.3206
Best evaluation --> auc: 0.6427
Time elapsed: 6.282s
Fit ---------------------------------------------
Train evaluation --> auc: 0.952
Test evaluation --> auc: 0.8025
Time elapsed: 0.790s
-------------------------------------------------
Time: 7.072s


Final results ==================== >>
Total time: 9.717s
-------------------------------------
LinearDiscriminantAnalysis --> auc: 0.8655 !
AdaBoost                   --> auc: 0.8025

In [10]:

Copied!

atom.evaluate()
atom.evaluate()

Out[10]:

	accuracy	ap	ba	f1	jaccard	mcc	precision	recall	auc
LDA	0.830000	0.778200	0.804700	0.685200	0.521100	0.574700	0.627100	0.755100	0.865500
AdaB	0.825000	0.671200	0.711800	0.578300	0.406800	0.485000	0.705900	0.489800	0.802500