Example: Getting started¶

This example shows how to get started with the atom-ml library.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

In [1]:

            
                Copied!
                
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [2]:

            
                Copied!
                
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (1000, 22)
Memory: 434.20 kB
Scaled: False
Missing values: 2210 (10.0%)
Categorical features: 5 (23.8%)
Outlier values: 1 (0.0%)
-------------------------------------
Train set size: 800
Test set size: 200
-------------------------------------
|   |     dataset |       train |        test |
| - | ----------- | ----------- | ----------- |
| 0 |   777 (3.5) |   622 (3.5) |   155 (3.4) |
| 1 |   223 (1.0) |   178 (1.0) |    45 (1.0) |

In [3]:

            
                Copied!
                
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="LeaveOneOut", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="LeaveOneOut", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 1 missing values with median (12.7) in feature MinTemp.
 --> Imputing 12 missing values with median (0.0) in feature Rainfall.
 --> Imputing 430 missing values with median (4.6) in feature Evaporation.
 --> Imputing 476 missing values with median (8.8) in feature Sunshine.
 --> Imputing 54 missing values with most_frequent (N) in feature WindGustDir.
 --> Imputing 54 missing values with median (37.0) in feature WindGustSpeed.
 --> Imputing 70 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 26 missing values with most_frequent (S) in feature WindDir3pm.
 --> Imputing 5 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 16 missing values with median (17.0) in feature WindSpeed3pm.
 --> Imputing 9 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 22 missing values with median (51.0) in feature Humidity3pm.
 --> Imputing 105 missing values with median (1018.2) in feature Pressure9am.
 --> Imputing 109 missing values with median (1015.6) in feature Pressure3pm.
 --> Imputing 393 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 397 missing values with median (4.0) in feature Cloud3pm.
 --> Imputing 4 missing values with median (17.0) in feature Temp9am.
 --> Imputing 15 missing values with median (21.5) in feature Temp3pm.
 --> Imputing 12 missing values with most_frequent (No) in feature RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> LeaveOneOut-encoding feature Location. Contains 49 classes.
 --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

            
                Copied!
                
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)

Training ========================= >>
Models: LDA, AdaB
Metric: roc_auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |   eigen |       0.8 |  0.8412 |       0.8412 |     0.218s |  0.218s | COMPLETE |
| 1     |    lsqr |       0.6 |  0.8192 |       0.8412 |     0.253s |  0.471s | COMPLETE |
| 2     |    lsqr |      auto |  0.7988 |       0.8412 |     0.218s |  0.689s | COMPLETE |
| 3     |   eigen |       0.9 |    0.89 |         0.89 |     0.216s |  0.905s | COMPLETE |
| 4     |     svd |       --- |  0.8542 |         0.89 |     0.249s |  1.154s | COMPLETE |
| 5     |    lsqr |       0.6 |  0.8192 |         0.89 |     0.004s |  1.158s | COMPLETE |
| 6     |    lsqr |      auto |  0.7988 |         0.89 |     0.003s |  1.161s | COMPLETE |
| 7     |    lsqr |       0.8 |  0.8129 |         0.89 |     0.244s |  1.405s | COMPLETE |
| 8     |    lsqr |      None |  0.7948 |         0.89 |     0.208s |  1.613s | COMPLETE |
| 9     |   eigen |       0.8 |  0.8412 |         0.89 |     0.002s |  1.616s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 3
Best parameters:
 --> solver: eigen
 --> shrinkage: 0.9
Best evaluation --> roc_auc: 0.89
Time elapsed: 1.616s
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.8146
Test evaluation --> roc_auc: 0.8305
Time elapsed: 0.040s
-------------------------------------------------
Total time: 1.656s


Running hyperparameter tuning for AdaBoost...
| trial | n_estimators | learning_rate | algorithm | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------------ | ------------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |          190 |        1.6916 |   SAMME.R |  0.6808 |       0.6808 |     0.690s |  0.690s | COMPLETE |
| 1     |          380 |        0.1227 |     SAMME |  0.7919 |       0.7919 |     0.999s |  1.689s | COMPLETE |
| 2     |          330 |         4.937 |     SAMME |   0.642 |       0.7919 |     0.235s |  1.923s | COMPLETE |
| 3     |           80 |        1.0146 |   SAMME.R |  0.7708 |       0.7919 |     0.391s |  2.314s | COMPLETE |
| 4     |          470 |        3.6762 |     SAMME |   0.642 |       0.7919 |     0.215s |  2.529s | COMPLETE |
| 5     |          200 |        6.1828 |     SAMME |  0.2231 |       0.7919 |     0.209s |  2.737s | COMPLETE |
| 6     |          240 |        0.5726 |   SAMME.R |  0.7708 |       0.7919 |     0.748s |  3.485s | COMPLETE |
| 7     |          390 |        3.0825 |   SAMME.R |  0.3262 |       0.7919 |     1.093s |  4.577s | COMPLETE |
| 8     |          290 |        0.7957 |     SAMME |  0.7699 |       0.7919 |     0.862s |  5.439s | COMPLETE |
| 9     |          190 |        0.0256 |     SAMME |  0.8011 |       0.8011 |     0.596s |  6.036s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 9
Best parameters:
 --> n_estimators: 190
 --> learning_rate: 0.0256
 --> algorithm: SAMME
Best evaluation --> roc_auc: 0.8011
Time elapsed: 6.036s
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.844
Test evaluation --> roc_auc: 0.8366
Time elapsed: 0.449s
-------------------------------------------------
Total time: 6.485s


Final results ==================== >>
Total time: 8.352s
-------------------------------------
LinearDiscriminantAnalysis --> roc_auc: 0.8305
AdaBoost                   --> roc_auc: 0.8366 !

In [5]:

            
                Copied!
                
atom.evaluate()
atom.evaluate()

Out[5]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
LDA	0.805	0.6325	0.7559	0.6061	0.4348	0.4814	0.5556	0.6667	0.8305
AdaB	0.815	0.6505	0.6204	0.3934	0.2449	0.3707	0.7500	0.2667	0.8366