Example: Getting started¶

This example shows how to get started with the atom-ml library.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

In [1]:

                
                    Copied!
                    
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [2]:

                
                    Copied!
                    
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ==================== >>
Shape: (1000, 22)
Train set size: 800
Test set size: 200
-------------------------------------
Memory: 433.78 kB
Scaled: False
Missing values: 2231 (10.1%)
Categorical features: 5 (23.8%)

In [3]:

                
                    Copied!
                    
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="Target", max_onehot=8)
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="Target", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 5 missing values with median (11.7) in feature MinTemp.
 --> Imputing 2 missing values with median (22.25) in feature MaxTemp.
 --> Imputing 12 missing values with median (0.0) in feature Rainfall.
 --> Imputing 417 missing values with median (4.2) in feature Evaporation.
 --> Imputing 469 missing values with median (8.4) in feature Sunshine.
 --> Imputing 68 missing values with most_frequent (W) in feature WindGustDir.
 --> Imputing 68 missing values with median (37.0) in feature WindGustSpeed.
 --> Imputing 64 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 32 missing values with most_frequent (SE) in feature WindDir3pm.
 --> Imputing 13 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 23 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 17 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 28 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 100 missing values with median (1017.6) in feature Pressure9am.
 --> Imputing 98 missing values with median (1015.3) in feature Pressure3pm.
 --> Imputing 379 missing values with median (5.0) in feature Cloud9am.
 --> Imputing 399 missing values with median (5.0) in feature Cloud3pm.
 --> Imputing 7 missing values with median (16.5) in feature Temp9am.
 --> Imputing 18 missing values with median (21.1) in feature Temp3pm.
 --> Imputing 12 missing values with most_frequent (No) in feature RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 49 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

In [4]:

                
                    Copied!
                    
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)

Training ========================= >>
Models: LDA, AdaB
Metric: roc_auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |   eigen |      auto |  0.8686 |       0.8686 |     1.863s |  1.863s | COMPLETE |
| 1     |    lsqr |       0.8 |  0.8607 |       0.8686 |     1.575s |  3.439s | COMPLETE |
| 2     |   eigen |      auto |  0.8686 |       0.8686 |     0.003s |  3.442s | COMPLETE |
| 3     |     svd |       --- |  0.8428 |       0.8686 |     1.494s |  4.936s | COMPLETE |
| 4     |    lsqr |       0.5 |  0.7998 |       0.8686 |     1.475s |  6.411s | COMPLETE |
| 5     |     svd |       --- |  0.8428 |       0.8686 |     0.000s |  6.411s | COMPLETE |
| 6     |    lsqr |      auto |  0.8147 |       0.8686 |     1.528s |  7.939s | COMPLETE |
| 7     |     svd |       --- |  0.8428 |       0.8686 |     0.000s |  7.939s | COMPLETE |
| 8     |     svd |       --- |  0.8428 |       0.8686 |     0.008s |  7.947s | COMPLETE |
| 9     |    lsqr |       1.0 |  0.8214 |       0.8686 |     1.518s |  9.465s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 0
Best parameters:
 --> solver: eigen
 --> shrinkage: auto
Best evaluation --> roc_auc: 0.8686
Time elapsed: 9.465s
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.8867
Test evaluation --> roc_auc: 0.9118
Time elapsed: 0.112s
-------------------------------------------------
Total time: 9.577s


Running hyperparameter tuning for AdaBoost...
| trial | n_estimators | learning_rate | algorithm | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------------ | ------------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |          240 |        0.1504 |     SAMME |  0.8736 |       0.8736 |     2.840s |  2.840s | COMPLETE |
| 1     |          360 |        0.0183 |   SAMME.R |  0.8063 |       0.8736 |     3.375s |  6.215s | COMPLETE |
| 2     |          490 |        0.6449 |   SAMME.R |  0.8147 |       0.8736 |     4.000s | 10.215s | COMPLETE |
| 3     |          100 |        0.6744 |     SAMME |  0.7949 |       0.8736 |     1.304s | 11.519s | COMPLETE |
| 4     |          270 |        0.0344 |   SAMME.R |  0.7845 |       0.8736 |     2.595s | 14.115s | COMPLETE |
| 5     |          360 |        1.5914 |   SAMME.R |  0.7563 |       0.8736 |     3.225s | 17.340s | COMPLETE |
| 6     |          400 |        0.0799 |   SAMME.R |  0.7664 |       0.8736 |     3.543s | 20.883s | COMPLETE |
| 7     |          450 |        0.0244 |   SAMME.R |  0.8151 |       0.8736 |     4.176s | 25.059s | COMPLETE |
| 8     |          310 |        0.2973 |   SAMME.R |  0.7478 |       0.8736 |     2.866s | 27.926s | COMPLETE |
| 9     |          290 |        1.6933 |     SAMME |  0.8609 |       0.8736 |     2.736s | 30.662s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 0
Best parameters:
 --> n_estimators: 240
 --> learning_rate: 0.1504
 --> algorithm: SAMME
Best evaluation --> roc_auc: 0.8736
Time elapsed: 30.662s
Fit ---------------------------------------------
Train evaluation --> roc_auc: 0.896
Test evaluation --> roc_auc: 0.8754
Time elapsed: 1.918s
-------------------------------------------------
Total time: 32.580s


Final results ==================== >>
Total time: 43.206s
-------------------------------------
LinearDiscriminantAnalysis --> roc_auc: 0.9118 !
AdaBoost                   --> roc_auc: 0.8754

In [5]:

                
                    Copied!
                    
atom.evaluate()
atom.evaluate()

Out[5]:

	accuracy	average_precision	balanced_accuracy	f1	jaccard	matthews_corrcoef	precision	recall	roc_auc
LDA	0.855	0.761	0.7431	0.6329	0.4630	0.5623	0.7812	0.5319	0.9118
AdaB	0.850	0.731	0.6956	0.5588	0.3878	0.5411	0.9048	0.4043	0.8754