# Example: Getting started
--------------------------

This example shows how to get started with the atom-ml library.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

In [1]:
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [2]:
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

Algorithm task: binary classification.

Shape: (1000, 22)
Memory: 434.20 kB
Scaled: False
Missing values: 2210 (10.0%)
Categorical features: 5 (23.8%)
Outlier values: 1 (0.0%)
-------------------------------------
Train set size: 800
Test set size: 200
-------------------------------------
|   |     dataset |       train |        test |
| - | ----------- | ----------- | ----------- |
| 0 |   777 (3.5) |   622 (3.5) |   155 (3.4) |
| 1 |   223 (1.0) |   178 (1.0) |    45 (1.0) |



In [3]:
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="LeaveOneOut", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 1 missing values with median (12.7) in feature MinTemp.
 --> Imputing 12 missing values with median (0.0) in feature Rainfall.
 --> Imputing 430 missing values with median (4.6) in feature Evaporation.
 --> Imputing 476 missing values with median (8.8) in feature Sunshine.
 --> Imputing 54 missing values with most_frequent (N) in feature WindGustDir.
 --> Imputing 54 missing values with median (37.0) in feature WindGustSpeed.
 --> Imputing 70 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 26 missing values with most_frequent (S) in feature WindDir3pm.
 --> Imputing 5 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 16 missing values with median (17.0) in feature WindSpeed3pm.
 --> Imputing 9 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 22 missing values with median (51.0) in feature Humidity3pm.
 --> Imputing 105 missing values with median (1018.2) 

In [4]:
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)


Models: LDA, AdaB
Metric: roc_auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |   eigen |       0.8 |  0.8412 |       0.8412 |     0.218s |  0.218s | COMPLETE |
| 1     |    lsqr |       0.6 |  0.8192 |       0.8412 |     0.253s |  0.471s | COMPLETE |
| 2     |    lsqr |      auto |  0.7988 |       0.8412 |     0.218s |  0.689s | COMPLETE |
| 3     |   eigen |       0.9 |    0.89 |         0.89 |     0.216s |  0.905s | COMPLETE |
| 4     |     svd |       --- |  0.8542 |         0.89 |     0.249s |  1.154s | COMPLETE |
| 5     |    lsqr |       0.6 |  0.8192 |         0.89 |     0.004s |  1.158s | COMPLETE |
| 6     |    lsqr |      auto |  0.7988 |         0.89 |     0.003s |  1.161s | COMPLETE |
| 7     |    lsqr |       0.8 |  0.8129 |         0.89 |     0.244s |  1.405s | 

In [5]:
atom.evaluate()

Unnamed: 0,accuracy,average_precision,balanced_accuracy,f1,jaccard,matthews_corrcoef,precision,recall,roc_auc
LDA,0.805,0.6325,0.7559,0.6061,0.4348,0.4814,0.5556,0.6667,0.8305
AdaB,0.815,0.6505,0.6204,0.3934,0.2449,0.3707,0.75,0.2667,0.8366
