# Example: Getting started
--------------------------

This example shows how to get started with the atom-ml library.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

In [1]:
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [2]:
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

Algorithm task: binary classification.

Shape: (1000, 22)
Train set size: 800
Test set size: 200
-------------------------------------
Memory: 433.78 kB
Scaled: False
Missing values: 2231 (10.1%)
Categorical features: 5 (23.8%)



In [3]:
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="Target", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 5 missing values with median (11.7) in feature MinTemp.
 --> Imputing 2 missing values with median (22.25) in feature MaxTemp.
 --> Imputing 12 missing values with median (0.0) in feature Rainfall.
 --> Imputing 417 missing values with median (4.2) in feature Evaporation.
 --> Imputing 469 missing values with median (8.4) in feature Sunshine.
 --> Imputing 68 missing values with most_frequent (W) in feature WindGustDir.
 --> Imputing 68 missing values with median (37.0) in feature WindGustSpeed.
 --> Imputing 64 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 32 missing values with most_frequent (SE) in feature WindDir3pm.
 --> Imputing 13 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 23 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 17 missing values with median (69.0) in feature Humidity9am.
 --> Imputing 28 missing values with median (52.0) in f

In [4]:
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)


Models: LDA, AdaB
Metric: roc_auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |   eigen |      auto |  0.8686 |       0.8686 |     1.863s |  1.863s | COMPLETE |
| 1     |    lsqr |       0.8 |  0.8607 |       0.8686 |     1.575s |  3.439s | COMPLETE |
| 2     |   eigen |      auto |  0.8686 |       0.8686 |     0.003s |  3.442s | COMPLETE |
| 3     |     svd |       --- |  0.8428 |       0.8686 |     1.494s |  4.936s | COMPLETE |
| 4     |    lsqr |       0.5 |  0.7998 |       0.8686 |     1.475s |  6.411s | COMPLETE |
| 5     |     svd |       --- |  0.8428 |       0.8686 |     0.000s |  6.411s | COMPLETE |
| 6     |    lsqr |      auto |  0.8147 |       0.8686 |     1.528s |  7.939s | COMPLETE |
| 7     |     svd |       --- |  0.8428 |       0.8686 |     0.000s |  7.939s | 

In [5]:
atom.evaluate()

Unnamed: 0,accuracy,average_precision,balanced_accuracy,f1,jaccard,matthews_corrcoef,precision,recall,roc_auc
LDA,0.855,0.761,0.7431,0.6329,0.463,0.5623,0.7812,0.5319,0.9118
AdaB,0.85,0.731,0.6956,0.5588,0.3878,0.5411,0.9048,0.4043,0.8754
