Getting started

Installation

Install ATOM's newest release easily via pip:

pip install -U atom-ml

or via conda:

conda install -c conda-forge atom-ml

Note

Since atom was already taken, download the package under the name atom-ml!

Warning

ATOM makes use of many other ML libraries, making its dependency list quite long. Because of that, the installation may take longer than you are accustomed to. Be patient!

Optional dependencies

Some specific models, utility methods or plots require the installation of additional libraries. To install the optional dependencies, add [full] after the package's name.

pip install -U atom-ml[full]

Latest source

Sometimes, new features and bug fixes are already implemented in the development branch, but waiting for the next release to be made available. If you can't wait for that, it's possible to install the package directly from git.

pip install git+https://github.com/tvdboom/ATOM.git@development#egg=atom-ml

Don't forget to include #egg=atom-ml to explicitly name the project, this way pip can track metadata for it without having to have run the setup.py script.

Contributing

If you are planning to contribute to the project, you'll need the development dependencies. Install them adding [dev] after the package's name.

pip install -U atom-ml[dev]

Click here for a complete list of package files for all versions published on PyPI.

Usage

ATOM contains a variety of classes and functions to perform data cleaning, feature engineering, model training, plotting and much more. The easiest way to use everything ATOM has to offer is through one of the main classes:

ATOMClassifier for classification tasks.
ATOMForecaster for forecasting tasks.
ATOMRegressor for regression tasks.

Let's walk you through an example. Click on the SageMaker Studio Lab badge on top of this section to run this example yourself.

Make the necessary imports and load the data.

>>> import pandas as pd
>>> from atom import ATOMClassifier

>>> # Load the Australian Weather dataset
>>> X = pd.read_csv("./examples/datasets/weatherAUS.csv", nrows=100)
>>> print(X.head())

           Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  WindSpeed3pm  Humidity9am  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm RainToday  RainTomorrow
0  MelbourneAirport     18.0     26.9      21.4          7.0       8.9         SSE           41.0          W        SSE           9.0          20.0         95.0         54.0       1019.5       1017.0       8.0       5.0     18.5     26.0       Yes             0
1          Adelaide     17.2     23.4       0.0          NaN       NaN           S           41.0          S        WSW          13.0          19.0         59.0         36.0       1015.7       1015.7       NaN       NaN     17.7     21.9        No             0
2            Cairns     18.6     24.6       7.4          3.0       6.1         SSE           54.0        SSE         SE          26.0          35.0         78.0         57.0       1018.7       1016.6       3.0       3.0     20.8     24.1       Yes             0
3          Portland     13.6     16.8       4.2          1.2       0.0         ESE           39.0        ESE        ESE          17.0          15.0         76.0         74.0       1021.4       1020.5       7.0       8.0     15.6     16.0       Yes             1
4           Walpole     16.4     19.9       0.0          NaN       NaN          SE           44.0         SE         SE          19.0          30.0         78.0         70.0       1019.4       1018.9       NaN       NaN     17.4     18.1        No             0

Initialize the ATOMClassifier or ATOMRegressor class. These two classes are convenient wrappers for the whole machine learning pipeline. Contrary to sklearn's API, they are initialized providing the data you want to manipulate.

>>> atom = ATOMClassifier(X, y="RainTomorrow", verbose=2)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.

Dataset stats ==================== >>
Shape: (100, 22)
Train set size: 80
Test set size: 20
-------------------------------------
Memory: 17.73 kB
Scaled: False
Missing values: 193 (8.8%)
Categorical features: 5 (23.8%)
Outlier values: 1 (0.1%)

Data transformations are applied through atom's methods. For example, calling the impute method will initialize an Imputer instance, fit it on the training set and transform the whole dataset. The transformations are applied immediately after calling the method (no fit and transform commands necessary).

>>> atom.impute(strat_num="median", strat_cat="most_frequent")

Fitting Imputer...
Imputing missing values...
 --> Imputing 1 missing values with median (0.0) in column Rainfall.
 --> Imputing 36 missing values with median (4.8) in column Evaporation.
 --> Imputing 38 missing values with median (8.5) in column Sunshine.
 --> Imputing 8 missing values with most_frequent (SSE) in column WindGustDir.
 --> Imputing 8 missing values with median (41.0) in column WindGustSpeed.
 --> Imputing 7 missing values with most_frequent (ESE) in column WindDir9am.
 --> Imputing 2 missing values with median (13.0) in column WindSpeed9am.
 --> Imputing 1 missing values with median (74.0) in column Humidity9am.
 --> Imputing 6 missing values with median (1017.05) in column Pressure9am.
 --> Imputing 6 missing values with median (1015.25) in column Pressure3pm.
 --> Imputing 38 missing values with median (5.0) in column Cloud9am.
 --> Imputing 40 missing values with median (5.0) in column Cloud3pm.
 --> Imputing 1 missing values with median (17.4) in column Temp9am.
 --> Imputing 1 missing values with most_frequent (No) in column RainToday.
>>> atom.encode(strategy="Target", max_onehot=8)

Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 42 classes.
   --> Handling 2 unknown classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

Similarly, models are trained and evaluated using the run method. Here, we fit both a LogisticRegression and LinearDiscriminantAnalysis model, and apply hyperparameter tuning.

>>> atom.run(models=["LR", "LDA"], metric="auc", n_trials=6)


Training ========================= >>
Models: LR, LDA
Metric: auc


Running hyperparameter tuning for LogisticRegression...
| trial | penalty |       C |  solver | max_iter | l1_ratio |     auc | best_auc | time_trial | time_ht |    state |
| ----- | ------- | ------- | ------- | -------- | -------- | ------- | -------- | ---------- | ------- | -------- |
| 0     |      l2 |  0.0764 |    saga |      630 |      0.0 |  0.4792 |   0.4792 |     0.109s |  0.109s | COMPLETE |
| 1     |      l2 |  9.4455 |    saga |      480 |      0.5 |   0.375 |   0.4792 |     0.113s |  0.222s | COMPLETE |
| 2     |      l2 |  0.4188 |     sag |      950 |      0.9 |  0.7708 |   0.7708 |     0.107s |  0.329s | COMPLETE |
| 3     |    None |  0.0091 |    saga |      400 |      0.5 |  0.6875 |   0.7708 |     0.113s |  0.442s | COMPLETE |
| 4     |      l2 |  0.4612 |   lbfgs |      830 |      0.0 |     0.5 |   0.7708 |     0.108s |  0.551s | COMPLETE |
| 5     |      l2 | 10.3363 | newto.. |      980 |      0.7 |  0.6042 |   0.7708 |     0.110s |  0.661s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 2
Best parameters:
 --> penalty: l2
 --> C: 0.4188
 --> solver: sag
 --> max_iter: 950
 --> l1_ratio: 0.9
Best evaluation --> auc: 0.7708
Time elapsed: 0.661s
Fit ---------------------------------------------
Train evaluation --> auc: 0.9957
Test evaluation --> auc: 0.6267
Time elapsed: 0.165s
-------------------------------------------------
Time: 0.826s


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage |     auc | best_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | -------- | ---------- | ------- | -------- |
| 0     |   eigen |       0.5 |  0.8125 |   0.8125 |     0.090s |  0.090s | COMPLETE |
| 1     |   eigen |      auto |  0.2708 |   0.8125 |     0.088s |  0.178s | COMPLETE |
| 2     |    lsqr |       0.7 |  0.8333 |   0.8333 |     0.083s |  0.261s | COMPLETE |
| 3     |   eigen |      None |  0.5625 |   0.8333 |     0.083s |  0.344s | COMPLETE |
| 4     |     svd |      None |  0.7917 |   0.8333 |     0.083s |  0.427s | COMPLETE |
| 5     |    lsqr |       0.9 |  0.7083 |   0.8333 |     0.084s |  0.511s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 2
Best parameters:
 --> solver: lsqr
 --> shrinkage: 0.7
Best evaluation --> auc: 0.8333
Time elapsed: 0.511s
Fit ---------------------------------------------
Train evaluation --> auc: 0.8654
Test evaluation --> auc: 0.76
Time elapsed: 0.020s
-------------------------------------------------
Time: 0.531s


Final results ==================== >>
Total time: 1.391s
-------------------------------------
LogisticRegression         --> auc: 0.6267 ~
LinearDiscriminantAnalysis --> auc: 0.76 !

And lastly, analyze the results.

>>> atom.results

	auc_ht	time_ht	auc_train	auc_test	time_fit	time
LR	0.895833	0.654592	0.963800	0.520000	0.166152	0.820744
LDA	0.958333	0.423385	0.850700	0.800000	0.020018	0.443403

>>> atom.plot_roc()