Example: Utilities¶

This example shows various useful utilities that can be used to improve atom's pipelines.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

Copied!





# Import packages
import os
import tempfile
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier
# Import packages
import os
import tempfile
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier

In [2]:

Copied!

# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Use the utility attributes¶

In [3]:

Copied!





atom = ATOMClassifier(X, random_state=1)
atom.clean()

# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")

# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")

# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")
atom = ATOMClassifier(X, random_state=1)
atom.clean()

# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")

# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")

# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")

Columns with missing values:
Location             0
MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustDir       9330
WindGustSpeed     9270
WindDir9am       10013
WindDir3pm        3778
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
RainToday         1406
RainTomorrow         0
dtype: int64

Categorical columns: Index(['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'], dtype='object')

Is the dataset scaled? False

Use the stats method to assess changes in the dataset¶

In [4]:

Copied!

# Note the number of missing values and categorical columns
atom.stats()
# Note the number of missing values and categorical columns
atom.stats()

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 27.44 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

In [5]:

Copied!





# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()

# ... and the values are gone
atom.stats()
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()

# ... and the values are gone
atom.stats()

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.74 MB
Scaled: False
Outlier values: 8731 (0.3%)
Duplicates: 45 (0.0%)

Inspect feature distributions¶

In [6]:

Copied!

# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_relationships(columns=slice(0, 3))
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_relationships(columns=slice(0, 3))

In [7]:

Copied!

# Check which distribution fits a column best
atom.distributions(columns=slice(0, 3))
# Check which distribution fits a column best
atom.distributions(columns=slice(0, 3))

Out[7]:

		Location	MinTemp	MaxTemp
dist	stat
beta	score	0.0711	0.0165	0.0276
beta	p_value	0.0	0.0	0.0
expon	score	0.3377	0.354	0.4194
expon	p_value	0.0	0.0	0.0
gamma	score	0.0784	0.02	0.0272
gamma	p_value	0.0	0.0	0.0
invgauss	score	0.0952	0.0409	0.2407
invgauss	p_value	0.0	0.0	0.0
lognorm	score	0.0724	0.0201	0.0272
lognorm	p_value	0.0	0.0	0.0
norm	score	0.0724	0.0198	0.0406
norm	p_value	0.0	0.0	0.0
pearson3	score	0.0647	0.02	0.0272
pearson3	p_value	0.0	0.0	0.0
triang	score	0.0888	0.0841	0.2514
triang	p_value	0.0	0.0	0.0
uniform	score	0.1933	0.2127	0.2859
uniform	p_value	0.0	0.0	0.0
weibull_min	score	0.0695	0.0197	0.0437
weibull_min	p_value	0.0	0.0	0.0
weibull_max	score	0.0753	0.0146	0.0245
weibull_max	p_value	0.0	0.0	0.0

In [8]:

Copied!

# Investigate a column's distribution
atom.plot_distribution(columns="MinTemp", distributions="beta")
atom.plot_qq(columns="MinTemp", distributions="beta")
# Investigate a column's distribution
atom.plot_distribution(columns="MinTemp", distributions="beta")
atom.plot_qq(columns="MinTemp", distributions="beta")

Change the data mid-pipeline¶

There are two ways to quickly transform the dataset mid-pipeline. The first way is through the property's @setter. The downside for this approach is that the transformation is not stored in atom's pipeline, so the transformation is not applied on new data. Therefore, we recommend using the second approach, through the add method.

In [9]:

Copied!





# Note that we can only replace a dataframe with a new dataframe!
atom.X = atom.X.assign(AvgTemp=(atom.X["MaxTemp"] + atom.X["MinTemp"])/2)

# This will automatically update all other data attributes
assert "AvgTemp" in atom

# But it's not saved to atom's pipeline
atom.pipeline
# Note that we can only replace a dataframe with a new dataframe!
atom.X = atom.X.assign(AvgTemp=(atom.X["MaxTemp"] + atom.X["MinTemp"])/2)

# This will automatically update all other data attributes
assert "AvgTemp" in atom

# But it's not saved to atom's pipeline
atom.pipeline

Out[9]:

Pipeline(memory=Memory(location=None),
         steps=[('cleaner',
                 Cleaner(engine={'data': 'pandas', 'estimator': 'sklearn'})),
                ('imputer',
                 Imputer(engine={'data': 'pandas', 'estimator': 'sklearn'}, random_state=1)),
                ('encoder', Encoder())],
         verbose=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [10]:

Copied!





# Same transformation, different approach (AvgTemp is overwritten)
def transform(df):
    df["AvgTemp"] = (df.MaxTemp + df.MinTemp) / 2
    return df

atom.apply(transform)

assert "AvgTemp" in atom
# Same transformation, different approach (AvgTemp is overwritten)
def transform(df):
    df["AvgTemp"] = (df.MaxTemp + df.MinTemp) / 2
    return df

atom.apply(transform)

assert "AvgTemp" in atom

In [11]:

Copied!

# Now the function appears in the pipeline
atom.pipeline
# Now the function appears in the pipeline
atom.pipeline

Out[11]:

Pipeline(memory=Memory(location=None),
         steps=[('cleaner',
                 Cleaner(engine={'data': 'pandas', 'estimator': 'sklearn'})),
                ('imputer',
                 Imputer(engine={'data': 'pandas', 'estimator': 'sklearn'}, random_state=1)),
                ('encoder', Encoder()),
                ('functiontransformer',
                 FunctionTransformer(func=<function transform at 0x00000225D33CC4A0>))],
         verbose=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Get an overview of the available models¶

In [12]:

Copied!

atom.available_models()
atom.available_models()

Out[12]:

	acronym	fullname	estimator	module	handles_missing	needs_scaling	accepts_sparse	native_multilabel	native_multioutput	validation	supports_engines
0	AdaB	AdaBoost	AdaBoostClassifier	sklearn.ensemble._weight_boosting	False	False	True	False	False	None	sklearn
1	Bag	Bagging	BaggingClassifier	sklearn.ensemble._bagging	True	False	True	False	False	None	sklearn
2	BNB	BernoulliNB	BernoulliNB	sklearn.naive_bayes	False	False	True	False	False	None	sklearn, cuml
3	CatB	CatBoost	CatBoostClassifier	catboost.core	True	True	True	False	False	n_estimators	catboost
4	CatNB	CategoricalNB	CategoricalNB	sklearn.naive_bayes	False	False	True	False	False	None	sklearn, cuml
5	CNB	ComplementNB	ComplementNB	sklearn.naive_bayes	False	False	True	False	False	None	sklearn, cuml
6	Tree	DecisionTree	DecisionTreeClassifier	sklearn.tree._classes	True	False	True	True	True	None	sklearn
7	Dummy	Dummy	DummyClassifier	sklearn.dummy	False	False	False	False	False	None	sklearn
8	ETree	ExtraTree	ExtraTreeClassifier	sklearn.tree._classes	False	False	True	True	True	None	sklearn
9	ET	ExtraTrees	ExtraTreesClassifier	sklearn.ensemble._forest	False	False	True	True	True	None	sklearn
10	GNB	GaussianNB	GaussianNB	sklearn.naive_bayes	False	False	False	False	False	None	sklearn, cuml
11	GP	GaussianProcess	GaussianProcessClassifier	sklearn.gaussian_process._gpc	False	False	False	False	False	None	sklearn
12	GBM	GradientBoostingMachine	GradientBoostingClassifier	sklearn.ensemble._gb	False	False	True	False	False	None	sklearn
13	hGBM	HistGradientBoosting	HistGradientBoostingClassifier	sklearn.ensemble._hist_gradient_boosting.gradi...	True	False	False	False	False	None	sklearn
14	KNN	KNearestNeighbors	KNeighborsClassifier	sklearn.neighbors._classification	False	True	True	True	True	None	sklearn, sklearnex, cuml
15	LGB	LightGBM	LGBMClassifier	lightgbm.sklearn	True	True	True	False	False	n_estimators	lightgbm
16	LDA	LinearDiscriminantAnalysis	LinearDiscriminantAnalysis	sklearn.discriminant_analysis	False	False	False	False	False	None	sklearn
17	lSVM	LinearSVM	LinearSVC	sklearn.svm._classes	False	True	True	False	False	None	sklearn, cuml
18	LR	LogisticRegression	LogisticRegression	sklearn.linear_model._logistic	False	True	True	False	False	None	sklearn, sklearnex, cuml
19	MLP	MultiLayerPerceptron	MLPClassifier	sklearn.neural_network._multilayer_perceptron	False	True	True	True	False	max_iter	sklearn
20	MNB	MultinomialNB	MultinomialNB	sklearn.naive_bayes	False	False	True	False	False	None	sklearn, cuml
21	PA	PassiveAggressive	PassiveAggressiveClassifier	sklearn.linear_model._passive_aggressive	False	True	True	False	False	max_iter	sklearn
22	Perc	Perceptron	Perceptron	sklearn.linear_model._perceptron	False	True	False	False	False	max_iter	sklearn
23	QDA	QuadraticDiscriminantAnalysis	QuadraticDiscriminantAnalysis	sklearn.discriminant_analysis	False	False	False	False	False	None	sklearn
24	RNN	RadiusNearestNeighbors	RadiusNeighborsClassifier	sklearn.neighbors._classification	False	True	True	True	True	None	sklearn
25	RF	RandomForest	RandomForestClassifier	sklearn.ensemble._forest	False	False	True	True	True	None	sklearn, sklearnex, cuml
26	Ridge	Ridge	RidgeClassifier	sklearn.linear_model._ridge	False	True	True	True	False	None	sklearn, sklearnex, cuml
27	SGD	StochasticGradientDescent	SGDClassifier	sklearn.linear_model._stochastic_gradient	False	True	True	False	False	max_iter	sklearn
28	SVM	SupportVectorMachine	SVC	sklearn.svm._classes	False	True	True	False	False	None	sklearn, sklearnex, cuml
29	XGB	XGBoost	XGBClassifier	xgboost.sklearn	True	True	True	False	False	n_estimators	xgboost

Use a custom metric¶

In [13]:

Copied!





atom.verbose = 1

# Define a custom metric
def f2(y_true, y_pred):
    return fbeta_score(y_true, y_pred, beta=2)

# Use the greater_is_better, needs_proba and needs_threshold parameters if necessary
atom.run(models="LR", metric=f2)
atom.verbose = 1

# Define a custom metric
def f2(y_true, y_pred):
    return fbeta_score(y_true, y_pred, beta=2)

# Use the greater_is_better, needs_proba and needs_threshold parameters if necessary
atom.run(models="LR", metric=f2)

Training ========================= >>
Models: LR
Metric: f2


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f2: 0.5277
Test evaluation --> f2: 0.5266
Time elapsed: 4.639s
-------------------------------------------------
Time: 4.639s


Final results ==================== >>
Total time: 4.749s
-------------------------------------
LogisticRegression --> f2: 0.5266

Customize the estimator's parameters¶

In [14]:

Copied!

# You can use the est_params parameter to customize the estimator
# Let's run AdaBoost using LR instead of a decision tree as base estimator
atom.run("AdaB", est_params={"estimator": atom.lr.estimator})
# You can use the est_params parameter to customize the estimator
# Let's run AdaBoost using LR instead of a decision tree as base estimator
atom.run("AdaB", est_params={"estimator": atom.lr.estimator})

Training ========================= >>
Models: AdaB
Metric: f2


Results for AdaBoost:
Fit ---------------------------------------------
Train evaluation --> f2: 0.5129
Test evaluation --> f2: 0.5172
Time elapsed: 01m:21s
-------------------------------------------------
Time: 01m:21s


Final results ==================== >>
Total time: 01m:22s
-------------------------------------
AdaBoost --> f2: 0.5172

In [15]:

Copied!

atom.adab.estimator
atom.adab.estimator

Out[15]:

AdaBoostClassifier(estimator=LogisticRegression(n_jobs=1, random_state=1),
                   random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [16]:

Copied!





# Note that parameters specified by est_params are not optimized in the BO
atom.run(
    models="Tree",
    n_trials=10,
    est_params={
        "criterion": "gini",
        "splitter": "best",
        "min_samples_leaf": 1,
        "ccp_alpha": 0.035,
    },
    verbose=2,
)
# Note that parameters specified by est_params are not optimized in the BO
atom.run(
    models="Tree",
    n_trials=10,
    est_params={
        "criterion": "gini",
        "splitter": "best",
        "min_samples_leaf": 1,
        "ccp_alpha": 0.035,
    },
    verbose=2,
)

Training ========================= >>
Models: Tree
Metric: f2


Running hyperparameter tuning for DecisionTree...
| trial | max_depth | min_samples_split | max_features |      f2 | best_f2 | time_trial | time_ht |    state |
| ----- | --------- | ----------------- | ------------ | ------- | ------- | ---------- | ------- | -------- |
| 0     |        13 |                12 |          0.5 |  0.4488 |  0.4488 |     9.644s |  9.644s | COMPLETE |
| 1     |        14 |                16 |         log2 |  0.4762 |  0.4762 |     5.440s | 15.084s | COMPLETE |
| 2     |        16 |                13 |          0.8 |  0.4798 |  0.4798 |     6.380s | 21.464s | COMPLETE |
| 3     |         9 |                 6 |         None |  0.5182 |  0.5182 |     6.861s | 28.325s | COMPLETE |
| 4     |         5 |                 2 |         log2 |  0.4859 |  0.5182 |     5.553s | 33.878s | COMPLETE |
| 5     |         1 |                15 |          0.5 |  0.4918 |  0.5182 |     5.312s | 39.190s | COMPLETE |
| 6     |        15 |                 9 |         sqrt |  0.4647 |  0.5182 |     5.292s | 44.482s | COMPLETE |
| 7     |        13 |                20 |         None |   0.496 |  0.5182 |     6.539s | 51.022s | COMPLETE |
| 8     |         3 |                19 |          0.5 |  0.5068 |  0.5182 |     5.410s | 56.432s | COMPLETE |
| 9     |        15 |                20 |         sqrt |  0.4734 |  0.5182 |     6.054s | 01m:02s | COMPLETE |
Hyperparameter tuning ---------------------------
Best trial --> 3
Best parameters:
 --> max_depth: 9
 --> min_samples_split: 6
 --> max_features: None
Best evaluation --> f2: 0.5182
Time elapsed: 01m:02s
Fit ---------------------------------------------
Train evaluation --> f2: 0.4803
Test evaluation --> f2: 0.4851
Time elapsed: 2.813s
-------------------------------------------------
Time: 01m:05s


Final results ==================== >>
Total time: 01m:05s
-------------------------------------
DecisionTree --> f2: 0.4851

Save & load¶

Note that both instances need to be initialized with the same data and use the same metric for model training to be able to merge.

In [17]:

Copied!

tempdir = tempfile.gettempdir()
tempdir = tempfile.gettempdir()

In [18]:

Copied!

# Save the atom instance as a pickle
# Use save_data=False to save the instance without the data
atom.save(os.path.join(tempdir, "atom"), save_data=False)
# Save the atom instance as a pickle
# Use save_data=False to save the instance without the data
atom.save(os.path.join(tempdir, "atom"), save_data=False)

ATOMClassifier successfully saved.

In [19]:

Copied!

# No need to store the transformed data, providing the original dataset to
# the loader automatically transforms it through all the steps in the pipeline
atom_2 = ATOMClassifier.load(os.path.join(tempdir, "atom"), data=(X,))
# No need to store the transformed data, providing the original dataset to
# the loader automatically transforms it through all the steps in the pipeline
atom_2 = ATOMClassifier.load(os.path.join(tempdir, "atom"), data=(X,))

ATOMClassifier successfully loaded.

In [20]:

Copied!





# Create a separate instance with its own branch and model
atom_3 = ATOMClassifier(X, verbose=0, random_state=1)
atom_3.branch.name = "lightgbm"
atom_3.impute()
atom_3.encode()
atom_3.run("LGB", metric=f2)
# Create a separate instance with its own branch and model
atom_3 = ATOMClassifier(X, verbose=0, random_state=1)
atom_3.branch.name = "lightgbm"
atom_3.impute()
atom_3.encode()
atom_3.run("LGB", metric=f2)

In [21]:

Copied!

# Merge the instances
atom_2.merge(atom_3)
# Merge the instances
atom_2.merge(atom_3)

Merging instances...
 --> Merging branch lightgbm.
 --> Merging model LGB.
 --> Merging attributes.

In [22]:

Copied!

# Note that it now contains both branches and all models
atom_2
# Note that it now contains both branches and all models
atom_2

Out[22]:

ATOMClassifier
 --> Branches:
   --> main !
   --> lightgbm
 --> Models: LR, AdaB, Tree, LGB
 --> Metric: f2

In [23]:

Copied!

atom_2.evaluate()
atom_2.evaluate()

Out[23]:

	accuracy	ap	ba	f1	jaccard	mcc	precision	recall	auc
LR	0.842100	0.681200	0.718400	0.583900	0.412300	0.502900	0.713200	0.494300	0.863200
AdaB	0.841600	0.675200	0.714300	0.577700	0.406200	0.499100	0.717700	0.483500	0.859200
Tree	0.814500	0.404300	0.688700	0.526800	0.357600	0.421100	0.615100	0.460700	0.688700
LGB	0.858400	0.743000	0.747500	0.633800	0.463900	0.559600	0.753800	0.546700	0.888300