Utilities¶

This example shows various useful utilities that can be used to improve atom's pipelines.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier, ATOMLoader
# Import packages
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier, ATOMLoader

In [2]:

            
                Copied!
                
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[2]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

Use the utility attributes¶

In [3]:

            
                Copied!
                
atom = ATOMClassifier(X, warnings=False, random_state=1)
atom.clean()

# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")

# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")

# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")
atom = ATOMClassifier(X, warnings=False, random_state=1)
atom.clean()

# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")

# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")

# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")

Columns with missing values:
MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustDir       9330
WindGustSpeed     9270
WindDir9am       10013
WindDir3pm        3778
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
RainToday         1406
dtype: int64

Categorical columns: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

Is the dataset scaled? False

Use the stats method to assess changes in the dataset¶

In [4]:

            
                Copied!
                
# Note the number of missing values and categorical columns
atom.stats()
# Note the number of missing values and categorical columns
atom.stats()

Dataset stats ==================== >>
Shape: (142193, 22)
Memory: 61.69 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicate samples: 45 (0.0%)
-------------------------------------
Train set size: 113755
Test set size: 28438
-------------------------------------
|   |   dataset |     train |      test |
| - | --------- | --------- | --------- |
| 0 |   0 (0.0) |   0 (0.0) |   0 (0.0) |
| 1 |   0 (0.0) |   0 (0.0) |   0 (0.0) |

In [5]:

            
                Copied!
                
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()

# ... and the values are gone
atom.stats()
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()

# ... and the values are gone
atom.stats()

Dataset stats ==================== >>
Shape: (56420, 22)
Memory: 9.93 MB
Scaled: False
Outlier values: 3210 (0.3%)
-------------------------------------
Train set size: 45039
Test set size: 11381
-------------------------------------
|   |   dataset |     train |      test |
| - | --------- | --------- | --------- |
| 0 |   0 (0.0) |   0 (0.0) |   0 (0.0) |
| 1 |   0 (0.0) |   0 (0.0) |   0 (0.0) |

Inspect feature distributions¶

In [6]:

            
                Copied!
                
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_scatter_matrix(columns=slice(0, 5))
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_scatter_matrix(columns=slice(0, 5))

In [7]:

            
                Copied!
                
# Check which distribution fits a column best
atom.distribution(columns="Rainfall")
# Check which distribution fits a column best
atom.distribution(columns="Rainfall")

Out[7]:

		Rainfall
dist	stat
beta	score	0.6506
beta	p_value	0.0
expon	score	0.6506
expon	p_value	0.0
gamma	score	0.6465
gamma	p_value	0.0
invgauss	score	0.4717
invgauss	p_value	0.0
lognorm	score	0.6485
lognorm	p_value	0.0
norm	score	0.3807
norm	p_value	0.0
pearson3	score	0.6506
pearson3	p_value	0.0
triang	score	0.7191
triang	p_value	0.0
uniform	score	0.8914
uniform	p_value	0.0
weibull_min	score	0.6506
weibull_min	p_value	0.0
weibull_max	score	0.8896
weibull_max	p_value	0.0

In [8]:

            
                Copied!
                
# Investigate a column's distribution
atom.plot_distribution(columns="MinTemp", distributions="beta")
atom.plot_qq(columns="MinTemp", distributions="beta")
# Investigate a column's distribution
atom.plot_distribution(columns="MinTemp", distributions="beta")
atom.plot_qq(columns="MinTemp", distributions="beta")

Change the data mid-pipeline¶

There are two ways to quickly transform the dataset mid-pipeline. The first way is through the property's @setter. The downside for this approach is that the transformation is not stored in atom's pipeline, so the transformation is not applied on new data. Therefore, we recommend using the second approach, through the add method.

In [9]:

            
                Copied!
                
# Note that we can only replace a dataframe with a new dataframe!
atom.X = atom.X.assign(AvgTemp=(atom.X["MaxTemp"] + atom.X["MinTemp"])/2)

# This will automatically update all other data attributes
assert "AvgTemp" in atom

# But it's not saved to atom's pipeline
atom.pipeline
# Note that we can only replace a dataframe with a new dataframe!
atom.X = atom.X.assign(AvgTemp=(atom.X["MaxTemp"] + atom.X["MinTemp"])/2)

# This will automatically update all other data attributes
assert "AvgTemp" in atom

# But it's not saved to atom's pipeline
atom.pipeline

Out[9]:

0    Cleaner()
1    Imputer()
2    Encoder()
Name: master, dtype: object

In [10]:

            
                Copied!
                
# Same transformation, different approach (AvgTemp is overwritten)
atom.apply(lambda df: (df.MaxTemp + df.MinTemp)/2, columns="AvgTemp")

assert "AvgTemp" in atom
# Same transformation, different approach (AvgTemp is overwritten)
atom.apply(lambda df: (df.MaxTemp + df.MinTemp)/2, columns="AvgTemp")

assert "AvgTemp" in atom

In [11]:

            
                Copied!
                
# Now the function appears in the pipeline
atom.pipeline
# Now the function appears in the pipeline
atom.pipeline

Out[11]:

0                                          Cleaner()
1                                          Imputer()
2                                          Encoder()
3    FuncTransformer(func=<lambda>, columns=AvgTemp)
Name: master, dtype: object

Get an overview of the available models¶

In [12]:

            
                Copied!
                
atom.available_models()
atom.available_models()

Out[12]:

	acronym	fullname	estimator	module	needs_scaling	accepts_sparse
0	Dummy	Dummy Estimator	DummyClassifier	sklearn.dummy	False	False
1	GP	Gaussian Process	GaussianProcessClassifier	sklearn.gaussian_process._gpc	False	False
2	GNB	Gaussian Naive Bayes	GaussianNB	sklearn.naive_bayes	False	False
3	MNB	Multinomial Naive Bayes	MultinomialNB	sklearn.naive_bayes	False	True
4	BNB	Bernoulli Naive Bayes	BernoulliNB	sklearn.naive_bayes	False	True
5	CatNB	Categorical Naive Bayes	CategoricalNB	sklearn.naive_bayes	False	True
6	CNB	Complement Naive Bayes	ComplementNB	sklearn.naive_bayes	False	True
7	Ridge	Ridge Estimator	RidgeClassifier	sklearn.linear_model._ridge	True	True
8	Perc	Perceptron	Perceptron	sklearn.linear_model._perceptron	True	False
9	LR	Logistic Regression	LogisticRegression	sklearn.linear_model._logistic	True	True
10	LDA	Linear Discriminant Analysis	LinearDiscriminantAnalysis	sklearn.discriminant_analysis	False	False
11	QDA	Quadratic Discriminant Analysis	QuadraticDiscriminantAnalysis	sklearn.discriminant_analysis	False	False
12	KNN	K-Nearest Neighbors	KNeighborsClassifier	sklearn.neighbors._classification	True	True
13	RNN	Radius Nearest Neighbors	RadiusNeighborsClassifier	sklearn.neighbors._classification	True	True
14	Tree	Decision Tree	DecisionTreeClassifier	sklearn.tree._classes	False	True
15	Bag	Bagging	BaggingClassifier	sklearn.ensemble._bagging	False	True
16	ET	Extra-Trees	ExtraTreesClassifier	sklearn.ensemble._forest	False	True
17	RF	Random Forest	RandomForestClassifier	sklearn.ensemble._forest	False	True
18	AdaB	AdaBoost	AdaBoostClassifier	sklearn.ensemble._weight_boosting	False	True
19	GBM	Gradient Boosting Machine	GradientBoostingClassifier	sklearn.ensemble._gb	False	True
20	hGBM	HistGBM	HistGradientBoostingClassifier	sklearn.ensemble._hist_gradient_boosting.gradi...	False	False
21	XGB	XGBoost	XGBClassifier	xgboost.sklearn	True	True
22	LGB	LightGBM	LGBMClassifier	lightgbm.sklearn	True	True
23	CatB	CatBoost	CatBoostClassifier	catboost.core	True	True
24	lSVM	Linear-SVM	LinearSVC	sklearn.svm._classes	True	True
25	kSVM	Kernel-SVM	SVC	sklearn.svm._classes	True	True
26	PA	Passive Aggressive	PassiveAggressiveClassifier	sklearn.linear_model._passive_aggressive	True	True
27	SGD	Stochastic Gradient Descent	SGDClassifier	sklearn.linear_model._stochastic_gradient	True	True
28	MLP	Multi-layer Perceptron	MLPClassifier	sklearn.neural_network._multilayer_perceptron	True	True

Use a custom metric¶

In [13]:

            
                Copied!
                
atom.verbose = 1

# Define a custom metric
def f2(y_true, y_pred):
    return fbeta_score(y_true, y_pred, beta=2)

# Use the greater_is_better, needs_proba and needs_threshold parameters if necessary
atom.run(models="LR", metric=f2)
atom.verbose = 1

# Define a custom metric
def f2(y_true, y_pred):
    return fbeta_score(y_true, y_pred, beta=2)

# Use the greater_is_better, needs_proba and needs_threshold parameters if necessary
atom.run(models="LR", metric=f2)

Training ========================= >>
Models: LR
Metric: f2


Results for Logistic Regression:
Fit ---------------------------------------------
Train evaluation --> f2: 0.5685
Test evaluation --> f2: 0.5743
Time elapsed: 0.285s
-------------------------------------------------
Total time: 0.285s


Final results ==================== >>
Duration: 0.286s
-------------------------------------
Logistic Regression --> f2: 0.5743

Customize the estimator's parameters¶

In [14]:

            
                Copied!
                
# You can use the est_params parameter to customize the estimator
# Let's run AdaBoost using LR instead of a decision tree as base estimator
atom.run("AdaB", est_params={"base_estimator": atom.lr.estimator})
# You can use the est_params parameter to customize the estimator
# Let's run AdaBoost using LR instead of a decision tree as base estimator
atom.run("AdaB", est_params={"base_estimator": atom.lr.estimator})

Training ========================= >>
Models: AdaB
Metric: f2


Results for AdaBoost:
Fit ---------------------------------------------
Train evaluation --> f2: 0.5556
Test evaluation --> f2: 0.5642
Time elapsed: 2.468s
-------------------------------------------------
Total time: 2.468s


Final results ==================== >>
Duration: 2.468s
-------------------------------------
AdaBoost --> f2: 0.5642

In [15]:

            
                Copied!
                
atom.adab.estimator
atom.adab.estimator

Out[15]:

AdaBoostClassifier(base_estimator=LogisticRegression(n_jobs=1, random_state=1),
                   random_state=1)

In [16]:

            
                Copied!
                
                    
                    
                
                

        
# Note that parameters specified by est_params are not optimized in the BO
atom.run(
    models="Tree",
    n_calls=10,
    n_initial_points=3,
    est_params={
        "criterion": "gini",
        "splitter": "best",
        "min_samples_leaf": 1,
        "ccp_alpha": 0.035,
    },
    verbose=2,
)
# Note that parameters specified by est_params are not optimized in the BO
atom.run(
    models="Tree",
    n_calls=10,
    n_initial_points=3,
    est_params={
        "criterion": "gini",
        "splitter": "best",
        "min_samples_leaf": 1,
        "ccp_alpha": 0.035,
    },
    verbose=2,
)

Training ========================= >>
Models: Tree
Metric: f2


Running BO for Decision Tree...
| call             | max_depth | min_samples_split | max_features |      f2 | best_f2 |    time | total_time |
| ---------------- | --------- | ----------------- | ------------ | ------- | ------- | ------- | ---------- |
| Initial point 1  |         9 |                19 |         sqrt |   0.491 |   0.491 |  0.700s |     0.710s |
| Initial point 2  |         9 |                 6 |          0.5 |  0.4987 |  0.4987 |  0.718s |     1.628s |
| Initial point 3  |         3 |                14 |          0.9 |  0.5029 |  0.5029 |  0.762s |     2.459s |
| Iteration 4      |         3 |                14 |          0.9 |  0.5029 |  0.5029 |  0.001s |     2.678s |
| Iteration 5      |         3 |                14 |         None |  0.4695 |  0.5029 |  0.723s |     3.603s |
| Iteration 6      |         9 |                12 |          0.9 |  0.4938 |  0.5029 |  0.763s |     4.626s |
| Iteration 7      |      None |                17 |          0.9 |  0.5102 |  0.5102 |  0.792s |     5.717s |
| Iteration 8      |      None |                19 |          0.8 |  0.5491 |  0.5491 |  0.776s |     6.878s |
| Iteration 9      |      None |                20 |          0.7 |  0.4822 |  0.5491 |  0.763s |     7.973s |
| Iteration 10     |      None |                18 |          0.8 |  0.5054 |  0.5491 |  0.798s |     9.070s |
Bayesian Optimization ---------------------------
Best call --> Iteration 8
Best parameters --> {'max_depth': None, 'min_samples_split': 19, 'max_features': 0.8}
Best evaluation --> f2: 0.5491
Time elapsed: 9.338s
Fit ---------------------------------------------
Train evaluation --> f2: 0.4908
Test evaluation --> f2: 0.4992
Time elapsed: 0.481s
-------------------------------------------------
Total time: 9.820s


Final results ==================== >>
Duration: 9.822s
-------------------------------------
Decision Tree --> f2: 0.4992

Save & load¶

In [17]:

            
                Copied!
                
# Save the atom instance as a pickle
# Use save_data=False to save the instance without the data
atom.save("atom", save_data=False)
# Save the atom instance as a pickle
# Use save_data=False to save the instance without the data
atom.save("atom", save_data=False)

ATOMClassifier successfully saved.

In [18]:

            
                Copied!
                
# Load the instance again with ATOMLoader
# No need to store the transformed data, providing the original dataset to
# the loader automatically transforms it through all the steps in the pipeline
atom_2 = ATOMLoader("atom", data=(X,), verbose=2)
# Load the instance again with ATOMLoader
# No need to store the transformed data, providing the original dataset to
# the loader automatically transforms it through all the steps in the pipeline
atom_2 = ATOMLoader("atom", data=(X,), verbose=2)

Transforming data for branch master:
Applying data cleaning...
Imputing missing values...
 --> Dropping 637 samples due to missing values in feature MinTemp.
 --> Dropping 322 samples due to missing values in feature MaxTemp.
 --> Dropping 1406 samples due to missing values in feature Rainfall.
 --> Dropping 60843 samples due to missing values in feature Evaporation.
 --> Dropping 67816 samples due to missing values in feature Sunshine.
 --> Dropping 9330 samples due to missing values in feature WindGustDir.
 --> Dropping 9270 samples due to missing values in feature WindGustSpeed.
 --> Dropping 10013 samples due to missing values in feature WindDir9am.
 --> Dropping 3778 samples due to missing values in feature WindDir3pm.
 --> Dropping 1348 samples due to missing values in feature WindSpeed9am.
 --> Dropping 2630 samples due to missing values in feature WindSpeed3pm.
 --> Dropping 1774 samples due to missing values in feature Humidity9am.
 --> Dropping 3610 samples due to missing values in feature Humidity3pm.
 --> Dropping 14014 samples due to missing values in feature Pressure9am.
 --> Dropping 13981 samples due to missing values in feature Pressure3pm.
 --> Dropping 53657 samples due to missing values in feature Cloud9am.
 --> Dropping 57094 samples due to missing values in feature Cloud3pm.
 --> Dropping 904 samples due to missing values in feature Temp9am.
 --> Dropping 2726 samples due to missing values in feature Temp3pm.
 --> Dropping 1406 samples due to missing values in feature RainToday.
Encoding categorical columns...
 --> LeaveOneOut-encoding feature Location. Contains 26 classes.
 --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
 --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.
Applying function <lambda> to the dataset...
ATOMClassifier successfully loaded.

Customize the plot aesthetics¶

In [19]:

            
                Copied!
                
# Use the plotting attributes to further customize your plots!
atom_2.palette= "Blues"
atom_2.style = "white"

atom_2.plot_roc()
# Use the plotting attributes to further customize your plots!
atom_2.palette= "Blues"
atom_2.style = "white"

atom_2.plot_roc()

In [20]:

            
                Copied!
                
                    
                    
                
                

        
# Reset the aesthetics to their original values
atom_2.reset_aesthetics()

# Draw multiple plots in one figure using the canvas method
with atom.canvas(2, 2):
    atom_2.plot_roc(dataset="train", title="ROC train")
    atom_2.plot_roc(dataset="test", title="ROC test")
    atom_2.plot_prc(dataset="train", title="PRC train")
    atom_2.plot_prc(dataset="test", title="PRC test")
# Reset the aesthetics to their original values
atom_2.reset_aesthetics()

# Draw multiple plots in one figure using the canvas method
with atom.canvas(2, 2):
    atom_2.plot_roc(dataset="train", title="ROC train")
    atom_2.plot_roc(dataset="test", title="ROC test")
    atom_2.plot_prc(dataset="train", title="PRC train")
    atom_2.plot_prc(dataset="test", title="PRC test")

Note that both instances need to be initialized with the same data and use the same metric for model training to be able to merge.

In [21]:

            
                Copied!
                
                    
                    
                
                

        
# Create a separate instance with its own branch and model
atom_3 = ATOMClassifier(X, verbose=0, random_state=1)
atom_3.branch.rename("lightgbm")
atom_3.impute()
atom_3.encode()
atom_3.run("LGB", metric=f2)
# Create a separate instance with its own branch and model
atom_3 = ATOMClassifier(X, verbose=0, random_state=1)
atom_3.branch.rename("lightgbm")
atom_3.impute()
atom_3.encode()
atom_3.run("LGB", metric=f2)

In [22]:

            
                Copied!
                
# Merge the instances
atom_2.merge(atom_3)
# Merge the instances
atom_2.merge(atom_3)

Merging instances...
 --> Merging branch lightgbm.
 --> Merging model LGB.
 --> Merging attributes.

In [23]:

            
                Copied!
                
# Note that it now contains both branches and all models
atom_2
# Note that it now contains both branches and all models
atom_2

Out[23]:

ATOMClassifier
 --> Branches:
   >>> master !
   >>> lightgbm
 --> Models: LR, AdaB, Tree, LGB
 --> Metric: f2
 --> Errors: 0

In [24]:

            
                Copied!
                
atom_2.results
atom_2.results

Out[24]:

	metric_bo	time_bo	metric_train	metric_test	time_fit	time
LR	NaN	None	0.568508	0.574347	0.285s	0.285s
AdaB	NaN	None	0.555627	0.564229	2.468s	2.468s
Tree	0.549118	9.338s	0.490803	0.499191	0.481s	9.820s
LGB	NaN	None	0.650216	0.601199	1.227s	1.227s