Utilities¶
This example shows various useful utilities that can be used to improve atom's pipelines.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
# Import packages
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier, ATOMLoader
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Use the utility attributes¶
atom = ATOMClassifier(X, warnings=False, random_state=1)
atom.clean()
# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")
# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")
# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")
Columns with missing values: MinTemp 637 MaxTemp 322 Rainfall 1406 Evaporation 60843 Sunshine 67816 WindGustDir 9330 WindGustSpeed 9270 WindDir9am 10013 WindDir3pm 3778 WindSpeed9am 1348 WindSpeed3pm 2630 Humidity9am 1774 Humidity3pm 3610 Pressure9am 14014 Pressure3pm 13981 Cloud9am 53657 Cloud3pm 57094 Temp9am 904 Temp3pm 2726 RainToday 1406 dtype: int64 Categorical columns: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'] Is the dataset scaled? False
Use the stats method to assess changes in the dataset¶
# Note the number of missing values and categorical columns
atom.stats()
Dataset stats ==================== >> Shape: (142193, 22) Memory: 61.69 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicate samples: 45 (0.0%) ------------------------------------- Train set size: 113755 Test set size: 28438 ------------------------------------- | | dataset | train | test | | - | --------- | --------- | --------- | | 0 | 0 (0.0) | 0 (0.0) | 0 (0.0) | | 1 | 0 (0.0) | 0 (0.0) | 0 (0.0) |
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()
# ... and the values are gone
atom.stats()
Dataset stats ==================== >> Shape: (56420, 22) Memory: 9.93 MB Scaled: False Outlier values: 3210 (0.3%) ------------------------------------- Train set size: 45039 Test set size: 11381 ------------------------------------- | | dataset | train | test | | - | --------- | --------- | --------- | | 0 | 0 (0.0) | 0 (0.0) | 0 (0.0) | | 1 | 0 (0.0) | 0 (0.0) | 0 (0.0) |
Inspect feature distributions¶
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_scatter_matrix(columns=slice(0, 5))
# Check which distribution fits a column best
atom.distribution(columns="Rainfall")
Rainfall | ||
---|---|---|
dist | stat | |
beta | score | 0.6506 |
p_value | 0.0 | |
expon | score | 0.6506 |
p_value | 0.0 | |
gamma | score | 0.6465 |
p_value | 0.0 | |
invgauss | score | 0.4717 |
p_value | 0.0 | |
lognorm | score | 0.6485 |
p_value | 0.0 | |
norm | score | 0.3807 |
p_value | 0.0 | |
pearson3 | score | 0.6506 |
p_value | 0.0 | |
triang | score | 0.7191 |
p_value | 0.0 | |
uniform | score | 0.8914 |
p_value | 0.0 | |
weibull_min | score | 0.6506 |
p_value | 0.0 | |
weibull_max | score | 0.8896 |
p_value | 0.0 |
# Investigate a column's distribution
atom.plot_distribution(columns="MinTemp", distributions="beta")
atom.plot_qq(columns="MinTemp", distributions="beta")
Change the data mid-pipeline¶
There are two ways to quickly transform the dataset mid-pipeline. The first way is through the property's @setter
.
The downside for this approach is that the transformation is not stored in atom's pipeline, so the transformation is
not applied on new data. Therefore, we recommend using the second approach, through the add method.
# Note that we can only replace a dataframe with a new dataframe!
atom.X = atom.X.assign(AvgTemp=(atom.X["MaxTemp"] + atom.X["MinTemp"])/2)
# This will automatically update all other data attributes
assert "AvgTemp" in atom
# But it's not saved to atom's pipeline
atom.pipeline
0 Cleaner() 1 Imputer() 2 Encoder() Name: master, dtype: object
# Same transformation, different approach (AvgTemp is overwritten)
atom.apply(lambda df: (df.MaxTemp + df.MinTemp)/2, columns="AvgTemp")
assert "AvgTemp" in atom
# Now the function appears in the pipeline
atom.pipeline
0 Cleaner() 1 Imputer() 2 Encoder() 3 FuncTransformer(func=<lambda>, columns=AvgTemp) Name: master, dtype: object
Get an overview of the available models¶
atom.available_models()
acronym | fullname | estimator | module | needs_scaling | accepts_sparse | |
---|---|---|---|---|---|---|
0 | Dummy | Dummy Estimator | DummyClassifier | sklearn.dummy | False | False |
1 | GP | Gaussian Process | GaussianProcessClassifier | sklearn.gaussian_process._gpc | False | False |
2 | GNB | Gaussian Naive Bayes | GaussianNB | sklearn.naive_bayes | False | False |
3 | MNB | Multinomial Naive Bayes | MultinomialNB | sklearn.naive_bayes | False | True |
4 | BNB | Bernoulli Naive Bayes | BernoulliNB | sklearn.naive_bayes | False | True |
5 | CatNB | Categorical Naive Bayes | CategoricalNB | sklearn.naive_bayes | False | True |
6 | CNB | Complement Naive Bayes | ComplementNB | sklearn.naive_bayes | False | True |
7 | Ridge | Ridge Estimator | RidgeClassifier | sklearn.linear_model._ridge | True | True |
8 | Perc | Perceptron | Perceptron | sklearn.linear_model._perceptron | True | False |
9 | LR | Logistic Regression | LogisticRegression | sklearn.linear_model._logistic | True | True |
10 | LDA | Linear Discriminant Analysis | LinearDiscriminantAnalysis | sklearn.discriminant_analysis | False | False |
11 | QDA | Quadratic Discriminant Analysis | QuadraticDiscriminantAnalysis | sklearn.discriminant_analysis | False | False |
12 | KNN | K-Nearest Neighbors | KNeighborsClassifier | sklearn.neighbors._classification | True | True |
13 | RNN | Radius Nearest Neighbors | RadiusNeighborsClassifier | sklearn.neighbors._classification | True | True |
14 | Tree | Decision Tree | DecisionTreeClassifier | sklearn.tree._classes | False | True |
15 | Bag | Bagging | BaggingClassifier | sklearn.ensemble._bagging | False | True |
16 | ET | Extra-Trees | ExtraTreesClassifier | sklearn.ensemble._forest | False | True |
17 | RF | Random Forest | RandomForestClassifier | sklearn.ensemble._forest | False | True |
18 | AdaB | AdaBoost | AdaBoostClassifier | sklearn.ensemble._weight_boosting | False | True |
19 | GBM | Gradient Boosting Machine | GradientBoostingClassifier | sklearn.ensemble._gb | False | True |
20 | hGBM | HistGBM | HistGradientBoostingClassifier | sklearn.ensemble._hist_gradient_boosting.gradi... | False | False |
21 | XGB | XGBoost | XGBClassifier | xgboost.sklearn | True | True |
22 | LGB | LightGBM | LGBMClassifier | lightgbm.sklearn | True | True |
23 | CatB | CatBoost | CatBoostClassifier | catboost.core | True | True |
24 | lSVM | Linear-SVM | LinearSVC | sklearn.svm._classes | True | True |
25 | kSVM | Kernel-SVM | SVC | sklearn.svm._classes | True | True |
26 | PA | Passive Aggressive | PassiveAggressiveClassifier | sklearn.linear_model._passive_aggressive | True | True |
27 | SGD | Stochastic Gradient Descent | SGDClassifier | sklearn.linear_model._stochastic_gradient | True | True |
28 | MLP | Multi-layer Perceptron | MLPClassifier | sklearn.neural_network._multilayer_perceptron | True | True |
Use a custom metric¶
atom.verbose = 1
# Define a custom metric
def f2(y_true, y_pred):
return fbeta_score(y_true, y_pred, beta=2)
# Use the greater_is_better, needs_proba and needs_threshold parameters if necessary
atom.run(models="LR", metric=f2)
Training ========================= >> Models: LR Metric: f2 Results for Logistic Regression: Fit --------------------------------------------- Train evaluation --> f2: 0.5685 Test evaluation --> f2: 0.5743 Time elapsed: 0.285s ------------------------------------------------- Total time: 0.285s Final results ==================== >> Duration: 0.286s ------------------------------------- Logistic Regression --> f2: 0.5743
Customize the estimator's parameters¶
# You can use the est_params parameter to customize the estimator
# Let's run AdaBoost using LR instead of a decision tree as base estimator
atom.run("AdaB", est_params={"base_estimator": atom.lr.estimator})
Training ========================= >> Models: AdaB Metric: f2 Results for AdaBoost: Fit --------------------------------------------- Train evaluation --> f2: 0.5556 Test evaluation --> f2: 0.5642 Time elapsed: 2.468s ------------------------------------------------- Total time: 2.468s Final results ==================== >> Duration: 2.468s ------------------------------------- AdaBoost --> f2: 0.5642
atom.adab.estimator
AdaBoostClassifier(base_estimator=LogisticRegression(n_jobs=1, random_state=1), random_state=1)
# Note that parameters specified by est_params are not optimized in the BO
atom.run(
models="Tree",
n_calls=10,
n_initial_points=3,
est_params={
"criterion": "gini",
"splitter": "best",
"min_samples_leaf": 1,
"ccp_alpha": 0.035,
},
verbose=2,
)
Training ========================= >> Models: Tree Metric: f2 Running BO for Decision Tree... | call | max_depth | min_samples_split | max_features | f2 | best_f2 | time | total_time | | ---------------- | --------- | ----------------- | ------------ | ------- | ------- | ------- | ---------- | | Initial point 1 | 9 | 19 | sqrt | 0.491 | 0.491 | 0.700s | 0.710s | | Initial point 2 | 9 | 6 | 0.5 | 0.4987 | 0.4987 | 0.718s | 1.628s | | Initial point 3 | 3 | 14 | 0.9 | 0.5029 | 0.5029 | 0.762s | 2.459s | | Iteration 4 | 3 | 14 | 0.9 | 0.5029 | 0.5029 | 0.001s | 2.678s | | Iteration 5 | 3 | 14 | None | 0.4695 | 0.5029 | 0.723s | 3.603s | | Iteration 6 | 9 | 12 | 0.9 | 0.4938 | 0.5029 | 0.763s | 4.626s | | Iteration 7 | None | 17 | 0.9 | 0.5102 | 0.5102 | 0.792s | 5.717s | | Iteration 8 | None | 19 | 0.8 | 0.5491 | 0.5491 | 0.776s | 6.878s | | Iteration 9 | None | 20 | 0.7 | 0.4822 | 0.5491 | 0.763s | 7.973s | | Iteration 10 | None | 18 | 0.8 | 0.5054 | 0.5491 | 0.798s | 9.070s | Bayesian Optimization --------------------------- Best call --> Iteration 8 Best parameters --> {'max_depth': None, 'min_samples_split': 19, 'max_features': 0.8} Best evaluation --> f2: 0.5491 Time elapsed: 9.338s Fit --------------------------------------------- Train evaluation --> f2: 0.4908 Test evaluation --> f2: 0.4992 Time elapsed: 0.481s ------------------------------------------------- Total time: 9.820s Final results ==================== >> Duration: 9.822s ------------------------------------- Decision Tree --> f2: 0.4992
Save & load¶
# Save the atom instance as a pickle
# Use save_data=False to save the instance without the data
atom.save("atom", save_data=False)
ATOMClassifier successfully saved.
# Load the instance again with ATOMLoader
# No need to store the transformed data, providing the original dataset to
# the loader automatically transforms it through all the steps in the pipeline
atom_2 = ATOMLoader("atom", data=(X,), verbose=2)
Transforming data for branch master: Applying data cleaning... Imputing missing values... --> Dropping 637 samples due to missing values in feature MinTemp. --> Dropping 322 samples due to missing values in feature MaxTemp. --> Dropping 1406 samples due to missing values in feature Rainfall. --> Dropping 60843 samples due to missing values in feature Evaporation. --> Dropping 67816 samples due to missing values in feature Sunshine. --> Dropping 9330 samples due to missing values in feature WindGustDir. --> Dropping 9270 samples due to missing values in feature WindGustSpeed. --> Dropping 10013 samples due to missing values in feature WindDir9am. --> Dropping 3778 samples due to missing values in feature WindDir3pm. --> Dropping 1348 samples due to missing values in feature WindSpeed9am. --> Dropping 2630 samples due to missing values in feature WindSpeed3pm. --> Dropping 1774 samples due to missing values in feature Humidity9am. --> Dropping 3610 samples due to missing values in feature Humidity3pm. --> Dropping 14014 samples due to missing values in feature Pressure9am. --> Dropping 13981 samples due to missing values in feature Pressure3pm. --> Dropping 53657 samples due to missing values in feature Cloud9am. --> Dropping 57094 samples due to missing values in feature Cloud3pm. --> Dropping 904 samples due to missing values in feature Temp9am. --> Dropping 2726 samples due to missing values in feature Temp3pm. --> Dropping 1406 samples due to missing values in feature RainToday. Encoding categorical columns... --> LeaveOneOut-encoding feature Location. Contains 26 classes. --> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes. --> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes. Applying function <lambda> to the dataset... ATOMClassifier successfully loaded.
Customize the plot aesthetics¶
# Use the plotting attributes to further customize your plots!
atom_2.palette= "Blues"
atom_2.style = "white"
atom_2.plot_roc()
# Reset the aesthetics to their original values
atom_2.reset_aesthetics()
# Draw multiple plots in one figure using the canvas method
with atom.canvas(2, 2):
atom_2.plot_roc(dataset="train", title="ROC train")
atom_2.plot_roc(dataset="test", title="ROC test")
atom_2.plot_prc(dataset="train", title="PRC train")
atom_2.plot_prc(dataset="test", title="PRC test")
Note that both instances need to be initialized with the same data and use the same metric for model training to be able to merge.
# Create a separate instance with its own branch and model
atom_3 = ATOMClassifier(X, verbose=0, random_state=1)
atom_3.branch.rename("lightgbm")
atom_3.impute()
atom_3.encode()
atom_3.run("LGB", metric=f2)
# Merge the instances
atom_2.merge(atom_3)
Merging instances... --> Merging branch lightgbm. --> Merging model LGB. --> Merging attributes.
# Note that it now contains both branches and all models
atom_2
ATOMClassifier --> Branches: >>> master ! >>> lightgbm --> Models: LR, AdaB, Tree, LGB --> Metric: f2 --> Errors: 0
atom_2.results
metric_bo | time_bo | metric_train | metric_test | time_fit | time | |
---|---|---|---|---|---|---|
LR | NaN | None | 0.568508 | 0.574347 | 0.285s | 0.285s |
AdaB | NaN | None | 0.555627 | 0.564229 | 2.468s | 2.468s |
Tree | 0.549118 | 9.338s | 0.490803 | 0.499191 | 0.481s | 9.820s |
LGB | NaN | None | 0.650216 | 0.601199 | 1.227s | 1.227s |