Example: Accelerating pipelines¶
This example shows how to accelerate your models on cpu using sklearnex.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [1]:
Copied!
# Import packages
import pandas as pd
from atom import ATOMClassifier
# Import packages
import pandas as pd
from atom import ATOMClassifier
In [2]:
Copied!
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[2]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Run the pipeline¶
In [3]:
Copied!
atom = ATOMClassifier(X, "RainTomorrow", verbose=2)
atom = ATOMClassifier(X, "RainTomorrow", verbose=2)
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 25.03 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%)
In [4]:
Copied!
# Impute missing values and encode categorical columns
atom.impute()
atom.encode()
# Impute missing values and encode categorical columns
atom.impute()
atom.encode()
Fitting Imputer... Imputing missing values... --> Imputing 637 missing values with mean (12.18) in column MinTemp. --> Imputing 322 missing values with mean (23.22) in column MaxTemp. --> Imputing 1406 missing values with mean (2.37) in column Rainfall. --> Imputing 60843 missing values with mean (5.46) in column Evaporation. --> Imputing 67816 missing values with mean (7.62) in column Sunshine. --> Imputing 9330 missing values with most_frequent (W) in column WindGustDir. --> Imputing 9270 missing values with mean (39.96) in column WindGustSpeed. --> Imputing 10013 missing values with most_frequent (N) in column WindDir9am. --> Imputing 3778 missing values with most_frequent (SE) in column WindDir3pm. --> Imputing 1348 missing values with mean (13.99) in column WindSpeed9am. --> Imputing 2630 missing values with mean (18.62) in column WindSpeed3pm. --> Imputing 1774 missing values with mean (68.86) in column Humidity9am. --> Imputing 3610 missing values with mean (51.48) in column Humidity3pm. --> Imputing 14014 missing values with mean (1017.64) in column Pressure9am. --> Imputing 13981 missing values with mean (1015.24) in column Pressure3pm. --> Imputing 53657 missing values with mean (4.44) in column Cloud9am. --> Imputing 57094 missing values with mean (4.5) in column Cloud3pm. --> Imputing 904 missing values with mean (16.98) in column Temp9am. --> Imputing 2726 missing values with mean (21.68) in column Temp3pm. --> Imputing 1406 missing values with most_frequent (No) in column RainToday. Fitting Encoder... Encoding categorical columns... --> Target-encoding feature Location. Contains 49 classes. --> Target-encoding feature WindGustDir. Contains 16 classes. --> Target-encoding feature WindDir9am. Contains 16 classes. --> Target-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes.
In [5]:
Copied!
# Train a K-Nearest Neighbors model (using default sklearn)
atom.run(models="KNN", metric="f1")
# Train a K-Nearest Neighbors model (using default sklearn)
atom.run(models="KNN", metric="f1")
Training ========================= >> Models: KNN Metric: f1 Results for KNearestNeighbors: Fit --------------------------------------------- Train evaluation --> f1: 0.6962 Test evaluation --> f1: 0.5818 Time elapsed: 16.214s ------------------------------------------------- Time: 16.214s Final results ==================== >> Total time: 16.249s ------------------------------------- KNearestNeighbors --> f1: 0.5818
In [6]:
Copied!
# Now, we train an accelerated KNN using engine="sklearnex"
# Note the diffrence in training speed!!
atom.run(models="KNN_acc", metric="f1", engine="sklearnex")
# Now, we train an accelerated KNN using engine="sklearnex"
# Note the diffrence in training speed!!
atom.run(models="KNN_acc", metric="f1", engine="sklearnex")
Training ========================= >> Models: KNN_acc Metric: f1 Results for KNearestNeighbors: Fit --------------------------------------------- Train evaluation --> f1: 0.6962 Test evaluation --> f1: 0.5818 Time elapsed: 5.443s ------------------------------------------------- Time: 5.443s Final results ==================== >> Total time: 5.477s ------------------------------------- KNearestNeighbors --> f1: 0.5818
Analyze the results¶
In [7]:
Copied!
atom.results
atom.results
Out[7]:
f1_train | f1_test | time_fit | time | |
---|---|---|---|---|
KNN | 0.6962 | 0.5818 | 16.213961 | 16.213961 |
KNN_acc | 0.6962 | 0.5818 | 5.442855 | 5.442855 |
In [8]:
Copied!
# Note how the underlying estimators might look the same...
print(atom.knn.estimator)
print(atom.knn_acc.estimator)
# ... but are using different implementations
print(atom.knn.estimator.__module__)
print(atom.knn_acc.estimator.__module__)
# Note how the underlying estimators might look the same...
print(atom.knn.estimator)
print(atom.knn_acc.estimator)
# ... but are using different implementations
print(atom.knn.estimator.__module__)
print(atom.knn_acc.estimator.__module__)
KNeighborsClassifier(n_jobs=1) KNeighborsClassifier(n_jobs=1) sklearn.neighbors._classification sklearnex.neighbors.knn_classification
In [9]:
Copied!
with atom.canvas(1, 2, title="Timing engines: sklearn vs sklearnex"):
atom.plot_results(metric="time_fit", title="Training")
atom.plot_results(metric="time", title="Total")
with atom.canvas(1, 2, title="Timing engines: sklearn vs sklearnex"):
atom.plot_results(metric="time_fit", title="Training")
atom.plot_results(metric="time", title="Total")