Example: Memory considerations¶
This example shows how to use the memory
parameter to make efficient use of the available memory.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
# Import packages
import os
import tempfile
import pandas as pd
from atom import ATOMClassifier
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")
# Let's have a look
X.head()
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
# Define a temp directory to store the files in this example
tempdir = tempfile.gettempdir()
def get_size(filepath):
"""Return the size of the object in MB."""
return f"{os.path.getsize(filepath + '.pkl') / 1e6:.2f}MB"
Run the pipeline¶
atom = ATOMClassifier(X, y="RainTomorrow", verbose=2)
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 25.03 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%)
Note that the datset takes ~25MB. We can reduce the size of the dataset using the shrink method, which reduces the dtypes to their smallest possible value.
atom.dtypes
Location object MinTemp float64 MaxTemp float64 Rainfall float64 Evaporation float64 Sunshine float64 WindGustDir object WindGustSpeed float64 WindDir9am object WindDir3pm object WindSpeed9am float64 WindSpeed3pm float64 Humidity9am float64 Humidity3pm float64 Pressure9am float64 Pressure3pm float64 Cloud9am float64 Cloud3pm float64 Temp9am float64 Temp3pm float64 RainToday object RainTomorrow int64 dtype: object
atom.shrink(str2cat=True)
The column dtypes are successfully converted.
atom.dtypes
Location category MinTemp Float32 MaxTemp Float32 Rainfall Float32 Evaporation Float32 Sunshine Float32 WindGustDir category WindGustSpeed Int16 WindDir9am category WindDir3pm category WindSpeed9am Int16 WindSpeed3pm Int8 Humidity9am Int8 Humidity3pm Int8 Pressure9am Float32 Pressure3pm Float32 Cloud9am Int8 Cloud3pm Int8 Temp9am Float32 Temp3pm Float32 RainToday category RainTomorrow Int8 dtype: object
# Let's check the memory usage again...
# Notice the huge drop!
atom.stats()
Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 9.67 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%)
# Now, we create some new branches to train models with different trasnformers
atom.impute()
atom.encode()
atom.run("LDA")
atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")
atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")
Fitting Imputer... Imputing missing values... --> Imputing 637 missing values with mean (12.19) in column MinTemp. --> Imputing 322 missing values with mean (23.23) in column MaxTemp. --> Imputing 1406 missing values with mean (2.37) in column Rainfall. --> Imputing 60843 missing values with mean (5.48) in column Evaporation. --> Imputing 67816 missing values with mean (7.63) in column Sunshine. --> Imputing 9330 missing values with most_frequent (W) in column WindGustDir. --> Imputing 9270 missing values with mean (40.0) in column WindGustSpeed. --> Imputing 10013 missing values with most_frequent (N) in column WindDir9am. --> Imputing 3778 missing values with most_frequent (SE) in column WindDir3pm. --> Imputing 1348 missing values with mean (14.02) in column WindSpeed9am. --> Imputing 2630 missing values with mean (18.64) in column WindSpeed3pm. --> Imputing 1774 missing values with mean (68.82) in column Humidity9am. --> Imputing 3610 missing values with mean (51.45) in column Humidity3pm. --> Imputing 14014 missing values with mean (1017.64) in column Pressure9am. --> Imputing 13981 missing values with mean (1015.25) in column Pressure3pm. --> Imputing 53657 missing values with mean (4.44) in column Cloud9am. --> Imputing 57094 missing values with mean (4.5) in column Cloud3pm. --> Imputing 904 missing values with mean (16.99) in column Temp9am. --> Imputing 2726 missing values with mean (21.69) in column Temp3pm. --> Imputing 1406 missing values with most_frequent (No) in column RainToday. Fitting Encoder... Encoding categorical columns... --> Target-encoding feature Location. Contains 49 classes. --> Target-encoding feature WindGustDir. Contains 16 classes. --> Target-encoding feature WindDir9am. Contains 16 classes. --> Target-encoding feature WindDir3pm. Contains 16 classes. --> Ordinal-encoding feature RainToday. Contains 2 classes. Training ========================= >> Models: LDA Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5906 Test evaluation --> f1: 0.5904 Time elapsed: 0.942s ------------------------------------------------- Time: 0.942s Final results ==================== >> Total time: 1.005s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.5904 Successfully created new branch: b2. Fitting Scaler... Scaling features... Training ========================= >> Models: LDA_scaled Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5906 Test evaluation --> f1: 0.5904 Time elapsed: 0.956s ------------------------------------------------- Time: 0.956s Final results ==================== >> Total time: 1.017s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.5904 Successfully created new branch: b3. Fitting Normalizer... Normalizing features... Training ========================= >> Models: LDA_norm Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5955 Test evaluation --> f1: 0.594 Time elapsed: 0.929s ------------------------------------------------- Time: 0.929s Final results ==================== >> Total time: 0.991s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.594
# If we save atom now, notice the size
# This is because atom keeps a copy of every branch in memory
filename = os.path.join(tempdir, "atom1")
atom.save(filename)
get_size(filename)
ATOMClassifier successfully saved.
'83.93MB'
To avoid large memory usages, set the memory
parameter.
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
atom.impute()
atom.encode()
atom.run("LDA")
atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")
atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 25.03 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%) The column dtypes are successfully converted. Loading cached results for Imputer... Loading cached results for Encoder... Training ========================= >> Models: LDA Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5914 Test evaluation --> f1: 0.5892 Time elapsed: 0.953s ------------------------------------------------- Time: 0.953s Final results ==================== >> Total time: 1.015s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.5892 Successfully created new branch: b2. Loading cached results for Scaler... Training ========================= >> Models: LDA_scaled Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5914 Test evaluation --> f1: 0.5892 Time elapsed: 0.971s ------------------------------------------------- Time: 0.971s Final results ==================== >> Total time: 1.028s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.5892 Successfully created new branch: b3. Loading cached results for Normalizer... Training ========================= >> Models: LDA_norm Metric: f1 Results for LinearDiscriminantAnalysis: Fit --------------------------------------------- Train evaluation --> f1: 0.5957 Test evaluation --> f1: 0.5935 Time elapsed: 0.924s ------------------------------------------------- Time: 0.924s Final results ==================== >> Total time: 0.985s ------------------------------------- LinearDiscriminantAnalysis --> f1: 0.5935
# And now, it only takes a fraction of the previous size
# This is because the data of inactive branches is now stored locally
filename = os.path.join(tempdir, "atom2")
atom.save(filename)
get_size(filename)
ATOMClassifier successfully saved.
'24.78MB'
Additionnaly, repeated calls to the same transformers with the same data will use the cached results.
Don't forget to specify the random_state
parameter to ensure the data remains the exact same.
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
<< ================== ATOM ================== >> Configuration ==================== >> Algorithm task: Binary classification. Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib Dataset stats ==================== >> Shape: (142193, 22) Train set size: 113755 Test set size: 28438 ------------------------------------- Memory: 25.03 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicates: 45 (0.0%) The column dtypes are successfully converted.
# Note the transformers are no longer fitted,
# instead the results are immediately read from cache
atom.impute()
atom.encode()
Loading cached results for Imputer... Loading cached results for Encoder...
atom.dataset
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.070767 | 13.0 | 30.500000 | 0.000000 | 6.80000 | 10.000000 | 0.272677 | 59.0 | 0.254995 | 0.282496 | ... | 19.000000 | 8.00000 | 1013.599976 | 1008.000000 | 0.000000 | 2.00000 | 19.600000 | 29.900000 | 0.0 | 0 |
1 | 0.130163 | 8.8 | 25.200001 | 0.000000 | 5.00000 | 7.614201 | 0.285167 | 50.0 | 0.26967 | 0.278696 | ... | 68.842218 | 51.50239 | 1011.200012 | 1006.500000 | 4.446657 | 3.00000 | 15.900000 | 23.700001 | 0.0 | 1 |
2 | 0.262043 | 19.9 | 26.600000 | 8.000000 | 5.46491 | 7.614201 | 0.26658 | 57.0 | 0.254995 | 0.250291 | ... | 81.000000 | 81.00000 | 1013.099976 | 1008.599976 | 4.446657 | 4.50922 | 24.500000 | 24.700001 | 1.0 | 1 |
3 | 0.183912 | 19.6 | 31.900000 | 2.600000 | 5.46491 | 7.614201 | 0.26658 | 59.0 | 0.269775 | 0.220975 | ... | 70.000000 | 42.00000 | 1001.200012 | 1002.400024 | 2.000000 | 8.00000 | 25.799999 | 22.000000 | 1.0 | 0 |
4 | 0.258569 | 15.3 | 22.400000 | 16.000000 | 4.20000 | 3.300000 | 0.194464 | 39.0 | 0.245824 | 0.189182 | ... | 83.000000 | 63.00000 | 1025.500000 | 1023.599976 | 6.000000 | 6.00000 | 16.900000 | 21.100000 | 1.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
142188 | 0.278746 | 9.0 | 21.799999 | 0.000000 | 5.46491 | 7.614201 | 0.158276 | 33.0 | 0.203597 | 0.277443 | ... | 44.000000 | 38.00000 | 1017.660981 | 1015.270396 | 4.446657 | 4.50922 | 16.600000 | 21.100000 | 0.0 | 1 |
142189 | 0.307562 | 11.5 | 19.200001 | 0.800000 | 2.00000 | 7.000000 | 0.158276 | 22.0 | 0.143946 | 0.187433 | ... | 73.000000 | 52.00000 | 1021.299988 | 1018.799988 | 3.000000 | 4.00000 | 17.100000 | 18.400000 | 0.0 | 0 |
142190 | 0.197839 | 17.5 | 29.100000 | 35.599998 | 5.46491 | 7.614201 | 0.158276 | 33.0 | 0.203597 | 0.180537 | ... | 77.000000 | 46.00000 | 1015.200012 | 1013.700012 | 4.446657 | 4.50922 | 21.000000 | 28.799999 | 1.0 | 0 |
142191 | 0.371853 | 5.9 | 18.000000 | 0.400000 | 0.80000 | 6.700000 | 0.285167 | 26.0 | 0.254995 | 0.278696 | ... | 92.000000 | 65.00000 | 1028.000000 | 1025.300049 | 3.000000 | 2.00000 | 9.400000 | 16.600000 | 0.0 | 0 |
142192 | 0.297818 | 10.2 | 18.100000 | 0.200000 | 5.46491 | 7.614201 | 0.205887 | 24.0 | 0.150067 | 0.221562 | ... | 84.000000 | 94.00000 | 1018.099976 | 1016.000000 | 4.446657 | 4.50922 | 15.300000 | 16.000000 | 0.0 | 0 |
142193 rows × 22 columns