Example: Memory considerations¶

This example shows how to use the memory parameter to make efficient use of the available memory.

The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow.

Load the data¶

In [5]:

Copied!





# Import packages
import os
import tempfile
import pandas as pd
from atom import ATOMClassifier
# Import packages
import os
import tempfile
import pandas as pd
from atom import ATOMClassifier

In [6]:

Copied!

# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()
# Load data
X = pd.read_csv("docs_source/examples/datasets/weatherAUS.csv")

# Let's have a look
X.head()

Out[6]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	MelbourneAirport	18.0	26.9	21.4	7.0	8.9	SSE	41.0	W	SSE	...	95.0	54.0	1019.5	1017.0	8.0	5.0	18.5	26.0	Yes	0
1	Adelaide	17.2	23.4	0.0	NaN	NaN	S	41.0	S	WSW	...	59.0	36.0	1015.7	1015.7	NaN	NaN	17.7	21.9	No	0
2	Cairns	18.6	24.6	7.4	3.0	6.1	SSE	54.0	SSE	SE	...	78.0	57.0	1018.7	1016.6	3.0	3.0	20.8	24.1	Yes	0
3	Portland	13.6	16.8	4.2	1.2	0.0	ESE	39.0	ESE	ESE	...	76.0	74.0	1021.4	1020.5	7.0	8.0	15.6	16.0	Yes	1
4	Walpole	16.4	19.9	0.0	NaN	NaN	SE	44.0	SE	SE	...	78.0	70.0	1019.4	1018.9	NaN	NaN	17.4	18.1	No	0

5 rows × 22 columns

In [7]:

Copied!

# Define a temp directory to store the files in this example
tempdir = tempfile.gettempdir()
# Define a temp directory to store the files in this example
tempdir = tempfile.gettempdir()

In [8]:

Copied!

def get_size(filepath):
    """Return the size of the object in MB."""
    return f"{os.path.getsize(filepath + '.pkl') / 1e6:.2f}MB"
def get_size(filepath):
    """Return the size of the object in MB."""
    return f"{os.path.getsize(filepath + '.pkl') / 1e6:.2f}MB"

Run the pipeline¶

In [9]:

Copied!

atom = ATOMClassifier(X, y="RainTomorrow", verbose=2)
atom = ATOMClassifier(X, y="RainTomorrow", verbose=2)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

Note that the datset takes ~25MB. We can reduce the size of the dataset using the shrink method, which reduces the dtypes to their smallest possible value.

In [10]:

Copied!

atom.dtypes
atom.dtypes

Out[10]:

Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow       int64
dtype: object

In [11]:

Copied!

atom.shrink(str2cat=True)
atom.shrink(str2cat=True)

The column dtypes are successfully converted.

In [12]:

Copied!

atom.dtypes
atom.dtypes

Out[12]:

Location         category
MinTemp           Float32
MaxTemp           Float32
Rainfall          Float32
Evaporation       Float32
Sunshine          Float32
WindGustDir      category
WindGustSpeed       Int16
WindDir9am       category
WindDir3pm       category
WindSpeed9am        Int16
WindSpeed3pm         Int8
Humidity9am          Int8
Humidity3pm          Int8
Pressure9am       Float32
Pressure3pm       Float32
Cloud9am             Int8
Cloud3pm             Int8
Temp9am           Float32
Temp3pm           Float32
RainToday        category
RainTomorrow         Int8
dtype: object

In [13]:

Copied!

# Let's check the memory usage again...
# Notice the huge drop!
atom.stats()
# Let's check the memory usage again...
# Notice the huge drop!
atom.stats()

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 9.67 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

In [14]:

Copied!





# Now, we create some new branches to train models with different trasnformers
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")
# Now, we create some new branches to train models with different trasnformers
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")

Fitting Imputer...
Imputing missing values...
 --> Imputing 637 missing values with mean (12.19) in column MinTemp.
 --> Imputing 322 missing values with mean (23.23) in column MaxTemp.
 --> Imputing 1406 missing values with mean (2.37) in column Rainfall.
 --> Imputing 60843 missing values with mean (5.48) in column Evaporation.
 --> Imputing 67816 missing values with mean (7.63) in column Sunshine.
 --> Imputing 9330 missing values with most_frequent (W) in column WindGustDir.
 --> Imputing 9270 missing values with mean (40.0) in column WindGustSpeed.
 --> Imputing 10013 missing values with most_frequent (N) in column WindDir9am.
 --> Imputing 3778 missing values with most_frequent (SE) in column WindDir3pm.
 --> Imputing 1348 missing values with mean (14.02) in column WindSpeed9am.
 --> Imputing 2630 missing values with mean (18.64) in column WindSpeed3pm.
 --> Imputing 1774 missing values with mean (68.82) in column Humidity9am.
 --> Imputing 3610 missing values with mean (51.45) in column Humidity3pm.
 --> Imputing 14014 missing values with mean (1017.64) in column Pressure9am.
 --> Imputing 13981 missing values with mean (1015.25) in column Pressure3pm.
 --> Imputing 53657 missing values with mean (4.44) in column Cloud9am.
 --> Imputing 57094 missing values with mean (4.5) in column Cloud3pm.
 --> Imputing 904 missing values with mean (16.99) in column Temp9am.
 --> Imputing 2726 missing values with mean (21.69) in column Temp3pm.
 --> Imputing 1406 missing values with most_frequent (No) in column RainToday.
Fitting Encoder...
Encoding categorical columns...
 --> Target-encoding feature Location. Contains 49 classes.
 --> Target-encoding feature WindGustDir. Contains 16 classes.
 --> Target-encoding feature WindDir9am. Contains 16 classes.
 --> Target-encoding feature WindDir3pm. Contains 16 classes.
 --> Ordinal-encoding feature RainToday. Contains 2 classes.

Training ========================= >>
Models: LDA
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5906
Test evaluation --> f1: 0.5904
Time elapsed: 0.942s
-------------------------------------------------
Time: 0.942s


Final results ==================== >>
Total time: 1.005s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.5904
Successfully created new branch: b2.
Fitting Scaler...
Scaling features...

Training ========================= >>
Models: LDA_scaled
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5906
Test evaluation --> f1: 0.5904
Time elapsed: 0.956s
-------------------------------------------------
Time: 0.956s


Final results ==================== >>
Total time: 1.017s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.5904
Successfully created new branch: b3.
Fitting Normalizer...
Normalizing features...

Training ========================= >>
Models: LDA_norm
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5955
Test evaluation --> f1: 0.594
Time elapsed: 0.929s
-------------------------------------------------
Time: 0.929s


Final results ==================== >>
Total time: 0.991s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.594

In [15]:

Copied!





# If we save atom now, notice the size
# This is because atom keeps a copy of every branch in memory
filename = os.path.join(tempdir, "atom1")
atom.save(filename)
get_size(filename)
# If we save atom now, notice the size
# This is because atom keeps a copy of every branch in memory
filename = os.path.join(tempdir, "atom1")
atom.save(filename)
get_size(filename)

ATOMClassifier successfully saved.

Out[15]:

'83.93MB'

To avoid large memory usages, set the memory parameter.

In [16]:

Copied!





atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.
Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

The column dtypes are successfully converted.
Loading cached results for Imputer...
Loading cached results for Encoder...

Training ========================= >>
Models: LDA
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5914
Test evaluation --> f1: 0.5892
Time elapsed: 0.953s
-------------------------------------------------
Time: 0.953s


Final results ==================== >>
Total time: 1.015s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.5892
Successfully created new branch: b2.
Loading cached results for Scaler...

Training ========================= >>
Models: LDA_scaled
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5914
Test evaluation --> f1: 0.5892
Time elapsed: 0.971s
-------------------------------------------------
Time: 0.971s


Final results ==================== >>
Total time: 1.028s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.5892
Successfully created new branch: b3.
Loading cached results for Normalizer...

Training ========================= >>
Models: LDA_norm
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.5957
Test evaluation --> f1: 0.5935
Time elapsed: 0.924s
-------------------------------------------------
Time: 0.924s


Final results ==================== >>
Total time: 0.985s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.5935

In [17]:

Copied!





# And now, it only takes a fraction of the previous size
# This is because the data of inactive branches is now stored locally
filename = os.path.join(tempdir, "atom2")
atom.save(filename)
get_size(filename)
# And now, it only takes a fraction of the previous size
# This is because the data of inactive branches is now stored locally
filename = os.path.join(tempdir, "atom2")
atom.save(filename)
get_size(filename)

ATOMClassifier successfully saved.

Out[17]:

'24.78MB'

Additionnaly, repeated calls to the same transformers with the same data will use the cached results.
Don't forget to specify the random_state parameter to ensure the data remains the exact same.

In [18]:

Copied!

atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.
Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib

Dataset stats ==================== >>
Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

The column dtypes are successfully converted.

In [19]:

Copied!





# Note the transformers are no longer fitted,
# instead the results are immediately read from cache
atom.impute()
atom.encode()
# Note the transformers are no longer fitted,
# instead the results are immediately read from cache
atom.impute()
atom.encode()

Loading cached results for Imputer...
Loading cached results for Encoder...

In [20]:

Copied!

atom.dataset
atom.dataset

Out[20]:

	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	WindDir3pm	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	0.070767	13.0	30.500000	0.000000	6.80000	10.000000	0.272677	59.0	0.254995	0.282496	...	19.000000	8.00000	1013.599976	1008.000000	0.000000	2.00000	19.600000	29.900000	0.0	0
1	0.130163	8.8	25.200001	0.000000	5.00000	7.614201	0.285167	50.0	0.26967	0.278696	...	68.842218	51.50239	1011.200012	1006.500000	4.446657	3.00000	15.900000	23.700001	0.0	1
2	0.262043	19.9	26.600000	8.000000	5.46491	7.614201	0.26658	57.0	0.254995	0.250291	...	81.000000	81.00000	1013.099976	1008.599976	4.446657	4.50922	24.500000	24.700001	1.0	1
3	0.183912	19.6	31.900000	2.600000	5.46491	7.614201	0.26658	59.0	0.269775	0.220975	...	70.000000	42.00000	1001.200012	1002.400024	2.000000	8.00000	25.799999	22.000000	1.0	0
4	0.258569	15.3	22.400000	16.000000	4.20000	3.300000	0.194464	39.0	0.245824	0.189182	...	83.000000	63.00000	1025.500000	1023.599976	6.000000	6.00000	16.900000	21.100000	1.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
142188	0.278746	9.0	21.799999	0.000000	5.46491	7.614201	0.158276	33.0	0.203597	0.277443	...	44.000000	38.00000	1017.660981	1015.270396	4.446657	4.50922	16.600000	21.100000	0.0	1
142189	0.307562	11.5	19.200001	0.800000	2.00000	7.000000	0.158276	22.0	0.143946	0.187433	...	73.000000	52.00000	1021.299988	1018.799988	3.000000	4.00000	17.100000	18.400000	0.0	0
142190	0.197839	17.5	29.100000	35.599998	5.46491	7.614201	0.158276	33.0	0.203597	0.180537	...	77.000000	46.00000	1015.200012	1013.700012	4.446657	4.50922	21.000000	28.799999	1.0	0
142191	0.371853	5.9	18.000000	0.400000	0.80000	6.700000	0.285167	26.0	0.254995	0.278696	...	92.000000	65.00000	1028.000000	1025.300049	3.000000	2.00000	9.400000	16.600000	0.0	0
142192	0.297818	10.2	18.100000	0.200000	5.46491	7.614201	0.205887	24.0	0.150067	0.221562	...	84.000000	94.00000	1018.099976	1016.000000	4.446657	4.50922	15.300000	16.000000	0.0	0

142193 rows × 22 columns