Regression¶

This example shows how to use ATOM to apply pca on the data and run a regression pipeline.

Download the abalone dataset from https://archive.ics.uci.edu/ml/datasets/Abalone. The goal of this dataset is to predict the rings (age) of abalone shells from physical measurements.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMRegressor
# Import packages
import pandas as pd
from atom import ATOMRegressor

In [2]:

            
                Copied!
                
# Load the data
X = pd.read_csv("./datasets/abalone.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/abalone.csv")

# Let's have a look
X.head()

Out[2]:

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

In [3]:

            
                Copied!
                
# Initialize atom for regression tasks
atom = ATOMRegressor(X, "Rings", verbose=2, warnings=False, random_state=42)
# Initialize atom for regression tasks
atom = ATOMRegressor(X, "Rings", verbose=2, warnings=False, random_state=42)

<< ================== ATOM ================== >>
Algorithm task: regression.

Dataset stats ==================== >>
Shape: (4177, 9)
Memory: 509.72 kB
Scaled: False
Categorical features: 1 (12.5%)
Outlier values: 196 (0.7%)
-------------------------------------
Train set size: 3342
Test set size: 835

In [4]:

            
                Copied!
                
# Encode the categorical features
atom.encode()
# Encode the categorical features
atom.encode()

Fitting Encoder...
Encoding categorical columns...
 --> OneHot-encoding feature Sex. Contains 3 classes.

In [5]:

            
                Copied!
                
# Plot the dataset's correlation matrix
atom.plot_correlation()
# Plot the dataset's correlation matrix
atom.plot_correlation()

In [6]:

            
                Copied!
                
# Apply pca for dimensionality reduction
atom.feature_selection(strategy="pca", n_features=6)
# Apply pca for dimensionality reduction
atom.feature_selection(strategy="pca", n_features=6)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   >>> Scaling features...
   >>> Explained variance ratio: 0.977

In [7]:

            
                Copied!
                
# Note that the fetaures are automatically renamed to Component 1, 2, etc...
atom.columns
# Note that the fetaures are automatically renamed to Component 1, 2, etc...
atom.columns

Out[7]:

['component_1',
 'component_2',
 'component_3',
 'component_4',
 'component_5',
 'component_6',
 'Rings']

In [8]:

            
                Copied!
                
# Use the plotting methods to see the retained variance ratio
atom.plot_pca()
# Use the plotting methods to see the retained variance ratio
atom.plot_pca()

In [9]:

            
                Copied!
                
atom.plot_components(figsize=(8, 6))
atom.plot_components(figsize=(8, 6))

Run the pipeline¶

In [10]:

            
                Copied!
                
                    
                    
                
                

        
atom.run(
    models=["Tree", "Bag", "ET"],
    metric="MSE",
    n_calls=5,
    n_initial_points=2,
    bo_params={"base_estimator": "GBRT"},
    n_bootstrap=5,
)
atom.run(
    models=["Tree", "Bag", "ET"],
    metric="MSE",
    n_calls=5,
    n_initial_points=2,
    bo_params={"base_estimator": "GBRT"},
    n_bootstrap=5,
)

Training ========================= >>
Models: Tree, Bag, ET
Metric: neg_mean_squared_error


Running BO for Decision Tree...
| call             |   criterion | splitter | max_depth | min_samples_split | min_samples_leaf | max_features | ccp_alpha | neg_mean_squared_error | best_neg_mean_squared_error |    time | total_time |
| ---------------- | ----------- | -------- | --------- | ----------------- | ---------------- | ------------ | --------- | ---------------------- | --------------------------- | ------- | ---------- |
| Initial point 1  | absolute_.. |   random |        12 |                 8 |               19 |         auto |    0.0161 |                -8.0987 |                     -8.0987 |  0.093s |     0.100s |
| Initial point 2  | absolute_.. |     best |        11 |                 3 |               12 |         None |       0.0 |                -6.7018 |                     -6.7018 |  0.257s |     0.598s |
| Iteration 3      | absolute_.. |     best |        11 |                 4 |                1 |         None |    0.0086 |                -6.8759 |                     -6.7018 |  0.294s |     1.072s |
| Iteration 4      | absolute_.. |   random |        12 |                 3 |               18 |          0.9 |    0.0036 |                -6.9111 |                     -6.7018 |  0.086s |     1.336s |
| Iteration 5      | friedman_.. |   random |         3 |                12 |               19 |         auto |     0.001 |                -6.4336 |                     -6.4336 |  0.051s |     1.662s |
Bayesian Optimization ---------------------------
Best call --> Iteration 5
Best parameters --> {'criterion': 'friedman_mse', 'splitter': 'random', 'max_depth': 3, 'min_samples_split': 12, 'min_samples_leaf': 19, 'max_features': 'auto', 'ccp_alpha': 0.001}
Best evaluation --> neg_mean_squared_error: -6.4336
Time elapsed: 1.844s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -7.4877
Test evaluation --> neg_mean_squared_error: -7.5847
Time elapsed: 0.008s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -7.4566 ± 0.1197
Time elapsed: 0.032s
-------------------------------------------------
Total time: 1.884s


Running BO for Bagging...
| call             | n_estimators | max_samples | max_features | bootstrap | bootstrap_features | neg_mean_squared_error | best_neg_mean_squared_error |    time | total_time |
| ---------------- | ------------ | ----------- | ------------ | --------- | ------------------ | ---------------------- | --------------------------- | ------- | ---------- |
| Initial point 1  |          112 |         0.9 |          0.6 |     False |              False |                -6.5592 |                     -6.5592 |  0.908s |     0.912s |
| Initial point 2  |          131 |         0.5 |          0.5 |     False |              False |                -5.4837 |                     -5.4837 |  0.608s |     1.594s |
| Iteration 3      |          302 |         0.5 |          0.5 |      True |               True |                -6.0919 |                     -5.4837 |  1.208s |     2.971s |
| Iteration 4      |          191 |         0.5 |          0.5 |     False |              False |                -5.3972 |                     -5.3972 |  0.929s |     4.106s |
| Iteration 5      |          217 |         0.5 |          0.5 |     False |              False |                -4.9339 |                     -4.9339 |  1.008s |     5.299s |
Bayesian Optimization ---------------------------
Best call --> Iteration 5
Best parameters --> {'n_estimators': 217, 'max_samples': 0.5, 'max_features': 0.5, 'bootstrap': False, 'bootstrap_features': False}
Best evaluation --> neg_mean_squared_error: -4.9339
Time elapsed: 5.489s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -1.3974
Test evaluation --> neg_mean_squared_error: -5.7349
Time elapsed: 1.327s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -5.9024 ± 0.058
Time elapsed: 5.496s
-------------------------------------------------
Total time: 12.314s


Running BO for Extra-Trees...
| call             | n_estimators |     criterion | max_depth | min_samples_split | min_samples_leaf | max_features | bootstrap | max_samples | ccp_alpha | neg_mean_squared_error | best_neg_mean_squared_error |    time | total_time |
| ---------------- | ------------ | ------------- | --------- | ----------------- | ---------------- | ------------ | --------- | ----------- | --------- | ---------------------- | --------------------------- | ------- | ---------- |
| Initial point 1  |          112 | absolute_er.. |         3 |                 9 |                7 |          0.6 |      True |         0.6 |    0.0117 |                  -8.95 |                       -8.95 |  0.699s |     0.705s |
| Initial point 2  |          369 | absolute_er.. |      None |                 3 |               12 |         None |      True |         0.9 |    0.0216 |                -6.6286 |                     -6.6286 |  6.961s |     7.746s |
| Iteration 3      |          126 | absolute_er.. |      None |                 5 |               20 |         None |      True |         0.9 |    0.0336 |                -7.6668 |                     -6.6286 |  2.091s |    10.038s |
| Iteration 4      |          471 | absolute_er.. |         3 |                 2 |               14 |          0.6 |      True |         0.7 |    0.0019 |                 -7.853 |                     -6.6286 |  2.776s |    13.018s |
| Iteration 5      |          106 | absolute_er.. |      None |                 4 |                5 |         None |      True |         0.5 |    0.0263 |                -6.2819 |                     -6.2819 |  1.326s |    14.562s |
Bayesian Optimization ---------------------------
Best call --> Iteration 5
Best parameters --> {'n_estimators': 106, 'criterion': 'absolute_error', 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 5, 'max_features': None, 'bootstrap': True, 'max_samples': 0.5, 'ccp_alpha': 0.0263}
Best evaluation --> neg_mean_squared_error: -6.2819
Time elapsed: 14.773s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -7.0748
Test evaluation --> neg_mean_squared_error: -7.2788
Time elapsed: 1.929s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -7.2423 ± 0.0229
Time elapsed: 9.449s
-------------------------------------------------
Total time: 26.152s


Final results ==================== >>
Duration: 40.351s
-------------------------------------
Decision Tree --> neg_mean_squared_error: -7.4566 ± 0.1197 ~
Bagging       --> neg_mean_squared_error: -5.9024 ± 0.058 ~ !
Extra-Trees   --> neg_mean_squared_error: -7.2423 ± 0.0229 ~

Analyze the results¶

In [11]:

            
                Copied!
                
# Use the errors or residuals plots to check the model performances
atom.plot_residuals()
# Use the errors or residuals plots to check the model performances
atom.plot_residuals()

In [12]:

            
                Copied!
                
atom.plot_errors()
atom.plot_errors()

In [13]:

            
                Copied!
                
# Analyze the relation between the target response and the features
atom.n_jobs = 8  # The method can be slow...
atom.ET.plot_partial_dependence(columns=(0, (2, 3)), figsize=(12, 5))
# Analyze the relation between the target response and the features
atom.n_jobs = 8  # The method can be slow...
atom.ET.plot_partial_dependence(columns=(0, (2, 3)), figsize=(12, 5))