Regression¶

This example shows how to use ATOM to apply PCA on the data and run a regression pipeline.

Download the abalone dataset from https://archive.ics.uci.edu/ml/datasets/Abalone. The goal of this dataset is to predict the rings (age) of abalone shells from physical measurements.

Load the data¶

In [1]:

            
                Copied!
                
# Import packages
import pandas as pd
from atom import ATOMRegressor
# Import packages
import pandas as pd
from atom import ATOMRegressor

In [2]:

            
                Copied!
                
# Load the data
X = pd.read_csv("./datasets/abalone.csv")

# Let's have a look
X.head()
# Load the data
X = pd.read_csv("./datasets/abalone.csv")

# Let's have a look
X.head()

Out[2]:

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

In [3]:

            
                Copied!
                
# Initialize atom for regression tasks
atom = ATOMRegressor(X, "Rings", verbose=2, warnings=False, random_state=42)
# Initialize atom for regression tasks
atom = ATOMRegressor(X, "Rings", verbose=2, warnings=False, random_state=42)

<< ================== ATOM ================== >>
Algorithm task: regression.

Dataset stats ====================== >>
Shape: (4177, 9)
Scaled: False
Categorical features: 1 (12.5%)
Outlier values: 192 (0.6%)
---------------------------------------
Train set size: 3342
Test set size: 835

In [4]:

            
                Copied!
                
# Encode the categorical features
atom.encode()
# Encode the categorical features
atom.encode()

Fitting Encoder...
Encoding categorical columns...
 --> OneHot-encoding feature Sex. Contains 3 classes.

In [5]:

            
                Copied!
                
# Plot the dataset's correlation matrix
atom.plot_correlation()
# Plot the dataset's correlation matrix
atom.plot_correlation()

In [6]:

            
                Copied!
                
# Apply PCA for dimensionality reduction
atom.feature_selection(strategy="pca", n_features=6)
# Apply PCA for dimensionality reduction
atom.feature_selection(strategy="pca", n_features=6)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   >>> Scaling features...
   >>> Total explained variance: 0.976

In [7]:

            
                Copied!
                
# Note that the fetaures are automatically renamed to Component 1, 2, etc...
atom.columns
# Note that the fetaures are automatically renamed to Component 1, 2, etc...
atom.columns

Out[7]:

['Component 1',
 'Component 2',
 'Component 3',
 'Component 4',
 'Component 5',
 'Component 6',
 'Rings']

In [8]:

            
                Copied!
                
# Use the plotting methods to see the retained variance ratio
atom.plot_pca()
# Use the plotting methods to see the retained variance ratio
atom.plot_pca()

In [9]:

            
                Copied!
                
atom.plot_components(figsize=(8, 6))
atom.plot_components(figsize=(8, 6))

Run the pipeline¶

In [10]:

            
                Copied!
                
                    
                    
                
                

        
atom.run(
    models=["Tree", "Bag", "ET"],
    metric="MSE",
    n_calls=5,
    n_initial_points=2,
    bo_params={"base_estimator": "GBRT", "cv": 1},
    n_bootstrap=5,
)
atom.run(
    models=["Tree", "Bag", "ET"],
    metric="MSE",
    n_calls=5,
    n_initial_points=2,
    bo_params={"base_estimator": "GBRT", "cv": 1},
    n_bootstrap=5,
)

Training ===================================== >>
Models: Tree, Bag, ET
Metric: neg_mean_squared_error


Running BO for Decision Tree...
Initial point 1 ---------------------------------
Parameters --> {'criterion': 'mae', 'splitter': 'random', 'max_depth': 7, 'min_samples_split': 8, 'min_samples_leaf': 19, 'max_features': None, 'ccp_alpha': 0.016}
Evaluation --> neg_mean_squared_error: -8.3677  Best neg_mean_squared_error: -8.3677
Time iteration: 0.049s   Total time: 0.049s
Initial point 2 ---------------------------------
Parameters --> {'criterion': 'mae', 'splitter': 'best', 'max_depth': 6, 'min_samples_split': 3, 'min_samples_leaf': 12, 'max_features': 0.9, 'ccp_alpha': 0.0}
Evaluation --> neg_mean_squared_error: -8.2055  Best neg_mean_squared_error: -8.2055
Time iteration: 0.163s   Total time: 0.344s
Iteration 3 -------------------------------------
Parameters --> {'criterion': 'mae', 'splitter': 'best', 'max_depth': 6, 'min_samples_split': 14, 'min_samples_leaf': 9, 'max_features': 0.9, 'ccp_alpha': 0.005}
Evaluation --> neg_mean_squared_error: -6.1540  Best neg_mean_squared_error: -6.1540
Time iteration: 0.171s   Total time: 0.659s
Iteration 4 -------------------------------------
Parameters --> {'criterion': 'mae', 'splitter': 'random', 'max_depth': 7, 'min_samples_split': 15, 'min_samples_leaf': 4, 'max_features': 0.7, 'ccp_alpha': 0.018}
Evaluation --> neg_mean_squared_error: -7.9567  Best neg_mean_squared_error: -6.1540
Time iteration: 0.063s   Total time: 0.868s
Iteration 5 -------------------------------------
Parameters --> {'criterion': 'mae', 'splitter': 'best', 'max_depth': 6, 'min_samples_split': 14, 'min_samples_leaf': 5, 'max_features': 0.9, 'ccp_alpha': 0.009}
Evaluation --> neg_mean_squared_error: -7.1330  Best neg_mean_squared_error: -6.1540
Time iteration: 0.165s   Total time: 1.186s

Results for Decision Tree:         
Bayesian Optimization ---------------------------
Best parameters --> {'criterion': 'mae', 'splitter': 'best', 'max_depth': 6, 'min_samples_split': 14, 'min_samples_leaf': 9, 'max_features': 0.9, 'ccp_alpha': 0.005}
Best evaluation --> neg_mean_squared_error: -6.154
Time elapsed: 1.377s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -6.3073
Test evaluation --> neg_mean_squared_error: -5.5317
Time elapsed: 0.260s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -5.678 ± 0.2464
Time elapsed: 0.991s
-------------------------------------------------
Total time: 2.628s


Running BO for Bagging Regressor...
Initial point 1 ---------------------------------
Parameters --> {'n_estimators': 112, 'max_samples': 0.9, 'max_features': 0.6, 'bootstrap': False, 'bootstrap_features': False}
Evaluation --> neg_mean_squared_error: -5.7680  Best neg_mean_squared_error: -5.7680
Time iteration: 0.938s   Total time: 0.954s
Initial point 2 ---------------------------------
Parameters --> {'n_estimators': 131, 'max_samples': 0.5, 'max_features': 0.5, 'bootstrap': False, 'bootstrap_features': False}
Evaluation --> neg_mean_squared_error: -6.8254  Best neg_mean_squared_error: -5.7680
Time iteration: 0.575s   Total time: 1.560s
Iteration 3 -------------------------------------
Parameters --> {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.6, 'bootstrap': False, 'bootstrap_features': True}
Evaluation --> neg_mean_squared_error: -5.4895  Best neg_mean_squared_error: -5.4895
Time iteration: 0.557s   Total time: 2.258s
Iteration 4 -------------------------------------
Parameters --> {'n_estimators': 74, 'max_samples': 0.5, 'max_features': 0.5, 'bootstrap': False, 'bootstrap_features': True}
Evaluation --> neg_mean_squared_error: -6.0363  Best neg_mean_squared_error: -5.4895
Time iteration: 0.352s   Total time: 2.802s
Iteration 5 -------------------------------------
Parameters --> {'n_estimators': 36, 'max_samples': 0.9, 'max_features': 0.6, 'bootstrap': True, 'bootstrap_features': False}
Evaluation --> neg_mean_squared_error: -6.0037  Best neg_mean_squared_error: -5.4895
Time iteration: 0.295s   Total time: 3.253s

Results for Bagging Regressor:         
Bayesian Optimization ---------------------------
Best parameters --> {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.6, 'bootstrap': False, 'bootstrap_features': True}
Best evaluation --> neg_mean_squared_error: -5.4895
Time elapsed: 3.427s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -0.0867
Test evaluation --> neg_mean_squared_error: -4.9533
Time elapsed: 0.519s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -5.2363 ± 0.1099
Time elapsed: 2.077s
-------------------------------------------------
Total time: 6.023s


Running BO for Extra-Trees...
Initial point 1 ---------------------------------
Parameters --> {'n_estimators': 112, 'criterion': 'mae', 'max_depth': 1, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.6, 'bootstrap': True, 'ccp_alpha': 0.016, 'max_samples': 0.6}
Evaluation --> neg_mean_squared_error: -10.2607  Best neg_mean_squared_error: -10.2607
Time iteration: 0.363s   Total time: 0.363s
Initial point 2 ---------------------------------
Parameters --> {'n_estimators': 369, 'criterion': 'mae', 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 12, 'max_features': 0.9, 'bootstrap': True, 'ccp_alpha': 0.035, 'max_samples': 0.8}
Evaluation --> neg_mean_squared_error: -9.4727  Best neg_mean_squared_error: -9.4727
Time iteration: 4.587s   Total time: 4.986s
Iteration 3 -------------------------------------
Parameters --> {'n_estimators': 385, 'criterion': 'mse', 'max_depth': None, 'min_samples_split': 6, 'min_samples_leaf': 18, 'max_features': 0.9, 'bootstrap': False, 'ccp_alpha': 0.02}
Evaluation --> neg_mean_squared_error: -5.5174  Best neg_mean_squared_error: -5.5174
Time iteration: 0.493s   Total time: 5.635s
Iteration 4 -------------------------------------
Parameters --> {'n_estimators': 425, 'criterion': 'mse', 'max_depth': 1, 'min_samples_split': 20, 'min_samples_leaf': 19, 'max_features': 0.7, 'bootstrap': False, 'ccp_alpha': 0.016}
Evaluation --> neg_mean_squared_error: -9.1980  Best neg_mean_squared_error: -5.5174
Time iteration: 0.297s   Total time: 6.240s
Iteration 5 -------------------------------------
Parameters --> {'n_estimators': 445, 'criterion': 'mse', 'max_depth': None, 'min_samples_split': 7, 'min_samples_leaf': 20, 'max_features': 0.6, 'bootstrap': False, 'ccp_alpha': 0.004}
Evaluation --> neg_mean_squared_error: -6.9959  Best neg_mean_squared_error: -5.5174
Time iteration: 0.481s   Total time: 6.908s

Results for Extra-Trees:         
Bayesian Optimization ---------------------------
Best parameters --> {'n_estimators': 385, 'criterion': 'mse', 'max_depth': None, 'min_samples_split': 6, 'min_samples_leaf': 18, 'max_features': 0.9, 'bootstrap': False, 'ccp_alpha': 0.02}
Best evaluation --> neg_mean_squared_error: -5.5174
Time elapsed: 7.087s
Fit ---------------------------------------------
Train evaluation --> neg_mean_squared_error: -6.1021
Test evaluation --> neg_mean_squared_error: -5.0002
Time elapsed: 0.643s
Bootstrap ---------------------------------------
Evaluation --> neg_mean_squared_error: -4.9204 ± 0.0591
Time elapsed: 2.950s
-------------------------------------------------
Total time: 10.680s


Final results ========================= >>
Duration: 19.330s
------------------------------------------
Decision Tree     --> neg_mean_squared_error: -5.678 ± 0.2464 ~
Bagging Regressor --> neg_mean_squared_error: -5.2363 ± 0.1099 ~
Extra-Trees       --> neg_mean_squared_error: -4.9204 ± 0.0591 ~ !

Analyze the results¶

In [11]:

            
                Copied!
                
# Use the errors or residuals plots to check the model performances
atom.plot_residuals()
# Use the errors or residuals plots to check the model performances
atom.plot_residuals()

In [12]:

            
                Copied!
                
atom.plot_errors()
atom.plot_errors()

In [13]:

            
                Copied!
                
# Analyze the relation between the target response and the features
atom.n_jobs = 8  # The method can be slow...
atom.ET.plot_partial_dependence(features=(0, (2, 3)), figsize=(12, 5))
# Analyze the relation between the target response and the features
atom.n_jobs = 8  # The method can be slow...
atom.ET.plot_partial_dependence(features=(0, (2, 3)), figsize=(12, 5))