Ordinary Least Squares (OLS)

needs scaling accept sparse supports_gpu

Ordinary Least Squares is just linear regression without any regularization. It fits a linear model with coefficients w=(w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Corresponding estimators are:

LinearRegression for regression tasks.

Hyperparameters

By default, the estimator adopts the default parameters provided by its package. See the user guide on how to customize them.
The n_jobs parameter is set equal to that of the trainer.
OLS has no parameters to tune with the BO.

Attributes

Data attributes

Attributes:

dataset: pd.DataFrame
Complete dataset in the pipeline.

train: pd.DataFrame
Training set.

test: pd.DataFrame
Test set.

X: pd.DataFrame
Feature set.

y: pd.Series
Target column.

X_train: pd.DataFrame
Training features.

y_train: pd.Series
Training target.

X_test: pd.DataFrame
Test features.

y_test: pd.Series
Test target.

shape: tuple
Dataset's shape: (n_rows x n_columns) or (n_rows, (shape_sample), n_cols) for datasets with more than two dimensions.

columns: pd.Index
Names of the columns in the dataset.

n_columns: int
Number of columns in the dataset.

features: pd.Index
Names of the features in the dataset.

n_features: int
Number of features in the dataset.

target: str
Name of the target column.

Utility attributes

Attributes:

estimator: class
Estimator instance with the best combination of hyperparameters fitted on the complete training set.

time_fit: str
Time it took to train the model on the complete training set and calculate the metric(s) on the test set.

metric_train: float or list
Metric score(s) on the training set.

metric_test: float or list
Metric score(s) on the test set.

metric_bootstrap: np.array
Bootstrap results with shape=(n_bootstrap,) for single-metric runs and shape=(metric, n_bootstrap) for multi-metric runs.

mean_bootstrap: float or list
Mean of the bootstrap results. List of values for multi-metric runs.

std_bootstrap: float or list
Standard deviation of the bootstrap results. List of values for multi-metric runs.

results: pd.Series
Training results. Columns include:

metric_bo: Best score achieved during the BO.
time_bo: Time spent on the BO.
metric_train: Metric score on the training set.
metric_test: Metric score on the test set.
time_fit: Time spent fitting and evaluating.
mean_bootstrap: Mean score of the bootstrap results.
std_bootstrap: Standard deviation score of the bootstrap results.
time_bootstrap: Time spent on the bootstrap algorithm.
time: Total time spent on the whole run.

Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Prediction attributes:

predict_train: np.array
Predictions of the model on the training set.

predict_test: np.array
Predictions of the model on the test set.

score_train: np.float64
Model's score on the training set.

score_test: np.float64
Model's score on the test set.

Methods

The majority of the plots and prediction methods can be called directly from the model, e.g. atom.ols.plot_permutation_importance() or atom.ols.predict(X). The remaining utility methods can be found hereunder.

clear	Clear attributes from the model.
cross_validate	Evaluate the model using cross-validation.
delete	Delete the model from the trainer.
dashboard	Create an interactive dashboard to analyze the model.
evaluate	Get the model's scores for the provided metrics.
export_pipeline	Export the model's pipeline to a sklearn-like Pipeline object.
full_train	Train the estimator on the complete dataset.
rename	Change the model's tag.
save_estimator	Save the estimator to a pickle file.
transform	Transform new data through the model's branch.

method clear() [source]

Reset attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the class. The cleared attributes per model are:

method cross_validate(**kwargs) [source]

Evaluate the model using cross-validation. This method cross-validates the whole pipeline on the complete dataset. Use it to assess the robustness of the solution's performance. The return of sklearn's cross_validate function is stored under the cv attribute.

Parameters:	**kwargs Additional keyword arguments for sklearn's cross_validate function. If the scoring method is not specified, it uses the trainer's metric.
Returns:	pd.DataFrame Overview of the results.

method delete() [source]

Delete the model from the trainer. If it's the last model in the trainer, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. The model is not removed from any active mlflow experiment.

method dashboard(dataset="test", filename=None, **kwargs) [source]

Create an interactive dashboard to analyze the model. The dashboard allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees. By default, the dashboard opens in an external dash app.

Parameters:

dataset: str, optional (default="test")
Data set to get the report from. Choose from: "train", "test", "both" (train and test) or "holdout".

filename: str or None, optional (default=None)
Name to save the file with (as .html). None to not save anything.

**kwargs
Additional keyword arguments for the ExplainerDashboard instance.

Returns:

ExplainerDashboard
Created dashboard object.

method evaluate(metric=None, dataset="test", sample_weight=None) [source]

Get the model's score for the provided metrics.

Parameters:

metric: str, func, scorer, sequence or None, optional (default=None)
Metrics to calculate. If None, a selection of the most common metrics per task are used.

dataset: str, optional (default="test")
Data set on which to calculate the metric. Choose from: "train", "test" or "holdout".

sample_weight: sequence or None, optional (default=None)
Sample weights corresponding to y in dataset.

Returns: pd.Series
Scores of the model.

method export_pipeline(verbose=None) [source]

Export the model's pipeline to a sklearn-like Pipeline object. If the model used automated feature scaling, the scaler is added to the pipeline. The returned pipeline is already fitted on the training set.

Info

ATOM's Pipeline class behaves the same as a sklearn Pipeline, and additionally:

Accepts transformers that change the target column.
Accepts transformers that drop rows.
Accepts transformers that only are fitted on a subset of the provided dataset.
Always outputs pandas objects.
Uses transformers that are only applied on the training set (see the balance or prune methods) to fit the pipeline, not to make predictions on unseen data.

Parameters:

memory: bool, str, Memory or None, optional (default=None)
Used to cache the fitted transformers of the pipeline.

If None or False: No caching is performed.
If True: A default temp directory is used.
If str: Path to the caching directory.
If Memory: Object with the joblib.Memory interface.

verbose: int or None, optional (default=None)
Verbosity level of the transformers in the pipeline. If None, it leaves them to their original verbosity. Note that this is not the pipeline's own verbose parameter. To change that, use the set_params method.

Returns: Pipeline
Current branch as a sklearn-like Pipeline object.

method full_train(include_holdout=False) [source]

In some cases it might be desirable to use all available data to train a final model. Note that doing this means that the estimator can no longer be evaluated on the test set. The newly retrained estimator will replace the estimator attribute. If there is an active mlflow experiment, a new run is started with the name [model_name]_full_train. Since the estimator changed, the model is cleared.

Warning

Although the model is trained on the complete dataset, the pipeline is not! To also get the fully trained pipeline, use: pipeline = atom .export_pipeline().fit(X, y).

Parameters:

include_holdout: bool, optional (default=False)
Whether to include the holdout data set (if available) in the training of the estimator. Note that if True, it means the model can't be evaluated.

method rename(name=None) [source]

Change the model's tag. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run is also changed.

Parameters:

name: str or None, optional (default=None)
New tag for the model. If None, the tag is removed.

method save_estimator(filename="auto") [source]

Save the estimator to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method transform(X, y=None, verbose=None) [source]

Transform new data through the model's branch. Transformers that are only applied on the training set are skipped. If the model used feature scaling, the data is also scaled.

Parameters:

X: dataframe-like
Features to transform, with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored in the transformers.
If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

verbose: int or None, optional (default=None)
Verbosity level of the output. If None, it uses the transformer's own verbosity.

Returns:

pd.DataFrame
Transformed feature set.

pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.run(models="OLS")