DirectRegressor
Train and evaluate the models in a direct fashion.
The following steps are applied to every model:
- Apply hyperparameter tuning (optional).
- Fit the model on the training set using the best combination of hyperparameters found.
- Evaluate the model on the test set.
- Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).
Parameters |
models: str, estimator or sequence, default=None
Models to fit to the data. Allowed inputs are: an acronym from
any of the predefined models, an ATOMModel or a custom
predictor as class or instance. If None, all the predefined
models are used.
metric: str, func, scorer, sequence or None, default=None
Metric on which to fit the models. Choose from any of sklearn's
scorers, a function with signature
n_trials: int, dict or sequence, default=0function(y_true, y_pred,
**kwargs) -> score , a scorer object or a sequence of these. If
None, the default metric r2 is selected.
Maximum number of iterations for the hyperparameter tuning.
If 0, skip the tuning and fit the model on its default
parameters. If sequence, the n-th value applies to the n-th
model.
est_params: dict or None, default=None
Additional parameters for the models. See their corresponding
documentation for the available options. For multiple models,
use the acronyms as key (or 'all' for all models) and a dict
of the parameters as value. Add
ht_params: dict or None, default=None_fit to the parameter's name
to pass it to the estimator's fit method instead of the
constructor.
Additional parameters for the hyperparameter tuning. If None,
it uses the same parameters as the first run. Can include:
n_bootstrap: int or sequence, default=0
Number of data sets to use for bootstrapping. If 0, no
bootstrapping is performed. If sequence, the n-th value applies
to the n-th model.
parallel: bool, default=False
Whether to train the models in a parallel or sequential
fashion. Using
errors: str, default="skip"parallel=True turns off the verbosity of the
models during training. Note that many models also have
build-in parallelizations (often when the estimator has the
n_jobs parameter).
How to handle exceptions encountered during model training.
Choose from:
n_jobs: int, default=1
Number of cores to use for parallel processing.
device: str, default="cpu"
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str, dict or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for data and
estimators. The value should be
one of the possible values to change one of the two engines,
or a dictionary with keys
backend: str, default="loky"data and estimator , with their
corresponding choice as values to change both engines. If
None, the default values are used. Choose from:
Parallelization backend. Read more in the
user guide. Choose from:
memory: bool, str, Path or Memory, default=False
Enables caching for memory optimization. Read more in the
user guide.
verbose: int, default=0
Verbosity level of the class. Choose from:
warnings: bool or str, default=False
Changing this parameter affects the
Name of the mlflow experiment to use for tracking.
If None, no mlflow tracking is performed.
random_state: int or None, default=None
Seed used by the random number generator. If None, the random
number generator is the RandomState used by np.random .
|
See Also
Example
>>> from atom.training import DirectRegressor
>>> from sklearn.datasets import load_digits
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_digits(return_X_y=True, as_frame=True)
>>> train, test = train_test_split(
... X.merge(y.to_frame(), left_index=True, right_index=True),
... test_size=0.3,
... )
>>> runner = DirectRegressor(models=["OLS", "RF"], verbose=2)
>>> runner.run(train, test)
Training ========================= >>
Models: OLS, RF
Metric: r2
Results for OrdinaryLeastSquares:
Fit ---------------------------------------------
Train evaluation --> r2: 0.5993
Test evaluation --> r2: 0.5645
Time elapsed: 0.423s
-------------------------------------------------
Time: 0.423s
Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> r2: 0.9782
Test evaluation --> r2: 0.8468
Time elapsed: 0.955s
-------------------------------------------------
Time: 0.955s
Final results ==================== >>
Total time: 1.380s
-------------------------------------
OrdinaryLeastSquares --> r2: 0.5645
RandomForest --> r2: 0.8468 !
>>> # Analyze the results
>>> runner.results
r2_train | r2_test | time_fit | time | |
---|---|---|---|---|
OLS | 0.599300 | 0.564500 | 0.423366 | 0.423366 |
RF | 0.978200 | 0.846800 | 0.954871 | 0.954871 |
Attributes
Data attributes
The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.
Attributes |
dataset: pd.DataFrame Complete data set.
train: pd.DataFrameTraining set.
test: pd.DataFrameTest set.
X: pd.DataFrameFeature set.
y: pd.Series | pd.DataFrameTarget column(s).
holdout: pd.DataFrame | NoneHoldout set.
X_train: pd.DataFrameThis data set is untransformed by the pipeline. Read more in the user guide. Features of the training set.
y_train: pd.Series | pd.DataFrameTarget column(s) of the training set.
X_test: pd.DataFrameFeatures of the test set.
y_test: pd.Series | pd.DataFrameTarget column(s) of the test set.
shape: tuple[int, int]Shape of the dataset (n_rows, n_columns).
columns: pd.IndexName of all the columns.
n_columns: intNumber of columns.
features: pd.IndexName of the features.
n_features: intNumber of features.
target: str | list[str]Name of the target column(s).
|
Utility attributes
The utility attributes are used to access information about the models in the instance after training.
Attributes |
models: str | list[str] | None Name of the model(s).
metric: str | list[str] | NoneName of the metric(s).
winners: list[model] | NoneModels ordered by performance.
winner: model | NonePerformance is measured as the highest score on the model's
Best performing model.
results: StylerPerformance is measured as the highest score on the model's
Overview of the training results.
All durations are in seconds. Possible values include:
Tip This attribute returns a pandas' Styler object. Convert
the result back to a regular dataframe using its |
Tracking attributes
The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.
Plot attributes
The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.
Attributes |
palette: str | Sequence[str] Color palette.
title_fontsize: int | floatSpecify one of plotly's built-in palettes or create
a custom one, e.g., Fontsize for the plot's title.
label_fontsize: int | floatFontsize for the labels, legend and hover information.
tick_fontsize: int | floatFontsize for the ticks along the plot's axes.
line_width: int | floatWidth of the line plots.
marker_size: int | floatSize of the markers.
|
Methods
Next to the plotting methods, the class contains a variety of methods to handle the data, run the training, and manage the pipeline.
available_models | Give an overview of the available predefined models. |
canvas | Create a figure with multiple plots. |
clear | Reset attributes and clear cache from all models. |
delete | Delete models. |
evaluate | Get all models' scores for the provided metrics. |
export_pipeline | Export the internal pipeline. |
get_class_weight | Return class weights for a balanced data set. |
get_params | Get parameters for this estimator. |
merge | Merge another instance of the same class into this one. |
update_layout | Update the properties of the plot's layout. |
update_traces | Update the properties of the plot's traces. |
reset_aesthetics | Reset the plot aesthetics to their default values. |
run | Train and evaluate the models. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
stacking | Add a Stacking model to the pipeline. |
voting | Add a Voting model to the pipeline. |
Give an overview of the available predefined models.
Parameters |
**kwargs
Filter the returned models providing any of the column as
keyword arguments, where the value is the desired filter,
e.g., accepts_sparse=True , to get all models that accept
sparse input or supports_engines="cuml" to get all models
that support the cuML engine.
|
Returns |
pd.DataFrame
Tags of the available predefined models. The columns
depend on the task, but can include:
|
Create a figure with multiple plots.
This @contextmanager
allows you to draw many plots in one
figure. The default option is to add two plots side by side.
See the user guide for an example.
Parameters |
rows: int, default=1
Number of plots in length.
cols: int, default=2
Number of plots in width.
sharex: bool, default=False
If True, hide the label and ticks from non-border subplots
on the x-axis.
sharey: bool, default=False
If True, hide the label and ticks from non-border subplots
on the y-axis.
hspace: float, default=0.05
Space between subplot rows in normalized plot coordinates.
The spacing is relative to the figure's size.
vspace: float, default=0.07
Space between subplot cols in normalized plot coordinates.
The spacing is relative to the figure's size.
title: str, dict or None, default=None
Title for the plot.
legend: bool, str or dict, default="out"
Legend for the plot. See the user guide for
an extended description of the choices.
figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it
adapts the size to the number of plots in the canvas.
filename: str, Path or None, default=None
Save the plot using this name. Use "auto" for automatic
naming. The type of the file depends on the provided name
(.html, .png, .pdf, etc...). If
display: bool, default=Truefilename has no file type,
the plot is saved as html. If None, the plot is not saved.
Whether to render the plot.
|
Yields | {#canvas-go.Figure}
go.Figure
Plot object.
|
Reset attributes and clear cache from all models.
Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:
- In-training validation scores
- Shap values
- App instance
- Dashboard instance
- Calculated holdout data sets
Delete models.
If all models are removed, the metric is reset. Use this method to drop unwanted or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.
Parameters |
models: int, str, Model, segment, sequence or None, default=None
Models to delete. If None, all models are deleted.
|
Get all models' scores for the provided metrics.
Tip
This method returns a pandas' Styler object. Convert
the result back to a regular dataframe using its data
attribute.
Parameters |
metric: str, func, scorer, sequence or None, default=None
Metric to calculate. If None, it returns an overview of
the most common metrics per task.
rows: hashable, segment, sequence or dataframe, default="test"
Selection of rows to calculate
metric on.
|
Returns | {#evaluate-Styler}
Styler
Scores of the models.
|
Export the internal pipeline.
This method returns a deepcopy of the branch's pipeline. Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.
Parameters |
model: str, Model or None, default=None
Model for which to export the pipeline. If the model used
automated feature scaling, the Scaler is added to
the pipeline. If None, the pipeline in the current branch
is exported (without any model).
|
Returns | {#export_pipeline-Pipeline}
Pipeline
Current branch as a sklearn-like Pipeline object.
|
Return class weights for a balanced data set.
Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected rows.
Parameters |
rows: hashable, segment, sequence or dataframe, default="train"
Selection of rows for which to
get the weights.
|
Returns |
dict
Classes with the corresponding weights. A dict of dicts is
returned for multioutput tasks.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Merge another instance of the same class into this one.
Branches, models, metrics and attributes of the other instance
are merged into this one. If there are branches and/or models
with the same name, they are merged adding the suffix
parameter to their name. The errors and missing attributes are
extended with those of the other instance. It's only possible
to merge two instances if they are initialized with the same
dataset and trained with the same metric.
Update the properties of the plot's layout.
Recursively update the structure of the original layout with the values in the arguments.
Parameters |
**kwargs
Keyword arguments for the figure's update_layout method.
|
Update the properties of the plot's traces.
Recursively update the structure of the original traces with the values in the arguments.
Parameters |
**kwargs
Keyword arguments for the figure's update_traces method.
|
Reset the plot aesthetics to their default values.
Train and evaluate the models.
Read more in the user guide.
Parameters |
*arrays: sequence of indexables
Training set and test set. Allowed formats are:
|
Save the instance to a pickle file.
Parameters |
filename: str or Path, default="auto"
Filename or pathlib.Path of the file to save. Use
"auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from atom.
If False, add the data to the load
method to reload the instance.
|
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Add a Stacking model to the pipeline.
Warning
Combining models trained on different branches into one ensemble is not allowed and will raise an exception.
Parameters |
models: segment, sequence or None, default=None
Models that feed the stacking estimator. The models must
have been fitted on the current branch.
name: str, default="Stack"
Name of the model. The name is always presided with the
model's acronym:
train_on_test: bool, default=FalseStack .
Whether to train the final estimator of the stacking model
on the test set instead of the training set. Note that
training it on the training set (default option) means there
is a high risk of overfitting. It's recommended to use this
option if you have another, independent set for testing
(holdout set).
**kwargs
Additional keyword arguments for one of these estimators.
Tip The model's acronyms can be used for the |
Add a Voting model to the pipeline.
Warning
Combining models trained on different branches into one ensemble is not allowed and will raise an exception.
Parameters |
models: segment, sequence or None, default=None
Models that feed the stacking estimator. The models must have
been fitted on the current branch.
name: str, default="Vote"
Name of the model. The name is always presided with the
model's acronym:
**kwargsVote .
Additional keyword arguments for one of these estimators.
|