ATOMClassifier
Apply all data transformations and model management provided by the package on a given dataset. Note that, contrary to sklearn's API, the instance contains the dataset on which to perform the analysis. Calling a method will automatically apply it on the dataset it contains.
All data cleaning, feature engineering, model training and plotting functionality can be accessed from an instance of this class.
Parameters | *arrays: sequence of indexables
Dataset containing features and target. Allowed formats are:
y: int, str, dict, sequence or dataframe, default=-1
X, train, test: dataframe-like y: int, str or sequence
Target column corresponding to X.
index: bool, int, str or sequence, default=False
This parameter is ignored if the target column is provided
through
Handle the index in the resulting dataframe.
test_size: int or float, default=0.2
This parameter is ignored if the test set is provided
through
This parameter is ignored if the holdout set is provided
through
Whether to shuffle the dataset before splitting the train and
test set. Be aware that not shuffling the dataset can cause
an unequal distribution of target classes over the sets.
stratify: bool, int, str or sequence, default=True
Handle stratification of the target classes over the data sets.
n_rows: int or float, default=1
This parameter is ignored if For multioutput tasks, stratification is applied to the joint target columns.
Random subsample of the dataset to use. The default value selects
all rows.
n_jobs: int, default=1
Number of cores to use for parallel processing.
device: str, default="cpu"
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
backend: str, default="loky"
Parallelization backend. Choose from:
verbose: int, default=0
Selecting the ray backend also parallelizes the data using modin, a multi-threading, drop-in replacement for pandas, that uses Ray as backend.
Verbosity level of the class. Choose from:
warnings: bool or str, default=False
Changing this parameter affects the
Name of the mlflow experiment to use for tracking.
If None, no mlflow tracking is performed.
random_state: int or None, default=None
Seed used by the random number generator. If None, the random
number generator is the RandomState used by np.random .
|
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Initialize atom
>>> atom = ATOMClassifier(X, y, logger="auto", n_jobs=2, verbose=2)
<< ================== ATOM ================== >>
Algorithm task: binary classification.
Parallel processing with 2 cores.
Dataset stats ==================== >>
Shape: (569, 31)
Memory: 138.96 kB
Scaled: False
Outlier values: 160 (1.1%)
-------------------------------------
Train set size: 456
Test set size: 113
-------------------------------------
| | dataset | train | test |
| - | ----------- | ----------- | ----------- |
| 0 | 212 (1.0) | 170 (1.0) | 42 (1.0) |
| 1 | 357 (1.7) | 286 (1.7) | 71 (1.7) |
>>> # Apply data cleaning and feature engineering methods
>>> atom.balance(strategy="smote")
Oversampling with SMOTE...
--> Adding 116 samples to class 0.
>>> atom.feature_selection(strategy="rfecv", solver="xgb", n_features=22)
Fitting FeatureSelector...
Performing feature selection ...
--> RFECV selected 26 features from the dataset.
--> Dropping feature mean perimeter (rank 4).
--> Dropping feature mean symmetry (rank 3).
--> Dropping feature perimeter error (rank 2).
--> Dropping feature worst compactness (rank 5).
>>> # Train models
>>> atom.run(
... models=["LR", "RF", "XGB"],
... metric="precision",
... n_bootstrap=4,
... )
Training ========================= >>
Models: LR, RF, XGB
Metric: precision
Results for Logistic Regression:
Fit ---------------------------------------------
Train evaluation --> precision: 0.9895
Test evaluation --> precision: 0.9467
Time elapsed: 0.028s
-------------------------------------------------
Total time: 0.028s
Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> precision: 1.0
Test evaluation --> precision: 0.9221
Time elapsed: 0.181s
-------------------------------------------------
Total time: 0.181s
Results for XGBoost:
Fit ---------------------------------------------
Train evaluation --> precision: 1.0
Test evaluation --> precision: 0.9091
Time elapsed: 0.124s
-------------------------------------------------
Total time: 0.124s
Final results ==================== >>
Total time: 0.333s
-------------------------------------
Logistic Regression --> precision: 0.9467 !
Random Forest --> precision: 0.9221
XGBoost --> precision: 0.9091
>>> # Analyze the results
>>> atom.evaluate()
accuracy average_precision ... recall roc_auc
LR 0.970588 0.995739 ... 0.981308 0.993324
RF 0.958824 0.982602 ... 0.962617 0.983459
XGB 0.964706 0.996047 ... 0.971963 0.993473
[3 rows x 9 columns]
Magic methods
The class contains some magic methods to help you access some of its elements faster. Note that methods that apply on the pipeline can return different results per branch.
- __repr__: Prints an overview of atom's branches, models and metric.
- __len__: Returns the length of the dataset.
- __iter__: Iterate over the pipeline's transformers.
- __contains__: Checks if the provided item is a column in the dataset.
- __getitem__: Access a branch, model, column or subset of the dataset.
Attributes
Data attributes
The data attributes are used to access the dataset and its properties. Updating the dataset will automatically update the response of these attributes accordingly.
Attributes | pipeline: pd.Series Transformers fitted on the data.
mapping: dictUse this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method. Encoded values and their respective mapped values.
dataset: dataframeThe column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...). Complete data set. train: dataframeTraining set. test: dataframeTest set. X: dataframeFeature set. y: series | dataframeTarget column(s). X_train: dataframeFeatures of the training set. y_train: series | dataframeTarget column(s) of the training set. X_test: dataframeFeatures of the test set. y_test: series | dataframeTarget column(s) of the test set. shape: tuple[int, int]Shape of the dataset (n_rows, n_columns). columns: seriesName of all the columns. n_columns: intNumber of columns. features: seriesName of the features. n_features: intNumber of features. target: str | list[str]Name of the target column(s). scaled: boolWhether the feature set is scaled.
duplicates: seriesA data set is considered scaled when it has mean=0 and std=1, or when there is a scaler in the pipeline. Binary columns (only 0s and 1s) are excluded from the calculation. Number of duplicate rows in the dataset. missing: listValues that are considered "missing".
nans: series | NoneThese values are used by the clean and impute methods. Default values are: None, NaN, +inf, -inf, "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators. Columns with the number of missing values in them. n_nans: int | NoneNumber of samples containing missing values. numerical: seriesNames of the numerical features in the dataset. n_numerical: intNumber of numerical features in the dataset. categorical: seriesNames of the categorical features in the dataset. n_categorical: intNumber of categorical features in the dataset. outliers: pd.series | NoneColumns in training set with amount of outlier values. n_outliers: int | NoneNumber of samples in the training set containing outliers. classes: pd.DataFrame | NoneDistribution of target classes per data set. n_classes: int | series | NoneNumber of classes in the target column(s). |
Utility attributes
The utility attributes are used to access information about the models in the instance after training.
Attributes | branch: Branch Current active branch.
models: str | list[str] | NoneUse the property's Name of the model(s). metric: str | list[str] | NoneName of the metric(s). winners: list[model]Models ordered by performance.
winner: modelPerformance is measured as the highest score on the model's
Best performing model.
results: pd.DataFramePerformance is measured as the highest score on the model's
Overview of the training results.
All durations are in seconds. Columns include:
|
Tracking attributes
The tracking attributes are used to customize what elements of the experiment are tracked. Read more in the user guide.
Plot attributes
The plot attributes are used to customize the plot's aesthetics. Read more in the user guide.
Attributes | palette: str | SEQUENCE Color palette.
title_fontsize: intSpecify one of plotly's built-in palettes or create
a custom one, e.g. Fontsize for the plot's title. label_fontsize: intFontsize for the labels, legend and hover information. tick_fontsize: intFontsize for the ticks along the plot's axes. line_width: intWidth of the line plots. marker_size: intSize of the markers. |
Utility methods
Next to the plotting methods, the class contains a variety of utility methods to handle the data and manage the pipeline.
add | Add a transformer to the pipeline. |
apply | Apply a function to the dataset. |
automl | Search for an optimized pipeline in an automated fashion. |
available_models | Give an overview of the available predefined models. |
canvas | Create a figure with multiple plots. |
clear | Reset attributes and clear cache from all models. |
delete | Delete models. |
distribution | Get statistics on column distributions. |
eda | Create an Exploratory Data Analysis report. |
evaluate | Get all models' scores for the provided metrics. |
export_pipeline | Export the pipeline to a sklearn-like object. |
get_class_weight | Return class weights for a balanced data set. |
get_sample_weight | Return sample weights for a balanced data set. |
inverse_transform | Inversely transform new data through the pipeline. |
load | Loads an atom instance from a pickle file. |
log | Print message and save to log file. |
merge | Merge another instance of the same class into this one. |
update_layout | Update the properties of the plot's layout. |
reset | Reset the instance to it's initial state. |
reset_aesthetics | Reset the plot aesthetics to their default values. |
save | Save the instance to a pickle file. |
save_data | Save the data in the current branch to a .csv file. |
shrink | Converts the columns to the smallest possible matching dtype. |
stacking | Add a Stacking model to the pipeline. |
stats | Print basic information about the dataset. |
status | Get an overview of the branches and models. |
transform | Transform new data through the pipeline. |
voting | Add a Voting model to the pipeline. |
If the transformer is not fitted, it is fitted on the complete training set. Afterwards, the data set is transformed and the estimator is added to atom's pipeline. If the estimator is a sklearn Pipeline, every estimator is merged independently with atom.
Warning
- The transformer should have fit and/or transform methods
with arguments
X
(accepting a dataframe-like object of shape=(n_samples, n_features)) and/ory
(accepting a sequence of shape=(n_samples,)). - The transform method should return a feature set as a dataframe-like object of shape=(n_samples, n_features) and/or a target column as a sequence of shape=(n_samples,).
Note
If the transform method doesn't return a dataframe:
- The column naming happens as follows. If the transformer
has a
get_feature_names
orget_feature_names_out
method, it is used. If not, and it returns the same number of columns, the names are kept equal. If the number of columns change, old columns will keep their name (as long as the column is unchanged) and new columns will receive the namex[N-1]
, where N stands for the n-th feature. This means that a transformer should only transform, add or drop columns, not combinations of these. - The index remains the same as before the transformation. This means that the transformer should not add, remove or shuffle rows unless it returns a dataframe.
Note
If the transformer has a n_jobs
and/or random_state
parameter that is left to its default value, it adopts
atom's value.
The function should have signature func(dataset, **kw_args)
-> dataset
. This method is useful for stateless transformations
such as taking the log, doing custom scaling, etc...
Note
This approach is preferred over changing the dataset directly
through the property's @setter
since the transformation is
stored in the pipeline.
Tip
Use atom.apply(lambda df: df.drop("column_name",axis=1))
to store the removal of columns in the pipeline.
Automated machine learning (AutoML) automates the selection,
composition and parameterization of machine learning pipelines.
Automating the machine learning often provides faster, more
accurate outputs than hand-coded algorithms. ATOM uses the
evalML package for AutoML optimization. The resulting
transformers and final estimator are merged with atom's pipeline
(check the pipeline
and models
attributes after the method finishes running). The created
AutoMLSearch instance can be accessed through the evalml
attribute.
Warning
AutoML algorithms aren't intended to run for only a few minutes. The method may need a very long time to achieve optimal results.
Parameters | **kwargs
Additional keyword arguments for the AutoMLSearch instance.
|
Returns | pd.DataFrame
Information about the available predefined models. Columns
include:
|
This @contextmanager
allows you to draw many plots in one
figure. The default option is to add two plots side by side.
See the user guide for an example.
Parameters | rows: int, default=1
Number of plots in length.
cols: int, default=2
Number of plots in width.
horizontal_spacing: float, default=0.05
Space between subplot rows in normalized plot coordinates.
The spacing is relative to the figure's size.
vertical_spacing: float, default=0.07
Space between subplot cols in normalized plot coordinates.
The spacing is relative to the figure's size.
title: str, dict or None, default=None
Title for the plot.
legend: bool, str or dict, default="out"
Legend for the plot. See the user guide for
an extended description of the choices.
figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it
adapts the size to the number of plots in the canvas.
filename: str or None, default=None
Save the plot using this name. Use "auto" for automatic
naming. The type of the file depends on the provided name
(.html, .png, .pdf, etc...). If display: bool, default=Truefilename has no file type,
the plot is saved as html. If None, the plot is not saved.
Whether to render the plot.
|
Yields | go.Figure
Plot object.
|
Reset certain model attributes to their initial state, deleting potentially large data arrays. Use this method to free some memory before saving the instance. The affected attributes are:
- In-training validation scores
- Shap values
- App instance
- Dashboard instance
- Cached prediction attributes
- Cached metric scores
- Cached holdout data sets
If all models are removed, the metric is reset. Use this method to drop unwanted models from the pipeline or to free some memory before saving. Deleted models are not removed from any active mlflow experiment.
Parameters | models: int, str, slice, Model, sequence or None, default=None
Models to delete. If None, all models are deleted.
|
Compute the Kolmogorov-Smirnov test for various distributions against columns in the dataset. Only for numerical columns. Missing values are ignored.
Tip
Use the plot_distribution method to plot a column's distribution.
ATOM uses the ydata-profiling package for the EDA.
The report is rendered directly in the notebook. The created
ProfileReport instance can be accessed through the report
attribute.
Warning
This method can be slow for large datasets.
Parameters | dataset: str, default="dataset"
Data set to get the report from.
n_rows: int or None, default=None
Number of (randomly picked) rows to process. None to use
all rows.
filename: str or None, default=None
Name to save the file with (as .html). None to not save
anything.
**kwargs
Additional keyword arguments for the ProfileReport
instance.
|
Parameters | metric: str, func, scorer, sequence or None, default=None
Metric to calculate. If None, it returns an overview of
the most common metrics per task.
dataset: str, default="test"
Data set on which to calculate the metric. Choose from:
"train", "test" or "holdout".
threshold: float or sequence, default=0.5
Threshold between 0 and 1 to convert predicted probabilities
to class labels. Only used when:
sample_weight: sequence or None, default=None
For multilabel classification tasks, it's possible to provide a sequence of thresholds (one per target column). The same threshold per target column is applied to all models.
Sample weights corresponding to y in dataset .
|
Returns | pd.DataFrame
Scores of the models.
|
Optionally, you can add a model as final estimator. The returned pipeline is already fitted on the training set.
Info
The returned pipeline behaves similarly to sklearn's Pipeline, and additionally:
- Accepts transformers that change the target column.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Always returns pandas objects.
- Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions.
Parameters | model: str, Model or None, default=None
Model for which to export the pipeline. If the model used
automated feature scaling, the Scaler is added to
the pipeline. If None, the pipeline in the current branch
is exported.
memory: bool, str, Memory or None, default=None
Used to cache the fitted transformers of the pipeline.
- If None or False: No caching is performed.
- If True: A default temp directory is used.
- If str: Path to the caching directory.
- If Memory: Object with the joblib.Memory interface.
verbose: int or None, default=None
Verbosity level of the transformers in the pipeline. If
None, it leaves them to their original verbosity. Note
that this is not the pipeline's own verbose parameter.
To change that, use the set_params method.
|
Returns | Pipeline
Current branch as a sklearn-like Pipeline object.
|
Statistically, the class weights re-balance the data set so that the sampled data set represents the target population as closely as possible. The returned weights are inversely proportional to the class frequencies in the selected data set.
Parameters | dataset: str, default="train"
Data set from which to get the weights. Choose from:
"train", "test", "dataset".
|
Returns | dict
Classes with the corresponding weights. A dict of dicts is
returned for multioutput tasks.
|
The returned weights are inversely proportional to the class
frequencies in the selected data set. For multioutput tasks,
the weights of each column of y
will be multiplied.
Parameters | dataset: str, default="train"
Data set from which to get the weights. Choose from:
"train", "test", "dataset".
|
Returns | series
Sequence of weights with shape=(n_samples,).
|
Transformers that are only applied on the training set are
skipped. The rest should all implement a inverse_transform
method. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
used to transform only the target column.
If the instance was saved using save_data=False
,
it's possible to load new data into it and apply all data
transformations.
Note
The loaded instance's current branch is the same branch as it was when saved.
Branches, models, metrics and attributes of the other instance
are merged into this one. If there are branches and/or models
with the same name, they are merged adding the suffix
parameter to their name. The errors and missing attributes are
extended with those of the other instance. It's only possible
to merge two instances if they are initialized with the same
dataset and trained with the same metric.
Parameters | other: Runner
Instance with which to merge. Should be of the same class
as self.
suffix: str, default="2"
Conflicting branches and models are merged adding suffix
to the end of their names.
|
This recursively updates the structure of the original layout with the values in the input dict / keyword arguments.
Deletes all branches and models. The dataset is also reset to its form after initialization.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This parameter
is ignored if the method is not called from atom. If False,
add the data to the load method.
|
.csv
file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
dataset: str, default="dataset"
Data set to save.
**kwargs
Additional keyword arguments for pandas' to_csv method.
|
Warning
Combining models trained on different branches into one ensemble is not allowed and will raise an exception.
Parameters | _vb: int, default=-2
Internal parameter to always print if called by user.
|
This method prints the same information as the __repr__ and also saves it to the logger.
Transformers that are only applied on the training set are
skipped. If only X
or only y
is provided, it ignores
transformers that require the other parameter. This can be
of use to, for example, transform only the target column.
Warning
Combining models trained on different branches into one ensemble is not allowed and will raise an exception.
Data cleaning
The data cleaning methods can help you scale the data, handle missing values, categorical columns, outliers and unbalanced datasets. All attributes of the data cleaning classes are attached to atom after running. Read more in the user guide.
Tip
Use the eda method to examine the data and help you determine suitable parameters for the data cleaning methods.
balance | Balance the number of rows per class in the target column. |
clean | Applies standard data cleaning steps on the dataset. |
discretize | Bin continuous data into intervals. |
encode | Perform encoding of categorical features. |
impute | Handle missing values in the dataset. |
normalize | Transform the data to follow a Normal/Gaussian distribution. |
prune | Prune outliers from the training set. |
scale | Scale the data. |
When oversampling, the newly created samples have an increasing integer index for numerical indices, and an index of the form [estimator]_N for non-numerical indices, where N stands for the N-th sample in the data set.
See the Balancer class for a description of the parameters.
Note
- The balance method does not support multioutput tasks.
- This transformation is only applied to the training set in order to maintain the original distribution of target classes in the test set.
Tip
Use atom's classes attribute for an overview of the target class distribution per data set.
Use the parameters to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
- Remove characters from column names.
- Strip categorical features from white spaces.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
- Encode the target column (ignored for regression tasks).
See the Cleaner class for a description of the parameters.
For each feature, the bin edges are computed during fit and, together with the number of bins, they will define the intervals. Ignores numerical columns.
See the Discretizer class for a description of the parameters.
Tip
Use the plot_distribution method to visualize a column's distribution and decide on the bins.
The encoding type depends on the number of classes in the column:
- If n_classes=2 or ordinal feature, use Ordinal-encoding.
- If 2 < n_classes <=
max_onehot
, use OneHot-encoding. - If n_classes >
max_onehot
, usestrategy
-encoding.
Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Rare classes can be replaced with a value in order to prevent too high cardinality.
See the Encoder class for a description of the parameters.
Note
This method only encodes the categorical features. It does not encode the target column! Use the clean method for that.
Tip
Use the categorical attribute for a list of the categorical features in the dataset.
Impute or remove missing values according to the selected
strategy. Also removes rows and columns with too many missing
values. Use the missing
attribute to customize what are
considered "missing values".
See the Imputer class for a description of the parameters.
Tip
Use the nans attribute to check the amount of missing values per column.
This transformation is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired. Missing values are disregarded in fit and maintained in transform. Ignores categorical columns.
See the Normalizer class for a description of the parameters.
Tip
Use the plot_distribution method to examine a column's distribution.
Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns.
See the Pruner class for a description of the parameters.
Note
This transformation is only applied to the training set in order to maintain the original distribution of samples in the test set.
Tip
Use the outliers attribute to check the number of outliers per column.
Apply one of sklearn's scalers. Categorical columns are ignored.
See the Scaler class for a description of the parameters.
Tip
Use the scaled attribute to check whether the dataset is scaled.
NLP
The Natural Language Processing (NLP) transformers help to convert raw
text to meaningful numeric values, ready to be ingested by a model. All
transformations are applied only on the column in the dataset called
corpus
. Read more in the user guide.
textclean | Applies standard text cleaning to the corpus. |
textnormalize | Normalize the corpus. |
tokenize | Tokenize the corpus. |
vectorize | Vectorize the corpus. |
Transformations include normalizing characters and dropping
noise from the text (emails, HTML tags, URLs, etc...). The
transformations are applied on the column named corpus
, in
the same order the parameters are presented. If there is no
column with that name, an exception is raised.
See the TextCleaner class for a description of the parameters.
Convert words to a more uniform standard. The transformations
are applied on the column named corpus
, in the same order the
parameters are presented. If there is no column with that name,
an exception is raised. If the provided documents are strings,
words are separated by spaces.
See the TextNormalizer class for a description of the parameters.
Convert documents into sequences of words. Additionally,
create n-grams (represented by words united with underscores,
e.g. "New_York") based on their frequency in the corpus. The
transformations are applied on the column named corpus
. If
there is no column with that name, an exception is raised.
See the Tokenizer class for a description of the parameters.
Transform the corpus into meaningful vectors of numbers. The
transformation is applied on the column named corpus
. If
there is no column with that name, an exception is raised.
If strategy="bow" or "tfidf", the transformed columns are named
after the word they are embedding with the prefix corpus_
. If
strategy="hashing", the columns are named hash[N], where N stands
for the n-th hashed column.
See the Vectorizer class for a description of the parameters.
Feature engineering
To further pre-process the data, it's possible to extract features from datetime columns, create new non-linear features transforming the existing ones, group similar features or, if the dataset is too large, remove features. Read more in the user guide.
feature_extraction | Extract features from datetime columns. |
feature_generation | Generate new features. |
feature_grouping | Extract statistics from similar features. |
feature_selection | Reduce the number of features in the data. |
Create new features extracting datetime elements (day, month,
year, etc...) from the provided columns. Columns of dtype
datetime64
are used as is. Categorical columns that can be
successfully converted to a datetime format (less than 30% NaT
values after conversion) are also used.
See the FeatureExtractor class for a description of the parameters.
Create new combinations of existing features to capture the non-linear relations between the original features.
See the FeatureGenerator class for a description of the parameters.
Replace groups of features with related characteristics with new
features that summarize statistical properties of te group. The
statistical operators are calculated over every row of the group.
The group names and features can be accessed through the groups
method.
See the FeatureGrouper class for a description of the parameters.
Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low variance features.
See the FeatureSelector class for a description of the parameters.
Note
- When strategy="univariate" and solver=None, f_classif or f_regression is used as default solver.
- When strategy is "sfs", "rfecv" or any of the advanced strategies and no scoring is specified, atom's metric (if it exists) is used as scoring.
Training
The training methods are where the models are fitted to the data and their performance is evaluated against a selected metric. There are three methods to call the three different training approaches. Read more in the user guide.
run | Train and evaluate the models in a direct fashion. |
successive_halving | Fit the models in a successive halving fashion. |
train_sizing | Train and evaluate the models in a train sizing fashion. |
Contrary to successive_halving and train_sizing, the direct approach only iterates once over the models, using the full dataset.
The following steps are applied to every model:
- Apply hyperparameter tuning (optional).
- Fit the model on the training set using the best combination of hyperparameters found.
- Evaluate the model on the test set.
- Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).
See the DirectClassifier or DirectRegressor class for a description of the parameters.
The successive halving technique is a bandit-based algorithm that fits N models to 1/N of the data. The best half are selected to go to the next iteration where the process is repeated. This continues until only one model remains, which is fitted on the complete dataset. Beware that a model's performance can depend greatly on the amount of data on which it is trained. For this reason, it is recommended to only use this technique with similar models, e.g. only using tree-based models.
The following steps are applied to every model (per iteration):
- Apply hyperparameter tuning (optional).
- Fit the model on the training set using the best combination of hyperparameters found.
- Evaluate the model on the test set.
- Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).
See the SuccessiveHalvingClassifier or SuccessiveHalvingRegressor class for a description of the parameters.
When training models, there is usually a trade-off between model performance and computation time, that is regulated by the number of samples in the training set. This method can be used to create insights in this trade-off, and help determine the optimal size of the training set. The models are fitted multiple times, ever-increasing the number of samples in the training set.
The following steps are applied to every model (per iteration):
- Apply hyperparameter tuning (optional).
- Fit the model on the training set using the best combination of hyperparameters found.
- Evaluate the model on the test set.
- Train the estimator on various bootstrapped samples of the training set and evaluate again on the test set (optional).
See the TrainSizingClassifier or TrainSizingRegressor class for a description of the parameters.