Pipeline
Sequentially apply a list of transforms and a final estimator.
Intermediate steps of the pipeline must be transformsers, that
is, they must implement fit
and transform
methods. The final
estimator only needs to implement fit
. The transformers in the
pipeline can be cached using the memory
parameter.
A step's estimator may be replaced entirely by setting the
parameter with its name to another estimator, or a transformer
removed by setting it to passthrough
or None
.
Read more in sklearn's the user guide.
Info
This class behaves similarly to sklearn's pipeline, and additionally:
- Can initialize with an empty pipeline.
- Always returns 'pandas' objects.
- Accepts transformers that drop rows.
- Accepts transformers that only are fitted on a subset of the provided dataset.
- Accepts transformers that apply only on the target column.
- Uses transformers that are only applied on the training set to fit the pipeline, not to make predictions on new data.
- The instance is considered fitted at initialization if all the underlying transformers/estimator in the pipeline are.
- It returns attributes from the final estimator if they are not of the Pipeline.
- The last estimator is also cached.
- Supports time series models following sktime's API.
Warning
This Pipeline only works with estimators whose parameters
for fit, transform, predict, etc... are named X
and/or y
.
Parameters | steps: list of tuple
List of (name, transform) tuples (implementing memory: str, Memory or None, default=Nonefit /transform )
that are chained in sequential order.
Used to cache the fitted transformers of the pipeline. Enabling
caching triggers a clone of the transformers before fitting.
Therefore, the transformer instance given to the pipeline cannot
be inspected directly. Use the attribute verbose: int or None, default=0named_steps or steps
to inspect estimators within the pipeline. Caching the
transformers is advantageous when fitting is time-consuming.
Verbosity level of the transformers in the pipeline. If None,
it leaves them to their original verbosity. If >0, the time
elapsed while fitting each step is printed. Note this is not
the same as sklearn's verbose parameter. Use the pipeline's
verbose attribute to modify that one (defaults to False).
|
Attributes | named_steps: Bunch
Dictionary-like object, with the following attributes. Read-only
attribute to access any step parameter by user given name. Keys
are step names and values are steps parameters.
classes_: np.ndarray of shape (n_classes,)
The class' labels. Only exist if the last step of the pipeline
is a classifier.
feature_names_in_: np.ndarray
Names of features seen during first step n_features_in_: intfit method.
Number of features seen during first step fit method.
|
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Initialize atom
>>> atom = ATOMClassifier(X, y, verbose=2)
<< ================== ATOM ================== >>
Configuration ==================== >>
Algorithm task: Binary classification.
Dataset stats ==================== >>
Shape: (569, 31)
Train set size: 456
Test set size: 113
-------------------------------------
Memory: 138.97 kB
Scaled: False
Outlier values: 174 (1.2%)
>>> # Apply data cleaning and feature engineering methods
>>> atom.scale()
Fitting Scaler...
Scaling features...
>>> atom.balance(strategy="smote")
Oversampling with SMOTE...
--> Adding 116 samples to class 0.
>>> atom.feature_selection(strategy="rfe", solver="lr", n_features=22)
Fitting FeatureSelector...
Performing feature selection ...
--> rfe selected 22 features from the dataset.
--> Dropping feature mean texture (rank 5).
--> Dropping feature mean smoothness (rank 4).
--> Dropping feature mean symmetry (rank 2).
--> Dropping feature mean fractal dimension (rank 3).
--> Dropping feature smoothness error (rank 9).
--> Dropping feature concavity error (rank 7).
--> Dropping feature symmetry error (rank 6).
--> Dropping feature worst compactness (rank 8).
>>> # Train models
>>> atom.run(models="LR")
Training ========================= >>
Models: LR
Metric: f1
Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.9808
Test evaluation --> f1: 0.9929
Time elapsed: 0.031s
-------------------------------------------------
Time: 0.031s
Final results ==================== >>
Total time: 0.034s
-------------------------------------
LogisticRegression --> f1: 0.9929
>>> # Get the pipeline object
>>> pipeline = atom.lr.export_pipeline()
>>> print(pipeline)
Pipeline(memory=Memory(location=None),
steps=[('scaler',
Scaler(engine={'data': 'pandas', 'estimator': 'sklearn'}, verbose=2)),
('balancer', Balancer(strategy='smote', verbose=2)),
('featureselector',
FeatureSelector(engine={'data': 'pandas', 'estimator': 'sklearn'}, n_features=22, solver='lr_class', strategy='rfe', verbose=2)),
('LogisticRegression', LogisticRegression(n_jobs=1))],
verbose=False)
Methods
decision_function | Transform, then decision_function of the final estimator. |
fit | Fit the pipeline. |
fit_transform | Fit the pipeline and transform the data. |
get_feature_names_out | Get output feature names for transformation. |
get_metadata_routing | Get metadata routing of this object. |
get_params | Get parameters for this estimator. |
inverse_transform | Inverse transform for each step in a reverse order. |
predict | Transform, then predict of the final estimator. |
predict_interval | Transform, then predict_quantiles of the final estimator. |
predict_log_proba | Transform, then predict_log_proba of the final estimator. |
predict_proba | Transform, then predict_proba of the final estimator. |
predict_quantiles | Transform, then predict_quantiles of the final estimator. |
predict_residuals | Transform, then predict_residuals of the final estimator. |
predict_var | Transform, then predict_var of the final estimator. |
score | Transform, then score of the final estimator. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Transform the data. |
Fit all the transformers one after the other and sequentially transform the data. Finally, fit the transformed data using the final estimator.
Call fit
followed by transform
on each transformer in the
pipeline. The transformed data are finally passed to the final
estimator that calls the transform
method. Only valid if the
final estimator implements transform
. This also works when the
final estimator is None
, in which case all prior
transformations are applied.
Parameters | input_features : array-like of str or None, default=None
Input features.
|
Returns | feature_names_out : ndarray of str objects
Transformed feature names.
|
Check sklearn's documentation on how the routing mechanism works.
Returns | MetadataRouter
A MetadataRouter encapsulating routing information.
|
All estimators in the pipeline must implement the
inverse_transform
method.
Parameters | X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). Can only
be fh: int, sequence or ForecastingHorizon or None, default=NoneNone for forecast tasks.
The forecasting horizon encoding the time stamps to
forecast at. Only for forecast tasks.
**params
Parameters requested and accepted by steps. Each step must
have requested certain metadata for these parameters to be
forwarded to them. Note that while this may be used to
return uncertainties from some models with return_std or
return_cov , uncertainties that are generated by the
transformations in the pipeline are not propagated to the
final estimator.
|
Returns | np.ndarray, series or dataframe
Predictions with shape=(n_samples,) or shape=(n_samples,
n_targets) for multioutput tasks.
|
Parameters | fh: int, sequence or ForecastingHorizon
The forecasting horizon encoding the time stamps to
forecast at.
X: dataframe-like or None, default=None
Exogenous time series corresponding to coverage: float or sequence, default=0.9fh .
Nominal coverage(s) of predictive interval(s).
|
Returns | dataframe
Computed interval forecasts.
|
Parameters | X: dataframe-like
Feature set with shape=(n_samples, n_features).
**params
Parameters requested and accepted by steps. Each step must
have requested certain metadata for these parameters to be
forwarded to them.
|
Returns | list or np.ndarray
Predicted class log-probabilities with shape=(n_samples,
n_classes) or a list of arrays for multioutput tasks.
|
Parameters | X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). Can only
be fh: int, sequence, ForecastingHorizon or None, default=NoneNone for forecast tasks.
The forecasting horizon encoding the time stamps to
forecast at. Only for forecast tasks.
marginal: bool, default=True
Whether returned distribution is marginal by time index.
Only for forecast tasks.
**params
Parameters requested and accepted by steps. Each step must
have requested certain metadata for these parameters to be
forwarded to them.
|
Returns | list, np.ndarray or sktime.proba.Normal
|
Parameters | fh: int, sequence or ForecastingHorizon
The forecasting horizon encoding the time stamps to
forecast at.
X: dataframe-like or None, default=None
Exogenous time series corresponding to alpha: float or sequence, default=(0.05, 0.95)fh .
A probability or list of, at which quantile forecasts are
computed.
|
Returns | dataframe
Computed quantile forecasts.
|
Parameters | y: sequence or dataframe
Ground truth observations.
X: dataframe-like or None, default=None
Exogenous time series corresponding to y .
|
Returns | series or dataframe
Residuals with shape=(n_samples,) or shape=(n_samples,
n_targets) for multivariate tasks.
|
Parameters | fh: int, sequence or ForecastingHorizon
The forecasting horizon encoding the time stamps to
forecast at.
X: dataframe-like or None, default=None
Exogenous time series corresponding to cov: bool, default=Falsefh .
Whether to compute covariance matrix forecast or marginal
variance forecasts.
|
Returns | dataframe
Computed variance forecasts.
|
Parameters | X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). Can only
be y: sequence, dataframe-like or None, default=NoneNone for forecast tasks.
Target values corresponding to fh: int, sequence, ForecastingHorizon or None, default=NoneX .
The forecasting horizon encoding the time stamps to score.
sample_weight: sequence or None, default=None
Sample weights corresponding to y passed to the score
method of the final estimator. If None, no sampling weight
is performed. Only for non-forecast tasks.
|
Returns | float
Mean accuracy, r2 or mape of self.predict(X) with respect
to y (depending on the task).
|
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Call transform
on each transformer in the pipeline. The
transformed data are finally passed to the final estimator
that calls the transform
method. Only valid if the final
estimator implements transform
. This also works when the
final estimator is None
, in which case all prior
transformations are applied.