Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low variance features.
This class can be accessed from atom through the feature_selection method. Read more in the user guide.
- Ties between features with equal scores are broken in an unspecified way.
- For strategy="rfecv", the
parameter is the minimum number of features to select, not the actual number of features that the transformer returns. It may very well be that it returns more!
- The "sklearnex" and "cuml" engines are only supported for strategy="pca" with dense datasets.
- If strategy="pca" and the data is dense and unscaled, it's scaled to mean=0 and std=1 before fitting the PCA transformer.
- If strategy="pca" and the provided data is sparse, the used estimator is TruncatedSVD, which works more efficiently with sparse matrices.
Use the plot_feature_importance method to examine how much
a specific feature contributes to the final predictions. If the
model doesn't have a feature_importances_
attribute, use
plot_permutation_importance instead.
Parameters | strategy: str or None, default=None
Feature selection strategy to use. Choose from:
solver: str, estimator or None, default=None
Solver/estimator to use for the feature selection strategy. See
the corresponding documentation for an extended description of
the choices. If None, the default value is used (only if
strategy="pca"). Choose from:
n_features: int, float or None, default=None
Number of features to select.
min_repeated: int, float or None, default=2
If strategy="sfm" and the threshold parameter is not specified,
the threshold is automatically set to If strategy="rfecv", This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, go.
Remove categorical features if there isn't any repeated value
in at least max_repeated: int, float or None, default=1.0min_repeated rows. The default is to keep all
features with non-maximum variance, i.e. remove the features
which number of unique values is equal to the number of rows
(usually the case for names, IDs, etc...).
Remove categorical features with the same value in at least
max_correlation: float or None, default=1.0max_repeated rows. The default is to keep all features with
non-zero variance, i.e. remove the features that have the same
value in all samples.
Minimum absolute Pearson correlation to identify
correlated features. For each group, it removes all except the
feature with the highest correlation to n_jobs: int, default=1y (if provided, else
it removes all but the first). The default value removes equal
columns. If None, skip this step.
Number of cores to use for parallel processing.
device: str, default="cpu"
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
logger: str, Logger or None, default=None
Seed used by the random number generator. If None, the random
number generator is the **kwargsRandomState used by np.random .
Any extra keyword argument for the strategy estimator. See the
corresponding documentation for the available options.
Attributes | collinear: pd.DataFrame
Information on the removed collinear features. Columns include:
feature_importance: pd.Series
Normalized importance scores calculated by the solver for the
features kept by the transformer. The scores are extracted from
the coef_ or feature_importances_ attribute, checked in that
order. Only if strategy is one of univariate, sfm, rfe or rfecv.
[strategy]: sklearn transformer
Object used to transform the data, e.g. feature_names_in_: np.arrayfs.pca for the pca
Names of features seen during fit.
n_features_in_: int
Number of features seen during fit.
See Also
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y)
>>> atom.feature_selection(strategy="pca", n_features=12, verbose=2)
Fitting FeatureSelector...
Performing feature selection ...
--> Applying Principal Component Analysis...
--> Scaling features...
--> Keeping 12 components.
--> Explained variance ratio: 0.97
>>> # Note that the column names changed
>>> print(atom.dataset)
pca0 pca1 pca2 ... pca10 pca11 target
0 -2.493723 3.082653 1.318595 ... -0.182142 -0.591784 1
1 4.596102 -0.876940 -0.380685 ... 0.224170 1.155544 0
2 0.955979 -2.141057 -1.677736 ... 0.306153 0.099138 0
3 3.221488 4.209911 -2.818757 ... 0.808883 -0.531868 0
4 1.038000 2.451758 -1.753683 ... -0.312883 0.862319 1
.. ... ... ... ... ... ... ...
564 3.414827 -3.757253 -1.012369 ... 0.387175 0.283633 0
565 -1.191561 -1.276069 -0.871712 ... 0.106362 -0.449361 1
566 -2.757000 0.411997 -1.321697 ... 0.185550 -0.025368 1
567 -3.252533 0.074827 0.549622 ... 0.693073 -0.058251 1
568 1.607258 -2.076465 -1.025986 ... -0.385542 0.103603 0
[569 rows x 13 columns]
>>> atom.plot_pca()
>>> from atom.feature_engineering import FeatureSelector
>>> from sklearn.datasets import load_breast_cancer
>>> X, _ = load_breast_cancer(return_X_y=True, as_frame=True)
>>> fs = FeatureSelector(strategy="pca", n_features=12, verbose=2)
>>> X = fs.fit_transform(X)
Fitting FeatureSelector...
Performing feature selection ...
--> Applying Principal Component Analysis...
--> Scaling features...
--> Keeping 12 components.
--> Explained variance ratio: 0.97
>>> # Note that the column names changed
>>> print(X)
pca0 pca1 pca2 ... pca9 pca10 pca11
0 9.192837 1.948583 -1.123166 ... -0.877402 0.262955 -0.859014
1 2.387802 -3.768172 -0.529293 ... 1.106995 0.813120 0.157923
2 5.733896 -1.075174 -0.551748 ... 0.454275 -0.605604 0.124387
3 7.122953 10.275589 -3.232790 ... -1.116975 -1.151514 1.011316
4 3.935302 -1.948072 1.389767 ... 0.377704 0.651360 -0.110515
.. ... ... ... ... ... ... ...
564 6.439315 -3.576817 2.459487 ... 0.256989 -0.062651 0.123342
565 3.793382 -3.584048 2.088476 ... -0.108632 0.244804 0.222753
566 1.256179 -1.902297 0.562731 ... 0.520877 -0.840512 0.096473
567 10.374794 1.672010 -1.877029 ... -0.089296 -0.178628 -0.697461
568 -5.475243 -0.670637 1.490443 ... -0.047726 -0.144094 -0.179496
[569 rows x 12 columns]
The univariate, sfm (when model is not fitted), sfs, rfe and rfecv strategies need a target column. Leaving it None raises an exception.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns | params : dict
Parameter names mapped to their values.
Kept components are colored and discarted components are transparent. This plot is available only when feature selection was applied with strategy="pca".
Parameters | show: int or None, default=None
Number of components to show. None to show all.
title: str, dict or None, default=None
Title for the plot.
legend: str, dict or None, default="lower right"
Legend for the plot. See the user guide for
an extended description of the choices.
figsize: tuple or None, default=None
Figure's size in pixels, format as (x, y). If None, it
adapts the size to the number of components shown.
filename: str or None, default=None
Save the plot using this name. Use "auto" for automatic
naming. The type of the file depends on the provided name
(.html, .png, .pdf, etc...). If display: bool or None, default=Truefilename has no file type,
the plot is saved as html. If None, the plot is not saved.
Whether to render the plot. If None, it returns the figure.
Returns | go.Figure or None
Plot object. Only returned if display=None .
If the underlying estimator is PCA (for dense datasets), all possible components are plotted. If the underlying estimator is TruncatedSVD (for sparse datasets), it only shows the selected components. The star marks the number of components selected by the user. This plot is available only when feature selection was applied with strategy="pca".
Parameters | title: str, dict or None, default=None
Title for the plot.
legend: str, dict or None, default=None
Does nothing. Implemented for continuity of the API.
figsize: tuple, default=(900, 600)
Figure's size in pixels, format as (x, y).
filename: str or None, default=None
Save the plot using this name. Use "auto" for automatic
naming. The type of the file depends on the provided name
(.html, .png, .pdf, etc...). If display: bool or None, default=Truefilename has no file type,
the plot is saved as html. If None, the plot is not saved.
Whether to render the plot. If None, it returns the figure.
Returns | go.Figure or None
Plot object. Only returned if display=None .
Plot the scores obtained by the estimator fitted on every subset of the dataset. Only available when feature selection was applied with strategy="rfecv".
Parameters | title: str, dict or None, default=None
Title for the plot.
legend: str, dict or None, default=None
Legend for the plot. See the user guide for
an extended description of the choices.
figsize: tuple, default=(900, 600)
Figure's size in pixels, format as (x, y).
filename: str or None, default=None
Save the plot using this name. Use "auto" for automatic
naming. The type of the file depends on the provided name
(.html, .png, .pdf, etc...). If display: bool or None, default=Truefilename has no file type,
the plot is saved as html. If None, the plot is not saved.
Whether to render the plot. If None, it returns the figure.
Returns | go.Figure or None
Plot object. Only returned if display=None .
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from
atom. If False, remember to add the data to ATOMLoader
when loading the file.
Parameters | **params : dict
Estimator parameters.
Returns | self : estimator instance
Estimator instance.
This recursively updates the structure of the original layout with the values in the input dict / keyword arguments.