Skip to content

FeatureSelector


class atom.feature_engineering.FeatureSelector(strategy=None, solver=None, n_features=None, min_repeated=2, max_repeated=1.0, max_correlation=1.0, n_jobs=1, device="cpu", engine=None, verbose=0, random_state=None, **kwargs)[source]

Reduce the number of features in the data.

Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low-variance features.

This class can be accessed from atom through the feature_selection method. Read more in the user guide.

Warning

  • Ties between features with equal scores are broken in an unspecified way.
  • For strategy="rfecv", the n_features parameter is the minimum number of features to select, not the actual number of features that the transformer returns. It may very well be that it returns more!

Info

  • The "sklearnex" and "cuml" engines are only supported for strategy="pca" with dense datasets.
  • If strategy="pca" and the data is dense and unscaled, it's scaled to mean=0 and std=1 before fitting the PCA transformer.
  • If strategy="pca" and the provided data is sparse, the used estimator is TruncatedSVD, which works more efficiently with sparse matrices.

Tip

Parameters strategy: str or None, default=None
Feature selection strategy to use. Choose from:

  • None: Do not perform any feature selection strategy.
  • "univariate": Univariate statistical F-test.
  • "pca": Principal Component Analysis.
  • "sfm": Select best features according to a model.
  • "sfs": Sequential Feature Selection.
  • "rfe": Recursive Feature Elimination.
  • "rfecv": RFE with cross-validated selection.
  • "pso": Particle Swarm Optimization.
  • "hho": Harris Hawks Optimization.
  • "gwo": Grey Wolf Optimization.
  • "dfo": Dragonfly Optimization.
  • "go": Genetic Optimization.

solver: str, func, predictor or None, default=None
Solver/estimator to use for the feature selection strategy. See the corresponding documentation for an extended description of the choices. If None, the default value is used (only if strategy="pca"). Choose from:

  • If strategy="univariate":

  • If strategy="pca":

    • If data is dense:

      • If engine="sklearn":

        • "auto" (default)
        • "full"
        • "covariance_eigh"
        • "arpack"
        • "randomized"
      • If engine="sklearnex":

        • "full" (default)
      • If engine="cuml":

        • "full" (default)
        • "jacobi"
    • If data is sparse:

      • "randomized" (default)
      • "covariance_eigh"
      • "arpack"
  • for the remaining strategies:
    The base estimator. For sfm, rfe and rfecv, it should have either a feature_importances_ or coef_ attribute after fitting. You can use one of the predefined models. Add _class or _reg after the model's name to specify a classification or regression task, e.g., solver="LGB_reg" (not necessary if called from atom). No default option.

n_features: int, float or None, default=None
Number of features to select.

  • If None: Select all features.
  • If <1: Fraction of the total features to select.
  • If >=1: Number of features to select.

If strategy="sfm" and the threshold parameter is not specified, the threshold is automatically set to -inf to select n_features number of features.

If strategy="rfecv", n_features is the minimum number of features to select.

This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, go.

min_repeated: int, float or None, default=2
Remove categorical features if there isn't any repeated value in at least min_repeated rows. The default is to keep all features with non-maximum variance, i.e., remove the features which number of unique values is equal to the number of rows (usually the case for names, IDs, etc...).

  • If None: No check for minimum repetition.
  • If >1: Minimum repetition number.
  • If <=1: Minimum repetition fraction.

max_repeated: int, float or None, default=1.0
Remove categorical features with the same value in at least max_repeated rows. The default is to keep all features with non-zero variance, i.e., remove the features that have the same value in all samples.

  • If None: No check for maximum repetition.
  • If >1: Maximum number of repeated occurences.
  • If <=1: Maximum fraction of repeated occurences.

max_correlation: float or None, default=1.0
Minimum absolute Pearson correlation to identify correlated features. For each group, it removes all except the feature with the highest correlation to y (if provided, else it removes all but the first). The default value removes equal columns. If None, skip this step.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"
Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None
Execution engine to use for estimators. If None, the default value is used. Choose from:

  • "sklearn" (default)
  • "cuml"

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.

**kwargs
Any extra keyword argument for the strategy estimator. See the corresponding documentation for the available options.

Attributes collinear_: pd.DataFrame
Information on the removed collinear features. Columns include:

  • drop: Name of the dropped feature.
  • corr_feature: Names of the correlated features.
  • corr_value: Corresponding correlation coefficients.

{#featureselector-[strategy]} [strategy]: sklearn transformer
Object used to transform the data, e.g., fs.pca for the pca strategy.

feature_names_in_: np.ndarray
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

FeatureExtractor

Extract features from datetime columns.

FeatureGenerator

Generate new features.

FeatureGrouper

Extract statistics from similar features.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y)
>>> atom.feature_selection(strategy="pca", n_features=12, verbose=2)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   --> Scaling features...
   --> Keeping 12 components.
   --> Explained variance ratio: 0.972

>>> # Note that the column names changed
>>> print(atom.dataset)

         pca0      pca1      pca2      pca3      pca4      pca5      pca6      pca7      pca8      pca9     pca10     pca11  target
0   -4.550570 -3.137509 -0.092090 -1.609549  1.058535  0.642739  0.641216  0.149573  0.553681  0.012809  0.250259 -0.076306       1
1   -1.146323 -1.318056 -0.879402 -1.748581  0.615686  0.708593  0.125549 -0.481384  0.347614 -0.568856 -0.044347 -0.700876       1
2    4.022389  2.985922 -3.007969  2.528286 -0.074799  0.610755  0.573571 -0.489323 -0.433615  0.372529 -0.055120  0.235854       0
3    2.354108  4.800756 -1.036847 -0.338138 -2.778482  4.910456 -1.033646  0.196407  0.075153 -0.625668  0.095093 -0.761412       0
4   -2.907053  0.157141  2.076819 -1.780199  1.288917  2.018880  0.551727 -0.428490 -0.246376 -0.579773 -0.238851 -0.130475       1
..        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...     ...
564  0.985404  0.982949 -2.306420  0.314275 -0.470162  0.260210  0.474216 -0.067590 -0.292023 -0.021339 -0.199168 -0.129227       0
565 -0.952207  0.929797 -0.911746 -1.115508 -0.608486  0.914564 -0.580888 -1.027668  0.331691 -0.142763  0.850646 -0.454396       1
566  4.679295 -0.876128 -0.128211  0.984446 -0.177509  0.453696  0.081011 -0.789588  0.241801 -0.043042  0.593988  0.289491       0
567  4.511927 -0.693781 -0.407906 -0.963769  0.110021 -0.956139 -2.317544  0.046630  0.829408 -0.114397  0.283519 -0.946620       0
568 -0.591290  1.764559  1.380277  0.167142 -1.660975  0.438834 -0.076031 -0.002265 -0.196759 -0.271184  0.003263 -0.383594       1

[569 rows x 13 columns]
>>> from atom.feature_engineering import FeatureSelector
>>> from sklearn.datasets import load_breast_cancer

>>> X, _ = load_breast_cancer(return_X_y=True, as_frame=True)

>>> fs = FeatureSelector(strategy="pca", n_features=12, verbose=2)
>>> X = fs.fit_transform(X)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   --> Scaling features...
   --> Keeping 12 components.
   --> Explained variance ratio: 0.97

>>> # Note that the column names changed
>>> print(X)

          pca0       pca1      pca2      pca3      pca4      pca5      pca6      pca7      pca8      pca9     pca10     pca11
0     9.192837   1.948583 -1.123166 -3.633731  1.195110  1.411424  2.159370 -0.398407 -0.157118 -0.877402  0.262955  0.859014
1     2.387802  -3.768172 -0.529293 -1.118264 -0.621775  0.028656  0.013358  0.240988 -0.711905  1.106995  0.813120 -0.157923
2     5.733896  -1.075174 -0.551748 -0.912083  0.177086  0.541452 -0.668166  0.097374  0.024066  0.454275 -0.605604 -0.124387
3     7.122953  10.275589 -3.232790 -0.152547  2.960878  3.053422  1.429911  1.059565 -1.405440 -1.116975 -1.151514 -1.011316
4     3.935302  -1.948072  1.389767 -2.940639 -0.546747 -1.226495 -0.936213  0.636376 -0.263805  0.377704  0.651360  0.110515
..         ...        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
564   6.439315  -3.576817  2.459487 -1.177314  0.074824 -2.375193 -0.596130 -0.035471  0.987929  0.256989 -0.062651 -0.123342
565   3.793382  -3.584048  2.088476  2.506028  0.510723 -0.246710 -0.716326 -1.113360 -0.105207 -0.108632  0.244804 -0.222753
566   1.256179  -1.902297  0.562731  2.089227 -1.809991 -0.534447 -0.192758  0.341887  0.393917  0.520877 -0.840512 -0.096473
567  10.374794   1.672010 -1.877029  2.356031  0.033742  0.567936  0.223082 -0.280239 -0.542035 -0.089296 -0.178628  0.697461
568  -5.475243  -0.670637  1.490443  2.299157  0.184703  1.617837  1.698952  1.046354  0.374101 -0.047726 -0.144094  0.179496

[569 rows x 12 columns]


Methods

fitFit the feature selector to the data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_metadata_routingGet metadata routing of this object.
get_paramsGet parameters for this estimator.
inverse_transformDo nothing.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformTransform the data.


method fit(X, y=None)[source]

Fit the feature selector to the data.

The univariate, sfm (when model is not fitted), sfs, rfe and rfecv strategies need a target column. Leaving it None raises an exception.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returns self
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters input_features: sequence or None, default=None
Only used to validate feature names with the names seen in fit.

Returns np.ndarray
Transformed feature names.



method get_metadata_routing()[source]

Get metadata routing of this object.

Returns routing : MetadataRequest
A :class:~sklearn.utils.metadata_routing.MetadataRequest encapsulating routing information.



method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns params : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None, **fit_params)[source]

Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

Returns dataframe
Feature set. Only returned if provided.

series or dataframe
Target column(s). Only returned if provided.



method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters transform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

Returns Self
Estimator instance.



method set_params(**params)[source]

Set the parameters of this estimator.

Parameters **params : dict
Estimator parameters.

Returns self : estimator instance
Estimator instance.



method transform(X, y=None)[source]

Transform the data.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Do nothing. Implemented for continuity of the API.

Returns dataframe
Transformed feature set.