Skip to content

FeatureSelector


class atom.feature_engineering.FeatureSelector(strategy=None, solver=None, n_features=None, min_repeated=2, max_repeated=1.0, max_correlation=1.0, n_jobs=1, device="cpu", engine=None, verbose=0, random_state=None, **kwargs)[source]
Reduce the number of features in the data.

Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low-variance features.

This class can be accessed from atom through the feature_selection method. Read more in the user guide.

Warning

  • Ties between features with equal scores are broken in an unspecified way.
  • For strategy="rfecv", the n_features parameter is the minimum number of features to select, not the actual number of features that the transformer returns. It may very well be that it returns more!

Info

  • The "sklearnex" and "cuml" engines are only supported for strategy="pca" with dense datasets.
  • If strategy="pca" and the data is dense and unscaled, it's scaled to mean=0 and std=1 before fitting the PCA transformer.
  • If strategy="pca" and the provided data is sparse, the used estimator is TruncatedSVD, which works more efficiently with sparse matrices.

Tip

Parametersstrategy: str or None, default=None
Feature selection strategy to use. Choose from:

  • None: Do not perform any feature selection strategy.
  • "univariate": Univariate statistical F-test.
  • "pca": Principal Component Analysis.
  • "sfm": Select best features according to a model.
  • "sfs": Sequential Feature Selection.
  • "rfe": Recursive Feature Elimination.
  • "rfecv": RFE with cross-validated selection.
  • "pso": Particle Swarm Optimization.
  • "hho": Harris Hawks Optimization.
  • "gwo": Grey Wolf Optimization.
  • "dfo": Dragonfly Optimization.
  • "go": Genetic Optimization.

solver: str, func, predictor or None, default=None
Solver/estimator to use for the feature selection strategy. See the corresponding documentation for an extended description of the choices. If None, the default value is used (only if strategy="pca"). Choose from:

  • If strategy="univariate":

  • If strategy="pca":

    • If data is dense:

      • If engine="sklearn":

        • "auto" (default)
        • "full"
        • "arpack"
        • "randomized"
      • If engine="sklearnex":

        • "full" (default)
      • If engine="cuml":

        • "full" (default)
        • "jacobi"
    • If data is sparse:

      • "randomized" (default)
      • "arpack"
  • for the remaining strategies:
    The base estimator. For sfm, rfe and rfecv, it should have either a feature_importances_ or coef_ attribute after fitting. You can use one of the predefined models. Add _class or _reg after the model's name to specify a classification or regression task, e.g., solver="LGB_reg" (not necessary if called from atom). No default option.

n_features: int, float or None, default=None
Number of features to select.

  • If None: Select all features.
  • If <1: Fraction of the total features to select.
  • If >=1: Number of features to select.

If strategy="sfm" and the threshold parameter is not specified, the threshold is automatically set to -inf to select n_features number of features.

If strategy="rfecv", n_features is the minimum number of features to select.

This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, go.

min_repeated: int, float or None, default=2
Remove categorical features if there isn't any repeated value in at least min_repeated rows. The default is to keep all features with non-maximum variance, i.e., remove the features which number of unique values is equal to the number of rows (usually the case for names, IDs, etc...).

  • If None: No check for minimum repetition.
  • If >1: Minimum repetition number.
  • If <=1: Minimum repetition fraction.

max_repeated: int, float or None, default=1.0
Remove categorical features with the same value in at least max_repeated rows. The default is to keep all features with non-zero variance, i.e., remove the features that have the same value in all samples.

  • If None: No check for maximum repetition.
  • If >1: Maximum number of repeated occurences.
  • If <=1: Maximum fraction of repeated occurences.

max_correlation: float or None, default=1.0
Minimum absolute Pearson correlation to identify correlated features. For each group, it removes all except the feature with the highest correlation to y (if provided, else it removes all but the first). The default value removes equal columns. If None, skip this step.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 + n_jobs.

device: str, default="cpu"
Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None
Execution engine to use for estimators. If None, the default value is used. Choose from:

  • "sklearn" (default)
  • "cuml"

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.

**kwargs
Any extra keyword argument for the strategy estimator. See the corresponding documentation for the available options.

Attributescollinear_: pd.DataFrame
Information on the removed collinear features. Columns include:

  • drop: Name of the dropped feature.
  • corr_feature: Names of the correlated features.
  • corr_value: Corresponding correlation coefficients.

[strategy]_: sklearn transformer
Object used to transform the data, e.g., fs.pca for the pca strategy.

feature_names_in_: np.ndarray
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

FeatureExtractor

Extract features from datetime columns.

FeatureGenerator

Generate new features.

FeatureGrouper

Extract statistics from similar features.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y)
>>> atom.feature_selection(strategy="pca", n_features=12, verbose=2)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   --> Scaling features...
   --> Keeping 12 components.
   --> Explained variance ratio: 0.971


>>> # Note that the column names changed
>>> print(atom.dataset)

         pca0      pca1      pca2      pca3      pca4      pca5      pca6      pca7      pca8      pca9     pca10     pca11  target
0    4.948620 -3.211108  2.813983 -0.261623  0.662504  0.936950 -1.313074 -1.081034  0.183677 -0.642439 -0.958280 -0.694207       0
1   -2.502149  0.932102 -0.895617 -1.858217  0.880730 -0.382548  1.100790  0.968928 -0.732717  0.286001  0.244840  0.416763       1
2   -2.100882  1.231940  0.369006  1.182547  0.391909  1.736365  0.241540  1.028575  0.405305 -0.192192 -0.242827  0.557610       1
3    0.578704  0.584978 -0.309733  0.976049  0.234536 -0.909241 -0.646571  0.550037 -0.113273  0.082274  0.216308 -0.754310       0
4   -3.276672 -1.035308 -1.652544  0.431873  0.465152  0.398455  0.067745  0.154959  0.279904 -0.290049  0.077331  0.217252       1
..        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...     ...
564 -2.260973  1.561488 -0.020353  0.305708  0.625720 -1.525210 -0.817414 -0.238004  0.062477 -0.267512 -0.077557 -0.328817       1
565 -1.405854  0.363970  2.123404  0.810231  2.482462  0.637570  0.796672  0.323181 -0.377214 -0.044153  0.197700 -0.263923       1
566 -4.144479 -1.360866  0.549383 -0.769523  0.146979  1.059392  0.735636 -0.428059  0.130815  0.439454 -0.255789  0.127913       1
567 -4.570773 -0.772144 -0.585513  0.652783  0.201551  0.125336  0.565221  0.726564  0.173535  0.099568  0.059806  0.483699       1
568  3.328853  2.664665  6.591582 -0.123192  3.077119 -4.408566 -3.108804 -4.132658 -1.858832  3.522249 -1.427187 -0.948496       0

[569 rows x 13 columns]
>>> from atom.feature_engineering import FeatureSelector
>>> from sklearn.datasets import load_breast_cancer

>>> X, _ = load_breast_cancer(return_X_y=True, as_frame=True)

>>> fs = FeatureSelector(strategy="pca", n_features=12, verbose=2)
>>> X = fs.fit_transform(X)

Fitting FeatureSelector...
Performing feature selection ...
 --> Applying Principal Component Analysis...
   --> Scaling features...
   --> Keeping 12 components.
   --> Explained variance ratio: 0.97


>>> # Note that the column names changed
>>> print(X)

          pca0       pca1      pca2      pca3      pca4      pca5      pca6      pca7      pca8      pca9     pca10     pca11
0     9.192837   1.948583 -1.123166  3.633731 -1.195110  1.411424  2.159370 -0.398407 -0.157118 -0.877402  0.262955 -0.859014
1     2.387802  -3.768172 -0.529293  1.118264  0.621775  0.028656  0.013358  0.240988 -0.711905  1.106995  0.813120  0.157923
2     5.733896  -1.075174 -0.551748  0.912083 -0.177086  0.541452 -0.668166  0.097374  0.024066  0.454275 -0.605604  0.124387
3     7.122953  10.275589 -3.232790  0.152547 -2.960878  3.053422  1.429911  1.059565 -1.405440 -1.116975 -1.151514  1.011316
4     3.935302  -1.948072  1.389767  2.940639  0.546747 -1.226495 -0.936213  0.636376 -0.263805  0.377704  0.651360 -0.110515
..         ...        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
564   6.439315  -3.576817  2.459487  1.177314 -0.074824 -2.375193 -0.596130 -0.035471  0.987929  0.256989 -0.062651  0.123342
565   3.793382  -3.584048  2.088476 -2.506028 -0.510723 -0.246710 -0.716326 -1.113360 -0.105207 -0.108632  0.244804  0.222753
566   1.256179  -1.902297  0.562731 -2.089227  1.809991 -0.534447 -0.192758  0.341887  0.393917  0.520877 -0.840512  0.096473
567  10.374794   1.672010 -1.877029 -2.356031 -0.033742  0.567936  0.223082 -0.280239 -0.542035 -0.089296 -0.178628 -0.697461
568  -5.475243  -0.670637  1.490443 -2.299157 -0.184703  1.617837  1.698952  1.046354  0.374101 -0.047726 -0.144094 -0.179496

[569 rows x 12 columns]


Methods

fitFit the feature selector to the data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_metadata_routingGet metadata routing of this object.
get_paramsGet parameters for this estimator.
inverse_transformDo nothing.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformTransform the data.


method fit(X, y=None)[source]
Fit the feature selector to the data.

The univariate, sfm (when model is not fitted), sfs, rfe and rfecv strategies need a target column. Leaving it None raises an exception.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returnsself
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returnsdataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]
Get output feature names for transformation.

Parametersinput_features: sequence or None, default=None
Only used to validate feature names with the names seen in fit.

Returnsnp.ndarray
Transformed feature names.



method get_metadata_routing()[source]
Get metadata routing of this object.

Returnsrouting : MetadataRequest
A :class:~sklearn.utils.metadata_routing.MetadataRequest encapsulating routing information.



method get_params(deep=True)[source]
Get parameters for this estimator.

Parametersdeep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returnsparams : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None, **fit_params)[source]
Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

Returnsdataframe
Feature set. Only returned if provided.

series or dataframe
Target column(s). Only returned if provided.



method set_output(transform=None)[source]
Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameterstransform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

ReturnsSelf
Estimator instance.



method set_params(**params)[source]
Set the parameters of this estimator.

Parameters**params : dict
Estimator parameters.

Returnsself : estimator instance
Estimator instance.



method transform(X, y=None)[source]
Transform the data.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Do nothing. Implemented for continuity of the API.

Returnsdataframe
Transformed feature set.