Skip to content

FeatureSelector


class atom.feature_engineering.FeatureSelector(strategy=None, solver=None, n_features=None, max_frac_repeated=1., max_correlation=1., n_jobs=1, gpu=False, verbose=0, logger=None, random_state=None, **kwargs) [source]

Remove features according to the selected strategy. Ties between features with equal scores are broken in an unspecified way. Additionally, remove multicollinear and low variance features. This class can be accessed from atom through the feature_selection method. Read more in the user guide.

Parameters: strategy: str or None, optional (default=None)
Feature selection strategy to use. Choose from:
  • None: Do not perform any feature selection strategy.
  • "univariate": Univariate statistical F-test.
  • "pca": Principal Component Analysis.
  • "sfm": Select best features according to a model.
  • "sfs": Sequential Feature Selection.
  • "rfe": Recursive Feature Elimination.
  • "rfecv": RFE with cross-validated selection.
  • "pso": Particle Swarm Optimization.
  • "hho": Harris Hawks Optimization.
  • "gwo": Grey Wolf Optimization.
  • "dfo": Dragonfly Optimization.
  • "genetic": Genetic Optimization.
solver: str, estimator or None, optional (default=None)
Solver/model to use for the feature selection strategy. See the corresponding documentation for an extended description of the choices. If None, use the estimator's default value (only pca).
  • for "univariate", choose from:
    • "f_classif"
    • "f_regression"
    • "mutual_info_classif"
    • "mutual_info_regression"
    • "chi2"
    • Any function taking two arrays (X, y), and returning arrays (scores, p-values). See the sklearn documentation.
  • for "pca", choose from:
    • if dense data:
      • "auto" (default)
      • "full"
      • "arpack"
      • "randomized"
    • if sparse data:
      • "randomized" (default)
      • "arpack"
    • if gpu implementation:
      • "full" (default)
      • "jacobi"
      • "auto"
  • for the remaining strategies:

    The base estimator. For sfm, rfe and rfecv, it should have either a either a feature_importances_ or coef_ attribute after fitting. You can use one of ATOM's predefined models. Add _class or _reg after the model's name to specify a classification or regression task, e.g. solver="LGB_reg" (not necessary if called from an atom instance). No default option.

n_features: int, float or None, optional (default=None)
Number of features to select. Choose from:
  • if None: Select all features.
  • if < 1: Fraction of the total features to select.
  • if >= 1: Number of features to select.

If strategy="sfm" and the threshold parameter is not specified, the threshold is set to -np.inf to select n_features number of features.
If strategy="rfecv", n_features is the minimum number of features to select.
This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, genetic.

max_frac_repeated: float or None, optional (default=1.)
Remove features with the same value in at least this fraction of the total rows. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this step.

max_correlation: float or None, optional (default=1.)
Minimum absolute Pearson correlation to identify correlated features. For each group, it removes all except the feature with the highest correlation to `y` (if provided, else it removes all but the first). The default value removes equal columns. If None, skip this step.

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.
  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use available_cores - 1 + n_jobs.
gpu: bool or str, optional (default=False)
Train strategy on GPU (instead of CPU). Only for strategy="pca".
  • If False: Always use CPU implementation.
  • If True: Use GPU implementation if possible.
  • If "force": Force GPU implementation.
verbose: int, optional (default=0)
Verbosity level of the class. Choose from:
  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.
random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

**kwargs
Any extra keyword argument for the strategy estimator. See the corresponding documentation for the available options.

Info

If strategy="pca" and the provided data is dense, it's scaled to mean=0 and std=1 before fitting the transformer (if it wasn't already).

Tip

Use the plot_feature_importance method to examine how much a specific feature contributes to the final predictions. If the model doesn't have a feature_importances_ attribute, use plot_permutation_importance instead.

Note

Be aware that, for strategy="rfecv", the n_features parameter is the minimum number of features to select, not the actual number of features that the transformer returns. It may very well be that it returns more!


Attributes

Utility attributes

Attributes: collinear: pd.DataFrame
Information on the removed collinear features. Columns include:
  • drop: Name of the dropped feature.
  • corr_feature: Name of the correlated feature(s).
  • corr_value: Corresponding correlation coefficient(s).

feature_importance: list
Remaining features ordered by importance. Only if strategy in ("univariate", "sfm", "rfe", "rfecv"). For rfe and rfecv, the importance is extracted from the external estimator fitted on the reduced set.

<strategy>: sklearn transformer
Object used to transform the data, e.g. feature_selector.pca for the pca strategy.

feature_names_in_: np.array
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.




Methods

fit Fit to data.
fit_transform Fit to data, then transform it.
get_params Get parameters for this estimator.
log Write information to the logger and print to stdout.
plot_pca Plot the explained variance ratio vs the number of components.
plot_components Plot the explained variance ratio per component.
plot_rfecv Plot the scores obtained by the estimator on the rfecv.
reset_aesthetics Reset the plot aesthetics to their default values.
save Save the instance to a pickle file.
set_params Set the parameters of this estimator.
transform Transform the data.


method fit(X, y=None) [source]

Fit to data. Note that the univariate, sfm (when model is not fitted), sfs, RFE and rfecv strategies all need a target column. Leaving it None will raise an exception.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns: FeatureSelector
Fitted instance of self.


method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that the univariate, sfm (when model is not fitted), sfs, RFE and rfecv strategies need a target column. Leaving it None will raise an exception.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns: pd.DataFrame
Transformed feature set.


method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:

deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns: dict
Parameter names mapped to their values.


method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.


method plot_pca (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the explained variance ratio vs the number of components. See plot_pca for a description of the parameters.


method plot_components (show=None, title=None, figsize=None, filename=None, display=True) [source]

Plot the explained variance ratio per components. See plot_components for a description of the parameters.


method plot_rfecv (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the scores obtained by the estimator fitted on every subset of the data. See plot_rfecv for a description of the parameters.


method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.


method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method set_params(**params) [source]

Set the parameters of this estimator.

Parameters: **params: dict
Estimator parameters.
Returns: FeatureGenerator
Estimator instance.


method transform(X, y=None) [source]

Transform the data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns: pd.DataFrame
Transformed feature set.


Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.feature_selection(strategy="pca", n_features=12, whiten=True)

atom.plot_pca(filename="pca", figsize=(8, 5))
from atom.feature_engineering import FeatureSelector

feature_selector = FeatureSelector(strategy="pca", n_features=12, whiten=True)
feature_selector.fit(X_train, y_train)
X = feature_selector.transform(X, y)

feature_selector.plot_pca(filename="pca", figsize=(8, 5))
Back to top