FeatureSelector

class atom.feature_engineering.FeatureSelector(strategy=None, solver=None, n_features=None, max_frac_repeated=1., max_correlation=1., n_jobs=1, gpu=False, verbose=0, logger=None, random_state=None, **kwargs) [source]

Remove features according to the selected strategy. Ties between features with equal scores are broken in an unspecified way. Additionally, remove multicollinear and low variance features. This class can be accessed from atom through the feature_selection method. Read more in the user guide.

Parameters:

strategy: str or None, optional (default=None)
Feature selection strategy to use. Choose from:

None: Do not perform any feature selection strategy.
"univariate": Univariate statistical F-test.
"pca": Principal Component Analysis.
"sfm": Select best features according to a model.
"sfs": Sequential Feature Selection.
"rfe": Recursive Feature Elimination.
"rfecv": RFE with cross-validated selection.
"pso": Particle Swarm Optimization.
"hho": Harris Hawks Optimization.
"gwo": Grey Wolf Optimization.
"dfo": Dragonfly Optimization.
"genetic": Genetic Optimization.

solver: str, estimator or None, optional (default=None)
Solver/model to use for the feature selection strategy. See the corresponding documentation for an extended description of the choices. If None, use the estimator's default value (only pca).

for "univariate", choose from:
- "f_classif"
- "f_regression"
- "mutual_info_classif"
- "mutual_info_regression"
- "chi2"
- Any function taking two arrays (X, y), and returning arrays (scores, p-values). See the sklearn documentation.
for "pca", choose from:
- if dense data:
  - "auto" (default)
  - "full"
  - "arpack"
  - "randomized"
- if sparse data:
  - "randomized" (default)
  - "arpack"
- if gpu implementation:
  - "full" (default)
  - "jacobi"
  - "auto"
for the remaining strategies:

The base estimator. For sfm, rfe and rfecv, it should have either a either a feature_importances_ or coef_ attribute after fitting. You can use one of ATOM's predefined models. Add _class or _reg after the model's name to specify a classification or regression task, e.g. solver="LGB_reg" (not necessary if called from an atom instance). No default option.

n_features: int, float or None, optional (default=None)
Number of features to select. Choose from:

if None: Select all features.
if < 1: Fraction of the total features to select.
if >= 1: Number of features to select.

If strategy="sfm" and the threshold parameter is not specified, the threshold is set to -np.inf to select n_features number of features.
If strategy="rfecv", n_features is the minimum number of features to select.
This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, genetic.

max_frac_repeated: float or None, optional (default=1.)
Remove features with the same value in at least this fraction of the total rows. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this step.

max_correlation: float or None, optional (default=1.)
Minimum Pearson correlation coefficient to identify correlated features. For each pair above the specified limit (in terms of absolute value), it removes one of the two. The default is to drop one of two equal columns. If None, skip this step.

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use available_cores - 1 + n_jobs.

gpu: bool or str, optional (default=False)
Train strategy on GPU (instead of CPU). Only for strategy="pca".

If False: Always use CPU implementation.
If True: Use GPU implementation if possible.
If "force": Force GPU implementation.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

**kwargs
Any extra keyword argument for the strategy estimator. See the corresponding documentation for the available options.

Info

If strategy="pca" and the provided data is dense, it's scaled to mean=0 and std=1 before fitting the transformer (if it wasn't already).

Tip

Use the plot_feature_importance method to examine how much a specific feature contributes to the final predictions. If the model doesn't have a feature_importances_ attribute, use plot_permutation_importance instead.

Attributes

Utility attributes

Attributes:

collinear: pd.DataFrame
Information on the removed collinear features. Columns include:

drop_feature: Name of the feature dropped by the method.
correlated feature: Name of the correlated feature(s).
correlation_value: Pearson correlation coefficients of the feature pairs.

feature_importance: list
Remaining features ordered by importance. Only if strategy in ("univariate", "sfm", "rfe", "rfecv"). For rfe and rfecv, the importance is extracted from the external estimator fitted on the reduced set.

<strategy>: sklearn transformer
Object used to transform the data, e.g. feature_selector.pca for the pca strategy.

Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
plot_pca	Plot the explained variance ratio vs the number of components.
plot_components	Plot the explained variance ratio per component.
plot_rfecv	Plot the scores obtained by the estimator on the rfecv.
reset_aesthetics	Reset the plot aesthetics to their default values.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data. Note that the univariate, sfm (when model is not fitted), sfs, RFE and rfecv strategies all need a target column. Leaving it None will raise an exception.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

FeatureSelector
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that the univariate, sfm (when model is not fitted), sfs, RFE and rfecv strategies need a target column. Leaving it None will raise an exception.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

pd.DataFrame
Transformed feature set.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	dict Parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method plot_pca (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the explained variance ratio vs the number of components. See plot_pca for a description of the parameters.

method plot_components (show=None, title=None, figsize=None, filename=None, display=True) [source]

Plot the explained variance ratio per components. See plot_components for a description of the parameters.

method plot_rfecv (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the scores obtained by the estimator fitted on every subset of the data. See plot_rfecv for a description of the parameters.

method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	FeatureGenerator Estimator instance.

method transform(X, y=None) [source]

Transform the data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

pd.DataFrame
Transformed feature set.

Example

atomstand-alone

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.feature_selection(strategy="pca", n_features=12, whiten=True)

atom.plot_pca(filename="pca", figsize=(8, 5))

from atom.feature_engineering import FeatureSelector

feature_selector = FeatureSelector(strategy="pca", n_features=12, whiten=True)
feature_selector.fit(X_train, y_train)
X = feature_selector.transform(X, y)

feature_selector.plot_pca(filename="pca", figsize=(8, 5))