FeatureSelector

class atom.feature_engineering.FeatureSelector(strategy=None, solver=None, n_features=None, max_frac_repeated=1., max_correlation=1., n_jobs=1, verbose=0, logger=None, random_state=None, **kwargs) [source]

Remove features according to the selected strategy. Ties between features with equal scores are broken in an unspecified way. Additionally, removes features with too low variance and finds pairs of collinear features based on the Pearson correlation coefficient. For each pair above the specified limit (in terms of absolute value), it removes one of the two. This class can be accessed from atom through the feature_selection method. Read more in the user guide.

Parameters:

strategy: string or None, optional (default=None)
Feature selection strategy to use. Choose from:

None: Do not perform any feature selection algorithm.
"univariate": Select best features according to a univariate F-test.
"PCA": Perform principal component analysis.
"SFM": Select best features according to a model.
"RFE": Perform recursive feature elimination.
"RFECV": Perform RFE with cross-validated selection.
"SFS": Perform Sequential Feature Selection.

solver: string, estimator or None, optional (default=None)
Solver or model to use for the feature selection strategy. See sklearn's documentation for an extended description of the choices. Select None for the default option per strategy (only for univariate and PCA).

for "univariate", choose from:
- "f_classif"
- "f_regression"
- "mutual_info_classif"
- "mutual_info_regression"
- "chi2"
- Any function taking two arrays (X, y), and returning arrays (scores, p-values). See the sklearn documentation.
for "PCA", choose from:
- "auto" (default)
- "full"
- "arpack"
- "randomized"
for "SFM", "RFE", "RFECV" and "SFS":

The base estimator. For SFM, RFE and RFECV, it should have either a either a feature_importances_ or coef_ attribute after fitting. You can use one of ATOM's predefined models. Add _class or _reg after the model's name to specify a classification or regression task, e.g. solver="LGB_reg" (not necessary if called from an atom instance). No default option.

n_features: int, float or None, optional (default=None)
Number of features to select. Choose from:

if None: Select all features.
if < 1: Fraction of the total features to select.
if >= 1: Number of features to select.

If strategy="SFM" and the threshold parameter is not specified, the threshold is set to -np.inf to select the n_features features. If strategy="RFECV", it's the minimum number of features to select.

max_frac_repeated: float or None, optional (default=1.)
Remove features with the same value in at least this fraction of the total rows. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this step.

max_correlation: float or None, optional (default=1.)
Minimum value of the Pearson correlation coefficient to identify correlated features. A value of 1 removes one of 2 equal columns. A dataframe of the removed features and their correlation values can be accessed through the collinear attribute. If None, skip this step.

n_jobs: int, optional (default=1)
Number of cores to use for parallel processing.

If >0: Number of cores to use.
If -1: Use all available cores.
If <-1: Use available_cores - 1 + n_jobs.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

random_state: int or None, optional (default=None)
Seed used by the random number generator. If None, the random number generator is the RandomState instance used by numpy.random.

**kwargs
Any extra keyword argument for the PCA, SFM, RFE, RFECV and SFS estimators. See the corresponding sklearn documentation for the available options.

Info

If strategy="PCA", the data is scaled to mean=0 and std=1 before fitting the transformer (if it wasn't already).

Tip

Use the plot_feature_importance method to examine how much a specific feature contributes to the final predictions. If the model doesn't have a feature_importances_ attribute, use plot_permutation_importance instead.

Warning

The RFE, RFECV AND SFS strategies don't work when the solver is a CatBoost model due to incompatibility of the APIs.

Attributes

Utility attributes

Attributes:

collinear: pd.DataFrame
Dataframe of the removed collinear features. Columns include:

drop_feature: Name of the feature dropped by the method.
correlated feature: Name of the correlated feature(s).
correlation_value: Pearson correlation coefficients of the feature pairs.

feature_importance: list
Remaining features ordered by importance. Only if strategy in ("univariate", "SFM, "RFE", "RFECV"). For RFE and RFECV, the importance is extracted from the external estimator fitted on the reduced set.

<strategy>: sklearn estimator
Estimator instance (lowercase strategy) used to transform the data, e.g. feature_selector.pca for the PCA strategy.

Plot attributes

Attributes:

style: str
Plotting style. See seaborn's documentation.

palette: str
Color palette. See seaborn's documentation.

title_fontsize: int
Fontsize for the plot's title.

label_fontsize: int
Fontsize for labels and legends.

tick_fontsize: int
Fontsize for the ticks along the plot's axes.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
plot_pca	Plot the explained variance ratio vs the number of components.
plot_components	Plot the explained variance ratio per component.
plot_rfecv	Plot the scores obtained by the estimator on the RFECV.
reset_aesthetics	Reset the plot aesthetics to their default values.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data. Note that the univariate, SFM (when model is not fitted), RFE and RFECV strategies all need a target column. Leaving it None will raise an exception.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

self: FeatureSelector
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that the univariate, SFM (when model is not fitted), RFE and RFECV strategies need a target column. Leaving it None will raise an exception.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Dictionary of the parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method plot_pca (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the explained variance ratio vs the number of components. See plot_pca for a description of the parameters.

method plot_components (show=None, title=None, figsize=None, filename=None, display=True) [source]

Plot the explained variance ratio per components. See plot_components for a description of the parameters.

method plot_rfecv (title=None, figsize=(10, 6), filename=None, display=True) [source]

Plot the scores obtained by the estimator fitted on every subset of the data. See plot_rfecv for a description of the parameters.

method reset_aesthetics() [source]

Reset the plot aesthetics to their default values.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: FeatureGenerator Estimator instance.

method transform(X, y=None) [source]

Transform the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

X: pd.DataFrame
Transformed feature set.

Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.feature_selection(stratgey="pca", n_features=12, whiten=True)

atom.plot_pca(filename="pca", figsize=(8, 5))

or

from atom.feature_engineering import FeatureSelector

feature_selector = FeatureSelector(stratgey="pca", n_features=12, whiten=True)
feature_selector.fit(X_train, y_train)
X = feature_selector.transform(X, y)

feature_selector.plot_pca(filename="pca", figsize=(8, 5))