FeatureSelector
Reduce the number of features in the data.
Apply feature selection or dimensionality reduction, either to improve the estimators' accuracy or to boost their performance on very high-dimensional datasets. Additionally, remove multicollinear and low-variance features.
This class can be accessed from atom through the feature_selection method. Read more in the user guide.
Warning
- Ties between features with equal scores are broken in an unspecified way.
- For strategy="rfecv", the
n_features
parameter is the minimum number of features to select, not the actual number of features that the transformer returns. It may very well be that it returns more!
Info
- The "sklearnex" and "cuml" engines are only supported for strategy="pca" with dense datasets.
- If strategy="pca" and the data is dense and unscaled, it's scaled to mean=0 and std=1 before fitting the PCA transformer.
- If strategy="pca" and the provided data is sparse, the used estimator is TruncatedSVD, which works more efficiently with sparse matrices.
Tip
- Use the plot_pca and plot_components methods to examine the results after using strategy="pca".
- Use the plot_rfecv method to examine the results after using strategy="rfecv".
- Use the plot_feature_importance method to examine how
much a specific feature contributes to the final predictions.
If the model doesn't have a
feature_importances_
attribute, use plot_permutation_importance instead.
Parameters |
strategy: str or None, default=None
Feature selection strategy to use. Choose from:
solver: str, func, predictor or None, default=None
Solver/estimator to use for the feature selection strategy. See
the corresponding documentation for an extended description of
the choices. If None, the default value is used (only if
strategy="pca"). Choose from:
n_features: int, float or None, default=None
Number of features to select.
min_repeated: int, float or None, default=2
If strategy="sfm" and the threshold parameter is not specified,
the threshold is automatically set to If strategy="rfecv", This parameter is ignored if any of the following strategies is selected: pso, hho, gwo, dfo, go.
Remove categorical features if there isn't any repeated value
in at least
max_repeated: int, float or None, default=1.0min_repeated rows. The default is to keep all
features with non-maximum variance, i.e., remove the features
which number of unique values is equal to the number of rows
(usually the case for names, IDs, etc...).
Remove categorical features with the same value in at least
max_correlation: float or None, default=1.0max_repeated rows. The default is to keep all features with
non-zero variance, i.e., remove the features that have the same
value in all samples.
Minimum absolute Pearson correlation to identify
correlated features. For each group, it removes all except the
feature with the highest correlation to
n_jobs: int, default=1y (if provided, else
it removes all but the first). The default value removes equal
columns. If None, skip this step.
Number of cores to use for parallel processing.
device: str, default="cpu"
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
random_state: int or None, default=None
Seed used by the random number generator. If None, the random
number generator is the
**kwargsRandomState used by np.random .
Any extra keyword argument for the strategy estimator. See the
corresponding documentation for the available options.
|
Attributes |
collinear_: pd.DataFrame
Information on the removed collinear features. Columns include:
{#featureselector-[strategy]}
[strategy]: sklearn transformer
Object used to transform the data, e.g.,
feature_names_in_: np.ndarrayfs.pca for the pca
strategy.
Names of features seen during
n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y)
>>> atom.feature_selection(strategy="pca", n_features=12, verbose=2)
Fitting FeatureSelector...
Performing feature selection ...
--> Applying Principal Component Analysis...
--> Scaling features...
--> Keeping 12 components.
--> Explained variance ratio: 0.972
>>> # Note that the column names changed
>>> print(atom.dataset)
pca0 pca1 pca2 pca3 pca4 pca5 pca6 pca7 pca8 pca9 pca10 pca11 target
0 -4.550570 -3.137509 -0.092090 -1.609549 1.058535 0.642739 0.641216 0.149573 0.553681 0.012809 0.250259 -0.076306 1
1 -1.146323 -1.318056 -0.879402 -1.748581 0.615686 0.708593 0.125549 -0.481384 0.347614 -0.568856 -0.044347 -0.700876 1
2 4.022389 2.985922 -3.007969 2.528286 -0.074799 0.610755 0.573571 -0.489323 -0.433615 0.372529 -0.055120 0.235854 0
3 2.354108 4.800756 -1.036847 -0.338138 -2.778482 4.910456 -1.033646 0.196407 0.075153 -0.625668 0.095093 -0.761412 0
4 -2.907053 0.157141 2.076819 -1.780199 1.288917 2.018880 0.551727 -0.428490 -0.246376 -0.579773 -0.238851 -0.130475 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
564 0.985404 0.982949 -2.306420 0.314275 -0.470162 0.260210 0.474216 -0.067590 -0.292023 -0.021339 -0.199168 -0.129227 0
565 -0.952207 0.929797 -0.911746 -1.115508 -0.608486 0.914564 -0.580888 -1.027668 0.331691 -0.142763 0.850646 -0.454396 1
566 4.679295 -0.876128 -0.128211 0.984446 -0.177509 0.453696 0.081011 -0.789588 0.241801 -0.043042 0.593988 0.289491 0
567 4.511927 -0.693781 -0.407906 -0.963769 0.110021 -0.956139 -2.317544 0.046630 0.829408 -0.114397 0.283519 -0.946620 0
568 -0.591290 1.764559 1.380277 0.167142 -1.660975 0.438834 -0.076031 -0.002265 -0.196759 -0.271184 0.003263 -0.383594 1
[569 rows x 13 columns]
>>> from atom.feature_engineering import FeatureSelector
>>> from sklearn.datasets import load_breast_cancer
>>> X, _ = load_breast_cancer(return_X_y=True, as_frame=True)
>>> fs = FeatureSelector(strategy="pca", n_features=12, verbose=2)
>>> X = fs.fit_transform(X)
Fitting FeatureSelector...
Performing feature selection ...
--> Applying Principal Component Analysis...
--> Scaling features...
--> Keeping 12 components.
--> Explained variance ratio: 0.97
>>> # Note that the column names changed
>>> print(X)
pca0 pca1 pca2 pca3 pca4 pca5 pca6 pca7 pca8 pca9 pca10 pca11
0 9.192837 1.948583 -1.123166 -3.633731 1.195110 1.411424 2.159370 -0.398407 -0.157118 -0.877402 0.262955 0.859014
1 2.387802 -3.768172 -0.529293 -1.118264 -0.621775 0.028656 0.013358 0.240988 -0.711905 1.106995 0.813120 -0.157923
2 5.733896 -1.075174 -0.551748 -0.912083 0.177086 0.541452 -0.668166 0.097374 0.024066 0.454275 -0.605604 -0.124387
3 7.122953 10.275589 -3.232790 -0.152547 2.960878 3.053422 1.429911 1.059565 -1.405440 -1.116975 -1.151514 -1.011316
4 3.935302 -1.948072 1.389767 -2.940639 -0.546747 -1.226495 -0.936213 0.636376 -0.263805 0.377704 0.651360 0.110515
.. ... ... ... ... ... ... ... ... ... ... ... ...
564 6.439315 -3.576817 2.459487 -1.177314 0.074824 -2.375193 -0.596130 -0.035471 0.987929 0.256989 -0.062651 -0.123342
565 3.793382 -3.584048 2.088476 2.506028 0.510723 -0.246710 -0.716326 -1.113360 -0.105207 -0.108632 0.244804 -0.222753
566 1.256179 -1.902297 0.562731 2.089227 -1.809991 -0.534447 -0.192758 0.341887 0.393917 0.520877 -0.840512 -0.096473
567 10.374794 1.672010 -1.877029 2.356031 0.033742 0.567936 0.223082 -0.280239 -0.542035 -0.089296 -0.178628 0.697461
568 -5.475243 -0.670637 1.490443 2.299157 0.184703 1.617837 1.698952 1.046354 0.374101 -0.047726 -0.144094 0.179496
[569 rows x 12 columns]
Methods
fit | Fit the feature selector to the data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_metadata_routing | Get metadata routing of this object. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Transform the data. |
Fit the feature selector to the data.
The univariate, sfm (when model is not fitted), sfs, rfe and rfecv strategies need a target column. Leaving it None raises an exception.
Parameters |
X: dataframe-like
Feature set with shape=(n_samples, n_features).
y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X .
|
Returns |
self
Estimator instance.
|
Fit to data, then transform it.
Get output feature names for transformation.
Parameters |
input_features: sequence or None, default=None
Only used to validate feature names with the names seen in
fit .
|
Returns |
np.ndarray
Transformed feature names.
|
Get metadata routing of this object.
Returns |
routing : MetadataRequest
A :class: ~sklearn.utils.metadata_routing.MetadataRequest encapsulating
routing information.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Transform the data.