FeatureGenerator
class atom.feature_engineering.FeatureGenerator(strategy="dfs", n_features=None, operators=None, n_jobs=1, verbose=0, random_state=None, **kwargs)[source]
Generate new features.
Create new combinations of existing features to capture the non-linear relations between the original features.
This class can be accessed from atom through the feature_generation method. Read more in the user guide.
Warning
- Using the
div
,log
orsqrt
operators can return new features withinf
orNaN
values. Check the warnings that may pop up or use atom's nans attribute. - When using dfs with
n_jobs>1
, make sure to protect your code withif __name__ == "__main__"
. Featuretools uses dask, which uses python multiprocessing for parallelization. The spawn method on multiprocessing starts a new python process, which requires it to import the __main__ module before it can do its task. - gfg can be slow for very large populations.
Tip
dfs can create many new features and not all of them will be useful. Use the FeatureSelector class to reduce the number of features.
Parameters | strategy: str, default="dfs"
Strategy to crate new features. Choose from:
n_features: int or None, default=None
Maximum number of newly generated features to add to the
dataset. If None, select all created features.
operators: str, sequence or None, default=None
Mathematical operators to apply on the features. None to use
all. Choose from: n_jobs: int, default=1add , sub , mul , div , abs , sqrt ,
log , inv , sin , cos , tan .
Number of cores to use for parallel processing.
verbose: int, default=0
Verbosity level of the class. Choose from:
random_state: int or None, default=None
Seed used by the random number generator. If None, the random
number generator is the **kwargsRandomState used by np.random .
Additional keyword arguments for the SymbolicTransformer
instance. Only for the gfg strategy.
|
Attributes | gfg_: SymbolicTransformer
Object used to calculate the genetic features. Only available
when strategy="gfg".
genetic_features_: pd.DataFrame
Information on the newly created non-linear features. Only
available when strategy="gfg". Columns include:
feature_names_in_: np.ndarray
Names of features seen during n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Extract features from datetime columns.
Extract statistics from similar features.
Reduce the number of features in the data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y)
>>> atom.feature_generation(strategy="dfs", n_features=5, verbose=2)
Fitting FeatureGenerator...
Generating new features...
--> 5 new features were added.
>>> # Note the texture error / worst symmetry column
>>> print(atom.dataset)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity ... worst fractal dimension TANGENT(area error) fractal dimension error - mean radius mean concave points + worst texture mean radius / smoothness error mean symmetry * radius error target
0 11.94 18.24 75.71 437.6 0.08261 0.04751 0.01972 ... 0.07408 -5.165118 -11.937365 21.343490 1656.033287 0.042460 1
1 13.28 13.72 85.79 541.8 0.08363 0.08575 0.05077 ... 0.07320 -0.480546 -13.277387 17.398640 3109.342074 0.029640 1
2 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 ... 0.17300 -1.720653 -11.410792 26.605200 1253.567508 0.128707 0
3 13.86 16.93 90.96 578.9 0.10260 0.15170 0.09901 ... 0.10590 0.840327 -13.855440 26.986020 2325.503356 0.053977 0
4 20.09 23.86 134.70 1247.0 0.10800 0.18380 0.22830 ... 0.09469 -2.215996 -20.084072 29.558000 2522.601708 0.241093 0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 13.54 14.36 87.46 566.3 0.09779 0.08129 0.06664 ... 0.07259 514.164096 -13.537700 19.307810 1600.094540 0.050876 1
565 14.42 16.54 94.15 641.2 0.09751 0.11390 0.08007 ... 0.08764 0.884301 -14.416627 21.552230 3150.535285 0.066748 1
566 16.35 23.29 109.00 840.4 0.09742 0.14970 0.18110 ... 0.09614 18.817004 -16.345915 31.117730 2901.508429 0.093786 0
567 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 ... 0.07738 -0.351241 -17.675032 25.215400 1956.401461 0.159907 0
568 11.32 27.08 71.76 395.7 0.06883 0.03813 0.01633 ... 0.07087 -1.071237 -11.317948 33.753125 3098.822885 0.022615 1
[569 rows x 36 columns]
>>> from atom.feature_engineering import FeatureGenerator
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> fg = FeatureGenerator(strategy="dfs", n_features=5, verbose=2)
>>> X = fg.fit_transform(X, y)
Fitting FeatureGenerator...
Generating new features...
--> 5 new features were added.
>>> # Note the radius error * worst smoothness column
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness ... worst fractal dimension mean concave points - worst concavity mean concavity + radius error mean smoothness / mean perimeter worst compactness / mean perimeter worst concavity / mean concavity
index ...
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 ... 0.11890 -0.56480 1.39510 0.000964 0.005420 2.372209
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 ... 0.08902 -0.17143 0.63040 0.000638 0.001404 2.780207
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 ... 0.08758 -0.32250 0.94300 0.000843 0.003265 2.281662
3 11.42 20.38 77.58 386.1 0.14250 0.28390 ... 0.17300 -0.58170 0.73700 0.001837 0.011167 2.845485
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 ... 0.07678 -0.29570 0.95520 0.000742 0.001517 2.020202
... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 ... 0.07115 -0.27180 1.41990 0.000782 0.001488 1.683887
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 ... 0.06637 -0.22359 0.90950 0.000745 0.001465 2.232639
566 16.60 28.08 108.30 858.1 0.08455 0.10230 ... 0.07820 -0.28728 0.54891 0.000781 0.002857 3.678521
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 ... 0.12400 -0.78670 1.07740 0.000841 0.006196 2.671315
568 7.76 24.54 47.92 181.0 0.05263 0.04362 ... 0.07039 0.00000 0.38570 0.001098 0.001345 NaN
[569 rows x 35 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Generate new features. |
method fit(X, y=None)[source]
Fit to data.
Parameters | X: dataframe-like
Feature set with shape=(n_samples, n_features).
y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X .
|
Returns | self
Estimator instance.
|
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None, **fit_params)[source]
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
method set_output(transform=None)[source]
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X, y=None)[source]
Generate new features.