Encoder
Perform encoding of categorical features.
The encoding type depends on the number of classes in the column:
- If n_classes=2 or ordinal feature, use Ordinal-encoding.
- If 2 < n_classes <=
max_onehot
, use OneHot-encoding. - If n_classes >
max_onehot
, usestrategy
-encoding.
Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Infrequent classes can be replaced with a value in order to prevent too high cardinality.
This class can be accessed from atom through the encode method. Read more in the user guide.
Warning
Three category-encoders estimators are unavailable:
- OneHotEncoder: Use the max_onehot parameter.
- HashingEncoder: Incompatibility of APIs.
- LeaveOneOutEncoder: Incompatibility of APIs.
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cat_feature_1 cat_feature_2 cat_feature_3
0 13.48 20.82 88.40 559.2 0.10160 0.12550 0.10630 0.05439 0.1720 ... 0.1610 0.42250 0.5030 0.22580 0.2807 0.10710 x1 x2 x10
1 18.31 20.58 120.80 1052.0 0.10680 0.12480 0.15690 0.09451 0.1860 ... 0.1492 0.25360 0.3759 0.15100 0.3074 0.07863 x0 x1 x6
2 17.93 24.48 115.20 998.9 0.08855 0.07027 0.05699 0.04744 0.1538 ... 0.1315 0.18060 0.2080 0.11360 0.2504 0.07948 x0 x2 x9
3 15.13 29.81 96.71 719.5 0.08320 0.04605 0.04686 0.02739 0.1852 ... 0.1148 0.09866 0.1547 0.06575 0.3233 0.06165 x0 x2 x11
4 8.95 15.76 58.74 245.2 0.09462 0.12430 0.09263 0.02308 0.1305 ... 0.1179 0.18790 0.1544 0.03846 0.1652 0.07722 x0 x0 x3
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 14.34 13.47 92.51 641.2 0.09906 0.07624 0.05724 0.04603 0.2075 ... 0.1297 0.15250 0.1632 0.10870 0.3062 0.06072 x1 x2 x11
565 13.17 21.81 85.42 531.5 0.09714 0.10470 0.08259 0.05252 0.1746 ... 0.1503 0.39040 0.3728 0.16070 0.3693 0.09618 x0 x1 x4
566 17.30 17.08 113.00 928.2 0.10080 0.10410 0.12660 0.08353 0.1813 ... 0.1416 0.24050 0.3378 0.18570 0.3138 0.08113 x0 x1 x10
567 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 0.10540 0.1971 ... 0.1418 0.34980 0.3583 0.15150 0.2463 0.07738 x0 x1 x16
568 14.80 17.66 95.88 674.8 0.09179 0.08890 0.04069 0.02260 0.1893 ... 0.1226 0.18810 0.2060 0.08308 0.3600 0.07285 x1 x1 x2
[569 rows x 33 columns]
>>> atom.encode(strategy="target", max_onehot=10, verbose=2)
Fitting Encoder...
Encoding categorical columns...
--> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
--> OneHot-encoding feature cat_feature_2. Contains 3 classes.
--> Target-encoding feature cat_feature_3. Contains 20 classes.
>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(atom.X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst concavity worst concave points worst symmetry worst fractal dimension cat_feature_1 cat_feature_2_x2 cat_feature_2_x1 cat_feature_2_x0 cat_feature_3
0 13.48 20.82 88.40 559.2 0.10160 0.12550 0.10630 0.05439 0.1720 ... 0.5030 0.22580 0.2807 0.10710 1.0 1.0 0.0 0.0 0.604275
1 18.31 20.58 120.80 1052.0 0.10680 0.12480 0.15690 0.09451 0.1860 ... 0.3759 0.15100 0.3074 0.07863 0.0 0.0 1.0 0.0 0.495404
2 17.93 24.48 115.20 998.9 0.08855 0.07027 0.05699 0.04744 0.1538 ... 0.2080 0.11360 0.2504 0.07948 0.0 1.0 0.0 0.0 0.604073
3 15.13 29.81 96.71 719.5 0.08320 0.04605 0.04686 0.02739 0.1852 ... 0.1547 0.06575 0.3233 0.06165 0.0 1.0 0.0 0.0 0.657228
4 8.95 15.76 58.74 245.2 0.09462 0.12430 0.09263 0.02308 0.1305 ... 0.1544 0.03846 0.1652 0.07722 0.0 0.0 0.0 1.0 0.660063
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 14.34 13.47 92.51 641.2 0.09906 0.07624 0.05724 0.04603 0.2075 ... 0.1632 0.10870 0.3062 0.06072 1.0 1.0 0.0 0.0 0.657228
565 13.17 21.81 85.42 531.5 0.09714 0.10470 0.08259 0.05252 0.1746 ... 0.3728 0.16070 0.3693 0.09618 0.0 0.0 1.0 0.0 0.616927
566 17.30 17.08 113.00 928.2 0.10080 0.10410 0.12660 0.08353 0.1813 ... 0.3378 0.18570 0.3138 0.08113 0.0 0.0 1.0 0.0 0.604275
567 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 0.10540 0.1971 ... 0.3583 0.15150 0.2463 0.07738 0.0 0.0 1.0 0.0 0.675771
568 14.80 17.66 95.88 674.8 0.09179 0.08890 0.04069 0.02260 0.1893 ... 0.2060 0.08308 0.3600 0.07285 1.0 0.0 1.0 0.0 0.591592
[569 rows x 35 columns]
>>> from atom.data_cleaning import Encoder
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cat_feature_1 cat_feature_2 cat_feature_3
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890 x1 x2 x5
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 ... 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902 x1 x2 x13
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758 x0 x0 x15
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300 x0 x2 x10
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678 x1 x1 x17
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 ... 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115 x1 x1 x12
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 ... 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637 x0 x2 x14
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 ... 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820 x0 x1 x3
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 ... 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400 x1 x0 x2
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 ... 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039 x1 x1 x11
[569 rows x 33 columns]
>>> encoder = Encoder(strategy="target", max_onehot=10, verbose=2)
>>> X = encoder.fit_transform(X, y)
Fitting Encoder...
Encoding categorical columns...
--> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
--> OneHot-encoding feature cat_feature_2. Contains 3 classes.
--> Target-encoding feature cat_feature_3. Contains 20 classes.
>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst concavity worst concave points worst symmetry worst fractal dimension cat_feature_1 cat_feature_2_x2 cat_feature_2_x0 cat_feature_2_x1 cat_feature_3
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 0.7119 0.2654 0.4601 0.11890 1.0 1.0 0.0 0.0 0.645086
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 ... 0.2416 0.1860 0.2750 0.08902 1.0 1.0 0.0 0.0 0.604148
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 0.4504 0.2430 0.3613 0.08758 0.0 0.0 1.0 0.0 0.675079
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 0.6869 0.2575 0.6638 0.17300 0.0 1.0 0.0 0.0 0.706297
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 0.4000 0.1625 0.2364 0.07678 1.0 0.0 0.0 1.0 0.716566
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 ... 0.4107 0.2216 0.2060 0.07115 1.0 0.0 0.0 1.0 0.598024
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 ... 0.3215 0.1628 0.2572 0.06637 0.0 1.0 0.0 0.0 0.683185
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 ... 0.3403 0.1418 0.2218 0.07820 0.0 0.0 0.0 1.0 0.472908
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 ... 0.9387 0.2650 0.4087 0.12400 1.0 0.0 1.0 0.0 0.585452
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 ... 0.0000 0.0000 0.2871 0.07039 1.0 0.0 0.0 1.0 0.516759
[569 rows x 35 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Encode the data. |
Fit to data.
Note that leaving y=None can lead to errors if the strategy
encoder requires target values. For multioutput tasks, only
the first target column is used to fit the encoder.
Parameters |
X: dataframe-like
Feature set with shape=(n_samples, n_features).
y: sequence or dataframe-like
Target column(s) corresponding to X .
|
Returns |
Self
Estimator instance.
|
Fit to data, then transform it.
Get output feature names for transformation.
Parameters |
input_features: sequence or None, default=None
Only used to validate feature names with the names seen in
fit .
|
Returns |
np.ndarray
Transformed feature names.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Encode the data.