Encoder
class atom.data_cleaning.Encoder(strategy="LeaveOneOut", max_onehot=10, ordinal=None, rare_to_value=None, value="rare", verbose=0, logger=None, **kwargs)[source]
Perform encoding of categorical features.
The encoding type depends on the number of classes in the column:
- If n_classes=2 or ordinal feature, use Ordinal-encoding.
- If 2 < n_classes <=
max_onehot
, use OneHot-encoding. - If n_classes >
max_onehot
, usestrategy
-encoding.
Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Rare classes can be replaced with a value in order to prevent too high cardinality.
This class can be accessed from atom through the encode method. Read more in the user guide.
Warning
Two category-encoders estimators are unavailable:
- OneHotEncoder: Use the max_onehot parameter.
- HashingEncoder: Incompatibility of APIs.
See Also
Applies standard data cleaning steps on a dataset.
Handle missing values in the data.
Prune outliers from the data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> atom = ATOMClassifier(X, y)
>>> print(atom.X)
mean radius mean texture ... cat_feature_2 cat_feature_3
0 13.62 23.23 ... x0 x0
1 14.86 16.94 ... x0 x5
2 16.74 21.59 ... x2 x15
3 13.37 16.39 ... x1 x18
4 11.37 18.89 ... x0 x13
.. ... ... ... ... ...
564 14.06 17.18 ... x2 x1
565 11.29 13.04 ... x0 x10
566 14.26 19.65 ... x0 x5
567 12.05 14.63 ... x2 x14
568 18.81 19.98 ... x1 x13
[569 rows x 33 columns]
>>> atom.encode(strategy="leaveoneout", max_onehot=10, verbose=2)
Fitting Encoder...
Encoding categorical columns...
--> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
--> OneHot-encoding feature cat_feature_2. Contains 3 classes.
--> LeaveOneOut-encoding feature cat_feature_3. Contains 20 classes.
>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(atom.X)
mean radius mean texture ... cat_feature_2_x2 cat_feature_3
0 13.62 23.23 ... 0.0 0.714286
1 14.86 16.94 ... 0.0 0.555556
2 16.74 21.59 ... 1.0 0.681818
3 13.37 16.39 ... 0.0 0.739130
4 11.37 18.89 ... 0.0 0.521739
.. ... ... ... ... ...
564 14.06 17.18 ... 1.0 0.772727
565 11.29 13.04 ... 0.0 0.766667
566 14.26 19.65 ... 0.0 0.555556
567 12.05 14.63 ... 1.0 0.411765
568 18.81 19.98 ... 0.0 0.521739
[569 rows x 35 columns]
>>> from atom.data_cleaning import Encoder
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> print(X)
mean radius mean texture ... cat_feature_2 cat_feature_3
0 13.62 23.23 ... x0 x0
1 14.86 16.94 ... x0 x5
2 16.74 21.59 ... x2 x15
3 13.37 16.39 ... x1 x18
4 11.37 18.89 ... x0 x13
.. ... ... ... ... ...
564 14.06 17.18 ... x2 x1
565 11.29 13.04 ... x0 x10
566 14.26 19.65 ... x0 x5
567 12.05 14.63 ... x2 x14
568 18.81 19.98 ... x1 x13
[569 rows x 33 columns]
>>> encoder = Encoder(strategy="leaveoneout", max_onehot=10, verbose=2)
>>> X = encoder.fit_transform(X, y)
Fitting Encoder...
Encoding categorical columns...
--> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
--> OneHot-encoding feature cat_feature_2. Contains 3 classes.
--> LeaveOneOut-encoding feature cat_feature_3. Contains 20 classes.
>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(X)
mean radius mean texture ... cat_feature_2_x2 cat_feature_3
0 17.99 10.38 ... 1.0 0.379310
1 20.57 17.77 ... 1.0 0.714286
2 19.69 21.25 ... 0.0 0.586207
3 11.42 20.38 ... 0.0 0.678571
4 20.29 14.34 ... 0.0 0.714286
.. ... ... ... ... ...
564 21.56 22.39 ... 0.0 0.580645
565 20.13 28.25 ... 0.0 0.518519
566 16.60 28.08 ... 1.0 0.600000
567 20.60 29.33 ... 1.0 0.586207
568 7.76 24.54 ... 1.0 0.678571
[569 rows x 35 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Does nothing. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Encode the data. |
method fit(X, y=None)[source]
Fit to data.
Note that leaving y=None can lead to errors if the strategy
encoder requires target values.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Does nothing.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from
atom. If False, remember to add the data to ATOMLoader
when loading the file.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X, y=None)[source]
Encode the data.