Skip to content

Encoder


class atom.data_cleaning.Encoder(strategy="Target", max_onehot=10, ordinal=None, infrequent_to_value=None, value="infrequent", n_jobs=1, verbose=0, **kwargs)[source]

Perform encoding of categorical features.

The encoding type depends on the number of classes in the column:

  • If n_classes=2 or ordinal feature, use Ordinal-encoding.
  • If 2 < n_classes <= max_onehot, use OneHot-encoding.
  • If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Infrequent classes can be replaced with a value in order to prevent too high cardinality.

This class can be accessed from atom through the encode method. Read more in the user guide.

Warning

Three category-encoders estimators are unavailable:

Parameters strategy: str or transformer, default="Target"
Type of encoding to use for high cardinality features. Choose from any of the estimators in the category-encoders package or provide a custom one.

max_onehot: int or None, default=10
Maximum number of unique values in a feature to perform one-hot encoding. If None, strategy-encoding is always used for columns with more than two classes.

ordinal: dict or None, default=None
Order of ordinal features, where the dict key is the feature's name and the value is the class order, e.g., {"salary": ["low", "medium", "high"]}.

infrequent_to_value: int, float or None, default=None
Replaces infrequent class occurrences in categorical columns with the string in parameter value. This transformation is done before the encoding of the column.

  • If None: Skip this step.
  • If int: Minimum number of occurrences in a class.
  • If float: Minimum fraction of occurrences in a class.

value: str, default="infrequent"
Value with which to replace rare classes. This parameter is ignored if infrequent_to_value=None.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 - value.

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

**kwargs
Additional keyword arguments for the strategy estimator.

Attributes mapping_: dict of dicts
Encoded values and their respective mapping. The column name is the key to its mapping dictionary. Only for ordinal encoding.

feature_names_in_: np.ndarray
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Cleaner

Applies standard data cleaning steps on a dataset.

Imputer

Handle missing values in the data.

Pruner

Prune outliers from the data.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension  cat_feature_1  cat_feature_2  cat_feature_3
0          13.48         20.82           88.40      559.2          0.10160           0.12550         0.10630              0.05439         0.1720  ...            0.1610            0.42250           0.5030               0.22580          0.2807                  0.10710             x1             x2            x10
1          18.31         20.58          120.80     1052.0          0.10680           0.12480         0.15690              0.09451         0.1860  ...            0.1492            0.25360           0.3759               0.15100          0.3074                  0.07863             x0             x1             x6
2          17.93         24.48          115.20      998.9          0.08855           0.07027         0.05699              0.04744         0.1538  ...            0.1315            0.18060           0.2080               0.11360          0.2504                  0.07948             x0             x2             x9
3          15.13         29.81           96.71      719.5          0.08320           0.04605         0.04686              0.02739         0.1852  ...            0.1148            0.09866           0.1547               0.06575          0.3233                  0.06165             x0             x2            x11
4           8.95         15.76           58.74      245.2          0.09462           0.12430         0.09263              0.02308         0.1305  ...            0.1179            0.18790           0.1544               0.03846          0.1652                  0.07722             x0             x0             x3
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...               ...                ...              ...                   ...             ...                      ...            ...            ...            ...
564        14.34         13.47           92.51      641.2          0.09906           0.07624         0.05724              0.04603         0.2075  ...            0.1297            0.15250           0.1632               0.10870          0.3062                  0.06072             x1             x2            x11
565        13.17         21.81           85.42      531.5          0.09714           0.10470         0.08259              0.05252         0.1746  ...            0.1503            0.39040           0.3728               0.16070          0.3693                  0.09618             x0             x1             x4
566        17.30         17.08          113.00      928.2          0.10080           0.10410         0.12660              0.08353         0.1813  ...            0.1416            0.24050           0.3378               0.18570          0.3138                  0.08113             x0             x1            x10
567        17.68         20.74          117.40      963.7          0.11150           0.16650         0.18550              0.10540         0.1971  ...            0.1418            0.34980           0.3583               0.15150          0.2463                  0.07738             x0             x1            x16
568        14.80         17.66           95.88      674.8          0.09179           0.08890         0.04069              0.02260         0.1893  ...            0.1226            0.18810           0.2060               0.08308          0.3600                  0.07285             x1             x1             x2

[569 rows x 33 columns]

>>> atom.encode(strategy="target", max_onehot=10, verbose=2)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> Target-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(atom.X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst concavity  worst concave points  worst symmetry  worst fractal dimension  cat_feature_1  cat_feature_2_x2  cat_feature_2_x1  cat_feature_2_x0  cat_feature_3
0          13.48         20.82           88.40      559.2          0.10160           0.12550         0.10630              0.05439         0.1720  ...           0.5030               0.22580          0.2807                  0.10710            1.0               1.0               0.0               0.0       0.604275
1          18.31         20.58          120.80     1052.0          0.10680           0.12480         0.15690              0.09451         0.1860  ...           0.3759               0.15100          0.3074                  0.07863            0.0               0.0               1.0               0.0       0.495404
2          17.93         24.48          115.20      998.9          0.08855           0.07027         0.05699              0.04744         0.1538  ...           0.2080               0.11360          0.2504                  0.07948            0.0               1.0               0.0               0.0       0.604073
3          15.13         29.81           96.71      719.5          0.08320           0.04605         0.04686              0.02739         0.1852  ...           0.1547               0.06575          0.3233                  0.06165            0.0               1.0               0.0               0.0       0.657228
4           8.95         15.76           58.74      245.2          0.09462           0.12430         0.09263              0.02308         0.1305  ...           0.1544               0.03846          0.1652                  0.07722            0.0               0.0               0.0               1.0       0.660063
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...              ...                   ...             ...                      ...            ...               ...               ...               ...            ...
564        14.34         13.47           92.51      641.2          0.09906           0.07624         0.05724              0.04603         0.2075  ...           0.1632               0.10870          0.3062                  0.06072            1.0               1.0               0.0               0.0       0.657228
565        13.17         21.81           85.42      531.5          0.09714           0.10470         0.08259              0.05252         0.1746  ...           0.3728               0.16070          0.3693                  0.09618            0.0               0.0               1.0               0.0       0.616927
566        17.30         17.08          113.00      928.2          0.10080           0.10410         0.12660              0.08353         0.1813  ...           0.3378               0.18570          0.3138                  0.08113            0.0               0.0               1.0               0.0       0.604275
567        17.68         20.74          117.40      963.7          0.11150           0.16650         0.18550              0.10540         0.1971  ...           0.3583               0.15150          0.2463                  0.07738            0.0               0.0               1.0               0.0       0.675771
568        14.80         17.66           95.88      674.8          0.09179           0.08890         0.04069              0.02260         0.1893  ...           0.2060               0.08308          0.3600                  0.07285            1.0               0.0               1.0               0.0       0.591592

[569 rows x 35 columns]
>>> from atom.data_cleaning import Encoder
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> print(X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension  cat_feature_1  cat_feature_2  cat_feature_3
0          17.99         10.38          122.80     1001.0          0.11840           0.27760         0.30010              0.14710         0.2419  ...           0.16220            0.66560           0.7119                0.2654          0.4601                  0.11890             x1             x2             x5
1          20.57         17.77          132.90     1326.0          0.08474           0.07864         0.08690              0.07017         0.1812  ...           0.12380            0.18660           0.2416                0.1860          0.2750                  0.08902             x1             x2            x13
2          19.69         21.25          130.00     1203.0          0.10960           0.15990         0.19740              0.12790         0.2069  ...           0.14440            0.42450           0.4504                0.2430          0.3613                  0.08758             x0             x0            x15
3          11.42         20.38           77.58      386.1          0.14250           0.28390         0.24140              0.10520         0.2597  ...           0.20980            0.86630           0.6869                0.2575          0.6638                  0.17300             x0             x2            x10
4          20.29         14.34          135.10     1297.0          0.10030           0.13280         0.19800              0.10430         0.1809  ...           0.13740            0.20500           0.4000                0.1625          0.2364                  0.07678             x1             x1            x17
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...               ...                ...              ...                   ...             ...                      ...            ...            ...            ...
564        21.56         22.39          142.00     1479.0          0.11100           0.11590         0.24390              0.13890         0.1726  ...           0.14100            0.21130           0.4107                0.2216          0.2060                  0.07115             x1             x1            x12
565        20.13         28.25          131.20     1261.0          0.09780           0.10340         0.14400              0.09791         0.1752  ...           0.11660            0.19220           0.3215                0.1628          0.2572                  0.06637             x0             x2            x14
566        16.60         28.08          108.30      858.1          0.08455           0.10230         0.09251              0.05302         0.1590  ...           0.11390            0.30940           0.3403                0.1418          0.2218                  0.07820             x0             x1             x3
567        20.60         29.33          140.10     1265.0          0.11780           0.27700         0.35140              0.15200         0.2397  ...           0.16500            0.86810           0.9387                0.2650          0.4087                  0.12400             x1             x0             x2
568         7.76         24.54           47.92      181.0          0.05263           0.04362         0.00000              0.00000         0.1587  ...           0.08996            0.06444           0.0000                0.0000          0.2871                  0.07039             x1             x1            x11

[569 rows x 33 columns]

>>> encoder = Encoder(strategy="target", max_onehot=10, verbose=2)
>>> X = encoder.fit_transform(X, y)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> Target-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst concavity  worst concave points  worst symmetry  worst fractal dimension  cat_feature_1  cat_feature_2_x2  cat_feature_2_x0  cat_feature_2_x1  cat_feature_3
0          17.99         10.38          122.80     1001.0          0.11840           0.27760         0.30010              0.14710         0.2419  ...           0.7119                0.2654          0.4601                  0.11890            1.0               1.0               0.0               0.0       0.645086
1          20.57         17.77          132.90     1326.0          0.08474           0.07864         0.08690              0.07017         0.1812  ...           0.2416                0.1860          0.2750                  0.08902            1.0               1.0               0.0               0.0       0.604148
2          19.69         21.25          130.00     1203.0          0.10960           0.15990         0.19740              0.12790         0.2069  ...           0.4504                0.2430          0.3613                  0.08758            0.0               0.0               1.0               0.0       0.675079
3          11.42         20.38           77.58      386.1          0.14250           0.28390         0.24140              0.10520         0.2597  ...           0.6869                0.2575          0.6638                  0.17300            0.0               1.0               0.0               0.0       0.706297
4          20.29         14.34          135.10     1297.0          0.10030           0.13280         0.19800              0.10430         0.1809  ...           0.4000                0.1625          0.2364                  0.07678            1.0               0.0               0.0               1.0       0.716566
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...              ...                   ...             ...                      ...            ...               ...               ...               ...            ...
564        21.56         22.39          142.00     1479.0          0.11100           0.11590         0.24390              0.13890         0.1726  ...           0.4107                0.2216          0.2060                  0.07115            1.0               0.0               0.0               1.0       0.598024
565        20.13         28.25          131.20     1261.0          0.09780           0.10340         0.14400              0.09791         0.1752  ...           0.3215                0.1628          0.2572                  0.06637            0.0               1.0               0.0               0.0       0.683185
566        16.60         28.08          108.30      858.1          0.08455           0.10230         0.09251              0.05302         0.1590  ...           0.3403                0.1418          0.2218                  0.07820            0.0               0.0               0.0               1.0       0.472908
567        20.60         29.33          140.10     1265.0          0.11780           0.27700         0.35140              0.15200         0.2397  ...           0.9387                0.2650          0.4087                  0.12400            1.0               0.0               1.0               0.0       0.585452
568         7.76         24.54           47.92      181.0          0.05263           0.04362         0.00000              0.00000         0.1587  ...           0.0000                0.0000          0.2871                  0.07039            1.0               0.0               0.0               1.0       0.516759

[569 rows x 35 columns]


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_paramsGet parameters for this estimator.
inverse_transformDo nothing.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformEncode the data.


method fit(X, y=None)[source]

Fit to data.

Note that leaving y=None can lead to errors if the strategy encoder requires target values. For multioutput tasks, only the first target column is used to fit the encoder.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence or dataframe-like
Target column(s) corresponding to X.

Returns Self
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters input_features: sequence or None, default=None
Only used to validate feature names with the names seen in fit.

Returns np.ndarray
Transformed feature names.



method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns params : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None, **fit_params)[source]

Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

Returns dataframe
Feature set. Only returned if provided.

series or dataframe
Target column(s). Only returned if provided.



method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters transform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

Returns Self
Estimator instance.



method set_params(**params)[source]

Set the parameters of this estimator.

Parameters **params : dict
Estimator parameters.

Returns self : estimator instance
Estimator instance.



method transform(X, y=None)[source]

Encode the data.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Do nothing. Implemented for continuity of the API.

Returns dataframe
Encoded dataframe.