Skip to content

Encoder


class atom.data_cleaning.Encoder(strategy="Target", max_onehot=10, ordinal=None, infrequent_to_value=None, value="infrequent", verbose=0, logger=None, **kwargs)[source]
Perform encoding of categorical features.

The encoding type depends on the number of classes in the column:

  • If n_classes=2 or ordinal feature, use Ordinal-encoding.
  • If 2 < n_classes <= max_onehot, use OneHot-encoding.
  • If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Infrequent classes can be replaced with a value in order to prevent too high cardinality.

This class can be accessed from atom through the encode method. Read more in the user guide.

Warning

Three category-encoders estimators are unavailable:

Parametersstrategy: str or estimator, default="Target"
Type of encoding to use for high cardinality features. Choose from any of the estimators in the category-encoders package or provide a custom one.

max_onehot: int or None, default=10
Maximum number of unique values in a feature to perform one-hot encoding. If None, strategy-encoding is always used for columns with more than two classes.

ordinal: dict or None, default=None
Order of ordinal features, where the dict key is the feature's name and the value is the class order, e.g. {"salary": ["low", "medium", "high"]}.

infrequent_to_value: int, float or None, default=None
Replaces infrequent class occurrences in categorical columns with the string in parameter value. This transformation is done before the encoding of the column.

  • If None: Skip this step.
  • If int: Minimum number of occurrences in a class.
  • If float: Minimum fraction of occurrences in a class.

value: str, default="infrequent"
Value with which to replace rare classes. This parameter is ignored if infrequent_to_value=None.

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

logger: str, Logger or None, default=None

  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.

**kwargs
Additional keyword arguments for the strategy estimator.

Attributesmapping: dict of dicts
Encoded values and their respective mapping. The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

feature_names_in_: np.array
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Cleaner

Applies standard data cleaning steps on a dataset.

Imputer

Handle missing values in the data.

Pruner

Prune outliers from the data.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]

>>> atom = ATOMClassifier(X, y)
>>> print(atom.X)

     mean radius  mean texture  ...  cat_feature_2  cat_feature_3
0          13.62         23.23  ...             x0             x0
1          14.86         16.94  ...             x0             x5
2          16.74         21.59  ...             x2            x15
3          13.37         16.39  ...             x1            x18
4          11.37         18.89  ...             x0            x13
..           ...           ...  ...            ...            ...
564        14.06         17.18  ...             x2             x1
565        11.29         13.04  ...             x0            x10
566        14.26         19.65  ...             x0             x5
567        12.05         14.63  ...             x2            x14
568        18.81         19.98  ...             x1            x13

[569 rows x 33 columns]

>>> atom.encode(strategy="target", max_onehot=10, verbose=2)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> Target-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(atom.X)

     mean radius  mean texture  ...  cat_feature_2_x2  cat_feature_3
0          13.62         23.23  ...               0.0       0.714286
1          14.86         16.94  ...               0.0       0.555556
2          16.74         21.59  ...               1.0       0.681818
3          13.37         16.39  ...               0.0       0.739130
4          11.37         18.89  ...               0.0       0.521739
..           ...           ...  ...               ...            ...
564        14.06         17.18  ...               1.0       0.772727
565        11.29         13.04  ...               0.0       0.766667
566        14.26         19.65  ...               0.0       0.555556
567        12.05         14.63  ...               1.0       0.411765
568        18.81         19.98  ...               0.0       0.521739

[569 rows x 35 columns]
>>> from atom.data_cleaning import Encoder
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> print(X)

     mean radius  mean texture  ...  cat_feature_2  cat_feature_3
0          13.62         23.23  ...             x0             x0
1          14.86         16.94  ...             x0             x5
2          16.74         21.59  ...             x2            x15
3          13.37         16.39  ...             x1            x18
4          11.37         18.89  ...             x0            x13
..           ...           ...  ...            ...            ...
564        14.06         17.18  ...             x2             x1
565        11.29         13.04  ...             x0            x10
566        14.26         19.65  ...             x0             x5
567        12.05         14.63  ...             x2            x14
568        18.81         19.98  ...             x1            x13

[569 rows x 33 columns]

>>> encoder = Encoder(strategy="target", max_onehot=10, verbose=2)
>>> X = encoder.fit_transform(X, y)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> Target-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(X)

     mean radius  mean texture  ...  cat_feature_2_x2  cat_feature_3
0          17.99         10.38  ...               1.0       0.379310
1          20.57         17.77  ...               1.0       0.714286
2          19.69         21.25  ...               0.0       0.586207
3          11.42         20.38  ...               0.0       0.678571
4          20.29         14.34  ...               0.0       0.714286
..           ...           ...  ...               ...            ...
564        21.56         22.39  ...               0.0       0.580645
565        20.13         28.25  ...               0.0       0.518519
566        16.60         28.08  ...               1.0       0.600000
567        20.60         29.33  ...               1.0       0.586207
568         7.76         24.54  ...               1.0       0.678571

[569 rows x 35 columns]


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_paramsGet parameters for this estimator.
inverse_transformDoes nothing.
logPrint message and save to log file.
saveSave the instance to a pickle file.
set_paramsSet the parameters of this estimator.
transformEncode the data.


method fit(X, y=None)[source]
Fit to data.

Note that leaving y=None can lead to errors if the strategy encoder requires target values. For multioutput tasks, only the first target column is used to fit the encoder.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, dict, sequence or dataframe-like
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

ReturnsEncoder
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

**fit_params
Additional keyword arguments for the fit method.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.



method get_params(deep=True)[source]
Get parameters for this estimator.

Parametersdeep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returnsparams : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None)[source]
Does nothing.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.



method log(msg, level=0, severity="info")[source]
Print message and save to log file.

Parametersmsg: int, float or str
Message to save to the logger and print to stdout.

level: int, default=0
Minimum verbosity level to print the message.

severity: str, default="info"
Severity level of the message. Choose from: debug, info, warning, error, critical.



method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True
Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method.



method set_params(**params)[source]
Set the parameters of this estimator.

Parameters**params : dict
Estimator parameters.

Returnsself : estimator instance
Estimator instance.



method transform(X, y=None)[source]
Encode the data.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence, dataframe-like or None, default=None
Does nothing. Implemented for continuity of the API.

Returnsdataframe
Encoded dataframe.