Encoder

class atom.data_cleaning.Encoder(strategy="LeaveOneOut", max_onehot=10, ordinal=None, rare_to_value=None, value="rare", verbose=0, logger=None, **kwargs)[source]

Perform encoding of categorical features.

The encoding type depends on the number of classes in the column:

If n_classes=2 or ordinal feature, use Ordinal-encoding.
If 2 < n_classes <= max_onehot, use OneHot-encoding.
If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Rare classes can be replaced with a value in order to prevent too high cardinality.

This class can be accessed from atom through the encode method. Read more in the user guide.

Warning

Two category-encoders estimators are unavailable:

OneHotEncoder: Use the max_onehot parameter.
HashingEncoder: Incompatibility of APIs.

Parameters

strategy: str or estimator, default="LeaveOneOut"

Type of encoding to use for high cardinality features. Choose from any of the estimators in the category-encoders package or provide a custom one.

max_onehot: int or None, default=10

Maximum number of unique values in a feature to perform one-hot encoding. If None, strategy-encoding is always used for columns with more than two classes.

ordinal: dict or None, default=None

Order of ordinal features, where the dict key is the feature's name and the value is the class order, e.g.

{"salary": ["low",
"medium", "high"]}

.

rare_to_value: int, float or None, default=None

Replaces rare class occurrences in categorical columns with the string in parameter value. This transformation is done before the encoding of the column.

If None: Skip this step.
If int: Minimum number of occurrences in a class.
If float: Minimum fraction of occurrences in a class.

value: str, default="rare"

Value with which to replace rare classes. This parameter is ignored if rare_to_value=None.

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, default=None

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

**kwargs

Additional keyword arguments for the strategy estimator.

Attributes

mapping: dict of dicts

Encoded values and their respective mapping. The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

feature_names_in_: np.array

Names of features seen during fit.

n_features_in_: int

Number of features seen during fit.

Example

atomstand-alone

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]

>>> atom = ATOMClassifier(X, y)
>>> print(atom.X)

     mean radius  mean texture  ...  cat_feature_2  cat_feature_3
0          13.62         23.23  ...             x0             x0
1          14.86         16.94  ...             x0             x5
2          16.74         21.59  ...             x2            x15
3          13.37         16.39  ...             x1            x18
4          11.37         18.89  ...             x0            x13
..           ...           ...  ...            ...            ...
564        14.06         17.18  ...             x2             x1
565        11.29         13.04  ...             x0            x10
566        14.26         19.65  ...             x0             x5
567        12.05         14.63  ...             x2            x14
568        18.81         19.98  ...             x1            x13

[569 rows x 33 columns]

>>> atom.encode(strategy="leaveoneout", max_onehot=10, verbose=2)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> LeaveOneOut-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(atom.X)

     mean radius  mean texture  ...  cat_feature_2_x2  cat_feature_3
0          13.62         23.23  ...               0.0       0.714286
1          14.86         16.94  ...               0.0       0.555556
2          16.74         21.59  ...               1.0       0.681818
3          13.37         16.39  ...               0.0       0.739130
4          11.37         18.89  ...               0.0       0.521739
..           ...           ...  ...               ...            ...
564        14.06         17.18  ...               1.0       0.772727
565        11.29         13.04  ...               0.0       0.766667
566        14.26         19.65  ...               0.0       0.555556
567        12.05         14.63  ...               1.0       0.411765
568        18.81         19.98  ...               0.0       0.521739

[569 rows x 35 columns]

>>> from atom.data_cleaning import Encoder
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> X["cat_feature_1"] = [f"x{i}" for i in randint(0, 2, len(X))]
>>> X["cat_feature_2"] = [f"x{i}" for i in randint(0, 3, len(X))]
>>> X["cat_feature_3"] = [f"x{i}" for i in randint(0, 20, len(X))]
>>> print(X)

     mean radius  mean texture  ...  cat_feature_2  cat_feature_3
0          13.62         23.23  ...             x0             x0
1          14.86         16.94  ...             x0             x5
2          16.74         21.59  ...             x2            x15
3          13.37         16.39  ...             x1            x18
4          11.37         18.89  ...             x0            x13
..           ...           ...  ...            ...            ...
564        14.06         17.18  ...             x2             x1
565        11.29         13.04  ...             x0            x10
566        14.26         19.65  ...             x0             x5
567        12.05         14.63  ...             x2            x14
568        18.81         19.98  ...             x1            x13

[569 rows x 33 columns]

>>> encoder = Encoder(strategy="leaveoneout", max_onehot=10, verbose=2)
>>> X = encoder.fit_transform(X, y)

Fitting Encoder...
Encoding categorical columns...
 --> Ordinal-encoding feature cat_feature_1. Contains 2 classes.
 --> OneHot-encoding feature cat_feature_2. Contains 3 classes.
 --> LeaveOneOut-encoding feature cat_feature_3. Contains 20 classes.

>>> # Note the one-hot encoded column with name [feature]_[class]
>>> print(X)

     mean radius  mean texture  ...  cat_feature_2_x2  cat_feature_3
0          17.99         10.38  ...               1.0       0.379310
1          20.57         17.77  ...               1.0       0.714286
2          19.69         21.25  ...               0.0       0.586207
3          11.42         20.38  ...               0.0       0.678571
4          20.29         14.34  ...               0.0       0.714286
..           ...           ...  ...               ...            ...
564        21.56         22.39  ...               0.0       0.580645
565        20.13         28.25  ...               0.0       0.518519
566        16.60         28.08  ...               1.0       0.600000
567        20.60         29.33  ...               1.0       0.586207
568         7.76         24.54  ...               1.0       0.678571

[569 rows x 35 columns]

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
inverse_transform	Does nothing.
log	Print message and save to log file.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Encode the data.

method fit(X, y=None)[source]

Fit to data.

Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters	X: dataframe-like Feature set with shape=(n_samples, n_features). y: int, str or sequence Target column corresponding to X. If None: y is ignored. If int: Position of the target column in X. If str: Name of the target column in X. Else: Array with shape=(n_samples,) to use as target.
Returns	Encoder Estimator instance.

method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters

X: dataframe-like or None, default=None

Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, dict, sequence or None, default=None

Target column corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
Else: Array with shape=(n_samples,) to use as target.

**fit_params

Additional keyword arguments for the fit method.

Returns

pd.DataFrame

Transformed feature set. Only returned if provided.

pd.Series

Transformed target column. Only returned if provided.

method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns	params : dict Parameter names mapped to their values.

method inverse_transform(X=None, y=None)[source]

Does nothing.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, X is ignored. y: int, str, dict, sequence or None, default=None Target column corresponding to X. If None: y is ignored. If int: Position of the target column in X. If str: Name of the target column in X. Else: Array with shape=(n_samples,) to use as target.
Returns	pd.DataFrame Transformed feature set. Only returned if provided. pd.Series Transformed target column. Only returned if provided.

method log(msg, level=0, severity="info")[source]

Print message and save to log file.

Parameters

msg: int, float or str

Message to save to the logger and print to stdout.

level: int, default=0

Minimum verbosity level to print the message.

severity: str, default="info"

Severity level of the message. Choose from: debug, info, warning, error, critical.

method save(filename="auto", save_data=True)[source]

Save the instance to a pickle file.

Parameters

filename: str, default="auto"

Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True

Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, remember to add the data to ATOMLoader when loading the file.

method set_params(**params)[source]

Set the parameters of this estimator.

Parameters	**params : dict Estimator parameters.
Returns	self : estimator instance Estimator instance.

method transform(X, y=None)[source]

Encode the data.

Parameters	X: dataframe-like Feature set with shape=(n_samples, n_features). y: int, str, dict, sequence or None, default=None Does nothing. Implemented for continuity of the API.
Returns	pd.DataFrame Encoded dataframe.