Encoder

class atom.data_cleaning.Encoder(strategy="LeaveOneOut", max_onehot=10, ordinal=None, frac_to_other=None, verbose=0, logger=None, **kwargs) [source]

Perform encoding of categorical features. The encoding type depends on the number of classes in the column:

If n_classes=2 or ordinal feature, use Ordinal-encoding.
If 2 < n_classes <= max_onehot, use OneHot-encoding.
If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Classes with low occurrences can be replaced with the value other in order to prevent too high cardinality. It can be accessed from atom through the encode method. Read more in the user guide.

Parameters:

strategy: str or estimator, optional (default="LeaveOneOut")
Type of encoding to use for high cardinality features. Choose from any of the estimators in the category-encoders package or provide a custom one.

max_onehot: int or None, optional (default=10)
Maximum number of unique values in a feature to perform one-hot encoding. If None, strategy-encoding is always used for columns with more than two classes.

ordinal: dict or None, optional (default=None)
Order of ordinal features, where the dict key is the feature's name and the value is the class order, e.g. {"salary": ["low", "medium", "high"]}.

frac_to_other: int, float or None, optional (default=None)
Replaces rare occurrences in categorical columns with the string other. This transformation is done before the encoding of the column.

If None: Skip this step.
If int: Maximum number of occurrences to replace a category.
If float: Maximum fraction of occurrences to replace a category.

verbose: int, optional (default=0)
Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

**kwargs
Additional keyword arguments for the strategy estimator.

Tip

Use atom's categorical attribute for a list of the categorical columns in the dataset.

Warning

Two category-encoders estimators are unavailable:

OneHotEncoder: Use the max_onehot parameter.
HashingEncoder: Incompatibility of APIs.

Attributes

Attributes:

mapping: dict of dicts
Encoded values and their respective mapping. The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

feature_names_in_: np.array
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data. Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

Encoder
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

pd.DataFrame
Transformed feature set.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	dict Parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	Encoder Estimator instance.

method transform(X, y=None) [source]

Encode the data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

pd.DataFrame
Transformed feature set.

Example

atomstand-alone

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.encode(strategy="CatBoost", max_onehot=5)

from atom.data_cleaning import Encoder

encoder = Encoder(strategy="CatBoost", max_onehot=5)
encoder.fit(X_train, y_train)
X = encoder.transform(X)