Encoder

class atom.data_cleaning.Encoder(strategy="LeaveOneOut", max_onehot=10, ordinal=None, frac_to_other=None, verbose=0, logger=None, **kwargs) [source]

Perform encoding of categorical features. The encoding type depends on the number of classes in the column:

If n_classes=2 or ordinal feature, use Ordinal-encoding.
If 2 < n_classes <= max_onehot, use OneHot-encoding.
If n_classes > max_onehot, use strategy-encoding.

Missing values are propagated to the output column. Unknown classes encountered during transforming are imputed according to the selected strategy. Classes with low occurrences can be replaced with the value other in order to prevent too high cardinality. It can be accessed from atom through the encode method. Read more in the user guide.

Parameters:

strategy: str or estimator, optional (default="LeaveOneOut")
Type of encoding to use for high cardinality features. Choose from any of the estimators in the category-encoders package or provide a custom one.

max_onehot: int or None, optional (default=10)
Maximum number of unique values in a feature to perform one-hot encoding. If None, strategy-encoding is always used for columns with more than two classes.

ordinal: dict or None, optional (default=None)
Order of ordinal features, where the dict key is the feature's name and the value is the class order, e.g. {"salary": ["low", "medium", "high"]}.

frac_to_other: int, float or None, optional (default=None)
Classes with less occurrences than n_rows * frac_to_other are replaced with the string other. This transformation is done before the encoding of the column. If None, skip this step.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

**kwargs
Additional keyword arguments passed to the strategy estimator.

Tip

Use atom's categorical attribute for a list of the categorical columns in the dataset.

Warning

Two category-encoders estimators are unavailable:

OneHotEncoder: Use the max_onehot parameter.
HashingEncoder: Incompatibility of APIs.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data. Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

self: Encoder
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Dictionary of the parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: Encoder Estimator instance.

method transform(X, y=None) [source]

Encode the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

X: pd.DataFrame
Transformed feature set.

Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.encode(strategy="CatBoost", max_onehot=5)

or

from atom.data_cleaning import Encoder

encoder = Encoder(strategy="CatBoost", max_onehot=5)
encoder.fit(X_train, y_train)
X = encoder.transform(X)