Discretizer

class atom.data_cleaning.Discretizer(strategy="quantile", bins=5, labels=None, gpu=False, verbose=0, logger=None) [source]

Bin continuous data into intervals. For each feature, the bin edges are computed during fit and, together with the number of bins, they will define the intervals. Ignores numerical columns. It can be accessed from atom through the discretize method. Read more in the user guide.

Parameters:

strategy: str, optional (default="quantile")
Strategy used to define the widths of the bins. Choose from:

"uniform": All bins have identical widths.
"quantile": All bins have the same number of points.
"kmeans": Values in each bin have the same nearest center of a 1D k-means cluster.
"custom": Use custom bin edges provided through bins.

bins: int, sequence or dict, optional (default=5)

If int: Number of bins to produce for all columns. Not allowed if strategy="custom".
If sequence: Number of bins per column, where the n-th value corresponds to the n-th column that is transformed. If strategy="custom", it's the bin edges with length=n_bins + 1.
If dict: One of the aforementioned options per column, where the key is the column's name.

labels: sequence, dict or None, optional (default=None)
Label names with which to replace the binned intervals.

If None: Use default labels of the form [min_edge]-[max_edge].
If sequence: Labels to use for all columns.
If dict: Labels per column, where the key is the column's name.

gpu: bool or str, optional (default=False)
Train estimator on GPU (instead of CPU). Not for strategy="custom".

If False: Always use CPU implementation.
If True: Use GPU implementation if possible.
If "force": Force GPU implementation.

verbose: int, optional (default=0)
Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

Tip

The transformation returns categorical columns. Use the Encoder class to convert them back to numerical types.

Warning

If strategy="custom", the columns returned can contain NaN if the values lie outside the specified bin edges.

Attributes

Attributes:

feature_names_in_: np.array
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

Discretizer
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then transform it. Note that leaving y=None can lead to errors if the strategy encoder requires target values.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

pd.DataFrame
Transformed feature set.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	dict Parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	Discretizer Estimator instance.

method transform(X, y=None) [source]

Bin the data into intervals.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

pd.DataFrame
Transformed feature set.

Example

atomstand-alone

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.discretize(strategy="custom", bins=[0, 18, 120], labels=["child", "adult"])

from atom.data_cleaning import Discretizer

discretizer = Discretizer(
    strategy="custom",
    bins=[0, 18, 120],
    labels=["child", "adult"],
)
discretizer.fit(X_train)
X = discretizer.transform(X)