Skip to content

Cleaner


class atom.data_cleaning.Cleaner(drop_types=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine="sklearn", verbose=0, logger=None)[source]
Applies standard data cleaning steps on a dataset.

Use the parameters to choose which transformations to perform. The available steps are:

  • Drop columns with specific data types.
  • Remove characters from column names.
  • Strip categorical features from white spaces.
  • Drop duplicate rows.
  • Drop rows with missing values in the target column.
  • Encode the target column.

This class can be accessed from atom through the clean method. Read more in the user guide.

Parametersdrop_types: str, sequence or None, default=None
Columns with these data types are dropped from the dataset.

drop_chars: str or None, default=None
Remove the specified regex pattern from column names, e.g. [^A-Za-z0-9]+ to remove all non-alphanumerical characters.

strip_categorical: bool, default=True
Whether to strip spaces from categorical columns.

drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of every duplicated row is kept.

drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column. This transformation is ignored if y is not provided.

encode_target: bool, default=True
Whether to encode the target column(s). This includes converting categorical columns to numerical, and binarizing multilabel columns. This transformation is ignored if y is not provided.

device: str, default="cpu"
Device on which to train the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, default="sklearn"
Execution engine to use for the estimators. Refer to the user guide for an explanation regarding every choice. Choose from:

  • "sklearn" (only if device="cpu")
  • "cuml" (only if device="gpu")

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

logger: str, Logger or None, default=None

  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.

Attributesmissing: list
Values that are considered "missing". Default values are: "", "?", "NA", "nan", "NaN", "none", "None", "inf", "-inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

mapping: dict
Target values mapped to their respective encoded integer. Only available if encode_target=True.


See Also

Encoder

Perform encoding of categorical features.

Discretizer

Bin continuous data into intervals.

Scaler

Scale the data.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]

>>> atom = ATOMClassifier(X, y)
>>> print(atom.y)

0      b
1      b
2      b
3      b
4      a
      ..
995    b
996    a
997    a
998    b
999    b

Name: target, Length: 1000, dtype: object

>>> atom.clean(verbose=2)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding the target column.

>>> print(atom.y)

0      1
1      1
2      1
3      1
4      0
      ..
995    1
996    0
997    0
998    1
999    1

Name: target, Length: 1000, dtype: int32
>>> import numpy as np
>>> from atom.data_cleaning import Cleaner

>>> y = ["a" if i else "b" for i in np.randint(100)]

>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding the target column.

>>> print(y)

0     0
1     0
2     1
3     0
4     0
     ..
95    1
96    1
97    0
98    0
99    0

Name: target, Length: 100, dtype: int32


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_paramsGet parameters for this estimator.
inverse_transformInversely transform the label encoding.
logPrint message and save to log file.
saveSave the instance to a pickle file.
set_paramsSet the parameters of this estimator.
transformApply the data cleaning steps to the data.


method fit(X=None, y=None)[source]
Fit to data.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, dict, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

ReturnsCleaner
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

**fit_params
Additional keyword arguments for the fit method.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.



method get_params(deep=True)[source]
Get parameters for this estimator.

Parametersdeep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returnsparams : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None)[source]
Inversely transform the label encoding.

This method only inversely transforms the target encoding. The rest of the transformations can't be inverted. If encode_target=False, the data is returned as is.

ParametersX: dataframe-like or None, default=None
Does nothing. Implemented for continuity of the API.

y: int, str, dict, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

Returnsdataframe
Unchanged feature set. Only returned if provided.

series
Original target column. Only returned if provided.



method log(msg, level=0, severity="info")[source]
Print message and save to log file.

Parametersmsg: int, float or str
Message to save to the logger and print to stdout.

level: int, default=0
Minimum verbosity level to print the message.

severity: str, default="info"
Severity level of the message. Choose from: debug, info, warning, error, critical.



method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True
Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method.



method set_params(**params)[source]
Set the parameters of this estimator.

Parameters**params : dict
Estimator parameters.

Returnsself : estimator instance
Estimator instance.



method transform(X=None, y=None)[source]
Apply the data cleaning steps to the data.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, dict, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.