Skip to content

Cleaner


class atom.data_cleaning.Cleaner(drop_types=None, strip_categorical=True, drop_max_cardinality=True, drop_min_cardinality=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, verbose=0, logger=None) [source]

Performs standard data cleaning steps on a dataset. Use the parameters to choose which transformations to perform. The available steps are:

  • Drop columns with specific data types.
  • Strip categorical features from white spaces.
  • Drop categorical columns with maximal cardinality.
  • Drop columns with minimum cardinality.
  • Drop duplicate rows.
  • Drop rows with missing values in the target column.
  • Encode the target column.

This class can be accessed from atom through the clean method. Read more in the user guide.

Parameters:

drop_types: str, sequence or None, optional (default=None)
Columns with these types are dropped from the dataset.

strip_categorical: bool, optional (default=True)
Whether to strip the spaces from the categorical columns.

drop_max_cardinality: bool, optional (default=True)
Whether to drop categorical columns with maximum cardinality, i.e. the number of unique values is equal to the number of samples. Usually the case for names, IDs, etc...

drop_min_cardinality: bool, optional (default=True)
Whether to drop columns with minimum cardinality, i.e. all values in the column are the same.

drop_duplicates: bool, optional (default=False)
Whether to drop duplicate rows. Only the first occurrence of every duplicated row is kept.

drop_missing_target: bool, optional (default=True)
Whether to drop rows with missing values in the target column. This parameter is ignored if y is not provided.

encode_target: bool, optional (default=True)
Whether to Label-encode the target column. This parameter is ignored if y is not provided.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:
  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.


Attributes

Attributes:

missing: list
List of values that are considered "missing". Default values are: "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

mapping: dict
Dictionary of the target values mapped to their respective encoded integer. Only available if encode_target=True.


Methods

fit_transform Same as transform.
get_params Get parameters for this estimator.
log Write information to the logger and print to stdout.
save Save the instance to a pickle file.
set_params Set the parameters of this estimator.
transform Transform the data.


method fit_transform(X, y=None) [source]

Apply the data cleaning steps to the data.

Parameters:

X: dict, list, tuple, np.array, sps.matrix or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.


method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:

deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns: params: dict
Dictionary of the parameter names mapped to their values.


method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.


method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method set_params(**params) [source]

Set the parameters of this estimator.

Parameters: **params: dict
Estimator parameters.
Returns: self: Cleaner
Estimator instance.


method transform(X, y=None) [source]

Apply the data cleaning steps to the data.

Parameters:

X: dict, list, tuple, np.array, sps.matrix or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.


Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.clean(maximum_cardinality=False)
or
from atom.data_cleaning import Cleaner

cleaner = Cleaner(maximum_cardinality=False)
X, y = cleaner.transform(X, y)

Back to top