Cleaner

class atom.data_cleaning.Cleaner(drop_types=None, strip_categorical=True, drop_max_cardinality=True, drop_min_cardinality=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, verbose=0, logger=None) [source]

Performs standard data cleaning steps on a dataset. Use the parameters to choose which transformations to perform. The available steps are:

Drop columns with specific data types.
Strip categorical features from white spaces.
Drop categorical columns with maximal cardinality.
Drop columns with minimum cardinality.
Drop duplicate rows.
Drop rows with missing values in the target column.
Encode the target column.

This class can be accessed from atom through the clean method. Read more in the user guide.

Parameters:

drop_types: str, sequence or None, optional (default=None)
Columns with these types are dropped from the dataset.

strip_categorical: bool, optional (default=True)
Whether to strip the spaces from the categorical columns.

drop_max_cardinality: bool, optional (default=True)
Whether to drop categorical columns with maximum cardinality, i.e. the number of unique values is equal to the number of samples. Usually the case for names, IDs, etc...

drop_min_cardinality: bool, optional (default=True)
Whether to drop columns with minimum cardinality, i.e. all values in the column are the same.

drop_duplicates: bool, optional (default=False)
Whether to drop duplicate rows. Only the first occurrence of every duplicated row is kept.

drop_missing_target: bool, optional (default=True)
Whether to drop rows with missing values in the target column. This parameter is ignored if y is not provided.

encode_target: bool, optional (default=True)
Whether to Label-encode the target column. This parameter is ignored if y is not provided.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

Attributes

Attributes:

missing: list
List of values that are considered "missing". Default values are: "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

mapping: dict
Dictionary of the target values mapped to their respective encoded integer. Only available if encode_target=True.

Methods

fit_transform	Same as transform.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit_transform(X, y=None) [source]

Apply the data cleaning steps to the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Dictionary of the parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: Cleaner Estimator instance.

method transform(X, y=None) [source]

Apply the data cleaning steps to the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.clean(maximum_cardinality=False)

or

from atom.data_cleaning import Cleaner

cleaner = Cleaner(maximum_cardinality=False)
X, y = cleaner.transform(X, y)