Cleaner
class atom.data_cleaning.
Cleaner(drop_types=None,
strip_categorical=True, drop_max_cardinality=True, drop_min_cardinality=True,
drop_duplicates=False, drop_missing_target=True, encode_target=True, verbose=0,
logger=None)
[source]
 
Performs standard data cleaning steps on a dataset. Use the parameters
to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
 
- Strip categorical features from white spaces.
 
- Drop categorical columns with maximal cardinality.
 
- Drop columns with minimum cardinality.
 
- Drop duplicate rows.
 
- Drop rows with missing values in the target column.
 
- Encode the target column.
 
This class can be accessed from atom through the clean
method. Read more in the user guide.
| Parameters: | 
 
drop_types: str, sequence or None, optional (default=None) 
Columns with these types are dropped from the dataset.
 
strip_categorical: bool, optional (default=True) 
Whether to strip the spaces from the categorical columns.
 
drop_max_cardinality: bool, optional (default=True) 
Whether to drop categorical columns with maximum cardinality,
i.e. the number of unique values is equal to the number of
samples. Usually the case for names, IDs, etc...
 
drop_min_cardinality: bool, optional (default=True) 
Whether to drop columns with minimum cardinality, i.e. all values in the
column are the same.
 
drop_duplicates: bool, optional (default=False) 
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
 
drop_missing_target: bool, optional (default=True) 
Whether to drop rows with missing values in the target column.
This parameter is ignored if y is not provided.
 
encode_target: bool, optional (default=True) 
Whether to Label-encode the target column. This parameter is ignored
if y is not provided.
 
verbose: int, optional (default=0) 
Verbosity level of the class. Possible values are:
- 0 to not print anything.
 
- 1 to print basic information.
 
- 2 to print detailed information.
 
 
logger: str, Logger or None, optional (default=None) 
- If None: Doesn't save a logging file.
 
- If str: Name of the log file. Use "auto" for automatic naming.
 
- Else: Python 
logging.Logger instance. 
 
 | 
Attributes
| Attributes: | 
 
missing: list 
Values that are considered "missing". Default values are: "", "?",
"None", "NA", "nan", "NaN" and "inf". Note that None,
NaN, +inf and -inf are always
considered missing since they are incompatible with sklearn estimators.
 
mapping: dict 
Target values mapped to their respective encoded integer. Only
available if encode_target=True.
 
 | 
Methods
| fit_transform | 
Same as transform. | 
| get_params | 
Get parameters for this estimator. | 
| log | 
Write information to the logger and print to stdout. | 
| save | 
Save the instance to a pickle file. | 
| set_params | 
Set the parameters of this estimator. | 
| transform | 
Transform the data. | 
method fit_transform(X, y=None)
[source]
 
Apply the data cleaning steps to the data.
| Parameters: | 
 
X: dataframe-like 
Feature set with shape=(n_samples, n_features).
 
y: int, str, sequence or None, optional (default=None) 
- If None: y is ignored.
 
- If int: Index of the target column in X.
 
- If str: Name of the target column in X.
 
- Else: Target column with shape=(n_samples,).
 
 
 | 
| Returns: | 
 
pd.DataFrame 
Transformed feature set.
 
pd.Series 
Transformed target column. Only returned if provided.
 
 | 
Get parameters for this estimator.
| Parameters: | 
 
deep: bool, optional (default=True) 
If True, will return the parameters for this estimator and contained
subobjects that are estimators.
 
 | 
| Returns: | 
dict 
Parameter names mapped to their values.
 | 
Write a message to the logger and print it to stdout.
| Parameters: | 
 
msg: str 
Message to write to the logger and print to stdout.
 
level: int, optional (default=0) 
Minimum verbosity level to print the message.
 
 | 
Save the instance to a pickle file.
| Parameters: | 
filename: str, optional (default="auto") 
Name of the file. Use "auto" for automatic naming.
 | 
Set the parameters of this estimator.
| Parameters: | 
**params: dict 
Estimator parameters.
 | 
| Returns: | 
Cleaner 
Estimator instance.
 | 
Apply the data cleaning steps to the data.
| Parameters: | 
 
X: dataframe-like 
Feature set with shape=(n_samples, n_features).
 
y: int, str, sequence or None, optional (default=None) 
- If None: y is ignored.
 
- If int: Index of the target column in X.
 
- If str: Name of the target column in X.
 
- Else: Target column with shape=(n_samples,).
 
 
 | 
| Returns: | 
 
pd.DataFrame 
Transformed feature set.
 
pd.Series 
Transformed target column. Only returned if provided.
 
 | 
Example
from atom import ATOMClassifier
atom = ATOMClassifier(X, y)
atom.clean(maximum_cardinality=False)
 
or
from atom.data_cleaning import Cleaner
cleaner = Cleaner(maximum_cardinality=False)
X, y = cleaner.transform(X, y)