Cleaner
class atom.data_cleaning.Cleaner(drop_types=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine="sklearn", verbose=0, logger=None)[source]
Applies standard data cleaning steps on a dataset.
Use the parameters to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
- Remove characters from column names.
- Strip categorical features from white spaces.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
- Encode the target column.
This class can be accessed from atom through the clean method. Read more in the user guide.
Parameters | drop_types: str, sequence or None, default=None
Columns with these data types are dropped from the dataset.
drop_chars: str or None, default=None
Remove the specified regex pattern from column names, e.g.
strip_categorical: bool, default=True[^A-Za-z0-9]+ to remove all non-alphanumerical characters.
Whether to strip spaces from categorical columns.
drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column.
This transformation is ignored if encode_target: bool, default=Truey is not provided.
Whether to encode the target column(s). This includes
converting categorical columns to numerical, and binarizing
multilabel columns. This transformation is ignored if device: str, default="cpu"y
is not provided.
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
logger: str, Logger or None, default=None
|
Attributes | missing: list
Values that are considered "missing". Default values are: "",
"?", "NA", "nan", "NaN", "none", "None", "inf", "-inf". Note
that mapping: dictNone , NaN , +inf and -inf are always considered
missing since they are incompatible with sklearn estimators.
Target values mapped to their respective encoded integer. Only
available if encode_target=True.
|
See Also
Perform encoding of categorical features.
Bin continuous data into intervals.
Scale the data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]
>>> atom = ATOMClassifier(X, y)
>>> print(atom.y)
0 b
1 b
2 b
3 b
4 a
..
995 b
996 a
997 a
998 b
999 b
Name: target, Length: 1000, dtype: object
>>> atom.clean(verbose=2)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding the target column.
>>> print(atom.y)
0 1
1 1
2 1
3 1
4 0
..
995 1
996 0
997 0
998 1
999 1
Name: target, Length: 1000, dtype: int32
>>> import numpy as np
>>> from atom.data_cleaning import Cleaner
>>> y = ["a" if i else "b" for i in np.randint(100)]
>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding the target column.
>>> print(y)
0 0
1 0
2 1
3 0
4 0
..
95 1
96 1
97 0
98 0
99 0
Name: target, Length: 100, dtype: int32
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Inversely transform the label encoding. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Apply the data cleaning steps to the data. |
method fit(X=None, y=None)[source]
Fit to data.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Inversely transform the label encoding.
This method only inversely transforms the target encoding.
The rest of the transformations can't be inverted. If
encode_target=False
, the data is returned as is.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This parameter
is ignored if the method is not called from atom. If False,
add the data to the load method.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X=None, y=None)[source]
Apply the data cleaning steps to the data.