Cleaner
class atom.data_cleaning.Cleaner(drop_types=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine="sklearn", verbose=0, logger=None)[source]
Applies standard data cleaning steps on a dataset.
Use the parameters to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
- Strip categorical features from white spaces.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
- Encode the target column.
This class can be accessed from atom through the clean method. Read more in the user guide.
Parameters | drop_types: str, sequence or None, default=None
Columns with these data types are dropped from the dataset.
strip_categorical: bool, default=True
Whether to strip spaces from the categorical columns.
drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column.
This transformation is ignored if encode_target: bool, default=Truey is not provided.
Whether to Label-encode the target column. This transformation
is ignored if device: str, default="cpu"y is not provided.
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
logger: str, Logger or None, default=None
|
Attributes | missing: list
Values that are considered "missing". Default values are: "",
"?", "None", "NA", "nan", "NaN" and "inf". Note that mapping: dictNone ,
NaN , +inf and -inf are always considered missing since
they are incompatible with sklearn estimators.
Target values mapped to their respective encoded integer. Only
available if encode_target=True.
|
See Also
Perform encoding of categorical features.
Bin continuous data into intervals.
Scale the data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]
>>> atom = ATOMClassifier(X, y)
>>> print(atom.y)
0 b
1 b
2 b
3 b
4 a
..
995 b
996 a
997 a
998 b
999 b
Name: target, Length: 1000, dtype: object
>>> atom.clean(verbose=2)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding the target column.
>>> print(atom.y)
0 1
1 1
2 1
3 1
4 0
..
995 1
996 0
997 0
998 1
999 1
Name: target, Length: 1000, dtype: int32
>>> import numpy as np
>>> from atom.data_cleaning import Cleaner
>>> y = ["a" if i else "b" for i in np.randint(100)]
>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding the target column.
>>> print(y)
0 0
1 0
2 1
3 0
4 0
..
95 1
96 1
97 0
98 0
99 0
Name: target, Length: 100, dtype: int32
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Inversely transform the label encoding. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Apply the data cleaning steps to the data. |
method fit(X=None, y=None)[source]
Fit to data.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Inversely transform the label encoding.
This method only inversely transforms the label encoding.
The rest of the transformations can't be inverted. If
encode_target=False
, the data is returned as is.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from
atom. If False, remember to add the data to ATOMLoader
when loading the file.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X=None, y=None)[source]
Apply the data cleaning steps to the data.