Cleaner
Applies standard data cleaning steps on a dataset.
Use the parameters to choose which transformations to perform. The available steps are:
- Convert dtypes to the best possible types.
- Drop columns with specific data types.
- Remove characters from column names.
- Strip categorical features from spaces.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
- Encode the target column.
This class can be accessed from atom through the clean method. Read more in the user guide.
Parameters |
convert_dtypes: bool, default=True
Convert the column's data types to the best possible types
that support
drop_dtypes: str, sequence or None, default=Nonepd.NA .
Columns with these data types are dropped from the dataset.
drop_chars: str or None, default=None
Remove the specified regex pattern from column names, e.g.
strip_categorical: bool, default=True[^A-Za-z0-9]+ to remove all non-alphanumerical characters.
Whether to strip spaces from categorical columns.
drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column.
This transformation is ignored if
encode_target: bool, default=Truey is not provided.
Whether to encode the target column(s). This includes
converting categorical columns to numerical, and binarizing
multilabel columns. This transformation is ignored if
device: str, default="cpu"y
is not provided.
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
|
Attributes |
missing_: list
Values that are considered "missing". Default values are: None,
NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT",
"none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and
-inf are always considered missing since they are incompatible
with sklearn estimators.
mapping_: dict
Target values mapped to their respective encoded integers. Only
available if encode_target=True.
feature_names_in_: np.ndarray
Names of features seen during
target_names_in_: np.ndarrayfit .
Names of the target column(s) seen during
n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.y)
0 a
1 a
2 a
3 a
4 a
..
564 a
565 a
566 a
567 a
568 b
Name: target, Length: 569, dtype: object
>>> atom.clean(verbose=2)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding column target.
>>> print(atom.y)
0 0
1 0
2 0
3 0
4 0
..
564 0
565 0
566 0
567 0
568 1
Name: target, Length: 569, dtype: Int64
>>> from atom.data_cleaning import Cleaner
>>> from numpy.random import randint
>>> y = ["a" if i else "b" for i in range(randint(100))]
>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)
Fitting Cleaner...
Cleaning the data...
--> Label-encoding column target.
>>> print(y)
0 1
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
Name: target, dtype: Int64
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Inversely transform the label encoding. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Apply the data cleaning steps to the data. |
Fit to data.
Fit to data, then transform it.
Get output feature names for transformation.
Parameters |
input_features: sequence or None, default=None
Only used to validate feature names with the names seen in
fit .
|
Returns |
np.ndarray
Transformed feature names.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Inversely transform the label encoding.
This method only inversely transforms the target encoding.
The rest of the transformations can't be inverted. If
encode_target=False
, the data is returned as is.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Apply the data cleaning steps to the data.