Cleaner
class atom.data_cleaning.Cleaner(convert_dtypes=True, drop_dtypes=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine=None, verbose=0)[source]
Applies standard data cleaning steps on a dataset.
Use the parameters to choose which transformations to perform. The available steps are:
- Convert dtypes to the best possible types.
 - Drop columns with specific data types.
 - Remove characters from column names.
 - Strip categorical features from spaces.
 - Drop duplicate rows.
 - Drop rows with missing values in the target column.
 - Encode the target column.
 
This class can be accessed from atom through the clean method. Read more in the user guide.
| Parameters | convert_dtypes: bool, default=True 
Convert the column's data types to the best possible types
that support  drop_dtypes: str, sequence or None, default=Nonepd.NA.
Columns with these data types are dropped from the dataset.
 drop_chars: str or None, default=None
Remove the specified regex pattern from column names, e.g.
 strip_categorical: bool, default=True[^A-Za-z0-9]+ to remove all non-alphanumerical characters.
Whether to strip spaces from categorical columns.
 drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
 drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column.
This transformation is ignored if  encode_target: bool, default=Truey is not provided.
Whether to encode the target column(s). This includes
converting categorical columns to numerical, and binarizing
multilabel columns. This transformation is ignored if  device: str, default="cpu"y
is not provided.
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
 engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
 verbose: int, default=0
 
Verbosity level of the class. Choose from:
 
  | 
| Attributes | missing_: list 
Values that are considered "missing". Default values are: None,
NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT",
"none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and
-inf are always considered missing since they are incompatible
with sklearn estimators.
 mapping_: dict
Target values mapped to their respective encoded integers. Only
available if encode_target=True.
 feature_names_in_: np.ndarray
Names of features seen during  target_names_in_: np.ndarrayfit.
Names of the target column(s) seen during  n_features_in_: intfit.
Number of features seen during  fit.
 | 
See Also
Perform encoding of categorical features.
Bin continuous data into intervals.
Scale the data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.y)
0      a
1      a
2      a
3      a
4      a
      ..
564    a
565    a
566    a
567    a
568    b
Name: target, Length: 569, dtype: object
>>> atom.clean(verbose=2)
Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.
>>> print(atom.y)
0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: Int64
>>> from atom.data_cleaning import Cleaner
>>> from numpy.random import randint
>>> y = ["a" if i else "b" for i in range(randint(100))]
>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)
Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.
>>> print(y)
0     1
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
Name: target, dtype: Int64
Methods
| fit | Fit to data. | 
| fit_transform | Fit to data, then transform it. | 
| get_feature_names_out | Get output feature names for transformation. | 
| get_params | Get parameters for this estimator. | 
| inverse_transform | Inversely transform the label encoding. | 
| set_output | Set output container. | 
| set_params | Set the parameters of this estimator. | 
| transform | Apply the data cleaning steps to the data. | 
method fit(X=None, y=None)[source]
Fit to data.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_feature_names_out(input_features=None)[source]
Get output feature names for transformation.
| Parameters | input_features: sequence or None, default=None 
Only used to validate feature names with the names seen in
 fit.
 | 
| Returns | np.ndarray 
Transformed feature names.
  | 
method get_params(deep=True)[source]
Get parameters for this estimator.
| Parameters | deep : bool, default=True 
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
  | 
| Returns | params : dict 
Parameter names mapped to their values.
  | 
method inverse_transform(X=None, y=None)[source]
Inversely transform the label encoding.
This method only inversely transforms the target encoding.
The rest of the transformations can't be inverted. If
encode_target=False, the data is returned as is.
method set_output(transform=None)[source]
Set output container.
See sklearn's user guide on how to use the
set_output API. See here a description
of the choices.
method set_params(**params)[source]
Set the parameters of this estimator.
| Parameters | **params : dict 
Estimator parameters.
  | 
| Returns | self : estimator instance 
Estimator instance.
  | 
method transform(X=None, y=None)[source]
Apply the data cleaning steps to the data.