Skip to content

Cleaner


class atom.data_cleaning.Cleaner(convert_dtypes=True, drop_dtypes=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine=None, verbose=0)[source]

Applies standard data cleaning steps on a dataset.

Use the parameters to choose which transformations to perform. The available steps are:

  • Convert dtypes to the best possible types.
  • Drop columns with specific data types.
  • Remove characters from column names.
  • Strip categorical features from spaces.
  • Drop duplicate rows.
  • Drop rows with missing values in the target column.
  • Encode the target column.

This class can be accessed from atom through the clean method. Read more in the user guide.

Parameters convert_dtypes: bool, default=True
Convert the column's data types to the best possible types that support pd.NA.

drop_dtypes: str, sequence or None, default=None
Columns with these data types are dropped from the dataset.

drop_chars: str or None, default=None
Remove the specified regex pattern from column names, e.g. [^A-Za-z0-9]+ to remove all non-alphanumerical characters.

strip_categorical: bool, default=True
Whether to strip spaces from categorical columns.

drop_duplicates: bool, default=False
Whether to drop duplicate rows. Only the first occurrence of every duplicated row is kept.

drop_missing_target: bool, default=True
Whether to drop rows with missing values in the target column. This transformation is ignored if y is not provided.

encode_target: bool, default=True
Whether to encode the target column(s). This includes converting categorical columns to numerical, and binarizing multilabel columns. This transformation is ignored if y is not provided.

device: str, default="cpu"
Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None
Execution engine to use for estimators. If None, the default value is used. Choose from:

  • "sklearn" (default)
  • "cuml"

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

Attributes missing_: list
Values that are considered "missing". Default values are: None, NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT", "none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

mapping_: dict
Target values mapped to their respective encoded integers. Only available if encode_target=True.

feature_names_in_: np.ndarray
Names of features seen during fit.

target_names_in_: np.ndarray
Names of the target column(s) seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Encoder

Perform encoding of categorical features.

Discretizer

Bin continuous data into intervals.

Scaler

Scale the data.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.y)

0      a
1      a
2      a
3      a
4      a
      ..
564    a
565    a
566    a
567    a
568    b
Name: target, Length: 569, dtype: object

>>> atom.clean(verbose=2)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.

>>> print(atom.y)

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: Int64
>>> from atom.data_cleaning import Cleaner
>>> from numpy.random import randint

>>> y = ["a" if i else "b" for i in range(randint(100))]

>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.

>>> print(y)

0     1
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
Name: target, dtype: Int64


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_paramsGet parameters for this estimator.
inverse_transformInversely transform the label encoding.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformApply the data cleaning steps to the data.


method fit(X=None, y=None)[source]

Fit to data.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returns Self
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters input_features: sequence or None, default=None
Only used to validate feature names with the names seen in fit.

Returns np.ndarray
Transformed feature names.



method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns params : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None)[source]

Inversely transform the label encoding.

This method only inversely transforms the target encoding. The rest of the transformations can't be inverted. If encode_target=False, the data is returned as is.

Parameters X: dataframe-like or None, default=None
Do nothing. Implemented for continuity of the API.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returns dataframe
Unchanged feature set. Only returned if provided.

series or dataframe
Original target column. Only returned if provided.



method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters transform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

Returns Self
Estimator instance.



method set_params(**params)[source]

Set the parameters of this estimator.

Parameters **params : dict
Estimator parameters.

Returns self : estimator instance
Estimator instance.



method transform(X=None, y=None)[source]

Apply the data cleaning steps to the data.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.