Cleaner

class atom.data_cleaning.Cleaner(convert_dtypes=True, drop_dtypes=None, drop_chars=None, strip_categorical=True, drop_duplicates=False, drop_missing_target=True, encode_target=True, device="cpu", engine=None, verbose=0)[source]

Applies standard data cleaning steps on a dataset.

Use the parameters to choose which transformations to perform. The available steps are:

Convert dtypes to the best possible types.
Drop columns with specific data types.
Remove characters from column names.
Strip categorical features from spaces.
Drop duplicate rows.
Drop rows with missing values in the target column.
Encode the target column.

This class can be accessed from atom through the clean method. Read more in the user guide.

Parameters

convert_dtypes: bool, default=True

Convert the column's data types to the best possible types that support pd.NA.

drop_dtypes: str, sequence or None, default=None

Columns with these data types are dropped from the dataset.

drop_chars: str or None, default=None

Remove the specified regex pattern from column names, e.g. [^A-Za-z0-9]+ to remove all non-alphanumerical characters.

strip_categorical: bool, default=True

Whether to strip spaces from categorical columns.

drop_duplicates: bool, default=False

Whether to drop duplicate rows. Only the first occurrence of every duplicated row is kept.

drop_missing_target: bool, default=True

Whether to drop rows with missing values in the target column. This transformation is ignored if y is not provided.

encode_target: bool, default=True

Whether to encode the target column(s). This includes converting categorical columns to numerical, and binarizing multilabel columns. This transformation is ignored if y is not provided.

device: str, default="cpu"

Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None

Execution engine to use for estimators. If None, the default value is used. Choose from:

"sklearn" (default)
"cuml"

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

Attributes

missing_: list

Values that are considered "missing". Default values are: None, NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT", "none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

mapping_: dict

Target values mapped to their respective encoded integers. Only available if encode_target=True.

feature_names_in_: np.ndarray

Names of features seen during fit.

target_names_in_: np.ndarray

Names of the target column(s) seen during fit.

n_features_in_: int

Number of features seen during fit.

Example

atomstand-alone

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> y = ["a" if i else "b" for i in y]

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.y)

0      a
1      a
2      a
3      a
4      a
      ..
564    a
565    a
566    a
567    a
568    b
Name: target, Length: 569, dtype: object


>>> atom.clean(verbose=2)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.


>>> print(atom.y)

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: Int64

>>> from atom.data_cleaning import Cleaner
>>> from numpy.random import randint

>>> y = ["a" if i else "b" for i in range(randint(100))]

>>> cleaner = Cleaner(verbose=2)
>>> y = cleaner.fit_transform(y=y)

Fitting Cleaner...
Cleaning the data...
 --> Label-encoding column target.


>>> print(y)

0     1
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
Name: target, dtype: Int64

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_feature_names_out	Get output feature names for transformation.
get_params	Get parameters for this estimator.
inverse_transform	Inversely transform the label encoding.
set_output	Set output container.
set_params	Set the parameters of this estimator.
transform	Apply the data cleaning steps to the data.

method fit(X=None, y=None)[source]

Fit to data.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, `X` is ignored. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`.
Returns	Self Estimator instance.

method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, `X` is ignored. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`. If None, `y` is ignored. **fit_params Additional keyword arguments for the fit method.
Returns	dataframe Transformed feature set. Only returned if provided. series or dataframe Transformed target column. Only returned if provided.

method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters	input_features: sequence or None, default=None Only used to validate feature names with the names seen in `fit`.
Returns	np.ndarray Transformed feature names.

method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns	params : dict Parameter names mapped to their values.

method inverse_transform(X=None, y=None)[source]

Inversely transform the label encoding.

This method only inversely transforms the target encoding. The rest of the transformations can't be inverted. If encode_target=False, the data is returned as is.

Parameters	X: dataframe-like or None, default=None Do nothing. Implemented for continuity of the API. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`.
Returns	dataframe Unchanged feature set. Only returned if provided. series or dataframe Original target column. Only returned if provided.

method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters	transform: str or None, default=None Configure the output of the `transform`, `fit_transform`, and `inverse_transform` method. If None, the configuration is not changed. Choose from: "numpy" "pandas" (default) "pandas-pyarrow" "polars" "polars-lazy" "pyarrow" "modin" "dask" "pyspark" "pyspark-pandas"
Returns	Self Estimator instance.

method set_params(**params)[source]

Set the parameters of this estimator.

Parameters	**params : dict Estimator parameters.
Returns	self : estimator instance Estimator instance.

method transform(X=None, y=None)[source]

Apply the data cleaning steps to the data.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, `X` is ignored. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`.
Returns	dataframe Transformed feature set. Only returned if provided. series or dataframe Transformed target column. Only returned if provided.