Cleaner
class atom.data_cleaning.
Cleaner(drop_types=None,
strip_categorical=True, drop_max_cardinality=True, drop_min_cardinality=True,
drop_duplicates=False, drop_missing_target=True, encode_target=True, gpu=False,
verbose=0, logger=None)
[source]
Performs standard data cleaning steps on a dataset. Use the parameters
to choose which transformations to perform. The available steps are:
- Drop columns with specific data types.
- Strip categorical features from white spaces.
- Drop categorical columns with maximal cardinality.
- Drop columns with minimum cardinality.
- Drop duplicate rows.
- Drop rows with missing values in the target column.
- Encode the target column.
This class can be accessed from atom through the clean
method. Read more in the user guide.
Parameters: |
drop_types: str, sequence or None, optional (default=None)
Columns with these types are dropped from the dataset.
strip_categorical: bool, optional (default=True)
Whether to strip the spaces from the categorical columns.
drop_max_cardinality: bool, optional (default=True)
Whether to drop categorical columns with maximum cardinality,
i.e. the number of unique values is equal to the number of
samples. Usually the case for names, IDs, etc...
drop_min_cardinality: bool, optional (default=True)
Whether to drop columns with minimum cardinality, i.e. all values in the
column are the same.
drop_duplicates: bool, optional (default=False)
Whether to drop duplicate rows. Only the first occurrence of
every duplicated row is kept.
drop_missing_target: bool, optional (default=True)
Whether to drop rows with missing values in the target column.
This parameter is ignored if y is not provided.
encode_target: bool, optional (default=True)
Whether to Label-encode the target column. This parameter is ignored
if y is not provided.
gpu: bool or str, optional (default=False)
Train LabelEncoder on GPU (instead of CPU). Only for encode_target=True.
- If False: Always use CPU implementation.
- If True: Use GPU implementation if possible.
- If "force": Force GPU implementation.
verbose: int, optional (default=0)
Verbosity level of the class. Choose from:
- 0 to not print anything.
- 1 to print basic information.
- 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
- If None: Doesn't save a logging file.
- If str: Name of the log file. Use "auto" for automatic naming.
- Else: Python
logging.Logger instance.
|
Attributes
Attributes: |
missing: list
Values that are considered "missing". Default values are: "", "?",
"None", "NA", "nan", "NaN" and "inf". Note that None ,
NaN , +inf and -inf are always
considered missing since they are incompatible with sklearn estimators.
mapping: dict
Target values mapped to their respective encoded integer. Only
available if encode_target=True.
feature_names_in_: np.array
Names of features seen during fit.
n_features_in_: int
Number of features seen during fit.
|
Methods
fit |
Fit to data. |
fit_transform |
Same as transform. |
get_params |
Get parameters for this estimator. |
log |
Write information to the logger and print to stdout. |
save |
Save the instance to a pickle file. |
set_params |
Set the parameters of this estimator. |
transform |
Transform the data. |
Fit to data.
Parameters: |
X: dataframe-like or None, optional (default=None)
Feature set with shape=(n_samples, n_features).
y: int, str, sequence or None, optional (default=None)
- If None: y is ignored.
- If int: Index of the target column in X.
- If str: Name of the target column in X.
- Else: Target column with shape=(n_samples,).
|
Returns: |
Cleaner
Fitted instance of self.
|
method fit_transform(X=None, y=None)
[source]
Apply the data cleaning steps to the data.
Parameters: |
X: dataframe-like or None, optional (default=None)
Feature set with shape=(n_samples, n_features).
y: int, str, sequence or None, optional (default=None)
- If None: y is ignored.
- If int: Index of the target column in X.
- If str: Name of the target column in X.
- Else: Target column with shape=(n_samples,).
|
Returns: |
pd.DataFrame
Transformed feature set.
pd.Series
Transformed target column. Only returned if provided.
|
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained
subobjects that are estimators.
|
Returns: |
dict
Parameter names mapped to their values.
|
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
Message to write to the logger and print to stdout.
level: int, optional (default=0)
Minimum verbosity level to print the message.
|
Save the instance to a pickle file.
Parameters: |
filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.
|
Set the parameters of this estimator.
Parameters: |
**params: dict
Estimator parameters.
|
Returns: |
Cleaner
Estimator instance.
|
method transform(X=None, y=None)
[source]
Apply the data cleaning steps to the data.
Parameters: |
X: dataframe-like or None, optional (default=None)
Feature set with shape=(n_samples, n_features).
y: int, str, sequence or None, optional (default=None)
- If None: y is ignored.
- If int: Index of the target column in X.
- If str: Name of the target column in X.
- Else: Target column with shape=(n_samples,).
|
Returns: |
pd.DataFrame
Transformed feature set.
pd.Series
Transformed target column. Only returned if provided.
|
Example