Imputer

class atom.data_cleaning.Imputer(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, device="cpu", engine="sklearn", verbose=0, logger=None)[source]

Handle missing values in the data.

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. Use the missing attribute to customize what are considered "missing values".

This class can be accessed from atom through the impute method. Read more in the user guide.

Parameters

strat_num: str, int or float, default="drop"

Imputing strategy for numerical columns. Choose from:

"drop": Drop rows containing missing values.
"mean": Impute with mean of column.
"median": Impute with median of column.
"knn": Impute using a K-Nearest Neighbors approach.
"most_frequent": Impute with most frequent value.
int or float: Impute with provided numerical value.

strat_cat: str, default="drop"

Imputing strategy for categorical columns. Choose from:

"drop": Drop rows containing missing values.
"most_frequent": Impute with most frequent value.
str: Impute with provided string.

max_nan_rows: int, float or None, default=None

Maximum number or fraction of missing values in a row (if more, the row is removed). If None, ignore this step.

max_nan_cols: int, float or None, default=None

Maximum number or fraction of missing values in a column (if more, the column is removed). If None, ignore this step.

device: str, default="cpu"

Device on which to train the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, default="sklearn"

Execution engine to use for the estimators. Refer to the user guide for an explanation regarding every choice. Choose from:

"sklearn" (only if device="cpu")
"cuml" (only if device="gpu")

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, default=None

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

Attributes

missing: list

Values that are considered "missing". Default values are: "", "?", "NA", "nan", "NaN", "none", "None", "inf", "-inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

feature_names_in_: np.array

Names of features seen during fit.

n_features_in_: int

Number of features seen during fit.

Example

atomstand-alone

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iat[i, j] = np.nan

>>> atom = ATOMClassifier(X, y)
>>> print(atom.nans)

mean radius       118
mean texture      134
mean perimeter    135
mean area         140

dtype: int64

>>> atom.impute(strat_num="median", max_nan_rows=0.1, verbose=2)

Fitting Imputer...
Imputing missing values...
 --> Dropping 3 samples for containing more than 3 missing values.
 --> Imputing 115 missing values with median (13.3) in feature mean radius.
 --> Imputing 131 missing values with median (18.8) in feature mean texture.
 --> Imputing 132 missing values with median (85.86) in feature mean perimeter.
 --> Imputing 137 missing values with median (561.3) in feature mean area.

>>> print(atom.n_nans)

0

>>> from atom.data_cleaning import Imputer
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iloc[i, j] = np.nan

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0          17.99           NaN  ...          0.4601                  0.11890
1          20.57         17.77  ...          0.2750                  0.08902
2          19.69         21.25  ...          0.3613                  0.08758
3            NaN         20.38  ...          0.6638                  0.17300
4            NaN         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564          NaN         22.39  ...          0.2060                  0.07115
565        20.13         28.25  ...          0.2572                  0.06637
566          NaN           NaN  ...          0.2218                  0.07820
567          NaN         29.33  ...          0.4087                  0.12400
568          NaN         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]

>>> imputer = Imputer(strat_num="median", max_nan_rows=0.1, verbose=2)
>>> X, y = imputer.fit_transform(X, y)

Fitting Imputer...
Imputing missing values...
 --> Imputing 135 missing values with median (13.42) in feature mean radius.
 --> Imputing 133 missing values with median (18.81) in feature mean texture.
 --> Imputing 129 missing values with median (86.14) in feature mean perimeter.
 --> Imputing 120 missing values with median (537.9) in feature mean area.

>>> print(X)

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0         17.990         10.38  ...          0.4601                  0.11890
1         13.415         17.77  ...          0.2750                  0.08902
2         19.690         21.25  ...          0.3613                  0.08758
3         11.420         20.38  ...          0.6638                  0.17300
4         20.290         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564       21.560         22.39  ...          0.2060                  0.07115
565       20.130         28.25  ...          0.2572                  0.06637
566       13.415         28.08  ...          0.2218                  0.07820
567       13.415         18.81  ...          0.4087                  0.12400
568        7.760         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
inverse_transform	Does nothing.
log	Print message and save to log file.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Impute the missing values.

method fit(X, y=None)[source]

Fit to data.

Parameters	X: dataframe-like Feature set with shape=(n_samples, n_features). y: int, str, sequence, dataframe-like or None, default=None Does nothing. Implemented for continuity of the API.
Returns	Imputer Estimator instance.

method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters

X: dataframe-like or None, default=None

Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None

Target column corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

**fit_params

Additional keyword arguments for the fit method.

Returns

dataframe

Transformed feature set. Only returned if provided.

series

Transformed target column. Only returned if provided.

method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns	params : dict Parameter names mapped to their values.

method inverse_transform(X=None, y=None)[source]

Does nothing.

Parameters

X: dataframe-like or None, default=None

Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None

Target column corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

Returns

dataframe

Transformed feature set. Only returned if provided.

series

Transformed target column. Only returned if provided.

method log(msg, level=0, severity="info")[source]

Print message and save to log file.

Parameters

msg: int, float or str

Message to save to the logger and print to stdout.

level: int, default=0

Minimum verbosity level to print the message.

severity: str, default="info"

Severity level of the message. Choose from: debug, info, warning, error, critical.

method save(filename="auto", save_data=True)[source]

Save the instance to a pickle file.

Parameters

filename: str, default="auto"

Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True

Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method.

method set_params(**params)[source]

Set the parameters of this estimator.

Parameters	**params : dict Estimator parameters.
Returns	self : estimator instance Estimator instance.

method transform(X, y=None)[source]

Impute the missing values.

Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.

Parameters

X: dataframe-like

Feature set with shape=(n_samples, n_features).

y: int, str, dict, sequence, dataframe-like or None, default=None

Target column corresponding to X.

If None: y is ignored.
If int: Position of the target column in X.
If str: Name of the target column in X.
If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
If dataframe: Target columns for multioutput tasks.

Returns

dataframe

Imputed dataframe.

series

Transformed target column. Only returned if provided.