Skip to content

Imputer


class atom.data_cleaning.Imputer(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, device="cpu", engine="sklearn", verbose=0, logger=None)[source]
Handle missing values in the data.

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. Use the missing attribute to customize what are considered "missing values".

This class can be accessed from atom through the impute method. Read more in the user guide.

Parametersstrat_num: str, int or float, default="drop"
Imputing strategy for numerical columns. Choose from:

  • "drop": Drop rows containing missing values.
  • "mean": Impute with mean of column.
  • "median": Impute with median of column.
  • "knn": Impute using a K-Nearest Neighbors approach.
  • "most_frequent": Impute with most frequent value.
  • int or float: Impute with provided numerical value.

strat_cat: str, default="drop"
Imputing strategy for categorical columns. Choose from:

  • "drop": Drop rows containing missing values.
  • "most_frequent": Impute with most frequent value.
  • str: Impute with provided string.

max_nan_rows: int, float or None, default=None
Maximum number or fraction of missing values in a row (if more, the row is removed). If None, ignore this step.

max_nan_cols: int, float or None, default=None
Maximum number or fraction of missing values in a column (if more, the column is removed). If None, ignore this step.

device: str, default="cpu"
Device on which to train the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str, default="sklearn"
Execution engine to use for the estimators. Refer to the user guide for an explanation regarding every choice. Choose from:

  • "sklearn" (only if device="cpu")
  • "cuml" (only if device="gpu")

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

logger: str, Logger or None, default=None

  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.

Attributesmissing: list
Values that are considered "missing". Default values are: "", "?", "NA", "nan", "NaN", "none", "None", "inf", "-inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

feature_names_in_: np.array
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Balancer

Balance the number of samples per class in the target column.

Discretizer

Bin continuous data into intervals.

Encoder

Perform encoding of categorical features.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iat[i, j] = np.nan

>>> atom = ATOMClassifier(X, y)
>>> print(atom.nans)

mean radius       118
mean texture      134
mean perimeter    135
mean area         140

dtype: int64

>>> atom.impute(strat_num="median", max_nan_rows=0.1, verbose=2)

Fitting Imputer...
Imputing missing values...
 --> Dropping 3 samples for containing more than 3 missing values.
 --> Imputing 115 missing values with median (13.3) in feature mean radius.
 --> Imputing 131 missing values with median (18.8) in feature mean texture.
 --> Imputing 132 missing values with median (85.86) in feature mean perimeter.
 --> Imputing 137 missing values with median (561.3) in feature mean area.

>>> print(atom.n_nans)

0
>>> from atom.data_cleaning import Imputer
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iloc[i, j] = np.nan

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0          17.99           NaN  ...          0.4601                  0.11890
1          20.57         17.77  ...          0.2750                  0.08902
2          19.69         21.25  ...          0.3613                  0.08758
3            NaN         20.38  ...          0.6638                  0.17300
4            NaN         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564          NaN         22.39  ...          0.2060                  0.07115
565        20.13         28.25  ...          0.2572                  0.06637
566          NaN           NaN  ...          0.2218                  0.07820
567          NaN         29.33  ...          0.4087                  0.12400
568          NaN         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]

>>> imputer = Imputer(strat_num="median", max_nan_rows=0.1, verbose=2)
>>> X, y = imputer.fit_transform(X, y)

Fitting Imputer...
Imputing missing values...
 --> Imputing 135 missing values with median (13.42) in feature mean radius.
 --> Imputing 133 missing values with median (18.81) in feature mean texture.
 --> Imputing 129 missing values with median (86.14) in feature mean perimeter.
 --> Imputing 120 missing values with median (537.9) in feature mean area.

>>> print(X)

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0         17.990         10.38  ...          0.4601                  0.11890
1         13.415         17.77  ...          0.2750                  0.08902
2         19.690         21.25  ...          0.3613                  0.08758
3         11.420         20.38  ...          0.6638                  0.17300
4         20.290         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564       21.560         22.39  ...          0.2060                  0.07115
565       20.130         28.25  ...          0.2572                  0.06637
566       13.415         28.08  ...          0.2218                  0.07820
567       13.415         18.81  ...          0.4087                  0.12400
568        7.760         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_paramsGet parameters for this estimator.
inverse_transformDoes nothing.
logPrint message and save to log file.
saveSave the instance to a pickle file.
set_paramsSet the parameters of this estimator.
transformImpute the missing values.


method fit(X, y=None)[source]
Fit to data.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence, dataframe-like or None, default=None
Does nothing. Implemented for continuity of the API.

ReturnsImputer
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

**fit_params
Additional keyword arguments for the fit method.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.



method get_params(deep=True)[source]
Get parameters for this estimator.

Parametersdeep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returnsparams : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None)[source]
Does nothing.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: int, str, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target column with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe-like: Target columns with shape=(n_samples, n_targets) for multioutput tasks.

Returnsdataframe
Transformed feature set. Only returned if provided.

series
Transformed target column. Only returned if provided.



method log(msg, level=0, severity="info")[source]
Print message and save to log file.

Parametersmsg: int, float or str
Message to save to the logger and print to stdout.

level: int, default=0
Minimum verbosity level to print the message.

severity: str, default="info"
Severity level of the message. Choose from: debug, info, warning, error, critical.



method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.

Parametersfilename: str, default="auto"
Name of the file. Use "auto" for automatic naming.

save_data: bool, default=True
Whether to save the dataset with the instance. This parameter is ignored if the method is not called from atom. If False, add the data to the load method.



method set_params(**params)[source]
Set the parameters of this estimator.

Parameters**params : dict
Estimator parameters.

Returnsself : estimator instance
Estimator instance.



method transform(X, y=None)[source]
Impute the missing values.

Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, dict, sequence, dataframe-like or None, default=None
Target column corresponding to X.

  • If None: y is ignored.
  • If int: Position of the target column in X.
  • If str: Name of the target column in X.
  • If sequence: Target array with shape=(n_samples,) or sequence of column names or positions for multioutput tasks.
  • If dataframe: Target columns for multioutput tasks.

Returnsdataframe
Imputed dataframe.

series
Transformed target column. Only returned if provided.