Skip to content

Imputer


class atom.data_cleaning.Imputer(strat_num="mean", strat_cat="most_frequent", max_nan_rows=None, max_nan_cols=None, n_jobs=1, device="cpu", engine=None, verbose=0, random_state=None)[source]

Handle missing values in the data.

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. Use the missing_ attribute to customize what are considered "missing values".

This class can be accessed from atom through the impute method. Read more in the user guide.

Parameters strat_num: int, float, str or callable, default="mean"
Imputing strategy for numerical columns. Choose from:

  • "drop": Drop rows containing missing values.
  • "mean": Impute with mean of column.
  • "median": Impute with median of column.
  • "most_frequent": Impute with the most frequent value.
  • "knn": Impute using a K-Nearest Neighbors approach.
  • "iterative": Impute using a multivariate imputer.
  • "drift": Impute values using a PolynomialTrend model.
  • "linear": Impute using linear interpolation.
  • "nearest": Impute with nearest value.
  • "bfill": Impute by using the next valid observation to fill the gap.
  • "ffill": Impute by propagating the last valid observation to next valid.
  • "random": Impute with random values between the min and max of column.
  • int or float: Impute with provided numerical value.
  • callable: Replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column.

strat_cat: str, default="most_frequent"
Imputing strategy for categorical columns. Choose from:

  • "drop": Drop rows containing missing values.
  • "most_frequent": Impute with the most frequent value.
  • str: Impute with provided string.

max_nan_rows: int, float or None, default=None
Maximum number or fraction of missing values in a row (if more, the row is removed). If None, ignore this step.

max_nan_cols: int, float or None, default=None
Maximum number or fraction of missing values in a column (if more, the column is removed). If None, ignore this step.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 - value.

device: str, default="cpu"
Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None
Execution engine to use for estimators. If None, the default value is used. Choose from:

  • "sklearn" (default)
  • "cuml"

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random. Only used when strat_num="iterative".

Attributes missing_: list
Values that are considered "missing". Default values are: None, NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT", "none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

feature_names_in_: np.ndarray
Names of features seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Balancer

Balance the number of samples per class in the target column.

Discretizer

Bin continuous data into intervals.

Encoder

Perform encoding of categorical features.


Example

>>> import numpy as np
>>> from atom import ATOMClassifier
>>> from numpy.random import randint
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600)):
...     X.iloc[i, j] = np.NaN

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.nans)

mean radius                130
mean texture               141
mean perimeter             124
mean area                  136
mean smoothness              0
mean compactness             0
mean concavity               0
mean concave points          0
mean symmetry                0
mean fractal dimension       0
radius error                 0
texture error                0
perimeter error              0
area error                   0
smoothness error             0
compactness error            0
concavity error              0
concave points error         0
symmetry error               0
fractal dimension error      0
worst radius                 0
worst texture                0
worst perimeter              0
worst area                   0
worst smoothness             0
worst compactness            0
worst concavity              0
worst concave points         0
worst symmetry               0
worst fractal dimension      0
target                       0
dtype: int64

>>> atom.impute(strat_num="median", max_nan_rows=0.1, verbose=2)

Fitting Imputer...
Imputing missing values...
 --> Imputing 130 missing values with median (13.27) in column mean radius.
 --> Imputing 141 missing values with median (18.87) in column mean texture.
 --> Imputing 124 missing values with median (85.66) in column mean perimeter.
 --> Imputing 136 missing values with median (555.1) in column mean area.

>>> print(atom.n_nans)

0
>>> import numpy as np
>>> from atom.data_cleaning import Imputer
>>> from numpy.random import randint
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600)):
...     X.iloc[i, j] = np.nan

>>> imputer = Imputer(strat_num="median", max_nan_rows=0.1, verbose=2)
>>> X, y = imputer.fit_transform(X, y)

Fitting Imputer...
Imputing missing values...
 --> Dropping 2 samples for containing more than 3 missing values.
 --> Imputing 124 missing values with median (13.38) in column mean radius.
 --> Imputing 127 missing values with median (18.87) in column mean texture.
 --> Imputing 137 missing values with median (86.54) in column mean perimeter.
 --> Imputing 134 missing values with median (561.3) in column mean area.

>>> print(X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst texture  worst perimeter  worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension
0          13.38        10.380         122.800     1001.0          0.11840           0.27760         0.30010              0.14710         0.2419  ...          17.33           184.60      2019.0           0.16220            0.66560           0.7119                0.2654          0.4601                  0.11890
1          20.57        17.770          86.545      561.3          0.08474           0.07864         0.08690              0.07017         0.1812  ...          23.41           158.80      1956.0           0.12380            0.18660           0.2416                0.1860          0.2750                  0.08902
2          19.69        21.250         130.000     1203.0          0.10960           0.15990         0.19740              0.12790         0.2069  ...          25.53           152.50      1709.0           0.14440            0.42450           0.4504                0.2430          0.3613                  0.08758
3          11.42        20.380          77.580      386.1          0.14250           0.28390         0.24140              0.10520         0.2597  ...          26.50            98.87       567.7           0.20980            0.86630           0.6869                0.2575          0.6638                  0.17300
4          13.38        14.340         135.100     1297.0          0.10030           0.13280         0.19800              0.10430         0.1809  ...          16.67           152.20      1575.0           0.13740            0.20500           0.4000                0.1625          0.2364                  0.07678
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...            ...              ...         ...               ...                ...              ...                   ...             ...                      ...
564        21.56        22.390          86.545      561.3          0.11100           0.11590         0.24390              0.13890         0.1726  ...          26.40           166.10      2027.0           0.14100            0.21130           0.4107                0.2216          0.2060                  0.07115
565        20.13        18.865         131.200     1261.0          0.09780           0.10340         0.14400              0.09791         0.1752  ...          38.25           155.00      1731.0           0.11660            0.19220           0.3215                0.1628          0.2572                  0.06637
566        13.38        28.080          86.545      561.3          0.08455           0.10230         0.09251              0.05302         0.1590  ...          34.12           126.70      1124.0           0.11390            0.30940           0.3403                0.1418          0.2218                  0.07820
567        20.60        29.330         140.100     1265.0          0.11780           0.27700         0.35140              0.15200         0.2397  ...          39.42           184.60      1821.0           0.16500            0.86810           0.9387                0.2650          0.4087                  0.12400
568        13.38        24.540          47.920      181.0          0.05263           0.04362         0.00000              0.00000         0.1587  ...          30.37            59.16       268.6           0.08996            0.06444           0.0000                0.0000          0.2871                  0.07039

[567 rows x 30 columns]


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_paramsGet parameters for this estimator.
inverse_transformDo nothing.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformImpute the missing values.


method fit(X, y=None)[source]

Fit to data.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Do nothing. Implemented for continuity of the API.

Returns Self
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returns dataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters input_features: sequence or None, default=None
Only used to validate feature names with the names seen in fit.

Returns np.ndarray
Transformed feature names.



method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns params : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None, **fit_params)[source]

Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

Parameters X: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

Returns dataframe
Feature set. Only returned if provided.

series or dataframe
Target column(s). Only returned if provided.



method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters transform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

Returns Self
Estimator instance.



method set_params(**params)[source]

Set the parameters of this estimator.

Parameters **params : dict
Estimator parameters.

Returns self : estimator instance
Estimator instance.



method transform(X, y=None)[source]

Impute the missing values.

Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.

Parameters X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X.

Returns dataframe
Imputed dataframe.

series or dataframe
Transformed target column. Only returned if provided.