Imputer
class atom.data_cleaning.Imputer(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, device="cpu", engine="sklearn", verbose=0, logger=None)[source]
Handle missing values in the data.
Impute or remove missing values according to the selected strategy.
Also removes rows and columns with too many missing values. Use
the missing
attribute to customize what are considered "missing
values".
This class can be accessed from atom through the impute method. Read more in the user guide.
Parameters | strat_num: str, int or float, default="drop"
Imputing strategy for numerical columns. Choose from:
strat_cat: str, default="drop"
Imputing strategy for categorical columns. Choose from:
max_nan_rows: int, float or None, default=None
Maximum number or fraction of missing values in a row
(if more, the row is removed). If None, ignore this step.
max_nan_cols: int, float or None, default=None
Maximum number or fraction of missing values in a column
(if more, the column is removed). If None, ignore this step.
device: str, default="cpu"
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
logger: str, Logger or None, default=None
|
Attributes | missing: list
Values that are considered "missing". Default values are: "",
"?", "None", "NA", "nan", "NaN" and "inf". Note that feature_names_in_: np.arrayNone ,
NaN , +inf and -inf are always considered missing since
they are incompatible with sklearn estimators.
Names of features seen during fit.
n_features_in_: int
Number of features seen during fit.
|
See Also
Balance the number of samples per class in the target column.
Bin continuous data into intervals.
Perform encoding of categorical features.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iat[i, j] = np.nan
>>> atom = ATOMClassifier(X, y)
>>> print(atom.nans)
mean radius 118
mean texture 134
mean perimeter 135
mean area 140
dtype: int64
>>> atom.impute(strat_num="median", max_nan_rows=0.1, verbose=2)
Fitting Imputer...
Imputing missing values...
--> Dropping 3 samples for containing more than 3 missing values.
--> Imputing 115 missing values with median (13.3) in feature mean radius.
--> Imputing 131 missing values with median (18.8) in feature mean texture.
--> Imputing 132 missing values with median (85.86) in feature mean perimeter.
--> Imputing 137 missing values with median (561.3) in feature mean area.
>>> print(atom.n_nans)
0
>>> from atom.data_cleaning import Imputer
>>> from sklearn.datasets import load_breast_cancer
>>> from numpy.random import randint
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600])
>>> X.iloc[i, j] = np.nan
mean radius mean texture ... worst symmetry worst fractal dimension
0 17.99 NaN ... 0.4601 0.11890
1 20.57 17.77 ... 0.2750 0.08902
2 19.69 21.25 ... 0.3613 0.08758
3 NaN 20.38 ... 0.6638 0.17300
4 NaN 14.34 ... 0.2364 0.07678
.. ... ... ... ... ...
564 NaN 22.39 ... 0.2060 0.07115
565 20.13 28.25 ... 0.2572 0.06637
566 NaN NaN ... 0.2218 0.07820
567 NaN 29.33 ... 0.4087 0.12400
568 NaN 24.54 ... 0.2871 0.07039
[569 rows x 30 columns]
>>> imputer = Imputer(strat_num="median", max_nan_rows=0.1, verbose=2)
>>> X, y = imputer.fit_transform(X, y)
Fitting Imputer...
Imputing missing values...
--> Imputing 135 missing values with median (13.42) in feature mean radius.
--> Imputing 133 missing values with median (18.81) in feature mean texture.
--> Imputing 129 missing values with median (86.14) in feature mean perimeter.
--> Imputing 120 missing values with median (537.9) in feature mean area.
>>> print(X)
mean radius mean texture ... worst symmetry worst fractal dimension
0 17.990 10.38 ... 0.4601 0.11890
1 13.415 17.77 ... 0.2750 0.08902
2 19.690 21.25 ... 0.3613 0.08758
3 11.420 20.38 ... 0.6638 0.17300
4 20.290 14.34 ... 0.2364 0.07678
.. ... ... ... ... ...
564 21.560 22.39 ... 0.2060 0.07115
565 20.130 28.25 ... 0.2572 0.06637
566 13.415 28.08 ... 0.2218 0.07820
567 13.415 18.81 ... 0.4087 0.12400
568 7.760 24.54 ... 0.2871 0.07039
[569 rows x 30 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Does nothing. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Impute the missing values. |
method fit(X, y=None)[source]
Fit to data.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Does nothing.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from
atom. If False, remember to add the data to ATOMLoader
when loading the file.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X, y=None)[source]
Impute the missing values.
Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.