Imputer
Handle missing values in the data.
Impute or remove missing values according to the selected strategy.
Also removes rows and columns with too many missing values. Use
the missing_
attribute to customize what are considered "missing
values".
This class can be accessed from atom through the impute method. Read more in the user guide.
Parameters |
strat_num: int, float, str or callable, default="mean"
Imputing strategy for numerical columns. Choose from:
strat_cat: str, default="most_frequent"
Imputing strategy for categorical columns. Choose from:
max_nan_rows: int, float or None, default=None
Maximum number or fraction of missing values in a row
(if more, the row is removed). If None, ignore this step.
max_nan_cols: int, float or None, default=None
Maximum number or fraction of missing values in a column
(if more, the column is removed). If None, ignore this step.
n_jobs: int, default=1
Number of cores to use for parallel processing.
device: str, default="cpu"
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
random_state: int or None, default=None
Seed used by the random number generator. If None, the random
number generator is the RandomState used by np.random . Only
used when strat_num="iterative".
|
Attributes |
missing_: list
Values that are considered "missing". Default values are: None,
NaN, NA, NaT, +inf, -inf, "", "?", "NA", "nan", "NaN", "NaT",
"none", "None", "inf", "-inf". Note that None, NaN, NA, +inf and
-inf are always considered missing since they are incompatible
with sklearn estimators.
feature_names_in_: np.ndarray
Names of features seen during
n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Example
>>> import numpy as np
>>> from atom import ATOMClassifier
>>> from numpy.random import randint
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600)):
... X.iloc[i, j] = np.NaN
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.nans)
mean radius 130
mean texture 141
mean perimeter 124
mean area 136
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
target 0
dtype: int64
>>> atom.impute(strat_num="median", max_nan_rows=0.1, verbose=2)
Fitting Imputer...
Imputing missing values...
--> Imputing 130 missing values with median (13.27) in column mean radius.
--> Imputing 141 missing values with median (18.87) in column mean texture.
--> Imputing 124 missing values with median (85.66) in column mean perimeter.
--> Imputing 136 missing values with median (555.1) in column mean area.
>>> print(atom.n_nans)
0
>>> import numpy as np
>>> from atom.data_cleaning import Imputer
>>> from numpy.random import randint
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> # Add some random missing values to the data
>>> for i, j in zip(randint(0, X.shape[0], 600), randint(0, 4, 600)):
... X.iloc[i, j] = np.nan
>>> imputer = Imputer(strat_num="median", max_nan_rows=0.1, verbose=2)
>>> X, y = imputer.fit_transform(X, y)
Fitting Imputer...
Imputing missing values...
--> Dropping 2 samples for containing more than 3 missing values.
--> Imputing 124 missing values with median (13.38) in column mean radius.
--> Imputing 127 missing values with median (18.87) in column mean texture.
--> Imputing 137 missing values with median (86.54) in column mean perimeter.
--> Imputing 134 missing values with median (561.3) in column mean area.
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 13.38 10.380 122.800 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890
1 20.57 17.770 86.545 561.3 0.08474 0.07864 0.08690 0.07017 0.1812 ... 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902
2 19.69 21.250 130.000 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758
3 11.42 20.380 77.580 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300
4 13.38 14.340 135.100 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.390 86.545 561.3 0.11100 0.11590 0.24390 0.13890 0.1726 ... 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115
565 20.13 18.865 131.200 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 ... 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637
566 13.38 28.080 86.545 561.3 0.08455 0.10230 0.09251 0.05302 0.1590 ... 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820
567 20.60 29.330 140.100 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 ... 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400
568 13.38 24.540 47.920 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 ... 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039
[567 rows x 30 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Impute the missing values. |
Fit to data.
Fit to data, then transform it.
Get output feature names for transformation.
Parameters |
input_features: sequence or None, default=None
Only used to validate feature names with the names seen in
fit .
|
Returns |
np.ndarray
Transformed feature names.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Impute the missing values.
Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.