Imputer

class atom.data_cleaning.Imputer(strat_num="drop", strat_cat="drop", max_nan_rows=None, max_nan_cols=None, verbose=0, logger=None) [source]

Impute or remove missing values according to the selected strategy. Also removes rows and columns with too many missing values. Use the missing attribute to customize what are considered "missing values". This class can be accessed from atom through the impute method. Read more in the user guide.

Parameters:

strat_num: str, int or float, optional (default="drop")
Imputing strategy for numerical columns. Choose from:

"drop": Drop rows containing missing values.
"mean": Impute with mean of column.
"median": Impute with median of column.
"knn": Impute using a K-Nearest Neighbors approach.
"most_frequent": Impute with most frequent value.
int or float: Impute with provided numerical value.

strat_cat: str, optional (default="drop")
Imputing strategy for categorical columns. Choose from:

"drop": Drop rows containing missing values.
"most_frequent": Impute with most frequent value.
str: Impute with provided string.

max_nan_rows: int, float or None, optional (default=None)
Maximum number or fraction of missing values in a row (if more, the row is removed). If None, ignore this step.

max_nan_cols: int, float, optional (default=None)
Maximum number or fraction of missing values in a column (if more, the column is removed). If None, ignore this step.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

Tip

Use atom's nans attribute for an overview of the number of missing values per column.

Attributes

Attributes: missing: list
List of values that are considered "missing". Default values are: "", "?", "None", "NA", "nan", "NaN" and "inf". Note that None, NaN, +inf and -inf are always considered missing since they are incompatible with sklearn estimators.

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit(X, y=None) [source]

Fit to data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns:

self: Imputer
Fitted instance of self.

method fit_transform(X, y=None) [source]

Fit to data, then impute the missing values. Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Dictionary of the parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: Imputer Estimator instance.

method fit_transform(X, y=None) [source]

Impute the missing values. Note that leaving y=None can lead to inconsistencies in data length between X and y if rows are dropped during the transformation.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,)

Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.impute(strat_num="knn", strat_cat="drop", max_nan_cols=0.8)

or

from atom.data_cleaning import Imputer

imputer = Imputer(strat_num="knn", strat_cat="drop", max_nan_cols=0.8)
imputer.fit(X_train, y_train)
X = imputer.transform(X)