Pruner

class atom.data_cleaning.Pruner(strategy="z-score", method="drop", max_sigma=3, include_target=False, verbose=0, logger=None, **kwargs) [source]

Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns. This class can be accessed from atom through the prune method. Read more in the user guide.

Parameters:

strategy: str or sequence, optional (default="z-score")
Strategy with which to select the outliers. If sequence of strategies, only samples marked as outliers by all chosen strategies are dropped. Choose from:

"z-score": Uses the z-score of each data value.
"iForest": Uses an Isolation Forest.
"EE": Uses an Elliptic Envelope.
"LOF": Uses a Local Outlier Factor.
"SVM": Uses a One-class SVM.
"DBSCAN": Uses DBSCAN clustering.
"OPTICS": Uses OPTICS clustering.

method: int, float or str, optional (default="drop")
Method to apply on the outliers. Only the z-score strategy accepts another method than "drop". Choose from:

"drop": Drop any sample with outlier values.
"min_max": Replace the outlier with the min or max of the column.
Any numerical value with which to replace the outliers.

max_sigma: int or float, optional (default=3)
Maximum allowed standard deviations from the mean of the column. If more, it is considered an outlier. Only if strategy="z-score".

include_target: bool, optional (default=False)
Whether to include the target column in the search for outliers. This can be useful for regression tasks. Only if strategy="z-score".

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

logger: str, Logger or None, optional (default=None)

If None: Doesn't save a logging file.
If str: Name of the log file. Use "auto" for automatic naming.
Else: Python logging.Logger instance.

**kwargs
Additional keyword arguments for the strategy estimator. If sequence of strategies, the params should be provided in a dict with the strategy's name as key.

Tip

Use atom's outliers attribute for an overview of the number of outlier values per column.

Attributes

Attributes: <strategy>: sklearn estimator
Object (lowercase strategy) used to prune the data, e.g. pruner.iforest for the isolation forest strategy.

Methods

fit_transform	Same as transform.
get_params	Get parameters for this estimator.
log	Write information to the logger and print to stdout.
save	Save the instance to a pickle file.
set_params	Set the parameters of this estimator.
transform	Transform the data.

method fit_transform(X, y=None) [source]

Apply the outlier strategy to the data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.

method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:	deep: bool, optional (default=True) If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params: dict Parameter names mapped to their values.

method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.

method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters:

filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.

method set_params(**params) [source]

Set the parameters of this estimator.

Parameters:	**params: dict Estimator parameters.
Returns:	self: Pruner Estimator instance.

method transform(X, y=None) [source]

Apply the outlier strategy to the data.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)

If None: y is ignored.
If int: Index of the target column in X.
If str: Name of the target column in X.
Else: Target column with shape=(n_samples,).

Returns:

X: pd.DataFrame
Transformed feature set.

X: pd.Series
Transformed target column. Only returned if provided.

Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.prune(strategy="z-score", max_sigma=2, include_target=True)

or

from atom.data_cleaning import Pruner

pruner = Pruner(strategy="z-score", max_sigma=2, include_target=True)
X_train, y_train = pruner.transform(X_train, y_train)