Skip to content

Pruner


class atom.data_cleaning.Pruner(strategy="z-score", method="drop", max_sigma=3, include_target=False, verbose=0, logger=None, **kwargs) [source]

Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns. This class can be accessed from atom through the prune method. Read more in the user guide.

Parameters: strategy: str or sequence, optional (default="z-score")
Strategy with which to select the outliers. If sequence of strategies, only samples marked as outliers by all chosen strategies are dropped. Choose from: method: int, float or str, optional (default="drop")
Method to apply on the outliers. Only the z-score strategy accepts another method than "drop". Choose from:
  • "drop": Drop any sample with outlier values.
  • "min_max": Replace the outlier with the min or max of the column.
  • Any numerical value with which to replace the outliers.

max_sigma: int or float, optional (default=3)
Maximum allowed standard deviations from the mean of the column. If more, it is considered an outlier. Only if strategy="z-score".

include_target: bool, optional (default=False)
Whether to include the target column in the search for outliers. This can be useful for regression tasks. Only if strategy="z-score".

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:
  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.
**kwargs
Additional keyword arguments passed to the strategy estimator. If sequence of strategies, the params should be provided in a dict with the strategy's name as key.

Tip

Use atom's outliers attribute for an overview of the number of outlier values per column.


Attributes

Attributes: <strategy>: sklearn estimator
Estimator instance (lowercase strategy) used to prune the data, e.g. pruner.iforest for the isolation forest strategy.


Methods

fit_transform Same as transform.
get_params Get parameters for this estimator.
log Write information to the logger and print to stdout.
save Save the instance to a pickle file.
set_params Set the parameters of this estimator.
transform Transform the data.


method fit_transform(X, y=None) [source]

Apply the outlier strategy to the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns:

X: pd.DataFrame
Transformed feature set.

y: pd.Series
Transformed target column. Only returned if provided.


method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:

deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns: params: dict
Dictionary of the parameter names mapped to their values.


method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.


method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method set_params(**params) [source]

Set the parameters of this estimator.

Parameters: **params: dict
Estimator parameters.
Returns: self: Pruner
Estimator instance.


method transform(X, y=None) [source]

Apply the outlier strategy to the data.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
  • If None: y is ignored.
  • If int: Index of the target column in X.
  • If str: Name of the target column in X.
  • Else: Target column with shape=(n_samples,).
Returns:

X: pd.DataFrame
Transformed feature set.

X: pd.Series
Transformed target column. Only returned if provided.


Example

from atom import ATOMRegressor

atom = ATOMRegressor(X, y)
atom.prune(strategy="z-score", max_sigma=2, include_target=True)
or
from atom.data_cleaning import Pruner

pruner = Pruner(strategy="z-score", max_sigma=2, include_target=True)
X_train, y_train = pruner.transform(X_train, y_train)

Back to top