Pruner
Prune outliers from the data.
Replace or remove outliers. The definition of outlier depends on the selected strategy and can greatly differ from one another. Ignores categorical columns.
This class can be accessed from atom through the prune method. Read more in the user guide.
Info
The "sklearnex" and "cuml" engines are only supported for strategy="dbscan".
Parameters |
strategy: str or sequence, default="zscore"
Strategy with which to select the outliers. If sequence of
strategies, only samples marked as outliers by all chosen
strategies are dropped. Choose from:
method: int, float or str, default="drop"
Method to apply on the outliers. Only the zscore strategy
accepts another method than "drop". Choose from:
max_sigma: int or float, default=3
Maximum allowed standard deviations from the mean of the
column. If more, it is considered an outlier. Only if
strategy="zscore".
include_target: bool, default=False
Whether to include the target column in the search for
outliers. This can be useful for regression tasks. Only
if strategy="zscore".
device: str, default="cpu"
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
**kwargs
Additional keyword arguments for the strategy estimator. If
sequence of strategies, the params should be provided in a dict
with the strategy's name as key.
|
Attributes | {#pruner-[strategy]}
[strategy]: sklearn estimator
Object used to prune the data, e.g.,
feature_names_in_: np.ndarraypruner.iforest for the
isolation forest strategy. Not available for strategy="zscore".
Names of features seen during
n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.dataset)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 13.48 20.82 88.40 559.2 0.10160 0.12550 0.10630 0.05439 0.1720 ... 107.30 740.4 0.1610 0.42250 0.5030 0.22580 0.2807 0.10710 0
1 18.31 20.58 120.80 1052.0 0.10680 0.12480 0.15690 0.09451 0.1860 ... 142.20 1493.0 0.1492 0.25360 0.3759 0.15100 0.3074 0.07863 0
2 17.93 24.48 115.20 998.9 0.08855 0.07027 0.05699 0.04744 0.1538 ... 135.10 1320.0 0.1315 0.18060 0.2080 0.11360 0.2504 0.07948 0
3 15.13 29.81 96.71 719.5 0.08320 0.04605 0.04686 0.02739 0.1852 ... 110.10 931.4 0.1148 0.09866 0.1547 0.06575 0.3233 0.06165 0
4 8.95 15.76 58.74 245.2 0.09462 0.12430 0.09263 0.02308 0.1305 ... 63.34 270.0 0.1179 0.18790 0.1544 0.03846 0.1652 0.07722 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 14.34 13.47 92.51 641.2 0.09906 0.07624 0.05724 0.04603 0.2075 ... 110.40 873.2 0.1297 0.15250 0.1632 0.10870 0.3062 0.06072 1
565 13.17 21.81 85.42 531.5 0.09714 0.10470 0.08259 0.05252 0.1746 ... 105.50 740.7 0.1503 0.39040 0.3728 0.16070 0.3693 0.09618 0
566 17.30 17.08 113.00 928.2 0.10080 0.10410 0.12660 0.08353 0.1813 ... 130.90 1222.0 0.1416 0.24050 0.3378 0.18570 0.3138 0.08113 0
567 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 0.10540 0.1971 ... 132.90 1302.0 0.1418 0.34980 0.3583 0.15150 0.2463 0.07738 0
568 14.80 17.66 95.88 674.8 0.09179 0.08890 0.04069 0.02260 0.1893 ... 105.90 829.5 0.1226 0.18810 0.2060 0.08308 0.3600 0.07285 1
[569 rows x 31 columns]
>>> atom.prune(stratgey="iforest", verbose=2)
Fitting Pruner...
Pruning outliers...
--> Dropping 63 outliers.
>>> # Note the reduced number of rows
>>> print(atom.dataset)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 13.48 20.82 88.40 559.2 0.10160 0.12550 0.10630 0.05439 0.1720 ... 107.30 740.4 0.1610 0.42250 0.5030 0.22580 0.2807 0.10710 0
1 18.31 20.58 120.80 1052.0 0.10680 0.12480 0.15690 0.09451 0.1860 ... 142.20 1493.0 0.1492 0.25360 0.3759 0.15100 0.3074 0.07863 0
2 17.93 24.48 115.20 998.9 0.08855 0.07027 0.05699 0.04744 0.1538 ... 135.10 1320.0 0.1315 0.18060 0.2080 0.11360 0.2504 0.07948 0
3 15.13 29.81 96.71 719.5 0.08320 0.04605 0.04686 0.02739 0.1852 ... 110.10 931.4 0.1148 0.09866 0.1547 0.06575 0.3233 0.06165 0
4 10.26 16.58 65.85 320.8 0.08877 0.08066 0.04358 0.02438 0.1669 ... 71.08 357.4 0.1461 0.22460 0.1783 0.08333 0.2691 0.09479 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 14.34 13.47 92.51 641.2 0.09906 0.07624 0.05724 0.04603 0.2075 ... 110.40 873.2 0.1297 0.15250 0.1632 0.10870 0.3062 0.06072 1
502 13.17 21.81 85.42 531.5 0.09714 0.10470 0.08259 0.05252 0.1746 ... 105.50 740.7 0.1503 0.39040 0.3728 0.16070 0.3693 0.09618 0
503 17.30 17.08 113.00 928.2 0.10080 0.10410 0.12660 0.08353 0.1813 ... 130.90 1222.0 0.1416 0.24050 0.3378 0.18570 0.3138 0.08113 0
504 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 0.10540 0.1971 ... 132.90 1302.0 0.1418 0.34980 0.3583 0.15150 0.2463 0.07738 0
505 14.80 17.66 95.88 674.8 0.09179 0.08890 0.04069 0.02260 0.1893 ... 105.90 829.5 0.1226 0.18810 0.2060 0.08308 0.3600 0.07285 1
[506 rows x 31 columns]
>>> atom.plot_distribution(columns=0)
>>> from atom.data_cleaning import Normalizer
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> normalizer = Normalizer(verbose=2)
>>> X = normalizer.fit_transform(X)
Fitting Normalizer...
Normalizing features...
>>> # Note the reduced number of rows
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 1.134881 -2.678666 1.259822 1.126421 1.504114 2.165938 1.862988 1.848558 1.953067 ... -1.488367 1.810506 1.652210 1.282792 1.942737 1.730182 1.935654 2.197206 1.723624
1 1.619346 -0.264377 1.528723 1.633946 -0.820227 -0.384102 0.291976 0.820609 0.102291 ... -0.288382 1.430616 1.610022 -0.325080 -0.296580 0.070746 1.101594 -0.121997 0.537179
2 1.464796 0.547806 1.454664 1.461645 0.963977 1.163977 1.403673 1.683104 0.985668 ... 0.071406 1.321941 1.425307 0.580301 1.209701 1.005512 1.722744 1.218181 0.453955
3 -0.759262 0.357721 -0.514886 -0.836238 2.781494 2.197843 1.642391 1.423004 2.360528 ... 0.228089 -0.039480 -0.436860 2.857821 2.282276 1.675087 1.862378 3.250202 2.517606
4 1.571260 -1.233520 1.583340 1.595120 0.343932 0.762392 1.407479 1.410929 0.090964 ... -1.637882 1.316582 1.309486 0.284367 -0.131829 0.817474 0.807077 -0.943554 -0.279402
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 1.781795 0.785604 1.746492 1.823030 1.052829 0.460810 1.653784 1.783067 -0.232645 ... 0.212151 1.547961 1.657442 0.438013 -0.077871 0.859079 1.503734 -1.721528 -0.751459
565 1.543335 1.845150 1.485601 1.545430 0.168014 0.207602 0.984746 1.320730 -0.129120 ... 1.832201 1.365939 1.443167 -0.667317 -0.245277 0.480804 0.810995 -0.480093 -1.210527
566 0.828589 1.817618 0.811329 0.835270 -0.835509 0.183969 0.375105 0.396882 -0.808189 ... 1.320625 0.786129 0.796192 -0.799337 0.626487 0.566826 0.526136 -1.301164 -0.170872
567 1.624440 2.016299 1.702747 1.551036 1.468642 2.162820 1.994466 1.884414 1.899087 ... 1.968949 1.810506 1.513198 1.387135 2.284642 2.136932 1.931990 1.744693 1.850944
568 -2.699432 1.203224 -2.827766 -2.703256 -3.834325 -1.481409 -1.658319 -1.845392 -0.821560 ... 0.810681 -2.231436 -2.149403 -2.064647 -1.731936 -1.819966 -2.131070 0.103122 -0.820663
[569 rows x 30 columns]
Methods
fit | Do nothing. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Apply the outlier strategy on the data. |
Do nothing.
Implemented for continuity of the API.
Fit to data, then transform it.
Get output feature names for transformation.
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Apply the outlier strategy on the data.