Balancer
Balance the number of samples per class in the target column.
When oversampling, the newly created samples have an increasing integer index for numerical indices, and an index of the form [estimator]_N for non-numerical indices, where N stands for the N-th sample in the data set. Use only for classification tasks.
This class can be accessed from atom through the balance method. Read more in the user guide.
Warning
- The clustercentroids estimator is unavailable because of incompatibilities of the APIs.
- The Balancer class does not support multioutput tasks.
See Also
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.train)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 13.48 20.82 88.40 559.2 0.10160 0.12550 0.10630 0.054390 0.1720 ... 107.30 740.4 0.1610 0.42250 0.50300 0.22580 0.2807 0.10710 0
1 18.31 20.58 120.80 1052.0 0.10680 0.12480 0.15690 0.094510 0.1860 ... 142.20 1493.0 0.1492 0.25360 0.37590 0.15100 0.3074 0.07863 0
2 17.93 24.48 115.20 998.9 0.08855 0.07027 0.05699 0.047440 0.1538 ... 135.10 1320.0 0.1315 0.18060 0.20800 0.11360 0.2504 0.07948 0
3 15.13 29.81 96.71 719.5 0.08320 0.04605 0.04686 0.027390 0.1852 ... 110.10 931.4 0.1148 0.09866 0.15470 0.06575 0.3233 0.06165 0
4 8.95 15.76 58.74 245.2 0.09462 0.12430 0.09263 0.023080 0.1305 ... 63.34 270.0 0.1179 0.18790 0.15440 0.03846 0.1652 0.07722 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
451 19.73 19.82 130.70 1206.0 0.10620 0.18490 0.24170 0.097400 0.1733 ... 159.80 1933.0 0.1710 0.59550 0.84890 0.25070 0.2749 0.12970 0
452 12.72 13.78 81.78 492.1 0.09667 0.08393 0.01288 0.019240 0.1638 ... 88.54 553.7 0.1298 0.14720 0.05233 0.06343 0.2369 0.06922 1
453 11.51 23.93 74.52 403.5 0.09261 0.10210 0.11120 0.041050 0.1388 ... 82.28 474.2 0.1298 0.25170 0.36300 0.09653 0.2112 0.08732 1
454 10.75 14.97 68.26 355.3 0.07793 0.05139 0.02251 0.007875 0.1399 ... 77.79 441.2 0.1076 0.12230 0.09755 0.03413 0.2300 0.06769 1
455 25.22 24.91 171.50 1878.0 0.10630 0.26650 0.33390 0.184500 0.1829 ... 211.70 2562.0 0.1573 0.60760 0.64760 0.28670 0.2355 0.10510 0
[456 rows x 31 columns]
>>> atom.balance(strategy="smote", verbose=2)
Oversampling with SMOTE...
--> Adding 116 samples to class 0.
>>> # Note that the number of rows has increased
>>> print(atom.train)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 13.480000 20.820000 88.400000 559.200000 0.101600 0.125500 0.106300 0.054390 0.172000 ... 107.300000 740.400000 0.161000 0.422500 0.503000 0.225800 0.280700 0.107100 0
1 18.310000 20.580000 120.800000 1052.000000 0.106800 0.124800 0.156900 0.094510 0.186000 ... 142.200000 1493.000000 0.149200 0.253600 0.375900 0.151000 0.307400 0.078630 0
2 17.930000 24.480000 115.200000 998.900000 0.088550 0.070270 0.056990 0.047440 0.153800 ... 135.100000 1320.000000 0.131500 0.180600 0.208000 0.113600 0.250400 0.079480 0
3 15.130000 29.810000 96.710000 719.500000 0.083200 0.046050 0.046860 0.027390 0.185200 ... 110.100000 931.400000 0.114800 0.098660 0.154700 0.065750 0.323300 0.061650 0
4 8.950000 15.760000 58.740000 245.200000 0.094620 0.124300 0.092630 0.023080 0.130500 ... 63.340000 270.000000 0.117900 0.187900 0.154400 0.038460 0.165200 0.077220 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
567 15.182945 22.486774 98.949465 711.386079 0.092513 0.102732 0.113923 0.069481 0.179224 ... 107.689157 826.276172 0.126730 0.199259 0.295172 0.142325 0.265352 0.068318 0
568 19.990378 20.622944 130.491182 1253.735467 0.091583 0.117753 0.117236 0.082771 0.202428 ... 167.456689 1995.896044 0.132457 0.289652 0.332006 0.182989 0.299088 0.084150 0
569 18.158121 18.928220 119.907435 1027.331092 0.113149 0.147089 0.171862 0.103942 0.209306 ... 135.286302 1319.270051 0.127029 0.233493 0.260138 0.133851 0.302406 0.079535 0
570 23.733233 26.433751 158.185672 1724.145541 0.098008 0.193789 0.231158 0.139527 0.188817 ... 207.483796 2844.559632 0.150495 0.463361 0.599077 0.266433 0.290828 0.091542 0
571 17.669575 16.375717 115.468589 968.552411 0.093636 0.109983 0.101005 0.075283 0.174505 ... 133.767576 1227.195245 0.118221 0.264624 0.249798 0.135098 0.268044 0.076533 0
[572 rows x 31 columns]
>>> from atom.data_cleaning import Balancer
>>> from sklearn.datasets import load_breast_cancer
>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 ... 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 ... 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 ... 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 ... 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 ... 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 ... 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039
[569 rows x 30 columns]
>>> balancer = Balancer(strategy="smote", verbose=2)
>>> X, y = balancer.fit_transform(X, y)
Oversampling with SMOTE...
--> Adding 145 samples to class 0.
>>> # Note that the number of rows has increased
>>> print(X)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.990000 10.380000 122.800000 1001.000000 0.118400 0.277600 0.300100 0.147100 0.241900 ... 17.330000 184.600000 2019.000000 0.162200 0.665600 0.711900 0.265400 0.460100 0.118900
1 20.570000 17.770000 132.900000 1326.000000 0.084740 0.078640 0.086900 0.070170 0.181200 ... 23.410000 158.800000 1956.000000 0.123800 0.186600 0.241600 0.186000 0.275000 0.089020
2 19.690000 21.250000 130.000000 1203.000000 0.109600 0.159900 0.197400 0.127900 0.206900 ... 25.530000 152.500000 1709.000000 0.144400 0.424500 0.450400 0.243000 0.361300 0.087580
3 11.420000 20.380000 77.580000 386.100000 0.142500 0.283900 0.241400 0.105200 0.259700 ... 26.500000 98.870000 567.700000 0.209800 0.866300 0.686900 0.257500 0.663800 0.173000
4 20.290000 14.340000 135.100000 1297.000000 0.100300 0.132800 0.198000 0.104300 0.180900 ... 16.670000 152.200000 1575.000000 0.137400 0.205000 0.400000 0.162500 0.236400 0.076780
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
709 13.329262 20.382389 86.097972 551.351999 0.095639 0.086063 0.068628 0.042084 0.192509 ... 29.306676 102.623042 776.210985 0.137782 0.291373 0.328302 0.151230 0.336559 0.090507
710 20.185334 18.623259 131.879504 1278.482149 0.084117 0.094830 0.131577 0.082529 0.190481 ... 21.790588 150.589752 1641.020163 0.111636 0.163697 0.287766 0.146398 0.292034 0.062731
711 18.012313 17.046581 117.494946 988.670929 0.090583 0.125388 0.112866 0.064706 0.173268 ... 22.176153 133.279786 1292.505350 0.127083 0.270805 0.425427 0.153399 0.282285 0.082004
712 19.391707 19.145407 126.257114 1164.870937 0.103937 0.120719 0.136540 0.086889 0.181025 ... 25.437103 164.800203 2014.019293 0.152785 0.324601 0.421062 0.202135 0.341738 0.087907
713 15.154245 29.379464 97.230905 720.366950 0.085779 0.056406 0.058332 0.031637 0.184869 ... 36.867387 110.658374 929.783653 0.119091 0.127904 0.186762 0.076811 0.321684 0.064960
[714 rows x 30 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Balance the data. |
Fit to data.
Parameters |
X: dataframe-like
Feature set with shape=(n_samples, n_features).
y: sequence
Target column corresponding to X .
|
Returns |
Self
Estimator instance.
|
Fit to data, then transform it.
Get output feature names for transformation.
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Balance the data.
Parameters |
X: dataframe-like
Feature set with shape=(n_samples, n_features).
y: sequence
Target column corresponding to X .
|
Returns |
dataframe
Balanced dataframe.
series
Transformed target column.
|