Skip to content

Balancer


class atom.data_cleaning.Balancer(strategy="ADASYN", n_jobs=1, verbose=0, random_state=None, **kwargs)[source]
Balance the number of samples per class in the target column.

When oversampling, the newly created samples have an increasing integer index for numerical indices, and an index of the form [estimator]_N for non-numerical indices, where N stands for the N-th sample in the data set. Use only for classification tasks.

This class can be accessed from atom through the balance method. Read more in the user guide.

Warning

Parametersstrategy: str or transformer, default="ADASYN"
Type of algorithm with which to balance the dataset. Choose from the name of any estimator in the imbalanced-learn package or provide a custom instance of such.

n_jobs: int, default=1
Number of cores to use for parallel processing.

  • If >0: Number of cores to use.
  • If -1: Use all available cores.
  • If <-1: Use number of cores - 1 - value.

verbose: int, default=0
Verbosity level of the class. Choose from:

  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.

random_state: int or None, default=None
Seed used by the random number generator. If None, the random number generator is the RandomState used by np.random.

**kwargs
Additional keyword arguments for the strategy estimator.

Attributes[strategy]_: imblearn estimator
Object (lowercase strategy) used to balance the data, e.g., balancer.adasyn_ for the default strategy.

mapping_: dict
Target values mapped to their respective encoded integers.

feature_names_in_: np.ndarray
Names of features seen during fit.

target_names_in_: np.ndarray
Names of the target column seen during fit.

n_features_in_: int
Number of features seen during fit.


See Also

Encoder

Perform encoding of categorical features.

Imputer

Handle missing values in the data.

Pruner

Prune outliers from the data.


Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.train)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst perimeter  worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension  target
0          13.48         20.82           88.40      559.2          0.10160           0.12550         0.10630             0.054390         0.1720  ...           107.30       740.4            0.1610            0.42250          0.50300               0.22580          0.2807                  0.10710       0
1          18.31         20.58          120.80     1052.0          0.10680           0.12480         0.15690             0.094510         0.1860  ...           142.20      1493.0            0.1492            0.25360          0.37590               0.15100          0.3074                  0.07863       0
2          17.93         24.48          115.20      998.9          0.08855           0.07027         0.05699             0.047440         0.1538  ...           135.10      1320.0            0.1315            0.18060          0.20800               0.11360          0.2504                  0.07948       0
3          15.13         29.81           96.71      719.5          0.08320           0.04605         0.04686             0.027390         0.1852  ...           110.10       931.4            0.1148            0.09866          0.15470               0.06575          0.3233                  0.06165       0
4           8.95         15.76           58.74      245.2          0.09462           0.12430         0.09263             0.023080         0.1305  ...            63.34       270.0            0.1179            0.18790          0.15440               0.03846          0.1652                  0.07722       1
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...              ...         ...               ...                ...              ...                   ...             ...                      ...     ...
451        19.73         19.82          130.70     1206.0          0.10620           0.18490         0.24170             0.097400         0.1733  ...           159.80      1933.0            0.1710            0.59550          0.84890               0.25070          0.2749                  0.12970       0
452        12.72         13.78           81.78      492.1          0.09667           0.08393         0.01288             0.019240         0.1638  ...            88.54       553.7            0.1298            0.14720          0.05233               0.06343          0.2369                  0.06922       1
453        11.51         23.93           74.52      403.5          0.09261           0.10210         0.11120             0.041050         0.1388  ...            82.28       474.2            0.1298            0.25170          0.36300               0.09653          0.2112                  0.08732       1
454        10.75         14.97           68.26      355.3          0.07793           0.05139         0.02251             0.007875         0.1399  ...            77.79       441.2            0.1076            0.12230          0.09755               0.03413          0.2300                  0.06769       1
455        25.22         24.91          171.50     1878.0          0.10630           0.26650         0.33390             0.184500         0.1829  ...           211.70      2562.0            0.1573            0.60760          0.64760               0.28670          0.2355                  0.10510       0

[456 rows x 31 columns]


>>> atom.balance(strategy="smote", verbose=2)

Oversampling with SMOTE...
 --> Adding 116 samples to class 0.


>>> # Note that the number of rows has increased
>>> print(atom.train)

     mean radius  mean texture  mean perimeter    mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst perimeter   worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension  target
0      13.480000     20.820000       88.400000   559.200000         0.101600          0.125500        0.106300             0.054390       0.172000  ...       107.300000   740.400000          0.161000           0.422500         0.503000              0.225800        0.280700                 0.107100       0
1      18.310000     20.580000      120.800000  1052.000000         0.106800          0.124800        0.156900             0.094510       0.186000  ...       142.200000  1493.000000          0.149200           0.253600         0.375900              0.151000        0.307400                 0.078630       0
2      17.930000     24.480000      115.200000   998.900000         0.088550          0.070270        0.056990             0.047440       0.153800  ...       135.100000  1320.000000          0.131500           0.180600         0.208000              0.113600        0.250400                 0.079480       0
3      15.130000     29.810000       96.710000   719.500000         0.083200          0.046050        0.046860             0.027390       0.185200  ...       110.100000   931.400000          0.114800           0.098660         0.154700              0.065750        0.323300                 0.061650       0
4       8.950000     15.760000       58.740000   245.200000         0.094620          0.124300        0.092630             0.023080       0.130500  ...        63.340000   270.000000          0.117900           0.187900         0.154400              0.038460        0.165200                 0.077220       1
..           ...           ...             ...          ...              ...               ...             ...                  ...            ...  ...              ...          ...               ...                ...              ...                   ...             ...                      ...     ...
567    15.182945     22.486774       98.949465   711.386079         0.092513          0.102732        0.113923             0.069481       0.179224  ...       107.689157   826.276172          0.126730           0.199259         0.295172              0.142325        0.265352                 0.068318       0
568    19.990378     20.622944      130.491182  1253.735467         0.091583          0.117753        0.117236             0.082771       0.202428  ...       167.456689  1995.896044          0.132457           0.289652         0.332006              0.182989        0.299088                 0.084150       0
569    18.158121     18.928220      119.907435  1027.331092         0.113149          0.147089        0.171862             0.103942       0.209306  ...       135.286302  1319.270051          0.127029           0.233493         0.260138              0.133851        0.302406                 0.079535       0
570    23.733233     26.433751      158.185672  1724.145541         0.098008          0.193789        0.231158             0.139527       0.188817  ...       207.483796  2844.559632          0.150495           0.463361         0.599077              0.266433        0.290828                 0.091542       0
571    17.669575     16.375717      115.468589   968.552411         0.093636          0.109983        0.101005             0.075283       0.174505  ...       133.767576  1227.195245          0.118221           0.264624         0.249798              0.135098        0.268044                 0.076533       0

[572 rows x 31 columns]
>>> from atom.data_cleaning import Balancer
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> print(X)

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst texture  worst perimeter  worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension
0          17.99         10.38          122.80     1001.0          0.11840           0.27760         0.30010              0.14710         0.2419  ...          17.33           184.60      2019.0           0.16220            0.66560           0.7119                0.2654          0.4601                  0.11890
1          20.57         17.77          132.90     1326.0          0.08474           0.07864         0.08690              0.07017         0.1812  ...          23.41           158.80      1956.0           0.12380            0.18660           0.2416                0.1860          0.2750                  0.08902
2          19.69         21.25          130.00     1203.0          0.10960           0.15990         0.19740              0.12790         0.2069  ...          25.53           152.50      1709.0           0.14440            0.42450           0.4504                0.2430          0.3613                  0.08758
3          11.42         20.38           77.58      386.1          0.14250           0.28390         0.24140              0.10520         0.2597  ...          26.50            98.87       567.7           0.20980            0.86630           0.6869                0.2575          0.6638                  0.17300
4          20.29         14.34          135.10     1297.0          0.10030           0.13280         0.19800              0.10430         0.1809  ...          16.67           152.20      1575.0           0.13740            0.20500           0.4000                0.1625          0.2364                  0.07678
..           ...           ...             ...        ...              ...               ...             ...                  ...            ...  ...            ...              ...         ...               ...                ...              ...                   ...             ...                      ...
564        21.56         22.39          142.00     1479.0          0.11100           0.11590         0.24390              0.13890         0.1726  ...          26.40           166.10      2027.0           0.14100            0.21130           0.4107                0.2216          0.2060                  0.07115
565        20.13         28.25          131.20     1261.0          0.09780           0.10340         0.14400              0.09791         0.1752  ...          38.25           155.00      1731.0           0.11660            0.19220           0.3215                0.1628          0.2572                  0.06637
566        16.60         28.08          108.30      858.1          0.08455           0.10230         0.09251              0.05302         0.1590  ...          34.12           126.70      1124.0           0.11390            0.30940           0.3403                0.1418          0.2218                  0.07820
567        20.60         29.33          140.10     1265.0          0.11780           0.27700         0.35140              0.15200         0.2397  ...          39.42           184.60      1821.0           0.16500            0.86810           0.9387                0.2650          0.4087                  0.12400
568         7.76         24.54           47.92      181.0          0.05263           0.04362         0.00000              0.00000         0.1587  ...          30.37            59.16       268.6           0.08996            0.06444           0.0000                0.0000          0.2871                  0.07039

[569 rows x 30 columns]


>>> balancer = Balancer(strategy="smote", verbose=2)
>>> X, y = balancer.fit_transform(X, y)

Oversampling with SMOTE...
 --> Adding 145 samples to class 0.


>>> # Note that the number of rows has increased
>>> print(X)

     mean radius  mean texture  mean perimeter    mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  ...  worst texture  worst perimeter   worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension
0      17.990000     10.380000      122.800000  1001.000000         0.118400          0.277600        0.300100             0.147100       0.241900  ...      17.330000       184.600000  2019.000000          0.162200           0.665600         0.711900              0.265400        0.460100                 0.118900
1      20.570000     17.770000      132.900000  1326.000000         0.084740          0.078640        0.086900             0.070170       0.181200  ...      23.410000       158.800000  1956.000000          0.123800           0.186600         0.241600              0.186000        0.275000                 0.089020
2      19.690000     21.250000      130.000000  1203.000000         0.109600          0.159900        0.197400             0.127900       0.206900  ...      25.530000       152.500000  1709.000000          0.144400           0.424500         0.450400              0.243000        0.361300                 0.087580
3      11.420000     20.380000       77.580000   386.100000         0.142500          0.283900        0.241400             0.105200       0.259700  ...      26.500000        98.870000   567.700000          0.209800           0.866300         0.686900              0.257500        0.663800                 0.173000
4      20.290000     14.340000      135.100000  1297.000000         0.100300          0.132800        0.198000             0.104300       0.180900  ...      16.670000       152.200000  1575.000000          0.137400           0.205000         0.400000              0.162500        0.236400                 0.076780
..           ...           ...             ...          ...              ...               ...             ...                  ...            ...  ...            ...              ...          ...               ...                ...              ...                   ...             ...                      ...
709    18.182301     24.944043      121.442258  1048.093046         0.105821          0.177739        0.225725             0.116642       0.193406  ...      33.361608       158.493074  1666.140527          0.144662           0.409220         0.565327              0.179898        0.294844                 0.095960
710    11.851902     18.713059       78.007613   441.464565         0.110844          0.151534        0.121607             0.051901       0.229830  ...      28.119008       119.347928   888.852911          0.163612           0.575805         0.692879              0.154555        0.475255                 0.139890
711    15.422292     24.668732      103.240756   745.189688         0.110060          0.171101        0.179765             0.089365       0.196931  ...      35.127849       144.459281  1269.764324          0.169552           0.580809         0.659512              0.202655        0.383515                 0.105496
712    15.550268     20.580991      103.085807   752.576384         0.115896          0.160149        0.194129             0.091891       0.196371  ...      29.636524       139.647722  1340.919647          0.171716           0.412331         0.592770              0.210292        0.320508                 0.106742
713    12.526830     23.731139       84.203000   480.830934         0.116830          0.228817        0.216860             0.082392       0.204680  ...      38.778068        99.031745   710.424650          0.183927           0.963409         1.018880              0.216792        0.419279                 0.192472

[714 rows x 30 columns]


Methods

fitFit to data.
fit_transformFit to data, then transform it.
get_feature_names_outGet output feature names for transformation.
get_paramsGet parameters for this estimator.
inverse_transformDo nothing.
set_outputSet output container.
set_paramsSet the parameters of this estimator.
transformBalance the data.


method fit(X, y)[source]
Fit to data.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence
Target column corresponding to X.

ReturnsSelf
Estimator instance.



method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

**fit_params
Additional keyword arguments for the fit method.

Returnsdataframe
Transformed feature set. Only returned if provided.

series or dataframe
Transformed target column. Only returned if provided.



method get_feature_names_out(input_features=None)[source]
Get output feature names for transformation.

Parametersinput_features : array-like of str or None, default=None
Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].
  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returnsfeature_names_out : ndarray of str objects
Same as input features.



method get_params(deep=True)[source]
Get parameters for this estimator.

Parametersdeep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returnsparams : dict
Parameter names mapped to their values.



method inverse_transform(X=None, y=None, **fit_params)[source]
Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

ParametersX: dataframe-like or None, default=None
Feature set with shape=(n_samples, n_features). If None, X is ignored.

y: sequence, dataframe-like or None, default=None
Target column(s) corresponding to X. If None, y is ignored.

Returnsdataframe
Feature set. Only returned if provided.

series or dataframe
Target column(s). Only returned if provided.



method set_output(transform=None)[source]
Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameterstransform: str or None, default=None
Configure the output of the transform, fit_transform, and inverse_transform method. If None, the configuration is not changed. Choose from:

  • "numpy"
  • "pandas" (default)
  • "pandas-pyarrow"
  • "polars"
  • "polars-lazy"
  • "pyarrow"
  • "modin"
  • "dask"
  • "pyspark"
  • "pyspark-pandas"

ReturnsSelf
Estimator instance.



method set_params(**params)[source]
Set the parameters of this estimator.

Parameters**params : dict
Estimator parameters.

Returnsself : estimator instance
Estimator instance.



method transform(X, y)[source]
Balance the data.

ParametersX: dataframe-like
Feature set with shape=(n_samples, n_features).

y: sequence
Target column corresponding to X.

Returnsdataframe
Balanced dataframe.

series
Transformed target column.