LightGBM

LGB needs scaling accept sparse allows validation supports acceleration

LightGBM is a gradient boosting model that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency.
Lower memory usage.
Better accuracy.
Capable of handling large-scale data.

Corresponding estimators are:

LGBMClassifier for classification tasks.
LGBMRegressor for regression tasks.

Example

>>> from atom import ATOMClassifier
>>> from sklearn.datasets import load_breast_cancer

>>> X, y = load_breast_cancer(return_X_y=True, as_frame=True)

>>> atom = ATOMClassifier(X, y)
>>> atom.run(models="LGB", metric="f1", verbose=2)

Training ========================= >>
Models: LGB
Metric: f1


Results for LightGBM:
Fit ---------------------------------------------
Train evaluation --> f1: 1.0
Test evaluation --> f1: 0.979
Time elapsed: 0.465s
-------------------------------------------------
Total time: 0.465s
Final results ==================== >>


Total time: 0.466s
-------------------------------------
LightGBM --> f1: 0.979

Hyperparameters

classificationregression

Parameters

n_estimators

IntDistribution(high=500, log=False, low=20, step=10)

learning_rate

FloatDistribution(high=1.0, log=True, low=0.01, step=None)

max_depth

IntDistribution(high=17, log=False, low=-1, step=2)

num_leaves

IntDistribution(high=40, log=False, low=20, step=1)

min_child_weight

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

min_child_samples

IntDistribution(high=30, log=False, low=1, step=1)

subsample

FloatDistribution(high=1.0, log=False, low=0.5, step=0.1)

colsample_bytree

FloatDistribution(high=1.0, log=False, low=0.4, step=0.1)

reg_alpha

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

reg_lambda

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

Parameters

n_estimators

IntDistribution(high=500, log=False, low=20, step=10)

learning_rate

FloatDistribution(high=1.0, log=True, low=0.01, step=None)

max_depth

IntDistribution(high=17, log=False, low=-1, step=2)

num_leaves

IntDistribution(high=40, log=False, low=20, step=1)

min_child_weight

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

min_child_samples

IntDistribution(high=30, log=False, low=1, step=1)

subsample

FloatDistribution(high=1.0, log=False, low=0.5, step=0.1)

colsample_bytree

FloatDistribution(high=1.0, log=False, low=0.4, step=0.1)

reg_alpha

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

reg_lambda

FloatDistribution(high=100.0, log=True, low=0.0001, step=None)

Attributes

Data attributes

Attributes

pipeline: pd.Series

Transformers fitted on the data.

Models that used automated feature scaling have the scaler added. Use this attribute only to access the individual instances. To visualize the pipeline, use the plot_pipeline method.

mapping: dict

Encoded values and their respective mapped values.

The column name is the key to its mapping dictionary. Only for columns mapped to a single column (e.g. Ordinal, Leave-one-out, etc...).

dataset: dataframe

Complete data set.

train: dataframe

Training set.

test: dataframe

Test set.

X: dataframe

Feature set.

y: series | dataframe

Target column(s).

X_train: dataframe

Features of the training set.

y_train: series | dataframe

Target column(s) of the training set.

X_test: dataframe

Features of the test set.

y_test: series | dataframe

Target column(s) of the test set.

shape: tuple[int, int]

Shape of the dataset (n_rows, n_columns).

columns: series

Name of all the columns.

n_columns: int

Number of columns.

features: series

Name of the features.

n_features: int

Number of features.

target: str | list[str]

Name of the target column(s).

Utility attributes

Attributes

name: str

Name of the model.

Use the property's @setter to change the model's name. The acronym always stays at the beginning of the model's name. If the model is being tracked by mlflow, the name of the corresponding run also changes.

study: Study | None

Optuna study used for hyperparameter tuning.

trials: pd.DataFrame | None

Overview of the trials' results.

All durations are in seconds. Columns include:

params: Parameters used for this trial.
estimator: Estimator used for this trial.
score: Objective score(s) of the trial.
time_trial: Duration of the trial.
time_ht: Duration of the hyperparameter tuning.
state: Trial's state (COMPLETE, PRUNED, FAIL).

best_trial: Trial | None

Trial that returned the highest score.

For multi-metric runs, the best trial is the trial that performed best on the main metric. Use the property's @setter to change the best trial. See here an example.

best_params: dict

Hyperparameters used by the best trial.

score_ht: float | list[float] | None

Metric score obtained by the best trial.

time_ht: int | None

Duration of the hyperparameter tuning (in seconds).

estimator: Predictor

Estimator fitted on the training set.

evals: dict

Scores obtained per iteration of the training.

Only the scores of the main metric are tracked. Included keys are: train and test. Read more in the user guide.

score_train: float | list[float]

Metric score on the training set.

score_test: float | list[float]

Metric score on the test set.

score_holdout: float | list[float]

Metric score on the holdout set.

time_fit: int

Duration of the model fitting on the train set (in seconds).

bootstrap: pd.DataFrame | None

Overview of the bootstrapping scores.

The dataframe has shape=(n_bootstrap, metric) and shows the score obtained by every bootstrapped sample for every metric. Using atom.bootstrap.mean() yields the same values as score_bootstrap.

score_bootstrap: float | list[float] | None

Mean metric score on the bootstrapped samples.

time_bootstrap: int | None

Duration of the bootstrapping (in seconds).

time: int

Total duration of the run (in seconds).

feature_importance: pd.Series | None

Normalized feature importance scores.

The sum of importances for all features is 1. The scores are extracted from the estimator's scores_, coef_ or feature_importances_ attribute, checked in that order. Returns None for estimators without any of those attributes.

results: pd.Series

Overview of the training results.

All durations are in seconds. Values include:

score_ht: Score obtained by the hyperparameter tuning.
time_ht: Duration of the hyperparameter tuning.
score_train: Metric score on the train set.
score_test: Metric score on the test set.
time_fit: Duration of the model fitting on the train set.
score_bootstrap: Mean score on the bootstrapped samples.
time_bootstrap: Duration of the bootstrapping.
time: Total duration of the run.

Prediction attributes

The prediction attributes are not calculated until the attribute is called for the first time. This mechanism avoids having to calculate attributes that are never used, saving time and memory.

Attributes

decision_function_train: series | dataframe

Predicted confidence scores on the training set.