In a variety of industries, including finance, healthcare, and marketing, machine learning models have become essential for resolving challenging real-world issues. Gradient boosting techniques have become incredibly popular among the myriad of machine learning algorithms due to their remarkable prediction performance. Due to its speed and effectiveness, LightGBM (Light Gradient Boosting Machine) is one such technique that many data scientists and machine learning practitioners now turn to first.
We will examine LightGBM in this post with an emphasis on cross-validation, hyperparameter tweaking, and the deployment of a LightGBM-based application. To clarify the ideas covered, we shall use code examples throughout the article.
Understanding LightGBM
LightGBM is a gradient-boosting framework developed by Microsoft that uses a tree-based learning algorithm. It is specifically designed to be efficient and can handle large datasets with millions of records and features. Some of its key advantages include:
- Speed: LightGBM is incredibly fast and efficient, making it suitable for both training and prediction tasks.
- High Performance: It often outperforms other gradient-boosting algorithms in terms of predictive accuracy.
- Memory Efficiency: LightGBM uses a histogram-based approach for splitting nodes in trees, which reduces memory consumption.
- Parallel and GPU Support: It can take advantage of multi-core processors and GPUs for even faster training.
- Built-in Regularization: It includes built-in L1 and L2 regularization to prevent overfitting.
- Wide Range of Applications: LightGBM can be used for both classification and regression tasks.
Cross-Validation
A machine learning approach called cross-validation is used to evaluate a model's performance and make sure that it isn't unduly dependent on a particular training-test split of the data. To gain a more accurate approximation of the model's performance, you must divide the dataset into several subgroups, train and test the model using various combinations of these subsets, and then average the results.
There are various different cross-validation methods. The most popular ones are:
K-Fold Cross-Validation
In machine learning, K-Fold Cross-Validation is an essential method for assessing and optimizing model performance. It solves overfitting and underfitting issues by methodically separating a dataset into 'K' subsets, sometimes known as "folds." One fold is used as the validation set while the remaining "K-1" folds are used as the training data for each iteration. The test set for each of the 'K' training and testing iterations of the model is a distinct fold. For a reliable evaluation of the model's performance, the data are averaged or otherwise integrated.
K-Fold Cross-Validation has a number of benefits. It produces more accurate performance estimations by maximizing the use of data for both training and validation. Because it assesses the model on many data subsets, it also helps identify problems like overfitting of the model. But it can be computationally demanding, especially when dealing with big datasets or high 'K' values. In spite of this, K-Fold Cross-Validation is still widely used in model evaluation to make sure machine learning models have good generalization to new data.
Implementation of K-Fold Cross-Validation
Python3
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Number of folds
n_splits = 5
# Create a KFold cross-validator
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
# Initialize a list to store model performance metrics
metrics = []
# Define LightGBM hyperparameters
params = {
'objective': 'multiclass',
'num_class': 3, # Number of classes in the Iris dataset
'metric': 'multi_logloss',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
}
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
# Create LightGBM datasets for training and testing
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
# Train a LightGBM model
num_round = 100
bst = lgb.train(params, train_data, num_round)
# Make predictions on the test set
y_pred = bst.predict(X_test)
# Get the class with the highest predicted probability as the predicted label
y_pred_labels = np.argmax(y_pred, axis=1)
# Calculate accuracy and store it in the metrics list
accuracy = accuracy_score(y_test, y_pred_labels)
metrics.append(accuracy)
# Calculate the average accuracy across all folds
average_accuracy = np.mean(metrics)
print(f'Average Accuracy: {average_accuracy:.4f}')
Output:
Average Accuracy: 0.9600
Using the LightGBM machine learning framework and k-fold cross-validation, the provided code evaluates a multiclass classification model's performance on the Iris dataset. The dataset is first loaded and split into feature variables (X) and target labels (y). To ensure data shuffling for a robust evaluation, the KFold cross-validator is applied with a predetermined number of folds. The model makes predictions on a test subset after being trained on a training subset for each fold. The predicted label is chosen to be the class with the highest expected probability. The method then determines the accuracy metric for every fold and computes the average accuracy over all folds to provide a general indicator of the model's classification performance on the Iris dataset.
Stratified K-Fold Cross-Validation
Stratified K-Fold Cross-Validation: It is used to address categorization issues. In order to lessen evaluation bias, it makes sure that each fold has a distribution of class labels that is close to that of the overall dataset.
Every fold in a stratified K-fold preserves the same class distribution as the dataset as a whole. It is especially helpful in classification problems, where biased model assessment might result from unbalanced class distributions. It is a reliable method for determining the optimal hyperparameters and evaluating a model's generalization capacity because it maintains class proportions in every fold, which results in a more accurate estimation of a model's performance. In order to ensure that each fold is representative of the total data, this strategy is frequently used when the target variable has an uneven class distribution.
A popular form of K-Fold Cross-Validation for classification issues is called stratified K-Fold Cross-Validation, which makes sure that each fold has a class label distribution that is comparable to the dataset as a whole. As a result, there is less bias and a more accurate evaluation of the model's performance.
Implementation of Stratified K-Fold Cross-Validation
Python3
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Define hyperparameters for LightGBM
params = {
'objective': 'multiclass', # For multi-class classification
'metric': 'multi_logloss', # Logarithmic loss for multiclass
'boosting_type': 'gbdt',
'num_class': 3, # Number of classes in Iris dataset
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
# Number of folds for stratified cross-validation
num_folds = 5
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
# Initialize an empty list to store cross-validation scores
cv_scores = []
# Perform stratified k-fold cross-validation
for train_index, val_index in skf.split(X, y):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
# Train LightGBM model with early stopping
model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[val_data])
# Make predictions on the validation set
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
# Convert predicted probabilities to class predictions
val_pred_classes = np.argmax(val_pred, axis=1)
# Calculate accuracy and store it in the list
accuracy = accuracy_score(y_val, val_pred_classes)
cv_scores.append(accuracy)
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')
Output:
Mean Accuracy: 0.9667
Std Accuracy: 0.0298
This code uses the gradient boosting framework LightGBM to illustrate a popular machine learning technique called Stratified K-Fold Cross-Validation. First, the widely used benchmark dataset for classification, Iris, is loaded. The multiclass classification-focused hyperparameters of the model are preset. To divide the data into five subsets and preserve the balance of the class distribution, the StratifiedKFold approach is utilized. A LightGBM model is trained on the training set with early pausing to avoid overfitting inside the cross-validation loop. Accuracy is calculated for every fold based on predictions made on the validation set. A thorough assessment of the model's performance and capacity for generalization on the Iris dataset is given by the code, which gathers and presents the mean and standard deviation of accuracy scores across the folds.
Leave-One-Out Cross-Validation (LOOCV)
Leave-One-Out Cross-Validation (LOOCV): A resampling technique called Leave-One-Out Cross-Validation (LOOCV) is used to evaluate how well machine learning models perform. Using an exhaustive method, the remaining data is used for training and one data point is reserved as the validation set for each round. There are as many rounds as there are data points when this technique is repeated for each data point. Using all available data for both training and validation, LOOCV provides a thorough assessment of a model's generalization. However, computing costs may be high, particularly for huge datasets. In order to accurately measure a model's predictive capacity and robustness, the final performance score is typically calculated from the average of the individual validation findings.
(LOOCV) is a cross-validation method in which all of the dataset's data points are regarded as distinct test sets and the model is trained using them all. Although LOOCV offers a reliable assessment of model performance, it can be computationally costly, particularly when dealing with big datasets.
Implementation of Leave-One-Out Cross-Validation (LOOCV)
Here's how to use Python to implement LOOCV with LightGBM:
Python3
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import LeaveOneOut
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()
# Initialize an empty list to store cross-validation scores
cv_scores = []
# Perform Leave-One-Out Cross-Validation
for train_index, val_index in loo.split(X):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
# Create and configure a LightGBM dataset for training
train_data = lgb.Dataset(X_train, label=y_train)
# Define hyperparameters for LightGBM
params = {
'objective': 'multiclass',
'num_class': 3,
'boosting_type': 'gbdt',
'num_leaves': 5,
'learning_rate': 0.05,
}
# Train LightGBM model
model = lgb.train(params, train_data, num_boost_round=100)
# Make predictions on the validation set
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
# Get the predicted class (index of the highest probability)
val_pred_class = np.argmax(val_pred, axis=1)
# Calculate accuracy and store it in the list
accuracy = accuracy_score(y_val, val_pred_class)
cv_scores.append(accuracy)
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')
Output:
Mean Accuracy: 0.9533
Std Accuracy: 0.2109
With the use of LightGBM and the Iris dataset, this code sample illustrates Leave-One-Out Cross-Validation (LOOCV). Using each data point as a validation set, LOOCV iterates through the data, using the remaining data to train the model. It applies certain hyperparameters to the multiclass classification target of LightGBM. The mean accuracy and standard deviation are computed across the total number of repetitions, and the accuracy is determined for each iteration. By testing the model on each data point separately and making sure that every data point adds to the evaluation, this method offers a thorough assessment of the model's performance. One may get an idea of the model's predictive ability and consistency in categorizing samples from the Iris dataset by looking at the final mean accuracy and standard deviation.
LightGBM's Hyperparameter Tuning
Optimizing the performance of a LightGBM model requires careful consideration of its hyperparameters. To get the best results for a given dataset, LightGBM offers a variety of hyperparameters that can be fine-tuned. Selecting the hyperparameter settings that yield the best model performance is the aim of hyperparameter tuning; this is usually assessed using evaluation metrics such as accuracy, AUC, or log loss. For hyperparameter tuning, two popular methods are grid search and random search. Grid search involves giving the model a predetermined set of hyperparameter values, and the algorithm determines how well the model performs in every possible combination. When the search space is large, random search is more effective since it randomly samples hyperparameters from predetermined ranges.
Two popular methods for hyperparameter tuning are grid search and random search. In grid search, the model's performance is assessed for every potential combination of hyperparameter values that you specify in advance. On the other hand, random search is more effective when the search space is large since it randomly samples hyperparameters from predetermined ranges.
Implementing Hyperparameter Tuning with LightGBM
Let's see how to perform hyperparameter tuning with LightGBM.
Import the required libraries
Python3
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
This code uses GridSearchCV from scikit-learn for hyperparameter tuning and LightGBM, a gradient boosting framework. The model loads the Iris dataset, splits the data into train and test, and then uses grid search to find the optimal hyperparameters. The accuracy measure is used to assess the model's performance.
Loading Dataset and Splitting Data
Python3
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
With 80% of the data utilized for training and 20% put aside for testing, this code snippet loads the Iris dataset and divides it into training and testing sets. In order to guarantee reproducibility, the random seed is fixed using the random_state parameter.
Defining Parameters
Python3
# Define a range of values for the hyperparameters to search through
param_grid = {
'num_leaves': [5, 20, 31],
'learning_rate': [0.05, 0.1, 0.2],
'n_estimators': [50, 100, 150]
}
To find the right hyperparameter tuning, this code defines a grid of values for the hyperparameters. It allows for an exhaustive search across these combinations during hyperparameter optimization because it specifies different values for "num_leaves," "learning_rate," and "n_estimators."
Model Development
Python3
# Initialize an empty dictionary to store the best hyperparameters and their values
best_hyperparameters = {}
best_values = {}
# Initialize the LightGBM classifier
lgb_classifier = lgb.LGBMClassifier(objective='multiclass', num_class=3, boosting_type='gbdt')
# Initialize GridSearchCV for hyperparameters
grid_search = GridSearchCV(estimator=lgb_classifier, param_grid=param_grid,
scoring='accuracy', cv=5)
# Fit the model to the training data to search for the best hyperparameters
grid_search.fit(X_train, y_train)
# Get the best hyperparameters and their values
best_params = grid_search.best_params_
best_hyperparameters = list(best_params.keys())
best_values = list(best_params.values())
This method uses a GridSearchCV with a LightGBM classifier to conduct hyperparameter tuning. In order to save the optimal hyperparameters and their values, it initializes an empty dictionary called best_hyperparameters. Additionally, it initializes GridSearchCV and the LightGBM classifier by providing the estimator, the number of cross-validation folds (cv=5), the scoring metric ('accuracy'), and the parameter grid to search across (param_grid). Next, the grid_search finds the optimal hyperparameters by fitting the model to the training set.
After the search is finished, the best hyperparameters and their matching values are found and added to best_params, a dictionary where the best values of each hyperparameter are the values and the names of the hyperparameters are the keys. Next, the best_hyperparameters and best_values lists include the optimal hyperparameters and their values, respectively.
Training the model
Python3
# Train a LightGBM model with the best hyperparameters
best_model = lgb.LGBMClassifier(**best_params)
best_model.fit(X_train, y_train)
The optimal hyperparameters found through hyperparameter tuning are used to train a LightGBM model in this code. **best_params** is passed in to initialize a new LightGBM classifier, best_model, with the optimal hyperparameters. Subsequently, it fits the best_model to the training set (X_train and y_train), enabling the training of a model with optimum parameters.
Prediction and Evaluation
Python3
# Make predictions on the test set using the best model
y_pred = best_model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Best hyperparameters:', best_hyperparameters)
print('Best values:', best_values)
print(f'Accuracy with best hyperparameters: {accuracy:.4f}')
Output:
Best hyperparameters: ['learning_rate', 'n_estimators', 'num_leaves']
Best values: [0.05, 50, 5]
Accuracy with best hyperparameters: 1.0000
This code makes predictions on the test set (X_test) using the best model that was trained with the optimized hyperparameters. These forecasts' accuracy is computed and reported. It also shows the best values (best_values) and hyperparameters (best_hyperparameters) that produced the highest accuracy for the model.
Conclusion
LightGBM is a potent gradient boosting framework with great prediction accuracy, efficiency, and speed. In order to evaluate model performance, cross-validation is crucial, and hyperparameter tuning can assist in determining the ideal model configuration. Finally, serializing the model and developing a web API for prediction are required for deploying a LightGBM-based application. You are prepared to use LightGBM for your machine learning projects from development to deployment with the information and code examples in this tutorial.
Similar Reads
CatBoost Cross-Validation and Hyperparameter Tuning
CatBoost is a powerful gradient-boosting algorithm of machine learning that is very popular for its effective capability to handle categorial features of both classification and regression tasks. To maximize the potential of CatBoost, it's essential to fine-tune its hyperparameters which can be done
11 min read
Hyperparameters Optimization methods - ML
In this article, we will discuss the various hyperparameter optimization techniques and their major drawback in the field of machine learning. What are the Hyperparameters?Hyperparameters are those parameters that we set for training. Hyperparameters have major impacts on accuracy and efficiency whi
7 min read
Hyperparameter tuning with Optuna in PyTorch
Hyperparameter tuning is a critical step in the machine learning pipeline, often determining the success of a model. Optuna is a powerful and flexible framework for hyperparameter optimization, designed to automate the search for optimal hyperparameters. When combined with PyTorch, a popular deep le
5 min read
SVM Hyperparameter Tuning using GridSearchCV | ML
Support Vector Machines (SVM) are used for classification tasks but their performance depends on the right choice of hyperparameters like C and gamma. Finding the optimal combination of these hyperparameters can be a issue. GridSearchCV automates the process by systematically testing various combina
3 min read
Hyperparameter tuning using GridSearchCV and KerasClassifier
Hyperparameter tuning is done to increase the efficiency of a model by tuning the parameters of the neural network. Some scikit-learn APIs like GridSearchCV and RandomizedSearchCV are used to perform hyper parameter tuning. In this article, you'll learn how to use GridSearchCV to tune Keras Neural N
2 min read
Hyperparameter Tuning with R
In R Language several techniques and packages can be used to optimize these hyperparameters, leading to better, more reliable models. in this article, we will discuss all the techniques and packages for Hyperparameter Tuning with R.What are Hyperparameters?Hyperparameters are the settings that contr
5 min read
How to tune a Decision Tree in Hyperparameter tuning
Decision trees are powerful models extensively used in machine learning for classification and regression tasks. The structure of decision trees resembles the flowchart of decisions helps us to interpret and explain easily. However, the performance of decision trees highly relies on the hyperparamet
14 min read
Random Forest Hyperparameter Tuning in Python
Random Forest is one of the most popular machine learning algorithms used for both classification and regression tasks. It works by building multiple decision trees and combining their outputs to improve accuracy and control overfitting. While Random Forest is a robust model, fine-tuning its hyperpa
5 min read
Training and Validation Loss in Deep Learning
In deep learning, loss functions are crucial in guiding the optimization process. The loss represents the discrepancy between the predicted output of the model and the actual target value. During training, models attempt to minimize this loss by adjusting their weights. Training loss and validation
6 min read
Hyperparameter Optimization Based on Bayesian Optimization
In this article we explore what is hyperparameter optimization and how can we use Bayesian Optimization to tune hyperparameters in various machine learning models to obtain better prediction accuracy. Before we dive into the how's of implementing Bayesian Optimization, let us learn what is meant by
7 min read