Confidence interval for xgboost regression in R

Last Updated : 30 Jul, 2024

XGBoost is Designed to be highly efficient, versatile, and portable, it is an optimized distributed gradient boosting library. Under the Gradient Boosting framework, it puts machine learning techniques into practice. Many data science problems can be swiftly and precisely resolved with XGBoost's parallel tree boosting, also referred to as GBDT or GBM.

What are Confidence Intervals?

Confidence intervals are crucial in regression analysis because they provide a range of values that likely contain the true value of the response variable. This is important for assessing the reliability and stability of predictions made by the model. While XGBoost does not natively provide confidence intervals, we can use bootstrapping to estimate them.

Now we will discuss the step-by-step implementation of Confidence interval for xgboost regression in R Programming Language.

Step 1: Setting Up the Environment

First, we need to install and load the necessary packages:

# Install the packages if not already installed
install.packages("xgboost")
install.packages("boot")
install.packages("dplyr")

# Load the libraries
library(xgboost)
library(boot)
library(dplyr)

Step 2: Data Preparation

For this example, we will use the built-in mtcars dataset, which contains various attributes of cars and their fuel consumption. We will use mpg (miles per gallon) as the response variable.

# Load the dataset
data("mtcars")

# Prepare the predictor and response variables
x <- as.matrix(mtcars[, -1])  # All columns except the first
y <- mtcars$mpg              # The first column is the response variable

Step 3: Training the XGBoost Model

We will train an XGBoost regression model using the xgboost function:

# Set up the parameters
params <- list(
  objective = "reg:squarederror",  # Objective function for regression
  max_depth = 3,                   # Maximum depth of the trees
  eta = 0.1,                       # Learning rate
  nthread = 2                      # Number of threads to use
)

# Convert the data into DMatrix format
dtrain <- xgb.DMatrix(data = x, label = y)

# Train the model
set.seed(123)  # For reproducibility
model <- xgboost(data = dtrain, params = params, nrounds = 100, verbose = 0)
summary(model)

Output:

               Length Class              Mode       
handle              1 xgb.Booster.handle externalptr
raw            103376 -none-             raw        
niter               1 -none-             numeric    
evaluation_log      2 data.table         list       
call               13 -none-             call       
params              5 -none-             list       
callbacks           1 -none-             list       
feature_names      10 -none-             character  
nfeatures           1 -none-             numeric

Step 4: Calculating Predictions

We can now use the trained model to make predictions on the training data:

# Make predictions
predictions <- predict(model, x)

Step 5: Bootstrapping for Confidence Intervals

Bootstrapping is a statistical method that involves resampling with replacement to estimate the distribution of a statistic. We will use bootstrapping to estimate the confidence intervals for our predictions.

# Define a function to train the model and make predictions
xgboost_predict <- function(data, indices) {
  # Resample the data
  dtrain <- xgb.DMatrix(data = data[indices, -1], label = data[indices, 1])
  model <- xgboost(data = dtrain, params = params, nrounds = 100, verbose = 0)
  pred <- predict(model, as.matrix(mtcars[, -1]))
  return(pred)
}

# Combine the response and predictor variables into one dataframe
data_combined <- cbind(y, x)

# Apply bootstrapping
set.seed(123)  # For reproducibility
bootstrap_results <- boot(data_combined, statistic = xgboost_predict, R = 1000)

# Calculate the confidence intervals
confidence_intervals <- apply(bootstrap_results$t, 2, function(x) quantile(x, 
                                                         probs = c(0.025, 0.975)))

# Print the confidence intervals for the first few predictions
print(confidence_intervals[, 1:5])

Output:

          [,1]     [,2]     [,3]     [,4]     [,5]
2.5%  18.90042 19.27092 21.54931 17.26411 14.50560
97.5% 21.21082 21.69735 27.51927 21.41044 19.15212

The bootstrap_results object now contains the bootstrapped predictions, which we can use to calculate the confidence intervals.

Conclusion

Confidence intervals are essential for understanding the reliability of predictions made by regression models. While XGBoost does not natively provide confidence intervals, bootstrapping offers a powerful and flexible approach to estimate them. By resampling the data and recalculating predictions multiple times, we can build a distribution of predictions and derive confidence intervals.

In this article, we demonstrated how to train an XGBoost regression model, make predictions, and calculate confidence intervals using bootstrapping in R. This approach can be applied to other datasets and models, providing valuable insights into the uncertainty and variability of model predictions.

How to change color of regression line in R ?

poojashu00qn

Improve

Article Tags :

Practice Tags :

Machine Learning