Open In App

LOOCV (Leave One Out Cross-Validation) in R Programming

Last Updated : 17 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

LOOCV (Leave-One-Out Cross-Validation) is a cross-validation technique where each individual observation in the dataset is used once as the validation set, while the remaining observations are used as the training set. This process is repeated for all observations, with each one serving as the validation set exactly once. Essentially, LOOCV is a special case of K-fold cross-validation, where the number of folds equals the total number of observations (K = N). By using every data point as both a training and validation instance, LOOCV helps reduce bias and variance in model evaluation, thereby providing a more accurate estimate of model performance. It also helps minimize overfitting and reduces the mean squared error (MSE), making it a valuable tool in model validation, especially when dealing with small datasets.

Mathematical Expression 

In Leave-One-Out Cross-Validation (LOOCV), each individual observation serves once as the validation set, while the remaining n-1 observations are used for training. Instead of refitting the model n times, LOOCV for linear models can be computed efficiently using the following formula:

\text{LOOCV Error} = \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_{ii}} \right)^2

\textbf{Where:} \\y_i = \text{Actual value of the } i^{th} \text{ observation} \\\hat{y}_i = \text{Predicted value of the } i^{th} \text{ observation using the full model} \\h_{ii} = \text{Leverage of the } i^{th} \text{ observation (diagonal element of the hat matrix } H = X(X^TX)^{-1}X^T \text{)}

Implementation in R

1. Importing the Dataset

The Hedonic is a dataset of prices of Census Tracts in Boston. It comprises crime rates, the proportion of 25,000 square feet residential lots, the average number of rooms, the proportion of owner units built prior to 1940 etc of total 15 aspects. It comes pre-installed with Ecdat package in R.

R
install.packages("Ecdat") 
library(Ecdat) 
 
str(Hedonic) 

Output:

Hedonic dataset

2. Performing Leave One Out Cross Validation(LOOCV) on the Dataset

Using the Leave One Out Cross Validation(LOOCV) on the dataset by training the model using features or variables in the dataset. In this example,

  • age.glm <- glm() fits a linear regression model to predict age using variables like mv, crim, zn, indus, etc from the Hedonic dataset.
  • cv.glm(Hedonic, age.glm) performs Leave-One-Out Cross-Validation (LOOCV) on the age.glm model.
  • cv.mse$delta extracts the estimated LOOCV mean squared error (MSE) from the result.
  • rep(0,5) initializes a vector to store LOOCV errors for 5 different models.
  • for (i in 1:5) creates multiple models by increasing the degree of polynomial terms applied to crim and tax.
  • poly(crim, i) and poly(tax, i) introduce polynomial features to capture non-linear relationships.
  • cv.glm(...)$delta[1] computes and stores the LOOCV error for each model.
  • cv.mse at the end displays the list of LOOCV errors to compare model performance across polynomial degrees.
R
install.packages("boot")
library(boot)

age.glm <- glm(age ~ mv + crim + zn + indus + chas + nox + rm + tax + dis + rad + ptratio + blacks + lstat,
               data = Hedonic)

cv.mse <- cv.glm(Hedonic, age.glm)
cv.mse$delta

cv.mse = rep(0,5)
for (i in 1:5)
{
  age.loocv <- glm(age ~ mv + poly(crim, i) + zn + indus + chas + nox + rm + poly(tax, i) + dis +
                   rad + ptratio + blacks + lstat, data = Hedonic)
  cv.mse[i] = cv.glm(Hedonic, age.loocv)$delta[1]
}

cv.mse

Output:

model
model description

The age.glm model has 505 degrees of freedom with Null deviance as 400100 and Residual deviance as 120200. The AIC is 4234.

mse
mean_squared_error

The error is increasing continuously. This states that high order polynomials are not beneficial in general case

Advantages of LOOCV :

  • It has no randomness of using some observations for training vs. validation set. In the validation-set method, each observation is considered for both training and validation so it has less variability due to no randomness no matter how many times it runs.
  • It has less bias than validation-set method as training-set is of n-1 size. On the entire data set. As a result, there is a reduced over-estimation of test-error as much compared to the validation-set method.

Disadvantage of LOOCV :

  • Training the model N times leads to expensive computation time if the dataset is large.

Next Article

Similar Reads