Open In App

knn Impute Using Categorical Variables with caret Package

Last Updated : 23 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In data science and machine learning, missing data is a common issue that can significantly impact the performance of predictive models. One effective way to handle missing values is through imputation, which involves replacing missing data with substituted values. The caret package in R provides several methods for imputation, one of which is K-Nearest Neighbors (KNN) imputation. This article will focus on using KNN imputation with categorical variables in the caret package.

What is KNN Imputation?

K-Nearest Neighbors (KNN) imputation is a method that replaces missing values with the mean (for numeric data) or the most frequent (for categorical data) value from the 'k' nearest neighbors. The nearest neighbors are determined based on a distance metric, typically Euclidean distance for numerical variables and Hamming distance for categorical variables.

Why Use KNN Imputation?

KNN imputation is advantageous because it considers the relationships between observations, leading to more accurate imputations than simpler methods like mean or mode imputation. This method is particularly useful when dealing with mixed-type data (both numerical and categorical variables).

Prerequisites

Before diving into KNN imputation, ensure you have the following:

  • R and RStudio installed
  • Basic understanding of R programming
  • The caret package installed

You can install the caret package using the following command:

install.packages("caret")

Now we will discuss step by steps to Perform KNN Imputation Using Categorical Variables in R Programming Language.

Step 1. Load Required Libraries

First we will install and load the Required Libraries.

R
library(caret)
library(dplyr)
library(VIM)

Step 2. Load and Explore the Data

For demonstration, we'll use a sample dataset. You can replace this with your dataset.

R
data <- data.frame(
  Age = c(25, 30, NA, 35, 40, 45, NA, 55),
  Gender = as.factor(c("Male", "Female", "Female", "Male", NA, "Male", "Female",
                                                                      "Female")),
  Income = c(50000, 60000, 55000, NA, 80000, 75000, 70000, NA)
)

# Display the data
print(data)

Output:

  Age Gender Income
1 25 Male 50000
2 30 Female 60000
3 NA Female 55000
4 35 Male NA
5 40 <NA> 80000
6 45 Male 75000
7 NA Female 70000
8 55 Female NA

Step 3. Visualize Missing Data

Before imputation, it's helpful to visualize the missing data pattern.

R
aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, 
     labels=names(data), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))

Output:

gh
knn Impute Using Categorical Variables with caret Package

Step 4. Perform KNN imputation using the VIM package

The resulting imputed Data includes additional columns indicating which values is imputed.

R
# Perform KNN imputation using the VIM package
imputedData <- kNN(data, k = 3)

# The resulting imputedData includes additional columns indicating which values 
# Let's display only the original columns to see the imputed values
imputedData <- imputedData[, 1:4]

# Display the imputed data
print(imputedData)

Output:

  Age Gender Income Married
1 25 Male 50000 No
2 30 Female 60000 Yes
3 45 Female 55000 Yes
4 35 Male 75000 No
5 40 Male 80000 No
6 45 Male 75000 Yes
7 30 Female 70000 No
8 55 Female 60000 Yes

Step 5. Verify the Imputation

It's crucial to check if the missing values have been correctly imputed.

R
sum(is.na(imputedData))

Output:

[1] 0

By using the kNN function from the VIM package, we can successfully impute missing values for both numeric and factor variables, ensuring the dataset is complete and ready for further analysis.

Conclusion

KNN imputation is a powerful method for handling missing data, especially when dealing with both numerical and categorical variables. The caret package in R simplifies this process, making it accessible even for those with basic R programming skills. By carefully pre-processing the data and choosing appropriate methods, you can significantly improve the quality of your datasets, leading to more accurate and reliable predictive models.


Similar Reads