How to Calculate Correlation in R with Missing Values
Last Updated :
08 May, 2025
When calculating correlation in R, missing values are excluded by default using a method called pairwise deletion. This means R ignores any observation where a variable in the pair is missing.
How to Calculate Correlation in R with Missing Values
There are several ways to calculate correlation in R when the data contains missing values:
- Using
cor()
with "complete.obs"
to exclude any rows with missing data. - Using
cor()
with "pairwise.complete.obs"
to calculate correlation for each variable pair using all available data. - Manually handling missing values using data cleaning techniques.
- Applying imputation and then using
cov()
or cor()
to calculate correlation on the cleaned data.
1. Using cor() with complete.obs
In this example, we use the cor() function to calculate the correlation coefficient between x and y. By specifying use = 'complete.obs',it calculate the correlation coefficient using only complete observations. The resulting correlation coefficient is then printed to the console.
R
data <- data.frame(
A = c(1, 2, 3, NA, 5),
B = c(5, NA, 7, 8, 9),
C = c(10, 11, 12, 13, NA)
)
correlation_matrix <- cor(data, use = "complete.obs")
print(correlation_matrix)
Output:
A B C
A 1 1 1
B 1 1 1
C 1 1 1
2. Using cor() with pairwise.complete.obs
In this example, we use the cor()
function again, by specifying use = 'pairwise.complete.obs'
, it calculates correlation matrix based on pairwise complete observations. The resulting correlation matrix is then printed to the console.
R
df <- data.frame(
x = c(1, 2, 3, NA, 5),
y = c(4, NA, 6, 7, 8)
)
correlation_matrix <- cor(df, use = 'pairwise.complete.obs')
print(correlation_matrix)
Output:
x y
x 1 1
y 1 1
3. Calculate Correlation with Missing Values by Handling Missing Values Manually
In this approach ,missing values are manually handled by removing rows with missing values before calculating the correlation matrix. It ensures that only complete data is used in the correlation calculation.
R
data <- data.frame(
x = c(1, 2, 3, NA, 5),
y = c(3, NA, 4, 5, 6)
)
complete_data <- na.omit(data)
correlation_matrix <- cor(complete_data)
correlation_matrix
Output:
x y
x 1.0000000 0.9819805
y 0.9819805 1.0000000
4. Using the cov()
and cor()
Functions with Imputation
In this method, we impute missing values with the mean of each column before calculating the correlation coefficients using all available data.
R
data <- data.frame(
x = c(1, 2, 3, NA, 5),
y = c(3, NA, 4, 5, 6)
)
imputed_data <- apply(data, 2, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
covariance_matrix <- cov(imputed_data)
correlation_matrix <- cor(imputed_data)
correlation_matrix
Output:
x y
x 1.0000000 0.8882165
y 0.8882165 1.0000000
In this article, we’ll explorec different ways to handle missing values when computing correlation.