Open In App

How to Calculate Correlation in R with Missing Values

Last Updated : 08 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

When calculating correlation in R, missing values are excluded by default using a method called pairwise deletion. This means R ignores any observation where a variable in the pair is missing.

How to Calculate Correlation in R with Missing Values

There are several ways to calculate correlation in R when the data contains missing values:

  1. Using cor() with "complete.obs" to exclude any rows with missing data.
  2. Using cor() with "pairwise.complete.obs" to calculate correlation for each variable pair using all available data.
  3. Manually handling missing values using data cleaning techniques.
  4. Applying imputation and then using cov() or cor() to calculate correlation on the cleaned data.

1. Using cor() with complete.obs

In this example, we use the cor() function to calculate the correlation coefficient between x and y. By specifying use = 'complete.obs',it calculate the correlation coefficient using only complete observations. The resulting correlation coefficient is then printed to the console.

R
data <- data.frame(
  A = c(1, 2, 3, NA, 5),
  B = c(5, NA, 7, 8, 9),
  C = c(10, 11, 12, 13, NA)
)

correlation_matrix <- cor(data, use = "complete.obs")

print(correlation_matrix)

Output:

A B C

A 1 1 1

B 1 1 1

C 1 1 1

2. Using cor() with pairwise.complete.obs

In this example, we use the cor() function again, by specifying use = 'pairwise.complete.obs', it calculates correlation matrix based on pairwise complete observations. The resulting correlation matrix is then printed to the console.

R
df <- data.frame(
  x = c(1, 2, 3, NA, 5),
  y = c(4, NA, 6, 7, 8)
)

correlation_matrix <- cor(df, use = 'pairwise.complete.obs')
print(correlation_matrix)

Output:

x y

x 1 1

y 1 1

3. Calculate Correlation with Missing Values by Handling Missing Values Manually

In this approach ,missing values are manually handled by removing rows with missing values before calculating the correlation matrix. It ensures that only complete data is used in the correlation calculation.

R
data <- data.frame(
  x = c(1, 2, 3, NA, 5),
  y = c(3, NA, 4, 5, 6)
)


complete_data <- na.omit(data)


correlation_matrix <- cor(complete_data)

correlation_matrix

Output:

x y

x 1.0000000 0.9819805

y 0.9819805 1.0000000

4. Using the cov() and cor() Functions with Imputation

In this method, we impute missing values with the mean of each column before calculating the correlation coefficients using all available data.

R
data <- data.frame(
    x = c(1, 2, 3, NA, 5),
    y = c(3, NA, 4, 5, 6)
)

imputed_data <- apply(data, 2, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))


covariance_matrix <- cov(imputed_data)

correlation_matrix <- cor(imputed_data)

correlation_matrix

Output:

x y

x 1.0000000 0.8882165

y 0.8882165 1.0000000

In this article, we’ll explorec different ways to handle missing values when computing correlation.


Next Article

Similar Reads