Open In App

How to Use Reference Variables by Character String in a Formula in R?

Last Updated : 20 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In R Language it is common to create statistical models or perform calculations using formulas, where the variables in the formula are typically referenced directly by their names. However, there are scenarios where you may need to use variable names stored as character strings in your formulas. This is particularly useful in dynamic programming, where the variables in the formula are not known in advance and are passed as arguments or generated programmatically.

Understanding the Problem

Let’s assume you have a dataset and want to fit a linear model using variables whose names are stored as character strings using R Programming Language.

R
# Sample data
data <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Variable names as character strings
response_var <- "weight"
predictor_var <- "height"
data 

Output:

  height weight
1 150 50
2 160 60
3 170 70
4 180 80
5 190 90

The goal is to fit a linear model using weight as the response variable and height as the predictor variable, but you need to use the character strings stored in response_var and predictor_var.

1: Creating a Formula Dynamically Using as.formula() and paste() in Base R

One of the simplest ways to achieve this is by constructing the formula dynamically using the paste() function and then converting the result to a formula object using as.formula().

R
# Sample data
data <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Variable names as character strings
response_var <- "weight"
predictor_var <- "height"

# Create the formula using paste() and as.formula()
formula <- as.formula(paste(response_var, "~", predictor_var))

# Print the formula
print(formula)

# Fit the linear model
model <- lm(formula, data = data)

# Print the model summary
summary(model)

Output:

weight ~ height


Call:
lm(formula = formula, data = data)

Residuals:
1 2 3 4 5
1.270e-14 -1.296e-14 -6.454e-15 9.733e-16 5.736e-15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e+02 6.265e-14 -1.596e+15 <2e-16 ***
height 1.000e+00 3.673e-16 2.723e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.161e-14 on 3 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 7.413e+30 on 1 and 3 DF, p-value: < 2.2e-16
  • paste(response_var, "~", predictor_var) constructs the formula as a character string, e.g., "weight ~ height".
  • as.formula() converts this character string into a formula object that can be used in modeling functions like lm().

The resulting model uses the variables referenced by the character strings.

2: Creating a Formula using the reformulate() Function

The reformulate() function in base R is specifically designed to create formula objects from character vectors. It is particularly useful when you need to handle multiple predictors or want a more explicit method than paste().

R
# Sample data
data <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Variable names as character strings
response_var <- "weight"
predictor_var <- "height"

# Create the formula using reformulate()
formula <- reformulate(termlabels = predictor_var, response = response_var)

# Print the formula
print(formula)

# Fit the linear model
model <- lm(formula, data = data)

# Print the model summary
summary(model)

Output:

weight ~ height

Call:
lm(formula = formula, data = data)

Residuals:
1 2 3 4 5
1.270e-14 -1.296e-14 -6.454e-15 9.733e-16 5.736e-15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e+02 6.265e-14 -1.596e+15 <2e-16 ***
height 1.000e+00 3.673e-16 2.723e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.161e-14 on 3 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 7.413e+30 on 1 and 3 DF, p-value: < 2.2e-16
  • reformulate(termlabels = predictor_var, response = response_var) creates a formula where predictor_var is the right-hand side (RHS) and response_var is the left-hand side (LHS) of the formula.
  • This approach is more intuitive and less error-prone, especially when dealing with multiple predictors.

3: Using the rlang Package for Tidy Evaluation

The rlang package, part of the tidyverse, provides powerful tools for non-standard evaluation (NSE), allowing you to programmatically create and manipulate expressions, formulas, and variables.

R
# Load the rlang package
library(rlang)

# Sample data
data <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Variable names as character strings
response_var <- "weight"
predictor_var <- "height"

# Create the formula using rlang::expr()
formula <- expr(!!sym(response_var) ~ !!sym(predictor_var))

# Print the formula
print(formula)

# Fit the linear model
model <- lm(formula, data = data)

# Print the model summary
summary(model)

Output:

weight ~ height

Call:
lm(formula = formula, data = data)

Residuals:
1 2 3 4 5
1.270e-14 -1.296e-14 -6.454e-15 9.733e-16 5.736e-15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e+02 6.265e-14 -1.596e+15 <2e-16 ***
height 1.000e+00 3.673e-16 2.723e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.161e-14 on 3 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 7.413e+30 on 1 and 3 DF, p-value: < 2.2e-16
  • sym() converts a string into a symbol (a variable name), and !! (unquoting) is used to inject this symbol into the expression.
  • expr(!!sym(response_var) ~ !!sym(predictor_var)) creates the formula programmatically.
  • This method is particularly powerful when working with more complex formulas or when integrating with the tidyverse.

Conclusion

Using reference variables by character string in a formula is a crucial technique for dynamic and programmatic data analysis in R. Whether you're working in base R or leveraging the rlang package, understanding these methods will enable you to create flexible, reusable code for a wide range of applications.


Next Article

Similar Reads