Open In App

How to make iterative lm() formulas in R

Last Updated : 18 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Creating iterative lm() formulas in R involves generating and fitting multiple linear regression models programmatically. This can be particularly useful when you want to systematically explore different combinations of predictors or when you have a large number of potential predictors and need to automate the model-fitting process.

Here’s a step-by-step guide on how to create and fit iterative lm() formulas in R Programming Language.

Steps to Create Iterative Linear Models

Here are the main steps to make iterative lm() formulas in R.

  • Prepare the Data: Ensure your dataset is ready for modeling.
  • Define the Variables: Create a list of different sets of predictors for each model.
  • Create Iterative Formulas: Fit models iteratively using the different sets of predictors.
  • Evaluating and Comparing Models: Collect and summarize the results from each model.

Step 1: Define the Dataset

First, ensure you have a dataset to work with. For demonstration, let’s create a simple dataset.

R
# Sample dataset
set.seed(123)
data <- data.frame(
  y = rnorm(100),
  x1 = rnorm(100),
  x2 = rnorm(100),
  x3 = rnorm(100),
  x4 = rnorm(100)
)

head(data)

Output:

            y          x1         x2         x3          x4
1 -0.56047565 -0.71040656  2.1988103 -0.7152422 -0.07355602
2 -0.23017749  0.25688371  1.3124130 -0.7526890 -1.16865142
3  1.55870831 -0.24669188 -0.2651451 -0.9385387 -0.63474826
4  0.07050839 -0.34754260  0.5431941 -1.0525133 -0.02884155
5  0.12928774 -0.95161857 -0.4143399 -0.4371595  0.67069597
6  1.71506499 -0.04502772 -0.4762469  0.3311792 -1.65054654

Step 2: Define the Variables

Identify the dependent variable and the list of independent variables.

R
dependent_var <- "y"
independent_vars <- c("x1", "x2", "x3", "x4")

Step 3: Create Iterative Formulas

You can use a loop or apply functions to generate and fit multiple models. Here’s an example using a loop to create formulas with increasing numbers of predictors:

R
# Initialize a list to store the models
models <- list()

# Loop over the number of predictors
for (i in 1:length(independent_vars)) {
  # Select the first i predictors
  predictors <- independent_vars[1:i]
  
  # Create the formula
  formula <- as.formula(paste(dependent_var, "~", paste(predictors, collapse = " + ")))
  
  # Fit the model
  model <- lm(formula, data = data)
  
  # Store the model in the list
  models[[i]] <- model
}

# Display the summaries of the models
lapply(models, summary)

Output:

[[1]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.39149 -0.59570 -0.04306  0.59224  2.13004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.08538    0.09220   0.926    0.357
x1          -0.04676    0.09524  -0.491    0.625

Residual standard error: 0.9163 on 98 degrees of freedom
Multiple R-squared:  0.002453,    Adjusted R-squared:  -0.007726 
F-statistic: 0.241 on 1 and 98 DF,  p-value: 0.6246


[[2]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.28873 -0.61476 -0.08408  0.56236  2.34188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.10057    0.09269   1.085    0.281
x1          -0.04307    0.09499  -0.453    0.651
x2          -0.12280    0.09670  -1.270    0.207

Residual standard error: 0.9135 on 97 degrees of freedom
Multiple R-squared:  0.01877,    Adjusted R-squared:  -0.001466 
F-statistic: 0.9276 on 2 and 97 DF,  p-value: 0.399


[[3]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.35541 -0.58837 -0.08408  0.55592  2.31302 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09952    0.09309   1.069    0.288
x1          -0.04102    0.09547  -0.430    0.668
x2          -0.12493    0.09719  -1.285    0.202
x3          -0.04219    0.08892  -0.474    0.636

Residual standard error: 0.9172 on 96 degrees of freedom
Multiple R-squared:  0.02106,    Adjusted R-squared:  -0.00953 
F-statistic: 0.6885 on 3 and 96 DF,  p-value: 0.5613


[[4]]

Call:
lm(formula = formula, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4254 -0.5887 -0.0335  0.5494  2.3508 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.11737    0.09197   1.276   0.2050  
x1          -0.06619    0.09469  -0.699   0.4863  
x2          -0.12920    0.09562  -1.351   0.1798  
x3          -0.04482    0.08747  -0.512   0.6095  
x4          -0.19024    0.09246  -2.058   0.0424 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9021 on 95 degrees of freedom
Multiple R-squared:  0.06282,    Adjusted R-squared:  0.02336 
F-statistic: 1.592 on 4 and 95 DF,  p-value: 0.1827

Step 4: Using Combinations of Predictors

If you want to explore all possible combinations of predictors, you can use the combn function:

R
# Function to fit and store models for each combination of predictors
fit_models <- function(dep_var, indep_vars, data) {
  models <- list()
  index <- 1
  for (i in 1:length(indep_vars)) {
    combs <- combn(indep_vars, i)
    for (j in 1:ncol(combs)) {
      predictors <- combs[, j]
      formula <- as.formula(paste(dep_var, "~", paste(predictors, collapse = " + ")))
      model <- lm(formula, data = data)
      models[[index]] <- model
      index <- index + 1
    }
  }
  return(models)
}

# Fit models for all combinations of predictors
all_models <- fit_models(dependent_var, independent_vars, data)

# Display the summaries of the models
lapply(all_models, summary)

Output:

[[1]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.39149 -0.59570 -0.04306  0.59224  2.13004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.08538    0.09220   0.926    0.357
x1          -0.04676    0.09524  -0.491    0.625

Residual standard error: 0.9163 on 98 degrees of freedom
Multiple R-squared:  0.002453,    Adjusted R-squared:  -0.007726 
F-statistic: 0.241 on 1 and 98 DF,  p-value: 0.6246


[[2]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.29504 -0.60758 -0.09748  0.56634  2.31372 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.10536    0.09171   1.149    0.253
x2          -0.12414    0.09626  -1.290    0.200

Residual standard error: 0.9098 on 98 degrees of freedom
Multiple R-squared:  0.01669,    Adjusted R-squared:  0.006653 
F-statistic: 1.663 on 1 and 98 DF,  p-value: 0.2002


[[3]]

Call:
lm(formula = formula, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.46212 -0.56903 -0.06664  0.55342  2.08212 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.08900    0.09171   0.970    0.334
x3          -0.03873    0.08868  -0.437    0.663

Residual standard error: 0.9166 on 98 degrees of freedom
Multiple R-squared:  0.001943,    Adjusted R-squared:  -0.008241 
F-statistic: 0.1908 on 1 and 98 DF,  p-value: 0.6632...............................................................................................

Step 5: Evaluating and Comparing Models

To systematically compare models, you can extract and organize relevant statistics (e.g., AIC, BIC, R-squared):

R
# Extract model statistics for comparison
model_stats <- lapply(all_models, function(model) {
  summary_model <- summary(model)
  list(
    formula = as.character(formula(model)),
    AIC = AIC(model),
    BIC = BIC(model),
    R_squared = summary_model$r.squared,
    Adjusted_R_squared = summary_model$adj.r.squared
  )
})

# Convert to data frame for easy viewing
model_stats_df <- do.call(rbind, lapply(model_stats, as.data.frame))
print(model_stats_df)

Output:

             formula      AIC      BIC   R_squared Adjusted_R_squared
1                  ~ 270.2928 278.1083 0.002453434       -0.007725613
2                  y 270.2928 278.1083 0.002453434       -0.007725613
3                 x1 270.2928 278.1083 0.002453434       -0.007725613
4                  ~ 268.8557 276.6712 0.016686441        0.006652629
5                  y 268.8557 276.6712 0.016686441        0.006652629
6                 x2 268.8557 276.6712 0.016686441        0.006652629
7                  ~ 270.3440 278.1595 0.001942958       -0.008241297
8                  y 270.3440 278.1595 0.001942958       -0.008241297
9                 x3 270.3440 278.1595 0.001942958       -0.008241297
10                 ~ 266.7540 274.5695 0.037137567        0.027312440
11                 y 266.7540 274.5695 0.037137567        0.027312440
12                x4 266.7540 274.5695 0.037137567        0.027312440
13                 ~ 270.6440 281.0647 0.018766111       -0.001465516
14                 y 270.6440 281.0647 0.018766111       -0.001465516
15           x1 + x2 270.6440 281.0647 0.018766111       -0.001465516
16                 ~ 272.1163 282.5369 0.004213084       -0.016318605
17                 y 272.1163 282.5369 0.004213084       -0.016318605
18           x1 + x3 272.1163 282.5369 0.004213084       -0.016318605
19                 ~ 268.1626 278.5833 0.042815201        0.023079432
20                 y 268.1626 278.5833 0.042815201        0.023079432
21           x1 + x4 268.1626 278.5833 0.042815201        0.023079432
22                 ~ 270.6019 281.0226 0.019178931       -0.001044183
23                 y 270.6019 281.0226 0.019178931       -0.001044183
24           x2 + x3 270.6019 281.0226 0.019178931       -0.001044183
25                 ~ 266.8714 277.2920 0.055095160        0.035612586
26                 y 266.8714 277.2920 0.055095160        0.035612586
27           x2 + x4 266.8714 277.2920 0.055095160        0.035612586
28                 ~ 268.5164 278.9371 0.039422355        0.019616630
29                 y 268.5164 278.9371 0.039422355        0.019616630
30           x3 + x4 268.5164 278.9371 0.039422355        0.019616630
31                 ~ 272.4098 285.4357 0.021061425       -0.009530405
32                 y 272.4098 285.4357 0.021061425       -0.009530405
33      x1 + x2 + x3 272.4098 285.4357 0.021061425       -0.009530405
34                 ~ 268.3261 281.3519 0.060233429        0.030865724
35                 y 268.3261 281.3519 0.060233429        0.030865724
36      x1 + x2 + x4 268.3261 281.3519 0.060233429        0.030865724
37                 ~ 269.9536 282.9795 0.044813074        0.014963483
38                 y 269.9536 282.9795 0.044813074        0.014963483
39      x1 + x3 + x4 269.9536 282.9795 0.044813074        0.014963483
40                 ~ 268.5630 281.5889 0.058004079        0.028566706
41                 y 268.5630 281.5889 0.058004079        0.028566706
42      x2 + x3 + x4 268.5630 281.5889 0.058004079        0.028566706
43                 ~ 270.0500 285.6810 0.062824200        0.023364166
44                 y 270.0500 285.6810 0.062824200        0.023364166
45 x1 + x2 + x3 + x4 270.0500 285.6810 0.062824200        0.023364166
  • formula: Shows the exact combination of predictors used in each model. This helps you understand which predictors were included in each iteration.
  • AIC: Helps in model selection, with lower values indicating models that balance fit and complexity better. In the example, the model y ~ x1 + x2 + x3 + x4 has the lowest AIC, suggesting it's the best fit among the listed models.
  • BIC: Similar to AIC but with a stronger penalty for complexity. Again, lower values are better. It often leads to simpler models compared to AIC.
  • R_squared: Indicates how well the model explains the variance in the dependent variable. Higher values are better, but R-squared alone doesn't account for the number of predictors.
  • Adjusted_R_squared: Adjusts R-squared for the number of predictors, providing a more accurate measure for model comparison when multiple predictors are involved. Higher values indicate a better fit that is not solely due to adding more predictors.

Conclusion

Iterative modeling in R allows you to systematically explore different combinations of predictors and evaluate their impact on the response variable. By following the steps outlined in this guide, you can automate the process of fitting multiple linear models and extracting their summaries, making your analysis more efficient and comprehensive.


Next Article

Similar Reads