How to make iterative lm() formulas in R
Last Updated :
18 Jun, 2024
Creating iterative lm()
formulas in R involves generating and fitting multiple linear regression models programmatically. This can be particularly useful when you want to systematically explore different combinations of predictors or when you have a large number of potential predictors and need to automate the model-fitting process.
Here’s a step-by-step guide on how to create and fit iterative lm()
formulas in R Programming Language.
Steps to Create Iterative Linear Models
Here are the main steps to make iterative lm() formulas in R.
- Prepare the Data: Ensure your dataset is ready for modeling.
- Define the Variables: Create a list of different sets of predictors for each model.
- Create Iterative Formulas: Fit models iteratively using the different sets of predictors.
- Evaluating and Comparing Models: Collect and summarize the results from each model.
Step 1: Define the Dataset
First, ensure you have a dataset to work with. For demonstration, let’s create a simple dataset.
R
# Sample dataset
set.seed(123)
data <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100)
)
head(data)
Output:
y x1 x2 x3 x4
1 -0.56047565 -0.71040656 2.1988103 -0.7152422 -0.07355602
2 -0.23017749 0.25688371 1.3124130 -0.7526890 -1.16865142
3 1.55870831 -0.24669188 -0.2651451 -0.9385387 -0.63474826
4 0.07050839 -0.34754260 0.5431941 -1.0525133 -0.02884155
5 0.12928774 -0.95161857 -0.4143399 -0.4371595 0.67069597
6 1.71506499 -0.04502772 -0.4762469 0.3311792 -1.65054654
Step 2: Define the Variables
Identify the dependent variable and the list of independent variables.
R
dependent_var <- "y"
independent_vars <- c("x1", "x2", "x3", "x4")
Step 3: Create Iterative Formulas
You can use a loop or apply functions to generate and fit multiple models. Here’s an example using a loop to create formulas with increasing numbers of predictors:
R
# Initialize a list to store the models
models <- list()
# Loop over the number of predictors
for (i in 1:length(independent_vars)) {
# Select the first i predictors
predictors <- independent_vars[1:i]
# Create the formula
formula <- as.formula(paste(dependent_var, "~", paste(predictors, collapse = " + ")))
# Fit the model
model <- lm(formula, data = data)
# Store the model in the list
models[[i]] <- model
}
# Display the summaries of the models
lapply(models, summary)
Output:
[[1]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.39149 -0.59570 -0.04306 0.59224 2.13004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08538 0.09220 0.926 0.357
x1 -0.04676 0.09524 -0.491 0.625
Residual standard error: 0.9163 on 98 degrees of freedom
Multiple R-squared: 0.002453, Adjusted R-squared: -0.007726
F-statistic: 0.241 on 1 and 98 DF, p-value: 0.6246
[[2]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.28873 -0.61476 -0.08408 0.56236 2.34188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10057 0.09269 1.085 0.281
x1 -0.04307 0.09499 -0.453 0.651
x2 -0.12280 0.09670 -1.270 0.207
Residual standard error: 0.9135 on 97 degrees of freedom
Multiple R-squared: 0.01877, Adjusted R-squared: -0.001466
F-statistic: 0.9276 on 2 and 97 DF, p-value: 0.399
[[3]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.35541 -0.58837 -0.08408 0.55592 2.31302
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09952 0.09309 1.069 0.288
x1 -0.04102 0.09547 -0.430 0.668
x2 -0.12493 0.09719 -1.285 0.202
x3 -0.04219 0.08892 -0.474 0.636
Residual standard error: 0.9172 on 96 degrees of freedom
Multiple R-squared: 0.02106, Adjusted R-squared: -0.00953
F-statistic: 0.6885 on 3 and 96 DF, p-value: 0.5613
[[4]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.4254 -0.5887 -0.0335 0.5494 2.3508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11737 0.09197 1.276 0.2050
x1 -0.06619 0.09469 -0.699 0.4863
x2 -0.12920 0.09562 -1.351 0.1798
x3 -0.04482 0.08747 -0.512 0.6095
x4 -0.19024 0.09246 -2.058 0.0424 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9021 on 95 degrees of freedom
Multiple R-squared: 0.06282, Adjusted R-squared: 0.02336
F-statistic: 1.592 on 4 and 95 DF, p-value: 0.1827
Step 4: Using Combinations of Predictors
If you want to explore all possible combinations of predictors, you can use the combn
function:
R
# Function to fit and store models for each combination of predictors
fit_models <- function(dep_var, indep_vars, data) {
models <- list()
index <- 1
for (i in 1:length(indep_vars)) {
combs <- combn(indep_vars, i)
for (j in 1:ncol(combs)) {
predictors <- combs[, j]
formula <- as.formula(paste(dep_var, "~", paste(predictors, collapse = " + ")))
model <- lm(formula, data = data)
models[[index]] <- model
index <- index + 1
}
}
return(models)
}
# Fit models for all combinations of predictors
all_models <- fit_models(dependent_var, independent_vars, data)
# Display the summaries of the models
lapply(all_models, summary)
Output:
[[1]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.39149 -0.59570 -0.04306 0.59224 2.13004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08538 0.09220 0.926 0.357
x1 -0.04676 0.09524 -0.491 0.625
Residual standard error: 0.9163 on 98 degrees of freedom
Multiple R-squared: 0.002453, Adjusted R-squared: -0.007726
F-statistic: 0.241 on 1 and 98 DF, p-value: 0.6246
[[2]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.29504 -0.60758 -0.09748 0.56634 2.31372
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10536 0.09171 1.149 0.253
x2 -0.12414 0.09626 -1.290 0.200
Residual standard error: 0.9098 on 98 degrees of freedom
Multiple R-squared: 0.01669, Adjusted R-squared: 0.006653
F-statistic: 1.663 on 1 and 98 DF, p-value: 0.2002
[[3]]
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.46212 -0.56903 -0.06664 0.55342 2.08212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08900 0.09171 0.970 0.334
x3 -0.03873 0.08868 -0.437 0.663
Residual standard error: 0.9166 on 98 degrees of freedom
Multiple R-squared: 0.001943, Adjusted R-squared: -0.008241
F-statistic: 0.1908 on 1 and 98 DF, p-value: 0.6632...............................................................................................
Step 5: Evaluating and Comparing Models
To systematically compare models, you can extract and organize relevant statistics (e.g., AIC, BIC, R-squared):
R
# Extract model statistics for comparison
model_stats <- lapply(all_models, function(model) {
summary_model <- summary(model)
list(
formula = as.character(formula(model)),
AIC = AIC(model),
BIC = BIC(model),
R_squared = summary_model$r.squared,
Adjusted_R_squared = summary_model$adj.r.squared
)
})
# Convert to data frame for easy viewing
model_stats_df <- do.call(rbind, lapply(model_stats, as.data.frame))
print(model_stats_df)
Output:
formula AIC BIC R_squared Adjusted_R_squared
1 ~ 270.2928 278.1083 0.002453434 -0.007725613
2 y 270.2928 278.1083 0.002453434 -0.007725613
3 x1 270.2928 278.1083 0.002453434 -0.007725613
4 ~ 268.8557 276.6712 0.016686441 0.006652629
5 y 268.8557 276.6712 0.016686441 0.006652629
6 x2 268.8557 276.6712 0.016686441 0.006652629
7 ~ 270.3440 278.1595 0.001942958 -0.008241297
8 y 270.3440 278.1595 0.001942958 -0.008241297
9 x3 270.3440 278.1595 0.001942958 -0.008241297
10 ~ 266.7540 274.5695 0.037137567 0.027312440
11 y 266.7540 274.5695 0.037137567 0.027312440
12 x4 266.7540 274.5695 0.037137567 0.027312440
13 ~ 270.6440 281.0647 0.018766111 -0.001465516
14 y 270.6440 281.0647 0.018766111 -0.001465516
15 x1 + x2 270.6440 281.0647 0.018766111 -0.001465516
16 ~ 272.1163 282.5369 0.004213084 -0.016318605
17 y 272.1163 282.5369 0.004213084 -0.016318605
18 x1 + x3 272.1163 282.5369 0.004213084 -0.016318605
19 ~ 268.1626 278.5833 0.042815201 0.023079432
20 y 268.1626 278.5833 0.042815201 0.023079432
21 x1 + x4 268.1626 278.5833 0.042815201 0.023079432
22 ~ 270.6019 281.0226 0.019178931 -0.001044183
23 y 270.6019 281.0226 0.019178931 -0.001044183
24 x2 + x3 270.6019 281.0226 0.019178931 -0.001044183
25 ~ 266.8714 277.2920 0.055095160 0.035612586
26 y 266.8714 277.2920 0.055095160 0.035612586
27 x2 + x4 266.8714 277.2920 0.055095160 0.035612586
28 ~ 268.5164 278.9371 0.039422355 0.019616630
29 y 268.5164 278.9371 0.039422355 0.019616630
30 x3 + x4 268.5164 278.9371 0.039422355 0.019616630
31 ~ 272.4098 285.4357 0.021061425 -0.009530405
32 y 272.4098 285.4357 0.021061425 -0.009530405
33 x1 + x2 + x3 272.4098 285.4357 0.021061425 -0.009530405
34 ~ 268.3261 281.3519 0.060233429 0.030865724
35 y 268.3261 281.3519 0.060233429 0.030865724
36 x1 + x2 + x4 268.3261 281.3519 0.060233429 0.030865724
37 ~ 269.9536 282.9795 0.044813074 0.014963483
38 y 269.9536 282.9795 0.044813074 0.014963483
39 x1 + x3 + x4 269.9536 282.9795 0.044813074 0.014963483
40 ~ 268.5630 281.5889 0.058004079 0.028566706
41 y 268.5630 281.5889 0.058004079 0.028566706
42 x2 + x3 + x4 268.5630 281.5889 0.058004079 0.028566706
43 ~ 270.0500 285.6810 0.062824200 0.023364166
44 y 270.0500 285.6810 0.062824200 0.023364166
45 x1 + x2 + x3 + x4 270.0500 285.6810 0.062824200 0.023364166
- formula: Shows the exact combination of predictors used in each model. This helps you understand which predictors were included in each iteration.
- AIC: Helps in model selection, with lower values indicating models that balance fit and complexity better. In the example, the model
y ~ x1 + x2 + x3 + x4
has the lowest AIC, suggesting it's the best fit among the listed models. - BIC: Similar to AIC but with a stronger penalty for complexity. Again, lower values are better. It often leads to simpler models compared to AIC.
- R_squared: Indicates how well the model explains the variance in the dependent variable. Higher values are better, but R-squared alone doesn't account for the number of predictors.
- Adjusted_R_squared: Adjusts R-squared for the number of predictors, providing a more accurate measure for model comparison when multiple predictors are involved. Higher values indicate a better fit that is not solely due to adding more predictors.
Conclusion
Iterative modeling in R allows you to systematically explore different combinations of predictors and evaluate their impact on the response variable. By following the steps outlined in this guide, you can automate the process of fitting multiple linear models and extracting their summaries, making your analysis more efficient and comprehensive.
Similar Reads
How to Calculate MAE in R
In this article, we are calculating the Mean Absolute Error in the R programming language. Mean Absolute Error: It is the measure of errors between paired observations expressing the same phenomenon and the way to measure the accuracy of a given model. The formula to it is as follows: MAE = (1/n) *
3 min read
How to Rename Factor Levels in R?
In this article, we are going to how to rename factor levels in R programming language. A factor variable in R is represented using categorical variables which are represented using various levels. Each unique value is represented using a unique level value. A factor variable or vector in R can be d
4 min read
How to Use lm() Function in R to Fit Linear Models?
In this article, we will learn how to use the lm() function to fit linear models in the R Programming Language. A linear model is used to predict the value of an unknown variable based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. The
4 min read
How to create a matrix in R
In this article, we will discuss What is a matrix and various methods to create a matrix by using R Programming Language. What is a matrix?A matrix is a two-dimensional data set that collects rows and columns. The matrix stores the data in rows and columns format. It is possible to access the data i
3 min read
How to Code in R programming?
R is a powerful programming language and environment for statistical computing and graphics. Whether you're a data scientist, statistician, researcher, or enthusiast, learning R programming opens up a world of possibilities for data analysis, visualization, and modeling. This comprehensive guide aim
4 min read
How to use Summary Function in R?
The summary() function provides a quick statistical overview of a given dataset or vector. When applied to numeric data, it returns the following key summary statistics: Min: The minimum value in the data1st Qu: The first quartile (25th percentile)Median: The middle value (50th percentile)3rd Qu: Th
2 min read
How to Calculate Log-Linear Regression in R?
Logarithmic regression is a sort of regression that is used to simulate situations in which growth or decay accelerates quickly initially and then slows down over time. The graphic below, for example, shows an example of logarithmic decay: The relationship between a predictor variable and a response
3 min read
How to Handle rep.int Error in R
To repeat items in a vector in R, one often uses the rep. int function. However, some factors might lead to problems while utilizing this function. Users can use rep. int to replicate items in vectors and debug problems including missing arguments, improper argument types, and mismatched vector leng
3 min read
How to create a list in R
In this article, we will discuss What is a list and various methods to create a list using R Programming Language. What is a list?A list is the one-dimensional heterogeneous data i.e., which stores the data of various types such as integers, float, strings, logical values, and characters. These list
2 min read
How Linear Mixed Model Works in R
Linear mixed models (LMMs) are statistical models that are used to analyze data with both fixed and random effects. They are particularly useful when analyzing data with hierarchical or nested structures, such as longitudinal or clustered data. In R Programming Language, the lme4 package provides a
4 min read