The goal of Stepwise Regression in R Programming Language is to find the most simple and effective model that explains the relationship between the predictor variables and the response variable.
Stepwise Regression in R
Stepwise regression is a systematic method for adding or removing predictor variables from a multiple regression model. It is an iterative process that begins with an initial model and then explores potential improvements by adding or removing variables based on their statistical significance.
Stepwise regression is used in statistical modeling for several reasons:
- Variable selection: It helps identify the most relevant predictor variables that have a significant impact on the response variable while excluding irrelevant or redundant variables. This can improve model interpretability and reduce overfitting.
- Model simplification: By removing insignificant variables, stepwise regression can simplify the model, which can improve its generalization performance on new data.
- Exploratory analysis: Stepwise regression can be used as an exploratory tool to gain insights into the relationships between variables and to generate hypotheses for further investigation.
- Computational efficiency: In situations where there are many potential predictor variables, stepwise regression can be computationally more efficient than evaluating all possible combinations of variables.
Overview of Stepwise Regression Methods
There are three main types of stepwise regression methods:
- Forward Selection: This method starts with an empty model and sequentially adds variables based on their statistical significance.
- Backward Elimination: This method starts with a full model containing all predictor variables and sequentially removes variables that are insignificant.
- Stepwise Selection: This method is a combination of forward selection and backward elimination, where variables can be added or removed at each step.
Forward Selection
In forward selection, we start with a null model (a model with no predictor variables) and iteratively add variables to the model based on their statistical significance. Here's an example in R:
R
# Load the required dataset
data(longley, package = "datasets")
# Fit the initial null model
null_model <- lm(Employed ~ 1, data = longley)
# Perform forward selection
forward_model <- step(null_model, scope = list(lower = ~ 1, upper = ~ . - 1),
direction = "forward")
# Print the summary of the selected model
summary(forward_model)
Output:
Start: AIC=41.17
Employed ~ 1
Call:
lm(formula = Employed ~ 1, data = longley)
Residuals:
Min 1Q Median 3Q Max
-5.146 -2.604 0.187 2.974 5.234
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.317 0.878 74.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.512 on 15 degrees of freedom
In the code above, we're using the longley dataset, which is a built-in dataset in R containing several economic variables. We first fit a null model using lm() with only the intercept. Then, we use the step() function with the direction = "forward" argument to perform forward selection.
- The scope argument specifies the range of models to be considered, where lower = ~ 1 represents the null model, and upper = ~ . - 1 represents the full model with all predictor variables except the response variable (Employed).
- The step() function will iteratively add significant variables to the model until it reaches the optimal model based on the specified criteria (e.g., AIC, BIC).
The output indicates that the mean value of the "Employed" variable in the "longley" dataset is approximately 65.317. This intercept-only model does not include any predictors, so it simply represents the overall average employment across all observations.
Backward Elimination
In backward elimination, we start with a full model containing all predictor variables and iteratively remove variables that are insignificant. Here's an example in R.
R
# Load the required dataset
data(longley, package = "datasets")
# Fit the initial full model
initial_model <- lm(Employed ~ ., data = longley)
# Perform backward elimination
backward_model <- step(initial_model, direction = "backward")
# Print the summary of the selected model
summary(backward_model)
Output:
Start: AIC=-33.22
Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
Year
Df Sum of Sq RSS AIC
- GNP.deflator 1 0.00292 0.83935 -35.163
- Population 1 0.00475 0.84117 -35.129
- GNP 1 0.10631 0.94273 -33.305
<none> 0.83642 -33.219
- Year 1 1.49881 2.33524 -18.792
- Unemployed 1 1.59014 2.42656 -18.178
- Armed.Forces 1 2.16091 2.99733 -14.798
Step: AIC=-35.16
Employed ~ GNP + Unemployed + Armed.Forces + Population + Year
Df Sum of Sq RSS AIC
- Population 1 0.01933 0.8587 -36.799
<none> 0.8393 -35.163
- GNP 1 0.14637 0.9857 -34.592
- Year 1 1.52725 2.3666 -20.578
- Unemployed 1 2.18989 3.0292 -16.628
- Armed.Forces 1 2.39752 3.2369 -15.568
Step: AIC=-36.8
Employed ~ GNP + Unemployed + Armed.Forces + Year
Df Sum of Sq RSS AIC
<none> 0.8587 -36.799
- GNP 1 0.4647 1.3234 -31.879
- Year 1 1.8980 2.7567 -20.137
- Armed.Forces 1 2.3806 3.2393 -17.556
- Unemployed 1 4.0491 4.9077 -10.908
Call:
lm(formula = Employed ~ GNP + Unemployed + Armed.Forces + Year,
data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.42165 -0.12457 -0.02416 0.08369 0.45268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
GNP -4.019e-02 1.647e-02 -2.440 0.032833 *
Unemployed -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
Armed.Forces -1.015e-02 1.837e-03 -5.522 0.000180 ***
Year 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
The final model, selected through stepwise backward elimination, explain an excellent fit to the data, explaining almost all the variability in the employment variable. Each of the predictors—GNP, Unemployed, Armed Forces, and Year—significantly impacts employment, with Year having a positive effect while the others (GNP, Unemployed, and Armed Forces) have negative effects. The high significance levels of these predictors (all with p-values < 0.05) and the overall model suggest that these variables are strong determinants of employment in the context of the dataset used.
Stepwise Selection
Stepwise selection is a combination of forward selection and backward elimination. Variables can be added or removed at each step based on their statistical significance. Here's an example in R:
R
# Load the required dataset
data(longley, package = "datasets")
# Fit the initial full model
initial_model <- lm(Employed ~ ., data = longley)
# Perform stepwise selection
stepwise_model <- step(initial_model, direction = "both")
# Print the summary of the selected model
summary(stepwise_model)
Output:
Start: AIC=-33.22
Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population +
Year
Df Sum of Sq RSS AIC
- GNP.deflator 1 0.00292 0.83935 -35.163
- Population 1 0.00475 0.84117 -35.129
- GNP 1 0.10631 0.94273 -33.305
<none> 0.83642 -33.219
- Year 1 1.49881 2.33524 -18.792
- Unemployed 1 1.59014 2.42656 -18.178
- Armed.Forces 1 2.16091 2.99733 -14.798
Step: AIC=-35.16
Employed ~ GNP + Unemployed + Armed.Forces + Population + Year
Df Sum of Sq RSS AIC
- Population 1 0.01933 0.8587 -36.799
<none> 0.8393 -35.163
- GNP 1 0.14637 0.9857 -34.592
+ GNP.deflator 1 0.00292 0.8364 -33.219
- Year 1 1.52725 2.3666 -20.578
- Unemployed 1 2.18989 3.0292 -16.628
- Armed.Forces 1 2.39752 3.2369 -15.568
Step: AIC=-36.8
Employed ~ GNP + Unemployed + Armed.Forces + Year
Df Sum of Sq RSS AIC
<none> 0.8587 -36.799
+ Population 1 0.0193 0.8393 -35.163
+ GNP.deflator 1 0.0175 0.8412 -35.129
- GNP 1 0.4647 1.3234 -31.879
- Year 1 1.8980 2.7567 -20.137
- Armed.Forces 1 2.3806 3.2393 -17.556
- Unemployed 1 4.0491 4.9077 -10.908
Call:
lm(formula = Employed ~ GNP + Unemployed + Armed.Forces + Year,
data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.42165 -0.12457 -0.02416 0.08369 0.45268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
GNP -4.019e-02 1.647e-02 -2.440 0.032833 *
Unemployed -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
Armed.Forces -1.015e-02 1.837e-03 -5.522 0.000180 ***
Year 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
The selected linear model provides a highly accurate prediction of employment levels ("Employed") using the predictors "GNP", "Unemployed", "Armed Forces", and "Year". Each predictor significantly affects employment, with "Year" showing a positive influence while "GNP", "Unemployed", and "Armed Forces" have negative influences. The model's high R-squared and adjusted R-squared values indicate it explains nearly all the variability in employment data, making it a robust and reliable model for understanding the factors influencing employment in the context of the "longley" dataset. Future analyses could focus on exploring potential interactions between predictors or including additional relevant variables to further refine the model.
Conclusion
Stepwise regression is a widely used and powerful method for choosing variables and building models in multiple regression analysis. It helps systematically find the most significant predictor variables and create simpler, more efficient models. However, to ensure that the results are reliable and easy to interpret, it's important to be aware of the assumptions, limitations, and potential issues associated with stepwise regression.
Similar Reads
Stepwise Regression in Python
Stepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F
6 min read
Simple Linear Regression in R
Regression shows a line or curve that passes through all the data points on the target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum What is Linear Regression?Linear Regression is a commonly used type of predictive analysis. Linea
12 min read
Local Regression in R
In this article, we will discuss what local regression is and how we implement it in the R Programming Language. What is Local Regression in R?Local regression is also known as LOESS (locally estimated scatterplot smoothing) regression. It is a flexible non-parametric method for fitting regression m
4 min read
Understanding Spline Regression in R
Spline regression is a flexible method used in statistics and machine learning to fit a smooth curve to data points by dividing the independent variable (usually time or another continuous variable) into segments and fitting separate polynomial functions to each segment. This approach avoids the lim
6 min read
Types of Regression Techniques in ML
Regression Analysis is a fundamental concept in machine learning used to model relationships between dependent and independent variables. Various regression techniques are tailored to different data structures and objectives. Below is an exploration of key regression techniques, their significance,
9 min read
Simple Linear Regression in Python
Simple linear regression models the relationship between a dependent variable and a single independent variable. In this article, we will explore simple linear regression and it's implementation in Python using libraries such as NumPy, Pandas, and scikit-learn. Understanding Simple Linear Regression
7 min read
Find the Regression Output in R
In R Programming Language we can Interpret Regression Output by using various functions depending on the type of regression analysis you are conducting. The two most common types of regression analysis are linear regression and logistic regression. Here, I'll provide examples of how to find the regr
11 min read
What is Regression Line?
A regression line is a fundamental concept in statistics and data analysis used to understand the relationship between two variables. It represents the best-fit line that predicts the dependent variable based on the independent variable. This article will explain the concept of the regression line,
9 min read
Correlation and Regression with R
Correlation and regression analysis are two fundamental statistical techniques used to examine the relationships between variables. R Programming Language is a powerful programming language and environment for statistical computing and graphics, making it an excellent choice for conducting these ana
8 min read
Linear Regression using Turicreate
Linear Regression is a method or approach for Supervised Learning.Supervised Learning takes the historical or past data and then train the model and predict the things according to the past results.Linear Regression comes from the word 'Linear' and 'Regression'.Regression concept deals with predicti
2 min read