Hierarchical linear regression using R

Linear Regression model is used to establish a connection between two or more variables. These variables are either dependent or independent. Linear Regression In R Programming Language is used to give predictions based on the given data about a particular topic, It helps us to have valuable insights and give conclusions that help us in many future decisions. Hierarchical linear regression is an extension of the standard linear regression that allows for the analysis of such hierarchical data(grouped data or data at different levels).

Understanding Hierarchical Linear Regression

Hierarchical Linear Regression is used when the researchers want to study the relationship between the variables at different levels. These levels have different coefficients and they vary with each other. In R Programming Language this is used to understand the trend of the data in an overall manner as well as at different levels.
A hierarchical model is generally used when the data is divided into different groups at different levels. This kind of model helps in understanding the hierarchy at multiple levels. Sectors like education, psychology, and sociology use this model where data is taken in a multilevel structure.

To implement hierarchical linear regression in R, we can use various packages such as:

lme4: The lme4 package stands for "Linear Mixed-Effects Models. This package uses linear algebra to deal with hierarchical data. It handles large datasets with complex nature and random effects structures. The observations are not independent in such datasets and are influenced by multiple sources of variations. It also provides us with functions of model fitting and model diagnostic which helps users in checking the efficiency of the model. It also provides the formula function helping in specifying the model formula in R.
nlme: It stands for "Nonlinear Mixed-Effects Models. This is one of the fundamental packages in R for fitting various types of nonlinear mixed-effects models. This deals with data where the relation between the variables is nonlinear. It supports the inclusion of both random and fixed effects. It can also handle missing values from our datasets using different techniques. This package also provides extensions to check model accuracy and visualization.
tidyverse: This is a collection of packages with a set of tools for data manipulation, visualization, and analysis. It has packages like ggplot2, dplyr, readr packages, etc.
ggplot2: ggplot2 library stands for the grammar of graphics, popular because of its declarative syntax used to visualize and plot our data into graphs for better understanding.
Syntax to install and load these packages:

#installing important packages
install.packages("lme4")
install.packages("nlme")
install.packages("tidyverse")
install.packages("ggplot2")
#load necessary library
library(lme4)
library(nlme)
library(tidyverse)
library(ggplot2)

Implementing Hierarchical Linear Regression in R

In this article we will use different real-world base examples to understand our topic in detail using different packages for analysis and visualization. We can perform this analysis in 5 different steps:

Step 1: Creating/Loading a Dataset
Step 2: Loading packages
Step 3: Model Fitting
Step 4: Running an Analysis of Variance (ANOVA) and Assessing Model Fit
Step 5: Making a prediction
Step 6: Interpretation and Visualization

We will try to understand the relationship between teaching method and student performance or marks obtained by the student. We will create a fictional dataset using a set.seed() function which helps in generating random data.

Hierarchical linear regression using mtcars dataset in R

In this example we will take an in-built dataset in R programming language called "mtcars". This dataset have information such as weight, horsepower, and quarter mile time, miles per gallon (mpg) about different types of cars. We will try to understand how these attributes affect each other at different levels.

Step 1: Loading the Dataset

# Loading the Dataset
data(mtcars)

We can also view our data by using different functions such as head(), tail() or summary(). The head() function returns the first 6 rows of our dataset similarly, tail() returns the last 6 rows.

Step 2: Model Specification

In this step, we'll specify the hierarchical linear regression model. We'll consider two levels of hierarchy: car features and car make. We will try to understand the influence of car make or brand on mpg.

 # Car features level
model <- lm(mpg ~ wt + hp + qsec, data = mtcars) 
# Car make level
model_by_make <- lm(mpg ~ disp + drat + wt, data = mtcars)  
# summary of the model

summary(model)
summary(model_by_make)

Output:

Call:
lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.8591 -1.6418 -0.4636  1.1940  5.6092 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.61053    8.41993   3.279  0.00278 ** 
wt          -4.35880    0.75270  -5.791 3.22e-06 ***
hp          -0.01782    0.01498  -1.190  0.24418    
qsec         0.51083    0.43922   1.163  0.25463    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.578 on 28 degrees of freedom
Multiple R-squared:  0.8348,    Adjusted R-squared:  0.8171 
F-statistic: 47.15 on 3 and 28 DF,  p-value: 4.506e-11
 summary(model_by_make)
Call:
lm(formula = mpg ~ disp + drat + wt, data = mtcars)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.2342 -2.3719 -0.3148  1.6315  6.2820 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.043257   7.099792   4.372 0.000154 ***
disp        -0.016389   0.009578  -1.711 0.098127 .  
drat         0.843965   1.455051   0.580 0.566537    
wt          -3.172482   1.217157  -2.606 0.014495 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.951 on 28 degrees of freedom
Multiple R-squared:  0.7835,    Adjusted R-squared:  0.7603 
F-statistic: 33.78 on 3 and 28 DF,  p-value: 1.92e-09

In above model specification : one model "model" examines how car features (such as weight, horsepower, and quarter mile time) influence the miles per gallon (mpg) performance of the cars. The other model, "model_by_make", focuses on understanding how certain features (such as displacement, rear axle ratio, and weight) affect mpg within different car makes based on the dataset we have. This is a hierarchical structure of data.

Step 3: Running an Analysis of Variance (ANOVA) and Assessing Model Fit

As discussed in the previous example, this step is important for understanding the significance of the model fit.

# Conducting Analysis of Variance (ANOVA) for both models
anova_model <- anova(model)
anova_model_by_make <- anova(model_by_make)

# Printing ANOVA results for both models
cat("ANOVA for model:\n")
print(anova_model)
cat("\nANOVA for model by make:\n")
print(anova_model_by_make)

Output:


ANOVA for model:
Analysis of Variance Table
Response: mpg
          Df Sum Sq Mean Sq  F value    Pr(>F)    
wt         1 847.73  847.73 127.5739 6.131e-12 ***
hp         1  83.27   83.27  12.5319   0.00142 ** 
qsec       1   8.99    8.99   1.3527   0.25463    
Residuals 28 186.06    6.64                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA for model by make:
Analysis of Variance Table
Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
disp       1 808.89  808.89 92.9171 2.152e-10 ***
drat       1  14.26   14.26  1.6383    0.2111    
wt         1  59.14   59.14  6.7937    0.0145 *  
Residuals 28 243.75    8.71                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The incremental R-squared of 0.05123624 suggests that the additional predictors in the "model_by_make" account for an additional 5.12% of the variance.
Overall, the results indicate that the additional predictors in the "model_by_make" significantly contribute to explaining the variance in the response variable compared to the initial "model".

Assessing Sums of Squares for both models

# Assessing Sums of Squares for both models
SST <- anova_model$`Sum Sq`[1]
SSR <- anova_model$`Sum Sq`[2]
SSE <- anova_model$`Sum Sq`[3]

SST_model_by_make <- anova_model_by_make$`Sum Sq`[1]
SSR_model_by_make <- anova_model_by_make$`Sum Sq`[2]
SSE_model_by_make <- anova_model_by_make$`Sum Sq`[3]

cat("\nSums of Squares:\n")
cat("SST:", SST, "\n")
cat("SSR:", SSR, "\n")
cat("SSE:", SSE, "\n")

Output:

Sums of Squares:
SST: 847.7252 
SSR: 83.27418 
SSE: 8.988458

Calculating Difference in Sum of Squares for both models

# Calculating Difference in Sum of Squares for both models
SSR_diff <- SST - SSE
SSR_diff_model_by_make <- SST_model_by_make - SSE_model_by_make

# Deriving F-Statistics and P-Values for both models
F_statistic <- anova_model_by_make$`F value`[1]
p_value <- anova_model_by_make$`Pr(>F)`[1]

F_statistic_model_by_make <- anova_model_by_make$`F value`[1]
p_value_model_by_make <- anova_model_by_make$`Pr(>F)`[1]

cat("\nDifference in Sum of Squares for model by make (SSR_diff_model_by_make):", 
    SSR_diff_model_by_make, "\n")
cat("\nF-Statistic for model by make:", F_statistic_model_by_make, "\n")
cat("P-Value for model by make:", p_value_model_by_make, "\n")

Output:

Difference in Sum of Squares for model by make (SSR_diff_model_by_make): 749.7461 
F-Statistic for model by make: 92.91705 
P-Value for model by make: 2.151799e-10

Computing Incremental R-Squared

# Computing Incremental R-Squared
incremental_R_squared <- summary(model)$r.squared - summary(model_by_make)$r.squared

cat("\nIncremental R-Squared:", incremental_R_squared, "\n")

Output:

Incremental R-Squared: 0.05123624

4. Making Predictions

We'll predict the mpg for a specific car make, in this case, "Ford Pinto." Here we will assign certain values for displacement of car, rear axle ratio for a car and weight of the car in thousands of pounds.

new_data <- data.frame(disp = 120, drat = 3.9, wt = 2.8)  
predicted_mpg <- predict(model_by_make, newdata = new_data)
cat("Predicted MPG:", predicted_mpg, "\n")

Output:

Predicted MPG: 23.48507

23.48507 represents the estimated miles per gallon (mpg) performance based on our predictions.

Interpretation and Visualization

library(ggplot2)

# Adding a new variable 'make' derived from row names of mtcars dataset
mtcars$make <- rownames(mtcars) 
 #'make' on the x-axis and 'mpg' on the y-axis
ggplot(mtcars, aes(x = make, y = mpg)) +
  geom_bar(stat = "identity", fill = "Red") + 
  labs(title = "MPG by Car Make", x = "Car Make", y = "MPG") + 
 # Adjust the x-axis text angle for better readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

The barplot shows mpg(miles per gallon) of various car make(brand). This graph provides a clear visualization of the mpg performance across various car makes, allowing for easy comparison.

In this example, we used in-built dataset in R called "mtcars" and used it to perform hierarchical regression. Here, we specified two model for different of our dataset. We predicted certain values for mpg using our dataset. Prediction understands the previous patterns and trends in our dataset and predicted values using predict() function. We also plotted a graph based on car make(brand) and mpg values for them for better understanding.

Conclusion:

This article was focused on different examples based on the real-life. With the help of these examples we understood how hierarchical linear regression works and how it uses multilevel dataset to understand the relationship between the dependent and independent variables and how they influence the prediction. We also understood the role of underlying patterns and trends of historical dataset in prediction.

Hierarchical linear regression using R

Understanding Hierarchical Linear Regression

Implementing Hierarchical Linear Regression in R

Hierarchical linear regression using mtcars dataset in R

Step 1: Loading the Dataset

Step 2: Model Specification

Step 3: Running an Analysis of Variance (ANOVA) and Assessing Model Fit

Assessing Sums of Squares for both models

Calculating Difference in Sum of Squares for both models

Computing Incremental R-Squared

4. Making Predictions

Interpretation and Visualization

Conclusion:

Explore