0% found this document useful (0 votes)
13 views

R Code For Linear Regression Analysis 1 Way ANOVA

This R code document demonstrates how to perform linear regression analysis and diagnostic tests. It begins with fitting a simple linear regression model to sample data and obtaining correlation and ANOVA results. Diagnostic plots are generated to check assumptions of independence, constant variance, and normality of residuals. Further tests include Durbin-Watson for autocorrelation, Breusch-Pagan for heteroscedasticity, and Shapiro-Wilk for normality. The code then fits and analyzes a multiple linear regression model on additional sample data. One-way ANOVA is also demonstrated using insect count data from different spray treatments. Descriptive statistics, ANOVA tables, post hoc tests and model diagnostic plots are obtained.

Uploaded by

Hagos Tsegay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

R Code For Linear Regression Analysis 1 Way ANOVA

This R code document demonstrates how to perform linear regression analysis and diagnostic tests. It begins with fitting a simple linear regression model to sample data and obtaining correlation and ANOVA results. Diagnostic plots are generated to check assumptions of independence, constant variance, and normality of residuals. Further tests include Durbin-Watson for autocorrelation, Breusch-Pagan for heteroscedasticity, and Shapiro-Wilk for normality. The code then fits and analyzes a multiple linear regression model on additional sample data. One-way ANOVA is also demonstrated using insect count data from different spray treatments. Descriptive statistics, ANOVA tables, post hoc tests and model diagnostic plots are obtained.

Uploaded by

Hagos Tsegay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

R Code for

Linear Regression Analysis and its Diagnostics

# To fit a simple regression model y on x


x <- c(0,0.75,1.5,2.25,3,3.75,4.5,5.25,6,6.75,7.5,8.25,9,9.75,10.5,11.25,12,12.75)
y <- c(54.3,50.8,58,54.6,45.3,47,51.7,43.3,44.7,38.5,42.1,40,32,34.6,32.8,33.4,28.7,26.9)
z <- cbind(y,x)
z plot(x,y)
cor(x,y)
cor.test(x,y)

hist(y)
boxplot(y)
M1<- lm(y~x)
ss <- summary(M1) # To print summary of results of fitted line
ss

#ANOVA Table
Aov <- aov (y~x)
summary(Aov)

# For regression diagnostics


resi<-residuals(Aov) # To find residual values
y.hat<- fitted.values(Aov) # To find fitted values of y

# Checking for independence of observations i.e. autocorrelation


library(MASS)
str <- studres(Aov)
k<-1:18 #As the length of x or y is 18
plot(k, str)

# Checking for Constant variance assumption


plot(y.hat,resi)
abline(h=0)

# Checking for normality of residuals


qqnorm(resi)
qqline(resi)
hist(resi)

# To merge the above 3 assumptions in one graph


library(e1071)

BY YISIHAK M. 1
library(lattice)
library(PASWR)
checking.plots(Aov)

# Test for autocorrelation


library(lmtest)
dwtest(M1) # Durban Watson test

# Breusch-Pagan Test
#H0: Homoscedasticity. The errors have constant variance about the true model.
#H1: Heteroscedasticity. The errors have non-constant variance about the true model.
#install.packages(lmtest)
library(lmtest)
bptest(M1)

# Test for normality of residulas


shapiro.test(resi)# If number of observations < 50 else
ks.test(resi,"pnorm", m= mean(resi), sd= sd(resi))

# To fit a Multiple Linear Regression Model


x <- c(0,0.75,1.5,2.25,3,3.75,4.5,5.25,6,6.75,7.5,8.25,9,9.75,10.5,11.25,12,12.75)
y <- c(54.3,50.8,58,54.6,45.3,47,51.7,43.3,44.7,38.5,42.1,40,32,34.6,32.8,33.4,28.7,26.9)
z <- c(18,16,9,11,9,7,4,3,8,5,1,4,3,4.6,8,4,7,9)
z1 <- cbind(z,x,y)
z1

hist(y)
boxplot(y)
M2<-lm(y~x+z) #multiple linear regression
ss<-summary(M2) # To print summary of results of fitted line
ss

# For regression diagnostics


resi<-residuals(M1) # To find residual values
y.hat<- fitted.values(M1) # To find fitted values of y

Aov <- aov(y~x+z)


# Checking for independence of errors i.e. Autocorrelation
library(MASS)
sr<-studres(Aov) #students's residual
k<-1:18
plot(k, sr)

BY YISIHAK M. 2
One - way ANOVA
We’re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent
Variable with 6 levels) were tested to see if there was a difference in the number of insects
found in the field after each spraying (Dependent Variable).
> attach(InsectSprays)
> data(InsectSprays)
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

1. Descriptive statistics
a. Mean, variance, number of elements in each cell
b. Visualise the data – boxplot; look at distribution, look for outliers
We’ll use the tapply() function which is a helpful shortcut in processing data, basically
allowing you to specify a response variable, a factor (or factors) and a function that should be
applied to each subset of the response variable defined by each level of the factor. I.e. Instead
of doing:
> mean(count[spray=="A"])# and the same for B, C, D etc.
We use tapply(response,factor,function-name) as follows
Let’s look at the means:
> tapply(count, spray, mean)
A B C D E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
The variances:
> tapply(count, spray, var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000 38.606061

And sample sizes


> tapply(count, spray, length)
A B C D E F
12 12 12 12 12 12

And a boxplot:
> boxplot(count ~ spray)

 How does the data look?

BY YISIHAK M. 3
A couple of Asides
Default order is alphabetical. R needs, for example, the control condition to be 1st for
treatment contrasts to be easily interpreted.
If they’re not automatically in the correct order – i.e. if they were ordered variables, but
came out alphabetically (e.g. "Very.short","Short","Long","Very.long" or “A”, “B”,
“Control”), re-order the variables for ordered IV:
To change to, for example, F < B < C < D < E < A, use:

> Photoperiod<-ordered(spray,levels=c("F","B","C","D","E","A"))
Check it:
> tapply(count,Photoperiod,mean)
F B C D E A
16.666667 15.333333 2.083333 4.916667 3.500000 14.500000

If you want to check that a variable is a factor (especially for variables with numbers as
factor levels). We use the is.factor directive to find this out
is.factor(spray)
[1] TRUE
2. Run 1-way ANOVA
a. Oneway.test ( )
Use, for example:
> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12
Default is equal variances (i.e. homogeneity of variance) not assumed – i.e. Welch’s
correction applied (and this explains why the denom df (which is normally k*(n-1)) is not
a whole number in the output)
o To change this, set "var.equal=" option to TRUE

Oneway.test( ) corrects for non-homogeneity, but doesn’t give much information – i.e.
just F, p-value and dfs for numerator and denominator – no MS etc.

b. Run an ANOVA using aov( )

Use this function and store output and use extraction functions to extract what you need.
> aov.out = aov(count ~ spray, data=InsectSprays)
> summary(aov.out)

BY YISIHAK M. 4
F(5,66) = 34.7; p < .000
3. Post Hoc tests

Tukey HSD(Honestly Significant Difference) is default in R


> TukeyHSD(aov.out)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = count ~ spray, data = InsectSprays)
$spray

diff lwr upr p adj


B-A 0.8333333 -3.866075 5.532742 0.9951810
C-A -12.4166667 -17.116075 -7.717258 0.0000000
D-A -9.5833333 -14.282742 -4.883925 0.0000014
E-A -11.0000000 -15.699409 -6.300591 0.0000000
F-A 2.1666667 -2.532742 6.866075 0.7542147
C-B -13.2500000 -17.949409 -8.550591 0.0000000
D-B -10.4166667 -15.116075 -5.717258 0.0000002
E-B -11.8333333 -16.532742 -7.133925 0.0000000
F-B 1.3333333 -3.366075 6.032742 0.9603075
D-C 2.8333333 -1.866075 7.532742 0.4920707
E-C 1.4166667 -3.282742 6.116075 0.9488669
F-C 14.5833333 9.883925 19.282742 0.0000000
E-D -1.4166667 -6.116075 3.282742 0.9488669
F-D 11.7500000 7.050591 16.449409 0.0000000
F-E 13.1666667 8.467258 17.866075 0.0000000
>
4. Contrasts
NB: ANOVA and linear regression are the same thing – more on that tomorrow. For the
moment, the main point to note is that you can look at the results from aov() in terms of the
linear regression that was carried out, i.e. you can see the parameters that were estimated.
> summary.lm(aov.out)

Implicitly this can be understood as a set of (non-orthogonal) contrasts of the first group
against each of the other groups. R uses these so-called ‘Treatment’ contrasts as the default,
but you can request alternative contrasts (see later)

Interpreting a Treatment Contrasts Output

BY YISIHAK M. 5
5. Test assumptions
a. Homogeneity of variance
bartlett.test(count ~ spray, data=InsectSprays)
Bartlett test of homogeneity of
variances data: count by spray
Bartlett's K-squared = 25.9598, df = 5, p-value = 9.085e-05
Significant result, therefore variances cannot be assumed to be equal
b. Model checking plots
> plot(aov.out) # the aov command prepares the data for these plots

This shows if there is a pattern in the residuals, and ideally should show similar scatter

BY YISIHAK M. 6
for each condition. Here there is a worrying effect of larger residuals for larger fitted
values. This is called ‘heteroscedascity’ meaning that not only is variance in the
response not equal across groups, but that the variance has some specific relationship
with the size of the response. In fact you could see this in the original boxplots. It
contradicts assumptions made when doing an ANOVA.

This looks for normality of the residuals; if they are not normal, the assumptions of
ANOVA are potentially violated.

This is like the first plot but now to specifically test if the residuals increase with the
fitted values, which they do.

BY YISIHAK M. 7
This gives an idea of which levels of the factor are best fitted.

BY YISIHAK M. 8

You might also like