0% found this document useful (0 votes)
6 views

Lab-10-Forest-Regression

Uploaded by

poopstack1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lab-10-Forest-Regression

Uploaded by

poopstack1984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Week 10: ENV 445 and ENV 645: Tree Regression Lab

Correlation and Simple Linear Regression

# clear the workspace


rm(list=ls())

In this homework, you will be doing the examples from class, to sure you understand them. And you will be
doing a regression analysis of a dataset that we’ll collect on Burnaby Mountain.
# example starts on slide 9 from class
# enter the data
x=c(1,2,2,3)
y=c(1,2,3,6)

1. Use the correlation function in R, ‘cor.test()’ to determine the sample’s correlation coefficient r.
2. Is ‘cor.test(x,y)’ equal to ‘cor.test(y,x)’?
3. Assuming that y is dependent on the value of x, what percent of the variance of y can be explained by
x?
4. Calculate the covariance of x, y.
5. How is the covariance related to the correlation coefficent between 2 variables?
6. Create a correlation and covariace matrix for x and y. (Slide 13).

Another example of correlation: Relating time spent reading to time spent watching tv.

# Read in the data (this is an example from your text)


read = c(1,2,2,2,3,4,4,5,5,6)
tv = c(90,95,85,80,75,70,75,60,65,50)

Suppose x is the predictor variable, and y is the response variable. You are interested in testing if the amount
of reading a person does is related to the amount of tv they watch.
7. What is the parameter of interest?
8. What are HO and HA ?
9. Using the R function ‘cor.test’ calculate the correlation coefficient between read and tv. Is it significant
at α = 0.05?
10. On slide 24, I’ve pasted the output from R of a regression that includes the t-stat (‘t value’) which is the
‘Estimate’/‘Std.Error’. Calculate the test-statistic for the regression model (lm(y ∼ x)) this hypothesis
test using the formula from class (bottom of slide 24). Note that I used the negative of the ‘Estimate’
as it’s easier to use the left side of the distribution.
11. Why did we multiply by 2 to get the same p-value as R has in the regression outputs?
12. What is the degrees of freedom you used?
13. Is this test statistic in the rejection region? What do you conclude from this statistical test?

1
Simple Linear Regression

Generate some random data (junk food eaten vs immune cell marker data)

Let’s assume x is the number of units of junk food eaten, and y is some marker of immune cell response that
indicates cancer. Here, we are interested if there is a signficant relationship between units of junk food eaten
and a marker for cancer as the news suggests. (Keep in mind this isn’t a real study – we used R to randomly
generate the data!).
a=1 # intercept
b=1.2 # slope
x=0:20
set.seed(0); y=a+b*x + rnorm(length(x),0,4)
head(data.frame(x,y))

## x y
## 1 0 6.0518171
## 2 1 0.8950666
## 3 2 8.7191971
## 4 3 9.6897173
## 5 4 7.4585657
## 6 5 0.8401998
# Let's plot the data (x,y) we generated
par(mfrow=c(1,1), oma=c(4,4,1,1), mar=c(0,0,0,0))
plot(x,y, col='orangered', pch=16, axes=F)
#lines(x, a+b*x, col='blue')
axis(1)
axis(2, las=2)
box()
mtext(side=1, 'x', line=3)
mtext(side=2, 'y', line=3)

25

20

15
y

10

0
0 5 10 15 20

2
14. Fit a linear regression using the function ‘lm()’ and summary(lm()) where x is the covariate, and y is
the response variable. Plot the data and add the predicted least squares line to the plot of the data.
15. What are the estimated paramters β̂0 (intercept) and β̂1 (slope) of the model?
16. Calculate the fitted value, ŷ when x=12 using β̂0 and β̂1 from the last question.
# fit regression model
model=lm(y~x)
# add the regression line to the plot
plot(x,y, col='orangered', pch=16, axes=F)
axis(1)
axis(2, las=2)
box()
abline(model)
25
20
15
y

10
5
0
0 5 10 15 20

x
17. Calculate the ‘residual’ of point (x=12, y=10.8).
18. Plot the residuals of this linear model? Do you see any heteroskedasticity?
# what are the residuals (use R function)
resids.model = residuals(model)
# what are the residuals (write your own code)
resids.manual = y-(coef(model)[1] + coef(model)[2]*x )
# check
head(data.frame(resids.model,resids.manual))

## resids.model resids.manual
## 1 2.833055 2.833055
## 2 -3.296870 -3.296870
## 3 3.554086 3.554086
## 4 3.551432 3.551432
## 5 0.347106 0.347106
## 6 -7.244434 -7.244434

3
# are the residuals homoscedastic?
plot(resids.model)
abline(h=0, lty='dotted')
10
resids.model

5
0
−5

5 10 15 20

Index
19. You can also use R to extract the confidence intervals for the two parameters of interest β0 and β1 . Use
the build-in function ‘confint’ to calculate the confidence intervals of the intercept and slope coefficients.
20. Use ‘summary(model)’ to extract the standard errors for these confidence intervals, and use the correct
tstat and verify that R’s function ‘confint’ is correct.
## 2.5 % 97.5 %
## (Intercept) -0.1562490 6.593773
## x 0.6844764 1.261873
plot(x,y, col='orangered', pch=16, axes=F)
# add the regression line to the plot
abline(model)
# upper 95% CI
lines(x, model$coef[1]+ model$coef[2]*x + qt(.025, df=19) * ci.model$se.fit, col='blue', lty='dotted')
# lower 95% CI
lines(x, model$coef[1]+ model$coef[2]*x - qt(.025, df=19) * ci.model$se.fit, col='blue', lty='dotted')
axis(1)
axis(2, las=2)
box()
25
20
15
y

10
5
0
0 5 10 15 20

4
Class dataset from Burnaby Mountain Forest

21. What are the assumptions of a regression analysis?


22. What are any assumptions that are specific to the class dataset?
23. State the population parameter of interest, define in context.
24. State Hypotheses HO , HA
25. Calculate any relevant test statistics
26. Determine the p-values.
27. Interpret the regression statistics.
• Is the intercept different from zero?
• Is there a positive or negative relationship between the explanatory variable and the response
variable?
28. Do you accept or reject your null hypotheses?
29. What can you conclude something related to the ‘truth’ of the population.

Due Next Thursday March 28th at 2:30 pm

knitr::knit_exit()

You might also like