0% found this document useful (0 votes)
94 views

Inference in Regression: Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health

This document discusses inference and prediction in simple linear regression models. It reviews how to calculate standard errors and test statistics for regression coefficients that follow a t-distribution. Confidence intervals can be constructed using these statistics. Prediction intervals are also discussed, which account for variability in predicting new outcomes based on the regression model. Examples using diamond price data are provided to demonstrate calculating and plotting confidence and prediction intervals.

Uploaded by

Alex Boncu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Inference in Regression: Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health

This document discusses inference and prediction in simple linear regression models. It reviews how to calculate standard errors and test statistics for regression coefficients that follow a t-distribution. Confidence intervals can be constructed using these statistics. Prediction intervals are also discussed, which account for variability in predicting new outcomes based on the regression model. Examples using diamond price data are provided to demonstrate calculating and plotting confidence and prediction intervals.

Uploaded by

Alex Boncu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Inference in regression

Brian Caffo, Jeff Leek and Roger Peng


Johns Hopkins Bloomberg School of Public Health

Recall our model and fitted values


Considerthemodel

Yi = 0 + 1 Xi + i
N(0, 2 ) .
Weassumethatthetruemodelisknown.
Weassumethatyou'veseenconfidenceintervalsandhypothesistestsbefore.
0 = Y 1 X
1 = Cor(Y, X)

Sd(Y)
Sd(X)

2/14

Review

Statisticslike oftenhavethefollowingproperties.

1. IsnormallydistributedandhasafinitesampleStudent'sTdistributioniftheestimatedvariance
isreplacedwithasampleestimate(undernormalityassumptions).
2. CanbeusedtotestH0 : = 0 versusHa : >, <, 0 .
3. Canbeusedtocreateaconfidenceintervalfor via Q1/2 where Q1/2 istherelevant
quantilefromeitheranormalorTdistribution.
Inthecaseofregressionwithiidsamplingassumptionsandnormalerrors,ourinferenceswillfollow
verysimilarilytowhatyousawinyourinferenceclass.
Wewon'tcoverasymptoticsforregressionanalysis,butsufficeittosaythatunderassumptionson
thewaysinwhichthe X valuesarecollected,theiidsamplingmodel,andmeanmodel,thenormal
resultsholdtocreateintervalsandconfidenceintervals

3/14

Standard errors (conditioned on X)


ni=1 (Yi Y )(Xi X )
Var( 1 ) = Var
(
n (Xi X )2
i=1

n
Var (i=1 Yi (Xi X ))
2
n
i=1 (Xi X )2 )
n 2 (Xi X )2
(

i=1

2
n
i=1 (Xi X) 2 )

2
ni=1 (Xi X) 2

4/14

Results
2 = Var( 1 ) = 2 / i=1 ( Xi X ) 2
n

2 = Var( ) =
0

1
n
(

2
X
ni=1 (X iX ) 2

Inpractice, isreplacedbyitsestimate.
It'sprobablynotsurprisingthatunderiidGaussianerrors

j j

followsat distributionwithn 2 degreesoffreedomandanormaldistributionforlargen.


Thiscanbeusedtocreateconfidenceintervalsandperformhypothesistests.

5/14

Example diamond data set


library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2))
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

6/14

Example continued
coefTable

Estimate Std. Error t value P(>|t|)


(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40

fit <- lm(y ~ x);


summary(fit)$coefficients

Estimate Std. Error t value Pr(>|t|)


(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40

7/14

Getting a confidence interval


sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]

[1] -294.5 -224.8

sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]

[1] 3556 3886

With95%confidence,weestimatethata0.1caratincreaseindiamondsizeresultsina355.6to388.6
increaseinpricein(Singapore)dollars.

8/14

Prediction of outcomes
ConsiderpredictingY atavalueofX
Predictingthepriceofadiamondgiventhecarat
Predictingtheheightofachildgiventheheightoftheparents
Theobviousestimateforpredictionatpointx 0 is

0 + 1 x 0
Astandarderrorisneededtocreateapredictioninterval.
There'sadistinctionbetweenintervalsfortheregressionlineatpoint x 0 andthepredictionofwhat
aywouldbeatpointx 0 .
Lineatx se,
0

1
(x0 X ) 2
+
n
2
n

i=1 (X iX )

Predictionintervalseatx ,
0

(x0
X ) 2
1 + 1n + n
2

i=1 (X iX )

9/14

Plotting the prediction intervals


plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)

10/14

Plotting the prediction intervals

11/14

Discussion
Bothintervalshavevaryingwidths.
LeastwidthatthemeanoftheXs.
Wearequiteconfidentintheregressionline,sothatintervalisverynarrow.
Ifweknew 0 and 1 thisintervalwouldhavezerowidth.
Thepredictionintervalmustincorporatethevariabilibityinthedataaroundtheline.
Evenifweknew 0 and 1 thisintervalwouldstillhavewidth.

12/14

In R
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])

13/14

In R

14/14

You might also like