2 Descriptive Simple Linear Regression
2 Descriptive Simple Linear Regression
Sections 5.1-5.3
Introduction
Often, we measure two (or more) numerical variables on the same individual. In statistics,
finding relationships between variables is important. Simple linear regression is just one
approach and only models the linear relationship between two numerical variables. Linear
regression, in general, is a very powerful statistical technique and goes far beyond what we can
discuss in this course.1
Typically, we call one variable the response (or outcome) variable that measures the outcome
of the study. The response variable is denoted by Y; specific values of Y are denoted y. The
other variable is the explanatory (or predictor) variable whose values are denoted by x . The
explanatory variable is the variable that is thought to explain the changes we see in the
response variable.2
Warning! Choosing one variable as the explanatory variable does not necessarily mean that a
change in that variable will produce a change in the other variable. Association is not causation.
1
Consider Stats 401 or Stats 413 for your follow-up courses to Stats 250.
2
Some people call the Y variable the dependent variable and the X variable the independent variable. We stay
away from those terms because they have specific meanings in the study of probability.
3
One way to assess the strength of a linear relationship between two quantitative variables is with correlation
(which we will discuss shortly).
Figure 5.10 from the textbook allows us to see how the correlation coefficient is related to the
scatter of the points:
4
That means we will not ask you to calculate it by hand.
5
The scatterplots here display a famous collection of four data sets known as Anscombe’s Quartet.
Let’s return to our ames50 data. Below are a histogram and summary statistics for the sales
prices of the 50 single-family homes in our data set.
Question: What is our best estimate for the average sales price for all single-family homes sold
between 2006 and 2010 in Ames?6
Note: You might have thought that we should use the sample median because the distribution
of sales prices is right skewed. However, we want to estimate the population mean (not the
population median), so we need to use the sample mean as our estimate.
The scatterplot below is the same one we examined on page 2 with the least-squares regression
line added. Notice that, on average, the sale price increases as living area increases.
In general, we write the (simple) linear regression model for the population as
μ Y| X =x =β 0 + β 1 x , where
μ Y| X =x is the mean of our response variable Y when the value of the explanatory
variable is X =x
β 0 is a population parameter denoting the y-intercept; the mean of Y when x=0 .
β 1 is a population parameter denoting the slope; the change in the mean of Y per unit
change in X .
When we use sample data to estimate β 0 and β 1, we estimate the β ’s with our data and write
the estimated regression line as
^y =b0 +b1 x . 7
7 ^ + β^ x .
Alternatively, we could use “hat” notation to get ^y = β 0 1
b. Consider the slope of the red, dashed line. How do we interpret it in context of the variables
represented?8
c. One of the homes in the ames50 data set has 2475 square feet of living area. Estimate the
sales price of this home using both of the lines.
d. The actual sales price for the home with 2475 square feet of living area was $355,000. How
far off are the estimates from part (c) from the observed sales price?
8
When we interpret the slope, we need to be careful to talk about association (we do not want to imply
causation).
Residuals are the leftover variation in the response variable that the model cannot account for:
Data = Fit + Residual
Each observation will have a residual. Observations above the regression line have positive
residuals, while observations below the line have negative residuals. Our goal in picking the
right linear model is for these residuals to be as small as possible.
Definition: The residual of the ith observation (x i , y i ) is the difference of the observed response
( y i ) and the response we would predict based on the model fit ( ^y i ) :
e i= y i− ^yi
We typically identify ^y i by plugging x i into the model.
We want a line that has small residuals. Think about it: Why do you think small residuals are good?
For the least-squares line, some residuals are positive, and others are negative (and some may
even be 0), and, for the least-squares line, the average of the residuals
n ( )
∑ e i is zero, so that’s not
helpful. It turns out that the line that fits the data “best” is the one that minimizes the sum of the
squared residuals (that is, the line that minimizes the ∑ e2i ).
9
Unless there are only two data points, and you don’t need statistics for that!
10
Conditions: For the line to be the best linear regression for a set of data, certain assumptions about the data
must be made. We’ll get to these later on when we come back to inference for regression. For now, it’s enough to
know that most of these assumptions have to do with the residuals we’ve looked at over the past few pages.
Let’s use these properties to calculate the least-squares regression line for the predicting sales
price when living area is known.
Variable mean sd r
livingAre
a
1747 539.88
0.8641
salePrice 229,033 86,060.69
Use the scatterplot to describe the relationship between the price of a LEGO set and the number
of pieces in the set.
The y -intercept describes the average outcome of y when x=0 AND the linear model is valid
all the way to x=0 , which in many applications is not the case. For this example, we would
predict the cost of a LEGO set with 0 pieces to be $9.28. This is obviously nonsensical.
Extrapolation is Treacherous11
This quote from The Colbert Report says it all:
When those blizzards hit the East Coast this winter, it proved to my satisfaction that
global warming was a fraud. That snow was freezing cold. But in an alarming trend,
temperatures this spring have risen. Consider this: On February 6th it was 10 degrees.
Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the
climate debate rages on. (http://www.cc.com/shows/the-colbert-report)
Should we use our model to predict the cost of the new Colosseum set that has 9036 pieces?
The Frankenstein BrickHeadz set with 108 pieces?
11
ISRS, page 232
12
There are a lot of things you shouldn’t let your friends do. Remember that “friends don’t let friends extrapolate.”
The R2 of a linear model describes the amount of variation in the response variable that is
explained by the least-squares regression line.
For example, consider the LEGO data, shown with the regression line in the scatterplot below.
Also included below is R output for the variance for price and variance for the residuals.
The variance of the response variable, price, is s2price=5722.299 . However, if we apply our least
squares line, then this model reduces our uncertainty in predicting price using the number of
pieces in the set. The variability in the residuals describes how much variation remains after
using the model: s2resid =690.5406 .
We saw a reduction of
s 2price−s2resid 5722.299−690.5406 5031.758
= = =0.879
s 2price 5722.299 5722.299
or about 87.9% in the data’s variation by using information about the linear relationship
between price and number of pieces.
Bottom line: The R2 value is a measure of how good the linear model is. The closer R2 is to
100%, the better.13
13
When trying to predict measures of human behavior (e.g., in psychology), it is not unusual for an R2 of around
10% or 20% to indicate that the explanatory variable is a helpful predictor for the response variable.
It is tempting to remove outliers. Don’t do this without a very good reason. Models that ignore
exceptional (and interesting) cases often perform poorly.
Examining Outliers
Consider the following scatterplot for a dataset of
simulated points. Notice how the regression line fits
the data quite well. The equation for the line is
^y =1.0927+ 2.9096 x.