Notes 1
Notes 1
2 Prediction:
predict new outcomes given a new set of inputs with a built model.
Examples:
(1) drug overdose and life expectancy
(2) high school grade point average (gpa) and college entrance test score
(4) age, gender, physical activity, etc and body mass index (BMI)
For a start, Are you aware of...
▶ Scatterplots
▶ Linear models
Histogram of Body Height
Height vs Gender
85
80
75
Height
70
65
60
Women Men
Gender
0 10 20 30 40 50
Celsius
The relationship is perfect. We are not interested in that.
Statistical relationships:
30 35 40 45
Heights (h) and weights (w) of 10 students. Which line do you think
best summarizes the trend between height and weight?
200
w = -266.5 + 6.1h
w = -331.2 + 7.1h
180
Weight
160
140
120
64 66 68 70 72 74
Height
Let’s continue with the previous example of 10 students. In order to
examine which of the two lines is a better fit, we first need to
introduce some common notation:
Ŷi = b0 + b1 xi
1 In general, when we use Ŷi = b0 + b1 xi to predict the actual
response Yi , we make a prediction error (or residual error) of
size:
ei = Yi − Ŷi
ei is called the prediction error for data point i.
2 A line that fits the data “best” will be one for which the n
prediction errors are as small as possible in some overall sense.
Continuing with the example of 10 students:
e10
200
w = -266.5 + 6.1h
180
e8
Weight
160
140
120
64 66 68 70 72 74
Height
Least Squares Regression
w = -266.5 + 6.1h
w = -331.2 + 7.1h
180
Weight
160
140
120
64 66 68 70 72 74
Height
For the dashed line:
n
X n
X
e2i = (Yi − Ŷi )2 = 118.81 + . . . + 44.89 = 766.5
i=1 i=1
n
X n
X
e2i = (Yi − Ŷi )2 = 47.076 + . . . + 201.924 = 599.8
i=1 i=1
What Do b0 and b1 Estimate?
Suppose that we are interested in the relationship between high school gpa (x) and
college entrance test score (Y ) in a population of 200 students.
If we know the information of every student, we can get the following “population
regression line” by connecting the mean college entrance test score at each gpa level.
College entrance test score
20
15
10
5
E(Yi ) = β0 + β1 xi ,
Yi = β0 + β1 xi + εi , i = 1, . . . , n,
where
The “sample regression line” is Ŷi = b0 + b1 xi and the prediction error ei = Yi − Ŷi is a
surrogate of εi .