0% found this document useful (0 votes)
2 views

Notes 1

The document provides an overview of regression analysis, focusing on its purpose to model relationships between independent and dependent variables for estimation and prediction. It introduces concepts such as simple linear regression, the best fitting line, and the least squares criterion for minimizing prediction errors. Additionally, it discusses the difference between population and sample regression lines in the context of predicting outcomes based on observed data.

Uploaded by

promptmba24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Notes 1

The document provides an overview of regression analysis, focusing on its purpose to model relationships between independent and dependent variables for estimation and prediction. It introduces concepts such as simple linear regression, the best fitting line, and the least squares criterion for minimizing prediction errors. Additionally, it discusses the difference between population and sample regression lines in the context of predicting outcomes based on observed data.

Uploaded by

promptmba24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Basics of Regression

Regression analysis helps us answer


many practical questions. For example
...
What’s reducing life expectancy in U.S.?
How much of cancer mortality rates is related to where a
person lives?
How well does college percentage predict CAT score?
What are the factors relevant to the change of Body
Mass Index (BMI)? For example, age, gender, exercise,
etc.
The goal of regression analysis to model the relationship between independent
variables (predictor) and a dependent variable (response).
1 Estimation:
models the relationship between a predictor/predictors and a response with
an observed data set.

2 Prediction:
predict new outcomes given a new set of inputs with a built model.
Examples:
(1) drug overdose and life expectancy

(2) high school grade point average (gpa) and college entrance test score

(3) latitude and cancer mortality rate

(4) age, gender, physical activity, etc and body mass index (BMI)
For a start, Are you aware of...
▶ Scatterplots
▶ Linear models
Histogram of Body Height

For both genders, height follows a “normal distribution.”


Normal Distribution
If we connect the mean of the height for both genders, we get a
“line.”

Height vs Gender
85
80
75
Height

70
65
60

Women Men

Gender

We will learn linear regression in this course.


Deterministic (or functional) relationships, e.g., the relationship
between degrees Fahrenheit and degrees Celsius is known to be:
9
Fahr = Cels + 32
5
120
100
Fahrenheit
80
60
40

0 10 20 30 40 50

Celsius
The relationship is perfect. We are not interested in that.
Statistical relationships:

Skin Cancer Mortality versus Latitude


220
Mortality (Deaths per 10 million)
200
180
160
140
120
100

30 35 40 45

Latitude (at center of state)


The relationship is not perfect. Indeed, the plot exhibits some
“trend,” but it also exhibits some “scatter.”
Simple Linear Regression

Simple linear regression is a statistical method that allows us to


summarize and study relationships between two variables:
1 One variable, denoted x, is regarded as the predictor,
explanatory, or independent variable, which can be of any
type.

2 The other variable, denoted Y , is regarded as the response,


outcome, or dependent variable, which is continuous.
Simple linear regression is “simple”, because it concerns the study
of only one predictor variable.
“Best Fitting Line”

Heights (h) and weights (w) of 10 students. Which line do you think
best summarizes the trend between height and weight?
200

w = -266.5 + 6.1h
w = -331.2 + 7.1h
180
Weight
160
140
120

64 66 68 70 72 74

Height
Let’s continue with the previous example of 10 students. In order to
examine which of the two lines is a better fit, we first need to
introduce some common notation:

1 An experimental unit is an object or person on which the


measurement is made

2 Yi denotes the observed response for experimental unit i

3 xi denotes the predictor value for experimental unit i

4 Ŷ i is the predicted response (or fitted value) for experimental


unit i

Then, the equation for the best fitting line is:

Ŷi = b0 + b1 xi
1 In general, when we use Ŷi = b0 + b1 xi to predict the actual
response Yi , we make a prediction error (or residual error) of
size:
ei = Yi − Ŷi
ei is called the prediction error for data point i.

2 A line that fits the data “best” will be one for which the n
prediction errors are as small as possible in some overall sense.
Continuing with the example of 10 students:

e10
200

w = -266.5 + 6.1h
180

e8
Weight
160
140
120

64 66 68 70 72 74

Height
Least Squares Regression

One way to achieve this goal is to invoke the “least squares


criterion,” which says to “minimize the sum of the squared prediction
errors.”
That is, we need to find the values b0 and b1 that minimizes:
n
X
Q= [Yi − (b0 + b1 xi )]2
i=1
Xn
= e2i
i=1
In light of the least squares criterion, which line do you now think is
the best fitting line?
200

w = -266.5 + 6.1h
w = -331.2 + 7.1h
180
Weight
160
140
120

64 66 68 70 72 74

Height
For the dashed line:

n
X n
X
e2i = (Yi − Ŷi )2 = 118.81 + . . . + 44.89 = 766.5
i=1 i=1

For the solid line:

n
X n
X
e2i = (Yi − Ŷi )2 = 47.076 + . . . + 201.924 = 599.8
i=1 i=1
What Do b0 and b1 Estimate?
Suppose that we are interested in the relationship between high school gpa (x) and
college entrance test score (Y ) in a population of 200 students.
If we know the information of every student, we can get the following “population
regression line” by connecting the mean college entrance test score at each gpa level.
College entrance test score
20
15
10
5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

High school gpa


Connecting the average of test score of each group, we get the
solid line, which we summarize by

E(Yi ) = β0 + β1 xi ,

which is called the “population regression line.”


Simple Linear Regression Model

The simple linear regression model

Yi = β0 + β1 xi + εi , i = 1, . . . , n,

where

1 εi are independent random errors with E(εi ) = 0, V ar(εi ) = σ 2 .

2 (xi , Yi ) are observed in data.

3 β0 , β1 and σ 2 are unknown parameters.


However, if we only know the information of a sample of students, say, 20 students, then
we can estimate the “population regression line” by a “sample regression line”.

population regression line


College entrance test score

sample regression line


20
15
10
5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

High school gpa

The “sample regression line” is Ŷi = b0 + b1 xi and the prediction error ei = Yi − Ŷi is a
surrogate of εi .

You might also like