Lesson 6 02 Regression 2
Lesson 6 02 Regression 2
LESSON OUTLINE:
1. Motivation / Introduction
2. Preliminary Lesson : Simple Linear Regression Line
3. Main Lesson : Obtaining the Simple Linear Regression Line and Explaining the
Regression Coefficients
4. Enrichment: Sampling Distribution of Regression Coefficients
DEVELOPMENT OF THE LESSON
(A) Introduction
Inform students that when examining the relationship between two variables x and y, we can
consider one variable as some kind of input variable within an input-output framework, we
plot this variable along the horizontal (also called x) axis in the scatterplot. The output
variable is the variable along the vertical (also called the y) axis. The input or x variable is
typically called an independent variable; it is also called a covariate or an exogenous,
explanatory, regressor, or control variable. The output or y variable is called the dependent
variable; it is also called the regressand or the endogenous, explained, or response variable.
In Lesson 7-01, Karl Pearsons data on heights of fathers and of their respective first born
sons from the work of was presented. While taller-than-average fathers tend to have tallerthan-average sons, the sons are not quite as tall as the fathers. There is a regression toward
the average heights, thus the term regression analysis. Likewise, shorter-than-average
fathers tend to have shorter-than-average sons, but the sons are not quite as short as the
fathers.
(B) Preliminary Lesson : Simple Linear Regression Line
When we visualize the points in a scatterplot generally clustering about a line, we may be
interested to obtain an estimate of such a line in order to help us estimate the expected level
of a variable Y for a known specific value x of the variable X (say, daily allowance). For
instance, for the worked example in the previous lesson, we may want to determine how
many text messages a student to usually send if his/her daily allowance is 150 pesos. In
lesson 7-01, it was mentioned that we could consider the line that passes the point of
averages and whose slope is the ratio of the standard deviations as one possible line. Inform
students that this line ignores information about the magnitude of the association between the
two variables. If the correlation coefficient is zero, then we should not expect any increase in
one variable to accompany an increase or decrease in the other.
An alternative to this SD line that incorporates information provided by the correlation
coefficient, the means and standard deviations is the regression line:
x1 , y
( 1) . In the given
y y =r
y
( xx )
x
y
y
x + y r x
x
x
The term in parentheses in this expression is the y-intercept of the regression line. It can be
interpreted as what we expect y to be when the value of x is zero.
Explain to students that the regression line relates how much change in the y-value is
associated with a unit increase in the x-value. It estimates the expected value for the Y
variable corresponding to a particular level x of the variable X. On average, it associates with
each increase of one standard deviation in the x-units, r standard deviations in the y-units
(where r is the correlation coefficient).
Note that when we consider the notion of regression, we assume a functional dependence of
Y on X. Thus, we consider Y as a dependent, response, or output variable, while X is an
independent, explanatory or input variable. The magnitude of the output variable Y is
dependent on the magnitude of the input variable X. A persons blood pressure, for instance,
functionally depends on a persons age. This does not, however, suggest that age is the only
factor that is responsible for blood pressure, but that it is one possible determinant for blood
pressure.
On the other hand, arm length and leg length are correlated but not functionally dependent.
Increasing arm length would not have an effect on leg length although these variables are
correlated. In such instances, correlation can be calculated but obtaining a regression line
may not be of practical utility.
(C) Main Lesson : Obtaining the Simple Linear Regression Line and Explaining the Regression
Coefficients
Consider the worked example in Lesson 7-01 pertaining to information from the database
generated in Lesson 1-01. Students were asked in Lesson 7-01 to generate a random sample
of 30 students from the databse.
Worked Example: We have generated the following summary measures in the worked
example for students with complete information on their daily allowance and the usual
number of text messages they send in a day:
Summary
Measure
Daily Allowance
in School
Usual Number of
Text Messages
Sent in a Day
Mean
(Population)
Standard
Deviation
Correlation
90.37037
33.2963
120.9984
43.11124
0.780283
The regression line for Daily Allowance in School on Usual Number of Text
Messages Sent in a Day is then estimated as:
(Expected Usual Number of Text Messages Sent in a Day -33.2963) =
43.11124
( 0.780283 )
120.9984 (Daily Allowance in School
-90.37037)
or simply
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 Daily Allowance in School + 8.172270273
The earlier representation of the estimated regression line clearly indicates that
students with an average daily allowance are expected also have an average number of
text messages. That is, the point of averages is a point in the estimated regression line.
The later representation of the regression line is shown in a typical intercept-slope
form of an equation. In particular, the slope is interpreted as follows: for each increase
of 1 peso in total daily allowance, we expect a corresponding increase of 0.28 text
messages sent in a day, or equivalent, every 4 peso increase in allowance is expected
to have a corresponding increase of 1 text message sent by a student in a day.
Explaining the Regression Coefficients
Since the slope of a line is the rise over run, the slope of the regression line represents the rise
in Y over the run in X, i.e.,
whatever X will be, i.e., the fit is a horizontal line. In the next lesson, we consider how to
make valid statistical inferences about the slope of the regression line.
Remind students that in an equation of a line, the y-intercept is the value of Y when X is
zero. For the worked example, the intercept may be interpreted as the usual number of text
messages sent daily by a student that has zero daily allowance. Students may have zero daily
allowance when the family of the student decides not to give an allowance to the student
because the family is poor, or because the student is deemed not to need an allowance since
everything is being provided for the student. However, in other situations, such an
interpretation may not be valid as we may be unnecessarily extending the segment
representing the regression way outside of the usual range of X values. Consider for
instance relating the monetary value of a house (Y) to the area of the dwelling in square
meters (X). Here, a house must always have nonzero area, and thus the data on area does not
include X=0.
Using the Regression Line for Predictions
The utility of the estimated regression line is not merely for explaining relationships between
X and Y but also for making predictions about Y given a certain value of X. Suppose, we
wish to randomly pick one of the students who gave information for Lesson 1-01, and we
wish to guess his or her usual number of text messages per day. In the absence of any
information, the best guess would naturally be the average usual number of text messages
sent by the students per day. However, we may be given some specific level of daily
allowance of the student that can be utilized to improve the prediction.
Suppose that for the worked example, we are provided information about the level of daily
allowance of a student, say 150 pesos. According to our estimated regression line,
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 Daily Allowance in School + 8.172270273
a student with a daily allowance of 150 pesos is expected to usually have the following total
number of text messages sent per day
Expected Usual Number of Text Messages Sent in a Day =
0.278011805 (150) + 8.172270273
= 49.87404 50
which is more than the average usual number of text messages sent by students per day.
In many cases, obtaining a regression fit gives a sensible way of estimating the y-value. If,
however, there are nonlinearities in the relationship between the variables, one may have to
transform the variables, say, generate firstly the square root or logarithms of the X and/or Y
variables, and then perform a regression model on the transformed variables. In this case, tell
students that one will eventually have to re-express the generated analyses in terms of the
original units rather than the transformed data.
The regression model suggests that for every increase in one unit of an independent
variable x, we expect a change of
y
y is
x units in a dependent variable y, where
the standard deviation of the y-values (with the data treated as a population),
is the
standard deviation of the x-values (with the data treated as a population), and
is the
correlation coefficient.
The regression line may be used to make predictions. Given the value x for an independent
variable X, we expect or predict Y to take the value
y=r
where
x and
x + y r y x
x
x
REFERENCES
Much of the material here adapted from:
Text Messaging is Time Consuming! What Gives? by Jeanie Gibson, Mary McNelis, and Anna
Bargagliotti, STatistics Education Web (STEW), Available on the Internet at
https://www.amstat.org/education/stew/pdfs/TextMessagingisTimeConsumingWhatGives.doc
See also:
Albert, J. R. G. (2008).Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo
Patungan, Nelia Marquez), published by Rex Bookstore.
De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc.
Freedman, D., Pisani, R, and Purves (2007). Statistics. Fourth Edition. W. W. Norton &
Company, New York.
Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Banos, College Laguna
4031
Y ___________ X _____________
2. Provide an interpretation for the estimate slope of the sample regression line
4. Illustrate how to use the sample regression line you generated to predict Y for a given
level of X. (Make sure to agree with group mates what X is)
5. Collect the regression coefficients and predictions found by each person in the class into
a table:
Slope
Studen
t
Intercep
t
Prediction
for Y
given
X = ___
Slope
Student
11
12
13
14
15
16
17
18
19
10
20
Intercep
t
Prediction
for Y
given
X = ___
6. Create a dot plot for the regression coefficients (slope and interception) and for the
prediction for Y given X= ____ (Note taht three dot plots will be created).
7. Look at the dot plot for the slope. This dot plot represents an approximation to the
sampling distribution of the estimated slopes. What do you notice about the dot plot?
What is the range of the estimated slopes? What seems to be the most common slope? If
you had to guess what the slope of the regression line was for the entire population, what
would you guess? Explain why.
ASSESSMENT 6-02
1. In a regression line, the Y-intercept represents the
a) predicted value of Y when X = 0.
b) change in estimated average Y per unit change in X.
c) predicted value of Y.
d) variation around the sample regression line.
ANSWER: a
2.
ANSWER: b
Case 1 (For items 3 - 5 ) : A candy bar manufacturer is interested in trying to estimate how sales are
influenced by the price of their product. To do this, the company randomly chooses 6 cities and offers
the candy bar at different prices. Using candy bar sales as the dependent variable, the company will
conduct a simple linear regression on the data below:
City
Price (PHP)
Sales
Los Banos
39
100
Legazpi
48
90
Cagayan de Oro 54
90
Davao
60
40
Cebu
72
38
Makati
87
32
3. Referring to Case 1, what is the estimated average change in the sales of the candy bar if price
goes up by 1 peso?
a) 161.386
b) 0.784
c) 3.810
d) -1.606426
ANSWER: d
5. Referring to Case 1, if the price of the candy bar is set at 60 pesos, the estimated average sales
will be
a) 30
b) 65
c) 90
d) 100
ANSWER: b
II. A study was done to investigate the relationship between the amount of protix (a new proteinvitamin-mineral supplement) on fortified-vitamin rice, known as FVR, and the gain in weight of
children. Ten randomly chosen sections of grade one pupils were fed with FVR containing
protix; different amounts X of protix were used for the 10 sections. The increase in the weight of
each child was measured after a given period. The average gain Y in weight for each section
with a prescribed protix level X is as follows:
Section
1
2
3
4
5
Protix
Gain
50
92.6
60
70
80
90
97.5
96.5
102.3
105.8
Section
6
7
8
9
10
Protix
100
110
120
130
140
Gain
106.2
108.9
108.4
110.2
110.8
a. Obtain the sample regression line to predict the average gain in weight given the protix
level
ANSWER: Estimated Average weight gain = .2014546 ( Protix) + 83.78182
b. How would you predict the average gain in weight to be at a protix level of 125.
ANSWER: Using the regression line at Protix = 125, the estimated Average weight gain is
0.2014546 ( 125) + 83.78182 = 109
III. At a large local high school, the principal wanted to ensure that her students would perform
well on this years standardized tests. As such, the principal came up with a list of factors that
may negatively or positively impact test scores and aimed to prove it to the students while giving
a practice test out of 100 points. A month before the practice test the principal asked students to
fill out a survey asking them how many hours per week they hung out with their friends and how
many hours per week they spent in study hall. Because the high school was very large, the
principal only surveyed a sample of the students. The following two scatterplots provided show
the results of the survey versus the students scores on the practice exam.
Scatter Plot
Collection 1
110
100
90
80
70
60
50
0
10
15
20
25
Hours_With_Friends
30
35
Scatter Plot
Collection 1
110
100
90
80
70
60
50
0.0
0.5
1.0
1.5
2.0
2.5
Hours_in_Study_Hall
3.0
3.5
Dot Plot
1.5
2.0
2.5
3.0
Slope
3.5
-3.5
4.5
Dot Plot
-4.0
4.0
-3.0
-2.5
Slope
-2.0
-1.5