Simple Linear Regression
Simple Linear Regression
Linear Regression
Two techniques are used to study the relationship between the two variables:
The process involves developing a linear model, which can be used to predict the
values of Y (dependent variable) from the values of X (independent variable).
Parameters, symbols
x -values the observed values of the independent variable
y -values the observed values of the dependent variable which
correspond to the respective x -values
Note: x and y values are always given as “ordered pairs” ( x , y ) and the order is
important:- x -value first.
x the sum of all the x -values
y the sum of all the y -values
x 2 2
the sum of all the x -values.
That is, square all values and then add together.
y 2 2
the sum of all the y -values.
That is, square all values and then add together.
the sum of all the xy -values.
xy That is, multiply each pair of x and y values and then
add together.
b1 the slope of the fitted line
b0 the y -intercept
x the mean of all the x -values
y the mean of all the y -values
SS xy sums of squares for cross product xy
SS xx (SS x ), and SS y sums of squares for x and y respectively
2
Formulae
y b0 b1 x The equation of the fitted line from the sample
data
x y The formula to calculate the slope of the line.
xy
n SS xy
b1 This tells you the rate at which y is changing as
2 x 2 SS x x increases - that is, the gradient.
x
n
The formula to calculate the y -intercept – that is,
b0 y b1 x the value of y when x is zero.
In regression analysis, the notation for a simple linear regression line is as follows:
y b0 b1 x , where b1 is the gradient and b0 is the intercept on the y axis.
3
1. Plot the scatter diagram:
9
8
time (minutes)
7
6
5
4
3
2
1
0
0 50 100 150
mls
x y xy x2 y2
55 3 165 3025 9
30 2 60 900 4
85 5 425 7225 25
140 7 980 19600 49
115 8 920 13225 64
Totals 425 x 25 y 2550 xy 43975 = x
2
151 y
2
Use the totals above appropriately in the formula to calculate the slope-
x y 425 25
xy 2550
n 5
b1 0.054
2 x 2
425
2
x 43975
n 5
That is the slope, gradient or rate of change = 0.054 which means for each additional
ml of alcohol consumed, the time taken to complete the task increases by 0.054
minutes.
Calculate the constant:
y 25 x 426
b0 y b1 x , where y 5 and x 85
n 5 n 5
Hence: b0 5 0.054 85 0.398 . That is, the predicted time to complete the
task when no alcohol is consumed is 0.4 minutes (ie when X 0 ).
Write the equation: y b0 b1 x 0.4 0.054 x
4
Plot the line
First plot any two points. We then use a ruler to draw the line (through the two
points) from the Y axis to the largest value of X.
When x 0 , Yˆi 0.4 0.054 0 0.4
Note
1 The population predictor equation is formally written as: Y 0 1 X
2 It is wise to make predictions only within the range of the X values observed.
As a rule, you do not extrapolate your values outside this range.
5
Practice Questions
Multi-choice Questions
Use the following TABLE and the information given to answer the next 5
questions. Tim’s Hardware store manager believes that there is a linear relationship
between the income from sales of goods (Y, in thousands of dollars) and the amount
spent on advertising (X, in thousands of dollars).
x
2
(43.2)2
SSx x 2
195.34
8.716 ;
n 10
x y 43.2 185
SSxy xy 835 35.8
n 10
6
6. The slope of the regression is:
8. Predicted income from the sales of goods when $4,400 is spent on advertising
is
7
11. Assuming a linear relationship, the slope b1 of the regression model is:
13 The death rate (per 100 million miles) for a country whose speed limit is 65 miles
per hour is
a 5.0 b 5.5 c 6.1 d 4.62
More questions
The director of cooperative education at a state college wants to examine the effect
of cooperative education job experience on marketability in the work place. She takes
a random sample of 4 students. For these 4, she finds out how many times each had
a cooperative education job and how many job offers they received upon graduation.
These data are presented in the table below.
a) Co Operative Jobs
b) Job Offers
c) Marketability in the workplace
d) None of the Above
a) 2 b) 2.50 c) 5 d) 0.4
8
Use the following information to answer the next 2 questions.
It is believed that the average number of hours spent studying per day (HOURS)
during undergraduate education should have a positive linear relationship with the
starting salary (SALARY, measured in thousands of dollars per month) after
graduation. Given below is the Excel output from regressing starting salary on
number of hours spent studying per day for a sample of 51 students.
3 Referring to the table, the estimated average change in salary (in thousands of
dollars) as a result of spending an extra hour per day studying is
4. The 90% confidence interval for the average change in SALARY (in thousands
of dollars) as a result of spending an extra hour per day studying is
7. Estimate the average labour hours needed to move 200 cubic metres.
9
8. During the 1950s, radioactive material leaked from a storage area near Hanford,
Washington, into the Columbia River nearby. For nine counties downstream in
Oregon, an index of exposure X was calculated (based on the distance from
Hanford). Also the cancer mortality Y was calculated (deaths per 100,000 person-
years, 1959 – 1964). This data is summarised as follows:
b) Estimate the cancer mortality (Y) associated with a radioactive exposure index
value of 8.0. (2 marks)
10
Solutions
Multi-choice Questions
4 a 5 c 6 b 7 d 8 c
9 b 10 b 11 a 12 b 13 d
More Questions
1 a 2 b 3 a 4 c
x y 6864 1042.5
xy 240833
n 36
5. b1 0.1643
2 x 2
6864
2
x 1564761
n 36
1042.5 6864
b0 y b1 x 0.1643 2.37
36 36
6. For each cubic meter 0.164 labours hours are required
8.
x y 41.4 1440
xy 7500
n 9
a) b1 9.033
2 x 2
41.4
2
x 287.42
n 9
41.4 1440
b0 y b1 x 9.033 118.45
9 9
11