Topic 6
Topic 6
Topic 6
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
6.2.2 Calculation of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3 Least Squares Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3.1 General Equation of Regression Line . . . . . . . . . . . . . . . . . . . 6
6.3.2 Theoretical Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.3 Calculation of the Regression Line . . . . . . . . . . . . . . . . . . . . . 7
6.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.5 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.5.1 Hypothesis Test for b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.5.2 Hypothesis Test for r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.6 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.7 Summary and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Learning Objectives
calculate the equation of a Least Squares Regression Line through a set of points
solve a multiple linear regression problem with the aid of a computer package
2 TOPIC 6. REGRESSION AND CORRELATION
6.1 Introduction
One of the most important things that should have been learnt in the last chapter is
how to investigate whether there is a significant difference between two samples. It is
often also interesting to know what type of relationship there is between two sets of
variables. This can be a very useful discovery as the statistics obtained can then be
used in decision making. The goal is to use statistical data to derive a procedure that
will allow a prediction, with a determined level of confidence, of further observations or
measurements in the future. A typical example to be discussed below is the following.
Data is collected on the housing market. It is a good guess that there will be a
dependency between the size of house and the price. The latter is simply a number
and the former can be measured through number of floors, number of rooms etc. There
are also categorical parameters that determine the price, like "good location" or "close
to grocery facilities". The collected data can be plotted on a scatterplot (x-y Cartesian
diagram) and some relationship can be looked for.
Mathematical tools can then be developed that will allow a function to be calculated
(expressing price in terms of number of rooms etc.) that will represent this dependency.
Moreover, it will be desirable to have numerical values representing the confidence that
should be placed in accepting this function. This topic will mainly be concerned with
linear relationships although others will be briefly mentioned.
The starting point in looking a relationship between two sets of results is by making a
visual inspection of given sets of data. Scatterplots are graphs that show the relationship
between two quantitative variables.
Examples are
Example
Problem:
Consider the following average high and low temperatures in Montreal (in degrees
Celsius) and the measured gas consumption for heating (in litres) for a house.
In general when interpreting a scatterplot, first look for an overall pattern, and describe
form, direction, and strength. Two variables are positively associated if above average
values of one variable tend to result in above average values of the other. Two variables
are negatively associated if above average values of one variable tend to result in below
average values of the other. In the example above the data are negatively associated.
6.2 Correlation
6.2.1 Definition
Correlation is a number that measures strength and direction of the linear relationship
between two data sets.
If random variables x and y are given, each consisting of n individual data, then the
correlation, r, between x and y is defined as
c H ERIOT-WATT U NIVERSITY 2004
4 TOPIC 6. REGRESSION AND CORRELATION
where and are sample means and sx and sy are the standard deviations of the
samples.
It can be shown that correlation has the following properties:
Note
The equation given above is not usually the most efficient way to calculate r. Even
although it may look more complicated, a better one for computational purposes is:
6.2.2 Calculation of r
Example
Problem:
Find the correlation coefficient for the data set:
x 1 2 3 4
y 3 6 7 3
Solution:
For the equation 021 3 46587:9;4<5=467
!
> ? 3!4<5"@A9)BC4D5E+@-F ? 3!4<7.@A9)BC467E+@GF
to be used, it is best to set up a table.
x y x2 y2 xy
1 3 1 9 3
2 6 4 36 12
3 7 9 49 21
5 1JILK
4 3 16 9
4<7 1MIN 4<5 @ 1POQK 4<7 @ 1JILKO 4<587 1SRUT
12
H
Also, n = 4.
021 RWV RUT 9 LI KWVXIN
Now,
RYVZOQK 9 ILK @-F V W R VXILKO 9 I N -@ F
> ? ?
e = f!g<h8i - g6hg<i
f!g<hkj'l)e mCg<hon+j
prq g isl g h
f
Note that n is the number of points and that b has to be calculated first as a depends on
it.
Regression lines are used, among others, for prediction. Of course, a regression line is
only useful if a linear relationship between the data sets is expected.
= !<
"!<
=<
!6
"A) C6
o+
r :
Notes
In least squares regression analysis the roles of x and y are distinct. Changing
them round gives different regression lines (what is happening is that rather than
minimising errors vertically, they are being minimised horizontally).
A change of one standard deviation in x corresponds to a change of r standard
deviations in y.
`
It is always the case that
6
6.3.3 Calculation of the Regression Line
As in the correlation section, there is a convenient way of calculating the regression line
through a set of points by simply setting up a table.
Examples
1.
Problem:
Find the regression line for the data set:
x 0 1 2 3 4
y 0 5 11 13 19
It is assumed that x is the explanatory variable and y the response.
Solution:
x y x2 xy
0 0 0 0
1 5 1 5
2 11 4 22
2 13 9 39
4 19 16 76
Sum = 10 Sum = 48 Sum = 30 Sum = 142
Note that n = 5.
=
"!
!6
"A) C6
o+
r :
c H ERIOT-WATT U NIVERSITY 2004
8 TOPIC 6. REGRESSION AND CORRELATION
Now,
Notice that here, since the example was purely theoretical, it did not matter which values
should be used for x and which for y - they were simply used "as they came". However
sometimes the order is important. If, for example, you were examining the effect of how
cholesterol levels of humans depend on age, it would be necessary to make cholesterol
the response variable (y) and age the explanatory variable (x). It would be nonsensical
to have an equation that calculates a person’s age given their cholesterol level.
r b a
High -0.9940 -23.9072 743.8478
Low -0.9912 -24.6422 508.6054
Solution:
The two equations for the regression lines are thus
Gas Used = 743.8478 -23.9072 x High Temperature
Gas Used = 508.6054 -24.6422 x Low Temperature
Again, it is much more sensible to find equations for "gas used" in terms of "temperature"
rather than the other way round, so "gas used" is taken as the response variable and
"temperature" the explanatory one.
The lines can now be drawn on the earlier graphs as follows:
3.
Problem:
The following four data sets are from Frank J. Anscombe, Graphs in statistical analysis.
x1 10 8 13 9 11 14 6 4 12 7 5
y1 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
x2 10 8 13 9 11 14 6 4 12 7 5
y2 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.10 9.13 7.26 4.74
x3 10 8 13 9 11 14 6 4 12 7 5
y3 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
x4 8 8 8 8 8 8 8 8 8 8 19
y4 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91 6.89 12.50
Solution:
The correlation and least squares lines for the four data sets are shown in the following
list:
On inspection of the plots it is realised that only the third and fourth represent strong
linear relationships (both with one influential outlier). The first data set represents a
moderate linear relationship, while the second represents a curved relationship. This
shows that correlation and the least squares regression line can suggest a linear
relationship even if there is none so visual inspection of the data is vital.
EC Data
The following data represent the number of members in the EC Council of Ministers of
(1) current EC members and of (2) potential EC members, and the populations of the
member states:
(1) Current members
ß~àbáYà ã ä¾Õ
º väâ å ØÐ æç× Ó Ú
Ó Ù
Thus the end-points of a 95% confidence interval for the slope b are
Example
Problem:
Consider the data below:
x 1 2 3 4 5 6
y 1 2 2 4 4 6
Find a 95% confidence interval for the slope of the regression line.
Solution:
It is easily confirmed that a = -0.1333, b = 0.9429 and r = 0.9613.
The standard error is equal to
ñ ¢ò ùLú ù ú
ôò ó õØöøû ÷
ü ó õØöø÷
ö
÷ü ýÿþ ö þ ý þ ö ö
The "cut-off" points for the t distribution þ with 4 degrees of freedom and a 95% confidence
interval are 2.776.
Therefore, the 95% confidence interval 0.9429 2.776
0.1351.
This gives the interval [0.5679, 1.3179].
The two extreme regression lines are y = 0.5679x - 0.1333 and y = 1.3179x - 0.1333.
All three equations are drawn on the graph below.
Thus, with 95% confidence it can be said that the true linear regression line is within
the cone displayed in the data plot, the boundaries of the cone being the two regression
lines with slope 0.5679 and 1.3179 respectively. The cone seems to be very large, but
keep in mind that there is only a small number of data.
x 3.4 1.8 4.6 2.3 3.1 5.2 0.6 2.9 2.7 4.0 2.3 1.0 6.3 4.5 3.5
y 2.5 1.8 3.0 2.6 2.9 3.9 1.6 2.3 2.0 3.4 2.5 1.8 2.6 2.8 2.8
Calculate the least squares regression line and draw it on a scatterplot together with two
more lines that determine a cone formed by the 95% confidence interval values for b.
c H ERIOT-WATT U NIVERSITY 2004
14 TOPIC 6. REGRESSION AND CORRELATION
x 1 2 3 4 5 6
y 1 2 3 4 5 6
b is calculated as 0.9429 and the standard error as 0.1351
Set up the hypotheses:
H0 : b = 0
H1 : b 0
"
The test statistic is ! %# $ &$
" $ ')( ' "
This calculates as " $ +*,- /. 1
0%230
Drawing a t distribution curve with 4 degrees of freedom and a significance level of 5%
results in the following diagram.
4
c H ERIOT-WATT U NIVERSITY 2004
6.5. HYPOTHESIS TESTS 15
Since the test statistic is in the shaded area the alternative hypothesis (H 1 ) is accepted.
There is evidence at the 5% level that the slope of the regression line is not equal to
zero.
Note that for convenience b is used as the notation in the hypotheses even although in
the hypothesis test it actually refers to the population as opposed to the sample.
Examples
1.
Problem:
x 1 3 4 5 3 8
y 2 6 5 8 1 3
The scatterplot is drawn below
Solution:
The plot does not look very linear and this is borne out by calculation of r, which is
0.2241.
A hypothesis test is now carried out.
H0 : r = 0
H1 : r 6 F 0
G
c H ERIOT-WATT U NIVERSITY 2004
16 TOPIC 6. REGRESSION AND CORRELATION
Since the test statistic is not in the shaded region the null hypothesis is accepted. There
is no evidence, at the 5% level, of a significant relationship between the two variables.
One-tailed tests can also be used.
2.
Problem:
Recall the Montreal heating data in section 6.1produced the following results:
r b a
High -0.9940 -23.9072 743.8478
Low -0.9912 -24.6422 508.6054
The two equations for the regression lines were
Gas Used = 743.8478 -23.9072 W High Temperature
Gas Used = 508.6054 -24.6422 W Low Temperature
There were 12 data points considered.
Solution:
Consider the first value of r, namely -0.9940. A hypothesis test can be used to confirm
that this is significantly less than 0.
H0 : r X 0
H1 : r Y 0
The test statistic, HZI L MM<NO K MCSDOV calculates as -28.7.
K?P?Q<R QQ
The t curve is drawn below (10 degrees of freedom) with a significance level of 0.1%
shaded.
[
c H ERIOT-WATT U NIVERSITY 2004
6.6. MULTIPLE LINEAR REGRESSION 17
Since the test statistic is in the shaded area, the null hypothesis is rejected. There is
evidence using a significance level of 0.001 that the value of r is less than 0. It can be
said, then, that there is a very highly significant negative correlation between the two
variables.
This can similarly be repeated for the other value of r in the example.
Once again in this section the "r" in the hypothesis test actually refers to the population
rather than the sample.
Number Apartments (x) Number Floors (y) Price (in $million) (z)
60 10 78.2
40 5 45.4
80 10 100.0
30 6 35.7
60 3 80.5
40 6 42.890
90 12 120.4
80 7 90.5
Using the method of least squares, it is desired to find the equation of a plane that will
predict the price of an apartment block (z) in terms of the numbers of apartments (x)
and number of floors (y).
The equation of the plane is z = a + bx + cy where a, b and c are parameters to be
estimated.
\
c H ERIOT-WATT U NIVERSITY 2004
18 TOPIC 6. REGRESSION AND CORRELATION
For an observed data point z, the error is given by: ]_^a`cbef d` ^a`gbihfbkjmlnbio-p
For least squares, it is required to minimise qr]ts .
Therefore the problem becomes:
Minamise u_^ qwv `gbihxbkjml7byomp{z s .
This equation can be partially differentiated with respect to a, b and c to obtain:
|
S
|
^~} v `gbihfbijml7bio-pz v bctz
| a
S
| ~ ^ } v `gbihfbijml7bio-pz v blz
| j
S
| ^ } v `gbihfbijml7bio-pz v bp{z
o
To minimise, equate all 3 to 0 (the "2’s" cancel):
Now for the data above a series of values such as can be calculated and substituted
into these equations. It can be shown that:
Also note that 1 = n, the number of points (in this case 8).
The equations become:
8a + 480b + 59c = 593.5
480a + 32200b + 3840c = 40197
59a + 3840b + 499c = 4799.8
The method of solution of simultaneous equations can be used to solve for a, b and c
(matrix methods can also be employed).
The solution is a = -7.76106, b = 1.30665 and c = 0.48129
The equation of the least squares plane through the data is
z = -7.76106 + 1.30665x + 0.48129y
c H ERIOT-WATT U NIVERSITY 2004
6.7. SUMMARY AND ASSESSMENT 19
¡
c H ERIOT-WATT U NIVERSITY 2004
20 GLOSSARY
Glossary
Correlation
is a number that measures strength and direction of the linear relationship between
two data sets.
¢
c H ERIOT-WATT U NIVERSITY 2004
ANSWERS: TOPIC 6 21
For the second table, r calculates as 0.9700 and the regression equation is y = 0.6452x
+ 3.3924 The graph is shown below.
The first scatter plot does not necessarily suggest a linear relationship between number
of inhabitants and seats, but the second does look linear. The combined data set does,
in fact, suggest a logarithmic relationship instead of a linear one.
If you have Microsoft Excel, the regression equation and correlation coefficient can be
obtained very quickly. The command "Correl" (from the function menu) will produce the
Product Moment Correlation Co-efficient, r, whilst the commands "slope" and "intercept"
will give you b and a respectively.
£
c H ERIOT-WATT U NIVERSITY 2004
22 ANSWERS: TOPIC 6
This package is used on the response times for 999 calls data and the values obtained
are:
r = 0.746049
b = 0.29909
a = 1.60559
The regression equation is therefore y = 0.29909x + 1.60559
Now, confidence intervals for b can be calculated. The standard error of b is given by
¥1ª«« « ¥T¬®D¯ D «3¶·
¤¦¥§¥¨ © © © © ¨ ¥ ®¬
¥T¬®D¯ D± « ° ²)³µ´ © © © ©
© ©
³¸
The value of t0.025, 13 is found from tables to be 2.160. (If you are using Excel, you
might like to try using the "Tinv" function to also obtain this number, but note that it
automatically does a two-tailed test so 0.05 has to be entered as opposed to 0.025).
So the 95% confidence interval for b is 0.29909 ¹ 2.160 x 0.074040
This gives an interval of [0.1392, 0.4590].
Therefore, the two lines to produce the cone are given by
y = 0.1392x + 1.60559
y = 0.4590x + 1.60559
The lines are drawn on the scatterplot as follows
º
c H ERIOT-WATT U NIVERSITY 2004