Correlation and Regression
Correlation and Regression
CORRELATION
This is an important statistical concept which refers to interrelationship or association between variables.
The purpose of studying correlation is for one to be able to establish a relationship, plan and control the
inputs (independent variables) and the output (dependent variables)
In business one may be interested to establish whether there exists a relationship between the
i. Amount of fertilizer applied on a given farm and the resulting harvest
ii. Amount of experience one has and the corresponding performance
iii. Amount of money spent on advertisement and the expected incomes after sale of the
goods/service
There are two methods that measure the degree of correlation between two variables these are denoted by
R and r.
(a) Coefficient of correlation denoted by r, this provides a measure of the strength of association
between two variables one the dependent variable the other the independent variable r can range
between +1 and – 1 for perfect positive correlation and perfect negative correlation respectively
with zero indicating no relation i.e. for perfect positive correlation y increase linearly with x
increament.
(b) Rank correlation coefficient denoted by R is used to measure association two sets of ranked or
ordered data. R can also vary from +1, perfect positive rank correlation and -1 perfect negative
rank correlation where O or any number near zero representing no correlation.
SCATTER GRAPHS
- A scatter graph is a graph which comprises of points which have been plotted but are not joined
by line segments
- The pattern of the points will definitely reveal the types of relationship existing between variables
- The following sketch graphs will greatly assist in the interpretation of scatter graphs.
Perfect positive correlation
y
Dependant variable
x
x
x
x
x
x
Independent variable
NB: For the above pattern, it is referred to as perfect because the points may easily be represented by a
single line graph e.g. when measuring relationship between volumes of sales and profits in a company,
the more the company sales the higher the profits.
Perfect negative correlation
y x
Quantity sold x
X
x
x
x
x
x
x
10 20 Price X
This example considers volume of sale in relation to the price, the cheaper the goods the bigger the sale.
independent variable
High positive correlation
y
quantity sold x
x
xx
x
xx
x
x
x
x
xx
x
price
No correlation
y
600 x x x x x
x x x
400 x x x x x
x x x x
200 x x x x x
x x x x
0
10 20 30 40 50 x
h) Spurious Correlations
- in some rare situations when plotting the data for x and y we may have a group showing either
positive correlation or –ve correlation but when you analyze the data for x and y in normal life
there may be no convincing evidence that there is such a relationship. This implies therefore that
the relationship only exists in theory and hence it is referred to as spurious or non sense e.g. when
high passrates of student show high relation with increased accidents.
Correlation coefficient
- These are numerical measures of the correlations existing between the dependent and the
independent variables
- These are better measures of correlation than scatter groups
- The range for correlation coefficients lies between +ve 1 and –ve 1. A correlation coefficient of
+1 implies that there is perfect positive correlation. A value of –ve shows that there is perfect
negative correlation. A value of 0 implies no correlation at all
- The following chart will be found useful in interpreting correlation coefficients
There are usually two types of correlation coefficients normally used namely;-
r=
Note that this formula can be rearranged to have different outlooks but the resultant is always the same.
Example
The following data was observed and it is required to establish if there exists a relationship between the
two.
X 15 24 25 30 35 40 45 65 70 75
Y 60 45 50 35 42 46 28 20 22 15
Required
Calculate the product moment correlation coefficient briefly comment on the value obtained
The produce moment correlation
r=
Workings:
= = 402
r= = 0.89
Comment: The value obtained 0.89 suggests that the correlation between annual income and annual
expenditure is high and positive. This implies that the more one earns the more one spends.
R=1-
Where d = difference between the pairs of ranked values.
n = numbers of pairs of rankings
Example
A group of 8 accountancy students are tested in Quantitative Techniques and Law II. Their rankings in
the two tests were.
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1
E 4 5 -1 1
F 3 1 2 4
G 5 8 -3 9
H 8 7 1 1
R=1-
= 0.74
Thus we conclude that there is a reasonable agreement between student’s performances in the two types
of tests.
NOTE: in this example, if we are given the actual marks then we find r. R varies between +1 and
-1.
Tied Rankings
A slight adjustment to the formula is made if some students tie and have the same ranking the adjust is
Solution
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1
E 3½ 5 -1 ½ 2¼
F 3½ 1 2½ 6¼
G 5 8 -3 9
H 8 7 1 1
R = 1- = 1-
= 0.68
NOTE: It is conventional to show the shared rankings as above, i.e. E, & F take up the 3 rd and 4th rank
which are shared between the two as 3½ each.
REQUIRED
Calculate the rank correlation coefficient and hence comment briefly on the value obtained
REGRESSION
- This is a concept, which refers to the changes which occur in the dependent variable as a result of
changes occurring on the independent variable.
- Knowledge of regression is particularly very useful in business statistics where it is necessary to
consider the corresponding changes on dependant variables whenever independent variables
change
- It should be noted that most business activities involve a dependent variable and either one or
more independent variable. Therefore knowledge of regression will enable a business statistician
to predict or estimate the expenditure value of a dependant variable when given an independent
variable e.g. consider the above example for annual incomes and annual expenditures. Using the
regression techniques one can be able to determined the estimated expenditure of a given family
if the annual income is known and vice versa
- The general equation used in simple regression analysis is as follows
y = a + bx
Where y = Dependant variable
a= Interception y axis (constant)
b = Slope on the y axis
x = Independent variable
i. The determination of the regression equation such as given above is normally done by
using a technique known as “the method of least squares’.
Regression equation of y on x i.e. y = a + bx
x
The following sets of equations normally known as normal equation are used to determine the equation of
the above regression line when given a set of data.
Σy = an + bΣx
Σxy = aΣx + bΣx2
Where Σy = Sum of y values
Σxy = sum of the product of x and y
Σx = sum of x values
Σx2= sum of the squares of the x values
a = The intercept on the y axis
b = Slope gradient line of y on x
NB: The above regression line is normally used in one way only i.e. it is used to estimate the y values
when the x values are given.
Regression line of x on y i.e. x = a + by
- The fact that regression lines can only be used in one way leads to what is known as a regression
paradox
- This means that the regression lines are not ordinary mathematical line graphs which may be used
to estimate the x and y simultaneously
- Therefore one has to be careful when using regression lines as it becomes necessary to develop an
equation for x and y before doing the estimation.
The following example will illustrate how regression lines are used
Example
An investment company advertised the sale of pieces of land at different prices. The following table
shows the pieces of land their acreage and costs
Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5 hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2
intercept a =
Slope b =
Example
The calculations for our sample size n = 10 are given below. The linear regression model is y = a + bx
Table
The slope of the regression line is the estimated number of minutes per mile needed for a delivery. The
intercept is the estimated time to prepare for the journey and to deliver the goods, that is the time needed
for each journey other than the actual traveling time.
PREDICTION WITHIN THE RANGE OF SAMPLE DATA
We can use the linear regression model to predict the mean of dependant variable for any given value of
independent variable
For example if the sample model is given by
Time (min) = 5.91 + 2.66 (distance in miles)
Then the distance if 4.0 miles then our estimated mean time is
Ý = 5.91 + 2.66 x 4.0 = 16.6 minutes
Example
Odino chemicals limited are aware that its power costs are semi variable cost and over the last six months
these costs have shown the following relationship with a standard measure of output.
Required
i. Using the method of least squares, determine on appropriate linear relationship between total
power costs and output
ii. If total power costs are related to both output and time (as measured by the number of the
month) the following least squares regression equation is obtained
Power costs = 4.42 + (0.82) output + (0.10) month
Where the regression coefficients (i.e. 0.82 and 0.10) have t values 2.64 and 0.60 respectively
and coefficient of multiple correlation amounts to 0.976
Compare the relative merits of this fitted relationship with one you determine in (a). Explain
(without doing any further analysis) how you might use the data to forecast total power costs
in seven months.
Solution
a)
Output (x) Power costs (y) x2 y2 xy
12 6.2 144 38.44 74.40
18 8.0 324 64.00 144.00
19 8.6 361 73.96 163.40
20 10.4 400 108.16 208.00
24 10.2 576 104.04 244.80
30 12.4 900 153.76 372.00
Σx = 123 Σy = 55.8 Σx = 2705
2
Σy = 542.36
2
Σxy= 1,206.60
b=
=
= 0.342
a = (Σy – bΣy)
r=
= 0.96
This show a strong correlation between power cost and output. The multiple correlation when both output
and time are considered at the same time is 0.976.
We observe that there has been very little increase in r which means that inclusion of time variable does
not improve the correlation significantly
The value for time variable is only 0.60 which is insignificant as compared with a t value of 2.64 for the
output variable
In fact, if we work out correlation between output and time, there will be a high correlation. Hence there
is no necessity of taking both the variables. Inclusion of time does improve the correlation coefficient but
by a very small amount.
If we use the linear regression analysis and attempt to find the linear relationship between output and time
i.e.
Month Output
1 12
2 18
3 19
4 20
5 24
6 30
The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will be of the form
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation , Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960
Coefficient of Determination
This refers to the ratio of the explained variation to the total variation and is used to measure the strength
of the linear relationship (it measures the degree of change in y that is explained by change in x). The
stronger the linear relationship the closer the ratio will be to one.