Regression: Simple Linear Regression Model
Regression: Simple Linear Regression Model
Regression
The term regression was introduced by the English biometrician Sir Francis Galton (1822). Sir
Francis describe a phenomenon which he described in analyzing the heights of children and their
parents. He found that tall parents have tall children and short parents have shorts children. The
average heights of children tends to step back or to regress toward the average heights of all men.
The tendency toward the average height of all men was called regression by Galton.
OR.
The interdependency between the dependent variable and one or more independent variable is
called regression
Regression provides an equation to be used for estimating or predicting the average value
of the dependent variable from the known values of the independent variable.
We assume that the linear relationship between dependent variable yi and the independent
variable Xi is,
yi = α + β Xi + Ԑi (regression model)
Where
yi = dependent variable.
Xi = independent variable.
Α & β = parameters.
Ԑi = Residuals / error term.
Further more:
i. E(Ԑi) = 0
ii. Var (Ԑi) = E(Ԑi2) = Ω2 , for all i.
iii. E(Ԑi , Ԑj) = 0 , for all i ≠ j
iv. E(X , Ԑj) = 0 , X and Ԑ are also independent of each other.
v. Ԑi is normally distributed with a mean of zero and a constant variance Ω2
E(a) = α
E (b) =
β
Example:
Compute the least squares regression equation of Y on X for the following data. What is the
regression co-efficient and what does it mean?
X 5 6 8 10 12 13 15 16 17
Y 16 19 23 28 36 41 44 45 50
E•• mple 10.2 In an experiment to measure the stiffness of a spring, the length of the spring under
different loads was measured as follows:
J=Loads (lb) 3 5 6 9 10 12 15 20 22 28
F=length (in) 10 12 15 18 20 22 27 30 32 34
!Y o + bye K,
Hence A =0.94a -7.68 is the estimated regression equation appropriate for predicting the weight (J,
given the length (F).
5
The population correlation co-efficient for a bivariate distribution, denoted by p, has already
been defined as,
Scatter Diagrams:
Scatter diagrams (or scatter graphs) provides a useful means of deciding whether or not
there is association between variables. The construction of a scatter diagram is by drawing a
graph so that the scale for one variables ( the independent variable if this can be determined ) lies
along the horizontal axis, and the other variable (the dependent variable) on the vertical axis .
each pair of figure is then plotted as a single point on the graph.
Figure (a) : indicates positive correlation so that as the variable “X” increases so “y” will increase
Figure (b) : indicates negative correlation so that as the variable “X” increases so “y” will decrease.
Figure (c) : indicates perfect positive correlation between the two variables so that they both
increases in same proportion.
Figure (d) : indicates perfect negative correlation between the two variables so that they both
decreases in same proportion.
Figure (e): indicated that there is no correlation between the two variables. There are a
number of lines of best fit which could be drawn with equal validity.
Association gives the relationship between attributes, while the correlation gives the
relationship between variables.
Regress VS Correlation:
Regression and correlation have same fundamental difference that is worth mentioning.
In regression analysis there is an asymmetry in the way the dependent and explanatory variables
are treated. The dependent variables is assumed to be statistical, random or stochastic that is to
have a probability distribution. The explanatory variables, on the other hand, are assumed to have
fixed values. In correlation analysis, on the other hand, we treat any (two) variables
symmetrically, i.e there is no distinction between the dependent explanatory variables.
i. The value of “r” does not depend on the unit of measurement for either variable.
ii. The value of “r” is symmetrical with respect to “x” and “y” i.e
rxy = ryx
iii. The value of “r” is between -1 and +1 i.e
-1 ≤ r ≤ +1
Simple Correlation:
It is defined as the degree of relationship existing between two variables. For example,
the relationship between smoking and lungs cancer, between scores on statistics and mathematics
examinations and so on and its limit are
-1 ≤ r ≤ +1
Curvilinear Correlation:
Correlation may be curvilinear, when all points (x,y) on scatter diagram are seem to lie
near a curve.
Linear Correlation:
Correlation may be linear, when all points (x,y) on a scatter diagram seem to cluster near
a straight line.
Multiple Correlation:
-1 ≤ Rxyz ≤ +1
Or
The association or interdependence between a variable and a group of other variables is called
multiple correlation.
Partial Correlation:
It is the interdependence between the two variables ignoring the effect of other variables.
Its coefficient is denoted by rxyz etc and its limits are.
-1 ≤ rxyz ≤ +1
Or
Partial correlation measures the degree of association between two variables ignoring the effect of
a set of other controlling variables.
Positive Correlation:
When the moment of the variable is in the same direction, it is called positive correlation
e.g when price increases the quantity supplied increases and when price decreases the quantity
supplied also decreases.
OR
When both variables moves in the same direction then correlation is positive.
Negative Correlation:
When the moments of the variables are in the inverse direction, it is called negative
correlation e.g when price increases demand for commodity decreases, when price falls demand
increases.
Or
When two variables are independent then there is correlation e.g there is no correlation
between weights of students and the colour of their hair.
Co-efficient of Correlation:
“r” thus defined is measure of linear association between two variables and its limits are -1 ≤ rxyz
≤ +1, -1 and +1 indicating perfect negative and positive association respectively.
Co-efficient of Association:
Association:
If the two attributes A and B are not independent then they are said to be associated i.e
A Characteristics which varies only in quality from individual to individual and cannot be
measure in quantity, is called on attribute. The example of attributes are:
Marital Status of man, the color of car, education level, richness etc.
The attributes cannot be numerically expressed but only their presence or absence can be
described.
Co-efficient of determination:
Then
R2 is the most commonly used measure of the goodness of fit of a regression line. It a nonnative
quantity and its limits are.
-1 ≤ R2 ≤ +1
Dichotomy:
The process of diving the objects into two mutually exclusive classes is called dichotomy.
E.g division of a population according to sex into two classes as males and females.
The capital Latin Letters A,B,C… are usually used to denote the attributes. The letters
A,B,C,…. Are designated to the individuals possessing the attributes A,B,C … while the Greek
letters α, β,γ, are designated to the individuals do not possessing the attributes A,B,C,…. The
attributes denoted by A,B,C…. are called positive attributes while the attributes denoted by α, β,γ
.. are called negative attributes.
Consistence:
The class frequencies observed in the same population are said to be consistent, if they
conform with one other. For a consistent data, any class frequency can never be negative. If any
class frequency is negative then the data are in consistent.
Independence:
The independence mean that there is no relationship between attributes A and B. two
attributes A and B are said to be independent if.
Contingency Table:
Rank Correlation:
The correlation between two set of ranking for the variables X and Y is called rank
correlation. Sometimes, the accurate assessment is not possible then the objects are arranged in
order according to some characteristics of intereset. The order given to an individual is called the
rank.
Suppose that we have n pairs of observations from a bivariate population as (a1,b1), (a2,b2)
,……., (an,bn). the values of the set ai be ranked as X1, X2,….., Xn and the values of set bi, be
ranked as Y1, Y2,….., Yn .
We assume that there are no some ranks given to two or more objects. Then the coefficient of rank
correlation is
Where d= x – y
Or
Rank Correlation:
Sometimes it is possible to arrange various items of a series in serial order with respect to
some characteristics, if the numerical measurement of this value is difficult e.g a teacher can
arrange the students in his class in ascending or descending order of intelligence or a sales
manager can arrange a group of salesman in ascending or descending order by efficiency where
as quantities measurement of these characteristic (i.e intelligence and efficiency) is not possible
directly. In such cases the product movement method inappropriate because the data are not form
a measuring device, then we sue Auxiliary correlation procedure to hand such problems.
Permutations:
Or
ABC, ACB, BAC, BCA, CAB, CBA. These are the different permutation.
Fundamental Principles:
a. If one operation can be performed in “m” ways and other performed in “n” ways, then 0.
If performing the two operation will be m x n.
b. Permutation of n objects = n!.
c. Permutation of n distinct objects arranged in circle = (n – 1)!
d. Permutation of n objects of which n1 are of one kind, n2 are of other kind.
When
Example:
How many numbers of two different digits can performed with figures 1,2,3,4,5,6.
Solution:
For first digit there are 6 cases because any no. can be chosen. For 2nd there are only 5 way
then the total way = 6x5 = 30.
2nd method:
Example: There are 10 baby taxis running between Jhelum and Mangla. In how many ways a man
can go from Jhelum to Mangla and return by different baby Taxis.
Solution:
There are 10 ways of making the first passages, then are 9 chosen to return.
2nd method.
Combination:
It is group or selection made by taking all or part of a set is called combination e.g