Regression and Analysis
Regression and Analysis
EIAR
Topics to be dealt in this section
General overview of regression analysis
Simple Linear regression
Simple Linear correlation
Partial linear correlation
Multiple linear regression
Variable selection procedures
Forward selection
Backward elimination
Stepwise regression
Non-linear models
Logit
Probit
What is Regression analysis?
Regression analysis is statistical technique
for investigating and modeling the
relationship between variables.
For example
way
Treatments may not be assigned randomly to the
experimental units
Order of processing experimental units through
Parameter estimation
Tentative model
selection
If the tentative
model is not If the fit is
appropriate unsatisfactory
Y= B0+B1X +ε is SLRM
y i 0 1 xi i
, i=1,2,…,n is called sample
simple linear regression mode.
Thus the list square criteria is
n
S ( , ) (y - x )
0 1 i 0 1 i
2
i 1
the list
square estimator of B0 and B1 ,
say 0 and
1
, must satisfy
S n
2 ( y x ) 0
i 0 1 i
i 1
0 0 1
S n
2 ( y i 0 1 xi ) xi 0
1
0 1 i 1
Simplifying these two equations gives the
following normal equations
n n
n 0 1 xi y i
i 1 i 1
n n n
0 xi ̂1 xi2 y i xi
i 1 i 1 i 1
the solution to the above normal
equations are
0 y 1 x
n n
y x i i
yx i i i 1
n
i 1
1 n
n
( xi ) 2
i 1
xi2 i 1
n
Thus the fitted linear equation model is
y 0 1 x
After obtaining the least square fit, the
following interesting questions should
come to mind:
H 0 : 1 10
H 1 : 1 10
Since errors εi are NID(0, 2 ) , the
observations yi are NID( 0 1 xi , ) . since
2
1
S xx
calculated value is more than the standard
normal table value for a given level of
significance we conclude that H0 is not
true accepting H1.
1 10
t0 t ( n 2)
MSE
S xx
t0 > ttab(, n-2) then we conclude that
there is a strong evidence to reject H0.
A similar procedure can be used to test
hypothesis about
H 0 : 0 00
H 1 : 0 00
we would use a test statistic
1 10
t0 t (n 2)
2
1 x
MSE ( )
n S xx
Example1: The data shown below presents the average
number of surviving bacteria in a canned food product
exposure to 3000F heat and the minutes of exposure.
Obs bacteria exposure
1 175 1
2 108 2
3 95 3
4 82 4
5 71 5
6 50 6
7 49 7
8 31 8
9 28 9
10 17 10
11 16 11
12 11 12
The following SAS procedure will fit a
simple linear regression and significance
test on the effect of the regresser on the
response.
xy
r i
n n
i i
x 2
i 1
y 2
i
Multiple linear regression
In multiple linear regression model with p
regresser variables can be written as
Y 0 1 X 1 p X p
Example-2:
The data set below contains data from 19
livestock auction market.
The objective is to relate the annual cost
(in thousands of dollars) of operating a
livestock market (cost) to the numbers (in
thousands) of livestock in various classes
(cattle, calves, hogs, and sheep) that were
sold in each market.
This is done with multiple regression
models as follows.
Obs mkt cattle calves hogs sheep cost volume type
1 1 3.437 5.791 3.268 10.649 27.698 23.145 o
2 2 12.801 4.558 5.751 14.375 57.634 37.485 o
3 3 6.136 6.223 15.175 2.811 47.172 30.345 o
4 4 11.685 3.212 0.639 0.694 49.295 16.230 b
5 5 5.733 3.220 0.534 2.052 24.115 11.539 b
6 6 3.021 4.348 0.839 2.356 33.612 10.564 b
7 7 1.689 0.634 0.318 2.209 9.512 4.850 o
8 8 2.339 1.895 0.610 0.605 14.755 5.449 b
9 9 1.025 0.834 0.734 2.825 10.570 5.418 o
10 10 2.936 1.419 0.331 0.231 15.394 4.917 b
11 11 5.049 4.195 1.589 1.957 27.843 12.790 b
12 12 1.693 3.602 0.837 1.582 17.717 7.714 b
13 13 1.187 2.679 0.459 18.837 20.253 23.162 o
14 14 9.730 3.951 3.780 0.524 37.465 17.985 b
15 15 14.325 4.300 10.781 36.863 101.334 66.269 o
16 16 7.737 9.043 1.394 1.524 47.427 19.698 b
17 17 7.538 4.538 2.565 5.109 35.944 19.750 b
18 18 10.211 4.994 3.081 3.681 45.945 21.967 b
19 19 8.697 3.005 1.378 3.338 46.890 16.418 b
SAS code to run multiple
regression models
proc reg data=example2;
model cost=cattle calves hogs sheep /r vif
collin ss1 ss2;
output out=out1 r=resi p=pred
rstudent=sresi COOKD=CO
COVRATIO=COV DFFITS=Dfit h=h;
run;
SAS output for the above SAS code
The SAS System 10:01 Thursday, September 7, 2000 9
The REG Procedure
Model: MODEL1
Dependent Variable: cost
Analysis of Variance
Sum of Mean
Source DF Squares Square 1 F Value 2 Pr > F
Model 4 7936.73649 1984.18412 52.31 <.0001
Error 14 531.03865 37.93133
Corrected Total 18 8467.77514
Root MSE 6.15884 3 R-Square 0.9373
Dependent Mean 35.29342 4 Adj R-Sq 0.9194
Coeff Var 17.45040
Parameter Estimates
5 Parameter 6 Standard
Variable DF Estimate Error 7 t Value 8 Pr > |t|
Intercept 1 2.28842 3.38737 0.68 0.5103
cattle 1 3.21552 0.42215 7.62 <.0001
calves 1 1.61315 0.85168 1.89 0.0791
hogs 1 0.81485 0.47074 1.73 0.1054
sheep 1 0.80258 0.18982 4.23 0.0008
proc reg data=multiple;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
exp( 0 1 x) ---------------(**)
( y x )
1 exp( 0 1 x)
p 0 1 x
*
exp(ˆ0 ˆ1 x)
pˆ
1 exp(ˆ ˆ x)
0 1
Example3- the data presented below shows family income along with
Obs HHN income HOS hos1
1 1 8300 0 1
2 2 21200 1 0
3 3 9100 0 1
4 4 13400 1 0
5 5 17700 0 1
6 6 23000 0 1
7 7 11500 1 0
8 8 10800 0 1
9 9 15400 1 0
10 10 22400 1 0
11 11 18700 1 0
12 12 10100 0 1
13 13 19500 1 0
14 14 8000 0 1
15 15 12000 1 0
16 16 24000 1 0
17 17 21700 1 0
18 18 9400 0 1
19 19 10900 0 1
20 20 22800 1 0
proc probit data=ex3;
title 'logit model analysis';
class hos1;
model hos1=income/d=logistic;
output out=status1 p=pred ;
run;
PROC PROBIT is modeling the probabilities of levels of hos1 having LOWER Ordered
Values in the
response profile table.
Algorithm converged.
Type III Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
income 1 5.2981 0.0213
Analysis of Parameter Estimates
Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 -3.6994 1.7121 -7.0551 -0.3438 4.67 0.0307
income 1 0.0003 0.0001 0.0000 0.0005 5.30 0.0213
Interpretation of results:
Both intercept and income for the logistic
model are found to be significant for the
level of significance > 0.037 and >
0.0213. thus the fitted logistic regression
model will be
Logit (Pi) =-3.6994+ 0.0003(Income)
(homocedasticity)
errors are uncorrelated
Definition of residuals
yi
where Ф denote the standard normal
cumulative distribution.
This follow from the fact that
1
i
1 2
(e( i )
n
The different features on normal
probability plots
Idealized normal probability plots: the scatter
points lie approximately a long a straight line
(figure a).
Heavy tail distribution: a sharp downward and
upward curve at both extremis indicating the tail
of the distribution is to have to be considered
normal (Figure b).
Thinner tail distribution: flattering at the
extremis, which is pattern typical of the samples
from a distribution with thinner tails than the
normal (figure c).
Positively and negatively skewed distributions
(figure d and e respectively)
Plot of residuals versus
A pot of the residuals ei (or the scaled
residuals di or ri) versus the corresponding
fitted values is useful in detecting several
type of model in adequacies.
Plots of residual versus the
predicted values
If this plot resembles figures-a, which
indicates that the residual can be
contained in a horizontal band, then there
are no obvious model defects.
Figures b, c and d are symptomatic of
model deficiencies
The pattern in figure b and c indicates
that the variance of the error is not
constant.
the outward-opening funnel pattern in figure b implies
that the variance is an increasing function of y (an
inward-opening funnel is also possible, indicating
variance is a decreasing function of y).
0 1 1 1.0
.4 .84 1.19 1.09
200
150
Number of bacteria
100
50
0 2 4 6 8 10 12
Minutes exposure
The above scatter plot suggests that the
relationship between numbers of bacteria who
survive exposure to 3000F with minute of
exposure may be linear.
mean
7.5+ *
| ++
| ++++
4.5+ +++++
| ++++
| ++++
1.5+ ++++ *
| ++++ *
| * * * +*+*+* * * *
-1.5+ ++++
+----+----+----+----+----+----+----+----+----+----
+
-2 -1 0 +1 +2.
S ‚
t ‚
u 8ˆ
d ‚ A
e ‚
n ‚
t ‚
i ‚
z ‚
e 6ˆ
d ‚
‚
R ‚
e ‚
s ‚
i ‚
d 4ˆ
u ‚
a ‚
l ‚
‚
w ‚
i ‚
t 2ˆ
h ‚
o ‚
u ‚A
t ‚
‚ A
C ‚
u 0ˆ A A
r ‚ A
r ‚ A A A A A
e ‚ A
n ‚
t ‚
‚
O -2 ˆ
b ‚
s Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒ
-7.55 4.93 17.41 29.89 42.36 54.84 67.32 79.80 92.28 104.76 117.24 129.72
Predicted Value of Number of bacteria
t ‚
u 8ˆ
d ‚A
e ‚
n ‚
t ‚
i ‚
z ‚
e 6ˆ
d ‚
‚
R ‚
e ‚
s ‚
i ‚
d 4ˆ
u ‚
a ‚
l ‚
‚
w ‚
i ‚
t 2ˆ
h ‚
o ‚
u ‚ A
t ‚
‚ A
C ‚
u 0ˆ A A
r ‚ A
r ‚ A A A A A
e ‚ A
n ‚
t ‚
‚
O -2 ˆ
b ‚
s Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒ
1 2 3 4 5 6 7 8 9 10 11 12
Minute’s exposure
‚
5.5 ˆ
‚
‚
‚
‚A
‚A
‚
5.0 ˆ
‚
‚
‚
‚ A
‚ A
4.5 ˆ
‚ A
‚
‚ A
‚
‚
4.0 ˆ
‚ A A
F ‚
3 ‚
‚
‚
3.5 ˆ
‚ A
‚ A
‚
‚
‚
3.0 ˆ
‚
‚ A
‚ A
‚
‚
2.5 ˆ
‚ A
‚
‚
‚
‚
2.0 ˆ
‚
Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆ
1 2 3 4 5 6 7 8 9 10 11 12
Minutes exposure
Proc reg data=example11;
Model ln(number_of_bacteria)=minutes_exposure;
Output out=out1 r=resi p=pred
run;
The REG Procedure
Model: MODEL1
Dependent Variable: F3 F3
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 7.97614 7.97614 550.33 <.0001
Error 10 0.14493 0.01449
Corrected Total 11 8.12107
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 5.33878 0.07409 72.05 <.0001
Minutes_exposure 1 -0.23617 0.01007 -23.46 <.0001
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t -0.06861 Pr > |t| 0.9465
Sign M 0 Pr >= |M| 1.0000
Signed Rank S 0 Pr >= |S| 1.0000
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.989946 Pr < W 0.9997
Kolmogorov-Smirnov D 0.129374 Pr > D >0.1500
Cramer-von Mises W-Sq 0.018432 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.124732 Pr > A-Sq >0.2500
Normal Probability Plot
2.25+ * +++++
| ++++
| +*++
0.75+ *++*+
| *+*+
| +*+*+
-0.75+ +++*
| ++*++*
| +*++
-2.25+ ++++
+----+----+----+----+----+----+----+----+----+----
+
-2 -1 0 +1 +2
S ‚
t ‚
u 2.5 ˆ
d ‚
e ‚
n ‚
t 2.0 ˆ A
i ‚
z ‚
e ‚
d 1.5 ˆ
‚
R ‚
e ‚ A
s 1.0 ˆ
i ‚ A
d ‚
u ‚ A
a 0.5 ˆ
l ‚
‚ A
w ‚ A
i 0.0 ˆ
t ‚ A A
h ‚
o ‚
u -0.5 ˆ
t ‚ A
‚
C ‚
u -1.0 ˆ
r ‚ A
r ‚
e ‚ A
n -1.5 ˆ
t ‚
‚
O ‚
b -2.0 ˆ A
s ‚
Šƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ
2.5047 2.7409 2.9771 3.2132 3.4494 3.6856 3.9217 4.1579 4.3941 4.6303 4.8664 5.1026
Predicted value
S ‚
t ‚
u 2.5 ˆ
d ‚
e ‚
n ‚
t 2.0 ˆ A
i ‚
z ‚
e ‚
d 1.5 ˆ
‚
R ‚
e ‚ A
s 1.0 ˆ
i ‚ A
d ‚
u ‚ A
a 0.5 ˆ
l ‚
‚ A
w ‚ A
i 0.0 ˆ
t ‚ A A
h ‚
o ‚
u -0.5 ˆ
t ‚ A
‚
C ‚
u -1.0 ˆ
r ‚ A
r ‚
e ‚ A
n -1.5 ˆ
t ‚
‚
O ‚
b -2.0 ˆ A
s ‚
Šƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ
1 2 3 4 5 6 7 8 9 10 11 12
Example-2: data
Obs mkt cattle calves hogs sheep cost volume type
1 1 3.437 5.791 3.268 10.649 27.698 23.145 o
2 2 12.801 4.558 5.751 14.375 57.634 37.485 o
3 3 6.136 6.223 15.175 2.811 47.172 30.345 o
4 4 11.685 3.212 0.639 0.694 49.295 16.230 b
5 5 5.733 3.220 0.534 2.052 24.115 11.539 b
6 6 3.021 4.348 0.839 2.356 33.612 10.564 b
7 7 1.689 0.634 0.318 2.209 9.512 4.850 o
8 8 2.339 1.895 0.610 0.605 14.755 5.449 b
9 9 1.025 0.834 0.734 2.825 10.570 5.418 o
10 10 2.936 1.419 0.331 0.231 15.394 4.917 b
11 11 5.049 4.195 1.589 1.957 27.843 12.790 b
12 12 1.693 3.602 0.837 1.582 17.717 7.714 b
13 13 1.187 2.679 0.459 18.837 20.253 23.162 o
14 14 9.730 3.951 3.780 0.524 37.465 17.985 b
15 15 14.325 4.300 10.781 36.863 101.334 66.269 o
16 16 7.737 9.043 1.394 1.524 47.427 19.698 b
17 17 7.538 4.538 2.565 5.109 35.944 19.750 b
18 18 10.211 4.994 3.081 3.681 45.945 21.967 b
19 19 8.697 3.005 1.378 3.338 46.890 16.418 b
SAS code to run multiple regression
models
proc reg data=example2;
model cost=cattle calves hogs sheep /r vif
collin ss1 ss2;
output out=out1 r=resi p=pred
rstudent=sresi COOKD=CO
COVRATIO=COV DFFITS=Dfit h=h;
run;