0% found this document useful (0 votes)
60 views132 pages

Regression and Analysis

This document discusses various topics related to regression analysis including: simple linear regression, multiple linear regression, variable selection procedures, and nonlinear models such as logit and probit. It provides an overview of regression analysis, the assumptions and processes involved in fitting linear regression models to data including estimating parameters using least squares and testing hypotheses about slope and intercept values.

Uploaded by

Taddese Gashaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views132 pages

Regression and Analysis

This document discusses various topics related to regression analysis including: simple linear regression, multiple linear regression, variable selection procedures, and nonlinear models such as logit and probit. It provides an overview of regression analysis, the assumptions and processes involved in fitting linear regression models to data including estimating parameters using least squares and testing hypotheses about slope and intercept values.

Uploaded by

Taddese Gashaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 132

Minilik Tsega

EIAR
Topics to be dealt in this section
 General overview of regression analysis
 Simple Linear regression
 Simple Linear correlation
 Partial linear correlation
 Multiple linear regression
 Variable selection procedures
 Forward selection
 Backward elimination
 Stepwise regression
Non-linear models
 Logit
 Probit
What is Regression analysis?
 Regression analysis is statistical technique
for investigating and modeling the
relationship between variables.
 For example

is a statistical model which relates the


variable Y to X called simple linear
regression model;
Where,
 the relationship between Y and X is linear
 β0 and β1 are parameters of the model where β0
is the value of Y when the value of X =0 and β1
is the rate of change in Y for the unit change in
X.
 X and Y are called independent and dependent
variables respectively that is to mean the value
of Y depends on the value of X.
 ε = is called statistical error which is a random
variable that accounts failure of the model to fit
the data exactly. ε may be a cumulative effect of
the following factors:
 Effect of other variables not in the model
 Error in measurement

 Error in measuring the instruments

 Inherent variation on experimental units

 Failure to conduct the experiment in a uniform

way
 Treatments may not be assigned randomly to the

experimental units
 Order of processing experimental units through

out the course of the study may not be


randomized where order might influence the
results may produce systematic error
 etc
 An important objective of the regression analysis
is to estimate the unknown model parameters in
the regression model. This process is called
fitting the model to the data.

 The next step of regression analysis to model


fitting is checking the model adequacy in which
appropriateness of model is studied and quality
of the fit is ascertained.

 Through such analysis the usefulness of


regression model may be determined.
 The outcome of adequacy checking may indicate
either that the model is reasonable or that the
original fit must be modified.

 Thus regression analysis is an iterative


procedure, in which data led to a model and fit
of a model to the data is produced.

 The quality of the fit is then investigated,


leading either to modification of the model or
the fit or to adoption of the model.
 Regression equation is only an approximation to
the relationship between variables.

 Generally regression equations are valid only


over the range of regresser variables contained
in the observed data.

 Regression analysis may not be the primary


objective of the study, it is usually more
important to gain an insight and understanding,
concerning the system generating the data.
Data collection
 Data collection is an essential aspect of
regression analysis, since conclusions from
regression analysis are conditioned on the data.

 Data used in regression analysis should be


representative of the system studied, correctly
measured and recorded and also

 Preliminary data editing, coding verification and


cleaning must be made before the regression
analysis
Main uses of regression
 Data description: using regression equations to
summarize and describe a set of data

 Parameter estimation

 Prediction and estimation: predict or estimate


future values of dependent variables in the
regression equation.

 Control the system which generate the data


Data

Tentative model
selection

Fitting the tentatively selected


model to the data

Model adequacy check

If the tentative
model is not If the fit is
appropriate unsatisfactory

Adopting the tentatively selected model as


a final model
Simple linear regression (SLRM)
 A linear regression only with one independent
variable is called simple linear regression.

 Y= B0+B1X +ε is SLRM

 Where B0 is the intercept of the line and B1 is the


slope, i.e. the amount of change in Y for a unit
change in X

 B0 and B1 are unknown model parameters to be


estimated using OLS
Least square estimation of
parameters
 The parameters B0 and B1 are unknown and must
be estimated using sample data.

 Suppose that we have n pair of data, say, (y1, x1),


(y2, x2),…, (yn, xn).

 These data may result either from controlled


experiment designed specifically to collect the
data, from surveys or from existing historical data.
 A method of least square is used to
estimate B0 and B1 .

 That is, we will estimate B0 and B1 so that


the sum of the square difference between
the observation yi and the street line is
minimum.

y i   0   1 xi   i
 , i=1,2,…,n is called sample
simple linear regression mode.
 Thus the list square criteria is
n
S (  ,  )   (y -    x )
0 1 i 0 1 i
2

i 1

 the list
 square estimator of B0 and B1 ,
say  0 and
1
, must satisfy
S   n
 2 ( y     x )  0

i 0 1 i
  i 1
0  0 1

S n  
 2 ( y i   0   1 xi ) xi  0
 1  
 0 1 i 1
 Simplifying these two equations gives the
following normal equations
  n n
n 0  1  xi   y i
i 1 i 1
 n n n
 0  xi  ̂1  xi2   y i xi
i 1 i 1 i 1
 the solution to the above normal
equations are
 
 0  y  1 x
n n

y x i i

 yx i i  i 1
n
i 1

1  n

n
(  xi ) 2
i 1
xi2  i 1
n

  
Thus the fitted linear equation model is

y   0  1 x
After obtaining the least square fit, the
following interesting questions should
come to mind:

 How well does this equation fit the data?


 Is the model likely to be used as a predictor?
 Are any of the basic assumptions violated
(such as constant variance and uncorrelated
errors) if so how series is it?
 All this pointes should be investigated before
the model is adopted for finale use
Hypothesis testing in slope and
intercept
Testing and constructing a confidence interval
require the model errors to be normally and
independently distributed with mean zero and
variance, 2 abbreviated as NIID(0, 2 )

 Suppose we wish to test the hypothesis
that the slope equals a constant, say. The
appropriate hypotheses should be
formulated as follows:

H 0 : 1  10
H 1 : 1  10
 Since errors εi are NID(0,  2 ) , the

observations yi are NID(  0  1 xi ,  ) . since
2
1

is a linear combination of the


observations it is normally distributed.
Therefore the statistic

1  10
Z0   N (0,1) if the null hypothesis H 0 : 1  10 is true
 2

S xx
 calculated value is more than the standard
normal table value for a given level of
significance we conclude that H0 is not
true accepting H1.

Since most of the time the value of  is


2

not known we estimate it by the residual


mean square (MSE) which is its unbiased
estimator and use t-test.
 Since most of the time the value of σ2 is
not known we estimate it by the residual
mean square (MSE) which is its unbiased
estimator and use t-test.
H :  
 If  0 1
is true then
10

1  10
t0   t ( n  2)
MSE
S xx
 t0 > ttab(, n-2) then we conclude that
there is a strong evidence to reject H0.
 A similar procedure can be used to test
hypothesis about
H 0 :  0   00
H 1 :  0   00
we would use a test statistic


1  10
t0   t (n  2)
2
1 x
MSE (  )
n S xx
 Example1: The data shown below presents the average
number of surviving bacteria in a canned food product
exposure to 3000F heat and the minutes of exposure.
Obs bacteria exposure
 1 175 1
 2 108 2
 3 95 3
 4 82 4
 5 71 5
 6 50 6
 7 49 7
 8 31 8
 9 28 9
 10 17 10
 11 16 11
 12 11 12
 The following SAS procedure will fit a
simple linear regression and significance
test on the effect of the regresser on the
response.

 Proc reg data=example1;


 Model bacteria=exposure;
 Run;
 The REG Procedure
 Model: MODEL1
 Dependent Variable: Number of bacteria Analysis of Variance
 Sum of Mean
 Source DF Squares Square F Value Pr > F
 Model 1 22269 22269 66.51 <.0001
 Error 10 3348.10373 334.81037
 Corrected Total 11 25617

 Root MSE 18.29782 R-Square 0.8693


 Dependent Mean 61.08333 Adj R-Sq 0.8562
 Coeff Var 29.95551
 Parameter Standard
 Variable Label DF Estimate Error t Value Pr > |t|
 Intercept Intercept 1 142.19697 11.26153 12.63 <.0001
 exposure exposure 1 -12.47902 1.53014 -8.16 <.0001

 The model, intercept and the repressor ate


significant for  > .0001 and thus the fitted
model can be written as
Number _ of _ bacteria  142.19697  12.47902( Minutesof exp osure )

 Don’t forget that the adequacy of the model


should be assessed before adopting this model
for use.
 Extrapolation is making a prediction
outside the range of values of the
predictor in the sample used to generate
the model.
 The more removed the prediction is from
the range of values used to fit the model,
the riskier the prediction becomes because
there is no way to check that the
relationship continues to be linear.
Simple Linear Correlation
Analysis
 Simple linear correlation analysis deals with the
estimation and test of the significance of the
simple linear correlation coefficient r, which is
the measure of the degree of linear association
between two variables.

 The value of r is between -1 and +1 with


extreme values indicating perfect linear
association, and mid-value of zero indicating no
linear association; there may be other types of
relationships
 The negative and positive signs indicate
the type of association, i.e, when one
variable increases, the other variable
decreases or when one variable decreases
the other variable increases or vice versa.

 xy
r i
n n

 i i
x 2

i 1
y 2

i
Multiple linear regression
 In multiple linear regression model with p
regresser variables can be written as
Y   0  1 X 1       p X p  
Example-2:
 The data set below contains data from 19
livestock auction market.
 The objective is to relate the annual cost
(in thousands of dollars) of operating a
livestock market (cost) to the numbers (in
thousands) of livestock in various classes
(cattle, calves, hogs, and sheep) that were
sold in each market.
 This is done with multiple regression
models as follows.
 Obs mkt cattle calves hogs sheep cost volume type
 1 1 3.437 5.791 3.268 10.649 27.698 23.145 o
 2 2 12.801 4.558 5.751 14.375 57.634 37.485 o
 3 3 6.136 6.223 15.175 2.811 47.172 30.345 o
 4 4 11.685 3.212 0.639 0.694 49.295 16.230 b
 5 5 5.733 3.220 0.534 2.052 24.115 11.539 b
 6 6 3.021 4.348 0.839 2.356 33.612 10.564 b
 7 7 1.689 0.634 0.318 2.209 9.512 4.850 o
 8 8 2.339 1.895 0.610 0.605 14.755 5.449 b
 9 9 1.025 0.834 0.734 2.825 10.570 5.418 o
 10 10 2.936 1.419 0.331 0.231 15.394 4.917 b
 11 11 5.049 4.195 1.589 1.957 27.843 12.790 b
 12 12 1.693 3.602 0.837 1.582 17.717 7.714 b
 13 13 1.187 2.679 0.459 18.837 20.253 23.162 o
 14 14 9.730 3.951 3.780 0.524 37.465 17.985 b
 15 15 14.325 4.300 10.781 36.863 101.334 66.269 o
 16 16 7.737 9.043 1.394 1.524 47.427 19.698 b
 17 17 7.538 4.538 2.565 5.109 35.944 19.750 b
 18 18 10.211 4.994 3.081 3.681 45.945 21.967 b
 19 19 8.697 3.005 1.378 3.338 46.890 16.418 b
SAS code to run multiple
regression models
 proc reg data=example2;
 model cost=cattle calves hogs sheep /r vif
collin ss1 ss2;
 output out=out1 r=resi p=pred
rstudent=sresi COOKD=CO
COVRATIO=COV DFFITS=Dfit h=h;
 run;
 SAS output for the above SAS code

 The SAS System 10:01 Thursday, September 7, 2000 9
 The REG Procedure
 Model: MODEL1
 Dependent Variable: cost
 Analysis of Variance
 Sum of Mean
 Source DF Squares Square 1 F Value 2 Pr > F
 Model 4 7936.73649 1984.18412 52.31 <.0001
 Error 14 531.03865 37.93133
 Corrected Total 18 8467.77514
 Root MSE 6.15884 3 R-Square 0.9373
 Dependent Mean 35.29342 4 Adj R-Sq 0.9194
 Coeff Var 17.45040
 Parameter Estimates
 5 Parameter 6 Standard
 Variable DF Estimate Error 7 t Value 8 Pr > |t|
 Intercept 1 2.28842 3.38737 0.68 0.5103
 cattle 1 3.21552 0.42215 7.62 <.0001
 calves 1 1.61315 0.85168 1.89 0.0791
 hogs 1 0.81485 0.47074 1.73 0.1054
 sheep 1 0.80258 0.18982 4.23 0.0008
 proc reg data=multiple;

 model cost=cattle calves hogs sheep ;


 Run;

Dependent Variable: cost

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 7936.73649 1984.18412 52.31 <.0001


Error 14 531.03865 37.93133
Corrected Total 18 8467.77514

Root MSE 6.15884 R-Square 0.9373


Dependent Mean 35.29342 Adj R-Sq 0.9194
Coeff Var 17.45040


 Parameter Estimates

 Parameter Standard
 Variable DF Estimate Error t Value Pr > |t|

 Intercept 1 2.28842 3.38737 0.68 0.5103


 cattle 1 3.21552 0.42215 7.62 <.0001
 calves 1 1.61315 0.85168 1.89 0.0791
 hogs 1 0.81485 0.47074 1.73 0.1054
 sheep 1 0.80258 0.18982 4.23 0.0008
 proc glm data=example2;
 model cost=cattle calves hogs sheep / ss1 ss2;
 run;
 GLM procedure output
 Source DF Type I SS Mean Square F Value Pr > F
 cattle 1 6582.091806 6582.091806 173.53 <.0001
 calves 1 186.671101 186.671101 4.92 0.0436
 hogs 1 489.863790 489.863790 12.91 0.0029
 sheep 1 678.109792 678.109792 17.88 0.0008
 Source DF Type II SS Mean Square F Value Pr > F
 cattle 1 2200.712494 2200.712494 58.02 <.0001
 calves 1 136.081196 136.081196 3.59 0.0791
 hogs 1 113.656260 113.656260 3.00 0.1054
 sheep 1 678.109792 678.109792 17.88 0.0008
 Standard
 Parameter Estimate Error t Value Pr > |t|
 Intercept 2.288424577 3.38737222 0.68 0.5103
 Cattle 3.215524803 0.42215239 7.62 <.0001
 Calves 1.613147614 0.85167539 1.89 0.0791
 Hogs 0.814849491 0.47073855 1.73 0.1054
 Sheep 0.802578622 0.18981766 4.23 0.0008
 Interpretations: the f test employing type I sum
of squares shows us the inclusion of each
variable sequentially has brought a significant
effect over the dependent variable.
 But the f test using type II sum of squares has
shown that inclusion of calves over the model
which contains the rest has not brought a
significant contribution to the model also hogs
perform even worst than calves.
 Now we can conclude that contribution of hogs
and calves for the cost of livestock operating
market is insignificant
Regression Model Building
Properties of well constructed model
 The variables in a model are influential
 The function form of the model is correct
 The underlying assumptions are not violated
Selection of regressors to be used
in the model
 Theoretical considerations
 Prior experience
 Have a pool of candidate regressors
(Should include all influential factor) and
then selecting actual subset of regressors
to be in the model
 Variable selection problem: Finding an
appropriate subset of regressors for the
model
Variable selection procedures
 Forward selection
 Backward elimination
 Stepwise regression
Forward selection
 Start with the assumption there are no regressors other
than the intercept
 An effort will be to find an optimal subset by inserting
regressors in to the model one at a time
 The first regressor selected for entry in to the model is
the one that has the largest simple correlation with the
response variable.
 This regressor will be in the model if the F-statistics
exceeds a pre-selected F values, say F-to-enter
Forward (continued)
 the second regressor chosen for entry is the one
that now has the largest correlation with the
response after adjusting for the effect of the first
regressor which is already in the model

 This procedure terminates either when the


partial F-statistics at a particular step does not
exceed F-to-enter or when the last candidate
regressor is added to the model
Backward Elimination
 It begins with all the k candidate regressors already in
the model
 The partial F-statistics is computed for each regressor as
if it were the last variable to enter the model
 The smallest of the partial F-statistics will be compared
with the pre selected F-to-out value and if it is less the
associated regressor will be dropped from the model.
This step continues for the remaining K-1 regressors
 The backward elimination algorithm will terminates when
the smallest partial F value is not less than the pre
selected F-to-out value.
Stepwise regression
 Stepwise regression is a modification of forward
selection in which at each step all the regressors
entered in to a model previously are reassessed
via the partial F-statistics.
 A regressor added at earlier step may now be
redundant because of the relationship between
it and the regressors now in the equations

 If the partial F-statistics for a variable is less


than F-to-out, the variable will be dropped from
the model
non linear model
 In some problems with an indicator
response variable, the relationship
between y and x is nonlinear.
 Very frequently we found that the
response function is S shaped.
 One method involves modeling the S
shaped response function with the
normal cumulative distribution function.
This approach is called probit analysis.
 A second method of analysis is to model
the response using logistic function.
Fitting a logistic function is usually called
logit analysis

exp( 0   1 x) ---------------(**)
( y x ) 
1  exp( 0   1 x)

 The logistic function above has the


characteristics S shape. It has
asymptotes at 0 and 1, guaranteeing the
estimated response function lies between
0 and 1.
 Both probit and logit analysis models arise
from consideration of threshold values.
 For example, in determining the
contribution of fertilizer usage for
productivity of a given plot of land, fixing
the other factors constant, the productivity
of a given plot of land has a threshold
fertilizer amount s such that if the land is
tested for a fertilizer amount greater than
or equal to s it will brought a negative
effect.
 Not that p( s  x) is a cumulative distribution
of the threshold fertilizer amount of a
population of plots.
 Thus if this cumulative distribution is
normal probit analysis should be used,
while if the cumulative distribution is
logistic, then the logit analysis approach is
appropriate.
 ( y / x )   p 
p  ln 
*
  ln 
1  ( y / x)  1 p 

p   0  1 x
*

exp(ˆ0  ˆ1 x)
pˆ 
1  exp(ˆ  ˆ x)
0 1
 Example3- the data presented below shows family income along with
 Obs HHN income HOS hos1
 1 1 8300 0 1
 2 2 21200 1 0
 3 3 9100 0 1
 4 4 13400 1 0
 5 5 17700 0 1
 6 6 23000 0 1
 7 7 11500 1 0
 8 8 10800 0 1
 9 9 15400 1 0
 10 10 22400 1 0
 11 11 18700 1 0
 12 12 10100 0 1
 13 13 19500 1 0
 14 14 8000 0 1
 15 15 12000 1 0
 16 16 24000 1 0
 17 17 21700 1 0
 18 18 9400 0 1
 19 19 10900 0 1
 20 20 22800 1 0
 proc probit data=ex3;
 title 'logit model analysis';
 class hos1;
 model hos1=income/d=logistic;
 output out=status1 p=pred ;
 run;
 PROC PROBIT is modeling the probabilities of levels of hos1 having LOWER Ordered
Values in the
 response profile table.
 Algorithm converged.
 Type III Analysis of Effects
 Wald
 Effect DF Chi-Square Pr > ChiSq
 income 1 5.2981 0.0213
 Analysis of Parameter Estimates
 Standard 95% Confidence Chi-
 Parameter DF Estimate Error Limits Square Pr > ChiSq
 Intercept 1 -3.6994 1.7121 -7.0551 -0.3438 4.67 0.0307
 income 1 0.0003 0.0001 0.0000 0.0005 5.30 0.0213
Interpretation of results:
 Both intercept and income for the logistic
model are found to be significant for the
level of significance  > 0.037 and  >
0.0213. thus the fitted logistic regression
model will be
 Logit (Pi) =-3.6994+ 0.0003(Income)

exp( ˆ0  ˆ1 x) exp( 3.6994  0.0003 x)


pˆ  
1  exp( ˆ0  ˆ1 x) 1  exp( 3.6994  0.0003 x)
Probit model analysis
 The following SAS code will do probit analysis on
the above data
 proc probit data=ex3;
 title 'probit model analysis';
 class hos1;
 housestatus2:model hos1=income /d=normal;
 output out=status2 p=pred2 xbeta=z2;
 run;
 PROC PROBIT is modeling the probabilities of levels of hos1 having LOWER Ordered
Values in the
 response profile table.
 Algorithm converged.
 Type III Analysis of Effects
 Wald
 Effect DF Chi-Square Pr > ChiSq
 income 1 6.2623 0.0123
 Analysis of Parameter Estimates
 Standard 95% Confidence Chi-
 Parameter DF Estimate Error Limits Square Pr > ChiSq
 Intercept 1 -2.2177 0.9824 -4.1432 -0.2922 5.10 0.0240
 income 1 0.0002 0.0001 0.0000 0.0003
Interpretation of results:

 1-Both intercept and income for the probit


response model using probit analysis is
found to be significant for the level of
significance  > 0.0240 and  > 0.0123.
 Thus the fitted logistic regression model
will be
 Probit (PI) =-2.2177+0.0002(Income)

 Probit model analysis predicted odds of owning a house
 Obs HHN income HOS hos1 pred1 _LEVEL_ z1 pred2 z2
 1 1 8300 0 1 0.17736 0 -1.53432 0.17275 -0.94337
 2 2 21200 1 0 0.86185 0 1.83076 0.85018 1.03722
 3 3 9100 0 1 0.20988 0 -1.32563 0.20595 -0.82054
 4 4 13400 1 0 0.44919 0 -0.20394 0.43630 -0.16035
 5 5 17700 0 1 0.71458 0 0.91776 0.69141 0.49985
 6 6 23000 0 1 0.90890 0 2.30031 0.90551 1.31358
 7 7 11500 1 0 0.33191 0 -0.69957 0.32561 -0.45206
 8 8 10800 0 1 0.29273 0 -0.88217 0.28790 -0.55954
 9 9 15400 1 0 0.57878 0 0.31778 0.55832 0.14672
 10 10 22400 1 0 0.89509 0 2.14379 0.88904 1.22146
 11 11 18700 1 0 0.76470 0 1.17862 0.74324 0.65338
 12 12 10100 0 1 0.25640 0 -1.06477 0.25238 -0.66701
 13 13 19500 1 0 0.80016 0 1.38730 0.78119 0.77621
 14 14 8000 0 1 0.16623 0 -1.61257 0.16123 -0.98943
 15 15 12000 1 0 0.36144 0 -0.56914 0.35372 -0.37530
 16 16 24000 1 0 0.92832 0 2.56117 0.92883 1.46711
 17 17 21700 1 0 0.87666 0 1.96119 0.86736 1.11398
 18 18 9400 0 1 0.22316 0 -1.24737 0.21932 -0.77448
 19 19 10900 0 1 0.29816 0 -0.85608 0.29316 -0.54418
 20 20 22800 1 0 0.90449 0 2.24814 0.90023 1.28287
Measure of model adequacy
Major assumptions for regression models are:
 The relation b/n y and x is linear or at least it is

well approximated by straight line.


 the error term ε has zero mean

 the error term ε has constant variance

(homocedasticity)
 errors are uncorrelated

 errors are normally distributed

 explanatory variables in the model are

uncorrelated (no multicolinearity)


 We should always consider the validity of
these assumptions to be doubtful and
conduct analysis to examine the adequacy
of the model we have tentatively
entertained

 Gross violation of the assumptions may


yield an unstable model in the sense that
a different sample could lead to a totally
different model with opposite conclusion.
 We usually can not detect departure from
the underlying assumptions by
examinations of the standard summery
statistics, such as the t- or F-statistic or R2

 These are global model properties and as


such they do not ensure model adequacy.

yi

Definition of residuals
yi

 residuals are defined as


, i = 1,2,---,n where yi is an observation

y i and is the corresponding fitted value.

 Since residuals are deviation between the

data and the fit, it is a measure of


variation not explained by the model.
 It is also convenient to think residuals as

realization of the errors.


 Residuals have zero mean and their
approximate average variance is
 We will now present several residual plots
that are useful for detecting model
inadequacies.
 These methods are simple and effective
and are recommended that they be
incorporated in every regression analysis
test for the normality
assumption (Normal
probability plot (PP-PLOT))

 Although small departure from normality


do not affect the model greatly, gross non
normality is potentially more serious, as t-
or F-statistic, and confidence and
prediction intervals depend on the
normality assumption.
 The straight line is usually determined
visually, with emphasis on the central
values (e.g. the .33 and 67 cumulative
probability points) rather than the
extremes.
 Substantial departure from a straight line
indicates that the distribution is not
normal.
 When ranked residuals e(i) is usually
plotted
  against
1
 i    the "expected normal
 2
value n 
1

 
 
 where Ф denote the standard normal
cumulative distribution.
 This follow from the fact that
 1
 i   
1   2
(e( i )   
n 
 
 
The different features on normal
probability plots
 Idealized normal probability plots: the scatter
points lie approximately a long a straight line
(figure a).
 Heavy tail distribution: a sharp downward and
upward curve at both extremis indicating the tail
of the distribution is to have to be considered
normal (Figure b).
 Thinner tail distribution: flattering at the
extremis, which is pattern typical of the samples
from a distribution with thinner tails than the
normal (figure c).
 Positively and negatively skewed distributions
(figure d and e respectively)
Plot of residuals versus
 A pot of the residuals ei (or the scaled
residuals di or ri) versus the corresponding
fitted values is useful in detecting several
type of model in adequacies.
Plots of residual versus the
predicted values
 If this plot resembles figures-a, which
indicates that the residual can be
contained in a horizontal band, then there
are no obvious model defects.
 Figures b, c and d are symptomatic of
model deficiencies
 The pattern in figure b and c indicates
that the variance of the error is not
constant.
 the outward-opening funnel pattern in figure b implies
that the variance is an increasing function of y (an
inward-opening funnel is also possible, indicating
variance is a decreasing function of y).

 the double-bow pattern in figure c often occurs when y


is a proportion between zero and 1.The variance of a
binomial proportion near 0.5 is greater than the one
near zero or 1.

 Thus the usual approach for dealing with inequality of


variance is to apply a suitable transformation to either
the repressor or the response variable or to use method
of weighted least square.

 In practice, transformations on the response are


generally employed to stabilize variance,
 A curved plot such us in Figure d indicates
nonlinearity.
 This could mean that other regresser
variables are needed in the model. For
example, a squared term may be required.
 Transformation or the regressor and/or
the response variable may also be
required

 Residuals should be plotted against y i
since they are not uncorrelated while ei
and yi are correlated.

 The plot of residuals against y i may also
reveal one or more unusually large
residuals.
 These points are; of course, potential
outliers.
 Large residuals that occur at the extreme
values could also indicate that either the
variance is not constant or the relationship
between y and x is not linear.
 These possibilities should be investigated
before the pointes are considered outliers.
Plots of Residuals against Xi

Plotting residuals against the corresponding


values of the regresser variable is also
helpful. These plots often exhibit patterns
such as those in figer2, except the
yi
horizontal scale is xi rather than .
 Once again impression of a horizontal
band containing residuals is desirable

 The funnel and double-bow patterns in


figger2b-c indicate non constant variance

 The curved band in figure 2d implies that


possibility other repressors should be
should included or that a transformation is
necessary.
Other Residual Plots
 If the time sequence in which the data was
collected is known, it may be instructive to
plot the residuals against time order

 If such a plot resembles the pattern in


figure 2b-d, this may indicate that the
variance is changing with time or linear or
quadratic terms in time should be added to
the model
 The time sequence plot of residuals may indicate that
the errors at one time period are correlated with those
errors at different time periods.

 The correlation between errors at different time periods


is called autocorrelation.

 A display such as figure 3a indicates positive


autocorrelation while figure 3b is typical of negative
autocorrelation.

 The presence of autocorrelation in the errors is serious


violation of the basic regression assumptions
Detection and Treatment of
Outliers
 Outliers are an extreme observation

 Outliers are data points that are not typical of the


rest of the data

 Residuals that are considerably larger in absolute


value than the others, say three or more standard
deviation from the mean, are potential outliers
 Outliers should be examined carefully to se if a
reason for their unusually behavior can be found.

 Sometimes outliers are "bad" values, occurring as


a result of unusual but explainable events.

 Example includes faulty measurement or analysis,


incorrect recording of data and failure of
measuring instrument.

 If this is the case, then the outlier should be


corrected (if possible) or deleted from the data
set.
 Clearly discarding bad values is desirable
because least squares pulls the fitted
equation towards the outliers as it
minimizes the residual sum of squares.

 However there should be strong non


statistical evidence that the outlier is a bad
value before it is discarded.

 Effect of outliers on the regression model


may easily be checked by dropping these
points and refitting the regression model
 We may found that the value of a
regression coefficients or summery
statistics such as t or F-statistic, R2 and
residual mean square may be sensitive to
the outliers
 Cook's distance measures the effect of
deleting a given observation.
 Observations with larger D values larger than
the rest of the data are those which have
unusual leverage.
 Fox (1991: 34) suggests as a cut-off for
detecting influential cases, values of D greater
than 4/(n - k - 1), where n is the number of
cases and k is the number of independents.
 Others suggest D > 1 as the criterion to
constitute a strong indication of an outlier
problem, with D > 4/n the criterion to indicate a
possible problem.
 The leverage statistic, h, also called the hat-
value, is available to identify cases which
influence the regression model more than others.
 The leverage statistic varies from 0 (no influence
on the model) to 1 (completely determines the
model).
 A rule of thumb is that cases with leverage under
.2 are not a problem,
 but if a case has leverage over .5, the case has
undue leverage and should be examined for the
possibility of measurement error or the need to
model such cases separately.
 Mahalanobis distance is leverage times
(n - 1), where n is sample size.
 As a rule of thumb, the maximum
Mahalanobis distance should not exceed
the critical chi-squared value with degrees
of freedom equal to number of predictors
and alpha =.001, or else outliers may be a
problem in the data.
 DfBeta is another statistic for assessing
the influence of a case.
 If dfbeta > 0, the case increases the
slope; if <0, the case decreases the slope.
 The case may be considered an influential
outlier if |dfbeta| > 2.
 An alternative rule of thumb, a case may
be an outlier if |dfbeta|> 2/SQRT (n).
 DfFit: DfFit measures how much the
estimate changes as a result of a
particular observation being dropped from
analysis.
 Covratio: it tells us the influence of an
observation o on the precision of the
estimates.
 If covratioi > 1, inclusion of the ith
observation increase precision of
estimation, while if covratioi < 1, inclusion
of the ith observation decrease precision
 Multicollinearity implies near-linear
dependence among the regressors.
 the preferred method of assessing
multicollinearity is to regress each
independent on all the other independent
variables in the equation
 Tolerance is 1 - R2 for the regression of that
independent variable on all the other
independents, ignoring the dependent.

 There will be as many tolerance coefficients as


there are independents. The higher the inter-
correlation of the independents, the more the
tolerance will approach zero.

 As a rule of thumb, if tolerance is less than .20,


a problem with multicollinearity is indicated
 When tolerance is close to 0 there is high
multicollinearity of that variable with other
independents and the b and beta coefficients will
be unstable

 The more the multicollinearity, the lower the


tolerance, the more the standard error of the
regression coefficients

 Tolerance is part of the denominator in the


formula for calculating the confidence limits on
the b (partial regression) coefficient.
 Variance-inflation factor, VIF is the variance
inflation factor, which is simply the reciprocal of
tolerance.

 Therefore, when VIF is high there is high


multicollinearity and instability of the b and beta
coefficients

 The table below shows the inflationary impact


on the standard error of the regression
coefficient (b) of the jth independent variable for
various levels of multiple correlations (Rj),
tolerance, and VIF (adapted from Fox, 1991: 12)
Rj Tolerance VIF Impact on SEb

0 1 1 1.0
.4 .84 1.19 1.09

.6 .64 1.56 1.25

.75 .44 2.25 1.5

.8 .36 2.78 1.67

.87 .25 4.0 2.0


.9 .19 5.26 2.29
 Standard error is doubled when VIF is 4.0 and
tolerance is .25, corresponding to Rj = .87.
 Therefore VIF >= 4 is an arbitrary but common
cut-off criterion for deciding when a given
independent variable displays "too much"
multicollinearity:
 values above 4 suggest a multicollinearity
problem. Some researchers use the more lenient
cutoff of 5.0 or even 10.0 to signal when
multicollinearity is a problem.
 The researcher may wish to drop the variable
with the highest VIF if multicollinearity is
indicated and theory warrants.
Transformations to a Straight Line

 The assumption of straight-line relationship is


between x and y is the starting point in
regression analysis.
 Occasionally we found that straight line fit is
inappropriate.
 Nonlinearity may be detected via lack-of-test or
from scattered plot diagram and residual plots.
 Sometimes previous experience and theoretical
consideration may suggest that the relationship
between x and y is not linear
 Sometimes non linear relationship may be
transformed to linear by using suitable
transformation.
 Such non linear models are called
intrinsically linear.
 This transformation requires that the
transformed error terms are normally and
independently distributed with mean zero
and variance.

 We should look at the residuals from the


transformed model to see if the
assumptions are valid.
Variance Stabilizing Transformation

 The assumption of constant variance is


the basic requirement for regression
analysis.

 A common reason for the violation of this


assumption is for the response variable y
to follow a probability distribution in which
the variance is related to mean.
 For example if y a Poisson random
variable, then the variance of y is equal to
the mean.

 Since the mean of y is related to the


regressor variable x, the variance of y will
be proportional to x.

 variance stabilizing transformations are


useful in this case
 Thus if the distribution of y is Poisson we could
regress against x since the variance of the
square root of a Poisson random variable is
independent the mean

 If y is binomially distributed and the plot of


residual versus has double-bow pattern, then
arcsine transformation is appropriate.

 Several commonly used variance transformations


are summarized below.
 It is important to detect and correct a non
constant error variance.

 If this problem is not eliminated, the least


square estimators will still be unbiased but
they may no longer have minimum
variance property.

 This means that the regression


coefficients will have larger standard error
than the necessary.
 When the response variable has been
transformed and predicted, the predicted
values are in the transformed scale.
 It is often necessary to convert the
predicted values back to the original units
examples
 Example 1:The data shown below presents the average number of surviving
bacteria in a canned food product exposure to 3000F heat and the minutes of
exposure.
 Obs bacteria exposure
 1 175 1
 2 108 2
 3 95 3
 4 82 4
 5 71 5
 6 50 6
 7 49 7
 8 31 8
 9 28 9
 10 17 10
 11 16 11
 12 11 12
Plot1-Scatter plot of number of bacteria who’s is surviving to 300F0
heat versus minutes of exposure

200

150
Number of bacteria

100

50

0 2 4 6 8 10 12
Minutes exposure
 The above scatter plot suggests that the
relationship between numbers of bacteria who
survive exposure to 3000F with minute of
exposure may be linear.

 Thus, let as fit linear model with intercept and


assess the adequacy of the fit. The following
SAS code will do the regression.

 Proc reg data=example1;


 Model number_of_bacteria=minutes_exposure;
 Output out=out1 r=residual p=prediction;
 Run;
 Output
 The REG Procedure
 Model: MODEL1
 Dependent Variable: Number of bacteria Analysis of
Variance
 Sum of Mean
 Source DF Squares Square F Value Pr > F
 Model 1 22269 22269 66.51
<.0001
 Error 10 3348.10373 334.81037
 CorrecTotal 11 25617

 Root MSE 18.29782 R-Square 0.8693


 Depd. Mean 61.08333 Adj R-Sq 0.8562
 Coeff Var 29.95551
 Parameter Estimates
 Parameter Standard
 Variable DF Estimate Error t Value Pr > |t|
 Intercept 1 142.19697 11.26153 12.63 <.0001
 Minutes exposure 1 -12.47902 1.53014 -8.16 <.0001
 The SAS System 09:49 Thursday, August 31, 2000 20
 Number_of_ Minutes_
 Obs bacteria exposure pred resi sresi
 1 11 12 -7.551 18.5513 1.23930
 2 16 11 4.928 11.0723 0.66804
 3 17 10 17.407 -0.4068 -0.02314
 4 28 9 29.886 -1.8858 -0.10471
 5 31 8 42.365 -11.3648 -0.63451
 6 49 7 54.844 -5.8438 -0.31854
 7 50 6 67.323 -17.3228 -0.98864
 8 71 5 79.802 -8.8019 -0.48708
 9 82 4 92.281 -10.2809 -0.58110
 10 95 3 104.760 -9.7599 -0.56485
 11 108 2 117.239 -9.2389 -
0.55327
 12 175 1 129.718 45.2821 7.71
Assessing the validity of the assumption that
residuals are from a normal distribution with zero

mean

 The following SAS code will produce formal


tests for normality and zero mean
assumption along with normal probability
plot.

 Proc univariate plot normal data=example1;


 By sresidual;
 Run;
 Tests for Location: Mu0=0

 Test ------Statistic---- -----p Value------


 Student's t t 0.653417 Pr > |t| 0.5269
 Sign M -3 Pr >= |M| 0.1460
 Signed Rank S -7 Pr >= |S| 0.6221

 Tests for Normality


 Test -------- Statistic---- -------p Value------
 Shapiro-Wilk W 0.547003 Pr < W <0.0001
 Kolmogorov-Smirnov D 0.32863 Pr > D <0.0100
 Cramer-von Mises W-Sq 0.414614 Pr > W-Sq <0.0050
 Anderson-Darling A-Sq 2.244854 Pr > A-Sq <0.0050
 Plot2- Normal Probability Plot

 7.5+ *
 | ++
 | ++++
 4.5+ +++++
 | ++++
 | ++++
 1.5+ ++++ *
 | ++++ *
 | * * * +*+*+* * * *
 -1.5+ ++++
 +----+----+----+----+----+----+----+----+----+----
+
 -2 -1 0 +1 +2.
 S ‚
 t ‚
 u 8ˆ
 d ‚ A
 e ‚
 n ‚
 t ‚
 i ‚
 z ‚
 e 6ˆ
 d ‚
 ‚
 R ‚
 e ‚
 s ‚
 i ‚
 d 4ˆ
 u ‚
 a ‚
 l ‚
 ‚
 w ‚
 i ‚
 t 2ˆ
 h ‚
 o ‚
 u ‚A
 t ‚
 ‚ A
 C ‚
 u 0ˆ A A
 r ‚ A
 r ‚ A A A A A
 e ‚ A
 n ‚
 t ‚
 ‚
 O -2 ˆ
 b ‚
 s Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒ
 -7.55 4.93 17.41 29.89 42.36 54.84 67.32 79.80 92.28 104.76 117.24 129.72
 Predicted Value of Number of bacteria
 t ‚
 u 8ˆ
 d ‚A
 e ‚
 n ‚
 t ‚
 i ‚
 z ‚
 e 6ˆ
 d ‚
 ‚
 R ‚
 e ‚
 s ‚
 i ‚
 d 4ˆ
 u ‚
 a ‚
 l ‚
 ‚
 w ‚
 i ‚
 t 2ˆ
 h ‚
 o ‚
 u ‚ A
 t ‚
 ‚ A
 C ‚
 u 0ˆ A A
 r ‚ A
 r ‚ A A A A A
 e ‚ A
 n ‚
 t ‚
 ‚
 O -2 ˆ
 b ‚
 s Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒ
 1 2 3 4 5 6 7 8 9 10 11 12
 Minute’s exposure
 ‚
 5.5 ˆ
 ‚
 ‚
 ‚
 ‚A
‚A
 ‚
 5.0 ˆ
 ‚
 ‚
 ‚
 ‚ A
 ‚ A
 4.5 ˆ
 ‚ A
 ‚
 ‚ A
 ‚
 ‚
 4.0 ˆ
 ‚ A A
 F ‚
 3 ‚
 ‚
 ‚
 3.5 ˆ
 ‚ A
 ‚ A
 ‚
 ‚
 ‚
 3.0 ˆ
 ‚
 ‚ A
 ‚ A
 ‚
 ‚
 2.5 ˆ
 ‚ A
 ‚
 ‚
 ‚
 ‚
 2.0 ˆ
 ‚
 Šˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆ
 1 2 3 4 5 6 7 8 9 10 11 12
 Minutes exposure
 Proc reg data=example11;
 Model ln(number_of_bacteria)=minutes_exposure;
 Output out=out1 r=resi p=pred
 run;

 The REG Procedure
 Model: MODEL1
 Dependent Variable: F3 F3
 Analysis of Variance
 Sum of Mean
 Source DF Squares Square F Value Pr > F
 Model 1 7.97614 7.97614 550.33 <.0001
 Error 10 0.14493 0.01449
 Corrected Total 11 8.12107

 Root MSE 0.12039 R-Square 0.9822


 Dependent Mean 3.80366 Adj R-Sq 0.9804
 Coeff Var 3.16507

 Parameter Estimates
 Parameter Standard
 Variable DF Estimate Error t Value Pr > |t|
 Intercept 1 5.33878 0.07409 72.05 <.0001
 Minutes_exposure 1 -0.23617 0.01007 -23.46 <.0001
 Tests for Location: Mu0=0
 Test -Statistic- -----p Value------
 Student's t t -0.06861 Pr > |t| 0.9465
 Sign M 0 Pr >= |M| 1.0000
 Signed Rank S 0 Pr >= |S| 1.0000
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.989946 Pr < W 0.9997
Kolmogorov-Smirnov D 0.129374 Pr > D >0.1500
Cramer-von Mises W-Sq 0.018432 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.124732 Pr > A-Sq >0.2500
 Normal Probability Plot
 2.25+ * +++++
 | ++++
 | +*++
 0.75+ *++*+
 | *+*+
 | +*+*+
 -0.75+ +++*
 | ++*++*
 | +*++
 -2.25+ ++++
 +----+----+----+----+----+----+----+----+----+----
+
 -2 -1 0 +1 +2
 S ‚
 t ‚
 u 2.5 ˆ
 d ‚
 e ‚
 n ‚
 t 2.0 ˆ A
 i ‚
 z ‚
 e ‚
 d 1.5 ˆ
 ‚
 R ‚
 e ‚ A
 s 1.0 ˆ
 i ‚ A
 d ‚
 u ‚ A
 a 0.5 ˆ
 l ‚
 ‚ A
 w ‚ A
 i 0.0 ˆ
 t ‚ A A
 h ‚
 o ‚
 u -0.5 ˆ
 t ‚ A
 ‚
 C ‚
 u -1.0 ˆ
 r ‚ A
 r ‚
 e ‚ A
 n -1.5 ˆ
 t ‚
 ‚
 O ‚
 b -2.0 ˆ A
 s ‚
 Šƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ
 2.5047 2.7409 2.9771 3.2132 3.4494 3.6856 3.9217 4.1579 4.3941 4.6303 4.8664 5.1026
 Predicted value
 S ‚
 t ‚
 u 2.5 ˆ
 d ‚
 e ‚
 n ‚
 t 2.0 ˆ A
 i ‚
 z ‚
 e ‚
 d 1.5 ˆ
 ‚
 R ‚
 e ‚ A
 s 1.0 ˆ
 i ‚ A
 d ‚
 u ‚ A
 a 0.5 ˆ
 l ‚
 ‚ A
 w ‚ A
 i 0.0 ˆ
 t ‚ A A
 h ‚
 o ‚
 u -0.5 ˆ
 t ‚ A
 ‚
 C ‚
 u -1.0 ˆ
 r ‚ A
 r ‚
 e ‚ A
 n -1.5 ˆ
 t ‚
 ‚
 O ‚
 b -2.0 ˆ A
 s ‚
 Šƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ
 1 2 3 4 5 6 7 8 9 10 11 12
 Example-2: data
 Obs mkt cattle calves hogs sheep cost volume type
 1 1 3.437 5.791 3.268 10.649 27.698 23.145 o
 2 2 12.801 4.558 5.751 14.375 57.634 37.485 o
 3 3 6.136 6.223 15.175 2.811 47.172 30.345 o
 4 4 11.685 3.212 0.639 0.694 49.295 16.230 b
 5 5 5.733 3.220 0.534 2.052 24.115 11.539 b
 6 6 3.021 4.348 0.839 2.356 33.612 10.564 b
 7 7 1.689 0.634 0.318 2.209 9.512 4.850 o
 8 8 2.339 1.895 0.610 0.605 14.755 5.449 b
 9 9 1.025 0.834 0.734 2.825 10.570 5.418 o
 10 10 2.936 1.419 0.331 0.231 15.394 4.917 b
 11 11 5.049 4.195 1.589 1.957 27.843 12.790 b
 12 12 1.693 3.602 0.837 1.582 17.717 7.714 b
 13 13 1.187 2.679 0.459 18.837 20.253 23.162 o
 14 14 9.730 3.951 3.780 0.524 37.465 17.985 b
 15 15 14.325 4.300 10.781 36.863 101.334 66.269 o
 16 16 7.737 9.043 1.394 1.524 47.427 19.698 b
 17 17 7.538 4.538 2.565 5.109 35.944 19.750 b
 18 18 10.211 4.994 3.081 3.681 45.945 21.967 b
 19 19 8.697 3.005 1.378 3.338 46.890 16.418 b
 SAS code to run multiple regression
models
 proc reg data=example2;
 model cost=cattle calves hogs sheep /r vif
collin ss1 ss2;
 output out=out1 r=resi p=pred
rstudent=sresi COOKD=CO
COVRATIO=COV DFFITS=Dfit h=h;
 run;

You might also like