0% found this document useful (0 votes)

15 views

11. VariableSelectionAndModelBuilding IIT

Uploaded by

sadia.tafannun20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

11. VariableSelectionAndModelBuilding IIT

Uploaded by

sadia.tafannun20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Chapter 13

Variable Selection and Model Building

The complete regression analysis depends on the explanatory variables present in the model. It is understood
in the regression analysis that only correct and important explanatory variables appear in the model. In
practice, after ensuring the correct functional form of the model, the analyst usually has a pool of
explanatory variables which possibly influence the process or experiment. Generally, all such candidate
variables are not used in the regression modelling, but a subset of explanatory variables is chosen from this
pool. How to determine such an appropriate subset of explanatory variables to be used in regression is called
the problem of variable selection.

While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as
possible explanatory variables.
2. In order to make the model as simple as possible, one way includes only a fewer number of
explanatory variables.

Both approaches have their consequences. In fact, model building and subset selection have contradicting
objectives. When a large number of variables are included in the model, then these factors can influence the
prediction of the study variable y . On the other hand, when a small number of variables are included then
the predictive variance of ŷ decreases. Also, when the observations on more number are to be collected,
then it involves more cost, time, labour etc. A compromise between these consequences is striked to select
the “best regression equation”.

The problem of variable selection is addressed assuming that the functional form of the explanatory variable,
1
e.g., x 2 , , log x etc., is known and no outliers or influential observations are present in the data. Various
x
statistical tools like residual analysis, identification of influential or high leverage observations, model
adequacy etc. are linked to variable selection. In fact, all these processes should be solved simultaneously.
Usually, these steps are iteratively employed. In the first step, a strategy for variable selection is opted, and
the model is fitted with selected variables. The fitted model is then checked for the functional form, outliers,

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
1
influential observations etc. Based on the outcome, the model is re-examined, and the selection of variable is
reviewed again. Several iterations may be required before the final adequate model is decided.
There can be two types of incorrect model specifications.
1. Omission/exclusion of relevant variables.
2. Inclusion of irrelevant variables.

Now we discuss the statistical consequences arising from both situations.

1. Exclusion of relevant variables:

In order to keep the model simple, the analyst may delete some of the explanatory variables which may be of
importance from the point of view of theoretical considerations. There can be several reasons behind such
decision, e.g., it may be hard to quantify the variables like the taste, intelligence etc. Sometimes it may be
difficult to take correct observations on the variables like income etc.

Let there be k candidate explanatory variables out of which suppose r variables are included and (k  r )
variables are to be deleted from the model. So partition the X and  as

   
X   X1 X 2  and    1 2  .
nk
 nr n( k  r )   r1 ( k  r )1) 
The model y  X    , E ( )  0, V ( )   2 I can be expressed as
y  X 11  X 2  2  

which is called a full model or true model.

After dropping the r explanatory variable in the model, the new model is
y  X 11  

which is called a misspecified model or false model.

Applying OLS to the false model, the OLSE of 1 is

b1F  ( X 1' X 1 ) 1 X 1' y.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
2
The estimation error is obtained as follows:
b1F  ( X 1' X 1 ) 1 X 1' ( X 11  X 2  2   )
 1  ( X 1' X 1 ) 1 X 1' X 2  2  ( X 1' X 1 ) 1 X 1'
b1F  1    ( X 1' X 1 ) 1 X 1'

where   ( X 1' X 1 ) 1 X 1' X 2  2 .

Thus
E (b1F   )    ( X 1' X 1 ) 1 E ( )

which is a linear function of  2 , i.e., the coefficients of excluded variables. So b1F is biased, in general.

The bias vanishes if X 1' X 2  0, i.e., X 1 and X 2 are orthogonal or uncorrelated.

The mean squared error matrix of b1F is

MSE (b1F )  E (b1F   )(b1F   ) '

 E  '  ' X 1 ( X 1' X 1 ) 1  ( X 1' X 1 ) 1 X 1' ' ( X 1' X 1 ) 1 X 1' ' X 1 ( X 1' X 1 ) 1 
  ' 0  0   2 ( X 1' X 1 ) 1 X 1' IX 1 ( X 1' X 1 ) 1
  '  2 ( X 1' X 1 ) 1.

So efficiency generally declines. Note that the second term is the conventional form of MSE.

The residual sum of squares is

SSres e 'e
ˆ 2  
nr nr
where e  y  X 1b1F  H1 y,
H1  I  X 1 ( X 1' X 1 ) 1 X 1' .
Thus
H1 y  H1 ( X 11  X 2  2   )
 0  H1 ( X 2  2   )
 H1 ( X 2  2   ).

yH1 y  ( X 11  X 2  2   ) H1 ( X 2  2   )
 (  2' X 2 H1' H1 X 2  2   2' X 2' H1   2' X 2 H1' X 2  2  1' X 1' H1   ' H1' X 2  2   ' H1 ).
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
3
1
E (s 2 )   E (  2' X 2' H1 X 2  2 )  0  0  E ( ' H  ) 
nr
1
   2' X 2' H1 X 2  2 )  (n  r ) 2 
nr
1
2   2' X 2' H1 X 2  2 .
nr
Thus s 2 is a biased estimator of  2 and s 2 provides an overestimate of  2 . Note that even if X 1' X 2  0,

then also s 2 gives an overestimate of  2 . So the statistical inferences based on this will be faulty. The t -
test and confidence region will be invalid in this case.

If the response is to be predicted at x '  ( x1' , x2' ), then using the full model, the predicted value is

yˆ  x ' b  x '( X ' X ) 1 X ' y

with
E ( yˆ )  x ' 
Var ( yˆ )   2 1  x '( X ' X ) 1 x  .

When the subset model is used then the predictor is

yˆ1  x1' b1F
and then
E ( yˆ1 )  x1' ( X 1' X 1 ) 1 X 1' E ( y )
 x1' ( X 1' X 1 ) 1 X 1' E ( X 11  X 2  2   )
 x1' ( X 1' X 1 ) 1 X 1' ( X 11  X 2  2 )
 x1' 1  x1' ( X 1' X 1 ) 1 X 1' X 2  2
 x1' 1  xi' .

Thus ŷ1 is a biased predictor of y . It is unbiased when X 1' X 2  0. The MSE of predictor is

MSE ( yˆ1 )   2 1  x1' ( X 1' X 1 ) 1 x1    x1'  x2'  2  .

Also
Var ( yˆ )  MSE ( yˆ1 )

provided V ( ˆ2 )   2  2' is positive semidefinite.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
4
2. Inclusion of irrelevant variables
Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some
explanatory variables that are not very relevant to the model. Such variables may contribute very little to the
explanatory power of the model. This may tend to reduce the degrees of freedom (n  k ) and consequently,
the validity of inference drawn may be questionable. For example, the value of the coefficient of
determination will increase, indicating that the model is getting better, which may not really be true.

Let the true model be

y  X    , E ( )  0,V ( )   2 I
which comprise k explanatory variable. Suppose now r additional explanatory variables are added to the
model and the resulting model becomes
y  X   Z  
where Z is a n  r matrix of n observations on each of the r explanatory variables and  is r  1 vector
of regression coefficient associated with Z and  is disturbance term. This model is termed as a false
model.

Applying OLS to false model, we get

1
 bF   X ' X X ' Z   X ' y 
    
 cF   Z ' X Z ' Z   Z ' y 
 X ' X X ' Z   bF   X ' y 
     
 Z ' X Z ' Z   cF   Z ' y 

 X ' XbF  X ' ZCF  X ' y (1)

Z ' XbF  Z ' ZCF  Z ' y (2)

where bF and CF are the OLSEs of  and  respectively.

Premultiply equation (2) by X ' Z ( Z ' Z ) 1 , we get

X ' Z ( Z ' Z ) 1 Z ' XbF  X ' Z ( Z ' Z ) 1 Z ' ZCF  X ' Z ( Z ' Z ) 1 Z ' y. (3)

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
5
Subtracting equation (1) from (3), we get
 X ' X  X ' Z ( Z ' Z ) 1 Z ' X  bF  X ' y  X ' Z ( Z ' Z ) 1 Z ' y
X '  I  Z ( Z ' Z ) 1 Z ' XbF  X '  I  Z ( Z ' Z ) 1 Z ' y
 bF  ( X ' H Z X ) 1 X ' H Z y
where H Z  I  Z ( Z ' Z ) 1 Z '.

The estimation error of bF is

bF    ( X ' H Z X ) 1 X ' H Z y  
 ( X ' H Z X ) 1 X ' H Z ( X    )  
 ( X ' H Z X ) 1 X ' H Z  .
Thus
E (bF   )  ( X ' H Z X ) 1 X ' H Z E ( )  0

so bF is unbiased even when some irrelevant variables are added to the model.

The covariance matrix is

V (bF )  E  bF    bF   
1

 E  X ' H Z X  X ' H Z  ' H Z X ( X ' H Z X ) 1 

1

 
  2  X ' H Z X  X ' H Z IH Z X  X ' H Z X 
1 1

  2  X ' HZ X  .
1

If OLS is applied to true model, then

bT  ( X ' X ) 1 X ' y

with E (bT )  

V (bT )   2 ( X ' X ) 1.

To compare bF and bT , we use the following result.

Result: If A and B are two positive definite matrices then A  B is at least positive semi-definite if

B 1  A1 is also at least positive semi-definite.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
6
Let
A  ( X ' H Z X ) 1
B  ( X ' X ) 1
B 1  A1  X ' X  X ' H Z X
 X ' X  X ' X  X ' Z ( Z ' Z ) 1 Z ' X
 X ' Z ( Z ' Z ) 1 Z ' X
which is at least positive semidefinite matrix. This implies that the efficiency declines unless X ' Z  0. If
X ' Z  0, i.e., X and Z are orthogonal, then both are equally efficient.
The residual sum of squares under the false model is
SSres  eF' eF

where
eF  y  XbF  ZCF
bF  ( XH Z X ) 1 X ' H Z y
CF  ( Z ' Z ) 1 Z ' y  ( Z ' Z ) 1 Z ' XbF
 ( Z ' Z ) 1 Z '( y  XbF )
 ( Z ' Z ) 1 Z '  I  X ( X ' H Z X ) 1 X ' H z  y
 ( Z ' Z ) 1 Z ' H XZ y
H Z  I  Z ( Z ' Z ) 1 Z '
H Zx  I  X ( X ' H Z X ) 1 X ' H Z
2
H ZX  H ZX : idempotent.
So
eF  y  X ( X ' H Z X ) 1 X ' H Z y  Z ( Z ' Z ) 1 Z ' H ZX y
  I  X ( X ' H Z X ) 1 X ' H Z  Z ( Z ' Z ) 1 Z ' H ZX  y
  H ZX  ( I  H Z ) H ZX  y
 H Z H ZX y
 H ZX
* *
y where H ZX  H Z H ZX .
Thus
SSres  eF' eF
 y ' H Z H ZX H ZX H Z y
 y ' H Z H ZX y
 y ' H ZX
*
y

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
7
E ( SS res )   2 tr ( H ZX
*
)
  2 (n  k  r )
 SSres 
  .
2
E
nk r 
SS res
So is an unbiased estimator of  2 .
nk r

A comparison of exclusion and inclusion of variables is as follows:

Exclusion type Inclusion type
Estimation of coefficients Biased Unbiased
Efficiency Generally declines Declines
Estimation of the disturbance term Over-estimate Unbiased
Conventional test of hypothesis and Invalid and faulty inferences Valid though erroneous
confidence region

Evaluation of subset regression model

A question arises after the selection of subsets of candidate variables for the model, how to judge which
subset yields better regression model. Various criteria have been proposed in the literature to evaluate and
compare the subset regression models.

1. Coefficient of determination
The coefficient of determination is the square of multiple correlation coefficient between the study variable
y and set of explanatory variables X 1 , X 2 ,..., X p denotes as R p2 . Note that X i1  1 for all i  1, 2,..., n

which simply indicates the need of intercept term in the model without which the coefficient of
determination can not be used. So essentially, there will be a subset of ( p  1) explanatory variables and one

intercept term in the notation R p2 .

The coefficient of determination based on such variables is

SS reg ( p )
R p2 
SST
SS res ( p )
 1
SST
where SSreg ( p ) and SSres ( p) are the sum of squares due to regression and residuals, respectively in a subset

model based on ( p  1) explanatory variables.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
8
Since there are k explanatory variables available and we select only ( p  1) out of them, so there are

 k 
  possible choices of subsets. Each such choice will produce one subset model. Moreover, the
 p  1
coefficient of determination has a tendency to increase with the increase in p .
So proceed as follows:
 Choose an appropriate value of p , fit the model and obtain R p2 .

 Add one variable, fit the model and again obtain R p21 .

 Obviously R p21  R p2 . If R p21  R p2 is small, then stop and choose the value of p for subset

regression.
 If R p21  R p2 is high, then keep on adding variables up to a point where an additional variable

does not produce a large change in the value of R p2 or the increment in R p2 becomes small.

To know such value of p , create a plot of R p2 versus p. For example, the curve will look like as in the

following figure.

Choose the value of p corresponding to a value of R p2 where the “knee” of the curve is clearly seen. Such

choice of p may not be unique among different analyst. Some experience and judgment of analyst will be
helpful in finding the appropriate and satisfactory value of p .

To choose a satisfactory value analytically, a solution is a test which can identify the model with R 2 which
does not significantly differ from the R 2 based on all the explanatory variables.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
9
Let
R02  1  (1  Rk21 )(1  d ,n , k )

kF (n, n  k  1)
where d ,n ,k  and Rk21 is the value of R 2 based on all (k  1) explanatory variables. A
n  k 1
subset with R 2  R02 is called an R 2 - adequate(α) subset .

2. Adjusted coefficient of determination

The adjusted coefficient of determination has certain advantages over the usual coefficient of determination.
The adjusted coefficient of determination based on p -term model is

 n 1 
2
Radj ( p)  1    (1  R p ).
2

n p
2
An advantage of Radj ( p ) is that it does not necessarily increase as p increases.

If there are r more explanatory variables which are added to a p  term model then
2
Radj ( p  r )  Radj
2
( p)

if and only if the partial F  statistic for testing the significance of r additional explanatory variables
2
exceeds 1. So the subset selection based on Radj ( p ) can be made on the same lines are in R p2 . In general,
2
the value of p corresponding to the maximum value of Radj ( p) is chosen for the subset model.

3. Residual mean square

A model is said to have a better fit if residuals are small. This is reflected in the sum of squares due to
residuals SSres . A model with smaller SSres is preferable. Based on this, the residual mean square based on

a p variable subset regression model is defined as

SSres ( p)
MSres ( p )  .
n p
So MS res ( p) can be used as a criterion for model selection like SSres . The SSres ( p) decreases with an

increase in p . So similarly as p increases, MS res ( p) initially decreases, then stabilizes and finally may

increase if the model is not sufficient to compensate the loss of one degree of freedom in the factor ( n  p).
When MS res ( p) is plotted versus p , the curve looks like as in the following figure.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
10
So
 plot MSres ( p ) versus p.

 Choose p corresponding to the minimum value of MS res ( p) .

 Choose p corresponding to which MS res ( p) is approximately equal to MSres based on the full

model.
 Choose p near the point where the smallest value of MS res ( p) turns upward.

2
Such minimum value of MSres ( p ) will produce a Radj ( p ) with maximum value. So

n 1
2
Radj ( p)  1  (1  R p2 )
n p
n  1 SS res ( p )
 1 .
n  p SST
n  1 SSres ( p )
 1 .
SST n  p
MSres ( p )
 1 .
SST /(n  1)
2
Thus the two criteria, viz, minimum MS res ( p) and maximum Radj ( p ) are equivalent.

4. Mallow’s Cp statistics:
Mallow’s C p criterion is based on the mean squared error of a fitted value.

Consider the model y  X    with partitioned X  ( X 1 , X 2 ) where X 1 is n  p matrix and X 2 is n  q

matrix, so that
y  X 11  X 2  2   , E ( )  0, V ( )   2 I

where   ( 1' ,  2' ) '.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
11
Consider the reduced model
y  X 11   , E ( )  0, V ( )   2 I

and predict y based on the subset model as

yˆ  X 1ˆ1 , where ˆ1  ( X 1' X 1 ) 1 X 1' y.

The prediction of y can also be seen as the estimation of E ( y )  X  , so the expected outweighed
squared error loss of ŷ is given by

 
 p  E  X 1ˆ1  X  ' X 1ˆ1  X   .
  
So the subset model can be considered as an appropriate model if  p is small.

Since H1  X 1 ( X 1' X 1 ) 1 X 1' , so

 p  E ( y ' H1 y )  2  ' X ' H1 X    ' X ' X 

where E ( y ' H1 y )  E  ( X    ) ' H1 ( X    ) 

 E   ' X ' H1 X    ' X ' H1   ' H1 X    ' H1 

  ' X ' H1 X   0  0   2 tr H1
  ' X ' H1 X    2 p.
Thus
 p   2 p   ' X ' H1 X 1  2 ' X ' H1 X    ' X ' X 
  2 p   ' X ' X    ' X ' H1 X 
  2 p   ' X '( I  H1 ) X 
  2 p   ' X ' H1 X 

where H1  I  X 1 ( X 1' X 1 ) 1 X 1' .

Since
E ( y ' H1 y )  E  ( X    ) ' H1 ( X    ) 
  2trH1   ' X ' H1 X 
  2 ( n  p )   ' X ' H1 X 
  ' X ' H 1 X   E ( y ' H1 y )   2 ( n  p )

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
12
Thus
 p   2 (2 p  n)  E ( y ' H1 y ).

Note that  p depends on  and  2 which are unknown. So  p can not be used in practice. A solution to

this problem is to replace  and  2 by their respective estimators which gives

ˆ p  ˆ 2 (2 p  n)  SSres ( p) .

where SSres ( p)  y ' H1 y is the residuals sum of squares based on the subset model.

A rescaled vision of ˆ p is

SSres ( p )
C p  (2 p  n) 
ˆ 2
which is the Mallow’s C p statistic for the model y  X 11   , the subset model. Usually

b  ( X ' X ) 1 X ' y
1
ˆ 2  ( y  X ˆ ) '( y  X ˆ )
n pq

are used to estimate  and  2 respectively, which are based on the full model.

When different subset models are considered, then the models with smallest C p are considered to be better

than those models with higher C p . So lower C p is preferable.

If the subset model has a negligible bias, (in case of b , then bias is zero), then
E  SSres ( p)   (n  p) 2

and
(n  p) 2
E C p | Bias  0   2 p  n   p.
2
The plot of C p versus p for each regression equation will be a straight line passing through the origin and

look like as follows:

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
13
Those points which have smaller bias will be near to line, and those points with significant bias will lie
above the line. For example, the point A has little bias, so it is closer to line A whereas points B and C
have a substantial bias, so they are above the line. Moreover, the point C is above point A , and it represents
a model with a lower total error. It may be preferred to accept some bias in the regression equation to reduce
the average prediction error.

Note that an unbiased estimator of  2 is used in C p  p which is based on the assumption that the full

model has a negligible bias. In case, the full model contains non-significant explanatory variables with zero
regression coefficients, then the same unbiased estimator of  2 will overestimate  2 and then C p will have

smaller values. So working of C p depends on the good choice of the estimator of  2 .

5. Akaike’s information criterion (AIC)

The Akaike’s information criterion statistic is given as
 SS ( p) 
AIC p  n ln  res  2p
 n 
where SSres ( p )  y ' H1 y  y ' X 1 ( X 1' X 1 ) 1 X 1' y

is based on the subset model y  X 11   derived from the full model y  X 11  X 2  2    X    .

The AIC is defined as

AIC = -2(maximized log likelihood) + 2(number of parameters).

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
14
In the linear regression model with  ~ N (0,  2 I ) , the likelihood function is

1  1 ( y  X  ) '( y  X  ) 
L( y,  ,  2 )  exp   
n
 2 2
 2 2 2

and log-likelihood of L( y,  ,  2 ) . is
n n 1 ( y  X  ) '( y  X  )
ln L( y;  ,  2 )   ln 2  ln( 2 )  .
2 2 2 2

The log-likelihood is maximized at

  ( X ' X ) 1 X ' y
n p 2
 2  ˆ
n
where  is maximum likelihood estimate of  which is same as OLSE,  2 is maximum likelihood
estimate of  2 and ˆ 2 is OLSE of  2 .

So
AIC  2 ln L( y;  ,  2 )  2 p
 SS 
 n ln  res   2 p  n  ln(2 )  1
 n 

where SSres  y '  I  X ( X ' X ) 1 X ' y .

The term n  ln(2 )  1 remains the same for all the models under comparison if the same observations y are

compared. So it is irrelevant for AIC.

6. Bayesian information criterion (BIC)

Similar to AIC, the Bayesian information criterion is based on maximizing the posterior distribution of the
model given the observations y . In the case of linear regression model, it is defined as
BIC  n ln( SSres )  (k  n) ln n.

A model with a smaller value of BIC is preferable.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
15
7. PRESS statistic
Since the residuals and residual sum of squares act as a criterion of subset model selection, so similarly, the
PRESS residuals and prediction sum of squares can also be used as a basis for subset model selection. The
usual residual and PRESS residuals have their own characteristics which use used is regression modeling.

The PRESS statistic based on a subset model with p explanatory variable is given by
n
PRESS ( p )    yi  yˆ (i ) 
2

i 1
2
n
 e 
  i  .
i 1  1  hii 

where hii is the ith element in H  X ( X ' X ) 1 X . This criterion is used on similar lines as in the case of

SSres ( p). A subset regression model with a smaller value of PRESS ( p ) is preferable.

Partial F- statistic
The partial F  statistic is used to test the hypothesis about a subvector of the regression coefficient.
Consider the model
y  X 
n1 n p p1 n1

where p  k  1 which includes an intercept term and k explanatory variables. Suppose a subset of r  k
explanatory variables is to be obtained which contribute significantly to the regression model. So partition

 
X   X 1 X 2  ,   1
   2 
where X 1 and X 2 are matrices of order n  ( p  r ) and n  r , respectively; 1 and  2 are the vectors of

order ( p  1) 1 and r 1 , respectively.

The objective is to test the null hypothesis
H 0 : 2  0
H1 :  2  0.
Then
y  X     X 11  X 2  2  
is the full model and application of least squares gives the OLSE of  as

b  ( X ' X ) 1 X ' y.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
16
The corresponding sum of squares due to regression with p degrees of freedom is
SSreg  b ' X ' y

and the sum of squares due to residuals with (n  p ) degrees of freedom is

SSres  y ' y  b ' X ' y

y' y b' X ' y

and MSres 
n p
is the mean square due to residual.

The contribution of explanatory variables in  2 in the regression can be found by considering the full model

under H 0 :  2  0. Assume that H 0 :  2  0 is true, then the full model becomes

y  X 11   , E ( )  0, Var ( )   2 I

which is the reduced model. Application of least squares to reduced model yields the OLSE of 1 as

b1  ( X 1' X 1 ) 1 X 1' y

and the corresponding sum of squares due to regression with ( p  r ) degrees of freedom is

SS reg  b1' X 1' y.

The sum of squares of regression due to  2 given that 1 in already in the model can be found by

SSreg (  2 | 1 )  SSreg (  )  SS reg ( 1 )

where SS reg (  ) and SSreg ( 1 ) are the sum of squares due to regression with all explanatory variables

corresponding to  is the model and the explanatory variables corresponding to 1 in the model.

The term SSreg (  2 | 1 ) is called as the extra sum of squares due to  2 and has degrees of freedom.

p  ( p  r )  r. It is independent of MSres and is a measure of regression sum of squares that results from

adding the explanatory variables X k  r 1 ,..., X k in the model when the model has already X 1 , X 2 ,..., X k  r

explanatory variables.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
17
The null hypothesis H 0 :  2  0 can be tested using the statistic

SSres (  2 | 1 ) / r
F0 
MS res

which follows F  distribution with r and (n  p) degrees of freedom under H 0 . The decision rule is to

reject H 0 whenever

F0  F (r , n  p).

This is known as the partial F  test.

It measures the contribution of explanatory variables in X 2 given that the other explanatory variables in X 1

are already in the model.

Computational techniques for variable selection V. V. I.

In order to select a subset model, several techniques based on computational procedures and algorithm the
available. They are essentially based on two ideas – select all possible explanatory variables or select the
explanatory variables stepwise.

1. Use all possible explanatory variables

This methodology is based on the following steps:
 Fit a model with one explanatory variable.
 Fit a model with two explanatory variables.
 Fit a model with three explanatory variables.
and so on.

Choose a suitable criterion for model selection and evaluate each of the fitted regression equation with the
selection criterion.

The total number of models to be fitted sharply rises with an increase in k . So such models can be evaluated
using a model selection criterion with the help of an efficient computation algorithm on computers.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
18
2. Stepwise regression techniques
This methodology is based on choosing the explanatory variables in the subset model in steps which can be
either adding one variable at a time or deleting one variable at times. Based on this, there are three
procedures.
- Forward selection,
- backward elimination and
- stepwise regression.
These procedures are basically computer-intensive procedures and are executed using the software.

Forward selection procedure:

This methodology assumes that there is no explanatory variable in the model except an intercept term. It
adds variables one by one and tests the fitted model at each step using some suitable criterion. It has the
following steps.
 Consider only intercept term and insert one variable at a time.
 Calculate the simple correlations of xi ' s (i  1, 2,..., k ) with y.

 Choose xi which has the largest correlation with y .

 Suppose x1 is the variable which has the highest correlation with y . Since F  statistic given by

n  k R2
F0  . ,
k 1 1  R2
so x1 will produce the largest value of F0 in testing the significance of a regression.

 Choose a prespecified value of F value, say FIN ( F  to  enter) .

 If F  FIN , then accept x1 and so x1 enters into the model.

 Adjust the effect of x1 on y and re-compute the correlations of remaining xi ' s with y and

obtain the partial correlations as follows.

- Fit the regression ŷ  ˆ0  ˆ1 x1 and obtain the residuals.

- Fit the regression of x1 on other candidate explanatory variables as

xˆ j  ˆ oj  ˆ1 j x1 , j  2,3,..., k

and obtain the residuals.

- Find the simple correlation between the two residuals.
- This gives the partial correlations.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
19
 Choose xi with the second-largest correlation with y , i.e., the variable with the highest value of

partial correlation with y .

 Suppose this variable is x2 . Then the largest partial F  statistic is

SSreg ( x2 | x1 )
F .
MS res ( x1 , x2 )

 It F  FIN then x2 enters into the model.

 These steps are repeated. At each step, the partial correlations are computed, and explanatory
variable corresponding to the highest partial correlation with y is chosen to be added into the
model. Equivalently, the partial F -statistics are calculated, and the largest F  statistic given the
other explanatory variables in the model is chosen. The corresponding explanatory variable is
added into the model if partial F -statistic exceeds FIN .

 Continue with such selection as long as either at a particular step, the partial F  statistic does not
exceed FIN or when the least explanatory variable is added to the model.

Note: The SAS software chooses FIN by choosing a type I error rate  so that the explanatory variable with

the highest partial correlation coefficient with y is added to the model if partial F  statistic exceeds

F (1, n  p ) .

Backward elimination procedure:

This methodology is contrary to the forward selection procedure. The forward selection procedure starts with
no explanatory variable in the model and keeps on adding one variable at a time until a suitable model is
obtained .

The backward elimination methodology begins with all explanatory variables and keeps on deleting one
variable at a time until a suitable model is obtained.

It is based on the following steps:

 Consider all k explanatory variables and fit the model.
 Compute partial F  statistic for each explanatory variables as if it were the last variable to enter
in the model.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
20
 Choose a preselected value F0UT (F  to-remove).

 Compare the smallest of the partial F  statistics with FOUT . If it is less than FOUT , then remove

the corresponding explanatory variable from the model.

 The model will have now (k  1) explanatory variables.
 Fit the model with these (k  1) explanatory variables, compute the partial F  statistic for the
new model and compare it with FOUT . If it is less them FOUT , then remove the corresponding

variable from the model.

 Repeat this procedure.
 Stop the procedure when the smallest partial F  statistic exceeds FOUT .

Stepwise regression procedure:

A combination of forward selection and backward elimination procedure is the stepwise regression. It is a
modification of forward selection procedure and has the following steps.

 Consider all the explanatory variables entered into the model at the previous step.
 Add a new variable and regresses it via their partial F  statistics.
 An explanatory variable that was added at an earlier step may now become insignificant due to its
relationship with currently present explanatory variables in the model.
 If partial F -statistic for an explanatory variable is smaller than FOUT , then this variable is deleted

from the model.

 Stepwise needs two cut-off values, FIN and FOUT . Sometimes FIN  Fout or FIN  FOUT are

considered . The choice FIN  FOUT makes relatively more difficult to add an explanatory

variable than to delete one.

General comments:
1. None of the methods among the forward selection, backward elimination or stepwise
regression guarantees the best subset model.
2. The order in which the explanatory variables enter or leave the models does not indicate the
order of importance of the explanatory variable.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
21
3. In forward selection, no explanatory variable can be removed if entered in the model.
Similarly in backward elimination, no explanatory variable can be added if removed from the
model.
4. All procedures may lead to different models.
5. Different model selection criterion may give different subset models.

Comments about stopping rules:

 Choice of FIN and/or FOUT provides stopping rules for algorithms.

 Some computer software allows the analyst to specify these values directly.
 Some algorithms require type I errors to generate FIN or/and FOUT . Sometimes, taking  as the

level of significance can be misleading because several correlated partial F  variables are
considered at each step, and maximum among them is examined.
 Some analyst prefer small values of FIN and FOUT whereas some prefer extreme values. A

popular choice is FIN  FOUT  4 which is corresponding to 5% level of significance of

F  distribution.

Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
22

Econometric Modeling:: Model Specification and Diagnostic Testing
100% (1)
Econometric Modeling:: Model Specification and Diagnostic Testing
57 pages
Specification Choosing Independent Variables
No ratings yet
Specification Choosing Independent Variables
7 pages
Chapter11 Econometrics SpecificationerrorAnalysis
No ratings yet
Chapter11 Econometrics SpecificationerrorAnalysis
7 pages
Econometrics Chapter 3
No ratings yet
Econometrics Chapter 3
24 pages
Introduction To Econometrics With R
No ratings yet
Introduction To Econometrics With R
18 pages
Econometrics Specification Data Issues
No ratings yet
Econometrics Specification Data Issues
22 pages
chp2 Econometric
No ratings yet
chp2 Econometric
54 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Unit 5. Model Selection: María José Olmo Jiménez
No ratings yet
Unit 5. Model Selection: María José Olmo Jiménez
15 pages
07 Multiple Regression Analysis PDF
No ratings yet
07 Multiple Regression Analysis PDF
26 pages
Econometrics
No ratings yet
Econometrics
13 pages
TCH442E Quantitative Methods For Finance: Last Lecture: Next
No ratings yet
TCH442E Quantitative Methods For Finance: Last Lecture: Next
13 pages
Regression Model
No ratings yet
Regression Model
16 pages
Ec2 1
No ratings yet
Ec2 1
11 pages
Model Selection
No ratings yet
Model Selection
11 pages
Statistical Modelling: Regression: Choosing The Independent Variables
No ratings yet
Statistical Modelling: Regression: Choosing The Independent Variables
14 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
CH 03
No ratings yet
CH 03
17 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
CH 2
No ratings yet
CH 2
31 pages
Multiple Regression Analysis: y + X + X + - . - X + U
No ratings yet
Multiple Regression Analysis: y + X + X + - . - X + U
26 pages
ec226_24-25_week7_Thursday
No ratings yet
ec226_24-25_week7_Thursday
13 pages
Emet2007 Notes
No ratings yet
Emet2007 Notes
6 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Basic Econometrics Revision - Econometric Modelling
No ratings yet
Basic Econometrics Revision - Econometric Modelling
65 pages
2 Regression With Multiple Regressors 1
No ratings yet
2 Regression With Multiple Regressors 1
22 pages
Chapter three
No ratings yet
Chapter three
35 pages
Week 2, OLS
No ratings yet
Week 2, OLS
83 pages
Clase III
No ratings yet
Clase III
28 pages
Chapter3
No ratings yet
Chapter3
52 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
52 pages
Lecture 09 Model Misspecification
No ratings yet
Lecture 09 Model Misspecification
5 pages
Gary Chamberlain Econometric S
No ratings yet
Gary Chamberlain Econometric S
152 pages
A Comprehensive Approach to Misspecification Testing in Linear Regression Models
No ratings yet
A Comprehensive Approach to Misspecification Testing in Linear Regression Models
6 pages
Eco Trix
No ratings yet
Eco Trix
16 pages
Solutions for Tutorial 2
No ratings yet
Solutions for Tutorial 2
14 pages
Lecture 11_Stochastic Regressors Measurement Errors
No ratings yet
Lecture 11_Stochastic Regressors Measurement Errors
6 pages
Instrumental-variables-slides-2021
No ratings yet
Instrumental-variables-slides-2021
26 pages
lecture 3
No ratings yet
lecture 3
33 pages
MGT-Three
No ratings yet
MGT-Three
86 pages
Im ch08
No ratings yet
Im ch08
12 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
LEC12_ECMT
No ratings yet
LEC12_ECMT
30 pages
Chapter Three
No ratings yet
Chapter Three
22 pages
ECON 342 AE Model Specification and Data Problems 2021
No ratings yet
ECON 342 AE Model Specification and Data Problems 2021
43 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Econometric Modeling
No ratings yet
Econometric Modeling
38 pages
Activity 7
No ratings yet
Activity 7
5 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
7 pages
FCDS - RA ch1 Sp21
No ratings yet
FCDS - RA ch1 Sp21
14 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
統計摘要
No ratings yet
統計摘要
12 pages
Econometric Theory: Module - Iii
No ratings yet
Econometric Theory: Module - Iii
10 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Daniels Cap 5
No ratings yet
Daniels Cap 5
37 pages
Prerequisite: Course Objectives: Course Outcomes:: VLSI Design (404201)
No ratings yet
Prerequisite: Course Objectives: Course Outcomes:: VLSI Design (404201)
2 pages
Dsa Lab Assignment (3)
No ratings yet
Dsa Lab Assignment (3)
15 pages
Cpu-95 Pi 4-08
No ratings yet
Cpu-95 Pi 4-08
8 pages
01 Binomial Theorem
No ratings yet
01 Binomial Theorem
6 pages
Gsa Undergrad Prospectus PDF
100% (1)
Gsa Undergrad Prospectus PDF
83 pages
A Vindication of The Rights of Woman Chapter 1 English CC 11 Sem 5 by SR
No ratings yet
A Vindication of The Rights of Woman Chapter 1 English CC 11 Sem 5 by SR
4 pages
Infantile Amnesia PDF
No ratings yet
Infantile Amnesia PDF
15 pages
Algorithm Trading Using Q-Learning and Recurrent Reinforcement Learning PDF
No ratings yet
Algorithm Trading Using Q-Learning and Recurrent Reinforcement Learning PDF
7 pages
Methods and Teachniques in Teaching AP
No ratings yet
Methods and Teachniques in Teaching AP
22 pages
Band Limited Impedance Inversion of Blackfoot Field, Alberta, Canada
No ratings yet
Band Limited Impedance Inversion of Blackfoot Field, Alberta, Canada
6 pages
Sem PDF
No ratings yet
Sem PDF
88 pages
Chapter 2 Elementary Programming
No ratings yet
Chapter 2 Elementary Programming
79 pages
Plant Communication: A Fascinating Journey Into The World of Botanical Chatter
No ratings yet
Plant Communication: A Fascinating Journey Into The World of Botanical Chatter
3 pages
Lies of Evolution: Peppered Moth
No ratings yet
Lies of Evolution: Peppered Moth
2 pages
Datastructure Tree
No ratings yet
Datastructure Tree
2 pages
Hoofprint 2009-2010
No ratings yet
Hoofprint 2009-2010
24 pages
My Home, My Family Lesson 1: Oops! It's My Turn: Unit 2
100% (1)
My Home, My Family Lesson 1: Oops! It's My Turn: Unit 2
59 pages
Orthographic Sketching
No ratings yet
Orthographic Sketching
38 pages
Samostata True Story
No ratings yet
Samostata True Story
5 pages
Shortcut Protel
No ratings yet
Shortcut Protel
4 pages
VoiceThread Rubric
No ratings yet
VoiceThread Rubric
2 pages
Program With PL/SQL
No ratings yet
Program With PL/SQL
9 pages
Academic Text vs. Non-Academic Text
No ratings yet
Academic Text vs. Non-Academic Text
3 pages
5 Types of Power in Businesses
0% (1)
5 Types of Power in Businesses
4 pages
Validator Sheet
100% (1)
Validator Sheet
5 pages
10 Rules of Effective Communication
No ratings yet
10 Rules of Effective Communication
6 pages
QM KT
No ratings yet
QM KT
104 pages
SUMMATIVE TEST in DISS Q1 Week 1 2
No ratings yet
SUMMATIVE TEST in DISS Q1 Week 1 2
4 pages
An Introduction To Arangodb Server, An Advanced Multimodel Nosql Database
No ratings yet
An Introduction To Arangodb Server, An Advanced Multimodel Nosql Database
47 pages