11. VariableSelectionAndModelBuilding IIT
11. VariableSelectionAndModelBuilding IIT
The complete regression analysis depends on the explanatory variables present in the model. It is understood
in the regression analysis that only correct and important explanatory variables appear in the model. In
practice, after ensuring the correct functional form of the model, the analyst usually has a pool of
explanatory variables which possibly influence the process or experiment. Generally, all such candidate
variables are not used in the regression modelling, but a subset of explanatory variables is chosen from this
pool. How to determine such an appropriate subset of explanatory variables to be used in regression is called
the problem of variable selection.
While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as
possible explanatory variables.
2. In order to make the model as simple as possible, one way includes only a fewer number of
explanatory variables.
Both approaches have their consequences. In fact, model building and subset selection have contradicting
objectives. When a large number of variables are included in the model, then these factors can influence the
prediction of the study variable y . On the other hand, when a small number of variables are included then
the predictive variance of ŷ decreases. Also, when the observations on more number are to be collected,
then it involves more cost, time, labour etc. A compromise between these consequences is striked to select
the “best regression equation”.
The problem of variable selection is addressed assuming that the functional form of the explanatory variable,
1
e.g., x 2 , , log x etc., is known and no outliers or influential observations are present in the data. Various
x
statistical tools like residual analysis, identification of influential or high leverage observations, model
adequacy etc. are linked to variable selection. In fact, all these processes should be solved simultaneously.
Usually, these steps are iteratively employed. In the first step, a strategy for variable selection is opted, and
the model is fitted with selected variables. The fitted model is then checked for the functional form, outliers,
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
1
influential observations etc. Based on the outcome, the model is re-examined, and the selection of variable is
reviewed again. Several iterations may be required before the final adequate model is decided.
There can be two types of incorrect model specifications.
1. Omission/exclusion of relevant variables.
2. Inclusion of irrelevant variables.
Let there be k candidate explanatory variables out of which suppose r variables are included and (k r )
variables are to be deleted from the model. So partition the X and as
X X1 X 2 and 1 2 .
nk
nr n( k r ) r1 ( k r )1)
The model y X , E ( ) 0, V ( ) 2 I can be expressed as
y X 11 X 2 2
After dropping the r explanatory variable in the model, the new model is
y X 11
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
2
The estimation error is obtained as follows:
b1F ( X 1' X 1 ) 1 X 1' ( X 11 X 2 2 )
1 ( X 1' X 1 ) 1 X 1' X 2 2 ( X 1' X 1 ) 1 X 1'
b1F 1 ( X 1' X 1 ) 1 X 1'
Thus
E (b1F ) ( X 1' X 1 ) 1 E ( )
which is a linear function of 2 , i.e., the coefficients of excluded variables. So b1F is biased, in general.
So efficiency generally declines. Note that the second term is the conventional form of MSE.
yH1 y ( X 11 X 2 2 ) H1 ( X 2 2 )
( 2' X 2 H1' H1 X 2 2 2' X 2' H1 2' X 2 H1' X 2 2 1' X 1' H1 ' H1' X 2 2 ' H1 ).
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
3
1
E (s 2 ) E ( 2' X 2' H1 X 2 2 ) 0 0 E ( ' H )
nr
1
2' X 2' H1 X 2 2 ) (n r ) 2
nr
1
2 2' X 2' H1 X 2 2 .
nr
Thus s 2 is a biased estimator of 2 and s 2 provides an overestimate of 2 . Note that even if X 1' X 2 0,
then also s 2 gives an overestimate of 2 . So the statistical inferences based on this will be faulty. The t -
test and confidence region will be invalid in this case.
If the response is to be predicted at x ' ( x1' , x2' ), then using the full model, the predicted value is
Thus ŷ1 is a biased predictor of y . It is unbiased when X 1' X 2 0. The MSE of predictor is
Also
Var ( yˆ ) MSE ( yˆ1 )
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
4
2. Inclusion of irrelevant variables
Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some
explanatory variables that are not very relevant to the model. Such variables may contribute very little to the
explanatory power of the model. This may tend to reduce the degrees of freedom (n k ) and consequently,
the validity of inference drawn may be questionable. For example, the value of the coefficient of
determination will increase, indicating that the model is getting better, which may not really be true.
X ' Z ( Z ' Z ) 1 Z ' XbF X ' Z ( Z ' Z ) 1 Z ' ZCF X ' Z ( Z ' Z ) 1 Z ' y. (3)
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
5
Subtracting equation (1) from (3), we get
X ' X X ' Z ( Z ' Z ) 1 Z ' X bF X ' y X ' Z ( Z ' Z ) 1 Z ' y
X ' I Z ( Z ' Z ) 1 Z ' XbF X ' I Z ( Z ' Z ) 1 Z ' y
bF ( X ' H Z X ) 1 X ' H Z y
where H Z I Z ( Z ' Z ) 1 Z '.
bF ( X ' H Z X ) 1 X ' H Z y
( X ' H Z X ) 1 X ' H Z ( X )
( X ' H Z X ) 1 X ' H Z .
Thus
E (bF ) ( X ' H Z X ) 1 X ' H Z E ( ) 0
so bF is unbiased even when some irrelevant variables are added to the model.
V (bF ) E bF bF
1
2 X ' H Z X X ' H Z IH Z X X ' H Z X
1 1
2 X ' HZ X .
1
with E (bT )
Result: If A and B are two positive definite matrices then A B is at least positive semi-definite if
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
6
Let
A ( X ' H Z X ) 1
B ( X ' X ) 1
B 1 A1 X ' X X ' H Z X
X ' X X ' X X ' Z ( Z ' Z ) 1 Z ' X
X ' Z ( Z ' Z ) 1 Z ' X
which is at least positive semidefinite matrix. This implies that the efficiency declines unless X ' Z 0. If
X ' Z 0, i.e., X and Z are orthogonal, then both are equally efficient.
The residual sum of squares under the false model is
SSres eF' eF
where
eF y XbF ZCF
bF ( XH Z X ) 1 X ' H Z y
CF ( Z ' Z ) 1 Z ' y ( Z ' Z ) 1 Z ' XbF
( Z ' Z ) 1 Z '( y XbF )
( Z ' Z ) 1 Z ' I X ( X ' H Z X ) 1 X ' H z y
( Z ' Z ) 1 Z ' H XZ y
H Z I Z ( Z ' Z ) 1 Z '
H Zx I X ( X ' H Z X ) 1 X ' H Z
2
H ZX H ZX : idempotent.
So
eF y X ( X ' H Z X ) 1 X ' H Z y Z ( Z ' Z ) 1 Z ' H ZX y
I X ( X ' H Z X ) 1 X ' H Z Z ( Z ' Z ) 1 Z ' H ZX y
H ZX ( I H Z ) H ZX y
H Z H ZX y
H ZX
* *
y where H ZX H Z H ZX .
Thus
SSres eF' eF
y ' H Z H ZX H ZX H Z y
y ' H Z H ZX y
y ' H ZX
*
y
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
7
E ( SS res ) 2 tr ( H ZX
*
)
2 (n k r )
SSres
.
2
E
nk r
SS res
So is an unbiased estimator of 2 .
nk r
1. Coefficient of determination
The coefficient of determination is the square of multiple correlation coefficient between the study variable
y and set of explanatory variables X 1 , X 2 ,..., X p denotes as R p2 . Note that X i1 1 for all i 1, 2,..., n
which simply indicates the need of intercept term in the model without which the coefficient of
determination can not be used. So essentially, there will be a subset of ( p 1) explanatory variables and one
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
8
Since there are k explanatory variables available and we select only ( p 1) out of them, so there are
k
possible choices of subsets. Each such choice will produce one subset model. Moreover, the
p 1
coefficient of determination has a tendency to increase with the increase in p .
So proceed as follows:
Choose an appropriate value of p , fit the model and obtain R p2 .
Add one variable, fit the model and again obtain R p21 .
Obviously R p21 R p2 . If R p21 R p2 is small, then stop and choose the value of p for subset
regression.
If R p21 R p2 is high, then keep on adding variables up to a point where an additional variable
does not produce a large change in the value of R p2 or the increment in R p2 becomes small.
To know such value of p , create a plot of R p2 versus p. For example, the curve will look like as in the
following figure.
Choose the value of p corresponding to a value of R p2 where the “knee” of the curve is clearly seen. Such
choice of p may not be unique among different analyst. Some experience and judgment of analyst will be
helpful in finding the appropriate and satisfactory value of p .
To choose a satisfactory value analytically, a solution is a test which can identify the model with R 2 which
does not significantly differ from the R 2 based on all the explanatory variables.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
9
Let
R02 1 (1 Rk21 )(1 d ,n , k )
kF (n, n k 1)
where d ,n ,k and Rk21 is the value of R 2 based on all (k 1) explanatory variables. A
n k 1
subset with R 2 R02 is called an R 2 - adequate(α) subset .
n 1
2
Radj ( p) 1 (1 R p ).
2
n p
2
An advantage of Radj ( p ) is that it does not necessarily increase as p increases.
If there are r more explanatory variables which are added to a p term model then
2
Radj ( p r ) Radj
2
( p)
if and only if the partial F statistic for testing the significance of r additional explanatory variables
2
exceeds 1. So the subset selection based on Radj ( p ) can be made on the same lines are in R p2 . In general,
2
the value of p corresponding to the maximum value of Radj ( p) is chosen for the subset model.
increase in p . So similarly as p increases, MS res ( p) initially decreases, then stabilizes and finally may
increase if the model is not sufficient to compensate the loss of one degree of freedom in the factor ( n p).
When MS res ( p) is plotted versus p , the curve looks like as in the following figure.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
10
So
plot MSres ( p ) versus p.
Choose p corresponding to which MS res ( p) is approximately equal to MSres based on the full
model.
Choose p near the point where the smallest value of MS res ( p) turns upward.
2
Such minimum value of MSres ( p ) will produce a Radj ( p ) with maximum value. So
n 1
2
Radj ( p) 1 (1 R p2 )
n p
n 1 SS res ( p )
1 .
n p SST
n 1 SSres ( p )
1 .
SST n p
MSres ( p )
1 .
SST /(n 1)
2
Thus the two criteria, viz, minimum MS res ( p) and maximum Radj ( p ) are equivalent.
4. Mallow’s Cp statistics:
Mallow’s C p criterion is based on the mean squared error of a fitted value.
matrix, so that
y X 11 X 2 2 , E ( ) 0, V ( ) 2 I
The prediction of y can also be seen as the estimation of E ( y ) X , so the expected outweighed
squared error loss of ŷ is given by
p E X 1ˆ1 X ' X 1ˆ1 X .
So the subset model can be considered as an appropriate model if p is small.
Since
E ( y ' H1 y ) E ( X ) ' H1 ( X )
2trH1 ' X ' H1 X
2 ( n p ) ' X ' H1 X
' X ' H 1 X E ( y ' H1 y ) 2 ( n p )
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
12
Thus
p 2 (2 p n) E ( y ' H1 y ).
Note that p depends on and 2 which are unknown. So p can not be used in practice. A solution to
ˆ p ˆ 2 (2 p n) SSres ( p) .
where SSres ( p) y ' H1 y is the residuals sum of squares based on the subset model.
A rescaled vision of ˆ p is
SSres ( p )
C p (2 p n)
ˆ 2
which is the Mallow’s C p statistic for the model y X 11 , the subset model. Usually
b ( X ' X ) 1 X ' y
1
ˆ 2 ( y X ˆ ) '( y X ˆ )
n pq
are used to estimate and 2 respectively, which are based on the full model.
When different subset models are considered, then the models with smallest C p are considered to be better
If the subset model has a negligible bias, (in case of b , then bias is zero), then
E SSres ( p) (n p) 2
and
(n p) 2
E C p | Bias 0 2 p n p.
2
The plot of C p versus p for each regression equation will be a straight line passing through the origin and
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
13
Those points which have smaller bias will be near to line, and those points with significant bias will lie
above the line. For example, the point A has little bias, so it is closer to line A whereas points B and C
have a substantial bias, so they are above the line. Moreover, the point C is above point A , and it represents
a model with a lower total error. It may be preferred to accept some bias in the regression equation to reduce
the average prediction error.
Note that an unbiased estimator of 2 is used in C p p which is based on the assumption that the full
model has a negligible bias. In case, the full model contains non-significant explanatory variables with zero
regression coefficients, then the same unbiased estimator of 2 will overestimate 2 and then C p will have
is based on the subset model y X 11 derived from the full model y X 11 X 2 2 X .
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
14
In the linear regression model with ~ N (0, 2 I ) , the likelihood function is
1 1 ( y X ) '( y X )
L( y, , 2 ) exp
n
2 2
2 2 2
and log-likelihood of L( y, , 2 ) . is
n n 1 ( y X ) '( y X )
ln L( y; , 2 ) ln 2 ln( 2 ) .
2 2 2 2
So
AIC 2 ln L( y; , 2 ) 2 p
SS
n ln res 2 p n ln(2 ) 1
n
The term n ln(2 ) 1 remains the same for all the models under comparison if the same observations y are
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
15
7. PRESS statistic
Since the residuals and residual sum of squares act as a criterion of subset model selection, so similarly, the
PRESS residuals and prediction sum of squares can also be used as a basis for subset model selection. The
usual residual and PRESS residuals have their own characteristics which use used is regression modeling.
The PRESS statistic based on a subset model with p explanatory variable is given by
n
PRESS ( p ) yi yˆ (i )
2
i 1
2
n
e
i .
i 1 1 hii
where hii is the ith element in H X ( X ' X ) 1 X . This criterion is used on similar lines as in the case of
SSres ( p). A subset regression model with a smaller value of PRESS ( p ) is preferable.
Partial F- statistic
The partial F statistic is used to test the hypothesis about a subvector of the regression coefficient.
Consider the model
y X
n1 n p p1 n1
where p k 1 which includes an intercept term and k explanatory variables. Suppose a subset of r k
explanatory variables is to be obtained which contribute significantly to the regression model. So partition
X X 1 X 2 , 1
2
where X 1 and X 2 are matrices of order n ( p r ) and n r , respectively; 1 and 2 are the vectors of
b ( X ' X ) 1 X ' y.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
16
The corresponding sum of squares due to regression with p degrees of freedom is
SSreg b ' X ' y
The contribution of explanatory variables in 2 in the regression can be found by considering the full model
y X 11 , E ( ) 0, Var ( ) 2 I
which is the reduced model. Application of least squares to reduced model yields the OLSE of 1 as
b1 ( X 1' X 1 ) 1 X 1' y
and the corresponding sum of squares due to regression with ( p r ) degrees of freedom is
The sum of squares of regression due to 2 given that 1 in already in the model can be found by
where SS reg ( ) and SSreg ( 1 ) are the sum of squares due to regression with all explanatory variables
corresponding to is the model and the explanatory variables corresponding to 1 in the model.
The term SSreg ( 2 | 1 ) is called as the extra sum of squares due to 2 and has degrees of freedom.
p ( p r ) r. It is independent of MSres and is a measure of regression sum of squares that results from
adding the explanatory variables X k r 1 ,..., X k in the model when the model has already X 1 , X 2 ,..., X k r
explanatory variables.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
17
The null hypothesis H 0 : 2 0 can be tested using the statistic
SSres ( 2 | 1 ) / r
F0
MS res
which follows F distribution with r and (n p) degrees of freedom under H 0 . The decision rule is to
reject H 0 whenever
F0 F (r , n p).
It measures the contribution of explanatory variables in X 2 given that the other explanatory variables in X 1
Choose a suitable criterion for model selection and evaluate each of the fitted regression equation with the
selection criterion.
The total number of models to be fitted sharply rises with an increase in k . So such models can be evaluated
using a model selection criterion with the help of an efficient computation algorithm on computers.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
18
2. Stepwise regression techniques
This methodology is based on choosing the explanatory variables in the subset model in steps which can be
either adding one variable at a time or deleting one variable at times. Based on this, there are three
procedures.
- Forward selection,
- backward elimination and
- stepwise regression.
These procedures are basically computer-intensive procedures and are executed using the software.
Suppose x1 is the variable which has the highest correlation with y . Since F statistic given by
n k R2
F0 . ,
k 1 1 R2
so x1 will produce the largest value of F0 in testing the significance of a regression.
Adjust the effect of x1 on y and re-compute the correlations of remaining xi ' s with y and
xˆ j ˆ oj ˆ1 j x1 , j 2,3,..., k
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
19
Choose xi with the second-largest correlation with y , i.e., the variable with the highest value of
SSreg ( x2 | x1 )
F .
MS res ( x1 , x2 )
These steps are repeated. At each step, the partial correlations are computed, and explanatory
variable corresponding to the highest partial correlation with y is chosen to be added into the
model. Equivalently, the partial F -statistics are calculated, and the largest F statistic given the
other explanatory variables in the model is chosen. The corresponding explanatory variable is
added into the model if partial F -statistic exceeds FIN .
Continue with such selection as long as either at a particular step, the partial F statistic does not
exceed FIN or when the least explanatory variable is added to the model.
Note: The SAS software chooses FIN by choosing a type I error rate so that the explanatory variable with
the highest partial correlation coefficient with y is added to the model if partial F statistic exceeds
F (1, n p ) .
The backward elimination methodology begins with all explanatory variables and keeps on deleting one
variable at a time until a suitable model is obtained.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
20
Choose a preselected value F0UT (F to-remove).
Compare the smallest of the partial F statistics with FOUT . If it is less than FOUT , then remove
Consider all the explanatory variables entered into the model at the previous step.
Add a new variable and regresses it via their partial F statistics.
An explanatory variable that was added at an earlier step may now become insignificant due to its
relationship with currently present explanatory variables in the model.
If partial F -statistic for an explanatory variable is smaller than FOUT , then this variable is deleted
considered . The choice FIN FOUT makes relatively more difficult to add an explanatory
General comments:
1. None of the methods among the forward selection, backward elimination or stepwise
regression guarantees the best subset model.
2. The order in which the explanatory variables enter or leave the models does not indicate the
order of importance of the explanatory variable.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
21
3. In forward selection, no explanatory variable can be removed if entered in the model.
Similarly in backward elimination, no explanatory variable can be added if removed from the
model.
4. All procedures may lead to different models.
5. Different model selection criterion may give different subset models.
Some computer software allows the analyst to specify these values directly.
Some algorithms require type I errors to generate FIN or/and FOUT . Sometimes, taking as the
level of significance can be misleading because several correlated partial F variables are
considered at each step, and maximum among them is examined.
Some analyst prefer small values of FIN and FOUT whereas some prefer extreme values. A
F distribution.
Regression Analysis | Chapter 13 | Variable Selection and Model Building | Shalabh, IIT Kanpur
22