0% found this document useful (0 votes)
36 views

Specification Choosing Independent Variables

The document discusses issues related to choosing independent variables in regression analysis, specifically omitted variables and redundant variables. It uses an example wage equation to illustrate problems that can arise from omitting relevant independent variables, such as biased coefficient estimates and invalid statistical tests. Including irrelevant independent variables is less problematic as estimates remain unbiased, but become inefficient. The document provides tips for determining whether a variable should be included, such as examining economic theory, significance tests, and how other coefficients are affected.

Uploaded by

Emmanuel Zivenge
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Specification Choosing Independent Variables

The document discusses issues related to choosing independent variables in regression analysis, specifically omitted variables and redundant variables. It uses an example wage equation to illustrate problems that can arise from omitting relevant independent variables, such as biased coefficient estimates and invalid statistical tests. Including irrelevant independent variables is less problematic as estimates remain unbiased, but become inefficient. The document provides tips for determining whether a variable should be included, such as examining economic theory, significance tests, and how other coefficients are affected.

Uploaded by

Emmanuel Zivenge
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Specification: Choosing Independent Variables

Specification errors that we will deal with: wrong independent variable; wrong
functional form. This lecture deals with wrong independent variables, which may
be due to i) omitted variables, ii) redundant variables (irrelevant variables).

Use the following example under both types:


lnW i =  0 +
 1 S i +  2 OJTi +  i

where Wi = Wage rate of worker i.


Si = Years of formal education of worker i..
OJTi = Effective years of On-the-Job Training of worker i.

The idea is that we have 2 forms of human capital: general human capital
obtained through formal education and specific human capital obtained through
vocational education, apprenticeship programmes, etc. Both may increase wages
(i.e., β1>0 and β2>0), but not at the same rate (i.e., β1β2).

I. Omitting a Relevant Variable.

One of the most common problems in regression analysis. Could be based in the
ignorance of the researcher (i.e., variable available, but not used). More likely,
data unavailable (e.g., Household Economic Survey).

Estimate the following model instead:

lnW i =  0 +  1 S i +  i *

So that the true error in the above regression is

 *   OJT  
i 2 i i

So that Assumption 2 does not hold E( *)   OJT  0 . More importantly,


because
i 2 i

in the case where OJT and S are correlated, looks like Assumption 3 does not
hold because Cov(
i  * , S )  0 . As a result, Gauss-Markov theorem does not apply.
In general, OLS estimate of the regression coefficient is biased, ie,
E(ˆ )  
*

1 1
Page - 2

And the bias is


bias(ˆ )  E(ˆ ) 
* *
b

1 1 1 2 12
where:
Cov( Si ,
b= OJTi )
12
Var ( Si )
Suppose that b12>0, then:

E ( ˆ1 )> 1
*

and the estimated coefficient is biased upward.

Bias is zero when the coefficient of omitted variable is zero or the included and
omitted variables are uncorrelated.

In addition, the standard errors on these estimated coefficients will be biased. In


the misspecified model:

Var ( ˆ
*

)= 
2

1
 si2
But variance of the 'true' estimator is:
ˆ

2
Var ( 1 )= 2 2
 si (1 - r12 )
where r12 is the correlation coefficient between S and OJT. This means that:
*
I r12 > 0 ,thenVar ( ˆ 1 ) < Var ( ˆ1)
f 2
The variance of estimated coefficient is also biased. We're placing 'too much'
confidence in our coefficient estimates. The result is that the t test will be
misleading (this is true even if r12=0, because our estimate of σ2 will also be
biased.)

The remedial measure is easy IF we know which variable has been omitted and
this omitted variable is available. Include it in the model. If the omitted variable
not available, might try to find a proxy variable that is closely related to this
missing variable (e.g., use information on the average OJT or people in a
particular industry and occupation). Or at least sign the direction of the bias, and
estimate its potential magnitude.

The above remedy works in theory. In practice, sometimes it is difficult to know


if a variable has been omitted. To detect the existence of the problem of omitting
a relevant variable, one common practice is to examine the sign of estimated
coefficients and see if they meet our expectation or economic theory. If not, it is
very likely that relevant variables have been omitted. The next step is to use the
direction of the bias to look for relevant variables.

II. Including an Irrelevant Variable.

Suppose true model doesn't contain OJTi. This is consistent with some
theoretical models that predict that this human capital will not affect wages,
employers are more likely to pay for it. Thus, the correct regression model is:
lnW i =  0 1 Si +  i
+
but we estimate:
lnWi =  0 +
*
*
 1 S i +  2 OJTi +
i
The problems here are less severe compared to omitting a relevant variable. The
true error in the above regression is

 **     OJT
i i 2 i

If OJT is irrelevant, 2 should be zero and hence Assumption 2 holds.


Assumption 3 holds too. What are the properties of the OLS estimates?

(i) Estimated coefficients are unbiased and consistent.


E ( ˆ )= 
*

1 1

(ii) t test is valid if the correct standard error is used.

(iii) The only problem is that the estimated coefficients are inefficient.

Under the 'false' model:


2
Var ( 1ˆ )=
*  si2 (1 - r122 )
Under the 'true' model:


2
ˆ
Var ( 1 )= i2
s
Since if r12 > 0 ,thenVar ( ˆ 1 ) < Var ( ˆ1 * ) , we're placing 'too little' confidence in our
2
coefficient estimates (i.e., the standard error on the estimated coefficient is larger
than it should be). This makes the t-ratio smaller than it should be, and makes it
more likely that we won’t be able to reject the null when we should.

This is an easy one to solve in theory. If the variable shouldn’t be in the


regression, eliminate it from the outset. But in practice, this isn’t so easy. The
theory in this example says that both specifications might be right. If an
independent variable may be relevant, include it.

III. How to Decide Whether to Include Variable or Not?

1. Graphic method to detect the problem of omitting a relevant variable

Plot the residuals and look for 'distinct pattern'. Take the earlier example on
functional form of the regression. We estimate:

lnW i =  0 1 Si *
i
+ +
but the 'true' model is:

lnWi =  0 +   2Si 2 + u i
1
Si

 *i 22 ii S
u

A plot of the residuals against Si would produce a 'detectable' pattern (i.e., curved
downward).
2. Four criteria

 Economic theory: is there any sound theory?


 Student t statistic: is it significant in the correct direction?
 Has R 2 improved?
 Do other coefficients change sign when a variable is included?

Include variable if answers are positive. Don’t necessarily drop insignificant


variables. An insignificant finding can be an important result.

Example:

Cofˆfee= 9.1  + 2.4Pt 0.0035Yd


7.8Pbc

(15.6) (1.2) (0.001)


t= 0.5 2.0 3.5

n=25, R 2  0.60

where Coffee= demand for Brazilian coffee in


US Pbc = price of Brazilian coffee
Pt = price of tea
Yd = disposable income in

US What happens if you drop Pbc?

Cofˆfee= 9.3  2.6Pt 0.0036Yd

(1.0) (0.0009)
t= 2.6 4.0

n=25, R 2  0.61

What happens if you add another variable, price of Colombia coffee, Pcc

Cofˆfee=10   5.6Pbc + 2.6Pt 0.0030Yd


8.0Pcc

(4.0) (2.0) (1.3) (0.001)


t= 2 -2.8 2 3

n=25, R 2  0.65
3. Three incorrect techniques for choosing variables

1) Data mining: simultaneously try a whole series of possible regression


formulations and then choose the equation that conforms the most to what the
researcher wants the results to look like. Doing econometrics = making
sausages.

2) Stepwise regression technique: systematic way of variable selection based


on R 2 . The computer program is given a “shopping list” of possible independent
variables, and then builds the equation in step. It always adds to the regression
model the variable which R2 the most. Problem: independent variables
increases could be correlated.

3) Sequential specification search: add and drop sequentially (ie estimate an


undisclosed number of regressions) but only present a final choice as if it were
the only specification estimated. When you test a model, you have a type I error.
If you estimate and test too many models, type I errors will accumulate.

IV. Lagged Independent Variables

Consider the following regressions:

Yt  0  1 X1t   2 X 2t  (1)
t
(2)
Yt  0  1 X1t 1   2 X 2t 
t

where t = 1, …, n. That is, we have sample of n time-series observations. Note


the change of notation from i to t to emphasize time series data.

In equation (1), the effect of X1 on Y is instantaneous. In equation (2), the effect


is felt one period later. As long as X1 is exogenous (not influenced by Y), the
lagged structure of the equation poses no problem. Of course, the interpretation
of slope coefficient is different.
V. Akaike’s Information Criterion and Schwarz Criterion

In general the more variables included in the regression, the smaller will be the
RSS. But if a variable only contributes marginally to the reduction of the RSS, it
should not be included. AIC and SC (also known BIC) measures the RSS with
penalty of additional parameters. They are defined in regression models as:

AIC = ln(RSS/n) +2(K+1)/n

SC = ln(RSS/n) + ln(n)(K+1)/n

You may select models that minimize the AIC or SC. These are called model
selection criteria. Note that R 2 is also a model selection criterion. You choose
model to maximize R 2 .

Compared with the AIC or R 2 tends to select a model with irrelevant


SC, variables.

VI. Questions for Discussion: Q6.3, Q6.9

VII. Computing Exercise: Q6.5 (Johnson, Ch 6), Q6.15, Johnson Ch


6: AIC

You might also like