Specification Choosing Independent Variables
Specification Choosing Independent Variables
Specification errors that we will deal with: wrong independent variable; wrong
functional form. This lecture deals with wrong independent variables, which may
be due to i) omitted variables, ii) redundant variables (irrelevant variables).
The idea is that we have 2 forms of human capital: general human capital
obtained through formal education and specific human capital obtained through
vocational education, apprenticeship programmes, etc. Both may increase wages
(i.e., β1>0 and β2>0), but not at the same rate (i.e., β1β2).
One of the most common problems in regression analysis. Could be based in the
ignorance of the researcher (i.e., variable available, but not used). More likely,
data unavailable (e.g., Household Economic Survey).
lnW i = 0 + 1 S i + i *
* OJT
i 2 i i
in the case where OJT and S are correlated, looks like Assumption 3 does not
hold because Cov(
i * , S ) 0 . As a result, Gauss-Markov theorem does not apply.
In general, OLS estimate of the regression coefficient is biased, ie,
E(ˆ )
*
1 1
Page - 2
E ( ˆ1 )> 1
*
Bias is zero when the coefficient of omitted variable is zero or the included and
omitted variables are uncorrelated.
Var ( ˆ
*
)=
2
1
si2
But variance of the 'true' estimator is:
ˆ
2
Var ( 1 )= 2 2
si (1 - r12 )
where r12 is the correlation coefficient between S and OJT. This means that:
*
I r12 > 0 ,thenVar ( ˆ 1 ) < Var ( ˆ1)
f 2
The variance of estimated coefficient is also biased. We're placing 'too much'
confidence in our coefficient estimates. The result is that the t test will be
misleading (this is true even if r12=0, because our estimate of σ2 will also be
biased.)
The remedial measure is easy IF we know which variable has been omitted and
this omitted variable is available. Include it in the model. If the omitted variable
not available, might try to find a proxy variable that is closely related to this
missing variable (e.g., use information on the average OJT or people in a
particular industry and occupation). Or at least sign the direction of the bias, and
estimate its potential magnitude.
Suppose true model doesn't contain OJTi. This is consistent with some
theoretical models that predict that this human capital will not affect wages,
employers are more likely to pay for it. Thus, the correct regression model is:
lnW i = 0 1 Si + i
+
but we estimate:
lnWi = 0 +
*
*
1 S i + 2 OJTi +
i
The problems here are less severe compared to omitting a relevant variable. The
true error in the above regression is
** OJT
i i 2 i
1 1
(iii) The only problem is that the estimated coefficients are inefficient.
2
Var ( 1ˆ )=
* si2 (1 - r122 )
Under the 'true' model:
2
ˆ
Var ( 1 )= i2
s
Since if r12 > 0 ,thenVar ( ˆ 1 ) < Var ( ˆ1 * ) , we're placing 'too little' confidence in our
2
coefficient estimates (i.e., the standard error on the estimated coefficient is larger
than it should be). This makes the t-ratio smaller than it should be, and makes it
more likely that we won’t be able to reject the null when we should.
Plot the residuals and look for 'distinct pattern'. Take the earlier example on
functional form of the regression. We estimate:
lnW i = 0 1 Si *
i
+ +
but the 'true' model is:
lnWi = 0 + 2Si 2 + u i
1
Si
*i 22 ii S
u
A plot of the residuals against Si would produce a 'detectable' pattern (i.e., curved
downward).
2. Four criteria
Example:
n=25, R 2 0.60
(1.0) (0.0009)
t= 2.6 4.0
n=25, R 2 0.61
What happens if you add another variable, price of Colombia coffee, Pcc
n=25, R 2 0.65
3. Three incorrect techniques for choosing variables
Yt 0 1 X1t 2 X 2t (1)
t
(2)
Yt 0 1 X1t 1 2 X 2t
t
In general the more variables included in the regression, the smaller will be the
RSS. But if a variable only contributes marginally to the reduction of the RSS, it
should not be included. AIC and SC (also known BIC) measures the RSS with
penalty of additional parameters. They are defined in regression models as:
SC = ln(RSS/n) + ln(n)(K+1)/n
You may select models that minimize the AIC or SC. These are called model
selection criteria. Note that R 2 is also a model selection criterion. You choose
model to maximize R 2 .