Advanced Econometrics Dr.
Uma Kollamparambil
Today's Agenda
A quick round-up of basic econometrics Econometric Modelling
Regression analysis Theory specifies the functional relationship Measurement of relationship uses regression analysis to arrive at values of a and b.
Y = a + bX + e
Components dependent & independent variables, intercept (O), coefficients, error term Regression may be simple or multivariate according to the no. of independent variables
Requirements
Model Specification: relationship between dependent and independent variables
scatter plot specify function that best fits the scatter
Sufficient Data for estimation
cross sectional time series panel
Some Important terminology
Least Squares Regression: Y = a + bX + e Estimation
point estimate Interval estimates
Inference
t-statistic R-square or Coefficient of Determination F-statistic
Estimation -- OLS
18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50
Ordinary Least Squares (OLS)
We have a set of datapoints, and want to fit a line to the data The most efficient can be shown to be OLS. His minimizes the squared distance between the line and actual data points.
How to Estimate a and b in the linear equation? The OLS estimator solves:
Mina;b [Yi a bX i ] Min
[Yi a bX i ]
a ;b
This minimization problem can be solved using calculus. The result is the OLS estimators of a and b
Regression Analysis -- OLS
Yj = a + b X j + j ( X X )( Y Y ) = b (X X )
i i 2 i
The basic equation
OLS estimator of b
X =Y b a
OLS estimator of a
Here a hat denotes an estimator, and a bar a sample mean.
Regression Analysis -- OLS
production period 1 2 3 4 5 6 7 Demand (Y) Price (X) 410 1 370 5 240 8 230 10 160 15 150 23 100 25
Intercept Q (X)
Coefficients 384.98 -11.89
These are the estimated coefficients for the data above.
Regression Analysis -- Inference
R =
2 2 ( Y Y ) i
(Y Y )
i
Sb =
(n k ) ( X i X ) 2
2 (Yi Yi )
Here, the R-squared is a measure of the goodness of fit of our model, while the standard deviation of b gives us a measure of confidence for out estimate of b.
Regression Analysis -- Confidence
SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.954112475 Adjusted R Square 0.94493497 Standard Error 27.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.47725 76274.48 103.9621 0.000155729 3668.379888 733.676 79942.85714
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156
These are goodness of fit measures reported by excel for our example data.
Hypothesis testing
Hypothesis formulation Test:
Confidence interval method: Construct interval of estimated b at desired level of confidence & SE of b. check if b falls within. If it does, accept null hypothesis Test of significance method: Estimate t-value of b and compare with table t value. If former less than latter accept the null hypothesis.
Hypothesis testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.954112475 Adjusted R Square 0.94493497 Standard Error 27.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.47725 76274.48 103.9621 0.000155729 3668.379888 733.676 79942.85714
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156
b = Sb
the t-ratio. Combined with information in critical values from a student-t distribution, this ratio tells us how confident we are that a value is significantly different from zero.
Analysis of Variance: F ratio
F ratio tests the overall significance of the regression.
Explained var iation F = Un exp lained var iation ( k 1) (n k )
F =
( k 1) (1 R ) /( n k )
2
R2
Tests the marginal contribution of new variable
ESS new ESS old F= RSS new ( no .ofnewX ) ( n no.ofXnew mod el )
Tests for structural change in data
RSS R RSS UR F = (k ) RSS UR ( n1 + n 2 2 k )
Multivariate regression
Y1 1 Y = 1 2 Yn 1
X 21 X 22 X 2n
/ /
= (X X ) X Y
/
y = X + u
X k1 1 u X k 2 2 + u X kn 3 u
1 2 3
Assumptions of OLS regression
Model is correctly specified & is linear in parameters X values are fixed in repeated sampling and y values are continuous & stochastic Each uiis normally distributed with mean of ui=0 Equal variance ui (Homoscedasticity) No autocorrelation or no correlation between ui and uj Zero covariance between Xi and ui No multicollinearity Cov(Xi Xj)=0 , multivariate regression Under assumption of CNLRM estimates are BLUE
Regression Analysis : Some problems
Autocorrelation: covariance between error terms Identification : DW d test 0-4 (near 2 indicates no autocorrelation) R2 is overestimated t and F tests misleading Missed Variable: Correctly specify Consider AR scheme Heteroscedasticity: Non-constant variance - Detection: scatter plot of error terms, park test, goldfeld-Quandt test, white test etc
Regression Analysis : Some problems
- t and F tests misleading - Remedial measures include transformation of variables through WLS Muticollinearity: covariance between various X variables Detection: high R2 but t test insignificant, high pair-wise correlation between explanatory variables t and F tests misleading remove model over-specification, use pooled data, transform variables
Model Specification
Sources of misspecification
Omission of relevant variable Inclusion of unnecessary variables Wrong functional form Errors of measurement Incorrect specification of the stochastic error term
Model Specification Errors: Omitting Relevant Variables and Including Irrelevant Variables
To properly estimate a regression model, we need to have specified the correct model A typical specification error occurs when the estimated model does not include the correct set of explanatory variables This specification error takes two forms Omitting one or more relevant explanatory variables Including one or more irrelevant explanatory variables Either form of specification error results in problems with OLS estimates
Model Specification Errors: Omitting Relevant Variables
Example: Two-factor model of stock returns Suppose that the true model that explains a particular stocks returns is given by a two-factor model with the growth of GDP and the inflation rate as factors
rt = 0 + 1GDPt + 2 INFt + t
Suppose instead that we estimated the following model
rt = 0 + 1GDPt + t
t = 2 INFt + t
Thus, the error term of this model is actually equal to If there is any correlation between the omitted variable (INF) and the explanatory variable (GDP), then there is a violation of classical assumption Cov(uiXi)=0
Model Specification Errors: Omitting Relevant Variables
This means that the explanatory variable and the error term are not uncorrelated If that is the case, the OLS estimate of 1 (the coefficient of GDP) will be biased As in the above example, it is highly likely that there will be some correlation between two financial (or economic) variables If, however, the correlation is low or the true coefficient of the omitted variable is zero, then the specification error is very small
When Cov(X1X2)#0, Estimate of both constant & slope biased Bias continues even with larger sample When Cov(X1X2)=0, constant is biased, slope unbiased Variance of error is incorrectly estimated Consequently, variance of slope is biased Leads to misleading conclusions through confidence interval and hypothesis testing procedures regarding statistical significance of the estimated parameters. Forecasts therefore based on mis-specified model will be unreliable
Model Specification Errors: Omitting Relevant Variables
To avoid omitted variable bias, A simple solution is to add the omitted variable back to the model, but the problem with this solution is to be able to detect which is the omitted variable Omitted variable bias is hard to detect, but there could be some obvious indications of this specification error. The best way to detect the omitted variable specification bias is to rely on the theoretical arguments behind the model. - Which variables does the theory suggest should be included?
Model Specification Errors: Omitting Relevant Variables
- What are the expected signs of the coefficients? - Have we omitted a variable that most other similar studies include in the estimated model?
Note, though, that a significant coefficient with the unexpected sign can also occur due to a small sample size However, most of the data sets used in empirical finance are large enough that this most likely is not the cause of the specification bias.
Model Specification Errors: Including Irrelevant Variables
Example: Going back to the two-factor model, suppose that we include a third explanatory variable in the model, for example, the degree of wage inequality (INEQ) So, we estimate the following model
rt = 0 + 1GDPt + 2 INFt + 3 INEQt + t
The estimated coefficients (both constant and slope) are unbiased The variance of the error term is estimated accurately
Model Specification Errors: Including Irrelevant Variables
However the variance of the coefficients are inefficient The inclusion of an irrelevant variable (INEQ) in the model increases the standard errors of the estimated coefficients and, thus, decreases the tstatistics This implies that it will be more difficult to reject a null hypothesis that a coefficient of one of the explanatory variables is equal to zero
Model Specification Errors: Including Irrelevant Variables
Also, the inclusion of an irrelevant variable will usually decrease the adjusted R-sq (but not the R-sq) Overspecified model Considered to be a lesser evil compared to underspecified model But other problems like multicollinearity, loss of degrees of freedom
Model Specification Criteria
To decide whether an explanatory variable belongs in a regression model, we can test whether most of the following conditions hold The importance of theory: Is the decision to include an explanatory variable in the model theoretically sound? t-Test: Is the variable statistically significant and does it have the expected coefficient sign? Adjusted R2: Does the overall fit of the model improve when we add the explanatory variable? Bias: Do the coefficients of the other variables change significantly (sign or statistical significance) when we add the variable to the model?
Problems with Specification Searches
In an attempt to find the right or desired model, a researcher may estimate numerous models until an estimated model with the desired properties is obtained It is definitely the case that the wrong approach to model specification is data mining In this case, the researcher would estimate every possible model and choose to report only those that produce desired results The researcher should try to minimize the number of estimated models and guide the selection of variables mainly on theory and not purely on statistical fit
Sequential Model Specification Searches
In an effort to find the appropriate regression model, it is common to begin with a benchmark (or base) specification and then sequentially add or drop variables The base specification can rely on theory and then add or drop variables based on adjusted R2 and tstatistics In this effort, it is important to follow the principle of parsimony: try to find the simplest model that best fits the data Make use of the F test for incremental contribution of variables
F test for incremental contribution of variables
Very useful test in deciding if a new variable should be retained in the model e.g. Return on a stock is a function of GDP and inflation of the country. Question is should we include inflation in the model. Estimate a model without inflation and get Rsq(old).
F test for incremental contribution of variables
Re-estimate including inflation and get its Rsq(new).
( R R ) / no.of `new`parameters F= 2 (1 Rnew ) / n knew
Ho: Addition of new variable does not improve the model H1: Addition of new variable improves the model If estimated F is higher than critical F table value, reject null hypothesis. It means inflation needs to be included in the above example.
2 new
2 old
Nominal vs. True level of Significance
Model derived from data mining should be assessed not at conventional levels of significance ( ) such as 1,5,10% To begin with if there were c candidate regressor of which k are selected after data mining, true level of significance ( ) is related to nominal significance level as: c/k *
*
= 1 (1 ) * (c / k )
* If c=2, k=1 and =5% then =10%
Model Specification: Choosing the Functional Form
One of the assumptions to derive the nice properties of OLS estimates is that the estimated model is linear What if the relationship between two variables is not linear? OLS maintains its nice properties of unbiased and minimum variance estimates if we transform the non-linear relationship into a model that is linear in the coefficients Interesting case Double-log (log-log) form
Model Specification: Choosing the Functional Form
Example: A well-known model of nominal exchange rate determination is the Purchasing Power Parity (PPP) model s = P/P* s = nominal exchange rate (e.g. rand/$), P = price level in the SA, P* = price level in the US
Taking natural logs, we can estimate the following model ln(s) = 0 + 1ln(P) + 2ln(P*) + i
Model Specification: Choosing the Functional Form
Property of double-log model: Estimated coefficients show elasticities between dependent and explanatory variables
Example: A 1% change in P will result in a 1% change in the nominal exchange rate (s).
How do we know if weve gotten the right functional form for our model? - Expected coefficient signs, R2, t-stat and DW dstat
Model Specification: Choosing the Functional Form
If not satisfactory, Examine the error terms use economic theory to guide you Weve seen that a linear regression can really fit nonlinear relationships Can use logs on RHS, LHS or both Can use quadratic forms of xs Can use interactions of xs
How to choose Functional Form
Think about the interpretation. Does it make more sense for x to affect y in percentage (use logs) or absolute terms? Does it make more sense for the derivative of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?
How to choose Functional Form (cont'd)
We already know how to test joint exclusion restrictions to see if higher order terms or interactions belong in the model It can be tedious to add and test extra terms, plus may find a square term matters when really using logs would be even better A test of functional form is Ramseys regression specification error test (RESET)
DW test for model misspecification
You suspect that relevant variable Z ( might be a polynomial of existing X) was omitted from the assumed model. From the assumed model, obtain OLS residuals. Order residuals according to increasing values of Z Compute d stat from thus ordered residuals. If autocorrelation is noticed, then the model is misspecified.
Ramseys RESET
Regression Specification Error Test Estimate assumed model and derive Then, estimate y = 0 + 1x1 + + kxk + 12 + 13 +error and test H0: 1 = 0, 2 = 0
2 2 ( Rnew Rold ) / no.of `new`parameters F= 2 (1 Rnew ) / n k new
If Ho rejected it indicates mis-specified model Advantage , in RESET you dont have to specify the the correct alternative model Disadvantage: doesn't help in attaining the right model
Lagrange Multiplier Test for Adding variable
Y=b0+bX1 Y=b0+b1X1+b2X2+b3X3 1 (restricted) 2 (UR)
Obtain residuals from 1 and regress it on all X in Eq2 including ones in eq1 Ui=a0+a1X1+a2X2+a3X3
nR
2
2 no .ofrestrictions
If estimated Chi-sq>critical chi-sq, reject the restricted regression.
Nested vs. Non-nested Models
Nested: Y=a+b1X1+b2X2+b3X3+b4X4 Y=a+b1X1+b2X2 Specification test and the restricted F test can be used to test for model specification errors Non-nested: Y=a+b1X1+b2X2 Y=c0+c1Z1+c2Z2
Tests for Non-nested Models
1) Discrimination approach: simply select better model based on goodness of fit
Rsq, Adj-Rsq, AIC, SIC, SBC ESS R = TSS
2
R = 1 (1 R 2 )
n 1 nk
RSS AIC = e n k / n RSS SIC = n n
2k / n
2)Discerning approach: make use of information provided by other models as well along with the initial model
Non-nested Discerning Tests
If the models have the same dependent variables, but non-nested xs could still just make a giant model with the xs from both and test joint exclusion restrictions that lead to one model or the other. Y=a+b1X1+b2X2 Y=c0+c1Z1+c2Z2 Y=a+b1X1+b2X2+c1Z1+c2Z2 Use F test using both equations as reference model 2 2 in turns. ( Rnew Rold ) / no.of `new`parameters
F=
2 (1 Rnew ) / n k new
Davidson-MacKinnon J test,
An alternative, the Davidson-MacKinnon test, uses from one model as regressor in the second model and tests for significance. Y=a+b1X1+b2X2 - A Y=c0+c1Z1+c2Z2 - B Estimate B and obtain Y^B Y=a+b1X1+b2X2+ b3Y^B
Davidson-MacKinnon J test,
Use t-test, if b3=0, not rejected, we accept model A Reverse the models and re-do steps More difficult if one model uses y and the other uses ln(y) Can follow same basic logic and transform predicted ln(y) to get for the second step In any case, Davidson-MacKinnon test may reject neither or both models rather than clearly preferring one specification
Measurement Error
Sometimes we have the variable we want, but we think it is measured with error Examples: A survey asks how many hours did you work over the last year, or how many weeks you used child care when your child was young Consequences of Measurement error in y different from measurement error in x
Measurement Error: Dependent Variable
Y* is not directly measurable, it is measured wrongly as y=y*+ e Thus, really estimating y = (0 + 1x1 + + kxk + e)+u When will OLS produce unbiased results? Only if E(e) =E(u)=0, e is uncorrelated with xj & u, is unbiased But has larger variances than with no measurement error
Measurement Error: Explanatory Variable
x* is not directly measurable, it is measured wrongly as X=X*+ e Define measurement error as e1 = x1 x1* y= 0 + 1(x1 -e)+u Really estimating y = 0 + 1x1 + (u 1e1)
Measurement Error: Explanatory Variable
Assume E(e1) = 0 , cov(ei,ej)=0 , cov(ei,ui)=0
The effect of measurement error on OLS estimates depends on our assumption about the correlation between e1 and x1
If Cov(x1, e1) # 0, OLS estimates are biased, and variances larger Use Proxy or IV variables
Proxy Variables
What if model is mis-specified because no data is available on an important x variable? It may be possible to avoid omitted variable bias by using a proxy variable A proxy variable must be related to the unobservable variable But must be uncorrelated with the error term Sargen test
Lagged Dependent Variables
What if there are unobserved variables, and you cant find reasonable proxy variables? May be possible to include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of y Obviously, you must think past and current y are related for this to make sense
Missing Data Is it a Problem?
If any observation is missing data on one of the variables in the model, it cant be used If data is missing at random, using a sample restricted to observations with no missing values will be fine A problem can arise if the data is missing systematically say high income individuals refuse to provide income data
Non-random Samples
If the sample is chosen on the basis of an x variable, then estimates are unbiased If the sample is chosen on the basis of the y variable, then we have sample selection bias Sample selection can be more subtle Say looking at wages for workers since people choose to work this isnt the same as wage offers
Outliers
Sometimes an individual observation can be very different from the others, and can have a large effect on the outcome Sometimes this outlier will simply be do to errors in data entry one reason why looking at summary statistics is important Sometimes the observation will just truly be very different from the others
Outliers (cont'd)
Not unreasonable to fix observations where its clear there was just an extra zero entered or left off, etc. Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliers Can use Stata to investigate outliers
Model Selection Criteria
Be data admissible: Prediction must be realistic Be consistent with theory Have weakly exogenous regressors Exhibit parameter constancy: values and signs must be consistent Exhibit data coherency: white noise residuals Be encompassing
Matrix Approach to OLS
Nx1
n x k+1
k+1 x 1
nx1
b = (X'X)-1X'Y
Assumptions
E(u)=0 where u and 0 are n x 1 column vectors, 0 being a null vector.
2I
E(uu`)= where I is an n x n identity matrix (homoscedasticity and no autocorrelation) N x k matrix X I non-stochastic
Assumptions
The rank of X is p(X)=k, where k is the number of columns in X and k is less than the number of observations, n (no multi-collinearity) I x = 0
I x = 0 where I is a 1 x k row vector and x is a k x 1
column vector.
xu=vector 0 The has a multivariate normal distribution
I
i.e. U~N(0, 2 I )
R =
X y nY
I
^ I
yI y nY
var cov( ) = ( X X )
^ 2 I
u =
u u = nk nk
^ 2 i
^I ^