0% found this document useful (0 votes)
10 views

Econometrics II Notes (1)

Chapter One discusses regression analysis incorporating qualitative information, focusing on the use of dummy variables to represent nominal scale variables. It explains how these variables can be used to classify data into categories and perform individual regressions for each subgroup, highlighting the importance of the benchmark category. The chapter also covers statistical significance testing for differences in means across groups and the implications of including qualitative variables in regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Econometrics II Notes (1)

Chapter One discusses regression analysis incorporating qualitative information, focusing on the use of dummy variables to represent nominal scale variables. It explains how these variables can be used to classify data into categories and perform individual regressions for each subgroup, highlighting the importance of the benchmark category. The chapter also covers statistical significance testing for differences in means across groups and the implications of including qualitative variables in regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Chapter One

Regression Analysis with Qualitative Information


• In this chapter, we consider models that may involve not only ratio scale (X1/X2, X1-X2, X1 > X2)
variables but also nominal scale variables.

1.1. Describing Qualitative Information


• In regression analysis the dependent variable can be influenced by variables that are essentially
qualitative in nature, such as sex, race, color, religion, nationality, geographical region, political
upheavals, and party affiliation.
• One way we could “quantify” such attributes is by constructing artificial variables that take on values
of 1 or 0, 1 indicating the presence (or possession) of that attribute and 0 indicating the absence of
that attribute.
• Variables that assume such 0 and 1 values are called dummy/ indicator/ binary/ categorical/
dichotomous variables. Such variables are essentially a device to classify data into mutually
exclusive categories.
• Dummy variables are a data-classifying device in that they divide a sample into various subgroups
based on qualities or attributes and implicitly allow one to run individual regressions for each
subgroup.
• The category that receives the value of zero is called base/ reference/ benchmark group. And all
comparisons are made in relation to the benchmark category.
• The choice of omitted category does not affect the substance of the regression results.
• Dummy variables can be incorporated in regression models just as easily as quantitative variables.

1.2. Dummy as Independent Variables


E.g. 1 (The case of single qualitative variable)
Suppose we want to find out if the average annual salary of public school teachers differs among the
three geographical regions (West, North and South) of a country.
To do this, we can set up the following model:
Yi = β1 + β2D2i + β3D3i + ui
where Yi = (average) salary of public school teacher in state i ($)
D2 = 1 for states in the North; 0 otherwise
D3 = 1 for states in the South; 0 otherwise
Mean salary of public school teachers in the North: E (Yi | D2i = 1, D3i = 0) = β1 + β2
Mean salary of public school teachers in the South: E (Yi | D2i = 0, D3i = 1) = β1 + β3
Mean salary of public school teachers in the West: E (Yi | D2i = 0, D3i = 0) = β1
• The benchmark category is the Western region.
• The intercept value (β1) represents the mean value of the benchmark category.
• The coefficients attached to the dummy variables (standing alone) are known as the differential
intercept coefficients because they tell by how much the value of the intercept of non-base group
differs from the intercept coefficient of the benchmark category.

1
• But how do we know if these differences are statistically significant?
Ŷi = 26,158.62 − 1734.473D2i − 3264.615D3i
se = (1128.523) (1435.953) (1499.615)
t = (23.1759) (− 1.2078) (− 2.1776)
pval= (0.0000) (0.2330) (0.0349) R2 = 0.0901
By checking if each of the “slope” coefficients is statistically significant.
The p Value, or Exact Level of Significance. Instead of preselecting α at arbitrary levels, such as 1,
5,or 10 percent, one can obtain the p (probability) value, or exact level of significance of a test
statistic.
(The p value is defined as the lowest significance level at which a null hypothesis can be rejected.)
• Therefore, the overall conclusion is that statistically the mean salaries of public school teachers in the
West and the North are about the same but the mean salary of teachers in the South is statistically
significantly lower by about $3265.
• The dummy variables will simply point out the differences, if they exist, but they do not suggest the
reasons for the differences.
Note:
• First, if the regression contains a constant term, the number of dummy variables must be one less
than the number of classes of each qualitative variable. If all categories of a qualitative variable are
incorporated with intercept, there will be perfect (multi) collinearity and regression will be
impossible. This is called dummy variable trap.
➢ There is a way to avoid this trap by introducing as many dummy variables as the number of
categories of that variable, provided we do not introduce the intercept in such a model.
Yi = β1D1i + β2D2i + β3iD3i + ui
➢ β’s now represent mean salary of teachers in the respective regions.
• Second, if there is base group in the model, the coefficient attached to the dummy variables must
always be interpreted in relation to the base, or reference, group. The base chosen will depend on the
purpose of research at hand.
• Finally, if a model has several qualitative variables with several classes, introduction of dummy
variables can consume a large number of degrees of freedom.
E.g. 2 (The case of multiple qualitative variables)
From a certain sample, the following regression results were obtained for hourly wages in relation to marital
status and region of residence:
Yi = 8.8148 + 1.0997D2i − 1.6729D3i
se = (0.4015) (0.4642) (0.4854)
t = (21.9528) (2.3688) (− 3.4462)
pval = (0.0000) (0.0182) (0.0006) R2 = 0.0322
where Y = hourly wage ($)
D2 = marital status; 1 = married, 0 = otherwise

2
D3 = region of residence; 1 = South, 0 = otherwise
Implicit in this model is the assumption that the differential effect of the marital status dummy D2 is constant
across the levels of region of residence and ….
i. What is the benchmark category for this model? unmarried, non-South residence
ii.What is the mean hourly wage of the benchmark category? $8.81
iii.
Interpret the coefficients. Mean hourly wage of married=8.81+1.09, south= 8.81-1.67
iv.What are the actual hourly wages for the married and in the South? (8.8148 + 1.0997 − 1.6729 =
$8.2416)
v. Are the average hourly wages statistically different compared to the base category? Yes, they are.
E.g. 3 (Regression with a mixture of quantitative and qualitative regressors)
Let’s introduce a quantitative variable for the above regression in example 1.
Yi = β1 + β2D2i + β3iD3i +β4Xi + ui
where Xi = spending on public school per pupil ($)
Teacher’s salary in relation to region and spending on public school per pupil
Ŷi = 13,269.11 − 1673.514D2i − 1144.157D3i + 3.2889Xi
se = (1395.056) (801.1703) (861.1182) (0.3176)
t = (9.5115)* (− 2.0889)* (− 1.3286)** (10.3539)* R2 = 0.7266
pval =
The constant term in this model is the salary of public school teachers in the West for zero spending
on public school per pupil.
Ceteris paribus, as public expenditure goes up by one dollar, on average, public school teacher’s
salary goes up by about 3.29$.
Consider the simplest example: we have data on individuals’ wages, years of education, and their
gender. We could create two gender dummies, male and female, but we will only need one in the
analysis: say, female.

wage = β0 + β1educ + β2female + u

The constant term in this model now becomes the wage for a male with zero years of education.
Male wages are predicted as β0 + β1educ, while female wages are predicted as β0 + β1educ + β2.
The gender differential is thus β2. How would we test for the existence of “statistical
discrimination”–that, say, females with the same qualifications are paid a lower wage? This would be
H0: β2 < 0. The t−statistic for β2 will provide us with this hypothesis test.
What is this model saying about wage structure? Wages are a linear function of the years of
education. If β2 is significantly different than zero, then there are two “wage profiles”– parallel lines
in {educ, wage} space, each with a slope of β1, with their intercepts differing by β2.

3
What if we wanted to expand this model to consider the possibility that wages differ by both gender
and race? Say that each worker is classified as race=white or race=black. Then we could gen black =
(race==‘‘black’’) to create the dummy variable, and add it to (3).
What, now, is the constant term? The wage for a white male with zero years of education. Is there a
significant race differential in wages? If so, the coefficient b3, which measures the difference
between white and black wages, ceteris paribus, will be significantly different from zero. In {educ,
wage} space, the model can be represented as four parallel lines, with each intercept labeled by a
combination of gender and race.
What if our racial data classified each worker as white, Black or Asian? Then we would run the
regression:
wage = β0+β1educ+β2female+β3Black+β4Asian+u (4)
or, with factor variables,
regress wage educ female i.race
where the constant term still refers to a white male. In this model, b3 measures the difference
between black and white wages, ceteris paribus, while b4 measures the difference between Asian and
white wages. Each can be examined for significance. But how can we determine whether the
qualitative factor, race, affects wages? That is a joint test, that both β3 = 0 and β4 = 0, and should be
conducted as such. If factor variables were used, we could do this with
testparm i.race
No matter how the equation is estimated, we should not make judgments based on the individual
dummies’ coefficients, but should rather include both race variables if the null is rejected, or remove
them both if it is not. When we examine a qualitative factor, which may give rise to a number of
dummy variables, they should be treated as a group. For instance, we might want to modify (3) to
consider the effect of state of residence:
wage = β0 + β1educ + β2female + ∑6𝑗=2 𝛾𝑗 𝑠𝑡𝑗 + u (5)
where we include any 5 of the 6 st variables designating the New England states. The test that wage
levels differ significantly due to state of residence is the joint test that γj = 0, j = 2, ..., 6 (or, if factor
variables are used, testparm i.state). A judgment concerning the relevance of state of residence
should be made on the basis of this joint test (an F-test with 5 numerator degrees of freedom).
Note that if the dependent variable was measured in log form, the coefficients on dummies would be
interpreted as percentage changes; if (5) was re-specified to place log(wage) as the dependent
variable, the coefficient b1 would measure the percentage return to education (how many percent
does the wage change for each additional year of education), while the coefficient b2 would measure
the (approximate) percentage difference in wage levels between females and males, ceteris paribus.
The state dummies would, likewise, measure the percentage difference in wage levels between that
state and the excluded state (state 1).
We must be careful when working with variables that have an ordinal interpretation, and are thus
coded in numeric form, to treat them as ordinal.

4
• Dummy variables can be used in testing for differences in regression functions across groups.
➢ Dummy variables can tell us whether the difference in the two regressions was because of
differences in the intercept terms or the slope coefficients or both.
➢ When we compare regressions from the two groups, we see that there are four possibilities:
i. Coincident regressions: Both the intercept and the slope coefficients are the same in
the two regressions.
ii. Parallel regressions: Only the intercepts in the two regressions are different but the
slopes are the same.
iii. Concurrent regressions: The intercepts in the two regressions are the same, but the
slopes are different.
iv. Dissimilar regressions: Both the intercepts and slopes in the two regressions are
different.
E.g. 4 The relationship between savings and income in the United States over the period 1970-1995.
Yt = α1 + α2Dt + β1Xt + β2 (Dt Xt) + ut
where Y = savings
X = income
t = time
D = 1 for observations in 1982–1995; 0 otherwise
Mean savings function for 1970–1981: E (Yt | Dt = 0, Xt) = α1 + β1 Xt
Mean savings function for 1982–1995: E (Yt | Dt = 1, Xt) = (α1 + α2) + (β1 + β2) Xt
α2 is the differential intercept and β2 is the differential slope
• Notice how the introduction of the dummy variable D in the interactive, or multiplicative, form (D
multiplied by X) enables us to differentiate between slope coefficients of the two periods, just as the
introduction of the dummy variable in the additive form enabled us to distinguish between the
intercepts of the two periods.
Ŷi = 1.0161 + 152.4786Dt + 0.0803Xt − 0.0655(Dt Xt)
se = (20.1648) (33.0824) (0.0144) (0.0159)
t = (0.0504)** (4.6090)* (5.5413)* ( − 4.0963)* R2 = 0.8819
➢ Both the differential intercept and slope coefficients are statistically significant, strongly
suggesting that the savings–income regressions for the two time periods are different.
Savings–income regression, 1970–1981: Yt = 1.0161 + 0.0803Xt
Savings–income regression, 1982–1995: Yt = (1.0161 + 152.4786) + (0.0803 − 0.0655)Xt
= 153.4947 + 0.0148Xt
• Interaction effects involving dummy variables
Just as continuous variables may be interacted in regression equations, so can dummy variables.
E.g. Average hourly earnings in relation to education, gender, and race

5
Ŷi = − 0.2610 − 2.3606D2i − 1.7327D3i + 0.8028Xi
t = (− 0.2357)** (− 5.4873)* (− 2.1803)* (9.9094)* R2 = 0.2032
where Y = hourly wage in dollars
X = education (years of schooling)
D2 = 1 if female, 0 otherwise
D3 = 1 if nonwhite and non-Hispanic, 0 otherwise
Yi = − 0.26100 − 2.3606D2i − 1.7327D3i + 2.1289D2iD3i + 0.8028Xi
t = (− 0.2357)** (− 5.4873)* (− 2.1803)* (1.7420)** (9.9095)** R2 = 0.2032
(* p value less than 5%, ** p value less than 10%)
Holding the level of education constant, if you add the three dummy coefficients you will obtain: −
1.964 ( = − 2.3605 − 1.7327 + 2.1289), which means that mean hourly wages of nonwhite/non-
Hispanic female workers is lower by about $1.96, which is between the value of − 2.3605 (gender
difference alone) and − 1.7327 (race difference alone).
We might, for instance, have one set of dummies indicating the gender of respondents (female) and
another set indicating their marital status (married). We could regress lwage on these two dummies:
lwage = b0 + b1female + b2married + u
We assume that the two effects, gender and marital status, have independent effects on the dependent
variable. Why? Because this joint distribution is modelled as the product of the marginals. What is
the difference between male and female wages? b1, irrespective of marital status. What is the
difference between unmarried and married wages? b2, irrespective of gender.
If we were to relax the assumption that gender and marital status had independent effects on wages,
we would want to consider their interaction. Since there are only two categories of each variable, we
only need one interaction term, fm, to capture the possible effects. As above, that term could be
generated as a Boolean (noting that & is Stata’s AND operator): gen fm=(female==1) &
(married==1), or we could generate it algebraically, as gen fm=female*married. In either case, it
represents the intersection of the sets. We then add a term, b3fm, to the equation, which then appears
as an additive constant in the lower right cell of the table. Now, if the coefficient on fm is
significantly nonzero, the effect of being female on the wage differs, depending on marital status, and
vice versa. Are the interaction effects important–that is, does the joint distribution differ from the
product of the marginals?
That is easily discerned, since if that is so b3 will be significantly nonzero.
regress wage female married fm
or, with factor variables, we can make use of the factorial interaction operator:
regress wage female married i.female#i.married
or, in an even simpler form,
regress wage i.female##i.married
where the double hash mark indicates the full factorial interaction, including both the main effects of
each factor and their interaction.

6
Two extensions of this framework come to mind. Sticking with two-way ANOVA (considering two
factors’ effects), imagine that instead of marital status we consider race ={white, Black, Asian}. To
run the model without interactions, we would include two of these dummies in the regression–say,
Black and Asian; the constant term would be the mean wage of a white male (the excluded class).
What if we wanted to include interactions? Then we would define f Black and f Asian, and include
those two regressors as well. The test for the significance of interactions is now a joint test that these
two coefficients are jointly zero.
With factor variables, we can just say
regress wage i.female##i.race
where the factorial interaction includes all race categories, both in levels and interacted with the
female dummy.
A second extension of the interaction concept is far more important: what if we want to consider a
regular regression, on quantitative variables, but want to allow for different slopes for different
categories of observations? Then we create interaction effects between the dummies that define those
categories and the measured variables. For instance,
lwage = b0 +b1female+b2educ+b3 (female × educ)+u
Here, we are in essence estimating two separate regressions in one: a regression for males, with an
intercept of b0 and a slope of b2, and a regression for females, with an intercept of
(b0+ b1) and a slope of (b2+ b3) . Why would we want to do this? We could clearly estimate the two
separate regressions, but if we did that, we could not conduct any tests (e.g. do males and females
have the same intercept? The same slope?). If we use interacted dummies, we can run one regression,
and test all of the special cases of this model which are nested within: that the slopes are the same,
that the intercepts are the same, and the “pooled” case in which we need not distinguish between
males and females. Since each of these special cases merely involves restrictions on this general
form, we can run this equation and then just conduct the appropriate tests.
This can be done with factor variables as
regress wage i.female##c.educ
where we must use the c. operator to tell Stata that educ is to be treated as a continuous variable,
rather than considering all possible levels of that variable in the dataset.
If we extended this logic to include race, as defined above, as an additional factor, we would include
two of the race dummies (say, Black and Asian) and interact each with educ. This would be a model
without interactions, where the effects of gender and race are considered to be independent, but it
would allow us to estimate different regression lines for each combination of gender and race, and
test for the importance of each factor. These interaction methods are often used to test hypotheses
about the importance of a qualitative factor–for instance, in a sample of companies from which we
are estimating their profitability, we may want to distinguish between companies in different
industries, or companies that underwent a significant merger, or companies that were formed within
the last decade, and evaluate whether their expenditures on R&D or advertising have the same effects
across those categories.

7
All of the necessary tests involving dummy variables and interacted dummy variables may be easily
specified and computed, since models without interacted dummies (or without certain dummies in
any form) are merely restricted forms of more general models in which they appear. Thus, the
standard “subset F” testing strategy that we have discussed for the testing of joint hypotheses on the
coefficient vector may be readily applied in this context.
➢ The use of dummy variables for seasonal adjustment
• Seasonality is regular oscillatory movements in the data usually within a year.
• E.g. Sales and demand for money during holiday times, prices of cash crops right after harvest
• Often it is desirable to remove the seasonal factor, or component, from a time series so that one can
concentrate on the other components. The process of removing the seasonal component from a time
series is known as deseasonalization or seasonal adjustment.
• To deseasonalize a quarterly time series data on variable Y
i. Set up Yt = α1D1t + α2D2t + α3D3t + α4D4t + ut, where the D’s are the dummies, taking a value of 1
in the relevant quarter and 0 otherwise. We are regressing Y effectively on an intercept, except
that we allow for a different intercept for each quarter.
ii. If there is any seasonal effect in a given quarter, that will be indicated by a statistically significant
t value of the dummy coefficient for that quarter. This method of assigning a dummy to each
quarter assumes that the seasonal factor, if present, is deterministic and not stochastic.
iii. The deseasonalized time series of refrigerator sales=Yt -Ŷt. They (residuals) represent the
remaining components of the refrigerator time series, namely, the trend, cycle, and random
components.
E.g. Seasonality in refrigerator sales
Ŷt = 1222.125D1t + 1467.5D2t + 1569.75D3t + 1160.0D4t
t = (20.3720) (24.4622) (26.1666) (19.3364) R2 = 0.5317
se = 59.99 for all coz all the dummies take only a value of either 1 or 0.
The estimated coefficients represent average sales of refrigerators in each season.
Or
Ŷt = 1222.125 + 245.375D2t + 347.625D3t − 62.125D4t
t = (20.3720)* (2.8922)* (4.0974)* (− 0.7322)** R2 = 0.5318
➢ The deseasonalized time series of refrigerator sales=Yt -Ŷt
Will the picture change if we bring in a quantitative regressor in the model? If a quantitative
variable X (durable goods expenditure) is added to the model
Ŷt = 456.24 + 242.49D2t + 325.26D3t − 86.08D4t + 2.77Xt
t = (2.5593)* (3.6951)* (4.9421)* (− 1.3073)** (4.4496)* R2 = 0.7298
The interesting thing about this equation is that the dummy variables in that model not only
remove the seasonality in Y but also the seasonality, if any, in X.
The Interpretation of Dummy Variables in Semi-logarithmic Regressions
➢ Policy evaluation using dummy variables
• In the simplest case, there are two groups of subjects: control and experimental. The control group
does not participate in the program. The experimental group or treatment group does take part in the
program. E.g. new fertilizer, land certification,

8
E.g. The dependent variable is hours of training per employee, at the firm level. The variable grant is
a dummy variable equal to one if the firm received a job training grant for 1988 and zero otherwise.
The variables sales and employ represent annual sales and number of employees, respectively.
hrsemp̂ = 46.67 + 26.25 grant + 0.98 log (sales) - 6.07 log (employ)
se = (43.41) (5.59) (3.54) (3.88) R2 =0.237
Controlling for sales and employment, firms that received a grant trained each worker, on average,
26.25 hours more.

1.3. Dummy as Dependent Variable


• In this section we consider models in which the regressand itself is qualitative in nature.
• A limited dependent variable is defined as a dependent variable whose range of values is
substantively restricted.
Labor force participation = f (unemployment rate, average wage rate, education, family income)….
Yes/No
Vote = f (rates of GDP, unemployment, inflation)….Dem/Rep/Lab
• In a model where Y is quantitative, our objective is to estimate its expected, or mean, value given the
values of the regressors.
• In models where Y is qualitative, our objective is to find the probability of something happening.
Hence, qualitative response regression models are often known as probability models.
• The binary response model is a type of limited dependent variable where the qualitative variable
takes either 1 or 0.
• Binary outcome examples
▪ Consumer economics: whether a consumer makes a purchase or not.
▪ Labor economics: whether an individual participates in the labor market or not.
▪ Agricultural economics: whether or not a farmer adopts or uses organic practices,
marketing/production contracts,
▪ Politics: whether an individual votes for one party or the other, etc.
• Binary outcome dependent variable
▪ The decision/choice is whether or not to have, do, use, or adopt.
▪ The dependent variable is a binary response.
0 𝑖𝑓 𝑛𝑜
▪ It takes on two values: 0 and 1. 𝑌 = {
1 𝑖𝑓 𝑦𝑒𝑠
• Binary outcome models
▪ Binary outcome models are among the most used in applied economics.
▪ Look at the OLS model: 𝑌 = x ′ 𝛽 + 𝑒.
▪ Binary outcome models estimate the probability that y=1 as a function of the independent
variables.
𝑝 = 𝑝𝑟[𝑌 = 1|x] = 𝐹(𝑥 ′ 𝛽)
There are three approaches to developing a probability model for a binary (dichotomous) response
variable depending on the functional form of 𝐹(x ′ 𝛽): linear probability model, logit model and
probit model.

9
1.3.1 The Linear Probability Model (LPM)
• In the linear probability model, 𝐹(𝑥 ′ 𝛽) = 𝑥 ′ 𝛽.
• Assume X = family income and Y = 1 if the family owns a house and 0 if it does not own a house
and consider the following regression:
Yi = β1 + β2Xi + ui
• This model is called a linear probability model (LPM) because
i. the dependent variable is binary
ii. the response probability is linear in the parameters βj
iii. the conditional expectation of Yi given Xi, E (Yi | Xi), can be interpreted as the conditional
probability that the event will occur given Xi, that is, Pr (Yi = 1 | Xi)
• Justification
Assume E(ui) = 0 to obtain unbiased estimators.
E(Yi |Xi ) = β1 + β2Xi
Now, if P = probability that Y = 1 (that is, the event occurs), and (1– P) = probability that Y = 0 (that
is, the event does not occur), the variable Yi has the following (probability) distribution.
Yi Probability
0 1–P
1 P
Total 1
That is, Yi follows the Bernoulli probability distribution. Now, by the definition of mathematical
expectation, we obtain:
E(Yi) = ∑YiPi = 0(1 – P) + 1(P) = P, which can be equated
E(Yi | Xi) = β1 + β2Xi = P
• In most of applications, the primary goal is to explain the effects of Xj on the response probability Pr
(Y = 1).
• In the LPM, βj measures the change in the response probability when Xj increases by one unit:
𝜕𝑃𝑟 (𝑌𝑖 =1)
= 𝛽𝑗 . For the OLS regression model, the marginal effects are the coefficients and they do
𝜕𝑋𝑗
not depend on x.
• Problems of LPM
➢ Non-zero mean of ui
The disturbances ui also take only two values.
ui = Yi – β1 – β2Xi
E(ui) = Pi (1– β1 – β2Xi) + (1– Pi) (−β1 – β2Xi) ≠ 0
➢ Heteroscedasticity of ui
Var (ui) = E(ui)2 = Var (Yi)
Var (Yi) = E(Yi - μ)2 = E(Yi2) - μ2 → or ∑(Yi - μ)2 Pi
= ∑Yi2 Pi - μ2 ∑Pi
= 12 * (P) + 02 * (1- P) - P2
= P – P2 = P (1- P) → heteroscedasticity because P = β1 + β2Xi

10
Thus, the distribution of ui is non-normal.
➢ Possibility of Ŷi lying outside the 0–1 range
There is no guarantee that Ŷi, the estimators of E(Yi | Xi), will necessarily fulfill this restriction, and
this is the real problem with the OLS estimation of the LPM. There are two ways of solving this
problem: (i) apply 0 when Ŷi < 0 & 1 when Ŷi > 1, (ii) devise techniques that guarantee the
restriction.
➢ Constant marginal effects
𝜕𝑃𝑟 (𝑌𝑖 =1)
The LPM assumes that Pi increases linearly with X. Given the linearity of the model, = 𝛽𝑗 ,
𝜕𝑋𝑗
that is, the marginal effect of X remains constant throughout.
➢ Questionable values of R2 as a measure of goodness of fit
All the Y values will either lie along the X axis or along the line corresponding to 1. Therefore,
generally no LPM is expected to fit such a scatter well.
The LPM estimated by OLS for house ownership:
Ŷi = − 0.9457 + 0.1021Xi
se = (0.1228) (0.0082)
t = (− 7.698) (12.515) R2 = 0.8048
The intercept of − 0.9457 gives the “probability’’ that a family with zero income will own a house.
The slope value of 0.1021 means that for a unit change in income, on the average, the probability of
owning a house increases by 0.1021 or about 10 percent.
• If x1 is a binary explanatory variable, β1 is just the difference in the probability of success when x1=
1 and x1 = 0, holding the other xj fixed. In other words, for dummy independent variables, the
marginal effect is expressed in comparison to the base category (x1=0).
e.g Ŷi = 0.5 + 0.2Xi 𝑃𝑟 (𝑌𝑖 = 1|𝑋 = 1) = 0.7 & 𝑃𝑟 (𝑌𝑖 = 1|𝑋 = 0) = 0.5
• The limitations of the LPM can be overcome by using more sophisticated response models: 𝑃𝑟 (𝑌𝑖 =
1) = 𝐺(𝑥 ′ 𝛽), where G (.) is a function taking on values between zero and one: 0 < G (z) < 1 for any
real z.
The two common functional forms are logit model and probit model.

1.3.2 The Logit Model


• For the logit model, 𝐹(𝑥 ′ 𝛽) is the cdf of the logistic distribution.

′ ′
𝑒𝑥 𝛽 exp(𝑥 ′ 𝛽)
𝐹(𝑥 𝛽) = Λ(𝑥 𝛽) = ′ =
1 + 𝑒 𝑥 𝛽 1 + exp(𝑥 ′ 𝛽)
𝑒 𝑍𝑖 1
• Let 𝑍𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 , then the function 𝑃𝑖 = 𝐸(𝑌 = 1 | 𝑋𝑖 ) = 1+𝑒 𝑍𝑖 = 1+𝑒 −𝑍𝑖 .
• The predicted probabilities are limited between 0 ( as Zi - ∞) and 1 (as Zi∞)
• An increase in X increases/decreases the likelihood that Y=1. In other words, an increase in X
makes the outcome of 1 more or less likely.
• We interpret the sign of the coefficient but not the magnitude. The magnitude cannot be interpreted
using the coefficient because different models have different scales of coefficients.

11
• A standard logistic distribution has a mean of 0 and a variance of π2 /3.
Estimating marginal effects
• The marginal effects reflect the change in the probability of Y=1, given a 1 unit change in an
independent variable x. One unit increase in Xj leads to an increase of 𝛽𝑗 𝑃𝑟 (𝑌𝑖 = 1)(1 − 𝑃𝑟 (𝑌𝑖 = 1))
in the response probability.
• Marginal effects at the mean: The marginal effects are estimated for the average person in the sample
𝜕𝑃𝑟 (𝑌𝑖 =1) 𝛽𝑗 exp(𝛽0 +𝛽1 𝑋1𝑖 +𝛽2 𝑋2𝑖 )
x̅. = 𝛽𝑗 𝐹′(x̅ ′ 𝛽) = = 𝛽𝑗 𝑃𝑟 (𝑌𝑖 = 1)(1 − 𝑃𝑟 (𝑌𝑖 = 1))
𝜕𝑋𝑗 (1+𝛽0 +𝛽1 𝑋1𝑖 +𝛽2 𝑋2𝑖 )2

• Most papers report marginal effects at the mean. A problem is that there may not be such a person in
the sample.
• Average marginal effects: The marginal effects are estimated as the average of the individual
𝜕𝑃𝑟 (𝑌𝑖 =1) ∑𝐹′(x̅′ 𝛽)
marginal effects. = 𝛽𝑗
𝜕𝑋𝑗 𝑛

• This is a better approach of estimating marginal effects and in practice, the two ways to estimate
marginal effects produce almost identical results most of the time.
• This shows that the rate of change in probability with respect to X involves not only β2 but also the
level of probability from which the change is measured.
• Because Pi is nonlinear not only in X but also in the β’s, this creates an estimation problem, i.e., we
cannot use the familiar OLS procedure to estimate the parameters.
• Probit and logit models are estimated using the maximum likelihood method.
• Odds ratios are estimated with the logistic model.
• Reporting marginal effects instead of odds ratios is more popular in economics.
• The odds ratio or relative risk in a binary response model is defined as 𝑃𝑟 (𝑌𝑖 = 1)/[1 − 𝑃𝑟 (𝑌𝑖 = 1)]
and measures the probability that Y=1 relative to the probability that Y=0.
• If Pi is the probability of owning a house, then (1−Pi), the probability of not owning a house, is
1
1 − 𝑃𝑖 = 1+𝑒 𝑍𝑖
𝑃𝑖 1+𝑒 𝑍𝑖
= 1+𝑒 −𝑍𝑖 = 𝑒 𝑍𝑖
1−𝑃𝑖
• Now Pi / (1− Pi) is simply the odds ratio in favor of owning a house - the ratio of the probability that
a family will own a house to the probability that it will not own a house.
• If this ratio is equal to 1, then both outcomes have equal probability. If this ratio is equal to 2, then
the outcome Yi = 1 is twice more likely than the outcome Yi = 0.
• The odds ratio is always non-negative.
𝑟 𝑖 𝑃 (𝑌 =1)
𝐿𝑖 = ln (1−𝑃 (𝑌 =1)
) = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖
𝑟 𝑖

• The log of the odds ratio is not only linear in X but also linear in the parameters and L is called the
logit model.
Features of the Logit Model
• As P goes from 0 to 1, the logit L goes from - ∞ to + ∞. That is, although the probabilities lie
between 0 and 1, the logits are not so bounded.
• Although L is linear in X, the probabilities themselves are not.

12
• The interpretation of the logit model is as follows: βi, the slope, measures the marginal effect of Xi
on the log odds-ratio in favor of Y=1. The intercept 𝛽0 is the value of the log odds if Xi’s are zero.
• Given a certain level of Xi’s, if we actually want to estimate not the log odds but the probability that
Y=1 itself, this can be done directly once the estimates of the 𝛽𝑖 ’s are available.
• The LPM assumes that Pi is linearly related to Xi, the logit model assumes that the of odds ratio is
linearly related to Xi.
Estimation of the Logit Model
• Probit and logit models are estimated using the maximum likelihood method.
• We distinguish two types of data: (1) individual or micro data and (2) grouped or replicated data.
• For micro data we use the maximum likelihood method to estimate the parameters with Z and LR
test.
➢ Let Y = 1 if a student’s final grade in an intermediate microeconomics course was A and Y =
0 if the final grade was a B or a C, grade point average (GPA), TUCE (score on an
examination given at the beginning of the term to test entering knowledge of
macroeconomics), and Personalized System of Instruction (PSI) as the grade predictors.
Li = β1 + β2 GPAi + β3TUCEi + β4 PSIi + ui

➢ Each slope coefficient in this equation is a partial slope coefficient and measures the change
in the estimated logit for a unit change in the value of the given regressor.
➢ If you take the antilog of the PSI coefficient of 2.3786 you will get 10.7897 (≈ e2.3786). This
suggests that students who are exposed to the new method of teaching are more than 10 times
likely to get an A than students who are not exposed to it, other things remaining the same.
number of correct predictions
➢ To test goodness of fit use McFadden R2 or Count R2 = . Since the
total number of observations

regressand in the logit model takes a value of 1 or zero, if the predicted probability is greater
than 0.5, we classify that as 1, but if it is less than 0.5, we classify that as 0.
➢ In binary regressand models what matters is the expected signs of the regression coefficients
and their statistical and/or practical significance.
• For grouped data, for example, assume corresponding to each income level Xi, there are Ni families,
𝑛
ni among whom are home owners (ni ≤ Ni). Therefore, if we compute 𝑃̂𝑖 = 𝑁𝑖 that is, the relative
𝑖

frequency, we can use it as an estimate of the true Pi corresponding to each Xi.

13
𝑃̂𝑖
𝐿̂𝑖 = ln ( ) = 𝛽̂1 + 𝛽̂2 𝑋𝑖
1 − 𝑃̂𝑖
➢ It can be shown that if Ni is fairly large and if each observation in a given income class Xi is
1
distributed independently as a binomial variable, then 𝑢𝑖 ~𝑁(0, 𝑁 𝑃 (1−𝑃 )). The disturbance
𝑖 𝑖 𝑖

term in the logit model is heteroscedastic. Thus, instead of using OLS we will have to use the
weighted least squares (WLS): multiply OLS equation by w=1/se
➢ E.g. 𝐿̂∗𝑖 = −1.59474√𝑤𝑖 + 0.07862 𝑋̂𝑖∗ , where 𝑤𝑖 = 𝑁𝑖 𝑃𝑖 (1 − 𝑃𝑖 ) RTO

se = (0.11046) (0.00539)

t = (−14.43619) (14.56675) R2 = 0.9642

➢ Log odds interpretation: The estimated slope coefficient suggests that for a unit increase in
weighted income, the weighted log of the odds in favor of owning a house goes up by 0.08
units. or
➢ Odds interpretation: Taking antilog of the estimated logit gives odds ratio.
𝑃̂𝑖 ∗
= 𝑒 −1.59474√𝑤𝑖 +0.07862 𝑋̂𝑖
1 − 𝑃̂𝑖

= 𝑒 −1.59474√𝑤𝑖 . 𝑒 0.07862 𝑋̂𝑖
➢ In general, if you take the antilog of the jth slope coefficient (in case there is more than one
regressor in the model), subtract 1 from it, and multiply the result by 100, you will get the
percent change in the odds for a unit increase in the jth regressor.
➢ e0.07862 = 1.0817. This means that for a unit increase in weighted income, the (weighted) odds
in favor of owing a house increases by 1.0817 or about 8.17%.
➢ OLS or unweighted regression:
𝐿̂∗𝑖 = −1.6587 + 0.0792 𝑋̂𝑖∗
se = (0.0958) (0.0041)
t = (−17.32) (19.11) R2 = 0.9786

Merits of Logit Model

• Logit analysis produces statistically sound results. By allowing for the transformation of a
dichotomous dependent variable to a continuous variable ranging from - ∞ to + ∞, the problem of out
of range estimates is avoided.
• The logit analysis provides results which can be easily interpreted and the method is simple to
analyze.

14
• It gives parameter estimates which are asymptotically consistent, efficient and normal, so that the
analogue of the regression t-test can be applied.
Demerits of Logit Model
• As in the case of LPM, the disturbance term in logit model is heteroscedastic and therefore, we
should go for WLS.
• Ni has to be fairly large for all Xi and hence in small sample; the estimated results should be
interpreted carefully.
• As in any other regression, there may be problem of multicollinearity if the explanatory variables are
related among themselves.
• As in LPM, the conventionally measured R2 is of limited value to judge the goodness of fit.
Application of Logit Model Analysis
• It can be used to identify factors that affect the adoption of a particular technology say, use of new
seeds, fertilizers, pesticides, etc. on a farm.
• In the field of marketing, it can be used to test the brand preference and brand loyalty for any
product.
• Gender studies can use logit analysis to find out the factors which affect the decision making status
of men/women in a family.

1.3.3 The Probit (Normit) Model


• The term probit is short for “probability unit”.
• For the probit model, 𝐹(x ′ 𝛽) is the cdf of the (not standard) normal distribution.
−(x−𝜇)2
x′ 𝛽 x′ 𝛽 1
𝐹(x ′ 𝛽) = Φ(x ′ 𝛽) = ∫−∞ 𝜙(𝑧)𝑑𝑧 = ∫−∞ √2𝜎2 𝜋
𝑒 2𝜎2 , if a variable x follows the normal
distribution with mean µ and variance σ2.
• The predicted probabilities are limited between 0 and 1.
𝜕𝑃𝑟 (𝑌𝑖 =1)
• Marginal effect: = 𝛽𝑗 𝐹′(x ′ 𝛽). We interpret both the sign and the magnitude of the marginal
𝜕𝑋𝑗
effects.
• Coefficients and marginal effects have the same signs because 𝐹′(x ′ 𝛽) > 0.
• E.g. of grouped probit

Relationship between Logit and Probit Models

• The normal CDF is relatively steeper than logistic CDF, i.e., the probit curve approaches the axis
more quickly than the logistic curve.

15
• The probit and logit models produce almost identical marginal effects. Therefore, there is no
compelling reason to choose one over the other. But in practice many researchers choose the logit
model because of its comparative mathematical simplicity.
• In the both models, the relative effects of any two continuous independent variables, X1 and X2, are
𝜕𝑃𝑟 (𝑌𝑖 =1)
𝜕𝑋1 𝛽
𝜕𝑃𝑟 (𝑌𝑖 =1) = 𝛽1.
2
𝜕𝑋2

Comparison of coefficients
• Coefficients differ among models because of the functional form of the F function.
𝛽𝑙𝑜𝑔𝑖𝑡 ≅ 4𝛽𝑂𝐿𝑆
𝛽𝑝𝑟𝑜𝑏𝑖𝑡 ≅ 2.5𝛽𝑂𝐿𝑆
𝛽𝑙𝑜𝑔𝑖𝑡 ≅ 1.6𝛽𝑝𝑟𝑜𝑏𝑖𝑡

• We should not compare the magnitude of the coefficients among different models. Hence,
comparisons of coefficients across nested models can be misleading because the dependent variable
is scaled differently in each model.
Predicted probabilities
• Gives predicted values at substantively meaningful values of xk.


• Maximum Likelihood Estimation

16
Estimation

LPM

Logit and Probit

17
Hypothesis Testing

18
19
Chapter Two

Introduction to Basic Regression Analysis with Time Series Data

2.1 The Nature of Time Series Data

• A time series data set consists of observations on a variable or several variables over time.
• One objective of analysing economic data is to predict or forecast the future values of economic
variables. Because past events can influence future events and lags in behaviour are prevalent in the
social sciences, time is an important dimension in a time series data set.
• The chronological ordering of observations in a time series conveys potentially important
information.
• Economic time series data can rarely be assumed to be independent across time.
• Economic data may be collected on daily, weekly, monthly, quarterly or annual basis.
• We assume that the observations are equally spaced in time.

2.2 Stationary and Non-Stationary Stochastic Processes

• A random variable is a variable whose value is unknown until it is observed. A variable whose value is
determined by the outcome of a chance experiment is called a random variable .
➢ discrete if it can take only a finite number of values and they can be counted by using the
positive integers.
➢ continuous if it can take any real value in an interval on the real number line.
• The theory of random process was developed in order to explain the fluctuations.
• A random process is a collection of random variables defined on a given probability space. A collection
of random variables ordered in time instants is called a stochastic or random process.
• Random process is described by using the statistical expectations, covariance, variance, and correlation
functions.
• Just as we use sample data to draw inferences about a population in cross-sectional data, in time series
we use the realization to draw inferences about the underlying stochastic process.
• A given series can be either stationary or non-stationary. The main difference between these series is the
degree of persistence of shocks.

20
2.2.1 Stationary Stochastic Processes

• A stochastic process is said to be stationary if its mean and variance are constant over time and the value
of the covariance between the two time periods depends only on the distance or gap or lag between the
two time periods and not the actual time at which the covariance is computed. Such a stochastic process
is known as a weakly stationary, or covariance stationary. Such a time series will tend to return to its
mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have
broadly constant amplitude.
• A time series is strictly stationary if all the moments of its probability distribution are invariant over
time. If, however, the stationary process is normal, the weakly stationary stochastic process is also
strictly stationary, for the normal stochastic process is fully specified by its two moments, the mean and
the variance.
• To explain weak stationarity, let Yt be a stochastic time series with these properties:
Mean: E(Yt) = µ
Variance: var (Yt) = E(Yt − µ)2 = σ2
Covariance: γk = Cov (Yt, Yt-k) = Cov (Yt, Yt+k) = E[(Yt − µ) (Yt+k − µ)]
• As the covariance (autocovariances) are not independent of the units in which the variables are
measured, it is common to standardize it by defining autocorrelations ρk as
𝐶𝑜𝑣(𝑌𝑡 , 𝑌𝑡−𝑘 )
𝜌𝑘 = 𝐶𝑜𝑟𝑟 (𝑌𝑡 , 𝑌𝑡−𝑘 ) =
𝑉𝑎𝑟(𝑌𝑡 )
• Note that ρ0 = 1, while − 1 ≤ ρk ≤ 1.
• The correlation of a series with its own lagged values is called autocorrelation or serial correlation.
• The autocorrelations considered as a function of k are referred to as the autocorrelation function (ACF).
• From the ACF we can infer the extent to which one value of the process is correlated with previous
values and thus the length and strength of the memory of the process. It indicates how long (and how
strongly) a shock in the process (εt) affects the values of Yt.
• A shock in an MA(p) process affects Yt in p+1 periods only, while a shock in the AR(p) process affects
all future observations with a decreasing effect.
• Autocorrelation and Partial Autocorrelation
• The coefficient of correlation between two values in a time series is called the autocorrelation function
(ACF) For example the ACF for a time series ytyt is given by:

• Corr(yt,yt−k).
• Corr(yt,yt−k).

21
• This value of k is the time gap being considered and is called the lag. A lag 1 autocorrelation (i.e., k = 1
in the above) is the correlation between values that are one time period apart. More generally, a lag k
autocorrelation is the correlation between values that are k time periods apart.

• The ACF is a way to measure the linear relationship between an observation at time t and the
observations at previous times. If we assume an AR(k) model, then we may wish to only measure the
association between ytyt and yt−kyt−k and filter out the linear influence of the random variables that lie
in between (i.e., yt−1,yt−2,…,yt−(k−1)yt−1,yt−2,…,yt−(k−1)), which requires a transformation on the
time series. Then by calculating the correlation of the transformed time series we obtain the partial
autocorrelation function (PACF).
• The PACF is most useful for identifying the order of an autoregressive model. Specifically, sample
partial autocorrelations that are significantly different from 0 indicate lagged terms of yy that are useful
predictors of ytyt. To help differentiate between ACF and PACF, think of them as analogues to R2R2
and partial R2R2 values as discussed previously.
• Graphical approaches to assessing the lag of an autoregressive model include looking at the ACF and
PACF values versus the lag. In a plot of ACF versus the lag, if you see large ACF values and a non-
random pattern, then likely the values are serially correlated. In a plot of PACF versus the lag, the
pattern will usually appear random, but large PACF values at a given lag indicate this value as a possible
choice for the order of an autoregressive model. It is important that the choice of the order makes sense.
For example, suppose you have blood pressure readings for every day over the past two years. You may
find that an AR(1) or AR(2) model is appropriate for modeling blood pressure. However, the PACF may
indicate a large partial autocorrelation value at a lag of 17, but such a large order for an autoregressive
model likely does not make much sense.
• Why are stationary time series so important? Because if a time series is non-stationary, we can study its
behaviour only for the time period under consideration. Each set of time series data will therefore be for
a particular episode. As a consequence, it is not possible to generalize it to other time periods. Therefore,
for the purpose of forecasting, such (nonstationary) time series may be of little practical value. Besides,
the classical t tests, F tests, etc. are based on the assumption of stationarity.

2.2.2 Non-Stationary Stochastic Processes

• A non-stationary time series will have a time-varying mean or a time-varying variance or both.
• We call a stochastic process purely random (white noise) if it has zero mean, constant variance, and is
serially uncorrelated.

22
• A classic example for non-stationary stochastic process is the random walk model (RWM). Stock prices
or exchange rates, follow a random walk. Today’s stock price is equal to yesterday’s stock price plus a
random shock.
• We distinguish two types of random walks: (1) random walk without drift (i.e., no constant term) and (2)
random walk with drift.
➢ Random Walk without Drift: Suppose ut is a white noise error term. Then the series Yt is
said to be a random walk if Yt = Yt-1 + ut AR(1)
In general, if the process started at some time 0 with a value of Y0, we have
Yt = Y0 +∑ut. Therefore, E(Yt) = E(Y0 +∑ut) = Y0 and var (Yt) = tσ2.
➢ RWM is characterized by persistence of random shocks and that’s why it is said to have an
infinite memory.
➢ Random Walk with Drift: Yt = δ + Yt-1 + ut where δ is known as the drift parameter.
E(Yt) = Y0 + tδ and var (Yt) = tσ2.
➢ RWM, with or without drift, is a non-stationary stochastic process.
• Regression of one time series variable on one or more time series variables often can give
nonsensical results. This phenomenon is known as spurious/ meaningless regression. When Yt and
Xt are uncorrelated I(1) processes, the R2 from the regression of Y on X should tend to zero. Yule
showed that (spurious) correlation could persist in nonstationary time series even if the sample is
very large.
• The spurious regression can be easily seen from regressing the first differences of Yt (= ∆Yt) on the
first differences of Xt (= ∆Xt) where R2 is practically zero. One way to guard against it is to find out
if the time series are cointegrated.
• The usual statistical results do not hold for spurious regression when all the regressors are I(1) and
not cointegrated.

2.3 Trend Stationary and Difference Stationary

• Based on the nature of trend, an economic time series can be trend stationary or difference stationary.
A trend stationary time series has a deterministic trend, whereas a difference stationary time series
has a variable, or stochastic, trend. The common practice of including the time or trend variable in a
regression model to detrend the data is justifiable only for trend stationary time series.
• If the trend in a time series is completely predictable and not variable, we call it a deterministic
trend, whereas if it is not predictable, we call it a stochastic trend.
• Consider the following model of the time series Yt

23
Yt = β1 + β2t + β3Yt-1 + ut , where ut is a white noise error term and where t is time
measured chronologically.
➢ Deterministic trend: If β1 ≠ 0, β2 ≠ 0, β3 = 0, we obtain Yt = β1 + β2t + ut which is called a trend
stationary process. Although the mean of Yt is β1 + β2t, which is not constant, its variance (= σ2) is. If
we subtract the mean of Yt from Yt, the resulting series will be stationary, hence the name trend
stationary. This procedure of removing the (deterministic) trend is called detrending.
➢ Random walk with drift and deterministic trend: If β1 ≠ 0, β2 ≠ 0, β3 = 1, we obtain: Yt = β1 + β2t
+ Yt-1 + ut , which can be seen if we write this equation as ∆Yt = β1 + β2t + ut which means that Yt is
non-stationary.
➢ Deterministic trend with stationary AR (1) component: If β1 ≠ 0, β2 ≠ 0, β3 < 1, then we get Yt =
β1 + β2t + β3Yt-1 + ut which is stationary around the deterministic trend.
➢ Pure random walk: If β1 = 0, β2 = 0, β3 = 1, we get Yt = Yt-1 + ut which is nothing but a RWM
without drift and is therefore non-stationary. If we write this equation as ∆Yt = (Yt – Yt-1) = ut it
becomes stationary. Hence, a RWM without drift is a difference stationary process and we call the
RWM without drift integrated of order 1.
➢ Random walk with drift: If β1 ≠ 0, β2 = 0, β3 = 1, we get Yt = β1 + Yt-1 + ut which is a random walk
with drift and is therefore non-stationary. If we write it as (Yt – Yt-1) = ∆Yt = β1 + ut, this means Yt
will exhibit a positive (β1 > 0) or negative (β1 < 0) trend. Such a trend is called a stochastic trend.
Equation (Yt – Yt-1) is a difference stationary process because the non-stationarity in Yt can be
eliminated by taking first differences of the time series.
➢ If a non-stationary time series has to be differenced d times to make it stationary, that time series is
said to be integrated of order d. A time series Yt integrated of order d is denoted as Yt ∼ I(d).
➢ If a time series Yt is stationary to begin with (i.e., it does not require any differencing), it is said to be
integrated of order zero, denoted by Yt ∼ I(0). Most economic time series are generally I(1).
➢ An I(0) series fluctuates around its mean with a finite variance that does not depend on time, while an
I(1) series wanders widely.

24
2.4 Tests of Stationarity: The Unit Root Test

• The random walk model is an example of what is known in the literature as a unit root process.
• How do we find out if a given time series is stationary?
• There are several tests of stationarity: graphical analysis, the correlogram test and the unit root test.
But we focus on the last one.
• One can allow for nonzero means by adding an intercept term to the model.
3 Let us write the RW without drift as: Yt = ρYt-1 + ut -1 ≤ ρ ≤ 1.
➢ If ρ is 1, we face what is known as the unit root problem, that is, a situation of non-
stationarity. The name unit root is due to the fact that ρ = 1. Thus the terms non-stationarity,
random walk, and unit root can be treated as synonymous.
➢ If |ρ| < 1, then it can be shown that the time series Yt is stationary.
➢ The above equation can be rewritten as:
Yt – Yt-1 = ρYt-1 – Yt-1 + ut
= (ρ − 1) Yt-1 + ut
∆Yt = δYt-1 + ut where δ = (ρ − 1).
➢ The null hypothesis now becomes δ = 0. If δ = 0, then ρ = 1, that is we have a unit root.
➢ It may be noted that if δ = 0, ∆Yt = (Yt – Yt-1) = ut and since ut is a white noise error term, it is
stationary.
➢ If δ is zero, we conclude that Yt is nonstationary. But if it is negative, we conclude that Yt is
stationary.
4 Which test we should use to find out if the estimated coefficient of Yt-1 is zero or not?
➢ Under the null hypothesis that δ = 0, the t value of the estimated coefficient of Yt-1 does not
follow the t distribution even in large samples; that is, it does not have an asymptotic normal
distribution. Hence, t test can’t be used.
➢ Dickey and Fuller have shown that under the null hypothesis that δ = 0, the estimated t value
of the coefficient of Yt-1 follows the τ (tau) statistic. These authors have computed the critical
values of the tau statistic on the basis of Monte Carlo simulations.

Dickey–Fuller (DF) test

5 The DF test is estimated in three different forms.


Yt is a random walk: ∆Yt = δYt-1 + ut
Yt is a random walk with drift: ∆Yt = β1 + δYt-1 + ut

25
Yt is a random walk with drift around a stochastic trend: ∆Yt = β1 + β2t + δYt-1 + ut where t is the
time or trend variable.
6 In each case, the null hypothesis is that δ = 0; that is, there is a unit root.
7 Estimate the above models by OLS; divide the estimated coefficient of Yt-1 in each case by its standard
error to compute the (τ) tau statistic.
8 If the computed absolute value of the tau statistic (|τ|) exceeds the absolute value of DF or MacKinnon
critical tau values, we reject the hypothesis that δ = 0, in which case the time series is stationary.
9 Note that the critical values of the tau test to test the hypothesis that δ = 0, are different for each of the
preceding three specifications of the DF test.
10 Before we examine the results, we have to decide which of the three models may be appropriate. We
should rule out the first model because the coefficient of GDPt-1 that is δ is positive implying that ρ >
1.
11 E.g. The U.S. GDP time series
̂ 𝑡 = 0.00576 𝐺𝐷𝑃𝑡−1
∆GDP
τ = (5.7980)
This can be ruled out because in this case the GDP time series would be explosive, δ > 0 → ρ > 1.
̂ 𝑡 = 28.2054 − 0.00136 𝐺𝐷𝑃𝑡−1
∆GDP
τ = (1.1576) (−0.2191) , ρ= 0.9986
Our conclusion is that the GDP time series is not stationary.
̂ 𝑡 = 190.3857 + 1.4776𝑡 − 0.0603 𝐺𝐷𝑃𝑡−1
∆GDP
τ = (1.8389) (1.6109) (−1.6252) , ρ= 0.9397
Our conclusion is that the GDP time series is not stationary.
Critical Values Critical Values Critical Values
1% 5% 10%
No constant −2.5897 −1.9439 −1.6177
With constant −3.5064 −2.8947 −2.5842
Constant and trend −4.0661 −3.4614 −3.1567

The Augmented Dickey–Fuller (ADF) Test


• DF test assumed that the error term ut was uncorrelated. But in case the ut are correlated, Dickey and
Fuller have developed a test, known as the augmented Dickey–Fuller (ADF) test. This test is
conducted by “augmenting” the preceding three equations by adding the lagged values of the
dependent variable ∆Yt.

26
𝑚

∆𝑌𝑡 = 𝛽1 + 𝛽2 𝑡 + 𝛿𝑌𝑡−1 + ∑ 𝛼𝑖 ∆𝑌𝑡−𝑖 + 𝜀𝑡


𝑖=1

where εt is a pure white noise error term. The number of lagged difference must be enough to
make the error term serially uncorrelated.

• In ADF we still test whether δ = 0 and the ADF test follows the same asymptotic distribution as the
DF statistic, so the same critical values can be used.
̂𝑡 = 234.9729 + 1.8921𝑡 − 0.0786 𝐺𝐷𝑃𝑡−1 + 0.3557 ∆𝐺𝐷𝑃 𝑡−1
∆𝐺𝐷𝑃
τ = (2.3833) (2.1522) (−2.2152) (3.4647)
• The GDP series is still non-stationary.
• In an econometric modelling, the relationship between the dependent variable and the explanatory
variables has been defined either in a form of a static relationship, or in a dynamic relationship.
• A static relationship defines the dependent variable as a function of a set of explanatory variables at
the same point in time. This form of relation is also called “the long run” relationship.
• A dynamic relation involves the non-contemporaneous relationship between the variables. This
relationship defines “the short run” relationship.
• Time series modelling techniques can be classified into three:
➢ Box-Jenkins ARIMA models
➢ Box-Jenkins Multivariate Models
➢ Holt-Winters Exponential Smoothing (single, double, triple)

2.5 Autocorrelation

When error terms are correlated (not independent), problems occur when using ordinary least squares (OLS) estimates

Regression Coefficients are Unbiased, but not Minimum Variance

MSE underestimates s2

Standard errors of regression coefficients based on OLS underestimate the true standard error

Inflated t and F statistics and artificially narrow confidence intervals

27
First-Order Autoregressive Model ( AR(1)) :
Simple Regression: Yt   0  1 X t   t t  1,..., n  t   t 1  ut
  autoregression parameter with   1
ut ~ N  0,  2  and independent

Generalizes to Multiple Regression:


Yt   0  1 X t1  ...   p 1 X t , p 1   t t  1,..., n  t   t 1  ut

Properties of Errors (assumption regarding 1 for model consistency):


 2 
1 ~ N  0, 2 
 1  
 2  2
 2  1  u2  E  2    E 1  E u2   0  2  2    2 2 1   2 u2    2  2 
  2

 1   1  2
 2
Covariance:   2 , 1    1  u2 , 1   2  2    u2 , 1   2  2   0 
1  2
 2
  2 , 1 1  2
Corrrelation:   2 , 1   
  2  1 2 2
1  2 1  2
Here we present some formal tests and remedial measures for dealing with error autocorrelation.

Durbin-Watson Test
We usually assume that the error terms are independent unless there is a specific reason to think that this is not
the case. Usually violation of this assumption occurs because there is a known temporal component for how the
observations were drawn. The easiest way to assess if there is dependency is by producing a scatterplot of the
residuals versus the time measurement for that observation (assuming you have the data arranged according to a
time sequence order). If the data are independent, then the residuals should look randomly scattered about 0.
However, if a noticeable pattern emerges (particularly one that is cyclical) then dependency is likely an issue.

Recall that if we have a first-order autocorrelation with the errors, then the errors are modeled as:

ϵt=ρϵt−1+ωt,

where |ρ|<1 and the ωt∼iidN(0,σ2). If we suspect first-order autocorrelation with the errors, then a formal test does
exist regarding the parameter ρ. In particular, the Durbin-Watson test is constructed as:

H0: ρ=0

H1: ρ≠0

28
So the null hypothesis of ρ=0 means that ϵt=ωt, or that the error term in one period is not correlated with the error
term in the previous period, while the alternative hypothesis of ρ≠0 means the error term in one period is either
positively or negatively correlated with the error term in the previous period. Often times, a researcher will
already have an indication of whether the errors are positively or negatively correlated. For example, a regression
of oil prices (in dollars per barrel) versus the gas price index will surely have positively correlated errors. When
the researcher has an indication of the direction of the correlation, then the Durbin-Watson test also
accommodates the one-sided alternatives HA: ρ<0 for negative correlations or HA: ρ>0 for positive correlations
(as in the oil example).
The test statistic for the Durbin-Watson test on a data set of size n is given by:
∑𝑛𝑡=2(𝑒𝑡 − 𝑒𝑡−1 )2
𝐷=
∑𝑛𝑡=1(𝑒𝑡 )2
where et=yt−y^t are the residuals from the ordinary least squares fit. Exact critical values are difficult to obtain,
but tables (for certain significance values) can be used to make a decision. The tables provide a lower and upper
bound, called dL and dU, respectively. In testing for positive autocorrelation, if D<dL then reject H0, if D>dU then
fail to reject H0, or if dL≤D≤dU, then the test is inconclusive. While the prospect of having an inconclusive test
result is less than desirable, there are some programs which use exact and approximate procedures for calculating
a p-value. These procedures require certain assumptions on the data which we will not discuss. One "exact"
method is based on the beta distribution for obtaining p-values.

Remedial Measures
Determine whether a missing predictor variable can explain the autocorrelation in the errors
Include a linear (trend) term if the residuals show a consistent increasing or decreasing pattern
Include seasonal dummy variables if data are quarterly or monthly and residuals show cyclic behavior
Use transformed Variables that remove the (estimated) autocorrelation parameter (Cochrane-Orcutt and Hildreth-
Lu Procedures)
Use First Differences
Estimated Generalized Least Squares

When autocorrelated error terms are found to be present, then one of the first remedial measures should be to
investigate the omission of a key predictor variable. If such a predictor does not aid in reducing/eliminating
autocorrelation of the error terms, then certain transformations on the variables can be performed. We discuss
three transformations which are designed for AR(1) errors. Methods for dealing with errors from an AR(k)
process do exist in the literature, but are much more technical in nature.

Cochrane-Orcutt Procedure

29
The first of the three transformation methods we discuss is called the Cochrane-Orcutt procedure, which
involves an iterative process (after identifying the need for an AR(1) process):
Estimate ρ for

ϵt=ρϵt−1+ωt

by performing a regression through the origin. Call this estimate r.

Suppose  is known: Yt   0  1 X t   t  t   t 1  ut
Let Yt '  Yt  Yt 1    0  1 X t   t      0  1 X t 1   t 1  
 0 1     1  X t   X t 1    t   t 1    0 1     1  X t   X t 1   ut

 Yt '   0'  1' X t'  ut (Standard Simple linear regression with independent errors)
where:
Yt '  Yt  Yt 1 X t'  X t   X t 1  0'   0 1    1'  1

In Practice, we need to estimate  with a sample based value r


Yt '  Yt  rYt 1 X t'  X t  rX t 1
^
Fit: Y '  b0'  b1' X ' and if errors are uncorrelated, back transform to:
b' s b0' 
s b1  s b1' 
^
Y  b0  b1 X where: b0  0 s b0   b1  b1'
1 r 1 r

2.6 ARIMA Modelling of Time Series Data

• For forecasting,
➢ R2 matters (a lot!)
➢ Omitted variable bias isn’t a problem!

30
➢ We will not worry about interpreting coefficients in forecasting models
➢ External validity is paramount: the model estimated using historical data must hold into the
(near) future
• A natural starting point for a forecasting model is to use past values of Y (that is, Yt–1 , Yt–2 ,…) to
forecast Yt .
• For psychological, technological, and institutional reasons, a regressand may respond to a
regressor(s) with a lapse of time. Regression models that take into account time lags are known as
dynamic or lagged regression models.
• There are two types of lagged models: distributed-lag and autoregressive. In the former, the current
and lagged values of regressors are explanatory variables. In the latter, the lagged value(s) of the
regressand appear as explanatory variables.
• An autoregression is a regression model in which Yt is regressed against its own lagged values.
➢ The number of lags used as regressors is called the order of the autoregression.
➢ In a first order autoregression, Yt is regressed against Yt–1
➢ In a pth order autoregression, Yt is regressed against Yt–1 ,Yt–2 ,…,Yt–p .
• ARIMA methodology emphasizes on analysing the probabilistic, or stochastic, properties of
economic time series on their own under the philosophy let the data speak for themselves.
Box–Jenkins Strategy
i. First examine the series for stationarity. This step can be done by computing the
autocorrelation function (ACF) and the partial autocorrelation function (PACF) or by a
formal unit root analysis.
➢ The joint distribution of all values of Yt is characterized by the so-called autocovariances, the
covariances between Yt and one of its lags, Yt-k. The covariance between Yt and Yt-k depends on k
only, not on time. This reflects the stationarity of the process.
➢ Partial autocorrelation is the correlation between Yt and Yt-k after removing the effect of the
intermediate Y’s.
ii. If the time series is not stationary, difference it one or more times to achieve stationarity.
iii. The ACF and PACF of the stationary time series are then computed to find out if the series is
purely autoregressive (AR) or purely of the moving average (MA) type or a mixture of the
two. At this stage the chosen ARMA(p, q) model is tentative.
• An autoregressive process of order p, an AR(p) process, is given by 𝑦𝑡 = 𝜃1 𝑦𝑡−1 + 𝜃2 𝑦𝑡−2 + ⋯ +
𝜃𝑝 𝑦𝑡−𝑝 + 𝜀𝑡 , where εt is a white noise process and yt = Yt− µ.

31
• If there is only one lag, this model says that the forecast value of Y at time t is simply some
proportion (= α1) of its value at time (t − 1) plus a random shock at time t; again the Y values are
expressed around their mean values.
• A moving average process of order q is defined as 𝑦𝑡 = 𝜀𝑡 + 𝛼1 𝜀𝑡−1 + ⋯ 𝛼𝑞 𝜀𝑡−𝑞 . In short, a moving
average process is simply a linear combination of white noise error terms.
• It is quite likely that Y has characteristics of both AR and MA and is therefore ARMA. Obviously, it
is possible to combine the autoregressive and moving average specification into an ARMA( p, q)
model, which consists of an AR part of order p and an MA part of order q ,
𝑦𝑡 = 𝜃1 𝑦𝑡−1 + 𝜃2 𝑦𝑡−2 + ⋯ + 𝜃𝑝 𝑦𝑡−𝑝 + 𝜀𝑡 + 𝛼1 𝜀𝑡−1 + ⋯ 𝛼𝑞 𝜀𝑡−𝑞 .
• There are no fundamental differences between autoregressive and moving average processes. The
choice is simply a matter of parsimony.
• If we have to difference a time series d times to make it stationary and then apply the ARMA (p, q)
model to it, we say that the original time series is ARIMA (p, d, q), that is, it is an autoregressive
integrated moving average time series, where p denotes the number of autoregressive terms, d the
number of times the series has to be differenced before it becomes stationary, and q the number of
moving average terms.
iv. The tentative model is then estimated.
v. The residuals from this tentative model are examined to find out if they are white noise. If
they are, the tentative model is probably a good approximation to the underlying stochastic
process. If they are not, the process is started all over again. Therefore, the Box–Jenkins
method is iterative.
vi. The model finally selected can be used for forecasting.

2.7 Multivariate Time Series Analysis

VAR

• According to Sims, if there is true simultaneity among a set of variables, they should all be treated on
an equal footing; there should not be any a priori distinction between endogenous and exogenous
variables. It is in this spirit that Sims developed his VAR model.
• It is a truly simultaneous system in that all variables are regarded as endogenous.
• The term autoregressive is due to the appearance of the lagged value of the dependent variable on the
right-hand side and the term vector is due to the fact that we are dealing with a vector of two (or
more) variables.

32
• In VAR modeling the value of a variable is expressed as a linear function of the past, or lagged,
values of that variable and all other variables included in the model.
• If each equation contains the same number of lagged variables in the system, it can be estimated by
OLS.
k k

Yt = δ0 + ∑ αi Yt−i + ∑ βi Xt−i + ut
i=1 i=1

k k

Xt = δ1 + ∑ θi Xt−i + ∑ γi Yt−i + vt
i=1 i=1

where ut and vt are uncorrelated stochastic error terms.


• Before we estimate the above model, we have to decide on the maximum lag length, k.
• One way of deciding this question is to use a criterion like the Akaike or Schwarz information
criteria and choose that model that gives the lowest values of these criteria (prediction errors). There
is no question that some trial and error is inevitable.

2.8 Cointegration Analysis

• Cointegration means that despite being individually non-stationary, a linear combination of two or
more time series can be stationary.
• Cointegration of two (or more) time series suggests that there is a long-run, or equilibrium,
relationship between them.

Engle-Granger Test

• Note that EG test runs static regression.


• Be aware that the issue of efficient estimation of parameters in cointegrating relationships is quite a
different issue from the issue of testing for cointegration.
• Assume personal consumption expenditure (PCE) and personal disposable income (PDI) are
individually I(1) variables and we regress PCE on PDI.
PCE𝑡 = β1 + β2 PDIt + ut
a) Estimate the error term ut
ut = PCE𝑡 − β1 − β2 PDIt
b) Perform unit root test for the error term
𝑢𝑡 = 𝜌𝑢𝑡−1 + 𝜀𝑡

33
The null hypothesis in the Engle-Granger procedure is no-cointegration and the alternative is
cointegration. If the test shows that ut is stationary [or I(0)], it means that the linear combination of PCE
and PDI is stationary. If you take consumption and income as two I(1) variables, savings defined as
(income − consumption) could be I(0) and the initial equation is meaningful. In this case we say that the
two variables are cointegrated. If PCE and PDI are not cointegrated, any linear combination of them will
be non-stationary and, therefore, the ut will also be non-stationary.

̂𝑡 = −171.4412 + 0.9672 𝑃𝐷𝐼𝑡


E.g. 𝑃𝐶𝐸

τ = (−7.4808) (119.8712)

Since PCE and PDI are individually non-stationary, there is the possibility that this regression is
spurious.

̂𝑡 = −0.2753 𝑢
∆𝑢 ̂𝑡−1

𝜏 = (−3.7791)

The Engle–Granger 1% critical τ value is −2.5899 and so the residuals from the regression of PCE
on PDI are I(0). Thus, this regression is not spurious and we call it the static or long run consumption
function and 0.9672 represents the long-run, or equilibrium, marginal propensity to consumer
(MPC).

• We just showed that PCE and PDI are cointegrated; that is, there is a long-term relationship between
the two. Of course, in the short run there may be disequilibrium.
• The Granger representation theorem, states that if two variables Y and X are cointegrated, then the
relationship between the two can be expressed as ECM.
∆𝑃𝐶𝐸𝑡 = 𝛼0 + 𝛼1 ∆𝑃𝐷𝐼𝑡 + 𝛼2 𝑢𝑡−1 + 𝜀𝑡 , where ut−1 = PCE𝑡−1 − β1 − β2 PDIt−1
This ECM equation states that ∆PCE depends on ∆PDI and also on the equilibrium error term. If the
latter is nonzero, then the model is out of equilibrium. If ∆PDI is zero and ut-1 is negative (i.e., PCE
is below its equilibrium value), α2ut-1 will be positive (as α2 is expected to be negative), which will
cause ∆CPEt to be positive, leading PCEt to rise in period t.
∆𝑃𝐶𝐸𝑡 = 11.6918 + 0.2906 ∆𝑃𝐷𝐼𝑡 − 0.0867 𝑢
̂𝑡−1

t = (5.3249) (4.1717) (−1.6003)


Statistically, the equilibrium error term is zero, suggesting that PCE adjusts to changes in PDI in the
same time period (automatically). One can interpret 0.2906 as the short-run marginal propensity to
consume (MPC).

34
• The error correction mechanism (ECM) developed by Engle and Granger is a means of reconciling
the short-run behavior of an economic variable with its long-run behavior. The ECM links the long-
run equilibrium relationship implied by cointegration with the short-run dynamic adjustment
mechanism that describes how the variables react when they move out of long-run equilibrium.

2.9 Granger Causality Test

• Assume the
k k

Yt = δ0 + ∑ αi Yt−i + ∑ βi Xt−i + ut
i=1 i=1

k k

Xt = δ1 + ∑ θi Xt−i + ∑ γi Yt−i + vt
i=1 i=1

Given this we can test for the following null hypotheses:

a) Unidirectional causality: if 𝛽𝑖 = 0 and 𝛾𝑖 ≠ 0 or if 𝛽𝑖 ≠ 0 and 𝛾𝑖 = 0


b) Bidirectional causality: if 𝛽𝑖 ≠ 0 and 𝛾𝑖 ≠ 0
c) No causality: if 𝛽𝑖 = 0 and 𝛾𝑖 = 0

Critics of VAR

➢ VAR model is a-theoretic because it uses less prior information.


➢ Because of its emphasis on forecasting, VAR models are less suited for policy analysis.
➢ Challenge in choosing the appropriate lag length.
➢ In an m-variable VAR model, all the m variables should be (jointly) stationary.

2.10 Diagnostic Tests

Lucas Critique: the estimated parameters are not invariant in the presence of policy changes or the
parameters estimated from an econometric model are dependent on the policy prevailing at the time the
model was estimated and will change if there is a policy change.

35
Chapter Three

Introduction to Simultaneous Equation Models

• A simultaneous equation system is one of 4 important types of equation systems that are used to
specify statistical models in economics. The others are the seemingly unrelated equations system,
recursive equations system, and block recursive equation system.
• In single equation models an implicit assumption is that the cause-and-effect relationship, if any,
between Y and the X’s is unidirectional: the explanatory variables are the cause and the dependent
variable is the effect.
• However, there are situations where there is a two-way flow of influence among economic variables;
that is, one economic variable affects another economic variable and is, in turn, affected by it. E.g.
Price and quantity in market equilibrium analysis
• The two-way, or simultaneous, relationship between Y and (some of) the X’s, makes the distinction
between dependent and explanatory variables doubtful. Under such circumstances, we need to
consider more than one regression equations; one for each interdependent variables to
understand the multi-flow of influence among the variables. This is precisely what is done in
simultaneous equation models.
• A system describing the joint dependence of variables is called a system of simultaneous equation or
simultaneous equations model. The number of equations in such models is equal to the number of
jointly dependent or endogenous variables involved in the phenomenon under analysis.
• The 3 most important sources that produce a correlation between the error term and an explanatory
variable are: 1) Omission of an important explanatory variable 2) Measurement error in an
explanatory variable 3) Simultaneity (or reverse causation). We will focus on reverse causation.
• And unlike the single-equation models, in the simultaneous-equation models one may not usually
estimate the parameters of a single equation without taking into account information provided by
other equations in the system.

3.1 The Nature of Simultaneous Equation Models

• Economic systems are usually described in terms of the behavior of various economic agents, and the
equilibrium results when these behaviours are reconciled.

36
• Endogeneity occurs when a theoretical relationship does not fit into the framework of y-on-X
regression, in which we can assume that the y variable is determined by (but does not jointly
determine) X.
• The classic meaning of endogeneity refers to the simultaneity problem where the flow of causality is
not purely from the RHS variables to the LHS variable. In other words, if we think that changes in
the LHS variable may cause changes in a RHS variable or that the LHS variable and a RHS variable
are being jointly determined, then there is simultaneity and we would not expect the error term to be
uncorrelated with the RHS variables.
• One important form of endogeneity is simultaneity. This arises when one or more of the explanatory
variables are jointly determined with the dependent variable, usually through an equilibrium
mechanism.
• Simultaneous Equations Models (SEMs) differ from those considered previously because in each
model there are two or more dependent variables rather than just one.
• When using SEMs each equation in the model should have a ceteris paribus, causal interpretation
and should stand on its own.
• Thus, in the following hypothetical system of equations
𝑌1𝑖 = 𝛽10 + 𝛽12𝑌2𝑖 + 𝛾11 𝑋1𝑖 + 𝑢1𝑖 ………………… (1)
𝑌2𝑖 = 𝛽20 + 𝛽21 𝑌1𝑖 + 𝛾21 𝑋1𝑖 + 𝑢2𝑖 ………………… (2)
where Y1 and Y2 are mutually dependent, or endogenous, variables and X1 is an exogenous variable
and where u1 and u2 are the stochastic disturbance terms, the variables Y1 and Y2 are both stochastic.
• Therefore, unless it can be shown that the stochastic explanatory variable Y2 in (1) is distributed
independently of u1 and the stochastic explanatory variable Y1 in (2) is distributed independently of
u2, application of the classical OLS to these equations individually will lead to inconsistent estimates.
• In this case, the true data generation process is not described by the classical linear regression model;
rather, it is described by a simultaneous equations regression model.
Examples of simultaneous-equation models
a) Demand-and-Supply Model
As is well known, the price P of a commodity and the quantity Q sold are determined by the
intersection of the demand and supply curves for that commodity. Thus, assuming for simplicity that
the demand-and-supply curves are linear and adding the stochastic disturbance terms u1 and u2, we
may write the empirical demand-and-supply functions as
Demand function: 𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝑢1𝑡 , 𝛼1 < 0……. (1)
Supply function: 𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡 , 𝛽1 > 0………. (2)

37
Equilibrium condition: 𝑄𝑡𝑑 = 𝑄𝑡𝑠 ……………………….. (3)
• Now it is not too difficult to see that P and Q are jointly dependent variables. If, for example, u1t in
(1) changes because of changes in other variables affecting Qd (such as income, wealth, and tastes),
the demand curve will shift upward if u1t is positive and downward if u1t is negative.
• Similarly, a change in u2t (because of strikes, weather, import or export restrictions, etc.) will shift
the supply curve, again affecting both P and Q. Because of this simultaneous dependence between Q
and P, u1t and Pt in (1) and u2t and Pt in (2) cannot be independent. Therefore, a regression of Q on P
as in (1) would violate an important assumption of the classical linear regression model, namely, the
assumption of no correlation between the explanatory variable(s) and the disturbance term.
b) Consider the following simple Keynesian model of income determination comprised of two
equations: a consumption function, and equilibrium condition
𝐶𝑡 = 𝛽0 + 𝛽𝑡 𝑌𝑡 + 𝑢𝑡 …….. (i)
𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡 ………..… (ii)

where C is aggregate consumption; Y is aggregate income; I is exogenous investment; 𝛽0 and 𝛽1 are


parameters; and ut is an error term that summarizes all factors other than Y that influence C (e.g.,
wealth, interest rate). Now, suppose that u increases. This will directly increase C in the
consumption function. However, the equilibrium condition tells us that the increase in C will
increase Y. Therefore, u and Y are positively correlated.

3.2 Simultaneity Bias in OLS

• The crucial assumption for OLS that we make is that the explanatory variables are independent
of the error term. That is, the explanatory variable X is either non-stochastic or, if stochastic
(random), are distributed independently of the stochastic disturbance term: E[XiUi] = 0. In the
SEM, this assumption does not hold.
• What happens if the parameters of each equation are estimated by applying, say, the method of
OLS, disregarding other equations in the system? The least-squares estimators are not only
biased but also inconsistent.
• The bias arising from the application of OLS independently for each equations of the
simultaneous equations model is known as simultaneity bias or simultaneous equation bias.
• A unique feature of simultaneous-equation models is that the endogenous variable (i.e.,
regressand) in one equation may appear as an explanatory variable (i.e., regressor) in another
equation of the system. As a consequence, such an endogenous explanatory variable becomes

38
stochastic and is usually correlated with the disturbance term of the equation in which it appears
as an explanatory variable.
• To show this, let us revert to the simple Keynesian model of income determination given in
example above.
• Suppose that we want to estimate the parameters of the consumption function (i). Assuming that
E(ut) = 0, E(u2t) = σ2, E(ut ut+j) = 0 (for j ≠ 0), and cov (It,ut) = 0, which are the assumptions of
̂1 is
the classical linear regression model. We first show that Yt and ut in (i) are correlated, prove 𝛽
̂1 is an inconsistent estimator of β1.
a biased estimator of β1 and then prove that 𝛽
𝐶𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡 …….. (i)
𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡 ………..… (ii)
𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡
𝑌𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡 + 𝐼𝑡
𝛽0 𝐼𝑡 𝑢𝑡
𝑌𝑡 = + +
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1
𝛽0 𝐼𝑡
𝐸(𝑌𝑡 ) = +
1 − 𝛽1 1 − 𝛽1
where use is made of the fact that E(ut) = 0 and that It being exogenous.
𝑢𝑡
𝑌𝑡 − 𝐸(𝑌𝑡 ) =
1 − 𝛽1

𝑐𝑜𝑣 (𝑌𝑡 , 𝑢𝑡 ) = 𝐸[𝑌𝑡 − 𝐸(𝑌𝑡 )][𝑢𝑡 − 𝐸(𝑢𝑡 )]

𝐸(𝑢𝑡2 ) 𝜎2
𝑐𝑜𝑣 (𝑌𝑡 , 𝑢𝑡 ) = = , 0 < 𝛽1 < 1
1 − 𝛽1 1 − 𝛽1

• As a result, Yt and ut in (i) are expected to be correlated, which violates the assumption of the
classical linear regression model that the disturbances are independent or at least uncorrelated with
the explanatory variables.
• ̂1 is an inconsistent estimator of 𝛽1 because of correlation between
To show that the OLS estimator 𝛽
Yt and ut, we proceed as follows:
∑(𝐶𝑡 − 𝐶̅ ) (𝑌𝑡 − 𝑌̅) ∑ 𝑐𝑡 𝑦𝑡 ∑(𝐶𝑡 − 𝐶̅ ) 𝑦𝑡 ∑ 𝐶𝑡 𝑦𝑡 𝐶̅ ∑ 𝑦𝑡
̂1 =
𝛽 = = = −
∑(𝑌𝑡 − 𝑌̅)2 ∑ 𝑦𝑡2 ∑ 𝑦𝑡2 ∑ 𝑦𝑡2 ∑ 𝑦𝑡2
̂1 = ∑ 𝐶𝑡 𝑦2𝑡 , 𝑤ℎ𝑒𝑟𝑒 ∑ 𝑦𝑡 = 0
𝛽 ∑𝑦 𝑡

∑( 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡 )𝑦𝑡 𝛽0 ∑ 𝑦𝑡 𝛽1 ∑ 𝑌𝑡 𝑦𝑡 ∑ 𝑦𝑡 𝑢𝑡
̂1 =
𝛽 = + +
∑ 𝑦𝑡2 ∑ 𝑦𝑡2 ∑ 𝑦𝑡2 ∑ 𝑦𝑡2

39
̂1 = 𝛽1 + ∑ 𝑦𝑡𝑢2 𝑡 , 𝑤ℎ𝑒𝑟𝑒 ∑ 𝑦𝑡 = 0, ∑ 𝑌𝑡 𝑦2𝑡 = 1
𝛽 ∑𝑦 𝑡 ∑𝑦 𝑡

̂1 ) = 𝛽1 + 𝐸 (∑ 𝑦𝑡𝑢2 𝑡)……….. (iii)


𝐸(𝛽 ∑𝑦 𝑡

∑ 𝑦𝑡 𝑢𝑡
[Note: E(A/B) ≠ E(A)/E(B).]. Therefore, 𝐸 ( ̂1 ) ≠ 𝛽1. That is 𝛽
) ≠ 0, 𝐸(𝛽 ̂1 will be biased by the
∑ 𝑦𝑡2

∑ 𝑦𝑡 𝑢𝑡
amount equivalent to ( ∑ 𝑦𝑡2
).

• Now an estimator is said to be consistent if its probability limit, or plim for short, is equal to its true
̂1 of (iii) is inconsistent, we must show that its plim is
(population) value. Therefore, to show that 𝛽
not equal to the true β1.
∑𝑦 𝑢
̂1 ) = 𝑝𝑙𝑖𝑚(𝛽1 ) + 𝑝𝑙𝑖𝑚 ( 𝑡 𝑡 )
𝑝𝑙𝑖𝑚(𝛽
∑ 𝑦𝑡2
∑ 𝑦𝑡 𝑢𝑡 ⁄𝑛
̂1 ) = 𝑝𝑙𝑖𝑚(𝛽1 ) + 𝑝𝑙𝑖𝑚 (
𝑝𝑙𝑖𝑚(𝛽 )
∑ 𝑦𝑡2⁄𝑛
• The plim of a constant is the same constant and the plim of (A/B) = plim (A)/plim (B).
𝑝𝑙𝑖𝑚 (∑ 𝑦𝑡 𝑢𝑡 ⁄𝑛)
̂1 ) = 𝛽1 +
𝑝𝑙𝑖𝑚(𝛽
𝑝𝑙𝑖𝑚(∑ 𝑦𝑡2 ⁄𝑛)
• ̂1 is equal to true β1 plus the ratio of the plim of the sample
This states that the probability limit of 𝛽
covariance between Y and u to the plim of the sample variance of Y. Now as the sample size n
increases indefinitely, one would expect the sample covariance between Y and u to approximate the
𝜎2
true population covariance, 1−𝛽 and the sample variance of Y to approximate its population
1

variance, 𝜎𝑌2 .
𝜎 2 ⁄1 − 𝛽1 1 𝜎2
̂1 ) = 𝛽1 +
𝑝𝑙𝑖𝑚(𝛽 = 𝛽1 +
𝜎𝑌2 1 − 𝛽1 𝜎𝑌2
• ̂1 will overestimate the true β1 and the bias will not disappear no matter how large the sample size.
𝛽
• To avoid this bias we will use other methods of estimation, such as, Indirect Least Square (ILS) and
Two Stage Least Square (2SLS).

Definitions and Basic Concepts

• In the context of the simultaneous-equation models, the jointly dependent variables are called
endogenous variables and the variables that are truly non-stochastic can be so regarded are called the
exogenous, or predetermined, variables. The values of endogenous variables are determined within
the system while exogenous variables are determined outside this system. The values of the
endogenous variables are the solution of the equation system. If an explanatory variable is

40
uncorrelated with the error term it is called an exogenous variable. If an explanatory variable is
correlated with the error term it is called an endogenous variable.
• The predetermined variables are divided into two categories: exogenous, current as well as lagged,
and lagged endogenous. It is up to the model builder to specify which variables are endogenous and
which are predetermined.
• A simultaneous equation regression model has two alternative specifications: structural form and
reduced form.
• Structural equation is an equation that has one or more endogenous right-hand side variables. The
structural, or behavioral, equations portray the structure of an economy or the behavior of an
economic agent (e.g., consumer or producer). Structural parameters are the parameters of a structural
equation.
• A set of assumptions defines the specification of the structural form of a simultaneous equations
regression model. The key assumption is that the error term is correlated with one or more
explanatory variables. We will assume that the error term has constant variance, and the errors are
not correlated within equations. However, we will allow errors to be contemporaneously correlated
across equations.
• From the structural equations one can solve for the endogenous variables and derive the reduced-
form equations and the associated reduced-form coefficients.
• Reduced form equation is an equation for which all right-hand side variables are exogenous.
Reduced form parameters are the parameters of a reduced form equation.
• To illustrate, consider the Keynesian model of income determination
Consumption function: 𝐶𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡 …….. (i)
Income identity: 𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡 ………..… (ii)

In this model, C (consumption) and Y (income) are the endogenous variables and I
(investment expenditure) is treated as an exogenous variable. Both these equations are
structural equations.

To find the reduced form of the above structural model, we have to express C and Y in terms
of I. If (i) is substituted into (ii), we obtain,

𝑌𝑡 = 𝐶𝑡 + 𝐼𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡 + 𝐼𝑡

𝑌𝑡 (1 − 𝛽1 ) = 𝛽0 + 𝐼𝑡 + 𝑢𝑡

𝛽0 𝐼𝑡 𝑢𝑡
𝑌𝑡 = + +
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1

41
𝛽0 1 𝑢𝑡
𝑌𝑡 = 𝜋0 + 𝜋1 𝐼𝑡 + 𝑤𝑡 , 𝑤ℎ𝑒𝑟𝑒 𝜋0 = , 𝜋1 = , 𝑤𝑡 =
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1

π0 and π1 are the associated reduced-form coefficients. Notice that these reduced-form
coefficients are nonlinear combinations of the structural coefficient(s).

If in the preceding Keynesian model the investment expenditure is increased by, say, $1 and
if the MPC is assumed to be 0.8, then we obtain π1 = 5. This result means that increasing the
investment by $1 will immediately (i.e., in the current time period) lead to an increase in
income of $5, that is, a fivefold increase.

𝐶𝑡 = 𝛽0 + 𝛽1 𝑌𝑡 + 𝑢𝑡

𝛽0 𝐼𝑡 𝑢𝑡
𝐶𝑡 = 𝛽0 + 𝛽1 ( + + ) + 𝑢𝑡
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1

𝛽0 𝛽1 𝐼𝑡 𝑢𝑡
𝐶𝑡 = + +
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1

𝛽0 𝛽1 𝑢𝑡
𝐶𝑡 = 𝜋2 + 𝜋3 𝐼𝑡 + 𝑤𝑡 , 𝑤ℎ𝑒𝑟𝑒 𝜋2 = , 𝜋3 = , 𝑤𝑡 =
1 − 𝛽1 1 − 𝛽1 1 − 𝛽1

The reduced-form coefficients are π2 and π3.

Notice an interesting feature of the reduced-form equations. Since only the predetermined
variables and stochastic disturbances appear on the right sides of these equations, and since
the predetermined variables are assumed to be uncorrelated with the disturbance terms, the
OLS method can be applied to estimate the coefficients of the reduced-form equations (the
π’s).

3.3 Problems of Simultaneous Equation Models

• Simultaneous equation models create three distinct problems. These are:

1. Mathematical completeness of the model: any model is said to be (mathematically) complete only
when it possesses as many independent equations as endogenous variables.

2. Identification of each equation of the model: Many times it so happens that a given set of values
of disturbance terms and exogenous variables yield the same values of different endogenous
variables included in the model. It is because the equations are observationally indistinguishable,

42
what is needed is that the parameters of each equation in the system should be uniquely determined.
Hence, certain tests are required to examine the identification of each equation before its estimation.

3. Statistical estimation of each equation of the model: Since application of OLS yield biased and
inconsistent estimates, different statistical techniques are to be developed to estimate the structural
parameters.

3.4 The Identification Problem

• Suppose that we have time series data on Q and P only and no additional information (such as
income of the consumer, price prevailing in the previous period, and weather condition). The
identification problem then consists in seeking an answer to this question: Given only the data on P
and Q, how do we know whether we are estimating the demand function or the supply function?
Alternatively, if we think we are fitting a demand function, how do we guarantee that it is, in fact,
the demand function that we are estimating and not something else?
• The problem of identification is a problem of model formulation that precedes the problem of
estimation.
• The identification problem arises because the same set of data may be compatible with different sets
of structural coefficients, that is, different models. Thus, in the regression of price on quantity only, it
is difficult to tell whether one is estimating the supply function or the demand function,
because price and quantity enter both equations.
• The estimation of the model depends up on the empirical data and the form of the model. If the
model is not in the proper statistical form, it may turn out that the parameters may not be uniquely
estimated even though adequate and relevant data are available. In a language of econometrics, a
model is said to be identified only when it is in unique statistical form to enable us to obtain unique
estimates of its parameters from the sample data.
• To assess the identifiability of a structural equation, one may apply the technique of reduced-form
equations, which expresses an endogenous variable solely as a function of predetermined variables.
• The identification problem asks whether one can obtain unique numerical estimates of the structural
coefficients from the estimated reduced-form coefficients.
• If this can be done, an equation in a system of simultaneous equations is identified. If this cannot be
done, that equation is un- or under-identified.
• An identified equation can be just (or exactly) identified or over identified. In the former case,
unique values of structural coefficients can be obtained; in the latter, there may be more than one

43
value for one or more structural parameters. The model as a whole is identified if each equation in it
is identified.
Classifying Structural Equations
• Every structural equation can be placed in one of the following three categories.
• Unidentified equation – The parameters of an unidentified equation have no interpretation, because
you do not have enough information to obtain meaningful estimates, and therefore it will not provide
any useful information.
• Consider once again the demand-and-supply model, together with the market-clearing, or
equilibrium.
Demand function: 𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝑢1𝑡 , 𝛼1 < 0……. (1)
Supply function: 𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡 , 𝛽1 > 0………. (2)
Equilibrium condition: 𝑄𝑡𝑑 = 𝑄𝑡𝑠 ……………………….. (3)
• By the equilibrium condition, we obtain
𝛼0 + 𝛼1 𝑃𝑡 + 𝑢1𝑡 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡
𝛽0 −𝛼0 𝑢2𝑡 − 𝑢1𝑡
𝑃𝑡 = + = 𝜋0 + 𝑣𝑡
𝛼
⏟1 − 𝛽1 ⏟𝛼1 − 𝛽1
𝛼1 𝛽0 −𝛼0 𝛽1 𝛼1 𝑢2𝑡 − 𝛽1 𝑢1𝑡
𝑄𝑡 = + = 𝜋1 + 𝑤𝑡
⏟𝛼1 − 𝛽1 ⏟ 𝛼1 − 𝛽1

• These reduced-form coefficients contain all four structural parameters, but there is no way in which
the four structural unknowns can be estimated from only two reduced-form coefficients. In general,
to estimate k unknowns we must have k (independent) equations.
• Incidentally, if we run the reduced-form regression, we will see that there are no explanatory
variables, only the constants, and these constants will simply give the mean values of P and Q.
• What all this means is that, given time series data on P (price) and Q (quantity) and no other
information, there is no way the researcher can guarantee whether he or she is estimating the demand
function or the supply function. That is, a given Pt and Qt represent simply the point of intersection
of the appropriate demand-and-supply curves because of the equilibrium condition that demand is
equal to supply.
• There is an alternative and perhaps more illuminating way of looking at the identification problem.
This method tells you whether the structural equation you are checking for identification can be
distinguished from a linear combination of all structural equations in the simultaneous equation
system. Suppose we multiply (1) by λ (0 ≤ λ ≤ 1) and (2) by 1 − λ to obtain the following equations:
𝜆𝑄𝑡 = 𝜆𝛼0 + 𝜆𝛼1 𝑃𝑡 + 𝜆𝑢1𝑡

44
(1 − 𝜆)𝑄𝑡 = (1 − 𝜆)𝛽0 + (1 − 𝜆)𝛽1 𝑃𝑡 + (1 − 𝜆)𝑢2𝑡
• Adding these two equations gives the following linear combination of the original demand-and-
supply equations:
𝑄𝑡 = 𝜆𝛼
⏟ 0 + (1 − 𝜆)𝛽0 + [𝜆𝛼
⏟ 1 + (1 − 𝜆)𝛽1 ] 𝑃𝑡 + 𝜆𝑢
⏟ 1𝑡 + (1 − 𝜆)𝑢2𝑡

𝑄𝑡 = 𝛾0 + 𝛾1 𝑃𝑡 + 𝑤𝑡 ………….. (4)
• The “bogus,” or “mongrel,” equation (4) is observationally indistinguishable from either (1) or (2)
because they involve the regression of Q and P. Therefore, if we have time series data on P and Q
only, any of (1), (2), or (4) may be compatible with the same data. In other words, the same data
may be compatible with the “hypothesis” (1), (2), or (4), and there is no way we can tell which one
of these hypotheses we are testing.
• For an equation to be identified, that is, for its parameters to be estimated, it must be shown that the
given set of data will not produce a structural equation that looks similar in appearance to the one in
which we are interested. If we set out to estimate the demand function, we must show that the given
data are not consistent with the supply function or some mongrel equation.
• The reason we could not identify this demand function or the supply function was that the same
variables P and Q are present in both functions and there is no additional information.
• Exactly identified equation – The parameters of an exactly identified equation have an
interpretation, because you have just enough information to obtain meaningful estimates.
• An equation belonging to a system of simultaneous equations is identified if it has a unique statistical
form, i.e. if there is no other equation in the system, or formed by algebraic manipulations of the
other equations of the system, contains the same variables as the equation in question.
• there is a unique solution for the structural parameters in terms of the reduced form
parameters.
• Suppose we consider the following demand-and-supply model:
Demand function: 𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝑢1𝑡 , 𝛼1 < 0, 𝛼2 > 0……. (1)
Supply function: 𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡 , 𝛽1 > 0………………………. (2)
where I = income of the consumer, an exogenous variable

Using the market-clearing mechanism, we have

𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝑢1𝑡 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡
𝛽 −𝛼 −𝛼 𝑢 −𝑢
𝑃𝑡 = 𝛼0 −𝛽0 + 𝛼 −𝛽2 𝐼𝑡 + 𝛼2𝑡 −𝛽1𝑡 = 𝜋0 + 𝜋1 𝐼𝑡 + 𝑣𝑡 ………… (3)
⏟1 1 ⏟1 1 ⏟1 1
𝛼1 𝛽0 −𝛼0 𝛽1 −𝛼2 𝛽1 𝛼1 𝑢2𝑡 −𝛽1 𝑢1𝑡
𝑄𝑡 = + 𝐼𝑡 + = 𝜋2 + 𝜋3 𝐼𝑡 + 𝑤𝑡 … (4)
⏟𝛼1 −𝛽1 ⏟
𝛼1 −𝛽1 ⏟ 𝛼1 −𝛽1

45
Since (3) and (4) are both reduced-form equations, the OLS method can be applied to estimate their
parameters. There are only four equations (reduced-form coefficients) to estimate five structural
coefficients. Hence, unique solution of all the structural coefficients is not possible.

The parameters of the supply function can be identified (estimated) because 𝛽0 = 𝜋2 − 𝛽1 𝜋0 , 𝛽1 =


𝜋3
. But there is no unique way of estimating the parameters of the demand function; therefore, it
𝜋1

remains under identified.

To verify that the demand function (i) cannot be identified (estimated), let us multiply it by λ (0 ≤ λ ≤
1) and (ii) by 1 − λ and add them up to obtain the following “mongrel” equation:
𝑄𝑡 = ⏟
𝜆𝛼0 + (1 − 𝜆)𝛽0 + ⏟
[𝜆𝛼1 + (1 − 𝜆)𝛽1 ] 𝑃𝑡 + 𝜆𝛼
⏟2 𝐼𝑡 + ⏟
𝜆𝑢1𝑡 + (1 − 𝜆)𝑢2𝑡

𝑄𝑡 = 𝛾0 + 𝛾1 𝑃𝑡 + 𝛾2 𝐼𝑡 + 𝑤𝑡 ………….. (5)
Equation (5) is observationally indistinguishable from the demand function (1) although it is
distinguishable from the supply function (2), which does not contain the variable I as an explanatory
variable. Hence, the demand function remains unidentified.
• Note that it is the presence of an additional variable in the demand function that enables us to
identify the supply function. Why? The inclusion of the income variable in the demand equation
provides us some additional information about the variability of the function. Very often the
identifiability of an equation depends on whether it excludes one or more variables that are included
in other equations in the model.
• Suppose
Demand function: 𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝑢1𝑡 , 𝛼1 < 0, 𝛼2 > 0…………. (1)
Supply function: 𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝛽2 𝑃𝑡−1 + 𝑢2𝑡 , 𝛽1 > 0…………………. (2)
Thus, we have six equations (reduced-form coefficients) in six unknowns (structural coefficients),
and normally we should be able to obtain unique estimates. Therefore, the parameters of both the
demand-and-supply equations can be identified, and the system as a whole can be identified.
• Over identified equation – The parameters of an over identified equation have an interpretation,
because you have more than enough information to obtain meaningful estimates.
• Suppose
Demand function: 𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝛼3 𝑅𝑡 + 𝑢1𝑡 , 𝛼1 < 0, 𝛼2 > 0.
Supply function: 𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝛽2 𝑃𝑡−1 + 𝑢2𝑡 , 𝛽1 > 0
where in addition to the variables already defined, R represents wealth.

Equating demand to supply, we obtain the following equilibrium price and quantity:

46
𝛽0 − 𝛼0 −𝛼2 −𝛼3 𝛽2 𝑢2𝑡 − 𝑢1𝑡
𝑃𝑡 = + 𝐼𝑡 + 𝑅𝑡 + 𝑃𝑡−1 +
𝛼1 − 𝛽1 𝛼
⏟ ⏟1 − 𝛽1 𝛼
⏟1 − 𝛽1 𝛼
⏟1 − 𝛽1 ⏟𝛼1 − 𝛽1

𝑃𝑡 = 𝜋0 + 𝜋1 𝐼𝑡 + 𝜋2 𝑅𝑡 + 𝜋3 𝑃𝑡−1 + 𝑣𝑡
𝛼1 𝛽0 −𝛼0 𝛽1 −𝛼2 𝛽1 −𝛼3 𝛽1 −𝛼1 𝛽2 𝛼1 𝑢2𝑡 − 𝛽1 𝑢1𝑡
𝑄𝑡 = + 𝐼𝑡 + 𝑅𝑡 + 𝑃𝑡−1 +
⏟𝛼1 − 𝛽1 𝛼
⏟1 − 𝛽1 𝛼
⏟1 − 𝛽1 𝛼
⏟1 − 𝛽1 ⏟ 𝛼1 − 𝛽1

𝑄𝑡 = 𝜋4 + 𝜋5 𝐼𝑡 + 𝜋6 𝑅𝑡 + 𝜋7 𝑃𝑡−1 + 𝑤𝑡

• The number of equations is greater than the number of unknowns. As a result, unique estimation of
𝜋6 𝜋5
all the parameters of our model is not possible: 𝛽1 = 𝑜𝑟 𝛽1 = , that is, there are two estimates of
𝜋2 𝜋1

the price coefficient in the supply function, and there is no guarantee that these two values or
solutions will be identical.
• Compared with the previous model here we have “too much,” or an oversufficiency of information,
to identify the supply curve. In other words, in this model we put “too many” restrictions on the
supply function by requiring it to exclude more variables than necessary to identify it.
• In general, an equation (or system) is said to be just identified NRFP = NSP, over identified NRFP >
NSP, under identified NRFP < NSP, where NRFP: Number of reduced form parameters, NSP:
Number of structural parameters.
• Using the above procedure, we can check identification problems easily if we have two or three
equations in a given simultaneous equation model. However, for ‘n’ equations simultaneous
equation model, such a procedure is very cumbersome. In general for any number of equations in a
given simultaneous equation, we have two conditions that need to be satisfied to say that the model is
in general identified or not.
• Identification may be established either by the examination of the specification of the structural
model, or by the examination of the reduced form of the model. The structural form approach is
simpler and more useful.
• In applying the identification rules we can ignore the constant term without affecting the
identification result.
• There are two conditions which must be fulfilled for an equation to be identified: order and rank
conditions.

Order Condition

• The order condition is a simple counting rule of the variables included and excluded from the
particular equation. It is a necessary but not sufficient condition for the identification of an equation,

47
that is, it may be fulfilled in any particular equation and yet the relation may not be identified. (The
term order refers to the order of a matrix.)
• Define the following:
K = total number of variables (endogenous and exogenous) excluded from the equation being
checked for identification, i.e., total number of variables in the model minus total number of
variables included in the equation.
G = total number of endogenous variables in the model (i.e., in all equations that comprise the
model).
• The order condition is as follows:
If K < G – 1, the equation is unidentified.

If K = G – 1, the equation is exactly identified.

If K > G – 1, the equation is over identified.

This is known as the order condition for identification. It is a necessary but not sufficient condition
for the identification status of an equation.
E.g. 1

𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝑢1𝑡
𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡
• Neither equation is identified. The order condition for identifying the first equation states that
at least one exogenous variable is excluded from this regression.

E.g. 2

𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝑢1𝑡
𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡
• The demand function is unidentified. On the other hand, the supply function is just identified.

E.g. 3

𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝑢1𝑡
𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝛽2 𝑃𝑡−1 + 𝑢2𝑡
• Each equation is just identified.

E.g. 4

𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝛼3 𝑅𝑡 + 𝑢1𝑡
𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝛽2 𝑃𝑡−1 + 𝑢2𝑡
• The demand function is exactly identified. But the supply function is over identified.
48
• Identification of an equation in a model of simultaneous equations is possible if that equation
excludes one or more variables that are present elsewhere in the model. This situation is known as
the exclusion (of variables) criterion, or zero restrictions criterion (the coefficients of variables
not appearing in an equation are assumed to have zero values). The zero restrictions criterion is
based on a priori or theoretical expectations that certain variables do not appear in a given equation.

Rank Condition

• The rank condition is both a necessary and sufficient condition for identification.
• The term rank refers to the rank of a matrix and is given by the largest-order square matrix
(contained in the given matrix) whose determinant is nonzero.
• The rank condition states that: in a system of G equations any particular equation is identified if and
only if it is possible to construct at least one non-zero determinant of order (G-1) from the
coefficients of the variables excluded from that particular equation but contained in the other
equations of the model.
• The procedure is as follows:
1. Construct a matrix for which each row represents one equation and each column represents
one variable in the simultaneous equations model.
2. If a variable occurs in an equation, mark it with its coefficient. If a variable does not occur in
an equation, mark it with a zero.
3. Delete the row for the equation you are checking for identification. Strike out the columns in
which a non-zero coefficient of the equation being examined appears.
4. The entries left in the table will then give only the coefficients of the variables included in the
system but not in the equation under consideration. Form a new matrix from the columns that
correspond to the elements that have zeros in the row that you deleted.
5. For this new matrix, if you can find nonzero determinant for the matrix of order (G – 1), then
the equation is identified. If you cannot, the equation is unidentified.
• When both conditions suggest the equation is identified, the rank condition tells us whether the
equation under consideration is identified or not, whereas the order condition tells us if it is exactly
identified or over identified.
• For example let a structural model be:
𝑦1 = 3𝑦2 − 2𝑥1 + 𝑥2 + 𝑢1
𝑦2 = 𝑦3 + 𝑥3 + 𝑢2

49
𝑦3 = 𝑦1 − 𝑦2 − 2𝑥3 + 𝑢3
where the y’s are the endogenous variables and the x’s are the predetermined variables. This
model may be rewritten in the form
−𝑦1 + 3𝑦2 + 0𝑦3 − 2𝑥1 + 𝑥2 + 0𝑥3 + 𝑢1 = 0
0𝑦1 − 𝑦2 + 𝑦3 + 0𝑥1 + 0𝑥2 + 𝑥3 + 𝑢2 = 0
𝑦1 − 𝑦2 − 𝑦3 + 0𝑥1 + 0𝑥2 − 2𝑥3 + 𝑢3 = 0
Ignoring the random disturbance the table of the parameters of the model is as follows:
Equations Variables
y1 y2 y3 x1 x2 x3
1st -1 3 0 -2 1 0
2nd 0 -1 1 0 0 1
3rd 1 -1 -1 0 0 -2
Assume we want to examine the identifiability of the second equation of the model
Table of structural parameters Table of parameters of
excluded variables from the 2nd
equation
y1 y2 y3 x1 x2 x3 y1 x1 x2
st
1 -1 3 0 -2 1 0 -1 -2 1
2nd → 0 -1 1 0 0 1
3rd 1 -1 -1 0 0 -2 1 0 0
If at least one of determinants of order (G-1) is non-zero, the equation is identified. If all the
determinants of order (G-1) are zero, the equation is under identified.
−2 1 −1 1 −1 −2
𝐴=| | = 0, 𝐵 = | | = −1 ≠ 0, 𝐶 = | |=2≠0
0 0 1 0 1 0
Hence, the second equation of our system is identified. To see whether the equation is exactly
identified or over identified we use the order condition: K (=3) > G-1(=3-1=2). Therefore, the second
equation of the model is over identified.

Example:
Consider the simplified model of the economy above:
Ct  0  1Yt  2Ct 1  u1t (consumption)
It  0  1rt  2 It 1  u 2t (investment)
rt   0  1Yt   2 Mt  u3t (money market)
Yt  Ct  It  G t (income identity)

This model may be re-written as:


C t   0  1Yt   2 C t 1  0I t  0rt  0I t 1  0M t  0G t  u1t  0
0C t  0  0Yt  0C t 1  I t  1rt  2 I t 1  0M t  0G t  u 2t  0
0C t   0  1Yt  0C t 1  0I t  rt  0I t 1   2 M t  0G t  u 3t  0
Ct  0  Yt  0C t 1  I t  0rt  0I t 1  0M t  G t  0  0

50
Note that the coefficient of a variable excluded from an equation is equal to zero. Ignoring the random
disturbances and the constants, a table of the parameters of the model is as follows:

Ct Yt Ct 1 It rt I t 1 Mt Gt
consumption -1 1 2 0 0 0 0 0
Investment 0 0 0 -1 1 2 0 0
money market 0 1 0 0 -1 0 2 0
income identity 1 -1 0 1 0 0 0 1
Now suppose we want to check the identification status of the consumption function.
a) We eliminate the row corresponding to the consumption function.
b) We eliminate the columns in which the consumption function has non-zero coefficients.
The two steps are shown below:

Note that by doing steps (a) and (b) above, we are left with the coefficients of variables not included in the
consumption function, but contained in the other equations of the system.

After eliminating the relevant row and columns, we get the following table (matrix) of parameters:

It rt I t 1 Mt Gt
-1 1 2 0 0 …………… (*)
0 -1 0 2 0
1 0 0 0 1

Since the system has G = 4 equations, form the determinants of order (G-1) = 3 and examine their value.
• If at least one of these determinants is non-zero, then the consumption equation is (exactly or over)
identified.
• If all determinants of order 3 are zero, then the consumption equation is under identified.
For example,
1 1 2
1 0 0 0 0 1
1  0 1 0   1  1  2   1(0) 1 (0)  2 (1)  2  0
0 0 1 0 1 0
1 0 0
or
2 0 0
2  0 2 0  2 (  2 )  0
0 0 1
Thus, we can form at least one non-zero determinant of order 3, and hence, the consumption equation is
exactly or over identified.

51
To see whether the consumption equation is exactly or over identified, we can use the order condition. Since
we have four endogenous variables ( Ct , It , Yt , rt ), G = 4. As can be seen from Table (*) above, the
variables It 1 , It , Mt , rt , G t are missing from the consumption equation, meaning k = 5. Thus, the equation
is over identified since k = 5 > G – 1 = 3.

Simultaneity Test

• A test of simultaneity is essentially a test of whether (an endogenous) regressor is correlated with the
error term. If it is, the simultaneity problem exists, in which case alternatives to OLS must be found;
if it is not, we can use OLS. To find out which is the case in a concrete situation, we can use
Hausman’s specification error test.
• If there is no simultaneous equation, or simultaneity problem, the OLS estimators produce consistent
and efficient estimators. On the other hand, if there is simultaneity, OLS estimators are not even
consistent. In the presence of simultaneity, the method of two-stage least squares (2SLS) will give
estimators that are consistent and efficient. Oddly, if we apply the alternative method when there is
in fact no simultaneity, this method yield estimators that are consistent but not efficient.
• Although in practice deciding whether a variable is endogenous or exogenous is a matter of
judgment, one can use the Hausman specification test to determine whether a variable or group of
variables is endogenous or exogenous.
Hausman Specification Test
• Consider the following two-equation model:
𝑄𝑡𝑑 = 𝛼0 + 𝛼1 𝑃𝑡 + 𝛼2 𝐼𝑡 + 𝛼3 𝑅𝑡 + 𝑢1𝑡
𝑄𝑡𝑠 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡
Assume that I and R are exogenous.
H0= no simultaneity problem.
Now consider the supply function. If there is no simultaneity problem (i.e., P and Q are mutually
independent), Pt and u2t should be uncorrelated. On the other hand, if there is simultaneity, Pt and u2t
will be correlated. To find out which is the case, the Hausman test proceeds as follows:
Step 1. Regress (reduced-form) Pt on It and Rt to obtain v̂t using OLS.
𝑃𝑡 = 𝜋0 + 𝜋1 𝐼𝑡 + 𝜋2 𝑅𝑡 + 𝑣𝑡
𝑃𝑡 = 𝑃̂𝑡 + 𝑣̂𝑡
Step 2. Regress Qt on P̂t and v̂t
𝑄𝑡 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝑢2𝑡

52
𝑄𝑡 = 𝛽0 + 𝛽1 (𝑃̂𝑡 + 𝑣̂𝑡 ) + 𝑢2𝑡
𝑄𝑡 = 𝛽0 + 𝛽1 𝑃̂𝑡 + 𝛽1 𝑣̂𝑡 + 𝑢2𝑡
Step 3. Perform a t test on the coefficient of v̂t. Thus, if we find that the coefficient of vt is
statistically zero (insignificant), we can conclude that there is no simultaneity problem (accept Ho).

Exogeneity Test

• The above supply function cannot be estimated by OLS if Qt and Pt are truly endogenous.
• To test exogeneity of price in the supply function:
H0: Pt is exogenous → λ=0
Step 1. Obtain the reduced-form equations for Pt.
Step 2. Calculate predicted price, P̂t
Step 3. Estimate the original supply function by including P̂t as an additional explanatory variable.
𝑄𝑡 = 𝛽0 + 𝛽1 𝑃𝑡 + 𝜆𝑃̂𝑡 + 𝑢2𝑡

Step 4. Apply the t test or F test, and if λ is statistically significant, then reject H0 and Pt is deemed
endogenous.

3.5 Approaches to Estimation

• Two alternative approaches can be used to estimate a simultaneous equation regression model: single
equation (limited information) estimation and system (full information) estimation.
• Single equation estimation involves estimating each equation in the model separately.
• System estimation involves estimating two or more equations in the model jointly.
• The major advantage of system estimation is that it results in more precise parameter estimates. The
major disadvantages are that it requires more data and is sensitive to model specification errors.
• For reasons of economy, specification errors, etc. the single-equation methods are by far the most
popular. A unique feature of these methods is that one can estimate a single-equation in a multi-
equation model without worrying too much about other equations in the system. (Note: For
identification purposes, however, the other equations in the system count.)
• The commonly used single-equation methods are OLS, ILS and 2SLS.

Recursive Models and Ordinary Least Squares

• A model is called recursive if its structural equations can be ordered in such a way that the first
equation includes only the predetermined variables in the right hand side; the second equation
contains predetermined variables and the first endogenous variable (of the first equation) in the right

53
hand side and so on. The special feature of recursive model is that its equations may be estimated,
one at a time, by OLS without simultaneous equations bias.
• To see the nature of these models, consider the following three-equation system:

where, as usual, the Y’s and the X’s are, respectively, the endogenous and exogenous variables. The
disturbances are such that
• Each equation exhibits a unilateral causal dependence, hence the name causal models.
• Although recursive models have proved to be useful, most simultaneous-equation models do not
exhibit such a unilateral cause-and-effect relationship. Therefore, OLS, in general, is inappropriate to
estimate a single equation in the context of a simultaneous-equation model.

The Method of Indirect Least Squares (ILS)

• The method of ILS is suited for just or exactly identified equations. In this method OLS is applied to
the reduced-form equation, and it is from the reduced-form coefficients that one estimates the
original structural coefficients.
• ILS derives from the fact that structural coefficients are obtained indirectly from the OLS estimates
of the reduced form coefficients.
• ILS involves the following three steps:
• Step 1. We first obtain the reduced-form equations.
• Step 2. We apply OLS to the reduced-form equations individually. This operation is permissible
since the explanatory variables in these equations are predetermined and hence uncorrelated with the
stochastic disturbances. The estimates thus obtained are consistent.
• Step 3. We obtain estimates of the original structural coefficients from the estimated reduced-form
coefficients obtained in Step 2. As noted before, if an equation is exactly identified, there is a one-to-
one correspondence between the structural and reduced-form coefficients; that is, one can derive
unique estimates of the former from the latter.
• The standard errors of the estimated structural coefficients may not be provided because, these
coefficients are generally nonlinear functions of the reduced-form coefficients and there is no simple
method of estimating their standard errors from the standard errors of the reduced-form coefficients.

Example

Consider the following model for demand and supply of pork:

54
Qt  a1  b1Pt  c1Yt  u1t (demand function) ............. (8a)

Qt  a 2  b2 Pt  c2 Zt  u 2t (supply function) .................(8b)

where Q t is consumption of pork (pounds per capita), Pt is real price of pork (cents per pound), Yt is

disposable personal income (dollars per capita) and Zt is ‘predetermined elements in pork production’.

Here P and Q are endogenous variables while Y and Z are predetermined variables. It can easily be shown
that both equations are exactly identified. Thus, we can apply ILS to estimate the parameters. We first
express P and Q in terms of the predetermined variables and disturbances as:

 b a  b1a 2   c1b 2   b1c 2   b 2 u1t  b1u 2t 


Qt   2 1   Yt    Zt    ……. (9a)
 b 2  b1   b 2  b1   b 2  b1   b 2  b1 

 a  a 2   c1   c2   u1t  u 2t 
Pt   1   Yt    Zt    ……. (9b)
 2
b  b 1   2
b  b1   2
b  b1   b 2  b1 

We can re-write equations (9a) and (9b) as:

Qt  1  2 Yt  3 Zt  1t ……. (10a)

Pt  4  5 Yt  6 Zt  2t ……. (10b)

By applying OLS to equations (10), we get estimates of the π’s: ˆ 1 , ˆ 2 , ˆ 3 , ˆ 4 , ˆ 5 , ˆ 6 . Relating equations

(9) and (10) we see that:

2 c b /(b  b1 ) ˆ 2
 1 2 2  b2  b̂2 
5 c1 /(b2  b1 ) ˆ 5

3 b c /(b  b1 ) ˆ 3
 1 2 2  b1  b̂1 
6 c2 /(b2  b1 ) ˆ 6

c1
5   c1  5 (b2  b1 )  cˆ1  ˆ 5 (bˆ 2  bˆ 1 )
b 2  b1

Similarly, it can be shown that cˆ 2   ˆ 6 (bˆ 2  bˆ 1 ) , aˆ 1  ˆ 1  bˆ 1ˆ 4 and aˆ 2  ˆ 1  bˆ 2 ˆ 4 .

Instrumental Variable (IV) Method


• In cross sectional analysis, when faced with omitted variable bias, we have two options, (i) ignore
the problem → biased and inconsistent estimators (ii) use proxy for unobserved variable.

55
• Another approach is to permit the unobservable to remain in the error term, and instead of using
OLS, we use another technique that recognizes the unobservable variable captured in the error term,
which is the Method of Instrumental Variables.

Suppose we have the model (in deviation form):


yi  xi  i
where xi is correlated with i . We cannot estimate  by OLS as it will yield an inconsistent estimator of β.
What we do is search for an instrumental variable z i that is uncorrelated with i but correlated with xi ; that
is, cov(zi , i )  0 and cov(zi , xi )  0 . The sample counterpart of cov(zi , i )  0 is:
1 1
n
 z i i  0 
n
 zi (yi  x i )  0
1 1 

n
 zi yi  ˆ   zi x i 
n 


1
ˆ   zi yi
1
 zi x i 
 z i yi
n n  zi x i
̂ can be expressed as:

ˆ 
z yi i

 z ( x   )
i i i
  
z  i i

z xi i z xi i z x i i

Now we have,
p lim( zi i / n)  cov(zi , i )  0
p lim( zi x i / n)  cov(zi , x i )  0
Thus,
0

  z i i  p lim( zi i / n)
p lim(ˆ )    p lim       
 z x  p lim( i i
z x / n)
 i i 

0

that is, the IV estimator ̂ is a consistent estimator of  .

Consider the following simultaneous equations model:


y1  a1  b1y2  c1z1  c2 z2  u1
y2  a 2  b2 y1  c3z3  u 2
where y1 and y2 are endogenous while z1 , z2 and z3 are predetermined.

Consider the estimation of the first equation:


• Since z1 and z2 are predetermined, they are not correlated with u1 , that is, cov(z1 , u1 )  0 and
cov(z2 , u1 )  0
• y2 is not independent of u1 , that is, cov(y2 , u1 )  0 .
Thus, OLS method of estimation cannot be applied. To find consistent estimators, we look for a variable that
is correlated with y2 but not correlated with u1 . Fortunately we have z3 that satisfies these two conditions,
that is, cov(y2 , z3 )  0 and cov(z3 , u1 )  0 . Thus, z3 can serve as an IV for y2 .

56
The procedure for estimation of the first equation is as follows:
a) Regress y2 on z1 , z2 and z3 ; that is, using OLS estimate the model:
y2  a10  a11z1  a12 z2  a13z3  v1 .
b) Obtain ŷ2 where ŷ2  aˆ 10  aˆ 11z1  aˆ 12z2  aˆ 13z3 .
c) Regress y1 on ŷ2 , z1 and z2 ; that is, estimate the model:
y1  a1  b1yˆ 2  c1z1  c2 z2  u1
Note that since z1 , z2 and z3 are predetermined variables, and hence, not correlated with u1 , we have:
cov(yˆ 2 , u1 )  cov(aˆ10  aˆ11z1  aˆ12z2  aˆ13z3 , u1 )  0
Thus, the OLS estimation using the above procedure yields consistent estimators.

Consider the second equation.


• Since z3 is predetermined, it is not correlated with u 2 , that is, Cov(z3 , u 2 )  0 .
• y1 is not independent of u 2 , that is, Cov(y1 , u 2 )  0 .
Again OLS cannot be applied to estimate the parameters. To find consistent estimators, we look for a
variable that is correlated with y1 but not correlated with u 2 . Here we have two choices, namely, z1 and
z2 that can serve as instruments.

Note: We have more than enough instrumental variables since the second equation is over identified.

In order to estimate the second equation:


a) Regress y1 on z1 and z3 (if z1 is considered as an IV for y1 ) or regress y1 on z2 and z3 (if z2 is
considered as an IV for y1 ) using OLS and obtain ŷ1 .
b) Regress y2 on ŷ1 and z3 ; that is, estimate the model:
Note that the solution is not unique, that is, depending on whether z1 is considered as an IV for y1 or z2
is considered as an IV for y1 , we may get different results.

The Method of Two-Stage Least Squares (2SLS)

• The method of 2SLS is especially designed for over identified equations, although it can also be
applied to exactly identified equations. But then the results of 2SLS and ILS are identical.
• The basic idea behind 2SLS is to replace the (stochastic) endogenous explanatory variable by a linear
combination of the predetermined variables in the model and use this combination as the
explanatory variable in lieu of the original endogenous variable. The 2SLS method thus resembles
the instrumental variable method of estimation in that the linear combination of the predetermined
variables serves as an instrument, or proxy, for the endogenous regressor. Both methods yield the
same result if the equation under consideration is exactly identified.
• Unlike ILS, which provides multiple estimates of parameters in the overidentified equations, 2SLS
provides only one estimate per parameter.

57
• The 2SLS involves two successive applications of the OLS estimator, and is given by the following
two stage procedure.
a) Regress each right-hand side endogenous variable in the equation to be estimated on all
exogenous variables in the simultaneous equation model using the OLS estimator. Calculate the
fitted values for each of these endogenous variables.
b) In the original equation to be estimated, replace each endogenous right-hand side variable by its
fitted value variable. Estimate the equation using the OLS estimator.
• A noteworthy feature of both ILS and 2SLS is that the estimates obtained are consistent, that is, as
the sample size increases indefinitely, the estimates converge to their true population values. The
estimates may not satisfy small-sample properties, such as unbiasedness and minimum variance.
Therefore, the results obtained by applying these methods to small samples and the inferences drawn
from them should be interpreted with due caution.

Consider the above simultaneous equations model:


y1  a1  b1y2  c1z1  c2 z2  u1 ………….. (a)
y2  a 2  b2 y1  c3z3  u 2 ………………….. (b)
where y1 and y2 are endogenous while z1 , z2 and z3 are predetermined.

Since cov(y2 , u1 )  0 and Cov(y1 , u 2 )  0 , we cannot apply OLS. Since equation (a) is exactly
identified, the 2SLS procedure is the same as the IV method. The 2SLS procedure of estimation of
equation (b) which is over-identified is:

➢ We first estimate the reduced form equations by OLS; that is, we regress y1 on z1 , z2 and z3
using OLS and obtain ŷ1 .
➢ We then replace y1 by ŷ1 and estimate equation (b) by OLS, that is, we apply OLS to:
y2  b2 yˆ 1  c3z3  u 2
• In Stata, you use the ivregress command to perform either IV or 2SLS estimation.

Example: Consider the following data on some characteristics of the wine industry in Australia.

year Q S PW PB A Y

1955 0.91 85.4 77.5 35.7 89.1 1056

1956 1.05 88.4 80.2 37.4 83.3 1037

1957 1.18 89.1 79.5 37.7 84.4 1006

1958 1.27 90.5 84.9 37.1 90.1 1047

58
1959 1.27 93.1 94.9 36.2 89.4 1091

1960 1.37 97.2 92.7 35.0 89.3 1093

1961 1.46 100.3 92.5 37.6 89.8 1102

1962 1.59 100.3 92.7 40.1 96.7 1154

1963 1.86 101.5 97.1 39.7 99.9 1234

1964 1.96 104.8 93.9 38.3 103.2 1254

1965 2.32 107.5 102.7 37.0 102.2 1241

1966 2.86 111.8 100.0 36.1 100.0 1299

1967 3.50 114.9 119.5 35.4 103.0 1287

1968 3.96 117.9 119.7 35.1 104.2 1369

1969 4.21 122.3 125.2 34.5 113.0 1443

1970 4.54 128.2 134.1 34.5 132.5 1517

1971 4.93 134.1 124.3 34.3 143.6 1562

1972 5.40 145.1 119.0 34.3 176.2 1678

1973 6.13 174.9 108.5 31.9 159.9 1769

1974 6.29 237.2 107.9 31.0 182.1 1847

It is assumed that a reasonable demand-supply model for the industry would be (where all variables are in
logs):
Qt  a 0  a1PWt  a 2PBt  a 3Yt  a 4At  u t (demand)
Qt  b0  b1PWt  b2St  vt (supply)

where Q is real per capita consumption of wine, PW is the price of wine relative to CPI, PB is the price of
beer relative to CPI, Y is real per capita disposable income, A is real per capita advertising expenditure, and
S is index of storage costs. Here Q and PW are the two endogenous variables while the rest are exogenous
variables.

I. Estimation using instrumental variables

59
For the estimation of the demand function we have only one instrumental variable (IV) S. But for the
estimation of the supply we have available three IVs: PB, Y and A.

a) Estimation of the supply function


The supply equation is over-identified. Thus, we have three possible IV’s for price of wine (PW): price of
beer (PB), advertising expense (A), income (Y).

i) Regression: price of beer is the IV


First we regress price of wine (PW) on price of beer (PB) and storage cost (S), and obtain the predicted
values of PW (IV pb for prcwine). At last, we estimate the supply function by regressing consumption (Q)
on S and (IV pb for prcwine).

The results are as follows:

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -4.673 19.179 -.244 .810
IV pb for prcwine .336 16.604 .054 .020 .984
storage cost 2.131 6.876 .834 .310 .760
a. Dependent Variable: consumption

R 2 = 0.79, F = 31.682 (p-value < 0.001)

ii) Regression: advertising expense is the IV


First we regress price of wine (PW) on advertising expense (A) and storage cost (S), and obtain the
predicted values of PW (IV ad for prcwine). At last, we estimate the supply function by regression
consumption (Q) on S and (IV ad for prcwine).

The results are as follows:


Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -7.665 2.005 -3.822 .001
IV ad for prcwine 2.928 1.673 .507 1.751 .098
storage cost 1.058 .740 .414 1.430 .171
a. Dependent Variable: consumption

R 2 = 0.82, F = 38.924 (p-value < 0.001)

iii) Regression: income is the IV


First we regress price of wine (PW) on income (Y) and storage cost (S), get the predicted values of PW (IV
inc for prcwine), and then estimate the supply function by regressing consumption (Q) on S and (IV inc for
prcwine).

60
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -7.374 .583 -12.642 .000
IV inc for prcwine 2.676 .422 .580 6.335 .000
storage cost 1.163 .234 .455 4.969 .000
a. Dependent Variable: consumption

R 2 = 0.94, F = 126.546 (p-value < 0.001)

By comparing the estimated models using the three IV’s, it seems that income is the best IV as the resulting
estimated model has the highest coefficient of determination ( R 2 = 0.94). Since all variables are in logs, the
coefficients are elasticities. Thus, quantity supplied is responsive to both price and storage costs (both p-
values < 0.001). In particular, the price elasticity of supply for wine is about 2.7.
b) Estimation of the demand function
The demand equation is exactly-identified. Thus, we just have one available IV: storage costs. First we
regress price of wine (PW) on price of beer (PB), income (Y), advertising expense (A) and storage cost (S),
get the predicted values of PW (Predicted price of wine), and then estimate the demand function by
regressing consumption (Q) on (Predicted price of wine), PB, Y and A.

The results are as follows:

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig. VIF
(Constant) -11.375 2.863 -3.973 .001
Predicted price of wine .644 .838 .144 .768 .454 11.715
price of beer -.140 .878 -.014 -.160 .875 2.551
income 4.082 1.594 1.185 2.561 .022 70.889
advertising expense -.985 .835 -.372 -1.180 .256 32.916

R 2 = 0.955, F = 79.026 (p-value < 0.001)

We observe that the coefficient of determination is 95.5% and the F-statistic is significant. However, most of
the regression coefficients are insignificant. Furthermore, all the coefficients except that of income (Y) have
the wrong signs. This is probably due to multicollinearity (MC). As can be seen from the above table, the
variance inflation factor (VIF) for income and advertising expense are large (far greater than 10). Thus, we
have to drop one of them. From practical point of view, it seems wise to drop advertising expense and re-
estimate the model. The results are:

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig. VIF
(Constant) -9.167 2.194 -4.178 .001
Predicted price of wine 1.380 .566 .310 2.440 .027 5.213
price of beer -.253 .883 -.025 -.286 .778 2.520
income 2.308 .537 .670 4.299 .001 7.847

R 2 = 0.95, F = 102.396 (p-value < 0.001)

61
The problem of MC is now solved as the VIF’s are greatly reduced (all less than 10). However, the
coefficients of both price of wine and price of beer have wrong signs. In particular, the coefficient of price
of wine not only has the wrong sign but is also significant. This is difficult to interpret. For the other
variables, the conclusion we arrive at is that the demand for wine is not responsive to the price of beer, but is
responsive to income. The income elasticity of demand for wine is about 2.3.

II. Estimation using two stages least squares (2-SLS)

In this method, we first find the reduced form equations by regressing each endogenous variable on all
exogenous variables. Then we replace all the endogenous variables in each equation by their predicted
values from the reduced forms and estimate each equation by OLS. Note that the IV estimator and the 2-SLS
estimator are the same if the equation under consideration is exactly identified. In our case we have seen that
the demand equation is exactly identified. Thus, the IV and 2-SLS estimators of the parameters are the same.

To estimate the supply function: We first regress the price of wine (PW) on all exogenous variables PB, Y,
A and S, and get the predicted values (PW2sls). We then estimate the supply function by regressing
consumption (Q) on S and PW2sls.

The results are shown below:


Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -7.305 .473 -15.445 .000
PW2sls 2.616 .334 .587 7.827 .000
storage cost 1.188 .192 .465 6.191 .000
a. Dependent Variable: consumption

R 2 = 0.95, F = 176.46 (p-value < 0.001)

We can see from the above table that the coefficients of both price of wine and storage cost are significant.
The price elasticity of supply is about 2.6.

Exercise

Example: Wage-price model


Wt  0  1Ut  2 Pt  u1t (wage equation) ..............(3a)
Pt  0  1Wt  2R t  3Mt  u 2t (price equation) ............. (3b)

where W is rate of change in money wage, U is unemployment rate (in percentage), P is rate of
change in prices, R is rate of change in cost of capital, and M is money supply. Here the price
variable P enters into the wage equation (3a) and the wage variable W enters into the price equation
(3b). Thus, these two variables are jointly dependent to each other, and estimation of the two
equations individually by OLS yields biased and inconsistent estimators.

62
Chapter Four

Introduction to Panel Data Regression Models

4.1 Introduction

• In time series data we observe the values of one or more variables over a period of time. In cross-
section data, values of one or more variables are collected for several sample units, or entities, at the
same point in time.
• In panel data the same cross-sectional unit (say a family or a firm or a state) is surveyed over time. In
other words, panel data have space as well as time dimensions. And we will call regression models
based on such data panel data regression models.
• Panel data example models:
▪ effect of education on income, with data across time and individuals
▪ effects of income on savings, with data across years and countries
• Multiple regression works well when all observed variables are available. However, if variables are
missing, then model has omitted variable bias. Panel data allows methods for controlling for some
types of omitted variables without even observing them!
• It is assumed that there are a maximum of N cross-sectional units or observations and a maximum of
T time periods. If each cross-sectional unit has the same number of time series observations, then
such a panel (data) is called a balanced panel. If the number of observations differs among panel
members, we call such a panel an unbalanced panel. In this chapter we will largely be concerned
with a balanced panel.
• What are the advantages of panel data over cross-section or time series data?
▪ First, they increase the sample size considerably.
▪ Second, by studying repeated cross-section observations, panel data are better suited to study
the dynamics of change. Spells of unemployment, job turnover, and labor mobility are better
studied with panel data.
▪ Third, panel data enable us to study more complicated behavioral models. E.g. economies of
scale and technological change
• Despite their substantial advantages, panel data pose several estimation and inference problems.
Since such data involve both cross-section and time dimensions, problems that plague cross-sectional
data (e.g., heteroscedasticity) and time series data (e.g., autocorrelation) need to be addressed. There
are some additional problems, such as cross-correlation in individual units at the same point in time.

63
• There are several estimation techniques to address one or more of these problems. The two most
prominent are (1) the fixed effects model (FEM) and (2) the random effects model (REM) or error
components model (ECM).
Panel Data Example
• Assume there are four cross-sectional units (GE, GM, US, and WEST) and 20 time periods (1935-
1954). In all, therefore, we have 80 observations. A priori, Y (real gross investment) is expected to
be positively related to X2 (real value of the firm) and X3 (real capital stock). The investment
function … (1) where i stands for the ith cross-sectional unit and t for the
tth time period. The error term is supposed to follow the classical assumptions, namely, .

4.2 The Fixed Effects Approach

• If the individual specific effects are correlated with the regressors, we have fixed effects model.
• If you use fixed effects on a random sample, you cannot make inferences outside your data set.
• Estimation of the above equation depends on the assumptions we make about the intercept, the
slope coefficients, and the error term, uit . There are several possibilities:
a) The intercept and slope coefficients are constant across time and space and the error term
captures differences over time and individuals.
b) The slope coefficients are constant but the intercept varies over individuals.
c) The slope coefficients are constant but the intercept varies over individuals and time.
d) All coefficients (the intercept as well as slope coefficients) vary over individuals.
e) The intercept as well as slope coefficients vary over individuals and time.
• In what follows, we will cover some of the main features of the various possibilities, especially the
first four. Our discussion is nontechnical.
1. All Coefficients Constant across Time and Individuals

The simplest, and possibly naive, approach is to disregard the space and time dimensions of the
pooled data and just estimate the usual OLS regression. That is, stack the 20 observations for each
company one on top of the other, thus giving in all 80 observations for each of the variables
in the model.

𝑌̂ = −63.3041 + 0.1101𝑋2 + 0.3034𝑋3 …. (2)

𝑠𝑒 = (29.6124) (0.0137) (0.0493)

64
All the coefficients are individually statistically significant; the slope coefficients have the expected
positive signs. The estimated model assumes that the intercept value of GE, GM, US, and
Westinghouse are the same. It also assumes that the slope coefficients of the two X variables are all
identical for all the four firms. Obviously, these are highly restricted assumptions. The above pooled
regression may distort the true picture of the relationship between Y and the X’s across the four
companies.

2. Slope Coefficients Constant but the Intercept Varies across Individuals: The Fixed Effects or
Least-Squares Dummy Variable (LSDV) Regression or Covariance Model

One way to take into account the “individuality” of each company or each cross-sectional unit is to
let the intercept vary for each company but still assume that the slope coefficients are constant across
firms.

In the literature this model is known as the fixed effects regression model (FEM). The term “fixed
effects” is due to the fact that, although the intercept may differ across individuals (here the four
companies), each individual’s intercept does not vary over time; that is, it is time invariant.

How do we actually allow for the (fixed effect) intercept to vary between companies? We can easily
do that by the dummy variable technique, particularly, the differential intercept dummies.

where D2i = 1 if the observation belongs to GM, 0 otherwise; D3i = 1 if the observation belongs to
US, 0 otherwise; and D4i = 1 if the observation belongs to WEST, 0 otherwise.

Since we are using dummies to estimate the fixed effects, in the literature the model is also known as
the least-squares dummy variable (LSDV) model.

In this model all the estimated coefficients are individually highly significant. The intercept values of
the four companies are statistically different. The intercept values of the four companies are
statistically different; being −245.7924 for GE, −84.220 (= −245.7924 +161.5722) for GM, 93.8774
(= −245.7924 + 339.6328) for US, and −59.2258 (= −245.7924 + 186.5666) for WEST. These
differences in the intercepts may be due to unique features of each company, such as differences in
management style or managerial talent.

65
Just as we used the dummy variables to account for individual (company) effect, we can allow for
time effect in the sense that the Grunfeld investment function shifts over time because of factors such
as technological changes, changes in government regulatory and/or tax policies, and external effects
such as wars or other conflicts. Such time effects can be easily accounted for if we introduce time
dummies, one for each year. Since we have data for 20 years, from 1935 to 1954, we can introduce
19 time dummies (why?), and write the model (16.3.3) as:

where Dum35 takes a value of 1 for observation in year 1935 and 0 otherwise, etc. We are treating
the year 1954 as the base year, whose intercept value is given by λ0.

We are not presenting the regression results, for none of the individual time dummies were
individually statistically significant.

We have already seen that the individual company effects were statistically significant, but the
individual year effects were not. Could it be that our model is mis-specified in that we have not taken
into account both individual and time effects together? Let us consider this possibility.

Disadvantages of LSDV:

➢ Loss of degrees of freedom


➢ Possibility of multicollinearity
➢ Difficulty to identify the impact of time-invariant variables (e.g. sex, ethnicity, color)
➢ Problem on the nature(variance, correlation) of error term
3. Slope Coefficients Constant but the Intercept Varies over Individuals As Well As Time

When we run this regression, we find the company dummies as well as the coefficients of the X are
individually statistically significant, but none of the time dummies are.

The overall conclusion that emerges is that perhaps there is pronounced individual company effect
but no time effect. In other words, the investment functions for the four companies are the same
except for their intercepts. In all the cases we have considered, the X variables had a strong impact
on Y.

4. All Coefficients Vary across Individuals

66
Here we assume that the intercepts and the slope coefficients are different for all individual, or cross-
section, units. This is to say that the investment functions of GE, GM, US, and WEST are all
different.

You will notice that the γ’s are the differential slope coefficients, just as α2, α3, and α4 are the
differential intercepts. If one or more of the γ coefficients are statistically significant, it will tell us
that one or more slope coefficients are different from the base group.

If all the differential intercept and all the differential slope coefficients are statistically significant, we
can conclude that the investment functions of General Motors, United States Steel, and
Westinghouse are different from that of General Electric.

As these results reveal, Y is significantly related to X2 and X3. However, several differential slope
coefficients are statistically significant. For instance, the slope coefficient of X2 is 0.0902 for GE, but
0.1828 (0.0902 + 0.092) for GM. Interestingly, none of the differential intercepts are statistically
significant.

All in all, it seems that the investment functions of the four companies are different. This might
suggest that the data of the four companies are not “poolable,” in which case one can estimate
the investment functions for each company separately. This is a reminder that panel data regression
models may not be appropriate in each situation, despite the availability of both time series and
cross-sectional data.

67
4.3 The Random Effects Approach

• If the individual specific effects are uncorrelated with the regressors, we have random effects
model. In a random effects model, the unobserved variables are assumed to be uncorrelated with all
the observed variables.
• Random effects assume a normal distribution, so you can make inferences to a larger population.
• REM can estimate coefficients for explanatory variables that are constant over time.
• Although straightforward to apply, fixed effects, or LSDV, modeling can be expensive in terms of
degrees of freedom if we have several cross-sectional units. An obvious question in connection with
the LSDV model is whether the inclusion of the dummy variables-and the consequent loss of the
number of degrees of freedom-is really necessary.
• If the dummy variables do in fact represent a lack of knowledge about the (true) model, why not
express this ignorance through the disturbance term uit? This is precisely the approach suggested by
the proponents of the so-called error components model (ECM) or random effects model (REM).
• The basic idea is to start with .
• Instead of treating β1i as fixed, we assume that it is a random variable with a mean value of β1 (no
subscript i here). And the intercept value for an individual company can be expressed as
where εi is a random error term with a mean value of zero and variance
of σ2ε.
• In ECM it is assumed that the intercept of an individual unit is a random drawing from a much
larger population with a constant mean value. The individual intercept is then expressed as a
deviation from this constant mean value.
• Substitute β1i with

where … (3)
• One advantage of ECM over FEM is that it is economical in degrees of freedom, as we do not have
to estimate N cross-sectional intercepts. ECM is appropriate in situations where the (random)
intercept of each cross-sectional unit is uncorrelated with the regressors.
• The composite error term wit consists of two components, εi , which is the cross-section, or
individual-specific, error component, and uit, which is the combined time series and cross-section
error component. The term error components model derives its name because the composite error
term wit consists of two (or more) error components.
• The usual assumptions made by ECM are that

68
that is, the individual error components are not correlated with each other and are not autocorrelated
across both cross-section and time series units.
• Notice carefully the difference between FEM and ECM. In FEM each cross-sectional unit has its
own (fixed) intercept value, in all N such values for N cross-sectional units. In ECM, on the other
hand, the intercept β1 represents the mean value of all the (cross-sectional) intercepts and the error
component εi represents the (random) deviation of individual intercept from this mean value.
However, keep in mind that εi is not directly observable; it is what is known as an unobservable, or
latent, variable.
• As a result of the assumptions stated above, it follows that

• Now if σ2ε = 0, there is no difference between pooled regression and REM, in which case we can
simply pool all the (cross-sectional and time series) observations and just run the pooled regression.
• The error term wit is homoscedastic. However, it can be shown that wit and wis (t≠s) are correlated;
that is, the error terms of a given cross-sectional unit at two different points in time are correlated.

• Notice two special features of the preceding correlation coefficient. First, for any given cross-
sectional unit, the value of the correlation between error terms at two different times remains the
same no matter how far apart the two time periods are. Second, the correlation structure remains the
same for all cross-sectional units; that is, it is identical for all individuals.
• If we do not take this correlation structure into account, and estimate REM by OLS, the resulting
estimators will be inefficient. The most appropriate method here is the method of generalized least
squares (GLS).
• The results of ECM estimation of the Grunfeld investment function are presented in the table below.
Several aspects of this regression should be noted. First, if you sum the random effect values given
for the four companies, it will be zero, as it should (why?). Second, the mean value of the random
variable, β1i, is the common intercept value of −73.0353. The random effect value of GE of
−169.9282 tells us by how much the random error component of GE differs from the common

69
intercept value. Similar interpretation applies to the other three values of the random effects. Third,
the R2 value is obtained from the transformed GLS regression.

• If you compare the results of the ECM model with those obtained from FEM, you will see that
generally the coefficient values of the two X variables do not seem to differ much, except for those
given in Table 16.2, where we allowed the slope coefficients of the two variables to differ across
cross-sectional units.

4.4 Fixed Effects (LSDV) Versus Random Effects Model

• The challenge facing a researcher is: which model is better, FEM or ECM?
• Several considerations will affect the choice between a fixed effects and a random effects model.

1. What is the nature of the variables that have been omitted from the model?

a. If you think there are no omitted variables – or if you believe that the omitted variables are
uncorrelated with the explanatory variables that are in the model – then a random effects model is
probably best. It will produce unbiased estimates of the coefficients, use all the data available, and
produce the smallest standard errors. More likely, however, is that omitted variables will produce at
least some bias in the estimates.

b. If there are omitted variables, and these variables are correlated with the variables in the model,
then fixed effects models may provide a means for controlling for omitted variable bias. In a fixed-
effects model, subjects serve as their own controls. The idea/hope is that whatever effects the omitted
variables have on the subject at one time, they will also have the same effect at a later time; hence
their effects will be constant, or “fixed.” HOWEVER, in order for this to be true, the omitted
variables must have time-invariant values with time-invariant effects.

i. By time-invariant values, we mean that the value of the variable does not change across time.
Gender and race are obvious examples, but this can also include things like the Educational Level of
the Respondent’s Father.

70
ii. By time-invariant effects, we mean the variable has the same effect across time, e.g. the effect of
gender on the outcome at time 1 is the same as the effect of gender at time.

iii. If either of these assumptions is violated, we need to have explicit measurements of the variables
in question and include them in our models.

In the case of time-varying effects, we can include things like the interaction of gender with time.
We also need explicit measurements of time-invariant variables if they are thought to interact with
other variables in the model, e.g. we think the effect of SES differs by race.

2. How much variability is there within subjects?

a. If subjects change little, or not at all, across time, a fixed effects model may not work very well or
even at all. There needs to be within-subject variability in the variables if we are to use subjects as
their own controls. If there is little variability within subjects then the standard errors from fixed
effects models may be too large to tolerate.

b. Conversely, random effects models will often have smaller standard errors. But, the trade-off is
that their coefficients are more likely to be biased.

3. Do we wish to estimate the effects of variables whose values do not change across time, or do we
merely wish to control for them?

a. With fixed effects models, we do not estimate the effects of variables whose values do not change
across time. Rather, we control for them or “partial them out.” This is similar to an experiment with
random assignment. We may not measure variables like SES, but whatever effects those variable
have are (subject to sampling variability) assumed to be more or less the same across groups because
of random assignment.

b. Random effects models will estimate the effects of time-invariant variables, but the estimates may
be biased because we are not controlling for omitted variables.

• Fixed effects models control for, or partial out, the effects of time-invariant variables with time-
invariant effects. This is true whether the variable is explicitly measured or not.
• In a random effects model, the unobserved variables are assumed to be uncorrelated with all the
observed variables.
• Despite its increasing popularity in applied research, and despite increasing availability of such data,
panel data regressions may not be appropriate in every situation. One has to use some practical
judgment in each case.

71
• The Hausman test can be used to decide between FEM and ECM.
▪ The null hypothesis is the random effects model is appropriate.
▪ If the P-value is less than 5%, we reject the null hypothesis.
• Serial correlation in the residual can be checked by using Pesaran CD test.
▪ The null hypothesis is that there is no serial correlation.
▪ If the P-value is less than 5%, we reject the null hypothesis.
• Stata commands
▪ tsset panelvariable timevariable
➢ Since we are working with panel data in this case, we need to indicate to STATA that
there is a time-series component to our dataset.
▪ Fixed effects model: xtreg dvar indvar, fe
▪ To store: estimates store Fixed
▪ Random effects model: xtreg dvar indvar, re
▪ Add the option ‘robust’ to control for heteroskedasticity
▪ To store: estimates store Random
▪ Hausman test : hausman Fixed Random
▪ Pesaran CD test: xtcsd, pesaran abs

72

You might also like