Econometrics II Chapter Two
Econometrics II Chapter Two
5
05/02/202
By Habtamu Legese Feyisa
Regression on Dummy Variables
HABTAMU LEGESE
1.1 The nature of dummy variables
5
05/02/202
By Habtamu Legese Feyisa
In regression analysis the dependent variable is
frequently influenced not only by variables that can be
readily quantified on some well-defined scale.
(e.g., sex, race, colour, religion, nationality, wars,
earthquakes, strikes, political upheavals, and
changes in government economic policy).
Cont.
For example, holding all other factors constant, female
5
05/02/202
By Habtamu Legese Feyisa
daily wage workers are found to earn less than their
male counterparts, and nonwhites are found to earn
less than whites.
This pattern may result from sex or racial discrimination,
but whatever the reason, qualitative variables such as sex
and race do influence the dependent variable and clearly
should be included among the explanatory variables.
Cont.
Qualitative variables usually indicate the presence or
5
05/02/202
By Habtamu Legese Feyisa
absence of a “quality” or an attribute, such as male or
female, black or white, or Christian or Muslim.
One method of “quantifying” such attributes is by
constructing artificial variables that take on values of
1 or 0, 0 indicating the absence of an attribute and 1
indicating the presence (or possession) of that attribute.
Cont.
For example, 1 may indicate that a person is a male, and 0
5
05/02/202
By Habtamu Legese Feyisa
may designate a female; or 1 may indicate that a person is a
college graduate, and 0 that he is not, and so on.
Variables that assume such 0 and 1 values are called dummy
variables.
Alternative names are indicator variables, binary variables,
categorical variables, and dichotomous variables.
Cont.
Dummy variables can be used in regression models just as
5
05/02/202
By Habtamu Legese Feyisa
easily as quantitative variables. As a matter of fact, a
regression model may contain explanatory variables that are
exclusively dummy, or qualitative, in nature.
Cont.
Model (1.01) may enable us to find out whether sex makes any
5
05/02/202
By Habtamu Legese Feyisa
difference in a college professor’s salary, assuming, of course,
that all other variables such as age, degree attained, and years
of experience are held constant.
Assuming that the disturbance satisfies the usually
assumptions of the classical linear regression model, we obtain
from (1.01).
Mean salary of female college professor: E (Yi / Di 0) -------(1.02)
5
05/02/202
By Habtamu Legese Feyisa
term.
2. The error term has a zero population mean.
3. All explanatory variables are uncorrelated with the error term
4. Observations of the error term are uncorrelated with each other (no serial
correlation).
5. The error term has a constant variance (no heteroskedasticity).
6. No perfect multicollinearity
7. The error term is normally distributed (not required).
Cont.
the intercept term gives the mean salary of female college professors and the slope
5
05/02/202
By Habtamu Legese Feyisa
coefficient tells by how much the mean salary of a male college professor differs from the
mean salary of his female counterpart, reflecting the mean salary of the male college
professor.
A test of the null hypothesis that there is no sex discrimination ( H 0 : 0) can be easily made
by running regression (1.01) in the usual manner and finding out whether on the basis of the t
test the estimated is statistically significant.
A. Dummy Independent Variable Models
1.2 Regression on one quantitative variable and one qualitative
5
05/02/202
By Habtamu Legese Feyisa
variable with two classes, or categories
Yi i 2 Di X i ui
Yi
Xi
Di
Cont.
Model (1.03) contains one quantitative variable (years of
5
05/02/202
By Habtamu Legese Feyisa
teaching experience) and one qualitative variable (sex)
that has two classes (or levels, classifications, or
categories), namely, male and female.
Cont.
Geometrically, we have the situation shown in fig. 1.1 (for
5
05/02/202
By Habtamu Legese Feyisa
illustration, it is assumed that ). In words, model 1.01
postulates that the male and female college professors’ salary
functions in relation to the years of teaching experience have
the same slope but different intercepts.
In other words, it is assumed that the level of the male
professor’s mean salary is different from that of the female
professor’s mean salary (by but the rate of change in the
mean annual salary by years of experience is the same for both
sexes.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
If the assumption of common slopes is valid, a test of the
5
05/02/202
By Habtamu Legese Feyisa
hypothesis that the two regressions (1.04) and (1.05) have the
same intercept (i.e., there is no sex discrimination) can be
made easily by running the regression (1.03) and noting the
statistical significance of the estimated on the basis of the
traditional t test.
If the t test shows that is statistically significant, we reject
the null hypothesis that the male and female college professors’
levels of mean annual salary are the same.
Cont.
Before proceeding further, note the following features of the
5
05/02/202
By Habtamu Legese Feyisa
dummy variable regression model considered previously
1. To distinguish the two categories, male and female, we have
introduced only one dummy variable . For if always
denotes a male, when D = 0 we know that it is a female since
there are only two possible outcomes.
Hence, one dummy variable suffices to distinguish two
categories. The general rule is this: If a qualitative variable
has ‘m’ categories, introduce only ‘m-1’ dummy variables.
Cont.
In our example, sex has two categories, and hence we
5
05/02/202
By Habtamu Legese Feyisa
introduced only a single dummy variable. If this rule is not
followed, we shall fall into what might be called the dummy
variable trap, that is, the situation of perfect
multicollinearity.
5
05/02/202
By Habtamu Legese Feyisa
value of 0 is often referred to as the base, benchmark, control,
comparison, reference, or omitted category. It is the base in
the sense that comparisons are made with that category.
5
05/02/202
By Habtamu Legese Feyisa
In statistics and econometrics, particularly in regression
analysis, a dummy variable is one that takes only the value 0 or
1 to indicate the absence or presence of some categorical effect
that may be expected to shift the outcome.
What is the purpose of dummy variables?
5
05/02/202
By Habtamu Legese Feyisa
What is the purpose of dummy variables?
Dummy variables are useful because they enable us to use a
5
05/02/202
By Habtamu Legese Feyisa
single regression equation to represent multiple groups.
This
means that we don't need to write out separate equation
models for each subgroup.
5
05/02/202
By Habtamu Legese Feyisa
variables?
How do you determine the number of dummy
variables?
5
05/02/202
By Habtamu Legese Feyisa
Thefirst step in this process is to decide the number of dummy
variables.
Thisis easy; it's simply k-1, where k is the number of levels of
the original variable.
You could also create dummy variables for all levels in the
original variable, and simply drop one from each analysis.
By Habtamu Legese Feyisa
05/02/202
5
Is 0 male or female?
Is 0 male or female?
In the case of gender, there is typically no natural reason to code
5
05/02/202
By Habtamu Legese Feyisa
the variable female = 0, male = 1, versus male = 0, female = 1.
However, convention may suggest one coding is more familiar
to a reader; or choosing a coding that makes the regression
coefficient positive may ease interpretation.
By Habtamu Legese Feyisa
05/02/202
5
Can dummy variables be 1 and 2?
Can dummy variables be 1 and 2?
Technically, dummy variables are dichotomous,
5
05/02/202
By Habtamu Legese Feyisa
quantitative variables.
Their range of values is small; they can take on only two
quantitative values.
As a practical matter, regression results are easiest to interpret
when dummy variables are limited to two specific
values, 1 or 0.
Why do we drop one dummy variable?
5
05/02/202
By Habtamu Legese Feyisa
1.3 Regression on one quantitative variable and
one qualitative variable with more than two classes
5
05/02/202
By Habtamu Legese Feyisa
Suppose that, on the basis of the cross-sectional data, we
want to regress the annual expenditure on health care by
an individual on the income and education of the
individual.
Since the variable education is qualitative in nature,
suppose we consider three mutually exclusive levels of
education: less than high school, high school, and
college.
Cont.
Now, unlike the previous case, we have more than two categories
5
05/02/202
By Habtamu Legese Feyisa
of the qualitative variable education.
Therefore,following the rule that the number of dummies be one
less than the number of categories of the variable, we should
introduce two dummies to take care of the three levels of
education.
Assuming that the three educational groups have a
common slope but different intercepts in the regression of
annual expenditure on health care on annual income, we can
use the following model:
Cont.
Yi 1 2 D2i 3 D3i X i ui --------------------------(1.06)
5
05/02/202
By Habtamu Legese Feyisa
Where Yi annual expenditure on health care
X i annual expenditure
D2 1 if high school education
= 0 otherwise
D3 1 if college education
= 0 otherwise
Cont.
Note that in the preceding assignment of the dummy variables
5
05/02/202
By Habtamu Legese Feyisa
we are arbitrarily treating the “less than high school
education” category as the base category. Therefore, the
intercept will reflect the intercept for this category.
5
05/02/202
By Habtamu Legese Feyisa
E (Yi | D2 0, D3 0, X i ) 1 X i
E (Yi | D2 1, D3 0, X i ) ( 1 2 ) X i
E (Yi | D2 0, D3 1, X i ) ( 1 3 ) X i
5
05/02/202
By Habtamu Legese Feyisa
purposes it is assumed that ).
1.4 Regression on one quantitative variable and two
qualitative variables
5
05/02/202
By Habtamu Legese Feyisa
The technique of dummy variable can be easily extended to
handle more than one qualitative variable.
Let
us revert to the college professors’ salary regression (1.03),
but now assume that in addition to years of teaching
experience and sex the skin color of the teacher is also an
important determinant of salary.
Forsimplicity, assume that colour has two categories: black
and white
Cont.
We can now write (1.03) as:
5
05/02/202
By Habtamu Legese Feyisa
Yi 1 2 D2i 3 D3i X i u i ----------(1.07)
Where Yi annual salary
5
05/02/202
By Habtamu Legese Feyisa
has two categories and hence needs one dummy variable for
each. Note also that the omitted, or base, category now is
“black female professor”.
Cont.
5
05/02/202
By Habtamu Legese Feyisa
Assuming E (u i ) 0 , we can obtain the following regression from (1.07)
Mean salary for black female professor:
E (Yi | D2 0, D3 0, X i ) 1 X i
Mean salary for black male professor:
E (Yi | D2 1, D3 0, X i ) ( 1 2 ) X i
Mean salary for white female professor:
E (Yi | D2 0, D3 1, X i ) ( 1 3 ) X i
Mean salary for white male professor:
E (Yi | D2 1, D3 1, X i ) ( 1 2 3 ) X i
Cont.
Once again, it is assumed that the preceding regressions differ
5
05/02/202
By Habtamu Legese Feyisa
only in the intercept coefficient but not in the slope coefficient.
An OLS estimation of (1.07) will enable us to test a variety of
hypotheses. Thus, if is statistically significant, it will mean
that colour does affect a professor’s salary.
Similarly, if is statistically significant, it will mean that sex
also affects a professor’s salary. If both these differential
intercepts are statistically significant, it would mean sex as well
as colour is an important determinant of professors’ salaries.
Cont.
From the preceding discussion it follows that we can extend
5
05/02/202
By Habtamu Legese Feyisa
our model to include more than one quantitative variable and
more than two qualitative variables.
Theonly precaution to be taken is that the number of dummies
for each qualitative variable should be one less than the
number of categories of that variable.
1.5 Interaction effects
Consider the following model:
5
05/02/202
By Habtamu Legese Feyisa
Yi 1 2 D2i 3 D3i X i ui ----------------------------(1.08)
where Yi annual expenditure on clothing
X i Income
D2 1 if female
= 0 if male
D3 1 if college graduate
= 0 otherwise
Cont.
The implicit assumption in this model is that the differential
5
05/02/202
By Habtamu Legese Feyisa
effect of the sex dummy is constant across the two levels of
education and the differential effect of the education dummy
is also constant across the two sexes.
That is, if, say, the mean expenditure on clothing is higher for
females than males this is so whether they are college
graduates or not. Likewise, if, say, college graduates on the
average spend more on clothing than non-college graduates,
this is so whether they are female or males.
Cont.
In many applications, such an assumption may be untenable. A
5
05/02/202
By Habtamu Legese Feyisa
female college graduate may spend more on clothing than a
male graduate.
In other words, there may be interaction between the two
qualitative variables and therefore their effect on mean Y
may not be simply additive as in (1.08) but multiplicative as
well, as in the following model:
5
05/02/202
By Habtamu Legese Feyisa
E (Yi | D2 1, D3 1, X i ) ( 1 2 3 4 ) X i ------------(4.10)
which is the mean clothing expenditure of graduate females.
Notice that
differential effect of being a female
differential effect of being a college graduate
differential effect of being a female graduate
Cont.
If are all positive, the average clothing
5
05/02/202
By Habtamu Legese Feyisa
expenditure of females is higher than the base category (which
here is male non-graduate), but it is much more so if the females
also happen to be graduates.
This shows how the interaction dummy modifies the effect of the
two attributes considered individually.
Whether the coefficient of the interaction dummy is statistically
significant can be tested by the usual t test. Omitting a significant
interaction term will lead to a specification bias.
Some Important Uses of Dummy
Variable
5
05/02/202
By Habtamu Legese Feyisa
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
By Habtamu Legese Feyisa
05/02/202
5
By Habtamu Legese Feyisa
05/02/202
5
By Habtamu Legese Feyisa
05/02/202
5
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Then the function would be
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
Use of Dummy Variables in Seasonal Analysis
5
05/02/202
By Habtamu Legese Feyisa
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
By Habtamu Legese Feyisa
05/02/202
5
Cont.
B. Dummy Dependent Variable Models
The dependent variable can also take the form of a dummy
5
05/02/202
By Habtamu Legese Feyisa
variable, where the variable consists of 1 and 0.
If it takes the value of 1, it can be interpreted as a success.
Examples might include home ownership or mortgage
approvals, where the dummy variable takes the value of 1 if
someone owns a home and 0 if they do not.
1. Linear Probability Model
In the case of dummy dependent variable model, we have:
5
05/02/202
By Habtamu Legese Feyisa
Where or 1 and .
A regression model in the situation where the dependent
variable takes on the two values 0 or 1 is called a linear
probability model. To see its properties, note the following.
Cont.
a) Since the mean error is zero, we know that
5
05/02/202
By Habtamu Legese Feyisa
b) Now, if we define and
then
Therefore, our model is and the estimated slope
coefficients would tell us the impact of a unit change in that
explanatory variable on the probability that
Cont.
c) The predicted values from the regression model
5
05/02/202
By Habtamu Legese Feyisa
would provide predictions, based on
some chosen values for the explanatory variables, for the
probability that
There is, however, nothing in the estimation strategy
that would constrain the resulting predictions from being
negative or larger than 1-clearly an unfortunate
characteristic of the approach.
Cont.
d) Since and uncorrelated with the explanatory variables
5
05/02/202
By Habtamu Legese Feyisa
(by assumption), it is easy to show that the OLS estimators are
unbiased.
The errors, however, are heteroscedastic. A simple way to see this is
to consider an example. Suppose that the dependent variable takes the
value 1 if the individual buys a Rolex watch and 0 other wise.
Also, suppose the explanatory variable is income. For low level of
income, it is likely that all of the observations are zeros. In this
case, there would be no scatter around the line. For higher levels of
income there would be some zeros and some ones. That is, there
would be some scatter around the line.
Cont.
Thus, the errors would be heteroscedastic. This suggests two
5
05/02/202
By Habtamu Legese Feyisa
empirical strategies.
First,we know that the OLS estimators are unbiased but
would yield the incorrect standard errors. We might simply
use OLS and then use the White correction to produce correct
standard errors.
In short, there are a number of problems with the above
approach, usually called the Linear Probability Model
(LPM), which is estimated in the usual way using OLS.
Cont.
The regression line is not a good fit of the data so the usual
5
05/02/202
By Habtamu Legese Feyisa
measures of this, such as the R2 statistic are not reliable.
There are other problems with this approach:
1) There will be heteroscedasticity in any model estimated
using the LPM approach.
2) It is possible the LPM will produce estimates that are greater
than 1 and less than 0, which is difficult to interpret as the
estimates are probabilities and a probability of more than 1
does not exist
Cont.
5
05/02/202
By Habtamu Legese Feyisa
3.The error term in such a model is likely to be non-normal
4.The largest problem is that the relationship between the
variables in this model is likely to be non-linear. This suggests we
need a different type of regression line, that will fit the data more
accurately, such as a ‘S’ shaped curve.
2. Logit and Probit Models
One potential criticism of the linear probability model is that
5
05/02/202
By Habtamu Legese Feyisa
the model assumes that the probability that is linearly
related to the explanatory variable(s).
We might, however, expect the relation to be nonlinear. For
example, increasing the income of the very poor or the very
rich will probably have little effect on whether they buy an
automobile. It could, however, have a non-zero effect on other
income groups.
Cont.
Two models that are non-linear, yet provide predicted
5
05/02/202
By Habtamu Legese Feyisa
probabilities between 0 and 1, are the logit and probit models.
Thedifference between the linear probability model and the
nonlinear logit and probit models can be explained using an
example.
Tomotivate these models, suppose that our underlying dummy
dependent variable depends on an unobserved (“latent”) utility
index .
Cont.
For example, if the variable y is discrete, taking on the values 0
5
05/02/202
By Habtamu Legese Feyisa
and 1 if someone buys a car, then we can imagine a continuous
variable that reflects a person’s desire to buy the car.
Itseems reasonable that would vary continuously with
some explanatory variable like income.
More formally, suppose
Cont.
5
05/02/202
By Habtamu Legese Feyisa
and
5
05/02/202
By
density function for the error term.
5
05/02/202
By Habtamu Legese Feyisa
probability and . Then the likelihood
function is:
Cont.
and
5
05/02/202
By Habtamu Legese Feyisa
which, given pi F ( 1 2 xi ) , becomes
n
ln L y ln F (
i 1
i 1 2 xi ) (1 yi ) ln(1 F ( 1 2 xi ))
Cont.
5
05/02/202
By Habtamu Legese Feyisa
Analytically, the next step would be to take the partial
derivatives of the likelihood function with respect to the ’s,
set them equal to zero, and solve for the MLEs.
What is the Difference Between Logit and
Probit Models?
5
05/02/202
By Habtamu Legese Feyisa
Logit and probit models are appropriate when attempting to
model a dichotomous dependent variable, e.g. yes/no,
agree/disagree, like/dislike, etc.
Logitand probit differ in how they define F. The logit model
uses something called the cumulative distribution function of
the logistic distribution. The probit model uses something
called the cumulative distribution function of the standard
normal distribution to define F.
Cont.
Bothfunctions will take any number and rescale it to fall
5
05/02/202
By Habtamu Legese Feyisa
between 0 and 1.
Hence, whatever α + βx equals, it can be transformed by the
function to yield a predicted probability.
Any function that would return a value between zero and one
would do the trick, but there is a deeper theoretical model
underpinning logit and probit that requires the function to be
based on a probability distribution.
Is logit better than probit, or vice versa?
Both methods will yield similar (though not identical) inferences.
5
05/02/202
By Habtamu Legese Feyisa
Thus, the derivatives are different only if there are enough
observations in the tail of the distribution. While the derivatives
are usually similar.
A simple approximation suggests that multiplying the logit
estimates by 0.625 makes the logit estimates comparable to the
probit estimates
Logit – also known as logistic regression – is more popular in
health sciences like epidemiology partly because coefficients can
be interpreted in terms of odds ratios.
Cont.
Probit models can be generalized to account for non-
5
05/02/202
By Habtamu Legese Feyisa
constant error variances in more advanced econometric
settings (known as heteroskedastic probit models) and
hence are used in some contexts by economists and
political scientists.
If these more advanced applications are not of
relevance, than it does not matter which method you
choose to go with.
By Habtamu Legese Feyisa
05/02/202
5
Thank You