Dummy Variable Regression Models
Dummy Variable Regression Models
9.0 OBJECTIVES
After reading this unit, you will be able to:
define a qualitative or dummy variable;
discuss the ANOVA model with a single dummy as exogenous variable;
specify an ANCOVA model with one quantitative and one dummy
variable;
interpret the results of dummy variable regression models;
differentiate between ‘differential intercept coefficient’ and ‘differential
slope coefficient;
describe the concepts of ‘concurrent, dissimilar and parallel’ regression
models that you encounter while considering ‘differential slope dummies’;
and
explain how more than two dummies and interactive dummies can be
formulated into a regression model.
9.1 INTRODUCTION
In real life situations, some variables are qualitative. Examples are gender,
choices, nationality, etc. Such variables may be dichotomous or binary, i.e., with
responses limited to two such as in ‘yes’ or ‘no’ situations. Or they may have
more than two categorical responses. We need methods to include such variables
in the regression model. In this unit, we consider some such cases. We limit this
unit to consider regressions in which the dependent variable is quantified. You
may note in passing that when the dependent variable itself is a dummy variable,
we have to deal with them by models such as Probit or Logit. In such models, the
Dr. Pooja Sharma, Assistant Professor, Daulat Ram College, University of Delhi and Prof. B S
Prakash, Indira Gandhi National Open University, New Delhi
OLS method of estimation does not apply. In this unit, we will not consider such Extension of Regression
cases. You will study about them in the course ‘BECE 142: Applied Models: Dummy
Econometrics’. Variable Cases
In this unit, we consider only such cases in which the independent variable is a
dummy variable. Qualitative variables are not straightaway quantified. By
treating them as dummy variables we can make them quantified (or categorical).
For instance, consider variables such as male or female, employed or
unemployed, etc. These are quantifiable in the sense that by treating them as 1 if
‘female, and 0 if ‘male’. Similar examples could be 1 if yes and 0 if no; 1 if
employed and 0 if unemployed, etc. In the above, we have converted a qualitative
response into quantitative form. Thus, the qualitative variable is now quantified.
Such regressions could be a simple regression, i.e., there is only one independent
variable which is qualitative and treated as dummy variable. Or there could be
two independent variables, one of which can be treated as dummy and the other
is its covariant, i.e., there is a close relationship with the variable treated as
dummy. For instance, pre-tax income of persons can be classified above a
threshold level and treated as dummy variable, i.e., above or below the threshold
level income with response taken as 1 or 0. Now, the post-tax income, which is a
co-variant of pre-tax income, can be considered by its actual quantified value.
There could be similar extension of situations where you have to consider
multiple dummies and cases where you have to consider interactive dummies.
The nature of such regressions, particularly for their inference or interpretational
interest, is what we consider in the present unit.
118
Table 9.2: Food Expenditure in Relation to Income and Gender Extension of Regression
Models: Dummy
Observation Food Expenditure Income ($) Gender Variable Cases
($)
1 1983 11557 1
2 2987 29387 1
3 2993 31463 1
4 3156 29554 1
5 2706 25137 1
6 2217 14952 1
7 2230 11589 0
8 3757 33328 0
9 3821 36151 0
10 3291 35448 0
11 3429 32988 0
12 2533 20437 0
119
Multiple Regression 𝑌= 2673.667 + 503.1667 Di
Models
se = (233.0446) (329.5749)
t= (11.4227) (–1.5267) R2 = 0.1890
Thus, we notice that the mean food consumption expenditures of the two genders
have remained the same. The R2 value is also the same. The absolute value of the
dummy variable coefficient and their standard errors are also the same. The only
change is in the numerical value of the intercept term and its t value.
Another question that we may get is: since we have two categories, male and
female, can we assign two dummies to them? This means we consider the model
as:
Yi = β1 + β2 D2i + β3Di + ui … (9.4)
where Y is expenditure on food, D2 = 1 for female and 0 for male and D3 = 1 for
male and 0 for female. Essentially, we are trying to see whether we can assign
two dummies for male and female separately? The answer is ‘no’. To know the
reason for this, consider the data for a sample of two females and three males, for
which the data matrix is as in Table 9.3. We see that D2 = 1 – D3 or D3 = 1 – D2.
This is a situation of perfect collinearity. Hence, we must always use only one
dummy variable if a qualitative variable has two categories, such as the gender
here.
Table 9.3: Data Matrix for the Equation
Gender Intercept D2 D3
Male Y1 1 0 1
Male Y2 1 0 1
Female Y3 1 1 0
Male Y4 1 0 1
Female Y5 1 1 0
A more general rule is: if a model has the common intercept β1, and the
qualitative variable has m categories, then we must introduce only (m – 1)
dummy variables. If we do not do this, we get into a problem of estimation called
as the ‘dummy variable trap’. Finally, note that when we have a simple
regression model with only one dummy variable as considered here, the model
considered is also called as the ANOVA model. This is because there is no
second variable from which we are seeking to know the impact or variability on
the dependent variable. When we have this, we get what we call as an ANCOVA
model. We take up such a case in the next section.
120
9.3 ANALYSIS OF COVARIANCE (ANCOVA) Extension of Regression
Models: Dummy
MODEL Variable Cases
Male
Food Expenditure
Female
After-tax expenditure
122
3) What happens if the base value is reassigned for the dummy variable, say Extension of Regression
gender, in a simple regression model as in equation (9.1)? Models: Dummy
Variable Cases
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
7) Specify the general form of an ANCOVA model with one qualitative and one
quantitative variable. What does the slope oefficient for the quantitative
variable considerd indicate in general?
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
123
Multiple Regression
Models
9.4 COMPARISON BETWEEN TWO REGRESSION
MODELS
In the example considered above, i.e., for both the ANOVA and the ANCOVA
models, we saw that the slope coefficients were same but the intercepts were
different. This raises the question on whether the slopes too could be different?
How do we formulate the model if our interest is to test for the difference in the
slope coefficients too? In order to capture this, we introduce a ‘slope drifter’. For
the example of consumption expenditure for male or female considered above, let
us now proceed to compare the difference in the consumption expenditure by
gender by specifying the model with dummies as follows:
𝑌 = 𝛽 + 𝛽 𝐷 + 𝛽 𝑋 + 𝛽 (𝐷 𝑋 ) + 𝑢 … (9.6)
{since Di = 0}
{since Di = 1}
In equation (9.8), (β1 + β2) gives the mean value of Y for the category that
receives the dummy value of 1 when X is zero. And, (β3 + β4) gives the slope co-
efficient of the income variable for the category that receives the dummy value of
1. Note that the introduction of the dummy variable in the ‘additive form’ enables
us to distinguish between the intercept terms of the two groups. Likewise, the
introduction of the dummy variable in the interactive (or multiplicative) form
(i.e., 𝐷 𝑋 ) enables us to differentiate between the slope coefficients (or terms) of
the two groups. Depending on the statistical significance of the differential
intercept coefficient, β2, and the differential slope coefficient, β4, we can infer
whether the female and male food expenditure functions differ in their intercept
values, or their slope values, or both. There can be four possibilities as shown in
Fig. 9.2. Fig. 9.2 (a) shows that there is no difference in intercept or the slope
coefficient of the two food expenditure regressions. Such regression equations
are called ‘Coincident Regressions’.
124
Extension of Regression
Models: Dummy
Y Y Variable Cases
X X
0 0
(a) Coincident Regressions (b) Parallel Regressions
X
X
0
0
(c) Concurrent Regressions (d) Dissimilar Regressions
125
Multiple Regression where Y is income, X is education measured in number of years of schooling, D2
Models
is gender (0 if male, 1 if female), D3 is if in reserved segment or group (e.g.
SC/ST/OBC) taking the value 0 if ‘not in reserved segment’, i.e., in general
segment and 1 if ‘in reserved segment’. Here, gender (D2) and reservation (D3)
are qualitative variables and X is quantitative variable. In this formulation (for
example, equation 9.7) we have made an implicit assumption that the differential
effect of gender is constant across the two segments of reservation. We have
likewise assumed that the differential effect of reservation is constant across the
two genders. This means if the average income is higher for males than for
females, it is so whether the person is in the general segment or in the reservation
segment. Likewise, it is assumed here that if the average income is different
between the two reservation segments, it is so irrespective of gender. However, in
many cases, such assumptions may not be tenable. This means, there could be
interaction between gender and reservation dummies. In other words, their effect
on average income may not be simply additive as in (9.7) but could be
multiplicative. If we wish to consider for this interactive effect, we must specify
the model as follows:
Yi = β1 + β2D2i + β3D3i + β4(D2i D3i) +β5Xi + ui … (9.8)
In equation (9.8), the dummy variable D2iD3i is called as ‘interactive or
interaction dummy’. It represents the joint or simultaneous effect of two
qualitative variables. Taking expectation on both sides of equation (9.8), i.e., by
considering the average effect on income across gender and reservation, we get:
E (Yi │ D2i =1, D3i = 1, Xi) = β1 + β2 + β3 + β4 + β5Xi … (9.9)
Equation (9.9) is the average income function for female reserved category
workers where β2 is the differential effect of being female, β3 is the differential
effect of being in the reserved segment and β4 is the interactive effect of being
both a female and in reserved segment. Depending on the statistical significance
of various dummies, we need to make relevant inferences. The specification can
easily be generalized for more than one quantitative variable and more than two
qualitative variables.
Check Your Progress 2 [answer questions within the given space in about 90-
100 words]
1) What is meant by a ‘slope drifter’? When is it introduced and for what use?
Specify a general model with such a ‘slope drifter’ and comment on the
additional variable introduced.
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
126
2) Differentiate between the four type of regressions that we might get when Extension of Regression
considering a model of the type in equation (9.6) with two slope drifters 𝛽 Models: Dummy
Variable Cases
and 𝛽 as therein.
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
.............................................................................................................................
127
Multiple Regression variable and a case where we might be interested in examining for the interactive
Models
effect of the two qualitative variables. For this, we considered models such as Yi
= β1 + β2D2i + β3D3i + β4(D2i D3i) +β5Xi + ui.
128
In other words, regression models in which some independent variables are Extension of Regression
qualitative and some others are quantitative, are called as ANCOVA models. Models: Dummy
Variable Cases
6) The advantage is that ANCOVA models provide a method of statistically
controlling the effects of covariates. The consequence of excluding a
covariant from being included in the model is that the model suffers from
‘specification error’. The consequence of committing specification errors are
that the ideal assumptions required for the OLS estimators to be efficient are
violated. Consequently, they lose out on their efficiency properties.