0% found this document useful (0 votes)
12 views

Chapter 1

Uploaded by

yonasketema26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Chapter 1

Uploaded by

yonasketema26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Course outline

 Chapter 1:Regression Analysis with Qualitative Data:


Binary (or Dummy Variables)

 Chapter 2:Time series Analysis

 Chapter 3:samiltanous equation

 Chapter 4:Introduction to Panel Data regression


CHAPTER 1.

Regression Analysis with Qualitative


Data: Binary (or Dummy Variables)
1. THE NATURE OF DUMMY VARIABLES
• Regression analysis is a powerful statistical method that allows you
to examine the relationship between two or more variables of
interest. While there are many types of regression analysis, at their
core they all examine the influence of one or more independent
variables on a dependent variable. These independent or dependent
variables can be quantitative or qualitative.

• In regression analysis the dependent variable, or regressand, is


frequently influenced not only by ratio scale variables (e.g.,
income, output, prices, costs, height, temperature) but also
by variables that are essentially qualitative, or nominal
scale, in nature, such as sex, race, color, religion,
nationality, geographical region, political upheavals, and
party affiliation.
• Since such variables usually indicate the
presence or absence of a“quality” or an attribute,
such as male or female, black or white, Catholic
or non-Catholic, Democrat or Republican, they
are essentially nominal scale variables

• For example, holding all other factors constant,


female workers are found to earn less than their
male counterparts or non white workers are
found to earn less than whites.
How we can quantify DUMMY VARIABLES?

• One way we could “quantify” such attributes is by


constructing artificial variables that take on values
of 1 or 0,
 1 indicating the presence (or possession) of that
attribute and
 0 indicating the absence of that attribute.
• Variables that assume such 0 and 1 values are
called dummy variables.
• Such variables are thus essentially a device to
classify data into mutually exclusive categories
such as male or female.

• It is not absolutely essential that dummy


variables take the values of 0 and 1.

• The pair (0,1) can be transformed into any other


pair by a linear function such that Z = a + bD ,
where a and b are constants and where D = 1 or 0.
• When D = 1, we have Z = a + b, and when D = 0, we
have Z = a. Thus the pair (0, 1) becomes (a, a + b).
• For example, if a = 1 and b = 2, the dummy variables
will be (1, 3).
• Qualitative, or dummy, variables do not have a
natural scale of measurement. That is why they
are described as nominal scale variables.
• Dummy variables can be incorporated in regression
models just as easily as quantitative variables.

• As a matter of fact, a regression model may contain


regressors (explanatory variables) that are all
exclusively dummy, or qualitative, in nature. Such
models are called Analysis of Variance (ANOVA)
models
Caution in the Use of Dummy Variables

1. If a qualitative variable has m categories,


introduce only (m − 1) dummy variables.
2. The category for which no dummy variable is
assigned is known as the base, benchmark,
control, comparison, reference, or omitted
category. And all comparisons are made in
relation to the benchmark category.
3. The intercept value (β1) represents the mean value
of the benchmark category.
4.The coefficients attached to the dummy variables are known
as the differential intercept coefficients because they tell by
how much the value of the intercept that receives the value of
differs from the intercept coefficient of the benchmark
category.
• 5. If a qualitative variable has more than one
category, the choice of the benchmark category
is strictly up to the researcher. Sometimes the
choice of the benchmark is dictated by the
particular problem at hand.
EXAMPLE. consider the following model:
• Yi = β1 + β2D2i + β3iD3i + ui----------1
• where Yi = (average) salary of public school teacher in region i
• D2i = 1 if the state is in the Amara or Somalia
= 0 otherwise (i.e., in other regions of the country)
D3i = 1 if the state is in the South
= 0 otherwise (i.e., in other regions of the country)
ˆYi = 26,158.62 − 1734.473D2i − 3264.615D3i------2

For example, the value of about −1734 tells us that the mean
salary of teachers in the Amara or Somalia is smaller by about
$1734 than the mean salary of about $26,159 for the benchmark
category, the Oromo .
Regression on one quantitative variable and one qualitative
variable with
two classes, or categories

consider the following model:


• Yi = β1 + β2D2i + β3Xi + ui----------3
• Where Yi : annual salary of a college professor
Xi= years of teaching experience
D2i 1 if male
=0 otherwise
What is the meaning of this equation? Assuming,
as usual E(Ui)=0, that we see that
• Mean salary of female college professor: E(Yi/Xi,
Di=0) = β1 + β3Xi ----------4
• Mean salary of male college professor
• E(Yi/Xi, Di=1) = β1 + β2 + β3Xi ----------5
• Geometrically, we have the situation shown in fig.
5.1 (for illustration, it is assumed that ).
• In words, model 5.01 postulates that the male and
female college professors’ salary functions in
relation to the years of teaching experience have
the same slope but different intercepts.
• In other words, it is assumed that the level of the
male professor’s mean salary is different from that
of the female professor’s mean salary (by β2 but
the rate of change in the mean annual salary by
years of experience is the same for both sexes.
Regression on one quantitative variable and one
qualitative variable with
more than two classes

• Suppose that, on the basis of the cross-sectional data, we


want to regress the annual expenditure on health care by
an individual on the income and education of the
individual.
• Since the variable education is qualitative in nature,
suppose we consider three mutually exclusive levels of
education: less than high school, high school, and college.
• Now, unlike the previous case, we have more than two
categories of the qualitative variable education. Therefore,
following the rule that the number of dummies be one less
than the number of categories of the variable,
• Assuming that the three educational groups have a
common slope but different intercepts in the regression of
annual expenditure on health care on annual income, we
can use the following model:
Yi = 1 +  2D2i +  3D3i +βXi + ui----------6
 Where
Yi= annual expenditure on health care
Xi= annual income
D2i= 1 if high school education
= 0 otherwise
D3i= 1 if college education
= 0 otherwise
• Note that in the preceding assignment of the
dummy variables we are arbitrarily treating the
“less than high school education” category
as the base category.

• Therefore, the intercept will reflect the


intercept for this category.
• The differential intercepts and tell by how
much the intercepts of the other two
categories differ from the intercept of the
base category,
Assuming E (ui )  0 , we obtain from (5.06)
E (Yi | D2  0, D3  0, X i )  1  X i

E (Yi | D2  1, D3  0, X i )  (1   2 )  X i

E (Yi | D2  0, D3  1, X i )  (1   3 )  X i
• which are, respectively the mean health care
expenditure functions for the three levels of
education, namely, less than high school, high
school, and college.
Regression on one quantitative variable and two
qualitative variables

• The technique of dummy variable can be easily extended to


handle more than one qualitative variable
• Yi = 1 +  2D2i +  3D3i +βXi + ui----------7
• Where
Yi=annual salary
• xi= Years of teaching experience
• D2i=1 if male
• =0 otherwise
• D3i=1 if white
• =0 otherwise
• What is its base category?
• Notice that each of the two qualitative variables, sex and
color, has two categories and hence needs one dummy
variable for each.

• Note also that the omitted, or base, category now is


“black female professor.”
• Assuming , we can obtain the following regression from
(5.07)
• Mean salary for black female professor:

• Mean salary for black male professor?
• Mean salary for white female professor?:
• Mean salary for white male professor?:

What you can understand?.


• Once again, it is assumed that the preceding
regressions differ only in the intercept coefficient but
not in the slope coefficient
THE NATURE OF QUALITATIVE
RESPONSE MODELS
• Considering the regressand is qualitative
in nature:
• If the response variable or regresand
takes two values, it is called a binary or
dichotomous variable.
• If it takes three values, it is a
trichotomous.
• If it is many, it is a polychotomous (or
multiple-category).
THE NATURE OF QUALITATIVE
RESPONSE MODELS---
• Suppose we want to study the labor force participation
(LFP) decision of adult males. Since an adult is either in the
labor force or not, LFP is a yes or no decision.
• Hence, the response variable, or regressand, can take only
two values, say, 1 if the person is in the labor force and 0
if he or she is not. In other words, the regressand is a
binary, or dichotomous, variable.
• Labor economics research suggests that the LFP decision is
a function of the unemployment rate, average wage rate,
education, family income, etc.
• It is important to note a fundamental difference between a
regression model where the regressand Y is quantitative
and a model where it is qualitative.
• In a model where Y is quantitative, our objective is to
estimate its expected, or mean, value given the values of
the regressors.
• In models where Y is qualitative, our objective is to find
the probability of something happening, such as voting
for a Democratic candidate, or owning a house, or
belonging to a union, or participating in a sport etc.
• Hence, qualitative response regression models are often
known as probability models.
There are three approaches to developing a probability
model for a binary response variable:

1. The linear probability model (LPM)


2. The logit model
3. The probit model
1. THE LINEAR PROBABILITY MODEL (LPM)

• To fix ideas, consider the following regression model:


• Yi = β1 + β2Xi + ui ------------- (5.11)
• where
• X = family income
• Y = 1 if the family if owns a house
• Y = 0 if it does not own a house

• The conditional expectation of Yi given Xi, E(Yi/Xi) is the


conditional probability that the event will occur given Xi,
that is, Pr(Yi= 1/Xi).
• Model (5.11) looks like a typical linear regression
model but because the regressand is binary, or
dichotomous, it is called a linear probability model
(LPM).
• This is because the conditional expectation of Yi given
Xi , E(Yi | Xi ), can be interpreted as the conditional
probability that the event will occur given Xi , that is,
Pr (Yi = 1 | Xi).
Thus, in our example, E(Yi | Xi) gives the probability of a
family owning a house and whose income is the given
amount Xi .
• The justification of the name LPM for models like (5.11)
can be seen as follows: Assuming E(ui) = 0, as usual (to
obtain unbiased estimators), we obtain
E(Yi | Xi) = β1 + β2Xi --------------------- (5.12)
 if Pi = probability that Yi = 1 (that is, the event occurs),
and
 (1 − Pi) = probability that Yi = 0 (that is, that the event
does not occur),
 Yi follows the Bernoulli probability distribution. i.e
• E(Yi) = 0(1 − Pi) + 1(Pi) = Pi ........(5.13)
•Comparing (5.12) with (5.13), we can equate
• E(Yi | Xi) = β1 + β2Xi = Pi (5.14)
•The mean of the binomial distribution is n*p
•The variance is n*p*(1-p)
LPM poses several problems
Non-Normality of the Disturbances ui
• assumption of normality for ui is not tenable for the
LPMs because, like Yi, the disturbances ui also take only
two values; that is, they also follow the Bernoulli
distribution. This can be seen clearly if we write (5.10)
as
 Heteroscedastic Variances of the Disturbances
 Nonfulfillment of 0 ≤ E(Yi | X) ≤ 1
 lower R2 values
2.THE LOGIT MODEL

• Logistic regression analysis has also been used


particularly to investigate the relationship
between binary or ordinal response probability
and explanatory variables.
THE LOGIT MODEL--

• We will continue with our home ownership example to


explain the basic ideas underlying the logit model.
• Recall that in explaining home ownership in relation to
income, the LPM was
Pi = E(Y = 1 | Xi) = β1 + β2Xi-----------15.5.1
• where X is income and Y = 1 means the family owns a
house.
• But now consider the following representation of home
ownership:
• Equation (15.5.3) represents what is known as the
(cumulative) logistic distribution function.
• It is easy to verify that as Zi ranges from −∞ to +∞, Pi
ranges between 0 and 1 and that Pi is nonlinearly
related to Zi (i.e., Xi),

 thus we cannot use the familiar OLS procedure to


estimate the parameters.
• Now Pi/(1 − Pi) is simply the odds ratio in favor of
owning a house—the ratio of the probability that a family
will own a house to the probability that it will not own a
house.
• Thus, if Pi = 0.8, it means that odds are 4 to 1 in favor of
the family owning a house.
• Now if we take the natural log of (15.5.5), we obtain a
very interesting result, namely,
• that is, L, the log of the odds ratio, is not only linear in X,
but also (from the estimation viewpoint) linear in the
parameters.
• L is called the logit, and hence the name logit model for
models like (15.5.6).
Notice these features of the logit model
 As P goes from 0 to 1 (i.e., as Z varies from −∞ to +∞), the
logit L goes from −∞ to +∞.
 That is, although the probabilities (of necessity) lie
between 0 and 1, the logits are not so bounded.
• The logit becomes negative and increasingly large in
magnitude as the odds ratio decreases from 1 to 0 and
becomes increasingly large and positive as the odds ratio
increases from 1 to infinity.
 More formally, the interpretation of the logit model
given in (15.5.6) is as follows:
 β2, the slope, measures the change in L for a unit change in
X, that is, it tells how the log-odds in favor of owning a
house change as income changes by a unit, say, $1000.
 The intercept β1 is the value of the log odds in favor of
owning a house if income is zero.
THE PROBIT MODEL
• Probit and logit models are estimated using the
maximum likelihood method.
• Interpretation of coefficients
 An increase in x increases/decreases the
likelihood that y=1 (makes that outcome more/less
likely). In other words, an increase in x makes the
outcome of 1 more or less likely.
 We interpret the sign of the coefficient but not the
magnitude. The magnitude cannot be interpreted
using the coefficient because different models
have different scales of coefficients.
Comparison of coefficients

• Thank you

You might also like