What Is A Math/Stats Model?: 1. Often Describe Relationship Between Variables 2. Types
What Is A Math/Stats Model?: 1. Often Describe Relationship Between Variables 2. Types
2. Types
- Deterministic Models (no randomness)
Probabilistic
Models
Regression
Models
1 Explanatory Regression
Variable Models
Simple
Simple Multiple
Simple Multiple
Linear
Simple Multiple
Non-
Linear
Linear
Simple Multiple
Non-
Linear Linear
Linear
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X
Yi 0 1X i i
Dependent Independent (Explanatory)
(Response) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Population & Sample Regression Models
Population
EPI 809/Spring 2008 21
Population & Sample Regression Models
Population
Unknown
Relationship
Yi 0 1X i i
EPI 809/Spring 2008 22
Population & Sample Regression Models
Unknown
Relationship
Yi 0 1X i i
EPI 809/Spring 2008 23
Population & Sample Regression Models
Unknown
Yi 0 1X i i
Relationship
Yi 0 1X i i
EPI 809/Spring 2008 24
Population Linear Regression Model
Y Yi 0 1X i i Observedv
alue
i = Random error
E Y 0 1 X i
X
Observed value
EPI 809/Spring 2008 25
Sample Linear Regression Model
Y Yi 0 1X i i
^
i = Random
error
Unsampled
observation
Yi 0 1X i
X
Observed value
EPI 809/Spring 2008 26
Keep in mind that the regressand Y and the regressor X themselves may be
nonlinear.
look at Table 2.1. Keeping the value of income X fixed, say, at $80, we
draw at random a family and observe its weekly family consumption
expenditure Y as, say, $60. Still keeping X at $80, we draw at random
another family and observe its Y value as $75. In each of these
drawings (i.e., repeated sampling), the value of X is fixed at $80. We
can repeat this process for all the X values shown in Table 2.1.
This means that our regression analysis is conditional regression
analysis, that is, conditional on the given values of the regressor(s) X.
As shown in Figure 3.3, each Y population corresponding to a given X
is distributed around its mean value with some Y values above the
mean and some below it. the mean value of these deviations
corresponding to any given X should be zero.
Note that the assumption E(ui | Xi) = 0 implies that E(Yi | Xi) = 1 +
2Xi.
E(ui | Xi) = 0
Technically, (3.2.2) represents the assumption of homoscedasticity, or equal
(homo) spread (scedasticity) or equal variance. Stated differently, (3.2.2)
means that the Y populations corresponding to various X values have the
same variance.
Put simply, the variation around the regression line (which is the line of
average relationship between Y and X) is the same across the X values; it
neither increases or decreases as X varies
In Figure 3.5, where the conditional variance of the Y population
varies with X. This situation is known as heteroscedasticity, or unequal
spread, or variance. Symbolically, in this situation (3.2.2) can be
written as
var (ui | Xi) = 2i (3.2.3)
Figure 3.5. shows that, var (u| X1) < var (u| X2), . . . , < var (u| Xi).
Therefore, the likelihood is that the Y observations coming from the
population with X = X1 would be closer to the PRF than those coming
from populations corresponding to X = X2, X = X3, and so on. In short,
not all Y values corresponding to the various Xs will be equally
reliable, reliability being judged by how closely or distantly the Y
values are distributed around their means, that is, the points on the
PRF.
The disturbances ui and uj are uncorrelated, i.e., no serial correlation. This
means that, given Xi , the deviations of any two Y values from their mean
value do not exhibit patterns. In Figure 3.6a, the us are positively correlated,
a positive u followed by a positive u or a negative u followed by a negative u.
In Figure 3.6b, the us are negatively correlated, a positive u followed by a
negative u and vice versa. If the disturbances follow systematic patterns,
Figure 3.6a and b, there is auto- or serial correlation. Figure 3.6c shows that
there is no systematic pattern to the us, thus indicating zero correlation.
The disturbance u and explanatory variable X are uncorrelated. The PRF
assumes that X and u (which may represent the influence of all the omitted
variables) have separate (and additive) influence on Y. But if X and u are
correlated, it is not possible to assess their individual effects on Y. Thus, if X
and u are positively correlated, X increases when u increases and it decreases
when u decreases. Similarly, if X and u are negatively correlated, X increases
when u decreases and it decreases when u increases. In either case, it is
difficult to isolate the influence of X and u on Y.
In the hypothetical example of Table 3.1, imagine that we had only the first
pair of observations on Y and X (4 and 1). From this single observation there
is no way to estimate the two unknowns, 1 and 2. We need at least two pairs
of observations to estimate the two unknowns
This assumption too is not so innocuous as it looks. Look at Eq. (3.1.6). If
all the X values are identical, then Xi = X and the denominator of that
equation will be zero, making it impossible to estimate 2 and therefore 1.
Looking at our family consumption expenditure example in Chapter 2, if
there is very little variation in family income, we will not be able to explain
much of the variation in the consumption expenditure.
An econometric investigation begins with the specification of the
econometric model underlying the phenomenon of interest. Some important
questions that arise in the specification of the model include the following:
(1) What variables should be included in the model?
(2) What is the functional form of the model? Is it linear in the parameters,
the variables, or both?
(3) What are the probabilistic assumptions made about the Yi , the Xi, and
the ui entering the model?
Suppose we choose the following two models to depict the underlying
relationship between the rate of change of money wages and the
unemployment rate:
Yi = 1 + 2Xi + ui (3.2.7)
Yi = 1 + 2 (1/Xi ) + ui (3.2.8)
where Yi = the rate of change of money wages, and Xi = the unemployment
rate. The regression model (3.2.7) is linear both in the parameters and the
variables, whereas (3.2.8) is linear in the parameters (hence a linear
regression model by our definition) but nonlinear in the variable X. Now
consider Figure 3.7.
If model (3.2.8) is the correct or the true model, fitting the model
(3.2.7) to the scatterpoints shown in Figure 3.7 will give us wrong
predictions.
Unfortunately, in practice one rarely knows the correct variables to include
in the model or the correct functional form of the model or the correct
probabilistic assumptions about the variables entering the model for the
theory underlying the particular investigation may not be strong or robust
We will discuss this assumption in Chapter 7, where we discuss multiple
regression models.
PRECISION OR STANDARD ERRORS OF LEAST-
SQUARES ESTIMATES
The least-squares estimates are a function of the sample data. But since the
data change from sample to sample, the estimates will change. Therefore,
what is needed is some measure of reliability or precision of the estimators
1 and 2. In statistics the precision of an estimate is measured by its
standard error (se), which can be obtained as follows:
2 is the constant or homoscedastic variance of ui of Assumption 4.
2 itself is estimated by the following formula:
where 2 is the OLS estimator of the true but unknown 2 and where the
expression n2 is known as the number of degrees of freedom (df), is
the residual sum of squares (RSS). Once is known, 2 can be easily
computed.
Compared with Eq. (3.1.2), Eq. (3.3.6) is easy to use, for it does not require
computing ui for each observation.
Since
Since var (2) is always positive, as is the variance of any variable, the nature
of the covariance between 1 and 2 depends on the sign of X . If X is
positive, then as the formula shows, the covariance will be negative. Thus, if
the slope coefficient 2 is overestimated (i.e., the slope is too steep), the
intercept coefficient 1 will be underestimated (i.e., the intercept will be too
small).
PROPERTIES OF LEAST-SQUARES ESTIMATORS:
THE GAUSSMARKOV THEOREM
We now consider the goodness of fit of the fitted regression line to a set of
data; that is, we shall find out how well the sample regression line fits the
data. The coefficient of determination r2 (two-variable case) or R2 (multiple
regression) is a summary measure that tells how well the sample regression
line fits the data.
Consider a heuristic explanation of r2 in terms of a graphical device, known
as the Venn diagram shown in Figure 3.9.
In this figure the circle Y represents variation in the dependent variable Y and
the circle X represents variation in the explanatory variable X. The overlap of
the two circles indicates the extent to which the variation in Y is explained
by the variation in X.
To compute this r2, we proceed as follows: Recall that
Yi = Yi +ui (2.6.3)
or in the deviation form
yi = yi + ui (3.5.1)
where use is made of (3.1.13) and (3.1.14). Squaring (3.5.1) on both sides
and summing over the sample, we obtain