Eviews: Example #3: Education and U.S. Mortality Rates
Eviews: Example #3: Education and U.S. Mortality Rates
Table of Contents
Topic
I.
Data files
1. Import Excel data into a workfile
2. Save workfile
3. Open an existing workfile
II.
1.
2.
3.
4.
5.
III.
3
3
4
4
Functions
1. Observation and Date Functions
2. Mathematical Functions
3. Time Series Functions
IV.
1.
2.
3.
4.
5.
6.
7.
V.
Page
5
5
5
6
6
8
9
9
9
9
10
10
10
10
11
11
12
13
14
14
14
14
15
15
15
16
16
21
21
21
21
21
Page
VI.
1.
2.
3.
4.
5.
6.
22
22
22
22
22
22
23
23
23
24
25
25
25
25
26
27
27
28
29
29
29
29
30
30
31
31
32
33
33
34
34
34
I.
Data files
All EVIEWS files are called workfiles, and contain imported Excel data, regression equations (if you create
any), hypothesis test results, graphs (if you create any), etc. Thus, after working in EVIEWS, if you want to
save everything when you are done (data, an equation specification, OLS output, tests results, graphs), you will
save it in a workfile.
All sessions with EVIEWS begins with either opening an existing workfile, or creating a new workfile.
1.
1.1
1.2
A pop-up box appears, giving you choices for Frequency (time-period denominations: daily, weekly, quarterly,
etc.) and Workfile Range (first period and last period in the data sample).
yearly data:
quaterly data:
monthly data:
weekly data:
daily data:
Example
Suppose you have aggregate dividends and profits data with quarterly increments between for first
quarter of 1970 and the last quarter of 1991. In the Workfile Range box, beneath Start Date, type
1970:1
and beneath End Date, type
1991:4
1.3
1.4
When the workfile is created, EVIEWS creates a workfile box with a list of the current variables. By default,
EVIEWS stores the variable c, which is used to create the intercept in OLS regressions, and resid, which is
where EVIEWS stores regression residuals.
2.
2.1
2.2
Next, in the Files of Type box, scroll down and click-on EXCEL.
2.3
Next, in the Look In box, scroll down to the drive where your Excel data is stored (most likely on a disk, so scroll
down to drive A), and then navigate until you find the file: double-click on the file name.
2.4
A pop-up box appears labeled Excel Spreadsheet Import. All the datasets for the class will be Excel data files
with data starting in cell A2. The first row will always contain the variable name.
2.4.1
2.4.2
Beneath Names for Series or Number if Name in File, type the number of variables that the Excel file
contains. You will always know this before hand. Finally, OK.
2.5
Once the Excel data is imported, it will listed in the workfile box, along with c and resid.
3.
Save workfile
3.1
3.2
Once the workfile is saved, EVIEWS automatically labels the workfile box with the name.
4.
4.1
II.
1.
1.1
1.2
A preferable method is the following: in the programmable white-area below the main EVIEWS tool-bar, type1
series t = @trend
Once the above command is typed in, hit the ENTER key, and EVIEWS will perform the command.
1.3
EVIEWS will create a variable called t, and will present the variable name t in the workfile list of variables.
2.
2.1
Assume we have already created a time trend, described above. In the workfile tool-bar,
PROCS, GENERATE SERIES
Then, type
Variable name = t^2
Variable name = @exp(t)
2.2
A preferable method is the following: in the programmable white-area below the main EVIEWS tool-bar, type
series t = @trend
series t2 = t^2
series exp_t = @exp(t)
then ENTER. Note that the variable names used here, t, t2, and exp_t, are arbitrary.
3.
3.1
Example
If AGE exists as a variable, age squared can be generated as, for example,
AGE_2 = AGE^2
If GDP exists, ln(GDP) can be generated as
LN_GDP = log(GDP)
If GDP and POP (population) exist, then per-capita GDP can be generated as3
GDP_PC = GDP/POP
3.2
Alternatively, we can use the white-area below the main EVIEWS tool-bar. Use the command SERIES.
Example
series AGE_2 = AGE^2
series LN_GDP = log(GDP)
series GDP_PC = GDP/POP
4.
Viewing Data
4.1
4.2
EVIEWS creates a box called SERIES:Variable Name in which the chosen variable is
displayed in spreadsheet format.
4.1.2
In order to view several variables at once, while holding down the CTRL key, click on each variable
name; then, click on SHOW in the workfile tool-bar, then OK.
4.1.3
Once the several variables are selected and shown, EVIEWS creates a GROUP: UNTITLED box
with the variable data in spread-sheet format on display.
2
3
4.2.1
4.2.2
4.2.3
VIEW, CORRELATIONS, COMMON SAMPLE derives correlations for all pairs of chosen
variables.
See Topic III on EVIEWS functions and their associated command forms.
Lower case and upper case are equivalent: gdp = GDP.
4.2.4
1 T
t 17 ( yt y)( yt 16 y)
T
16 corr ( yt , yt 16 )
1 T
( yt y )2
T t 1
^
You can adjust the lag-length by editing the lag number in the lag box. The result for the
monthly S&P500 follows below:
Date: 02/26/02 Time: 14:02
Sample: 1999:01 2001:10
Included observations: 34
Autocorrelation
Partial Correlation
. |*******|
. |****** |
. |****** |
. |***** |
. |**** |
. |**** |
. |*** |
. |**. |
. |**. |
. |* . |
. |* . |
. | . |
. | . |
. *| . |
. *| . |
.**| . |
. |*******|
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
. | . |
4.2.4.a The length of the lines below Autocorrelation and Partial Autocorrelation visually
denote the actual levels of autocorrelation and partial autocorrelation. The numerical value of
the estimated autocorrelations is displayed below AC.
4.2.4.b EVIEWS automatically performs a Ljung-Box Q-test for each hypothesis that there does
not exist any autocorrelation up to some order k. Thus, EVIEWS tests successively:
H 0 : 1 0
H 0 : 1 0, 2 0
....
H 0 : 1 0,..., 16 0
The test Q-statistic has a chi-squared distribution with k degrees of freedom, where k denotes
with number of parameters tests under the null hypothesis (e.g. k = 1 for the first-order
autocorrelation test).
An autocorrelation is the correlation between a variable and itself (hence, auto) in the past. Thus, for example, the estimated
^
^
correlation between GDP this month and one month ago is the first-order autocorrelation: 1 corr( gdp t , gdp t1 ) . See Diebold,
chapter 6.
5
A partial autocorrelation at lag k, PAC(k), is the OLS estimated slope on the regression of y t on y t-1,,y t -k with an intercept included.
^
Thus, estimate y t 0
... k yt k t , and define PAC(k) = k . See Diebold, chapter 6.
1 y t 1
5.
5.1
Highlight the variable name in the workfile box. In the workfile tool-bar,
DELETE
5.2
III.
Functions
1.
@day
@elem(x,d)
@month
@quarter
@year
2.
Mathematical Functions
@abs(x), abs(x)
@fact(x)
@exp(x), exp(x)
@inv(x)
@log(x), log(x)
@round(x)
@sqrt(x)
3.
factorial
inverse of x = 1/x
natural log
square root
d(x)
d(x,n)
dlog(x)
dlog(x,n)
@pch(x)
@pchy(x)
@seas(n)
@trend
@trend(n)
first difference
th
n -order difference
first difference of the log
th
n -order difference of the log
one period percentage change (decimal)
one-year percentage change
seasonal dummy:
returns 1 when the quarter or month equals n and 0 otherwise
generates a trend series, normalized to 0 at the first period/obs in the
workfile
generates a trend series, normalized to 0 at the nth period/obs in the
workfile
IV.
1.
1.1
1.1.2
In the Equation Specification space, type in the equation without =, and with c if an intercept is to
be included.
1.1.3
Beneath Estimation Settings and next to Method, scroll to LS (Least Squares). Usually, LS is the
default setting. Finally, OK.
1.1.4
Beneath Estimation Settings and next to Sample, alter the sample date-range if you want to use only a
portion of the sample.
1.1.5
An equation box appears with the OLS results: you can expand it or maximize it.
1.1.5.a The equation box output can be named: see the topic NAME EQUATION below in topic
V.3.
1.1.5.b The equation box output can be frozen for editing, and copying into Word and Excel: see
the topic FREEZE in topic V.2, below.
1.2
Define and Estimate a New Regression Equation: Program Space (the white area)
Rather than pulling-up an estimation pop-up box, we can tell EVIEWS to estimate an equation directly in the
programmable white-area beneath the main EVIEWS tool-bar.
The command LS tells EVIEWS to perform least squares estimation. After the command, type the equation
without =, and with c if you want an intercept. After everything is typed in, hit ENTER.
1.3
2.
1.3.1
In the equation box tool-bar, ESTIMATE: EVIEWS will show you the current regression equation.
1.3.2
Type in the new equation specification. Change the Method and Sample range if desired.
10
occurs no matter what (!), we should include a constant term. Since the effect education has on mortality may
be nonlinear (decreasing, but at a decreasing rate) we will create and include squared health_exp.
The model is
2
morti 0
1 ed _ colli
2 ed _ colli
3 physi
4 health expi
i
Std. Error
t-Statistic
Prob.
C
ED_COLL
ED_COLL_2
PHYS
HEALTH_EXP
Variable
796.9879
-776.6662
-10265.26
1.819145
0.074272
251.7288
2636.246
7978.920
0.390880
0.055304
3.166058
-0.294611
-1.286548
4.653974
1.342974
0.0027
0.7696
0.2047
0.0000
0.1859
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.704472
0.678774
78.19466
281262.6
-292.0539
1.854259
855. 0059
137.9660
11.64917
11.83857
27.41344
0.000000
The fitted values are (in the equation box: View, Actual/Fitted/Residual: see Part V)
1200
1000
200
800
100
600
400
200
-100
-200
5
10
15
20
Residual
11
25
30
Actual
35
40
45
Fitt ed
50
3.
3.1
3.2
3.3
In the Method box, scroll to LS (Least Squares). Alter the Sample range if required.
3.4
3.5
A space next to Weight appears: type in the variable name that serves as the weighting instrument, then OK.
Example:
We want to estimate an aggregate state-wide health care expenditure model
(1)
where healtht denotes state-wide health care expenditure, seniort denotes the percent of the tth states
th
populace that is over the age of 65, and incomet denotes the t states aggegrate disposable income. We
have evidence that the variance of health care expenditure is non-constant, and the proportional to the
states population size squared:
(2)
t2 2 popt2
In this case, OLS is inefficient and standard hypothesis tests are invalid6. Employing Feasible
Generalized Least Squares [FGLS], in this case, is equivalent to Weighted Least Squares, with weights
equal to the population size. We want to estimate the transformed model
(1
)
healtht
1
seniort
income t
1
2
3
t
popt
popt
pop t
popt
pop t
12
4.
i
2
morti 0
1 ed _ colli
2 ed _ colli
3 physi
4 health expi
i
Std. Error
t-Statistic
Prob.
C
ED_COLL
PHYS
HEALTH_EXP
Variable
-6761.235
57955.33
-64.83620
9.289032
6567.843
37372.16
30.06773
4.374160
-1.029445
1.550762
-2.156339
2.123615
0.3085
0.1277
0.0362
0.0390
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.104180
0.047000
6265.156
1.84E+09
-516.1638
2.231099
6716.724
6417.788
20.39858
20.55010
1.821959
0.156061
College education and health care expenditure are associated with greater state-to-state differences in mortality
rates, on average, while more physicians renders morality rates more homogenous. In fact, the F-test is a test of
heteroscedasticity. The p-value for F is about 10%, but it does suggest the homoscedasticity assumption is
invalid.
We can use the positively associated factor ed_coll in WLS by assuming
i2 2 ed _ colli2
Coefficient
Std. Error
t-Statistic
Prob.
987.7339
-2557.626
-6204.347
2.075763
0.040821
345.4736
3264.201
8826.086
0.336755
0.056756
2.859072
-0.783538
-0.702956
6.164025
0.719240
0.0064
0.4373
0.4856
0.0000
0.4756
0.958101
0.954458
85.53036
336510.4
-296.6271
1.756078
Weighted Statistics
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
837.3803
400.7874
11.82851
12.01791
55.81226
0.000000
All goodness-of-fit criteria have improved. Note the dependent variable has changed so caution about such
comparisons is advised.
13
5.
5.1
5.1.2
5.2
Beneath Coefficient Restrictions, type the restrictions of coefficients c(i) with commas.
These are the second and third parameters, hence beneath Coefficient Restrictions
c(2) = 0 , c(3) = 0
The results indicate overwhelming rejection of the null in favor of the alternative.
Wald Test:
Equation: Untitled
Test Statistic
F-statistic
Chi-square
5.2
Value
df
Probability
49.83546
99.67092
(2, 46)
2
0.0000
0.0000
If the data is cross-sectional, the data needs to be sorted according to the binary quality which
is being tested (e.g. female vs. male). Once the data is sorted, the observation number where the binary
variable value changes to 1 needs to be obtained.
If the data is a time series and we want to test for a structural change at some point in time,
we need to obtain the precise date.
5.2.2
See IV.4, below, for instructions on using lagged values. We estimate the model by least
squares, QUICK, ESTIMATE EQUATION, and type in the equation
14
6.
6.1
In the workfile box tool-bar, GENR, then type in the functional statement using existing function
7
commands .
For example, if AGE exists as a variable, age squared can be generated as, for example,
AGE_2 = AGE^2
If GDP exists, log(GDP) can be generated as
LN_GDP = log(GDP)
If GDP and POP (population) exist, then per-capita GDP can be generated as
GDP_PC = GDP/POP
6.1.2
For a better method, we can use the programmable white-area beneath the main EVIEWS tool-bar.
Use the command SERIES:
series age_2 = age^2
series ln_gdp = log(gdp)
series gdp_pc = gdp/pop
Or, for a trend variable,
series t = @trend
After each line is typed, be sure to hit the ENTER key: EVIEWS will perform the command only after
the ENTER key is hit.
6.2
15
However, we only have data on profit and GDP. Then, in the programmable white-area, type
ls log(profits) c log(gdp)
and hit ENTER.
6.3
Trend Variables
Any time trend variable and any function of time trend variable can be added to a regression model to account
for deterministic (non-random) trend in a time series.
6.3.1
See topics II.1 II.3 for instructions on how to create linear, quadratic and exponential trend variables.
6.3.2
In order to include a time trend variable or function of such a variable, simply add it to the regression
Equation Specification.
A g g e g a te Q u a rte r l y D i vi d e n d s : 1 9 7 0 :1 - 1 9 9 1 :4
120
100
B i ll i o n s $
80
60
40
20
70
72
74
76
78
80
82
Y ear
84
86
88
90
In order to account for a likely linear time trend, we may specify a simple linear trend model
divt 1 2 t t
16
160
120
80
40
0
20
10
0
-1 0
70
72
74
76
78
80
R e s id u a l
82
84
86
A c tu a l
88
90
F itte d
The linear time-trend regression fit is very poor based on residual tests of autocorrelation and the SIC.
Consider, instead, a quadratic trend model:
divt 1 2 t 2 t 2 u t
160
120
15
80
10
40
0
-5
-1 0
70
72
74
76
78
R e s id u a l
80
82
84
A c tu a l
86
88
90
F it t e d
The residuals appear to be more random and noisy, although, in fact, they are not: there are clear
signs of cycles in the residuals suggesting severe autocorrelation and omitted variables (i.e. there exists
some neglected dividends structure that we need to model using techniques in Topic VI).
17
7.
Lagged Variables
8
Any existing variable can be lagged for a subsequent regressor. For example, if DIV, PROFITS and GDP are
the existing variables, we can generate related lagged variables by employing, for example, GDP(-1) or GDP(2), etc.
Example:
Suppose DIV, PROFITS and GDP are the existing variables. We are interested in modeling corporate
dividend payouts as a function of national income and profits. However, current dividends are paid
from past profits:
Example:
9
We want to estimate an AR(3) model of corporate dividends:
DIVt 1 2DIVt 1 3DIVSt 2 4 DIVt 3 t
A lagged variable is a past value of a variable. Thus, for GDP t, in the t th month, the one-month lagged value of GDP is GDP t-1. The
12-month (one-year) lagged value of GDP is GDP t-12.
9
AR(3) denotes autoregressive of order 3: a model which regresses a variables on itself (hence, auto, Latinate for self) 3
periods into the past (hence, order 3). See Topic VI, and Diebold, chapters 6-9.
18
V.
1.
1.1
VIEW
Located in the equation box tool-bar: Navigates through the equation representation, OLS output and
hypothesis tests.
1.1.1
REPRESENTATIONS
The specified model based on the typed equation, and the actual mathematical representation.
1.1.2
1.1.3
ESTIMATION OUTPUT
Default: displays the actual OLS output.
1.1.4
COEFFCIENT TESTS
Allows us to perform tests of compound hypotheses on the estimated parameters, including omitted
variables, redundant variables and standard Wald tests of linear coefficient restrictions.
See Topic IV.3 on hypothesis tests.
1.1.5
RESIDUALS TESTS
Allows us to perform tests for autocorrelated errors, errors with non-constant variance (i.e.
heteroscedastic errors), and a combination of the two in the form of ARCH errors (i.e. correlated
variances).
1.2
FREEZE
Located in the equation box tool-bar.
1.2.1
FREEZE stores the regression output in an Excel spread-sheet format called a table. Once the
regression output is frozen, we can directly edit the results, add titles, and copy-paste the results into
Excel or Word. Every EVIEWS graph and table was pasted directly into this document.
1.3
NAME
Located in the equation box tool-bar , assigns a name to any equation box of results.
1.3.1
1.3.2
1.3.3
1.3.4
1.3.5
Click-on NAME in the equation box tool-bar. Beneath Name to identify object, type the name of your
preference.
EVIEWS will place the equation name in the list of variables in the workfile box. If you name the
equation, say, eq01, then EVIEWS creates the label =eq01 in the workfile box.
Once the workfile is saved, all named objects will be saved to, including equations and tables of
output.
Equations need to be named in order for multiple regressions to be performed. If the Equation is left
unnamed as Untitled, EVIEWS will attempt to delete the regression results when another equation is
estimated.
Once the equation is named, you can click-on the cross in the upper-right corner of the equation box in
order to remove the equation results from view. EVIEWS, however, stores the equation information:
click on the equation name icon in the workfile box in order to display the equation box once again.
19
2.
2.1
yt 1 2t 3t 2 3t3
t
th
6000
5000
4000
2000
-5000
0
-2000
-4000
-6000
82
84
86
88
Residual
2.2
90
92
94
Actual
96
98
Fitted
Graph Title
2.2.1.a In the graph box
ADD TEXT.
Beneath Justification, click-on Center.
20
6000
4000
2000
0
-2000
-4000
-6000
82
84
86
88
90
Residual
2.3
92
Actual
94
96
98
Fitted
2.3.2
Go to Word or Excel. In Word or Excel, simply go to the main tool-bar, EDIT, PASTE. Because the
EVIEWS graph was saved in the clip-board, and EVIEWS is Windows based, Word will simply paste
the graph itself. Alternatively, hold the Control key and type v: CNTR v. Once the graph, etc., has
been pasted, it will be very large: click-on the object to highlight the corners, then click on the corners,
hold and drag to re-shape the object.
21
VI.
Eviews allows the analyst to perform aspects of Generalized Least Squares including heteroscedasticiy and
autocorrelation robust standard errors. It allows for a wide array of estimation techniques for systems of equations, in
particular when regressors are endogenous. These include Instrumental Variables (IV), Seemingly Unrelated Regression
(SUR), Two Stages Least Squares (2SLS) as a two-step IV estimator, and Three Stages Least Squares (3SLS) as a
combination of 2SLS with heteroscedasticiy or autocorrelation robusification.
1.
1.1
1.2
After the equation is typed in the white area click OPTIONS, HETEROSC. CONSISTENT COVARIANCE,
WHITE, then ok.
1.3
The resulting t-statstics will be robust to any form of heteroscedasticity that is related to the included regressors.
Example: U.S. Mortality Rates
Recall we want to estimate
2
morti 0
1 ed _ colli
2 ed _ colli
3 physi
4 health expi
i
We found evidence the regression error variance may depend on the included regressors. If we use
Whites robust t-test we find
Dependent Variable: MORT
Method: Least Squares
Sample: 1 51
Included observations: 51
White Heteroskedasticity-Consistent Standard Errors & Covariance
Coefficient
Std. Error
t-Statistic
Prob.
t-Statistic
Prob.
C
ED_COLL
ED_COLL_2
PHYS
HEALTH_EXP
Variable
796.9879
-776.6662
-10265.26
1.819145
0.074272
184.6299
1901.757
6320.866
0.466715
0.057531
4.316678
-0.408394
-1.624028
3.897769
1.290986
0.0001
0.6849
0.1112
0.0003
0.2032
3.166058
-0.294611
-1.286548
4.653974
1.342974
0.0027
0.7696
0.2047
0.0000
0.1859
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.704472
0.678774
78.19466
281262.6
-292.0539
1.854259
855.0059
137.9660
11.64917
11.83857
27.41344
0.000000
For comparisons sake we include the non-robust t-statistics in bold. Education is insignificant at the
15% level, while phys has gained in significance.
Whites (1982) test of heteroscedasticity in fact is little more than a test that the robust standard errors
and non-robust standard errors are identical for large samples.
22
2.
2.1
Example: SUR
The U.S. state-wide mortality model is
morti 0
ed _ colli 2 ed _ colli2 3 physi 4 health expi
1
i
We also have information on tobacco expenditure per capita10 (tob), percent of adult population with a high
school education (ed_hs), per capita income (inc) and the percent of the population above the age of 65
(aged) 11. We conjecture that tobacco use to related to income level, high school educatedness and youth:
tobi 0 1ed _ hsi 2 inci 3 agedi ui
Coefficient
Std. Error
t-Statistic
Prob.
C
ED_HS
INC
AGED
182.6473
-143.1016
0.003046
-49.88707
37.23020
45.27598
0.001511
141.2627
4.905890
-3.160652
2.015812
-0.353151
0.0000
0.0028
0.0496
0.7256
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.185482
0.133492
20.60049
19945.86
-224.5741
1.966861
120.5275
22.13050
8.963691
9.115207
3.567621
0.020888
Tobacco appears to be normal good, negatively related to having a high school education.
The mortality and tobacco use regressions are seemingly unrelated, but clearly related. The errors terms and u
capture unobservable characteristics of state residents, including cultural traits (diet, risk taking) and
sociological traits (social networks, religion). Indeed, a state with a high mortality rate may be a state with high
tobacco use (see footnote!), hence the errors undoubtedly are related.
It is perfectly fine to estimate equation alone, but we are neglecting possible important information associated
with the error term correlation. This implies a potentially more efficient set of estimates may exist if we
estimate the equations at the same time and while simultaneously allowing the errors to be correlated. This is
Seemingly Unrelated Regression.
2.2
10
11
In the main toolbar click OBJECTS, NEW OBJECTS, SYSTEM. Before you click SYSTEM, name
the object (e.g. mort_sur).
2.2.2
In the pop-up box type the system of equations using c(1) for the constant, and so on.
23
Example:
The mortality system is typed
mort = c(1) +c(2)*ed2_coll +c(3)*ed_coll_2 +c(4)*phys +c(5)*health_pc
ob_pc = c(6) + c(7)*ed_hs + c(8)*inc_pc + c(9)*aged
2.3
2.2.3
On the system is typed, click ESTIMATE from the pop-up box toolbar.
2.2.4
A new box appears with a list of choices. Click SEEMINGLY UNRELATED REGRESSION.
2.2.5
There are choices for handling how the correlation between the errors is estimated and these are used to
estimate the system of equations. Unfortunately this choice may have a profound impact on the subsequent
results.
Coefficient
Std. Error
t-Statistic
Prob.
786.3284
-816.1597
-9834.407
1.795136
0.079825
208.9991
-159.2617
0.003211
-198.1056
234.4604
2448.351
7402.848
0.363124
0.051222
35.13037
42.70680
0.001430
132.9795
3.353780
-0.333351
-1.328463
4.943593
1.558407
5.949242
-3.729188
2.245032
-1.489745
0.0012
0.7396
0.1873
0.0000
0.1225
0.0000
0.0003
0.0271
0 .1397
1955256.
0.703673
0.677906
78.30028
1.890212
24
855.0059
137.9660
282022.9
120.5275
22.13050
20419.29
Compare the mortality regression results in bold with the OLS results from Part V.2. There is essentially no
difference in the percent of mortality rate variation explained by the regression model, and all coefficient
estimates are qualitatively similar. There may, indeed, not be a SUR effect (estimation of the system offers no
boost in efficiency over single equation estimation).
3.
3.1
Endogenous Regressors
3.1.1
3.1.2
Validity is determined by i. the set zi is correlated with x i; and ii. zi is uncorrelated with i,.
3.1.3
Straight substitution of zi for xi is Instrumental Variables. But this begs the questions: if many valid
instruments exist, which do we choose?
Example:
Reconsider U.S. mortality rates:
2
morti 0
1 ed _ colli
2 ed _ colli
3 physi
4 health expi
i
We can easily argue that the unobservable characteristics of each state, which affect mortality rates
(e.g. state resident risk taking behavior, cultural information associated with marketable skills) also
affect the desire and/or ability to obtain a college education, to seek medial help (e.g. health care
expenditure), and to demand medical care (e.g. physician count per 100,000 resident).
3.2
3.3
The IV approach is to use a direct variable-by-variable substitute for the endogenous regressors. If a set
of regressors exists then there is an optimal method for combining them to form a best set of IVs:
simply generate predicted values of the endogenous xi by regressing them one be on of the IVs zi;.
3.2.2
3.2.3
Creating this best set is stage one, and using them as IVs is stage two of Two Stages Least Squares.
3.2.4
EVIEWSs Two Stages Least Squares routine requires at least as many IVs as variables in the regression
model.
25
3.3.2
Since we believe health_exp is endogenous, we include all other regressors and the IV inc as the
instruments. Type in the instrument box
ed2_coll ed_coll_2 phys inc_pc
Then ok.
Dependent Variable: MORT
Method: Two-Stage Least Squares
Sample: 1 51
Included observations: 51
MORT = C(1) +C(2)*ED2_COLL +C(3)*ED_COLL_2 +C(4)*PHYS +C(5) *HEALTH_PC
Instrument list: ED2_COLL ED_COLL_2 PHYS INC_PC
C(1)
C(2)
C(3)
C(4)
C(5)
R-squared
Adjusted R-squared
S.E. of regression
Durbin-Watson stat
3.4
Coefficient
Std. Error
t-Statistic
Prob.
850.9452
-1081.035
-9559.261
1.976187
0.043648
326.6824
2891.747
8452.073
0.719358
0.130032
2.604809
-0.373835
-1.130996
2.747154
0.335667
0.0123
0.7102
0.2639
0.0086
0.7386
0.702502
0.676633
78.45485
1.846369
855.0059
137.9660
283137.5
The Hausman (1978) test allows us to compare two estimators for one regression model, where one
estimator is guaranteed to be consistent and efficient.
3.4.2
In the 2SLS case, if the suspected endogenous regressor x i is NOT endogenous, then OLS and 2SLS
should approximately identical. Otherwise, in the presence of endogenous regressors OLS is not
consistent so OLS and 2SLS must produce significantly different estimates.
3.4.3
EVIEWS allows us to the Hausman test by a sequence of regressions (Davidson and MacKinnon 1989,
1993):
i.
Regress the suspected endogenous variable (e.g. health_exp) on all exogenous variables and
available instruments zi. Collect the residuals, say w i.
ii.
In the case of health_exp, regression residuals w i represent health_exp after controlling for
association with other variables.
iii.
Now regress y i on x i as usual, only include w i from the first auxiliary regression. If the
suspected endogenous variable is truly endogenous then the slope on w i will be significant.
26
4.
4.1
We suspect health_exp is endogenous. Regress health_exp on all other explanatory variables plus the income
instrument inc. Save the residuals
ls health_exp c ed_coll ed_coll_2 phys inc
series w = resid
4.2
Std. Error
t-Statistic
Prob.
C
ED_COLL
ED_COLL_2
PHYS
HEALTH_EXP
W
Variable
850.9452
-1081.035
-9559.261
1.976187
0.043648
0.037442
328.9525
2911.842
8510.806
0.724357
0.130936
0.144780
2.586833
-0.371255
-1.123191
2.728195
0.333351
0.258615
0.0130
0.7122
0.2673
0.0091
0.7404
0.7971
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.704911
0.672123
79.00003
280845.2
-292.0161
1.843317
855.0059
137.9660
11.68690
11.91418
21.49926
0.000000
The results support our finding that 2SLS did not generate estimates very different from OLS. Here, the
coefficient on u is not significant at any level, so we fail to reject the null that health_exp is exogenous.
5.
Two Stage Least Squares in a SUR System: Three Stages Least Squares
Three Stages Least Squares is 2SLS applied to a Seemingly Unrelated System. The three steps concern i.
controlling for correlation between the different equation error terms; ii. controlling for endogenous regressors;
and iii. estimating the robustified system.
5.1
Follow the SUR instructions: OBJECTS, NEW OBJECT, SYSTEM (name the system, say mort_3sls).
5.2
In the white pop-up box type the equations as before. Below the last equation type the instrument set. I include
all exogenous variables included in the regression and all instruments that were left out:
@inst [exogenous regressors] [instruments]
There is no = and there are no commas.
5.3
27
6.
6.1
6.2
Type
mort = c(1) +c(2)*ed2_coll +c(3)*ed_coll_2 +c(4)*phys +c(5)*health_pc
tob_pc = c(6) + c(7)*ed_hs + c(8)*inc_pc + c(9)*aged
@inst inc_pc ed2_coll ed_coll_2 phys health_pc ed_hs
Click ESTIMATE, THREE STAGE LEAST SQUARES. The results are
System: MORT_3SLS
Estimation Method: Three-Stage Least Squares
Sample: 1 51
Included observations: 51
Total system (balanced) observations 102
Linear estimation after one-step weighting matrix
C(1)
C(2)
C(3)
C(4)
C(5)
C(6)
C(7)
C(8)
C(9)
Coefficient
Std. Error
t-Statistic
Prob.
769.0567
-643.7900
-10497.68
1.817089
0.081796
182.3184
-146.6055
0.003231
-47.82365
235.4522
2458.311
7442.833
0.364334
0.051370
43.50059
44.42703
0.001433
195.9449
3.266297
-0.261883
-1.410442
4.987429
1.592275
4.191172
-3.299916
2.255129
-0.244067
0.0015
0.7940
0.1617
0.0000
0.1147
0.0001
0.0014
0.0265
0.8077
2032528.
28
855.0059
137.9660
281962.6
120.5275
22.13050
19952.75
VII.
1.
Binary Response
In this case yi = 0 or 1. Typically the approach is to assume y i depends on observable xi and unobservable
i
traits:
k
(*)
yi 1 if ij xi , j and yi 0 if ij xi, j
j
1
j 1
1.1.2
The Binary Likelihood Function L(Y|) is the joint probability a sample of binary responses Y = [y 1,
, yn].
In order to represent the Likelihood Function it helps to re-order the sample as a thought experiment.
WE DO NOT NEED TO RE-ORDER THE SAMPLE WHEN WE USE EVIEWS.
This is merely for representing the concept of Binary Maximum Likelihood. We can arbitrarily order
the observations so that y i = 0 occur first in the sample and all y i = 1 occur last: Y = [0,0,,0,1,1,.,1].
There are n0 observations with response 0 and n1 observations with response 1. Note:
n0 n1 n
Under independence and using (*) the natural log of the Likelihood Function is
n0
k
n
k
ln L
y |
F
j xi , j
1 F
j xi , j
i 1
j
1
i
1
j
1.2
F
j xi , j f j xi , j
P
yi 1
1 F
j xi, j
j
x j ,i
x j, i
x
j ,i
j 1
j 1
j 1
So, j, scaled by the density, represents the marginal impact of xi,j on the likelihood of response y1 = 1.
29
i.
Perhaps most importantly, notice the marginal impact IS NOT A CONSTANT. It depends on each
individuals observable information x j,i.
ii.
Since it is individual specific, typically we plot out the marginal affects, or analyze the descriptive
statistics, including its mean:
1 n k
MEAN
P
yi 1
f
j xi , j
j
x j,i
n i 1 j 1
Alternatively, we can compute the marginal affect for the average individual:
1 n k
P
mean{ yi } 1
f
j xi , j
j
x j ,i
n
i
1
j
1.3
Estimation in EVIEWS
We can now j using EVIEWS. We simply denote what the cdf F is. The mot popular choices in practice are the
standard normal and the logistic.
If we assume F is the standard normal then the estimation method is called Probit Maximum Likelihood
(i.e. Binary ML with standard normal cdf).
If we assume F is the logistic then the estimation method is called Logit Maximum Likelihood (i.e. Binary ML
with logistic cdf).
1.4
In the main toolbar QUICK, ESTIMATE EQUATION, scroll through the options for BINARY
CHOICE, choose PROBIT or LOGIT.
1.4.2
In the white are type the equation, using the 0/1 variable on the left:
y = c x1 x2 x3
1.4.3
There are two options: we can select ways to robustify against the fact that we may chosen the wrong
F; and we may choose the numerical estimation method use for estimating this highly nonlinear model
(ln(L) is itself very nonlinear).
The true cdf may not be the standard (Probit) or logistic (LOGIT). After all, we are merely guessing.
i.
Huber/White
Under OPTIONS, click ROBUST COVARIANCE MATRIX, and then HUBER/WHITE
in order to generate standard errors, and therefore t-statistics, that are robust to the fact that we
may have chosen then wrong cdf F.
This should be done whenever possible.
ii.
30
2.
Example #7: Binary Choice and Labor Force Participation and Probit
We have a sample of women who in the labor force (lfp i = 1) or not (lfp i = 0). Available regressors are age,
husbands age age_h, and the number of children under the age of 6 child_6.
2.1
We will estimate the model by Probit ML, using Huber-White robust t-tests.
2.1.1
In the main toolbar QUICK, ESTIMATE EQUATION, scroll through the options for BINARY
CHOICE, PROBIT.
2.1.2
2.1.3
Std. Error
z-Statistic
Prob.
C
AGE
AGE_H
CHILD_6
Variable
3.337181
-0.036233
-0.026326
-1.355352
0.533849
0.020628
0.020776
0.195827
6.251172
-1.756550
-1.267107
-6.921162
0.0000
0.0790
0.2051
0.0000
0.568393
0.474904
168.9249
-481.4587
-514.8732
66.82902
2.03E-14
325
428
31
Total obs
0.495630
1.289399
1.313962
1.298862
-0.639387
0.064899
753
2.2
Marginal Affects
In order to interpret the estimated coefficients, we want to generate the series
k
P
yi 1
f
j xi, j
j
x j ,i
j 1
P
yi 1
f
j
j xi , j
x j,i
j 1
EVIEWS does not provide this in a simple way, so we will compute in order
k
j
1
2.2.1
j i, j
k
f
j xi , j
j 1
k
f
j xi, j
j
j 1
We obtain
k
j
1
j i, j
by clicking within the equation popup box FORECAST, INDEX-WHERE PROB-F(-INDEX). Then ok.
Since the 0/1 dependent variable is called lfp the forecast value will given the automatic name lfpf, or
change the name.
2.2.2
j 1
MEAN
P
yi 1
.325381
j
j,i
We can now inspect the marginal impact of each explanatory variable on the likelihood of entering the
labor force.
32
3.
3.1
Observed yi y if y 0
*
i
*
i
yi 0 if y*i 0
*
Since we do not observe y (e.g. work hours h < 0!), we must, of course, use the observed y (e.g. h = 0):
k
yi j xi , j i
j 1
But it can be shown that OLS estimates will be biased because there is a missing variable accounting for the
*
truncation (y < y).
3.2
Tobit Model
*
Since we do not observe y (e.g. work hours h < 0!), we must, of course, use the observed y (e.g. h = 0):
k
yi j xi , j i
j 1
But it can be shown that OLS estimates will be biased because there is a missing variable accounting for the
truncation (y* < y). If there errors are iid normally distributed N(0,2) then the correct model is
33
j xi , j
j 1
yi j xi, j
i
k
j 1
j
i
,
j
j 1
where (z) is the standard normal density and (z) the standard normal cdf. This is called the Tobit Regression
Model, after Tobin (1958).
3.3
4.
where hours_h is the females husbands work hours, etc., and child_6 the number of children under the age of 6 in
the family.
OLS results follow:
Dependent Variable: HOURS
Method: Least Squares
Sample: 1 753
Included observations: 753
White Heteroskedasticity-Consistent Standard Errors & Covariance
Coefficient
Std. Error
t-Statistic
Prob.
C
WAGE
AGE
HOURS_H
WAGE_H
AGE_H
CHILD_6
Variable
1612.222
106.1719
-5.532008
-0.101367
-26.57645
-8.230411
-371.8635
259.0128
19.48620
6.992067
0.048597
5.395773
6.727591
62.04534
6.224486
5.448570
-0.791184
-2.085887
-4.925421
-1.223382
-5.993416
0.0000
0.0000
0.4291
0.0373
0.0000
0.2216
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.237422
0.231289
763.9350
4.35E+08
-6063.722
1.606736
34
740.5764
871.3142
16.12410
16.16708
38.71011
0.000000
Coefficient
Std. Error
z-Statistic
Prob.
2157.747
204.3639
-12.21144
-0.220945
-59.52449
-15.81633
-825.4865
413.1214
30.26070
12.22641
0.080955
12.00630
11.65812
130.1804
5.223035
6.753443
-0.998775
-2.729226
-4.957772
-1.356680
-6.341098
0.0000
0.0000
0.3179
0.0063
0.0000
0.1749
0.0000
1146.301
51.03048
22.46306
0.106779
0.098387
827.3419
5.10E+08
-3808.037
-5.057154
Error Distribution
SCALE:C(8)
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Avg. log likelihood
Left censored obs
Uncensored obs
325
428
0.0000
740.5764
871.3142
10.13556
10.18468
10.15448
0
753
Since no one works all hours on all days, right censoring is irrelevant: we can leave RIGHT blank and receive the
same results.
Notice the stark coefficient estimate differences. By not accounting for censorship all marginal affects are underestimated. By not controlling for the numerous hours = wages = 0, least squares under estimates the marginal affect
a one dollar differential has on annual work hours by a factor of two! Similarly, the presence of young children is
overwhelming associated with dampened work hours, but that effect is far stronger once truncation is controlled for.
35