Computer Application in Economics
Computer Application in Economics
This distance material introduces basic steps in computer based analysis. At the same
time the material provides a sound foundation of basic statistics and econometrics. The
course motivates the students by providing the techniques, methods of computer based
data analysis.
This distance material presupposes that students that are taking the course Computer
Application in Economics have basic back ground about MS Windows programs, notably
Excel. The material attempts to make the reader well versed on various ways of data
analysis based on computer packages through different examples, explanations, activities
and self-assessment questions. The distance material contains the total of seven units
whose contents are briefly discussed hereunder.
Unit One deals with basic introduction. In this unit, students would be made aware of the
importance of computer based analysis and also identify major economic software
Unit Two deals with data management. In this unit the student learn how to operate the
different softwares, enter data from various sources and .identify various ways of
transforming data and perform tests before estimation is made.
Unit three deals with the statistical estimation and graphical analysis. In this unit students
would learn several statistics related concepts, compute several statistics concepts and
their analysis as well as draw graph and analyze the result.
1
Unit four deals with econometric estimation and analysis. This unit emphasizes the
various ways of performing estimation using EViews and Stata programs. Moreover, it
attempts to interpret the results obtained from the estimation.
Unit five deals with the diagnostic tests. In this unit students would be made aware of the
basic concepts and approaches of checking the result of a regression model. Identify the
concept of diagnostic checking. This includes explaining the sources of the problem, the
detection mechanism and forwarding the appropriate solutions. Students will learn how to
test for the presence of the problems using EViews and Stata.
Unit six briefly introduces SPSS software while unit seven briefly introduces PCGIVE
and LIMDEP programs.
2
TABLE OF CONTENTS
PAGE
UNIT ONE: Introduction ......................................................... 4
3
Unit One: Introduction
1.0 Objective
1.1 Introduction
1.2 The Need for Computer Application in Economic Analysis
1.3 Major Economic Softwares
1.4 The Nature, Type and Sources of Data
1.4 Summary
1.5 Answers to Check Your Progress
1.6 Model Examination
1.0 Objective
The aim of this unit is to introduce the student with basic concepts related to computer
applications in economic analysis. After completing this unit, the student will be able to:
Understand the importance of computer based analysis
Identify major economic software
Understand the different types and nature of data.
1.1 Introduction
Whenever we are talking about economics, what come to the picture is the three basic
problems that we face every day. These are (i) What goods and services should be
produced and in what amounts? (ii) How should those goods and services be produced?
and (iii) For whom to produce?
Note that these questions are universal problems because human wants are practically
unlimited, but all societies have only limited quantities of resources that can be used to
produce goods or services.
In this connection, the two main branches of economic analysis are microeconomics and
macroeconomics. Microeconomics is concerned with the behavior of individual firms,
industries and consumers (or households). Note that microeconomics deals with the
4
problems of resource allocation, considers problems of income distribution, and is chiefly
interested in the determination of the relative prices of goods and services.
On the other hand, macroeconomics concerns itself with large aggregates, particularly for
the economy as a whole. It deals with the factors, which determine national output and
employment, the general price level, total spending and saving in the economy, total
imports and exports, and the demand for any supply of money and other financial assets.
Statistical and econometric software systems are used to understand the existing
concepts, and to find new properties. On the other hand, new developments in the process
of decision making under uncertainty often motivate developments of new approaches
and revision of the existing software systems. Statistical and econometric software
systems rely on a cooperation of statisticians, econometrician and software developers.
Note that without a computer one cannot perform any realistic data analysis having large
data set.
5
Some softwares are widely used and are powerful tool for data analysis with excellent
data management capabilities. Overall, there are over 400 statistical packages; however, a
working familiarity with some of the major statistical systems will carry over easily to
other environments. These basic softwares are professional statistical and econometric
packages that are in widespread use internationally.
EVIEWS: EViews provides sophisticated data analysis, regression, and forecasting tools
on Windows-based computers. With EViews, we can quickly develop a statistical
relation from our data and then use the relation to forecast future values of the data. Areas
where EViews can be useful include scientific data analysis and evaluation, financial
analysis, macroeconomic forecasting, simulation, sales forecasting, and cost analysis.
EViews was developed by economists and most of its uses are in economics. However,
there is nothing in its design that limits its usefulness only to economic time series. Even
quite large cross-section projects can be handled in EViews.
STATA: Stata is a modern and very powerful program especially designed for data
management, statistical and econometric analysis as well as graphics. It is mainly a
command driven program providing sufficient flexibility to meet different users' needs.
SPSS: This program has a comprehensive and flexible statistical analysis and data
management system. It can generate tabulated reports, charts, and plots of distributions
and trends, descriptive statistics, and conduct complex statistical analyses. Note that
SPSS for Windows provides a user interface that makes statistical analysis more intuitive
6
for all levels of users. Simple menus and dialog box selections make it possible to
perform complex analyses without typing a single line of command syntax. The built-in
SPSS Data Editor offers a simple and efficient spreadsheet-like utility for entering data
and browsing the working
Quantitative data sets consist of measures that take numerical values for which
descriptions such as means and standard deviations are meaningful. They can be put into
an order and further divided into two groups: discrete data or continuous data. Discrete
data are countable data, for example, the number of defective items produced during a
day's production. Continuous data, when the parameters (variables) are measurable, are
expressed on a continuous scale. For example, measuring the height of a person.
7
Measurement or counting theory is concerned with the connection between data and
reality. A set of data is a representation (i.e., a model) of the reality based on numerical
and measurable scales. Data are called "primary type" data if the analyst has been
involved in collecting the data relevant to his/her investigation. Otherwise, it is called
"secondary type" data.
Data come in the forms of Nominal, Ordinal, Interval and Ratio. Moreover, data can be
either continuous or discrete. The following chart illustrates the various forms of
measuring data
Note that both zero and unit of measurements are arbitrary in the Interval scale. While the
unit of measurement is arbitrary in Ratio scale, its zero point is a natural attribute. The
categorical variable is measured on an ordinal or nominal scale.
8
Note that date collected for analysis and estimation of a model may be time series,
pooled or cross-sectional data.
Time series data are data that give information about numerical values of variables from
period to period. This is a set of observations on the values that a variable takes at
different times. It is data collected over a period. For example, the data on sales in a
company, data on GNP, data on unemployment, data on money supply in the period
1990-1999 forms a time series data. Such data may be collected at regular intervals like
daily, weekly, monthly, quarterly, annually. These data may be quantitative in nature
example, income and price or qualitative like sex and religion. The qualitative variables
are called categorical or dummy variables.
Consider the following table containing macro data of Ethiopia for the period 1984 to
1993 EC, measured in millions of dollars.
Year GDP Saving Export Import
1984 20792.00 625.2000 937.5000 2223.400
1985 26671.40 1494.100 2222.500 4520.500
1986 28328.90 1426.200 3223.000 6090.500
1987 33885.00 2517.100 4898.100 7950.000
1988 37937.60 2652.600 4969.700 8721.500
1989 41465.10 3195.000 6730.600 10584.70
1990 44840.30 3466.300 7116.900 11341.20
1991 48803.20 1044.600 6878.000 14101.50
1992 53189.70 480.1000 8017.600 15969.30
1993 54210.70 1433.900 7981.500 16193.60
1994 51760.60 931.4000 8027.400 17709.50
1995 54585.90 -1145.300 8319.300 21557.60
Table 1.1 Time Series Data
Notice that the table above represents a description of the variables value across time.
Thus, it represents a time series data.
Cross–sectional data are data that give information on one or more variables concerning
individual consumer or producer at a given point of time. For example, census of the
population and surveys of consumer expenditure are cross-sectional data. This is because,
the data give information on the variables concerning individual agents (consumers or
producers) at a given point of time. As an example, consider the table below that shows
9
the total population, the average mid age, the level of death, marriage and divorce of
some selected states of the USA for a particular year.
Note that the table above represents a cross section data since it shows the results of
several variables for a particular time.
Pooled data have elements of both time series and cross sectional data. For example,
suppose we collect data on GDP and Saving for 20 countries for 10 years. In this case,
GDP and Saving for 10 years period will represent a time series data. On the other hand,
GDP and Saving of these countries for a particular year will be cross- sectional data.
Thus pooled data has both characteristics.
Note that, Panel data is a special type of pooled data in which the same cross-sectional
unit is surveyed over time. For example, census of housing at periodic intervals in which
the same household is interviewed to find out if there has been any change of that
household since last survey. Panel data are repeated surveys of a single sample in
different periods of time. It records the behavior of the same individual over time. The
panel data that results from repeated interviewing of the same household at periodic
intervals will provide very useful information on the dynamic of household behavior.
Note that the success of any economic related study depends on the quality and quantity
of data. Unlike natural science, most data collected in social science like GNP, money
10
supply etc are non-experimental. This means that, data collecting agency may not have
any direct control over the data.
Note that the individual (researcher) himself may collect data through interviews or using
questionnaire. In the social sciences, the data that one generally obtains is non-
experimental in nature. In other words, it is not subject to the control of the researcher.
For example, data on Investment, unemployment, etc are not directly under the control of
the investigator, unlike the natural science data. This often creates special problems for
the researcher in articulating the exact cause or causes affecting a particular situation.
Moreover, although there is plenty of data available for economic research, the quality of
the data is often not that good. The reasons to this include:
there is the possibility of observational errors, since most social science data
are not experimental in nature,
Errors of measurement arising from approximations and round offs.
The problem of non-response, in questionnaire type surveys, there is
Respondents may not answer all the questions correctly
Sampling methods used in obtaining data
Aggregation problem. Note that, economic data is generally available at a
highly aggregate level. For example most macro data like GNP,
unemployment, inflation etc are available for the economy as a whole.
11
Because of confidentiality, certain data can be published only in highly
aggregate form. For example, data on individual tax, production, employment
etc at firm level are usually available in aggregate form.
Because of all these and many other problems, the researcher should always keep in mind
that the results of research are only as good as the quality of the data. Therefore, the
results of the research may be unsatisfactory due to the poor quality of the available data
(may not be due to wrong model)
1.4 Summary
Computer based data analysis has got several advantages. It helps to attain efficient
results and also facilitates simple and comprehensive data analysis with short period.
Moreover, it produces results that are correct and is easily readable. Currently there are a
number of softwares that are applied in data analysis activities. This includes EViews,
Stata, SPSS and some others. Note that without a computer one cannot perform any
realistic data analysis having large data set.
Statistical and econometric software systems are used to understand the existing
concepts, and to find new properties. On the other hand, new developments in the process
of decision making under uncertainty often motivate developments of new approaches
and revision of the existing software systems.
12
For the last 30 years, a number of softwares in relation to data analysis have been
produced and widely applied. These softwares have shown an astonishing development in
both depth and simplicity. This has contributed a lot to the development and efficiency of
economic analysis. Statistical and econometric software systems rely on a cooperation of
statisticians, econometrician and software developers.
To perform analysis of any kind, we need data. As we know, the source of information.
Where it can be collected using qualitative or quantitative data. Qualitative data is not
computable by arithmetic relations. They are labels that advise in which category or class
an individual, object, or process fall. They are called categorical variables. Data come in
the forms of Nominal, Ordinal, Interval and Ratio. Moreover, data can be either
continuous or discrete. In addition, data collected for analysis and estimation of a model
may be time series, pooled or cross-sectional data.
Note that relevant data can be obtained from a number of sources. Governmental agency,
an international agency, a private organization or an individual may collect the data used
in empirical analysis. In Ethiopia, governmental institutions like MoFED (ministry of
Finance and Economic Development), CSA (Central Statistical Authority), and NBE
(National Bank of Ethiopia) are the major source that produces published data.
International agencies that produce wide range of data that can be used for analysis
purpose includes International Monetary Fund (IMF) and the World Bank (WB)
13
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
2.0 Objective
2.1 Introduction
2.2 Data Management
2.2.1 Data Entry And Operations
2.2.1.1 Using Excel
2.2.1.2 Using EViews
14
2.2.1.3 Using Stata
2.2.2 Data Transformation And Pre Estimation Tests
2.2.2.1 Using EViews
2.2.2.2 Using Stata
2.3 Summary
2.4 Answers to Check Your Progress
2.5 Model Examination
2.6 Reference
2.0 Objective
The objective of this unit is to explain the various types of data management. After
completing this unit, a student will be able to:
Learn how to operate the different softwares and enter data from various sources.
Learn ways of transforming data and perform tests before estimation is made
2.1 Introduction
Before any type of meaning full data analysis is performed, the available data has to be
stored in the appropriate spreadsheet. Moreover, the data has to be transformed to the
required form so that it can be used easily to perform the appropriate estimation. In this
unit a discussion is made regarding data management. First a brief explanation is given
using Microsoft excel program. The purpose of this is to shed light to the basics of excel
program in relation to data analysis. Latter on a detailed discussion is provided using
EViews and Stata programs. The explanation starts from briefing the basic windows of
the two programs and extends to explaining the alternative ways of data entry,
transformation and pre estimation tests. Whenever appropriate and necessary, the
discussion is supported by the use of diagrams of various EViews and Stata windows
15
2.2.1 Data Entry and Operations
This sub section presents the methods used to enter data using Excel, EViews and Stata
programs. Moreover, it shows various operations using the above stated programs.
As an introduction we start our discussion with the various ways of entering, editing and
operation using excel. This discussion will shed light to the basics of excel in relation to
data analysis.
Note that when you open excel window by default a new work book is opened. If we
want to open an other new work book, we click on the File menu and then click New, and
select Blank Workbook on the New Workbook
In excel spreadsheet we can enter numbers, texts, a date or a time. To do so, we first click
the cell where you want to enter data. Then we type the data and press ENTER or TAB.
Note that we can enter the data raw by raw or column-by-column. In any case, the
following approach is usually used.
1. Enter data in a cell in the first column, and then press TAB to move to the
next cell.
2. At the end of the row, press ENTER to move to the beginning of the next
row.
3. If the cell at the beginning of the next row doesn't become active, click
Options on the Tools menu, and then click the Edit tab. Under Settings, select the
Move selection after Enter check box, and then click Down in the Direction box.
16
The following discussion explains what we should do when our objective is to enter data
with a fixed number of decimal places or extensive zeros.
1. On the Tools menu, click Options, and then click the Edit tab. Then,
2. Select the Fixed decimal check box. In the Places box, enter a positive number of
digits to the right of the decimal point or a negative number for digits to the left of the
decimal point.
For example, if you enter 3 in the Places box and then type 1874 in the cell, the value will
be 1.874. If you enter -3 in the Places box and then type 183, the value will be 183000.
Note that any data you entered before selecting the Fixed decimal option is not affected.
To enter the same data into several cells at once we follow the following
instruction.
1. Select the cells where you want to enter data. Note that the cells do not have to be
adjacent. Then,
2. Type the data and press CTRL+ENTER.
This automatically enter the same data into the selected cells.
Note that, if the first few characters you type in a cell match an existing entry in that
column, Microsoft Excel fills in the remaining characters for you. Excel completes only
those entries that contain text or a combination of text and numbers. Entries that contain
only numbers, dates, or times are not completed. In this regard, note the following.
To accept the proposed entry, press ENTER. The completed entry exactly
matches the pattern of uppercase and lowercase letters of the existing entries.
To replace the automatically entered characters, continue typing.
To delete the automatically entered characters, press BACKSPACE.
17
To select from a list of entries already in the column, right-click the cell, and then
click Pick from List on the short cut menu
Note that the prime objective of the above brief discussion is to show how data is entered
in excell program. This knowledge is important because (i) Excel can conduct different
types of analysis, (ii) more often than not data to be used in other specialized software
(programs) may be stored in Excel format
1. Using the discussion made earlier, enter the following data in to excel spreadsheet
Year 1991 1992 1993 1994 1995 1996 1997 1998 1999
Y 50 100 150 200 250 300 350 400 450
X1 15.67 156.7 1.567 1567 15670 0.1567 0.01567 15.67 156.7
X2 1 2 3 4 5 6 7 8 9
X3 90 80 70 60 50 40 30 20 10
I. Introduction to EVIEWS
18
EViews provides sophisticated data analysis, regression, and forecasting tools on
Windows-based computers. With EViews you can quickly develop a statistical relation
from your data and then use the relation to forecast future values of the data. Areas where
EViews can be useful include: scientific data analysis and evaluation, financial analysis,
macroeconomic forecasting, simulation, sales forecasting, and cost analysis.
EViews was developed by economists and most of its uses are in economics. However,
there is nothing in its design that limits its usefulness only to economic time series. Even
quite large cross-section projects can be handled in EViews.
EViews takes advantage of the visual features of modern Windows software. You can use
your mouse to guide the operation with standard Windows menus and dialogs. Results
appear in windows. The results can be manipulated with standard Windows techniques.
Alternatively, we may use EViews’ powerful command and batch processing language.
That is, we can enter and edit commands in the command window. Moreover we can
create and store the commands in programs that document our research project for later
execution.
Before discussing any thing further, it is important to know the installation procedure of
EViews program discussed hereunder. Note the following
19
a) Once Windows is running, it is strongly recommended that you close all other
Windows applications before beginning the installation procedure. This is because other
applications may interfere with the installation program.
b) Insert the CD containing EViews program into the CD drive. Then the program will be
ready to be installed. Select the word next to pass form on page to another in the
installation process.
c) Once the installation procedure is complete; the installer will inform you that EViews
has been successfully installed.
As stated in the general introduction part it is assumed that you are familiar with the
basics of Windows. However, we provide a brief discussion of some useful techniques,
concepts, and conventions that we will use in this distance material.
A. The Mouse
EViews supports the use of both buttons (i.e. left and right) of the standard Windows
mouse. Unless otherwise specified, clicking on an item means a single click of the left-
mouse button. On the other hand, double-click means to click the left-mouse button twice
in rapid succession. Moreover, dragging with the mouse means that you should click and
hold the button down while moving the mouse.
B. Window Control
As we work, we may wish to change the size of a window or temporarily move a window
out of the way. Alternatively, a window may not be large enough to display all of the
output, so that we want to move within the window in order to see relevant items.
Windows provides us methods for performing each of the tasks stated above. This
includes the following.
20
1. Changing the active window
When working in EViews or other Windows programs, you may find that you have a
number of open windows. The currently active (top-most) window is easily identified
since its title bar will generally differ (in color and/or intensity) from the inactive
windows. You can make a window active by clicking anywhere in the window, or by
clicking on the word Window in the main menu, and selecting the window by clicking on
its name.
2. Scrolling
Windows provides both horizontal and vertical scroll bars so that you can view the
contents of windows that contain information, which does not fit inside the window.
When the information does fit, the scroll bars will be hidden.
The scroll box indicates the overall relative position of the window and the data. In the
example above, the vertical scroll box is near the bottom, indicating that the window is
showing the lower portion of our data. If the box is in the middle of the scroll bar, then
the window displays the halfway point of the information. The size of the box also
changes to show you the relative sizes of the amount of data in the window and the
amount of data that is offscreen. Here, the current display covers roughly half of the
horizontal contents of the window.
The up, down, left, and right scroll arrows on the scroll bar will scroll one line in that
direction. Clicking on the scroll bar on either side of a scroll box moves the information
one screen in that direction.
If you hold down the mouse button while you click on or next to a scroll arrow, you will
scroll continuously in the desired direction. To move quickly to any position in the
window, drag the scroll box to the desired position.
21
3. Minimize/Maximize/Restore/Close
There may be times that you wish to move EViews out of the way while you work in
another Windows program. Or you may wish to make the EViews window as large as
possible by using the entire display area.
In the upper right-hand corner of each window, you will see a set of buttons which
control the window display. By clicking on the middle button, you can toggle between
using your entire display area for the window, and using the original window size. The
button maximize uses your entire monitor display for the application window; whereas
the button restore returns the window to its original size, allowing you to view multiple
windows. If you are already using the entire display area for your window, the middle
button will display the icon for restoring the window; otherwise, it will display the icon
for using the full screen area.
You can minimize your window by clicking on the minimize button in the upper right-
hand corner of the window. To restore a program that has been minimized, click on the
icon in your taskbar
Lastly, the close button provides you with a convenient method for closing the window.
To close all of your open EViews windows, you may also select Window in the main
menu, and either Close All, or Close All Objects.
To select a single item, you should place the pointer over the item and single click. The
item will now be highlighted. If you change your mind, you can change your selection by
clicking on a different item, or you can cancel your selection by clicking on an area of the
window where there are no items. Double clicking on an item will usually open the item.
22
If you have multiple items selected, you can double click anywhere in the highlighted
area.
There are several methods for starting EViews program. These are
Click on the Start button in the taskbar, then select the Programs followed by
EViews3 to navigate to the EViews program group, and then select the
EViews3.1 program icon. (Note: if we have installed EViews 5, similar steps is
used to open it)
Navigate to the EViews directory using Windows Explorer, or via the My
Computer icon on your desktop, and double click on the EViews3.1 (or EViews 5
if we have installed it) program icon.
Double click on an EViews workfile or database icon.
The next sub section presents the various components displayed when EViews is opened.
If the program is correctly installed, you should see the EViews window when you open
or launch the program. The followings are the main areas in the EViews Window:
In EViews, the title bar is labeled at the very top of the main window. When EViews is
the active program in Windows, the title bar has a color and intensity that differs from the
other windows (generally, it is darker). When another program is active, the EViews title
bar will be lighter. If another program is active, EViews may be made active by clicking
anywhere in the EViews window
23
Just below the title bar is the main menu. If you move the cursor to an entry in the main
menu and click on the left mouse button, a drop-down menu will appear. Clicking on an
entry in the drop-down menu selects the highlighted item. Some of the items in the drop-
down may be listed in black and others in gray. In menus, black items may be executed
while the gray items are not available.
The area in the middle of the window is the work area where EViews will display the
various object windows that it creates. Think of these windows as similar to the sheets of
paper you might place on your desk as you work. The windows will overlap each other
with the foremost window being in focus or active. Only the active window has a
darkened title bar.
When a window is partly covered, you can bring it to the top by clicking on its title bar or
on a visible portion of the window. You can also cycle through the displayed windows by
pressing the F6 or CTRL-TAB keys. Alternatively, you may directly select a window by
clicking on the Window menu item, and selecting the desired name. In any ways once we
opened the EViews window we then create a work file
Check Your Progress 2.2
1. State the various ways of starting EViews program
________________________________________________________________________
________________________________________________________________________
______________________________________________________________________
2. State the main areas in the EViews Window
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
24
To enter data in to EViews spreadsheet, we need to first open the work file after we
opened the EViews window. To create a work file we first click File and then select new
File and choose Work file. This process is shown in the diagram below.
This will display the Work file Create box. In the box, we have to specify the frequency
as well as the start and end date. That is, we need to provide the appropriate frequency
and enter the information for the work file range. Note that the Start date is the earliest
date or observation you plan to use in the project and the End date is the latest date or
observation. Frequency includes many cases like annual, semi annual, quarterly and the
like.
The rules for describing the work file frequency are quite simple:
Annual: This refers the year. For example, 1995, 2000 or 2005. Thus, if the data
to be entered and used is a yearly data, we select the annual.
25
Quarterly: This refers to the year, followed by a colon or period, and the quarter
number. Examples: 1994:2 (representing the 2nd quarter of 1994), 2001:1, 2006:3
and so on. If the data to be used is a quarterly data, we select the quarterly.
Monthly: This refers to the year, followed by a colon or period, and the month
number. Examples: 1984:1(representing the first month of 1984), 1993:12, and so
on. . Note therefore, if the data to be used is a monthly data, we select the
monthly.
Weekly and Daily: by default, you should specify these dates as month number,
followed by a colon, followed by the day number, followed by a colon, followed
by the year. Make sure that the date to be entered should be in the order of
month/date/year (i.e. mm/dd/yyyy). However, using the Options/Dates-
Frequency… menu item, you can reverse the order of the day and month by
switching to European notation.
After identifying the work file frequency as per our data to be entered into EViews we
should specify the corresponding start and end date in the space provided. The diagram
below displays the work file window created for a specific frequency, start and end date.
26
After we have finished supplying the information about the type of work file, we need to
click OK. This will create the work file window. Note that the work file is UNTITLED
since we have not yet saved the work file. That is, if we save the work file using a certain
file name, then that name will appear instead of the word untitled.
Notice from diagram 2.3 below that there are two icons in this newly created work file.
These icons represent the objects that are contained in every work file. These are a vector
of coefficients represented by C, and a series of residuals designated by RESID.
27
As can be seen from the above diagram, just below the title bar there exists a toolbar
made up of a number of buttons. These buttons provide you with easy access to a number
of useful workfile operations.
Below the toolbar are two lines of status information. That is, EViews displays the range
of the workfile, the current sample of the workfile (the range of observations that are to
be used in calculations and statistical operations), the display filter (rule used in choosing
a subset of objects to display in the workfile window), and the default equation (the last
equation estimated or operated on). You may change the range, sample, and filter by
double clicking on these labels and entering the relevant information in the dialog boxes.
Double clicking on the equation label opens the equation.
Saving Workfiles
Usually we want to name and save our workfile for future use. To do so, we push (or
click) the Save button on the workfile toolbar to save a copy of the workfile on disk. You
can also save the file using the File/SaveAs… or File/Save… choices from the main
menu. When we make use of one of the above three alternative saving methods, a
standard Windows file dialog box will open. In the dialog box we can specify the target
directory in the upper file menu labeled Save in. Note that we can navigate between
directories in the standard Windows fashion. To perform this we click once on the down
arrow to access a directory tree. On the other hand double clicking on a directory name in
the display area gives you a list of all the files and subdirectories in that directory. Once
you have worked your way to the right directory, type the name you want to give the
workfile in the File name box and push the Save button. This will save the workfile with
the name we choose.
Once the workfile is named and saved, we can save subsequent updates or changes using
File/Save… from the main menu or the Save button on the toolbar. EViews will use the
existing workfile name, and if the workfile has changed, will ask you whether you want
28
to update the version on disk. Just like other Windows software, File/Save As… can be
used to save the file with a new name.
Note that when you select File/Open/Workfile… you will see a standard Windows file
dialog. Simply navigate to the appropriate directory and double click on the name of the
workfile. Then the workfile window will open and all of the objects in the workfile will
immediately be available.
For your convenience, EViews keeps a record of the ten most recently used workfiles and
programs at the bottom of the File menu. Select an entry and it will be opened in EViews.
Resizing Workfiles
Some times we may decide to add data or we may want to use observations beyond the
ending date or before the starting date of our work file. Alternatively, we may wish to
remove extra observations from the start or end of the workfile.
To change the size of our workfile, we select Procs and click Change workfile Range…
and enter the required beginning and ending observation of the workfile in the dialog. If
we enter dates that encompass the original workfile range, EViews will expand the
workfile without additional comment. However, if we enter a workfile range that does
not encompass the original workfile range, EViews will warn us that data will be lost,
and ask us to confirm the operation.
Sorting Workfiles
29
Basic data in workfiles are held in objects called series. If you click on Procs/Sort
Series… in the workfile toolbar, you can sort all of the series in the workfile on the basis
of the values of one or more of the series. A dialog box will open where you can provide
the details about the sort.
If you list two or more series, EViews uses the values of the second series to resolve ties
from the first series, and values of the third series to resolve ties from the second, and so
forth. If you wish to sort in descending order, select the appropriate option in the dialog.
Note that if you are using a dated workfile, sorting the workfile will generally break the
link between an observation and the corresponding date.
The following names are reserved and should not be used for naming a variable while
working with EViews: These are ABS, ACOS, AR, ASIN, C, CON, CNORM, COEF,
COS, D, DLOG, DNORM, ELSE, ENDIF, EXP, LOG, LOGIT, LPT1, LPT2, MA, NA,
NRND, PDL, RESID, RND, SAR, SIN, SMA, SQR, and THEN.
V. Data Entry
Once the workfile is created, the next step is data entry. In EViews there are three
different ways of data entry. These are
Entering data from the keyboard
Copying data from other sources
Spreadsheet import.
This approach is preferred for small datasets in printed form. This is because in such
cases we may wish to enter the data by typing at the keyboard. The step required to enter
data via the keyboard is explained as follows. (Note that it is after we created a workfile
that we enter data into EViews spreadsheet).
30
I. Our first step is to open a temporary spreadsheet window in which we will enter the
data. To do this we choose Quick and select Empty Group(Edit Series) from the main
menu to open an untitled group window. Note that we have now an empty EViews
spreadsheet.
II. The next step is to create and name the series (or variables). In this regard we first
click once on the up arrow to display the second obs label on the left-hand column. The
row of cells next to the second obs label is where you will enter and edit series names.
Click once in the s next cell of the second obs label. Then type your first variable name
in the command window and press ENTER. (Note that the name in the cell changes as
you type the required name in the command window),
III. Repeat this procedure in subsequent columns for each additional series.
If we want to rename one of our series, what we have to do is simply select the cell
containing the series name and then edit the name. After we made the necessary change,
we press ENTER. When we do this EViews will prompt us to confirm the series rename.
Thus we confirm the rename be clicking OK.
IV. To enter the data into the EViews spreadsheet, we click on the appropriate cell and
type the number. Pressing ENTER after entering a number will move you to the next cell.
If you prefer, you can use the cursor keys to navigate the spreadsheet while entering the
data.
V. When you are finished entering data, close the group window. Now in the workfile:
untitled box you will have the names of the variables that we have typed earlier. If you
wish, you can first name the untitled group by clicking on the Name button. If you do not
wish to keep the group then choose to delete. When you do this, EViews asks you to
confirm the deletion and you have to select Yes
31
By now you have learned that the Windows clipboard is a handy way to move data within
EViews and between EViews and other software applications. The following discussion
involves an example using an Excel spreadsheet, but the basic principles apply for other
Windows applications.
Suppose we have GDP and Investment data of Ethiopia from the year 1953 up to 1995
E.C. in excel spreadsheet. Furthermore, suppose that we would like to bring these
variables into EViews. The following discussion clearly illustrates the steps to be
followed in this regard.
I. First copy the data range from excel spreadsheet. Then start EViews and create a new
annual workfile containing the dates in the Excel spreadsheet (in our example 1953
through 1995). Then we select Quick and choose Empty Group (Edit Series). Note that
the spreadsheet opens in edit mode so there is no need to click the Edit +/– button. If we
have created an annual workfile with a range from 1953 to 1995, the first row of the
EViews spreadsheet is labeled 1953. Since we are pasting in the series names, we should
click on the up arrow in the scroll bar to make room for the series names.
II. Place the cursor in the upper-left cell, just to the right of the second obs label. Then
select Edit and then choose Paste from the main menu (not Edit +/– in the toolbar). This
will paste the data into the group spreadsheet.
Note that you may now close the group window so that the untitled group will be deleted
without losing the two series. Notice then that in the workfile: untitled box you will have
the names of the two variables that we have typed earlier
Note that we can bring data from the clipboard into an existing EViews series or group
spreadsheet by using the approach of Edit - Paste. as we did earlier. There are only a few
additional issues to consider.
32
1. To paste several series, you will first open a group window containing the existing
series. The easiest way to do this is to click on Show and then type the series names in the
order they appear on the clipboard. Alternatively, you can create an untitled group by
selecting the first series then we click selecting each subsequent series (in order), and
then double clicking to open.
2. Next, make certain that the group window is in edit mode. If not, press the Edit +/–
button to toggle between edit mode and protected mode. Place the cursor in the target
cell, and select Edit and then Paste.
C. Spreadsheet Import
This approach implies that we can also read data directly from files created by other
programs. Data may be in Excel (i.e.XLS) spreadsheet formats or in other compatible
formats. In importing data from other sources, the following steps must be considered.
I. First make certain that you have an open workfile in excel or other compatible formats
in order to receive the contents of the data import. For our case consider the following
excel data containing information about a return from a certain investment for 12 rich
countries of the world
33
Figure 2.4 Excel data of 12 rich countries
Note that to import the above excel spreadsheet in to EViews format, we have to create
first a work file whose range and frequency is in line to the (above) data to be imported.
II. Next, in the workfile (untitled) box we click on Procs and select Import and then Read
Text-Lotus-Excel... The diagram below shows this step.
34
III. After this we will see a standard File dialog box asking you to specify the type and
name of the file. Select a file type, navigate to the directory containing the file, and
double-click on the name. Alternatively, type in the name of the file that you wish to read
(with full path information, if appropriate). In this case, EViews will automatically set the
file type, otherwise it will treat the file as an ASCII file. Then we click on Open. This
will display excel spreadsheet import box as shown in the diagram below.
IV. To read from a spreadsheet file we have to fill in the dialog as follows:
A) First, we need to tell whether the data are ordered by observation or by series. By
observation it means that all of the data for the first observation are followed by all of the
data for the second observation, etc. By series it means that all of the data for the first
variable are followed by all data for the second variable, etc. Another interpretation for
“by observation” is that variables are arranged in columns while “by row” implies that all
35
of the observations for a variable are in a single row. For example in the above diagram,
the data are ordered by observation.
B) Next, we have to tell EViews the location of the beginning cell (upper left-hand
corner) of your actual data, without including any label or date information. In the
diagram above the upper left data cell is a4. This imply that we are to import a data
starting from the first column (represented by column A in the excel spreadsheet) and the
fourth row (as the values displayed in figure 2.6 are written beginning from the fourth
row in the excel spreadsheet.
C) Then we enter the names of the series that we wish to read into the edit box.
In the above case we write the names of the 12 countries using the same name as written
in the excel spreadsheet such as fin swe nor and so on.
Alternatively, if the names that you wish to use for your series are contained in the file,
you can simply provide the number of series to be read. The names must be adjacent to
your data. If the data are organized by row and the starting cell is B2, then the names
must be in column A, beginning at cell A2. If the data are organized by column beginning
in B2, then the names must be in row 1, starting in cell B1. If, in the course of reading the
data, EViews encounters an invalid cell name, it will automatically assign the next
unused name with the prefix SER, followed by a number (e.g., SER01, SER02, etc.).
D) Lastly, we should tell EViews the sample of data that we wish to import. Notice that
EViews begins with the first observation in the file and assigns it to the first date in the
sample for each variable. Each successive observation in the file is associated with
successive observations in the sample. Note also that if we are reading data from an
Excel 5 workbook file, there will be an additional edit box where we can enter the name
of the sheet containing our data. If we do not enter a name, EViews will read the topmost
sheet in the Excel workbook. When the dialog is completely filled out, simply click OK
and EViews will read your file, creating series and assigning values as requested.
36
Figure 2.7 Work file with the imported variable names
The workfile: untitled box above shows that we have successfully imported the excel data
containing the figures for the 12 countries. This is justified by the appearance of the
names of the variables (or countries) in addition to C and RESID. To see the data that we
have imported into EViews spreadsheet, we can use one of the following steps
Shed the variable names found in the Workfile menu and either click Show from
the same menu, or make a right click and select Open as Group
Click Show from the workfile menu and write the names of the variables in the
Show box
37
2. State the three ways of data entry into EViews spreadsheet
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Using the three alternative data entry approaches, read the above data into EViews
spreadsheet
I) Introduction
Stata is a modern and very powerful program especially designed for data management,
statistical and econometric analysis as well as graphics. It is mainly a command driven
program providing sufficient flexibility to meet different users' needs. The main aim of
this section is to familiarize you with basic operations of Stata and learn you how to use
and combine some basic commands.
To open Stata just double click on the Stata icon (or WSTSTA icon) found in the
program file of your computer. Once Stata is opened your computer's screen will bring
the following window.
38
Figure 2.8 The Stata Window.
The discussion below explains the different components of the above opened stata.
Note from the figure above that, the top of Stata window represent the menu bar, the tool
bar and Stata Window. This bars and windows are explained as follows.
A. Menu Bar: The menu bar has lists of commands that can be opned by cliking on the
icon. Below we provide a brief description of the different options. Note that if you use
stata a lot, you probably will not use the menu bar often because the most common tasks
can be done with the buttons on the tool bar and command box.
1. File: When we click this menu bar, 10 items will drop down. This are
Open: This opens data file already saved in stata format
View: This helps to view data file
Save: This helps to save data file.
Save As: This will save data file under new name
File Name: Select data file name to put in command
Log: This helps to open, close, review, or convert log file
Save Graph: This will be used to save file with graph.
Print Graph: This is used to print graph
39
Print Results: This will print contents of current windows
Exit: This helps to leave stata
2. Edit: Opening the edit menu will bring five items as stated below
Copy Text: This uses to copy marked text
Copy Tables: This helps to copy tables to insert in spreadsheet or word processor
Paste: This is used to insert something previously copied.
Table Copy Options: This provides with options for how tables are copied
Graph Copy Options: This provides options for how graphs are copied
3. Prefs: This is used for various options for setting preferences. For example it is used to
open the default window, or change the colors used in stata window.
4. Window: When we click this menu bar, 10 items will be displayed. These are Result,
Graph, Log, Viewer, Command, Review, Variables, Help/Search, Data editor and Do-file
editor. Selecting any of this will bring that particular window to front
5. Help: This bar is used whenever additional information or support is needed. Within
this menu bar there are Contents, Search, Stata command and What's new. This will be
discussed latter on.
B. Tool Bar. Notice that the buttons on the tool bar are designed to make it easier to
carry out the most common tasks. The following diagram presents the various items in
the tool bar.
40
Figure 2.9 Items in the tool bar
Note that most of the items presented in the tool bar are all found inside the menu bar
presented earlier in section A. However, each icon represents the following.
C. Stata window: This represents the four windows that stata displays by default in
addition to the menu and tool bar. The four windows and their purpose with diagramatic
illustration is explained as follows
1. The Stata Command window. This is located at the right bottom side of the
screen. It is a place where the stata commands are written. Note that once the
appropriate command is typed, we have to press the Enter key to excute it
is where we have to enter and edit.
2. The Stata Result window. This is located at the right side of the screen. It displays
the results of the commands written. In case there is any error in the command
inserted, a message in red text will appear.
41
3. The Review window. As the figure below shows, this is located at the upper left
side of the screen. It lists all the executed commands of the session (both right and
false). Note that by clicking on a listed command, we can make it reappear at the
command window so that we can either execute it again or modify it.
4. The Variable window. As the figure above shows, this is located at the left bottom
side of the screen. It displays all the variables that the currently used data set
contains.
Note that we can choose the fonts that each of the four windows displays. As you can see
each window has a window's control menu box (the little box in the upper left corner of
the window). Simply click once on it, and then choose Font, which brings up a standard
windows font dialog box. Select your font preferences and then OK. Now you have
changed the fonts that this window displays. In order to change the fonts for each
window you have to follow this process for each window separately allowing you to
choose a different font for each window. However, note that the next time Stata comes
up, it will come up according to the latest windows' modifications.
42
Once we have organized the screen, we are ready to learn Stata's basic features. Do not
forget that Stata is a command driven application (this makes it really powerful) and so
we have to learn how to insert and combine Stata commands. Note that Stata commands
are case sensitive (i.e. Stata distinguishes between lowercase and uppercase syntax). All
Stata commands that are included in this material are in lowercase. For some basic
operations, Stata offers windows facilities as a command alternative
Note that in order to get the full power out of Stata we should become familiar with its
help facilities. In doing so we first open the Help menu. This will display an number of
sub items as shown below
To search for a particular case we click on Search. This will open the following dialog
box. The box is called key word search box that will help to search an item from the
program.
43
Figure 2.12 Stata Search Box
Note that by selecting the first option and specify a particular word, stata will produces
the result accordingly. For example if we type the word regression in the Keywords box ,
all those entries in Stata that contain this word are displayed.
As you can see from the diagram above, there are hypertext links (clickable words in
blue) that will link you to the help files for the appropriate Stata commands.
If we click again on help, and select Stata Command from the pull down, a dialog box
that asks you to specify a Stata command appears. This box is shown below
44
Figure 2.14 Stata command box
By specifying a command name and clicking OK, we can see detailed information on its
syntax as well as some examples on how the command can be used. For illustration
purposes, let us specify the command regress, which is used to perform linear
regressions. Then click OK and the program brings front the following Stata on-line
manual page.
Note that we can get the same help facilities by simply using the help and search
commands.
Typing in the command window, search regression, and press Enter return, produces the
same output as choosing search from the help menu and entering the keyword regression.
In the Stata Results window a list of all Stata commands that relate to regression will
45
appear in green. When you position the mouse pointer near a hypertext link (displayed in
blue), the pointer will change to a hand. If you click while the hand is pointing at a
command name, you will go to the help file of the selected case. In general we can use
search, followed by a keyword when you need to find out the names of the commands
related with this keyword.
Note that, on the other hand, we can use the help command when we know the exact
name of the Stata command on which we want more information. Notice that executing
help regress, produces the same output as choosing Stata Command from the Help menu
and specify regress at the Stata Command dialogue box (as described above). For
example suppose we type in Stata Command window help followed by the command
name regress. In the Stata Results window, we obtain information on the specific
command (which is all the information that Stata's on line manual pages - a shortened
version of the printed manuals - include). If we type help alone we obtain information on
how to use the help system.
To perform any kind of analysis using Stata, we have to have data on stata spreadsheet.
The data can be imported from other spreadsheet (such as excel) or entered via the key
board into the stata spreadsheet. The following explains the steps required in each of the
cases.
A. Copying data from other spreadsheet. This approach is more preferred when we want
to make use of data that is already available in soft copy in a certain spreadsheet (such as
excel). The steps required to execute this method are:
1. First open the required spreadsheet and open the data set.
2. Then select the whole data set and copy it
3.Open stata and bring the spreadsheet window to the front. (This can be done by either
opening the spreadsheet using the menu/tool bar or by writing the command edit in the
command box and press the Enter key)
46
1. Click edit in the header bar (i.e. menu bar) and select paste from the drop
down menu. This transfers the whole spreadsheet of data into stata
spreadsheet.
B. Entering Data Via the data Spreadsheet. This approach is more appropriate when
we have data in a hard copy and want it in stata spreadsheet. Suppose we have
data on ID numbers of five individuals and their corresponding income, and
expenditure values. To enter this information directly in to stata spreadsheet, we
first open an empty spreadsheet window. Then treating each column as a separate
variable, we begin entering the data. Thus, first the respondents ID value is
entered in the first column. This will bring a result as shown below.
Note that the moment we started entering the values of ID variable, stata names
that column as Var1. To change this name in to the actual one, we double click on
Var 1 and then enter the variable name ID as shown below.
47
Figure 2.17 Changing variable names.
Note that on the stata variable information box, we enter the name of the variable
(which is ID for this case) and if we wish, enter the variable label. We then
proceed to do this for each additional variable.
Once the data is entered, we can close the spreadsheet and return to the stata window. In
this case the variable box will display the names of the variables (i.e. ID and the others)
that we have entered in the spreadsheet.
48
3. Consider the data below:
Y 10 15 18 16 25 30 40 35
X 15 13 11 9 7 9 11 13
Enter the above data in to Stata spreadsheet using (a) copy and paste approach, and (b)
own type (via stata data spreadsheet)
The first step in data analysis involves organizing the raw data into a format usable by
Stata. Data management encompasses the initial task of creating a data set, editing to
correct errors, and adding internal documentation such as variable and value labels. It
also encompasses many other jobs required by ongoing projects, such as reorganizing,
simplifying, or sampling form the data. Moreover, it includes adding further observations
or variables, separating, combining or collapsing data set; converting variable types and
creating new variables through algebraic or logical expressions. Note that although Stata
is best known for its analytical capabilities, it possesses a broad range of data
management features as well. This section introduces some of the basics in this regards.
This discussion states the various Stata commands that are used in the data management
process. Assuming that we have data of GDP, INV(investment) and SAV(domestic
saving) in the Stata spreadsheet, we explain briefly here below what each Stata
command stands for. However, to adequately understand each command, the student is
advised to exercise using his/her own data. Note that the command is identified by bold
letter and varname (or varlist) refers to the variable name (or list of variables) to be used
in the data management work. To obtain the result after we write the command, we press
Enter from the key board
49
compare varname1 varname2
describe
Note that describe displays a summary of the contents of the data in memory or the data
stored in a Stata-format dataset.. The command ds list variable names in a compact
format.
drop varlist
keep varlist
Keep works the same as drop except that you specify the variables or observations
to be kept rather than those to be deleted.
Example: keep GDP EXPO keeps GDP and Export variables and eliminate all other
variables from the spreadsheet.
clear
This eliminate all the variables or observations listed after the command. If no variable
name is written, Stata will eliminate all the variables available in the data.
50
Example: clear INV SAV will eliminate the variables investment and saving from the
data.
edit
The command edit brings up a spreadsheet-style data editor for entering new data and
editing existing data.
browse
This is like edit except that it will not allow changing the data. That is a data displayed in
this way cannot be edited.
Merge datasets
Merge joins corresponding observations from the dataset currently in memory
(called the master dataset) with those from the Stata-format dataset stored as
filename (called the using dataset) into single observations. Note that if filename is
specified without an extension, .dta is assumed.
order varlist
Notice that, order changes the order of the variables in the current dataset. The variables
specified are moved, in order, to the front of the dataset.
The command move also reorders variables. It relocates variable name1 to the position of
variable name2 and shifts the remaining variables, including variable name2, to make
room. Similarly the command aorder alphabetizes the variables specified in variable list
and moves them to the front of the dataset. If no variable list is specified, _all is
assumed.
51
Note that rename changes the name of an existing variable but the contents of the
variable remain unchanged.
Examples: rename INV INVE. this renames INV as INVE
save filename
Note that save stores the dataset currently in memory on disk under the name filename.
If filename is not specified, the name under which the data was last known to Stata
is used. If filename is specified without an extension, .dta is used.
Examples: save my file. This will save the dataset by the name my file
Before conducting estimationand analysis, it may be important to transform the data into
the required form. Moreover, performing pre estimation tests are important so as to adjust
the data set to come up with an acceptable form.
replace
This command changes the contents of an existing variable.
Example: if we write the command replace GDP=100*GDP, it replaces the old variable
GDP with 100 times their previous values.
52
gsort
Note that gsort arranges the observations to be in ascending or descending order of the
specified variable names. This command is different from sort in that sort can produce
only ascending-order arrangements. Note that each variable name can be numeric or
string. The observations are placed in ascending order of variable name if + or nothing is
typed in front of the name and in descending order if - is typed.
Example: gsort GDP creates an ascending order arrangement of GDP. But if we write
gsort GDP -INV then a data with an ascending order of GDP and a descending of INV is
created.
inspect
This command display simple summary of data's characteristics. It reports the number of
negative, zero, and positive values; the number of integers and non-integers; the number
of unique values; the number of missing; and produces a small histogram. Its purpose is
not analytical instead it allow you to quickly gain familiarity with unknown
data. Example: inspect EXPO
list
List displays the values of variables. If no variable list is specified, the values of
all the variables are displayed.
sample as a percent or as a count
Note that sample draws random samples from the data in memory. Sampling here is
defined as drawing observations without replacement. The size of the sample to be drawn
can be specified as a percent or as a count.
Consider the following commands
53
Suppose that we have data on male and female. Then consider the command below
by sex: sample 50, count This draw sample of size 50 for men, 50 for women
Usually the first thing that we should do before we perform any kind of data analysis is to
examine the behavior of the data to be used in the analysis. Such approach will help to
adjust the data in a manner that is applicable to appropriate estimations. In this
connection we discuss here an approach that will help us identify whether a variable has
outliers. Recall that by an outlier we mean an extreme value compared to the average.
The two commands usually used in this regard are the box plot and one way scatter plot.
Consider a data on import (given by IMPO) of Ethiopia for the period 1953 to 1995E.C.
The command below is used to construc one way scater plot and box plots respectively
graph IMPO, oneway: [This command produces a scatter plot of the variable
import (IMPO) as shown hereunder.
54
Figure 2.18 Scatter plot result for the variable import.
As can be seen from the result above, much of the values of the variable IMPO (import)
is concentrated around a value of 260 as shown in the (dark area) left side of the scatter
plot. The figure also points out the presence of an outlier as shown by the right hand side
of the scatter plot. That is, values around 21,550 are extreme compared to much of and
hence represent an outlier compared to the majority of the values of the variable.
Similar result can be obtained if we make use of the box plot approach. In this case the
result will make use of a box while the result is more or less the same. To construct the
box plot examination for the variable IMPO (import) we write the following command.
This command will draw a box plot for the variable IMPO as follow
Impo
21557.6
258.724
55
Notice from the box plot result above that there are outliers as shown by dot figures at the
upper part of the box plot.. As can be seen clearly, values around 21,550 are extreme and
hence represent an outlier compared to the majority of the values of the variable.
Note that such outlier values very much affect the mean value of the variable and also
makes the standard deviation to be larger than what would have been with out such
extreme values. Because of such complications and unattractive result that follows
outliers, we may omit or exclude extreme observations. This is usually practiced in
regression analysis since regression is nothing but conditional expectation.
Y 10 12 13 11 14 10 16 15 18
X1 0.25 0.75 1.25 3.5 8 6.1.25 1.00 0.9 1.5
X2 150 180 600 400 100 120 140 130 175
Using EViews we can transform variables into other forms. For example, variable GDP
can be transformed into its logarithm form, the lagged value of GDP can be created and
so on. To transform a variable we need to write a command that will make EViews make
the adjustment. The command is written by clicking the GENR toolbar from the workfile
box. The step in doing so is explained as follows. Note that there are a number of
transformations. But we will see two of the transformations
First open the workfile box that contains the names of the variables that we want
to transform.
56
Then from the tool bar click GENR. This will open Generate Series By Equation
box. It is in this box that we write the command to transform the variable into
another form.
If our objective is to transform variable Y into its logarithm form, we write the
following command in the box
LY = Log(Y) and then click OK (or press ENTER). This will create the
logarithm of Y and will store it as LY in the EViews spreadsheet.
For example, consider the following table that consists the variables GDP, Export
(represented by EXPO) and import (represented by IMPO) for the period 1986 to 1995
E.C.
Year GDP EXPO IMPO
1986 28328.90 3223.000 6090.500
1987 33885.00 4898.100 7950.000
1988 37937.60 4969.700 8721.500
1989 41465.10 6730.600 10584.70
1990 44840.30 7116.900 11341.20
1991 48803.20 6878.000 14101.50
1992 53189.70 8017.600 15969.30
1993 54210.70 7981.500 16193.60
1994 51760.60 8027.400 17709.50
1995 54585.90 8319.300 21557.60
Table 2.1 Data on GDP, Export, and Import from 1986 to 1995
Suppose that we wanted to create the logarithm and lagged value of each of the above
tabled variables. The transformation is done by writing the following command in the
Generate Series By Equation box
57
A) Transformation in to logarithm
- For GDP we write LGDP = log(GDP) and click OK or press ENTER
- For EXPO we write LEXPO = log(EXPO) and click OK or press ENTER
- For IMPO we write LIMPO = log(IMPO) and click OK or press ENTER
The above commands will create the logarithm and one time laged value of GDP, export
and import variables as shown below.
Notice that the first value of the each lagged variable is NA indicating that it is not
available.
Consider the values of Y, X1 and X2 provided in check your progress 2.5 and attempt the
following questions
58
1. Using EViews create the logarithm of Y, X1 and X2
________________________________________________________________________
________________________________________________________________________
2.3 Summary
In this unit we have seen what we mean by data entry, transformation and pre estimation
tests. In excel spreadsheet we can enter numbers, texts, a date or a time. To do so, we first
Click the cell where you want to enter data. Then we type the data and press ENTER or
TAB. Note that we can enter the data raw by raw or column-by-column.
As we have seen, EViews provides convenient visual ways to enter data series from the
keyboard or from disk files, to create new series from existing ones, as well as perform
data transformation EViews takes advantage of the visual features of modern Windows
software. You can use your mouse to guide the operation with standard Windows menus
and dialogs. Results appear in windows. The results can be manipulated with standard
Windows techniques. In EViews there are different ways of data entry. These are
entering data from the keyboard and entering data from a file
The other software used for the discussion is Stata. To open Stata just double click on the
Stata icon (or WSTSTA icon) found in the program file of your computer. Once Stata is
opened your computer's screen will bring the Stata window. This represents the four
windows that Stata displays by default in addition to the menu and tool bar. To perform
any kind of analysis using Stata, we have to have data on Stata spreadsheet. The data can
be imported from other spreadsheet (such as excel) or entered via the key board into the
Stata spreadsheet. The following explains the steps required in each of the cases.
59
Data management encompasses the initial task of creating a data set, editing to correct
errors, and adding internal documentation such as variable and value labels. It also
encompasses many other jobs required by ongoing projects, such as reorganizing,
simplifying, or sampling form the data. Moreover, it includes adding further observations
or variables, separating, combining or collapsing data set; converting variable types and
creating new variables through algebraic or logical expressions. Note that although Stata
is best known for its analytical capabilities, it possesses a broad range of data
management features as well. Similarly, using EViews we can transform variables into
other forms. For example, variable GDP can be transformed into its logarithm form, the
lagged value of GDP can be created and so on. To transform a variable we need to write a
command that will make EViews make
Consider the following quarterly data on GDP, M1 (money supply), Pr (price) and Rs
(interest rate) for the period 1993:1 up to 1996:4
60
Year GDP PR M1 RS
1993:1 1611.1 1.018411 1098.221 2.993333
1993:2 1627.3 1.023475 1135.69 2.983333
1993:3 1643.625 1.02831 1168.657 3.02
1993:4 1676.025 1.035079 1187.475 3.08
1994:1 1698.6 1.041367 1210.237 3.25
1994:2 1727.875 1.047149 1211.559 4.036667
1994:3 1746.65 1.053865 1210.962 4.51
1994:4 1773.95 1.06088 1204.365 5.283333
1995:1 1792.25 1.069409 1209.235 5.78
1995:2 1802.375 1.074633 1219.42 5.623333
1995:3 1825.3 1.080187 1204.52 5.38
1995:4 1845.475 1.086133 1197.609 5.27
1996:1 1866.875 1.093915 1195.807 4.95
1996:2 1901.95 1.098441 1208.025 5.04
1996:3 1919.05 1.105475 1218.991 5.136667
1996:4 1948.225 1.110511 1202.149 4.97
1. Using Box-Plot and Scatter Plot, test for the presence of outlier in each variable given
above.
2. Using Stata, generate the logarithm of each variables
3. Draw a random sample of 30% from the above data
4. Using EViews create the logarithm of each variables
5. Using EViews generate a one period lagged value of each variables.
61
Unit Three: Statistical Estimation and Graphical Analysis
3.0 Objective
3.1 Introduction
3.2 What Is Statistical Data Analysis
3.3 Some Basic Statistics Concepts
3.4 Statistical Estimation, Graphing and Analysis
3.4.1 Using Excel
3.4.2 Using EViews
3.4.3 Using Stata
3.5 Summary
3.6 Answers to Check Your Progress
3.7 Model Examination
3.0 Objective
The aim of this unit is to equip the student with the various ways of computing statistics
and draw graph. After completing this unit, a student will be able to:
Understand several statistics related concepts
Compute several statistics concepts and issues with analysis
Draw graph and analyze the result
3.1 Introduction
The original idea of "statistics" was the collection of information about and for the
"state". The word statistics drives directly not from any classical Greek or Latin roots, but
from the Italian word for state.
The birth of statistics occurred in mid-17th century. A man, named John Graunt, who was
a native of London, begin reviewing a weekly church publication issued by the local
parish clerk that listed the number of births, christenings, and deaths in each parish.
These so called Bills of Mortality also listed the causes of death. Graunt who was a
62
shopkeeper organized this data in the forms we call descriptive statistics, which was
published as Natural and Political Observation Made upon the Bills of Mortality.
Probability has much longer history. Probability is derived from the verb to probe
meaning to "find out" what is not too easily accessible or understandable. The word
"proof" has the same origin that provides necessary details to understand what is claimed
to be true. Probability originated from the study of games of chance and gambling during
the sixteenth century. Probability theory was a branch of mathematics studied by Blaise
Pascal and Pierre de Fermat in the seventeenth century. Currently; in 21st century,
probabilistic modeling are used in many cases such as quality control; insurance;
investment; and other sectors of business and industry.
Note that statistical models are currently used in various fields of business and
economics. However, the terminology differs from field to field. For example, the fitting
of models to data, called calibration, history matching, and data assimilation. But all
these terminologies are all synonymous with parameter estimation.
63
3.2 What is Statistical Data Analysis? Data are not Information
In this section we first explain various concepts of statistical analysis and then discuss the
various ways of performing statistical estimation using EViews and Stata.
A given database contains a wealth of information, yet we make use of a fraction of it. In
organizations employees waste time scouring multiple sources for a database. The
decision-makers are frustrated because they cannot get business-critical data exactly
when they need it. Therefore, too many decisions are based on guesswork, not facts.
Many opportunities are also missed, if they are even noticed at all.
Data is known to be crude information and not knowledge by itself. The sequence from
data to knowledge is: from Data to Information, from Information to Facts, and finally,
from Facts to Knowledge. Data becomes information, when it becomes relevant to your
decision problem. Information becomes fact, when the data can support it. Facts are what
the data reveals. However the decisive instrumental (i.e., applied) knowledge is expressed
together with some statistical degree of confidence.
64
Figure 3.1 The sequence from data to knowledge
The above figure depicts the fact that as the exactness of a statistical model increases, the
level of improvements in decision-making increases. That is why we need statistical data
analysis. Statistical data analysis arose from the need to place knowledge on a systematic
evidence base. This required a study of the laws of probability, the development of
measures of data properties and relationships, and so on.
Considering the uncertain environment, the chance that "good decisions" are made
increases with the availability of "good information." The chance that "good information"
is available increases with the level of structuring the process of Knowledge
Management. The above figure also illustrates the fact that as the exactness of a statistical
model increases, the level of improvements in decision-making increases.
65
best used to meet the needs of the decision-maker. Wisdom, for example, creates
statistical software that is useful, rather than technically brilliant.
The appearance of computer softwares, are the most important events in the process of
model-based statistical decision makings. These tools allow us to construct numerical
cases and easily understand the concepts, and to find their significance for ourselves.
Note that data are not information. To determine what statistical data analysis is, one
must first define statistics. Statistics is a set of methods that are used to collect, analyze,
present, and interpret data. Statistical methods are used in a wide variety of occupations
and help people identify, study, and solve many complex problems. In the business and
economic world, these methods enable decision makers and managers to make informed
and better decisions about uncertain situations.
Vast amounts of statistical information are available in today's global and economic
environment because of continual improvements in computer technology. To compete
successfully globally, managers and decision makers must be able to understand the
information and use it effectively. Statistical data analysis provides hands on experience
to promote the use of statistical thinking and techniques to apply in order to make
educated decisions.
Computers play a very important role in statistical data analysis. The statistical software
packages, offers extensive data-handling capabilities and numerous statistical analysis
routines that can analyze small to very large data statistics. The computer will assist in
the summarization of data, but statistical data analysis focuses on the interpretation of the
output to make inferences and predictions.
66
Studying a problem through the use of statistical data analysis usually involves four basic
steps. These are (i) Defining the problem, (ii) Collecting the data, (iii) Analyzing the data,
and (iv) reporting the results. The following discusses each case briefly.
An exact definition of the problem is imperative in order to obtain accurate data about it.
It is extremely difficult to gather data without a clear definition of the problem.
We live and work at a time when data collection and statistical computations have
become easy almost to the point of triviality. Paradoxically, the design of data collection,
never sufficiently emphasized in the statistical data analysis textbook, have been
weakened by an apparent belief that extensive computation can make up for any
deficiencies in the design of data collection. One must start with an emphasis on the
importance of defining the population about which we are seeking to make inferences; all
the requirements of sampling and experimental design must be met.
Designing ways to collect data is an important job in statistical data analysis. Statistical
inference is refer to extending our knowledge obtain from a random sample from a
population to the whole population. This is known as an Inductive Reasoning. That is,
knowledge of whole from a particular. Its main application is in hypothesis testing about
a given population. The purpose of statistical inference is to obtain information about a
population form information contained in a sample. It is just not feasible to test the entire
population, so a sample is the only realistic way to obtain data because of the time and
cost constraints. Data can be either quantitative or qualitative. Qualitative data are labels
or names used to identify an attribute of each element. Quantitative data are always
numeric and indicate either how much or how many.
67
Statistical data analysis divides the methods for analyzing data into two categories:
exploratory methods and confirmatory methods. Exploratory methods are used to
discover what the data seems to be saying by using simple arithmetic and easy-to-draw
pictures to summarize data. Confirmatory methods use ideas from probability theory in
the attempt to answer specific questions. Probability is important in decision making
because it provides a mechanism for measuring, expressing, and analyzing the
uncertainties associated with future events. The majority of the topics addressed in this
course fall under this heading.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
68
3.3 Some Basic Statistics Concepts
Mean. It measures the average of the observation. The most common measure of average
is the arithmetic mean. Given X1, X2, ..., Xn, the mean is given by the following formula.
X
X
n
Note that the mean lends itself to a subsequent analysis because it includes the values of
all items.
Dispersion. This is the variation or scatter of a set of values. A measure of the degree of
dispersion of data is needed for assessing the reliability of the average of the data.
The most important measure of dispersion are Variance and the Standard Deviation. Note
that the higher the dispersion, the lower the reliability of the average of the data.
Covariance. The covariance between any two variables measures the co movement
between the variables. If the covariance is positive, then the two variables move
together. If it is negative it means the two variables move in opposite direction. On the
other hand, If it is zero, it means the two variables are not linearly related. Note that the
magnitude of covariance cannot be interpreted as an indication of the degree of linear
association between the two variables. To convey information between the relative
strength between any two variables, we employ the concept of correlation.
Correlation. One of the methods for measuring the relationship between variables is
correlation coefficient. Correlation is defined as the degree of relationship existing
between two or more variables. The degree of relationship existing between two variables
69
is called simple correlation and the degree of relationship existing between three or more
variables is called multiple correlations.
The correlation coefficient, like the covariance, is a measure of the extent to which two
measurement variables “vary together.” Unlike the covariance, the correlation coefficient
is scaled so that its value is independent of the units in which the two measurement
variables are expressed. For example, if the two measurement variables are weight and
height, the value of the correlation coefficient is unchanged if weight is converted from
pounds to kilograms. The value of any correlation coefficient must be between -1 and +1
inclusive.
Two variables are said to be positively correlated if they tend to vary together in the same
direction, i.e. if they tend to increase or decrease together. For example, price of a
commodity and quantity supplied are positively correlated because when the price
increases, the quantity supplied increases and conversely when price falls, the quantity
supplied decreases. On the other hand, two variables are said to be negatively correlated
if they tend to change/vary/ in the opposite direction i.e. when one of the variable
increases the other decreases and vice versa. For example the price of a commodity and
quantity demanded are negatively correlated. When price increases, quantity demanded
decreases and when price falls, demands for the commodity increases. Note that the
difference between covariance and correlation is that correlation coefficients are scaled to
lie between -1 and +1 inclusive, Corresponding covariances are not scaled. Both the
correlation coefficient and the covariance are measures of the extent to which two
variables “vary together.”
To obtain the quantitative measure of the degree of correlation between two variables, we
use a parameter called the correlation coefficient. We have two types of correlation
coefficients. These are population correlation coefficient, which is denoted by ρ, and
sample correlation coefficient dented by r.
Population correlation coefficient (ρ) refers to the correlation of all the values of the
population of the variables and sample correlation coefficient (r) refers to the estimate of
70
population correlation coefficient from the sample. The sample correlation coefficient
between two variables X and Y is defined by the following formula as:
( X i X )(Yi Y )
rxy =
(Yi Y ) 2 ( X i X ) 2
The closer the value of the correlation coefficient to one, the greater the degree of
correlation (the closer the scatter of points approach a straight line). On the other hand,
the closer the value of the correlation coefficient to zero, the greater the scatter of points.
In hypothesis testing, the most common approach is to establish a set of two mutually
exclusive and exhaustive hypotheses about the true value of the parameter under study.
Then a sample is used to assess the hypothesis. In short, the following is the stapes to be
used in testing the hypothesis.
The first thing that we do is formulate the null and the alternate hypothesis. The null
hypothesis represents the hypothesis to be tested empirically. Thus, this hypothesis is
71
developed for the purpose of testing. It is represented by H o. The alternate hypothesis on
the other hand, is the counter proposition against which we test the null hypothesis. It
describes what we will conclude if we reject the null hypothesis. It is represented by H1.
Then we need to set the level of significance. The level of significance is the probability
of rejecting the null hypothesis when it is true. Generally, in making decision regarding
test of hypothesis, there are two possible errors. These are:
Type I error. It is a type of error committed when we reject H0 while it is true. The
probability of committing type I error is denoted by α.
Type II error. It is a type of error committed when we accept H 0 while it is false. The
probability of committing type II error is denoted by β.
Note that α is called the level of significance. By setting the level of α at 0.1, or 0.05, or
0.01, we minimize the probability of committing type I error.
Recall that we said a hypothesis testing is a procedure that helps to decide whether the
observed difference between the sample value and the population value is real or due to
chance. To test whether the observed difference between the data and what is expected
under the null hypothesis is real or due to chance variation, we use a test statistic. In the
procedure we compare the value obtained using test statistic with the critical (or table)
value. Then we reject the null hypothesis; if the test statistic is statistically significant at
the chosen significance level α. Other wise the null hypothesis is not to rejected. Such
result takes place when the test statistic is not statistically significant.
72
The P-value, which directly depends on a given sample, attempts to provide a measure of
the strength of the results of a test, in contrast to a simple reject or do not reject. If the
null hypothesis is true and the chance of random variation is the only reason for sample
differences, then the P-value is a quantitative measure to feed into the decision making
process as evidence. The following table provides a reasonable interpretation of P-values:
P-value Interpretation
P< 0.01 very strong evidence against H0
0.01< P < 0.05 moderate evidence against H0
0.05< P < 0.10 suggestive evidence against H0
0.10< P little or no real evidence against H0
This interpretation is widely accepted, and many scientific journals routinely publish
papers using this interpretation for the result of test of hypothesis. In general, when a p-
value is associated with a set of data, it is a measure of the probability that the data could
have arisen as a random sample from some population described by the statistical
(testing) model.
73
3.4 Statistical Estimation Graphics and Analysis
In this section we will explain the approaches employed in making graphical and
statistical estimation. The discussion first makes a brief presentation of Excel program to
be followed by EViews and Stata program
Note that excel can carry out several statistical analysis. The following step explains the
steps required to handle the job of computing various statistical problems. The discussion
assumes that we have data on excel spreadsheet. To perform the estimation we use the
following steps.
1. On the Tools menu, click Data Analysis.
2. In the Data Analysis dialog box, click the name of the analysis tool you want to
use, then click OK.
3. In the dialog box for the tool you selected, set the analysis options you want.
You can use the Help button on the dialog box to get more information about the options.
Note that if Data Analysis is not available in the first place, we have to load the Analysis
ToolPak. This is performed using the following approach.
B. In the Add-Ins available list, select the Analysis ToolPak box, and then
click OK.
The following presents some of the tools provided by Excel in statistical estimation.
Anova. The Anova analysis tools provide different types of variance analysis. The tool to
use depends on the number of factors and the number of samples you have from the
populations you want to test.
74
Anova: Single Factor. This tool performs a simple analysis of variance on data for two or
more samples. The analysis provides a test of the hypothesis that each sample is drawn
from the same underlying probability distribution against the alternative hypothesis that
underlying probability distributions are not the same for all samples.
Anova: Two-Factor With Replication. This analysis tool is useful when data can be
Anova: Two-Factor Without Replication. This analysis tool is useful when data are
classified on two different dimensions as in the Two-Factor case With Replication.
Correlation. The CORREL and PEARSON worksheet functions both calculate the
correlation coefficient between two measurement variables when measurements on each
variable are observed for each of N subjects. Note that any missing observation for any
subject causes that subject to be ignored in the analysis. The Correlation analysis tool is
particularly useful when there are more than two measurement variables for each of N
subjects. It provides an output table, a correlation matrix, showing the value of CORREL
(or PEARSON) applied to each possible pair of measurement variables.
Covariance. The Correlation and Covariance tools can both be used in the same setting,
when you have N different measurement variables observed on a set of individuals. The
Correlation and Covariance tools each give an output table, a matrix, showing the
correlation coefficient or covariance, respectively, between each pair of measurement
variables.
The Covariance tool represented by COVAR computes the value of the worksheet
function, COVAR, for each pair of measurement variables. Note that direct use of
COVAR rather than the Covariance tool is a reasonable alternative when there are only
two measurement variables, i.e. N=2. The entry on the diagonal of the Covariance tool’s
output table in row i, column i is the covariance of the i-th measurement variable with
itself. This is just the population variance for that variable as calculated by the worksheet
function, VARP.
75
Descriptive Statistics. The Descriptive Statistics analysis tool generates a report of
univariate statistics for data in the input range, providing information about the central
tendency and variability of your data.
Histogram. The Histogram analysis tool calculates individual and cumulative frequencies
for a cell range of data and data bins. This tool generates data for the number of
occurrences of a value in a data set.
For example, in a class of 20 students, you could determine the distribution of scores in
letter-grade (A, B, ...) categories. A histogram table presents the letter-grade boundaries
and the number of scores between the lowest bound and the current bound. The single
most-frequent score is the mode of the data.
Rank and Percentile. The Rank and Percentile analysis tool produces a table that contains
the ordinal and percentage rank of each value in a data set. You can analyze the relative
standing of values in a data set. This tool uses the worksheet functions, RANK and
PERCENTRANK.
What we state above is some of the statistical computations conducted by excel. Note
however, that excel is not as user friendly and appealing as other specialized software
such as EViews and Stata. Therefore, with this brief introduction we move on to the
discussion of statistical and graphical computation using EViews and Stata.
EViews very simply and attractively provides us with many types of graphical results.
Note however, that to execute any kind of graph we should first have the relevant data in
EViews spreadsheet. That is, in the workfile: untitled (or titled if we have already saved
data), there has to be names of variables in addition to C and RESID.
There are alternative ways to draw graph using EViews. One of the options is to first
click Quick from the main menu and then from the drop down we select Graph. This will
76
display the list of graph types available in EViews. These are Line Graph, Bar Graph,
Scatter, XY line and Pie. This is clearly shown in the diagram below.
Selecting any of the various types of graph will give the corresponding results. As we
saw in the above diagram, the combo box in the dialog allows us to select a graph type.
To change the graph type, we simply select an entry from the combo box. Note that some
graph types are not available for different data types (for example, we cannot view a
scatter diagram for a single series). Furthermore, some views do not allow us to change
the graph type. In such cases, the Graph Type will display Special, and we will not have
access to the entries in the combo box.
A. Graph Type
The basic graph types are stated as follows:
Line Graph displays a plot of the series, with each value plotted vertically against
either an observation indicator or time.
Bar Graph displays the value of each series as the height of a bar.
77
Scatter Diagram displays a scatter with the first series on the horizontal axis and
the remaining series in the vertical axis, each with a different symbol.
XY Line Graph plots the first series on the horizontal axis and the remaining
series in the vertical axis, each connected as a line.
Pie Chart displays each observation as a pie with each series shown as a wedge in
a different color, where the width of the wedge is proportional to the percentage
contribution of the series to the sum of all series for that observation. Note that
series with negative values are dropped from the chart.
Note that the appearance of graphical views can be customized extensively. However,
changes to a graphical view will often be lost when the view is redrawn (including when
the object is closed and reopened, when the workfile sample is modified, or when the
data underlying the object are changed). Often one would like to preserve the current
view so that it does not change when the object changes. In EViews, this is referred to as
freezing the view. Freezing a graphical view creates a graph object. Thus, if we would
like to customize a view for presentation purposes, we should first freeze the view as a
graph object to ensure that our changes are not lost. To do this we click Freeze in the
graph menu.
Consider having a quarterly data on gdp and m1 (money supply) a given country for the
period 1952:1 up to 1996:4. We can plot these variables together in one plane as shown
below.
78
2000
1500
1000
500
0
55 60 65 70 75 80 85 90 95
GDP M1
The following figure is based on annual data on export and import of Ethiopia for the
period 1953 up to 1995 E.C. The figure clearly shows the relatioship between the two
external trade variables
25000
20000
15000
10000
5000
0
55 60 65 70 75 80 85 90 95
EXPO IMPO
79
Note that the diagram above is drawn using line graph type. As can be seen from the
second diagram, the trend of import and export is widening across time. In general, we
can observe the behavior of individual variable as well as its relationship with other
variables against time using graphical presentation
B. Graph Options
Note that for each graph types stated previously, there are a number of available options.
This includes the following.
80
It is also possible to set the font used in labeling the figures and to add a text. The steps in
doing so are explained as follows.
Setting Fonts
we can change the fonts used in the graph labels (the axes, legends, and added text) by
clicking the Fonts button. The Font button in the Graph Options Dialog sets the default
font for all graph components. If we wish, you may then set individual fonts for axes,
legends and text.
Adding Text
Moreover, we can customize a graph by adding one or more lines of text anywhere on the
graph. This can be useful for labeling a particular observation or period, or for adding
titles or remarks to the graph. In a frozen graph object, simply we click on the AddText
button in the toolbar or select Procs and then click on Add text…. The Text Label dialog
will come up.
We then enter the text we want to display in the large edit field. Note that spacing and
capitalization (upper and lower case letters) will be preserved. If we want to enter more
than one line, it can be done by pressing ENTER after each line.
81
C. Statistical Computation using EViews
Based on its window-based approach, EViews estimates several statistical issues. Very
simply it estimates and reports descriptive statistical results of a variable. This includes
mean, median, standard deviation skewness and the like concepts. Moreover, EViews
computes the covariance and the correlation between the variables.
To perform the above stated statistical estimations we have to open the workfile that
contains the names of the variables to be used in the analysis. The estimation is
conducted as follows.
Descriptive Statistics
To derive descriptive statistics for each variables we first click Quick and select Group
Statistic. Then we choose Descriptive Statistics to be followed by either Individual or
Common Samples. In this case, a Series List box will appear where we have to list the
name/s of the variable/s (series). Then we click OK and the result will be displayed. As
an illustration, consider the following descriptive statistic result for the variables: GDP,
M1 (money supply), price level (PR) and the interest rate (RS) for a hypothetical
country.
GDP M1 PR RS
Mean 632.4190 445.0064 0.514106 5.412928
Median 374.3000 298.3990 0.383802 5.057500
Maximum 1948.225 1219.420 1.110511 15.08733
Minimum 87.87500 126.5370 0.197561 0.814333
Std. Dev. 564.2441 344.8315 0.303483 2.908939
Skewness 0.845880 0.997776 0.592712 0.986782
Kurtosis 2.345008 2.687096 1.829239 4.049883
Note that the above table provides descriptive statistical information for each variable.
From the figure, we learn that the average (mean) value of gdp for the period given
earlier is 632.41 dollars while it is 445.00, 0.514 and 5.412 for money supply, price rate
82
and the interest rate respectively. Similarly, the table presents the median, standard
deviation, skewness and other results of each variable.
Consider the following result based on annual economics data of Ethiopia collected for
the period of 1953 to 1995 EC. The variables used for the computation are GDP, INV
(investment), SAV(domestic saving), IMPO (import), and EXPO(export) measured in
billions of birr at current market price.
Observations 43 43 43 43 43
Note that the above table provides descriptive statistical information for each variable.
From the figure, we learn that the mean value of gdp for the period is 17642.85 billion
birr while it is 2849.88 and 954.17 for investment and saving respectively. Similarly, the
table presents the standard deviation, skewness and other results of each variable.
Note that in addition to the above result, EViews computes several statistical issues with
simple manipulation of its buttons. The following discussion explains how to compute
variance and covariance between variables as well as the correlation between any two
variables.
83
Variance and Covariance
To compute the variance and covariance between variables we first click Quick from the
main menu and select Group Statistic. Then we choose Covariance. This will display
Series List on which we write the name of the variables to be used in the computation.
This will give us a covariance matrix. Note that the values listed in diagonal of the table
represent variance of a variable where as off diagonal values describe the covariance
between the two variables. The table below presents such results based on a hypothetical
data.
Covariance Matrix
GDP M1 PR RS
The table above presents the covariance matrix for the variables GDP, M1, PR and RS.
Note from the result that the values listed diagonally (characterized by bold letters)
represent the variance of each variable. For example, the variance of GDP is 316602.7
whereas it is 118248.2, 0.091590 and 8.414915 for M1, PR and RS variables
respectively.
The values listed off the diagonal shows covariance between two variables. For instance
the covariance between GDP and M1 is 192558.9, and it is 0.362112 between PR and
RS. Similar approach is used to determine the other covariance results stated in the above
table. Note, however, that values presented below the diagonal matches the above
diagonal ones. We can repeat the same job as shown below, using annual data of Ethiopia
on GDP, investment, saving, import and export for the period 1953 to 1995 E.C.
GDP INV SAV IMPO EXPO
GDP 1.000000 0.979769 0.366253 0.967962 0.979475
84
INV 0.979769 1.000000 0.296736 0.991192 0.983098
SAV 0.366253 0.296736 1.000000 0.193738 0.357570
IMPO 0.967962 0.991192 0.193738 1.000000 0.980851
EXPO 0.979475 0.983098 0.357570 0.980851 1.000000
Figure 3.4 Covariance Matrix
The table presents the covariance matrix for the variables. Note from the result that the
values listed diagonally represent the variance of each variable. The values listed off the
diagonal shows covariance between two variables. Note also that values presented below
the diagonal matches the above diagonal ones.
Correlation
To compute the correlation between any two variables we first click Quick from the main
menu and select Group Statistic. Then we choose Correlation. This will display Series
List on which we must write the name of the variables to be used for correlation
computation. For example, consider the following correlation matrix obtained using
EViews for the variables a hypothetical data on GDP, M1, Pr and RS
GDP M1 PR RS
GDP 1.000000 0.995197 0.992475 0.333494
The values listed in the table diagonally have no relevant meaning. Because it represents
the correlation for the same variables measured for the same period. That is why the
result is 1 in all the cases across the board diagonally. But the results off the diagonal
relates the correlation between the variables GDP, M1, PR and RS. For example the
correlation between GDP and M1 (or between M1 and GDP) equals 0.995. This suggests
85
that there is a strong correlation between the two variables. Similarly the correlation
between RS and M1 (or M1 and RS) equals 0.27 indicating the presence of weak
correlation between the two variables.
We can repeat the same job using annual data of Ethiopia on GDP, investment, saving,
import and export for the period 1953 to 1995 E.C. The result indicates the extent of
relationship between the economic variables under study.
Note that the result is 1 across the board diagonally. The correlation between GDP and
INV equals 0.98. This suggests that there is a strong correlation between the two
variables. Similarly the correlation between INV and SAV equals 0.29 indicating the
presence of weak correlation between the two variables. Note also that there is a strong
correlation (= 0.98) between import and export.
EViews can also present the histogram and descriptive statistical results simultaneously
for each variable separately. Such presentation helps to examine the various numerical
results together with graphical presentation of the distribution of a variable. To construct
histogram and statistical computations simultaneously, we first click Quick from the main
menu and select Series Statistic. Then we choose Histogram and Stat. This will display
86
the histogram and descriptive statistical results for the variable (or series) selected for the
purpose. For example, the figure below describes the distribution and statistical
computations for the variable GDP
12
Series: GDP
Sample 1953 1995
10 Observations 43
8 Mean 17642.85
Median 10635.77
Maximum 54585.90
6
Minimum 2883.800
Std. Dev. 16578.71
4 Skewness 1.177501
Kurtosis 2.961864
2
Jarque-Bera 9.939243
Probability 0.006946
0
0 100 00 2 0000 30000 40 000 50000
Notice from the histogram above that much of GDP values are less than 20,000. The
diagram and statistical result points out that the variable GDP is positively skewed. The
figure below is based on the variable investment (INV)
20
Series: INV
Sample 1953 1995
Observations 43
15
Mean 2849.879
Median 1394.000
Maximum 12093.00
10
Minimum 437.4000
Std. Dev. 3148.441
Skewness 1.513733
5 Kurtosis 4.066041
Jarque-Bera 18.45775
Probability 0.000098
0
0 2000 4000 60 00 8000 10 000 12000
87
Figure 3.5 Histogram and statistical result for INV.
Note from the above distributions that the variables investment is positively skewed.
Note also that the statistical result presented to the right of the histogram is what we can
obtain using the process stated earlier. That is, to derive descriptive statistics for the
variable INV we first click Quick and select Group Statistic. Then we choose Descriptive
Statistics to be followed by either Individual or Common Samples. In this case, a Series
List box will appear where we write INV and click OK. This will display the descriptive
result like the one we have obtained above.
Check Your Progress 3.3
1. State the stapes required to compute mean and standard deviation of a variable using
EViews
________________________________________________________________________
______________________________________________________________________
________________________________________________________________________
2. From table 3.1 describe the skewness of M1. What does the result implies?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
3. Using table 3.4 above what is the covariance between
A) GDP and EXPO__________________________________________________
B) INV and IMPO___________________________________________________
4. Given table 3.5 above compute the correlation between
A) SAV and IMPO__________________________________________________
B) GDP and IMPO__________________________________________________
In this section, we will examine the various ways of conducting statistical estimations
using Stata. The discussion encompasses simple as well as comprhensive approaches.
88
To compute confidence interval for a given variable, the following command is used
ci varlist
Note that ci computes standard errors and confidence intervals for each of the variables in
variable list. For example Consider a survey data collected from a sample of 1100
respondents of Michigan University regarding their gender (gn), department they have
attended (Dept), degree earned (Deg) and salary (Sal) after graduation.
To compute the confidence interval for salary the command will be ci sal and then press
enter. Accordingly, the following result will be displayed.
Note that the command produces mean and standard deviation of the variable in addition
to the 95% confidence interval. From the result we are 95% confident that the salary of a
graduate from the university on average earns between 25,651.98 and 26,476.43
89
This displays the covariance matrix where the diagonal values represent the variance of
each variables where as off diagonal values points to the covariance between the two
variables. Consider the following data collected from 10 respondents about their monthly
income (Ic), demand for a good in kg (X) and market price of the good (Ps).
Ind. X Ps Ic
1 15 5.5 500
2 18 5.2 600
3 16 4.8 650
4 21 4.5 680
5 22 4.6 750
6 26 4.7 780
7 24 4.2 800
8 29 4 900
9 28 3.6 950
10 30 3.8 975
To compute the correlation between the variables, any of the following commands is
possible. Correlate or correlate X Ps Ic or pwcorr. If we use any of the commands we
obtain the following result.
| X Ps Ic
-------------+----------------------------------
X| 1.0000
Ps | -0.8804 1.0000
Ic | 0.9579 -0.9594 1.0000
Note from the result that the correlation between X and Ps is -0.88 while it is -0.9594 for
Ps and Ic. Like EViews result, the diagonal values are equal to one since it represents the
same variable.
On the other hand, to compute the covariance between any two variables, we write the
following command.
correlate X Ps Ic, covariance. This displays the covariance of the variables as shown here
under.
90
| X Ps Ic
---------- +---------------------------------------
X | 29.2111
Ps | -2.86778 .363222
Ic | 801.5 -89.5167 23966.9
Notice that the covariance between Ps and X is -2.867 where as it is -89.5167 and 801.5
between IC and PS and between IC and X respectively. Note that the diagonal values
represents variance of each variable
91
Recall that partial correlation displays the partial correlation between two variables
keeping other things constant. In Stata the command required to compute this is given by
pcorr varname varlist
Note that pcorr displays the partial correlation coefficient of variable name1 with each
variable listed, holding the other variables in varlist constant.
Example, to compute the partial correlation of X with Ps or Ic, we use the following
command. pcorr X Ps Ic. The result of this command is listed as follows
Variable | Corr. Sig.
-------------+------------------------
Ps | 0.4772 0.194
Ic | 0.8467 0.004
Notice from the result that the partial correlation between X and Ps is 0.47 where as it is
0.84 and significant at 1% between X and Ic. It is significant at 1% because the
significant level (which is p = 0.004) is less than 1% (i.e. less than 0.01). But notice that
the correlation between X and Ps is not significant even at 10% (i.e.0.1)
Accordingly, to compute the partial correlation of Ps with X or Ic, we use the following
command. pcorr Ps X Ic. The result will be tabulated as follows.
Variable | Corr. Sig.
-------------+-----------------------------
X| 0.4772 0.194
Ic | -0.8526 0.003
The interpretation of the above result is left for the student as an exercise. However, note
that the result is partial in the sense that it holds true while other things are held
constant. .
E. Skewness and kurtosis test for normality
Stata computes the skewness and kurtosis of a variable as a test of normality. The
appropriate comand to excute such thing is given by:
92
sktest varlist
Note that for each variable in variable list, sktest presents a test for normality based on
skewness and another based on kurtosis and then combines the two tests into an
overall test statistic. Note also that sktest requires a minimum of 8 observations to make
its calculations. To compute a test of skewness and kurtosis for the variables X Ps and Ic
we make use of the following command. sktest X Ps Ic
Skewness/Kurtosis tests for Normality
------- joint ------
Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+--------------------------------------------------------------------------------
X| 0.793 0.206 1.95 0.3763
Ps | 0.809 0.699 0.21 0.9010
Ic | 0.868 0.568 0.34 0.8421
Notice that the result can not reject the normality hypothesis for each variable as can be
observed from the probability values that are higher than 10% (i.e. 0.1).
93
As can be seen from the above result, the correlation coefficient (ρ-rho) between X and
Ps is -0.87. The result represents a strong correlation between the two variables.
Moreover, the hypothesis test result rejects the null hypothesis (Ho) that X and Ps are
independent. The result 0.0012 is less than 0.01 (or 1%). Thus the Ho is rejected even at
1% . Note that since the correlation coefficient (which is -0.86) is very strong it indicates
that the two variables are highly dependent to one another and therefore, the rejection of
Ho is an expected result. The following result represents the correlation result and the
associated hypothesis for Ps and Ic
Number of obs = 10
Spearman's rho = -0.9394
The result represents a very strong negative correlation, which is -0.94. Accordingly, we
reject Ho at 1% indicating that the two variables are dependent to each other.
Y 10 12 9 7 12 11 13 12
X 8 6 9 10 7 12 15 14
Z 6 7 10 6 8 10 14 16
Based on the above information, attempt the following questions
94
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
G. Summary statistics
The other advantage of Stata is that it produces a summary statistics. The appropriate
command to handle this task is presented as follows.
summarize varlist
This command (i.e. summarize) calculates and displays a variety of univariate
(individual) summary statistics. Note that after the command is written, if no variable list
is specified, summary statistics are calculated for all the variables available in Stata data
set. The following result reports the summary statistics for our variables X, Ps and Ic
(note that the command is summarize X Ps Ic).
Note that the result gives the mean standard deviation, minimum and the maximum of
each variable. This helps to obtain basic information about each variable (i.e. X, Ps and
Ic).
In Stata it is possible to present table of summary statistics. The kind and form of the
tabular statistics summary depends on the command written. That is, we can order Stata
to produce a simple frequency table, or a cross tabulation that represent the frequency of
a variable for a given value of other variables, and so on. The following command
represents such cases using the variables X, Ps and Ic
95
table X. This gives a frequency table for the variable X
table X Ps Ic. This gives a frequency table in a cross tabulation form
between the variables.
table X, contents(n Ps). This develops a frequency based relationship
between the variables X and Ps
table Ic, c(n Ps mean Ps sd Ps median Ps). This reports a frequency based
mean and median of Ps for a given value of Ic. The result below reresents this
case.
--------------------------------------------------------------
Ic | N(Ps) mean(Ps) med(Ps)
----------+--------------------------------------------------
500 | 1 5.5 5.5
600 | 1 5.2 5.2
650 | 1 4.8 4.8
680 | 1 4.5 4.5
750 | 1 4.6 4.6
780 | 1 4.7 4.7
800 | 1 4.2 4.2
900 | 1 4 4
950 | 1 3.6 3.6
975 | 1 3.8 3.8
--------------------------------------------------------------
The above result displays the mean and median of Ps for a given value of Ic. Note that in
our hypothetical table since the frequency of Ps fro a given value of Ic is one
[i.e. N(Ps)=1], the mean value of Ps is equal to the actual hypothesized observation.
96
The other command that produces a tabular statistical summary is tabstat. Note that
tabstat displays summary statistics for a series of numeric variables in a single table,
possibly broken down on (conditioned by) another variable. The result to be obtained
depends on the command to be written. Consider the following two commands.
tabstat X Ps Ic, stats(mean range). This displays the mean and range of each
variables (i.e. X, Ps and Ic).
Ic | X Ps
---------+--------------------------
500 | 15 5.5
600 | 18 5.2
650 | 16 4.8
680 | 21 4.5
750 | 22 4.6
780 | 26 4.7
800 | 24 4.2
900 | 29 4
950 | 28 3.6
975 | 30 3.8
---------+---------------------------
Total | 22.9 4.49
--------------------------------------
Note that the above table shows the mean of X and Ps for a given value of Ic. For
instance, given Ic = 800, the mean value of X and Ps are 24 and 4.2 respectively.
97
Stata can also produce one- and two-way tables of summary statistics. The result provides
the mean, standard deviation and frequency of a variable for a given value of an other
variable. The command to execute this job is given as follows.
tabulate X, summarize(Ps).
This provides with one way table a summery statistics for Ps for each value of variable X.
The result below provides table a summery statistics for variable Ic given the values of
Ps. Note that the command required to do this is tabulate Ps, summarize(Ic)
| Summary of Ic
Ps | Mean Std. Dev. Freq.
------------+-----------------------------------------------
3.6 | 950 0 1
3.8 | 975 0 1
4| 900 0 1
4.2 | 800 0 1
4.5 | 680 0 1
4.6 | 750 0 1
4.7 | 780 0 1
4.8 | 650 0 1
5.2 | 600 0 1
5.5 | 500 0 1
------------+-------------------------------------------------
Total | 758.5 154.81261 10
As you can see, Stata provides the mean, standard deviation and frequency of Ic given the
value of Ps. For instance, at Ps = 4.2, the mean value of Ic is given gy 800 birr.
98
I. Graphical Analysis Using Stata
Stata provides with a number of graphing options. This includes line graph, bar graph, pie
charts and a number of others. The following discussion illustrates some of the
commands and the corresponding outcomes.
Stata provides a bar graph based on the actual value or the mean (average) value of a
variable. The appropriate command to construct such graph is given by thfollowing
command
graph INV SAV IMPO EXPO Gcon Pcon, bar
This command displays a bar graph for the variables stated in the command
On the other hand, the command draws a bar chart for the variables using the maen value
of each variable displayed in the command box.
graph INV SAV IMPO EXPO Gcon Pcon, bar means
Note that the bar chart diagram below is constructed using the first command. It displays
the bar graph for the variables investment (INV), domestic savings (SAV), import
(IMPO), export (EXPO), government consumption (Gcon) and private consumption
(Pcon)
99
INV Sav Expo
Impo Gcon Pcon
115721
The other graph that can be drawn in Stata is histogram. Note that histogram is the
default for graph with one variable. Accordingly, the command relevant to this job is
given as follows.
Consider the graph presented below. It represents the distribution for GDP together with
a histogram for GDP using 20 bins. Note that the command to generate the graph is given
by: graph GDP, bin(20) norm.
100
.232558
Fraction
0
2883.83 54585.9
GDP
Notice from the figure above that there are 20 bars since we wrote 20 in the command
box. As the distribution curve shows, the variable GDP is slightly positively skewed.
When domestic saving (SAV) is used in the analysis, we obtained almost normally
distributed curve. Recall that by normal distribution, we mean a result where the mean
mode and median are equal, and the skewness is zero. The command used to obtain the
result below is: graph SAV, bin(10) norm
101
.511628
Fraction
0
-1145.3 3466.3
Sav
Note that in addition to this, Stata can also produce a categorical variable histogram. The
result being quite similar to that of above command, however, this one is intended for use
with integer-coded categorical variables. In this case, the x-axis is automatically labeled
and those labels are centered below the corresponding bar.
The other type of graph is the pie chart. To execute this, the relevant command is
presented as follows.
Note that, this command produces a pie chart containing the variables investment, import
and government consumption for the period discussed earlier. In developing a pie chart,
we can specify up to 16 variables and Stata will place up to 64 pie charts in a single
image. The following pie chart is the result of the command specified above.
102
17% INV
56% Impo
27% Gcon
Note from the pie chart that 56% represents the share of import from the total of the three
variables, while it is 27% and 17% for government consumption and investment
respectively.
1. Write the command that displays individual summary statistic for the variables A1 and
B2 ___________________________
2. Write the command that gives the a frequency based mean and median of A1 for a
given value of C3 __________________________________
3. Write the command that draws a bar chart for the variables A!, B2 and C3 using mean
values of each variables ____________________________________
4. Write the command that draws the distribution of B1 together with its histogram using
25 bins _____________________________________________
103
3.5 Summary
The other important statistics concept is hypothesis testing. A hypothesis is some testable
belief or opinion. It is a statement about the population developed for the purpose of
testing. A statistical hypothesis is a statement about the values of parameters in the
population. In hypothesis testing, the most common approach is to establish a set of two
mutually exclusive and exhaustive hypotheses about the true value of the parameter under
study. Then a sample is used to assess the hypothesis.
Excel can carry out several statistical analyses. The steps required to handle statistical
estimation using excel is
4. On the Tools menu, click Data Analysis.
5. In the Data Analysis dialog box, click the name of the analysis tool you want to
use, then click OK.
6. In the dialog box for the tool you selected, set the analysis options you want.
EViews very simply and attractively provides us with many types of graphical results.
Note however, that to execute any kind of graph we should first have the relevant data in
EViews spreadsheet.
Based on its window-based approach, EViews estimates several statistical issues. Very
simply it estimates and reports descriptive statistical results of a variable. This includes
104
mean, median, standard deviation skewness and the like concepts. Moreover, EViews
computes the covariance and the correlation between the variables. To perform the above
stated statistical estimations we have to open the workfile that contains the names of the
variables to be used in the analysis.
Using Stata we can construct a wide range of statistical and graphical results. This
includes constructing confidence interval, correlation and covariance, summary statistics
and the like.
3.6 Answers to Check Your Progress
Answer to Check Your Progress 3.1
105
Year GDP INV Sav Expo Impo
1981 15742.1 2269.23 1399.77 1422.8 2292.26
1982 16825.7 2100.49 1335.22 1295.04 2060.31
1983 19195.3 1996.38 660.39 1062.21 2398.2
1984 20792 1911.1 625.2 937.5 2223.4
1985 26671.4 3792.1 1494.1 2222.5 4520.5
1986 28328.9 4293.7 1426.2 3223 6090.5
1987 33885 5569 2517.1 4898.1 7950
1988 37937.6 6404.4 2652.6 4969.7 8721.5
1989 41465.1 7049.1 3195 6730.6 10584.7
1990 44840.3 7690.6 3466.3 7116.9 11341.2
1991 48803.2 8268.1 1044.6 6878 14101.5
1992 53189.7 8431.8 480.1 8017.6 15969.3
1993 54210.7 9646 1433.9 7981.5 16193.6
1994 51760.6 10613.5 931.4 8027.4 17709.5
1995 54585.9 12093 -1145.3 8319.3 21557.6
4.0 Objective
4.1 Introduction
4.2 Concepts in Econometrics Analysis
4.3 Regression Estimation and Analysis Using EViews.
106
4.4 Regression Estimation and Analysis Using Stata.
4.5 Summary
4.6 Answers to Check Your Progress
4.7 Model Examination
4.0 Objectives
The aim of this unit is to explain the approaches used in computer based econometric
estimation and analysis.
After studying this, you will be able to:
Define the term econometrics
Explain the importance of studying econometrics and the associated concepts.
Understand the various ways of performing estimation using EViews and Stata.
Interpret the results obtained from the estimation.
4.1 Introduction
This is because; economic theory makes statements or hypotheses that are mostly
qualitative in nature. However, the theory itself does not provide any numerical measure
of the relationship between the two: that is it does not tell by how much the quantity will
go up or down as a result of a certain change in the price of the commodity. It is the job
of econometrician to provide such numerical statements. Similarly, the main concern of
Mathematical economics is to express economic theory in mathematical form without
regard to measurability or empirical verification of the theory. Both economic theory and
mathematical economics state the same relationships. Economic theory uses verbal
exposition but mathematical economics employs mathematical symbolism. Neither of
them allows for random elements, which might affect the relationship and make it
107
stochastic. Further, more, they do not provide numerical values for the coefficients of the
relationships.
On the other hand, economic Statistics is mainly concerned with collecting, processing,
and presenting economic data in the form of charts and tables. It is mainly a descriptive
aspect of economics. It does not provide explanations of the development of the various
variables and it does not provide measurement of the parameters of economic
relationships. Nevertheless, econometrics is an amalgam of economic theory,
mathematical economics, economic statistics, and mathematical statistics. Yet, it is a
subject that deserves to be studied in its own right for the above-mentioned reasons.
Step I: The first step is an attempt to identify the relationship between variables and
express the relationship in mathematical form. This is called the specification of the
model. It involves the determination of the dependent and independent variables. For
example, given Y= f(X1, X2, X3,...,Xn), the variable Y whose behavior is to be explained is
referred as dependent variable while the variables X1, X2, X3...Xn that influences the
dependent variable Y are referred as explanatory or independent variable
108
In economic analysis, the choice of independent variables might come from economic
theory, past experience, other related studies or from intuitive judgment.
Step II. After identifying the dependent and the explanatory variables, the second step is
specifying the mathematical form of the model. Note that economic theory may or may
not indicate the mathematical form of the relationship and the number of equations to be
included in the model.
For example, the theory of demand does not specify whether the demand function will be
linear or non-linear form. Similarly, the theory of demand doesn’t specify the number of
equations to be included in the demand function. Thus, it is the researcher who is
responsible in dealing with such issues. Note also that the determination of the a priori
theoretical expectations about the sign and size of the parameters is also part of
formulating a model or specifying of the model.
Step III. This step refers to specifying of an econometric model. It is based on economic
theory and on any available information relating to the phenomenon being studied.
Since most of the relationships in economic variables are inexact, this step reflects the
issue by incorporating a disturbance term, U. This U captures the influence of any other
variables that is not included in the model.
Note that, U denotes the random error term which represents all those forces/factors/
affecting the dependent variable but not explicitly introduced in the model. This error
term distinguishes econometric model from mathematical model.
As we have said earlier, since economic theory does not explicitly state the number of
equations to be included in the function (single or simulations equation model), the
researcher must decide the number of equation to be included in the model. In general,
the number of equations depends on the complexity of the phenomenon being studied, the
purpose of estimating the model and the availability of data.
109
Step IV. This step is involved with determining the numerical estimates of the
coefficients of the model. Estimation of the coefficients of the model includes (i)
gathering of data on the variables included in the model and (ii) selecting the appropriate
econometric technique for the estimation of the function.
Note that, the coefficients of economic relationships may be estimated by single equation
methods or simultaneous equations methods. Note that in this material we will focus on
least squares methods.
Step V. Evaluation of Estimates. This refers to checking the reliability of the estimated
results. The evaluation of the results includes deciding whether the estimates of the
parameters are theoretically meaningful and statistically satisfactory. To check the
estimates of the parameters are meaningful, we make use of the Economic criteria,
statistical criteria and econometric criteria. By Economic criteria it means the criteria
determined by economic theory and refer to the sign and size of the parameters of
economic relationship. By statistical criteria, it reflects statistical theory and aim at
evaluation of statistical reliability of the estimates of the parameters of the model. The
most commonly used statistical criteria are the correlation coefficient and the standard
deviation (error) of the estimates. The square of the correlation coefficient shows the
percentage of the total variation of the dependent variable being explained by the changes
of the explanatory variables. On the other hand, the standard error of the estimates is a
measure of the dispersion of the estimates around the true parameter. The larger the
standard errors of the parameter, the less reliable are the estimates.
The third one is econometric criteria. This is set by the theory of econometrics and aim at
investigation of whether the assumptions of the econometric method are satisfied or not.
It helps us to check whether the estimates have the desirable properties of unbiasedness,
consistency, efficiency, sufficiency and the like. If the assumptions are not satisfied, then
the estimates of the parameters will not posses some of the desirable properties and
become unreliable for the determination of the significance of the estimates. Note,
110
therefore, that, before accepting or rejecting the estimates, the researcher must use all the
above criteria.
After we formulated the model, we may want to perform hypothesis testing to find out
whether the estimates obtained are in accordance with the expectation of the theory that is
being tested. That means, we may want to find out whether the estimated model makes
economic sense and confirm to economic theory. To this we develop the necessary tools
to test hypothesis suggested by economic theory and/or prior empirical experience. The
confirmation or refutation of economic theories on the basis of sample evidence is known
as hypothesis testing.
As you know the objective of any econometric research is to obtain good numerical
estimates of the coefficients of economic relationships and to use them for the prediction
of the values of economic variables. Before using the estimated model for forecasting the
value of the dependent variable, we must assess the predictive power of the model.
Note that if the chosen model confirms the theory, then we may use it to predict/forecast/
the future value(s) of the dependent variable on the basis of known or expected future
value(s) of the explanatory variables. The estimated model is economically, statistically
and econometrically correct for the sample period for which the model has been
estimated, however, it may not be useful for forecasting. In this stage, we will investigate
the stability of the estimates, their sensitivity to changes in the size of the sample.
Therefore, we have to check whether the estimated function performs outside the sample
data.
From the forgoing analysis we learn that a successful econometric analysis should make
use of all the above stated steps. In our case the regression estimation will be based on the
111
Ordinary Least Squares (OLS) method. Therefore, the next sub section makes a brief
discussion of what OLS is before we examine how to compute using EViews and Stata
Regression analysis is concerned with describing and evaluating the relationship between
a given variable (dependent variable) and one or more other variables (explanatory
variables).
In general given Y = f(X1, X2, ...., Xk), if we assume that there is a linear relationship,
then we obtain
Where 0 denotes the intercept and 1 , β2 ......βk represents partial slope of the
regression equation. However, the simplest form of multiple linear regression model (i.e.
a model with two explanatory variables) is given by:
Yi = β0 + β1X1i + β2X2i + Ui
112
Taking the expected value of the above model, we obtain:
Where: E(Yi/ X1i ,X2i) represents the conditional mean of Yi given fixed values of X1i
and X2i
β0 is the average value of Yi when X1i =X2i=0.
β1 is obtained by taking the partial derivatives of Yi with respect to X1i keeping
X2i constant. That is,
Yi
β 1= , keeping X2i constant which represents the change in the mean value
X 1i
of Yi with respect to X1i keeping X2i constant. Similarly, β2 is obtained by taking the
partial derivatives of Yi with respect to X2i keeping X1i constant, i.e.
Yi
β2 = , keeping X1i constant which represents the change in the mean value
X 2i
After specification of the model, the next step to estimate the population parameter using
sample observations on Y, X1i and X2i and obtains estimates of the population parameters
0, 1 and 2. These estimates ˆ 0 , ̂ 1 and ̂ 2 of the population parameters 0, 1
Yi = 0 + 1 X1i + 2 X2i + Ui
113
and the counterpart sample regression function is given as:
Using Ordinary Least Squares method of estimation we obtain the following result for
̂ 0 , ˆ1 and ̂ 2 .
̂ 0 = Y ˆ1 X 1 ˆ 2 X 2
2
( x1i y i )( x 2i ) ( x 2i y i )( x1i x 2 i )
̂ 1 = 2 2
( x1i )( x 2i ) ( x1i x 2 i ) 2
2
( x 2i y i )( x1i ) ( x1i y i )( x1i x 2 i )
̂ 2 = 2 2
( x1i )( x 2i ) ( x1i x 2 i ) 2
In addition to the partial regression coefficient, the variance of each is important for
wide ranges of purposes such as confidence interval, hypothesis testing and the like. The
formula to compute the variance of each is given as:
Thus, the variances of ˆ 0 , ̂ 1 and ̂ 2 is given by the following formula
2 2 2 2
21 X 1 x 2i X 2 x1i 2 X 1 X 2 x1i x 2i
Var ( ̂ 0 ) = ˆ u ( 2 2
)
n x1i x 2i ( x1i x 2i ) 2
2
x 2i
Var ( ̂ 1 ) = ˆ u
2
2 2
x1i x 2i ( x1i x 2i ) 2
114
2
x1i
Var ( ̂ 2 ) = ˆ u
2
2 2
x1i x 2i ( x1i x 2i ) 2
2
U i
Where: x1i X 1i X 1 and ˆ u 2 , K in this case is 3 since we are dealing with
nk
there parameters- ̂ 0 , ˆ1 and ̂ 2 .
Note that if there are more than two explanatory variables, the formula to compute each
and their corresponding variances will be more complicated and very hard to do it
Given Y = f(X1, X2), the coefficient of multiple determinations- R2y.x1x2 is the square of
multiple correlation coefficients. It is denoted by R2 with the subscripts of the variables
whose relationship is being studied. The coefficient of multiple determinations in the case
of two explanatory variables X1 and X2 shows the percentage of the total variation of Y
explained by the regression plane, i.e. by the change in X 1 and X2. In a multiple
regression, R2 measures the proportion of the variation in Y explained by variables X 1
and X2 jointly.
yˆ Y Y
2 2
2 i i
R =
Y Y
y. X 1 X 2
y
2 2
i i
U
2
i RSS
=1– 1
y
2
i
TSS
where: RSS – residual sum of squares
TSS – total sum of squares
Using estimated 's the coefficient of determination can be presented as follows
115
2 1 x1i y i 2 x 2i y i
R y. X 1 X 2 = ,
yi
2
Note that the value of R2 lies between 0 and 1. The higher R2 the greater the percentage of
the variation of Y explained by the regression plane, that is, the better the goodness of fit
of the regression plane to the sample observations. The closer R 2 to zero, the worse the fit
is.
The Adjusted R2
When new variables are introduced into the mode, the coefficient of determination R2
always increases even if the variable added is not important to the model. To correct this
defect, we should adjust the coefficient of multiple determinations by taking into account
the degrees of freedom, which clearly decrease as new repressors are introduced in to the
function. Therefore, the adjusted R-square is given by
U n k , or
2
i
R
2
=1–
y n 1
2
i
n 1
R
2
= 1 – (1 – R2)
nk
where k = the number of parameters in the model (including the intercept term)
n = the number of sample observations
R2 = is the unadjusted multiple coefficient of determination
Note that, as the number of explanatory variables increases, the adjusted R 2 is
increasingly less than the unadjusted R2. The adjusted R2 ( R 2 ) can be negative, although
R2 is necessarily non-negative. In this case its value is taken as zero.
If n is large, R 2 and R2 will not differ much. But with small samples, if the number of
regressors (X’s) is large in relation to the sample observations, R 2 will be much smaller
than R2.
116
E. Test of Significance of Parameters
This is the observed (or sample) value of the t ratio, which we compare with the
theoretical value of t obtainable from the t-table with n – k degrees of freedom.
The theoretical values of t (at the chosen level of significance) are the critical values that
define the critical region in a two-tail test, with n – k degrees of freedom.
If the computed t value exceeds the critical t value at the chosen level of significance, we
may reject the hypothesis; otherwise, we may accept it (i.e. 1 is not significant at the
chosen level of significance and hence the corresponding regression does not appear to
contribute to the explanation of the variations in Y). Consider the figure below that
represents the t distribution and the critical values for a two-tailed test in t-distribution
Acceptance
region
117
Critical Point Rejection
region
significant.
II. Testing the Overall Significance of a Regression
This test aims at finding out whether the explanatory variables (X1, X2, …Xk) do actually
have any significant influence on the dependent variable. Consider the following general
regression model.
Yi = β0 + β1X1i + β2X2i + β3X3i +……..+ βkXki + Ui
The test of the overall significance of the regression implies the following.
H0 : β1 = β2 = β3 = …….. = βk = 0
This test aims at finding out whether the explanatory variables do have any significance
influence on the dependent variable. If the null hypothesis is true, then there is no linear
relationship between the dependent variable and the explanatory variables.
To test the above stated hypothesis we use the following test statistic:
ESS
k 1
F= ~ F (k-1, n-k)
RSS
nk
118
Where ESS represents the explained sum of squares, and RSS represents the residual
(unexplained) sum of squares.Accordingly, the decision rule will be:
Compare the computed F-calculated value with the critical (table) value at the chosen
level of significance, (k-1) for numerator and (n-k) for denominator degrees of freedom
which is obtained from the F- distribution table. The decision is based on the following
procedures:
If the computed F- value is greater than the critical value, reject the null hypothesis and
accept that the regression is significant and not all coefficients are zero. On the other
hand, if the computed F- value is less than the critical value obtained from the F-
distribution table, then accept the null hypothesis, i.e. accept that the regression is not
significant and all coefficients are zero.
In test of significance, we may reject the null hypothesis and come up with significant
result. However, note that rejection of the null hypothesis does not mean that our estimate
̂ 1 is the correct estimate of the true population parameter βi. It means that our estimate
comes from a sample drawn from a population whose parameter is different from zero.
In order to known how close the estimate to the true population parameter; we must
construct confidence intervals for the true parameter. In confidence interval estimation,
we establish limiting values around the estimate within which the true parameter is
expected to lie with a certain 'degree of confidence'. We should select a confidence level
or confidence coefficients denoted by α. For instance, if the confidence level is 90%, then
it means that in a repeated sampling the confidence interval computed from the sample
would included the true parameter in 90 times out of 100 times. In the remaining 10
times, the population parameter will fall outside the confidence interval. The following
explains the method of constructing confidence interval from the t-distribution.
119
Note that the t-distribution is used when the population is normal, the sample size is small
and population variance is unknown. In this case, the test statistic for testing hypothesis is
given as:
ˆi i
t= , with (n-k) degrees of freedom
S ( ˆi )
If the confidence coefficient, α, is given, then the confidence interval that the observed t-
value lying between -tα/2 and tα/2 with (n-k) degrees of freedom is given as:
The foregoing brief discussion informs the reader what regression estimation means and
the various concepts related to it. More importantly the above discussion shows how time
taking and unfriendly it would be to compute the various concepts manually. Note that
the formula and the computation process will be very long as the number of the
explanatory variables increase in the regression model. The interesting part is that, with
the aid of EViews and Stata it becomes very easy to compute the various concepts while
having many explanatory variables.
120
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
3. Define the concept of R2 and show its relationship with the adjusted R2
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
I. Equation Objects
Single equation regression estimation in EViews is performed using the equation object.
To create an equation object in EViews we can use one of the following alternatives
121
From the main menu select Objects and then choose New Object and select
Equation
From the main menu click Quick and then select Estimate Equation. Or
Simply type the keyword equation in the command window and press Enter
Any of these alternatives create the Equation Estimation dialog box as shown in the
diagram below.
As we have said earlier, when we create an equation object, the Equation Estimation
dialog box appears. In that box we need to specify three things. These are: the equation
specification, the estimation method, and the sample to be used in estimation.
122
a) Equation Specification Box
In the upper edit box, we observe the equation specification box. It is used to specify the
equation. This refers to identifying the dependent (left-hand side) variable and the
independent (right-hand side) variables. Moreover we have to determine the functional
form (i.e. linear or non-linear). Note that there are two basic ways of specifying an
equation. These are the list and the formula approaches. The listing method is easier but
may only be used with unrestricted linear specifications whereas the formula method is
more general and can be used to specify nonlinear models or models with parametric
restrictions.
The simplest way to specify a linear equation is to provide a list of variables that you
wish to use in the equation. In this approach we first include the name of the dependent
variable or expression then a list of explanatory variables name is listed. For example,
consider the following demand function. D = f (P, I). Now suppose the objective is to
specify a linear demand function, D regressed on a constant, own price, P and income, I
of the consumer as follows. Di = 0 + 1Pi + 2Ii + Ui
In this case, the list method requires to type the following in the upper field of the
Equation Estimation dialog box
DcPI
Note that each variable is separated by a space and the presence of the series name C in
the list of regressors. This (i.e. c) is a built-in EViews series that is used to specify a
constant in a regression. Note that EViews does not automatically include a constant in a
regression so you must explicitly list the constant as a regressor.
You may have noticed from our previous chapter discussion that there is a pre-defined
object C in your workfile. This is the default coefficient vector when you specify an
123
equation by listing variable names. Note that EViews stores the estimated coefficients in
this vector, in the order of appearance in the list.
This approach requires to specify our equation using a formula when the list method is
not general enough for our specification. Many, but not all, estimation methods allow you
to specify your equation using a formula.
Note that when you specify an equation by list, EViews converts this into an equivalent
equation formula. For example, suppose our regression model is given by the following
log-log model
logDi = log 0 + 1logPi + 2logIi + logUi
In this case the list method is given by,
logD c logP logI
The two most common motivations for specifying your equation by formula are to
estimate restricted and nonlinear models.
Note that to estimate a nonlinear model, simply enter the nonlinear formula. EViews will
automatically detect the nonlinearity and estimate the model using nonlinear least
squares.
124
b) Estimation Methods
The second box in the equation estimation dialog box refers to estimation methods. Note
that having specified our equation; we now need to choose an estimation method. To
select the required method of estimation we simply click on the Method box and we will
see the drop-down menu listing estimation methods. From the various alternatives we
select tje LS-Least Square since standard, single-equation regression is performed using
this method of estimation.
Note that equations estimated by ordinary least squares, two-stage least squares, GMM,
and ARCH can be specified with a formula. But nonlinear equations are not allowed with
binary, ordered, censored, and count models, or in equations with ARMA terms.
c) Estimation Sample
After identifying the method of estimation, we should also specify the sample to be used
in estimation. EViews will fill out this dialog with the current workfile sample, but we
can change the sample for purposes of estimation by entering our sample string or object
in the box. Note that changing the estimation sample does not affect the current workfile
sample.
If any of the series used in estimation contain missing data, EViews will temporarily
adjust the estimation sample of observations to exclude those observations. EViews
notifies you that it has adjusted the sample by reporting the actual sample used in the
estimation results. The diagram below presents the regression result using a hypothetical
regression model.
125
Figure 4.3 Regression result
At this point, we are interested about the top of an equation output view. Notice that
EViews will tell you the dependent variable, the method, the sample and the like. For
example, the above result uses 1824 observations.
Note that some operations, most notably estimation with moving average terms and auto
regressive conditional hetroscedasticity, do not allow missing observations in the middle
of the sample. When executing these procedures, an error message is displayed and
execution is halted if an NA is encountered in the middle of the sample. EViews handles
missing data at the very start or the very end of the sample range by adjusting the sample
endpoints and proceeding with the estimation procedure.
126
2. Explain the two equation specification methods
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
Where GDP = gross domestic product, INV= investment, SAV = domestic saving,
EXPO = export, POP = population (a proxy t labor force). Suppose the researcher uses
data of Ethiopia for the period 1953 up to 1995E.C.
According to the List method, we write the following in the equation estimation box and
click OK or press Enter:
127
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
C -10299.51 1181.469 -8.717549 0.0000
INV 0.056541 0.479796 0.117844 0.9068
SAV -0.374671 0.334872 -1.118848 0.2702
EXPO 3.777520 0.503058 7.509122 0.0000
POP 496.1977 42.92215 11.56041 0.0000
R-squared 0.993417 Mean dependent var 17642.85
Adjusted R-squared 0.992724 S.D. dependent var 16578.71
S.E. of regression 1414.109 Akaike info criterion 17.45533
Sum squared resid 75988722 Schwarz criterion 17.66012
Log likelihood -370.2896 F-statistic 1433.695
Durbin-Watson stat 1.000662 Prob(F-statistic) 0.000000
The following discussion summarizes the results obtained from EViews equation box
using the above result.
Regression Coefficients
On the result above the column labeled “Coefficient” depicts the estimated coefficients.
Recall that the least squares regression coefficients βi are computed by the standard OLS
formula
Note that if our equation is specified by list, the coefficients will be labeled in the
“Variable” column with the name of the corresponding regressor. But if the equation is
specified by formula, EViews lists the actual coefficients, C(1), C(2), etc.
For the simple linear models considered here, the coefficient measures the marginal
contribution of the independent variable to the dependent variable, holding all other
variables fixed. If present, the coefficient of the C is the constant or intercept in the
regression. It is the base level of the prediction when all of the other independent
variables are zero. The other coefficients are interpreted as the slope of the relation
between the corresponding independent variable (GDP in this case) and the dependent
variable, assuming all other variables do not change.
128
For the above model the estimated GDP equation is given as
(Estimated) GDP = -10299.51 + 0.056INV + -0.374SAV + 3.777EXPO + 496.19POP
The partial regression coefficients βi are interpreted as follows
β1 = 0.0561=> other things being equal, an increase in investment by one unit increases
GDP by 0.561 units
β2 = -0.374 => other things being equal, an increase in domestic saving by one unit
decreases GDP by 0.374 units
β3 = 3.777 => Other things being equal, an increase in export by one unit increases GDP
by 3.777 units
β4 = 496.19 => Other things being equal, an increase in labor force by one unit increases
GDP by 496.19 units
Standard Errors
The “Std. Error” column reports the estimated standard errors of the coefficient estimates.
The standard errors measure the statistical reliability of the coefficient estimates. Note
that the larger the standard errors, the more statistical noise in the estimates. If the errors
are normally distributed, there are about 2 chances in 3 that the true regression
coefficient lies within one standard error of the reported coefficient, and 95 chances out
of 100 that it lies within two standard errors.
Note that and the standard errors of the estimated coefficients are the square roots of the
diagonal elements of this matrix. You can view the whole covariance matrix by choosing
View/Covariance Matrix.
t-Statistics
The t-statistic, which is computed as the ratio of an estimated coefficient to its standard
error, is used to test the hypothesis that a coefficient is equal to zero. That is the
computed t value helps to test the significance of the parameter estimates individually. In
129
manual operation we should compare the computed t value with the critical (table) value.
However, EViews uses the following. That is, to interpret the t-statistic, we should
examine the probability of observing the t-statistic given that the coefficient is equal to
zero. This probability computation is described below.
Probability
As you can see in the table the last column of the output shows the probability of drawing
a t-statistic as extreme as the one actually observed, under the assumption that the errors
are normally distributed, or that the estimated coefficients are asymptotically normally
distributed.
This probability is also known as the p-value or the marginal significance level. Given a
p-value, you can tell at a glance if you reject or accept the hypothesis that the true
coefficient is zero against a two-sided alternative that it differs from zero. For example, if
you are performing the test at the 5% (or 0.05) significance level, a p-value lower than .
05 is taken as evidence to reject the null hypothesis of a zero coefficient.
For the above tabulated result for example, the hypothesis that the coefficient on INV and
SAV is zero individually is accepted at all. i.e. at 1%, 5% and 10% significance level.
But the hypothesis that the coefficient on EXPO and POP is zero individually is rejected
at all. i.e. at 1%, 5% and 10%. Note, therefore, that the p-values are computed from a t-
distribution with T-k degrees of freedom.
Summary Statistics
Notice from the regression result that the lower half of the result table provides a number
of summary statistics that are vital to measure the adequacy of the model from various
points of view. It is briefly explained as follows.
R-squared
The R-squared statistic measures the success of the regression in predicting the values of
the dependent variable within the sample. It is the fraction of the variance of the
dependent variable explained by the independent variables. The statistic will equal one if
130
the regression fits perfectly, and zero if it fits no better than the simple mean of the
dependent variable. Note that it can be negative if the regression does not have an
intercept or constant, or if the estimation method is two-stage least squares. In our
regression example, the R2 equals to 0.99, which represents a very good fit.
Adjusted R-squared
One problem with using as a measure of goodness of fit is that R 2 will never decrease as
you add more regressors. In the extreme case, you can always obtain a value of one if we
include as many independent regressors as there are sample observations.
The adjusted R2 penalizes the R2 for the addition of regressors, which do not contribute to
the explanatory power of the model. EViews regression result reports the adjusted R2 just
below to the R2
Log Likelihood
EViews reports the value of the log likelihood function (assuming normally distributed
errors). It is evaluated at the estimated values of the coefficients. Likelihood ratio tests
may be conducted by looking at the difference between the log likelihood values of the
restricted and unrestricted versions of an equation.
Durbin-Watson Statistic
The Durbin-Watson statistic measures the serial correlation in the residuals. This concept
and the interpretation of the result will be discussed in the next chaptrr
131
The mean and standard deviation of the dependent variable (GDP in the above example)
are computed using the standard formulas:
Schwarz Criterion
The Schwarz Criterion (SC) is an alternative to the AIC that imposes a larger penalty for
additional coefficients. It is used for model selection
The p-value given just below the F-statistic is denoted by Prob (F-statistic). It is the
marginal significance level of the F-test. If the p-value is less than the significance level
we are testing, say .05, we reject the null hypothesis that all slope coefficients are equal
to zero. For our example above, the p-value is essentially zero, so we reject the null
hypothesis that all of the regression coefficients are zero. Note that the F-test is a joint
test so that even if all the t-statistics are insignificant, the F-statistic can be highly
significant.
Example: Suppose the researcher is interested to use a log-log modeling to the function
stated earlier. That is, consider the following regression model
Let the researcher used the same data (i.e. 1953-1995 EC). Note, however that first we
have to transform each variable to be used in the estimation in to their logarithmic form.
132
After this if we use the formula approach, we write the following into the equation
estimation box.
lgdp = c(1) + c(2)*linv + c(3)*lsav + c(4)*lexpo + c(5)*lpop
From the above result, we can write the estimated regression function as follows
133
(estimated)LGDPi = 0.760 + 0.059lINVi - 0.028LSAVi + 0.222LEXPOi. + 1.865LPOPi
Note that interpretation of the partial regression coefficient will be different from our
previous interpretation. Here, the coefficient is a measure of elasticity. Thus,
Moreover, the regression result shows that the variables investment and saving do not
have a significant effect on GDP whereas the variables export and labor force play a
significant role to the GDP in the periods under consideration. The F test result shows
that the model is significant jointly even at 1% level. According to the R 2 and adjusted R2
results, the model is a very good fit.
The View button on the equation toolbar gives us a choice among three categories of tests
to check the specification of the equation.
Coefficient Tests
These tests evaluate restrictions on the estimated coefficients.
Consider the regression equation we used earlier given by
logGDPi = log 0 + 1logINVi + 2logSAVi + β3logEXPOi. + βilogPOPi. + logUi.
Let the researcher wants to test the following hypothesis
134
Ho : 1 = 3 against the alternate
H1 : 1 3
This hypothesis implies that the elasticity of investment on GDP is equal to the elasticity
of export on GDP. Such kind and other types of tests can be performed easily using
EViews. Note that coefficient restriction test in EViews is called the Wald Test.
The Wald test computes the test statistic by estimating the unrestricted regression without
imposing the coefficient restrictions specified by the null hypothesis. The Wald statistic
measures how close the unrestricted estimates come to satisfying the restrictions under
the null hypothesis. If the restrictions are in fact true, then the unrestricted estimates
should come close to satisfying the restrictions.
EViews reports both the chi-square and the F-statistics and the associated p-values.
To demonstrate how to perform Wald tests, once again consider the above regression
model. The estimated result (using the formula method) is presented as follows.
Dependent Variable: LGDP
Method: Least Squares
Date: 11/22/06 Time: 09:15
Sample(adjusted): 1953 1994
Included observations: 42 after adjusting endpoints
LGDP = C(1) + C(2)*LINV + C(3)*LSAV + C(4)*LEXPO + C(5) *LPOP
135
From the above result we observe that the coefficients of LINV (which is 0.059) is quite
different from the coefficient of LEXPO( which is 0.222). But to determine whether the
difference is statistically relevant, we will conduct the hypothesis test described earlier.
Note that the restrictions should be expressed as equations involving the estimated
coefficients and constants. The coefficients should be referred to as C(1), C(2), and so on,
unless you have used a different coefficient vector in estimation.
To test our hypothesis of Ho: 1 = 3, we type the following restriction in the dialog
box:
c(2) = c(4)
and click OK. Note that c(2) refers to the coefficient of LINV and c(4) represents the
coefficient of LEXPO. EViews reports the following result of the Wald test.
Wald Test:
Equation: Untitled
Null Hypothesis: C(2) = C(4)
F-statistic 4.610510 Probability 0.038397
Chi-square 4.610510 Probability 0.031777
Notice that EViews reports an F-statistic and a Chi-square statistic with associated p-
values. The Chi-square statistic is equal to the F-statistic times the number of restrictions
under test. In this example, there is only one restriction and so the two test statistics are
identical with the p-values of both statistics indicating that we can decisively reject the
null hypothesis of equal elasticity of investment and export of Ethiopia for the mentioned
period.
136
Consider the following case. Suppose a Cobb-Douglas production function on Ethiopia
for the period 1953 to 1995 E.C. has been estimated in the form:
Qi 0 L K eU
1 2 i
where Q, K and L denote GDP and the inputs of capital and labor respectively. To come
up with linear regression model in parameter we rewrite the above model in a log-log
form and obtain the following
Let the researcher come up with the hypothesis of constant returns to scale. That is
Dependent Variable: LQ
Method: Least Squares
Date: 11/23/06 Time: 09:05
Sample: 1953 1995
Included observations: 43
LQ = C(1) + C(2)*LL+ C(3)*LK
Coefficient Std. Error t-Statistic Prob.
C(1) 0.191744 0.178262 1.075627 0.2885
C(2) 2.088333 0.139868 14.93072 0.0000
C(3) 0.211335 0.049131 4.301449 0.0001
R-squared 0.993536 Mean dependent var 9.366394
Adjusted R-squared 0.993212 S.D. dependent var 0.918770
S.E. of regression 0.075695 Akaike info criterion -2.256990
Sum squared resid 0.229191 Schwarz criterion -2.134115
Log likelihood 51.52528 F-statistic 3073.831
Durbin-Watson stat 0.670528 Prob(F-statistic) 0.000000
Notice from the result the sum of the coefficients on LOGL(which is LL) and LOGK
( which is LK) appears to be in excess of one. But to determine whether the difference is
statistically relevant, we will conduct the hypothesis test of constant returns.
137
To carry out a Wald test, we choose View then Coefficient Tests and select Wald-
Coefficient Restrictions… from the equation toolbar.
Wald Test:
Equation: Untitled
Null Hypothesis: C(2) = C(3)
Notice from the result above that the p-values of both statistics indicates the rejection of
the null hypothesis of constant returns to scale even at 1%.
Dependent Variable: Y
Method: Least Squares
Date: 12/25/06 Time: 15:53
Sample: 1990:1 1995:4
Included observations: 24
Variable Coefficient Std. Error t-Statistic Prob.
C -133.7252 8.304021 -16.10367 0.0000
138
X1 0.958838 0.108965 8.799527 0.0000
X2 490.6481 81.96783 5.985862 0.0000
X3 2.010433 0.488545 4.115148 0.0005
R-squared 0.990404 Mean dependent var 101.3912
Adjusted R-squared 0.988964 S.D. dependent var 9.283278
S.E. of regression 0.975228 Akaike info criterion 2.938720
Sum squared resid 19.02139 Schwarz criterion 3.135063
Log likelihood -31.26464 F-statistic 688.0330
Durbin-Watson stat 1.695080 Prob(F-statistic) 0.000000
In this section we discuss on how to perform such estimations using EViews and Stata.
But as a brief introduction, the following section summarizes the concept of non linear
regression and dummy variables.
Note that non linear regression model shows the presence of non-linear relationship in the
model. The following are some of the commonly used regression models that are non-
linear in the variables but are linear in the parameters.
I. Double-Log Models
This model is very common in economics. Consider the following Cobb Douglas model
139
Yi = 0 X 1i 1 X 2i2 eU
The above specification may be alternatively expressed as
lnYi = ln 0 + 1 lnX1 + 2 lnX2 + U
Since both the dependent and the explanatory variables are expressed in terms of
logarithm, the model is known as double-log, or log-log model.
Note that the coefficient 1 and 2 measures the elasticity of Y with respect to X1 and
X2.
For example suppose the estimated value 1 = 0.65 This implies that a one percent
Where IMPO = import, INV= investment, SAV= domestic saving and CO= consumption
expenditure. The data is of Ethiopia for the period 1986 to 1995 E.C. Notice that both the
dependent and the explanatory variables are expressed in logarithmic form. The
following estimation result is performed by EViews.
140
Log likelihood 21.27928 F-statistic 383.4959
Durbin-Watson stat 3.337651 Prob(F-statistic) 0.000003
If we consider the log-lin model above, 1 measures the relative change in Y for a given
absolute change in X, that is. But in the lin-log model, 1 measures the absolute change
in Y for a given relative change in X.
141
Where IMPO = import, INV= investment, SAV= domestic saving and CO= consumption
expenditure. The data is of Ethiopia for the period 1986 to 1995 E.C. Notice that the
explanatory variables are expressed in logarithmic form while the dependent variable is
not. The following estimation result is performed by EViews.
Dependent Variable: IMPO
Method: Least Squares
Date: 11/25/06 Time: 17:36
Sample(adjusted): 1986 1994
Included observations: 9 after adjusting endpoints
Variable Coefficient Std. Error t-Statistic Prob.
C -84347.99 16776.36 -5.027787 0.0040
LINV 12174.76 2747.187 4.431720 0.0068
LSAV -1851.376 384.2515 -4.818136 0.0048
LCO 167.1902 3491.878 0.047880 0.9637
R-squared 0.991462 Mean dependent var 12073.53
Adjusted R-squared 0.986339 S.D. dependent var 4105.149
S.E. of regression 479.8053 Akaike info criterion 15.48574
Sum squared resid 1151065. Schwarz criterion 15.57340
Log likelihood -65.68583 F-statistic 193.5409
Durbin-Watson stat 2.275361 Prob(F-statistic) 0.000014
Y 100 110 105 120 130 125 115 140 160 160
X1 10 11 10 13 15 14 15 16 12 20
X2 1.25 1.00 1.5 1.4 2.0 2.5 3 2.75 3 3.2
X3 600 650 550 600 700 800 750 900 800 850
142
Using the above data, answer the following questions.
2. Based on the result from question number one above, test the hypothesis that
Ho : β1 = β2 at 5% significant level
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
143
number of variables entered in the above model is only one. This represents a two
variable (or simple) regression model. Using the data on GDP and investment, we obtain
the following result.
------------------------------------------------------------------------------
gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------------------------
inv | 5.159164 .1645798 31.35 0.000 4.826788 5.491539
_cons | 2939.84 694.4404 4.23 0.000 1537.389 4342.291
-------------------------------------------------------------------------------------------------
Note from the above result that Stata produces several results in addition to the estimation
of the coefficients of the parameters. The estimated regression result is presented as
follows.
(estimated) GDP = 2939.84 + 5.159INV
Note that for a two variable regression estimation the coefficient of the independent
variable (in our case investment) represents the slope of the function. Notice that one of
the additional information given by Stata's default result window in the confidence
interval for each parameters. For example, the result shows that we are 95% confident
that the unknown population parameter of INV lies between 4.826 and 5.491 units.
Accordingly, we can explain the confidence interval for the intercept term. Moreover,
notice from the result that the parameters are significant even at 1% both jointly and
individually.
The other interesting part while working with Stata is that, it is possible to plot the actual
dependent variable together with the estimated value of that same dependent variable.
144
Such graphical examination helps to identify to what extent the predicted value
approximates the actual value of the dependent variable. To perform this we follow the
following step.
I. Given Y= f(X), conduct the associated regression model and obtain the estimated
results
II. Then construct the estimated value of Y from the regression result. This refers to
computing Y-hat value at each values of X. The command to do so is given by:
predict yhat. This generates a new variable equal to the predicted values from the
most recent regression.
III. To graph the actual value of Y with the estimated value (Y hat) at each value of
X, we write the following command: graph Y Yhat X, connect(.s) symbol(oi). This
command draws a scatter plot with regression line using the variables Y, Yhat and X
We can conduct this approach using the model that we have estimated earlier.
Accordingly, after estimating the regression model from the function GDP = f(INV),
we construct the estimated value of GDP using the following command: predict
GDPhat. Note that this generates the predicted values of GDP from the regression
estimation.
To construct the graph that takes in to account the actual values of GDP and the
estimated value of GDP for each value of investment, we write the following
command: graph GDP GDPhat INV, connect(.s) symbol(oi). The result of this
command is given as follows
145
GDP Fitted values
65329.6
GDP
2883.83
437.424 12093
INV
Note that the line represented the regression line constructed by using the fitted
(estimated) values of GDP for each value of INV. On the other hand the scatter plot (or
dotted figure) represent the actual value of GDP for a given value of GDP. Such drawing
is helpful for wide range of analysis.
Using the data on GDP, INV, IMPO, CO and POP, we can compute this model. Note that
the command to be displayed is given by: regress GDP INV IMPO CO POP
146
-------------+------------------------------------------------- F( 4, 38) =20376.68
Model | 1.1538e+10 4 2.8846e+09 Prob > F = 0.0000
Residual | 5379456.74 38 141564.651 R-squared = 0.9995
-------------+------------------------------------------------ Adj R-squared = 0.9995
Total | 1.1544e+10 42 274853564 Root MSE = 376.25
--------------------------------------------------------------------------------------------------
gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+------------------------------------------------------------------------------------
inv | 1.521035 .1757031 8.66 0.000 1.165342 1.876727
co | 1.170104 .033189 35.26 0.000 1.102917 1.237292
pop | -83.28284 20.16384 -4.13 0.000 -124.1024 -42.46328
impo | -1.127762 .1050218 -10.74 0.000 -1.340368 -.9151567
_cons | 1705.201 468.9588 3.64 0.001 755.8434 2654.558
------------------------------------------------------------------------------
In the result, the 95% confidence interval for each parameter is also given. That is,
o we are 95% confident that the unknown population parameter INV lies with
1.165342 and 1.876727 interval.
o we are 95% confident that the unknown population parameter CO lies with
1.102917 and 1.237292 interval.
o we are 95% confident that the unknown population parameter POP lies with 1-
124.1024 and -42.46328 interval.
o we are 95% confident that the unknown population parameter IMPO lies with
-1.340368 and -.9151567interval.
147
The result also show the fact that the parameters used in the estimation are significant
even at 1 percent. Recall that for individual test of significance we use the t-probability
value( represented in Stata by P>|t|) where as it is the F probability (given by Prob > F) that
is used to examine joint test of significance.
Consider a regression model designed to examine the elasticity of import and export to
the GDP. This requires to come up with a log-log model such as the following
The estimation of the above log-log model using Stata first requires the transformation of
each variable into its logarithm form (Recall the command to transform a variable into its
logarithm form). Accordingly we obtain the following regression result.
-----------------------------------------------------------------------------------------------
lgdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------
linv | .0036559 .0477036 0.08 0.939 -.092915 .1002269
lexpo | .2011124 .0554548 3.63 0.001 .08885 .3133747
limpo | .0523219 .0685613 0.76 0.450 -.0864732 .1911171
lpop | 1.849338 .1329017 13.92 0.000 1.580293 2.118384
_cons | .8004088 .2369478 3.38 0.002 .320733 1.280085
------------------------------------------------------------------------------------------------
148
Note from the above result that the coefficients represents partial elasticity coefficients.
That is, other things being equal,
o an increase in investment by percent increases GDP by 0.003 percent.
o an increase in export by one percent increases GDP by 0.20percent.
o an increase in import by one percent increases GDP by 0.05percent.
o an increase in population by one percent increases GDP by 1.84 percent.
In the result, the 95% confidence interval for each parameter is also given. That is,
o we are 95% confident that the elasticity of the population parameter INV lies with
-.092915 and 0.1002269 interval.
o we are 95% confident that the elasticity of the population parameter EXPO lies
with 0.08885 and 0.3133747 interval.
o we are 95% confident that the elasticity of the population parameter IMPO lies
with -0.0864732 and 0.1911171 interval.
o we are 95% confident that the elasticity of the population parameter POP lies with
1.580293 and 2.118384 interval.
The result also shows that the parameters used in the estimation are jointly significant
even at 1 percent. But it is only EXPO and POP parameters that are significant while INV
and IMPO are not significant.
Hypothesis Testing Using Stata
Stata conducts a wide range of hypothesis testing. This includes pre estimation
hypothesis testing about a variable or between variables in many respects. Moreover, post
estimation hypothesis testing about an individual parameter or between parameters can be
easily computed in a very friendly manner. Our discussion first focuses on pre estimation
hypothesis testing to be followed by the post estimation one.
149
and average weekly family disposable income (X3). The values in X1, X2 and X3 are
measured in birr. The following data represents this information.
Week Y X1 X2 X3
1 8429 3.07 4.06 165.26
2 10079 2.91 3.64 172.92
3 9240 2.73 3.21 178.46
4 8862 2.77 3.66 198.62
5 6216 3.59 3.76 186.28
6 8038 2.60 3.13 180.49
7 8038 2.60 3.13 180.49
8 7476 2.89 3.20 183.33
9 5911 3.77 3.65 181.87
10 7950 3.64 3.60 185.00
11 6134 2.82 2.94 184.00
12 5868 2.96 3.12 188.20
150
Now, given the table above, we can estimate a hypothesis that makes use of the above
stated commands. For example, consider the following command
ttest X1 = 2.75
This hypothesis argue that the average retail price of roses in the market is 2.75 birr. The
following result is computed in stata based on the above command.
One-sample t test
---------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
----------------------------------------------------------------------------------------------------
Degrees of freedom: 11
Ha: mean < 2.75 Ha: mean ~= 2.75 Ha: mean > 2.75
t = 2.3611 t = 2.3611 t = 2.3611
P < t = 0.9811 P > |t| = 0.0377 P > t = 0.0189
Notice from the result that the null hypothesis is Ho: mean(x1)= 2.75. However, there are both
two tailed and one tailed (both left and right tail) tests presented above. As we know the
null hypothesis is rejected when the t-probability value exceeds the selected level of
significance. Notice from the result that for the two tailed and right tailed test, the null
hypothesis is rejected both at 5% (0.05) and 10% (0.1) where as it is not rejected at 1%
(0.01) level of significance. In general, the test suggests the idea that the mean value of
X1 is different and higher from 2.75 birr in the market.
The command below performs similar hypothesis tests for the variable X2 that Ho = 3.5
Note that the appropriate command in this regard is given by: ttest X2 = 3.5 The result is
summarized in the following box.
One-sample t test
--------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +--------------------------------------------------------------------------------------
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198
151
--------------------------------------------------------------------------------------------------
Degrees of freedom: 11
Ha: mean < 3.5 Ha: mean ~= 3.5 Ha: mean > 3.5
t = -0.7565 t = -0.7565 t = -0.7565
P < t = 0.2326 P > |t| = 0.4652 P > t = 0.7674
Notice from the result that the null hypothesis is not rejected even at 1% in both two
tailed and one tailed test (both left and right). This indicates that on average the price of
X2 is 3.5 birr per unit.
Note that in addition to this Stata can compare the mean values of two variable collected
from the same sample or different one. This is called a paired test from the same data.
Given the above table one may hypothesis that the mean price of X 1 is equal to that of X2.
Accordingly the command is given by: ttest X1 = X2. The following result is based on
this command.
Paired t test
----------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198
--------- +------------------------------------------------------------------------------------------------
diff | 12 -.3958334 .1029155 .3565098 -.6223489 -.1693178
-----------------------------------------------------------------------------------------------------------
152
Notice from the result that the null hypothesis is rejected at 1% for both the two tailed
test and the left tailed test. In general, the test suggests the idea that the mean value of X 1
is different from the mean value of X2.
sdtest varname = #
sdtest varname1 = varname2
Note from the two commands that sdtest performs tests on the equality of variances
(standard deviations). In the first command, sdtest performs a chi-squared test of the
hypothesis that the standard deviation of varname is #. That is, it examines the hypothesis
whether a given variable's population variance (standard deviation) is equal to some
number. In the second command, sdtest performs an F test (variance ratio test) of the
hypothesis that varname1 and varname2 have the same variance. In other words, the
second command checks whether two variables have equal variance (standard deviation)
or not.
Considering the data on demand for roses discussed earlier, we can conduct the above
stated hypotheses. Suppose a researcher hypothesized that the variation of price of roses
(from the average) is equal to 0.5 In this case the appropriate command to execute the
job is given by: sdtest X1 = 0.5 The Stata output to this command is displayed below.
sdtest x1 = 0.5
-------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +-------------------------------------------------------------------------------------------
153
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
--------------------------------------------------------------------------------------------------------
Ha: sd(x1) < 0.5 Ha: sd(x1) ~= 0.5 Ha: sd(x1) > 0.5
P < chi2 = 0.2326 2*(P < chi2) = 0.4651 P > chi2 = 0.7674
From the result above we learn that there are both two tailed and one tailed (both left and
right tail) tests. Recall that in hypothesis testing the null hypothesis is rejected when the t-
probability value exceeds the selected level of significance (which is either 10%, 5% or
1%). Notice from the result that for the two tailed and right tailed test, the null hypothesis
can not be rejected even at 1% (or 0.01) level of significant. The same result also holds
for both left and right tailed tests. This suggests that the hypothesized variance of X 1 is
acceptable. That is, there is a statistical evidence to argue that the population variance of
X1 is equal to 0.5
In addition to this, we can test the equality of the variance of two variables. For instance,
suppose we hypothesized that price variation in X1 is equal to that of X2. To test this, the
relevant command is: sdtest X1 = X2. The result obtained from the above result is given as
follows.
Variance ratio test
--------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198
--------- +-----------------------------------------------------------------------------------------------
combined | 24 3.227083 .0860011 .4213176 3.049177 3.40499
----------------------------------------------------------------------------------------------------------
154
F(11,11) upper tail = F_U = F_obs = 1.422
Ha: sd(x1) < sd(x2) Ha: sd(x1) ~= sd(x2) Ha: sd(x1) > sd(x2)
P < F_obs = 0.7156 P < F_L + P > F_U = 0.5688 P > F_obs = 0.2844
Notice that both the two tailed and one tailed tests can not rejects the null hypothesis that
says the variance of X1 is equal to X2. This is because the probability value of both tailed
tests is well over even at 10% (or 0.1). Thus we accept the hypothesis that the variance of
X1 = X2.
This section examines the process of hypothesis testing after we perform regression
estimation. This includes a test of equality between parameters, and many other linear
restrictions.
The discussion will be based on the following cross section data that includes the output
(Y), the labor input (L), and capital input (K) of firms of a chemical industry.
155
Stata regression result. These default results are significant test of parameters individually
and jointly. However, it is possible to conduct test of significance of a parameter by itself
using some commands.
In Stata such hypothesis tests are conducted using the test command. Note that, test tests
linear hypotheses about the estimated parameters from the most recently estimated
model. Without arguments, test redisplays the results from the last test. The other
command used is testparm which provides a useful alternative to test that permits varlist
rather than just a list of coefficients. Note that test and testparm perform Wald tests.
Consider the following regression model that can be estimated using the above data
Yi = β0 + β1Li + β2Ki + Ui
After (single-equation) estimation of the above model, we can estimate a number of
hypotheses using the appropriate command as shown below.
test L = K
This command tests the hypothesis that the population parameter of L and K (which is β 1
and β2) are equal. The hypothesis suggests that labor and capital have equal impact on the
output.
test K = L/2
This command tests the hypothesis that the population parameter K is half of L. That is β2
is half of β1. In other words the hypothesis suggest that the contribution of labor is two
times of that of capital.
test L = 2
This command tests the hypothesis that the population parameter of L equals to 2. The
hypothesis suggests that as labor changes by one unit, output increases by two units,
ceteris paribus.
test L or test K.
This produces individual test of significance about the population parameter. The result
of this hypothesis is similar to the one presented in the Stata regression result.
test L K
156
This produces test of significance about the population parameter jointly. The result of
this hypothesis is similar to the one presented in the Stata regression result.
We can exercise the above stated commands using the cross section data posted above.
As we have said earlier, the first thing is to compute the regression model. Note that Stata
performs the linear hypotheses about the estimated parameters from the most recently
estimated model. Thus, it is very important to first conduct the regression estimation.
Accordingly, we obtain the following result.
regress y l k
-------------------------------------------------------------------------------------------------------------
y| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-----------------------------------------------------------------------------------------------
l | .0409662 .0628447 0.65 0.527 -.0959606 .177893
k | .534077 . 1424411 3.75 0.003 .2237244 .8444296
_cons | -48.62864 75.29277 -0.65 0.531 -212.6775 115.4202
--------------------------------------------------------------------------------------------------------------
Note that in the default regression result reported in Stata result window, there is a test of
significant of parameters individually as well as jointly. However, we can perform the
same test using the following commands.
test L. This command tests the hypothesis that the parameter of L (which is
β1) is significant or not. The result is presented as follows.
157
test l
( 1) l = 0.0
F( 1, 12) = 0.42
Prob > F = 0.5268
Note from the above result that it is test of significance of individual parameters. The
probability result of this test is similar to the one posted in the regression result displayed
earlier. Based on the probability result, we find β1 to be insignificant.
test K. This command tests the hypothesis that the parameter of K (which is
β2) is significant or not. The result is presented as follows.
test k
( 1) k = 0.0
F( 1, 12) = 14.06
Prob > F = 0.0028
Note from the above result that the probability result of the test is similar to the one
posted in the regression result. Moreover, we find β 2 to be significant even at one
percent.
Test L = K. As we have said earlier this test examines the hypothesis that the
population parameter of L and K (which is β 1 and β2) are equal. Thus, the
hypothesis suggests that labor and capital have equal impact on the output.
The result of this hypothesis is presented as follows
test l = k
( 1) l - k = 0.0
F( 1, 12) = 6.63
Prob > F = 0.0243
158
The above result rejects the hypothesis that β 1 equals to β2 at 5% significant level. This
indicates that there is no statistical evidence to suggest that the contribution of L and K
to Y is equal.
test L K. This test represents joint significance test as shown below
. test l k
( 1) l = 0.0
( 2) k = 0.0
F( 2, 12) = 17.29
Prob > F = 0.0003
------------------------------------------------------------------------------
ly | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ll | -.2157061 .2418451 -0.89 0.390 -.7426414 .3112292
lk | 1.67721 .3040903 5.52 0.000 1.014654 2.339766
_cons | -3.435928 1.637866 -2.10 0.058 -7.004531 .1326751
------------------------------------------------------------------------------
159
The estimated result can be presented as follows.
(estimated)LYi = -3.43 - 0.215LLi + 1.67LKi
Note from the result that the coefficient of L and K (which measures elasticity) are not
equal. However, we can examine whether such differences are statistically significant for
the population parameters. Thus, we test the hypothesis that the elasticity of L with
respect to Y is the same to the elasticity of K with respect to Y. The command to this is:
test LL =LK
The result is given as follows. (Note that the test must follow the regression estimation).
test ll= lk
( 1) ll - lk = 0.0
F( 1, 12) = 14.86
Prob > F = 0.0023
Note from the result that the hypothesis is rejected at 5% significant level. It points out
that the two elasticities do not have equal impact on output.
Moreover, note that the sum of the estimated elasticities is more than one (or around
1.5). This indicates the presence of increasing to scale in the model. However, an
examination is needed as to whether such result is statistically supported. To do this we
test a hypothesis of constant returns to scale using the following command.
test LL + LK =1
The test hypothesizes the presence of constant returns to scale in the model. The
following is the result of the test.
F( 1, 12) = 3.50
160
Prob > F = 0.0858
The result points that the hypothesis of constant returns to scale can not be rejected at 1%
and 5% significant levels. However, it is rejected at 10% level. Accordingly we can
conduct the following hypothesis.
Consider the data given under check your progress 4.4 on Y, X 1, X2 and X3 . Then attempt
the following using Stata.
4.5 Summary
161
Note that regression analysis is concerned with describing and evaluating the relationship
between a given variable (dependent variable) and one or more other variables
(explanatory variables).
This unit have employed a computer-based analysis and explained a number of issues
related to econometric estimation and analysis by using EViews and Stata. That is, we
described the approach employed in conducting regression estimations in EViews. This
includes specifying and estimating a regression model, estimation and interpretation of a
regression model, tests of significance of the parameters and the like. Note that in the
discussion we made use of Ordinary Least Squares method of estimation since it is the
simplest as well as widely used in basic regression estimation.
Moreover, Stata program has been used to conduct similar tests. As we have seen earlier,
Stata conducts a number of regression estimations. The discussion have explained simple
linear regression using OLS. In addition to this. Stata conducts a wide range of
hypothesis testing. This includes pre estimation hypothesis testing about a variable or
between variables in many respects. Moreover, post estimation hypothesis testing about
an individual parameter or between parameters can be easily computed in a very friendly
manner.
1. Step I: Identify the relationship between variables and express the relationship in
mathematical form. Step II. specifying the mathematical form of the model. Step III
specifying of an econometric model. Step IV. determining the numerical estimates of the
coefficients of the model. Step V. Evaluation of Estimates
162
3. R2 measures the proportion of the variation in Y explained by the explanatory
variables such as X1 and X2 jointly.
4 This refers to testing whether a particular variable X1 or X2 is significant or not either
individually or jointly. On the other hand, in confidence interval estimation, we establish
limiting values around the estimate within which the true parameter is expected to lie
with a certain 'degree of confidence'
163
4.7 Model Examination
The following table provides data on real gross product, labor input, and real capital input
in the manufacturing sector of a certain economy.
1. Using the above table, attempt the following questions using EViews
A. fit the following model to the above data, and report the results
Yi = β0 + β1X1 + β2X2 + Ui
B. Interpret the coefficient results and comment on the results of the adjusted R2
C. Test the significance of the parameters individually at 5% level.
D. Does the data support the hypothesis β1 = β2 ? Report your findings at 5% level
of significance
E. Test for the existence of autocorrelation problems in the model
164
2. Based on the data given above, answer the following questions using Stata
I. Fit the following model to the above data, and report the results
LogYi = Logα0 + Logα1X1 + Logα2X2 + LogUi
II. Interpret the coefficient results and comment on the results of the adjusted R2
III. Test the significance of the parameters individually at 5% level.
IV. Does the data support the hypothesis (a) α1 = α2 and (b) α1+ α2 =1 Use 5%
significant level
V. Test for the existence of hetroscedasticity problems in the above model
165
Unit Five: Diagnostic Tests
5.0 Objective
5.1 Introduction
5.2 The Concept of Diagnostic Checking.
5.3 Diagnostic Checking Using EViews
5.4 Diagnostic Checking Using Stata
5.5 Summary
5.6Answers to Check Your Progress
5.7 Model Examination
5.0 Objective
The aim of this unit is to conduct computer-based examination on the regression results.
After completing this unit the student will be able to
Identify the concept of diagnostic checking. This includes explaining the sources
of the problem, the detection mechanism and the appropriate solutions.
Test for the presence of the problems using EViews and Stata.
5.1 Introduction
In regression estimation and analysis, the task of the econometrician is not limited to
performing estimation. Rather several tests that ascertain the reliability of the model must
be conducted. In this unit, we will employ a computer based method of assessing the
reliability of the estimates of the parameters from econometric criteria point of view.
Recall from your econometrics discussion that after the estimation of the parameters with
166
the method of ordinary least squares, we should assess the reliability of the estimates of
the parameters based on three types of criteria. These are:
A priori economic criteria which are determined by economic theory and related
to the sign and magnitude of the parameters.
Statistical criteria which are determined by the statistical theory.
Econometric criteria which are determined by the econometric theory.
Further, recall that the statistical criteria are the coefficient of determination, the standard
errors of the estimates and the related t and F-statistics. These tests are valid only if the
assumptions of the linear regression model are satisfied. Thus, if the assumptions of an
econometric method are violated, then the estimates obtained do not possess some or all
of their optimal properties discussed in the earlier units. Therefore, their standard error
becomes unreliable criteria.
Note that econometric criteria provide evidence about the validity or the violation of the
assumptions of the linear regression model. In this unit therefore, we will see with the aid
of computer programs the violation of the basic assumptions of the classical linear
regression model. In this regard, among a number of requirements, emphasis is usually
given on three major econometric problems. These are the problem of autocorrelation,
hetroscedasticity and multicollinearity. Thus this unit examines briefly what this
problems look like conceptually and then conducts the test using EViews and Stata.
Note that there are some criteria set by the theory of econometrics. Therefore, after
conducting regression estimation it is important to investigate whether the assumptions of
the econometric method are satisfied or not.
If the assumptions are not satisfied, then the estimates of the parameters will not posses
some of the desirable properties and become unreliable for the determination of the
significance of the estimates.
167
In general, before using the estimate for prediction, policy making or for other objective,
the researcher must make use of all the criteria. If the assumptions are not satisfied, it is
necessary to re-specify the model by introducing new variables, omitting variables or
transforming variables and re-estimation of the model. This process of re –specification
of the model will continue until it possess all the above three criteria.
Given Yi = β0 + β1X1i + β2X2i + ...... + Xkiβk + Ui, the assumptions on which the
classical linear regression model is based upon are:
Assumption 1: Randomness of the random variable U. That is, its value is unpredictable
and hence depends on chance.
Assumption 3: The variance of each Ui is the same for all the X i values. This is known as
the assumption of homoscedasticity
Assumption 4: The values of each Ui are normally distributed, i.e. Ui ~N (0, u2)
Assumption 8: The explanatory variables are not perfectly linearly correlated. This is
called the assumption of no perfect multicollineariyy between the X's
Assumption 9: The model has no specification error. That is all the important explanatory
variables appear explicitly in the function and the mathematical form is
correctly specified.
168
It was on the basis of these assumptions that we try to estimate the model, and test the
significance of the model. Nevertheless, the question is what would be the implication if
some or all of these assumptions are violated. That is, if the assumptions are not fulfilled
what will be the outcome? Here under a brief discussion is made on those assumptions
that are most important ones.
This assumption implies that the covariance of Ui and Uj in equal to zero. Nevertheless, if
this assumption is violated, it implies that the disturbances are said to be auto correlated.
Autocorrelated values of the disturbance term may be observed for many reasons. These
are
Omitted explanatory variables
Most economic variables tend to be autocorrelated. If an autocorrelated variable has been
excluded from the set of explanatory variables, then its influence will be reflected in the
random variable U. This is called" quasi-autocorrelation", since it is due to the
autocorrelated pattern of the omitted explanatory variables and not because of the pattern
of the values of the random variable U. If several autocorrelated explanatory variables are
omitted, then the random variable, U, may not be autocorrelated. This is because the
autocorrelation patterns of the omitted variables may offset each other.
If we use a mathematical form which differs from the correct form of the relationship,
then, the random variable may show serial correlation. For example, if we chosen a linear
function while the correct form are non-linear, then the values of U will be correlated.
Many random factors like war, drought, weather conditions, strikes etc exert influence
that are spread over more than one period of time. For example, the effect of weather
conditions in agricultural sector will influence the performance of all other economic
variables in several times in the future. A strike in an organization affects the production
169
process which will persist for several future periods. In such cases, the values of U's
become serially dependent, so that if we assume E(UiUj)=0, then we mis-specify the true
pattern of values of U. This type of autocorrelation is called "true autocorrelation".
Most time series data involve some interpolation and "smoothing process" to remove
seasonal effect which do average the true disturbances over successive time periods. As a
result, the successive values of U's are interrelated and show autocorrelation patterns.
The source of autocorrelation has a strong influence on selecting solution for the
correction of autocorrelation. This means, the type of corrective action depends on the
cause or source of autocorrelation.
When the disturbance term exhibits serial correlation the value as well as the standard
errors of the parameter estimates are affected. Note that if disturbances are correlated, (i)
the prevailed value of the disturbances have some information to convey about the
current disturbances. If this information is ignored it is clear that the sample data is not
being used with maximum efficiency. (ii) Moreover, the variance of the random term U
may be seriously underestimated. (iii) The prediction based on ordinary least squares
estimate will be inefficient with autocorrelated errors.
Uˆ
n
2
t U t 1
t 2
d= n
Uˆ
t 2
t
2
which is simply the ratio of the sum of squared differences in successive residuals to the
residual sum of squares, RSS. Note that in the numerator of the d statistic the number of
170
observations is n-1 because one observation is lost in taking successive differences. Note
that expanding the above formula allows us to obtain
d = 2(1 - ̂ ).
Note from the Durbin-Watson statistic that for positive autocorrelation ( > 0),
successive disturbance values will tend to have the same sign and the quantities (U t – Ut-
1)2 will tend to be small relative to the squares of the actual values of the disturbances.
We can therefore, expect the value of the expression in the above equation to be low.
Indeed, for the extreme case = 1 it is possible that Ut = Ut-1 for all t so that the minimum
possible value of the equation is zero. However, for negative autocorrelation, since
positive disturbance values now tend to be followed by negative ones and vise versa, the
quantities (Ut – Ut-1)2 will tend to be large relative to the squares of the U’s. Hence, the
value of the above equation now tends to be high. The extreme case here is when = 0
we should expect the expression to take a value in the neighborhood of 2. Notice,
however, that when = 0, the equation reduces to Ut = t for all t, so that t takes on all
the property of Ut – in particular it is no longer autocorrelated. Thus in the absence of the
autocorrelation we can expect the above equation to take a value close to 2, when
negative autocorrelation is present a value in excess of 2 and may be as high as 4, and
when positive autocorrelation is present a value lower than 2 and may be close to zero.
As discussed earlier, the Durbin-Watson (DW) test tests the hypothesis that H 0: = 0
(implying that the error terms are not autocorrelated with a first order scheme against the
alternate.
Note that the range of DW result is between 0 and 4. Decision is made by comparing the
calculated value with the critical (tabulated) value. As a rule of thumb, however, if d is
found to be closer to 2 in an application, one may assume that there is no first order
autocorrelation either positive or negative. If d is closer to 0 it is because the correlation
between successive error terms is closer to 1 indicating strong positive autocorrelation in
the residuals. Similarly, the closer d is to 4, the greater the evidence of negative serial
correlation. This is because the correlation between successive error terms is closer to 1.
171
The following figure explains the general approach of making decision in the DW test.
Note:
H0: No positive autocorrelation
H0*: No Negative autocorrelation
Reject H0 Reject H0
Zone of Zone of
Evidence indecision indecision Evidence
of * of
Do not reject H0 or H
positive or both negative
autoco autoco
0 dL dU 2 4-dU 4-dL 4
Note that the DW result is compared with the Durbin Watson critical (or table) value and
decision on either to accept or to reject the null hypothesis of no autocorrelation (of
positive or negative) is made using the above figure. For example suppose that the DW
result is 1.05 and from the Durbin Watson table let the critical d value are dL = 1.38 and
dU = 1.72 at 5%. In this case note that on the basis of the d test we can say that there is
positive autocorrelation
Note that it is important to solve the DW problem when it exists in the regression result.
If the source of the problem is suspected to be due to omission of important variables, the
solution is to include those omitted variables. Besides, if the source of the problem is
believed to be the result of misspecification of the model, then the solution is to
determine the appropriate mathematical form. However, if the above approaches are
ruled out, the appropriate procedure will be to transform the original data so that we can
come up with a new form (or model) which satisfies the assumption of no serial
correlation.
172
B. The Assumption of Homoscedasticity
This is one of the important assumption of the classical linear regression model. It says
that the population disturbances term, Ui all have the same variance. It suggests that the
conditional variance of the dependent variable conditional upon the given value of the
explanatory variable remains the same regardless of the values taken by the variable X.
On the other hand, when the conditional variance of the dependent variable increases as
the value of the explanatory variable increases, then we say there is heteroscedasticity,
Note that there are a number of reasons why the variances of U i are variable
(heteroscedasticity). Some of these are:
I) As income grows, people have more discretionary income and hence more scope for
choice. Hence variance of the error term- i
2
is likely to increase with income. Thus, in
the regression of saving on income, we find i2 increases with income
III) Another source of heteroscedasticity arises from violating the assumption that the
regression model is correctly specified. Heteroscedasticity will arise due to the fact that
some important variables are omitted from the model. For example, in the demand
function, if we omit the price of complement or the price of substitutes, then the residuals
obtained from this regression may give that the error variance may not be constant. If the
omitted variables are included in the model, then the problem may disappear.
173
different sizes such as small, medium or large sizes. In time series data, the variables tend
to be similar and collect the data for the same entity over a period of time.
Note that there are some formal and informal methods of detecting heteroscedasticity.
This includes the Breush -Pagan test. This test is relevant for a very wide class of
alternative hypotheses, normally that the variance is some function of a linear
combination of known variables. The generality of this test is both its strength (that it
does not require prior knowledge of the functional form involved).
This is one of the most important assumption whose. Its violation represents the fact that
the explanatory variables are perfectly linearly correlated,
174
Note that multicollinearity is not a condition that either exists or does not exist in
economic variables but rather inherent in most economic relationships due to the
interdependence of many economic variables. In other words, multicollinearity is a
question of degree and not of its existence.
Multicollinearity has a problem because when any two explanatory variables are
changing in the same way, it becomes difficult to the measure the influence of each
variable on the dependent variable.
Note that if the correlation between the explanatory variables is perfect, then the
estimates of the coefficients are indeterminate and the standard errors of these estimates
become infinitely large. When certain explanatory variables are more important than
others and correlated with the dependent variables, the seriousness of the problem is
greater.
(i). OLS estimation may not be precise. (ii) the confidence interval tend to be much wider
which may affect the hypothesis testing regarding to the regression coefficients. (iii). the
test statistics which are important for conducting hypothesis testing tends to be
statistically insignificant. (iv) Although the test statistics are statistically insignificant, the
overall measure of goodness of fit, R2, can be very high.
175
The solutions for multicollinearity depends on the severity of multicollinearity, on
availability of sources of data, on the importance of factors which are multicollinear and
on the purpose for which the model is being estimated
If multicollinearity affects some of the less important factors (variables), one may
exclude these factors from the model. If, on the other hand, multicollinearity has serious
effects on the coefficients estimates of important factors, then (i) increase the sample
size. (ii) Introduce additional equations in the model. (iii) Drop a variable and (iv)
transform the variables.
1. State the assumptions on which the classical linear regression model is based upon
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
4. State the problems associated with high multicollinearity among the explanatory
variables
________________________________________________________________________
________________________________________________________________________
176
________________________________________________________________________
Note that, when ever a regression model is specified there is uncertainty regarding its
appropriateness. Once we estimate our equation, EViews provides tools for evaluating
the quality of our specification along a number of dimensions. Note that the results of
these tests influence the chosen specification.
Recall that we said in the previous section that the DW result is compared with the
Durbin Watson critical (or table) value and decision on either to accept or to reject the
null hypothesis of no autocorrelation (of positive or negative) is made Now consider the
regression model given by investment as a function of domestic saving, import, and GDP
as follows.
INVi = β0 + β1SAVi + β2IMPOi + β3GDPi + Ui
Suppose that annual data of Ethiopia for the period 1953-1995 is used to obtain the
following regression result
177
Dependent Variable: INV
Method: Least Squares
Date: 11/23/06 Time: 16:16
Sample: 1953 1995
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
SAV 0.410876 0.078253 5.250643 0.0000
IMPO 0.488312 0.037556 13.00236 0.0000
GDP 0.024106 0.012792 1.884443 0.0668
R-squared 0.992759 Mean dependent var 2849.883
Adjusted R-squared 0.992397 S.D. dependent var 3148.438
S.E. of regression 274.5304 Akaike info criterion 14.13522
Sum squared resid 3014679. Schwarz criterion 14.25809
Log likelihood -300.9071 F-statistic 2742.032
Durbin-Watson stat 0.218943 Prob(F-statistic) 0.000000
Note from the result above that the Durbin-Watson statistic is 0.90. As mentioned earlier
DW test examines the hypothesis H0: No positive or negative autocorrelation. Decision
on this is made by comparing the calculated value stated above (which is 0.219) with the
dU and dL of the table value. From the DW table attached at the end of this material we
observe that the computed value is less than dU. Thus, it suggest that there is positive
autocorrelation
Note however, that if the regressors are very highly collinear, EViews may encounter
difficulty in computing the regression estimates. In such cases, EViews will issue an error
message by saying “Near singular matrix.” When you get this error message, you should
check to see whether the regressors are exactly collinear. The regressors are exactly
collinear if one regressor can be written as a linear combination of the other regressors.
Note that under exact collinearity, the regressor matrix X does not have full column rank
and the OLS estimator cannot be computed. However, the problem of multicolinearity is
said to exist even with strong multicolinearity. The rule of thumb for strong
multicolinearity is when the correlation coefficient, ρ is greater than 0.8. To make such
178
examination the student is advised to recall from our statistical computation discussion
how to compute the correlation between variables.
This is to briefly inform the student that EViews provides tests for autocorrelation,
heteroscedasticity, and autoregressive conditional heteroskedasticity (ARCH) in the
residuals from the estimated equation. The following explains in short the process of
doing these tests.
This view displays the autocorrelations and partial autocorrelations of the equation
residuals up to the specified number of lags. To display the correlograms and Q-statistics,
click View then Residual Tests and select Correlogram-Q-statistics on the equation
toolbar. This application is presented in the diagram below.
179
After selecting the Correlograms Q-statistics, we have to specify the number of lags that
we wish to use in computing the correlogram. Note that this is done on the Lag
Specification dialog box
This view displays the autocorrelations and partial autocorrelations of the squared
residuals up to any specified number of lags. The correlograms of the squared residuals
can be used to check autoregressive conditional heteroscedasticity (ARCH) in the
residuals. The diagram presented earlier shows the steps required to perform this test.
Note that if there is no ARCH in the residuals, the autocorrelations and partial
autocorrelations should be zero at all lags and the Q-statistics should not be significant.
To display the correlograms and Q-statistics of the squared residuals, click View then
choose Residual Tests and then select Correlogram Squared Residuals on the equation
toolbar. Then in the Lag Specification dialog box that opens, specify the number of lags
over which to compute the correlograms.
To display the histogram and Jarque-Bera statistic, click View then select Residual Tests
and choose Histogram-Normality. Note that the Jarque-Bera statistic has a distribution
with two degrees of freedom under the null hypothesis of normally distributed errors.
180
Serial Correlation LM Test
This test is an alternative to the Q-statistics for testing serial correlation. The test belongs
to the class of asymptotic (large sample) tests known as Lagrange multiplier (LM) tests.
The null hypothesis of the LM test is that there is no serial correlation up to lag order p,
where p is a pre-specified integer..
The serial correlation LM test is available for residuals from least squares or two-stage
least squares. To carry out the test, click View and select Residual Tests and then Serial
Correlation LM Test… on the equation toolbar and specify the highest order of the
AR(autoregressive) or MA(moving average) process that might describe the serial
correlation. If the test indicates serial correlation in the residuals, LS standard errors are
invalid and should not be used for inference.
ARCH LM Test
EViews reports two test statistics from the test regression. The F-statistic is an omitted
variable test for the joint significance of all cross products, excluding the constant. It is
presented for comparison purposes.
181
Notice from the earlier diagram that to carry out White’s heteroscedasticity test, we firs
select View and select Residual Tests and then White Heteroscedasticity. EViews has two
options for the test: cross terms and no cross terms. The cross terms version of the test is
the original version of White’s test that includes all of the cross product terms. However,
with many right-hand side variables in the regression, the number of possible cross
product terms becomes very large so that it may not be practical to include all of them.
The no cross terms option runs the test regression using only squares of the regressors.
Using the Durbin Watson test examine the presence of autocorrelation in the error term of
the regression model
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
5.4 Diagnostic Analysis Using Stata
Stata program do have a number of checking forms for the presence of errorrs in the
estimated regression model. This includes graphical examination that can suggest
whether the problem exists or not and a number of other systematic tests as discussed
hereunder.
182
I. Test for Hetroscedasticity
Using Stata we can test for the presence of hetroscedasticity problem in the regression
model. To do this we can make use of graphical test as well as numerical test as discussed
below.
A. Graph residual-versus-fitted plot after regress
This helps to examine graphically whether there is a systematic relationship between the
residual and the fitted values (i.e. the estimated value of the dependent variable). If there
exists a systematic relationship between the two variables then it suggests the presence of
hetroscedastic variance of the error term. Note that such results indicate the importance of
re specifying the regression model.
Suppose the regression model is given by Yi = β0 + β1X1 + β2X2 +Ui. To check the
existence of the above stated problem, we perform the following step.
First: perform the regression using the command: regress Y X1 X2
Then: Perform the graphical analysis using the command: rvfplot
To execute this consider the following table which reports output (Y) measured in tons,
the labor input (X1) measured in hours and the capital input (X 2) measured in machine
hours of 10 firms of textile industry.
Firms 1 2 3 4 5 6 7 8 9 10
Y 500 440 545 600 510 625 680 720 750 830
X1 1420 1600 1620 1600 1500 1700 1760 1700 1800 1500
X2 390 400 430 410 430 650 700 780 700 600
To graphically test the normality of the residual, we should firs perform the regression
and obtain the result as follows. (note that the command to do this is regress Y X1 X2)
183
Total | 139550.00 9 15505.5556 Root MSE = 80.868
-----------------------------------------------------------------------------------------------------------
y| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------------------
x1 | -.3390981 .3194092 -1.06 0.324 -1.094381 .4161846
x2 | .8439037 .2574407 3.28 0.014 .2351531 1.452654
_cons | 706.0358 427.3086 1.65 0.142 -304.3886 1716.46
------------------------------------------------------------------------------------------------------------
The interpretation of the results is left to the student. To conduct the graphical test, we
write the command: rvfplot. The result is given as follows.
126.269
Residuals
-67.8139
501.04 787.814
Fitted values
Figure 5.3 Graphical test for the residual and the fitted values of
Note that the above graph shows the relationship between the residual of the regression
with the fitted value of Y (called estimated Y or Y hat). As we can see the graph, there is
no as such systematic relationship (both positive and negative) between the two variables.
Thus, the result suggests that the residual's variance do not represent hetroscedasticity.
184
Stata can very simply test for the presence of hetroscedastic variance of the error term in
the regression model. To perform this test we need to first estimate the regression model
that we are interested to test for unequal variance. Then we write the following command.
hettest
Note that hettest performs the Cook and Weisberg (1983) test for heteroscedasticity.
After the command hettest if variable list is not specified, the fitted values are used for
the analysis. If, however, variable list is specified, the variables specified are used for the
computation..
Note that even though hettest was originally written following a 1983 article in the
journal "Biometrika" by Cook and Weisberg, the same test was derived by Breusch and
Pagan in the journal "Econometrica" (1979). In fact, in econometrics, the test performed
by hettest is known as the Breusch-Pagan test for heteroscedasticity. Thus students are
required to recall the approach employed in the Breusch and Pagan test for
hetroscedasticity.
Consider the data on the 10 firms of the textile industry used earlier. After estimation is
performed to the model Yi = β0 + β1X1 + β2X2 +Ui (which will give us the result
obtained earlier), we perform the test of hetroscedasticity as follows.
Suppose we write the command hettest then we get the following result.
hettest
Notice that our command is not followed by the list of explanatory variables used in the
regression. In this case, the fitted values are used for the analysis. The null hypothesis of
the test is constant variance or homoscedastic variance. Note that this test provides with a
chi square result together with the associated probability. As usual, the decision is to
reject the null hypothesis if the chi square probability is less than the chosen level of
significance. In our example, we cannot reject the null hypothesis even at 10% (or 0.1)
185
significance level. Thus, the result points out that the regression model do have
homoscedastic variance of the error term. Recall that the graphical result also have got
homoscedastic variance result.
However, after the command hetest, if variable list is specified, the variables specified are
used for the computation. In this case we obtain the result presented below.
hettest x1 x2
Notice that the Cook-Weisberg test for heteroscedasticity in this case used the variables
X1 and X2 in its computation. Nevertheless, we cannot reject the null hypothesis even at
10% (or 0.1) significance level since it is by far greater than the probability of the chi
square test. Therefore, the result points out that the regression model do have
homoscedastic variance of the error term.
In regression analysis one of the test is about proper specification of the model. In this
regard, Stata handles a test for omitted variable developed by Ramsey. The test is known
as RESET test. To execute this test we need to first perform the regression estimation.
Then to check whether the model is correctly specified or not, we write the following
command
ovtest
Note that ovtest performs Ramsey's (1969) regression specification error test
(RESET) for omitted variables. Recall that this test amounts to testing y = xb+zt+u and
then testing t=0. In the command, if rhs is not specified after the command ovtest,
powers of the fitted values are used for z and others powers of the individual elements of
186
x are used. However, if rhs is specified, it means that, the powers of the right-hand-side
(explanatory) variables are to be used in the test rather than powers of the fitted values.
For example consider the regression model given by: Y i = β0 + β1X1 + β2X2 +Ui. To
check whether there are omitted variables in the model we write the following command
after performing the regression estimation: ovtest, rhs. The result is presented as follows.
ovtest, rhs
Ramsey RESET test using powers of the independent variables
Ho: model has no omitted variables
F(6, 1) = 33.29
Prob > F = 0.1319
Note that the test result represents the null hypothesis of no omitted variables in the
model. As we can see the probability of the F test, we can not reject H0 even at 10% (or
0.1) significant level. Therefore, the result points out that there is no statistical evidence
to suggest that the model has omitted variables.
Kolmogorov-Smirnov Test
187
This test determines whether the distribution of the residuals is statistically significantly
different from that of a theoretical normal distribution. This can be done by using the
following command
sktest resid
Then by comparing the probability value with the selected level of significance, we arrive
at our decision. We can examine this using the data on Y, X 1, X2 used earlier. After
performing the regression estimation, when we write the command sktest resid we obtain
the following result.
sktest resid
Note from the above result that since the (joint) probability value exceeds even the 0.1 (or
10%) level, the residual of the estimated regression model is not significantly different
from that of normal distribution.
188
Residuals Inverse Normal
Residuals 126.269
-95.2233
-95.2233 95.2233
Inverse Normal
In the above figure, the straight line represent the theoretical normal distribution. On the
other hand, the dots represent the residual of the model stated earlier. The closer they (i.e.
the dots)cleave to the straight line, the more normal the distribution is said to be.
189
Y 10.5 12.5 13 14 15.5 13 17 16 20 22
X1 100 70 60 80 150 120 110 90 130 140
X2 20 19 15 16 18 20 15 12 10 16
5.5 Summary
In this unit we examined the concept and approaches of diagnostic checking. Recall that
we said the task of the econometrician is not limited to performing regression estimation.
Rather several tests that ascertain the reliability of the model must be conducted.
Accordingly, we used EViews and Stata in the assessment of the reliability of the
estimates of the parameters. Recall that after the estimation of the parameters with the
method of ordinary least squares, we should assess the reliability of the estimates of the
parameters based on three types of criteria. This unit explains about the econometric
criteria
190
Homoscedasticity is one of the important assumption of the classical linear regression
model. It says that the population disturbances term, Ui all have the same variance. It
suggests that the conditional variance of the dependent variable conditional upon the
given value of the explanatory variable remains the same regardless of the values taken
by the variable X. But, when the conditional variance of the dependent variable increases
as the value of the explanatory variable increases, then we say there is heteroscedasticity.
Note that the Breush -Pagan test is one of the formal methods of detecting
heteroscedasticity.
EViews provides tools for evaluating the quality of the model specied along a number of
dimensions. EViews performs the presence of autocorrelation using the Durbin-Watson
(DW) test. Moreover we have noted that EViews provides tests for autocorrelation,
heteroscedasticity, and autoregressive conditional heteroscedasticity (ARCH) in the
residuals from the estimated equation.
Stata program also have a number of checking forms for the presence of errors in the
estimated regression model. This includes graphical examination that can suggest
whether the problem exists or not and a number of other systematic tests. This includes
test for the presence of hetroscedastic variance of the error term in the regression model
and test about proper specification of the model.
5.6 Answer to Check Your Progress
Answer to Check Your Progress 5.1
1. Assumption 1: Randomness of the random variable U. That is, its value is
unpredictable and hence depends on chance. Assumption 2: Zero mean of the random
variable U Assumption 3: The variance of each Ui is the same for all the Xi values. This is
known as the assumption of homoscedasticity Assumption 4: The values of each Ui are
normally distributed. Assumption 5: The values of Ui corresponding to Xi are independent
from the values of any other Uj corresponding to Xj. This is called the assumption of non-
autocorrelation or serial independence of the U's. Assumption 6: Every disturbance term
Ui is independent of the explanatory variables. Assumption 7: No errors of measurement
in the X's Assumption 8: The explanatory variables are not perfectly linearly correlated.
191
This is called the assumption of no perfect multicollineariyty between the X's
Assumption 9: The model has no specification error. That is all the important explanatory
variables appear explicitly in the function and the mathematical form is correctly
specified.
2. The consequences are
the prevailed value of the disturbances have some information to convey about the
current disturbances.
the variance of the random term U may be seriously underestimated.
the prediction based on ordinary least squares estimate will be inefficient with
3. The concept of hetroscedasticity is that the conditional variance of the dependent
variable changes as the value of the explanatory variable changes.
4. The problems associated with high multicollinearity problem are:
OLS estimation may not be precise.
the confidence interval tend to be much wider which may affect the hypothesis
testing regarding to the regression coefficients.
the test statistics which are important for conducting hypothesis testing tends to be
statistically insignificant.
Although the test statistics are statistically insignificant, the overall measure of
goodness of fit, R2, can be very high.
192
Year GDP INV SAV Expo
1967 6427.78 859.576 633.688 638.531
1968 6874.17 755.484 631.362 710.507
1969 7872.84 831.699 539.167 785.007
1970 8308.34 808.716 321.569 809.123
1971 9286.52 1036.33 521.99 881.099
1972 9865.43 1266.31 612.063 1130.4
1973 10079 1366.84 763.71 1072.35
1974 10635.8 1456.64 630.63 1007.17
1975 11775.4 1435.7 644.66 1064.85
1976 10987.6 1850.69 890.55 1164.87
1977 13026.5 1394.02 368.29 1057.07
1978 13575.2 2225.63 1171.05 1271.73
1979 14391 2244.65 1093.06 1186.84
1980 14970.5 3060.51 1867.45 1205.37
1981 15742.1 2269.23 1399.77 1422.8
1982 16825.7 2100.49 1335.22 1295.04
1983 19195.3 1996.38 660.39 1062.21
Let the following regression model is employed using the above data
1. Using EViews test for the presence of autocorrelation problem in the disturbance term.
2. Using correlograms and Q-statistics, comment on the autocorrelation result of the error
term.
3. Using Stata, test for the presence of homoscedastic variance of the error term.
4. Test for omitted variables in the regression model using Ramsey RESET test
6.0 Objective
6.1 Introduction
6.2 Components of SPSS
193
6.3 Data Entry, Operation and Transformation
6.4 Statistical Estimation and Graphing
6.5 Econometric Estimation
6.6 Summary
6.0 Objective
The objective of this unit is to familiarize the student with the basic approach of SPSS.
After completing this unit the student will be able to
Understand the functions of SPSS window and its components
Understand the process of data entry, operations, graphics and estimations
6.1 Introduction
Recall from unit one discussion that it is not only EViews and Stata that are used in data
analysis. Rather there are a number of other softwares designed to perform several kinds
of data analysis and modeling. Accordingly, this unit will examine briefly the application
of SPSS in this regard. SPSS is a comprehensive and flexible statistical analysis and data
management system. SPSS can take data from almost any type of file and use them to
generate tabulated reports, charts, and plots of distributions and trends, descriptive
statistics, and conduct complex statistical analyses. SPSS is available from several
platforms. This unit will make an introductory remark on more general and important
issues.
6.2 Components of SPSS
SPSS for Windows, brings the full power of the mainframe version of SPSS to the
personal computer environment. The following discussion briefly explains the
components of window.
A. Windows in SPSS
There are a number of different types of windows that you will see in SPSS: It is
described as follows.
194
This window displays the contents of the data file. You may create new data files, or
modify existing ones with the Data Editor. The Data Editor window opens automatically
when you start an SPSS session.
Viewer window
The Viewer window displays the statistical results, tables, and charts from the analysis
you performed (e.g., descriptive statistics, correlations, plots, charts). A Viewer window
opens automatically when you run a procedure that generates output. In the Viewer
windows, you can edit, move, delete and copy your results in a Microsoft Explorer-like
environment.
Draft Viewer window
You can display output as simple text (instead of interactive pivot tables) in the Draft
Viewer.
Pivot Table Editor window
Output displayed in pivot tables can be modified in many ways with the Pivot Table
Editor. You can edit text, swap data in rows and columns, add color, create
multidimensional tables, and selectively hide and show results.
Chart Editor window
You can modify and save high-resolution charts and plots in chart windows. You can
change the colors, select different type fonts or sizes, switch the horizontal and vertical
axes, rotate 3-D scatter plots, and even change the chart type.
Text Output Editor window
Text output not displayed in pivot tables can be modified with the Text Output Editor.
You can edit the output and change font characteristics (type, style, color, size).
Syntax Editor window
You can paste your dialog box choices into a Syntax Editor window, where your
selections appear in the form of command syntax. You can then edit the command syntax
to utilize special features of SPSS not available through dialog boxes. If you are familiar
with SPSS software under other operating systems (e.g., Unix), you can open up a Syntax
Editor window and enter SPSS commands exactly as you did under those platforms and
execute the job. You can save these commands in a file for use in subsequent SPSS
sessions.
195
Script Editor window
Scripting and OLE automation allow you to customize and automate many tasks in SPSS.
Use the Script Editor to create and modify basic scripts.
If you have more than one open Viewer window, output is routed to the designated
Viewer window. If you have more than one open Syntax Editor window, command
syntax is pasted into the designated Syntax Editor window. (Paste feature will be
explained later.) The designated windows are indicated by an exclamation point (!) in the
status bar at the bottom of each SPSS window. You can change the designated window at
any time by selecting it (making it active) and clicking the highlighted pushbutton on the
toolbar. An active window is the currently selected window which appears in the
foreground. An active window may not be a designated window until you instruct SPSS
to make it a designated window (by clicking the icon on the toolbar).
Many of the tasks you may want to perform with SPSS start with menu selections. Each
window in SPSS has its own menu bar with menu selections appropriate for that window
type. The Data Editor window, for example, has the following menu with its associated
toolbar: Note that most menus are common for all windows and some are found in certain
types of windows.
I. Common menus
File
Use the File menu to create a new SPSS system file, open an existing system file, read in
spreadsheet or database files created by other software programs (you can read data into
SPSS from any database format for which you have an ODBC [Open Database
Connectivity] driver), read in an external ASCII data file from the Data Editor; create a
command file, retrieve an already created SPSS command file into the Syntax Editor;
open, save, and print output files from the Viewer and Pivot Table Editor; and save chart
templates and export charts in external formats in the Chart Editor, etc.
Edit
196
Use the Edit menu to cut, copy, and paste data values from the Data Editor; modify or
copy text from the Viewer or Syntax Editor; copy charts for pasting into other
applications from the Chart Editor, etc.
View
Use the View menu to turn toolbars and the status bar on and off, and turn grid lines on
and off from all window types; and control the display of value labels and data values in
the Data Editor.
Analyze
This menu is selected for various statistical procedures such as crosstabulation, analysis
of variance, correlation, linear regression, and factor analysis.
Graphs
Use the Graphs menu to create bar charts, pie charts, histograms, scatterplots, and other
full-color, high-resolution graphs. Some statistical procedures also generate graphs. All
graphs can be customized with the Chart Editor.
Utilities
Use the Utilities menu to display information about variables in the working data file and
control the list of variables from all window types; change the designated Viewer and
Syntax Editor, etc.
Window
Use the Window menu to switch between SPSS windows or to minimize all open SPSS
windows.
Help
This menu opens a standard Microsoft Help window containing information on how to
use the many features of SPSS. Context-sensitive help is available through the dialog
boxes.
197
changes are only temporary and do not affect the permanent file unless you save the file
with the changes.
Transform
Use the Transform menu to make changes to selected variables in the data file and to
compute new variables based on the values of existing ones. These changes are
temporary and do not affect the permanent file unless you save the file with changes.
198
Use the Format menu to select fill patterns, colors, line styles, bar style, bar label styles,
interpolation type, and text fonts and sizes. You can also swap axes of plots, explode one
or more slices of a pie chart, change the treatment of missing values in lines, and rotate 3-
D scatterplots.
Each SPSS window has its own toolbar that provides quick and easy access to common
tasks. Tool Tips provide a brief description of each tool when you put the mouse pointer
on the tool. For example, the toolbar with Syntax Editor window shows the following
tool tip when the mouse pointer is put on the run icon:
A status bar at the bottom of the SPSS application window indicates the current status of
the SPSS processor. If the processor is running a command, it displays the command
name and a case counter indicating the current case number being processed. When you
first begin an SPSS session, the status bar displays the message Starting SPSS Processor.
When SPSS is ready, the message changes to SPSS Processor is ready. The status bar
199
also provides information such as command status, filter status, weight status, and split
file status. The following status bar in an Viewer window, for example, shows that the
current Viewer window is the designated output window and the SPSS is ready to run:
Note that we can personalize our SPSS session by altering the default Options settings.
Select Edit/Options...
Click the tabs for the settings you want to change.
Change the settings.
Click OK or Apply.
For example, within variable list boxes in dialogs, you have the option to display the
variable name as always or the entire variable label (up to 256 characters) can be
displayed. Then,
Suppose you have three test scores collected from a class of 10 students (5 males, and 5
females) during a semester. Each student was assigned an identification number. The
information for each student you have is an identification number, gender of each
student, and scores for test one, test two, and test three (the full data set is displayed
toward the end of this section for you to view). Your first task is to present the data in a
form acceptable to SPSS for processing.
SPSS uses data organized in rows and columns. Cases are represented in rows and
variables are represented in columns. A case contains information for one unit of analysis
200
(e.g., a person, an animal, a machine). Variables are information collected for each case,
such as name, score, age, income, educational level. In the above chart, there are two
cases and four variables.
In SPSS, variables are named with eight or fewer characters. They must begin with a
letter, although the remaining characters can be any letter, any digit, a period, or the
symbols like @, #, _, or $. Variable names cannot end with a period. Variable names that
end with an underscore should be avoided. Blanks and special characters such as &, !, ?,
', and * cannot be used in a variable name. Note that variable names are not case
sensitive. Each variable name must be unique; duplication is not allowed.
Following the conventions above, let us assign names for the variables in our data set: id,
sex, test1, test2, and test3. Once the variables are named according to SPSS conventions,
it is a good practice to prepare a code book with details of the data layout. Following is a
code book for the data in discussion. Note that this step is to present your data in an
organized fashion. It is not mandatory for data analysis. A code book becomes especially
handy when dealing with large number of variables.
201
The next issue is entering your data into the computer. There are several options. You
may create a data file using one of your favorite text editors, or word processing packages
(e.g., Word Perfect, MS-Word). Files created using word processing software should be
saved in text format before trying to read them into an SPSS session. You may enter your
data into a spreadsheet (e.g., Lotus 123, Excel, dBASE) and read it directly into SPSS for
Windows. Finally, you may enter the data directly into the spreadsheet-like Data Editor
of SPSS for Windows. In this document we are going to examine two of the above data
entry methods: using a text editor/word processor, and using the Data Editor of SPSS for
Windows. This is explained as follows.
Let us first look into the steps for using a text editor or word processor for entering data.
Note that if you have a data set with a limited number of variables, you may want to use
the SPSS Data Editor to enter your data. However, this example is for illustration
purposes. Open up your editor session, or word processing session, and enter the variable
values into appropriate columns as outlined in the code book. If you are using a word
processor, make sure to save your data in text format. Whichever style (format) you
choose, as long as you convey the format correctly to SPSS, it should not have any
impact on the analysis.
In many instances, you may have an external ASCII data file made available to you for
analysis. In such a situation, you do not have to enter your data again into the Data
Editor. You can direct SPSS to read the file from the SPSS Syntax Editor window.
Suppose you want to read a given file into SPSS from a Syntax Editor window and create
a system. Creating a command file is a faster way to define your variables, especially if
you have a large number of variables. You may create a command file using your favorite
editor or word processor and then read it into a Syntax Editor window or open a Syntax
Editor window and type in the command lines.
202
To read your already created command file into a Syntax Editor window
Select File/Open/Syntax...
Choose the syntax file (with .sps extension) you want to read and click Open
In the following example we are opening a new Syntax Editor window to enter the
following command lines.
Select File/New/Syntax
When the Syntax Editor window appears, type the appropriate command:
Run and choose selection. Alternatively, you can click from the toolbar
The command file will read the specified variable values from the data file, and create a
system file, sample1.sav. Make sure you specify the pathname; appropriately indicating
the location of the external data file and where the newly created file is to be written.
However, you do not have to save a system file to do the analysis. This means the last
line is optional for data analysis. Every time you run the above lines, SPSS does create an
active file stored in the computer's memory. However, for large data sets, it will save
processing time if you save it as a system file and access it for analysis.
Using Text Import Wizard is another way to direct SPSS to read an external ASCII data
file.
Suppose you want to read the file, grade.dat, into SPSS from Text Import Wizard.
203
The data file is read into the SPSS. We can save the data file as SAMPLE1.SAV.
Suppose you want to use the SPSS for Windows features for data entry. In that case, you
enter data directly into the SPSS spreadsheet-like Data Editor. This is convenient if you
have only a small number of variables. The first step is to enter the data into the Data
Editor window by opening an SPSS for Windows session. You will define your variables,
variable type (e.g., numeric, string), number of decimal places, and any other necessary
attributes while you are entering the data. In this mode of data entry, you must define
each variable in the Data Editor. You cannot define a group of variables (e.g., Q1 to Q10)
using the Data Editor. To define a group of variables, without individually specifying
them, you would use the Syntax window.
Let us start an SPSS for Windows session to enter the above data set. If you are using
your own PC, start Windows and launch SPSS. If you are using a PC in a UITS Student
Technology Center:
This opens the SPSS Data Editor window (titled Untitled). The Data Editor window
contains the menu bar, which you use to open files, choose statistical procedures, create
graphs, etc. When you start an SPSS session, the Data Editor window always opens first.
You are ready to enter your data once the Data Editor window appears. The first step is to
enter the variable names that will appear as the top row of the data file. When you start
the session, the top row of the Data Editor window contains a dimmed var as the title of
every column, indicating that no data are present. In our sample data set, discussed above,
there are five variables named earlier as id, sex, test1, test2, and test3. Let us now enter
these variable names into the Data Editor.
204
To define the variables, click on the Variable View tag at the lower left corner of the Data
Editor window and:
Type in the variable name, id, at the first row under the column Name.
Press the Tab key to fill-in the variable's attributes with default settings.
Type in the variable name, sex, at the second row under the column Name.
Press the Tab key to fill-in the variable's attributes with default settings.
To modify the variable type, click on the icon in the Type column.
Select String by clicking on the circle to the left.
Define the remaining three numeric variables, test1, test2, and test3, the same way the
variable id was defined.
Click on the Data View tag. Now enter the data pressing [Tab] or the right arrow key
after each entry. After entering the last variable value for case number one use the arrow
key to move the cursor to the beginning of the next line. Continue the process until all the
data are entered.
After you have entered/read the data into the Data Editor, save it onto the diskette. Those
who are working from personally owned computers might want to save the file to the
hard disk.
Select Save... or Save As... from the File menu. A dialog box appears
205
In the box below File Name type a:\sample1.sav. You can use a longer file name;
for example, a:\first sample of data entry is a legitimate file name
Click OK
The data will be saved as an SPSS format file which is readable only by SPSS for
Windows. Note that the data file, grade.dat, you saved earlier and the file, sample1.sav,
you saved now are in different formats.
Even after saving the data file, the data will still be displayed on your screen. If not,
select sample1-SPSS Data Editor from the Window menu.
Before computing the descriptive statistics, we want to calculate the mean score from the
three tests for each student. To compute the mean score:
Select Compute... from the Transform menu. A dialog box appears ***
In the box below the Target Variable: type in average as the variable name you
want to assign to the mean score
Move the pointer to the box titled Numeric Expression: and type: mean (test1,
test2, test3)
Click OK
A new column titled average will be displayed in the Data Editor window with the values
of the mean score for each case. The number of decimal places in a newly created
variable can be tailored by selecting Edit/Options/Data/Display format for new numeric
variables prior to creating new variables. This display format setting affects the formats
of all new subsequent numeric variables.
Suppose that you have the data set, sample1.sav, still displayed on your screen. If not,
select SPSS Data Editor - sample1 from the Window menu. The next step is to run some
206
basic statistical analysis with the data you entered. The commands you use to perform
statistical analysis are developed by simply pointing and clicking the mouse to
appropriate menu options. This frees you from typing in your command lines.
However, you may paste the command selections you made to a Syntax Editor window.
The command lines you paste to the Syntax Editor window may be edited and used for
subsequent analysis, or saved for later use. Use the Paste pushbutton to paste your dialog
box selections into a Syntax Editor window. If you don't have an open Syntax Editor
window, one opens automatically the first time you paste from a dialog box. Click the
Paste button only if you want to view the command lines you generated. Once you click
the Paste pushbutton the dialog selections are pasted to the Syntax Editor window, and
this window becomes active. To execute the pasted command lines, highlight them and
click run. You can always get back to the Data Editor window by selecting sample1-
SPSS Data Editor from the Window menu.
a) Frequencies
Now the selected variable appears in a box on the right and disappears from the left box.
Note that when a variable is highlighted in the left box, the arrow button is pointed right
for you to complete the selection. When a variable is highlighted in the right box, the
arrow button is pointed left to enable you to deselect a variable (by clicking the button) if
necessary. If you need additional statistics besides the frequency count, click the
Statistics... button at the bottom of the screen. When the Statistics... dialog box appears,
207
make appropriate selections and click Continue. In this instance, we are interested only in
frequency counts.
b) Descriptive
Our next task is to run the DESCRIPTIVES procedure on the four continuous variables in
the data set.
A dialog box appears. Names of all the numeric variables in the data set appear on the left
side of the dialog box.
Click the variable average and click the arrow button to the right of the selected
variable
Do the same thing for the variables test1 through test3
Now the selected variables appear in the box on the right and disappear from the box on
the left.
The mean, standard deviation, minimum, and maximum are displayed by default. The
variables are displayed, by default, in the order in which you selected them. Click
Options... for other statistics and display order.
Click OK
Means
Suppose you want to obtain the above results for males and females separately. The
MEANS procedure displays means, standard deviations, and group counts for dependent
variables based on grouping variables. In our data set sex is the grouping variable and
test1, test2, test3, and average are the dependent variables.
208
To run the Means procedure:
Select Mean, Number of cases, and Standard Deviation. Normally these options
are selected by default. If any other options are selected, deselect them by clicking
them
Click Continue
Click OK and then The output will be displayed on the Viewer screen.:
There may be other situations in which you want to select a specific category of cases
from a grouping variable (e.g., ethnic background, socio-economic status, education). To
do so, choose Data/Select Cases... to select the cases you want and do the analysis (e.g.,
from the grouping variable educate, select cases without a college degree). However,
make sure you reset your data if you want to include all the cases for subsequent data
analysis. If not, only the selected cases will appear in subsequent analysis. To reset your
data choose Data/Select Cases.../All Cases, and click OK.
c) SPSS Output
When you run a procedure in SPSS, the results are displayed in the Viewer window in the
order in which the procedures were run. In this window, you can easily navigate to
whichever part of output you want to see. You can also manipulate the output and create
a document that contains precisely the output you want, arranged and formatted
appropriately. You can use the Viewer to:
209
Move items between SPSS and other applications
The Viewer is divided into two panes. The left pane contains an outline view of the
output contents. The right pane contains statistical tables, charts, and text output. You can
use the scroll bars to browse the results, or you can click an item in the outline to go
directly to the corresponding table or chart.
Suppose you want to copy the Descriptives table into another Windows application, such
as a word processing program or a spreadsheet.
Much of the output in SPSS is presented in tables that can be pivoted interactively. You
can rearrange the rows, columns, and layers. To edit a pivot table, double-click the pivot
table and this activates the Pivot Table Editor. Or click the right mouse button on the
pivot table and from the context menu, choose SPSS Pivot Table Object/Open and the
pivot table will be ready to edit in its own separate Pivot Table Editor window. The
second feature is especially useful for viewing and editing a wide and long table that
otherwise cannot be viewed at a full scale.
210
Printing the Output
Once you are satisfied with your analysis you may want to obtain a hard copy of the
output. You may print the entire output on the Viewer window, or delete the sections you
do not want before you print. Or you can save the output to a diskette o r hard drive and
print it later. In this case, let us print the entire output which is on the Viewer window. It
is assumed that there is a printer attached to your PC or you are working from a Student
Technology Center. Make sure that you are at the Viewer window by selecting Output1-
SPSS Viewer from the Window menu. If you open multiple Viewer windows, select the
output you want to print:
Select Edit/Options...
Click Viewer from the SPSS Options dialog box
Click Infinite for the Length under the Text Output Page Size parameter
Click OK
Click File/Print
Click OK
The contents of the output window will be directed to the printer. To save paper, choose
Infinite option for the Length. You can also control the page width by changing Width.
For some procedures, however, some statistics are only displayed in wide format.
So far, we've used SPSS to develop a basic idea about how SPSS for Windows works.
Next step is to examine a few other data analysis techniques (CORRELATIONS,
REGRESSION, T-TEST, ANOVA). All the statistical procedures available under a mini
or mainframe version of SPSS are available from SPSS for Windows. Refer to the vendor
documentation for the most complete information.
Correlation analysis
211
Select Analyze/Correlate/Bivariate... This opens the Bivariate Correlations dialog
box. The numeric variables in your data file appear on the source list on the left
side of the screen.
Select compopi, compscor, mathatti and mathscor from the list and click the
arrow box. The variables will be pasted into the selection box. The options
Pearson and Two-tailed are selected by default.
Click OK
A symmetric matrix with Pearson correlation as given below will be displayed on the
screen. Along with Pearson r, the number of cases and probability values are also
displayed.
Note that a correlation coefficient tells you that some sort of relation exists between the
variables, but it does not tell you much more than that. For example, a correlation of 1.0
means that there exits a positive linear relationship between the two variables, but it does
not say anything about the form of the relation between the variables. When the
observations are not perfectly correlated, many different lines may be drawn through the
data. To select a line that describes the data, as close as possible to the points, you
employ the Regression Analysis which is based on the least- squares principle. In the
following task you will perform a simple regression analysis with compscor as the
dependent variable, and mathscor as the independent variable.
212
One-way Analysis of Variance
The statistical technique used to test the null hypothesis that several population means are
equal is called analysis of variance. It is called that because it examines the variability in
the sample, and based on the variability, it determines whether there is a reason to believe
the population means are not equal. The statistical test for the null hypothesis that all of
the groups have the same mean in the population is based on computing the ratio of
within and between group variability estimates, called the F statistic. A significant F
value only tells you that the population means are probably not all equal. It does not tell
you which pairs of groups appear to have different means. To pinpoint exactly where the
differences are, multiple comparisons may be performed.
The explanation and the topics covered in this section illustrates some of the basic
features of SPSS for Windows. Examining additional features of SPSS for Windows is
beyond the scope of this section. For further help, refer to SPSS for Windows documents.
213
7.0 Objective
7.1 Introduction
7.2 A Brief Introduction to PCGIVE
7.3 A Brief Introduction to LIMDEP
7.4 Summary
7.0 Objective
The objective of this unit is to introduce briefly the basics of LIMDEP and PCGIVE.
After completing this unit the student will:
Be able to enter and transform data using limdep and pcgive
Understand the steps in making graphs, statistical and econometric analysis
7.1 Introduction
This last unit is interested to briefly explain the two software packages- Limdep and
PcGive. These two softwares are developed by famous and known econometricians.
While PcGive is usually preferred and used in advanced time series econometrics,
Limdep is best suited for econometric analysis that make use of cross section and
qualitative dependant variables. Note, however, that there are many issues that we do not
address here. This section serves as a start point for introducing PcGive and LimDep.
214
When you first load PcGive, you get the initial GiveWin window. Note that loading the
data base is done by simply choosing ”Open...” under ”File”, and then finding the
designated file path. When the data base has been opened, a window opens in GiveWin
where you can see the ”raw” data, and do simple editing of it. If the data base has been
modified, you may wish to save the changes; this can be done by choosing ”Save...”
under ”File”.
Note that it is also possible to enter data yourself: To do so we choose File->New, then
Database. Then choose the appropriate frequency. This will create an empty spreadsheet.
Double click the top of the first column to give it a name. In Variable name, type the
name of the variable such as ”x”. We can call the next column ”y”. Note that the
spreadsheet is filled with ”missing” as we haven’t entered any data yet. Go to one
missing, double click or press enter to enter data. Enter the data below and save them.
To derive any new series, based on those already in the database, use the Calculator.
Select Tools | Calculator and then use the keys to create the new series. For example to
create GDPdefl = GDPcurr/GDP95 you can click on the names of the series in the
database and the calculator button for / to get the formula.. Then click on the = button
and type the name you require for the new series (GPDdefl). It will be added to the
database. NOTE: Until you resave the data base the new series will not permanently be
added to the file. Thus, we suggest that after each set of transformations is complete you
resave the file.
215
Note that plots by GiveWin are shown in the window "GiveWin Graphics". If you make
several graphs, GiveWin will place them all in the same window. This can be avoided by
choosing "Keep Graph" under "View"; this force GiveWin to open a new graphics
window, the next time a graph is made. The graphs can be printed directly from
GiveWin, or may be exported to your favorite software program. Export to Windows
programs such as Word, can easily be done using "Cut" and "Paste". Otherwise, GiveWin
allows you to save the graphs. This can be done by activating the graph window of
interest, and then choosing "Save as..." under "File".
Finally, it should be noted that the possibilities of editing graphs in GiveWin are either by
choosing "Edit" and then "Edit Graph", or simply double clicking it.
216
3. In the lower right corner you can choose the number of lags which are to be
included in the model.
4. In the box to the right, the list over the available variables in the current data base
is shown. Highlight the variable and click on the ”<<Add” button.
5. In the box to the left, the model is shown. PcGive automatically adds a constant to
the model. If this had not been in our interest it could be removed: Highlight
”Constant” and click ”Delete”.
6. Next add the explanatory variables to the model. Notice that there can only be one
endogenous variable in PcGive, for two or more endogenous variables we need
PcFiml. To accept the model, click ”OK”.
7. PcGive now asks you to choose the method of estimation. Choose ”OLS”
(ordinaryleast squares) which is the standard choice and – as you know – the
maximum likelihood estimators conditional on the initial observation. You have now
estimated the parameters of the model: The results of the estimation are reported in
the ”Results” window in GiveWin.
217
7.3 A Brief Introduction to LIMDEP
1. Introduction
Note that LimDep opens multiple windows as it proceeds. When you start LimDep from
your desktop, LimDep starts with a window called Untitled Project. as shown below. A
project consists of the data that you are going to analyze, the results of your analysis, and
procedures that you have used. LimDep only allows one active project at a time.
2. Inputting Data
As we know from the foregoing units, in order to conduct any kind of data analysis, we
need to have data available in the spreadsheet under consideration. Note that there are
several ways of inputting data into Limdep spreadsheet. The various approaches in this
regard is briefly explained as follows.
I. Via Stat/Transfer
The easiest way of inputting data into LimDep for Windows is to convert your data file
into LimDep for Windows format (with file extension .lpj) using Stat/Transfer. From
within LimDep you can simply open the data file from File control manual. One subtle
point is that LimDep data format for Windows is different from LimDep for Unix and
Stat/Transfer only handles LimDep for Windows. If you have a dataset in LimDep for
Unix format, you need first to convert it into ASCII format within LimDep for Unix and
then convert it into LimDep for Windows format using Stat/Transfer.
218
II. Excel Data
It is very easy to read an Excel data file to LimDep for Windows. We can simply issue
the command Import Variables from the pull-down menu Project and open our Excel file
from there. The only thing to keep in mind is that LimDep only reads Excel 4.0 (or 3.0).
If you have a newer version Excel file, you have to first open your Excel file in Excel and
save it to an older version.
If your data file is in ASCII format, you can read your data using LimDep command
READ to read in the data file. The ASCII data file can have commas, tabs or simply
spaces as delimiters. Missing values can be coded as "."
There are two ways of issuing a command within LimDep. One way is via a command
dialog box where one can type a command such as READ to read in a data file. The other
way is via a command window where one can enter multiple commands and save it as a
program file for later use. Let's focus on the second way here. We first need to open a
command window. This is done from File control manual by choosing New option. A
dialog window will pop up asking for the type of new window we want open. Choose
Text/Command Document and click on OK. Now we can enter our commands in the
command windows, highlight the commands we want to run and run it using Run control
manual. Note that creating a new variable in LimDep can be done through Project pull-
down.
3. Running Analysis
Limdep conducts a number of estimations. In this brief discussion we highlight the
approaches in formulating descriptive and simple regression analysis.
Descriptive Statistics
One way of checking that we have input our data correctly is to run some descriptive
statistics on the data set. It is fairly straightforward in LimDep to do so. Descriptive
Statistics is in Data Description in the Model pull-down manual. A window with a
219
list of variables will pop up so you can choose the variables. Note that the variable
ONE is created by LimDep to be constant 1 and is used if we want to include the
constant term in statistical analysis.
Most of the statistical analysis can be done through Model pull-down manual. For
example, we can run a number of regression estimations.
Running analysis
After the model is formulated clicking on Run button gives the result. Note that we can
save the output as a LimDep file with extension .lim, or we can simply copy and paste the
result to a Word document.
Note, however, that there are a lot of issues that we do not address here. This section
serves as a start point for anyone who does not know LimDep for Windows at all but may
want to run some statistical analysis with LimDep. With LimDep one can indeed run
very sophisticated statistical analysis and that is beyond what this section is all about.
7.4 Summary
This unit briefly discussed two software packages- Limdep and PcGive that are
developed by famous and known econometricians. PcGive is well suited for analyzing
multivariate and univariate autoregressive processes To enter, load, perform data
transformations or pre estimation graphs we use GiveWin. However, we should open the
PcGive module to formulate, estimate and test regression models. On the other hand, we
noted from the foregoing discussion that Limdep program provides parameter estimation
for linear and nonlinear regression models and qualitative and limited dependent models.
220