100% found this document useful (2 votes)
3K views

Computer Application in Economics

This document introduces a course on computer applications in economics. It discusses the importance of learning to conduct data analysis using specialized computer programs. The course will provide students with the skills and techniques to perform computer-based statistical and econometric analysis. It will cover topics like data management, statistical estimation, econometric estimation, diagnostic testing, and introduce software programs like EViews, Stata, SPSS, and others. The goal is for students to be able to formulate regression models, estimate parameters, interpret results, and conduct other data analysis tasks using computer programs.

Uploaded by

Agat
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
3K views

Computer Application in Economics

This document introduces a course on computer applications in economics. It discusses the importance of learning to conduct data analysis using specialized computer programs. The course will provide students with the skills and techniques to perform computer-based statistical and econometric analysis. It will cover topics like data management, statistical estimation, econometric estimation, diagnostic testing, and introduce software programs like EViews, Stata, SPSS, and others. The goal is for students to be able to formulate regression models, estimate parameters, interpret results, and conduct other data analysis tasks using computer programs.

Uploaded by

Agat
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 220

GENERAL INTRODUCTION

Dear Students, welcome to the "Computer Application in Economics" course. The


importance of studying computer based data analysis has increased dramatically in recent
years. Statisticians and econometricians of these days are relying more heavily on
computer programs. The students of economics must learn how to conduct data analysis
with the aid of specialized computer program as adequately as possible so that they can
use them in their attempt of data based analysis..

This distance material introduces basic steps in computer based analysis. At the same
time the material provides a sound foundation of basic statistics and econometrics. The
course motivates the students by providing the techniques, methods of computer based
data analysis.

This distance material presupposes that students that are taking the course Computer
Application in Economics have basic back ground about MS Windows programs, notably
Excel. The material attempts to make the reader well versed on various ways of data
analysis based on computer packages through different examples, explanations, activities
and self-assessment questions. The distance material contains the total of seven units
whose contents are briefly discussed hereunder.

Unit One deals with basic introduction. In this unit, students would be made aware of the
importance of computer based analysis and also identify major economic software

Unit Two deals with data management. In this unit the student learn how to operate the
different softwares, enter data from various sources and .identify various ways of
transforming data and perform tests before estimation is made.

Unit three deals with the statistical estimation and graphical analysis. In this unit students
would learn several statistics related concepts, compute several statistics concepts and
their analysis as well as draw graph and analyze the result.

1
Unit four deals with econometric estimation and analysis. This unit emphasizes the
various ways of performing estimation using EViews and Stata programs. Moreover, it
attempts to interpret the results obtained from the estimation.

Unit five deals with the diagnostic tests. In this unit students would be made aware of the
basic concepts and approaches of checking the result of a regression model. Identify the
concept of diagnostic checking. This includes explaining the sources of the problem, the
detection mechanism and forwarding the appropriate solutions. Students will learn how to
test for the presence of the problems using EViews and Stata.

Unit six briefly introduces SPSS software while unit seven briefly introduces PCGIVE
and LIMDEP programs.

General Course Description


After studying this course, you should be able to.
 Formulate a regression model using specialized computer programs.
 Perform computer based estimate the parameters of a regression model
 Compute the simple and multiple correlation coefficient and interpret the results
 Compute computer based partial correlation coefficients and interpret the results
 Conduct individual and over all significance test using various computer
packages.
 Construct confidence intervals for population parameters with the help of
computer programs.
 Conducted computer based diagnostic checking on estimated regression results.
 Perform non linear regression models and interpret the results.

2
TABLE OF CONTENTS
PAGE
UNIT ONE: Introduction ......................................................... 4

UNIT TWO: Data Management ................................................ 15

UNIT THREE: Statistical Estimation and Graphical Analysis .... 63

UNIT FOUR: Econometric Estimation and Analysis ................... 108

UNIT FIVE: Diagnostic Tests ................................................. 169

UNIT SIX: Introduction to SPSS ................................................. 197

UNIT SEVEN: Introduction to PCGIVE and LIMDEP ............... 217

3
Unit One: Introduction

1.0 Objective
1.1 Introduction
1.2 The Need for Computer Application in Economic Analysis
1.3 Major Economic Softwares
1.4 The Nature, Type and Sources of Data
1.4 Summary
1.5 Answers to Check Your Progress
1.6 Model Examination

1.0 Objective
The aim of this unit is to introduce the student with basic concepts related to computer
applications in economic analysis. After completing this unit, the student will be able to:
 Understand the importance of computer based analysis
 Identify major economic software
 Understand the different types and nature of data.
1.1 Introduction
Whenever we are talking about economics, what come to the picture is the three basic
problems that we face every day. These are (i) What goods and services should be
produced and in what amounts? (ii) How should those goods and services be produced?
and (iii) For whom to produce?

Note that these questions are universal problems because human wants are practically
unlimited, but all societies have only limited quantities of resources that can be used to
produce goods or services.

In this connection, the two main branches of economic analysis are microeconomics and
macroeconomics. Microeconomics is concerned with the behavior of individual firms,
industries and consumers (or households). Note that microeconomics deals with the

4
problems of resource allocation, considers problems of income distribution, and is chiefly
interested in the determination of the relative prices of goods and services.

On the other hand, macroeconomics concerns itself with large aggregates, particularly for
the economy as a whole. It deals with the factors, which determine national output and
employment, the general price level, total spending and saving in the economy, total
imports and exports, and the demand for any supply of money and other financial assets.

To examine the relationships between variables, to formulate policies, and or to criticize


policies we have to collect data and perform the appropriate analysis. In this regard,
computer based analysis is as much important as quality data is in obtaining correct
result. This is because computer based analysis
 helps to attain efficient results
 facilitates simple and comprehensive data analysis with short period of time.
 Produces results that are correct and is easily readable
Therefore, learning the technique of various statistical and econometric softwares is very
crucial to come up with accurate result
1.2 The Need for Computer Application in Economic Analysis

Personal computers, spreadsheets, professional statistical packages, and other information


technologies are now everywhere in data analysis. Without using these tools, one cannot
perform any realistic statistical data analysis on large data sets.

Statistical and econometric software systems are used to understand the existing
concepts, and to find new properties. On the other hand, new developments in the process
of decision making under uncertainty often motivate developments of new approaches
and revision of the existing software systems. Statistical and econometric software
systems rely on a cooperation of statisticians, econometrician and software developers.
Note that without a computer one cannot perform any realistic data analysis having large
data set.

5
Some softwares are widely used and are powerful tool for data analysis with excellent
data management capabilities. Overall, there are over 400 statistical packages; however, a
working familiarity with some of the major statistical systems will carry over easily to
other environments. These basic softwares are professional statistical and econometric
packages that are in widespread use internationally.

1.3 Major Economic Softwares


For the last 30 years, a number of softwares in relation to data analysis have been
produced and widely applied. These softwares have shown an astonishing development in
both depth and simplicity. This has contributed a lot to the development and efficiency of
economic analysis. Many of the programs that are widely accepted and applied make use
of a user-friendly approach. The major softwares related to data analysis include:
STATA, EVIEWS, SPSS, LIMDEP, PCGIVE, GIVEWIN and others. The following is a
very brief discussion about some of these major softwares.

EVIEWS: EViews provides sophisticated data analysis, regression, and forecasting tools
on Windows-based computers. With EViews, we can quickly develop a statistical
relation from our data and then use the relation to forecast future values of the data. Areas
where EViews can be useful include scientific data analysis and evaluation, financial
analysis, macroeconomic forecasting, simulation, sales forecasting, and cost analysis.
EViews was developed by economists and most of its uses are in economics. However,
there is nothing in its design that limits its usefulness only to economic time series. Even
quite large cross-section projects can be handled in EViews.

STATA: Stata is a modern and very powerful program especially designed for data
management, statistical and econometric analysis as well as graphics. It is mainly a
command driven program providing sufficient flexibility to meet different users' needs.
SPSS: This program has a comprehensive and flexible statistical analysis and data
management system. It can generate tabulated reports, charts, and plots of distributions
and trends, descriptive statistics, and conduct complex statistical analyses. Note that
SPSS for Windows provides a user interface that makes statistical analysis more intuitive

6
for all levels of users. Simple menus and dialog box selections make it possible to
perform complex analyses without typing a single line of command syntax. The built-in
SPSS Data Editor offers a simple and efficient spreadsheet-like utility for entering data
and browsing the working

PCGIVE: PcGive is an econometric software package for econometric modelling written


by renowned econometricians David Hendry and Jurgen Doornik of the University of
Oxford. It is a computer programme well suited for analyzing multivariate and univariate
autoregressive processes. It

LIMDEP: LimDep is econometric software developed by the well-known


econometrician William H. Greene. The name LimDep is derived from LIMited
DEPendent models. It provides parameter estimation for linear and nonlinear regression
models and qualitative and limited dependent models for across section, time series, and
panel data. Its primary strengths are its specialized and relatively advanced
microeconometric analysis.

1.4 The Nature, Types and Sources of Data


A. The Nature and Type of Data
Data is the source of information. This information can be collected using qualitative or
quantitative data. Qualitative data, such as sex and education up of individuals is not
computable by arithmetic relations. They are labels that advise in which category or class
an individual, object, or process fall. They are called categorical variables.

Quantitative data sets consist of measures that take numerical values for which
descriptions such as means and standard deviations are meaningful. They can be put into
an order and further divided into two groups: discrete data or continuous data. Discrete
data are countable data, for example, the number of defective items produced during a
day's production. Continuous data, when the parameters (variables) are measurable, are
expressed on a continuous scale. For example, measuring the height of a person.

7
Measurement or counting theory is concerned with the connection between data and
reality. A set of data is a representation (i.e., a model) of the reality based on numerical
and measurable scales. Data are called "primary type" data if the analyst has been
involved in collecting the data relevant to his/her investigation. Otherwise, it is called
"secondary type" data.

Data come in the forms of Nominal, Ordinal, Interval and Ratio. Moreover, data can be
either continuous or discrete. The following chart illustrates the various forms of
measuring data

Figure 1.1 Measurement Scale

Note that both zero and unit of measurements are arbitrary in the Interval scale. While the
unit of measurement is arbitrary in Ratio scale, its zero point is a natural attribute. The
categorical variable is measured on an ordinal or nominal scale.

8
Note that date collected for analysis and estimation of a model may be time series,
pooled or cross-sectional data.

Time series data are data that give information about numerical values of variables from
period to period. This is a set of observations on the values that a variable takes at
different times. It is data collected over a period. For example, the data on sales in a
company, data on GNP, data on unemployment, data on money supply in the period
1990-1999 forms a time series data. Such data may be collected at regular intervals like
daily, weekly, monthly, quarterly, annually. These data may be quantitative in nature
example, income and price or qualitative like sex and religion. The qualitative variables
are called categorical or dummy variables.

Consider the following table containing macro data of Ethiopia for the period 1984 to
1993 EC, measured in millions of dollars.
Year GDP Saving Export Import
1984 20792.00 625.2000 937.5000 2223.400
1985 26671.40 1494.100 2222.500 4520.500
1986 28328.90 1426.200 3223.000 6090.500
1987 33885.00 2517.100 4898.100 7950.000
1988 37937.60 2652.600 4969.700 8721.500
1989 41465.10 3195.000 6730.600 10584.70
1990 44840.30 3466.300 7116.900 11341.20
1991 48803.20 1044.600 6878.000 14101.50
1992 53189.70 480.1000 8017.600 15969.30
1993 54210.70 1433.900 7981.500 16193.60
1994 51760.60 931.4000 8027.400 17709.50
1995 54585.90 -1145.300 8319.300 21557.60
Table 1.1 Time Series Data
Notice that the table above represents a description of the variables value across time.
Thus, it represents a time series data.

Cross–sectional data are data that give information on one or more variables concerning
individual consumer or producer at a given point of time. For example, census of the
population and surveys of consumer expenditure are cross-sectional data. This is because,
the data give information on the variables concerning individual agents (consumers or
producers) at a given point of time. As an example, consider the table below that shows

9
the total population, the average mid age, the level of death, marriage and divorce of
some selected states of the USA for a particular year.

state pop popurban medage death marriage divorce


Michigan 9262078 6551551 28.8 75102 86898 45047
Minnesota 4075970 2725202 29.2 33412 37641 15371
Mississippi 2520638 1192805 27.7 23570 27908 13846
Missouri 4916686 3349588 30.9 49329 54625 27595
Montana 786690 416402 29 6664 8336 4940
Nebraska 1569825 987859 29.7 14465 14239 6442
Nevada 800493 682947 30.2 5852 114333 13842
New Hampshire 920610 480325 30.1 7594 9251 5254
New Jersey 7364823 6557377 32.2 68762 55794 27796
New Mexico 1302894 939963 27.4 9016 16641 10426
New York 17558072 14858068 31.9 171769 144518 61972
Table 1.2 Cross Section Data

Note that the table above represents a cross section data since it shows the results of
several variables for a particular time.
Pooled data have elements of both time series and cross sectional data. For example,
suppose we collect data on GDP and Saving for 20 countries for 10 years. In this case,
GDP and Saving for 10 years period will represent a time series data. On the other hand,
GDP and Saving of these countries for a particular year will be cross- sectional data.
Thus pooled data has both characteristics.

Note that, Panel data is a special type of pooled data in which the same cross-sectional
unit is surveyed over time. For example, census of housing at periodic intervals in which
the same household is interviewed to find out if there has been any change of that
household since last survey. Panel data are repeated surveys of a single sample in
different periods of time. It records the behavior of the same individual over time. The
panel data that results from repeated interviewing of the same household at periodic
intervals will provide very useful information on the dynamic of household behavior.

Note that the success of any economic related study depends on the quality and quantity
of data. Unlike natural science, most data collected in social science like GNP, money

10
supply etc are non-experimental. This means that, data collecting agency may not have
any direct control over the data.

B. The Sources of Data

Relevant data can be obtained from a number of sources. Governmental agency, an


international agency, a private organization or an individual may collect the data used in
empirical analysis. In Ethiopia, governmental institutions like MoFED (ministry of
Finance and Economic Development), CSA (Central Statistical Authority), and NBE
(National Bank of Ethiopia) are the major source that produces published data
International agencies that produce wide range of data that can be used for analysis
purpose includes International Monetary Fund (IMF) and the World Bank (WB)

Note that the individual (researcher) himself may collect data through interviews or using
questionnaire. In the social sciences, the data that one generally obtains is non-
experimental in nature. In other words, it is not subject to the control of the researcher.
For example, data on Investment, unemployment, etc are not directly under the control of
the investigator, unlike the natural science data. This often creates special problems for
the researcher in articulating the exact cause or causes affecting a particular situation.
Moreover, although there is plenty of data available for economic research, the quality of
the data is often not that good. The reasons to this include:

 there is the possibility of observational errors, since most social science data
are not experimental in nature,
 Errors of measurement arising from approximations and round offs.
 The problem of non-response, in questionnaire type surveys, there is
 Respondents may not answer all the questions correctly
 Sampling methods used in obtaining data
 Aggregation problem. Note that, economic data is generally available at a
highly aggregate level. For example most macro data like GNP,
unemployment, inflation etc are available for the economy as a whole.

11
 Because of confidentiality, certain data can be published only in highly
aggregate form. For example, data on individual tax, production, employment
etc at firm level are usually available in aggregate form.

Because of all these and many other problems, the researcher should always keep in mind
that the results of research are only as good as the quality of the data. Therefore, the
results of the research may be unsatisfactory due to the poor quality of the available data
(may not be due to wrong model)

Check Your Progress


1. Briefly explain the importance of computer based economic analysis.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
2. Differentiate between time series, cross section as well as panel data
________________________________________________________________________
________________________________________________________________________
_______________________________________________________________________

1.4 Summary

Computer based data analysis has got several advantages. It helps to attain efficient
results and also facilitates simple and comprehensive data analysis with short period.
Moreover, it produces results that are correct and is easily readable. Currently there are a
number of softwares that are applied in data analysis activities. This includes EViews,
Stata, SPSS and some others. Note that without a computer one cannot perform any
realistic data analysis having large data set.

Statistical and econometric software systems are used to understand the existing
concepts, and to find new properties. On the other hand, new developments in the process
of decision making under uncertainty often motivate developments of new approaches
and revision of the existing software systems.

12
For the last 30 years, a number of softwares in relation to data analysis have been
produced and widely applied. These softwares have shown an astonishing development in
both depth and simplicity. This has contributed a lot to the development and efficiency of
economic analysis. Statistical and econometric software systems rely on a cooperation of
statisticians, econometrician and software developers.

To perform analysis of any kind, we need data. As we know, the source of information.
Where it can be collected using qualitative or quantitative data. Qualitative data is not
computable by arithmetic relations. They are labels that advise in which category or class
an individual, object, or process fall. They are called categorical variables. Data come in
the forms of Nominal, Ordinal, Interval and Ratio. Moreover, data can be either
continuous or discrete. In addition, data collected for analysis and estimation of a model
may be time series, pooled or cross-sectional data.

Note that relevant data can be obtained from a number of sources. Governmental agency,
an international agency, a private organization or an individual may collect the data used
in empirical analysis. In Ethiopia, governmental institutions like MoFED (ministry of
Finance and Economic Development), CSA (Central Statistical Authority), and NBE
(National Bank of Ethiopia) are the major source that produces published data.
International agencies that produce wide range of data that can be used for analysis
purpose includes International Monetary Fund (IMF) and the World Bank (WB)

1.5 Answers to Check Your Progress

1. Refer section 1.2 for the answer.


2. Refer section 1.4 for the answer.

1.6 Model Examination

1. Explain the various factor that limit the quality of data

13
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. Using your own example, develop a panel data


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

3. What do we mean by "data is source of information".


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Unit Two: Data Management

2.0 Objective
2.1 Introduction
2.2 Data Management
2.2.1 Data Entry And Operations
2.2.1.1 Using Excel
2.2.1.2 Using EViews

14
2.2.1.3 Using Stata
2.2.2 Data Transformation And Pre Estimation Tests
2.2.2.1 Using EViews
2.2.2.2 Using Stata
2.3 Summary
2.4 Answers to Check Your Progress
2.5 Model Examination
2.6 Reference

2.0 Objective
The objective of this unit is to explain the various types of data management. After
completing this unit, a student will be able to:
 Learn how to operate the different softwares and enter data from various sources.
 Learn ways of transforming data and perform tests before estimation is made
2.1 Introduction
Before any type of meaning full data analysis is performed, the available data has to be
stored in the appropriate spreadsheet. Moreover, the data has to be transformed to the
required form so that it can be used easily to perform the appropriate estimation. In this
unit a discussion is made regarding data management. First a brief explanation is given
using Microsoft excel program. The purpose of this is to shed light to the basics of excel
program in relation to data analysis. Latter on a detailed discussion is provided using
EViews and Stata programs. The explanation starts from briefing the basic windows of
the two programs and extends to explaining the alternative ways of data entry,
transformation and pre estimation tests. Whenever appropriate and necessary, the
discussion is supported by the use of diagrams of various EViews and Stata windows

2.2 Data Management


The task of data management encompasses the initial tasks of creating a data set, editing
and other jobs. There is a broad range of data management features. The following
discussion starts with brief explanation of data management with excel and then moves to
a detailed analysis using EViews and Stata.

15
2.2.1 Data Entry and Operations
This sub section presents the methods used to enter data using Excel, EViews and Stata
programs. Moreover, it shows various operations using the above stated programs.

2.2.1.1. Data Entry and Operation Using Excel (A Brief Introduction)

As an introduction we start our discussion with the various ways of entering, editing and
operation using excel. This discussion will shed light to the basics of excel in relation to
data analysis.

I. Create a new workbook

Note that when you open excel window by default a new work book is opened. If we
want to open an other new work book, we click on the File menu and then click New, and
select Blank Workbook on the New Workbook

II. Enter data in worksheet cells

In excel spreadsheet we can enter numbers, texts, a date or a time. To do so, we first click
the cell where you want to enter data. Then we type the data and press ENTER or TAB.
Note that we can enter the data raw by raw or column-by-column. In any case, the
following approach is usually used.
1. Enter data in a cell in the first column, and then press TAB to move to the
next cell.
2. At the end of the row, press ENTER to move to the beginning of the next
row.
3. If the cell at the beginning of the next row doesn't become active, click
Options on the Tools menu, and then click the Edit tab. Under Settings, select the
Move selection after Enter check box, and then click Down in the Direction box.

III. Entering numbers with a fixed number of decimal places

16
The following discussion explains what we should do when our objective is to enter data
with a fixed number of decimal places or extensive zeros.

1. On the Tools menu, click Options, and then click the Edit tab. Then,
2. Select the Fixed decimal check box. In the Places box, enter a positive number of
digits to the right of the decimal point or a negative number for digits to the left of the
decimal point.

For example, if you enter 3 in the Places box and then type 1874 in the cell, the value will
be 1.874. If you enter -3 in the Places box and then type 183, the value will be 183000.
Note that any data you entered before selecting the Fixed decimal option is not affected.

IV. Entering the same data into several cells at once.

To enter the same data into several cells at once we follow the following
instruction.

1. Select the cells where you want to enter data. Note that the cells do not have to be
adjacent. Then,
2. Type the data and press CTRL+ENTER.

This automatically enter the same data into the selected cells.

V. Filling automatically repeated entries in a column

Note that, if the first few characters you type in a cell match an existing entry in that
column, Microsoft Excel fills in the remaining characters for you. Excel completes only
those entries that contain text or a combination of text and numbers. Entries that contain
only numbers, dates, or times are not completed. In this regard, note the following.

 To accept the proposed entry, press ENTER. The completed entry exactly
matches the pattern of uppercase and lowercase letters of the existing entries.
 To replace the automatically entered characters, continue typing.
 To delete the automatically entered characters, press BACKSPACE.

17
 To select from a list of entries already in the column, right-click the cell, and then
click Pick from List on the short cut menu

VI. Filling in a series of numbers, dates, or other items

To do this we should follow the steps as detailed hereunder.

1. Select the first cell in the range you want to fill.


2. Enter the starting value for the series.
3. Enter a value in the next cell to establish a pattern. For example if you want the
series 2, 3, 4, 5..., enter 2 and 3 in the first two cells. If you want the series 2, 4, 6,
8..., enter 2 and 4. If you want the series 2, 2, 2, 2..., you can leave the second cell
blank. Then to specify the type of series, use the right mouse button to drag the
fill handle over the range, and then click the appropriate command on the short
cut menu. Note that, to fill in increasing order, we have to drag down or to the
right. However, to fill in decreasing order, we have to drag up or to the left.

Note that the prime objective of the above brief discussion is to show how data is entered
in excell program. This knowledge is important because (i) Excel can conduct different
types of analysis, (ii) more often than not data to be used in other specialized software
(programs) may be stored in Excel format

Check Your Progress 2.1

1. Using the discussion made earlier, enter the following data in to excel spreadsheet

Year 1991 1992 1993 1994 1995 1996 1997 1998 1999
Y 50 100 150 200 250 300 350 400 450
X1 15.67 156.7 1.567 1567 15670 0.1567 0.01567 15.67 156.7
X2 1 2 3 4 5 6 7 8 9
X3 90 80 70 60 50 40 30 20 10

2.2.1.2 Data Entry and Operation Using EViews

I. Introduction to EVIEWS

18
EViews provides sophisticated data analysis, regression, and forecasting tools on
Windows-based computers. With EViews you can quickly develop a statistical relation
from your data and then use the relation to forecast future values of the data. Areas where
EViews can be useful include: scientific data analysis and evaluation, financial analysis,
macroeconomic forecasting, simulation, sales forecasting, and cost analysis.

EViews was developed by economists and most of its uses are in economics. However,
there is nothing in its design that limits its usefulness only to economic time series. Even
quite large cross-section projects can be handled in EViews.

EViews provides convenient visual ways to


 enter data series from the keyboard or from disk files,
 to create new series from existing ones,
 to display and print series,
 to carry out statistical and regression analysis of the relationships among
variables.

EViews takes advantage of the visual features of modern Windows software. You can use
your mouse to guide the operation with standard Windows menus and dialogs. Results
appear in windows. The results can be manipulated with standard Windows techniques.

Alternatively, we may use EViews’ powerful command and batch processing language.
That is, we can enter and edit commands in the command window. Moreover we can
create and store the commands in programs that document our research project for later
execution.

II. Installing EViews

Before discussing any thing further, it is important to know the installation procedure of
EViews program discussed hereunder. Note the following

19
a) Once Windows is running, it is strongly recommended that you close all other
Windows applications before beginning the installation procedure. This is because other
applications may interfere with the installation program.
b) Insert the CD containing EViews program into the CD drive. Then the program will be
ready to be installed. Select the word next to pass form on page to another in the
installation process.
c) Once the installation procedure is complete; the installer will inform you that EViews
has been successfully installed.

III. Windows Basics

As stated in the general introduction part it is assumed that you are familiar with the
basics of Windows. However, we provide a brief discussion of some useful techniques,
concepts, and conventions that we will use in this distance material.

A. The Mouse

EViews supports the use of both buttons (i.e. left and right) of the standard Windows
mouse. Unless otherwise specified, clicking on an item means a single click of the left-
mouse button. On the other hand, double-click means to click the left-mouse button twice
in rapid succession. Moreover, dragging with the mouse means that you should click and
hold the button down while moving the mouse.

B. Window Control

As we work, we may wish to change the size of a window or temporarily move a window
out of the way. Alternatively, a window may not be large enough to display all of the
output, so that we want to move within the window in order to see relevant items.

Windows provides us methods for performing each of the tasks stated above. This
includes the following.

20
1. Changing the active window

When working in EViews or other Windows programs, you may find that you have a
number of open windows. The currently active (top-most) window is easily identified
since its title bar will generally differ (in color and/or intensity) from the inactive
windows. You can make a window active by clicking anywhere in the window, or by
clicking on the word Window in the main menu, and selecting the window by clicking on
its name.

2. Scrolling

Windows provides both horizontal and vertical scroll bars so that you can view the
contents of windows that contain information, which does not fit inside the window.
When the information does fit, the scroll bars will be hidden.

The scroll box indicates the overall relative position of the window and the data. In the
example above, the vertical scroll box is near the bottom, indicating that the window is
showing the lower portion of our data. If the box is in the middle of the scroll bar, then
the window displays the halfway point of the information. The size of the box also
changes to show you the relative sizes of the amount of data in the window and the
amount of data that is offscreen. Here, the current display covers roughly half of the
horizontal contents of the window.

The up, down, left, and right scroll arrows on the scroll bar will scroll one line in that
direction. Clicking on the scroll bar on either side of a scroll box moves the information
one screen in that direction.

If you hold down the mouse button while you click on or next to a scroll arrow, you will
scroll continuously in the desired direction. To move quickly to any position in the
window, drag the scroll box to the desired position.

21
3. Minimize/Maximize/Restore/Close

There may be times that you wish to move EViews out of the way while you work in
another Windows program. Or you may wish to make the EViews window as large as
possible by using the entire display area.

In the upper right-hand corner of each window, you will see a set of buttons which
control the window display. By clicking on the middle button, you can toggle between
using your entire display area for the window, and using the original window size. The
button maximize uses your entire monitor display for the application window; whereas
the button restore returns the window to its original size, allowing you to view multiple
windows. If you are already using the entire display area for your window, the middle
button will display the icon for restoring the window; otherwise, it will display the icon
for using the full screen area.

You can minimize your window by clicking on the minimize button in the upper right-
hand corner of the window. To restore a program that has been minimized, click on the
icon in your taskbar

Lastly, the close button provides you with a convenient method for closing the window.
To close all of your open EViews windows, you may also select Window in the main
menu, and either Close All, or Close All Objects.

4. Selecting and Opening Items

To select a single item, you should place the pointer over the item and single click. The
item will now be highlighted. If you change your mind, you can change your selection by
clicking on a different item, or you can cancel your selection by clicking on an area of the
window where there are no items. Double clicking on an item will usually open the item.

22
If you have multiple items selected, you can double click anywhere in the highlighted
area.

IV. Starting EViews

There are several methods for starting EViews program. These are
 Click on the Start button in the taskbar, then select the Programs followed by
EViews3 to navigate to the EViews program group, and then select the
EViews3.1 program icon. (Note: if we have installed EViews 5, similar steps is
used to open it)
 Navigate to the EViews directory using Windows Explorer, or via the My
Computer icon on your desktop, and double click on the EViews3.1 (or EViews 5
if we have installed it) program icon.
 Double click on an EViews workfile or database icon.
The next sub section presents the various components displayed when EViews is opened.

The EViews Window

If the program is correctly installed, you should see the EViews window when you open
or launch the program. The followings are the main areas in the EViews Window:

A. The Title Bar

In EViews, the title bar is labeled at the very top of the main window. When EViews is
the active program in Windows, the title bar has a color and intensity that differs from the
other windows (generally, it is darker). When another program is active, the EViews title
bar will be lighter. If another program is active, EViews may be made active by clicking
anywhere in the EViews window

B. The Main Menu

23
Just below the title bar is the main menu. If you move the cursor to an entry in the main
menu and click on the left mouse button, a drop-down menu will appear. Clicking on an
entry in the drop-down menu selects the highlighted item. Some of the items in the drop-
down may be listed in black and others in gray. In menus, black items may be executed
while the gray items are not available.

C. The Work Area

The area in the middle of the window is the work area where EViews will display the
various object windows that it creates. Think of these windows as similar to the sheets of
paper you might place on your desk as you work. The windows will overlap each other
with the foremost window being in focus or active. Only the active window has a
darkened title bar.

When a window is partly covered, you can bring it to the top by clicking on its title bar or
on a visible portion of the window. You can also cycle through the displayed windows by
pressing the F6 or CTRL-TAB keys. Alternatively, you may directly select a window by
clicking on the Window menu item, and selecting the desired name. In any ways once we
opened the EViews window we then create a work file
Check Your Progress 2.2
1. State the various ways of starting EViews program
________________________________________________________________________
________________________________________________________________________
______________________________________________________________________
2. State the main areas in the EViews Window
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

V. Creating a Work file

24
To enter data in to EViews spreadsheet, we need to first open the work file after we
opened the EViews window. To create a work file we first click File and then select new
File and choose Work file. This process is shown in the diagram below.

Figure 2.1 Creating Work file

This will display the Work file Create box. In the box, we have to specify the frequency
as well as the start and end date. That is, we need to provide the appropriate frequency
and enter the information for the work file range. Note that the Start date is the earliest
date or observation you plan to use in the project and the End date is the latest date or
observation. Frequency includes many cases like annual, semi annual, quarterly and the
like.

The rules for describing the work file frequency are quite simple:

 Annual: This refers the year. For example, 1995, 2000 or 2005. Thus, if the data
to be entered and used is a yearly data, we select the annual.

25
 Quarterly: This refers to the year, followed by a colon or period, and the quarter
number. Examples: 1994:2 (representing the 2nd quarter of 1994), 2001:1, 2006:3
and so on. If the data to be used is a quarterly data, we select the quarterly.
 Monthly: This refers to the year, followed by a colon or period, and the month
number. Examples: 1984:1(representing the first month of 1984), 1993:12, and so
on. . Note therefore, if the data to be used is a monthly data, we select the
monthly.
 Weekly and Daily: by default, you should specify these dates as month number,
followed by a colon, followed by the day number, followed by a colon, followed
by the year. Make sure that the date to be entered should be in the order of
month/date/year (i.e. mm/dd/yyyy). However, using the Options/Dates-
Frequency… menu item, you can reverse the order of the day and month by
switching to European notation.

After identifying the work file frequency as per our data to be entered into EViews we
should specify the corresponding start and end date in the space provided. The diagram
below displays the work file window created for a specific frequency, start and end date.

Figure 2.2 Work file Window


Note form the above that in the work file created the frequency is daily with the start date
being 10/31/1997 and the end date is 10/29/2004.

26
After we have finished supplying the information about the type of work file, we need to
click OK. This will create the work file window. Note that the work file is UNTITLED
since we have not yet saved the work file. That is, if we save the work file using a certain
file name, then that name will appear instead of the word untitled.

Notice from diagram 2.3 below that there are two icons in this newly created work file.
These icons represent the objects that are contained in every work file. These are a vector
of coefficients represented by C, and a series of residuals designated by RESID.

The Workfile Window


In the title bar of the work file window, you will see the “Workfile:” designation
followed by the workfile name. If the workfile has not been saved, it will be designated
“UNTITLED”. If the workfile has been saved to disk, you will see the name and the full
disk path. For example in the figure below the work file is untitled indicating the fact that
the file has not been saved.

Figure 2.3 Work file Window

27
As can be seen from the above diagram, just below the title bar there exists a toolbar
made up of a number of buttons. These buttons provide you with easy access to a number
of useful workfile operations.

Below the toolbar are two lines of status information. That is, EViews displays the range
of the workfile, the current sample of the workfile (the range of observations that are to
be used in calculations and statistical operations), the display filter (rule used in choosing
a subset of objects to display in the workfile window), and the default equation (the last
equation estimated or operated on). You may change the range, sample, and filter by
double clicking on these labels and entering the relevant information in the dialog boxes.
Double clicking on the equation label opens the equation.

Saving Workfiles

Usually we want to name and save our workfile for future use. To do so, we push (or
click) the Save button on the workfile toolbar to save a copy of the workfile on disk. You
can also save the file using the File/SaveAs… or File/Save… choices from the main
menu. When we make use of one of the above three alternative saving methods, a
standard Windows file dialog box will open. In the dialog box we can specify the target
directory in the upper file menu labeled Save in. Note that we can navigate between
directories in the standard Windows fashion. To perform this we click once on the down
arrow to access a directory tree. On the other hand double clicking on a directory name in
the display area gives you a list of all the files and subdirectories in that directory. Once
you have worked your way to the right directory, type the name you want to give the
workfile in the File name box and push the Save button. This will save the workfile with
the name we choose.

Once the workfile is named and saved, we can save subsequent updates or changes using
File/Save… from the main menu or the Save button on the toolbar. EViews will use the
existing workfile name, and if the workfile has changed, will ask you whether you want

28
to update the version on disk. Just like other Windows software, File/Save As… can be
used to save the file with a new name.

To bring back a previously saved workfile we can use File/Open/Workfile…. Generally


we save our workfile containing all of our data and results at the end of the day. Thus, to
pick up where we left off we use File/Open/Workfile

Note that when you select File/Open/Workfile… you will see a standard Windows file
dialog. Simply navigate to the appropriate directory and double click on the name of the
workfile. Then the workfile window will open and all of the objects in the workfile will
immediately be available.

For your convenience, EViews keeps a record of the ten most recently used workfiles and
programs at the bottom of the File menu. Select an entry and it will be opened in EViews.

Resizing Workfiles

Some times we may decide to add data or we may want to use observations beyond the
ending date or before the starting date of our work file. Alternatively, we may wish to
remove extra observations from the start or end of the workfile.

To change the size of our workfile, we select Procs and click Change workfile Range…
and enter the required beginning and ending observation of the workfile in the dialog. If
we enter dates that encompass the original workfile range, EViews will expand the
workfile without additional comment. However, if we enter a workfile range that does
not encompass the original workfile range, EViews will warn us that data will be lost,
and ask us to confirm the operation.

Sorting Workfiles

29
Basic data in workfiles are held in objects called series. If you click on Procs/Sort
Series… in the workfile toolbar, you can sort all of the series in the workfile on the basis
of the values of one or more of the series. A dialog box will open where you can provide
the details about the sort.

If you list two or more series, EViews uses the values of the second series to resolve ties
from the first series, and values of the third series to resolve ties from the second, and so
forth. If you wish to sort in descending order, select the appropriate option in the dialog.
Note that if you are using a dated workfile, sorting the workfile will generally break the
link between an observation and the corresponding date.

The following names are reserved and should not be used for naming a variable while
working with EViews: These are ABS, ACOS, AR, ASIN, C, CON, CNORM, COEF,
COS, D, DLOG, DNORM, ELSE, ENDIF, EXP, LOG, LOGIT, LPT1, LPT2, MA, NA,
NRND, PDL, RESID, RND, SAR, SIN, SMA, SQR, and THEN.

V. Data Entry

Once the workfile is created, the next step is data entry. In EViews there are three
different ways of data entry. These are
 Entering data from the keyboard
 Copying data from other sources
 Spreadsheet import.

A. Entering Data from the Keyboard

This approach is preferred for small datasets in printed form. This is because in such
cases we may wish to enter the data by typing at the keyboard. The step required to enter
data via the keyboard is explained as follows. (Note that it is after we created a workfile
that we enter data into EViews spreadsheet).

30
I. Our first step is to open a temporary spreadsheet window in which we will enter the
data. To do this we choose Quick and select Empty Group(Edit Series) from the main
menu to open an untitled group window. Note that we have now an empty EViews
spreadsheet.

II. The next step is to create and name the series (or variables). In this regard we first
click once on the up arrow to display the second obs label on the left-hand column. The
row of cells next to the second obs label is where you will enter and edit series names.
Click once in the s next cell of the second obs label. Then type your first variable name
in the command window and press ENTER. (Note that the name in the cell changes as
you type the required name in the command window),

III. Repeat this procedure in subsequent columns for each additional series.
If we want to rename one of our series, what we have to do is simply select the cell
containing the series name and then edit the name. After we made the necessary change,
we press ENTER. When we do this EViews will prompt us to confirm the series rename.
Thus we confirm the rename be clicking OK.

IV. To enter the data into the EViews spreadsheet, we click on the appropriate cell and
type the number. Pressing ENTER after entering a number will move you to the next cell.
If you prefer, you can use the cursor keys to navigate the spreadsheet while entering the
data.

V. When you are finished entering data, close the group window. Now in the workfile:
untitled box you will have the names of the variables that we have typed earlier. If you
wish, you can first name the untitled group by clicking on the Name button. If you do not
wish to keep the group then choose to delete. When you do this, EViews asks you to
confirm the deletion and you have to select Yes

B. Copying Data from Other Sources and Pasting it in to EViews

31
By now you have learned that the Windows clipboard is a handy way to move data within
EViews and between EViews and other software applications. The following discussion
involves an example using an Excel spreadsheet, but the basic principles apply for other
Windows applications.

Suppose we have GDP and Investment data of Ethiopia from the year 1953 up to 1995
E.C. in excel spreadsheet. Furthermore, suppose that we would like to bring these
variables into EViews. The following discussion clearly illustrates the steps to be
followed in this regard.

I. First copy the data range from excel spreadsheet. Then start EViews and create a new
annual workfile containing the dates in the Excel spreadsheet (in our example 1953
through 1995). Then we select Quick and choose Empty Group (Edit Series). Note that
the spreadsheet opens in edit mode so there is no need to click the Edit +/– button. If we
have created an annual workfile with a range from 1953 to 1995, the first row of the
EViews spreadsheet is labeled 1953. Since we are pasting in the series names, we should
click on the up arrow in the scroll bar to make room for the series names.

II. Place the cursor in the upper-left cell, just to the right of the second obs label. Then
select Edit and then choose Paste from the main menu (not Edit +/– in the toolbar). This
will paste the data into the group spreadsheet.

Note that you may now close the group window so that the untitled group will be deleted
without losing the two series. Notice then that in the workfile: untitled box you will have
the names of the two variables that we have typed earlier

Pasting into Existing Series

Note that we can bring data from the clipboard into an existing EViews series or group
spreadsheet by using the approach of Edit - Paste. as we did earlier. There are only a few
additional issues to consider.

32
1. To paste several series, you will first open a group window containing the existing
series. The easiest way to do this is to click on Show and then type the series names in the
order they appear on the clipboard. Alternatively, you can create an untitled group by
selecting the first series then we click selecting each subsequent series (in order), and
then double clicking to open.
2. Next, make certain that the group window is in edit mode. If not, press the Edit +/–
button to toggle between edit mode and protected mode. Place the cursor in the target
cell, and select Edit and then Paste.

3. Finally, click on Edit +/– to return to protected mode.


Note that if we are pasting into a single series we will need to make certain that the series
window is in edit mode, and that the series is viewed in a single column. If the series is in
multiple columns, push on the Smpl +/– button. Then click Edit and then select Paste to
paste the data. After this click on Edit +/– to protect the data.

C. Spreadsheet Import

This approach implies that we can also read data directly from files created by other
programs. Data may be in Excel (i.e.XLS) spreadsheet formats or in other compatible
formats. In importing data from other sources, the following steps must be considered.

I. First make certain that you have an open workfile in excel or other compatible formats
in order to receive the contents of the data import. For our case consider the following
excel data containing information about a return from a certain investment for 12 rich
countries of the world

33
Figure 2.4 Excel data of 12 rich countries

Note that to import the above excel spreadsheet in to EViews format, we have to create
first a work file whose range and frequency is in line to the (above) data to be imported.

II. Next, in the workfile (untitled) box we click on Procs and select Import and then Read
Text-Lotus-Excel... The diagram below shows this step.

Figure 2.5 The import Process

34
III. After this we will see a standard File dialog box asking you to specify the type and
name of the file. Select a file type, navigate to the directory containing the file, and
double-click on the name. Alternatively, type in the name of the file that you wish to read
(with full path information, if appropriate). In this case, EViews will automatically set the
file type, otherwise it will treat the file as an ASCII file. Then we click on Open. This
will display excel spreadsheet import box as shown in the diagram below.

Figure 2.6 Excel spreadsheet import box


As can be seen from the above diagram, EViews opens a dialog prompting you for
additional information about the import procedure. Bear in mind that the dialog will
differ greatly depending on whether the source file is a spreadsheet or an other file. In
any case, the title bar of the dialog will identify the type of file that we have asked
EViews to read.

IV. To read from a spreadsheet file we have to fill in the dialog as follows:

A) First, we need to tell whether the data are ordered by observation or by series. By
observation it means that all of the data for the first observation are followed by all of the
data for the second observation, etc. By series it means that all of the data for the first
variable are followed by all data for the second variable, etc. Another interpretation for
“by observation” is that variables are arranged in columns while “by row” implies that all

35
of the observations for a variable are in a single row. For example in the above diagram,
the data are ordered by observation.

B) Next, we have to tell EViews the location of the beginning cell (upper left-hand
corner) of your actual data, without including any label or date information. In the
diagram above the upper left data cell is a4. This imply that we are to import a data
starting from the first column (represented by column A in the excel spreadsheet) and the
fourth row (as the values displayed in figure 2.6 are written beginning from the fourth
row in the excel spreadsheet.

C) Then we enter the names of the series that we wish to read into the edit box.
In the above case we write the names of the 12 countries using the same name as written
in the excel spreadsheet such as fin swe nor and so on.
Alternatively, if the names that you wish to use for your series are contained in the file,
you can simply provide the number of series to be read. The names must be adjacent to
your data. If the data are organized by row and the starting cell is B2, then the names
must be in column A, beginning at cell A2. If the data are organized by column beginning
in B2, then the names must be in row 1, starting in cell B1. If, in the course of reading the
data, EViews encounters an invalid cell name, it will automatically assign the next
unused name with the prefix SER, followed by a number (e.g., SER01, SER02, etc.).

D) Lastly, we should tell EViews the sample of data that we wish to import. Notice that
EViews begins with the first observation in the file and assigns it to the first date in the
sample for each variable. Each successive observation in the file is associated with
successive observations in the sample. Note also that if we are reading data from an
Excel 5 workbook file, there will be an additional edit box where we can enter the name
of the sheet containing our data. If we do not enter a name, EViews will read the topmost
sheet in the Excel workbook. When the dialog is completely filled out, simply click OK
and EViews will read your file, creating series and assigning values as requested.

This is confirmed by the following diagram

36
Figure 2.7 Work file with the imported variable names

The workfile: untitled box above shows that we have successfully imported the excel data
containing the figures for the 12 countries. This is justified by the appearance of the
names of the variables (or countries) in addition to C and RESID. To see the data that we
have imported into EViews spreadsheet, we can use one of the following steps

 Shed the variable names found in the Workfile menu and either click Show from
the same menu, or make a right click and select Open as Group
 Click Show from the workfile menu and write the names of the variables in the
Show box

Check Your Progress 2.3

1. Discuss the various work file frequencies available in EViews


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

37
2. State the three ways of data entry into EViews spreadsheet
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

3. Consider the following annual data on Y and X

Year 1993 1994 1995 1996 1997 1998 1999


Y 120 250 210 654 216 541 236
X 12.6 15.6 14.8 14.1 17.9 21.0 22.5

Using the three alternative data entry approaches, read the above data into EViews
spreadsheet

2.2.1.3 Data Entry and Operation Using Stata

I) Introduction
Stata is a modern and very powerful program especially designed for data management,
statistical and econometric analysis as well as graphics. It is mainly a command driven
program providing sufficient flexibility to meet different users' needs. The main aim of
this section is to familiarize you with basic operations of Stata and learn you how to use
and combine some basic commands.

To open Stata just double click on the Stata icon (or WSTSTA icon) found in the
program file of your computer. Once Stata is opened your computer's screen will bring
the following window.

38
Figure 2.8 The Stata Window.

The discussion below explains the different components of the above opened stata.
Note from the figure above that, the top of Stata window represent the menu bar, the tool
bar and Stata Window. This bars and windows are explained as follows.

A. Menu Bar: The menu bar has lists of commands that can be opned by cliking on the
icon. Below we provide a brief description of the different options. Note that if you use
stata a lot, you probably will not use the menu bar often because the most common tasks
can be done with the buttons on the tool bar and command box.

1. File: When we click this menu bar, 10 items will drop down. This are
 Open: This opens data file already saved in stata format
 View: This helps to view data file
 Save: This helps to save data file.
 Save As: This will save data file under new name
 File Name: Select data file name to put in command
 Log: This helps to open, close, review, or convert log file
 Save Graph: This will be used to save file with graph.
 Print Graph: This is used to print graph

39
 Print Results: This will print contents of current windows
 Exit: This helps to leave stata

2. Edit: Opening the edit menu will bring five items as stated below
 Copy Text: This uses to copy marked text
 Copy Tables: This helps to copy tables to insert in spreadsheet or word processor
 Paste: This is used to insert something previously copied.
 Table Copy Options: This provides with options for how tables are copied
 Graph Copy Options: This provides options for how graphs are copied

3. Prefs: This is used for various options for setting preferences. For example it is used to
open the default window, or change the colors used in stata window.

4. Window: When we click this menu bar, 10 items will be displayed. These are Result,
Graph, Log, Viewer, Command, Review, Variables, Help/Search, Data editor and Do-file
editor. Selecting any of this will bring that particular window to front

5. Help: This bar is used whenever additional information or support is needed. Within
this menu bar there are Contents, Search, Stata command and What's new. This will be
discussed latter on.

B. Tool Bar. Notice that the buttons on the tool bar are designed to make it easier to
carry out the most common tasks. The following diagram presents the various items in
the tool bar.

40
Figure 2.9 Items in the tool bar

Note that most of the items presented in the tool bar are all found inside the menu bar
presented earlier in section A. However, each icon represents the following.

 Open: open a stata data set


 Save: save a data set
 Print: prints contents of active window
 Log: to start or stop, pause or resume a log file
 Viewer: open viewer window, or bring to the front
 Results: open result window, or bring to the front
 Graph: open graph window, or bring to the front
 Do-file editor: open do-file editor, or bring to the front
 Data editor: open data editor window, or bring to the front
 Data browser: open data browser window, or bring to the front
 More: commands to continue when paused in long output
 Break: stop the current task (or stops processing).

C. Stata window: This represents the four windows that stata displays by default in
addition to the menu and tool bar. The four windows and their purpose with diagramatic
illustration is explained as follows

1. The Stata Command window. This is located at the right bottom side of the
screen. It is a place where the stata commands are written. Note that once the
appropriate command is typed, we have to press the Enter key to excute it
is where we have to enter and edit.
2. The Stata Result window. This is located at the right side of the screen. It displays
the results of the commands written. In case there is any error in the command
inserted, a message in red text will appear.

41
3. The Review window. As the figure below shows, this is located at the upper left
side of the screen. It lists all the executed commands of the session (both right and
false). Note that by clicking on a listed command, we can make it reappear at the
command window so that we can either execute it again or modify it.

Figure 2.10 The Stata window

4. The Variable window. As the figure above shows, this is located at the left bottom
side of the screen. It displays all the variables that the currently used data set
contains.

Note that we can choose the fonts that each of the four windows displays. As you can see
each window has a window's control menu box (the little box in the upper left corner of
the window). Simply click once on it, and then choose Font, which brings up a standard
windows font dialog box. Select your font preferences and then OK. Now you have
changed the fonts that this window displays. In order to change the fonts for each
window you have to follow this process for each window separately allowing you to
choose a different font for each window. However, note that the next time Stata comes
up, it will come up according to the latest windows' modifications.

42
Once we have organized the screen, we are ready to learn Stata's basic features. Do not
forget that Stata is a command driven application (this makes it really powerful) and so
we have to learn how to insert and combine Stata commands. Note that Stata commands
are case sensitive (i.e. Stata distinguishes between lowercase and uppercase syntax). All
Stata commands that are included in this material are in lowercase. For some basic
operations, Stata offers windows facilities as a command alternative

Note that in order to get the full power out of Stata we should become familiar with its
help facilities. In doing so we first open the Help menu. This will display an number of
sub items as shown below

Figure 2.11 Stata help Window

To search for a particular case we click on Search. This will open the following dialog
box. The box is called key word search box that will help to search an item from the
program.

43
Figure 2.12 Stata Search Box

Note that by selecting the first option and specify a particular word, stata will produces
the result accordingly. For example if we type the word regression in the Keywords box ,
all those entries in Stata that contain this word are displayed.

Figure 2.13 An example to search result

As you can see from the diagram above, there are hypertext links (clickable words in
blue) that will link you to the help files for the appropriate Stata commands.

If we click again on help, and select Stata Command from the pull down, a dialog box
that asks you to specify a Stata command appears. This box is shown below

44
Figure 2.14 Stata command box

By specifying a command name and clicking OK, we can see detailed information on its
syntax as well as some examples on how the command can be used. For illustration
purposes, let us specify the command regress, which is used to perform linear
regressions. Then click OK and the program brings front the following Stata on-line
manual page.

Figure 2.15 A example to Stata command

Note that we can get the same help facilities by simply using the help and search
commands.

Typing in the command window, search regression, and press Enter return, produces the
same output as choosing search from the help menu and entering the keyword regression.
In the Stata Results window a list of all Stata commands that relate to regression will

45
appear in green. When you position the mouse pointer near a hypertext link (displayed in
blue), the pointer will change to a hand. If you click while the hand is pointing at a
command name, you will go to the help file of the selected case. In general we can use
search, followed by a keyword when you need to find out the names of the commands
related with this keyword.

Note that, on the other hand, we can use the help command when we know the exact
name of the Stata command on which we want more information. Notice that executing
help regress, produces the same output as choosing Stata Command from the Help menu
and specify regress at the Stata Command dialogue box (as described above). For
example suppose we type in Stata Command window help followed by the command
name regress. In the Stata Results window, we obtain information on the specific
command (which is all the information that Stata's on line manual pages - a shortened
version of the printed manuals - include). If we type help alone we obtain information on
how to use the help system.

III. Data Entry (Using Stata)

To perform any kind of analysis using Stata, we have to have data on stata spreadsheet.
The data can be imported from other spreadsheet (such as excel) or entered via the key
board into the stata spreadsheet. The following explains the steps required in each of the
cases.
A. Copying data from other spreadsheet. This approach is more preferred when we want
to make use of data that is already available in soft copy in a certain spreadsheet (such as
excel). The steps required to execute this method are:
1. First open the required spreadsheet and open the data set.
2. Then select the whole data set and copy it
3.Open stata and bring the spreadsheet window to the front. (This can be done by either
opening the spreadsheet using the menu/tool bar or by writing the command edit in the
command box and press the Enter key)

46
1. Click edit in the header bar (i.e. menu bar) and select paste from the drop
down menu. This transfers the whole spreadsheet of data into stata
spreadsheet.

B. Entering Data Via the data Spreadsheet. This approach is more appropriate when
we have data in a hard copy and want it in stata spreadsheet. Suppose we have
data on ID numbers of five individuals and their corresponding income, and
expenditure values. To enter this information directly in to stata spreadsheet, we
first open an empty spreadsheet window. Then treating each column as a separate
variable, we begin entering the data. Thus, first the respondents ID value is
entered in the first column. This will bring a result as shown below.

Figure 2.16 Data entry in to Stata

Note that the moment we started entering the values of ID variable, stata names
that column as Var1. To change this name in to the actual one, we double click on
Var 1 and then enter the variable name ID as shown below.

47
Figure 2.17 Changing variable names.

Note that on the stata variable information box, we enter the name of the variable
(which is ID for this case) and if we wish, enter the variable label. We then
proceed to do this for each additional variable.

Once the data is entered, we can close the spreadsheet and return to the stata window. In
this case the variable box will display the names of the variables (i.e. ID and the others)
that we have entered in the spreadsheet.

Check Your Progress 2.4

1 Briefly explain the various components of Stata's menu bar


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. State the purposes of the Stata default window


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

48
3. Consider the data below:

Y 10 15 18 16 25 30 40 35
X 15 13 11 9 7 9 11 13

Enter the above data in to Stata spreadsheet using (a) copy and paste approach, and (b)
own type (via stata data spreadsheet)

IV) Data Management Using Stata

The first step in data analysis involves organizing the raw data into a format usable by
Stata. Data management encompasses the initial task of creating a data set, editing to
correct errors, and adding internal documentation such as variable and value labels. It
also encompasses many other jobs required by ongoing projects, such as reorganizing,
simplifying, or sampling form the data. Moreover, it includes adding further observations
or variables, separating, combining or collapsing data set; converting variable types and
creating new variables through algebraic or logical expressions. Note that although Stata
is best known for its analytical capabilities, it possesses a broad range of data
management features as well. This section introduces some of the basics in this regards.

This discussion states the various Stata commands that are used in the data management
process. Assuming that we have data of GDP, INV(investment) and SAV(domestic
saving) in the Stata spreadsheet, we explain briefly here below what each Stata
command stands for. However, to adequately understand each command, the student is
advised to exercise using his/her own data. Note that the command is identified by bold
letter and varname (or varlist) refers to the variable name (or list of variables) to be used
in the data management work. To obtain the result after we write the command, we press
Enter from the key board

49
 compare varname1 varname2

The command compare performs an accounting of the differences and similarities


between varname1 and varname2 (i.e. variable 1 and variable 2.

Example: compare GDP INV

 describe

Note that describe displays a summary of the contents of the data in memory or the data
stored in a Stata-format dataset.. The command ds list variable names in a compact
format.

Examples: describe GDP INV SAV EXPO

 drop varlist

Drop eliminates variables or observations from the data in memory.


Example: drop GDP EXPO eliminates the variables GDP and Export from the data.

 keep varlist

Keep works the same as drop except that you specify the variables or observations
to be kept rather than those to be deleted.
Example: keep GDP EXPO keeps GDP and Export variables and eliminate all other
variables from the spreadsheet.

 clear
This eliminate all the variables or observations listed after the command. If no variable
name is written, Stata will eliminate all the variables available in the data.

50
Example: clear INV SAV will eliminate the variables investment and saving from the
data.
 edit

The command edit brings up a spreadsheet-style data editor for entering new data and
editing existing data.
 browse
This is like edit except that it will not allow changing the data. That is a data displayed in
this way cannot be edited.

 Merge datasets
Merge joins corresponding observations from the dataset currently in memory
(called the master dataset) with those from the Stata-format dataset stored as
filename (called the using dataset) into single observations. Note that if filename is
specified without an extension, .dta is assumed.

 order varlist
Notice that, order changes the order of the variables in the current dataset. The variables
specified are moved, in order, to the front of the dataset.

 move varname1 varname2

The command move also reorders variables. It relocates variable name1 to the position of
variable name2 and shifts the remaining variables, including variable name2, to make
room. Similarly the command aorder alphabetizes the variables specified in variable list
and moves them to the front of the dataset. If no variable list is specified, _all is
assumed.

 rename old varname new varname

51
Note that rename changes the name of an existing variable but the contents of the
variable remain unchanged.
Examples: rename INV INVE. this renames INV as INVE

 save filename

Note that save stores the dataset currently in memory on disk under the name filename.
If filename is not specified, the name under which the data was last known to Stata
is used. If filename is specified without an extension, .dta is used.
Examples: save my file. This will save the dataset by the name my file

2.2.2 Data Transformation and Pre-Estimation Tests

Before conducting estimationand analysis, it may be important to transform the data into
the required form. Moreover, performing pre estimation tests are important so as to adjust
the data set to come up with an acceptable form.

A- Data Transformation and Pre-Estimation Tests Using Stata


The following commands are used to transform data and perform some tesrs
 generate
This command creates a new variable.
Example: suppose that we want to create the logarithm of INV and EXPO. Then we write
the following Stata command. generate LINV=log(INV). This creates the (natural)
logarithm of investment and stores it in the spreadsheet as LINV. Similarly to generate
the logarithm of export we write: generate LEXPO=log(EXPO)

 replace
This command changes the contents of an existing variable.
Example: if we write the command replace GDP=100*GDP, it replaces the old variable
GDP with 100 times their previous values.

52
 gsort
Note that gsort arranges the observations to be in ascending or descending order of the
specified variable names. This command is different from sort in that sort can produce
only ascending-order arrangements. Note that each variable name can be numeric or
string. The observations are placed in ascending order of variable name if + or nothing is
typed in front of the name and in descending order if - is typed.
Example: gsort GDP creates an ascending order arrangement of GDP. But if we write
gsort GDP -INV then a data with an ascending order of GDP and a descending of INV is
created.

 inspect
This command display simple summary of data's characteristics. It reports the number of
negative, zero, and positive values; the number of integers and non-integers; the number
of unique values; the number of missing; and produces a small histogram. Its purpose is
not analytical instead it allow you to quickly gain familiarity with unknown
data. Example: inspect EXPO

 list
List displays the values of variables. If no variable list is specified, the values of
all the variables are displayed.
 sample as a percent or as a count

Note that sample draws random samples from the data in memory. Sampling here is
defined as drawing observations without replacement. The size of the sample to be drawn
can be specified as a percent or as a count.
Consider the following commands

sample 50 This draws 50% sample

sample 50, count This draw sample of size 50

53
Suppose that we have data on male and female. Then consider the command below

by sex: sample 50 This draw 50% of men, 50% of women

by sex: sample 50, count This draw sample of size 50 for men, 50 for women

Usually the first thing that we should do before we perform any kind of data analysis is to
examine the behavior of the data to be used in the analysis. Such approach will help to
adjust the data in a manner that is applicable to appropriate estimations. In this
connection we discuss here an approach that will help us identify whether a variable has
outliers. Recall that by an outlier we mean an extreme value compared to the average.
The two commands usually used in this regard are the box plot and one way scatter plot.
Consider a data on import (given by IMPO) of Ethiopia for the period 1953 to 1995E.C.
The command below is used to construc one way scater plot and box plots respectively

 graph IMPO, oneway: [This command produces a scatter plot of the variable
import (IMPO) as shown hereunder.

258.7236 Impo 21557.6

54
Figure 2.18 Scatter plot result for the variable import.

As can be seen from the result above, much of the values of the variable IMPO (import)
is concentrated around a value of 260 as shown in the (dark area) left side of the scatter
plot. The figure also points out the presence of an outlier as shown by the right hand side
of the scatter plot. That is, values around 21,550 are extreme compared to much of and
hence represent an outlier compared to the majority of the values of the variable.

Similar result can be obtained if we make use of the box plot approach. In this case the
result will make use of a box while the result is more or less the same. To construct the
box plot examination for the variable IMPO (import) we write the following command.

 graph IMPO, box

This command will draw a box plot for the variable IMPO as follow

Impo
21557.6

258.724

Figure 2.19 Box-Plot result for import (IMPO)

55
Notice from the box plot result above that there are outliers as shown by dot figures at the
upper part of the box plot.. As can be seen clearly, values around 21,550 are extreme and
hence represent an outlier compared to the majority of the values of the variable.

Note that such outlier values very much affect the mean value of the variable and also
makes the standard deviation to be larger than what would have been with out such
extreme values. Because of such complications and unattractive result that follows
outliers, we may omit or exclude extreme observations. This is usually practiced in
regression analysis since regression is nothing but conditional expectation.

Check Your Progress 2.5


Consider the following hypothetical data on Y, X1 and X2

Y 10 12 13 11 14 10 16 15 18
X1 0.25 0.75 1.25 3.5 8 6.1.25 1.00 0.9 1.5
X2 150 180 600 400 100 120 140 130 175

a) Using Stata command, generate the logarithm of each variable


b) Test the presence of outliers (extreme values) for each variable

B. Data Transformation and Pre-Estimation Tests Using EViews

Using EViews we can transform variables into other forms. For example, variable GDP
can be transformed into its logarithm form, the lagged value of GDP can be created and
so on. To transform a variable we need to write a command that will make EViews make
the adjustment. The command is written by clicking the GENR toolbar from the workfile
box. The step in doing so is explained as follows. Note that there are a number of
transformations. But we will see two of the transformations

 First open the workfile box that contains the names of the variables that we want
to transform.

56
 Then from the tool bar click GENR. This will open Generate Series By Equation
box. It is in this box that we write the command to transform the variable into
another form.
 If our objective is to transform variable Y into its logarithm form, we write the
following command in the box

LY = Log(Y) and then click OK (or press ENTER). This will create the
logarithm of Y and will store it as LY in the EViews spreadsheet.

 If our objective is to create one period lagged value of Y we write


DY = Y(-1) and press OK. This creates one period lagged value of the
variable Y and store it in EViews spreadsheet as DY. Note that in this case the
first observation of DY is not there since the lagged value of the first observation
is not available in EViews this is represented by NA.

For example, consider the following table that consists the variables GDP, Export
(represented by EXPO) and import (represented by IMPO) for the period 1986 to 1995
E.C.
Year GDP EXPO IMPO
1986 28328.90 3223.000 6090.500
1987 33885.00 4898.100 7950.000
1988 37937.60 4969.700 8721.500
1989 41465.10 6730.600 10584.70
1990 44840.30 7116.900 11341.20
1991 48803.20 6878.000 14101.50
1992 53189.70 8017.600 15969.30
1993 54210.70 7981.500 16193.60
1994 51760.60 8027.400 17709.50
1995 54585.90 8319.300 21557.60
Table 2.1 Data on GDP, Export, and Import from 1986 to 1995

Suppose that we wanted to create the logarithm and lagged value of each of the above
tabled variables. The transformation is done by writing the following command in the
Generate Series By Equation box

57
A) Transformation in to logarithm
- For GDP we write LGDP = log(GDP) and click OK or press ENTER
- For EXPO we write LEXPO = log(EXPO) and click OK or press ENTER
- For IMPO we write LIMPO = log(IMPO) and click OK or press ENTER

B) Transformation in to first lagged value


- For GDP we write DGDP = GDP(-1) and click OK or press ENTER
- For EXPO we write DEXPO = EXPO(-1) and click OK or press ENTER
- For IMPO we write DIMPO = IMPO(-1) and click OK or press ENTER

The above commands will create the logarithm and one time laged value of GDP, export
and import variables as shown below.

Obs LGDP LEXPO LIMPO DGDP DEXPO DIMPO


1986 10.25164 8.078068 8.714485 NA NA NA
1987 10.43073 8.496603 8.980927 28328.90 3223.000 6090.500
1988 10.54370 8.511115 9.073547 33885.00 4898.100 7950.000
1989 10.63261 8.814420 9.267165 37937.60 4969.700 8721.500
1990 10.71086 8.870228 9.336197 41465.10 6730.600 10584.70
1991 10.79555 8.836083 9.554036 44840.30 7116.900 11341.20
1992 10.88162 8.989394 9.678423 48803.20 6878.000 14101.50
1993 10.90063 8.984882 9.692371 53189.70 8017.600 15969.30
1994 10.85438 8.990616 9.781856 54210.70 7981.500 16193.60
1995 10.90753 9.026333 9.978484 51760.60 8027.400 17709.50
Table 2.2 Transformed values of GDP, import and export.

Notice that the first value of the each lagged variable is NA indicating that it is not
available.

Check Your Progress 2.6

Consider the values of Y, X1 and X2 provided in check your progress 2.5 and attempt the
following questions

58
1. Using EViews create the logarithm of Y, X1 and X2
________________________________________________________________________
________________________________________________________________________

2. Generate the lagged values of each variable.


________________________________________________________________________
________________________________________________________________________

2.3 Summary

In this unit we have seen what we mean by data entry, transformation and pre estimation
tests. In excel spreadsheet we can enter numbers, texts, a date or a time. To do so, we first
Click the cell where you want to enter data. Then we type the data and press ENTER or
TAB. Note that we can enter the data raw by raw or column-by-column.

As we have seen, EViews provides convenient visual ways to enter data series from the
keyboard or from disk files, to create new series from existing ones, as well as perform
data transformation EViews takes advantage of the visual features of modern Windows
software. You can use your mouse to guide the operation with standard Windows menus
and dialogs. Results appear in windows. The results can be manipulated with standard
Windows techniques. In EViews there are different ways of data entry. These are
entering data from the keyboard and entering data from a file

The other software used for the discussion is Stata. To open Stata just double click on the
Stata icon (or WSTSTA icon) found in the program file of your computer. Once Stata is
opened your computer's screen will bring the Stata window. This represents the four
windows that Stata displays by default in addition to the menu and tool bar. To perform
any kind of analysis using Stata, we have to have data on Stata spreadsheet. The data can
be imported from other spreadsheet (such as excel) or entered via the key board into the
Stata spreadsheet. The following explains the steps required in each of the cases.

59
Data management encompasses the initial task of creating a data set, editing to correct
errors, and adding internal documentation such as variable and value labels. It also
encompasses many other jobs required by ongoing projects, such as reorganizing,
simplifying, or sampling form the data. Moreover, it includes adding further observations
or variables, separating, combining or collapsing data set; converting variable types and
creating new variables through algebraic or logical expressions. Note that although Stata
is best known for its analytical capabilities, it possesses a broad range of data
management features as well. Similarly, using EViews we can transform variables into
other forms. For example, variable GDP can be transformed into its logarithm form, the
lagged value of GDP can be created and so on. To transform a variable we need to write a
command that will make EViews make

2.4 Answers to Check Your Progress


Answer to Check Your Progress 2.2

Refer section 2.2.1.2 for the answer

Answer to Check Your Progress 2.3

Refer section 2.2.1.2-V for the answer

Answer to Check Your Progress 2.4

Refer section 2.2.1.3-V for the answer

2.5 Model Examination

Consider the following quarterly data on GDP, M1 (money supply), Pr (price) and Rs
(interest rate) for the period 1993:1 up to 1996:4

60
Year GDP PR M1 RS
1993:1 1611.1 1.018411 1098.221 2.993333
1993:2 1627.3 1.023475 1135.69 2.983333
1993:3 1643.625 1.02831 1168.657 3.02
1993:4 1676.025 1.035079 1187.475 3.08
1994:1 1698.6 1.041367 1210.237 3.25
1994:2 1727.875 1.047149 1211.559 4.036667
1994:3 1746.65 1.053865 1210.962 4.51
1994:4 1773.95 1.06088 1204.365 5.283333
1995:1 1792.25 1.069409 1209.235 5.78
1995:2 1802.375 1.074633 1219.42 5.623333
1995:3 1825.3 1.080187 1204.52 5.38
1995:4 1845.475 1.086133 1197.609 5.27
1996:1 1866.875 1.093915 1195.807 4.95
1996:2 1901.95 1.098441 1208.025 5.04
1996:3 1919.05 1.105475 1218.991 5.136667
1996:4 1948.225 1.110511 1202.149 4.97

Using the above data, attempt the following questions

1. Using Box-Plot and Scatter Plot, test for the presence of outlier in each variable given
above.
2. Using Stata, generate the logarithm of each variables
3. Draw a random sample of 30% from the above data
4. Using EViews create the logarithm of each variables
5. Using EViews generate a one period lagged value of each variables.

61
Unit Three: Statistical Estimation and Graphical Analysis

3.0 Objective
3.1 Introduction
3.2 What Is Statistical Data Analysis
3.3 Some Basic Statistics Concepts
3.4 Statistical Estimation, Graphing and Analysis
3.4.1 Using Excel
3.4.2 Using EViews
3.4.3 Using Stata
3.5 Summary
3.6 Answers to Check Your Progress
3.7 Model Examination

3.0 Objective
The aim of this unit is to equip the student with the various ways of computing statistics
and draw graph. After completing this unit, a student will be able to:
 Understand several statistics related concepts
 Compute several statistics concepts and issues with analysis
 Draw graph and analyze the result

3.1 Introduction

The original idea of "statistics" was the collection of information about and for the
"state". The word statistics drives directly not from any classical Greek or Latin roots, but
from the Italian word for state.

The birth of statistics occurred in mid-17th century. A man, named John Graunt, who was
a native of London, begin reviewing a weekly church publication issued by the local
parish clerk that listed the number of births, christenings, and deaths in each parish.
These so called Bills of Mortality also listed the causes of death. Graunt who was a

62
shopkeeper organized this data in the forms we call descriptive statistics, which was
published as Natural and Political Observation Made upon the Bills of Mortality.

Probability has much longer history. Probability is derived from the verb to probe
meaning to "find out" what is not too easily accessible or understandable. The word
"proof" has the same origin that provides necessary details to understand what is claimed
to be true. Probability originated from the study of games of chance and gambling during
the sixteenth century. Probability theory was a branch of mathematics studied by Blaise
Pascal and Pierre de Fermat in the seventeenth century. Currently; in 21st century,
probabilistic modeling are used in many cases such as quality control; insurance;
investment; and other sectors of business and industry.

Developments in the field of statistical data analysis often parallel or follow


advancements in other fields to which statistical methods are fruitfully applied. Because
practitioners of the statistical analysis often address particular applied decision problems,
methods developments is consequently motivated by the search to a better decision
making under uncertainties.

Decision making process under uncertainty is largely based on application of statistical


data analysis for probabilistic risk assessment of once decision. Managers need to
understand variation for two key reasons. First, they can lead others to apply statistical
thinking in day to day activities and secondly, to apply the concept for the purpose of
continuous improvement. This section will provide us with various computer based
analysis so as to apply them to make educated decisions. Therefore, it is a section in
computer supported statistical thinking via a data-oriented approach.

Note that statistical models are currently used in various fields of business and
economics. However, the terminology differs from field to field. For example, the fitting
of models to data, called calibration, history matching, and data assimilation. But all
these terminologies are all synonymous with parameter estimation.

63
3.2 What is Statistical Data Analysis? Data are not Information

In this section we first explain various concepts of statistical analysis and then discuss the
various ways of performing statistical estimation using EViews and Stata.

A. The process of Data to Knowledge

A given database contains a wealth of information, yet we make use of a fraction of it. In
organizations employees waste time scouring multiple sources for a database. The
decision-makers are frustrated because they cannot get business-critical data exactly
when they need it. Therefore, too many decisions are based on guesswork, not facts.
Many opportunities are also missed, if they are even noticed at all.

Knowledge is what we know well. Information is the communication of knowledge.


Information can be classified as explicit and tacit forms. The explicit information can be
explained in structured form, while tacit information is inconsistent and fuzzy to explain.
We should know that data are only crude information and not knowledge by themselves.

Data is known to be crude information and not knowledge by itself. The sequence from
data to knowledge is: from Data to Information, from Information to Facts, and finally,
from Facts to Knowledge. Data becomes information, when it becomes relevant to your
decision problem. Information becomes fact, when the data can support it. Facts are what
the data reveals. However the decisive instrumental (i.e., applied) knowledge is expressed
together with some statistical degree of confidence.

Fact becomes knowledge, when it is used in the successful completion of a decision


process. The following figure illustrates the statistical thinking process based on data in
constructing statistical models for decision making under uncertainties.

64
Figure 3.1 The sequence from data to knowledge

The above figure depicts the fact that as the exactness of a statistical model increases, the
level of improvements in decision-making increases. That is why we need statistical data
analysis. Statistical data analysis arose from the need to place knowledge on a systematic
evidence base. This required a study of the laws of probability, the development of
measures of data properties and relationships, and so on.

Statistical inference aims at determining whether any statistical significance can be


attached that results after due allowance is made for any random variation as a source of
error. Intelligent and critical inferences cannot be made by those who do not understand
the purpose, the conditions, and applicability of the various techniques for judging
significance.

Considering the uncertain environment, the chance that "good decisions" are made
increases with the availability of "good information." The chance that "good information"
is available increases with the level of structuring the process of Knowledge
Management. The above figure also illustrates the fact that as the exactness of a statistical
model increases, the level of improvements in decision-making increases.

Knowledge is more than knowing something technical. Knowledge needs wisdom.


Wisdom is the power to put our time and our knowledge to the proper use. Wisdom is the
accurate application of accurate knowledge and its key component is to knowing the
limits of your knowledge. Wisdom is about knowing how something technical can be

65
best used to meet the needs of the decision-maker. Wisdom, for example, creates
statistical software that is useful, rather than technically brilliant.

Almost every professionals need a statistical toolkit. Statistical skills enable us to


intelligently collect, analyze and interpret data relevant to their decision-making.
Statistical concepts enable us to solve problems in a diversity of contexts. Statistical
thinking enables you to add substance to our decisions.

The appearance of computer softwares, are the most important events in the process of
model-based statistical decision makings. These tools allow us to construct numerical
cases and easily understand the concepts, and to find their significance for ourselves.

B. Steps Required in Statistical Data Analysis

Note that data are not information. To determine what statistical data analysis is, one
must first define statistics. Statistics is a set of methods that are used to collect, analyze,
present, and interpret data. Statistical methods are used in a wide variety of occupations
and help people identify, study, and solve many complex problems. In the business and
economic world, these methods enable decision makers and managers to make informed
and better decisions about uncertain situations.

Vast amounts of statistical information are available in today's global and economic
environment because of continual improvements in computer technology. To compete
successfully globally, managers and decision makers must be able to understand the
information and use it effectively. Statistical data analysis provides hands on experience
to promote the use of statistical thinking and techniques to apply in order to make
educated decisions.

Computers play a very important role in statistical data analysis. The statistical software
packages, offers extensive data-handling capabilities and numerous statistical analysis
routines that can analyze small to very large data statistics. The computer will assist in
the summarization of data, but statistical data analysis focuses on the interpretation of the
output to make inferences and predictions.

66
Studying a problem through the use of statistical data analysis usually involves four basic
steps. These are (i) Defining the problem, (ii) Collecting the data, (iii) Analyzing the data,
and (iv) reporting the results. The following discusses each case briefly.

I. Defining the Problem

An exact definition of the problem is imperative in order to obtain accurate data about it.
It is extremely difficult to gather data without a clear definition of the problem.

II. Collecting the Data

We live and work at a time when data collection and statistical computations have
become easy almost to the point of triviality. Paradoxically, the design of data collection,
never sufficiently emphasized in the statistical data analysis textbook, have been
weakened by an apparent belief that extensive computation can make up for any
deficiencies in the design of data collection. One must start with an emphasis on the
importance of defining the population about which we are seeking to make inferences; all
the requirements of sampling and experimental design must be met.

Designing ways to collect data is an important job in statistical data analysis. Statistical
inference is refer to extending our knowledge obtain from a random sample from a
population to the whole population. This is known as an Inductive Reasoning. That is,
knowledge of whole from a particular. Its main application is in hypothesis testing about
a given population. The purpose of statistical inference is to obtain information about a
population form information contained in a sample. It is just not feasible to test the entire
population, so a sample is the only realistic way to obtain data because of the time and
cost constraints. Data can be either quantitative or qualitative. Qualitative data are labels
or names used to identify an attribute of each element. Quantitative data are always
numeric and indicate either how much or how many.

III. Analyzing the Data

67
Statistical data analysis divides the methods for analyzing data into two categories:
exploratory methods and confirmatory methods. Exploratory methods are used to
discover what the data seems to be saying by using simple arithmetic and easy-to-draw
pictures to summarize data. Confirmatory methods use ideas from probability theory in
the attempt to answer specific questions. Probability is important in decision making
because it provides a mechanism for measuring, expressing, and analyzing the
uncertainties associated with future events. The majority of the topics addressed in this
course fall under this heading.

IV. Reporting the Results

Through inferences, an estimate or test claims about the characteristics of a population


can be obtained from a sample. The results may be reported in the form of a table, a
graph or a set of percentages. Because only a small collection (sample) has been
examined and not an entire population, the reported results must reflect the uncertainty
through the use of probability statements and intervals of values.

Check Your Progress 3.1

1. Explain the sequence of data to knowledge

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. State the stapes required in statistical data analysis

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

68
3.3 Some Basic Statistics Concepts

Statistics is the science of collecting, organizing analyzing and interpreting to assist in


making more effective decisions. Before making the discussion on computer based
statistical estimation, we will make a brief description of some of the basic statistic
concepts as follows.

Mean. It measures the average of the observation. The most common measure of average
is the arithmetic mean. Given X1, X2, ..., Xn, the mean is given by the following formula.

X 
X
n
Note that the mean lends itself to a subsequent analysis because it includes the values of
all items.

Dispersion. This is the variation or scatter of a set of values. A measure of the degree of
dispersion of data is needed for assessing the reliability of the average of the data.
The most important measure of dispersion are Variance and the Standard Deviation. Note
that the higher the dispersion, the lower the reliability of the average of the data.

Covariance. The covariance between any two variables measures the co movement
between the variables. If the covariance is positive, then the two variables move
together. If it is negative it means the two variables move in opposite direction. On the
other hand, If it is zero, it means the two variables are not linearly related. Note that the
magnitude of covariance cannot be interpreted as an indication of the degree of linear
association between the two variables. To convey information between the relative
strength between any two variables, we employ the concept of correlation.

Correlation. One of the methods for measuring the relationship between variables is
correlation coefficient. Correlation is defined as the degree of relationship existing
between two or more variables. The degree of relationship existing between two variables

69
is called simple correlation and the degree of relationship existing between three or more
variables is called multiple correlations.

The correlation coefficient, like the covariance, is a measure of the extent to which two
measurement variables “vary together.” Unlike the covariance, the correlation coefficient
is scaled so that its value is independent of the units in which the two measurement
variables are expressed. For example, if the two measurement variables are weight and
height, the value of the correlation coefficient is unchanged if weight is converted from
pounds to kilograms. The value of any correlation coefficient must be between -1 and +1
inclusive.

Two variables are said to be positively correlated if they tend to vary together in the same
direction, i.e. if they tend to increase or decrease together. For example, price of a
commodity and quantity supplied are positively correlated because when the price
increases, the quantity supplied increases and conversely when price falls, the quantity
supplied decreases. On the other hand, two variables are said to be negatively correlated
if they tend to change/vary/ in the opposite direction i.e. when one of the variable
increases the other decreases and vice versa. For example the price of a commodity and
quantity demanded are negatively correlated. When price increases, quantity demanded
decreases and when price falls, demands for the commodity increases. Note that the
difference between covariance and correlation is that correlation coefficients are scaled to
lie between -1 and +1 inclusive, Corresponding covariances are not scaled. Both the
correlation coefficient and the covariance are measures of the extent to which two
variables “vary together.”

To obtain the quantitative measure of the degree of correlation between two variables, we
use a parameter called the correlation coefficient. We have two types of correlation
coefficients. These are population correlation coefficient, which is denoted by ρ, and
sample correlation coefficient dented by r.

Population correlation coefficient (ρ) refers to the correlation of all the values of the
population of the variables and sample correlation coefficient (r) refers to the estimate of

70
population correlation coefficient from the sample. The sample correlation coefficient
between two variables X and Y is defined by the following formula as:

( X i  X )(Yi  Y )
rxy =
(Yi  Y ) 2 ( X i  X ) 2

Note the following.

 When r >0, the two variables increase or decrease together.


 When r=1, there is perfect positive correlation between X and Yand all
observations on X and Y lie on a straight line with a positive slope.
 When r<0, the two variables move in opposite directions
 When r = -1, there is a perfect negative correlation between X and Y and
all observations on X and Y lie on a line with a negative slope.
 When r= 0, the two variables are uncorrelated.

The closer the value of the correlation coefficient to one, the greater the degree of
correlation (the closer the scatter of points approach a straight line). On the other hand,
the closer the value of the correlation coefficient to zero, the greater the scatter of points.

Hypothesis Testing. A hypothesis is some testable belief or opinion. It is a statement


about the population developed for the purpose of testing. A statistical hypothesis is a
statement about the values of parameters in the population.

In hypothesis testing, the most common approach is to establish a set of two mutually
exclusive and exhaustive hypotheses about the true value of the parameter under study.
Then a sample is used to assess the hypothesis. In short, the following is the stapes to be
used in testing the hypothesis.

The first thing that we do is formulate the null and the alternate hypothesis. The null
hypothesis represents the hypothesis to be tested empirically. Thus, this hypothesis is

71
developed for the purpose of testing. It is represented by H o. The alternate hypothesis on
the other hand, is the counter proposition against which we test the null hypothesis. It
describes what we will conclude if we reject the null hypothesis. It is represented by H1.

Then we need to set the level of significance. The level of significance is the probability
of rejecting the null hypothesis when it is true. Generally, in making decision regarding
test of hypothesis, there are two possible errors. These are:
Type I error. It is a type of error committed when we reject H0 while it is true. The
probability of committing type I error is denoted by α.

Type II error. It is a type of error committed when we accept H 0 while it is false. The
probability of committing type II error is denoted by β.

Note that α is called the level of significance. By setting the level of α at 0.1, or 0.05, or
0.01, we minimize the probability of committing type I error.

Recall that we said a hypothesis testing is a procedure that helps to decide whether the
observed difference between the sample value and the population value is real or due to
chance. To test whether the observed difference between the data and what is expected
under the null hypothesis is real or due to chance variation, we use a test statistic. In the
procedure we compare the value obtained using test statistic with the critical (or table)
value. Then we reject the null hypothesis; if the test statistic is statistically significant at
the chosen significance level α. Other wise the null hypothesis is not to rejected. Such
result takes place when the test statistic is not statistically significant.

Alternatively, we can make use of P value in hypothesis testing. In a statistical hypothesis


test, the P value is the probability of observing a test statistic at least as extreme as the
value actually observed, assuming that the null hypothesis is true. A p-value is a measure
of how much evidence you have against the null hypothesis. The smaller the p-value, the
more evidence you have. In such a case, if the p-value is less than some threshold
(usually .05, sometimes a bit larger like 0.1 or a bit smaller like .01) then you reject the
null hypothesis.

72
The P-value, which directly depends on a given sample, attempts to provide a measure of
the strength of the results of a test, in contrast to a simple reject or do not reject. If the
null hypothesis is true and the chance of random variation is the only reason for sample
differences, then the P-value is a quantitative measure to feed into the decision making
process as evidence. The following table provides a reasonable interpretation of P-values:

P-value Interpretation
P< 0.01 very strong evidence against H0
0.01< P < 0.05 moderate evidence against H0
0.05< P < 0.10 suggestive evidence against H0
0.10< P little or no real evidence against H0

This interpretation is widely accepted, and many scientific journals routinely publish
papers using this interpretation for the result of test of hypothesis. In general, when a p-
value is associated with a set of data, it is a measure of the probability that the data could
have arisen as a random sample from some population described by the statistical
(testing) model.

Check Your Progress 3.2


1.Explain the concept of Mean and Standard deviation
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
2. Compare and contrast between covariance and correlation
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
3. State the steps required in formulating hypothesis testing
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

73
3.4 Statistical Estimation Graphics and Analysis

In this section we will explain the approaches employed in making graphical and
statistical estimation. The discussion first makes a brief presentation of Excel program to
be followed by EViews and Stata program

3.4.1 Statistical Estimation (Using Excel)

Note that excel can carry out several statistical analysis. The following step explains the
steps required to handle the job of computing various statistical problems. The discussion
assumes that we have data on excel spreadsheet. To perform the estimation we use the
following steps.
1. On the Tools menu, click Data Analysis.
2. In the Data Analysis dialog box, click the name of the analysis tool you want to
use, then click OK.
3. In the dialog box for the tool you selected, set the analysis options you want.

You can use the Help button on the dialog box to get more information about the options.

Note that if Data Analysis is not available in the first place, we have to load the Analysis
ToolPak. This is performed using the following approach.

A. On the Tools menu, click Add-Ins.

B. In the Add-Ins available list, select the Analysis ToolPak box, and then
click OK.

Some Statistical Computations tools using Excel

The following presents some of the tools provided by Excel in statistical estimation.
Anova. The Anova analysis tools provide different types of variance analysis. The tool to
use depends on the number of factors and the number of samples you have from the
populations you want to test.

74
Anova: Single Factor. This tool performs a simple analysis of variance on data for two or
more samples. The analysis provides a test of the hypothesis that each sample is drawn
from the same underlying probability distribution against the alternative hypothesis that
underlying probability distributions are not the same for all samples.

Anova: Two-Factor With Replication. This analysis tool is useful when data can be

classified along two different dimensions.

Anova: Two-Factor Without Replication. This analysis tool is useful when data are
classified on two different dimensions as in the Two-Factor case With Replication.

Correlation. The CORREL and PEARSON worksheet functions both calculate the
correlation coefficient between two measurement variables when measurements on each
variable are observed for each of N subjects. Note that any missing observation for any
subject causes that subject to be ignored in the analysis. The Correlation analysis tool is
particularly useful when there are more than two measurement variables for each of N
subjects. It provides an output table, a correlation matrix, showing the value of CORREL
(or PEARSON) applied to each possible pair of measurement variables.

Covariance. The Correlation and Covariance tools can both be used in the same setting,
when you have N different measurement variables observed on a set of individuals. The
Correlation and Covariance tools each give an output table, a matrix, showing the
correlation coefficient or covariance, respectively, between each pair of measurement
variables.

The Covariance tool represented by COVAR computes the value of the worksheet
function, COVAR, for each pair of measurement variables. Note that direct use of
COVAR rather than the Covariance tool is a reasonable alternative when there are only
two measurement variables, i.e. N=2. The entry on the diagonal of the Covariance tool’s
output table in row i, column i is the covariance of the i-th measurement variable with
itself. This is just the population variance for that variable as calculated by the worksheet
function, VARP.

75
Descriptive Statistics. The Descriptive Statistics analysis tool generates a report of
univariate statistics for data in the input range, providing information about the central
tendency and variability of your data.

Histogram. The Histogram analysis tool calculates individual and cumulative frequencies
for a cell range of data and data bins. This tool generates data for the number of
occurrences of a value in a data set.

For example, in a class of 20 students, you could determine the distribution of scores in
letter-grade (A, B, ...) categories. A histogram table presents the letter-grade boundaries
and the number of scores between the lowest bound and the current bound. The single
most-frequent score is the mode of the data.

Rank and Percentile. The Rank and Percentile analysis tool produces a table that contains
the ordinal and percentage rank of each value in a data set. You can analyze the relative
standing of values in a data set. This tool uses the worksheet functions, RANK and
PERCENTRANK.

What we state above is some of the statistical computations conducted by excel. Note
however, that excel is not as user friendly and appealing as other specialized software
such as EViews and Stata. Therefore, with this brief introduction we move on to the
discussion of statistical and graphical computation using EViews and Stata.

3.4.2 Statistical and Graphical Analysis Using EViews

EViews very simply and attractively provides us with many types of graphical results.
Note however, that to execute any kind of graph we should first have the relevant data in
EViews spreadsheet. That is, in the workfile: untitled (or titled if we have already saved
data), there has to be names of variables in addition to C and RESID.

There are alternative ways to draw graph using EViews. One of the options is to first
click Quick from the main menu and then from the drop down we select Graph. This will

76
display the list of graph types available in EViews. These are Line Graph, Bar Graph,
Scatter, XY line and Pie. This is clearly shown in the diagram below.

Figure 3.1 The process to draw graph

Selecting any of the various types of graph will give the corresponding results. As we
saw in the above diagram, the combo box in the dialog allows us to select a graph type.
To change the graph type, we simply select an entry from the combo box. Note that some
graph types are not available for different data types (for example, we cannot view a
scatter diagram for a single series). Furthermore, some views do not allow us to change
the graph type. In such cases, the Graph Type will display Special, and we will not have
access to the entries in the combo box.

A. Graph Type
The basic graph types are stated as follows:

 Line Graph displays a plot of the series, with each value plotted vertically against
either an observation indicator or time.
 Bar Graph displays the value of each series as the height of a bar.

77
 Scatter Diagram displays a scatter with the first series on the horizontal axis and
the remaining series in the vertical axis, each with a different symbol.
 XY Line Graph plots the first series on the horizontal axis and the remaining
series in the vertical axis, each connected as a line.

 Pie Chart displays each observation as a pie with each series shown as a wedge in
a different color, where the width of the wedge is proportional to the percentage
contribution of the series to the sum of all series for that observation. Note that
series with negative values are dropped from the chart.

Note that the appearance of graphical views can be customized extensively. However,
changes to a graphical view will often be lost when the view is redrawn (including when
the object is closed and reopened, when the workfile sample is modified, or when the
data underlying the object are changed). Often one would like to preserve the current
view so that it does not change when the object changes. In EViews, this is referred to as
freezing the view. Freezing a graphical view creates a graph object. Thus, if we would
like to customize a view for presentation purposes, we should first freeze the view as a
graph object to ensure that our changes are not lost. To do this we click Freeze in the
graph menu.

Consider having a quarterly data on gdp and m1 (money supply) a given country for the
period 1952:1 up to 1996:4. We can plot these variables together in one plane as shown
below.

78
2000

1500

1000

500

0
55 60 65 70 75 80 85 90 95

GDP M1

Figure 3.2 Graph of GDP and M1

The following figure is based on annual data on export and import of Ethiopia for the
period 1953 up to 1995 E.C. The figure clearly shows the relatioship between the two
external trade variables

25000

20000

15000

10000

5000

0
55 60 65 70 75 80 85 90 95

EXPO IMPO

Figure 3.3 Graph of export and import from 1953- 1995

79
Note that the diagram above is drawn using line graph type. As can be seen from the
second diagram, the trend of import and export is widening across time. In general, we
can observe the behavior of individual variable as well as its relationship with other
variables against time using graphical presentation

B. Graph Options
Note that for each graph types stated previously, there are a number of available options.
This includes the following.

Line Graph Options


These options determine whether to show just lines connecting the data points for each
series, symbols marking the data points, or both. Line Patterns lets you choose the type of
line used to plot each series; click in the box and a menu of options will drop down. Note
that if you are plotting your series in color, selecting the line pattern will also change the
color of the displayed line. In fact, if you wish to change the color of your lines, you will
do so by selecting an alternative line pattern.

Bar Graphs Options


These options allow you to put values (height) of each bar in the graph as labels and to
control the spacing between bars of adjacent observations. For large samples with many
bars, the bars are not labeled and have no spacing between them.

Scatter Diagram Options


These options allow you to connect the consecutive points with lines and to plot the fitted
line from a bivariate regression of the vertical axis series on the horizontal axis series.

Pie Graph Options


The Label option puts the date or observation number of each pie in the graph. When
there are many pies, the pies are labeled sparsely.

80
It is also possible to set the font used in labeling the figures and to add a text. The steps in
doing so are explained as follows.

Setting Fonts
we can change the fonts used in the graph labels (the axes, legends, and added text) by
clicking the Fonts button. The Font button in the Graph Options Dialog sets the default
font for all graph components. If we wish, you may then set individual fonts for axes,
legends and text.

Adding Text
Moreover, we can customize a graph by adding one or more lines of text anywhere on the
graph. This can be useful for labeling a particular observation or period, or for adding
titles or remarks to the graph. In a frozen graph object, simply we click on the AddText
button in the toolbar or select Procs and then click on Add text…. The Text Label dialog
will come up.
We then enter the text we want to display in the large edit field. Note that spacing and
capitalization (upper and lower case letters) will be preserved. If we want to enter more
than one line, it can be done by pressing ENTER after each line.

Copying Graphs to Other Windows Programs


Note that we can incorporate an EViews graph directly into a document in to Windows
word processor. To do this, first we need to activate the object window containing the
graph that we wish to move To activate the required widow that contains the graph we
click anywhere in the window so that the titlebar changes to a bright color. Then click on
Edit and select Copy on the EViews main menu. This creates the Copy Graph as Metafile
dialog box appears. We then paste the graph to the Windows clipboard or to a disk file.
Note that it is possible to adjust the graph to be in color and that its lines be in bold. If we
plan to use the metafile in another document or application, we should select Make
metafile placeable.

81
C. Statistical Computation using EViews
Based on its window-based approach, EViews estimates several statistical issues. Very
simply it estimates and reports descriptive statistical results of a variable. This includes
mean, median, standard deviation skewness and the like concepts. Moreover, EViews
computes the covariance and the correlation between the variables.

To perform the above stated statistical estimations we have to open the workfile that
contains the names of the variables to be used in the analysis. The estimation is
conducted as follows.

Descriptive Statistics
To derive descriptive statistics for each variables we first click Quick and select Group
Statistic. Then we choose Descriptive Statistics to be followed by either Individual or
Common Samples. In this case, a Series List box will appear where we have to list the
name/s of the variable/s (series). Then we click OK and the result will be displayed. As
an illustration, consider the following descriptive statistic result for the variables: GDP,
M1 (money supply), price level (PR) and the interest rate (RS) for a hypothetical
country.
GDP M1 PR RS
Mean 632.4190 445.0064 0.514106 5.412928
Median 374.3000 298.3990 0.383802 5.057500
Maximum 1948.225 1219.420 1.110511 15.08733
Minimum 87.87500 126.5370 0.197561 0.814333
Std. Dev. 564.2441 344.8315 0.303483 2.908939
Skewness 0.845880 0.997776 0.592712 0.986782
Kurtosis 2.345008 2.687096 1.829239 4.049883

Jarque-Bera 24.68300 30.60101 20.81933 37.47907


Probability 0.000004 0.000000 0.000030 0.000000
Table 3.1 Descriptive statistic result for GDP, M1, PR and RS

Note that the above table provides descriptive statistical information for each variable.
From the figure, we learn that the average (mean) value of gdp for the period given
earlier is 632.41 dollars while it is 445.00, 0.514 and 5.412 for money supply, price rate

82
and the interest rate respectively. Similarly, the table presents the median, standard
deviation, skewness and other results of each variable.

Consider the following result based on annual economics data of Ethiopia collected for
the period of 1953 to 1995 EC. The variables used for the computation are GDP, INV
(investment), SAV(domestic saving), IMPO (import), and EXPO(export) measured in
billions of birr at current market price.

GDP INV SAV IMPO EXPO


Mean 17642.85 2849.879 954.1744 4050.395 2154.688
Median 10635.77 1394.000 688.8000 1833.200 1057.100
Maximum 54585.90 12093.00 3466.300 21557.60 8319.300
Minimum 2883.800 437.4000 -1145.300 258.7000 215.2000
Std. Dev. 16578.71 3148.441 801.4865 5551.567 2636.624
Skewness 1.177501 1.513733 1.220979 1.742244 1.444076
Kurtosis 2.961864 4.066041 5.940276 4.854723 3.431719

Jarque-Bera 9.939243 18.45775 26.17335 27.91713 15.27898


Probability 0.006946 0.000098 0.000002 0.000001 0.000481

Observations 43 43 43 43 43

Table 3.2 Descriptive statistic result

Note that the above table provides descriptive statistical information for each variable.
From the figure, we learn that the mean value of gdp for the period is 17642.85 billion
birr while it is 2849.88 and 954.17 for investment and saving respectively. Similarly, the
table presents the standard deviation, skewness and other results of each variable.

Note that in addition to the above result, EViews computes several statistical issues with
simple manipulation of its buttons. The following discussion explains how to compute
variance and covariance between variables as well as the correlation between any two
variables.

83
Variance and Covariance
To compute the variance and covariance between variables we first click Quick from the
main menu and select Group Statistic. Then we choose Covariance. This will display
Series List on which we write the name of the variables to be used in the computation.
This will give us a covariance matrix. Note that the values listed in diagonal of the table
represent variance of a variable where as off diagonal values describe the covariance
between the two variables. The table below presents such results based on a hypothetical
data.
Covariance Matrix
GDP M1 PR RS

GDP 316602.7 192558.9 169.0057 544.3403

M1 192558.9 118248.2 102.0296 269.3900

PR 169.0057 102.0296 0.091590 0.362112

RS 544.3403 269.3900 0.362112 8.414915

Table 3.3 Covariance matrix for GDP, M1, PR and RS

The table above presents the covariance matrix for the variables GDP, M1, PR and RS.
Note from the result that the values listed diagonally (characterized by bold letters)
represent the variance of each variable. For example, the variance of GDP is 316602.7
whereas it is 118248.2, 0.091590 and 8.414915 for M1, PR and RS variables
respectively.

The values listed off the diagonal shows covariance between two variables. For instance
the covariance between GDP and M1 is 192558.9, and it is 0.362112 between PR and
RS. Similar approach is used to determine the other covariance results stated in the above
table. Note, however, that values presented below the diagonal matches the above
diagonal ones. We can repeat the same job as shown below, using annual data of Ethiopia
on GDP, investment, saving, import and export for the period 1953 to 1995 E.C.
GDP INV SAV IMPO EXPO
GDP 1.000000 0.979769 0.366253 0.967962 0.979475

84
INV 0.979769 1.000000 0.296736 0.991192 0.983098
SAV 0.366253 0.296736 1.000000 0.193738 0.357570
IMPO 0.967962 0.991192 0.193738 1.000000 0.980851
EXPO 0.979475 0.983098 0.357570 0.980851 1.000000
Figure 3.4 Covariance Matrix

The table presents the covariance matrix for the variables. Note from the result that the
values listed diagonally represent the variance of each variable. The values listed off the
diagonal shows covariance between two variables. Note also that values presented below
the diagonal matches the above diagonal ones.

Correlation
To compute the correlation between any two variables we first click Quick from the main
menu and select Group Statistic. Then we choose Correlation. This will display Series
List on which we must write the name of the variables to be used for correlation
computation. For example, consider the following correlation matrix obtained using
EViews for the variables a hypothetical data on GDP, M1, Pr and RS

GDP M1 PR RS
GDP 1.000000 0.995197 0.992475 0.333494

M1 0.995197 1.000000 0.980402 0.270059


PR 0.992475 0.980402 1.000000 0.412471
RS 0.333494 0.270059 0.412471 1.000000

Table 3.5 Correlation Result

The values listed in the table diagonally have no relevant meaning. Because it represents
the correlation for the same variables measured for the same period. That is why the
result is 1 in all the cases across the board diagonally. But the results off the diagonal
relates the correlation between the variables GDP, M1, PR and RS. For example the
correlation between GDP and M1 (or between M1 and GDP) equals 0.995. This suggests

85
that there is a strong correlation between the two variables. Similarly the correlation
between RS and M1 (or M1 and RS) equals 0.27 indicating the presence of weak
correlation between the two variables.

We can repeat the same job using annual data of Ethiopia on GDP, investment, saving,
import and export for the period 1953 to 1995 E.C. The result indicates the extent of
relationship between the economic variables under study.

GDP INV SAV IMPO EXPO


GDP 1.000000 0.979769 0.366253 0.967962 0.979475
INV 0.979769 1.000000 0.296736 0.991192 0.983098
SAV 0.366253 0.296736 1.000000 0.193738 0.357570
IMPO 0.967962 0.991192 0.193738 1.000000 0.980851
EXPO 0.979475 0.983098 0.357570 0.980851 1.000000
Table 3.6 Correlation result for GDP, INV, IMPO, and EXPO

Note that the result is 1 across the board diagonally. The correlation between GDP and
INV equals 0.98. This suggests that there is a strong correlation between the two
variables. Similarly the correlation between INV and SAV equals 0.29 indicating the
presence of weak correlation between the two variables. Note also that there is a strong
correlation (= 0.98) between import and export.

Histograms and Statistical Results

EViews can also present the histogram and descriptive statistical results simultaneously
for each variable separately. Such presentation helps to examine the various numerical
results together with graphical presentation of the distribution of a variable. To construct
histogram and statistical computations simultaneously, we first click Quick from the main
menu and select Series Statistic. Then we choose Histogram and Stat. This will display

86
the histogram and descriptive statistical results for the variable (or series) selected for the
purpose. For example, the figure below describes the distribution and statistical
computations for the variable GDP

12
Series: GDP
Sample 1953 1995
10 Observations 43

8 Mean 17642.85
Median 10635.77
Maximum 54585.90
6
Minimum 2883.800
Std. Dev. 16578.71
4 Skewness 1.177501
Kurtosis 2.961864
2
Jarque-Bera 9.939243
Probability 0.006946
0
0 100 00 2 0000 30000 40 000 50000

Figure 3.4 Histogram and statistical results of GDP

Notice from the histogram above that much of GDP values are less than 20,000. The
diagram and statistical result points out that the variable GDP is positively skewed. The
figure below is based on the variable investment (INV)

20
Series: INV
Sample 1953 1995
Observations 43
15
Mean 2849.879
Median 1394.000
Maximum 12093.00
10
Minimum 437.4000
Std. Dev. 3148.441
Skewness 1.513733
5 Kurtosis 4.066041

Jarque-Bera 18.45775
Probability 0.000098
0
0 2000 4000 60 00 8000 10 000 12000

87
Figure 3.5 Histogram and statistical result for INV.

Note from the above distributions that the variables investment is positively skewed.
Note also that the statistical result presented to the right of the histogram is what we can
obtain using the process stated earlier. That is, to derive descriptive statistics for the
variable INV we first click Quick and select Group Statistic. Then we choose Descriptive
Statistics to be followed by either Individual or Common Samples. In this case, a Series
List box will appear where we write INV and click OK. This will display the descriptive
result like the one we have obtained above.
Check Your Progress 3.3
1. State the stapes required to compute mean and standard deviation of a variable using
EViews
________________________________________________________________________
______________________________________________________________________
________________________________________________________________________
2. From table 3.1 describe the skewness of M1. What does the result implies?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
3. Using table 3.4 above what is the covariance between
A) GDP and EXPO__________________________________________________
B) INV and IMPO___________________________________________________
4. Given table 3.5 above compute the correlation between
A) SAV and IMPO__________________________________________________
B) GDP and IMPO__________________________________________________

3.4.3 Statistical Estimation, Analysis and Graphing Using Stata

In this section, we will examine the various ways of conducting statistical estimations
using Stata. The discussion encompasses simple as well as comprhensive approaches.

A. Confidence intervals for means, proportions, and counts

88
To compute confidence interval for a given variable, the following command is used
ci varlist
Note that ci computes standard errors and confidence intervals for each of the variables in
variable list. For example Consider a survey data collected from a sample of 1100
respondents of Michigan University regarding their gender (gn), department they have
attended (Dept), degree earned (Deg) and salary (Sal) after graduation.

To compute the confidence interval for salary the command will be ci sal and then press
enter. Accordingly, the following result will be displayed.

Variable | Obs Mean Std. Err. [95% Conf. Interval]


-------------+-------------------------------------------------------------
Sal | 1100 26064.2 210.0926 25651.98 26476.43

Note that the command produces mean and standard deviation of the variable in addition
to the 95% confidence interval. From the result we are 95% confident that the salary of a
graduate from the university on average earns between 25,651.98 and 26,476.43

B. Correlations and Covariances of variables or estimators


Stata computes very simply the correlation and covariance that may exists between any
two variables. The following are the required commands to perform the job.
 correlate varlist. This presents the correlation between any two variables
listed in the command box. Or
 pwcorr varlist
Note that the command correlate without the variable names option displays the
correlation for all variables in the data. Similarly the command pwcorr displays all the
pair wise correlation coefficients between the variables in variable list or, if variable list
is not specified, all the variables in the dataset.

In order to construct a covariance matrix we make use of the following command


 correlate varlist, covariance

89
This displays the covariance matrix where the diagonal values represent the variance of
each variables where as off diagonal values points to the covariance between the two
variables. Consider the following data collected from 10 respondents about their monthly
income (Ic), demand for a good in kg (X) and market price of the good (Ps).

Ind. X Ps Ic
1 15 5.5 500
2 18 5.2 600
3 16 4.8 650
4 21 4.5 680
5 22 4.6 750
6 26 4.7 780
7 24 4.2 800
8 29 4 900
9 28 3.6 950
10 30 3.8 975

To compute the correlation between the variables, any of the following commands is
possible. Correlate or correlate X Ps Ic or pwcorr. If we use any of the commands we
obtain the following result.

| X Ps Ic
-------------+----------------------------------
X| 1.0000
Ps | -0.8804 1.0000
Ic | 0.9579 -0.9594 1.0000

Note from the result that the correlation between X and Ps is -0.88 while it is -0.9594 for
Ps and Ic. Like EViews result, the diagonal values are equal to one since it represents the
same variable.

On the other hand, to compute the covariance between any two variables, we write the
following command.
correlate X Ps Ic, covariance. This displays the covariance of the variables as shown here
under.

90
| X Ps Ic
---------- +---------------------------------------
X | 29.2111
Ps | -2.86778 .363222
Ic | 801.5 -89.5167 23966.9

Notice that the covariance between Ps and X is -2.867 where as it is -89.5167 and 801.5
between IC and PS and between IC and X respectively. Note that the diagonal values
represents variance of each variable

C. Arithmetic, geometric, and harmonic means


To compute the mean of a variable, the stata command to be used is given by
 means varlist
Note that means reports the arithmetic, geometric, and harmonic means, along with their
respective confidence intervals, for each variable in variable list or for all the variables in
the data if variable list is not specified. If we simply want arithmetic means and
corresponding confidence intervals, we can use the command ci discussed earlier.
For example for the above stated variables (i.e. X, Ps and Ic), the command to be used is
means X Ps Ic

Variable | Type Obs Mean [95% Conf. Interval]


-------------+------------------------------------------------------------------------
X | Arithmetic 10 22.9 19.03369 26.76631
| Geometric 10 22.29455 18.66619 26.6282
| Harmonic 10 21.67061 18.30293 26.55702
-------------+------------------------------------------------------------------------
Ps | Arithmetic 10 4.49 4.058869 4.921131
| Geometric 10 4.453558 4.044027 4.90456
| Harmonic 10 4.417215 4.027381 4.890606
-------------+------------------------------------------------------------------------
Ic | Arithmetic 10 758.5 647.7537 869.2463
| Geometric 10 743.6627 638.6928 865.8846
| Harmonic 10 728.3064 628.6551 865.5014
---------------------------------------------------------------------------------------

D. Partial correlation coefficients

91
Recall that partial correlation displays the partial correlation between two variables
keeping other things constant. In Stata the command required to compute this is given by
 pcorr varname varlist
Note that pcorr displays the partial correlation coefficient of variable name1 with each
variable listed, holding the other variables in varlist constant.
Example, to compute the partial correlation of X with Ps or Ic, we use the following
command. pcorr X Ps Ic. The result of this command is listed as follows
Variable | Corr. Sig.
-------------+------------------------
Ps | 0.4772 0.194
Ic | 0.8467 0.004

Notice from the result that the partial correlation between X and Ps is 0.47 where as it is
0.84 and significant at 1% between X and Ic. It is significant at 1% because the
significant level (which is p = 0.004) is less than 1% (i.e. less than 0.01). But notice that
the correlation between X and Ps is not significant even at 10% (i.e.0.1)

Accordingly, to compute the partial correlation of Ps with X or Ic, we use the following
command. pcorr Ps X Ic. The result will be tabulated as follows.
Variable | Corr. Sig.
-------------+-----------------------------
X| 0.4772 0.194
Ic | -0.8526 0.003

The interpretation of the above result is left for the student as an exercise. However, note
that the result is partial in the sense that it holds true while other things are held
constant. .
E. Skewness and kurtosis test for normality

Stata computes the skewness and kurtosis of a variable as a test of normality. The
appropriate comand to excute such thing is given by:

92
 sktest varlist

Note that for each variable in variable list, sktest presents a test for normality based on
skewness and another based on kurtosis and then combines the two tests into an
overall test statistic. Note also that sktest requires a minimum of 8 observations to make
its calculations. To compute a test of skewness and kurtosis for the variables X Ps and Ic
we make use of the following command. sktest X Ps Ic
Skewness/Kurtosis tests for Normality
------- joint ------
Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+--------------------------------------------------------------------------------
X| 0.793 0.206 1.95 0.3763
Ps | 0.809 0.699 0.21 0.9010
Ic | 0.868 0.568 0.34 0.8421

Notice that the result can not reject the normality hypothesis for each variable as can be
observed from the probability values that are higher than 10% (i.e. 0.1).

F. Spearman's and Kendall's correlations


In Stata, we can compute the Spearman's and Kendall's correlations between any two
variables. To do so we write the following command
 spearman varname1 varname2
Note that, spearman displays the Spearman rank correlation coefficient between
varname1 and varname2. The result of this test is given along with a test that varname1
and varname2 are independent. Using our example of X, Ps and IC, we can exercise this
issue. For example to compute Spearman's and Kendall's correlations between X and Ps,
we write the command as follows: spearman X Ps. The result is given as shown below.
Number of obs = 10
Spearman's rho = -0.8667

Test of Ho: X and Ps are independent


Prob > |t| = 0.0012

93
As can be seen from the above result, the correlation coefficient (ρ-rho) between X and
Ps is -0.87. The result represents a strong correlation between the two variables.
Moreover, the hypothesis test result rejects the null hypothesis (Ho) that X and Ps are
independent. The result 0.0012 is less than 0.01 (or 1%). Thus the Ho is rejected even at
1% . Note that since the correlation coefficient (which is -0.86) is very strong it indicates
that the two variables are highly dependent to one another and therefore, the rejection of
Ho is an expected result. The following result represents the correlation result and the
associated hypothesis for Ps and Ic

Number of obs = 10
Spearman's rho = -0.9394

Test of Ho: Ps and Ic are independent


Prob > |t| = 0.0001

The result represents a very strong negative correlation, which is -0.94. Accordingly, we
reject Ho at 1% indicating that the two variables are dependent to each other.

Check Your Progress 3. 4

Consider the data below

Y 10 12 9 7 12 11 13 12
X 8 6 9 10 7 12 15 14
Z 6 7 10 6 8 10 14 16
Based on the above information, attempt the following questions

1. The 95% confidence interval for the variable Y is __________________


2. The 95% confidence interval for the variable Z is __________________
3. The partial correlation between X and Z is _________________________
4. The partial correlation between Y and X is_________________________
5. Is the partial correlation result in (4) above significant?

94
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

G. Summary statistics

The other advantage of Stata is that it produces a summary statistics. The appropriate
command to handle this task is presented as follows.
 summarize varlist
This command (i.e. summarize) calculates and displays a variety of univariate
(individual) summary statistics. Note that after the command is written, if no variable list
is specified, summary statistics are calculated for all the variables available in Stata data
set. The following result reports the summary statistics for our variables X, Ps and Ic
(note that the command is summarize X Ps Ic).

Variable | Obs Mean Std. Dev. Min Max


-------------+-----------------------------------------------------------------
X| 10 22.9 5.40473 15 30
Ps | 10 4.49 .6026792 3.6 5.5
Ic | 10 758.5 154.8126 500 975

Note that the result gives the mean standard deviation, minimum and the maximum of
each variable. This helps to obtain basic information about each variable (i.e. X, Ps and
Ic).

H. Tables of summary statistics

In Stata it is possible to present table of summary statistics. The kind and form of the
tabular statistics summary depends on the command written. That is, we can order Stata
to produce a simple frequency table, or a cross tabulation that represent the frequency of
a variable for a given value of other variables, and so on. The following command
represents such cases using the variables X, Ps and Ic

95
 table X. This gives a frequency table for the variable X
 table X Ps Ic. This gives a frequency table in a cross tabulation form
between the variables.
 table X, contents(n Ps). This develops a frequency based relationship
between the variables X and Ps
 table Ic, c(n Ps mean Ps sd Ps median Ps). This reports a frequency based
mean and median of Ps for a given value of Ic. The result below reresents this
case.

--------------------------------------------------------------
Ic | N(Ps) mean(Ps) med(Ps)
----------+--------------------------------------------------
500 | 1 5.5 5.5
600 | 1 5.2 5.2
650 | 1 4.8 4.8
680 | 1 4.5 4.5
750 | 1 4.6 4.6
780 | 1 4.7 4.7
800 | 1 4.2 4.2
900 | 1 4 4
950 | 1 3.6 3.6
975 | 1 3.8 3.8
--------------------------------------------------------------

The above result displays the mean and median of Ps for a given value of Ic. Note that in
our hypothetical table since the frequency of Ps fro a given value of Ic is one
[i.e. N(Ps)=1], the mean value of Ps is equal to the actual hypothesized observation.

96
The other command that produces a tabular statistical summary is tabstat. Note that
tabstat displays summary statistics for a series of numeric variables in a single table,
possibly broken down on (conditioned by) another variable. The result to be obtained
depends on the command to be written. Consider the following two commands.

 tabstat X Ps Ic, stats(mean range). This displays the mean and range of each
variables (i.e. X, Ps and Ic).

 tabstat X Ps, by(Ic) This produces summary statistics or mean of X and Ps by


categories of Ic. The result below represents the result of this command.

Ic | X Ps
---------+--------------------------
500 | 15 5.5
600 | 18 5.2
650 | 16 4.8
680 | 21 4.5
750 | 22 4.6
780 | 26 4.7
800 | 24 4.2
900 | 29 4
950 | 28 3.6
975 | 30 3.8
---------+---------------------------
Total | 22.9 4.49
--------------------------------------

Note that the above table shows the mean of X and Ps for a given value of Ic. For
instance, given Ic = 800, the mean value of X and Ps are 24 and 4.2 respectively.

97
Stata can also produce one- and two-way tables of summary statistics. The result provides
the mean, standard deviation and frequency of a variable for a given value of an other
variable. The command to execute this job is given as follows.

 tabulate X, summarize(Ps).

This provides with one way table a summery statistics for Ps for each value of variable X.
The result below provides table a summery statistics for variable Ic given the values of
Ps. Note that the command required to do this is tabulate Ps, summarize(Ic)

| Summary of Ic
Ps | Mean Std. Dev. Freq.
------------+-----------------------------------------------
3.6 | 950 0 1
3.8 | 975 0 1
4| 900 0 1
4.2 | 800 0 1
4.5 | 680 0 1
4.6 | 750 0 1
4.7 | 780 0 1
4.8 | 650 0 1
5.2 | 600 0 1
5.5 | 500 0 1
------------+-------------------------------------------------
Total | 758.5 154.81261 10

As you can see, Stata provides the mean, standard deviation and frequency of Ic given the
value of Ps. For instance, at Ps = 4.2, the mean value of Ic is given gy 800 birr.

98
I. Graphical Analysis Using Stata

Stata provides with a number of graphing options. This includes line graph, bar graph, pie
charts and a number of others. The following discussion illustrates some of the
commands and the corresponding outcomes.

Graph bar charts (graph, bar command)

Stata provides a bar graph based on the actual value or the mean (average) value of a
variable. The appropriate command to construct such graph is given by thfollowing
command
 graph INV SAV IMPO EXPO Gcon Pcon, bar

This command displays a bar graph for the variables stated in the command

On the other hand, the command draws a bar chart for the variables using the maen value
of each variable displayed in the command box.
 graph INV SAV IMPO EXPO Gcon Pcon, bar means
Note that the bar chart diagram below is constructed using the first command. It displays
the bar graph for the variables investment (INV), domestic savings (SAV), import
(IMPO), export (EXPO), government consumption (Gcon) and private consumption
(Pcon)

99
INV Sav Expo
Impo Gcon Pcon
115721

Figure 3.6 Bar chart

J. Graph histograms (graph, histogram command)

The other graph that can be drawn in Stata is histogram. Note that histogram is the
default for graph with one variable. Accordingly, the command relevant to this job is
given as follows.

 graph GDP: This command draws a histogram of GDP


 graph GDP, bin(2-50): This also draws a histogram for GDP using 2 to 50 bins
for histogram.
 graph GDP, bin(2-50) norm: This command draws the distribution for GDP
together with a histogram for GDP using 2 to 50 bins.

Consider the graph presented below. It represents the distribution for GDP together with
a histogram for GDP using 20 bins. Note that the command to generate the graph is given
by: graph GDP, bin(20) norm.

100
.232558

Fraction

0
2883.83 54585.9
GDP

Figure 3.7. Graph and Histogram

Notice from the figure above that there are 20 bars since we wrote 20 in the command
box. As the distribution curve shows, the variable GDP is slightly positively skewed.

When domestic saving (SAV) is used in the analysis, we obtained almost normally
distributed curve. Recall that by normal distribution, we mean a result where the mean
mode and median are equal, and the skewness is zero. The command used to obtain the
result below is: graph SAV, bin(10) norm

101
.511628

Fraction

0
-1145.3 3466.3
Sav

Figure 3.8 Graph and histogram

Note that in addition to this, Stata can also produce a categorical variable histogram. The
result being quite similar to that of above command, however, this one is intended for use
with integer-coded categorical variables. In this case, the x-axis is automatically labeled
and those labels are centered below the corresponding bar.

K. Graph pie charts (graph, pie command)

The other type of graph is the pie chart. To execute this, the relevant command is
presented as follows.

 graph INV IMPO Gcon Pcon, pie

Note that, this command produces a pie chart containing the variables investment, import
and government consumption for the period discussed earlier. In developing a pie chart,
we can specify up to 16 variables and Stata will place up to 64 pie charts in a single
image. The following pie chart is the result of the command specified above.

102
17% INV
56% Impo
27% Gcon

Figure 3.9 Pie chart

Note from the pie chart that 56% represents the share of import from the total of the three
variables, while it is 27% and 17% for government consumption and investment
respectively.

Check Your Progress 3.5

Suppose that we have data on variables named by A1, B2 and C3

1. Write the command that displays individual summary statistic for the variables A1 and
B2 ___________________________

2. Write the command that gives the a frequency based mean and median of A1 for a
given value of C3 __________________________________
3. Write the command that draws a bar chart for the variables A!, B2 and C3 using mean
values of each variables ____________________________________

4. Write the command that draws the distribution of B1 together with its histogram using
25 bins _____________________________________________

103
3.5 Summary

Statistics is the science of collecting, organizing analyzing and interpreting to assist in


making more effective decisions. Mean measures the average of the observation
Dispersion on the other hand refers to the variation or scatter of a set of values. The most
important measure of dispersion are Variance and the Standard Deviation. Correlation is
one of the methods for measuring the relationship between variables. It is defined as the
degree of relationship existing between two or more variables. The degree of relationship
existing between two variables is called simple correlation and the degree of relationship
existing between three or more variables is called multiple correlations.

The other important statistics concept is hypothesis testing. A hypothesis is some testable
belief or opinion. It is a statement about the population developed for the purpose of
testing. A statistical hypothesis is a statement about the values of parameters in the
population. In hypothesis testing, the most common approach is to establish a set of two
mutually exclusive and exhaustive hypotheses about the true value of the parameter under
study. Then a sample is used to assess the hypothesis.

Excel can carry out several statistical analyses. The steps required to handle statistical
estimation using excel is
4. On the Tools menu, click Data Analysis.
5. In the Data Analysis dialog box, click the name of the analysis tool you want to
use, then click OK.
6. In the dialog box for the tool you selected, set the analysis options you want.

EViews very simply and attractively provides us with many types of graphical results.
Note however, that to execute any kind of graph we should first have the relevant data in
EViews spreadsheet.

Based on its window-based approach, EViews estimates several statistical issues. Very
simply it estimates and reports descriptive statistical results of a variable. This includes

104
mean, median, standard deviation skewness and the like concepts. Moreover, EViews
computes the covariance and the correlation between the variables. To perform the above
stated statistical estimations we have to open the workfile that contains the names of the
variables to be used in the analysis.

Using Stata we can construct a wide range of statistical and graphical results. This
includes constructing confidence interval, correlation and covariance, summary statistics
and the like.
3.6 Answers to Check Your Progress
Answer to Check Your Progress 3.1

Refer section 3.2 for the answer


Answer to Check Your Progress 3.2

Refer section 3.3 for the answer


Answer to Check Your Progress 3.3

Refer section 3.3 for the answer


Answer to Check Your Progress 3.4

1. 9.092954 and 12.40705


2. 6.530996 and 12.719
3. 0.8581
4. - 0.4496
5. No
Answer to Check Your Progress 2.5

1. For A1 = summarize A1 and for B2 = summarize B2


2. table C3, c(n A1 mean A1 sd A1 median A1).
3. graph A1 B2 C3, bar
4. graph B1, bin(25) norm

3.7 Model Examination


Consider the following annual data on Ethiopia on GDP, SAV (saving),
INV ( investment), EXPO (Export), and IMPO (import) for the year 1981 upto 1995 E.C

105
Year GDP INV Sav Expo Impo
1981 15742.1 2269.23 1399.77 1422.8 2292.26
1982 16825.7 2100.49 1335.22 1295.04 2060.31
1983 19195.3 1996.38 660.39 1062.21 2398.2
1984 20792 1911.1 625.2 937.5 2223.4
1985 26671.4 3792.1 1494.1 2222.5 4520.5
1986 28328.9 4293.7 1426.2 3223 6090.5
1987 33885 5569 2517.1 4898.1 7950
1988 37937.6 6404.4 2652.6 4969.7 8721.5
1989 41465.1 7049.1 3195 6730.6 10584.7
1990 44840.3 7690.6 3466.3 7116.9 11341.2
1991 48803.2 8268.1 1044.6 6878 14101.5
1992 53189.7 8431.8 480.1 8017.6 15969.3
1993 54210.7 9646 1433.9 7981.5 16193.6
1994 51760.6 10613.5 931.4 8027.4 17709.5
1995 54585.9 12093 -1145.3 8319.3 21557.6

Based on the above data attempt the following


1. Using Stata construct the mean values of each variables together with their
95% confidence interval.
2. Using EViews compute the covariance between
a) GDP and INV b) IMPO and EXPO c) INV and GDP
3. Using Stata compute the partial correlation between
a) GDP and INV b) IMPO and EXPO c) SAV and GDP
4. Perform Skewness/Kurtosis test for normality for each variables and
comment on the result.
5. Draw the distribution of each variables with the corresponding histograms
using 15 bins.

Unit Four: Econometric Estimation and Analysis

4.0 Objective
4.1 Introduction
4.2 Concepts in Econometrics Analysis
4.3 Regression Estimation and Analysis Using EViews.

106
4.4 Regression Estimation and Analysis Using Stata.
4.5 Summary
4.6 Answers to Check Your Progress
4.7 Model Examination

4.0 Objectives

The aim of this unit is to explain the approaches used in computer based econometric
estimation and analysis.
After studying this, you will be able to:
 Define the term econometrics
 Explain the importance of studying econometrics and the associated concepts.
 Understand the various ways of performing estimation using EViews and Stata.
 Interpret the results obtained from the estimation.

4.1 Introduction

Econometrics deals with the measurement of economic relationships. It is a combination


of economic theory, mathematical economics and statistics, but it is completely distinct
from each one of these three branches of science.

This is because; economic theory makes statements or hypotheses that are mostly
qualitative in nature. However, the theory itself does not provide any numerical measure
of the relationship between the two: that is it does not tell by how much the quantity will
go up or down as a result of a certain change in the price of the commodity. It is the job
of econometrician to provide such numerical statements. Similarly, the main concern of
Mathematical economics is to express economic theory in mathematical form without
regard to measurability or empirical verification of the theory. Both economic theory and
mathematical economics state the same relationships. Economic theory uses verbal
exposition but mathematical economics employs mathematical symbolism. Neither of
them allows for random elements, which might affect the relationship and make it

107
stochastic. Further, more, they do not provide numerical values for the coefficients of the
relationships.

On the other hand, economic Statistics is mainly concerned with collecting, processing,
and presenting economic data in the form of charts and tables. It is mainly a descriptive
aspect of economics. It does not provide explanations of the development of the various
variables and it does not provide measurement of the parameters of economic
relationships. Nevertheless, econometrics is an amalgam of economic theory,
mathematical economics, economic statistics, and mathematical statistics. Yet, it is a
subject that deserves to be studied in its own right for the above-mentioned reasons.

In this unit, a computer-based analysis is used to explain a number of issues related to


econometric estimation and analysis. Using EViews and Stata we will learn how to
estimate a regression equation, test of significance of parameters, develop confidence
interval and the like.

4.2 Concepts in Econometric Estimation and Analysis

A. Definition and Steps in Econometric Analysis

Econometrics deals with the measurement of economic relationships. It is a combination


of economic theory, mathematical economics and statistics, but it is completely distinct
from each one of these three branches of science. In conducting econometric analysis,
there are some steps that we need to follow as a methodology.

Step I: The first step is an attempt to identify the relationship between variables and
express the relationship in mathematical form. This is called the specification of the
model. It involves the determination of the dependent and independent variables. For
example, given Y= f(X1, X2, X3,...,Xn), the variable Y whose behavior is to be explained is
referred as dependent variable while the variables X1, X2, X3...Xn that influences the
dependent variable Y are referred as explanatory or independent variable

108
In economic analysis, the choice of independent variables might come from economic
theory, past experience, other related studies or from intuitive judgment.

Step II. After identifying the dependent and the explanatory variables, the second step is
specifying the mathematical form of the model. Note that economic theory may or may
not indicate the mathematical form of the relationship and the number of equations to be
included in the model.

For example, the theory of demand does not specify whether the demand function will be
linear or non-linear form. Similarly, the theory of demand doesn’t specify the number of
equations to be included in the demand function. Thus, it is the researcher who is
responsible in dealing with such issues. Note also that the determination of the a priori
theoretical expectations about the sign and size of the parameters is also part of
formulating a model or specifying of the model.

Step III. This step refers to specifying of an econometric model. It is based on economic
theory and on any available information relating to the phenomenon being studied.

Since most of the relationships in economic variables are inexact, this step reflects the
issue by incorporating a disturbance term, U. This U captures the influence of any other
variables that is not included in the model.

Note that, U denotes the random error term which represents all those forces/factors/
affecting the dependent variable but not explicitly introduced in the model. This error
term distinguishes econometric model from mathematical model.

As we have said earlier, since economic theory does not explicitly state the number of
equations to be included in the function (single or simulations equation model), the
researcher must decide the number of equation to be included in the model. In general,
the number of equations depends on the complexity of the phenomenon being studied, the
purpose of estimating the model and the availability of data.

109
Step IV. This step is involved with determining the numerical estimates of the
coefficients of the model. Estimation of the coefficients of the model includes (i)
gathering of data on the variables included in the model and (ii) selecting the appropriate
econometric technique for the estimation of the function.

Note that, the coefficients of economic relationships may be estimated by single equation
methods or simultaneous equations methods. Note that in this material we will focus on
least squares methods.

Step V. Evaluation of Estimates. This refers to checking the reliability of the estimated
results. The evaluation of the results includes deciding whether the estimates of the
parameters are theoretically meaningful and statistically satisfactory. To check the
estimates of the parameters are meaningful, we make use of the Economic criteria,
statistical criteria and econometric criteria. By Economic criteria it means the criteria
determined by economic theory and refer to the sign and size of the parameters of
economic relationship. By statistical criteria, it reflects statistical theory and aim at
evaluation of statistical reliability of the estimates of the parameters of the model. The
most commonly used statistical criteria are the correlation coefficient and the standard
deviation (error) of the estimates. The square of the correlation coefficient shows the
percentage of the total variation of the dependent variable being explained by the changes
of the explanatory variables. On the other hand, the standard error of the estimates is a
measure of the dispersion of the estimates around the true parameter. The larger the
standard errors of the parameter, the less reliable are the estimates.

The third one is econometric criteria. This is set by the theory of econometrics and aim at
investigation of whether the assumptions of the econometric method are satisfied or not.
It helps us to check whether the estimates have the desirable properties of unbiasedness,
consistency, efficiency, sufficiency and the like. If the assumptions are not satisfied, then
the estimates of the parameters will not posses some of the desirable properties and
become unreliable for the determination of the significance of the estimates. Note,

110
therefore, that, before accepting or rejecting the estimates, the researcher must use all the
above criteria.

After we formulated the model, we may want to perform hypothesis testing to find out
whether the estimates obtained are in accordance with the expectation of the theory that is
being tested. That means, we may want to find out whether the estimated model makes
economic sense and confirm to economic theory. To this we develop the necessary tools
to test hypothesis suggested by economic theory and/or prior empirical experience. The
confirmation or refutation of economic theories on the basis of sample evidence is known
as hypothesis testing.

Step VI Evaluation of the Forecasting Power of the Estimated Model.

As you know the objective of any econometric research is to obtain good numerical
estimates of the coefficients of economic relationships and to use them for the prediction
of the values of economic variables. Before using the estimated model for forecasting the
value of the dependent variable, we must assess the predictive power of the model.

Note that if the chosen model confirms the theory, then we may use it to predict/forecast/
the future value(s) of the dependent variable on the basis of known or expected future
value(s) of the explanatory variables. The estimated model is economically, statistically
and econometrically correct for the sample period for which the model has been
estimated, however, it may not be useful for forecasting. In this stage, we will investigate
the stability of the estimates, their sensitivity to changes in the size of the sample.
Therefore, we have to check whether the estimated function performs outside the sample
data.

From the forgoing analysis we learn that a successful econometric analysis should make
use of all the above stated steps. In our case the regression estimation will be based on the

111
Ordinary Least Squares (OLS) method. Therefore, the next sub section makes a brief
discussion of what OLS is before we examine how to compute using EViews and Stata

B. The Concept of Regression Analysis

Regression analysis is concerned with describing and evaluating the relationship between
a given variable (dependent variable) and one or more other variables (explanatory
variables).

Let us denote the explained (dependent) variable by Y and explanatory (independent)


variables by X1,X2,...Xk. If there is only one explanatory variable, it is known as simple
regression. For example, let the dependent variables is demand (Di) of a commodity and
the explanatory variable is its own price (Pi). Then we have D i = f (Pi). Thus a simple
regression shows the relationship between demand and own price. On the other hand, if
there are more than one explanatory variables, it is called a multiple regression. For
example let the dependent variable is consumption expenditure (Ci) and the explanatory
variables are family income (Ii) and family size (Si). Then we obtain Ci = f (Ii, Si). Notice
that a multiple regression shows the relationship between the dependent variable and a
number of explanatory variables.

In general given Y = f(X1, X2, ...., Xk), if we assume that there is a linear relationship,
then we obtain

E(Y/Xi) = β0+ β1X1i + β2X2i + β3X3i +......+ βkXki

Where  0 denotes the intercept and  1 , β2 ......βk represents partial slope of the
regression equation. However, the simplest form of multiple linear regression model (i.e.
a model with two explanatory variables) is given by:

Yi = β0 + β1X1i + β2X2i + Ui

112
Taking the expected value of the above model, we obtain:

E(Yi/ X1i ,X2i)= β0 +β1X1i + β2X2i

Where: E(Yi/ X1i ,X2i) represents the conditional mean of Yi given fixed values of X1i
and X2i
β0 is the average value of Yi when X1i =X2i=0.
β1 is obtained by taking the partial derivatives of Yi with respect to X1i keeping
X2i constant. That is,

Yi
β 1= , keeping X2i constant which represents the change in the mean value
X 1i

of Yi with respect to X1i keeping X2i constant. Similarly, β2 is obtained by taking the
partial derivatives of Yi with respect to X2i keeping X1i constant, i.e.

Yi
β2 = , keeping X1i constant which represents the change in the mean value
X 2i

of Yi with respect to X2i keeping X1i constant.

C. Estimation of Parameters Using OLS

After specification of the model, the next step to estimate the population parameter using
sample observations on Y, X1i and X2i and obtains estimates of the population parameters
 0,  1 and  2. These estimates ˆ 0 , ̂ 1 and ̂ 2 of the population parameters  0,  1

and  2 respectively will be obtained by minimizing the sum of squared residuals.

The population regression function is given as:

Yi =  0 +  1 X1i +  2 X2i + Ui

113
and the counterpart sample regression function is given as:

Yˆi = ̂ 0 + ̂ 1 X1i + ̂ 2 X2i ,

Using Ordinary Least Squares method of estimation we obtain the following result for

̂ 0 , ˆ1 and ̂ 2 .

̂ 0 = Y  ˆ1 X 1  ˆ 2 X 2

2
( x1i y i )( x 2i )  ( x 2i y i )( x1i x 2 i )
̂ 1 = 2 2
(  x1i )( x 2i )  (  x1i x 2 i ) 2

2
( x 2i y i )( x1i )  ( x1i y i )( x1i x 2 i )
̂ 2 = 2 2
( x1i )( x 2i )  ( x1i x 2 i ) 2

In addition to the partial regression coefficient, the variance of each  is important for
wide ranges of purposes such as confidence interval, hypothesis testing and the like. The
formula to compute the variance of each  is given as:
Thus, the variances of ˆ 0 , ̂ 1 and ̂ 2 is given by the following formula

2 2 2 2
21 X 1  x 2i  X 2  x1i  2 X 1 X 2  x1i x 2i
Var ( ̂ 0 ) = ˆ u (  2 2
)
n  x1i  x 2i  ( x1i x 2i ) 2

2
 x 2i
Var ( ̂ 1 ) = ˆ u
2
2 2
 x1i  x 2i  ( x1i x 2i ) 2

114
2
 x1i
Var ( ̂ 2 ) = ˆ u
2
2 2
 x1i  x 2i  ( x1i x 2i ) 2

2
U i
Where: x1i  X 1i  X 1 and ˆ u 2  , K in this case is 3 since we are dealing with
nk
there parameters- ̂ 0 , ˆ1 and ̂ 2 .

Note that if there are more than two explanatory variables, the formula to compute each
 and their corresponding variances will be more complicated and very hard to do it

with out the aid of a computer.

D. The Coefficient of Multiple Determinations R2y.x1x2

Given Y = f(X1, X2), the coefficient of multiple determinations- R2y.x1x2 is the square of
multiple correlation coefficients. It is denoted by R2 with the subscripts of the variables
whose relationship is being studied. The coefficient of multiple determinations in the case
of two explanatory variables X1 and X2 shows the percentage of the total variation of Y
explained by the regression plane, i.e. by the change in X 1 and X2. In a multiple
regression, R2 measures the proportion of the variation in Y explained by variables X 1
and X2 jointly.

The formula of R 2 y . X 1 X 2 is given as

 yˆ  Y Y 
2 2
2 i i
R = 
 Y Y 
y. X 1 X 2
y
2 2
i i

U
2
i RSS
=1–  1
y
2
i
TSS
where: RSS – residual sum of squares
TSS – total sum of squares
Using estimated  's the coefficient of determination can be presented as follows

115
 
2  1  x1i y i   2  x 2i y i
R y. X 1 X 2 = ,
 yi
2

where x1i, x2i and yi are in their deviation forms.

Note that the value of R2 lies between 0 and 1. The higher R2 the greater the percentage of
the variation of Y explained by the regression plane, that is, the better the goodness of fit
of the regression plane to the sample observations. The closer R 2 to zero, the worse the fit
is.

The Adjusted R2
When new variables are introduced into the mode, the coefficient of determination R2
always increases even if the variable added is not important to the model. To correct this
defect, we should adjust the coefficient of multiple determinations by taking into account
the degrees of freedom, which clearly decrease as new repressors are introduced in to the
function. Therefore, the adjusted R-square is given by

U  n  k  , or
2
i
R
2
=1–
 y  n  1
2
i

 n  1
R
2
= 1 – (1 – R2)
nk
where k = the number of parameters in the model (including the intercept term)
n = the number of sample observations
R2 = is the unadjusted multiple coefficient of determination
Note that, as the number of explanatory variables increases, the adjusted R 2 is
increasingly less than the unadjusted R2. The adjusted R2 ( R 2 ) can be negative, although
R2 is necessarily non-negative. In this case its value is taken as zero.

If n is large, R 2 and R2 will not differ much. But with small samples, if the number of
regressors (X’s) is large in relation to the sample observations, R 2 will be much smaller
than R2.

116
E. Test of Significance of Parameters

Test of significance is by definition a hypothesis testing approach. Note that in a multiple


regression this hypothesis testing can be tests on individual coefficients as well as on the
significance all the parameter jointly.

I. Significance Test about Individual Partial Regression Coefficients

This refers to testing whether a particular variable X1 or X2 is significant or not holding


the other variable constant. In this case, the t test is used to test a hypothesis about any
individual partial regression coefficient. Now let us postulate for X1 as follows.
H0: β1 = 0
H1: β1 0
The null hypothesis states that, holding X2 constant, X1 has no (linear) influence on y.
In examining the above hypothesis, we compute the t value that will be compared with
the critical (table) value. The t value is computed by the following formula

 1 i
t=  ~ t(n – k) (i = 0, 1, 2, …., k)
S ( i )

This is the observed (or sample) value of the t ratio, which we compare with the
theoretical value of t obtainable from the t-table with n – k degrees of freedom.

The theoretical values of t (at the chosen level of significance) are the critical values that
define the critical region in a two-tail test, with n – k degrees of freedom.

If the computed t value exceeds the critical t value at the chosen level of significance, we

may reject the hypothesis; otherwise, we may accept it (i.e.  1 is not significant at the

chosen level of significance and hence the corresponding regression does not appear to
contribute to the explanation of the variations in Y). Consider the figure below that
represents the t distribution and the critical values for a two-tailed test in t-distribution

Acceptance
region

117
Critical Point Rejection
region

-tα/2 (n-k) 0 tα/2 (n-k)

Figure 4.1 Significance test using t-distribution



Note that the greater the value of t calculated, the stronger is the evidence that  i is

significant.
II. Testing the Overall Significance of a Regression

This test aims at finding out whether the explanatory variables (X1, X2, …Xk) do actually
have any significant influence on the dependent variable. Consider the following general
regression model.
Yi = β0 + β1X1i + β2X2i + β3X3i +……..+ βkXki + Ui

The test of the overall significance of the regression implies the following.

H0 : β1 = β2 = β3 = …….. = βk = 0

H1: Not all slope coefficients are simultaneously zero.

This test aims at finding out whether the explanatory variables do have any significance
influence on the dependent variable. If the null hypothesis is true, then there is no linear
relationship between the dependent variable and the explanatory variables.

To test the above stated hypothesis we use the following test statistic:

ESS
k 1
F= ~ F (k-1, n-k)
RSS
nk

118
Where ESS represents the explained sum of squares, and RSS represents the residual
(unexplained) sum of squares.Accordingly, the decision rule will be:

Compare the computed F-calculated value with the critical (table) value at the chosen
level of significance, (k-1) for numerator and (n-k) for denominator degrees of freedom
which is obtained from the F- distribution table. The decision is based on the following
procedures:

If the computed F- value is greater than the critical value, reject the null hypothesis and
accept that the regression is significant and not all coefficients are zero. On the other
hand, if the computed F- value is less than the critical value obtained from the F-
distribution table, then accept the null hypothesis, i.e. accept that the regression is not
significant and all coefficients are zero.

F. Confidence Intervals Estimation for the parameter estimates

In test of significance, we may reject the null hypothesis and come up with significant
result. However, note that rejection of the null hypothesis does not mean that our estimate

̂ 1 is the correct estimate of the true population parameter βi. It means that our estimate
comes from a sample drawn from a population whose parameter is different from zero.

In order to known how close the estimate to the true population parameter; we must
construct confidence intervals for the true parameter. In confidence interval estimation,
we establish limiting values around the estimate within which the true parameter is
expected to lie with a certain 'degree of confidence'. We should select a confidence level
or confidence coefficients denoted by α. For instance, if the confidence level is 90%, then
it means that in a repeated sampling the confidence interval computed from the sample
would included the true parameter in 90 times out of 100 times. In the remaining 10
times, the population parameter will fall outside the confidence interval. The following
explains the method of constructing confidence interval from the t-distribution.

119
Note that the t-distribution is used when the population is normal, the sample size is small
and population variance is unknown. In this case, the test statistic for testing hypothesis is
given as:

ˆi   i
t= , with (n-k) degrees of freedom
S ( ˆi )

If the confidence coefficient, α, is given, then the confidence interval that the observed t-
value lying between -tα/2 and tα/2 with (n-k) degrees of freedom is given as:

̂ i - tα/2(n-k) se( ̂ i )   i  ̂ i + tα/2(n-k) se( ̂ i )

Where se ( ̂ i ) is the sample estimate of population standard deviation and α is the


confidence coefficient

The foregoing brief discussion informs the reader what regression estimation means and
the various concepts related to it. More importantly the above discussion shows how time
taking and unfriendly it would be to compute the various concepts manually. Note that
the formula and the computation process will be very long as the number of the
explanatory variables increase in the regression model. The interesting part is that, with
the aid of EViews and Stata it becomes very easy to compute the various concepts while
having many explanatory variables.

Check Your Progress 4.1

1. State the steps required in conducting econometric analysis


_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

2. Define the concept of regression analysis

120
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

3. Define the concept of R2 and show its relationship with the adjusted R2
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

4. Define the concept of test of significance and confidence interval.


_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

4.3 Regression Estimation and Analysis Using EViews

In the following discussion we describe the approached employed in conducting


regression estimations in EViews This includes the following concepts. Specifying and
estimating a regression model, estimation and interpretation of a regression model, tests
of significance of the parameters and the like. Note that in this discussion we make use of
Ordinary Least Squares method of estimation since it is the simplest as well as widely
used in basic regression estimation.
Note that regression estimation is performed using data. Thus, the following discussion
assumes that we already have data in EViews workfile. Consequently, the firs step in
regression estimation using EViews is to create an Equation Object.

I. Equation Objects
Single equation regression estimation in EViews is performed using the equation object.
To create an equation object in EViews we can use one of the following alternatives

121
 From the main menu select Objects and then choose New Object and select
Equation
 From the main menu click Quick and then select Estimate Equation. Or
 Simply type the keyword equation in the command window and press Enter

Any of these alternatives create the Equation Estimation dialog box as shown in the
diagram below.

Figure 4.2 Equation estimation dialog box


In the equation estimation box, there are three boxes that we are interested with. The
following section discusses the relevance of the three boxes as well as how to specify an
equation in EViews.

II. Specifying an Equation in EViews

As we have said earlier, when we create an equation object, the Equation Estimation
dialog box appears. In that box we need to specify three things. These are: the equation
specification, the estimation method, and the sample to be used in estimation.

122
a) Equation Specification Box

In the upper edit box, we observe the equation specification box. It is used to specify the
equation. This refers to identifying the dependent (left-hand side) variable and the
independent (right-hand side) variables. Moreover we have to determine the functional
form (i.e. linear or non-linear). Note that there are two basic ways of specifying an
equation. These are the list and the formula approaches. The listing method is easier but
may only be used with unrestricted linear specifications whereas the formula method is
more general and can be used to specify nonlinear models or models with parametric
restrictions.

Specifying an Equation by List

The simplest way to specify a linear equation is to provide a list of variables that you
wish to use in the equation. In this approach we first include the name of the dependent
variable or expression then a list of explanatory variables name is listed. For example,
consider the following demand function. D = f (P, I). Now suppose the objective is to
specify a linear demand function, D regressed on a constant, own price, P and income, I
of the consumer as follows. Di =  0 +  1Pi +  2Ii + Ui
In this case, the list method requires to type the following in the upper field of the
Equation Estimation dialog box

DcPI

Note that each variable is separated by a space and the presence of the series name C in
the list of regressors. This (i.e. c) is a built-in EViews series that is used to specify a
constant in a regression. Note that EViews does not automatically include a constant in a
regression so you must explicitly list the constant as a regressor.

You may have noticed from our previous chapter discussion that there is a pre-defined
object C in your workfile. This is the default coefficient vector when you specify an

123
equation by listing variable names. Note that EViews stores the estimated coefficients in
this vector, in the order of appearance in the list.

Specifying an Equation by Formula

This approach requires to specify our equation using a formula when the list method is
not general enough for our specification. Many, but not all, estimation methods allow you
to specify your equation using a formula.

An equation formula in EViews is a mathematical expression involving regressors and


coefficients. To specify an equation using a formula, simply enter the expression in the
dialog in place of the list of variables. EViews will add an implicit additive disturbance to
this equation and will estimate the parameters of the model using least squares.

Note that when you specify an equation by list, EViews converts this into an equivalent
equation formula. For example, suppose our regression model is given by the following
log-log model
logDi = log  0 +  1logPi +  2logIi + logUi
In this case the list method is given by,
logD c logP logI

This list is interpreted by EViews as,

logD = c(1) + c(2)*logP + c(3)*logI

The two most common motivations for specifying your equation by formula are to
estimate restricted and nonlinear models.

Note that to estimate a nonlinear model, simply enter the nonlinear formula. EViews will
automatically detect the nonlinearity and estimate the model using nonlinear least
squares.

124
b) Estimation Methods

The second box in the equation estimation dialog box refers to estimation methods. Note
that having specified our equation; we now need to choose an estimation method. To
select the required method of estimation we simply click on the Method box and we will
see the drop-down menu listing estimation methods. From the various alternatives we
select tje LS-Least Square since standard, single-equation regression is performed using
this method of estimation.

Note that equations estimated by ordinary least squares, two-stage least squares, GMM,
and ARCH can be specified with a formula. But nonlinear equations are not allowed with
binary, ordered, censored, and count models, or in equations with ARMA terms.

c) Estimation Sample

After identifying the method of estimation, we should also specify the sample to be used
in estimation. EViews will fill out this dialog with the current workfile sample, but we
can change the sample for purposes of estimation by entering our sample string or object
in the box. Note that changing the estimation sample does not affect the current workfile
sample.

If any of the series used in estimation contain missing data, EViews will temporarily
adjust the estimation sample of observations to exclude those observations. EViews
notifies you that it has adjusted the sample by reporting the actual sample used in the
estimation results. The diagram below presents the regression result using a hypothetical
regression model.

125
Figure 4.3 Regression result

At this point, we are interested about the top of an equation output view. Notice that
EViews will tell you the dependent variable, the method, the sample and the like. For
example, the above result uses 1824 observations.

Note that some operations, most notably estimation with moving average terms and auto
regressive conditional hetroscedasticity, do not allow missing observations in the middle
of the sample. When executing these procedures, an error message is displayed and
execution is halted if an NA is encountered in the middle of the sample. EViews handles
missing data at the very start or the very end of the sample range by adjusting the sample
endpoints and proceeding with the estimation procedure.

Check Your Progress 4.2

1. State the three alternative ways of creating an equation object in EViews


_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

126
2. Explain the two equation specification methods
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________

III. Estimation Options

Consider the following regression model developed by a researcher to examine the


relationship between GDP and a number of explanatory variables. Let a linear
relationship is assumed as follows

GDPi =  0 +  1INVi +  2SAVi + β3EXPOi. + βi.POPi. + Ui.

Where GDP = gross domestic product, INV= investment, SAV = domestic saving,
EXPO = export, POP = population (a proxy t labor force). Suppose the researcher uses
data of Ethiopia for the period 1953 up to 1995E.C.

According to the List method, we write the following in the equation estimation box and
click OK or press Enter:

gdp c inv sav expo pop


However, if we are using the Formula method, we write the following in the equation
estimation box and click OK (or press Enter)
gdp = c(1) + c(2)*inv + c(3)*sav + c(4)*expo + c(5)*pop
When you click OK in the Equation estimation dialog, EViews displays the equation
window displaying the estimation output view. This will produce the following
regression result
Dependent Variable: GDP
Method: Least Squares
Date: 11/22/06 Time: 07:47
Sample: 1953 1995

127
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
C -10299.51 1181.469 -8.717549 0.0000
INV 0.056541 0.479796 0.117844 0.9068
SAV -0.374671 0.334872 -1.118848 0.2702
EXPO 3.777520 0.503058 7.509122 0.0000
POP 496.1977 42.92215 11.56041 0.0000
R-squared 0.993417 Mean dependent var 17642.85
Adjusted R-squared 0.992724 S.D. dependent var 16578.71
S.E. of regression 1414.109 Akaike info criterion 17.45533
Sum squared resid 75988722 Schwarz criterion 17.66012
Log likelihood -370.2896 F-statistic 1433.695
Durbin-Watson stat 1.000662 Prob(F-statistic) 0.000000

The following discussion summarizes the results obtained from EViews equation box
using the above result.

Regression Coefficients

On the result above the column labeled “Coefficient” depicts the estimated coefficients.
Recall that the least squares regression coefficients βi are computed by the standard OLS
formula

Note that if our equation is specified by list, the coefficients will be labeled in the
“Variable” column with the name of the corresponding regressor. But if the equation is
specified by formula, EViews lists the actual coefficients, C(1), C(2), etc.
For the simple linear models considered here, the coefficient measures the marginal
contribution of the independent variable to the dependent variable, holding all other
variables fixed. If present, the coefficient of the C is the constant or intercept in the
regression. It is the base level of the prediction when all of the other independent
variables are zero. The other coefficients are interpreted as the slope of the relation
between the corresponding independent variable (GDP in this case) and the dependent
variable, assuming all other variables do not change.

128
For the above model the estimated GDP equation is given as
(Estimated) GDP = -10299.51 + 0.056INV + -0.374SAV + 3.777EXPO + 496.19POP
The partial regression coefficients βi are interpreted as follows
β1 = 0.0561=> other things being equal, an increase in investment by one unit increases
GDP by 0.561 units
β2 = -0.374 => other things being equal, an increase in domestic saving by one unit
decreases GDP by 0.374 units
β3 = 3.777 => Other things being equal, an increase in export by one unit increases GDP
by 3.777 units
β4 = 496.19 => Other things being equal, an increase in labor force by one unit increases
GDP by 496.19 units

Standard Errors
The “Std. Error” column reports the estimated standard errors of the coefficient estimates.
The standard errors measure the statistical reliability of the coefficient estimates. Note
that the larger the standard errors, the more statistical noise in the estimates. If the errors
are normally distributed, there are about 2 chances in 3 that the true regression
coefficient lies within one standard error of the reported coefficient, and 95 chances out
of 100 that it lies within two standard errors.

Note that and the standard errors of the estimated coefficients are the square roots of the
diagonal elements of this matrix. You can view the whole covariance matrix by choosing
View/Covariance Matrix.

t-Statistics

The t-statistic, which is computed as the ratio of an estimated coefficient to its standard
error, is used to test the hypothesis that a coefficient is equal to zero. That is the
computed t value helps to test the significance of the parameter estimates individually. In

129
manual operation we should compare the computed t value with the critical (table) value.
However, EViews uses the following. That is, to interpret the t-statistic, we should
examine the probability of observing the t-statistic given that the coefficient is equal to
zero. This probability computation is described below.

Probability

As you can see in the table the last column of the output shows the probability of drawing
a t-statistic as extreme as the one actually observed, under the assumption that the errors
are normally distributed, or that the estimated coefficients are asymptotically normally
distributed.
This probability is also known as the p-value or the marginal significance level. Given a
p-value, you can tell at a glance if you reject or accept the hypothesis that the true
coefficient is zero against a two-sided alternative that it differs from zero. For example, if
you are performing the test at the 5% (or 0.05) significance level, a p-value lower than .
05 is taken as evidence to reject the null hypothesis of a zero coefficient.

For the above tabulated result for example, the hypothesis that the coefficient on INV and
SAV is zero individually is accepted at all. i.e. at 1%, 5% and 10% significance level.
But the hypothesis that the coefficient on EXPO and POP is zero individually is rejected
at all. i.e. at 1%, 5% and 10%. Note, therefore, that the p-values are computed from a t-
distribution with T-k degrees of freedom.

Summary Statistics
Notice from the regression result that the lower half of the result table provides a number
of summary statistics that are vital to measure the adequacy of the model from various
points of view. It is briefly explained as follows.

R-squared
The R-squared statistic measures the success of the regression in predicting the values of
the dependent variable within the sample. It is the fraction of the variance of the
dependent variable explained by the independent variables. The statistic will equal one if

130
the regression fits perfectly, and zero if it fits no better than the simple mean of the
dependent variable. Note that it can be negative if the regression does not have an
intercept or constant, or if the estimation method is two-stage least squares. In our
regression example, the R2 equals to 0.99, which represents a very good fit.

Adjusted R-squared
One problem with using as a measure of goodness of fit is that R 2 will never decrease as
you add more regressors. In the extreme case, you can always obtain a value of one if we
include as many independent regressors as there are sample observations.
The adjusted R2 penalizes the R2 for the addition of regressors, which do not contribute to
the explanatory power of the model. EViews regression result reports the adjusted R2 just
below to the R2

Standard Error of the Regression (S.E. of regression)


The standard error of the regression is a summary measure based on the estimated
variance of the residuals. The standard error of the regression is computed as:

Sum of Squared Residuals


The sum of squared residuals can be used in a variety of statistical calculations, and is
presented separately for our convenience:

Log Likelihood
EViews reports the value of the log likelihood function (assuming normally distributed
errors). It is evaluated at the estimated values of the coefficients. Likelihood ratio tests
may be conducted by looking at the difference between the log likelihood values of the
restricted and unrestricted versions of an equation.

Durbin-Watson Statistic
The Durbin-Watson statistic measures the serial correlation in the residuals. This concept
and the interpretation of the result will be discussed in the next chaptrr

Mean and Standard Deviation (S.D.) of the Dependent Variable

131
The mean and standard deviation of the dependent variable (GDP in the above example)
are computed using the standard formulas:

Akaike Information Criterion (AIC)


The AIC is often used in model selection for non-nested alternatives. According to the
criteria, smaller values of the AIC are preferred.

Schwarz Criterion
The Schwarz Criterion (SC) is an alternative to the AIC that imposes a larger penalty for
additional coefficients. It is used for model selection

F-Statistic and Probability


The F-statistic tests the hypothesis that all of the slope coefficients (excluding the
constant, or intercept) in a regression are zero. Recall that we said under the null
hypothesis with normally distributed errors, this statistic has an F-distribution with k-1
numerator degrees of freedom and T-k denominator degrees of freedom.

The p-value given just below the F-statistic is denoted by Prob (F-statistic). It is the
marginal significance level of the F-test. If the p-value is less than the significance level
we are testing, say .05, we reject the null hypothesis that all slope coefficients are equal
to zero. For our example above, the p-value is essentially zero, so we reject the null
hypothesis that all of the regression coefficients are zero. Note that the F-test is a joint
test so that even if all the t-statistics are insignificant, the F-statistic can be highly
significant.

Example: Suppose the researcher is interested to use a log-log modeling to the function
stated earlier. That is, consider the following regression model

logGDPi = log  0 +  1logINVi +  2logSAVi + β3logEXPOi. + βilogPOPi. + logUi.

Let the researcher used the same data (i.e. 1953-1995 EC). Note, however that first we
have to transform each variable to be used in the estimation in to their logarithmic form.

132
After this if we use the formula approach, we write the following into the equation
estimation box.
lgdp = c(1) + c(2)*linv + c(3)*lsav + c(4)*lexpo + c(5)*lpop

Where l represent logarithm


Note that when we use the formula method, EViews result will express the variables not
by their names but by the coefficients c(2), c(3)... used in the estimation. Using OLS
method of estimation, we obtain the following.

Dependent Variable: LGDP


Method: Least Squares
Date: 11/22/06 Time: 09:15
Sample(adjusted): 1953 1994
Included observations: 42 after adjusting endpoints
LGDP = C(1) + C(2)*LINV + C(3)*LSAV + C(4)*LEXPO + C(5) *LPOP

Coefficient Std. Error t-Statistic Prob.


C(1) 0.760588 0.162611 4.677339 0.0000
C(2) 0.059884 0.049056 1.220744 0.2299
C(3) -0.028990 0.020265 -1.430590 0.1609
C(4) 0.222790 0.034374 6.481443 0.0000
C(5) 1.865356 0.099511 18.74527 0.0000
R-squared 0.997136 Mean dependent var 9.329700
Adjusted R-squared 0.996827 S.D. dependent var 0.897452
S.E. of regression 0.050556 Akaike info criterion -3.020128
Sum squared resid 0.094569 Schwarz criterion -2.813262
Log likelihood 68.42269 F-statistic 3220.738
Durbin-Watson stat 1.101603 Prob(F-statistic) 0.000000

From the above result, we can write the estimated regression function as follows

133
(estimated)LGDPi = 0.760 + 0.059lINVi - 0.028LSAVi + 0.222LEXPOi. + 1.865LPOPi
Note that interpretation of the partial regression coefficient will be different from our
previous interpretation. Here, the coefficient is a measure of elasticity. Thus,

β1 = 0.0598=> other things being equal, an increase in investment by one percent


increases GDP by 0.598 percent
β2 = -0.028 => other things being equal, an increase in domestic saving by percent
decreases GDP by 0.028 percent
β3 = 0.222 => Other things being equal, an increase in export by one percent increases
GDP by 0.222 percent.
β4 = 1.865 => Other things being equal, an increase in labor force by one percent
increases GDP by 1.865 percent.
Note from the result that except for the labor force, the other variables show inelastic
relationship with the GDP.

Moreover, the regression result shows that the variables investment and saving do not
have a significant effect on GDP whereas the variables export and labor force play a
significant role to the GDP in the periods under consideration. The F test result shows
that the model is significant jointly even at 1% level. According to the R 2 and adjusted R2
results, the model is a very good fit.

IV. Coefficient Restriction Test

The View button on the equation toolbar gives us a choice among three categories of tests
to check the specification of the equation.

Coefficient Tests
These tests evaluate restrictions on the estimated coefficients.
Consider the regression equation we used earlier given by
logGDPi = log  0 +  1logINVi +  2logSAVi + β3logEXPOi. + βilogPOPi. + logUi.
Let the researcher wants to test the following hypothesis

134
Ho :  1 =  3 against the alternate
H1 :  1   3

This hypothesis implies that the elasticity of investment on GDP is equal to the elasticity
of export on GDP. Such kind and other types of tests can be performed easily using
EViews. Note that coefficient restriction test in EViews is called the Wald Test.

The Wald test computes the test statistic by estimating the unrestricted regression without
imposing the coefficient restrictions specified by the null hypothesis. The Wald statistic
measures how close the unrestricted estimates come to satisfying the restrictions under
the null hypothesis. If the restrictions are in fact true, then the unrestricted estimates
should come close to satisfying the restrictions.

EViews reports both the chi-square and the F-statistics and the associated p-values.

How to Perform Wald Coefficient Tests

To demonstrate how to perform Wald tests, once again consider the above regression
model. The estimated result (using the formula method) is presented as follows.
Dependent Variable: LGDP
Method: Least Squares
Date: 11/22/06 Time: 09:15
Sample(adjusted): 1953 1994
Included observations: 42 after adjusting endpoints
LGDP = C(1) + C(2)*LINV + C(3)*LSAV + C(4)*LEXPO + C(5) *LPOP

Coefficient Std. Error t-Statistic Prob.


C(1) 0.760588 0.162611 4.677339 0.0000
C(2) 0.059884 0.049056 1.220744 0.2299
C(3) -0.028990 0.020265 -1.430590 0.1609
C(4) 0.222790 0.034374 6.481443 0.0000
C(5) 1.865356 0.099511 18.74527 0.0000
R-squared 0.997136 Mean dependent var 9.329700
Adjusted R-squared 0.996827 S.D. dependent var 0.897452
S.E. of regression 0.050556 Akaike info criterion -3.020128
Sum squared resid 0.094569 Schwarz criterion -2.813262
Log likelihood 68.42269 F-statistic 3220.738
Durbin-Watson stat 1.101603 Prob(F-statistic) 0.000000

Note that this result is identical to the result obtained earlier.

135
From the above result we observe that the coefficients of LINV (which is 0.059) is quite
different from the coefficient of LEXPO( which is 0.222). But to determine whether the
difference is statistically relevant, we will conduct the hypothesis test described earlier.

To carry out a Wald test, we perform the following.


First, from the equation tool bar we choose View then we select Coefficient Tests and
choose Wald-Coefficient Restrictions… This will open the Wald Test dialog box.
Next we enter the restrictions into the edit box.

Note that the restrictions should be expressed as equations involving the estimated
coefficients and constants. The coefficients should be referred to as C(1), C(2), and so on,
unless you have used a different coefficient vector in estimation.

To test our hypothesis of Ho:  1 =  3, we type the following restriction in the dialog
box:
c(2) = c(4)

and click OK. Note that c(2) refers to the coefficient of LINV and c(4) represents the
coefficient of LEXPO. EViews reports the following result of the Wald test.

Wald Test:
Equation: Untitled
Null Hypothesis: C(2) = C(4)
F-statistic 4.610510 Probability 0.038397
Chi-square 4.610510 Probability 0.031777

Notice that EViews reports an F-statistic and a Chi-square statistic with associated p-
values. The Chi-square statistic is equal to the F-statistic times the number of restrictions
under test. In this example, there is only one restriction and so the two test statistics are
identical with the p-values of both statistics indicating that we can decisively reject the
null hypothesis of equal elasticity of investment and export of Ethiopia for the mentioned
period.

136
Consider the following case. Suppose a Cobb-Douglas production function on Ethiopia
for the period 1953 to 1995 E.C. has been estimated in the form:
Qi   0 L K  eU
1 2 i

where Q, K and L denote GDP and the inputs of capital and labor respectively. To come
up with linear regression model in parameter we rewrite the above model in a log-log
form and obtain the following

logQi = log  0 +  1logLi +  2logKi + Ui.

Let the researcher come up with the hypothesis of constant returns to scale. That is

Ho :  1 +  2 = 1 against the alternate


H1 :  1 +  2  1
First we estimate the Cobb-Douglas production function (using annual data from 1953 to
1995) and obtain the following result.

Dependent Variable: LQ
Method: Least Squares
Date: 11/23/06 Time: 09:05
Sample: 1953 1995
Included observations: 43
LQ = C(1) + C(2)*LL+ C(3)*LK
Coefficient Std. Error t-Statistic Prob.
C(1) 0.191744 0.178262 1.075627 0.2885
C(2) 2.088333 0.139868 14.93072 0.0000
C(3) 0.211335 0.049131 4.301449 0.0001
R-squared 0.993536 Mean dependent var 9.366394
Adjusted R-squared 0.993212 S.D. dependent var 0.918770
S.E. of regression 0.075695 Akaike info criterion -2.256990
Sum squared resid 0.229191 Schwarz criterion -2.134115
Log likelihood 51.52528 F-statistic 3073.831
Durbin-Watson stat 0.670528 Prob(F-statistic) 0.000000
Notice from the result the sum of the coefficients on LOGL(which is LL) and LOGK
( which is LK) appears to be in excess of one. But to determine whether the difference is
statistically relevant, we will conduct the hypothesis test of constant returns.

137
To carry out a Wald test, we choose View then Coefficient Tests and select Wald-
Coefficient Restrictions… from the equation toolbar.

Then we type the following restriction in the dialog box:


c(2) + c(3) = 1 and click OK.
This will produce the following result of the Wald test:

Wald Test:
Equation: Untitled
Null Hypothesis: C(2) = C(3)

F-statistic 99.79008 Probability 0.000000

Chi-square 99.79008 Probability 0.000000

Notice from the result above that the p-values of both statistics indicates the rejection of
the null hypothesis of constant returns to scale even at 1%.

Check Your Progress 4.3

Consider the following regression result obtained using EViews

Dependent Variable: Y
Method: Least Squares
Date: 12/25/06 Time: 15:53
Sample: 1990:1 1995:4
Included observations: 24
Variable Coefficient Std. Error t-Statistic Prob.
C -133.7252 8.304021 -16.10367 0.0000

138
X1 0.958838 0.108965 8.799527 0.0000
X2 490.6481 81.96783 5.985862 0.0000
X3 2.010433 0.488545 4.115148 0.0005
R-squared 0.990404 Mean dependent var 101.3912
Adjusted R-squared 0.988964 S.D. dependent var 9.283278
S.E. of regression 0.975228 Akaike info criterion 2.938720
Sum squared resid 19.02139 Schwarz criterion 3.135063
Log likelihood -31.26464 F-statistic 688.0330
Durbin-Watson stat 1.695080 Prob(F-statistic) 0.000000

Based on the result, answer the following questions


1. Report the estimated regression result.
2. Are the parameters significant individually at 5% level.
3. Are the parameters significant jointly at 5% level

V. Non Linear Regression models


Note that, in many economic variables, there exists non-linear relationship. In the
coefficient restriction test discussed earlier, for example, makes use of a model that is
non-linear. Thus, it is important to examine how to model and interpret such
relationships. Moreover, note that in a regression analysis the dependent variable is
influenced not only by quantitative variables like price, output and income, but also by
variables that are qualitative in nature like education, sex, and the like. Since the
qualitative variables also influence the dependent variable, they should be included
among the explanatory variables. These qualitative variables indicate the presence or
absence of a “quality” or an “attribute”. Such variables are called dummy variables.

In this section we discuss on how to perform such estimations using EViews and Stata.
But as a brief introduction, the following section summarizes the concept of non linear
regression and dummy variables.

Note that non linear regression model shows the presence of non-linear relationship in the
model. The following are some of the commonly used regression models that are non-
linear in the variables but are linear in the parameters.
I. Double-Log Models
This model is very common in economics. Consider the following Cobb Douglas model

139
 
Yi =  0 X 1i 1 X 2i2 eU
The above specification may be alternatively expressed as
lnYi = ln  0 +  1 lnX1 +  2 lnX2 + U
Since both the dependent and the explanatory variables are expressed in terms of
logarithm, the model is known as double-log, or log-log model.

This model is linear in the parameters and can be estimated by OLS.

Note that the coefficient  1 and  2 measures the elasticity of Y with respect to X1 and
X2.

For example suppose the estimated value  1 = 0.65 This implies that a one percent

increase in X1 will result a 0.65% increase in Y assuming that X 2 is held constant.



Similarly if  2 = 1.25, it implies that a one percent increase in X2 will result in a 1.25%

increase in Y assuming that X1 is held constant.


Consider the following regression model
logIMPOi = logβ0 + β1logINVi + β2log SAVi + β3logCOi + Ui

Where IMPO = import, INV= investment, SAV= domestic saving and CO= consumption
expenditure. The data is of Ethiopia for the period 1986 to 1995 E.C. Notice that both the
dependent and the explanatory variables are expressed in logarithmic form. The
following estimation result is performed by EViews.

Dependent Variable: LIMPO


Method: Least Squares
Date: 12/25/06 Time: 16:06
Sample(adjusted): 1986 1994
Included observations: 9 after adjusting endpoints
Variable Coefficient Std. Error t-Statistic Prob.
C -2.901645 1.067091 -2.719211 0.0418
LINV 0.732266 0.174740 4.190609 0.0086
LSAV -0.065123 0.024441 -2.664506 0.0446
LCO 0.584836 0.222107 2.633127 0.0464
R-squared 0.995673 Mean dependent var 9.342112
Adjusted R-squared 0.993077 S.D. dependent var 0.366780
S.E. of regression 0.030519 Akaike info criterion -3.839841
Sum squared resid 0.004657 Schwarz criterion -3.752185

140
Log likelihood 21.27928 F-statistic 383.4959
Durbin-Watson stat 3.337651 Prob(F-statistic) 0.000003

The estimated regression result is given as follows


(estimated) Limpo = -2.90 + 0.73Linv - 0.065Lsav + 0.58Lco

The interpretation of the model is presented as follows.]


β1 = 0.73 => other things being equal, an increase in investment by one percent increases
Import by 0.73 percent
β2 = -0.065 => other things being equal, an increase in domestic saving by percent
decreases import by 0.065 percent
β3 = 0.58 => Other things being equal, an increase in CO by one percent increases import
by 0.58 percent.

II. Semilog Models: Log-lin and Lin-log models


Note that semilog models are those whose dependent or explanatory variable is written in
the log form. For example consider the following two models
lnYi =  0 +  1 Xi + Ui
Yi = 0 + 1 lnXi + Ui
The above models are called semilog models. We call the first model log-lin model and
the second model is known as lin-log model. The name given to the above models is
based on whether the dependent variable or the explanatory variable is in the log form.

If we consider the log-lin model above,  1 measures the relative change in Y for a given
absolute change in X, that is. But in the lin-log model,  1 measures the absolute change
in Y for a given relative change in X.

For example consider the following regression model

IMPOi = logβ0 + β1logINVi + β2log SAVi + β3logCOi + Ui

141
Where IMPO = import, INV= investment, SAV= domestic saving and CO= consumption
expenditure. The data is of Ethiopia for the period 1986 to 1995 E.C. Notice that the
explanatory variables are expressed in logarithmic form while the dependent variable is
not. The following estimation result is performed by EViews.
Dependent Variable: IMPO
Method: Least Squares
Date: 11/25/06 Time: 17:36
Sample(adjusted): 1986 1994
Included observations: 9 after adjusting endpoints
Variable Coefficient Std. Error t-Statistic Prob.
C -84347.99 16776.36 -5.027787 0.0040
LINV 12174.76 2747.187 4.431720 0.0068
LSAV -1851.376 384.2515 -4.818136 0.0048
LCO 167.1902 3491.878 0.047880 0.9637
R-squared 0.991462 Mean dependent var 12073.53
Adjusted R-squared 0.986339 S.D. dependent var 4105.149
S.E. of regression 479.8053 Akaike info criterion 15.48574
Sum squared resid 1151065. Schwarz criterion 15.57340
Log likelihood -65.68583 F-statistic 193.5409
Durbin-Watson stat 2.275361 Prob(F-statistic) 0.000014

The result of the above estimation can is written as follows


(estimated)IMPOi = -84347.99 + 12174.76logINVi + -1851.37log SAVi + 167.19logCOi

From the result we learn the following


 A 1% change in INV brings about 12174.76 unit change in IMPO
 A 1% change in SAV brings about -1851.37 unit change in IMPO
 A 1% change inCO brings about 167.19 unit change in IMPO

Check Your Progress 4.4

Consider the following hypothetical data

Y 100 110 105 120 130 125 115 140 160 160
X1 10 11 10 13 15 14 15 16 12 20
X2 1.25 1.00 1.5 1.4 2.0 2.5 3 2.75 3 3.2
X3 600 650 550 600 700 800 750 900 800 850

142
Using the above data, answer the following questions.

1. Estimate the following regression model


LogYi = Logβ0 + β1LogX1 + β2LogX2 + β3LogX3 + Ui
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. Based on the result from question number one above, test the hypothesis that
Ho : β1 = β2 at 5% significant level
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

3. Is the regression result jointly significant at 5% level


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

4.4 Regression Estimation and Analysis Using Stata

Stata conducts a number of regression estimations. At this introductory discussion, we


explain simple linear regression using OLS. Since Stata is command-based program, we
will state the command and the corresponding result in the following discussion. Suppose
GDP is a function of INV only. To compute a regression model for this function we write
the following command
 regress GDP INV
Note from the command that in Stata regression estimation, we first write the command
regress then state the dependent variable to be followed by the explanatory variables. The

143
number of variables entered in the above model is only one. This represents a two
variable (or simple) regression model. Using the data on GDP and investment, we obtain
the following result.

Source | SS df MS Number of obs = 43


-------------+------------------------------ F( 1, 41) = 982.66
Model | 1.1081e+10 1 1.1081e+10 Prob > F = 0.0000
Residual | 462356210 41 11276980.7 R-squared = 0.9599
-------------+------------------------------ Adj R-squared = 0.9590
Total | 1.1544e+10 42 274853564 Root MSE = 3358.1

------------------------------------------------------------------------------
gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------------------------
inv | 5.159164 .1645798 31.35 0.000 4.826788 5.491539
_cons | 2939.84 694.4404 4.23 0.000 1537.389 4342.291
-------------------------------------------------------------------------------------------------

Note from the above result that Stata produces several results in addition to the estimation
of the coefficients of the parameters. The estimated regression result is presented as
follows.
(estimated) GDP = 2939.84 + 5.159INV

Note that for a two variable regression estimation the coefficient of the independent
variable (in our case investment) represents the slope of the function. Notice that one of
the additional information given by Stata's default result window in the confidence
interval for each parameters. For example, the result shows that we are 95% confident
that the unknown population parameter of INV lies between 4.826 and 5.491 units.
Accordingly, we can explain the confidence interval for the intercept term. Moreover,
notice from the result that the parameters are significant even at 1% both jointly and
individually.

The other interesting part while working with Stata is that, it is possible to plot the actual
dependent variable together with the estimated value of that same dependent variable.

144
Such graphical examination helps to identify to what extent the predicted value
approximates the actual value of the dependent variable. To perform this we follow the
following step.
I. Given Y= f(X), conduct the associated regression model and obtain the estimated
results

II. Then construct the estimated value of Y from the regression result. This refers to
computing Y-hat value at each values of X. The command to do so is given by:
predict yhat. This generates a new variable equal to the predicted values from the
most recent regression.

III. To graph the actual value of Y with the estimated value (Y hat) at each value of
X, we write the following command: graph Y Yhat X, connect(.s) symbol(oi). This
command draws a scatter plot with regression line using the variables Y, Yhat and X

We can conduct this approach using the model that we have estimated earlier.
Accordingly, after estimating the regression model from the function GDP = f(INV),
we construct the estimated value of GDP using the following command: predict
GDPhat. Note that this generates the predicted values of GDP from the regression
estimation.

To construct the graph that takes in to account the actual values of GDP and the
estimated value of GDP for each value of investment, we write the following
command: graph GDP GDPhat INV, connect(.s) symbol(oi). The result of this
command is given as follows

145
GDP Fitted values

65329.6

GDP

2883.83
437.424 12093
INV

Figure 4.4 Estimated regression line using fitted values of GDP

Note that the line represented the regression line constructed by using the fitted
(estimated) values of GDP for each value of INV. On the other hand the scatter plot (or
dotted figure) represent the actual value of GDP for a given value of GDP. Such drawing
is helpful for wide range of analysis.

Consider the following regression model

GDP = f(INV, IMPO, CO, POP)


The associated regression model is given by

GDPi = β0 +β1INV + β2IMPO + β3CO +β4POP + Ui

Using the data on GDP, INV, IMPO, CO and POP, we can compute this model. Note that
the command to be displayed is given by: regress GDP INV IMPO CO POP

Source | SS df MS Number of obs = 43

146
-------------+------------------------------------------------- F( 4, 38) =20376.68
Model | 1.1538e+10 4 2.8846e+09 Prob > F = 0.0000
Residual | 5379456.74 38 141564.651 R-squared = 0.9995
-------------+------------------------------------------------ Adj R-squared = 0.9995
Total | 1.1544e+10 42 274853564 Root MSE = 376.25

--------------------------------------------------------------------------------------------------
gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+------------------------------------------------------------------------------------
inv | 1.521035 .1757031 8.66 0.000 1.165342 1.876727
co | 1.170104 .033189 35.26 0.000 1.102917 1.237292
pop | -83.28284 20.16384 -4.13 0.000 -124.1024 -42.46328
impo | -1.127762 .1050218 -10.74 0.000 -1.340368 -.9151567
_cons | 1705.201 468.9588 3.64 0.001 755.8434 2654.558
------------------------------------------------------------------------------

The estimated regression result is given by the following result.

(estimated)GDP = 1705.20 + 1.52INV + 1.17CO - 83.28POP - 1.12IMPO


Note that the coefficient in this case represents partial regression coefficient. That is,
o an increase in investment by one unit increases GDP by 1.52 units, ceteris paribus
o an increase in Consumption spending by one unit increases GDP by 1.17 units,
ceteris paribus
o an increase in population by one decreases GDP by 83.28 units, ceteris paribus
o an increase in import by one unit decreases GDP by 1.13 units, ceteris paribus

In the result, the 95% confidence interval for each parameter is also given. That is,
o we are 95% confident that the unknown population parameter INV lies with
1.165342 and 1.876727 interval.
o we are 95% confident that the unknown population parameter CO lies with
1.102917 and 1.237292 interval.
o we are 95% confident that the unknown population parameter POP lies with 1-
124.1024 and -42.46328 interval.
o we are 95% confident that the unknown population parameter IMPO lies with
-1.340368 and -.9151567interval.

147
The result also show the fact that the parameters used in the estimation are significant
even at 1 percent. Recall that for individual test of significance we use the t-probability
value( represented in Stata by P>|t|) where as it is the F probability (given by Prob > F) that
is used to examine joint test of significance.

Consider a regression model designed to examine the elasticity of import and export to
the GDP. This requires to come up with a log-log model such as the following

LogGDPi = βo + β1LogINV + β2LogEXPO + β3LogIMPO + β4LogPop + LogUi

The estimation of the above log-log model using Stata first requires the transformation of
each variable into its logarithm form (Recall the command to transform a variable into its
logarithm form). Accordingly we obtain the following regression result.

Source | SS df MS Number of obs = 43


-------------+-------------------------------------- F( 4, 38) = 3121.69
Model | 35.346275 4 8.83656875 Prob > F = 0.0000
Residual | .107566666 38 .002830702 R-squared = 0.9970
-------------+-------------------------------------- Adj R-squared = 0.9966
Total | 35.4538417 42 . 844139087 Root MSE = .0532

-----------------------------------------------------------------------------------------------
lgdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------
linv | .0036559 .0477036 0.08 0.939 -.092915 .1002269
lexpo | .2011124 .0554548 3.63 0.001 .08885 .3133747
limpo | .0523219 .0685613 0.76 0.450 -.0864732 .1911171
lpop | 1.849338 .1329017 13.92 0.000 1.580293 2.118384
_cons | .8004088 .2369478 3.38 0.002 .320733 1.280085
------------------------------------------------------------------------------------------------

148
Note from the above result that the coefficients represents partial elasticity coefficients.
That is, other things being equal,
o an increase in investment by percent increases GDP by 0.003 percent.
o an increase in export by one percent increases GDP by 0.20percent.
o an increase in import by one percent increases GDP by 0.05percent.
o an increase in population by one percent increases GDP by 1.84 percent.
In the result, the 95% confidence interval for each parameter is also given. That is,
o we are 95% confident that the elasticity of the population parameter INV lies with
-.092915 and 0.1002269 interval.
o we are 95% confident that the elasticity of the population parameter EXPO lies
with 0.08885 and 0.3133747 interval.
o we are 95% confident that the elasticity of the population parameter IMPO lies
with -0.0864732 and 0.1911171 interval.
o we are 95% confident that the elasticity of the population parameter POP lies with
1.580293 and 2.118384 interval.
The result also shows that the parameters used in the estimation are jointly significant
even at 1 percent. But it is only EXPO and POP parameters that are significant while INV
and IMPO are not significant.
Hypothesis Testing Using Stata
Stata conducts a wide range of hypothesis testing. This includes pre estimation
hypothesis testing about a variable or between variables in many respects. Moreover, post
estimation hypothesis testing about an individual parameter or between parameters can be
easily computed in a very friendly manner. Our discussion first focuses on pre estimation
hypothesis testing to be followed by the post estimation one.

A. Pre Estimation Test


As discussed earlier pre estimation test refers to testing a hypothesis about a variable
individually or between the variables under study. Consider a weekly data about demand
for roses randomly collected for 15 weeks. The data include quantity of roses sold in kg.
(Y), average retail price of roses (X 1), average wholesale price of a substitute good (X 2)

149
and average weekly family disposable income (X3). The values in X1, X2 and X3 are
measured in birr. The following data represents this information.
Week Y X1 X2 X3
1 8429 3.07 4.06 165.26
2 10079 2.91 3.64 172.92
3 9240 2.73 3.21 178.46
4 8862 2.77 3.66 198.62
5 6216 3.59 3.76 186.28
6 8038 2.60 3.13 180.49
7 8038 2.60 3.13 180.49
8 7476 2.89 3.20 183.33
9 5911 3.77 3.65 181.87
10 7950 3.64 3.60 185.00
11 6134 2.82 2.94 184.00
12 5868 2.96 3.12 188.20

Based on the above table we can construct the following hypothesis.

Mean comparison tests


Using Stata we can compare the mean value of a variable with a pre determined value.
Moreover, we can compare the mean (average) value of any two variables. In this regard
there are several commands associated to such tests. The following commands are the
basic hypothesis testing instruments.
 ttest varname = #
 ttest varname1 = varname2
 ttest varname1 = varname2, unpaired
In general notice that the command ttest performs one-sample, two-sample, and paired t
tests on the equality of means. In the first command, ttest performs a one-sample t test of
the hypothesis that a given variable has a certain mean given by #.
In the second command without any options specified, ttest performs a paired test of
the hypothesis that varname1 - varname2 has a mean of zero. In the third command with
the unpaired option specified, ttest performs a two-sample t test of the hypothesis that the
mean of varname1 equals the mean of varname2.

150
Now, given the table above, we can estimate a hypothesis that makes use of the above
stated commands. For example, consider the following command
 ttest X1 = 2.75
This hypothesis argue that the average retail price of roses in the market is 2.75 birr. The
following result is computed in stata based on the above command.
One-sample t test

---------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
----------------------------------------------------------------------------------------------------
Degrees of freedom: 11

Ho: mean(x1) = 2.75

Ha: mean < 2.75 Ha: mean ~= 2.75 Ha: mean > 2.75
t = 2.3611 t = 2.3611 t = 2.3611
P < t = 0.9811 P > |t| = 0.0377 P > t = 0.0189

Notice from the result that the null hypothesis is Ho: mean(x1)= 2.75. However, there are both
two tailed and one tailed (both left and right tail) tests presented above. As we know the
null hypothesis is rejected when the t-probability value exceeds the selected level of
significance. Notice from the result that for the two tailed and right tailed test, the null
hypothesis is rejected both at 5% (0.05) and 10% (0.1) where as it is not rejected at 1%
(0.01) level of significance. In general, the test suggests the idea that the mean value of
X1 is different and higher from 2.75 birr in the market.

The command below performs similar hypothesis tests for the variable X2 that Ho = 3.5
Note that the appropriate command in this regard is given by: ttest X2 = 3.5 The result is
summarized in the following box.
One-sample t test

--------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +--------------------------------------------------------------------------------------
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198

151
--------------------------------------------------------------------------------------------------
Degrees of freedom: 11

Ho: mean(x2) = 3.5

Ha: mean < 3.5 Ha: mean ~= 3.5 Ha: mean > 3.5
t = -0.7565 t = -0.7565 t = -0.7565
P < t = 0.2326 P > |t| = 0.4652 P > t = 0.7674

Notice from the result that the null hypothesis is not rejected even at 1% in both two
tailed and one tailed test (both left and right). This indicates that on average the price of
X2 is 3.5 birr per unit.

Note that in addition to this Stata can compare the mean values of two variable collected
from the same sample or different one. This is called a paired test from the same data.
Given the above table one may hypothesis that the mean price of X 1 is equal to that of X2.
Accordingly the command is given by: ttest X1 = X2. The following result is based on
this command.
Paired t test

----------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198
--------- +------------------------------------------------------------------------------------------------
diff | 12 -.3958334 .1029155 .3565098 -.6223489 -.1693178
-----------------------------------------------------------------------------------------------------------

Ho: mean(x1 - x2) = mean(diff) = 0

Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0


t = -3.8462 t = -3.8462 t = -3.8462
P < t = 0.0014 P > |t| = 0.0027 P > t = 0.9986

152
Notice from the result that the null hypothesis is rejected at 1% for both the two tailed
test and the left tailed test. In general, the test suggests the idea that the mean value of X 1
is different from the mean value of X2.

Variance comparison tests


Using Stata it is also possible to test the hypothesis about the variance (or standard
deviation) of a variable to assume some specific value. In addition to this Stata can
compare the variance (or standard deviation) of any two variables. In connection to this
the most important commands associated to such tests is given as follows.

 sdtest varname = #
 sdtest varname1 = varname2

Note from the two commands that sdtest performs tests on the equality of variances
(standard deviations). In the first command, sdtest performs a chi-squared test of the
hypothesis that the standard deviation of varname is #. That is, it examines the hypothesis
whether a given variable's population variance (standard deviation) is equal to some
number. In the second command, sdtest performs an F test (variance ratio test) of the
hypothesis that varname1 and varname2 have the same variance. In other words, the
second command checks whether two variables have equal variance (standard deviation)
or not.

Considering the data on demand for roses discussed earlier, we can conduct the above
stated hypotheses. Suppose a researcher hypothesized that the variation of price of roses
(from the average) is equal to 0.5 In this case the appropriate command to execute the
job is given by: sdtest X1 = 0.5 The Stata output to this command is displayed below.
sdtest x1 = 0.5

One-sample test of variance

-------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +-------------------------------------------------------------------------------------------

153
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
--------------------------------------------------------------------------------------------------------

Ho: sd(x1) = 0.5


chi2(11) = 7.381

Ha: sd(x1) < 0.5 Ha: sd(x1) ~= 0.5 Ha: sd(x1) > 0.5
P < chi2 = 0.2326 2*(P < chi2) = 0.4651 P > chi2 = 0.7674

From the result above we learn that there are both two tailed and one tailed (both left and
right tail) tests. Recall that in hypothesis testing the null hypothesis is rejected when the t-
probability value exceeds the selected level of significance (which is either 10%, 5% or
1%). Notice from the result that for the two tailed and right tailed test, the null hypothesis
can not be rejected even at 1% (or 0.01) level of significant. The same result also holds
for both left and right tailed tests. This suggests that the hypothesized variance of X 1 is
acceptable. That is, there is a statistical evidence to argue that the population variance of
X1 is equal to 0.5

In addition to this, we can test the equality of the variance of two variables. For instance,
suppose we hypothesized that price variation in X1 is equal to that of X2. To test this, the
relevant command is: sdtest X1 = X2. The result obtained from the above result is given as
follows.
Variance ratio test

--------------------------------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------- +----------------------------------------------------------------------------------------------
x1 | 12 3.029167 .1182349 .4095776 2.768933 3.2894
x2 | 12 3.425 .0991364 .3434186 3.206802 3.643198
--------- +-----------------------------------------------------------------------------------------------
combined | 24 3.227083 .0860011 .4213176 3.049177 3.40499
----------------------------------------------------------------------------------------------------------

Ho: sd(x1) = sd(x2)

F(11,11) observed = F_obs = 1.422


F(11,11) lower tail = F_L = 1/F_obs = 0.703

154
F(11,11) upper tail = F_U = F_obs = 1.422

Ha: sd(x1) < sd(x2) Ha: sd(x1) ~= sd(x2) Ha: sd(x1) > sd(x2)
P < F_obs = 0.7156 P < F_L + P > F_U = 0.5688 P > F_obs = 0.2844

Notice that both the two tailed and one tailed tests can not rejects the null hypothesis that
says the variance of X1 is equal to X2. This is because the probability value of both tailed
tests is well over even at 10% (or 0.1). Thus we accept the hypothesis that the variance of
X1 = X2.

This section examines the process of hypothesis testing after we perform regression
estimation. This includes a test of equality between parameters, and many other linear
restrictions.

The discussion will be based on the following cross section data that includes the output
(Y), the labor input (L), and capital input (K) of firms of a chemical industry.

Firm Y (000 tons) L (hours) K (mach. Hrs)


1 60 1100 300
2 120 1200 400
3 190 1430 420
4 250 500 400
5 300 520 510
6 300 1620 590
7 380 1800 600
8 430 1820 630
9 440 1800 610
10 490 1700 630
11 500 1900 850
12 520 1960 900
13 540 1830 980
14 410 1900 900
15 350 1500 800

B. Test of linear hypotheses after model estimation


This refers to testing of a given hypothesis about a parameter individually or its
relationship with other variables. Some test results are displayed by default together with

155
Stata regression result. These default results are significant test of parameters individually
and jointly. However, it is possible to conduct test of significance of a parameter by itself
using some commands.

In Stata such hypothesis tests are conducted using the test command. Note that, test tests
linear hypotheses about the estimated parameters from the most recently estimated
model. Without arguments, test redisplays the results from the last test. The other
command used is testparm which provides a useful alternative to test that permits varlist
rather than just a list of coefficients. Note that test and testparm perform Wald tests.

Consider the following regression model that can be estimated using the above data
Yi = β0 + β1Li + β2Ki + Ui
After (single-equation) estimation of the above model, we can estimate a number of
hypotheses using the appropriate command as shown below.
 test L = K
This command tests the hypothesis that the population parameter of L and K (which is β 1
and β2) are equal. The hypothesis suggests that labor and capital have equal impact on the
output.
 test K = L/2
This command tests the hypothesis that the population parameter K is half of L. That is β2
is half of β1. In other words the hypothesis suggest that the contribution of labor is two
times of that of capital.
 test L = 2
This command tests the hypothesis that the population parameter of L equals to 2. The
hypothesis suggests that as labor changes by one unit, output increases by two units,
ceteris paribus.
 test L or test K.
This produces individual test of significance about the population parameter. The result
of this hypothesis is similar to the one presented in the Stata regression result.
 test L K

156
This produces test of significance about the population parameter jointly. The result of
this hypothesis is similar to the one presented in the Stata regression result.

We can exercise the above stated commands using the cross section data posted above.
As we have said earlier, the first thing is to compute the regression model. Note that Stata
performs the linear hypotheses about the estimated parameters from the most recently
estimated model. Thus, it is very important to first conduct the regression estimation.
Accordingly, we obtain the following result.

regress y l k

Source | SS df MS Number of obs = 15


-------------+-------------------------------------------------------- F( 2, 12) = 17.29
Model | 225428.323 2 112714.161 Prob > F = 0.0003
Residual | 78211.6775 12 6517.63979 R-squared = 0.7424
-------------+-------------------------------------------------------- Adj R-squared = 0.6995
Total | 303640.00 14 21688.5714 Root MSE = 80.732

-------------------------------------------------------------------------------------------------------------
y| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-----------------------------------------------------------------------------------------------
l | .0409662 .0628447 0.65 0.527 -.0959606 .177893
k | .534077 . 1424411 3.75 0.003 .2237244 .8444296
_cons | -48.62864 75.29277 -0.65 0.531 -212.6775 115.4202
--------------------------------------------------------------------------------------------------------------

Note that in the default regression result reported in Stata result window, there is a test of
significant of parameters individually as well as jointly. However, we can perform the
same test using the following commands.

 test L. This command tests the hypothesis that the parameter of L (which is
β1) is significant or not. The result is presented as follows.

157
test l
( 1) l = 0.0
F( 1, 12) = 0.42
Prob > F = 0.5268

Note from the above result that it is test of significance of individual parameters. The
probability result of this test is similar to the one posted in the regression result displayed
earlier. Based on the probability result, we find β1 to be insignificant.

 test K. This command tests the hypothesis that the parameter of K (which is
β2) is significant or not. The result is presented as follows.

test k
( 1) k = 0.0

F( 1, 12) = 14.06
Prob > F = 0.0028

Note from the above result that the probability result of the test is similar to the one
posted in the regression result. Moreover, we find β 2 to be significant even at one
percent.
 Test L = K. As we have said earlier this test examines the hypothesis that the
population parameter of L and K (which is β 1 and β2) are equal. Thus, the
hypothesis suggests that labor and capital have equal impact on the output.
The result of this hypothesis is presented as follows
test l = k
( 1) l - k = 0.0

F( 1, 12) = 6.63
Prob > F = 0.0243

158
The above result rejects the hypothesis that β 1 equals to β2 at 5% significant level. This
indicates that there is no statistical evidence to suggest that the contribution of L and K
to Y is equal.
 test L K. This test represents joint significance test as shown below
. test l k
( 1) l = 0.0
( 2) k = 0.0

F( 2, 12) = 17.29
Prob > F = 0.0003

Note that the parameters are significant jointly even at 1% level.

Suppose we fit a Cobb-Douglas production function to the data presented earlier.


Transformation of the function in to linear form will produce the following log-log
model as shown below.

LogYi = Logβ0 + β1LogLi + β2LogKi + Ui


Recall that the parameter β in this model represents elasticity coefficients. The result of
this model (which will be used for hypothesis testing) is presented as follows. [note: we
have to first transform each variables in to their logarithm forms]
Source | SS df MS Number of obs = 15
-------------+------------------------------ F( 2, 12) = 20.17
Model | 4.05652733 2 2.02826366 Prob > F = 0.0001
Residual | 1.2066428 12 .100553566 R-squared = 0.7707
-------------+------------------------------ Adj R-squared = 0.7325
Total | 5.26317012 14 .375940723 Root MSE = .3171

------------------------------------------------------------------------------
ly | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ll | -.2157061 .2418451 -0.89 0.390 -.7426414 .3112292
lk | 1.67721 .3040903 5.52 0.000 1.014654 2.339766
_cons | -3.435928 1.637866 -2.10 0.058 -7.004531 .1326751
------------------------------------------------------------------------------

159
The estimated result can be presented as follows.
(estimated)LYi = -3.43 - 0.215LLi + 1.67LKi

Note from the result that the coefficient of L and K (which measures elasticity) are not
equal. However, we can examine whether such differences are statistically significant for
the population parameters. Thus, we test the hypothesis that the elasticity of L with
respect to Y is the same to the elasticity of K with respect to Y. The command to this is:
 test LL =LK
The result is given as follows. (Note that the test must follow the regression estimation).

test ll= lk
( 1) ll - lk = 0.0

F( 1, 12) = 14.86
Prob > F = 0.0023

Note from the result that the hypothesis is rejected at 5% significant level. It points out
that the two elasticities do not have equal impact on output.

Moreover, note that the sum of the estimated elasticities is more than one (or around
1.5). This indicates the presence of increasing to scale in the model. However, an
examination is needed as to whether such result is statistically supported. To do this we
test a hypothesis of constant returns to scale using the following command.
 test LL + LK =1
The test hypothesizes the presence of constant returns to scale in the model. The
following is the result of the test.

test ll+ lk=1


( 1) ll + lk = 1.0

F( 1, 12) = 3.50

160
Prob > F = 0.0858

The result points that the hypothesis of constant returns to scale can not be rejected at 1%
and 5% significant levels. However, it is rejected at 10% level. Accordingly we can
conduct the following hypothesis.

Check Your Progress 4.5

Consider the data given under check your progress 4.4 on Y, X 1, X2 and X3 . Then attempt
the following using Stata.

1. Estimate the following regression model

Yi = βo + β1Log X1 + β2Log X2 + β3Log X3 + Ui


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. Test the hypothesis that A) 1 = 2 and B) β 1 + β2 = 1


________________________________________________________________________
________________________________________________________________________
_______________________________________________________________________

4.5 Summary

Econometrics deals with the measurement of economic relationships. It is a combination


of economic theory, mathematical economics and statistics, but it is completely distinct
from each one of these three branches of science. In conducting econometric analysis,
there are some steps that we need to follow as a methodology.

161
Note that regression analysis is concerned with describing and evaluating the relationship
between a given variable (dependent variable) and one or more other variables
(explanatory variables).

This unit have employed a computer-based analysis and explained a number of issues
related to econometric estimation and analysis by using EViews and Stata. That is, we
described the approach employed in conducting regression estimations in EViews. This
includes specifying and estimating a regression model, estimation and interpretation of a
regression model, tests of significance of the parameters and the like. Note that in the
discussion we made use of Ordinary Least Squares method of estimation since it is the
simplest as well as widely used in basic regression estimation.

Moreover, Stata program has been used to conduct similar tests. As we have seen earlier,
Stata conducts a number of regression estimations. The discussion have explained simple
linear regression using OLS. In addition to this. Stata conducts a wide range of
hypothesis testing. This includes pre estimation hypothesis testing about a variable or
between variables in many respects. Moreover, post estimation hypothesis testing about
an individual parameter or between parameters can be easily computed in a very friendly
manner.

4.6 Answer to Check Your Progress

Answer to Check Your Progress 4.1

1. Step I: Identify the relationship between variables and express the relationship in
mathematical form. Step II. specifying the mathematical form of the model. Step III
specifying of an econometric model. Step IV. determining the numerical estimates of the
coefficients of the model. Step V. Evaluation of Estimates

2. Regression analysis is concerned with describing and evaluating the relationship


between a given variable (dependent variable) and one or more other variables
(explanatory variables).

162
3. R2 measures the proportion of the variation in Y explained by the explanatory
variables such as X1 and X2 jointly.
4 This refers to testing whether a particular variable X1 or X2 is significant or not either
individually or jointly. On the other hand, in confidence interval estimation, we establish
limiting values around the estimate within which the true parameter is expected to lie
with a certain 'degree of confidence'

Answer to Check Your Progress 4.2


1. The three alternatives are
A) From the main menu select Objects and then choose New Object and select
Equation
B) From the main menu click Quick and then select Estimate Equation. Or
C) Simply type the keyword equation in the command window and press Enter
2. The two equation specification methods are: specifying an Equation by List and
Specifying an Equation by Formula

Answer to Check Your Progress 4.3

1. (estimated) Y = -133.72 + 0.958X1 + 490.65X2 + 2.01X3


2. Yes all the parameter are significant individually even at 1%
3. Yes the parameter are significant jointly even at 1%

Answer to Check Your Progress 4.4

1. (estimated) Y = 1.56 + 0.109LogX1 + 0.107LogX2 + 0.442LogX3


2. Accept Ho since P-value = 0.9942
3. It is not significant at 5% since F prob = 0.0651
Answer to Check Your Progress 4.5

1. (estimated) Y = -281.47 + 10.44LogX1 + 15.22LogX2 + 56.40LogX3


2. A) Accept Ho since P-value = 0.9260 B) Accept Ho since P-value = 0.5483

163
4.7 Model Examination
The following table provides data on real gross product, labor input, and real capital input
in the manufacturing sector of a certain economy.

YEAR Real Gross Product Labor Input Real Capital Input


(billions of dollars) Y (per hundred thousand person) X1 (billions of dollars) X2
1958 8.9 2.82 121
1959 10.9 2.84 122
1960 11.1 2.89 123
1961 12.1 3.76 128
1962 12.8 3.75 131
1963 16.3 4.03 134
1964 19.5 4.78 139
1965 21.1 5.73 146
1966 23.1 6.17 154
1967 26.1 6.96 164
1968 29.6 7.90 177
1969 33.4 8.16 188
1970 38.4 8.48 206
1971 46.7 8.73 222
1972 54.3 9.99 240

1. Using the above table, attempt the following questions using EViews
A. fit the following model to the above data, and report the results
Yi = β0 + β1X1 + β2X2 + Ui
B. Interpret the coefficient results and comment on the results of the adjusted R2
C. Test the significance of the parameters individually at 5% level.
D. Does the data support the hypothesis β1 = β2 ? Report your findings at 5% level
of significance
E. Test for the existence of autocorrelation problems in the model

164
2. Based on the data given above, answer the following questions using Stata

I. Fit the following model to the above data, and report the results
LogYi = Logα0 + Logα1X1 + Logα2X2 + LogUi
II. Interpret the coefficient results and comment on the results of the adjusted R2
III. Test the significance of the parameters individually at 5% level.
IV. Does the data support the hypothesis (a) α1 = α2 and (b) α1+ α2 =1 Use 5%
significant level
V. Test for the existence of hetroscedasticity problems in the above model

165
Unit Five: Diagnostic Tests

5.0 Objective
5.1 Introduction
5.2 The Concept of Diagnostic Checking.
5.3 Diagnostic Checking Using EViews
5.4 Diagnostic Checking Using Stata
5.5 Summary
5.6Answers to Check Your Progress
5.7 Model Examination

5.0 Objective
The aim of this unit is to conduct computer-based examination on the regression results.
After completing this unit the student will be able to

 Identify the concept of diagnostic checking. This includes explaining the sources
of the problem, the detection mechanism and the appropriate solutions.
 Test for the presence of the problems using EViews and Stata.

5.1 Introduction

In regression estimation and analysis, the task of the econometrician is not limited to
performing estimation. Rather several tests that ascertain the reliability of the model must
be conducted. In this unit, we will employ a computer based method of assessing the
reliability of the estimates of the parameters from econometric criteria point of view.
Recall from your econometrics discussion that after the estimation of the parameters with

166
the method of ordinary least squares, we should assess the reliability of the estimates of
the parameters based on three types of criteria. These are:
 A priori economic criteria which are determined by economic theory and related
to the sign and magnitude of the parameters.
 Statistical criteria which are determined by the statistical theory.
 Econometric criteria which are determined by the econometric theory.

Further, recall that the statistical criteria are the coefficient of determination, the standard
errors of the estimates and the related t and F-statistics. These tests are valid only if the
assumptions of the linear regression model are satisfied. Thus, if the assumptions of an
econometric method are violated, then the estimates obtained do not possess some or all
of their optimal properties discussed in the earlier units. Therefore, their standard error
becomes unreliable criteria.

Note that econometric criteria provide evidence about the validity or the violation of the
assumptions of the linear regression model. In this unit therefore, we will see with the aid
of computer programs the violation of the basic assumptions of the classical linear
regression model. In this regard, among a number of requirements, emphasis is usually
given on three major econometric problems. These are the problem of autocorrelation,
hetroscedasticity and multicollinearity. Thus this unit examines briefly what this
problems look like conceptually and then conducts the test using EViews and Stata.

5.2 The Concept of Diagnostic Checking

Note that there are some criteria set by the theory of econometrics. Therefore, after
conducting regression estimation it is important to investigate whether the assumptions of
the econometric method are satisfied or not.

If the assumptions are not satisfied, then the estimates of the parameters will not posses
some of the desirable properties and become unreliable for the determination of the
significance of the estimates.

167
In general, before using the estimate for prediction, policy making or for other objective,
the researcher must make use of all the criteria. If the assumptions are not satisfied, it is
necessary to re-specify the model by introducing new variables, omitting variables or
transforming variables and re-estimation of the model. This process of re –specification
of the model will continue until it possess all the above three criteria.

Given Yi = β0 + β1X1i + β2X2i + ...... + Xkiβk + Ui, the assumptions on which the
classical linear regression model is based upon are:

Assumption 1: Randomness of the random variable U. That is, its value is unpredictable
and hence depends on chance.

Assumption 2: Zero mean of the random variable U

Assumption 3: The variance of each Ui is the same for all the X i values. This is known as
the assumption of homoscedasticity
Assumption 4: The values of each Ui are normally distributed, i.e. Ui ~N (0,  u2)

Assumption 5: The values of Ui corresponding to Xi are independent from the values of


any other U j corresponding to Xj. This is called the assumption of non-
autocorrelation or serial independence of the U's

Assumption 6: Every disturbance term Ui is independent of the explanatory variables.

Assumption 7: No errors of measurement in the X's

Assumption 8: The explanatory variables are not perfectly linearly correlated. This is
called the assumption of no perfect multicollineariyy between the X's

Assumption 9: The model has no specification error. That is all the important explanatory
variables appear explicitly in the function and the mathematical form is
correctly specified.

168
It was on the basis of these assumptions that we try to estimate the model, and test the
significance of the model. Nevertheless, the question is what would be the implication if
some or all of these assumptions are violated. That is, if the assumptions are not fulfilled
what will be the outcome? Here under a brief discussion is made on those assumptions
that are most important ones.

A. The assumption of no autocorrelation

This assumption implies that the covariance of Ui and Uj in equal to zero. Nevertheless, if
this assumption is violated, it implies that the disturbances are said to be auto correlated.
Autocorrelated values of the disturbance term may be observed for many reasons. These
are
 Omitted explanatory variables
Most economic variables tend to be autocorrelated. If an autocorrelated variable has been
excluded from the set of explanatory variables, then its influence will be reflected in the
random variable U. This is called" quasi-autocorrelation", since it is due to the
autocorrelated pattern of the omitted explanatory variables and not because of the pattern
of the values of the random variable U. If several autocorrelated explanatory variables are
omitted, then the random variable, U, may not be autocorrelated. This is because the
autocorrelation patterns of the omitted variables may offset each other.

 Mis-specification of the mathematical form of the model

If we use a mathematical form which differs from the correct form of the relationship,
then, the random variable may show serial correlation. For example, if we chosen a linear
function while the correct form are non-linear, then the values of U will be correlated.

 Mis-specification of the true random term U

Many random factors like war, drought, weather conditions, strikes etc exert influence
that are spread over more than one period of time. For example, the effect of weather
conditions in agricultural sector will influence the performance of all other economic
variables in several times in the future. A strike in an organization affects the production

169
process which will persist for several future periods. In such cases, the values of U's
become serially dependent, so that if we assume E(UiUj)=0, then we mis-specify the true
pattern of values of U. This type of autocorrelation is called "true autocorrelation".

 Interpolation in the statistical observation

Most time series data involve some interpolation and "smoothing process" to remove
seasonal effect which do average the true disturbances over successive time periods. As a
result, the successive values of U's are interrelated and show autocorrelation patterns.
The source of autocorrelation has a strong influence on selecting solution for the
correction of autocorrelation. This means, the type of corrective action depends on the
cause or source of autocorrelation.

When the disturbance term exhibits serial correlation the value as well as the standard
errors of the parameter estimates are affected. Note that if disturbances are correlated, (i)
the prevailed value of the disturbances have some information to convey about the
current disturbances. If this information is ignored it is clear that the sample data is not
being used with maximum efficiency. (ii) Moreover, the variance of the random term U
may be seriously underestimated. (iii) The prediction based on ordinary least squares
estimate will be inefficient with autocorrelated errors.

Autocorrelation is potentially a series problem. Hence, it is essential to find out whether


autocorrelation exists in a given situation. Note that some rough idea about the existence
of autocorrelation may be gained by plotting the residuals either against time or against
their own lagged variables. But the most celebrated test for detecting serial correlation is
popularly known as the Durbin-Watson d-Statistic. It is defined as

 Uˆ 
n
2
t  U t 1
t 2
d= n

Uˆ
t 2
t
2

which is simply the ratio of the sum of squared differences in successive residuals to the
residual sum of squares, RSS. Note that in the numerator of the d statistic the number of

170
observations is n-1 because one observation is lost in taking successive differences. Note
that expanding the above formula allows us to obtain
d = 2(1 - ̂ ).
Note from the Durbin-Watson statistic that for positive autocorrelation ( > 0),
successive disturbance values will tend to have the same sign and the quantities (U t – Ut-
1)2 will tend to be small relative to the squares of the actual values of the disturbances.
We can therefore, expect the value of the expression in the above equation to be low.
Indeed, for the extreme case  = 1 it is possible that Ut = Ut-1 for all t so that the minimum
possible value of the equation is zero. However, for negative autocorrelation, since
positive disturbance values now tend to be followed by negative ones and vise versa, the
quantities (Ut – Ut-1)2 will tend to be large relative to the squares of the U’s. Hence, the
value of the above equation now tends to be high. The extreme case here is when  = 0
we should expect the expression to take a value in the neighborhood of 2. Notice,
however, that when  = 0, the equation reduces to Ut = t for all t, so that t takes on all
the property of Ut – in particular it is no longer autocorrelated. Thus in the absence of the
autocorrelation we can expect the above equation to take a value close to 2, when
negative autocorrelation is present a value in excess of 2 and may be as high as 4, and
when positive autocorrelation is present a value lower than 2 and may be close to zero.

As discussed earlier, the Durbin-Watson (DW) test tests the hypothesis that H 0:  = 0
(implying that the error terms are not autocorrelated with a first order scheme against the
alternate.

Note that the range of DW result is between 0 and 4. Decision is made by comparing the
calculated value with the critical (tabulated) value. As a rule of thumb, however, if d is
found to be closer to 2 in an application, one may assume that there is no first order
autocorrelation either positive or negative. If d is closer to 0 it is because the correlation
between successive error terms is closer to 1 indicating strong positive autocorrelation in
the residuals. Similarly, the closer d is to 4, the greater the evidence of negative serial
correlation. This is because the correlation between successive error terms is closer to 1.

171
The following figure explains the general approach of making decision in the DW test.

Note:
H0: No positive autocorrelation
H0*: No Negative autocorrelation

Reject H0 Reject H0
Zone of Zone of
Evidence indecision indecision Evidence
of * of
Do not reject H0 or H
positive or both negative
autoco autoco
0 dL dU 2 4-dU 4-dL 4

Figure 5.1 DW test

Note that the DW result is compared with the Durbin Watson critical (or table) value and
decision on either to accept or to reject the null hypothesis of no autocorrelation (of
positive or negative) is made using the above figure. For example suppose that the DW
result is 1.05 and from the Durbin Watson table let the critical d value are dL = 1.38 and
dU = 1.72 at 5%. In this case note that on the basis of the d test we can say that there is
positive autocorrelation

Note that it is important to solve the DW problem when it exists in the regression result.
If the source of the problem is suspected to be due to omission of important variables, the
solution is to include those omitted variables. Besides, if the source of the problem is
believed to be the result of misspecification of the model, then the solution is to
determine the appropriate mathematical form. However, if the above approaches are
ruled out, the appropriate procedure will be to transform the original data so that we can
come up with a new form (or model) which satisfies the assumption of no serial
correlation.

172
B. The Assumption of Homoscedasticity

This is one of the important assumption of the classical linear regression model. It says
that the population disturbances term, Ui all have the same variance. It suggests that the
conditional variance of the dependent variable conditional upon the given value of the
explanatory variable remains the same regardless of the values taken by the variable X.
On the other hand, when the conditional variance of the dependent variable increases as
the value of the explanatory variable increases, then we say there is heteroscedasticity,

Note that there are a number of reasons why the variances of U i are variable
(heteroscedasticity). Some of these are:

I) As income grows, people have more discretionary income and hence more scope for
choice. Hence variance of the error term-  i
2
is likely to increase with income. Thus, in
the regression of saving on income, we find  i2 increases with income

II) Heteroscedasticity can also arise as a result of the presence of outliers. An


outlier/outlying observation/ is an observation that is much different either very small or
very large in relation to other observations in the sample. The inclusion or exclusion of
such an observation in small sample can alter the results of regression analysis.

III) Another source of heteroscedasticity arises from violating the assumption that the
regression model is correctly specified. Heteroscedasticity will arise due to the fact that
some important variables are omitted from the model. For example, in the demand
function, if we omit the price of complement or the price of substitutes, then the residuals
obtained from this regression may give that the error variance may not be constant. If the
omitted variables are included in the model, then the problem may disappear.

IV) The problem of heteroscedasticity is likely to be more common in cross sectional


than time series data. In cross- sectional data, members of a population like individual
consumer, firms; industries etc are considered at given time. These members may be of

173
different sizes such as small, medium or large sizes. In time series data, the variables tend
to be similar and collect the data for the same entity over a period of time.

The consequences of the presence of hetroscedasticity in a regression model includes: (i)


The confidence intervals based on OLS will be unnecessarily larger. As a result t and F
tests are likely to give inaccurate results. (ii) The prediction of the dependent variable for
a given observation on the X’s is inefficient using the method of ordinary least squares
estimators.

Note that there are some formal and informal methods of detecting heteroscedasticity.
This includes the Breush -Pagan test. This test is relevant for a very wide class of
alternative hypotheses, normally that the variance is some function of a linear
combination of known variables. The generality of this test is both its strength (that it
does not require prior knowledge of the functional form involved).

To illustrate this test, consider the k-variable linear regression model


Yi = 0 + 1X1i + … + kXki + Ui
Assume that the error variance i2 is described as
i2 = f(1 + 2Z2i + … + mZmi)
that is, i2 is some function of the non-stochastic variables Z’s. some or all of the X’s can
serve as Z’s. Specifically, assume that
i2 = 0 + 1Z1i + … + mZmi
that is, i2 is a linear function of the Z’s
If 1 = 2 = … = m = 0, i2 = 0 which is constant. Therefore to test whether i2 is
homoscedastic, one test the hypothesis that 1 = 2 = … = m = 0.

C. The Assumption of No Multicolinearity

This is one of the most important assumption whose. Its violation represents the fact that
the explanatory variables are perfectly linearly correlated,

174
Note that multicollinearity is not a condition that either exists or does not exist in
economic variables but rather inherent in most economic relationships due to the
interdependence of many economic variables. In other words, multicollinearity is a
question of degree and not of its existence.

Multicollinearity has a problem because when any two explanatory variables are
changing in the same way, it becomes difficult to the measure the influence of each
variable on the dependent variable.

Note that if the correlation between the explanatory variables is perfect, then the
estimates of the coefficients are indeterminate and the standard errors of these estimates
become infinitely large. When certain explanatory variables are more important than
others and correlated with the dependent variables, the seriousness of the problem is
greater.

With multicollinearity, we may face the problem of mis-specification because we may


reject a variable whose standard error appears high although this variable is an important
determinant of the variations of the dependent variable.
In general, if there is high multicollinearity, then there is a possibility to encounter
several problems. This includes

(i). OLS estimation may not be precise. (ii) the confidence interval tend to be much wider
which may affect the hypothesis testing regarding to the regression coefficients. (iii). the
test statistics which are important for conducting hypothesis testing tends to be
statistically insignificant. (iv) Although the test statistics are statistically insignificant, the
overall measure of goodness of fit, R2, can be very high.

In a regression result we may suspect for the presence of Multicolinearity if (i) we


observe high R2 but few significant t-ratios. (ii)high pair wise correlations (in excess of
0.8)among regressors.

175
The solutions for multicollinearity depends on the severity of multicollinearity, on
availability of sources of data, on the importance of factors which are multicollinear and
on the purpose for which the model is being estimated

If multicollinearity affects some of the less important factors (variables), one may
exclude these factors from the model. If, on the other hand, multicollinearity has serious
effects on the coefficients estimates of important factors, then (i) increase the sample
size. (ii) Introduce additional equations in the model. (iii) Drop a variable and (iv)
transform the variables.

Check Your Progress 5.1

1. State the assumptions on which the classical linear regression model is based upon
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

2. What are the consequences of having autocorrelated disturbances


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

3. Explain the concept of hetroscedasticity


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

4. State the problems associated with high multicollinearity among the explanatory
variables
________________________________________________________________________
________________________________________________________________________

176
________________________________________________________________________

5.3 DIAGNOSIS ANALYSIS Using EViews)

Note that, when ever a regression model is specified there is uncertainty regarding its
appropriateness. Once we estimate our equation, EViews provides tools for evaluating
the quality of our specification along a number of dimensions. Note that the results of
these tests influence the chosen specification.

In this discussion, an attempt is made to provide with some statistical background to


conduct the tests. However, many of the descriptions are incomplete. Thus, the student is
advised to refer standard econometric references for further details.

Autocorrelation Test using EViews

Eviews performs the presence of autocorrelation in different approaches. Here we discuss


what is known as the Durbin-Watson (DW) test, which is reported together with the
regression result.

Recall that we said in the previous section that the DW result is compared with the
Durbin Watson critical (or table) value and decision on either to accept or to reject the
null hypothesis of no autocorrelation (of positive or negative) is made Now consider the
regression model given by investment as a function of domestic saving, import, and GDP
as follows.
INVi = β0 + β1SAVi + β2IMPOi + β3GDPi + Ui

Suppose that annual data of Ethiopia for the period 1953-1995 is used to obtain the
following regression result

177
Dependent Variable: INV
Method: Least Squares
Date: 11/23/06 Time: 16:16
Sample: 1953 1995
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
SAV 0.410876 0.078253 5.250643 0.0000
IMPO 0.488312 0.037556 13.00236 0.0000
GDP 0.024106 0.012792 1.884443 0.0668
R-squared 0.992759 Mean dependent var 2849.883
Adjusted R-squared 0.992397 S.D. dependent var 3148.438
S.E. of regression 274.5304 Akaike info criterion 14.13522
Sum squared resid 3014679. Schwarz criterion 14.25809
Log likelihood -300.9071 F-statistic 2742.032
Durbin-Watson stat 0.218943 Prob(F-statistic) 0.000000

Note from the result above that the Durbin-Watson statistic is 0.90. As mentioned earlier
DW test examines the hypothesis H0: No positive or negative autocorrelation. Decision
on this is made by comparing the calculated value stated above (which is 0.219) with the
dU and dL of the table value. From the DW table attached at the end of this material we
observe that the computed value is less than dU. Thus, it suggest that there is positive
autocorrelation

Note however, that if the regressors are very highly collinear, EViews may encounter
difficulty in computing the regression estimates. In such cases, EViews will issue an error
message by saying “Near singular matrix.” When you get this error message, you should
check to see whether the regressors are exactly collinear. The regressors are exactly
collinear if one regressor can be written as a linear combination of the other regressors.
Note that under exact collinearity, the regressor matrix X does not have full column rank
and the OLS estimator cannot be computed. However, the problem of multicolinearity is
said to exist even with strong multicolinearity. The rule of thumb for strong
multicolinearity is when the correlation coefficient, ρ is greater than 0.8. To make such

178
examination the student is advised to recall from our statistical computation discussion
how to compute the correlation between variables.

Other Diagnosis Tests performed by EViews

This is to briefly inform the student that EViews provides tests for autocorrelation,
heteroscedasticity, and autoregressive conditional heteroskedasticity (ARCH) in the
residuals from the estimated equation. The following explains in short the process of
doing these tests.

Correlograms and Q-statistics

This view displays the autocorrelations and partial autocorrelations of the equation
residuals up to the specified number of lags. To display the correlograms and Q-statistics,
click View then Residual Tests and select Correlogram-Q-statistics on the equation
toolbar. This application is presented in the diagram below.

Figure 5.2 Correlograms and Q- statistics

179
After selecting the Correlograms Q-statistics, we have to specify the number of lags that
we wish to use in computing the correlogram. Note that this is done on the Lag
Specification dialog box

Correlograms of Squared Residuals

This view displays the autocorrelations and partial autocorrelations of the squared
residuals up to any specified number of lags. The correlograms of the squared residuals
can be used to check autoregressive conditional heteroscedasticity (ARCH) in the
residuals. The diagram presented earlier shows the steps required to perform this test.

Note that if there is no ARCH in the residuals, the autocorrelations and partial
autocorrelations should be zero at all lags and the Q-statistics should not be significant.

To display the correlograms and Q-statistics of the squared residuals, click View then
choose Residual Tests and then select Correlogram Squared Residuals on the equation
toolbar. Then in the Lag Specification dialog box that opens, specify the number of lags
over which to compute the correlograms.

Histogram and Normality Test


This view displays a histogram and descriptive statistics of the residuals, including the
Jarque-Bera statistic for testing normality. If the residuals are normally distributed, the
histogram should be bell-shaped and the Jarque-Bera statistic should not be significant;

To display the histogram and Jarque-Bera statistic, click View then select Residual Tests
and choose Histogram-Normality. Note that the Jarque-Bera statistic has a distribution
with two degrees of freedom under the null hypothesis of normally distributed errors.

180
Serial Correlation LM Test

This test is an alternative to the Q-statistics for testing serial correlation. The test belongs
to the class of asymptotic (large sample) tests known as Lagrange multiplier (LM) tests.
The null hypothesis of the LM test is that there is no serial correlation up to lag order p,
where p is a pre-specified integer..
The serial correlation LM test is available for residuals from least squares or two-stage
least squares. To carry out the test, click View and select Residual Tests and then Serial
Correlation LM Test… on the equation toolbar and specify the highest order of the
AR(autoregressive) or MA(moving average) process that might describe the serial
correlation. If the test indicates serial correlation in the residuals, LS standard errors are
invalid and should not be used for inference.

ARCH LM Test

This is a Lagrange multipler (LM) test for autoregressive conditional heteroskedasticity


(ARCH) in the residuals. The ARCH LM test is available for equations estimated by least
squares, two-stage least squares, and nonlinear least squares. To carry out the test, click
View and choose Residual Tests and then ARCH LM Test… on the equation toolbar and
specify the order of ARCH to be tested.

White's Heteroskedasticity Test


This is a test for heteroskedasticity in the residuals from a least squares regression.
White’s test is a test of the null hypothesis of no heteroskedasticity against
heteroskedasticity of some unknown general form. The test statistic is computed by an
auxiliary regression, where we regress the squared residuals on all possible cross
products of the regressors.

EViews reports two test statistics from the test regression. The F-statistic is an omitted
variable test for the joint significance of all cross products, excluding the constant. It is
presented for comparison purposes.

181
Notice from the earlier diagram that to carry out White’s heteroscedasticity test, we firs
select View and select Residual Tests and then White Heteroscedasticity. EViews has two
options for the test: cross terms and no cross terms. The cross terms version of the test is
the original version of White’s test that includes all of the cross product terms. However,
with many right-hand side variables in the regression, the number of possible cross
product terms becomes very large so that it may not be practical to include all of them.
The no cross terms option runs the test regression using only squares of the regressors.

Check Your Progress 5.2


Consider the following regression result obtained using EViews
Dependent Variable: Y
Method: Least Squares
Date: 12/27/06 Time: 02:25
Sample: 1974:1 1991:4
Included observations: 72
Variable Coefficient Std. Error t-Statistic Prob.
C -181.0362 9.328736 -19.40630 0.0000
X1 1.200416 0.082592 14.53426 0.0000
X2 576.8731 83.41435 6.915754 0.0000
X3 0.830139 0.799752 1.037997 0.3029
R-squared 0.993192 Mean dependent var 146.9181
Adjusted R-squared 0.992892 S.D. dependent var 46.71829
S.E. of regression 3.938783 Akaike info criterion 5.633573
Sum squared resid 1054.953 Schwarz criterion 5.760055
Log likelihood -198.8086 F-statistic 3306.893
Durbin-Watson stat 0.155984 Prob(F-statistic) 0.000000

Using the Durbin Watson test examine the presence of autocorrelation in the error term of
the regression model
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
5.4 Diagnostic Analysis Using Stata
Stata program do have a number of checking forms for the presence of errorrs in the
estimated regression model. This includes graphical examination that can suggest
whether the problem exists or not and a number of other systematic tests as discussed
hereunder.

182
I. Test for Hetroscedasticity
Using Stata we can test for the presence of hetroscedasticity problem in the regression
model. To do this we can make use of graphical test as well as numerical test as discussed
below.
A. Graph residual-versus-fitted plot after regress
This helps to examine graphically whether there is a systematic relationship between the
residual and the fitted values (i.e. the estimated value of the dependent variable). If there
exists a systematic relationship between the two variables then it suggests the presence of
hetroscedastic variance of the error term. Note that such results indicate the importance of
re specifying the regression model.
Suppose the regression model is given by Yi = β0 + β1X1 + β2X2 +Ui. To check the
existence of the above stated problem, we perform the following step.
First: perform the regression using the command: regress Y X1 X2
Then: Perform the graphical analysis using the command: rvfplot

To execute this consider the following table which reports output (Y) measured in tons,
the labor input (X1) measured in hours and the capital input (X 2) measured in machine
hours of 10 firms of textile industry.

Firms 1 2 3 4 5 6 7 8 9 10
Y 500 440 545 600 510 625 680 720 750 830
X1 1420 1600 1620 1600 1500 1700 1760 1700 1800 1500
X2 390 400 430 410 430 650 700 780 700 600

To graphically test the normality of the residual, we should firs perform the regression
and obtain the result as follows. (note that the command to do this is regress Y X1 X2)

Source | SS df MS Number of obs = 10


-------------+--------------------------------------------------- F( 2, 7) = 7.17
Model | 93772.6073 2 46886.3036 Prob > F = 0.0202
Residual | 45777.3927 7 6539.62754 R-squared = 0.6720
-------------+--------------------------------------------------- Adj R-squared = 0.5782

183
Total | 139550.00 9 15505.5556 Root MSE = 80.868

-----------------------------------------------------------------------------------------------------------
y| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------------------------------------
x1 | -.3390981 .3194092 -1.06 0.324 -1.094381 .4161846
x2 | .8439037 .2574407 3.28 0.014 .2351531 1.452654
_cons | 706.0358 427.3086 1.65 0.142 -304.3886 1716.46
------------------------------------------------------------------------------------------------------------

The interpretation of the results is left to the student. To conduct the graphical test, we
write the command: rvfplot. The result is given as follows.

126.269
Residuals

-67.8139
501.04 787.814
Fitted values

Figure 5.3 Graphical test for the residual and the fitted values of
Note that the above graph shows the relationship between the residual of the regression
with the fitted value of Y (called estimated Y or Y hat). As we can see the graph, there is
no as such systematic relationship (both positive and negative) between the two variables.
Thus, the result suggests that the residual's variance do not represent hetroscedasticity.

B. Numerical Test for Hetroscedassticity

184
Stata can very simply test for the presence of hetroscedastic variance of the error term in
the regression model. To perform this test we need to first estimate the regression model
that we are interested to test for unequal variance. Then we write the following command.
 hettest
Note that hettest performs the Cook and Weisberg (1983) test for heteroscedasticity.
After the command hettest if variable list is not specified, the fitted values are used for
the analysis. If, however, variable list is specified, the variables specified are used for the
computation..

Note that even though hettest was originally written following a 1983 article in the
journal "Biometrika" by Cook and Weisberg, the same test was derived by Breusch and
Pagan in the journal "Econometrica" (1979). In fact, in econometrics, the test performed
by hettest is known as the Breusch-Pagan test for heteroscedasticity. Thus students are
required to recall the approach employed in the Breusch and Pagan test for
hetroscedasticity.

Consider the data on the 10 firms of the textile industry used earlier. After estimation is
performed to the model Yi = β0 + β1X1 + β2X2 +Ui (which will give us the result
obtained earlier), we perform the test of hetroscedasticity as follows.
Suppose we write the command hettest then we get the following result.
hettest

Cook-Weisberg test for heteroskedasticity using fitted values of y


Ho: Constant variance
chi2(1) = 0.17
Prob > chi2 = 0.6808

Notice that our command is not followed by the list of explanatory variables used in the
regression. In this case, the fitted values are used for the analysis. The null hypothesis of
the test is constant variance or homoscedastic variance. Note that this test provides with a
chi square result together with the associated probability. As usual, the decision is to
reject the null hypothesis if the chi square probability is less than the chosen level of
significance. In our example, we cannot reject the null hypothesis even at 10% (or 0.1)

185
significance level. Thus, the result points out that the regression model do have
homoscedastic variance of the error term. Recall that the graphical result also have got
homoscedastic variance result.

However, after the command hetest, if variable list is specified, the variables specified are
used for the computation. In this case we obtain the result presented below.
hettest x1 x2

Cook-Weisberg test for heteroskedasticity using variables specified


Ho: Constant variance
chi2(2) = 1.26
Prob > chi2 = 0.5338

Notice that the Cook-Weisberg test for heteroscedasticity in this case used the variables
X1 and X2 in its computation. Nevertheless, we cannot reject the null hypothesis even at
10% (or 0.1) significance level since it is by far greater than the probability of the chi
square test. Therefore, the result points out that the regression model do have
homoscedastic variance of the error term.

C. The Ramsey RESET test for omitted variable

In regression analysis one of the test is about proper specification of the model. In this
regard, Stata handles a test for omitted variable developed by Ramsey. The test is known
as RESET test. To execute this test we need to first perform the regression estimation.
Then to check whether the model is correctly specified or not, we write the following
command
 ovtest
Note that ovtest performs Ramsey's (1969) regression specification error test
(RESET) for omitted variables. Recall that this test amounts to testing y = xb+zt+u and
then testing t=0. In the command, if rhs is not specified after the command ovtest,
powers of the fitted values are used for z and others powers of the individual elements of

186
x are used. However, if rhs is specified, it means that, the powers of the right-hand-side
(explanatory) variables are to be used in the test rather than powers of the fitted values.

For example consider the regression model given by: Y i = β0 + β1X1 + β2X2 +Ui. To
check whether there are omitted variables in the model we write the following command
after performing the regression estimation: ovtest, rhs. The result is presented as follows.
ovtest, rhs
Ramsey RESET test using powers of the independent variables
Ho: model has no omitted variables
F(6, 1) = 33.29
Prob > F = 0.1319

Note that the test result represents the null hypothesis of no omitted variables in the
model. As we can see the probability of the F test, we can not reject H0 even at 10% (or
0.1) significant level. Therefore, the result points out that there is no statistical evidence
to suggest that the model has omitted variables.

D. Test for Normality of the Error term.


Recall the assumptions underlying the classical linear regression model. One of these
assumptions is the normal distribution of the error terms in a given regression model.
Stata can check for the presence of this problem using the following approach.
Suppose the regression model is given by: Yi = β0 + β1X1 + β2X2 +Ui. To perform the
normality test of the error term we first perform the estimation of the regression model.
Then we construct the predicted value of the error term using the following command.
 predict resid, residuals
After we generate the residual of the estimated error, we can test the normality of the
error term using the Kolmogorov-Smirnov test or graphical test as discussed below.

Kolmogorov-Smirnov Test

187
This test determines whether the distribution of the residuals is statistically significantly
different from that of a theoretical normal distribution. This can be done by using the
following command
 sktest resid
Then by comparing the probability value with the selected level of significance, we arrive
at our decision. We can examine this using the data on Y, X 1, X2 used earlier. After
performing the regression estimation, when we write the command sktest resid we obtain
the following result.
sktest resid

Skewness/Kurtosis tests for Normality


------- joint ------
Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+-------------------------------------------------------
resid | 0.244 0.453 2.28 0.3197

Note from the above result that since the (joint) probability value exceeds even the 0.1 (or
10%) level, the residual of the estimated regression model is not significantly different
from that of normal distribution.

Graphical Test for Normality

Alternatively a graphical approach to testing the normality of the residuals can be


performed based on the qq plot of the residuals against the normal distribution. The
command to execute this is given by:
 qnorm resid
This generates a graphical result. Using the above model as an example, we get the
following graphical result

188
Residuals Inverse Normal

Residuals 126.269

-95.2233
-95.2233 95.2233
Inverse Normal

Figure 5.4 Residual- Normal Quantile Plot

In the above figure, the straight line represent the theoretical normal distribution. On the
other hand, the dots represent the residual of the model stated earlier. The closer they (i.e.
the dots)cleave to the straight line, the more normal the distribution is said to be.

E. Calculate the Durbin-Watson d statistic after regress


The other diagnostic checking that can be performed by Stata is the Durbin-Watson d
statistic test. Recall that this test examines the presence of autocorrelation between the
disturbance terms of the model. The application of this test first requires performing
regression estimation and then testing the result using the following command.
 dwstat
Note that dwstat computes the Durbin-Watson d statistic to test for first-order serial
correlation in the disturbances after regression is performed. Note that Stata conducts
Durbin-Watson test only for time series data. Moreover, some transformation is required
(by Stata) before the test is performed

Check Your Progress 5.3


Consider the following hypothetical observation on Y, X1 and X2

189
Y 10.5 12.5 13 14 15.5 13 17 16 20 22
X1 100 70 60 80 150 120 110 90 130 140
X2 20 19 15 16 18 20 15 12 10 16

Suppose the above data is used to the following regression equation

LogYi = Logβ0 + β1LogX1i + β2LogX2i + LogUi

Using Stata attempt the following questions.


1. Test for the presence of hetroscedastic variance of the error term using the
explanatory variables
2. Test for omitted variables using Ramsey RESET test

5.5 Summary
In this unit we examined the concept and approaches of diagnostic checking. Recall that
we said the task of the econometrician is not limited to performing regression estimation.
Rather several tests that ascertain the reliability of the model must be conducted.
Accordingly, we used EViews and Stata in the assessment of the reliability of the
estimates of the parameters. Recall that after the estimation of the parameters with the
method of ordinary least squares, we should assess the reliability of the estimates of the
parameters based on three types of criteria. This unit explains about the econometric
criteria

This assumption of autocorrelation implies that the covariance of Ui and Uj in equal to


zero. Nevertheless, if this assumption is violated, it implies that the disturbances are said
to be auto correlated. The most celebrated test for detecting serial correlation is popularly
known as the Durbin-Watson d-Statistic. The Durbin-Watson (DW) test tests the
hypothesis that H0:  = 0 (implying that the error terms are not autocorrelated with a first
order scheme against the alternate.

190
Homoscedasticity is one of the important assumption of the classical linear regression
model. It says that the population disturbances term, Ui all have the same variance. It
suggests that the conditional variance of the dependent variable conditional upon the
given value of the explanatory variable remains the same regardless of the values taken
by the variable X. But, when the conditional variance of the dependent variable increases
as the value of the explanatory variable increases, then we say there is heteroscedasticity.
Note that the Breush -Pagan test is one of the formal methods of detecting
heteroscedasticity.

EViews provides tools for evaluating the quality of the model specied along a number of
dimensions. EViews performs the presence of autocorrelation using the Durbin-Watson
(DW) test. Moreover we have noted that EViews provides tests for autocorrelation,
heteroscedasticity, and autoregressive conditional heteroscedasticity (ARCH) in the
residuals from the estimated equation.

Stata program also have a number of checking forms for the presence of errors in the
estimated regression model. This includes graphical examination that can suggest
whether the problem exists or not and a number of other systematic tests. This includes
test for the presence of hetroscedastic variance of the error term in the regression model
and test about proper specification of the model.
5.6 Answer to Check Your Progress
Answer to Check Your Progress 5.1
1. Assumption 1: Randomness of the random variable U. That is, its value is
unpredictable and hence depends on chance. Assumption 2: Zero mean of the random
variable U Assumption 3: The variance of each Ui is the same for all the Xi values. This is
known as the assumption of homoscedasticity Assumption 4: The values of each Ui are
normally distributed. Assumption 5: The values of Ui corresponding to Xi are independent
from the values of any other Uj corresponding to Xj. This is called the assumption of non-
autocorrelation or serial independence of the U's. Assumption 6: Every disturbance term
Ui is independent of the explanatory variables. Assumption 7: No errors of measurement
in the X's Assumption 8: The explanatory variables are not perfectly linearly correlated.

191
This is called the assumption of no perfect multicollineariyty between the X's
Assumption 9: The model has no specification error. That is all the important explanatory
variables appear explicitly in the function and the mathematical form is correctly
specified.
2. The consequences are
 the prevailed value of the disturbances have some information to convey about the
current disturbances.
 the variance of the random term U may be seriously underestimated.
 the prediction based on ordinary least squares estimate will be inefficient with
3. The concept of hetroscedasticity is that the conditional variance of the dependent
variable changes as the value of the explanatory variable changes.
4. The problems associated with high multicollinearity problem are:
 OLS estimation may not be precise.
 the confidence interval tend to be much wider which may affect the hypothesis
testing regarding to the regression coefficients.
 the test statistics which are important for conducting hypothesis testing tends to be
statistically insignificant.
 Although the test statistics are statistically insignificant, the overall measure of
goodness of fit, R2, can be very high.

Answer to Check Your Progress 5.2


There is positive autocorrelation between the disturbance terms.

Answer to Check Your Progress 5.3


1. Accept Ho since P-value = 0.3417. Thus, there is no evidence to suggest the presence
of hetroscedasticity.
2. Since the P-value = 0.6491, it suggests that the model has no omitted variables

5.7 Model Examination


Consider the following data on GDP, INV(investment), SAV(domestic saving), EXPO
(export) of Ethiopia for the period 1967 up to 1983 E.C., measured in millions of birr

192
Year GDP INV SAV Expo
1967 6427.78 859.576 633.688 638.531
1968 6874.17 755.484 631.362 710.507
1969 7872.84 831.699 539.167 785.007
1970 8308.34 808.716 321.569 809.123
1971 9286.52 1036.33 521.99 881.099
1972 9865.43 1266.31 612.063 1130.4
1973 10079 1366.84 763.71 1072.35
1974 10635.8 1456.64 630.63 1007.17
1975 11775.4 1435.7 644.66 1064.85
1976 10987.6 1850.69 890.55 1164.87
1977 13026.5 1394.02 368.29 1057.07
1978 13575.2 2225.63 1171.05 1271.73
1979 14391 2244.65 1093.06 1186.84
1980 14970.5 3060.51 1867.45 1205.37
1981 15742.1 2269.23 1399.77 1422.8
1982 16825.7 2100.49 1335.22 1295.04
1983 19195.3 1996.38 660.39 1062.21

Let the following regression model is employed using the above data

LogGDPi = Logβ0 + β1LogSAVi + β2LogINVi + LogEXPOi + LogUi

1. Using EViews test for the presence of autocorrelation problem in the disturbance term.
2. Using correlograms and Q-statistics, comment on the autocorrelation result of the error
term.
3. Using Stata, test for the presence of homoscedastic variance of the error term.
4. Test for omitted variables in the regression model using Ramsey RESET test

Unit Six: Introduction to SPSS

6.0 Objective
6.1 Introduction
6.2 Components of SPSS

193
6.3 Data Entry, Operation and Transformation
6.4 Statistical Estimation and Graphing
6.5 Econometric Estimation
6.6 Summary

6.0 Objective
The objective of this unit is to familiarize the student with the basic approach of SPSS.
After completing this unit the student will be able to
 Understand the functions of SPSS window and its components
 Understand the process of data entry, operations, graphics and estimations

6.1 Introduction
Recall from unit one discussion that it is not only EViews and Stata that are used in data
analysis. Rather there are a number of other softwares designed to perform several kinds
of data analysis and modeling. Accordingly, this unit will examine briefly the application
of SPSS in this regard. SPSS is a comprehensive and flexible statistical analysis and data
management system. SPSS can take data from almost any type of file and use them to
generate tabulated reports, charts, and plots of distributions and trends, descriptive
statistics, and conduct complex statistical analyses. SPSS is available from several
platforms. This unit will make an introductory remark on more general and important
issues.
6.2 Components of SPSS

SPSS for Windows, brings the full power of the mainframe version of SPSS to the
personal computer environment. The following discussion briefly explains the
components of window.

A. Windows in SPSS

There are a number of different types of windows that you will see in SPSS: It is
described as follows.

Data Editor window

194
This window displays the contents of the data file. You may create new data files, or
modify existing ones with the Data Editor. The Data Editor window opens automatically
when you start an SPSS session.
Viewer window
The Viewer window displays the statistical results, tables, and charts from the analysis
you performed (e.g., descriptive statistics, correlations, plots, charts). A Viewer window
opens automatically when you run a procedure that generates output. In the Viewer
windows, you can edit, move, delete and copy your results in a Microsoft Explorer-like
environment.
Draft Viewer window
You can display output as simple text (instead of interactive pivot tables) in the Draft
Viewer.
Pivot Table Editor window
Output displayed in pivot tables can be modified in many ways with the Pivot Table
Editor. You can edit text, swap data in rows and columns, add color, create
multidimensional tables, and selectively hide and show results.
Chart Editor window
You can modify and save high-resolution charts and plots in chart windows. You can
change the colors, select different type fonts or sizes, switch the horizontal and vertical
axes, rotate 3-D scatter plots, and even change the chart type.
Text Output Editor window
Text output not displayed in pivot tables can be modified with the Text Output Editor.
You can edit the output and change font characteristics (type, style, color, size).
Syntax Editor window
You can paste your dialog box choices into a Syntax Editor window, where your
selections appear in the form of command syntax. You can then edit the command syntax
to utilize special features of SPSS not available through dialog boxes. If you are familiar
with SPSS software under other operating systems (e.g., Unix), you can open up a Syntax
Editor window and enter SPSS commands exactly as you did under those platforms and
execute the job. You can save these commands in a file for use in subsequent SPSS
sessions.

195
Script Editor window
Scripting and OLE automation allow you to customize and automate many tasks in SPSS.
Use the Script Editor to create and modify basic scripts.

If you have more than one open Viewer window, output is routed to the designated
Viewer window. If you have more than one open Syntax Editor window, command
syntax is pasted into the designated Syntax Editor window. (Paste feature will be
explained later.) The designated windows are indicated by an exclamation point (!) in the
status bar at the bottom of each SPSS window. You can change the designated window at
any time by selecting it (making it active) and clicking the highlighted pushbutton on the
toolbar. An active window is the currently selected window which appears in the
foreground. An active window may not be a designated window until you instruct SPSS
to make it a designated window (by clicking the icon on the toolbar).

B. Menus in SPSS for Windows

Many of the tasks you may want to perform with SPSS start with menu selections. Each
window in SPSS has its own menu bar with menu selections appropriate for that window
type. The Data Editor window, for example, has the following menu with its associated
toolbar: Note that most menus are common for all windows and some are found in certain
types of windows.

I. Common menus
File
Use the File menu to create a new SPSS system file, open an existing system file, read in
spreadsheet or database files created by other software programs (you can read data into
SPSS from any database format for which you have an ODBC [Open Database
Connectivity] driver), read in an external ASCII data file from the Data Editor; create a
command file, retrieve an already created SPSS command file into the Syntax Editor;
open, save, and print output files from the Viewer and Pivot Table Editor; and save chart
templates and export charts in external formats in the Chart Editor, etc.
Edit

196
Use the Edit menu to cut, copy, and paste data values from the Data Editor; modify or
copy text from the Viewer or Syntax Editor; copy charts for pasting into other
applications from the Chart Editor, etc.
View
Use the View menu to turn toolbars and the status bar on and off, and turn grid lines on
and off from all window types; and control the display of value labels and data values in
the Data Editor.
Analyze
This menu is selected for various statistical procedures such as crosstabulation, analysis
of variance, correlation, linear regression, and factor analysis.
Graphs
Use the Graphs menu to create bar charts, pie charts, histograms, scatterplots, and other
full-color, high-resolution graphs. Some statistical procedures also generate graphs. All
graphs can be customized with the Chart Editor.
Utilities
Use the Utilities menu to display information about variables in the working data file and
control the list of variables from all window types; change the designated Viewer and
Syntax Editor, etc.
Window
Use the Window menu to switch between SPSS windows or to minimize all open SPSS
windows.
Help
This menu opens a standard Microsoft Help window containing information on how to
use the many features of SPSS. Context-sensitive help is available through the dialog
boxes.

II. Data Editor specific menus


Data
Use the Data menu to make global changes to SPSS data files, such as transposing
variables and cases, or creating subsets of cases for analysis, and merging files. These

197
changes are only temporary and do not affect the permanent file unless you save the file
with the changes.
Transform
Use the Transform menu to make changes to selected variables in the data file and to
compute new variables based on the values of existing ones. These changes are
temporary and do not affect the permanent file unless you save the file with changes.

III. Draft View specific menus


Insert
Use the Insert menu to change the page breaks.
Format
Use the Format menu to change font characteristics, underline, and bold.

IV. Pivot Table Editor specific menus


Insert
Use the Insert menu to insert titles, captions, and footnotes; and to create table breaks.
Pivot
Use the Pivot menu to perform basic pivoting tasks, to turn pivoting trays on and off, and
to go to specific layers in a multidimensional pivot table.
Format
Use the Format menu to modify table and cell properties; to apply and change TableLook
formats; and to change font characteristics, footnote markers, and the width of data cells.

V. Chart Editor specific menus


Gallery
Use the Galley menu to change the chart type.
Chart
Use the Chart menu to modify layout and labeling characteristics of your chart.
Series
Use the Series menu to select data series and categories to display or omit.
Format

198
Use the Format menu to select fill patterns, colors, line styles, bar style, bar label styles,
interpolation type, and text fonts and sizes. You can also swap axes of plots, explode one
or more slices of a pie chart, change the treatment of missing values in lines, and rotate 3-
D scatterplots.

VI. Text Output Editor specific menu


Insert
Use the Insert menu to change the page breaks.

VII. Syntax Editor specific menu


Run
Use the Run menu to run the selected commands.

VIII. Script Editor specific menu


Debug
Use the Debug menu to step through your code, executing one line or subroutine at a time
and viewing the result. You can also insert a break point in tht script to pause the
execution at the line that contains the break point.

C. Toolbars in SPSS for Windows

Each SPSS window has its own toolbar that provides quick and easy access to common
tasks. Tool Tips provide a brief description of each tool when you put the mouse pointer
on the tool. For example, the toolbar with Syntax Editor window shows the following
tool tip when the mouse pointer is put on the run icon:

D. Status Bar in SPSS for Windows

A status bar at the bottom of the SPSS application window indicates the current status of
the SPSS processor. If the processor is running a command, it displays the command
name and a case counter indicating the current case number being processed. When you
first begin an SPSS session, the status bar displays the message Starting SPSS Processor.
When SPSS is ready, the message changes to SPSS Processor is ready. The status bar

199
also provides information such as command status, filter status, weight status, and split
file status. The following status bar in an Viewer window, for example, shows that the
current Viewer window is the designated output window and the SPSS is ready to run:

E. Options in SPSS for Windows

Note that we can personalize our SPSS session by altering the default Options settings.

 Select Edit/Options...
 Click the tabs for the settings you want to change.
 Change the settings.
 Click OK or Apply.

For example, within variable list boxes in dialogs, you have the option to display the
variable name as always or the entire variable label (up to 256 characters) can be
displayed. Then,

 Click General from the Options dialog box


 Click either Display labels or Display names under the Variable Lists
 Click OK

6.3 Data Entry, Operation and Transformation

i) Organizing the Data for Analysis

Suppose you have three test scores collected from a class of 10 students (5 males, and 5
females) during a semester. Each student was assigned an identification number. The
information for each student you have is an identification number, gender of each
student, and scores for test one, test two, and test three (the full data set is displayed
toward the end of this section for you to view). Your first task is to present the data in a
form acceptable to SPSS for processing.

SPSS uses data organized in rows and columns. Cases are represented in rows and
variables are represented in columns. A case contains information for one unit of analysis

200
(e.g., a person, an animal, a machine). Variables are information collected for each case,
such as name, score, age, income, educational level. In the above chart, there are two
cases and four variables.

In SPSS, variables are named with eight or fewer characters. They must begin with a
letter, although the remaining characters can be any letter, any digit, a period, or the
symbols like @, #, _, or $. Variable names cannot end with a period. Variable names that
end with an underscore should be avoided. Blanks and special characters such as &, !, ?,
', and * cannot be used in a variable name. Note that variable names are not case
sensitive. Each variable name must be unique; duplication is not allowed.

Most variables are generally numeric (e.g., 12, 93.23) or character/string/alphanumeric


(e.g., F, f, john). Maximum width for numeric variables is 40 characters, the maximum
number of decimal positions is 16. String variables with a defined width of eight or fewer
characters are short strings, more than eight characters (up to 255 characters) are long
strings. Short string variables can be used in many SPSS procedures. You may leave a
blank for any missing numeric values or enter a user-define missing (e.g., 9, 999) value.
However, for string values a blank is considered a valid value. You may choose to enter a
user-defined missing (e.g., x, xxx, na) value for missing short string variables, but long
string variables cannot have user-missing values.

Following the conventions above, let us assign names for the variables in our data set: id,
sex, test1, test2, and test3. Once the variables are named according to SPSS conventions,
it is a good practice to prepare a code book with details of the data layout. Following is a
code book for the data in discussion. Note that this step is to present your data in an
organized fashion. It is not mandatory for data analysis. A code book becomes especially
handy when dealing with large number of variables.

ii) Data Entry

201
The next issue is entering your data into the computer. There are several options. You
may create a data file using one of your favorite text editors, or word processing packages
(e.g., Word Perfect, MS-Word). Files created using word processing software should be
saved in text format before trying to read them into an SPSS session. You may enter your
data into a spreadsheet (e.g., Lotus 123, Excel, dBASE) and read it directly into SPSS for
Windows. Finally, you may enter the data directly into the spreadsheet-like Data Editor
of SPSS for Windows. In this document we are going to examine two of the above data
entry methods: using a text editor/word processor, and using the Data Editor of SPSS for
Windows. This is explained as follows.

Using an Editor/Word Processor to Enter Data

Let us first look into the steps for using a text editor or word processor for entering data.
Note that if you have a data set with a limited number of variables, you may want to use
the SPSS Data Editor to enter your data. However, this example is for illustration
purposes. Open up your editor session, or word processing session, and enter the variable
values into appropriate columns as outlined in the code book. If you are using a word
processor, make sure to save your data in text format. Whichever style (format) you
choose, as long as you convey the format correctly to SPSS, it should not have any
impact on the analysis.

Creating a Command file to read in your data

In many instances, you may have an external ASCII data file made available to you for
analysis. In such a situation, you do not have to enter your data again into the Data
Editor. You can direct SPSS to read the file from the SPSS Syntax Editor window.

Suppose you want to read a given file into SPSS from a Syntax Editor window and create
a system. Creating a command file is a faster way to define your variables, especially if
you have a large number of variables. You may create a command file using your favorite
editor or word processor and then read it into a Syntax Editor window or open a Syntax
Editor window and type in the command lines.

202
To read your already created command file into a Syntax Editor window

 Select File/Open/Syntax...
 Choose the syntax file (with .sps extension) you want to read and click Open

In the following example we are opening a new Syntax Editor window to enter the
following command lines.

 Select File/New/Syntax

When the Syntax Editor window appears, type the appropriate command:

 Run and choose selection. Alternatively, you can click from the toolbar

The command file will read the specified variable values from the data file, and create a
system file, sample1.sav. Make sure you specify the pathname; appropriately indicating
the location of the external data file and where the newly created file is to be written.
However, you do not have to save a system file to do the analysis. This means the last
line is optional for data analysis. Every time you run the above lines, SPSS does create an
active file stored in the computer's memory. However, for large data sets, it will save
processing time if you save it as a system file and access it for analysis.

Using Text Import Wizard to Read Text Data

Using Text Import Wizard is another way to direct SPSS to read an external ASCII data
file.

Suppose you want to read the file, grade.dat, into SPSS from Text Import Wizard.

 Select File/Read Text Data


 Click Text(*.txt) for the file type from the Open File dialog box, choose the data
file grade.dat in your (A:) drive and click Open
 Text Import Wizard is open, follow the Step1 to Step6 in this wizard to specify
how the data should be read.

203
The data file is read into the SPSS. We can save the data file as SAMPLE1.SAV.

Using the SPSS Data Editor for entering data

Suppose you want to use the SPSS for Windows features for data entry. In that case, you
enter data directly into the SPSS spreadsheet-like Data Editor. This is convenient if you
have only a small number of variables. The first step is to enter the data into the Data
Editor window by opening an SPSS for Windows session. You will define your variables,
variable type (e.g., numeric, string), number of decimal places, and any other necessary
attributes while you are entering the data. In this mode of data entry, you must define
each variable in the Data Editor. You cannot define a group of variables (e.g., Q1 to Q10)
using the Data Editor. To define a group of variables, without individually specifying
them, you would use the Syntax window.

Let us start an SPSS for Windows session to enter the above data set. If you are using
your own PC, start Windows and launch SPSS. If you are using a PC in a UITS Student
Technology Center:

 Log on to an available workstation


 Click the Start button
 Click and drag Programs -> Statistics and Math -> SPSS for Windows -> SPSS
11.5 for Windows.

This opens the SPSS Data Editor window (titled Untitled). The Data Editor window
contains the menu bar, which you use to open files, choose statistical procedures, create
graphs, etc. When you start an SPSS session, the Data Editor window always opens first.

You are ready to enter your data once the Data Editor window appears. The first step is to
enter the variable names that will appear as the top row of the data file. When you start
the session, the top row of the Data Editor window contains a dimmed var as the title of
every column, indicating that no data are present. In our sample data set, discussed above,
there are five variables named earlier as id, sex, test1, test2, and test3. Let us now enter
these variable names into the Data Editor.

204
To define the variables, click on the Variable View tag at the lower left corner of the Data
Editor window and:

 Type in the variable name, id, at the first row under the column Name.
 Press the Tab key to fill-in the variable's attributes with default settings.

SPSS considers all variables as numeric variables by default. Since id is a numeric


variable you do not have to redefine the variable type for id. However, you may want to
change the current format for decimal places.

 Enter 0 for Decimals.

Now let us define the second variable, sex.

 Type in the variable name, sex, at the second row under the column Name.
 Press the Tab key to fill-in the variable's attributes with default settings.
 To modify the variable type, click on the icon in the Type column.
 Select String by clicking on the circle to the left.

Define the remaining three numeric variables, test1, test2, and test3, the same way the
variable id was defined.

Click on the Data View tag. Now enter the data pressing [Tab] or the right arrow key
after each entry. After entering the last variable value for case number one use the arrow
key to move the cursor to the beginning of the next line. Continue the process until all the
data are entered.

iii) Saving Your SPSS Data

After you have entered/read the data into the Data Editor, save it onto the diskette. Those
who are working from personally owned computers might want to save the file to the
hard disk.

 Select Save... or Save As... from the File menu. A dialog box appears

205
 In the box below File Name type a:\sample1.sav. You can use a longer file name;
for example, a:\first sample of data entry is a legitimate file name
 Click OK

The data will be saved as an SPSS format file which is readable only by SPSS for
Windows. Note that the data file, grade.dat, you saved earlier and the file, sample1.sav,
you saved now are in different formats.

Even after saving the data file, the data will still be displayed on your screen. If not,
select sample1-SPSS Data Editor from the Window menu.

IV) Generating a New Variable

Before computing the descriptive statistics, we want to calculate the mean score from the
three tests for each student. To compute the mean score:

 Select Compute... from the Transform menu. A dialog box appears ***

 In the box below the Target Variable: type in average as the variable name you
want to assign to the mean score
 Move the pointer to the box titled Numeric Expression: and type: mean (test1,
test2, test3)
 Click OK

A new column titled average will be displayed in the Data Editor window with the values
of the mean score for each case. The number of decimal places in a newly created
variable can be tailored by selecting Edit/Options/Data/Display format for new numeric
variables prior to creating new variables. This display format setting affects the formats
of all new subsequent numeric variables.

6.4 Statistical Estimation and Graphing

Suppose that you have the data set, sample1.sav, still displayed on your screen. If not,
select SPSS Data Editor - sample1 from the Window menu. The next step is to run some

206
basic statistical analysis with the data you entered. The commands you use to perform
statistical analysis are developed by simply pointing and clicking the mouse to
appropriate menu options. This frees you from typing in your command lines.

However, you may paste the command selections you made to a Syntax Editor window.
The command lines you paste to the Syntax Editor window may be edited and used for
subsequent analysis, or saved for later use. Use the Paste pushbutton to paste your dialog
box selections into a Syntax Editor window. If you don't have an open Syntax Editor
window, one opens automatically the first time you paste from a dialog box. Click the
Paste button only if you want to view the command lines you generated. Once you click
the Paste pushbutton the dialog selections are pasted to the Syntax Editor window, and
this window becomes active. To execute the pasted command lines, highlight them and
click run. You can always get back to the Data Editor window by selecting sample1-
SPSS Data Editor from the Window menu.

a) Frequencies

To run the FREQUENCIES procedure:

 Select Descriptive Statistics from Analyze menu


 Choose Frequencies...
 A dialog box appears. Names of all the variables in the data set appear on the left
side of the dialog box.
 Select the variable sex from the list. It is highlighted.
 Click the arrow button right to the selected variable.

Now the selected variable appears in a box on the right and disappears from the left box.
Note that when a variable is highlighted in the left box, the arrow button is pointed right
for you to complete the selection. When a variable is highlighted in the right box, the
arrow button is pointed left to enable you to deselect a variable (by clicking the button) if
necessary. If you need additional statistics besides the frequency count, click the
Statistics... button at the bottom of the screen. When the Statistics... dialog box appears,

207
make appropriate selections and click Continue. In this instance, we are interested only in
frequency counts.

 Click OK and the output appears on the viewer screen

b) Descriptive

Our next task is to run the DESCRIPTIVES procedure on the four continuous variables in
the data set.

 Select Descriptive Statistics from the Analyze menu


 Choose Descriptives...

A dialog box appears. Names of all the numeric variables in the data set appear on the left
side of the dialog box.

 Click the variable average and click the arrow button to the right of the selected
variable
 Do the same thing for the variables test1 through test3

Now the selected variables appear in the box on the right and disappear from the box on
the left.

The mean, standard deviation, minimum, and maximum are displayed by default. The
variables are displayed, by default, in the order in which you selected them. Click
Options... for other statistics and display order.

 Click OK

Means

Suppose you want to obtain the above results for males and females separately. The
MEANS procedure displays means, standard deviations, and group counts for dependent
variables based on grouping variables. In our data set sex is the grouping variable and
test1, test2, test3, and average are the dependent variables.

208
To run the Means procedure:

 Select Analyze/Compare Means/Means...


 Select test1, test2, test3, and average as the dependent variables
 Select sex as the independent variable
 Click Options...

 Select Mean, Number of cases, and Standard Deviation. Normally these options
are selected by default. If any other options are selected, deselect them by clicking
them
 Click Continue
 Click OK and then The output will be displayed on the Viewer screen.:

There may be other situations in which you want to select a specific category of cases
from a grouping variable (e.g., ethnic background, socio-economic status, education). To
do so, choose Data/Select Cases... to select the cases you want and do the analysis (e.g.,
from the grouping variable educate, select cases without a college degree). However,
make sure you reset your data if you want to include all the cases for subsequent data
analysis. If not, only the selected cases will appear in subsequent analysis. To reset your
data choose Data/Select Cases.../All Cases, and click OK.

c) SPSS Output

When you run a procedure in SPSS, the results are displayed in the Viewer window in the
order in which the procedures were run. In this window, you can easily navigate to
whichever part of output you want to see. You can also manipulate the output and create
a document that contains precisely the output you want, arranged and formatted
appropriately. You can use the Viewer to:

 Browse output results or show or hide selected tables and charts


 Change the display order of output by moving selected items
 Access the Pivot Table Editor, Text Output Editor, or Chart Editor for modifying
output

209
 Move items between SPSS and other applications

The Viewer is divided into two panes. The left pane contains an outline view of the
output contents. The right pane contains statistical tables, charts, and text output. You can
use the scroll bars to browse the results, or you can click an item in the outline to go
directly to the corresponding table or chart.

Suppose you want to copy the Descriptives table into another Windows application, such
as a word processing program or a spreadsheet.

 Click the Descriptives table


 Select Edit/Copy
 Switch to target application
 From the menus in the target application you can choose either Edit/Paste or
Edit/Paste Special...
 If you choose Edit/Paste Special... select the type of object you want to paste
 Edit/Paste Special... allows you to paste the SPSS output as an embedded object
into the target application. The pasted object can be activated in place by double-
clicking then edited as if in SPSS. ***

Manipulating Pivot Tables

Much of the output in SPSS is presented in tables that can be pivoted interactively. You
can rearrange the rows, columns, and layers. To edit a pivot table, double-click the pivot
table and this activates the Pivot Table Editor. Or click the right mouse button on the
pivot table and from the context menu, choose SPSS Pivot Table Object/Open and the
pivot table will be ready to edit in its own separate Pivot Table Editor window. The
second feature is especially useful for viewing and editing a wide and long table that
otherwise cannot be viewed at a full scale.

210
Printing the Output

Once you are satisfied with your analysis you may want to obtain a hard copy of the
output. You may print the entire output on the Viewer window, or delete the sections you
do not want before you print. Or you can save the output to a diskette o r hard drive and
print it later. In this case, let us print the entire output which is on the Viewer window. It
is assumed that there is a printer attached to your PC or you are working from a Student
Technology Center. Make sure that you are at the Viewer window by selecting Output1-
SPSS Viewer from the Window menu. If you open multiple Viewer windows, select the
output you want to print: 

 Select Edit/Options...
 Click Viewer from the SPSS Options dialog box
 Click Infinite for the Length under the Text Output Page Size parameter
 Click OK
 Click File/Print
 Click OK

The contents of the output window will be directed to the printer. To save paper, choose
Infinite option for the Length. You can also control the page width by changing Width.
For some procedures, however, some statistics are only displayed in wide format.

d) Further Data Analysis

So far, we've used SPSS to develop a basic idea about how SPSS for Windows works.
Next step is to examine a few other data analysis techniques (CORRELATIONS,
REGRESSION, T-TEST, ANOVA). All the statistical procedures available under a mini
or mainframe version of SPSS are available from SPSS for Windows. Refer to the vendor
documentation for the most complete information.

Correlation analysis

A correlation analysis is performed to quantify the strength of association between two


numeric variables. In the following task we will perform Pearson correlation analysis.

211
 Select Analyze/Correlate/Bivariate... This opens the Bivariate Correlations dialog
box. The numeric variables in your data file appear on the source list on the left
side of the screen.

 Select compopi, compscor, mathatti and mathscor from the list and click the
arrow box. The variables will be pasted into the selection box. The options
Pearson and Two-tailed are selected by default.
 Click OK

A symmetric matrix with Pearson correlation as given below will be displayed on the
screen. Along with Pearson r, the number of cases and probability values are also
displayed.

6.5 Regression Estimation

Note that a correlation coefficient tells you that some sort of relation exists between the
variables, but it does not tell you much more than that. For example, a correlation of 1.0
means that there exits a positive linear relationship between the two variables, but it does
not say anything about the form of the relation between the variables. When the
observations are not perfectly correlated, many different lines may be drawn through the
data. To select a line that describes the data, as close as possible to the points, you
employ the Regression Analysis which is based on the least- squares principle. In the
following task you will perform a simple regression analysis with compscor as the
dependent variable, and mathscor as the independent variable.

 Choose Analyze/Regression/Linear... The Linear Regression dialog box appears.


 Choose compscor (Score in Computer Science), as the dependent variable
 Choose mathscor (Score in Mathematics), as the independent variable
 Click OK. This will display the output on the screen

212
One-way Analysis of Variance

The statistical technique used to test the null hypothesis that several population means are
equal is called analysis of variance. It is called that because it examines the variability in
the sample, and based on the variability, it determines whether there is a reason to believe
the population means are not equal. The statistical test for the null hypothesis that all of
the groups have the same mean in the population is based on computing the ratio of
within and between group variability estimates, called the F statistic. A significant F
value only tells you that the population means are probably not all equal. It does not tell
you which pairs of groups appear to have different means. To pinpoint exactly where the
differences are, multiple comparisons may be performed.

The explanation and the topics covered in this section illustrates some of the basic
features of SPSS for Windows. Examining additional features of SPSS for Windows is
beyond the scope of this section. For further help, refer to SPSS for Windows documents.

Unit Seven: A Brief Introduction to PCGIVE and LIMDEP

213
7.0 Objective
7.1 Introduction
7.2 A Brief Introduction to PCGIVE
7.3 A Brief Introduction to LIMDEP
7.4 Summary

7.0 Objective
The objective of this unit is to introduce briefly the basics of LIMDEP and PCGIVE.
After completing this unit the student will:
 Be able to enter and transform data using limdep and pcgive
 Understand the steps in making graphs, statistical and econometric analysis
7.1 Introduction

This last unit is interested to briefly explain the two software packages- Limdep and
PcGive. These two softwares are developed by famous and known econometricians.
While PcGive is usually preferred and used in advanced time series econometrics,
Limdep is best suited for econometric analysis that make use of cross section and
qualitative dependant variables. Note, however, that there are many issues that we do not
address here. This section serves as a start point for introducing PcGive and LimDep.

7.2 A Brief Introduction to PCGIVE


PcGive is a software package for econometric modelling written by two known
econometricians David Hendry and Jurgen Doornik of the University of Oxford. PcGive
is a computer program well suited for analyzing multivariate and univariate
autoregressive processes One key thing to know before you start is that PcGive is one of
a number of program modules all of which operate through the “front end” program
called GiveWin. So when you load PcGive it will automatically load up GiveWin. You
use GiveWin to enter or load your data, to undertake preliminary data transformations or
construct pre-regression graphs and to view your results. You switch to the PcGive
module to formulate, estimate and test your regression models. Note that the programs
are menu-driven and have detailed "help"-functions.
Step I: Loading the data.

214
When you first load PcGive, you get the initial GiveWin window. Note that loading the
data base is done by simply choosing ”Open...” under ”File”, and then finding the
designated file path. When the data base has been opened, a window opens in GiveWin
where you can see the ”raw” data, and do simple editing of it. If the data base has been
modified, you may wish to save the changes; this can be done by choosing ”Save...”
under ”File”.

Note that it is also possible to enter data yourself: To do so we choose File->New, then
Database. Then choose the appropriate frequency. This will create an empty spreadsheet.
Double click the top of the first column to give it a name. In Variable name, type the
name of the variable such as ”x”. We can call the next column ”y”. Note that the
spreadsheet is filled with ”missing” as we haven’t entered any data yet. Go to one
missing, double click or press enter to enter data. Enter the data below and save them.

Step 2 Deriving new series using the Calculator.

To derive any new series, based on those already in the database, use the Calculator.
Select Tools | Calculator and then use the keys to create the new series. For example to
create GDPdefl = GDPcurr/GDP95 you can click on the names of the series in the
database and the calculator button for / to get the formula.. Then click on the = button
and type the name you require for the new series (GPDdefl). It will be added to the
database. NOTE: Until you resave the data base the new series will not permanently be
added to the file. Thus, we suggest that after each set of transformations is complete you
resave the file.  

Step 3 Creating pre-regression graphs


Graphic handling in GiveWin is done in "Graphics" under "Tools". When it is opned, a
window is opened with four headers: These are "Variables", "Descriptive", "Cross Plot"
and "QQ Plot". To apply any of the graphics functions on a variable (series) you highlight
the appropriate name in the list of variables, then choose the desired plot, and then click
"Apply". Example, if you wish to see how import evolves over time, you click on the
header "Variables", highlight IMPO, then choose "Actual Values", and click "Apply".

215
Note that plots by GiveWin are shown in the window "GiveWin Graphics". If you make
several graphs, GiveWin will place them all in the same window. This can be avoided by
choosing "Keep Graph" under "View"; this force GiveWin to open a new graphics
window, the next time a graph is made. The graphs can be printed directly from
GiveWin, or may be exported to your favorite software program. Export to Windows
programs such as Word, can easily be done using "Cut" and "Paste". Otherwise, GiveWin
allows you to save the graphs. This can be done by activating the graph window of
interest, and then choosing "Save as..." under "File".

Finally, it should be noted that the possibilities of editing graphs in GiveWin are either by
choosing "Edit" and then "Edit Graph", or simply double clicking it.

Step 4 Descriptive Statistics


Descriptive statistics (notably averages and standard deviations) are most easily created
in PcGive. To do so we need to start the PcGive module by choosing Modules-and click
Start PcGive. In this case PcGive pops up in a new window. To tell PcGive that we are
doing descriptive statistics, choose Package and select Descriptive statistics. Then we go
back to GiveWin and close all data windows. Open data. Next we go to PcGive, and click
the ”formulate” icon. After that we choose all variables; and press ok until you get the
results.

Step 5 Formulating a regression model


The formulation of a regression model and the estimation is done in PcGive. Here
we will perform a simple regression analysis. To do so, just follow the instructions
below:
1. From the GiveWin window choose ”Modules” and then ”Start PcGive”.
2. In PcGive: Choose the menu ”Model” and then ”Single-equation Dynamic
Modelling...”; the model is set up here. This opens the dialog window ”Formulate
Model”.

216
3. In the lower right corner you can choose the number of lags which are to be
included in the model.
4. In the box to the right, the list over the available variables in the current data base
is shown. Highlight the variable and click on the ”<<Add” button.
5. In the box to the left, the model is shown. PcGive automatically adds a constant to
the model. If this had not been in our interest it could be removed: Highlight
”Constant” and click ”Delete”.
6. Next add the explanatory variables to the model. Notice that there can only be one
endogenous variable in PcGive, for two or more endogenous variables we need
PcFiml. To accept the model, click ”OK”.
7. PcGive now asks you to choose the method of estimation. Choose ”OLS”
(ordinaryleast squares) which is the standard choice and – as you know – the
maximum likelihood estimators conditional on the initial observation. You have now
estimated the parameters of the model: The results of the estimation are reported in
the ”Results” window in GiveWin.

Step 6 Post-regression testing- diagnostic checking


Once you have formulated and estimated the model you should always test for a
misspecified model. Note that it is never enough to estimate a model and just report the
results, maybe the model we used makes no sense, maybe it fits the data badly, may be
another model would fit better. Thus, an important part of regression estimation is model
control.
Note that we can conduct a variety of post-regression tests. One possible approach is
graphical methods to check for a mis specified model. Moreover, in PcGive we can select
Test and choose Test Summary to get a summary table of diagnostic test statistics.
 
This completes the basic and brief introduction. To find out more, explore the program
itself, make use of the program’s Help file.

217
7.3 A Brief Introduction to LIMDEP

1. Introduction

As briefly discussed in unit one, LimDep is econometric software developed by William


H. Greene. The program provides parameter estimation for linear and nonlinear
regression models and qualitative and limited dependent models for across section, time
series, and panel data. Note that the name LimDep is derived from LIMited DEPendent
models. This section is designed to provide very basic knowledge on how to run
statistical analysis on LimDep for Windows. The discussion will be very brief by
introducing the basics of data management and analysis.

Note that LimDep opens multiple windows as it proceeds. When you start LimDep from
your desktop, LimDep starts with a window called Untitled Project. as shown below. A
project consists of the data that you are going to analyze, the results of your analysis, and
procedures that you have used. LimDep only allows one active project at a time. 

2. Inputting Data
As we know from the foregoing units, in order to conduct any kind of data analysis, we
need to have data available in the spreadsheet under consideration. Note that there are
several ways of inputting data into Limdep spreadsheet. The various approaches in this
regard is briefly explained as follows.

I. Via Stat/Transfer

The easiest way of inputting data into LimDep for Windows is to convert your data file
into LimDep for Windows format (with file extension .lpj) using Stat/Transfer. From
within LimDep you can simply open the data file from File control manual. One subtle
point is that LimDep data format for Windows is different from LimDep for Unix and
Stat/Transfer only handles LimDep for Windows. If you have a dataset in LimDep for
Unix format, you need first to convert it into ASCII format within LimDep for Unix and
then convert it into LimDep for Windows format using Stat/Transfer. 

218
II. Excel Data

It is very easy to read an Excel data file to LimDep for Windows. We can simply issue
the command Import Variables  from the pull-down menu Project and open our Excel file
from there. The only thing to keep in mind is that LimDep only reads Excel 4.0 (or 3.0).
If you have a newer version Excel file, you have to first open your Excel file in Excel and
save it to an older version.  

III. Via Command READ

If your data file is in ASCII format, you can read your data using LimDep command
READ to read in the data file. The ASCII data file can have commas, tabs or simply
spaces as delimiters. Missing values can be coded as "."  

There are two ways of issuing a command within LimDep. One way is via a command
dialog box where one can type a command such as READ to read in a data file. The other
way is via a command window where one can enter multiple commands and save it as a
program file for later use. Let's focus on the second way here. We first need to open a
command window. This is done from File control manual by choosing New option. A
dialog window will pop up asking for the type of new window we want open. Choose
Text/Command Document and click on OK. Now we can enter our commands in the
command windows, highlight the commands we want to run and run it using Run control
manual. Note that creating a new variable in LimDep can be done through Project pull-
down.

3. Running Analysis
Limdep conducts a number of estimations. In this brief discussion we highlight the
approaches in formulating descriptive and simple regression analysis.
 Descriptive Statistics

One way of checking that we have input our data correctly is to run some descriptive
statistics on the data set. It is fairly straightforward in LimDep to do so. Descriptive
Statistics is in Data Description in the Model pull-down manual.  A window with a

219
list of variables will pop up so you can choose the variables. Note that the variable
ONE is created by LimDep to be constant 1 and is used if we want to include the
constant term in statistical analysis. 

 Running LimDep procedures

Most of the statistical analysis can be done through Model pull-down manual. For
example, we can run a number of regression estimations.

 Running analysis 

After the model is formulated clicking on Run button gives the result. Note that we can
save the output as a LimDep file with extension .lim, or we can simply copy and paste the
result to a Word document. 

Note, however, that there are a lot of issues that we do not address here. This section
serves as a start point for anyone who does not know LimDep for Windows at all but may
want  to run some statistical analysis with LimDep. With LimDep one can indeed run
very sophisticated statistical analysis and that is beyond what this section is all about. 

7.4 Summary

This unit briefly discussed two software packages- Limdep and PcGive that are
developed by famous and known econometricians. PcGive is well suited for analyzing
multivariate and univariate autoregressive processes To enter, load, perform data
transformations or pre estimation graphs we use GiveWin. However, we should open the
PcGive module to formulate, estimate and test regression models. On the other hand, we
noted from the foregoing discussion that Limdep program provides parameter estimation
for linear and nonlinear regression models and qualitative and limited dependent models.

220

You might also like