0% found this document useful (0 votes)
196 views99 pages

The Fundamentals of Regression Analysis PDF

This document provides an outline for a presentation on regression analysis for antitrust attorneys. It introduces key concepts such as: 1) Regression analysis relates economic variables and aims to quantify their relationships through functions and models. 2) Models are built based on economic theory and involve variables, functions, and error terms to account for unpredictable factors. 3) Simple and multiple regression models are linear or nonlinear and include one or more independent variables to isolate relationships while controlling for other influences.

Uploaded by

katie farrell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views99 pages

The Fundamentals of Regression Analysis PDF

This document provides an outline for a presentation on regression analysis for antitrust attorneys. It introduces key concepts such as: 1) Regression analysis relates economic variables and aims to quantify their relationships through functions and models. 2) Models are built based on economic theory and involve variables, functions, and error terms to account for unpredictable factors. 3) Simple and multiple regression models are linear or nonlinear and include one or more independent variables to isolate relationships while controlling for other influences.

Uploaded by

katie farrell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Continuing Legal Education

The Fundamentals
of Regression Analysis
A Primer for Antitrust Attorneys
by
Russell Lamb, Ph.D.
Senior Vice President
Nathan Associates Inc.
Arlington, VA.

N 0
Outline

• Introduction
• Building a regression model
• Interpreting the regression coefficients
• Estimating the regression coefficients
• Empirical example
• The indicator variable model
·• Hypothesis testing
• Assessing the OLS estimators
• Assessing the regression model
• Identification
• Introduction to the forecasting method
• Appendix A - Probability Basics
• Appendix B - Assumptions of the Multiple Linear Regression Model

N .I
,. • Appendix C - Random Sampling
1
Introduction

• Econometrics is the application of statistical


methods to economic data [1 ].
• Econometric methods, particularly those involving
regression analysis, play an important role in legal
proceedings.
• This presentation summarizes the essential ideas of
regression analysis, with a focus on those methods
that are commonly used in price-fixing litigation.

N 2
Regression analysis
The "how much" question
• Regression analysis is a statistical tool that is used
to understand the relationship among two or more
economic variables.
• Regression analysis aims to answer the "how
much" question. For example, the regression
model might seek to answer

• If the market price of a product goes down, how much


will that affect the amount that firms will be willing to
supply to the market?
• If the market price of a product goes up, how much will
that affect the sales of another product?

N 3
Regression analysis
It all begins with theory

• When applied to the field of economics, regression


analysis is carried out by first developing an
economic model and then a corresponding
statistical model.
• Economic models are based on theories about the
relationships between economic variables.
• Economic theory describes relationships between
economic variables using the .mathematical concept
of a function. For example:

CONSUMPTION == f (INCOME)

Nt' says that consumption is some function of income.


4
Key concept
Functions and variables
• The general expression for a function is y == f(x).
• The function y == f (x) involves two variables:

• y is the output or value of the function


• x is the input or argument of the function

INPUT x

FUNCTION f:

OUTPUT y == f (x)

• It follows that a variable is an attribute, condition or event that


takes on 2 or more values.

N
- I

5
The economic model

• To illustrate the model-building process, we will use an


example motivated by the consumption function, which
says that consumption is some function of income.

• Consumption is measured by weekly food expenditure


and income is measured by weekly earnings.

• The economic model must say something about the


functional (mathematical) form of the relationship
between weekly earnings and weekly food expenditure.

• We will begin by modeling a relationship that is linear in


the variables.

N•
...J

6
The linear model
Food Expenditure= {30 + {31 /ncome
Linear Model
y Y = /Jo + /J1X

.3
]
Q.

'
-==
~
t:.Y
B1: t:,.X

Bo

X
Income

• This model is linear in the variables because plotting the function


y = Po + P1 x in terms of Income and Food Expenditure generates a
straight line where p1 is the slope of the line and Po is the y-intercept.
N 7
The simple linear regression model

• The food expenditure function on the previous slide is an


example of a simple linear regression model.
• This regression model is simple not because it is necessarily
easy to carry out, but because there is a single independent
variable.
• The general form of the simple linear model is

Yi == f3o + /31 xi
where the subscript i runs over observations, i == 1, ... .. . , N;
Yi is the dependent variable;
xi is an independent or explanatory variable; and
/Jo and {31 are the unknown population parameters, or
regression coefficients.
N 8
Introduction to the error term
• Every regression model can be thought of as having two components:
a systematic component, which is obtained from theory, and a
random component [2].
• Since economic theory describes the average behavior of many
individuals or firms, the systematic portion can be thought of as the
expected value of Y given X, which is the mean value of the Ys
associated with a particular value of X. The systematic component is

E(YIX = x) = /30 + f31x


• Economic theory does not claim to perfectly predict the behavior of
each individual or firm.
~ In the real world, the value ofY is unlikely to be equal to exactly the
systematic component.
~ As a result, a random error term e must be added to the equation

E(YIX = x) = /30 + {31 x +e


N 9
Introduction to the error term
Y = Po + /31 x + e
By explicitly adding an error term to the regression model,
econometricians admit the existence of unexplained variation due
to the inherent uncertainty in economic behavior.
• The error term is random because its value is determined
entirely by chance.
• In addition to unpredictable occurrences ("noise"), the error
term is comprised of

1. All factors effecting the dependent variable that are not included in
the model.
2. Any approximation error that arises from the fact that the
underlying theoretical equation might have a different functional
form than the one chosen for the regression [3].

N 10
Multiple regression models
• When studying the effect of an explanatory variable
x on the dependent variable y, we generally want to
"control" for other factors that influence y by
including these factors in the model.
• Including other factors in the model allows us to
isolate and measure the impact of a variable of
interest while accounting for all other factors that are
thought to influence outcomes.
• For this purpose, we use the multiple regression
model, the general form of which is

Yi= f3o + f31xi1 + f32xi2 + ··· ··· · · + f3KxiK + ei ,


N fork = 1, .. . ... , K explanatory variables. 12
Nonlinear models
• Many economic relationships are nonlinear, which means they
are represented by curves rather than lines.
• For example, economic theory predicts that food expenditure
increases with income, but at a decreasing rate. Therefore, the
underlying theoretical equation for this relationship is nonlinear

Food Expenditure= /Jo+ {J1 In(Income)


where In(Income)is the natural logarithm transformation of the
variable Income.
• This regression model is nonlinear in the variables because
plotting the function in terms of Income and Food Expenditure .,
generates a curve.
~ Unlike the linear regression model, the slope is not constant.

N 13
Nonlinear models
Food Expenditure= {30 + {31 In(Income)
Linear-log Model
600 •

500

•••
~

3
·-i
"C

c. 300
400


• •••

~

·

---

• •
~
r;i;l
"C •• •
• •
••
0
~
0 200 • •
••
100 •

0
',
0 10 20 30 40
Income

N Data Source: R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011), Principles of Econometrics, 4th Ed.

14
Models that are linear in the
coefficients
Food Expenditure = {30 + {31 In(Income)

• Because the coefficients in the model above appear in their


simplest form, the model is linear in the coefficients [4]:

• The coefficients are not raised to any powers (other than one).
• The coefficients are not multiplied or divided by other coefficients.
• The coefficients do not themselves include some sort of function.

• Linear regression analysis can be applied to an equation that is


nonlinear in the variables if the equation can be transformed in
a way that is linear in the coefficients.
)- When econometricians refer to "linear regression," they
usually mean "regression that is linear in the coefficients."

N 15
Interpreting the regression coefficients
Simple linear regression model

Y = Po + /31 X1 +e
• {3 0 is the intercept coefficient, and it indicates the
value of y when x 1 is equal to zero.
• {31 is the slope coefficient, and it indicates the
amount that the dependent variable y will change
when x 1 increases by one unit.

N 16
Interpreting the regression coefficients
Multiple regression models

Y == Po + /31 X1 + f32x2 + /33X3 + e


• The interpretation of {31 in the multiple regression
model above is different than it was on the
previous slide when x1 was the only explanatory
variable.
• In the multiple regression model, {31 is the impact of
a one-unit increase in x1 on the dependent v~riable
y, holding constant, or controlling for, the other
independent variables (x 2 and x 3 ).

N 17
Key concept
Ceteris paribus [5]

• The ceteris paribus notion means to hold all other


relevant factors fixed.
• The ceteris paribus assumption is crucial for
establishing a causal relationship .
• In applying the ceteris paribus assumption to the
linear regression model, all variables except the
one under immediate consideration are held
constant.

N 18
Interpreting the regression coefficients
Nonlinear models

• The general form of a model that is linear in the


coefficients is:

f (y) = Po + f31f (x)

• For example, f (y) and f (x) could be the natural logarithm


transformations of the variables y and x, respectively.
• The figures on the next slide illustrate some commonly-
used alternative functional forms that are linear in the
coefficients while being nonlinear in the variables.
• The interpretation of the slope coefficient {31 for these
models is more complicated because the slopes are not
N constant (i.e., they are either increasing or decreasing in x).

19
Interpreting the regression coefficients
Nonlinear models

Some Alternative Functional Forms


Log-linear Model Linear-log Model

y
lo(y) = Bo + B 1x

y
y = B, + B1lo(x)
~
------
B1 >0

X X

Log-log Model Quadratic Model


2

/
ln(y) = Bo + B1ln(x) y = Bo + B1x

y y

X X

• In order to understand how to interpret some of these slope


coefficients, we will first introduce the concept of elasticity.
N 20
Key concept
Elasticity

• In economics, elasticity is a unit-free measure of how


responsive an economic variable is to changes in another.
• This sensitivity is measured as the percentage change in
one variable given a 1-percent increase in another variable.
• Price elasticity of demand is the percentage change in
quantity demanded resulting from a 1-percent increase in
pnce.
• A low price elasticity of demand indicates an industry in
which collusion would be profitable.
• Cross-price elasticity of demand is the percentage
change in the quantity demanded of one good resulting
from a 1- percent increase in the price of another good.
• A positive and high cross-price elasticity of demand
between two goods suggests that they are part of the same
N;:,. relevant market.
21
Economic functions and their elasticities

• The table below summarizes some useful functions, their


elasticities, and the interpretation of their slope coefficients [6].

Na1ne Function Slope = dy I dx Elasticity

Linear Y =/Jo+ /J1X X


/11 /11-
y
Quaclratic y = /Jo+ /J1X2 2/J1X X
(2/J1X)-
y
Log-log In(y) = Po + P1ln(x) y /11
Pi-X
A 1 % change in x leads to a Pt% chang,e in y.

Log-linear ln(y) = Po + /J1x P1Y P1x


A 1 unit change in x leads to a 100/11 % change in y.

Linear-log y = /Jo + {J'1 111(x) 1 1


Pi-x /11 y
A 1% change in x leads to a P1 /100unit change in y .

N 22
Estimating the regression coeffici·e nts
Linear least squares

• To estimate the regression coefficients, we need a


rule for fitting the regression model to the data.
• Linear least squares or ordinary least squares
(OLS) is a commonly-used rule for fitting the
regression model to the data where the assumed
function is linear in the coefficients.

N 23
Estimating the regression coefficients
Least squares principle

• The least squares principle: Fit a line such that we minimize the
vertical distances from each point to the line. This distance is called the
residual.
Statistical Fit of a Line
~ 1--------------------------------
residual (y1- Yt)


Y = f3o + f31x
t. 1------------------ ---· •- ~-------


.. • • •
• ••
•----
' •
,- • •
6X
M

Bo I B1 = 6X

• The modeled values, denoted as Yi, are referred to as fitted values .


• Fitted values are calculated as the intercept plus a weighted average of the values of
the explanatory variables, with the estimated coefficients used as weights.
• For each sample observation on y, the difference between the data value and the
N fitted value is the residual value (Yi - yJ.
24
Estimating the regression coefficients
The le.a st squares estimators

Y = /Jo + /31 x1 + {32 x 2 + Ill Ill I+ f3KxK + e

• Given sample observations on y and x 1 , x 2 , x 3, ...... , xK , find


values of /3 0 , /3 1 , /3 2 ,
1,1 I, /3K that minimizes the sum of
.11

squared residuals.
• This rule is known as the least squares problem.
• The solution to the least squares problem yields a set of
formulas called the least squares estimators. These
formulas are perfectly general and can be applied to any
sample data.
• When you plug the sample data values into the estimators
you obtain numbers, which are the least squares estimates.

N 25
The indicator-variable model
Introduction

• Indicator variables (also known as dummy variables) are


explanatory variables that take on the value of zero or 1.
• Indicator variables are used to represent qualitative
characteristics such as gender.
• The general form of the indicator-variable model is

Yi = /Jo + /J1Xi1 + <pDi + ei

where Di is a 0/1 indicator variable.


• In this model, the group of observations with a value of zero
for the indicator variable is called the reference group.
• The coefficient <p is a m~asure of the effect on the dependent
variable y of the characteristic represented in Di .

N 26
The indicator-variable model
Modeling but-for prices

• In price-fixing litigation, the reference group is used


to estimate the but-for price.
• The reference groups used in this context are based
on the following two approaches:

1. "Before-and-After," where but-for prices are prices


before and/or after the alleged conspiracy; or

2. "Yardstick," where but-for prices are the prices of the


same product manufactured in another location (e.g.,
other countries), or the prices of other products with
similar demand, cost, and market structure
conditions.
N 27
Hyp~thetical empirical example

• Suppose there is an alleged conspiracy in the


primary aluminum industry.
• A cartel comprised of manufacturers attempted to
maintain supra competitive prices on sales in the
United States from 2005 through 2006.
• Linear regression can be used to:
• Assess whether the cartel was effective
• Measure the magnitude of the cartel impact on prices

N 28
Hypothetical empirical example

• We collect the following data for the period


January 2002 - December 2006:
• Primary aluminum prices
• Variables that affect the cost of producing aluminum
• Variables that affect the demand for aluminum
• The data used in this hypothetical empirical
example are all generated and do not reflect any
real matters.

N 29
Hypothetical empirical example

• We transform the variables into their natural


logarithms and run the following regression

ln(Pit) = {30 + {31 ln(Cit) + {32 ln(Dit) + f3 3 Cartelit + eit


Where:
Pit is the price paid by the ith customer at time t;
cit is an index that includes factors that affect production costs;
Dit is personal disposable income, which affects demand;
Cartelit is an indicator variable that is equal to 1 during the
conspiracy period, and O during the benchmark
period; and
eit is the random error term.

N 30
Hypothetical empirical example

ln(Pit) == /3 0 + /31 ln(Cit) + /3 2 ln(Dit) + f3 3 Cartelit + eit

• Since the Pit, Cit and Dit are in their natural logarithm
transformations, the slope coefficients /31 and /3 2 are
elasticities.
• The coefficient on the indicator variable {3 3 is an
estimate of the impact of the alleged cartel on prices,
while accounting for the influences on prices of
demand (personal disposable income) and
production costs.
• The Cartel effect is an (approximately)100{33 °/o change in
prices.

N • The exact calculation is 1OO(e/1 3


exponential function.
- 1) 0/o, where ex is the natural
31
Hypothesis testing

• In addition to estimating the population parameters, the


regression model is also used to statistically test
relationships between economic variables in a process
called hypothesis testing.
• The application of hypothesis testing to regression
analysis involves the following steps:
1. Specify the null and alternative hypotheses.
2. Specify the test statistic.
3. Select a significance level and determine the "rejection
region."
4. Run the regression and calculate the test statistic.
5. State the conclusions of the test based on the magnitude
of the test statistic relative to the predetermined "rejection
N region."
32
Hypothesis testing
The null and alternative hypotheses

• The specification of the hypothesis to be tested


involves formalizing what the expert thinks is true,
and then stating a range of values of a regression
coefficient that would be expected to occur if the
theory were not true.
• This statement is called the null hypothesis. The
notation used to refer to a null hypothesis is
"H0.. , " .
• The alternative hypothesis specifies the range
of values that would occur if the expert's theory
were correct. The notation used to refer to an
alternative hypothesis is "HA:,".
N 33
The null and alternative hypotheses
Price-fixing example

• In a price-fixing case, an expert may want to evaluate the null


hypothesis of no legal impact against the alternative hypothesis
that anticompetitive misconduct lead to supra competitive prices.
• This test is often conducted by specifying an indicator-variable
regression model, and then testing the null hypothesis:

Ho: f3cARTEL < O (the values that are not expected to be true)

against the alternative hypothesis

HA: f3cARTEL >O (the values that are expected to be true),

where f3cARTEL is the coefficient on an indicator variable that is


equal to 1 during the alleged conspiracy period, and zero during
the benchmark period.
N " 34
How are statistical tests used?

• ·It is almost impossible to prove with absolute certainty


that a theory is correct using statistical evidence alone.
• The conclusions drawn from statistical tests are used
in legal proceedings to provide evidence contrary to
the view that a particular violation has not occurred.
• This evidence can aid finders of fact in assessing the
likelihood that a violation has occurred .

N 35
Statistical significance

• To use hypothesis tests to assess the likelihood that a violation


has occurred involves quantifying the probability that the null
hypothesis is rejected when it is true (Type I Error).
• The significance level of a statistical test measures the
probability of making a Type I Error.

> The lower the percentage required for statistical significance,


the more difficult it is to reject the null hypothesis.
> In most scientific work, the level of significance required to
reject the null hypothesis is set conventionally at 5°/o.

• If the null hypothesis is rejected, then we can say that the


regression coefficient is statistically significant at the chosen
significance level.

N 36
~ Hypothesis testing
' The p-value
J

• It has also become standard practice to report the


p-value (probability value) of the statistical test.
• The p-value is the probability of observing a test
statistic equal to or "more extreme" than the
calculated test statistic if the null hypothesis were
true.
• For example, a p-value of 0.01 means that there is
a 1°/o chance of observing the calculated test
statistic if the null hypothesis were true ..
~ Thus the estimated coefficient is statistically
significant at the 1°/o significance level.

N 37
Hypothesis testing
The t-statistic

• The test statistic that econometricians usually use


to test hypotheses about individual regression
coefficients is the t-statistic.
• The test using this statistic is called a t-test.
• The t-statistic describes how far an estimate of a
coefficient is from its hypothesized value under the
null hypothesis . .
• If a t-statistic is sufficiently large in absolute
magnitude, then the expert can reject the null
hypothesis.

N 38
Hypothesis testing
The t-statistic

• In our price-fixing indicator-variable model, the


t-statistic is
t _ (PcARTEL - f3H 0 )
CARTEL - SE(/3- )
CARTEL
where:
is the estimated regression coefficient on
PcARTEL
the CARTEL indicator variable;
SE(PcARTEL) is the square root of the estimated
variance of the distribution of PcARTEL; and
f3Ho is the value of the CARTEL indicator variable
under the null hypothesis (i.e., zero). ·
N...
39
Hypothesis testing
Hypothetical empirical example
• Returning to our indicator-variable regression model

In(Pit) = /10 + {11 In(Cit) + {12 In(Dit) + {1 3 Cartelit + eit ,

we will now conduct the following statistical test:


Ho: f3cARTEL <0 vs. HA: f3cARTEL > 0 ·

• This test essentially determines whether the coefficient on


the CARTEL indicator variable is positive.

• If the null hypothesis were rejected, then the coefficient on the


indicator variable {33 is significantly greater than zero, and
the model provides empirical evidence contrary to the view of
no anticompetitive impact on prices.
N 40
Hypothesis testing
Hypothetical empirical example

• The computer output for the indicator-model is:


regress ln_ PRICE ln_COST ln_DEMAND CARTEL , nohead.er

ln PRICE Coef. Std . Err. t P> l t l [ 95% Conf. Interval]

l n COST .3606615 . 01 16 3687 22 . 03 0 . 000 .3285776 . 39 27454


1

l n DEMAND . 2882051 . 0049012 58 . 80 0 . 000 . 27859·85 .2:! H81 1 7


CARTEL . 1911031 . 003181 60.08 0.000 .184868 .19·73381
con s 1.683021 . 0610985 27.55 0 . 000 1. 563264 1 . 802778

• The OLS estimate of the Cartel indicator variable


suggests that the alleged conspiracy lead to
(approximately) a 19.11 °/o increase in prices.
• This estimate is statistically significant since the
p - value (P > ltl) is less than 0.00005.

N 41
Hypothesis testing
Hypothetical empirical example

. regress ln_PRICE, ln_COST ln_DEMAND CARTEL, nohea,d er

ln PRICE I Coef. Std. Err. t P> l t l [ 95% Conf. Interv al ]

l n COST . 3606615 , Ol ,6 3687 22,03 0.000 . 328577,6 . 3927454


-
l n DEMAND .2882051 . 0049012 58 . 80 0.000 .2785985 . 2978117
CARTEL .1911031 . 003181 60 . 08 0 . 000 .184868 .1973381
con s 1 . 683021 . 0'610985 27.55 0 . 000 1. 563264 1 . 802778

• The estimated coefficients on the demand and cost


variables both have the expected signs and are
statistically significant.
• A 1% increase in production costs is estimated to increase
prices by .360 percent.
• A 1°/o increase in personal disposable income is estimated to
increase prices by .288 percent.
N,
42
Statistical properties of the OLS
estimators
• Since their values are not known until a sample is
collected, the least squares estimators are random
variables.
• As random variables, the least squares estimators have
• A probability density function (pd/), which describes
how the OLS estimates are distributed in repeated
sampling.
• An expected value, which is the average value of an
estimator that occurs in many repeated samples of the
same size. It is the center of the pdf of the random
variable.
• A variance, which is a measure of the spread of the
probability distribution of an estimator.

N~,• Appendix A provides a more detailed explanation of these terms.


43
Assessing the OLS estimators
Unbiasedness

• When the expected value of any estimator of a parameter equals the


true parameter value, then that estimator is unbiased [7].
• It caA be shown that the least squares estimator of {3 1 is unbiased,
that is E(b 1 ) = {31 , where E(b 1 ) is the average value of the estimator
b 1 that occurs in many repeated samples of the same size.
J(b1)

N 131
The probability density function oftbe least squares estimator b1.
b1

44
Assessing the OLS estimators
Variance
• The variance of an estimator is a measure of the spread of its
probability distribution.
• The smaller the variance of an estimator is, the greater the
precision of that estimator [8].
>- When comparing 2 estimators, the one with smaller variance is
best since this rule gives us a higher probability of obtaining an
estimate that is close to the true parameter value.

N B1
Two possible probability density functions (or b1.

45
The Gauss-Markov theorem

• The Gauss-Markov theorem states that if the regression


model is correctly specified, 1 the data are obtained from
random sampling, 2 no explanatory variable is a perfect linear
function of another explanatory variable, and the variance of
the error term is homoskedastic, 3 then the OLS estimators are
the best linear unbiased estimators (BLUE) of the
regression coefficients. They are best in their class because
they have the minimum variance.
1. A correctly specified model has not omitted any important variables, and has
the correct functional (mathematical) form of the variables (see Appendix B for
a technical description of the multiple linear regression assumptions).
2. The process by which the data are collected is such that each observation
(Yi , xi1 , xi2 , ........ , xik) is statistically independent of every other observation
(see Appendix C for a more detailed description of the random sampling
assumption) .

N 3. Homoskedastic error variances are defined in Appendix B.

46
Assessing the regression model
Measuring goodness-of-fit

• The fit of the linear model is measured by the


coefficient of determination, called R 2 .
• R 2 measures the variation in the dependent variable
y that is explained by the variation in the explanatory
variables.
• The total variation in the dependent variable that we
wish to explain is measured by the sum of squared
differences between the sample values of the
dependent variable and the sample mean of the
dependent variable. This measure is called the total
sum of squares (SST) .

N 47
Assessing the regression model
Measuring goodness-of-fit
• The total sum of squares (SST) can be decomposed into
• The sum of squares that is explained by the regression (SSR).
• The sum of squares due to error (SSE), which is that part of SST that
is not explained by the regression.
Yi

Yt-Y
• ••• •
Yd • ~ explained

YI
~
.~• •·

'
, . . l rl
I\
Y1 - Y
-

• •
• •

N Explained and unexplained components of y,


X

48
Assessing the regression model
Measuring goodness-of-fit

• R 2 is the proportion of variation in the dependent


variable that is explained by the variation in the
explanatory variables.
• R 2 is equal to the ratio of SSR to SST

2
SSR
_
R - SST

• An R 2 close to 1 means that the sample values are close to


the regression line.
• An R 2 close to O means that the sample data for y and x
are uncorrelated.

N 49
Assessing the regression model
Measuring goodness-of-fit

y •


• • •
• ••
••
• •
• •
.,
y •


Regression Line
• •,
• • • •• •

•• • • •
• •
• •
•••

X X
x and y are not related; In such case, R 1 would be zero. 2
All of the data points are on the regression Une,and the resulting R Is eqnal to 1.

N 50
Assessing the regression model
The F-test of overall significance

• The t-test that we introduced earlier is used for testing


hypotheses about individual coefficients.
• To test hypotheses about more than one coefficient, we use
the F-test.
• The F-test is frequently used to test the overall
significance of a regression equation.
• The F-test tests the null hypothesis that all the coefficients in
the model except the intercept are equal to zero:
Ho: P1 == P2 == ··· == PK == 0
HA: At least one of the pk is nonzero for
k == 1, ..... , K explanatory variables
excluding the intercept.

N 51
Assessing the regression model
The F-test of overall significance
• To test the significance of the model, we calculate the F-statistic:

F == (SST - SSE)/(K - 1)
SSE/(N - K)

where:
K is the number of parameters in the model including the
intercept;
N is the number of observations;
SST is the total sum of squares; and
SSE is the sum of squares due to error (sum of squared OLS
residuals)
• Thus, the F-statistic in a test for overall significance is the ratio of
the sum of errors explained to the residual sum of squares,
adjusted for the number of independent variables and the number
N of observations.
52
Assessing the regression model
Hypothetical empirical example
• The R 2 tells us that approximately 81 % of the variation in
In(PRICE) can be explained by the variation in the explanatory
variables.
• The probability of observing an F-statistic as large or larger
than the observed statistic under the null hypothesis is smaller
than 0.00005 .
> We reject the hypothesis that the slope coeffi~ients are jointly
equal to zero.
Sourc e I ss df MS Number of ob.s = 22 , 62 4
F(3, 22620) = 32297.51
Model 1503.98605 3 5 01 . 328683 Prob> F = 0.0000
Re .sidu a l I 35 1.112389 22, 620 . 01552221 R- .squared = 0.8107
Adj R- .squared = 0.810 7
Total I 1855.09844 22,623 .08200055 Root MSE = .124 59

l n PRICE I Coef. Std . Err . t P>l t l ['95% Conf . Interval]

l n COST .3606615 . 0163687 22 .03 0.000 .3285776 . 3927 454


l n DEMAND . 288205 1 .0049012 58.80 0 . 000 .2785985 . 2978117
CARTEL . 19·11031 . 003181 60 . 08 0.000 .184 868 . 1973381

N c on s 1.6.8 3021 . 0610985 27 . 5 5 0.000 1 . 5 63264 1.802778

53
Key concept
Identification

• The term identified is used to indicate that the


model parameters can be consistently estimated [9].
• An estimator is consistent if it tends to become
more and more accurate as the sample size
.
increases.
• We will now summarize the key threats to
identification, each of which persists even when the
sample size increases.

N 54
Threats to identification
Biasedness

• OLS estimators are biased if one or more


explanatory variable is correlated with the error
term.
• The causes of correlation between an explanatory
variable and the error term are:

• The omitted variable problem


• The incorrect functional form problem
• The errors-in-variables or measurement-error problem
• The simultaneous causality problem

• The bias caused by any of these problems


N ,: persists even when the sample size increases.
55
Omitted variable bias
Problem
• Omitted variable bias (OVB) occurs when a model
incorrectly leaves out one or more causal factors, and
the omitted variables are correlated with at least one of
the included variables.
• The magnitude and direction of the OVB depends on:
• the correlation between each omitted variable and the
included variable; and
• the magnitude and direction of the effect of the omitted
variable on the dependent variable.
• When there are two or more omitted variables, the
direction of the bias may be difficult to sign because the
formula for the bias term is the sum of the individual
biases associated with each omitted variable (e.g., a
N;.;· positive bias can be perfectly offset by a negative bias).
56
Omitted variable bias
Solutions

• If the variable can be measured, include it as an explanatory


variable in the multiple regression model.
• If the variable cannot be measured but a proxy variable
exists, include the proxy as an explanatory variable in the
multiple regression model.
• If the variable reflects entity-specific, time-invariant
characteristics, and if panel data are available, the fixed
effects regression model can be used to "control for" the
omitted fixed factors.
• Use instrumental variable (IV) regression, which relies on
a new variable, called an instrumental variable. (IV
regression is a relatively advanced topic that is not covered
in this presentation.)
N 57
Incorrect functional form
Problem

• If the functional form of the estimated regression


function is different from the functional form of the
true population regression, then the OLS estimator
is biased.
• To detect functional form misspecification, we can
plot the data points and the fitted regression line.
• As shown in the figures on the next slide, a linear
regression function does not fit the data well.
), Either the fitted quadratic relationship or log-linear
relationship fit the data better.

N 58
Incorrect functio·n al form
Solution
A Fitted Linear Relationship A Fitted Quadratic Relationship A Fitted Log-linear Relationship
y •• y •• y ••
• • •
•• •• ••

• . ./ •
• • •
• • •
• • •
• • • • ·1'·
• •

• • •
• •
• • •
X X X
The s:wn of .squared residuals is 6,868,481,92 The sum of .squared re.s iduals is S,015,0l7;86 The swn of .squared :residuals is 111

• In evaluating different models, we can choose the model with


the smallest sum of squared residuals (SSE).
• Comparing the SSE for the 3 models, we can see that the log-linear
model fits the data the best.
N Data Source: R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), Principles of Econometrics, 4th Ed .
59
Measurement-error bias
Problem and solutions

• Errors-in-variables, or measurement-error, bias of the OLS


estimator arises when an explanatory variable is measured
with error.
• Measurement error will always bias the OLS estimators.
Under certain conditions, it will bias them towards zero, which
is called attenuation bias.
• The 3 solutions to the measurement-error problem are
1. Get better measurements of the variable.
2. Use the technique of instrumental variable regression
(this is a relatively advanced topic that is not covered in
this presentation).
3. Attempt to quantify the bias, and use the resulting formula
to adjust the estimated coefficients [1 O].

N 60
Simultaneous causality bias
Problem

• Simultaneous causality bias arises when in


addition to the causal link from the explanatory
variable x to the dependent variable y, there is a
causal link from y to x.
• A classic example of simultaneous causality is a
model that attempts to explain the relationship
between market price and quantity

Q = /Jo + /31 P + e ,
where P is the equilibrium price of the product,
and Q is the equilibrium quantity.

N .
,,.

61
Simultaneous causality bias
Problem

• Economic theory tells us that market price and quantity are


jointly determined by the intersection of the supply and
demand curves.
p
s

Q* Q
Supply and demand equilibrium.

N 62
Simultaneous causality bias
Problem

• Since market price and quantity are jointly determined by


the intersection of the supply and demand curves, the model
describing this relationship consists of the following two
equations

Demand: Q = a 0 + a 1 P + a 2X + ed
Supply: Q = /30 + /31 P + e5

where X is a determinant of demand such as income.


• This set of equations is called the structural equation
system (also known as the simultaneous equations
model).
• The parameters of this model are called the structural
parameters.
N'
_j

63
Key concept
Endogenous vs. exogenous variables

Demand: Q = a 0 + a1P + a 2X + ed
Supply: Q = /30 + {31 P + e5

• Endogenous variables as those whose values


are "determined within the system created."
• Exogenous variables as those whose values are
"determined outside the system created."
• In the two structural equations above, both Q
(quantity) and P (price) are endogenous
variables, and X (income) is an exogenous
variables.

N 64
Simultaneous causality solutions
Reduced-form methods

Demand: Q = a0 + a1 P + a 2 X + ed
Supply: Q = /3 0 + /31 P + e5

• The two structural equations above can be solved to


express the endogenous variables Q (quantity) and P
(price) as a function of the exogenous variable
X (income).
• The reformed model is called the reduced-form of the
structural equation system.
• The reduced-form parameters can be estimated
using linear least squares.
• The reduced-form parameters are typically functions of a
N. number of structural parameters.
65
Reduced-form model

• The reduced-form model is probably the most widely-


used and accepted approach to estimating damages
in price fixing cases [11 ].
• Indeed, the model in our hypothetical empirical
example is a reduced-form price equation:

In(Pit) · {30 + {31 In(Cit) + {32 In(Dit) + {33 Cartelit + eit

N 66
Introduction to the forecasting method

• An alternative approach to estimating overcharges


is to estimate the parameters of the regression
equation using data from the benchmark period,
and use these estimates to forecast what prices
would have been during the alleged conspiracy
period.
• Any deviation of actual prices from those predicted
using data from the benchmark period is the
amount of the overcharge.

N 67
Introduction to the forecasting method

Actual and Predicted Prices


2002- 2004 Regressi on Sampl e
0.60

. . ... ..
In- Sample Ont- of- Sample
0 .50 Prediction Prediction

S'
§ 0 .40
0
Q.
.... .
a..
~ 0 .30 •·<• Actual
~
'-" •
~
-~
a.. 0 .20
I
I•

Predicted - - ~---
p..
J~
.+
0 . 10 I
I
I
I
0 .0 0
......
:<S-°'
~<:>'V,
~......
s::>".;, ~~" ~"
~<i-j "
~'o:<S' s,.:<S-
~ ~ ~ ~ ~ ~
Month

This model appears to fit the benchmark


period (2002-2004 ), and suggests a large

N overcharge during the alleged conspiracy


period (2005-2006).
68
Summary

• Regression analysis is commonly-used to isolate and


measure the effect of an alleged price-fixing conspiracy.
• When correctly implemented, regression analysis
provides an estimate of the impact of the alleged
conspiracy on price, while accounting for other factors
that are determinants of price.
• Ordinary least squares is a method that provides
estimators of the regression coefficients that minimize
the sum of squared residuals.
• The two main regression-based methods for estimating
an overcharge are the indicator-variable (or dummy
variable) method and the forecasting method.

N 69
Summary

• The indicator-variable method estimates a regression


over the entire analysis period, and includes an indicator
variable that is equal to 1 during the conspiracy period,
and zero during the benchmark period. The estimated
coefficient on the indicator variable is an estimate of the
average effect on price due to the conspiracy.
• The forecasting method estimates a regression over the
benchmark period only, and the estimated coefficients
from the benchmark period are used to predict what
prices would have been during the alleged conspiracy.
The overcharge is calculated as the difference between
the actual price and the predicted price.

N 70
References

• A. H. Studenmund (2001 ), Using Econometrics A Practical


Guide, 4th Ed., Addison Wesley Longman, Inc.
• Daniel Rubinfeld, "Reference Guide on Multiple Regression,"
Reference Manual on Scientific Evidence, Federal Judicial
Center 2011, Third Edition.
• James A. Brander and Thomas W. Ross (2006). "Estimating
Damages from Price-fixing," Canadian Class Action Review 3,
no. 1: 335-369.
• James Stock and Mark Watson (2007), Introduction to
Econometrics, Pearson Education Inc.
• Jonathan B. Baker and Daniel L. Rubinfeld (1999). "Empirical
Methods in Antitrust Litigation: Review and Critique," American
Law and Economics Review, 1(1/2): 386-436.

N...
71
References

• Jeffrey M. Wooldridge (2002), Econometric Analysis of Cross


Section and Panel Data, Massachusetts Institute of
Technology.
• M. Hashem Pesaran (1987). "Econometrics," The New
Pa/grave: A Dictionary of Economics, 1st Ed. Palgrave
Macmillan, v 2: 8-22.
• R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ),
Principles of Econometrics, 4th Ed., John Wiley & Sons, Inc.

N 72
In-text Citations

1. M. Hashem Pesaran (1987), p. 8.


2. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 46.
3. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 48.
4. A. H. Studenmund (2001 ), p. 13
5. Jeffrey M. Wooldridge (2002), p.3
6. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 153.
7. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 58.
8. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 60.
9. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 417
10. James Stock and Mark Watson (2007), p. 321.
11. Jonathan 8. Baker and Daniel L. Rubinfeld (1999), p. 392.
12. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 41 .
13. R. Carter Hill, William E. Griffiths, and Guay C. Lim (2011 ), p. 23

N...
14 . James Stock and Mark Watson (2007), p. 778.
Appendices

• Appendix A - Probability Basics


• Appendix B - Assumptions of the Multiple Linear
Regression Model
• Appendix C - Random Sampling

N 74
Appendix A - Probability Basics
Random variables

• A variable is random if its value is not known until a


sample is collected.
• A random variable takes on a set of possible different
values, each with an associated probability.
• Notation:
• We will follow the convention of using upper case letters to
denote random variables and lower case letters to denote
realizations of random variables.
• For instance, Y represents a random variable, and y is a
realization of that random variable.

N 75
Appendix A - Probability Basics
Probability density function

• The probability density function (pd/) of a


random variable summarizes the probabilities of
possible outcomes occurring through repeated trials
of an experiment (e.g., repeated sampling).

N 76
Appendix A - Probability Basics
Probability density function
Probability
Density Function
of X
X f(x)
l 0.1
2 0.2
3 0.3
4 0.4

.4

.3

~

.! .2
=
i!:

.1

0
2 3 4
Xvalue
Probability density function for X

• For a discrete (i.e., countable) random variable, the pd/ indicates the
N probability that the random variable X takes on the value x.
77
Appendix A - Probability Basics
Probability density function

Probability(5 ~ X ~ 10)

0 5 10 15 20
X
Probability density function for a continuous random variable.

• For a continuous (i.e., uncountable) random variable, the pdf indicates the
probability of outcomes being in certain ranges.

N 78
Appendix A - Probability Basics
Mathematical expectation

• The mean of a random variable is the long-run


average value of a random variable over many
repeated trials or occurrences.
• It is NOT the same as the sample mean, which is the
arithmetic average of numerical values [12].
• The mean of a random variable is given by its
mathematical expectation, which is called the
expected value, denoted as E(Y).
• The expected value of a random variable is the
center of its probability density function, and is also
called the population mean value.

N'
::J

79
Appendix A - Probability Basics
Conditional mean
• With two or more random variables, we need to consider
the joint probability density function .
• For example, say Y is a random variable that takes on the
values 1, 2, 3, and 4, and X is a random variable that
takes on the values O and 1. The joint probability
density function of X and Y allows us to say things like
"The probability of Y being equal to 2 when X is equal to 1
is .25."
• The conditional expectation or conditional mean of Y
given X = x is the average value of Y in repeated
sampling where X = x has occurred.

N 80
Appendix A - Probability Basics
Variance
• The variance of a random variable, denoted as var(Y),
characterizes the spread of the probability density function .

.4

Smaller variance

.3

f(x) .2
Larger variance
"
.1 ""·
Q-l----
4 ~ ~ ~ 0 2 3 4
X
Distributions with different variances.

• With two or more random variables, the conditional variance,


denoted as var(YIX = x), characterizes the spread of the
N:
... conditional probability density function .
81
Appendix A - Probability Basics
Covari_a nce

• The covariance between two random variables,


denoted as cov(Y, X), is a measure of the linear
association between them.
• If the covariance between Y and X is positive,
then when the value of Y is above its population
mean value, then the value of X also tends to be
above its population mean.
• If the covariance between Y and X is negative,
then when the value of Y is above its population
mean value, then the value of X tends to be
below its mean.

N
Appendix A - Probability Basics
Correlation

• The correlation between two random variables is


a unit-free measure of the degree of linear
association between them.
• Unlike the covariance, the correlation must lie between
-1 and 1.
• A correlation between Y and X of -1 means that Y is a
perfect negative function of X.
• A correlation of 1 means that Y is a perfect positive
function of X.
• A correlation of zero means that there is no linear
association between Y and X.

N 83
Appendix A - Probability Basics
Statistical independence

• Two random variables are statistically independent if


the conditional probability that Y is equal to y given
that X = x has occurred is equal to the probability that
Y is equal to y [13].
• Said another way, two random variables are
statistically independent if knowing the value of one
variable provides no information about the other [14 ].

N:
;..

84
Appendix A - Probability Basics
Normality
• The normal distribution, denoted as N(µ, cr 2 ), is a probability density function
that is symmetric and centered around its popu lation mean value µ.
• Random variables that have "bell-shaped" probability density functions are
said to be "normally distributed."
• In the figure below, the random variable Y is normally distributed with mean
µ = E(Y) and variance cr 2 . The area under the normal pdf between
µ - l.96cr and µ + 1.96cr is 0.95, where er is the square root of the variance,
or the standard deviation of µ.
The Normal Probability Density

N µ - 1.96 a µ µ + 1.96 a y

85
Appendix B - Assumptions of the
Multiple Linear Regression Model
• A1. Yi = Po + f31xi1 + f32xi2 + ...... ·· + f3kxiK + ei,
=
i 1, ..... , N correctly describes the relationship between y
and x in the population.
• A2. The data pairs (xii, Xi2, ..... I XiK, ya, i = 1, ...... IN, are obtained
by random sampling (see Appendix C).
• A3. E(elx1, x 2 , ....... , xK) = 0. The expected value of the error
term conditional on x 1 , x 2 , ....... , xK is zero.
• Sometimes y can be above the population regression line,
and sometimes y can be below the population regression
line, but on average, y falls on the population regression
line.
~ This means that the expected value of
e = y - E(ylxv x 2 , ....... , xK) conditional on any
value of xk, is zero.
• If xk and e are correlated, then it can be shown that
*
N E(elxvx 2, ... .... ,xK) 0.

86
Appendix B - Assumptions of the
Multiple Linear Regression Model
• A4. In the sample, each xk must take on at least 2 different values,
and the values of each xk are not exact linear functions of the
other explanatory variables (no perfect multicollinearity}.
• Perfect multicollinearity is the condition where variables are
essentially redundant.
• Under perfect multicollinearity, the least squares procedure fails
• A5. var(e lx1 , x 2 , ....... , xK) = (J 2 . The variance of the error term,
conditional on any value of x, is a constant (J 2 :
~ It is assumed to be the same for each observation.
~ It is not directly related to any of the explanatory variables.
Errors with this property are said to be homoskedastic.
• A6. The distribution of the error term is normal.

If assumptions A 1 through A5 hold, then the OLS estimators are the best
linear unbiased estimators (BLUE} of the regression coefficients.

N 87
Appendix C - Random Sampling
• By random sampling, we mean that the process by which the
data are collected is such that each observation
(Yi, xi1, xi2, "
1 xiK) is statistically independent of every
I I,

other observation.
• Statistical independence means that knowing the values
(y, x1 , x 2, ...
Ill xK) for one observation provides no
I,

information about the values for another observation.

N 88
Glossary

• alternative hypothesis. See hypothesis test.


• best linear unbiased estimator (BLUE). An estimator that has the
smallest variance of any estimator that is a linear function of the sample
values Y and is unbiased.
• bias. The expected value of the difference between an estimator and the
parameter that it is estimating. A biased estimator of a parameter differs on
average from the true parameter.
• ceteris paribus. Hold all other relevant factors fixed.
• coefficient. The intercept and slope(s) of the population regression line, also
known as the parameters of the population regression line.
• consistent estimator. An estimator that tends to become more and more
accurate as the sample size grows.
• correlation. A unit-free measure of the extent to which two random
variables move, or vary., together. Two variables are correlated positively if,
on average, they move in the same direction; two variables are correlated
negatively if, on average, they move in opposite directions.

N:
-'

89
Glossary

• covariance. A measure of the extent to which two random variables move


together. If the covariance between Y and X is positive, then when the value
of Y is above its population mean value, then the value of X also tends to be
above its population mean. If the covariance between Y and is negative,
then when the value of Y is above its population mean value, then the value
of X tends to be below its mean.
• cross-price elasticity of demand. The percentage change in the quantity
demanded of one good resulting from a 1-percent increase in the price of
another good.
• dependent variable. The variable to be explained or predicted in a
regression model.
• dummy variable (see indicator variable).
• el~sticity. A unit-free measure of how responsive an economic variable is to
changes in another. It is measured as the percentage change in one variable
given a 1-percent increase in another variable.
• endogenous variable. A variable that is correlated with the error term.
• errors-in-variables bias. The bias in an estimator of a regression coefficient
that arises when an explanatory variable is measured with error.
N 90
Glossary

• estimate. The numerical value of an estimator computed from data in a


value of a particular sample.
• estimator. A procedure for using sample data to compute estimates of the
population parameters. For example, the least squares estimators is a set of
formulas obtained from the solution of the least squares problem . When
sample data values are plugged into the least squares estimators, estimates
of the regression coefficients are obtained; these numerical values vary from
sample to sample.
• exogenous variable. A variable that is uncorrelated with the error term.
• expected value of a random variable. The long-run average value of a
random variable over many repeated trials or occurrences.
• explanatory variable. A variable appearing on the right-hand-side of a
regression model that is associated with changes in the dependent variable.
• F-test. A statistical test (based on an F-ratio) of the null hypothesis that a
group of explanatory variables are jointly equal to 0. When applied to all the
explanatory variables in a multiple regression model, the F-test tests the
overall significance of the model.

N 91
Glossary

• feedback. When changes in an explanatory variable affect the values of the


dependent variable, and changes in the dependent variable also affect the
explanatory variable. When both effects occur at the same time, the two
variables are described as being determined simultaneously.
• fitted value. The estimated value for the dependent variable; in a linear
regression, this value is calculated as the intercept plus a weighted average
of the values of the explanatory variables, with the estimated coefficients
used as weights.
• fixed effects. Binary (0/1) variables indicating the entity or time period in a
panel data regression.
• fixed effects regression model. A panel data regression that includes entity
fixed effects.
• functional form misspecification. When the mathematical form of the
regression function does not match the form of the population regression
function.
• Gauss-Markov Theorem. States that under certain conditions, the OLS
estimator is the best linear unbiased estimator (BLUE) of the regression
coefficients.
N 92
Glossary

• heteroskedasticity. The situation in which the variance of the error


associated with a regression model is not constant.
• homoskedasticity. The variance of the error term, conditional on the
explanatory variables is constant.
• hypothesis test. A statement about the coefficients in a multiples
regression model. The null hypothesis states that certain coefficients have
specified values or ranges; the alternative hypothesis would specify other
values or ranges.
• imperfect multicollinearity. The situation in which two or more explanatory
variables are highly correlated.
• independence. Two random variables are statistically independent if
knowing the value of one variable provides no information about the other
• independent variable {see explanatory variable).
• indicator variable. A variable that takes on only two values, usually O and
1, with one value indicating the presence of a characteristic, attribute, or
effect (1 ), and the other value indicating its absence (0).
• intercept. The value of the dependent variable when each of the explanatory

N_..
variables takes on the value of O in a regression equation .

93
Glossary

• linear least squares. The estimator of the regression intercept and slope(s)
that minimizes the sum of squared residuals.
• linear-log model. A nonlinear regression function in which the dependent
variable is y and the independent variable is the natural logarithm transformation
Of X.

• linear regression model. A regression model with a constant slope.


• log-linear model. A nonlinear regression function in which the dependent
variable is the natural logarithm transformation of y and the independent
variable is x.
• log-log model. A nonlinear regression function in which both the dependent
variable and the independent variable(s) are in their natural logarithm
transformations.
• measurement-error bias (see errors-in-variables bias).
• multiple regression model. An extension of the simple linear regression model
that allows the dependent variable to depend on multiple explanatory variables.
• nonlinear regression model. A model having the property that changes in
explanatory variables will have differential effects on the dependent variable as

N the values of the explanatory variables change.

94
Glossary

• normal distribution. A bell-shaped probability distribution having the


property that about 95°/o of the distribution lies within 2 standard deviations of
the mean.
• null hypothesis. The hypothesis being tested in a hypothesis test. See
hypothesis test.
• omitted variable bias. The bias in an estimator that arises because a
variable that is a determinant of the dependent variable y and correlated with
at least one of the independent variables x has been omitted from the
regression.
• ordinary least squares (see linear least squares).
• p-value. The significance level in a statistical test; the probability of getting a
test statistic as extreme or more extreme than the observed value. The
larger the p-value, the more likely that the null hypothesis is valid.
• parameter. A constant that determines a characteristic of a probability
distribution or a population regression function.
• perfect multicollinearity. The situation in which one of the explanatory
variables is an exact linear function of other explanatory variables.

N • population. All the units of interest to the researcher; also, universe.

95
Glossary

• population regression function or line. The relationship between the


dependent variable y and the independent variable(s) that holds on average
in the population.
• price elasticity of demand. The percentage change in quantity demanded
resulting from a 1-percent increase in price.
• probability density function (pd/) of a random variable. For a discrete
random variable, the pdf lists all possible outcomes and the probability that
each will occur. For a continuous random variable, the area underneath the
probability density function between any two points is the probability that the
random variable falls between those two points.
• R -Squared (R 2 ). In a regression, the fraction of the sample variance of the
dependent variable that is explained by the explanatory variable(s). R-
squared is the most commonly used measure of goodness of fit of a
regression model.
• random error term. A term in a regression model that reflects random error
that is the result of chance.
• random variable. A variable is random if its value is not known until a
sample is collected.
N 96
Glossary

• reduced form model. The model resulting from solving a system of


equations to express the endogenous variables as a function of the
exogenous variables.
• regression residual. The difference between the actual value of a
dependent variable and the value predicted by the regression equation (the
fitted value).
• sample. A selection of data chosen for a study; a subset of a population.
• significance level. A prespecified rejection probability of a statistical
hypothesis test when the null hypothesis is true .
• simple regression model. A linear regression model where the dependent
variable depends on a single explanatory variable.
• simultaneous causality bias. The bias in an estimator of a regression
coefficient that arises when in addition to the causal link from the explanatory
variable x to the dependent variable y, there is a causal link from y to x .
• slope. The change in the dependent variable associated with a one-unit
change in an explanatory variable.
• standard deviation. The square root of the variance.

N 97
Glossary

• standard error of the coefficient; standard error (SE). An estimator of the


standard deviation of the estimator. The standard deviation of an estimator
measures the variation of a parameter estimate or coefficient about the true
p·arameter. The standard error is a standard deviation that is calculated from
the probability distribution of estimated parameters.
• statistical significance. A test used to evaluate the degree of association
between a dependent variable and one or more explanatory variables. If the
calculated p-value is smaller than 5°/o, the result is said to be statistically
significant (at the 5% level). if p is greater than 5%, the result is statistically
insignificant (at the 5°/o level).
• sum of squares due to error. The sum of the squared OLS residuals.
• sum of squares due to the regression. The part of total sum of squares
that is explained by the regression. It is measured as the sum of squared
deviations of the fitted values of Yi from its average y.
• t-statistic. A test statistic that describes how far an estimate of a parameter
is from its hypothesized value (i.e., given a null hypothesis). If a t-statistic is
sufficiently large (in absolute magnitude), an expert can reject the null
hypothesis.

N 98
Glossary

• t-test. A test of the null hypothesis that a regression coefficient takes on a


particular value, usually 0. The test is based on the t-statistic.
• total sum of squares. The sum of squared deviations of the Yi from its
average y.
• type I error. In hypothesis testing, the error made when the null hypothesis
is true but is rejected .
• variable. Any attribute, phenomenon, condition, or event that can have two
or more values.
• variance. The expected value of the squared difference between a random
variable and its mean.

N 99

You might also like