0% found this document useful (0 votes)
4 views

LEC12_ECMT

The document discusses the criteria for selecting good models in empirical analysis, emphasizing the importance of parsimony, parameter constancy, and goodness of fit. It outlines common specification errors, such as omitting relevant variables or including unnecessary ones, and their consequences on regression results. Additionally, it addresses the impact of measurement errors and the significance of testing for specification errors using methods like the RESET and MWD tests.

Uploaded by

Rafin Max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

LEC12_ECMT

The document discusses the criteria for selecting good models in empirical analysis, emphasizing the importance of parsimony, parameter constancy, and goodness of fit. It outlines common specification errors, such as omitting relevant variables or including unnecessary ones, and their consequences on regression results. Additionally, it addresses the impact of measurement errors and the significance of testing for specification errors using methods like the RESET and MWD tests.

Uploaded by

Rafin Max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

SS|FINANCE|

RU
MODEL SELECTION
CRITERIA AND TESTS
Chapter 7

AT TRIBUTES OF A GOOD MODEL
Whether a model used in empirical analysis is good/appropriate, or the
"right” model cannot be determined without some reference criteria, or
guidelines.
• Principle of Parsimony: the principle of parsimony emphasizes the
necessity of keeping a model as simple as possible since complete
representation of reality is unattainable.
• Parameter constancy: parameter constancy is essential, meaning that
the values of model parameters should remain stable. Fluctuating
parameter values make forecasting challenging, and the true test of a
model lies in comparing its predictions with actual observations.
• Goodness of Fit: in regression analysis, the primary goal is to explain the
maximum variance in the dependent variable using the explanatory
variables. A model is considered good when it can account for a
substantial portion of this variance, which is typically assessed through
SS|FINANCE|
RU
AT T R I B U T E S O F A G O O D M O D E L
• Theoretical Consistency: even if the goodness of fit measures are very
high, a model may still not be deemed as good if it exhibits coefficients
with incorrect signs. For instance, in the demand function for a
commodity, if the price coefficient shows a positive sign (indicating a
positively sloping demand curve, except for Giffen goods), it is imperative
to verify that the coefficients align with the theoretical expectations.
• It is one thing to list criteria of a ‘good’ model and quite another to develop
it, because in practice one is likely to commit various model specification
errors, which we discuss next.
• One of the assumptions of the classical linear regression model (CLRM), is
that the regression model used in the analysis is ‘correctly’ specified: If the
model is not “correctly” specified, we encounter the problem of model
specification error or model specification bias. SS|FINANCE|
RU
SS|FINANCE|RU
TYPES OF SPECIFICATION ERRORS
S P E C I F I C AT I O N B I A S / E R R O R S
• Knowing the consequences of specification errors is one thing but finding
out whether one has committed such errors is quite another, because we
do not deliberately commit such errors.
• Very often specification errors arise inadvertently, perhaps from our inability
to formulate the model as precisely as possible because the underlying
theory is weak or because we do not have the right kind of data to test the
model.
• Because of the non-experimental nature of economics, we are never sure
how the observed data were generated.
• In such a set up, the practical question then is not why specification errors
are made, but how to detect them. Once it is found that specification errors
have been made, the remedies often suggest themselves. Werelated
Directly willto
functional mis-
concentrate on following biases: specification
#1. Omission of relevant X variables
#2. Inclusion of unnecessary variable(s) SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S ) : U N D E R F I T T I N G A M O D E L
• The estimation of a regression model without relevant explanatory
variable(s) may introduce bias into the estimates. For instance, assume
that the correct function explaining variation in Y is as follows:
[1]
In mean deviated form EQ.[1] can be written as:
correct model [2]
• Now, assume that either due to ignorance about the true relation or
unavailability of relevant data on , following regression equation is
estimated:
estimated model [3]
• It can be shown that is different from . See Gujarati and Porter “Basic
Econometrics” p. 519, 5th Ed. for the
proof SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S )
• On applying OLS to the misspecified regression model in EQ.[3], we
obtain:
[4]
• Contrarily, the normal equations of the correctly specified regression
model in EQ. [2] are:
, (EQ[2] multiplied by & summing the variables) [5]
(EQ[2] multiplied by & summing the variables) [6]
• Dividing EQ. [5] by , we obtain:
[7]
Where, is the slope coefficient in the regression of the omitted variable on
the included variable .
SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S )
• Therefore, iff the term
• Now, from EQ. [7], we can obtain the bias introduced by the exclusion of a
relevant explanatory variable as:
Specification bias [8]
• Above exposition tells us that the bias due to exclusion of relevant
explanatory variable(s) depends on two terms:
(a) Regression coefficient of the explanatory variable(s) excluded
from the fitted model.
(b) The relationship between the explanatory variable(s) dropped
and kept in the fitted regression model, that is, .

Note: We work with data in mean deviation form and assume that to simplify the
derivations as we know in econometrics most ( but not all) intercepts in fitted
regression models intercept terms do not have any economic interpretation. SS|FINANCE|
RU
E XC LU S I O N O F R E L E VA N T X VA R I A B L E ( S ) :
CONSEQUENCES
• If the left-out or omitted, variable is correlated with the included variable
, slope coefficient of original regression will be biased.
• The disturbance variance is incorrectly estimated.
• The variance of is a biased estimator of the variance of the true
estimator . Variance of will, on average, overestimate the true variance,
even if and are not correlated:
[
• In consequence, the usual confidence interval and hypothesis-testing
procedures are likely to give misleading conclusions about the statistical
significance of the estimated parameters.
• As another consequence, the forecasts based on the incorrect model and
the forecast (confidence) intervals will be unreliable.
SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S

• Another type of specification bias may arise when the set of explanatory
variables is enlarged by inclusion of one or more irrelevant variables.
• The philosophy is that so long as you include the theoretically relevant
variables, inclusion of one or more unnecessary or ‘nuisance’ variables
will not hurt (unnecessary in the sense that there is no solid theory that
says they should be included).
• In that case inclusion of such variables will certainly increase (and
adjusted when the ).
• This is called overfitting a model. But if the variables are not
economically meaningful and relevant, such a strategy is not
recommended.
SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S

• Suppose the correctly specified model is as follows:


original/correct model [9]
• But a researcher adds the unnecessary variable and estimates the
following model:
estimated/incorrect model [10]

• The OLS estimators of the “incorrect” model are unbiased (as well as
consistent). That is, , and . If does not belong to the model, is expected
to be zero.
• Also, the estimator of obtained from over-fitted regression is correctly
estimated. SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S
• The standard confidence interval and hypothesis-testing procedure
based on the t and F tests remains valid.
• However, the ’s estimated from the regression are inefficient ─ their
variances will be generally larger than those of the ’s estimated from the
true model.
• As a result, the confidence intervals based on the standard errors of ’s
will be larger than those based on the standard errors of ’s of the true
model.
• However, we can use F-test statistic (pls. see lecture on multivariate
regression) to choose the right variable(s) for our model.
SS|FINANCE|
RU
I S I T B E T T E R T O I N C LU D E I R R E L E VA N T
VA R I A B L E S T H A N T O E XC LU D E T H E R E L E VA N T
ONES?

• The addition of unnecessary variables will lead to a loss in the efficiency


of the estimators (i.e., larger standard errors) and may also lead to the
problem of multicollinearity (Why?), not to mention the loss of degrees of
freedom.
• The best approach is to include only explanatory variables that on
theoretical grounds directly influence the dependent variable and are not
accounted for by other included variables.

SS|FINANCE|
RU
WRONG FUNCTIONAL FORM (LINEAR OR NON-
LINEAR)
• Sometimes researchers mistakenly do not account for the nonlinear nature
of variables in a model.
• Moreover, some dependent variables (such as wage, which tends to be
skewed to the right) are more appropriately entered in natural log form.
• Consider the following true marginal cost model
[11]
• Instead, the econometrician estimated following model
Consistently
[12] underestimate the true
marginal cost

Between points P and Q, the


linear marginal cost curve will
consistently overestimate the
true marginal cost
SS|FINANCE|
RU
SS|FINANCE|RU
TESTING FOR SPECIFICATION
ERRORS
T E S T S F O R O M I T T E D VA R I A B L E A N D
F U N C T I O N A L F O R M O F R E G R E SS I O N E Q U AT I O N :
RESET TEST
• Consider the original model as: [13]
• RESET (Regression Equation Specification Error Test) adds polynomials of the OLS
fitted values to above EQ[13] to detect general kinds of functional form
misspecification.
• To implement RESET, we need to decide how many functions of fitted values to
include in an expanded regression. However, there is no right or wrong answer.
• Let denotes the OLS fitted value.
• Consider the expanded model: [14]
• We use this equation to test whether original equation has missed important
nonlinearities.
• , that is, . To test this hypothesis, use test statistic:
• A significant F- statistic suggests some sort of functional problem. SS|FINANCE|
RU
T EST S FOR OMI T T ED VARI ABLE AN D FUN CT I ON AL FORM
O F R E G R E SS I O N E Q UAT I O N : M W D T E S T

• MacKinnon-White-Davidson (MWD) test


• Consider following two models:

• To illustrate use of MWD test to identify which functional form is correct,


we specify the hypotheses as follows:
Linear Model: is a linear function of the
Log-linear Model: is a linear function of or the ’s.
SS|FINANCE|
RU
TESTS FOR OMIT TED VARIABLE AND FUNCTIONAL FORM OF
RE G R E SS I O N E Q UAT I O N : M W D T E S T …
The MWD test involves the following steps:
• Estimate the linear model and obtain the values.
• Estimate the log-linear model and obtain the .
• Create a new variable
• Regress on the and :
• Reject if the coefficient of is statistically significant by the usual t test.
• Obtain
• Regress on the or and : or,

•Reject if the coefficient of in the preceding equation is statistically


significant.
☛The idea behind the MWD test is simple:
If the linear model is the correct model, the constructed variable should not be significant,
because in that case the estimated Y values from the linear model and those estimated from the
log-linear model (after taking their antilog values for comparative purposes) should not be
SS|FINANCE|
RU
T E S T S F O R O M I T T E D VA R I A B L E A N D F U N C T I O N A L
F O R M O F R E G R E SS I O N E Q UAT I O N : L M T E S T

• This is an alternative to RESET test. Estimate model in EQ.[13] and obtain the
estimated residuals, .
• If in fact EQ.[13] is the correct model, then the residuals obtained from this
model should not be related to the regressors omitted from the model.
• We now regress on the regressors in the original model and the omitted
variables from the original model.
[15]
• If the sample size is large, it can be shown that n (the sample size) times the
R2 obtained from the auxiliary regression (follows a distribution,
symbolically, .
• If the computed value > critical value at the chosen level of significance,
we reject the null of no misspecification. SS|FINANCE|
RU
SS|FINANCE|RU
FACTORS OTHER THAN SPECIFICATION
ERRORS THAT CAN AFFECT REGRESION
RESULTS
ERRORS OF MEASUREMENT
• So far, we have assumed implicitly that the dependent variable and the
explanatory variables, the , are measured without any errors.
• Although not explicitly spelled out, this presumes that the values of the
dependent as well as independent variables are accurate. That is, they are
not guess estimates, extrapolated, interpolated or rounded off in any
systematic manner or recorded with errors.
Consequences for errors of measurement of the dependent variable:

1. The OLS estimators are still unbiased.


2. The variances and standard errors of OLS estimators are still unbiased.
3. But the estimated error variances, and ipso facto the standard errors,
are larger than in the absence of such errors.
In short, errors of measurement in the dependent variable do not pose a
very serious threat to OLS estimation. SS|FINANCE|
RU
ERRORS OF MEASUREMENT…

Consequences for errors of measurement in the independent variables:


1. OLS estimators are biased as well as inconsistent.
2. Errors in a single regressor can lead to biased and inconsistent estimates of the
coefficients of the other regressors in the model.
It is not easy to establish the size and direction of bias in the estimated
coefficients.
3. It is often suggested that we use instrumental or proxy variables for variables
suspected of having measurement errors.
The proxy variables must satisfy two requirements - that they are highly
correlated with the variables for which they are a proxy and, they are
uncorrelated with the usual equation error as well as the measurement error.
But such proxies are not easy to find.
We should thus be very careful in collecting the data and making sure that some
SS|FINANCE|
RU
OUTLIERS, LEVERAGE, AND INFLUENCIAL POINTS:
OUTLIER

• An observation could be unusual w.r.t. its y-value or x-value. However, rather


than calling them x- or y-unusual observations, they are categorized as outlier,
leverage, and influential points according to their impact on the regression
model.
OUTLIER:
• An outlier is defined by an unusual observation w.r.t. either x-value or y-value.
• An x-outlier may seriously affect the regression outcomes. In an unplanned
study, often the data is collected before putting much thought into it. In those
situations, there could be a possibility of having x-outliers.
• The y-outliers are usually not as severe as the x-outlier. Nevertheless, the
effects of the y-outliers must be investigated further to check whether it is just a
simple data entry error, or some severe issue in the process, or just a random
phenomenon.
SS|FINANCE|
RU
OUTLIERS…

SS|FINANCE|
RU
• Figure below shows both x-outlier (left) and y-outlier (right).
• Both plots show that a better linear relationship will be possible without
these outliers.
• In this situation, the x-
outlier is rotating the line
clockwise to change both
the slope and the
intercept of the
relationship, while the y-
outlier is moving the
predicted line upward.
• The solid line shows
the predicted
relationship without
A S I M P L E W AY T O I D N T I F Y O U T L I E R S : I Q R
• To explain IQR Method easily, let’s start with
a box plot.
• A box plot tells us about the distribution of
the data. It gives a sense of how much the
data is spread about, what’s its range, and
about its skewness.
As you might have noticed in the figure, that a box plot enables us to draw
inference from it for an ordered data, i.e., it tells us about the various metrics of a
data arranged in ascending order.
• The median is the median (or center point), also called second quartile, of the
data (resulting from the fact that the data is ordered).
• Q1 is the first quartile of the data, i.e., to say 25% of the data lies between
minimum and Q1.
• Q3 is the third quartile of the data, i.e., to say 75% of the data lies between
SS|FINANCE|
RU
A S I M P L E WAY T O I D N T I F Y O U T L I E R S : I Q R E X A M P L E

• A survey was given to a random sample of 20 1st yr. university students.


They were asked “how many textbooks do you own?” There responses
were:

SS|FINANCE|
RU
0, 0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 14, 15, 20, and 25

• To detect the outliers using IQR method, we define a new range, let’s call it
decision range, and any data point lying outside this range is considered as
outlier and is accordingly dealt with.
and
• Thus, any observations less than 2 books or
observation greater than 18 books are
outliers.
Why only 1.5 times the IQR? Why not any
other number?
LEVERAGE

• A data point whose x-value (independent) is unusual, y-value follows the


predicted regression line though.
• A leverage point may look okay as it sits on the predicted regression line.
• However, a leverage point will inflate the strength of the regression
relationship by both the statistical significance (reducing the p-value to
increase the chance of a significant relationship) and the practical
significance (increasing r-square).
• Unfortunately, leverage points have no impact on the coefficients
because the point follows the predicted regression line.

SS|FINANCE|
RU
LEVERAGE…

Leverage Point (Right) in a Regression Analysis


SS|FINANCE|
RU
INFLUENTIAL POINTS
• A data point that unduly influences the regression results. A point is
considered influential if its exclusion causes major changes in the fitted
regression function. Depending on the location of the point, it may affect
all statistics, including the p-value, r-square, coefficients, and intercept.

The solid line


shows the
predicted
relationship

SS|FINANCE|
RU
H O W D O W E H A N D L E U N U S UA L D ATA P O I N T S ?

• Should we just drop them and confine our attention to the remaining data points?
• Automatic rejection of unusual data points is not always a wise procedure.
• Sometimes the outlier provides information that other data points cannot be because
it arises from an unusual circumstance which may be of interest and requires further
investigation rather than rejection.
• As a rule, unusual data points should be rejected out of hand only if they can be traced
as recording errors.
1.Winsorization: replaces extreme values with less extreme but still plausible values.
By capping or flooring the outliers, we can reduce their impact on the statistical
analysis.
2.Data transformation: using logarithms measures can help mitigate the impact of
outliers. Such transformation can make the data more normally distributed and lessen
the influence of extreme values.
3.Removing or adjusting outliers : In some cases, we may have domain knowledge SS|FINANCE|
RU

You might also like