LEC12_ECMT
LEC12_ECMT
RU
MODEL SELECTION
CRITERIA AND TESTS
Chapter 7
•
AT TRIBUTES OF A GOOD MODEL
Whether a model used in empirical analysis is good/appropriate, or the
"right” model cannot be determined without some reference criteria, or
guidelines.
• Principle of Parsimony: the principle of parsimony emphasizes the
necessity of keeping a model as simple as possible since complete
representation of reality is unattainable.
• Parameter constancy: parameter constancy is essential, meaning that
the values of model parameters should remain stable. Fluctuating
parameter values make forecasting challenging, and the true test of a
model lies in comparing its predictions with actual observations.
• Goodness of Fit: in regression analysis, the primary goal is to explain the
maximum variance in the dependent variable using the explanatory
variables. A model is considered good when it can account for a
substantial portion of this variance, which is typically assessed through
SS|FINANCE|
RU
AT T R I B U T E S O F A G O O D M O D E L
• Theoretical Consistency: even if the goodness of fit measures are very
high, a model may still not be deemed as good if it exhibits coefficients
with incorrect signs. For instance, in the demand function for a
commodity, if the price coefficient shows a positive sign (indicating a
positively sloping demand curve, except for Giffen goods), it is imperative
to verify that the coefficients align with the theoretical expectations.
• It is one thing to list criteria of a ‘good’ model and quite another to develop
it, because in practice one is likely to commit various model specification
errors, which we discuss next.
• One of the assumptions of the classical linear regression model (CLRM), is
that the regression model used in the analysis is ‘correctly’ specified: If the
model is not “correctly” specified, we encounter the problem of model
specification error or model specification bias. SS|FINANCE|
RU
SS|FINANCE|RU
TYPES OF SPECIFICATION ERRORS
S P E C I F I C AT I O N B I A S / E R R O R S
• Knowing the consequences of specification errors is one thing but finding
out whether one has committed such errors is quite another, because we
do not deliberately commit such errors.
• Very often specification errors arise inadvertently, perhaps from our inability
to formulate the model as precisely as possible because the underlying
theory is weak or because we do not have the right kind of data to test the
model.
• Because of the non-experimental nature of economics, we are never sure
how the observed data were generated.
• In such a set up, the practical question then is not why specification errors
are made, but how to detect them. Once it is found that specification errors
have been made, the remedies often suggest themselves. Werelated
Directly willto
functional mis-
concentrate on following biases: specification
#1. Omission of relevant X variables
#2. Inclusion of unnecessary variable(s) SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S ) : U N D E R F I T T I N G A M O D E L
• The estimation of a regression model without relevant explanatory
variable(s) may introduce bias into the estimates. For instance, assume
that the correct function explaining variation in Y is as follows:
[1]
In mean deviated form EQ.[1] can be written as:
correct model [2]
• Now, assume that either due to ignorance about the true relation or
unavailability of relevant data on , following regression equation is
estimated:
estimated model [3]
• It can be shown that is different from . See Gujarati and Porter “Basic
Econometrics” p. 519, 5th Ed. for the
proof SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S )
• On applying OLS to the misspecified regression model in EQ.[3], we
obtain:
[4]
• Contrarily, the normal equations of the correctly specified regression
model in EQ. [2] are:
, (EQ[2] multiplied by & summing the variables) [5]
(EQ[2] multiplied by & summing the variables) [6]
• Dividing EQ. [5] by , we obtain:
[7]
Where, is the slope coefficient in the regression of the omitted variable on
the included variable .
SS|FINANCE|
RU
B I A S D U E T O E XC LU S I O N O F R E L E VA N T X
VA R I A B L E ( S )
• Therefore, iff the term
• Now, from EQ. [7], we can obtain the bias introduced by the exclusion of a
relevant explanatory variable as:
Specification bias [8]
• Above exposition tells us that the bias due to exclusion of relevant
explanatory variable(s) depends on two terms:
(a) Regression coefficient of the explanatory variable(s) excluded
from the fitted model.
(b) The relationship between the explanatory variable(s) dropped
and kept in the fitted regression model, that is, .
Note: We work with data in mean deviation form and assume that to simplify the
derivations as we know in econometrics most ( but not all) intercepts in fitted
regression models intercept terms do not have any economic interpretation. SS|FINANCE|
RU
E XC LU S I O N O F R E L E VA N T X VA R I A B L E ( S ) :
CONSEQUENCES
• If the left-out or omitted, variable is correlated with the included variable
, slope coefficient of original regression will be biased.
• The disturbance variance is incorrectly estimated.
• The variance of is a biased estimator of the variance of the true
estimator . Variance of will, on average, overestimate the true variance,
even if and are not correlated:
[
• In consequence, the usual confidence interval and hypothesis-testing
procedures are likely to give misleading conclusions about the statistical
significance of the estimated parameters.
• As another consequence, the forecasts based on the incorrect model and
the forecast (confidence) intervals will be unreliable.
SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S
• Another type of specification bias may arise when the set of explanatory
variables is enlarged by inclusion of one or more irrelevant variables.
• The philosophy is that so long as you include the theoretically relevant
variables, inclusion of one or more unnecessary or ‘nuisance’ variables
will not hurt (unnecessary in the sense that there is no solid theory that
says they should be included).
• In that case inclusion of such variables will certainly increase (and
adjusted when the ).
• This is called overfitting a model. But if the variables are not
economically meaningful and relevant, such a strategy is not
recommended.
SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S
• The OLS estimators of the “incorrect” model are unbiased (as well as
consistent). That is, , and . If does not belong to the model, is expected
to be zero.
• Also, the estimator of obtained from over-fitted regression is correctly
estimated. SS|FINANCE|
RU
I N C LU S I O N O F U N N E C E SS A RY / I R R E L E VA N T X
VA R I A B L E S : C O N S E Q U E N C E S
• The standard confidence interval and hypothesis-testing procedure
based on the t and F tests remains valid.
• However, the ’s estimated from the regression are inefficient ─ their
variances will be generally larger than those of the ’s estimated from the
true model.
• As a result, the confidence intervals based on the standard errors of ’s
will be larger than those based on the standard errors of ’s of the true
model.
• However, we can use F-test statistic (pls. see lecture on multivariate
regression) to choose the right variable(s) for our model.
SS|FINANCE|
RU
I S I T B E T T E R T O I N C LU D E I R R E L E VA N T
VA R I A B L E S T H A N T O E XC LU D E T H E R E L E VA N T
ONES?
SS|FINANCE|
RU
WRONG FUNCTIONAL FORM (LINEAR OR NON-
LINEAR)
• Sometimes researchers mistakenly do not account for the nonlinear nature
of variables in a model.
• Moreover, some dependent variables (such as wage, which tends to be
skewed to the right) are more appropriately entered in natural log form.
• Consider the following true marginal cost model
[11]
• Instead, the econometrician estimated following model
Consistently
[12] underestimate the true
marginal cost
• This is an alternative to RESET test. Estimate model in EQ.[13] and obtain the
estimated residuals, .
• If in fact EQ.[13] is the correct model, then the residuals obtained from this
model should not be related to the regressors omitted from the model.
• We now regress on the regressors in the original model and the omitted
variables from the original model.
[15]
• If the sample size is large, it can be shown that n (the sample size) times the
R2 obtained from the auxiliary regression (follows a distribution,
symbolically, .
• If the computed value > critical value at the chosen level of significance,
we reject the null of no misspecification. SS|FINANCE|
RU
SS|FINANCE|RU
FACTORS OTHER THAN SPECIFICATION
ERRORS THAT CAN AFFECT REGRESION
RESULTS
ERRORS OF MEASUREMENT
• So far, we have assumed implicitly that the dependent variable and the
explanatory variables, the , are measured without any errors.
• Although not explicitly spelled out, this presumes that the values of the
dependent as well as independent variables are accurate. That is, they are
not guess estimates, extrapolated, interpolated or rounded off in any
systematic manner or recorded with errors.
Consequences for errors of measurement of the dependent variable:
SS|FINANCE|
RU
• Figure below shows both x-outlier (left) and y-outlier (right).
• Both plots show that a better linear relationship will be possible without
these outliers.
• In this situation, the x-
outlier is rotating the line
clockwise to change both
the slope and the
intercept of the
relationship, while the y-
outlier is moving the
predicted line upward.
• The solid line shows
the predicted
relationship without
A S I M P L E W AY T O I D N T I F Y O U T L I E R S : I Q R
• To explain IQR Method easily, let’s start with
a box plot.
• A box plot tells us about the distribution of
the data. It gives a sense of how much the
data is spread about, what’s its range, and
about its skewness.
As you might have noticed in the figure, that a box plot enables us to draw
inference from it for an ordered data, i.e., it tells us about the various metrics of a
data arranged in ascending order.
• The median is the median (or center point), also called second quartile, of the
data (resulting from the fact that the data is ordered).
• Q1 is the first quartile of the data, i.e., to say 25% of the data lies between
minimum and Q1.
• Q3 is the third quartile of the data, i.e., to say 75% of the data lies between
SS|FINANCE|
RU
A S I M P L E WAY T O I D N T I F Y O U T L I E R S : I Q R E X A M P L E
SS|FINANCE|
RU
0, 0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 14, 15, 20, and 25
• To detect the outliers using IQR method, we define a new range, let’s call it
decision range, and any data point lying outside this range is considered as
outlier and is accordingly dealt with.
and
• Thus, any observations less than 2 books or
observation greater than 18 books are
outliers.
Why only 1.5 times the IQR? Why not any
other number?
LEVERAGE
SS|FINANCE|
RU
LEVERAGE…
SS|FINANCE|
RU
H O W D O W E H A N D L E U N U S UA L D ATA P O I N T S ?
• Should we just drop them and confine our attention to the remaining data points?
• Automatic rejection of unusual data points is not always a wise procedure.
• Sometimes the outlier provides information that other data points cannot be because
it arises from an unusual circumstance which may be of interest and requires further
investigation rather than rejection.
• As a rule, unusual data points should be rejected out of hand only if they can be traced
as recording errors.
1.Winsorization: replaces extreme values with less extreme but still plausible values.
By capping or flooring the outliers, we can reduce their impact on the statistical
analysis.
2.Data transformation: using logarithms measures can help mitigate the impact of
outliers. Such transformation can make the data more normally distributed and lessen
the influence of extreme values.
3.Removing or adjusting outliers : In some cases, we may have domain knowledge SS|FINANCE|
RU