Midterm 2
Chapter 7: Simple regression
Regression basics
• Regression analysis is a method that uncovers the average value of a variable y for
different values of variable x
• The model for conditional mean shows the mean value of y for various values of x
• E y | x = f ( x )
• yE = f (x )
o it shows the result of the result of plugging a specific value for x value into the
model
• y: dependent variable, left-hand-side variable
• x: explanatory variable, right-hand-side variable
• Regression analysis may reveal that average y tends to be higher at higher values of x
o pattern of positive mean-dependence/association
o Linear patterns: positive (negative) association - average y tends to be higher
(lower) at higher values of x.
o Non-linear patterns: association may be non-monotonic - y tends to be higher
for higher values of x in a certain range of the x variable and lower for higher
values of x in another range of the x variable
o No association or relationship
o goal: uncover this relationship
• Non-parametric regression
o describe the y E = f ( x ) pattern without imposing a specific functional form
on f
o Let the data dictate what that function looks like, at least approximately.
o Can spot (any) patterns well
o When x has few values and there are many observations in the data, the best
and most intuitive non-parametric regression for y E = f ( x ) shows average y
for each and every value of x.
o With many values of x: two ways
▪ bin scatters: show the average value of y that corresponds to each bin
created from the values of x, in an x-y coordinate system
▪ smoothing: shows the average y over the entire range of bins
▪ smoothed conditional means plots: lowess is one of them – we
interpret these graphs in a qualitative way
▪ lowess: locally weighted scatterplot smoothing – a smooth curve fit
around a bin scatter
• Binary explanatory variable
o β is the difference in average y between observations with x = 1 and
observations with x = 0
o Graphically, the regression line of linear regression goes through two points:
average y when x is zero (α) and average y when x is one (α + β).
• Coefficient formula
o Calculated estimates - αˆ and βˆ (use data and calculate the statistic)
Cov x , y
o =
Var x
o The formula of the intercept reveals that the regression line always goes
through the point of average x and average y
o = y − x
o OLS gives the best-fitting linear regression line
o OLS method finds the values of the coefficients of the linear regression that
minimize the sum of squares of the difference between actual y values and
their values implied by the regression, αˆ + βˆx
Residuals and predicted values
• Predicted values
o The predicted value of the dependent variable = best guess for its average
value if we know the value of the explanatory variable, using our model
o The predicted values of the dependent variable are the points of the regression
line itself.
o y = + x
• Residuals
o The residual is the difference between the actual value of the dependent
variable for an observation and its predicted value :
o e i = yi − yi
o The residual is meaningful only for actual observation
▪ While we can have predicted values for any x, actual y values are only
available for the observations in our data
o above regression line: positive
o under regression line: negative
o Residuals sum to zero if a linear regression is fitted by OLS
▪ Sum is zero –> average of the residuals is zero, too.
• Interpolation
o predicting y for x not the data but in-between x values in the data
• Extrapolation
o predicting y for x not in the data and outside the range of x in the data
R-squared
• Fit of a regression captures how predicted values compare to the actual values
• gives the goodness of fit
Var y
• R2 = = 1 − Var e
Var y Var y
• can be defined for parametric and non-parametric regressions
• always between 1 and 0
o 1: if the regression fits perfectly the data
o 0: all of the predicted y values are equal tot he overall average value y in the
data regardless of the value of the explanatory variable x – regression line is
completely flat
• Fit depends (1): how well the particular version of the regression captures the actual
function f in y E = f ( x )
• Fit depends (2): how far actual values of y are spread around what would be predicted
using the actual function f
• R-squared may help in choosing between different versions of regression for the same
data
Causation
• R-squared of the simple linear regression is the square of the correlation coefficient
R 2 = (Corr y, x )
2
o
o So the R-squared is yet another measure of the association between the two
variables.
• reverse regression
o x E = + y
o they’re not the same until the two variances aren’t equal
o but they always have the same sign
o both are larger in magnitude the larger the covariance
o R 2 for the simple linear regression and the reverse regression is the same
• Slope of the y E = f ( x ) regression is not zero in our data
o Several reasons, not mutually exclusive:
▪ x causes y
▪ y causes x
▪ A third variable causes both x and y (or many such variables do)
Chapter 8: Complicated patterns and messy data
The shape of association
• When is it importanr if the shape of a regression is linear or not
o we want to make a prediction or analyze residuals - better fit
o we want to go beyond the average pattern of association - good reason for
complicated patterns
o all we care about is the average pattern of association, but the linear regression
gives a bad approximation to that - linear approximation is bad
• Potential nonlinear shapes doesn’t matter
o all we care about is the average pattern of association
o linear regression is good approximation to the average pattern
Logs
• Frequent nonlinear patterns better approximated with y or x transformed by taking
relative differences
• Log differences works because differences in natural logs approximate percentage
differences!
• we usually use ln
• In cross-sectional data usually there is no natural base for comparison
• Log transformation allows for comparison in relative terms – percentages
o Log transformation allows for comparison in relative terms (percentage),
because:
x
o ln ( x + x ) − ln ( x ) ln 1 + (for small differences)
x
• when to take logs?
o Percentage differences
o relative comparisons are free from measures of the variables that are often
different across time and space and are arbitrary to begin with
o economical decisions
o differences in relative are likely to be more stable across time
• The distribution of many important economic variables is skewed with a long-right
tail and are reasonably well approximidated by a log-normal distribution.
• When to take logs?in the ppt!and read in the book non-positive values
• Log-level
ln ( y ) = + x i
E
o
o α is average ln(y) when x is zero. (Often meaningless.)
o β: y is β ∗ 100 percent higher, on average for observations with one unit higher
x
• Level-log
o y E = + ln ( x i )
o α is : average y when ln(x) is zero (and thus x is one)
o β: y is β/100 units higher, on average, for observations with one percent higher
• Log-log
ln ( y ) = + ln ( x i )
E
o
o α: is average ln(y) when ln(x) is zero. (Often meaningless.)
o β: y is β percent higher on average for observations with one percent higher x.
• per capita measures
o Most often: per capita: GDP/capita, revenues/employee, sales/shop
o can take logs easily
Polynomials
• polynomials do not require the analyst to specify where the pattern may change
• Quadratic
o Technically: quadratic function is not a linear function (a parabola, not a line)
o Handles only nonlinearity, which can be captured by a parabola
o y E = + 1x + 2 x 2
o if beta2 is positive: convex relationship
o if negative: concave relationship
o we can get the slope of the function with derivative
o We can compare two observations, denoted by j and k, that are different in x,
by one unit so that xk = xj + 1. y is higher by β1 + 2β2xj units for observation
k than for observation j
Other
• robustness checks: running several different regressions and comparing results
• Influentail observations
o the slope of the regression is different when we include them in the data from
the slope when we exclude them from the data
o extreme values
o why the values are extreme for the influental observations
o what the question of the analysis is
• Measurement error in variables
o such errors may arise due to technical or fundamental reasons
o latent variables: Latent variables are unobservable variables that are inferred
from observable variables in a statistical model.
o proxy variables: observed variables that we use instead of latent variables
o classical measurement error: an error that is zero on average and is
independent of all other relevant variables, including the error-free variable
o noise-to-signal ratio: the importance of measurement error
o attenuation bias: the effect of classical measurement error in the explanatory
variable
• Using weights
o one way to compensate for unequal sampling and biased coverage
o use it with aggregate data
▪ countries, firms, families
Chapter 9: Generalizing results of a regression
Generalizing linear regression coefficients
• Question: Is the pattern we see in our data
o True in general?
o or is it just a special case what we see?
• inference: the act of generalizing results
• statistical inference: generalizing from the data to the population, or general pattern, it
represents
• external validity
o if it is low we may consider wider ranges
o Beyond (other dates, countries, people, firms)
CI and SE of regression coefficients
• true value: the true value (of beta) in the population, or general pattern, represented by
the data
• : the average difference in y corresponding to one unit difference in x in the data
o the question of the statistical inference is the true value of beta
• y i : best guess for the expected value (average) of the dependent variable for
observation i with value xi for the explanatory variable in the dataset
• CI
o confidence interval of the regression coefficient
o can tell us where the true value of beta is with 95% likelihood
o narrower SE, more precise coefficient
o CI () 95% ()
= − 2SE , + 2SE ()
• SE
o measures the spread of the values of the statistic across hypothetical repeated
samples drwan from the same population, or general pattern, that our data
represents
o In the context of linear regression, the standard error is used to quantify the
precision of the regression coefficients. Each coefficient has its standard error,
and the ratio of the coefficient estimate to its standard error is used to assess
statistical significance.
• simple SE formula
o Simple SE formula is not correct in general
o assumes that variables are independent across observations
o Homoskedasticity assumption
Std e
o SE = () nStd x
(e:regresion residuals)
o smaller:
▪ smaller the standard deviation of the residual
▪ larger the standard deviation of the explanatory variable
▪ more observations are in the data
• homoskedasticity
o the assumption that the fit of the regression line is the same across the entire
range of the x variable
• heteroskedasticity
o the fit may differ at different values of x, in which case the spread of actual y
around the regression line is different for different values of x
• robust SE formula
o In statistics, a robust statistical method is one that remains valid and effective
even when the assumptions of the method are not perfectly met. Robust
statistical techniques are less sensitive to outliers or deviations from the
expected data distribution.
o Same properties as the simple formula: smaller when Std[e] is small, Std[x] is
large and n is large
o assumes heteroskedasticity
o sometimes they’re same, sometimes this one is larger
o Coefficient estimates, R squared etc. remain the same
Intervals for predicted values
• how we can quantify and present the uncertainty corresponding to this specific
predicted value, y i ?
• the CI of the predicted value/CI of the regression line
o The CI of the predicted value combines the CI for and the CI for
o ( )
95%CI y j = y j 2SE y j ( )
o It answers the question of where we can expect y E to lie if we know the value
of x and we have estimates of coefficients and from the data
( )
o SE y j can be estimated using bootstrap or an appropriate formula
( )
2
xj − x
( )
o SE y j = Std e
1
+
n nVar x
o the SE of the predicted value for a particular observation is small if the SEs of
the coefficient estimates are small and the particular observation has an x
value close to its average
o The second part means that predictions for observations with more extreme x
values are bound to have larger standard errors and thus wider confidence
intervals
o Can be used for any model
o In general, the CI for the predicted value is an interval that tells where to
expect average y given the value of x in the population, or general pattern,
represented by the data
• Prediction interval
o The prediction interval for y j starts from the CI for y j and adds the extra
uncertainty due to the fact that the actual y j will be somewhere around yj .
o ( )
95% PI y j = y j 2SPE y j ( )
o Standard prediction error:
( )
2
xj − x
▪ ( )
SPE y j = Std e
1
1+ +
n nVar x
o In the formula, all elements get very small if n gets large, except for the new
element
Testing hypotheses
• whether the true value of β is zero, which means its value in the population, or general
pattern represented by the data
• H 0 : true = 0
• H A : true 0
• t-statistic
o the t-statistic is best viewed as a measure of distance: how far the value of the
()
statistic in the data is from its value hypothesized by the null (zero)
o A value 0 of the t-statistic means that the value of is exactly what’s in the
null hypothesis (zero distance)
o A t-statistic of 1 means that the value of is exactly one SE larger than the
value in the null hypothesis
−c
o t=
SE ()
o The t-statistic for the intercept coefficient is analogous
o critical value: . We reject the null if the t-statistic is larger than 2 or smaller
than −2, and we don’t reject the null if it’s in-between
• p-value
o the p-value is the smallest significance level at which we can reject the null
o We have to simply look at the p-value and decide if it is larger or smaller than
the level of significance that we set for ourselves in advance
o x is said to be statistically significant at 5%
• proof of concept
o example: cross-country data
• proof beyond reasonable doubt
Other
• Usually, one star is attached to coefficients that are significant at 5% (we can reject
that they are zero at the 5% level), and two stars are attached to coefficients that are
significant at 1%
• As external validity is about generalizing beyond what our data represents, we can’t
assess it using our data
o analyzing other data may help
Chapter 10: Multiple linear regression
Multiple linear regression
• Multiple regression analysis uncovers average y as a function of more than one x
variable: y E = f ( x1, x 2 ,...) .
• y E = 0 + 1x1 + 2 x 2
• The slope coefficient on x1 shows the difference in average y across observations
with different values of x1 but with the same value of x 2
• This way, multiple regression with two explanatory variables compares observations
that are similar in one explanatory variable to see the differences related to the other
explanatory variable
• x-x regression
o whether x1 and x 2 are related
o δ would tell us how much the two prices tend to move together
o x 2E = + x1
o The slope of x1 in a simple regression is different from its slope in the
multiple regression, the difference being the product of its slope in the
regression of x 2 on x1 and the slope of x 2 in the multiple regression.
• omitted variable bias
o the slope in simple regression is different from the slope in multiple regression
by the slope in the x−x regression times the slope of the other x in the multiple
regression
o Corresponding differences in y may be due to differences in x1 but also due to
differences in x 2
Multiple Linear Regression Terminology
•