0% found this document useful (0 votes)
13 views60 pages

Da Public Slides Ch07 v3 2023

Uploaded by

Tin Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

Da Public Slides Ch07 v3 2023

Uploaded by

Tin Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Békés-Kézdi: Data Analysis, Chapter 06: Hypotheses testing

Data Analysis for Business, Economics,


and Policy
Gábor Békés (Central European University)
Gábor Kézdi (University of Michigan)
Cambridge University Press, 2021
gabors-data-analysis.com

Central European University


Version: v3.1 License: CC BY-NC 4.0
Any comments or suggestions:
[email protected]
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Motivation

I Spend a night in Vienna and you want to


find a good deal for your stay.
I Travel time to the city center is rather
important.
I Looking for a good deal: as low a price as
possible and as close to the city center as
possible.
I Collect data on suitable hotels

Data Analysis for Business, Economics, and Policy 2 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Introduction

I Regression is the most widely used method of comparison in data analysis.


I Simple regression analysis amounts to comparing average values of a dependent
variable (y) for observations that are different in the explanatory variable (x).
I Simple regression: comparing conditional means.
I Doing so uncovers the pattern of association between y and x. What you use for y
and for x is important and not inter-changeable!

Data Analysis for Business, Economics, and Policy 3 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression: comparing conditional means

Data Analysis for Business, Economics, and Policy 4 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression

I Simple regression analysis uncovers mean-dependence between two variables.


I It amounts to comparing average values of one variable, called the dependent
variable (y ) for observations that are different in the other variable, the explanatory
variable (x).
I Multiple regression analysis involves more variables -> later.

Data Analysis for Business, Economics, and Policy 5 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - uses

I Discovering patterns of association between variables is often a good starting point


even if our question is more ambitious.
I Causal analysis: uncovering the effect of one variable on another variable.
Concerned with a parameter.
I Predictive analysis: what to expect of a y variable (long-run polls, hotel prices)
for various values of another x variable (immediate polls, distance to the city
center). Concerned with predicted value of y using x.

Data Analysis for Business, Economics, and Policy 6 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - names and notation

I Regression analysis is a method that uncovers the average value of a variable y


for different values of another variable x.

E [y |x] = f (x) (1)


We use a simpler shorthand notation

y E = f (x) (2)
I dependent variable or left-hand-side variable, or simply the y variable,
I explanatory variable, right-hand-side variable, or simply the x variable
I “regress y on x," or “run a regression of y on x"= do simple regression analysis
with y as the dependent variable and x as the explanatory variable.

Data Analysis for Business, Economics, and Policy 7 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - type of patterns

Regression may find


I Linear patterns: positive (negative) association - average y tends to be higher
(lower) at higher values of x.
I Non-linear patterns: association may be even non-monotonic - y tends to be
higher for higher values of x in a certain range of the x variable and lower for
higher values of x in another range of the x variable
I No association or relationship

Data Analysis for Business, Economics, and Policy 8 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric and parametric regression

I Non-parametric regressions describe the y E = f (x) pattern without imposing a


specific functional form on f .
I Data driven and flexible, no theory
I Can capture any pattern

I Parametric regressions impose a functional form on f . Parametric examples


include:
I linear functions: f (x) = a + bx;
I exponential functions: f (x) = ax b ;
I quadratic functions: f (x) = a + bx + cx 2 ,
I or any functions which have parameters of a, b, c, etc.
I Restrictive, but they produce readily interpretable numbers.

Data Analysis for Business, Economics, and Policy 9 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: average by each value

I Non-parametric regressions come (also) in various forms.


I Most intuitive non-parametric regression for y E = f (x) shows average y for each
and every value of x.
I Works well when x has few values and there are many observations in the data,
I There is no functional form imposed on f here.

Data Analysis for Business, Economics, and Policy 10 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: Categorical variable

I Sometimes, no straightforward functional form on f .


I Categorical variables
I Ordered variables.
I For example, Hotels: average price of hotels with the same numbers of stars and
compare these averages = non-parametric regression analysis.

Data Analysis for Business, Economics, and Policy 11 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: bins

I With many x values - two ways to do non-parametric regression analysis: bins and
smoothing.
I Bins - based on grouped values of x
I Bins are disjoint categories (no overlap) that span the entire range of x (no gaps).
I Many ways to create bins - equal size, equal number of observations per bin, or bins
defined by analyst.

Data Analysis for Business, Economics, and Policy 12 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: lowess (loess)

I Produce “smooth" graph - both continuous and has no kink at any point.
I also called smoothed conditional means plots = non-parametric regression
shows conditional means, smoothed to get a better image.
I Lowess = most widely used non-parametric regression methods that produce a
smooth graph.
I locally weighted scatterplot smoothing (sometimes abbreviated as “loess").

I A smooth curve fit around a bin scatter.


I Related to density plots (Chapter 03), set the bandwidth for smoothing
I wider bandwidth results in a smoother graph but may miss important details of the
pattern.
I narrower bandwidth produces a more rugged-looking graph

Data Analysis for Business, Economics, and Policy 13 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: lowess (loess)

I Smooth non-parametric regression methods, including lowess, do not produce


numbers that would summarize the y E = f (x) pattern.
I Provide a value y E for each of the particular x values that occur in the data, as
well as for all x values in-between.
I Graph – we interpret these graphs in qualitative, not quantitative ways.
I They can show interesting shapes in the pattern, such as non-monotonic parts,
steeper and flatter parts, etc.
I Great way to find relationship patterns

Data Analysis for Business, Economics, and Policy 14 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I We look at Vienna hotels for a 2017 November weekday.


I we focus on hotels that are (i) in Vienna actual,(ii) not too far from the center,
(iii) classified as hotels, (iv) 3-4 stars, and (v) have no extremely high price
classified as error.
I There are 428 hotel prices for that weekday in Vienna, our focused sample has
N = 207 observations.

Data Analysis for Business, Economics, and Policy 15 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

Bin scatter non-parametric regression, 2 bins Bin scatter non-parametric regression, 4 bins
Data Analysis for Business, Economics, and Policy 16 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

Scatter and bin scatter non-parametric Scatter and bin scatter non-parametric
regression, 4 bins regression, 7 bins
Data Analysis for Business, Economics, and Policy 17 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I lowess non-parametric regression,


together with the scatterplot.
I bandwidth selected by software is 0.8
miles.
I The smooth non-parametric
regression retains some aspects of
previous bin scatter – a smoother
version of the corresponding
non-parametric regression with
disjoint bins of similar width.

Data Analysis for Business, Economics, and Policy 18 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression

Data Analysis for Business, Economics, and Policy 19 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression

Linear regression is the most widely used method in data analysis.


I imposes linearity of the function f in y E = f (x).
I Linear functions have two parameters, also called coefficients: the intercept and
the slope.
y E = α + βx (3)
I Linearity in terms of its coefficients.
I can have any function, including any nonlinear function, of the original variables
themselves
I linear regression is a line through the x − y scatterplot.
I This line is the best-fitting line one can draw through the scatterplot.
I It is the best fit in the sense that it is the line that is closest to all points of the
scatterplot.

Data Analysis for Business, Economics, and Policy 20 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression - assumption vs approximation

I Linearity as an assumption:
I assume that the regression function is linear in its coefficients.
I Linearity as an approximation.
I Whatever the form of the y E = f (x) relationship, the y E = α + βx regression fits a
line through it.
I This may or may not be a good approximation.
I By fitting a line we approximate the average slope of the y E = f (x) curve.

Data Analysis for Business, Economics, and Policy 21 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression coefficients

Coefficients have a clear interpretation – based on comparing conditional means.

E [y |x] = α + βx

Two coefficients:
I intercept: α = average value of y when x is zero:
I E [y |x = 0] = α + β × 0 = α.

I slope: β = expected difference in y corresponding to a one unit difference in x.


I E [y |x = x0 + 1] − E [y |x0 ] = (α + β × (x0 + 1)) − (α + β × x0 ) = β.

Data Analysis for Business, Economics, and Policy 22 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient

I slope: β = expected difference in y corresponding to a one unit difference in x.


I y is higher, on average, by β for observations with a one-unit higher value of x.
I Comparing two observations that differ in x by one unit, we expect y to be β
higher for the observation with one unit higher x.

Data Analysis for Business, Economics, and Policy 23 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient interpretation

Several good ways to interpret the slope coefficient


I $yistextithigher , onaverage, by β for observations with a one-unit higher value of x.
I Comparing two observations that differ in x by one unit, we textitexpect y to be β
higher for the observation with one unit higher x.

Data Analysis for Business, Economics, and Policy 24 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient interpretation

Several good ways to interpret the slope coefficient


I $yistextithigher , onaverage, by β for observations with a one-unit higher value of x.
I Comparing two observations that differ in x by one unit, we textitexpect y to be β
higher for the observation with one unit higher x.

Avoid uisng
I “decrease/increase" – not right, unless time series or causal relationship only
I “effect" – not right, unless causal relationship

Data Analysis for Business, Economics, and Policy 24 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression: binary explanatory

Simplest case:
I x is a binary variable, zero or one.
I α is the average value of y when x is zero (E [y |x = 0] = α).
I β is the difference in average y between observations with x = 1 and observations
with x = 0
I E [y |x = 1] − E [y |x = 0] = α + β × 1 − α + β × 0 = β.
I The average value of y when x is one is E [y |x = 1] = α + β.
I Graphically, the regression line of linear regression goes through two points:
average y when x is zero (α) and average y when x is one (α + β).

Data Analysis for Business, Economics, and Policy 25 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

Notation:
I General coefficients are α and β.
I Calculated estimates - α̂ and β̂ (use data and calculate the statistic)
I The slope coefficient formula is
1 Pn
Cov [x, y ] n i=1 (xi − x̄)(yi − ȳ )
β̂ = = 1 Pn 2
Var [x] n i=1 (xi − x̄)

I Slope coefficient formula is normalized version of the covariance between x and y .


I The slope measures the covariance relative to the variation in x.
I That is why the slope can be interpreted as differences in average y corresponding to
differences in x.

Data Analysis for Business, Economics, and Policy 26 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

I The intercept – average y minus average x multiplied by the estimated slope β̂.

α̂ = ȳ − β̂ x̄

I The formula of the intercept reveals that the regression line always goes through
the point of average x and average y .
I Note, you can manipulate and get: ȳ = α̂ + β̂ x̄.

Data Analysis for Business, Economics, and Policy 27 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Ordinary Least Squares (OLS)

I OLS gives the best-fitting linear


regression line.
I A vertical line at the average value of
x and a horizontal line at the
average value of y . The regression
line goes through the point of
average x and average y .

Data Analysis for Business, Economics, and Policy 28 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

I Ordinary Least Squares – OLS is method to find the best fit with a formula.
I The idea underlying OLS is to find the values of the intercept and slope
parameters that make the regression line fit the scatterplot best.
I OLS method finds the values of the coefficients of the linear regression that
minimize the sum of squares of the difference between actual y values and their
values implied by the regression, α̂ + β̂x.
n
X
minα,β (yi − α − βxi )2 (4)
i=1

For this minimization problem, we can use calculus to give α̂ and β̂, the values for
α and β that give the minimum. Please check out U7.1.

Data Analysis for Business, Economics, and Policy 29 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Recap

I Simple regression analysis amounts to comparing average values of a dependent


variable (y) for observations that are different in the explanatory variable (x).

I Simple regression in any way or form: comparing conditional means.

Data Analysis for Business, Economics, and Policy 30 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I The linear regression of hotel prices


(in EUR) on distance (in miles)
produces an intercept of 133 and a
slope -14.
I The intercept is 133, suggesting that
the average price of hotels right in
the city center is EUR 133.
I The slope of the linear regression is
-14. Hotels that are 1 mile further
away from the city center are, on
average, EUR 14 cheaper in our data.

Data Analysis for Business, Economics, and Policy 31 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Compare linear model and non-parametric ones


I Linear is an average that fails to capture steep decline close to center
I Not bad approximation overall

Data Analysis for Business, Economics, and Policy 32 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Predicted values
I The predicted value of the dependent variable = best guess for its average value
if we know the value of the explanatory variable, using our model.
I The predicted value can be calculated from the regression for any x.
I The predicted values of the dependent variable are the points of the regression line
itself.
I The predicted value of dependent variable y is denoted as ŷ .

ŷ = α̂ + β̂x

I What about non-parametric regressions

Data Analysis for Business, Economics, and Policy 33 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Predicted values
I The predicted value of the dependent variable = best guess for its average value
if we know the value of the explanatory variable, using our model.
I The predicted value can be calculated from the regression for any x.
I The predicted values of the dependent variable are the points of the regression line
itself.
I The predicted value of dependent variable y is denoted as ŷ .

ŷ = α̂ + β̂x

I What about non-parametric regressions


I Predicted dependent variables exist
I Complete list of predicted values of the dependent variable for each value of the
explanatory variable in the data.
Data Analysis for Business, Economics, and Policy 33 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Residuals

I The residual is the difference between the actual value of the dependent variable
for an observation and its predicted value :

ei = yi − yˆi , where yˆi = α̂ + β̂xi


I The residual is meaningful only for actual observation.
I While we can have predicted values for any x, actual y values are only available for
the observations in our data

I The residual is the vertical distance between the scatterplot point and the
regression line.
I For points above the regression line the residual is positive.
I For points below the regression line the residual is negative.

Data Analysis for Business, Economics, and Policy 34 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Use of residuals

I The residual may be important on its own right.


I Interested in identifying observations that are special in that they have a dependent
variable that is much higher or much lower than “it should be" as predicted by the
regression.

Data Analysis for Business, Economics, and Policy 35 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Predicted dependent variable and residuals

I Residuals sum to zero if a linear regression is fitted by OLS.


I Sum is zero –> average of the residuals is zero, too.
I Predicted average is equal to the actual average for y : average ŷ equals average y .
I See U7.2 for details.

Data Analysis for Business, Economics, and Policy 36 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Residual is vertical distance


I Positive residual shown here - price is
above what predicted by regression
line

Data Analysis for Business, Economics, and Policy 37 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Can look at residuals from linear


regressions
I Centered around zero
I Both positive and negative

Data Analysis for Business, Economics, and Policy 38 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I If linear regression is
accepted model for
prices
I Draw a scatterplot with
regression line
I With the model you can
capture the over and
underpriced hotels

Data Analysis for Business, Economics, and Policy 39 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels


A list of the hotels with the five lowest value of the residual.
No. Hotel_id Distance Price Predicted price Residual
1 22080 1.1 54 116.17 -62.17
2 21912 1.1 60 116.17 -56.17
3 22152 1 63 117.61 -54.61
4 22408 1.4 58 111.85 -53.85
5 22090 0.9 68 119.05 -51.05

I Bear in mind, we can (and will) do better - this is not the best model for price
prediction.
I Non-linear pattern
I Functional form
I Taking into account differences beyond distance

Data Analysis for Business, Economics, and Policy 40 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression modelling

Data Analysis for Business, Economics, and Policy 41 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2

I Fit of a regression captures how predicted values compare to the actual values.
I R-squared (R 2 ) – how much of the variation in y is captured by the regression,
and how much is left for residual variation
Var [ŷ ] Var [e]
R2 = =1− (5)
Var [y ] Var [y ]

] = n1 ni=1 (yi − ȳ )2 , Var [ŷ ] = n1 ni=1 (yˆi − ȳ )2 , and


P P
where Var [yP
Var [e] = n1 ni=1 (ei )2 . Note that ŷ¯ = ȳ , and ē = 0.
I Decomposition of the overall variation in y into variation in predicted values
“explained by the regression") and residual variation ( “not explained by the
regression"):
Var [y ] = Var [ŷ ] + Var [e] (6)

Data Analysis for Business, Economics, and Policy 42 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2

I R-squared (or R 2 ) can be defined for both parametric and non-parametric


regressions.
I Any kind of regression produces predicted ŷ values, and all we need to compute R 2
is its variance compared to the variance of y .
I The value of R-squared is always between zero and one.
I R-squared is zero, if the predicted values are just the average of the observed
outcome yˆi = y¯i , ∀i.

Data Analysis for Business, Economics, and Policy 43 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2 - A question

I When the R-squared is zero, How does regression line look like?
I What about when it’s not zero, but very small?

Data Analysis for Business, Economics, and Policy 44 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit

I Fit depends (1): how well the particular version of the regression captures the
actual function f in y E = f (x)
I Can be helped by modeling

I Fit depends (2): how far actual values of y are spread around what would be
predicted using the actual function f .
I Given by data

Data Analysis for Business, Economics, and Policy 45 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - how to use R 2

I R-squared may help in choosing between different versions of regression for the
same data.
I Choose between regressions with different functional forms
I Predictions are likely to be better with high R 2
I More on this in Chapter 13-14
I R-squared matters less when the goal is to characterize the association between y
and x

Data Analysis for Business, Economics, and Policy 46 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Correlation and linear regression


I Linear regression is closely related to correlation.
I Remember, the OLS formula for the slope

Cov [y , x]
β̂ =
Var [x]
I In contrast with the correlation coefficient, its values can be anything.
Furthermore y and x are not interchangeable.
I Covariance and correlation coefficient can be substituted to get β̂:

Std[y ]
β̂ = Corr [x, y ]
Std[x]
I Covariance, the correlation coefficient, and the slope of a linear regression capture
similar information: the degree of association between the two variables.
Data Analysis for Business, Economics, and Policy 47 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Correlation and R 2 in linear regression

I R-squared of the simple linear regression is the square of the correlation coefficient.

R 2 = (Corr [y , x])2

I So the R-squared is yet another measure of the association between the two
variables.
I To show this equality holds, the trick is to substitute the numerator of R-squared
and manipulate:
β̂ 2 Var [x] Std[x] 2
 
2 Var [ŷ ] Var [α̂ + β̂x]
R = = = = β̂ = (Corr [y , x])2
Var [y ] Var [y ] Var [y ] Std[y ]

Data Analysis for Business, Economics, and Policy 48 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression

I Consider two similar models

y E = α + βx

x E = γ + δy
I What can we say about estimated coefficients?
I What can we say about the R 2 ?

Data Analysis for Business, Economics, and Policy 49 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression

I One can change the variables, but the interpretation is going to change as well!

x E = γ + δy
Cov [y ,x]
I The OLS estimator for the slope coefficient here is δ̂ = Var [y ] .
I The OLS slopes of the original regression and the reverse regression are related:

Var [y ]
β̂ = δ̂
Var [x]

I Different, unless Var [x] = Var [y ],


I but always have the same sign.
I both are larger in magnitude the larger the covariance.

Data Analysis for Business, Economics, and Policy 50 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression: A question

I Is R 2 for the simple linear regression and the reverse regression


I exactly the same,
I close but not the same
I different
I Why?

Data Analysis for Business, Economics, and Policy 51 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Were very careful to use neutral language, not talk about causation
I Think back to sources of variation in x
I When we have observational data, and we pick x and y and decide how to run the
regression
I Regression is a method of comparison: it compares observations that are different
in variable x and shows corresponding average differences in variable y .
I It is a way to find patterns of association by comparisons.
I If we can’t infer causation from regression analysis — not the fault of the method.

Data Analysis for Business, Economics, and Policy 52 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation - possible relations

I Slope of the y E = α + βx regression is not zero in our data


I Several reasons, not mutually exclusive:
I x causes y :
I y causes x.
I A third variable causes both x and y (or many such variables do):
I In reality if we have observational data, there is a mix of these relations.
I For more see, Chapters 19-21

Data Analysis for Business, Economics, and Policy 53 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Yes: "correlation (regression) does not imply causation"

I Better: we should not infer cause and effect from comparisons in


observational data.

Data Analysis for Business, Economics, and Policy 54 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Yes: "correlation (regression) does not imply causation"

I Better: we should not infer cause and effect from comparisons in


observational data.

I Suggested approach is two steps


I First interpret precisely the object (correlation of slope coefficient)
I Conclude and discuss causal claims if any

Data Analysis for Business, Economics, and Policy 54 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Fit and causation


I The R-squared of the regression is 0.16 = 16%.
I This means that of the overall variation in hotel prices, 16% is explained by the linear
regression with distance to the city center; the remaining 84% is left unexplained.
I 16% - good for cross-sectional regression with a single explanatory variable.
I In any case it is the fit of the best-fitting line.

Data Analysis for Business, Economics, and Policy 55 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Slope is -14
I Does that mean that a longer distance causes hotels to be cheaper?

Data Analysis for Business, Economics, and Policy 56 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Summary take-away

I Regression – method to compare average y across observations with different


values of x.
I Non-parametric regressions (bin scatter, lowess) visualize complicated patterns of
association between y and x, but no interpretable number.
I Linear regression – linear approximation of the average pattern of association y
and x
I In y E = α + βx, β shows how much larger y is, on average, for observations with
a one-unit larger x
I When β is not zero, one of three things (+ any combination) may be true:
I x causes y
I y causes x
I a third variable causes both x and y .

Data Analysis for Business, Economics, and Policy 57 / 57 Gábor Békés (Central European University)

You might also like