Da Public Slides Ch07 v3 2023
Da Public Slides Ch07 v3 2023
Motivation
Data Analysis for Business, Economics, and Policy 2 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Introduction
Data Analysis for Business, Economics, and Policy 3 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 4 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Regression
Data Analysis for Business, Economics, and Policy 5 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Regression - uses
Data Analysis for Business, Economics, and Policy 6 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
y E = f (x) (2)
I dependent variable or left-hand-side variable, or simply the y variable,
I explanatory variable, right-hand-side variable, or simply the x variable
I “regress y on x," or “run a regression of y on x"= do simple regression analysis
with y as the dependent variable and x as the explanatory variable.
Data Analysis for Business, Economics, and Policy 7 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 8 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 9 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 10 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 11 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I With many x values - two ways to do non-parametric regression analysis: bins and
smoothing.
I Bins - based on grouped values of x
I Bins are disjoint categories (no overlap) that span the entire range of x (no gaps).
I Many ways to create bins - equal size, equal number of observations per bin, or bins
defined by analyst.
Data Analysis for Business, Economics, and Policy 12 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Produce “smooth" graph - both continuous and has no kink at any point.
I also called smoothed conditional means plots = non-parametric regression
shows conditional means, smoothed to get a better image.
I Lowess = most widely used non-parametric regression methods that produce a
smooth graph.
I locally weighted scatterplot smoothing (sometimes abbreviated as “loess").
Data Analysis for Business, Economics, and Policy 13 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 14 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 15 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Bin scatter non-parametric regression, 2 bins Bin scatter non-parametric regression, 4 bins
Data Analysis for Business, Economics, and Policy 16 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Scatter and bin scatter non-parametric Scatter and bin scatter non-parametric
regression, 4 bins regression, 7 bins
Data Analysis for Business, Economics, and Policy 17 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 18 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Linear regression
Data Analysis for Business, Economics, and Policy 19 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Linear regression
Data Analysis for Business, Economics, and Policy 20 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Linearity as an assumption:
I assume that the regression function is linear in its coefficients.
I Linearity as an approximation.
I Whatever the form of the y E = f (x) relationship, the y E = α + βx regression fits a
line through it.
I This may or may not be a good approximation.
I By fitting a line we approximate the average slope of the y E = f (x) curve.
Data Analysis for Business, Economics, and Policy 21 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
E [y |x] = α + βx
Two coefficients:
I intercept: α = average value of y when x is zero:
I E [y |x = 0] = α + β × 0 = α.
Data Analysis for Business, Economics, and Policy 22 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 23 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 24 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Avoid uisng
I “decrease/increase" – not right, unless time series or causal relationship only
I “effect" – not right, unless causal relationship
Data Analysis for Business, Economics, and Policy 24 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Simplest case:
I x is a binary variable, zero or one.
I α is the average value of y when x is zero (E [y |x = 0] = α).
I β is the difference in average y between observations with x = 1 and observations
with x = 0
I E [y |x = 1] − E [y |x = 0] = α + β × 1 − α + β × 0 = β.
I The average value of y when x is one is E [y |x = 1] = α + β.
I Graphically, the regression line of linear regression goes through two points:
average y when x is zero (α) and average y when x is one (α + β).
Data Analysis for Business, Economics, and Policy 25 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Notation:
I General coefficients are α and β.
I Calculated estimates - α̂ and β̂ (use data and calculate the statistic)
I The slope coefficient formula is
1 Pn
Cov [x, y ] n i=1 (xi − x̄)(yi − ȳ )
β̂ = = 1 Pn 2
Var [x] n i=1 (xi − x̄)
Data Analysis for Business, Economics, and Policy 26 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I The intercept – average y minus average x multiplied by the estimated slope β̂.
α̂ = ȳ − β̂ x̄
I The formula of the intercept reveals that the regression line always goes through
the point of average x and average y .
I Note, you can manipulate and get: ȳ = α̂ + β̂ x̄.
Data Analysis for Business, Economics, and Policy 27 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 28 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Ordinary Least Squares – OLS is method to find the best fit with a formula.
I The idea underlying OLS is to find the values of the intercept and slope
parameters that make the regression line fit the scatterplot best.
I OLS method finds the values of the coefficients of the linear regression that
minimize the sum of squares of the difference between actual y values and their
values implied by the regression, α̂ + β̂x.
n
X
minα,β (yi − α − βxi )2 (4)
i=1
For this minimization problem, we can use calculus to give α̂ and β̂, the values for
α and β that give the minimum. Please check out U7.1.
Data Analysis for Business, Economics, and Policy 29 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Recap
Data Analysis for Business, Economics, and Policy 30 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 31 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 32 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Predicted values
I The predicted value of the dependent variable = best guess for its average value
if we know the value of the explanatory variable, using our model.
I The predicted value can be calculated from the regression for any x.
I The predicted values of the dependent variable are the points of the regression line
itself.
I The predicted value of dependent variable y is denoted as ŷ .
ŷ = α̂ + β̂x
Data Analysis for Business, Economics, and Policy 33 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Predicted values
I The predicted value of the dependent variable = best guess for its average value
if we know the value of the explanatory variable, using our model.
I The predicted value can be calculated from the regression for any x.
I The predicted values of the dependent variable are the points of the regression line
itself.
I The predicted value of dependent variable y is denoted as ŷ .
ŷ = α̂ + β̂x
Residuals
I The residual is the difference between the actual value of the dependent variable
for an observation and its predicted value :
I The residual is the vertical distance between the scatterplot point and the
regression line.
I For points above the regression line the residual is positive.
I For points below the regression line the residual is negative.
Data Analysis for Business, Economics, and Policy 34 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Use of residuals
Data Analysis for Business, Economics, and Policy 35 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 36 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 37 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 38 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I If linear regression is
accepted model for
prices
I Draw a scatterplot with
regression line
I With the model you can
capture the over and
underpriced hotels
Data Analysis for Business, Economics, and Policy 39 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Bear in mind, we can (and will) do better - this is not the best model for price
prediction.
I Non-linear pattern
I Functional form
I Taking into account differences beyond distance
Data Analysis for Business, Economics, and Policy 40 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Regression modelling
Data Analysis for Business, Economics, and Policy 41 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Model fit - R 2
I Fit of a regression captures how predicted values compare to the actual values.
I R-squared (R 2 ) – how much of the variation in y is captured by the regression,
and how much is left for residual variation
Var [ŷ ] Var [e]
R2 = =1− (5)
Var [y ] Var [y ]
Data Analysis for Business, Economics, and Policy 42 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Model fit - R 2
Data Analysis for Business, Economics, and Policy 43 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I When the R-squared is zero, How does regression line look like?
I What about when it’s not zero, but very small?
Data Analysis for Business, Economics, and Policy 44 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Model fit
I Fit depends (1): how well the particular version of the regression captures the
actual function f in y E = f (x)
I Can be helped by modeling
I Fit depends (2): how far actual values of y are spread around what would be
predicted using the actual function f .
I Given by data
Data Analysis for Business, Economics, and Policy 45 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I R-squared may help in choosing between different versions of regression for the
same data.
I Choose between regressions with different functional forms
I Predictions are likely to be better with high R 2
I More on this in Chapter 13-14
I R-squared matters less when the goal is to characterize the association between y
and x
Data Analysis for Business, Economics, and Policy 46 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Cov [y , x]
β̂ =
Var [x]
I In contrast with the correlation coefficient, its values can be anything.
Furthermore y and x are not interchangeable.
I Covariance and correlation coefficient can be substituted to get β̂:
Std[y ]
β̂ = Corr [x, y ]
Std[x]
I Covariance, the correlation coefficient, and the slope of a linear regression capture
similar information: the degree of association between the two variables.
Data Analysis for Business, Economics, and Policy 47 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I R-squared of the simple linear regression is the square of the correlation coefficient.
R 2 = (Corr [y , x])2
I So the R-squared is yet another measure of the association between the two
variables.
I To show this equality holds, the trick is to substitute the numerator of R-squared
and manipulate:
β̂ 2 Var [x] Std[x] 2
2 Var [ŷ ] Var [α̂ + β̂x]
R = = = = β̂ = (Corr [y , x])2
Var [y ] Var [y ] Var [y ] Std[y ]
Data Analysis for Business, Economics, and Policy 48 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Reverse regression
y E = α + βx
x E = γ + δy
I What can we say about estimated coefficients?
I What can we say about the R 2 ?
Data Analysis for Business, Economics, and Policy 49 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Reverse regression
I One can change the variables, but the interpretation is going to change as well!
x E = γ + δy
Cov [y ,x]
I The OLS estimator for the slope coefficient here is δ̂ = Var [y ] .
I The OLS slopes of the original regression and the reverse regression are related:
Var [y ]
β̂ = δ̂
Var [x]
Data Analysis for Business, Economics, and Policy 50 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 51 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Were very careful to use neutral language, not talk about causation
I Think back to sources of variation in x
I When we have observational data, and we pick x and y and decide how to run the
regression
I Regression is a method of comparison: it compares observations that are different
in variable x and shows corresponding average differences in variable y .
I It is a way to find patterns of association by comparisons.
I If we can’t infer causation from regression analysis — not the fault of the method.
Data Analysis for Business, Economics, and Policy 52 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 53 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 54 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 54 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Data Analysis for Business, Economics, and Policy 55 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
I Slope is -14
I Does that mean that a longer distance causes hotels to be cheaper?
Data Analysis for Business, Economics, and Policy 56 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary
Summary take-away
Data Analysis for Business, Economics, and Policy 57 / 57 Gábor Békés (Central European University)