0% found this document useful (0 votes)

13 views60 pages

Da Public Slides Ch07 v3 2023

Uploaded by

Tin Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views60 pages

Da Public Slides Ch07 v3 2023

Uploaded by

Tin Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Békés-Kézdi: Data Analysis, Chapter 06: Hypotheses testing

Data Analysis for Business, Economics,

and Policy
Gábor Békés (Central European University)
Gábor Kézdi (University of Michigan)
Cambridge University Press, 2021
gabors-data-analysis.com

Central European University

Version: v3.1 License: CC BY-NC 4.0
Any comments or suggestions:
[email protected]
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Motivation

I Spend a night in Vienna and you want to

find a good deal for your stay.
I Travel time to the city center is rather
important.
I Looking for a good deal: as low a price as
possible and as close to the city center as
possible.
I Collect data on suitable hotels

Data Analysis for Business, Economics, and Policy 2 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Introduction

I Regression is the most widely used method of comparison in data analysis.

I Simple regression analysis amounts to comparing average values of a dependent
variable (y) for observations that are different in the explanatory variable (x).
I Simple regression: comparing conditional means.
I Doing so uncovers the pattern of association between y and x. What you use for y
and for x is important and not inter-changeable!

Data Analysis for Business, Economics, and Policy 3 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression: comparing conditional means

Data Analysis for Business, Economics, and Policy 4 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression

I Simple regression analysis uncovers mean-dependence between two variables.

I It amounts to comparing average values of one variable, called the dependent
variable (y ) for observations that are different in the other variable, the explanatory
variable (x).
I Multiple regression analysis involves more variables -> later.

Data Analysis for Business, Economics, and Policy 5 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - uses

I Discovering patterns of association between variables is often a good starting point

even if our question is more ambitious.
I Causal analysis: uncovering the effect of one variable on another variable.
Concerned with a parameter.
I Predictive analysis: what to expect of a y variable (long-run polls, hotel prices)
for various values of another x variable (immediate polls, distance to the city
center). Concerned with predicted value of y using x.

Data Analysis for Business, Economics, and Policy 6 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - names and notation

I Regression analysis is a method that uncovers the average value of a variable y

for different values of another variable x.

E [y |x] = f (x) (1)

We use a simpler shorthand notation

y E = f (x) (2)
I dependent variable or left-hand-side variable, or simply the y variable,
I explanatory variable, right-hand-side variable, or simply the x variable
I “regress y on x," or “run a regression of y on x"= do simple regression analysis
with y as the dependent variable and x as the explanatory variable.

Data Analysis for Business, Economics, and Policy 7 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - type of patterns

Regression may find

I Linear patterns: positive (negative) association - average y tends to be higher
(lower) at higher values of x.
I Non-linear patterns: association may be even non-monotonic - y tends to be
higher for higher values of x in a certain range of the x variable and lower for
higher values of x in another range of the x variable
I No association or relationship

Data Analysis for Business, Economics, and Policy 8 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric and parametric regression

I Non-parametric regressions describe the y E = f (x) pattern without imposing a

specific functional form on f .
I Data driven and flexible, no theory
I Can capture any pattern

I Parametric regressions impose a functional form on f . Parametric examples

include:
I linear functions: f (x) = a + bx;
I exponential functions: f (x) = ax b ;
I quadratic functions: f (x) = a + bx + cx 2 ,
I or any functions which have parameters of a, b, c, etc.
I Restrictive, but they produce readily interpretable numbers.

Data Analysis for Business, Economics, and Policy 9 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: average by each value

I Non-parametric regressions come (also) in various forms.

I Most intuitive non-parametric regression for y E = f (x) shows average y for each
and every value of x.
I Works well when x has few values and there are many observations in the data,
I There is no functional form imposed on f here.

Data Analysis for Business, Economics, and Policy 10 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: Categorical variable

I Sometimes, no straightforward functional form on f .

I Categorical variables
I Ordered variables.
I For example, Hotels: average price of hotels with the same numbers of stars and
compare these averages = non-parametric regression analysis.

Data Analysis for Business, Economics, and Policy 11 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: bins

I With many x values - two ways to do non-parametric regression analysis: bins and
smoothing.
I Bins - based on grouped values of x
I Bins are disjoint categories (no overlap) that span the entire range of x (no gaps).
I Many ways to create bins - equal size, equal number of observations per bin, or bins
defined by analyst.

Data Analysis for Business, Economics, and Policy 12 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: lowess (loess)

I Produce “smooth" graph - both continuous and has no kink at any point.
I also called smoothed conditional means plots = non-parametric regression
shows conditional means, smoothed to get a better image.
I Lowess = most widely used non-parametric regression methods that produce a
smooth graph.
I locally weighted scatterplot smoothing (sometimes abbreviated as “loess").

I A smooth curve fit around a bin scatter.

I Related to density plots (Chapter 03), set the bandwidth for smoothing
I wider bandwidth results in a smoother graph but may miss important details of the
pattern.
I narrower bandwidth produces a more rugged-looking graph

Data Analysis for Business, Economics, and Policy 13 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Non-parametric regression: lowess (loess)

I Smooth non-parametric regression methods, including lowess, do not produce

numbers that would summarize the y E = f (x) pattern.
I Provide a value y E for each of the particular x values that occur in the data, as
well as for all x values in-between.
I Graph – we interpret these graphs in qualitative, not quantitative ways.
I They can show interesting shapes in the pattern, such as non-monotonic parts,
steeper and flatter parts, etc.
I Great way to find relationship patterns

Data Analysis for Business, Economics, and Policy 14 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I We look at Vienna hotels for a 2017 November weekday.

I we focus on hotels that are (i) in Vienna actual,(ii) not too far from the center,
(iii) classified as hotels, (iv) 3-4 stars, and (v) have no extremely high price
classified as error.
I There are 428 hotel prices for that weekday in Vienna, our focused sample has
N = 207 observations.

Data Analysis for Business, Economics, and Policy 15 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

Bin scatter non-parametric regression, 2 bins Bin scatter non-parametric regression, 4 bins
Data Analysis for Business, Economics, and Policy 16 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

Scatter and bin scatter non-parametric Scatter and bin scatter non-parametric
regression, 4 bins regression, 7 bins
Data Analysis for Business, Economics, and Policy 17 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I lowess non-parametric regression,

together with the scatterplot.
I bandwidth selected by software is 0.8
miles.
I The smooth non-parametric
regression retains some aspects of
previous bin scatter – a smoother
version of the corresponding
non-parametric regression with
disjoint bins of similar width.

Data Analysis for Business, Economics, and Policy 18 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression

Data Analysis for Business, Economics, and Policy 19 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression

Linear regression is the most widely used method in data analysis.

I imposes linearity of the function f in y E = f (x).
I Linear functions have two parameters, also called coefficients: the intercept and
the slope.
y E = α + βx (3)
I Linearity in terms of its coefficients.
I can have any function, including any nonlinear function, of the original variables
themselves
I linear regression is a line through the x − y scatterplot.
I This line is the best-fitting line one can draw through the scatterplot.
I It is the best fit in the sense that it is the line that is closest to all points of the
scatterplot.

Data Analysis for Business, Economics, and Policy 20 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression - assumption vs approximation

I Linearity as an assumption:
I assume that the regression function is linear in its coefficients.
I Linearity as an approximation.
I Whatever the form of the y E = f (x) relationship, the y E = α + βx regression fits a
line through it.
I This may or may not be a good approximation.
I By fitting a line we approximate the average slope of the y E = f (x) curve.

Data Analysis for Business, Economics, and Policy 21 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Linear regression coefficients

Coefficients have a clear interpretation – based on comparing conditional means.

E [y |x] = α + βx

Two coefficients:
I intercept: α = average value of y when x is zero:
I E [y |x = 0] = α + β × 0 = α.

I slope: β = expected difference in y corresponding to a one unit difference in x.

I E [y |x = x0 + 1] − E [y |x0 ] = (α + β × (x0 + 1)) − (α + β × x0 ) = β.

Data Analysis for Business, Economics, and Policy 22 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient

I slope: β = expected difference in y corresponding to a one unit difference in x.

I y is higher, on average, by β for observations with a one-unit higher value of x.
I Comparing two observations that differ in x by one unit, we expect y to be β
higher for the observation with one unit higher x.

Data Analysis for Business, Economics, and Policy 23 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient interpretation

Several good ways to interpret the slope coefficient

I $yistextithigher , onaverage, by β for observations with a one-unit higher value of x.
I Comparing two observations that differ in x by one unit, we textitexpect y to be β
higher for the observation with one unit higher x.

Data Analysis for Business, Economics, and Policy 24 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression - slope coefficient interpretation

Several good ways to interpret the slope coefficient

Avoid uisng
I “decrease/increase" – not right, unless time series or causal relationship only
I “effect" – not right, unless causal relationship

Regression: binary explanatory

Simplest case:
I x is a binary variable, zero or one.
I α is the average value of y when x is zero (E [y |x = 0] = α).
I β is the difference in average y between observations with x = 1 and observations
with x = 0
I E [y |x = 1] − E [y |x = 0] = α + β × 1 − α + β × 0 = β.
I The average value of y when x is one is E [y |x = 1] = α + β.
I Graphically, the regression line of linear regression goes through two points:
average y when x is zero (α) and average y when x is one (α + β).

Data Analysis for Business, Economics, and Policy 25 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

Notation:
I General coefficients are α and β.
I Calculated estimates - α̂ and β̂ (use data and calculate the statistic)
I The slope coefficient formula is
1 Pn
Cov [x, y ] n i=1 (xi − x̄)(yi − ȳ )
β̂ = = 1 Pn 2
Var [x] n i=1 (xi − x̄)

I Slope coefficient formula is normalized version of the covariance between x and y .

I The slope measures the covariance relative to the variation in x.
I That is why the slope can be interpreted as differences in average y corresponding to
differences in x.

Data Analysis for Business, Economics, and Policy 26 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

I The intercept – average y minus average x multiplied by the estimated slope β̂.

α̂ = ȳ − β̂ x̄

I The formula of the intercept reveals that the regression line always goes through
the point of average x and average y .
I Note, you can manipulate and get: ȳ = α̂ + β̂ x̄.

Data Analysis for Business, Economics, and Policy 27 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Ordinary Least Squares (OLS)

I OLS gives the best-fitting linear

regression line.
I A vertical line at the average value of
x and a horizontal line at the
average value of y . The regression
line goes through the point of
average x and average y .

Data Analysis for Business, Economics, and Policy 28 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression coefficient formula

I Ordinary Least Squares – OLS is method to find the best fit with a formula.
I The idea underlying OLS is to find the values of the intercept and slope
parameters that make the regression line fit the scatterplot best.
I OLS method finds the values of the coefficients of the linear regression that
minimize the sum of squares of the difference between actual y values and their
values implied by the regression, α̂ + β̂x.
n
X
minα,β (yi − α − βxi )2 (4)
i=1

For this minimization problem, we can use calculus to give α̂ and β̂, the values for
α and β that give the minimum. Please check out U7.1.

Data Analysis for Business, Economics, and Policy 29 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Recap

I Simple regression analysis amounts to comparing average values of a dependent

variable (y) for observations that are different in the explanatory variable (x).

I Simple regression in any way or form: comparing conditional means.

Data Analysis for Business, Economics, and Policy 30 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I The linear regression of hotel prices

(in EUR) on distance (in miles)
produces an intercept of 133 and a
slope -14.
I The intercept is 133, suggesting that
the average price of hotels right in
the city center is EUR 133.
I The slope of the linear regression is
-14. Hotels that are 1 mile further
away from the city center are, on
average, EUR 14 cheaper in our data.

Data Analysis for Business, Economics, and Policy 31 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Compare linear model and non-parametric ones

I Linear is an average that fails to capture steep decline close to center
I Not bad approximation overall

Data Analysis for Business, Economics, and Policy 32 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Predicted values
I The predicted value of the dependent variable = best guess for its average value
if we know the value of the explanatory variable, using our model.
I The predicted value can be calculated from the regression for any x.
I The predicted values of the dependent variable are the points of the regression line
itself.
I The predicted value of dependent variable y is denoted as ŷ .

ŷ = α̂ + β̂x

I What about non-parametric regressions

Data Analysis for Business, Economics, and Policy 33 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

ŷ = α̂ + β̂x

I What about non-parametric regressions

I Predicted dependent variables exist
I Complete list of predicted values of the dependent variable for each value of the
explanatory variable in the data.
Data Analysis for Business, Economics, and Policy 33 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Residuals

I The residual is the difference between the actual value of the dependent variable
for an observation and its predicted value :

ei = yi − yˆi , where yˆi = α̂ + β̂xi

I The residual is meaningful only for actual observation.
I While we can have predicted values for any x, actual y values are only available for
the observations in our data

I The residual is the vertical distance between the scatterplot point and the
regression line.
I For points above the regression line the residual is positive.
I For points below the regression line the residual is negative.

Data Analysis for Business, Economics, and Policy 34 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Use of residuals

I The residual may be important on its own right.

I Interested in identifying observations that are special in that they have a dependent
variable that is much higher or much lower than “it should be" as predicted by the
regression.

Data Analysis for Business, Economics, and Policy 35 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Predicted dependent variable and residuals

I Residuals sum to zero if a linear regression is fitted by OLS.

I Sum is zero –> average of the residuals is zero, too.
I Predicted average is equal to the actual average for y : average ŷ equals average y .
I See U7.2 for details.

Data Analysis for Business, Economics, and Policy 36 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Residual is vertical distance

I Positive residual shown here - price is
above what predicted by regression
line

Data Analysis for Business, Economics, and Policy 37 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Can look at residuals from linear

regressions
I Centered around zero
I Both positive and negative

Data Analysis for Business, Economics, and Policy 38 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I If linear regression is
accepted model for
prices
I Draw a scatterplot with
regression line
I With the model you can
capture the over and
underpriced hotels

Data Analysis for Business, Economics, and Policy 39 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

A list of the hotels with the five lowest value of the residual.
No. Hotel_id Distance Price Predicted price Residual
1 22080 1.1 54 116.17 -62.17
2 21912 1.1 60 116.17 -56.17
3 22152 1 63 117.61 -54.61
4 22408 1.4 58 111.85 -53.85
5 22090 0.9 68 119.05 -51.05

I Bear in mind, we can (and will) do better - this is not the best model for price
prediction.
I Non-linear pattern
I Functional form
I Taking into account differences beyond distance

Data Analysis for Business, Economics, and Policy 40 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression modelling

Data Analysis for Business, Economics, and Policy 41 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2

I Fit of a regression captures how predicted values compare to the actual values.
I R-squared (R 2 ) – how much of the variation in y is captured by the regression,
and how much is left for residual variation
Var [ŷ ] Var [e]
R2 = =1− (5)
Var [y ] Var [y ]

] = n1 ni=1 (yi − ȳ )2 , Var [ŷ ] = n1 ni=1 (yˆi − ȳ )2 , and

P P
where Var [yP
Var [e] = n1 ni=1 (ei )2 . Note that ŷ¯ = ȳ , and ē = 0.
I Decomposition of the overall variation in y into variation in predicted values
“explained by the regression") and residual variation ( “not explained by the
regression"):
Var [y ] = Var [ŷ ] + Var [e] (6)

Data Analysis for Business, Economics, and Policy 42 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2

I R-squared (or R 2 ) can be defined for both parametric and non-parametric

regressions.
I Any kind of regression produces predicted ŷ values, and all we need to compute R 2
is its variance compared to the variance of y .
I The value of R-squared is always between zero and one.
I R-squared is zero, if the predicted values are just the average of the observed
outcome yˆi = y¯i , ∀i.

Data Analysis for Business, Economics, and Policy 43 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - R 2 - A question

I When the R-squared is zero, How does regression line look like?
I What about when it’s not zero, but very small?

Data Analysis for Business, Economics, and Policy 44 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit

I Fit depends (1): how well the particular version of the regression captures the
actual function f in y E = f (x)
I Can be helped by modeling

I Fit depends (2): how far actual values of y are spread around what would be
predicted using the actual function f .
I Given by data

Data Analysis for Business, Economics, and Policy 45 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Model fit - how to use R 2

I R-squared may help in choosing between different versions of regression for the
same data.
I Choose between regressions with different functional forms
I Predictions are likely to be better with high R 2
I More on this in Chapter 13-14
I R-squared matters less when the goal is to characterize the association between y
and x

Data Analysis for Business, Economics, and Policy 46 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Correlation and linear regression

I Linear regression is closely related to correlation.
I Remember, the OLS formula for the slope

Cov [y , x]
β̂ =
Var [x]
I In contrast with the correlation coefficient, its values can be anything.
Furthermore y and x are not interchangeable.
I Covariance and correlation coefficient can be substituted to get β̂:

Std[y ]
β̂ = Corr [x, y ]
Std[x]
I Covariance, the correlation coefficient, and the slope of a linear regression capture
similar information: the degree of association between the two variables.
Data Analysis for Business, Economics, and Policy 47 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Correlation and R 2 in linear regression

I R-squared of the simple linear regression is the square of the correlation coefficient.

R 2 = (Corr [y , x])2

I So the R-squared is yet another measure of the association between the two
variables.
I To show this equality holds, the trick is to substitute the numerator of R-squared
and manipulate:
β̂ 2 Var [x] Std[x] 2

2 Var [ŷ ] Var [α̂ + β̂x]
R = = = = β̂ = (Corr [y , x])2
Var [y ] Var [y ] Var [y ] Std[y ]

Data Analysis for Business, Economics, and Policy 48 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression

I Consider two similar models

y E = α + βx

x E = γ + δy
I What can we say about estimated coefficients?
I What can we say about the R 2 ?

Data Analysis for Business, Economics, and Policy 49 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression

I One can change the variables, but the interpretation is going to change as well!

x E = γ + δy
Cov [y ,x]
I The OLS estimator for the slope coefficient here is δ̂ = Var [y ] .
I The OLS slopes of the original regression and the reverse regression are related:

Var [y ]
β̂ = δ̂
Var [x]

I Different, unless Var [x] = Var [y ],

I but always have the same sign.
I both are larger in magnitude the larger the covariance.

Data Analysis for Business, Economics, and Policy 50 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Reverse regression: A question

I Is R 2 for the simple linear regression and the reverse regression

I exactly the same,
I close but not the same
I different
I Why?

Data Analysis for Business, Economics, and Policy 51 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Were very careful to use neutral language, not talk about causation
I Think back to sources of variation in x
I When we have observational data, and we pick x and y and decide how to run the
regression
I Regression is a method of comparison: it compares observations that are different
in variable x and shows corresponding average differences in variable y .
I It is a way to find patterns of association by comparisons.
I If we can’t infer causation from regression analysis — not the fault of the method.

Data Analysis for Business, Economics, and Policy 52 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation - possible relations

I Slope of the y E = α + βx regression is not zero in our data

I Several reasons, not mutually exclusive:
I x causes y :
I y causes x.
I A third variable causes both x and y (or many such variables do):
I In reality if we have observational data, there is a mix of these relations.
I For more see, Chapters 19-21

Data Analysis for Business, Economics, and Policy 53 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Yes: "correlation (regression) does not imply causation"

I Better: we should not infer cause and effect from comparisons in

observational data.

Data Analysis for Business, Economics, and Policy 54 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Regression and causation

I Yes: "correlation (regression) does not imply causation"

I Better: we should not infer cause and effect from comparisons in

observational data.

I Suggested approach is two steps

I First interpret precisely the object (correlation of slope coefficient)
I Conclude and discuss causal claims if any

Case Study: Finding a good deal among hotels

I Fit and causation

I The R-squared of the regression is 0.16 = 16%.
I This means that of the overall variation in hotel prices, 16% is explained by the linear
regression with distance to the city center; the remaining 84% is left unexplained.
I 16% - good for cross-sectional regression with a single explanatory variable.
I In any case it is the fit of the best-fitting line.

Data Analysis for Business, Economics, and Policy 55 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Case Study: Finding a good deal among hotels

I Slope is -14
I Does that mean that a longer distance causes hotels to be cheaper?

Data Analysis for Business, Economics, and Policy 56 / 57 Gábor Békés (Central European University)
Introduction Regression basics Case: Hotels 1 Linear regression CS:A2 Residuals CS:A3 OLS Modeling Causation CS:A4 Summary

Summary take-away

I Regression – method to compare average y across observations with different

values of x.
I Non-parametric regressions (bin scatter, lowess) visualize complicated patterns of
association between y and x, but no interpretable number.
I Linear regression – linear approximation of the average pattern of association y
and x
I In y E = α + βx, β shows how much larger y is, on average, for observations with
a one-unit larger x
I When β is not zero, one of three things (+ any combination) may be true:
I x causes y
I y causes x
I a third variable causes both x and y .

Data Analysis for Business, Economics, and Policy 57 / 57 Gábor Békés (Central European University)

Early Grade Reading Assessment For Kindergarten Final
75% (4)
Early Grade Reading Assessment For Kindergarten Final
6 pages
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition PDF Download
100% (6)
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition PDF Download
54 pages
Exam C Manual
100% (3)
Exam C Manual
810 pages
Dealing With Econometrics Real World Cases With Cross-Sectional Data (Jordi Ripollés, Inmaculada Martínez-Zarzoso Etc.) (Z-Library)
No ratings yet
Dealing With Econometrics Real World Cases With Cross-Sectional Data (Jordi Ripollés, Inmaculada Martínez-Zarzoso Etc.) (Z-Library)
210 pages
Linear Regression
No ratings yet
Linear Regression
216 pages
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition 2024 Scribd Download
100% (2)
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition 2024 Scribd Download
45 pages
Applied Regression Analysis by Christer Thrane
No ratings yet
Applied Regression Analysis by Christer Thrane
203 pages
Lecture Notes
No ratings yet
Lecture Notes
141 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Regression Analysis PDF
100% (2)
Regression Analysis PDF
205 pages
All LIC Plans
100% (1)
All LIC Plans
4 pages
Advancedeconometricsl3!4!240128102442 58a0f1f1
No ratings yet
Advancedeconometricsl3!4!240128102442 58a0f1f1
58 pages
Chatterjee & Hadi
100% (1)
Chatterjee & Hadi
30 pages
Module 3 - Data Analysis - S RM
No ratings yet
Module 3 - Data Analysis - S RM
63 pages
Ch07-Bekes Kezdi Data Analysis Slides v2
No ratings yet
Ch07-Bekes Kezdi Data Analysis Slides v2
44 pages
Ees 400 - Topic Three - Simple Regression
No ratings yet
Ees 400 - Topic Three - Simple Regression
36 pages
EC501 Lecture 04
No ratings yet
EC501 Lecture 04
30 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Topic 3 - Simple Regression Analysis
No ratings yet
Topic 3 - Simple Regression Analysis
37 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Bi - Variate Data Analysis - II Regression Analysis
No ratings yet
Bi - Variate Data Analysis - II Regression Analysis
37 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Script ASR v161212
No ratings yet
Script ASR v161212
148 pages
Stochastic Processes
No ratings yet
Stochastic Processes
277 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
Module 3
No ratings yet
Module 3
34 pages
Reg 01
No ratings yet
Reg 01
17 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
35 pages
Pol 222
No ratings yet
Pol 222
8 pages
Regression
No ratings yet
Regression
11 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
Linear Regression PDF
No ratings yet
Linear Regression PDF
45 pages
ECO 391 Lecture Slides - Part 2
No ratings yet
ECO 391 Lecture Slides - Part 2
26 pages
Week01 Lecture BB
No ratings yet
Week01 Lecture BB
70 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
70 pages
Lse Ppa M4u3 Notes
No ratings yet
Lse Ppa M4u3 Notes
15 pages
Py Regex v4p0
No ratings yet
Py Regex v4p0
122 pages
Institute of Actuaries of India: Subject ST4 - Pensions and Other Employee Benefits
No ratings yet
Institute of Actuaries of India: Subject ST4 - Pensions and Other Employee Benefits
4 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
5 Multicolinearity
No ratings yet
5 Multicolinearity
26 pages
Regression Analysis
No ratings yet
Regression Analysis
65 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Lecture 2 - Single Eq Reg Model
No ratings yet
Lecture 2 - Single Eq Reg Model
30 pages
6.multiple Regressions - BDSM - 2020 - Oct
No ratings yet
6.multiple Regressions - BDSM - 2020 - Oct
45 pages
Born in the year 1972: Astrological character profiles for every day of the year
From Everand
Born in the year 1972: Astrological character profiles for every day of the year
Christoph Däppen
No ratings yet
Chapter Two: Simple Linear Regression Model: 2.1 Introduction To Regression Analysis
No ratings yet
Chapter Two: Simple Linear Regression Model: 2.1 Introduction To Regression Analysis
7 pages
Linear Regression Analysis: Module - I
No ratings yet
Linear Regression Analysis: Module - I
13 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
3 Regression Analysis
No ratings yet
3 Regression Analysis
6 pages
CH0010790762166C66FD3C12577
No ratings yet
CH0010790762166C66FD3C12577
6 pages
(Bruderl) Applied Regression Analysis Using Stata
No ratings yet
(Bruderl) Applied Regression Analysis Using Stata
73 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
003-Forecasting Techniques Detailed
No ratings yet
003-Forecasting Techniques Detailed
20 pages
Basic Econometrics 2023 Question Paper With Solution Delhi University BBE Business Economics
No ratings yet
Basic Econometrics 2023 Question Paper With Solution Delhi University BBE Business Economics
7 pages
What Is Regression Analysis
No ratings yet
What Is Regression Analysis
4 pages
Module 5 Quiz
No ratings yet
Module 5 Quiz
3 pages
IFoA Careers Guide 2016-17 PDF
No ratings yet
IFoA Careers Guide 2016-17 PDF
9 pages
Wilson CHPT 1
No ratings yet
Wilson CHPT 1
24 pages
Chapter1 Regression Introduction PDF
No ratings yet
Chapter1 Regression Introduction PDF
8 pages
1056 Mat Hang Viet Nam Xuat Khau Sang Thuy Dien Nam 2023 Va Thi Phan
No ratings yet
1056 Mat Hang Viet Nam Xuat Khau Sang Thuy Dien Nam 2023 Va Thi Phan
74 pages
Regn & Marketing Research
No ratings yet
Regn & Marketing Research
23 pages
How Data Is Driving Resilient Sustainable Supply Chain 2021
No ratings yet
How Data Is Driving Resilient Sustainable Supply Chain 2021
16 pages
How To Direct Source Products To Sell Online - NerdWallet
No ratings yet
How To Direct Source Products To Sell Online - NerdWallet
16 pages
SSRN 4477833
No ratings yet
SSRN 4477833
44 pages
FOMB - Proposed - UPR - FY21 Fiscal Plan
No ratings yet
FOMB - Proposed - UPR - FY21 Fiscal Plan
62 pages
Syllabus Business Analytics PDF
No ratings yet
Syllabus Business Analytics PDF
1 page
WhartonOnline M7
No ratings yet
WhartonOnline M7
84 pages
CH 12 Regression Analysis
No ratings yet
CH 12 Regression Analysis
9 pages
Estimating Food Value Chain FAO
No ratings yet
Estimating Food Value Chain FAO
36 pages
Opti Supp Chai
No ratings yet
Opti Supp Chai
6 pages
Stat Lab 2
No ratings yet
Stat Lab 2
1 page
Part A - CH 30 - As 15 Employee Benefits - Concept - V3
No ratings yet
Part A - CH 30 - As 15 Employee Benefits - Concept - V3
16 pages
Topic0 Introduction
No ratings yet
Topic0 Introduction
9 pages
White Paper On Regression
No ratings yet
White Paper On Regression
14 pages
Scaling Laws and Statistical Properties of The Transaction Flows and Holding Times of Bitcoin
No ratings yet
Scaling Laws and Statistical Properties of The Transaction Flows and Holding Times of Bitcoin
48 pages
Using AI To Detect Panic Buying
No ratings yet
Using AI To Detect Panic Buying
30 pages
Neo4j GDS Use Cases Supply Chain
No ratings yet
Neo4j GDS Use Cases Supply Chain
5 pages
Oecd Competitive Neutrality Reviews Vietnam 2021 Highlights
No ratings yet
Oecd Competitive Neutrality Reviews Vietnam 2021 Highlights
4 pages
WhartonOnline M2
No ratings yet
WhartonOnline M2
24 pages
Assignment-15 BA
No ratings yet
Assignment-15 BA
11 pages
WhartonOnline M3
No ratings yet
WhartonOnline M3
32 pages
A System Dynamics Archetype To Mitigate Rework 2022 International Journal o
No ratings yet
A System Dynamics Archetype To Mitigate Rework 2022 International Journal o
11 pages
Useful Junk The Effects of Visual Embellishment On Comprehension and Memorability of Charts
No ratings yet
Useful Junk The Effects of Visual Embellishment On Comprehension and Memorability of Charts
11 pages
Spatial Analysis For Port Crisis
No ratings yet
Spatial Analysis For Port Crisis
24 pages
MIT15 060F14 HW2 Work
No ratings yet
MIT15 060F14 HW2 Work
4 pages
Tws LC Target Costing
No ratings yet
Tws LC Target Costing
6 pages
Trade Partner Diversification
No ratings yet
Trade Partner Diversification
39 pages
Ant Colony Optimization in Supply Chain
No ratings yet
Ant Colony Optimization in Supply Chain
14 pages
Game Theory in Transport and Logistics
No ratings yet
Game Theory in Transport and Logistics
6 pages
Research Ibnr Report 2009
No ratings yet
Research Ibnr Report 2009
202 pages
Pennsylvania Pension Plan Booklet
No ratings yet
Pennsylvania Pension Plan Booklet
19 pages
Annuity
No ratings yet
Annuity
27 pages
P2-4 Setiyono 2014 Remote-Sensing Based Crop Yield Monitoring
No ratings yet
P2-4 Setiyono 2014 Remote-Sensing Based Crop Yield Monitoring
12 pages
Cooperative Games in Two-Echelon Supply Chains
No ratings yet
Cooperative Games in Two-Echelon Supply Chains
18 pages
Data Driven Digital Transformation For Emergency 2022 International Journal
No ratings yet
Data Driven Digital Transformation For Emergency 2022 International Journal
11 pages
Paccurate Report - How To Save The Planet - Think Inside The Box
No ratings yet
Paccurate Report - How To Save The Planet - Think Inside The Box
8 pages
Cathy Econ0019 - w2
No ratings yet
Cathy Econ0019 - w2
62 pages
Linear Regression Practice
No ratings yet
Linear Regression Practice
4 pages
Thuchanh
No ratings yet
Thuchanh
1 page
Expeditious Approval Circular
No ratings yet
Expeditious Approval Circular
36 pages
The Following Information Relates To The Defined Benefits Pension Scheme
No ratings yet
The Following Information Relates To The Defined Benefits Pension Scheme
1 page
D Linear Regression With R
No ratings yet
D Linear Regression With R
9 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Pengalaman Penjamin Emisi: Periode 2014 - 2018
No ratings yet
Pengalaman Penjamin Emisi: Periode 2014 - 2018
52 pages
MId - Term 2
No ratings yet
MId - Term 2
10 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
How To Deal With Continuous & Dichotomous Outcomes in Epidemiological Research - Linear & Logistic Regression Analyses
No ratings yet
How To Deal With Continuous & Dichotomous Outcomes in Epidemiological Research - Linear & Logistic Regression Analyses
9 pages
TVOG
No ratings yet
TVOG
4 pages
Mock Exam Chap1To7 2
No ratings yet
Mock Exam Chap1To7 2
5 pages
#Quantitative Analysis Excel
No ratings yet
#Quantitative Analysis Excel
5 pages
Exercise6 1
No ratings yet
Exercise6 1
4 pages
Stat 324: Lecture 24 Testing Equality of Regression Parameters in Two Groups
No ratings yet
Stat 324: Lecture 24 Testing Equality of Regression Parameters in Two Groups
2 pages
Keyboard Shortcuts 2 Excel
No ratings yet
Keyboard Shortcuts 2 Excel
1 page