Lect2 Part1
Lect2 Part1
Part 1
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 1 / 64
Fundamentals of Regression Analysis
Contents
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 2 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 3 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
• Using all this information, want to know how the area (in squared
meters) of an apartment in Barcelona affects its price (in euros)
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 4 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
• We could then draw a line through the data points Pricei = β0 + βarea Areai
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 5 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
clear all
set more off
use habitatge_BCN_1920_12.dta, clear
twoway (scatter preu superf, msize(tiny) mcolor(edkblue)) ///
(lfit preu superf, lwidth(medium) lcolor(black))
corr superf preu
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 6 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 7 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
• In general
Yi = β0 + β1 Xi + ui (1)
• This is the simple linear regression model with just one regressor, where
Yi is the dependent variable for unit i, Xi is the regressor or independent
variable for unit i and ui is the error term for unit i.
• The first part β0 + β1 Xi is the population regression line, that is, the
average relation between X and Y that we see in the population
• If we know the value of β0 and β1 , for a given X we could use the
corresponding population equation and predict its Y
• Last but not least, ui is the difference between Yi and the corresponding
regression line, and arises from all the other factors affecting Yi that have
not been included in our model
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 8 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 9 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 10 / 64
Fundamentals of Regression Analysis The role of Econometrics: simple regression model
• So, how do we estimate the slope of a line such that goes through the
scatterplot of Size and Price?
• Of course, there is no line that will go through all the data points and we
can actually draw an infinite number of different lines that go through the
data points
• So, which criteria should we use to pick one among all the possibilities?
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 11 / 64
Fundamentals of Regression Analysis Different types of data
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 12 / 64
Fundamentals of Regression Analysis Different types of data
Experimental vs Observational
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 13 / 64
Fundamentals of Regression Analysis Different types of data
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 14 / 64
Fundamentals of Regression Analysis The Ordinary Least Squares (OLS) estimator
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 15 / 64
Fundamentals of Regression Analysis The Ordinary Least Squares (OLS) estimator
• The ordinary least squares (OLS) extends this idea for the linear
regression model
• Let’s assume that b0 and b1 are some estimators of the unknown
parameters β0 and β1
• Based on those estimators, the regression line is b0 + b1 X
• From that estimation, the predicted value of Yi is Ŷi = b0 + b1 Xi
• Therefore, the residual from the ith prediction will be
ûi = Yi − Ŷi = Yi − (b0 + b1 Xi ) = Yi − b0 − b1 Xi
• Note: The residual ûi can be interpreted as the sample counterpart of ui
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 16 / 64
Fundamentals of Regression Analysis The Ordinary Least Squares (OLS) estimator
• It looks for the pair (b0 , b1 ) that solves the aforementioned problem
• We will refer to the pair that minimises the sum of squares as (β̂0 , β̂1 )
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 17 / 64
Fundamentals of Regression Analysis The Ordinary Least Squares (OLS) estimator
Pn
(X − X̄)(Yi − Ȳ) sXY
β̂1 = Pn i
i=1
=
i=1 (Xi − X̄)
2 s2X
β̂0 = Ȳ − β̂1 X̄
ûi = Yi − Ŷi = Yi − β̂0 − β̂1 Xi
• The first two equations are the OLS estimators of the unknown
population parameters β0 and β1 ; the third is the residual from model
prediction (sample counterpart of the error term, but we should never
interpret them as equivalent)
• Different samples, will generate different estimands of β̂0 and β̂1 (that is,
their estimated value)
• That is, the estimators are random variables and their particular value
will depend on the sample
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 18 / 64
Fundamentals of Regression Analysis The Ordinary Least Squares (OLS) estimator
Notation
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 19 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 20 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
∂ X
= −2 (Yi − b0 − b1 Xi ) = 0
∂b0 i
∂ X
= −2 (Yi − b0 − b1 Xi )Xi = 0
∂b1 i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 21 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
• In the appendix of these notes you have the formal proof of these
properties, let’s take a look at those using the data from our example.
• In order to do so, we will:
1. Run the OLS regression of price on area
2. Predict the value of Ŷi using X̄: β̂0 + β̂1 X̄
3. Predict the OLS residuals ûi as the difference Yi − Ŷi , where
Ŷi = β̂0 + β̂P
1 Xi
n
4. Calculate i ûi
5. Calculate Corr(Xi , ûi )
6. Calculate Corr(Ŷi , ûi )
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 22 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Stata code
sum superf
local msuperf = r(mean)
disp _b[_cons] + _b[superf]*‘msupfer’
sum preu
reg preu superf
predict resid, r
predict yhat, xb
sum resid
corr superf resid
corr yhat resid
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 23 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Property # 1
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 24 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Property # 2
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 25 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Property # 3
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 26 / 64
Fundamentals of Regression Analysis First order conditions of the OLS estimators
Property # 4
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 27 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 28 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 29 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 30 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 31 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
• In Stata, the command reg (or regress) will estimate linear model of Y
on X
• Let’s take a look at the result of using that command for the housing
example, where Y is the price of the dwelling in euros and X its area in
sq meters
• According to the results, β̂1 = 1641.242 euros/m2
• That is, according to our estimation, if the size of the dwelling is 1
squared meter larger, we expect the price of the dwelling to be 1641.24
euros higher
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 32 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 33 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 34 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
• When X is a binary 0/1 variable, β̂1 is the difference of the sample mean
of the price of the dwelling between the two groups (large dwellings vs
small dwellings)
• That is, β̂1 = (Ȳ | X = 1) − (Ȳ | X = 0) = Ȳ1 − Ȳ0
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 35 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 36 / 64
Fundamentals of Regression Analysis Interpretation of the OLS coefficients
1. By construction, ûi (the residuals) are not correlated with Xi . This can
be interpreted as if the OLS estimator is extracting all the linear
information from X that is useful to predict Y
2. ûi ̸= ui . The first is the difference between the value of Yi and the
predicted value Ŷi ; the second is the unknown error from the population
regression line
3. The units of β̂1 are the units of Y over the units of X, while the units of β̂0
are in Y’s units. Therefore, a different unit of X or a different unit of Y will
have consequences on the estimated coefficients (but not on the
underlying conclusions of the model)
4. The interpretation of β̂0 is meaningful only if the probability of X being 0
is larger than zero
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 37 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 38 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
• A natural question is how well the regression line "fits" or explains the
data
• How much of the observed variation in Y is explained by our model?
How close is the regression line to the observations?
• There are two regression statistics that provide complementary
measures of the quality of the fit:
▶ The R2 statistic measures the fraction of the variance of Y that is
explained by X; it is unitless and ranges between zero (no fit) and
one (perfect fit)
▶ The standard error of the regression (SER) measures the typical
size of a regression residual in the units of Y.
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 39 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
R2
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 40 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
RSS
• Therefore R2 = 1 −
TSS
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 41 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
Rewriting the equation for the residual, we have that Yi = Ŷi + ûi . Therefore,
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 42 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 43 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
R2 = Corr(X, Y)2
• One interesting thing is that the R2 is the same as the sample correlation
of X and Y squared
• That is because, in the end, the linear regression model is just scaling
the sample correlation between X and Y by the variance of X. So the
model is as good as it is the correlation between these two variables
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 44 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
R2 : warning
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 45 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 46 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 47 / 64
Fundamentals of Regression Analysis Measures of goodness of fit
• In our case, R2 = 0.6867 which means that the area of the dwelling
explains 68.67% of the variance in the price of the dwelling
• According to the model, since the RSS is 1545445061347.875 and we
have 1267 observations, the SER is 34953 euros and the RMSE is
34925 euros
• Word of caution: Stata calculates the RMSE differently. It divides by
(n-2) instead of n. So what Stata puts as the RMSE is actually the SER
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 48 / 64
Fundamentals of Regression Analysis Appendix
Appendix
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 49 / 64
Fundamentals of Regression Analysis Appendix
∂ X
= −2 (Yi − b0 − b1 Xi ) = 0
∂b0 i
∂ X
= −2 (Yi − b0 − b1 Xi )Xi = 0
∂b1 i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 50 / 64
Fundamentals of Regression Analysis Appendix
X
(Yi − b0 − b1 Xi ) = 0
i
X X X
Yi − b0 − b1 Xi ) = 0
i i i
X X X
b0 = Yi − b1 Xi
i i i
b0 = Ȳ − b1 X̄i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 51 / 64
Fundamentals of Regression Analysis Appendix
X
(Yi − b0 − b1 Xi )Xi = 0
i
X X X
Yi Xi − b0 Xi − b1 Xi2 = 0
i i i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 52 / 64
Fundamentals of Regression Analysis Appendix
X X X X
Yi Xi − Ȳ Xi − b1 X̄i Xi − b1 Xi2 = 0
i i i i
X X X X
b1 ( Xi2 − Xi X̄) = Yi Xi − Ȳ Xi
i i i i
X X
b1 Xi (Xi − X̄) = Xi (Yi − Ȳ)
i i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 53 / 64
Fundamentals of Regression Analysis Appendix
X X X X
(Xi − X̄)2 = Xi2 + X̄ 2 − 2X̄ Xi
X
= Xi2 + N X̄ 2 − N2X̄ 2
X
= Xi2 − N X̄ 2
X X
= Xi2 − X̄ 2
X
= Xi (Xi − X̄)
i
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 54 / 64
Fundamentals of Regression Analysis Appendix
P
(Xi − X̄)(Yi − Ȳ)
b1 = P
(Xi − X̄)2
SXY
β̂1 =
SX2
β̂0 = Ȳ − β̂1 X̄
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 55 / 64
Fundamentals of Regression Analysis Appendix
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 56 / 64
Fundamentals of Regression Analysis Appendix
Proof #1
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 57 / 64
Fundamentals of Regression Analysis Appendix
Proof #2
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 58 / 64
Fundamentals of Regression Analysis Appendix
n
X n
X
ûi Xi = ûi (Xi − X̄)
i=1 i=1
Xn
= ((Yi − Ȳ) − β̂1 (Xi − X̄))(Xi − X̄)
i=1
Xn n
X
= (Yi − Ȳ)(Xi − X̄) − β̂1 (Xi − X̄)2
i=1 i=1
Pn
(Y − Ȳ)(Xi − X̄)
Como β̂1 = Pn i
i=1
2
i=1 (Xi − X̄)
n n Pn n
(Y − Ȳ)(Xi − X̄) X
Pn i
X X
ûi Xi = (Yi − Ȳ)(Xi − X̄) − i=1
2
(Xi − X̄)2 = 0
i=1 i=1 i=1 (Xi − X̄) i=1
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 59 / 64
Fundamentals of Regression Analysis Appendix
Proof #5
n
X n
X
2
STC = (Yi − Ȳ) = (Yi − Ŷi + Ŷi − Ȳ)2
i=1 i=1
Xn Xn n
X
= (Yi − Ŷi )2 + (Ŷi − Ȳ)2 + 2 (Yi − Ŷi )(Ŷi − Ȳ)
i=1 i=1 i=1
n
X
= SRC + SEC + 2 ûi Ŷi = SRC + SEC
i=1
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 60 / 64
Fundamentals of Regression Analysis Appendix
• OLS estimator:
P P
i (X − X̄)(Yi − Ȳ) i Xi (Yi − Ȳ)
β̂1 = Pi 2
= P
i (Xi − X̄) i Xi (Xi − X̄)
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 61 / 64
Fundamentals of Regression Analysis Appendix
• Let’s rewrite the OLS estimator in terms of this binary or dummy variable
X
P
• β̂1 = i 1(i∈T)(Yi −Ȳ)
N
1(i∈T)[1(i∈T)− NT ]
P
i
• NT
N is the result of :
1 X 1 X X
X̄ = 1(i ∈ T) = ( NT × 1 + NNT × 0)
N i N
i=1 i=NT +1
.
• Thus, X̄ = NT
N
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 62 / 64
Fundamentals of Regression Analysis Appendix
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 63 / 64
Fundamentals of Regression Analysis Appendix
1 X NT NNT
NT [ 1(i ∈ T)Yi ] − NT [ E[Yi |Xi = 1] + E[Yi |Xi = 0]
NT i N N
• NT E[Y|X = T] − NT
N [NT E[Yi |Xi = 1] + NNT E[Yi |Xi = 0]]
• E[Y|X = T](NT − NT NT
N ) + N NNT E[Yi |Xi = 0]
• Since (NT − NT NT NT NT NT
N ) = NT (1 − N ) y N NNT = N (N − NT ) = NT (1 − N )
• NT (1 − NT
N )[E[Yi |Xi = 1] − E[Yi |Xi = 0]]
• Therefore,
NT
NT (1 − N )[E[Yi |Xi = 1] − E[Yi |Xi = 0]]
β̂1OLS =
NT (1 − NNT )
• β̂1 = Ȳ T − Ȳ NT
Javier Abellán, Màxim Ventura and Carlos Suárez (UPF) Topic 1 April 3, 2024 64 / 64