0% found this document useful (0 votes)
4 views

CH 02 Simple Regression TQT

Uploaded by

laplap by
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CH 02 Simple Regression TQT

Uploaded by

laplap by
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Chapter 2: The Simple Linear

Regression

2.1. The definition of simple linear regression (SLR) model

2.2. The method of Ordinary Least Square (OLS)

2.3. Interpretation of the SLR model

2.4. Properties of the OLS estimator

2.5. Unit of measurement and functional form

2.6. Assumptions of the OLS estimator

2.7. Mean and variances of the OLS estimator


2.1. The definition of simple linear regression model

Econometrics is based on techniques such as


regression analysis and hypothesis testing.

What is regression analysis?


“Regression analysis is concerned with the study of the dependence of one variable, the dependent

variable, on one or more other variables, the explanatory variables, with a view to estimating and/or
predicting the (population) mean or average value of the former in terms of the known or fixed (in
repeated sampling) values of the latter” [2, p.18].

-Regression studies the dependence of one variable (the dependent


variable) on other variables (the explanatory or independent variables).

-The goal of regression analysis is to estimate or predict the population


mean of one variable on the basis of the known or fixed values of the
other variables.
Simple linear regression (SLR) is a linear regression
model with a single explanatory variable

(Cont)

The slope parameter ( ) shows how much changes if increases by one-unit. But this
interpretation is only correct if all other factors are constant ( ).

Intercept parameter or constant


term

Dependent variable,
outcome variable, Independent Error term
explained variable, variable, unobservables,
response variable, explanatory disturbance,
predicted variable, variable, white noise,..
regressand… control variable,
pridictor
regressor,…
Simple linear regression: graphical presentation of the
coefficients

(Cont)

Constant slope indicates that a one-unit change in X has the same


effect on Y regardless of X's innial values.
Simple linear regression: (Cont)

Sunlight, rainfall, moisture


Quantify the effect of fertilizer on diseases, soil fertility,...
output, holding all other factors
constant

Work experience, ethnicity, gender, ability,...

Measures the change in wage


given an addtional year of formal
schooling,
holding all other factors fixed
The population linear regresson model (PRM):

We have the population linear regression model (PRM):


 and are two variables that describe the properties of the population under
consideration: is explained by .

, representing all other


unobserved factors than

The goal of the linear regression is to estimate the population mean of the
dependent variable on the basis of the known values of the independent
variable(s).
To estimate which is known as the conditional expectation function
or the population linear regression function.
The conditional mean: E)
(Cont)
Wage, tho
20000

15000

10000
The conditional mean: E)

5000

0
(Cont)

Conditional distribution of wages for various levels of education

Distribution of wages given


education=the college level

E(wages|edu)

No edu Primary Lower second Upper second College University Master


The population linear regression function (PRF):
E)=
(Cont)
Under some certain assumptions, we can capture a ceteris paribus relationship
between and

-Linear in parameters and
 -Zero conditional mean assumption: E The average value of does
not depend on the value of and equals zero.

-This assumption implies that and

Take the expected value of the PRM () on x using the zero conditional mean
assumption, we have the population regression function (PRF):

E) =+
E) =

The average value of y can be expressed as a linear function of x


E) =
Zero conditional mean assumption: E
(Cont)
This assumption means that represents other unobservable factors that do not
have a systematic effect on . Why? This assumption is critical for causal analysis; it
cannot be tested statistically and has to be
argued by economic theory.

𝑢𝑖=Theobservedvalue−Mean

The positive and negative values of


cancel out each other, which makes
their average or mean effect on Y
equal to zero.

Note that , …here refer to and not different variables


Zero conditional mean assumption: E
(Cont)
E)= : This assumption is a strong assumption for cetetis paribus analysis;
it cannot be tested statistically and has to be argued by economic theory.

+ : ability
To capture the ceteris paribus relationship between wage and education, we
have to assume that the average ability is the same for all levels of education.
E.g. E)=… E)… =E)=… E)=… E)=0.

Note: : the assumption for defining the intercept,


E)= : the assumption with impact,

?
Zero conditional mean assumption: E
(Cont)

+
: land quality

To capture the ceteris paribus relationship between and , we


have to assume that the average quality of land is independent of
the amount of fertilizer.
In other words: E) = the amount of fertilizer is applied independently
of other plot characteristics (e.g. land quality)

?
The sample linear regression function:
SRF

Household in
4000

3000

2000

1000

0
(Cont)

𝑇h𝑒𝑎𝑐𝑡𝑢𝑎𝑙𝑣𝑎𝑙𝑢𝑒𝑠
: sample linear regression function

0 5 10 15
Schooling years of household head

Fitted values (regression line) Montly per capita income

The average income for all household heads with 12 years of education is E(Income|
edu=12)=815+53*12=1451

Note: But it is false to interpret that evevery household head with 12 years of education will
earn 1451
The sample linear regression function (SRF)
(Cont)
is called as “Y-hat” or “Y-cap”,
is the estimator of , which also means the fitted or predicted value of
is the estimator of
is the estimator of
is the estimator of (u), which is also known as the residual = -
The is used to index the observations of a sample.
Population Sample
Model
Population linear regression model
Sample linear regression model
(PRM)
(SRM)
Function E) =
Population linear regression function Sample linear regression function
(PRF) (SRF)
What is OLS (Ordinary Least Squares)?
Answer:
(Cont)
Deriving the OLS estimators
(Cont)

denote a random sample size n from the population


Let are possible values of the population parameters
From the sample, we have to find to minimize the (the residual
sum of squares)

Take partial derivatives of RSS with respect to .


Set each of the partial derivatives to zero
Solve for {} and replace them with the solutions
are chosen to make the residuals add up to zero
OLS estimators
(Cont)

==
These sample functions are the OLS estimators for
For a given sample, the numerical values of are
called the OLS estimates for
We have derived the formula for calculating the OLS estimates
of our parameters, but you don’t have to compute them by
hand.

Regressions in Stata are very easy and simple, to run the regression of
y on x, just type: reg y x
Estimators vs estimate
 An estimator, also referred to as a "sample"
statistic, is a rule, formula, or procedure that
explains how to estimate the population parameter
from the sample data.
 An estimate is a specific numerical value generated
by the estimator in an application.

Estimator: ,

Estimate:
Household
4000

3000

2000

1000
How do we fit the regression line to the data?

0
Answer:
(Cont)

𝒚𝒊
>1328

= -

^𝑖
𝒖
𝑦𝑖
^
𝒚𝒊
^
𝒚 𝒊 = 𝑓𝑖𝑡𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒

𝒚𝒊 E(income/edu=10)=815+53*10=1328

<1328
0 5 10 15
Schooling years of household head

^𝒚 =𝟖𝟏𝟓+𝟓𝟑 𝒙 Fitted values (regression line) Montly per capita income


ReDifference Between the Residual and the Error
The
(Cont)
2.3. Interpretation

: the slope=
: the constant or
intercept, which which indicates the average amount by
shows the average which changes when increases by one
value of the dependent unit, holding other factors in the model
variable when the constant ( =0).
independent variable is
set to zero. >0: a possitie association between and
<0: a negative association between and
=0: no association between and
Meat consumption and income in Lao Cai
(Cont) Monthly meat consumption
per household (1000 VND)
Monthly household income
per capita (1000 VND)

Fitted regression:

: The intercept (constant) of The slope estimate of 0.161 means that a


299 means that a household household's meat consumption would increase by
with zero income has a 0.161 thousand VND if their per capita income
predicted meat consumption increased by one thousand VND.
of 299 thousand VND.
What is the average meat consumption for households with
a monthly income per capita of one million VND?
Wage and education among workers in FDI enterprises in Hanoi, 2018
(Cont) Monthly wage (1000 VND) Years of education

Fitted regression:

The slope estimate of 754 means that


on average, each additional year of
: The intercept (constant) of - education would increase wage by
856 means that all workers 754*1=754 thousand VND/month.
without education has a
predicted wage of -856 The average predited wage for all
thousand VND/month. workers with 12 years of education is -
856+(754 *12)=8192 thousand
Is this meaningful? VND/month (about 8.2 million VND)

The result should not be interpreted as a causal


effect.
Intercept often does not make sense to interpret
(Cont)

: The intercept (constant) NOTE


of -856 means that all
the intercept or constant often has no real
workers without
meaning for some reasons:
education has a predicted
wage of -856 thousand  Zero settings for all independent variables are often
VND/month. impossible or irrational. (e.g., can we set a
household's food consumption to zero?).
Is this meaningful?
 The intercept might be outside of the observed data
(e.g., no worker without education).
E()=0
 Estimating the intercept is done to make sure that the
residual mean is equal to zero, which makes the
intercept meaningless.
The intercept is outside of the data range
(Cont)

The intercept = - 856 when setting the independent variable (education) to zero
( zero years of education).

But the data shows that education has the smallest value of 6 (6 years of
education).

Therefore, we can’t interpret the intercept because it is outside of the data


range.
The intercept = - 856 when setting the independent variable (education) to zero
( zero years of education).

But the data shows that education has the smallest value of 6 (6 years of
education).

Therefore, we can’t interpret the intercept because it is outside the range of the
study data.
In
The intercept absorbs the bias for the regression model.
(Cont)

The omission of some relevant variables in the model.


e.g., some other factors than education, such as work experience or ability, can affect wage.

This omission can cause the bias:


e.g., the residuals may have an overall positive or negative mean

The intercept prevents this bias by compelling the


residual mean to equal zero: .
It is crucial to include the intercept in the regression
model.
Household income
The intercept absorbs the bias for the regression model.
The regression model can always be redefined with the same slope but a new intercept and error,
4000

3000

2000

1000
where the new error has a zero mean:

0
(Cont)

^𝑖 ) < 0
𝐸 (𝑢

^𝑖 ) =0
𝐸 (𝑢

𝜷𝒂
^𝛽
0

𝜷𝒃 ^𝑖 ) > 0
𝐸 (𝑢
E(income/edu=10)=815+53*10=1328

0 5 10 15
Schooling years of household head

Fitted values (regression line) Montly per capita income


2.5. Properties of the OLS estim 2.5.
2.4. Properties
Properties of the of
OLStheestimator
OLS estimator: fitted values and residuals
=356.067+0.143*3000=785.42
ator
-: Residuals )
: Predicted
Obs
1 3000 995 785.423017 209.577
2 8208 2900 1530.78546 1369.215
3 3613 1450 873.15481 576.8452
4 4624 1460 1017.84787 442.1521
5 4751 510 1036.02395 -526.024
6 5151 760 1093.27145 -333.271
7 5884 1005 1198.17749 -193.177
8 2696 100 741.914917 -641.915
9 2485 912 711.716861 200.2831
10 8860 570 1624.09889 -1054.1
11 1436 512 561.585292 -49.5853
Fitted values and residuals

(Cont)
Thealgebraic
Some Simple properties of OLS
(Cont)
Regression Model
From the first order conditions of OLS, we have some algebraic properties of OLS:
 The sum and mean of the residuals will aways equal zero

 The residuals will be uncorrelated with the independent variable


or
 The residuals will be uncorrelated with the fitted or predicted values

 Sample averages of y and x lie on regression line

 The average of the predicted values is equal to the average of


actual values . =
Some algebraic properties of OLS
(Cont)
Observation X Y Predicted Y: Residuals:

1 3000 995 785.423017 209.576983

2 8208 2900 1530.78546 1369.214536

3 3613 1450 873.15481 576.845190

4 4624 1460 1017.84787 442.152134

5 4751 510 1036.02395 -526.023947

6 5151 760 1093.27145 -333.271447

7 5884 1005 1198.17749 -193.177490

8 2696 100 741.914917 -641.914917

9 2485 912 711.716861 200.283139

10 8860 570 1624.09889 -1054.098889

11 1436 512 561.585292 -49.585292

Mean =1015.818 =1015.81818 =0.00000000

Sum =0.00000000

Cov(, ) =0.0000

Cov(, ) =0.00000
Decomposition of total
variation
ESS
RSS
TSS (Y-)
1015.818182 995 785.4230167 433.3966942 53081.93213 209.576983 43922.51
1015.818182 2900 1530.785464 3550141.124 265191.3019 1369.214536 1874748
1015.818182 1450 873.1548101 188513.8512 20352.83762 576.845190 332750.4
1015.818182 1460 1017.847866 197297.4876 4.119617482 442.152134 195498.5
1015.818182 510 1036.023947 255852.0331 408.2729503 -526.023947 276701.2
1015.818182 760 1093.271447 65442.94215 5999.008273 -333.271447 111069.9
1015.818182 1005 1198.17749 117.0330579 33254.91739 -193.177490 37317.54
1015.818182 100 741.9149168 838722.9421 75022.99858 -641.914917 412054.8
1015.818182 912 711.7168607 10778.21488 92477.61353 200.283139 40113.34
1015.818182 570 1624.098889 198753.8512 370005.4186 -1054.098889 1111124

1015.818182 512 561.5852924 253832.7603 206327.5178 -49.585292 2458.701


TSS (Total sum of
squares) 5559885.636

ESS (Explained sum of squares) 1122125.939

RSS (Residual sum of squares) 4437760

R-Squared=ESS/TSS 0.201825363
Goodness-of-fit measure (R-squared)
Total sum of squares Explained sum of Residual sum of
squares squares

,
where =

TSS represents total variation in ESS represents variation RSS represents variation
the dependent variable(s) not
explained by regression
explained by regression

Total variation Explained part Unexplained part


Goodness-of-fit measure:R-squared or the coefficient of determination:
measures the proportion of the total variation in that is explained by the
regression.
Graphic presentation for decomposing the total variation
Goodness-of-fit measure (R-squared)

N=265; =0.172
The regression explains 17.2 %
of the total variation in monthly wage

N=11; =0.201
The regression explains 20.2 %
of the total variation in meat
consumption

Note: High R-squared doesn't mean it has a causal


interpretation.
2.5. Unit of measurement and functional form
Unit of measurements
Change in the measurement unit of the dependent variable Intercept Slope coefficient
Y is divided by the constant :
Y is multiplied by the constant :

wage is measured in 1.000 VND


wage is measured in 1.000.000 VND: Wage/1000

Change in the measurement unit of the independent Intercept Slope coefficient


variable
X is divided by the constant :
X is multiplied by the constant :

education is measured in years.


education is measured in months. Edu*12 ( months)
Functional forms

Fitted values of monthly wages ( 1000 VND)/household


Cont

 Each additional year of education has the same effect on wages. Is this
reasonable?
In fact, the effect may be greater at higher levels of education.
80000

60000

40000

20000

0
 and =0

0 5 10 15
average schooling years of working members

Fitted values Fitted values


Functional forms
(Cont)
 We can model how each year of education increases wages by a
constant percentage.
 From the exponential function: , an approximately constant percentage
change effect can be modeled as:

Percentage change of wage if education increases by


one year
(*) if is very small.
We multiply by 100 to obtain the perentage change given one extra year of
education
; =0.15
(*

Important notes:
 Exact = exp(*1)-1= .04456445 =4.45%
 =0.16: shows that educ explains about 16% of the variation in log(wage) (not
wage)
Functional form
(Cont)
2.6. Standard assumptions for the simple linear regression model

 Assumption SLR.1 (Linear in parameters)

 Assumption SLR.2 (Random sampling)

 Assumption SLR.3 (Sample variation in the explanatory

variable)

 Assumption SLR.4 (Zero conditional mean)

 Assumption SLR.5 (Homoskedasticity)


Assumption SLR.1 (Linear in parameters)

The relationship between Y and X is linear in


parameters.

(1) Linear in both parameters and the variables.

This implies that a one-unit change in X has the same effect on Y regardless of X’s innial
values.
is a linear function of and the regression line is a straight line.

(2) Linear in parameters but non-linear in variables

(3) Linear in variables but non-linear in parameters


Both regression models are linear in parameters.
(Cont)
English= Math=
Stata command: curvefit english income, Stata command: curvefit math income, function
function (1) (4)

1,000,000 1,000,000
VND/month VND/month
Assumption SLR.2 (Random sampling)
(Cont)

The data is a random sample drawn


from the population.
It means that every member of the
population has an equal chance of
being selected for the sample.
This assumption is likely to be
violated. Why?
>0
Assumption SLR.3 (Sample variation in the independent variable):

(Cont)

 Sample variation in the independent variable


means that the values of the independent variable
are not the same.
>0

 If the values of the independent variables are


identical, it is impossible to estimate because the
denominator=0.
=0
No variation in education: Every employee has the same level of education: 12 years.

(Cont)
Assumption SLR.4 (Zero conditional mean):
(Cont)
We have already discussed about this crucial assumption

 In other words: the value of the independent variable (X) must contain
no information about the mean of the unobservables (u).
 Note: There is no linear or non-linear relationship between and .
 implies that the explanatory variable is exogenous:
There is no linear association between and .

Questions:
0 implies that ?
0 does not implies that
Assumption SLR.4 (Zero conditional mean):

(Cont)
Causality vs correlation

 if is endogenous
)
Assumption SLR.5 (Homoskedasticity): V
(Cont)

Given any values of the explanatory


variable, the error term has the same
variance.
V which is the same as V
In other words: the value of the independent variable
(X) must contain no information about the variance
of the unobservables (u).
E.g.: Even though the average wage goes up with education, the
spread of wages around the mean is assumed to stay the same at all
levels of education.
Note: Because heteroskedasticity occurs whenever V is a function
of .
Homoskedasticity vs heteroskedasticity
(Cont)
For any value of x, the variance of y is similar.
𝑯𝒆𝒕𝒆𝒓𝒐𝒔𝒌𝒆𝒅𝒂𝒔𝒕𝒊𝒄𝒊𝒕𝒚 : Wage variation around the mean is unconstant , increasing with education level
Heteroskedasticity before and after logarithm transformation

(Cont)

Stata command:
Stata command: gen Log_Wage=ln(wage)
Reg wage edu Reg Log_Wage edu
hettest hettest
Gauss-Markov assumptions of the Simple
Linear Regression (SLR)
Under SLR.1-SLR.4, is unbiased: E()=
Under SLR.5, has the smallest variance among other linear unbiased estimator.
Under SLR.1-SLR.5, for is the best linear unbiased estimator (BLUE)

1. Assumption SLR.1 (Linear in parameters)


2. Assumption SLR.2 (Random sampling)
3. Assumption SLR.3 (Sample variation in the explanatory variable)
4. Assumption SLR.4 (Zero conditional mean: E(u|x=0))
5. Assumption SLR.5 (Homoskedasticity: Var(u|x= ))

What happens if any of our four assumptions SLR.1–SLR.4 is not satisfied?

What happens if the assumptions SLR.1–SLR.4 are satisfied but the assumption SLR.5 is not?
2.7. Mean and variances of the OLS estimator
Interpretation of unbiasedness
 Under SLR.1-SLR.4, is unbiased: E()=
 Unbiasedness does not imply that, with a given sample, our estimated
parameters would equal the exact true values of the population parameters.
 In a given sample, estimates may larger (> ; > ) or smaller (<; <) than the true
values.
Instead, the unbiasedness should be interpreted that:

o if sampling is repeated from the same population;


o and the estimation is repeated many times;
o we will get the expected value of the estimated parameters, which will be equal to
the true population parameters.
E()=E()=
Variances of OLS estimators:
Estimating the variance of the error term:
From a population regression model: + is the error term
we have a sample regression model: . representing all

is an estimate of
; , where = the residual. unobservables.

Recall that the residual ( can be seen as an estimate of the error term ()
Now we have to obtain an unbiased estimate of variance of the error
term.

 Estimating the error variance or the variance of the error term:


An unbiased estimate of the error variance can be calculated as:

, where = the number of independent variables

 Estimating the standard error of the regression:

Note: Under the assumptions 1-5: .


measures the average distance between the observed values
and the regression line (the fitted values).
Estimating the variance of the error term:

N=11

measures the average distance between the observed values and the regression line (the fitted values).
Estimating the variance of the error term:
(Cont)

We have =4437760, where n=11, k=1 ()


= 4437760/11-2=4437760/9=493084.44
= = 702.19972

indicates that the average difference between observed and


fitted meat consumption values is about 702 thousand VND.

Note:
 The standard error of the regression () has the same unit as
the dependent variable.
 also has other names: the standard error of the estimate and the
root mean squared error (Root MSE).
Variances and standard errors for regression
coefficients

;
;;
(Cont)

Variance and standard error for the slope coefficient


;

Variance and standard error for the intercept coefficient

Note:
 Standard errors are the estimated standard deviations of the regression coefficients.

 They measure how precisely the coefficients are estimated.


Properties of the mean and variances

E()=

True
Properties of the mean and variances
Exercise:
1. Is the residual the error term? Explain
2. Why do the sum and mean of the residual always equal zero?
3. What happens to the sum and mean of the residual if we exclude
the intercept from the OLS model?
4. What happens to the OLS estimator if the sample is not randomly
selected from a population?
5. What happens to a simple linear regression model if the value of
the explanatory variable is similar for all observations?
6. Suppose our model satisfies SLR assumptions 1–4 but suffers from
heterokedasticity. In this case, are our estimates biased?. What is
the consequence of the heterokedasticity?
7. Comment on the statement that a model with a high R-squared
shows a strongly causal relationship.
8. Which model violates the assumption of the OLS?
+ ; (1)
+ ; (2)
+ (3)
Excercise
9. Let Qd denote the quantity of a given product, and let P denote the price of that product.
A simple model is presented that connects quantity demanded to price:
(i) What possible factors are contained in ? Is it likely that these will be related to price
(ii) Will a simple regression analysis show the ceteris paribus effect of price on quantity demanded? Explain.

10. The following table contains monthly meat consumption per household (thousand VND) and monthly
household income per capita (thousand VND)incom
for 20 households. incom
list meat e list meat e
1 1390 5031 11 1770 4365
2 1320 6491 12 1620 4727
3 2900 4900 13 1460 5067
4 790 3267 14 650 5094
5 1600 5164 15 995 3000
6 2400 3260 16 2900 8208
7 1310 4847 17 1450 3613
8 1690 8395 18 1460 4624
9 1880 6625 19 510 4751
10 1205 2394 20 760 5151

(i) estimate the relationship between the dependent variable (meat consumption) and the independent variable
(household income per capita) using an OLS regression model. Comment on the link between two variables. What is
the meaning of the intercept and slope coefficients?
(ii) How much higher is the level of meat consumption predicted to be if the monthly income per capita is increased
by 200 thousand VND?
(iii) Is this true if we say that given a one million VND increase in household income per capita, the value of meat
consumption increases at the same level for all households?
(iv) calculate the fitted values of the dependent variable and the residuals. Do the sum and mean of the residuals
equal zero? What is the average of the fitted values and the observed values of the dependent variables?
(v) Please interpret the R-squared. How much of the variation in meat consumption is unexplained by the regression?
(vi) calculate the standard error of the regression coefficients and the standard error of the regression. What is the
unit of analysis for the standard error of the regression?
11. Using a simple linear regression model, a researcher investigates the dependence of the monthly wage (in
thousand VND) on the number of years of education among wage workers in Hanoi in 2018.
(i) What is the average predicted wage when education equals zero?
(ii) How much does the monthly wage increase if the number of years of education increases from 12 to 16 years?
(iii) Does this model infer a causal relationship between wage and education?
(iv) What percentage of the variance in wages is explained by education?
12. A sample of 11 households with their income and food consumption is given in the table.
Income
Thousand VND/per Food consumption
person/month Thousan VND/per person/month

3000 995

8208 2900

3613 1450

4624 1460

4751 510

5151 760

5884 1005

2696 100

2485 912

8860 570

1436 512
Using the OLS estimator, estimate the relationship between the dependent variable (food consumption) and the
explanatory variable ( income):
+
(i) Using the regression result, please report the marginal propensity to consume food (MPCF) .
(iii) What is the MPCF if the regression model excludes the intercept? + . (Note, please use “constant is zero” in
excel”) or noconstant in Stata
(iv) Using the result from the model without intercept, calculate the fitted values of the dependent variable and the
residuals. Do the sum and mean of the residuals equal zero? What is the average of the fitted values and the
observed values of the dependent variables?
(v) Does the exlussion of the intercept from the model cause the bias? Explain.

You might also like