0% found this document useful (0 votes)
80 views6 pages

5103A1

The document contains answers to questions about regression analysis. Key points: 1) OLS provides an unbiased estimate of β1 under assumptions of linearity, random sampling, no perfect collinearity, and zero conditional mean. These assumptions are often unrealistic. 2) Violations of the assumptions may cause β1 to be overestimated if independent variables are positively correlated with each other. 3) Summary statistics and regressions are reported for variables like AFQT scores, wages, education and experience using US census data.

Uploaded by

Maesha Armeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views6 pages

5103A1

The document contains answers to questions about regression analysis. Key points: 1) OLS provides an unbiased estimate of β1 under assumptions of linearity, random sampling, no perfect collinearity, and zero conditional mean. These assumptions are often unrealistic. 2) Violations of the assumptions may cause β1 to be overestimated if independent variables are positively correlated with each other. 3) Summary statistics and regressions are reported for variables like AFQT scores, wages, education and experience using US census data.

Uploaded by

Maesha Armeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

ECA5103 ASSIGNMENT 1-SEC1

Maesha Armeen

A0210255W

29.09.2019

Answer 1

(a) Under what assumptions will the OLS estimate of _1 provide an unbiased estimate of _1?
Are these assumptions realistic?

a) OLS estimate of ai will provide an unbiased and consistent estimate of Bi under MLR 1-4, which are as
follows

MLR1: Linear in parameters the model should be linear in parameters meaning that increase of unit size
and price per square foot of the house must hold a linear relationship. The researcher can use specific
alternations like log, exponential etc. while running the specified regression, but the linearity in
parameters in the original regression must hold.

MLR2: Random Sampling the housing data must be a random sample from the population. This allows
us to make unbiased interpretations or else a flawed conclusion would be drawn based on a specific
portion of the population.

MLR3: No perfect Collinearity all the values of the independent variables cannot be the same which
means each x cannot be a linear function of another or constant.

MLR4: Zero conditional Mean the explanatory variables cannot contain any information about the
mean of the error terms, i.e. they must be endogenous.

There are further two assumptions ML5: variance of explanatory variable must be constant and ML6
normality, ei is independent of lnsi under all which OLS is BLUE (best linear unbiased estimator).
However, the coefficient is unbiased if it satisfies just MLR1-4.

These assumptions are not always realistic, however is it important to make these assumptions to infer
the causal effect. In real life, these assumptions are unlikely to hold since we can expect unit size to have
a proportional increasing on pricing, researchers have no details about the randomness of the sample
data, explanatory variables are likely to hold some information about residuals or other parameters.

(b) If these assumptions are violated, will the OLS estimate of _1 over- or under-estimate the
impact of size on price? Explain your answer.

Violations of these assumptions may make the results worthless or the effect can be usually trivial. In
this case, dropping a variable from the regression leads to violation of omitted variable bias because of
zero conditional mean. Thus, alphai is likely to be overestimated since number of bedrooms and unit
size hold a positive relationship.
Answer 2

(a) Rename variables R0000300 R0000500 R0618300 to birth month birth day, and afqt.

. rename R0000300 birth_month

. rename R0000500 birth_day

. rename R0618300 afqt

(b) Use recode 1) to convert invalid values into missing and 2) to recode sex, which is currently

defined as = 1 for male, = 2 for female, into = 0 for male and = 1 for female, and rename

the variable female.

i) After checking all the variables with tab for negitive value,found out that birth_month and birth_day
has no negitive values thus,recoding them is not required.

. recode afqt (-4/-3=.)

(afqt: 808 changes made)

. recode ind04(-5/-3=.)

(ind04: 6052 changes made)

. recode wage04(-5/-4=.)

(wage04: 10002 changes made)

. recode incwg04 (-5/-1=.)

(incwg04: 5417 changes made)

. recode edu(-5=.)

(edu: 5025 changes made)

. recode age04(-5=.)

(age04: 5025 changes made)

ii)

. recode sex(1=0)

(sex: 6403 changes made)

. recode sex(2=1)

(sex: 6283 changes made)

. rename sex female


(c) Generate the summary statistics for AFQT female wage04 edu age04 and reports the results

in Table 1 of the provided template.

. summarize afqt female wage04 edu age04

Variable Obs Mean Std. Dev. Min Max

afqt 11,878 40.95193 28.75716 1 99


female 12,686 .4952704 .4999973 0 1
wage04 2,684 3522610 3965777 -3 3.01e+07
edu 7,661 13.23026 2.518906 0 20
age04 7,661 43.15703 2.256123 27 48

Table 1:

Variable Obs Mean Min Max

afqt 11,878 40.95193 1 99


female 12,686 0.49527 0 1
wage04 2,684 3522610 -3 3.01E+07
edu 7,661 13.23026 0 20
age04 7,661 43.15703 27 48

(d) Create a new variable birthq that contains information on a person's birth quarter. For
example, if a person was born in May, then his birthq will have the value of 2.

. gen birthq=birth_month

. recode birthq(1/3=1)

(birthq: 2019 changes made)

. recode birthq(4/6=2)

(birthq: 3038 changes made)

. recode birthq(7/9=3)

(birthq: 3562 changes made)

. recode birthq(10/12=4)

(birthq: 2941 changes made)

. tab birthq

birthq Freq. Percent Cum.

1 3,145 24.79 24.79


2 3,038 23.95 48.74
3 3,562 28.08 76.82
4 2,941 23.18 100.00

Total 12,686 100.00


(e) Test whether people born in the last quarter are less educated than people born in other

quarters.

. mean(edu) if birthq ==4

Mean estimation Number of obs = 1,804

Mean Std. Err. [95% Conf. Interval]

edu 13.15521 .05824 13.04099 13.26944

. mean (edu) if birthq==3|birthq==2|birthq==1

Mean estimation Number of obs = 5,857

Mean Std. Err. [95% Conf. Interval]

edu 13.25337 .0330904 13.1885 13.31824

From the above tables if we compare the mean values of the forth quarter and the first three quarters
we can notice that the mean of the fourth quarter(i.e. 13.55) is less than the cumulative mean of the
first three quarters(13.23).As a result, the people born in the last quarter are less educated.

(f) Plot the histogram of wage04 and log(wage04). (You can use the command histogram and
you need to create a new variable lw04 = log(wage04)).

. gen lw04=log(wage04)

(10,037 missing values generated)

. hist wage04

(bin=34, start=-3, width=884705.97)


. hist lw04

(bin=34, start=7.6797137, width=.28057818)

(g) Explain why researchers prefer to use log wage rather than wage in the regression.

Researchers prefer to use log wage instead of wage as they care about percentage changes in wages
rather than absolute changes. Moreover, using log enables them to normalize the data.

(h) Generate potential years of experience using exp = age - edu - 5.

. gen exp=age04-edu-5

(5,025 missing values generated)

(i) Regress log wage on education, female, a quadratic function of potential years of experience.
Report the regression results in column 1 of Table 2.

. gen exp2=(exp)^2

(5,025 missing values generated)

. reg lw04 edu female exp2

Source SS df MS Number of obs = 2,649


F(3, 2645) = 27.82
Model 429.167305 3 143.055768 Prob > F = 0.0000
Residual 13600.3953 2,645 5.14192638 R-squared = 0.0306
Adj R-squared = 0.0295
Total 14029.5626 2,648 5.29817318 Root MSE = 2.2676

lw04 Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0849158 .0293516 2.89 0.004 .0273613 .1424702


female -.8019071 .0895686 -8.95 0.000 -.9775387 -.6262755
exp2 .0003259 .0003809 0.86 0.392 -.000421 .0010729
_cons 12.78836 .5632276 22.71 0.000 11.68395 13.89277
Table 2
lw04 (1) (2) (3)

edu 0.0849158 (j)


female -0.8019071 Regress
exp2 0.0003259 log wage
_cons 12.78836 on
Observations 2,649
R-squared 0.0306

(j) Regress log wage on ducation, female, a quadratic function of potential years of experience, and
AFQT scores. Report the regression results in column 2 of Table 2.

. reg lw04 edu female exp2 afqt

Source SS df MS Number of obs = 2,541


F(4, 2536) = 21.77
Model 446.949788 4 111.737447 Prob > F = 0.0000
Residual 13016.7549 2,536 5.13278979 R-squared = 0.0332
Adj R-squared = 0.0317
Total 13463.7047 2,540 5.30067114 Root MSE = 2.2656

lw04 Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .1160641 .0342978 3.38 0.001 .0488095 .1833187


female -.8276587 .0915236 -9.04 0.000 -1.007127 -.6481901
exp2 .0004232 .000393 1.08 0.282 -.0003474 .0011938
afqt -.0030298 .0020755 -1.46 0.144 -.0070995 .00104
_cons 12.45203 .6019054 20.69 0.000 11.27175 13.63231

Table 2
lw04 (1) (2) (3)

edu 0.1160641
female -0.8276587
exp2 0.0004232
afqt -0.0030298
_cons 12.45203
Observations 2,541
R-squared 0.0332

(k) Comment on the difference in the coefficients on education between these two columns.

The coefficient of edu increases from 0.0849 to 0.116 after we add variable afqt in the regression. This
means that in the first regression the coefficient was underestimated because of omitted variable bias,
thus adding another variable makes it more unbiased.

You might also like