Econ 103: Econometrics
Lab 1: Review of Estimation, Hypothesis Testing and SLR
Dr. Randall R. Rojas
Contents
Problem 1: Desriptive Analysis 1
A. Summary Statistics Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
B. Distributions of the Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Problem 2: Confidence Intervals 5
A. Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
B. Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Problem 3: Hypothesis Testing 9
A. Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B. Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Problem 4: Ordinary Least Squares 11
Problem 1: Desriptive Analysis
For this problem we will work with the data set food from (HGL) which includes the variables f oode xp (weekly
food expenditure in $) and income (weekly household income in $100s). Our objective is to understand
the statistical characteristics of our variables. This will help us build better models (in the context of linear
regression) for how the variables are related.
Note: Descriptions of your textbook data sets can be found here: [Link]
PoEdata/man/
A. Summary Statistics Table
For your data, produce a table that includes various relevant statistical measures such as the mean, median,
min, max, etc., and comment on their values.
B. Distributions of the Variables
Plot a histogram for each one of your variables, and comment on attributes such as the shape, moments,
unusual observations, and any other features you consider relevant for its description. For each variable
show its boxplot, and compare and contrast it with its respective histogram.
1
Load the library in the beginning that contains all the data
# Read in the data and check what variables are included
library(POE5Rdata)
data("food") # Load the data
names(food) # Check what variables are included
## [1] "food_exp" "income"
attach(food) # This allows us to reference variables by their names*
# *Note: In general, it is better not to attach variables
# and instead reference them by their respective data set
# so that their is no conflict with variables named the
# same in other data sets. For example, we can reference
# the variable 'income' as 'food$income', without having to
# attach it.
# A. Summary Statistics For a short description of each
# variable:
summary(income) summary of variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.69 17.11 20.03 19.60475 24.3975 33.4
summary(food) summary of file
food_exp income
Min. :109.7 Min. : 3.69
1st Qu.:200.4 1st Qu.:17.11
Median :264.5 Median :20.03
Mean :283.6 Mean :19.60
3rd Qu.:363.3 3rd Qu.:24.40
Max. :587.7 Max. :33.40
# For a more detailed description, including confidence
# intervals
library(pastecs)
[Link](food)
na = holes in dataset food_exp income
[Link] 4.000000e+01 40.0000000
[Link] 0.000000e+00 0.0000000
[Link] 0.000000e+00 0.0000000
min 1.097100e+02 3.6900000
max 5.876600e+02 33.4000000
range 4.779500e+02 29.7100000
sum 1.134294e+04 784.1900000
median 2.644800e+02 20.0300000
mean 2.835735e+02 19.6047500
[Link] 1.781551e+01 1.0827279
[Link].0.95 3.603527e+01 2.1900239
var 1.269570e+04 46.8919897
[Link] 1.126752e+02 6.8477726
2
food_exp income
[Link] 3.973403e-01 0.3492915
# B. Distributions of the Variables Histograms
par(mfrow = c(2, 1)) default is one plot for figure
hist(food_exp)
hist(income)
Histogram of food_exp
Frequency
5
0
100 200 300 400 500 600
food_exp
Histogram of income
Frequency
0 8
0 5 10 15 20 25 30 35
income
par(mfrow = c(1, 1))
# We can customize the histograms for improved
# visualization and understanding
library(MASS)
truehist(income, col = "slategray3", main = "Histogram of Weekly Household Income",
xlab = "Weekly Income ($100s)", ylab = "Proportion", xlim = c(0,
color, title, x-axis label, y-
40), ylim = c(0, 0.07))
axis label
lines(density(income), lwd = 2, col = "blue3")
overlay density curve
3
Histogram of Weekly Household Income
0.06
Proportion
0.04
0.02
0.00
0 10 20 30 40
Weekly Income ($100s)
truehist(food_exp, col = "slategray3", main = "Histogram of Weekly Food Expenditure",
xlab = "Weekly Food Expenditure ($)", ylab = "Proportion",
xlim = c(0, 700), ylim = c(0, 0.0035))
lines(density(food_exp), lwd = 2, col = "blue3")
Histogram of Weekly Food Expenditure
0.0030
0.0020
Proportion
0.0010
0.0000
0 100 200 300 400 500 600 700
Weekly Food Expenditure ($)
4
# Boxplots
par(mfrow = c(1, 2)) 1 row, 2 columns = boxplots are side by side for comparison
library(car)
Boxplot(income)
## [1] 1 2 3
Boxplot(food_exp)
600
30
500
25
400
food_exp
income
20
300
15
10
200
3
5
2
100
1
Problem 2: Confidence Intervals
A. Mean
The data file cps4 contains 4838 observations on variables such as earnings per hour (wage), gender (female=1
if female), and so on. After loading and inspecting the data, construct a 95% confidence interval for the
mean earnings per hour. Interpret your confidence interval, and discuss what you expect to find if you had
randomly sampled with replacement from the variable wage, and for each sample, computed its respective
95% confidence interval.
Out of the 100 intervals estimated, how many would expect not to contain the true but unobserved population
mean? To test your intuition, sample 50 times from a Normal distribution with parameters µ = 3 and σ = 0.5,
estimate the respective confidence intervals, and plot them. How many of the intervals actually contain the
true mean µ = 3? If you repeat this calculation, will you get the same number of intervals containing the
true mean? Why or why not? Explain.
Recall that for a population mean (µ) with unknown standard deviation (s), the two-sided confidence interval
is given by CI = x ± tα/2 √sn , and for the known standard deviation case, CI = x ± zα/2 √σn .
B. Difference of Two Means
Using the same data file, cps4, construct a 95% confidence interval for the difference of mean earnings per
hour between males and females. Discuss the economic and the statistical significance of your result. Next,
compare the difference of mean earnings per hour between individuals with 12 or fewer years of education
5
and those with more than 12 years of education. Again, discuss the economic and the statistical significance
of your result.
Recall that for
q the known variances case, the confidence interval for the difference of means is given by:
x − y ± zα/2 σX2 /n + σY2 /m
√
and for the unknown variances case, we would use: x − y ± tα/2, n+m−2 S p 1/n + 1/m, where S p =
r
(n−1)s2x +(m−1)s2y
n + m −2
# Read in the data and check what variables are included
library(POE5Rdata)
data("cps5")
names(cps5)
## [1] "age" "asian" "black" "divorced" "educ" "exper"
## [7] "faminc" "female" "hrswork" "insure" "married" "mcaid"
## [13] "mcare" "metro" "midwest" "nchild" "northeast" "single"
## [19] "south" "union" "wage" "west" "white"
attach(cps5)
library(Rmisc)
# Construct a 95% Conf. Int for the mean of 'wages'
CI(wage, ci = 0.95)
## upper mean lower
## 23.77836 23.46008 23.14180
# Are the observed differences in mean wages between men
# and women, statistically significant at the 5%
# significance level? Wages based on genger
male_wages = wage[female == 0]
female_wages = wage[female == 1]
par(mfrow = c(1, 2))
library(car)
par(mfrow = c(1, 2))
hist(female_wages, xlim = c(0, 110), main = "Wages: Female")
abline(v = mean(female_wages), col = "red3", lwd = 2)
hist(male_wages, xlim = c(0, 110), main = "Wages: Male")
abline(v = mean(male_wages), col = "red3", lwd = 2)
6
Wages: Female Wages: Male
4000
2500
3000
Frequency
Frequency
1500
2000
1000
500
0
0
0 20 40 60 80 0 20 40 60 80
female_wages male_wages
[Link](female_wages, male_wages)
##
## Welch Two Sample t-test
##
## data: female_wages and male_wages
## t = -8.6963, df = 8815.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.512138 -2.220050
## sample estimates:
## mean of x mean of y
## 21.87363 24.73972
# Wages based on years of education
high_wages = wage[educ > 12]
low_wages = wage[educ <= 12]
par(mfrow = c(1, 2))
library(car)
par(mfrow = c(1, 2))
hist(low_wages, xlim = c(0, 110), main = "Wages: Years of Educ <= 12")
abline(v = mean(low_wages), col = "red3", lwd = 2)
hist(high_wages, xlim = c(0, 110), main = "Wages: Years of Educ > 12")
abline(v = mean(high_wages), col = "red3", lwd = 2)
7
Wages: Years of Educ <= 12 Wages: Years of Educ > 12
1500
5000
1000
Frequency
Frequency
3000
500
1000
0
0
0 20 40 60 80 0 20 40 60 80
low_wages high_wages
[Link](low_wages, high_wages)
##
## Welch Two Sample t-test
##
## data: low_wages and high_wages
## t = -35.398, df = 9697.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.066181 -9.009814
## sample estimates:
## mean of x mean of y
## 16.92394 26.46194
8
3.4
Confidence Intervals
3.2
3.0
2.8
2.6
0 10 20 30 40 50
50 Random Samples
sex [Link]
F 54.70
M 65.36
Problem 3: Hypothesis Testing
A. Mean
Recall that for the hypothesis test for one mean is given by:
H0 : µ = µ0
Ha : µ 6 = µ 0
.
In R we would use the ‘[Link]’ function:
9
Histogram of x
150
Frequency
100
50
0
8.5 9.0 9.5 10.0 10.5 11.0 11.5
x
##
## One Sample t-test
##
## data: x
## t = -0.5513, df = 999, p-value = 0.5815
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 9.961399 10.021669
## sample estimates:
## mean of x
## 9.991534
B. Difference of Two Means
For this case, the test statistic depends on whether we know or not the populations variance. Therefore, we
have to consider two cases: (1) known variance, or (2) unknown variance.
H0 : µ1 − µ2 = 0
Ha : µ 1 − µ 2 6 = 0
.
##
## Two Sample t-test
##
## data: x1 and x2
## t = -110.52, df = 98, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
10
## -10.163020 -9.804493
## sample estimates:
## mean of x mean of y
## 10.02350 20.00726
Problem 4: Ordinary Least Squares
For our example we will estimate a model for hourly wages in US dollars (wage) on years of education (exper)
as given by:
y = β 1 + β 2 x + e → wage = β 1 + β 2 educ + e
.
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "[Link]" "assign" "qr" "[Link]"
## [9] "xlevels" "call" "terms" "model"
##
## Call:
## lm(formula = wage ~ educ, data = cps5_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.785 -8.381 -3.166 5.708 193.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.4000 1.9624 -5.3 1.38e-07 ***
## educ 2.3968 0.1354 17.7 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.55 on 1198 degrees of freedom
## Multiple R-squared: 0.2073, Adjusted R-squared: 0.2067
## F-statistic: 313.3 on 1 and 1198 DF, p-value: < 2.2e-16
11
Perfect Fit
200
150 Least Squares Fit
wage
100
50
0
0 5 10 15 20
educ
10 20 30
Residuals
−10 0
−30
0 200 400 600 800 1000 1200
Index
## [1] 9.884686e-16
## [1] 183.5382
## [1] 231.5435
12
Histogram of the Residuals
0.04
Density
0.02
0.00
−40 −20 0 20 40
Residuals ##
13