HW2+Solution
HW2+Solution
1. Run the following R code to find a symmetrizing transformation for 1998 earnings
data from the Current Population Survey. We only focus on the female.earnings.
library("Ecdat")
data(CPSch3)
dimnames(CPSch3)[[2]]
earnings = CPSch3[CPSch3[ ,3] == "female", 2]
a) Transformed the earnings using square-root and log-transformation. Plot the QQ-
normal plots, boxplots, and kernel density estimates for the untransformed data
and two transformed data respectively. Which of the three dataset provides the
most symmetric distribution? Try other powers beside the square root. Which
power do you think is best for symmetrization?
Answer: Of the two transformations, the square-root transformation create the most
symmetric distribution. Its normal plot is closest to linear. The untransformed data
have a concave pattern indicating right-skewness. The log-transformed data have a
convex pattern indicating left-skewness, though the left tail is short, as can be seen
by the quick downturn at the extreme left than deviates from the convex pattern
elsewhere.
In the boxplots, the untransformed data have most of their extreme values on the
top, the log-transformed data have most of their extreme values on the bottom, and
the square-root transformed data have extreme values on both the top and bottom,
though the left tail is compressed relative to the right tail. Thus, we see the same
types of skewness seen in the normal plots.
The density estimates agree with our earlier conclusions that the untransformed data
are right-skewed, the log-transformed data are left-skewed, and the square-root
transformed data are approximately symmetric.
!
The following figure contains normal plots with the power ranging from 0.1 to 0.8.
One can see that the left tail remains short regardless of which power is used. For
the rest of the data, the plot changes from convex to concave as the power
increases, that is, the transformed data (excluding the left tail) change from left-
skewed to right-skewed as the power increases. When the power is between 0.4 and
0.5, the data (excluding the left tail) are close to normal.
Answer: The plot for λ on the default grid seq(-2, 2, 1/10), and zoom in on the high-
likelihood region is as below:
!
ind indicates which value of λ maximizes the likelihood and ind2 indicates
which values of λ are in the 95% confidence interval for λ.
The fitted skewed-t density is not very close to the kernel density estimate,
especially at around mode area, and the differences between the two could
because the lack of fit, not likely purely due to random variation. Therefore, the
skewed-t model seems like not a very proper choice for parametric modeling of
female.earnings. Other skewed distributions should be considered.
2, We will now consider the Boston housing data set, from the MASS library.
(a) Based on this data set, provide an estimate for the population mean of medv. Call
^.
this estimate !μ
Answer: We use the sample mean as an estimate for the population mean of medv,
! ^ = 22.53.
and μ
Hint: We can compute the standard error of the sample mean by dividing the
sample standard deviation by the square root of the number of observations.
Answer: estimate of the standard error is 0.409
> medv.err = sd(medv)/sqrt(length(medv))
> medv.err
[1] 0.4089
(c) Now estimate the standard error of μ^! using the bootstrap. How does this
compare to your answer from (b)? Based on your bootstrap estimate, provide a
95 % confidence interval for the mean of medv. Compare it to the results
obtained using t.test(Boston$medv).
^ −
Hint: You can approximate a 95 % confidence interval using the formula [!μ
^), !μ^ + 2SE(!μ^)].
2SE(!μ
Answer: We conduct bootstrap with 1000 repetitions.
The estimated standard error of μ^! using the bootstrap is 0.4108, which is similar to
answer from (b) up to two significant digits. (0.4108 vs 0.4089). We first use the
bootstrap to compute 95 % confidence interval as [21.71, 23.35]. The results
obtained using t.test(Boston$medv) show the confidence interval as ! [21.73, 23.34].
Bootstrap estimate only as far as 0.02 away for t.test estimate.
^!
(d) Based on this data set, provide an estimate, μ for the median value of medv
med
in the population.
Answer: Similarly, the sample median as estimate, which is 21.2
> medv.med = median(medv)
> medv.med
[1] 21.2
(f) Based on this data set, provide an estimate for the tenth percentile of medv in
!^ 0.1(You can use the quantile() function.). Use
Boston suburbs. Call this quantity μ
! ^ 0.1. Comment on your findings.
the bootstrap to estimate the standard error of μ
3. Use logistic regression to predict the probability of default using income and
balance on the Default data set (in ISLR library). Do not forget to set a random seed
before beginning your analysis.
a) Fit a logistic regression model that uses income and balance to predict default.
Hint: you can use glm() function and set the argument “family = binomial” to run
logistic regression. ?glm for more details in R.
Answer: We fit a logistic regression model using all observations and get output as
follows. It shows that both income and balance are significant predictors to predict
default.
!
b) Using the validation set approach, estimate the test error of this model. In order
to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.
ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation set by
computing the posterior probability of default for that individual, and
classifying the individual to the default category if the posterior probability is
greater than 0.5.
iv. Compute the validation set error, which is the fraction of the observations in
the validation set that are misclassified.
Answer: There are 10000 observations in total. We randomly split the data and let
80% as training and 20% as testing.
> set.seed(1234)
> dim(Default)
[1] 10000 4
> train = sample(dim(Default)[1], dim(Default)[1]*0.8)
> Default.train = Default[train, ]
> Default.test = Default[-train, ]
Then we build model using train sample, and if the predicted probabity is larger than
0.5, we set the predicted label as default “Yes”:
> # train model
> lm.fit = glm(default ~ income + balance, data = Default.train, family =
binomial)
> probability=predict(lm.fit, Default.test, type = "response")
> lm.pred = ifelse(probability > 0.5, "Yes", "No")
Finally, we compute the misclassified labels rate in validation sample, and the
misclassified rate over validation is 2.3%.
> accuracy = function(actual, predicted){
+ mean(actual == predicted)
+}
> table(Default.test$default, lm.pred)
lm.pred
No Yes
No 1928 7
Yes 39 26
> 1-accuracy(predicted = lm.pred, actual =Default.test$default)
[1] 0.023
c) Using the summary() and glm() functions used in (a), determine the estimated
standard errors for the coefficients associated with income and balance in a
multiple logistic regression model that uses both predictors.
Answer: Based on the summary() output in (a), we know that the estimated standard
errors for the coefficients associated with income is 4.99e-6, and with balance is
2.27e-4.
d) Write a function, boot.fn(), that takes as input the Default data set as well as an
index of the observations, and that outputs the coefficient estimates for income
and balance in the multiple logistic regression model. Use the boot() function
together with your boot.fn() function to estimate the standard errors of the logistic
regression coefficients for income and balance.
Answer: The function boot.fn() is:
> boot.fn = function(data, index){
+ return(coef(glm(default ~ income + balance,
+ data = data, family = binomial, subset = index)))
+}
We repeat 1000 times in the bootstrap, and estimate the standard errors (SE) of
coefficients for income is 5.021e-06, and SE for balance is 2.277e-04.
e) Comment on the estimated standard errors obtained using the glm() function and
using your bootstrap function.
Answer: The estimated standard errors obtained using the glm() function and using
bootstrap function is quite similar to the second and third significant digits.