0% found this document useful (0 votes)
7 views

HW2+Solution

The document outlines a homework assignment for a Data Science course focusing on quantitative finance, specifically analyzing earnings data and housing data using R. It covers various statistical transformations, maximum likelihood estimation, logistic regression, and bootstrap methods for estimating parameters and standard errors. The findings include the best transformation for symmetrizing earnings data, estimates for population means and medians, and the performance of logistic regression models in predicting default probabilities.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

HW2+Solution

The document outlines a homework assignment for a Data Science course focusing on quantitative finance, specifically analyzing earnings data and housing data using R. It covers various statistical transformations, maximum likelihood estimation, logistic regression, and bootstrap methods for estimating parameters and standard errors. The findings include the best transformation for symmetrizing earnings data, estimates for population means and medians, and the performance of logistic regression models in predicting default probabilities.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Homework 2 and Answers

1. Run the following R code to find a symmetrizing transformation for 1998 earnings
data from the Current Population Survey. We only focus on the female.earnings.
library("Ecdat")
data(CPSch3)
dimnames(CPSch3)[[2]]
earnings = CPSch3[CPSch3[ ,3] == "female", 2]

a) Transformed the earnings using square-root and log-transformation. Plot the QQ-
normal plots, boxplots, and kernel density estimates for the untransformed data
and two transformed data respectively. Which of the three dataset provides the
most symmetric distribution? Try other powers beside the square root. Which
power do you think is best for symmetrization?
Answer: Of the two transformations, the square-root transformation create the most
symmetric distribution. Its normal plot is closest to linear. The untransformed data
have a concave pattern indicating right-skewness. The log-transformed data have a
convex pattern indicating left-skewness, though the left tail is short, as can be seen
by the quick downturn at the extreme left than deviates from the convex pattern
elsewhere.

In the boxplots, the untransformed data have most of their extreme values on the
top, the log-transformed data have most of their extreme values on the bottom, and
the square-root transformed data have extreme values on both the top and bottom,
though the left tail is compressed relative to the right tail. Thus, we see the same
types of skewness seen in the normal plots.

The density estimates agree with our earlier conclusions that the untransformed data
are right-skewed, the log-transformed data are left-skewed, and the square-root
transformed data are approximately symmetric.

DSA5205 A/P Chen Ying !1 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

!
The following figure contains normal plots with the power ranging from 0.1 to 0.8.
One can see that the left tail remains short regardless of which power is used. For
the rest of the data, the plot changes from convex to concave as the power
increases, that is, the transformed data (excluding the left tail) change from left-
skewed to right-skewed as the power increases. When the power is between 0.4 and
0.5, the data (excluding the left tail) are close to normal.

DSA5205 A/P Chen Ying !2 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

b) Next, you will estimate the Box–Cox transformation parameter by maximum


likelihood. The model is that the data are N(µ, σ2)-distributed after being
transformed by some λ. The unknown parameters are λ, µ, and σ. The following
R code plots the profile likelihood for λ on the default grid and zoom in on grid
seq(0.14, 0.28, 1/100). The command boxcox takes an R formula as input. In this
application, the model has only an intercept, which is indicated by “1.”
Question: Display the boxcox plot under two grids. What are ind and ind2 and
what purposes do they serve? What is the MLE of λ?
> library("MASS")
> par(mfrow=c(1,2))
> boxcox(earnings~1)
> boxcox(earnings~1,lambda = seq(0.14, 0.28, 1/100))
> bc = boxcox(earnings~1,lambda = seq(0.14, 0.28, by=1/100),interp=T,plotit = F)
> ind = (bc$y==max(bc$y))
> ind2 = (bc$y > max(bc$y) - qchisq(.95,df=1)/2)
> bc$x[ind]
> bc$x[ind2]

Answer: The plot for λ on the default grid seq(-2, 2, 1/10), and zoom in on the high-
likelihood region is as below:

DSA5205 A/P Chen Ying !3 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

!
ind indicates which value of λ maximizes the likelihood and ind2 indicates
which values of λ are in the 95% confidence interval for λ.

The MLE of λ is 0.2079 as can be seen from the following output:


> bc$x[ind]
[1] 0.2079
> bc$x[ind2]
[1] 0.1598 0.1612 0.1626 0.1640 0.1655 0.1669 0.1683 0.1697 0.1711 0.1725
0.1739 0.1754 0.1768 0.1782 0.1796 0.1810 0.1824 0.1838 0.1853 0.1867
0.1881 0.1895 0.1909 0.1923
……
The location of the MLE can also be seen in the second plot in above figure, which
was produced by boxcox(earnings 1, lambda = seq(0.14, 0.28, 1/100)) where the
values of lambda where limited to a restricted range to provide detail near the MLE.

c) Fit a skewed t-distribution to female.earnings. What are the estimates of the


degrees-of-freedom parameter and of ξ?
Answer: The MLE of the degrees of freedom parameter is 8.527 and the MLE of ξ is
1.682.
> library("fGarch")
> fit = sstdFit(earnings,hessian=T)
> fit
$minimum
[1] 16437
$estimate
mean sd nu xi

15.035 6.292 8.527 1.682

DSA5205 A/P Chen Ying !4 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

d) Produce a plot of a kernel density estimate of the pdf of female.earnings. Overlay


a plot of the skewed t-density with MLEs of the parameters. Include your plot with
your work. Compare the parametric and nonparametric estimates of the pdf. Do
they seem similar? Based on the plots, do you believe that the skewed t-model
provides an adequate fit to female.earnings?
Answer: The plots for the skewed-t fit are produced by this code:
> fit = sstdFit(earnings)
> para = fit$estimate
> xgrid=seq(0,max(earnings)+5,length.out=100)
> par(mfrow=c(1,1))

> plot(density(earnings),main="Male Earnings")


> lines(xgrid,dsstd(xgrid, mean=para[1],sd=para[2],nu=para[3],xi=para[4]),
type="l",lty=5)
> legend("topright",c("KDE","Skewed-t"),lty=c(1,5))

The fitted skewed-t density is not very close to the kernel density estimate,
especially at around mode area, and the differences between the two could
because the lack of fit, not likely purely due to random variation. Therefore, the
skewed-t model seems like not a very proper choice for parametric modeling of
female.earnings. Other skewed distributions should be considered.

2, We will now consider the Boston housing data set, from the MASS library.
(a) Based on this data set, provide an estimate for the population mean of medv. Call
^.
this estimate !μ
Answer: We use the sample mean as an estimate for the population mean of medv,
! ^ = 22.53.
and μ

DSA5205 A/P Chen Ying !5 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

^. Interpret this result.


(b) Provide an estimate of the standard error of !μ

Hint: We can compute the standard error of the sample mean by dividing the
sample standard deviation by the square root of the number of observations.
Answer: estimate of the standard error is 0.409
> medv.err = sd(medv)/sqrt(length(medv))
> medv.err
[1] 0.4089

(c) Now estimate the standard error of μ^! using the bootstrap. How does this
compare to your answer from (b)? Based on your bootstrap estimate, provide a
95 % confidence interval for the mean of medv. Compare it to the results
obtained using t.test(Boston$medv).
^ −
Hint: You can approximate a 95 % confidence interval using the formula [!μ
^), !μ^ + 2SE(!μ^)].
2SE(!μ
Answer: We conduct bootstrap with 1000 repetitions.

The estimated standard error of μ^! using the bootstrap is 0.4108, which is similar to
answer from (b) up to two significant digits. (0.4108 vs 0.4089). We first use the
bootstrap to compute 95 % confidence interval as [21.71, 23.35]. The results
obtained using t.test(Boston$medv) show the confidence interval as ! [21.73, 23.34].
Bootstrap estimate only as far as 0.02 away for t.test estimate.

DSA5205 A/P Chen Ying !6 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

^!
(d) Based on this data set, provide an estimate, μ for the median value of medv
med
in the population.
Answer: Similarly, the sample median as estimate, which is 21.2
> medv.med = median(medv)
> medv.med
[1] 21.2

!^ med. Unfortunately, there is


(e) We now would like to estimate the standard error of μ
no simple formula for computing the standard error of the median. Instead,
estimate the standard error of the median using the bootstrap. Comment on your
findings.
Answer: Based on the bootstrap, SE is 0.381, which is small relative to median
value of 21.2.

DSA5205 A/P Chen Ying !7 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Based on this data set, provide an estimate for the tenth percentile of medv in
!^ 0.1(You can use the quantile() function.). Use
Boston suburbs. Call this quantity μ
! ^ 0.1. Comment on your findings.
the bootstrap to estimate the standard error of μ

Answer: Tenth-percentile of 12.75 with SE of 0.4932. Small standard error relative to


tenth-percentile value.

3. Use logistic regression to predict the probability of default using income and
balance on the Default data set (in ISLR library). Do not forget to set a random seed
before beginning your analysis.
a) Fit a logistic regression model that uses income and balance to predict default.
Hint: you can use glm() function and set the argument “family = binomial” to run
logistic regression. ?glm for more details in R.
Answer: We fit a logistic regression model using all observations and get output as
follows. It shows that both income and balance are significant predictors to predict
default.

DSA5205 A/P Chen Ying !8 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

!
b) Using the validation set approach, estimate the test error of this model. In order
to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.

ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation set by
computing the posterior probability of default for that individual, and
classifying the individual to the default category if the posterior probability is
greater than 0.5.
iv. Compute the validation set error, which is the fraction of the observations in
the validation set that are misclassified.
Answer: There are 10000 observations in total. We randomly split the data and let
80% as training and 20% as testing.
> set.seed(1234)
> dim(Default)
[1] 10000 4
> train = sample(dim(Default)[1], dim(Default)[1]*0.8)
> Default.train = Default[train, ]
> Default.test = Default[-train, ]
Then we build model using train sample, and if the predicted probabity is larger than
0.5, we set the predicted label as default “Yes”:
> # train model
> lm.fit = glm(default ~ income + balance, data = Default.train, family =
binomial)
> probability=predict(lm.fit, Default.test, type = "response")
> lm.pred = ifelse(probability > 0.5, "Yes", "No")

DSA5205 A/P Chen Ying !9 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Finally, we compute the misclassified labels rate in validation sample, and the
misclassified rate over validation is 2.3%.
> accuracy = function(actual, predicted){
+ mean(actual == predicted)
+}
> table(Default.test$default, lm.pred)
lm.pred
No Yes
No 1928 7
Yes 39 26
> 1-accuracy(predicted = lm.pred, actual =Default.test$default)
[1] 0.023

c) Using the summary() and glm() functions used in (a), determine the estimated
standard errors for the coefficients associated with income and balance in a
multiple logistic regression model that uses both predictors.
Answer: Based on the summary() output in (a), we know that the estimated standard
errors for the coefficients associated with income is 4.99e-6, and with balance is
2.27e-4.

d) Write a function, boot.fn(), that takes as input the Default data set as well as an
index of the observations, and that outputs the coefficient estimates for income
and balance in the multiple logistic regression model. Use the boot() function
together with your boot.fn() function to estimate the standard errors of the logistic
regression coefficients for income and balance.
Answer: The function boot.fn() is:
> boot.fn = function(data, index){
+ return(coef(glm(default ~ income + balance,
+ data = data, family = binomial, subset = index)))

+}
We repeat 1000 times in the bootstrap, and estimate the standard errors (SE) of
coefficients for income is 5.021e-06, and SE for balance is 2.277e-04.

DSA5205 A/P Chen Ying 10


! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

e) Comment on the estimated standard errors obtained using the glm() function and
using your bootstrap function.
Answer: The estimated standard errors obtained using the glm() function and using
bootstrap function is quite similar to the second and third significant digits.

DSA5205 A/P Chen Ying 11


! |Page

You might also like