0% found this document useful (0 votes)
9 views

HW4+Solution

The document outlines a homework assignment for a Data Science course focused on quantitative finance, detailing exercises on data simulation, model selection, and regression techniques. It includes tasks such as generating data, performing best subset selection, lasso modeling, and evaluating various regression methods on college application data. Additionally, it covers principal component analysis (PCA) of yield data, assessing stationarity, and determining the number of principal components needed to explain variance.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

HW4+Solution

The document outlines a homework assignment for a Data Science course focused on quantitative finance, detailing exercises on data simulation, model selection, and regression techniques. It includes tasks such as generating data, performing best subset selection, lasso modeling, and evaluating various regression methods on college application data. Additionally, it covers principal component analysis (PCA) of yield data, assessing stationarity, and determining the number of principal components needed to explain variance.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Homework 4 and Answers

1. In this exercise, we will generate simulated data, and will then use this data to
perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a
noise vector ϵ
! of length n = 100. Generate a response vector Y of length n = 100
according to the model

! = β0 + β1X + β2 X 2 + β3 X 3 + ϵ
Y
where !β0, β
! 1, !β2, and β
! 3 are constants of your choice.

Answer: We are selecting β! 0=1, β


! 1=-2.5, β
! 2=2 and β
! 3=0.5.
> set.seed(1)
> X = rnorm(100)
> eps = rnorm(100)
> set.seed(1)
> X = rnorm(100)
> eps = rnorm(100)
> beta0 = 1
> beta1 = -2.5
> beta2 = 2
> beta3 = 0.5
> Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps

(b) Use the regsubsets() function to perform best subset selection in order to choose
the best model containing the predictors X, X! 2 ,...,! X 10. What is the best model
obtained according to C! p , BIC, and adjusted R2 ? Show some plots to provide
evidence for your answer, and report the coefficients of the best model obtained.
Answer: Use regsubsets to select best model having polynomial of X of degree 10.
The measure cp, BIC and adjr2 select the same best model, which has the model
size of 3.

!
The coefficients of the best model obtained are as below:

DSA5205 A/P Chen Ying !1 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

! 5 over X
All statistics pick X ! 3. The remaining coefficients are close to βs.

(c) Repeat (b), using forward stepwise selection and also using backwards stepwise
selection. How does your answer compare to the results in (b)?
Answer: We see that all statistics also pick 3 variable. Here are the coefficients:

(d) Now fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors respectively. Use cross-validation to select the optimal value of λ.
Create plots of the cross-validation error as a function of λ. Report the resulting
coefficient estimates, and discuss the results obtained.

Answer: The optimal value of λ selected by 10-fold cross-validation is 0.06295. And


the resulting coefficient estimates are shown below. It shows that Lasso picks X ! 4
over X! 3 . Besides, it picks out X! 5 with small coefficient. It also picks X
! 7 with
negligible coefficient.

DSA5205 A/P Chen Ying !2 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(e) Now generate a response vector Y according to the model Y ! = β0 + β7 X 7 + ϵ ,


and perform best subset selection and the lasso. Discuss the results obtained.

Answer: We create new Y with different β


! 7=5, and β
! 0 is still 1.
> beta7 = 5
> Y = beta0 + beta7 * X^7 + eps

> # Predict using regsubsets


> data.full = data.frame(y = Y, x = X)
> mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
> mod.summary = summary(mod.full)
> # Find the model size for best cp, BIC and adjr2

DSA5205 A/P Chen Ying !3 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

> which.min(mod.summary$cp)
[1] 2
> which.min(mod.summary$bic)
[1] 1
> which.max(mod.summary$adjr2)
[1] 4
The coefficients with model size being 2,1,and 4 are respectively:

!
We see that BIC picks the most accurate 1-variable model with matching
coefficients. Other criteria pick additional variables.

We then fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors. The optimal value of λ selected by 10-fold cross-validation is 8.835. And
the resulting coefficient estimates are shown below. We see that Lasso also picks
the best 1-variable model but intercet is off (1.574 vs 1).

!
2. In this exercise, we will predict the number of applications received using the other
variables in the College data set.
(a) Split the data set into a training set and a test set.
Answer: we split the training and test at ratio 7:3.

DSA5205 A/P Chen Ying !4 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(b) Fit a linear model using least squares on the training set, and report the test error
obtained.
Answer: Number of applications is the Apps variable. The test error obtained is
1,261,630.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation.
Report the test error obtained.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is 1,121,034, lower than that OLS.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report
the test error obtained, along with the number of non-zero coefficient estimates.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is slightly lower than that OLS,
1,255,241. The coefficients look like

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value of M selected by cross-validation.
Answer: Test RSS for PCR is about 1,283,651, which is slightly higher than OLS.
The CV MSEP with different number of components is below:

DSA5205 A/P Chen Ying !5 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the
test error obtained, along with the value of M selected by cross-validation.

Answer: Test RSS for PLS is about 1,129,846, which is similar to PCR, and also
slightly higher than OLS. The CV MSEP with different number of components is
below:

!
(g) Comment on the results obtained. How accurately can we predict the number of
college applications received? Is there much difference among the test errors
resulting from these five approaches?
Answer: For the test MSE, results for OLS, Lasso, PLS are comparable, ridge
regression has best performance, and PCR performs worst. While for the test R2,
the difference from these 5 approaches are not very significant, though ridge method
also performs best. Here are the test MSE and R2 for all models.

DSA5205 A/P Chen Ying !6 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

3, We do a principal components analysis of daily yield data in the file yields.txt. R


has functions, which we will use later, that automate PCA, but it is easy to do PCA
“from scratch” and it is instructive to do this.
(a) First load the data and, to get a feel for what yield curves look like, plot the yield
curves on days 1, 101, 201, 301, . . ., 1101.
Answer: The plot is shown below:

(b) It is generally recommended that PCA be applied to time series that are
stationary. Plot the first column of yieldDat. (You can look at other columns as
well) Does the plot appear stationary? Why or why not? Include your plot with
your work.

DSA5205 A/P Chen Ying !7 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Answer: The time series plots for the first and fifth maturity are below. The plots
appear non-stationary with significantly decreasing and increasing trend at some
time periods.

(c) Another way to check for stationarity is to run the augmented Dickey–Fuller
(ADF) test: library("tseries")
adf.test(yieldDat[ , 1])
Based on the augmented Dickey–Fuller test, do you think the first column of
yieldDat is stationary? Why or why not?

Answer: The ADF test shows that the first column of yieldDat is non-stationary.
Since the null hypothesis of ADF test is: the series is non-stationary. The p-value for
the test is 0.9 and we cannot reject the null hypothesis. Thus the first column of
yieldDat is non-stationary.

(d) Compute changes in the yield curves and defined as diff_yield. Plot the first
column of diff_yield and run the augmented Dickey–Fuller test to check for
stationarity. Do you think the first column of diff yield is stationary? Why or why
not?
Answer: The time series plot shows that the first column of diff yield is stationary, we
cannot find significant increasing or decreasing trend. The ADF test also reject the
null hypothesis of unit-root non-stationary with smaller p-value than 0.01. Thus we
conclude that the diff yield is stationary.

DSA5205 A/P Chen Ying !8 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(e) Use R function eigen() to compute the eigenvalues and corresponding


eigenvectors of diff_yield based on covariance. And also plot the first 4
eigenvectors.
Answer: The eigenvalues and corresponding eigenvectors are:
> eig = eigen(cov(diff_yield))
> eig$values
[1] 3.13e-02 2.85e-03 1.03e-03 2.09e-04 1.22e-04 7.83e-05

[7] 6.55e-05 4.04e-05 1.83e-05 1.40e-05 3.90e-06


> eig$vectors

DSA5205 A/P Chen Ying !9 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Performing a PCA using the function princomp() on the covariance matrix which
is default of princomp(). The output from names includes the following: [1] "sdev"
"loadings" "center" "scores". Print and describe each of these components.
Answer: We perform a PCA using the function princomp() on the covariance matrix
using following code:
> pca_del = princomp(diff_yield)

DSA5205 A/P Chen Ying 10


! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

> names(pca_del
> summary(pca_del)
> par(mfrow = c(1, 1))
> plot(pca_del,col=4)

!
(i) sdev is a vector contain the square roots of the eigenvalues of the covariance
matrix. This fact can be verified by using eigen to find the eigenvalues. See the R
output below. Apparently, princomp() uses n (the sample size) rather than n − 1 as
the divisor in the sample covariance matrix. To compensate, I used
cov(diff_yield)*((n-1)/n) in the second line of the R code. With the change, the square
roots of the eigenvalues are equal to the stdev from princomp().

(ii) loadings are the eigenvectors of the covariance matrix. Below is a check that the
first eigenvector is equal to the first column of loadings. The other eigenvector can
be checked in the same way. Eigenvectors are determined only up to sign changes,
so you might find that some eigenvectors from eigen() and the corresponding ones
from princomp() have different signs.

DSA5205 A/P Chen Ying 11


! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(iii) center is the mean vector. This is verified below.

(iv) scores contains the projections of the yields minus their means onto the
eigenvectors. See the R output which only checks the projections onto the first
eigenvectors.

(g) Suppose you wish to “explain” at least 95% of the variation in the changes in the
yield curves. Then how many principal components should you use?
Answer: Use summary() function and the output is:

DSA5205 A/P Chen Ying 12


! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Therefore, only the first 2 principal components should be used since the first two
principal components have 95.59% of the variance as seen in the cumulative
proportions.

DSA5205 A/P Chen Ying 13


! |Page

You might also like