0% found this document useful (0 votes)

9 views

HW4+Solution

The document outlines a homework assignment for a Data Science course focused on quantitative finance, detailing exercises on data simulation, model selection, and regression techniques. It includes tasks such as generating data, performing best subset selection, lasso modeling, and evaluating various regression methods on college application data. Additionally, it covers principal component analysis (PCA) of yield data, assessing stationarity, and determining the number of principal components needed to explain variance.

Uploaded by

Jake Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

HW4+Solution

Uploaded by

Jake Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Homework 4 and Answers

1. In this exercise, we will generate simulated data, and will then use this data to
perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a
noise vector ϵ
! of length n = 100. Generate a response vector Y of length n = 100
according to the model

! = β0 + β1X + β2 X 2 + β3 X 3 + ϵ
Y
where !β0, β
! 1, !β2, and β
! 3 are constants of your choice.

Answer: We are selecting β! 0=1, β

! 1=-2.5, β
! 2=2 and β
! 3=0.5.
> set.seed(1)
> X = rnorm(100)
> eps = rnorm(100)
> set.seed(1)
> X = rnorm(100)
> eps = rnorm(100)
> beta0 = 1
> beta1 = -2.5
> beta2 = 2
> beta3 = 0.5
> Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps

(b) Use the regsubsets() function to perform best subset selection in order to choose
the best model containing the predictors X, X! 2 ,...,! X 10. What is the best model
obtained according to C! p , BIC, and adjusted R2 ? Show some plots to provide
evidence for your answer, and report the coefficients of the best model obtained.
Answer: Use regsubsets to select best model having polynomial of X of degree 10.
The measure cp, BIC and adjr2 select the same best model, which has the model
size of 3.

!
The coefficients of the best model obtained are as below:

DSA5205 A/P Chen Ying !1 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

! 5 over X
All statistics pick X ! 3. The remaining coefficients are close to βs.

(c) Repeat (b), using forward stepwise selection and also using backwards stepwise
selection. How does your answer compare to the results in (b)?
Answer: We see that all statistics also pick 3 variable. Here are the coefficients:

(d) Now fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors respectively. Use cross-validation to select the optimal value of λ.
Create plots of the cross-validation error as a function of λ. Report the resulting
coefficient estimates, and discuss the results obtained.

Answer: The optimal value of λ selected by 10-fold cross-validation is 0.06295. And

the resulting coefficient estimates are shown below. It shows that Lasso picks X ! 4
over X! 3 . Besides, it picks out X! 5 with small coefficient. It also picks X
! 7 with
negligible coefficient.

DSA5205 A/P Chen Ying !2 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(e) Now generate a response vector Y according to the model Y ! = β0 + β7 X 7 + ϵ ,

and perform best subset selection and the lasso. Discuss the results obtained.

Answer: We create new Y with different β

! 7=5, and β
! 0 is still 1.
> beta7 = 5
> Y = beta0 + beta7 * X^7 + eps

> # Predict using regsubsets

> data.full = data.frame(y = Y, x = X)
> mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
> mod.summary = summary(mod.full)
> # Find the model size for best cp, BIC and adjr2

DSA5205 A/P Chen Ying !3 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

> which.min(mod.summary$cp)
[1] 2
> which.min(mod.summary$bic)
[1] 1
> which.max(mod.summary$adjr2)
[1] 4
The coefficients with model size being 2,1,and 4 are respectively:

!
We see that BIC picks the most accurate 1-variable model with matching
coefficients. Other criteria pick additional variables.

We then fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors. The optimal value of λ selected by 10-fold cross-validation is 8.835. And
the resulting coefficient estimates are shown below. We see that Lasso also picks
the best 1-variable model but intercet is off (1.574 vs 1).

!
2. In this exercise, we will predict the number of applications received using the other
variables in the College data set.
(a) Split the data set into a training set and a test set.
Answer: we split the training and test at ratio 7:3.

DSA5205 A/P Chen Ying !4 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(b) Fit a linear model using least squares on the training set, and report the test error
obtained.
Answer: Number of applications is the Apps variable. The test error obtained is
1,261,630.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation.
Report the test error obtained.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is 1,121,034, lower than that OLS.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report
the test error obtained, along with the number of non-zero coefficient estimates.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is slightly lower than that OLS,
1,255,241. The coefficients look like

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value of M selected by cross-validation.
Answer: Test RSS for PCR is about 1,283,651, which is slightly higher than OLS.
The CV MSEP with different number of components is below:

DSA5205 A/P Chen Ying !5 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the
test error obtained, along with the value of M selected by cross-validation.

Answer: Test RSS for PLS is about 1,129,846, which is similar to PCR, and also
slightly higher than OLS. The CV MSEP with different number of components is
below:

!
(g) Comment on the results obtained. How accurately can we predict the number of
college applications received? Is there much difference among the test errors
resulting from these five approaches?
Answer: For the test MSE, results for OLS, Lasso, PLS are comparable, ridge
regression has best performance, and PCR performs worst. While for the test R2,
the difference from these 5 approaches are not very significant, though ridge method
also performs best. Here are the test MSE and R2 for all models.

DSA5205 A/P Chen Ying !6 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

3, We do a principal components analysis of daily yield data in the file yields.txt. R

has functions, which we will use later, that automate PCA, but it is easy to do PCA
“from scratch” and it is instructive to do this.
(a) First load the data and, to get a feel for what yield curves look like, plot the yield
curves on days 1, 101, 201, 301, . . ., 1101.
Answer: The plot is shown below:

(b) It is generally recommended that PCA be applied to time series that are
stationary. Plot the first column of yieldDat. (You can look at other columns as
well) Does the plot appear stationary? Why or why not? Include your plot with
your work.

DSA5205 A/P Chen Ying !7 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Answer: The time series plots for the first and fifth maturity are below. The plots
appear non-stationary with significantly decreasing and increasing trend at some
time periods.

(c) Another way to check for stationarity is to run the augmented Dickey–Fuller
(ADF) test: library("tseries")
adf.test(yieldDat[ , 1])
Based on the augmented Dickey–Fuller test, do you think the first column of
yieldDat is stationary? Why or why not?

Answer: The ADF test shows that the first column of yieldDat is non-stationary.
Since the null hypothesis of ADF test is: the series is non-stationary. The p-value for
the test is 0.9 and we cannot reject the null hypothesis. Thus the first column of
yieldDat is non-stationary.

(d) Compute changes in the yield curves and defined as diff_yield. Plot the first
column of diff_yield and run the augmented Dickey–Fuller test to check for
stationarity. Do you think the first column of diff yield is stationary? Why or why
not?
Answer: The time series plot shows that the first column of diff yield is stationary, we
cannot find significant increasing or decreasing trend. The ADF test also reject the
null hypothesis of unit-root non-stationary with smaller p-value than 0.01. Thus we
conclude that the diff yield is stationary.

DSA5205 A/P Chen Ying !8 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(e) Use R function eigen() to compute the eigenvalues and corresponding

eigenvectors of diff_yield based on covariance. And also plot the first 4
eigenvectors.
Answer: The eigenvalues and corresponding eigenvectors are:
> eig = eigen(cov(diff_yield))
> eig$values
[1] 3.13e-02 2.85e-03 1.03e-03 2.09e-04 1.22e-04 7.83e-05

[7] 6.55e-05 4.04e-05 1.83e-05 1.40e-05 3.90e-06

> eig$vectors

DSA5205 A/P Chen Ying !9 | P a g e

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Performing a PCA using the function princomp() on the covariance matrix which
is default of princomp(). The output from names includes the following: [1] "sdev"
"loadings" "center" "scores". Print and describe each of these components.
Answer: We perform a PCA using the function princomp() on the covariance matrix
using following code:
> pca_del = princomp(diff_yield)

DSA5205 A/P Chen Ying 10

! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

> names(pca_del
> summary(pca_del)
> par(mfrow = c(1, 1))
> plot(pca_del,col=4)

!
(i) sdev is a vector contain the square roots of the eigenvalues of the covariance
matrix. This fact can be verified by using eigen to find the eigenvalues. See the R
output below. Apparently, princomp() uses n (the sample size) rather than n − 1 as
the divisor in the sample covariance matrix. To compensate, I used
cov(diff_yield)*((n-1)/n) in the second line of the R code. With the change, the square
roots of the eigenvalues are equal to the stdev from princomp().

(ii) loadings are the eigenvectors of the covariance matrix. Below is a check that the
first eigenvector is equal to the first column of loadings. The other eigenvector can
be checked in the same way. Eigenvectors are determined only up to sign changes,
so you might find that some eigenvectors from eigen() and the corresponding ones
from princomp() have different signs.

DSA5205 A/P Chen Ying 11

! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(iii) center is the mean vector. This is verified below.

(iv) scores contains the projections of the yields minus their means onto the
eigenvectors. See the R output which only checks the projections onto the first
eigenvectors.

(g) Suppose you wish to “explain” at least 95% of the variation in the changes in the
yield curves. Then how many principal components should you use?
Answer: Use summary() function and the output is:

DSA5205 A/P Chen Ying 12

! |Page
DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Therefore, only the first 2 principal components should be used since the first two
principal components have 95.59% of the variance as seen in the cumulative
proportions.

DSA5205 A/P Chen Ying 13

! |Page

2 Months Nclex Study Calendar
67% (3)
2 Months Nclex Study Calendar
2 pages
Exam AFF700 211210 - Solutions
No ratings yet
Exam AFF700 211210 - Solutions
11 pages
Aff700 1000 221209
No ratings yet
Aff700 1000 221209
11 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
HW3+Solution
No ratings yet
HW3+Solution
10 pages
HW5+Solution
No ratings yet
HW5+Solution
10 pages
All The Previous Questions
No ratings yet
All The Previous Questions
37 pages
Introduction To Machine Learning Week 2 Assignment
100% (1)
Introduction To Machine Learning Week 2 Assignment
8 pages
Activity 7
No ratings yet
Activity 7
5 pages
CHW 4
No ratings yet
CHW 4
7 pages
Self-Quiz Unit 3 - Attempt Review
No ratings yet
Self-Quiz Unit 3 - Attempt Review
12 pages
SDSC3006_Assignment 2
No ratings yet
SDSC3006_Assignment 2
3 pages
18CSO106T Data Analysis Using Open Source Tool: Question Bank
No ratings yet
18CSO106T Data Analysis Using Open Source Tool: Question Bank
26 pages
Graded Quiz Unit 3 PDF
No ratings yet
Graded Quiz Unit 3 PDF
10 pages
Aff700 1000 220401
No ratings yet
Aff700 1000 220401
8 pages
Mathematical Modeling Stefan Heinz instant download
No ratings yet
Mathematical Modeling Stefan Heinz instant download
41 pages
51054 Mid Sample Ans
No ratings yet
51054 Mid Sample Ans
2 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Demo0 Sol1
No ratings yet
Demo0 Sol1
5 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
questionbank_011020035933
No ratings yet
questionbank_011020035933
9 pages
RGRSSN Assgnmnt
No ratings yet
RGRSSN Assgnmnt
11 pages
Regression Analysis For Third Years
No ratings yet
Regression Analysis For Third Years
6 pages
Quantitative Analysis 4
No ratings yet
Quantitative Analysis 4
22 pages
UCT633_EST_23
No ratings yet
UCT633_EST_23
3 pages
Grade 3 Data Mining: Question Text
No ratings yet
Grade 3 Data Mining: Question Text
28 pages
Sample Paper
No ratings yet
Sample Paper
12 pages
Practice Final
No ratings yet
Practice Final
15 pages
EF3450 2021B MID
No ratings yet
EF3450 2021B MID
12 pages
CT-2 AK (1)
No ratings yet
CT-2 AK (1)
13 pages
HW6 Solution
No ratings yet
HW6 Solution
10 pages
linear regression
No ratings yet
linear regression
37 pages
Quiz Feedback 2
No ratings yet
Quiz Feedback 2
6 pages
Econometric Mod L
No ratings yet
Econometric Mod L
8 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Machine 2020 Jul-Dec
No ratings yet
Machine 2020 Jul-Dec
45 pages
Linear-Model-Selection-and-Regularization
No ratings yet
Linear-Model-Selection-and-Regularization
23 pages
Economics Mod
No ratings yet
Economics Mod
22 pages
DS 432 Assignment I 2020
No ratings yet
DS 432 Assignment I 2020
7 pages
Exam SRM Sample Questions
No ratings yet
Exam SRM Sample Questions
71 pages
Dats501 2021 Spring Week11 Exam1
No ratings yet
Dats501 2021 Spring Week11 Exam1
7 pages
ECON 6001 Assignment1 2023
No ratings yet
ECON 6001 Assignment1 2023
9 pages
Quiz_2_2021_sol
No ratings yet
Quiz_2_2021_sol
8 pages
Machine Learning,( CS-3035), Online Spring End Semester Examination 2021
No ratings yet
Machine Learning,( CS-3035), Online Spring End Semester Examination 2021
8 pages
Practice Problems Note
No ratings yet
Practice Problems Note
9 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Advance Engineering Mathematics Assignment Curve Fitting To Data
No ratings yet
Advance Engineering Mathematics Assignment Curve Fitting To Data
9 pages
DSCI5180 Spring2013 Quizzes1-6
No ratings yet
DSCI5180 Spring2013 Quizzes1-6
31 pages
CH 2
No ratings yet
CH 2
31 pages
Extra 2
No ratings yet
Extra 2
7 pages
DM Slip Solutions
100% (1)
DM Slip Solutions
24 pages
Model Selection
No ratings yet
Model Selection
11 pages
Shanghai Jiaotong University Shanghai Advanced Institution of Finance
No ratings yet
Shanghai Jiaotong University Shanghai Advanced Institution of Finance
3 pages
Section 1: Multiple Choice Questions (1 X 12) Time: 50 Minutes
No ratings yet
Section 1: Multiple Choice Questions (1 X 12) Time: 50 Minutes
7 pages
exam-srm-sample-questions
No ratings yet
exam-srm-sample-questions
77 pages
Solution 2
0% (1)
Solution 2
6 pages
HW2+Solution
No ratings yet
HW2+Solution
11 pages
Worksheet A 2024 Topic 2.6 Competing Function Model Validation
No ratings yet
Worksheet A 2024 Topic 2.6 Competing Function Model Validation
4 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
BBSS Paintball Tournament
No ratings yet
BBSS Paintball Tournament
70 pages
CDC
No ratings yet
CDC
116 pages
Pep 141 Parent Child Communication
No ratings yet
Pep 141 Parent Child Communication
3 pages
Exposed by Kimberly Marcus
No ratings yet
Exposed by Kimberly Marcus
35 pages
Book Review of His Needs Her Needs: Building An Affair-Proof Marriage
No ratings yet
Book Review of His Needs Her Needs: Building An Affair-Proof Marriage
4 pages
Chap7 Digital Filter
No ratings yet
Chap7 Digital Filter
18 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Introduction & Direct Integration Method
No ratings yet
Introduction & Direct Integration Method
21 pages
Download ebooks file Workplace Learning in Context Helen Rainbird all chapters
100% (9)
Download ebooks file Workplace Learning in Context Helen Rainbird all chapters
67 pages
Chem 0.2 POGIL-Experimental Design
No ratings yet
Chem 0.2 POGIL-Experimental Design
3 pages
Kaddour Najim Control of Continuous Linear Systems
No ratings yet
Kaddour Najim Control of Continuous Linear Systems
11 pages
Dokumen Pub Animal Killer Transmission of War Trauma From One Generation
No ratings yet
Dokumen Pub Animal Killer Transmission of War Trauma From One Generation
125 pages
Cyber Safety - Final
No ratings yet
Cyber Safety - Final
16 pages
Download Complete Biomedical Image Analysis and Machine Learning Technologies Applications and Techniques Premier Reference Source 1st Edition Fabio A. Gonzalez PDF for All Chapters
No ratings yet
Download Complete Biomedical Image Analysis and Machine Learning Technologies Applications and Techniques Premier Reference Source 1st Edition Fabio A. Gonzalez PDF for All Chapters
41 pages
Cenar Krisnalyn CV
No ratings yet
Cenar Krisnalyn CV
4 pages
1.flower Shop Management System(s)
69% (26)
1.flower Shop Management System(s)
34 pages
Chapter 3 EIS Charts For Revision
No ratings yet
Chapter 3 EIS Charts For Revision
12 pages
Mahayana Prayers and Poetry
No ratings yet
Mahayana Prayers and Poetry
89 pages
Gear Train Experiment
No ratings yet
Gear Train Experiment
8 pages
Comms 2024
No ratings yet
Comms 2024
37 pages
CiscoIOU Howto
No ratings yet
CiscoIOU Howto
17 pages
Flight Training Manual 4th Edition Transport Canada all chapter instant download
100% (2)
Flight Training Manual 4th Edition Transport Canada all chapter instant download
85 pages
Activity Design Team Buildin and Budgeting
No ratings yet
Activity Design Team Buildin and Budgeting
3 pages
ANALYSING THE POSSIBILITY OF USING AI For Structural Optimization of Container Homes
No ratings yet
ANALYSING THE POSSIBILITY OF USING AI For Structural Optimization of Container Homes
17 pages
Chapter 8
100% (2)
Chapter 8
71 pages
Developmentofteeth Stages
100% (2)
Developmentofteeth Stages
45 pages
atomic_structure - IGCSE
No ratings yet
atomic_structure - IGCSE
19 pages
Strategies Weebly
No ratings yet
Strategies Weebly
1 page
Design and Fabrication of Eccentric Punching Machine
No ratings yet
Design and Fabrication of Eccentric Punching Machine
12 pages

HW4+Solution

Uploaded by

HW4+Solution

Uploaded by

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Homework 4 and Answers

Answer: We are selecting β! 0=1, β

DSA5205 A/P Chen Ying !1 | P a g e

Answer: The optimal value of λ selected by 10-fold cross-validation is 0.06295. And

DSA5205 A/P Chen Ying !2 | P a g e

(e) Now generate a response vector Y according to the model Y ! = β0 + β7 X 7 + ϵ ,

Answer: We create new Y with different β

> # Predict using regsubsets

DSA5205 A/P Chen Ying !3 | P a g e

DSA5205 A/P Chen Ying !4 | P a g e

DSA5205 A/P Chen Ying !5 | P a g e

DSA5205 A/P Chen Ying !6 | P a g e

3, We do a principal components analysis of daily yield data in the file yields.txt. R

DSA5205 A/P Chen Ying !7 | P a g e

DSA5205 A/P Chen Ying !8 | P a g e

(e) Use R function eigen() to compute the eigenvalues and corresponding

[7] 6.55e-05 4.04e-05 1.83e-05 1.40e-05 3.90e-06

DSA5205 A/P Chen Ying !9 | P a g e

DSA5205 A/P Chen Ying 10

DSA5205 A/P Chen Ying 11

(iii) center is the mean vector. This is verified below.

DSA5205 A/P Chen Ying 12

DSA5205 A/P Chen Ying 13

You might also like