HW4+Solution
HW4+Solution
1. In this exercise, we will generate simulated data, and will then use this data to
perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a
noise vector ϵ
! of length n = 100. Generate a response vector Y of length n = 100
according to the model
! = β0 + β1X + β2 X 2 + β3 X 3 + ϵ
Y
where !β0, β
! 1, !β2, and β
! 3 are constants of your choice.
(b) Use the regsubsets() function to perform best subset selection in order to choose
the best model containing the predictors X, X! 2 ,...,! X 10. What is the best model
obtained according to C! p , BIC, and adjusted R2 ? Show some plots to provide
evidence for your answer, and report the coefficients of the best model obtained.
Answer: Use regsubsets to select best model having polynomial of X of degree 10.
The measure cp, BIC and adjr2 select the same best model, which has the model
size of 3.
!
The coefficients of the best model obtained are as below:
! 5 over X
All statistics pick X ! 3. The remaining coefficients are close to βs.
(c) Repeat (b), using forward stepwise selection and also using backwards stepwise
selection. How does your answer compare to the results in (b)?
Answer: We see that all statistics also pick 3 variable. Here are the coefficients:
(d) Now fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors respectively. Use cross-validation to select the optimal value of λ.
Create plots of the cross-validation error as a function of λ. Report the resulting
coefficient estimates, and discuss the results obtained.
> which.min(mod.summary$cp)
[1] 2
> which.min(mod.summary$bic)
[1] 1
> which.max(mod.summary$adjr2)
[1] 4
The coefficients with model size being 2,1,and 4 are respectively:
!
We see that BIC picks the most accurate 1-variable model with matching
coefficients. Other criteria pick additional variables.
We then fit a lasso model to the simulated data, again using X, X ! 2 , ...,! X 10 as
predictors. The optimal value of λ selected by 10-fold cross-validation is 8.835. And
the resulting coefficient estimates are shown below. We see that Lasso also picks
the best 1-variable model but intercet is off (1.574 vs 1).
!
2. In this exercise, we will predict the number of applications received using the other
variables in the College data set.
(a) Split the data set into a training set and a test set.
Answer: we split the training and test at ratio 7:3.
(b) Fit a linear model using least squares on the training set, and report the test error
obtained.
Answer: Number of applications is the Apps variable. The test error obtained is
1,261,630.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation.
Report the test error obtained.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is 1,121,034, lower than that OLS.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report
the test error obtained, along with the number of non-zero coefficient estimates.
Answer: We consider 100 default λ sequence in cv.glmnet(). The optimal λ is
selected by 10-fold cross-validation. Test RSS is slightly lower than that OLS,
1,255,241. The coefficients look like
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report
the test error obtained, along with the value of M selected by cross-validation.
Answer: Test RSS for PCR is about 1,283,651, which is slightly higher than OLS.
The CV MSEP with different number of components is below:
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the
test error obtained, along with the value of M selected by cross-validation.
Answer: Test RSS for PLS is about 1,129,846, which is similar to PCR, and also
slightly higher than OLS. The CV MSEP with different number of components is
below:
!
(g) Comment on the results obtained. How accurately can we predict the number of
college applications received? Is there much difference among the test errors
resulting from these five approaches?
Answer: For the test MSE, results for OLS, Lasso, PLS are comparable, ridge
regression has best performance, and PCR performs worst. While for the test R2,
the difference from these 5 approaches are not very significant, though ridge method
also performs best. Here are the test MSE and R2 for all models.
(b) It is generally recommended that PCA be applied to time series that are
stationary. Plot the first column of yieldDat. (You can look at other columns as
well) Does the plot appear stationary? Why or why not? Include your plot with
your work.
Answer: The time series plots for the first and fifth maturity are below. The plots
appear non-stationary with significantly decreasing and increasing trend at some
time periods.
(c) Another way to check for stationarity is to run the augmented Dickey–Fuller
(ADF) test: library("tseries")
adf.test(yieldDat[ , 1])
Based on the augmented Dickey–Fuller test, do you think the first column of
yieldDat is stationary? Why or why not?
Answer: The ADF test shows that the first column of yieldDat is non-stationary.
Since the null hypothesis of ADF test is: the series is non-stationary. The p-value for
the test is 0.9 and we cannot reject the null hypothesis. Thus the first column of
yieldDat is non-stationary.
(d) Compute changes in the yield curves and defined as diff_yield. Plot the first
column of diff_yield and run the augmented Dickey–Fuller test to check for
stationarity. Do you think the first column of diff yield is stationary? Why or why
not?
Answer: The time series plot shows that the first column of diff yield is stationary, we
cannot find significant increasing or decreasing trend. The ADF test also reject the
null hypothesis of unit-root non-stationary with smaller p-value than 0.01. Thus we
conclude that the diff yield is stationary.
(f) Performing a PCA using the function princomp() on the covariance matrix which
is default of princomp(). The output from names includes the following: [1] "sdev"
"loadings" "center" "scores". Print and describe each of these components.
Answer: We perform a PCA using the function princomp() on the covariance matrix
using following code:
> pca_del = princomp(diff_yield)
> names(pca_del
> summary(pca_del)
> par(mfrow = c(1, 1))
> plot(pca_del,col=4)
!
(i) sdev is a vector contain the square roots of the eigenvalues of the covariance
matrix. This fact can be verified by using eigen to find the eigenvalues. See the R
output below. Apparently, princomp() uses n (the sample size) rather than n − 1 as
the divisor in the sample covariance matrix. To compensate, I used
cov(diff_yield)*((n-1)/n) in the second line of the R code. With the change, the square
roots of the eigenvalues are equal to the stdev from princomp().
(ii) loadings are the eigenvectors of the covariance matrix. Below is a check that the
first eigenvector is equal to the first column of loadings. The other eigenvector can
be checked in the same way. Eigenvectors are determined only up to sign changes,
so you might find that some eigenvectors from eigen() and the corresponding ones
from princomp() have different signs.
(iv) scores contains the projections of the yields minus their means onto the
eigenvectors. See the R output which only checks the projections onto the first
eigenvectors.
(g) Suppose you wish to “explain” at least 95% of the variation in the changes in the
yield curves. Then how many principal components should you use?
Answer: Use summary() function and the output is:
Therefore, only the first 2 principal components should be used since the first two
principal components have 95.59% of the variance as seen in the cumulative
proportions.