assignment
assignment
Do all computational problems in python unless otherwise noted. Please post any questions
on the Piazza discussion board.
1. In several discussions over the last week, there have been questions of what the eigen-
decomposition of covariance matrices mean. This is just a simple problem to better
understand this idea. In this problem, we use simulation to explore this relationship
for the multivariate normal distribution. This has close ties to our discussion of ridge
regression.
Let µ be a length p vector, and Σ be a p × p matrix.
You can generate a random multivariate normal vector X ∼ N (µ, Σ) using
numpy.random.multivariate normal or scipy.stats.multivariate normal.
(a) Define matricies mu and Sigma to be
1 1 ρ
µ= Σ= .
2 ρ 1
(a) Draw a sample of size n = 100 from a multivariate normal with mean µ and
covariance Σ. Confirm that the output is a 100 × 2 matrix. I will refer to this
as X below. Each row represents a sample from your 2-dimensional multivariate
normal distribution.
Plot the first column of X against the second column of X. Set the x-axis limits
to [−2, 4] and the y-axis limits to [−1, 5]. Add a large red dot in the middle to
represent your mean µ.
Remake the same plot but using ρ = 0.9 in Σ. What changes?
1
(b) Now we will look at the marginal distributions of your data (just do this for
one choice of ρ). Plot a histogram of the first column of your data (set
density=True when you call hist, so it represents a density works for the
next step). Overlay a normal curve on the same plot with mean mu[1] and
variance Sigma[1,1].
You should see that this marginal distribution looks quite Gaussian!
!
√1
(c) Let v = 2 , the diagonal vector of length 1.
√1
2
(d) For both settings in part (a), Compute the eigenvectors of Σ. Call these v1 and
v2 respectively.
Starting with the plots from (a), draw v1 and v2 as √lines starting at the mean
(do this for both plots). Make each line as long as 2 λi where λi is the corre-
sponding eigenvalue for vi . What do you notice?
(e) Form a matrix Z that has columns Xv1 and Xv2 , the projections of your
samples onto each of the eigenvectors.
Plot the first column of Z against the second column of Z. What do you notice?
Compute the correlation of the first column of X with the second column of
X. Similarly, compute the correlation of the first column of Z with the second
column of Z. What do you notice?
2. Bad validation. Here we demonstrate one way that validation of ML methods can go
wrong. This ties to our recent discussion of cross-validation, and to some practical
discussions we will have later in the course.
This problem will walk you through a common (but wrong) approach to model fitting
and validation. Don’t do this in real life.
The file bad valid.py contains starter code for this problem. You will use the
function get data to generate your data sets.
2
(a) Generate a data set (n = 100) using the get data function. We will be using
this data set for both training and testing by using a validation split.
First, you decide to investigate how useful your variables are for understanding
y. Compute a vector of correlations between each column of X and the outcome
vector y.
You can compute the correlation between y and a single column of X using the
pearsonr function from scipy.stats. I recommend using np.apply along axis
to compute correlations with all the columns at once.
Plot a histogram of the correlations between y and each of the columns of X in
your training set.
You should notice a nice bell-shaped curve of correlations around zero. (You
might expect this in real life, since you know that you are just analyzing noise.)
(b) Suppose that you decide that you have too many variables. Since you’ve noticed
that most of the variables are worthless, you decide to discard those worthless
variables before running your fancy ML methods.
Figure out which columns of X have absolute value correlation with y bigger than
0.2. Make a new data frame that only has those highly correlated columns; call
it X good. (You will probably want to keep track of the indices of these columns,
since you’ll need to make this matrix again later.)
(c) Split your X good data set into the first 0.8*n points (training set) and the last
0.2*n points (test set). Think of this like back testing: you’re going to train on
the older data and predict on the more recent data.
Train a lasso model on the first data set, using the LassoCV function from
sklearn.linear models, with cross-validation to pick your regularization pa-
rameter.
What is the cross-validation estimate for your model’s mean squared error? Use
your model to predict on the test portion of the X good data set. What is your
test mean squared error?
At this point, everything should be looking pretty good!
Hints:
To get the cross-validated MSE, note that the mse path object contains
all your cross-validated MSEs for each regularization parameter value and
every fold separately. Taking averages across the rows (folds) will give you
the CV-MSE at each regularization parameter value. The minimum of these
is your cross-validated MSE for the selected model.
3
(d) Generate another data set of size 100 from get data and assess your model
on this fresh data. You will have to first extract the same columns that you
extracted when you formed X good, and then use your lasso model to predict
with those subset columns. What is your mean squared error on this fresh data?
Explain why this number is quite different from your cross-validated error and
your test error on the previously held-out data set?
3. In this problem, we will explore ideas related to a paper, Preis et al. (2013), which
attempts to model DJIA returns based on Google Trends data.
In trends train.csv and trends test.csv, you will find training and test data sets
of matching column format. For now we’ll work with the training data set.
The data sets consist of the following columns:
• The date of the Sunday that ends the week of google trends data, which is
aggregated Monday - Sunday. The volume differences in the other columns
start one week before this date and end on this date.
• Google trends data for 96 words, with words selected by the methodology in the
paper. The time series for each word reflects change in the volume of search
queries that week, compared to the previous week. The original time series
(before taking differences) were normalized to have a maximum of 100 over
time; you can find these in trend timeseries.csv just for fun.
• The weekly DJIA log returns for the week after the google trends data (Monday
- Monday adjusted closing price).
You should do this problem in R with glmnet, you will find it easier.
See the demo code from class for glmnet syntax. I suggest starting this problem
early in case you have software issue with rpy2.
Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying trading behavior in financial markets
using Google Trends. Scientific reports, 3, srep01684.
(a) Load the training data and see what you have. Fit the lasso to predict the
weekly log return from the weekly changes in search volume. Use the defaut
range of penalization parameters. Plot the coefficients as a function of log(λ).
(b) Carry out cross-validation for the lasso (using the built in procedure). Plot the
cross-validation curve with its standard errors.
Note: For the purpose of this problem, we’re going to carry out cross-validation
as if the samples were i.i.d., neglecting the temporal structure. In reality, this
would be something to worry about.
4
(c) Look at the models that you obtain through each selection rule (minimization
vs. 1 s.e.). How many nonzero coefficients does each model have (other than
the intercept)? What does this suggest to you about the amount of signal in
the problem?
Note: The output of cross-validation depends on the random splits that you
select. In this problem, you should get different values for λmin and λ1s.e. to
make it interesting. If by some chance you select the same model both ways, just
rerun cross-validation again or change your seed. (NOTE: This is just to make
a nice homework problem, it’s not a good practice in general. We will discuss
further in class.)
(d) Let’s try the λmin model out. Make predictions on your training data by using
the predict function and passing your training X in as the newx parameter. The
s parameter is the λ value you want to fit at. Make a scatterplot of your fitted
values against the true values; is there any pattern?
(e) We could carry out a simple (and overly simplified) trading strategy based on
your model, as outlined in the paper. Each Sunday, we look at the prediction
from your model. If the log returns for the next week are predicted to be positive,
we buy on the Monday and sell on the following Monday. If the log returns are
predicted to be negative, we do the opposite. This lets us realize log returns for
each week of
(sign of our prediction) · (DJIA log return)
which is easy to work with. Compute your return for this strategy on the
training data, using your λmin model. Compare it to the return you would have
realized if you bought on the first day of the data set and sold on the last day.
(f) Now apply your λmin model, trained on the training set, to your test data.
Again, you just use the predict function, changing the newx parameter. Plot
your fitted values as before.
Carry out the same simple strategy on the test data. How well do you do?
Again, compare to the overall return for the DJIA in this period. In the end,
would you have been better off with the λ1s.e. model?
4. Understanding how splines are fit. This problem is intended to walk you through
some claims that I will make in class on Monday about smoothing splines. Today
we’ll stop a few sentences short of setting this problem up, so I will tell you the
missing pieces here:
5
• The solution to the optimization problem on slide 31 (repeated below) is always
a function in the class of natural cubic splines. We call the solutions to the
optimization problem smoothing splines.
• Natural cubic splines are are piecewise cubic functions, meaning that they’re
made up of a bunch of sections of different cubic functions all concatenated
together. They are concatenated together in such a way that the final function
is continuous with continuous first and second derivatives. The cubic pieces of
the function change at x-values called knots.
• The solution to such a problem must be a natural cubic spline with knots at the
xi from the training set.
• Natural cubic splines can be represented in terms of a basis of known functions
of x:
K
X
f (x) = β0 + βk bk (x)
k=1
Here the bk (x) are the known basis functions. (Note, I’m just using K instead
of the K + 3 you will see in the notes, just to keep things clean. Just imagine
you redefined K, it’s fine.)
This means that to fit your f , you only need to figure out the βk .
(a) Since we know each of the bk (x) functions, we could evaluate them at all of
our training coordinatesPxi . Show that you can rewrite the first part of your
optimization objective, ni=1 (yi − f (xi ))2 entirely in terms of the yi , bk (xi ), and
βk . Write the term entirely in matrix/vector notation, as we do in regression.
You will need to define a matrix of your bk (xi ) values.
(b) Similarly, rewrite your penalty term, (f ′′ (x))2 dx in terms of your functions
R
6
(and their derivatives). It may help to first think about what f ′′ (x) looks like in
terms of the bk (x) functions, and then move along to squaring and integrating
it.
(c) Combining these two pieces, you should now have an objective that is written
entirely in terms of matrices and vectors. The only unknowns will be your vector
of β coefficients. If everything has gone right, your objective will now remind
you very strongly of ridge regression.
Thinking back to your homework from last week, find a solution closed form
solution for β (in terms of the matrices you defined above). You are allowed to
just guess the answer based on the similarity to your ridge regression homework.
You are not required to do any actual justification.
The result from this problem means that you can calculate your coefficient β as a
linearPoperator on your vector of Y . Once you have this β, your spline would just be
β0 + K k=1 βk bk (x).