Ridge and Lasso in Python PDF
Ridge and Lasso in Python PDF
Want to follow along on your own machine? Download the .py (lab10.py) or Jupyter Notebook (Lab 10 - Ridge Regression
and the Lasso in Python.ipynb) version.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We will use the sklearn package in order to perform ridge regression and the lasso. The main functions in this package that
we care about are Ridge(), which can be used to fit ridge regression models, and Lasso() which will fit lasso models. They
also have cross-validated counterparts: RidgeCV() and LassoCV(). We'll use these a bit later.
Before proceeding, let's first ensure that the missing values have been removed from the data, as described in the previous
lab.
df = pd.read_csv('Hitters.csv').dropna().drop('Player', axis = 1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Let's set up our data:
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy var
iables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')
X.info()
Associated with each alpha value is a vector of ridge regression coefficients, which we'll store in a matrix coefs. In this case,
it is a 19 × 100 matrix, with 19 rows (one for each predictor) and 100 columns (one for each value of alpha). Remember that
we'll want to standardize the variables so that they are on the same scale. To do this, we can use the normalize = True
parameter:
for a in alphas:
ridge.set_params(alpha = a)
ridge.fit(X, y)
coefs.append(ridge.coef_)
np.shape(coefs)
We expect the coefficient estimates to be much smaller, in terms of l2 norm, when a large value of alpha is used, as
compared to when a small value of alpha is used. Let's plot and find out:
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso:
Next we fit a ridge regression model on the training set, and evaluate its MSE on the test set, using λ = 4 :
The test MSE when alpha = 4 is 106216. Now let's see what happens if we use a huge value of alpha, say 10 :
10
This big penalty shrinks the coefficients to a very large degree, essentially reducing to a model containing just the intercept.
This over-shrinking makes the model more biased, resulting in a higher MSE.
Okay, so fitting a ridge regression model with alpha = 4 leads to a much lower test MSE than fitting a model with just an
intercept. We now check whether there is any benefit to performing ridge regression with alpha = 4 instead of just performing
least squares regression. Recall that least squares is simply ridge regression with alpha = 0.
Instead of arbitrarily choosing alpha = 4 , it would be better to use cross-validation to choose the tuning parameter alpha. We
can do this using the cross-validated ridge regression function, RidgeCV(). By default, the function performs generalized
cross-validation (an efficient form of LOOCV), though this can be changed using the argument cv.
Therefore, we see that the value of alpha that results in the smallest cross-validation error is 0.57. What is the test MSE
associated with this value of alpha?
This represents a further improvement over the test MSE that we got using alpha = 4 . Finally, we refit our ridge regression
model on the full data set, using the value of alpha chosen by cross-validation, and examine the coefficient estimates.
ridge4.fit(X, y)
pd.Series(ridge4.coef_, index = X.columns)
As expected, none of the coefficients are exactly zero - ridge regression does not perform variable selection!
for a in alphas:
lasso.set_params(alpha=a)
lasso.fit(scale(X_train), y_train)
coefs.append(lasso.coef_)
ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
Notice that in the coefficient plot that depending on the choice of tuning parameter, some of the coefficients are exactly equal
to zero. We now perform 10-fold cross-validation to choose the best alpha, refit the model, and compute the associated test
error:
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
mean_squared_error(y_test, lasso.predict(X_test))
This is substantially lower than the test set MSE of the null model and of least squares, and only a little worse than the test
MSE of ridge regression with alpha chosen by cross-validation.
However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse.
Here we see that 13 of the 19 coefficient estimates are exactly zero:
Your turn!
Now it's time to test out these approaches (ridge regression and the lasso) and evaluation methods (validation set, cross
validation) on other datasets. You may want to work with a team on this portion of the lab. You may use any of the datasets
included in ISLR, or choose one from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html
(http://archive.ics.uci.edu/ml/datasets.html)). Download a dataset, and try to determine the optimal set of parameters to use
to model it! You are free to use the same dataset you used in Lab 9, or you can choose a new one.