Regression Analysis Tutorial - Lasso and Ridge
Regularization Regression
Creator: Muhammad Bilal Alam
What is Lasso Regression (L1)?
Lasso regression, also known as L1 regularization, is a linear regression technique that adds a
penalty term to the cost function, which shrinks the coefficients of the model towards zero and
performs variable selection. The penalty term is the absolute value of the coefficients multiplied
by a constant alpha, which determines the strength of the penalty. The smaller the value of
alpha, the weaker the penalty and the larger the coefficients. Conversely, the larger the value of
alpha, the stronger the penalty and the smaller the coefficients.
The formula for the cost function of Lasso regression is:
Cost = RSS + alpha * sum(|coefficients|)
where
RSS is the residual sum of squares,
alpha is the regularization parameter, and
sum(|coefficients|) is the sum of the absolute values of the coefficients.
The goal of Lasso regression is to minimize the cost function by adjusting the coefficients of the
model.
What is Ridge Regression (L2)?
Ridge regression is a type of linear regression that includes L2 regularization to prevent
overfitting in the model. The L2 regularization adds a penalty term to the loss function that is
proportional to the sum of the squared values of the model's coefficients.
The formula for the loss function in ridge regression is as follows:
Loss function = RSS + αΣβ^2
where
RSS is the residual sum of squares,
Σβ^2 is the sum of the squared values of the model's coefficients, and
α is the regularization parameter that controls the strength of the penalty term. The larger
the value of α, the greater the penalty on the coefficients, which results in a simpler model
with smaller coefficients. The value of α can be chosen using cross-validation to find the
optimal balance between the bias and variance of the model.
The goal of ridge regression is to address the problem of multicollinearity, which occurs when
the predictor variables in a linear regression model are highly correlated with each other. This
can lead to unstable and inaccurate estimates of the regression coefficients.
What is the Difference Between Ridge and Lasso Regression?
Multicollinearity refers to the situation where two or more independent variables in a regression
model are highly correlated with each other. This can cause issues in the model, as it becomes
difficult to determine the individual impact of each variable on the dependent variable. In such
cases, Ridge regression can be helpful as it shrinks the coefficients towards zero and reduces
the impact of multicollinearity.
On the other hand, if we have a large number of features in the dataset, it can lead to
overfitting, where the model becomes too complex and does not generalize well to new data. In
this scenario, Lasso regression can be useful as it performs feature selection by setting some of
the coefficients to zero. This helps in simplifying the model and removing irrelevant features,
leading to better generalization performance.
So, in simple terms, Ridge regression is helpful when we have highly correlated predictors, and
Lasso regression is useful when we have too many features and want to simplify the model by
selecting only the important ones.
The California Housing Dataset
The California Housing Dataset contains information on the median income, housing age, and
other features for census tracts in California. The dataset was originally published by Pace, R.
Kelley and Ronald Barry in their 1997 paper "Sparse Spatial Autoregressions" and is available in
the [Link] module.
The dataset consists of 20,640 instances, each representing a census tract in California. There
are eight features in the dataset, including:
MedInc: Median income in the census tract.
HouseAge: Median age of houses in the census tract.
AveRooms: Average number of rooms per dwelling in the census tract.
AveBedrms: Average number of bedrooms per dwelling in the census tract.
Population: Total number of people living in the census tract.
AveOccup: Average number of people per household in the census tract.
Latitude: Latitude of the center of the census tract.
Longitude: Longitude of the center of the census tract.
Step 1: Import the necessary libraries
In [1]: import pandas as pd
import seaborn as sns
import numpy as np
import [Link] as plt
import [Link] as px
from [Link] import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from [Link] import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error, r2_score
import warnings
[Link]('ignore')
Step 2: Load the dataset
In [2]: # Load the California Housing Dataset from seaborn
california = fetch_california_housing()
# Convert the data to a pandas dataframe
california_df = [Link](data=[Link], columns=california.feature_names)
# Add the target variable to the dataframe
california_df['MedHouseVal'] = [Link]
# Print the first 5 rows of the dataframe
california_df.head()
Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHous
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3
Step 3: Do Data Preprocessing along with Data Exploratory
Analysis
Step 3(a): Check Shape of Dataframe
Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset
In [3]: # Print the shape of the dataframe
print("Data shape:", california_df.shape)
Data shape: (20640, 9)
Step 3(b): Check Info of Dataframe
This is very useful to quickly get an overview of the structure and properties of a dataset, and to
check for any missing or null values that may need to be addressed before performing any
analysis or modeling.
In [4]: california_df.info()
<class '[Link]'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
Step 3(c): Show Descriptive Statistics of each numerical column
Looking at descriptive statistics in machine learning is important because it gives an overview of
the dataset's distribution and key characteristics. Some of the reasons why we should look at
descriptive statistics include:
Understanding the distribution of data: Descriptive statistics provide information about
the central tendency and the spread of the data. This information is useful in determining
the type of distribution and whether the data is skewed or symmetrical.
Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in
the dataset. These outliers can have a significant impact on the analysis and should be
investigated further.
From the descriptive statistics, we can observe the following:
Outliers: The 'AveRooms', 'AveBedrms', 'Population', and 'AveOccup' columns have high
maximum values, indicating the presence of outliers in the data. These outliers may need to
be treated or removed before model selection. we will create visuals to see them more
clearly
Distribution: The 'MedInc', 'HouseAge', and 'MedHouseVal' columns appear to be
normally distributed, as the mean and median values are close to each other, and the
standard deviation is not very high. The 'Latitude' column is skewed to the left, as the mean
is less than the median. The 'Longitude' column is skewed to the right, as the mean is
greater than the median.
In [5]: california_df.describe().T
Out[5]: count mean std min 25% 50% 75%
MedInc 20640.0 3.870671 1.899822 0.499900 2.563400 3.534800 4.743250
HouseAge 20640.0 28.639486 12.585558 1.000000 18.000000 29.000000 37.000000
AveRooms 20640.0 5.429000 2.474173 0.846154 4.440716 5.229129 6.052381
AveBedrms 20640.0 1.096675 0.473911 0.333333 1.006079 1.048780 1.099526
Population 20640.0 1425.476744 1132.462122 3.000000 787.000000 1166.000000 1725.000000
AveOccup 20640.0 3.070655 10.386050 0.692308 2.429741 2.818116 3.282261
Latitude 20640.0 35.631861 2.135952 32.540000 33.930000 34.260000 37.710000
Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010000
MedHouseVal 20640.0 2.068558 1.153956 0.149990 1.196000 1.797000 2.647250
Step 3(d): Check for missing values in the Dataframe
This is important because most machine learning algorithms cannot handle missing data and
will throw an error if missing values are present. Therefore, it is necessary to check for missing
values and impute or remove them before fitting the data into a machine learning model. This
helps to ensure that the model is trained on complete and accurate data, which leads to better
performance and more reliable predictions.
Here we have no missing values so lets move on.
In [6]: # Check for missing values
print("Missing values:\n", california_df.isnull().sum())
Missing values:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseVal 0
dtype: int64
Step 3(e): Check for duplicate values in the Dataframe
Checking for duplicate values in machine learning is important because it can affect the
accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your
model is too closely fit to the training data and does not generalize well to new data.
We have no duplicate values so thats good.
In [7]: california_df.duplicated().sum()
0
Out[7]:
Step 3(f)(i): Check for Outliers in the Dataframe
We should check for outliers as they can have a negative impact on machine learning
algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean,
standard deviation, and other statistical measures, which can misrepresent the true
characteristics of the data. Linear regression models, are sensitive to outliers and can produce
inaccurate results if the outliers are not properly handled or removed. Therefore, it is important
to identify and handle outliers appropriately to ensure the accuracy and reliability of the
models.
Here in the plots we can clearly see very high outliers on the right hand side. So we need to
deal with them appropriately
In [8]: # Define the colors for each feature
colors = ['blue', 'red', 'green','purple']
# Select the first 5 features to plot
features = ['AveBedrms', 'AveRooms', 'MedInc','AveOccup']
# Create a figure and axis object
fig, ax = [Link]()
# Create a boxplot for each feature
bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)
# Assign unique colors to each feature
for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)
# Customize the x-axis tick labels
ax.set_xticklabels(features)
# Customize the title and axes labels
ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')
# Add grid lines and remove top and right spines
[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)
# Set the size of the plot
fig.set_size_inches(8, 6)
# Show the plot
[Link]()
In [9]: # Define the colors for each feature
colors = ['orange']
# Select the first 5 features to plot
features = ['Population']
# Create a figure and axis object
fig, ax = [Link]()
# Create a boxplot for each feature
bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)
# Assign unique colors to each feature
for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)
# Customize the x-axis tick labels
ax.set_xticklabels(features)
# Customize the title and axes labels
ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')
# Add grid lines and remove top and right spines
[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)
# Set the size of the plot
fig.set_size_inches(8, 6)
# Show the plot
[Link]()
Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization:
This method involves replacing extreme values with the nearest values that are within a certain
percentile range. For example, we replace values above the 95th percentile with the value at the
95th percentile and values below the 1st percentile with the value at the 1st percentile. From the
visuals we can clearly see that the data is way more normally distributed now
In [10]: # Define the percentile limits for winsorization
pct_lower = 0.01
pct_upper = 0.95
# Apply winsorization to the five columns
california_df['AveRooms'] = [Link](california_df['AveRooms'],
california_df['AveRooms'].quantile(pct_lower),
california_df['AveRooms'].quantile(pct_upper))
california_df['AveBedrms'] = [Link](california_df['AveBedrms'],
california_df['AveBedrms'].quantile(pct_lower),
california_df['AveBedrms'].quantile(pct_upper))
california_df['Population'] = [Link](california_df['Population'],
california_df['Population'].quantile(pct_lower),
california_df['Population'].quantile(pct_upper))
california_df['AveOccup'] = [Link](california_df['AveOccup'],
california_df['AveOccup'].quantile(pct_lower),
california_df['AveOccup'].quantile(pct_upper))
california_df['MedInc'] = [Link](california_df['MedInc'],
california_df['MedInc'].quantile(pct_lower),
california_df['MedInc'].quantile(pct_upper))
In [11]: # Define the colors for each feature
colors = ['blue', 'red', 'green','purple']
# Select the first 5 features to plot
features = ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup']
# Create a figure and axis object
fig, ax = [Link]()
# Create a boxplot for each feature
bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)
# Assign unique colors to each feature
for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)
# Customize the x-axis tick labels
ax.set_xticklabels(features)
# Customize the title and axes labels
ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')
# Add grid lines and remove top and right spines
[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)
# Set the size of the plot
fig.set_size_inches(8, 6)
# Show the plot
[Link]()
In [12]: # Define the colors for each feature
colors = ['orange']
# Select the first 5 features to plot
features = ['Population']
# Create a figure and axis object
fig, ax = [Link]()
# Create a boxplot for each feature
bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)
# Assign unique colors to each feature
for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)
# Customize the x-axis tick labels
ax.set_xticklabels(features)
# Customize the title and axes labels
ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')
# Add grid lines and remove top and right spines
[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)
# Set the size of the plot
fig.set_size_inches(8, 6)
# Show the plot
[Link]()
Step 3(g): Check for Skewness using a Histogram
Skewed data can result in biased estimates of model parameters and reduce the accuracy of
predictions. Therefore, it is important to assess the distribution of features and target variables
to identify any potential issues and take appropriate measures to address them.
Here almost all the features and target look normally distributed. There is some Skewness In
MedHouseVal but not enough to do Transformation on it.
Note: For learning purposes I have shown how to do MedHouseVal transformation for
skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out!
In [13]: # Select the features to plot
features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', '
# Define the colors for each feature
colors = ['blue', 'red', 'green', 'orange', 'purple', 'brown', 'gray', 'pink', 'cyan']
# Set up the plot
fig, axs = [Link](nrows=3, ncols=3, figsize=(20, 12))
axs = [Link]()
[Link]('Histograms of Selected Features', fontsize=16)
# Create a histogram for each feature
for i, feature in enumerate(features):
# Get the data for the current feature
data = california_df[feature].values
# Calculate the number of bins for the histogram
binwidth = [Link](len(data))
# Calculate the mean of the data
mean_value = [Link](data)
# Plot the histogram and mean line
axs[i].hist(data, bins=50, color=colors[i], alpha=0.7)
axs[i].axvline(mean_value, color='black', linestyle='--', linewidth=2)
axs[i].set_title(feature)
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
axs[i].grid(axis='y', linestyle='--', alpha=0.7)
# Adjust the spacing between subplots
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Show the plot
[Link]()
Step 3(h): Create a Vertical Correlation Heatmap
The correlation matrix shows the correlation coefficients between every pair of variables in the
dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction
of the linear relationship between two variables. A coefficient of -1 indicates a perfect negative
correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect
positive correlation.
In [14]: # Calculate the correlation matrix
corr_matrix = california_df.corr()
# Set up the matplotlib figure
fig, ax = [Link](figsize=(10, 8))
# Create the heatmap
[Link](corr_matrix, cmap='Spectral_r', annot=True, fmt='.2f', linewidths=.5, ax=a
# Set the title and axis labels
ax.set_title('Correlation Heatmap for California Housing Dataset', fontsize=16)
ax.set_xlabel('Features', fontsize=14)
ax.set_ylabel('Features', fontsize=14)
# Rotate the x-axis labels for readability
[Link](rotation=45, ha='right')
# Show the plot
[Link]()
Step 3(i): Perform Feature Scaling
Feature scaling is the process of transforming numerical features in a dataset to have similar
scales or ranges of values. The purpose of feature scaling is to ensure that all features have the
same level of impact on the model and to prevent certain features from dominating the model
simply because they have larger values. In linear regression, feature scaling is particularly
important because the coefficients of the model represent the change in the dependent
variable associated with a one-unit change in the independent variable. Scaling the features to
have similar ranges can result in a more accurate and reliable model with more accurate
representations of the relationships between the independent variables and the dependent
variable.
In [15]: scaler = StandardScaler()
california_df_scaled = scaler.fit_transform(california_df)
california_df_scaled = [Link](california_df_scaled, columns=california_df.column
Step 3( j) Check for Assumptions using Scatter Plots
From the scatter plots, we can observe that there is a linear relationship between the dependent
variable (Median House Value) and some of the independent variables like Median Income and
Total Rooms. However, we can also see that some of the independent variables like Longitude,
Latitude, and Housing Median Age do not have a clear linear relationship with the dependent
variable. This suggests that a linear regression model might not be the best fit for predicting the
Median House Value based on these variables.
In [16]: # Create scatter plots
fig, axs = [Link](nrows=2, ncols=4, figsize=(25,15))
axs[0,0].scatter(california_df_scaled['Latitude'], california_df['MedHouseVal'], color
axs[0,0].set_xlabel('Latitude')
axs[0,0].set_ylabel('Median House Value')
axs[0,0].set_title('Latitude vs Median House Value')
axs[0,1].scatter(california_df_scaled['Longitude'], california_df['MedHouseVal'], colo
axs[0,1].set_xlabel('Longitude')
axs[0,1].set_ylabel('Median House Value')
axs[0,1].set_title('Longitude vs Median House Value')
axs[0,2].scatter(california_df_scaled['HouseAge'], california_df['MedHouseVal'], color
axs[0,2].set_xlabel('Housing Median Age')
axs[0,2].set_ylabel('Median House Value')
axs[0,2].set_title('Housing Median Age vs Median House Value')
axs[0,3].scatter(california_df_scaled['AveRooms'], california_df['MedHouseVal'], color
axs[0,3].set_xlabel('Total Rooms')
axs[0,3].set_ylabel('Median House Value')
axs[0,3].set_title('Total Rooms vs Median House Value')
axs[1,0].scatter(california_df_scaled['AveBedrms'], california_df['MedHouseVal'], colo
axs[1,0].set_xlabel('Total Bedrooms')
axs[1,0].set_ylabel('Median House Value')
axs[1,0].set_title('Total Bedrooms vs Median House Value')
axs[1,1].scatter(california_df_scaled['Population'], california_df['MedHouseVal'], col
axs[1,1].set_xlabel('Population')
axs[1,1].set_ylabel('Median House Value')
axs[1,1].set_title('Population vs Median House Value')
axs[1,2].scatter(california_df_scaled['AveOccup'], california_df['MedHouseVal'], color
axs[1,2].set_xlabel('Households')
axs[1,2].set_ylabel('Median House Value')
axs[1,2].set_title('Households vs Median House Value')
axs[1,3].scatter(california_df_scaled['MedInc'], california_df['MedHouseVal'], color='
axs[1,3].set_xlabel('Median Income')
axs[1,3].set_ylabel('Median House Value')
axs[1,3].set_title('Median Income vs Median House Value')
[Link]()
Step 4: Define Dependant and Independant Variable
In [17]: X = california_df_scaled.drop(['MedHouseVal'],axis=1)
y = california_df['MedHouseVal'] # We dont scale dependant variable
Step 5: Perform Lasso (L1) Regression
Step 5(a): Use LassoCV to find the best alpha value
LassoCV uses cross-validation to determine the optimal value of the regularization parameter
(alpha) that balances the bias-variance tradeoff in the model. It iteratively fits the Lasso
regression model to different subsets of the training data and computes the cross-validated
mean squared error for each value of alpha. It then selects the alpha value that minimizes the
cross-validated mean squared error as the optimal value.
LassoCV is useful when you have a large number of features and want to perform feature
selection and regularization at the same time. By setting some of the coefficients to exactly
zero, it can help identify the most important features and simplify the model, which can lead to
better performance and interpretation.
In [18]: # Create a LassoCV object with 5-fold cross-validation
lasso_cv = LassoCV(cv=5)
# Fit the LassoCV model to the data
lasso_cv.fit(X, y)
# Print the best alpha and coefficients
print("Best alpha:", lasso_cv.alpha_)
Best alpha: 0.0007860894985619018
Step 5(b): Using SelectFromModel Feature Selection Method to Select the
Features for Regression
SelectFromModel is a feature selection method in scikit-learn that uses a supervised estimator
to identify the most important features in a dataset. The estimator is trained on the data, and
then the feature importances are ranked. Features that have importance scores above a
threshold are selected for use in modeling, while features with lower scores are excluded.
The process of selecting the best features is an important step in machine learning, as it can
improve model accuracy and reduce overfitting.
In [19]: # Instantiate the Lasso model with the best alpha
lasso = Lasso(alpha=lasso_cv.alpha_, max_iter=10000)
# Select the features using Lasso regularization
selector = SelectFromModel(estimator=lasso)
X_selected = selector.fit_transform(X, y)
# Print the selected features
california_features = [Link]
selected_features = california_features[selector.get_support()]
print(selected_features)
Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
'Latitude', 'Longitude'],
dtype='object')
Step 5(c): Perform Lasso Regression Using the Selected Features
In [20]: # Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, rand
# Create a LassoCV object with 5-fold cross-validation
lasso_cv = LassoCV(cv=5)
# Fit the LassoCV model to the data
lasso_cv.fit(X_train, y_train)
# Predict the median house values for the testing data
y_pred = lasso_cv.predict(X_test)
Step 5(d): Evaluating the performance of the Model
In [21]: # Calculate the root mean squared error
rmse = [Link](mean_squared_error(y_test, y_pred))
print("Root mean squared error:", rmse)
# Calculate the R2 score
r2 = r2_score(y_test, y_pred)
print("R2 score:", r2)
Root mean squared error: 0.6802817409760694
R2 score: 0.6528874328086174
Step 6: Performing Ridge (L2) Regression
Step 6(a): Split the data into training and testing sets in the ratio 70:30
In [22]: # Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=
Step 6(b): Use RidgeCV to find the best value of alpha
RidgeCV performs cross-validation to automatically select the optimal value of the
regularization parameter (alpha) that balances the bias-variance trade-off. The algorithm fits the
model using different values of alpha and selects the value that results in the lowest mean
squared error (MSE). This process helps to prevent overfitting and improve the model's
generalization performance.
In [23]: # Create a Ridge regression object with possible alpha values
ridge_cv = RidgeCV(cv=5)
# Fit the Ridge regression model to the training data
ridge_cv.fit(X_train, y_train)
# Print the best alpha value
print("Best alpha value:", ridge_cv.alpha_)
Best alpha value: 1.0
Step 6(c): Perform Ridge Regression
In [24]: # Create a Ridge regression object with the best alpha value
ridge = Ridge(alpha=ridge_cv.alpha_)
# Fit the Ridge regression model to the training data
[Link](X_train, y_train)
# Predict the median house values for the testing data
y_pred = [Link](X_test)
Step 6(d): Evaluating the performance of the Model
In [25]: # Compute the mean squared error on the testing data
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)
# Compute the R-squared on the testing data
r2 = [Link](X_test, y_test)
print("R-squared:", r2)
Mean squared error: 0.46273885187759334
R-squared: 0.6529207316404193
Conclusion:
In this Ridge and Lasso regression tutorial, we explored how to use Ridge and Lasso regression
to model the relationship between variables with a non-linear pattern with the target feature,
specifically the median house value in California. We started by loading and cleaning the
dataset, and then visualizing the relationship between the variables using scatter plots.
Next, we split the data into training and testing sets and used Ridge and Lasso regression to fit
a model to the training data. We gradually adjusted the alpha of the models until we achieved a
good balance between bias and variance. We then used the models to make predictions on the
test data and evaluated the model's performance using mean squared error and R-squared
score.
Compared to polynomial regression, Ridge and Lasso regression provided a way to address
overfitting by introducing a penalty term to the model's cost function. Ridge regression added a
penalty term proportional to the square of the magnitude of the coefficients, while Lasso
regression added a penalty term proportional to the absolute value of the coefficients. This
encourages the models to select only the most important features and reduce the impact of less
important features on the model's predictions.
While polynomial regression may have performed better in this particular case (Shown in the
previous tutorial), Ridge and Lasso regression are often used in situations where there is high
multicollinearity among the independent variables, which can lead to overfitting in traditional
linear regression. They can also be useful in feature selection, as they can reduce the impact of
less important variables in the model.