0% found this document useful (0 votes)
515 views18 pages

Lasso vs Ridge Regression Explained

This document provides an overview of Lasso and Ridge regression techniques for regularization in linear regression models. Lasso regression performs variable selection by setting some coefficients to zero, which is useful when there are many features. Ridge regression addresses multicollinearity by shrinking coefficients, making it helpful when predictors are highly correlated. The key difference is that Lasso can completely remove variables while Ridge only shrinks coefficients. The document also introduces the California housing dataset used in an example to demonstrate these techniques.

Uploaded by

arso arsovski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
515 views18 pages

Lasso vs Ridge Regression Explained

This document provides an overview of Lasso and Ridge regression techniques for regularization in linear regression models. Lasso regression performs variable selection by setting some coefficients to zero, which is useful when there are many features. Ridge regression addresses multicollinearity by shrinking coefficients, making it helpful when predictors are highly correlated. The key difference is that Lasso can completely remove variables while Ridge only shrinks coefficients. The document also introduces the California housing dataset used in an example to demonstrate these techniques.

Uploaded by

arso arsovski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Introduction to Lasso and Ridge Regression
  • The California Housing Dataset
  • Difference Between Ridge and Lasso
  • Step 2: Load and Explore Dataset
  • Step 3: Data Preprocessing and Analysis
  • Step 1: Import Libraries and Data
  • Step 4: Define Variables and perform Regression

Regression Analysis Tutorial - Lasso and Ridge

Regularization Regression
Creator: Muhammad Bilal Alam

What is Lasso Regression (L1)?


Lasso regression, also known as L1 regularization, is a linear regression technique that adds a
penalty term to the cost function, which shrinks the coefficients of the model towards zero and
performs variable selection. The penalty term is the absolute value of the coefficients multiplied
by a constant alpha, which determines the strength of the penalty. The smaller the value of
alpha, the weaker the penalty and the larger the coefficients. Conversely, the larger the value of
alpha, the stronger the penalty and the smaller the coefficients.

The formula for the cost function of Lasso regression is:

Cost = RSS + alpha * sum(|coefficients|)

where

RSS is the residual sum of squares,


alpha is the regularization parameter, and
sum(|coefficients|) is the sum of the absolute values of the coefficients.

The goal of Lasso regression is to minimize the cost function by adjusting the coefficients of the
model.

What is Ridge Regression (L2)?


Ridge regression is a type of linear regression that includes L2 regularization to prevent
overfitting in the model. The L2 regularization adds a penalty term to the loss function that is
proportional to the sum of the squared values of the model's coefficients.

The formula for the loss function in ridge regression is as follows:

Loss function = RSS + αΣβ^2

where

RSS is the residual sum of squares,


Σβ^2 is the sum of the squared values of the model's coefficients, and
α is the regularization parameter that controls the strength of the penalty term. The larger
the value of α, the greater the penalty on the coefficients, which results in a simpler model
with smaller coefficients. The value of α can be chosen using cross-validation to find the
optimal balance between the bias and variance of the model.
The goal of ridge regression is to address the problem of multicollinearity, which occurs when
the predictor variables in a linear regression model are highly correlated with each other. This
can lead to unstable and inaccurate estimates of the regression coefficients.

What is the Difference Between Ridge and Lasso Regression?


Multicollinearity refers to the situation where two or more independent variables in a regression
model are highly correlated with each other. This can cause issues in the model, as it becomes
difficult to determine the individual impact of each variable on the dependent variable. In such
cases, Ridge regression can be helpful as it shrinks the coefficients towards zero and reduces
the impact of multicollinearity.

On the other hand, if we have a large number of features in the dataset, it can lead to
overfitting, where the model becomes too complex and does not generalize well to new data. In
this scenario, Lasso regression can be useful as it performs feature selection by setting some of
the coefficients to zero. This helps in simplifying the model and removing irrelevant features,
leading to better generalization performance.

So, in simple terms, Ridge regression is helpful when we have highly correlated predictors, and
Lasso regression is useful when we have too many features and want to simplify the model by
selecting only the important ones.

The California Housing Dataset


The California Housing Dataset contains information on the median income, housing age, and
other features for census tracts in California. The dataset was originally published by Pace, R.
Kelley and Ronald Barry in their 1997 paper "Sparse Spatial Autoregressions" and is available in
the [Link] module.

The dataset consists of 20,640 instances, each representing a census tract in California. There
are eight features in the dataset, including:

MedInc: Median income in the census tract.


HouseAge: Median age of houses in the census tract.
AveRooms: Average number of rooms per dwelling in the census tract.
AveBedrms: Average number of bedrooms per dwelling in the census tract.
Population: Total number of people living in the census tract.
AveOccup: Average number of people per household in the census tract.
Latitude: Latitude of the center of the census tract.
Longitude: Longitude of the center of the census tract.

Step 1: Import the necessary libraries


In [1]: import pandas as pd
import seaborn as sns
import numpy as np
import [Link] as plt
import [Link] as px
from [Link] import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from [Link] import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error, r2_score
import warnings
[Link]('ignore')

Step 2: Load the dataset


In [2]: # Load the California Housing Dataset from seaborn

california = fetch_california_housing()

# Convert the data to a pandas dataframe


california_df = [Link](data=[Link], columns=california.feature_names)

# Add the target variable to the dataframe


california_df['MedHouseVal'] = [Link]

# Print the first 5 rows of the dataframe


california_df.head()

Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHous

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3

Step 3: Do Data Preprocessing along with Data Exploratory


Analysis

Step 3(a): Check Shape of Dataframe


Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset

In [3]: # Print the shape of the dataframe


print("Data shape:", california_df.shape)

Data shape: (20640, 9)

Step 3(b): Check Info of Dataframe


This is very useful to quickly get an overview of the structure and properties of a dataset, and to
check for any missing or null values that may need to be addressed before performing any
analysis or modeling.

In [4]: california_df.info()

<class '[Link]'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

Step 3(c): Show Descriptive Statistics of each numerical column


Looking at descriptive statistics in machine learning is important because it gives an overview of
the dataset's distribution and key characteristics. Some of the reasons why we should look at
descriptive statistics include:

Understanding the distribution of data: Descriptive statistics provide information about


the central tendency and the spread of the data. This information is useful in determining
the type of distribution and whether the data is skewed or symmetrical.

Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in
the dataset. These outliers can have a significant impact on the analysis and should be
investigated further.

From the descriptive statistics, we can observe the following:

Outliers: The 'AveRooms', 'AveBedrms', 'Population', and 'AveOccup' columns have high
maximum values, indicating the presence of outliers in the data. These outliers may need to
be treated or removed before model selection. we will create visuals to see them more
clearly

Distribution: The 'MedInc', 'HouseAge', and 'MedHouseVal' columns appear to be


normally distributed, as the mean and median values are close to each other, and the
standard deviation is not very high. The 'Latitude' column is skewed to the left, as the mean
is less than the median. The 'Longitude' column is skewed to the right, as the mean is
greater than the median.

In [5]: california_df.describe().T
Out[5]: count mean std min 25% 50% 75%

MedInc 20640.0 3.870671 1.899822 0.499900 2.563400 3.534800 4.743250

HouseAge 20640.0 28.639486 12.585558 1.000000 18.000000 29.000000 37.000000

AveRooms 20640.0 5.429000 2.474173 0.846154 4.440716 5.229129 6.052381

AveBedrms 20640.0 1.096675 0.473911 0.333333 1.006079 1.048780 1.099526

Population 20640.0 1425.476744 1132.462122 3.000000 787.000000 1166.000000 1725.000000

AveOccup 20640.0 3.070655 10.386050 0.692308 2.429741 2.818116 3.282261

Latitude 20640.0 35.631861 2.135952 32.540000 33.930000 34.260000 37.710000

Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010000

MedHouseVal 20640.0 2.068558 1.153956 0.149990 1.196000 1.797000 2.647250

Step 3(d): Check for missing values in the Dataframe


This is important because most machine learning algorithms cannot handle missing data and
will throw an error if missing values are present. Therefore, it is necessary to check for missing
values and impute or remove them before fitting the data into a machine learning model. This
helps to ensure that the model is trained on complete and accurate data, which leads to better
performance and more reliable predictions.

Here we have no missing values so lets move on.

In [6]: # Check for missing values


print("Missing values:\n", california_df.isnull().sum())

Missing values:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseVal 0
dtype: int64

Step 3(e): Check for duplicate values in the Dataframe


Checking for duplicate values in machine learning is important because it can affect the
accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your
model is too closely fit to the training data and does not generalize well to new data.

We have no duplicate values so thats good.

In [7]: california_df.duplicated().sum()
0
Out[7]:

Step 3(f)(i): Check for Outliers in the Dataframe


We should check for outliers as they can have a negative impact on machine learning
algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean,
standard deviation, and other statistical measures, which can misrepresent the true
characteristics of the data. Linear regression models, are sensitive to outliers and can produce
inaccurate results if the outliers are not properly handled or removed. Therefore, it is important
to identify and handle outliers appropriately to ensure the accuracy and reliability of the
models.

Here in the plots we can clearly see very high outliers on the right hand side. So we need to
deal with them appropriately

In [8]: # Define the colors for each feature


colors = ['blue', 'red', 'green','purple']

# Select the first 5 features to plot


features = ['AveBedrms', 'AveRooms', 'MedInc','AveOccup']

# Create a figure and axis object


fig, ax = [Link]()

# Create a boxplot for each feature


bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature


for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels


ax.set_xticklabels(features)

# Customize the title and axes labels


ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines


[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot


fig.set_size_inches(8, 6)

# Show the plot


[Link]()
In [9]: # Define the colors for each feature
colors = ['orange']

# Select the first 5 features to plot


features = ['Population']

# Create a figure and axis object


fig, ax = [Link]()

# Create a boxplot for each feature


bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature


for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels


ax.set_xticklabels(features)

# Customize the title and axes labels


ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines


[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot


fig.set_size_inches(8, 6)
# Show the plot
[Link]()

Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization:


This method involves replacing extreme values with the nearest values that are within a certain
percentile range. For example, we replace values above the 95th percentile with the value at the
95th percentile and values below the 1st percentile with the value at the 1st percentile. From the
visuals we can clearly see that the data is way more normally distributed now

In [10]: # Define the percentile limits for winsorization


pct_lower = 0.01
pct_upper = 0.95

# Apply winsorization to the five columns


california_df['AveRooms'] = [Link](california_df['AveRooms'],
california_df['AveRooms'].quantile(pct_lower),
california_df['AveRooms'].quantile(pct_upper))
california_df['AveBedrms'] = [Link](california_df['AveBedrms'],
california_df['AveBedrms'].quantile(pct_lower),
california_df['AveBedrms'].quantile(pct_upper))
california_df['Population'] = [Link](california_df['Population'],
california_df['Population'].quantile(pct_lower),
california_df['Population'].quantile(pct_upper))
california_df['AveOccup'] = [Link](california_df['AveOccup'],
california_df['AveOccup'].quantile(pct_lower),
california_df['AveOccup'].quantile(pct_upper))
california_df['MedInc'] = [Link](california_df['MedInc'],
california_df['MedInc'].quantile(pct_lower),
california_df['MedInc'].quantile(pct_upper))

In [11]: # Define the colors for each feature


colors = ['blue', 'red', 'green','purple']
# Select the first 5 features to plot
features = ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup']

# Create a figure and axis object


fig, ax = [Link]()

# Create a boxplot for each feature


bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature


for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels


ax.set_xticklabels(features)

# Customize the title and axes labels


ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines


[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot


fig.set_size_inches(8, 6)

# Show the plot


[Link]()
In [12]: # Define the colors for each feature
colors = ['orange']

# Select the first 5 features to plot


features = ['Population']

# Create a figure and axis object


fig, ax = [Link]()

# Create a boxplot for each feature


bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature


for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels


ax.set_xticklabels(features)

# Customize the title and axes labels


ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines


[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot


fig.set_size_inches(8, 6)

# Show the plot


[Link]()
Step 3(g): Check for Skewness using a Histogram
Skewed data can result in biased estimates of model parameters and reduce the accuracy of
predictions. Therefore, it is important to assess the distribution of features and target variables
to identify any potential issues and take appropriate measures to address them.

Here almost all the features and target look normally distributed. There is some Skewness In
MedHouseVal but not enough to do Transformation on it.

Note: For learning purposes I have shown how to do MedHouseVal transformation for
skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out!

In [13]: # Select the features to plot


features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', '

# Define the colors for each feature


colors = ['blue', 'red', 'green', 'orange', 'purple', 'brown', 'gray', 'pink', 'cyan']

# Set up the plot


fig, axs = [Link](nrows=3, ncols=3, figsize=(20, 12))
axs = [Link]()
[Link]('Histograms of Selected Features', fontsize=16)

# Create a histogram for each feature


for i, feature in enumerate(features):
# Get the data for the current feature
data = california_df[feature].values

# Calculate the number of bins for the histogram


binwidth = [Link](len(data))
# Calculate the mean of the data
mean_value = [Link](data)

# Plot the histogram and mean line


axs[i].hist(data, bins=50, color=colors[i], alpha=0.7)
axs[i].axvline(mean_value, color='black', linestyle='--', linewidth=2)
axs[i].set_title(feature)
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
axs[i].grid(axis='y', linestyle='--', alpha=0.7)

# Adjust the spacing between subplots


fig.subplots_adjust(hspace=0.4, wspace=0.4)

# Show the plot


[Link]()

Step 3(h): Create a Vertical Correlation Heatmap


The correlation matrix shows the correlation coefficients between every pair of variables in the
dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction
of the linear relationship between two variables. A coefficient of -1 indicates a perfect negative
correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect
positive correlation.

In [14]: # Calculate the correlation matrix


corr_matrix = california_df.corr()

# Set up the matplotlib figure


fig, ax = [Link](figsize=(10, 8))
# Create the heatmap
[Link](corr_matrix, cmap='Spectral_r', annot=True, fmt='.2f', linewidths=.5, ax=a

# Set the title and axis labels


ax.set_title('Correlation Heatmap for California Housing Dataset', fontsize=16)
ax.set_xlabel('Features', fontsize=14)
ax.set_ylabel('Features', fontsize=14)

# Rotate the x-axis labels for readability


[Link](rotation=45, ha='right')

# Show the plot


[Link]()

Step 3(i): Perform Feature Scaling


Feature scaling is the process of transforming numerical features in a dataset to have similar
scales or ranges of values. The purpose of feature scaling is to ensure that all features have the
same level of impact on the model and to prevent certain features from dominating the model
simply because they have larger values. In linear regression, feature scaling is particularly
important because the coefficients of the model represent the change in the dependent
variable associated with a one-unit change in the independent variable. Scaling the features to
have similar ranges can result in a more accurate and reliable model with more accurate
representations of the relationships between the independent variables and the dependent
variable.

In [15]: scaler = StandardScaler()


california_df_scaled = scaler.fit_transform(california_df)
california_df_scaled = [Link](california_df_scaled, columns=california_df.column

Step 3( j) Check for Assumptions using Scatter Plots


From the scatter plots, we can observe that there is a linear relationship between the dependent
variable (Median House Value) and some of the independent variables like Median Income and
Total Rooms. However, we can also see that some of the independent variables like Longitude,
Latitude, and Housing Median Age do not have a clear linear relationship with the dependent
variable. This suggests that a linear regression model might not be the best fit for predicting the
Median House Value based on these variables.

In [16]: # Create scatter plots


fig, axs = [Link](nrows=2, ncols=4, figsize=(25,15))

axs[0,0].scatter(california_df_scaled['Latitude'], california_df['MedHouseVal'], color


axs[0,0].set_xlabel('Latitude')
axs[0,0].set_ylabel('Median House Value')
axs[0,0].set_title('Latitude vs Median House Value')

axs[0,1].scatter(california_df_scaled['Longitude'], california_df['MedHouseVal'], colo


axs[0,1].set_xlabel('Longitude')
axs[0,1].set_ylabel('Median House Value')
axs[0,1].set_title('Longitude vs Median House Value')

axs[0,2].scatter(california_df_scaled['HouseAge'], california_df['MedHouseVal'], color


axs[0,2].set_xlabel('Housing Median Age')
axs[0,2].set_ylabel('Median House Value')
axs[0,2].set_title('Housing Median Age vs Median House Value')

axs[0,3].scatter(california_df_scaled['AveRooms'], california_df['MedHouseVal'], color


axs[0,3].set_xlabel('Total Rooms')
axs[0,3].set_ylabel('Median House Value')
axs[0,3].set_title('Total Rooms vs Median House Value')

axs[1,0].scatter(california_df_scaled['AveBedrms'], california_df['MedHouseVal'], colo


axs[1,0].set_xlabel('Total Bedrooms')
axs[1,0].set_ylabel('Median House Value')
axs[1,0].set_title('Total Bedrooms vs Median House Value')

axs[1,1].scatter(california_df_scaled['Population'], california_df['MedHouseVal'], col


axs[1,1].set_xlabel('Population')
axs[1,1].set_ylabel('Median House Value')
axs[1,1].set_title('Population vs Median House Value')

axs[1,2].scatter(california_df_scaled['AveOccup'], california_df['MedHouseVal'], color


axs[1,2].set_xlabel('Households')
axs[1,2].set_ylabel('Median House Value')
axs[1,2].set_title('Households vs Median House Value')

axs[1,3].scatter(california_df_scaled['MedInc'], california_df['MedHouseVal'], color='


axs[1,3].set_xlabel('Median Income')
axs[1,3].set_ylabel('Median House Value')
axs[1,3].set_title('Median Income vs Median House Value')

[Link]()

Step 4: Define Dependant and Independant Variable


In [17]: X = california_df_scaled.drop(['MedHouseVal'],axis=1)
y = california_df['MedHouseVal'] # We dont scale dependant variable

Step 5: Perform Lasso (L1) Regression

Step 5(a): Use LassoCV to find the best alpha value


LassoCV uses cross-validation to determine the optimal value of the regularization parameter
(alpha) that balances the bias-variance tradeoff in the model. It iteratively fits the Lasso
regression model to different subsets of the training data and computes the cross-validated
mean squared error for each value of alpha. It then selects the alpha value that minimizes the
cross-validated mean squared error as the optimal value.

LassoCV is useful when you have a large number of features and want to perform feature
selection and regularization at the same time. By setting some of the coefficients to exactly
zero, it can help identify the most important features and simplify the model, which can lead to
better performance and interpretation.

In [18]: # Create a LassoCV object with 5-fold cross-validation


lasso_cv = LassoCV(cv=5)
# Fit the LassoCV model to the data
lasso_cv.fit(X, y)

# Print the best alpha and coefficients


print("Best alpha:", lasso_cv.alpha_)

Best alpha: 0.0007860894985619018

Step 5(b): Using SelectFromModel Feature Selection Method to Select the


Features for Regression
SelectFromModel is a feature selection method in scikit-learn that uses a supervised estimator
to identify the most important features in a dataset. The estimator is trained on the data, and
then the feature importances are ranked. Features that have importance scores above a
threshold are selected for use in modeling, while features with lower scores are excluded.

The process of selecting the best features is an important step in machine learning, as it can
improve model accuracy and reduce overfitting.

In [19]: # Instantiate the Lasso model with the best alpha


lasso = Lasso(alpha=lasso_cv.alpha_, max_iter=10000)

# Select the features using Lasso regularization


selector = SelectFromModel(estimator=lasso)
X_selected = selector.fit_transform(X, y)

# Print the selected features


california_features = [Link]
selected_features = california_features[selector.get_support()]
print(selected_features)

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',


'Latitude', 'Longitude'],
dtype='object')

Step 5(c): Perform Lasso Regression Using the Selected Features

In [20]: # Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, rand

# Create a LassoCV object with 5-fold cross-validation


lasso_cv = LassoCV(cv=5)

# Fit the LassoCV model to the data


lasso_cv.fit(X_train, y_train)

# Predict the median house values for the testing data


y_pred = lasso_cv.predict(X_test)

Step 5(d): Evaluating the performance of the Model

In [21]: # Calculate the root mean squared error


rmse = [Link](mean_squared_error(y_test, y_pred))
print("Root mean squared error:", rmse)

# Calculate the R2 score


r2 = r2_score(y_test, y_pred)
print("R2 score:", r2)

Root mean squared error: 0.6802817409760694


R2 score: 0.6528874328086174

Step 6: Performing Ridge (L2) Regression

Step 6(a): Split the data into training and testing sets in the ratio 70:30

In [22]: # Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=

Step 6(b): Use RidgeCV to find the best value of alpha


RidgeCV performs cross-validation to automatically select the optimal value of the
regularization parameter (alpha) that balances the bias-variance trade-off. The algorithm fits the
model using different values of alpha and selects the value that results in the lowest mean
squared error (MSE). This process helps to prevent overfitting and improve the model's
generalization performance.

In [23]: # Create a Ridge regression object with possible alpha values


ridge_cv = RidgeCV(cv=5)

# Fit the Ridge regression model to the training data


ridge_cv.fit(X_train, y_train)

# Print the best alpha value


print("Best alpha value:", ridge_cv.alpha_)

Best alpha value: 1.0

Step 6(c): Perform Ridge Regression

In [24]: # Create a Ridge regression object with the best alpha value
ridge = Ridge(alpha=ridge_cv.alpha_)

# Fit the Ridge regression model to the training data


[Link](X_train, y_train)

# Predict the median house values for the testing data


y_pred = [Link](X_test)

Step 6(d): Evaluating the performance of the Model

In [25]: # Compute the mean squared error on the testing data


mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)

# Compute the R-squared on the testing data


r2 = [Link](X_test, y_test)
print("R-squared:", r2)
Mean squared error: 0.46273885187759334
R-squared: 0.6529207316404193

Conclusion:
In this Ridge and Lasso regression tutorial, we explored how to use Ridge and Lasso regression
to model the relationship between variables with a non-linear pattern with the target feature,
specifically the median house value in California. We started by loading and cleaning the
dataset, and then visualizing the relationship between the variables using scatter plots.

Next, we split the data into training and testing sets and used Ridge and Lasso regression to fit
a model to the training data. We gradually adjusted the alpha of the models until we achieved a
good balance between bias and variance. We then used the models to make predictions on the
test data and evaluated the model's performance using mean squared error and R-squared
score.

Compared to polynomial regression, Ridge and Lasso regression provided a way to address
overfitting by introducing a penalty term to the model's cost function. Ridge regression added a
penalty term proportional to the square of the magnitude of the coefficients, while Lasso
regression added a penalty term proportional to the absolute value of the coefficients. This
encourages the models to select only the most important features and reduce the impact of less
important features on the model's predictions.

While polynomial regression may have performed better in this particular case (Shown in the
previous tutorial), Ridge and Lasso regression are often used in situations where there is high
multicollinearity among the independent variables, which can lead to overfitting in traditional
linear regression. They can also be useful in feature selection, as they can reduce the impact of
less important variables in the model.

Common questions

Powered by AI

Mean squared error (MSE) and R-squared are complementary metrics used to evaluate model performance. MSE measures the average squared difference between the observed and predicted values, providing a measure of accuracy; lower values indicate better model performance . R-squared, on the other hand, indicates the proportion of variance in the dependent variable that is predictable from the independent variables, assessing model fit; a higher value suggests a better model fit . Utilizing both metrics provides a fuller picture of a model's accuracy and its ability to generalize to new data, ensuring robust evaluation .

The alpha parameter in both Lasso and Ridge regression determines the strength of the regularization penalty. In Lasso, a larger alpha value increases the penalty, leading more coefficients to be set to zero, effectively performing stronger feature selection . In Ridge regression, a larger alpha value implies a stronger penalty on large coefficients, which leads to a simpler model with reduced coefficient magnitudes . The alpha parameter is typically determined through cross-validation to find the value that provides an optimal balance between bias and variance, ensuring the model generalizes well to unseen data .

Handling outliers and missing values is crucial because they can significantly skew the results of the machine learning algorithms. Outliers can distort the mean, standard deviation, and other statistical measures, affecting the model's accuracy and leading to biased predictions . Missing values, if not addressed, can cause errors or biases in model training since most algorithms require complete data for accurate computations . Properly preprocessing by handling outliers and missing data ensures that the data used for training reflects the true patterns and relationships in the dataset, leading to more reliable and generalizable model performance .

Feature scaling is crucial in linear regression models because it ensures that all features contribute equally to the model's predictions by aligning their scales, especially when regularization is involved . Without feature scaling, coefficients of larger-scaled features could become disproportionately large, impacting model stability and interpretability . In regularized models like Ridge and Lasso, unscaled features can lead to improper shrinkage or selection because the penalty imposed is relative to the scale of the coefficients. Scaling helps ensure that the regularization term appropriately penalizes all features equally, leading to more accurate and reliable models .

Lasso regression is preferred in high-dimensional datasets with many irrelevant features due to its ability to perform automatic variable selection by setting some of the regression coefficients to zero . This capability helps to simplify the model by eliminating superfluous features, which is particularly beneficial when dimensionality reduction and sparsity are desired . Ridge regression, while effective for multicollinearity, does not zero out coefficients and thus includes all features, which may not be ideal when irrelevant features dominate the dataset . Lasso's feature selection results in a more interpretable, parsimonious model, suited for scenarios where model simplicity is prioritized .

A high R-squared value in Ridge and Lasso regressions indicates a high proportion of variance in the dependent variable is explained by the model. However, in the context of regularization, this does not automatically imply a good model. Overfitting is a common concern, where a model may fit the training data well (high R-squared) but perform poorly on new, unseen data . Regularization techniques like Ridge and Lasso help mitigate overfitting by penalizing larger coefficients, thus maintaining model simplicity and generalizability. Therefore, a balanced interpretation of R-squared alongside other metrics such as RMSE and cross-validation scores is critical for evaluating model performance .

Winsorization helps deal with outliers by capping extreme values at a specified percentile range, effectively reducing the influence of outliers on the dataset's overall distribution . By replacing extreme values with threshold values (e.g., 1st and 95th percentiles), Winsorization can lead to a more normal distribution of data, which can improve model robustness . However, one disadvantage of Winsorization is that it can lead to loss of information regarding the true variability of the data, as the actual extreme values are not retained, potentially masking true data patterns or relationships .

Lasso regression, also known as L1 regularization, handles multicollinearity by performing feature selection, setting some coefficients to zero and thus reducing the model complexity by eliminating irrelevant features . Ridge regression, on the other hand, uses L2 regularization to add a penalty term proportional to the sum of squared coefficients, shrinking all coefficients towards zero without setting them explicitly to zero. It reduces multicollinearity impact by stabilizing coefficient estimates without eliminating features, which keeps all predictors in the model but with lower impact .

A correlation heatmap visually represents the correlation coefficients between variables, making it easier to identify patterns of linear relationships during data exploration . This can help detect multicollinearity issues that may affect regression models, guiding feature selection or dimensionality reduction efforts . However, limitations include its focus on linear relationships only, potentially overlooking non-linear associations . Additionally, heatmaps in large datasets can be overwhelming, leading to difficulty in interpretation without proper context or thresholds for significant correlation . Overall, while beneficial, heatmaps should be complemented with other analyses to fully understand data relationships.

Lasso regression incorporates feature selection within the model fitting process by applying L1 regularization, shrinking some coefficients to zero; it is an automated process driven by the regularization strength (alpha). In contrast, stepwise selection is a manual, iterative process that involves adding or removing predictors based on specific criteria such as p-value or Akaike Information Criterion (AIC). Lasso is typically more efficient and can handle high-dimensional data better, whereas stepwise selection is more prone to overfitting due to multiple testing . Additionally, Lasso simplifies the model automatically, whereas stepwise requires explicit criteria for inclusion or exclusion of variables.

You might also like