0% found this document useful (0 votes)

515 views18 pages

Lasso vs Ridge Regression Explained

This document provides an overview of Lasso and Ridge regression techniques for regularization in linear regression models. Lasso regression performs variable selection by setting some coefficients to zero, which is useful when there are many features. Ridge regression addresses multicollinearity by shrinking coefficients, making it helpful when predictors are highly correlated. The key difference is that Lasso can completely remove variables while Ridge only shrinks coefficients. The document also introduces the California housing dataset used in an example to demonstrate these techniques.

Uploaded by

arso arsovski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

515 views18 pages

Lasso vs Ridge Regression Explained

Uploaded by

arso arsovski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Lasso and Ridge Regression
The California Housing Dataset
Difference Between Ridge and Lasso
Step 2: Load and Explore Dataset
Step 3: Data Preprocessing and Analysis
Step 1: Import Libraries and Data
Step 4: Define Variables and perform Regression

Regression Analysis Tutorial - Lasso and Ridge

Regularization Regression
Creator: Muhammad Bilal Alam

What is Lasso Regression (L1)?

Lasso regression, also known as L1 regularization, is a linear regression technique that adds a
penalty term to the cost function, which shrinks the coefficients of the model towards zero and
performs variable selection. The penalty term is the absolute value of the coefficients multiplied
by a constant alpha, which determines the strength of the penalty. The smaller the value of
alpha, the weaker the penalty and the larger the coefficients. Conversely, the larger the value of
alpha, the stronger the penalty and the smaller the coefficients.

The formula for the cost function of Lasso regression is:

Cost = RSS + alpha * sum(|coefficients|)

where

RSS is the residual sum of squares,

alpha is the regularization parameter, and
sum(|coefficients|) is the sum of the absolute values of the coefficients.

The goal of Lasso regression is to minimize the cost function by adjusting the coefficients of the
model.

What is Ridge Regression (L2)?

Ridge regression is a type of linear regression that includes L2 regularization to prevent
overfitting in the model. The L2 regularization adds a penalty term to the loss function that is
proportional to the sum of the squared values of the model's coefficients.

The formula for the loss function in ridge regression is as follows:

Loss function = RSS + αΣβ^2

where

RSS is the residual sum of squares,

Σβ^2 is the sum of the squared values of the model's coefficients, and
α is the regularization parameter that controls the strength of the penalty term. The larger
the value of α, the greater the penalty on the coefficients, which results in a simpler model
with smaller coefficients. The value of α can be chosen using cross-validation to find the
optimal balance between the bias and variance of the model.
The goal of ridge regression is to address the problem of multicollinearity, which occurs when
the predictor variables in a linear regression model are highly correlated with each other. This
can lead to unstable and inaccurate estimates of the regression coefficients.

What is the Difference Between Ridge and Lasso Regression?

Multicollinearity refers to the situation where two or more independent variables in a regression
model are highly correlated with each other. This can cause issues in the model, as it becomes
difficult to determine the individual impact of each variable on the dependent variable. In such
cases, Ridge regression can be helpful as it shrinks the coefficients towards zero and reduces
the impact of multicollinearity.

On the other hand, if we have a large number of features in the dataset, it can lead to
overfitting, where the model becomes too complex and does not generalize well to new data. In
this scenario, Lasso regression can be useful as it performs feature selection by setting some of
the coefficients to zero. This helps in simplifying the model and removing irrelevant features,
leading to better generalization performance.

So, in simple terms, Ridge regression is helpful when we have highly correlated predictors, and
Lasso regression is useful when we have too many features and want to simplify the model by
selecting only the important ones.

The California Housing Dataset

The California Housing Dataset contains information on the median income, housing age, and
other features for census tracts in California. The dataset was originally published by Pace, R.
Kelley and Ronald Barry in their 1997 paper "Sparse Spatial Autoregressions" and is available in
the [Link] module.

The dataset consists of 20,640 instances, each representing a census tract in California. There
are eight features in the dataset, including:

MedInc: Median income in the census tract.

HouseAge: Median age of houses in the census tract.
AveRooms: Average number of rooms per dwelling in the census tract.
AveBedrms: Average number of bedrooms per dwelling in the census tract.
Population: Total number of people living in the census tract.
AveOccup: Average number of people per household in the census tract.
Latitude: Latitude of the center of the census tract.
Longitude: Longitude of the center of the census tract.

Step 1: Import the necessary libraries

In [1]: import pandas as pd
import seaborn as sns
import numpy as np
import [Link] as plt
import [Link] as px
from [Link] import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from [Link] import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error, r2_score
import warnings
[Link]('ignore')

Step 2: Load the dataset

In [2]: # Load the California Housing Dataset from seaborn

california = fetch_california_housing()

# Convert the data to a pandas dataframe

california_df = [Link](data=[Link], columns=california.feature_names)

# Add the target variable to the dataframe

california_df['MedHouseVal'] = [Link]

# Print the first 5 rows of the dataframe

california_df.head()

Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHous

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3

Step 3: Do Data Preprocessing along with Data Exploratory

Analysis

Step 3(a): Check Shape of Dataframe

Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset

In [3]: # Print the shape of the dataframe

print("Data shape:", california_df.shape)

Data shape: (20640, 9)

Step 3(b): Check Info of Dataframe

This is very useful to quickly get an overview of the structure and properties of a dataset, and to
check for any missing or null values that may need to be addressed before performing any
analysis or modeling.

In [4]: california_df.info()

<class '[Link]'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

Step 3(c): Show Descriptive Statistics of each numerical column

Looking at descriptive statistics in machine learning is important because it gives an overview of
the dataset's distribution and key characteristics. Some of the reasons why we should look at
descriptive statistics include:

Understanding the distribution of data: Descriptive statistics provide information about

the central tendency and the spread of the data. This information is useful in determining
the type of distribution and whether the data is skewed or symmetrical.

Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in
the dataset. These outliers can have a significant impact on the analysis and should be
investigated further.

From the descriptive statistics, we can observe the following:

Outliers: The 'AveRooms', 'AveBedrms', 'Population', and 'AveOccup' columns have high
maximum values, indicating the presence of outliers in the data. These outliers may need to
be treated or removed before model selection. we will create visuals to see them more
clearly

Distribution: The 'MedInc', 'HouseAge', and 'MedHouseVal' columns appear to be

normally distributed, as the mean and median values are close to each other, and the
standard deviation is not very high. The 'Latitude' column is skewed to the left, as the mean
is less than the median. The 'Longitude' column is skewed to the right, as the mean is
greater than the median.

In [5]: california_df.describe().T
Out[5]: count mean std min 25% 50% 75%

MedInc 20640.0 3.870671 1.899822 0.499900 2.563400 3.534800 4.743250

HouseAge 20640.0 28.639486 12.585558 1.000000 18.000000 29.000000 37.000000

AveRooms 20640.0 5.429000 2.474173 0.846154 4.440716 5.229129 6.052381

AveBedrms 20640.0 1.096675 0.473911 0.333333 1.006079 1.048780 1.099526

Population 20640.0 1425.476744 1132.462122 3.000000 787.000000 1166.000000 1725.000000

AveOccup 20640.0 3.070655 10.386050 0.692308 2.429741 2.818116 3.282261

Latitude 20640.0 35.631861 2.135952 32.540000 33.930000 34.260000 37.710000

Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010000

MedHouseVal 20640.0 2.068558 1.153956 0.149990 1.196000 1.797000 2.647250

Step 3(d): Check for missing values in the Dataframe

This is important because most machine learning algorithms cannot handle missing data and
will throw an error if missing values are present. Therefore, it is necessary to check for missing
values and impute or remove them before fitting the data into a machine learning model. This
helps to ensure that the model is trained on complete and accurate data, which leads to better
performance and more reliable predictions.

Here we have no missing values so lets move on.

In [6]: # Check for missing values

print("Missing values:\n", california_df.isnull().sum())

Missing values:
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseVal 0
dtype: int64

Step 3(e): Check for duplicate values in the Dataframe

Checking for duplicate values in machine learning is important because it can affect the
accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your
model is too closely fit to the training data and does not generalize well to new data.

We have no duplicate values so thats good.

In [7]: california_df.duplicated().sum()
0
Out[7]:

Step 3(f)(i): Check for Outliers in the Dataframe

We should check for outliers as they can have a negative impact on machine learning
algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean,
standard deviation, and other statistical measures, which can misrepresent the true
characteristics of the data. Linear regression models, are sensitive to outliers and can produce
inaccurate results if the outliers are not properly handled or removed. Therefore, it is important
to identify and handle outliers appropriately to ensure the accuracy and reliability of the
models.

Here in the plots we can clearly see very high outliers on the right hand side. So we need to
deal with them appropriately

In [8]: # Define the colors for each feature

colors = ['blue', 'red', 'green','purple']

# Select the first 5 features to plot

features = ['AveBedrms', 'AveRooms', 'MedInc','AveOccup']

# Create a figure and axis object

fig, ax = [Link]()

# Create a boxplot for each feature

bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature

for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels

ax.set_xticklabels(features)

# Customize the title and axes labels

ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines

[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot

fig.set_size_inches(8, 6)

# Show the plot

[Link]()
In [9]: # Define the colors for each feature
colors = ['orange']

# Select the first 5 features to plot

features = ['Population']

# Create a figure and axis object

fig, ax = [Link]()

# Create a boxplot for each feature

bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature

for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels

ax.set_xticklabels(features)

# Customize the title and axes labels

ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines

[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot

fig.set_size_inches(8, 6)
# Show the plot
[Link]()

Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization:

This method involves replacing extreme values with the nearest values that are within a certain
percentile range. For example, we replace values above the 95th percentile with the value at the
95th percentile and values below the 1st percentile with the value at the 1st percentile. From the
visuals we can clearly see that the data is way more normally distributed now

In [10]: # Define the percentile limits for winsorization

pct_lower = 0.01
pct_upper = 0.95

# Apply winsorization to the five columns

california_df['AveRooms'] = [Link](california_df['AveRooms'],
california_df['AveRooms'].quantile(pct_lower),
california_df['AveRooms'].quantile(pct_upper))
california_df['AveBedrms'] = [Link](california_df['AveBedrms'],
california_df['AveBedrms'].quantile(pct_lower),
california_df['AveBedrms'].quantile(pct_upper))
california_df['Population'] = [Link](california_df['Population'],
california_df['Population'].quantile(pct_lower),
california_df['Population'].quantile(pct_upper))
california_df['AveOccup'] = [Link](california_df['AveOccup'],
california_df['AveOccup'].quantile(pct_lower),
california_df['AveOccup'].quantile(pct_upper))
california_df['MedInc'] = [Link](california_df['MedInc'],
california_df['MedInc'].quantile(pct_lower),
california_df['MedInc'].quantile(pct_upper))

In [11]: # Define the colors for each feature

colors = ['blue', 'red', 'green','purple']
# Select the first 5 features to plot
features = ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup']

# Create a figure and axis object

fig, ax = [Link]()

# Create a boxplot for each feature

bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature

for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels

ax.set_xticklabels(features)

# Customize the title and axes labels

ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines

[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot

fig.set_size_inches(8, 6)

# Show the plot

[Link]()
In [12]: # Define the colors for each feature
colors = ['orange']

# Select the first 5 features to plot

features = ['Population']

# Create a figure and axis object

fig, ax = [Link]()

# Create a boxplot for each feature

bp = [Link]([california_df[f] for f in features],
sym='o',
patch_artist=True,
notch=True)

# Assign unique colors to each feature

for patch, color in zip(bp['boxes'], colors[:len(features)]):
patch.set_facecolor(color)

# Customize the x-axis tick labels

ax.set_xticklabels(features)

# Customize the title and axes labels

ax.set_title('Boxplot of Selected Features')
ax.set_xlabel('Features')
ax.set_ylabel('Values')

# Add grid lines and remove top and right spines

[Link](axis='y', linestyle='--', alpha=0.7)
[Link]['top'].set_visible(False)
[Link]['right'].set_visible(False)

# Set the size of the plot

fig.set_size_inches(8, 6)

# Show the plot

[Link]()
Step 3(g): Check for Skewness using a Histogram
Skewed data can result in biased estimates of model parameters and reduce the accuracy of
predictions. Therefore, it is important to assess the distribution of features and target variables
to identify any potential issues and take appropriate measures to address them.

Here almost all the features and target look normally distributed. There is some Skewness In
MedHouseVal but not enough to do Transformation on it.

Note: For learning purposes I have shown how to do MedHouseVal transformation for
skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out!

In [13]: # Select the features to plot

features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', '

# Define the colors for each feature

colors = ['blue', 'red', 'green', 'orange', 'purple', 'brown', 'gray', 'pink', 'cyan']

# Set up the plot

fig, axs = [Link](nrows=3, ncols=3, figsize=(20, 12))
axs = [Link]()
[Link]('Histograms of Selected Features', fontsize=16)

# Create a histogram for each feature

for i, feature in enumerate(features):
# Get the data for the current feature
data = california_df[feature].values

# Calculate the number of bins for the histogram

binwidth = [Link](len(data))
# Calculate the mean of the data
mean_value = [Link](data)

# Plot the histogram and mean line

axs[i].hist(data, bins=50, color=colors[i], alpha=0.7)
axs[i].axvline(mean_value, color='black', linestyle='--', linewidth=2)
axs[i].set_title(feature)
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
axs[i].grid(axis='y', linestyle='--', alpha=0.7)

# Adjust the spacing between subplots

fig.subplots_adjust(hspace=0.4, wspace=0.4)

# Show the plot

[Link]()

Step 3(h): Create a Vertical Correlation Heatmap

The correlation matrix shows the correlation coefficients between every pair of variables in the
dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction
of the linear relationship between two variables. A coefficient of -1 indicates a perfect negative
correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect
positive correlation.

In [14]: # Calculate the correlation matrix

corr_matrix = california_df.corr()

# Set up the matplotlib figure

fig, ax = [Link](figsize=(10, 8))
# Create the heatmap
[Link](corr_matrix, cmap='Spectral_r', annot=True, fmt='.2f', linewidths=.5, ax=a

# Set the title and axis labels

ax.set_title('Correlation Heatmap for California Housing Dataset', fontsize=16)
ax.set_xlabel('Features', fontsize=14)
ax.set_ylabel('Features', fontsize=14)

# Rotate the x-axis labels for readability

[Link](rotation=45, ha='right')

# Show the plot

[Link]()

Step 3(i): Perform Feature Scaling

Feature scaling is the process of transforming numerical features in a dataset to have similar
scales or ranges of values. The purpose of feature scaling is to ensure that all features have the
same level of impact on the model and to prevent certain features from dominating the model
simply because they have larger values. In linear regression, feature scaling is particularly
important because the coefficients of the model represent the change in the dependent
variable associated with a one-unit change in the independent variable. Scaling the features to
have similar ranges can result in a more accurate and reliable model with more accurate
representations of the relationships between the independent variables and the dependent
variable.

In [15]: scaler = StandardScaler()

california_df_scaled = scaler.fit_transform(california_df)
california_df_scaled = [Link](california_df_scaled, columns=california_df.column

Step 3( j) Check for Assumptions using Scatter Plots

From the scatter plots, we can observe that there is a linear relationship between the dependent
variable (Median House Value) and some of the independent variables like Median Income and
Total Rooms. However, we can also see that some of the independent variables like Longitude,
Latitude, and Housing Median Age do not have a clear linear relationship with the dependent
variable. This suggests that a linear regression model might not be the best fit for predicting the
Median House Value based on these variables.

In [16]: # Create scatter plots

fig, axs = [Link](nrows=2, ncols=4, figsize=(25,15))

axs[0,0].scatter(california_df_scaled['Latitude'], california_df['MedHouseVal'], color

axs[0,0].set_xlabel('Latitude')
axs[0,0].set_ylabel('Median House Value')
axs[0,0].set_title('Latitude vs Median House Value')

axs[0,1].scatter(california_df_scaled['Longitude'], california_df['MedHouseVal'], colo

axs[0,1].set_xlabel('Longitude')
axs[0,1].set_ylabel('Median House Value')
axs[0,1].set_title('Longitude vs Median House Value')

axs[0,2].scatter(california_df_scaled['HouseAge'], california_df['MedHouseVal'], color

axs[0,2].set_xlabel('Housing Median Age')
axs[0,2].set_ylabel('Median House Value')
axs[0,2].set_title('Housing Median Age vs Median House Value')

axs[0,3].scatter(california_df_scaled['AveRooms'], california_df['MedHouseVal'], color

axs[0,3].set_xlabel('Total Rooms')
axs[0,3].set_ylabel('Median House Value')
axs[0,3].set_title('Total Rooms vs Median House Value')

axs[1,0].scatter(california_df_scaled['AveBedrms'], california_df['MedHouseVal'], colo

axs[1,0].set_xlabel('Total Bedrooms')
axs[1,0].set_ylabel('Median House Value')
axs[1,0].set_title('Total Bedrooms vs Median House Value')

axs[1,1].scatter(california_df_scaled['Population'], california_df['MedHouseVal'], col

axs[1,1].set_xlabel('Population')
axs[1,1].set_ylabel('Median House Value')
axs[1,1].set_title('Population vs Median House Value')

axs[1,2].scatter(california_df_scaled['AveOccup'], california_df['MedHouseVal'], color

axs[1,2].set_xlabel('Households')
axs[1,2].set_ylabel('Median House Value')
axs[1,2].set_title('Households vs Median House Value')

axs[1,3].scatter(california_df_scaled['MedInc'], california_df['MedHouseVal'], color='

axs[1,3].set_xlabel('Median Income')
axs[1,3].set_ylabel('Median House Value')
axs[1,3].set_title('Median Income vs Median House Value')

[Link]()

Step 4: Define Dependant and Independant Variable

In [17]: X = california_df_scaled.drop(['MedHouseVal'],axis=1)
y = california_df['MedHouseVal'] # We dont scale dependant variable

Step 5: Perform Lasso (L1) Regression

Step 5(a): Use LassoCV to find the best alpha value

LassoCV uses cross-validation to determine the optimal value of the regularization parameter
(alpha) that balances the bias-variance tradeoff in the model. It iteratively fits the Lasso
regression model to different subsets of the training data and computes the cross-validated
mean squared error for each value of alpha. It then selects the alpha value that minimizes the
cross-validated mean squared error as the optimal value.

LassoCV is useful when you have a large number of features and want to perform feature
selection and regularization at the same time. By setting some of the coefficients to exactly
zero, it can help identify the most important features and simplify the model, which can lead to
better performance and interpretation.

In [18]: # Create a LassoCV object with 5-fold cross-validation

lasso_cv = LassoCV(cv=5)
# Fit the LassoCV model to the data
lasso_cv.fit(X, y)

# Print the best alpha and coefficients

print("Best alpha:", lasso_cv.alpha_)

Best alpha: 0.0007860894985619018

Step 5(b): Using SelectFromModel Feature Selection Method to Select the

Features for Regression
SelectFromModel is a feature selection method in scikit-learn that uses a supervised estimator
to identify the most important features in a dataset. The estimator is trained on the data, and
then the feature importances are ranked. Features that have importance scores above a
threshold are selected for use in modeling, while features with lower scores are excluded.

The process of selecting the best features is an important step in machine learning, as it can
improve model accuracy and reduce overfitting.

In [19]: # Instantiate the Lasso model with the best alpha

lasso = Lasso(alpha=lasso_cv.alpha_, max_iter=10000)

# Select the features using Lasso regularization

selector = SelectFromModel(estimator=lasso)
X_selected = selector.fit_transform(X, y)

# Print the selected features

california_features = [Link]
selected_features = california_features[selector.get_support()]
print(selected_features)

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',

'Latitude', 'Longitude'],
dtype='object')

Step 5(c): Perform Lasso Regression Using the Selected Features

In [20]: # Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, rand

# Create a LassoCV object with 5-fold cross-validation

lasso_cv = LassoCV(cv=5)

# Fit the LassoCV model to the data

lasso_cv.fit(X_train, y_train)

# Predict the median house values for the testing data

y_pred = lasso_cv.predict(X_test)

Step 5(d): Evaluating the performance of the Model

In [21]: # Calculate the root mean squared error

rmse = [Link](mean_squared_error(y_test, y_pred))
print("Root mean squared error:", rmse)

# Calculate the R2 score

r2 = r2_score(y_test, y_pred)
print("R2 score:", r2)

Root mean squared error: 0.6802817409760694

R2 score: 0.6528874328086174

Step 6: Performing Ridge (L2) Regression

Step 6(a): Split the data into training and testing sets in the ratio 70:30

In [22]: # Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=

Step 6(b): Use RidgeCV to find the best value of alpha

RidgeCV performs cross-validation to automatically select the optimal value of the
regularization parameter (alpha) that balances the bias-variance trade-off. The algorithm fits the
model using different values of alpha and selects the value that results in the lowest mean
squared error (MSE). This process helps to prevent overfitting and improve the model's
generalization performance.

In [23]: # Create a Ridge regression object with possible alpha values

ridge_cv = RidgeCV(cv=5)

# Fit the Ridge regression model to the training data

ridge_cv.fit(X_train, y_train)

# Print the best alpha value

print("Best alpha value:", ridge_cv.alpha_)

Best alpha value: 1.0

Step 6(c): Perform Ridge Regression

In [24]: # Create a Ridge regression object with the best alpha value
ridge = Ridge(alpha=ridge_cv.alpha_)

# Fit the Ridge regression model to the training data

[Link](X_train, y_train)

# Predict the median house values for the testing data

y_pred = [Link](X_test)

Step 6(d): Evaluating the performance of the Model

In [25]: # Compute the mean squared error on the testing data

mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)

# Compute the R-squared on the testing data

r2 = [Link](X_test, y_test)
print("R-squared:", r2)
Mean squared error: 0.46273885187759334
R-squared: 0.6529207316404193

Conclusion:
In this Ridge and Lasso regression tutorial, we explored how to use Ridge and Lasso regression
to model the relationship between variables with a non-linear pattern with the target feature,
specifically the median house value in California. We started by loading and cleaning the
dataset, and then visualizing the relationship between the variables using scatter plots.

Next, we split the data into training and testing sets and used Ridge and Lasso regression to fit
a model to the training data. We gradually adjusted the alpha of the models until we achieved a
good balance between bias and variance. We then used the models to make predictions on the
test data and evaluated the model's performance using mean squared error and R-squared
score.

Compared to polynomial regression, Ridge and Lasso regression provided a way to address
overfitting by introducing a penalty term to the model's cost function. Ridge regression added a
penalty term proportional to the square of the magnitude of the coefficients, while Lasso
regression added a penalty term proportional to the absolute value of the coefficients. This
encourages the models to select only the most important features and reduce the impact of less
important features on the model's predictions.

While polynomial regression may have performed better in this particular case (Shown in the
previous tutorial), Ridge and Lasso regression are often used in situations where there is high
multicollinearity among the independent variables, which can lead to overfitting in traditional
linear regression. They can also be useful in feature selection, as they can reduce the impact of
less important variables in the model.

Common questions

Mean squared error (MSE) and R-squared are complementary metrics used to evaluate model performance. MSE measures the average squared difference between the observed and predicted values, providing a measure of accuracy; lower values indicate better model performance . R-squared, on the other hand, indicates the proportion of variance in the dependent variable that is predictable from the independent variables, assessing model fit; a higher value suggests a better model fit . Utilizing both metrics provides a fuller picture of a model's accuracy and its ability to generalize to new data, ensuring robust evaluation .

The alpha parameter in both Lasso and Ridge regression determines the strength of the regularization penalty. In Lasso, a larger alpha value increases the penalty, leading more coefficients to be set to zero, effectively performing stronger feature selection . In Ridge regression, a larger alpha value implies a stronger penalty on large coefficients, which leads to a simpler model with reduced coefficient magnitudes . The alpha parameter is typically determined through cross-validation to find the value that provides an optimal balance between bias and variance, ensuring the model generalizes well to unseen data .

Handling outliers and missing values is crucial because they can significantly skew the results of the machine learning algorithms. Outliers can distort the mean, standard deviation, and other statistical measures, affecting the model's accuracy and leading to biased predictions . Missing values, if not addressed, can cause errors or biases in model training since most algorithms require complete data for accurate computations . Properly preprocessing by handling outliers and missing data ensures that the data used for training reflects the true patterns and relationships in the dataset, leading to more reliable and generalizable model performance .

Feature scaling is crucial in linear regression models because it ensures that all features contribute equally to the model's predictions by aligning their scales, especially when regularization is involved . Without feature scaling, coefficients of larger-scaled features could become disproportionately large, impacting model stability and interpretability . In regularized models like Ridge and Lasso, unscaled features can lead to improper shrinkage or selection because the penalty imposed is relative to the scale of the coefficients. Scaling helps ensure that the regularization term appropriately penalizes all features equally, leading to more accurate and reliable models .

Lasso regression is preferred in high-dimensional datasets with many irrelevant features due to its ability to perform automatic variable selection by setting some of the regression coefficients to zero . This capability helps to simplify the model by eliminating superfluous features, which is particularly beneficial when dimensionality reduction and sparsity are desired . Ridge regression, while effective for multicollinearity, does not zero out coefficients and thus includes all features, which may not be ideal when irrelevant features dominate the dataset . Lasso's feature selection results in a more interpretable, parsimonious model, suited for scenarios where model simplicity is prioritized .

A high R-squared value in Ridge and Lasso regressions indicates a high proportion of variance in the dependent variable is explained by the model. However, in the context of regularization, this does not automatically imply a good model. Overfitting is a common concern, where a model may fit the training data well (high R-squared) but perform poorly on new, unseen data . Regularization techniques like Ridge and Lasso help mitigate overfitting by penalizing larger coefficients, thus maintaining model simplicity and generalizability. Therefore, a balanced interpretation of R-squared alongside other metrics such as RMSE and cross-validation scores is critical for evaluating model performance .

Winsorization helps deal with outliers by capping extreme values at a specified percentile range, effectively reducing the influence of outliers on the dataset's overall distribution . By replacing extreme values with threshold values (e.g., 1st and 95th percentiles), Winsorization can lead to a more normal distribution of data, which can improve model robustness . However, one disadvantage of Winsorization is that it can lead to loss of information regarding the true variability of the data, as the actual extreme values are not retained, potentially masking true data patterns or relationships .

Lasso regression, also known as L1 regularization, handles multicollinearity by performing feature selection, setting some coefficients to zero and thus reducing the model complexity by eliminating irrelevant features . Ridge regression, on the other hand, uses L2 regularization to add a penalty term proportional to the sum of squared coefficients, shrinking all coefficients towards zero without setting them explicitly to zero. It reduces multicollinearity impact by stabilizing coefficient estimates without eliminating features, which keeps all predictors in the model but with lower impact .

A correlation heatmap visually represents the correlation coefficients between variables, making it easier to identify patterns of linear relationships during data exploration . This can help detect multicollinearity issues that may affect regression models, guiding feature selection or dimensionality reduction efforts . However, limitations include its focus on linear relationships only, potentially overlooking non-linear associations . Additionally, heatmaps in large datasets can be overwhelming, leading to difficulty in interpretation without proper context or thresholds for significant correlation . Overall, while beneficial, heatmaps should be complemented with other analyses to fully understand data relationships.

Lasso regression incorporates feature selection within the model fitting process by applying L1 regularization, shrinking some coefficients to zero; it is an automated process driven by the regularization strength (alpha). In contrast, stepwise selection is a manual, iterative process that involves adding or removing predictors based on specific criteria such as p-value or Akaike Information Criterion (AIC). Lasso is typically more efficient and can handle high-dimensional data better, whereas stepwise selection is more prone to overfitting due to multiple testing . Additionally, Lasso simplifies the model automatically, whereas stepwise requires explicit criteria for inclusion or exclusion of variables.

Car MPG Prediction with Regression Models
No ratings yet
Car MPG Prediction with Regression Models
4 pages
Ensemble Methods: Bagging vs Boosting
No ratings yet
Ensemble Methods: Bagging vs Boosting
23 pages
Gaussian Distribution Cheat Sheet
100% (1)
Gaussian Distribution Cheat Sheet
8 pages
Linear Discriminant Analysis Overview
No ratings yet
Linear Discriminant Analysis Overview
65 pages
Financial Risk Assessment Model Analysis
100% (1)
Financial Risk Assessment Model Analysis
66 pages
Implement of Salary Prediction System To Improve Student Motivation Using Data Mining Technique PDF
No ratings yet
Implement of Salary Prediction System To Improve Student Motivation Using Data Mining Technique PDF
6 pages
BAIL606 Machine Learning Lab Syllabus
No ratings yet
BAIL606 Machine Learning Lab Syllabus
15 pages
CART vs CHAID Decision Trees Explained
No ratings yet
CART vs CHAID Decision Trees Explained
50 pages
Understanding Boosting in Machine Learning
No ratings yet
Understanding Boosting in Machine Learning
6 pages
PLS Algorithms in Multivariate Calibration
No ratings yet
PLS Algorithms in Multivariate Calibration
17 pages
R2 Model Validation and Cross-Validation
No ratings yet
R2 Model Validation and Cross-Validation
46 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
25 pages
Feature Selection and Extraction Guide
No ratings yet
Feature Selection and Extraction Guide
38 pages
Integrated Circuit Technologies Overview
No ratings yet
Integrated Circuit Technologies Overview
49 pages
Predictive Modeling for Firm Sales Analysis
No ratings yet
Predictive Modeling for Firm Sales Analysis
43 pages
Security Data Visualization Graphical Techniques F... - (4 Vulnerability Assessment and Exploitation)
No ratings yet
Security Data Visualization Graphical Techniques F... - (4 Vulnerability Assessment and Exploitation)
24 pages
Strategic Software Testing Approaches
No ratings yet
Strategic Software Testing Approaches
6 pages
DES Encryption and Hex Conversion Code
No ratings yet
DES Encryption and Hex Conversion Code
11 pages
MATLAB Convolution and Impulse Response
No ratings yet
MATLAB Convolution and Impulse Response
7 pages
Linear Regression Essentials in Python
No ratings yet
Linear Regression Essentials in Python
23 pages
EE2211 Midterm Exam Paper
No ratings yet
EE2211 Midterm Exam Paper
14 pages
Overview of the Perceptron Model
No ratings yet
Overview of the Perceptron Model
33 pages
Overview of Artificial Neural Networks
No ratings yet
Overview of Artificial Neural Networks
54 pages
Exploratory Data Analysis in Data Science
100% (3)
Exploratory Data Analysis in Data Science
113 pages
Introduction to Pandas for Data Analysis
100% (1)
Introduction to Pandas for Data Analysis
2 pages
DIY Digital IC Tester Overview
No ratings yet
DIY Digital IC Tester Overview
19 pages
Causal Inference in Machine Learning
No ratings yet
Causal Inference in Machine Learning
95 pages
Loading and Analyzing Wine Data
100% (1)
Loading and Analyzing Wine Data
1 page
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
30 pages
Business Analytics with R: Course II
No ratings yet
Business Analytics with R: Course II
1 page
Large-Sample Estimation Techniques
No ratings yet
Large-Sample Estimation Techniques
28 pages
Time Series Forecasting Cheat Sheet
No ratings yet
Time Series Forecasting Cheat Sheet
2 pages
Machine Learning Model Basics
No ratings yet
Machine Learning Model Basics
18 pages
Santander Customer Transaction Prediction
No ratings yet
Santander Customer Transaction Prediction
42 pages
Understanding Bayes Classifier Basics
No ratings yet
Understanding Bayes Classifier Basics
23 pages
Ensemble Techniques for Banking Data Analysis
100% (2)
Ensemble Techniques for Banking Data Analysis
28 pages
Handwritten Character Recognition Thesis
No ratings yet
Handwritten Character Recognition Thesis
52 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
41 pages
Logistic Regression Overview and Methods
No ratings yet
Logistic Regression Overview and Methods
21 pages
Linear Regression Analysis of House Prices
No ratings yet
Linear Regression Analysis of House Prices
15 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
39 pages
Clustering and PCA Assignment Guide
No ratings yet
Clustering and PCA Assignment Guide
4 pages
Overview of Artificial Neural Networks
No ratings yet
Overview of Artificial Neural Networks
51 pages
lmfit: Non-Linear Fitting Guide
No ratings yet
lmfit: Non-Linear Fitting Guide
353 pages
Simple Linear Regression Quiz
No ratings yet
Simple Linear Regression Quiz
6 pages
GREET User Guide Overview
No ratings yet
GREET User Guide Overview
121 pages
KNN, K-Means, and Regression in Python
No ratings yet
KNN, K-Means, and Regression in Python
12 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
8 pages
K Fold Cross Validation Explained
No ratings yet
K Fold Cross Validation Explained
17 pages
Machine Learning for Loan Default Prediction
No ratings yet
Machine Learning for Loan Default Prediction
5 pages
CART, RF, and ANN Model Comparison
100% (1)
CART, RF, and ANN Model Comparison
41 pages
Ensemble Learning Techniques Overview
No ratings yet
Ensemble Learning Techniques Overview
27 pages
Data Loading and Handling in R
No ratings yet
Data Loading and Handling in R
78 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
77 pages
Mumbai House Price Prediction Study
No ratings yet
Mumbai House Price Prediction Study
27 pages
Cross-Validation Techniques in ML
No ratings yet
Cross-Validation Techniques in ML
18 pages
Lasso and Ridge Regression Explained
No ratings yet
Lasso and Ridge Regression Explained
17 pages
Polynomial Regression in Python Guide
No ratings yet
Polynomial Regression in Python Guide
20 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
King County House Price Prediction Models
No ratings yet
King County House Price Prediction Models
10 pages
Central Tendency: Mean, Median, Mode
No ratings yet
Central Tendency: Mean, Median, Mode
127 pages
Data Preparation and Analytics Overview
No ratings yet
Data Preparation and Analytics Overview
2 pages
MATLAB Code for Correlation Analysis
No ratings yet
MATLAB Code for Correlation Analysis
3 pages
تحليل معامل ارتباط بيرسون
No ratings yet
تحليل معامل ارتباط بيرسون
3 pages
Simple Linear Regression Analysis Results
No ratings yet
Simple Linear Regression Analysis Results
2 pages
Central Tendency and Variation Explained
No ratings yet
Central Tendency and Variation Explained
10 pages
Understanding Skewness and Kurtosis
No ratings yet
Understanding Skewness and Kurtosis
18 pages
Business Mathematics and Statistics Problems
No ratings yet
Business Mathematics and Statistics Problems
3 pages
Central Tendency Measures Quiz & Assignment
No ratings yet
Central Tendency Measures Quiz & Assignment
26 pages
Air Quality Data Analysis in R
No ratings yet
Air Quality Data Analysis in R
11 pages
Applied Statistics: Regression Analysis
No ratings yet
Applied Statistics: Regression Analysis
25 pages
Statistics and Probability Concepts Guide
No ratings yet
Statistics and Probability Concepts Guide
1 page
Grade 10 Progression Worksheet
No ratings yet
Grade 10 Progression Worksheet
1 page
Uji Asumsi Klasik dalam Regresi
No ratings yet
Uji Asumsi Klasik dalam Regresi
17 pages
Testbank for Statistics in Canada
No ratings yet
Testbank for Statistics in Canada
16 pages
Statistical Measures in Data Analysis
No ratings yet
Statistical Measures in Data Analysis
68 pages
Understanding Statistics: Mean, Median, Mode
No ratings yet
Understanding Statistics: Mean, Median, Mode
17 pages
Measures of Central Tendency Worksheet 1
No ratings yet
Measures of Central Tendency Worksheet 1
8 pages
Introduction to Business Analytics Basics
No ratings yet
Introduction to Business Analytics Basics
33 pages
Introduction To Data Analysis Solutions
No ratings yet
Introduction To Data Analysis Solutions
5 pages
Salary and Gas Price Analysis Task
No ratings yet
Salary and Gas Price Analysis Task
3 pages
International Fetal Growth Standards
No ratings yet
International Fetal Growth Standards
5 pages
Stat 2181: Intro to Statistics Outline
No ratings yet
Stat 2181: Intro to Statistics Outline
2 pages
Lifetime Pets Survey Analysis
No ratings yet
Lifetime Pets Survey Analysis
3 pages
Control Limits and Capability Calculations
No ratings yet
Control Limits and Capability Calculations
11 pages
Understanding Central Tendencies in Statistics
No ratings yet
Understanding Central Tendencies in Statistics
43 pages
Probability & Statistics Lecture Notes
No ratings yet
Probability & Statistics Lecture Notes
19 pages
SP Las 15
No ratings yet
SP Las 15
12 pages
Total Probability and Statistics Explained
No ratings yet
Total Probability and Statistics Explained
57 pages
CA Foundation Business Maths Exam Paper
No ratings yet
CA Foundation Business Maths Exam Paper
19 pages