0% found this document useful (0 votes)
4 views28 pages

PGN AI and ML Presentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

PGN AI and ML Presentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Ridge, Lasso and

Elastic Net
Presented by: Pham Gi
Nguyen
Student ID: BA12-142

Course: AI and
Machine Learning
Lecturer: Nguyen Cam 01
Overview of
Topics
1.Advancements with
Regression
2.Ridge Regression
3.Lasso Regression
4.Elastic Net

02
1. Introduction to
Regression Analysis
01
If we continue to draw from OLS as our only approach
to linear regression techniques, methodologically
speaking, we are still within the late 1800’s and early
1900’s timeframe.

With advancements in computing technology,


02 regression analysis can be calculated using a variety of
different statistical techniques which has lead to the
development of new tools and methods.

The techniques we will discuss today will bring us to


03 date with advancements in regression analysis.

In modern data analysis, we will often find data with a


04
very high number of independent variables and we
need better regression techniques to handle this high-
dimensional modeling.

03
Review of Linear
Regression Analysis
Simple Linear Regression
Formula
The output of a regression analysis will produce
The simple regression model can be represented as a coefficient table similar to the one below.
follows:

• This table shows that the intercept is -114.326


and the Height coefficient is 106.505 ± 11.55.
• This can be interpreted as for each unit increase
The β0 represents the Y intercept value, in X, we can expect that Y will increase by 106.5
the coefficient β1 represents the slope of • Also, the T value and Pr > |t| indicate that these
the line, the X1 is an independent variables are statistically significant at the 0.05
variable, and ϵ is the error term. The level and can be included in the model.
error term is the value needed to correct
for a prediction error between the
observed and predicted value.

04
Ordinary Least
Squares
What is Ordinary Least Squares
or OLS?
• In statistics, ordinary least squares (OLS) or
linear least squares is a method for
estimating the unknown parameters in a
linear regression model.

• The goal of OLS is to minimize the differences


between the observed responses in some
arbitrary dataset and the responses predicted
by the linear approximation of the data.

05
2. Ridge
Regression
Ridge Regression is a modeling
01
technique that works to solve the
multi-collinearity problem in OLS
models through the incorporation of
the shrinkage parameter, λ.

02
The assumptions of the model are the
same as OLS: Linearity, constant
variance, and independence. Normality
need not be assumed.

Additionally, multiple linear


03 regression (OLS) has no manner to
identify a smaller subset of important
variables.

06
Ridge
Regression
• In OLS regression, the form of the equation
in matrix notation as follows:
Y=β0+β1X1+β2X2+…+ϵ can be represented

XtXβ=XtY
• Where XX is the design matrix having, [X]ij=xij ,y is the vector of the response (y1,…,yn) and β is
the vector of the coefficients (β1,…,βp).

• This equation can be rearranged to show the following:

β=(X’X)−1 X’Y
• Where R=X’X

• And R is the correlation matrix of independent variables.

• These estimates are unbiased so the expected values of the estimates are the population values.
That is,

E(β′)=β
• The variance-covariance matrix of the estimates is

V(β′)=σ2R−1 07
Ridge
Regression
• Ridge Regression proceeds by adding a small value, λ to the diagonal elements of the correlation matrix. (This is where
ridge regression gets its name since the diagonal of ones may be thought of as a ridge.)

• λ is a positive value less than one (usually less than 0.3).

• The amount of bias of the estimator is given by:

• and the covariance matrix is given by:

08
Ridge Trace
• One of the main obstacles in using ridge regression is choosing an
appropriate value of λ. The inventors of ridge regression suggested
using a graphic which they called a "ridge trace.“

• A ridge trace is a plot that shows the ridge regression coefficients as a


function of λ.

• When viewing the ridge trace, we are looking for the λ for which the
regression coefficients have stabilized. Often the coefficients will vary
widely for small values of λ and then stabilize.

• Choose the smallest value of λ\lambda possible (which introduces the


smallest bias) after which the regression coefficients seem to have
remained constant.

• Note: Increasing λ will eventually drive the regression coefficients to 0

09
Scale in Ridge
Regression
• Here is a visual representation of the ridge coefficients for λ versus a
linear regression.

• We can see that the size of the coefficients (penalized) has decreased
through our shrinking function, ℓ2

• It is also important to point out that in ridge regression we usually


leave the intercept un-penalized because it is not in the same scale as
the other predictors.

• The λ is unfair if the predictor variables are not on the same scale.

• Therefore, if we know that the variables are not measured in the


same units, we typically center and scale all of the variables before
building a ridge regression

10
Variable
Selection
• The problem of picking out the relevant variables from a larger set is
called variable selection.

• Suppose there is a subset of coefficients that are identically zero. This


means that the men response doesn’t depend on these predictors at
all.

• The red paths on the plot are the true non-zero coefficients, the grey
paths are the true zeros.

• The vertical dashed line is the point at which ridge regression’s MSE
starts losing to linear regression.

• Note: the grey coefficient paths are not exactly zero; they are
shrunken, but non-zero.

11
Variable
Selection
• We can show that ridge regression doesn’t set the coefficients exactly
to zero unless λ=∞, in which case they are all zero.

• Therefore, ridge regression cannot perform variable selection.

• Ridge regression performs well when there is a subset of true


coefficients that are small or zero.

• It doesn’t do well when all of the true coefficients are moderately


large, however, will still perform better than OLS regression.

12
Ridge
Advantages Regression Disadvantages
•Reduces Overfitting: By adding a penalty to the size of •Does Not Perform Variable Selection: Ridge Regression
coefficients, Ridge Regression reduces the risk of overfitting. does not set any coefficients to zero, so it does not perform
variable selection.
•Handles Multicollinearity: It is effective in dealing with
multicollinearity (when predictor variables are highly correlated). •Interpretability: The model can be less interpretable
because it includes all predictors in the final model.
•Computationally Efficient: It is computationally efficient and
can be solved using standard linear algebra techniques.

Potential Applications
Genomics and Bioinformatics:
•Gene Expression Data: Ridge Regression is used to handle multicollinearity
among gene expression data and improve the accuracy of gene function prediction.

Economics and Finance:


•Portfolio Optimization: It's applied in financial models to predict returns and
manage portfolios, especially when dealing with a large number of correlated
financial indicators.

Healthcare:
•Disease Risk Prediction: Ridge Regression helps in predicting disease risk by
considering multiple correlated health metrics. 13
3. Lasso
Regression

The lasso combines some of Lasso is an acronym for


01 02
the shrinking advantages of 'Least Absolute
ridge regression with Selection and
variable selection. Shrinkage Operator'.

The lasso is very competitive The only difference between the


03
with the ridge regression in
04 lasso and ridge regression is that
regards to prediction error the ridge ℓ2uses the ∥β∥22 penalty
where the lasso ℓ1 uses the ∥β∥1
penalty

14
Lasso
Regression
• The tuning parameter λ controls the strength of the penalty, and
like ridge regression, we get the βlasso= the linear regression
estimate when λ=0, and βlasso when λ=∞

• For λ in between these two extremes, we are balancing two ideas:


fitting a linear model of y on X, and shrinking the coefficients.

• The nature of the penalty ℓ1 causes some of the coefficients to be


shrunken to zero exactly.

• This is what makes lasso different than ridge regression. It is able to


perform variable selection in the linear model.

• Important: As λ increases, more coefficients are set to zero (less


variables selected), and among non-zero coefficients, more
shrinkage is employed.

15
Lasso
Regression
Because the lasso sets the coefficients to exactly zero. It performs variable selection in the linear model

16
Lasso
Regression
We can also use plots of the degrees of freedom (ds) to put different estimates on equal footing

17
Constrained Form
• It can be helpful to think about our penalty ℓ1 and ℓ2 parameters in the following form:

• We can think of this formula now in a constrained (penalized) form:

• t is a tuning parameter (which we have been calling λ earlier)

• The usual OLS regression solves the unconstrained least squares problem; these estimates constrain the
coefficient vector to lie in some geometric shape centered around the origin.

18
Constrained
Form
This generally reduces the variance because it keeps the estimate close to zero. But the shape that we
choose really matters!!!

The contour lines are the least squares The contour lines are the least squares
error function. The blue diamond is the error function. The blue circle is the
constraint region for the lasso regression. constraint region for the ridge regression.

19
Lasso
Regression
Advantages Disadvantages
•Performs Variable Selection: Lasso can shrink some •Computationally Intensive: It can be more computationally
coefficients to zero, effectively performing variable selection intensive than Ridge Regression, especially with a large
and producing simpler, more interpretable models. number of predictors.

•Reduces Overfitting: Like Ridge Regression, Lasso also •Bias: Lasso can introduce bias into the model, especially if
reduces the risk of overfitting by adding a penalty to the size the true relationship between predictors and the response is
of coefficients. not sparse.

Potential Application
Feature Selection in Machine Learning:
•High-Dimensional Data: Lasso is extensively used for selecting important features in
datasets with a large number of variables, such as text classification and image recognition.

Credit Scoring:
•Risk Assessment: In finance, Lasso can identify key predictors of credit risk from a large
set of financial indicators.

Marketing:
•Customer Segmentation: Helps in identifying the most influential factors that differentiate
between different customer segments. 20
4. When we are working with high dimensional
Elastic 01 data, correlations between the variables can
be high resulting in multicollinearity.

Net These correlated variables can sometimes


02 form groups or clusters of correlated
variables.

There are many times where we would want


03 to include the entire group in the model
selection if one variable has been selected.

This can be thought of as an elastic net


04
catching a school of fish instead of singling
out a single fish.

21
Elastic Net
The total number of
Additionally, lasso fails to
variables that the lasso
perform grouped selection. The elastic net forms a
variable selection
It tends to select one hybrid of the ℓ1 and ℓ2
procedure can select is
variable from a group and penalties.
bound by the total number
ignore the others.
of samples in the dataset.

22
Elastic Net
• Ridge, Lasso, and Elastic Net are all part of the same family with the penalty term of:

•If the α=0 then we have a Ridge Regression


•If the α=1 then we have the LASSO
•If the 0<α<10 then we have the elastic net

23
Elastic Net

The specification of the elastic net penalty above is actually considered a naïve elastic net.

• Unfortunately, the naïve elastic net does not perform well in practice.

• The parameters are penalized twice with the same α level. (this is why it is called naïve)

• To correct this we can use the following:

24
Elastic Net -
Constraint
Here is the visualization of the
constrained region for the elastic net
Visualizations for the Ridge Regression, Lasso And Elastics Nets

25
Elastic
Advantages Net
•Combines Ridge and Lasso: Elastic Net combines the penalties of
Disadvantages
•Complexity: The model can be more complex to tune due
Ridge and Lasso, providing a balance between the two methods. to the additional mixing parameter.

•Handles Multicollinearity and Variable Selection: It can handle •Computationally Intensive: Like Lasso, Elastic Net can be
multicollinearity and perform variable selection simultaneously. computationally intensive, especially with a large number of
predictors.
•Flexibility: The mixing parameter allows for flexibility in the amount
of Ridge and Lasso penalties applied.

Potential Application
Predictive Modeling in Medicine:
•Genetic Studies: Elastic Net is useful in predictive modeling where there are groups of correlated
genes, combining the strengths of Ridge and Lasso.

Chemoinformatic:
•Drug Discovery: Used to identify important molecular descriptors and predict the biological
activity of compounds.

Social Science Research:


•Survey Data Analysis: Elastic Net helps in analyzing large survey datasets by selecting
significant predictors while handling multicollinearity.
26
References
Ridge Regression
Ridge Regression (Wikipedia)
Ridge Regression Lecture Notes from "The Elements of Statistical Learning" by Hastie,
Tibshirani, and Friedman.
Lasso Regression
Lasso Regression (Wikipedia)
Lasso Regression Lecture Notes by Trevor Hastie.
Elastic Net
Elastic Net (Wikipedia)
Elastic Net Regression Tutorial on Cross Validated, a Stack Exchange site.

Most Important: TRUST ME BRO

27
Thank You For
Watching

28

You might also like