0% found this document useful (0 votes)

9 views

Group30 Linear Regression

The document discusses linear regression as a supervised learning technique aimed at modeling the relationship between input and output variables. It covers various aspects including loss functions, regularization techniques like Ridge and Lasso regression, and the importance of minimizing error to prevent overfitting. Additionally, it highlights the use of different norms for regularization and their implications for model performance and feature selection.

Uploaded by

ISHAAN JAIN 22114039

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Group30 Linear Regression

Uploaded by

ISHAAN JAIN 22114039

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Linear Regression

Himanshu Raheja 22323023

Ishaan Jain 22114039
Regression: Pictorially 2

Regression : A supervised learning problem. Goal is to

model the relationship input(x) and real-valued output(y) .
This is akin to a line/plane or curve fitting problem .

(Output )
(Output )

(Output
)
(Input ) (Input ) (Input )

Linear Linear Regression Polynomial

Regression With multiple Regression
With one variables
variable
3
Linear Regression
𝑛𝑥1 𝑛𝑥𝐷 𝐷𝑥1
• Given : Training data with N input-output pairs 𝑦1 𝑥1
𝑇
𝑤1
⋮ ⋮ ⋮
𝑦𝑛 𝑥𝑛
𝑇
𝑤𝐷
• Goal : Learn a model to predict the output for new test
𝑦 𝑋 𝑤
inputs.

• Assume the function that approximates the I/O Can also write all of
⊤
𝑦𝑛≈ 𝑓 ( 𝒙
relationship to𝑛 )be
=𝒘 a linear model
𝒙 𝑛 (𝑛=1 , 2., … , 𝑁 ) them compactly using
matrix-vector notation
as

• Let's
Goal writeis the
of learning to total error or "loss" of this modelmeasures
over thethe
findtraining
the that data as ) prediction error
del on a single or
training
minimizes this loss + “loss”
input or “deviation” of
does well on test data the model on a single
training input
4
Linear Regression
• With one-dimensional inputs, linear regression would look like
(xn, yn)

(Output )
yn - wxn

y = wx

(Input )

• Error of the model for an example = yn – wTxn (= yn – wTxn for scalar

input case)
• Here, w represents the coefficients (also known as weights), which
determine the contribution of each input feature to the
output prediction.
5
Linear Regression: Pictorially
[ ]
 Linear regression is like fitting a line or (hyper)plane to a set of points
𝑧1
𝑧2
= 𝜙( 𝑥 )

What if a line/plane
doesn’t model the
input-output
relationship very well, Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
e.g., if their

(Output )
relationship is better (Output )
modeled by a
nonlinear curve or
No. We can even fit
curved surface?
Do linear a curve using a
models linear model after
become suitably
useless in transforming the
such cases? (Input (Input inputs

 The line/plane must also predict outputs the unseen (test) inputs well.
Choice of loss function
6
Loss Functions for Regression usually depends on the
nature of the data. Also,
some loss functions
result in easier
 Many possible loss functions for regression problems
optimization problem
than others
Squared Absolute
Very commonly
loss Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
used for loss
Grows more slowly
regression. than squared loss.
Leads to an Thus better suited
easy-to-solve when data has some
optimization outliers (inputs on
problem which model makes
) large errors) )
Huber Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨− 𝜖
loss loss for
Squared -insensitive Loss
small errors (say loss
up to ); absolute (a.k.a. Vapnik Note: Can also
loss for larger
loss)
Zero loss for small use squared
errors. Good for errors (say up to ); loss instead of
data with absolute loss for
−𝛿 𝛿 ) absolute loss
outliers larger errors
−𝜖 𝜖 )
Loss Functions for Regression 7

• Absolute/Huber loss is preferred if

there are outliers in the data Outliers are
observations that lie at
• Less Affected by large errors as an abnormal distance
from other

(Output
compared to the squared loss values in a dataset.

• Squared loss objective functions

)
are easy to optimize (Convex and
(Input )
Differentiable)

• 𝜖-sensitive loss used where small

errors (within 𝜖) are ignored,
making it suitable for applications
where minor
deviations are acceptable.
Cost Function 8

 Loss/Error function: Loss is calculated per observation, for each

training example we see how much it deviated from the actual
observation. Loss functions measure how far an estimated value is from
its true value.
 MAE (Mean absolute error), MSE (Mean Squared Error) are loss functions
for regression problems
 Cost function: Cost is associated with the whole training data. A cost
function is the average loss over the entire training dataset. It is a
function that determines how well a Machine Learning model performs
for a given set of data.
Usually a sum of the
 The general
training error +form of an optimization problem usually are We will
regularizer introduce
Regularizer
soon while
 Here denotes the loss function to be optimized discussing
overfitting.
9
Linear Regression with Squared Loss
In matrix-vector notation, can write
 In this case, the loss func will be
it compactly as )

 Let us find the that optimizes (minimizes) the above squared

loss The “least squares” (LS)
problem Gauss-Legendre,
18th century)
 We need calculus and optimization to do this!

 The LS problem can be solved easily and has a closed form

solution =
= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
matrix inversion – can be
expensive. Ways to handle this.
10
Proof: A bit of calculus/optim. (more on this later)
 We wanted to find the minima of

 Let us apply basic rule of calculus: Take first derivative of and

Chain rule of
set to zero calculus

Partial derivative of dot product w.r.t each Result of this derivative is - same size
element of as

 Using the fact , we get

 To separate to get a solution, we
𝑁
write the above as
𝑁

∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1

= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
11
Problem(s) with the Solution!
 We minimized the objective w.r.t. and got

= ⊤
¿( 𝑿 𝑿)
−1
𝑿 𝒚
⊤
Two popular examples of
regularization for linear regression
 Problem: The matrix may not be invertible are:
1. Ridge Regression or L2
 This may lead to non-unique solutions for Regularization
2. Lasso Regression or L1
 Problem: Overfitting since we only minimized loss defined on
Regularization

training data
is called the Regularizer
 Weights may become arbitrarily large to fit training data perfectly
and measures the
 Such weights may perform poorly on the test data however of
“magnitude”
is the reg.
 One Solution: Minimize a regularized objective hyperparam. Controls
how much we wish to
 The reg. will prevent the elements of from becoming tooregularize
large (needs to be
tuned via cross-
 Reason: Now we are minimizing training error + magnitude of vector
validation)
12
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean

( squared) norm of 𝑅 ( 𝒘 )=‖𝒘‖2=𝒘 ⊤ 𝒘
2

 With this regularizer, we have the regularized least

Whysquares
is the method
problem as + called “ridge”
Look at the form of the 𝑁 regression ?

= arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
solution. We are adding a small ⊤ 2 ⊤
value to the diagonals of the
DxD matrix (like adding a 𝑛=1
ridge/mountain to some land)

=Proceeding just like the LS case, we can find ⊤

¿ ( 𝑿the
𝑿optimal
−1 ⊤
+ 𝜆 𝐼 𝐷 ) which
𝑿 𝒚
is given by
13
A closer look at regularization Remember – in general,
weights with large
 The regularized objective we minimized is magnitude are bad since
𝑁
they can cause
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
overfitting on training
data and may not work
𝑛=1
well on test data
 Minimizing w.r.t. gives a solution for that
Good because,
 Keeps the training error small consequently, the Not a “smooth”
model since its
individual entries of the
 Has a small squared norm = weight vector are also
test data
predictions may
prevented from becoming change drastically
 Small entries in are good since they lead to “smooth” models too large even with small
changes in some
feature’s value
A typical learned without reg.
𝒙 𝑛=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1 𝑦 𝑛= 0.8
3.2 1.8 1.3 2.1 10000 2.5 3.1 0.1

𝒙 𝑚=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1

𝑦 𝑚=100 Just to fit the training data where
one of the inputs was possibly an
Exact same feature vectors only Very different outputs though outlier, this weight became too big.
differing in just one feature by a (maybe one of these two Such a weight vector will possibly
small amount training ex. is an outlier) do poorly on normal test inputs
Regularized Least Squares (a.k.a. Ridge Regression) 14
•Consider regularized loss: Training error + -squared norm of w, i.e.,

𝑁
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
•Minimizing the above objective
𝑛=1 w.r.t. w does two things
•Keeps the training error small
•Keeps the norm of w small (and thus also the individual components of
w): Regularization
•There is a trade-off between the two terms: The regularization
hyperparameter λ>0 controls it
•Very small λ means almost no regularization (can overfit)
•Very large λ means very high regularization (can underfit - high training
error)
•Can use cross-validation to choose the "right" λ
𝒘 =( 𝑿 ⊤ 𝑿 + 𝜆 𝐼 𝐷)− 1 𝑿 ⊤ 𝒚
•The solution to the above optimization problem is:
15
Other Ways to Control Overfitting
 Use a regularizer defined by other norms,Use
e.g.,
them if you have a
norm very large number of
regularizer 𝐷
features but many
irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can
When should these 𝑑=1
regularizers be used help in automatic
instead of the feature selection
Using such
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) regularizers gives a
Automatic feature sparse weight vector
norm regularizer
selection? Wow, as solution
(counts number of
cool!!! nonzeros in
But how exactly?
 Use non-regularization based approaches
 Early-stopping (stopping training just when we have a decent val. set
accuracy) All of these are very popular
ways to control overfitting in
 Dropout (in each iteration, don’t update some of thedeep
weights)
learning models. More
 Injecting noise in the inputs on these later when we talk
about deep learning
16
, , and regularizations: Some Comments
Many ways to regularize ML models (for linear as well as other models)
 Some are based on adding a norm of to the loss function (as we already
saw)
 Using norm in the loss function promotes the individual entries to be
small (we saw that)
 Using norm encourages very few non-zero entries in w (thereby promoting
“sparse” ) ‖𝒘‖ =¿ nnz ( 𝒘 )
0

 Optimizing with is difficult (NP-hard

𝐷 problem); can use norm as an
approximation ‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿
𝑑=1

 Note: Since they learn a sparse w, or regularization is also useful for doing
feature selection
17
Linear/Ridge Regression via Gradient Descent
•Both least squares regression and ridge regression require matrix inversion
⊤ −1 ⊤
𝒘 𝐿𝑆 =( 𝑿 ⊤ 𝑿 )−1 𝑿 ⊤ 𝒚 𝒘 𝑟𝑖𝑑𝑔𝑒 =( 𝑿 𝑿 + 𝜆 𝐼 𝐷 ) 𝑿 𝒚
•Can be computationally expensive when is very large
•A faster way is to use iterative optimization, such as batch or stochastic gradient descent
•A basic batch gradient-descent based procedure looks like
•Start with an initial value of
•Update by moving in the opposite direction of the gradient of the loss function
η

where η is the learning rate

•Repeat until convergence 𝑁
𝜕 𝐿 ( 𝒘 )
•For least squares, the gradient is
𝜕𝒘
∑ ⊤
=− 𝒙 ( 𝑦 −𝒘 𝒙 )
𝑛 𝑛
2
𝑛
𝑛=1
Linear/Ridge Regression via Gradient Descent 18

• Gradient Descent is guaranteed to converge to a local minima.

• Gradient Descent converges to a global minima if the function is convex.

Convex
Non - Convex

A B A B

• A function is convex if the second derivative is non-negative everywhere (for scalar

functions) or if the Hessian is positive semi-definite (for vector-valued functions).
For a convex function, every local minima is also a global minima.

• Note: The squared loss function in linear regression is convex.

• With regularizer, it becomes strictly convex (single global minima).

• For Gradient Descent, the learning rate is important (should not be too large or too
small).
19
Linear Regression as Solving System of Linear Eqs
 The form of the lin. reg. model is akin to a system of linear
equation
 Assuming training𝑦examples
First training example: with features each,
1=𝑥11 𝑤 1+ 𝑥 12 𝑤 2+ …+ 𝑥 1 𝐷 𝑤 𝐷
weHere
Note: have
denotes
the feature of the
Second training example: 𝑦 2=𝑥 21 𝑤1 + 𝑥 22 𝑤2 +… + 𝑥2 𝐷 𝑤 𝐷 training example
equations and
unknowns here ()
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 + 𝑥 𝑁 2 𝑤 2+ …+ 𝑥 𝑁𝐷 𝑤 𝐷

 However, in regression, we rarely have but rather or

 Thus we have an underdetermined () or overdetermined () system
 Methods to solve over/underdetermined systems canNow be solve
used for lin-reg as well
this!
 Many
Solving lin-regof these
𝒘 ¿methods
( 𝑿 𝑿 ) don’t
⊤ −1
𝑿 require
⊤
𝒚 expensive matrix
where inversion
, and
as system of lin eq. System of lin. Eqns with equations and
unknowns
Thank You!

Unit 7 Exam
88% (8)
Unit 7 Exam
3 pages
Savage Worlds - Daring Tales of The Sprawl - Cutter
No ratings yet
Savage Worlds - Daring Tales of The Sprawl - Cutter
2 pages
All-In-One Robot Education: Dobot Product Catalog
No ratings yet
All-In-One Robot Education: Dobot Product Catalog
32 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Notes 04
No ratings yet
Notes 04
50 pages
Machine learning
No ratings yet
Machine learning
19 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Section05 Solutions
No ratings yet
Section05 Solutions
9 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
eng
No ratings yet
eng
10 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
ML_Lec 5_Regression_Gradient Descent Least Square
No ratings yet
ML_Lec 5_Regression_Gradient Descent Least Square
59 pages
Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy
No ratings yet
Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy
42 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
LLM ML Interview Q
No ratings yet
LLM ML Interview Q
43 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
Linear Regression: Volker Tresp 2017
No ratings yet
Linear Regression: Volker Tresp 2017
25 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
MSCV MLDL Remedial
No ratings yet
MSCV MLDL Remedial
95 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Unit 2
No ratings yet
Unit 2
8 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
21csc305p Ml Unit 2 Ppt
No ratings yet
21csc305p Ml Unit 2 Ppt
115 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Instant download Network Analysis Using Wireshark Cookbook 1st Edition Yoram Orzach pdf all chapter
No ratings yet
Instant download Network Analysis Using Wireshark Cookbook 1st Edition Yoram Orzach pdf all chapter
41 pages
Fiche 3543dr - GB PDF
No ratings yet
Fiche 3543dr - GB PDF
2 pages
SQF Audit Checklist: Module A7
No ratings yet
SQF Audit Checklist: Module A7
59 pages
Request For Application and TOR For Connected Bangladesh Project
No ratings yet
Request For Application and TOR For Connected Bangladesh Project
29 pages
ProxySG CMG Guide 3.2.4
No ratings yet
ProxySG CMG Guide 3.2.4
869 pages
STAT 540: Data Analysis and Regression - Fall 2015
No ratings yet
STAT 540: Data Analysis and Regression - Fall 2015
2 pages
Chief Technology Officer CTO in Dallas TX Resume Scott Davis
No ratings yet
Chief Technology Officer CTO in Dallas TX Resume Scott Davis
2 pages
ST7MDT
No ratings yet
ST7MDT
5 pages
Catia V5 QnA
No ratings yet
Catia V5 QnA
2 pages
What Is An Autonomous Mobile Robot?: - Collaboration Robots
No ratings yet
What Is An Autonomous Mobile Robot?: - Collaboration Robots
21 pages
Hadoop Mapr Configuring Topologies
No ratings yet
Hadoop Mapr Configuring Topologies
34 pages
AMOLED
100% (1)
AMOLED
17 pages
Edi 104 - Chapter 5
No ratings yet
Edi 104 - Chapter 5
43 pages
New PC Build For Home
No ratings yet
New PC Build For Home
3 pages
Sample XML Gateway
No ratings yet
Sample XML Gateway
25 pages
PI Interface Configuration Utility (PI ICU) 1.5.1 User Guide
No ratings yet
PI Interface Configuration Utility (PI ICU) 1.5.1 User Guide
108 pages
Final Assessment - Level 2
No ratings yet
Final Assessment - Level 2
17 pages
Sap BPC
No ratings yet
Sap BPC
8 pages
GC Computer (Eng)
No ratings yet
GC Computer (Eng)
121 pages
Teste
No ratings yet
Teste
558 pages
Introduction To Zigbee Technology
No ratings yet
Introduction To Zigbee Technology
24 pages
STT GDC Sea - Dco.sg - Dcso Form 005 Method of Procedure v2.0
100% (1)
STT GDC Sea - Dco.sg - Dcso Form 005 Method of Procedure v2.0
5 pages
AV Control Receiver SA-XR45 SA-XR25: Operating Instructions
No ratings yet
AV Control Receiver SA-XR45 SA-XR25: Operating Instructions
24 pages
UVM Mindmap 1704943887
No ratings yet
UVM Mindmap 1704943887
1 page
Folder auCDtect
No ratings yet
Folder auCDtect
3 pages
Accountant. CV
No ratings yet
Accountant. CV
4 pages
Mobile Banking
No ratings yet
Mobile Banking
3 pages

Group30 Linear Regression

Uploaded by

Group30 Linear Regression

Uploaded by

Linear Regression

Himanshu Raheja 22323023

Regression : A supervised learning problem. Goal is to

Linear Linear Regression Polynomial

• Error of the model for an example = yn – wTxn (= yn – wTxn for scalar

• Absolute/Huber loss is preferred if

• Squared loss objective functions

• 𝜖-sensitive loss used where small

 Loss/Error function: Loss is calculated per observation, for each

 Let us find the that optimizes (minimizes) the above squared

 The LS problem can be solved easily and has a closed form

 Let us apply basic rule of calculus: Take first derivative of and

 Using the fact , we get

 One possible/popular regularizer: the squared Euclidean

 With this regularizer, we have the regularized least

=Proceeding just like the LS case, we can find ⊤

𝒙 𝑚=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1

 Optimizing with is difficult (NP-hard

where η is the learning rate

• Gradient Descent is guaranteed to converge to a local minima.

• Gradient Descent converges to a global minima if the function is convex.

• A function is convex if the second derivative is non-negative everywhere (for scalar

• Note: The squared loss function in linear regression is convex.

 However, in regression, we rarely have but rather or

You might also like