Group30 Linear Regression
Group30 Linear Regression
(Output )
(Output )
(Output
)
(Input ) (Input ) (Input )
• Assume the function that approximates the I/O Can also write all of
⊤
𝑦𝑛≈ 𝑓 ( 𝒙
relationship to𝑛 )be
=𝒘 a linear model
𝒙 𝑛 (𝑛=1 , 2., … , 𝑁 ) them compactly using
matrix-vector notation
as
• Let's
Goal writeis the
of learning to total error or "loss" of this modelmeasures
over thethe
findtraining
the that data as ) prediction error
del on a single or
training
minimizes this loss + “loss”
input or “deviation” of
does well on test data the model on a single
training input
4
Linear Regression
• With one-dimensional inputs, linear regression would look like
(xn, yn)
(Output )
yn - wxn
y = wx
(Input )
What if a line/plane
doesn’t model the
input-output
relationship very well, Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
e.g., if their
(Output )
relationship is better (Output )
modeled by a
nonlinear curve or
No. We can even fit
curved surface?
Do linear a curve using a
models linear model after
become suitably
useless in transforming the
such cases? (Input (Input inputs
The line/plane must also predict outputs the unseen (test) inputs well.
Choice of loss function
6
Loss Functions for Regression usually depends on the
nature of the data. Also,
some loss functions
result in easier
Many possible loss functions for regression problems
optimization problem
than others
Squared Absolute
Very commonly
loss Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
used for loss
Grows more slowly
regression. than squared loss.
Leads to an Thus better suited
easy-to-solve when data has some
optimization outliers (inputs on
problem which model makes
) large errors) )
Huber Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨− 𝜖
loss loss for
Squared -insensitive Loss
small errors (say loss
up to ); absolute (a.k.a. Vapnik Note: Can also
loss for larger
loss)
Zero loss for small use squared
errors. Good for errors (say up to ); loss instead of
data with absolute loss for
−𝛿 𝛿 ) absolute loss
outliers larger errors
−𝜖 𝜖 )
Loss Functions for Regression 7
(Output
compared to the squared loss values in a dataset.
)
are easy to optimize (Convex and
(Input )
Differentiable)
Partial derivative of dot product w.r.t each Result of this derivative is - same size
element of as
∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1
= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
11
Problem(s) with the Solution!
We minimized the objective w.r.t. and got
= ⊤
¿( 𝑿 𝑿)
−1
𝑿 𝒚
⊤
Two popular examples of
regularization for linear regression
Problem: The matrix may not be invertible are:
1. Ridge Regression or L2
This may lead to non-unique solutions for Regularization
2. Lasso Regression or L1
Problem: Overfitting since we only minimized loss defined on
Regularization
training data
is called the Regularizer
Weights may become arbitrarily large to fit training data perfectly
and measures the
Such weights may perform poorly on the test data however of
“magnitude”
is the reg.
One Solution: Minimize a regularized objective hyperparam. Controls
how much we wish to
The reg. will prevent the elements of from becoming tooregularize
large (needs to be
tuned via cross-
Reason: Now we are minimizing training error + magnitude of vector
validation)
12
Regularized Least Squares (a.k.a. Ridge Regression)
Recall that the regularized objective is of the form
= arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
solution. We are adding a small ⊤ 2 ⊤
value to the diagonals of the
DxD matrix (like adding a 𝑛=1
ridge/mountain to some land)
𝑁
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
•Minimizing the above objective
𝑛=1 w.r.t. w does two things
•Keeps the training error small
•Keeps the norm of w small (and thus also the individual components of
w): Regularization
•There is a trade-off between the two terms: The regularization
hyperparameter λ>0 controls it
•Very small λ means almost no regularization (can overfit)
•Very large λ means very high regularization (can underfit - high training
error)
•Can use cross-validation to choose the "right" λ
𝒘 =( 𝑿 ⊤ 𝑿 + 𝜆 𝐼 𝐷)− 1 𝑿 ⊤ 𝒚
•The solution to the above optimization problem is:
15
Other Ways to Control Overfitting
Use a regularizer defined by other norms,Use
e.g.,
them if you have a
norm very large number of
regularizer 𝐷
features but many
irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can
When should these 𝑑=1
regularizers be used help in automatic
instead of the feature selection
Using such
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) regularizers gives a
Automatic feature sparse weight vector
norm regularizer
selection? Wow, as solution
(counts number of
cool!!! nonzeros in
But how exactly?
Use non-regularization based approaches
Early-stopping (stopping training just when we have a decent val. set
accuracy) All of these are very popular
ways to control overfitting in
Dropout (in each iteration, don’t update some of thedeep
weights)
learning models. More
Injecting noise in the inputs on these later when we talk
about deep learning
16
, , and regularizations: Some Comments
Many ways to regularize ML models (for linear as well as other models)
Some are based on adding a norm of to the loss function (as we already
saw)
Using norm in the loss function promotes the individual entries to be
small (we saw that)
Using norm encourages very few non-zero entries in w (thereby promoting
“sparse” ) ‖𝒘‖ =¿ nnz ( 𝒘 )
0
Note: Since they learn a sparse w, or regularization is also useful for doing
feature selection
17
Linear/Ridge Regression via Gradient Descent
•Both least squares regression and ridge regression require matrix inversion
⊤ −1 ⊤
𝒘 𝐿𝑆 =( 𝑿 ⊤ 𝑿 )−1 𝑿 ⊤ 𝒚 𝒘 𝑟𝑖𝑑𝑔𝑒 =( 𝑿 𝑿 + 𝜆 𝐼 𝐷 ) 𝑿 𝒚
•Can be computationally expensive when is very large
•A faster way is to use iterative optimization, such as batch or stochastic gradient descent
•A basic batch gradient-descent based procedure looks like
•Start with an initial value of
•Update by moving in the opposite direction of the gradient of the loss function
η
Convex
Non - Convex
A B A B
• For Gradient Descent, the learning rate is important (should not be too large or too
small).
19
Linear Regression as Solving System of Linear Eqs
The form of the lin. reg. model is akin to a system of linear
equation
Assuming training𝑦examples
First training example: with features each,
1=𝑥11 𝑤 1+ 𝑥 12 𝑤 2+ …+ 𝑥 1 𝐷 𝑤 𝐷
weHere
Note: have
denotes
the feature of the
Second training example: 𝑦 2=𝑥 21 𝑤1 + 𝑥 22 𝑤2 +… + 𝑥2 𝐷 𝑤 𝐷 training example
equations and
unknowns here ()
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 + 𝑥 𝑁 2 𝑤 2+ …+ 𝑥 𝑁𝐷 𝑤 𝐷