0% found this document useful (0 votes)
36 views

02 - Linear Models - A

Linear regression aims to model the relationship between features (x) and a target (y) using a linear function. It can be expressed in vector and matrix forms. The goal is to find the model parameters (weights w) that minimize the mean squared error between the predicted and actual target values. This optimization problem can be solved analytically using the normal equation, which yields the global optimum weights in one step. Numerical solutions like gradient descent can also be used. Basis functions can transform input features to help learning. Polynomial regression is an example of a linear model that uses polynomial basis functions.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

02 - Linear Models - A

Linear regression aims to model the relationship between features (x) and a target (y) using a linear function. It can be expressed in vector and matrix forms. The goal is to find the model parameters (weights w) that minimize the mean squared error between the predicted and actual target values. This optimization problem can be solved analytically using the normal equation, which yields the global optimum weights in one step. Numerical solutions like gradient descent can also be used. Basis functions can transform input features to help learning. Polynomial regression is an example of a linear model that uses polynomial basis functions.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Training Linear Models

Sungjin Ahn

School of Computing
KAIST
Linear Regression
Linear Regression

𝑦! = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$

• 𝑥! is ith feature
• 𝑤" is jth model parameter (also, called jth weight)
• 𝑤# is the bias term
• 𝑦# is the predicted value
• d is the number of features
Vectorized Form

𝑦! = ℎ𝒘 𝒙 = 𝒘" 𝒙

x1
x2
𝑦" = 𝑤! 𝑤" 𝑤# 𝑤$ 𝑤% x3
x4
1
Matrix Form

§ X: N x (d+1) matrix (N: the number of training data), this is called the design matrix
(()
• 𝑥& : the dth feature of the nth data point
§ w: (d+1) x 1 vector
%: N x 1 vector
§ 𝒚

𝑦 (!) 1 𝑥!
(!) (!)
𝑥" … (!)
𝑥& 𝑤%
𝑦 (") 1 𝑥!
(") (")
𝑥" … (")
𝑥& 𝑤!
𝑦 (#) = 1 𝑥!
(#) (#)
𝑥" … (#)
𝑥& X
𝑤"
… 1 … … … … …
𝑦 (*) 1 𝑥!
(*) (*)
𝑥" … (*)
𝑥& 𝑤&

&
𝒚 X w
Basis Functions

• In the polynomial regression example, we assumed that the input x is 1D.


• But, we increased the input dimension from 1 to M by applying x, x2, x3, …, xM
• Generalizing this idea, we can apply a function 𝜙(x), called a basis function, that
transforms the input into some other form (if we believe that the transform will help
learning).
§ This is the feature design or feature engineering process. (A human is the designer)

• Examples,
§ Polynomial: 𝑥 ! = 𝜙! (x)
"
"#$!
§ Gaussian: 𝜙! 𝑥 = 𝑒𝑥𝑝 −
%&"
"#$!
§ Sigmoid: 𝜙! 𝑥 = 𝜎 Basis function
&
Polynomial Regression is a Linear Model

• 𝛽$ is the model parameter. Then, the matrix form of polynomial


regression is
1.1. Example: Polynomial Curve Fitting 7

1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

1 M =3 1 M =9
t t

0 0

−1 −1

X w 0 x 1 0 x 1

Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in
Figure 1.2.

(RMS) error defined by !


ERMS = 2E(w! )/N (1.3)
in which the division by N allows us to compare different sizes of data sets on
an equal footing, and the square root ensures that ERMS is measured on the same
errors. When our prediction for some example i is ŷ (i) and the corresponding tru
squared error is given by:

Mean Squared Error (MSE) l(i) (w, b) =


1 ! (i)
2
"2
ŷ − y (i) .

The constant 1/2 makes no real difference but will prove notationally convenien
when we take the derivative of the loss. Since the training dataset is given to us
• A cost function for linear regression our control, the empirical error is only a function of the model parameters. To m
concrete, consider the example below where we plot a regression problem for a o
• The squared loss error of each data item measures
case as the
shown in Fig. 3.1.1.
distance between the prediction 𝑦. and label (=target) 𝑦 (why
square?)

'
1 %
𝑙 𝒘 = 𝑦. ' − 𝑦 '
2
• Mean Squared Error (MSE) for the whole training data is
*
1 1 %
𝐿 𝒘 = 6 𝒘+ 𝒙(') − 𝑦 '
𝑁 2
'()
• The goal is to find the model parameters that minimize this
Fig. 3.1.1: Fit data with a linear model.
loss
𝒘∗ = argmin 𝐿(𝒘)
𝒘 Note that large differences between estimates ŷ (i) and observations y (i) lead to ev
• There are two approaches to this optimizationbutions
problemto the loss, due to the quadratic dependence. To measure the quality o
entire dataset, we simply average (or equivalently, sum) the losses on the trainin

1 # 1 ! ! (i) "2
n n
1 # (i)
L(w, b) = l (w, b) = w x +b−y (i)
.
n n 2
Solving Linear Regression

1. Analytic solution à The Normal Equation


2. Numerical solution à The Gradient Descent Method
The Analytic Solution: The Normal Equation
ession happens to be an unusually simple optimization problem. Unlike m
Linear
t we regressioninhappens
will encounter this The
book, tolinear
be anregression
unusually can simple
be solved optimization
analyticall
Normal Equation
emodels
formula,that we will
yielding encounter
a global optimum. in thisTobook,start, linear
we canregression
subsume the canbiasbe
wing a simple formula,
by appending a columnyieldingto the design a global
matrix optimum.
(Not consisting
exactly To
same but the ofstart,
all
solution 1s.
should beweThen
the cano
same)

mparameter w by ||yappending a column to the design


• In the matrix form, the objective can be written as:
is to minimize − Xw||. Because this expression has amatrix
quadratic consistin
form, it
as theproblem
tion problemis isto
not degenerate
minimize ||y(our
− Xw||.features
Because are linearly independent),
this expression has ai
and so long as• thewhich problem
is a quadratic formisandnot degenerate
convex (no local minima) (our features are linear
isconvex.
just one critical point
• || on the
x || is L2 norm, i.e., ∥ 𝑥loss
∥ = ∑surface
# 𝑥$
# and it corresponds to the global
derivative of the •loss
X: N xwith respect to w and setting it equal to 0 yields the an
(D+1) matrix
Thus there is just one critical point on the loss surface
• Taking derivative of the objective w.r.t. w and setting it equal to zero,
and it corresp
Taking the i.e.,
derivative
∇𝒘∥ 𝒚 − 𝑿𝒘 ∥of
= 0,the
yieldsloss with solution.
the following respect (Youtowillw and
derive thissetting
yourself) it equa
tion: w∗ = (XT X)−1 XT y.

ple problems like


• This linear
equation regression
is known may
as the Normal w ∗ T
X) X y. you shou
−1 T
admit analytic solutions,
Equation = (X
h good fortune. Although analytic solutions allow for nice mathematical an
Computational Complexity

• Solving the Normal Equation includes computing the inverse of a


matrix XTX of (D+1) x (D+1) shape.
• Inversing this matrix typically takes O(D2.4) to O(D3) depending on the
implementation of the underlying numerical algorithm
• This means the computation increases cubically in terms of the number
of features à very expensive

• What if N >> D?
§ XTX is O(ND2) so if N is very large (e.g,. N=# of all webpages in the
internet), this computation can be dominant than the inverse
§ What would be the solution?
Gradient Descent
The Gradient Descent Method

• Gradient descent is a generic optimization algorithm. The general idea


is to tweak the parameters iteratively in order to minimize a cost
function.
• We use the gradient of the cost function to decide the direction of the
parameter update. So, GD is applicable to training any model as long as
the gradient of the cost function w.r.t. the parameter is computable.
§ To all differentiable cost functions

Differentiable means that the derivative exists .


The Gradient Descent Method

1. Randomly initialize the parameter


2. Compute the gradient, dL(w)/dw, of the cost
function and update the parameter as follows

DE(𝒘)
wßw-𝜂
D𝒘
§ Here, 𝜂 is the learning rate (aka, step size), a
hyperparameter controlling how much move to the
direction of the gradient.

3. Go to step 2 until a stopping criterion is met. e.g.,


§ The ( ) loss doesn’t decrease anymore
§ Or, the norm of the gradient is near zero
Fig. 3.1.1: Fit data with a linear model.

Computing Gradient of Linear Regression(i)MSE


Note that large differences between estimates ŷ and observations y lead to even larger co
(i)

butions to the loss, due to the quadratic dependence. To measure the quality of a model on
entire dataset, we simply average (or equivalently, sum) the losses on the training set.
• MSE Loss
1 # 1 ! ! (i) "2
n n
1 # (i)
L(w, b) = l (w, b) = w x + b − y (i) . (3
n n 2
i=1 i=1

When• training
Gradientthe model, we want to find parameters (w , b ) that minimize the total loss ac
∗ ∗
(
all training samples: 1
𝜕% 𝐿 𝑤, 𝑏 = . 𝒘) 𝒙 ! + 𝑏 − 𝑦 !
𝒙!
𝑛∗
w , b !&'

= argmin L(w, b). (3
(
1 w,b
𝜕* 𝐿 𝑤, 𝑏 = . 𝒘) 𝒙 ! + 𝑏 − 𝑦 !
𝑛
!&'
(w,b) can also be updated jointly
• Update (for w)
𝒘 ← 𝒘 − 𝜂𝜕% 𝐿 𝑤, 𝑏
Learning Rate

• The learning rate is usually the most important hyperparameter in


training a model by GD (especially for neural networks)

w w
Too small Too large
Convex & Non-convex Function
• A function is said to be convex: if you pick any two points on the curve,
the line segment joining them never crosses the curve.
• Formally, a function f: X à R is convex if for any t in [0,1] and all x1, x2 in
X, f(tx1 + (1 - t) x2) ≤ tf(x1) + (1 - t)f(x2)
• If it is convex, the function has one global minimum. If it is non-convex,
the function has multiple local minima.

Local minimum

Global minimum
GD in convex

• For convex functions, GD is guaranteed to approach arbitrarily close


the global minimum (if you wait long enough and if the learning rate is
reduced small enough.)
• e.g., MSE of linear regression
Feature Scaling

• A convex loss function looks like a bowl. But if


the scales of the features are too different, the
bowl would look skewed and gradient descent
becomes inefficient
• Feature scaling is very useful here (Worth trying
this in almost every ML project)
• There are a few different ways to do this scaling

Standardization Normalization

For more info: https://en.wikipedia.org/wiki/Feature_scaling


GD in Non-convexity

• GD can see the gradient of only the current


position (i.e., weight, or parameter). That is,
it does not know how the loss function looks
like globally
• Even if it reaches the global minimum, there
is no way it can know that it is actually at the
global minimum.
• The MSE of a neural network is non-convex
function
When to use GD

• Most of the powerful learning models including neural networks use a


non-convex cost function
à Analytic solutions do not exist and GD is usually the only practical choice

• GD is very generally applicable as long as the gradient of the cost


function is available
à Applicable to both linear and nonlinear models.

• If the dataset is too big N >> D, (stochastic) GD could be more efficient


than analytic solution

You might also like