02 - Linear Models - A
02 - Linear Models - A
Sungjin Ahn
School of Computing
KAIST
Linear Regression
Linear Regression
𝑦! = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$
• 𝑥! is ith feature
• 𝑤" is jth model parameter (also, called jth weight)
• 𝑤# is the bias term
• 𝑦# is the predicted value
• d is the number of features
Vectorized Form
𝑦! = ℎ𝒘 𝒙 = 𝒘" 𝒙
x1
x2
𝑦" = 𝑤! 𝑤" 𝑤# 𝑤$ 𝑤% x3
x4
1
Matrix Form
§ X: N x (d+1) matrix (N: the number of training data), this is called the design matrix
(()
• 𝑥& : the dth feature of the nth data point
§ w: (d+1) x 1 vector
%: N x 1 vector
§ 𝒚
𝑦 (!) 1 𝑥!
(!) (!)
𝑥" … (!)
𝑥& 𝑤%
𝑦 (") 1 𝑥!
(") (")
𝑥" … (")
𝑥& 𝑤!
𝑦 (#) = 1 𝑥!
(#) (#)
𝑥" … (#)
𝑥& X
𝑤"
… 1 … … … … …
𝑦 (*) 1 𝑥!
(*) (*)
𝑥" … (*)
𝑥& 𝑤&
&
𝒚 X w
Basis Functions
• Examples,
§ Polynomial: 𝑥 ! = 𝜙! (x)
"
"#$!
§ Gaussian: 𝜙! 𝑥 = 𝑒𝑥𝑝 −
%&"
"#$!
§ Sigmoid: 𝜙! 𝑥 = 𝜎 Basis function
&
Polynomial Regression is a Linear Model
1 M =0 1 M =1
t t
0 0
−1 −1
0 x 1 0 x 1
1 M =3 1 M =9
t t
0 0
−1 −1
X w 0 x 1 0 x 1
Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in
Figure 1.2.
The constant 1/2 makes no real difference but will prove notationally convenien
when we take the derivative of the loss. Since the training dataset is given to us
• A cost function for linear regression our control, the empirical error is only a function of the model parameters. To m
concrete, consider the example below where we plot a regression problem for a o
• The squared loss error of each data item measures
case as the
shown in Fig. 3.1.1.
distance between the prediction 𝑦. and label (=target) 𝑦 (why
square?)
'
1 %
𝑙 𝒘 = 𝑦. ' − 𝑦 '
2
• Mean Squared Error (MSE) for the whole training data is
*
1 1 %
𝐿 𝒘 = 6 𝒘+ 𝒙(') − 𝑦 '
𝑁 2
'()
• The goal is to find the model parameters that minimize this
Fig. 3.1.1: Fit data with a linear model.
loss
𝒘∗ = argmin 𝐿(𝒘)
𝒘 Note that large differences between estimates ŷ (i) and observations y (i) lead to ev
• There are two approaches to this optimizationbutions
problemto the loss, due to the quadratic dependence. To measure the quality o
entire dataset, we simply average (or equivalently, sum) the losses on the trainin
1 # 1 ! ! (i) "2
n n
1 # (i)
L(w, b) = l (w, b) = w x +b−y (i)
.
n n 2
Solving Linear Regression
• What if N >> D?
§ XTX is O(ND2) so if N is very large (e.g,. N=# of all webpages in the
internet), this computation can be dominant than the inverse
§ What would be the solution?
Gradient Descent
The Gradient Descent Method
DE(𝒘)
wßw-𝜂
D𝒘
§ Here, 𝜂 is the learning rate (aka, step size), a
hyperparameter controlling how much move to the
direction of the gradient.
butions to the loss, due to the quadratic dependence. To measure the quality of a model on
entire dataset, we simply average (or equivalently, sum) the losses on the training set.
• MSE Loss
1 # 1 ! ! (i) "2
n n
1 # (i)
L(w, b) = l (w, b) = w x + b − y (i) . (3
n n 2
i=1 i=1
When• training
Gradientthe model, we want to find parameters (w , b ) that minimize the total loss ac
∗ ∗
(
all training samples: 1
𝜕% 𝐿 𝑤, 𝑏 = . 𝒘) 𝒙 ! + 𝑏 − 𝑦 !
𝒙!
𝑛∗
w , b !&'
∗
= argmin L(w, b). (3
(
1 w,b
𝜕* 𝐿 𝑤, 𝑏 = . 𝒘) 𝒙 ! + 𝑏 − 𝑦 !
𝑛
!&'
(w,b) can also be updated jointly
• Update (for w)
𝒘 ← 𝒘 − 𝜂𝜕% 𝐿 𝑤, 𝑏
Learning Rate
w w
Too small Too large
Convex & Non-convex Function
• A function is said to be convex: if you pick any two points on the curve,
the line segment joining them never crosses the curve.
• Formally, a function f: X à R is convex if for any t in [0,1] and all x1, x2 in
X, f(tx1 + (1 - t) x2) ≤ tf(x1) + (1 - t)f(x2)
• If it is convex, the function has one global minimum. If it is non-convex,
the function has multiple local minima.
Local minimum
Global minimum
GD in convex
Standardization Normalization