Linear-Regression
Linear-Regression
M. Soleymani
Fall 2016
Topics
Linear regression
Error (cost) function
Optimization
Generalization
2
Regression problem
The goal is to make (real valued) predictions given
features
3
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
4
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
5
Hypothesis space
Specify the class of functions (e.g., linear)
6
Linear regression: hypothesis space
Univariate 𝑦
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥
𝑥
Multivariate
𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝑇
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 are parameters we need to set.
7
Learning problem
Selecting a hypothesis space
Hypothesis space: a set of mappings from feature vector to
target
8
Learning algorithm
9
Learning algorithm
Training Set 𝐷
We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0 , 𝑤1
10
How to measure the error
500
400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
2
𝑖 𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘
11
Linear regression: univariate example
500
400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function: 𝑛
𝑖 2
𝐽 𝒘 = 𝑦 − 𝑓(𝑥; 𝒘)
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1
12
Regression: squared loss
In the SSE cost function, we used squared error as the
prediction loss:
2 𝑦 = 𝑓(𝒙; 𝒘)
𝐿𝑜𝑠𝑠 𝑦, 𝑦 = 𝑦 − 𝑦
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
Minimize 𝐽 𝒘
Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘
14
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0 ,𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
15 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 ,𝑤1, this is a function of 𝑥) (function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
16 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
17 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
18 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
19 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function optimization: univariate
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1
𝜕𝐽 𝒘
=0
𝜕𝑤0
𝜕𝐽 𝒘
=0
𝜕𝑤1
20
Optimality conditions: univariate
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑤0 − 𝑤1 𝑥
𝑖=1
𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1
𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1
21
Cost function: multivariate
We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 𝑇
𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1
22
Cost function and optimal linear model
(1) (1)
1 𝑥1 ⋯ 𝑥𝑑 𝑤0
𝑦 (1) (2) ⋯ (2) 𝑤1
1 𝑥1 𝑥𝑑
𝒚= ⋮ 𝑿= ⋱ 𝒘= ⋮
⋮ ⋮ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑑
(𝑛)
1 𝑥1 ⋯ 𝑥𝑑
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
24
Minimizing cost function
Optimal linear weight vector (for SSE cost function):
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = −2𝑿𝑇 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚
25
Minimizing cost function
𝒘 = 𝑿𝑇 𝑿 −𝟏
𝑿𝑇 𝒚
𝒘 = 𝑿† 𝒚
𝑿† = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇
𝑿† is pseudo inverse of 𝑿
26
Another approach for optimizing the sum
squared error
Iterative approach for solving the following optimization
problem:
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘)
𝑖=1
27
Review:
Iterative optimization of cost function
Cost function: 𝐽(𝒘)
Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
Steps:
Start from 𝒘0
Repeat
Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
𝑡 ←𝑡+1
until we hopefully end up at a minimum
28
Review:
Gradient descent
First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
Also known as ”steepest descent”
Minimize 𝐽(𝒘)
Step size
(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑑
30
Review:
Gradient descent disadvantages
31
Review: Problem of gradient descent with
non-convex cost functions
J(w0,w1)
w1
w0
32 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Review: Problem of gradient descent with
non-convex cost functions
J(w0,w1)
w1
w0
33 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
34
Gradient descent for SSE cost function
Weight update rule: 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙
𝑛
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑇 𝒙 𝑖
𝒙(𝑖)
𝑖=1
Batch mode: each step
considers all training data
35
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
36 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
37 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
38 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
39 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
40 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
41 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
42 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
43 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )
𝑤1
𝑤0
44 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
Batch techniques process the entire training set in one go
thus they can be computationally costly for large data sets.
45
Stochastic gradient descent
Example: Linear regression with SSE cost function
2
𝐽(𝑖) (𝒘) = 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)
46
Stochastic gradient descent: online learning
Sequential learning is also appropriate for real-time
applications
data observations are arriving in a continuous stream
and predictions must be made before seeing all of the data
47
Evaluation and generalization
𝑛
(𝑖) (𝑖)
min 𝐿𝑜𝑠𝑠 𝑦 , 𝑓(𝒙 ; 𝜽) Empirical loss
𝜽 𝑖=1
48
Training and test performance
Assumption: training and test examples are drawn independently
at random from the same but unknown distribution.
Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 , i.e., 𝒙, 𝑦 ~𝑃
1 𝑛 (𝑖)
Empirical (training) loss = 𝑖=1 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 ; 𝜽)
𝑛
49
Linear regression: number of training data
𝑛 = 10 𝑛 = 20
𝑛 = 50
50
Linear regression: generalization
By increasing the number of training examples, will solution be
better?
Why the mean squared error does not decrease more after
reaching a level?
51
Linear regression: types of errors
Structural error: the error introduced by the limited
function class (infinite training data):
𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
∗𝑇 2
Structural error: 𝐸𝒙,𝑦 𝑦−𝒘 𝒙
52
Linear regression: types of errors
Approximation error measures how close we can get to the
optimal linear predictions with limited training data:
𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
𝑛
2
𝒘 = argmin 𝑦 (𝑖) 𝑇
−𝒘 𝒙 (𝑖)
𝒘
𝑖=1
2
Approximation error: 𝐸𝒙 𝒘∗ 𝑇 𝒙 𝑇
−𝒘 𝒙
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
+ 2𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
54
Linear regression: error decomposition
The expected error can decompose into the sum of
structural and approximation errors
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
0
+ 2𝐸 𝑦 − 𝒘 ∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
𝒙,𝑦
Note: Optimality condition for 𝒘∗ give us 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒙 = 0
since 𝛻𝒘 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 𝒘∗ = 0
55