0% found this document useful (0 votes)
2 views

Linear-Regression

The document discusses linear regression as a method for making real-valued predictions based on features, exemplified by predicting house prices from attributes like size and age. It covers the learning problem, including hypothesis space selection, cost function optimization, and evaluation of generalization to unseen examples. The document also details the process of minimizing the sum of squared errors to find optimal parameters for the regression model.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Linear-Regression

The document discusses linear regression as a method for making real-valued predictions based on features, exemplified by predicting house prices from attributes like size and age. It covers the learning problem, including hypothesis space selection, cost function optimization, and evaluation of generalization to unseen examples. The document also details the process of minimizing the sum of squared errors to find optimal parameters for the regression model.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Linear Regression

CE-717: Machine Learning


Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Linear regression
 Error (cost) function
 Optimization
 Generalization

2
Regression problem
 The goal is to make (real valued) predictions given
features

 Example: predicting house price from 3 attributes

Size (𝑚2 ) Age (year) Region Price (106 T)


100 2 5 500
80 25 3 250
… … … …

3
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function


𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to


unseen examples

4
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function


𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to


unseen examples

5
Hypothesis space
 Specify the class of functions (e.g., linear)

 We begin by the class of linear functions


 easy to extend to generalized linear and so cover more
complex regression functions

6
Linear regression: hypothesis space
 Univariate 𝑦

𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥

𝑥
 Multivariate

𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑

𝑇
 𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 are parameters we need to set.

7
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function


𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to


unseen examples

8
Learning algorithm

 Select how to measure the error (i.e. prediction loss)

 Find the minimum of the resulting error or cost function

9
Learning algorithm
Training Set 𝐷

We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0 , 𝑤1

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated


house price
𝑥

10
How to measure the error

500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥

2
𝑖 𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘

11
Linear regression: univariate example
500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function: 𝑛
𝑖 2
𝐽 𝒘 = 𝑦 − 𝑓(𝑥; 𝒘)
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

12
Regression: squared loss
 In the SSE cost function, we used squared error as the
prediction loss:
2 𝑦 = 𝑓(𝒙; 𝒘)
𝐿𝑜𝑠𝑠 𝑦, 𝑦 = 𝑦 − 𝑦

 Cost function (based on the training set):


𝑛
𝐽 𝒘 = 𝐿𝑜𝑠𝑠 𝑦 𝑖 , 𝑓 𝒙 𝑖 ; 𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Minimizing sum (or mean) of squared errors is a common


approach in curve fitting, neural network, etc.
13
Sum of Squares Error (SSE) cost function

𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

 𝐽 𝒘 : sum of the squares of the prediction errors on the


training set

 We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘


 equivalently, the best 𝒘

 Minimize 𝐽 𝒘
 Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘

14
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0 ,𝑤1)
500

400
Price ($) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0

15 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 ,𝑤1, this is a function of 𝑥) (function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

16 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

17 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

18 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

19 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function optimization: univariate
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

 Necessary conditions for the “optimal” parameter values:

𝜕𝐽 𝒘
=0
𝜕𝑤0

𝜕𝐽 𝒘
=0
𝜕𝑤1

20
Optimality conditions: univariate
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑤0 − 𝑤1 𝑥
𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1

 A systems of 2 linear equations

21
Cost function: multivariate
 We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 𝑇

𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1

22
Cost function and optimal linear model

 Necessary conditions for the “optimal” parameter values:


𝛻𝒘 𝐽 𝒘 = 𝟎
 A system of 𝑑 + 1 linear equations
23
Cost function: matrix notation
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘) =
𝑖=1
𝑛 2
𝑖 𝑇 𝑖
= 𝑦 −𝒘 𝒙
𝑖=1

(1) (1)
1 𝑥1 ⋯ 𝑥𝑑 𝑤0
𝑦 (1) (2) ⋯ (2) 𝑤1
1 𝑥1 𝑥𝑑
𝒚= ⋮ 𝑿= ⋱ 𝒘= ⋮
⋮ ⋮ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑑
(𝑛)
1 𝑥1 ⋯ 𝑥𝑑

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
24
Minimizing cost function
Optimal linear weight vector (for SSE cost function):

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = −2𝑿𝑇 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

25
Minimizing cost function

𝒘 = 𝑿𝑇 𝑿 −𝟏
𝑿𝑇 𝒚

𝒘 = 𝑿† 𝒚

𝑿† = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇

𝑿† is pseudo inverse of 𝑿

26
Another approach for optimizing the sum
squared error
 Iterative approach for solving the following optimization
problem:
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘)
𝑖=1

27
Review:
Iterative optimization of cost function
 Cost function: 𝐽(𝒘)
 Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘

 Steps:
 Start from 𝒘0
 Repeat
 Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
 𝑡 ←𝑡+1
 until we hopefully end up at a minimum

28
Review:
Gradient descent
 First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
 Also known as ”steepest descent”

 In each step, takes steps proportional to the negative of the


gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

 Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a


point 𝒘𝑡

Gradient ascent takes steps proportional to (the positive of)


the gradient to find a local maximum of the function
29
Review:
Gradient descent

 Minimize 𝐽(𝒘)
Step size
(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑑

 If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .


 𝜂 can be allowed to change at every iteration as 𝜂𝑡 .

30
Review:
Gradient descent disadvantages

 Local minima problem

 However, when 𝐽 is convex, all local minima are also global


minima ⇒ gradient descent can converge to the global
solution.

31
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0
32 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0

33 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
 Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

 𝐽(𝒘): Sum of squares error


𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:


𝑛
𝑇
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑡 𝒙 𝑖
𝒙(𝑖)
𝑖=1

34
Gradient descent for SSE cost function
 Weight update rule: 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙
𝑛

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑇 𝒙 𝑖
𝒙(𝑖)
𝑖=1
Batch mode: each step
considers all training data

 𝜂: too small → gradient descent can be slow.


 𝜂 : too large → gradient descent can overshoot the
minimum. It may fail to converge, or even diverge.

35
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

36 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

37 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

38 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

39 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

40 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

41 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

42 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

43 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

44 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
 Batch techniques process the entire training set in one go
 thus they can be computationally costly for large data sets.

 Stochastic gradient descent: when the cost function can


comprise a sum over data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1

 Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):


𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

45
Stochastic gradient descent
 Example: Linear regression with SSE cost function

2
𝐽(𝑖) (𝒘) = 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)

It is proper for sequential or online learning

46
Stochastic gradient descent: online learning
 Sequential learning is also appropriate for real-time
applications
 data observations are arriving in a continuous stream
 and predictions must be made before seeing all of the data

 The value of η needs to be chosen with care to ensure


that the algorithm converges

47
Evaluation and generalization

 Why minimizing the cost function (based on only training data)


while we are interested in the performance on new examples?

𝑛
(𝑖) (𝑖)
min 𝐿𝑜𝑠𝑠 𝑦 , 𝑓(𝒙 ; 𝜽) Empirical loss
𝜽 𝑖=1

 Evaluation: After training, we need to measure how well the


learned prediction function can predicts the target for unseen
examples

48
Training and test performance
 Assumption: training and test examples are drawn independently
at random from the same but unknown distribution.
 Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 , i.e., 𝒙, 𝑦 ~𝑃

1 𝑛 (𝑖)
Empirical (training) loss = 𝑖=1 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 ; 𝜽)
𝑛

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

 We minimize empirical loss (on the training data) and expect to


also find an acceptable expected loss
 Empirical loss as a proxy for the performance over the whole distribution.

49
Linear regression: number of training data

𝑛 = 10 𝑛 = 20

𝑛 = 50

50
Linear regression: generalization
 By increasing the number of training examples, will solution be
better?
 Why the mean squared error does not decrease more after
reaching a level?

51
Linear regression: types of errors
 Structural error: the error introduced by the limited
function class (infinite training data):

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
∗𝑇 2
Structural error: 𝐸𝒙,𝑦 𝑦−𝒘 𝒙

 where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear


regression parameters (infinite training data)

52
Linear regression: types of errors
 Approximation error measures how close we can get to the
optimal linear predictions with limited training data:

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘

𝑛
2
𝒘 = argmin 𝑦 (𝑖) 𝑇
−𝒘 𝒙 (𝑖)
𝒘
𝑖=1

2
Approximation error: 𝐸𝒙 𝒘∗ 𝑇 𝒙 𝑇
−𝒘 𝒙

 Where 𝒘 are the parameter estimates based on a small


training set (so themselves are random variables).
53
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
+ 2𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙

54
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
0
+ 2𝐸 𝑦 − 𝒘 ∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
𝒙,𝑦
Note: Optimality condition for 𝒘∗ give us 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒙 = 0
since 𝛻𝒘 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 𝒘∗ = 0
55

You might also like