0% found this document useful (0 votes)

2 views

Linear-Regression

The document discusses linear regression as a method for making real-valued predictions based on features, exemplified by predicting house prices from attributes like size and age. It covers the learning problem, including hypothesis space selection, cost function optimization, and evaluation of generalization to unseen examples. The document also details the process of minimizing the sum of squared errors to find optimal parameters for the regression model.

Uploaded by

samira.nazari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Linear-Regression

Uploaded by

samira.nazari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Linear Regression

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Linear regression
 Error (cost) function
 Optimization
 Generalization

2
Regression problem
 The goal is to make (real valued) predictions given
features

 Example: predicting house price from 3 attributes

Size (𝑚2 ) Age (year) Region Price (106 T)

100 2 5 500
80 25 3 250
… … … …

3
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

4
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

5
Hypothesis space
 Specify the class of functions (e.g., linear)

 We begin by the class of linear functions

 easy to extend to generalized linear and so cover more
complex regression functions

6
Linear regression: hypothesis space
 Univariate 𝑦

𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥

𝑥
 Multivariate

𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑

𝑇
 𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 are parameters we need to set.

7
Learning problem
 Selecting a hypothesis space
 Hypothesis space: a set of mappings from feature vector to
target

 Learning (estimation): optimization of a cost function

𝑖 𝑖 𝑛
 Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖=1
and a cost
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

 Evaluation: we measure how well 𝑓 generalizes to

unseen examples

8
Learning algorithm

 Select how to measure the error (i.e. prediction loss)

 Find the minimum of the resulting error or cost function

9
Learning algorithm
Training Set 𝐷

We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0 , 𝑤1

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated

house price
𝑥

10
How to measure the error

500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥

2
𝑖 𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘

11
Linear regression: univariate example
500

400
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function: 𝑛
𝑖 2
𝐽 𝒘 = 𝑦 − 𝑓(𝑥; 𝒘)
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

12
Regression: squared loss
 In the SSE cost function, we used squared error as the
prediction loss:
2 𝑦 = 𝑓(𝒙; 𝒘)
𝐿𝑜𝑠𝑠 𝑦, 𝑦 = 𝑦 − 𝑦

 Cost function (based on the training set):

𝑛
𝐽 𝒘 = 𝐿𝑜𝑠𝑠 𝑦 𝑖 , 𝑓 𝒙 𝑖 ; 𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Minimizing sum (or mean) of squared errors is a common

approach in curve fitting, neural network, etc.
13
Sum of Squares Error (SSE) cost function

𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

 𝐽 𝒘 : sum of the squares of the prediction errors on the

training set

 We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘

 equivalently, the best 𝒘

 Minimize 𝐽 𝒘
 Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘

14
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0 ,𝑤1)
500

400
Price ($) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0

15 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 ,𝑤1, this is a function of 𝑥) (function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

16 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

17 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

18 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function: univariate example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

19 This example has been adapted from: Prof. Andrew Ng’s slides
Cost function optimization: univariate
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖
𝑖=1

 Necessary conditions for the “optimal” parameter values:

𝜕𝐽 𝒘
=0
𝜕𝑤0

𝜕𝐽 𝒘
=0
𝜕𝑤1

20
Optimality conditions: univariate
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑤0 − 𝑤1 𝑥
𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1

𝜕𝐽 𝒘 𝑛
= 2 𝑦 𝑖 − 𝑤0 − 𝑤1 𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1

 A systems of 2 linear equations

21
Cost function: multivariate
 We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 , 𝑤1 , . . . , 𝑤𝑑 𝑇

𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1

22
Cost function and optimal linear model

 Necessary conditions for the “optimal” parameter values:

𝛻𝒘 𝐽 𝒘 = 𝟎
 A system of 𝑑 + 1 linear equations
23
Cost function: matrix notation
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘) =
𝑖=1
𝑛 2
𝑖 𝑇 𝑖
= 𝑦 −𝒘 𝒙
𝑖=1

(1) (1)
1 𝑥1 ⋯ 𝑥𝑑 𝑤0
𝑦 (1) (2) ⋯ (2) 𝑤1
1 𝑥1 𝑥𝑑
𝒚= ⋮ 𝑿= ⋱ 𝒘= ⋮
⋮ ⋮ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑑
(𝑛)
1 𝑥1 ⋯ 𝑥𝑑

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
24
Minimizing cost function
Optimal linear weight vector (for SSE cost function):

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = −2𝑿𝑇 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

25
Minimizing cost function

𝒘 = 𝑿𝑇 𝑿 −𝟏
𝑿𝑇 𝒚

𝒘 = 𝑿† 𝒚

𝑿† = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇

𝑿† is pseudo inverse of 𝑿

26
Another approach for optimizing the sum
squared error
 Iterative approach for solving the following optimization
problem:
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓(𝒙 ; 𝒘)
𝑖=1

27
Review:
Iterative optimization of cost function
 Cost function: 𝐽(𝒘)
 Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘

 Steps:
 Start from 𝒘0
 Repeat
 Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
 𝑡 ←𝑡+1
 until we hopefully end up at a minimum

28
Review:
Gradient descent
 First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
 Also known as ”steepest descent”

 In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

 Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

point 𝒘𝑡

Gradient ascent takes steps proportional to (the positive of)

the gradient to find a local maximum of the function
29
Review:
Gradient descent

 Minimize 𝐽(𝒘)
Step size
(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑑

 If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

 𝜂 can be allowed to change at every iteration as 𝜂𝑡 .

30
Review:
Gradient descent disadvantages

 Local minima problem

 However, when 𝐽 is convex, all local minima are also global

minima ⇒ gradient descent can converge to the global
solution.

31
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0
32 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Review: Problem of gradient descent with
non-convex cost functions

J(w0,w1)

w1
w0

33 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
 Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

 𝐽(𝒘): Sum of squares error

𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 𝑖 ;𝒘
𝑖=1

 Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:

𝑛
𝑇
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑡 𝒙 𝑖
𝒙(𝑖)
𝑖=1

34
Gradient descent for SSE cost function
 Weight update rule: 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙
𝑛

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖
− 𝒘𝑇 𝒙 𝑖
𝒙(𝑖)
𝑖=1
Batch mode: each step
considers all training data

 𝜂: too small → gradient descent can be slow.

 𝜂 : too large → gradient descent can overshoot the
minimum. It may fail to converge, or even diverge.

35
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

36 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

37 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

38 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

39 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

40 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

41 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

42 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

43 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 ,𝑤1 )

𝑤1

𝑤0

44 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
 Batch techniques process the entire training set in one go
 thus they can be computationally costly for large data sets.

 Stochastic gradient descent: when the cost function can

comprise a sum over data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1

 Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

45
Stochastic gradient descent
 Example: Linear regression with SSE cost function

2
𝐽(𝑖) (𝒘) = 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)

It is proper for sequential or online learning

46
Stochastic gradient descent: online learning
 Sequential learning is also appropriate for real-time
applications
 data observations are arriving in a continuous stream
 and predictions must be made before seeing all of the data

 The value of η needs to be chosen with care to ensure

that the algorithm converges

47
Evaluation and generalization

 Why minimizing the cost function (based on only training data)

while we are interested in the performance on new examples?

𝑛
(𝑖) (𝑖)
min 𝐿𝑜𝑠𝑠 𝑦 , 𝑓(𝒙 ; 𝜽) Empirical loss
𝜽 𝑖=1

 Evaluation: After training, we need to measure how well the

learned prediction function can predicts the target for unseen
examples

48
Training and test performance
 Assumption: training and test examples are drawn independently
at random from the same but unknown distribution.
 Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 , i.e., 𝒙, 𝑦 ~𝑃

1 𝑛 (𝑖)
Empirical (training) loss = 𝑖=1 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 ; 𝜽)
𝑛

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

 We minimize empirical loss (on the training data) and expect to

also find an acceptable expected loss
 Empirical loss as a proxy for the performance over the whole distribution.

49
Linear regression: number of training data

𝑛 = 10 𝑛 = 20

𝑛 = 50

50
Linear regression: generalization
 By increasing the number of training examples, will solution be
better?
 Why the mean squared error does not decrease more after
reaching a level?

51
Linear regression: types of errors
 Structural error: the error introduced by the limited
function class (infinite training data):

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘
∗𝑇 2
Structural error: 𝐸𝒙,𝑦 𝑦−𝒘 𝒙

 where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear

regression parameters (infinite training data)

52
Linear regression: types of errors
 Approximation error measures how close we can get to the
optimal linear predictions with limited training data:

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2
𝒘

𝑛
2
𝒘 = argmin 𝑦 (𝑖) 𝑇
−𝒘 𝒙 (𝑖)
𝒘
𝑖=1

2
Approximation error: 𝐸𝒙 𝒘∗ 𝑇 𝒙 𝑇
−𝒘 𝒙

 Where 𝒘 are the parameter estimates based on a small

training set (so themselves are random variables).
53
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
+ 2𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙

54
Linear regression: error decomposition
 The expected error can decompose into the sum of
structural and approximation errors

𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2

𝑇 2 2
= 𝐸𝒙,𝑦 𝑦− ∗
𝒘 𝒙 + 𝐸𝒙 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙

 Derivation
2
𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 = 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 + 𝒘∗ 𝑇 𝒙 − 𝑇
𝒘 𝒙
∗𝑇 2 ∗𝑇 𝑇 2
= 𝐸𝒙,𝑦 𝑦−𝒘 𝒙 + 𝐸𝒙 𝒘 𝒙−𝒘 𝒙
0
+ 2𝐸 𝑦 − 𝒘 ∗ 𝑇 𝒙 𝒘∗ 𝑇 𝒙 − 𝒘𝑇 𝒙
𝒙,𝑦
Note: Optimality condition for 𝒘∗ give us 𝐸𝒙,𝑦 𝑦 − 𝒘∗ 𝑇 𝒙 𝒙 = 0
since 𝛻𝒘 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇 𝒙 2 𝒘∗ = 0
55

DigitalSignalProcessing Book
100% (2)
DigitalSignalProcessing Book
506 pages
Amjad Ali SNA4 Lab Manual MATLAB4
No ratings yet
Amjad Ali SNA4 Lab Manual MATLAB4
180 pages
Assignment For Module-1: Instructions
100% (1)
Assignment For Module-1: Instructions
23 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Lecture3_Linear Regression and Logistic Regression
No ratings yet
Lecture3_Linear Regression and Logistic Regression
60 pages
Lecture 3
No ratings yet
Lecture 3
56 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
04 LinearRegression PDF
No ratings yet
04 LinearRegression PDF
61 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lec06 Matt[1]
No ratings yet
Lec06 Matt[1]
60 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Regression
No ratings yet
Regression
30 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
Andrew NG Week 1-2
No ratings yet
Andrew NG Week 1-2
120 pages
Lecture 6
No ratings yet
Lecture 6
51 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
GR_1_report_week_7
No ratings yet
GR_1_report_week_7
6 pages
Lecture slides - Linear Regression (2025)
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
Lec 03
No ratings yet
Lec 03
42 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Regression
No ratings yet
Regression
39 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Regression PDF
No ratings yet
Regression PDF
37 pages
2 - Multiple Linear Regression
No ratings yet
2 - Multiple Linear Regression
71 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
2-LR_Optim
No ratings yet
2-LR_Optim
60 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
EC202 Signals & Systems
No ratings yet
EC202 Signals & Systems
2 pages
All Jntu World: R07 Set No. 2
No ratings yet
All Jntu World: R07 Set No. 2
8 pages
10th - Abraham Silberschatz Operating System Concepts 2018 Trang 431 435
No ratings yet
10th - Abraham Silberschatz Operating System Concepts 2018 Trang 431 435
5 pages
Factoring Completely Different Types of Polynomials: Rhealinda R. Blanquera
No ratings yet
Factoring Completely Different Types of Polynomials: Rhealinda R. Blanquera
54 pages
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
No ratings yet
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
46 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
Coding Worksheet
No ratings yet
Coding Worksheet
1 page
Lab 5: The FFT and Digital Filtering: 1. Goals
No ratings yet
Lab 5: The FFT and Digital Filtering: 1. Goals
3 pages
difference of dda and beresenham
No ratings yet
difference of dda and beresenham
8 pages
Mlpyq
No ratings yet
Mlpyq
5 pages
Taylor Series
No ratings yet
Taylor Series
7 pages
ECE424FL Activity 1
No ratings yet
ECE424FL Activity 1
7 pages
Algorithm and Complexity Course Sillaby
No ratings yet
Algorithm and Complexity Course Sillaby
3 pages
Lab Manual: Flow Chart & Number System
No ratings yet
Lab Manual: Flow Chart & Number System
8 pages
Chapter V Goal Programming (GP)
No ratings yet
Chapter V Goal Programming (GP)
28 pages
Signal Questions1
No ratings yet
Signal Questions1
11 pages
Unit-4-1 PPT CS
No ratings yet
Unit-4-1 PPT CS
78 pages
Machine Learning Using Python PDF
No ratings yet
Machine Learning Using Python PDF
2 pages
Prediction of Diabetes Using Machine Learning Techniques
No ratings yet
Prediction of Diabetes Using Machine Learning Techniques
10 pages
Spanning Tree
No ratings yet
Spanning Tree
7 pages
Psuedo Code Worksheet (for Next Loop)
No ratings yet
Psuedo Code Worksheet (for Next Loop)
8 pages
MATLAB Examples - Numerical Integration
No ratings yet
MATLAB Examples - Numerical Integration
17 pages
Unit 2 1
No ratings yet
Unit 2 1
15 pages
DMC 1933
No ratings yet
DMC 1933
128 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
55 pages
Daa Lab Manual Cse III I Sem
No ratings yet
Daa Lab Manual Cse III I Sem
47 pages
A Computer Vision Based Framework For Visual Gun Detection Using SURF
No ratings yet
A Computer Vision Based Framework For Visual Gun Detection Using SURF
6 pages

Linear-Regression

Uploaded by

Linear-Regression

Uploaded by

Linear Regression

CE-717: Machine Learning

 Example: predicting house price from 3 attributes

Size (𝑚2 ) Age (year) Region Price (106 T)

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 We begin by the class of linear functions

 Learning (estimation): optimization of a cost function

 Evaluation: we measure how well 𝑓 generalizes to

 Select how to measure the error (i.e. prediction loss)

 Find the minimum of the resulting error or cost function

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated

 Cost function (based on the training set):

 Minimizing sum (or mean) of squared errors is a common

 𝐽 𝒘 : sum of the squares of the prediction errors on the

 We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘

 Necessary conditions for the “optimal” parameter values:

 A systems of 2 linear equations

 Necessary conditions for the “optimal” parameter values:

 In each step, takes steps proportional to the negative of the

 Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

Gradient ascent takes steps proportional to (the positive of)

 If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

 Local minima problem

 However, when 𝐽 is convex, all local minima are also global

 𝐽(𝒘): Sum of squares error

 Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇 𝒙:

 𝜂: too small → gradient descent can be slow.

 Stochastic gradient descent: when the cost function can

 Update after presentation of (𝒙(𝑖) , 𝑦 (𝑖) ):

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝑖) (𝒘)

It is proper for sequential or online learning

 The value of η needs to be chosen with care to ensure

 Why minimizing the cost function (based on only training data)

 Evaluation: After training, we need to measure how well the

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

 We minimize empirical loss (on the training data) and expect to

 where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear

 Where 𝒘 are the parameter estimates based on a small

You might also like