0% found this document useful (0 votes)
2 views

Lecture 2.1 Linear Regression

The document is a lecture on Linear Regression by Dr. Mohamed-Rafik Bouguelia at Halmstad University, focusing on predicting house prices based on size using a linear model. It explains the process of training a model, finding optimal parameters through minimizing a cost function, and utilizing gradient descent for optimization. The lecture includes examples and visual representations of the hypothesis and error functions involved in the regression analysis.

Uploaded by

homerajasekhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 2.1 Linear Regression

The document is a lecture on Linear Regression by Dr. Mohamed-Rafik Bouguelia at Halmstad University, focusing on predicting house prices based on size using a linear model. It explains the process of training a model, finding optimal parameters through minimizing a cost function, and utilizing gradient descent for optimization. The lecture includes examples and visual representations of the hypothesis and error functions involved in the regression analysis.

Uploaded by

homerajasekhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Learning Course

Linear Regression

Dr. Mohamed-Rafik Bouguelia


[email protected]

Halmstad University, Sweden.


You can also watch the video corresponding to
this lecture at: https://youtu.be/-wmjwMWRsZU
Example with one feature
House size in m² (𝒙𝒙) House price in $1000 (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 m², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price (1000$)

Size (m²)
1250
Example with one feature
House size in m² (𝒙𝒙) House price in $1000 (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 m², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price is
linear.
Price ($1000)

2. Training: find a line that fits


the training dataset well.

3. Predicting: use the line to


predict the price of the new
house.

Size (m²)
1250
Example with one feature
House size in m² (𝒙𝒙) House price in $1000 (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 m², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price ($1000)

?
How do we find the best
parameters 𝜃𝜃0 and 𝜃𝜃1
(i.e. the best fitting line)

Size (m²)
Example with one feature
• Choose 𝜃𝜃0 , 𝜃𝜃1 so that ℎ𝜃𝜃 (𝑥𝑥 (𝑖𝑖) ) is close to 𝑦𝑦 (𝑖𝑖) for all our training examples
𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2 , 𝑦𝑦 2 , … , 𝑥𝑥 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 .

• We want to find the parameters vector that minimizes the cost


function 𝐸𝐸(𝜃𝜃).
Price (1000$)

Cost function 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )


(mean squared error cost
function)

Size (m²)
Example with one feature
• To simplify, let’s first assume that 𝜃𝜃0 = 0, so our model ℎ𝜃𝜃 is
on the form: ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃1 𝑥𝑥

In this case, we need to find


the optimal value for 𝜃𝜃1
Price (1000$)

minim𝑖𝑖𝑖𝑖𝑖𝑖 E(𝜃𝜃1 )
𝜃𝜃1

Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

𝒚𝒚(𝒊𝒊) − 𝒉𝒉𝜽𝜽 𝒙𝒙 𝒊𝒊

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝑬𝑬 𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏

𝒙𝒙
𝜽𝜽𝟏𝟏 𝜽𝜽𝟎𝟎
Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Contour plot of 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝜽𝜽𝟏𝟏

𝒙𝒙
Size (m²) 𝜽𝜽𝟎𝟎
Optimizing the cost function
• To find the parameters that minimize the cost function, we can use an
optimization algorithm called gradient descent.

• Gradient descent is a general optimization algorithm which is not only


specific for this cost function.

Optimization problem:

Cost function:

Hypothesis function:
Optimization using
Gradient Descent

Machine Learning Course.


Dr. Mohamed-Rafik Bouguelia.
[email protected]
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we end up at a minimum.

𝑮𝑮(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )

Note:
G(…) is just some
arbitrary (non-convex)
𝜽𝜽𝟏𝟏
function used for this
example. It’s not our
𝜽𝜽𝟎𝟎
previous (convex) cost
function E(…).
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we end up at a minimum.

• Depending on initial parameters values, we might end-up at a different


(local) minimum.
local minimum

𝑮𝑮(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )

Note:
G(…) is just some
arbitrary (non-convex)
𝜽𝜽𝟏𝟏
function used for this
example. It’s not our
𝜽𝜽𝟎𝟎
previous (convex) cost
function E(…).
Gradient Descent – Basic idea
Example with a convex error function (MSE).
Only one (global) minimum.

𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
MSE

At each update step, how


does gradient descent
decide if we should increase
or decrease each of the 𝜽𝜽𝟏𝟏
parameters 𝜽𝜽𝟎𝟎 and 𝜽𝜽𝟏𝟏 ? 𝜽𝜽𝟎𝟎
Gradient Descent – Algorithm

Learning Gradient of 𝐸𝐸
rate 𝛼𝛼 > 0

Derivative of 𝐸𝐸
• The gradient of the cost with respect to 𝜃𝜃0
function 𝐸𝐸(𝜃𝜃) is simply a
vector containing the
derivative of 𝐸𝐸(𝜃𝜃) with
Derivative of 𝐸𝐸
respect to each parameter 𝜃𝜃𝑗𝑗 with respect to 𝜃𝜃1

Gradient of 𝐸𝐸
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1

𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• Update 𝜃𝜃1 :

𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• Update 𝜃𝜃1 :

Derivative: slope of the


𝜃𝜃1 (red) line which is tangent
to the function at 𝜃𝜃1 .
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• Update 𝜃𝜃1 :

≥0
𝜃𝜃1 (positive slope)

In this case, since the derivative is positive and 𝛼𝛼 ≥ 0, then


𝜃𝜃1 will decrease, and get’s closer to the optimal value.
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

If our initial value of 𝜃𝜃1 was too


small (i.e. on the left side)
𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• Update 𝜃𝜃1 :

≤ 0
𝜃𝜃1 (negative slope)

In this case, since the derivative is negative and 𝛼𝛼 ≥ 0, then


𝜃𝜃1 will increase, and get’s closer to the optimal value.
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

𝜃𝜃1 𝜃𝜃1
As we approach a local minimum, gradient descent If 𝛼𝛼 is too large, it may fail to
will automatically take smaller steps (why?). So, no converge, or may even diverge.
need to decrease 𝛼𝛼 over time.
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

The derivative (slop of


the tangent line) get’s
closer to zero as we get
closer to the minimum.

𝜃𝜃1 𝜃𝜃1
As we approach a local minimum, gradient descent If 𝛼𝛼 is too large, it may fail to
will automatically take smaller steps (why?). So, no converge, or may even diverge.
need to decrease 𝛼𝛼 over time.
Gradient Descent – Local minimum
Function (𝜃𝜃1 )

Assume that we have


reached a local optimum
(local minimum here).

𝜃𝜃1 at a local minimum

𝜃𝜃1
Current value of 𝜃𝜃1

The derivative will be equal to 0, so


convergence. We get stuck at this local
minimum.
Gradient descent for linear regression
Details of the derivatives computation

Derivative of 𝑬𝑬(𝜽𝜽) with respect to 𝜽𝜽𝟎𝟎

Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟏𝟏


Example with multiple features
Size (𝒙𝒙𝟏𝟏 ) Nb rooms (𝒙𝒙𝟐𝟐 ) Location (𝒙𝒙𝟑𝟑 ) Nb floors (𝒙𝒙𝟒𝟒 ) Age (𝒙𝒙𝟓𝟓 ) … Price (𝒚𝒚)
2104 6 2 2 45 … 460
1416 5 10 1 40 … 230
1534 5 3 2 30 … 315
852 4 2 1 35 … 178
… … … … … … …

• For convenience of notation, define 𝑥𝑥0 = 1


(𝑖𝑖)
 Think of it as an additional feature which equals 1 for all data-points: 𝑥𝑥0 = 1, ∀𝑖𝑖 = 1 … 𝑛𝑛

𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥
Multivariate linear regression

, where:
Batch and Stochastic Gradient Descent
• The batch gradient descent uses all the
training examples, to update the model
parameters.

• The online (also called stochastic)


gradient descent updates the model
parameters based on one training example
at a time.This can be useful when:
– e.g. (1) you don’t have all the training dataset beforehand. Your training examples
arrive one by one over time, as a stream.

– e.g. (2) your training dataset is very big (computationally expensive to use batch
GD, or the dataset doesn’t fit in memory).
Convergence and selecting 𝛼𝛼
𝐸𝐸 𝜃𝜃
• For a sufficiently small 𝛼𝛼, the 𝐸𝐸(𝜃𝜃) (on the
training set) should decrease at every
iteration.

• One can consider convergence (thus stop)


if 𝐸𝐸(𝜃𝜃) decreases by less than 𝜖𝜖 (e.g.
0.0001) in one iteration.
Nbr. of iterations of GD

𝐸𝐸 𝜃𝜃 𝐸𝐸 𝜃𝜃1
𝐸𝐸 𝜃𝜃
In these cases, you
should use a
smaller 𝛼𝛼

Note: if 𝛼𝛼 is too 𝜃𝜃1


small, GD can be
Nbr. of iterations Nbr. of iterations slow to converge.
Linear Regression without
using Gradient Descent

Machine Learning Course.


Dr. Mohamed-Rafik Bouguelia.
[email protected]
Linear Regression without DG
• Method to solve for 𝜃𝜃 analytically.
• The derivative at the optimal 𝜃𝜃𝑗𝑗 equals to 0. So, set the derivative to 0
𝜕𝜕
𝐸𝐸 𝜃𝜃 = 0, and solve for 𝜃𝜃𝑗𝑗
𝜕𝜕𝜃𝜃𝑗𝑗

• The solution will be: • Disadvantages:


– Need to compute 𝑋𝑋 𝑇𝑇 𝑋𝑋 −1 , which is
slow when 𝑑𝑑 is very large.
– Does not apply to other
optimization problems.
pseudo inverse
Details – Linear Regression without DG
2

You might also like