Lecture 2.1 Linear Regression
Lecture 2.1 Linear Regression
Linear Regression
Size (m²)
1250
Example with one feature
House size in m² (𝒙𝒙) House price in $1000 (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 m², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price is
linear.
Price ($1000)
Size (m²)
1250
Example with one feature
House size in m² (𝒙𝒙) House price in $1000 (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 m², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price ($1000)
?
How do we find the best
parameters 𝜃𝜃0 and 𝜃𝜃1
(i.e. the best fitting line)
Size (m²)
Example with one feature
• Choose 𝜃𝜃0 , 𝜃𝜃1 so that ℎ𝜃𝜃 (𝑥𝑥 (𝑖𝑖) ) is close to 𝑦𝑦 (𝑖𝑖) for all our training examples
𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2 , 𝑦𝑦 2 , … , 𝑥𝑥 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 .
Size (m²)
Example with one feature
• To simplify, let’s first assume that 𝜃𝜃0 = 0, so our model ℎ𝜃𝜃 is
on the form: ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃1 𝑥𝑥
minim𝑖𝑖𝑖𝑖𝑖𝑖 E(𝜃𝜃1 )
𝜃𝜃1
Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
𝒚𝒚(𝒊𝒊) − 𝒉𝒉𝜽𝜽 𝒙𝒙 𝒊𝒊
minimum
error
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
minimum
error
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
minimum
error
For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Price
𝒉𝒉𝜽𝜽 𝒙𝒙
𝑬𝑬 𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏
𝒙𝒙
𝜽𝜽𝟏𝟏 𝜽𝜽𝟎𝟎
Size (m²)
Example with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function
For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Contour plot of 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )
Price
𝒉𝒉𝜽𝜽 𝒙𝒙
𝜽𝜽𝟏𝟏
𝒙𝒙
Size (m²) 𝜽𝜽𝟎𝟎
Optimizing the cost function
• To find the parameters that minimize the cost function, we can use an
optimization algorithm called gradient descent.
Optimization problem:
Cost function:
Hypothesis function:
Optimization using
Gradient Descent
𝑮𝑮(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
Note:
G(…) is just some
arbitrary (non-convex)
𝜽𝜽𝟏𝟏
function used for this
example. It’s not our
𝜽𝜽𝟎𝟎
previous (convex) cost
function E(…).
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we end up at a minimum.
𝑮𝑮(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
Note:
G(…) is just some
arbitrary (non-convex)
𝜽𝜽𝟏𝟏
function used for this
example. It’s not our
𝜽𝜽𝟎𝟎
previous (convex) cost
function E(…).
Gradient Descent – Basic idea
Example with a convex error function (MSE).
Only one (global) minimum.
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
MSE
Learning Gradient of 𝐸𝐸
rate 𝛼𝛼 > 0
Derivative of 𝐸𝐸
• The gradient of the cost with respect to 𝜃𝜃0
function 𝐸𝐸(𝜃𝜃) is simply a
vector containing the
derivative of 𝐸𝐸(𝜃𝜃) with
Derivative of 𝐸𝐸
respect to each parameter 𝜃𝜃𝑗𝑗 with respect to 𝜃𝜃1
Gradient of 𝐸𝐸
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
𝐸𝐸 𝜃𝜃1
𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
• Update 𝜃𝜃1 :
𝜃𝜃1
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
• Update 𝜃𝜃1 :
• Update 𝜃𝜃1 :
≥0
𝜃𝜃1 (positive slope)
• Update 𝜃𝜃1 :
≤ 0
𝜃𝜃1 (negative slope)
𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1
𝜃𝜃1 𝜃𝜃1
As we approach a local minimum, gradient descent If 𝛼𝛼 is too large, it may fail to
will automatically take smaller steps (why?). So, no converge, or may even diverge.
need to decrease 𝛼𝛼 over time.
Gradient Descent – Algorithm
Reasonably small value of 𝛼𝛼 Very large value of 𝛼𝛼
𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1
𝜃𝜃1 𝜃𝜃1
As we approach a local minimum, gradient descent If 𝛼𝛼 is too large, it may fail to
will automatically take smaller steps (why?). So, no converge, or may even diverge.
need to decrease 𝛼𝛼 over time.
Gradient Descent – Local minimum
Function (𝜃𝜃1 )
𝜃𝜃1
Current value of 𝜃𝜃1
𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥
Multivariate linear regression
, where:
Batch and Stochastic Gradient Descent
• The batch gradient descent uses all the
training examples, to update the model
parameters.
– e.g. (2) your training dataset is very big (computationally expensive to use batch
GD, or the dataset doesn’t fit in memory).
Convergence and selecting 𝛼𝛼
𝐸𝐸 𝜃𝜃
• For a sufficiently small 𝛼𝛼, the 𝐸𝐸(𝜃𝜃) (on the
training set) should decrease at every
iteration.
𝐸𝐸 𝜃𝜃 𝐸𝐸 𝜃𝜃1
𝐸𝐸 𝜃𝜃
In these cases, you
should use a
smaller 𝛼𝛼