0% found this document useful (0 votes)
21 views

CSE445 T3 Linear Regression One Variable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

CSE445 T3 Linear Regression One Variable

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Lecture 3a

Linear Regression
with One Variable

Silvia Ahmed CSE445 Machine Learning ECE@NSU 1


Learning goals

• After this presentation, you should be able to:


• Understand the Model Representation
• Explain the Hypothesis Function
• Understand the Cost Function
• Gradient Descent Algorithm
• Batch Gradient Descent
• Matrix Multiplication for Implementation

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 2


Linear Regression
• Housing price data example
• Supervised learning regression problem
• What do we start with? Size in feet2 (x) Price ($) in 1000’s (y)

• Training set (this is your data set) 2104 460


1416 232
• Notation (used throughout the course)
1534 315
• m = number of training examples
852 178
• x's = input variables / features
… …
• y's = output variable "target" variables
• (x, y) - single training example
• (x(i), y(i)) - specific example (ith training example)
• i is an index to training set

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 3


Training Set of Housing prices

Housing Price
600
Size in feet2 (x) Price ($) in 1000’s (y)
500
2104 460
PRICE (IN 1000S OF $)

400
1416 232
300 1534 315
200 852 178
100 … …
0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 4


Model Representation
• With our training set defined - how do
we use it?
• Take training set
• Pass into a learning algorithm
• Algorithm outputs a function (denoted h )
(h = hypothesis)
• This function takes an input (e.g. size of new
house)
• Tries to output the estimated value of y

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 5


How do we Represent h?
• Hypothesis function h -
• hθ(x) = θ0 + θ1x

Housing Price
600

500
h(x) = θ0 + θ1x
PRICE (IN 1000S OF $)

400

300

200

100

0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 6


Model Representation
• What does this mean?
• Means y is a linear function of x.
• θi are parameters
• θ0 is zero condition
• θ1 is gradient
• This kind of function is a linear regression with one variable
• Also called Univariate Linear Regression
• So in summary
• A hypothesis takes in some variable
• Uses parameters determined by a learning system
• Outputs a prediction based on that input

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 7


Hypothesis Function
hθ(x) = θ0 + θ1x
• equation of a straight line.
• set values for θ0 and θ1 to get estimated output hθ​.
• hθ is a function that is trying to map input data (the x's) to our output data
(the y's). Input x Output y
• Example – 0 4
1 7
2 7
• Let θ0 = 2 and θ1 = 2, then hθ(x) = 2 + 2x 3 8
• For input 1 to our hypothesis, hθ(x) will be 4 (off by 3).
• Need to try out various values to find best possible fit.
• Or the most representative "straight line" through the data points mapped
on the x-y plane.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 8


Cost Function
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

• Hypothesis: hθ(x) = θ0 + θ1x


• θi’s: parameters

How to choose θi’s?

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 9


Cost Function
• A cost function lets us figure out how to fit the best straight line
to our data.
• We can measure the accuracy of our hypothesis function by
using a cost function.
• Choosing values for θi (parameters)
• Different values give different functions.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 10


Cost Function
hθ(x) = θ0 + θ1x

h(x) = 1.5 + 0.x h(x) = 0.5x h(x) = 1 + 0.5x


3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

Θ0 =1.5 Θ0 = 0 Θ0 = 1
Θ1 = 0 Θ1 = 0.5 Θ1 = 0.5

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 11


Linear Regression – Cost Function

θ0, θ1

y
x

• Based on our training set we want to generate parameters which make the
straight line
• Chosen these parameters so hθ(x) is close to y for our training examples
• Basically, uses x’s in training set with hθ(x) to give output which is as close to the actual y
value as possible
• Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and considering we already
have - we can evaluate, how well hθ(x) does this.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 12


Cost Function
• To formalize this;
• We want to solve a minimization problem
• Minimize the difference between h(x) and y for each/any/every example –
• minimize (hθ(x) - y)2
• Sum this over the training set

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 13


Cost Function
• Minimize squared different between predicted house price and actual
house price
• 1/2m
• 1/m - means we determine the average
• 1/2m - the 2 makes the math a bit easier, and doesn't change the constants we determine
at all (i.e. half the smallest value is still the smallest value!)
• Minimizing difference means we get the values of θ0 and θ1 which find on average the
minimal deviation of x from y when we use those parameters in our hypothesis
function
• More cleanly, this is a cost function

• And we want to minimize this cost function


• Our cost function is (because of the summation term) inherently looking at ALL the data in the training set
at any time

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 14


Cost Function
• Lets consider some intuition about the cost function and why we want to
use it
• The cost function determines parameters
• The value associated with the parameters determines how your hypothesis behaves,
with different values generate different costs
• Simplified hypothesis
3
• Assumes θ0 = 0
2

hθ(x) = θ1x
1

0
0 1 2 3

Θ0 = 0

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 15


Cost Function
• Cost function and goal here are very similar to when we have θ0, but with a
simpler parameter
• Simplified hypothesis makes visualizing cost function J(θ) a bit easier
• So hypothesis pass through (0, 0)
• Two key functions we want to understand
• hθ(x)
• Hypothesis is a function of x - function of what the size of the house is
• J(θ1)
• Is a function of the parameter of θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 16


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3
Θ 1= 1

2 2
Θ1= 1

J(θ1)
y

J(θ1) = 0
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 17


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3

2 2
Θ1= 0.5

J(θ1)
Θ1= 0.5 J(θ1) = 0.58
y

1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 18


Cost Function
hθ(x) = θ1x

hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3

Θ 1= 1
2 2

J(θ1)
y

Θ1= 0.5
1 1
hθ(x)

0 Θ 1= 0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 19


Deeper Insight - Cost Function
• Using our original complex hypothesis with two variables,
• So cost function is
• J(θ0, θ1)
• Example,
• Say
• θ0 = 50
• θ1 = 0.06
• Previously we plotted our cost function by plotting
• θ1 vs J(θ1)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 20


Deeper Insight - Cost Function
• Now we have two parameters
• Plot becomes a bit more
complicated
• Generates a 3D surface plot
where axis are
• X = θ1
• Z = θ0
• Y = J(θ0, θ1)
• the height (y) indicates the value
of the cost function, so find where
y is at a minimum

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 21


Deeper Insight - Cost Function
• Instead of a surface plot we can use a contour figures/plots
• Set of ellipses in different colors
• Each color is the same value of J(θ0, θ1), but obviously plot to different locations
because θ1 and θ0 will vary
• Imagine a bowl shape function coming out of the screen so the middle is the
concentric circles

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 22


Deeper Insight - Cost Function
• Each point (like the red one above) represents a pair of parameter values
for θ0 and θ1
• Our example here put the values at
• θ0 = ~800
• θ1 = ~-0.15
• Not a good fit
• i.e. these parameters give a value on our contour plot far from the center
• If we have
• θ0 = ~360
• θ1 = 0
• This gives a better hypothesis, but still not great - not in the center of the contour plot
• Finally we find the minimum, which gives the best hypothesis
• Doing this by eye/hand is a painful task
• What we really want is an efficient algorithm for finding the minimum for θ0 and θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 23


Deeper Insight - Cost Function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 24


Deeper Insight - Cost Function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 25


Deeper Insight - Cost Function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 26


Deeper Insight - Cost Function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 27


Deeper Insight - Cost Function

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 28


Gradient Descent Algorithm
• Minimize cost function J
• Gradient descent
• Used all over machine learning for minimization
• Start by looking at a general J() function
• Problem
• We have J(θ0, θ1)
• We want to get min J(θ0, θ1)
• Gradient descent applies to more general functions
• J(θ0, θ1, θ2 .... θn)
• min J(θ0, θ1, θ2 .... θn)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 29


Gradient Descent Algorithm
How does it work?
• Start with initial guesses
• Start at (0, 0) (or any other value)
• Keeping changing θ0 and θ1 a little bit to try and reduce J(θ0, θ1)
• Each time you change the parameters, you select the gradient which
reduces J(θ0, θ1) the most possible
• Repeat
• Do so until you converge to a local minimum
• Has an interesting property
• Where you start can determine which minimum you end up

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 30


Gradient Descent Algorithm

• Here we can see one initialization point led to one local


minimum
• The other led to a different one
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 31
Gradient Descent Algorithm

• What does this all mean?


• Update θj by setting it to (θj - α) times the partial derivative of the cost function with
respect to θj
• Notation
• := Denotes assignment
• α (alpha)
• Is a number called the learning rate
• Controls how big a step you take
• If α is big have an aggressive gradient descent
• If α is small take tiny steps

• Derivative term

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 32


Gradient Descent Algorithm

• There is a subtly about how this gradient descent algorithm is


implemented
• Do this for θ0 and θ1
• For j = 0 and j = 1 means we simultaneously update both
• How do we do this?
• Compute the right hand side for both θ0 and θ1
• So we need a temp value
• Then, update θ0 and θ1 at the same time

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 33


Gradient Descent Algorithm
• To understand gradient descent, we'll return to a simpler function where
we minimize one parameter to help explain the algorithm in more detail
• where θ1 is a real number
• Two key terms in the algorithm
• Alpha
• Derivative term
• Notation nuances
• Partial derivative vs. derivative
• Use partial derivative when we have multiple variables but only derive with respect to one
• Use derivative when we are deriving with respect to all the variables

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 34


Gradient Descent Algorithm
• Derivative term

• Derivative says
• Lets take the tangent at the point and look at the slope of the line
• So moving towards the minimum (down) will create a negative derivative, alpha is
always positive, so will update J(θ1) to a smaller value
• Similarly, if we're moving up a slope we make J(θ1) a bigger numbers

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 35


Gradient Descent Algorithm
J(θ1) (θ1 ϵ ℝ) J(θ1) (θ1 ϵ ℝ)

positive slope
negative slope

θ1 θ1

d d
θ1 = θ1 − α J(θ ) ≥ 0 θ1 = θ1 − α J(θ ) ≤ 0
dθ1 1 dθ1 1

θ1 = θ1 − α (positive number) θ1 = θ1 − α (negative number)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 36


Gradient Descent Algorithm

J(θ1) (θ1 ϵ ℝ) J(θ1) (θ1 ϵ ℝ)

θ1
d
θ1 = θ1 − α J(θ )
dθ1 1 If'α is too large, gradient descent
can overshoot the minimum. It may
If'α is too small, gradient descent fail to converge, or even diverge.
can be slow.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 37


Gradient Descent Algorithm
d
θ1 = θ1 − α J(θ ) ≥ 0
dθ1 1

J(θ1) (θ1 ϵ ℝ)

θ1 at local optima

θ1

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 38


Gradient Descent Algorithm
• Gradient descent can converge to a local minimum, even with the
learning rate α fixed.
d J(θ1) (θ1 ϵ ℝ)
θ1 = θ1 − α J(θ ) ≥ 0
dθ1 1

θ1

• As we approach a local minimum, gradient descent will automatically


take smaller steps. So, no need to decrease α over time.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 39


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 40


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 41


Gradient Descent Algorithm

repeat until convergence {

Update θ0 and θ1
simultaneously

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 42


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 43


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 44


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 45


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 46


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 47


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 48


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 49


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 50


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 51


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 52


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 53


Gradient Descent Algorithm

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 54


“Batch” Gradient Descent

• “Batch”: Each step of gradient descent uses all the training


examples

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 55


Implementation
House sizes:
2104 hθ(x) = -40 + 0.25x
1416
1534
852

1 2104 -40 x 1 + 0.25 x 2104


1 1416 X −40 = -40 x 1 + 0.25 x 1416
1 1534 0.25 -40 x 1 + 0.25 x 1532
1 852 -40 x 1 + 0.25 x 852

Prediction = Data Matrix * Parameters

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 56


Implementation
House sizes: Have 3 competing hypothesis
2104 1. hθ(x) = -40 + 0.25x
1416 2. hθ(x) = 200 + 0.1x
1534 3. hθ(x) = -150 + 0.4x
852
486 410 692
1 2104
X −40 200 −150 = 314 342 416
1 1416
0.25 0.1 0.4 344 353 464
1 1534
173 285 191
1 852
Prediction Prediction Prediction
of the 1st hθ of the 2nd hθ of the 3rd hθ

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 57

You might also like