0% found this document useful (0 votes)
24 views

Supervised Machine Learning

Supervised learning algorithms aim to learn a function from labeled training data. For linear regression, this function is represented by hθ(x) = θ0 + θ1x. A cost function measures the accuracy of the hypothesis by calculating the average squared difference between predicted and actual outputs. Gradient descent is used to minimize the cost function by iteratively adjusting the θ values in the direction of steepest descent as determined by the partial derivatives of the cost function with respect to θ0 and θ1. This process repeats until convergence is reached at the minimum cost.

Uploaded by

ram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Supervised Machine Learning

Supervised learning algorithms aim to learn a function from labeled training data. For linear regression, this function is represented by hθ(x) = θ0 + θ1x. A cost function measures the accuracy of the hypothesis by calculating the average squared difference between predicted and actual outputs. Gradient descent is used to minimize the cost function by iteratively adjusting the θ values in the direction of steepest descent as determined by the partial derivatives of the cost function with respect to θ0 and θ1. This process repeats until convergence is reached at the minimum cost.

Uploaded by

ram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Chapter 1

Supervised Learning Algorithms


Regression

1.1 Notations
m = Number of training examples
x’s = ”input” variable / features
y’s = ”output” variable / ”target” variable
(x,y) = single training example i.e. a single row in the table below
(x(i) , y(i) ) = ith training example

1.2 How supervised learning works?


So here’s how this supervised learn-
ing algorithm works. We feed the train-
ing set (like our training set of hous-
ing prices) to our learning algorithm.
The job of a learning algorithm is to
then output a function which by con-
vention is usually denoted lowercase h
and h stands for hypothesis. And the
job of the hypothesis is to generate a
function that takes as input the size
of a house like maybe the size of the
new house your friend’s trying to sell,
so it takes in the value of x and it
tries to output the estimated value of
y for the corresponding house. So h
is a function that maps from x’s to
y’s.
Figure 1.1: Supervised Learning flow

1
1.3 Basics
1.3.1 What is a model?
The algorithm you use for machine learning is called a model. It is sometimes also referred
to as a learning model.

1.3.2 How to represent hypothesis function (h)?


We represent h as follows

hθ (x) = θ0 + θ1 x (1.1)
where θ0 and θ1 are the parameters of the model.

Short hand notations for hθ (x) are h(x) or y.

The above equation(2.1) is plotted in this


picture, it shows that we are going to pre-
dict that y is a linear function of x.
What this function is doing, is predicting
that y is some straight line function of x.
That’s h(x) = θ0 + θ1 x

This model is called linear regres-


sion or this, for example, is actually linear
regression with one variable, with the vari-
able being x. Predicting all the prices as
functions of one variable X. And another
name for this model is univariate linear re-
gression. And univariate is just a fancy way
of saying one variable.

1.3.3 Cost function


Let us consider the following example.

x(input) y(output)
0 4
1 7
2 7
3 8

Now we can make a random guess about our hθ function: θ0 = 2 and θ1 = 2. The
hypothesis function becomes hθ (x) = 2 + 2x.

So for input of 1 to our hypothesis, y will be 4. This is off by 3.

We can measure the accuracy of our hypothesis function by using a cost


function. This takes an average of all the results of the hypothesis with inputs from x’s
compared to the actual output y’s.

2
m 2
1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i) (1.2)
2m i=1

To break it apart, it is 21 x̄ where x̄ is the mean of the squares of hθ (x(i) ) − y (i) , or the
difference between the predicted value and the actual value.

This function is otherwise called the ”Squared error function”, or Mean squared error.
The mean is halved 21 m as a convenience for the computation of the gradient descent, as
the derivative term of the square function will cancel out the 12 term. (We will be seeing
this soon)

Now we are able to concretely measure the accuracy of our predictor function against the
correct results we have so that we can predict new results we don’t have.

1.3.4 Gradient Descent


So we have our hypothesis function and we have a way of measuring how accurate it is.
Now what we need is a way to automatically improve our hypothesis function. That’s where
gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we
are graphing the cost function for the combinations of parameters). We are not graphing x
and y itself, but the guesses of our hypothesis function.

We put θ0 on the x axis and θ1 on the z axis, with the cost function on the vertical y
axis. The points on our graph will be the result of the cost function using our hypothesis
with those specific theta parameters.

We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum.

3
The way we do this is by taking the derivative (the line tangent to a function) of our
cost function. The slope of the tangent is the derivative at that point and it will give us a
direction to move towards. We make steps down that derivative by the parameter α, called
the learning rate.

Algorithm outline
1. Start with some θ0 , θ1
2. Keep changing θ0 , θ1 to reduce J(θ0 , θ1 )
3. repeat step 2 until we reach minimum

The gradient descent equation is:

repeat until convergence:



θj := θj − α J(θ0 , θ1 ) (1.3)
∂θj

for j=0 and j=1

4
This could also be thought of as:

repeat until convergence:

θj := θj − α[Slope of tangent aka derivative] (1.4)

1.3.5 Gradient Descent for Linear Regression


When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our actual
hypothesis function and modify the equation as follows.

For our cost function, think of it this way:


m 2
1 X
g(θ0 , θ1 ) = J(θ0 , θ1 ) = f (θ0 , θ1 )(i) (1.5)
2m i=1

f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) (1.6)

Now substitute f (θ0 , θ1 )(i) in equation g(θ0 , θ1 ) and we get,


m 2
1 X
g(f (θ0 , θ1 )(i) ) = θ0 + θ1 x(i) − y (i) (1.7)
2m i=1
This is, indeed, our entire cost function.

5
Thus, the partial derivatives work like this:

m 2 m
∂ ∂ 1 X 1 X ∂
g(θ0 , θ1 ) = f (θ0 , θ1 )(i) = 2 × f (θ0 , θ1 )2−1 θ0 =
∂θ0 ∂θ0 2m i=1 2m i=1 ∂θ0
m
1 X
f (θ0 , θ1 )(i) (1.8)
m i=1

∂ ∂
f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) = 1 (1.9)
∂θ0 ∂θ0
Using chain rule,
∂ ∂ ∂
g(f (θ0 , θ1 )(i) ) = g(θ0 , θ1 ) f (θ0 , θ1 )(i) (1.10)
∂θ0 ∂θ0 ∂θ0

Substituting g(θ0 , θ1 ) and f (θ0 , θ1 )(i) in above equation, we have:

m m
1 X ∂ 1 X 
f (θ0 , θ1 )(i) f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) × 1 =
m i=1 ∂θ0 m i=1
m
1 X 
θ0 + θ1 x(i) − y (i) (1.11)
m i=1

Similarly for θ1 . Our term g(θ0 , θ1 ) is identical, so we just need to take the derivative of
f (θ0 , θ1 )(i) , this time treating θ1 as the variable and the other terms as ”just a number.”
That goes like this:
∂ ∂
f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) (1.12)
∂θ1 ∂θ1
∂ ∂
f (θ0 , θ1 )(i) = [a number] + θ1 [a number, x(i) ] − [a number] (1.13)
∂θ1 ∂θ1

∂ (1−1=0) (i)
f (θ0 , θ1 )(i) = 0 + (θ1 )1 x(i) − 0 = 1 × θ1 x = 1 × 1 × x(i) = x(i) (1.14)
∂θ1

Thus, the entire answer becomes:

∂ ∂ ∂
g(f (θ0 , θ1 )(i) ) = g(θ0 , θ1 ) f (θ0 , θ1 )(i) =
∂θ1 ∂θ1 ∂θ1
m m
1 X ∂ 1 X 
f (θ0 , θ1 )(i) f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) x(i) (1.15)
m i=1 ∂θ1 m i=1

6
So our final algorithm will be as follows

repeat until convergence: {


m
1 X
θ0 :=θ0 − α (hθ (x(i) ) − y (i) )
m i=1
m
1 X 
θ1 :=θ1 − α (hθ (x(i) ) − y (i) )x(i)
m i=1
}

where m is the size of the training set, θ0 a constant that will be changing simultaneously
with θ1 and x(i) ,y (i) are values of the given training set (data).

Note that we have separated out the two cases for θ0 and that for θ1 we are multiplying
x(i) at the end due to the derivative.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.

1.4 Quiz

Quiz 1

You might also like