0% found this document useful (0 votes)

24 views

Supervised Machine Learning

Supervised learning algorithms aim to learn a function from labeled training data. For linear regression, this function is represented by hθ(x) = θ0 + θ1x. A cost function measures the accuracy of the hypothesis by calculating the average squared difference between predicted and actual outputs. Gradient descent is used to minimize the cost function by iteratively adjusting the θ values in the direction of steepest descent as determined by the partial derivatives of the cost function with respect to θ0 and θ1. This process repeats until convergence is reached at the minimum cost.

Uploaded by

ram

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Supervised Machine Learning

Uploaded by

ram

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Chapter 1

Supervised Learning Algorithms

Regression

1.1 Notations
m = Number of training examples
x’s = ”input” variable / features
y’s = ”output” variable / ”target” variable
(x,y) = single training example i.e. a single row in the table below
(x(i) , y(i) ) = ith training example

1.2 How supervised learning works?

So here’s how this supervised learn-
ing algorithm works. We feed the train-
ing set (like our training set of hous-
ing prices) to our learning algorithm.
The job of a learning algorithm is to
then output a function which by con-
vention is usually denoted lowercase h
and h stands for hypothesis. And the
job of the hypothesis is to generate a
function that takes as input the size
of a house like maybe the size of the
new house your friend’s trying to sell,
so it takes in the value of x and it
tries to output the estimated value of
y for the corresponding house. So h
is a function that maps from x’s to
y’s.
Figure 1.1: Supervised Learning flow

1
1.3 Basics
1.3.1 What is a model?
The algorithm you use for machine learning is called a model. It is sometimes also referred
to as a learning model.

1.3.2 How to represent hypothesis function (h)?

We represent h as follows

hθ (x) = θ0 + θ1 x (1.1)
where θ0 and θ1 are the parameters of the model.

Short hand notations for hθ (x) are h(x) or y.

The above equation(2.1) is plotted in this

picture, it shows that we are going to pre-
dict that y is a linear function of x.
What this function is doing, is predicting
that y is some straight line function of x.
That’s h(x) = θ0 + θ1 x

This model is called linear regres-

sion or this, for example, is actually linear
regression with one variable, with the vari-
able being x. Predicting all the prices as
functions of one variable X. And another
name for this model is univariate linear re-
gression. And univariate is just a fancy way
of saying one variable.

1.3.3 Cost function

Let us consider the following example.

x(input) y(output)
0 4
1 7
2 7
3 8

Now we can make a random guess about our hθ function: θ0 = 2 and θ1 = 2. The
hypothesis function becomes hθ (x) = 2 + 2x.

So for input of 1 to our hypothesis, y will be 4. This is off by 3.

We can measure the accuracy of our hypothesis function by using a cost

function. This takes an average of all the results of the hypothesis with inputs from x’s
compared to the actual output y’s.

2
m 2
1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i) (1.2)
2m i=1

To break it apart, it is 21 x̄ where x̄ is the mean of the squares of hθ (x(i) ) − y (i) , or the
difference between the predicted value and the actual value.

This function is otherwise called the ”Squared error function”, or Mean squared error.
The mean is halved 21 m as a convenience for the computation of the gradient descent, as
the derivative term of the square function will cancel out the 12 term. (We will be seeing
this soon)

Now we are able to concretely measure the accuracy of our predictor function against the
correct results we have so that we can predict new results we don’t have.

1.3.4 Gradient Descent

So we have our hypothesis function and we have a way of measuring how accurate it is.
Now what we need is a way to automatically improve our hypothesis function. That’s where
gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we
are graphing the cost function for the combinations of parameters). We are not graphing x
and y itself, but the guesses of our hypothesis function.

We put θ0 on the x axis and θ1 on the z axis, with the cost function on the vertical y
axis. The points on our graph will be the result of the cost function using our hypothesis
with those specific theta parameters.

We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum.

3
The way we do this is by taking the derivative (the line tangent to a function) of our
cost function. The slope of the tangent is the derivative at that point and it will give us a
direction to move towards. We make steps down that derivative by the parameter α, called
the learning rate.

Algorithm outline
1. Start with some θ0 , θ1
2. Keep changing θ0 , θ1 to reduce J(θ0 , θ1 )
3. repeat step 2 until we reach minimum

The gradient descent equation is:

repeat until convergence:

∂
θj := θj − α J(θ0 , θ1 ) (1.3)
∂θj

for j=0 and j=1

4
This could also be thought of as:

repeat until convergence:

θj := θj − α[Slope of tangent aka derivative] (1.4)

1.3.5 Gradient Descent for Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our actual
hypothesis function and modify the equation as follows.

For our cost function, think of it this way:

m 2
1 X
g(θ0 , θ1 ) = J(θ0 , θ1 ) = f (θ0 , θ1 )(i) (1.5)
2m i=1

f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) (1.6)

Now substitute f (θ0 , θ1 )(i) in equation g(θ0 , θ1 ) and we get,

m 2
1 X
g(f (θ0 , θ1 )(i) ) = θ0 + θ1 x(i) − y (i) (1.7)
2m i=1
This is, indeed, our entire cost function.

5
Thus, the partial derivatives work like this:

m 2 m
∂ ∂ 1 X 1 X ∂
g(θ0 , θ1 ) = f (θ0 , θ1 )(i) = 2 × f (θ0 , θ1 )2−1 θ0 =
∂θ0 ∂θ0 2m i=1 2m i=1 ∂θ0
m
1 X
f (θ0 , θ1 )(i) (1.8)
m i=1

∂ ∂
f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) = 1 (1.9)
∂θ0 ∂θ0
Using chain rule,
∂ ∂ ∂
g(f (θ0 , θ1 )(i) ) = g(θ0 , θ1 ) f (θ0 , θ1 )(i) (1.10)
∂θ0 ∂θ0 ∂θ0

Substituting g(θ0 , θ1 ) and f (θ0 , θ1 )(i) in above equation, we have:

m m
1 X ∂ 1 X
f (θ0 , θ1 )(i) f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) × 1 =
m i=1 ∂θ0 m i=1
m
1 X
θ0 + θ1 x(i) − y (i) (1.11)
m i=1

Similarly for θ1 . Our term g(θ0 , θ1 ) is identical, so we just need to take the derivative of
f (θ0 , θ1 )(i) , this time treating θ1 as the variable and the other terms as ”just a number.”
That goes like this:
∂ ∂
f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) (1.12)
∂θ1 ∂θ1
∂ ∂
f (θ0 , θ1 )(i) = [a number] + θ1 [a number, x(i) ] − [a number] (1.13)
∂θ1 ∂θ1

∂ (1−1=0) (i)
f (θ0 , θ1 )(i) = 0 + (θ1 )1 x(i) − 0 = 1 × θ1 x = 1 × 1 × x(i) = x(i) (1.14)
∂θ1

Thus, the entire answer becomes:

∂ ∂ ∂
g(f (θ0 , θ1 )(i) ) = g(θ0 , θ1 ) f (θ0 , θ1 )(i) =
∂θ1 ∂θ1 ∂θ1
m m
1 X ∂ 1 X
f (θ0 , θ1 )(i) f (θ0 , θ1 )(i) = θ0 + θ1 x(i) − y (i) x(i) (1.15)
m i=1 ∂θ1 m i=1

6
So our final algorithm will be as follows

repeat until convergence: {

m
1 X
θ0 :=θ0 − α (hθ (x(i) ) − y (i) )
m i=1
m
1 X
θ1 :=θ1 − α (hθ (x(i) ) − y (i) )x(i)
m i=1
}

where m is the size of the training set, θ0 a constant that will be changing simultaneously
with θ1 and x(i) ,y (i) are values of the given training set (data).

Note that we have separated out the two cases for θ0 and that for θ1 we are multiplying
x(i) at the end due to the derivative.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.

1.4 Quiz

Quiz 1

Enrichment Polya Problem Excerpt
No ratings yet
Enrichment Polya Problem Excerpt
8 pages
General Mathematics Reviewer!
100% (19)
General Mathematics Reviewer!
4 pages
2.6.1.3 Packet Tracer - Configure Cisco Routers For Syslog, NTP, and SSH Operations
50% (2)
2.6.1.3 Packet Tracer - Configure Cisco Routers For Syslog, NTP, and SSH Operations
2 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Cost Function
No ratings yet
Cost Function
17 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
MIT_Regression
No ratings yet
MIT_Regression
5 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Gradient Descent for Linear Regression: repeat until convergence: (:=:=) − α ( −) 1 ∑ − α ( ( −) ) 1 ∑
No ratings yet
Gradient Descent for Linear Regression: repeat until convergence: (:=:=) − α ( −) 1 ∑ − α ( ( −) ) 1 ∑
1 page
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
06 Logistic Regression PDF
No ratings yet
06 Logistic Regression PDF
10 pages
Binary Logistic Regression From Scratch
No ratings yet
Binary Logistic Regression From Scratch
10 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
cs229.... Machine Language. Andrew NG
No ratings yet
cs229.... Machine Language. Andrew NG
17 pages
Lec 3
No ratings yet
Lec 3
22 pages
Đạo hàm - Tiếng Anh
No ratings yet
Đạo hàm - Tiếng Anh
14 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
Calc
No ratings yet
Calc
6 pages
Difference Calculus: N K 1 3 M X 1 N y 1 2 N K 0 K
No ratings yet
Difference Calculus: N K 1 3 M X 1 N y 1 2 N K 0 K
9 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Sample Research Paper
No ratings yet
Sample Research Paper
26 pages
C&DE Unit - I
No ratings yet
C&DE Unit - I
36 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
Calculus 1 Math 1
No ratings yet
Calculus 1 Math 1
4 pages
Section 3.4 Rational Functions: Example 1
No ratings yet
Section 3.4 Rational Functions: Example 1
18 pages
Sam HW2
No ratings yet
Sam HW2
4 pages
Chapter 5
No ratings yet
Chapter 5
17 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Notes 2. Linear - Regression - With - Multiple - Variables
No ratings yet
Notes 2. Linear - Regression - With - Multiple - Variables
10 pages
A Reader on Functions
No ratings yet
A Reader on Functions
22 pages
Training Course 9 Senior Functions and Trigonometry
No ratings yet
Training Course 9 Senior Functions and Trigonometry
32 pages
Chapter One 1.1. Theoretical Versus Mathematical Economics Example. Example
No ratings yet
Chapter One 1.1. Theoretical Versus Mathematical Economics Example. Example
19 pages
MTH0141(1st Week-functions I)
No ratings yet
MTH0141(1st Week-functions I)
5 pages
Gen Math - FUNCTIONS &THEIR GRAPHS
No ratings yet
Gen Math - FUNCTIONS &THEIR GRAPHS
5 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Mach Learning Qs
No ratings yet
Mach Learning Qs
7 pages
(Lecture Notes) Chapter 3.1 3.3
No ratings yet
(Lecture Notes) Chapter 3.1 3.3
70 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Algebra-6 FULL
No ratings yet
Algebra-6 FULL
22 pages
About Euler
No ratings yet
About Euler
10 pages
calculusofvariations
No ratings yet
calculusofvariations
7 pages
math20
No ratings yet
math20
36 pages
Cs229 Notes Deep Learning
No ratings yet
Cs229 Notes Deep Learning
21 pages
CT_lecturer note 0
No ratings yet
CT_lecturer note 0
135 pages
Gen Math g11 q3 m6 Wk6
100% (1)
Gen Math g11 q3 m6 Wk6
9 pages
calculusofvariations
No ratings yet
calculusofvariations
7 pages
Lecture 1
No ratings yet
Lecture 1
89 pages
Kami Export - Jaden Mills - Equations of Functions
No ratings yet
Kami Export - Jaden Mills - Equations of Functions
4 pages
Topic 1
No ratings yet
Topic 1
8 pages
Function (Theory + Exercise)
No ratings yet
Function (Theory + Exercise)
87 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
The Gamma Function
From Everand
The Gamma Function
Emil Artin
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Year9 Notes On Input Devices Part 1
No ratings yet
Year9 Notes On Input Devices Part 1
13 pages
Emtec
No ratings yet
Emtec
4 pages
Chapter-10 Standard Library Functions Type A: Very Short Answer Questions
No ratings yet
Chapter-10 Standard Library Functions Type A: Very Short Answer Questions
3 pages
Android Populating Spinner Data From SQLite Database
No ratings yet
Android Populating Spinner Data From SQLite Database
6 pages
Computer-Aided Design and Applications
No ratings yet
Computer-Aided Design and Applications
11 pages
Jumbo Hotfix Accumulator For R77
No ratings yet
Jumbo Hotfix Accumulator For R77
16 pages
Cs101 Fall2011 Midterm I Solutions
No ratings yet
Cs101 Fall2011 Midterm I Solutions
7 pages
Basic Digital Signal Processing Matlab Codes
100% (1)
Basic Digital Signal Processing Matlab Codes
10 pages
HowTo Work With CR 90
No ratings yet
HowTo Work With CR 90
86 pages
Introduction To Functions: Syntax of Function Is As Follows
No ratings yet
Introduction To Functions: Syntax of Function Is As Follows
16 pages
R307 Fingerprint Module User Manual
No ratings yet
R307 Fingerprint Module User Manual
24 pages
SBT Router Design PDF
No ratings yet
SBT Router Design PDF
9 pages
Half-Life Valve Software's
No ratings yet
Half-Life Valve Software's
6 pages
Sun RPC (Remote Procedure Call)
No ratings yet
Sun RPC (Remote Procedure Call)
20 pages
Ibm Manual Blade Center PDF
No ratings yet
Ibm Manual Blade Center PDF
264 pages
PCMF Manual 16.3
No ratings yet
PCMF Manual 16.3
963 pages
SAP Screen Personas Installation and Configuration
No ratings yet
SAP Screen Personas Installation and Configuration
15 pages
Bricks Game in C
No ratings yet
Bricks Game in C
35 pages
C For Engineers and Scientist
89% (9)
C For Engineers and Scientist
664 pages
VIPER Manual Issue 4
No ratings yet
VIPER Manual Issue 4
55 pages
s7-400 Cpu Run Diagnostics
No ratings yet
s7-400 Cpu Run Diagnostics
6 pages
Isilon Adminstration and Management
No ratings yet
Isilon Adminstration and Management
589 pages
CVI SCAP Migration Guide
No ratings yet
CVI SCAP Migration Guide
42 pages
SWC FullManual
No ratings yet
SWC FullManual
313 pages
Irobot Create Open Interface v2
100% (1)
Irobot Create Open Interface v2
25 pages
DCOM Configuration Tutorial
No ratings yet
DCOM Configuration Tutorial
12 pages
AOS Assignment 1
No ratings yet
AOS Assignment 1
2 pages
Chapter 6 Concurrency For Deadlock and Starvation
No ratings yet
Chapter 6 Concurrency For Deadlock and Starvation
69 pages
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding StudVersion PDF
No ratings yet
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding StudVersion PDF
48 pages