0% found this document useful (0 votes)
46 views

Linear Regression

Here are the key steps of gradient descent: 1. Start with random parameters θ0, θ1 2. Compute the derivative of the cost function J(θ0, θ1) with respect to θ0 and θ1 3. Update the parameters in the opposite direction of the gradient: θ0 := θ0 - α*∂/∂θ0 J(θ0, θ1) θ1 := θ1 - α*∂/∂θ1 J(θ0, θ1) 4. Repeat steps 2-3 until convergence The learning rate α controls how big of a step we take in the negative gradient direction. If α is too small, gradient descent can

Uploaded by

Aaqib Inam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Linear Regression

Here are the key steps of gradient descent: 1. Start with random parameters θ0, θ1 2. Compute the derivative of the cost function J(θ0, θ1) with respect to θ0 and θ1 3. Update the parameters in the opposite direction of the gradient: θ0 := θ0 - α*∂/∂θ0 J(θ0, θ1) θ1 := θ1 - α*∂/∂θ1 J(θ0, θ1) 4. Repeat steps 2-3 until convergence The learning rate α controls how big of a step we take in the negative gradient direction. If α is too small, gradient descent can

Uploaded by

Aaqib Inam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Machine Learning

Linear Regression

Na Lu
Xi’an Jiaotong University
Machine Learning

• Machine learning: Field of study that gives


computers the ability to learn without being
explicitly programmed.
– By Arthur Samuel (1959)
Machine Learning

• Well-posed learning problem: A computer


program is said to learn from experience E with
respect to some task T and some performance
measure P, if its performance on T, as
measured by P, improves with experience E.
– By Tom Mitchell (1998)
Question
Three types of learning

• Supervised learning
• Learn to predict an output when given an input
vector
• Reinforcement learning
• Learn to select an action to maximize payoff.
• Unsupervised learning
• Discover a good internal representation of the
input and find out what it could be.
Two types of supervised learning
• Each training case consists of an input vector x and a
target output y.
• Regression: The target output is a real number or a
whole vector of real numbers.
– The price of a stock.
– The temperature during a day.
– Aim: to get as close as you can to the real number.
• Classification: The target output is a class label.
– The simplest case is a choice between 1 and 0.
– Facial identities with multiple labels.
– Aim: to classify the input into the correct category.
How does supervised learning work

• Start by choosing a model-class: y = f (x; W)


– A model class f is a mapping of the input vector x to
the predicted output y with parameters in W.
• The target of learning is to reduce the discrepancy
between the model predicted output and the actual
output based on given input-output pairs.
– L2 norm (least square) is a widely used measure of
the discrepancy in case of regression.
– L2 norm and other particular measures could be used
in case of classification.
Reinforcement learning
• The output is an action or a sequence of actions; and the
supervisory information is an occasional scalar reward.
– The goal of the action selection is to maximize the
expected future rewards.
– A discount factor is employed to incorporate the
delayed rewards.
• Difficulties in reinforcement learning:
– The delayed rewards make it hard to know when we
went wrong.
– A scalar reward dose not supply sufficient information.
– Not many parameters could be learnt from
reinforcement learning as could be done in
supervised and unsupervised learning.
Unsupervised learning

• The aim of unsupervised learning is hard to be


defined.
– One major aim is to find an internal
representation of the input for subsequent
supervised learning or reinforcement learning.
– The related research focus on clustering.
– Unsupervised learning has been ignored for
about 40 years by the machine learning
community.
Other goals of unsupervised learning

• To provides a compact, low-dimensional


representation of the input.
– High-dimensional inputs typically reside on or
near a low-dimensional manifold (or several
such manifolds).
– Principle component analysis is such a
representative linear method.
• To find the clusters from the input.
– Clusters could be interpreted as a very sparse
code where only one feature is nonzero.
Question
Supervised learning

• Data from Portland, Oregon State, US

• Supervised learning: right answer given


• Regression: predict continuous valued output (house
price)
Supervised learning

• Breast cancer (malignant or benign?)

• Supervised learning: correct labels given


• Classification: Discrete valued output (0, 1)
Supervised learning

• More than one feature considered

– Clump thickness
– Uniformity of cell size
– Uniformity of cell shape
– ……
Unsupervised learning
Question
Applications
Cocktail party problem
Cocktail party problem

• Microphone 1 Output 1

Output 2

[W, s v] =svd((repmat(sum(x.*x,1),size(x,1),1).*x*x’);
Problem
500
Housing Prices
(Portland, OR) 400

300

Price 200
(in 1000s
of dollars) 100
0
0 1000 2000 3000
Size (feet2)

Supervised Learning Regression Problem


Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 Price ($) in
housing prices (x) 1000's (y)
(Portland, OR) 2104 460
1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable

(x, y) : one training example


(x(i),y(i)): ith training example
How do we represent h ?

Training Set
) θ 0 + θ1 x
hθ ( x=

Learning Algorithm
y

Size of Estimate
h x
house d price
Linear regression with one variable.
Univariate linear regression.
Question
Linear regression
with one variable

Cost function
Training Set Size in feet2 Price ($) in
(x) 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:

: Parameters

How to choose s?
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
1 m
y
J (θ 0 ,θ1 )
= ∑
2m i =1
( hθ ( x (i )
) − y (i ) 2
)

x Minimize J (θ 0 ,θ1 )
θ0 ,θ1

Idea: Choose so that


is close to for
our training examples
Question
Linear regression
with one variable

Cost function
intuition I
Hypothesis: Simplified

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 m
J (θ1 ) ∑
2m i =1
(hθ ( x (i ) ) − y (i ) ) 2

1 m
= ∑
2m i =1
(θ1 x (i )
− y (i ) 2
)
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
=
J (0.5) [(0.5 − 1) 2 + (1 − 2) 2 + (1.5 − 3) 2 ]
6
3.5
= ≈ 0.58
6
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 2
J (0)= [1 + 22 + 32 ]
6
14
= ≈ 2.3
6
Question
Hypothesis:

Parameters:

Cost Function:

Goal:
(for fixed , this is a function of x) (function of the parameters )

500

400

Price ($)
in 1000’s 300

200

100

0
0 1000 2000 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Linear regression
with one variable

Gradient descent
Have some function

Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J(θ0,θ1)

θ1
θ0
J(θ0,θ1)

θ1
θ0
Gradient descent algorithm

Correct: Simultaneous update Incorrect:


Question
Linear regression
with one variable

Gradient descent
intuition
Gradient descent algorithm
If α is too small, gradient
descent can be slow.

If α is too large, gradient descent


can overshoot the minimum. It
may fail to converge, or even
diverge.
Question
at local optima

Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α
fixed.

As we approach a local
minimum, gradient
descent will
automatically take
smaller steps. So, no
need to decrease α over
time.
Linear regression
with one variable

Gradient descent for


linear regression
Gradient descent algorithm Linear Regression Model
∂ 1 m

∂θ j 2m i =1
( hθ ( x (i )
) − y (i ) 2
)

∂ 1 m
= ∑
∂θ j 2m i =1
(θ 0 + θ1 x (i )
− y (i ) 2
)

1 m

m i =1
( hθ ( x (i )
) − y (i )
)

1 m

m i =1
( hθ ( x (i )
) − y (i )
) ⋅ x (i )
Gradient descent algorithm

update
and
simultaneously
J(θ0,θ1)

θ1
θ0
J(θ0,θ1)

θ1
θ0
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent

“Batch”: Each step of gradient


descent uses all the training
examples.
Question
The End

Thank you

You might also like