0% found this document useful (0 votes)
48 views59 pages

Linear Regression: Jia-Bin Huang Virginia Tech

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views59 pages

Linear Regression: Jia-Bin Huang Virginia Tech

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Linear Regression

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
BRACE YOURSELVES

WINTER IS COMING
BRACE YOURSELVES

HOMEWORK IS COMING
Administrative
• Office hour
• Chen Gao
• Shih-Yang Su

• Feedback (Thanks!)
• Notation?

• More descriptive slides?

• Video/audio recording?

• TA hours (uniformly spread over the week)?


Recap: Machine learning algorithms
Supervised Unsupervised
Learning Learning

Discrete Classification Clustering

Dimensionality
Continuous Regression reduction
Recap: Nearest neighbor classifier
• Training data

• Learning
Do nothing.

• Testing
, where
Recap: Instance/Memory-based Learning
1. A distance metric
• Continuous? Discrete? PDF? Gene data? Learn the metric?
2. How many nearby neighbors to look at?
• 1? 3? 5? 15?
3. A weighting function (optional)
• Closer neighbors matter more
4. How to fit with the local points?
• Kernel regression

Slide credit: Carlos Guestrin


Validation set
• Spliting training set: A fake test set to tune hyper-parameters

Slide credit: CS231 @ Stanford


Cross-validation
• 5-fold cross-validation -> split the training data into 5 equal folds
• 4 of them for training and 1 for validation

Slide credit: CS231 @ Stanford


Things to remember
• Supervised Learning
• Training/testing data; classification/regression; Hypothesis
• k-NN
• Simplest learning algorithm
• With sufficient data, very hard to beat “strawman” approach
• Kernel regression/classification
• Set k to n (number of data points) and chose kernel width
• Smoother than k-NN
• Problems with k-NN
• Curse of dimensionality
• Not robust to irrelevant features
• Slow NN search: must remember (very large) dataset for prediction
Today’s plan: Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Regression
Training set
real-valued output

Learning Algorithm

𝑥 h 𝑦
Size of house Hypothesis Estimated price
House pricing prediction
Price ($)
in 1000’s
400
300
200
100

500 1000 1500 2000 2500


Size in feet^2
Training set Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178 = 47
… …

• Notation:
• = Number of training examples
• = Input variable / features Examples:
• = Output variable / target variable
• (, ) = One training example
• (, ) = training example
Slide credit: Andrew Ng
Model representation

Training set
Shorthand

Learning Algorithm Price ($)


in 1000’s
400
300

𝑥 h 𝑦
200
100

Size of house Hypothesis Estimated price 500 1000 1500 2000 2500
Size in feet^2

Univariate linear regression


Slide credit: Andrew Ng
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Size in feet^2 (x) Price ($) in 1000’s (y)
Training set 2104 460
1416 232
1534 315
852 178 = 47
… …

• Hypothesis

: parameters/weights

How to choose ’s?


Slide credit: Andrew Ng
h 𝜃 ( 𝑥 )= 𝜃 0+ 𝜃 1 𝑥
𝑦 𝑦 𝑦
3 3 3
2 2 2
1 1 1

1 2 3 𝑥 1 2 3 𝑥 1 2 3 𝑥

Slide credit: Andrew Ng


Cost function
• Idea: 𝜃 0 , 𝜃1
Choose so that
is close to for our
h 𝜃 ( 𝑥 ) =𝜃 0 +𝜃 1 𝑥
(𝑖 ) (𝑖)
training example
𝑚
𝑦 1 (𝑖 ) 2
Price ($)
in 1000’s
𝐽 ( 𝜃0 , 𝜃1 ) = ∑
2𝑚 𝑖=1
( h𝜃 ( 𝑥 ) − 𝑦 )
(𝑖 )

400
300
200

Cost function
100

500 1000 1500 2000 2500


𝑥 𝜃 0 , 𝜃1
Size in feet^2
Slide credit: Andrew Ng
Simplified
• Hypothesis: • Hypothesis:

• Parameters: • Parameters:

• Cost function: • Cost function:

• Goal: • Goal:

𝜃 0 , 𝜃1 𝜃 0 , 𝜃1
Slide credit: Andrew Ng
, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


, function of , function of
𝑦 𝐽 ( 𝜃1 )
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃1

Slide credit: Andrew Ng


• Hypothesis:

• Parameters:

• Cost function:

• Goal:
𝜃 0 , 𝜃1
Slide credit: Andrew Ng
Cost function

Slide credit: Andrew Ng


How do we find good that minimize ?
Slide credit: Andrew Ng
Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Gradient descent
Have some function
Want
𝜃 0 , 𝜃1

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at minimum
Slide credit: Andrew Ng
Slide credit: Andrew Ng
Gradient descent
Repeat until convergence{
(for and )
}

: Learning rate (step size)


: derivative (rate of change)

Slide credit: Andrew Ng


Gradient descent
Correct: simultaneous update Incorrect:

Slide credit: Andrew Ng


𝜕
𝜃 1 ≔ 𝜃1 − 𝛼 𝐽 ( 𝜃1 )
𝜕 𝜃1
𝐽 ( 𝜃1 )

𝜕
3 𝐽 ( 𝜃1 ) < 0
𝜕 𝜃1
𝜕
2 𝐽 ( 𝜃1 ) > 0
𝜕 𝜃1

0 1 2 3 𝜃1
Slide credit: Andrew Ng
Learning rate
Gradient descent for linear regression
Repeat until convergence{
(for and )
}

• Linear regression model

Slide credit: Andrew Ng


Computing partial derivative
•=
=

•:
•:

Slide credit: Andrew Ng


Gradient descent for linear regression
Repeat until convergence{

Update and simultaneously

Slide credit: Andrew Ng


Batch gradient descent
• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
: Number of training examples

Slide credit: Andrew Ng


Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
Training dataset
Size in feet^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …

h 𝜃 ( 𝑥 )=𝜃 0+ 𝜃1 𝑥

Slide credit: Andrew Ng


Multiple features (input variables)
Size in feet^2 () Number of Number of Age of home Price ($) in
bedrooms () floors () (years) () 1000’s (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… …

Notation:
= Number of features
= Input features of training example
= Value of feature in training example
Slide credit: Andrew Ng
Hypothesis
Previously:

Now:

Slide credit: Andrew Ng


h 𝜃 ( 𝑥 )=𝜃 0+𝜃1 𝑥1 +𝜃 2 𝑥 2+…+𝜃 𝑛 𝑥 𝑛
• For convenience of notation, define
( for all examples)

Slide credit: Andrew Ng


Gradient descent
• Previously () • New algorithm ()

Repeat until convergence{ Repeat until convergence{

}
Simultaneously update

Slide credit: Andrew Ng


Gradient descent in practice: Feature scaling
• Idea: Make sure features are on a similar scale (e.g,. )
• E.g. size (0-2000 feat^2)
number of bedrooms (1-5)

𝜃2 𝜃2

3 3
2 2
1 1
𝜃1 𝜃1
0 1 2 3 0 1 2 3 Slide credit: Andrew Ng
Gradient descent in practice: Learning rate
• Automatic convergence test
• too small: slow convergence
• too large: may not converge

• To choose , try

0.001, … 0.01, …, 0.1, … , 1

Image credit: CS231n@Stanford


House prices prediction

• Area

Slide credit: Andrew Ng


Polynomial regression
Price ($)
in 1000’s
400

300

200

100 = (size)
500 1000 1500 2000 2500
= (size)^2
Size in feet^2 = (size)^3

Slide credit: Andrew Ng


Linear Regression
• Model representation

• Cost function

• Gradient descent

• Features and polynomial regression

• Normal equation
() Size in feet^2 Number of Number of Age of home Price ($) in
() bedrooms () floors () (years) () 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …

[ ]
460
𝑦 = 232
315
178
⊤ −1 ⊤
𝜃=( 𝑋 𝑋) 𝑋 𝑦 Slide credit: Andrew Ng
Least square solution

Justification/interpretation 1
• Loss minimization

• Least squares loss

• Empirical Risk Minimization (ERM)


Justification/interpretation 2
• Probabilistic model
• Assume linear model with Gaussian errors

• Solving maximum likelihood

Image credit: CS 446@UIUC


𝒚
Justification/interpretation 3
𝑿 𝜽−𝒚
• Geometric interpretation 𝑿𝜽
column space of 

• : column space of or span()

• Residual is orthogonal to the column space of 


training examples, features
Gradient Descent Normal Equation
• Need to choose • No need to choose
• Need many iterations • Don’t need to iterate
• Works well even when • Need to compute
is large
• Slow if is very large

Slide credit: Andrew Ng


Things to remember
• Model representation

• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
Next
• Naïve Bayes, Logistic regression, Regularization

You might also like