0% found this document useful (0 votes)
39 views

Lec9 - Linear Models

The document discusses linear models for machine learning, including linear regression and linear classification. It covers the history of linear regression, how it works using least squares to find the coefficients that minimize error, and pros and cons of different methods like gradient descent. Linear classification with perceptrons is also introduced.

Uploaded by

Khawir Mahmood
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Lec9 - Linear Models

The document discusses linear models for machine learning, including linear regression and linear classification. It covers the history of linear regression, how it works using least squares to find the coefficients that minimize error, and pros and cons of different methods like gradient descent. Linear classification with perceptrons is also introduced.

Uploaded by

Khawir Mahmood
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Machine Learning

Linear Models
Outline
II - Linear Models
1. Linear Regression
(a) Linear regression: History
(b) Linear regression with Least Squares
(c) Matrix representation and Normal Equation Method
(d) Iterative Method: Gradient descent
(e) Pros and Cons of both methods
2. Linear Classification: Perceptron
(a) Definition and history
(b) Example
(c) Algorithm
Supervised Learning
Training data:“examples” x with “labels” y.

(x1, y1), . . . , (xn, yn) / xi 2 Rd

• Regression: y is a real value, y 2 R

f : Rd ! R f is called a regressor.

• Classification: y is discrete. To simplify, y 2 { 1, +1}

f : Rd ! { 1, +1} f is called a binary classifier.


Linear Regression: History
• A very popular technique.

• Rooted in Statistics.

• Method of Least Squares used as early as


1795 by Gauss.

• Re-invented in 1805 by Legendre.

• Frequently applied in astronomy to study


the large scale of the universe.
Carl Friedrich Gauss
• Still a very useful tool today.
Linear Regression
Given: Training data: (x1, y1), . . . , (xn, yn) / xi 2 Rd and yi 2 R
Linear Regression
Given: Training data: (x1, y1), . . . , (xn, yn) / xi 2 Rd and yi 2 R

example x1 ! x11 x12 ... x1d y1 label


... ... ... ... ... ...
example xi ! xi1 xi2 ... xid yi label
... ... ... ... ... ...
example xn ! xn1 xn2 ... xnd yn label
Linear Regression
Given: Training data: (x1, y1), . . . , (xn, yn) / xi 2 Rd and yi 2 R

example x1 ! x11 x12 ... x1d y1 label


... ... ... ... ... ...
example xi ! xi1 xi2 ... xid yi label
... ... ... ... ... ...
example xn ! xn1 xn2 ... xnd yn label

Task: Learn a regression function:

f : Rd ! R
f (x) = y
Linear Regression
Given: Training data: (x1, y1), . . . , (xn, yn) / xi 2 Rd and yi 2 R

example x1 ! x11 x12 ... x1d y1 label


... ... ... ... ... ...
example xi ! xi1 xi2 ... xid yi label
... ... ... ... ... ...
example xn ! xn1 xn2 ... xnd yn label

Task: Learn a regression function:

f : Rd ! R
f (x) = y

Linear Regression: A regression model is said to be linear if it is


represented by a linear function.
Linear Regression
Y

#"
25
20
Sales

15

X2
10
5

0 50 100 150 200 250 !"300 X1

TV
d = 1, line in R2 d = 2, hyperplane is R3

Credit: Introduction to Statistical Learning.


Linear Regression
Linear Regression Model:
d
X
f (x) = 0 + j xj with j 2 R, j 2 {1, . . . , d}
j=1
’s are called parameters or coefficients or weights.
Linear Regression
Linear Regression Model:
d
X
f (x) = 0 + j xj with j 2 R, j 2 {1, . . . , d}
j=1
’s are called parameters or coefficients or weights.

Learning the linear model ! learning the 0s


Linear Regression
Linear Regression Model:
d
X
f (x) = 0 + j xj with j 2 R, j 2 {1, . . . , d}
j=1
’s are called parameters or coefficients or weights.

Learning the linear model ! learning the 0s

Estimation with Least squares:


Use least square loss: `oss(yi, f (xi)) = (yi f (xi))2
Linear Regression
Linear Regression Model:
d
X
f (x) = 0 + j xj with j 2 R, j 2 {1, . . . , d}
j=1
’s are called parameters or coefficients or weights.

Learning the linear model ! learning the 0s

Estimation with Least squares:


Use least square loss: `oss(yi, f (xi)) = (yi f (xi))2
We want to minimize the loss over all examples, that is minimize
the risk or cost function R:
n
1 X
R= (yi f (xi))2
2n i=1
Linear Regression
A simple case with one feature ( d = 1):

f (x) = 0 + 1x
Linear Regression
A simple case with one feature ( d = 1):

f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1
Linear Regression
A simple case with one feature ( d = 1):

f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1

n
1 X 2
R( ) = (yi 0 1 xi )
2n i=1
Linear Regression
A simple case with one feature ( d = 1):

f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1

n
1 X 2
R( ) = (yi 0 1 xi )
2n i=1
Find 0 and 1 that minimize:
n
1 X
2
R( ) = (yi 0 1 xi )
2n i=1
Linear Regression

3 3
2.5
0.06

2.15
0.05
β1

RS
S
0.04

2.2
2.3
β1
0.03

3 3 β0
5 6 7 8 9

β0

Credit: Introduction to Statistical Learning.


Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1

Minimize: R( 0, 1), that is: @@R = 0 @R = 0


@ 1
0
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1

Minimize: R( 0, 1), that is: @@R = 0 @R = 0


@ 1
0

n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1

Minimize: R( 0, 1), that is: @@R = 0 @R = 0


@ 1
0

n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1

Minimize: R( 0, 1), that is: @@R = 0 @R = 0


@ 1
0

n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n n
1 X 1 X
0= yi 1 xi
n i=1 n i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
n
X n
X n
X
1 xi 2 = xi y i 0 xi
i=1 i=1 i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
n
X n
X n
X
1 xi 2 = xi y i 0 xi
i=1 i=1 i=1
Plugging 0 in 1:

Pn 1 Pn Pn
i=1 yi xi n i=1 yi i=1 xi
1 = Pn 2 1 Pn P
i=1 xi x
n i=1 i xi
Linear Regression
With more than one feature:
d
X
f (x) = 0 + j xj
j=1
Find the j that minimize:
n d
1 X X
2
R= (yi 0 j xij ))
2n i=1 j=1

Let’s write it more elegantly with matrices!


Matrix representation
Let X be an n ⇥ (d + 1) matrix where each row starts with a 1
followed by a feature vector.
Let y be the label vector of the training set.
Let be the vector of weights (that we want to estimate!).
0 1
1 x11 · · · x1j · · · x1d
B ... ... ... ... ... ... C
B C
B C
X := B
B 1 xi1 · · · xij · · · xid C C
B ... ... ... ... ... ... C
@ A
1 xn1 · · · xnj · · · xnd
0 1 0 1
y1 0
B ... C B ... C
B C B C
B C B C
y := B
B yi C
C := B
B
C
j C
B ... C B ... C
@ A @ A
yn d
Normal Equation
We want to find (d + 1) ’s that minimize R. We write R:
1
R( ) = ||(y X )||2
2n
1
R( ) = (y X )T (y X )
2n
@R 1 T
= X (y X )
@ n
We have that:
@ 2R 1 T
= X X
@ n
is positive definite which ensures that is a minimum. We solve:

X T (y X )=0
The unique solution is: = (X T X) 1X T y
Gradient descent

!"#$%&'()'

4.5'
+,-./0$'12,3$'

*$%&'('
Gradient descent
Gradient Descent is an optimization method.

Repeat until convergence:

Update simultaneously all j for (j = 0 and j = 1)


@
0 := 0 ↵ R( 0, 1)
@ 0

@
1 := 1 ↵ R( 0, 1)
@ 1
Gradient descent
Gradient Descent is an optimization method.

Repeat until convergence:

Update simultaneously all j for (j = 0 and j = 1)


@
0 := 0 ↵ R( 0, 1)
@ 0

@
1 := 1 ↵ R( 0, 1)
@ 1

↵ is a learning rate.
Gradient descent
In the linear case:
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i )
@ 1 n i=1
Let’s generalize it!
Gradient descent
In the linear case:
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i )
@ 1 n i=1

Repeat until convergence:

Update simultaneously all j for (j = 0 and j = 1)

n
1 X
0 := 0 ↵ ( 0 + 1 xi yi )
n i=1

n
1 X
1 := 1 ↵ ( 0 + 1 xi yi)(xi)
n i=1
Pros and Cons
Analytical approach: Normal Equation

+ No need to specify a convergence rate or iterate.


- Works only if X T X is invertible
- Very slow if d is large O(d3) to compute (X T X) 1

Iterative approach: Gradient Descent

+ E↵ective and efficient even in high dimensions.


- Iterative (sometimes need many iterations to converge).
- Needs to choose the rate ↵.
Practical considerations
1. Scaling: Bring your features to a similar scale.
Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.
Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.

3. R should decrease after each iteration.


Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.

3. R should decrease after each iteration.

4. Declare convergence if it start decreasing by less ✏


Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.

3. R should decrease after each iteration.

4. Declare convergence if it start decreasing by less ✏

5. If X T X is not invertible?
Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.

3. R should decrease after each iteration.

4. Declare convergence if it start decreasing by less ✏

5. If X T X is not invertible?
(a) Too many features as compared to the number of examples
(e.g., 50 examples and 500 features)
(b) Features linearly dependent: e.g., weight in pounds and in
kilo.
Credit
• The elements of statistical learning. Data mining, inference,
and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani,
J. Friedman.
• Machine Learning 1997. Tom Mitchell.

You might also like