Lec9 - Linear Models
Lec9 - Linear Models
Linear Models
Outline
II - Linear Models
1. Linear Regression
(a) Linear regression: History
(b) Linear regression with Least Squares
(c) Matrix representation and Normal Equation Method
(d) Iterative Method: Gradient descent
(e) Pros and Cons of both methods
2. Linear Classification: Perceptron
(a) Definition and history
(b) Example
(c) Algorithm
Supervised Learning
Training data:“examples” x with “labels” y.
f : Rd ! R f is called a regressor.
• Rooted in Statistics.
f : Rd ! R
f (x) = y
Linear Regression
Given: Training data: (x1, y1), . . . , (xn, yn) / xi 2 Rd and yi 2 R
f : Rd ! R
f (x) = y
#"
25
20
Sales
15
X2
10
5
TV
d = 1, line in R2 d = 2, hyperplane is R3
f (x) = 0 + 1x
Linear Regression
A simple case with one feature ( d = 1):
f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1
Linear Regression
A simple case with one feature ( d = 1):
f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1
n
1 X 2
R( ) = (yi 0 1 xi )
2n i=1
Linear Regression
A simple case with one feature ( d = 1):
f (x) = 0 + 1x
We want to minimize:
n
1 X
R= (yi f (xi))2
2n i=1
n
1 X 2
R( ) = (yi 0 1 xi )
2n i=1
Find 0 and 1 that minimize:
n
1 X
2
R( ) = (yi 0 1 xi )
2n i=1
Linear Regression
3 3
2.5
0.06
2.15
0.05
β1
RS
S
0.04
2.2
2.3
β1
0.03
3 3 β0
5 6 7 8 9
β0
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
Linear Regression
Find 0 and 1 so that:
n
1 X 2
argmin ( (yi 0 1 xi ) )
2n i=1
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 0 2n i=1 @ 0
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n n
1 X 1 X
0= yi 1 xi
n i=1 n i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
n
X n
X n
X
1 xi 2 = xi y i 0 xi
i=1 i=1 i=1
Linear Regression
n
@R 1 X @
=2⇥ (yi 0 1 xi ) ⇥ (yi 0 1 xi )
@ 1 2n i=1 @ 1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i ) = 0
@ 1 n i=1
n
X n
X n
X
1 xi 2 = xi y i 0 xi
i=1 i=1 i=1
Plugging 0 in 1:
Pn 1 Pn Pn
i=1 yi xi n i=1 yi i=1 xi
1 = Pn 2 1 Pn P
i=1 xi x
n i=1 i xi
Linear Regression
With more than one feature:
d
X
f (x) = 0 + j xj
j=1
Find the j that minimize:
n d
1 X X
2
R= (yi 0 j xij ))
2n i=1 j=1
X T (y X )=0
The unique solution is: = (X T X) 1X T y
Gradient descent
!"#$%&'()'
4.5'
+,-./0$'12,3$'
*$%&'('
Gradient descent
Gradient Descent is an optimization method.
@
1 := 1 ↵ R( 0, 1)
@ 1
Gradient descent
Gradient Descent is an optimization method.
@
1 := 1 ↵ R( 0, 1)
@ 1
↵ is a learning rate.
Gradient descent
In the linear case:
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i )
@ 1 n i=1
Let’s generalize it!
Gradient descent
In the linear case:
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( 1) = 0
@ 0 n i=1
n
@R 1 X
= (yi 0 1 xi ) ⇥ ( x i )
@ 1 n i=1
n
1 X
0 := 0 ↵ ( 0 + 1 xi yi )
n i=1
n
1 X
1 := 1 ↵ ( 0 + 1 xi yi)(xi)
n i=1
Pros and Cons
Analytical approach: Normal Equation
5. If X T X is not invertible?
Practical considerations
1. Scaling: Bring your features to a similar scale.
xi µ i
xi :=
stdev(xi)
2. Learning rate: Don’t use a rate that is too small or too large.
5. If X T X is not invertible?
(a) Too many features as compared to the number of examples
(e.g., 50 examples and 500 features)
(b) Features linearly dependent: e.g., weight in pounds and in
kilo.
Credit
• The elements of statistical learning. Data mining, inference,
and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani,
J. Friedman.
• Machine Learning 1997. Tom Mitchell.