Chapter 5 Learning Deterministic Models
Chapter 5 Learning Deterministic Models
Supervised Learning
Supervised learning involves learning a function from a training set.
7000
6000
5000
Dollars (Y)
4000
3000
2000
It looks like there is an approximate linear relationship between Dollars and Miles.
y 0 1 x x ,
where x is a random variable, which depends on the value x of X , with the following
properties:
Note that these assumptions entail that the expected value of Y given a value x of X
is given by
E (Y | X x) 0 1 x.
However, the actual value y of Y is not uniquely determined by the value of X because of a
random error term x .
To estimate the values of 0 and 1 , we find the values of b0 and b1 that minimize the Mean
Square Error (MSE), which is
[n
y
i 1
i (b 0 b x
1 i )]2
where n is the size of the sample, and xi and yi are the values of
X and Y for the i th item.
In the case of the American Express example we obtain the following:
y b0 b1 x
274.8 1.26 x.
A statistical package usually provides the following information
when doing linear regression:
6 6
4 4
2 2
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
x x
( a ) 9 d a ta p o in ts (b ) c o n e c t th e d o ts
y y
8 8
6 6
4 4
2 2
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
x x
( c ) lin e a r r e g r e s s io n ( d ) q u a d r a t ic r e g r e s s io n
(b), (c), and (d) show 3 models we could learn from the data in (a).
However, (b) might over fit the data and perform poorly on out-of-sample data.
In order to avoid overfitting, we use techniques that evaluate the fit of the model
to the underlying system.
In the test set method, we partition the data into a training set and a test set.
We then learn a model from the training set and estimate the performance
using the test set.
Allocating 70% of the data previous figure to the training set, and 30% to
the test set, we obtain the results in the first row of the following table
(MSE denotes mean square error):
We then compute the error for the removed item relative to the model learned.
After this process is repeated for all n data items, the MSE is computed.
In k-Fold Cross Validation, we divide the data into k partitions of the same size.
For each partition j we train using the data items in the remaining k-1 partitions,
and we compute the error for each data item in partition j relative to the model learned.
After this process is repeated for all k partitions, the MSE for all data items is computed.
Learning a Decision Tree
Information Theory
Huffman Code
Suppose we are going to transmit to a colleague a sequence as follows:
aaabbbccaaddcccc
a: 00 b: 01 c: 10 d: 11
variable length binary code:
a: 0 b: 10 c: 110 d: 111
The variable length code has more total bits
than the fixed length code.
P(H) = 1/2
E(avg. # bits) = (2(1/4)+2(1/4)+2(1/4)+2(1/4))/2
=1
P(H) = 3/4
E(avg. # bits) = (1(9/16)+2(3/16)+3(3/16)+3(16))/2
= 0.84375
The entropy H of a probability distribution is defined as follows:
H ( p1 log 2 p1 p2 log 2 p2 )
If P(Heads) = 0.5,
H ( p1 log 2 p1 p2 log 2 p2 )
(.5 log 2 .5 .5 log 2 .5) 1
If P(Heads) = 0.75,
H ( p1 log 2 p1 p2 log 2 p2 )
(.75 log 2 .75 .25 log 2 .25) 0.81128
H is minimized when p = 1 for some outcome.
H (1log 2 1 0 log 2 0) 0
m
H pi log 2 pi
i 1
Tem p?
H u m id ity ? W in d ? H u m id ity ?
h ig h n o rm a l s tro n g weak h ig h n o rm a l
s ta y h o m e W in d ? s ta y h o m e w a lk W in d ? te n n is
O u tlo o k ? t e n n is O u t lo o k ? t e n n is
r a in o v e rc a s t sunny r a in o v e rc a s t sunny
s ta y h o m e w a lk te n n is s ta y h o m e w a lk te n n is
This parsimonious decision tree classifies the data:
O u tlo o k ?
o v e rc a s t r a in sunny
Tem p? s ta y h o m e Tem p?
w a lk H u m id it y ? s ta y h o m e te n n is te n n is w a lk
n o rm a l h ig h
w a lk s ta y h o m e
Our goal is to learn the most parsimonious decision tree from the data.
6 6 3 3 5 5
H ( Activity ) log 2 log 2 log 2 1.5306
14 14 14 14 14 14
5 5 1 1 0 0
H ( Activity | sunny) log 2 log 2 log 2 0.650
6 6 6 6 6 6
2 2 2 2 0 0
H ( Activity | overcast ) log 2 log 2 log 2 1.0
4 4 4 4 4 4
m
H ( Z | x j ) P( zi | x j ) log 2 P( zi | x j )
i 1
m
EH ( Z | X ) H ( Z | x j ) P ( x j ).
j 1
IG( Z ; X ) H ( Z ) EH ( Z | X )