0% found this document useful (0 votes)
20 views

Chapter 5 Learning Deterministic Models

The document discusses supervised learning and linear regression. It provides an example of using linear regression to model the relationship between dollars spent and miles traveled for American Express cardholders. It finds a linear relationship between the two variables and estimates the coefficients that minimize the mean squared error.

Uploaded by

ashokmvanjare
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Chapter 5 Learning Deterministic Models

The document discusses supervised learning and linear regression. It provides an example of using linear regression to model the relationship between dollars spent and miles traveled for American Express cardholders. It finds a linear relationship between the two variables and estimates the coefficients that minimize the mean squared error.

Uploaded by

ashokmvanjare
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Learning Deterministic Models

Supervised Learning
Supervised learning involves learning a function from a training set.

The function maps a variable x (which may be a vector) to a variable y

The training set is a set of known values of (x,y) pairs.

The variables in x are called the predictors

Variable y is called the target.


American Express suspected that charges on American Express cards increased
with the number of miles traveled by the card holder. To investigate this matter,
a research firm randomly selected 25 card holders and obtained the data shown
in the following table.
The following is a scatterplot of the data in the table:

7000

6000

5000
Dollars (Y)

4000

3000

2000

1000 2000 3000 4000 5000 6000


Miles (X)

It looks like there is an approximate linear relationship between Dollars and Miles.

Linear regression endeavors to find such a linear relationship.


In simple linear regression, we assume we have an independent random variable X
and a dependent random variable Y such that

y   0  1 x   x ,
where  x is a random variable, which depends on the value x of X , with the following
properties:

1) For every value x of X ,  x is normally distributed with 0 mean


2) For every value x of X ,  x has the same standard deviation 
3) The random variables  x for all x are mutually independent

Note that these assumptions entail that the expected value of Y given a value x of X
is given by
E (Y | X  x)   0  1 x.

The idea is that the expected value of Y is a deterministic linear function of x.

However, the actual value y of Y is not uniquely determined by the value of X because of a
random error term  x .
To estimate the values of  0 and 1 , we find the values of b0 and b1 that minimize the Mean
Square Error (MSE), which is

 [n
y
i 1
i  (b 0  b x
1 i )]2

where n is the size of the sample, and xi and yi are the values of
X and Y for the i th item.
In the case of the American Express example we obtain the following:

y  b0  b1 x
 274.8  1.26 x.
A statistical package usually provides the following information
when doing linear regression:

The far right value is the p-value.

So, we can be highly confident in the linear relationship between Y and X.

However, we can’t be so confident in the constant.


Over Fitting the Data
y y
8 8

6 6

4 4

2 2

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
x x

( a ) 9 d a ta p o in ts (b ) c o n e c t th e d o ts

y y
8 8

6 6

4 4

2 2

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
x x
( c ) lin e a r r e g r e s s io n ( d ) q u a d r a t ic r e g r e s s io n

(b), (c), and (d) show 3 models we could learn from the data in (a).

(b) predicts the data in the sample perfectly.

However, (b) might over fit the data and perform poorly on out-of-sample data.
In order to avoid overfitting, we use techniques that evaluate the fit of the model
to the underlying system.

In the test set method, we partition the data into a training set and a test set.

We then learn a model from the training set and estimate the performance
using the test set.
Allocating 70% of the data previous figure to the training set, and 30% to
the test set, we obtain the results in the first row of the following table
(MSE denotes mean square error):

LOOCV and 3-Fold CV are other techniques discussed next.

All 3 techniques found quadratic regression to be the best model.


In Leave-One-Out Cross Validation (LOOCV), we remove one of the n data
items and train using the remaining n-1 data items.

We then compute the error for the removed item relative to the model learned.

After this process is repeated for all n data items, the MSE is computed.

In k-Fold Cross Validation, we divide the data into k partitions of the same size.

For each partition j we train using the data items in the remaining k-1 partitions,
and we compute the error for each data item in partition j relative to the model learned.

After this process is repeated for all k partitions, the MSE for all data items is computed.
Learning a Decision Tree
Information Theory
Huffman Code
Suppose we are going to transmit to a colleague a sequence as follows:
aaabbbccaaddcccc

Fixed length binary code:

a: 00 b: 01 c: 10 d: 11
variable length binary code:
a: 0 b: 10 c: 110 d: 111
The variable length code has more total bits
than the fixed length code.

If all characters are transmitted the same


number of times, the fixed length code is
better.

However, if “a” is transmitted a large fraction


of the time, the variable length code is
better.
Huffman’s algorithm finds the optimal binary code.

Character Frequency Fixed Length Huffman


a 16 000 00
b 5 001 1110
c 12 010 110
d 17 011 01
e 10 100 1111
f 25 101 10
Suppose we toss a coin once.

If P(H)= ½, expected value of number of bits


needed to report outcome is 1.

If P(H)= ¾, expected value of number of bits


need to report outcome is 1.
Suppose we toss the coin twice.

P(H) = 1/2
E(avg. # bits) = (2(1/4)+2(1/4)+2(1/4)+2(1/4))/2
=1
P(H) = 3/4
E(avg. # bits) = (1(9/16)+2(3/16)+3(3/16)+3(16))/2
= 0.84375
The entropy H of a probability distribution is defined as follows:

H  lim E (avg # bits | optimal code)


n 

It is possible to show that for a binary outcome (coin toss):

H  ( p1 log 2 p1  p2 log 2 p2 )
If P(Heads) = 0.5,

H  ( p1 log 2 p1  p2 log 2 p2 )
 (.5 log 2 .5  .5 log 2 .5)  1
If P(Heads) = 0.75,

H  ( p1 log 2 p1  p2 log 2 p2 )
 (.75 log 2 .75  .25 log 2 .25)  0.81128
H is minimized when p = 1 for some outcome.

H  (1log 2 1  0 log 2 0)  0

H is maximized when p1 = p2 = 1/2

H  (.5 log 2 .5  .5 log 2 .5)  1

H is a measure of uncertainty in the outcome.


When there are m outcomes, the entropy is as follows:

m
H    pi log 2 pi
i 1

Entropy is minimized when p = 1 for some outcome.

Entropy is maximized when p = 1/m for all outcomes.


Suppose we want a decision that informs us of the activity we should
engage in on a given day based on the outlook, temp, humidity, and wind.

We want the decision tree to classify these data correctly:

Day Outlook Temp Humidity Wind Activity


1 rain hot high strong stay home
2 overcast cool high strong stay home
3 overcast cool normal strong walk
4 rain cool normal strong stay home
5 sunny cool normal strong tennis
6 sunny cool normal weak tennis
7 rain hot normal strong stay home
8 sunny hot normal weak walk
9 sunny mild normal strong tennis
10 sunny mild high weak tennis
11 rain mild high strong stay home
12 overcast mild high strong walk
13 sunny mild high strong tennis
14 overcast hot high strong stay home
This decision tree classifies the data:

Tem p?

cool hot m ild

H u m id ity ? W in d ? H u m id ity ?

h ig h n o rm a l s tro n g weak h ig h n o rm a l

s ta y h o m e W in d ? s ta y h o m e w a lk W in d ? te n n is

s tro n g weak s tro n g weak

O u tlo o k ? t e n n is O u t lo o k ? t e n n is

r a in o v e rc a s t sunny r a in o v e rc a s t sunny

s ta y h o m e w a lk te n n is s ta y h o m e w a lk te n n is
This parsimonious decision tree classifies the data:

O u tlo o k ?

o v e rc a s t r a in sunny

Tem p? s ta y h o m e Tem p?

m ild cool hot m ild cool hot

w a lk H u m id it y ? s ta y h o m e te n n is te n n is w a lk

n o rm a l h ig h

w a lk s ta y h o m e
Our goal is to learn the most parsimonious decision tree from the data.

Algorithm 5.3 (ID3) uses information theory to learn a parsimonious tree.

Next we illustrate how the algorithm learns the top node.


The entropy of Activity:

6 6 3 3 5 5
H ( Activity )   log 2  log 2  log 2   1.5306
 14 14 14 14 14 14 

The entropy of Activity given rain, sunny, and overcast:


4 4 0 0 0 0
H ( Activity | rain)   log2  log2  log2   0
4 4 4 4 4 4

5 5 1 1 0 0
H ( Activity | sunny)   log 2  log 2  log 2   0.650
6 6 6 6 6 6

2 2 2 2 0 0
H ( Activity | overcast )   log 2  log 2  log 2   1.0
4 4 4 4 4 4

The expected value of entropy of Activity given Outlook:

4 6 4


EH ( Activity | Outlook )  0   0.650   1   0.564
 14   14   14 
The information gain of Activity relative to Outlook:

IG( Activity ; Outlook )  H ( Activity )  EH ( Activity | Outlook )


 1.5306  0.564  0.967

The information gains of Activity relative to Humidity, Temp, and Wind


are all smaller.

So, we choose Outlook as the top node in the decision tree.

The ID3 algorithm then recursively learns the remaining nodes in


the same fashion.
Formal Definition of Information Gain
Let P (Z ) be a probability distribution, where Z has m alternatives, and X be a random
variable with alternative x j . We define conditional entropy of Z given x j as follows:

m
H ( Z | x j )    P( zi | x j ) log 2 P( zi | x j )
i 1

The conditional entropy of Z given X is the expected value of the entropy of Z


given X. It is defined as follows (where X has m alternatives).

m
EH ( Z | X )    H ( Z | x j ) P ( x j ).
j 1

The Information Gain of X for Z is as follows.

IG( Z ; X )  H ( Z )  EH ( Z | X )

You might also like