0% found this document useful (0 votes)
20 views

Lec7 - Nonparametric Methods - II

Uploaded by

omar.okasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lec7 - Nonparametric Methods - II

Uploaded by

omar.okasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 7

parametric vs non-parametric
methods (II)

Ghada Khoriba
[email protected]

1
Recommending App

For a woman who works at


an office, which app do we
recommend?

For a man who works at a


factory, which app do we
recommend?

ML ask:
Between Gender and Occupation,
which one seems more decisive for
predicting what app will the users
download?....

2
Recommending App
Occupation

School Work

Pokemon Go Gender

F M

WhatsApp Snapchat

3
Between a horizontal
and a vertical line, which
one would cut the data
better?

4
5
Non-parametric Estimation
• A non-parametric model is not fixed, but its complexity
depends on the size of the training set or, rather, the
complexity of the problem inherent in the data.
• A nonparametric model does not mean that the model has no
parameters; it means that the number of parameters is not
fixed and that their number can grow depending on the size
of the data or, better still, depending on the complexity of the
regularity that underlies the data.

6
Decision tree
• A decision tree is a hierarchical data structure implementing
the divide-and-conquer strategy.
• It is an efficient nonparametric method that can be used for
both classification and regression.
• A decision tree is also a nonparametric model in the sense
that we do not assume any parametric form for the class
densities and the tree structure is not fixed a priori, but the
tree grows, branches and leaves are added during learning
depending on the complexity of the problem inherent in the
data.

7
FunctionFunction
Approximation
Approximation
Problem Setting
• Set of possible instances X
• Set of possible labels Y
• Unknown target function f : X ! Y
• Set of function hypotheses H = {h | h : X ! Y}

Input: Training examples of unknown target function f


n
{hxi , yi i}i=1 = {hx1 , y1 i , . . . , hxn , yn i}
Output: Hypothesis h 2 H that best approximates f
8
Ref: https://www.seas.upenn.edu, Eric Eaton.
Based on slide by Tom Mitchell
Sample
Sample Dataset
Dataset (was
(was Tennis Tennis
Played?) Played?)
• Columns denote features Xi
• Rows denote labeled instances hxi , yi i
• Class label denotes whether a tennis game was played

hxi , yi i

9
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree Decision Tree
• A possible decision tree for the data:

• Each internal node: test one attribute Xi


• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y
10
Ref: https://www.seas.upenn.edu, Eric Eaton.
Based on slide by Tom Mitchell
Decision Tree
Decision Tree
• A possible decision tree for the data:

• What prediction would we make for


<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?

Ref: https://www.seas.upenn.edu, Eric Eaton. 11


Based on slide by Tom Mitchell
– Decision Boundary
DecisionTree
Decision Tree– Decision Boundary
the feature space into axis-
angles
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
ion is labeled with one label
• Each rectangular region is labeled with one label
bution– over labels distribution over labels
or a probability

Decision
boundary

12
Ref: https://www.seas.upenn.edu, Eric Eaton. 7
Stages of of
Stages Machine LearningLearning
(Batch) Machine
n
Given: labeled training data X, Y = {hxi , yi i}i=1
• Assumes each xi ⇠ D(X ) with yi = ftarget (xi )

X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model yprediction

Apply the model to new data:


• Given: new unlabeled instance x ⇠ D(X )
yprediction ß model.predict(x)
13
Ref: https://www.seas.upenn.edu, Eric Eaton.
Basic Algorithm for Top-Down
Learning of Decision
Top-down learning (ID3, C4.5) Trees
[ID3, C4.5 by Quinlan]

node = root of decision tree


Main loop:
1. A ß the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.

How do we choose which attribute is best?


14
Choosing the Best Attribute
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest number of possible values
– Most-Values: Choose the attribute with the largest number of possible values
– Max-Gain: Choose the attribute that has the largest expected information gain
• i.e., attribute that results in smallest expected size of subtrees rooted at its children

• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute

15
Information Gain
Information Gain
Which test is more informative?
Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


12
Based on slide by Pedro Domingos 16
Information Gain
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples

13
Based on slide by Pedro Domingos

17
18
Entropy: a common way to measure impurity

19
2-Class
2-Class Cases:
Cases:
2-Class 2-Class
2-Class Cases:2-ClassCases:
Entropy H(x)
Cases:
X
X
n
n

2-Class
Entropy
X Cases:
n Cases:
H(x)
n
=
X = 2-Class
i=1
P (x = i) log2 P (x = i)
P (x =Cases:
i) log2 P (x = i)
Entropy EntropyH(x) =
H(x) = X P (x = i) log P (x = i)
n X Pn(x = i) log2 i=1 P (x = n i)
2
X in which all Minimum
• What is the entropy of a group
Entropy
Entropy H(x) •=
H(x) What =i=1is the
Entropy
P (xentropy
P (xi)H(x)
i=1
= =logi)of a
log
P
= group
(x P=(xi)in=
Pwhich all
i) = i)
(x log Minimum
P (x = i)
2 impurity
examples belong to the same class? 2 2
impurity
• What is• theWhat is the
entropy examples
entropy
of a
i=1group belong
i=1 of ina
– entropy = - 1 log21 = 0
to
group
which thein
allsame
which class?
all
Minimum
i=1 Minimum
examples examples
belong to belong
– the same
entropy =to- 1the 21 same
class?
log =0 class? impurity impurity
• What
• What is theis entropy
the entropy of• a of
group
What a group
isinthe in
which which
entropy all allMinimum
of a Minimum
group in which all Minimum
= -–1 log
entropy not
= 0 = -not a good
1 log21 = 0 training set for learning impurity
examples
– entropy
examples belong to thebelong
21 to the
same
a good same
trainingclass? class?
examples belong to the same class?
set for learning impurity impurity
– entropy
– entropy = -not =a 2-good
1 log 11=log 1 = 0 set–forentropy
0 training learning= - 1 log21 = 0
not a good training set for2learning

not atraining
not a good set for•
good training set What isathe
for learning
learning not good entropy oflearning
training set for a groupwith 50% Maximum
• What is the entropy of a group with 50% Maximum
in either class? impurity
• What is• theWhat is the
entropy in–of
either
entropy
a class?
group of a
with group
50% with 50%
Maximum
entropy = -0.5 log20.5 – 0.5 log20.5 =1 Maximum impurity
in either
• What
• What in
is the either
class?
is entropy
the entropyclass?
– entropy = -0.5 log20.5 – 0.5 log20.5 =1 impurity impurity
of
of• aWhat a group
group the with
iswith 50%50%ofMaximum
entropy Maximum
a group with 50% Maximum
in either
– entropy = -0.5
in either class? class?
– entropy
log 20.5 =
– -0.5 log220.5
0.5 log
good in
0.5 =1
– 0.5 log20.5 =1
either
training set forclass?
learning
impurity
impurity impurity
– entropy
– entropy = -0.5 =log
-0.5 log2good
0.5 log
20.5 – 0.5
– training
0.5– log
20.5 =1 20.5
set =1
entropyfor=learning
-0.5 log20.5 – 0.5 log20.5 =1 16
Based on slide by Pedro Domingos 16
good
good training set training
forBased
learning set by
on slide forPedro
learning
Domingos
16 16 20
Based on slide
good goodDomingos
Based
by Pedro ontraining
training slide
set by set
forPedrofor learning
Domingos
learning good training set for learning
16 16
16
Based onBased
slide on
by slide
PedrobyDomingos
Pedro Domingos
Based on slide by Pedro Domingos
Sample Entropy
Sample Entropy
Sample Entropy

ell

21
InformationInformation
Gain Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in


the nodes of a decision tree.

22

18
Based on slide by Pedro Domingos
From Entropy to Information Gain
From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :


Information Gain
Mututal information Information Gain isGain)
(aka Information the mutual information
of X and Y: between
input attribute A and target variable Y
Slide by Tom Mitchell
Information Gain is the expected reduction in entropy
of target variable Y for data sample S, due to sorting 23

on variable A
Calculating Informa
Information Gain = entropy(parent) – [averag
Calculating Information Gain
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]
child =
impurity
entropy

Calculating Information Gain Information Gain = entropy(parent) – [average


child
entropy(children)]
æ 13

æ 13
13 ö æ 4
impurity
13 ö æ 4
= -ç × log
Entire population
entropy è 17

(30
2 ÷ - ç ×
instances)
17 ø è 17
log 2

÷ = 0.787
17 ø
CalculatingInformation
Calculating Information Gain Gain child
impurity = -ç × log 2 ÷ - ç × log 2 ÷ = 0
Entire population (30 instances) entropy è 17 17 ø è 1717 instances
17 ø
Information Gain = entropy(parent) – [average entropy(children)]
Information Gain = entropy(parent) – [average entropy(children)] child
Entire population æ
(30
13 instances)
13 ö æ 4 4 ö entropy= -
impurity
child =æ 13 13 ö æ÷ 4- ç × log 4ö
impurity
child
impurity = -ç -ç × log× log 2÷ - ç × log ÷ = 0.787
2 = 0.÷787 17 instan
entropy è 17è
entropy 17 2 17
17 ø èø17 è 17 2
17 ø 17 ø child æ 1
æ 14
1 ö æ 12
14 ö æ 16 12
16 öö
impurity = -=ç-ç × log
parent
entropy
impurity × log2 213 ÷÷--çç13 ××log
log 22 13÷÷==00.996
.391
13
entropyè è 30 30øø èè 30 30 øø
Entire
Entire population (30 instances)
population (30 instances)
parent= -æç 14 × log 14 ö÷ - æç 16 × log 16 ö÷ = 0.996
impurity
entropy è 30
2
30 ø 17 172 instances
è 30instances
30 ø
child æ 1 1 ö æ 1213 instances
12 ö æç
entropy=(Weighted)
impurity -ç × log Average
2 - ç ×oflog
÷ Entropy Children
2 ÷ == 0è
è 13 13 ø è 13 13 ø
æ 17
Information ö æ 13 0.996ö - 0.615 = 0.3
Gain=
(Weighted) Average Entropy of Children = ç Pedro Domingos÷ + ç × 0.391÷ = 0.615
× 0.787
child æ 1æ 1
child æ 14
parent 14 1 ö 16
æ 12 1612 ö
ö æ1 ö æ 12 ö 12 ö
Based on slide by 30
è ø è 30 ø
impurity
impurity =
impurity
entropy=
-ç- ç × ×
loglog 2 ÷ 2-÷ç- ç÷ -
= -ç 2Information
× log ç× log
× log ÷ =÷2 0=.996
2 ×2log 0÷.391
= 0.391
entropy 30 13
entropy è è è Based 30 13
ø øè è
30 13 Gain=
30 0.996
13
ø ø - 0.615 = 0.38
13 on slide by Pedro
13Domingos
ø è 13 13 ø 21
13 instanc

parent æ 14
æ 14 14
14 öö ææ16
16 öö
1616
= - × -
parent= -çç × log2 ÷÷ -çç × log
impurity
impurity log × log 2 ÷ =÷ 0=.996
0.996
entropy è 30 2 30 ø è 30 30
2 ø
13 instances
entropy è 30 30 ø è 30 30 ø æ ö æ 13 ö
13 instances 17
(Weighted) Average Entropy of Children = ç × 0.787 +
÷ ç × 0.391÷ = 0.615
è 30 ø è 30 ø
Information æ 17 Gain= ö æ 13
0.996 -ö 0.615 = 0.38
(Weighted) Average Entropy of Children = ç æ × 0.787 ×
170.787 +
÷ çö æ ×13
0.391÷ = 0ö.615 21
(Weighted) Average Entropy of Children
Based =è 30
on slide by Pedro ç
Domingos + ç × 0.ø391÷ = 0.615
ø è÷ 30 24
è 30 ø è 30 ø
Information Gain= 0.996 - 0.615 = 0.38 21
Information Gain= 0.996 - 0.615 = 0.38
Based on slide by Pedro Domingos 21
Based on slide by Pedro Domingos
Slide by Tom Mitchell

Slide by Tom Mitchell

12 25
26
Which Tree Should We Output?
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?

Occam’s razor: prefer the


simplest hypothesis that
fits the data

Slide by Tom Mitchell


27
Overfitting in Decision Trees
Overfitting in Decision Trees
• Many kinds of noise can occur in the examples:
• Many kinds of noise can occur in the examples:
– Two examples have same attribute/value pairs, but different
– Two examples have same attribute/value pairs, but different
classifications
– Some valuesclassifications
of attributes are incorrect because of errors in
Some values
the data–acquisition of attributes
process are incorrectphase
or the preprocessing because of errors in
thewas
– The instance datalabeled
acquisition process
incorrectly or the preprocessing
(+ instead of -) phase
– The instance was labeled incorrectly (+ instead of -)
• Also, some attributes are irrelevant to the decision-
Also, some attributes are irrelevant to the decision-
making• process
making
– e.g., color process
of a die is irrelevant to its outcome
– e.g., color of a die is irrelevant to its outcome
27 28
sed on Slide from M. desJardins & T. Finin

27
Based on Slide from M. desJardins & T. Finin
Overfitting
Overfitting in Decision
in Decision Trees Trees
• Irrelevant
• Many kinds of noiseattributes
can occur incan
the result in overfitting the
examples:
– Two examples have same attribute/value pairs, but different
training
classifications
example data
–If hypothesis
– Some values space
of attributes are hasbecause
incorrect many dimensions
of errors in (large
number process
the data acquisition of attributes), we may find
or the preprocessing phasemeaningless
– The instanceregularity
was labeledinincorrectly
the data(+that is irrelevant
instead of -) to the true,
important, distinguishing features
• Also, some attributes are irrelevant to the decision-
making process
• If we have too little training
– e.g., color of a die is irrelevant to its outcome
data, even a
reasonable hypothesis space will overfit
27 29
sed on Slide from M. desJardins & T. Finin
28
Based on Slide from M. desJardins & T. Finin
Avoiding
AvoidingOverfitting in Decision
Overfitting in DecisionTrees
Trees
How
How canwe
can weavoid
avoidoverfitting?
overfitting?
• Stop
•Stop growingwhen
growing whendata
data split
split is
is not
notstatistically
statisticallysignificant
significant
• Acquire
•Acquire more training data
more training data
• Remove irrelevant attributes (manual process – not always possible)
• Remove irrelevant attributes (manual process – not always possible)
• Grow full tree, then post-prune
• Grow full tree, then post-prune

How to select “best” tree:


How to select
• Measure “best” tree:
performance over training data
• •Measure
Measure performance
performanceoverover training
separate data
validation data set
• •Measure performance
Add complexity penaltyover separate validation
to performance measure data set
• (heuristic:
Add complexitysimpler is better)
penalty to performance measure
(heuristic: simpler is better)
32 30
Based on Slide by Pedro Domingos
32
Based on Slide by Pedro Domingos
Reduced-Error
Reduced-Error Pruning
Pruning
Split training
training data into
data further further into training
training and validation
and validation sets sets
Grow
tree treeon
based based on training
training set set
ntilDo until pruning
further further pruning is harmful:
is harmful:
valuate impact on
1. Evaluate validation
impact set of pruning
on validation each each
set of pruning
ossiblepossible
node (plus those
node (plusbelow
thoseit)below it)
reedily remove remove
2. Greedily the nodethethat most
node improves
that most improves
alidation set accuracy
validation set accuracy

33 33
Domingos 31
Slide by Pedro Domingos
Effect of Reduced-Error Pruning
Effect of Reduced-Error Pruning

The tree is pruned back to the red line where


it gives more accurate results on the test data 36
Based on Slide by Pedro Domingos

The tree is pruned back to the red line where


it gives more accurate results on the test data 36
32
ased on Slide by Pedro Domingos
Decision Tree Decision Tree Learning
Summary:
• Representation: Summary:
decision trees Decision Tree Learning
• Bias: prefer small decision trees
• Widely used in practice
• Search algorithm: greedy
• Heuristic function: • information gain or information
Strengths include
Summary: Decision Tree Learning – Fast and
content or simple
othersto implement
– Can convert to rules
• Overfitting / pruning – Handles noisy data
• Widely used in practice
• Weaknesses include
• Strengths include – Univariate splits/partitioning using only one attribute at a
– Fast and simple to implement time --- limits types of possible trees
– Can convert to rules – Large decision trees may be hard to understand
– Handles noisy data – Requires fixed-length feature vectors
– Non-incremental (i.e., batch method)
• Weaknesses include
Slide by Pedro Domingos
38
33
37
– Univariate splits/partitioning using only one attribute at a
time --- limits types of possible trees
– Large decision trees may be hard to understand
K-Nearest Neighbor
• The nearest neighbor class of estimators adapts the amount
of smoothing to the local density of data. The degree of
smoothing is controlled by k, the number of neighbors taken
into account, which is much smaller than N, the sample size.
• 1‐Nearest Neighbor

34
35
36
37
KNN has three basic steps.
1. Calculate the distance.
2. Find the k nearest neighbors.
3. Vote for classes

38

You might also like