Lec7 - Nonparametric Methods - II
Lec7 - Nonparametric Methods - II
parametric vs non-parametric
methods (II)
Ghada Khoriba
[email protected]
1
Recommending App
ML ask:
Between Gender and Occupation,
which one seems more decisive for
predicting what app will the users
download?....
2
Recommending App
Occupation
School Work
Pokemon Go Gender
F M
WhatsApp Snapchat
3
Between a horizontal
and a vertical line, which
one would cut the data
better?
4
5
Non-parametric Estimation
• A non-parametric model is not fixed, but its complexity
depends on the size of the training set or, rather, the
complexity of the problem inherent in the data.
• A nonparametric model does not mean that the model has no
parameters; it means that the number of parameters is not
fixed and that their number can grow depending on the size
of the data or, better still, depending on the complexity of the
regularity that underlies the data.
6
Decision tree
• A decision tree is a hierarchical data structure implementing
the divide-and-conquer strategy.
• It is an efficient nonparametric method that can be used for
both classification and regression.
• A decision tree is also a nonparametric model in the sense
that we do not assume any parametric form for the class
densities and the tree structure is not fixed a priori, but the
tree grows, branches and leaves are added during learning
depending on the complexity of the problem inherent in the
data.
7
FunctionFunction
Approximation
Approximation
Problem Setting
• Set of possible instances X
• Set of possible labels Y
• Unknown target function f : X ! Y
• Set of function hypotheses H = {h | h : X ! Y}
hxi , yi i
9
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree Decision Tree
• A possible decision tree for the data:
Decision
boundary
12
Ref: https://www.seas.upenn.edu, Eric Eaton. 7
Stages of of
Stages Machine LearningLearning
(Batch) Machine
n
Given: labeled training data X, Y = {hxi , yi i}i=1
• Assumes each xi ⇠ D(X ) with yi = ftarget (xi )
X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model yprediction
• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute
15
Information Gain
Information Gain
Which test is more informative?
Split over whether Split over whether
Balance exceeds 50K applicant is employed
13
Based on slide by Pedro Domingos
17
18
Entropy: a common way to measure impurity
19
2-Class
2-Class Cases:
Cases:
2-Class 2-Class
2-Class Cases:2-ClassCases:
Entropy H(x)
Cases:
X
X
n
n
2-Class
Entropy
X Cases:
n Cases:
H(x)
n
=
X = 2-Class
i=1
P (x = i) log2 P (x = i)
P (x =Cases:
i) log2 P (x = i)
Entropy EntropyH(x) =
H(x) = X P (x = i) log P (x = i)
n X Pn(x = i) log2 i=1 P (x = n i)
2
X in which all Minimum
• What is the entropy of a group
Entropy
Entropy H(x) •=
H(x) What =i=1is the
Entropy
P (xentropy
P (xi)H(x)
i=1
= =logi)of a
log
P
= group
(x P=(xi)in=
Pwhich all
i) = i)
(x log Minimum
P (x = i)
2 impurity
examples belong to the same class? 2 2
impurity
• What is• theWhat is the
entropy examples
entropy
of a
i=1group belong
i=1 of ina
– entropy = - 1 log21 = 0
to
group
which thein
allsame
which class?
all
Minimum
i=1 Minimum
examples examples
belong to belong
– the same
entropy =to- 1the 21 same
class?
log =0 class? impurity impurity
• What
• What is theis entropy
the entropy of• a of
group
What a group
isinthe in
which which
entropy all allMinimum
of a Minimum
group in which all Minimum
= -–1 log
entropy not
= 0 = -not a good
1 log21 = 0 training set for learning impurity
examples
– entropy
examples belong to thebelong
21 to the
same
a good same
trainingclass? class?
examples belong to the same class?
set for learning impurity impurity
– entropy
– entropy = -not =a 2-good
1 log 11=log 1 = 0 set–forentropy
0 training learning= - 1 log21 = 0
not a good training set for2learning
not atraining
not a good set for•
good training set What isathe
for learning
learning not good entropy oflearning
training set for a groupwith 50% Maximum
• What is the entropy of a group with 50% Maximum
in either class? impurity
• What is• theWhat is the
entropy in–of
either
entropy
a class?
group of a
with group
50% with 50%
Maximum
entropy = -0.5 log20.5 – 0.5 log20.5 =1 Maximum impurity
in either
• What
• What in
is the either
class?
is entropy
the entropyclass?
– entropy = -0.5 log20.5 – 0.5 log20.5 =1 impurity impurity
of
of• aWhat a group
group the with
iswith 50%50%ofMaximum
entropy Maximum
a group with 50% Maximum
in either
– entropy = -0.5
in either class? class?
– entropy
log 20.5 =
– -0.5 log220.5
0.5 log
good in
0.5 =1
– 0.5 log20.5 =1
either
training set forclass?
learning
impurity
impurity impurity
– entropy
– entropy = -0.5 =log
-0.5 log2good
0.5 log
20.5 – 0.5
– training
0.5– log
20.5 =1 20.5
set =1
entropyfor=learning
-0.5 log20.5 – 0.5 log20.5 =1 16
Based on slide by Pedro Domingos 16
good
good training set training
forBased
learning set by
on slide forPedro
learning
Domingos
16 16 20
Based on slide
good goodDomingos
Based
by Pedro ontraining
training slide
set by set
forPedrofor learning
Domingos
learning good training set for learning
16 16
16
Based onBased
slide on
by slide
PedrobyDomingos
Pedro Domingos
Based on slide by Pedro Domingos
Sample Entropy
Sample Entropy
Sample Entropy
ell
21
InformationInformation
Gain Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.
22
18
Based on slide by Pedro Domingos
From Entropy to Information Gain
From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X
on variable A
Calculating Informa
Information Gain = entropy(parent) – [averag
Calculating Information Gain
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]
child =
impurity
entropy
æ 13
13 ö æ 4
impurity
13 ö æ 4
= -ç × log
Entire population
entropy è 17
4ö
(30
2 ÷ - ç ×
instances)
17 ø è 17
log 2
4ö
÷ = 0.787
17 ø
CalculatingInformation
Calculating Information Gain Gain child
impurity = -ç × log 2 ÷ - ç × log 2 ÷ = 0
Entire population (30 instances) entropy è 17 17 ø è 1717 instances
17 ø
Information Gain = entropy(parent) – [average entropy(children)]
Information Gain = entropy(parent) – [average entropy(children)] child
Entire population æ
(30
13 instances)
13 ö æ 4 4 ö entropy= -
impurity
child =æ 13 13 ö æ÷ 4- ç × log 4ö
impurity
child
impurity = -ç -ç × log× log 2÷ - ç × log ÷ = 0.787
2 = 0.÷787 17 instan
entropy è 17è
entropy 17 2 17
17 ø èø17 è 17 2
17 ø 17 ø child æ 1
æ 14
1 ö æ 12
14 ö æ 16 12
16 öö
impurity = -=ç-ç × log
parent
entropy
impurity × log2 213 ÷÷--çç13 ××log
log 22 13÷÷==00.996
.391
13
entropyè è 30 30øø èè 30 30 øø
Entire
Entire population (30 instances)
population (30 instances)
parent= -æç 14 × log 14 ö÷ - æç 16 × log 16 ö÷ = 0.996
impurity
entropy è 30
2
30 ø 17 172 instances
è 30instances
30 ø
child æ 1 1 ö æ 1213 instances
12 ö æç
entropy=(Weighted)
impurity -ç × log Average
2 - ç ×oflog
÷ Entropy Children
2 ÷ == 0è
è 13 13 ø è 13 13 ø
æ 17
Information ö æ 13 0.996ö - 0.615 = 0.3
Gain=
(Weighted) Average Entropy of Children = ç Pedro Domingos÷ + ç × 0.391÷ = 0.615
× 0.787
child æ 1æ 1
child æ 14
parent 14 1 ö 16
æ 12 1612 ö
ö æ1 ö æ 12 ö 12 ö
Based on slide by 30
è ø è 30 ø
impurity
impurity =
impurity
entropy=
-ç- ç × ×
loglog 2 ÷ 2-÷ç- ç÷ -
= -ç 2Information
× log ç× log
× log ÷ =÷2 0=.996
2 ×2log 0÷.391
= 0.391
entropy 30 13
entropy è è è Based 30 13
ø øè è
30 13 Gain=
30 0.996
13
ø ø - 0.615 = 0.38
13 on slide by Pedro
13Domingos
ø è 13 13 ø 21
13 instanc
parent æ 14
æ 14 14
14 öö ææ16
16 öö
1616
= - × -
parent= -çç × log2 ÷÷ -çç × log
impurity
impurity log × log 2 ÷ =÷ 0=.996
0.996
entropy è 30 2 30 ø è 30 30
2 ø
13 instances
entropy è 30 30 ø è 30 30 ø æ ö æ 13 ö
13 instances 17
(Weighted) Average Entropy of Children = ç × 0.787 +
÷ ç × 0.391÷ = 0.615
è 30 ø è 30 ø
Information æ 17 Gain= ö æ 13
0.996 -ö 0.615 = 0.38
(Weighted) Average Entropy of Children = ç æ × 0.787 ×
170.787 +
÷ çö æ ×13
0.391÷ = 0ö.615 21
(Weighted) Average Entropy of Children
Based =è 30
on slide by Pedro ç
Domingos + ç × 0.ø391÷ = 0.615
ø è÷ 30 24
è 30 ø è 30 ø
Information Gain= 0.996 - 0.615 = 0.38 21
Information Gain= 0.996 - 0.615 = 0.38
Based on slide by Pedro Domingos 21
Based on slide by Pedro Domingos
Slide by Tom Mitchell
12 25
26
Which Tree Should We Output?
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?
27
Based on Slide from M. desJardins & T. Finin
Overfitting
Overfitting in Decision
in Decision Trees Trees
• Irrelevant
• Many kinds of noiseattributes
can occur incan
the result in overfitting the
examples:
– Two examples have same attribute/value pairs, but different
training
classifications
example data
–If hypothesis
– Some values space
of attributes are hasbecause
incorrect many dimensions
of errors in (large
number process
the data acquisition of attributes), we may find
or the preprocessing phasemeaningless
– The instanceregularity
was labeledinincorrectly
the data(+that is irrelevant
instead of -) to the true,
important, distinguishing features
• Also, some attributes are irrelevant to the decision-
making process
• If we have too little training
– e.g., color of a die is irrelevant to its outcome
data, even a
reasonable hypothesis space will overfit
27 29
sed on Slide from M. desJardins & T. Finin
28
Based on Slide from M. desJardins & T. Finin
Avoiding
AvoidingOverfitting in Decision
Overfitting in DecisionTrees
Trees
How
How canwe
can weavoid
avoidoverfitting?
overfitting?
• Stop
•Stop growingwhen
growing whendata
data split
split is
is not
notstatistically
statisticallysignificant
significant
• Acquire
•Acquire more training data
more training data
• Remove irrelevant attributes (manual process – not always possible)
• Remove irrelevant attributes (manual process – not always possible)
• Grow full tree, then post-prune
• Grow full tree, then post-prune
33 33
Domingos 31
Slide by Pedro Domingos
Effect of Reduced-Error Pruning
Effect of Reduced-Error Pruning
34
35
36
37
KNN has three basic steps.
1. Calculate the distance.
2. Find the k nearest neighbors.
3. Vote for classes
38