0% found this document useful (0 votes)
9 views61 pages

Tutorial Part2

This document discusses combination rules for combining evidence in Dempster-Shafer theory. It reviews Dempster's rule and other combination rules proposed to address issues with conflicting evidence, such as Zadeh's paradox. It also discusses interpretations of the degree of conflict and differences between combination rules.

Uploaded by

Thuong Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views61 pages

Tutorial Part2

This document discusses combination rules for combining evidence in Dempster-Shafer theory. It reviews Dempster's rule and other combination rules proposed to address issues with conflicting evidence, such as Zadeh's paradox. It also discusses interpretations of the degree of conflict and differences between combination rules.

Uploaded by

Thuong Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Dempster-Shafer Reasoning with Uncertainty

Theory and Applications

Huỳnh Văn Nam

Japan Advanced Institute of Science and Technology


1-1 Asahidai, Nomi, Ishikawa, 923-1292 Japan
Email: [email protected]

Đại Học Bách Khoa TP HCM, 22/02/2011

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 1 / 58
Part 2 – Combination and Applications
Evidence Combination and Conflict
Combination Rules: A Review
Conflict Revisited
Difference Between two BoEs
Discounting and Combination Solution
Applications to Ensemble Learning
Application to Ensemble Classification
Application to Ensemble Clustering
An Illustrative Application
Word Sense Disambiguation
Multi-Representation of Context
Discounting-and-Combination Method for WSD
Experimental Results

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 2 / 58
Combination of Evidence in D-S Theory

∎ Criticisms on the counterintuitive results of applying Dempster’s


combination rule to conflicting beliefs soon emerged since its inception.
∎ In Dempster’s rule of combination, the combined mass assigned to the
empty set considered as the conflict is distributed proportionally to the
other masses.
∎ Zadeh (1984) presented an example where Dempster’s rule of
combination produces unsatisfactory results.
∎ Since then, many alternatives have been proposed in the literature.
∎ The study of combination rules in D-S theory when evidence is in
conflict has emerged again recently as an interesting topic, especially in
data/information fusion applications.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 3 / 58
Evidence Combination and Conflict

Outline
Evidence Combination and Conflict
Combination Rules: A Review
Conflict Revisited
Difference Between two BoEs
Discounting and Combination Solution
Applications to Ensemble Learning
Application to Ensemble Classification
Application to Ensemble Clustering
An Illustrative Application
Word Sense Disambiguation
Multi-Representation of Context
Discounting-and-Combination Method for WSD
Experimental Results

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 4 / 58
Evidence Combination and Conflict Combination Rules: A Review

Dempster’s Rule and Conflict

∎ Let m1 and m2 be two mass functions defined on frame Θ.


∎ Denote m⊕ = (m1 ⊕ m2 ) the combined mass function by Dempster’s
rule of combination:
1
m⊕ (A) ≜ ∑ m1 (B) × m2 (C), ∀A ⊆ Θ, A =/ ∅
1 − κ B∩C=A

where
κ= ∑ m1 (B) × m2 (C)
B∩C=∅
∎ κ can be interpreted as the combined mass assigned to the empty set
before normalization. So, it is also denoted by m⊕ (∅) and
conventionally considered as the degree of conflict.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 5 / 58
Evidence Combination and Conflict Combination Rules: A Review

Dempster’s Rule and Conflict


Zadeh’s example
∎ One doctor believes a patient has either meningitis – with a probability
of 0.99, or a brain tumor – with a probability of only 0.01.
∎ A second doctor believes the patient suffers from concussion – with a
probability of 0.99, and also believes the patient has a brain tumor –
with a probability of only 0.01.
Combining these two pieces of evidence with Dempster’s rule yields

m⊕ (brain tumor) = Bel⊕ (brain tumor) = 1

8 This result implies complete support for the diagnosis of a brain tumor,
which both doctors believed very unlikely.
ê Many alternative rules of combination have been developed.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 6 / 58
Evidence Combination and Conflict Combination Rules: A Review

Smets’s Rule of Combination

The transferable belief model [Smets & Kennes, Artif. Intell. 66 (1994)]:
∎ Justifies the use of belief functions to model subjective, personal beliefs.
∎ In general, in the definition of a mass function, the condition m(∅) = 0
is not required.

Smets’s rule of combination (2005)


Let m1 and m2 be two mass functions defined on frame Θ. The so-called
conjunctive rule of combination, denoted as m? = m1 ? m2, is defined as

m? (A) ≜ ∑ m1 (B) × m2 (C), ∀A ⊆ Θ


B∩C=A

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 7 / 58
Evidence Combination and Conflict Combination Rules: A Review

Smets’s Rule of Combination

3 Masses are not renormalized.

3 The conflict is stored in the mass given to the empty set ⇒ the open
world assumption, i.e. the “actual world” (i.e., the true value of variable
X) might not be in Θ.
Zadeh’s example revisited:
∎ m? (brain tumor) = 0.0001
∎ m? (meningitis) = 0
∎ m? (concussion) = 0
∎ m? (∅) = 0.9999

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 8 / 58
Evidence Combination and Conflict Combination Rules: A Review

Yager’s Rule of Combination

∎ Yager’s solution (1987): the conflict is transferred to the universe Θ.


∎ Given two mass functions m1 and m2 on frame Θ, then the mass
function that results from the application of Yager’s combination rule is
given by

⎪ ∑ m (B) × m2 (C), ∀A ⊆ Θ, A =/ Θ, A =/ ∅

⎪ B∩C=A 1
mY⊕ (A) = ⎨ 0, if A = ∅



⎩ m 1 (Θ) × m2 (Θ) + m ⊕ (∅), if A = Θ

where
m⊕ (∅) = ∑ m1 (B) × m2 (C)
B∩C=∅

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 9 / 58
Evidence Combination and Conflict Combination Rules: A Review

Yager’s Rule of Combination

3 Yager’s combination rule is commutative but not associative!

Zadeh’s example again:


∎ mY⊕ (brain tumor) = 0.0001
∎ mY⊕ (∅) = 0
∎ mY⊕ (Θ) = 0.9999

ê The conflict of sources of information to be combined is treated as


ignorance.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 10 / 58
Evidence Combination and Conflict Combination Rules: A Review

Dubois and Prade’s Rule of Combination


∎ Let m1 and m2 be two mass functions on frame Θ. The disjunctive rule
of combination, denoted m⊎ = (m1 ⊎ m2 ), is defined by

m⊎ (A) = ∑ m1 (B) × m2 (C), ∀A ⊆ Θ


B∪C=A

∎ Interpretation: Only one of the two sources of evidence represented by


m1 and m2 is fully reliable, but we do not know which source it is.
∎ Dubois and Prade (1988) also proposed a “hybrid” rule intermediate
between the conjunctive and disjunctive sums as follows:

mH (A) = ∑ m1 (B) × m2 (C) + ∑ m1 (B) × m2 (C)


B∩C=A B∩C=∅,B∪C=A

for any A ⊆ Θ and A =/ ∅, and mH (A) = 0, if A = ∅.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 11 / 58
Evidence Combination and Conflict Combination Rules: A Review

Dubois and Prade’s Rule of Combination

∎ Dubois and Prade’s “hybrid” rule is not associative, but it usually


provides a good summary of partially conflicting items of evidence!
Solution to Zadeh’s example:
∎ mH (brain tumor) = 0.0001
∎ mH ({meningitis, brain tumor}) = 0.0099
∎ mH ({concussion, brain tumor}) = 0.0099
∎ mH ({meningitis, concussion}) = 0.9801
∎ mH (∅) = 0
∎ mH (Θ) = 0

ê a solution more flexible than Yager’s solution for the transfer of the
conflictual masses.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 12 / 58
Evidence Combination and Conflict Combination Rules: A Review

Remarks

∎ Many other suggestions have been made, creating a “jungle” of


combination rules.
∎ A good survey of the combination rules and their applications:
[Sentz & Ferson, Combination of Evidence in Dempster-Shafer Theory,
Sandia National Laboratories SAND 2002-0835, 2002]

Observation
∎ Most of these works usually began with analyzing some counterintuitive

examples when applying existing combination rules, and then proposed


new ones which would give more reasonable results to these particular
situations.
∎ This approach may only yield solutions being good ‘locally’, and
consequently, it is difficult to be theoretically justified.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 13 / 58
Evidence Combination and Conflict Conflict Revisited

m⊕ (∅) as Conflict?

Liu [Artif. Intell. 170 (2006)] argued that value m⊕ (∅) cannot be used as
a measure of conflict between two bodies of evidence but only represents
the mass of uncommitted belief as a result of combination.
Example – Two identical mass functions
Let us consider two identical mass functions m1 = m2 on Θ = {θi }5i=1 :
∎ m1 (θi ) = m2 (θi ) = 0.2 for i = 1, . . . , 5
∎ Then, m⊕ (∅) = 0.8, which is quite high whilst it appears the total
absence of conflict as two mass functions are identical.

Remark:
More generally, we always get m⊕ (∅) > 0 with two identical mass function
whenever their focal elements defines a partition of the frame!

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 14 / 58
Evidence Combination and Conflict Conflict Revisited

Liu’s Criteria for Conflict


Two mass functions m1 and m2 are said to be in conflict if and only if

m⊕ (∅) >  and difBetP(m1 , m2 ) > 

where  ∈ [0, 1] is a threshold of conflict tolerance and difBetP(m1 , m2 )


is defined by

difBetP(m1 , m2 ) = max(∣BetPm1 (A) − BetPm2 (A)∣)


A⊆Θ

and called the distance between betting commitments of the two mass
functions.
+ A comprehensive analysis of combination rules and conflict
management:
[P. Smets, Analyzing the combination of conflicting belief functions,
Information Fusion 8 (2007)].

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 15 / 58
Evidence Combination and Conflict Conflict Revisited

Liu’s Criteria for Conflict


Example: Consider the following pair of mass functions on the same frame
Θ = {θi ∣i = 1, . . . , 7}

m1 ({θ1 , θ2 , θ3 , θ4 }) = 1; and m2 ({θ4 , θ5 , θ6 , θ7 }) = 1

Then, m⊕ (∅) = 0, i.e, these mass functions are not in conflict at all.
However, using the second criterion we easily get:

difBetP(m1 , m2 ) = 0.75

Note that m1 and m2 have assigned, by definition, the total mass exactly
to {θ1 , θ2 , θ3 , θ4 } and {θ4 , θ5 , θ6 , θ7 }, respectively, and to none of the
proper subsets of them. So intuitively these two mass functions are partly
in conflict. Such a partial conflict does not be judged by means of m⊕ (∅)
but difBetP(m1 , m2 ) as shown above.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 16 / 58
Evidence Combination and Conflict Difference Between two BoEs

Distance Between two Mass Functions

∎ Let B1 = (Fm1 , m1 ) and B2 = (Fm2 , m2 ) be two bodies of evidence on


the same frame Θ.
∎ The distance between m1 and m2 , denoted by d(m1 , m2 ), is defined as
follows
d(m1 , m2 ) = max(∣m1 (A) − m2 (A)∣)
A⊆Θ
∎ Obviously, d(m1 , m2 ) = 0 if and only if m1 = m2 .
This distance is considered as a quantitative measure for judging the
difference between two bodies of evidence B1 and B2 (Huynh, 2009).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 17 / 58
Evidence Combination and Conflict Difference Between two BoEs

Difference Between two BoEs

Denote dif F (m1 , m2 ) the symmetric difference between two families of


focal elements Fm1 and Fm2 , i.e.,

dif F (m1 , m2 ) = (Fm1 ∖ Fm2 ) ∪ (Fm2 ∖ Fm1 )

∎ If dif F (m1 , m2 ) = Fm1 ∪ Fm2 , and A ∩ B = ∅ for any A ∈ Fm1 and


B ∈ Fm2 , then m⊕ (∅) = 1 – fully conflict.
∎ If dif F (m1 , m2 ) = ∅ and d(m1 , m2 ) > 0, then qualitatively two sources
are not in conflict but having different preferences in distributing their
masses to focal elements.

ê How different between two sources in realization of the question of


where the true hypothesis lies.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 18 / 58
Evidence Combination and Conflict Difference Between two BoEs

Difference Between two BoEs

Let us denote

dif(B1 , B2 ) = ⟨d(m1 , m2 ), dif F (m1 , m2 )⟩

and call it the difference measure of two bodies of evidence.

∎ The conflict between two bodies of evidence originates from either or


both of d(m1 , m2 ) (quantitative) and dif F (m1 , m2 ) (qualitative).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 19 / 58
Evidence Combination and Conflict Difference Between two BoEs

Difference Between two BoEs

∎ Liu’s criterion of using difBetP(m1 , m2 ) is somewhat weaker than


using the direct distance of d(m1 , m2 ).
Example: consider again the following pair of mass functions:

m1 ({θ1 , θ2 , θ3 , θ4 }) = 1; and m2 ({θ4 , θ5 , θ6 , θ7 }) = 1

Then, we have d(m1 , m2 ) = 1 whilst difBetP(m1 , m2 ) = 0.75.


∎ In addition, if m1 = m2 we have difBetP(m1 , m2 ) = 0 but the reverse
does not hold in general.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 20 / 58
Evidence Combination and Conflict Difference Between two BoEs

Quantifying Conflict

∎ We have recently argued that only a part of value m⊕ (∅) should be


used to quantify a conflict qualitatively stemming from dif F (m1 , m2 ).
∎ Let
mcomb
⊕ (∅) = ∑ m1 (A)m2 (B)
A,B∈F1 ∩F2 ,A∩B=∅

∎ Clearly, mcomb
⊕ (∅) is a part of m⊕ (∅) and intuitively representing the
mass of uncommitted belief as a result of combination rather than a
conflict.
∎ Therefore, the conflict is properly represented by the remainder of
m⊕ (∅), i.e.

m⊕ (∅) − mcomb
⊕ (∅) = mconf
⊕ (∅)

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 21 / 58
Evidence Combination and Conflict Difference Between two BoEs

Quantifying Conflict

Remark
With this formulation of conflict, the fact used to question the validity of
Dempster’s rule that two identical probability measures are always
conflicting becomes inappropriate!

Example
Consider again two identical mass functions on Θ = {θi ∣i = 1 . . . 5}:
m1 (θi ) = m2 (θi ) = 0.2 for i = 1, . . . , 5. Then we get mcomb
⊕ (∅) = 0.8 and
m⊕ (∅) = 0, and hence no conflict appears between the two at all.
conf

⊕ (∅) = 0 whenever two mass functions


ê Generally, we always get mconf
being combined are identical.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 22 / 58
Evidence Combination and Conflict Difference Between two BoEs

Quantifying Conflict

Zadeh’s famous example revisited


Consider two mass functions m1 and m2 defined on Θ = {a, b, c} as:
∎ m1 (a) = 0.99, m1 (b) = 0.01
∎ m2 (c) = 0.99, m2 (b) = 0.01
⊕ (∅) = 0.98, which accurately reflects a very high
Then we get mconf

conflict between the two sources of evidence.

Critical remark
With such a high conflict but still assuming both sources are fully reliable
to proceed with directly applying Demspter’s rule on them (to get
unsatisfactory results) seems irrational!

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 23 / 58
Evidence Combination and Conflict Discounting and Combination Solution

A Solution to Solving Conflict

Main Idea
∎ According to Smets’ two-level view of evidence (Smets, 1994), to make

decisions based on evidence, beliefs encoding evidence must be


transformed into probabilities using the so-called pignistic
transformation.
∎ Guided by this view, we propose to discount a mass function involving
in combination based upon how sure in its decision when it is used
alone for decision making.
∎ More particularly, we provide a method for defining discount rates of
mass functions being combined using the entropy of their corresponding
pignistic probability functions.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 24 / 58
Evidence Combination and Conflict Discounting and Combination Solution

A Solution to Solving Conflict


Ambiguity Measure
∎ Let m1 and m2 be two mass functions on the frame Θ and BetPm1 and
BetPm2 be pignistic probability functions of m1 and m2 , respectively.
∎ For i = 1, 2, we denote

H(mi ) = − ∑ BetP mi (θ) log2 (BetP mi (θ))


θ∈Θ

the Shannon entropy expression of pignistic probability distribution


BetPmi .
∎ This measure has been used in Jousselme et al (2006) as an ambiguity
measure of belief functions.
∎ Clearly, H(mi ) ∈ [0, log2 (∣Θ∣)].

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 25 / 58
Evidence Combination and Conflict Discounting and Combination Solution

A Solution to Solving Conflict


Entropy-based Discount Rate:
∎ The discount rate of BPA mi (i = 1, 2), denoted δ(mi ), is defined by

H(mi )
δ(mi ) =
log2 (∣Θ∣)

∎ That is, the higher uncertainty (in its decision) a source of evidence is,
the higher discount rate it is applied.
General Discounting and Combination Rule:
(1−δ(m1 )) (1−δ(m2 ))
m⊕ = m1 ⊕ m2
(1−δ(mi ))
where ⊕ is a combination operator in general and mi is the
discounted mass function obtaining from mi .

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 26 / 58
Evidence Combination and Conflict Discounting and Combination Solution

Selecting a Combination Rule


The information from dif(B1 , B2 ) and mconf
⊕ (∅) can properly provide
helpful suggestions for conflict management on selecting appropriate
combination rules in some typical situations.

∎ If dif F (m1 , m2 ) = F1 ∪ F2 and A ∩ B = ∅ for any A ∈ Fm1 and


B ∈ Fm2 , we have mconf ⊕ (∅) = 1 and two sources are fully conflict.

ê A discounting and then combination strategy should be applied, where


different attitudes may suggest different combination rules for use.

∎ If dif F (m1 , m2 ) = ∅ and d(m1 , m2 ) > 0, we have mconf


⊕ (∅) = 0: Two
sources qualitatively are not in conflict but having different beliefs
attributed to focal elements.
ê A compromise attitude may suggest to use the trade-off rule (Dubois and
Prade, 1988), or its special case of averaging operator.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 27 / 58
Evidence Combination and Conflict Discounting and Combination Solution

Selecting a Combination Rule

∎ If dif F (m1 , m2 ) =/ ∅, then we have d(m1 , m2 ) > 0.


ê If m⊕ (Θ) = m1 (Θ) × m2 (Θ) > 0 two sources may provide complementary
information each other, and then Dempster’s rule can be applied.
ê If m⊕ (Θ) = 0, two sources may be in a partial conflict and then depending
⊕ (∅) whether it is tolerated and information on meta-belief is
on value mconf
available or not, one may apply discounting and then combination strategy
or a disjunctive combination rule.

3 Reference: Huynh VN, Discounting and combination scheme in evidence


theory for dealing with conflict in information fusion. MDAI 2009,
Springer.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 28 / 58
Applications to Ensemble Learning

Outline
Evidence Combination and Conflict
Combination Rules: A Review
Conflict Revisited
Difference Between two BoEs
Discounting and Combination Solution
Applications to Ensemble Learning
Application to Ensemble Classification
Application to Ensemble Clustering
An Illustrative Application
Word Sense Disambiguation
Multi-Representation of Context
Discounting-and-Combination Method for WSD
Experimental Results

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 29 / 58
Applications to Ensemble Learning

D-S Theory and Its Applications

∎ D-S theory has been theoretically well studied and widely applied to
such areas of application as
● Classification, Identification, Recognition
● Decision Making, Expert Systems
● Fault Detection and Failure Diagnosis
● Image Processing, Medical Applications
● Risk and Reliability
● Robotics, Multiple Sensors
● Signal Processing
● Etc.

+ A good collection of references to applications of D-S theory:


[Sentz & Ferson, Combination of Evidence in Dempster-Shafer Theory,
Sandia National Laboratories SAND 2002-0835, 2002]

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 30 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Classifier Combination

Observation
As observed in studies of machine learning systems:
∎ the set of patterns misclassified by different classification systems would
not necessarily overlap.
∎ different classifiers potentially offer complementary information about
patterns to be classified.

Remark: The observation highly motivated the interest in combining


classifiers during the last two decades (Kittler et al., IEEE PAMI 1998).
Combination Scenarios:
∎ All classifiers use the same representation of the input
∎ Each classifier uses its own representation of the input

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 31 / 58
Applications to Ensemble Learning Application to Ensemble Classification

D-S theory in Classifier Combination

∎ Application of D-S theory to classifier combination has received


attention since early 1990s.
∎ In the context of single-class classification problem, the frame of
discernment is often modeled by the set of all possible classes used to
assign to an input pattern.
∎ Given an input pattern, each individual classifier produces an output
considered as a source of information serving for classification of the
input pattern.
∎ These sources of information from all classifiers participating in the
combination process will be combined to make the final decision on the
classification.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 32 / 58
Applications to Ensemble Learning Application to Ensemble Classification

D-S theory in Classifier Combination

∎ Let C = {c1 , c2 , . . . , cM } be the set of classes – the frame of


discernment of the problem.
∎ Assume that we have R classifiers: {ψ1 , . . . , ψR }.
∎ For an input x, each classifier ψi produces an output ψi (x) defined as

ψi (x) = [si1 , . . . , siM ]

where sij indicates the degree of confidence or support in saying that


“the pattern x is assigned to class cj according to classifier ψi .”
ê Note that sij can be a binary value or a continuous numeric value and
its semantic interpretation depends on what type of learning algorithm
used to build ψi .

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 33 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Xu’s Combination Method


∎ Each individual classifier produces a crisp decision on classifying an
input x, which is used as the evidence come from the corresponding
classifier.
∎ Then this evidence is associated with prior knowledge defined in terms
of performance indexes of the classifier to define its corresponding mass
function.
∎ Performance indexes of a classifier are defined by recognition,
substitution and rejection rates obtained by testing the classifier on a
test sample set.

3 Reference: Xu et al., Several methods for combining multiple classifiers


and their applications in handwritten character recognition. IEEE Trans.
SMC 22 (1992).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 34 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Xu’s Combination Method

∎ Let the recognition rate and substitution rate of ψi be ir and is
(usually ir + is < 1, due to the rejection action), respectively
∎ The mass function mi from ψi (x) is defined by
1. If ψi rejected x, i.e. ψi (x) = [0, . . . , 0], mi has only a focal element C with
mi (C) = 1.
2. If ψi (x) = [0, . . . , 0, sij = 1, 0, . . . , 0], then mi ({cj }) = ir , mi (¬{cj }) = is ,
where ¬{cj } = C ∖ {cj }, and mi (C) = 1 − ir − is .
∎ In a similar way one can obtain all mi (i = 1, . . . , R) from R classifiers
ψi (i = 1, . . . , R).
∎ Then Dempster’s rule is applied to combine these mi ’s to obtain a
combined m = m1 ⊕ . . . ⊕ mR , which is used to make the final decision
on the classification of x.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 35 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Rogova’s Combination Method

∎ Used a proximity measure between a reference vector of each class and


a classifier’s output vector.
∎ The reference vector is the mean vector µij of the output set of each
classifier ψi for each class cj .
∎ Then, for any input pattern x, the proximity measures
dij = φ(µij , ψi (x)) are transformed into the following mass functions:

mi ({cj }) = dij , mi (C) = 1 − dij


m¬i (¬{cj }) = 1 − ∏(1 − dik ), m¬i (C) = ∏(1 − dik )
=j
k/ =j
k/

which together constitute the knowledge about cj .

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 36 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Rogova’s Combination Method

∎ Hence, these mi and m¬i are combined to define the evidence from
classifier ψi on classifying x as mi ⊕ m¬i .
∎ Finally, all evidences from all classifiers are combined using Dempster’s
rule to obtain an overall mass function for making the final decision on
the classification.

3 Reference: Rogova, Combining the results of several neural network


classifiers. Neural Networks 7 (1994).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 37 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Al-Ani & Deriche’s Combination Method

∎ The distance between the output classification vector provided by each


single classifier and a reference vector is used to estimate mass
functions.
∎ These mass functions are then combined using Dempster’s rule to
obtain a new output vector that represents the combined confidence in
each class label.
∎ However, instead of defining a reference vector as the mean vector of
the output set of a classifier for a class as in Rogova’s work, it is
measured such that the mean square error (MSE) between the new
output vector obtained after combination and the target vector of a
training data set is minimized.
ê This interestingly makes their combination algorithm trainable.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 38 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Al-Ani & Deriche’s Combination Method


∎ Given an input x, the mass function mi derived from classifier ψi is
defined as follows:

dji
mi ({cj }) =
k=1 di + gi
∑M k
gi
mi (C) =
∑k=1 dki + gi
M

where dji = exp(−∥vji − ψi (x)∥2 ), vji is a reference vector and gi is a


coefficient.
∎ Both vji and gi are estimated via the minimized MSE learning process.

3 Reference: Al-Ani & Deriche, A new technique for combining multiple


classifiers using the Dempster-Shafer theory of evidence, Journal of
Artificial Intelligence Research 17 (2002).
Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 39 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Bell’s Combination Method

∎ A new method and technique for representing and combining outputs


from different classifiers for text categorization.
∎ Different from all the above mentioned methods, Bell et al. (2005)
directly used outputs of individual classifiers to define the so-called
2-points focused mass functions.
∎ These 2-points focused mass functions are then combined using
Dempster’s rule to obtain an overall mass function for making the final
classification decision.

3 Reference: Bell, Guan & Bi, On combining classifiers mass functions for
text categorization, IEEE Trans. KDE 17 (2005).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 40 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Bell’s Combination Method


∎ Given an input x, the output ψi (x) from classifier ψi is normalized:
sij
pi (cj ) = M
, for j = 1, . . . , M
∑k=1 sik
∎ Then the collection {pi (cj )}M
j=1 is arranged so that

pi (ci1 ) ≥ pi (ci2 ) ≥ . . . ≥ pi (ciM )

∎ The mass function mi induced from ψi on the classification of x:

mi ({ci1 }) = pi ({ci1 })
mi ({ci2 }) = pi ({ci2 })
mi (C) = 1 − mi ({ci1 }) − mi ({ci2 })

∎ This mass function is called the 2-points focused mass function and the
set {{ci1 }, {ci2 }, C} is referred to as a triplet.
Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 41 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Le’s Combination Method

∎ Built Naive Bayes classifiers corresponding to distinct representations of


the input.
∎ Then weighted them by their accuracies obtained by testing with a test
sample set, where weighting is modeled by the discounting operator.
∎ Finally, discounted mass functions are combined to obtain the final
mass function which is used for making the classification decision.

3 Reference: Le, Huynh, Shimazu & Nakamori, Combining classifiers for


word sense disambiguation based on Dempster-Shafer theory and OWA
operators, Data & Knowledge Engineering 63 (2007).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 42 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Le’s Combination Method

∎ Let fi be the i-th representation of an input x and classifier ψi building


on fi produces a posterior probability distribution P (⋅∣fi ) on C.
∎ Assume that αi is the weight of ψi defined by its accuracy.
∎ Then the piece of evidence represented by P (⋅∣fi ) is discounted at a
discount rate of (1 − αi ), resulting in a mass function mi defined by

mi ({cj }) = αi × P (cj ∣fi ), for j = 1, . . . , M


mi (C) = 1 − αi

∎ These discounted mass functions are then combined using either


Dempster’s rule or averaging operator.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 43 / 58
Applications to Ensemble Learning Application to Ensemble Classification

Remarks on Le’s Method


∎ This method of weighting clearly focuses on only the strength of
individual classifiers, which is defined by testing them on the designed
sample data set.
∎ Therefore it does not be influenced by an input pattern under
classification.
∎ However, the information quality of soft decisions or outputs provided
by individual classifiers might vary from pattern to pattern.
ê The general discounting and combination strategy for solving conflict
discussed above has been applied to classifier combination.

3 Reference: Huynh, Nguyen & Le, Adaptively entropy-based weighting


classifiers in combination using Dempster-Shafer theory for word sense
disambiguation, Computer Speech and Language 24 (2010).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 44 / 58
Applications to Ensemble Learning Application to Ensemble Clustering

Lattice Intervals

∎ Let (L, ≤) be a lattice.


∎ A (lattice) interval of L is defined as

[a, b] = {x ∈ L∣a ≤ x ≤ b}

for a, b ∈ L and a ≤ b.
∎ Let IL be the set of intervals, including the empty set, of L.
∎ (IL , ⊆) is a lattice with
● meet (⊓) = intersection (∩)
● joint (⊔) defined by [a, b] ⊔ [c, d] = [a ∧ c, b ∨ d]
● least element = ∅L ; greatest element = L

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 45 / 58
Applications to Ensemble Learning Application to Ensemble Clustering

Lattice of Intervals - Example

∎ 4-element lattice L = {x, y, z, t} ∎ Lattice of intervals (IL , ⊆)

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 46 / 58
Applications to Ensemble Learning Application to Ensemble Clustering

Application of Belief Structures on (IL , ⊆)

∎ Application to multi-label classification:

+ Denoeux, Younes & Abdallah, Representing uncertainty on set-valued


variables using belief functions, Artif. Intell. 174 (2010).

∎ Application to ensemble clustering:

+ Masson & Denoeux, Belief functions and cluster ensembles. ECSQARU


2009 (Springer).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 47 / 58
Applications to Ensemble Learning Application to Ensemble Clustering

Partitions of a Finite Set

∎ In clustering, the frame of discernment is the set of all partitions of a


finite data set D, denoted P(D).
∎ This set can be partially ordered using the following relation:
● A partition p is said to be finer than a partition p′ (or, equivalently p′ is
coarser than p) if the clusters of p can be obtained by splitting those of p′ ,
denoted p ⪯ p′ .

∎ The poset (P(D), ⪯) is a lattice.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 48 / 58
Applications to Ensemble Learning Application to Ensemble Clustering

Ensemble Clustering

∎ Ensemble clustering aims at combining the outputs of several clustering


algorithms to form a single clustering structure.
∎ This problem can be addressed using D-S theory by assuming that:
● There exists a “true” partition p∗ .
● Each clusterer provides evidence about p∗ .
● The evidence from multiple clusterers can be combined to draw plausible
conclusions about p∗ .

∎ To implement this scheme, we need to manipulate mass functions, the


focal elements of which are sets of partitions.
∎ This is feasable by restricting ourselves to intervals of the lattice
(P(D), ⪯).

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 49 / 58
An Illustrative Application

Outline
Evidence Combination and Conflict
Combination Rules: A Review
Conflict Revisited
Difference Between two BoEs
Discounting and Combination Solution
Applications to Ensemble Learning
Application to Ensemble Classification
Application to Ensemble Clustering
An Illustrative Application
Word Sense Disambiguation
Multi-Representation of Context
Discounting-and-Combination Method for WSD
Experimental Results

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 50 / 58
An Illustrative Application Word Sense Disambiguation

Word Sense Disambiguation

Polysemous Words
A polysemous word has more than one possible meaning (sense). These
senses are determined depending on the context where the word appears.

Example
“Interest”
∎ Context 1: “My guess would be that interest rates will decline
moderately into the spring of 1961”.
∎ Context 2: “A few of his examples are of very great interest, and the
whole discussion of some importance for theory.”

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 51 / 58
An Illustrative Application Word Sense Disambiguation

Word Sense Disambiguation

Polysemous Words
A polysemous word has more than one possible meaning (sense). These
senses are determined depending on the context where the word appears.

WSD
∎ Involving the association of a given word in a text or discourse with a

particular sense among numerous potential senses of that word.


∎ This is an “intermediate task” necessarily to accomplish most natural
language processing tasks.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 51 / 58
An Illustrative Application Word Sense Disambiguation

Word Sense Disambiguation

Polysemous Words
A polysemous word has more than one possible meaning (sense). These
senses are determined depending on the context where the word appears.

WSD as a classification problem:


Given an ambiguous word w:
∎ c1 , c2 ,. . . , cm – possible senses (classes) of w
∎ given N contexts of w, in each of which w is tagged with the right
sense (the training data).
∎ for new occurrence of w in a context C, WSD aims at identifying the
most appropriate sense of w given C.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 51 / 58
An Illustrative Application Multi-Representation of Context

How to Use Context in WSD?

Generally, a given context C can be used in two ways


The bag-of-words approach
the context is considered as words in some window surrounding the target
word w

The relational information based approach


the context is considered in terms of some relation to the target such as
1. distance from the target,
2. syntactic relations,
3. phrasal collocation, etc.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 52 / 58
An Illustrative Application Multi-Representation of Context

How to Use Context in WSD?

Example (Target Word: Interest)


“My[PRP] guess[NN] would[MD] be[VB] that[IN] interest[NN] rates[NNS]
will[MD] decline[VBP] moderately[RB] into[IN] the[DT] spring[NN] of[IN]
1961[CD]”
Bag of words my, guess, rates, decline, . . .
Collocations of words interest rates, interest rates will,
that interest,. . .
collocations of part-of-speech interest [NNS], interest [NNS] [MD],
[IN] interest,. . .
(word, position) (be, -2), (that, -1), (rates, 1), . . .
(part-of-speech tag, position) (VB, -2), (IN, -1), (NNS, 1)

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 52 / 58
An Illustrative Application Discounting-and-Combination Method for WSD

Classifier Combination-First Scenario

Individual Classifiers in Combination


∎ The Naive Bayes (NB),

∎ Maximum Entropy Model (MEM),


∎ Support Vector Machines (SVM).
The selection of these learning methods is basically guided by the direct
use of output results for defining mass functions.

Combination Algorithms
1. Discounting-and-Dempster’s combination algorithm (DCA1 )
2. Discounting-and-averaging combination algorithm (DCA2 )

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 53 / 58
An Illustrative Application Discounting-and-Combination Method for WSD

Classifier Combination-Second Scenario

Individual Classifiers in Combination


∎ The same NB learning algorithm used for individual classifiers, however,

each of which has been built using a distinct set of features


corresponding to a distinct representation of a polysemous word to be
disambiguated.
Note that NB is commonly accepted as one of learning methods represents
state-of-the-art accuracy on supervised WSD (Escudero, 2000).

Combination Algorithms
1. Discounting-and-Dempsert’s combination algorithm (DCA1 )
2. Discounting-and-averaging combination algorithm (DCA2 )

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 54 / 58
An Illustrative Application Experimental Results

The Experimental Results

Bảng: Experimental results for the first scenario of combination

Individual classifiers Combined classifiers


%
NB MEM SVM DCA1 DCA2
Senseval-2 65.6 65.5 63.5 66.3 66.5
Senseval-3 72.9 72.0 72.5 73.3 73.3

Remark:
The results yielded by the discounting-and-averaging combination
algorithm are comparable or even better than that given by the
discounting-and-orthogonal sum combination algorithm, while the former
is computational more simple than the latter.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 55 / 58
An Illustrative Application Experimental Results

The Experimental Results

Bảng: Experimental results for the second scenario of combination

Individual classifiers Combined classifiers


%
C1 C2 C3 C4 C5 C6 DCA1 DCA2
Senseval-2 56.7 54.6 54.7 56.8 56.8 52.5 64.4 65.0
Senseval-3 62.4 62.3 64.1 61.9 63.9 59.5 71.0 72.3

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 56 / 58
An Illustrative Application Experimental Results

The Experimental Results

Bảng: A comparison with the best system in the contests of Senseval-2 and
Senseval-3

Accuracy-based weighting Adaptively weighting


% Best systems
DS1 [Le et al (2007)] DCA2
Senseval-2 64.2 64.7 66.3
Senseval-3 72.9 72.4 73.3

Remark:
Both developed combination algorithms deriving from the discounting and
combination scheme yield an improvement in overall accuracy compared to
previous work for WSD.

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 57 / 58
An Illustrative Application Experimental Results

CẢM ƠN MỌI NGƯỜI


ĐÃ LẮNG NGHE !!!

Huỳnh Văn Nam (JAIST) Evidence Theory and Applications HCMUT, Feb. 2011 58 / 58

You might also like