Bayesian Classification
Dr. Navneet Goyal
BITS, Pilani
Bayesian Classification
What are Bayesian Classifiers?
Statistical Classifiers
Predict class membership
probabilities
Based on Bayes Theorem
Naïve Bayesian Classifier
Computationally Simple
Comparable performance with DT
and NN classifiers
Bayesian Classification
Probabilistic learning: Calculate explicit
probabilities for hypothesis, among the
most practical approaches to certain
types of learning problems
Incremental: Each training example can
incrementally increase/decrease the
probability that a hypothesis is correct.
Prior knowledge can be combined with
observed data.
Bayes Theorem
Let X be a data sample whose class
label is unknown
Let H be some hypothesis that X
belongs to a class C
For classification determine P(H/X)
P(H/X) is the probability that H holds
given the observed data sample X
P(H/X) is posterior probability
Bayes Theorem
Example: Sample space: All Fruits
X is “round” and “red”
H= hypothesis that X is an Apple
P(H/X) is our confidence that X is an
apple given that X is “round” and “red”
P(H) is Prior Probability of H, ie, the
probability that any given data sample is
an apple regardless of how it looks
P(H/X) is based on more information
Note that P(H) is independent of X
Bayes Theorem
Example: Sample space: All Fruits
P(X/H) ?
It is the probability that X is round and
red given that we know that it is true
that X is an apple
Here P(X) is prior probability =
P(data sample from our set of fruits is
red and round)
Estimating Probabilities
P(X), P(H), and P(X/H) may be
estimated from given data
Bayes Theorem
P(H | X ) P( X | H ) P( H )
P( X )
Use of Bayes Theorem in Naïve Bayesian
Classifier!!
Naïve Bayesian Classification
Also called Simple BC
Why Naïve/Simple??
Class Conditional Independence
Effectof an attribute values on a given class is
independent of the values of other attributes
This assumption simplifies computations
Naïve Bayesian Classification
Steps Involved
1. Each data sample is of the type
X=(xi) i =1(1)n, where xi is the values of X for
attribute Ai
2. Suppose there are m classes Ci, i=1(1)m.
X Ci iff
P(Ci|X) > P(Cj|X) for 1 j m, ji
i.e BC assigns X to class Ci having highest
Naïve Bayesian Classification
The class for which P(Ci|X) is maximized is called the
maximum posterior hypothesis.
From Bayes Theorem
P(Ci | X ) P( X | Ci) P(Ci)
P( X )
3. P(X) is constant. Only P( X | Ci)P(Ci) need be
maximized.
If class prior probabilities not known, then assume
all classes to be equally likely
Otherwise maximize P( X | Ci)P(Ci)
P(Ci) = Si/S
Problem: computing P(X|Ci) is unfeasible!
(find out how you would find it and why it is infeasible)
Naïve Bayesian Classification
4. Naïve assumption: attribute independence
P( X | Ci) = P(x1,…,xn|C) = P(xk|C)
5. In order to classify an unknown sample
X, evaluate P( X | Ci)P(Ci) for each class Ci.
Sample X is assigned to the class Ci iff
P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j m, ji
Naïve Bayesian Classification
EXAMPLE
Age Income Student Credit_rating Class:Buys_comp
<=30 HIGH N FAIR N
<=30 HIGH N EXCELLENT N
31…..40 HIGH N FAIR Y
>40 MEDIUM N FAIR Y
>40 LOW Y FAIR Y
>40 LOW Y EXCELLENT N
31…..40 LOW Y EXCELLENT Y
<=30 MEDIUM N FAIR N
<=30 LOW Y FAIR Y
>40 MEDIUM Y FAIR Y
<=30 MEDIUM Y EXCELLENT Y
31….40 MEDIUM N EXCELLENT Y
31….40 HIGH Y FAIR Y
>40 MEDIUM N EXCELLENT N
Naïve Bayesian Classification
EXAMPLE
X= (<=30,MEDIUM, Y,FAIR, ???)
We need to maximize:
P(X|Ci)P(Ci) for i=1,2.
P(Ci) is computed from training sample
P(buys_comp=Y) = 9/14 = 0.643
P(buys_comp=N) = 5/14 = 0.357
How to calculate P(X|Ci)P(Ci) for i=1,2?
P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)
Naïve Bayesian Classification
EXAMPLE
P(age<=30 | buys_comp=Y)=2/9=0.222
P(age<=30 | buys_comp=N)=3/5=0.600
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400
P(student=Y | buys_comp=Y)=6/9=0.667
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
Naïve Bayesian Classification
EXAMPLE
P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019
P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028
P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007
CONCLUSION: X buys computer
Naïve Bayes Classifier:
Issues
Probability values ZERO!
Recall what you observed in WEKA!
If Ak is continuous valued!
Recall what you observed in WEKA!
If there are no tuples in the training set corresponding
to students for the class buys-comp=NO
P(student = Y|buys_comp=N)=0
Implications?
Solution?
Naïve Bayes Classifier:
Issues
Laplacian Correction or Laplace Estimator
Philosophy – we assume that the training data set is so large
that adding one to each count that we need would only make
a negligible difference in the estimated prob. value.
Example: D (1000)
Class: buys_comp=Y
income=low – zero tuples
income=medium – 990 tuples
income=high – 10 tuples
Without Laplacian Correction the probs. are 0, 0.990, and
0.010
With Laplacian correction: 1/1003 = 0.001,
991/1003=0.988, and 11/1003=0.011 respectively.
Naïve Bayes Classifier:
Issues
Continuous variable: need to do more
work than categorical attributes!
It is typically assumed to have a
Guassian distribution with a mean
and a std. dev. .
Do it yourself! And cross check with
WEKA!
Naïve Bayes (Summary)
Robust to isolated noise points
Handle missing values by ignoring the
instance during probability estimate
calculations
Robust to irrelevant attributes
Independence assumption may not hold for
some attributes
Use other techniques such as Bayesian Belief
Networks (BBN)
Probability Calculations
Age Income Student Credit_rating Class:Buys_comp
<=30 HIGH N FAIR N
No. of attributes = 4
<=30 HIGH N EXCELLENT N
Distinct values = 3,3,3,3
31…..40 HIGH N FAIR Y
No. of classes = 2
>40 MEDIUM N FAIR Y
>40 LOW Y GOOD Y
Total no. of probability
>40 LOW Y EXCELLENT N
calculations in NBC
= 4*3*2 = 24!
31…..40 LOW Y EXCELLENT Y
<=30 MEDIUM N FAIR N
<=30 LOW Y GOOD Y
What if conditional ind. was not
>40 MEDIUM Y FAIR Y
assumed?
<=30 MEDIUM Y EXCELLENT Y
O(kp) for p k-valued attributes
31….40 MEDIUM N EXCELLENT Y Multiply by m classes.
31….40 HIGH Y FAIR Y
>40 MEDIUM N EXCELLENT N
Bayesian Belief Networks
Naïve BC assumes Class Conditional Independence
This assumption simplifies computations
When this assumption holds true, Naïve BC is most
accurate compared to all other classifiers
In real problems, dependencies do exist between
variables
2 methods to overcome this limitation of NBC
Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes
Decision trees, that reason on one attribute at the
time, considering ‘most important’ attributes first
Conditional Independence
Let X, Y, & Z denote three set of random
variables. The variables in X are said to
be conditionally independent of Y, given
Z if
P(X|Y,Z) = P(X|Z)
Rel. bet. a person’s arm length and
his/her reading skills!!
One might observe that people with
longer arms tend to have higher levels of
reading skills
How do you explain this rel.?
Conditional Independence
Can be explained through a confounding
factor, AGE
A young child tends to have short arms
and lacks the reading skills of an adult
If the age of a person is fixed, then the
observed rel. between arm length and
reading skills disappears
We can this conclude that arm length and
reading skills are conditionally
independent when the age variable is
fixed
P(reading skills| long arms,age) = P(reading skills|age)
Conditional Independence
P(X,Y|Z) = P(X,Y,Z)/P(Z)
= P(X,Y,Z)/P(Y,Z) x P(Y,Z)/P(Z)
= P(X|Y,Z) x P(Y|Z)
= P(X|Z) x P(Y|Z)
This explains the Naïve Bayesian:
P(X|Ci) = P(x1, x2, x3,…,xn|C) = P(xk|C)
Bayesian Belief Networks
Belief Networks
Bayesian Networks
Probabilistic Networks
Bayesian Belief Networks
Conditional Independence (CI) assumption
made by NBC may be too rigid
Specially for classification problems in
which attributes are somewhat correlated
We need a more flexible approach for
modeling the class conditional probabilities
P(X|Ci) = P(x1, x2, x3,…,xn|C)
instead of requiring that all the attributes
be CI given the class, BBN allows us to
specify which pair of attributes are CI
Bayesian Belief Networks
Belief Networks has 2 components
Directed Acyclic Graph (DAG)
Conditional Probability Table (CPT)
Bayesian Belief Networks
A node in BBN is CI of its non-
descendants, if its parents are known
Bayesian Belief Networks
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
The conditional probability table
for the variable LungCancer
PositiveXRay Dyspnea
Bayesian Belief Networks
Bayesian Belief Networks
6 boolean variables
Arcs allow representation of causal
knowledge
Having lung cancer is influenced by family
history and smoking
PositiveXray is ind. of whether the paient has
a FH or if he/she is a smoker given that we
know that the patient has lung cancer
Once we know the outcome of Lung Cancer,
FH & Smoker do not provide any additional
info. about PositiveXray
Bayesian Belief Networks
Lung Cancer is CI of Emphysema, given its
parents, FH & Smoker
BBN has a Conditional Probability Table (CPT) for
each variable in the DAG
CPT for a variable Y specifies the conditional
distribution P(Y|parents(Y))
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1 P(LC=Y|FH=Y,S=Y) = 0.8
~LC 0.2 0.5 0.3 0.9 P(LC=N|FH=N,S=N) = 0.9
CPT for LungCancer
Bayesian Belief Networks
Let X=(x1, x2,…,xn) be a tuple described by variables
or attributes Y1, Y2, …,Yn respectively
Each variable is CI of its nondescendants given its
parents
Allows he DAG to provide a complete representation of
the existing Joint Probability Distribution by:
P(x1, x2, x3,…,xn)=P(xi|Parents(Yi))
where P(x1, x2, x3,…,xn) is the prob. of a particular
combination of values of X, and the values for P(xi|
Parents(Yi)) correspond to the entries in CPT for Yi
Bayesian Belief Networks
A node within the network can selected as an
‘output’ node, representing a class label attribute
More than one output node
Rather than returning a single class label, the
classification process can return a probability
distribution that gives the probability of each
class
Training BBN!!
Training BBN
Number of scenarios possible
Network topology may be given in advance or
inferred from data
Variables may be observable or hidden (mising
or incomplete data) in all or some of the training
tuples
Many algos for learning the network topology
from the training data given observable attibutes
If network topology is known and the variables
observable, training is straightforward (just
compute CPT entries)
Training BBNs
Topology given, but some variables are hidden
Gradient Descent (self study)
Falls under the class of algos called Adaptive
Probabilistic Networks
BBNs are computationally expensive
BBNs provide explicit representation of Causal
structure
Domain experts can provide prior knowledge to the
training process in the form of topology and/or in
conditional probability values
This leads to significant improvement in the learning
process