0% found this document useful (0 votes)
51 views

Bayesian Classification: Dr. Navneet Goyal BITS, Pilani

- Bayesian classifiers use Bayes' theorem to predict class membership probabilities based on training data. - Naive Bayesian classifiers assume conditional independence between attributes, allowing simple and efficient probability calculations. - They can outperform more sophisticated classifiers in many cases, despite the conditional independence assumption not always holding. - Bayesian belief networks allow modeling conditional dependencies between attributes and can overcome some limitations of naive Bayes classifiers.

Uploaded by

Siti Omar
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Bayesian Classification: Dr. Navneet Goyal BITS, Pilani

- Bayesian classifiers use Bayes' theorem to predict class membership probabilities based on training data. - Naive Bayesian classifiers assume conditional independence between attributes, allowing simple and efficient probability calculations. - They can outperform more sophisticated classifiers in many cases, despite the conditional independence assumption not always holding. - Bayesian belief networks allow modeling conditional dependencies between attributes and can overcome some limitations of naive Bayes classifiers.

Uploaded by

Siti Omar
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

Bayesian Classification

Dr. Navneet Goyal


BITS, Pilani
Bayesian Classification
 What are Bayesian Classifiers?
 Statistical Classifiers
 Predict class membership
probabilities
 Based on Bayes Theorem
 Naïve Bayesian Classifier
 Computationally Simple
 Comparable performance with DT
and NN classifiers
Bayesian Classification

 Probabilistic learning: Calculate explicit


probabilities for hypothesis, among the
most practical approaches to certain
types of learning problems
 Incremental: Each training example can
incrementally increase/decrease the
probability that a hypothesis is correct.
Prior knowledge can be combined with
observed data.
Bayes Theorem
 Let X be a data sample whose class
label is unknown
 Let H be some hypothesis that X
belongs to a class C
 For classification determine P(H/X)
 P(H/X) is the probability that H holds
given the observed data sample X
 P(H/X) is posterior probability
Bayes Theorem
Example: Sample space: All Fruits
X is “round” and “red”
H= hypothesis that X is an Apple
P(H/X) is our confidence that X is an
apple given that X is “round” and “red”
 P(H) is Prior Probability of H, ie, the
probability that any given data sample is
an apple regardless of how it looks
 P(H/X) is based on more information
 Note that P(H) is independent of X
Bayes Theorem
Example: Sample space: All Fruits
 P(X/H) ?
 It is the probability that X is round and
red given that we know that it is true
that X is an apple
 Here P(X) is prior probability =
P(data sample from our set of fruits is
red and round)
Estimating Probabilities
 P(X), P(H), and P(X/H) may be
estimated from given data
 Bayes Theorem

P(H | X )  P( X | H ) P( H )
P( X )

 Use of Bayes Theorem in Naïve Bayesian


Classifier!!
Naïve Bayesian Classification
Also called Simple BC
Why Naïve/Simple??
Class Conditional Independence
Effectof an attribute values on a given class is
independent of the values of other attributes
This assumption simplifies computations
Naïve Bayesian Classification
Steps Involved
1. Each data sample is of the type
X=(xi) i =1(1)n, where xi is the values of X for
attribute Ai
2. Suppose there are m classes Ci, i=1(1)m.
X  Ci iff
P(Ci|X) > P(Cj|X) for 1 j  m, ji
i.e BC assigns X to class Ci having highest
Naïve Bayesian Classification
The class for which P(Ci|X) is maximized is called the
maximum posterior hypothesis.
From Bayes Theorem
P(Ci | X )  P( X | Ci) P(Ci)
P( X )
3. P(X) is constant. Only P( X | Ci)P(Ci) need be
maximized.
 If class prior probabilities not known, then assume
all classes to be equally likely
 Otherwise maximize P( X | Ci)P(Ci)
P(Ci) = Si/S
Problem: computing P(X|Ci) is unfeasible!
(find out how you would find it and why it is infeasible)
Naïve Bayesian Classification
4. Naïve assumption: attribute independence
P( X | Ci) = P(x1,…,xn|C) = P(xk|C)

5. In order to classify an unknown sample


X, evaluate P( X | Ci)P(Ci) for each class Ci.
Sample X is assigned to the class Ci iff
P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, ji
Naïve Bayesian Classification
EXAMPLE
Age Income Student Credit_rating Class:Buys_comp
<=30 HIGH N FAIR N
<=30 HIGH N EXCELLENT N
31…..40 HIGH N FAIR Y
>40 MEDIUM N FAIR Y
>40 LOW Y FAIR Y
>40 LOW Y EXCELLENT N
31…..40 LOW Y EXCELLENT Y
<=30 MEDIUM N FAIR N
<=30 LOW Y FAIR Y
>40 MEDIUM Y FAIR Y
<=30 MEDIUM Y EXCELLENT Y
31….40 MEDIUM N EXCELLENT Y
31….40 HIGH Y FAIR Y
>40 MEDIUM N EXCELLENT N
Naïve Bayesian Classification
EXAMPLE
X= (<=30,MEDIUM, Y,FAIR, ???)
We need to maximize:
P(X|Ci)P(Ci) for i=1,2.
P(Ci) is computed from training sample
P(buys_comp=Y) = 9/14 = 0.643
P(buys_comp=N) = 5/14 = 0.357
How to calculate P(X|Ci)P(Ci) for i=1,2?
P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)
Naïve Bayesian Classification
EXAMPLE
P(age<=30 | buys_comp=Y)=2/9=0.222
P(age<=30 | buys_comp=N)=3/5=0.600
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400
P(student=Y | buys_comp=Y)=6/9=0.667
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
Naïve Bayesian Classification
EXAMPLE
P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019

P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028


P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007

CONCLUSION: X buys computer


Naïve Bayes Classifier:
Issues
 Probability values ZERO!
 Recall what you observed in WEKA!
 If Ak is continuous valued!
 Recall what you observed in WEKA!

If there are no tuples in the training set corresponding


to students for the class buys-comp=NO
P(student = Y|buys_comp=N)=0
Implications?
Solution?
Naïve Bayes Classifier:
Issues
 Laplacian Correction or Laplace Estimator
 Philosophy – we assume that the training data set is so large
that adding one to each count that we need would only make
a negligible difference in the estimated prob. value.
 Example: D (1000)
 Class: buys_comp=Y

income=low – zero tuples


income=medium – 990 tuples
income=high – 10 tuples
Without Laplacian Correction the probs. are 0, 0.990, and
0.010
With Laplacian correction: 1/1003 = 0.001,
991/1003=0.988, and 11/1003=0.011 respectively.
Naïve Bayes Classifier:
Issues
 Continuous variable: need to do more
work than categorical attributes!
 It is typically assumed to have a
Guassian distribution with a mean 
and a std. dev. .
 Do it yourself! And cross check with
WEKA!
Naïve Bayes (Summary)
 Robust to isolated noise points

 Handle missing values by ignoring the


instance during probability estimate
calculations

 Robust to irrelevant attributes

 Independence assumption may not hold for


some attributes
 Use other techniques such as Bayesian Belief
Networks (BBN)
Probability Calculations
Age Income Student Credit_rating Class:Buys_comp

<=30 HIGH N FAIR N


No. of attributes = 4
<=30 HIGH N EXCELLENT N

Distinct values = 3,3,3,3


31…..40 HIGH N FAIR Y
No. of classes = 2
>40 MEDIUM N FAIR Y

>40 LOW Y GOOD Y


Total no. of probability
>40 LOW Y EXCELLENT N
calculations in NBC
= 4*3*2 = 24!
31…..40 LOW Y EXCELLENT Y

<=30 MEDIUM N FAIR N

<=30 LOW Y GOOD Y


What if conditional ind. was not
>40 MEDIUM Y FAIR Y
assumed?
<=30 MEDIUM Y EXCELLENT Y
O(kp) for p k-valued attributes
31….40 MEDIUM N EXCELLENT Y Multiply by m classes.
31….40 HIGH Y FAIR Y

>40 MEDIUM N EXCELLENT N


Bayesian Belief Networks
 Naïve BC assumes Class Conditional Independence
 This assumption simplifies computations
 When this assumption holds true, Naïve BC is most
accurate compared to all other classifiers
 In real problems, dependencies do exist between
variables
 2 methods to overcome this limitation of NBC
 Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes
 Decision trees, that reason on one attribute at the
time, considering ‘most important’ attributes first
Conditional Independence
 Let X, Y, & Z denote three set of random
variables. The variables in X are said to
be conditionally independent of Y, given
Z if
P(X|Y,Z) = P(X|Z)
 Rel. bet. a person’s arm length and
his/her reading skills!!
 One might observe that people with
longer arms tend to have higher levels of
reading skills
 How do you explain this rel.?
Conditional Independence
 Can be explained through a confounding
factor, AGE
 A young child tends to have short arms
and lacks the reading skills of an adult
 If the age of a person is fixed, then the
observed rel. between arm length and
reading skills disappears
 We can this conclude that arm length and
reading skills are conditionally
independent when the age variable is
fixed
P(reading skills| long arms,age) = P(reading skills|age)
Conditional Independence
P(X,Y|Z) = P(X,Y,Z)/P(Z)
= P(X,Y,Z)/P(Y,Z) x P(Y,Z)/P(Z)
= P(X|Y,Z) x P(Y|Z)
= P(X|Z) x P(Y|Z)

This explains the Naïve Bayesian:


P(X|Ci) = P(x1, x2, x3,…,xn|C) = P(xk|C)
Bayesian Belief Networks
 Belief Networks
 Bayesian Networks
 Probabilistic Networks
Bayesian Belief Networks
Conditional Independence (CI) assumption
made by NBC may be too rigid
Specially for classification problems in
which attributes are somewhat correlated
We need a more flexible approach for
modeling the class conditional probabilities
P(X|Ci) = P(x1, x2, x3,…,xn|C)
 instead of requiring that all the attributes
be CI given the class, BBN allows us to
specify which pair of attributes are CI
Bayesian Belief Networks
Belief Networks has 2 components
Directed Acyclic Graph (DAG)
Conditional Probability Table (CPT)
Bayesian Belief Networks
A node in BBN is CI of its non-
descendants, if its parents are known
Bayesian Belief Networks
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table


for the variable LungCancer
PositiveXRay Dyspnea

Bayesian Belief Networks


Bayesian Belief Networks
 6 boolean variables

Arcs allow representation of causal


knowledge
 Having lung cancer is influenced by family
history and smoking
 PositiveXray is ind. of whether the paient has
a FH or if he/she is a smoker given that we
know that the patient has lung cancer
 Once we know the outcome of Lung Cancer,
FH & Smoker do not provide any additional
info. about PositiveXray
Bayesian Belief Networks
 Lung Cancer is CI of Emphysema, given its
parents, FH & Smoker
BBN has a Conditional Probability Table (CPT) for
each variable in the DAG
 CPT for a variable Y specifies the conditional
distribution P(Y|parents(Y))
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1 P(LC=Y|FH=Y,S=Y) = 0.8


~LC 0.2 0.5 0.3 0.9 P(LC=N|FH=N,S=N) = 0.9

CPT for LungCancer


Bayesian Belief Networks
 Let X=(x1, x2,…,xn) be a tuple described by variables
or attributes Y1, Y2, …,Yn respectively
 Each variable is CI of its nondescendants given its
parents
 Allows he DAG to provide a complete representation of
the existing Joint Probability Distribution by:
P(x1, x2, x3,…,xn)=P(xi|Parents(Yi))
where P(x1, x2, x3,…,xn) is the prob. of a particular
combination of values of X, and the values for P(xi|
Parents(Yi)) correspond to the entries in CPT for Yi
Bayesian Belief Networks
 A node within the network can selected as an
‘output’ node, representing a class label attribute
 More than one output node
Rather than returning a single class label, the
classification process can return a probability
distribution that gives the probability of each
class
 Training BBN!!
Training BBN
 Number of scenarios possible

 Network topology may be given in advance or


inferred from data
 Variables may be observable or hidden (mising
or incomplete data) in all or some of the training
tuples
 Many algos for learning the network topology
from the training data given observable attibutes
 If network topology is known and the variables
observable, training is straightforward (just
compute CPT entries)
Training BBNs
 Topology given, but some variables are hidden

 Gradient Descent (self study)


 Falls under the class of algos called Adaptive
Probabilistic Networks
 BBNs are computationally expensive
 BBNs provide explicit representation of Causal
structure
 Domain experts can provide prior knowledge to the
training process in the form of topology and/or in
conditional probability values
 This leads to significant improvement in the learning
process

You might also like