Bayes Classifier
Dr. Partha Pratim Sarangi
Bayes Classifier
Introduction
• The naive Bayes classifier is probably among the most effective algorithms
for learning tasks to classify text documents.
• The naive Bayes technique is extremely helpful in case of huge datasets.
• For example, Google employs naive Bayes classifier to correct the spelling
mistakes in the text typed in by users.
• it gives a meaningful perspective to the comprehension of various learning
algorithms that do not explicitly manipulate probabilities.
• Bayes theorem is the cornerstone of Bayesian learning methods.
Bayes theorem
• Bayes theorem offers a method of calculating the probability of a
hypothesis on the basis of its prior probability, the probabilities of
observing different data given the hypothesis, and the observed data itself.
• The distribution of all possible values of discrete random variable y is
expressed as probability distribution.
• We assume that there is some a priori probability (or simply prior) P(yq)
that the next feature vector belongs to the class q.
Bayes theorem
• The continuous attributes are binned and converted to categorical variables.
• Therefore, each attribute xj is assumed to have value set that are countable.
• Bayes theorem provides a way to calculate posterior P(yk |x); k ϵ{1, …, M}
from the known priors P(yq), together with known conditional probabilities
P(x| yq); q = 1, …, M.
Using this relation, easier
Directly, difficult to calculate
• P(x) expresses variability of the observed data, independent of the class.
Naive Rule
• As per this rule, the record gets classified as a member of the majority class.
• Assume that there are six attributes in the data table
• x1: Day of Week, x2: Departure Time, x3: Origin, x4: Destination,
• x5: Carrier, x6: Weather
• and output y gives class labels (Delayed, On Time).
• Say 82% of the entries in y column record ‘On Time’.
• A naive rule for classifying a flight into two classes, ignoring information on x1,
x2, …, x6 is to classify all flights as being ‘On Time’.
• The naive rule is used as a baseline for evaluating the performance of more
complicated classifiers.
• Clearly, a classifier that uses attribute information should outperform the naive
rule.
Naive Bayes Classifier
• Takes into account the features as equally important and independent of
each other, considering the class.
• Not the scenario in real-life data.
• Each of the P(yq) may be estimated simply by counting the frequency with
which class yq occurs in the training data:
• If the decision must be made with so little information, it seems logical to
(Just like Naive rule)
use the following rule:
For balanced data, it will not work
Very much greater decision will be right
Naive Bayes Classifier
• In most other circumstances, we need to estimate class-conditional
probabilities P(x|yq) as well
• According to the assumption (attribute values are conditionally
independent, given the class), given the class of the pattern, the probability
of observing the conjunction x1, x2, …, xn is just the product of the
probabilities for the individual attributes:
Naive Bayes Classifier
• where yNB denotes the class output by the naive Bayes classifier.
• The number of distinct P(xj | yq) terms that must be estimated from the
training data is just the number of distinct attributes (n) times the number
of distinct classes (M).
Summary
Example:
y for x : {M, 1.95 m} ?
• y1 corresponds to the class
‘short’,
• y2 corresponds to the class
‘medium’, and
• y3 corresponds to the class
‘tall’.
Example:
y for x : {M, 1.95 m} ?
N1= no. of y1=4; N2= no. of y2=8; N3= no. of y3=3;
Sorted w.r.t x2:
Example: Height x2
Gender x1 (m) Class y
y for x : {M, 1.95 m} ? F 1.6 Short y1
F 1.6 Short y1
N1= no. of y1=4; N2= no. of y2=8; N3= no. of y3=3;
F 1.7 Short y1
M 1.7 Short y1
F 1.75 Medium y2
F 1.8 Medium y2
F 1.8 Medium y2
M 1.85 Medium y2
F 1.88 Medium y2
F 1.9 Medium y2
F 1.9 Medium y2
M 1.95 Medium y2
M 2 Tall y3
M 2.1 Tall y3
M 2.2 Tall y3
Example:
y for x : {M, 1.95 m} ?
N1= no. of y1=4; N2= no. of y2=8; N3= no. of y3=3;
Example:
y for x : {M, 1.95 m} ?
N1= no. of y1=4; N2= no. of y2=8; N3= no. of y3=3;
Example:
y for x : {M, 1.95 m} ?
N1= no. of y1=4; N2= no. of y2=8; N3= no. of y3=3;
• This gives q = 3.
• Therefore, for the pattern x = {M 1.95m}, the
predicted class is ‘tall’.
• The true class in the data table is ‘medium’.
• Use of naive Bayes algorithm on real-life datasets will bring out the power of
naive Bayes classifier when N is large.
Let us say, we want to classify a Red
Example 2: Domestic SUV, as stolen or not
Let us say, we want to classify a Red
Example 2: Domestic SUV, as stolen or not
Example 2:
• We need to calculate the probabilities P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,
P(Red|No) , P(SUV|No), and P(Domestic|No)
• and multiply them by P(Yes) and P(No) respectively.
• Then we can estimate these values using equation for YNB
• Looking at P(Red|Y es), we have 5 cases where vj = Yes , and in 3 of those cases ai =
Red.
• So for P(RedjY es), n = 5 and nc = 3.
• Note that all attribute are binary (two possible values).
• We are assuming no other information so, p = 1 / (number-of-attribute-values) = 0.5
for all of our attributes.
• Our m value is arbitrary, (We will use m = 3) but consistent for all attributes.
• Now we simply apply eqauation (3) using the precomputed values of n , nc, p, and m.
Example 2: