Bayes Classifier
Bayes Classifier
• We assume that there is some a priori probability (or simply prior) P(yq)
that the next feature vector belongs to the class q.
Bayes theorem
• The continuous attributes are binned and converted to categorical variables.
• Therefore, each attribute xj is assumed to have value set that are countable.
• Bayes theorem provides a way to calculate posterior P(yk |x); k ϵ{1, …, M}
from the known priors P(yq), together with known conditional probabilities
P(x| yq); q = 1, …, M.
• where yNB denotes the class output by the naive Bayes classifier.
• The number of distinct P(xj | yq) terms that must be estimated from the
training data is just the number of distinct attributes (n) times the number
of distinct classes (M).
Summary
Introduction: Naïve Bayes Classifier
• The naive Bayes classifier is probably among the most effective
algorithms for learning tasks to classify text documents.
• The naive Bayes technique is extremely helpful in case of huge
datasets.
• For example, Google employs naive Bayes classifier to correct the
spelling mistakes in the text typed in by users.
• it gives a meaningful perspective to the comprehension of
various learning algorithms that do not explicitly manipulate
probabilities.
• Bayes theorem is the cornerstone of Bayesian learning methods.
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for hypothesis,
among the most practical approaches to certain types of learning
problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
Bayesian classification
• The classification problem may be formalized using a-posteriori
probabilities:
• P(C|X) = probability that the sample tuple
X=<x1,…,xk> is of class C.
• Idea: assign to sample X the class label C such that P(C|X) is maximal
15
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative frequence of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
16
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative frequence of samples
having value xi as i-th attribute in class C
Where:
• α is the smoothing parameter (usually 𝛼=1).
• Count (𝑥𝑖 and 𝑦) is the count of instances where 𝑥𝑖 occurs with 𝑦.
• count(𝑦) is the count of instances where 𝑦 occurs.
• number of unique values for 𝑥𝑖number of unique values for 𝑥𝑖 is the
number of distinct categories for the feature 𝑥𝑖 .
Smoothing Outlook Feature
Yes Class No Class
2+𝛼 3+𝛼
𝑃 𝑠𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 = 𝑃 𝑠𝑢𝑛𝑛𝑦 𝑁𝑜 =
9+3𝛼 5+3𝛼
4+𝛼 2+𝛼
𝑃 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 𝑌𝑒𝑠 = 𝑃 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 𝑁𝑜 =
9 + 3𝛼 5 + 3𝛼
2+𝛼 𝛼
𝑃 𝑟𝑎𝑖𝑛 𝑌𝑒𝑠 = 𝑃 𝑟𝑎𝑖𝑛 𝑁𝑜 =
9+3𝛼 5+3𝛼
• 3𝛼 means Outlook feature has three different values (sunny, overcast, rain)
When to Adjust Alpha
• Laplace Smoothing is the method to handle zero probabilities.
• With Laplace Smoothing, none of the probabilities are zero, and the
model can handle unseen data without invalidating predictions.
• Small Dataset: Use 𝛼>1 to add more smoothing since fewer examples
increase the likelihood of zero probabilities.
• Large Dataset: 𝛼=1 is typically sufficient since larger datasets naturally
reduce the chances of zero probabilities.
• Validation: Test different 𝛼 values using cross-validation to find the best
fit for your dataset.
• The default value of alpha is 1. (from sklearn.naive_bayes import
CategoricalNB) nb = CategoricalNB(alpha=1)
Problem:(Confident = Yes, Sick = No) => (Fail or Pass)
• Find out whether the student with attribute Confident = Yes, Sick =
No will Fail or Pass using Bayesian classification.
• Let, C1 correspond to the class Result = Pass and C2 correspond to
Result = Fail.
Solution
• We wish to classify a test feature vector X = (Confident = Yes, Sick =
No) is more likely to C1 or C2?
• The classifier predicts that the class label of tuple X is the class Ci if
and only if
• Step 1: Compute prior probability
• The prior probability of each class, can be computed based on the
training set:
• Step 2: (Compute likelihood probability)
• To compute P(X| Ci), for i = 1, 2, we compute the following
conditional probabilities:
• Therefore, the naive Bayesian classifier predicts Result = Pass for tuple X.
How to solve continuous feature in naïve bayes
Gaussian Naïve Bayes
• In gaussian naïve bayes the continuous values associated with each
feature is assumed to follow gaussian distribution. A random variable
is said to follow a gaussian/normal distribution when plotted gives a
bell-shaped curve which is symmetric about the mean.
• The likelihood of the feature is assumed to be gaussian and hence the
conditional probability is given by:
Solve Problem using Gaussian Distribusion
Multivariate Gaussian distribution
Application of Naïve Bayes
• Spam Detection: One of the earliest and most famous applications of
Naive Bayes is in the filtering of unwanted emails based on the
likelihood of certain words appearing in spam versus non-spam emails.
• Sentiment Analysis: Naive Bayes is commonly used in sentiment
analysis, determining whether a text expresses positive, negative, or
neutral sentiments, particularly useful in social media monitoring and
market research.
• Document Classification: It is extensively used in classifying
documents, such as categorizing news articles into various topics or
organizing books into genres.
• Healthcare: Naive Bayes has applications in the medical field for
disease prediction and discovering relationships between various risk
factors and diagnosis.
Advantages of Using Naive Bayes
• Efficiency: Naive Bayes is known for its simplicity and speed. It can
make quick predictions even with large datasets, which is invaluable
in real-time applications.
• Easy to Implement: With fewer parameters to tune, Naive Bayes can
be easier to implement compared to more complex models like
neural networks.
• Good Performance with Small Data: Unlike some models that require
vast amounts of training data to perform well, Naive Bayes can
achieve good results even with a smaller dataset.
• Probabilistic Interpretation: The model provides probabilities for
outcomes, offering more insight into the results, such as how likely a
given class is the correct classification.
Limitations and Considerations
• Independence Assumption: The biggest limitation is the
assumption of independent predictors. In real-world scenarios,
features often influence each other, and this assumption can
lead to incorrect predictions.
• Zero-Frequency Problem: If a categorical variable has a
category in test data set which was not observed in training
data set, it will assign a 0 probability and will be unable to
make a prediction. This is often mitigated by a smoothing
technique.
• Biased Estimates: Because it relies heavily on the actual
distribution of classes and features in the training set, Naive
Bayes can produce biased estimates if the training data is not
representative.