Ho Chi Minh University of Banking DAT 704
Department of Data Science in Business
Machine Learning
• Naïve Bayes Classifier
12/4/2024 Vuong Trong Nhan
Content
• Introduction to Machine Learning
• Data Representation
• Supervised Learning
• Linear Regression
• Logistic Regression
• k-Nearest Neighbors
• Naïve Bayes Classifier
• Decision Tree and Random Forest
• Support Vector Machine
• Neural Network
• Unsupervised Learning
• Model Evaluation and Improvement
12/4/2024 2
Outline
▪ Naive Bayes: Introduction
▪ Bayes’s Theorem
▪ Example using Naive Bayes
▪ Types of Naïve Bayes classifiers
▪ Evaluation
▪ Advantages and Disadvantages
▪ Some applications
▪ Exercises
3
Naïve Bayes classifiers
• The Naïve Bayes classifier
is a supervised machine learning algorithm
which is used for classification tasks.
• Based on applying Bayes’ theorem
• With the “naive” assumption of conditional
independence between every pair of features given the
value of the class variable.
4
Applications of the Naïve Bayes classifier
▪ Spam filtering:
One of the most popular applications of Naïve Bayes Oreilly.
▪ Document classification:
Document and text classification go hand in hand. Another popular use case of Naïve
Bayes is content classification. Imagine the content categories of a News media
website. All the content categories can be classified under a topic taxonomy based on
the each article on the site. Federick Mosteller and David Wallace are credited with the
first application of Bayesian inference within their 1963 paper.
▪ Sentiment analysis:
While this is another form of text classification, sentiment analysis is commonly
leveraged within marketing to better understand and quantify opinions and attitudes
around specific products and brands.
▪ Mental state predictions:
Using fMRI data, naïve bayes has been leveraged to predict different cognitive states
among humans. The goal of this research was to assist in better understanding
hidden cognitive states, particularly among brain injury patients.
https://www.ibm.com/topics/naive-bayes 5
Example
Dataset that describes the weather conditions for playing Tennis
Play
Day Outlook Temperature Humidity Wind
Tennis Features are
D1 Sunny Hot High Weak No ‘Outlook’,
D2 Sunny Hot High Strong No ‘Temperature’,
D3 Overcast Hot High Weak Yes ‘Humidity’ and ‘Windy’.
D4 Rainy Mild High Weak Yes
Class: play Tennis
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes Predict:
D10 Rainy Mild Normal Weak Yes • Today = D15
D11 Sunny Mild Normal Strong Yes • Play Tennis = ?
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
D15 Sunny Hot Normal Weak ???
6
Bayes’ Theorem
• Bayes’ theorem finds the probability of an event occurring given the
probability of another event that has already occurred.
where Y and X are events and P(X) ≠ 0.
• Trying to find probability of event Y, given the event X is true. Event X
is also termed as evidence.
• P(Y) is the prior probability of Y.
• Probability of event before evidence is seen).
• The evidence is an attribute value of an unknown instance (here, it is event X).
• P(Y|X) is a posteriori probability of X, i.e. probability of event after
evidence is seen.
7
“Naïve” Bayes Assumption
Assumption:
Each feature contributes independently and equally to the outcome.
• Independence: no pair of features are dependent.
• I.e., the temperature being ‘Hot’ has nothing to do with the humidity
or the outlook being ‘Rainy’ has no effect on the winds.
• Hence, the features are assumed to be independent.
• Equality: each feature is given the same weight (or
importance).
• I.e., knowing only temperature and humidity alone can’t predict the
outcome accurately.
• None of the attributes is irrelevant and assumed to be contributing
equally to the outcome.
Note: In-fact, the independence assumption is never correct but often works
well in practice.
8
Data presentation
▪ Dataset D: (X, y)
▪ X is an independent feature vector (of size n)
▪ y is class variable
▪ Apply Bayes’ theorem
E.g:
X = (Rain, Hot, High, Weak) P(y|X) means, the probability of “Not playing tennis”
given that the weather conditions are “Rainy outlook”,
y = No
“Temperature is hot”, “high humidity” and “weak wind”.
9
(1)
Naïve Bayes
▪ Since A and B are independent (naive assumption):
P(A,B) = P(A)P(B)
▪ (1) become:
(2)
▪ (2) can be expressed as:
(3)
▪ As the denominator remains constant for a given input, we can remove
that term:
Proportion to
(4)
Maximum likelihood
Note: The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(x i | y).10
Naïve Bayes
▪ Step 1: Calculate P(y)
▪ y = Yes
▪ y = No
Play Tennis
Yes No P(Yes) P(No)
9 5 9/14 5/14
P(play = Yes) = 9/14
P(play = No) = 5/14
11
Predict Today
Naïve Bayes D15 Sunny Hot Normal Weak ???
Step 2: Calculate: P(xi|y)
Outlook Temperature
xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/5 Mild 4 2 4/9 2/5
Rainy 3 2 3/9 2/5 Cool 3 1 3/9 1/5
P(Outlook = Sunny | play = Yes) = 2/9 P(Temp. = Hot| play = Yes) = 2/9
P(Outlook = Sunny | play = No) = 3/5 P(Temp. = Hot | play = No) = 2/5
Humidity Wind
xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 3/9 4/5 Weak 6 2 6/9 2/5
Normal 6 1 6/9 1/5 Strong 3 3 3/9 3/5
P(Humidity = Normal | play = Yes) = 6/9 P(Wind = Weak | play = Yes) = 6/9
P(Humidity = Normal| play = No) = 1/5 P(Wind = Weak | play = No) = 2/5
12
12
Naïve Bayes
Predict Today = D15, class = ?
D15 Sunny Hot Nomal Weak ???
13
Evaluate a Naïve Bayes classifier
• Accuracy
• Precision, Recall
• Confusion matrix
https://www.ibm.com/topics/naive-bayes 14
Types of Naïve Bayes classifiers
▪ Based on the distributions of the feature values:
▪ Gaussian Naïve Bayes (GaussianNB):
o Feature: continuous variables
o eg. Age ∈ [18, 60]
o Gaussian distribution
▪ Multinomial Naïve Bayes (MultinomialNB):
o Feature: discrete values(eg. frequency counts)
o eg. outlook = {sunny, overcast, rainy}
o Multinomial distribution
▪ Bernoulli Naïve Bayes (BernoulliNB):
o features: Boolean variables
o {True, False} or {1, 0}.
• Bernoulli distribution
• Hydrid NB
• by combining existing Naive Bayes models
https://www.ibm.com/topics/naive-bayes 15
Advantages and disadvantages
▪ Advantages
▪ Less complex:
o Naïve Bayes is considered a simpler classifier since the
parameters are easier to estimate.
▪ Scales well:
o Compared to logistic regression, Naïve Bayes is considered a
fast and efficient classifier that is fairly accurate when the
conditional independence assumption holds. It also has low
storage requirements.
▪ Can handle high-dimensional data:
o Use cases, such document classification, can have a high
number of dimensions, which can be difficult for other
classifiers to manage.
https://www.ibm.com/topics/naive-bayes 16
Advantages and disadvantages
▪ Disadvantages:
▪ Subject to Zero frequency:
o Zero frequency occurs when a categorical variable does not
exist within the training set.
o For example, imagine that we’re trying to find the maximum
likelihood estimator for the word, “sir” given class “spam”, but
the word, “sir” doesn’t exist in the training data. The probability
in this case would zero, and since this classifier multiplies all the
conditional probabilities together, this also means that posterior
probability will be zero. (To avoid this issue, laplace smoothing
can be leveraged)
▪ Unrealistic core assumption:
o While the conditional independence assumption overall
performs well, the assumption does not always hold, leading to
incorrect classifications.
https://www.ibm.com/topics/naive-bayes 17
(Optional)
• Deal with Zero-frequency problem
• Laplace smoothing/correction
• Deal with Continuous features
• Discretization
• Density probability function
18
Yes No P(Yes) P(No)
Zero-frequency problem 9 5 9/14 5/14
Outlook Temperature
xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/5 Mild 4 2 4/9 2/5
Rainy 3 2 3/9 2/5 Cool 3 1 3/9 1/5
Humidity Wind
xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 3/9 4/5 Weak 6 2 6/9 2/5
Normal 6 1 6/9 1/5 Strong 3 3 3/9 3/5
Predict
D16 Overcast Cool High Strong ?
19
Laplace Smoothing/Correction
• In Naive Bayes classification, Laplace smoothing, also
known as add-one smoothing is a technique used to handle
the problem of zero probabilities
𝑐𝑜𝑢𝑛𝑡 𝑥𝑖 , 𝑦 + 𝛼
𝑃𝐿𝐴𝑃,𝑘 𝑥𝑖 𝑦 =
𝑐𝑜𝑢𝑛𝑡 𝑦 + 𝛼|𝑋|
Where:
• P (xi ∣y ) is the probability of feature xi given class y.
• 𝛼 is the smoothing parameter (𝛼 > 0, usually using 𝛼 = 1)
• count (xi |y) is the count of occurrences of feature xi with class y in the training data.
• count (y ) is the total count of instances of class y in the training data.
• ∣X ∣ is number of unique feature values (or the size of the vocabulary)
20
Laplace smoothing/correction
P(xi|y) without using Laplace smoothing
Outlook Predict
xi Yes No P(xi|Yes) P(xi|No)
D16 Overcast Cool High Strong ?
Sunny 2 3 2/9 3/5
Overcast 4 0 4/9 0/5
Rainy 3 2 3/9 2/5
P(xi|y) using Laplace smoothing
Outlook (using Laplace smoothing)
xi Yes No P(xi|Yes) P(xi|No) • Choose 𝛼 = 1
Sunny 2 3 3/12 4/8 • Outlook = {Sunny, Overcast, Rainy}
|Outlook| = 3
Overcast 4 0 5/12 1/8
• c(Overcast,Yes) = 4, c(Overcast,No) = 0
Rainy 3 2 4/12 3/8 • c(Yes) = 9, c(No) = 5
4+𝟏 5 0+𝟏 1
P(Outlook = Overcast |Yes) = = P(Outlook = Overcast |No) = =
9+ 𝟏∗𝟑 12 5+ 𝟏∗𝟑 8
21
NBC using Laplace smoothing
Yes No P(Yes) P(No)
9 5 9/14 5/14
Outlook (Laplace smoothing) Temperature (Laplace smoothing)
xi Yes No P(xi|Yes) P(xi|No) xi Yes No P(xi|Yes) P(xi|No)
Sunny 2 3 3/12 4/8 Hot 2 2 3/12 3/8
Mild 4 2 5/12 3/8
Overcast 4 0 5/12 1/8
Cool 3 1 4/12 2/8
Rainy 3 2 4/12 3/8
Humidity (Laplace smoothing) Wind (Laplace smoothing)
xi Yes No P(xi|Yes) P(xi|No) Yes No P(xi|Yes) P(xi|No)
High 3 4 4/11 5/7 Weak 6 2 7/11 3/7
Normal 6 1 7/11 2/7 Strong 3 3 4/11 4/7
Predict D16 Overcast Cool High Strong ?
22
NBC with continous features
• Deal with continous value
• Change to discrete value (data binning)
• Eg.
• Temperature = 80 => high
• Temperature = 70 => mild
• Temperature = 60 => cool
• Using probability density distribution function (f)
𝑃 𝑋 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑌 = 𝑦 = ෑ 𝑓(𝑋𝑖 = 𝑥𝑖 |𝑌 = 𝑦)
Probability density function for the normal distribution (Gaussian distribution)
1 (𝑥−𝜇)2
−
𝑓 𝑥 = 𝑒 2 𝜎2
𝜎 2𝜋
24
NBC with continous features
Using probability density D17 = {Outlook = Overcast,
distribution function (f) Temperature
Play
Day Outlook Temperature Humidity Wind
Tennis
= 60,
Humidity =
D1 Sunny 85 85 Weak No 62,
D2 Sunny 80 90 Strong No Wind = Weak
(83+70+⋯+ 81)
D3 Overcast 83 86 Weak Yes 𝜇 Temp|yes = } = 73
9
D4 Rainy 70 96 Weak Yes (83−73)2 +(70−73)2 + ⋯ +(81−73
68 80
𝜎(Temp|yes) =
D5 Rainy Weak Yes 9−1
D6 Rainy 65 70 Strong No
(85+ 80+ …+ 71)
D7 Overcast 64 65 Strong Yes 𝜇 Temp|no = = 74.6
5
D8 Sunny 72 95 Weak No (85−74.6)2 +(80−74.6)2 + ⋯ +(71−
𝜎(Temp|no) =
D9 Sunny 69 70 Weak Yes 5−1
D10 Rainy 75 80 Weak Yes
Probability density function
D11 Sunny 75 70 Strong Yes
for the normal distribution
D12 Overcast 72 90 Strong Yes
1 (𝑥−𝜇)2
−
D13 Overcast 81 75 Weak Yes 𝑓 temp = 60|yes = 𝑒 2 𝜎2
𝜎 2𝜋
D14 Rainy 71 91 Strong No (𝑥−𝜇)2
= 0.071 1 −
𝑓 temp = 60|no = 𝑒 2 𝜎2
𝜎 2𝜋
= 0.0094 25
Summary
• Naïve Bayes Classifier
• Naïve assumption
• Bayes Theory
• Types:
• Gaussian NB
• Multinominal NB
• Bernoulli NB
28
Exercise
Play
Day Outlook Temperature Humidity Wind
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Using Naïve Bayes algorithm to predict:
• D15 = (Sunny, Hot, High, Weak) • D16 = (Rain, Mild, Normal, Weak)
• Play tennis (D15) = ? • Play tennis (D16) = ?
29