0% found this document useful (0 votes)
13 views

Text Mining - Classification

Naive Bayes is an algorithm used for text classification that calculates the probability of different categories or classes based on Bayes' theorem and makes predictions based on the highest probability class. It works by counting the number of times words or features appear in documents of a particular class and then using that to estimate the probability that a new document belongs to that class. The document provides examples of how Naive Bayes can be used for sentiment analysis, spam detection, and subject categorization. It also explains the key assumptions and calculations behind the Naive Bayes algorithm.

Uploaded by

Zorka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Text Mining - Classification

Naive Bayes is an algorithm used for text classification that calculates the probability of different categories or classes based on Bayes' theorem and makes predictions based on the highest probability class. It works by counting the number of times words or features appear in documents of a particular class and then using that to estimate the probability that a new document belongs to that class. The document provides examples of how Naive Bayes can be used for sentiment analysis, spam detection, and subject categorization. It also explains the key assumptions and calculations behind the Naive Bayes algorithm.

Uploaded by

Zorka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Text Classification using Naïve Bayes

Hina Arora
TextClassification.ipynb
Text Classification Examples
• Sentiment Analysis (product reviews)
• Spam detection
• Authorship identification
• Subject Categorization
Naïve Bayes
Bayes Theorem
Prior Probability
• The probability that some hypothesis (h) is true
𝑃 ℎ

Posterior or Conditional Probability


• The probability that some hypothesis (h) is true given some additional evidence (d)
𝑃 ℎ∩𝑑
𝑃 ℎ|𝑑 =
𝑃 𝑑

Bayes Theorem
• Describes the relationship between 𝑃 ℎ|𝑑 , 𝑃 𝑑 , 𝑃 𝑑|ℎ , 𝑃 ℎ
• Used because it is often easier to get data about 𝑃 𝑑|ℎ than it is to get data about 𝑃 ℎ|𝑑
𝑃 𝑑|ℎ 𝑃 ℎ
𝑃 ℎ|𝑑 =
𝑃 𝑑
Probability that a randomly selected person purchases an iPhone: 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒
Example-1
Probability that a randomly selected person purchases a Mac: 𝑃 𝑀𝑎𝑐
Laptop and Phone purchase data

Probability that a randomly selected person purchases an iPhone and a Mac:


𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 ∩ 𝑀𝑎𝑐

Probability that a randomly selected person purchases an iPhone


given that the person purchased a Mac: 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐

Probability that a randomly selected person does not purchase an iPhone


given that the person purchased a Mac: 𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐

Probability that a randomly selected person purchases a Mac


given that the person purchased an iPhone: 𝑃 𝑀𝑎𝑐|𝑖𝑃ℎ𝑜𝑛𝑒

Probability that a randomly selected person does not puchase a Mac


given that the person purchased an iPhone: 𝑃 ~𝑀𝑎𝑐|𝑖𝑃ℎ𝑜𝑛𝑒
Laptop and Phone purchase data
Probability that a randomly selected person purchases an iPhone:
5
𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 = = 0.5
10

Probability that a randomly selected person purchases a Mac:


6
𝑃 𝑀𝑎𝑐 = = 0.6
10

Probability that a randomly selected person purchases an iPhone and a Mac:


4
𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 ∩ 𝑀𝑎𝑐 = = 0.4
10
Probability that a randomly selected person purchases an iPhone
given that the person purchased a Mac:
𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 ∩ 𝑀𝑎𝑐 0.4
Laptop and Phone purchase data 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = = = 0.667
𝑃 𝑀𝑎𝑐 0.6

Probability that a randomly selected person does not purchase an iPhone


given that the person purchased a Mac:
𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = 1 − 0.667 = 0.333

Probability that a randomly selected person purchases a Mac


given that the person purchased an iPhone:
𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 ∩ 𝑀𝑎𝑐 0.4
𝑃 𝑀𝑎𝑐|𝑖𝑃ℎ𝑜𝑛𝑒 = = = 0.8
𝑃 𝑖𝑃ℎ𝑜𝑛𝑒 0.5

Probability that a randomly selected person does not puchase a Mac


given that the person purchased an iPhone:
𝑃 ~𝑀𝑎𝑐|𝑖𝑃ℎ𝑜𝑛𝑒 = 1 − 0.8 = 0.2
Maximum A Posterior (MAP) Decision Rule

We typically use Bayes Theorem to decide among alternative hypothesis (given some evidence), by
computing the conditional probability for each hypothesis, and picking the hypothesis with the
highest conditional probability. This is called the maximum a posterior decision rule.

Let’s say we have a number of possible hypotheses:


𝐻 = {ℎ1 , ℎ2 , ℎ3 , ℎ4 , … , ℎ𝑛 }

Then,
𝑃 𝑑|ℎ 𝑃 ℎ
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝑑 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 𝑑|ℎ 𝑃 ℎ
𝑃 𝑑
Note: we could ignore denominator P(d) since it is common across all components over which
we are evaluating argmax
Example-2
Let’s say the purchase data we saw on the last table was data from an online shopping site.

• Now let’s say a person was online-shopping, and let’s say we want to display an ad for iPhones, but
only if we think that the person is likely to buy an iPhone. Let’s assume we know that this person
already owns a Mac.

• There are two competing hypothesis:


o This person will purchase an iPhone given he has already purchased a Mac
o This person will not purchase an iPhone given he has already purchased a Mac

• Recall that we have already computed 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = 0.667 and 𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = 0.333

• Since 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 > 𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 , we can conclude that it is more likely that the person
will buy an iPhone and hence show the ad
Example-3
Let’s say we wanted to determine whether or not a patient has a particular type of disease.

And let’s say we have a blood test that returns a POS or NEG to indicate whether or not a person has the
disease.

Assume we have the following information:


• Percent population in the US that has this type of disease is 0.8%
• When the disease is present, the blood test returns a correct POS result 98% of the time
• When the disease is not present, the blood test returns a correct NEG result 97% of the time

Now assume a person comes in for a blood test, and tests POS for the disease.

Is he likely to have the disease?


Let’s say we wanted to determine whether or not a patient has a particular type of disease. And let’s say
we have a blood test that returns a POS or NEG to indicate whether or not a person has the disease.

Assume we have the following information:

• Percent population in the US that has this type of disease = 0.8%


=> P(disease) = 0.008 => P(~disease) = 0.992

• When the disease is present, the blood test returns a correct POS result 98% of the time
=> P(POS|disease) = 0.98 => P(NEG|disease) = 0.02

• When the disease is not present, the blood test returns a correct NEG result 97% of the time
=> P(NEG|~disease) = 0.97 => P(POS|~disease) = 0.03

Now assume a person comes in for a blood test, and tests POS for the disease.

What is the probability that he has the disease?


There are two competing hypotheses:

• This person has the disease, given he tested POS

𝑃 𝑃𝑂𝑆|𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ∗ 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 0.98 ∗ 0.008 0.0078


𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 = = =
𝑃 𝑃𝑂𝑆 𝑃 𝑃𝑂𝑆 𝑃 𝑃𝑂𝑆

• This person does not have the disease, given he tested POS

𝑃 𝑃𝑂𝑆|~𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ∗ 𝑃 ~𝑑𝑖𝑠𝑒𝑎𝑠𝑒 0.03 ∗ 0.992 0.0298


𝑃 ~𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 = = =
𝑃 𝑃𝑂𝑆 𝑃 𝑃𝑂𝑆 𝑃 𝑃𝑂𝑆

Since 𝑃 ~𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 > 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 , we can conclude that it is more likely that this person does not
have the disease

In fact, this person has only a 21% chance of having the disease:

0.0078
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 = = 0.21
0.0078 + 0.0298
Naïve Bayes Classifier
• Often, we have more than just one piece of evidence.

• To compute the probability of a hypothesis given multiple pieces of evidence, we can simply
multiply the individual probabilities.
Note: we are able to do this because we are making a naïve conditional independence
assumption. This assumption is often violated in the real world, but we will make it anyway, since
it’s been found that Naïve Bayes works quite well even with this naïve assumption!

• Let’s say we have a number of possible hypotheses 𝐻 = {ℎ1 , ℎ2 , ℎ3 , ℎ4 , … , ℎ𝑛 }


And let’s say we have multiple pieces of evidence 𝐷 = {𝑑1 , 𝑑2 , 𝑑3 , 𝑑4 , … , 𝑑𝑚 }, .
Then,
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝑑1 𝑑2 … 𝑑𝑚
= 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 𝑑1 |ℎ 𝑃 𝑑2 |ℎ … 𝑃 𝑑𝑚 |ℎ 𝑃 ℎ
= 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ ෑ 𝑃 𝑑|ℎ
𝑑∈𝐷
Example-4
User profiles for users who purchased i100 vs i500
Which model should we recommend to a person whose main interest is health, current exercise level is moderate, is
moderately motivated, and is comfortable with technological devices?

P(i100|healthInterest, moderateExercise, moderateMotivation, techComfortable)


= P(i100) * User profiles for users who purchased i100 vs i500
P(healthInterest|i100) *
P(moderateExercise|i100) *
P(moderateMotivation|i100) *
P(techComfortable|i100)
= 6/15 * 1/6 * 1/6 * 5/6 * 2/6
= 0.00309

P(i500|healthInterest, moderateExercise, moderateMotivation, techComfortable)


= P(i500) *
P(healthInterest|i500) *
P(moderateExercise|i500) *
P(moderateMotivation|i500) *
P(techComfortable|i500)
= 9/15 * 4/9 * 3/9 * 3/9 * 6/9
= 0.01975

Since 0.01975 > 0.00309, we should recommend i500.


Naïve Bayes: Training, Testing, Classification
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝐷 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ ς𝑑∈𝐷 𝑃 𝑑|ℎ
Where,
Hypotheses 𝐻 = {ℎ1 , ℎ2 , ℎ3 , ℎ4 , … , ℎ𝑛 }
Evidence 𝐷 = {𝑑1 , 𝑑2 , 𝑑3 , 𝑑4 , … , 𝑑𝑚 }

Two-step process:
1) Build the Naïve Bayes Classifier.
• First partition the given data into two sets:
• Use the Training Partition to get frequency-based estimates of the various probabilities
• Use the Testing Partition to classify the test data and evaluate the model
2) Use the Naïve Bayes Classifier to classify new data
Issue
• What if the training set is missing a combination of evidence and hypothesis?

• This would cause the frequency-based estimate of the conditional probability to be 0,


which would in turn cuase the Naïve Bayes probability estimate to be 0:
𝑃 ℎ|𝑑1 𝑑2 … 𝑑𝑚 =𝑃 𝑑1 |ℎ 𝑃 𝑑2 |ℎ … 𝑃 𝑑𝑚 |ℎ 𝑃 ℎ

• Therefore, it is customary to incorporate a small correction factor in all probability


estimates such that no probability is ever set to be exactly zero. You can use Laplace or
Lindstone smoothing to handle this. We’ll see how this is done in the next section.
Aside: Continuous Data
• Since we are going to be dealing with text data, our discussion of Naïve Bayes so far has been around
frequency-based estimates of probabilities.

• As we saw, we calculate frequency-based estimates of probabilities just by dividing the frequency of


occurrences:
𝑛
o 𝑃 𝑥|𝑦 = 𝑐 , where 𝑛 is total number of instances of class 𝑦 in the training set, and 𝑛𝑐 is the total
𝑛
number of instances of class 𝑦 that have the value 𝑥

• For continuous data, we can take one of two approaches to estimate probabilities:
o Make categories by discretizing the continuous attributes, and then just treat as a categorical variable
(eg age: <18, 18-25, 26-40, 41-60, >60)
o Use Gaussian Distributions to estimate the probabilities (mean 𝜇 and standard deviation 𝜎)
2
𝑥𝑖 −𝜇𝑖𝑗
1 −
2𝜎𝑖𝑗 2
𝑃 𝑥𝑖 |𝑦𝑗 = 𝑒
2𝜋𝜎𝑖𝑗
Text Classification using
Naïve Bayes
Recall: Naïve Bayes Classification
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝐷 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ ς𝑑∈𝐷 𝑃 𝑑|ℎ
Where,
Hypotheses 𝐻 = {ℎ1 , ℎ2 , ℎ3 , ℎ4 , … , ℎ𝑛 }
Evidence 𝐷 = {𝑑1 , 𝑑2 , 𝑑3 , 𝑑4 , … , 𝑑𝑚 }

Two-step process:
1) Build the Naïve Bayes Classifier.
• First partition the given data into two sets:
• Use the Training Partition to get frequency-based estimates of the various probabilities
• Use the Testing Partition to classify the test data and evaluate the model
2) Use the Naïve Bayes Classifier to classify new data
Naïve Bayes: Text Classification
Recall that 𝑃 𝑑|ℎ is the probability of seeing some evidence or data 𝑑 ∈ 𝐷, given the
hypothesis ℎ.

• In text classification, the hypotheses ℎ ∈ 𝐻 are going to be class labels, so for instance
whether the text is positive or negative sentiment.

• And the data 𝑑 ∈ 𝐷 we are going to use are essentially the words in the text. We are
going to treat documents as bags of unordered words. So we can then ask, “What is the
probability that the word amazing occurs given positive sentiment documents?”.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 ′𝑎𝑚𝑎𝑧𝑖𝑛𝑔′ 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑃 𝑎𝑚𝑎𝑧𝑖𝑛𝑔|𝑃𝑂𝑆 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
ℎ𝑀𝐴𝑃 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 ∈𝐶 𝑃 𝑐𝑖 ς𝑤𝑘∈𝑊 𝑃 𝑤𝑘 |𝑐𝑖
Where,
Class 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 , 𝑐4 , … , 𝑐𝑛 }
Words 𝑊 = {𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , … , 𝑤𝑚 }

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ∈ 𝑐𝑖
𝑃 𝑐𝑖 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑤𝑜𝑟𝑑 𝑤𝑘 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ∈ 𝑐𝑖


𝑃 𝑤𝑘 |𝑐𝑖 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ∈ 𝑐𝑖
𝑛𝑤𝑘,𝑐𝑖
=
𝑛𝑐 𝑖
𝑛𝑤𝑘,𝑐𝑖 + 𝛼
=
𝑛𝑐 𝑖 + 𝛼 𝑉
With no smoothing (𝛼 = 0), Lindstone smoothing (𝛼 < 1), or Laplace/add-one smoothing (𝛼 = 1)
where 𝑉 is the vocabulary (unique words) of all documents in the training corpus (any class)
Example:
Tweets about a New Yogurt Flavor

preprocessText()
Prior Probabilities from the Training Data
Conditional Probabilities from the Training Data
Predicting class of Test Data

P(POS|great, yogurt, great, flavor, great, texture, big, mistake)


= P(POS) *
P(great|POS) * P(yogurt|POS) * P(great|POS) * P(flavor|POS) *
P(great|POS) * P(texture|POS) * P(big|POS) * P(mistake|POS)
= P(c1) *
P(w1|c1) * P(w5|c1) * P(w1|c1) * P(w2|c1) *
P(w1|c1) * P(w4|c1) * P(w6|c1) * P(w7|c1)
= 3/4 * 5/15 * 2/15 * 5/15 * 2/15 * 5/15 * 2/15 * 1/15 * 1/15
= 2.9 e -7

P(NEG|great, yogurt, great, flavor, great, texture, big, mistake)


= P(NEG) *
P(great|NEG) * P(yogurt|NEG) * P(great|NEG) * P(flavor|NEG) *
P(great|NEG) * P(texture|NEG) * P(big|NEG) * P(mistake|NEG)
= P(c2) *
P(w1|c2) * P(w5|c2) * P(w1|c2) * P(w2|c2) *
P(w1|c2) * P(w4|c2) * P(w6|c2) * P(w7|c2)
= 1/4 * 2/10 * 1/10 * 2/10 * 1/10 * 2/10 * 1/10 * 2/10 * 2/10
= 0.8 e -7

Since 2.9 e -7 > 0.8 e -7, we would classify as POS.


Log-form of Naïve Bayes

Notice how we can end up with some really tiny fractions when using Naïve Bayes in Text
Classification.

We therefore often use the log-form to avoid floating point limits.


Note: ln = loge

ℎ𝑀𝐴𝑃 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 ∈𝐶 𝑃 𝑐𝑖 ς𝑤𝑘∈𝑊 𝑃 𝑤𝑘 |𝑐𝑖

= 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 ∈𝐶 ln 𝑃 𝑐𝑖 + ෍ ln 𝑃 𝑤𝑘 |𝑐𝑖
𝑤𝑘 ∈𝑊

Where,
Class 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 , 𝑐4 , … , 𝑐𝑛 }
Words 𝑊 = {𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , … , 𝑤𝑚 }
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

You might also like