Text Mining - Classification
Text Mining - Classification
Hina Arora
TextClassification.ipynb
Text Classification Examples
• Sentiment Analysis (product reviews)
• Spam detection
• Authorship identification
• Subject Categorization
Naïve Bayes
Bayes Theorem
Prior Probability
• The probability that some hypothesis (h) is true
𝑃 ℎ
Bayes Theorem
• Describes the relationship between 𝑃 ℎ|𝑑 , 𝑃 𝑑 , 𝑃 𝑑|ℎ , 𝑃 ℎ
• Used because it is often easier to get data about 𝑃 𝑑|ℎ than it is to get data about 𝑃 ℎ|𝑑
𝑃 𝑑|ℎ 𝑃 ℎ
𝑃 ℎ|𝑑 =
𝑃 𝑑
Probability that a randomly selected person purchases an iPhone: 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒
Example-1
Probability that a randomly selected person purchases a Mac: 𝑃 𝑀𝑎𝑐
Laptop and Phone purchase data
We typically use Bayes Theorem to decide among alternative hypothesis (given some evidence), by
computing the conditional probability for each hypothesis, and picking the hypothesis with the
highest conditional probability. This is called the maximum a posterior decision rule.
Then,
𝑃 𝑑|ℎ 𝑃 ℎ
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝑑 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 𝑑|ℎ 𝑃 ℎ
𝑃 𝑑
Note: we could ignore denominator P(d) since it is common across all components over which
we are evaluating argmax
Example-2
Let’s say the purchase data we saw on the last table was data from an online shopping site.
• Now let’s say a person was online-shopping, and let’s say we want to display an ad for iPhones, but
only if we think that the person is likely to buy an iPhone. Let’s assume we know that this person
already owns a Mac.
• Recall that we have already computed 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = 0.667 and 𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 = 0.333
• Since 𝑃 𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 > 𝑃 ~𝑖𝑃ℎ𝑜𝑛𝑒|𝑀𝑎𝑐 , we can conclude that it is more likely that the person
will buy an iPhone and hence show the ad
Example-3
Let’s say we wanted to determine whether or not a patient has a particular type of disease.
And let’s say we have a blood test that returns a POS or NEG to indicate whether or not a person has the
disease.
Now assume a person comes in for a blood test, and tests POS for the disease.
• When the disease is present, the blood test returns a correct POS result 98% of the time
=> P(POS|disease) = 0.98 => P(NEG|disease) = 0.02
• When the disease is not present, the blood test returns a correct NEG result 97% of the time
=> P(NEG|~disease) = 0.97 => P(POS|~disease) = 0.03
Now assume a person comes in for a blood test, and tests POS for the disease.
• This person does not have the disease, given he tested POS
Since 𝑃 ~𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 > 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 , we can conclude that it is more likely that this person does not
have the disease
In fact, this person has only a 21% chance of having the disease:
0.0078
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑃𝑂𝑆 = = 0.21
0.0078 + 0.0298
Naïve Bayes Classifier
• Often, we have more than just one piece of evidence.
• To compute the probability of a hypothesis given multiple pieces of evidence, we can simply
multiply the individual probabilities.
Note: we are able to do this because we are making a naïve conditional independence
assumption. This assumption is often violated in the real world, but we will make it anyway, since
it’s been found that Naïve Bayes works quite well even with this naïve assumption!
Two-step process:
1) Build the Naïve Bayes Classifier.
• First partition the given data into two sets:
• Use the Training Partition to get frequency-based estimates of the various probabilities
• Use the Testing Partition to classify the test data and evaluate the model
2) Use the Naïve Bayes Classifier to classify new data
Issue
• What if the training set is missing a combination of evidence and hypothesis?
• For continuous data, we can take one of two approaches to estimate probabilities:
o Make categories by discretizing the continuous attributes, and then just treat as a categorical variable
(eg age: <18, 18-25, 26-40, 41-60, >60)
o Use Gaussian Distributions to estimate the probabilities (mean 𝜇 and standard deviation 𝜎)
2
𝑥𝑖 −𝜇𝑖𝑗
1 −
2𝜎𝑖𝑗 2
𝑃 𝑥𝑖 |𝑦𝑗 = 𝑒
2𝜋𝜎𝑖𝑗
Text Classification using
Naïve Bayes
Recall: Naïve Bayes Classification
ℎ𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ|𝐷 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥ℎ∈𝐻 𝑃 ℎ ς𝑑∈𝐷 𝑃 𝑑|ℎ
Where,
Hypotheses 𝐻 = {ℎ1 , ℎ2 , ℎ3 , ℎ4 , … , ℎ𝑛 }
Evidence 𝐷 = {𝑑1 , 𝑑2 , 𝑑3 , 𝑑4 , … , 𝑑𝑚 }
Two-step process:
1) Build the Naïve Bayes Classifier.
• First partition the given data into two sets:
• Use the Training Partition to get frequency-based estimates of the various probabilities
• Use the Testing Partition to classify the test data and evaluate the model
2) Use the Naïve Bayes Classifier to classify new data
Naïve Bayes: Text Classification
Recall that 𝑃 𝑑|ℎ is the probability of seeing some evidence or data 𝑑 ∈ 𝐷, given the
hypothesis ℎ.
• In text classification, the hypotheses ℎ ∈ 𝐻 are going to be class labels, so for instance
whether the text is positive or negative sentiment.
• And the data 𝑑 ∈ 𝐷 we are going to use are essentially the words in the text. We are
going to treat documents as bags of unordered words. So we can then ask, “What is the
probability that the word amazing occurs given positive sentiment documents?”.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 ′𝑎𝑚𝑎𝑧𝑖𝑛𝑔′ 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑃 𝑎𝑚𝑎𝑧𝑖𝑛𝑔|𝑃𝑂𝑆 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
ℎ𝑀𝐴𝑃 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 ∈𝐶 𝑃 𝑐𝑖 ς𝑤𝑘∈𝑊 𝑃 𝑤𝑘 |𝑐𝑖
Where,
Class 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 , 𝑐4 , … , 𝑐𝑛 }
Words 𝑊 = {𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , … , 𝑤𝑚 }
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ∈ 𝑐𝑖
𝑃 𝑐𝑖 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
preprocessText()
Prior Probabilities from the Training Data
Conditional Probabilities from the Training Data
Predicting class of Test Data
Notice how we can end up with some really tiny fractions when using Naïve Bayes in Text
Classification.
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 ∈𝐶 ln 𝑃 𝑐𝑖 + ln 𝑃 𝑤𝑘 |𝑐𝑖
𝑤𝑘 ∈𝑊
Where,
Class 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 , 𝑐4 , … , 𝑐𝑛 }
Words 𝑊 = {𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , … , 𝑤𝑚 }
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf