nb24aug
nb24aug
Classification Classification
and Naive
Bayes
Is this spam?
Who wrote which Federalist Papers?
1787-8: essays anonymously written by:
Alexander Hamilton, James Madison, and John Jay
to convince New York to ratify U.S Constitution
Authorship of 12 of the letters unclear between:
4
What is the subject of this article?
Example: "research in the cure of tuberculosis of lungs by x-ray conducted in India in 1950"
would be categorized as:
Medicine,Lungs;Tuberculosis:Treatment;X-ray:Research.India'1950
https://en.wikipedia.org/wiki/Colon_classification
Text Classification
Assigning subject categories, topics, or genres
Spam detection
Authorship identification (who wrote this?)
Language Identification (is this Portuguese?)
Sentiment analysis
…
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
10
Classification Methods:
Supervised Machine Learning
Many kinds of classifiers!
• Naïve Bayes (this lecture)
• Logistic regression
• Neural networks
• k-nearest neighbors
• …
14
The bag of words representation
seen 2
γ( )=c
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Bayes’ Rule Applied to Documents and Classes
P(d | c)P(c)
P(c | d) =
P(d)
Naive Bayes Classifier (I)
P(d | c)P(c)
= argmax Bayes Rule
cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naive Bayes Classifier (II)
"Likelihood" "Prior"
available.
Multinomial Naive Bayes Independence
Assumptions
P(x1, x2,… , xn | c)
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
This:
Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text The Naive Bayes Classifier
Classification
and Naive
Bayes
Text
Classification Naive Bayes: Learning
and Naïve
Bayes
Sec.13.3
𝑁𝑐𝑗
𝑃 𝑐𝑗 =
𝑁𝑡𝑜𝑡𝑎𝑙
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
Parameter estimation
count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV
count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV
count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
Multinomial Naïve Bayes: Learning
40
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
50
Using Lexicons in Sentiment Classification
Add a feature that gets a count whenever a word
from the lexicon occurs
◦ E.g., a feature called "this word occurs in the positive
lexicon" or "this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful,
wonderful) or negative words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of the
test set, dense lexicon features can help
Naive Bayes in Other tasks: Spam Filtering
SpamAssassin Features:
◦ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
◦ From: starts with many numbers
◦ Subject is all capitals
◦ HTML has a low ratio of text to image area
◦ "One hundred percent guaranteed"
◦ Claims you can be removed from the list
Naive Bayes in Language ID
Determining what language a piece of text is written in.
Features based on character n-grams do very well
Important to train on lots of varieties of each language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Knowledge to numbers
• Labeling few documents
• Labeling few words
Summary: Naive Bayes is Not So Naive
Very Fast, low storage requirements
Work well with very small amounts of training data
Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Naïve Bayes:
Relationship to
Language Modeling
Dan Jurafsky
c=China
59
Dan Jurafsky
Naïve Bayes:
Relationship to
Language Modeling
Text Precision, Recall, and F1
Classification
and Naive
Bayes
Evaluating Classifiers: How well does our
classifier work?
Let's first address binary classifiers:
• Is this email spam?
spam (+) or not spam (-)
• Is this post about Delicious Pie Company?
about Del. Pie Co (+) or not about Del. Pie Co(-)
We can define precision and recall for multiple classes like this 3-way email task:
gold labels
urgent normal spam
8
urgent 8 10 1 precisionu=
8+10+1
system 60
output normal 5 60 50 precisionn=
5+60+50
200
spam 3 30 200 precisions=
3+30+200
recallu = recalln = recalls =
8 60 200
8+5+3 10+60+30 1+50+200
How to combine P/R values for different classes:
Microaveraging vs Macroaveraging
macroaverage = .42+.52+.86
= .60
precision 3
Text Precision, Recall, and F1
Classification
and Naive
Bayes
Text Avoiding Harms in Classification
Classification
and Naive
Bayes
Harms of classification
Classifiers, like any NLP algorithm, can cause harms
This is true for any classifier, whether Naive Bayes or
other algorithms
Representational Harms
• Harms caused by a system that demeans a social group
• Such as by perpetuating negative stereotypes about them.
• Kiritchenko and Mohammad 2018 study
• Examined 200 sentiment analysis systems on pairs of sentences
• Identical except for names:
• common African American (Shaniqua) or European American (Stephanie).
• Like "I talked to Shaniqua yesterday" vs "I talked to Stephanie yesterday"
• Result: systems assigned lower sentiment and more negative
emotion to sentences with African American names
• Downstream harm:
• Perpetuates stereotypes about African Americans
• African Americans treated differently by NLP tools like sentiment (widely
used in marketing research, mental health studies, etc.)
Multinomial Naïve Bayes: Learning