0% found this document useful (0 votes)
2 views

MultinomialNB

The document discusses text classification, focusing on the Naïve Bayes method, which uses Bayes' rule for classifying documents into predefined categories. It covers various applications such as spam detection, authorship identification, and sentiment analysis, and explains the bag of words representation and the assumptions behind the Naïve Bayes classifier. Additionally, it addresses challenges like zero probabilities and the importance of precision, recall, and the F measure in evaluating classification performance.

Uploaded by

imrulumer
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

MultinomialNB

The document discusses text classification, focusing on the Naïve Bayes method, which uses Bayes' rule for classifying documents into predefined categories. It covers various applications such as spam detection, authorship identification, and sentiment analysis, and explains the bag of words representation and the assumptions behind the Naïve Bayes classifier. Additionally, it addresses challenges like zero probabilities and the importance of precision, recall, and the F measure in evaluating classification performance.

Uploaded by

imrulumer
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Text

Classification
and Naïve
Bayes

The Task of Text


Classification
Is this spam?
Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince New York


to ratify U.S Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using
Bayesian methods

James Madison Alexander Hamilton


Male or female author?
1. By 1925 present-day Vietnam was divided into three parts
under French colonial rule. The southern region embracing
Saigon and the Mekong delta was the colony of Cochin-China;
the central area with its imperial capital at Hue was the
protectorate of Annam…
2. Clara never failed to be astonished by the extraordinary felicity
of her own name. She found it hard to trust herself to the
mercy of fate, which had managed over the years to convert
her greatest shame into one of her greatest assets…
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp.
321–346
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and some
great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing
scenes.

5
What is the subject of this article?

MEDLINE Article MeSH Subject Category Hierarchy


• Antogonists and Inhibitors
• Blood Supply
• Chemistry
? • Drug Therapy
• Embryology
• Epidemiology
• …
6
Text Classification
• Assigning subject categories, topics, or genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class c  C


Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c

10
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors

•…
Text
Classification
and Naïve
Bayes

The Task of Text


Classification
Text
Classification
and Naïve
Bayes

Naïve Bayes (I)


Naïve Bayes Intuition
• Simple (“naïve”) classification method based on
Bayes rule
• Relies on very simple representation of document
• Bag of words
The Bag of Words Representation

15
The bag of words representation
seen 2

γ
sweet 1
whimsical
recommend
1
1 )=c
(
happy 1
... ...
Text
Classification
and Naïve
Bayes

Naïve Bayes (I)


Text
Classification
and Naïve
Bayes

Formalizing the
Naïve Bayes
Classifier
Bayes’ Rule Applied to Documents and
Classes

• For a document d and a class c


Naïve Bayes Classifier (I)

MAP is “maximum a
posteriori” = most
likely class

Bayes Rule

Dropping the
denominator
Naïve Bayes Classifier (II)

Document d
argmax P( x1 , x2 ,  , xn | c) P (c) represented
as features
cC x1..xn
Naïve Bayes Classifier (IV)

cMAP argmax P ( x1 , x2 ,  , xn | c) P (c)


cC

O(|X|n•|C|) parameters How often does this


class occur?

Could only be estimated if a


We can just count the
very, very large number of relative frequencies
training examples was in a corpus

available.
Multinomial Naïve Bayes Independence
Assumptions
P ( x1 , x2 ,  , xn | c)

• Bag of Words assumption: Assume position doesn’t


matter
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
Multinomial Naïve Bayes Classifier

cMAP argmax P ( x1 , x2 ,  , xn | c) P (c)


cC
Applying Multinomial Naive Bayes
Classifiers to Text Classification

positions  all word positions in test document


Problems with multiplying lots of probs
• There's a problem with this:

• Multiplying lots of probabilities can result in floating-point underflow!


• .0006 * .0007 * .0009 * .01 * .5 * .000008….
• Idea: Use logs, because log(ab) = log(a) + log(b)
• We'll sum logs of probabilities instead of multiplying
probabilities!
We actually do everything in log space
Instead of this:

This:

Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log
probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text
Classification
and Naïve
Bayes

Formalizing the
Naïve Bayes
Classifier
Text
Classification
and Naïve
Bayes

Naïve Bayes:
Learning
Sec.13.3

Learning the Multinomial Naïve Bayes Model

• First attempt: maximum likelihood estimates


• simply use the frequencies in the data
Parameter estimation

fraction of times word wi appears


among all words in documents of topic cj

• Create mega-document for topic j by concatenating all docs in


this topic
• Use frequency of w in mega-document
Sec.13.3

Problem with Maximum Likelihood


• What if we have seen no training documents with the word
fantastic and classified in the topic positive (thumbs-up)?

• Zero probabilities cannot be conditioned away, no matter the


other evidence!
Laplace (add-1) smoothing for Naïve Bayes
Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary

• Calculate P(cj) terms • Calculate P(wk | cj) terms


• For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
Unknown words
• What about unknown words
• that appear in our test data
• but not in our training data or vocabulary?
• We ignore them
• Remove them from the test document!
• Pretend they weren't there!
• Don't include any probability for them at all!
• Why don't we build an unknown word model?
• It doesn't help: knowing which class has more unknown words is not
generally helpful!
Stop words
• Some systems ignore stop words
• Stop words: very frequent words like the and a.
• Sort the vocabulary by word frequency in training set
• Call the top 10 or 50 words the stopword list.
• Remove all stop words from both training and test sets
• As if they were never there!
• But removing stop words doesn't usually help
• So in practice most NB algorithms use all words and don't use
stopword lists
Text
Classification
and Naïve
Bayes

Naïve Bayes:
Learning
Text
Classification
and Naïve
Bayes

Naïve Bayes:
Relationship to
Language Modeling
Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

39
Naïve Bayes and Language Modeling
• Naïve bayes classifiers can use any sort of feature
• URL, email address, dictionaries, network features
• But if, as in the previous slides
• We use only word features
• we use all of the words in the text (not a subset)
• Then
• Naïve bayes has an important similarity to language
40
modeling.
Sec.13.2.1

Each class = a unigram language model


• Assigning each word: P(word | c)
• Assigning each sentence: P(s|c)=Π P(word|c)
Class pos
0.1 I
I love this fun film
0.1 love
0.01 this
0.1 0.1 .05 0.01 0.1
0.05 fun
0.1 film
P(s | pos) = 0.0000005

Sec.13.2.1

Naïve Bayes as a Language Model


• Which class assigns the higher probability to s?

Model pos Model neg


0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1
0.05 fun 0.005 fun
0.1 film P(s|pos) > P(s|neg)
0.1 film
Text
Classification
and Naïve
Bayes

Naïve Bayes:
Relationship to
Language Modeling
Text
Classification
and Naïve
Bayes

Multinomial Naïve
Bayes: A Worked
Example
Doc Words Class
Training 1 Chinese Beijing Chinese c
2 Chinese Chinese Shanghai c
3 Chinese Macao c
4 Tokyo Japan Chinese j
Test 5 Chinese Chinese Chinese Tokyo Japan ?
Priors:
P(c)= 3
4 1 Choosing a class:
P(j)= 4 P(c|d5) 3/4 * (3/7)3 * 1/14 * 1/14
≈ 0.0003
Conditional Probabilities:
P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7
P(Tokyo|c) = (0+1) / (8+6) = 1/14 P(j|d5) 1/4 * (2/9)3 * 2/9 * 2/9
P(Japan|c) = (0+1) / (8+6) = 1/14 ≈ 0.0001
P(Chinese|j) = (1+1) / (3+6) = 2/9
P(Tokyo|j) = (1+1) / (3+6) = 2/9
45 P(Japan|j) = (1+1) / (3+6) = 2/9
Naïve Bayes in Spam Filtering
• SpamAssassin Features:
• Online Pharmacy
• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
• Phrase: impress ...
• From: starts with many numbers
• Subject is all capitals
• HTML has a low ratio of text to image area
• One hundred percent guaranteed
• Claims you can be removed from the list
• 'Prestigious Non-Accredited Universities'
• http://spamassassin.apache.org/tests_3_3_x.html
Summary: Naive Bayes is Not So Naive
• Very Fast, low storage requirements
• Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
• Very good in domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data
• Optimal if the independence assumptions hold: If assumed
independence is correct, then it is the Bayes Optimal Classifier for problem
• A good dependable baseline for text classification
• But we will see other classifiers that give better accuracy
Text
Classification
and Naïve
Bayes

Multinomial Naïve
Bayes: A Worked
Example
Classification
and Naïve
Bayes

Precision, Recall, and


the F measure
The 2-by-2 contingency table
correct not correct
selected tp fp
not selected fn tn
Precision and recall
• Precision: % of selected items that are correct
Recall: % of correct items that are selected

correct not correct


selected tp fp
not selected fn tn
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F
measure (weighted harmonic mean):

• The harmonic mean is a very conservative average


• People usually use balanced F1 measure
• i.e., with  = 1 (that is,  = ½): F = 2PR/(P+R)

You might also like