0% found this document useful (0 votes)
99 views55 pages

CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)

The document summarizes the Naive Bayes classifier. It explains that Naive Bayes makes the assumption of conditional independence between features given the class. This allows it to greatly reduce the number of parameters needed to estimate probabilities compared to other classifiers. The document then discusses how Naive Bayes can be applied to text classification problems by representing documents as bag-of-words models and estimating word probabilities.

Uploaded by

Mathias Bueno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views55 pages

CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)

The document summarizes the Naive Bayes classifier. It explains that Naive Bayes makes the assumption of conditional independence between features given the class. This allows it to greatly reduce the number of parameters needed to estimate probabilities compared to other classifiers. The document then discusses how Naive Bayes can be applied to text classification problems by representing documents as bag-of-words models and estimating word probabilities.

Uploaded by

Mathias Bueno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CS464

Chapter 4: Naïve Bayes

(slides based on the slides provided by Öznur Taştan and


Mehmet Koyutürk)
Last Chapter: Density Estimation

2
Outline Today
•  Naïve Bayes Classifier

•  Generalization of maximum a posteriori estimation

•  Text Classification

•  Application of Naïve Bayes


•  Illustration of feature extraction/encoding and feature selection

3
A Bayesian Classifier
-  Compute the conditional probability of each value
of Y given the attributes

-  Classify the example into the class that is most


probable given the attributes

4
Learning a Classifier By Learning P(Y|X)

Joint probability table


P(G,W,H)
W: Wealth
G: Gender
H: HoursWorked

Conditional probability
table P(W |G, H)

5
A Bayesian Classifier
Predict the class label that is most probable given the
attributes (values of features)

6
Building a Classifier By Learning P(Y|X)
•  Two binary features, one class label

7
How many parameters do we need to estimate?

8
Can we reduce the
number of parameters using Bayes' Rule?

10
Can we reduce the
number of parameters using Bayes Rule?

30 features ---> more than 30 billion parameters!

10
Naïve Bayes
•  Naïve Bayes assumes

•  Random variables (features) Xi and Xj are


conditionally independent of each other given the
class label Y for all i≠ j

11
Conditional Independence

•  X and Y are conditionally independent given Z iff


the conditional probability of the joint variable
can be written as product of conditional
probabilities:

𝑋 ⊥ 𝑌|𝑍 ⟺ 𝑃(𝑋, 𝑌|𝑍) = 𝑃 𝑋|𝑍 𝑃(𝑌|𝑍)


Naïve Bayes in a Nutshell

How many parameters?


- > 2n + 1 if Y is binary

13
Example: Shall we play tennis?

14
Applying Naïve Bayes Assumption

Applying the Naïve Bayes Assumption:

O: Outlook
T: Temperature
H: Humidity
W: Wind

15
Applying Naïve Bayes Assumption

Applying the Naïve Bayes Assumption:

16
Parameters to Estimate

17
Parameters to Estimate

20
Parameters to Estimate

19
Relative Frequencies
•  Consider each feature independently and estimate:

20
Applying Naïve Bayes
•  Posterior probability for a new instance with the feature
vector:
•  Xnew = (sunny, cool, high, true)

Posterior Likelihood Prior

21
Applying Naïve Bayes
X = (sunny, cool, humid, windy)
•  Estimating the likelihood:

•  Estimating the posterior:

•  Class label predicted for X is then Play = No

22
Numerical Issues
•  Multiplying lots of probabilities, which are between 0
and 1 by definition, can result in floating--point
underflow.

•  Underflow occurs when you perform an operation that's


smaller than the smallest magnitude non--zero number.

•  Since log(xy) = log(x) + log(y), it is better to perform all


computations by summing logs of probabilities rather
than multiplying probabilities.

•  Class with highest final un--normalized log probability


score is still the most probable.
10
Underflow
•  Therefore, instead of using this formulation:

•  Use the following equivalent rule:

•  Avoiding underflow is an important implementation


detail!
24
Text Classification Using Naïve Bayes

25
Identifying Spam

26
Other Text Classification Tasks
•  Classify email as
‘Spam’, ‘Ham’

•  Classify web pages as


‘Student’, ‘Faculty’, ‘Other’

•  Classify news stories into topics as


‘Sports’, ‘Politics’..

•  Classify movie reviews as


‘favorable’, ‘unfavorable’, ‘neutral

27
Text Classification
•  Classify email as
– ‘Spam’, ‘Ham’

•  Classify web pages as


– ‘Student’, ‘Faculty’,
‘Other’ Class labels, y
•  Classify news stories into topics as
–  ‘Sports’, ‘Politics’..

•  Classify movie reviews as


–  ‘favorable’, ‘unfavorable’, ‘neutral’

•  What about the features X?


•  How to represent the document?

28
How do we represent a document?
•  A sequence of words?
–  computationally very expensive, can be difficult to
train

•  A set of words (Bag--of--Words)


–  Ignore the position of the word in the
document
–  Ignore the ordering of the words in the
document
–  Consider the words in a predefined
vocabulary Image courtesy: Joseph Gonzalez

29
Document Models
•  Bernoulli document model: a document is represented by
a binary feature vector, whose elements indicate the absence
or presence of corresponding word in the document

•  Multinomial document model: a document is represented


by an integer feature vector, whose elements indicate the
frequency of corresponding word in the document

30
Bag--of--words document models
•  Document:
Congratulations to you as we bring to your notice, the
results of Category draws of THE
CASINO the First INT. We are HOLLAND
LOTTO PROMO happy to inform you that
you have emerged a winner under the First Category,
which is part of our promotional draws.

31
Example
•  Classify documents as Sports and
Informatics
•  Assume the vocabulary contains 8 words
•  Good vocabularies usually do not include common words
(a.k.a. stop words)

32
Training Data
•  Rows are documents
•  6 examples of sports documents
•  5 examples of informatics documents
•  Columns are words in the order of vocabulary

20
Estimating Parameters

34
Bernoulli Document Model

35
Classification with Bernoulli Model

36
Classifying a given sample
•  A test document:

•  Priors and likelihoods:

•  Posterior probabilities:

•  Classify this document as Sports


37
Multinomial Document Model

38
Multinomial Document Model

Words are i.i.d. samples from a


multinomial distirbution.

39
Classification with Multinomial Model

40
Add--one (Laplace) Smoothing

-  Add one imaginary occurrence of every word to every document

41
Text Classification Framework

Features/
Documents Preprocessing
Indexing

Applying
Performance Feature
classification
measure filtering
algorithms

42
Preprocessing
•  Token normalization
-- Remove superficial character variances from words
normelization -> normalization
•  Stop--word removal
–  Remove predefined common words that are not specific or
discriminatory to the different classes
is, a, the, you, as…
•  Stemming
–  Reduce different forms of the same word into a single word
(base/root form)
swimming, swimmer, swims -> swim
•  Feature selection
–  Choose features that are more relevant and complementary
can be part of the design process, but in general it is done
computationally by trying different combinations
30
Preprocessing

•  Mr. O’Neill thinks that the boys’ stories about Chile’s


capital aren’t amusing.

How to handle special cases involving apostrophes,


hyphens etc?

C++, C#, URLs, emails, phone numbers, dates


San Francisco, Los Angeles

44
Tokenization

•  Divide the text into a sequence of words by


combining, dividing words, handling special
characters etc.

•  Issues of tokenization are language specific


–  Requires the language to be known
German compound nouns

–  East Asian Languages (Chinese, Japanese, Korean,


Thai)
•  Text is written without any spaces between words

45
Normalization
•  Token normalization
–  Canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of the
tokens
–  U.S.A vs USA
–  Anti--discriminatory vs antidiscriminatory
–  Car vs automobile?

46
Stop Words
•  Very common words that have no discriminatory power

•  Sort terms by collection frequency and take the most


frequent words

•  For an application, an additional domain specific


stopword list may be constructed
– In a collection about insurance practices, “insurance” would be
a stop word

47
Feature Encoding

feature
encoding

feature
encoding

Source: http://www.3n1ltk.org/book/ch06.html
Feature Encoding
•  How to represent the features

•  Feature encoding can have tremendous impact on


the classifier

49
Feature Extraction vs Feature Selection
•  Feature extraction:
–  Transform data into a new feature space, usually by
mapping existing features into a lower dimensional space
(PCA, ICA, etc. We will come back)

•  Feature selection:
–  Select a subset of the existing features without a
transformation

50
Feature (Subset) Selection
•  Necessary in a number of situations:
•  Features may be expensive to obtain
•  Evaluate a large number of features in the test bed
and select a subset for the final implementation

•  You want to extract meaningful rules from your classifier

•  Fewer features means fewer model parameters


•  Improved generalization capabilities
•  Reduced complexity and run--time

51
Runtime of Naïve Bayes

•  It is fast
•  Computation of parameters can be done in O
(CD)
•  C : number of classes
•  D: number of attributes/features
Incremental Updates

•  If the model is going to be updated very often as new data


come, you may implement it such that it allows easy
incremental updates

•  For example: Store raw counts instead of probabilities


•  New example of class k:
•  For each feature update the counts based on the example feature
vector
•  Update the class counts, update the number of training data
•  When you need to classify, compute the probabilities
The Independence Assumption

•  Usually features are not conditionally independent


•  That is why it is called naïve
•  In practice it often works well
•  Naïve Bayes does not produce accurate probability estimates when
its independence assumptions are violated, but it may still (and often)
pick the correct maximum-probability class in many cases
[Domingos&Pazzani, 1996].
•  Typically handles noise well since it does not even focus on
completely fitting the training data


What You Should Know
•  Training and using classifiers based on Bayes rule

•  Conditional independence
•  What it is
•  Why it is important

•  Naïve Bayes
•  What it is
•  How to estimate the parameters
•  How to make predictions

•  Mutual Information is a good measure for filtering features

You might also like