CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
2
Outline Today
• Naïve Bayes Classifier
• Text Classification
3
A Bayesian Classifier
- Compute the conditional probability of each value
of Y given the attributes
4
Learning a Classifier By Learning P(Y|X)
Conditional probability
table P(W |G, H)
5
A Bayesian Classifier
Predict the class label that is most probable given the
attributes (values of features)
6
Building a Classifier By Learning P(Y|X)
• Two binary features, one class label
7
How many parameters do we need to estimate?
8
Can we reduce the
number of parameters using Bayes' Rule?
10
Can we reduce the
number of parameters using Bayes Rule?
10
Naïve Bayes
• Naïve Bayes assumes
11
Conditional Independence
13
Example: Shall we play tennis?
14
Applying Naïve Bayes Assumption
O: Outlook
T: Temperature
H: Humidity
W: Wind
15
Applying Naïve Bayes Assumption
16
Parameters to Estimate
17
Parameters to Estimate
20
Parameters to Estimate
19
Relative Frequencies
• Consider each feature independently and estimate:
20
Applying Naïve Bayes
• Posterior probability for a new instance with the feature
vector:
• Xnew = (sunny, cool, high, true)
21
Applying Naïve Bayes
X = (sunny, cool, humid, windy)
• Estimating the likelihood:
22
Numerical Issues
• Multiplying lots of probabilities, which are between 0
and 1 by definition, can result in floating--point
underflow.
25
Identifying Spam
26
Other Text Classification Tasks
• Classify email as
‘Spam’, ‘Ham’
27
Text Classification
• Classify email as
– ‘Spam’, ‘Ham’
28
How do we represent a document?
• A sequence of words?
– computationally very expensive, can be difficult to
train
29
Document Models
• Bernoulli document model: a document is represented by
a binary feature vector, whose elements indicate the absence
or presence of corresponding word in the document
30
Bag--of--words document models
• Document:
Congratulations to you as we bring to your notice, the
results of Category draws of THE
CASINO the First INT. We are HOLLAND
LOTTO PROMO happy to inform you that
you have emerged a winner under the First Category,
which is part of our promotional draws.
31
Example
• Classify documents as Sports and
Informatics
• Assume the vocabulary contains 8 words
• Good vocabularies usually do not include common words
(a.k.a. stop words)
32
Training Data
• Rows are documents
• 6 examples of sports documents
• 5 examples of informatics documents
• Columns are words in the order of vocabulary
20
Estimating Parameters
34
Bernoulli Document Model
35
Classification with Bernoulli Model
36
Classifying a given sample
• A test document:
• Posterior probabilities:
38
Multinomial Document Model
39
Classification with Multinomial Model
40
Add--one (Laplace) Smoothing
41
Text Classification Framework
Features/
Documents Preprocessing
Indexing
Applying
Performance Feature
classification
measure filtering
algorithms
42
Preprocessing
• Token normalization
-- Remove superficial character variances from words
normelization -> normalization
• Stop--word removal
– Remove predefined common words that are not specific or
discriminatory to the different classes
is, a, the, you, as…
• Stemming
– Reduce different forms of the same word into a single word
(base/root form)
swimming, swimmer, swims -> swim
• Feature selection
– Choose features that are more relevant and complementary
can be part of the design process, but in general it is done
computationally by trying different combinations
30
Preprocessing
44
Tokenization
45
Normalization
• Token normalization
– Canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of the
tokens
– U.S.A vs USA
– Anti--discriminatory vs antidiscriminatory
– Car vs automobile?
46
Stop Words
• Very common words that have no discriminatory power
47
Feature Encoding
feature
encoding
feature
encoding
Source: http://www.3n1ltk.org/book/ch06.html
Feature Encoding
• How to represent the features
49
Feature Extraction vs Feature Selection
• Feature extraction:
– Transform data into a new feature space, usually by
mapping existing features into a lower dimensional space
(PCA, ICA, etc. We will come back)
• Feature selection:
– Select a subset of the existing features without a
transformation
50
Feature (Subset) Selection
• Necessary in a number of situations:
• Features may be expensive to obtain
• Evaluate a large number of features in the test bed
and select a subset for the final implementation
51
Runtime of Naïve Bayes
• It is fast
• Computation of parameters can be done in O
(CD)
• C : number of classes
• D: number of attributes/features
Incremental Updates
• Conditional independence
• What it is
• Why it is important
• Naïve Bayes
• What it is
• How to estimate the parameters
• How to make predictions