0% found this document useful (0 votes)

12 views179 pages

L5 TextClassification Updated

This document provides an introduction to Naive Bayes classification. It begins by distinguishing between generative and discriminative classification models, noting that discriminative models directly model the posterior probability of classes given data and are generally more effective. It then explains the intuition behind Naive Bayes classification, which is a simple generative method based on Bayes' rule that represents documents as bags of words. The document concludes by presenting the formula for a Naive Bayes classifier, which predicts the class with the highest posterior probability given a document.

Uploaded by

Ike S. Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views179 pages

L5 TextClassification Updated

Uploaded by

Ike S. Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 179

NLP

Introduction to NLP

511.
Classification
Classification
• Assigning documents or sentences to predefined
categories
– topics, languages, users …
• Input:
– a document (or sentence) d
– a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c  C
Variants of Problem Formulation
• Binary categorization: only two categories
– Retrieval: {relevant-doc, irrelevant-doc}
– Spam filtering: {spam, non-spam}
– Opinion: {positive, negative}
• K-category categorization: more than two categories
– Topic categorization: {sports, science, travel, business,…}
– Word sense disambiguation:{bar1, bar2, bar3, …}
• Hierarchical vs. flat
• Overlapping (soft) vs non-overlapping (hard)
Hierarchical Classification

Image from Schütze & Krisnawati

Borges’s Classification
• The list divides all animals into 14 categories:
• Those that belong to the emperor
• Embalmed ones
• Those that are trained
• Sucking pigs
• Mermaids (or Sirens)
• Fabulous ones
• Stray dogs
• Those that are included in this classification
• Those that tremble as if they were mad
• Innumerable ones
• Those drawn with a very fine camel hair brush
• Et cetera
• Those that have just broken the flower vase
• Those that, at a distance, resemble flies

Celestial Emporium of Benevolent Knowledge

Components of a Classification System
Mathematical Formulation
Mathematical Formulation
Mathematical Formulation
Mathematical Formulation
Spam Recognition
Return-Path: <*****@rediffmail.com>
X-Sieve: CMU Sieve 2.2
From: "Ibrahim Galadima" <*****@rediffmail.com>
Reply-To: [email protected]
To: *****
Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS

CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL

COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A
RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL
TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE
MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A
SAFE FOREIGN ACCOUNT
SpamAssassin
• http://spamassassin.apache.org/
• http://wiki.apache.org/spamassassin/HowScoresAreAssigned
• http://spamassassin.apache.org/tests_3_3_x.html
• Example features:
– body Incorporates a tracking ID number
– body HTML and text parts are different
– header Date: is 3 to 6 hours before Received: date
– body HTML font size is huge
– header Attempt to obfuscate words in Subject:
– header Subject =~ /^urgent(?:[\s\W]*(dollar) | .{1,40}(?:alert| response|
assistance| proposal| reply| warning| noti(?:ce| fication)| greeting| matter))/i
Features for Classification
• Vector-based
– Words: “cat”, “dog”, “great”, “horrible”, etc.
– Meta features: document length, author name, etc.
– Each document (or sentence) is represented as a
vector in an n-dimensional space
• Similar documents appear nearby in the vector space
– (more later)
Classification in NLP
• Part of speech tagging
• Sentiment analysis
• Word sense disambiguation
• Parsing
• Optical character recognition
• Spelling correction
• Named entity classification
Introduction to NLP

Vector Space Classification

x2
topic2
topic1

18
Decision surfaces

x2
topic2
topic1

x1
Decision trees
x2

topic2
topic1

x1
Classification Using Centroids
• Centroid
– the point most representative of a class
• Compute centroid by finding vector average of known
class members
• Decision boundary is a line that is equidistant from two
centroids.
• New document on one side of the goes in one class;
new document on the other side goes in the other.
Linear boundary

topic2
topic1

x1
Classification Using Centroids

x2
topic2
topic1
centroid2
centroid1

23
Introduction to NLP

Linear Classifiers
Decision Boundary
Decision Boundary
Decision Boundary
Decision Boundary

x
Decision Boundary

w
Linear Separators
• Two-dimensional line:
w1x1+w2x2=b is the linear separator
w1x1+w2x2>b for the positive class
• In n-dimensional spaces:
𝑤 𝑇 𝑥Ԧ = σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥 n
• One can also add w1=1, x0=b (constant)
• w is the weight vector
• x is the feature vector
Example
T 
w x b
wi xi wi xi
• Bias b=0 (in this example) 0.6 A -0.7 G
0.5 B -0.5 H
• Sentence is “A D E H” 0.5 C -0.3 I
0.4 D -0.2 J
• Its score will be 0.4 E -0.2 K
0.3 F -0.2 L
0.6*1+0.4*1+0.4*1+(-0.5)*1 = 0.9>0
How to Find the Linear Boundary?

• Find the linear boundary = find w
• Many methods
– E.g., Perceptron x2 +
+
• Problem: +++ ++ +
+
-
– There are infinite number of + +++
- -
-
-
linear boundaries if the two -
- --
classes are linearly separable! - -
- - - -
– Maximum margin: Support
Vector Machines (SVM) x1
General Training Idea
• Go through the training data
• Predict the class y (1 or -1)

• If the prediction is wrong, update θ

Perceptron Algorithm (preview)
[Example: Chris Bishop]
Interpolation using Features
• Linear interpolation for language modeling
– Estimating the trigram probability P(c|ab) using PMLE(c|a), PMLE(c), etc.
– Weights 1, 2, etc.
• We may want to consider other features
– E.g., POS tags of previous words, heads, word endings, etc.
• General idea
– Compute the conditional probability P(y|x)
– P(y|x) = sum of weights*features
• Label Y for a given history x in X
Logistic Regression (preview)

[discriminative method] [Greg Durrett]

Naïve Bayes (preview)

• Multinomial Naïve Bayes is a linear model

x= [1, x1, x2 …]
w = [log P(y), log P(w1|y), log P(w2|y) …]
Example from Karl Stratos
NLP
Introduction to NLP

513.
Naïve Bayes
Generative vs. Discriminative Models
Generative Discriminative
• Learn a model of the joint probability p(d, c) • Model posterior probability p(c|d) directly
• Use Bayes’ Rule to calculate p(c|d) • Class is a function of document vector
• Build a model of each class; given example, • Find the exact function that minimizes
return the model most likely to have classification errors on the training data
generated that example • Examples:
• Examples: – Logistic regression
– Naïve Bayes – Neural Networks (NNs)
– Gaussian Discriminant Analysis – Support Vector Machines (SVMs)
– HMM – Decision Trees
– Probabilistic CFG
Assumptions of Discriminative
Classifiers
• Data examples (documents) are represented as vectors
of features (words, phrases, ngrams, etc)
• Looking for a function that maps each vector into a class.
• This function can be found by minimizing the errors on
the training data (plus other various criteria)
• Different classifiers vary on what the function looks like,
and how they find the function
Discriminative vs. Generative
• Discriminative classifiers are generally more effective,
since they directly optimize the classification accuracy. But
– They are all sensitive to the choice of features, and so far these
features are extracted heuristically
– Also, overfitting can happen if data is sparse
• Generative classifiers are the “opposite”
– They directly model text, an unnecessarily harder problem than
classification
– They can easily exploit unlabeled data
Naïve Bayes Intuition

• Simple (“naïve”) classification method based on

Bayes rule
• Relies on very simple representation of
document
– Bag of words
Bag of Words
Remember Bayes’ Rule?
P ( d | c ) P (c )
Bayes’ Rule: P (c | d ) 
P(d )
• d is the document (represented as a list of features of a
document, x1, …, xn)
• c is a class (e.g., “not spam”)
Naïve Bayes Classifier

cMAP  argmax P(c | d )

MAP is “maximum a
posteriori” = most likely
cC class

P ( d | c ) P (c )
 argmax Bayes Rule

cC P(d )
 argmax P (d | c) P (c) Dropping the
denominator
cC
Naïve Bayes Classifier

cMAP  argmax P (d | c) P(c)

cC
Document d
 argmax P ( x1 , x2 ,  , xn | c) P (c) represented as
features x1..xn
cC

But where will we get these probabilitites?

Naïve Bayesian Classifiers

• Naïve Bayesian classifier

P( F1 , F2 ,...Fk | d  C ) P(d  C )
P(d  C | F1 , F2 ,...Fk ) 
P( F1 , F2 ,... Fk )
• Assuming statistical independence

k
j 1
P( F j | d  C ) P(d  C )
P(d  C | F , F ,...F ) 

1 2 k k
j 1
P( F j )

• Features = words (or phrases) typically

Training Naïve Bayes
Learning the Parameters

• First attempt: maximum likelihood estimates

– use the frequencies in the data

ˆ doccount (C  c j )
P (c j ) 
N doc

ˆ count ( wi , c j )
P ( wi | c j ) 
 count (w, c j )
wV
Parameter Estimation

ˆ count ( wi , c j ) fraction of times word wi appears

P ( wi | c j ) 
 count (w, c j )
wV
among all words in documents of topic cj

• Create mega-document for topic j by concatenating all

docs in this topic
– Use frequency of w in mega-document
Problem with Maximum Likelihood

• What if we have seen no training documents with the word

fantastic and classified in the topic positive (thumbs-up)?
ˆ count (" fantastic" , positive)
P(" fantastic" positive)  0
 count (w, positive)
wV

• Zero probabilities cannot be conditioned away, no matter the

other evidence!
cMAP  argmax c Pˆ (c)i Pˆ ( xi | c)
Laplace Smoothing
count ( wi , c)
Pˆ ( wi | c) 
 count (w, c)
wV

count ( wi , c)  1
Pˆ ( wi | c) 
 count (w, c)  1
wV

count ( wi , c)  1

 
  count ( w, c)   V
 wV 
Multinomial Naïve Bayes
Independence Assumptions
P( x1 , x2 ,  , xn | c)
• Bag of Words assumption
– Assume position doesn’t matter
• Conditional Independence
– Assume the feature probabilities P(xi|c) are independent given
the class c.
P( x1 ,  , xn | c)  P( x1 | c)  P( x2 | c)  P( x3 | c)    P( xn | c)
[Jurafsky and Martin]
Multinomial Naïve Bayes

cMAP  argmax P( x1 , x2 ,  , xn | c) P (c)

cC

c NB  argmax P(c j ) P ( x | c)
cC x X

This is why it’s naïve!

[Jurafsky and Martin]
Multinomial NB - Learning

• From training corpus, extract Vocabulary

positions  all word positions in test document

c NB  argmax P(c j )
c jC
 P( x | c )
i positions
i j
Generative Model for Multinomial NB

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Example
𝑃 𝑌 = 1/2 1/2
• Features = {I hate love this book} 1 1 0 1 1
• Training 𝑀=
0 0 1 1 1
– I hate this book
– Love this book
1/4 1/4 0 1/4 1/4
𝑃 𝑋𝑌 =
0 0 1/3 1/3 1/3
• What is P(Y|X)?
• Prior p(Y) 𝑃 𝑌|𝑋  1/2 × 1/4 × 1/4 1/2 × 0 × 1/3 = 1 0
• Testing
– hate book 2 2 1 2 2
𝑀=
• Different conditions 1 1 2 2 2
– a = 0 (no smoothing) 2/9 2/9 1/9 2/9 2/9
𝑃 𝑋𝑌 =
– a = 1 (smoothing) 1/8 1/8 2/8 2/8 2/8
𝑃 𝑌|𝑋  1/2 × 2/9 × 2/9 1/2 × 1/8 × 2/8 = 0.613 0.387
Another Example

[Greg Durrett]
Summary (1)

[Greg Durrett]
Summary (2)

[Greg Durrett]
Not So Naïve after all!
• Very fast, low storage requirements
• Robust to irrelevant features
• Irrelevant features cancel each other without affecting results
• Very good in domains with many equally important features
– Decision trees suffer from fragmentation in such cases – especially if little data
• Optimal if the independence assumptions hold
– If assumed independence is correct, then it is the Bayes Optimal Classifier for
problem
• A good, dependable baseline for text classification
– But other classifiers give better accuracy
[Jurafsky and Martin]
Naïve Bayes as a Linear Model

• Multinomial Naïve Bayes is a linear model

x= [1, x1, x2 …]
w = [log P(y), log P(w1|y), log P(w2|y) …]
Summary of NB (1/3)
Summary of NB (2/3)
Summary of NB (3/3)
NB is a generative model
NLP
Introduction to NLP

541.
Text Classification
Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince

New York to ratify U.S Constitution: Jay,
Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using
Bayesian methods

James Madison Alexander Hamilton

Male or female author?
1. By 1925 present-day Vietnam was divided into three parts
under French colonial rule. The southern region embracing
Saigon and the Mekong delta was the colony of Cochin-
China; the central area with its imperial capital at Hue was
the protectorate of Annam…
2. Clara never failed to be astonished by the extraordinary
felicity of her own name. She found it hard to trust herself to
the mercy of fate, which had managed over the years to
convert her greatest shame into one of her greatest assets…
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3,
pp. 321–346
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and
some great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the
boxing scenes.
What is the subject of this article?
MEDLINE Article
MeSH Subject Category Hierarchy
• Antagonists and
Inhibitors
• Blood Supply
? • Chemistry
• Drug Therapy
• Embryology
• Epidemiology
• …
Introduction to NLP

Feature Selection
Feature selection: The 2 test
• For a term t: It
0 1
C 0 k00 k01
1 k10 k11
• C=class, it = feature
• Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0)

P(C=0) = (k00+k01)/n
P(C=1) = 1-P(C=0) = (k10+k11)/n
P(It=0) = (k00+K10)/n
P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature Selection: The 2 test
n( k11k 00  k10 k 01 ) 2
Χ 2

( k11  k10 )(k 01  k 00 )(k11  k 01 )(k10  k 00 )

• High values of 2 indicate lower belief in

independence.
• In practice, compute 2 for all words and pick the top k
among them.
Mutual Information
• Measures amount of information
– X = word; Y = class
P ( x, y )
MI ( X , Y )    P ( x, y ) log
x y P( x) P( y )
– if the distribution is the same as the background distribution, then MI=0
• Documents are assumed to be generated
according to the multinomial model
Introduction to NLP

Evaluation of Classification
Accuracy

• Percentage of items correctly classified

• Not a great metric for imbalanced data sets. Why?
Contingency Table
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F
measure (weighted harmonic mean):
1 (  2  1) PR
F 
1
  (1   )
1  2
PR
P R

• The harmonic mean is a very conservative average

• People usually use balanced F1 measure
– i.e., with  = 1 (that is,  = ½): F1 = 2PR/(P+R)
More than two classes
Sec. 15.2.4

Micro-averaging and Macro-averaging

If we have more than one class, how do we combine

multiple performance measures into one quantity?
• Macroaveraging: Compute performance for each class,
then average.
• Microaveraging: Collect decisions for all classes,
compute contingency table, evaluate.
Micro-averaging and Macro-averaging
Classic Reuters-21578 Data Set
• Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)
Confusion Matrix
• For each pair of classes <c1,c2> how many documents
from c1 were incorrectly assigned to c2?
– c3,2: 90 wheat documents incorrectly assigned to poultry
Docs in test set Assigned Assigned Assigned Assigned Assigned Assigned
UK poultry wheat coffee interest trade
True UK 95 1 13 0 1 0
True poultry 0 1 0 0 0 0
True wheat 10 90 0 1 0 0
True coffee 0 0 0 34 3 7
True interest - 1 2 13 26 5

95True trade 0 0 2 14 5 10
Micro- vs. Macro-Averaging

– If we have more than one class, how do we combine

What happens when your model learns your training data a little too well?
Development Test Sets and
Cross-validation
Training set Development Test Set Test Set

• Metric: P/R/F1 or Accuracy

• Unseen test set Training Set Dev Test
– avoid overfitting (‘tuning to the test set’)
– more conservative estimate of performance Training Set Dev Test
– Cross-validation over multiple splits
• Handle sampling errors from different datasets Dev Test Training Set
– Pool results over each split
– Compute pooled dev set performance Test Set
Cross-validation
Underflow Prevention: log space
• Multiplying lots of probabilities can result in floating-point underflow.
• Since log(xy) = log(x) + log(y)
– Better to sum logs of probabilities instead of multiplying probabilities.
• Class with highest un-normalized log probability score is still most
probable.

cNB = argmax log P(c j ) +

c j ÎC
å log P(xi | c j )
iÎ positions

• Model is now just max of sum of weights

NLP
Introduction to NLP

516.
Logistic Regression
Combining Features
• Linear interpolation for language modeling
– Estimating the trigram probability P(c|ab)?
– features: PMLE(c|a), PMLE(c), etc.
– Weights: 1, 2, etc.
• We may want to consider other features
– E.g., POS tags of previous words, heads, word endings, etc.
• General idea
– Compute the conditional probability P(y|x)
– P(y|x) = sum of weights*features
• Label Y for a given history x in X
Logistic Regression
• Similar to Naïve Bayes (but discriminative!)
– Log-linear model
– Features don’t have to be independent
• Examples of features
– Anything of use
– Linguistic and non-linguistic
– Count of “good”
– Count of “not good”
– Sentence length
• Convex optimization problem
– It can be solved using gradient descent
Feature Templates

• For each word $w

– $w_count
– $w_ends_in_er
– $w_is_negated
– $w_is_all_caps
Logistic Regression
• An example of a discriminative classifier
• Input:
– Training example pairs of (𝑥,
റ 𝑦) where 𝑥റ is the feature vector and y
is the label
• Goal:
– Build a model that predicts the probability of the label
• Output:
– Set of weights 𝑤 that maximizes likelihood of correct labels on
training examples
Loglinear Models
• Naive Bayes, HMM, and Logistic Regression can all be
expressed in a similar way:
𝐺 𝑦, 𝜃 = 𝜃 𝑇 𝑓(𝑥, 𝑦)
𝑝 𝑦 𝑥, 𝜃 ∝ exp 𝐺 𝑦, 𝜃 ֞ log 𝑝 𝑦 𝑥 = 𝐶 + 𝐺(𝑦, 𝜃)
• Decision rule:

argmax 𝐺(𝑦 ∗ , 𝜃)
𝑦∗
Mapping x to a 1-D coordinate

[image from Greg Shakhnarovich]

Logistic Regression

[discriminative method] [Greg Durrett]

Sigmoid Function
Including a Nonlinearity
• Compute the feature vector x
• Multiply with weight vector w

𝑧 = ෍ 𝑤𝑖 𝑥𝑖
• Compute the logistic function
(sigmoid)
1
𝑓 𝑧 =
1 + 𝑒 −𝑧
Examples
• Example 1
x = (2,1,1,1)
w = (1,-1,-2,3)
z = 2-1-2+3=2
f(z) = 1/(1+e-2)
• Example 2
x = (2,1,0,1)
w = (0,0,-3,0)
z=0
f(z) = 1/(1+e0) = 1/2
Non-linear Activation Functions
Training and Testing LR
Classification using LR
Example
Example
Per-class Probabilities
Period Disambiguation
Cross-Entropy Loss (5.9)
Two examples (.69 vs .31)
Gradient Descent
Bivariate Gradient
Using the Gradient
Updating using the Gradient
Gradient for Logistic Regression
Stochastic Gradient Descent
Example (Cont’d)
Example (Cont’d)
Regularization
L2 Regularization
L1 Regularization
Constraining the Weights (1/2)
Constraining the Weights (2/2)
Multinomial LR

• More than two classes

Using Softmax
Sigmoid vs. Softmax
Features in Multinomial LR
Loss Function for Multinomial LR
Gradient of the Loss Function for LR
Deriving the Gradient Equation
Deriving the Gradient Equation (1/2)
Deriving the Gradient Equation (2/2)
Summary
Summary of Learning Methods
Non-linear Learning Methods
Softmax
• Recall
𝑝 𝑦 𝑥, 𝜃 ∝ exp(𝜃 𝑇 𝑓 𝑥, 𝑦 )
exp(𝜃 𝑇 𝑓 𝑥, 𝑦 )
𝑝 𝑦 𝑥, 𝜃 = 𝑚
σ𝑦′ =1 exp(𝜃 𝑇 𝑓 𝑥, 𝑦 ′ )

• Is this a valid probability distribution?

• Yes, using softmax
𝑝 𝑦 𝑥, 𝜃 = softmaxy (𝜃 𝑇 𝑓 𝑥, 𝑦 )
Stochastic Gradient Descent (SGD)

[Example from Karl Stratos]

SGD for Linear Regression

[Example from Karl Stratos]

Logistic Regression - Summary

[Example from Greg Durrett]

NLP
Introduction to NLP

515.
Perceptrons
The Perceptron
• A simple but very important
(discriminative) classifier
• Model of a neuron
– Input excitations
– If excitation > inhibition, send an
electrical signal out the axon
• Earliest neural network
– invented in 1957 by Frank
Rosenblatt at Cornell
Perceptron Idea
Input
-1
x0 w0 no
w1
x1

x2
w2 Σ > threshold?
yes
1

Question: Where can we get these weights?

So we can rewrite
x0 w0
w1
x1

x2
w2 Σ
as a dot product of two vectors
𝒘∙𝒙
Gradient Descent

https://en.wikipedia.org/wiki/Gradient_descent
Updating Parameters
• Basic approach
– We want to increase the probability of the entire data set
𝑤𝑀𝐿𝐸 = argmax𝑤 ෍ 𝑙𝑜𝑔𝑃(𝑦𝑖 |𝑥𝑖 ; 𝑤)

– Gradient descent
– Take the derivative of the log likelihood with respect to the parameters
– Make a little change to the parameters in the direction the derivative
tells you is uphill: 𝜕
𝑤≔ 𝑤+ 𝛼 𝐿(𝑤)
𝜕𝑤
• α here is the learning rate
– how much do you want to change each time?
Non-linearities: sigmoid, tanh
Notes about perceptrons
• Bag of words is not enough

• If the data is separable, the perceptron will separate it.

• Automatic differentiation (codegen)
NLP
Introduction to NLP

381.
Sentiment Analysis
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Sentiment about Companies
Product Reviews
Introduction
• Many posts, blogs
• Expressing personal opinions
• Research questions
– Subjectivity analysis
– Polarity analysis (positive/negative, number of stars)
– Viewpoint analysis (Chelsea vs. Manchester United, republican vs.
democrat)
• Sentiment target
– entity
– aspect
Introduction
• Level of granularity
– Document
– Sentence
– Attribute
• Opinion words
– Base
– Comparative (better, slower)
• Negation analysis
– Just counting negative words is not enough
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Observations
• “raise”
– can be negative or positive depending on the object
• “very”
– enhances the polarity of the adjective or adverb
• “advanced”, “progressive”
• http://www.theysay.io/
– claim to have 65,000 manual patterns
SA as a Classification Problem
• Set of features
– Words
• Presence is more important than frequency
– Punctuation
– Phrases
– Syntax
• A lot of training data is available
– E.g., movie review sentences and stars
• Techniques
– MaxEnt, SVM, Naive Bayes, RNN
Results

https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md
Results
Difficult Problems

• Subtlety
• Concession
• Manipulation
• Sarcasm and irony
NLP
Introduction to NLP

382.
Sentiment Lexicons
Sentiment Lexicons
• SentiWordNet
– http://sentiwordnet.isti.cnr.it/
• General Inquirer
– 2,000 positive words and 2,000 negative words
– http://www.wjh.harvard.edu/~inquirer/
• LIWC
– http://liwc.wpengine.com/
• Bing Liu’s opinion dataset
– http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
• MPQA subjectivity lexicon
– http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
General Inquirer
• Annotations
– Strong Power Weak Submit Active Passive Pleasur Pain Feel Arousal EMOT Virtue Vice Ovrst Undrst
Academ Doctrin Econ@ Exch ECON Exprsv Legal Milit Polit@ POLIT Relig Role COLL Work Ritual
SocRel Race Kin@ MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky
Object Tool Food Vehicle BldgPt ComnObj NatObj BodyPt ComForm COM Say Need Goal Try
Means Persist Complet Fail NatrPro Begin Vary Increas Decreas Finish Stay Rise Exert Fetch Travel
Fall Think Know Causal Ought Perceiv Compare Eval@ EVAL Solve Abs@ ABS Quality Quan NUMB
ORD CARD FREQ DIST Time@ TIME Space POS DIM Rel COLOR Self Our You Name Yes No
Negate Intrj IAV DAV SV IPadj IndAdj PowGain PowLoss PowEnds PowAren PowCon PowCoop
PowAuPt PowPt PowDoct PowAuth PowOth PowTot RcEthic RcRelig RcGain RcLoss RcEnds RcTot
RspGain RspLoss RspOth RspTot AffGain AffLoss AffPt AffOth AffTot WltPt WltTran WltOth WltTot
WlbGain WlbLoss WlbPhys WlbPsyc WlbPt WlbTot EnlGain EnlLoss EnlEnds EnlPt EnlOth EnlTot
SklAsth SklPt SklOth SklTot TrnGain TrnLoss TranLw MeansLw EndsLw ArenaLw PtLw Nation
Anomie NegAff PosAff SureLw If NotLw TimeSpc
• http://www.webuse.umd.edu:9090/tags
– Positive: able, accolade, accuracy, adept, adequate…
– Negative: addiction, adversity, adultery, affliction, aggressive…
Dictionary-based Methods

• Start from known seeds

– e.g., happy, angry
• Expand using WordNet
– synonyms
– hypernyms
• Random-walk based methods
– words with known polarity as absorbing boundary
Automatic Extraction of Sentiment
Words
• Semi-supervised method
• Look for pairs of adjectives that appear together in a
conjunction

Vasileios Hatzivassiloglou and Kathleen R. McKeown, ACL 1997

Molistic

NACLO problem (2007)

brastic

11 cluvious

molistic
3
1 danty

slatty 4

5 cloovy

blitty
10
struffy

weasy 8
9
strungy
6

sloshful 8
pleasure to watch
7

frumsy
PMI (Turney)
• PMI = pointwise mutual information
• Check how often a given unlabeled word appears with a known positive
word (“excellent”)
• Same for a known negative word (“poor”)

hits ( word1 NEAR word 2 )

PMI( word1 , word 2 )  log 2
hits ( word1 )hits ( word 2 )
NLP

EDU30062 A1 Essay
No ratings yet
EDU30062 A1 Essay
7 pages
Nadine Gordimer - Writing and Being
No ratings yet
Nadine Gordimer - Writing and Being
156 pages
in4080_2022_lecture_03
No ratings yet
in4080_2022_lecture_03
62 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
NLP NB
No ratings yet
NLP NB
52 pages
Text Classification
No ratings yet
Text Classification
53 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
lecture3-linear-classifiers
No ratings yet
lecture3-linear-classifiers
36 pages
Multimedia Application L7_for
No ratings yet
Multimedia Application L7_for
46 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
lect33-textcat (1)
No ratings yet
lect33-textcat (1)
70 pages
MultinomialNB
No ratings yet
MultinomialNB
52 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
SP14 CS188 Lecture 21 -- Naive Bayes - Print
No ratings yet
SP14 CS188 Lecture 21 -- Naive Bayes - Print
41 pages
NBayes-1-20-2011-ann
No ratings yet
NBayes-1-20-2011-ann
21 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Naive Bayes Classifiers: Connectionist and Statistical Language Processing
No ratings yet
Naive Bayes Classifiers: Connectionist and Statistical Language Processing
22 pages
NLP ch4 l1
No ratings yet
NLP ch4 l1
23 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
3. Text Classification
No ratings yet
3. Text Classification
60 pages
nb24aug
No ratings yet
nb24aug
85 pages
Text Classification[1][1]
No ratings yet
Text Classification[1][1]
11 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
L2_CSE256_FA24_TC
No ratings yet
L2_CSE256_FA24_TC
65 pages
Lecture-Feb20&25
No ratings yet
Lecture-Feb20&25
11 pages
Ch5
No ratings yet
Ch5
21 pages
lec20-ML I
No ratings yet
lec20-ML I
48 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
03 Classification
No ratings yet
03 Classification
66 pages
nb24aug
No ratings yet
nb24aug
79 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Na Ive Bayes Classifier
No ratings yet
Na Ive Bayes Classifier
3 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Document
No ratings yet
Document
7 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
Naive456 Bayes297Classification
No ratings yet
Naive456 Bayes297Classification
21 pages
Ml Module4 Classification
No ratings yet
Ml Module4 Classification
79 pages
Resentation On Aïve Bayesian Lassification
No ratings yet
Resentation On Aïve Bayesian Lassification
38 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
Navies Bayes
No ratings yet
Navies Bayes
18 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
cs188-fa22-note19
No ratings yet
cs188-fa22-note19
8 pages
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
No ratings yet
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
8 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
Comp Vis Week 3
No ratings yet
Comp Vis Week 3
44 pages
CSL0777 L24
No ratings yet
CSL0777 L24
38 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
lec09 (1)
No ratings yet
lec09 (1)
50 pages
7 - Text Classification Naive Bayes
No ratings yet
7 - Text Classification Naive Bayes
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
64 pages
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
No ratings yet
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
29 pages
mla_unit-5'2 (1)
No ratings yet
mla_unit-5'2 (1)
74 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
CSC-325-AI-Lecture08-Supervised-Learning
No ratings yet
CSC-325-AI-Lecture08-Supervised-Learning
32 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Factors Affecting Online Shopping Decision Behavior of Vietnam Consumers Shopee International Platform (Operations)
No ratings yet
Factors Affecting Online Shopping Decision Behavior of Vietnam Consumers Shopee International Platform (Operations)
10 pages
CANON 1 - Moreno vs. Araneta
No ratings yet
CANON 1 - Moreno vs. Araneta
1 page
Improving Reliability in Railway Thesis PDF
No ratings yet
Improving Reliability in Railway Thesis PDF
142 pages
Wills and Succession Final Exam Questions
100% (1)
Wills and Succession Final Exam Questions
3 pages
Fundamentals of Applied Electromagnetics, Global Edition, 8th Edition Fawwaz Ulaby - The full ebook with all chapters is available for download now
100% (1)
Fundamentals of Applied Electromagnetics, Global Edition, 8th Edition Fawwaz Ulaby - The full ebook with all chapters is available for download now
53 pages
Commissioner of Internal Revenue vs. Burmeister and Wain Scandinavian CONTRACTOR MINDANAO, INC. - Value Added Tax, Zero Rated
No ratings yet
Commissioner of Internal Revenue vs. Burmeister and Wain Scandinavian CONTRACTOR MINDANAO, INC. - Value Added Tax, Zero Rated
2 pages
Asics Case Analysis
No ratings yet
Asics Case Analysis
5 pages
Autumn's Beauty l - Mlp
No ratings yet
Autumn's Beauty l - Mlp
4 pages
Karpaga Vinayaka College of Engineering and Technology
No ratings yet
Karpaga Vinayaka College of Engineering and Technology
50 pages
Module 3
No ratings yet
Module 3
17 pages
GJC - 1 - SOCIAL STUDIES JSS 3 - EDITED (1) (1)
No ratings yet
GJC - 1 - SOCIAL STUDIES JSS 3 - EDITED (1) (1)
40 pages
Lect 4 - Concrete Technology - Part1
No ratings yet
Lect 4 - Concrete Technology - Part1
38 pages
Planning and Administering Early Childhood Programs 11th Edition – Ebook PDF Version pdf download
100% (1)
Planning and Administering Early Childhood Programs 11th Edition – Ebook PDF Version pdf download
70 pages
Nfjpiar3 2223 26th RMYC Overall Winners
No ratings yet
Nfjpiar3 2223 26th RMYC Overall Winners
4 pages
CPT-2022-Monthly-REPORT-Form (Isabelita Tantiado.)
No ratings yet
CPT-2022-Monthly-REPORT-Form (Isabelita Tantiado.)
2 pages
Numerical Assignment Thermodynamics-I
No ratings yet
Numerical Assignment Thermodynamics-I
5 pages
Document 5.pdf - 20240717 - 123227 - 0000
No ratings yet
Document 5.pdf - 20240717 - 123227 - 0000
3 pages
Phrygian Mode - Wikipedia
No ratings yet
Phrygian Mode - Wikipedia
30 pages
Application Affordable Housing Permit - v2t
No ratings yet
Application Affordable Housing Permit - v2t
4 pages
274-Article Text-1673-1-10-20220929
No ratings yet
274-Article Text-1673-1-10-20220929
30 pages
Computer Keyboard Shortcut Keys PDF
No ratings yet
Computer Keyboard Shortcut Keys PDF
7 pages
NetEA Dark Angels Hunt For The Fallen 3.1.1
No ratings yet
NetEA Dark Angels Hunt For The Fallen 3.1.1
4 pages
Far East Structural Steelwork Engineering LTD
No ratings yet
Far East Structural Steelwork Engineering LTD
23 pages
Epilepsia - 2006 - Muramoto - Socrates and Temporal Lobe Epilepsy A Pathographic Diagnosis 2 400 Years Later
No ratings yet
Epilepsia - 2006 - Muramoto - Socrates and Temporal Lobe Epilepsy A Pathographic Diagnosis 2 400 Years Later
3 pages
Tourism and Finance
No ratings yet
Tourism and Finance
4 pages
Super Teacher Worksheets Free Time
No ratings yet
Super Teacher Worksheets Free Time
4 pages
Tutorial On Navigation Software and Map Update
No ratings yet
Tutorial On Navigation Software and Map Update
12 pages
Stephen Anderson 2020 CV
No ratings yet
Stephen Anderson 2020 CV
6 pages

L5 TextClassification Updated

Uploaded by

L5 TextClassification Updated

Uploaded by

NLP

Image from Schütze & Krisnawati

Celestial Emporium of Benevolent Knowledge

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL

Vector Space Classification

• If the prediction is wrong, update θ

[discriminative method] [Greg Durrett]

• Multinomial Naïve Bayes is a linear model

• Simple (“naïve”) classification method based on

cMAP  argmax P(c | d )

cMAP  argmax P (d | c) P(c)

But where will we get these probabilitites?

• Naïve Bayesian classifier

• Features = words (or phrases) typically

• First attempt: maximum likelihood estimates

ˆ count ( wi , c j ) fraction of times word wi appears

• Create mega-document for topic j by concatenating all

• What if we have seen no training documents with the word

• Zero probabilities cannot be conditioned away, no matter the

cMAP  argmax P( x1 , x2 ,  , xn | c) P (c)

This is why it’s naïve!

• From training corpus, extract Vocabulary

positions  all word positions in test document

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

• Multinomial Naïve Bayes is a linear model

• 1787-8: anonymous essays try to convince

James Madison Alexander Hamilton

• High values of 2 indicate lower belief in

• Percentage of items correctly classified

• The harmonic mean is a very conservative average

Micro-averaging and Macro-averaging

If we have more than one class, how do we combine

– If we have more than one class, how do we combine

• Metric: P/R/F1 or Accuracy

cNB = argmax log P(c j ) +

• Model is now just max of sum of weights

• For each word $w

[image from Greg Shakhnarovich]

[discriminative method] [Greg Durrett]

• More than two classes

• Is this a valid probability distribution?

[Example from Karl Stratos]

[Example from Karl Stratos]

[Example from Greg Durrett]

Question: Where can we get these weights?

• If the data is separable, the perceptron will separate it.

• Start from known seeds

Vasileios Hatzivassiloglou and Kathleen R. McKeown, ACL 1997

NACLO problem (2007)

hits ( word1 NEAR word 2 )

You might also like