L5 TextClassification Updated
L5 TextClassification Updated
Introduction to NLP
511.
Classification
Classification
• Assigning documents or sentences to predefined
categories
– topics, languages, users …
• Input:
– a document (or sentence) d
– a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c C
Variants of Problem Formulation
• Binary categorization: only two categories
– Retrieval: {relevant-doc, irrelevant-doc}
– Spam filtering: {spam, non-spam}
– Opinion: {positive, negative}
• K-category categorization: more than two categories
– Topic categorization: {sports, science, travel, business,…}
– Word sense disambiguation:{bar1, bar2, bar3, …}
• Hierarchical vs. flat
• Overlapping (soft) vs non-overlapping (hard)
Hierarchical Classification
DEAR SIR
x2
topic2
topic1
x1
18
Decision surfaces
x2
topic2
topic1
x1
Decision trees
x2
topic2
topic1
x1
Classification Using Centroids
• Centroid
– the point most representative of a class
• Compute centroid by finding vector average of known
class members
• Decision boundary is a line that is equidistant from two
centroids.
• New document on one side of the goes in one class;
new document on the other side goes in the other.
Linear boundary
x2
topic2
topic1
x1
Classification Using Centroids
x2
topic2
topic1
centroid2
centroid1
x1
23
Introduction to NLP
Linear Classifiers
Decision Boundary
Decision Boundary
Decision Boundary
Decision Boundary
x
Decision Boundary
w
Linear Separators
• Two-dimensional line:
w1x1+w2x2=b is the linear separator
w1x1+w2x2>b for the positive class
• In n-dimensional spaces:
𝑤 𝑇 𝑥Ԧ = σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥 n
• One can also add w1=1, x0=b (constant)
• w is the weight vector
• x is the feature vector
Example
T
w x b
wi xi wi xi
• Bias b=0 (in this example) 0.6 A -0.7 G
0.5 B -0.5 H
• Sentence is “A D E H” 0.5 C -0.3 I
0.4 D -0.2 J
• Its score will be 0.4 E -0.2 K
0.3 F -0.2 L
0.6*1+0.4*1+0.4*1+(-0.5)*1 = 0.9>0
How to Find the Linear Boundary?
• Find the linear boundary = find w
• Many methods
– E.g., Perceptron x2 +
+
• Problem: +++ ++ +
+
-
– There are infinite number of + +++
- -
-
-
linear boundaries if the two -
- --
classes are linearly separable! - -
- - - -
– Maximum margin: Support
Vector Machines (SVM) x1
General Training Idea
• Go through the training data
• Predict the class y (1 or -1)
513.
Naïve Bayes
Generative vs. Discriminative Models
Generative Discriminative
• Learn a model of the joint probability p(d, c) • Model posterior probability p(c|d) directly
• Use Bayes’ Rule to calculate p(c|d) • Class is a function of document vector
• Build a model of each class; given example, • Find the exact function that minimizes
return the model most likely to have classification errors on the training data
generated that example • Examples:
• Examples: – Logistic regression
– Naïve Bayes – Neural Networks (NNs)
– Gaussian Discriminant Analysis – Support Vector Machines (SVMs)
– HMM – Decision Trees
– Probabilistic CFG
Assumptions of Discriminative
Classifiers
• Data examples (documents) are represented as vectors
of features (words, phrases, ngrams, etc)
• Looking for a function that maps each vector into a class.
• This function can be found by minimizing the errors on
the training data (plus other various criteria)
• Different classifiers vary on what the function looks like,
and how they find the function
Discriminative vs. Generative
• Discriminative classifiers are generally more effective,
since they directly optimize the classification accuracy. But
– They are all sensitive to the choice of features, and so far these
features are extracted heuristically
– Also, overfitting can happen if data is sparse
• Generative classifiers are the “opposite”
– They directly model text, an unnecessarily harder problem than
classification
– They can easily exploit unlabeled data
Naïve Bayes Intuition
P ( d | c ) P (c )
argmax Bayes Rule
cC P(d )
argmax P (d | c) P (c) Dropping the
denominator
cC
Naïve Bayes Classifier
ˆ doccount (C c j )
P (c j )
N doc
ˆ count ( wi , c j )
P ( wi | c j )
count (w, c j )
wV
Parameter Estimation
count ( wi , c) 1
Pˆ ( wi | c)
count (w, c) 1
wV
count ( wi , c) 1
count ( w, c) V
wV
Multinomial Naïve Bayes
Independence Assumptions
P( x1 , x2 , , xn | c)
• Bag of Words assumption
– Assume position doesn’t matter
• Conditional Independence
– Assume the feature probabilities P(xi|c) are independent given
the class c.
P( x1 , , xn | c) P( x1 | c) P( x2 | c) P( x3 | c) P( xn | c)
[Jurafsky and Martin]
Multinomial Naïve Bayes
c NB argmax P(c j ) P ( x | c)
cC x X
c NB argmax P(c j )
c jC
P( x | c )
i positions
i j
Generative Model for Multinomial NB
c=China
[Greg Durrett]
Summary (1)
[Greg Durrett]
Summary (2)
[Greg Durrett]
Not So Naïve after all!
• Very fast, low storage requirements
• Robust to irrelevant features
• Irrelevant features cancel each other without affecting results
• Very good in domains with many equally important features
– Decision trees suffer from fragmentation in such cases – especially if little data
• Optimal if the independence assumptions hold
– If assumed independence is correct, then it is the Bayes Optimal Classifier for
problem
• A good, dependable baseline for text classification
– But other classifiers give better accuracy
[Jurafsky and Martin]
Naïve Bayes as a Linear Model
541.
Text Classification
Who wrote which Federalist papers?
Feature Selection
Feature selection: The 2 test
• For a term t: It
0 1
C 0 k00 k01
1 k10 k11
• C=class, it = feature
• Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0)
P(C=0) = (k00+k01)/n
P(C=1) = 1-P(C=0) = (k10+k11)/n
P(It=0) = (k00+K10)/n
P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature Selection: The 2 test
n( k11k 00 k10 k 01 ) 2
Χ 2
( k11 k10 )(k 01 k 00 )(k11 k 01 )(k10 k 00 )
Evaluation of Classification
Accuracy
95True trade 0 0 2 14 5 10
Micro- vs. Macro-Averaging
What happens when your model learns your training data a little too well?
Development Test Sets and
Cross-validation
Training set Development Test Set Test Set
516.
Logistic Regression
Combining Features
• Linear interpolation for language modeling
– Estimating the trigram probability P(c|ab)?
– features: PMLE(c|a), PMLE(c), etc.
– Weights: 1, 2, etc.
• We may want to consider other features
– E.g., POS tags of previous words, heads, word endings, etc.
• General idea
– Compute the conditional probability P(y|x)
– P(y|x) = sum of weights*features
• Label Y for a given history x in X
Logistic Regression
• Similar to Naïve Bayes (but discriminative!)
– Log-linear model
– Features don’t have to be independent
• Examples of features
– Anything of use
– Linguistic and non-linguistic
– Count of “good”
– Count of “not good”
– Sentence length
• Convex optimization problem
– It can be solved using gradient descent
Feature Templates
argmax 𝐺(𝑦 ∗ , 𝜃)
𝑦∗
Mapping x to a 1-D coordinate
𝑧 = 𝑤𝑖 𝑥𝑖
• Compute the logistic function
(sigmoid)
1
𝑓 𝑧 =
1 + 𝑒 −𝑧
Examples
• Example 1
x = (2,1,1,1)
w = (1,-1,-2,3)
z = 2-1-2+3=2
f(z) = 1/(1+e-2)
• Example 2
x = (2,1,0,1)
w = (0,0,-3,0)
z=0
f(z) = 1/(1+e0) = 1/2
Non-linear Activation Functions
Training and Testing LR
Classification using LR
Example
Example
Per-class Probabilities
Period Disambiguation
Cross-Entropy Loss (5.9)
Two examples (.69 vs .31)
Gradient Descent
Bivariate Gradient
Using the Gradient
Updating using the Gradient
Gradient for Logistic Regression
Stochastic Gradient Descent
Example (Cont’d)
Example (Cont’d)
Regularization
L2 Regularization
L1 Regularization
Constraining the Weights (1/2)
Constraining the Weights (2/2)
Multinomial LR
515.
Perceptrons
The Perceptron
• A simple but very important
(discriminative) classifier
• Model of a neuron
– Input excitations
– If excitation > inhibition, send an
electrical signal out the axon
• Earliest neural network
– invented in 1957 by Frank
Rosenblatt at Cornell
Perceptron Idea
Input
-1
x0 w0 no
w1
x1
x2
w2 Σ > threshold?
yes
1
x2
w2 Σ
as a dot product of two vectors
𝒘∙𝒙
Gradient Descent
https://en.wikipedia.org/wiki/Gradient_descent
Updating Parameters
• Basic approach
– We want to increase the probability of the entire data set
𝑤𝑀𝐿𝐸 = argmax𝑤 𝑙𝑜𝑔𝑃(𝑦𝑖 |𝑥𝑖 ; 𝑤)
– Gradient descent
– Take the derivative of the log likelihood with respect to the parameters
– Make a little change to the parameters in the direction the derivative
tells you is uphill: 𝜕
𝑤≔ 𝑤+ 𝛼 𝐿(𝑤)
𝜕𝑤
• α here is the learning rate
– how much do you want to change each time?
Non-linearities: sigmoid, tanh
Notes about perceptrons
• Bag of words is not enough
381.
Sentiment Analysis
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Sentiment about Companies
Product Reviews
Introduction
• Many posts, blogs
• Expressing personal opinions
• Research questions
– Subjectivity analysis
– Polarity analysis (positive/negative, number of stars)
– Viewpoint analysis (Chelsea vs. Manchester United, republican vs.
democrat)
• Sentiment target
– entity
– aspect
Introduction
• Level of granularity
– Document
– Sentence
– Attribute
• Opinion words
– Base
– Comparative (better, slower)
• Negation analysis
– Just counting negative words is not enough
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Observations
• “raise”
– can be negative or positive depending on the object
• “very”
– enhances the polarity of the adjective or adverb
• “advanced”, “progressive”
• http://www.theysay.io/
– claim to have 65,000 manual patterns
SA as a Classification Problem
• Set of features
– Words
• Presence is more important than frequency
– Punctuation
– Phrases
– Syntax
• A lot of training data is available
– E.g., movie review sentences and stars
• Techniques
– MaxEnt, SVM, Naive Bayes, RNN
Results
https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md
Results
Difficult Problems
• Subtlety
• Concession
• Manipulation
• Sarcasm and irony
NLP
Introduction to NLP
382.
Sentiment Lexicons
Sentiment Lexicons
• SentiWordNet
– http://sentiwordnet.isti.cnr.it/
• General Inquirer
– 2,000 positive words and 2,000 negative words
– http://www.wjh.harvard.edu/~inquirer/
• LIWC
– http://liwc.wpengine.com/
• Bing Liu’s opinion dataset
– http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
• MPQA subjectivity lexicon
– http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
General Inquirer
• Annotations
– Strong Power Weak Submit Active Passive Pleasur Pain Feel Arousal EMOT Virtue Vice Ovrst Undrst
Academ Doctrin Econ@ Exch ECON Exprsv Legal Milit Polit@ POLIT Relig Role COLL Work Ritual
SocRel Race Kin@ MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky
Object Tool Food Vehicle BldgPt ComnObj NatObj BodyPt ComForm COM Say Need Goal Try
Means Persist Complet Fail NatrPro Begin Vary Increas Decreas Finish Stay Rise Exert Fetch Travel
Fall Think Know Causal Ought Perceiv Compare Eval@ EVAL Solve Abs@ ABS Quality Quan NUMB
ORD CARD FREQ DIST Time@ TIME Space POS DIM Rel COLOR Self Our You Name Yes No
Negate Intrj IAV DAV SV IPadj IndAdj PowGain PowLoss PowEnds PowAren PowCon PowCoop
PowAuPt PowPt PowDoct PowAuth PowOth PowTot RcEthic RcRelig RcGain RcLoss RcEnds RcTot
RspGain RspLoss RspOth RspTot AffGain AffLoss AffPt AffOth AffTot WltPt WltTran WltOth WltTot
WlbGain WlbLoss WlbPhys WlbPsyc WlbPt WlbTot EnlGain EnlLoss EnlEnds EnlPt EnlOth EnlTot
SklAsth SklPt SklOth SklTot TrnGain TrnLoss TranLw MeansLw EndsLw ArenaLw PtLw Nation
Anomie NegAff PosAff SureLw If NotLw TimeSpc
• http://www.webuse.umd.edu:9090/tags
– Positive: able, accolade, accuracy, adept, adequate…
– Negative: addiction, adversity, adultery, affliction, aggressive…
Dictionary-based Methods
11 cluvious
molistic
3
1 danty
slatty 4
5 cloovy
blitty
10
struffy
weasy 8
9
strungy
6
sloshful 8
pleasure to watch
7
frumsy
PMI (Turney)
• PMI = pointwise mutual information
• Check how often a given unlabeled word appears with a known positive
word (“excellent”)
• Same for a known negative word (“poor”)