Lecture_5_Part_Of_Speech_Tagging
Lecture_5_Part_Of_Speech_Tagging
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts)
A description of 8 parts-of-speech as;
- Noun: a word (other than a pronoun) used to identify any of a class of people,
places, or things ( common noun ). e.g.,
- Verb: a word used to describe an action, state, or occurrence, and forming the
main part of the predicate of a sentence, e.g., hear, become, happen.
- Pronoun: a word that can function as a noun phrase used by itself and that
refers either to the participants in the discourse (e.g., I, you, she, it, this )
- Preposition: Prepositions are usually used in front of nouns or pronouns and
they show the relationship between the noun or pronoun and other words in a
sentence. e.g., after, in, to, on, and with.
- Adverb: a word or phrase that modifies the meaning of an adjective, verb, or
other adverb, expressing manner, place, time, or degree
(e.g., gently, here, now, very ).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts) (Cont…)
A description of 8 parts-of-speech as;
- conjunction: a word used to connect clauses or sentences or to coordinate
words in the same clause (e.g., and, but, if ).
- participle: a word formed from a verb (e.g., going, gone, being, been ) and
used as an adjective (e.g., working woman, burnt toast ) or a noun (e.g., good
breeding ).
In English; participles are also used to make compound verb forms
(e.g., is going, has been ).
- Article: Articles are words that define a noun as specific or unspecific.
Consider the following examples:
Example- 1: After the long day, the cup of tea tasted particularly good.
Example-2: After a long day, a cup of tea tastes particularly good.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts) (Cont…)
More recent lists of parts-of-speech (or tagsets) have many more words
classes;
- 45 for the Penn Treebank (Marcus et al., 1993).
- 87 for the Brown corpus (Francis, 1979).
- 146 for the C7 tagset (Garside et al., 1997).
Part-of-speech tagging (or just tagging for short) is the process tagging of
assigning a part-of-speech or other syntactic class marker to each word in a corpus.
Because tags are generally also applied to punctuation, tagging requires that the
punctuation marks (period, comma, etc) be separated off of the words.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. English Word Classes (Cont…)
(ii) Common nouns are divided into;
1) Count nouns are those that allow grammatical enumeration; that is;
- they can occur in both the singular and plural
- e.g., ( goat/goats, relationship/relationships) and they can be counted (
one goat, two goat).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. English Word Classes (Cont…)
(d) The final open class form, is rather a hodge-podge, both semantically and
formally.
For example; all the italicized words are adverbs:
Unfortunately, John walked home extremely slowly yesterday.
(i) Directional adverbs or locative adverbs;
- Specify the direction or location of some action. e.g., home, downhill, etc.
(ii) Degree adverbs;
- Specify the extent of some action, process, or property. e.g., extremely, very,
somewhat, etc.
(iii) Manner adverbs;
- Describe the manner of some action or process. e.g., slowly, delicately, etc.
(iv) Temporal adverbs;
- Describe the time that some action or event took place. e.g., yesterday, monday, etc.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Tagsets for English
Most of the popular tagsets for English;
- 87-tag tagset used for the Brown corpus.
Two of the most commonly used targets are;
- small 45-tag Penn Treebank tagset.
- medium-sized 61-tag C5 tagset.
Example;
- Some examples of tagged sentences from the Penn Treebank
version of the Brown corpus is;
a) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
b) There/EX are/VBP 70/CD children/NNS there/RB
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Tagsets for English (Class Participation)
a) Although preliminary findings were
reported more than a year ago, the
latest results appear in today ‘s New
England Journal of Medicine.
b) Mrs. Shaefer never got around to
joining.
c) All we gotta do is go around the
corner.
d) She told off her friends.
e) She stepped off the train.
f) They were married by the Justice of
the Peace yesterday at 5:00.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging
Part-of-speech tagging;
- is the process of assigning a part of speech or other synthetic class marker to each word in a corpus.
Problem:
Book/VB that/DT flight/NN ./.
Does/VBZ that/DT flight/NN serve/VB dinner/NN ?/.
Book is ambiguous.
- That is, it has more than one possible usage and part-of-speech.
(i) It can be a verb ( as in book that flight or to book the suspect).
(ii) or a noun (as in hand me that book or a book of matches).
Solution:
The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context.
Upgrade version of POS-tagging is used as 87-tag Brown corpus tagset.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging (Brown Corpus Tags) (Cont…)
Example
Go away!
He sometimes goes to the cafe.
All the cakes have gone.
We went on the excursion
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging (Penn Treebank tagset vs Brown Corpus Tags)
(Class Participation)
My aunt’s can opener can open a drum Natural disasters – storms, flooding,
should look like this: hurricanes – occur infrequently but cause
The old car broke down in the car park devastation that strains resources to breaking
point
At least two men broke in and stole my TV
The horses were broken in and ridden in two Letters delivered on time by old-fashioned
weeks means are increasingly rare, so it is as well
that that is not the only option available
Kim and Sandy both broke up with their
partners It won’t rain but there might be snow on high
The horse which Kim sometimes rides is more ground if the temperature stays about the
bad tempered than mine same over the next 24 hours
The horse as well as the rabbits which we The long and lonely road to redemption
wanted to eat has escaped begins with self-reflection: the need to delve
It was my aunt’s car which we sold at auction inwards to deconstruct layers of psychological
last year in February obfuscation
The only rabbit that I ever liked was eaten by My wildest dream is to build a POS tagger
my parents one summer which processes 10K words per second and
The veterans who I thought that we would uses only 1MB of RAM, but it may prove too
meet at the reunion were dead hard
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
(2) The second stage used large lists of hand-written disambiguation rules to
narrow down this list to a single part-of-speech for each word.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
(1) First stage: dictionary to assign each word ( Example)
Which POS is more likely in a corpus (1,273,000 tokens)?
NN VB Total
race 400 600 1000
P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability
- P(race) ≅ 1000/1,273,000 = .0008
- P(race&NN) ≅ 400/1,273,000 =.0003
- P(race&VB) ≅ 600/1,273,000 = .0005
And so we obtain:
- P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.375
- P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .625
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
(2) Second stage: hand-written disambiguation rules
Adj = attribute of a noun. e.g., sweet color,
Uses 56,000-word lexicon which lists parts-of-speech for red car, sixteen candles.
each word (using two-level morphology)
Uses up to 3,744 rules, or constraints, for POS ADV = modifies a verb. e.g., very tall, too
disambiguation. quickly.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
Algorithm Description :-
The first two clauses of this rule check to see that the that directly precedes a
sentence-final adjective, adverb, or quantifier.
The last clause eliminates cases preceded by verbs like consider or believe
which can take a noun and an adjective; this is to avoid tagging the following
instance of that as an adverb:
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
During Part-of-speech tagging, probability-based tagging play major rule instead
of rule-based tagging or hand written rule tagging.
Machines can learn from examples
- Learning can be supervised or unsupervised.
Given training data, machines analyze the data, and learn rules which generalize
to new examples.
- Can be sub-symbolic (rule may be a mathematical function) e.g., neural nets.
- Or it can be symbolic (rules are in a representation that is similar to
representation used for hand-coded rules).
In general, machine learning approaches allow for more tuning to the needs of a
corpus, and can be reused across corpora.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
In a classification task, we are given some observation(s) and our job is to
determine which of a set of classes it belongs to.
Part-of-speech tagging is generally treated as a sequence classification task.
- the observation is a sequence of words (let’s say a sentence), and it is our job to
assign them a sequence of part-of-speech tags.
For example, say we are given a sentence like
- “He will race”.
• What is the best sequence of tags which corresponds to this sequence of
words?
- The Bayesian interpretation of this task starts by considering all possible
sequences of classes in this case, all possible sequences of tags.
- Out of this universe of tag sequences, we want to choose the tag sequence which
is most probable given the observation sequence of n words 𝑤1𝑛 .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Example]
What you want to do is find the “best Example: He will race
sequence” of POS tags T=T1..Tn for a Possible sequences:
sentence W=W1..Wn.
- He/PRP will/MD race/NN
- (Here T1 is pos_tag(W1)). - He/PRP will/NN race/NN
find a sequence of POS tags T that - He/PRP will/MD race/VB
maximizes P(T|W)
- He/PRP will/NN race/VB 4 different probabilities
Using Bayes’ Rule, we can say sequences values for
W = W1 W2 W3 W4 “He will race”
P(T|W) = P(W|T)*P(T)/P(W)
= He will race
We want to find the value of T T = T1 T2 T3 T4
which maximizes the RHS - Choices:
=> denominator can be discarded • T= PRP MD NN
(same for every T) • T= PRP NN NN
=> Find T which maximizes • T = PRP MD VB
P(W|T) * P(T) • T = PRP NN VB
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Independence Assumptions]
Assumption : Case 1:
Assume that current event is based only on previous n-1 events (for a bigram
model, it’s based only on previous 1 event)
P(T1….Tn) ≅ Πi=1, n P(Ti| Ti-1)
- assumes that the event of a POS tag occurring is independent of the event of any
other POS tag occurring, except for the immediately previous POS tag.
=> From a linguistic standpoint, this seems an unreasonable assumption, due to
long-distance dependencies. {e.g., Ali and his friends (go or goes?????)}
Assumption : Case 2:
P(W1….Wn | T1….Tn) ≅ Πi=1, n P(Wi| Ti)
- assumes that the event of a word appearing in a category is independent of the
event of any surrounding word or tag, except for the tag at this position.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Independence Assumptions]
(Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
POS Tagging Based on Bigrams
Problem: Find T which maximizes P(W | T) * P(T)
- Here W=W1..Wn and T=T1..Tn
Using the bigram model, we get:
(a) Transition probabilities (prob. of transitioning from one state/tag to
another):
• P(T1….Tn) ≅ Πi=1, n P(Ti|Ti-1)
(b) Emission probabilities (prob. of emitting a word at a given state):
• P(W1….Wn | T1….Tn) ≅ Πi=1, n P(Wi| Ti)
So, we want to find the value of T1..Tn which maximizes:
Πi=1, n P(Wi| Ti) * P(Ti| Ti-1)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
POS Tagging Based on Bigrams
(a) Transition probabilities P(T1….Tn) ≅
Πi=1, n P(Ti|Ti-1)
Example: He will race
Choices for T=T1..T3
- T= PRP MD NN
- T= PRP NN NN
- T = PRP MD VB
- T = PRP NN VB
POS bigram probs from training corpus
can be used for P(T)
P(PRP-MD-NN)=1*.8*.4 =.32
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
(a) Transition probabilities
From the training corpus, we need to find the Ti which maximizes
Πi=1, n P(Wi| Ti) * P(Ti| Ti-1)
So, we’ll need to factor the lexical generation (emission)
probabilities, somehow:
Choices for T=T1..T3
- T= PRP MD NN
- T= PRP NN NN
- T = PRP MD VB
- T = PRP NN VB
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
(b) Adding Emission probabilities
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. HMM Part-of-speech tagging (Tag Transition Probability)
HMM part-of-speech tagging contains two kinds of probabilities,
(a) Tag transition probabilities
(b) Word likelihood probabilities
(a) Tag transition probabilities
Tag and Tag combination.
(The tag transition probabilities, P(ti|ti−1), represent the probability of a tag
given the previous tag.)
𝐶 𝑉𝐵𝑍,𝑖𝑠 10073
P(islVBZ)= = =.47
𝐶(𝑉𝐵𝑍) 21627
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. HMM Part-of-speech tagging (Example)
Example of Tagging word “ to race” as VB as well NN.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers
The HMM is an extension of the finite automata a finite automaton is defined by a set of states, and a set of
transitions between states that are taken based on the input observations.
A weighted finite-state automaton is a simple augmentation of the finite automaton in which each arc is
associated with a probability, indicating how likely that path is to be taken. The probability on all the arcs
leaving a node must sum to 1.
A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines
which states the automaton will go through. Because they can’t represent inherently ambiguous problems, a
Markov chain is only useful for assigning probabilities to unambiguous sequences. While the Markov chain
is appropriate for situations where we can see the actual conditioning events, it is not appropriate in part-of-
speech tagging. This is because in part-of-speech tagging, while we observe the words in the input, we do not
observe the part-of-speech tags.
Thus we can’t condition any probabilities on, say, a previous part-of-speech tag, because we cannot be
completely certain exactly which tag applied to the previous word.
A Hidden Markov Model (HMM) allows us to talk about both Model observed events (like words that we
see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our
probabilistic model. An HMM is specified by the following components:
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Variable definations)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Apply Chain rule)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Example)
• Example of “Ali is Intelligent”.
a33
a11
Ali
a13
a31
a01
NNP
VB a34
a14
End
Start a03
a12
a32
a02
TO
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Class Participation)
Apply single chain rule of HMM taggers over following NLP sentences of;
• Secretariat is expected to race tomorrow.
• is Secretariat expected to race tomorrow.
• expected Secretariat is to race tomorrow.
• to Secretariat is expected race tomorrow.
• race Secretariat is expected to tomorrow.
• tomorrow Secretariat is expected to race .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging
For any model, such as HMM, that contains hidden variables,
- the task of determining which sequence of variables is the underlying source of some
sequence of observations is called the decoding task.
The Viterbi algorithm is decoding perhaps the most common algorithm used for HMMs,
whether for part-of-speech tagging or for speech recognition.
- looks a lot like the minimum edit distance algorithm.
The slightly simplified version of the Viterbi algorithm that we will present an input of a
single HMM and includes;
- a set of observed words O = (o1o2o3 . . .oT ) and
- returns the most probable state/tag sequence Q = (q1q2q3 . . .qT), together with its
probability.
Let the HMM be defined by the two Tables (next slide) expresses the aij probabilities,
- the transition probabilities between hidden states (i.e. part-of-speech tags).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging (Cont…)
Figure expresses the Bi(ot ) probabilities, the observation likelihoods of words given tags.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)