0% found this document useful (0 votes)
8 views

Lecture_5_Part_Of_Speech_Tagging

The document provides an overview of Part-of-Speech (POS) tagging, detailing various tagging methods including rule-based, statistical, and transformation-based approaches. It describes the eight parts of speech in English, their classifications, and the significance of tagsets like the Penn Treebank and Brown corpus. Additionally, it discusses the algorithms used for POS tagging, emphasizing the importance of disambiguation rules and the challenges posed by ambiguous words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture_5_Part_Of_Speech_Tagging

The document provides an overview of Part-of-Speech (POS) tagging, detailing various tagging methods including rule-based, statistical, and transformation-based approaches. It describes the eight parts of speech in English, their classifications, and the significance of tagsets like the Penn Treebank and Brown corpus. Additionally, it discusses the algorithms used for POS tagging, emphasizing the importance of disambiguation rules and the challenges posed by ambiguous words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Part-of-Speech Tagging

 Part-of-speech tagging,  Rule-based Part-of-speech Tagging,


- Rule-based tagging, - first stage used a dictionary,
- Statistical model tagging, - second stage used large lists of hand-
- Transformation-based tagging, written disambiguation rules,
 (Mostly) English Word Classes,  HMM Part-of-Speech Tagging,
- Prior probability,
- Closed class, Open class,
- likelihood of tag sequence,
- Noun, proper/ common nouns,
- Computing the Most likely Tag
- Verb, Adjectives, adverbs sequence: An Example,
 Tagsets for English, - Formalizing Hidden Markov Model
- Penn Treebank part-of-speech tags, Taggers,
 Part-of Speech Tagging,  Transformation-based Tagging,
- Tagging, Ambiguous, - Transformation-based learning,
- How TBL Rules are Applied,
- Brown corpus,

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts)
 A description of 8 parts-of-speech as;
- Noun: a word (other than a pronoun) used to identify any of a class of people,
places, or things ( common noun ). e.g.,
- Verb: a word used to describe an action, state, or occurrence, and forming the
main part of the predicate of a sentence, e.g., hear, become, happen.
- Pronoun: a word that can function as a noun phrase used by itself and that
refers either to the participants in the discourse (e.g., I, you, she, it, this )
- Preposition: Prepositions are usually used in front of nouns or pronouns and
they show the relationship between the noun or pronoun and other words in a
sentence. e.g., after, in, to, on, and with.
- Adverb: a word or phrase that modifies the meaning of an adjective, verb, or
other adverb, expressing manner, place, time, or degree
(e.g., gently, here, now, very ).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts) (Cont…)
 A description of 8 parts-of-speech as;
- conjunction: a word used to connect clauses or sentences or to coordinate
words in the same clause (e.g., and, but, if ).
- participle: a word formed from a verb (e.g., going, gone, being, been ) and
used as an adjective (e.g., working woman, burnt toast ) or a noun (e.g., good
breeding ).
In English; participles are also used to make compound verb forms
(e.g., is going, has been ).
- Article: Articles are words that define a noun as specific or unspecific.
Consider the following examples:
Example- 1: After the long day, the cup of tea tasted particularly good.
Example-2: After a long day, a cup of tea tastes particularly good.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts) (Cont…)
 More recent lists of parts-of-speech (or tagsets) have many more words
classes;
- 45 for the Penn Treebank (Marcus et al., 1993).
- 87 for the Brown corpus (Francis, 1979).
- 146 for the C7 tagset (Garside et al., 1997).

 The significance of Parts-of-speech (POS) or tagsets includes;


- large amount of information that give about a word and its neighbor information.
For Example; tagsets distinguish between possessive pronouns ( my, your, his,
her, its) and personal pronouns (I, you, he, me).
- Knowing whether a word is possessive pronoun or a personal pronoun can
tell us what words are likely to occur in its vicinity.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Part-of-Speech Tagging (Some Concepts) (Cont…)

Part-of-speech tagging (or just tagging for short) is the process tagging of
assigning a part-of-speech or other syntactic class marker to each word in a corpus.
 Because tags are generally also applied to punctuation, tagging requires that the
punctuation marks (period, comma, etc) be separated off of the words.

Computational algorithms for assigning parts-of-speech to words (part-of-speech


tagging) divided into three algorithms:
1- Hand-written rules (rule-based tagging),
2- Statistical methods (HMM tagging and maximum entropy tagging),
3. Transformation-based tagging and memory-based tagging.
Rule-based taggers generally involve a large database of handwritten disambiguation
rules which specify,
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. English Word Classes
 Parts-of-speech can be divided into two broad supercategories;
(1) Closed class type:
- Closed class words are also generally function words
- which tend to be very short, occur frequently, and often have structuring uses in grammar.
e.g., of, it, and, or, you.

(2) Open Class type:


- 4 major open classes occur in the languages of the world: nouns, verbs, adjectives, and
adverbs.
(a) Noun : is the name given to the synthetic class
- Grouped into (i) proper nouns and (ii) common nouns.
(i) Proper nouns like Regina, Colorodo, and IBM.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. English Word Classes (Cont…)
(ii) Common nouns are divided into;

1) Count nouns are those that allow grammatical enumeration; that is;
- they can occur in both the singular and plural
- e.g., ( goat/goats, relationship/relationships) and they can be counted (
one goat, two goat).

2) Mass nouns are used when something is conceptualized as a homogenous


group.
- e.g., works like snow, salt, and water are not counted (i.e., *two snows or *
two water)
2. English Word Classes (Cont…)
(b) Verb : includes most of the words referring to actions and processes.
- English verbs have a number of
(i) Morphological forms (non-third-person-sg (eat))
(ii) Third-person-sg (third-person-sg (eats))
(iii) Progressive (progressive (eating))
(iv) Past participle (eaten)

(c) Adjectives : includes many terms that describe properties or qualities.


- e.g., color (white, black), age (old, young), and value (good, bad).

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. English Word Classes (Cont…)
(d) The final open class form, is rather a hodge-podge, both semantically and
formally.
For example; all the italicized words are adverbs:
Unfortunately, John walked home extremely slowly yesterday.
(i) Directional adverbs or locative adverbs;
- Specify the direction or location of some action. e.g., home, downhill, etc.
(ii) Degree adverbs;
- Specify the extent of some action, process, or property. e.g., extremely, very,
somewhat, etc.
(iii) Manner adverbs;
- Describe the manner of some action or process. e.g., slowly, delicately, etc.
(iv) Temporal adverbs;
- Describe the time that some action or event took place. e.g., yesterday, monday, etc.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Tagsets for English
 Most of the popular tagsets for English;
- 87-tag tagset used for the Brown corpus.
 Two of the most commonly used targets are;
- small 45-tag Penn Treebank tagset.
- medium-sized 61-tag C5 tagset.

Example;
- Some examples of tagged sentences from the Penn Treebank
version of the Brown corpus is;
a) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
b) There/EX are/VBP 70/CD children/NNS there/RB

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Tagsets for English (Class Participation)
a) Although preliminary findings were
reported more than a year ago, the
latest results appear in today ‘s New
England Journal of Medicine.
b) Mrs. Shaefer never got around to
joining.
c) All we gotta do is go around the
corner.
d) She told off her friends.
e) She stepped off the train.
f) They were married by the Justice of
the Peace yesterday at 5:00.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging
 Part-of-speech tagging;
- is the process of assigning a part of speech or other synthetic class marker to each word in a corpus.
Problem:
 Book/VB that/DT flight/NN ./.
 Does/VBZ that/DT flight/NN serve/VB dinner/NN ?/.

Book is ambiguous.
- That is, it has more than one possible usage and part-of-speech.
(i) It can be a verb ( as in book that flight or to book the suspect).
(ii) or a noun (as in hand me that book or a book of matches).

Solution:
 The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context.
 Upgrade version of POS-tagging is used as 87-tag Brown corpus tagset.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging (Brown Corpus Tags) (Cont…)
Example
 Go away!
 He sometimes goes to the cafe.
 All the cakes have gone.
 We went on the excursion

Figure: 87-tag Brown corpus tagset.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Part-of-Speech Tagging (Penn Treebank tagset vs Brown Corpus Tags)
(Class Participation)
 My aunt’s can opener can open a drum  Natural disasters – storms, flooding,
should look like this: hurricanes – occur infrequently but cause
 The old car broke down in the car park devastation that strains resources to breaking
point
 At least two men broke in and stole my TV
 The horses were broken in and ridden in two  Letters delivered on time by old-fashioned
weeks means are increasingly rare, so it is as well
that that is not the only option available
 Kim and Sandy both broke up with their
partners  It won’t rain but there might be snow on high
 The horse which Kim sometimes rides is more ground if the temperature stays about the
bad tempered than mine same over the next 24 hours
 The horse as well as the rabbits which we  The long and lonely road to redemption
wanted to eat has escaped begins with self-reflection: the need to delve
 It was my aunt’s car which we sold at auction inwards to deconstruct layers of psychological
last year in February obfuscation
 The only rabbit that I ever liked was eaten by  My wildest dream is to build a POS tagger
my parents one summer which processes 10K words per second and
 The veterans who I thought that we would uses only 1MB of RAM, but it may prove too
meet at the reunion were dead hard
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)

• The earliest algorithms for automatically assigning part-of-speech were based


on a two stages architecture (Harris, 1962; Klein and Simmons).
(1) The first stage used a dictionary to assign each word a list of potential parts-
of-speech.

(2) The second stage used large lists of hand-written disambiguation rules to
narrow down this list to a single part-of-speech for each word.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)

(1) First stage: dictionary to assign each word


 Choose the most likely tag for each ambiguous word, independent of
previous words.
- i.e., assign each token the POS category it occurred as most often in the
training set
- e.g., race – which POS is more likely in a corpus?

 This strategy gives you 90% accuracy in controlled tests


- So, this “unigram baseline” must always be compared against

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
(1) First stage: dictionary to assign each word ( Example)
Which POS is more likely in a corpus (1,273,000 tokens)?
NN VB Total
race 400 600 1000
P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability
- P(race) ≅ 1000/1,273,000 = .0008
- P(race&NN) ≅ 400/1,273,000 =.0003
- P(race&VB) ≅ 600/1,273,000 = .0005
And so we obtain:
- P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.375
- P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .625
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)
(2) Second stage: hand-written disambiguation rules
Adj = attribute of a noun. e.g., sweet color,
 Uses 56,000-word lexicon which lists parts-of-speech for red car, sixteen candles.
each word (using two-level morphology)
Uses up to 3,744 rules, or constraints, for POS ADV = modifies a verb. e.g., very tall, too
disambiguation. quickly.

QUANT = a determiner or pronoun indicative


ADV-that rule [sentence = it isn’t that old] of quantity. e.g., all people, both party.
Given input “that” (ADV/PRON/DET/COMP)
PRON = a word that can function as a noun
If (+1 A/ADV/QUANT) #next word is adj, adverb, or quantifier (e.g., I, you)
(+2 SENT_LIM) #and following word is a sentence
boundary DET = acting as determiner. A modifying
word that determines the kind of reference a
(NOT -1 SVOC/A) #and the previous word is not a verb like noun or group has. (e.g., a person, the game,
#consider which allows adjs as object every moment)
complements
COMP = acting as complement. A word which
Then eliminate non-ADV tags complete the meaning of an expression. (e.g.,
Else eliminate ADV tag He is weak, He is old.)

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Rule-Based Part-Of-Speech Tagging (Algorithm) (Cont…)

Algorithm Description :-
The first two clauses of this rule check to see that the that directly precedes a
sentence-final adjective, adverb, or quantifier.

In all other cases the adverb reading is eliminated.

The last clause eliminates cases preceded by verbs like consider or believe
which can take a noun and an adjective; this is to avoid tagging the following
instance of that as an adverb:

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
 During Part-of-speech tagging, probability-based tagging play major rule instead
of rule-based tagging or hand written rule tagging.
 Machines can learn from examples
- Learning can be supervised or unsupervised.
 Given training data, machines analyze the data, and learn rules which generalize
to new examples.
- Can be sub-symbolic (rule may be a mathematical function) e.g., neural nets.
- Or it can be symbolic (rules are in a representation that is similar to
representation used for hand-coded rules).
 In general, machine learning approaches allow for more tuning to the needs of a
corpus, and can be reused across corpora.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
 In a classification task, we are given some observation(s) and our job is to
determine which of a set of classes it belongs to.
Part-of-speech tagging is generally treated as a sequence classification task.
- the observation is a sequence of words (let’s say a sentence), and it is our job to
assign them a sequence of part-of-speech tags.
For example, say we are given a sentence like
- “He will race”.
• What is the best sequence of tags which corresponds to this sequence of
words?
- The Bayesian interpretation of this task starts by considering all possible
sequences of classes in this case, all possible sequences of tags.
- Out of this universe of tag sequences, we want to choose the tag sequence which
is most probable given the observation sequence of n words 𝑤1𝑛 .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Example]
What you want to do is find the “best Example: He will race
sequence” of POS tags T=T1..Tn for a Possible sequences:
sentence W=W1..Wn.
- He/PRP will/MD race/NN
- (Here T1 is pos_tag(W1)). - He/PRP will/NN race/NN
find a sequence of POS tags T that - He/PRP will/MD race/VB
maximizes P(T|W)
- He/PRP will/NN race/VB 4 different probabilities
Using Bayes’ Rule, we can say sequences values for
W = W1 W2 W3 W4 “He will race”
P(T|W) = P(W|T)*P(T)/P(W)
= He will race
We want to find the value of T T = T1 T2 T3 T4
which maximizes the RHS - Choices:
=> denominator can be discarded • T= PRP MD NN
(same for every T) • T= PRP NN NN
=> Find T which maximizes • T = PRP MD VB
P(W|T) * P(T) • T = PRP NN VB
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Independence Assumptions]

Assumption : Case 1:
Assume that current event is based only on previous n-1 events (for a bigram
model, it’s based only on previous 1 event)
 P(T1….Tn) ≅ Πi=1, n P(Ti| Ti-1)
- assumes that the event of a POS tag occurring is independent of the event of any
other POS tag occurring, except for the immediately previous POS tag.
=> From a linguistic standpoint, this seems an unreasonable assumption, due to
long-distance dependencies. {e.g., Ali and his friends (go or goes?????)}
Assumption : Case 2:
 P(W1….Wn | T1….Tn) ≅ Πi=1, n P(Wi| Ti)
- assumes that the event of a word appearing in a category is independent of the
event of any surrounding word or tag, except for the tag at this position.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) [Independence Assumptions]
(Cont…)

 Linguists know both these assumptions are incorrect!


- But, nevertheless, statistical approaches based on these assumptions work
pretty well for part-of-speech tagging.
 In particular, with Hidden Markov Models (HMMs)
- Very widely used in both POS-tagging and speech recognition, among
other problems.
- A Markov model, or Markov chain, is just a weighted Finite State
Automaton.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
POS Tagging Based on Bigrams
 Problem: Find T which maximizes P(W | T) * P(T)
- Here W=W1..Wn and T=T1..Tn
 Using the bigram model, we get:
(a) Transition probabilities (prob. of transitioning from one state/tag to
another):
• P(T1….Tn) ≅ Πi=1, n P(Ti|Ti-1)
(b) Emission probabilities (prob. of emitting a word at a given state):
• P(W1….Wn | T1….Tn) ≅ Πi=1, n P(Wi| Ti)
 So, we want to find the value of T1..Tn which maximizes:
Πi=1, n P(Wi| Ti) * P(Ti| Ti-1)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging)
POS Tagging Based on Bigrams
(a) Transition probabilities P(T1….Tn) ≅
Πi=1, n P(Ti|Ti-1)
Example: He will race
Choices for T=T1..T3
- T= PRP MD NN
- T= PRP NN NN
- T = PRP MD VB
- T = PRP NN VB
POS bigram probs from training corpus
can be used for P(T)
P(PRP-MD-NN)=1*.8*.4 =.32
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
(a) Transition probabilities
 From the training corpus, we need to find the Ti which maximizes
Πi=1, n P(Wi| Ti) * P(Ti| Ti-1)
 So, we’ll need to factor the lexical generation (emission)
probabilities, somehow:
 Choices for T=T1..T3
- T= PRP MD NN
- T= PRP NN NN
- T = PRP MD VB
- T = PRP NN VB

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Statistical Tagging (HMM Part-of-speech tagging) (Cont…)
(b) Adding Emission probabilities

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. HMM Part-of-speech tagging (Tag Transition Probability)
 HMM part-of-speech tagging contains two kinds of probabilities,
(a) Tag transition probabilities
(b) Word likelihood probabilities
(a) Tag transition probabilities
 Tag and Tag combination.
 (The tag transition probabilities, P(ti|ti−1), represent the probability of a tag
given the previous tag.)

For Example ( This/DT book/NN is interesting)


 In the 45-tag Treebank Brown corpus, the tag DT occurs 116,454 times. Of these, DT is followed by NN
56,509
Thus the MLE estimate of the transition probability is calculated as follows:
𝐶 𝐷𝑇,𝑁𝑁 56059
P(NNlDT)= = =.49
𝐶(𝐷𝑇) 116454
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. HMM Part-of-speech tagging (Word likelihood Probability)

(b) Word likelihood probabilities

 Word and their tag combination.


 (The word likelihood probabilities, P(wi|ti), represent the probability, given
that we see a given tag, that it will be associated with a given word)

For Example ( This book is/VBZ interesting)


In Treebank Brown corpus, the tag VBZ occurs 21,627 times, and VBZ is the
tag for “is” 10,073 times. Thus

𝐶 𝑉𝐵𝑍,𝑖𝑠 10073
P(islVBZ)= = =.47
𝐶(𝑉𝐵𝑍) 21627
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. HMM Part-of-speech tagging (Example)
Example of Tagging word “ to race” as VB as well NN.

Tag-to-tag combination Word-to-tag combination Tag-to-tag combination

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers

 The HMM is an extension of the finite automata a finite automaton is defined by a set of states, and a set of
transitions between states that are taken based on the input observations.
 A weighted finite-state automaton is a simple augmentation of the finite automaton in which each arc is
associated with a probability, indicating how likely that path is to be taken. The probability on all the arcs
leaving a node must sum to 1.
 A Markov chain is a special case of a weighted automaton in which the input sequence uniquely determines
which states the automaton will go through. Because they can’t represent inherently ambiguous problems, a
Markov chain is only useful for assigning probabilities to unambiguous sequences. While the Markov chain
is appropriate for situations where we can see the actual conditioning events, it is not appropriate in part-of-
speech tagging. This is because in part-of-speech tagging, while we observe the words in the input, we do not
observe the part-of-speech tags.
 Thus we can’t condition any probabilities on, say, a previous part-of-speech tag, because we cannot be
completely certain exactly which tag applied to the previous word.
 A Hidden Markov Model (HMM) allows us to talk about both Model observed events (like words that we
see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our
probabilistic model. An HMM is specified by the following components:

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Variable definations)

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Apply Chain rule)

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Example)
• Example of “Ali is Intelligent”.
a33
a11
Ali
a13

a31
a01
NNP
VB a34
a14
End

Start a03
a12
a32
a02

TO

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Formalizing Hidden Markov Model Taggers (Class Participation)

Apply single chain rule of HMM taggers over following NLP sentences of;
• Secretariat is expected to race tomorrow.
• is Secretariat expected to race tomorrow.
• expected Secretariat is to race tomorrow.
• to Secretariat is expected race tomorrow.
• race Secretariat is expected to tomorrow.
• tomorrow Secretariat is expected to race .

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging
 For any model, such as HMM, that contains hidden variables,
- the task of determining which sequence of variables is the underlying source of some
sequence of observations is called the decoding task.

 The Viterbi algorithm is decoding perhaps the most common algorithm used for HMMs,
whether for part-of-speech tagging or for speech recognition.
- looks a lot like the minimum edit distance algorithm.

The slightly simplified version of the Viterbi algorithm that we will present an input of a
single HMM and includes;
- a set of observed words O = (o1o2o3 . . .oT ) and
- returns the most probable state/tag sequence Q = (q1q2q3 . . .qT), together with its
probability.

Let the HMM be defined by the two Tables (next slide) expresses the aij probabilities,
- the transition probabilities between hidden states (i.e. part-of-speech tags).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging (Cont…)

Tag-to-tag combination Matrix

Word-to-tag combination Matrix

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. The Viterbi Algorithm for HMM Tagging (Cont…)
 Figure expresses the Bi(ot ) probabilities, the observation likelihoods of words given tags.

@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)

You might also like