Part of Speech Tagging
Part of Speech Tagging
Example
Word heat water in a large vessel Tag verb (noun) noun (verb) prep (noun, adv) det (noun) adj (noun) noun
Choosing a tagset
Need to choose a standard set of tags to do POS tagging One tag for each part of speech Could pick very coarse tagset N, V, Adj, Adv, Prep. More commonly used set is finer-grained E.g., the UPenn TreeBank II tagset has 36 word tags PRP, PRP$, VBG,VBD, JJR, JJS (also has tags for phrases) Even more finely-grained tagsets exist
Stochastic Tagging
Based on probability of certain tag occurring, given various possibilities Necessitates a training corpus A collection of sentences that have already been tagged Several such corpora exist One of the best known is the Brown University Standard Corpus of Present-Day American English (or just the Brown Corpus) about 1,000,000 words from a wide variety of sources POS tags assigned to each
Approach 1
Assign each word its most likely POS tag If w has tags t1, , tk, then can use P(ti | w) = c(w,ti)/(c(w,t1) + + c(w,tk)), where c(w,ti) = number of times w/ti appears in the corpus Success: 91% for English Example heat :: noun/89, verb/5
Approach 2
Given: sequence of words W W = w1,w2,,wn (a sentence) e.g., W = heat water in a large vessel Assign sequence of tags T: T = t1, t2, , tn Find T that maximizes P(T | W)
Let c(ti) = frequency of ti in the corpus c(wi,ti) = frequency of wi/ti in the corpus c(ti-1,ti) = frequency of ti-1 ti in the corpus Then we can use P(ti|ti-1) = c(ti-1,ti)/c(ti-1), P(wi|ti) = c(wi,ti)/c(ti)
Example
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN to/TO race/??? People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN the/DT race/??? For each word wi, ti = argmaxt P(t|ti-1)P(wi|t) max( P(VB|TO) P(race|VB), P(NN|TO) P(race|NN) ) From the Brown corpus P(NN|TO) = .021 P(race|NN) = .00041 P(VB|TO) = .34 P(race|VB) = .00003 So P(NN|TO) P(race|NN) = .021 .00041 = .000007 P(VB|TO) P(race|VB) = .34 .00003 = .00001