Lecture 9: Part of Speech
Kai-Wei Chang
CS @ University of Virginia
[email protected]
Couse webpage: http://kwchang.net/teaching/NLP16
CS6501 Natural Language Processing 1
CuuDuongThanCong.com https://fb.com/tailieudientucntt
This lecture
v Parts of speech (POS)
v POS Tagsets
CS6501 Natural Language Processing 2
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Parts of Speech
v Traditional parts of speech
v ~ 8 of them
CS6501 Natural Language Processing 3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
POS examples
vN noun chair, bandwidth, pacing
vV verb study, debate, munch
v ADJ adjective purple, tall, ridiculous
v ADV adverb unfortunately, slowly
vP preposition of, by, to
v PRO pronoun I, me, mine
v DET determiner the, a, that, those
CS6501 Natural Language Processing 4
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Parts of Speech
v A.k.a. parts-of-speech, lexical categories,
word classes, morphological classes,
lexical tags...
v Lots of debate within linguistics about the
number, nature, and universality of these
CS6501 Natural Language Processing 5
CuuDuongThanCong.com https://fb.com/tailieudientucntt
POS Tagging
v The process of assigning a part-of-speech to
each word in a collection (sentence).
WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
table N
CS6501 Natural Language Processing 6
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Why is POS Tagging Useful?
v First step of a vast number of practical tasks
v Parsing
v Need to know if a word is an N or V before you can parse
v Information extraction
v Finding names, relations, etc.
v Speech synthesis/recognition
v OBject obJECT
v OVERflow overFLOW
v DIScount disCOUNT
v CONtent conTENT
v Machine Translation
CS6501 Natural Language Processing 7
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Open and Closed Classes
v Closed class: a small fixed membership
v Prepositions: of, in, by, …
v Pronouns: I, you, she, mine, his, them, …
v Usually function words (short common words which
play a role in grammar)
v Open class: new ones can be created
v English has 4: Nouns, Verbs, Adjectives, Adverbs
v Many languages have these 4, but not all!
CS6501 Natural Language Processing 8
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Open Class Words
v Nouns
v Proper nouns (Boulder, Granby, Eli Manning)
v Common nouns (the rest).
v Count nouns and mass nouns
v Count: have plurals, get counted: goat/goats, one
goat, two goats
v Mass: don’t get counted (snow, salt, communism)
(*two snows)
v Verbs
v In English, have morphological affixes (eat/eats/eaten)
CS6501 Natural Language Processing 9
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Closed Class Words
Examples:
vprepositions: on, under, over, …
vparticles: up, down, on, off, …
vdeterminers: a, an, the, …
vpronouns: she, who, I, ..
vconjunctions: and, but, or, …
vauxiliary verbs: can, may should, …
vnumerals: one, two, three, third, …
CS6501 Natural Language Processing 10
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Prepositions from CELEX
CELEX: online dictionary
Frequency counts are from COBUILD 16-billion-word corpus
CS6501 Natural Language Processing 11
CuuDuongThanCong.com https://fb.com/tailieudientucntt
English Particles
CS6501 Natural Language Processing 12
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Conjunctions
CS6501 Natural Language Processing 13
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Choosing a Tagset
v Could pick very coarse tagsets
v N, V, Adj, Adv, Other
v More commonly used set is finer grained
v E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB,
WP$, VBG
v Brown cropus, 87 tags.
v Prague Dependency Treebank (Czech)
v 4452 tags
v AAFP3----3N----: (nejnezajímavějším)
Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial]
CS6501 Natural Language Processing 14
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Penn TreeBank POS Tagset
CS6501 Natural Language Processing 15
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Using the Penn Tagset
v The/DT grand/JJ jury/NN
commmented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
CS6501 Natural Language Processing 16
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Universal Tag set
v ~ 12 different tags
v NOUN, VERB, ADJ, ADV, PRON, DET, ADP,
NUM, CONJ, PRT, “.”, X
CS6501 Natural Language Processing 17
CuuDuongThanCong.com https://fb.com/tailieudientucntt
POS Tagging v.s. Word clustering
v Words often have more than one POS:
back
v The back door = JJ
v On my back = NN
v Win the voters back = RB
v Promised to back the bill = VB
These examples from Dekang Lin
CS6501 Natural Language Processing 18
CuuDuongThanCong.com https://fb.com/tailieudientucntt
How Hard is POS Tagging?
CS6501 Natural Language Processing 19
CuuDuongThanCong.com https://fb.com/tailieudientucntt
POS tag sequences
v Some tag sequences more likely occur
than others
v POS Ngram view
https://books.google.com/ngrams/graph?co
ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU
N_%2C+_ADV_+_VERB_
Existing methods often model POS tagging as a
sequence tagging problem
CS6501 Natural Language Processing 20
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Evaluation
v How many words in the unseen test data
can be tagged correctly?
v Usually evaluated on Penn Treebank
v State of the art ~97%
v Trivial baseline (most likely tag) ~94%
v Human performance ~97%
CS6501 Natural Language Processing 21
CuuDuongThanCong.com https://fb.com/tailieudientucntt