Natural Language Processing Slides
Natural Language Processing Slides
Kemal Oflazer
∗
Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh.
1/31
What is NLP?
2/31
Why NLP?
I Answer questions using the Web
I Translate documents from one language to another
I Do library research; summarize
I Manage messages intelligently
I Help make informed decisions
I Follow directions given by any user
I Fix your spelling or grammar
I Grade exams
I Write poems or novels
I Listen and give advice
I Estimate public opinion
I Read everything and make predictions
I Interactively help people learn
I Help disabled people
I Help refugees/disaster victims
I Document or reinvigorate indigenous languages
3/31
What is NLP? More Detailed Answer
4/31
Levels of Linguistic Representation
5/31
Why it’s Hard
6/31
Complexity of Linguistics Representations
7/31
Complexity of Linguistics Representations
I Richness: there are many ways to express the same meaning, and immeasurably
many meanings to express. Lots of words/phrases.
I Each level interacts with the others.
I There is tremendous diversity in human languages.
I Languages express the same kind of meaning in different ways
I Some languages express some meanings more readily/often.
I We will study models.
8/31
What is a Model?
9/31
Using NLP Models and Tools
I This course is meant to introduce some formal tools that will help you navigate the
field of NLP.
I We focus on formalisms and algorithms.
I This is not a comprehensive overview; it’s a deep introduction to some key topics.
I We’ll focus mainly on analysis and mainly on English text (but will provide examples from
other languages whenever meaningful)
I The skills you develop will apply to any subfield of NLP
10/31
Applications / Challenges
11/31
Expectations from NLP Systems
12/31
Key Applications (2017)
I Computational linguistics (i.e., modeling the human capacity for language
computationally)
I Information extraction, especially “open” IE
I Question answering (e.g., Watson)
I Conversational Agents (e.g., Siri, OK Google)
I Machine translation
I Machine reading
I Summarization
I Opinion and sentiment analysis
I Social media analysis
I Fake news detection
I Essay evaluation
I Mining legal, medical, or scholarly literature
13/31
NLP vs Computational Linguistics
14/31
Let’s Look at Some of the Levels
15/31
Morphology
16/31
Morphology
17/31
Let’s Look at Some of the Levels
18/31
Lexical Processing
I Segmentation
I Normalize and disambiguate words
I Words with multiple meanings: bank, mean
I Extra challenge: domain-specific meanings (e.g., latex)
I Process multi-word expressions
I make . . . decision, take out, make up, kick the . . . bucket
I Part-of-speech tagging
I Assign a syntactic class to each word (verb, noun, adjective, etc.)
I Supersense tagging
I Assign a coarse semantic category to each content word (motion event, instrument,
foodstuff, etc.)
I Syntactic “supertagging”
I Assign a possible syntactic neighborhood tag to each word (e.g., subject of a verb)
19/31
Let’s Look at Some of the Levels
20/31
Syntax
21/31
Some of the Possible Syntactic Analyses
22/31
Morphology–Syntax
23/31
Let’s Look at Some of the Levels
24/31
Semantics
⇒ ∃d1 , d2 , d3 , doctor(d1 )&doctor(d1 )&doctor(d1 ) (∀p, patient(p), saw(d1 , p)&saw(d2 , p)&saw(d3 , p))
I (TR) Her hastaya üç doktor baktı “Every patient three doctors saw”
25/31
Syntax–Semantics
26/31
Let’s Look at Some of the Levels
27/31
Pragmatics/Discourse
I Pragmatics
I Any non-local meaning phenomena
I “Can you pass the salt?”
I “Is he 21?” “Yes, he’s 25.”
I Discourse
I Structures and effects in related sequences of sentences
I “I said the black shoes.”
I “Oh, black.” (Is that a sentence?)
28/31
Course Logistics/Administrivia
29/31
Your Grade
30/31
Policies
31/31
11-411
Natural Language Processing
Applications of NLP
Kemal Oflazer
∗
Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh.
1/19
Information Extraction – Bird’s Eye View
2/19
Named-Entity Recognition
I Input: text
I Output: text annotated with named-entities
3/19
Reference Resolution
4/19
Coreference Resolution
5/19
Relation Extraction
6/19
Encoding for Named-Entity Recognition
7/19
Encoding for Named-Entity Recognition
With that , Edwards ’ campaign will end the way
O O O B-PER O O O O O O
O O O O O O O O O O
O O O O O O O O O O
8/19
NER as a Sequence Modeling Problem
9/19
Evaluation of NER Performance
I Recall: What percentage of the actual named-entities did you correctly label?
I Precision: What percentage of the named-entities you labeled were actually correctly
labeled?
Correct Hypothesized
NEs NEs
(C) C∩H (H)
11/19
Relation Extraction
12/19
Seeding Tuples
13/19
Bootstrapping Relations
14/19
Information Retrieval – the Vector Space Model
I Each document Di is represented by a |V|-dimensional vector d~i (V is the vocabulary
of words/tokens.)
d~i · ~q
cosine similarity(d~i ,~q) =
kd~i k × k~qk
I Twists: tf − idf term frequency – inverse document frequency
# docs
x[j] = count(ωj ) × log
# docs with ωj
I Recall, Precision, Ranking
15/19
Information Retrieval – Evaluation
I Recall?
Number of Relevant Documents Retrieved
Recall =
Number of Actual Relevant Documents in the Database
I Precision?
Number of Relevant Documents Retrieved
Precision =
Number of Documents Retrieved
16/19
Question Answering
17/19
Question Answering Evaluation
18/19
Some General Tools
I Supervised classification
I Feature vector representations
I Bootstrapping
I Evaluation:
I Precision and recall (and their curves)
I Mean reciprocal rank
19/19
8/28/17
11-411
Natural Language Processing
Words and
Computational Morphology
Kemal Oflazer
Carnegie Mellon University - Qatar
1
8/28/17
Morphology
n Languages differ widely in
¨ What information they encode in their words
¨ How they encode the information.
n I am swim-m+ing.
¨ (Presumably) we know what swim “means”
¨ The +ing portion tells us that this event is
taking place at the time the utterance is taking
place.
¨ What’s the deal with the extra m?
2
8/28/17
3
8/28/17
Dancing in Andalusia
n A poem by the early 20th century Turkish
poet Yahya Kemal Beyatlı.
ENDÜLÜSTE RAKS
Zil, şal ve gül, bu bahçede raksın bütün hızı
Şevk akşamında Endülüs, üç defa kırmızı
Aşkın sihirli şarkısı, yüzlerce dildedir
İspanya neşesiyle bu akşam bu zildedir
4
8/28/17
BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.
5
8/28/17
BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.
BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.
6
8/28/17
BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.
BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.
7
8/28/17
DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
castanets: Plural noun
Animated turns like a fan's fast flutterings,
Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.
8
8/28/17
DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
bewitching: gerund form of the
Animated turns like a fan's fast flutterings, verb bewitch
Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.
DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
9
8/28/17
Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!
Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!
Auf die Stirn die Ringellocken fallen lose ihr, Zimbeln: plural of the feminine
Auf der Brust erblüht Granadas schönste Rose ihr, noun “Zimbel”
Goldpokal in jeder Hand, im Herzen Sonne lacht
Spanien lebt und webt in dieser Rose heute Nacht!
10
8/28/17
Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!
Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!
11
8/28/17
Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!
Auf die Stirn die Ringellocken fallen lose ihr, herzerfüllend: noun-verb/participle
Auf der Brust erblüht Granadas schönste Rose ihr, compound
Goldpokal in jeder Hand, im Herzen Sonne lacht “that which fulfills the heart”(?)
Spanien lebt und webt in dieser Rose heute Nacht!
Aligned Verses
Zil, şal ve gül, bu bahçede raksın bütün hızı
Şevk akşamında Endülüs, üç defa kırmızı
Aşkın sihirli şarkısı, yüzlerce dildedir
İspanya neş'esiyle bu akşam bu zildedir
12
8/28/17
Computational Morphology
n Computational morphology deals with
¨ developingtheories and techniques for
¨ computational analysis and synthesis of word
forms.
26
13
8/28/17
n books Þ book+Noun+Plural
Þ book+Verb+Pres+3SG.
n stopping Þ stop+Verb+Cont
n happiest Þ happy+Adj+Superlative
n went Þ go+Verb+Past
27
n stop+Past Þ stopped
n (T) dur+Past+1Pl Þ durduk
+2Pl Þ durdunuz
28
14
8/28/17
Computational Morphology-Analysis
n
n
Input raw text
Segment / Tokenize
} Pre-processing
}
n Analyze individual
words Morphological
Processing
n Analyze multi-word
constructs
n Disambiguate
Morphology
n Syntactically analyze
sentences } Syntactic
Processing
n …. 29
30
15
8/28/17
Text-to-speech
n I read the book.
¨ Can’t really decide what the pronunciation is
n Yesterday, I read the book.
¨ read must be a past tense verb.
n He read the book
n read must be a past tense verb.
¨ (T) oKU+ma (don’t read)
oku+MA (reading)
ok+uM+A (to my arrow)
32
16
8/28/17
Morphology
n Morphology is the study of the structure of
words.
¨ Words are formed by combining smaller units
of linguistic information called morphemes, the
building blocks of words.
¨ Morphemes in turn consist of phonemes and,
in abstract analyses, morphophonemes.
Often, we will deal with orthographical
symbols.
33
Morphemes
n Morphemes can be classified into two
groups:
¨ Free Morpheme: Morphemes which can
occur as a word by themselves.
n e.g., go, book,
34
17
8/28/17
Dimensions of Morphology
n “Complexity” of Words
¨ How many morphemes?
n Morphological Processes
¨ What functions do morphemes perform?
n Morpheme combination
¨ How do we put the morphemes together to
form words?
35
¨ Inflectional Languages
¨ Agglutinative Languages
¨ Polysynthetic Languages
36
18
8/28/17
Isolating languages
n Isolating languages do not (usually) have any
bound morphemes
¨ Mandarin Chinese
¨ Gou bu ai chi qingcai (dog not like eat vegetable)
¨ This can mean one of the following (depending on the
context)
n The dog doesn’t like to eat vegetables
n The dog didn’t like to eat vegetables
n The dogs don’t like to eat vegetables
n The dogs didn’t like to eat vegetables.
n Dogs don’t like to eat vegetables.
37
Inflectional Languages
n A single bound morphemes conveys
multiple pieces of linguistic information
n (R) most+u: Noun, Sing, Dative
pros+u: Verb, Present, 1sg
38
19
8/28/17
Agglutinative Languages
n (Usually multiple) Bound morphemes are
attached to one (or more) free morphemes,
like beads on a string.
¨ Turkish/Turkic, Finnish, Hungarian
¨ Swahili, Aymara
n Each morpheme encodes one "piece" of
linguistic information.
¨ (T) gid+iyor+du+m: continuous, Past, 1sg (I
was going)
39
Agglutinative Languages
n Turkish
n Finlandiyalılaştıramadıklarımızdanmışsınızcasına
n (behaving) as if you have been one of those whom we could not
convert into a Finn(ish citizen)/someone from Finland
n Finlandiya+lı+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
¨ Finlandiya+Noun+Prop+A3sg+Pnon+Nom
n ^DB+Adj+With/From
n ^DB+Verb+Become
n ^DB+Verb+Caus
n ^DB+Verb+Able+Neg
n ^DB+Noun+PastPart+A3pl+P1pl+Abl
n ^DB+Verb+Zero+Narr+A2pl
n ^DB+Adverb+AsIf
40
20
8/28/17
Agglutinative Languages
n Aymara
¨ ch’uñüwinkaskirïyätwa
¨ ch’uñu +: +wi +na -ka +si -ka -iri +: +ya:t(a) +wa
n I was (one who was) always at the place for making ch’uñu ’
ch’uñu N ‘freeze-dried potatoes’
+: N>V be/make …
+wi V>N place-of
+na in (location)
-ka N>V be-in (location)
+si continuative
-ka imperfect
-iri V>N one who
+: N>V be
+ya:ta 1P recent past
+wa affirmative sentencial
41
Example Courtesy of Ken Beesley
Agglutinative Languages
n Finnish Numerals
¨ Finnish numerals are written as one word and
all components inflect and agree in all aspects
¨ Kahdensienkymmenensienkahdeksansien
42
Example Courtesy of Lauri Karttunen
21
8/28/17
Agglutinative Languages
n Hungarian
¨ szobáikban = szoba[N/room] + ik[PersPl-3-
PossPl] + ban[InessiveCase]
n In their rooms
¨ faházaikból = fa[N/wooden] + ház[N/house] +
aik[PersPl3-PossPl] +ból[ElativeCase]
n From their wooden houses
¨ olvashattam = olvas[V/read] +
hat[DeriV/is_able] + tam[Sg1-Past]
n I was able to read
Agglutinative Languages
n Swahili
¨ walichotusomea = wa[Subject
Pref]+li[Past]+cho[Rel Prefix]+tu[Obj Prefix
1PL]+som[read/Verb]+e[Prep Form]+a[]
n that (thing) which they read for us
¨ tulifika=tu[we]+li[Past]+fik[arrive/Verb]+a[]
n We arrived
¨ ninafika=ni[I]+na[Present]+fik[arrive/Verb]+a[]
n I am arriving
44
22
8/28/17
Polysynthetic Languages
n Use morphology to combine syntactically
related components (e.g. verbs and their
arguments) of a sentence together
¨ Certain Eskimo languages, e.g., Inuktikut
45
Polysynthetic Languages
n Use morphology to combine syntactically
related components (e.g. verbs and their
arguments) of a sentence together
• Parismunngaujumaniralauqsimanngittunga
Paris+mut+nngau+juma+niraq+lauq+si+ma+nn
git+jun
46
Example Courtesy of Ken Beesley
23
8/28/17
Arabic
n Arabic seems to have aspects of
¨ Inflecting languages
n wktbt (wa+katab+at “and she wrote …”)
¨ Agglutinative languages
n wsyktbunha (wa+sa+ya+ktub+ūn+ha “and will
(masc) they write her)
¨ Polysynthetic languages
47
Morphological Processes
n There are essentially 3 types of
morphological processes which determine
the functions of morphemes:
¨ Inflectional Morphology
¨ Derivational Morphology
¨ Compounding
48
24
8/28/17
Inflectional Morphology
n Inflectional morphology introduces relevant
information to a word so that it can be used
in the syntactic context properly.
¨ That is, it is often required in particular
syntactic contexts.
n Inflectional morphology does not change
the part-of-speech of a word.
n If a language marks a certain piece of
inflectional information, then it must mark
that on all appropriate words.
49
Inflectional Morphology
n Subject-verb agreement, tense, aspect
25
8/28/17
Inflectional Morphology
n Number, case, possession, gender, noun-
class for nouns
¨ (T)ev+ler+in+den (from your houses)
¨ Bantu marks noun class by a prefix.
n Humans: m+tu (person) wa+tu (persons)
n Thin-objects: m+ti (tree) mi+ti (trees)
51
Inflectional Morphology
n Gender and/or case marking may also
appear on adjectives in agreement with the
nouns they modify
(G) ein neuer Wagen
eine schöne Stadt
ein altes Auto
52
26
8/28/17
Inflectional Morphology
n Case/Gender agreement for determiners
53
Inflectional Morphology
n (A) Perfect verb subject conjugation (masc form
only)
Singular Dual Plural
katabtu katabnā
katabta katabtumā katabtum
kataba katabā katabtū
27
8/28/17
Derivational Morphology
n Derivational morphology produces a new
word with usually a different part-of-speech
category.
¨ e.g., make a verb from a noun.
n The new word is said to be derived from
the old word.
55
Derivational Morphology
¨ happy (Adj) Þ happi+ness (Noun)
56
28
8/28/17
Derivational Morphology
n Productive vs. unproductive derivational
morphology
Compounding
n Compounding is concatenation of two or
more free morphemes (usually nouns) to
form a new word (though the boundary between normal
words and compounds is not very clear in some languages)
¨ firefighter / fire-fighter
¨ (G)
Lebensversicherungsgesellschaftsangesteller
(life insurance company employee)
¨ (T) acemborusu ((lit.) Persian pipe – neither
Persian nor pipe, but a flower)
58
29
8/28/17
Combining Morphemes
n Morphemes can be combined in a variety of ways
to make up words:
¨ Concatenative
¨ Infixation
¨ Circumfixation
¨ Templatic Combination
¨ Reduplication
59
Concatenative Combination
n Bound morphemes are attached before or
after the free morpheme (or any other
intervening morphemes).
¨ Prefixation:
bound morphemes go before the
free morpheme
n un+happy
¨ Suffixation: bound morphemes go after the free
morpheme
n happi+ness
¨ Need to be careful about the order [un+happi]+ness (not
un +[happi+ness]
n el+ler+im+de+ki+ler
60
30
8/28/17
Concatenative Combination
n Such concatenation can trigger spelling
(orthographical) and/or phonological
changes at the concatenation boundary (or
even beyond)
¨ happi+ness
¨ (T) şarap (wine) Þ şarab+ı
¨ (T) burun (nose) Þ burn+a
¨ (G) der Mann (man) Þ die Männ+er (men)
61
Infixation
n The bound morpheme is inserted into free
morpheme stem.
62
31
8/28/17
Circumfixation
n Part of the morpheme goes before the
stem, part goes after the stem.
63
Templatic Combination
n The root is modulated with a template to generate
stem to which other morphemes can be added by
concatentaion etc.
n Semitic Languages (e.g., Arabic)
¨ rootktb (the general concept of writing)
¨ template CVCCVC
¨ vocalism (a,a)
k t b
k a t t a b
C V C C V C
64
32
8/28/17
Templatic Combination
n More examples of templatic combination
65
Reduplication
n Some or all of a word is duplicated to mark a
morphological process
¨ Indonesian
n orang (man) Þ orangorang (men)
¨ Bambara
n wulu (dog) Þ wuluowulu (whichever dog)
¨ Turkish
n mavi (blue) Þ masmavi (very blue)
n kırmızı (red) Þ kıpkırmızı (very red)
66
33
8/28/17
Zero Morphology
n Derivation/inflection takes place without any
additional morpheme
¨ English
n second (ordinal) Þ (to) second (a motion)
n man (noun) Þ (to) man (a place)
67
Subtractive morphology
n Part of the stem is removed to mark a
morphological feature
68
34
8/28/17
69
Computational Morphology
All Possible
n Morphological analysis
Analyses
Sequence of characters
Word
70
35
8/28/17
Computational Morphology
stop+Verb+PresCont
n Morphological analysis
stopping
71
Computational Morphology
n Ideally we would like to be able to use the
same system “in reverse” to generate
words from a given sequence or
morphemes
¨ Take “analyses” as input
¨ Produce words.
72
36
8/28/17
Computational Morphology
n Morphological generation Analysis
Morphological
Generator
Word(s)
73
Computational Morphology
n Morphological generation stop+Verb+PresCont
Morphological
Generator
stopping
74
37
8/28/17
Computational Morphology
n What is in the box? Analyses
Morphological
Analyzer/
Generator
Word(s)
75
Computational Morphology
n What is in the box? Analyses
n Data
¨ Language Specific
n Engine
¨ Language Independent Data Engine
Word(s)
76
38
8/28/17
77
Some Terminology
n Lexicon is a structured collection of all the
morphemes
¨ Rootwords (free morphemes)
¨ Morphemes (bound morphemes)
n Morphotactics is a model of how and in
what order the morphemes combine.
n Morphographemics is a model of what/how
changes occur when morphemes are
combined.
78
39
8/28/17
80
40
8/28/17
82
41
8/28/17
Representation
n Lexical form: An underlying representation of
morphemes w/o any morphographemic changes
applied.
¨ easy+est
¨ shop+ed
¨ blemish+es
¨ vase+es
n Surface Form: The actual written form
¨ easiest
¨ shopped
¨ blemishes
¨ vases
83
Representation
n Lexical form: An underlying representation of
morphemes w/o any morphographemic changes
applied.
¨ ev+lAr A={a,e} Abstract meta-phonemes
¨ oda+lAr
¨ tarak+sH H={ı, i, u, ü}
¨ kese+sH
84
42
8/28/17
85
Morphological Ambiguity
n Morphological structure/interpretation is
usually ambiguous
¨ Part-of-speech ambiguity
n book (verb), book (noun)
¨ Morpheme ambiguity
n +s (plural) +s (present tense, 3rd singular)
n (T) +mA (infinitive), +mA (negation)
¨ Segmentation ambiguity
n Word can be legitimately divided into morphemes in
a number of ways
86
43
8/28/17
Morphological Ambiguity
n The same surface form is interpreted in
many possible ways in different syntactic
contexts.
(F) danse
danse+Verb+Subj+3sg (lest s/he dance)
danse+Verb+Subj+1sg (lest I dance)
danse+Verb+Imp+2sg ((you) dance!)
danse+Verb+Ind+3sg ((s/he) dances)
danse+Verb+Ind+1sg ((I) dance)
danse+Noun+Fem+Sg (dance)
(E) read
read+Verb+Pres+N3sg (VBP-I/you/we/they read)
read+Verb+Past (VBD - read past tense)
read+Verb+Participle+(VBN – participle form)
read+Verb (VB - infinitive form)
read+Noun+Sg (NN – singular noun)
87
Morphological Ambiguity
n The same morpheme can be interpreted
differently depending on its position in the
morpheme order:
88
44
8/28/17
Morphological Ambiguity
n The word can be segmented in different
ways leading to different interpretations,
e.g. (T) koyun:
¨ koyun+Noun+Sg+Pnon+Nom (koyun-sheep)
¨ koy+Noun+Sg+P2sg+Nom (koy+un-your bay)
¨ koy+Noun+Sg+Pnon+Gen (koy+[n]un – of the bay)
¨ koyu+Adj+^DB+Noun+Sg+P2sg+Nom
(koyu+[u]n – your dark (thing)
¨ koy+Verb+Imp+2sg (koy+un – put (it) down)
89
Morphological Ambiguity
n The word can be segmented in different
ways leading to different interpretations,
e.g.
(Sw) frukosten:
frukost + en ‘the breakfast’
frukost+en ‘breakfast juniper’
fru+kost+en ‘wife nutrition juniper’
fru+kost+en ‘the wife nutrition’
fru+ko+sten ‘wife cow stone’
(H) ebth:
e+bth ‘that field’
e+b+th ‘that in tea(?)’
ebt+h ‘her sitting’
e+bt+h ‘that her daughter’
90
45
8/28/17
Morphological Ambiguity
n Orthography could be ambiguous or
underspecified.
16 possible interpretations
91
Morphological Disambiguation
n Morphological Disambiguation or Tagging
is the process of choosing the "proper"
morphological interpretation of a token in a
given context.
92
46
8/28/17
Morphological Disambiguation
n He can can the can.
¨ Modal
¨ Infinitive
form
¨ Singular Noun
¨ Non-third person present tense verb
n We can tomatoes every summer.
93
Morphological Disambiguation
n These days standard statistical approaches
(e.g., Hidden Markov Models) can solve
this problem with quite high accuracy.
n The accuracy for languages with complex
morphology/ large number of tags is lower.
94
47
8/28/17
n Heuristic/Rule-based affix-stripping
95
48
8/28/17
Heuristic/Rule-based Affix-stripping
n Uses ad-hoc language-specific rules
¨ to split words into morphemes
¨ to “undo” morphographemic changes
¨ scarcity
n-ity looks like a noun making suffix, let’s strip it
n scarc is not a known root, so let’s add e and see if
we get an adjective
98
49
8/28/17
Heuristic/Rule-based Affix-stripping
n Uses ad-hoc language-specific rules
¨ to split words into morphemes
¨ to “undo” morphographemic changes
99
100
50
8/28/17
OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules
101
51
8/28/17
103
104
52
8/28/17
105
A
A*
L
Finite
A={a,b} Infinite
L can be finite or infinite
53
8/28/17
107
108
54
8/28/17
Languages
n Languages are sets. So we can do “set”
things with them
¨ Union
¨ Intersection
¨ Complement with respect to the universe set
A*.
109
110
55
8/28/17
Recognition Problem
n Given a language L and a string w
¨ Is w in L?
111
Classes of Languages
A class of languages
L3
A
A* L2
L1
Finite
Infinite
112
56
8/28/17
113
Regular Languages
n Regular languages are those that can be
recognized by a finite state recognizer.
114
57
8/28/17
Regular Languages
n Regular languages are those that can be
recognized by a finite state recognizer.
115
58
8/28/17
b a b
q0 q1
a
Start State Input Symbols
States
118
59
8/28/17
q0 q1
abab
^
119
q0 q1
abab
^
120
60
8/28/17
q0 q1
abab
^
121
q0 q1
abab
^
122
61
8/28/17
q0 q1
abab
^
123
q0 q1
abab
^
The state q0 remembers the fact that we have seen an even number of a’s
The state q1 remembers the fact that we have seen an odd number of a’s
124
62
8/28/17
125
126
63
8/28/17
127
128
64
8/28/17
129
130
65
8/28/17
q0 q1
A = {a,b}
Q = {q0, q1}
Next = {((q0,b),q0),
((q0,a),q1), If the machine is in state q0 and the input is a then
the next state is q1
((q1,b),q1),
((q1,a),q0))}
Final = {q0}
131
n M accepts w Î A*, if
¨ startingin state q0, M proceeds by looking at
each symbol in w, and
¨ ends up in one of the final states when the
string w is exhausted.
132
66
8/28/17
133
Another Example
e e
p
l
s i
n
IMPORTANT CONVENTION
If at some state, there is no transition for a symbol, we assume that
the FSR rejects the string.
g
67
8/28/17
Another Example
e e
p
l
p
s s i
Another Example
e e
p
l
p
w
s s i
68
8/28/17
Another Example
e e
k p
l
p
w
s s i
Another Example
e e
k p
l
p
w
s s i
a
t
n
v i
s
e g
d
Accepts …..save, saving, saved, saves
138
69
8/28/17
Regular Languages
n A language whose strings are accepted by
some finite state recognizer is a regular
language.
139
Regular Languages
n A language whose strings are accepted by
some finite state recognizer is a regular
language.
70
8/28/17
141
142
71
8/28/17
Regular Expressions
n A regular expression is compact formula or
metalanguage that describes a regular
language.
143
144
72
8/28/17
a b [a | c] [d | e]
145
146
73
8/28/17
{e} È L È L L È L L L È ….
147
148
74
8/28/17
149
150
75
8/28/17
151
Regular Expressions
n Regular expression for set of strings with
an even number of a's.
[b* a b* a]* b*
¨ Any number of concatenations of strings of the
sort
n Any number of b's followed by an a followed by any
number of b's followed by another a
¨ Ending with any number of b's
152
76
8/28/17
Regular Expressions
n Regular expression for set of strings with
an even number of a's.
[b* a b* a]* b*
¨b b a b a b b a a b a a a b a b b b
b* a b* a b* a b* a b* a b* a b* a b* a b*
153
Regular Languages
n Regular languages are described by
regular expressions.
154
77
8/28/17
155
156
78
8/28/17
157
158
79
8/28/17
Regular Relations
n The set of upper-side strings in a regular
relation (upper language) is a regular
language.
¨{ cat+N , fly+N, fly+V, big+A}
160
80
8/28/17
Regular Relations
n A regular relation is a “mapping” between
two regular ranguages. Each string in one
of the languages is “related” to one or more
strings of the other language.
161
81
8/28/17
qi qj
163
Regular relation
{ <ac,ac>, <abc,adc>, <abbc,addc>, <abbbc,adddc>... }
Finite-state transducer
Regular expression a:a c:c
0 1 2
a:a [b:d]* c:c
b:d
Slide courtesy of Lauri Karttunen 164
82
8/28/17
Regular relation
{ <ac,ac>, <abc,adc>, <abbc,addc>, <abbbc,adddc>... }
Finite-state transducer
Regular expression a c
0 1 2
a [b:d]* c
Convention: when both upper and lower
symbols are same b:d
Slide courtesy of Lauri Karttunen 165
q0 q1
b:a
166
83
8/28/17
A Linguistic Example
From now on we will use the symbol 0 (zero) to denote the empty string e
167
84
8/28/17
Combining Transducers
n In algebra we write
¨ y=f(x) to indicate that function f maps x to y
¨ Similarly in z=g(y), g maps y to z
n We can combine these to
¨ z= g(f(x)) to map directly from x to z and write
this as z = (g · f) (x)
¨ g · f is the composed function
¨ If y=x2 and z = y3 then z = x6
169
Combining Transducers
n The same idea can be applied to
transducers – though they define relations
in general.
170
85
8/28/17
Composing Transducers
U1 x
U1’= f-1(L1 Ç U2)
f
f
y
L1
f °g
g
U2 y
L2’ = g(L1 Ç U2)
g
L2
z
171
Composing Transducers
U1 x
U1’= f-1(L1 Ç U2)
f
f
y
L1
f °g
g
U2 y
L2’ = g(L1 Ç U2)
g
f ° g = {<x,z>: $y (<x,y> Î f and <y,z> Î g)}
L2
z where x, y, z are strings
86
8/28/17
Composing Transducers
n Composition is an operation that merges two
transducers “vertically”.
¨ Let X be a transducer that contains the single ordered
pair < “dog”, “chien”>.
¨ Let Y be a transducer that contains the single ordered
pair <“chien”, “Hund”>.
¨ The composition of X over Y, notated X o Y, is the
relation that contains the ordered pair <“dog”, “Hund”>.
173
Composing Transducers
n The crucial property is that the two finite
state transducers can be composed into a
single transducer.
¨ Details are hairy and not relevant.
174
87
8/28/17
English Numeral to
Number Transducer
n Take my word that this
can be done with a
finite state transducer.
175
Numbers to
Turkish Numerals n Again, take my word
Transducer
that this can be done
with a finite state
transducer.
1273
176
88
8/28/17
Number to
Turkish Numeral Transducer
1273
English Numeral to
Number Transducer
177
Number to
Turkish Numeral Transducer
Compose English Numeral
to
1273 Turkish Numeral
Transducer
English Numeral to
Number Transducer
One thousand two hundred seventy three One thousand two hundred seventy three
178
89
8/28/17
Number to
Finnish Numeral Transducer
Compose English Numeral
to
123 Finnish Numeral
Transducer
English Numeral to
Number Transducer
179
180
90
8/28/17
End of digression
n How does all this tie back to computational
morphology?
181
OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules
182
91
8/28/17
Morphological Analysis
n Morphological
happy+Adj+Sup
analysis can be seen
as a finite state
transduction
Finite State
Transducer
T
happiest Î English_Words
184
92
8/28/17
n Need to describe
¨ Lexicon (of free and bound morphemes)
¨ Spelling
change rules in a finite state
framework.
185
93
8/28/17
187
94
8/28/17
The Lexicon
n The lexicon structure can be refined to a
point so that all and only valid forms are
accepted and others rejected.
189
Describing Lexicons
n Current available systems for morphology provide
a simple scheme for describing finite state
lexicons.
¨ XeroxFinite State Tools
¨ PC-KIMMO
190
95
8/28/17
Describing Lexicons
LEXICON NOUNS
abacus NOUN-STEM; ;; same as abacus:abacus
car NOUN-STEM;
table NOUN-STEM;
…
information+Noun+Sg: information End;
…
zymurgy NOUN-STEM;
LEXICON NOUN-STEM
+Noun:0 NOUN-SUFFIXES
191
Describing Lexicons
LEXICON NOUN-SUFFIXES
LEXICON REG-VERB-STEM
+Sg:0 End;
+Pl:+s End; +Verb:0 REG-VERB-SUFFIXES;
LEXICON REG-VERB-SUFFIXES
LEXICON REGULAR-VERBS +Pres+3sg:+s End;
admire REG-VERB-STEM;
+Past:+ed End;
head REG-VERB-STEM;
.. +Part:+ed End;
zip REG-VERB-STEM; +Cont:+ing End;
LEXICON IRREGULAR-VERBS
…..
192
96
8/28/17
193
Describing Lexicons
LEXICON ADJECTIVES happy+Adj+Sup
…
LEXICON ADVERBS
…
happy+est
194
97
8/28/17
Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s
. +
+Verb
. +Pres +3sg
. s 0
Morphotactics in Arabic
n As we saw earlier, words in Arabic are
based on a root and pattern scheme:
¨A root consisting of 3 consonants (radicals)
¨ A template and a vocalization.
which combine to give a stem.
n Further prefixes and suffixes can be
attached to the stem in a concatenative
fashion.
196
98
8/28/17
Morphotactics in Arabic
Pattern
CVCVC
Vocalization FormI+Perfect+Active
a a
Root
d r s learn/study
Prefix Suffix
wa+ +at
daras
wa+daras+at
Morphotactics in Arabic
Pattern
CVCVC
Vocalization FormI+Perfect+Passive
u i
Root
d r s learn/study
Prefix Suffix
wa+ +at
duris
wa+duris+at
99
8/28/17
Morphotactics in Arabic
Pattern
CVCVC
Vocalization FormI+Perfect+Active
a a
Root
k t b write
Prefix Suffix
wa+ +at
katab
wa+katab+at
Morphotactics in Arabic
Pattern
CVCCVC
Vocalization FormII+Perfect+Active
a a
Root
d r s learn/study
Prefix Suffix
wa+ +at
darras
wa+darras+at
100
8/28/17
Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing
wa+drs+CVCVC+ui+at
wa+duris+at
201
Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing
wa+drs+CVCVC+ui+at
wa+duris+at
wadurisat
202
101
8/28/17
Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing
wa+drs+CVCVC+ui+at
wa+duris+at
wadurisat
wdrst
203
Morphotactics in Arabic
+drs+… +drs+… +drs+… +drs+… +drs+…
…
16 possible interpretations
204
102
8/28/17
205
Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s
. +
+Verb
. +Pres +3sg
. s 0
Nondeterminism 206
103
8/28/17
207
Lexicon Transducer
happy+est
Morphographemic ????????
Transducer
happiest
208
104
8/28/17
Lexicon Transducer
Morphological
happy+est Compose
Analyzer/Generator
Morphographemic
Transducer
happiest happiest
209
210
105
8/28/17
211
212
106
8/28/17
214
107
8/28/17
215
216
108
8/28/17
217
Lexical form
Set of parallel
of two-level rules
compiled into finite-state fst 1 fst 2 ... fst n
automata interpreted as
transducers
Surface form
109
8/28/17
fst 1
Intermediate form
fst n
Spoiler
n At the end both approaches are equivalent
Lexical form Lexical form
fst 1
fst 1 fst 2 ... fst n
Intermediate form
fst 2
fst n
Surface form
110
8/28/17
Two-Level Morphology
n Basic terminology and concepts
n Examples of morphographemic alternations
n Two-level rules
n Rule examples
222
111
8/28/17
Terminology
n Representation
Surface form/string : happiest
Lexical form/string: happy+est
Feasible Pairs
n Aligned correspondence:
happy+est
happi0est
224
112
8/28/17
Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or
225
Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or
¨ Transductions from surface strings to lexical
strings (analysis), or
226
113
8/28/17
Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or
¨ Transductions from surface strings to lexical
strings (analysis), or
¨ Transductions from lexical strings to surface
strings (generation)
227
228
114
8/28/17
229
230
115
8/28/17
231
232
116
8/28/17
233
234
117
8/28/17
235
236
118
8/28/17
237
238
119
8/28/17
qawul+a
qaA0l0a (qaAla – he said)
239
¨ Conditions
n Context
n Optional vs Obligatory Changes
240
120
8/28/17
Parallel Rules
n A well-established formalism for describing
morphographemic changes.
Lexical Form
Each rule describes
a constraint on legal
Lexical - Surface
R1 R2 R3 R4 ... Rn pairings.
Surface Form
241
Parallel Rules
n Each morphographemic constraint is
enforced by a finite state recognizer over
the alphabet of feasible-pairs.
t i e + i n g
R1 R2 R3 R4 ... Rn
121
8/28/17
Parallel Rules
n A lexical-surface string pair is "accepted" if
NONE of the rule recognizers reject it.
n Thus, all rules must put a good word in!
t i e + i n g
R1 R2 R3 R4 ... Rn
t y 0 0 i n g
243
Parallel Rules
n Each rule independently checks if it has
any problems with the pair of strings.
t i e + i n g
R1 R2 R3 R4 ... Rn
t y 0 0 i n g
244
122
8/28/17
Two-level Morphology
n Each recognizer sees that same pair of
symbols
R1 R2 R3 R4 ... Rn
t y 0 0 i n g
245
Two-level Morphology
n Each recognizer sees that same pair of
symbols
t i e + i n g
R1 R2 R3 R4 . . . Rn
t y 0 0 i n g
246
123
8/28/17
Two-level Morphology
n Each recognizer sees that same pair of
symbols
t i e + i n g
R1 R2 R3 R4 ... Rn
t y 0 0 i n g
247
Two-level Morphology
n Each recognizer sees that same pair of
symbols
R1 R2 R3 R4 ... Rn
t y 0 0 i n g
248
124
8/28/17
250
125
8/28/17
251
p Þ {p:p, p:m} a denotes eveything else Also remember FST rejects if no arc
252
is found
126
8/28/17
253
127
8/28/17
Rules in parallel
a t
Rules in parallel
a t
128
8/28/17
Rules in parallel
a t
Both FSRs see the N:m 1st one goes to state 3 2nd one goes to state 2
257
Rules in parallel
a t
Both FSRs see the p:m 1st one back goes to state 1 2nd one stays in state 2
258
129
8/28/17
Rules in parallel
a t
Both FSRs see the a:a 1st one stays in state 1 2nd one goes to state 1
259
Rules in parallel
a t
Both FSRs see the t:t 1st one stays in state 1 2nd one stays in state 1
260
130
8/28/17
Rules in parallel
a t
Rules in parallel
a t
262
131
8/28/17
Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
263
Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!
¨ Because, all recognizers have to be in the accepting
state– the pairing is rejected if at least one rule rejects.
264
132
8/28/17
Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!
265
Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!
266
133
8/28/17
267
t i e + i n g
R1 Ç R2 Ç R3 Ç R4 Ç ... Ç Rn
t y 0 0 i n g
268
134
8/28/17
t i e + i n g
t y 0 0 i n g
269
t i e + i n g
t y 0 0 i n g
270
135
8/28/17
Describing Phenomena
n Finite state transducers are too low level.
271
Two-level Rules
n Always remember the set of feasible
symbols = sets of legal correspondences.
n Rules are of the sort:
a:b op LC __ RC
Feasible
Pair
272
136
8/28/17
Two-level Rules
a:b op LC __ RC
Feasible Operator
Pair
273
Two-level Rules
a:b op LC __ RC
274
137
8/28/17
Two-level Rules
a:b op LC __ RC
276
138
8/28/17
Left Context:
Right Context:
Some consonant possibly followed
A morpheme boundary
by an optional) morpheme boundary
278
139
8/28/17
279
280
140
8/28/17
281
282
141
8/28/17
283
Rules to Transducers
n All the rule types can be compiled into finite
state transducers
¨ Rather hairy and not so gratifying (J)
284
142
8/28/17
Rules to Transducers
n Let’s think about a:b => LC _ RC
n If we see the a:b pair we want
to make sure
¨ It is preceded by a (sub)string
that matches LC, and
¨ It is followed by a (sub)string
that matches RC
n So we reject any input that
violates either or both of these
constraints
285
Rules to Transducers
n Let’s think about a:b => LC _ RC
n More formally
¨ Itis not the case that we have a:b not
preceded by LC, or not followed by RC
¨ ~[
[~ [?* LC ] a:b ?*] |
[ ~ ?* a:b ~[ RC ?* ] ]
]
(~ is the complementation operator)
286
143
8/28/17
Summary of Rules
n <= a:b <= c _ d
¨ a is always realized as b in the context c _ d.
n => a:b => c _ d
¨ a is realized as b only in the context c _ d.
n <=> a:b <=> c _ d
¨ a is realized as b in c _ d and nowhere else.
n /<= a:b /<= c _ d
¨ a is never realized as b in the context c _ d.
287
288
144
8/28/17
289
145
8/28/17
Two-level Morphology
n Beesley and Karttunen, Finite State Morphology,
CSLI Publications, 2004 (www.fsmbook.com)
n Karttunen and Beesley: Two-level rule compiler,
Xerox PARC Tech Report
n Sproat, Morphology and Computation, MIT Press
n Ritchie et al. Computational Morphology, MIT
Press
n Two-Level Rule Compiler
http://www.xrce.xerox.com/competencies/content-
analysis/fssoft/docs/twolc-92/twolc92.html
291
292
146
8/28/17
Turkish
n Turkish is an Altaic language with over 60
Million speakers ( > 150 M for Turkic
Languages: Azeri, Turkoman, Uzbek, Kirgiz,
Tatar, etc.)
n Agglutinative Morphology
¨ Morphemes glued together like "beads-on-a-
string"
¨ Morphophonemic processes (e.g.,vowel
harmony)
293
Turkish Morphology
n Productive inflectional and derivational
suffixation.
147
8/28/17
Turkish Morphology
n Too many word forms per root.
¨ Hankamer (1989) e.g., estimates few million
forms per verbal root (based on generative
capacity of derivations).
¨ Nouns have about 100 different forms w/o any
derivations
¨ Verbs have a thousands.
295
Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
296
148
8/28/17
Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
297
Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
298
149
8/28/17
Word Structure
n A word can be seen as a sequence of
inflectional groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
¨ evinizdekilerden (from the ones at your house)
¨ ev+iniz+de+ki+ler+den
¨ ev+HnHz+DA+ki+lAr+DAn
A = {a,e}, H={ı, i, u, ü}, D= {d,t}
¨ cf. odanızdakilerden
oda+[ı]nız+da+ki+ler+den
oda+[H]nHn+DA+ki+lAr+DAn
299
Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
300
150
8/28/17
Word Structure
n sağlamlaştırdığımızdaki ( (existing) at the time we caused
(something) to become strong. )
n Morphemes
¨ sağlam+lAş+DHr+DHk+HmHz+DA+ki
n Features
¨ sağlam(strong)
n +Adj
n ^DB+Verb+Become (+lAş)
n ^DB+Verb+Caus+Pos (+DHr)
n ^DB+Noun+PastPart+P1sg+Loc
(+DHk,+HmHz,+DA)
301
n ^DB+Adj (+ki)
302
151
8/28/17
Morphological Features
n Nominals
¨ Nouns
¨ Pronouns
¨ Participles
¨ Infinitives
inflect for
¨ Number, Person (2/6)
¨ Possessor (None, 1sg-3pl)
¨ Case
n Nom,Loc,Acc,Abl,Dat,Ins,Gen
303
Morphological Features
n Nominals
¨ Productive Derivations into
n Nouns (Diminutive)
¨ kitap(book), kitapçık (little book)
n Adjectives (With, Without….)
¨ renk (color), renkli (with color), renksiz (without color)
304
152
8/28/17
Morphological Features
n Verbs have markers for
¨ Voice:
n Reflexive/Reciprocal,Causative (0 or more),Passive
¨ Polarity (Neg)
¨ Tense-Aspect-Mood (2)
n Past, Narr,Future, Aorist,Pres
n Progressive (action/state)
305
Morphological Features
n öl-dür-ül-ec ek-ti
(it) was going to be killed (caused to die)
¨ öl - die
¨ -dür: causative
¨ -ül: passive
¨ -ecek: future
¨ -ti: past
¨ -0: 3rd Sg person
306
153
8/28/17
Morphological Features
n Verbs also have markers for
¨ Modality:
n able to verb (can/may)
n verb repeatedly
n verb hastily
307
Morphological Features
n Productive derivations from Verb
¨ (e.g:Verb Þ Temp/Manner Adverb)
n after having verb-ed,
n by verbing
154
8/28/17
Morphographemic Processes
n Vowel Harmony
¨ Vowels in suffixes agree in certain
phonological features with the preceding
vowels.
High Vowels H = {ı, i, u, ü} Morphemes use A and H as
Low Vowels = {a, e, o, ö} underspecified
Front Vowels = {e, i, ö, ü} meta symbols on the lexical
Back Vowels = {a, ı, o, u} side.
Round Vowels = { o, ö, u, ü} +lAr : Plural Marker
Nonround Vowels = {a, e, ı, i} +nHn: Genitive Case Marker
Nonround Low A= {a, e}
309
Vowel Harmony
n Some data
masa+lAr okul+lAr ev+lAr gül+lAr
masa0lar okul0lar ev0ler gül0ler
¨ Ifthe last surface vowel is a back vowel, A is paired
with a on the surface, otherwise A is paired with e.
(A:a and A:e are feasible pairs)
310
155
8/28/17
Vowel Harmony
n Some data
masa+lAr okul+lAr ev+lAr gül+lAr+yA
masa0lar okul0lar ev0ler gül0ler+0e
¨ If the last surface vowel is a back vowel. A is paired
with a on the surface, otherwise A is paired with e.
(A:a and A:e are feasible pairs)
n Note that this is chain effect
Vowel Harmony
n Some data
masa+nHn okul+nHn ev+nHn gül+Hn+DA
masa0nın okul00un ev00in gül0ün+de
312
156
8/28/17
Vowel Harmony
n Some data
masa+nHn okul+nHn ev+nHn gül+nHn
masa0nın okul00un ev00in gül+0ün
313
n Consonant Devoicing
kitab+DA tad+DHk tad+sH+nA kitab
kitap0ta tat0tık tad00ı0na kitap
kitapta tattık tadına
n Gemination
tıb0+yH üs0+sH şık0+yH
tıbb00ı üss00ü şıkk+0ı
tıbbı üssü şıkkı
314
157
8/28/17
315
Reality Check-1
n Real text contains phenomena that causes
nasty problems:
¨ Words of foreign origin - 1
alkol+sH kemal0+yA
alkol00ü kemal'00e
Use different lexical vowel symbols for these
¨ Words of foreign origin -2
Carter'a serverlar Bordeaux'yu
n This needs to be handled by a separate analyzer
using phonological encodings of foreign words, or
n Using Lenient morphology
316
158
8/28/17
Reality Check-2
n Real text contains phenomena that cause
nasty problems:
¨ Numbers, Numbers:
2'ye, 3'e, 4'ten, %6'dan, 20inci,100üncü
16:15 vs 3:4, 2/3'ü, 2/3'si
Reality Check-3
n Real text contains phenomena that causes
nasty problems:
¨ Acronyms
PTTye -- No written vowel to harmonize to!
318
159
8/28/17
Reality Check-4
n Interjections
¨ Aha!, Ahaaaaaaa!, Oh, Oooooooooh
¨ So the lexicon may have to encode lexical
representations as regular expresions
n ah[a]+, [o]+h
n Emphasis
¨ çok, çooooooook
319
Reality Check-5
n Lexicons have to be kept in check to
prevent overgeneration*
¨ Allomorph Selection
n Which causative morpheme you use depends on
the (phonological structure of the) verb, or the
previous causative morpheme
¨ ye+DHr+t oku+t+DHr
n Which case morpheme you use depends on the
previous morpheme.
oda+sH+nA oda+yA
oda0sı0na oda0ya
to his room to
(the) house
160
8/28/17
Reality Check-6
n Lexicons have to be kept in check to
prevent overgeneration
n +ki can only follow suffix only after +Loc case
marked nouns, or
n after singular nouns in +Nom case denoting
temporal entities (such as day, minute, etc)
321
Taming Overgeneration
n All these can be specified as finite state
transducers.
Constraint Transducer 2
Constraint Transducer 1
Constrained
Lexicon Transducer
Lexicon Transducer
322
161
8/28/17
Morphographemic
TR1 TR2 TR3 TR4 ... TRn transducer
323
TC
Tlx-if
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
324
162
8/28/17
TC
Tlx-if
kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
325
TC
Tlx-if
kütük+sH+ndAn
Tis-lx = intersection of rule transducers
kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
326
163
8/28/17
TC
Tlx-if
kütük+sH+ndAn
Tis-lx = intersection of rule transducers
kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
327
TC
Tlx-if
kütük+sH+ndAn
Tis-lx = intersection of rule transducers
kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
328
164
8/28/17
Turkish Analyzer
(After all transducers
are intersected and composed)
(~1M States, 1.6M Transitions)
22K Nouns
4K Verbs
Turkish Analyzer 2K Adjective
(After all transducers 100K Proper Nouns
are intersected and composed)
(~1M States, 1.6M Transitions)
330
165
8/28/17
Pronunciation Generation
n gelebilecekken
¨ (gj e - l )gel+Verb+Pos(e - b i - l ) ^DB+Verb
+Able(e - "dZ e c ) +Fut(- c e n
)^DB+Adverb+While
n okuma
¨ (o - k )ok+Noun+A3sg(u - "m ) +P1sg(a )+Dat
¨ (o - "k u ) oku+Verb(- m a ) +Neg+Imp+A2sg
¨ (o - k u ) oku+Verb+Pos(- "m a )
^DB+Noun+Inf2+A3sg+Pnon+Nom
332
166
8/28/17
333
334
167
8/28/17
335
168
8/28/17
n Solution
¨ zaplıyordum
n zapla+Hyor+DHm (zapla+Verb+Pos+Pres1+A1sg)
n zapl +Hyor+DHm (zapl+Verb+Pos+Pres1+A1sg)
337
Systems Available
n Xerox Finite State Suite (lexc, twolc,xfst)
¨ Commercial (Education/research license available)
¨ Lexicon and rule compilers available
¨ Full power of finite state calculus (beyond two-level
morphology)
¨ Very fast (thousands of words/sec)
169
8/28/17
Systems Available
n Schmid’s SFST-- the Stuttgart Finite State
Transducer Tool
¨ SFST is a toolbox for the implementation of
morphological analysers and other tools which
are based on finite state transducer
technology.
¨ Available at http://www.ims.uni-
stuttgart.de/projekte/gramotron/SOFTWARE/S
FST.html
339
Systems Available
n AT&T FSM Toolkit
¨ Tools to manipulate (weighted) finite state
transducers
n Now an open source version available as OpenFST
n Carmel Toolkit
¨ http://www.isi.edu/licensed-sw/carmel/
n FSA Toolkit
¨ http://www-i6.informatik.rwth-
aachen.de/~kanthak/fsa.html
340
170
8/28/17
OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules
341
Lexicon Transducer
happy+est
Morphographemic ????????
Transducer
happiest
342
171
8/28/17
Lexical form
Set of parallel
of two-level rules
compiled into finite-state fst 1 fst 2 ... fst n
automata interpreted as
transducers
Surface form
fst 1
Intermediate form
fst n
172
8/28/17
fst 1
fst 1 fst 2 ... fst n
Intermediate form
fst 2
fst n
Surface form
173
8/28/17
347
348
174
8/28/17
...m p...
349
...m p...
...m m...
350
175
8/28/17
kampat Intermediate
kammat Surface
351
kammat Surface
352
176
8/28/17
kanmat Surface
353
354
177
8/28/17
N->m
transformation
kampat
N->n
transformation
kampat
p->m
transformation
kammat
355
N->m
transformation
kampat kaNtat
N->n
transformation
kampat kantat
p->m
transformation
kammat kantat
356
178
8/28/17
N->m
transformation
kampat kaNtat kammat
N->n
transformation
kampat kantat kammat
p->m
transformation
357
N->m
transformation
kampat kaNtat kammat
N->n
transformation
kampat kantat kammat
p->m
transformation
358
179
8/28/17
N->m
transformation
kampat kampat kammat
N->n
transformation
kampat kampat kammat
p->m
transformation
359
N->m
transformation
N->n
transformation
p->m
transformation
kammat
360
180
8/28/17
m:m
p:p ?
p->m
m:m
transformation
?
p:m
N->m
Transducer
Composition
N->n
Transducer
p->m
Transducer
181
8/28/17
Rewrite Rules
n Originally rewrite rules were proposed to
describe phonological changes
n u -> l / LC _ RC
¨ Change u to l if it is preceded by LC and
followed by RC.
364
182
8/28/17
Rewrite Rules
n These rules of the sort u -> l / LC _ RC
look like context sensitive grammar rules,
so can in general describe much more
complex languages.
Replace Rules
n Replace rules define regular relations
between two regular languages
A -> B LC _ RC
Replacement Context
The relation that replaces A by B between L and R leaving
everything else unchanged.
n In general A, B, LC and RC are regular
expressions.
366
183
8/28/17
Replace Rules
n Let us look at the simplest replace rule
¨a -> b
n The relation defined by this rule contains among
others
¨ {..<abb,
bbb>,<baaa, bbbb>, <cbc, cbc>, <caad,
cbbd>, …}
n A string in the upper language is related to a
string in the lower language which is exactly the
same, except all the a’s are replaced by b’s.
¨ The related strings are identical if the upper string does
not contain any a’s
367
Replace Rules
n Let us look at the simplest replace rule with
a context
¨ a->b || d _ e
n a’s are replaced by b, if they occur after a d
and before an e.
¨ <cdaed, cdbed> are related
n a appears in the appropriate context in the upper
string
¨ <caabd, cbbbd> are NOT related,
n a’s do not appear in the appropriate context.
¨ <caabd, caabd> are related (Why?)
368
184
8/28/17
Replace Rules
n Although replace rules define regular
relations, sometimes it may better to look at
them in a procedural way.
¨a -> b || d _ e
n What string do I get when I apply this rule
to the upper string bdaeccdaeb?
• bdaeccdaeb
• bdbeccdbeb
369
185
8/28/17
371
Lower String B
372
186
8/28/17
A -> B || LC _ RC
Lower String B
373
A -> B // LC _ RC
Lower String LC B
374
187
8/28/17
A -> B \\ LC _ RC
Lower String B RC
375
A -> B \/ LC _ RC
Lower String LC B RC
376
188
8/28/17
377
378
189
8/28/17
379
n N->m || _ p;
kampat kaNtat kammat
n N-> n;
kampat kantat kammat
n p -> m || m _
380
190
8/28/17
.o.
FST3
p -> m || m _
381
382
191
8/28/17
383
192
8/28/17
Vowel Harmony
n A bit tricky
¨ A->a or A->e
¨ H->ı, H->i, H->u, H->ü
n These two (groups of) rules are
interdependent
385
Vowel Harmony
n So we need
¨ Parallel
Rules
¨ Each checking its left context on the output (lower-
side)
n A->a // VBack Cons* “+” Cons* _ ,,
n A->e // VFront Cons* “+” Cons* _ ,,
n H->u // [o | u] Cons* “+” Cons* _ ,,
n H->ü // [ö | ü] Cons* “+” Cons* _ ,,
n H->ı // [a | ı] Cons* “+” Cons* _ ,,
n H->i // [e | i] Cons* “+” Cons* _
386
193
8/28/17
Consonant Resolution
n d is realized as t either at the end of a word
or after certain consonants
n b is realized as p either at the end of a
word or after certain consonants
n c is realized as ç either at the end of a word
or after certain consonants
n d-> t, b->p, c->ç // [h | ç | ş | k | p | t | f | s ] “+” _
387
Consonant Deletion
n Morpheme initial s, n, y is deleted if it is
preceded by a consonant
388
194
8/28/17
Cascade
Stem Final Vowel Deletion
Morpheme Initial
Vowel Deletion
Vowel Harmony
Consonant Devoicing
(Partial) Morphographemic
Transducer
Consonant Deletion
Boundary Deletion
389
Cascade
Stem Final Vowel Deletion Lexicon
Transducer
Morpheme Initial
Vowel Deletion
Vowel Harmony
(partial)
TMA
Consonant Devoicing (Partial)
Morphographemic
Consonant Deletion Transducer
Boundary Deletion
390
195
8/28/17
Some Observations
n We have not really seen all the nitty gritty
details of both approaches but rather the
fundamental ideas behind them.
¨ Rule conflicts in Two-level morphology
n Sometimes the rule compiler detects a conflict:
¨ Two rules sanction conflicting feasible pairs in a context
n Sometimes the compiler can resolve the conflict but
sometimes the developer has to fine tune the
contexts.
391
Some Observations
n We have not really seen all the nitty gritty
details of both approaches but rather the
fundamental ideas behind them.
¨ Unintended rule interactions in rule cascades
n When one has 10’s of replace rule one feeding into
the other, unintended/unexpected interactions are
hard to avoid
n Compilers can’t do much
392
196
8/28/17
Some Observations
n For a real morphological analyzer, my
experience is that developing an accurate
model of the lexicon is as hard as (if not
harder than) developing the
morphographemic rules.
¨ Taming overgeneration
¨ Enforcing “semantic” constraints
¨ Enforcing long distance co-occurance
constraints
n This suffix can not occur with that prefix, etc.
¨ Special cases, irregular cases
393
197
11-411
Natural Language Processing
Language Modelling and Smoothing
Kemal Oflazer
1/46
What is a Language Model?
I A model that estimates how likely it is that a sequence of words belongs to a (natural)
language
I Intuition
I p(A tired athlete sleeps comfortably) p(Colorless green ideas sleep furiously)
I p(Colorless green ideas sleep furiously) p(Salad word sentence is this)
2/46
Let’s Check How Good Your Language Model is?
3/46
Where do we use a language model?
I Language models are typically used as components of larger systems.
I We’ll study how they are used later, but here’s some further motivation.
I Speech transcription:
I I want to learn how to wreck a nice beach.
I I want to learn how to recognize speech.
I Handwriting recognition:
I I have a gub!
I I have a gun!
I Spelling correction:
I We’re leaving in five minuets.
I We’re leaving in five minutes.
I Ranking machine translation system outputs
4/46
Very Quick Review of Probability
I Event space (e.g., X , Y), usually discrete for the purposes of this class.
I Random variables (e.g., X , Y )
I We say “Random variable X takes value x ∈ X with probability p(X = x)”
I We usually write p(X = x) as p(x).
I Joint probability: p(X = x, Y = y)
p(X = x, Y = y)
I Conditional probability: p(X = x | Y = y) =
p(Y = y)
I This always holds
5/46
Language Models: Definitions
I V is a finite set of discrete symbols (characters, words, emoji symbols, . . . ), V = |V|.
I V + is the infinite set of finite-length sequence of symbols from V whose final symbol
is 2.
I p : V + → R such that
I For all x ∈ V + p(x) ≥ 0 X
I p is a proper probability distribution: p(x) = 1
x∈V +
I Language modeling: Estimate p from the training set examples
x1:n = hx1 , x2 , . . . , xn i
7/46
Motivation – Noisy Channel Models
I Noisy channel models are very suitable models for many NLP problems:
I Y is the plaintext, the true message, the missing information, the output
I X is the ciphertext, the garbled message, the observable evidence, the input
I Decoding: select the best y given X = x.
8/46
Noisy Channel Example – Speech Recognition
I Source model characterizes p(y), “What are possible sequences of words I can say?”
I Channel model characterizes p(Acoustics | y)
I It is hard to recognize speech
I It is hard to wreck a nice beach
I It is hard wreck an ice beach
I It is hard wreck a nice peach
I It is hard wreck an ice peach
I It is heart to wreck an ice peach
I ···
9/46
Noisy Channel Example – Machine Translation
10/46
Machine Transliteration
I Phonetic translation across language pairs with very different alphabets and sound
system is called transliteration.
I Golfbag in English is to be transliterated to Japanese.
I Japanese has no distinct l and r sounds - these in English collapse to the same sound.
Same for English h and f.
I Japanese uses alternating vowel-consonant syllable structure: lfb is impossible to
pronounce without any vowels.
I Katagana writing is based on syllabaries: different symbols for ga, gi, gu, etc.
I So Golfbag is transliterated as and pronounced as go-ru-hu-ba-ggu.
I So when you see a transliterated word in Japanese text, how can you find out what
the English is?
I nyuuyooko taimuzu → New York Times
I aisukuriimu → ice-cream (and not “I scream”)
I ranpu → lamp or ramp
I masutaazutoonamento → Master’s Tournament
11/46
Noisy Channel Model – Other Applications
I Spelling Correction
I Grammar Correction
I Optical Character Recognition
I Sentence Segmentation
I Part-of-speech Tagging
12/46
Is finite V realistic?
I NO!
I We will never see all possible words in a language no matter how large the sample we
look at, is.
13/46
The Language Modeling Problem
14/46
A Very Simple Language Model
I What happens when you want to assign a probability to some x that is not in the
training set?
I Is there a way out?
15/46
Chain Rule to the Rescue
I We break down p(x) mathematically
p(X = x) = p(X1 = x1 )×
p(X2 = x2 | X1 = x1 )×
p(X3 = x3 | X1:2 = x1:2 )×
..
.
p(X` = 2) | X1:`−1 = x1:`−1 )
`
Y
= p(Xj = xj | X1:j−1 = x1:j−1 )
j=1
16/46
Approximating the Chain Rule Expansion – The Unigram Model
`
Y
p(X = x) = p(Xj = xj | X1:j−1 = x1:j−1 )
j=1
` ` `
assumption Y Y Y
= pθ (Xj = xj ) = θxj ≈ θ̂xj
j=1 j=1 j=1
Pros: Cons:
I Easy to understand I “Bag of Words” assumption is not
18/46
Approximating the Chain Rule Expansion – Markov Models
`
Y
p(X = x) = p(Xj = xj | X1:j−1 = x1:j−1 )
j=1
`
assumption Y
= pθ (Xj = xj | Xj−n+1:j−1 = xj−n+1:j−1 )
j=1
| {z }
last n − 1 words
19/46
Estimating n-gram Models
unigram bigram trigram general n-gram
`
Y `
Y `
Y `
Y
pθ (x) = θxj θx |x θx |x θx |x
j j−1 j j−2 xj−1 j j−n+1:j−1
j=1 j=1 j=1 j=1
20/46
The Problem with MLE
21/46
Engineering Issues – Log Probabilities
I Note that computation of pθ (x) involves multiplication of numbers each of which are
between 0 and 1.
I So multiplication hits underflow: computationally the product can not be represented
or computed.
I In implementation, probabilities are represented by the logarithms (between −∞ and
0) and multiplication is replaced by addition.
22/46
Dealing with Out-of-Vocabulary Words
23/46
Smoothing Language Models
I We can not have 0 probability n-grams. So we should shave off some probability
mass from seen n-grams to give to unseen n-grams.
I The Robin-Hood Approach – steal some probability from the haves to have-nots.
I Simplest method: Laplace Smoothing
I Interpolation
I Stupid backoff.
I Long-standing best method: modified Kneser-Ney smoothing
24/46
Laplace Smoothing
I We add 1 to all counts! So words with 0 counts will be assumed to have count 1.
c(v) + 1
I Unigram probabilities: p(v) =
N+V
0 c(v0 v) + 1
I Bigram probabilities: p(v | v ) =
c(v0 ) + V
I One can also use Add-k smoothing for some fractional k, 0 < k ≤ 1)
I It turns out this method is very simple but shaves off too much of the probability mass.
(See book for an example.)
25/46
Interpolation
26/46
Stupid Backoff
I Gives up the idea of making the language model a true probability distribution.
I Works quite well with very large training data (e.g. web scale) and large language
models
I If a given n-gram has never been observed, just use the next lower gram’s estimate
scaled by a fixed weight λ (terminates when you reach the unigram)
27/46
Kneser-Ney Smoothing
28/46
Toolkits
29/46
n-gram Models– Assessment
Pros: Cons:
I Easy to understand I Markov assumption is not
30/46
Evaluation – Language Model Perplexity
I Consider language model that assigns probabilities to a sequence of digits (in speech
recognition)
I Each digit occurs with the same probability p = 0.1
I Perplexity for a sequence of N digits D = d1 d2 · · · dn is
def −1
PP(D) = p(d1 d2 · · · dn ) N
s
N 1
=
p(d d · · · dn )
s 1 2
= N Q 1
N p(d )
s i=1 i
1
= N
1 )N
( 10
= 10
I How can we interpret this number?
31/46
Evaluation – Language Model Perplexity
I Intuitively, language models should assign high probability to “real language” they
have not seen before.
I Let x1:m be a sequence of m sentences, that we have not seen before (held-out or
test set)
Ym X m
I Probability of x1:m = p(xi ) ⇒ Log probability of x1:m = log2 p(xi )
i=1 i=1
I Average log probability of per word of x1:m is:
m
X 1
l= log2 p(xi )
M
i=1
m
X
where M = |xi |
i=1
def
I Perplexity relative to x1:m = 2−l
I Intuitively, perplexity is average “confusion” after each word. Lower is better!
32/46
Understanding Perplexity
m
1 X
− log2 p(~xi )
M
I 2 i=1 is really a branching factor.
I Assign probability of 1 to the test data ⇒ perplexity = 1. No confusion.
1
I Assign probability of to each word ⇒ perplexity = V . Equal confusion after each
V
word!
I Assign probability of 0 to anything ⇒ perplexity = ∞
I We really should have for any x ∈ V + p(x) > 0
33/46
Entropy and Cross-entropy
34/46
Entropy and Cross-entropy
I Suppose the probabilities over the outcome of the race are not at all even.
I
Clinton 1/4 Huckabee 1/64
Edwards 1/16 McCain 1/8
Kucinich 1/64 Paul 1/64
Obama 1/2 Romney 1/64
I You can encode the winner using the following coding scheme
35/46
Another View
36/46
Bits vs Probabilities
37/46
Entropy
I Entropy of a Distribution X
H (p) = − p(x) log p(x)
x∈X
I Always ≥ 0 and maximal when p is uniform.
X 1 1
H (puniform ) = − log = log |X |
|X | |X |
x∈X
38/46
Cross-entropy
39/46
Cross-entropy and Betting
40/46
How does this Relate to Language Models?
41/46
What do n-gram Models Know?
42/46
Unigram Model Generation
first, from less the This different 2004), out which goal 19.2 Model
their It ˜(i?1), given 0.62 these (x0; match 1 schedule. x 60
1998. under by Notice we of stated CFG 120 be 100 a location accuracy
If models note 21.8 each 0 WP that the that Novak. to function; to
[0, to different values, model 65 cases. said -24.94 sentences not
that 2 In to clustering each K&M 100 Boldface X))] applied; In 104
S. grammar was (Section contrastive thesis, the machines table -5.66
trials: An the textual (family applications.We have for models 40.1 no
156 expected are neighborhood
43/46
Bigram Model Generation
44/46
Trigram Model Generation
45/46
The Trade-off
I As we increase n, the stuff the model generates looks better and better, and the
model gives better probabilities to the training data.
I But as n gets big, we tend toward the history model, which has a lot of zero counts
and therefore isn’t helpful for data we haven’t seen before.
I Generalizing vs. Memorizing
46/46
11-411
Natural Language Processing
Classification
Kemal Oflazer
1/36
Text Classification
I We have a set of documents (news items, emails, product reviews, movie reviews,
books, . . . )
I Classify this set of documents into a small set classes.
I Applications:
I Topic of a news article (classic example: finance, politics, sports, . . . )
I Sentiment of a movie or product review (good, bad, neutral)
I Email into spam or not or into a category (business, personal, bills, . . . )
I Reading level (K-12) of an article or essay
I Author of a document (Shakespeare, James Joyce, . . . )
I Genre of a document (report, editorial, advertisement, blog, . . . )
I Language identification
2/36
Notation and Setting
3/36
Evaluation
I Accuracy: X
A(classify) = p(x, `)
x∈V + ,`∈L,
classify(x)=`
where p is the true distribution over data. Error is 1 − A.
I This is estimated using a test set {(x1 , `1 ), (x2 , `2 ), · · · , (xm , `m )}
m
1X
Â(classify) = 1{classify(xi ) = `i }
m
i=1
4/36
Issues with Using Test Set Accuracy
I Class imbalance: if p(L = not spam) = 0.99, then you can get  ≈ 0.99 by always
guessing “not spam”
I Relative importance of classes or cost of error types.
I Variance due to the test data.
5/36
Evaluation in the Two-class case
I Suppose we have one of the classes t ∈ L as the target class.
I We would like to identify documents with label t in the test data.
I Like information retrieval
I We get
C
I Precision P̂ = (percentage of documents classify correctly labeled as t)
B
C
I Recall R̂ = (percentage of actual t labeled documents correctly labeled as t)
A
P̂ + R̂
I F1 = 2
P̂ · R̂ 6/36
A Different View – Contingency Tables
L=t L 6= t
A
7/36
Evaluation with > 2 Classes
I Macroaveraged precision and recall: let each class be the target and report the
average P̂ and R̂ across all classes.
I Microaveraged precision and recall: pool all one-vs.-rest decisions into a single
contingency table, calculate P̂ and R̂ from that.
8/36
Cross-validation
I Remember that Â, P̂, R̂, and Fˆ1 are all estimates of the classifier’s quality under the
true data distribution.
I Estimates are noisy!
I K -fold cross validation
I Partition the training data into K nonverlapping “folds”, x1 , x2 , . . . , xK ,
I For i ∈ {1, . . . , K}
I Train on x1:n \xi , using xi as development data
i
I Estimate quality on the xi development set as Â
K
1 X i
I Report average accuracy as  =  and perhaps also the standard deviation.
K i=1
9/36
Features in Text Classification
10/36
Spam Detection
11/36
Movie Ratings
12/36
Probabilistic Classification
p(`,f )
= arg max p(f )
`∈L
13/36
Naive Bayes Classifier
d
Y
p(L = `, F1 = f1 , . . . , Fd = fd ) = p(`) p(Fj = fj | `)
j=1
d
Y
= π` θf |j,`
j
j=1
14/36
Generative vs Discriminative Classifier
I A discriminative classifier instead learns what features from the input are useful to
discriminate between possible classes.
15/36
The Most Basic Naive Bayes Classifier
16/36
The Most Basic Naive Bayes Classifier
17/36
The Most Basic Naive Bayes Classifier
|x|
Y
classify(x) = arg max π` p(xj | `)
`∈L j=1
|x|
X
classify(x) = arg max log π` + log p(xj | `)
`∈L j=1
I All computations are done in log space to avoid underflow and increase speed.
I Class prediction is based on a linear combination of the inputs.
I Hence Naive Bayes is confidered as a linear classifier.
18/36
An Example
1+1 0+1
p(“predictable” | −) = p(“predictable” | +) =
14 + 20 9 + 20
1+1 0+1
p(“no” | −) = p(“no” | +) =
14 + 20 9 + 20
0+1 1+1
p(“fun” | −) = p(“fun” | +) =
14 + 20 9 + 20
I |V | = 20
2 1×1×2
I Add 1 Laplace smoothing p(+)p(s | +) = × = 3.2×10−5
5 293
3 2 3 2×2×1
π− = p(−) = π+ = p(+) = p(−)p(s | −) = × = 6.1×10−5
5 5 5 343
N− = 14 N+ = 9
19/36
Other Optimizations for Sentiment Analysis
20/36
Formulation of a Discriminative Classifier
I A discriminative model computes p(` | x) to discriminate among different values of `,
using combinations of d features of x.
`ˆ = arg max p(` | x)
`∈L
I There is no obvious way to map features to probabilities.
I Assuming features are binary-valued and they are both functions of x and class ` we
can write
d
1 X
p(` | x) = exp wi fi (`, x)
Z
i=1
where Z is the normalization factor to make everything a probability and wi are
weights for features.
I p(` | x) can be then be formally defined with normalization as
P
exp d w f (`, x)
i=1 i i
p(` | x) = P P
exp d w f (`0 , x)
`0 ∈L i=1 i i
21/36
Some Features
I Remember features are binary-valued and are both functions of x and class `.
I Suppose we are doing sentiment classification. Here are some sample feature
functions:
1 if “great” ∈ x & ` = +
I f1 (`, x) =
0 otherwise
1 if “second-rate” ∈ x & ` = −
I f2 (`, x) =
0 otherwise
1 if “no” ∈ x & ` = +
I f3 (`, x) =
0 otherwise
1 if “enjoy” ∈ x & ` = −
I f4 (`, x) =
0 otherwise
22/36
Mapping to a Linear Formulation
23/36
Two-class Classification with Linear Models
I Big idea: “map” a document x into a d-dimensional (feature) vector Φ(x), and learn a
hyperplane defined by vector w = [w1 , w2 , . . . , wd ].
I Linear decision rule:
I Decide on class 1 if w · Φ(x) > 0
I Decide on class 2 if w · Φ(x) ≤ 0
24/36
Two-class Classification with Linear Models
25/36
Two-class Classification with Linear Models
I There may not be a separation hyperplane. The data is not linearly separable!
26/36
Two-class Classification with Linear Models
27/36
The Perceptron Learning Algorithm for Two Classes
28/36
Linear Models for Classification
I Big idea: “map” a document x into a d-dimensional (feature) vector Φ(x, `), and learn
a hyperplane defined by vector w = [w1 , w2 , . . . , wd ].
I Linear decision rule
where Φ : V + × L → Rd
I Parameters are w ∈ Rd .
29/36
A Geometric View of Linear Classifiers
30/36
A Geometric View of Linear Classifiers
I Suppose we an instance w and L = {y1 , y2 , y3 , y4 }.
I We have two simple binary features φ1 , and φ2
I Suppose w is such that w · Φ = w1 φ1 + w2 φ2
31/36
A Geometric View of Linear Classifiers
I Suppose we an instance w and L = {y1 , y2 , y3 , y4 }.
I We have two simple binary features φ1 , and φ2
I Suppose w is such that w · Φ = w1 φ1 + w2 φ2
|w · Φ0 |
distance(w · Φ, Φ0 ) = ∝ |w · Φ0 |
kwk2
I So w · Φ(x, y1 ) > w · Φ(x, y3 ) > w · Φ(x, y4 ) > w · Φ(x, y2 ) ≥ 0
32/36
A Geometric View of Linear Classifiers
33/36
Where do we get w? The Perceptron Learner
I Start with w = 0
I Go over the training samples and adjust w to minimize the deviation from correct
labels.
n
max w · Φ(xi , `0 ) − w · Φ(xi , `i )
X
min
w 0
i=1 ` ∈L
I The perceptron learning algorithm is a stochastic subgradient descent algorithm on
above.
I For t ∈ {1, . . . , T }
I Pick it uniformly at random from {1, . . . , n}
I `ˆit ← arg max w · Φ(xit , `)
`∈L
w ← w − α Φ(xit , `ˆit ) − Φ(xit , `it )
I
I Return w
34/36
Gradient Descent
35/36
More Sophisticated Classification
I Take into account error costs if all mistakes are not equally bad. (false positives vs.
false negatives in spam detection)
I Use maximum margin techniques (e.g., Support Vector Machines) try to find the best
separating hyperplane that’s far from the training examples.
I Use kernel methods map vectors to get much higher-dimensional spaces, almost for
free, where they may be lineraly separable.
I Use Feature selection to find the most important features and throw out the rest.
I Take the machine learning class if you are interested on these
36/36
11-411
Natural Language Processing
Part-of-Speech Tagging
Kemal Oflazer
1/41
Motivation
2/41
What are Part-of-Speech Tags?
3/41
English Nouns
4/41
Why have Part-of-Speech Tags?
I It is an “abstraction” mechanism.
I There are too many words.
I You would need a lot of data to train models.
I Your model would be very specific.
I POS Tags allow for generalization and allow for useful reduction in model sizes.
I There are many different tagsets: You want the right one for your task
5/41
How do we know the class?
I Substitution test
I The ADJ cat sat on the mat.
I The blue NOUN sits on the NOUN.
I The blue cat VERB on the mat.
I The blue cat sat PP the mat.
6/41
What are the Classes?
7/41
Broad Classes
8/41
Finer-grained Classes
9/41
Hard Cases
10/41
Other Classes
11/41
Penn Treebank Tagset for English
12/41
Others Tagsets for English and for Other Languages
13/41
Some Tagged Text from The Penn Treebank Corpus
14/41
How Bad is Ambiguity?
Tags Token Tags Token Count POS/Token
7 down 5 run 317 RB/down
6 that 5 repurchase 200 RP/down
6 set 5 read 138 IN/down
6 put 5 present 10 JJ/down
6 open 5 out 1 VBP/down
6 hurt 5 many 1 RBR/down
6 cut 5 less 1 NN/down
6 bet 5 left
6 back
5 vs,
5 the
5 spread
5 split
5 say
5 ’s
15/41
Some Tags for “down”
One/CD hundred/CD and/CC ninety/CD two/CD former/JJ greats/NNS ,/, near/JJ
greats/NNS ,/, hardly/RB knowns/NNS and/CC unknowns/NNS begin/VBP a/DT 72-game/JJ
,/, three-month/JJ season/NN in/IN spring-training/NN stadiums/NNS up/RB and/CC
down/RB Florida/NNP ...
He/PRP will/MD keep/VB the/DT ball/NN down/RP ,/, move/VB it/PRP around/RB ...
As/IN the/DT judge/NN marched/VBD down/IN the/DT center/JJ aisle/NN in/IN his/PRP$
flowing/VBG black/JJ robe/NN ,/, he/PRP was/VBD heralded/VBN by/IN a/DT trumpet/NN
fanfare/NN ...
Other/JJ Senators/NNP want/VBP to/TO lower/VB the/DT down/JJ payments/NNS
required/VBN on/IN FHA-insured/JJ loans/NNS ...
Texas/NNP Instruments/NNP ,/, which/WDT had/VBD reported/VBN Friday/NNP that/IN
third-quarter/JJ earnings/NNS fell/VBD more/RBR than/IN 30/CD %/NN from/IN the/DT
year-ago/JJ level/NN ,/, went/VBD down/RBR 2/CD 1/8/CD to/TO 33/CD on/IN 1.1/CD
million/CD shares/NNS ....
Because/IN hurricanes/NNS can/MD change/VB course/NN rapidly/RB ,/, the/DT
company/NN sends/VBZ employees/NNS home/NN and/CC shuts/NNS down/VBP
operations/NNS in/IN stages/NNS : /: the/DT closer/RBR a/DT storm/NN gets/VBZ ,/,
the/DT more/RBR complete/JJ the/DT shutdown/NN ...
Jaguar/NNP ’s/POS American/JJ depositary/NN receipts/NNS were/VBD up/IN 3/8/CS
yesterday/NN in/IN a/DT down/NN market/NN ,/, closing/VBG at/IN 10/CD ...
16/41
Some Tags for “Japanese
17/41
How we do POS Tagging?
18/41
Markov Models for POS Tagging
where t̂ is the tag sequence that maximizes the argument of the arg max .
19/41
Basic Equation and Assumptions for POS Tagging
n
Y
t̂1:n = arg max p(t1:n | w1:n ) ≈ arg max p(wi | ti ) p(ti | tt−1 )
t1:n t1:n | {z } | {z }
i=1 emission transition
21/41
Bird’s Eye View of p(ti | ti−1 )
22/41
Bird’s Eye View of p(wi | ti )
23/41
Estimating Probabilities
I We can estimate these probabilities from a tagged training using maximum likelihood
estimation.
c(ti−1 , ti )
I Transition Probabilities: p(ti | ti−1 ) =
c(ti−1 )
c(ti , wi )
I Emission Probabilities: p(wi | ti ) =
c(ti )
24/41
The Setting
25/41
The Forward Algorithm
26/41
The Forward Algorithm
I Computes αi ( j) = p(w1 , w2 , . . . , wi , qi = j | λ)
I The total probability of observing w1 , w2 , . . . , wi and landing in state j after emitting i words.
I Let’s define some short-cuts:
I αi−1 (k): the previous forward probability from the previous stage (word)
I akj = p(tj | tk )
I bj (wi ) = p(wi | tj )
N
X
I αi ( j) = αi−1 (k) · akj · bj (wi )
k=1
I αn (F ) = p(w1 , w2 , . . . , wn , qn = F | λ) is the total probability of observing
w1 , w2 , . . . , wn .
I We really do not need αs. We just wanted to motivate the trellis.
I We are actually interested in the most likely sequence of states (tags) that we go
through while “emitting” w1 , w2 , . . . , wi These would be the most likely tags!.
27/41
Viterbi Decoding
I Computes vi ( j) = max p(q0 , q1 , . . . qi−1 , w1 , w2 , . . . , wi , qi = j | λ)
q0 ,q1 ,...qi−1
I vi ( j) is the maximum probability of observing w1 , w2 , . . . , wi after emitting i words
while going through some sequence of states (tags) q0 , q1 , . . . qi−1 before landing in
state qi = j.
I We can recursively define
28/41
Viterbi Algorithm
I Initialization:
v1 ( j) = a0j · bj (w1 ) 1 ≤ j ≤ N
bt1 ( j) = 0
I Recursion:
29/41
Viterbi Decoding
30/41
Viterbi Decoding
31/41
Viterbi Decoding
32/41
Viterbi Decoding
33/41
Viterbi Decoding
34/41
Viterbi Decoding
35/41
Viterbi Decoding
I Once you are at i = n, you have to land in the END state (F ), then use the backtrace
to find the previous state you came from and recursively trace backwards to find t̂1:n .
36/41
Viterbi Decoding Example
37/41
Viterbi Decoding Example
38/41
Viterbi Decoding Example
39/41
Unknown Words
I They are unlikely to be closed class words.
I They are most likely to be nouns or proper nouns, less likely, verbs.
I Exploit capitalization – most likely proper nouns.
I Exploit any morphological hints: -ed most likely past tense verb, -s, most likely plural
noun or present tense verb for 3rd person singular.
I Build a separate models of the sort
Kemal Oflazer
1/12
Syntax
2/12
Syntax vs. Morphology
3/12
Syntax vs. Semantics
4/12
Two Approaches to Syntactic Structure
I Dependency Grammar:
I The basic unit of syntactic structure is a binary relation between words called a
dependency.
5/12
Constituents
6/12
Constituents
7/12
Noun Phrases
8/12
Prepositional Phrases
I I arrived on Tuesday.
I I arrived in March.
I I arrived under the leaking roof.
I I arrived with the elephant I love to hate.
9/12
Sentences/Clauses
10/12
Recursion and Constituents,
I This is the house.
I This is the house that Jack built.
I This is the cat that lives in the house that Jack built.
I This is the dog that chased the cat that lives in the house that Jack built.
I This is the flea that bit the dog that chased the cat that lives in the house the Jack
built.
I This is the virus that infected the flea that bit the dog that chased the cat that lives in
the house that Jack built.
I Non-constituents
I If on a Winter’s Night a Traveler
I Nuclear and Radiochemistry
I The Fire Next Time
I A Tad Overweight, but Violet Eyes to Die For
I Sometimes a Great Notion
I [how can we know the] Dancer from the Dance
11/12
Describing Phrase Structure / Constituency Grammars
12/12
11-411
Natural Language Processing
Formal Languages and Chomsky Hierarchy
Kemal Oflazer
1/53
Brief Overview of Formal Language Concepts
2/53
Strings
3/53
Strings
0
I Alternatively, Σ∗ = {x1 , . . . , xn |n ≥ 0 and xi ∈ Σ for all i}
I Φ denotes the empty set of strings Φ = {},
I but Φ∗ = {}
4/53
Sets of Languages
∗
I The power set of Σ∗ , the set of all its subsets, is denoted as 2Σ
5/53
Describing Languages
6/53
Describing Languages
7/53
Identifying Nonregular Languages
8/53
The Pigeonhole Principle
I If there are n pigeons and m holes and n > m, then at least one hole has > 1 pigeons.
9/53
The Pigeonhole Principle
10/53
The Pigeonhole Principle
I When traversing the DFA with the string ω , if the number of transitions ≥ number of
states, some state q has to repeat!
I Transitions are pigeons, states are holes.
11/53
Pumping a String
I Consider a string ω = xyz
I |y| ≥ 1
I |xy| ≤ m (m the number of states)
I If ω = xyz ∈ L that so are xyi z for all i ≥ 0
I The substring y can be pumped.
I So if a DFA accepts a sufficiently long string, then it accepts an infinite number of
strings!
12/53
There are Nonregular Languages
13/53
Is English Regular?
14/53
Grammars
15/53
Grammars - An Example
16/53
How does a grammar work?
17/53
Types of Grammars
18/53
Formal Definition of a Grammar
19/53
Types of Grammars
I Regular Grammars
I Left-linear: All rules are either like X → Ya or like X → a with X, Y ∈ V and a ∈ Σ∗
I Right-linear: All rules are either like X → aY or like X → a with X, Y ∈ V and a ∈ Σ∗
I Context-free Grammars
I All rules are like X → Y with X ∈ V and Y ∈ (Σ ∪ V)∗
I Context-sensitive Grammars
I All rules are like LXR → Y with X ∈ V and R, Y, L ∈ (Σ ∪ V)∗
I General Grammars
I All rules are like X → Y with X, Y ∈ (Σ ∪ V)∗
20/53
Chomsky Normal Form
I CFGs in certain standard forms are quite useful for some computational problems.
A → BC or A → a
21/53
Chomsky Hierarchy
22/53
Parse Trees
S
a S b
23/53
A Grammar for a Fragment of English
Nomenclature:
S → NP VP
I S: Sentence
NP → CN | CN PP
VP → CV | CV PP I NP: Noun Phrase
telescope I P: Preposition
V → touches | likes | I DT: Determiner
sees | gives
I N: Noun
P → with | to
I V: Verb
24/53
A Grammar for a Fragment of English
S → NP VP
NP → CN | CN PP S ⇒ NP VP
VP → CV | CV PP ⇒ CN PP VP
PP → P NP ⇒ DT N PP VP
CN → DT N ⇒ a N PP VP
CV → V | V NP ⇒ ···
DT → a | the ⇒ a boy with a flower VP
N → boy | girl | flower | ⇒ a boy with a flower CV PP
telescope ⇒ ···
V → touches | likes | ⇒ a boy with a flower sees a girl
sees | gives with a telescope
P → with | to
25/53
English Parse Tree
S
NP VP
CN PP CV PP
DT N P NP V NP P NP
DT N DT N DT N
a flower a girl a
telescope
I This structure is for the interpretation where the boy is seeing with the telescope!
26/53
English Parse Tree
Alternate Structure
NP VP
CN PP CV
V NP
DT N P NP
sees CN PP
a boy with CN
DT N P NP
DT N
a girl with CN
a flower
DT N
a
I This is for the interpretation where the girl is carrying a telescope. telescope
27/53
Structural Ambiguity
28/53
Some NLP Considerations - Linguistic Grammaticality
29/53
Some NLP Considerations – Getting it Right
30/53
Some NLP Considerations – Why are we Building Grammars?
I Consider:
I Oswald shot Kennedy.
I Oswald, who had visited Russia recently, shot Kennedy.
I Oswald assassinated Kennedy
I Who shot Kennedy?
I Consider
I Oswald shot Kennedy.
I Kennedy was shot by Oswald.
I Oswald was shot by Ruby.
I Who shot Oswald?
I Active/Passive
I Oswald shot Kennedy.
I Kennedy was shot by Oswald.
I Relative clauses
I Oswald who shot Kennedy was shot by Ruby.
I Kennedy whom Oswald shot didn’t shoot anybody.
31/53
Language Myths: Subject
32/53
Subject and Object
33/53
Looking Forward
I CFGs may not be entirely adequate for capturing the syntax of natural languages
I They are almost adequate.
I They are computationally well-behaved (in that you can build relatively efficient parsers for
them, etc.)
I But they are not very convenient as a means for handcrafting a grammar.
I They are not probabilistic. But we will add probabilities to them soon.
34/53
Parsing Context-free Languages
35/53
The Cocke-Younger-Kasami (CYK) algorithm
36/53
The CYK Algorithm
I Consider w = a1 a2 · · · an , ai ∈ Σ
I Suppose we could cut up the string into two parts u = a1 a2 ..ai and
v = ai+1 ai+2 · · · an
∗ ∗
I Now suppose A ⇒ u and B ⇒ v and that S → AB is a rule.
A B
← u → ← v →
a1 ai ai+1 an
37/53
The CYK Algorithm
S
A B
← u → ← v →
a1 ai ai+1 an
A B
C D E F
38/53
The CYK Algorithm
A B
C D E F
39/53
DIGRESSION - Dynamic Programming
I An algorithmic paradigm
I Essentially like divide-and-conquer but subproblems overlap!
I Results of subproblem solutions are reusable.
I Subproblem results are computed once and then memoized
I Used in solutions to many problems
I Length of longest common subsequence
I Knapsack
I Optimal matrix chain multiplication
I Shortest paths in graphs with negative weights (Bellman-Ford Alg.)
40/53
(Back to) The CYK Algorithm
I Let w = a1 a2 · · · an .
I We define
I wi, j = ai · · · aj (substring between positions i and j)
∗
I Vi, j = {A ∈ V | A ⇒ wi, j }(j ≥ i) (all variables which derive wi, j )
I w ∈ L(G) iff S ∈ V1,n
I How do we compute Vi, j (j ≥ i)?
41/53
The CYK Algorithm
42/53
The CYK Algorithm
[
Vi, j = {A : A → BC and B ∈ Vi,k and C ∈ Vk+1,j }
i≤k<j
43/53
The CYK Algorithm
44/53
The CYK Algorithm in Action
45/53
The CYK Algorithm in Action
46/53
A CNF Grammar for a Fragment of English
Grammar in Chomsky Normal
Form
S → NP VP S → NP VP
NP → CN | CN PP NP → CN PP
VP → CV | CV PP NP → DT N
PP → P NP VP → CV PP
CN → DT N VP → V NP
CV → V | V NP VP → touches | likes | sees | gives
DT → a | the PP → P NP
N → boy | girl | flower | CN → DT N
telescope CV → V NP
V → touches | likes | CV → touches | likes | sees | gives
sees | gives DT → a | the
P → with | to N → boy | girl | flower | telescope
V → touches | likes | sees | gives
P → with | to
47/53
English Parsing Example with CYK
S → NP VP
NP → CN PP
NP → DT N
VP → CV PP
VP → V NP i→ 1 2 3 4 5
VP → touches | likes | sees | gives the boy sees a girl
{DT} {N} {V, CV, VP} {DT} {N}
PP → P NP
{CN, NP} {} {} {CN, NP}
CN → DT N {S} {} {CV, VP}
CV → V NP {} {}
CV → touches | likes | sees | gives {S}X
DT → a | the
N → boy | girl | flower | telescope
V → touches | likes | sees | gives
P → with | to
48/53
Some Languages are NOT Context-free
I Jan säit das mer d’chind em Hans es huus haend wele laa hälfe aastriiche.
I Jan says that we the children Hans the house have wanted to let help paint.
I “Jan says that we have wanted to let the children help Hans paint the house.”
50/53
Is Swiss German Context-free?
I L1 = { Jan säit das mer (d’chind)∗ (em Hans)∗ es huus haend wele (laa)∗ (hälfe)∗
aastriiche.}
I L2 = { Swiss German }
I L1 ∩ L2 = { Jan säit das mer (d’chind)n (em Hans)m es huus haend wele (laa)n
(hälfe)m aastriiche.} ≡ L = {xan ybm zcn wdm u | n ≥ 0}
51/53
English “Respectively” Construct
I Alice, Bob and Carol will have a juice, a tea and a coffee, respectively.
I Again mildly context-sensitive!
52/53
Closing Remarks
53/53
11-411
Natural Language Processing
Treebanks and
Probabilistic Parsing
Kemal Oflazer
1/34
Probabilistic Parsing with CFGs
I The basic CYK Algorithm is not probabilistic: It builds a table from which all
(potentially exponential number of) parse trees can be extracted.
I Note that while computing the table needs O(n3 ) work, computing all trees could require
exponential work!
I Computing all trees is not necessarily useful either. How do you know which one is the
correct or best tree?
I We need to incorporate probabilities in some way.
I But where do we get them?
2/34
Probabilistic Context-free Grammars
3/34
PCFG Example
4/34
PCFG Example
Aux NP VP
5/34
PCFG Example
Aux NP VP
does
6/34
PCFG Example
Aux NP VP
does Det N
7/34
PCFG Example
Aux NP VP
does Det N
this
8/34
PCFG Example
Aux NP VP
does Det N
this flight
p(flight | N)
9/34
PCFG Example
Aux NP VP
this flight
10/34
PCFG Example
Aux NP VP
11/34
PCFG Example
Aux NP VP
12/34
PCFG Example
S
Aux NP VP
Aux NP VP
a meal
I “I have a tree of the sentence I want to utter in my mind; by the time I utter it only the
words come our.”
I The PCFG defines the source model.
I The channel is deterministic: it erases everything except the leaves!
I If I observe a sequence of words comprising a sentence, what is the best tree
structure it corresponds to?
I Find tree t̂ = arg max p(t | x)
Trees t
with yield x
I How do we set the probabilities p(right hand side | left hand side)?
I How do we decode/parse?
15/34
Probabilistic CYK
I Input
I a PCFG (V, S, Σ, R, p(∗ | ∗)) in Chomsky Normal Form.
I a sentence x of length n words.
I Output
I t̂ = arg max p(t | x) (if x is in the language of the grammar.)
t∈Tx
I Tx : all trees with yield x.
16/34
Probabilistic CYK
si:i (V ) = p(xi | V )
I Inductive case: For each i, j, 1 ≤ i < j ≤ n and V ∈ V .
I Solution:
s1:n (S) = max p(t)
t∈Tx
17/34
Parse Chart
i→ 1 2 3 4 5
the boy sees a girl
s1:1 (∗) s2:2 (∗) s3:3 (∗) s4:4 (∗) s5:5 (∗)
s1:2 (∗) s2:3 (∗) s3:4 (∗) s4:5 (∗)
s1:3 (∗) s2:4 (∗) s3:5 (∗)
s1:4 (∗) s2:5 (∗)
s1:5 (∗)
I Again, each entry is a table, mapping each nonterminal V to si:j (V ), the maximum
probability for deriving the fragment . . . xi , . . . , xj . . . from the nonterminal V .
18/34
Remarks
19/34
More Refined Models
Starting Point
20/34
More Refined Models
Parent Annotation
i
22/34
More Refined Models
Lexicalization
24/34
Penn Treebank
25/34
Example Sentence from Penn Treebank
26/34
Example Sentence Encoding from Penn Treebank
27/34
More PTB Trees
28/34
More PTB Trees
29/34
Treebanks as Grammars
30/34
Interesting PTB Rules
I VP → VBP PP PP PP PP PP ADVP PP
I This mostly happens because we go from football in the fall to lifting in the winter to
football again in the spring.
I NP → DT JJ JJ VBG NN NNP NNP FW NNP
I The state-owned industrial holding company Instituto Nacional de Industria . . .
31/34
Some Penn Treebank Rules with Counts
32/34
Parser Evaluation
I Represent a parse tree as a collection of tuples
{(`1 , i1 , j1 ), (`2 , i2 , j2 ),. . . , (`m , im , jm )} where
I `k is the nonterminal labeling kth phrase.
I ik is the index of the first word in the kth phrase.
I jk is the index of the last word in the kth phrase.
S
Aux NP VP
a meal
I Convert gold-standard tree and system hypothesized tree into this representation,
then estimate precision, recall, and F1 .
33/34
Tree Comparison Example
I In both trees: {(NP, 1, 1), (S, 1, 7), (VP, 2, 7), (PP, 5, 7), (NP, 6, 7), (Nominal, 4, 4)}
I In the left (hypothesized) tree: {(NP, 3, 7), (Nominal, 4, 7)}
I In the right (gold) tree: {(VP, 2, 4), (NP, 3, 4)}
I P = 6/8, R = 6/8
34/34
11-411
Natural Language Processing
Earley Parsing
Kemal Oflazer
1/29
Earley Parsing
I Remember that CKY parsing works only for grammar in Chomsky Normal Form
(CNF)
I Need to convert grammar to CNF.
I The structure may not necessarily be “natural”.
I CKY is bottom-up – may be doing unnecessary work.
I Earley algorithm allows arbitrary CFGs.
I So no need to convert your grammar.
I Earley algorithm is a top-down algorithm.
2/29
Earley Parsing
I The Earley parser fills a table (sometimes called a chart) in a single sweep over the
input.
I For an n word sentence, the table is of size n + 1.
I Table entries represent
I In-progress constituents
I Predicted constituents.
I Completed constituents and their locations in the sentence
3/29
Table Entries
I Table entries are called states and are represented with dotted-rules.
I S → •VP a VP is predicted
4/29
States and Locations
5/29
The Early Table Layout
6/29
Earley – High-level Aspects
I As with most dynamic programming approaches, the answer is found by looking in the
table in the right place.
I In this case, there should be an S state in the final column that spans from 0 to n and
is complete. That is,
I S → α • [0, n]
I If that is the case, you are done!
I So sweep through the table from 0 to n
I New predicted states are created by starting top-down from S
I New incomplete states are created by advancing existing states as new constituents are
discovered.
I New complete states are created in the same way.
7/29
Earley – High-level Aspects
8/29
Earley – Main Functions: Predictor
9/29
Earley– Prediction
ROOT → •S[0, 0]
S → •NP VP[0, 0]
S → •VP[0, 0]
...
VP → •V NP[0, 0]
...
NP → •DT N[0, 0]
10/29
Earley – Main Functions: Scanner
11/29
Earley– Scanning
12/29
Earley – Main Functions: Completer
I If you have a completed state spanning [j, k] with B as the left hand side.
I then, for each state in chart position j (with some span [i, j], that is immediately
looking for a B),
I move the dot to after B,
I extend the span to [i, k]
I then enqueue the updated state in chart position k.
13/29
Earley– Completion
14/29
Earley – Main Functions: Enqueue
I Just enter the given state to the chart-entry if it is not already there.
15/29
The Earley Parser
16/29
Extended Earley Example
I
0 Book 1 that 2 flight 3
I We should find a completed state at chart position 3
I with left hand side S and is spanning [0, 3]
17/29
Extended Earley Example Grammar
18/29
Extended Earley Example
19/29
Extended Earley Example
20/29
Extended Earley Example
21/29
Extended Earley Example
22/29
Extended Earley Example
23/29
Final Earley Parse
24/29
Comments
25/29
Probabilistic Earley Parser
26/29
General Chart Parsing
27/29
Implementing Parsing as Search
Agenda = {state0 }
while (Agenda not empty)
s = pop a state from Agenda
if s a success-state return s // we have a parse
else if s is not a failure-state:
generate new states from s
push new states on Agenda
return nil // no parser
I Fundamental Rule of Chart Parsing: if you can combine two contiguous edges to
make a bigger one, do it.
I Akin to the Completer function in Earley.
I How you interact with the agenda is called a strategy.
28/29
Is Ambiguity Solved?
29/29
11-411
Natural Language Processing
Dependency Parsing
Kemal Oflazer
1/47
Dependencies
I Turkish Treebank
2/47
Dependency Tree: Definition
Let x = [x1 , . . . , xn ] be a sentence. We add a special ROOT symbol as “x0 ”.
Different annotation schemes define different label sets L, and different constraints on the
set of tuples. Most commonly:
I The tuple is represented as a directed edge from xp to xc with label `.
I The directed edges form an directed tree with x as the root (sometimes denoted as
0
ROOT ).
3/47
Example
NP VP
Pronoun Verb NP
our cats
Phrase-structure tree
4/47
Example
NP VP
Pronoun Verb NP
our cats
5/47
Example
Swash
NPwe VPwash
our cats
6/47
Example
ROOT
7/47
Example
ROOT
8/47
Example
ROOT
9/47
Labels
ROOT
POBJ
I Direct Object
I Indirect Object
I Preposition Object
I Adjectival Modifier
I Adverbial Modifier
10/47
Problem: Coordination Structures
ROOT
11/47
Coordination Structures: Proposal 1
ROOT
12/47
Coordination Structures: Proposal 2
ROOT
13/47
Coordination Structures: Proposal 3
ROOT
14/47
Dependency Trees ROOT
ROOT
ROOT
ROOT
16/47
Dependencies and Grammar
17/47
Three Approaches to Dependency Parsing
18/47
Transition-based Parsing
I Process x once, from left to right, making a sequence of greedy parsing decisions.
I Formally, the parser is a state machine (not a finite-state machine) whose state is
represented by a stack S and a buffer B.
I Initialize the buffer to contain x and the stack to contain the ROOT symbol.
Buffer B
we
Stack S vigorously
wash
ROOT our
cats
who
stink
Buffer B
we
Stack S vigorously
wash
ROOT our
cats
who
stink
Actions:
20/47
Transition-based Parsing Example
Buffer B
Stack S vigorously
wash
we our
ROOT cats
who
stink
Actions: SHIFT
21/47
Transition-based Parsing Example
Buffer B
Stack S
wash
vigorously our
we cats
ROOT who
stink
22/47
Transition-based Parsing Example
Stack S Buffer B
wash our
vigorously cats
we who
ROOT stink
23/47
Transition-based Parsing Example
Stack S
Buffer B
our
vigorously wash cats
who
we stink
ROOT
24/47
Transition-based Parsing Example
Stack S
Buffer B
our
cats
we vigorously wash who
stink
ROOT
25/47
Transition-based Parsing Example
Stack S
Buffer B
our
cats
who
we vigorously wash stink
ROOT
26/47
Transition-based Parsing Example
Stack S
cats
Buffer B
our
who
stink
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT
27/47
Transition-based Parsing Example
Stack S
who
stink
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC
28/47
Transition-based Parsing Example
Stack S
who
stink
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT
29/47
Transition-based Parsing Example
Stack S
stink
who
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
30/47
Transition-based Parsing Example
Stack S
who stink
Buffer B
our cats
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC
31/47
Transition-based Parsing Example
Stack S
we vigorously wash
ROOT
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC
32/47
Transition-based Parsing Example
Stack S
Buffer B
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC
33/47
Transition-based Parsing Example
Stack S
ROOT
Buffer B
Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC RIGHT- ARC
34/47
The Core of Transition-based Parsing
35/47
Transition-based Parsing: Remarks
36/47
Dependency Parsing Evauation
I Unlabeled attachment score: Did you identify the head and the dependent
correctly?
I Labeled attachment score: Did you identify the head and the dependent AND the
label correctly?
37/47
Dependency Examples from Other Languages
38/47
Dependency Examples from Other Languages
Subj
Poss Mod
Det Mod Mod Loc Adj Mod
Bu okul+da +ki ö÷renci+ler+in en akıl +lı +sı úura+da dur +an küçük kız +dır
39/47
Dependency Examples from Other Languages
<S>
<W IX=1 LEM="bu" MORPH="bu" IG=[(1, "bu+Det")] REL=[(3,1,(DETERMINER)]>
Bu </W>
<W IX=2 LEM="eski"’ MORPH="eski" IG=[(1, "eski+Adj")]
REL=[3,1,(MODIFIER)]> eski> </W>
<W IX=3 LEM="bahçe" MORPH="bahçe+DA+ki" IG=[(1, "bahçe+A3sg+Pnon+Loc")
(2, "+Adj+Rel")] REL=[4,1,(MODIFIER)]> bahçedeki </W>
<W IX=4 LEM="gül" MORPH="gül+nHn" IG=[(1,"gül+Noun+A3sg+Pnon+Gen")]
REL=[6,1,(SUBJECT)]> gülün </W>
<W IX=5 LEM="böyle" MORPH="böyle" IG=[(1,"böyle+Adv")]
REL=[6,1,(MODIFIER)]> böyle </W>
<W IX=6 LEM="büyü" MORPH="büyü+mA+sH" IG=[(1,"büyü+Verb+Pos") (2,
"+Noun+Inf+A3sg+P3sg+Nom")] REL=[9,1,(SUBJECT)]> büyümesi </W>
<W IX=7 LEM="herkes" MORPH="herkes+yH"
IG=[(1,"herkes+Pron+A3sg+Pnon+Acc")] REL=[9,1,(OBJECT)]> herkesi </W>
<W IX=8 LEM="çok" MORPH="çok" IG=[(1,"çok+Adv’’)] REL=[9,1,(MODIFIER)]>
çok </W>
<W IX=9 LEM="etkile" MORPH="etkile+DH" IG=[(1,
"etkile+Verb+Pos+Past+A3sg")] REL=[]> etkiledi </W>
</S>
40/47
Universal Dependencies
I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).
41/47
Universal Dependencies
I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).
42/47
Universal Dependencies
I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).
43/47
State-of-the-art Dependency Parsers
I Stanford Parser
I Detailed Information at
https://nlp.stanford.edu/software/lex-parser.shtml
I Demo at http://nlp.stanford.edu:8080/parser/
I MaltParser is the original transition-based dependency parser by Nivre.
I “MaltParser is a system for data-driven dependency parsing, which can be used to induce
a parsing model from treebank data and to parse new data using an induced model.”
I Available at http://maltparser.org/
44/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results
45/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results
46/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results
47/47
11-411
Natural Language Processing
Lexical Semantics
Kemal Oflazer
1/47
Lexical Semantics
2/47
Decompositional Lexical Semantics
I Assume that woman has (semantic) components [female], [human], and [adult].
I Man might have the componets [male], [human], and [adult].
I Such “semantic features” can be combined to form more complicated meanings.
I Although this looks appealing, there is a little bit of a chickens-and-eggs situation.
I Scholars and language scientists have not yet developed a consensus about a
common set of “semantic primitives.”
I Such as representation probably has to involve more structure than just a flat set of
features per word.
3/47
Ontological Semantics
I Antonymy
I Hyponymy/Hypernymy
I Meronymy/Holonymy
4/47
Terminology: Lemma and Wordform
Wordform Lemma
banks bank
sung sing
sang sing
went go
goes go
5/47
Lemmas have Senses
6/47
Homonymy
I Homonyms: words that share a form but have unrelated, distinct meanings:
I bank1 : financial institution, bank2 : sloping land
I bat1 : club for hitting a ball, bat2 : nocturnal flying mammal
I Homographs: Same spelling (bank/bank, bat/bat)
I Homophones: Same pronunciation
I write and right
I piece and peace
7/47
Homonymy causes problems for NLP applications
I Information retrieval
I “bat care”
I Machine Translation
I bat: murciélago (animal) or bate (for baseball)
I Text-to-Speech
I bass (stringed instrument) vs. bass (fish)
8/47
Polysemy
9/47
Metonymy/Systematic Polysemy
10/47
How do we know when a word has multiple senses?
1A zeugma is an interesting device that can cause confusion in sentences, while also adding some flavor. 11/47
Synonymy
I Words a and b share an identical sense or have the same meaning in some or all
contexts.
I filbert / hazelnut
I couch / sofa
I big / large
I automobile / car
I vomit / throw up
I water / H2 O
I Synonyms can be substituted for each other in all situations.
I True synonymy is relatively rare compared to other lexical relations.
I may not preserve the acceptability based on notions of politeness, slang, register, genre,
etc.
I water / H2 O
I big / large
I bravery / courage
I Bravery is the ability to confront pain, danger, or attempts of intimidation without any feeling of
fear.
I Courage, on the other hand, is the ability to undertake an overwhelming difficulty or pain despite
the eminent and unavoidable presence of fear.
12/47
Synonymy
13/47
Antonymy
I Lexical items a and b have senses which are “opposite”, with respect to one feature of
meaning
I Otherwise they are similar
I dark/light
I short/long
I fast/slow
I rise/fall
I hot/cold
I up/down
I in/out
I More formally: antonyms can
I define a binary opposition or be at opposite ends of a scale (long/short, fast/slow)
I or be reversives (rise/fall, up/down)
I Antonymy is much more common than true synonymy.
I Antonymy is not always well defined, especially for nouns (but for other words as well).
14/47
Hyponymy/Hypernymy
15/47
Hyponymy more formally
I Extensional
I The class denoted by the superordinate (e.g., vehicle) extensionally includes the class
denoted by the hyponym (e.g. car).
I Entailment
I A sense A is a hyponym of sense B if being an A entails being a B (e.g. if it is car, it is a
vehicle)
I Hyponymy is usually transitive
I If A is a hyponym of B and B is a hyponym of C ⇒ A is a hyponym of C.
I Another name is the IS - A hierarchy
I A IS - A B
I B subsumes A
16/47
Hyponyms and Instances
17/47
Meronymy/Holonymy
18/47
A Lexical Mini-ontology
19/47
WordNet
I A hierarchically organizated database of (English) word senses.
I George A. Miller (1995). WordNet: A Lexical Database for English. Communications
of the ACM Vol. 38, No. 11: 39-41.
I Available at wordnet.princeton.edu
I Provides a set of three lexical databases:
I Nouns
I Verbs
I Adjectives and adverbs.
I Relations are between senses, not lexical items (words).
I Applications Program Interfaces (APIs) are available for many languages and toolkits
including a Python interface via NLTK.
I WordNet 3.0
Category Unique Strings
Noun 117,197
Verb 11,529
Adjective 22,429
Adverb 4,481
20/47
Synsets
21/47
Synsets for dog (n)
I S: (n) dog, domestic dog, Canis familiaris (a member of the genus Canis (probably
descended from the common wolf) that has been domesticated by man since
prehistoric times; occurs in many breeds) “the dog barked all night”
I S: (n) frump, dog (a dull unattractive unpleasant girl or woman) “she got a reputation
as a frump”, “she’s a real dog”
I S: (n) dog (informal term for a man) “you lucky dog”
I S: (n) cad, bounder, blackguard, dog, hound, heel (someone who is morally
reprehensible) “you dirty dog”
I S: (n) frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie (a
smooth-textured sausage of minced beef or pork usually smoked; olen served on a
bread roll)
I S: (n) pawl, detent, click, dog (a hinged catch that fits into a notch of a ratchet to
move a wheel forward or prevent it from moving backward)
I S: (n) andiron, firedog, dog, dog-iron (metal supports for logs in a fireplace) “the
andirons were too hot to touch”
22/47
Synsets for bass in WordNet
23/47
Hierarchy for bass3 in WordNet
24/47
The IS - A Hierarchy for fish (n)
I fish (any of various mostly cold-blooded aquatic vertebrates usually having scales and
breathing through gills)
I aquatic vertebrate (animal living wholly or chiefly in or on water)
I vertebrate, craniate (animals having a bony or cartilaginous skeleton with a segmented spinal
column and a large brain enclosed in a skull or cranium)
I chordate (any animal of the phylum Chordata having a notochord or spinal column)
I animal, animate being, beast, brute, creature, fauna (a living organism characterized by
voluntary movement)
I organism, being (a living thing that has (or can develop) the ability to act or function
independently)
I living thing, animate thing (a living (or once living) entity)
I whole, unit (an assemblage of parts that is regarded as a single entity)
I object, physical object (a tangible and visible entity; an entity that can cast a shadow)
I entity (that which is perceived or known or inferred to have its own distinct existence (living or
nonliving))
25/47
WordNet Noun Relations
26/47
WordNet Verb Relations
27/47
Other WordNet Hierarchy Fragment Examples
28/47
Other WordNet Hierarchy Fragment Examples
29/47
Other WordNet Hierarchy Fragment Examples
30/47
WordNet as as Graph
31/47
Supersenses in WordNet
32/47
WordNets for Other Languages
33/47
Word Similarity
34/47
Why Word Similarity?
35/47
Similarity and Relatedness
36/47
Two Classes of Similarity Algorithms
I WordNet/Thesaurus-based algorithms
I Are words “nearby” in hypernym hierarchy?
I Do words have similar glosses (definitions)?
I Distributional algorithms
I Do words have similar distributional contexts?
I Distributional (Vector) semantics.
37/47
Path-based Similarity
I Two concepts (senses/synsets) are similar if they are near each other in the hierarchy
I They have a short path between them
I Synsets have path 1 to themselves.
38/47
Refinements
1
I simpath(c1 , c2 ) = (Ranges between 0 and 1)
pathlen(c1 , c2 )
39/47
Example for Path-based Similarity
40/47
Problem with Basic Path-based Similarity
41/47
Information Content Similarity Metrics
42/47
Information Content Similarity Metrics
43/47
Information Content: Definitions
44/47
The Resnik Method
45/47
The Dekang Lin Method
I The similarity between A and B is measured by the ratio between the amount of
information needed to state the commonality of A and B and the information needed
to fully describe what A and B are.
I
2 log p(LCS(c1 , c2 ))
simlin (c1 , c2 ) =
log p(c1 ) + log p(c2 )
2 log p(geological-formation)
I simlin (hill, coast) = log p(hill)+log p(coast)
= 0.59
46/47
Evaluating Similarity
47/47
11-411
Natural Language Processing
Distributional/Vector Semantics
Kemal Oflazer
1/55
The Distributional Hypothesis
I Want to know the meaning of a word? Find what words occur with it.
I Leonard Bloomfield
I Edward Sapir
I Zellig Harris–first formalization
I “oculist and eye-doctor . . . occur in almost the same environments”
I “If A and B have almost identical environments we say that they are synonyms.”
I The best known formulation comes from J.R. Firth:
I “You shall know a word by the company it keeps.”
2/55
Contexts for Beef
4/55
Intuition for Distributional Word Similarity
I Consider
I A bottle of pocarisweat is on the table.
I Everybody likes pocarisweat.
I Pocarisweat makes you feel refreshed.
I They make pocarisweat out of ginger.
I From context words humans can guess pocarisweat means a beverage like coke.
I So the intuition is that two words are similar if they have similar word contexts.
5/55
Why Vector Models of Meaning?
6/55
Word Similarity for Plagiarism Detection
7/55
Vector Models
8/55
Shared Intuition
9/55
Term-document Matrix
10/55
Term-document Matrix
11/55
Term-document Matrix
12/55
Term-document Matrix
13/55
Term-context Matrix for Word Similarity
14/55
Word–Word or Word–Context Matrix
15/55
Sample Contexts of ±7 Words
16/55
The Word–Word Matrix
I We showed only a 4 × 6 matrix, but the real matrix is 50, 000 × 50, 000.
I So it is very sparse: Most values are 0.
I That’s OK, since there are lots of efficient algorithms for sparse matrices.
I The size of windows depends on the goals:
I The smaller the context (±1 − 3) , the more syntactic the representation
I The larger the context (±4 − 10), the more syntactic the representation
17/55
Types of Co-occurence between Two Words
18/55
Problem with Raw Counts
19/55
Pointwise Mutual Information
I Pointwise Mutual Information: Do events x and y co-occur more that if they were
independent.
p(x, y)
PMI(x, y) = log2
p(x)p(y)
I PMI between two words: Do target word w and context word c co-occur more that if
they were independent.
p(w, c)
PMI(w, c) = log2
p(w)p(c)
20/55
Positive Pointwise Mutual Information
21/55
Computing PPMI on a Term-Context Matrix
I We have matrix F with V rows (words) and C columns (contexts) (in general C = V )
I fij is how many times word wi co-occurs in the context of the word cj .
fij
pij = P
V PC
i=1 ( j=1 fij )
PC PV
j=1 fij i=1 fij
pi∗ = P p∗j = P
V (PC f ) V PC
i=1 j=1 ij i=1 ( j=1 fij )
pij
pmiij = log2 ppmiij = max(pmiij , 0)
pi∗ p∗j
22/55
Example
computer data pinch result sugar
apricot 0 0 1 0 1 2
pineapple 0 0 1 0 1 2
digital 2 1 0 1 0 4
information 1 6 0 4 0 11
3 7 2 5 2 19
6
p(w = information, c = data) = = 0.32
19
11 7
p(w = information) = = 0.58 p(c = data) = = 0.32
19 19
p(w, c)
computer data pinch result sugar p(w)
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
23/55
Example
p(w, c)
computer data pinch result sugar p(w)
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
0.32
pmi(information, data) = log2 ≈ 0.58
0.37 · 0.57
PPMI(w, c)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.32 - 0.47 -
24/55
Issues with PPMI
25/55
Issues with PPMI
p(w, c)
PPMIα (w, c) = max(log2 , 0)
p(w)pα (c)
count(c)α
pα (c) = P α
c count(c)
I This helps because pα (c) > p(c) for rare c.
I Consider two context words p(a) = 0.99 and p(b) = 0.01
26/55
Using Laplace Smoothing
p(w, c) Add-2
computer data pinch result sugar p(w)
apricot 0.03 0.03 0.05 0.03 0.05 0.20
pineapple 0.03 0.03 0.05 0.03 0.05 0.20
digital 0.07 0.05 0.03 0.05 0.03 0.24
information 0.05 0.14 0.03 0.10 0.03 0.36
27/55
PPMI vs. add-2 Smoothed PPMI
PPMI(w, c)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.32 - 0.47 - -
PPMI(w, c)
computer data pinch result sugar
apricot 0.00 0.00 0.56 0.00 0.56
pineapple 0.00 0.00 0.56 0.00 0.56
digital 0.62 0.00 0.00 0.00 0.00
information 0.00 0.58 0.00 0.37 0.00
28/55
Measuring Similarity
N
X
v·w= vi wi = v1 w1 + v2 w2 + · · · + vN wN = |v||w| cos θ
i=1
I v · w is high when two vectors have large values in the same dimensions.
I v · w is low (in fact 0) with zeros in complementary distribution.
I We also do not want the similarity to be sensitive to word-frequency.
I So normalize by vector length and use the cosine as the similarity
v·w
cos(v, w) = |v||w|
29/55
Other Similarity Measures in the Literature
P
I simJaccard (v, w) = P i min(vi ,wi )
i max(vi ,wi )
P
2 Pi min(vi ,wi )
I simDice (v, w) =
i (vi +wi )
30/55
Using Syntax to Define Context
I “The meaning of entities, and the meaning of grammatical relations among them, is
related to the restriction of combinations of these entities relative to other entities.”
(Zelig Harris (1968))
I Two words are similar if they appear in similar syntactic contexts.
I duty and responsibility have similar syntactic distribution
I Modified by Adjectives: additional, administrative, assumed, collective, congressional,
constitutional, . . .
I Objects of Verbs: assert, assign, assume, attend to, avoid, become, breach, . . .
31/55
Co-occurence Vectors based on Syntactic Dependencies
32/55
Sparse vs. Dense Vectors
33/55
Why Dense Vectors?
I Short vectors may be easier to use as features in machine learning (less weights to
tune).
I Dense vectors may generalize better than storing explicit counts.
I They may do better at capturing synonymy:
I car and automobile are synonyms
I But they are represented as distinct dimensions
I This fails to capture similarity between a word with car as a neighbor and a word with
automobile as a neighbor
34/55
Methods for Getting Short Dense Vectors
35/55
Dense Vectors via SVD - Intuition
36/55
Dimensionality Reduction
37/55
Singular Value Decomposition
I Any square v × v matrix (of rank v) X equals the product of three matrices.
X W S C
x11 . . . x1v
w11 ... w1m σ11 ... 0 c11 ... c1c
x21 . . . x2v . .. .. × . .. .. × . .. ..
= .. .. ..
. ..
.. .. . . . . . .
. .
wv1 ... wvv 0 ... σvv cm1 ... cvv
xv1 . . . xvv
v×v v×v v×v v×v
I v columns in W are orthogonal to each other and are ordered by the amount of
variance each new dimension accounts for.
I S is a diagonal matrix of eigenvalues expressing the importance of each dimension.
I C has v rows for the singular values and v columns corresponding to the original
contexts.
38/55
Reducing Dimensionality with Truncated SVD
X W S C
x11 . . . x1c
w11 ... w1m σ11 ... 0 c11 ... c1c
x21 . . . x2c . .. .. × . .. .. × . .. ..
= .. .. ..
. ..
.. .. . . . . . .
. .
wv1 ... wvv 0 ... σvv cm1 ... cvv
xv1 . . . xvv
v×v v×v v×v v×v
X W S C
x11 . . . x1v
w11 ... w1k σ11 ... 0 c11 ... c1v
x21 . . . x2v . .. .. × . .. .. × . .. ..
≈ .. .. ..
. ..
.. .. . . . . . .
. .
wv1 ... wvk 0 ... σkk cm1 ... ckv
xv1 . . . xvv
v×c v×k k×k k×v
39/55
Truncated SVD Produces Embeddings
w11 . . . w1k
w21 . . . w2k
. .. ..
.. . .
wv1 . . . wvk
40/55
Embeddings vs Sparse Vectors
I Dense SVD embeddings sometimes work better than sparse PPMI matrices at tasks
like word similarity
I Denoising: low-order dimensions may represent unimportant information
I Truncation may help the models generalize better to unseen data.
I Having a smaller number of dimensions may make it easier for classifiers to properly
weight the dimensions for the task.
I Dense models may do better at capturing higher order co-occurrence.
41/55
Embeddings Inspired by Neural Language Models
I Skip-gram and CBOW learn embeddings as part of the process of word prediction.
I Train a neural network to predict neighboring words
I Inspired by neural net language models.
I In so doing, learn dense embeddings for the words in the training corpus.
I Advantages:
I Fast, easy to train (much faster than SVD).
I Available online in the word2vec package.
I Including sets of pretrained embeddings!
42/55
Skip-grams
I From the current word wt , predict other words in a context window of 2C words.
I For example, we are given wt and we are predicting one of the words in
43/55
Compressing Words
44/55
One-hot Vector Representation
[0, 0, 0, 0, 1, 0, 0, 0, 0, . . . , 0]
45/55
Neural Network Architecture
46/55
Where are the Word Embeddings?
I The rows of the first matrix actually are the word embeddings.
I Multiplication of the one-hot input vector “selects” the relevant row as the output to
hidden layer.
47/55
Output Probabilities
I The output vector is also a vector (hidden-layer) and matrix multiplication (the C
matrix).
I The value computed for output unit k = ck · wj where wj is the hidden layer vector (for word
j).
I Except, the outputs are not probabilities!
I We use the same scaling idea we used earlier and then use softmax .
exp(ck · vj )
p(wk is in the context of wj ) = P
i exp(ci · vj )
48/55
Training for Embeddings
49/55
Training for Embeddings
I You have a huge network (say you have 1M words and embedding dimension of 300).
50/55
Properties of Embeddings
I Nearest words to some embeddings in the d− dimensional space.
I Relation meanings
I vector(king) − vector(man) + vector(woman) ≈ vector(queen)
I vector(Paris) − vector(France) + vector(Italy) ≈ vector(Rome)
51/55
Brown Clustering
52/55
Brown Clustering
53/55
Brown Clusters as Vectors
I By tracing the order in which clusters are merged, the model builds a binary tree from
bottom to top.
I Each word represented by binary string = path from root to leaf
I Each intermediate node is a cluster
I Chairman represented by 0010, “months” by 01, and verbs by 1.
54/55
Class-based Language Model
55/55
11-411
Natural Language Processing
Word Sense Disambiguation
Kemal Oflazer
1/23
Homonymy and Polysemy
I As we have seen, multiple words can be spelled the same way (homonymy;
technically homography)
I The same word can also have different, related senses (polysemy)
I Various NLP tasks require resolving the ambiguities produced by homonymy and
polysemy.
I Word sense disambiguation (WSD)
2/23
Versions of the WSD Task
I Lexical sample
I Choose a sample of words.
I Choose a sample of senses for those words.
I Identify the right sense for each word in the sample.
I All-words
I Systems are given the entire text.
I Systems are given a lexicon with senses for every content word in the text.
I Identify the right sense for each content word in the text .
3/23
Supervised WSD
4/23
Sample SemCor Data
<wf cmd=done pos=PRP$ ot=notag>Your</wf>
<wf cmd=done pos=NN lemma=invitation wnsn=1 lexsn=1:10:00::>invitation</wf>
<wf cmd=ignore pos=TO>to</wf>
<wf cmd=done pos=VB lemma=write_about wnsn=1 lexsn=2:36:00::>write_about</wf>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Se
<wf cmd=ignore pos=TO>to</wf>
<wf cmd=done pos=VB lemma=honor wnsn=1 lexsn=2:41:00::>honor</wf>
<wf cmd=ignore pos=PRP$>his</wf>
<wf cmd=done pos=JJ lemma=70th wnsn=1 lexsn=5:00:00:ordinal:00>70_th</wf>
<wf cmd=done pos=NN lemma=anniversary wnsn=1 lexsn=1:28:00::>Anniversary</wf>
<wf cmd=ignore pos=IN>for</wf>
<wf cmd=ignore pos=DT>the</wf>
<wf cmd=done pos=NN lemma=april wnsn=1 lexsn=1:28:00::>April</wf>
<wf cmd=done pos=NN lemma=issue wnsn=2 lexsn=1:10:00::>issue</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NNP pn=other ot=notag>Sovietskaya_Muzyka</wf>
<wf cmd=done pos=VBZ ot=notag>is</wf>
<wf cmd=done pos=VB lemma=accept wnsn=6 lexsn=2:40:01::>accepted</wf>
<wf cmd=ignore pos=IN>with</wf>
<wf cmd=done pos=NN lemma=pleasure wnsn=1 lexsn=1:12:00::>pleasure</wf>
5/23
What Features Should One Use?
6/23
What Features Should One Use?
I Collocation features
I Encode information about specific positions located to the left or right of the target word
I For example [wi−2 , POSi−2 , wi−1 , POSi−2 , wi+1 , POSi+1 , wi+2 , POSi+2 ]
I For bass, e.g., [guitar, NN, and, CC, player, NN, stand, VB]
I Bag-of-words features
I Unordered set of words occurring in window
I Relative sequence is ignored
I Words are lemmatized
I Stop/Function words typically ignored.
I Used to capture domain
7/23
Naive Bayes for WSD
I Choose the most probable sense given the feature vector f which can be formulated
into
Yn
ŝ = arg max p(s) p(fj | s)
s∈S j=1
I Naive Bayes assumes features in f are independent (often not true)
I But usually Naive Bayes Classifiers perform well in practice.
8/23
Semisupervised WSD–Decision List Classifiers
I The decisions handed down by naive Bayes classifiers (and other similar ML
algorithms) are difficult to interpret.
I It is not always clear why, for example, a particular classification was made.
I For reasons like this, some researchers have looked to decision list classifiers, a highly
interpretable approach to WSD .
I We have a list of conditional statements.
I Item being classified falls through the cascade until a statement is true.
I The associated sense is then returned.
I Otherwise, a default sense is returned.
I Where does the list come from?
9/23
Decision List Features for WSD – Collocational Features
10/23
Learning a Decision List Classifier
p(si | fj )
weight(si , fj ) = log
p(¬si | fj )
11/23
Example
I Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for
bank/2 (river sense)
I p(s1 ) = 1, 500/2, 000 = .75
I p(s2 ) = 500/2, 000 = .25
I Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.
I p(credit) = 204/2, 000 = .102
I p(credit | s1 ) = 200/1, 500 = .133
I p(credit | s2 ) = 4/500 = .008
I From Bayes Rule
I p(s1 | credit) = .133 ∗ .75/.102 = .978
I p(s2 | credit) = .008 ∗ .25/.102 = .020
I Weights
I weight(s1 , credit) = log 49.8 = 3.89
1
I weight(s2 , credit) = log 49.8 = −3.89
12/23
Using the Decision List
13/23
Evaluation of WSD
I Extrinsic Evaluation
I Also called task-based, end-to-end, and in vivo evaluation.
I Measures the contribution of a WSD (or other) component to a larger pipeline.
I Requires a large investment and hard to generalize to other tasks,
I Intrinsic Evaluation
I Also called in vitro evaluation
I Measures the performance of the WSD (or other) component in isolation
I Does not necessarily tell you how well the component contributes to a real test – which is
in general what you are interested in.
14/23
Baselines
15/23
Simplified Lesk Algorithm
I The bank can guarantee deposits
will eventually cover future tuition
costs because it invests in
adjustable-rate mortgage securities.
16/23
Bootstrapping Algorithms
I There are bootstrapping techniques that can be used to obtain reasonable WSD
results with minimal amounts of labelled data.
17/23
Bootstrapping Algorithms
18/23
Bootstrapping Example
19/23
State-of-the-art Results in WSD (2017)
20/23
State-of-the-art Results in WSD (2017)
21/23
Other Approaches – Ensembles
22/23
Unsupervised Methods for Word Sense Discrimination/Induction
23/23
11-411
Natural Language Processing
Semantic Roles
Semantic Parsing
Kemal Oflazer
1/41
Semantics vs Syntax
2/41
Motivating Example: Who did What to Whom?
3/41
Motivating Example: Who did What to Whom”
4/41
Motivating Example: Who did What to Whom?
5/41
Motivating Example: Who did What to Whom”
6/41
Motivating Example: Who did What to Whom?
I In this buying/purchasing event/situation, Warren played the role of the buyer, and
there was some stock that played the role of the thing purchased.
I Also, there was presumably a seller, only mentioned in one example.
I In some examples, a separate “event” involving surprise did not occur.
7/41
Semantic Roles: Breaking
8/41
Semantic Roles: Breaking
9/41
Semantic Roles: Eating
I Eat!
I We ate dinner.
I We already ate.
I The pies were eaten up quickly.
I Our gluttony was complete.
10/41
Semantic Roles: Eating
I Eat!(you, listener) ?
I We ate dinner.
I We already ate.
I The pies were eaten up quickly.
I Our gluttony was complete.
I A eating event has a E ATER and a F OOD, neither of which needs to be mentioned
explicitly.
11/41
Abstraction
?
B REAKER = E ATER
Both are actors that have some causal responsibility for changes in the world around them.
?
B REAKEE = F OOD
Both are greatly affected by the event, which “happened to” them.
12/41
Thematic Roles
13/41
Verb Alternation Examples: Breaking and Giving
I Breaking:
I AGENT/subject; T HEME/object; I NSTRUMENT/PPwith
I I NSTRUMENT/subject; T HEME/object
I T HEME/subject
I Giving:
I AGENT/subject; G OAL/object; T HEME/second-object
I AGENT/subject; T HEME/object; G OAL/PPto
I English verbs have been codified into classes that share patterns (e.g., verbs of
throwing: throw/kick/pass)
14/41
Semantic Role Labeling
I Input: a sentence x
I Output: A collection of predicates, each consisting of
I a label sometimes called the frame
I a span
I a set of arguments, each consisting of
I a label usually called the role
I a span
15/41
The Importance of Lexicons
16/41
PropBank
17/41
fall.01 (move downward)
18/41
fall.01 (move downward)
19/41
fall.01 (move downward)
20/41
fall.01 (move downward)
21/41
fall.08 (fall back, rely on in emergency)
I World Bank president Paul Wolfowitz has fallen back on his last resort.
22/41
fall.08 (fall back, rely on in emergency)
I World Bank president Paul Wolfowitz has fallen back on his last resort.
23/41
fall.08 (fall back, rely on in emergency)
I World Bank president Paul Wolfowitz has fallen back on his last resort.
24/41
fall.10 (fall for a trick; be fooled by)
I A RG 0: the fool
I A RG 1: the trick
I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.
25/41
fall.10 (fall for a trick; be fooled by)
I A RG 0: the fool
I A RG 1: the trick
I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.
26/41
fall.10 (fall for a trick; be fooled by)
I A RG 0: the fool
I A RG 1: the trick
I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.
27/41
FrameNet
28/41
change position on a scale
29/41
FrameNet Example
Attacks
| on
{z civilians} decreased
| {z } over
| the last{zfour months}
I TEM change position. . . D URATION
I The ATTRIBUTE is left unfilled but is understood from context (e.g., “number” or
“frequency”).
30/41
change position on a scale
I Verbs: advance, climb, decline, decrease, diminish, dip, double, drop, dwindle, edge,
explode, fall, fluctuate, gain, grow, increase, jump, move, mushroom, plummet, reach,
rise, rocket, shift, skyrocket, slide, soar, swell, swing, triple, tumble
I Nouns: decline, decrease, escalation, explosion, fall, fluctuation, gain, growth, hike,
increase, rise, shift, tumble
I Adverb: increasingly
I Frame hierarchy
event
31/41
The Semantic Role Labeling Task
I Given a syntactic parse, identify the appropriate role for each noun phrase (according
to the scheme that you are using, e.g., PropBank, FrameNet or something else)
I Why is this useful?
I Why is it useful for some tasks that you cannot perform with just dependency parsing?
I What kind of semantic representation could you obtain if you had SRL?
I Why is this hard?
I Why is it harder that dependency parsing?
32/41
Semantic Role Labeling Methods
33/41
Example: Path Features
NP-SBJ VP
yesterday
35/41
Additional Steps for Efficiency
I Pruning
I Only a small number of constituents should ultimately be labeled
I Use heuristics to eliminate some constituents from consideration
I Preliminary Identification:
I Label each node as ARG or NONE with a binary classifier
I Classification
I Only then, perform 1-of-N classification to label the remaining ARG nodes with roles
36/41
Additional Information
37/41
Methods: Beyond Features
38/41
Related Problems in “Relational” Semantics
I Coreference resolution: which mentions (within or across texts) refer to the same
entity or event?
I Entity linking: ground such mentions in a structured knowledge base (e.g.,
Wikipedia)
I Relation extraction: characterize the relation among specific mentions
39/41
General Remarks
I Semantic roles are just “syntax++” since they don’t allow much in the way of
reasoning (e.g., question answering).
I Lexicon building is slow and requires expensive expertise. Can we do this for every
(sub)language?
40/41
Snapshot
41/41
11-411
Natural Language Processing
Compositional Semantics
Kemal Oflazer
1/34
Semantics Road Map
I Lexical semantics
I Vector semantics
I Semantic roles, semantic parsing
I Meaning representation languages and Compositional semantics
I Discourse and pragmatics
2/34
Bridging the Gap between Language and the World
I Meaning representation is the interface between the language and the world.
I Answering essay question on an exam.
I Deciding what to order at a restaurant.
I Recognizing a joke.
I Executing a command.
I Responding to a request.
3/34
Desirable Qualities of Meaning Representation Languages (MRL)
4/34
Desirable Qualities of Meaning Representation Languages (MRL)
I Inputs that mean the same thing should have the same meaning representation.
I “Bukhara has vegetarian dishes.”
I “They have vegetarian food at Bukhara.”
I “Vegetarian dishes are served at Bukhara.”
I “Bukhara serves vegetarian fare.”
5/34
Variables and Expressiveness
6/34
Limitation
7/34
What do we Represent?
I Objects: people (John, Ali, Omar), cuisines (Thai, Indian), restaurants (Bukhara,
Chef’s Garden), . . .
I John, Ali, Omar, Thai, Indian, Chinese, Bukhara, Chefs Garden, . . .
I Properties of Objects: Ali is picky, Bukhara is noisy, Bukhara is cheap, Indian is
spicy, John, Ali and Omar are humans, Bukhara has long wait . . .
I picky={Ali}, noisy={Bukhara}, spicy={Indian}, human={Ali, John, Omar}. . .
I Relations between objects: Bukhara serves Indian, NY Steakhouse serves steak.
Omar likes Chinese.
I serves(Bukhara, Indian), serves(NY Steakhouse, steak), likes(Omar, Chinese) . . .
I Simple questions are easy:
I Is Bukhara noisy?
I Does Bukhara serve Chinese?
8/34
MRL: First-order Logic – A Quick Tour
9/34
First-order Logic: Meta Theory
10/34
Translating between First-order Logic and Natural Language
11/34
Logical Semantics (Montague Semantics)
I The denotation of a natural language sentence is the set of conditions that must hold
in the (model) world for the sentence to be true.
I “Every restaurant has a long wait or is disliked by Ali.”
is true if an only if
is true.
I This is sometimes called the logical form of the NL sentence.
12/34
The Principle of Compositionality
13/34
Lexicon Entries
14/34
λ-Calculus
15/34
Semantic Attachments to CFGs
16/34
Example
NP VP
NNP VBZ NP
expensive restaurants
17/34
Example
S : VP.sem(NP.sem)
NP : NNP.sem VP : VBZ.sem(NP.sem)
expensive restaurants
18/34
Example
S : VP.sem(NP.sem)
NP : NNP.sem VP : VBZ.sem(NP.sem)
expensive restaurants
19/34
Example
..
.
VP : VBZ.sem(NP.sem)
20/34
Example
..
.
S : VP.sem(NP.sem)
Ali
22/34
Example
S : ∀x expensive(x) ∧ restaurant(x) ⇒ likes(Ali, x)
Ali
λy.∀x expensive(x) ∧ restaurant(x) ⇒ likes(y, x) Ali
| {z } |{z}
VP.sem NP.sem
23/34
Quantifier Scope Ambiguity
I NNP → Ali {Ali}
I VBZ → likes {λf .λy.∀x f (x) ⇒ likes(y, x)} S
I JJ → expensive {λx.expensive(x)}
I NNS → restaurants {λx.restaurant(x)} NP VP
I NP → NNP {NNP.sem}
I NP → JJ NNS {λx.JJ.sem(x)∧ NNS.sem(x)} Det NN VBZ NP
I VP → VBZ NP {VBZ.sem(NP.sem)}
I S → NP VP {VP.sem(NP.sem)} Every man loves Det NN
I NP → Det NN {Det.sem(NN.sem)}
I Det → every {λf .λg.∀u f (u) ⇒ g(u)} a woman
I Det → a {λm.λn.∃x m(x) ⇒ n(x)}
I NN → man {λv.man(v)} ∀u man(u) ⇒ ∃x woman(x)∧loves(u, x)
I NN → woman {λv.woman(v)}
I VBZ → loves {λf .λy.∀x f (x) ⇒ loves(y, x)}
24/34
This is not Quite Right!
25/34
Other Meaning Representations: Abstract Meaning Representation
26/34
Combinatory Categorial Grammar
I CCG is a grammatical formalism that is well-suited for tying together syntax and
semantics.
I Formally, it is more powerful than CFG – it can represent some of the
context-sensitive languages.
I Instead of the set of non-terminals of CFGs, CCGs can have an infinitely large set of
structured categories (called types).
27/34
CCG Types and Combinators
28/34
Application Combinator
I Forward Combination: X /Y Y ⇒ X
I Backward Combination: Y X \Y ⇒ X
NP S\NP
NP/N N (S\NP)/NP NP
29/34
Conjunction Combinator
I X and X ⇒ X
S
NP S\NP
(S\NP)/NP NP (S\NP)/NP NP
30/34
Composition Combinator
I Forward (X /Y Y /Z ⇒ X /Z )
I Backward (Y \Z X \Y ⇒ X \Z )
NP S\NP
I (S\NP)/NP NP
would prefer
31/34
Type-raising Combinator
I Forward (X ⇒ Y /(Y \X ))
I Backward (X ⇒ Y \(Y /X ))
S/NP NP
NP love NP hates
I Karen
32/34
Back to Semantics
33/34
CCG Lexicon
34/34
11-411
Natural Language Processing
Discourse and Pragmatics
Kemal Oflazer
1/61
What is Discourse?
2/61
Applications of Computational Discourse
3/61
Kinds of Discourse Analysis
I Monologue
I Human-human dialogue (conversation)
I Human-computer dialogue (conversational agents)
I “Longer-range” analysis (discourse) vs. “deeper” analysis (real semantics):
I John bought a car from Bill.
I Bill sold a car to John.
I They were both happy with the transaction.
4/61
Discourse in NLP
5/61
Coherence
6/61
Discourse Segmentation
I Many genres of text have particular conventional structures:
I Academic articles: Abstract, Introduction, Methodology, Results, Conclusion, etc.
I Newspaper stories:
8/61
Applications of Discourse Segmentation
9/61
Cohesion
10/61
Discourse Segmentation
I Intuition: If we can “measure” the cohesion between every neighboring pair of
sentences, we may expect a “dip” in cohesion at subtopic boundaries.
I The TextTiling algorithm uses lexical cohesion.
11/61
The TextTiling Algorithm
I Tokenization
I lowercase, remove stop words, morphologically stem inflected words
I stemmed words are (dynamically) grouped into pseudo-sentences of length 20 (equal
length and not real sentences!)
I Lexical score determination
I Boundary identification
12/61
TextTiling – Determining Lexical Cohesion Scores
I Remember:
I Count-based similarity vectors
I Cosine-similarity
a·b
simcosine (a, b) =
|a| |b|
13/61
TextTiling – Determining Boundaries
I A gap position i is a valley if yi < yi−1 and yi < yi+1 .
I If i is a valley, find the depth score – distance from the peaks on both sides
= (yi−1 − yi ) + (yi+1 − yi ).
I Any valley with depth at s − σs or lower, that is, deeper than one standard deviation
from average valley depth, is selected as a boundary.
14/61
TextTiling – Determining Boundaries
15/61
TextTiling – Determining Boundaries
16/61
Supervised Discourse Segmentation
17/61
Evaluating Discourse Segmentation
I We could do precision, recall and F-measure, but . . .
I These will not be sensitive to near misses!
I A commonly-used metric is WindowDiff.
I Slide a window of length k across the (correct) references and the hypothesized
segmentation.
19/61
Coherence Relations
I Let S0 and S1 represent the “meanings” of two sentences being related.
I Result: Infer that state or event asserted by S0 causes or could cause the state or
event asserted by S1 .
I The Tin Woodman was caught in the rain. His joints rusted.
I Explanation: Infer that state or event asserted by S1 causes or could cause the state
or event asserted by S0 .
I John hid the car’s keys. He was drunk.
I Parallel: Infer p(a1 , a2 , . . .) from the assertion of S0 and p(b1 , b2 , . . .) from the
assertion of S1 , where ai and bi are similar for all i.
I The Scarecrow wanted some brains. The Tin Woodman wanted a heart.
I Elaboration: Infer the same proposition from the assertions S0 and S1 .
I Dorothy was from Kansas. She lived in the midst of the great Kansas prairies.
I Occasion:
I A change of state can be inferred from the assertion S0 whose final state can be inferred
from S1 , or
I A change of state can be inferred from the assertion S1 whose initial state can be inferred
from S0 .
I Dorothy picked up the oil can. She oiled the Tin Woodman’s joints.
20/61
Coherence Relations
I Consider
I S1: John went to the bank to deposit his paycheck.
I S2: He then took a bus to Bill’s car dealership.
I S3: He needed to buy a car.
I S4: The company he works for now isn’t near a bus line.
I S5: He also wanted to talk with Bill about their soccer league.
Occasion(e1 , e2 )
S1 (e1 ) Explanation(e2 )
S2 (e2 ) Parallel(e3 ; e5 )
Explanation(e3 ) S5 (e5 )
S3 (e3 ) S4 (e4 )
21/61
Rhetorical Structure Theory – RST
I Based on 23 rhetorical relations between two spans of text in a discourse.
I a nucleus – central to the write’s purpose and interpretable independently
I a satellite – less central and generally only interpretable with respect to the nucleus
I Evidence relation: |Kevin must
{z be here.} His car is parked outsize.
| {z }
nucleus satellite
I An RST relation is defined by a set of constraints.
23/61
RST Coherence Relations
24/61
Automatic Coherence Assignment
25/61
Automatic Coherence Assignment
I Very difficult!
I One existing approach is to use cue phrases.
I John hid Bill’s car keys because he was drunk.
I The scarecrow came to ask for a brain. Similarly, the tin man wants a heart.
26/61
Reference Resolution
I To interpret the sentence in any discourse we need to who or what entity is being
talked about.
I Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%,
to $1.3 million, as the 37-year-old also became the Denver-based company’s
president. It has been ten years since she came to Megabucks from rival Lotsaloot.
I Coreference chains:
I {Victoria Chen, CFO of Megabucks Banking Corp since 2004, her, the 37-year-old, the
Denver-based company’s president, she}
I {Megabucks Banking Corp, the Denver-based company, Megabucks}
I {her pay}
I {Lotsaloot}
27/61
Some Terminology
Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%, to
$1.3 million, as the 37-year-old also became the Denver-based company’s president. It
has been ten years since she came to Megabucks from rival Lotsaloot.
I Referring expression
I Victoria Chen, the 37-year-old and she are referring expressions.
I Referent
I Victoria Chen is the referent.
I Two referring expressions referring to the same entity are said to corefer.
I A referring expression licenses the use of a subsequent expression.
I Victoria Chen allows Victoria Chen to be referred to as she.
I Victoria Chen is the antecedent of she.
I Reference to an earlier introduced entity is called anaphora.
I Such a reference is called anaphoric.
I the 37-year-old, her and she are anaphoric.
28/61
References and Context
29/61
Other Kinds of Referents
30/61
Types of Referring Expressions
31/61
Indefinite Noun Phrases
32/61
Definite Noun Phrases
33/61
Pronouns
I Pronouns usually refer to entities that were introduced no further that one or two
sentences back.
I John went to Bob’s party and parked next to a classic Ford Falcon.
I He went inside and talked to Bob for more than a hour. (He = John)
I Bob told him that he recently got engaged. (him = John, he = Bob)
I He also said he bought it yesterday (He = Bob, it = ???)
I He also said he bought the Falcon yesterday (He = Bob)
I Pronouns can also participate in cataphora.
I Even before she saw it, Dorothy had been thinking about the statue.
I Pronouns also appear in quantified contexts, bound to the quantifier.
I Every dancer brought her left arm forward.
34/61
Demonstratives
35/61
Names
36/61
Reference Resolution
I Coreference resolution
I Pronomial anaphora resolution
37/61
Pronouns Reference Resolution: Filters
38/61
Pronouns Reference Resolution: Preferences
39/61
Pronoun Reference Resolution: The Hobbs Algorithm
40/61
Pronoun Reference Resolution: Centering Theory
41/61
Sentence Transitions
42/61
Pronoun Reference Resolution: Log-Linear Models
I Supervised: hand-labeled coreference corpus
I Rule-based filtering of non-referential pronouns:
I It was a dark and stormy night.
I It is raining.
I Needs positive and negative examples:
I Positive examples in the corpus.
I Negative examples are created by pairing pronouns with other noun phrases.
I Features are extracted for each training example.
I Classifier learns to predict 1 or 0.
I During testing:
I Classifier extracts all potential antecedents by parsing the current and previous sentences.
I Each NP is considered a potential antecedent for each following pronoun.
I Each pronoun – potential antecedent pair is then presented (through their features) to the
classifier.
I Classifier predicts 1 or 0.
43/61
Pronoun Reference Resolution: Log-Linear Models
I Example
I U1 : John saw a Ford at the dealership.
I U2 : He showed it to Bob.
I U3 : He bought it.
I Features for He in U3
44/61
General Reference Resolution
I Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%,
to $1.3 million, as the 37-year-old also became the Denver-based company’s
president. It has been ten years since she came to Megabucks from rival Lotsaloot.
I Coreference chains:
I {Victoria Chen, CFO of Megabucks Banking Corp since 2004, her, the 37-year-old, the
Denver-based company’s president, she}
I {Megabucks Banking Corp, the Denver-based company, Megabucks}
I {her pay}
I {Lotsaloot}
45/61
High-level Recipe for Coreference Resolution
46/61
High-level Recipe for Coreference Resolution
47/61
Pragmatics
48/61
In Context?
I Social context
I Social identities, relationships, and setting
I Physical context
I Where? What objects are present? What actions?
I Linguistic context
I Conversation history
I Other forms of context
I Shared knowledge, etc.
49/61
Language as Action: Speech Acts
I The Mood of a sentence indicates relation between speaker and the concept
(proposition) defined by the LF
I There can be operators that represent these direct relations:
I ASSERT: the proposition is proposed as a fact
I YN-QUERY: the truth of the proposition is queried
I COMMAND: the proposition describes a requested action
I WH-QUERY: the proposition describes an object to be identified
I There are also indirect speech acts.
I Can you pass the salt?
I It is warm here.
50/61
”How to do things with words.” Jane Austin1
1 http://en.wikipedia.org/wiki/J._L._Austin
51/61
Performative Sentences
I When uttered by the proper authority, such sentences have the effect of changing the
state of the world, just as any other action that can change the state of the world.
I These involve verbs like, name, second, declare, etc.
I “I name this ship the Titanic.” also causes the ship to be named Titanic.
I You can tell whether sentences are performative by adding “hereby”:
I I hereby name this ship the Queen Elizabeth.
I Non-performative sentences do not sound good with hereby:
I Birds hereby sing.
I There is hereby fighting in Syria.
52/61
Speech Acts Continued
53/61
Searle’s Speech Acts
I Assertives = speech acts that commit a speaker to the truth of the expressed
proposition
I Directives = speech acts that are to cause the hearer to take a particular action, e.g.
requests, commands and advice
I Can you pass the salt?
I Has the form of a question but the effect of a directive
I Commissives = speech acts that commit a speaker to some future action, e.g.
promises and oaths
I Expressives = speech acts that express the speaker’s attitudes and emotions
towards the proposition, e.g. congratulations, excuses
I Declarations = speech acts that change the reality in accord with the proposition of
the declaration, e.g. pronouncing someone guilty or pronouncing someone husband
and wife
54/61
Speech Acts in NLP
55/61
Task-oriented Dialogues
56/61
Ways of Asking for a Room
57/61
Examples of Task-oriented Speech Acts
I Identify self:
I This is David
I My name is David
I I’m David
I David here
I Sound check: Can you hear me?
I Meta dialogue act: There is a problem.
I Greet: Hello.
I Request-information:
I Where are you going.
I Tell me where you are going.
58/61
Examples of Task-oriented Speech Acts
I Backchannel – Sounds you make to indicate that you are still listening
I ok, m-hm
I Apologize/reply to apology
I Thank/reply to thanks
I Request verification/Verify
I So that’s 2:00? Yes. 2:00.
I Resume topic
I Back to the accommodations . . .
I Answer a yes/no question: yes, no.
59/61
Task-oriented Speech Acts in Negotiation
I Suggest
I I recommend this hotel
I Offer
I I can send some brochures.
I How about if I send some brochures.
I Accept
I Sure. That sounds fine.
I Reject
I No. I don’t like that one.
60/61
Negotiation
61/61
(Mostly Statistical)
Machine Translation
11-411
Fall 2017
2
The Rosetta Stone
• Decree from Ptolemy V
on repealing taxes and
erecting some statues
(196 BC)
• Written in three
languages
– Hieroglyphic
– Demotic
– Classical Greek
3
Overview
• History of Machine Translation
• Early Rule-based Approaches
• Introduction to Statistical Machine Translation
(SMT)
• Advanced Topics in SMT
• Evaluation of (S)MT output
4
Machine Translation
• Transform text (speech) in one language
(source) to text (speech) in a different
language (target) such that
– The “meaning” in the source language input is
(mostly) preserved, and
– The target language output is grammatical.
• Holy grail application in AI/NLP since middle of
20th century.
5
Translation
• Process
– Read the text in the source language
– Understand it
– Write it down in the target language
6
Machine Translation
Many possible legitimate translations!
7
Machine Translation
Rolls-Royce Merlin Engine English Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • The Rolls-Royce Merlin is a 12-cylinder
Flugmotor von Rolls-Royce in V-Bauweise, aircraft engine from Rolls-Royce V-type,
der vielen wichtigen britischen und US- which served many important British and
amerikanischen Flugzeugmustern des American aircraft designs of World War II as
ZweitenWeltkriegs als Antrieb diente. Ab a drive. From 1941 the engine was built
1941 wurde der Motor in Lizenz von der under license by the Packard Motor Car
Packard Motor Car Company in den USA als Company in the U.S. as a Packard V-1650th.
Packard V-1650 gebaut. • After the war, several passenger and cargo
• Nach dem Krieg wurden diverse Passagier- aircraft have been equipped with this engine,
und Frachtflugzeuge mit diesem Motor such as Avro Lancastrian, Avro Tudor Avro
ausgestattet, so z. B. Avro Lancastrian, Avro York and, later, the Canadair C-4 (converted
Tudor und Avro York, später noch einmal die Douglas C-54). The civilian use of the Merlin
Canadair C-4 (umgebaute Douglas C-54). Der was, however, limited as it remains robust,
zivile Einsatz des Merlin hielt sich jedoch in however, was too loud.
Grenzen, da er als robust, aber zu laut galt. • The name of the motor is taken under the
• Die Bezeichnung des Motors ist gemäß then Rolls-Royce tradition of one species, the
damaliger Rolls-Royce Tradition von einer Merlin falcon, and not, as often assumed, by
Vogelart, dem Merlinfalken, übernommen the wizard Merlin.
und nicht, wie oft vermutet, von dem
Zauberer Merlin.
8
Machine Translation
Rolls-Royce Merlin Engine English Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • The Rolls-Royce Merlin is a 12-cylinder
Flugmotor von Rolls-Royce in V-Bauweise, aircraft engine from Rolls-Royce V-type,
der vielen wichtigen britischen und US- which served many important British and
amerikanischen Flugzeugmustern des American aircraft designs of World War II as
Zweitenweltkriegs als Antrieb diente. Ab a drive. From 1941 the engine was built
1941 wurde der Motor in Lizenz von der under license by the Packard Motor Car
Packard Motor Car Company in den USA als Company in the U.S. as a Packard V-1650th.
Packard V-1650 gebaut. • After the war, several passenger and cargo
• Nach dem Krieg wurden diverse Passagier- aircraft have been equipped with this engine,
und Frachtflugzeuge mit diesem Motor such as Avro Lancastrian, Avro Tudor Avro
ausgestattet, so z. B. Avro Lancastrian, Avro York and, later, the Canadair C-4 (converted
Tudor und Avro York, später noch einmal die Douglas C-54). The civilian use of the Merlin
Canadair C-4 (umgebaute Douglas C-54). Der was, however, limited as it remains robust,
zivile Einsatz des Merlin hielt sich jedoch in however, was too loud.
Grenzen, da er als robust, aber zu laut galt. • The name of the motor is taken under the
• Die Bezeichnung des Motors ist gemäß then Rolls-Royce tradition of one species, the
damaliger Rolls-Royce Tradition von einer Merlin falcon, and not, as often assumed, by
Vogelart, dem Merlinfalken, übernommen the wizard Merlin.
und nicht, wie oft vermutet, von dem
Zauberer Merlin.
9
Machine Translation
Rolls-Royce Merlin Engine Turkish Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • Rolls-Royce Merlin 12-den silindirli Rolls-
Flugmotor von Rolls-Royce in V-Bauweise, Royce uçak motoru V tipi, bir sürücü olarak
der vielen wichtigen britischen und US- Dünya Savaşı'nın birçok önemli İngiliz ve
amerikanischen Flugzeugmustern des Amerikan uçak tasarımları devam eder. 1.941
ZweitenWeltkriegs als Antrieb diente. Ab motor lisansı altında Packard Motor Car
1941 wurde der Motor in Lizenz von der Company tarafından ABD'de Packard V olarak
Packard Motor Car Company in den USA als yaptırılmıştır Gönderen-1650
Packard V-1650 gebaut. • Savaştan sonra, birkaç yolcu ve kargo uçakları
• Nach dem Krieg wurden diverse Passagier- ile Avro Lancastrian, Avro Avro York ve Tudor
und Frachtflugzeuge mit diesem Motor gibi bu motor, daha sonra, Canadair C-4
ausgestattet, so z. B. Avro Lancastrian, Avro (Douglas C-54) dönüştürülür
Tudor und Avro York, später noch einmal die donatılmıştır. Olarak, ancak, çok yüksek oldu
Canadair C-4 (umgebaute Douglas C-54). Der sağlam kalır Merlin sivil kullanıma Ancak
zivile Einsatz des Merlin hielt sich jedoch in sınırlıydı.
Grenzen, da er als robust, aber zu laut galt. • Motor adı daha sonra Rolls altında bir türün,
• Die Bezeichnung des Motors ist gemäß Merlin şahin, ve değil-Royce geleneği, sıklıkta
damaliger Rolls-Royce Tradition von einer kabul, Merlin sihirbaz tarafından alınır.
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.
10
Machine Translation
Rolls-Royce Merlin Engine Arabic Translation
(from German Wikipedia) (via Google Translate -- 2009
• Der Rolls-Royce Merlin ist ein 12-Zylinder-
Flugmotor von Rolls-Royce in V-Bauweise,
der vielen wichtigen britischen und US-
amerikanischen Flugzeugmustern des
ZweitenWeltkriegs als Antrieb diente. Ab
1941 wurde der Motor in Lizenz von der
Packard Motor Car Company in den USA als
Packard V-1650 gebaut.
• Nach dem Krieg wurden diverse Passagier-
und Frachtflugzeuge mit diesem Motor
ausgestattet, so z. B. Avro Lancastrian, Avro
Tudor und Avro York, später noch einmal die
Canadair C-4 (umgebaute Douglas C-54). Der
zivile Einsatz des Merlin hielt sich jedoch in
Grenzen, da er als robust, aber zu laut galt.
• Die Bezeichnung des Motors ist gemäß
damaliger Rolls-Royce Tradition von einer
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.
11
Machine Translation
Rolls-Royce Merlin Engine Arabic Translation
(from German Wikipedia) (via Google Translate – 2017)
• Der Rolls-Royce Merlin ist ein 12-Zylinder-
Flugmotor von Rolls-Royce in V-Bauweise,
der vielen wichtigen britischen und US-
amerikanischen Flugzeugmustern des
ZweitenWeltkriegs als Antrieb diente. Ab
1941 wurde der Motor in Lizenz von der
Packard Motor Car Company in den USA als
Packard V-1650 gebaut.
• Nach dem Krieg wurden diverse Passagier-
und Frachtflugzeuge mit diesem Motor
ausgestattet, so z. B. Avro Lancastrian, Avro
Tudor und Avro York, später noch einmal die
Canadair C-4 (umgebaute Douglas C-54). Der
zivile Einsatz des Merlin hielt sich jedoch in
Grenzen, da er als robust, aber zu laut galt.
• Die Bezeichnung des Motors ist gemäß
damaliger Rolls-Royce Tradition von einer
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.
12
Machine Translation
• (Real-time speech-to-speech) Translation is a
very demanding task
– Simultaneous translators (in UN, or EU Parliament)
last about 30 minutes
– Time pressure
– Divergences between languages
• German: Subject ........................... Verb
• English: Subject Verb ……………………….
• Arabic: Verb Subject ..............
13
Brief History
• 1950’s: Intensive research activity in MT
– Translate Russian into English
• 1960’s: Direct word-for-word replacement
• 1966 (ALPAC): NRC Report on MT
– Conclusion: MT no longer worthy of serious scientific
investigation.
• 1966-1975: `Recovery period’
• 1975-1985: Resurgence (Europe, Japan)
• 1985-present: Resurgence (US)
– Mostly Statistical Machine Translation since 1990s
– Recently Neural Network/Deep Learning based machine
translation
14
Early Rule-based Approaches
• Expert system-like rewrite systems
• Interlingua methods (analyze and generate)
• Information used for translation are compiled
by humans
– Dictionaries
– Rules
15
Vauquois Triangle
16
Statistical Approaches
• Word-to-word translation
• Phrase-based translation
• Syntax-based translation (tree-to-tree, tree-to-
string)
– Trained on parallel corpora
– Mostly noisy-channel (at least in spirit)
17
Early Hints on the Noisy Channel
Intuition
• “One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an
article in Russian, I say: ‘This is really written
in English, but it has been coded in some
strange symbols. I will now proceed to
decode.’ ”
Warren Weaver
• (1955:18, quoting a letter he wrote in 1947)
18
Divergences between Languages
• Languages differ along many dimensions
– Concept – Lexicon alignment – Lexical Divergence
– Syntax – Structure Divergence
• Word-order differences
– English is Subject-Verb-Object
– Arabic is Verb-Subject-Object
– Turkish is Subject-Object-Verb
• Phrase order differences
• Structure-Semantics Divergences
19
Lexical Divergences
• English: wall
– German: Wand for walls inside, Mauer for walls
outside
• English: runway
– Dutch: Landingbaan for when you are landing;
startbaan for when you are taking off
• English: aunt
– Turkish: hala (father’s sister), teyze(mother’s sister)
• Turkish: o
– English: she, he, it
20
Lexical Divergences
How conceptual space is cut up
21
Lexical Gaps
• One language may not have a word for a
concept in another language
– Japanese: oyakoko
• Best English approximation: “filial piety”
– Turkish: gurbet
• Where you are when you are not “home”
– English: condiments
• Turkish: ??? (things like mustard, mayo and ketchup)
22
Local Phrasal Structure Divergences
• English: a blue house
– French: une maison bleu
• German: die ins Haus gehende Frau
– English: the lady walking into the house
23
Structural Divergences
• English: I have a book.
– Turkish: Benim kitabim var. (Lit: My book exists)
• French: Je m’appelle Jean (Lit: I call myself
Jean)
– English: My name is Jean.
• English: I like swimming.
– German: Ich schwimme gerne. (Lit: I swim
“likingly”.)
24
Major Rule-based MT
Systems/Projects
• Systran
– Major human effort to construct large translation
dictionaires + limited word-reordering rules
• Eurotra
– Major EU-funded project (1970s-1994) to translate
among (then) 12 EC languages.
• Bold technological framework
– Structural Interlingua
• Management failure
• Never delivered a working MT system
• Helped create critical mass of researchers
25
Major Rule-based MT
Systems/Projects
• METEO
– Successful system for French-English translation of
Canadian weather reports (1975-1977)
• PANGLOSS
– Large-scale MT project by CMU/USC-ISI/NMSU
– Interlingua-based Japanese-Spanish-English
translation
– Manually developed semantic lexicons
26
Rule-based MT
• Manually develop rules to analyze the source
language sentence (e.g., a parser)
– => some source structure representation
• Map source structure to a target structure
• Generate target sentence from the transferred
structure
27
Rule-based MT
Syntactic Transfer
Þ
Noun Phrase Sentence
Sentence
Verb Phrase
Noun Phrase Verb Phrase
Verb
Pronoun Noun Phrase Verb
Noun Phrase
Pronoun
I read Adj.
Noun Je lire Noun
Adj.
scientific books
livres scientifiques
Swap
Source language analysis
Target language generation
28
Rules
• Rules to analyze the source sentences
– (Usually) Context-free grammar rules coupled with
linguistic features
• Sentence => Subject-NP Verb-Phrase
• Verb-Phrase => Verb Object …..
29
Rules
• Lexical transfer rules
– English: book (N) => French: livre (N, masculine)
– English: pound (N, monetary sense)=> French:
livre (N, feminine)
– English: book (V) => French: réserver (V)
• Quite tricky for
30
Rules
• Structure Transfer Rules
– English: S => NP VP è
French: TR(S) => TR(NP) TR(VP)
– English: NP => Adj Noun è
French: TR(NP) => Tr(Noun) Tr(Adj)
but there are exceptions for
Adj=grand, petit, ….
31
Rules
Much more complex to deal with “real world” sentences.
32
Example-based MT (EBMT)
• Characterized by its use of a bilingual corpus
with parallel texts as its main knowledge base,
at run-time.
• Essentially translation by analogy and can be
viewed as an implementation of case-based
reasoning approach of machine learning.
• Find how (parts of) input are translated in the
examples
– Cut and paste to generate novel translations
33
Example-based MT (EBMT)
• Translation Memory
– Store many translations,
• source – target sentence pairs
– For new sentences, find closes match
• use edit distance, POS match, other similarity techniques
– Do corrections,
• map insertions, deletions, substitutions onto target sentence
– Useful only when you expect same or similar sentence to
show up again, but then high quality
34
Example-based MT (EBMT)
English Japanese
• How much is that red • Ano akai kasa wa ikura desu
umbrella? ka?
• How much is that small • Ano chiisai kamera wa ikura
camera? desu ka?
35
Hybrid Machine Translation
• Use multiple techniques (rule-based/
EBMT/Interlingua)
• Combine the outputs of different systems to
improve final translations
36
How do we evaluate MT output?
• Adequacy: Is the meaning of the source
sentence conveyed by the target sentence?
• Fluency: Is the sentence grammatical in the
target language?
• These are rated on a scale of 1 to 5
37
How do we evaluate MT output?
Je suis fatigué.
Adequacy Fluency
Tired is I. 5 2
I am tired. 5 5
38
How do we evaluate MT output?
• This in general is very labor intensive
– Read each source sentence
– Evaluate target sentence for adequacy and fluency
• Not easy to do if you improve your MT system
10 times a day, and need to evaluate!
– Could this be mechanized?
• Later
39
Shallow/ Simple
MT Strategies (1954-2004)
Word-based
Electronic only
dictionaries
Example-
Phrase tables
Knowledge based MT
Statistical MT
Acquisition
Strategy Hand-built by Hand-built by Learn from Learn from un-
experts non-experts annotated data annotated data
All manual Fully automated
Original direct Syntactic
approach Constituent
Structure
Typical transfer
system Semantic New Research
analysis Goes Here!
Classic
interlingual Interlingua
system
Knowledge
Deep/ Complex Representation
Slide by40
Strategy Laurie Gerber
Statistical Machine Translation
• How does statistics and probabilities come
into play?
– Often statistical and rule-based MT are seen as
alternatives, even opposing approaches – wrong
!!!
No Probabilities Probabilities
Flat Structure EBMT SMT
Deep Structure Transfer Holy Grail
Interlingua
– Goal: structurally rich probabilistic models
41
Rule-based MT vs SMT
Statistical System
Expert System
Experts Bilingual parallel corpus
S T
+
+
Machine
Learning
Manually coded rules
S: Mais où sont les neiges d’antan?
If « … » then …
If « … » then … Statistical rules
Statistical system output
…… P(but | mais)=0.7
…… T1: But where are the snows
P(however | mais)=0.3
Else …. of yesteryear? P = 0.41
T2: However, where are P(where | où)=1.0
yesterday’s snows? P = 0.33 ……
Expert system output T3: Hey - where did the old
T: But where are the snows snow go? P = 0.18
42
of ? …
Data-Driven Machine Translation
Translated documents
43
Slide by Kevin Knight
Statistical Machine Translation
• The idea is to use lots of parallel texts to
model how translations are done.
– Observe how words or groups of words are
translated
– Observe how translated words are moved around
to make fluent sentences in the target sentences
44
Parallel Texts
1a. Garcia and associates . 7a. the clients and the associates are enemies .
1b. Garcia y asociados . 7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates . 8a. the company has three groups .
2b. Carlos Garcia tiene tres asociados . 8b. la empresa tiene tres grupos .
3a. his associates are not strong . 9a. its groups are in Europe .
3b. sus asociados no son fuertes . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 10a. the modern groups sell strong pharmaceuticals .
4b. Garcia tambien tiene una empresa . 10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry . 11a. the groups do not sell zenzanine .
5b. sus clientes estan enfadados . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 12a. the small groups are not modern .
6b. los asociados tambien estan enfadados . 12b. los grupos pequenos no son modernos .
45
Parallel Texts
Clients do not sell pharmaceuticals in Europe
Clientes no venden medicinas en Europa
1a. Garcia and associates . 7a. the clients and the associates are enemies .
1b. Garcia y asociados . 7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates . 8a. the company has three groups .
2b. Carlos Garcia tiene tres asociados . 8b. la empresa tiene tres grupos .
3a. his associates are not strong . 9a. its groups are in Europe .
3b. sus asociados no son fuertes . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 10a. the modern groups sell strong pharmaceuticals .
4b. Garcia tambien tiene una empresa . 10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry . 11a. the groups do not sell zenzanine .
5b. sus clientes estan enfadados . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 12a. the small groups are not modern .
6b. los asociados tambien estan enfadados . 12b. los grupos pequenos no son modernos .
46
Parallel Texts
1. employment rates are very low , 1. istihdam oranları , özellikle kadınlar için
especially for women . çok düşüktür .
2. the overall employment rate in 2001 was 2. 2001 yılında genel istihdam oranı % 46,8'
46. 8% . dir .
3. the system covers insured employees 3. sistem , işini kaybeden sigortalı işsizleri
who lose their jobs . kapsamaktadır .
4. the resulting loss of income is covered 4. ortaya çıkan gelir kaybı , ödenmiş
in proportion to the premiums paid . primlerle orantılı olarak karşılanmaktadır .
5. there has been no development in the 5. engelli kişiler konusunda bir gelişme
field of disabled people . kaydedilmemiştir .
6. overall assessment 6. genel değerlendirme
7. no social dialogue exists in most private 7. özel işletmelerin çoğunda sosyal diyalog
enterprises . yoktur .
8. it should be reviewed together with all the 8. konseyin yapısı , sosyal taraflar ile birlikte
social partners . yeniden gözden geçirilmelidir .
9. much remains to be done in the field of 9. sosyal koruma alanında yapılması gereken
social protection . çok şey vardır .
47
Available Parallel Data (2004)
Millions of
words
(English side)
49
Available Parallel Data (2017)
50
Available Parallel Text
• A book has a few 100,000s words
• An educated person may read 10,000 words a
day
– 3.5 million words a year
– 300 million words a lifetime
• Soon computers will have access to more
translated text than humans read in a lifetime
51
More data is better!
• Language Weaver Arabic to English Translation
52
Sample Learning Curves
Swedish/English
French/English
BLEU German/English
score Finnish/English
54
Sentence Alignment
The old man is happy. El viejo está feliz
He has fished many porque ha pescado
times. His wife talks muchos veces. Su
to him. The fish are mujer habla con él.
jumping. The sharks Los tiburones
await. esperan.
55
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. porque ha pescado
2. He has fished many muchos veces.
times. 2. Su mujer habla con
3. His wife talks to él.
him. 3. Los tiburones
4. The fish are esperan.
jumping.
5. The sharks await.
56
Sentence Alignment
• 1-1 Alignment
– 1 sentence in one side aligns to 1 sentence in the
other side
• 0-n, n-0 Alignment
– A sentence in one side aligns to no sentences on the
other side
• n-m Alignment (n,m>0 but typically very small)
– n sentences on one side align to m sentences on the
other side
57
Sentence Alignment
• Sentence alignments are typically done by
dynamic programming algorithms
– Almost always, the alignments are monotonic.
– The lengths of sentences and their translations
(mostly) correlate.
– Tokens like numbers, dates, proper names,
cognates help anchor sentences..
58
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. porque ha pescado
2. He has fished many muchos veces.
times. 2. Su mujer habla con
3. His wife talks to él.
him. 3. Los tiburones
4. The fish are esperan.
jumping.
5. The sharks await.
59
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. He has porque ha pescado
fished many times. muchos veces.
2. His wife talks to 2. Su mujer habla con
him. él.
3. The sharks await. 3. Los tiburones
esperan.
– Output: 美国 关岛国 际机 场 及其 办公
室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
61
The Basic Formulation of SMT
• Given a source language sentence s, what is the
target language text t, that maximizes
𝑝 𝑡 𝑠)
• So, any target language sentence t is a “potential”
translation of the source sentence s
– But probabilities differ
– We need that t with the highest probability of being a
translation.
62
The Basic Formulation of SMT
• Given a source language sentence s, what is
the target language text t, that maximizes
𝑝 𝑡 𝑠)
• We denote this computation as a search
∗
𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥- 𝑝 𝑡 𝑠)
63
The Basic Formulation of SMT
• We need to compute 𝑡 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥- 𝑝 𝑡 𝑠)
64
The Noisy Channel Model
(Target) Dün Ali’yi T S (Source) I saw Ali
gördüm.
yesterday
Noisy Channel
Source(s)/Target(t) Target
Bilingual Text Text
67
How do the models interact?
• Maximizing p(S | T) P(T)
– p(T) models “good” target sentences (Target Language Model)
– p(S|T) models whether words in source sentence are “good”
translation of words in the target sentence (Translation Model)
I saw Ali yesterday Good Target? P(T) Good match to Source ? Overall
P(S|T)
Bugün Ali’ye gittim
Okulda kalmışlar
Var gelmek ben
Dün Ali’yi gördüm
Gördüm ben dün Ali’yi
Dün Ali’ye gördüm
68
Three Problems for Statistical MT
• Language model
– Given a target sentence T, assigns p(T)
• good target sentence -> high p(T)
• word salad -> low p(T)
• Translation model
– Given a pair of strings <S,T>, assigns p(S | T)
• <S,T> look like translations -> high p(S | T)
• <S,T> don’t look like translations -> low p(S | T)
• Decoding algorithm
– Given a language model, a translation model, and a new
sentence S … find translation T maximizing p(T) * p(S|T)
69
The Classic Language Model:
Word n-grams
• Helps us choose among sentences
– He is on the soccer field
– He is in the soccer field
– Rice shrine
– American shrine
– Rice company
– American company
70
The Classic Language Model
• What is a “good” target sentence? (HLT Workshop 3)
• T = t1 t2 t3 … tn;
• We want P(T) to be “high”
• A good approximation is by short n-grams
– P(T) @ P(t1|START)•P(t2|START,t1) •P(t3|t1,t2)•…•P(ti|ti-2,ti-1)•
…•P(tn|tn-2,tn-1)
71
The Classic Language Model
72
The Classic Language Model
73
Translation Model?
Generative approach:
74
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach:
75
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach:
77
Basic Translation Model (IBM M-1)
5
𝑝 𝑡 𝑠, 𝑚 = 1 𝑝 𝑎 𝑠, 𝑚) × 3 𝑝 𝑡𝑖 𝑠𝑎𝑖 )
5
: ∈ <,= 678
78
Parameters of the IBM 3 Model
• Fertility: How many words does a source word get
translated to?
– n(k | s): the probability that the source word s gets
translated as k target words
– Fertility depends solely on the source words in question
and not other source words in the sentence, or their
fertilities.
79
Parameters of the IBM 3 Model
• Translation: How do source words translate?
– tr(t|s): the probability that the source word s gets
translated as the target word t
– Once we fix n(k | s) we generate k target words
• Reordering: How do words move around in the
target sentence?
– d(j | i): distortion probability – the probability of word
at position i in a source sentence being translated as
the word at position j in target sentence.
• Very dubious!!
80
How IBM Model 3 works
1. For each source word si indexed by i = 1, 2,
..., m, choose fertility phii with probability
n(phii | si).
2. Choose the number phi0 of “spurious” target
words to be generated from s0 = NULL
81
How IBM Model 3 works
3. Let q be the sum of fertilities for all words,
including NULL.
4. For each i = 0, 1, 2, ..., m, and each k = 1, 2,
..., phii, choose a target word tik with
probability tr(tik | si).
5. For each i = 1, 2, ..., l, and each k = 1, 2, ...,
phii, choose target position piik with
probability d(piik | i,l,m).
82
How IBM Model 3 works
6. For each k = 1, 2, ..., phi0, choose a position
pi0k from the remaining vacant positions in 1,
2, ... q, for a total probability of 1/phi0.
7. Output the target sentence with words tik in
positions piik (0 <= i <= m, 1 <= k <= phii).
83
Example
• n-parameters
b c d b d • n(0,b)=0, n(1,b)=2/2=1
| | | |
• n(0,c)=1/1=1, n(1,c)=0
| +-+ | |
| | | | | • n(0,d)=0,n(1,d)=1/2=
x y z x y 0.5, n(2,d)=1/2=0.5
84
Example
• t-parameters
b c d b d • t(x|b)=1.0
| | | |
• t(y|d)=2/3
| +-+ | |
| | | | | • t(z|d)=1/3
x y z x y
85
Example
• d-parameters
b c d b d • d(1|1,3,3)=1.0
| | | |
• d(1|1,2,2)=1.0
| +-+ | |
| | | | | • d(2|2,3,3)=0.0
x y z x y • d(3|3,3,3)=1.0
• d(2|2,2,2)=1.0
86
Example
• p1
b c d b d • No target words are
| | | | generated by NULL so
| +-+ | | p1 = 0.0
| | | | |
x y z x y
87
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach:
89
Word Alignments
• One source word can map
to 0 or more target words
– But not vice versa
• technical reasons
• Some words in the target
can magically be
generated from an
invisible NULL word
• A target word can only be
generated from one
source word
– technical reasons
90
Word Alignments
𝑐 𝑜𝑒𝑢𝑣𝑟𝑒 𝑤𝑜𝑟𝑘𝑒𝑑)
𝑡𝑟 𝑜𝑒𝑢𝑣𝑟𝑒 𝑤𝑜𝑟𝑘𝑒𝑑) =
𝑐(𝑤𝑜𝑟𝑘𝑒𝑑)
91
How do we get these alignments?
• We only have aligned sentences and the
constraints:
– One source word can map to 0 or more target words
• But not vice versa
– Some words in the target can magically be generated
from an invisible NULL word
– A target word can only be generated from one source
word
• Estimation – Maximization Algorithm
– Mathematics is rather complicated
92
How do we get these alignments?
93
How do we get these alignments?
94
How do we get these alignments?
(pigeonhole principle)
95
How do we get these alignments?
96
How do we get these alignments?
97
Decoding for “Classic” Models
• Of all conceivable English word strings, find the one
maximizing p(t) * p(t | s)
98
Dynamic Programming Beam Search
1st target 2nd target 3rd target 4th target
word word word word
start end
all source
words
covered
start end
all source
words
covered
102
Phrase-Based Statistical MT
Morgen fliege ich nach Kanada zur Konferenz
103
Advantages of Phrase-Based SMT
• Many-to-many mappings can handle non-
compositional phrases
• Local context is very useful for disambiguating
– “Interest rate” à …
– “Interest in” à …
• The more data, the longer the learned phrases
– Sometimes whole sentences
104
How to Learn the Phrase Translation Table?
TàS best
alignment
MERGE
SàT best
alignment Union or Intersection
107
How to Learn the Phrase Translation Table?
Mary
did
not
slap
one
the example
green phrase
pair
witch
108
Word Alignment Consistent Phrases
Maria no dió Maria no dió Maria no dió
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
110
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
)
111
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch)
112
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch)
113
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) (Maria no dió una bofetada a la, Mary did not slap the)
114
(no dió una bofetada a la, did not slap the) (dió una bofetada a la bruja verde, slap the green witch)
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde
Mary
did
not
slap
the
green
witch
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) (Maria no dió una bofetada a la, Mary did not slap the)
(no dió una bofetada a la, did not slap the) (dió una bofetada a la bruja verde, slap the green witch) 115
(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair Probabilities
– We hope so!
116
Phrase-based SMT
• After doing this to millions of sentences
– For each phrase pair (t, s)
• Count how many times s occurs
• Count how many times s is translated to t
• Estimate p(t | s)
117
Decoding
• During decoding
– a sentence is segmented into “phrases” in all possible ways
– each such phrase is then “translated” to the target phrases
in all possible ways
– Translations are also moved around
– Resulting target sentences are scored with the target
language model
• The decoder actually does NOT actually enumerate all
possible translations or all possible target sentences
– Pruning
118
Decoding
119
Basic Model, Revisited
argmax P(t | s) =
t
120
Basic Model, Revisited
argmax P(t | s) =
t
121
Basic Model, Revisited
argmax P(t | s) =
t
122
Basic Model, Revisited
123
Maximum BLEU Training
Length
Translation
Model
Model Target
Language Reference Translations
Language Other
Model #1 (sample “right answers”)
Model #2 Features
Translation Translation
System Target Quality BLEU
Source
(Automatic, MT Output Evaluator score
Trainable) (Automatic)
125
BLEU Evaluation
Reference (human) translation: N-gram precision (score between 0 & 1)
The US island of Guam is • what % of machine n-grams (a sequence of
maintaining a high state of alert words) can be found in the reference
after the Guam airport and its translation?
offices both received an e-mail
from someone calling himself Brevity Penalty
Osama Bin Laden and threatening a
• Can’t just type out single word “the’’
(precision 1.0!)
biological/chemical attack against
the airport.
Extremely hard to trick the system,
i.e. find a way to change MT output so that
Machine translation: BLEU score increases, but quality doesn’t.
The American [?] International airport and its
the office a [?] receives one calls self the sand
Arab rich business [?] and so on electronic mail,
which sends out; The threat will be able after
the maintenance at the airport.
126
More Reference Translations are Better
Reference translation 1: Reference translation 2:
The US island of Guam is maintaining a high Guam International Airport and its offices are
state of alert after the Guam airport and its maintaining a high state of alert after receiving
offices both received an e-mail from someone an e-mail that was from a person claiming to be
calling himself Osama Bin Laden and the rich Saudi Arabian businessman Osama Bin
threatening a biological/ chemical attack against Laden and that threatened to launch a biological
the airport. and chemical attack on the airport.
Machine translation:
The American [?] International airport and its
the office a [?] receives one calls self the sand
Arab rich business [?] and so on electronic mail
, which sends out; The threat will be able after
the maintenance at the airport to start the
biochemistry attack.
128
BLEU Formulation
8
S S
𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑙𝑒𝑛𝑔𝑡ℎ
𝐵𝐿𝐸𝑈 = min(1, ) 3 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 − 𝑙𝑒𝑛𝑔𝑡ℎ
678
129
Correlation with Human Judgment
130
What About Morphology?
• Issue for handling morphologically complex
languages like Turkish, Hungarian, Finnish,
Arabic, etc.
– A word contains much more information than just
the root word
• Arabic: wsyktbunha (wa+sa+ya+ktub+ūn+ha “and they
will write her”)
– What are the alignments?
• Turkish: gelebilecekmissin (gel+ebil+ecek+mis+sin (I
heard) you would be coming))
– What are the alignments?
131
Morphology & SMT
• Finlandiyalılaştıramadıklarımızdanmışsınızcasına
• Finlandiya+lı+laş+tır+ama+dık+lar+ımız+dan+mış+sını
z+casına
• (behaving) as if you have been one of those whom
we could not convert into a Finn(ish
citizen)/someone from Finland
132
Morphology & SMT
• yapabileceksek Most of the time, the morpheme
– yap+abil+ecek+se+k order is “reverse” of the corresponding
– if we will be able to do (something) English word order
• yaptırtabildiğimizde
– yap+tır+t+tığ+ımız+da
– when/at the time we had (someone) have (someone else) do (something)
• görüntülenebilir
– görüntüle+n+ebil+ir
– it can be visualize+d
• sakarlıklarından
– sakar+lık+ları+ndan
– of/from/due-to their clumsi+ness
133
Morphology and Alignment
• Remember the alignment needs to count co-
occuring words
– If one side of the parallel text has little
morphology (e.g. English)
– The other side has lots of morphology
• Lots of words on the English side either don’t
align or align randomly
134
Morphology & SMT
• If we ignore
Word Form Count Gloss
faaliyet 3 activity
faaliyetlerini
5
1
to their activities
TOTAL 41
135
An Example E – T Translation
136
An Example E – T Translation
Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz
137
An Example E – T Translation
Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz
138
An Example E – T Translation
Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz
139
An Example E – T Translation
Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz
140
Morphology and Parallel Texts
• Use
– Morphological analyzers (HLT Workshop 2)
– Tagger/Disambiguators (HLT Workshop 3)
• to split both sides of the parallel corpus into
moprhemes
141
Morphology and Parallel Texts
• A typical sentence pair in this corpus looks like
the following:
• Turkish:
– kat +hl +ma ortaklık +sh +nhn uygula +hn +ma +sh
, ortaklık anlaşma +sh çerçeve +sh +nda izle +hn
+yacak +dhr .
• English:
– the implementation of the accession partnership
will be monitor +ed in the framework of the
association agreement
142
Results
• Using morphology in Phrase-based SMT
certainly improves results compared to just
using words
• But
– Sentences get much longer and this hurts
alignment
– We now have an additional problem: getting the
morpheme order on each word right
143
Syntax and Morphology Interaction
• A completely different approach
– Instead of dividing up Turkish side into morpheme
– Collect “stuff” on the English side to make-up
“words”.
– What is the motivation?
144
Syntax and Morphology Interaction
Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz
• to your hotel
– to is the preposition related to hotel
– your is the possessor of hotel
• to your hotel => hotel +your+to
otel +iniz+e
– separate content from local syntax
146
Syntax and Morphology Interaction
we are go+ing to your hotel in Taksim by taxi
• we are go+ing
– we is the subject of go
– are is the auxiliary of go
– ing is the present tense marker for go
• we are go+ing => go +ing+are+we
gid +iyor+uz
– separate content from local syntax
147
Syntax and Morphology Interaction
we are go+ing to your hotel in Taksim by taxi
Now align only based on root words – the syntax alignments just follow that
148
Syntax and Morphology Interaction
149
Syntax and Morphology Interaction
• Transformations on the English side reduce
sentence length
• This helps alignment
– Morphemes and most function words never get
involved in alignment
• We can use factored phrase-based translation
– Phrased-based framework with morphology
support
150
Syntax and Morphology Interaction
English Turkish BLEU Score
1300000 25.00
1250000 24.00
1200000 23.00
1150000 22.00
Number of Tokens
1100000 21.00
BLEU Scores
1050000 20.00
1000000 19.00
950000 18.00
900000 17.00
850000 16.00
800000 15.00
Adv
Baseline-Factored
Verb
Verb+Adv
Noun+Adj
Noun+Adj+Verb+Adv
Noun+Adj+Verb
Noun+Adj+Verb+PostPC
Noun+Adj+Verb+Adv+PostPC
Experiments
151
Syntax and Morphology Interaction
• She is reading.
– She is the subject of read
– is is the auxiliary of read
152
Shallow/ Simple
MT Strategies (1954-2004)
Word-based
Electronic only
dictionaries Example-
Phrase tablesbased MT
Knowledge Statistical MT
Acquisition
Strategy Hand-built by Hand-built by Learn from Learn from un-
experts non-experts annotated data annotated data
All manual
Fully automated
Original direct Syntactic
approach Constituent
Structure
Typical transfer
system Semantic New Research
analysis Goes Here!
Classic
interlingual Interlingua
system
Knowledge
Deep/ Complex Representation
Slide by
153
Strategy Laurie Gerber
Syntax in SMT
• Early approaches relied on high-performance
parsers for one or both languages
– Good applicability when English is the source
language
• Tree-to-tree or tree-to-string transductions
154
Tree-to-String Transformation
VB VB
Parse Tree(E)
PRP VB2 VB1
PRP VB1 VB2
Reorder
he TO VB adores
he adores VB TO
listening
listening TO MN MN TO
music to
to music
Insert
VB VB
he ha TO VB ga kare ha TO VB ga
adores desu daisuki desu
MN TO MN TO
Take Leaves
156
Tree-to-String Transformation
• Each step is described by a statistical model
– Insert new sibling to the left or right of a node
probabilitically
– Translate source nodes probabilistically
157
Hierarchical phrase models
• Combines phrase-based models and tree
strutures
• Extract synchronous grammars from parallel
text
• Uses a statistical chart-parsing algorithm
during decoding
– Parse and generate concurrently
158
For more info
• Proceedings of the Third Workshop on Syntax and
Structure in Statistical Translation (SSST-3) at NAACL
HLT 2009
– http://aclweb.org/anthology-new/W/W09/#2300
• Proceedings of the ACL-08: HLT Second Workshop on
Syntax and Structure in Statistical Translation (SSST-2)
– http://aclweb.org/anthology-new/W/W08/#0400
159
Acknowledments
• Some of the tutorial material is based on
slides by
– Kevin Knight (USC/ISI)
– Philipp Koehn (Edinburgh)
– Reyyan Yeniterzi (CMU/LTI)
160
Important References
• Statistical Machine Translation (2010)
– Philipp Koehn
– Cambridge University Press
• SMT Workbook (1999)
– Kevin Knight
– Unpublished manuscript at http://www.isi.edu/~knight/
• http://www.statmt.org
• http://aclweb.org/anthology-new/
– Look for “Workshop on Statistical Machine Translation”
161
11-411
Natural Language Processing
Neural Networks and Deep Learning in NLP
Kemal Oflazer
1/60
Big Picture: Natural Language Analyzers
2/60
Big Picture: Natural Language Analyzers
3/60
Big Picture: Natural Language Analyzers
4/60
Linear Models
I y1 = w11 x1 + w21 x2 + w31 x3 + w41 x4 + w51 x5
5/60
Perceptrons
I Remember Perceptrons?
I A very simple algorithm guaranteed to eventually find a linear separator hyperplane
(determine w), if one exists.
I If one doesn’t, the perceptron will oscillate!
I Assume our classifier is
1 if w · Φ(x) > 0
n
classify(x) =
0 if w · Φ(x) ≤ 0
I Start with w = 0
I for t = 1, . . . , T
I i = t mod N
I w ← w + α `i − classify(xi ) Φ(xi )
I Return w
I α is the learning rate – determined by experimentation.
6/60
Perceptrons
I For classification we are basically computing
score(x) = W × f (x)T =
X
wj · fj (x)
j
8/60
Multiple Layers
9/60
Adding Non-linearity
I Instead of computing a linear combination
X
score(x) = wj · fj (x)
j
10/60
Deep Learning
11/60
What Depth Holds
12/60
Simple Neural Network
3.7
2.9 4.5
3.7
2.9 -5.2
-1.5 -2.0
-4.6
1 1
13/60
Sample Input
3.7
1.0
2.9 4.5
3.7
2.9 -5.2
0.0
-1.5 -2.0
-4.6
1 1
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.6) = sigmoid(−1.7) = = 0.15
1 + e1.7
14/60
Computed Hidden Layer Values
3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15
-1.5 -2.0
-4.6
1 1
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.7) = = 0.15
1 + e1.7
15/60
Computed Output Value
3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15 .78
-1.5 -2.0
-4.6
1 1
1
sigmoid(0.90 × 4.5 + 0.15 ×−5.2 + 1 ×−2.0) = sigmoid(1.25) = = 0.78
1 + e−1.25
16/60
Output for All Binary Inputs
17/60
The Brain vs. Artificial Neural Networks
I Similarities
I Neurons, connections between neurons
I Learning = change of connections, not change of neurons
I Massive parallel processing
I But artificial neural networks are much simpler
I computation within neuron vastly simplified
I discrete time steps
I typically some form of supervised learning with massive number of stimuli
18/60
Backpropagation Training
19/60
Backpropagation Training
3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15 .78
-1.5 -2.0
-4.6
1 1
20/60
Key Concepts
I Gradient Descent
I error is a function of the weights
I we want to reduce the error
I gradient descent: move towards the error minimum
I compute gradient → get direction to the error minimum
I adjust weights towards direction of lower error
I Backpropagation
I first adjust last set of weights
I propagate error back to each previous layer
I adjust their weights
21/60
Gradient Descent
22/60
Gradient Descent
23/60
Derivative of the Sigmoid
1
I Sigmoid: sigmoid(x) =
1 + e−x
f (x) 0 g(x)f 0 (x) − f (x)g0 (x)
I Reminder: quotient rule ( ) =
g(x) g(x)2
d sigmoid(x) d 1
I Derivative =
dx dx 1 + e−x
0 × (1 − e−x ) − (−e−x )
=
(1 + e−x )2
1 e−x
= ( )
1 + e−x 1 + e−x
1 1
= −x (1 − )
1+e 1 + e−x
= sigmoid(x)(1 − sigmoid(x))
24/60
Final Layer Update (1)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I t is the target ouput.
I Derivative of error with regard to one weight wk (using chain rule)
dE dE dy ds
=
dwk dy ds dwk
I Error is already defined in terms of y, hence
dE d 1
= (t − y)2 = −(t − y)
dy dy 2
25/60
Final Layer Update (2)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I Derivative of error with regards to one weight wk (using chain rule)
dE dE dy ds
=
dwk dy ds dwk
I y with respect to s is sigmoid(s)
dy d sigmoid(s)
= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds ds
26/60
Final Layer Update (3)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I Derivative of error with regards to one weight wk (using chain rule)
dE dE dy ds
=
dwk dy ds dwk
I s is a weighted linear combination of hidden node values hk
ds d X
= ( wk hk ) = hk
dwk dwk
k
27/60
Putting it All Together
dE dE dy ds
= = −(t − y) y(1 − y) hk
dwk dy ds dwk
I error
I derivative of sigmoid: y0
I We adjust the weight as follows
∆wk = µ (t − y) y0 hk
28/60
Multiple Output Nodes
I Our example had one ouput node.
I Typically neural networks have multiple output nodes.
I Error is computed over all j output nodes
1X
E= (tj − yj )2
2
j
I Weight wkj from hidden unit k to output unit j is adjusted according to node j
29/60
Hidden Layer Update
I In a hidden layer, we do not have a target output value.
I But we can compute how much each hidden node contributes to the downstream
error E.
I k refers to a hidden node
I j refers to a node in the next/output layer
I Remember the error term
δj = (tj − yj )y0j
I The error term associated with hidden node k is (skipping the multivariate math) is
wkj δj )h0k
X
δk = (
j
I So if the uik is the weight between input unit xi and hidden unit k then
∆uik = µ δk xi
I Compare with ∆wkj = µ δj hk .
30/60
An Example
i k j
3.7
1 1.0 .90
2.9 4.5
3.7 G
2.9 -5.2
2 0.0 .15 .78
-1.5 -2.0
-4.6
3 1 1
x u h w y
x u h w y
[−0.01, 0.01]
I For shallow networks there are suggestions for
1 1
[− √ , √ ]
n n
I For deep networks there are suggestions for
√ √
6 6
[− p ,p ]
ni + ni+1 ni + ni+1
where ni and ni+1 are sizes of the previous and next layers.
34/60
Neural Networks for Classification
36/60
Problems with Gradient Descent Training
37/60
Problems with Gradient Descent Training
38/60
Speed-up: Momentum
∆wkj (n − 1)
I and add these to any new updates with a decay factor ρ
39/60
Dropout
I A general problem of machine learning: overfitting to training data (very good on train,
bad on unseen test)
I Solution: regularization, e.g., keeping weights from having extreme values
I Dropout: randomly remove some hidden units during training
I mask: set of hidden units dropped
I randomly generate, say, 10 – 20 masks
I alternate between the masks during training
40/60
Mini Batches
41/60
Matrix Vector Formulation
I Forward computation s = W h
I Activation computation y = sigmoid(s)
I Error Term: δ = (t − y) · sigmoid0 (s)
I Propagation of error term: δ i = W δ i+1 · sigmoid0 (s)
I Weight updates: ∆W = µ δ hT
42/60
Toolkits
43/60
Neural Network V1.0: Linear Model
44/60
Neural Network v2.0: Representation Learning
I Big idea: induce low-dimensional dense feature representations of high-dimensional
objects
45/60
Neural Network v2.1: Representation Learning
46/60
Neural Network v3.0: Complex Functions
I y = W 2 h1 = a1 (W 1 x1 )
47/60
Neural Network v3.0: Complex Functions
I Popular activation/transfer/non-linear functions
48/60
Neural Network v3.5: Deeper Networks
I y = W 3 h2 = W 3 a2 (W 2 (a1 (W 1 x1 ))
49/60
Neural Network v3.5: Deeper Networks
50/60
Neural Network v4.0: Recurrent Neural Networks
I Big Idea: Use hidden layers to represent sequential state
51/60
Neural Network v4.0: Recurrent Neural Networks
52/60
Neural Network v4.1: Output Sequences
53/60
Neural Network v4.1: Output Sequences
I Character-level Language Models
54/60
Neural Network v4.2: Long-Short Term Memory
I Regular Recurrent Networks
I LSTMs
55/60
Neural Network v4.2: Long-Short Term Memory
56/60
Neural Network v4.3: Bidirectional RNNs
I Unidirectional RNNs
I Bidirectional RNNs
57/60
Neural Machine Translation
58/60
Neural Part-of-Speech Tagging
60/60