0% found this document useful (0 votes)
152 views

Natural Language Processing Slides

This document provides an overview of natural language processing (NLP). It defines NLP as automating the analysis, generation, and acquisition of human language. It discusses the complexity of NLP due to linguistic representations having many possible interpretations and the richness of human language. The document also outlines some common NLP applications and challenges, including question answering, machine translation, and conversational agents. It introduces several linguistic levels involved in NLP, such as morphology, lexical processing, and syntax.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Natural Language Processing Slides

This document provides an overview of natural language processing (NLP). It defines NLP as automating the analysis, generation, and acquisition of human language. It discusses the complexity of NLP due to linguistic representations having many possible interpretations and the richness of human language. The document also outlines some common NLP applications and challenges, including question answering, machine translation, and conversational agents. It introduces several linguistic levels involved in NLP, such as morphology, lexical processing, and syntax.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1027

11-411

Natural Language Processing


Overview

Kemal Oflazer

Carnegie Mellon University in Qatar


Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh.

1/31
What is NLP?

I Automating the analysis, generation, and acquisition of human (”natural”) language


I Analysis (or “understanding” or “processing”, . . . )
I Generation
I Acquisition
I Some people use “NLP” to mean all of language technologies.
I Some people use it only to refer to analysis.

2/31
Why NLP?
I Answer questions using the Web
I Translate documents from one language to another
I Do library research; summarize
I Manage messages intelligently
I Help make informed decisions
I Follow directions given by any user
I Fix your spelling or grammar
I Grade exams
I Write poems or novels
I Listen and give advice
I Estimate public opinion
I Read everything and make predictions
I Interactively help people learn
I Help disabled people
I Help refugees/disaster victims
I Document or reinvigorate indigenous languages
3/31
What is NLP? More Detailed Answer

I Automating language analysis, generation, acquisition.


I Analysis (or “understanding” or “processing” ...): input is language, output is some
representation that supports useful action
I Generation: input is that representation, output is language
I Acquisition: obtaining the representation and necessary algorithms, from knowledge and
data
I Representation?

4/31
Levels of Linguistic Representation

5/31
Why it’s Hard

I The mappings between levels are extremely complex.


I Details and appropriateness of a representation depends on the application.

6/31
Complexity of Linguistics Representations

I Input is likely to be noisy.


I Linguistic representations are theorized constructs; we cannot observe them directly.
I Ambiguity: each linguistic input may have many possible interpretations at every
level.
I The correct resolution of the ambiguity will depend on the intended meaning, which is
often inferable from context.
I People are good at linguistic ambiguity resolution.
I Computers are not so good at it.
I How do we represent sets of possible alternatives?
I How do we represent context?

7/31
Complexity of Linguistics Representations

I Richness: there are many ways to express the same meaning, and immeasurably
many meanings to express. Lots of words/phrases.
I Each level interacts with the others.
I There is tremendous diversity in human languages.
I Languages express the same kind of meaning in different ways
I Some languages express some meanings more readily/often.
I We will study models.

8/31
What is a Model?

I An abstract, theoretical, predictive construct. Includes:


I a (partial) representation of the world
I a method for creating or recognizing worlds,
I a system for reasoning about worlds
I NLP uses many tools for modeling.
I Surprisingly, shallow models work fine for some applications.

9/31
Using NLP Models and Tools

I This course is meant to introduce some formal tools that will help you navigate the
field of NLP.
I We focus on formalisms and algorithms.
I This is not a comprehensive overview; it’s a deep introduction to some key topics.
I We’ll focus mainly on analysis and mainly on English text (but will provide examples from
other languages whenever meaningful)
I The skills you develop will apply to any subfield of NLP

10/31
Applications / Challenges

I Application tasks evolve and are often hard to define formally.


I Objective evaluations of system performance are always up for debate.
I This holds for NL analysis as well as application tasks.
I Different applications may require different kinds of representations at different levels.

11/31
Expectations from NLP Systems

I Sensitivity to a wide range of the phenomena and constraints in human language


I Generality across different languages, genres, styles, and modalities
I Computational efficiency at construction time and runtime
I Strong formal guarantees (e.g., convergence, statistical efficiency, consistency, etc.)
I High accuracy when judged against expert annotations and/or task-specific
performance

12/31
Key Applications (2017)
I Computational linguistics (i.e., modeling the human capacity for language
computationally)
I Information extraction, especially “open” IE
I Question answering (e.g., Watson)
I Conversational Agents (e.g., Siri, OK Google)
I Machine translation
I Machine reading
I Summarization
I Opinion and sentiment analysis
I Social media analysis
I Fake news detection
I Essay evaluation
I Mining legal, medical, or scholarly literature

13/31
NLP vs Computational Linguistics

I NLP is focussed on the technology of processing language


I Computational Linguistics is focussed on using technology to support/implement
linguistics.
I The distinction is
I Like “artificial intelligence” vs. “cognitive science”
I Like “ building airplanes” vs. “understanding how birds fly”

14/31
Let’s Look at Some of the Levels

15/31
Morphology

I Analysis of words into meaningful components – morphemes.


I Spectrum of complexity across languages
I Isolating Languages: mostly one morpheme (e.g., Chinese/Mandarin)
I Inflectional Languages: mostly two morphemes (e.g., English, French, one
morpheme may mean many things)
I go+ing, habla+mos “I have spoken” (SP)

16/31
Morphology

I Agglutinative Languages: Mostly many morphemes stacked like


“beads-on-a-string” (e.g., Turkish, Finnish, Hungarian, Swahili)
I uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
“(behaving) as if you are among those whom we could not civilize”
I Polysynthetic Languages: A word is a sentence! (e.g., Inuktikut)
I Parismunngaujumaniralauqsimanngittunga
Paris+mut+nngau+juma+niraq+lauq+si+ma+nngit+jun
“I never said that I wanted to go to Paris”
I Reasonably dynamic:
I unfriend, Obamacare

17/31
Let’s Look at Some of the Levels

18/31
Lexical Processing

I Segmentation
I Normalize and disambiguate words
I Words with multiple meanings: bank, mean
I Extra challenge: domain-specific meanings (e.g., latex)
I Process multi-word expressions
I make . . . decision, take out, make up, kick the . . . bucket
I Part-of-speech tagging
I Assign a syntactic class to each word (verb, noun, adjective, etc.)
I Supersense tagging
I Assign a coarse semantic category to each content word (motion event, instrument,
foodstuff, etc.)
I Syntactic “supertagging”
I Assign a possible syntactic neighborhood tag to each word (e.g., subject of a verb)

19/31
Let’s Look at Some of the Levels

20/31
Syntax

I Transform a sequence of symbols into a hierarchical or compositional structure.


I Some sequences are well-formed, others are not
X I want a flight to Tokyo.
X I want to fly to Tokyo.
X I found a flight to Tokyo.
× I found to fly to Tokyo.
X Colorless green ideas sleep furiously.
× Sleep colorless green furiously ideas.
I Ambiguities explode combinatorially
I Students hate annoying professors.
I John saw the woman with the telescope.
I John saw the woman with the telescope wrapped in paper.

21/31
Some of the Possible Syntactic Analyses

22/31
Morphology–Syntax

I A ship-shipping ship, shipping shipping-ships.

23/31
Let’s Look at Some of the Levels

24/31
Semantics

I Mapping of natural language sentences into domain representations.


I For example, a robot command language, a database query, or an expression in a formal
logic
I Scope ambiguities:
I In this country a woman gives birth every fifteen minutes.
I Every person on this island speaks three languages.
I (TR) Üç doktor her hastaya baktı “Three doctors every patient saw”

⇒ ∃d1 , d2 , d3 , doctor(d1 )&doctor(d1 )&doctor(d1 ) (∀p, patient(p), saw(d1 , p)&saw(d2 , p)&saw(d3 , p))

I (TR) Her hastaya üç doktor baktı “Every patient three doctors saw”

⇒ ∀p, patient(p)(∃d1 , d2 , d3 , doctor(d1 )&doctor(d1 )&doctor(d1 )&saw(d1 , p)&saw(d2 , p)&saw(d3 , p))

I Going beyond specific domains is a goal of general artificial Intelligence.

25/31
Syntax–Semantics

We saw the woman with the telescope wrapped in paper.

I Who has the telescope?


I Who or what is wrapped in paper?
I Is this an event of perception or an assault?

26/31
Let’s Look at Some of the Levels

27/31
Pragmatics/Discourse

I Pragmatics
I Any non-local meaning phenomena
I “Can you pass the salt?”
I “Is he 21?” “Yes, he’s 25.”
I Discourse
I Structures and effects in related sequences of sentences
I “I said the black shoes.”
I “Oh, black.” (Is that a sentence?)

28/31
Course Logistics/Administrivia

I Web page: piazza.com/qatar.cmu/fall2017/11411/home


I Course Material:
I Book: Speech and Language Processing, Jurafsky and Martin, 2nd ed.
I As needed, copies of papers, etc. will be provided.
I Lectures slides will be provided after each lecture.
I Instructor: Kemal Oflazer

29/31
Your Grade

I Class project, 30%


I In-class midterm (October 11, for the time being), 20%
I Final exam (date TBD), 20%
I Unpredictable in-class quizzes, 15%
I Homework assignments, 15%

30/31
Policies

I Everything you submit must be your own work


I Any outside resources (books, research papers, web sites, etc.) or collaboration
(students, professors, etc.) must be explicitly acknowledged.
I Project
I Collaboration is required (team size TBD)
I It’s okay to use existing tools, but you must acknowledge them.
I Grade is mostly shared.
I Programming language is up to you.
I Do people know Python? Perl?

31/31
11-411
Natural Language Processing
Applications of NLP

Kemal Oflazer

Carnegie Mellon University in Qatar


Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh.

1/19
Information Extraction – Bird’s Eye View

I Input: text, empty relational database


I Output: populated relational database

Senator John Edwards is to drop out of the race to become the


Democratic party’s presidential candidate after consistently trail- State Party Cand. %
ing in third place. In the latest primary, held in Florida yesterday, FL Dem. Edwards 14
⇒ FL Dem. Clinton 50
Edwards gained only 14% of the vote, with Hillary Clinton polling
50% and Barack Obama on 33%. A reported 1.5m voters turned FL Dem. Obama 33
out to vote.

2/19
Named-Entity Recognition

I Input: text
I Output: text annotated with named-entities

[PER Senator John Edwards] is to drop out of the race


Senator John Edwards is to drop out of the race to be-
to become the [GPE Democratic party]’s presidential can-
come the Democratic party’s presidential candidate after
didate after consistently trailing in third place. In the lat-
consistently trailing in third place. In the latest primary,
⇒ est primary, held in [LOC Florida] yesterday, [PER Edwards]
held in Florida yesterday, Edwards gained only 14% of the
gained only 14% of the vote, with [PER Hillary Clinton]
vote, with Hillary Clinton polling 50% and Barack Obama
polling 50% and [PER Barack Obama] on 33%. A reported
on 33%. A reported 1.5m voters turned out to vote.
1.5m voters turned out to vote.

3/19
Reference Resolution

I Input: text possibly with annotated named-entities


I Output: text annotated with named-entities and the real-world entitities they refer to.

[PER Senator John Edwards] is to drop out of the race


to become the [GPE Democratic party]’s presiden-
tial candidate after consistently trailing in third place.
[PER Senator John Edwards] refers to
In the latest primary, held in [LOC Florida] yesterday, ⇒
[PER Edwards] gained only 14% of the vote, with [PER
Hillary Clinton] polling 50% and [PER Barack Obama]
on 33%. A reported 1.5m voters turned out to vote. [PER Edwards] refers to

4/19
Coreference Resolution

I Input: text possibly with annotated named-entities


I Output: text with annotations of coreference chains.

[PER Senator John Edwards] is to drop out of the race


to become the [GPE Democratic party]’s presidential can-
didate after consistently trailing in third place. In the lat- [PER Senator John Edwards],
est primary, held in [LOC Florida] yesterday, [PER Edwards] [PER Edwards]

gained only 14% of the vote, with [PER Hillary Clinton] Senator from [LOC North Carolina]
polling 50% and [PER Barack Obama] on 33%. A reported refer to the same entity.
1.5m voters turned out to ote. This was a huge setback for
the Senator from [LOC North Carolina].

5/19
Relation Extraction

I Input: text annotated with named-entitites


I Output: populated relational database with relations between entities.

Senator John Edwards is to drop out of the race to be-


come the Democratic party’s presidential candidate after
Person Member-of
consistently trailing in third place. In the latest primary, John Edwards Democrat Party
⇒ Hillary Clinton Democrat Party
held in Florida yesterday, Edwards gained only 14% of the
vote, with Hillary Clinton polling 50% and Barack Obama Barack Obama Democrat Party
on 33%. A reported 1.5m voters turned out to vote.

6/19
Encoding for Named-Entity Recognition

I Named-entity recognition is typically formulated as a sequence tagging problem.


I We somehow encode the boundaries and types of the named-entities.
I BIO Encoding
I B-type indicates the beginning token/word of a named-entity (of type type)
I I-type indicates (any) other tokens of a named-entity (length > 1)
I O indicates that a token is not a part of any named-entity.
I BIOLU Encoding
I BIO same as above
I L-type indicates last token of a named-entity (length > 1)
I U-type indicates a single token named-entity (length = 1)

7/19
Encoding for Named-Entity Recognition
With that , Edwards ’ campaign will end the way
O O O B-PER O O O O O O

it began 13 months ago – with the candidate pitching

O O O O O O O O O O

in to rebuild lives in a city still ravaged by

O O O O O O O O O O

Hurricane Katrina . Edwards embraced New Orleans as a glaring

B-NAT I-NAT O B-PER O B-LOC I-LOC O O O

symbol of what he described as a Washington that did


O O O O O O O B-GPE O O

n’t hear the cries of the downtrodden .


O O O O O O O O

8/19
NER as a Sequence Modeling Problem

9/19
Evaluation of NER Performance
I Recall: What percentage of the actual named-entities did you correctly label?
I Precision: What percentage of the named-entities you labeled were actually correctly
labeled?

Correct Hypothesized
NEs NEs
(C) C∩H (H)

|C∩H | |C∩H | 2·R·P


R= P= F1 =
|C| |H| R+P

I Actual: [Microsoft Corp.] CEO [Steve Ballmer] announced the


release of [Windows 7] today
I Tagged: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the
release of Windows 7 [today]
I What is R, P, and F?
10/19
NER System Architecture

11/19
Relation Extraction

Some types of relations:

Relations Examples Type


Affiliations
Personal married to, mother of PER → PER
Organizational spokeman for, president of PER → ORG
Artifactual owns, invented, produces (PER | ORG) → ART
Geospatial
Proximity near, on outskirts of LOC → LOC
Directional southeast of LOC → LOC
Part-Of
Organizational unit of, parent-of ORG → ORG
Political annexed, acquired GPE → GPE

12/19
Seeding Tuples

I Provide some examples


I Brad is married to Angelina.
I Bill is married to Hillary.
I Hillary is married to Bill.
I Hillary is the wife of Bill.
I Induce/provide seed patterns.
I X is married to Y
I X is the wife or Y
I Find other examples of X and Y mentioned closely and generate new patterns
I Hillary and Bill wed in 1975 ⇒ X and Y wed

13/19
Bootstrapping Relations

14/19
Information Retrieval – the Vector Space Model
I Each document Di is represented by a |V|-dimensional vector d~i (V is the vocabulary
of words/tokens.)

d~i [j] = count of word ωj in document Di


I A query Q is represented the same way with the vector ~q:

~q[j] = count of word ωj in query Q

I Vector Similarity ⇒ Relevance of Document Di to Query Q

d~i · ~q
cosine similarity(d~i ,~q) =
kd~i k × k~qk
I Twists: tf − idf term frequency – inverse document frequency
# docs
x[j] = count(ωj ) × log
# docs with ωj
I Recall, Precision, Ranking
15/19
Information Retrieval – Evaluation

I Recall?
Number of Relevant Documents Retrieved
Recall =
Number of Actual Relevant Documents in the Database

I Precision?
Number of Relevant Documents Retrieved
Precision =
Number of Documents Retrieved

I Can you fool these?


I Are these useful? (Why did Google win the IR wars?)
I Ranking?
I Is the “best” document close to the top if the list?

16/19
Question Answering

17/19
Question Answering Evaluation

I We typically get a list of answers back.


I Higher ranked correct answers are more valued.
I Mean reciprocal rank
T
1X 1
mean reciprocal rank =
T rank of the first correct answer to question i
i=1

18/19
Some General Tools

I Supervised classification
I Feature vector representations
I Bootstrapping
I Evaluation:
I Precision and recall (and their curves)
I Mean reciprocal rank

19/19
8/28/17

11-411
Natural Language Processing

Words and
Computational Morphology
Kemal Oflazer
Carnegie Mellon University - Qatar

Words, Words, Words


n Words in natural languages usually encode
many pieces of information.
¨ What the word “means” in the real world
¨ What categories, if any, the word belongs to
¨ What the function of the word in the sentence
is

¨ Nouns: How many?, Do we already know what


they are?, How does it relate to the verb?, …

¨ Verbs: When, how, who,…


2

1
8/28/17

Morphology
n Languages differ widely in
¨ What information they encode in their words
¨ How they encode the information.

n I am swim-m+ing.
¨ (Presumably) we know what swim “means”
¨ The +ing portion tells us that this event is
taking place at the time the utterance is taking
place.
¨ What’s the deal with the extra m?

2
8/28/17

n (G) Die Frau schwimm+t.


¨ The schwimm(en) refers to the same meaning
¨ The +t portion tells us that this event is taking
place now and that a single entity other than
the speaker and the hearer is swimming (or
plural hearers are swimming)
n Ich schwimme, Du schwimmst, Er/Sie/Es schwimmt,
Wir schwimmen, Ihr schwimmt Sie schwimmen

n (T) Ben eve git+ti+m.


n I to-the-house I-went.

n (T) Sen evi gör+dü+n


n You the-house you-saw.

n What’s the deal with +ti vs +dü?


n What’s deal with +m and +n?

3
8/28/17

Dancing in Andalusia
n A poem by the early 20th century Turkish
poet Yahya Kemal Beyatlı.

ENDÜLÜSTE RAKS
Zil, şal ve gül, bu bahçede raksın bütün hızı
Şevk akşamında Endülüs, üç defa kırmızı
Aşkın sihirli şarkısı, yüzlerce dildedir
İspanya neşesiyle bu akşam bu zildedir

Yelpaze gibi çevrilir birden dönüşleri


İşveyle devriliş, saçılış, örtünüşleri
Her rengi istemez gözümüz şimdi aldadır
İspanya dalga dalga bu akşam bu şaldadır

Alnında halka halkadır âşufte kâkülü


Göğsünde yosma Gırnata'nın en güzel gülü
Altın kadeh her elde, güneş her gönüldedir
İspanya varlığıyla bu akşam bu güldedir

Raks ortasında bir durup oynar, yürür gibi


Bir baş çevirmesiyle bakar öldürür gibi
Gül tenli, kor dudaklı, kömür gözlü, sürmeli
Şeytan diyor ki sarmalı, yüz kerre öpmeli

Gözler kamaştıran şala, meftûn eden güle


Her kalbi dolduran zile, her sineden ole!

4
8/28/17

ENDÜLÜSTE RAKS zildedir: a verb derived from the


Zil, şal ve gül, bu bahçede raksın bütün hızı locative case of the noun “zil”
Şevk akşamında Endülüs, üç defa kırmızı (castanet)
Aşkın sihirli şarkısı, yüzlerce dildedir “is at the castanet”
İspanya neşesiyle bu akşam bu zildedir
dönüşleri: plural infinitive and
Yelpaze gibi çevrilir birden dönüşleri possessive form of the verb “dön”
İşveyle devriliş, saçılış, örtünüşleri (rotate)
Her rengi istemez gözümüz şimdi aldadır
İspanya dalga dalga bu akşam bu şaldadır “their (act of) rotating”

Alnında halka halkadır âşufte kâkülü istemez: negative present form of


Göğsünde yosma Gırnata'nın en güzel gülü the verb “iste” (want)
Altın kadeh her elde, güneş her gönüldedir “it does not want”
İspanya varlığıyla bu akşam bu güldedir
varlığıyla: singular possessive
Raks ortasında bir durup oynar, yürür gibi instrumental-case of the noun
Bir baş çevirmesiyle bakar öldürür gibi “varlık” (wealth)
Gül tenli, kor dudaklı, kömür gözlü, sürmeli “with its wealth”
Şeytan diyor ki sarmalı, yüz kerre öpmeli
kamaştıran: present participle of
Gözler kamaştıran şala, meftûn eden güle the verb “kamaş” (blind)
Her kalbi dolduran zile, her sineden ole! “that which blinds….”

BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Como el revuelo de un abanico son sus vueltas súbitas,


Con súbitos gestos se abren y se cierran las faldas.
Ya no veo los demás colores, sólo el carmesí,
La mantilla esta noche ondea a españa entera en sí.

Con un encanto travieso, cae su pelo hacia su frente;


La mas bonita rosa de Granada en su pecho rebelde.
Se para y luego continúa como si caminara,
Vuelve la cara y mira como si apuntara y matara.

Labios ardientes, negros ojos y de rosa su tez!


Luzbel me susurra: ¡Ánda bésala mil veces!

¡Olé a la rosa que enamora! ¡Olé al mantilla que deslumbra!


¡Olé de todo corazón a la castañuela que al espíritu alumbra!”

5
8/28/17

BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Como el revuelo de un abanico son sus vueltas súbitas, castañuelas:


Con súbitos gestos se abren y se cierran las faldas. either
Ya no veo los demás colores, sólo el carmesí, the plural feminine form of
La mantilla esta noche ondea a españa entera en sí. the adjective “castañuelo”
(Castilian)
Con un encanto travieso, cae su pelo hacia su frente; or
La mas bonita rosa de Granada en su pecho rebelde. the plural of the feminine noun
Se para y luego continúa como si caminara, “castañuela” (castanet)
Vuelve la cara y mira como si apuntara y matara.

Labios ardientes, negros ojos y de rosa su tez!


Luzbel me susurra: ¡Ánda bésala mil veces!

¡Olé a la rosa que enamora! ¡Olé al mantilla que deslumbra!


¡Olé de todo corazón a la castañuela que al espíritu alumbra!”

BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Como el revuelo de un abanico son sus vueltas súbitas,


Con súbitos gestos se abren y se cierran las faldas. vueltas:
Ya no veo los demás colores, sólo el carmesí, either
La mantilla esta noche ondea a españa entera en sí. the plural form of the
feminine noun “vuelta” (spin?)
Con un encanto travieso, cae su pelo hacia su frente; or
La mas bonita rosa de Granada en su pecho rebelde. the feminine plural
Se para y luego continúa como si caminara, past participle of the verb
Vuelve la cara y mira como si apuntara y matara. “volver”
Labios ardientes, negros ojos y de rosa su tez!
Luzbel me susurra: ¡Ánda bésala mil veces!

¡Olé a la rosa que enamora! ¡Olé al mantilla que deslumbra!


¡Olé de todo corazón a la castañuela que al espíritu alumbra!”

6
8/28/17

BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Como el revuelo de un abanico son sus vueltas súbitas,


Con súbitos gestos se abren y se cierran las faldas.
Ya no veo los demás colores, sólo el carmesí,
La mantilla esta noche ondea a españa entera en sí. veo:
First person present indicative
Con un encanto travieso, cae su pelo hacia su frente; of the verb “ver” (see?)
La mas bonita rosa de Granada en su pecho rebelde.
Se para y luego continúa como si caminara,
Vuelve la cara y mira como si apuntara y matara.

Labios ardientes, negros ojos y de rosa su tez!


Luzbel me susurra: ¡Ánda bésala mil veces!

¡Olé a la rosa que enamora! ¡Olé al mantilla que deslumbra!


¡Olé de todo corazón a la castañuela que al espíritu alumbra!”

BAILE EN ANDALUCIA
Castañuela, mantilla y rosa. El baile veloz llena el jardín...
En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Como el revuelo de un abanico son sus vueltas súbitas,


Con súbitos gestos se abren y se cierran las faldas.
Ya no veo los demás colores, sólo el carmesí,
La mantilla esta noche ondea a españa entera en sí.
enamora:
Con un encanto travieso, cae su pelo hacia su frente; either
La mas bonita rosa de Granada en su pecho rebelde. the 3rd person singular
Se para y luego continúa como si caminara, present indicative
Vuelve la cara y mira como si apuntara y matara. or
the 2nd person imperative of
Labios ardientes, negros ojos y de rosa su tez! the verb “enamorar” (woo)
Luzbel me susurra: ¡Ánda bésala mil veces!

¡Olé a la rosa que enamora! ¡Olé al mantilla que deslumbra!


¡Olé de todo corazón a la castañuela que al espíritu alumbra!”

7
8/28/17

DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,

Animated turns like a fan's fast flutterings,


Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.

Her bewitching locks on her forehead is overlaid,


On her chest is the fairest rose of Granada.
Golden cup in hand, sun in every mind,
Spain this evening in this shawl defined.

'Mid a step a halt, then dances as she loiters,


She turns her head round and looks daggers.
Rose-complexioned, fiery-lipped, black-eyed, painted,
To embracing and kissing her over one's tempted,

To dazzling shawl, to the rose charmingly gay


To castanets from every heart soars an "ole!".

DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
castanets: Plural noun
Animated turns like a fan's fast flutterings,
Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.

Her bewitching locks on her forehead is overlaid,


On her chest is the fairest rose of Granada.
Golden cup in hand, sun in every mind,
Spain this evening in this shawl defined.

'Mid a step a halt, then dances as she loiters,


She turns her head round and looks daggers.
Rose-complexioned, fiery-lipped, black-eyed, painted,
To embracing and kissing her over one's tempted,

To dazzling shawl, to the rose charmingly gay


To castanets from every heart soars an "ole!".

8
8/28/17

DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,
bewitching: gerund form of the
Animated turns like a fan's fast flutterings, verb bewitch
Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.

Her bewitching locks on her forehead is overlaid,


On her chest is the fairest rose of Granada.
Golden cup in hand, sun in every mind,
Spain this evening in this shawl defined.

'Mid a step a halt, then dances as she loiters,


She turns her head round and looks daggers.
Rose-complexioned, fiery-lipped, black-eyed, painted,
To embracing and kissing her over one's tempted,

To dazzling shawl, to the rose charmingly gay


To castanets from every heart soars an "ole!".

DANCE IN ANDALUSIA
Castanets, shawl and rose. Here's the fervour of dance,
Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,

Animated turns like a fan's fast flutterings,


Fascinating bendings, coverings, uncoverings.
We want to see no other color than carnation,
Spain does subsist in this shawl in undulation.

Her bewitching locks on her forehead is overlaid,


On her chest is the fairest rose of Granada. evening:
Golden cup in hand, sun in every mind, either
Spain this evening in this shawl defined. a noun
or
'Mid a step a halt, then dances as she loiters, the present continuous
She turns her head round and looks daggers. form of the verb “even”
Rose-complexioned, fiery-lipped, black-eyed, painted,
To embracing and kissing her over one's tempted,

To dazzling shawl, to the rose charmingly gay


To castanets from every heart soars an "ole!".

9
8/28/17

Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!

Wie ein Fäscher: unvermutet das Sich-Wenden, Biegen,


Ihr kokettes Sich - Verhüllen, Sich - Entfalten, Wiegen -
Unser Auge, nichts sonst wünschend - sieht nur Rot voll Pracht:
Spanien wogt und wogt in diesem Schal ja heute Nacht!

Auf die Stirn die Ringellocken fallen lose ihr,


Auf der Brust erblüht Granadas schönste Rose ihr,
Goldpokal in jeder Hand, im Herzen Sonne lacht
Spanien lebt und webt in dieser Rose heute Nacht!

Jetzt im Tanz ein spielend Schreiten, jetz ein Steh’n, Zurück


Tötend, wenn den Kopf sie wendet, scheint ihr rascher Blick.
Rosenleib, geschminkt, rotlippig, schwarzer Augen Strahl
Der Verführer lockt: «Umarme, küsse sie hundertmal!»

Für den Schal so blendend, Zaubervoller Rose Lust,


Zimbel herzerfüllend, ein Ole aus jeder Brust!

Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!

Wie ein Fäscher: unvermutet das Sich-Wenden, Biegen,


Ihr kokettes Sich - Verhüllen, Sich - Entfalten, Wiegen -
Unser Auge, nichts sonst wünschend - sieht nur Rot voll Pracht:
Spanien wogt und wogt in diesem Schal ja heute Nacht!

Auf die Stirn die Ringellocken fallen lose ihr, Zimbeln: plural of the feminine
Auf der Brust erblüht Granadas schönste Rose ihr, noun “Zimbel”
Goldpokal in jeder Hand, im Herzen Sonne lacht
Spanien lebt und webt in dieser Rose heute Nacht!

Jetzt im Tanz ein spielend Schreiten, jetz ein Steh’n, Zurück


Tötend, wenn den Kopf sie wendet, scheint ihr rascher Blick.
Rosenleib, geschminkt, rotlippig, schwarzer Augen Strahl
Der Verführer lockt: «Umarme, küsse sie hundertmal!»

Für den Schal so blendend, Zaubervoller Rose Lust,


Zimbel herzerfüllend, ein Ole aus jeder Brust!

10
8/28/17

Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!

Wie ein Fäscher: unvermutet das Sich-Wenden, Biegen,


Ihr kokettes Sich - Verhüllen, Sich - Entfalten, Wiegen -
Unser Auge, nichts sonst wünschend - sieht nur Rot voll Pracht:
Spanien wogt und wogt in diesem Schal ja heute Nacht!
Liebeszauberlied: compound noun
Auf die Stirn die Ringellocken fallen lose ihr,
Magic love song (?)
Auf der Brust erblüht Granadas schönste Rose ihr,
Goldpokal in jeder Hand, im Herzen Sonne lacht
Spanien lebt und webt in dieser Rose heute Nacht!

Jetzt im Tanz ein spielend Schreiten, jetz ein Steh’n, Zurück


Tötend, wenn den Kopf sie wendet, scheint ihr rascher Blick.
Rosenleib, geschminkt, rotlippig, schwarzer Augen Strahl
Der Verführer lockt: «Umarme, küsse sie hundertmal!»

Für den Schal so blendend, Zaubervoller Rose Lust,


Zimbel herzerfüllend, ein Ole aus jeder Brust!

Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!

Wie ein Fäscher: unvermutet das Sich-Wenden, Biegen,


Ihr kokettes Sich - Verhüllen, Sich - Entfalten, Wiegen -
Unser Auge, nichts sonst wünschend - sieht nur Rot voll Pracht:
Spanien wogt und wogt in diesem Schal ja heute Nacht!
Ringellocken: compound noun
Auf die Stirn die Ringellocken fallen lose ihr,
Convoluted curls (?)
Auf der Brust erblüht Granadas schönste Rose ihr,
Goldpokal in jeder Hand, im Herzen Sonne lacht
Spanien lebt und webt in dieser Rose heute Nacht!

Jetzt im Tanz ein spielend Schreiten, jetz ein Steh’n, Zurück


Tötend, wenn den Kopf sie wendet, scheint ihr rascher Blick.
Rosenleib, geschminkt, rotlippig, schwarzer Augen Strahl
Der Verführer lockt: «Umarme, küsse sie hundertmal!»

Für den Schal so blendend, Zaubervoller Rose Lust,


Zimbel herzerfüllend, ein Ole aus jeder Brust!

11
8/28/17

Spanischer Tanz
Zimbel, Schal und Rose- Tanz in diesem Garten loht.
In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht!

Wie ein Fäscher: unvermutet das Sich-Wenden, Biegen,


Ihr kokettes Sich - Verhüllen, Sich - Entfalten, Wiegen -
Unser Auge, nichts sonst wünschend - sieht nur Rot voll Pracht:
Spanien wogt und wogt in diesem Schal ja heute Nacht!

Auf die Stirn die Ringellocken fallen lose ihr, herzerfüllend: noun-verb/participle
Auf der Brust erblüht Granadas schönste Rose ihr, compound
Goldpokal in jeder Hand, im Herzen Sonne lacht “that which fulfills the heart”(?)
Spanien lebt und webt in dieser Rose heute Nacht!

Jetzt im Tanz ein spielend Schreiten, jetz ein Steh’n, Zurück


Tötend, wenn den Kopf sie wendet, scheint ihr rascher Blick.
Rosenleib, geschminkt, rotlippig, schwarzer Augen Strahl
Der Verführer lockt: «Umarme, küsse sie hundertmal!»

Für den Schal so blendend, Zaubervoller Rose Lust,


Zimbel herzerfüllend, ein Ole aus jeder Brust!

Aligned Verses
Zil, şal ve gül, bu bahçede raksın bütün hızı
Şevk akşamında Endülüs, üç defa kırmızı
Aşkın sihirli şarkısı, yüzlerce dildedir
İspanya neş'esiyle bu akşam bu zildedir

Castañuela, mantilla y rosa. El baile veloz llena el jardín...


En esta noche de jarana, Andalucíá se ve tres veces carmesí...
Cientas de bocas recitan la canción mágica del amor.
La alegría española esta noche, está en las castañuelas.

Castanets, shawl and rose. Here's the fervour of dance,


Andalusia is threefold red in this evening of trance.
Hundreds of tongues utter love's magic refrain,
In these castanets to-night survives the gay Spain,

Zimbel, Schal und Rose- Tanz in diesem Garten loht.


In der Nacht der Lust ist Andalusien dreifach rot!
Und in tausend Zungen Liebeszauberlied erwacht-
Spaniens Frohsinn lebt in diesen Zimbeln heute Nacht! 24

12
8/28/17

Why do we care about words?


n Many language processing applications need to
extract the information encoded in the words.
¨ Parserswhich analyze sentence structure need to
know/check agreement between
n subjects and verbs
n Adjectives and nouns
n Determiners and nouns, etc.
¨ Information retrieval systems benefit from know what
the stem of a word is
¨ Machine translation systems need to analyze words to
their components and generate words with specific
features in the target language
25

Computational Morphology
n Computational morphology deals with
¨ developingtheories and techniques for
¨ computational analysis and synthesis of word
forms.

26

13
8/28/17

Computational Morphology -Analysis


n Extract any information encoded in a word
and bring it out so that later layers of
processing can make use of it.

n books Þ book+Noun+Plural
Þ book+Verb+Pres+3SG.
n stopping Þ stop+Verb+Cont
n happiest Þ happy+Adj+Superlative
n went Þ go+Verb+Past
27

Computational Morphology -Generation


n In a machine translation application, one
may have to generate the word
corresponding to a set of features

n stop+Past Þ stopped
n (T) dur+Past+1Pl Þ durduk
+2Pl Þ durdunuz

28

14
8/28/17

Computational Morphology-Analysis
n
n
Input raw text
Segment / Tokenize
} Pre-processing

}
n Analyze individual
words Morphological
Processing
n Analyze multi-word
constructs
n Disambiguate
Morphology
n Syntactically analyze
sentences } Syntactic
Processing

n …. 29

Some other applications


n Spelling Checking
¨ Check if words in a text are all valid words
n Spelling Correction
¨ Find correct words “close” to a misspelled word.

n For both these applications, one needs to know


what constitutes a valid word in a language.
¨ Rather straightforward for English
¨ No so for Turkish –
n cezalandırılacaklardan (ceza+lan+dır+ıl+acak+lar+dan)

30

15
8/28/17

Some other applications


n Grammar Checking
¨ Checks if a (local) sequence of words violate
some basic constraints of language (e.g.,
agreement)
n Text-to-speech
¨ Proper stress/prosody may depend on proper
identification of morphemes and their
properties.
n Machine Translation (especially between
closely related languages)
¨ E.g., Turkmen to Turkish translation 31

Text-to-speech
n I read the book.
¨ Can’t really decide what the pronunciation is
n Yesterday, I read the book.
¨ read must be a past tense verb.
n He read the book
n read must be a past tense verb.
¨ (T) oKU+ma (don’t read)
oku+MA (reading)
ok+uM+A (to my arrow)
32

16
8/28/17

Morphology
n Morphology is the study of the structure of
words.
¨ Words are formed by combining smaller units
of linguistic information called morphemes, the
building blocks of words.
¨ Morphemes in turn consist of phonemes and,
in abstract analyses, morphophonemes.
Often, we will deal with orthographical
symbols.

33

Morphemes
n Morphemes can be classified into two
groups:
¨ Free Morpheme: Morphemes which can
occur as a word by themselves.
n e.g., go, book,

¨ Bound Morphemes: Morphemes which are


not words in their own right, but have to be
attached in some way to a free morpheme.
n e.g., +ing, +s, +ness

34

17
8/28/17

Dimensions of Morphology
n “Complexity” of Words
¨ How many morphemes?
n Morphological Processes
¨ What functions do morphemes perform?
n Morpheme combination
¨ How do we put the morphemes together to
form words?

35

“Complexity” of Word Structure


n The kind and amount information that is
conveyed with morphology differs from
language to language.
¨ Isolating Languages

¨ Inflectional Languages

¨ Agglutinative Languages

¨ Polysynthetic Languages

36

18
8/28/17

Isolating languages
n Isolating languages do not (usually) have any
bound morphemes
¨ Mandarin Chinese
¨ Gou bu ai chi qingcai (dog not like eat vegetable)
¨ This can mean one of the following (depending on the
context)
n The dog doesn’t like to eat vegetables
n The dog didn’t like to eat vegetables
n The dogs don’t like to eat vegetables
n The dogs didn’t like to eat vegetables.
n Dogs don’t like to eat vegetables.

37

Inflectional Languages
n A single bound morphemes conveys
multiple pieces of linguistic information
n (R) most+u: Noun, Sing, Dative
pros+u: Verb, Present, 1sg

n (S) habla+mos: Verb, Perfect, 1pl


Verb, Pres.Ind., 1pl

38

19
8/28/17

Agglutinative Languages
n (Usually multiple) Bound morphemes are
attached to one (or more) free morphemes,
like beads on a string.
¨ Turkish/Turkic, Finnish, Hungarian
¨ Swahili, Aymara
n Each morpheme encodes one "piece" of
linguistic information.
¨ (T) gid+iyor+du+m: continuous, Past, 1sg (I
was going)

39

Agglutinative Languages
n Turkish
n Finlandiyalılaştıramadıklarımızdanmışsınızcasına
n (behaving) as if you have been one of those whom we could not
convert into a Finn(ish citizen)/someone from Finland
n Finlandiya+lı+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına

¨ Finlandiya+Noun+Prop+A3sg+Pnon+Nom
n ^DB+Adj+With/From
n ^DB+Verb+Become
n ^DB+Verb+Caus
n ^DB+Verb+Able+Neg
n ^DB+Noun+PastPart+A3pl+P1pl+Abl
n ^DB+Verb+Zero+Narr+A2pl
n ^DB+Adverb+AsIf

40

20
8/28/17

Agglutinative Languages
n Aymara
¨ ch’uñüwinkaskirïyätwa
¨ ch’uñu +: +wi +na -ka +si -ka -iri +: +ya:t(a) +wa
n I was (one who was) always at the place for making ch’uñu ’
ch’uñu N ‘freeze-dried potatoes’
+: N>V be/make …
+wi V>N place-of
+na in (location)
-ka N>V be-in (location)
+si continuative
-ka imperfect
-iri V>N one who
+: N>V be
+ya:ta 1P recent past
+wa affirmative sentencial
41
Example Courtesy of Ken Beesley

Agglutinative Languages
n Finnish Numerals
¨ Finnish numerals are written as one word and
all components inflect and agree in all aspects

¨ Kahdensienkymmenensienkahdeksansien

two ten eighth (28th)


kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Gen
kahde ns i en kymmene ns i en kahdeksa ns i en

42
Example Courtesy of Lauri Karttunen

21
8/28/17

Agglutinative Languages
n Hungarian
¨ szobáikban = szoba[N/room] + ik[PersPl-3-
PossPl] + ban[InessiveCase]
n In their rooms
¨ faházaikból = fa[N/wooden] + ház[N/house] +
aik[PersPl3-PossPl] +ból[ElativeCase]
n From their wooden houses
¨ olvashattam = olvas[V/read] +
hat[DeriV/is_able] + tam[Sg1-Past]
n I was able to read

Examples courtesy of Gabor Proszeky 43

Agglutinative Languages
n Swahili
¨ walichotusomea = wa[Subject
Pref]+li[Past]+cho[Rel Prefix]+tu[Obj Prefix
1PL]+som[read/Verb]+e[Prep Form]+a[]
n that (thing) which they read for us
¨ tulifika=tu[we]+li[Past]+fik[arrive/Verb]+a[]
n We arrived
¨ ninafika=ni[I]+na[Present]+fik[arrive/Verb]+a[]
n I am arriving

44

22
8/28/17

Polysynthetic Languages
n Use morphology to combine syntactically
related components (e.g. verbs and their
arguments) of a sentence together
¨ Certain Eskimo languages, e.g., Inuktikut

¨ qaya:liyu:lumi: he was excellent at making


kayaks

45

Polysynthetic Languages
n Use morphology to combine syntactically
related components (e.g. verbs and their
arguments) of a sentence together
• Parismunngaujumaniralauqsimanngittunga
Paris+mut+nngau+juma+niraq+lauq+si+ma+nn
git+jun

¨ “I never said that I wanted to go to Paris”

46
Example Courtesy of Ken Beesley

23
8/28/17

Arabic
n Arabic seems to have aspects of
¨ Inflecting languages
n wktbt (wa+katab+at “and she wrote …”)
¨ Agglutinative languages
n wsyktbunha (wa+sa+ya+ktub+ūn+ha “and will
(masc) they write her)
¨ Polysynthetic languages

47

Morphological Processes
n There are essentially 3 types of
morphological processes which determine
the functions of morphemes:

¨ Inflectional Morphology

¨ Derivational Morphology

¨ Compounding

48

24
8/28/17

Inflectional Morphology
n Inflectional morphology introduces relevant
information to a word so that it can be used
in the syntactic context properly.
¨ That is, it is often required in particular
syntactic contexts.
n Inflectional morphology does not change
the part-of-speech of a word.
n If a language marks a certain piece of
inflectional information, then it must mark
that on all appropriate words.
49

Inflectional Morphology
n Subject-verb agreement, tense, aspect

I/you/we/they go Ich gehe (Ben) gidiyorum


He/She/It goes Du gehst (Sen) gidiyorsun
Er/Sie/Es geht (O) gidiyor
Wir gehen (Biz) gidiyoruz
Ihr geht (Siz) gidiyorsunuz
Sie gehen (Onlar) gidiyorlar
n Constituent function (indicated by case marking)
Biz eve gittik – We went to the house.
Biz evi gördük – We saw the house.
Biz evden nefret ettik – We hated the house
Biz evde kaldık. – We stayed at the house.
50

25
8/28/17

Inflectional Morphology
n Number, case, possession, gender, noun-
class for nouns
¨ (T)ev+ler+in+den (from your houses)
¨ Bantu marks noun class by a prefix.
n Humans: m+tu (person) wa+tu (persons)
n Thin-objects: m+ti (tree) mi+ti (trees)

n Paired things: ji-cho (eye) ma+cho (eyes)

n Instrument: ki+tu (thing) vi+tu (things)

n Extended body parts: u+limi (tongue) n+dimi


(tongues)

51

Inflectional Morphology
n Gender and/or case marking may also
appear on adjectives in agreement with the
nouns they modify
(G) ein neuer Wagen
eine schöne Stadt
ein altes Auto

52

26
8/28/17

Inflectional Morphology
n Case/Gender agreement for determiners

n (G) Der Bleistift (the pencil)


Den Blestift (the pencil (object/Acc))
Dem Bleistift (the pencil (Dative))
Des Bleistifts (of the pencil)

Die Frau (the woman)


Die Frau (the woman (object(Acc))
Der Frau (the woman (Dative)
Der Frau (of the woman)

53

Inflectional Morphology
n (A) Perfect verb subject conjugation (masc form
only)
Singular Dual Plural
katabtu katabnā
katabta katabtumā katabtum
kataba katabā katabtū

n (A) Imperfect verb subject conjugation


Singular Dual Plural
aktubu naktubu
taktubu taktubān taktubūn
yaktubu yaktubān yaktubūn
54

27
8/28/17

Derivational Morphology
n Derivational morphology produces a new
word with usually a different part-of-speech
category.
¨ e.g., make a verb from a noun.
n The new word is said to be derived from
the old word.

55

Derivational Morphology
¨ happy (Adj) Þ happi+ness (Noun)

¨ (T) elçi (Noun, ambassador) Þ


elçi+lik (Noun, embassy)
¨ (G) Botschaft (Noun, embassy) Þ
Botschaft+er (Noun, ambassador)
¨ (T) git (Verb, go) Þ
gid+er+ken (Adverb, while going)

56

28
8/28/17

Derivational Morphology
n Productive vs. unproductive derivational
morphology

¨ Productive: can apply to almost all members of


a class of words

¨ Unproductive: applies to only a few members


of a class or words
n lexicalized derivations (e.g., application as in
“application program”)
57

Compounding
n Compounding is concatenation of two or
more free morphemes (usually nouns) to
form a new word (though the boundary between normal
words and compounds is not very clear in some languages)

¨ firefighter / fire-fighter
¨ (G)
Lebensversicherungsgesellschaftsangesteller
(life insurance company employee)
¨ (T) acemborusu ((lit.) Persian pipe – neither
Persian nor pipe, but a flower)
58

29
8/28/17

Combining Morphemes
n Morphemes can be combined in a variety of ways
to make up words:
¨ Concatenative

¨ Infixation

¨ Circumfixation

¨ Templatic Combination

¨ Reduplication

59

Concatenative Combination
n Bound morphemes are attached before or
after the free morpheme (or any other
intervening morphemes).
¨ Prefixation:
bound morphemes go before the
free morpheme
n un+happy
¨ Suffixation: bound morphemes go after the free
morpheme
n happi+ness
¨ Need to be careful about the order [un+happi]+ness (not
un +[happi+ness]
n el+ler+im+de+ki+ler
60

30
8/28/17

Concatenative Combination
n Such concatenation can trigger spelling
(orthographical) and/or phonological
changes at the concatenation boundary (or
even beyond)
¨ happi+ness
¨ (T) şarap (wine) Þ şarab+ı
¨ (T) burun (nose) Þ burn+a
¨ (G) der Mann (man) Þ die Männ+er (men)

61

Infixation
n The bound morpheme is inserted into free
morpheme stem.

¨Bontoc (due to Sproat)


n fikas(Adj, strong) Þ fumikas (Verb, to be
strong)
¨Tagalog
n pili Þ pumili, pinili

62

31
8/28/17

Circumfixation
n Part of the morpheme goes before the
stem, part goes after the stem.

n German past participle


¨ tauschen(Verb, to exchange ) Þ getauscht (Participle)
¨ Sagen (Verb, to say) Þ gesagt

63

Templatic Combination
n The root is modulated with a template to generate
stem to which other morphemes can be added by
concatentaion etc.
n Semitic Languages (e.g., Arabic)
¨ rootktb (the general concept of writing)
¨ template CVCCVC
¨ vocalism (a,a)

k t b
k a t t a b
C V C C V C
64

32
8/28/17

Templatic Combination
n More examples of templatic combination

TEMPLATE VOVEL PATTERN


aa (active) ui (passive)
CVCVC katab kutib ‘write’
CVCCVC kattab kuttib ‘cause to write’
CVVCVC ka:tab ku:tib ‘correspond’
tVCVVCVC taka:tab tuku:tib ‘write each other’
nCVVCVC nka:tab nku:tib ‘subscribe’
CtVCVC ktatab ktutib ‘write’
stVCCVC staktab stuktib ‘dictate’

65

Reduplication
n Some or all of a word is duplicated to mark a
morphological process
¨ Indonesian
n orang (man) Þ orangorang (men)
¨ Bambara
n wulu (dog) Þ wuluowulu (whichever dog)
¨ Turkish
n mavi (blue) Þ masmavi (very blue)
n kırmızı (red) Þ kıpkırmızı (very red)

n koşa koşa (by running)

66

33
8/28/17

Zero Morphology
n Derivation/inflection takes place without any
additional morpheme
¨ English
n second (ordinal) Þ (to) second (a motion)
n man (noun) Þ (to) man (a place)

67

Subtractive morphology
n Part of the stem is removed to mark a
morphological feature

n Sproat (1992) gives Koasati as a language where


part of the word is removed to indicate a plural
subject agreement.

¨ obkahitiplin (go backwards, singular subject)


¨ obakhitlin (go backwards, plural subject)

68

34
8/28/17

(Back to) Computational Morphology


n Computational morphology deals with
¨ developingtheories and techniques for
¨ computational analysis and synthesis of word
forms.
n Analysis: Separate and identify the constituent
morphemes and mark the information they encode
n Synthesis (Generation): Given a set constituent
morphemes or information be encoded, produce the
corresponding word(s)

69

Computational Morphology
All Possible
n Morphological analysis
Analyses

Lemma/Root+Features encoded in the word

Break down a given word Morphological


form into its constituents Analyzer
and map them to features

Sequence of characters

Word
70

35
8/28/17

Computational Morphology
stop+Verb+PresCont
n Morphological analysis

Break down a given word Morphological


form into its constituents Analyzer
and map them to features

stopping
71

Computational Morphology
n Ideally we would like to be able to use the
same system “in reverse” to generate
words from a given sequence or
morphemes
¨ Take “analyses” as input
¨ Produce words.

72

36
8/28/17

Computational Morphology
n Morphological generation Analysis

Morphological
Generator

Word(s)
73

Computational Morphology
n Morphological generation stop+Verb+PresCont

Morphological
Generator

stopping
74

37
8/28/17

Computational Morphology
n What is in the box? Analyses

Morphological
Analyzer/
Generator

Word(s)
75

Computational Morphology
n What is in the box? Analyses

n Data
¨ Language Specific
n Engine
¨ Language Independent Data Engine

Word(s)
76

38
8/28/17

Issues in Developing a Morphological


Analyzer

n What kind of data needs to be compiled?

n What kinds of ambiguity can occur?

n What are possible implementation


strategies?

77

Some Terminology
n Lexicon is a structured collection of all the
morphemes
¨ Rootwords (free morphemes)
¨ Morphemes (bound morphemes)
n Morphotactics is a model of how and in
what order the morphemes combine.
n Morphographemics is a model of what/how
changes occur when morphemes are
combined.
78

39
8/28/17

Data Needed - Lexicon


n A list of root words with parts-of-speech and any
other information (e.g. gender, animateness, etc.)
that may be needed by morphotactical and
morphographemic phenomena.
¨ (G) Kind (noun, neuter), Hund (noun, masculin)

n A list of morphemes along with the morphological


information/features they encode (using some
convention for naming)
¨ +s(plural, PL), +s (present tense, 3rd person singular,
A3SG)
79

Data Needed - Morphotactics


n How morphemes are combined
¨ Valid morpheme sequences
n Usually as “paradigms” based on (broad) classes of root words
¨ (T) In nouns, plural morpheme precedes possessive morpheme
which precedes case morpheme
¨ (Fin) In nouns, Plural morpheme precedes case morpheme
which precedes possessive morpheme
¨ (F,P) Certain classes of verbs follow certain “conjugation
paradigms”
¨ Exceptions
n go+ed is not a valid combination of morphemes
¨ go+ing is!

80

40
8/28/17

Data Needed - Morphotactics


n Co-occurence/Long distance Constraints
¨ This prefix only goes together with this suffix!
n (G) ge+tausch+t
¨ This prefix can not go with this suffix
n (A) The prefix bi+ can only appear with genitive
case
¨ bi+l+kaatib+u is not OK
¨ bi+l+kaatib+i is OK

n (A) Definite nouns can not have indefinite endings


¨ al+kaatib+an is not OK
¨ al+kaatib+a is OK
n Only subset of suffixes apply to a set of roots
81
¨ sheep does not have a plural form!

Data Needed - Morphographemics


n A (comprehensive) inventory of what
happens when morphemes are combined.
¨ Need a consistent representation of
morphemes
n look+ed Þ looked, save+ed Þ saved
¨ Morpheme is +ed but under certain “conditions” the e may
be deleted
n look+d Þ looked, save+d Þ saved
¨ Morpheme is +d but under certain “conditions” the e may
be inserted

82

41
8/28/17

Representation
n Lexical form: An underlying representation of
morphemes w/o any morphographemic changes
applied.
¨ easy+est
¨ shop+ed
¨ blemish+es
¨ vase+es
n Surface Form: The actual written form
¨ easiest
¨ shopped
¨ blemishes
¨ vases
83

Representation
n Lexical form: An underlying representation of
morphemes w/o any morphographemic changes
applied.
¨ ev+lAr A={a,e} Abstract meta-phonemes
¨ oda+lAr
¨ tarak+sH H={ı, i, u, ü}
¨ kese+sH

n Surface Form: The actual written form


¨ evler
¨ odalar
¨ tarağı
¨ kesesi

84

42
8/28/17

Data Needed – Word List


n A (large) list of words compiled from actual
text
¨ Test coverage
¨ See if all ambiguities are produced.

85

Morphological Ambiguity
n Morphological structure/interpretation is
usually ambiguous
¨ Part-of-speech ambiguity
n book (verb), book (noun)
¨ Morpheme ambiguity
n +s (plural) +s (present tense, 3rd singular)
n (T) +mA (infinitive), +mA (negation)

¨ Segmentation ambiguity
n Word can be legitimately divided into morphemes in
a number of ways

86

43
8/28/17

Morphological Ambiguity
n The same surface form is interpreted in
many possible ways in different syntactic
contexts.
(F) danse
danse+Verb+Subj+3sg (lest s/he dance)
danse+Verb+Subj+1sg (lest I dance)
danse+Verb+Imp+2sg ((you) dance!)
danse+Verb+Ind+3sg ((s/he) dances)
danse+Verb+Ind+1sg ((I) dance)
danse+Noun+Fem+Sg (dance)
(E) read
read+Verb+Pres+N3sg (VBP-I/you/we/they read)
read+Verb+Past (VBD - read past tense)
read+Verb+Participle+(VBN – participle form)
read+Verb (VB - infinitive form)
read+Noun+Sg (NN – singular noun)

87

Morphological Ambiguity
n The same morpheme can be interpreted
differently depending on its position in the
morpheme order:

¨ (T) git+me ((the act of) going),


¨ git+me (don’t go)

88

44
8/28/17

Morphological Ambiguity
n The word can be segmented in different
ways leading to different interpretations,
e.g. (T) koyun:
¨ koyun+Noun+Sg+Pnon+Nom (koyun-sheep)
¨ koy+Noun+Sg+P2sg+Nom (koy+un-your bay)
¨ koy+Noun+Sg+Pnon+Gen (koy+[n]un – of the bay)
¨ koyu+Adj+^DB+Noun+Sg+P2sg+Nom
(koyu+[u]n – your dark (thing)
¨ koy+Verb+Imp+2sg (koy+un – put (it) down)

89

Morphological Ambiguity
n The word can be segmented in different
ways leading to different interpretations,
e.g.
(Sw) frukosten:
frukost + en ‘the breakfast’
frukost+en ‘breakfast juniper’
fru+kost+en ‘wife nutrition juniper’
fru+kost+en ‘the wife nutrition’
fru+ko+sten ‘wife cow stone’

(H) ebth:
e+bth ‘that field’
e+b+th ‘that in tea(?)’
ebt+h ‘her sitting’
e+bt+h ‘that her daughter’
90

45
8/28/17

Morphological Ambiguity
n Orthography could be ambiguous or
underspecified.

16 possible interpretations

91

Morphological Disambiguation
n Morphological Disambiguation or Tagging
is the process of choosing the "proper"
morphological interpretation of a token in a
given context.

He can can the can.

92

46
8/28/17

Morphological Disambiguation
n He can can the can.
¨ Modal
¨ Infinitive
form
¨ Singular Noun
¨ Non-third person present tense verb
n We can tomatoes every summer.

93

Morphological Disambiguation
n These days standard statistical approaches
(e.g., Hidden Markov Models) can solve
this problem with quite high accuracy.
n The accuracy for languages with complex
morphology/ large number of tags is lower.

94

47
8/28/17

Implementation Approaches for


Computational Morphology
n List all word-forms as a database

n Heuristic/Rule-based affix-stripping

n Finite State Approaches

95

Listing all Word Forms


n List of all word forms and corresponding
features in the analyses.
¨ Feasible if the word list is “small”
…..
harass harass V INF
harassed harass V PAST
harassed harass V PPART WK
harasser harasser N 3sg
harasser's harasser N 3sg GEN
harassers harasser N 3pl
harassers' harasser N 3pl GEN
harasses harass V 3sg PRES
harassing harass V PROG
harassingly harassingly Adv
harassment harassment N 3sg
harassment's harassment N 3sg GEN
harassments harassment N 3pl
harassments' harassment N 3pl GEN
harbinger harbinger N 3sg
harbinger harbinger V INF
harbinger's harbinger N 3sg GEN
…..
96

48
8/28/17

Listing all Word Forms


n No need for spelling rules
n Analysis becomes a simple search
procedure
¨ getthe word form and search for matches,
¨ output the features for all matching entries.
n However, not very easy to create
¨ Labor intensive
n Not feasible for large / “infinite” vocabulary
languages
¨ Turkish, Aymara
97

Heuristic/Rule-based Affix-stripping
n Uses ad-hoc language-specific rules
¨ to split words into morphemes
¨ to “undo” morphographemic changes

¨ scarcity
n-ity looks like a noun making suffix, let’s strip it
n scarc is not a known root, so let’s add e and see if
we get an adjective

98

49
8/28/17

Heuristic/Rule-based Affix-stripping
n Uses ad-hoc language-specific rules
¨ to split words into morphemes
¨ to “undo” morphographemic changes

¨ Lotsof language-specific heuristics and rule


development
¨ Practically impossible to use in reverse as a
morphological generator

99

Finite State Approaches


n Finite State Morphology
¨ Represent
n lexicons and
n morphographemic rules

in one unified formalism of finite state


transducers.
n Two Level Morphology
n Cascade Rules

100

50
8/28/17

OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules

101

Why is the Finite State Approach


Interesting?
n Finite state systems are mathematically
well-understood, elegant, flexible.
n Finite state systems are computationally
efficient.
n For typical natural language processing
tasks, finite state systems provide compact
representations.
n Finite state systems are inherently
bidirectional
102

51
8/28/17

Finite State Concepts


n Alphabet (A): Finite set of symbols
A={a,b}
n String (w): concatenation of 0 or more
symbols.
abaab, e(empty string)

103

Finite State Concepts


n Alphabet (A): Finite set of symbols
n String (w): concatenation of 0 or more
symbols.
n A*: The set of all possible strings
A*={e,a,b,aa,ab,ba,bb,aaa,aab … }

104

52
8/28/17

Finite State Concepts


n Alphabet (A): Finite set of symbols
n String (w): concatenation of 0 or more
symbols.
n A*: The set of all possible strings
n (formal) Language (L): a subset of A*
¨ The set of strings with an even number of a's
n e.g. abbaaba but not abbaa
¨ The set of strings with equal number of a's and
b's
n e.g., ababab but not bbbaa

105

Alphabets and Languages


A*={e,a,b,aa,ab,ba,bb,aaa,aab … }

A
A*
L
Finite
A={a,b} Infinite
L can be finite or infinite

The set of strings with an even number of a's


106

53
8/28/17

Describing (Formal) Languages


n L = {a, aa, aab} – description by enumeration
n L = {anbn: n³ 0} = { e, ab, aabb, aaabbb,….}
n L = { w | w has even number of a’s}
n L = {w | w = wR} – All strings that are the same as
their reverses, e.g., a, abba
n L = {w | w = x x} – All strings that are formed by
duplicating strings once, e.g., abab
n L = {w | w is a syntactically correct Java program}

107

Describing (Formal) Languages


n L = {w : w is a valid English word}
n L = {w: w is a valid Turkish word}
n L = {w: w is a valid English sentence}

108

54
8/28/17

Languages
n Languages are sets. So we can do “set”
things with them
¨ Union
¨ Intersection
¨ Complement with respect to the universe set
A*.

109

Describing (Formal) Languages


n How do we describe languages?
¨L = {anbn: n³ 0} is fine but not that terribly
useful
n We need finite descriptions (of usually
infinite sets)
n We need to be able to use these
descriptions in “mechanizable” procedures.
¨ e.g.,to check if a given string is in the
language or not

110

55
8/28/17

Recognition Problem
n Given a language L and a string w
¨ Is w in L?

¨ The problem is that interesting languages are


infinite!

¨ We need finite descriptions of (infinite)


languages.

111

Classes of Languages

A class of languages

Set of all possible strings Set of all languages


(set of all subsets of A*)
Alphabet

L3

A
A* L2

L1
Finite
Infinite
112

56
8/28/17

Classes of (Formal) Languages


n Chomksy Hierarchy
¨ Regular Languages (simplest)
¨ Context Free Languages
¨ Context Sensitive Languages
¨ Recursively Enumerable Languages

113

Regular Languages
n Regular languages are those that can be
recognized by a finite state recognizer.

114

57
8/28/17

Regular Languages
n Regular languages are those that can be
recognized by a finite state recognizer.

n What is a finite state recognizer?

115

Finite State Recognizers


n Consider the mechanization of “recognizing”
strings with even number of a’s.
¨ Input is a string consisting of a’s and b’s
n abaabbaaba
¨ Count a’s but modulo 2, i.e., when count
reaches 2 reset to 0. Ignore the b’s – we are
not interested in b’s.
n 0a1b1a0a1b1b1a0a1b1a0
¨ Since we reset to 0 after seeing two a’s, any
string that ends with a count 0 matches our
condition.
116

58
8/28/17

Finite State Recognizers


n 0a1b1a0a1b1b1a0a1b1a0

n The symbols shown in between input symbols


encode/remember some “interesting” property of
the string seen so far.
¨ e.g., a 0 shows that we have seen an even number of
a’s, while a 1, shows an odd number of a’s, so far.
n Let’s call this “what we remember from past” as
the state.
n A “finite state recognizer” can only remember a
finite number of distinct pieces of information from
the past.
117

Finite State Recognizers


n We picture FSRs with state graphs.
Transitions
Accept State

b a b

q0 q1

a
Start State Input Symbols

States

118

59
8/28/17

Finite State Recognizers


b a b

q0 q1

Is abab in the language?

abab
^

119

Finite State Recognizers


b a b

q0 q1

Is abab in the language?

abab
^

120

60
8/28/17

Finite State Recognizers


b a b

q0 q1

Is abab in the language?

abab
^

121

Finite State Recognizers


b a b

q0 q1

Is abab in the language?

abab
^

122

61
8/28/17

Finite State Recognizers


b a b

q0 q1

Is abab in the language? YES

abab
^

123

Finite State Recognizers


b a b

q0 q1

Is abab in the language? YES

abab
^
The state q0 remembers the fact that we have seen an even number of a’s
The state q1 remembers the fact that we have seen an odd number of a’s

124

62
8/28/17

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}

125

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}
¨ Q: Set of states

126

63
8/28/17

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}
¨ Q: Set of states
¨ A: Alphabet

127

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}
¨ Q: Set of states
¨ A: Alphabet
¨ q0: Initial state

128

64
8/28/17

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}
¨ Q: Set of states
¨ A: Alphabet
¨ q0: Initial state
¨ Next: Next state function Q x A ® Q

129

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}
¨ Q: Set of states
¨ A: Alphabet
¨ q0: Initial state
¨ Next: Next state function Q x A ® Q
¨ Final: a subset of the states called the final
states

130

65
8/28/17

Finite State Recognizers


b a b

q0 q1

A = {a,b}
Q = {q0, q1}
Next = {((q0,b),q0),
((q0,a),q1), If the machine is in state q0 and the input is a then
the next state is q1
((q1,b),q1),
((q1,a),q0))}
Final = {q0}
131

Finite State Recognizers


n Abstract machines for regular languages:
M = {Q, A, q0, Next, Final}

n M accepts w Î A*, if
¨ startingin state q0, M proceeds by looking at
each symbol in w, and
¨ ends up in one of the final states when the
string w is exhausted.

132

66
8/28/17

Finite State Recognizers


n Note that the description of the recognizer
itself is finite.
¨ Finite number of states
¨ Finite number of alphabet symbols
¨ Finite number of transitions

n Number of strings accepted can be infinite.

133

Another Example
e e
p
l

s i

n
IMPORTANT CONVENTION
If at some state, there is no transition for a symbol, we assume that
the FSR rejects the string.
g

Accepts sleep, sleeping


134

67
8/28/17

Another Example
e e
p
l
p

s s i

Accepts sleep, sleeping, sleeps, slept


135

Another Example
e e
p
l
p
w
s s i

Accepts sleep, sleeping, sleeps, slept, sweep, ….


136

68
8/28/17

Another Example
e e
k p
l
p
w
s s i

Accepts sleep, sleeping, sleeps, slept, sweep, …., keep,…


137

Another Example
e e
k p
l
p
w
s s i

a
t

n
v i

s
e g
d
Accepts …..save, saving, saved, saves
138

69
8/28/17

Regular Languages
n A language whose strings are accepted by
some finite state recognizer is a regular
language.

Regular Finite State


Languages Recognizers

139

Regular Languages
n A language whose strings are accepted by
some finite state recognizer is a regular
language.

Regular Finite State


Languages Recognizers

n However, FSRs can get rather wieldy; we


need higher level notations for describing
regular languages.
140

70
8/28/17

141

142

71
8/28/17

Regular Expressions
n A regular expression is compact formula or
metalanguage that describes a regular
language.

143

Constructing Regular Expressions


Expression Language
n F n {}
n e n {e}
n a,"a Î A n {a}
n R1 | R2 n L1 È L2
n R1 R2 n {w1w2: w1ÎL1 and w2ÎL2}
n [R] n L

144

72
8/28/17

Constructing Regular Expressions


Expression Language
n F n {}
n e n {e}
n a,"a Î A n {a}
n R1 | R2 n L1 È L2
n R1 R2 n {w1w2: w1ÎL1 and w2ÎL2}
n [R] n L

a b [a | c] [d | e]
145

Constructing Regular Expressions


Expression Language
n F n {}
n e n {e}
n a,"a Î A n {a}
n R1 | R2 n L1 È L2
n R1 R2 n {w1w2: w1ÎL1 and w2ÎL2}
n [R] n L

a b [a | c] [d | e] { abad, abae, abcd, abae}

146

73
8/28/17

Constructing Regular Expressions


n This much is not that interesting; allows us
to describe only finite languages.
n The Kleene Star operator describes infinite
sets.
R*

{e} È L È L L È L L L È ….

147

Kleene Star Operator


n a* Þ {e, a, aa, aaa, aaaa, …..}

148

74
8/28/17

Kleene Star Operator


n a* Þ {e, a, aa, aaa, aaaa, …..}
n [ab]* Þ {e, ab, abab, ababab, … }

149

Kleene Star Operator


n a* Þ {e, a, aa, aaa, aaaa, …..}
n [ab]* Þ {e, ab, abab, ababab, … }
n ab*a Þ {aa, aba, abba, abbba,…}

150

75
8/28/17

Kleene Star Operator


n a* Þ {e, a, aa, aaa, aaaa, …..}
n [ab]* Þ {e, ab, abab, ababab, … }
n ab*a Þ {aa, aba, abba, abbba,…}
n [a|b]*[c|d]* Þ {e, a, b, c, d, ac, ad, bc, bd,
aac, abc,bac, bbc, …..}

151

Regular Expressions
n Regular expression for set of strings with
an even number of a's.

[b* a b* a]* b*
¨ Any number of concatenations of strings of the
sort
n Any number of b's followed by an a followed by any
number of b's followed by another a
¨ Ending with any number of b's
152

76
8/28/17

Regular Expressions
n Regular expression for set of strings with
an even number of a's.

[b* a b* a]* b*
¨b b a b a b b a a b a a a b a b b b
b* a b* a b* a b* a b* a b* a b* a b* a b*

Any number of strings matching b*ab*a concatenated

Ending with any number of


b’s

153

Regular Languages
n Regular languages are described by
regular expressions.

n Regular languages are recognized by finite


state recognizers.

n Regular expressions define finite state


recognizers.

154

77
8/28/17

More Examples of Regular Expressions


n All strings ending in a
¨ [a|b]*a
n All strings in which the first and the last
symbols are the same
¨ [a [a|b]* a] | [b [a|b]* b] | a | b
n All strings in which the third symbol from
the end is a b
¨ [a|b]* b [a|b] [a|b]

155

More Examples of Regular Expressions


n Assume an alphabet A={a,b,…,z}
n All strings ending in ing
¨ A* ing
n All strings that start with an a and end with
a ed
¨a A* e d

156

78
8/28/17

Regular Languages to Regular Relations


n A regular language is a set of strings, e.g.
{ “cat”, “fly”, “big” }.

157

Regular Languages to Regular Relations


n A regular language is a set of strings, e.g.
{ “cat”, “fly”, “big” }.
n An ordered pair of strings, notated
<“upper”, “lower”>, relates two strings, e.g.
<“wiggle+ing”, “wiggling”>.

158

79
8/28/17

Regular Languages to Regular Relations


n A regular language is a set of strings, e.g. { “cat”,
“fly”, “big” }.
n An ordered pair of strings, notated <“upper”,
“lower”>, relates two strings, e.g. <“wiggle+ing”,
“wiggling”>.
n A regular relation is a set of ordered pairs of
strings, e.g.
¨{ <“cat+N”, “cat”> , <“fly+N”, “fly”> , <“fly+V”, “fly”>,
<“big+A”, “big”> }

¨{ <“cat”, “cats”> , <“zebra”, “zebras”> , <“deer”, “deer”>,


<“ox”, “oxen”>, <“child”, “children”> }
159

Regular Relations
n The set of upper-side strings in a regular
relation (upper language) is a regular
language.
¨{ cat+N , fly+N, fly+V, big+A}

n The set of lower-side strings in a regular


relation (lower language) is a regular
language.
¨ {cat ,fly, big }

160

80
8/28/17

Regular Relations
n A regular relation is a “mapping” between
two regular ranguages. Each string in one
of the languages is “related” to one or more
strings of the other language.

n A regular relation can be described in a


regular expression and encoded in a Finite-
State Transducer (FST).

161

Finite State Transducers


n Regular relations can be defined by regular
expressions over an alphabet of pairs of symbols
u:l (u for upper, l for lower)
n In general either u or l may be the empty string
symbol.
¨ e:l: a symbol on the lower side maps to “nothing” on the
upper side
¨ u: e: a symbol on the upper side maps to “nothing” on
the lower side
n It is also convenient to view any recognizer as an
identity transducer.
162

81
8/28/17

Finite State Transducers


b/a

qi qj

n Now the automaton outputs a symbol at every


transition, e.g., if the lower input symbol is a, then
b is output as the upper symbol (or vice-versa).
n The output is defined iff the input is accepted (on
the input side), otherwise there is no output.

163

Relations and Transducers

Regular relation
{ <ac,ac>, <abc,adc>, <abbc,addc>, <abbbc,adddc>... }

between [a b* c] and [a d* c].


“upper language” “lower language”

Finite-state transducer
Regular expression a:a c:c
0 1 2
a:a [b:d]* c:c

b:d
Slide courtesy of Lauri Karttunen 164

82
8/28/17

Relations and Transducers

Regular relation
{ <ac,ac>, <abc,adc>, <abbc,addc>, <abbbc,adddc>... }

between [a b* c] and [a d* c].


“upper language” “lower language”

Finite-state transducer
Regular expression a c
0 1 2
a [b:d]* c
Convention: when both upper and lower
symbols are same b:d
Slide courtesy of Lauri Karttunen 165

Finite State Transducers


n If the input string has an even number of
a's then flip each symbol.

a:b b:a a:b

q0 q1

b:a

166

83
8/28/17

A Linguistic Example

v o u l o i r +IndP +SG +P3


v e u 0 0 0 0 0 0 t

Maps the lower string veut to the upper string


vouloir+IndP+SG+P3 (and vice versa)

From now on we will use the symbol 0 (zero) to denote the empty string e

167

Finite State Transducers – Black Box View


n We say T transduces
the lower string w1 into w2 e L2
string upper string w2
in the upward direction
(lookup)
n We say T transduces Finite State
the upper string w2 Transducer
into string lower string T
w1 in the downward
direction (lookdown)
n A given string may
map to >=0 strings w1 e L1
168

84
8/28/17

Combining Transducers
n In algebra we write
¨ y=f(x) to indicate that function f maps x to y
¨ Similarly in z=g(y), g maps y to z
n We can combine these to
¨ z= g(f(x)) to map directly from x to z and write
this as z = (g · f) (x)
¨ g · f is the composed function
¨ If y=x2 and z = y3 then z = x6

169

Combining Transducers
n The same idea can be applied to
transducers – though they define relations
in general.

170

85
8/28/17

Composing Transducers

U1 x
U1’= f-1(L1 Ç U2)
f
f
y
L1
f °g
g
U2 y
L2’ = g(L1 Ç U2)
g

L2
z

171

Composing Transducers

U1 x
U1’= f-1(L1 Ç U2)
f
f
y
L1
f °g
g
U2 y
L2’ = g(L1 Ç U2)
g
f ° g = {<x,z>: $y (<x,y> Î f and <y,z> Î g)}
L2
z where x, y, z are strings

The y’s have to be equal, so there is an implicit intersection


172

86
8/28/17

Composing Transducers
n Composition is an operation that merges two
transducers “vertically”.
¨ Let X be a transducer that contains the single ordered
pair < “dog”, “chien”>.
¨ Let Y be a transducer that contains the single ordered
pair <“chien”, “Hund”>.
¨ The composition of X over Y, notated X o Y, is the
relation that contains the ordered pair <“dog”, “Hund”>.

n Composition merges any two transducers


“vertically”. If the shared middle level has a non-
empty intersection, then the result will be a non-
empty relation.

173

Composing Transducers
n The crucial property is that the two finite
state transducers can be composed into a
single transducer.
¨ Details are hairy and not relevant.

174

87
8/28/17

Composing Transducers - Example


1273
n Maps a (finite) set of
English numerals to
integers

English Numeral to
Number Transducer
n Take my word that this
can be done with a
finite state transducer.

One thousand two hundred seventy three

175

Composing Transducers - Example


Bin iki yüz yetmiş üç
n Maps a (finite) set of
numbers to Turkish
numerals

Numbers to
Turkish Numerals n Again, take my word
Transducer
that this can be done
with a finite state
transducer.

1273

176

88
8/28/17

Composing Transducers - Example


Bin iki yüz yetmiş üç

Number to
Turkish Numeral Transducer

1273

English Numeral to
Number Transducer

One thousand two hundred seventy three

177

Composing Transducers - Example


Bin iki yüz yetmiş üç Bin iki yüz yetmiş üç

Number to
Turkish Numeral Transducer
Compose English Numeral
to
1273 Turkish Numeral
Transducer
English Numeral to
Number Transducer

One thousand two hundred seventy three One thousand two hundred seventy three

178

89
8/28/17

Composing Transducers - Example


satakaksikymmentäkolme satakaksikymmentäkolme

Number to
Finnish Numeral Transducer
Compose English Numeral
to
123 Finnish Numeral
Transducer
English Numeral to
Number Transducer

One hundred twenty three One hundred twenty three

179

Morphology & Finite State Concepts


n Some references:
¨ Hopcroft,Motwani and Ullman, Formal Languages and
Automata Theory, Addison Wesley
¨ Beesley and Karttunen, Finite State Morphology, CSLI,
2003
¨ Kaplan and Kay. “Regular Models of Phonological
Rule Systems” in Computational Linguistics 20:3,
1994. Pp. 331-378.
¨ Karttunen et al., Regular Expressions for Language
Engineering, Natural Language Engineering, 1996

180

90
8/28/17

End of digression
n How does all this tie back to computational
morphology?

181

OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules

182

91
8/28/17

Morphological Analysis and Regular Sets


n Assumption:
The set of words in a (natural) language is a regular
(formal) language.

n Finite: Trivially true; though instead of listing all words, one


needs to abstract and describe phenomena.

n Infinite: Very accurate approximation.

n BUT: Processes like (unbounded) reduplication


are out of the finite state realm.
n Let’s also assume that we combine morphemes
by simple concatenation.
¨ Not a serious limitation
183

Morphological Analysis
n Morphological
happy+Adj+Sup
analysis can be seen
as a finite state
transduction

Finite State
Transducer
T

happiest Î English_Words
184

92
8/28/17

Morphological Analysis as FS Transduction


n First approximation

n Need to describe
¨ Lexicon (of free and bound morphemes)

¨ Spelling
change rules in a finite state
framework.

185

The Lexicon as a Finite State Transducer


n Assume words have the form
prefix+root+suffix where the prefix and the
suffix are optional.
So:
Prefix = [P1 | P2 | …| Pk]
Root = [ R1 | R2| … | Rm]
Suffix = [ S1 | S2 | … | Sn]
Lexicon = (Prefix) Root (Suffix)

(R) = [R | e], that is, R is optional.


186

93
8/28/17

The Lexicon as a Finite State Transducer


n Prefix = [[ u n +] | [ d i s +] | [i n +]]
Root = [ [t i e] | [ e m b a r k ] | [ h a p p y ]
| [ d e c e n t ] | [ f a s t e n]]
Suffix = [[+ s] | [ + i n g ] | [+ e r] | [+ e d] ]

Tie, embark, happy, un+tie, dis+embark+ing,


in+decent, un+happy √
un+embark, in+happy+ing …. X

187

The Lexicon as a Finite State Transducer


n Lexicon =
[ ([ u n +]) [ [t i e] | [ f a s t e n]] ([[+e d] | [ + i n g] | [+ s]]) ]
|
[ ([d i s +]) [ e m b a r k ] ([[+e d] | [ + i n g] | [+ s]])]
|
[ ([u n +]) [ h a p p y] ([+ e r])]
|
[ (i n +) [ d e c e n t] ]
Note that some patterns are now emerging
tie, fasten, embark are verbs, but differ in prefixes,
happy and decent are adjectives, but behave differently
188

94
8/28/17

The Lexicon
n The lexicon structure can be refined to a
point so that all and only valid forms are
accepted and others rejected.

n This is very painful to do manually for any


(natural) language.

189

Describing Lexicons
n Current available systems for morphology provide
a simple scheme for describing finite state
lexicons.
¨ XeroxFinite State Tools
¨ PC-KIMMO

n Roots and affixes are grouped and linked to each


other as required by the morphotactics.
n A compiler (lexc) converts this description to a
finite state transducer

190

95
8/28/17

Describing Lexicons
LEXICON NOUNS
abacus NOUN-STEM; ;; same as abacus:abacus
car NOUN-STEM;
table NOUN-STEM;

information+Noun+Sg: information End;

zymurgy NOUN-STEM;

LEXICON NOUN-STEM
+Noun:0 NOUN-SUFFIXES

Think of these as (roughly) the NEXT function of a transducer


upper:lower next-state (but strings of symbols)

191

Describing Lexicons
LEXICON NOUN-SUFFIXES
LEXICON REG-VERB-STEM
+Sg:0 End;
+Pl:+s End; +Verb:0 REG-VERB-SUFFIXES;

LEXICON REG-VERB-SUFFIXES
LEXICON REGULAR-VERBS +Pres+3sg:+s End;
admire REG-VERB-STEM;
+Past:+ed End;
head REG-VERB-STEM;
.. +Part:+ed End;
zip REG-VERB-STEM; +Cont:+ing End;

LEXICON IRREGULAR-VERBS
…..

192

96
8/28/17

How Does it Work?


n Suppose we have the string abacus+s
¨ The first segment matches abacus in the NOUNS
lexicon, “outputs” abacus and we next visit the NOUN-
STEM lexicon.
¨ Here, the only option is to match to the empty string
and “output” +Noun and visit the NOUN-SUFFIXES
lexicon.
¨ Here we have two options
n Match the epsilon and “output” +Sg but there is more stuff so
this fails.
n Match the +s and “output” +Plu and the finish.
¨ Output is abacus+Noun+Pl

193

Describing Lexicons
LEXICON ADJECTIVES happy+Adj+Sup

LEXICON ADVERBS

LEXICON ROOT Lexicon Transducer


NOUNS;
REGULAR-VERBS;
….
ADJECTIVES;
….

happy+est

194

97
8/28/17

Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s

. +
+Verb
. +Pres +3sg
. s 0

A typical lexicon will be represented with 105 to 106 states. 195

Morphotactics in Arabic
n As we saw earlier, words in Arabic are
based on a root and pattern scheme:
¨A root consisting of 3 consonants (radicals)
¨ A template and a vocalization.
which combine to give a stem.
n Further prefixes and suffixes can be
attached to the stem in a concatenative
fashion.

196

98
8/28/17

Morphotactics in Arabic
Pattern

CVCVC
Vocalization FormI+Perfect+Active
a a
Root

d r s learn/study

Prefix Suffix

wa+ +at
daras

wa+daras+at

“and she learned/studied” 197

Morphotactics in Arabic
Pattern

CVCVC
Vocalization FormI+Perfect+Passive
u i
Root

d r s learn/study

Prefix Suffix

wa+ +at
duris

wa+duris+at

“and she was learned/studied” 198

99
8/28/17

Morphotactics in Arabic
Pattern

CVCVC
Vocalization FormI+Perfect+Active
a a
Root

k t b write

Prefix Suffix

wa+ +at
katab

wa+katab+at

“and she wrote” 199

Morphotactics in Arabic
Pattern

CVCCVC
Vocalization FormII+Perfect+Active
a a
Root

d r s learn/study

Prefix Suffix

wa+ +at
darras

wa+darras+at

“and *someone* taught/instructed her/it” 200

100
8/28/17

Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing

wa+drs+CVCVC+ui+at

wa+duris+at

201

Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing

wa+drs+CVCVC+ui+at

wa+duris+at

wadurisat

202

101
8/28/17

Morphotactics in Arabic
wa+Conj+drs+FormI+Perfect+Passive+3rdPers+Fem+Sing

wa+drs+CVCVC+ui+at

wa+duris+at

wadurisat

wdrst
203

Morphotactics in Arabic
+drs+… +drs+… +drs+… +drs+… +drs+…


16 possible interpretations
204

102
8/28/17

Arabic Morphology on Web


n Play with XRCE’s Arabic Analyzer at
¨ https://open.xerox.com/Services/arabic-
morphology

205

Lexicon as a FS Transducer
h a p p y +Adj +Sup 0 0
h a p p y + e s t
.
s .
s
a. v e +Verb +Past 0
a v e + e d
t .
t .
a
. b l e +Noun +Pl
a b l e + s

. +
+Verb
. +Pres +3sg
. s 0

Nondeterminism 206

103
8/28/17

The Lexicon Transducer


n Note that the lexicon transducer solves part
of the problem.
¨ It
maps from a sequence of morphemes to root
and features.

¨ Where do we get the sequence of


morphemes?

207

Morphological Analyzer Structure


happy+Adj+Sup

Lexicon Transducer

happy+est

Morphographemic ????????
Transducer

happiest
208

104
8/28/17

Sneak Preview (of things to come)


happy+Adj+Sup happy+Adj+Sup

Lexicon Transducer

Morphological
happy+est Compose
Analyzer/Generator

Morphographemic
Transducer

happiest happiest
209

The Morphographemic Transducer


n The morphographemic transducer
generates
¨ allpossible ways the input word can be
segmented and “unmangled”
¨ As sanctioned by the alternation rules of the
language
n Graphemic conventions
n Morphophonological processes (reflected to the
orthography)

210

105
8/28/17

The Morphographemic Transducer


n The morphographic
transducers thinks:
¨ There may be a
happy+est morpheme boundary
between i and e, so let
me mark that with a +.
Morphographemic
¨ There is i+e situation,
Transducer now and
¨ There is a rule that
happiest says, change the i to a
y in this context.
¨ So let me output
happy+est

211

The Morphographemic Transducer



happy+est
n However, the
h+ap+py+e+st morphographemic
….
happiest transducer is oblivious
… to the lexicon,
¨ it
does not really know
Morphographemic about words and
Transducer morphemes,
¨ but rather about what
happiest happens when you
combine them
Only some of these will actually be sanctioned by the lexicon

212

106
8/28/17

The Morphographemic Transducer


n This obliviousness is actually a good thing
¨ Languages easily import, generate new words
¨ But not necessarily such rules! (and usually
there are a “small” number of rules.)
¨ bragÞbragged, flog Þ flogged
n In a situation like vowel g + vowel insert another g

¨ And you want these rules to also apply to new


words coming to the lexicon
n blog Þ blogged, vlog Þvlogged, etc.
213

The Morphographemic Transducer


….
koyun n Also, especially in
koy+Hn
koy+nHn
languages that allow
koyu+Hn segmentation
….
ambiguity, there may
be multiple legitimate
Morphographemic
outputs of the
Transducer
transducer that are
koyun
sanctioned by the
lexicon transducer.

214

107
8/28/17

What kind of changes does the MG


Transducer handle?
n Agreement
¨ consonants agree in certain features under
certain contexts
n e.g., Voiced, unvoiced
¨ (T) at+tı (it was a horse, it threw)
¨ (T) ad+dı (it was a name)
¨ vowels agree in certain features under certain
contexts
n e.g., vowel harmony (may be long distance)
¨ (T) el+ler+im+de (in my hands)
¨ (T) masa+lar+ım+da (on my tables)

215

What kind of changes does the MG


Transducer handle?
n Insertions
¨ brag+ed Þ bragged
n Deletions
¨ (T) koy+nHn Þ koyun (of the bay)
¨ (T) alın+Hm+yA Þ alnıma (to my forehead)
n Changes
¨ happy+est Þ happiest
¨ (T) tarak+sH Þ tarağı (his comb)
¨ (G) Mann+er Þ Männer

216

108
8/28/17

Approaches to MG Transducer Architecture


n There are two main approaches to the
internal architecture of the
morphographemics transducers.
¨ The parallel constraints approach
n Two-level Morphology

¨ Sequential transduction approach


n Cascade Rules

217

The Parallel Constraints Approach


n Two-level Morphology
¨ Koskenniemi ’83, Karttunen et al. 1987,
Karttunen et al. 1992
Describe constraints

Lexical form
Set of parallel
of two-level rules
compiled into finite-state fst 1 fst 2 ... fst n
automata interpreted as
transducers
Surface form

Slide courtesy of Lauri Karttunen 218

109
8/28/17

Sequential Transduction Approach


n Cascaded ordered rewrite rules
¨ Johnson ’71 Kaplan & Kay ’81 based on
Chomsky & Halle ‘64 Lexical form

fst 1

Intermediate form

Ordered cascade fst 2


of rewrite rules
compiled into finite-state
...
transducers

fst n

Slide courtesy of Lauri Karttunen


Surface form
219

Spoiler
n At the end both approaches are equivalent
Lexical form Lexical form

fst 1
fst 1 fst 2 ... fst n
Intermediate form
fst 2

... Surface form

fst n

Surface form

Slide courtesy of Lauri Karttunen


FST 220

110
8/28/17

Sequential vs. Parallel


n Two different ways of decomposing the
complex relation between lexical and
surface forms into a set of simpler relations
that can be more easily understood and
manipulated.
¨ The sequential model is oriented towards
string-to-string relations,
¨ The parallel model is about symbol-to-symbol
relations.
n In some cases it may be advantageous to
combine the two approaches.
Slide courtesy of Lauri Karttunen 221

Two-Level Morphology
n Basic terminology and concepts
n Examples of morphographemic alternations
n Two-level rules
n Rule examples

222

111
8/28/17

Terminology
n Representation
Surface form/string : happiest
Lexical form/string: happy+est

n Aligned correspondence: You may think of these


happy+est as the upper and lower
symbols in a FST
happi0est
n 0 will denote the empty string symbol, but
we will pretend that it is an actual symbol!!
223

Feasible Pairs
n Aligned correspondence:
happy+est
happi0est

n Feasible Pairs: The set of possible lexical


and surface symbol correspondences.
¨ Allpossible pairs for insertions, deletions,
changes
¨ {h:h, a:a, p:p, y:i, +:0, e:e, s:s, t:t, y:y …}

224

112
8/28/17

Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or

225

Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or
¨ Transductions from surface strings to lexical
strings (analysis), or

226

113
8/28/17

Aligned Correspondence
n Aligned correspondence:
happy+est
happi0est
n The alignments can be seen as
¨ Strings in a regular language over the alphabet
of feasible pairs, (i.e., symbols that look like
“y:i”) or
¨ Transductions from surface strings to lexical
strings (analysis), or
¨ Transductions from lexical strings to surface
strings (generation)
227

The Morphographemic Component


n So how do we define this regular language
(over the alphabet of feasible pairs) or the
transductions?
¨ We want to accept the pairs
n stop0+ed try+ing try+ed
n stopp0ed try0ing tri0ed
¨ but reject
n stop+ed try+ing try+ed
n stop0ed tri0ing try0ed

228

114
8/28/17

The Morphographemic Component


n Basically we need to describe
¨ whatkind of changes can occur, and
¨ under what conditions
n contexts
n optional vs obligatory.

n Typically there will be few tens of different


kinds of “spelling change rules.”

229

The Morphographemic Component


¨ The first step is to create an inventory of
possible spelling changes.
n What are the changes? Þ Feasible pairs
n How can we describe the contexts it occurs?

n Is the change obligatory or not?

n Consistent representation is important

¨ The second step is to describe these changes


in a computational formalism.

230

115
8/28/17

Inventory of Spelling Changes


n In general there are many ways of looking at a
certain phenomena
¨ An insertion at the surface can be viewed as a deletion
on the surface
¨ Morpheme boundaries on the lexical side can be
matched with 0 or some real surface symbol
n For the following examples we will have +s and
+ed as the morphemes.
n Rules will be devised relative to these
representations.

231

Inventory of Spelling Changes


n An extra e is needed before +s after consonants
s, x, z, sh, or ch
box+0s kiss+0s blemish+0s preach+0s
box0es kiss0es blemish+0s preach0es

232

116
8/28/17

Inventory of Spelling Changes


n Sometimes a y will be an i on the surface.
try+es spot0+y+ness
tri0es spott0i0ness

n But not always


try+ing country+'s spy+'s
try0ing country0's spy+'s

233

Inventory of Spelling Changes


n Sometimes lexical e may be deleted
move+ed move+ing persuade+ed
mov00ed mov00ing persuad00ed

dye+ed queue+ing tie+ing


dy00ed queu00ing ty00ing

n but not always


trace+able change+able
trace0able change0able

234

117
8/28/17

Inventory of Spelling Changes


n The last consonant may be duplicated
following a stressed syllable.
re'fer0+ed 'travel+ed
re0ferr0ed 0travel0ed

n So if you want to account for this, stress will


have to be somehow represented and we
will have a feasible pair (':0)

235

Inventory of Spelling Changes


n Sometimes the s of the genitive marker (+’s)
drops on the surface
boy+s+'s dallas's
boy0s0'0 dallas'0

236

118
8/28/17

Inventory of Spelling Changes


n Sometimes an i will be a y on the surface.
tie+ing
ty00ing

237

Inventory of Spelling Changes


n In Zulu, the n of the basic prefix changes to m
before labial consonants b, p, f, v.
i+zin+philo
i0zim0p0ilo (izimpilo)

n Aspiration (h) is removed when n is followed by


kh, ph, th, bh
i+zin+philo
i0zim0p0ilo (izimpilo)

238

119
8/28/17

Inventory of Spelling Changes


n In Arabic “hollow” verbs, the middle radical w
between two vowels is replaced by vowel
lengthening
zawar+tu
z0U0r0tu (zUrtu – I visited)

qawul+a
qaA0l0a (qaAla – he said)

qaAla => qawul+a => qwl+CaCuC+a =>


qwl+FormI+Perfect+3Pers+Masc+Sing

239

The Computational Formalism


n So how do we (formally) describe all such
phenomena
¨ Representation
n lexical form
n surface form

¨ Conditions
n Context
n Optional vs Obligatory Changes

240

120
8/28/17

Parallel Rules
n A well-established formalism for describing
morphographemic changes.

Lexical Form
Each rule describes
a constraint on legal
Lexical - Surface
R1 R2 R3 R4 ... Rn pairings.

All rules operate in


parallel!

Surface Form
241

Parallel Rules
n Each morphographemic constraint is
enforced by a finite state recognizer over
the alphabet of feasible-pairs.
t i e + i n g

R1 R2 R3 R4 ... Rn

t y 0 0 i n g Assume that the 0’s are magically


There.
The reason and how are really technical
242

121
8/28/17

Parallel Rules
n A lexical-surface string pair is "accepted" if
NONE of the rule recognizers reject it.
n Thus, all rules must put a good word in!
t i e + i n g

R1 R2 R3 R4 ... Rn

t y 0 0 i n g
243

Parallel Rules
n Each rule independently checks if it has
any problems with the pair of strings.

t i e + i n g

R1 R2 R3 R4 ... Rn

t y 0 0 i n g
244

122
8/28/17

Two-level Morphology
n Each recognizer sees that same pair of
symbols

t i e + i n g Each Ri sees t:t


and makes a state
change accordingly

R1 R2 R3 R4 ... Rn

t y 0 0 i n g
245

Two-level Morphology
n Each recognizer sees that same pair of
symbols

t i e + i n g

R1 R2 R3 R4 . . . Rn

t y 0 0 i n g
246

123
8/28/17

Two-level Morphology
n Each recognizer sees that same pair of
symbols

t i e + i n g

R1 R2 R3 R4 ... Rn

t y 0 0 i n g
247

Two-level Morphology
n Each recognizer sees that same pair of
symbols

t i e + i n g At this point all


Ri should be in
an accepting state.

R1 R2 R3 R4 ... Rn

t y 0 0 i n g
248

124
8/28/17

The kaNpat Example


n Language X has a root kaN, including an
underspecified nasal morphophoneme N, that
can be followed by the suffix pat to produce the
well-formed, but still abstract, word kaNpat. We
may refer to this as the “underlying” or “lexical” or
“morphophonemic” form.
n The morphophonology of this language has an
alternation: the morphophoneme N becomes m
when it is followed by a p.
n In addition, a p becomes an m when it is
preceded by an underlying or derived m.
n The “surface” form or “realization” of this word
should therefore be kammat.
249

The kaNpat Example

The morphophonology a p becomes an m


of this language has an when it is preceded by
alternation: the
morphophoneme N an underlying or
becomes m when it is derived m
followed by a p.
In addition,

250

125
8/28/17

The kaNpat Example


n Ignore everything until
you see a N:m pair,
and make sure it is
followed with some
p:@ (some feasible
pair with p on the
lexical side)
n If you see a N:x (x≠m)
then make sure the
next pair is not p:@.

251

The kaNpat Example: N:m rule


n Ignore everything until
you see a N:m pair,
and make sure it is
followed with some
p:@ (some feasible
pair with p on the
lexical side)
n If you see a N:x (x≠m)
then make sure the
next pair is not p:@.

p Þ {p:p, p:m} a denotes eveything else Also remember FST rejects if no arc
252
is found

126
8/28/17

The kaNpat Example: p:m rule


n Make sure that a p:m
pairing is preceded by
a feasible pair like
@:m
n And, no p:x (x ≠ m)
follows a @:m

253

The kaNpat Example: p:m rule


n Make sure that a p:m
pairing is preceded by
a feasible pair like
@:m
n And, no p:x (x ≠ m)
follows a @:m

Also remember that if an unlisted input


İs encountered, the string is rejected

M Þ{m:m, N:m}, a everything else 254

127
8/28/17

Rules in parallel
a t

Both FSRs see the k:k pair and stay in state 1


255

Rules in parallel
a t

Both FSRs see the a:a pair and stay in state 1


256

128
8/28/17

Rules in parallel
a t

Both FSRs see the N:m 1st one goes to state 3 2nd one goes to state 2
257

Rules in parallel
a t

Both FSRs see the p:m 1st one back goes to state 1 2nd one stays in state 2
258

129
8/28/17

Rules in parallel
a t

Both FSRs see the a:a 1st one stays in state 1 2nd one goes to state 1
259

Rules in parallel
a t

Both FSRs see the t:t 1st one stays in state 1 2nd one stays in state 1
260

130
8/28/17

Rules in parallel
a t

Both machines are now in accepting states


261

Rules in parallel
a t

Try kaNpat and kampat pairing

262

131
8/28/17

Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.

263

Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!
¨ Because, all recognizers have to be in the accepting
state– the pairing is rejected if at least one rule rejects.

264

132
8/28/17

Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!

n Such recognizers can be viewed as transducers


between lexical and surface string languages.

265

Crucial Points
n Rules are implemented by recognizers over
strings of pairs of symbols.
n The set of strings accepted by a set of such
recognizers is the intersection of the languages
accepted by each!

n Such machines can be viewed as transducers


between lexical and surface string languages.
n Each transducer defines a regular relation or
transduction.

266

133
8/28/17

Some Technical Details


n Such machines can be viewed as transducers
between lexical and surface string languages.
n Each transducer defines a regular relation or
transduction.
n The intersection of the regular relations over
equal length strings is a regular relation (Kaplan
and Kay 1994).
¨ Ingeneral the intersection of regular relations need not
be regular!
n So, 0 is treated as a normal symbol internally, not
as e. So there are no transitions with real e
symbols.

267

Intersecting the Transducers


n The transducers for each rule can now be
intersected to give a single transducer.

t i e + i n g

R1 Ç R2 Ç R3 Ç R4 Ç ... Ç Rn

t y 0 0 i n g
268

134
8/28/17

The Intersected Transducer


n The resulting transducer can be used for
analysis

t i e + i n g

Morphographemic Rule Transducer

t y 0 0 i n g
269

The Intersected Transducer


n The resulting transducer can be used for
analysis, or generation

t i e + i n g

Morphographemic Rule Transducer

t y 0 0 i n g
270

135
8/28/17

Describing Phenomena
n Finite state transducers are too low level.

n Need high level notational tools for


describing morphographemic phenomena

n Computers can then compile such rules


into transducers; compilation is too much of
a hassle for humans!

271

Two-level Rules
n Always remember the set of feasible
symbols = sets of legal correspondences.
n Rules are of the sort:
a:b op LC __ RC

Feasible
Pair

272

136
8/28/17

Two-level Rules

n Rules are of the sort:

a:b op LC __ RC

Feasible Operator
Pair

273

Two-level Rules

n Rules are of the sort:

a:b op LC __ RC

Feasible Operator Left Context Right Context


Pair

274

137
8/28/17

Two-level Rules

n Rules are of the sort:

a:b op LC __ RC

Feasible Operator Left Context Right Context


Pair Regular expressions
that define what comes before
and after the pair.
275

The Context Restriction Rule


n a:b => LC _ RC;

¨a lexical a MAY be paired with a surface b only


in this context
¨ correspondence implies the context
¨ occurrence of a:b in any other context is not
allowed (but you can specify multiple contexts!)

276

138
8/28/17

The Context Restriction Rule

a:b => LC _ RC;

y:i => Consonant (+:0) _ +:0

Left Context:
Right Context:
Some consonant possibly followed
A morpheme boundary
by an optional) morpheme boundary

try+0s spot0+y+ness day+s


tri0es spott0i0ness dai0s
277

The Surface Coercion Rule


n a:b <= LC _ RC

¨A lexical a MUST be paired with a surface b in


the given context; no other pairs with a as its
lexical symbol can appear in this context.
¨ Context implies correspondence
¨ Note that there may be other contexts where
an a may be paired with b (optionally)

278

139
8/28/17

The Surface Coercion Rule


n The s of the genitive suffix MUST drop after
the plural suffix.
a:b <= LC __ RC

s:0 <= +:0 (0:e) s:s +:0 ':' _ ;

book+s+'s blemish+0s+'s book+s+'s


book0s0'0 blemish0es0'0 book0s0's

279

The Composite Rule


n a:b <=> LC _ RC

¨A lexical a must be paired with a surface b only


in this context and this correspondence is
valid only in this context.
¨ Combination of the previous two rules.
¨ Correspondence Û Context

280

140
8/28/17

The Composite Rule


n i:y is valid only before an e:0 correspondence
followed by a morpheme boundary and followed
by an i:i correspondence, and
n in this context i must be paired with a y.
a:b <=> LC _ RC
i:y <=> _ e:0 +:0 i

tie+ing tie+ing tie+ing


ty00ing ti00ing tye0ing

281

The Exclusion Rule


n a:b /<= LC _ RC

¨A lexical a CANNOT be paired with a b in this


context
¨ Typically used to constrain the other rules.

282

141
8/28/17

The Exclusion Rule


n The y:i correspondence formalized earlier can
not occur if the morpheme on the right hand side
starts with an i or the ' (the genitive marker)
a:b /<= LC _ RC
y:i /<= Consonant (+:0) _ +:0 [i:i |
’:’]

try+ing try+ing Note that the previous


context restriction rule
try0ing tri+ing
sanctions this case

283

Rules to Transducers
n All the rule types can be compiled into finite
state transducers
¨ Rather hairy and not so gratifying (J)

284

142
8/28/17

Rules to Transducers
n Let’s think about a:b => LC _ RC
n If we see the a:b pair we want
to make sure
¨ It is preceded by a (sub)string
that matches LC, and
¨ It is followed by a (sub)string
that matches RC
n So we reject any input that
violates either or both of these
constraints
285

Rules to Transducers
n Let’s think about a:b => LC _ RC
n More formally
¨ Itis not the case that we have a:b not
preceded by LC, or not followed by RC
¨ ~[
[~ [?* LC ] a:b ?*] |
[ ~ ?* a:b ~[ RC ?* ] ]
]
(~ is the complementation operator)

286

143
8/28/17

Summary of Rules
n <= a:b <= c _ d
¨ a is always realized as b in the context c _ d.
n => a:b => c _ d
¨ a is realized as b only in the context c _ d.
n <=> a:b <=> c _ d
¨ a is realized as b in c _ d and nowhere else.
n /<= a:b /<= c _ d
¨ a is never realized as b in the context c _ d.

287

How does one select a rule?


Is a:b allowed Is a:b only allowed Must a always correspond
in this context? in this context? to b in this context?

a:b => LC _ RC Yes Yes No

a:b <= LC _ RC Yes No Yes

a:b <=> LC_ RC Yes Yes Yes


a:b /<= LC _ RC No NA NA

288

144
8/28/17

More Rule Examples - kaNpat


n N:m <=> _ p:@;
¨ N:n is also given as a feasible pair.
n p:m <=> @:m _ ;

289

More Rule Examples


n A voiced consonant is devoiced word-finally
¨ b:p <=> _ #;
¨ d:t <=> _ #;
¨ c:ç <=> _ #;

n Xrce formalism lest you write this as a


single rule
n Cx:Cy <=> _ #: ;
where Cx in (b d c)
Cy in (p t ç) matched ;
290

145
8/28/17

Two-level Morphology
n Beesley and Karttunen, Finite State Morphology,
CSLI Publications, 2004 (www.fsmbook.com)
n Karttunen and Beesley: Two-level rule compiler,
Xerox PARC Tech Report
n Sproat, Morphology and Computation, MIT Press
n Ritchie et al. Computational Morphology, MIT
Press
n Two-Level Rule Compiler
http://www.xrce.xerox.com/competencies/content-
analysis/fssoft/docs/twolc-92/twolc92.html
291

Engineering a Real Morphological Analyzer

292

146
8/28/17

Turkish
n Turkish is an Altaic language with over 60
Million speakers ( > 150 M for Turkic
Languages: Azeri, Turkoman, Uzbek, Kirgiz,
Tatar, etc.)

n Agglutinative Morphology
¨ Morphemes glued together like "beads-on-a-
string"
¨ Morphophonemic processes (e.g.,vowel
harmony)
293

Turkish Morphology
n Productive inflectional and derivational
suffixation.

n No prefixation, and no productive


compounding.

n With minor exceptions, morphotactics, and


morphophonemic processes are very
"regular."
294

147
8/28/17

Turkish Morphology
n Too many word forms per root.
¨ Hankamer (1989) e.g., estimates few million
forms per verbal root (based on generative
capacity of derivations).
¨ Nouns have about 100 different forms w/o any
derivations
¨ Verbs have a thousands.

295

Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln

¨ evinizdekilerden (from the ones at your house)

296

148
8/28/17

Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln

¨ evinizdekilerden (from the ones at your house)


¨ ev+iniz+de+ki+ler+den

297

Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln

¨ evinizdekilerden (from the ones at your house)


¨ ev+iniz+de+ki+ler+den
¨ ev+HnHz+DA+ki+lAr+DAn
A = {a,e}, H={ı, i, u, ü}, D= {d,t}

298

149
8/28/17

Word Structure
n A word can be seen as a sequence of
inflectional groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln
¨ evinizdekilerden (from the ones at your house)
¨ ev+iniz+de+ki+ler+den
¨ ev+HnHz+DA+ki+lAr+DAn
A = {a,e}, H={ı, i, u, ü}, D= {d,t}

¨ cf. odanızdakilerden
oda+[ı]nız+da+ki+ler+den
oda+[H]nHn+DA+ki+lAr+DAn

299

Word Structure
n A word can be seen as a sequence of inflectional
groups (IGs) of the form
Lemma+Infl1^DB+Infl2^DB+…^DB+Infln

¨ evinizdekilerden (from the ones at your house)


¨ ev+iniz+de+ki+ler+den
¨ ev+HnHz+DA+ki+lAr+DAn
¨ ev+Noun+A3sg+P2pl+Loc ^DB+Adj
^DB+Noun+A3pl+Pnon+Abl

300

150
8/28/17

Word Structure
n sağlamlaştırdığımızdaki ( (existing) at the time we caused
(something) to become strong. )

n Morphemes
¨ sağlam+lAş+DHr+DHk+HmHz+DA+ki
n Features
¨ sağlam(strong)
n +Adj
n ^DB+Verb+Become (+lAş)
n ^DB+Verb+Caus+Pos (+DHr)
n ^DB+Noun+PastPart+P1sg+Loc

(+DHk,+HmHz,+DA)
301
n ^DB+Adj (+ki)

Lexicon – Major POS Categories


n +Noun (*) ( Common n +Adverbs
(Temporal, Spatial),
Proper, Derived) n +Postpositions (+)
n +Pronoun (*) (Personal,
(subcat)
Demons, Ques, n +Conjunctions
Quant,Ques, Derived) n +Determiners/Quant
n +Verb (*) (Lexical,
derived) n +Interjections
n +Adjectives(+) (Lexical, n +Question (*)
Derived) n +Punctuation
n +Number (+) (Cardinal, n +Dup (onomatopoeia)
Ordinal, Distributive, Perc,
Ratio, Real, Range)

302

151
8/28/17

Morphological Features
n Nominals
¨ Nouns
¨ Pronouns
¨ Participles
¨ Infinitives
inflect for
¨ Number, Person (2/6)
¨ Possessor (None, 1sg-3pl)
¨ Case
n Nom,Loc,Acc,Abl,Dat,Ins,Gen
303

Morphological Features
n Nominals
¨ Productive Derivations into
n Nouns (Diminutive)
¨ kitap(book), kitapçık (little book)
n Adjectives (With, Without….)
¨ renk (color), renkli (with color), renksiz (without color)

n Verbs (Become, Acquire)


¨ taş (stone), taşlaş (petrify)
¨ araba (car) arabalan (to acquire a car)

304

152
8/28/17

Morphological Features
n Verbs have markers for
¨ Voice:
n Reflexive/Reciprocal,Causative (0 or more),Passive
¨ Polarity (Neg)
¨ Tense-Aspect-Mood (2)
n Past, Narr,Future, Aorist,Pres
n Progressive (action/state)

n Conditional, Necessitative, Imperative, Desiderative,


Optative.
¨ Number/Person

305

Morphological Features
n öl-dür-ül-ec ek-ti
(it) was going to be killed (caused to die)
¨ öl - die
¨ -dür: causative
¨ -ül: passive
¨ -ecek: future
¨ -ti: past
¨ -0: 3rd Sg person

306

153
8/28/17

Morphological Features
n Verbs also have markers for
¨ Modality:
n able to verb (can/may)
n verb repeatedly

n verb hastily

n have been verb-ing ever since

n almost verb-ed but didn't

n froze while verb-ing

n got on with verb-ing immediately

307

Morphological Features
n Productive derivations from Verb
¨ (e.g:Verb Þ Temp/Manner Adverb)
n after having verb-ed,

n since having verb-ed,

n when (s/he) verbs

n by verbing

n while (s/he is ) verbing

n as if (s/he is) verbing

n without having verb-ed

¨ (e.g. Verb Þ Nominals)


n Past/Pres./Fut. Participles
n 3 forms of infinitives 308

154
8/28/17

Morphographemic Processes
n Vowel Harmony
¨ Vowels in suffixes agree in certain
phonological features with the preceding
vowels.
High Vowels H = {ı, i, u, ü} Morphemes use A and H as
Low Vowels = {a, e, o, ö} underspecified
Front Vowels = {e, i, ö, ü} meta symbols on the lexical
Back Vowels = {a, ı, o, u} side.
Round Vowels = { o, ö, u, ü} +lAr : Plural Marker
Nonround Vowels = {a, e, ı, i} +nHn: Genitive Case Marker
Nonround Low A= {a, e}

309

Vowel Harmony
n Some data
masa+lAr okul+lAr ev+lAr gül+lAr
masa0lar okul0lar ev0ler gül0ler
¨ Ifthe last surface vowel is a back vowel, A is paired
with a on the surface, otherwise A is paired with e.
(A:a and A:e are feasible pairs)

310

155
8/28/17

Vowel Harmony
n Some data
masa+lAr okul+lAr ev+lAr gül+lAr+yA
masa0lar okul0lar ev0ler gül0ler+0e
¨ If the last surface vowel is a back vowel. A is paired
with a on the surface, otherwise A is paired with e.
(A:a and A:e are feasible pairs)
n Note that this is chain effect

A:a <=> @:Vback [@:Cons]* +:0 [@:Cons| Cons:@ | :0]*


_;

A:e <=> @:Vfront [@:Cons]* +:0 [@:Cons| Cons:@ |


:0]* _;
311

Vowel Harmony
n Some data
masa+nHn okul+nHn ev+nHn gül+Hn+DA
masa0nın okul00un ev00in gül0ün+de

¨ H is paired with ı if the previous surface vowel is a or ı


¨ H is paired with i if the previous surface vowel is e or i
¨ H is paired with u if the previous surface vowel is o or u
¨ H is paired with ü if the previous surface vowel is ö or ü

H:ü <=> [ @:ü | ö ] [@:Cons]* +:0 [@:Cons|


Cons:@ | :0]* _ ;

312

156
8/28/17

Vowel Harmony
n Some data
masa+nHn okul+nHn ev+nHn gül+nHn
masa0nın okul00un ev00in gül+0ün

H:ü <=> [ @:ü | ö ] [@:Cons]* +:0 [@:Cons| Cons:@ | :0]* _ ;


¨ @:ü stands for both H:ü and ü:ü pairs

¨ ö stands for ö:ö


¨ @:Cons stands for any feasible pair where the surface symbol is a
consonant (e.g., k:ğ, k:k, 0:k)
¨ Cons:@ stands for any feasible pair where the lexical symbol is a
consonant (e.g., n:0, n:n, D:t)

313

Other interesting phenomena


n Vowel ellipsis
masa+Hm av$uc+yH+nA but kapa+Hyor
masa00m av00c00u0na kap00ıyor
masam avcuna kapıyor

n Consonant Devoicing
kitab+DA tad+DHk tad+sH+nA kitab
kitap0ta tat0tık tad00ı0na kitap
kitapta tattık tadına

n Gemination
tıb0+yH üs0+sH şık0+yH
tıbb00ı üss00ü şıkk+0ı
tıbbı üssü şıkkı
314

157
8/28/17

Other interesting phenomena


n Consonant ellipsis
ev+nHn kalem+sH ev+yH
ev00in kalem00i ev00i

n Other Consonant Changes


ayak+nHn renk+yH radyolog+sH
ayağ00ın reng00i radyoloğ00u

k:ğ <=> @:Vowel _ +:0 ([s:0|n:0|y:0]) @:Vowel

315

Reality Check-1
n Real text contains phenomena that causes
nasty problems:
¨ Words of foreign origin - 1
alkol+sH kemal0+yA
alkol00ü kemal'00e
Use different lexical vowel symbols for these
¨ Words of foreign origin -2
Carter'a serverlar Bordeaux'yu
n This needs to be handled by a separate analyzer
using phonological encodings of foreign words, or
n Using Lenient morphology
316

158
8/28/17

Reality Check-2
n Real text contains phenomena that cause
nasty problems:
¨ Numbers, Numbers:
2'ye, 3'e, 4'ten, %6'dan, 20inci,100üncü
16:15 vs 3:4, 2/3'ü, 2/3'si

Affixation proceeds according to how the number is


pronounced, not how it is written!

The number lexicon has to be coded so that a


representation of the last bits of its pronunciation is
available at the lexical level. 317

Reality Check-3
n Real text contains phenomena that causes
nasty problems:
¨ Acronyms
PTTye -- No written vowel to harmonize to!

¨ Stash an invisible symbol (E:0) into the lexical


representation so that you can force harmonization
pttE+yA
ptt00ye

318

159
8/28/17

Reality Check-4
n Interjections
¨ Aha!, Ahaaaaaaa!, Oh, Oooooooooh
¨ So the lexicon may have to encode lexical
representations as regular expresions
n ah[a]+, [o]+h
n Emphasis
¨ çok, çooooooook

319

Reality Check-5
n Lexicons have to be kept in check to
prevent overgeneration*
¨ Allomorph Selection
n Which causative morpheme you use depends on
the (phonological structure of the) verb, or the
previous causative morpheme
¨ ye+DHr+t oku+t+DHr
n Which case morpheme you use depends on the
previous morpheme.
oda+sH+nA oda+yA
oda0sı0na oda0ya
to his room to
(the) house

*Potentially illegitimate word structures. 320

160
8/28/17

Reality Check-6
n Lexicons have to be kept in check to
prevent overgeneration
n +ki can only follow suffix only after +Loc case
marked nouns, or
n after singular nouns in +Nom case denoting
temporal entities (such as day, minute, etc)

321

Taming Overgeneration
n All these can be specified as finite state
transducers.

Constraint Transducer 2

Constraint Transducer 1
Constrained
Lexicon Transducer

Lexicon Transducer

322

161
8/28/17

Turkish Analyzer Architecture


Tif-ef Transducer to generate
symbolic output

Transducers for morphotactic


TC constraints (300)

Root and morpheme


Tlx-if lexicon transducer

Tis-lx = intersection of rule transducers

Morphographemic
TR1 TR2 TR3 TR4 ... TRn transducer

Tes-is Transducer to normalize case.

323

Turkish Analyzer Architecture


Tif-ef

TC

Tlx-if

Tis-lx = intersection of rule transducers

TR1 TR2 TR3 TR4 ... TRn

Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
324

162
8/28/17

Turkish Analyzer Architecture


Tif-ef

TC

Tlx-if

Tis-lx = intersection of rule transducers

TR1 TR2 TR3 TR4 ... TRn

kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
325

Turkish Analyzer Architecture


Tif-ef

TC

Tlx-if

kütük+sH+ndAn
Tis-lx = intersection of rule transducers

TR1 TR2 TR3 TR4 ... TRn

kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
326

163
8/28/17

Turkish Analyzer Architecture


Tif-ef
kütük+Noun+A3sg+P3sg+nAbl

TC

Tlx-if

kütük+sH+ndAn
Tis-lx = intersection of rule transducers

TR1 TR2 TR3 TR4 ... TRn

kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
327

Turkish Analyzer Architecture


kütük+Noun+A3sg+Pnon+Abl
Tif-ef
kütük+Noun+A3sg+P3sg+nAbl

TC

Tlx-if

kütük+sH+ndAn
Tis-lx = intersection of rule transducers

TR1 TR2 TR3 TR4 ... TRn

kütüğünden
Tes-is
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
328

164
8/28/17

Turkish Analyzer Architecture


kütük+Noun+A3sg+Pnon+Abl

Turkish Analyzer
(After all transducers
are intersected and composed)
(~1M States, 1.6M Transitions)

kütüğünden, Kütüğünden, KÜTÜĞÜNDEN


329

Turkish Analyzer Architecture

22K Nouns
4K Verbs
Turkish Analyzer 2K Adjective
(After all transducers 100K Proper Nouns
are intersected and composed)
(~1M States, 1.6M Transitions)

330

165
8/28/17

Finite State Transducers Employed


(e - v )ev+Noun+A3sg(i )+P3sg(n - "d e )+Loc
(e - v )ev+Noun+A3sg(i n )+P2sg(- "d e )+Loc

Morphology+Pronuncation Resulting FST has 10M states 16M transitions

Stress Computation Total of 750 xfst regular expressions +


Transducer
100K root words (mostly proper names)
over about 50 files
Syllabification Transducer
•ev+i+nde
•ev+in+de
•ev+sH+ndA •ev+Noun+A3sg+P3sg+Loc
Exceptional Phonology Surface Morpheme •ev+Hn+DA •ev+Noun+A3sg+P2sg+Loc
Transducer Sequence
Lexical Morpheme
Sequence Feature Form
SAMPA Mapping Transducer Filters

Modified Inverse Two-Level Rule Transducer Filters Filters

(Duplicating) Lexicon and Morphotactic Constraints Transducer

Two-Level Rule Transducer

Surface form evinde 331

Pronunciation Generation
n gelebilecekken
¨ (gj e - l )gel+Verb+Pos(e - b i - l ) ^DB+Verb
+Able(e - "dZ e c ) +Fut(- c e n
)^DB+Adverb+While

n okuma
¨ (o - k )ok+Noun+A3sg(u - "m ) +P1sg(a )+Dat
¨ (o - "k u ) oku+Verb(- m a ) +Neg+Imp+A2sg
¨ (o - k u ) oku+Verb+Pos(- "m a )
^DB+Noun+Inf2+A3sg+Pnon+Nom

332

166
8/28/17

Are we there yet?


n How about all these foreign words?
¨ serverlar, clientlar, etc.

333

Are we there yet?


n How about all these foreign words?
¨ serverlar, clientlar, etc.
n A solution is to use Lenient Morphology
¨ Find the two-level constraints violated
¨ Allow violation of certain constraints

334

167
8/28/17

Are we there yet?


n How about unknown words?
¨ zaplıyordum (I was zapping)

335

Are we there yet?


n How about unknown words?
¨ zaplıyordum (I was zapping)
n Solution
¨ Delete all lexicons except noun and verb root
lexicons.
¨ Replace both lexicons by
n [Alphabet Alphabet*]
¨ Thus the analyzer will parse any prefix string
as a root provide it can parse the rest as a
sequence of Turkish suffixes.
336

168
8/28/17

Are we there yet?


n How about unknown words?
¨ zaplıyordum (I was zapping)

n Solution
¨ zaplıyordum
n zapla+Hyor+DHm (zapla+Verb+Pos+Pres1+A1sg)
n zapl +Hyor+DHm (zapl+Verb+Pos+Pres1+A1sg)

337

Systems Available
n Xerox Finite State Suite (lexc, twolc,xfst)
¨ Commercial (Education/research license available)
¨ Lexicon and rule compilers available
¨ Full power of finite state calculus (beyond two-level
morphology)
¨ Very fast (thousands of words/sec)

n Beesley and Karttunen, Finite State Morphology,


CSLI Publications, 2004 (www.fsmbook.com)
comes with (a version of) this software
¨ New versions of the software now available on the web
via a click-through license (with bonus C++ and Pyhton
API)
338

169
8/28/17

Systems Available
n Schmid’s SFST-- the Stuttgart Finite State
Transducer Tool
¨ SFST is a toolbox for the implementation of
morphological analysers and other tools which
are based on finite state transducer
technology.
¨ Available at http://www.ims.uni-
stuttgart.de/projekte/gramotron/SOFTWARE/S
FST.html

339

Systems Available
n AT&T FSM Toolkit
¨ Tools to manipulate (weighted) finite state
transducers
n Now an open source version available as OpenFST
n Carmel Toolkit
¨ http://www.isi.edu/licensed-sw/carmel/
n FSA Toolkit
¨ http://www-i6.informatik.rwth-
aachen.de/~kanthak/fsa.html

340

170
8/28/17

OVERVIEW
n Overview of Morphology
n Computational Morphology
n Overview of Finite State Machines
n Finite State Morphology
¨ Two-level
Morphology
¨ Cascade Rules

341

Morphological Analyzer Structure


happy+Adj+Sup

Lexicon Transducer

happy+est

Morphographemic ????????
Transducer

happiest
342

171
8/28/17

The Parallel Constraints Approach


n Two-level Morphology
¨ Koskenniemi ’83, Karttunen et al. 1987,
Karttunen et al. 1992
Describe constraints

Lexical form
Set of parallel
of two-level rules
compiled into finite-state fst 1 fst 2 ... fst n
automata interpreted as
transducers
Surface form

Slide courtesy of Lauri Karttunen 343

Sequential Transduction Approach


n Cascaded ordered rewrite rules
¨ Johnson ’71 Kaplan & Kay ’81 based on
Chomsky & Halle ‘64 Lexical form

fst 1

Intermediate form

Ordered cascade fst 2


of rewrite rules
compiled into finite-state
...
transducers

fst n

Slide courtesy of Lauri Karttunen


Surface form
344

172
8/28/17

Sequential vs. Parallel


n At the end both approaches are equivalent
Lexical form Lexical form

fst 1
fst 1 fst 2 ... fst n
Intermediate form
fst 2

... Surface form

fst n

Surface form

Slide courtesy of Lauri Karttunen


FST 345

Sequential vs. Parallel


n Two different ways of decomposing the
complex relation between lexical and
surface forms into a set of simpler relations
that can be more easily understood and
manipulated.
n Sequential model is oriented towards
string-to-string relations, the parallel model
is about symbol-to-symbol relations.
n In some cases it may be advantageous to
combine the two approaches.
Slide courtesy of Lauri Karttunen 346

173
8/28/17

Cascaded Ordered Rewrite Rules


n In this architecture,
Lexical form one transforms the
fst 1
lexical form to the
surface form (or vice
Intermediate form versa) through a
fst 2 series of
...
transformations.
n In a sense, it is a
fst n
procedural model.
Surface form ¨ What do I do to a
lexical string to convert
to a surface string?

347

The kaNpat Example (again)


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.

348

174
8/28/17

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.
...N p...

...m p...

349

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.
...N p...

...m p...

...m m...

350

175
8/28/17

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.
kaNpat Lexical So we obtain the surface form after a
sequence of transformations

kampat Intermediate

kammat Surface

351

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.
kaNpat Lexical There is minor problem with this!

What happens if N is NOT followed by a p?


kampat Intermediate
It has to be realized as an n.

kammat Surface

352

176
8/28/17

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
n A p after an m is realized as m on the
surface.
kaNmat Lexical There is minor problem with this!

What happens if N is NOT followed by a p?


kanmat Intermediate
It has to be realized as an n.

kanmat Surface

353

The kaNpat Example


n A lexical N before a lexical p is realized on
the surface as an m.
¨ Otherwise it has to be replaced by an n.
n A p after an m is realized as m on the
surface.

354

177
8/28/17

The kaNpat Example


kaNpat

N->m
transformation
kampat

N->n
transformation
kampat

p->m
transformation

kammat

355

The kaNpat Example


kaNpat kaNtat

N->m
transformation
kampat kaNtat

N->n
transformation
kampat kantat

p->m
transformation

kammat kantat

356

178
8/28/17

The kaNpat Example


kaNpat kaNtat kammat

N->m
transformation
kampat kaNtat kammat

N->n
transformation
kampat kantat kammat

p->m
transformation

kammat kantat kammat

357

The kaNpat Example


kaNpat kaNtat kammat

N->m
transformation
kampat kaNtat kammat

N->n
transformation
kampat kantat kammat

p->m
transformation

kammat kantat kammat

358

179
8/28/17

The kaNpat Example


kaNpat kampat kammat

N->m
transformation
kampat kampat kammat

N->n
transformation
kampat kampat kammat

p->m
transformation

kammat kammat kammat

359

The kaNpat Example


{kaNpat, kammat, kampat}

N->m
transformation

N->n
transformation

p->m
transformation

kammat

360

180
8/28/17

Finite State Transducers


m:m p:p ? m:m ? N:N
N:N
N->m
transformation N
N:m N:m
p:p
N:n
N->n n:n
transformation ?

m:m
p:p ?
p->m
m:m
transformation
?

p:m

Courtesy of Ken Beesley 361

Finite State Transducers

N->m
Transducer

Composition
N->n
Transducer

p->m
Transducer

The Morphographemic Transducer

Courtesy of Lauri Karttunen 362

181
8/28/17

Finite State Transducers


n Again, describing transformations with
transducers is too low level and tedious.

n One needs to use a higher level formalism

n Xerox Finite State Tool Regular Expression


language
http://www.stanford.edu/~laurik/fsmbook/on
line-documentation.html
363

Rewrite Rules
n Originally rewrite rules were proposed to
describe phonological changes
n u -> l / LC _ RC
¨ Change u to l if it is preceded by LC and
followed by RC.

364

182
8/28/17

Rewrite Rules
n These rules of the sort u -> l / LC _ RC
look like context sensitive grammar rules,
so can in general describe much more
complex languages.

n Johnson (1972) showed that such rewrite


rules are only finite-state in power, if some
constraints on how they apply to strings are
imposed.
365

Replace Rules
n Replace rules define regular relations
between two regular languages

A -> B LC _ RC
Replacement Context
The relation that replaces A by B between L and R leaving
everything else unchanged.
n In general A, B, LC and RC are regular
expressions.
366

183
8/28/17

Replace Rules
n Let us look at the simplest replace rule
¨a -> b
n The relation defined by this rule contains among
others
¨ {..<abb,
bbb>,<baaa, bbbb>, <cbc, cbc>, <caad,
cbbd>, …}
n A string in the upper language is related to a
string in the lower language which is exactly the
same, except all the a’s are replaced by b’s.
¨ The related strings are identical if the upper string does
not contain any a’s
367

Replace Rules
n Let us look at the simplest replace rule with
a context
¨ a->b || d _ e
n a’s are replaced by b, if they occur after a d
and before an e.
¨ <cdaed, cdbed> are related
n a appears in the appropriate context in the upper
string
¨ <caabd, cbbbd> are NOT related,
n a’s do not appear in the appropriate context.
¨ <caabd, caabd> are related (Why?)
368

184
8/28/17

Replace Rules
n Although replace rules define regular
relations, sometimes it may better to look at
them in a procedural way.
¨a -> b || d _ e
n What string do I get when I apply this rule
to the upper string bdaeccdaeb?
• bdaeccdaeb
• bdbeccdbeb

369

Replace Rules – More examples


n a -> 0 || b _ b;
¨ Replace an a between two b’s with an epsilon
(delete the a)
¨ abbabab Þ abbbb so <abbabab,abbbb>
¨ ababbb Þ abbbb so <ababbb, abbbb>

¨ In fact all of abababab map to abbbb!


abababb
ababbab
ababbb
abbabab
abbabb
abbbab
abbbb 370

185
8/28/17

Replace Rules – More examples


n [..] -> a || b _ b
¨ Replace a single epsilon between two b’s with
an a (insert an a)
¨ abbbbe Þ ababababe so
<abbbbe,ababababe>
n 0 -> a || b _ b is a bit tricky.

371

Rules and Contexts


A -> B LC _ RC
Replacement Context

n Contexts around the replacement can be


specified in 4 ways.
Upper String A

Lower String B

372

186
8/28/17

Rules and Contexts

A -> B || LC _ RC

n Both LC and RC are checked on the upper


string.
Upper String LC A RC

Lower String B

373

Rules and Contexts

A -> B // LC _ RC

n LC is checked on the lower string, RC is


checked on the upper string
Upper String A RC

Lower String LC B

374

187
8/28/17

Rules and Contexts

A -> B \\ LC _ RC

n LC is checked on the upper string, RC is


checked on the lower string
Upper String LC A

Lower String B RC

375

Rules and Contexts

A -> B \/ LC _ RC

n LC is checked on the lower string, RC is


checked on the lower string
Upper String A

Lower String LC B RC

376

188
8/28/17

Replace Rules – More Variations


n Multiple parallel replacements
¨ A->B, C->D
n A is replaced by B and C is replaced by D

n Multiple parallel replacements in the same


context
¨ A->B, C->D || LC _ RC
n A is replaced by B and C is replaced by D in the
same context

377

Replace Rules – More Variations


n Multiple parallel replacements in multiple
contexts
¨ A->B, C->D || LC1 _ RC2, LC2_RC2…
n A is replaced by B and C is replaced by D in any
one of the listed contexts
n Independent multiple replacements in
different contexts
¨ A->B || LC1_ LC2 ,, C->D || LC2_RC2

378

189
8/28/17

Replace Rules for KaNpat


n N->m || _ p;
¨ replace N with m just before an upper p;
n N-> n;
¨ Replace N with n otherwise
n p -> m || m _;
¨ Replace p with an m just after an m (in the
upper string

379

Replace Rules for KaNpat


kaNpat kaNtat kammat

n N->m || _ p;
kampat kaNtat kammat

n N-> n;
kampat kantat kammat

n p -> m || m _

kammat kantat kammat

380

190
8/28/17

Replace Rules for KaNpat (combined)


N->m || _ p; FST1
.o. kaNpat
N-> n; FST2 transducer

.o.
FST3
p -> m || m _

n Defines a single transducer that


“composes” the three transducers.

381

A Cascade Rule Sequence for Turkish


n Remember the conventions
¨A = {a, e}, H= {ı, i, u, ü}
¨ VBack = [a | ı | o | u ] //the back vowels
¨ VFront = [e | i | ö | ü ] //the front vowels
¨ Vowel = VBack | VFront
¨ Consonant = [ …all consonants ..]

382

191
8/28/17

Stem Final Vowel Deletion


n A vowel ending a stem is deleted if it is
followed by the morpheme +Hyor
¨ ata+Hyor (is assigning)
¨ at+Hyor

n Vowel ->0 || _ “+” H y o r

383

Morpheme Initial Vowel Deletion


n A high vowel starting a morpheme is
deleted if it is attached to a segment ending
in a vowel.
¨ masa+Hm (my table)
¨ masa+m

n H ->0 || Vowel “+” _ ;


n Note that the previous rule is a more
special case rule
384

192
8/28/17

Vowel Harmony
n A bit tricky
¨ A->a or A->e
¨ H->ı, H->i, H->u, H->ü
n These two (groups of) rules are
interdependent

n They have to apply concurrently and each


is dependent on the outputs of the others

385

Vowel Harmony
n So we need
¨ Parallel
Rules
¨ Each checking its left context on the output (lower-
side)
n A->a // VBack Cons* “+” Cons* _ ,,
n A->e // VFront Cons* “+” Cons* _ ,,
n H->u // [o | u] Cons* “+” Cons* _ ,,
n H->ü // [ö | ü] Cons* “+” Cons* _ ,,
n H->ı // [a | ı] Cons* “+” Cons* _ ,,
n H->i // [e | i] Cons* “+” Cons* _
386

193
8/28/17

Consonant Resolution
n d is realized as t either at the end of a word
or after certain consonants
n b is realized as p either at the end of a
word or after certain consonants
n c is realized as ç either at the end of a word
or after certain consonants
n d-> t, b->p, c->ç // [h | ç | ş | k | p | t | f | s ] “+” _

387

Consonant Deletion
n Morpheme initial s, n, y is deleted if it is
preceded by a consonant

n [s|n|y] -> 0 || Consonant “+” _ ;

388

194
8/28/17

Cascade
Stem Final Vowel Deletion

Morpheme Initial
Vowel Deletion

Vowel Harmony

Consonant Devoicing
(Partial) Morphographemic
Transducer
Consonant Deletion

Boundary Deletion

389

Cascade
Stem Final Vowel Deletion Lexicon
Transducer
Morpheme Initial
Vowel Deletion

Vowel Harmony
(partial)
TMA
Consonant Devoicing (Partial)
Morphographemic
Consonant Deletion Transducer

Boundary Deletion

390

195
8/28/17

Some Observations
n We have not really seen all the nitty gritty
details of both approaches but rather the
fundamental ideas behind them.
¨ Rule conflicts in Two-level morphology
n Sometimes the rule compiler detects a conflict:
¨ Two rules sanction conflicting feasible pairs in a context
n Sometimes the compiler can resolve the conflict but
sometimes the developer has to fine tune the
contexts.

391

Some Observations
n We have not really seen all the nitty gritty
details of both approaches but rather the
fundamental ideas behind them.
¨ Unintended rule interactions in rule cascades
n When one has 10’s of replace rule one feeding into
the other, unintended/unexpected interactions are
hard to avoid
n Compilers can’t do much

n Careful incremental development with extensive


testing

392

196
8/28/17

Some Observations
n For a real morphological analyzer, my
experience is that developing an accurate
model of the lexicon is as hard as (if not
harder than) developing the
morphographemic rules.
¨ Taming overgeneration
¨ Enforcing “semantic” constraints
¨ Enforcing long distance co-occurance
constraints
n This suffix can not occur with that prefix, etc.
¨ Special cases, irregular cases
393

197
11-411
Natural Language Processing
Language Modelling and Smoothing

Kemal Oflazer

Carnegie Mellon University in Qatar

1/46
What is a Language Model?

I A model that estimates how likely it is that a sequence of words belongs to a (natural)
language
I Intuition
I p(A tired athlete sleeps comfortably)  p(Colorless green ideas sleep furiously)
I p(Colorless green ideas sleep furiously)  p(Salad word sentence is this)

2/46
Let’s Check How Good Your Language Model is?

I Guess the most likely next word


I The prime of his life . . . {is, was, . . . }
I The prime minister gave an . . . {ultimatum, address, expensive, . . . }
I The prime minister gave a . . . {speech, book, cheap, . . . }
I The prime number after eleven . . . {is, does, could, has, had . . . }
I The prime rib was . . . {delicious, expensive, flavorful, . . . ,} but NOT green

3/46
Where do we use a language model?
I Language models are typically used as components of larger systems.
I We’ll study how they are used later, but here’s some further motivation.
I Speech transcription:
I I want to learn how to wreck a nice beach.
I I want to learn how to recognize speech.
I Handwriting recognition:
I I have a gub!
I I have a gun!
I Spelling correction:
I We’re leaving in five minuets.
I We’re leaving in five minutes.
I Ranking machine translation system outputs

4/46
Very Quick Review of Probability

I Event space (e.g., X , Y), usually discrete for the purposes of this class.
I Random variables (e.g., X , Y )
I We say “Random variable X takes value x ∈ X with probability p(X = x)”
I We usually write p(X = x) as p(x).
I Joint probability: p(X = x, Y = y)
p(X = x, Y = y)
I Conditional probability: p(X = x | Y = y) =
p(Y = y)
I This always holds

p(X = x, Y = y) = p(X = x | Y = y) · p(Y = y)= p(Y = y | X = x) · p(X = x)


| {z } | {z }
I This sometimes holds: p(X = x, Y = y) = p(X = x) · p(Y = y)
I True and estimated probability distributions.

5/46
Language Models: Definitions
I V is a finite set of discrete symbols (characters, words, emoji symbols, . . . ), V = |V|.
I V + is the infinite set of finite-length sequence of symbols from V whose final symbol
is 2.
I p : V + → R such that
I For all x ∈ V + p(x) ≥ 0 X
I p is a proper probability distribution: p(x) = 1
x∈V +
I Language modeling: Estimate p from the training set examples

x1:n = hx1 , x2 , . . . , xn i

I Notation going forward:


I x a single symbol (word, character, etc.) from V
I x is a sequence of symbols in V + as defined above. xi is the ith symbol of x.
I x1:n denotes n sequences, hx1 , x2 , . . . , xn i.
I xi is the ith sequence in x1:n .
I [xi ]j is the jth symbol of the ith sequence in x1:n .
6/46
Issues

I Why would we want to do this?


I Are the nonnegativity and sum-to-one constraints really necessary?
I Is “finite V ” realistic?

7/46
Motivation – Noisy Channel Models
I Noisy channel models are very suitable models for many NLP problems:

Source (Generator) → Y → Channel (Blender) → X

I Y is the plaintext, the true message, the missing information, the output
I X is the ciphertext, the garbled message, the observable evidence, the input
I Decoding: select the best y given X = x.

ŷ = arg max p(y | x)


y
p(x | y) · p(y)
= arg max
y p(x)

= arg max p(x | y) · p(y)


y | {z } |{z}
Channel Model Source Model

8/46
Noisy Channel Example – Speech Recognition

Source → Y (Seq. of words) → Vocal Tract → X (Acoustic Waves)

I “Mining a year of speech” →

I Source model characterizes p(y), “What are possible sequences of words I can say?”
I Channel model characterizes p(Acoustics | y)
I It is hard to recognize speech
I It is hard to wreck a nice beach
I It is hard wreck an ice beach
I It is hard wreck a nice peach
I It is hard wreck an ice peach
I It is heart to wreck an ice peach
I ···
9/46
Noisy Channel Example – Machine Translation

Source → Y (Seq. of Turkish words) → Translation → X (Seq. of English Words)

I p(x | y) models the translation process.


I Given an observed sequence of English words, presumably generated by translating
from Turkish, what is the most likely source Turkish sentence, that could have given
rise to this translation?

10/46
Machine Transliteration

I Phonetic translation across language pairs with very different alphabets and sound
system is called transliteration.
I Golfbag in English is to be transliterated to Japanese.
I Japanese has no distinct l and r sounds - these in English collapse to the same sound.
Same for English h and f.
I Japanese uses alternating vowel-consonant syllable structure: lfb is impossible to
pronounce without any vowels.
I Katagana writing is based on syllabaries: different symbols for ga, gi, gu, etc.
I So Golfbag is transliterated as and pronounced as go-ru-hu-ba-ggu.
I So when you see a transliterated word in Japanese text, how can you find out what
the English is?
I nyuuyooko taimuzu → New York Times
I aisukuriimu → ice-cream (and not “I scream”)
I ranpu → lamp or ramp
I masutaazutoonamento → Master’s Tournament

11/46
Noisy Channel Model – Other Applications

I Spelling Correction
I Grammar Correction
I Optical Character Recognition
I Sentence Segmentation
I Part-of-speech Tagging

12/46
Is finite V realistic?

I NO!
I We will never see all possible words in a language no matter how large the sample we
look at, is.

13/46
The Language Modeling Problem

I Input: x1:n – the “training data”.


I Output: p : V + → R+

I p should be a “useful” measure of plausibility (not necessarily of grammaticality).

14/46
A Very Simple Language Model

I We are given x1:n as the training data


I Remember that each xi is a sequence of symbols, that is, a “sentence”
I So we have n sentences, and we count how many times the sentence x appears
I p(x) is estimated as
|{i : xi = x}| cx (x)
p(x) = = 1:n
n n
I So we only know about n sentences and nothing else!

I What happens when you want to assign a probability to some x that is not in the
training set?
I Is there a way out?

15/46
Chain Rule to the Rescue
I We break down p(x) mathematically

p(X = x) = p(X1 = x1 )×
p(X2 = x2 | X1 = x1 )×
p(X3 = x3 | X1:2 = x1:2 )×
..
.
p(X` = 2) | X1:`−1 = x1:`−1 )

`
Y
= p(Xj = xj | X1:j−1 = x1:j−1 )
j=1

I This is an exact formulation.


I Each word is conditioned on all the words coming before it!

16/46
Approximating the Chain Rule Expansion – The Unigram Model
`
Y
p(X = x) = p(Xj = xj | X1:j−1 = x1:j−1 )
j=1
` ` `
assumption Y Y Y
= pθ (Xj = xj ) = θxj ≈ θ̂xj
j=1 j=1 j=1

I θ̂0 s are maximum likelihood estimates:


|(i, j) : [xi ]j = v| cx
∀v ∈ V, θ̂v = = 1:n (v)
N N
n
X
I N= |xi |
i=1
I This is also known as “relative frequency estimation”.
I The unigram model is also known as the “bag of words” model. Why?
17/46
Unigram Models – The Good and the Bad

Pros: Cons:
I Easy to understand I “Bag of Words” assumption is not

I Cheap linguistically accurate.


I
I Not many parameters
I Easy to compute p(the the the the)  p(I want to run)
I Good enough for maybe information
I “Out of vocabulary” problem.
retrieval
I What happens if you encounter a
word you never saw before?

I Generative Process: keep on randomly picking words until you pick 2.


I We really never use unigram models!

18/46
Approximating the Chain Rule Expansion – Markov Models

Markov Models ≡ n-gram Models

`
Y
p(X = x) = p(Xj = xj | X1:j−1 = x1:j−1 )
j=1
`
assumption Y
= pθ (Xj = xj | Xj−n+1:j−1 = xj−n+1:j−1 )
j=1
| {z }
last n − 1 words

I n-gram models ≡ (n − 1)th -order Markov assumption.


I Unigram model model with when n = 1
I Trigram models (n = 3) are widely used.
I 5-gram models (n = 5) are quite common in statistical machine translation.

19/46
Estimating n-gram Models
unigram bigram trigram general n-gram

`
Y `
Y `
Y `
Y
pθ (x) = θxj θx |x θx |x θx |x
j j−1 j j−2 xj−1 j j−n+1:j−1
j=1 j=1 j=1 j=1

Parameters: θv θv|v0 θv|v00 v0 θv|h


∀v ∈ V ∀v0 ∈ V, ∀v0 , v00 ∈ V, ∀h ∈ V n−1 ,
∀v ∈ V ∪ {2} ∀v ∈ V ∪ {2} ∀v ∈ V ∪ {2}

c(v) c(v0 v) c(v00 v0 v) c(hv)


MLE:
N c(v0 ) c(v00 v0 ) c(h)

20/46
The Problem with MLE

I The curse of dimensionality: the number of parameters grows exponentially in n.


I Data sparseness: most n-grams will never be observed – even when they are
linguistically plausible.
I What is the probability of unseen words? (0 ?)
I But that’s not what you want. Test set will usually include words not in training set.
I What is p(Nebuchadnezzur | son of) ?
I A single 0 probability will set the estimate to 0. Not acceptable!

21/46
Engineering Issues – Log Probabilities

I Note that computation of pθ (x) involves multiplication of numbers each of which are
between 0 and 1.
I So multiplication hits underflow: computationally the product can not be represented
or computed.
I In implementation, probabilities are represented by the logarithms (between −∞ and
0) and multiplication is replaced by addition.

22/46
Dealing with Out-of-Vocabulary Words

I Quick and dirty approach


I Decide what is in the vocabulary (e.g., all words with frequency > say 10).
I Add UNK to the vocabulary.
I Replace all unknown words with UNK
I Estimate as usual.
I Build a language model at the character level.
I What are advantages and disadvantages?

23/46
Smoothing Language Models

I We can not have 0 probability n-grams. So we should shave off some probability
mass from seen n-grams to give to unseen n-grams.
I The Robin-Hood Approach – steal some probability from the haves to have-nots.
I Simplest method: Laplace Smoothing
I Interpolation
I Stupid backoff.
I Long-standing best method: modified Kneser-Ney smoothing

24/46
Laplace Smoothing

I We add 1 to all counts! So words with 0 counts will be assumed to have count 1.
c(v) + 1
I Unigram probabilities: p(v) =
N+V
0 c(v0 v) + 1
I Bigram probabilities: p(v | v ) =
c(v0 ) + V
I One can also use Add-k smoothing for some fractional k, 0 < k ≤ 1)
I It turns out this method is very simple but shaves off too much of the probability mass.
(See book for an example.)

25/46
Interpolation

I We estimate n-gram probabilities by combining count-based estimates from n- and


lower grams.

p̂(v | v00 v0 ) = λ1 p(v | v00 v0 ) + λ2 p(v | v0 ) + λ3 p(v)


X
I λi = 1
i
I λ’s are estimated by maximizing the likelihood of a held-out data.

26/46
Stupid Backoff

I Gives up the idea of making the language model a true probability distribution.
I Works quite well with very large training data (e.g. web scale) and large language
models
I If a given n-gram has never been observed, just use the next lower gram’s estimate
scaled by a fixed weight λ (terminates when you reach the unigram)

27/46
Kneser-Ney Smoothing

I Kneser-Ney smoothing and its variants (interpolated Kenser-Ney or modified


Kneser-Ney) use absolute discounting.
I The math is a bit more involved. See the book if you are interested.

28/46
Toolkits

I These days people build language models using well-established toolkits:


I SRILM Toolkit (https://www.sri.com/engage/products-solutions/
sri-language-modeling-toolkit)
I CMU Statistical Language Modeling Toolkit
(http://www.speech.cs.cmu.edu/SLM_info.html)
I KenLM Language Model Toolkit (https://kheafield.com/code/kenlm/)
I Each toolkit provides executables and/or API and options to build, smooth, evaluate
and use language models. See their documentation.

29/46
n-gram Models– Assessment

Pros: Cons:
I Easy to understand I Markov assumption is not

I Cheap (with modern


linguistically accurate.
hardware/memory)
I but not as bad as unigram models
I Good enough for machine
I “Out of vocabulary” problem.
translation, speech recognition,
contextual spelling correction, etc.

30/46
Evaluation – Language Model Perplexity
I Consider language model that assigns probabilities to a sequence of digits (in speech
recognition)
I Each digit occurs with the same probability p = 0.1
I Perplexity for a sequence of N digits D = d1 d2 · · · dn is

def −1
PP(D) = p(d1 d2 · · · dn ) N
s
N 1
=
p(d d · · · dn )
s 1 2
= N Q 1
N p(d )
s i=1 i
1
= N
1 )N
( 10
= 10
I How can we interpret this number?
31/46
Evaluation – Language Model Perplexity
I Intuitively, language models should assign high probability to “real language” they
have not seen before.
I Let x1:m be a sequence of m sentences, that we have not seen before (held-out or
test set)
Ym X m
I Probability of x1:m = p(xi ) ⇒ Log probability of x1:m = log2 p(xi )
i=1 i=1
I Average log probability of per word of x1:m is:
m
X 1
l= log2 p(xi )
M
i=1
m
X
where M = |xi |
i=1
def
I Perplexity relative to x1:m = 2−l
I Intuitively, perplexity is average “confusion” after each word. Lower is better!
32/46
Understanding Perplexity

m
1 X
− log2 p(~xi )
M
I 2 i=1 is really a branching factor.
I Assign probability of 1 to the test data ⇒ perplexity = 1. No confusion.
1
I Assign probability of to each word ⇒ perplexity = V . Equal confusion after each
V
word!
I Assign probability of 0 to anything ⇒ perplexity = ∞
I We really should have for any x ∈ V + p(x) > 0

33/46
Entropy and Cross-entropy

I Suppose that there are eight horses running in an upcoming race.


I Your friend is on the moon.
I It’s really expensive to send a bit to the moon!
I You want to send him the results.
I
Clinton 000 Huckabee 100
Edwards 001 McCain 101
Kucinich 010 Paul 110
Obama 011 Romney 111
I Expected number of bits to convey a message is 3 bits.

34/46
Entropy and Cross-entropy

I Suppose the probabilities over the outcome of the race are not at all even.
I
Clinton 1/4 Huckabee 1/64
Edwards 1/16 McCain 1/8
Kucinich 1/64 Paul 1/64
Obama 1/2 Romney 1/64
I You can encode the winner using the following coding scheme

Clinton 10 Huckabee 111101


Edwards 1110 McCain 110
Kucinich 11110 Paul 111110
Obama 0 Romney 111111
I How did we get these codes?

35/46
Another View

36/46
Bits vs Probabilities

37/46
Entropy

I Entropy of a Distribution X
H (p) = − p(x) log p(x)
x∈X
I Always ≥ 0 and maximal when p is uniform.
X 1 1
H (puniform ) = − log = log |X |
|X | |X |
x∈X

38/46
Cross-entropy

I Cross-entropy uses one distribution to tell us something about another distribution.


X
H (p; q) = − p(x) log q(x)
x∈X
I The difference H (p; q) − H (p) tells us how many extra bits (on average) we waste by
using q instead of p.
I Extra bits make us sad; we can therefore think of this as a measure of regret.
I We want to choose q so that H (p; q) is small.
I Cross-entropy is an estimate of of the average code-length (bits per message) when
using q as a proxy for p.

39/46
Cross-entropy and Betting

I Before the horse race, place your bet.


I Regret is how sad you feel after you find out who won.
I What’s your average score after you place your bets and test many times?
I Upper bound on regret: uniform betting
I Lower bound on regret: proportional betting on the true distribution for today’s race.
I The better our estimated distribution is, the closer we get to the lower bound (lower
regret)!

40/46
How does this Relate to Language Models?

I ptrain : training sample (which horses we have seen before)


I ptest : test sample (which horse will win today)
I q: our (estimated) model (or code)
I Real goal when training: make H (ptest ; q) small.
I We don’t know ptest ! The closest we have is ptrain .
I So make H (ptrain ; q)small.
I But that overfits and can lead to infinite regret.
I Smoothing hopefully makes q more like ptest .

41/46
What do n-gram Models Know?

I They (sort of) learn:


I Rare vs. common words and patterns
I Local syntax (an elephant, a rhinoceros)
I Words with related meanings (ate apples)
I Punctuation and spelling
I They have no idea about:
I Sentence structure
I Underlying rules of agreement/spelling/etc.
I Meaning
I The World

42/46
Unigram Model Generation

first, from less the This different 2004), out which goal 19.2 Model
their It ˜(i?1), given 0.62 these (x0; match 1 schedule. x 60
1998. under by Notice we of stated CFG 120 be 100 a location accuracy
If models note 21.8 each 0 WP that the that Novak. to function; to
[0, to different values, model 65 cases. said -24.94 sentences not
that 2 In to clustering each K&M 100 Boldface X))] applied; In 104
S. grammar was (Section contrastive thesis, the machines table -5.66
trials: An the textual (family applications.We have for models 40.1 no
156 expected are neighborhood

43/46
Bigram Model Generation

e. (A.33) (A.34) A.5 ModelS are also been completely surpassed in


performance on drafts of online algorithms can achieve far more so
while substantially improved using CE. 4.4.1 MLEasaCaseofCE 71 26.34
23.1 57.8 K&M 42.4 62.7 40.9 44 43 90.7 100.0 100.0 100.0 15.1 30.9
18.0 21.2 60.1 undirected evaluations directed DEL1 TRANS1
neighborhood. This continues, with supervised init., semisupervised
MLE with the METU-Sabanci Treebank 195 ADJA ADJD ADV APPR APPRART APPO
APZR ART CARD FM ITJ KOUI KOUS KON KOKOM NN NN NN IN JJ NN Their
problem is y x. The evaluation offers the hypothesized link grammar
with a Gaussian

44/46
Trigram Model Generation

top(xI ,right,B). (A.39) vine0(X, I) rconstit0(I 1, I). (A.40)


vine(n). (A.41) These equations were presented in both cases; these
scores u<AC>into a probability distribution is even smaller(r
=0.05). This is exactly fEM. During DA, is gradually relaxed. This
approach could be efficiently used in previous chapters) before
training (test) K&MZeroLocalrandom models Figure4.12: Directed
accuracy on all six languages. Importantly, these papers achieved
state-of-the-art results on their tasks and unlabeled data and the
verbs are allowed (for instance) to select the cardinality of discrete
structures, like matchings on weighted graphs (McDonald et al., 1993)
(35 tag types, 3.39 bits). The Bulgarian,

45/46
The Trade-off

I As we increase n, the stuff the model generates looks better and better, and the
model gives better probabilities to the training data.
I But as n gets big, we tend toward the history model, which has a lot of zero counts
and therefore isn’t helpful for data we haven’t seen before.
I Generalizing vs. Memorizing

46/46
11-411
Natural Language Processing
Classification

Kemal Oflazer

Carnegie Mellon University in Qatar

1/36
Text Classification

I We have a set of documents (news items, emails, product reviews, movie reviews,
books, . . . )
I Classify this set of documents into a small set classes.
I Applications:
I Topic of a news article (classic example: finance, politics, sports, . . . )
I Sentiment of a movie or product review (good, bad, neutral)
I Email into spam or not or into a category (business, personal, bills, . . . )
I Reading level (K-12) of an article or essay
I Author of a document (Shakespeare, James Joyce, . . . )
I Genre of a document (report, editorial, advertisement, blog, . . . )
I Language identification

2/36
Notation and Setting

I We have a set of n documents (texts) xi ∈ V +


I We assume the texts are segmented already.
I We have set L of labels, `i
I Human experts annotate documents with labels and give us
{(x1 , `1 ), (x2 , `2 ), · · · , (xn , `n )}
I We learn a classifier classify : V + → L with this labeled training data.
I Afterwards, we use classify to classify new documents into their classes.

3/36
Evaluation

I Accuracy: X
A(classify) = p(x, `)
x∈V + ,`∈L,
classify(x)=`
where p is the true distribution over data. Error is 1 − A.
I This is estimated using a test set {(x1 , `1 ), (x2 , `2 ), · · · , (xm , `m )}
m
1X
Â(classify) = 1{classify(xi ) = `i }
m
i=1

4/36
Issues with Using Test Set Accuracy

I Class imbalance: if p(L = not spam) = 0.99, then you can get  ≈ 0.99 by always
guessing “not spam”
I Relative importance of classes or cost of error types.
I Variance due to the test data.

5/36
Evaluation in the Two-class case
I Suppose we have one of the classes t ∈ L as the target class.
I We would like to identify documents with label t in the test data.
I Like information retrieval
I We get

C
I Precision P̂ = (percentage of documents classify correctly labeled as t)
B
C
I Recall R̂ = (percentage of actual t labeled documents correctly labeled as t)
A
P̂ + R̂
I F1 = 2
P̂ · R̂ 6/36
A Different View – Contingency Tables

L=t L 6= t

classify(X) = t C(true positives) B\C(false positives) B

classify(X) 6= t C\A(false negatives) (true negatives)

A
7/36
Evaluation with > 2 Classes

I Macroaveraged precision and recall: let each class be the target and report the
average P̂ and R̂ across all classes.
I Microaveraged precision and recall: pool all one-vs.-rest decisions into a single
contingency table, calculate P̂ and R̂ from that.

8/36
Cross-validation

I Remember that Â, P̂, R̂, and Fˆ1 are all estimates of the classifier’s quality under the
true data distribution.
I Estimates are noisy!
I K -fold cross validation
I Partition the training data into K nonverlapping “folds”, x1 , x2 , . . . , xK ,
I For i ∈ {1, . . . , K}
I Train on x1:n \xi , using xi as development data
i
I Estimate quality on the xi development set as Â
K
1 X i
I Report average accuracy as  =  and perhaps also the standard deviation.
K i=1

9/36
Features in Text Classification

I Running example x = “The spirit is willing but the flesh is weak”


I Feature random variables
I For j ∈ {1, . . . , d} Fj is a discrete random variable taking values in Fj
I Most of the time these can be frequencies of words or n-grams in a text.
I ff −spirit = 1, ff −is = 2, ff −the−flesh = 1, . . .
I They can be boolean “exists” features.
I fe−spirit = 1, fe−is = 1, ff −strong = 0, . . .

10/36
Spam Detection

I A training set of email messages (marked Spam or Not-Spam)


I A set of features for each message (considered as a bag of words)
I For each word: Number of occurrences
I Whether phrases such as “Nigerian Prince”, “email quota full”, “won ONE HUNDRED
MILLION DOLLARS” are in the message
I Whether it is from someone you know
I Whether it is reply to your message
I Whether it is from your domain (e.g., cmu.edu)

11/36
Movie Ratings

I A training set of movie reviews (with star ratings 1 - 5)


I A set of features for each message (considered as a bag of words)
I For each word: Number of occurrences
I Whether phrases such as Excellent, sucks, blockbuster, biggest, Star Wars, Disney, Adam
Sandler, . . . are in the review

12/36
Probabilistic Classification

I Documents are preprocessed: each document x is mapped to a d-dimensional


feature vector f .
I Classification rule

classify(f ) = arg max p(` | f )


`∈L

p(`,f )
= arg max p(f )
`∈L

= arg max p(`, f )(Why?)


`∈L

13/36
Naive Bayes Classifier

d
Y
p(L = `, F1 = f1 , . . . , Fd = fd ) = p(`) p(Fj = fj | `)
j=1

d
Y
= π` θf |j,`
j
j=1

I Parameters: π` is the class or label prior.


I The probability that a document belongs to class ` – without considering any of its
features.
I They can be computed directly from the training data {(x1 , `1 ), (x2 , `2 ), · · · , (xn , `n )}. These
sum to 1.
I For each feature function j and label `, a distribution over values θ∗|j,`
I These sum to 1 for every (j, `) pair.

14/36
Generative vs Discriminative Classifier

I Naive Bayes is known as a Generative classifier.


I Generative Classifiers build a model of each class.
I Given an observation (document), they return the class most likely have generated
that observation.

I A discriminative classifier instead learns what features from the input are useful to
discriminate between possible classes.

15/36
The Most Basic Naive Bayes Classifier

16/36
The Most Basic Naive Bayes Classifier

I Features are just words xj in x


I Naive Assumption: Word positions do not matter – “bag of words”.
I Conditional Independence: Feature probabilities p(xi | `) are independent given the
class `.
|x|
Y
I p(x | `) = p(xj | `)
j=1
I The probability that a word in a sports document is “soccer” is estimated as
p(soccer | sports) by counting “soccer” in all sports documents.
I So
|x|
Y
classify(x) = arg max π` p(xj | `)
`∈L j=1
I Smoothing is very important as any new document may have unseen words.

17/36
The Most Basic Naive Bayes Classifier

|x|
Y
classify(x) = arg max π` p(xj | `)
`∈L j=1

|x|
 
X
classify(x) = arg max log π` + log p(xj | `)
`∈L j=1
I All computations are done in log space to avoid underflow and increase speed.
I Class prediction is based on a linear combination of the inputs.
I Hence Naive Bayes is confidered as a linear classifier.

18/36
An Example
1+1 0+1
p(“predictable” | −) = p(“predictable” | +) =
14 + 20 9 + 20

1+1 0+1
p(“no” | −) = p(“no” | +) =
14 + 20 9 + 20
0+1 1+1
p(“fun” | −) = p(“fun” | +) =
14 + 20 9 + 20
I |V | = 20
2 1×1×2
I Add 1 Laplace smoothing p(+)p(s | +) = × = 3.2×10−5
5 293
3 2 3 2×2×1
π− = p(−) = π+ = p(+) = p(−)p(s | −) = × = 6.1×10−5
5 5 5 343

N− = 14 N+ = 9

19/36
Other Optimizations for Sentiment Analysis

I Ignore unknown words in the test.


I Ignore stop words like the, a, with, etc.
I Remove most frequent 10-100 words from the training and test documents.
I Count vs existence of words: Binarized features.
I Negation Handling didnt like this movie , but I → didnt NOT like
NOT this NOT movie , but I

20/36
Formulation of a Discriminative Classifier
I A discriminative model computes p(` | x) to discriminate among different values of `,
using combinations of d features of x.
`ˆ = arg max p(` | x)
`∈L
I There is no obvious way to map features to probabilities.
I Assuming features are binary-valued and they are both functions of x and class ` we
can write  
d
1 X
p(` | x) = exp  wi fi (`, x)
Z
i=1
where Z is the normalization factor to make everything a probability and wi are
weights for features.
I p(` | x) can be then be formally defined with normalization as
P 
exp d w f (`, x)
i=1 i i
p(` | x) = P P 
exp d w f (`0 , x)
`0 ∈L i=1 i i
21/36
Some Features

I Remember features are binary-valued and are both functions of x and class `.
I Suppose we are doing sentiment classification. Here are some sample feature
functions: 
1 if “great” ∈ x & ` = +
I f1 (`, x) =
0 otherwise


1 if “second-rate” ∈ x & ` = −
I f2 (`, x) =
0 otherwise


1 if “no” ∈ x & ` = +
I f3 (`, x) =
0 otherwise


1 if “enjoy” ∈ x & ` = −
I f4 (`, x) =
0 otherwise

22/36
Mapping to a Linear Formulation

I If the goal is just classification, the denominator can be ignored

`ˆ = arg max p(` | x)


`∈L
P 
exp d w f (`,x)
i i
= arg max P i=1 
`∈L 0 exp
Pd
w f (`0 ,x)
` ∈L i=1 i i
P 
= arg max exp d w f (`, x)
i=1 i i
`∈L P
`ˆ = arg max d w f (`, x)
i=1 i i
`∈L
I Thus we have a linear combination of features for decision making.

23/36
Two-class Classification with Linear Models
I Big idea: “map” a document x into a d-dimensional (feature) vector Φ(x), and learn a
hyperplane defined by vector w = [w1 , w2 , . . . , wd ].
I Linear decision rule:
I Decide on class 1 if w · Φ(x) > 0
I Decide on class 2 if w · Φ(x) ≤ 0

I Parameters are w ∈ Rd . They determine the separation hyperplane.

24/36
Two-class Classification with Linear Models

I There may be more than one separation hyperplane.

25/36
Two-class Classification with Linear Models
I There may not be a separation hyperplane. The data is not linearly separable!

26/36
Two-class Classification with Linear Models

I Some features may not be actually relevant.

27/36
The Perceptron Learning Algorithm for Two Classes

I A very simple algorithm guaranteed to eventually find a linear separator hyperplane


(determine w), if one exists.
I If one doesn’t, the perceptron will oscillate!
I Assume our classifier is
1 if w · Φ(x) > 0
n
classify(x) =
0 if w · Φ(x) ≤ 0
I Start with w = 0
I for t = 1, . . . , T
I i = t mod N 
I w ← w + α `i − classify(xi ) Φ(xi )
I Return w
I α is the learning rate – determined by experimentation.

28/36
Linear Models for Classification

I Big idea: “map” a document x into a d-dimensional (feature) vector Φ(x, `), and learn
a hyperplane defined by vector w = [w1 , w2 , . . . , wd ].
I Linear decision rule

classify(x) = `ˆ = arg max w · Φ(x, `)


`∈L

where Φ : V + × L → Rd
I Parameters are w ∈ Rd .

29/36
A Geometric View of Linear Classifiers

I Suppose we have an instance of w and L = {y1 , y2 , y3 , y4 }.


I We have two simple binary features φ1 , and φ2
I Φ(x, `) are as follows:

30/36
A Geometric View of Linear Classifiers
I Suppose we an instance w and L = {y1 , y2 , y3 , y4 }.
I We have two simple binary features φ1 , and φ2
I Suppose w is such that w · Φ = w1 φ1 + w2 φ2

31/36
A Geometric View of Linear Classifiers
I Suppose we an instance w and L = {y1 , y2 , y3 , y4 }.
I We have two simple binary features φ1 , and φ2
I Suppose w is such that w · Φ = w1 φ1 + w2 φ2

|w · Φ0 |
distance(w · Φ, Φ0 ) = ∝ |w · Φ0 |
kwk2
I So w · Φ(x, y1 ) > w · Φ(x, y3 ) > w · Φ(x, y4 ) > w · Φ(x, y2 ) ≥ 0
32/36
A Geometric View of Linear Classifiers

I Suppose we an instance w and L = {y1 , y2 , y3 , y4 }.


I We have two simple binary features φ1 , and φ2
I Suppose w is such that w · Φ = w1 φ1 + w2 φ2

I So w · Φ(x, y3 ) > w · Φ(x, y1 ) > w · Φ(x, y2 ) > w · Φ(x, y4 )

33/36
Where do we get w? The Perceptron Learner

I Start with w = 0
I Go over the training samples and adjust w to minimize the deviation from correct
labels.
n
max w · Φ(xi , `0 ) − w · Φ(xi , `i )
X 
min
w 0
i=1 ` ∈L
I The perceptron learning algorithm is a stochastic subgradient descent algorithm on
above.
I For t ∈ {1, . . . , T }
I Pick it uniformly at random from {1, . . . , n}
I `ˆit ← arg max w · Φ(xit , `)
`∈L
w ← w − α Φ(xit , `ˆit ) − Φ(xit , `it )

I

I Return w

34/36
Gradient Descent

35/36
More Sophisticated Classification

I Take into account error costs if all mistakes are not equally bad. (false positives vs.
false negatives in spam detection)
I Use maximum margin techniques (e.g., Support Vector Machines) try to find the best
separating hyperplane that’s far from the training examples.
I Use kernel methods map vectors to get much higher-dimensional spaces, almost for
free, where they may be lineraly separable.
I Use Feature selection to find the most important features and throw out the rest.
I Take the machine learning class if you are interested on these

36/36
11-411
Natural Language Processing
Part-of-Speech Tagging

Kemal Oflazer

Carnegie Mellon University in Qatar

1/41
Motivation

I My cat, which lives dangerously, no longer has nine lives.


I The first lives is a present tense verb.
I The second lives is a plural noun.
I They are pronounced differently.
I How we pronounce the word depends on us knowing which is which.
I The two lives above are pronounced differently.
I “The minute issue took one minute to resolve.”
I They can be stressed differently. SUSpect (noun) vs. susPECT (verb)
I He can can the can.
I The first can is a modal.
I The second can is a(n untensed) verb.
I The third can is a singular noun.
I In fact, can has one more possible interpretation as a present tense verb as in “We can
tomatotes every summer.”

2/41
What are Part-of-Speech Tags?

I A limited number of tags to denote words “classes”.


I Words in the same class
I Occur more or less in the same contexts
I Have more or less the same functions
I Morphologically, they (usually) take the same suffixes or prefixes.
I Part-of-Speech tags are not about meaning!
I Part-of-Speech tags are not necessarily about any grammatical function.

3/41
English Nouns

I Can be subjects and objects


I This book is about geography.
I I read a good book.
I Can be plural or singular (books, book)
I Can have determiners (the book)
I Can be modified by adjectives (blue book)
I Can have possessors (my book, John’s book)

4/41
Why have Part-of-Speech Tags?

I It is an “abstraction” mechanism.
I There are too many words.
I You would need a lot of data to train models.
I Your model would be very specific.
I POS Tags allow for generalization and allow for useful reduction in model sizes.
I There are many different tagsets: You want the right one for your task

5/41
How do we know the class?

I Substitution test
I The ADJ cat sat on the mat.
I The blue NOUN sits on the NOUN.
I The blue cat VERB on the mat.
I The blue cat sat PP the mat.

6/41
What are the Classes?

I Nouns, Verbs, Adjectives, . . .


I Lots of different values (open class)
I Determiners
I The, a, this, that, some, . . .
I Prepositions
I By, at, from, as, against, below, . . .
I Conjunctions
I And, or, neither, but, . . .
I Modals
I Will, may, could, can, . . .
I Some classes are well defined and closed, some are open.

7/41
Broad Classes

I Open Classes: nouns, verbs, adjectives, adverbs, numbers


I Closed Classes: prepositions, determiners, pronouns, conjunctions, auxiliary verbs,
particles, punctuation

8/41
Finer-grained Classes

I Nouns: Singular, Plural, Proper, Count, Mass


I Verbs: Untensed, Present 3rd Person, Present Non-3rd Person, Past , Past Participle
I Adjectives: Normal, Comparative, Superlative
I Adverbs: Comparative, Superlative, Directional, Temporal, Manner,
I Numbers: Cardinal, Ordinal

9/41
Hard Cases

I I will call up my friend.


I I will call my friend up.
I I will call my friend up in the treehouse.
I Gerunds
I I like walking.
I I like apples.
I His walking daily kept him fit.
I His apples kept him fit.
I Eating apples kept him fit.

10/41
Other Classes

I Interjections (Wow!, Oops, Hey)


I Politeness markers (Your Highness . . . )
I Greetings (Dear . . .
I Existential there (there is . . . )
I Symbols, Money, Emoticons, URLs, Hashtags

11/41
Penn Treebank Tagset for English

12/41
Others Tagsets for English and for Other Languages

I The International Corpus of English (ICE) Tagset: 205 Tags


I London-Lund Corpus (LLC) Tagset: 211 Tags
I Arabic: Several tens of (composite tags)
I (Buckwalter: wsyktbwnhA “And they will write it”) is tagged as
CONJ + FUTURE PARTICLE + IMPERFECT VERB PREFIX + IMPERFECT VERB +
IMPERFECT VERB SUFFIX MASCULINE PLURAL 3RD PERSON +
OBJECT PRONOUN FEMININE SINGULAR
I Czech: Several hundred (composite) tags
I Vaclav is tagged as k1gMnSc1, indicating it is a noun, gender is male animate, number is
singular, and case is nominative
I Turkish: Potentially infinite set of (composite) tags.
I elmasında is tagged as elma+Noun+A3sg+P3sg+Loc indicating root is elma and the
word is singular noun belonging to a third singular person in locative case.

13/41
Some Tagged Text from The Penn Treebank Corpus

In/IN an/DT Oct./NNP 19/CD review/NN of/IN ‘‘/‘‘ The/DT Misanthrope/NN


’’/’’ at/IN Chicago/NNP ’s/POS Goodman/NNP Theatre/NNP ‘‘/‘‘
Revitalized/VBN Classics/NNS Take/VBP the/DT Stage/NN in/IN Windy/NNP
City/NNP ,/, ’’/’’ Leisure/NN &/CC Arts/NNS ,/, the/DT role/NN of/IN
Celimene/NNP ,/, played/VBN by/IN Kim/NNP Cattrall/NNP , , was/VBD
mistakenly/RB attributed/VBN to/TO Christina/NNP Haag/NNP ./.
Ms./NNP Haag/NNP plays/VBZ Elianti/NNP ./.
Rolls-Royce/NNP Motor/NNP Cars/NNPS Inc./NNP said/VBD it/PRP
expects/VBZ its/PRP$ U.S./NNP sales/NNS to/TO remain/VB steady/JJ
at/IN about/IN 1,200/CD cars/NNS in/IN 1990/CD ./.
The/DT luxury/NN auto/NN maker/NN last/JJ year/NN sold/VBD 1,214/CD
cars/NNS in/IN the/DT U.S./NNP

14/41
How Bad is Ambiguity?
Tags Token Tags Token Count POS/Token
7 down 5 run 317 RB/down
6 that 5 repurchase 200 RP/down
6 set 5 read 138 IN/down
6 put 5 present 10 JJ/down
6 open 5 out 1 VBP/down
6 hurt 5 many 1 RBR/down
6 cut 5 less 1 NN/down
6 bet 5 left
6 back
5 vs,
5 the
5 spread
5 split
5 say
5 ’s

15/41
Some Tags for “down”
One/CD hundred/CD and/CC ninety/CD two/CD former/JJ greats/NNS ,/, near/JJ
greats/NNS ,/, hardly/RB knowns/NNS and/CC unknowns/NNS begin/VBP a/DT 72-game/JJ
,/, three-month/JJ season/NN in/IN spring-training/NN stadiums/NNS up/RB and/CC
down/RB Florida/NNP ...
He/PRP will/MD keep/VB the/DT ball/NN down/RP ,/, move/VB it/PRP around/RB ...
As/IN the/DT judge/NN marched/VBD down/IN the/DT center/JJ aisle/NN in/IN his/PRP$
flowing/VBG black/JJ robe/NN ,/, he/PRP was/VBD heralded/VBN by/IN a/DT trumpet/NN
fanfare/NN ...
Other/JJ Senators/NNP want/VBP to/TO lower/VB the/DT down/JJ payments/NNS
required/VBN on/IN FHA-insured/JJ loans/NNS ...
Texas/NNP Instruments/NNP ,/, which/WDT had/VBD reported/VBN Friday/NNP that/IN
third-quarter/JJ earnings/NNS fell/VBD more/RBR than/IN 30/CD %/NN from/IN the/DT
year-ago/JJ level/NN ,/, went/VBD down/RBR 2/CD 1/8/CD to/TO 33/CD on/IN 1.1/CD
million/CD shares/NNS ....
Because/IN hurricanes/NNS can/MD change/VB course/NN rapidly/RB ,/, the/DT
company/NN sends/VBZ employees/NNS home/NN and/CC shuts/NNS down/VBP
operations/NNS in/IN stages/NNS : /: the/DT closer/RBR a/DT storm/NN gets/VBZ ,/,
the/DT more/RBR complete/JJ the/DT shutdown/NN ...
Jaguar/NNP ’s/POS American/JJ depositary/NN receipts/NNS were/VBD up/IN 3/8/CS
yesterday/NN in/IN a/DT down/NN market/NN ,/, closing/VBG at/IN 10/CD ...

16/41
Some Tags for “Japanese

Meanwhile/RB ,/, Japanese/JJ bankers/NNS said/VBD they/PRP were/VBD still/RB


hesitant/JJ about/IN accepting/VBG Citicorp/NNP ’s/POS latest/JJS proposal/NN ...
And/CC the/DT Japanese/NNPS are/VBP likely/JJ to/TO keep/VB close/RB on/IN
Conner/NNP ’s/POS heels/NNS ...
The/DT issue/NN is/VBZ further/RB complicated/VBN because/IN although/IN the/DT
organizations/NNS represent/VBP Korean/JJ residents/NNS ,/, those/DT residents/NNS
were/VBD largely/RB born/VBN and/CC raised/VBN in/IN Japan/NNP and/CC many/JJ
speak/VBP only/RB Japanese/NNP ...
And/CC the/DT Japanese/NNP make/VBP far/RB more/JJR suggestions/NNS :/: 2,472/CS
per/IN 100/CD eligible/JJ employees/NNS vs./CC only/RB 13/CD per/IN 100/CD
employees/NNS in/IN the/DT ...
The/DT Japanese/NNS are/VBP in/IN the/DT early/JJ stage/NN right/RB now/RB ,/,
said/VBD Thomas/NNP Kenney/NNP ,/, a/DT onetime/JJ media/NN adviser/NN for/IN
First/NNP Boston/NNP Corp./NNP who/WP was/VBD recently/RB appointed/VBN
president/NN of/IN Reader/NNP ’s/POS Digest/NNP Association/NNP ’s/POS new/JJ
Magazine/NNP Publishing/NNP Group/NNP ...
In/IN 1991/CD ,/, the/DT Soviets/NNS will/MD take/VB a/DT Japanese/JJ into/NN
space/NN ,/, the/DT first/JJ Japanese/NN to/TO go/VB into/IN orbit/NN ...

17/41
How we do POS Tagging?

I Pick the most frequent tag for each type


I Gives about 92.34% accuracy (on a standard test set)
I Look at the context
I Preceeding (and succeeding) words
I Preceeding (and succeeding) tags
I the . . .
I to . . .
I John’s blue . . .

18/41
Markov Models for POS Tagging

I We use an already annotated training data to statistically model POS tagging.


I Again the problem can be cast as a noisy channel problem:
I “I have a sequence of tags of a proper sentence in my mind, t = ht1 , t2 , . . . , tn i”
I “By the time, the tags are communicated, they are turned into actual words,
w = hw1 , w2 , . . . , wn i, which are observed.”
I “What is the most likely tag sequence t̂ that gives rise to the observation w? ”
I The basic equation for tagging is then

t̂ = arg max p(t | w)


t

where t̂ is the tag sequence that maximizes the argument of the arg max .

19/41
Basic Equation and Assumptions for POS Tagging

Channel Model Source Model


p(w | t)p(t) z }| { z}|{
t̂ = arg max p(t | w) = arg max = arg max p(w | t) p(t)
t t p(w) t
| {z } | {z }
Bayes Expansion Ignoring Denominator

I The independence assumption: Probability of a word appearing depends only on its


own tag and is independent of neighboring words and tags:
n
Y
p(w | t) = p(w1:n | t1:n ) ≈ p(wi | ti )
i=1
I The bigram assumption: that probability of a tag is dependent only on the previous
tag.
Yn
p(t) = p(t1:n ) ≈ p(ti | ti−1 )
i=1
20/41
Basic Approximation Model for Tagging

n
Y
t̂1:n = arg max p(t1:n | w1:n ) ≈ arg max p(wi | ti ) p(ti | tt−1 )
t1:n t1:n | {z } | {z }
i=1 emission transition

21/41
Bird’s Eye View of p(ti | ti−1 )

22/41
Bird’s Eye View of p(wi | ti )

23/41
Estimating Probabilities
I We can estimate these probabilities from a tagged training using maximum likelihood
estimation.
c(ti−1 , ti )
I Transition Probabilities: p(ti | ti−1 ) =
c(ti−1 )
c(ti , wi )
I Emission Probabilities: p(wi | ti ) =
c(ti )

I It is also possible to use a trigram approximation (with appropriate smoothing).


n
Y
p(t1:n ) ≈ p(ti | ti−2 ti−1 )
i=1

I You need to square the number of states!

24/41
The Setting

I We have n words in w = hw1 w2 , . . . , wn i.


I We have total N tags which are the labels of the Markov Model states (excluding start
(0) and end (F ) states).
I qi is the label of the state after i words have been observed.
I We will also denote all the parameters of our HMM by λ = (A, B), the transition (A)
and emission (B) probabilities.
I In the next several slides
I i will range over word positions.
I j will range over states/tags
I k will range over states/tags.

25/41
The Forward Algorithm

I An efficient dynamic programming algorithm for finding the total probability of


observing w = hw1 , w2 , . . . , wn i, given the (Hidden) Markov Model λ.
I Creates expanded directed acyclic graph that is a specialized version of the model
graph to the specific sentence called a trellis.

26/41
The Forward Algorithm

I Computes αi ( j) = p(w1 , w2 , . . . , wi , qi = j | λ)
I The total probability of observing w1 , w2 , . . . , wi and landing in state j after emitting i words.
I Let’s define some short-cuts:
I αi−1 (k): the previous forward probability from the previous stage (word)
I akj = p(tj | tk )
I bj (wi ) = p(wi | tj )
N
X
I αi ( j) = αi−1 (k) · akj · bj (wi )
k=1
I αn (F ) = p(w1 , w2 , . . . , wn , qn = F | λ) is the total probability of observing
w1 , w2 , . . . , wn .
I We really do not need αs. We just wanted to motivate the trellis.
I We are actually interested in the most likely sequence of states (tags) that we go
through while “emitting” w1 , w2 , . . . , wi These would be the most likely tags!.

27/41
Viterbi Decoding
I Computes vi ( j) = max p(q0 , q1 , . . . qi−1 , w1 , w2 , . . . , wi , qi = j | λ)
q0 ,q1 ,...qi−1
I vi ( j) is the maximum probability of observing w1 , w2 , . . . , wi after emitting i words
while going through some sequence of states (tags) q0 , q1 , . . . qi−1 before landing in
state qi = j.
I We can recursively define

vi ( j) = max vi−1 (k) · akj · bk (wi )


k=1...N
I Let’s also define a backtrace pointer as

bti ( j) = arg max vi−1 (k) · akj · bk (wi )


k=1...N
I These backtrace pointers will give us the tag sequence q0 = START, q1 , q2 , . . . , qn
which is the most likely tag sequence for hw1 , w2 , . . . , wn i.

28/41
Viterbi Algorithm
I Initialization:
v1 ( j) = a0j · bj (w1 ) 1 ≤ j ≤ N
bt1 ( j) = 0
I Recursion:

vi ( j) = max vi−1 (k) · akj · bj (wi ) 1 ≤ j ≤ N , 1 < i ≤ n


k=1...N

bt1 ( j) = arg max vi−1 (k) · akj · bj (wi ) 1 ≤ j ≤ N , 1 < i ≤ n


k=1...N
I Termination:

p∗ = vn (qF ) = max vn (k) · ajF The best score


k=1...N
qn ∗ = btn (qF ) = arg max vn (k) · ajF The start of the backtrace
k=1...N

29/41
Viterbi Decoding

30/41
Viterbi Decoding

31/41
Viterbi Decoding

32/41
Viterbi Decoding

33/41
Viterbi Decoding

34/41
Viterbi Decoding

35/41
Viterbi Decoding

I Once you are at i = n, you have to land in the END state (F ), then use the backtrace
to find the previous state you came from and recursively trace backwards to find t̂1:n .
36/41
Viterbi Decoding Example

37/41
Viterbi Decoding Example

38/41
Viterbi Decoding Example

39/41
Unknown Words
I They are unlikely to be closed class words.
I They are most likely to be nouns or proper nouns, less likely, verbs.
I Exploit capitalization – most likely proper nouns.
I Exploit any morphological hints: -ed most likely past tense verb, -s, most likely plural
noun or present tense verb for 3rd person singular.
I Build a separate models of the sort

p(tj | ln−i+1 . . . ln ) and p(ln−i+1 . . . ln )

where ln−i+1 . . . ln are the last i letters of a word.


I Then
p(tj | ln−i+1 . . . ln ) · p(ln−i+1 . . . ln )
p(ln−i+1 . . . ln | tj ) =
p(tj )
I Hence can be used in place of p(wi | ti ) in the Viterbi algorithm.
I Only use low frequency words in these models.
40/41
Closing Remarks
I Viterbi decoding takes O(n · N 2 ) work. (Why?)
I HMM parameters (transition probabilities A and emissions probabilities B) can
actually be estimated from an unannotated corpus.
I Given an unannotated corpus and the state labels, the forward-backward or Welch
Welch algorithm, a special case of the Expectation-Maximization (EM) algorithm
trains both the transition probabilities A and the emission probabilities B of the HMM.
I EM is an iterative algorithm. It works by computing an initial estimate for the
probabilities, then using those estimates to computing a better estimate, and so on,
iteratively improving the probabilities that it learns.
I There are many other more recent and usually better performing approaches to POS
tagging:
I Maximum Entropy Models (discriminative, uses features, computes t̂ = arg max p(t | w))
t
I Conditional Random Fields (discriminative, uses features, but features functions can
also depend on the previous tag.)
I Perceptrons (discriminative, uses features, trained with the perceptron algorithm)
I Accuracy for English is in the 97% to 98% range.
I In every hundred words, you have 2 errors on the average and you do not know what they
are!
41/41
11-411
Natural Language Processing
Overview of (Mostly English) Syntax

Kemal Oflazer

Carnegie Mellon University in Qatar

1/12
Syntax

I The ordering of words and how they group into phrases


I [ [the old man] [is yawning] ]
I [ [the old] [man the boats] ]
I Syntax vs. Meaning
I “Colorless green ideas sleep furiously.”
I You can tell that the words are in the right order.
I and that “colorless” and “green” modify “ideas”
I and that ideas sleep
I that the sleeping is done furiously
I that it sounds like an English sentence
I if you can’t imagine what it means
I and you know that it is better than “Sleep green furiously ideas colorless”

2/12
Syntax vs. Morphology

I Syntax is not morphology


I Morphology deals with the internal structure of words.
I Syntax deals with combinations of words – phrases and sentences.
I Syntax is mostly made up of general rules that apply across-the-board, with very little
irregularities.

3/12
Syntax vs. Semantics

I Syntax is not semantics.


I Semantics is about meaning; syntax is about structure alone.
I A sentence can be syntactically well-formed but semantically ill-formed. (e.g., “Colorless
green ideas sleep furiously.”)
I Some well-known linguistic theories attempt to “read” semantic representations off of
syntactic representations in a compositional fashion.
I We’ll talk about these in a later lecture

4/12
Two Approaches to Syntactic Structure

I Constituent Structure or Phrase Structure Grammar


I Syntactic structure is represented by trees generated by a context-free grammar.
I An important construct is the constituent (complete sub-tree).

I Dependency Grammar:
I The basic unit of syntactic structure is a binary relation between words called a
dependency.

5/12
Constituents

I One way of viewing the structure of a sentence is as a collection of nested


constituents:
I Constituent: a group of words that “go together” (or relate more closely to one another
than to other words in the sentence)
I Constituents larger than a word are called phrases.
I Phrases can contain other phrases.

6/12
Constituents

I Linguists characterize constituents in a number of ways, including:


I where they occur (e.g., “NPs can occur before verbs”)
I where they can move in variations of a sentence
I On September 17th, I’d like to fly from Atlanta to Denver.
I I’d like to fly on September 17th from Atlanta to Denver.
I I’d like to fly from Atlanta to Denver on September 17th.
I what parts can move and what parts can’t
I *On September I’d like to fly 17th from Atlanta to Denver.
I what they can be conjoined with
I I’d like to fly from Atlanta to Denver on September 17th and in the morning.

7/12
Noun Phrases

I The elephant arrived.


I It arrived.
I Elephants arrived.
I The big ugly elephant arrived.
I The elephant I love to hate arrived.

8/12
Prepositional Phrases

I Every prepositional phrase contains a preposition followed by a noun phrase.

I I arrived on Tuesday.
I I arrived in March.
I I arrived under the leaking roof.
I I arrived with the elephant I love to hate.

9/12
Sentences/Clauses

I John likes Mary.


I John likes the woman he thinks is Mary.
I John likes the woman (whom) he thinks (the woman) is Mary.
I Sometimes, John thinks he is Mary.
I Sometimes, John thinks (that) he/John is Mary.
I It is absolutely false that sometimes John thinks he is Mary.

10/12
Recursion and Constituents,
I This is the house.
I This is the house that Jack built.
I This is the cat that lives in the house that Jack built.
I This is the dog that chased the cat that lives in the house that Jack built.
I This is the flea that bit the dog that chased the cat that lives in the house the Jack
built.
I This is the virus that infected the flea that bit the dog that chased the cat that lives in
the house that Jack built.
I Non-constituents
I If on a Winter’s Night a Traveler
I Nuclear and Radiochemistry
I The Fire Next Time
I A Tad Overweight, but Violet Eyes to Die For
I Sometimes a Great Notion
I [how can we know the] Dancer from the Dance

11/12
Describing Phrase Structure / Constituency Grammars

I Regular expressions were a convenient formalism for describing morphological


structure of words.
I Context-free grammars are a convenient formalism for describing context-free
languages.
I Context-free languages are a reasonable approximation for natural languages, while
regular languages are much less so!
I Although these depend on what the goal is.
I There is some linguistic evidence that natural languages are NOT context-free, but in
fact are mildly context-sensitive.
I This has not been a serious impediment.
I Other formalisms have been constructed over the years to deal with natural
languages.
I Unification-based grammars
I Tree-adjoining grammars
I Categorial grammars

12/12
11-411
Natural Language Processing
Formal Languages and Chomsky Hierarchy

Kemal Oflazer

Carnegie Mellon University in Qatar

1/53
Brief Overview of Formal Language Concepts

2/53
Strings

I An alphabet is any finite set of distinct symbols


I {0, 1}, {0,1,2,. . . ,9}, {a,b,c}
I We denote a generic alphabet by Σ
I A string is any finite-length sequence of elements of Σ.
I e.g., if Σ = {a, b} then a, aba, aaaa, ...., abababbaab are some strings over the
alphabet Σ

3/53
Strings

I The set of all possible strings over Σ is denoted by Σ∗ .


I We define Σ0 = {} and Σn = Σn−1 · Σ
I with some abuse of the concatenation notation applying to sets of strings now
I So Σn = {ω|ω = xy and x ∈ Σn−1 and y ∈ Σ}

Σ∗ = Σ0 ∪ Σ1 ∪ Σ2 ∪ · · · Σn ∪ · · · = Σi
[
I

0
I Alternatively, Σ∗ = {x1 , . . . , xn |n ≥ 0 and xi ∈ Σ for all i}
I Φ denotes the empty set of strings Φ = {},
I but Φ∗ = {}

4/53
Sets of Languages


I The power set of Σ∗ , the set of all its subsets, is denoted as 2Σ

5/53
Describing Languages

I Interesting languages are infinite


I We need finite descriptions of infinite sets
I L = {an bn : n ≥ 0} is fine but not terribly useful!
I We need to be able to use these descriptions in mechanizable procedures

6/53
Describing Languages

I Regular Expressions/Finite State Recognizers ⇒ Regular Languages


I Context-free Grammars/Push-down Automata ⇒ Context-free Languages

7/53
Identifying Nonregular Languages

I Given language L how can we check if it is not a regular language ?


I The answer is not obvious.
I Not being able to design a DFA does not constitute a proof!

8/53
The Pigeonhole Principle

I If there are n pigeons and m holes and n > m, then at least one hole has > 1 pigeons.

I What do pigeons have to do with regular languages?

9/53
The Pigeonhole Principle

I Consider the DFA

I With strings a, aa or aab, no state is repeated


I With strings aabb, bbaa, abbabb or abbbabbabb, a state is repeated
I In fact, for any ω where |ω| ≥ 4, some state has to repeat? Why?

10/53
The Pigeonhole Principle

I When traversing the DFA with the string ω , if the number of transitions ≥ number of
states, some state q has to repeat!
I Transitions are pigeons, states are holes.

11/53
Pumping a String
I Consider a string ω = xyz

I |y| ≥ 1
I |xy| ≤ m (m the number of states)
I If ω = xyz ∈ L that so are xyi z for all i ≥ 0
I The substring y can be pumped.
I So if a DFA accepts a sufficiently long string, then it accepts an infinite number of
strings!
12/53
There are Nonregular Languages

I Consider the language L = {an bn |n ≥ 0}


I Suppose L is regular and a DFA with p states accepts L
I Consider δ ∗ (q0 , ai ) for i = 0, 1, 2, . . .
I δ ∗ (q, w) is the extended state transition function: what state do I land in starting in state q
and stepping through the stmbols in w.
I Since there are infinite i’s, but a finite number states, the Pigeonhole Principle tells us
that there is some state q such that
I δ ∗ (q0 , an ) = q and δ ∗ (q0 , am ) = q, but n 6= m
I Thus if M accepts an bn it must also accept am bn , since in state q is does not “remember” if
there were n or m a’s.
I Thus M can not exist and L is not regular.

13/53
Is English Regular?

I The cat likes tuna fish.


I The cat the dog chased likes tuna fish.
I The cat the dog the rat bit chased likes tuna fish.
I The cat the dog the rat the elephant admired bit chased likes tuna fish.
I L1 = (the cat | the dog | the mouse| . . . )* (chased | bit | ate | . . . .)* likes tuna fish
I L2 = English
I L1 ∩ L2 = (the cat | the dog | the mouse| . . . )n (chased | bit | ate | . . . .)n−1 likes tuna
fish.
I Closure fact: If L1 and L2 are regular ⇒ L1 ∩ L2 is regular.
I L1 is regular, L1 ∩ L2 is NOT regular, hence L2 (English) can NOT be regular.

14/53
Grammars

I Grammars provide the generative mechanism to generate all strings in a language.


I A grammar is essentially a collection of substitution rules, called productions
I Each production rule has a left-hand-side and a right-hand-side.

15/53
Grammars - An Example

I Consider once again L = {an bn | n ≥ 0}


I Basis:  is in the language
I Production: S → 
I Recursion: If w is in the language, then so is the string awb.
I Production: S → aSb
I S is called a variable or a nonterminal symbol
I a, b etc., are called terminal symbols
I One variable is designated as the start variable or start symbol.

16/53
How does a grammar work?

I Consider the set of rules R = {S → , S → aSb}


I Start with the start variable S
I Apply the following until all remaining symbols are terminal.
I Choose a production in R whose left-hand sides matches one of the variables.
I Replace the variable with the rule’s right hand side.
I S ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaaaSbbbb ⇒ aaaabbbb
I The string aaaabbbb is in the language L
I The sequence of rule applications above is called a derivation.

17/53
Types of Grammars

I Regular Grammars describe regular languages.


I Context-free Grammars: describe context-free languages.
I Context-sensitive Grammars: describe context-sensitive languages.
I General Grammars: describe arbitrary Turing-recognizable languages.

18/53
Formal Definition of a Grammar

I A Grammar is a 4-tuple G = (V, Σ, R, S) where


I V is a finite set of variables
I Σ is a finite set of terminals, disjoint from V.
I R is a set of rules of the X → Y
I S ∈ V is the start variable
I In general X ∈ (V ∪ Σ)+ and Y ∈ (V ∪ Σ)∗
I The type of a grammar (and hence the class of the languages described) depends on
the type of the left- and right-hand sides.
I The right hand side of the rules can be any combination of variables and terminals,
including  (hence Y ∈ (V ∪ Σ)∗ ).

19/53
Types of Grammars

I Regular Grammars
I Left-linear: All rules are either like X → Ya or like X → a with X, Y ∈ V and a ∈ Σ∗
I Right-linear: All rules are either like X → aY or like X → a with X, Y ∈ V and a ∈ Σ∗
I Context-free Grammars
I All rules are like X → Y with X ∈ V and Y ∈ (Σ ∪ V)∗
I Context-sensitive Grammars
I All rules are like LXR → Y with X ∈ V and R, Y, L ∈ (Σ ∪ V)∗
I General Grammars
I All rules are like X → Y with X, Y ∈ (Σ ∪ V)∗

20/53
Chomsky Normal Form

I CFGs in certain standard forms are quite useful for some computational problems.

Chomsky Normal Form


A context-free grammar is in Chomsky normal form(CNF) if every
rule is either of the form

A → BC or A → a

where a is a terminal and A, B, C are variables – except B and C may


not be the start variable. In addition, we allow the rule S →  if
necessary.
I Any CFG can be converted to a CFG in Chomsky Normal Form. They accept the
same language but assign possibly different tree structures to the same string.

21/53
Chomsky Hierarchy

22/53
Parse Trees
S

a S b

I Derivations can also be represented


a S b
with a parse tree.
I The leaves constitute the yield of the
a S b tree.
I Terminal symbols can occur only at
a S b the leaves.
I Variables can occur only at the
 internal nodes.
The terminals concatenated from left
to right give us the string.

23/53
A Grammar for a Fragment of English

Nomenclature:
S → NP VP
I S: Sentence
NP → CN | CN PP
VP → CV | CV PP I NP: Noun Phrase

PP → P NP I CN: Complex Noun


CN → DT N I PP: Prepositional Phrase
CV → V | V NP
I VP: Verb Phrase
DT → a | the
N → boy | girl | flower | I CV: Complex Verb

telescope I P: Preposition
V → touches | likes | I DT: Determiner
sees | gives
I N: Noun
P → with | to
I V: Verb

24/53
A Grammar for a Fragment of English

S → NP VP
NP → CN | CN PP S ⇒ NP VP
VP → CV | CV PP ⇒ CN PP VP
PP → P NP ⇒ DT N PP VP
CN → DT N ⇒ a N PP VP
CV → V | V NP ⇒ ···
DT → a | the ⇒ a boy with a flower VP
N → boy | girl | flower | ⇒ a boy with a flower CV PP
telescope ⇒ ···
V → touches | likes | ⇒ a boy with a flower sees a girl
sees | gives with a telescope
P → with | to

25/53
English Parse Tree
S

NP VP

CN PP CV PP

DT N P NP V NP P NP

a boy with CN sees CN with CN

DT N DT N DT N

a flower a girl a
telescope

I This structure is for the interpretation where the boy is seeing with the telescope!

26/53
English Parse Tree
Alternate Structure

NP VP

CN PP CV

V NP
DT N P NP

sees CN PP
a boy with CN

DT N P NP
DT N

a girl with CN
a flower

DT N

a
I This is for the interpretation where the girl is carrying a telescope. telescope

27/53
Structural Ambiguity

I A set of rules can assign multiple structures to the same string.


I Which rule one chooses determines the eventual structure.
I VP → CV | CV PP
I CV → V | V NP
I NP → CN | CN PP
I · · · [VP [CV sees [NP a girl] [PP with a telescope]].
I · · · [VP [CV sees] [NP [CN a girl] [PP with a telescope]].
I (Not all brackets are shown!)

28/53
Some NLP Considerations - Linguistic Grammaticality

I We need to address a wide-range of grammaticality.


I I’ll write the company.
I I’ll write to the company.
I It needs to be washed.
I It needs washed.
I They met Friday to discuss it.
I They met on Friday to discuss it.

29/53
Some NLP Considerations – Getting it Right

I CFGs provide you with a tool set for creating grammars


I Grammars that work well (for a given application)
I Grammars that work poorly (for a given application)
I There is nothing about the theory of CFGs that tells you, a priori, what a “correct”
grammar for a given application looks like
I A good grammar is generally one that:
I Doesn’t over-generate very much (high precision)
I A grammar over-generates when it accepts strings not in the language.
I Doesn’t under-generate very much (high recall)
I A grammar under-generates when it does not accept strings in the language.

30/53
Some NLP Considerations – Why are we Building Grammars?
I Consider:
I Oswald shot Kennedy.
I Oswald, who had visited Russia recently, shot Kennedy.
I Oswald assassinated Kennedy
I Who shot Kennedy?
I Consider
I Oswald shot Kennedy.
I Kennedy was shot by Oswald.
I Oswald was shot by Ruby.
I Who shot Oswald?
I Active/Passive
I Oswald shot Kennedy.
I Kennedy was shot by Oswald.
I Relative clauses
I Oswald who shot Kennedy was shot by Ruby.
I Kennedy whom Oswald shot didn’t shoot anybody.

31/53
Language Myths: Subject

I Myth I: the subject is the first noun phrase in a sentence


I Myth II: the subject is the actor in a sentence
I Myth III: the subject is what the sentence is about
I All of these are often true, but none of them is always true, or tells you what a subject
really is (or how to use it in NLP).

32/53
Subject and Object

I Syntactic (not semantic)


I The batter hit the ball. [subject is semantic agent]
I The ball was hit by the batter. [subject is semantic patient]
I The ball was given a whack by the batter. [subject is semantic recipient]
I George, the key, the wind opened the door.
I Subject 6= topic (the most important information in the sentence)
I I just married the most beautiful woman in the world.
I Now beans, I like.
I As for democracy, I think it’s the best form of government.
I English subjects
I agree with the verb
I when pronouns, are in nominative case (I/she/he vs. me/her/him)
I English objects
I when pronouns, in accusative case (me, her, him)
I become subjects in passive sentences

33/53
Looking Forward

I CFGs may not be entirely adequate for capturing the syntax of natural languages
I They are almost adequate.
I They are computationally well-behaved (in that you can build relatively efficient parsers for
them, etc.)
I But they are not very convenient as a means for handcrafting a grammar.
I They are not probabilistic. But we will add probabilities to them soon.

34/53
Parsing Context-free Languages

I The Cocke-Younger-Kasami (CYK) algorithm:


I Grammar in Chomsky Normal Form (may not necessarily be linguistically meaningful)
I All trees sanctioned by the grammar can be computed.
I For an input of n words, requires O(n3 ) work (with a large constant factor dependent on the
grammar size), using a bottom-up dynamic programming approach.
I Earley Algorithm:
I Can handle arbitrary Context-free Grammars
I Parsing is top-down.
I Later.

35/53
The Cocke-Younger-Kasami (CYK) algorithm

I The CYK parsing algorithm determines if w ∈ L(G) for a grammar G in Chomsky


Normal Form
I with some extensions, it can also determine possible structures.
I Assume w 6=  (if so, check if the grammar has the rule S → )

36/53
The CYK Algorithm

I Consider w = a1 a2 · · · an , ai ∈ Σ
I Suppose we could cut up the string into two parts u = a1 a2 ..ai and
v = ai+1 ai+2 · · · an
∗ ∗
I Now suppose A ⇒ u and B ⇒ v and that S → AB is a rule.

A B

← u → ← v →
a1 ai ai+1 an

37/53
The CYK Algorithm
S

A B

← u → ← v →
a1 ai ai+1 an

I Now we apply the same idea to A and B recursively.

A B

C D E F

a1 aj aj+1 ai ai+1 ak ak+1 an


← u1 → ← v1 → ← u2 → ← v2 →

38/53
The CYK Algorithm

A B

C D E F

a1 aj aj+1 ai ai+1 ak ak+1 an


← u1 → ← v1 → ← u2 → ← v2 →

I What is the problem here?


I We do not know what i, j and k are!
I No Problem! We can try all possible i’s, j’s and k0 s.
I Dynamic programming to the rescue.

39/53
DIGRESSION - Dynamic Programming

I An algorithmic paradigm
I Essentially like divide-and-conquer but subproblems overlap!
I Results of subproblem solutions are reusable.
I Subproblem results are computed once and then memoized
I Used in solutions to many problems
I Length of longest common subsequence
I Knapsack
I Optimal matrix chain multiplication
I Shortest paths in graphs with negative weights (Bellman-Ford Alg.)

40/53
(Back to) The CYK Algorithm

I Let w = a1 a2 · · · an .
I We define
I wi, j = ai · · · aj (substring between positions i and j)

I Vi, j = {A ∈ V | A ⇒ wi, j }(j ≥ i) (all variables which derive wi, j )
I w ∈ L(G) iff S ∈ V1,n
I How do we compute Vi, j (j ≥ i)?

41/53
The CYK Algorithm

I How do we compute Vi, j ?


I Observe that A ∈ Vi,i if A → ai is a rule.
I So Vi, i can easily be computed for 1 ≤ i ≤ n by an inspection of w and the grammar.

I A ⇒ wi,j if
I There is a production A → BC, and
∗ ∗
I B ⇒ wi,k and C ⇒ wk+1,j for some k, i ≤ k < j.
I So [
Vi, j = {A :| A → BC and B ∈ Vi,k and C ∈ Vk+1,j }
i≤k<j

42/53
The CYK Algorithm
[
Vi, j = {A : A → BC and B ∈ Vi,k and C ∈ Vk+1,j }
i≤k<j

I Compute in the following order:



↓ V1,1 V2,2 V3,3 ··· ··· ··· Vn,n
V1,2 V2,3 V3,4 ··· ··· Vn−1,n
V1,3 V2,4 V3,5 ··· Vn−2,n
···
V1,n−1 V2,n
V1,n
I For example to compute V2,4 one needs V2,2 and V3,4 , and then V2,3 and V4,4 all of
which are computed earlier!

43/53
The CYK Algorithm

1) for i=1 to n do // Initialization


2) Vi,i = {A | A → a is a rule and wi,i = a]
3) for j=2 to n do
4) for i=1 to n-j+1 do
5) begin
6) Vi, j = {}; // Set Vi, j to empty set
7) for k=i to j-1 do
8) Vi, j = Vi, j ∪ {A | A → BC is a rule and
B ∈ Vi,k and C ∈ Vk+1,j }
I This algorithm has 3 nested loops with the bound for each being O(n). So the overall time/work
is O(n3 ).
I The size of the grammar factors in as a constant factor as it is independent of n – the length of
the string.
I Certain special CFGs have subcubic recognition algorithms.

44/53
The CYK Algorithm in Action

I Consider the following grammar in CNF


S → AB
A → BB | a
B → AB | b
I The input string is w = aabbb
I i→ 1 2 3 4 5
a a b b b
{A} {A} {B} {B} {B}
{} {S, B} {A} {A}
{S, B} {A} {S, B}
{A} {S, B}
{S, B}
I Since S ∈ V1,5 , this string is in L(G).

45/53
The CYK Algorithm in Action

I Consider the following grammar in CNF


S → AB
A → BB | a
B → AB | b
I Let us see how we compute V2,4
I We need to look at V2,2 and V3,4
I We need to look at V2,3 and V4,4
i→ 1 2 3 4 5
a a b b b
{A} {A} {B} {B} {B}
{} {S, B} {A} {A}
{S, B} {A} {S, B}
{A} {S, B}
{S, B}

46/53
A CNF Grammar for a Fragment of English
Grammar in Chomsky Normal
Form
S → NP VP S → NP VP
NP → CN | CN PP NP → CN PP
VP → CV | CV PP NP → DT N
PP → P NP VP → CV PP
CN → DT N VP → V NP
CV → V | V NP VP → touches | likes | sees | gives
DT → a | the PP → P NP
N → boy | girl | flower | CN → DT N
telescope CV → V NP
V → touches | likes | CV → touches | likes | sees | gives
sees | gives DT → a | the
P → with | to N → boy | girl | flower | telescope
V → touches | likes | sees | gives
P → with | to

47/53
English Parsing Example with CYK

S → NP VP
NP → CN PP
NP → DT N
VP → CV PP
VP → V NP i→ 1 2 3 4 5
VP → touches | likes | sees | gives the boy sees a girl
{DT} {N} {V, CV, VP} {DT} {N}
PP → P NP
{CN, NP} {} {} {CN, NP}
CN → DT N {S} {} {CV, VP}
CV → V NP {} {}
CV → touches | likes | sees | gives {S}X
DT → a | the
N → boy | girl | flower | telescope
V → touches | likes | sees | gives
P → with | to

48/53
Some Languages are NOT Context-free

I L = {an bn cn | n ≥ 0} is not a context-free language.


I This can be shown with the Pumping Lemma for Context-free Languages.
I It is however a context-sensitive language.
I Cross-serial Dependencies1

I L = {an bm cn dm | n, m ≥ 0} is not a context-free language but is considered mildly


context sensitive.
I So is L = {xan ybm zcn wdm u | n, m ≥ 0}

1 Graphics by Christian Nassif-Haynes from commons.wikimedia.org/w/index.php?curid=28274222 49/53


Are CFGs enough to model natural languages?

I Swiss German has the following construct:


dative-NPp accusative-NPq dative-taking-Vp accusative-taking-Vq

I Jan säit das mer em Hans es huus hälfed aastriiche.


I Jan says that we Hans the house helped paint.
I “Jan says that we helped Hans paint the house.”

I Jan säit das mer d’chind em Hans es huus haend wele laa hälfe aastriiche.
I Jan says that we the children Hans the house have wanted to let help paint.
I “Jan says that we have wanted to let the children help Hans paint the house.”

50/53
Is Swiss German Context-free?

I L1 = { Jan säit das mer (d’chind)∗ (em Hans)∗ es huus haend wele (laa)∗ (hälfe)∗
aastriiche.}
I L2 = { Swiss German }
I L1 ∩ L2 = { Jan säit das mer (d’chind)n (em Hans)m es huus haend wele (laa)n
(hälfe)m aastriiche.} ≡ L = {xan ybm zcn wdm u | n ≥ 0}

51/53
English “Respectively” Construct

I Alice, Bob and Carol will have a juice, a tea and a coffee, respectively.
I Again mildly context-sensitive!

52/53
Closing Remarks

I Natural languages are mildly context sensitive.


I But CFGs might be enough
I But RGs might be enough
I If you have very big grammars and,
I don’t really care about parsing.

53/53
11-411
Natural Language Processing
Treebanks and
Probabilistic Parsing

Kemal Oflazer

Carnegie Mellon University in Qatar

1/34
Probabilistic Parsing with CFGs

I The basic CYK Algorithm is not probabilistic: It builds a table from which all
(potentially exponential number of) parse trees can be extracted.
I Note that while computing the table needs O(n3 ) work, computing all trees could require
exponential work!
I Computing all trees is not necessarily useful either. How do you know which one is the
correct or best tree?
I We need to incorporate probabilities in some way.
I But where do we get them?

2/34
Probabilistic Context-free Grammars

I A probabilistic CFG (PCFG) is a CFG


I A set of nonterminal symbols V
I A set of terminal symbols Σ
I A set R of rules of the sort X → Y where X ∈ V and Y ∈ (V ∪ Σ)∗ .
I If you need to use CKY, Chomsky Normal Form is a special case with rules only like
I X → YZ
I X→a
where X, Y, Z ∈ V and a ∈ Σ
with a probability distribution over the rules:
I For each X ∈ V , there is a probability distribution over the rules in R, where X is the
left-hand side p(X → Y )
I For every X X
p(X → Y ) = 1
X→Y∈R

3/34
PCFG Example

Write down the start Symbol S


Score:

4/34
PCFG Example

Aux NP VP

Choose a rule from the S distribution. Here S → Aux NP VP


Score:
p(Aux NP VP | S)

5/34
PCFG Example

Aux NP VP

does

Choose a rule from the Aux distribution. Here Aux → does


Score:
p(Aux NP VP | S) · p(does | Aux)

6/34
PCFG Example

Aux NP VP

does Det N

Choose a rule from the NP distribution. Here NP → Det Noun


Score:
p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP)

7/34
PCFG Example

Aux NP VP

does Det N

this

Choose a rule from the Det distribution. Here Det → this


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)

8/34
PCFG Example

Aux NP VP

does Det N

this flight

Choose a rule from the Det distribution. Here Det → this


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)·

p(flight | N)

9/34
PCFG Example

Aux NP VP

does Det N Verb NP

this flight

Choose a rule from the VP distribution. Here VP → Verb NP


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)·

p(flight | N) · p(Verb NP | VP)

10/34
PCFG Example

Aux NP VP

does Det N Verb NP

this flight include

Choose a rule from the Verb distribution. Here Verb → include


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)·

p(flight | N) · p(Verb NP | VP) · p(include | V)

11/34
PCFG Example

Aux NP VP

does Det N Verb NP

this flight include Det N

Choose a rule from the NP distribution. Here NP → Det NP


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)·

p(flight | N) · p(Verb NP | VP) · p(include | V) · p(Det N | NP)

12/34
PCFG Example
S

Aux NP VP

does Det N Verb NP

this flight include Det N

Choose a rule from the Det distribution. Here Det → a


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det)·

p(flight | N) · p(Verb NP | VP) · p(include | V) · p(Det N | NP) · p(a | Det)


13/34
PCFG Example
S

Aux NP VP

does Det N Verb NP

this flight include Det N

a meal

Choose a rule from the N distribution. Here N → meal


Score:

p(Aux NP VP | S) · p(does | Aux) · p(Det N | NP) · p(this | Det) · p(flight | N)


p(Verb NP | VP) · p(include | V) · p(Det N | NP) · p(a | Det) · p(meal | N)
14/34
Noisy Channel Model of Parsing
 
 
 
(A boy with a flower sees . . . )
Source → T → Vocal Tract/Typing → X

I “I have a tree of the sentence I want to utter in my mind; by the time I utter it only the
words come our.”
I The PCFG defines the source model.
I The channel is deterministic: it erases everything except the leaves!
I If I observe a sequence of words comprising a sentence, what is the best tree
structure it corresponds to?
I Find tree t̂ = arg max p(t | x)
Trees t
with yield x
I How do we set the probabilities p(right hand side | left hand side)?
I How do we decode/parse?
15/34
Probabilistic CYK

I Input
I a PCFG (V, S, Σ, R, p(∗ | ∗)) in Chomsky Normal Form.
I a sentence x of length n words.

I Output
I t̂ = arg max p(t | x) (if x is in the language of the grammar.)
t∈Tx
I Tx : all trees with yield x.

16/34
Probabilistic CYK

I We define si:j (V ) as the maximum probability for deriving the fragment


. . . xi , . . . , xj . . . from the nonterminal V ∈ V .
I We use CYK dynamic programming to compute the best score s1:n (S).
I Base case: for i ∈ {1, . . . , n} and for each V ∈ V :

si:i (V ) = p(xi | V )
I Inductive case: For each i, j, 1 ≤ i < j ≤ n and V ∈ V .

si:j (V ) = max p(L R | V ) · si:k (L) · s(k+1):j (R)


L,R∈V, i≤k<j

I Solution:
s1:n (S) = max p(t)
t∈Tx

17/34
Parse Chart

i→ 1 2 3 4 5
the boy sees a girl
s1:1 (∗) s2:2 (∗) s3:3 (∗) s4:4 (∗) s5:5 (∗)
s1:2 (∗) s2:3 (∗) s3:4 (∗) s4:5 (∗)
s1:3 (∗) s2:4 (∗) s3:5 (∗)
s1:4 (∗) s2:5 (∗)
s1:5 (∗)

I Again, each entry is a table, mapping each nonterminal V to si:j (V ), the maximum
probability for deriving the fragment . . . xi , . . . , xj . . . from the nonterminal V .

18/34
Remarks

I Work and Space requirements? O(|R|n3 ) work, O(|V|n2 ) space.


I Recovering the best tree? Use backpointers.
I Note that there may be an exponential number of possible trees, if you want to enumerate
some/all trees.
I Probabilistic Earley’s Algorithm does NOT require a Chomsky Normal Form grammar.

19/34
More Refined Models

Starting Point

20/34
More Refined Models
Parent Annotation

I Increase the “vertical” Markov Order

p(children | parent, grandparent)


21/34
More Refined Models
Headedness

I Suggests “horizontal” Markovization:


p(children | parent) = p(head | parent)· p(ith sibling | head, parent)
Y

i
22/34
More Refined Models
Lexicalization

I Each node shares a lexical head with its head child.


23/34
Where do the Probabilities Come from?

I Building a CFG for a natural language by hand is really hard.


I One needs lots of categories to make sure all and only grammatical sentences are
included.
I Categories tend to start exploding combinatorially.
I Alternative grammar formalisms are typically used for manual grammar construction;
these are often based on constraints and a powerful algorithmic tool called unification.
I Standard approach today is to build a large-scale treebank, a database manually
constructed parse-trees of real-world sentences.
I Extract rules from the treebank.
I Estimate probabilities from the treebank.

24/34
Penn Treebank

I Large database of hand-annotated parse trees of English.


I Mostly Wall Street Journal news text.
I About 42,500 sentences: typically about 40,000 used for statistical modeling and
training and 2500 for testing
I WSJ section has about ≈1M words, ≈ 1M non-lexical rules boiling down to 17,500
distinct rules.
I https://en.wikipedia.org/wiki/Treebank lists tens of treebanks built for
many different languages over the last two decades.
I http://universaldependencies.org/ lists tens of treebanks built using the
Universal Dependencies framework.

25/34
Example Sentence from Penn Treebank

26/34
Example Sentence Encoding from Penn Treebank

27/34
More PTB Trees

28/34
More PTB Trees

29/34
Treebanks as Grammars

I You can now compute rule


probabilities from the counts
of these rules e.g.,
I p(VBN PP | VP), or
I p(light | NN).

30/34
Interesting PTB Rules

I VP → VBP PP PP PP PP PP ADVP PP
I This mostly happens because we go from football in the fall to lifting in the winter to
football again in the spring.
I NP → DT JJ JJ VBG NN NNP NNP FW NNP
I The state-owned industrial holding company Instituto Nacional de Industria . . .

31/34
Some Penn Treebank Rules with Counts

32/34
Parser Evaluation
I Represent a parse tree as a collection of tuples
{(`1 , i1 , j1 ), (`2 , i2 , j2 ),. . . , (`m , im , jm )} where
I `k is the nonterminal labeling kth phrase.
I ik is the index of the first word in the kth phrase.
I jk is the index of the last word in the kth phrase.
S

Aux NP VP

does Det N Verb NP →


{(S, 1, 6), (NP, 2, 3), (VP, 4, 6), . . . ,

(Aux, 1, 1), . . . , (Noun, 6, 6)}


this flight include Det N

a meal
I Convert gold-standard tree and system hypothesized tree into this representation,
then estimate precision, recall, and F1 .
33/34
Tree Comparison Example

I In both trees: {(NP, 1, 1), (S, 1, 7), (VP, 2, 7), (PP, 5, 7), (NP, 6, 7), (Nominal, 4, 4)}
I In the left (hypothesized) tree: {(NP, 3, 7), (Nominal, 4, 7)}
I In the right (gold) tree: {(VP, 2, 4), (NP, 3, 4)}
I P = 6/8, R = 6/8
34/34
11-411
Natural Language Processing
Earley Parsing

Kemal Oflazer

Carnegie Mellon University in Qatar

1/29
Earley Parsing

I Remember that CKY parsing works only for grammar in Chomsky Normal Form
(CNF)
I Need to convert grammar to CNF.
I The structure may not necessarily be “natural”.
I CKY is bottom-up – may be doing unnecessary work.
I Earley algorithm allows arbitrary CFGs.
I So no need to convert your grammar.
I Earley algorithm is a top-down algorithm.

2/29
Earley Parsing
I The Earley parser fills a table (sometimes called a chart) in a single sweep over the
input.
I For an n word sentence, the table is of size n + 1.
I Table entries represent
I In-progress constituents
I Predicted constituents.
I Completed constituents and their locations in the sentence

3/29
Table Entries

I Table entries are called states and are represented with dotted-rules.

I S → •VP a VP is predicted

NP → Det • Nominal an NP is in progress

VP → V NP• a VP has been found

4/29
States and Locations

S → •VP[0, 0] a VP is predicted at the start of the sentence

NP → Det • Nominal[1, 2] an NP is in progress; Det goes from 1 to 2

VP → V NP • [0, 3] a VP has been found starting at 0 and ending at 3

5/29
The Early Table Layout

Column 0 Column 1 ... Column n


States and Locations States and Locations States and Locations States and Locations
for for for
column 0 column 1 column n

I Words are positioned between columns.


I w1 is positioned between columns 0 and 1
I wn is positioned between columns n − 1 and n.

6/29
Earley – High-level Aspects

I As with most dynamic programming approaches, the answer is found by looking in the
table in the right place.
I In this case, there should be an S state in the final column that spans from 0 to n and
is complete. That is,
I S → α • [0, n]
I If that is the case, you are done!
I So sweep through the table from 0 to n
I New predicted states are created by starting top-down from S
I New incomplete states are created by advancing existing states as new constituents are
discovered.
I New complete states are created in the same way.

7/29
Earley – High-level Aspects

1. Predict all the states you can upfront


2. Read a word
2.1 Extend states based on matches
2.2 Generate new predictions
2.3 Go to step 2
3. When you are out of words, look at the chart to see if you have a winner.

8/29
Earley – Main Functions: Predictor

I If you have a state spanning [i, j]


I and is looking for a constituent B,
I then enqueue new states that will search for a B, starting at position j.

9/29
Earley– Prediction

I Given A → α • Bβ [i, j] (for example ROOT → •S [0, 0])


I and the rule B → γ (for example S → VP)
I create B → •γ [j, j] (for example, S → •VP [0, 0])

ROOT → •S[0, 0]
S → •NP VP[0, 0]
S → •VP[0, 0]
...
VP → •V NP[0, 0]
...
NP → •DT N[0, 0]

10/29
Earley – Main Functions: Scanner

Red circled indices should be j + 1


I If you have a state spanning [i, j]
I and is looking for a word with Part-of-Speech B,
I and one of the parts-of-speech the next word at position j is B
I then enqueue a state of the sort B → word[j + 1] • [j, j + 1] in chart position j + 1

11/29
Earley– Scanning

I Given A → α • Bβ [i, j] (for example VP → •V NP [0, 0])


I and the rule B → wj+1 (for example V → book)
I create B → wj+1 • [j, j + 1] (for example, V → book • [0, 1])

ROOT → •S[0, 0] V → book • [0, 1]


S → •NP VP[0, 0]
S → •VP[0, 0]
...
VP → •V NP[0, 0]
...
NP → •DT N[0, 0]

12/29
Earley – Main Functions: Completer

I If you have a completed state spanning [j, k] with B as the left hand side.
I then, for each state in chart position j (with some span [i, j], that is immediately
looking for a B),
I move the dot to after B,
I extend the span to [i, k]
I then enqueue the updated state in chart position k.

13/29
Earley– Completion

I Given A → α • Bβ [i, j] (for example VP → •V NP [0, 0])


I and B → γ • [j, k] (for example V → book • [0, 1])
I create V → αB • β [i, k] (for example, VP → V • NP [0, 1])

ROOT → •S[0, 0] V → book • [0, 1]


S → •NP VP[0, 0] VP → V • NP [0, 1]
S → •VP[0, 0]
...
VP → •V NP[0, 0]
...
NP → •DT N[0, 0]

14/29
Earley – Main Functions: Enqueue

I Just enter the given state to the chart-entry if it is not already there.

15/29
The Earley Parser

16/29
Extended Earley Example

I
0 Book 1 that 2 flight 3
I We should find a completed state at chart position 3
I with left hand side S and is spanning [0, 3]

17/29
Extended Earley Example Grammar

18/29
Extended Earley Example

19/29
Extended Earley Example

20/29
Extended Earley Example

21/29
Extended Earley Example

22/29
Extended Earley Example

S37 is due to S11 and S33!

23/29
Final Earley Parse

24/29
Comments

I Work is O(n3 ) – analysis is a bit trickier than CKY.


I Space is O(n2 )
I Big grammar-related constant
I Backpointers can help recover trees.

25/29
Probabilistic Earley Parser

I So far we had no mention of any probabilities or choosing the best parse.


I Rule probabilities can be estimated from treebanks as usual.
I There are algorithms that resemble the Viterbi algorithm for HMMs for computing the
best parse incrementally from left to right it the chart.
I Stolcke, Andreas, “An efficient probabilistic context-free parsing algorithm that computes
prefix probabilities”, Computational Linguistics, Volume 21, No:2, 1995
I Beyond our scope!

26/29
General Chart Parsing

I CKY and Earley each statically determine order of events, in code


I CKY fills out triangle table starting with words on the diagonal
I Earley marches through instances of grammar rules
I Chart parsing puts edges on an agenda, allowing an arbitrary ordering policy,
separate from the code.
I Generalizes CKY, Earley, and others

27/29
Implementing Parsing as Search

Agenda = {state0 }
while (Agenda not empty)
s = pop a state from Agenda
if s a success-state return s // we have a parse
else if s is not a failure-state:
generate new states from s
push new states on Agenda
return nil // no parser
I Fundamental Rule of Chart Parsing: if you can combine two contiguous edges to
make a bigger one, do it.
I Akin to the Completer function in Earley.
I How you interact with the agenda is called a strategy.

28/29
Is Ambiguity Solved?

I Time flies like an arrow.


I Fruit flies like a banana.

I Time/N flies/V like an arrow.


I Time/V flies/N like (you time) an arrow.
I Time/V flies/N like an arrow (times flies).
I Time/V flies/N (that are) like an arrow.
I [Time/N flies/N] like/V an arrow!
I ...

29/29
11-411
Natural Language Processing
Dependency Parsing

Kemal Oflazer

Carnegie Mellon University in Qatar

1/47
Dependencies

Informally, you can think of dependency structures as a transformation of


phrase-structures that
I maintains the word-to-word relationships induced by lexicalization,

I adds labels to these relationships, and

I eliminates the phrase categories

There are linguistic theories build on dependencies, as well as treebanks based on


dependencies:
I Czech Treebank

I Turkish Treebank

2/47
Dependency Tree: Definition
Let x = [x1 , . . . , xn ] be a sentence. We add a special ROOT symbol as “x0 ”.

A dependency tree consists of a set of tuples [p, c, `] where


I p ∈ {0, . . . , n} is the index of a parent.
I c ∈ {1, . . . , n} is the index of a child.
I ` ∈ L is a label.

Different annotation schemes define different label sets L, and different constraints on the
set of tuples. Most commonly:
I The tuple is represented as a directed edge from xp to xc with label `.

I The directed edges form an directed tree with x as the root (sometimes denoted as
0
ROOT ).

3/47
Example

NP VP

Pronoun Verb NP

we wash Determiner Noun

our cats

Phrase-structure tree

4/47
Example

NP VP

Pronoun Verb NP

we wash Determiner Noun

our cats

Phrase-structure tree with phrase-heads

5/47
Example

Swash

NPwe VPwash

Pronounwe Verbwash NPcats

we wash Determinerour Nouncats

our cats

Phrase-structure tree with phrase-heads, lexicalized

6/47
Example

ROOT

we wash our cats

“Bare bones” dependency tree.

7/47
Example

ROOT

we wash our cats who stink

8/47
Example

ROOT

we vigorously wash our cats who stink

9/47
Labels

ROOT

POBJ

SUBJ DOBJ PREP

kids saw birds with fish

Key dependency relations captured in the labels include:


I Subject

I Direct Object

I Indirect Object

I Preposition Object

I Adjectival Modifier

I Adverbial Modifier

10/47
Problem: Coordination Structures

ROOT

we vigorously wash our cats and dogs who stink

Most likely the most important problem with dependency syntax.

11/47
Coordination Structures: Proposal 1

ROOT

we vigorously wash our cats and dogs who stink

Make the first conjunct head?

12/47
Coordination Structures: Proposal 2

ROOT

we vigorously wash our cats and dogs who stink

Make the coordinating conjunction the head?

13/47
Coordination Structures: Proposal 3

ROOT

we vigorously wash our cats and dogs who stink

Make the second conjunct the head?

14/47
Dependency Trees ROOT

we vigorously wash our cats and dogs who stink

ROOT

we vigorously wash our cats and dogs who stink

ROOT

we vigorously wash our cats and dogs who stink

What is a common property among these trees? 15/47


Discontinuous Constituents / Crossing Arcs

ROOT

A hearing is scheduled on this issue today

16/47
Dependencies and Grammar

I Context-free grammars can be used to encode dependency structures.


I For every head word and group of its dependent children:

Nhead → Nleftmost−sibling . . . Nhead . . . Nrightmost−sibling

I And for every c ∈ V : Nv → v and S → Nv


I Such a grammar can produce only projective trees, which are (informally) trees in
which the arcs don’t cross.

17/47
Three Approaches to Dependency Parsing

1. Dynamic Programming with bilexical dependency grammars


2. Transition-based parsing with a stack
3. Chu-Liu-Edmonds algorithm for the maximum spanning tree

18/47
Transition-based Parsing
I Process x once, from left to right, making a sequence of greedy parsing decisions.
I Formally, the parser is a state machine (not a finite-state machine) whose state is
represented by a stack S and a buffer B.
I Initialize the buffer to contain x and the stack to contain the ROOT symbol.
Buffer B

we
Stack S vigorously
wash
ROOT our
cats
who
stink

I We can take one of three actions:


I SHIFT the word at the front of the buffer B onto the stack S.
I RIGHT-ARC: u = pop(S); v = pop(S); push(S, v → u).
I LEFT-ARC: u = pop(S); v = pop(S); push(S, v ← u).
(for labeled parsing, add labels to the LEFT- ARC and RIGHT- ARC transitions.
I During parsing, apply a classifier to decide which transition to take next, greedily. No
backtracking!
19/47
Transition-based Parsing Example

Buffer B

we
Stack S vigorously
wash
ROOT our
cats
who
stink

Actions:

20/47
Transition-based Parsing Example

Buffer B

Stack S vigorously
wash
we our
ROOT cats
who
stink

Actions: SHIFT

21/47
Transition-based Parsing Example

Buffer B
Stack S
wash
vigorously our
we cats
ROOT who
stink

Actions: SHIFT SHIFT

22/47
Transition-based Parsing Example

Stack S Buffer B

wash our
vigorously cats
we who
ROOT stink

Actions: SHIFT SHIFT SHIFT

23/47
Transition-based Parsing Example

Stack S
Buffer B

our
vigorously wash cats
who
we stink
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC

24/47
Transition-based Parsing Example

Stack S
Buffer B

our
cats
we vigorously wash who
stink
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC

25/47
Transition-based Parsing Example

Stack S
Buffer B
our
cats
who
we vigorously wash stink
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT

26/47
Transition-based Parsing Example

Stack S

cats
Buffer B
our
who
stink
we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT

27/47
Transition-based Parsing Example

Stack S

our cats Buffer B

who
stink

we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC

28/47
Transition-based Parsing Example
Stack S

who

our cats Buffer B

stink

we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT

29/47
Transition-based Parsing Example
Stack S

stink
who

our cats Buffer B

we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT

30/47
Transition-based Parsing Example
Stack S

who stink

Buffer B
our cats

we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC
31/47
Transition-based Parsing Example

Stack S

our cats who stink Buffer B

we vigorously wash
ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC

32/47
Transition-based Parsing Example

Stack S

Buffer B

we vigorously wash our cats who stink


ROOT

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC

33/47
Transition-based Parsing Example

Stack S

ROOT

Buffer B

we vigorously wash our cats who stink

Actions: SHIFT SHIFT SHIFT LEFT- ARC LEFT- ARC SHIFT SHIFT LEFT- ARC SHIFT SHIFT
RIGHT- ARC RIGHT- ARC RIGHT- ARC

34/47
The Core of Transition-based Parsing

I At each iteration, choose among {SHIFT, RIGHT- ARC, LEFT- ARC}.


I Actually, among all L-labeled variants of RIGHT- and LEFT- ARC.
I Features can come from S, B, and the history of past actions – usually there is no
decomposition into local structures.
I Training data: Dependency treebank trees converted into “oracle” transition
sequence.
I These transition sequences gives the right tree,
I 2 · n pairs: hstate, correcttransitioni.
I Each word gets SHIFTed once and participates as a child in one ARC.

35/47
Transition-based Parsing: Remarks

I Can also be applied to phrase-structure parsing. Keyword: “shift-reduce” parsing.


I The algorithm for making decisions doesn’t need to be greedy; can maintain multiple
hypotheses.
I e.g., beam search
I Potential flaw: the classifier is typically trained under the assumption that previous
classification decisions were all correct. As yet, no principled solution to this problem,
but there are approximations based on “dynamic oracles”.

36/47
Dependency Parsing Evauation

I Unlabeled attachment score: Did you identify the head and the dependent
correctly?
I Labeled attachment score: Did you identify the head and the dependent AND the
label correctly?

37/47
Dependency Examples from Other Languages

38/47
Dependency Examples from Other Languages

Subj

Poss Mod
Det Mod Mod Loc Adj Mod

Bu okul+da +ki ö÷renci+ler+in en akıl +lı +sı úura+da dur +an küçük kız +dır

bu okul ö÷renci en akıl úura dur küçük kız


+Det +Noun +Adj +Noun +Adv +Noun +Adj +Noun +Noun +Verb +Adj +Adj +Noun +Verb
+A3sg +A3pl +A3sg +With +Zero +A3sg +Pos +Prespart +A3sg +Zero
+Pnon +Pnon +Pnon +A3sg +Pnon +Pnon +Pres
+Loc +Gen +Nom +P3sg +Loc +Nom +Cop
+Nom +A3sg
This school-at+that-is student-s-' most intelligence+with+of there stand+ing little girl+is
The most intelligent of the students in this school is the little girl standing there.

39/47
Dependency Examples from Other Languages
<S>
<W IX=1 LEM="bu" MORPH="bu" IG=[(1, "bu+Det")] REL=[(3,1,(DETERMINER)]>
Bu </W>
<W IX=2 LEM="eski"’ MORPH="eski" IG=[(1, "eski+Adj")]
REL=[3,1,(MODIFIER)]> eski> </W>
<W IX=3 LEM="bahçe" MORPH="bahçe+DA+ki" IG=[(1, "bahçe+A3sg+Pnon+Loc")
(2, "+Adj+Rel")] REL=[4,1,(MODIFIER)]> bahçedeki </W>
<W IX=4 LEM="gül" MORPH="gül+nHn" IG=[(1,"gül+Noun+A3sg+Pnon+Gen")]
REL=[6,1,(SUBJECT)]> gülün </W>
<W IX=5 LEM="böyle" MORPH="böyle" IG=[(1,"böyle+Adv")]
REL=[6,1,(MODIFIER)]> böyle </W>
<W IX=6 LEM="büyü" MORPH="büyü+mA+sH" IG=[(1,"büyü+Verb+Pos") (2,
"+Noun+Inf+A3sg+P3sg+Nom")] REL=[9,1,(SUBJECT)]> büyümesi </W>
<W IX=7 LEM="herkes" MORPH="herkes+yH"
IG=[(1,"herkes+Pron+A3sg+Pnon+Acc")] REL=[9,1,(OBJECT)]> herkesi </W>
<W IX=8 LEM="çok" MORPH="çok" IG=[(1,"çok+Adv’’)] REL=[9,1,(MODIFIER)]>
çok </W>
<W IX=9 LEM="etkile" MORPH="etkile+DH" IG=[(1,
"etkile+Verb+Pos+Past+A3sg")] REL=[]> etkiledi </W>
</S>
40/47
Universal Dependencies
I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).

41/47
Universal Dependencies
I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).

42/47
Universal Dependencies

I A very recent project that aims to use a small set of “universal” labels and annotation
guidelines (universaldependencies.org).

43/47
State-of-the-art Dependency Parsers

I Stanford Parser
I Detailed Information at
https://nlp.stanford.edu/software/lex-parser.shtml
I Demo at http://nlp.stanford.edu:8080/parser/
I MaltParser is the original transition-based dependency parser by Nivre.
I “MaltParser is a system for data-driven dependency parsing, which can be used to induce
a parsing model from treebank data and to parse new data using an induced model.”
I Available at http://maltparser.org/

44/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results

45/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results

46/47
State-of-the-art Dependency Parser Performance
CONLL Shared Task Results

47/47
11-411
Natural Language Processing
Lexical Semantics

Kemal Oflazer

Carnegie Mellon University in Qatar

1/47
Lexical Semantics

The study of meanings of words:


I Decompositional
I Words have component meanings
I Total meanings are composed of these component meanings
I Ontological
I The meanings of words can be defined in relation to other words.
I Paradigmatic – Thesaurus-based
I Distributional
I The meanings of words can be defined in relation to their contexts among other words.
I Syntagmatic – meaning defined by syntactic context

2/47
Decompositional Lexical Semantics

I Assume that woman has (semantic) components [female], [human], and [adult].
I Man might have the componets [male], [human], and [adult].
I Such “semantic features” can be combined to form more complicated meanings.
I Although this looks appealing, there is a little bit of a chickens-and-eggs situation.
I Scholars and language scientists have not yet developed a consensus about a
common set of “semantic primitives.”
I Such as representation probably has to involve more structure than just a flat set of
features per word.

3/47
Ontological Semantics

Relations between words/senses


I Synonymy

I Antonymy

I Hyponymy/Hypernymy

I Meronymy/Holonymy

Key resource: Wordnet

4/47
Terminology: Lemma and Wordform

I A lemma or citation form


I Common stem, part of speech, rough semantics
I A wordform
I The (inflected) word as it appears in text

Wordform Lemma
banks bank
sung sing
sang sing
went go
goes go

5/47
Lemmas have Senses

I One lemma “bank” can have many meanings:


I Sense 1: “. . . a bank1 can hold the investments in a custodial account . . . ”
I Sense 2: “. . . as agriculture burgeons on the east bank2 the river will shrink even more.”
I Sense (or word sense)
I A discrete representation of an aspect of a word’s meaning.
I The lemma bank here has two senses.

6/47
Homonymy

I Homonyms: words that share a form but have unrelated, distinct meanings:
I bank1 : financial institution, bank2 : sloping land
I bat1 : club for hitting a ball, bat2 : nocturnal flying mammal
I Homographs: Same spelling (bank/bank, bat/bat)
I Homophones: Same pronunciation
I write and right
I piece and peace

7/47
Homonymy causes problems for NLP applications

I Information retrieval
I “bat care”
I Machine Translation
I bat: murciélago (animal) or bate (for baseball)
I Text-to-Speech
I bass (stringed instrument) vs. bass (fish)

8/47
Polysemy

I The bank1 was constructed in 1875 out of local red brick.


I I withdrew the money from the bank2 .
I Are those the same sense?
I bank1 : “The building belonging to a financial institution”
I bank2 : “A financial institution”
I A polysemous word has multiple related meanings.
I Most non-rare words have multiple meanings

9/47
Metonymy/Systematic Polysemy

I Lots of types of polysemy are systematic


I school, university, hospital
I All can mean the institution or the building.
I A systematic relation
I Building ⇔ Organization
I Other examples of such polysemy
I Author (Jane Austen wrote Emma) ⇔ Works of Author (I love Jane Austen)
I Tree (Plums have beautiful blossoms) ⇔ Fruit (I ate a preserved plum)

10/47
How do we know when a word has multiple senses?

I The “zeugma”1 test: Two senses of serve?


I Which flights serve breakfast?
I Does Qatar Airways serve Philadelphia?
I ?Does Qatar Airways serve breakfast and Washington?
I Since this conjunction sounds weird, we say that these are two different senses of
“serve”
I “The farmers in the valley grew potatoes, peanuts, and bored.”
I “He lost his coat and his temper.”

1A zeugma is an interesting device that can cause confusion in sentences, while also adding some flavor. 11/47
Synonymy
I Words a and b share an identical sense or have the same meaning in some or all
contexts.
I filbert / hazelnut
I couch / sofa
I big / large
I automobile / car
I vomit / throw up
I water / H2 O
I Synonyms can be substituted for each other in all situations.
I True synonymy is relatively rare compared to other lexical relations.
I may not preserve the acceptability based on notions of politeness, slang, register, genre,
etc.
I water / H2 O
I big / large
I bravery / courage
I Bravery is the ability to confront pain, danger, or attempts of intimidation without any feeling of
fear.
I Courage, on the other hand, is the ability to undertake an overwhelming difficulty or pain despite
the eminent and unavoidable presence of fear.
12/47
Synonymy

Synonymy is a relation between senses rather than words.


I Consider the words big and large. Are they synonyms?
I How big is that plane?
I Would I be flying on a large or small plane?
I How about here:
I Miss Nelson became a kind of big sister to Benjamin.
I ?Miss Nelson became a kind of large sister to Benjamin.
I Why?
I big has a sense that means being older, or grown up
I large lacks this sense.

13/47
Antonymy

I Lexical items a and b have senses which are “opposite”, with respect to one feature of
meaning
I Otherwise they are similar
I dark/light
I short/long
I fast/slow
I rise/fall
I hot/cold
I up/down
I in/out
I More formally: antonyms can
I define a binary opposition or be at opposite ends of a scale (long/short, fast/slow)
I or be reversives (rise/fall, up/down)
I Antonymy is much more common than true synonymy.
I Antonymy is not always well defined, especially for nouns (but for other words as well).

14/47
Hyponymy/Hypernymy

I The “is-a” relations.


I Lexical item a is a hyponym of lexical item b if a is a kind of b (if a sense of b refers to
a superset of the referent of a sense of a).
I screwdriver is a hyponym of tool.
I screwdriver is also a hyponym of drink.
I car is a hyponym of vehicle
I mango is a hyponym of fruit
I Hypernymy is the converse of hyponymy.
I tool and drink are hypernyms of screwdriver.
I vehicle is a hypernym of car

15/47
Hyponymy more formally

I Extensional
I The class denoted by the superordinate (e.g., vehicle) extensionally includes the class
denoted by the hyponym (e.g. car).
I Entailment
I A sense A is a hyponym of sense B if being an A entails being a B (e.g. if it is car, it is a
vehicle)
I Hyponymy is usually transitive
I If A is a hyponym of B and B is a hyponym of C ⇒ A is a hyponym of C.
I Another name is the IS - A hierarchy
I A IS - A B
I B subsumes A

16/47
Hyponyms and Instances

I An instance is an individual, a proper noun that is a unique entity


I Doha/San Francisco/London are instances of city.
I But city is a class
I city is a hyponym of municipality . . . location . . .

17/47
Meronymy/Holonymy

I The “part-of” relation


I Lexical item a is a meronym of lexical item b if a sense of a is a part of/a member of a
sense of b.
I hand is a meronym of body.
I congressperson is a meronym of congress.
I Holonymy (think: whole) is the converse of meronymy.
I body is a holonym of hand.

18/47
A Lexical Mini-ontology

19/47
WordNet
I A hierarchically organizated database of (English) word senses.
I George A. Miller (1995). WordNet: A Lexical Database for English. Communications
of the ACM Vol. 38, No. 11: 39-41.
I Available at wordnet.princeton.edu
I Provides a set of three lexical databases:
I Nouns
I Verbs
I Adjectives and adverbs.
I Relations are between senses, not lexical items (words).
I Applications Program Interfaces (APIs) are available for many languages and toolkits
including a Python interface via NLTK.
I WordNet 3.0
Category Unique Strings
Noun 117,197
Verb 11,529
Adjective 22,429
Adverb 4,481
20/47
Synsets

I Primitive in WordNet: Synsets (roughly: synonym sets)


I Words that can be given the same gloss or definition.
I Does not require absolute synonymy
I For example: {chump, fool, gull, mark, patsy, fall guy, sucker, soft touch, mug}
I Other lexical relations, like antonymy, hyponymy, and meronymy are between
synsets, not between senses directly.

21/47
Synsets for dog (n)
I S: (n) dog, domestic dog, Canis familiaris (a member of the genus Canis (probably
descended from the common wolf) that has been domesticated by man since
prehistoric times; occurs in many breeds) “the dog barked all night”
I S: (n) frump, dog (a dull unattractive unpleasant girl or woman) “she got a reputation
as a frump”, “she’s a real dog”
I S: (n) dog (informal term for a man) “you lucky dog”
I S: (n) cad, bounder, blackguard, dog, hound, heel (someone who is morally
reprehensible) “you dirty dog”
I S: (n) frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie (a
smooth-textured sausage of minced beef or pork usually smoked; olen served on a
bread roll)
I S: (n) pawl, detent, click, dog (a hinged catch that fits into a notch of a ratchet to
move a wheel forward or prevent it from moving backward)
I S: (n) andiron, firedog, dog, dog-iron (metal supports for logs in a fireplace) “the
andirons were too hot to touch”

22/47
Synsets for bass in WordNet

23/47
Hierarchy for bass3 in WordNet

24/47
The IS - A Hierarchy for fish (n)
I fish (any of various mostly cold-blooded aquatic vertebrates usually having scales and
breathing through gills)
I aquatic vertebrate (animal living wholly or chiefly in or on water)
I vertebrate, craniate (animals having a bony or cartilaginous skeleton with a segmented spinal
column and a large brain enclosed in a skull or cranium)
I chordate (any animal of the phylum Chordata having a notochord or spinal column)
I animal, animate being, beast, brute, creature, fauna (a living organism characterized by
voluntary movement)
I organism, being (a living thing that has (or can develop) the ability to act or function
independently)
I living thing, animate thing (a living (or once living) entity)
I whole, unit (an assemblage of parts that is regarded as a single entity)
I object, physical object (a tangible and visible entity; an entity that can cast a shadow)
I entity (that which is perceived or known or inferred to have its own distinct existence (living or
nonliving))

25/47
WordNet Noun Relations

26/47
WordNet Verb Relations

27/47
Other WordNet Hierarchy Fragment Examples

28/47
Other WordNet Hierarchy Fragment Examples

29/47
Other WordNet Hierarchy Fragment Examples

30/47
WordNet as as Graph

31/47
Supersenses in WordNet

Super senses are top-level hypernyms in the hierarchy.

32/47
WordNets for Other Languages

I globalwordnet.org/wordnets-in-the-world/ lists WordNets for tens of


languages.
I Many of these WordNets are linked through ILI – Interlingual Index numbers.

33/47
Word Similarity

I Synonymy: a binary relation


I Two words are either synonymous or not
I Similarity (or distance): a looser metric
I Two words are more similar if they share more features of meaning
I Similarity is properly a relation between senses
I The word “bank” is not similar to the word “slope”
I bank1 is similar to fund3
I bank2 is similar to slope5
I But we will compute similarity over both words and senses.

34/47
Why Word Similarity?

I A practical component in lots of NLP tasks


I Question answering
I Natural language generation
I Automatic essay grading
I Plagiarism detection
I A theoretical component in many linguistic and cognitive tasks
I Historical semantics
I Models of human word learning
I Morphology and grammar induction

35/47
Similarity and Relatedness

I We often distinguish word similarity from word relatedness


I Similar words: near-synonyms
I Related words: can be related in any way
I car, bicycle: similar
I car, gasoline: related, not similar

36/47
Two Classes of Similarity Algorithms

I WordNet/Thesaurus-based algorithms
I Are words “nearby” in hypernym hierarchy?
I Do words have similar glosses (definitions)?
I Distributional algorithms
I Do words have similar distributional contexts?
I Distributional (Vector) semantics.

37/47
Path-based Similarity
I Two concepts (senses/synsets) are similar if they are near each other in the hierarchy
I They have a short path between them
I Synsets have path 1 to themselves.

38/47
Refinements

I pathlen(c1 , c2 ) = 1+number of edges in the shortest path in the hypernym graph


between sense nodes c1 and c2

1
I simpath(c1 , c2 ) = (Ranges between 0 and 1)
pathlen(c1 , c2 )

I wordsim(w1 , w2 ) = max simpath(c1 , c2 )


c1 ∈senses(w1 ),c2 ∈senses(w2 )

39/47
Example for Path-based Similarity

I simpath(nickel, coin) = 1/2 = .5


I simpath(fund, budget) = 1/2 = .5
I simpath(nickel, currency) = 1/4 = .25
I simpath(nickel, money) = 1/6 = .17
I simpath(coinage, Richterscale) = 1/6 = .17

40/47
Problem with Basic Path-based Similarity

I Assumes each link represents a uniform distance


I But nickel to money seems to us to be closer than nickel to standard
I Nodes high in the hierarchy are very abstract
I We instead want a metric that
I Represents the cost of each edge independently
I Ranks words connected only through abstract nodes as less similar.

41/47
Information Content Similarity Metrics

I Define p(c) as the probability that a randomly selected word in a corpus is an


instance of concept c.
I Formally: there is a distinct random variable, ranging over words, associated with
each concept in the hierarchy. For a given concept, each observed word/lemma is
either
I a member of that concept with probability p(c)
I not a member of that concept with probability 1 − p(c)
I All words are members of the root node (Entity): p(root) = 1
I The lower a node in hierarchy, the lower its probability

42/47
Information Content Similarity Metrics

I Train by counting in a corpus


I Each instance of hill counts toward frequency of
natural elevation, geological-formation, entity, etc.
I Let words(c) be the set of all words that are
descendants of node c
I words(geological-formation) = {hill, ridge, grotto,
coast, cave, shore, natural elevation}
I words(natural elevation) = {hill, ridge}
I For n words in the corpus
P
w∈words(c) count(w)
p(c) =
N

43/47
Information Content: Definitions

I Information Content: IC(c) = − log p(c)


I Most informative subsumer (lowest common subsumer)
I LCS(c1 , c2 ) = The most informative (lowest) node in the hierarchy subsuming both c1 and
c2 .

44/47
The Resnik Method

I The similarity between two words is related to their common information


I The more two words have in common, the more similar they are
I Resnik: measure common information as:

simresnik (c1 , c2 ) = − log p(LCS(c1 , c2 ))


I simresnik (hill, coast) = − log p(LCS(hill, coast)) = − log p(geological-formation) =
6.34

45/47
The Dekang Lin Method

I The similarity between A and B is measured by the ratio between the amount of
information needed to state the commonality of A and B and the information needed
to fully describe what A and B are.
I
2 log p(LCS(c1 , c2 ))
simlin (c1 , c2 ) =
log p(c1 ) + log p(c2 )

2 log p(geological-formation)
I simlin (hill, coast) = log p(hill)+log p(coast)
= 0.59

46/47
Evaluating Similarity

I Extrinsic evaluation (Task-based)


I Question answering
I Essay grading
I Intrinsic evaluation: Correlation between algorithm and human word similarity ratings
I Wordsim353 task: 353 noun pairs rated 0-10. sim(plane,car) = 5.77
I Taking TOEFL multiple-choice vocabulary tests
I Levied is closest in meaning to: imposed, believed, requested, correlated.

47/47
11-411
Natural Language Processing
Distributional/Vector Semantics

Kemal Oflazer

Carnegie Mellon University in Qatar

1/55
The Distributional Hypothesis

I Want to know the meaning of a word? Find what words occur with it.
I Leonard Bloomfield
I Edward Sapir
I Zellig Harris–first formalization
I “oculist and eye-doctor . . . occur in almost the same environments”
I “If A and B have almost identical environments we say that they are synonyms.”
I The best known formulation comes from J.R. Firth:
I “You shall know a word by the company it keeps.”

2/55
Contexts for Beef

I This is called a concordance.


3/55
Contexts for Chicken

4/55
Intuition for Distributional Word Similarity

I Consider
I A bottle of pocarisweat is on the table.
I Everybody likes pocarisweat.
I Pocarisweat makes you feel refreshed.
I They make pocarisweat out of ginger.
I From context words humans can guess pocarisweat means a beverage like coke.
I So the intuition is that two words are similar if they have similar word contexts.

5/55
Why Vector Models of Meaning?

I Computing similarity between words:


I fast is similar to rapid
I tall is similar to height
I Application: Question answering:
I Question: “How tall is Mt. Everest?”
I Candidate A: “The official height of Mount Everest is 29029 feet.”

6/55
Word Similarity for Plagiarism Detection

7/55
Vector Models

I Sparse vector representations:


I Mutual-information weighted word co-occurrence matrices.
I Dense vector representations:
I Singular value decomposition (and Latent Semantic Analysis)
I Neural-network-inspired models (skip-grams, CBOW)
I Brown clusters

8/55
Shared Intuition

I Model the meaning of a word by “embedding” in a vector space.


I The meaning of a word is a vector of numbers:
I Vector models are also called embeddings.
I In contrast, word meaning is represented in many (early) NLP applications by a
vocabulary index (“word number 545”)

9/55
Term-document Matrix

I Each cell is the count of term t in a document d (tft,d ).


I Each document is a count vector in NV , a column below.

As You Like It Twelfth Night Julius Caesar Henry V


battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0

10/55
Term-document Matrix

I Two documents are similar of their vectors are similar.


As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0

11/55
Term-document Matrix

I Each word is a count vector in ND – a row below


As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0

12/55
Term-document Matrix

I Two words are similar if their vectors are similar.


As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0

13/55
Term-context Matrix for Word Similarity

I Two words are similar if their context vectors are similar.


aardvark computer data pinch result sugar . . .
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0

14/55
Word–Word or Word–Context Matrix

I Instead of entire documents, use smaller contexts


I Paragraph
I Window of ±4 words
I A word is now defined by a vector over counts of words in context.
I If a word wj occurs in the context of wi , increase countij .
I Assuming we have V words,
I Each vector is now of length V.
I The word-word matrix is V × V.

15/55
Sample Contexts of ±7 Words

aardvark computer data pinch result sugar . . .


..
.
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
..
.

16/55
The Word–Word Matrix

I We showed only a 4 × 6 matrix, but the real matrix is 50, 000 × 50, 000.
I So it is very sparse: Most values are 0.
I That’s OK, since there are lots of efficient algorithms for sparse matrices.
I The size of windows depends on the goals:
I The smaller the context (±1 − 3) , the more syntactic the representation
I The larger the context (±4 − 10), the more syntactic the representation

17/55
Types of Co-occurence between Two Words

I First-order co-occurrence (syntagmatic association):


I They are typically nearby each other.
I wrote is a first-order associate of book or poem.
I Second-order co-occurrence (paradigmatic association):
I They have similar neighbors.
I wrote is a second-order associate of words like said or remarked.

18/55
Problem with Raw Counts

I Raw word frequency is not a great measure of association between words.


I It is very skewed: “the” and “of” are very frequent, but maybe not the most discriminative.
I We would rather have a measure that asks whether a context word is particularly
informative about the target word.
I Positive Pointwise Mutual Information (PPMI)

19/55
Pointwise Mutual Information

I Pointwise Mutual Information: Do events x and y co-occur more that if they were
independent.
p(x, y)
PMI(x, y) = log2
p(x)p(y)
I PMI between two words: Do target word w and context word c co-occur more that if
they were independent.
p(w, c)
PMI(w, c) = log2
p(w)p(c)

20/55
Positive Pointwise Mutual Information

I PMI ranges from −∞ to +∞


I But the negative values are problematic:
I Things are co-occurring less than we expect by chance
I Unreliable without enormous corpora
I Imagine w − 1 and w2 whose probability is each 10−6 .
I Hard to be sure p(w1 , w2 ) is significantly different than 10−12 .
I Furthermore it’s not clear people are good at “unrelatedness”.
I So we just replace negative PMI values by 0.
 
p(w, c)
PPMI(w, c) = max log2 ,0
p(w)p(c)

21/55
Computing PPMI on a Term-Context Matrix

I We have matrix F with V rows (words) and C columns (contexts) (in general C = V )
I fij is how many times word wi co-occurs in the context of the word cj .
fij
pij = P
V PC
i=1 ( j=1 fij )
PC PV
j=1 fij i=1 fij
pi∗ = P p∗j = P
V (PC f ) V PC
i=1 j=1 ij i=1 ( j=1 fij )
pij
pmiij = log2 ppmiij = max(pmiij , 0)
pi∗ p∗j

22/55
Example
computer data pinch result sugar
apricot 0 0 1 0 1 2
pineapple 0 0 1 0 1 2
digital 2 1 0 1 0 4
information 1 6 0 4 0 11

3 7 2 5 2 19
6
p(w = information, c = data) = = 0.32
19
11 7
p(w = information) = = 0.58 p(c = data) = = 0.32
19 19
p(w, c)
computer data pinch result sugar p(w)
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

23/55
Example
p(w, c)
computer data pinch result sugar p(w)
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

p(c) 0.16 0.37 0.11 0.26 0.11

0.32
pmi(information, data) = log2 ≈ 0.58
0.37 · 0.57
PPMI(w, c)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.32 - 0.47 -

24/55
Issues with PPMI

I PMI is biased toward infrequent events.


I Very rare words have very high PMI values.
I Two solutions:
I Give rare words slightly higher probabilities
I Use add-one smoothing (which has a similar effect)

25/55
Issues with PPMI

I Raise the context probabilities to α = 0.75:

p(w, c)
PPMIα (w, c) = max(log2 , 0)
p(w)pα (c)

count(c)α
pα (c) = P α
c count(c)
I This helps because pα (c) > p(c) for rare c.
I Consider two context words p(a) = 0.99 and p(b) = 0.01

I pα (a) = 0.990.75 = 0.97 pα (b) = 0.990.75 = 0.03


0.990.75 +0.010.75 0.990.75 +0.010.75

26/55
Using Laplace Smoothing

Add-2 Smoothed Count(w, c)


computer data pinch result sugar
apricot 2 2 3 2 3
pineapple 2 2 3 2 3
digital 4 3 2 3 2
information 3 8 2 6 2

p(w, c) Add-2
computer data pinch result sugar p(w)
apricot 0.03 0.03 0.05 0.03 0.05 0.20
pineapple 0.03 0.03 0.05 0.03 0.05 0.20
digital 0.07 0.05 0.03 0.05 0.03 0.24
information 0.05 0.14 0.03 0.10 0.03 0.36

p(c) 0.19 0.25 0.17 0.22 0.17

27/55
PPMI vs. add-2 Smoothed PPMI

PPMI(w, c)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.32 - 0.47 - -

PPMI(w, c)
computer data pinch result sugar
apricot 0.00 0.00 0.56 0.00 0.56
pineapple 0.00 0.00 0.56 0.00 0.56
digital 0.62 0.00 0.00 0.00 0.00
information 0.00 0.58 0.00 0.37 0.00

28/55
Measuring Similarity

I Given two target words represented with vectors v and w.


I The dot product or inner product is usually used basis for similarity.

N
X
v·w= vi wi = v1 w1 + v2 w2 + · · · + vN wN = |v||w| cos θ
i=1
I v · w is high when two vectors have large values in the same dimensions.
I v · w is low (in fact 0) with zeros in complementary distribution.
I We also do not want the similarity to be sensitive to word-frequency.
I So normalize by vector length and use the cosine as the similarity
v·w
cos(v, w) = |v||w|

29/55
Other Similarity Measures in the Literature

I simcosine (v, w) = v·w


|v||w|

P
I simJaccard (v, w) = P i min(vi ,wi )
i max(vi ,wi )

P
2 Pi min(vi ,wi )
I simDice (v, w) =
i (vi +wi )

30/55
Using Syntax to Define Context

I “The meaning of entities, and the meaning of grammatical relations among them, is
related to the restriction of combinations of these entities relative to other entities.”
(Zelig Harris (1968))
I Two words are similar if they appear in similar syntactic contexts.
I duty and responsibility have similar syntactic distribution
I Modified by Adjectives: additional, administrative, assumed, collective, congressional,
constitutional, . . .
I Objects of Verbs: assert, assign, assume, attend to, avoid, become, breach, . . .

31/55
Co-occurence Vectors based on Syntactic Dependencies

I Each context dimension is a context word in one of R grammatical relations.


I Each word vector now has R · V entries.
I Variants have V dimensions with the count being total for R relations.

32/55
Sparse vs. Dense Vectors

I PPMI vectors are


I long (length in 10s of thousands)
I sparse (most elements are 0)
I Alternative: learn vectors which are
I short (length in several hundreds)
I dense (most elements are non-zero)

33/55
Why Dense Vectors?

I Short vectors may be easier to use as features in machine learning (less weights to
tune).
I Dense vectors may generalize better than storing explicit counts.
I They may do better at capturing synonymy:
I car and automobile are synonyms
I But they are represented as distinct dimensions
I This fails to capture similarity between a word with car as a neighbor and a word with
automobile as a neighbor

34/55
Methods for Getting Short Dense Vectors

I Singular Value Decomposition (SVD)


I A special case of this is called LSA - Latent Semantic Analysis
I “Neural Language Model”-inspired predictive models.
I skip-grams and continuous bag-of-words (CBOW)
I Brown clustering

35/55
Dense Vectors via SVD - Intuition

I Approximate an N-dimensional dataset using fewer dimensions


I By rotating the axes into a new space along the dimension with the most variance
I Then repeat with the next dimension captures the next most variance, etc.
I Many such (related) methods:
I PCA - principle components analysis
I Factor Analysis
I SVD

36/55
Dimensionality Reduction

37/55
Singular Value Decomposition

I Any square v × v matrix (of rank v) X equals the product of three matrices.
X W S C
 
x11 . . . x1v      
w11 ... w1m σ11 ... 0 c11 ... c1c
x21 . . . x2v   . .. ..  ×  . .. ..  ×  . .. .. 
=  ..  ..  ..

 . .. 

 .. .. . .  . .  . . 
. . 
wv1 ... wvv 0 ... σvv cm1 ... cvv
xv1 . . . xvv
v×v v×v v×v v×v

I v columns in W are orthogonal to each other and are ordered by the amount of
variance each new dimension accounts for.
I S is a diagonal matrix of eigenvalues expressing the importance of each dimension.
I C has v rows for the singular values and v columns corresponding to the original
contexts.

38/55
Reducing Dimensionality with Truncated SVD
X W S C
 
x11 . . . x1c      
w11 ... w1m σ11 ... 0 c11 ... c1c
x21 . . . x2c   . .. ..  ×  . .. ..  ×  . .. .. 
=  ..  ..  ..

 . .. 

 .. .. . .  . .  . . 
. . 
wv1 ... wvv 0 ... σvv cm1 ... cvv
xv1 . . . xvv
v×v v×v v×v v×v

X W S C
 
x11 . . . x1v      
w11 ... w1k σ11 ... 0 c11 ... c1v
x21 . . . x2v   . .. ..  ×  . .. ..  ×  . .. .. 
≈  ..  ..  ..

 . .. 

 .. .. . .  . .  . . 
. . 
wv1 ... wvk 0 ... σkk cm1 ... ckv
xv1 . . . xvv
v×c v×k k×k k×v

39/55
Truncated SVD Produces Embeddings

w11 . . . w1k
 
w21 . . . w2k 
 . .. .. 
 .. . . 
wv1 . . . wvk

I Each row of W matrix is a k-dimensional representation of each word w.


I k may range from 50 to 1000

40/55
Embeddings vs Sparse Vectors

I Dense SVD embeddings sometimes work better than sparse PPMI matrices at tasks
like word similarity
I Denoising: low-order dimensions may represent unimportant information
I Truncation may help the models generalize better to unseen data.
I Having a smaller number of dimensions may make it easier for classifiers to properly
weight the dimensions for the task.
I Dense models may do better at capturing higher order co-occurrence.

41/55
Embeddings Inspired by Neural Language Models

I Skip-gram and CBOW learn embeddings as part of the process of word prediction.
I Train a neural network to predict neighboring words
I Inspired by neural net language models.
I In so doing, learn dense embeddings for the words in the training corpus.
I Advantages:
I Fast, easy to train (much faster than SVD).
I Available online in the word2vec package.
I Including sets of pretrained embeddings!

42/55
Skip-grams

I From the current word wt , predict other words in a context window of 2C words.
I For example, we are given wt and we are predicting one of the words in

[wt−2 , wt−1 , wt+1 , wt+2 ]

43/55
Compressing Words

44/55
One-hot Vector Representation

I Each word in the vocabulary is represented with a vector of length |V |.


I 1 for the index target word wt and 0 for other words.
I So if “popsicle” is vocabulary word 5, the one-hot vector is

[0, 0, 0, 0, 1, 0, 0, 0, 0, . . . , 0]

45/55
Neural Network Architecture

46/55
Where are the Word Embeddings?

I The rows of the first matrix actually are the word embeddings.
I Multiplication of the one-hot input vector “selects” the relevant row as the output to
hidden layer.

47/55
Output Probabilities

I The output vector is also a vector (hidden-layer) and matrix multiplication (the C
matrix).
I The value computed for output unit k = ck · wj where wj is the hidden layer vector (for word
j).
I Except, the outputs are not probabilities!
I We use the same scaling idea we used earlier and then use softmax .

exp(ck · vj )
p(wk is in the context of wj ) = P
i exp(ci · vj )

48/55
Training for Embeddings

49/55
Training for Embeddings

I You have a huge network (say you have 1M words and embedding dimension of 300).

I You have 300M entries in each of the matrices.


I Running gradient descent (via backpropagation) is very slow.
I Some innovations used:
I Reduce vocabulary for phrases like “New York”
I Reduce vocabulary and training samples by ignoring infrequent words.
I Instead of computing probabilities through the expensive scaling process, use negative
sampling to only update a small number of the weights each time.
I It turns out these improve the quallity of the vectors also!

50/55
Properties of Embeddings
I Nearest words to some embeddings in the d− dimensional space.

I Relation meanings
I vector(king) − vector(man) + vector(woman) ≈ vector(queen)
I vector(Paris) − vector(France) + vector(Italy) ≈ vector(Rome)

51/55
Brown Clustering

I An agglomerative clustering algorithm that clusters words based on which words


precede or follow them
I These word clusters can be turned into a kind of vector
I We will do a brief outline here.

52/55
Brown Clustering

I Each word is initially assigned to its own cluster.


I We now consider consider merging each pair of clusters. Highest quality merge is
chosen.
I Quality = merges two words that have similar probabilities of preceding and following
words.
I More technically quality = smallest decrease in the likelihood of the corpus according to a
class-based language model
I Clustering proceeds until all words are in one big cluster.

53/55
Brown Clusters as Vectors

I By tracing the order in which clusters are merged, the model builds a binary tree from
bottom to top.
I Each word represented by binary string = path from root to leaf
I Each intermediate node is a cluster
I Chairman represented by 0010, “months” by 01, and verbs by 1.

54/55
Class-based Language Model

I Suppose each word is in some class ci .

p(wi | wi−1 ) = p(ci | ci−1 ) p(wi | ci )

55/55
11-411
Natural Language Processing
Word Sense Disambiguation

Kemal Oflazer

Carnegie Mellon University in Qatar

1/23
Homonymy and Polysemy

I As we have seen, multiple words can be spelled the same way (homonymy;
technically homography)
I The same word can also have different, related senses (polysemy)
I Various NLP tasks require resolving the ambiguities produced by homonymy and
polysemy.
I Word sense disambiguation (WSD)

2/23
Versions of the WSD Task

I Lexical sample
I Choose a sample of words.
I Choose a sample of senses for those words.
I Identify the right sense for each word in the sample.
I All-words
I Systems are given the entire text.
I Systems are given a lexicon with senses for every content word in the text.
I Identify the right sense for each content word in the text .

3/23
Supervised WSD

I If we have hand-labelled data, we can do supervised WSD.


I Lexical sample tasks
I Line-hard-serve corpus
I SENSEVAL corpora
I All-word tasks
I Semantic concordance: SemCor – subset of Brown Corpus manually tagged with
WordNet senses.
I SENSEVAL-3
I Can be viewed as a classification task

4/23
Sample SemCor Data
<wf cmd=done pos=PRP$ ot=notag>Your</wf>
<wf cmd=done pos=NN lemma=invitation wnsn=1 lexsn=1:10:00::>invitation</wf>
<wf cmd=ignore pos=TO>to</wf>
<wf cmd=done pos=VB lemma=write_about wnsn=1 lexsn=2:36:00::>write_about</wf>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Se
<wf cmd=ignore pos=TO>to</wf>
<wf cmd=done pos=VB lemma=honor wnsn=1 lexsn=2:41:00::>honor</wf>
<wf cmd=ignore pos=PRP$>his</wf>
<wf cmd=done pos=JJ lemma=70th wnsn=1 lexsn=5:00:00:ordinal:00>70_th</wf>
<wf cmd=done pos=NN lemma=anniversary wnsn=1 lexsn=1:28:00::>Anniversary</wf>
<wf cmd=ignore pos=IN>for</wf>
<wf cmd=ignore pos=DT>the</wf>
<wf cmd=done pos=NN lemma=april wnsn=1 lexsn=1:28:00::>April</wf>
<wf cmd=done pos=NN lemma=issue wnsn=2 lexsn=1:10:00::>issue</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NNP pn=other ot=notag>Sovietskaya_Muzyka</wf>
<wf cmd=done pos=VBZ ot=notag>is</wf>
<wf cmd=done pos=VB lemma=accept wnsn=6 lexsn=2:40:01::>accepted</wf>
<wf cmd=ignore pos=IN>with</wf>
<wf cmd=done pos=NN lemma=pleasure wnsn=1 lexsn=1:12:00::>pleasure</wf>
5/23
What Features Should One Use?

I Warren Weaver commented in 1955


If one examines the words in a book, one at a time as through an opaque mask
with a hole in it one word wide, then it is obviously impossible to determine, one
at a time, the meaning of the words. [. . . ] But if one lengthens the slit in the
opaque mask, until one can see not only the central word in question but also say
N words on either side, then if N is large enough one can unambiguously decide
the meaning of the central word. [. . . ] The practical question is: “What minimum
value of N will, at least in a tolerable fraction of cases, lead to the correct choice
of meaning for the central word?”
I What information is available in that window of length N that allows us to do WSD?

6/23
What Features Should One Use?

I Collocation features
I Encode information about specific positions located to the left or right of the target word
I For example [wi−2 , POSi−2 , wi−1 , POSi−2 , wi+1 , POSi+1 , wi+2 , POSi+2 ]
I For bass, e.g., [guitar, NN, and, CC, player, NN, stand, VB]
I Bag-of-words features
I Unordered set of words occurring in window
I Relative sequence is ignored
I Words are lemmatized
I Stop/Function words typically ignored.
I Used to capture domain

7/23
Naive Bayes for WSD

I Choose the most probable sense given the feature vector f which can be formulated
into
Yn
ŝ = arg max p(s) p(fj | s)
s∈S j=1
I Naive Bayes assumes features in f are independent (often not true)
I But usually Naive Bayes Classifiers perform well in practice.

8/23
Semisupervised WSD–Decision List Classifiers

I The decisions handed down by naive Bayes classifiers (and other similar ML
algorithms) are difficult to interpret.
I It is not always clear why, for example, a particular classification was made.
I For reasons like this, some researchers have looked to decision list classifiers, a highly
interpretable approach to WSD .
I We have a list of conditional statements.
I Item being classified falls through the cascade until a statement is true.
I The associated sense is then returned.
I Otherwise, a default sense is returned.
I Where does the list come from?

9/23
Decision List Features for WSD – Collocational Features

I Word immediately to the left or right of target:


I I have my bank1 statement.
I The river bank2 is muddy.
I Pair of words to immediate left or right of target:
I The world’s richest bank1 is here in New York.
I The river bank2 is muddy.
I Words found within k positions to left or right of target, where k is often 10-50 :
I My credit is just horrible because my bank1 has made several mistakes with my account
and the balance is very low.

10/23
Learning a Decision List Classifier

I Each individual feature-value is a test.


I Features exists in a small context around the word.
I How discriminative is a feature among the senses?
I For a given ambiguous word compute

p(si | fj )
weight(si , fj ) = log
p(¬si | fj )

where ¬si all other senses of the word except si .


I Order in descending order ignoring values ≤ 0.
I When testing the first test that matches gives the sense.

11/23
Example

I Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for
bank/2 (river sense)
I p(s1 ) = 1, 500/2, 000 = .75
I p(s2 ) = 500/2, 000 = .25
I Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.
I p(credit) = 204/2, 000 = .102
I p(credit | s1 ) = 200/1, 500 = .133
I p(credit | s2 ) = 4/500 = .008
I From Bayes Rule
I p(s1 | credit) = .133 ∗ .75/.102 = .978
I p(s2 | credit) = .008 ∗ .25/.102 = .020
I Weights
I weight(s1 , credit) = log 49.8 = 3.89
1
I weight(s2 , credit) = log 49.8 = −3.89

12/23
Using the Decision List

13/23
Evaluation of WSD

I Extrinsic Evaluation
I Also called task-based, end-to-end, and in vivo evaluation.
I Measures the contribution of a WSD (or other) component to a larger pipeline.
I Requires a large investment and hard to generalize to other tasks,
I Intrinsic Evaluation
I Also called in vitro evaluation
I Measures the performance of the WSD (or other) component in isolation
I Does not necessarily tell you how well the component contributes to a real test – which is
in general what you are interested in.

14/23
Baselines

I Most frequent sense


I Senses in WordNet are typically ordered from most to least frequent
I For each word, simply pick the most frequent
I Surprisingly accurate
I Lesk algorithm
I Really, a family of algorithms
I Measures overlap in words between gloss/examples and context

15/23
Simplified Lesk Algorithm
I The bank can guarantee deposits
will eventually cover future tuition
costs because it invests in
adjustable-rate mortgage securities.

I Sense bank1 has two


non-stopwords overlapping with
the context above,
I Sense bank2 has no overlaps.

16/23
Bootstrapping Algorithms
I There are bootstrapping techniques that can be used to obtain reasonable WSD
results with minimal amounts of labelled data.

17/23
Bootstrapping Algorithms

I Yarowsky’s Algorithm (1995), builds a classifier for each ambiguous word.


I The algorithm is given a small seed set Λ0 of labeled instances of each sense and a much
larger unlabeled corpus V0 .
I Trains an initial classifier and labels V0 along with confidence
I Add high-confidence labeled examples to the training set giving Λ1
I Trains an new classifier and labels V1 along with confidence.
I Add high-confidence labeled examples to the training set giving Λ2
I ...
I Until no new examples can be added or a sufficiently accurate labeling is reached.
I Assumptions/Observations:
I One sense per collocation: Nearby words provide strong and consistent clues as to the
sense of a target word
I One sense per discourse: The sense of a target word is highly consistent within a single
document

18/23
Bootstrapping Example

19/23
State-of-the-art Results in WSD (2017)

20/23
State-of-the-art Results in WSD (2017)

21/23
Other Approaches – Ensembles

I Classifier error has two components: Bias and Variance


I The bias is error from erroneous assumptions in the learning algorithm. High bias can
cause an algorithm to miss the relevant relations between features and target outputs
(underfitting).
I The variance is error from sensitivity to small fluctuations in the training set. High variance
can cause an algorithm to model the random noise in the training data, rather than the
intended outputs (overfitting).
I Some algorithms (e.g., decision trees) try and build a representation of the training
data - Low Bias/High Variance
I Others (e.g., Naive Bayes) assume a parametric form and don’t necessarily represent
the training data - High Bias/Low Variance
I Combining classifiers with different bias variance characteristics can lead to improved
overall accuracy.

22/23
Unsupervised Methods for Word Sense Discrimination/Induction

I Unsupervised learning identifies patterns in a large sample of data, without the


benefit of any manually labeled examples or external knowledge sources.
I These patterns are used to divide the data into clusters, where each member of a
cluster has more in common with the other members of its own cluster than any other.
I Important: If you remove manual labels from supervised data and cluster, you may
not discover the same classes as in supervised learning:
I Supervised Classification identifies features that trigger a sense tag
I Unsupervised Clustering finds similarity between contexts
I Recent approaches to this use embeddings.

23/23
11-411
Natural Language Processing
Semantic Roles
Semantic Parsing

Kemal Oflazer

Carnegie Mellon University in Qatar

1/41
Semantics vs Syntax

I Syntactic theories and representations focus on the question of which strings in V +


are in the language.
I Semantics is about “understanding” what a string in V + means.
I Sidestepping a lengthy and philosophical discussion of what “meaning” is, we will
consider two meaning representations:
I Predicate-argument structures, also known as event frames.
I Truth conditions represented in first-order logic.

2/41
Motivating Example: Who did What to Whom?

I Warren bought the stock.


I They sold the stock to Warren.
I The stock was bought by Warren.
I The purchase of the stock by Warren surprised no one.
I Warren’s stock purchase surprised no one.

3/41
Motivating Example: Who did What to Whom”

I Warren bought the stock.


I They sold the stock to Warren.
I The stock was bought by Warren.
I The purchase of the stock by Warren surprised no one.
I Warren’s stock purchase surprised no one.

4/41
Motivating Example: Who did What to Whom?

I Warren bought the stock.


I They sold the stock to Warren.
I The stock was bought by Warren.
I The purchase of the stock by Warren surprised no one.
I Warren’s stock purchase surprised no one.

5/41
Motivating Example: Who did What to Whom”

I Warren bought the stock.


I They sold the stock to Warren.
I The stock was bought by .
The purchase of the stock by Warren surprised no one.

I Warren’s stock purchase surprised no one.

6/41
Motivating Example: Who did What to Whom?

I Warren bought the stock.


I They sold the stock to Warren.
I The stock was bought by Warren.
I The purchase of the stock by Warren surprised no one.
I Warren’s stock purchase surprised no one.

I In this buying/purchasing event/situation, Warren played the role of the buyer, and
there was some stock that played the role of the thing purchased.
I Also, there was presumably a seller, only mentioned in one example.
I In some examples, a separate “event” involving surprise did not occur.

7/41
Semantic Roles: Breaking

I Ali broke the window.


I The window broke.
I Ali is always breaking things.
I The broken window testified to Ali’s malfeasance.

8/41
Semantic Roles: Breaking

I Ali broke the window.


I The window broke. (?)
I Ali is always breaking things.
I The broken window testified to Ali’s malfeasance.

I A breaking event has a B REAKER and a B REAKEE.

9/41
Semantic Roles: Eating

I Eat!
I We ate dinner.
I We already ate.
I The pies were eaten up quickly.
I Our gluttony was complete.

10/41
Semantic Roles: Eating

I Eat!(you, listener) ?
I We ate dinner.
I We already ate.
I The pies were eaten up quickly.
I Our gluttony was complete.

I A eating event has a E ATER and a F OOD, neither of which needs to be mentioned
explicitly.

11/41
Abstraction

?
B REAKER = E ATER

Both are actors that have some causal responsibility for changes in the world around them.

?
B REAKEE = F OOD

Both are greatly affected by the event, which “happened to” them.

12/41
Thematic Roles

AGENT The waiter spilled the soup.


E XPERIENCER Ali has a headache.
F ORCE The wind blows debris into our garden
T HEME Ali broke the window.
R ESULT The city built a basketball court.
C ONTENT Omar asked, “You saw Ali playing soccer?”
I NSTRUMENT He broke the window with a hammer.
B ENEFICIARY Jane made reservations for me.
S OURCE I flew in from New York.
G OAL I drove to Boston.

13/41
Verb Alternation Examples: Breaking and Giving

I Breaking:
I AGENT/subject; T HEME/object; I NSTRUMENT/PPwith
I I NSTRUMENT/subject; T HEME/object
I T HEME/subject
I Giving:
I AGENT/subject; G OAL/object; T HEME/second-object
I AGENT/subject; T HEME/object; G OAL/PPto
I English verbs have been codified into classes that share patterns (e.g., verbs of
throwing: throw/kick/pass)

14/41
Semantic Role Labeling

I Input: a sentence x
I Output: A collection of predicates, each consisting of
I a label sometimes called the frame
I a span
I a set of arguments, each consisting of
I a label usually called the role
I a span

15/41
The Importance of Lexicons

I Like syntax, any annotated dataset is the product of extensive development of


conventions.
I Many conventions are specific to particular words, and this information is codified in
structured objects called lexicons.
I You should think of every semantically annotated dataset as both the data and the
lexicon.
I We consider two examples.

16/41
PropBank

I Frames are verb senses (with some extensions)


I Lexicon maps verb-sense-specific roles onto a small set of abstract roles (e.g., A RG 0,
A RG 1, etc.)
I Annotated on top of the Penn Treebank, so that arguments are always constituents.

17/41
fall.01 (move downward)

I A RG 1: logical subject, patient, thing falling


I A RG 2: extent, amount fallen
I A RG 3: starting point
I A RG 4: ending point
I A RG M- LOC: medium

I Sales fell to $251.2 million from $278.8 million.


I The average junk bond fell by 4.2%.
I The meteor fell through the atmosphere, crashing into Palo Alto.

18/41
fall.01 (move downward)

I A RG 1: logical subject, patient, thing falling


I A RG 2: extent, amount fallen
I A RG 3: starting point
I A RG 4: ending point
I A RG M- LOC: medium

I Sales fell to $251.2 million from $278.8 million.


I The average junk bond fell by 4.2%.
I The meteor fell through the atmosphere, crashing into Palo Alto.

19/41
fall.01 (move downward)

I A RG 1: logical subject, patient, thing falling


I A RG 2: extent, amount fallen
I A RG 3: starting point
I A RG 4: ending point
I A RG M- LOC: medium

I Sales fell to $251.2 million from $278.8 million.


I The average junk bond fell by 4.2%.
I The meteor fell through the atmosphere, crashing into Palo Alto.

20/41
fall.01 (move downward)

I A RG 1: logical subject, patient, thing falling


I A RG 2: extent, amount fallen
I A RG 3: starting point
I A RG 4: ending point
I A RG M- LOC: medium

I Sales fell to $251.2 million from $278.8 million.


I The average junk bond fell by 4.2%.
I The meteor fell through the atmosphere, crashing into Palo Alto.

21/41
fall.08 (fall back, rely on in emergency)

I A RG 0: thing falling back


I A RG 1: thing fallen on

I World Bank president Paul Wolfowitz has fallen back on his last resort.

22/41
fall.08 (fall back, rely on in emergency)

I A RG 0: thing falling back


I A RG 1: thing fallen on

I World Bank president Paul Wolfowitz has fallen back on his last resort.

23/41
fall.08 (fall back, rely on in emergency)

I A RG 0: thing falling back


I A RG 1: thing fallen on

I World Bank president Paul Wolfowitz has fallen back on his last resort.

24/41
fall.10 (fall for a trick; be fooled by)

I A RG 0: the fool
I A RG 1: the trick

I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.

25/41
fall.10 (fall for a trick; be fooled by)

I A RG 0: the fool
I A RG 1: the trick

I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.

26/41
fall.10 (fall for a trick; be fooled by)

I A RG 0: the fool
I A RG 1: the trick

I Many people keep falling for the idea that lowering taxes on the rich benefits
everyone.

27/41
FrameNet

I Frames can be any content word (verb, noun, adjective, adverb)


I About 1,000 frames, each with its own roles
I Both frames and roles are hierarchically organized
I Annotated without syntax, so that arguments can be anything
I Different philosophy:
I Micro roles defined according to frame
I Verb is in the background and frame is in the foreground.
I When a verb is “in” a frame it is allowed to use the associated roles.

28/41
change position on a scale

I I TEM: entity that has a position on the scale


I ATTRIBUTE: scalar property that the I TEM possesses
I D IFFERENCE: distance by which an I TEM changes its position
I F INAL STATE: I TEM’s state after the change
I F INAL VALUE: position on the scale where I TEM ends up
I I NITIAL STATE: I TEM’s state before the change
I I NITIAL VALUE: position on the scale from which the I TEM moves
I VALUE RANGE: portion of the scale along which values of ATTRIBUTE fluctuate
I D URATION: length of time over which the change occurs
I S PEED: rate of change of the value

29/41
FrameNet Example

Attacks
| on
{z civilians} decreased
| {z } over
| the last{zfour months}
I TEM change position. . . D URATION

I The ATTRIBUTE is left unfilled but is understood from context (e.g., “number” or
“frequency”).

30/41
change position on a scale

I Verbs: advance, climb, decline, decrease, diminish, dip, double, drop, dwindle, edge,
explode, fall, fluctuate, gain, grow, increase, jump, move, mushroom, plummet, reach,
rise, rocket, shift, skyrocket, slide, soar, swell, swing, triple, tumble
I Nouns: decline, decrease, escalation, explosion, fall, fluctuation, gain, growth, hike,
increase, rise, shift, tumble
I Adverb: increasingly
I Frame hierarchy
event

... change position on a scale ...

change of temperature proliferating in number

31/41
The Semantic Role Labeling Task

I Given a syntactic parse, identify the appropriate role for each noun phrase (according
to the scheme that you are using, e.g., PropBank, FrameNet or something else)
I Why is this useful?
I Why is it useful for some tasks that you cannot perform with just dependency parsing?
I What kind of semantic representation could you obtain if you had SRL?
I Why is this hard?
I Why is it harder that dependency parsing?

32/41
Semantic Role Labeling Methods

I Boils down to labeling spans with frames and role names.


I It is mostly about features.
I Some features for SRL
I The governing predicate (often the main verb)
I The phrase type of the constituent (NP, NP-SUBJ, etc)
I The headword of the constituent
I The part of speech of the headword
I The path from the constituent to the predicate
I The voice of the clause (active, passive, etc.)
I The binary linear position of the constituent with respect to the predicate (before or after)
I The subcategorization of the predicate
I Others: named entity tags, more complex path features, when particular nodes appear in
the path, rightmost and leftmost words in the constituent, etc.

33/41
Example: Path Features

NP-SBJ VP

DT NNP NNP NNP VBD NP PP-TMP

The San Francisco Examiner issued DT JJ NN IN NN NP-TMP

a special edition around noon NN

yesterday

Path from “The San Francisco Examiner” to “issued”: NP↑S↓VP↓VBD


Path from “a special edition” to “issued”: NP↑VP↓VBD
34/41
Sketch of an SRL Algorithm

35/41
Additional Steps for Efficiency

I Pruning
I Only a small number of constituents should ultimately be labeled
I Use heuristics to eliminate some constituents from consideration
I Preliminary Identification:
I Label each node as ARG or NONE with a binary classifier
I Classification
I Only then, perform 1-of-N classification to label the remaining ARG nodes with roles

36/41
Additional Information

I See framenet/icsi.berkeley.edu/fndrupal/ for additional information about


the FrameNet Project.
I Semantic Parsing Demos at
I http://demo.ark.cs.cmu.edu/parse
I http://nlp.cs.lth.se/demonstrations/

37/41
Methods: Beyond Features

I The span-labeling decisions interact a lot!


I Presence of a frame increases the expectation of certain roles
I Roles for the same predicate should not overlap
I Some roles are mutually exclusive or require each other (e.g., “resemble”)
I Using syntax as a scaffold allows efficient prediction; you are essentially labeling the
parse tree.
I Other methods include: discrete optimization, greedy and joint syntactic and semantic
dependencies.

38/41
Related Problems in “Relational” Semantics

I Coreference resolution: which mentions (within or across texts) refer to the same
entity or event?
I Entity linking: ground such mentions in a structured knowledge base (e.g.,
Wikipedia)
I Relation extraction: characterize the relation among specific mentions

39/41
General Remarks

I Semantic roles are just “syntax++” since they don’t allow much in the way of
reasoning (e.g., question answering).
I Lexicon building is slow and requires expensive expertise. Can we do this for every
(sub)language?

40/41
Snapshot

I We have had a taste of two branches of semantics:


I Lexical semantics (e.g., supersense tagging, WSD)
I Relational semantics (e.g., semantic role labeling)
I Coming up:
I Compositional Semantics

41/41
11-411
Natural Language Processing
Compositional Semantics

Kemal Oflazer

Carnegie Mellon University in Qatar

1/34
Semantics Road Map

I Lexical semantics
I Vector semantics
I Semantic roles, semantic parsing
I Meaning representation languages and Compositional semantics
I Discourse and pragmatics

2/34
Bridging the Gap between Language and the World

I Meaning representation is the interface between the language and the world.
I Answering essay question on an exam.
I Deciding what to order at a restaurant.
I Recognizing a joke.
I Executing a command.
I Responding to a request.

3/34
Desirable Qualities of Meaning Representation Languages (MRL)

I Represent the state of the world, i.e., be a knowledge base


I Query the knowledge base to verify that a statement is true, or answers a question.
I “Does Bukhara serve vegetarian food?”
I Is serves(Bukhara, vegetarian food) in our knowledge base?
I Handle ambiguity, vagueness, and non-canonical forms
I “I want to eat someplace that’s close to the campus.”
I “something not too spicy”
I Support inference and reasoning.
I “Can vegetarians eat at Bukhara?”

4/34
Desirable Qualities of Meaning Representation Languages (MRL)

I Inputs that mean the same thing should have the same meaning representation.
I “Bukhara has vegetarian dishes.”
I “They have vegetarian food at Bukhara.”
I “Vegetarian dishes are served at Bukhara.”
I “Bukhara serves vegetarian fare.”

5/34
Variables and Expressiveness

I “ I would like to find a restaurant where I can get vegetarian food.”


I serves(x, vegetarian food)
I It should be possible to express all the relevant details
I “Qatar Airways flies Boeing 777s and Airbus 350s from Doha to the US”

6/34
Limitation

I We will focus on the basic requirements of meaning representation.


I These requirements do not include correctly interpreting statements like
I “Ford was hemorrhaging money.”
I “I could eat a horse.”

7/34
What do we Represent?

I Objects: people (John, Ali, Omar), cuisines (Thai, Indian), restaurants (Bukhara,
Chef’s Garden), . . .
I John, Ali, Omar, Thai, Indian, Chinese, Bukhara, Chefs Garden, . . .
I Properties of Objects: Ali is picky, Bukhara is noisy, Bukhara is cheap, Indian is
spicy, John, Ali and Omar are humans, Bukhara has long wait . . .
I picky={Ali}, noisy={Bukhara}, spicy={Indian}, human={Ali, John, Omar}. . .
I Relations between objects: Bukhara serves Indian, NY Steakhouse serves steak.
Omar likes Chinese.
I serves(Bukhara, Indian), serves(NY Steakhouse, steak), likes(Omar, Chinese) . . .
I Simple questions are easy:
I Is Bukhara noisy?
I Does Bukhara serve Chinese?

8/34
MRL: First-order Logic – A Quick Tour

I Term: any constant (e.g., Bukhara) or a variable


I Formula: defined inductively . . .
I if R is an n-ary relation and t1 , . . . , tn are terms, then R(t1 , . . . , tn ) is a formula.
I if φ is a formula, then its negation, ¬φ is a formula.
I if φ and ψ are formulas, then binary logical connectives can be used to create formulas:
I φ∧ψ
I φ∨ψ
I φ⇒ψ
I φ⊕ψ
I If φ is a formula and v is a variable, then quantifiers can be used to create formulas:
I Existential quantifier: ∃v : φ
I Universal quantifier: ∀v : φ

9/34
First-order Logic: Meta Theory

I Well-defined set-theoretic semantics


I Sound: You can’t prove false things.
I Complete: You can prove everything that logically follows from a set of axioms (e.g.
with a “resolution theorem prover.”)
I Well-behaved, well-understood.
I But there are issues:
I “Meanings” of sentences are truth values.
I Only first-order (no quantifying over predicates).
I Not very good for “fluents” (time-varying things, real-valued quantities, etc.)
I Brittle: anything follows from any contradiction (!)
I Gödel Incompleteness: “This statement has no proof.”
I Finite axiom sets are incomplete with respect to the real world.
I Most systems use its descriptive apparatus (with extensions) but not its inference
mechanisms.

10/34
Translating between First-order Logic and Natural Language

I Bukhara is not loud. ¬noisy(Bukhara)


I Some humans like Chinese. ∃x, human(x) ∧ likes(x, chinese)
I If a person likes Thai, then they are not friends with Ali.
∀x, human(x) ∧ likes(x, Thai) ⇒ ¬friends(x, Ali)
I ∀x, restaurant(x) ⇒ (longwait(x) ∨ ¬likes(Ali, x))
Every restaurant has a long wait or is disliked by Ali.
I ∀x, ∃y, ¬likes(x, y) Everybody has something they don’t like.
I ∃y, ∀x, ¬likes(x, y) There is something that nobody likes.

11/34
Logical Semantics (Montague Semantics)

I The denotation of a natural language sentence is the set of conditions that must hold
in the (model) world for the sentence to be true.
I “Every restaurant has a long wait or is disliked by Ali.”
is true if an only if

∀x, restaurant(x) ⇒ (longwait(x) ∨ ¬likes(Ali, x))

is true.
I This is sometimes called the logical form of the NL sentence.

12/34
The Principle of Compositionality

I The meaning of a natural language phrase is determined by the meanings of its


sub-phrases.
I There are obvious exceptions: e.g., hot dog, New York, etc.
I Semantics is derived from syntax.
I We need a way to express semantics of phrases, and compose them together!
I Little pieces of semantics are introduced by words, from the lexicon.
I Grammar rules include semantic attachments that describe how the semantics of
the children are combined to produce the semantics of the parent, bottom-up.

13/34
Lexicon Entries

I In real systems that do detailed semantics, lexicon entries contain


I Semantic attachments
I Morphological info
I Grammatical info (POS, etc.)
I Phonetic info, if speech system
I Additional comments, etc.

14/34
λ-Calculus

I λ-abstraction is device way to give “scope” to variables.


I If φ is a formula and v is a variable, then λv.φ is a λ-term: an unnamed function from
values (of v) to formulas (usually involving v)
I application of such functions: if we have λv.φ and ψ , then [λv.φ](ψ) is a formula.
I It can be reduced by substituting ψ for every instance of v in φ
I [λx.likes(x, Bukhara)](Ali) reduces to likes(Ali, Bukhara).
I [λx.λy.friends(x, y)](b) reduces to λy.friends(b, y)
I [[λx.λy.friends(x, y)](b)](a) reduces to [λy.friends(b, y)](a) which reduces to friends(b, a)

15/34
Semantic Attachments to CFGs

I NNP → Ali {Ali}


I VBZ → likes {λf .λy.∀x f (x) ⇒ likes(y, x)}
I JJ → expensive {λx.expensive(x)}
I NNS → restaurants {λx.restaurant(x)}
I NP → NNP {NNP.sem}
I NP → JJ NNS {λx.JJ.sem(x)∧ NNS.sem(x)}
I VP → VBZ NP {VBZ.sem(NP.sem)}
I S → NP VP {VP.sem(NP.sem)}

16/34
Example

NP VP

NNP VBZ NP

Ali likes JJ NNS

expensive restaurants

17/34
Example

S : VP.sem(NP.sem)

NP : NNP.sem VP : VBZ.sem(NP.sem)

NNP : Ali VBZ : λf .λy.∀ x, f (x) ⇒ likes(y, x) NP : λv. JJ.sem(v)∧NNS.sem(v)

Ali likes JJ : λz.expensive(z) NNS : λw.restaurant(w)

expensive restaurants

18/34
Example

S : VP.sem(NP.sem)

NP : NNP.sem VP : VBZ.sem(NP.sem)

NNP : Ali VBZ : λ f .λ y.∀ x f (x) ⇒ likes(y, x) NP : λv.expensive(v) ∧ restaurant(v)

Ali likes JJ : λz.expensive(z) NNS : λw.restaurant(w)

expensive restaurants
   

λv. λz.expensive(z) (v) ∧ λw.restaurant(w) (v)


   
| {z } | {z }
JJ.sem NNS.sem

19/34
Example

..
.

VP : VBZ.sem(NP.sem)

VBZ : λ f .λ y.∀ x f (x) ⇒ likes(y, x) NP : λ v.expensive(v) ∧ restaurant(v)

likes expensive restaurants

20/34
Example
..
.

VP : λy.∀x, expensive(x) ∧ restaurant(x) ⇒ likes(y, x)

VBZ : λf .λy.∀x f (x) ⇒ likes(y, x) NP : λv.expensive(v) ∧ restaurant(v)

likes expensive restaurants


 
 
λ f .λ y .∀ x f ( x ) ⇒ likes ( y , x ) λ v . expensive ( v ) ∧ restaurant ( v )
 
 
| {z } | {z }
VBZ.sem NP.sem

λy.∀x [λv.expensive(v) ∧ restaurant(v)](x) ⇒ likes(y, x)

λy.∀x, expensive(x) ∧ restaurant(x) ⇒ likes(y, x)


21/34
Example

S : VP.sem(NP.sem)

NP : NNP.sem VP : λy.∀x, expensive(x) ∧ restaurant(x) ⇒ likes(y, x)

NNP : Ali likes expensive restaurants

Ali

22/34
Example
S : ∀x expensive(x) ∧ restaurant(x) ⇒ likes(Ali, x)

NP : Ali VP : λy.∀x, expensive(x) ∧ restaurant(x) ⇒ likes(y, x)

NNP : Ali likes expensive restaurants

Ali
 
 
λy.∀x expensive(x) ∧ restaurant(x) ⇒ likes(y, x) Ali
 
| {z } |{z}
VP.sem NP.sem

∀x expensive(x) ∧ restaurant(x) ⇒ likes(Ali, x)

23/34
Quantifier Scope Ambiguity
I NNP → Ali {Ali}
I VBZ → likes {λf .λy.∀x f (x) ⇒ likes(y, x)} S
I JJ → expensive {λx.expensive(x)}
I NNS → restaurants {λx.restaurant(x)} NP VP
I NP → NNP {NNP.sem}
I NP → JJ NNS {λx.JJ.sem(x)∧ NNS.sem(x)} Det NN VBZ NP
I VP → VBZ NP {VBZ.sem(NP.sem)}
I S → NP VP {VP.sem(NP.sem)} Every man loves Det NN
I NP → Det NN {Det.sem(NN.sem)}
I Det → every {λf .λg.∀u f (u) ⇒ g(u)} a woman
I Det → a {λm.λn.∃x m(x) ⇒ n(x)}
I NN → man {λv.man(v)} ∀u man(u) ⇒ ∃x woman(x)∧loves(u, x)
I NN → woman {λv.woman(v)}
I VBZ → loves {λf .λy.∀x f (x) ⇒ loves(y, x)}
24/34
This is not Quite Right!

I “Every man loves a woman.” really is ambiguous.


I ∀u (man(u) ⇒ ∃x woman(x) ∧ loves(u, x))
I ∃x (woman(x) ∧ ∀u man(u) ⇒ loves(u, x))
I We get only one of the two meanings.
I Extra ambiguity on top of syntactic ambiguity.
I One approach is to delay the quantifier processing until the end, then permit any
ordering.

25/34
Other Meaning Representations: Abstract Meaning Representation

I “The boy wants to visit New York City.”


I Designed mainly for annotation-ability and eventual use in machine translation.

26/34
Combinatory Categorial Grammar

I CCG is a grammatical formalism that is well-suited for tying together syntax and
semantics.
I Formally, it is more powerful than CFG – it can represent some of the
context-sensitive languages.
I Instead of the set of non-terminals of CFGs, CCGs can have an infinitely large set of
structured categories (called types).

27/34
CCG Types and Combinators

I Primitive types: typically S, NP, N, and maybe more.


I Complex types: built with “slashes,” for example:
I S/NP is “an S, except it lacks an NP to the right”
I S\NP is “an S, except it lacks an NP to the left”
I (S\NP)/NP is “an S, except that it lacks an NP to its right and to its left”
I You can think of complex types as functions:
I S/NP maps NPs to Ss.
I CCG Combinators: Instead of the production rules of CFGs, CCGs have a very
small set of generic combinators that tell us how we can put types together.
I Convention writes the rule differently from CFG: XY ⇒ Z means that X and Y
combine to form a Z (the “parent” in the tree).

28/34
Application Combinator

I Forward Combination: X /Y Y ⇒ X
I Backward Combination: Y X \Y ⇒ X

NP S\NP

NP/N N (S\NP)/NP NP

the dog bit John

29/34
Conjunction Combinator

I X and X ⇒ X
S

NP S\NP

John S\NP and S\NP

(S\NP)/NP NP (S\NP)/NP NP

ate anchovies drank Coke

30/34
Composition Combinator

I Forward (X /Y Y /Z ⇒ X /Z )
I Backward (Y \Z X \Y ⇒ X \Z )

NP S\NP

I (S\NP)/NP NP

(S\NP)/(S\NP) (S\NP)/NP olives

would prefer

31/34
Type-raising Combinator
I Forward (X ⇒ Y /(Y \X ))
I Backward (X ⇒ Y \(Y /X ))

S/NP NP

S/NP and S/NP chocolate

S/(S\NP) (S\NP)/NP S/(S\NP) (S\NP)/NP

NP love NP hates

I Karen

32/34
Back to Semantics

I Each combinator also tells us what to do with the semantic attachments.


I Forward application: X/Y : f Y : g ⇒ X f (g)
I Forward composition: X/Y : f Y/Z : g ⇒ X/Z λx.f (g(x))
I Forward type-raising: X : g ⇒ Y/(Y\X) : λf .f (g)

33/34
CCG Lexicon

I Most of the work is done in the lexicon.


I Syntactic and semantic information is much more formal here.
I Slash categories define where all the syntactic arguments are expected to be
I λ-expressions define how the expected arguments get “used” to build up a FOL
expression.

34/34
11-411
Natural Language Processing
Discourse and Pragmatics

Kemal Oflazer

Carnegie Mellon University in Qatar

1/61
What is Discourse?

I Discourse is the coherent structure of language above the level of sentences or


clauses.
I A discourse is a coherent structured group of sentences.
I What makes a passage coherent?
I A practical answer: It has meaningful connections between its utterances.

2/61
Applications of Computational Discourse

I Automatic essay grading


I Automatic summarization
I Meeting understanding
I Dialogue systems

3/61
Kinds of Discourse Analysis

I Monologue
I Human-human dialogue (conversation)
I Human-computer dialogue (conversational agents)
I “Longer-range” analysis (discourse) vs. “deeper” analysis (real semantics):
I John bought a car from Bill.
I Bill sold a car to John.
I They were both happy with the transaction.

4/61
Discourse in NLP

5/61
Coherence

I Coherence relations: E XPLANATION, C AUSE


I John hid Bill’s car keys. He was drunk.
I John hid Bill’s car keys. He likes spinach.
I Consider:
I John went to the store to buy a piano.
I He had gone to the store for many years.
I He was excited that he could finally afford a piano.
I He arrived just as the store was closing for the day.
I Now consider this:
I John went to the store to buy a piano.
I It was a store he had gone to for many years.
I He was excited that he could finally afford a piano.
I It was closing for the day just as John arrived.
I First is “intuitively” more coherent than the second.
I Entity-based coherence (centering).

6/61
Discourse Segmentation
I Many genres of text have particular conventional structures:
I Academic articles: Abstract, Introduction, Methodology, Results, Conclusion, etc.
I Newspaper stories:

I Spoken patient reports by doctors (SOAP): Subjective, Objective, Assesment, Plan.


7/61
Discourse Segmentation

I Given raw text, separate a document into a linear sequence of subtopics.


1–3 Intro: The search for life in space
4–5 The moon’s chemical composition
6–8 How early earth-moon proximity
shaped the moon
9–12 How the moon helped life evolve on
earth
13 Improbability of the earth-moon system

14–16 Binary/trinary star systems make life
unlikely
17–18 The low probability of nonbinary/trinary
systems
19–20 Properties of earth’s sun that facilitate
life
21 Summary

8/61
Applications of Discourse Segmentation

I Summarization: Summarize each segment independently.


I News Transcription: Separate a steady stream of transcribed news to separate
stories.
I Information Extraction: First identify the relevant segment and then extract.

9/61
Cohesion

I To remind: Coherence refers to the “meaning” relation between two units. A


coherence relation explains how the meaning in different textual units can combine to
a meaningful discourse.
I On the other hand, cohesion is the use of linguistic devices to link or tie together
textual units. A cohesive relation is like a “glue” grouping two units into one.
I Common words used are cues for cohesion.
I Before winter, I built a chimney and shingled the sides of my house.
I I have thus a tight shingled and plastered house.
I Synonymy/hypernymy relations are cues for lexical cohesion.
I Peel, core and slice the pears and the apples.
I Add the fruit to the skillet.
I Use of anaphora are cues for lexical cohesion
I The Woodhouses were first in consequence there.
I All looked up to them.

10/61
Discourse Segmentation
I Intuition: If we can “measure” the cohesion between every neighboring pair of
sentences, we may expect a “dip” in cohesion at subtopic boundaries.
I The TextTiling algorithm uses lexical cohesion.

11/61
The TextTiling Algorithm

I Tokenization
I lowercase, remove stop words, morphologically stem inflected words
I stemmed words are (dynamically) grouped into pseudo-sentences of length 20 (equal
length and not real sentences!)
I Lexical score determination
I Boundary identification

12/61
TextTiling – Determining Lexical Cohesion Scores

I Remember:
I Count-based similarity vectors
I Cosine-similarity
a·b
simcosine (a, b) =
|a| |b|

I Consider a gap position i between any two words.


I Consider k = 10 words before the gap (a) and 10 words after the gap (b), and
compute their similarity yi .
I So yi ’s are the lexical cohesion scores between the 10 words before the gap and 10
words after the gap.

13/61
TextTiling – Determining Boundaries
I A gap position i is a valley if yi < yi−1 and yi < yi+1 .
I If i is a valley, find the depth score – distance from the peaks on both sides
= (yi−1 − yi ) + (yi+1 − yi ).
I Any valley with depth at s − σs or lower, that is, deeper than one standard deviation
from average valley depth, is selected as a boundary.

14/61
TextTiling – Determining Boundaries

15/61
TextTiling – Determining Boundaries

16/61
Supervised Discourse Segmentation

I Spoken news transcription segmentation task.


I Data sets with hand-labeled boundaries exist.
I Paragraph segmentation task of monologues (lectures/speeches).
I Plenty of data on the web with <p> markers.
I Treat the task as a binary decision classification problem.
I Use classifiers such as decision trees, Support Vector Machines to classify
boundaries.
I Use sequence models such as Hidden Markov Models or Conditional Random Fields
to incorporate sequential constraints.
I Additional features that could be used are discourse markers or cue words which
are typically domain dependent.
I Good Evening I am PERSON
I Joining us with the details is PERSON
I Coming up

17/61
Evaluating Discourse Segmentation
I We could do precision, recall and F-measure, but . . .
I These will not be sensitive to near misses!
I A commonly-used metric is WindowDiff.
I Slide a window of length k across the (correct) references and the hypothesized
segmentation.

I Count the number of segmentation boundaries in each.


I Compute the average difference in the number of boundaries in the sliding window.
I Assigns partial credit.
I Another metric is pk (ref , hyp) – the probability that a randomly chosen pair of words a
distance k words apart are inconsistently classified.
18/61
Coherence

I Cohesion does not necessarily imply coherence.


I We need a more detailed definition of coherence.
I We need computational mechanisms for determining coherence.

19/61
Coherence Relations
I Let S0 and S1 represent the “meanings” of two sentences being related.
I Result: Infer that state or event asserted by S0 causes or could cause the state or
event asserted by S1 .
I The Tin Woodman was caught in the rain. His joints rusted.
I Explanation: Infer that state or event asserted by S1 causes or could cause the state
or event asserted by S0 .
I John hid the car’s keys. He was drunk.
I Parallel: Infer p(a1 , a2 , . . .) from the assertion of S0 and p(b1 , b2 , . . .) from the
assertion of S1 , where ai and bi are similar for all i.
I The Scarecrow wanted some brains. The Tin Woodman wanted a heart.
I Elaboration: Infer the same proposition from the assertions S0 and S1 .
I Dorothy was from Kansas. She lived in the midst of the great Kansas prairies.
I Occasion:
I A change of state can be inferred from the assertion S0 whose final state can be inferred
from S1 , or
I A change of state can be inferred from the assertion S1 whose initial state can be inferred
from S0 .
I Dorothy picked up the oil can. She oiled the Tin Woodman’s joints.
20/61
Coherence Relations
I Consider
I S1: John went to the bank to deposit his paycheck.
I S2: He then took a bus to Bill’s car dealership.
I S3: He needed to buy a car.
I S4: The company he works for now isn’t near a bus line.
I S5: He also wanted to talk with Bill about their soccer league.

Occasion(e1 , e2 )

S1 (e1 ) Explanation(e2 )

S2 (e2 ) Parallel(e3 ; e5 )

Explanation(e3 ) S5 (e5 )

S3 (e3 ) S4 (e4 )
21/61
Rhetorical Structure Theory – RST
I Based on 23 rhetorical relations between two spans of text in a discourse.
I a nucleus – central to the write’s purpose and interpretable independently
I a satellite – less central and generally only interpretable with respect to the nucleus
I Evidence relation: |Kevin must
{z be here.} His car is parked outsize.
| {z }
nucleus satellite
I An RST relation is defined by a set of constraints.

Relation name: Evidence

Constraints on Nucleus Reader might not believe Nucleus to a degree


satisfactory to Writer

Constraints on Satellite Reader believes Satellite or will find it credible.

Constraints on Nucleus+Satellite Reader’s comprehending Satellite increases


Reader’s belief of Nucleus

Effects Reader’s belief on Nucleus is increased.


22/61
RST – Other Common Relations

I Elaboration: Satelite gives more information about the nucleus


I [N The company wouldn’t elaborate,] [S citing competitive reasons.]
I Attribution: The satellite gives the source of attribution for the information in nucleus.
I [S Analysts estimated,] [N that sales at US stores declined in the quarter.]
I Contrast: Two of more nuclei contrast along some dimension.
I [N The man was in a bad temper,] [N but his dog was quite happy.]
I List: A series of nuclei are given without contrast or explicit comparison.
I [N John was the goalie;] [N Bob, he was the center forward.]
I Background: The satellite gives context for interpreting the nucleus.
I [S T is a pointer to the root of a binary tree.] [N Initialize T.]

23/61
RST Coherence Relations

24/61
Automatic Coherence Assignment

I Given a sequence of sentences or clauses, we want to automatically:


I determine coherence relations between them (coherence relation assignment)
I extract a tree or graph representing an entire discourse (discourse parsing)

25/61
Automatic Coherence Assignment

I Very difficult!
I One existing approach is to use cue phrases.
I John hid Bill’s car keys because he was drunk.
I The scarecrow came to ask for a brain. Similarly, the tin man wants a heart.

1. Identify cue phrases in the text.


2. Segment the text into discourse segments.
I Use cue phrases/discourse markers
I although, but, because, yet, with, and, . . .
I but often implicit, as in car key example
3. Classify the relationship between each consecutive discourse segment.

26/61
Reference Resolution

I To interpret the sentence in any discourse we need to who or what entity is being
talked about.
I Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%,
to $1.3 million, as the 37-year-old also became the Denver-based company’s
president. It has been ten years since she came to Megabucks from rival Lotsaloot.
I Coreference chains:
I {Victoria Chen, CFO of Megabucks Banking Corp since 2004, her, the 37-year-old, the
Denver-based company’s president, she}
I {Megabucks Banking Corp, the Denver-based company, Megabucks}
I {her pay}
I {Lotsaloot}

27/61
Some Terminology

Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%, to
$1.3 million, as the 37-year-old also became the Denver-based company’s president. It
has been ten years since she came to Megabucks from rival Lotsaloot.
I Referring expression
I Victoria Chen, the 37-year-old and she are referring expressions.
I Referent
I Victoria Chen is the referent.
I Two referring expressions referring to the same entity are said to corefer.
I A referring expression licenses the use of a subsequent expression.
I Victoria Chen allows Victoria Chen to be referred to as she.
I Victoria Chen is the antecedent of she.
I Reference to an earlier introduced entity is called anaphora.
I Such a reference is called anaphoric.
I the 37-year-old, her and she are anaphoric.

28/61
References and Context

I Suppose your friend has a car, a 1961 Ford Falcon.


I You can refer to it in many ways: it, this, that, this car, the car, the Ford, the Falcon,
my friend’s car, . . .
I However you are not free to use any of these in any context!
I For example, you can not refer to it as it, or as the Falcon, if the hearer has no prior
knowledge of the car, or it has not been mentioned, etc.
I Coreference chains are part of cohesion.

29/61
Other Kinds of Referents

I You do not always refer to entities. Consider:


I According to Doug, Sue just bought the Ford Falcon.
I But that turned out to be a lie.
I But that was false.
I That struck me as a funny way to describe the situation.
I That caused a financial problem for Sue.

30/61
Types of Referring Expressions

I Indefinite Noun Phrases


I Definite Noun Phrases
I Pronouns
I Demonstratives
I Names

31/61
Indefinite Noun Phrases

I Introduce new entities to the discourse


I Usually with a, an, some, and even this.
I Mrs. Martin was so kind as to send Mrs. Goddard a beautiful goose.
I I brought him some good news.
I I saw this beautiful Ford Falcon today.
I Specific vs. non-specific ambiguity.
I The goose above is specific – it is the one Mrs. Martin sent.
I The goose in “I am going to the butcher to buy a goose.” is non-specific.

32/61
Definite Noun Phrases

I Refer to entities identifiable to the hearer.


I Entities are either previously mentioned:
I It concerns a while stallion which I have sold to another officer. But the pedigree of the
while stallion was not fully established.
I Or, they are part of the hearer’s beliefs about the world.
I I read it in the New York Times

33/61
Pronouns

I Pronouns usually refer to entities that were introduced no further that one or two
sentences back.
I John went to Bob’s party and parked next to a classic Ford Falcon.
I He went inside and talked to Bob for more than a hour. (He = John)
I Bob told him that he recently got engaged. (him = John, he = Bob)
I He also said he bought it yesterday (He = Bob, it = ???)
I He also said he bought the Falcon yesterday (He = Bob)
I Pronouns can also participate in cataphora.
I Even before she saw it, Dorothy had been thinking about the statue.
I Pronouns also appear in quantified contexts, bound to the quantifier.
I Every dancer brought her left arm forward.

34/61
Demonstratives

I This, that, these, those


I Can be both pronouns or determiners
I That came earlier.
I This car was parked on the left.
I Proximal demonstrative – this
I Distal demonstrative – that
I Note that this NP is ambiguous: can be both indefinite or definite.

35/61
Names

I Names can be used to refer to new or old entities in the discourse.


I These mostly refer to named-entities: people, organizations, locations, geographical
objects, products, nationalities, physical facilities, geopolitical entities, dates,
monetary instruments, plants, animals, . . . .
I They are not necessarily unique:
I Do you mean the Ali in the sophomore class or the Ali in the senior class?

36/61
Reference Resolution

I Goal: Detemine what entities are referred to by which linguistic expressions.


I The discourse model contains our eligible set of referents.

I Coreference resolution
I Pronomial anaphora resolution

37/61
Pronouns Reference Resolution: Filters

I Number, person, gender agreement constraints.


I it can not refer to books
I she can not refer to John
I Binding theory constraints:
I John bought himself a new Ford. (himself=John)
I John bought him a new Ford. (him 6= John)
I John said that Bill bought him a new Ford. [him 6= Bill]
I John said that Bill bought himself a new Ford. (himself =Bill)
I He said that he bought John a new Ford. (both he 6= John)

38/61
Pronouns Reference Resolution: Preferences

I Recency: preference for most recent referent


I Grammatical Role: subj>obj>others
I Billy went to the bar with Jim. He ordered rum.
I Repeated mention:
I Billy had been drinking for days. He went to the bar again today. Jim went with him. He
ordered rum.
I Parallelism:
I John went with Jim to one bar. Bill went with him to another.
I Verb semantics:
I John phoned/criticized Bill. He lost the laptop.
I Selectional restrictions:
I John parked his car in the garage after driving it around for hours.

39/61
Pronoun Reference Resolution: The Hobbs Algorithm

I Algorithm for walking through parses of current and preceding sentences.


I Simple, often used as baseline.
I Requires parser, morphological gender and number
I Uses rules to identify heads of NPs
I Uses WordNet for humanness and gender
I Is person a hypernym of an NP head?
I Is female a hypernym of an NP head?
I Implements binding theory, recency, and grammatical role preferences.

40/61
Pronoun Reference Resolution: Centering Theory

I Claim: A single entity is “centered” in each sentence.


I That entity is to be distinguished from all other entities that have been evoked.
I Also used in entity-based coherence.
I Let Un and Un+1 be two adjacent utterances.
I The backward-looking center Un , denoted Cb (Un ), is the entity focused in the
discourse after Un is interpreted.
I The forward-looking centers of Un , denoted Cf (Un ), is an ordered list of the
entities mentioned in Un of of which could serve as the Cb of Un+1 .
I Cb (Un+1 ) in the highest ranked element of Cf (Un ), also mentioned in Un+1 .
I Entities in Cf (Un )) are ordered: subject > existential predicate nominal > object>
indirect object. . .
I Cp is the preferred center, the first element of Cf (Un ).

41/61
Sentence Transitions

I Rule 1: If any element of Cf (Un ) is realized as a pronoun in Un+1 , the Cb (Un+1


must be realized as a pronoun.
I Rule 2:Transition states are ordered: Continue > Retain> Smooth-shift >
Rough-shift
I Algorithm:
I Generate possible Cb – Cf combinations for each possible set of reference assignments.
I Filter by constraints: agreements, selectional restrictions, centering rules and constraints
I Rank by transition orderings
I The most preferred relation defines the pronomial referents.

42/61
Pronoun Reference Resolution: Log-Linear Models
I Supervised: hand-labeled coreference corpus
I Rule-based filtering of non-referential pronouns:
I It was a dark and stormy night.
I It is raining.
I Needs positive and negative examples:
I Positive examples in the corpus.
I Negative examples are created by pairing pronouns with other noun phrases.
I Features are extracted for each training example.
I Classifier learns to predict 1 or 0.
I During testing:
I Classifier extracts all potential antecedents by parsing the current and previous sentences.
I Each NP is considered a potential antecedent for each following pronoun.
I Each pronoun – potential antecedent pair is then presented (through their features) to the
classifier.
I Classifier predicts 1 or 0.

43/61
Pronoun Reference Resolution: Log-Linear Models

I Example
I U1 : John saw a Ford at the dealership.
I U2 : He showed it to Bob.
I U3 : He bought it.
I Features for He in U3

44/61
General Reference Resolution

I Victoria Chen, CFO of Megabucks Banking Corp since 2004, saw her pay jump 20%,
to $1.3 million, as the 37-year-old also became the Denver-based company’s
president. It has been ten years since she came to Megabucks from rival Lotsaloot.
I Coreference chains:
I {Victoria Chen, CFO of Megabucks Banking Corp since 2004, her, the 37-year-old, the
Denver-based company’s president, she}
I {Megabucks Banking Corp, the Denver-based company, Megabucks}
I {her pay}
I {Lotsaloot}

45/61
High-level Recipe for Coreference Resolution

I Parse the text and identify NPs; then


I For every pair of NPs, carry out binary classification: coreferential or not?
I Collect the results into coreference chains
I What do we need?
I A choice of classifier.
I Lots of labeled data.
I Features

46/61
High-level Recipe for Coreference Resolution

I Word-level edit distance between the two NPs


I Are the two NPs the same NER type?
I Appositive syntax
I “Alan Shepherd, the first American astronaut, . . . ”
I Proper/definite/indefinite/pronoun
I Gender
I Number
I Distance in sentences
I Number of NPs between
I Grammatical roles
I Any other relevant features,.e.g embeddings?

47/61
Pragmatics

I Pragmatics is a branch of linguistics dealing with language use in context.


I When a diplomat says yes, he means ‘perhaps’;
I When he says perhaps, he means ‘no’;
I When he says no, he is not a diplomat.
I (Variously attributed to Voltaire, H. L. Mencken, and Carl Jung)

48/61
In Context?

I Social context
I Social identities, relationships, and setting
I Physical context
I Where? What objects are present? What actions?
I Linguistic context
I Conversation history
I Other forms of context
I Shared knowledge, etc.

49/61
Language as Action: Speech Acts

I The Mood of a sentence indicates relation between speaker and the concept
(proposition) defined by the LF
I There can be operators that represent these direct relations:
I ASSERT: the proposition is proposed as a fact
I YN-QUERY: the truth of the proposition is queried
I COMMAND: the proposition describes a requested action
I WH-QUERY: the proposition describes an object to be identified
I There are also indirect speech acts.
I Can you pass the salt?
I It is warm here.

50/61
”How to do things with words.” Jane Austin1

I In addition to just saying things, sentences perform actions.


I When these sentences are uttered, the important thing is not their truth value, but the
felicitousness of the action (e.g., do you have the authority to do it):
I I name this ship the Titanic.
I I take this man to be my husband.
I I bequeath this watch to my brother.
I I declare war.

1 http://en.wikipedia.org/wiki/J._L._Austin
51/61
Performative Sentences

I When uttered by the proper authority, such sentences have the effect of changing the
state of the world, just as any other action that can change the state of the world.
I These involve verbs like, name, second, declare, etc.
I “I name this ship the Titanic.” also causes the ship to be named Titanic.
I You can tell whether sentences are performative by adding “hereby”:
I I hereby name this ship the Queen Elizabeth.
I Non-performative sentences do not sound good with hereby:
I Birds hereby sing.
I There is hereby fighting in Syria.

52/61
Speech Acts Continued

I Locutionary Act: The utterance of a sentence with a particular meaning.


I Illocutionary Act: The act of asking, answering, promising, etc. in uttering a
sentence.
I I promise you that I will fix the problem.
I You can’t do that (protesting)
I By the way, I have a CD of Debussy; would you like to borrow it? (offering)
I Perlocutionary Act: The – often intentional – production of certain effects on the
addressee.
I You can’t do that. (stopping or annoying the addressee)
I By the way, I have a CD of Debussy; would you like to borrow it? (impressing the
addressee)

53/61
Searle’s Speech Acts

I Assertives = speech acts that commit a speaker to the truth of the expressed
proposition
I Directives = speech acts that are to cause the hearer to take a particular action, e.g.
requests, commands and advice
I Can you pass the salt?
I Has the form of a question but the effect of a directive
I Commissives = speech acts that commit a speaker to some future action, e.g.
promises and oaths
I Expressives = speech acts that express the speaker’s attitudes and emotions
towards the proposition, e.g. congratulations, excuses
I Declarations = speech acts that change the reality in accord with the proposition of
the declaration, e.g. pronouncing someone guilty or pronouncing someone husband
and wife

54/61
Speech Acts in NLP

I Speech acts (inventories) are mainly used in developing (task-oriented) dialog


systems.
I Speech acts are used as annotation guidelines for corpus annotation.
I An annotated corpus is then used for machine learning of dialog tasks.
I Such corpora are highly developed and checked for intercoder agreement.
I Annotation still takes a long time to learn.

55/61
Task-oriented Dialogues

I Making travel reservations (flight, hotel room, etc.)


I Scheduling a meeting.
I Finding out when the next bus is.
I Making a payment over the phone.

56/61
Ways of Asking for a Room

I I’d like to make a reservation


I I’m calling to make a reservation
I Do you have a vacancy on . . .
I Can I reserve a room?
I Is it possible to reserve a room?

57/61
Examples of Task-oriented Speech Acts

I Identify self:
I This is David
I My name is David
I I’m David
I David here
I Sound check: Can you hear me?
I Meta dialogue act: There is a problem.
I Greet: Hello.
I Request-information:
I Where are you going.
I Tell me where you are going.

58/61
Examples of Task-oriented Speech Acts

I Backchannel – Sounds you make to indicate that you are still listening
I ok, m-hm
I Apologize/reply to apology
I Thank/reply to thanks
I Request verification/Verify
I So that’s 2:00? Yes. 2:00.
I Resume topic
I Back to the accommodations . . .
I Answer a yes/no question: yes, no.

59/61
Task-oriented Speech Acts in Negotiation

I Suggest
I I recommend this hotel
I Offer
I I can send some brochures.
I How about if I send some brochures.
I Accept
I Sure. That sounds fine.
I Reject
I No. I don’t like that one.

60/61
Negotiation

61/61
(Mostly Statistical)
Machine Translation
11-411
Fall 2017
2
The Rosetta Stone
• Decree from Ptolemy V
on repealing taxes and
erecting some statues
(196 BC)
• Written in three
languages
– Hieroglyphic
– Demotic
– Classical Greek

3
Overview
• History of Machine Translation
• Early Rule-based Approaches
• Introduction to Statistical Machine Translation
(SMT)
• Advanced Topics in SMT
• Evaluation of (S)MT output

4
Machine Translation
• Transform text (speech) in one language
(source) to text (speech) in a different
language (target) such that
– The “meaning” in the source language input is
(mostly) preserved, and
– The target language output is grammatical.
• Holy grail application in AI/NLP since middle of
20th century.

5
Translation
• Process
– Read the text in the source language
– Understand it
– Write it down in the target language

• These are hard tasks for computers


– The human process is invisible, intangible

6
Machine Translation
Many possible legitimate translations!

7
Machine Translation
Rolls-Royce Merlin Engine English Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • The Rolls-Royce Merlin is a 12-cylinder
Flugmotor von Rolls-Royce in V-Bauweise, aircraft engine from Rolls-Royce V-type,
der vielen wichtigen britischen und US- which served many important British and
amerikanischen Flugzeugmustern des American aircraft designs of World War II as
ZweitenWeltkriegs als Antrieb diente. Ab a drive. From 1941 the engine was built
1941 wurde der Motor in Lizenz von der under license by the Packard Motor Car
Packard Motor Car Company in den USA als Company in the U.S. as a Packard V-1650th.
Packard V-1650 gebaut. • After the war, several passenger and cargo
• Nach dem Krieg wurden diverse Passagier- aircraft have been equipped with this engine,
und Frachtflugzeuge mit diesem Motor such as Avro Lancastrian, Avro Tudor Avro
ausgestattet, so z. B. Avro Lancastrian, Avro York and, later, the Canadair C-4 (converted
Tudor und Avro York, später noch einmal die Douglas C-54). The civilian use of the Merlin
Canadair C-4 (umgebaute Douglas C-54). Der was, however, limited as it remains robust,
zivile Einsatz des Merlin hielt sich jedoch in however, was too loud.
Grenzen, da er als robust, aber zu laut galt. • The name of the motor is taken under the
• Die Bezeichnung des Motors ist gemäß then Rolls-Royce tradition of one species, the
damaliger Rolls-Royce Tradition von einer Merlin falcon, and not, as often assumed, by
Vogelart, dem Merlinfalken, übernommen the wizard Merlin.
und nicht, wie oft vermutet, von dem
Zauberer Merlin.

8
Machine Translation
Rolls-Royce Merlin Engine English Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • The Rolls-Royce Merlin is a 12-cylinder
Flugmotor von Rolls-Royce in V-Bauweise, aircraft engine from Rolls-Royce V-type,
der vielen wichtigen britischen und US- which served many important British and
amerikanischen Flugzeugmustern des American aircraft designs of World War II as
Zweitenweltkriegs als Antrieb diente. Ab a drive. From 1941 the engine was built
1941 wurde der Motor in Lizenz von der under license by the Packard Motor Car
Packard Motor Car Company in den USA als Company in the U.S. as a Packard V-1650th.
Packard V-1650 gebaut. • After the war, several passenger and cargo
• Nach dem Krieg wurden diverse Passagier- aircraft have been equipped with this engine,
und Frachtflugzeuge mit diesem Motor such as Avro Lancastrian, Avro Tudor Avro
ausgestattet, so z. B. Avro Lancastrian, Avro York and, later, the Canadair C-4 (converted
Tudor und Avro York, später noch einmal die Douglas C-54). The civilian use of the Merlin
Canadair C-4 (umgebaute Douglas C-54). Der was, however, limited as it remains robust,
zivile Einsatz des Merlin hielt sich jedoch in however, was too loud.
Grenzen, da er als robust, aber zu laut galt. • The name of the motor is taken under the
• Die Bezeichnung des Motors ist gemäß then Rolls-Royce tradition of one species, the
damaliger Rolls-Royce Tradition von einer Merlin falcon, and not, as often assumed, by
Vogelart, dem Merlinfalken, übernommen the wizard Merlin.
und nicht, wie oft vermutet, von dem
Zauberer Merlin.

9
Machine Translation
Rolls-Royce Merlin Engine Turkish Translation
(from German Wikipedia) (via Google Translate)
• Der Rolls-Royce Merlin ist ein 12-Zylinder- • Rolls-Royce Merlin 12-den silindirli Rolls-
Flugmotor von Rolls-Royce in V-Bauweise, Royce uçak motoru V tipi, bir sürücü olarak
der vielen wichtigen britischen und US- Dünya Savaşı'nın birçok önemli İngiliz ve
amerikanischen Flugzeugmustern des Amerikan uçak tasarımları devam eder. 1.941
ZweitenWeltkriegs als Antrieb diente. Ab motor lisansı altında Packard Motor Car
1941 wurde der Motor in Lizenz von der Company tarafından ABD'de Packard V olarak
Packard Motor Car Company in den USA als yaptırılmıştır Gönderen-1650
Packard V-1650 gebaut. • Savaştan sonra, birkaç yolcu ve kargo uçakları
• Nach dem Krieg wurden diverse Passagier- ile Avro Lancastrian, Avro Avro York ve Tudor
und Frachtflugzeuge mit diesem Motor gibi bu motor, daha sonra, Canadair C-4
ausgestattet, so z. B. Avro Lancastrian, Avro (Douglas C-54) dönüştürülür
Tudor und Avro York, später noch einmal die donatılmıştır. Olarak, ancak, çok yüksek oldu
Canadair C-4 (umgebaute Douglas C-54). Der sağlam kalır Merlin sivil kullanıma Ancak
zivile Einsatz des Merlin hielt sich jedoch in sınırlıydı.
Grenzen, da er als robust, aber zu laut galt. • Motor adı daha sonra Rolls altında bir türün,
• Die Bezeichnung des Motors ist gemäß Merlin şahin, ve değil-Royce geleneği, sıklıkta
damaliger Rolls-Royce Tradition von einer kabul, Merlin sihirbaz tarafından alınır.
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.

10
Machine Translation
Rolls-Royce Merlin Engine Arabic Translation
(from German Wikipedia) (via Google Translate -- 2009
• Der Rolls-Royce Merlin ist ein 12-Zylinder-
Flugmotor von Rolls-Royce in V-Bauweise,
der vielen wichtigen britischen und US-
amerikanischen Flugzeugmustern des
ZweitenWeltkriegs als Antrieb diente. Ab
1941 wurde der Motor in Lizenz von der
Packard Motor Car Company in den USA als
Packard V-1650 gebaut.
• Nach dem Krieg wurden diverse Passagier-
und Frachtflugzeuge mit diesem Motor
ausgestattet, so z. B. Avro Lancastrian, Avro
Tudor und Avro York, später noch einmal die
Canadair C-4 (umgebaute Douglas C-54). Der
zivile Einsatz des Merlin hielt sich jedoch in
Grenzen, da er als robust, aber zu laut galt.
• Die Bezeichnung des Motors ist gemäß
damaliger Rolls-Royce Tradition von einer
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.

11
Machine Translation
Rolls-Royce Merlin Engine Arabic Translation
(from German Wikipedia) (via Google Translate – 2017)
• Der Rolls-Royce Merlin ist ein 12-Zylinder-
Flugmotor von Rolls-Royce in V-Bauweise,
der vielen wichtigen britischen und US-
amerikanischen Flugzeugmustern des
ZweitenWeltkriegs als Antrieb diente. Ab
1941 wurde der Motor in Lizenz von der
Packard Motor Car Company in den USA als
Packard V-1650 gebaut.
• Nach dem Krieg wurden diverse Passagier-
und Frachtflugzeuge mit diesem Motor
ausgestattet, so z. B. Avro Lancastrian, Avro
Tudor und Avro York, später noch einmal die
Canadair C-4 (umgebaute Douglas C-54). Der
zivile Einsatz des Merlin hielt sich jedoch in
Grenzen, da er als robust, aber zu laut galt.
• Die Bezeichnung des Motors ist gemäß
damaliger Rolls-Royce Tradition von einer
Vogelart, dem Merlinfalken, übernommen
und nicht, wie oft vermutet, von dem
Zauberer Merlin.

12
Machine Translation
• (Real-time speech-to-speech) Translation is a
very demanding task
– Simultaneous translators (in UN, or EU Parliament)
last about 30 minutes
– Time pressure
– Divergences between languages
• German: Subject ........................... Verb
• English: Subject Verb ……………………….
• Arabic: Verb Subject ..............

13
Brief History
• 1950’s: Intensive research activity in MT
– Translate Russian into English
• 1960’s: Direct word-for-word replacement
• 1966 (ALPAC): NRC Report on MT
– Conclusion: MT no longer worthy of serious scientific
investigation.
• 1966-1975: `Recovery period’
• 1975-1985: Resurgence (Europe, Japan)
• 1985-present: Resurgence (US)
– Mostly Statistical Machine Translation since 1990s
– Recently Neural Network/Deep Learning based machine
translation

14
Early Rule-based Approaches
• Expert system-like rewrite systems
• Interlingua methods (analyze and generate)
• Information used for translation are compiled
by humans
– Dictionaries
– Rules

15
Vauquois Triangle

16
Statistical Approaches
• Word-to-word translation
• Phrase-based translation
• Syntax-based translation (tree-to-tree, tree-to-
string)
– Trained on parallel corpora
– Mostly noisy-channel (at least in spirit)

17
Early Hints on the Noisy Channel
Intuition
• “One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an
article in Russian, I say: ‘This is really written
in English, but it has been coded in some
strange symbols. I will now proceed to
decode.’ ”
Warren Weaver
• (1955:18, quoting a letter he wrote in 1947)

18
Divergences between Languages
• Languages differ along many dimensions
– Concept – Lexicon alignment – Lexical Divergence
– Syntax – Structure Divergence
• Word-order differences
– English is Subject-Verb-Object
– Arabic is Verb-Subject-Object
– Turkish is Subject-Object-Verb
• Phrase order differences
• Structure-Semantics Divergences

19
Lexical Divergences
• English: wall
– German: Wand for walls inside, Mauer for walls
outside
• English: runway
– Dutch: Landingbaan for when you are landing;
startbaan for when you are taking off
• English: aunt
– Turkish: hala (father’s sister), teyze(mother’s sister)
• Turkish: o
– English: she, he, it
20
Lexical Divergences
How conceptual space is cut up

21
Lexical Gaps
• One language may not have a word for a
concept in another language
– Japanese: oyakoko
• Best English approximation: “filial piety”
– Turkish: gurbet
• Where you are when you are not “home”
– English: condiments
• Turkish: ??? (things like mustard, mayo and ketchup)

22
Local Phrasal Structure Divergences
• English: a blue house
– French: une maison bleu
• German: die ins Haus gehende Frau
– English: the lady walking into the house

23
Structural Divergences
• English: I have a book.
– Turkish: Benim kitabim var. (Lit: My book exists)
• French: Je m’appelle Jean (Lit: I call myself
Jean)
– English: My name is Jean.
• English: I like swimming.
– German: Ich schwimme gerne. (Lit: I swim
“likingly”.)

24
Major Rule-based MT
Systems/Projects
• Systran
– Major human effort to construct large translation
dictionaires + limited word-reordering rules
• Eurotra
– Major EU-funded project (1970s-1994) to translate
among (then) 12 EC languages.
• Bold technological framework
– Structural Interlingua
• Management failure
• Never delivered a working MT system
• Helped create critical mass of researchers

25
Major Rule-based MT
Systems/Projects
• METEO
– Successful system for French-English translation of
Canadian weather reports (1975-1977)
• PANGLOSS
– Large-scale MT project by CMU/USC-ISI/NMSU
– Interlingua-based Japanese-Spanish-English
translation
– Manually developed semantic lexicons

26
Rule-based MT
• Manually develop rules to analyze the source
language sentence (e.g., a parser)
– => some source structure representation
• Map source structure to a target structure
• Generate target sentence from the transferred
structure

27
Rule-based MT
Syntactic Transfer
Þ
Noun Phrase Sentence
Sentence
Verb Phrase
Noun Phrase Verb Phrase
Verb
Pronoun Noun Phrase Verb
Noun Phrase
Pronoun
I read Adj.
Noun Je lire Noun
Adj.
scientific books
livres scientifiques

Swap
Source language analysis
Target language generation

28
Rules
• Rules to analyze the source sentences
– (Usually) Context-free grammar rules coupled with
linguistic features
• Sentence => Subject-NP Verb-Phrase
• Verb-Phrase => Verb Object …..

29
Rules
• Lexical transfer rules
– English: book (N) => French: livre (N, masculine)
– English: pound (N, monetary sense)=> French:
livre (N, feminine)
– English: book (V) => French: réserver (V)
• Quite tricky for

30
Rules
• Structure Transfer Rules
– English: S => NP VP è
French: TR(S) => TR(NP) TR(VP)
– English: NP => Adj Noun è
French: TR(NP) => Tr(Noun) Tr(Adj)
but there are exceptions for
Adj=grand, petit, ….

31
Rules
Much more complex to deal with “real world” sentences.

32
Example-based MT (EBMT)
• Characterized by its use of a bilingual corpus
with parallel texts as its main knowledge base,
at run-time.
• Essentially translation by analogy and can be
viewed as an implementation of case-based
reasoning approach of machine learning.
• Find how (parts of) input are translated in the
examples
– Cut and paste to generate novel translations

33
Example-based MT (EBMT)
• Translation Memory
– Store many translations,
• source – target sentence pairs
– For new sentences, find closes match
• use edit distance, POS match, other similarity techniques
– Do corrections,
• map insertions, deletions, substitutions onto target sentence
– Useful only when you expect same or similar sentence to
show up again, but then high quality

34
Example-based MT (EBMT)
English Japanese
• How much is that red • Ano akai kasa wa ikura desu
umbrella? ka?
• How much is that small • Ano chiisai kamera wa ikura
camera? desu ka?

• How much is that X? • Ano X wa ikura desu ka?

35
Hybrid Machine Translation
• Use multiple techniques (rule-based/
EBMT/Interlingua)
• Combine the outputs of different systems to
improve final translations

36
How do we evaluate MT output?
• Adequacy: Is the meaning of the source
sentence conveyed by the target sentence?
• Fluency: Is the sentence grammatical in the
target language?
• These are rated on a scale of 1 to 5

37
How do we evaluate MT output?
Je suis fatigué.

Adequacy Fluency

Tired is I. 5 2

Cookies taste good! 1 5

I am tired. 5 5

38
How do we evaluate MT output?
• This in general is very labor intensive
– Read each source sentence
– Evaluate target sentence for adequacy and fluency
• Not easy to do if you improve your MT system
10 times a day, and need to evaluate!
– Could this be mechanized?
• Later

39
Shallow/ Simple
MT Strategies (1954-2004)
Word-based
Electronic only
dictionaries
Example-
Phrase tables
Knowledge based MT
Statistical MT
Acquisition
Strategy Hand-built by Hand-built by Learn from Learn from un-
experts non-experts annotated data annotated data
All manual Fully automated
Original direct Syntactic
approach Constituent
Structure
Typical transfer
system Semantic New Research
analysis Goes Here!
Classic
interlingual Interlingua
system
Knowledge
Deep/ Complex Representation
Slide by40
Strategy Laurie Gerber
Statistical Machine Translation
• How does statistics and probabilities come
into play?
– Often statistical and rule-based MT are seen as
alternatives, even opposing approaches – wrong
!!!
No Probabilities Probabilities
Flat Structure EBMT SMT
Deep Structure Transfer Holy Grail
Interlingua
– Goal: structurally rich probabilistic models
41
Rule-based MT vs SMT
Statistical System
Expert System
Experts Bilingual parallel corpus

S T
+
+
Machine
Learning
Manually coded rules
S: Mais où sont les neiges d’antan?
If « … » then …
If « … » then … Statistical rules
Statistical system output
…… P(but | mais)=0.7
…… T1: But where are the snows
P(however | mais)=0.3
Else …. of yesteryear? P = 0.41
T2: However, where are P(where | où)=1.0
yesterday’s snows? P = 0.33 ……
Expert system output T3: Hey - where did the old
T: But where are the snows snow go? P = 0.18
42
of ? …
Data-Driven Machine Translation

Hmm, every time he sees


“banco”, he either types
“bank” or “bench” … but if
Man, this is so boring. he sees “banco de…”,
he always types “bank”,
never “bench”…

Translated documents

43
Slide by Kevin Knight
Statistical Machine Translation
• The idea is to use lots of parallel texts to
model how translations are done.
– Observe how words or groups of words are
translated
– Observe how translated words are moved around
to make fluent sentences in the target sentences

44
Parallel Texts
1a. Garcia and associates . 7a. the clients and the associates are enemies .
1b. Garcia y asociados . 7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates . 8a. the company has three groups .
2b. Carlos Garcia tiene tres asociados . 8b. la empresa tiene tres grupos .

3a. his associates are not strong . 9a. its groups are in Europe .
3b. sus asociados no son fuertes . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 10a. the modern groups sell strong pharmaceuticals .
4b. Garcia tambien tiene una empresa . 10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry . 11a. the groups do not sell zenzanine .
5b. sus clientes estan enfadados . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 12a. the small groups are not modern .
6b. los asociados tambien estan enfadados . 12b. los grupos pequenos no son modernos .
45
Parallel Texts
Clients do not sell pharmaceuticals in Europe
Clientes no venden medicinas en Europa
1a. Garcia and associates . 7a. the clients and the associates are enemies .
1b. Garcia y asociados . 7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates . 8a. the company has three groups .
2b. Carlos Garcia tiene tres asociados . 8b. la empresa tiene tres grupos .

3a. his associates are not strong . 9a. its groups are in Europe .
3b. sus asociados no son fuertes . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 10a. the modern groups sell strong pharmaceuticals .
4b. Garcia tambien tiene una empresa . 10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry . 11a. the groups do not sell zenzanine .
5b. sus clientes estan enfadados . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 12a. the small groups are not modern .
6b. los asociados tambien estan enfadados . 12b. los grupos pequenos no son modernos .
46
Parallel Texts

1. employment rates are very low , 1. istihdam oranları , özellikle kadınlar için
especially for women . çok düşüktür .
2. the overall employment rate in 2001 was 2. 2001 yılında genel istihdam oranı % 46,8'
46. 8% . dir .
3. the system covers insured employees 3. sistem , işini kaybeden sigortalı işsizleri
who lose their jobs . kapsamaktadır .
4. the resulting loss of income is covered 4. ortaya çıkan gelir kaybı , ödenmiş
in proportion to the premiums paid . primlerle orantılı olarak karşılanmaktadır .
5. there has been no development in the 5. engelli kişiler konusunda bir gelişme
field of disabled people . kaydedilmemiştir .
6. overall assessment 6. genel değerlendirme
7. no social dialogue exists in most private 7. özel işletmelerin çoğunda sosyal diyalog
enterprises . yoktur .
8. it should be reviewed together with all the 8. konseyin yapısı , sosyal taraflar ile birlikte
social partners . yeniden gözden geçirilmelidir .
9. much remains to be done in the field of 9. sosyal koruma alanında yapılması gereken
social protection . çok şey vardır .

47
Available Parallel Data (2004)

Millions of
words
(English side)

+ 1m-20m words for


many language pairs

(Data stripped of formatting, in sentence-pair format, available


from the Linguistic Data Consortium at UPenn).
48
Available Parallel Data (2008)
• Europarl: 30 million words in 11 languages
• Acquis Communitaire: 8-50 million words in 20 EU
languages
• Canadian Hansards: 20 million words from Canadian
Parlimentary Proceedings
• Chinese/Arabic - English: over 100 million words from
LDC
• Lots more French/English, Spanish/French/English from
LDC
• Smaller corpora for many other language pairs
– Usually English – Some other language.

49
Available Parallel Data (2017)

+ 1m-20m words for


many language pairs

50
Available Parallel Text
• A book has a few 100,000s words
• An educated person may read 10,000 words a
day
– 3.5 million words a year
– 300 million words a lifetime
• Soon computers will have access to more
translated text than humans read in a lifetime

51
More data is better!
• Language Weaver Arabic to English Translation

v.2.0 – October 2003 v.2.4 – October 2004

52
Sample Learning Curves

Swedish/English
French/English
BLEU German/English
score Finnish/English

# of sentence pairs used in training


Experiments by
Philipp Koehn
53
Preparing Data
• Sentence Alignment
• Tokenization/Segmentation

54
Sentence Alignment
The old man is happy. El viejo está feliz
He has fished many porque ha pescado
times. His wife talks muchos veces. Su
to him. The fish are mujer habla con él.
jumping. The sharks Los tiburones
await. esperan.

55
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. porque ha pescado
2. He has fished many muchos veces.
times. 2. Su mujer habla con
3. His wife talks to él.
him. 3. Los tiburones
4. The fish are esperan.
jumping.
5. The sharks await.

56
Sentence Alignment
• 1-1 Alignment
– 1 sentence in one side aligns to 1 sentence in the
other side
• 0-n, n-0 Alignment
– A sentence in one side aligns to no sentences on the
other side
• n-m Alignment (n,m>0 but typically very small)
– n sentences on one side align to m sentences on the
other side

57
Sentence Alignment
• Sentence alignments are typically done by
dynamic programming algorithms
– Almost always, the alignments are monotonic.
– The lengths of sentences and their translations
(mostly) correlate.
– Tokens like numbers, dates, proper names,
cognates help anchor sentences..

58
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. porque ha pescado
2. He has fished many muchos veces.
times. 2. Su mujer habla con
3. His wife talks to él.
him. 3. Los tiburones
4. The fish are esperan.
jumping.
5. The sharks await.

59
Sentence Alignment
1. The old man is 1. El viejo está feliz
happy. He has porque ha pescado
fished many times. muchos veces.
2. His wife talks to 2. Su mujer habla con
him. él.
3. The sharks await. 3. Los tiburones
esperan.

Unaligned sentences are thrown out, and


sentences are merged in n-to-m alignments (n, m > 0).
60
Tokenization (or Segmentation)
• English
– Input (some byte stream):
"There," said Bob.
– Output (7 “tokens” or “words”):
" There , " said Bob .
• Chinese
– Input (byte stream): 美国关岛国际机场及其办公室均接获
一名自称沙地阿拉伯富商拉登等发出
的电子邮件。

– Output: 美国 关岛国 际机 场 及其 办公
室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
61
The Basic Formulation of SMT
• Given a source language sentence s, what is the
target language text t, that maximizes

𝑝 𝑡 𝑠)
• So, any target language sentence t is a “potential”
translation of the source sentence s
– But probabilities differ
– We need that t with the highest probability of being a
translation.

62
The Basic Formulation of SMT
• Given a source language sentence s, what is
the target language text t, that maximizes
𝑝 𝑡 𝑠)
• We denote this computation as a search

𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥- 𝑝 𝑡 𝑠)

63
The Basic Formulation of SMT
• We need to compute 𝑡 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥- 𝑝 𝑡 𝑠)

• Using Bayes’ Rule we can “factorize” this into two


separate problems
𝑝 𝑠 𝑡 𝑝(𝑡)
𝑇𝑡 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥-
𝑝(𝑠)
= 𝑎𝑟𝑔𝑚𝑎𝑥- 𝑝 𝑠 𝑡 𝑝(𝑡)
– Search over all possible target sentences t
• For a given s, p(s) is constant, so no need to consider it in the
maximization

64
The Noisy Channel Model
(Target) Dün Ali’yi T S (Source) I saw Ali
gördüm.
yesterday

Noisy Channel

What are likely


sentences he could
P(T) have said in the What was
target language?
target
Models How could the sentence he
channel have
Decoding P(S|T) used?
“corrupted” target
to source
language? 65
Where do the probabilities come
from?

Source(s)/Target(t) Target
Bilingual Text Text

Statistical Analysis Statistical Analysis

Source Broken Target


Target
Translation Language
Model P(S|T) Model P(T)

Source Sentence Decoding algorithm Target Sentence


argmax P(T) * P(S|T)
66
T
The Statistical Models
• Translation model p(S|T)
– Essentially models Adequacy without having to
worry about Fluency.
• P(S|T) is high for sentences S, if words in S are in
general translations of words in T.
• Target Language Model p(T)
– Essentially models Fluency without having to
worry about Adequacy
• P(T) is high if a sentence T is a fluent sentence in the
target language

67
How do the models interact?
• Maximizing p(S | T) P(T)
– p(T) models “good” target sentences (Target Language Model)
– p(S|T) models whether words in source sentence are “good”
translation of words in the target sentence (Translation Model)

I saw Ali yesterday Good Target? P(T) Good match to Source ? Overall
P(S|T)
Bugün Ali’ye gittim
Okulda kalmışlar
Var gelmek ben
Dün Ali’yi gördüm
Gördüm ben dün Ali’yi
Dün Ali’ye gördüm

68
Three Problems for Statistical MT
• Language model
– Given a target sentence T, assigns p(T)
• good target sentence -> high p(T)
• word salad -> low p(T)

• Translation model
– Given a pair of strings <S,T>, assigns p(S | T)
• <S,T> look like translations -> high p(S | T)
• <S,T> don’t look like translations -> low p(S | T)

• Decoding algorithm
– Given a language model, a translation model, and a new
sentence S … find translation T maximizing p(T) * p(S|T)
69
The Classic Language Model:
Word n-grams
• Helps us choose among sentences
– He is on the soccer field
– He is in the soccer field

– Is table the on cup the


– The cup is on the table

– Rice shrine
– American shrine
– Rice company
– American company

70
The Classic Language Model
• What is a “good” target sentence? (HLT Workshop 3)
• T = t1 t2 t3 … tn;
• We want P(T) to be “high”
• A good approximation is by short n-grams
– P(T) @ P(t1|START)•P(t2|START,t1) •P(t3|t1,t2)•…•P(ti|ti-2,ti-1)•
…•P(tn|tn-2,tn-1)

– Estimate from large amounts of text


• Maximum-likelihood estimation
• Smoothing for unseen data
– You can never see all of language
• There is no data like more data (e.g., 10^9 words would be nice)

71
The Classic Language Model

• If the target language is English. using 2-grams


P(I saw water on the table) @
P(I | START) *
P(saw | I) *
P(water | saw) *
P(on | water) *
P(the | on) *
P(table | the) *
P(END | table)

72
The Classic Language Model

• If the target language is English, using 3-grams


P(I saw water on the table) @
P(I | START, START) *
P(saw | START, I) *
P(water | I, saw) *
P(on | saw, water) *
P(the | water, on) *
P(table | on, the) *
P(END | the, table)

73
Translation Model?

Generative approach:

Mary did not slap the green witch

Morphological Analysis Source-language morphological structure


What are all
Parsing the possible
Source parse tree
moves and
Semantic Analysis their associated
Semantic representation
probability
Generation tables?
Target structure

Maria no dió una botefada a la bruja verde

74
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]

Generative approach:

Mary did not slap the green witch


Predict count of target words

Mary not slap slap slap the green witch


Predict target words from NULL

Mary not slap slap slap NULL the green witch


Translate source to target words

Maria no dió una botefada a la verde bruja


Reorder target words

Maria no dió una botefada a la bruja verde

75
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]

Generative approach:

Mary did not slap the green witch


Predict count of target words

Mary not slap slap slap the green witch


Predict target words from NULL

Mary not slap slap slap NULL the green witch


Translate source to target words

Maria no dió una botefada a la verde bruja


Reorder target words

Maria no dió una botefada a la bruja verde

Selected as the most likely by P(T)


76
Basic Translation Model (IBM M-1)
• Model p(t | s, m)
– t = <t1, t2, …, tm>, s = <s1, s2, …, sn>
• Lexical translation makes the following
assumptions
– Each word ti in t is generated from exactly one word in
s.
– Thus, we have a latent alignment ai that indicates
which word ti “came from.” Specifically it came from
tai.
– Given the alignments a, translation decisions are
conditionally independent of each other and depend
only on the aligned source word t

77
Basic Translation Model (IBM M-1)
5

𝑝 𝑡 𝑠, 𝑚 = 1 𝑝 𝑎 𝑠, 𝑚) × 3 𝑝 𝑡𝑖 𝑠𝑎𝑖 )
5
: ∈ <,= 678

p(alignment) p(translation | alignment)

78
Parameters of the IBM 3 Model
• Fertility: How many words does a source word get
translated to?
– n(k | s): the probability that the source word s gets
translated as k target words
– Fertility depends solely on the source words in question
and not other source words in the sentence, or their
fertilities.

• Null Probability: What is the probability of a word


magically appearing in the target at some position,
without being the translation of any source word?
– P-null

79
Parameters of the IBM 3 Model
• Translation: How do source words translate?
– tr(t|s): the probability that the source word s gets
translated as the target word t
– Once we fix n(k | s) we generate k target words
• Reordering: How do words move around in the
target sentence?
– d(j | i): distortion probability – the probability of word
at position i in a source sentence being translated as
the word at position j in target sentence.
• Very dubious!!

80
How IBM Model 3 works
1. For each source word si indexed by i = 1, 2,
..., m, choose fertility phii with probability
n(phii | si).
2. Choose the number phi0 of “spurious” target
words to be generated from s0 = NULL

81
How IBM Model 3 works
3. Let q be the sum of fertilities for all words,
including NULL.
4. For each i = 0, 1, 2, ..., m, and each k = 1, 2,
..., phii, choose a target word tik with
probability tr(tik | si).
5. For each i = 1, 2, ..., l, and each k = 1, 2, ...,
phii, choose target position piik with
probability d(piik | i,l,m).

82
How IBM Model 3 works
6. For each k = 1, 2, ..., phi0, choose a position
pi0k from the remaining vacant positions in 1,
2, ... q, for a total probability of 1/phi0.
7. Output the target sentence with words tik in
positions piik (0 <= i <= m, 1 <= k <= phii).

83
Example
• n-parameters
b c d b d • n(0,b)=0, n(1,b)=2/2=1
| | | |
• n(0,c)=1/1=1, n(1,c)=0
| +-+ | |
| | | | | • n(0,d)=0,n(1,d)=1/2=
x y z x y 0.5, n(2,d)=1/2=0.5

84
Example
• t-parameters
b c d b d • t(x|b)=1.0
| | | |
• t(y|d)=2/3
| +-+ | |
| | | | | • t(z|d)=1/3
x y z x y

85
Example
• d-parameters
b c d b d • d(1|1,3,3)=1.0
| | | |
• d(1|1,2,2)=1.0
| +-+ | |
| | | | | • d(2|2,3,3)=0.0
x y z x y • d(3|3,3,3)=1.0
• d(2|2,2,2)=1.0

86
Example
• p1
b c d b d • No target words are
| | | | generated by NULL so
| +-+ | | p1 = 0.0
| | | | |
x y z x y

87
The Classic Translation Model
Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]

Generative approach:

Mary did not slap the green witch


n(3|slap)
Mary not slap slap slap the green witch
P-Null

Mary not slap slap slap NULL the green witch


tr(la | the)
Maria no dió una botefada a la verde bruja
d(j | i)

Maria no dió una botefada a la bruja verde

Selected as the most likely by P(T)


88
How do we get these parameters?
• Remember we had aligned parallel sentences

• Now we need to figure out how words align


with other words.
– Word alignment

89
Word Alignments
• One source word can map
to 0 or more target words
– But not vice versa
• technical reasons
• Some words in the target
can magically be
generated from an
invisible NULL word
• A target word can only be
generated from one
source word
– technical reasons

90
Word Alignments
𝑐 𝑜𝑒𝑢𝑣𝑟𝑒 𝑤𝑜𝑟𝑘𝑒𝑑)
𝑡𝑟 𝑜𝑒𝑢𝑣𝑟𝑒 𝑤𝑜𝑟𝑘𝑒𝑑) =
𝑐(𝑤𝑜𝑟𝑘𝑒𝑑)

• Count over all aligned


sentences
• worked
– fonctionné(30), travaillé(20),
marché(27), oeuvré (13)
– tr(oeuvre|worked)=0.13
• Similarly, n(3, many) can
be computed.

91
How do we get these alignments?
• We only have aligned sentences and the
constraints:
– One source word can map to 0 or more target words
• But not vice versa
– Some words in the target can magically be generated
from an invisible NULL word
– A target word can only be generated from one source
word
• Estimation – Maximization Algorithm
– Mathematics is rather complicated

92
How do we get these alignments?

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

All word alignments equally likely

All p(french-word | english-word) equally likely

93
How do we get these alignments?

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“la” and “the” observed to co-occur frequently,


so p(la | the) is increased.

94
How do we get these alignments?

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“house” co-occurs with both “la” and “maison”, but


p(maison | house) can be raised without limit, to 1.0,
while p(la | house) is limited because of “the”

(pigeonhole principle)
95
How do we get these alignments?

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

settling down after another iteration

96
How do we get these alignments?

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

Inherent hidden structure revealed by EM training!


For further details, see:

• “A Statistical MT Tutorial Workbook” (Knight, 1999).


• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
• Software: GIZA++

97
Decoding for “Classic” Models
• Of all conceivable English word strings, find the one
maximizing p(t) * p(t | s)

• Decoding is an NP-complete challenge


– Reduction to Traveling Salesman problem (Knight, 1999)

• Several search strategies are available

• Each potential target output is called a hypothesis.

98
Dynamic Programming Beam Search
1st target 2nd target 3rd target 4th target
word word word word

start end

all source
words
covered

Each partial translation hypothesis contains:


- Last English word chosen + source words covered by it
- Next-to-last English word chosen
[Jelinek, 1969;
- Entire coverage vector (so far) of source sentence Brown et al, 1996 US Patent;
- Language model and translation model scores (so far) (Och, Ueffing, and Ney, 2001]
99
Dynamic Programming Beam Search
1st target 2nd target 3rd target 4th target
word word word word

start end

all source
words
covered

Each partial translation hypothesis contains:


- Last English word chosen + source words covered by it
- Next-to-last English word chosen
[Jelinek, 1969;
- Entire coverage vector (so far) of source sentence Brown et al, 1996 US Patent;
- Language model and translation model scores (so far) (Och, Ueffing, and Ney, 2001]
100
The Classic Results
• la politique de la haine . (Original Source)
• politics of hate . (Reference Translation)
• the policy of the hatred . (IBM4+N-grams+Stack)

• nous avons signé le protocole . (Original Source)


• we did sign the memorandum of agreement . (Reference Translation)
• we have signed the protocol . (IBM4+N-grams+Stack)

• où était le plan solide ? (Original Source)


• but where was the solid plan ? (Reference Translation)
• where was the economic base ? (IBM4+N-grams+Stack)

the Ministry of Foreign Trade and Economic Cooperation, including foreign


direct investment 40.007 billion US dollars today provide data include
that year to November china actually using foreign 46.959 billion US dollars and 101
Flaws of Word-Based MT
• Multiple source words for one target word
– IBM models can do one-to-many (fertility) but not many-
to-one
• Phrasal Translation
– “real estate”, “note that”, “interest in”
• Syntactic Transformations
– Verb at the beginning in Arabic
– Translation model penalizes any proposed re-ordering
– Language model not strong enough to force the verb to
move to the right place

102
Phrase-Based Statistical MT
Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

• Source input segmented in to phrases


– “phrase” is any sequence of words
• Each phrase is probabilistically translated into target
– P(to the conference | zur Konferenz)
– P(into the meeting | zur Konferenz)
• Phrases are probabilistically re-ordered

103
Advantages of Phrase-Based SMT
• Many-to-many mappings can handle non-
compositional phrases
• Local context is very useful for disambiguating
– “Interest rate” à …
– “Interest in” à …
• The more data, the longer the learned phrases
– Sometimes whole sentences

104
How to Learn the Phrase Translation Table?

• One method: “alignment templates”


• Start with word alignment, build phrases from that.

Maria no dió una bofetada a la bruja verde


This word-to-word
Mary
alignment is a
did by-product of
training a
not
translation model
slap like IBM-Model-3.
the
This is the best
green (or “Viterbi”)
witch alignment.
105
How to Learn the Phrase Translation Table?

• One method: “alignment templates” (Och et al, 1999)


• Start with word alignment, build phrases from that.

Maria no dió una bofetada a la bruja verde


This word-to-word
Mary
alignment is a
did by-product of
training a
not
translation model
slap like IBM-Model-3.
the
This is the best
green (or “Viterbi”)
witch alignment.
106
IBM Models are 1-to-Many
• Run IBM-style aligner both directions, then
merge:

TàS best
alignment

MERGE

SàT best
alignment Union or Intersection

107
How to Learn the Phrase Translation Table?

• Collect all phrase pairs that are consistent with the


word alignment

Maria no dió una bofetada a la bruja verde

Mary
did
not
slap
one
the example
green phrase
pair
witch
108
Word Alignment Consistent Phrases
Maria no dió Maria no dió Maria no dió

Mary Mary Mary

did did did

not not x not

slap slap slap x

consistent inconsistent inconsistent


Phrase alignment must contain all alignment points for all
the words in both phrases!
109
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

110
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
)

111
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch)

112
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch)
113
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) (Maria no dió una bofetada a la, Mary did not slap the)
114
(no dió una bofetada a la, did not slap the) (dió una bofetada a la bruja verde, slap the green witch)
Word Alignment Induced Phrases
Maria no dió una bofetada a la bruja verde

Mary

did

not

slap

the

green

witch

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap the)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) (Maria no dió una bofetada a la, Mary did not slap the)
(no dió una bofetada a la, did not slap the) (dió una bofetada a la bruja verde, slap the green witch) 115
(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair Probabilities

• A certain phrase pair (s-s-s, t-t-t) may appear many


times across the bilingual corpus.

– We hope so!

• So, now we have a vast list of phrase pairs and their


frequencies – how to assign probabilities?

116
Phrase-based SMT
• After doing this to millions of sentences
– For each phrase pair (t, s)
• Count how many times s occurs
• Count how many times s is translated to t
• Estimate p(t | s)

117
Decoding
• During decoding
– a sentence is segmented into “phrases” in all possible ways
– each such phrase is then “translated” to the target phrases
in all possible ways
– Translations are also moved around
– Resulting target sentences are scored with the target
language model
• The decoder actually does NOT actually enumerate all
possible translations or all possible target sentences
– Pruning

118
Decoding

119
Basic Model, Revisited
argmax P(t | s) =
t

argmax P(t) x P(s | t) / P(s) =


t

argmax P(t) x P(t | s)


t

120
Basic Model, Revisited
argmax P(t | s) =
t

argmax P(t) x P(s | t) / P(s) =


t

argmax P(t)2.4 x P(t | s) seems to work better


t

121
Basic Model, Revisited
argmax P(t | s) =
t

argmax P(t) x P(s | t) / P(s) =


t

argmax P(t)2.4 x P(t | s) * length(t)1.1


t Rewards longer hypotheses, since
these are unfairly punished by p(t)

122
Basic Model, Revisited

argmax P(t)2.4 x P(s | t) x length(t)1.1 x KS 3.7 …


e
Lots of knowledge sources vote on any given hypothesis.

“Knowledge source” = “feature function” = “score component”.

Feature function simply scores a hypothesis with a real value.

(May be binary, as in “e has a verb”).

Problem: How to set the exponent weights?

123
Maximum BLEU Training

Length
Translation
Model
Model Target
Language Reference Translations
Language Other
Model #1 (sample “right answers”)
Model #2 Features

Translation Translation
System Target Quality BLEU
Source
(Automatic, MT Output Evaluator score
Trainable) (Automatic)

Learning Algorithm for Directly Reducing Translation Error


Yields big improvements in quality. 124
Automatic Machine Translation
Evaluation
• Objective
• Inspired by the Word Error Rate metric used by ASR research
• Measuring the “closeness” between the MT hypothesis and
human reference translations
– Precision: n-gram precision
– Recall:
• Against the best matched reference
• Approximated by brevity penalty
• Cheap, fast
• Highly correlated with human evaluations
• MT research has greatly benefited from automatic evaluations
• Typical metrics: BLEU, NIST, F-Score, Meteor, TER

125
BLEU Evaluation
Reference (human) translation: N-gram precision (score between 0 & 1)
The US island of Guam is • what % of machine n-grams (a sequence of
maintaining a high state of alert words) can be found in the reference
after the Guam airport and its translation?
offices both received an e-mail
from someone calling himself Brevity Penalty
Osama Bin Laden and threatening a
• Can’t just type out single word “the’’
(precision 1.0!)
biological/chemical attack against
the airport.
Extremely hard to trick the system,
i.e. find a way to change MT output so that
Machine translation: BLEU score increases, but quality doesn’t.
The American [?] International airport and its
the office a [?] receives one calls self the sand
Arab rich business [?] and so on electronic mail,
which sends out; The threat will be able after
the maintenance at the airport.

126
More Reference Translations are Better
Reference translation 1: Reference translation 2:
The US island of Guam is maintaining a high Guam International Airport and its offices are
state of alert after the Guam airport and its maintaining a high state of alert after receiving
offices both received an e-mail from someone an e-mail that was from a person claiming to be
calling himself Osama Bin Laden and the rich Saudi Arabian businessman Osama Bin
threatening a biological/ chemical attack against Laden and that threatened to launch a biological
the airport. and chemical attack on the airport.

Machine translation:
The American [?] International airport and its
the office a [?] receives one calls self the sand
Arab rich business [?] and so on electronic mail
, which sends out; The threat will be able after
the maintenance at the airport to start the
biochemistry attack.

Reference translation 3: Reference translation 4:


The US International Airport of Guam and its US Guam International Airport and its offices
office has received an email from a self- received an email from Mr. Bin Laden and
claimed Arabian millionaire named Laden , other rich businessmen from Saudi Arabia.
which threatens to launch a biochemical They said there would be biochemistry air raid
attack on airport. Guam authority has been on to Guam Airport. Guam needs to be in high
alert. precaution about this matter.
127
BLEU in Action

• Reference Translation: The gunman was shot to death by the police .

• The gunman was shot kill .


• Wounded police jaya of
• The gunman was shot dead by the police .
• The gunman arrested by police kill .
• The gunmen were killed .
• The gunman was shot to death by the police .
• The ringer is killed by the police .
• Police killed the gunman .

• Green = 4-gram match (good!) Red = unmatched word (bad!)

128
BLEU Formulation
8
S S
𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑙𝑒𝑛𝑔𝑡ℎ
𝐵𝐿𝐸𝑈 = min(1, ) 3 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖
𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 − 𝑙𝑒𝑛𝑔𝑡ℎ
678

precisioni: i-gram precision over the whole corpus

129
Correlation with Human Judgment

130
What About Morphology?
• Issue for handling morphologically complex
languages like Turkish, Hungarian, Finnish,
Arabic, etc.
– A word contains much more information than just
the root word
• Arabic: wsyktbunha (wa+sa+ya+ktub+ūn+ha “and they
will write her”)
– What are the alignments?
• Turkish: gelebilecekmissin (gel+ebil+ecek+mis+sin (I
heard) you would be coming))
– What are the alignments?

131
Morphology & SMT
• Finlandiyalılaştıramadıklarımızdanmışsınızcasına

• Finlandiya+lı+laş+tır+ama+dık+lar+ımız+dan+mış+sını
z+casına
• (behaving) as if you have been one of those whom
we could not convert into a Finn(ish
citizen)/someone from Finland

132
Morphology & SMT
• yapabileceksek Most of the time, the morpheme
– yap+abil+ecek+se+k order is “reverse” of the corresponding
– if we will be able to do (something) English word order

• yaptırtabildiğimizde
– yap+tır+t+tığ+ımız+da
– when/at the time we had (someone) have (someone else) do (something)
• görüntülenebilir
– görüntüle+n+ebil+ir
– it can be visualize+d
• sakarlıklarından
– sakar+lık+ları+ndan
– of/from/due-to their clumsi+ness

133
Morphology and Alignment
• Remember the alignment needs to count co-
occuring words
– If one side of the parallel text has little
morphology (e.g. English)
– The other side has lots of morphology
• Lots of words on the English side either don’t
align or align randomly

134
Morphology & SMT
• If we ignore
Word Form Count Gloss
faaliyet 3 activity

morphology faaliyete 1 to the activity

faaliyetinde 1 in its activity


– Large vocabulary size on faaliyetler 3 activities

the Turkish side faaliyetlere 6 to the activities

faaliyetleri 7 their activities


– Potentially noisy faaliyetlerin 7 of the activities
alignments faaliyetlerinde 1 in their activities

– The link activity-faaliyet faaliyetlerine

faaliyetlerini
5

1
to their activities

their activities (accusative)


is very “loose” faaliyetlerinin 2 of their activities

faaliyetleriyle 1 with their activities

faaliyette 2 in (the) activity

faaliyetteki 1 that is in activity

TOTAL 41

135
An Example E – T Translation

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

136
An Example E – T Translation

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz

137
An Example E – T Translation

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz

138
An Example E – T Translation

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz

139
An Example E – T Translation

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz

140
Morphology and Parallel Texts
• Use
– Morphological analyzers (HLT Workshop 2)
– Tagger/Disambiguators (HLT Workshop 3)
• to split both sides of the parallel corpus into
moprhemes

141
Morphology and Parallel Texts
• A typical sentence pair in this corpus looks like
the following:
• Turkish:
– kat +hl +ma ortaklık +sh +nhn uygula +hn +ma +sh
, ortaklık anlaşma +sh çerçeve +sh +nda izle +hn
+yacak +dhr .
• English:
– the implementation of the accession partnership
will be monitor +ed in the framework of the
association agreement
142
Results
• Using morphology in Phrase-based SMT
certainly improves results compared to just
using words
• But
– Sentences get much longer and this hurts
alignment
– We now have an additional problem: getting the
morpheme order on each word right

143
Syntax and Morphology Interaction
• A completely different approach
– Instead of dividing up Turkish side into morpheme
– Collect “stuff” on the English side to make-up
“words”.
– What is the motivation?

144
Syntax and Morphology Interaction

we are going to your hotel in Taksim by taxi

we are go+ing to your hotel in Taksim by taxi

Biz siz+in Taksim +de +ki otel +iniz +e taksi +yle gid +iyor +uz

Suppose we can do some syntactic analysis on the English side


145
Syntax and Morphology Interaction
we are go+ing to your hotel in Taksim by taxi

• to your hotel
– to is the preposition related to hotel
– your is the possessor of hotel
• to your hotel => hotel +your+to
otel +iniz+e
– separate content from local syntax

146
Syntax and Morphology Interaction
we are go+ing to your hotel in Taksim by taxi

• we are go+ing
– we is the subject of go
– are is the auxiliary of go
– ing is the present tense marker for go
• we are go+ing => go +ing+are+we
gid +iyor+uz
– separate content from local syntax

147
Syntax and Morphology Interaction
we are go+ing to your hotel in Taksim by taxi

go+ing+are+we hotel +your+to Taksim+in taxi+by

Biz siz+in Taksim+de+ki otel+iniz+e taksi+yle gid+iyor+uz

Now align only based on root words – the syntax alignments just follow that
148
Syntax and Morphology Interaction

149
Syntax and Morphology Interaction
• Transformations on the English side reduce
sentence length
• This helps alignment
– Morphemes and most function words never get
involved in alignment
• We can use factored phrase-based translation
– Phrased-based framework with morphology
support

150
Syntax and Morphology Interaction
English Turkish BLEU Score

1300000 25.00

1250000 24.00

1200000 23.00

1150000 22.00
Number of Tokens

1100000 21.00

BLEU Scores
1050000 20.00

1000000 19.00

950000 18.00

900000 17.00

850000 16.00

800000 15.00
Adv
Baseline-Factored

Verb

Verb+Adv

Noun+Adj

Noun+Adj+Verb+Adv

Noun+Adj+Verb

Noun+Adj+Verb+PostPC

Noun+Adj+Verb+Adv+PostPC
Experiments
151
Syntax and Morphology Interaction
• She is reading.
– She is the subject of read
– is is the auxiliary of read

• She is read+ing => read +ing+is+she


taQrAA QrAA +*ta

152
Shallow/ Simple
MT Strategies (1954-2004)
Word-based
Electronic only
dictionaries Example-
Phrase tablesbased MT
Knowledge Statistical MT
Acquisition
Strategy Hand-built by Hand-built by Learn from Learn from un-
experts non-experts annotated data annotated data
All manual
Fully automated
Original direct Syntactic
approach Constituent
Structure
Typical transfer
system Semantic New Research
analysis Goes Here!
Classic
interlingual Interlingua
system
Knowledge
Deep/ Complex Representation
Slide by
153
Strategy Laurie Gerber
Syntax in SMT
• Early approaches relied on high-performance
parsers for one or both languages
– Good applicability when English is the source
language
• Tree-to-tree or tree-to-string transductions

• Recent approaches induce synchronous


grammars during training
– Grammar that describe two languages at the same
time
• NP => ADJe1 NPe2 : NPf2 ADJf1

154
Tree-to-String Transformation
VB VB
Parse Tree(E)
PRP VB2 VB1
PRP VB1 VB2
Reorder
he TO VB adores
he adores VB TO
listening
listening TO MN MN TO
music to
to music

Insert
VB VB

PRP VB2 VB1 Translate PRP VB2 VB1

he ha TO VB ga kare ha TO VB ga
adores desu daisuki desu
MN TO MN TO

music to listening no ongaku wo kiku no

Take Leaves

Sentence(J) Kare ha ongaku wo kiku no ga daisuki desu


Tree-to-String Transformation
• Each step is described by a statistical model
– Reorder children on a node probabilistically
– R-table
– English – Japanese table
Original Order Reordering P(reorder|original)

PRP VB1 VB2 PRP VB1 VB2 0.074


PRP VB2 VB1 0.723
VB1 PRP VB2 0.061
VB1 VB2 PRP 0.037
VB2 PRP VB1 0.083
VB2 VB1 PRP 0.021
VB TO VB TO 0.107
TO VB 0.893
TO NN TO NN 0.251
NN TO 0.749

156
Tree-to-String Transformation
• Each step is described by a statistical model
– Insert new sibling to the left or right of a node
probabilitically
– Translate source nodes probabilistically

157
Hierarchical phrase models
• Combines phrase-based models and tree
strutures
• Extract synchronous grammars from parallel
text
• Uses a statistical chart-parsing algorithm
during decoding
– Parse and generate concurrently

158
For more info
• Proceedings of the Third Workshop on Syntax and
Structure in Statistical Translation (SSST-3) at NAACL
HLT 2009
– http://aclweb.org/anthology-new/W/W09/#2300
• Proceedings of the ACL-08: HLT Second Workshop on
Syntax and Structure in Statistical Translation (SSST-2)
– http://aclweb.org/anthology-new/W/W08/#0400

159
Acknowledments
• Some of the tutorial material is based on
slides by
– Kevin Knight (USC/ISI)
– Philipp Koehn (Edinburgh)
– Reyyan Yeniterzi (CMU/LTI)

160
Important References
• Statistical Machine Translation (2010)
– Philipp Koehn
– Cambridge University Press
• SMT Workbook (1999)
– Kevin Knight
– Unpublished manuscript at http://www.isi.edu/~knight/
• http://www.statmt.org
• http://aclweb.org/anthology-new/
– Look for “Workshop on Statistical Machine Translation”

161
11-411
Natural Language Processing
Neural Networks and Deep Learning in NLP

Kemal Oflazer

Carnegie Mellon University in Qatar

1/60
Big Picture: Natural Language Analyzers

2/60
Big Picture: Natural Language Analyzers

3/60
Big Picture: Natural Language Analyzers

4/60
Linear Models
I y1 = w11 x1 + w21 x2 + w31 x3 + w41 x4 + w51 x5

5/60
Perceptrons
I Remember Perceptrons?
I A very simple algorithm guaranteed to eventually find a linear separator hyperplane
(determine w), if one exists.
I If one doesn’t, the perceptron will oscillate!
I Assume our classifier is
1 if w · Φ(x) > 0
n
classify(x) =
0 if w · Φ(x) ≤ 0
I Start with w = 0
I for t = 1, . . . , T
I i = t mod N 
I w ← w + α `i − classify(xi ) Φ(xi )
I Return w
I α is the learning rate – determined by experimentation.

6/60
Perceptrons
I For classification we are basically computing

score(x) = W × f (x)T =
X
wj · fj (x)
j

I wj are the weights comprising W


I fj (x) are the feature functions.
I We are then deciding based on the value of score(x)
I Such a computation can be viewed as a “network”.

I Feature function values are provided by the nodes on the left.


I Edges have the weights wi . Each feature value is multipleid with the respective edge
weight.
I The node on the right sums up the incoming values and decides.
7/60
Perceptron
I While quite useful, such a model can only classify “linearly separable” classes.
I So it fails for a very simple problem such as the exclusive-or

8/60
Multiple Layers

I We can add an intermediate “hidden” layer.


I each arrow is a weight

I Have we gained anything?


I Not really. We have a linear combination of weights (input to hidden and hidden to output),
I Those two can be combined offline to a single weight matrix.

9/60
Adding Non-linearity
I Instead of computing a linear combination
X
score(x) = wj · fj (x)
j

I We use a non-linear function F


X 
score(x) = F wj · fj (x)
j

I Some popular choices for F

10/60
Deep Learning

I More layers ⇒ “deep learning”

I The sigmoid is also called the “logistic function.”

11/60
What Depth Holds

I Each layer is a processing step


I Having multiple processing steps allows complex functions
I Metaphor: NN and computing circuits
I computer = sequence of Boolean gates
I neural computer = sequence of layers
I Deep neural networks can implement more complex functions

12/60
Simple Neural Network

3.7
2.9 4.5
3.7
2.9 -5.2

-1.5 -2.0
-4.6
1 1

I One innovation: bias units (no input, always value 1)

13/60
Sample Input
3.7
1.0
2.9 4.5
3.7
2.9 -5.2
0.0
-1.5 -2.0
-4.6
1 1

I Try out two input values


I Hidden unit computation
1
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = = 0.90
1 + e−2.2

1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.6) = sigmoid(−1.7) = = 0.15
1 + e1.7
14/60
Computed Hidden Layer Values
3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15
-1.5 -2.0
-4.6
1 1

I Try out two input values


I Hidden unit computation
1
sigmoid(1.0 × 3.7 + 0.0 × 1.7 + 1 × −1.5) = sigmoid(2.2) = = 0.90
1 + e−2.2

1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.7) = = 0.15
1 + e1.7
15/60
Computed Output Value

3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15 .78
-1.5 -2.0
-4.6
1 1

I Output unit computation

1
sigmoid(0.90 × 4.5 + 0.15 ×−5.2 + 1 ×−2.0) = sigmoid(1.25) = = 0.78
1 + e−1.25

16/60
Output for All Binary Inputs

Input x0 Input x1 Hidden h0 Hidden h1 Output y0


0 0 0.18 0.01 0.23 → 0
0 1 0.90 0.15 0.78 → 1
1 0 0.90 0.15 0.78 → 1
1 1 0.99 0.77 0.18 → 0

I Network implements the XOR


I hidden node h0 is OR
I hidden node h1 is AND
I final layer is (essentially) h0 − (h1 )

17/60
The Brain vs. Artificial Neural Networks

I Similarities
I Neurons, connections between neurons
I Learning = change of connections, not change of neurons
I Massive parallel processing
I But artificial neural networks are much simpler
I computation within neuron vastly simplified
I discrete time steps
I typically some form of supervised learning with massive number of stimuli

18/60
Backpropagation Training

I Lather – take an input and run it forward through the network


I Rinse – compare it to the expected output, and adjust weights if they differ
I Repeat – for the next input, until convergence or time-out

19/60
Backpropagation Training

3.7
1.0 .90
2.9 4.5
3.7
2.9 -5.2
0.0 .15 .78
-1.5 -2.0
-4.6
1 1

I Computed output is y = 0.78


I Correct output is t(arget) = 1.0
I How do we adjust the weights?

20/60
Key Concepts

I Gradient Descent
I error is a function of the weights
I we want to reduce the error
I gradient descent: move towards the error minimum
I compute gradient → get direction to the error minimum
I adjust weights towards direction of lower error
I Backpropagation
I first adjust last set of weights
I propagate error back to each previous layer
I adjust their weights

21/60
Gradient Descent

22/60
Gradient Descent

23/60
Derivative of the Sigmoid
1
I Sigmoid: sigmoid(x) =
1 + e−x
f (x) 0 g(x)f 0 (x) − f (x)g0 (x)
I Reminder: quotient rule ( ) =
g(x) g(x)2
d sigmoid(x) d 1
I Derivative =
dx dx 1 + e−x

0 × (1 − e−x ) − (−e−x )
=
(1 + e−x )2

1 e−x
= ( )
1 + e−x 1 + e−x
1 1
= −x (1 − )
1+e 1 + e−x
= sigmoid(x)(1 − sigmoid(x))
24/60
Final Layer Update (1)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I t is the target ouput.
I Derivative of error with regard to one weight wk (using chain rule)

dE dE dy ds
=
dwk dy ds dwk
I Error is already defined in terms of y, hence

dE d 1
= (t − y)2 = −(t − y)
dy dy 2

25/60
Final Layer Update (2)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I Derivative of error with regards to one weight wk (using chain rule)

dE dE dy ds
=
dwk dy ds dwk
I y with respect to s is sigmoid(s)
dy d sigmoid(s)
= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds ds

26/60
Final Layer Update (3)
X
I We have a linear combination of weights and hidden layer values: s = wk hk
k
I Then we have the activation function: y = sigmoid(s)
1
I We have the error function E = (t − y)2 .
2
I Derivative of error with regards to one weight wk (using chain rule)

dE dE dy ds
=
dwk dy ds dwk
I s is a weighted linear combination of hidden node values hk
ds d X
= ( wk hk ) = hk
dwk dwk
k

27/60
Putting it All Together

I Derivative of error with regard to one weight wk

dE dE dy ds
= = −(t − y) y(1 − y) hk
dwk dy ds dwk
I error
I derivative of sigmoid: y0
I We adjust the weight as follows

∆wk = µ (t − y) y0 hk

where µ is a fixed learning rate.

28/60
Multiple Output Nodes
I Our example had one ouput node.
I Typically neural networks have multiple output nodes.
I Error is computed over all j output nodes

1X
E= (tj − yj )2
2
j

I Weight wkj from hidden unit k to output unit j is adjusted according to node j

∆wkj = µ (tj − yj ) y0j hk

I We can also rewrite this as


∆wkj = µ δj hk
where δj is the error term for output unit j.

29/60
Hidden Layer Update
I In a hidden layer, we do not have a target output value.
I But we can compute how much each hidden node contributes to the downstream
error E.
I k refers to a hidden node
I j refers to a node in the next/output layer
I Remember the error term
δj = (tj − yj )y0j
I The error term associated with hidden node k is (skipping the multivariate math) is

wkj δj )h0k
X
δk = (
j

I So if the uik is the weight between input unit xi and hidden unit k then

∆uik = µ δk xi
I Compare with ∆wkj = µ δj hk .
30/60
An Example
i k j
3.7
1 1.0 .90
2.9 4.5
3.7 G
2.9 -5.2
2 0.0 .15 .78
-1.5 -2.0
-4.6
3 1 1

For the output unit G


I Computed output = y = 0.78
1
I Correct output = t = 1.0
I Final layer weight updates (with learning rate µ = 10)
I δ1 = (t1 − y1 )y01 = (1 − 0.78) × 0.172 = 0.0378
I ∆w11 = µδ1 h1 = 10 × 0.0378 × 0.90 = 0.3402
I ∆w21 = µδ1 h2 = 10 × 0.0378 × 0.15 = 0.0567
I ∆w31 = µδ1 h3 = 10 × 0.0378 × 1 = 0.378
31/60
An Example
i k j
4.5+0.3402=4.8402
3.7
1 1.0 .90 4.8402 -5.2+0.0.0567=-5.1433
2.9
3.7 G -2.0+0.378=-1.622
2.9 -5.1433
2 0.0 .15 .78
-1.5
-4.6 -1.622
3 1 1

x u h w y

For the output unit G


I Computed output = y = 0.78
1
I Correct output = t = 1.0
I Final layer weight updates (with learning rate µ = 10)
I δ1 = (t1 − y1 )y01 = (1 − 0.78) × 0.172 = 0.0378
I ∆w11 = µδ1 h1 = 10 × 0.0378 × 0.90 = 0.3402
I ∆w21 = µδ1 h2 = 10 × 0.0378 × 0.15 = 0.0567
I ∆w31 = µδ1 h3 = 10 × 0.0378 × 1 = 0.378 32/60
Hidden Layer Updates
i k j
4.5+0.3402=4.8402
3.7
1 1.0 .90 4.8402 -5.2+0.0.0567=-5.1433
2.9
3.7 G -2.0+0.378=-1.622
2.9 -5.1433
2 0.0 .15 .78
-1.5
-4.6 -1.622
3 1 1

x u h w y

For hidden unit h1


δ1 = ( j w1j δ1G )h01 = 4.5 × 0.0378 × 0.09 = 0.015
P
I
I ∆u
11 = µδ1 x1 = 10 × 0.015 × 1.0 = 0.175
I ∆u
21 = µδ1 x2 = 10 × 0.015 × 0.0 = 0
I ∆u
31 = µδ1 x3 = 10 × 0.015 × 1.0 = 0.175
Repeat for hidden unit h2
P G 0
I δ = (
2 j w2j δ1 )h2 = −5.2 × 0.0378 × 0.1275 = −0.025
I ∆u
12 = . . .
I ∆u
22 = . . .
I ∆u
32 = . . . 33/60
Initialization of Weights

I Random initialization e.g., uniformly in the interval

[−0.01, 0.01]
I For shallow networks there are suggestions for

1 1
[− √ , √ ]
n n
I For deep networks there are suggestions for
√ √
6 6
[− p ,p ]
ni + ni+1 ni + ni+1

where ni and ni+1 are sizes of the previous and next layers.

34/60
Neural Networks for Classification

I Predict Class: one output per class


I Training data output is a “one-hot-vector”, e.g., y = [0, 0, 1]T
I Prediction:
I predicted class is output node i with the highest value yi
I obtain posterior probability distribution by softmax
eyi
softmax(yi ) = P yj
je
35/60
Problems with Gradient Descent Training

36/60
Problems with Gradient Descent Training

37/60
Problems with Gradient Descent Training

38/60
Speed-up: Momentum

I Updates may move a weight slowly in one direction


I We can keep up a memory if prior updates

∆wkj (n − 1)
I and add these to any new updates with a decay factor ρ

∆wkj (n) = µ δj hk + ρ∆wkj (n − 1)

39/60
Dropout

I A general problem of machine learning: overfitting to training data (very good on train,
bad on unseen test)
I Solution: regularization, e.g., keeping weights from having extreme values
I Dropout: randomly remove some hidden units during training
I mask: set of hidden units dropped
I randomly generate, say, 10 – 20 masks
I alternate between the masks during training

40/60
Mini Batches

I Each training example yields a set of weight updates ∆wij


I Batch up several training examples
I Accumulate their updates
I Apply sum to the model – one big step instead of many small steps
I Mostly done for speed reasons

41/60
Matrix Vector Formulation

I Forward computation s = W h
I Activation computation y = sigmoid(s)
I Error Term: δ = (t − y) · sigmoid0 (s)
I Propagation of error term: δ i = W δ i+1 · sigmoid0 (s)
I Weight updates: ∆W = µ δ hT

42/60
Toolkits

I Theano (Python Library)


I Tensorflow (Python Library, Google)
I PyTorch (Python Library, Facebook)
I MXNet (Python Library, Amazon)
I DyNet (Python Library, A consortium of institutions including CMU)

43/60
Neural Network V1.0: Linear Model

44/60
Neural Network v2.0: Representation Learning
I Big idea: induce low-dimensional dense feature representations of high-dimensional
objects

45/60
Neural Network v2.1: Representation Learning

I Big idea: induce low-dimensional dense feature representations of high-dimensional


objects

I Did this really solve the problem?

46/60
Neural Network v3.0: Complex Functions

I Big idea: define complex functions by adding a hidden layer.

I y = W 2 h1 = a1 (W 1 x1 )

47/60
Neural Network v3.0: Complex Functions
I Popular activation/transfer/non-linear functions

48/60
Neural Network v3.5: Deeper Networks

I Add more layers!

I y = W 3 h2 = W 3 a2 (W 2 (a1 (W 1 x1 ))

49/60
Neural Network v3.5: Deeper Networks

50/60
Neural Network v4.0: Recurrent Neural Networks
I Big Idea: Use hidden layers to represent sequential state

51/60
Neural Network v4.0: Recurrent Neural Networks

52/60
Neural Network v4.1: Output Sequences

Many to One Many to Many Many to Many

53/60
Neural Network v4.1: Output Sequences
I Character-level Language Models

54/60
Neural Network v4.2: Long-Short Term Memory
I Regular Recurrent Networks

I LSTMs

55/60
Neural Network v4.2: Long-Short Term Memory

56/60
Neural Network v4.3: Bidirectional RNNs
I Unidirectional RNNs

I Bidirectional RNNs

57/60
Neural Machine Translation

58/60
Neural Part-of-Speech Tagging

I wi is the one-hot representation of the current word.


I f (wi ) encodes the case of of wi : all caps, cap initial, lowercase.
59/60
Neural Parsing

60/60

You might also like