Unit 3
Unit 3
CSE NLP
Prerequisites:
1. Data structures and compiler design
Course Objectives:
Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:
CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP
Page 2 of 76
R22 B.Tech. CSE NLP
most likely syntactic structure for a given sentence. These models PCFGs can be used to compute the probability of a parse
can be based on various neural architectures, such as recurrent tree for a given sentence, which can then be used to select the most
neural networks (RNNs) or transformer models. For example, a likely parse. The probability of a parse tree is computed by
neural network model might use an attention mechanism to learn multiplying the probabilities of its constituent production rules,
which words in a sentence are most relevant for predicting the from the root symbol down to the leaves. The probability of a
syntactic structure. sentence is computed by summing the probabilities of all parse
trees that generate the sentence.
5. Ensemble models:
Here is an example of a PCFG for the sentence "the cat saw the
Ensemble models combine the predictions of multiple
dog":
parsing models to achieve higher accuracy and robustness. These
models can be based on various techniques, such as voting, S -> NP VP [1.0]t
weighting, or stacking. For example, an ensemble model might
NP -> Det N [0.6]
combine the predictions of a rule-based model, a statistical model,
and a neural network model to improve the overall accuracy of the NP -> N [0.4]
parsing system. VP -> V NP [0.8]
Overall, there are many models for ambiguity resolution in VP -> V [0.2]
parsing, each with its own strengths and weaknesses. The choice
of model depends on the specific application and the available Det -> "the" [0.9]
resources, such as training data and computational power. Det -> "a" [0.1]
1.1 Probabilistic Context-Free Grammars: N -> "cat" [0.5]
Probabilistic Context-Free Grammars (PCFGs) are a N -> "dog" [0.5]
popular model for ambiguity resolution in parsing. PCFGs extend
context-free grammars (CFGs) by assigning probabilities to each V -> "saw" [1.0]
production rule, representing the likelihood of generating a certain In this PCFG, each production rule is annotated with a
symbol given its parent symbol. probability. For example, the rule NP -> Det N [0.6] has a
probability of 0.6, indicating that a noun phrase can be generated
Page 39 of 76
R22 B.Tech. CSE NLP
by first generating a determiner, followed by a noun, with a "the") * P(N -> "dog") = 1.0 * 0.6 * 0.9 * 0.5 * 0.8 *1.0 * 0.6 *
probability of 0.6. 0.9 * 0.5 = 0.11664
To parse the sentence "the cat saw the dog" using this Thus, the probability of the best parse tree for the sentence
PCFG, we can use the CKY algorithm to generate all possible "the cat saw the dog" is0.11664. This probability can be used to
parse trees and compute their probabilities. The algorithm starts select the most likely parse among all possible parse trees for the
by filling in the table of all possible subtrees for each span of the sentence.
sentence, and then combines these subtrees using the production
1.2 Generative Models for Parsing:
rules of the PCFG. The final cell in the table represents the
probability of the best parse tree for the entire sentence. Generative models for parsing are a family of models that
generate a sentence's parse tree by generating each node in the tree
Using the probabilities from the PCFG, the CKY algorithm
according to a set of probabilistic rules. One such model is the
generates the following parse tree for the sentence "the cat saw the
probabilistic earley parser.
dog":
The early parser uses a chart data structure to store all
S
possible parse trees for a sentence. The parser starts with an empty
/ \ chart, and then adds new parse trees to the chart as it progresses
through the sentence. The parser consists of three mainstages:
NP VP
prediction, scanning, and completion.
/ \ / \
In the prediction stage, the parser generates new items in the
Det N V NP chart by applying grammar rules that can generate non-terminal
| | | / \ symbols. For example, if the grammar has a rule S -> NP VP, the
parser would predict the presence of an S symbol in the current
the cat saw the dog span of the sentence by adding a new item to the chart that
The probability of this parse tree is computed as follows: indicates that an S symbol can be generated by an NP symbol
followed by a VP symbol.
P(S -> NP VP) * P(NP -> Det N) * P(Det -> "the") * P(N -> "cat")
* P(VP -> V NP) * P(V ->"saw") * P(NP -> Det N) * P(Det -> In the scanning stage, the parser checks whether a word in
the sentence can be assigned to a non-terminal symbol in the chart.
Page 40 of 76
R22 B.Tech. CSE NLP
For example, if the parser has predicted an NP symbol in the N -> "cat" [0.5]
current span of the sentence, and the word "dog" appears in that
N -> "dog" [0.5]
span, the parser would add a new item to the chart that indicates
that the NP symbol can be generated by the word "dog". V -> "saw" [1.0]
In the completion stage, the parser combines items in the Initial chart:
chart that have the same end position and can be combined 0: [S -> * NP VP [1.0], 0, 0]
according to the grammar rules. For example, if the parser has
added an item to the chart that indicates that an NP symbol can be 0: [NP -> * Det N [0.6], 0, 0]
generated by the word "dog", and another item that indicates that 0: [NP -> * N [0.4], 0, 0]
a VP symbol can be generated by the word "saw" and an NP
symbol, the parser would add a new item to the chart that indicates 0: [VP -> * V NP [0.8], 0, 0]
that an S symbol can be generated by an NP symbol followed by 0: [VP -> * V [0.2], 0, 0]
a VP symbol.
0: [Det -> * "the" [0.9], 0, 0]
Here is an example of a probabilistic earley parser applied to the
sentence "the cat saw the dog": 0: [Det -> * "a" [0.1], 0, 0]
NP -> N [0.4]
VP -> V NP [0.8] Predicting S:
During testing, the MEMM uses the learned feature F3: 0.9
functions and weights to score each possible parse tree for the F4: 1.1
input sentence. The model then selects the parse tree with the
highest score as the final parse tree for the sentence. F5: 0.8
Here is an example of a MEMM applied to the sentence "the cat F6: 0.6
saw the dog": F7: 0.7
Features: F8: 0.9
Page 42 of 76
R22 B.Tech. CSE NLP
- - Det -> "the" constructed, how they are written, and how they are used in
context.
- - N -> "cat"
For example, in English, words are typically separated by
- VP -> V NP
spaces, making it relatively easy to tokenize a sentence into
- - V -> "saw" individual words. However, in some languages, such as Chinese
- - NP -> Det N or Japanese, there are no spaces between words, and the text must
be segmented into individual units of meaning based on other
- - - Det -> "the" cues, such as syntax or context.
- - - N -> "dog" Furthermore, even within a single language, there can be
Score: 5.7 variation in how words are spelled or written. For example, in
English, words can be spelled with or without hyphens or
In this example, the MEMM generates a score for each apostrophes, and there can be differences in spelling between
possible parse tree and selects the parse tree with the highest score American English and British English.
as the final parse tree for the sentence. The selected parse tree
corresponds to the correct parse for the sentence. Multilingual issues in tokenization arise because different
languages can have different character sets, which means that the
same sequence of characters can represent different words in
2. Multilingual Issues: different languages. Additionally, some languages have complex
morphology, which means that a single word can have many
In natural language processing (NLP), a token is a sequence different forms that represent different grammatical features or
of characters that represents a single unit of meaning. In other meanings.
words, it is a word or a piece of a word that has a specific meaning
within a language. The process of splitting a text into individual To address these issues, NLP researchers have developed
tokens is called tokenization. multilingual tokenization techniques that take into account the
specific linguistic features of different languages. These
However, the definition of what constitutes a token can vary techniques can include using language-specific dictionaries,
depending on the language being analyzed. This is because models, or rules to identify the boundaries between words or units
different languages have different rules for how words are of meaning indifferent languages.
Page 44 of 76
R22 B.Tech. CSE NLP
2.1 Tokenization, Case, and Encoding: character encoding standard that can represent a wide range of
characters from different languages.
Tokenization, case, and encoding are all important aspects
of natural language processing (NLP) that are used to preprocess Here is an example of how tokenization, case, and encoding might
text data before it can be analyzed by machine learning algorithms. be applied to a sentence of text:
Here are some examples of each:
Text: "The quick brown fox jumps over the lazy dog."
Tokenization:
Tokenization: ["The", "quick", "brown", "fox", "jumps", "over",
Tokenization is the process of splitting a text into individual "the", "lazy", "dog", "."]
tokens or words. In English, this is typically done by splitting the
Case: ["the", "quick", "brown", "fox", "jumps", "over", "the",
text on whitespace and punctuation marks. For example, the
"lazy", "dog", "."]
sentence "The quick brown fox jumps over the lazy dog." would
be tokenized into the following list of words: Encoding: [0x74, 0x68, 0x65, 0x20, 0x71, 0x75, 0x69, 0x63,
0x6b, 0x20, 0x62, 0x72,0x6f, 0x77, 0x6e, 0x20, 0x66, 0x6f, 0x78,
1. ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
0x20, 0x6a, 0x75, 0x6d, 0x70, 0x73, 0x20,0x6f, 0x76, 0x65, 0x72,
"dog", "."].
0x20, 0x74, 0x68, 0x65, 0x20, 0x6c, 0x61, 0x7a, 0x79,
Case: 0x20,0x64, 0x6f, 0x67, 0x2e]
2. Case refers to the use of upper and lower case letters in text. In Note that the encoding is represented in hexadecimal to show the
NLP, it is often important to standardize the case of words to avoid underlying bytes that represent the text.
treating the same word as different simply because it appears in
2.2 Word Segmentation:
different case. For example, the words "apple" and "Apple" should
be treated as the same word. Word segmentation is one of the most basic tasks in Natural
Language Processing(NLP), and it involves identifying the
Encoding:
boundaries between words in a sentence. However, in some
3. Encoding refers to the process of representing text data in a way languages, such as Chinese and Japanese, there is no clear spacing
that can be processed by machine learning algorithms. One or punctuation between words, which makes word segmentation
common encoding method used in NLP is Unicode, which is a more challenging.
Page 45 of 76
R22 B.Tech. CSE NLP
In Chinese, for example, a sentence like "我喜欢中文" Vietnamese: Vietnamese uses the Latin alphabet, but it also
has many diacritics (accent marks) that can change the
(which means "I like Chinese") could be segmented in different
meaning of a word. In addition, Vietnamese words can be
ways, such as "我 / 喜欢 / 中文" or "我喜欢 / 中文".Similarly, in formed by combining smaller words, which makes word
Japanese, a sentence like "私は日本語が好きです" (which also segmentation more complex.
means "Ilike Japanese") could be segmented in different ways, To address these challenges, NLP researchers have developed
such as "私は / 日本語が / 好きです" or "私は日本語 / が好き various techniques for word segmentation, including rule-based
です". approaches, statistical models, and neural networks. However,
Here are some examples of the challenges of word segmentation word segmentation is still an active area of research, especially for
in different languages: low-resource languages where large amounts of annotated data are
not available.
Chinese: In addition to the lack of spacing between words,
Chinese also has a large number of homophones, which are 2.3 Morphology:
words that sound the same but have different meanings. For Morphology is the study of the structure of words and how
example, the words "你" (you) and "年" (year) sound the they are formed from smaller units called morphemes.
same in Mandarin, but they are written with different Morphological analysis is important in many natural language
characters. processing tasks, such as machine translation and speech
Japanese: Japanese also has a large number of recognition, because it helps to identify the underlying structure
homophones, but it also has different writing systems, of words and to disambiguate their meanings.
including kanji (Chinese characters), hiragana, and Here are some examples of the challenges of morphology in
katakana. Kanji can often have multiple readings, which different languages:
makes word segmentation more complex.
Thai: Thai has no spaces between words, and it also has no Turkish: Turkish has a rich morphology, with a complex
capitalization or punctuation. In addition, Thai has a unique system of affixes that can be added to words to convey
script with many consonants that can be combined with different meanings. For example, the word "kitap" (book)
different vowel signs to form words. can be modified with different suffixes to indicate things
like possession, plurality, or tense.
Page 46 of 76
R22 B.Tech. CSE NLP
Arabic: Arabic also has a rich morphology, with a complex Semantic Parsing I:
system of prefixes, suffixes, and infixes that can be added to
1. Introduction
words to convey different meanings. For example, the root
"k-t-b" (meaning "write") can be modified with different 2. Semantic Interpretation
affixes to form words like "kitab" (book) and "kataba" (he
3. System Paradigms,
wrote).
Finnish: Finnish has a complex morphology, with a large 4. Word Sense
number of cases, suffixes, and vowel harmony rules that can
1. Introduction to Semantic Parsing:
affect the form of a word. For example, the word "käsi"
(hand) can be modified with different suffixes to indicate What is Semantic parsing?
things like possession, location, or movement.
The process of understanding the meaning and
Swahili: Swahili has a complex morphology, with a large interpretation words, signs and sentence structure is called
number of prefixes and suffixes that can be added to words semantic parsing.
to convey different meanings. For example, the word
"kutaka" (to want) can be modified with different prefixes Using semantic parsing, the computers can understand
and suffixes to indicate things like tense, negation, or Natural language the way humans do.
subject agreement. It is the toughest phase and not fully solved.
Semantic--------> study of meaning
To address these challenges, NLP researchers have developed Parsing----------> Identify and relates peace of information.
various techniques for morphological analysis, including rule-
based approaches, statistical models, and neural networks.
However, morphological analysis is still an active area of 2. Semantic Interpretation:
research, especially for low-resource languages where large
amounts of annotated data are not available. Semantic Interpretation:
Semantic parsing is considered as a part of a large process
semantic interpretation.
Semantic interpretation is a kind of representation of text
that can be fed into a computer to allow further
Page 47 of 76
R22 B.Tech. CSE NLP
3. This nails are growing too fast. did what to whom, when, where, why and how.
4. He went to manicure to remove his nails.
Example:
Steve jobs was to co-founder of Apple, which is headquartered in
Cupertino.
Answer (x1, longest (x1 river(x1))) • Semi-supervised: manual annotation is very expensive and
does not yield enough data. In such instances researches can
3. System Paradigms: automatically expand the dataset on which their models are
Researchers from linguistics community have examined trained either by employing machine-generated output directly
meaning representations at different levels of granularity or by bootstrapping off of an existing model by having human
(the level of detail) and generality (how broad of general the corrects its output. In many cases, a model from one domain is
information is). used quickly to adapt to a new domain.
Many of the potential experimental conditions, no hand 3.2 Scope:
annotated data is available.
Therefore, it is important to get a perspective on the various a) Domain Dependent: These systems are specific to certain
primary dimensions on which the problem of semantic domains.
interpretation has been tackled.
b) Domain Independent: These systems are general enough
The historic approaches which are more prevalent and that the techniques can be applicable to multiple domains.
successful generally fall into 3 categories.
3.3 Coverage:
3.1 SYSTEM ARCHITECTURES:
Shallow: These systems tend to produce an intermediate
• Knowledge based: These systems use a predefined set of representation that can then be converted to one that a
rules or a knowledge base to obtain a solution to a new machine can base its actions on.
problem. Deep: These systems usually create a terminal
representation that is directly consumed by a machine
• Unsupervised: These systems tend to require minimal or application.
human intervention to be functional by using existing resources
that can be bootstrapped for a particular application or problem 4. Word Sense:
domain.
Researchers have explored various system architectures to
• Supervised: These systems require some manual annotation. address the sense disambiguation problem.
Typically, researches create feature functions. A model is
trained to use these features predict labels, and then it is applied We classify these systems into four main categories:
to unseen data. 1. Rule based or knowledge based
Page 50 of 76
R22 B.Tech. CSE NLP
2. Supervised relationships).
3. Unsupervised This increase the accuracy of overlap
4. Semi-supervised measurement and improves the
disambiguation performance.
4.1 Rule based: iii) Structural Semantic Interconnections (SSI):
Rule based system for word sense disambiguation are Proposed by Navigli and Velardi.
among the earliest methods developed to tackle the Constructs semantic graphs using resources
like WordNet, domain labels, and annotated
Problem of determining the correct meaning of a word corpora.
based on its context. Uses an Iterative algorithm to match semantic
These systems rely heavily on dictionaries, thesauri, and graphs of context words with the target word
handcrafted rules. until the best matching sense is identified.
Algorithms and Techniques: Working of rule based:
i) Lesk Algorithm: 1. Context collection
One of the oldest and simplest dictionary 2. Dictionary / Thesaurus matching
based algorithms. 3. Weight computation
The algorithm assigns the sense of a word that 4. Sense selection
has the most overlap in terms of word with
the words in its context. Advantages:
Example: if the word “bank” appears in a context 1. Simple and intuitive approach
with words like “monkey” and 2. Can be very effective when precise dictionary
definitions or thesaurus categories are available.
“deposit” the financial sense of “bank” is chosen.
Limitations:
ii) Enhanced Lesk algorithm:
Banerjee and Pedersen extended the lesk 1. Heavily reliant on the availability and quality of lexical
algorithm to include synonyms, hypernyms resources.
(more general terms), hyponyms (more 2. Handcrafted rules can be labor-intensive and may not
specific terms) and meronyms (part-whole cover all possible contexts.
Page 51 of 76
R22 B.Tech. CSE NLP
The PAS for this sentence would be: chased (cat, mouse)
1.1 Resources:
These resources help computers understand the
Page 54 of 76