Unit 2
Unit 2
CSE NLP
Prerequisites:
1. Data structures and compiler design
Course Objectives:
Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:
CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP
Page 2 of 76
R22 B.Tech. CSE NLP
Syntax I: Parsing Natural Language, Treebanks: A Data- Syntax analysis is important for many NLP tasks, such as named
Driven Approach to Syntax, Representation of Syntactic entity recognition, sentiment analysis, and machine translation. By
understanding the syntactic structure of a sentence, NLP systems
Structure, Parsing Algorithms
can better identify the relationships between words and the overall
structure of the text, which can be used to extract meaning and
perform various downstream tasks.
Syntax Analysis:
1. Parsing Natural Language:
Syntax analysis in natural language processing (NLP)
refers to the process of identifying the structure of a sentence and In natural language processing (NLP), syntax analysis, also known
its component parts, such as phrases and clauses, based on the as parsing, refers to the process of analyzing the grammatical
rules of the language's syntax. structure of a sentence in order to determine its constituent parts,
their relationships to each other, and their functions within the
There are several approaches to syntax analysis in NLP, sentence. This involves breaking down the sentence into its
including: individual components, such as nouns, verbs, adjectives, and
1. Part-of-speech (POS) tagging: This involves identifying the phrases, and then analyzing how these components are related to
syntactic category of each word in a sentence, such as noun, verb, each other.
adjective, etc. This can be done using machine learning algorithms There are two main approaches to syntax analysis in NLP: rule-
trained on annotated corpora of text. based parsing and statistical parsing. Rule-based parsing involves
2. Dependency parsing: This involves identifying the the use of a set of pre-defined rules that dictate how the different
relationships between words in a sentence, such as subject-verb or parts of speech and phrases in a sentence should be structured and
object-verb relationships. This can be done using a dependency related to each other. Statistical parsing, on the other hand, uses
parser, which generates a parse tree that represents the machine learning algorithms to learn patterns and relationships in
relationships between words. large corpora of text in order to generate parse trees for new
sentences.
3. Constituency parsing: This involves identifying the
constituent parts of a sentence, such as phrases and clauses. This Here's an example of how syntax analysis works using a simple
sentence:
Page 25 of 76
R22 B.Tech. CSE NLP
Sentence: "The cat sat on the mat." Syntax analysis is a crucial component of many NLP tasks,
including machine translation, text-to-speech conversion, and
Step 1: Tokenization sentiment analysis. By understanding the grammatical structure of
The first step is to break the sentence down into its individual a sentence, NLP models can more accurately interpret its meaning
words, or tokens: and generate appropriate responses or translations.
"The", "cat", "sat", "on", "the", "mat", "." 2. Treebanks: A Data-Driven Approach to
Syntax:
Step 2: Part of Speech Tagging
Treebanks are a data-driven approach to syntax analysis in natural
Next, each token is assigned a part of speech tag, which indicates language processing (NLP). They consist of a large collection of
its grammatical function in the sentence: sentences, each of which has been manually annotated with a parse
"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposition), tree that shows the syntactic structure of the sentence. Treebanks
"the" (determiner),"mat" (noun), "." (punctuation) are used to train statistical parsers, which can then automatically
analyze new sentences and generate their own parse trees.
Step 3: Dependency Parsing
A parse tree is a hierarchical structure that represents the syntactic
Finally, the relationships between the words in the sentence are structure of a sentence. Each node in the tree represents a
analyzed using a dependency parser to create a parse tree. In this constituent of the sentence, such as a noun phrase or a verb phrase.
example, the parse tree might look something like this: The edges of the tree represent the relationships between these
constituents, such as subject-verb or verb-object relationships.
Sat
Here's an example of a parse tree for the sentence "The cat sat on
/ \
the mat":
cat on
/ \ |
The mat the
This parse tree shows that "cat" is the subject of the verb "sat,"
and "mat" is the object of the preposition "on."
Page 26 of 76
R22 B.Tech. CSE NLP
phrase ("The cat") and a verb phrase ("sat on the mat"), with the subject "cat," and the preposition "on" depends on the object
verb phrase consisting of a verb("sat") and a prepositional phrase "mat." Both constituency-based and dependency-based
("on the mat"), and the prepositional phrase consisting of a representations are used in a variety of NLP tasks, including
preposition ("on") and a noun phrase ("the mat"). machine translation, sentiment analysis, and information
extraction. The choice of representation depends on the specific
3.2 Dependency-Based Representations: task and the algorithms used to process the data.
Dependency-based representations represent the structure
of a sentence as a directed graph, with each word in the sentence
represented as a node in the graph, and the relationships between 3.2.1 Syntax Analysis Using Dependency Graphs:
the words represented as directed edges. The edges are labeled
with a grammatical function such as subject(SUBJ) or object Syntax analysis using dependency graphs is a popular
(OBJ), and the nodes are labeled with a part-of-speech tag such as approach in natural language processing (NLP). Dependency
noun (N) or verb (V). Dependency-based representations are often graphs represent the syntactic structure of a sentence as a directed
used in statistical approaches to parsing. graph, where each word is a node in the graph and the relationships
between words are represented as directed edges. The nodes in the
Here's an example of a dependency-based representation of the graph are labeled with the part of speech of the corresponding
sentence "The cat sat on the mat": word, and the edges are labeled with the grammatical relationship
between the two words.
sat-V
Here's an example of a dependency graph for the sentence "The
| cat sat on the mat":
cat-N
|
on-PREP
|
mat-N
This representation shows that the verb "sat" depends on the
Page 28 of 76
R22 B.Tech. CSE NLP
NLP. They can be used for a variety of tasks and are a key
component of many state-of-the-art NLP models.
3.2.2 Syntax Analysis Using Phrase Structure Trees:
Syntax analysis, also known as parsing, is the process of
analyzing the grammatical structure of a sentence to identify its
constituent parts and the relationships between them. In natural
language processing (NLP), phrase structure trees are often used
to represent the syntactic structure of a sentence.
A phrase structure tree, also known as a parse tree or a syntax tree,
is a graphical representation of the syntactic structure of a
sentence. It consists of a hierarchical structure of nodes, where
each node represents a phrase or a constituent of the sentence.
Here's an example of a phrase structure tree for the sentence "The
cat sat on the mat":
Page 30 of 76
R22 B.Tech. CSE NLP
In this tree, the top-level node represents the entire sentence (S),
which is divided into two subparts: the noun phrase (NP) "The cat"
and the verb phrase (VP) "sat on the mat". The NP is further
divided into a determiner (Det) "The" and a noun (N) "cat".
The VP is composed of a verb (V) "sat" and a prepositional phrase
(PP) "on the mat", which itself consists of a preposition (P) "on"
and another noun phrase (NP) "the mat".
Here's another example of a phrase structure tree for the sentence
"John saw the man with the telescope":
Page 31 of 76
R22 B.Tech. CSE NLP
In this tree, the top-level node represents the entire matches the current input, and recursively applies it to its right-
sentence (S), which is divided into a noun phrase (NP) "John" and hand side symbols. This process continues until a match is found
a verb phrase (VP) "saw the man with the telescope". The NP is for every terminal symbol in the input.
simply a single noun (N) "John". The VP is composed of a verb(V)
"saw" and a prepositional phrase (PP) "with the telescope", which Example: Consider the following context-free grammar for
itself consists of a preposition (P) "with" and another noun phrase arithmetic expressions:
(NP) "the man with the telescope". The latter is composed of a
determiner (Det) "the" and a noun (N) "man", which is modified
by another prepositional phrase "with the telescope", consisting of
a preposition (P) "with" and a noun phrase (NP) "the telescope".
Suppose we want to parse the expression "3 + 4 * (5 - 2)" using
Phrase structure trees can be used in NLP for a variety of recursive descent parsing. The algorithm would start with the top-
tasks, such as machine translation, text-to-speech synthesis, and level symbol E and apply the first production rule E -> E + T. It
natural language understanding. By identifying the syntactic would then recursively apply the production rules for E, T, and F
structure of a sentence, computers can more accurately understand until it reaches the terminals "3", "+", "4", "*", "(", "5", "-", "2",
its meaning and generate appropriate responses. and ")". The resulting parse tree would look like this:
4. Parsing Algorithms:
There are several algorithms used in natural language processing
(NLP) for syntax analysis or parsing, each with its own strengths
and weaknesses. Here are three common parsing algorithms and
their examples:
4.1.Recursive descent parsing:
This is a top-down parsing algorithm that starts with the
top-level symbol (usually the sentence) and recursively applies
production rules to derive the structure of the sentence. Each
production rule corresponds to a non-terminal symbol in the
grammar, which can be expanded into a sequence of other
symbols. The algorithm selects the first production rule that
Page 32 of 76
R22 B.Tech. CSE NLP
programming to store partial parses in a chart, which can be only the symbol "the".
combined to form complete parses.
3. Shifting again: We shift the next token "cat" onto the stack and
4.4 Shift-Reduce Parsing: the next token "chased" into the lookahead buffer. The stack now
contains the symbols "the" and "cat".
Shift-reduce parsing is a bottom-up parsing algorithm
commonly used in natural language processing (NLP) to generate 4. Reduction: We check if the top of the stack and the next token
parse trees from input sentences. It works by incrementally in the lookahead buffer can be reduced to a non-terminal symbol
reducing a stack of symbols to a single non-terminal symbol that using a production rule from the grammar. In this case, "the" and
matches a production rule. "cat" can be reduced to the non-terminal symbol NP using the
production rule NP -> Det N. We pop "the" and "cat" from the
Here is an example of how shift-reduce parsing can be used to stack and push the non-terminal symbol NP onto the stack.
parse the sentence "the cat chased the mouse" using a simple
grammar: 5. Shifting again: We shift the next token "chased" onto the stack
and the next token "the" into the lookahead buffer. The stack now
S -> NP VP contains the symbols NP and "chased".
NP -> Det N 6. Reduction again: We check if the top of the stack and the next
VP -> V NP token in the lookahead buffer can be reduced to a non-terminal
symbol using a production rule from the grammar. In this case,
Det -> the "chased" and NP can be reduced to the non-terminal symbol VP
N -> cat | mouse using the production rule VP -> V NP. We pop" chased" and NP
from the stack and push the non-terminal symbol VP ontothe
V -> chased stack.
1. Initialization: We start by initializing an empty stack and an 7. Shifting again: We shift the next token "the" onto the stack and
input buffer with the sentence tokens "the", "cat", "chased", "the", the next token "mouse" into the lookahead buffer. The stack now
and "mouse". We also initialize a parse chart, which is a table used contains the symbols VP and "the".
to keep track of all the possible partial parses of the sentence.
8. Reduction again: We check if the top of the stack and the next
2. Shifting: We shift the first token "the" onto the stack and the token in the lookahead buffer can be reduced to a non-terminal
first token "cat" into the lookahead buffer. The stack now contains symbol using a production rule from the grammar. In this case,
Page 34 of 76
R22 B.Tech. CSE NLP
"the" and VP can be reduced to the non-terminal symbol S using achieve a successful parse.
the production rule S -> NP VP. We pop "the" and VP from the
stack and push the non-terminal symbol S onto the stack. 4.5 Hypergraphs and Chart Parsing:
9. Completion: The stack now contains only the symbol S, which Hypergraphs and chart parsing are two related concepts
is the final parse of the input sentence. We can also look at the used in natural language processing (NLP) for syntactic parsing.
parse chart to see all the possible partial parses that were Hypergraphs represent a generalization of traditional parse
considered during the parsing process. The final parse tree for the trees, allowing for more complex structures and more efficient
sentence is: parsing algorithms. A hypergraph consists of a set of nodes
S (representing words or phrases in the input sentence) and a set of
hyperedges, which connect nodes and represent higher-level
/ \ structures. A chart, on the other hand, is a data structure used in
chart parsing to efficiently store and manipulate all possible partial
NP VP parses of a sentence.
/ \ | Here is an example of how chart parsing can be used to parse the
/ chased sentence "the cat chased the mouse" using a simple grammar:
/ | S -> NP VP
| / \ VP -> V NP
Note that this example uses a simple grammar and a 1. Initialization: We start by initializing an empty chart with the
straightforward parsing process, but more complex grammars and length of the input sentence (5 words) and a set of empty cells
sentences may require additional steps or different strategies to representing all possible partial parses.
Page 35 of 76
R22 B.Tech. CSE NLP
2. Scanning: We scan each word in the input sentence and add a because it stores all possible partial parses in the chart and avoids
corresponding parse to the chart. For example, for the first word redundant parsing of the same span multiple times. Hypergraph
"the", we add a parse for the non-terminal symbol Det (Det -> the). scan also be used in chart parsing to represent more complex
We do this for each word in the sentence. structures and enable more efficient parsing algorithms.
3. Predicting: We use the grammar rules to predict possible 4.6 Minimum Spanning Trees and Dependency Parsing:
partial parses for each span of words in the sentence. For example,
we can predict a partial parse for the span (1, 2) (i.e., the first two Dependency parsing is a type of syntactic parsing that
words "the cat") by applying the rule NP -> Det N to the parses represents the grammatical structure of a sentence as a directed
for "the" and "cat". We add this partial parse to the chart cell for acyclic graph (DAG). The nodes of the graph represent the words
the span (1, 2). of the sentence, and the edges represent the syntactic relationships
between the words.
4. Scanning again: We scan the input sentence again, this time
looking for matches to predicted partial parses in the chart. For Minimum spanning tree (MST) algorithms are often used
example, if we predicted a partial parse for the span (1, 2), we look for dependency parsing, as they provide an efficient way to find
for a parse for the exact same span in the chart. If we find a match, the most likely parse for a sentence given a set of syntactic
we can apply a grammar rule to combine the two partial parses dependencies.
into a larger parse. For example, if we find a parse for (1, 2)that Here's an example of how a MST algorithm can be used for
matches the predicted parse for NP -> Det N, we can combine dependency parsing:
them to create a parse for the span (1, 3) and the non-terminal
symbol NP. Consider the sentence "The cat chased the mouse". We can
represent this sentence as a graph with nodes for each word and
5. Combining: We continue to combine partial parses in the chart edges representing the syntactic dependencies between them:
using grammar rules until we have a complete parse for the entire
sentence. We can use a MST algorithm to find the most likely parse for this
graph.
6. Output: The final parse tree for the sentence is represented by
the complete parse in the chart cell for the span (1, 5) and the non- One popular algorithm for this is the Chu-Liu/Edmonds
terminal symbol S. algorithm:
Chart parsing can be more efficient than other parsing 1. We first remove all self-loops and multiple edges in the graph.
algorithms, such as recursive descent or shift-reduce parsing, This is because a valid dependency tree must be acyclic and have
Page 36 of 76
R22 B.Tech. CSE NLP
Page 37 of 76