0% found this document useful (0 votes)
120 views

Unit 2

Unit 2 notes for NLP

Uploaded by

Anonymous XhmybK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

Unit 2

Unit 2 notes for NLP

Uploaded by

Anonymous XhmybK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

R22 B.Tech.

CSE NLP

Prerequisites:
1. Data structures and compiler design
Course Objectives:

Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:

CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP

Syntax I: Parsing Natural Language, Treebanks: A Data- REFERENCE BOOK:


Driven Approach to Syntax, Representation of Syntactic
1. Speech and Natural Language Processing - Daniel
Structure, Parsing Algorithms
Jurafsky & James H Martin, Pearson Publications.
UNIT – III
2. Natural Language Processing and Information Retrieval:
Syntax II: Models for Ambiguity Resolution in Parsing, Tanvier Siddiqui, U.S. Tiwary.
Multilingual Issues
Semantic Parsing I: Introduction, Semantic Interpretation,
System Paradigms, Word Sense
UNIT - IV
Semantic Parsing II: Predicate-Argument Structure,
Meaning Representation Systems
UNIT - V
Language Modeling: Introduction, N-Gram Models,
Language Model Evaluation, Bayesian parameter
estimation, Language Model Adaptation, Language Models-
class based, variable length, Bayesian topic based,
Multilingual and Cross Lingual Language Modeling
TEXT BOOKS:
1. Multilingual natural Language Processing Applications:
From Theory to Practice – Daniel M. Bikel and Imed Zitouni,
Pearson Publication.

Page 2 of 76
R22 B.Tech. CSE NLP

can be done using a phrase-structure parser, which generates a


parse tree that represents the structure of the sentence.

Syntax I: Parsing Natural Language, Treebanks: A Data- Syntax analysis is important for many NLP tasks, such as named
Driven Approach to Syntax, Representation of Syntactic entity recognition, sentiment analysis, and machine translation. By
understanding the syntactic structure of a sentence, NLP systems
Structure, Parsing Algorithms
can better identify the relationships between words and the overall
structure of the text, which can be used to extract meaning and
perform various downstream tasks.
Syntax Analysis:
1. Parsing Natural Language:
Syntax analysis in natural language processing (NLP)
refers to the process of identifying the structure of a sentence and In natural language processing (NLP), syntax analysis, also known
its component parts, such as phrases and clauses, based on the as parsing, refers to the process of analyzing the grammatical
rules of the language's syntax. structure of a sentence in order to determine its constituent parts,
their relationships to each other, and their functions within the
There are several approaches to syntax analysis in NLP, sentence. This involves breaking down the sentence into its
including: individual components, such as nouns, verbs, adjectives, and
1. Part-of-speech (POS) tagging: This involves identifying the phrases, and then analyzing how these components are related to
syntactic category of each word in a sentence, such as noun, verb, each other.
adjective, etc. This can be done using machine learning algorithms There are two main approaches to syntax analysis in NLP: rule-
trained on annotated corpora of text. based parsing and statistical parsing. Rule-based parsing involves
2. Dependency parsing: This involves identifying the the use of a set of pre-defined rules that dictate how the different
relationships between words in a sentence, such as subject-verb or parts of speech and phrases in a sentence should be structured and
object-verb relationships. This can be done using a dependency related to each other. Statistical parsing, on the other hand, uses
parser, which generates a parse tree that represents the machine learning algorithms to learn patterns and relationships in
relationships between words. large corpora of text in order to generate parse trees for new
sentences.
3. Constituency parsing: This involves identifying the
constituent parts of a sentence, such as phrases and clauses. This Here's an example of how syntax analysis works using a simple
sentence:
Page 25 of 76
R22 B.Tech. CSE NLP

Sentence: "The cat sat on the mat." Syntax analysis is a crucial component of many NLP tasks,
including machine translation, text-to-speech conversion, and
Step 1: Tokenization sentiment analysis. By understanding the grammatical structure of
The first step is to break the sentence down into its individual a sentence, NLP models can more accurately interpret its meaning
words, or tokens: and generate appropriate responses or translations.

"The", "cat", "sat", "on", "the", "mat", "." 2. Treebanks: A Data-Driven Approach to
Syntax:
Step 2: Part of Speech Tagging
Treebanks are a data-driven approach to syntax analysis in natural
Next, each token is assigned a part of speech tag, which indicates language processing (NLP). They consist of a large collection of
its grammatical function in the sentence: sentences, each of which has been manually annotated with a parse
"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposition), tree that shows the syntactic structure of the sentence. Treebanks
"the" (determiner),"mat" (noun), "." (punctuation) are used to train statistical parsers, which can then automatically
analyze new sentences and generate their own parse trees.
Step 3: Dependency Parsing
A parse tree is a hierarchical structure that represents the syntactic
Finally, the relationships between the words in the sentence are structure of a sentence. Each node in the tree represents a
analyzed using a dependency parser to create a parse tree. In this constituent of the sentence, such as a noun phrase or a verb phrase.
example, the parse tree might look something like this: The edges of the tree represent the relationships between these
constituents, such as subject-verb or verb-object relationships.
Sat
Here's an example of a parse tree for the sentence "The cat sat on
/ \
the mat":
cat on
/ \ |
The mat the
This parse tree shows that "cat" is the subject of the verb "sat,"
and "mat" is the object of the preposition "on."
Page 26 of 76
R22 B.Tech. CSE NLP

sat(V) 3. Representation of Syntactic Structure:


/ \ In natural language processing (NLP), the representation of
syntactic structure refers to how the structure of a sentence is
cat(N) on(PREP)
represented in a machine-readable form. There are several
/ \ / \ different ways to represent syntactic structure, including
constituency-based representations and dependency-based
The(D) mat(N) the(D) representations.
This parse tree shows that the sentence is composed of a verb 3.1 Constituency-Based Representations:
phrase ("sat") and a prepositional phrase ("on the mat"), with the
verb phrase consisting of a verb ("sat") and a noun phrase ("the Constituency-based representations, also known as phrase
cat"). The noun phrase, in turn, consists of a determiner("the") and structure trees, represent the structure of a sentence as a
a noun ("cat"), and the prepositional phrase consists of a hierarchical tree structure, with each node in the tree representing
preposition("on") and a noun phrase ("the mat"). a constituent of the sentence. The nodes are labeled with a
grammatical category such as noun phrase (NP) or verb phrase
Treebanks can be used to train statistical parsers, which can then (VP), and the branches represent the syntactic relationships
automatically analyze new sentences and generate their own parse between the nodes. Constituency-based representations are often
trees. These parsers work by identifying patterns in the treebank used in rule-based approaches to parsing.
data and using these patterns to make predictions about the
structure of new sentences. For example, a statistical parser might Here's an example of a constituency-based representation of the
learn that a noun phrase is usually followed by a verb phrase and sentence "The cat sat on the mat":
use this pattern to generate a parse tree for a new sentence.
(S
Treebanks are an important resource in NLP, as they allow
(NP (DT The) (NN cat))
researchers and developers to train and test statistical parsers and
other models that rely on syntactic analysis. Some well-known (VP (VBD sat)
treebanks include the Penn Treebank and the Universal
Dependencies treebanks. These resources are publicly available (PP (IN on)
and have been used in a wide range of NLP research and (NP (DT the) (NN mat)))))
applications.
This representation shows that the sentence is composed of a noun
Page 27 of 76
R22 B.Tech. CSE NLP

phrase ("The cat") and a verb phrase ("sat on the mat"), with the subject "cat," and the preposition "on" depends on the object
verb phrase consisting of a verb("sat") and a prepositional phrase "mat." Both constituency-based and dependency-based
("on the mat"), and the prepositional phrase consisting of a representations are used in a variety of NLP tasks, including
preposition ("on") and a noun phrase ("the mat"). machine translation, sentiment analysis, and information
extraction. The choice of representation depends on the specific
3.2 Dependency-Based Representations: task and the algorithms used to process the data.
Dependency-based representations represent the structure
of a sentence as a directed graph, with each word in the sentence
represented as a node in the graph, and the relationships between 3.2.1 Syntax Analysis Using Dependency Graphs:
the words represented as directed edges. The edges are labeled
with a grammatical function such as subject(SUBJ) or object Syntax analysis using dependency graphs is a popular
(OBJ), and the nodes are labeled with a part-of-speech tag such as approach in natural language processing (NLP). Dependency
noun (N) or verb (V). Dependency-based representations are often graphs represent the syntactic structure of a sentence as a directed
used in statistical approaches to parsing. graph, where each word is a node in the graph and the relationships
between words are represented as directed edges. The nodes in the
Here's an example of a dependency-based representation of the graph are labeled with the part of speech of the corresponding
sentence "The cat sat on the mat": word, and the edges are labeled with the grammatical relationship
between the two words.
sat-V
Here's an example of a dependency graph for the sentence "The
| cat sat on the mat":
cat-N
|
on-PREP
|
mat-N
This representation shows that the verb "sat" depends on the
Page 28 of 76
R22 B.Tech. CSE NLP

In this graph, the word "cat" depends on the word "sat"


with a subject relationship, and the word "mat" depends on the
word "on" with a prepositional relationship.
Dependency graphs are useful for a variety of NLP tasks,
including named entity recognition, relation extraction, and
sentiment analysis. They can also be used for parsing and syntactic
analysis, as they provide a compact and expressive way to
represent the structure of a sentence.
One advantage of dependency graphs is that they are
simpler and more efficient than phrase structure trees, which can
be computationally expensive to build and manipulate. This graph shows that the verb "saw" depends on the
Dependency graphs also provide a more flexible representation of subject "I," and that the noun phrase "the man" depends on the
syntactic structure, as they can easily capture non-projective verb "saw" with an object relationship. The prepositional phrase
dependencies and other complex relationships between words. "with the telescope" modifies the noun phrase "the man," with the
word "telescope" being the object of the preposition "with."
Here's another example of a dependency graph for the sentence "I
saw the man with the telescope": In summary, dependency graphs provide a flexible and
efficient way to represent the syntactic structure of a sentence in
Page 29 of 76
R22 B.Tech. CSE NLP

NLP. They can be used for a variety of tasks and are a key
component of many state-of-the-art NLP models.
3.2.2 Syntax Analysis Using Phrase Structure Trees:
Syntax analysis, also known as parsing, is the process of
analyzing the grammatical structure of a sentence to identify its
constituent parts and the relationships between them. In natural
language processing (NLP), phrase structure trees are often used
to represent the syntactic structure of a sentence.
A phrase structure tree, also known as a parse tree or a syntax tree,
is a graphical representation of the syntactic structure of a
sentence. It consists of a hierarchical structure of nodes, where
each node represents a phrase or a constituent of the sentence.
Here's an example of a phrase structure tree for the sentence "The
cat sat on the mat":

Page 30 of 76
R22 B.Tech. CSE NLP

In this tree, the top-level node represents the entire sentence (S),
which is divided into two subparts: the noun phrase (NP) "The cat"
and the verb phrase (VP) "sat on the mat". The NP is further
divided into a determiner (Det) "The" and a noun (N) "cat".
The VP is composed of a verb (V) "sat" and a prepositional phrase
(PP) "on the mat", which itself consists of a preposition (P) "on"
and another noun phrase (NP) "the mat".
Here's another example of a phrase structure tree for the sentence
"John saw the man with the telescope":

Page 31 of 76
R22 B.Tech. CSE NLP

In this tree, the top-level node represents the entire matches the current input, and recursively applies it to its right-
sentence (S), which is divided into a noun phrase (NP) "John" and hand side symbols. This process continues until a match is found
a verb phrase (VP) "saw the man with the telescope". The NP is for every terminal symbol in the input.
simply a single noun (N) "John". The VP is composed of a verb(V)
"saw" and a prepositional phrase (PP) "with the telescope", which Example: Consider the following context-free grammar for
itself consists of a preposition (P) "with" and another noun phrase arithmetic expressions:
(NP) "the man with the telescope". The latter is composed of a
determiner (Det) "the" and a noun (N) "man", which is modified
by another prepositional phrase "with the telescope", consisting of
a preposition (P) "with" and a noun phrase (NP) "the telescope".
Suppose we want to parse the expression "3 + 4 * (5 - 2)" using
Phrase structure trees can be used in NLP for a variety of recursive descent parsing. The algorithm would start with the top-
tasks, such as machine translation, text-to-speech synthesis, and level symbol E and apply the first production rule E -> E + T. It
natural language understanding. By identifying the syntactic would then recursively apply the production rules for E, T, and F
structure of a sentence, computers can more accurately understand until it reaches the terminals "3", "+", "4", "*", "(", "5", "-", "2",
its meaning and generate appropriate responses. and ")". The resulting parse tree would look like this:
4. Parsing Algorithms:
There are several algorithms used in natural language processing
(NLP) for syntax analysis or parsing, each with its own strengths
and weaknesses. Here are three common parsing algorithms and
their examples:
4.1.Recursive descent parsing:
This is a top-down parsing algorithm that starts with the
top-level symbol (usually the sentence) and recursively applies
production rules to derive the structure of the sentence. Each
production rule corresponds to a non-terminal symbol in the
grammar, which can be expanded into a sequence of other
symbols. The algorithm selects the first production rule that
Page 32 of 76
R22 B.Tech. CSE NLP

4.2. Shift-reduce parsing: look like this:


This is a bottom-up parsing algorithm that starts with the
input tokens and constructs a parse tree by repeatedly shifting a
token onto a stack and reducing a group of symbols on the stack
to a single symbol based on the production rules. The algorithm
maintains a parse table that specifies which actions to take based
on the current state and the next input symbol.
Example: Consider the following grammar for simple English
sentences:
S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> the | a
N -> man | ball | woman
V -> saw | liked
P -> with | in
Suppose we want to parse the sentence "the man saw a
woman with a ball" using shift-reduce parsing. The algorithm
would start with an empty stack and shift the tokens "the", "man",
"saw", "a", "woman", "with", "a", and "ball" onto the stack. It 4.3. Earley parsing:
would then reduce the symbols "Det N" to NP, "NP PP" to NP, "V
NP" to VP, and "NP PP" to PP. The resulting parse tree would This is a chart parsing algorithm that uses dynamic
Page 33 of 76
R22 B.Tech. CSE NLP

programming to store partial parses in a chart, which can be only the symbol "the".
combined to form complete parses.
3. Shifting again: We shift the next token "cat" onto the stack and
4.4 Shift-Reduce Parsing: the next token "chased" into the lookahead buffer. The stack now
contains the symbols "the" and "cat".
Shift-reduce parsing is a bottom-up parsing algorithm
commonly used in natural language processing (NLP) to generate 4. Reduction: We check if the top of the stack and the next token
parse trees from input sentences. It works by incrementally in the lookahead buffer can be reduced to a non-terminal symbol
reducing a stack of symbols to a single non-terminal symbol that using a production rule from the grammar. In this case, "the" and
matches a production rule. "cat" can be reduced to the non-terminal symbol NP using the
production rule NP -> Det N. We pop "the" and "cat" from the
Here is an example of how shift-reduce parsing can be used to stack and push the non-terminal symbol NP onto the stack.
parse the sentence "the cat chased the mouse" using a simple
grammar: 5. Shifting again: We shift the next token "chased" onto the stack
and the next token "the" into the lookahead buffer. The stack now
S -> NP VP contains the symbols NP and "chased".
NP -> Det N 6. Reduction again: We check if the top of the stack and the next
VP -> V NP token in the lookahead buffer can be reduced to a non-terminal
symbol using a production rule from the grammar. In this case,
Det -> the "chased" and NP can be reduced to the non-terminal symbol VP
N -> cat | mouse using the production rule VP -> V NP. We pop" chased" and NP
from the stack and push the non-terminal symbol VP ontothe
V -> chased stack.
1. Initialization: We start by initializing an empty stack and an 7. Shifting again: We shift the next token "the" onto the stack and
input buffer with the sentence tokens "the", "cat", "chased", "the", the next token "mouse" into the lookahead buffer. The stack now
and "mouse". We also initialize a parse chart, which is a table used contains the symbols VP and "the".
to keep track of all the possible partial parses of the sentence.
8. Reduction again: We check if the top of the stack and the next
2. Shifting: We shift the first token "the" onto the stack and the token in the lookahead buffer can be reduced to a non-terminal
first token "cat" into the lookahead buffer. The stack now contains symbol using a production rule from the grammar. In this case,
Page 34 of 76
R22 B.Tech. CSE NLP

"the" and VP can be reduced to the non-terminal symbol S using achieve a successful parse.
the production rule S -> NP VP. We pop "the" and VP from the
stack and push the non-terminal symbol S onto the stack. 4.5 Hypergraphs and Chart Parsing:

9. Completion: The stack now contains only the symbol S, which Hypergraphs and chart parsing are two related concepts
is the final parse of the input sentence. We can also look at the used in natural language processing (NLP) for syntactic parsing.
parse chart to see all the possible partial parses that were Hypergraphs represent a generalization of traditional parse
considered during the parsing process. The final parse tree for the trees, allowing for more complex structures and more efficient
sentence is: parsing algorithms. A hypergraph consists of a set of nodes
S (representing words or phrases in the input sentence) and a set of
hyperedges, which connect nodes and represent higher-level
/ \ structures. A chart, on the other hand, is a data structure used in
chart parsing to efficiently store and manipulate all possible partial
NP VP parses of a sentence.
/ \ | Here is an example of how chart parsing can be used to parse the
/ chased sentence "the cat chased the mouse" using a simple grammar:

/ | S -> NP VP

Det NP NP -> Det N

| / \ VP -> V NP

the Det N Det -> the

| | N -> cat | mouse

the mouse V -> chased

Note that this example uses a simple grammar and a 1. Initialization: We start by initializing an empty chart with the
straightforward parsing process, but more complex grammars and length of the input sentence (5 words) and a set of empty cells
sentences may require additional steps or different strategies to representing all possible partial parses.
Page 35 of 76
R22 B.Tech. CSE NLP

2. Scanning: We scan each word in the input sentence and add a because it stores all possible partial parses in the chart and avoids
corresponding parse to the chart. For example, for the first word redundant parsing of the same span multiple times. Hypergraph
"the", we add a parse for the non-terminal symbol Det (Det -> the). scan also be used in chart parsing to represent more complex
We do this for each word in the sentence. structures and enable more efficient parsing algorithms.
3. Predicting: We use the grammar rules to predict possible 4.6 Minimum Spanning Trees and Dependency Parsing:
partial parses for each span of words in the sentence. For example,
we can predict a partial parse for the span (1, 2) (i.e., the first two Dependency parsing is a type of syntactic parsing that
words "the cat") by applying the rule NP -> Det N to the parses represents the grammatical structure of a sentence as a directed
for "the" and "cat". We add this partial parse to the chart cell for acyclic graph (DAG). The nodes of the graph represent the words
the span (1, 2). of the sentence, and the edges represent the syntactic relationships
between the words.
4. Scanning again: We scan the input sentence again, this time
looking for matches to predicted partial parses in the chart. For Minimum spanning tree (MST) algorithms are often used
example, if we predicted a partial parse for the span (1, 2), we look for dependency parsing, as they provide an efficient way to find
for a parse for the exact same span in the chart. If we find a match, the most likely parse for a sentence given a set of syntactic
we can apply a grammar rule to combine the two partial parses dependencies.
into a larger parse. For example, if we find a parse for (1, 2)that Here's an example of how a MST algorithm can be used for
matches the predicted parse for NP -> Det N, we can combine dependency parsing:
them to create a parse for the span (1, 3) and the non-terminal
symbol NP. Consider the sentence "The cat chased the mouse". We can
represent this sentence as a graph with nodes for each word and
5. Combining: We continue to combine partial parses in the chart edges representing the syntactic dependencies between them:
using grammar rules until we have a complete parse for the entire
sentence. We can use a MST algorithm to find the most likely parse for this
graph.
6. Output: The final parse tree for the sentence is represented by
the complete parse in the chart cell for the span (1, 5) and the non- One popular algorithm for this is the Chu-Liu/Edmonds
terminal symbol S. algorithm:

Chart parsing can be more efficient than other parsing 1. We first remove all self-loops and multiple edges in the graph.
algorithms, such as recursive descent or shift-reduce parsing, This is because a valid dependency tree must be acyclic and have
Page 36 of 76
R22 B.Tech. CSE NLP

only one edge between any two nodes. analysis.


2. We then choose a node to be the root of the tree. In this example, One advantage of dependency parsing is that it captures
we can choose "chased" to be the root since it is the main verb of more fine-grained syntactic information than phrase-structure
the sentence. parsing, as it represents the relationships between individual
words rather than just the hierarchical structure of phrases.
3. We then compute the scores for each edge in the graph based However, dependency parsing can be more difficult to perform
on a scoring function that takes into account the probability of accurately than phrase-structure parsing, as it requires more
each edge being a valid dependency. The score function can be sophisticated algorithms and models to capture the nuances of
based on various linguistic features, such as part-of-speech tags or syntactic dependencies.
word embeddings.
4. We use the MST algorithm to find the tree that maximizes the
total score of its edges. The MST algorithm starts with a set of
edges that connect the root node to each of its immediate
dependents, and iteratively adds edges that connect other nodes to
the tree. At each iteration, we select the edge with the highest score
that does not create a cycle in the tree.
5. Once the MST algorithm has constructed the tree, we can assign
a label to each edge in the tree based on the type of dependency it
represents (e.g., subject, object, etc.).
The resulting dependency tree for the example sentence is
shown below:
In this tree, each node represents a word in the sentence,
and each edge represents a syntactic dependency between two
words.
Dependency parsing can be useful for many NLP tasks,
such as information extraction, machine translation, and sentiment

Page 37 of 76

You might also like