NLP Unit II Notes
NLP Unit II Notes
Prepared by
K SWAYAMPRABHA
Assistance Professor
UNIT - II
Syntax Analysis:
Parsing Algorithms
1 Shift-Reduce Parsing
2 Hypergraphs and Chart Parsing
3 Minimum Spanning Trees and Dependency Parsing
1. Rule-based systems,
2. Statistical models, and
TreeBank
For example, a simple sentence like "The cat chased the mouse" might be
represented as a tree with "cat" and "mouse" as noun phrases, "chased" as
a verb, and "the" and "the" as determiners.
Treebanks are created by linguists and other experts who manually annotate
the sentences with their syntactic structures. The process of creating a
treebank is time-consuming and requires a lot of expertise, but once a
treebank has been created, it can be used to train machine learning
algorithms to automatically parse new sentences.
In this example, the words "cat" and "mouse" are connected to the verb
"chased" by directed edges that indicate their syntactic relationships.
Specifically, "cat" and "mouse" are both direct objects of the verb, while
"chased" is the head or root of the sentence. The determiners "the"
preceding both "cat" and "mouse" are also included in the dependency graph
as dependents of their respective nouns. This dependency graph captures
the syntactic structure of the sentence and can be used to perform a variety
of syntactic analysis tasks.
Here is an example of a phrase structure tree for the sentence "The cat
chased the mouse":
S S
|
+---------+--------+
| |
NP VP
| |
+---+---+ +----+-----+
| | | |
DET N V NP
| | | |
the cat chased the mouse
This phrase structure tree captures the syntactic structure of the sentence
and can be used to perform a variety of syntactic analysis tasks, including
parsing, translation, and text-to-speech synthesis.
Parsing Algorithms
Shift moves the next input token onto the stack, while reduce applies a
grammar rule to reduce the top of the stack to a non-terminal symbol. The
parser continues to apply these operations until it reaches the end of the
input string and the stack contains only the start symbol of the grammar.
S -> E
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
And we want to parse the input string "id * ( id + id )". We can use a shift-
reduce parser to build a parse tree for this string as follows:
1. Start with an empty stack and the input string "id * ( id + id )".
2. Shift the first token "id" onto the stack.
3. Reduce the top of the stack to "F" using the rule "F -> id".
4. Shift the next token "*" onto the stack.
5. Shift the next token "(" onto the stack.
6. Shift the next token "id" onto the stack.
7. Reduce the top of the stack to "F" using the rule "F -> id".
8. Reduce the top of the stack to "T" using the rule "T -> F".
9. Reduce the top of the stack to "E" using the rule "E -> T".
10. Shift the next token "+" onto the stack.
11. Shift the next token "id" onto the stack.
12. Reduce the top of the stack to "F" using the rule "F -> id".
13. Reduce the top of the stack to "T" using the rule "T -> F".
14. Reduce the top of the stack to "E" using the rule "E -> E + T".
15. Shift the next token ")" onto the stack.
16. Reduce the top of the stack to "F" using the rule "F -> ( E )".
17. Reduce the top of the stack to "T" using the rule "T -> F".
18. Reduce the top of the stack to "E" using the rule "E -> T".
19. The stack now contains only the start symbol "S", indicating that the
input string has been successfully parsed.
S
|
E
/\
T +
/\ |
F id E
/\
T F
/\ |
F id id
One common type of chart parsing algorithm is the Earley parser, which
uses a bottom-up approach to construct the chart. Another common
algorithm is the CYK parser, which uses a top-down approach and is based
on context-free grammars.
In this hypergraph, the words "the", "cat", "sat", "on", and "mat" are
represented as nodes, and the hyperedges represent the grammatical
relationships between those words. For example, the hyperedge connecting
"cat" and "sat" represents the fact that "cat" is the subject of the verb "sat".
We can then use chart parsing to build a chart that represents the different
possible syntactic and semantic structures for this sentence. Each cell in the
chart represents a combination of words or phrases, and the chart is filled in
with possible structures based on a set of grammar rules. Here's an example
chart for this sentence:
1 2 3 4 5
+---------+---------+---------+---------+---------+
1| D | | | | |
+---------+---------+---------+---------+---------+
2| | N | | | |
+---------+---------+---------+---------+---------+
3| | | V | | |
+---------+---------+---------+---------+---------+
4| | | P | | |
+---------+---------+---------+---------+---------+
5| | | | D | |
+---------+---------+---------+---------+---------+
6| | | | | N |
+---------+---------+---------+---------+---------+
7 | S1 | | | | |
+---------+---------+---------+---------+---------+
8| | S2 | | | |
+---------+---------+---------+---------+---------+
9| | | S3 | | |
+---------+---------+---------+---------+---------+
10 | | | | S4 | |
+---------+---------+---------+---------+---------+
11 | | | | | S5 |
+---------+---------+---------+---------+---------+
12 | | | | | S6 |
+---------+---------+---------+---------+---------+
In this chart, the rows represent the start and end positions of phrases, and
the columns represent the different phrase types (D for determiner, N for
noun, V for verb, and P for preposition). The cells are filled in with the
possible structures based on the grammar rules. For example, cell (2,2)
represents the phrase "cat", cell (3,3) represents the verb phrase "sat", and
cell (4,4) represents the prepositional phrase "on the mat".
Consider the sentence "John gave Mary a book". We can use dependency
parsing to identify the syntactic dependencies between the words in the
sentence:
In this dependency tree, the nodes represent the words in the sentence, and
the edges represent the syntactic dependencies between those words. For
example, the "nsubj" edge connects "John" to "gave" and represents the fact
that "John" is the subject of the verb "gave". The "dobj" edge connects
"book" to "gave" and represents the fact that "book" is the direct object of
the verb "gave".
We can then use an MST algorithm to find the minimum spanning tree that
connects all the nodes in the dependency tree with the minimum possible
total edge weight. The resulting MST represents the most likely grammatical
structure for the sentence.
In this MST, the nodes still represent the words in the sentence, but the
edges represent the most likely grammatical relationships between those
words. For example, the "nsubj" and "dobj" edges are the same as in the
original dependency tree, but the "det" edge connecting "a" to "book"
represents the fact that "a" is a determiner for "book".
By analyzing the grammatical structure of sentences with minimum spanning
trees and dependency parsing, we can gain insights into the meaning and
structure of natural language text. These techniques are widely used in
applications such as machine translation, sentiment analysis, and text
classification.
S -> E
E -> E + E [0.4]
E -> E - E [0.3]
E -> E * E [0.2]
E -> E / E [0.1]
E -> ( E ) [0.0]
E -> num [0.0]
In this grammar, S is the start symbol and E represents an arithmetic
expression. The production rules for E indicate that an arithmetic expression
can be generated by adding two expressions with probability 0.4, subtracting
two expressions with probability 0.3, multiplying two expressions with
probability 0.2, dividing two expressions with probability 0.1, or enclosing an
expression in parentheses with probability 0.0. Finally, an arithmetic
expression can also be a number (num) with probability 0.0.
We can use this corpus to learn a PCFG that represents the grammar of
these sentences. Here is a sample PCFG:
S -> NP VP [1.0]
NP -> Det N [0.67] | N [0.33]
VP -> V NP [0.67] | V [0.33]
Det -> the [1.0]
N -> cat [0.33] | dog [0.33] | bird [0.33]
V -> sat [0.33] | chased [0.33] | flew [0.33]
This grammar allows us to generate new sentences that are similar to the
sentences in the training corpus. For example, we can use the following
parse tree to generate the sentence "The cat chased the bird":
S
|
NP VP
| | |
Det N V
| | |
the cat chased
| | |
N
bird
To generate this sentence, we start with the S symbol and apply the
production rule S -> NP VP. We then randomly choose between the two
possible expansions of NP and VP based on their probabilities. In this case,
we choose NP -> Det N and VP -> V NP. We then randomly choose the
expansions of Det, N, and V based on their probabilities. Finally, we combine
the resulting strings to get the sentence "The cat chased the bird".
We can generate other sentences in the same way, by randomly choosing
expansion rules based on their probabilities. Note that this approach allows
us to generate sentences that may not have appeared in the training corpus,
but are still grammatically correct according to the PCFG.
We can then use these features as input to a linear SVM, which learns
to predict the correct parse tree based on these features. The SVM is
trained on a set of annotated sentences, where each sentence is
represented by its features and the correct parse tree.
During testing, the SVM predicts the correct parse tree for a given
input sentence by computing a weighted sum of the features, and then
applying a threshold to this sum to make a binary classification
decision. The predicted parse tree can then be converted into a more
readable format, such as a bracketed string.
Here is an example parse tree that could be predicted by the SVM for
the input sentence "The cat sat on the mat":
(S
(NP (DT The) (NN cat))
(VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat))))
(. .))
Note that discriminative models can be trained on a variety of feature sets,
including hand-crafted features as shown in this example, or features
learned automatically from the input data using techniques such as neural
networks. Discriminative models can also incorporate additional information,
such as lexical semantic knowledge or discourse context, to improve their
accuracy.
我爱北京天安门。
Other encoding schemes include word embeddings, which map each token to
a low-dimensional vector that captures its semantic and syntactic properties,
and character-level encodings, which represent each character in a token as
a separate feature.
Overall, tokenization, case, and encoding are critical preprocessing steps
that help transform raw text data into a format that can be effectively
analyzed and modeled.
2. Word Segmentation
Word segmentation refers to the process of identifying individual words in a
piece of text, especially in languages where words are not explicitly
separated by spaces or punctuation marks. Word segmentation is an
important task in natural language processing and can be challenging in
languages such as Chinese, Japanese, and Thai.
我爱北京天安门。
3. Morphology
Morphology refers to the study of the structure of words and the rules that
govern the formation of words from smaller units known as morphemes.
Morphemes are the smallest units of meaning in a language and can be
either free (can stand alone as words) or bound (must be attached to other
morphemes to form words).
For example, consider the word "unhappily." This word consists of three
morphemes:
Each of these morphemes has a specific meaning and function, and their
combination in the word "unhappily" changes the meaning and grammatical
function of the root word "happy."
1. Stemming: the process of reducing a word to its root or stem form, which
can help reduce the number of unique words in a text corpus and improve
efficiency in language modeling and information retrieval systems.
2. Morphological analysis: the process of breaking down words into their
constituent morphemes, which can help identify word meanings and
relationships, as well as identify errors or inconsistencies in text data.
3. Morphological generation: the process of creating new words from existing
morphemes, which can be useful in natural language generation tasks such
as machine translation or text summarization.