0% found this document useful (0 votes)
12 views

Lexical Analysis and Parsing CD

this is very important

Uploaded by

Santhosh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lexical Analysis and Parsing CD

this is very important

Uploaded by

Santhosh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Lexical Analysis and Parsing

Irfan Rasool
Language Processing System
Compiler Overview and Structural Phases of Compiler

1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
7. Linking and Assembly
Lexical Analysis
● Lexical Analysis is the first phase of the compiler, where the source code written in a
high-level programming language is converted into a sequence of tokens.
● The primary purpose of lexical analysis is to break down the input code into smaller,
manageable components that can be used by the parser for syntax analysis.
● This phase is responsible for reading the characters of the input program and grouping
them into meaningful sequences, or tokens, such as keywords, operators, identifiers,
constants, and punctuation.
● These tokens are stored in Symbol Table.
Key Functions of Lexical Analysis
1. Tokenization:
○ The main role of lexical analysis is to divide the input code into tokens.
○ A token is a sequence of characters that represents a basic unit in the source code, such
as:
■ Keywords (if, for, while, int, return)
■ Identifiers (variable names like x, sum)
■ Operators (+, -, *, /, ==)
■ Literals (constants like 123, 3.14, 'a')
■ Punctuation (brackets (), semicolons ;)
2. Removing Whitespaces and Comments:
○ Lexical analysis also eliminates unnecessary whitespaces and comments from the code,
which are not needed for parsing but help in making the code more readable.
Key Functions of Lexical Analysis
3. Error Handling:

● During lexical analysis, if an invalid token or sequence of characters is encountered, an error is


generated, which is handled by the lexer (the tool performing lexical analysis).
● Errors may include unrecognized characters or malformed tokens.

4. Symbol Table Generation:

● The lexical analyzer may generate or update a symbol table, which stores information about
identifiers (such as variable names) and their associated data (type, memory address, etc.).
● This table is crucial for semantic analysis and code generation phases later in the compilation
process.
What is a Symbol Table?
● A symbol table is one of the most important data structures within a compiler, where all the identifiers
used in a program are stored along with their type, scope, and memory locations.
● The symbol table is built during the earliest stages of compilation.
● It helps ensure the correct usage of all identifiers according to the language’s rules in order for efficient
code generation and error checking to be performed.
● It has built-in lexical and syntax analysis phases.
● The information is collected by the analysis phases of the compiler and is used by the synthesis phases
of the compiler to generate code.
● It is used by the compiler to achieve compile-time efficiency.

It is used by various phases of the compiler as follows:-

● Lexical Analysis: Creates new table entries in the table, for example, entries about tokens.
● Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of reference, use, etc
in the table.
● Semantic Analysis: Uses available information in the table to check for semantics i.e. to verify that
expressions and assignments are semantically correct(type checking) and update it accordingly.
● Intermediate Code Generation: Refers to symbol table for knowing how much and what type of run-time
is allocated and table helps in adding temporary variable information.
● Code Optimization: Uses information present in the symbol table for machine-dependent optimization.
● Target Code generation: Generates code by using the address information of the identifier present in the
table.
Process of Lexical Analysis
1. Scanning:
○ The source code is scanned character by character by the lexical analyzer.
○ The scanner uses a lookahead mechanism to identify the next token without consuming the
input.
○ This process is facilitated by regular expressions, which define the patterns for valid tokens.
2. Matching Tokens:
○ Each sequence of characters is matched against predefined token patterns.
○ These patterns are defined using regular expressions, and each successful match
corresponds to a valid token.

Example:
● if matches the regular expression for keywords.
● x matches the pattern for identifiers.
● 3.14 matches the pattern for floating-point literals.
Process of Lexical Analysis
3. Token Generation: Once a sequence of characters is recognized as a valid token, the
lexical analyzer outputs the token in the form of a pair: (token type, token value).

For example:
● if → (Keyword, "if")
● x → (Identifier, "x")
● 3.14 → (FloatLiteral, "3.14")

4. State Transition:

● The lexical analyzer operates like a finite state machine (FSM), where each state
corresponds to the partial recognition of a token.
● As the input characters are read, the FSM transitions between states until it reaches a
final state that corresponds to a valid token.
Example of Lexical Analysis
Code:
int sum = 0;
sum = sum + 10;
The lexical analyzer will produce the following tokens:

1. int → (Keyword, "int")


2. sum → (Identifier, "sum")
3. = → (AssignmentOperator, "=")
4. 0 → (IntLiteral, "0")
5. ; → (Punctuation, ";")
6. sum → (Identifier, "sum")
7. = → (AssignmentOperator, "=")
8. sum → (Identifier, "sum")
9. + → (Operator, "+")
10. 10 → (IntLiteral, "10")
11. ; → (Punctuation, ";")
Tools Used for Lexical Analysis
Lex/Flex:

● Lexical analyzers like Lex (or its variant Flex) are commonly used tools to generate
lexical analyzers automatically from a set of regular expressions.
● The programmer provides patterns in the form of regular expressions, and Lex
generates the corresponding code to perform lexical analysis.

Finite Automata:

● The lexical analyzer uses finite automata to recognize tokens.


● A deterministic finite automaton (DFA) is usually used, which efficiently recognizes
tokens by transitioning through states based on input characters.
Finite State Automata
● An FSA is an acceptor or recognizer of regular
languages An FSA is a 5-tuple, (Q, Σ, δ, q0, F ),
where
○ Q is a finite set of states
○ Σ is the input alphabet
○ δ is the transition function, δ : Q × Σ → Q
○ That is, δ(q, a) is a state for each state q
and input symbol a q0 is the start state
○ F is the set of final or accepting states
● In one move from some state q, an FSA reads
an input symbol, changes the state based on
δ, and gets ready to read the next input
symbol
● An FSA accepts its input string, if starting from
q0, it consumes the entire input string, and
reaches a final state
● If the last state reached is not a final state,
then the input string is rejected.
Regular Expressions and Lexical Analysis
● Regular expressions play a key role in defining the patterns for valid tokens. For
instance:
○ An identifier might be defined as [a-zA-Z_][a-zA-Z0-9_]*, meaning it starts
with a letter or underscore and is followed by letters, digits, or underscores.
○ A number might be defined as [0-9]+, meaning a sequence of digits.
● The lexical analyzer matches the input code against these regular expressions to
classify tokens correctly.
Examples of Regular Expressions
Examples of Regular Expressions
NFA
Construction
for
r = (a+b)*c
Transition Diagrams
Challenges in Lexical Analysis
● Ambiguities: Sometimes, a sequence of characters can match more than one token
type. For example, the string "if" could be recognized as both an identifier and a
keyword. The lexical analyzer resolves this ambiguity by giving precedence to
keywords.
● Longer Token Recognition: The lexical analyzer often prioritizes recognizing the longest
valid token. For example, if >= is a valid token (greater than or equal to), it should be
recognized as a single token rather than two separate tokens > and =.
● Handling Errors: If the lexical analyzer encounters an invalid sequence of characters, it
must generate an appropriate error message and possibly attempt to recover from the
error to continue processing the input code.
Output of Lexical Analysis
● The output of lexical analysis is a stream of tokens.
● These tokens are passed on to the next phase of the compiler, which is syntax analysis
or parsing.
● The tokens form the building blocks for the grammar rules that the parser uses to
check the syntactic correctness of the source code.
Parser (Syntax Analysis)
● A parser is a component in a compiler or interpreter that processes the sequence of
tokens generated by the lexical analyzer and arranges them in a structured format,
typically a parse tree or syntax tree.
● The primary role of a parser is to determine if the given input follows the rules of a
formal grammar.
● This process is known as syntax analysis.
● The parser checks whether the tokens conform to the grammar of the programming
language and reports any syntax errors.
● If the input is syntactically correct, the parser builds a syntax tree, which represents the
hierarchical structure of the source code.
Grammars
● Every programming language has precise grammar rules that describe the
syntactic structure of well-formed programs
○ In C, the rules state how functions are made out of parameter lists,
declarations, and statements; how statements are made of
expressions, etc.
● Grammars are easy to understand, and parsers for programming languages can
be constructed automatically from certain classes of grammars
● Parsers or syntax analyzers are generated for a particular grammar
● Context-free grammars are usually used for syntax specification of
programming languages
What is Parsing or Syntax Analysis?
● A parser for a grammar of a programming language
○ verifies that the string of tokens for a program in that language can indeed be
generated from that grammar
○ reports any syntax errors in the program
○ constructs a parse tree representation of the program (not necessarily
explicit)
○ usually calls the lexical analyzer to supply a token to it when necessary
○ could be hand-written or automatically generated
○ is based on context-free grammars
● Grammars are generative mechanisms like regular expressions
● Pushdown automata are machines recognizing context-free languages (like
FSA for Regular Language)
Context-free Grammars
A CFG is denoted as G = (N, T, P, S)
N: Finite set of non-terminals
T : Finite set of terminals
S ∈ N: The start symbol
P: Finite set of productions, each of the form A → α, where
A ∈ N and α ∈ (N ∪ T )∗
Usually, only P is specified and the first production corresponds to that of the start
symbol
Examples
(1) (2) (3) (4)
E→ E+E S→ 0S0 S → aSb S → aB | bA
E→ E∗E S→ 1S1 S→ϵ A → a | aS | bAA
E→ (E) S→ 0 B → b | bS | aBB
E→ id S→ 1
S→ ϵ
Derivations
Context-free
Languages
Derivation Trees
● Derivations can be displayed as trees
● The internal nodes of the tree are all nonterminals and the leaves are all terminals
● Corresponding to each internal node A, there exists a production ∈ P, with the RHS of
the production being the list of children of A, read from left to right
● The yield of a derivation tree is the list of the labels of all the leaves read from left to
right
● If α is the yield of some derivation tree for a grammar G, then S ⇒∗ α and conversely
Derivation Tree Example
Leftmost and Rightmost Derivations
● If at each step in a derivation, a production is applied to the leftmost nonterminal, then
the derivation is said to be leftmost. Similarly rightmost derivation.
● If w ∈ L(G) for some G, then w has at least one parse tree and corresponding to a
parse tree, w has unique leftmost and rightmost derivations
● If some word w in L(G) has two or more parse trees, then G is said to be ambiguous
● A CFL for which every G is ambiguous, is said to be an inherently ambiguous CFL
Leftmost and Rightmost Derivations: An Example
Ambiguous and Unambiguous Grammar Examples
Ambiguous Grammar: A grammar is ambiguous if there exists more than one parse tree or
derivation for a single input string.

Unambiguous Grammar: A grammar is unambiguous if there is exactly one unique parse


tree for every valid input string.

The grammar, E → E + E|E ∗ E|(E)|id


is ambiguous, but the following grammar for the same language is unambiguous
E → E + T |T , T → T ∗ F |F , F → (E)|id
The grammar,
stmt → IF expr stmt |IF expr stmt ELSE stmt |other _stmt
is ambiguous, but the following equivalent grammar is not

stmt → IF expr stmt |IF expr matched _stmt ELSE stmt matched _stmt →
IF expr matched _stmt ELSE matched _stmt |other _stmt
The language,
L = {anbncmdm | n, m ≥ 1} ∪ {anbmcmdn | n, m ≥ 1}, is inherently ambiguous
Ambiguity
Example 1
Equivalent
Unambiguous
Grammar
Ambiguity
Example 2
Ambiguity
Example 2
(contd.)
Pushdown Automata
A PDA M is a system (Q, Σ, Γ, δ, q0, z0, F ), where
Q is a finite set of states
Σ is the input alphabet
Γ is the stack alphabet
q0 ∈ Q is the start state
z0 ∈ Γ is the start symbol on stack (initialization)
F ⊆ Q is the set of final states
δ is the transition function, Q × Σ ∪ {ϵ} × Γ to finite subsets of Q × Γ∗
A typical entry of δ is given by
δ(q, a, z) = {(p1, γ1), ((p2, γ2), ..., (pm, γm)}
The PDA in state q, with input symbol a and top-of-stack symbol z, can enter any of
the states pi , replace the symbol z by the string γi , and advance the input head by
one symbol.
Pushdown Automata Contd…
● The leftmost symbol of γi will be the new top of stack
● a in the above function δ could be ϵ, in which case, the input symbol is not
used and the input head is not advanced
● For a PDA M, we define L(M), the language accepted by
M by final state, to be
L(M) = {w | (q0, w, Z0) ►∗ (p, ϵ, γ), for some p ∈ F and
γ ∈ Γ∗}
● We define N(M), the language accepted by M by empty stack, to be
N(M) = {w | (q0, w, Z0) ►∗ (p, ϵ, ϵ), for some p ∈ Q
● When acceptance is by empty stack, the set of final states is irrelevant, and
usually, we set F = φ
PDA - Examples
PDA - Examples Contd…
Nondeterministic and Deterministic PDA
● Just as in the case of NFA and DFA, PDA also have two versions: NPDA and
DPDA
● However, NPDA are strictly more powerful than the DPDA
● For example, the language, L = {ww R | w ∈ {a, b}+} can be recognized only by
an NPDA and not by any DPDA
● In the same breath, the language,
L = {wcw R | w ∈ {a, b}+}, can be recognized by a DPDA
● In practice we need DPDA, since they have exactly one possible move at any
instant
● Our parsers are all DPDA
Parsing
● Parsing is the process of constructing a parse tree for a sentence generated by a given
grammar
● If there are no restrictions on the language and the form of grammar used, parsers for
context-free languages require O(n3) time (n being the length of the string parsed)
■ Cocke-Younger-Kasami’s algorithm
■ Earley’s algorithm
● Subsets of context-free languages typically require O(n) time
■ Predictive parsing using LL(1) grammars (top-down parsing method)
■ Shift-Reduce parsing using LR(1) grammars (bottom-up parsing method)
Types of Parsing
Pushdown Automata in Parsing
● Pushdown Automata (PDA) are computational models used for recognizing
context-free languages (CFLs), which are crucial in parsing programming languages.
● PDAs are an extension of finite automata that include a stack as an additional memory
structure.
● The stack allows PDAs to handle recursive patterns and nested structures, which are
common in programming languages and make them more powerful than finite
automata.
Pushdown Automata in Parsing
When used in parsing, Pushdown Automata play a fundamental role in recognizing and processing
the hierarchical structure of context-free grammars. They form the theoretical foundation for
top-down and bottom-up parsers like LL and LR parsers.

Key Features of a Pushdown Automaton:


● Input tape: Contains the string of tokens to be parsed (e.g., a source code file or an
expression).
● Finite control: The set of states that represent the current situation in parsing.
● Stack: Provides memory to handle nested or recursive structures (e.g., matching
parentheses, function calls).

How PDA Works in Parsing:

● Input consumption: The PDA reads the next symbol (or token) from the input.
● Stack manipulation: Based on the current state, input, and stack contents, the PDA either
pushes or pops symbols from the stack.
● State transitions: The automaton moves between states based on the current input and stack
contents, allowing it to recognize whether a string belongs to the language defined by the
context-free grammar.
PDAs in Top-Down Parsing
In Top-down parsing (like LL parsers), a PDA simulates the recursive descent by predicting the next
symbol and comparing it with the input. The stack stores non-terminals that are expanded according to the
grammar rules.

Example of how a PDA works for a simple grammar:

S → aSb | ε

● Input string: aabb


● The PDA pushes S onto the stack and expands it to aSb.
● It matches the input tokens with the right-hand side of the production, pushing and popping symbols
from the stack as needed to ensure the input conforms to the grammar.
PDAs in Bottom-Up Parsing:
In Bottom-up parsing (like LR parsers), the PDA works by shifting input tokens onto the stack
and reducing them to non-terminals based on the grammar rules. The PDA moves from the leaves
(tokens) to the root (start symbol) of the parse tree.

● The stack represents partially processed input.


● The automaton shifts tokens until it can apply a reduction based on the grammar.
● Reductions occur by popping items from the stack and replacing them with a non-terminal,
gradually building up to the start symbol.
Types of Parsing
Parsers are generally classified into two main categories based on the method of parsing:

● Top-Down Parsers
● Bottom-Up Parsers

1. Top-Down Parsers

● Top-down parsers construct the parse tree from the root (starting symbol) and attempt
to derive the input tokens by recursively expanding the non-terminal symbols.
● The parsing process proceeds from the top (starting symbol) of the grammar to the
bottom (terminals in the input).
Types of Top-Down Parsers
1.1 Recursive Descent Parser:
● A recursive descent parser is a type of top-down parser that uses recursive procedures to
process the input tokens.
● It directly implements the grammar rules by using a separate function for each non-terminal
in the grammar.
● Each function tries to match a part of the input with the corresponding grammar rule.
● Example: If you have a grammar rule E → T + E, a recursive function E() will call T() to
recognize T, and then look for the + operator and recursively call E() to match the remainder
of the input.
Advantages:
● Simple to implement for small grammars.
● Intuitive and easy to follow.
Disadvantages:
● Cannot handle left recursion, which occurs when a non-terminal symbol refers to itself on the
left-hand side of its production rule. For example, E → E + T is left-recursive and leads to
infinite recursion.
● Not suitable for large grammars.
Types of Top-Down Parsers
1.2 Predictive Parser (LL Parsers):

● A predictive parser is a type of top-down parser that uses lookahead to decide which
grammar rule to apply. It predicts which rule to use based on the next input token.
● The term LL parser stands for:
○ L: Left-to-right scanning of the input.
○ L: Leftmost derivation.
● LL(1) parsers are a subset of predictive parsers that use 1-token lookahead to make
parsing decisions.
LL(1) Parser
How LL(1) Works:

● The parser looks at the current input token and the top of the parsing stack.
● It uses a parsing table to decide which production to apply.
● It requires that the grammar be non-left-recursive and non-ambiguous.

Advantages:

● Efficient and fast as it requires only one lookahead token.


● Can handle large grammars.

Disadvantages:

● The grammar must be LL(1) compatible (no left recursion, no ambiguity).


● May not be able to parse complex or ambiguous grammars without manual transformation.
2. Bottom-Up Parsers
Bottom-up parsers build the parse tree from the leaves (the input tokens) and work their way up
to the root (the start symbol). The parsing process involves shifting input tokens onto a stack and
then reducing them into non-terminal symbols according to the grammar rules.

● Types of Bottom-Up Parsers:


○ 2.1 Shift-Reduce Parser:
■ Shift-reduce parsers operate by shifting tokens onto a stack and then reducing
the tokens on the stack into non-terminals based on the grammar rules.
○ Two key actions:
■ Shift: Move the next input token onto the stack.
■ Reduce: Replace a sequence of tokens on the stack that matches the right-hand
side of a grammar production with the corresponding non-terminal.
○ Advantages:
■ Can handle a wider class of grammars, including left-recursive grammars.
■ Simple implementation.
○ Disadvantages:
■ Can encounter conflicts like shift-reduce conflicts or reduce-reduce conflicts.
2.2 LR Parser
LR parsers are a more powerful and widely used form of bottom-up parsers. They are able to handle a wide
range of grammars efficiently. The term LR stands for:

● L: Left-to-right scanning of the input.


● R: Rightmost derivation in reverse.

Types of LR Parsers:

● SLR(1) (Simple LR): Uses a simplified version of the LR parsing table.


● LR(1): Uses one lookahead token and constructs a full LR parsing table. It is more powerful than SLR(1)
but generates larger tables.
● LALR(1) (LookAhead LR): A compromise between SLR(1) and LR(1). It uses lookahead tokens like LR(1)
but with smaller tables like SLR(1). LALR(1) parsers are the most commonly used in practice due to their
efficiency.

Advantages:

● Can handle a wide range of grammars, including all deterministic context-free grammars.
● Handles both left recursion and ambiguity in certain cases.

Disadvantages:

● Complex to implement.
● Can result in large parsing tables, especially for LR(1) parsers.
2.3 Canonical LR(1) Parser
● This is the most powerful LR parser and can handle the largest class of context-free
grammars.
● It uses 1 token lookahead to resolve shifts and reductions and builds a comprehensive
parsing table.

Advantages:

● Can parse any deterministic context-free grammar.

Disadvantages:

● The parsing tables can become very large and difficult to manage.
2.4 LALR Parser (LookAhead LR):
● The LALR(1) parser is a more practical variant of the LR(1) parser that reduces the size of the
parsing tables without losing too much of the power.
● It merges states in the LR(1) table that have the same core, minimizing the table size while keeping
the grammar's expressiveness.

Advantages:

● Combines the power of LR(1) with the efficiency of smaller parsing tables.
● Most programming language parsers, such as those for C, C++, and Java, use LALR(1) parsers.

Disadvantages:

● Some grammars may still cause conflicts that are harder to resolve compared to full LR(1) parsers.
Comparison of Parsing Methods
Top-Down Parsing using LL Grammars
● Top-down parsing using predictive parsing, traces the left-most derivation of the string
while constructing the parse tree
● Starts from the start symbol of the grammar, and “predicts” the next production used in
the derivation
● Such “prediction” is aided by parsing tables (constructed off-line)
● The next production to be used in the derivation is determined using the next input
symbol to lookup the parsing table (look-ahead symbol)
● Placing restrictions on the grammar ensures that no slot in the parsing table contains
more than one production
● At the time of parsing table constrcution, if two productions become eligible to be
placed in the same slot of the parsing table, the grammar is declared unfit for
predictive parsing
Top-Down
LL-Parsing
Example
LL(1) Parsing
Algorithm
LL(1) Parsing
Algorithm
Example
Strong LL(k) Grammars
Let the given grammar be G
● Input is extended with k symbols, $k , k is the lookahead of the grammar
● Introduce a new nonterminal SJ, and a production, SJ → S$k , where S is the start
symbol of the given grammar
● Consider leftmost derivations only and assume that the grammar has no useless
symbols
● A production A → α in G is called a strong LL(k ) production, if in G
○ SJ ⇒∗ wAγ ⇒ wαγ ⇒∗ wzy
○ SJ ⇒∗ w JAδ ⇒ w Jβδ ⇒∗ w Jzx
○ |z| = k, z ∈ Σ∗, w and w J ∈ Σ∗, then α = β
● A grammar (nonterminal) is strong LL(k) if all its productions are strong LL(k)
● Strong LL(k) grammars do not allow different productions of the same nonterminal to be
used even in two different derivations, if the first k symbols of the strings produced by
αγ and βδ are the same
Strong LL(k) Grammars
Example: S → Abc|aAcb, A → ϵ|b|c
S is a strong LL(1) nonterminal
● S′ ⇒ S$ ⇒ Abc$ ⇒ bc$, bbc$, and cbc$, on application of the productions, A → ϵ, A → b, and,
A → c, respectively.
z = b, b, or c, respectively
● S′ ⇒ S$ ⇒ aAcb$ ⇒ acb$, abcb$, and accb$, on application of the productions, A → ϵ, A →
b, and, A → c, respectively. z = a, in all three cases

● In this case, w = w = ϵ, α = Abc, β = aAcb, but z is different in the two derivations, in all the
derived strings
● Hence the nonterminal S is strong LL(1)
Strong LL(k) Grammars
A is not strong LL(1)
● S0 )∗ Abc$ ) bc$, w = , z = b, α = (A ! )
S0 )∗ Abc$ ) bbc$, w0 = , z = b, β = b (A ! b)
● Even though the lookaheads are the same (z = b), α 6= β, and therefore, the
grammar is not strong LL(1)
A is not strong LL(2)
● S0 )∗ Abc$ ) bc$, w = , z = bc, α = (A ! )
S0 )∗ aAcb$ ) abcb$, w0 = a, z = bc, β = b (A ! b)
● Even though the lookaheads are the same (z = bc), α 6= β, and therefore, the
grammar is not strong LL(2)
A is strong LL(3) because all the six strings (bc$, bbc, cbc, cb$, bcb, ccb) can be
distinguished using 3-symbol lookahead (details are for home work)
Testable Conditions for LL(1)
● We call strong LL(1) as LL(1) from now on and we will not consider lookaheads
longer than 1
● The classical condition for LL(1) property uses FIRST and FOLLOW sets
● If α is any string of grammar symbols (α ∈ (N ∪ T )∗), then FIRST (α) = {a | a ∈
T, and α ⇒∗ ax, x ∈ T ∗} FIRST (ϵ) = {ϵ}
● If A is any nonterminal, then
FOLLOW (A) = {a | S ⇒∗ αAaβ, α, β ∈ (N ∪ T )∗,
a ∈ T ∪ {$}}
● FIRST (α) is determined by α alone, but FOLLOW (A) is determined by the
“context” of A, i.e., the derivations in which A occurs
FIRST and FOLLOW Computation Example
● Consider the following grammar
SJ → S$, S → aAS | c, A → ba | SB, B → bA | S
● FIRST (SJ) = FIRST (S) = {a, c} because
SJ ⇒ S$ ⇒ c$, and SJ ⇒ S$ ⇒ aAS$ ⇒ abaS$ ⇒ abac$
● FIRST (A) = {a, b, c} because
A ⇒ ba, and A ⇒ SB, and therefore all symbols in
FIRST (S) are in FIRST (A)
● FOLLOW (S) = {a, b, c, $} because
SJ ⇒ S$,
SJ ⇒∗ aAS$ ⇒ aSBS$ ⇒ aSbAS$,
SJ ⇒∗ aSBS$ ⇒ aSSS$ ⇒ aSaASS$,
SJ ⇒∗ aSSS$ ⇒ aScS$
● FOLLOW (A) = {a, c} because
SJ ⇒∗ aAS$ ⇒ aAaAS$,
SJ ⇒∗ aAS$ ⇒ aAc
LL(1) Conditions
LL(1) Table Construction using FIRST and FOLLOW
Simple Example of LL(1) Grammar
LL(1) Table Construction Example 1
LL(1) Table Problem Example 1
LL(1) Table Construction Example 2
LL(1) Table Problem Example 2
LL(1) Table Construction Example 3
LL(1) Table Construction Example 4
Elimination of Useless Symbols
LR Parser
An LR parser is a type of bottom-up parser used for parsing a wide range of context-free
grammars. The term "LR" stands for:
● L: Left-to-right scanning of the input.
● R: Rightmost derivation in reverse.
Characteristics of LR Parsers:
● Efficient: LR parsers can handle a large class of grammars, including most
programming language grammars.
● Non-backtracking: LR parsers make deterministic parsing decisions by consulting a
parsing table, avoiding backtracking.
● Lookahead: They use lookahead tokens to decide whether to shift (read more input) or
reduce (apply a grammar rule).
● Handles Left Recursion: LR parsers can process grammars that contain left recursion,
which is a limitation in some top-down parsers.
Types of LR Parsers
LR(0) Parser:
● Simplest form of an LR parser.
● No lookahead tokens are used.
● Works well for simple grammars but can't handle many practical grammars due to conflicts in the
parsing table.
SLR(1) Parser (Simple LR Parser):
● Uses 1 token of lookahead.
● Resolves some conflicts by consulting the follow sets of non-terminals.
● More powerful than LR(0), but it can still fail on more complex grammars.
LR(1) Parser:
● Uses 1 token of lookahead to make more informed parsing decisions.
● Can handle all deterministic context-free grammars.
● The parsing table is larger but more powerful than SLR(1).
LALR(1) Parser (Look-Ahead LR):
● Merges states with identical cores in an LR(1) parser to reduce the size of the parsing table.
● More efficient and commonly used in practice (e.g., in tools like YACC).
● Retains most of the power of LR(1) but with a smaller table.
Structure of an LR Parser
An LR parser consists of:

● Input buffer: Holds the string to be parsed.


● Stack: Stores symbols and states during the parsing process.
● Parsing table: Contains two parts:
● Action table: Dictates whether to shift, reduce, accept, or report an error.
● Goto table: Determines the next state based on the non-terminal symbol.
● Parsing algorithm: The logic that combines the input, stack, and tables to produce the
parse tree or identify errors.
Components of the Parsing Table
● Action Table: Maps pairs of current state and input symbols to one of the following
actions:
● Shift: Push the current input symbol onto the stack and move to a new state.
● Reduce: Replace the top symbols on the stack with a non-terminal according to a
production rule.
● Accept: Successfully complete the parse.
● Error: Identify that parsing has failed.
● Goto Table: Maps pairs of current state and non-terminal symbols to the next state. It's
used after a reduction to determine where the parser should continue.
How an LR Parser Works
Step-by-Step Parsing Algorithm:

1. Initialization: The parser starts with an initial state (s0) on the stack and reads the first
token from the input buffer.
2. Action Determination: The parser consults the action table based on the current state
(from the top of the stack) and the current input symbol (from the buffer).
a. If the action is Shift, the input symbol is pushed onto the stack, and the parser transitions to
the new state.
b. If the action is Reduce, the parser pops symbols from the stack according to the right-hand
side of a production rule and pushes the left-hand side non-terminal. It then consults the Goto
table to determine the new state.
c. If the action is Accept, parsing is complete, and the input is successfully parsed.
d. If the action is Error, the input is rejected.
3. Continue: The parser repeats this process until either an error is found or the input is
successfully parsed.
Example of LR Parsing
S → E
E → E + T | T
T → int
Steps for Parsing Input: int + int

● Step 1: Initialization
○ Input buffer: int + int
○ Stack: [0] (initial state)
● Step 2: Shift int
○ Input buffer: + int
○ Stack: [0, int, 5] (Shifted int and transitioned to state 5)
● Step 3: Reduce T → int
○ Input buffer: + int
○ Stack: [0, T, 3] (Reduced int to T and transitioned to state 3)
● Step 4: Reduce E → T
○ Input buffer: + int
○ Stack: [0, E, 1] (Reduced T to E and transitioned to state 1)
Example of LR Parsing
● Step 5: Shift +
○ Input buffer: int
○ Stack: [0, E, 1, +, 6] (Shifted + and transitioned to state 6)
● Step 6: Shift int
○ Input buffer: (empty)
○ Stack: [0, E, 1, +, 6, int, 5] (Shifted int and transitioned to state 5)
● Step 7: Reduce T → int
○ Input buffer: (empty)
○ Stack: [0, E, 1, +, 6, T, 3] (Reduced int to T and transitioned to state 3)
● Step 8: Reduce E → E + T
○ Input buffer: (empty)
○ Stack: [0, E, 1] (Reduced E + T to E and transitioned to state 1)
● Step 9: Accept
○ The parser accepts the input as valid.
LR parser
Advantages of LR Parsers:

1. Handles Large Class of Grammars: LR parsers can handle nearly all programming language grammars,
including those with left recursion.
2. Deterministic: LR parsers are deterministic and never require backtracking.
3. Efficient Parsing: They are capable of parsing in linear time (O(n)) for the length of the input.

Disadvantages of LR Parsers:

1. Complexity: The construction of LR parsing tables can be complex, and LR(1) tables can become very
large.
2. Conflict Resolution: When grammars are ambiguous, LR parsers can face shift-reduce and reduce-reduce
conflicts, which may require grammar rewriting or conflict resolution.

LR Conflicts:

1. Shift-Reduce Conflict: Occurs when the parser can either shift the input or reduce it using a rule, and it's
unclear which action to take.
2. Reduce-Reduce Conflict: Occurs when the parser can apply more than one reduction, and it's unclear
which rule to apply.

You might also like