Lexical Analysis and Parsing CD
Lexical Analysis and Parsing CD
Irfan Rasool
Language Processing System
Compiler Overview and Structural Phases of Compiler
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
7. Linking and Assembly
Lexical Analysis
● Lexical Analysis is the first phase of the compiler, where the source code written in a
high-level programming language is converted into a sequence of tokens.
● The primary purpose of lexical analysis is to break down the input code into smaller,
manageable components that can be used by the parser for syntax analysis.
● This phase is responsible for reading the characters of the input program and grouping
them into meaningful sequences, or tokens, such as keywords, operators, identifiers,
constants, and punctuation.
● These tokens are stored in Symbol Table.
Key Functions of Lexical Analysis
1. Tokenization:
○ The main role of lexical analysis is to divide the input code into tokens.
○ A token is a sequence of characters that represents a basic unit in the source code, such
as:
■ Keywords (if, for, while, int, return)
■ Identifiers (variable names like x, sum)
■ Operators (+, -, *, /, ==)
■ Literals (constants like 123, 3.14, 'a')
■ Punctuation (brackets (), semicolons ;)
2. Removing Whitespaces and Comments:
○ Lexical analysis also eliminates unnecessary whitespaces and comments from the code,
which are not needed for parsing but help in making the code more readable.
Key Functions of Lexical Analysis
3. Error Handling:
● The lexical analyzer may generate or update a symbol table, which stores information about
identifiers (such as variable names) and their associated data (type, memory address, etc.).
● This table is crucial for semantic analysis and code generation phases later in the compilation
process.
What is a Symbol Table?
● A symbol table is one of the most important data structures within a compiler, where all the identifiers
used in a program are stored along with their type, scope, and memory locations.
● The symbol table is built during the earliest stages of compilation.
● It helps ensure the correct usage of all identifiers according to the language’s rules in order for efficient
code generation and error checking to be performed.
● It has built-in lexical and syntax analysis phases.
● The information is collected by the analysis phases of the compiler and is used by the synthesis phases
of the compiler to generate code.
● It is used by the compiler to achieve compile-time efficiency.
● Lexical Analysis: Creates new table entries in the table, for example, entries about tokens.
● Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of reference, use, etc
in the table.
● Semantic Analysis: Uses available information in the table to check for semantics i.e. to verify that
expressions and assignments are semantically correct(type checking) and update it accordingly.
● Intermediate Code Generation: Refers to symbol table for knowing how much and what type of run-time
is allocated and table helps in adding temporary variable information.
● Code Optimization: Uses information present in the symbol table for machine-dependent optimization.
● Target Code generation: Generates code by using the address information of the identifier present in the
table.
Process of Lexical Analysis
1. Scanning:
○ The source code is scanned character by character by the lexical analyzer.
○ The scanner uses a lookahead mechanism to identify the next token without consuming the
input.
○ This process is facilitated by regular expressions, which define the patterns for valid tokens.
2. Matching Tokens:
○ Each sequence of characters is matched against predefined token patterns.
○ These patterns are defined using regular expressions, and each successful match
corresponds to a valid token.
Example:
● if matches the regular expression for keywords.
● x matches the pattern for identifiers.
● 3.14 matches the pattern for floating-point literals.
Process of Lexical Analysis
3. Token Generation: Once a sequence of characters is recognized as a valid token, the
lexical analyzer outputs the token in the form of a pair: (token type, token value).
For example:
● if → (Keyword, "if")
● x → (Identifier, "x")
● 3.14 → (FloatLiteral, "3.14")
4. State Transition:
● The lexical analyzer operates like a finite state machine (FSM), where each state
corresponds to the partial recognition of a token.
● As the input characters are read, the FSM transitions between states until it reaches a
final state that corresponds to a valid token.
Example of Lexical Analysis
Code:
int sum = 0;
sum = sum + 10;
The lexical analyzer will produce the following tokens:
● Lexical analyzers like Lex (or its variant Flex) are commonly used tools to generate
lexical analyzers automatically from a set of regular expressions.
● The programmer provides patterns in the form of regular expressions, and Lex
generates the corresponding code to perform lexical analysis.
Finite Automata:
stmt → IF expr stmt |IF expr matched _stmt ELSE stmt matched _stmt →
IF expr matched _stmt ELSE matched _stmt |other _stmt
The language,
L = {anbncmdm | n, m ≥ 1} ∪ {anbmcmdn | n, m ≥ 1}, is inherently ambiguous
Ambiguity
Example 1
Equivalent
Unambiguous
Grammar
Ambiguity
Example 2
Ambiguity
Example 2
(contd.)
Pushdown Automata
A PDA M is a system (Q, Σ, Γ, δ, q0, z0, F ), where
Q is a finite set of states
Σ is the input alphabet
Γ is the stack alphabet
q0 ∈ Q is the start state
z0 ∈ Γ is the start symbol on stack (initialization)
F ⊆ Q is the set of final states
δ is the transition function, Q × Σ ∪ {ϵ} × Γ to finite subsets of Q × Γ∗
A typical entry of δ is given by
δ(q, a, z) = {(p1, γ1), ((p2, γ2), ..., (pm, γm)}
The PDA in state q, with input symbol a and top-of-stack symbol z, can enter any of
the states pi , replace the symbol z by the string γi , and advance the input head by
one symbol.
Pushdown Automata Contd…
● The leftmost symbol of γi will be the new top of stack
● a in the above function δ could be ϵ, in which case, the input symbol is not
used and the input head is not advanced
● For a PDA M, we define L(M), the language accepted by
M by final state, to be
L(M) = {w | (q0, w, Z0) ►∗ (p, ϵ, γ), for some p ∈ F and
γ ∈ Γ∗}
● We define N(M), the language accepted by M by empty stack, to be
N(M) = {w | (q0, w, Z0) ►∗ (p, ϵ, ϵ), for some p ∈ Q
● When acceptance is by empty stack, the set of final states is irrelevant, and
usually, we set F = φ
PDA - Examples
PDA - Examples Contd…
Nondeterministic and Deterministic PDA
● Just as in the case of NFA and DFA, PDA also have two versions: NPDA and
DPDA
● However, NPDA are strictly more powerful than the DPDA
● For example, the language, L = {ww R | w ∈ {a, b}+} can be recognized only by
an NPDA and not by any DPDA
● In the same breath, the language,
L = {wcw R | w ∈ {a, b}+}, can be recognized by a DPDA
● In practice we need DPDA, since they have exactly one possible move at any
instant
● Our parsers are all DPDA
Parsing
● Parsing is the process of constructing a parse tree for a sentence generated by a given
grammar
● If there are no restrictions on the language and the form of grammar used, parsers for
context-free languages require O(n3) time (n being the length of the string parsed)
■ Cocke-Younger-Kasami’s algorithm
■ Earley’s algorithm
● Subsets of context-free languages typically require O(n) time
■ Predictive parsing using LL(1) grammars (top-down parsing method)
■ Shift-Reduce parsing using LR(1) grammars (bottom-up parsing method)
Types of Parsing
Pushdown Automata in Parsing
● Pushdown Automata (PDA) are computational models used for recognizing
context-free languages (CFLs), which are crucial in parsing programming languages.
● PDAs are an extension of finite automata that include a stack as an additional memory
structure.
● The stack allows PDAs to handle recursive patterns and nested structures, which are
common in programming languages and make them more powerful than finite
automata.
Pushdown Automata in Parsing
When used in parsing, Pushdown Automata play a fundamental role in recognizing and processing
the hierarchical structure of context-free grammars. They form the theoretical foundation for
top-down and bottom-up parsers like LL and LR parsers.
● Input consumption: The PDA reads the next symbol (or token) from the input.
● Stack manipulation: Based on the current state, input, and stack contents, the PDA either
pushes or pops symbols from the stack.
● State transitions: The automaton moves between states based on the current input and stack
contents, allowing it to recognize whether a string belongs to the language defined by the
context-free grammar.
PDAs in Top-Down Parsing
In Top-down parsing (like LL parsers), a PDA simulates the recursive descent by predicting the next
symbol and comparing it with the input. The stack stores non-terminals that are expanded according to the
grammar rules.
S → aSb | ε
● Top-Down Parsers
● Bottom-Up Parsers
1. Top-Down Parsers
● Top-down parsers construct the parse tree from the root (starting symbol) and attempt
to derive the input tokens by recursively expanding the non-terminal symbols.
● The parsing process proceeds from the top (starting symbol) of the grammar to the
bottom (terminals in the input).
Types of Top-Down Parsers
1.1 Recursive Descent Parser:
● A recursive descent parser is a type of top-down parser that uses recursive procedures to
process the input tokens.
● It directly implements the grammar rules by using a separate function for each non-terminal
in the grammar.
● Each function tries to match a part of the input with the corresponding grammar rule.
● Example: If you have a grammar rule E → T + E, a recursive function E() will call T() to
recognize T, and then look for the + operator and recursively call E() to match the remainder
of the input.
Advantages:
● Simple to implement for small grammars.
● Intuitive and easy to follow.
Disadvantages:
● Cannot handle left recursion, which occurs when a non-terminal symbol refers to itself on the
left-hand side of its production rule. For example, E → E + T is left-recursive and leads to
infinite recursion.
● Not suitable for large grammars.
Types of Top-Down Parsers
1.2 Predictive Parser (LL Parsers):
● A predictive parser is a type of top-down parser that uses lookahead to decide which
grammar rule to apply. It predicts which rule to use based on the next input token.
● The term LL parser stands for:
○ L: Left-to-right scanning of the input.
○ L: Leftmost derivation.
● LL(1) parsers are a subset of predictive parsers that use 1-token lookahead to make
parsing decisions.
LL(1) Parser
How LL(1) Works:
● The parser looks at the current input token and the top of the parsing stack.
● It uses a parsing table to decide which production to apply.
● It requires that the grammar be non-left-recursive and non-ambiguous.
Advantages:
Disadvantages:
Types of LR Parsers:
Advantages:
● Can handle a wide range of grammars, including all deterministic context-free grammars.
● Handles both left recursion and ambiguity in certain cases.
Disadvantages:
● Complex to implement.
● Can result in large parsing tables, especially for LR(1) parsers.
2.3 Canonical LR(1) Parser
● This is the most powerful LR parser and can handle the largest class of context-free
grammars.
● It uses 1 token lookahead to resolve shifts and reductions and builds a comprehensive
parsing table.
Advantages:
Disadvantages:
● The parsing tables can become very large and difficult to manage.
2.4 LALR Parser (LookAhead LR):
● The LALR(1) parser is a more practical variant of the LR(1) parser that reduces the size of the
parsing tables without losing too much of the power.
● It merges states in the LR(1) table that have the same core, minimizing the table size while keeping
the grammar's expressiveness.
Advantages:
● Combines the power of LR(1) with the efficiency of smaller parsing tables.
● Most programming language parsers, such as those for C, C++, and Java, use LALR(1) parsers.
Disadvantages:
● Some grammars may still cause conflicts that are harder to resolve compared to full LR(1) parsers.
Comparison of Parsing Methods
Top-Down Parsing using LL Grammars
● Top-down parsing using predictive parsing, traces the left-most derivation of the string
while constructing the parse tree
● Starts from the start symbol of the grammar, and “predicts” the next production used in
the derivation
● Such “prediction” is aided by parsing tables (constructed off-line)
● The next production to be used in the derivation is determined using the next input
symbol to lookup the parsing table (look-ahead symbol)
● Placing restrictions on the grammar ensures that no slot in the parsing table contains
more than one production
● At the time of parsing table constrcution, if two productions become eligible to be
placed in the same slot of the parsing table, the grammar is declared unfit for
predictive parsing
Top-Down
LL-Parsing
Example
LL(1) Parsing
Algorithm
LL(1) Parsing
Algorithm
Example
Strong LL(k) Grammars
Let the given grammar be G
● Input is extended with k symbols, $k , k is the lookahead of the grammar
● Introduce a new nonterminal SJ, and a production, SJ → S$k , where S is the start
symbol of the given grammar
● Consider leftmost derivations only and assume that the grammar has no useless
symbols
● A production A → α in G is called a strong LL(k ) production, if in G
○ SJ ⇒∗ wAγ ⇒ wαγ ⇒∗ wzy
○ SJ ⇒∗ w JAδ ⇒ w Jβδ ⇒∗ w Jzx
○ |z| = k, z ∈ Σ∗, w and w J ∈ Σ∗, then α = β
● A grammar (nonterminal) is strong LL(k) if all its productions are strong LL(k)
● Strong LL(k) grammars do not allow different productions of the same nonterminal to be
used even in two different derivations, if the first k symbols of the strings produced by
αγ and βδ are the same
Strong LL(k) Grammars
Example: S → Abc|aAcb, A → ϵ|b|c
S is a strong LL(1) nonterminal
● S′ ⇒ S$ ⇒ Abc$ ⇒ bc$, bbc$, and cbc$, on application of the productions, A → ϵ, A → b, and,
A → c, respectively.
z = b, b, or c, respectively
● S′ ⇒ S$ ⇒ aAcb$ ⇒ acb$, abcb$, and accb$, on application of the productions, A → ϵ, A →
b, and, A → c, respectively. z = a, in all three cases
′
● In this case, w = w = ϵ, α = Abc, β = aAcb, but z is different in the two derivations, in all the
derived strings
● Hence the nonterminal S is strong LL(1)
Strong LL(k) Grammars
A is not strong LL(1)
● S0 )∗ Abc$ ) bc$, w = , z = b, α = (A ! )
S0 )∗ Abc$ ) bbc$, w0 = , z = b, β = b (A ! b)
● Even though the lookaheads are the same (z = b), α 6= β, and therefore, the
grammar is not strong LL(1)
A is not strong LL(2)
● S0 )∗ Abc$ ) bc$, w = , z = bc, α = (A ! )
S0 )∗ aAcb$ ) abcb$, w0 = a, z = bc, β = b (A ! b)
● Even though the lookaheads are the same (z = bc), α 6= β, and therefore, the
grammar is not strong LL(2)
A is strong LL(3) because all the six strings (bc$, bbc, cbc, cb$, bcb, ccb) can be
distinguished using 3-symbol lookahead (details are for home work)
Testable Conditions for LL(1)
● We call strong LL(1) as LL(1) from now on and we will not consider lookaheads
longer than 1
● The classical condition for LL(1) property uses FIRST and FOLLOW sets
● If α is any string of grammar symbols (α ∈ (N ∪ T )∗), then FIRST (α) = {a | a ∈
T, and α ⇒∗ ax, x ∈ T ∗} FIRST (ϵ) = {ϵ}
● If A is any nonterminal, then
FOLLOW (A) = {a | S ⇒∗ αAaβ, α, β ∈ (N ∪ T )∗,
a ∈ T ∪ {$}}
● FIRST (α) is determined by α alone, but FOLLOW (A) is determined by the
“context” of A, i.e., the derivations in which A occurs
FIRST and FOLLOW Computation Example
● Consider the following grammar
SJ → S$, S → aAS | c, A → ba | SB, B → bA | S
● FIRST (SJ) = FIRST (S) = {a, c} because
SJ ⇒ S$ ⇒ c$, and SJ ⇒ S$ ⇒ aAS$ ⇒ abaS$ ⇒ abac$
● FIRST (A) = {a, b, c} because
A ⇒ ba, and A ⇒ SB, and therefore all symbols in
FIRST (S) are in FIRST (A)
● FOLLOW (S) = {a, b, c, $} because
SJ ⇒ S$,
SJ ⇒∗ aAS$ ⇒ aSBS$ ⇒ aSbAS$,
SJ ⇒∗ aSBS$ ⇒ aSSS$ ⇒ aSaASS$,
SJ ⇒∗ aSSS$ ⇒ aScS$
● FOLLOW (A) = {a, c} because
SJ ⇒∗ aAS$ ⇒ aAaAS$,
SJ ⇒∗ aAS$ ⇒ aAc
LL(1) Conditions
LL(1) Table Construction using FIRST and FOLLOW
Simple Example of LL(1) Grammar
LL(1) Table Construction Example 1
LL(1) Table Problem Example 1
LL(1) Table Construction Example 2
LL(1) Table Problem Example 2
LL(1) Table Construction Example 3
LL(1) Table Construction Example 4
Elimination of Useless Symbols
LR Parser
An LR parser is a type of bottom-up parser used for parsing a wide range of context-free
grammars. The term "LR" stands for:
● L: Left-to-right scanning of the input.
● R: Rightmost derivation in reverse.
Characteristics of LR Parsers:
● Efficient: LR parsers can handle a large class of grammars, including most
programming language grammars.
● Non-backtracking: LR parsers make deterministic parsing decisions by consulting a
parsing table, avoiding backtracking.
● Lookahead: They use lookahead tokens to decide whether to shift (read more input) or
reduce (apply a grammar rule).
● Handles Left Recursion: LR parsers can process grammars that contain left recursion,
which is a limitation in some top-down parsers.
Types of LR Parsers
LR(0) Parser:
● Simplest form of an LR parser.
● No lookahead tokens are used.
● Works well for simple grammars but can't handle many practical grammars due to conflicts in the
parsing table.
SLR(1) Parser (Simple LR Parser):
● Uses 1 token of lookahead.
● Resolves some conflicts by consulting the follow sets of non-terminals.
● More powerful than LR(0), but it can still fail on more complex grammars.
LR(1) Parser:
● Uses 1 token of lookahead to make more informed parsing decisions.
● Can handle all deterministic context-free grammars.
● The parsing table is larger but more powerful than SLR(1).
LALR(1) Parser (Look-Ahead LR):
● Merges states with identical cores in an LR(1) parser to reduce the size of the parsing table.
● More efficient and commonly used in practice (e.g., in tools like YACC).
● Retains most of the power of LR(1) but with a smaller table.
Structure of an LR Parser
An LR parser consists of:
1. Initialization: The parser starts with an initial state (s0) on the stack and reads the first
token from the input buffer.
2. Action Determination: The parser consults the action table based on the current state
(from the top of the stack) and the current input symbol (from the buffer).
a. If the action is Shift, the input symbol is pushed onto the stack, and the parser transitions to
the new state.
b. If the action is Reduce, the parser pops symbols from the stack according to the right-hand
side of a production rule and pushes the left-hand side non-terminal. It then consults the Goto
table to determine the new state.
c. If the action is Accept, parsing is complete, and the input is successfully parsed.
d. If the action is Error, the input is rejected.
3. Continue: The parser repeats this process until either an error is found or the input is
successfully parsed.
Example of LR Parsing
S → E
E → E + T | T
T → int
Steps for Parsing Input: int + int
● Step 1: Initialization
○ Input buffer: int + int
○ Stack: [0] (initial state)
● Step 2: Shift int
○ Input buffer: + int
○ Stack: [0, int, 5] (Shifted int and transitioned to state 5)
● Step 3: Reduce T → int
○ Input buffer: + int
○ Stack: [0, T, 3] (Reduced int to T and transitioned to state 3)
● Step 4: Reduce E → T
○ Input buffer: + int
○ Stack: [0, E, 1] (Reduced T to E and transitioned to state 1)
Example of LR Parsing
● Step 5: Shift +
○ Input buffer: int
○ Stack: [0, E, 1, +, 6] (Shifted + and transitioned to state 6)
● Step 6: Shift int
○ Input buffer: (empty)
○ Stack: [0, E, 1, +, 6, int, 5] (Shifted int and transitioned to state 5)
● Step 7: Reduce T → int
○ Input buffer: (empty)
○ Stack: [0, E, 1, +, 6, T, 3] (Reduced int to T and transitioned to state 3)
● Step 8: Reduce E → E + T
○ Input buffer: (empty)
○ Stack: [0, E, 1] (Reduced E + T to E and transitioned to state 1)
● Step 9: Accept
○ The parser accepts the input as valid.
LR parser
Advantages of LR Parsers:
1. Handles Large Class of Grammars: LR parsers can handle nearly all programming language grammars,
including those with left recursion.
2. Deterministic: LR parsers are deterministic and never require backtracking.
3. Efficient Parsing: They are capable of parsing in linear time (O(n)) for the length of the input.
Disadvantages of LR Parsers:
1. Complexity: The construction of LR parsing tables can be complex, and LR(1) tables can become very
large.
2. Conflict Resolution: When grammars are ambiguous, LR parsers can face shift-reduce and reduce-reduce
conflicts, which may require grammar rewriting or conflict resolution.
LR Conflicts:
1. Shift-Reduce Conflict: Occurs when the parser can either shift the input or reduce it using a rule, and it's
unclear which action to take.
2. Reduce-Reduce Conflict: Occurs when the parser can apply more than one reduction, and it's unclear
which rule to apply.