0% found this document useful (0 votes)
3 views

Ch04

Uploaded by

SARANYA M
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Ch04

Uploaded by

SARANYA M
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

Chapter 4

Lexical and Syntax


Analysis

ISBN 0-321-49362-1
Chapter 4 Topics

• Introduction
• Lexical Analysis
• The Parsing Problem
• Recursive-Descent Parsing
• Bottom-Up Parsing

Copyright © 2009 Addison-Wesley. All rights reserved. 1-2


Introduction

• Language implementation systems must


analyze source code, regardless of the
specific implementation approach
• Nearly all syntax analysis is based on a
formal description of the syntax of the
source language (BNF)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-3


Syntax Analysis

• The syntax analysis portion of a language


processor nearly always consists of two
parts:
– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on
a regular grammar)
– A high-level part called a syntax analyzer, or
parser (mathematically, a push-down
automaton based on a context-free grammar,
or BNF)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-4


Advantages of Using BNF to Describe
Syntax

• Provides a clear and concise syntax


description
• The parser can be based directly on the
BNF
• Parsers based on BNF are easy to
maintain

Copyright © 2009 Addison-Wesley. All rights reserved. 1-5


Reasons to Separate Lexical and
Syntax Analysis

• Simplicity - less complex approaches can


be used for lexical analysis; separating
them simplifies the parser
• Efficiency - separation allows optimization
of the lexical analyzer
• Portability - parts of the lexical analyzer
may not be portable, but the parser
always is portable

Copyright © 2009 Addison-Wesley. All rights reserved. 1-6


Lexical Analysis

• A lexical analyzer is a pattern matcher for


character strings
• A lexical analyzer is a “front-end” for the
parser
• Identifies substrings of the source
program that belong together - lexemes
– Lexemes match a character pattern, which is
associated with a lexical category called a
token
– sum is a lexeme; its token may be IDENT

Copyright © 2009 Addison-Wesley. All rights reserved. 1-7


Lexical Analysis (continued)

• The lexical analyzer is usually a function that is


called by the parser when it needs the next token
• Three approaches to building a lexical analyzer:
– Write a formal description of the tokens and use a
software tool that constructs table-driven lexical
analyzers given such a description
– Design a state diagram that describes the tokens and
write a program that implements the state diagram
– Design a state diagram that describes the tokens and
hand-construct a table-driven implementation of the
state diagram

Copyright © 2009 Addison-Wesley. All rights reserved. 1-8


State Diagram Design

– A naïve state diagram would have a transition


from every state on every character in the
source language - such a diagram would be
very large!

Copyright © 2009 Addison-Wesley. All rights reserved. 1-9


Lexical Analysis (cont.)

• In many cases, transitions can be


combined to simplify the state diagram
– When recognizing an identifier, all uppercase
and lowercase letters are equivalent
• Use a character class that includes all letters
– When recognizing an integer literal, all digits
are equivalent - use a digit class

Copyright © 2009 Addison-Wesley. All rights reserved. 1-10


Lexical Analysis (cont.)

• Reserved words and identifiers can be


recognized together (rather than having a
part of the diagram for each reserved
word)
– Use a table lookup to determine whether a
possible identifier is in fact a reserved word

Copyright © 2009 Addison-Wesley. All rights reserved. 1-11


Lexical Analysis (cont.)

• Convenient utility subprograms:


– getChar - gets the next character of input,
puts it in nextChar, determines its class and
puts the class in charClass
– addChar - puts the character from nextChar
into the place the lexeme is being
accumulated, lexeme
– lookup - determines whether the string in
lexeme is a reserved word (returns a code)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-12


State Diagram

Copyright © 2009 Addison-Wesley. All rights reserved. 1-13


Lexical Analyzer

Implementation:
 SHOW front.c (pp. 176-181)

- Following is the output of the lexical analyzer


of
front.c when used on (sum + 47) / total

Next token is: 25 Next lexeme is (


Next token is: 11 Next lexeme is sum
Next token is: 21 Next lexeme is +
Next token is: 10 Next lexeme is 47
Next token is: 26 Next lexeme is )
Next token is: 24 Next lexeme is /
Next token is: 11 Next lexeme is total
Next token is: -1 Next lexeme is EOF
Copyright © 2009 Addison-Wesley. All rights reserved. 1-14
The Parsing Problem

• Goals of the parser, given an input


program:
– Find all syntax errors; for each, produce an
appropriate diagnostic message and recover
quickly
– Produce the parse tree, or at least a trace of
the parse tree, for the program

Copyright © 2009 Addison-Wesley. All rights reserved. 1-15


The Parsing Problem (cont.)

• Two categories of parsers


– Top down - produce the parse tree, beginning
at the root
• Order is that of a leftmost derivation
• Traces or builds the parse tree in preorder
– Bottom up - produce the parse tree, beginning
at the leaves
• Order is that of the reverse of a rightmost
derivation
• Useful parsers look only one token ahead
in the input
Copyright © 2009 Addison-Wesley. All rights reserved. 1-16
The Parsing Problem (cont.)

• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next
sentential form in the leftmost derivation,
using only the first token produced by A
• The most common top-down parsing
algorithms:
– Recursive descent - a coded implementation
– LL parsers - table driven implementation

Copyright © 2009 Addison-Wesley. All rights reserved. 1-17


The Parsing Problem (cont.)

• Bottom-up parsers
– Given a right sentential form, , determine
what substring of  is the right-hand side of
the rule in the grammar that must be reduced
to produce the previous sentential form in the
right derivation
– The most common bottom-up parsing
algorithms are in the LR family

Copyright © 2009 Addison-Wesley. All rights reserved. 1-18


The Parsing Problem (cont.)

• The Complexity of Parsing


– Parsers that work for any unambiguous
grammar are complex and inefficient ( O(n3),
where n is the length of the input )
– Compilers use parsers that only work for a
subset of all unambiguous grammars, but do it
in linear time ( O(n), where n is the length of
the input )

Copyright © 2009 Addison-Wesley. All rights reserved. 1-19


Recursive-Descent Parsing

• There is a subprogram for each


nonterminal in the grammar, which can
parse sentences that can be generated by
that nonterminal
• EBNF is ideally suited for being the basis
for a recursive-descent parser, because
EBNF minimizes the number of
nonterminals

Copyright © 2009 Addison-Wesley. All rights reserved. 1-20


Recursive-Descent Parsing (cont.)

• A grammar for simple expressions:

<expr>  <term> {(+ | -) <term>}


<term>  <factor> {(* | /) <factor>}
<factor>  id | int_constant | ( <expr> )

Copyright © 2009 Addison-Wesley. All rights reserved. 1-21


Recursive-Descent Parsing (cont.)

• Assume we have a lexical analyzer named


lex, which puts the next token code in
nextToken
• The coding process when there is only one
RHS:
– For each terminal symbol in the RHS, compare
it with the next input token; if they match,
continue, else there is an error
– For each nonterminal symbol in the RHS, call
its associated parsing subprogram

Copyright © 2009 Addison-Wesley. All rights reserved. 1-22


Recursive-Descent Parsing (cont.)
/* Function expr
Parses strings in the language
generated by the rule:
<expr> → <term> {(+ | -) <term>}
*/

void expr() {

/* Parse the first term */

term();
/* As long as the next token is + or -, call
lex to get the next token and parse the
next term */

while (nextToken == ADD_OP ||


nextToken == SUB_OP){
lex();
term();
}
}

Copyright © 2009 Addison-Wesley. All rights reserved. 1-23


Recursive-Descent Parsing (cont.)

• This particular routine does not detect errors


• Convention: Every parsing routine leaves the
next token in nextToken

Copyright © 2009 Addison-Wesley. All rights reserved. 1-24


Recursive-Descent Parsing (cont.)

• A nonterminal that has more than one


RHS requires an initial process to
determine which RHS it is to parse
– The correct RHS is chosen on the basis of the
next token of input (the lookahead)
– The next token is compared with the first
token that can be generated by each RHS until
a match is found
– If no match is found, it is a syntax error

Copyright © 2009 Addison-Wesley. All rights reserved. 1-25


Recursive-Descent Parsing (cont.)

/* term
Parses strings in the language generated by the rule:
<term> -> <factor> {(* | /) <factor>)
*/
void term() {
printf("Enter <term>\n");
/* Parse the first factor */
factor();
/* As long as the next token is * or /,
next token and parse the next factor */
while (nextToken == MULT_OP || nextToken == DIV_OP) {
lex();
factor();
}
printf("Exit <term>\n");
} /* End of function term */

Copyright © 2009 Addison-Wesley. All rights reserved. 1-26


Recursive-Descent Parsing (cont.)

/* Function factor
Parses strings in the language
generated by the rule:
<factor> -> id | (<expr>) */

void factor() {

/* Determine which RHS */


if (nextToken) == ID_CODE || nextToken == INT_CODE)

/* For the RHS id, just call lex */


lex();

/* If the RHS is (<expr>) – call lex to pass over the left parenthesis,
call expr, and check for the right parenthesis */
else if (nextToken == LP_CODE) {
lex();
expr();
if (nextToken == RP_CODE)
lex();
else
error();
} /* End of else if (nextToken == ... */

else error(); /* Neither RHS matches */


}

Copyright © 2009 Addison-Wesley. All rights reserved. 1-27


Recursive-Descent Parsing (cont.)
- Trace of the lexical and syntax analyzers on (sum + 47) / total

Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is total
Enter <expr> Enter <factor>
Enter <term> Next token is: -1 Next lexeme is EOF
Enter <factor> Exit <factor>
Next token is: 11 Next lexeme is sum Exit <term>
Enter <expr> Exit <expr>
Enter <term>
Enter <factor>
Next token is: 21 Next lexeme is +
Exit <factor>
Exit <term>
Next token is: 10 Next lexeme is 47
Enter <term>
Enter <factor>
Next token is: 26 Next lexeme is )
Exit <factor>
Exit <term>
Exit <expr>
Next token is: 24 Next lexeme is /
Exit <factor>

Copyright © 2009 Addison-Wesley. All rights reserved. 1-28


Recursive-Descent Parsing (cont.)

• The LL Grammar Class


– The Left Recursion Problem
• If a grammar has left recursion, either direct or
indirect, it cannot be the basis for a top-down
parser
– A grammar can be modified to remove left recursion
For each nonterminal, A,
1. Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … |
βn
where none of the β‘s begins with A
2. Replace the original A-rules with
A → β1A’ | β2A’ | … | βnA’
A’ → α1A’ | α2A’ | … | αmA’ | ε
Copyright © 2009 Addison-Wesley. All rights reserved. 1-29
Recursive-Descent Parsing (cont.)

• The other characteristic of grammars that


disallows top-down parsing is the lack of
pairwise disjointness
– The inability to determine the correct RHS on
the basis of one token of lookahead
– Def: FIRST() = {a |  =>* a }
(If  =>* ,  is in FIRST())

Copyright © 2009 Addison-Wesley. All rights reserved. 1-30


Recursive-Descent Parsing (cont.)

• Pairwise Disjointness Test:


– For each nonterminal, A, in the grammar that
has more than one RHS, for each pair of rules,
A  i and A  j, it must be true that
FIRST(i) ⋂ FIRST(j) = 
• Examples:
A  a | bB | cAb
A  a | aB

Copyright © 2009 Addison-Wesley. All rights reserved. 1-31


Recursive-Descent Parsing (cont.)

• Left factoring can resolve the problem


Replace
<variable>  identifier | identifier
[<expression>]
with
<variable>  identifier <new>
<new>   | [<expression>]
or
<variable>  identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-32


Bottom-up Parsing

• The parsing problem is finding the correct


RHS in a right-sentential form to reduce to
get the previous right-sentential form in
the derivation

Copyright © 2009 Addison-Wesley. All rights reserved. 1-33


Bottom-up Parsing (cont.)

•Intuition about handles:


– Def:  is the handle of the right sentential form
 = w if and only if S =>*rm Aw =>rm w

– Def:  is a phrase of the right sentential form


 if and only if S =>*  = 1A2 =>+ 12

– Def:  is a simple phrase of the right sentential


form  if and only if S =>*  = 1A2 => 12

Copyright © 2009 Addison-Wesley. All rights reserved. 1-34


Bottom-up Parsing (cont.)

• Intuition about handles (continued):


– The handle of a right sentential form is its
leftmost simple phrase
– Given a parse tree, it is now easy to find the
handle
– Parsing can be thought of as handle pruning

Copyright © 2009 Addison-Wesley. All rights reserved. 1-35


Bottom-up Parsing (cont.)

• Shift-Reduce Algorithms
– Reduce is the action of replacing the handle
on the top of the parse stack with its
corresponding LHS
– Shift is the action of moving the next token to
the top of the parse stack

Copyright © 2009 Addison-Wesley. All rights reserved. 1-36


Bottom-up Parsing (cont.)

• Advantages of LR parsers:
– They will work for nearly all grammars that
describe programming languages.
– They work on a larger class of grammars than
other bottom-up algorithms, but are as
efficient as any other bottom-up parser.
– They can detect syntax errors as soon as it is
possible.
– The LR class of grammars is a superset of the
class parsable by LL parsers.

Copyright © 2009 Addison-Wesley. All rights reserved. 1-37


Bottom-up Parsing (cont.)

• LR parsers must be constructed with a


tool
• Knuth’s insight: A bottom-up parser could
use the entire history of the parse, up to
the current point, to make parsing
decisions
– There were only a finite and relatively small
number of different parse situations that could
have occurred, so the history could be stored
in a parser state, on the parse stack

Copyright © 2009 Addison-Wesley. All rights reserved. 1-38


Bottom-up Parsing (cont.)

• An LR configuration stores the state of an


LR parser

(S0X1S1X2S2…XmSm, aiai+1…an$)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-39


Bottom-up Parsing (cont.)

• LR parsers are table driven, where the


table has two components, an ACTION
table and a GOTO table
– The ACTION table specifies the action of the
parser, given the parser state and the next
token
• Rows are state names; columns are terminals
– The GOTO table specifies which state to put
on top of the parse stack after a reduction
action is done
• Rows are state names; columns are
nonterminals
Copyright © 2009 Addison-Wesley. All rights reserved. 1-40
Structure of An LR Parser

Copyright © 2009 Addison-Wesley. All rights reserved. 1-41


Bottom-up Parsing (cont.)

• Initial configuration: (S0, a1…an$)


• Parser actions:
– If ACTION[Sm, ai] = Shift S, the next
configuration is:
(S0X1S1X2S2…XmSmaiS, ai+1…an$)
– If ACTION[Sm, ai] = Reduce A   and S =
GOTO[Sm-r, A], where r = the length of , the
next configuration is
(S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$)

Copyright © 2009 Addison-Wesley. All rights reserved. 1-42


Bottom-up Parsing (cont.)

• Parser actions (continued):


– If ACTION[Sm, ai] = Accept, the parse is
complete and no errors were found.
– If ACTION[Sm, ai] = Error, the parser calls an
error-handling routine.

Copyright © 2009 Addison-Wesley. All rights reserved. 1-43


LR Parsing Table

Copyright © 2009 Addison-Wesley. All rights reserved. 1-44


Bottom-up Parsing (cont.)

• A parser table can be generated from a


given grammar with a tool, e.g., yacc

Copyright © 2009 Addison-Wesley. All rights reserved. 1-45


Summary

• Syntax analysis is a common part of language


implementation
• A lexical analyzer is a pattern matcher that
isolates small-scale parts of a program
– Detects syntax errors
– Produces a parse tree
• A recursive-descent parser is an LL parser
– EBNF
• Parsing problem for bottom-up parsers: find the
substring of current sentential form
• The LR family of shift-reduce parsers is the most
common bottom-up parsing approach

Copyright © 2009 Addison-Wesley. All rights reserved. 1-46

You might also like