0% found this document useful (0 votes)

44 views

Lexical and Syntax Analysis

This document discusses lexical and syntax analysis in programming language compilers. It covers: 1) Lexical analysis breaks source code into tokens by matching patterns. A lexical analyzer assigns token codes and collects characters into lexemes. 2) Syntax analysis checks the syntax of a program and builds a parse tree. It is usually based on a context-free grammar. 3) Nearly all compilers separate lexical and syntax analysis for simplicity, efficiency, and portability. The lexical analyzer handles small language units like names and literals, while the syntax analyzer handles large constructs.

Uploaded by

Lashawn Brown

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Lexical and Syntax Analysis

Uploaded by

Lashawn Brown

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

4.

LEXICAL AND SYNTAX ANALYSIS

CSc 4330/6330 4-1 9/15

Programming Language Concepts
Introduction

• Chapter 1 described three approaches to implementing programming languages:

compilation, pure interpretation, and hybrid implementation. All three use both a
lexical analyzer and a syntax analyzer.

• The job of a syntax analyzer is to check the syntax of a program and create a
parse tree from it.

• Syntax analyzers, or parsers, are nearly always based on a formal description of

the syntax of programs, usually in form of a context-free grammar or BNF.

• Advantages of using BNF:

BNF descriptions are clear and concise, both for humans and software systems.
Syntax analyzers can be generated directly from BNF.
Implementations based on BNF are easy to maintain.

• Nearly all compilers separate the task of analyzing syntax into two distinct parts:
The lexical analyzer deals with small-scale language constructs, such as names
and numeric literals.
The syntax analyzer deals with large-scale constructs, such as expressions, state-
ments, and program units.

• Reasons for the separation:

Simplicity—Removing the details of lexical analysis from the syntax analyzer
makes it smaller and less complex.
Efficiency—It beomes easier to optimize the lexical analyzer.
Portability—The lexical analyzer reads source files, so it may be platform-
dependent.

CSc 4330/6330 4-2 9/15

Programming Language Concepts
Lexical Analysis

• A lexical analyzer collects input characters into groups (lexemes) and assigns an
internal code (a token) to each group.

• Lexemes are recognized by matching the input against patterns.

• Tokens are usually coded as integer values, but for the sake of readability, they
are often referenced through named constants.

• An example assignment statement:

result = oldsum - value / 100;

Tokens and lexemes of this statement:

Token Lexeme
IDENT result
ASSIGN_OP =
IDENT oldsum
SUB_OP -
IDENT value
DIV_OP /
INT_LIT 100
SEMICOLON ;

• In early compilers, lexical analyzers often processed an entire program file and
produced a file of tokens and lexemes. Now, most lexical analyzers are subpro-
grams that return the next lexeme and its associated token code when called.

• Other tasks performed by a lexical analyzer:

Skipping comments and white space between lexemes.
Inserting lexemes for user-defined names into the symbol table.
Detecting syntactic errors in tokens, such as ill-formed floating-point literals.

CSc 4330/6330 4-3 9/15

Programming Language Concepts
Lexical Analysis (Continued)

• Approaches to building a lexical analyzer:

Write a formal description of the token patterns of the language and use a soft-
ware tool such as lex to automatically generate a lexical analyzer.
Design a state transition diagram that describes the token patterns of the lan-
guage and write a program that implements the diagram.
Design a state transition diagram that describes the token patterns of the lan-
guage and hand-construct a table-driven implementation of the state diagram.

• A state transition diagram, or state diagram, is a directed graph.

The nodes are labeled with state names.
The arcs are labeled with input characters.
An arc may also include actions to be done when the transition is taken.

CSc 4330/6330 4-4 9/15

Programming Language Concepts
Lexical Analysis (Continued)

• Consider the problem of building a lexical analyzer that recognizes lexemes that
appear in arithmetic expressions, including variable names and integer literals.
Names consist of uppercase letters, lowercase letters, and digits, but must begin
with a letter.
Names have no length limitations.

• To simplify the transition diagram, we can treat all letters the same way, having
one transition from a state instead of 52 transitions. In the lexical analyzer, LET-
TER will represent the class of all 52 letters.

• Integer literals offer another opportunity to simplify the transition diagram.

Instead of having 10 transitions from a state, it is better to group digits into a sin-
gle character class (named DIGIT) and have one transition.

• The lexical analyzer will use a string (or character array) named lexeme to
store a lexeme as it is being read.

• Utility subprograms needed by the lexical analyzer:

getChar—Gets the next input character and puts it in a global variable named
nextChar. Also determines the character class of the input character and
puts it in the global variable charClass.
addChar—Adds the character in nextChar to the end of lexeme.
getNonBlank—Skips white space.
lookup—Computes the token code for single-character tokens (parentheses
and arithmetic operators).

CSc 4330/6330 4-5 9/15

Programming Language Concepts
Lexical Analysis (Continued)

• A state diagram that recognizes names, integer literals, parentheses, and arith-
metic operators:

The diagram includes the actions required on each transition.

CSc 4330/6330 4-6 9/15

Programming Language Concepts
Lexical Analysis (Continued)

• C code for a lexical analyzer that implements this state diagram:

/* front.c - a lexical analyzer system for simple
arithmetic expressions */

#include <stdio.h>
#include <ctype.h>

/* Global declarations */
/* Variables */
int charClass;
char lexeme[100];
char nextChar;
int lexLen;
int token;
int nextToken;
FILE *in_fp;

/* Function declarations */
int lookup(char ch);
void addChar(void);
void getChar(void);
void getNonBlank(void);
int lex(void);

/* Character classes */
#define LETTER 0
#define DIGIT 1
#define UNKNOWN 99

/* Token codes */
#define INT_LIT 10
#define IDENT 11
#define ASSIGN_OP 20
#define ADD_OP 21
#define SUB_OP 22
#define MULT_OP 23
#define DIV_OP 24
#define LEFT_PAREN 25
#define RIGHT_PAREN 26

CSc 4330/6330 4-7 9/15

Programming Language Concepts
/******************************************************/
/* main driver */
int main(void) {
/* Open the input data file and process its contents */
if ((in_fp = fopen("front.in", "r")) == NULL)
printf("ERROR - cannot open front.in \n");
else {
getChar();
do {
lex();
} while (nextToken != EOF);
}
return 0;
}
/******************************************************/
/* lookup - a function to look up operators and
parentheses and return the token */
int lookup(char ch) {
switch (ch) {
case '(':
addChar();
nextToken = LEFT_PAREN;
break;
case ')':
addChar();
nextToken = RIGHT_PAREN;
break;
case '+':
addChar();
nextToken = ADD_OP;
break;
case '-':
addChar();
nextToken = SUB_OP;
break;
case '*':
addChar();
nextToken = MULT_OP;
break;
case '/':
addChar();
nextToken = DIV_OP;
break;
default:
addChar();
nextToken = EOF;
break;
}
return nextToken;
}

CSc 4330/6330 4-8 9/15

Programming Language Concepts
/******************************************************/
/* addChar - a function to add nextChar to lexeme */
void addChar(void) {
if (lexLen <= 98) {
lexeme[lexLen++] = nextChar;
lexeme[lexLen] = '\0';
}
else
printf("Error - lexeme is too long \n");
}

/******************************************************/
/* getChar - a function to get the next character of
input and determine its character class */
void getChar(void) {
if ((nextChar = getc(in_fp)) != EOF) {
if (isalpha(nextChar))
charClass = LETTER;
else if (isdigit(nextChar))
charClass = DIGIT;
else
charClass = UNKNOWN;
}
else
charClass = EOF;
}

/******************************************************/
/* getNonBlank - a function to call getChar until it
returns a non-whitespace character */
void getNonBlank(void) {
while (isspace(nextChar))
getChar();
}

CSc 4330/6330 4-9 9/15

Programming Language Concepts
/******************************************************/
/* lex - a simple lexical analyzer for arithmetic
expressions */
int lex(void) {
lexLen = 0;
getNonBlank();
switch (charClass) {

/* Identifiers */
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT) {
addChar();
getChar();
}
nextToken = IDENT;
break;

/* Integer literals */
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
nextToken = INT_LIT;
break;

/* Parentheses and operators */

case UNKNOWN:
lookup(nextChar);
getChar();
break;

/* EOF */
case EOF:
nextToken = EOF;
lexeme[0] = 'E';
lexeme[1] = 'O';
lexeme[2] = 'F';
lexeme[3] = '\0';
break;
} /* End of switch */

printf("Next token is: %d, Next lexeme is %s\n",

nextToken, lexeme);
return nextToken;
} /* End of function lex */

CSc 4330/6330 4-10 9/15

Programming Language Concepts
Lexical Analysis (Continued)

• Sample input for front.c:

(sum + 47) / total

• Output of front.c:
Next token is: 25, Next lexeme is (
Next token is: 11, Next lexeme is sum
Next token is: 21, Next lexeme is +
Next token is: 10, Next lexeme is 47
Next token is: 26, Next lexeme is )
Next token is: 24, Next lexeme is /
Next token is: 11, Next lexeme is total
Next token is: -1, Next lexeme is EOF

CSc 4330/6330 4-11 9/15

Programming Language Concepts
Introduction to Parsing

• Syntax analysis is often referred to as parsing.

• Responsibilities of a syntax analyzer, or parser:

Determine whether the input program is syntactically correct.
Produce a parse tree. In some cases, the parse tree is only implicitly constructed.

• When an error is found, a parser must produce a diagnostic message and recover.
Recovery is required so that the compiler finds as many errors as possible.

• Parsers are categorized according to the direction in which they build parse
trees:
Top-down parsers build the tree from the root downward to the leaves.
Bottom-up parsers build the tree from the leaves upward to the root.

• Notational conventions for grammar symbols and strings:

Terminal symbols—Lowercase letters at the beginning of the alphabet (a, b, …)
Nonterminal symbols—Uppercase letters at the beginning of the alphabet (A, B,
…)
Terminals or nonterminals—Uppercase letters at the end of the alphabet (W, X,
Y, Z)
Strings of terminals—Lowercase letters at the end of the alphabet (w, x, y, z)
Mixed strings (terminals and/or nonterminals)—Lowercase Greek letters (α, β,
γ, δ)

CSc 4330/6330 4-12 9/15

Programming Language Concepts
Top-Down Parsers

• A top-down parser traces or builds the parse tree in preorder: each node is visited
before its branches are followed.

• The actions taken by a top-down parser correspond to a leftmost derivation.

• Given a sentential form xAα that is part of a leftmost derivation, a top-down

parser’s task is to find the next sentential form in that leftmost derivation.
Determining the next sentential form is a matter of choosing the correct gram-
mar rule that has A as its left-hand side (LHS).
If the A-rules are A → bB, A → cBb, and A → a, the next sentential form could
be xbBα, xcBbα, or xaα.
The most commonly used top-down parsing algorithms choose an A-rule based
on the token that would be the first generated by A.

• The most common top-down parsing algorithms are closely related.

A recursive-descent parser is coded directly from the BNF description of the
syntax of a language.
An alternative is to use a parsing table rather than code.

• Both are LL algorithms, and both are equally powerful. The first L in LL speci-
fies a left-to-right scan of the input; the second L specifies that a leftmost deriva-
tion is generated.

CSc 4330/6330 4-13 9/15

Programming Language Concepts
Bottom-Up Parsers

• A bottom-up parser constructs a parse tree by beginning at the leaves and pro-
gressing toward the root. This parse order corresponds to the reverse of a right-
most derivation.

• Given a right sentential form α, a bottom-up parser must determine what sub-
string of α is the right-hand side (RHS) of the rule that must be reduced to its
LHS to produce the previous right sentential form.

• A given right sentential form may include more than one RHS from the gram-
mar. The correct RHS to reduce is called the handle.

• Consider the following grammar and derivation:

S → aAc
A → aA | b

S => aAc => aaAc => aabc

A bottom-up parser can easily find the first handle, b, because it is the only RHS
in the sentence aabc. After replacing b by the corresponding LHS, A, the parser
is left with the sentential form aaAc. Finding the next handle will be more diffi-
cult because both aAc and aA are potential handles.

• A bottom-up parser finds the handle of a given right sentential form by examin-
ing the symbols on one or both sides of a possible handle.

• The most common bottom-up parsing algorithms are in the LR family. The L
specifies a left-to-right scan and the R specifies that a rightmost derivation is
generated.

CSc 4330/6330 4-14 9/15

Programming Language Concepts
The Complexity of Parsing

• Parsing algorithms that work for any grammar are inefficient. The worst-case
complexity of common parsing algorithms is O(n3), making them impractical
for use in compilers.

• Faster algorithms work for only a subset of all possible grammars. These algo-
rithms are acceptable as long as they can parse grammars that describe program-
ming languages.

• Parsing algorithms used in commercial compilers have complexity O(n).

CSc 4330/6330 4-15 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process

• A recursive-descent parser consists of a collection of subprograms, many of

which are recursive; it produces a parse tree in top-down order.

• A recursive-descent parser has one subprogram for each nonterminal in the

grammar.

• EBNF is ideally suited for recursive-descent parsers.

• An EBNF description of simple arithmetic expressions:

<expr> → <term> {(+ | -) <term>}
<term> → <factor> {(* | /) <factor>}
<factor> → id | int_constant | ( <expr> )

• These rules can be used to construct a recursive-descent function named expr

that parses arithmetic expressions.

• The lexical analyzer is assumed to be a function named lex. It reads a lexeme

and puts its token code in the global variable nextToken. Token codes are
defined as named constants.

CSc 4330/6330 4-16 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process (Continued)

• Writing a recursive-descent subprogram for a rule with a single RHS is rela-

tively simple.
For each terminal symbol in the RHS, that terminal symbol is compared with
nextToken. If they do not match, it is a syntax error. If they match, the lex-
ical analyzer is called to get to the next input token.
For each nonterminal, the parsing subprogram for that nonterminal is called.

• A recursive-descent subprogram for <expr>, written in C:

/* expr
Parses strings in the language generated by the rule:
<expr> -> <term> {(+ | -) <term>}
*/
void expr(void) {
printf("Enter <expr>\n");

/* Parse the first term */

term();

/* As long as the next token is + or -, get

the next token and parse the next term */
while (nextToken == ADD_OP || nextToken == SUB_OP) {
lex();
term();
}
printf("Exit <expr>\n");
}

• Each recursive-descent subprogram, including expr, leaves the next input

token in nextToken.

• expr does not include any code for syntax error detection or recovery, because
there are no detectable errors associated with the rule for <expr>.

CSc 4330/6330 4-17 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process (Continued)

• The subprogram for <term> is similar to that for <expr>:

/* term
Parses strings in the language generated by the rule:
<term> -> <factor> {(* | /) <factor>}
*/
void term(void) {
printf("Enter <term>\n");

/* Parse the first factor */

factor();

/* As long as the next token is * or /, get the

next token and parse the next factor */
while (nextToken == MULT_OP || nextToken == DIV_OP) {
lex();
factor();
}
printf("Exit <term>\n");
}

CSc 4330/6330 4-18 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process (Continued)

• A recursive-descent parsing subprogram for a nonterminal whose rule has more

than one RHS must examine the value of nextToken to determine which RHS
is to be parsed.

• The recursive-descent subprogram for <factor> must choose between two RHSs:
/* factor
Parses strings in the language generated by the rule:
<factor> -> id | int_constant | ( <expr> )
*/
void factor(void) {
printf("Enter <factor>\n");
/* Determine which RHS */
if (nextToken == IDENT || nextToken == INT_LIT)
/* Get the next token */
lex();
/* If the RHS is ( <expr> ), call lex to pass over the
left parenthesis, call expr, and check for the right
parenthesis */
else {
if (nextToken == LEFT_PAREN) {
lex();
expr();
if (nextToken == RIGHT_PAREN)
lex();
else
error();
}
/* It was not an id, an integer literal, or a left
parenthesis */
else
error();
}

printf("Exit <factor>\n");
}

• The error function is called when a syntax error is detected. A real parser
would produce a diagnostic message and attempt to recover from the error.

CSc 4330/6330 4-19 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process (Continued)

• Trace of the parse of (sum + 47) / total:

Next token is: 25, Next lexeme is (
Enter <expr>
Enter <term>
Enter <factor>
Next token is: 11, Next lexeme is sum
Enter <expr>
Enter <term>
Enter <factor>
Next token is: 21, Next lexeme is +
Exit <factor>
Exit <term>
Next token is: 10, Next lexeme is 47
Enter <term>
Enter <factor>
Next token is: 26, Next lexeme is )
Exit <factor>
Exit <term>
Exit <expr>
Next token is: 24, Next lexeme is /
Exit <factor>
Next token is: 11, Next lexeme is total
Enter <factor>
Next token is: -1, Next lexeme is EOF
Exit <factor>
Exit <term>
Exit <expr>

CSc 4330/6330 4-20 9/15

Programming Language Concepts
The Recursive-Descent Parsing Process (Continued)

• An EBNF description of the Java if statement:

<ifstmt> → if ( <boolexpr> ) <statement> [else <statement>]

• The recursive-descent subprogram for <ifstmt>:

/* ifstmt
Parses strings in the language generated by the rule:
<ifstmt> -> if (<boolexpr>) <statement>
[else <statement>]
*/
void ifstmt(void) {
if (nextToken != IF_CODE)
error();
else {
lex();
if (nextToken != LEFT_PAREN)
error();
else {
lex();
boolexpr();
if (nextToken != RIGHT_PAREN)
error();
else {
lex();
statement();
if (nextToken == ELSE_CODE) {
lex();
statement();
}
}
}
}
}

CSc 4330/6330 4-21 9/15

Programming Language Concepts
The LL Grammar Class

• Recursive-descent and other LL parsers can be used only with grammars that
meet certain restrictions.

• Left recursion causes a catastrophic problem for LL parsers.

• Calling the recursive-descent parsing subprogram for the following rule would
cause infinite recursion:
A→A+B

• The left recursion in the rule A → A + B is called direct left recursion, because
it occurs in one rule.

• An algorithm for eliminating direct left recursion from a grammar:

For each nonterminal A,
1. Group the A-rules as A → Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn
where none of the β’s begins with A.
2. Replace the original A-rules with
A → β1A′ | β2A′ | … | βnA′
A′ → α1A′ | α2A′ | … | αmA′ | ε

• The symbol ε represents the empty string. A rule that has ε as its RHS is called
an erasure rule, because using it in a derivation effectively erases its LHS from
the sentential form.

CSc 4330/6330 4-22 9/15

Programming Language Concepts
The LL Grammar Class (Continued)

• Left recursion can easily be eliminated from the following grammar:

E→E+T|T
T→T*F|F
F → ( E ) | id
For the E-rules, we have α1 = + T and β1 = T, so we replace the E-rules with
E → T E′
E′ → + T E′ | ε
For the T-rules, we have α1 = * F and β1 = F, so we replace the T-rules with
T → F T′
T′ → * F T′ | ε
The F-rules remain the same.

• The grammar with left recursion removed:

E → T E′
E′ → + T E′ | ε
T → F T′
T′ → * F T′ | ε
F → ( E ) | id

• Indirect left recursion poses the same problem as direct left recursion:
A→BaA
B→Ab

• Algorithms exist that remove indirect left recursion from a grammar.

CSc 4330/6330 4-23 9/15

Programming Language Concepts
The LL Grammar Class (Continued)

• Left recursion is not the only grammar trait that disallows top-down parsing. A
top-down parser must always be able to choose the correct RHS on the basis of
the next token of input.

• The pairwise disjointness test is used to test a non-left-recursive grammar to

determine whether it can be parsed in a top-down fashion. This test requires
computing FIRST sets, where
FIRST(α) = {a | α =>* aβ}
The symbol =>* indicates a derivation of zero or more steps. If α =>* ε, then ε
is also in FIRST(α), where ε is the empty string.

• There are algorithms to compute FIRST for any mixed string. In simple cases,
FIRST can usually be computed by inspecting the grammar.

• The pairwise disjointness test:

For each nonterminal A that has more than one RHS, and for each pair of rules,
A → αi and A → αj, it must be true that FIRST(αi) ∩ FIRST(αj) = ∅.

• An example:
A→aB|bAb|Bb
B→cB|d
The FIRST sets for the RHSs of the A-rules are {a}, {b}, and {c, d}. These rules
pass the pairwise disjointness test.

• A second example:
A→aB|BAb
B→aB|b
The FIRST sets for the RHSs of the A-rules are {a} and {a, b}. These rules fail
the pairwise disjointness test.

CSc 4330/6330 4-24 9/15

Programming Language Concepts
The LL Grammar Class (Continued)

• In many cases, a grammar that fails the pairwise disjointness test can be modi-
fied so that it will pass the test.

• The following rules do not pass the pairwise disjointness test:

<variable> → identifier | identifier [ <expression> ]
(The square brackets are terminals, not metasymbols.) This problem can solved
by left factoring.

• Using left factoring, these rules would be replaced by the following rules:
<variable> → identifier <new>
<new> → ε | [ <expression> ]

• Using EBNF can also help. The original rules for <variable> can be replaced by
the following EBNF rules:
<variable> → identifier [ [ <expression> ] ]
The outer brackets are metasymbols, and the inner brackets are terminals.

• Formal algorithms for left factoring exist.

• Left factoring cannot solve all pairwise disjointness problems. In some cases,
rules must be rewritten in other ways to eliminate the problem.

CSc 4330/6330 4-25 9/15

Programming Language Concepts
The Parsing Problem for Bottom-Up Parsers

• The following grammar for arithmetic expressions will be used to illustrate bot-
tom-up parsing:
E→E+T|T
T→T*F|F
F → ( E ) | id
This grammar is left recursive, which is acceptable to bottom-up parsers.

• Grammars for bottom-up parsers are normally written using ordinary BNF, not
EBNF.

• A rightmost derivation using this grammar:

E => E + T
=> E + T * F
=> E + T * id
=> E + F * id
=> E + id * id
=> T + id * id
=> F + id * id
=> id + id * id
The underlined part of each sentential form shows the RHS of the rule that was
applied at that step.

• A bottom-up parser produces the reverse of a rightmost derivation by starting

with the last sentential form (the input sentence) and working back to the start
symbol.

• At each step, the parser’s task is to find the RHS in the current sentential form
that must be rewritten to get the previous sentential form.

CSc 4330/6330 4-26 9/15

Programming Language Concepts
The Parsing Problem for Bottom-Up Parsers (Continued)

• A right sentential form may include more than one RHS.

The right sentential form E + T * id includes three RHSs, E + T, T, and id.

• The task of a bottom-up parser is to find the unique handle of a given right sen-
tential form.

• Definition: β is the handle of the right sentential form γ = αβw if and only if
S =>*rm αAw =>rm αβw.
=>rm specifies a rightmost derivation step, and =>*rm specifies zero or more
rightmost derivation steps.

• Two other concepts are related to the idea of the handle.

• Definition: β is a phrase of the right sentential form γ = α1βα2 if and only if

S =>* α1Aα2 =>+ α1βα2.
=>+ means one or more derivation steps.

• Definition: β is a simple phrase of the right sentential form γ = α1βα2 if and

only if S =>* α1Aα2 => α1βα2.

CSc 4330/6330 4-27 9/15

Programming Language Concepts
The Parsing Problem for Bottom-Up Parsers (Continued)

• A phrase is a string consisting of all of the leaves of the partial parse tree that is
rooted at one particular internal node of the whole parse tree.

• A simple phrase is a phrase that is derived from a nonterminal in a single step.

• An example parse tree:

The leaves of the parse tree represent the sentential form E + T * id. Because
there are three internal nodes, there are three phrases: E + T * id, T * id, and id.

• The simple phrases are a subset of the phrases. In this example, the only simple
phrase is id.

• The handle of a right sentential form is the leftmost simple phrase.

• Once the handle has been found, it can be pruned from the parse tree and the
process repeated. Continuing to the root of the parse tree, the entire rightmost
derivation can be constructed.

CSc 4330/6330 4-28 9/15

Programming Language Concepts
Shift-Reduce Algorithms

• Bottom-up parsers are often called shift-reduce algorithms, because shift and
reduce are their two fundamental actions.
The shift action moves the next input token onto the parser’s stack.
A reduce action replaces a RHS (the handle) on top of the parser’s stack by its
corresponding LHS.

CSc 4330/6330 4-29 9/15

Programming Language Concepts
LR Parsers

• Most bottom-up parsing algorithms belong to the LR family. LR parsers use a

relatively small amount of code and a parsing table.

• The original LR algorithm was designed by Donald Knuth, who published it in

1965. His canonical LR algorithm was not widely used because producing the
parsing table required large amounts of computer time and memory.

• Later variations on the table construction process were more popular. These
variations require much less time and memory to produce the parsing table but
work for smaller classes of grammars.

• Advantages of LR parsers:
They can be built for all programming languages.
They can detect syntax errors as soon as possible in a left-to-right scan.
The LR class of grammars is a proper superset of the class parsable by LL pars-
ers.

• It is difficult to produce an LR parsing table by hand. However, there are many

programs available that take a grammar as input and produce the parsing table.

CSc 4330/6330 4-30 9/15

Programming Language Concepts
LR Parsers (Continued)

• Older parsing algorithms would find the handle by looking both to the left and to
the right of the substring that was suspected of being the handle.

• Knuth’s insight was that it was only necessary to look to the left of the suspected
handle (by examining the stack) to determine whether it was the handle.

• Even better, the parser can avoid examining the entire stack if it keeps a sum-
mary of the stack contents in a “state” symbol on top of the stack.

• In general, each grammar symbol on the stack will be followed by a state symbol
(often written as a subscripted uppercase S).

• The structure of an LR parser:

CSc 4330/6330 4-31 9/15

Programming Language Concepts
LR Parsers (Continued)

• The contents of the parse stack for an LR parser has the following form, where
the Ss are state symbols and the Xs are grammar symbols:
S0X1S1X2S2…XmSm (top)

• An LR parser configuration is a pair of strings representing the stack and the

remaining input:
(S0XlSlX2S2…XmSm, aiai+1…an$)
The dollar sign is an end-of-input marker.

• The LR parsing process is based on the parsing table, which has two parts,
ACTION and GOTO.

• The ACTION part has state symbols as its row labels and terminal symbols as its
column labels.

• The parse table specifies what the parser should do, based on the state symbol on
top of the parse stack and the next input symbol.

• The two primary actions are shift (shift the next input symbol onto the stack) and
reduce (replace the handle on top of the stack by the LHS of the matching rule).

• Two other actions are possible: accept (parsing is complete) and error (a syntax
error has been detected).

• The values in the GOTO part of the table indicate which state symbol should be
pushed onto the parse stack after a reduction has been completed.
The row is determined by the state symbol on top of the parse stack after the
handle and its associated state symbols have been removed.
The column is determined by the LHS of the rule used in the reduction.

CSc 4330/6330 4-32 9/15

Programming Language Concepts
LR Parsers (Continued)

• Initial configuration of an LR parser:

(S0, a1…an$)

• Informal definition of parser actions:

Shift: The next input symbol is pushed onto the stack, along with the state sym-
bol specified in the ACTION table.
Reduce: First, the handle is removed from the stack. For every grammar symbol
on the stack there is a state symbol, so the number of symbols removed is twice
the number of symbols in the handle. Next, the LHS of the rule is pushed onto
the stack. Finally, the GOTO table is used to determine which state must be
pushed onto the stack.
Accept: The parse is complete and no errors were found.
Error: The parser calls an error-handling routine.

• All LR parsers use this parsing algorithm, although they may construct the pars-
ing table in different ways.

• The following grammar will be used to illustrate LR parsing:

1. E → E + T
2. E → T
3. T → T * F
4. T → F
5. F → ( E )
6. F → id

CSc 4330/6330 4-33 9/15

Programming Language Concepts
LR Parsers (Continued)

• The LR parsing table for this grammar:

R4 means reduce using rule 4; S6 means shift the next input symbol onto the stack
and push state S6. Empty positions in the ACTION table indicate syntax errors.

• A trace of the parse of id + id * id using the LR parsing algorithm:

Stack Input Action
0 id + id * id $ Shift 5
0id5 + id * id $ Reduce 6 (use GOTO[0, F])
0F3 + id * id $ Reduce 4 (use GOTO[0, T])
0T2 + id * id $ Reduce 2 (use GOTO[0, E])
0E1 + id * id $ Shift 6
0E1+6 id * id$ Shift 5
0E1+6id5 * id $ Reduce 6 (use GOTO[6, F])
0E1+6F3 * id $ Reduce 4 (use GOTO[6, T])
0E1+6T9 * id $ Shift 7
0E1+6T9*7 id $ Shift 5
0E1+6T9*7id5 $ Reduce 6 (use GOTO[7, F])
0E1+6T9*7F10 $ Reduce 3 (use GOTO[6, T])
0E1+6T9 $ Reduce 1 (use GOTO[0, E])
0E1 $ Accept

CSc 4330/6330 4-34 9/15

Programming Language Concepts

CS3304 9 LanguageSyntax 2 PDF
No ratings yet
CS3304 9 LanguageSyntax 2 PDF
39 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
31 pages
Concepts_Assignment (Technical Report Template)[1]
No ratings yet
Concepts_Assignment (Technical Report Template)[1]
14 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Comp Chap2
No ratings yet
Comp Chap2
36 pages
Cs3501 Compiler Design Lab Manual
No ratings yet
Cs3501 Compiler Design Lab Manual
54 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
4 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
56 pages
Compilers Lecture 2
No ratings yet
Compilers Lecture 2
23 pages
Lexical and Syntax Analysis: Concepts of Programming Languages Understanding Programming Languages
No ratings yet
Lexical and Syntax Analysis: Concepts of Programming Languages Understanding Programming Languages
25 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
Compiler Design
No ratings yet
Compiler Design
40 pages
CD Lab Manual
No ratings yet
CD Lab Manual
48 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Lexical Analysis: Programming Languages Translators
No ratings yet
Lexical Analysis: Programming Languages Translators
21 pages
CD LAB RECORD
No ratings yet
CD LAB RECORD
40 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
1_scanning-slides-sanyal-part1
No ratings yet
1_scanning-slides-sanyal-part1
22 pages
HW_31712
No ratings yet
HW_31712
22 pages
BC200405108
No ratings yet
BC200405108
5 pages
Lab1-Design A Lexical Analyzer
No ratings yet
Lab1-Design A Lexical Analyzer
1 page
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Ch2_Lexical Analysis
No ratings yet
Ch2_Lexical Analysis
71 pages
Compiler Lab Manual
No ratings yet
Compiler Lab Manual
36 pages
CS3501- COMPILER DESIGN LAB MANUAL
No ratings yet
CS3501- COMPILER DESIGN LAB MANUAL
53 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
68 pages
Chapter-2
No ratings yet
Chapter-2
41 pages
Day 2 - Lexial Analyzer
No ratings yet
Day 2 - Lexial Analyzer
37 pages
03LexicalAndSyntaxAnalysis 1
No ratings yet
03LexicalAndSyntaxAnalysis 1
25 pages
CD LAB MANUAL (1)
No ratings yet
CD LAB MANUAL (1)
52 pages
2-lexing
No ratings yet
2-lexing
73 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
2_Lexical Analysis
No ratings yet
2_Lexical Analysis
52 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
CD Manual
No ratings yet
CD Manual
58 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
Lecture-2-10022025-035804pm
No ratings yet
Lecture-2-10022025-035804pm
27 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
R.V. College of Engineering
No ratings yet
R.V. College of Engineering
56 pages
CD Lab1
No ratings yet
CD Lab1
68 pages
ex 1 _ lexical analyser
No ratings yet
ex 1 _ lexical analyser
8 pages
CT3
No ratings yet
CT3
20 pages
17ACS42 Manual
No ratings yet
17ACS42 Manual
54 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
CD Student Manual (1)
No ratings yet
CD Student Manual (1)
76 pages
Unit2
No ratings yet
Unit2
61 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
Compiler Construction Lec 1b
No ratings yet
Compiler Construction Lec 1b
37 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
M2 Session2
No ratings yet
M2 Session2
17 pages
No Bills of Attainder
No ratings yet
No Bills of Attainder
14 pages
Moorish Science Temple Compari
No ratings yet
Moorish Science Temple Compari
33 pages
America Is A Republic, Not A Democracy: Bernard Dobski
No ratings yet
America Is A Republic, Not A Democracy: Bernard Dobski
24 pages
US Doj-Title-Regulations ENRD LAS 202.305.0316
No ratings yet
US Doj-Title-Regulations ENRD LAS 202.305.0316
74 pages
Tribal Law Order Act 2010
No ratings yet
Tribal Law Order Act 2010
44 pages
3rd Scheme & Syllabus 2021
No ratings yet
3rd Scheme & Syllabus 2021
68 pages
Whole Number Y4 p2
No ratings yet
Whole Number Y4 p2
4 pages
Telephone Directory
No ratings yet
Telephone Directory
8 pages
Clean Architectures in Python
No ratings yet
Clean Architectures in Python
149 pages
Java Oops
No ratings yet
Java Oops
7 pages
Uncodemy's Python (Programming Language) Course Module
No ratings yet
Uncodemy's Python (Programming Language) Course Module
11 pages
F
No ratings yet
F
577 pages
COS3711 2024 Assignment 3
No ratings yet
COS3711 2024 Assignment 3
4 pages
Digital System Designs
No ratings yet
Digital System Designs
168 pages
Basic and Advance Formulas Examples - 2020
No ratings yet
Basic and Advance Formulas Examples - 2020
68 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Parsing Dates With Lubridate: Charlo e Wickham
No ratings yet
Parsing Dates With Lubridate: Charlo e Wickham
23 pages
SCSJ1023-201720181-Mid Term-Part A-Solution
No ratings yet
SCSJ1023-201720181-Mid Term-Part A-Solution
6 pages
ultimate ViSA Report
No ratings yet
ultimate ViSA Report
22 pages
4.2.chuong4 - FreeRTOS - QUAN LY BO NHO - 2
No ratings yet
4.2.chuong4 - FreeRTOS - QUAN LY BO NHO - 2
53 pages
Koch and Snowflake Curves
No ratings yet
Koch and Snowflake Curves
5 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
16 pages
Class 2
No ratings yet
Class 2
1 page
What Are Some Tricks To Learn Java Quickly
No ratings yet
What Are Some Tricks To Learn Java Quickly
377 pages
Mocha Pro Python Guide
No ratings yet
Mocha Pro Python Guide
67 pages
C programing ppt
No ratings yet
C programing ppt
65 pages
Example Questions Answers v2
No ratings yet
Example Questions Answers v2
9 pages
CLASS1 Fundamentals of Data Structures
No ratings yet
CLASS1 Fundamentals of Data Structures
21 pages
Hlasm Tmodelling
No ratings yet
Hlasm Tmodelling
22 pages
Python Ass 2
No ratings yet
Python Ass 2
7 pages
Modelling With UML in Software Engineering
No ratings yet
Modelling With UML in Software Engineering
64 pages
MCS-031 08
No ratings yet
MCS-031 08
4 pages
Lab 01
No ratings yet
Lab 01
6 pages
C-Notes Module 5
No ratings yet
C-Notes Module 5
7 pages