Lexical and Syntax Analysis
Lexical and Syntax Analysis
• The job of a syntax analyzer is to check the syntax of a program and create a
parse tree from it.
• Nearly all compilers separate the task of analyzing syntax into two distinct parts:
The lexical analyzer deals with small-scale language constructs, such as names
and numeric literals.
The syntax analyzer deals with large-scale constructs, such as expressions, state-
ments, and program units.
• A lexical analyzer collects input characters into groups (lexemes) and assigns an
internal code (a token) to each group.
• Tokens are usually coded as integer values, but for the sake of readability, they
are often referenced through named constants.
• In early compilers, lexical analyzers often processed an entire program file and
produced a file of tokens and lexemes. Now, most lexical analyzers are subpro-
grams that return the next lexeme and its associated token code when called.
• Consider the problem of building a lexical analyzer that recognizes lexemes that
appear in arithmetic expressions, including variable names and integer literals.
Names consist of uppercase letters, lowercase letters, and digits, but must begin
with a letter.
Names have no length limitations.
• To simplify the transition diagram, we can treat all letters the same way, having
one transition from a state instead of 52 transitions. In the lexical analyzer, LET-
TER will represent the class of all 52 letters.
• The lexical analyzer will use a string (or character array) named lexeme to
store a lexeme as it is being read.
• A state diagram that recognizes names, integer literals, parentheses, and arith-
metic operators:
#include <stdio.h>
#include <ctype.h>
/* Global declarations */
/* Variables */
int charClass;
char lexeme[100];
char nextChar;
int lexLen;
int token;
int nextToken;
FILE *in_fp;
/* Function declarations */
int lookup(char ch);
void addChar(void);
void getChar(void);
void getNonBlank(void);
int lex(void);
/* Character classes */
#define LETTER 0
#define DIGIT 1
#define UNKNOWN 99
/* Token codes */
#define INT_LIT 10
#define IDENT 11
#define ASSIGN_OP 20
#define ADD_OP 21
#define SUB_OP 22
#define MULT_OP 23
#define DIV_OP 24
#define LEFT_PAREN 25
#define RIGHT_PAREN 26
/******************************************************/
/* getChar - a function to get the next character of
input and determine its character class */
void getChar(void) {
if ((nextChar = getc(in_fp)) != EOF) {
if (isalpha(nextChar))
charClass = LETTER;
else if (isdigit(nextChar))
charClass = DIGIT;
else
charClass = UNKNOWN;
}
else
charClass = EOF;
}
/******************************************************/
/* getNonBlank - a function to call getChar until it
returns a non-whitespace character */
void getNonBlank(void) {
while (isspace(nextChar))
getChar();
}
/* Identifiers */
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT) {
addChar();
getChar();
}
nextToken = IDENT;
break;
/* Integer literals */
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
nextToken = INT_LIT;
break;
/* EOF */
case EOF:
nextToken = EOF;
lexeme[0] = 'E';
lexeme[1] = 'O';
lexeme[2] = 'F';
lexeme[3] = '\0';
break;
} /* End of switch */
• Output of front.c:
Next token is: 25, Next lexeme is (
Next token is: 11, Next lexeme is sum
Next token is: 21, Next lexeme is +
Next token is: 10, Next lexeme is 47
Next token is: 26, Next lexeme is )
Next token is: 24, Next lexeme is /
Next token is: 11, Next lexeme is total
Next token is: -1, Next lexeme is EOF
• When an error is found, a parser must produce a diagnostic message and recover.
Recovery is required so that the compiler finds as many errors as possible.
• Parsers are categorized according to the direction in which they build parse
trees:
Top-down parsers build the tree from the root downward to the leaves.
Bottom-up parsers build the tree from the leaves upward to the root.
• A top-down parser traces or builds the parse tree in preorder: each node is visited
before its branches are followed.
• Both are LL algorithms, and both are equally powerful. The first L in LL speci-
fies a left-to-right scan of the input; the second L specifies that a leftmost deriva-
tion is generated.
• A bottom-up parser constructs a parse tree by beginning at the leaves and pro-
gressing toward the root. This parse order corresponds to the reverse of a right-
most derivation.
• Given a right sentential form α, a bottom-up parser must determine what sub-
string of α is the right-hand side (RHS) of the rule that must be reduced to its
LHS to produce the previous right sentential form.
• A given right sentential form may include more than one RHS from the gram-
mar. The correct RHS to reduce is called the handle.
• A bottom-up parser finds the handle of a given right sentential form by examin-
ing the symbols on one or both sides of a possible handle.
• The most common bottom-up parsing algorithms are in the LR family. The L
specifies a left-to-right scan and the R specifies that a rightmost derivation is
generated.
• Parsing algorithms that work for any grammar are inefficient. The worst-case
complexity of common parsing algorithms is O(n3), making them impractical
for use in compilers.
• Faster algorithms work for only a subset of all possible grammars. These algo-
rithms are acceptable as long as they can parse grammars that describe program-
ming languages.
• expr does not include any code for syntax error detection or recovery, because
there are no detectable errors associated with the rule for <expr>.
• The recursive-descent subprogram for <factor> must choose between two RHSs:
/* factor
Parses strings in the language generated by the rule:
<factor> -> id | int_constant | ( <expr> )
*/
void factor(void) {
printf("Enter <factor>\n");
/* Determine which RHS */
if (nextToken == IDENT || nextToken == INT_LIT)
/* Get the next token */
lex();
/* If the RHS is ( <expr> ), call lex to pass over the
left parenthesis, call expr, and check for the right
parenthesis */
else {
if (nextToken == LEFT_PAREN) {
lex();
expr();
if (nextToken == RIGHT_PAREN)
lex();
else
error();
}
/* It was not an id, an integer literal, or a left
parenthesis */
else
error();
}
printf("Exit <factor>\n");
}
• The error function is called when a syntax error is detected. A real parser
would produce a diagnostic message and attempt to recover from the error.
• Recursive-descent and other LL parsers can be used only with grammars that
meet certain restrictions.
• Calling the recursive-descent parsing subprogram for the following rule would
cause infinite recursion:
A→A+B
• The left recursion in the rule A → A + B is called direct left recursion, because
it occurs in one rule.
• The symbol ε represents the empty string. A rule that has ε as its RHS is called
an erasure rule, because using it in a derivation effectively erases its LHS from
the sentential form.
• Indirect left recursion poses the same problem as direct left recursion:
A→BaA
B→Ab
• Left recursion is not the only grammar trait that disallows top-down parsing. A
top-down parser must always be able to choose the correct RHS on the basis of
the next token of input.
• There are algorithms to compute FIRST for any mixed string. In simple cases,
FIRST can usually be computed by inspecting the grammar.
• An example:
A→aB|bAb|Bb
B→cB|d
The FIRST sets for the RHSs of the A-rules are {a}, {b}, and {c, d}. These rules
pass the pairwise disjointness test.
• A second example:
A→aB|BAb
B→aB|b
The FIRST sets for the RHSs of the A-rules are {a} and {a, b}. These rules fail
the pairwise disjointness test.
• In many cases, a grammar that fails the pairwise disjointness test can be modi-
fied so that it will pass the test.
• Using left factoring, these rules would be replaced by the following rules:
<variable> → identifier <new>
<new> → ε | [ <expression> ]
• Using EBNF can also help. The original rules for <variable> can be replaced by
the following EBNF rules:
<variable> → identifier [ [ <expression> ] ]
The outer brackets are metasymbols, and the inner brackets are terminals.
• Left factoring cannot solve all pairwise disjointness problems. In some cases,
rules must be rewritten in other ways to eliminate the problem.
• The following grammar for arithmetic expressions will be used to illustrate bot-
tom-up parsing:
E→E+T|T
T→T*F|F
F → ( E ) | id
This grammar is left recursive, which is acceptable to bottom-up parsers.
• Grammars for bottom-up parsers are normally written using ordinary BNF, not
EBNF.
• At each step, the parser’s task is to find the RHS in the current sentential form
that must be rewritten to get the previous sentential form.
• The task of a bottom-up parser is to find the unique handle of a given right sen-
tential form.
• Definition: β is the handle of the right sentential form γ = αβw if and only if
S =>*rm αAw =>rm αβw.
=>rm specifies a rightmost derivation step, and =>*rm specifies zero or more
rightmost derivation steps.
• A phrase is a string consisting of all of the leaves of the partial parse tree that is
rooted at one particular internal node of the whole parse tree.
The leaves of the parse tree represent the sentential form E + T * id. Because
there are three internal nodes, there are three phrases: E + T * id, T * id, and id.
• The simple phrases are a subset of the phrases. In this example, the only simple
phrase is id.
• Once the handle has been found, it can be pruned from the parse tree and the
process repeated. Continuing to the root of the parse tree, the entire rightmost
derivation can be constructed.
• Bottom-up parsers are often called shift-reduce algorithms, because shift and
reduce are their two fundamental actions.
The shift action moves the next input token onto the parser’s stack.
A reduce action replaces a RHS (the handle) on top of the parser’s stack by its
corresponding LHS.
• Later variations on the table construction process were more popular. These
variations require much less time and memory to produce the parsing table but
work for smaller classes of grammars.
• Advantages of LR parsers:
They can be built for all programming languages.
They can detect syntax errors as soon as possible in a left-to-right scan.
The LR class of grammars is a proper superset of the class parsable by LL pars-
ers.
• Older parsing algorithms would find the handle by looking both to the left and to
the right of the substring that was suspected of being the handle.
• Knuth’s insight was that it was only necessary to look to the left of the suspected
handle (by examining the stack) to determine whether it was the handle.
• Even better, the parser can avoid examining the entire stack if it keeps a sum-
mary of the stack contents in a “state” symbol on top of the stack.
• In general, each grammar symbol on the stack will be followed by a state symbol
(often written as a subscripted uppercase S).
• The contents of the parse stack for an LR parser has the following form, where
the Ss are state symbols and the Xs are grammar symbols:
S0X1S1X2S2…XmSm (top)
• The LR parsing process is based on the parsing table, which has two parts,
ACTION and GOTO.
• The ACTION part has state symbols as its row labels and terminal symbols as its
column labels.
• The parse table specifies what the parser should do, based on the state symbol on
top of the parse stack and the next input symbol.
• The two primary actions are shift (shift the next input symbol onto the stack) and
reduce (replace the handle on top of the stack by the LHS of the matching rule).
• Two other actions are possible: accept (parsing is complete) and error (a syntax
error has been detected).
• The values in the GOTO part of the table indicate which state symbol should be
pushed onto the parse stack after a reduction has been completed.
The row is determined by the state symbol on top of the parse stack after the
handle and its associated state symbols have been removed.
The column is determined by the LHS of the rule used in the reduction.
• All LR parsers use this parsing algorithm, although they may construct the pars-
ing table in different ways.
R4 means reduce using rule 4; S6 means shift the next input symbol onto the stack
and push state S6. Empty positions in the ACTION table indicate syntax errors.