Compiler Design
Compiler Design
Lexical analysis
1
Outline
Introduction
Interaction of the Lexical Analyzer with the Parser
Token, pattern, lexeme
Specification of patterns using regular expressions
Regular expressions
Regular expressions for tokens
2
Introduction
The role of the lexical analyzer is:
• to read a sequence of characters from the source
program
• group them into lexemes and
• produce as output a sequence of tokens for each
lexeme in the source program.
The scanner can also perform the following
secondary tasks:
stripping out blanks, tabs, new lines
stripping out comments
keep track of line numbers (for error reporting)
3
Interaction of the Lexical Analyzer
with the Parser
Source
Program
symbol
table
(Contains a record
for each identifier)
5
Lexemes are the specific character strings
that make up a token.
– For example: abc and 123A
Patterns are rules describing the set of
lexemes belonging to a token.
– For example: “letter followed by
letters and digits”
Patterns are usually specified using regular
expressions.
[a-zA-Z]*
3-6
Token, pattern, lexeme…
Example: The following table shows some tokens and
their lexemes in Pascal (a high level, case insensitive
programming language)
Token Some lexemes pattern
begin Begin, Begin, BEGIN, Begin in small or capital
beGin… letters
if If, IF, iF, If If in small or capital letters
ident Distance, F1, x, Dist1,… Letters followed by zero or
more letters and/or digits
9
Attributes of tokens…
10
10
Errors
Very few errors are detected by the lexical
analyzer.
For example, if the programmer mistakes
ebgin for begin, the lexical analyzer cannot
detect the error since it will consider ebgin as
an identifier.
However, if a certain sequence of characters
follows none of the specified patterns, the
lexical analyzer can detect the error.
11
Errors…
When an error occurs, the lexical analyzer
recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token (panic mode recover)
deleting one character from the remaining input
inserting missing characters in to the remaining input
replacing an incorrect character by a correct
character
transposing two adjacent characters
12
Input Buffering
Speed of lexical analysis is a concern
Lexical analysis needs to look ahead several
characters before a match can be
announced.
Two buffer input schema that is useful
when look ahead is necessary:
Buffer Pairs
Sentinels
3-13
implementation of lexical
analyzer
Read the text character by character
In place of every lexemes we are going to
see a token
Eliminate the comment line
Eliminate the white spaces
Convert lexemes into tokens
Remove errors and display it
3-14
Example:
Consider this expression in the
programming language C++:
if(y<=t)
y=y-3;
Tokenized and represented by the
following table:
3-15
3-16
Exercise
Tokenized the following expression:
1. Result = Sum + 1;
2. int x, y, z;
3-17
Buffer Pairs
Because of large amount of time
consumption in moving characters,
specialized buffering techniques have been
developed to reduce the amount of
overhead required to process an input
character.
• Consists of two buffers, each consists of
N-character size which are reloaded
alternatively.• N-Number of characters on
one disk block, e.g., 1024 or 4096.
3-18
Two buffer scheme
1st half and 2nd half
Each buffer contains N- character size
3-19
2 pointers are maintained
LexemeBegin
forward
3-20
LexemeBegin
Points to the beginning of the current
lexeme which is yet to be found
Forward
Scans ahead until match for a pattern is
found
3-21
Once a lexeme is found, lexemebegin is set
to the character immediately after the
lexeme which is just found and forward is
set to the character at its right end.•
Current lexeme is the set of characters
between two pointers.
3-22
Algorithm
if forward at end of first half then begin
reload second half;
forward := forward + 1
End
else if forward at end of second half then
begin
reload second half;
move forward to beginning of first half
end
else forward := forward + 1;
3-23
Sentinel
Sentinel is a special character.
It cannot be a part of source program.
EOF generally used sentinel.
In the previous scheme, each time when
the forward pointer is moved, a check is
done to ensure that one half of the buffer
has not moved off. If it is done, then the
other half must be reloaded.
3-24
Therefore the ends of the buffer halves
require two tests for each advance of the
forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is
read.
The usage of sentinel reduces the two
tests to one by extending each buffer half
to hold a sentinel character at the end.
3-25
Algorithm
forward : = forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward +1;
end
else if forward at end of second half then begin reload first
half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end 3-26
Specification of Tokens
Regular expressions are an important
notation for specifying lexeme patterns
3-27
Specification of patterns using
regular expressions
Regular expressions
Regular expressions for tokens
28
Regular expression: Definitions
29
Regular expressions…
A regular expression is one of the following:
Symbol: a basic regular expression consisting of a single
character a, where a is from:
an alphabet Σ of legal characters;
the metacharacter ε: or
the metacharacter ø.
In the first case, L(a)={a};
in the second case, L(ε)= { ε};
in the third case, L(ø)= { }.
{} – contains no string at all.
{ε} – contains the single string consists of no character
30
Regular expressions…
Alternation: an expression of the form r|s, where r
and s are regular expressions.
In this case , L(r|s) = L(r) U L(s) ={r,s}
31
Regular expression: Language Operations
Union of L and M
L ∪ M = {s |s ∈ L or s ∈ M}
Concatenation of L and M
LM = {xy | x ∈ L and y ∈ M}
32
Examples
L1={a,b,c,d} L2={1,2}
L1 ∪ L2={a,b,c,d,1,2}
L1L2={a1,a2,b1,b2,c1,c2,d1,d2}
L1*=all strings of letter a,b,c,d and empty string.
L1+= the set of all strings of one or more letter a,b,c,d,
empty string not included
33
Regular expressions…
Examples (more):
1- a | b = {a,b}
2- (a|b)a = {aa,ba}
3- (ab) | ε ={ab, ε}
4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
binary numbers
1 – Even binary numbers (0|1)*0
2 – An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}
34
Regular expressions for tokens
35
Regular expressions for tokens…
Special symbols: including arithmetic operators,
assignment and equality such as =, :=, +, -, *
Identifiers: which are defined to be a sequence of
letters and digits beginning with letter,
we can express this in terms of regular definitions as
follows:
letter = A|B|…|Z|a|b|…|z
digit = 0|1|…|9
or
letter= [a-zA-Z]
digit = [0-9]
identifiers = letter(letter|digit)*
36
Regular expressions for tokens…
Numbers: Numbers can be:
sequence of digits (natural numbers), or
decimal numbers, or
numbers with exponent (indicated by an e or E).
Example: 2.71E-2 represents the number 0.0271.
We can write regular definitions for these numbers as
follows:
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
Literals or constants: which can include:
numeric constants such as 42, and
string literals such as “ hello, world”.
37
Regular expressions for tokens…
38
Recognition of tokens
a grammar for branching Patterns for tokens using
statements and conditional regular expressions
expressions
digit [0-9]
nat digit+
stmt if expr then stmt signednat (+|-)?nat
| if expr then stmt else stmt numbersignednat(“.”nat)?(E signednat)?
|ε letter [A-Za-z]
expr term relop term | term idletter(letter|digit)*
term id | number ifif
then then
elseelse
relop <|>|<=|>=|=|<>
For this language, the lexical analyzer will recognize:
the keywords if, then, else
Lexemes that match the patterns for relop, id, number
40
Transition diagram that recognizes the lexemes
matching the token relop and id.
3-41
Example:
Token Attribute
If x <= 10 If Null
then Id Ptr to x in ST
Relop LE
x Lexical Num Ptr to 10 in ST
else Analyzer then Null
Id Ptr to x in ST
y else Null
Id Ptr to y in ST
Source program
tokens & attributes
3-42
Assignment 1
Draw a transition diagram for
1. unsigned integer numeric constant.
2. unsigned number.
3. whitespace.
3-43
Design of a Lexical Analyzer/Scanner
Finite Automata
Lex – turns its input program into lexical analyzer.
At the heart of the transition is the formalism known as finite
automata.
Finite automata is graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors:
a) Nondeterministic finite automata (NFA) have no restrictions on
the labels of their edges.
ε, the empty string, is a possible label.
b) Deterministic finite automata (DFA) have, for each state, and
for each symbol of its input alphabet exactly one edge with that
symbol leaving that state.
44
The Whole Scanner Generator Process
Overview
Direct construction of Nondeterministic finite
Automation (NFA) to recognize a given regular
expression.
Easy to build in an algorithmic way
Requires ε-transitions to combine regular sub expressions
Construct a deterministic finite automation
(DFA) to simulate the NFA Optional
Use a set-of-state construction
Minimize the number of states in the DFA
Generate the scanner code.
45
Design of a Lexical Analyzer …
Token Pattern
Pattern Regular Expression
Regular Expression NFA
NFA DFA
DFA’s or NFA’s for all tokens Lexical Analyzer
46
Non-Deterministic Finite Automata
(NFA)
Definition
An NFA M consists of five tuples: ( Σ,S, T, s0, F)
A set of input symbols Σ, the input alphabet
a finite set of states S,
a transition function T: S × (Σ U { ε}) -> S (next state),
48
Transition Graph
The transition graph for an NFA recognizing the
language of regular expression (a|b)*abb
all strings of a's and b's ending in the
particular string abb
a
start a b b
0 1 2 3
b S={0,1,2,3}
Σ={a,b}
S0=0
F={3}
49
Transition Table
The mapping T of an NFA can be represented
in a transition table
State Input Input Input
a b ε
0 {0,1} {0} ø
a a b b
0 0 1 2 3 YES
a a b b
0 0 0 0 0 NO
Exercise:
babb is accepted by (a|b)*abb ?
bbabb is NOT? 51
Another NFA Exercise:
aaa is accepted by aa*|bb* ?
bbb isa NOT?
a
start
b
b
aa*|bb*
52
Deterministic Finite Automata (DFA)
53
DFA example
A DFA that accepts (a|b)*abb
54
Simulating a DFA: Algorithm
How to apply a DFA to a string.
INPUT:
An input string x terminated by an end-of-file character eof.
A DFA D with start state So, accepting states F, and
transition function move.
OUTPUT: Answer ''yes" if D accepts x; "no" otherwise
METHOD
Apply the algorithm in (next slide) to the input string x.
The function move(s, c) gives the state to which there is an
edge from state s on input c.
The function nextChar() returns the next character of the
input string x.
55
Simulating a DFA
Exercise:
s = so; bbababb is accepted by (a|b)*abb ?
c = nextchar();
bbabab is NOT?
while ( c != eof ) {
s = move(s, c);
c = nextchar();
}
if ( s is in F )
return "yes"; DFA accepting (a|b)*abb
else return "no";
Given the input string ababb, this DFA enters the
sequence of states 0,1,2,1,2,3 and returns "yes"
56
example
Draw DFAs for the string matched by the
following definition:
digit =[0-9]
nat=digit+
signednat=(+|-)?nat
3-57
DFA: Exercise 1
58
Design of a Lexical Analyzer Generator
Two algorithms:
1- Translate a regular expression into an NFA
(Thompson’s construction)
Rules:
1- For an ε, a regular expressions, construct:
start a
start
60
From regular expression to an NFA…
2- For a composition of regular expression:
Case 1: Alternation: regular expression(s|r), assume
that NFAs equivalent to r and s have been
constructed.
61
61
From regular expression to an NFA…
Case 2: Concatenation: regular expression sr
ε
…r …s
Case 3: Repetition r*
62
Examples:
1. Draw NFA for RE (a|b)* using Thompsons's
construction.
Solution:
3-63
Examples:
1. Draw NFA for RE (a|b)* using Thompsons's
construction.
Solution:
3-64
Class work
1. Draw NFA for the RE (a|b)*abb
3-65
Class work
1. Draw NFA for the RE (a|b)*abb
3-66
Assignment 2
Draw NFA for the following RE
1. a(a|b)*
2. a(a|b)*b
3. (ab|ba)a*
4. letter(leter | digit)*
3-67
From an NFA to a DFA
(subset construction algorithm)
Rules:
Start state of D is assumed to be unmarked.
Start state of D is = ε-closer (S0),
where S0 -start state of N.
68
The Subset Construction
Algorithm
1. Create the start state of the DFA by
taking the ε-closure of the start state of
the NFA.
2. Perform the following for the new DFA
state: For each possible input symbol:
a) Apply move to the newly-created state
and the input symbol; this will return a set
of states.
b) Apply the ε-closure to this set of states,
possibly resulting in a new set.
This set of NFA states will be a single state
in the DFA. 3-69
3. Each time we generate a new DFA state,
we must apply step 2 to it. The process is
complete when applying step 2 does not
yield any new states.
4. The finish states of the DFA are those
which contain any of the finish states of
the NFA.
3-70
NFA to a DFA…
ε- closure
ε-closure (S’) – is a set of states with the following
characteristics:
1- S’ € ε-closure(S’) itself
2- if t € ε-closure (S’) and if there is an edge labeled ε
from t to v, then v € ε-closure (S’)
3- Repeat step 2 until no more states can be added to
ε-closure (S’).
e.g. convert the following RE to NFA and then to DFA.
By using ε-closure .
a) a(a|b)*b
71
Solution:
NFA:
3-72
NFA to DFA using (ε-closure)
Find the ε closure of the start state of NFA
(state 0 is the start state)
Start state:
ε-closure (0) = {0} ->A
move(A,a) = {1} -> B
move(A,b) = {ø} =>dade configuration and
dade state in dfa -> D: move (D,a)=move(D,b)=D
ε-closure (1)= {1,2,3,5,8} B
move(B,a) = {4} C
move(B,b) = {6,9} E
3-73
ε-closure (4) = {4,7,2,3,5,8} C
move (C,a) = {4} C
move (C,b) = {6,9} E
ε-closure (6,9) = {6,7,2,3,5,8,9} E
move (E,a) = {4} C
move (E,b) = {6,9} E
Translation table: State Input symbol
a b
A B D
B C E
C C E
*E C E
D D D 3-74
DFA will be:
3-75
Example 2 convert the following
NFA to DFA
By using the second method (subset
construction)
NFA:
3-76
Step 1: states (Q) on NFA are {qo, q1,q2} list all
subset of the states (for n state we will have 2n
subsets)
subset(Q) =
{ø,{q0},{q1},{q2},{q0,q1},{q0,q2},{q1,q2},{q0,q1,q2}}
Step 2: draw the transition diagram
3-77
Step 3: rename states:
*
*
*
3-78
Step 4: New transition table (DFA)
3-79
Step 5: eliminate unwanted state
state Input state
a b
B E B
E E F
*F E B
3-80
Assignment 3
1. Convert the following NFA to DFA (using
subset construction)
3-81
NFA for identifier: letter(letter|digit)*
ε
letter
3 4
ε ε
start
letter ε ε
0 1 2 7 8
digit ε
ε 5 6
82
NFA to a DFA…
Example: Convert the following NFA into the corresponding
DFA. letter (letter|digit)*
A={0}
B={1,2,3,5,8}
start letter C={4,7,2,3,5,8}
A B
D={6,7,8,2,3,5}
letter digit
letter
digit D digit
C
letter
83
Exercise: convert NFA of (a|b)*abb in to DFA.
84
Other Algorithms
85
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the
input source and converts strings in the source
to tokens.
Lex: generates a scanner (lexical analyzer or
lexer) given a specification of the tokens using
REs.
The input notation for the Lex tool is referred to as
the Lex language and
The tool itself is the Lex compiler.
The Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.
86
Lex…
87
General Compiler Infra-structure
Parse tree
Program source Tokens Parser
Scanner Semantic
(tokenizer) Routines
(stream of
characters) Annotated/decorated
c c tree
ex Ya
L Analysis/
Transformations/
Symbol and optimizations
literal Tables
IR: Intermediate
Representation
Code
Generator
Assembly code
88
Scanner, Parser, Lex and Yacc
8989
Generating a Lexical Analyzer using Lex
Lex is a scanner generator ----- it takes lexical specification as
input, and produces a lexical analyzer written in C.
Lex source
program Lex compiler lex.yy.c
lex.l
lex.yy.c
C compiler a.out
Sequence of
Input stream
a.out tokens
Lexical Analyzer
90
Lex specification
Program structure C declarations in %
...declaration section... { %}
%%
P1 { action1 }
...rule section... P2 { action2 }
%%
...user defined functions...
Rules section – regular expression <--> action.
• The actions are C program.
Declaration section – variables, constants
91
Skeleton of a lex specification (.l file)
x.l *.c is generated after
running
%{
< C global variables, prototypes, This part is copied as–is to
comments > the top of the generated
C file
%}
Substitutions simplifies
[DEFINITION SECTION] pattern matching
Thompson’s
construction
ε-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
ε-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
ε-closure({7}) = {7}
move({7},b) = {8}
ε-closure({8}) = {8}
move({8},a) = ∅
97
Combining and simulation of NFAs of a Set of
Regular Expressions: Example 2
start a
a {action1} 1 2
start b
abb {action2} a b
3 4 5 6
a*b+ {action3}
start a
When two or more b
accepting states are 7 b 8
reached, the action is
executed a Action 1
ε 1 2
start b
a b b b
0 ε 3 a 4 5 6
0 2 5 6
1 4 8 8 ε b Action 2
a 7 b
3 7 8
7 None a Action 3
Action 2
Action 3 98
DFA's for Lexical Analyzers
NFA DFA. Transition table for DFA
State a b Token
found
0137 247 8 None
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
100
Pattern matching examples
101
Meta-characters
102
Lex Regular Expression: Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
103
Regular Expression: Examples…
• a delimiter for an English sentence
“.” | “?” | ! OR
[“.””?”!]
• C++ comment: // call foo() here!!
“//”.*
•white space
[ \t]+
• English sentence: Look at this!
([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)
104
Two Rules
105
Lex variables
yyin - of the type FILE*. This points to the current file
being scanned by the lexer.
yyout - Of the type FILE*. This points to the location
where the output of the lexer will be written.
• By default, both yyin and yyout point to standard input
and output.
yytext – variable, a pointer to the matched strings (char
*)
yyleng - Gives the length of the matched pattern.
yylineno - Provides current line number information.
106
Lex functions
107
Lex predefined variables
108
Let us run a lex program
109
Lex : programs
The first example is the shortest possible lex file:
%%
Input is copied to output, one character at a time.
The first %% is always required, as there must
always be a rules section.
However, if we don’t specify any rules, then the
default action is to match everything and copy it to
output.
Defaults for input and output are stdin and stdout,
respectively.
Here is the same example, with defaults explicitly
coded:
110
Rule %%
section /* match everything except newline */
. ECHO;
/* match newline */
\n ECHO;
%%
int yywrap(void) { Invokes the
return 1; Lexical
analyzer
}
int main(void) {
User yylex();
definition return 0;
section
}
111
Developing Lexical analyzer using
Lex : Linux (Fedora)
vi – used to edit lex and yacc source files.
w – save
q – quit
w filename – save as
wq – save and quit
q! – exit overriding change
114
How to compile and run LEX programs...
4. Press esc
5. Press :wq
6. lex lab1.l
7. gcc lex.yy.c -ll
8. ./a.out <hello.c
115
Examples (more) Regular
definitions
%% %{
/*Match every thing #include <stdio.h>
except new line*/ %}
digit [0-9]
. ECHO;
letter [A-Za-z]
/*Match new line*/ id {letter}({letter}|{digit})*
\n ECHO; %%
%% {digit}+ { printf(“number: %s\n”, yytext); }
int yywrap(void) { {id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
return 1; %%
} main()
int main(void) { { yylex();
yylex(); }
Translation
retrun 0; rules
}
116
Example :Finding the number of identifier in a given
program
digit [0-9]
letter [A-Za-z]
%{
int count;
%}
%%
{letter}({letter}/{digit})* count++;
%%
int main(void) {
yylex();
printf(“The number of identifiers are=%4d\n”,count);
return 0; }
117
Example: Here is a scanner that counts the number of
characters, words, and lines in a file.
%{
int nchar, nword, nline;
%}
%%
\n { nline++;}
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}
118
%{ /* definitions of manifest constants */
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER,
Regular definitions
RELOP */
%}
delim [ \t\n]
ws {delim}+ Return token to parser
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {/*no action and no return*/ }
if {return IF;} Token attribute
then {return THEN;}
else {return ELSE;}
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;} Install yytext as identifier
“<>“ {yylval = NE; return RELOP;} in symbol table
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%%
int install_id() {}
int install_num() {}
119
Assignment on Lexical Analyzer
120
1. Write a program in LEX to count the no of
consonants and vowels for a given C and C++ source
programs.
2. Write a program in LEX to count the no of:
(i) positive and negative integers
(ii) positive and negative fractions.
For C and C++ source programs
3. Write a LEX program to recognize a valid C and C++
programs.
121
END!
3-122