0% found this document useful (0 votes)
9 views50 pages

Chapter-3 Short

Chapter 3 discusses the role of the lexical analyzer in a compiler, which includes reading input characters, grouping them into lexemes, and producing tokens for syntax analysis. It highlights the interaction between the lexical analyzer and the parser, the distinction between scanning and lexical analysis, and the importance of tokens, patterns, and lexemes. Additionally, the chapter covers error handling, regular expressions, and finite automata as tools for recognizing tokens.

Uploaded by

chattolagadget
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views50 pages

Chapter-3 Short

Chapter 3 discusses the role of the lexical analyzer in a compiler, which includes reading input characters, grouping them into lexemes, and producing tokens for syntax analysis. It highlights the interaction between the lexical analyzer and the parser, the distinction between scanning and lexical analysis, and the importance of tokens, patterns, and lexemes. Additionally, the chapter covers error handling, regular expressions, and finite automata as tools for recognizing tokens.

Uploaded by

chattolagadget
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 3

Lexical Analysis

Muhammad Kamal Hossen


Associate Professor
Dept. of CSE, CUET
E-mail: khossen@[Link]
The Role of the Lexical Analyzer
• The lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source
program.
• The stream of tokens is sent to the parser for syntax
analysis.
• It is common for the lexical analyzer to interact with the
symbol table as well.
• When the lexical analyzer discovers a lexeme constituting
an identifier, it needs to enter that lexeme into the symbol
table.
• In some cases, information regarding the kind of identifier
may be read from the symbol table by the lexical analyzer
to assist it in determining the proper token it must pass to
the parser. CCE-4805 2
Interaction between the lexical
analyzer & the parser

CCE-4805 3
The Role of the Lexical Analyzer
• Other tasks include:
Stripping out comments and whitespace
(blank, newline, tab, and perhaps other
characters that are used to separate tokens in
the input).
Correlating error messages generated by the
compiler with the source program.

CCE-4805 4
Divided into two processes
1. Scanning consists of the simple processes that
do not require tokenization of the input, such
as deletion of comments and compaction of
consecutive whitespace characters into one.
2. Lexical analysis proper is the more complex
portion, which produces tokens from the
output of the scanner.

CCE-4805 5
Lexical Analysis Versus Parsing
• Why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax
analysis) phases?
1. Simplicity of design is the most important consideration.
 The separation of lexical & syntactic analysis often
allows us to simplify at least one of these tasks.
 If we are designing a new language, separating lexical
& syntactic concerns can lead to a cleaner overall
language design.
CCE-4805 6
2. Compiler efficiency is improved.
A separate lexical analyzer allows us to
apply specialized techniques that serve
only the lexical task, not the job of parsing.
In addition, specialized buffering
techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced. Input-
device-specific peculiarities can be restricted
to the lexical analyzer.
CCE-4805 7
Tokens, Patterns & Lexemes
• Token
 A token name + an optional attribute
value.
 The token name is an abstract symbol
representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input
characters denoting an identifier.
 The token names are the input symbols
that the parser processes.
CCE-4805 8
• Pattern
o A pattern is a description of the form that the
lexemes of a token may take.
o In the case of a keyword as a token, the pattern is
just the sequence of characters that form the
keyword.
o For identifiers & some other tokens, the pattern is
a more complex structure that is matched by
many strings.
• Lexemes
 A lexeme is a sequence of characters in the
source program that matches the pattern for a
token & is identified by the lexical analyzer as an
instance of that token.
CCE-4805 9
Example: Patterns & Lexemes

• C statement
printf("Total = %d\n", score);
• both printf and score are lexemes matching the
pattern for token id, and " Total = %d\n" is a lexeme
matching literal.
CCE-4805 10
Covering most or all of the tokens
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes.
3. One token representing all identifiers.
4. One or more tokens representing constants, such
as numbers & literal strings.
5. Tokens for each punctuation symbol, such as left
& right parentheses, comma, & semicolon.
CCE-4805 11
Attributes for Tokens
• When more than one lexeme can match a pattern, the
lexical analyzer must provide the subsequent compiler
phases additional information about the particular lexeme
that matched.
• For example, the pattern for token number matches both
0 and 1, but it is extremely important for the code
generator to know which lexeme was found in the source
program.
• Thus, in many cases the lexical analyzer returns to the
parser not only a token name, but an attribute value that
describes the lexeme represented by the token ;
• Token name influences parsing decisions, while the
attribute value influences translation of tokens after the
parse.
CCE-4805 12
Attributes for Tokens
• Tokens have at most one associated attribute,
although this attribute may have a structure
that combines several pieces of information.
• Normally, information about an identifier, e.g.,
its lexeme, its type, and the location at which
it is first found is kept in the symbol table.
• Thus, the appropriate attribute value for an
identifier is a pointer to the symbol-table
entry for that identifier.
CCE-4805 13
Example 3.2: The token names & associated
attribute values for the Fortran statement
E = M * C ** 2
• Sequence of pairs:
<id, pointer to symbol-table entry for E>
<assign_op> [no need to assign]
<id, pointer to symbol-table entry for M>
<mult_op> [no need to assign]
<id, pointer to symbol-table entry for C>
<exp_op> [no need to assign]
<number, integer value 2>
CCE-4805 14
Lexical Errors
• It is hard for a lexical analyzer to tell, without the
aid of other components, that there is a source-
code error.
• For instance, if the string fi is encountered for the
first time in a C program in the context :
fi ( a == f (x) ) . . .
• A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
• Since fi is a valid lexeme for the token id, the
lexical analyzer must return the token id to the
parser
CCE-4805 15
Lexical Errors
• Let - the parser - handle an error due to transposition of the
letters.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• Error Recovery
 The simplest recovery strategy is "panic mode" recovery.
 We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left.
 This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite adequate.
CCE-4805 16
• Other possible error-recovery actions are:
1. Delete one character from the remaining
input.
2. Insert a missing character into the remaining
input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

CCE-4805 17
Regular Expressions
• Regular expressions: underscore is included
among the letters.
• if letter_ is established to stand for any letter
or the underscore, and digit_ established to
stand for any digit, then we could describe the
language of C identifiers by :
Letter_ ( letter_ | digit )*
Vertical bar: union
Parentheses: group sub expressions,
Star: zero or more occurrences of
Juxtaposition of letter_ with the remainder of the expression signifies concatenation
CCE-4805 18
Regular Expressions
• Each regular expression r denotes a language
L(r) , which is also defined recursively from the
languages denoted by r's sub expressions.
• Here are the rules that define the regular
expressions over some alphabet  and the
languages that those expressions denote.

CCE-4805 19
BASIS: There are two rules

1.  is a regular expression, and L () is {}


 the language whose sole member is the empty
string.
2. If a is a symbol in , then a is a regular
expression, and L(a) = {a},
 the language with one string, of length one,
with a in its one position.

CCE-4805 20
INDUCTION: There are 04 parts to the induction whereby larger
regular expressions are built from smaller ones. Suppose r and s
are regular expressions denoting languages L(r) and L(s).

1. (r)|(s) is a regular expression denoting the language L(r) U L(s) .

2. (r) (s) is a regular expression denoting the language L(r)L(s).

3. (r)* is a regular expression denoting (L (r)) * .

4. (r) is a regular expression denoting L(r) .


 This last rule says that we can add additional pairs of parentheses
around expressions without changing the language they denote.

CCE-4805 21
Regular Expressions
• We may drop certain pairs of parentheses
• Conventions:
a) The unary operator * has highest precedence & is left
associative.
b) Concatenation has second highest precedence and is left
associative.
c) | has lowest precedence and is left associative.
• For example,
(a) I((b) *(c)) == alb* c.
• Both expressions denote the set of strings that are either
a single a or are zero or more b's followed by one c.
CCE-4805 22
Example 3.4 : Let  = {a, b} .
1. The regular expression a|b denotes the language {a, b} .
2. (alb) (alb) denotes {aa, ab, ba, bb} , the language of all
strings of length two over the alphabet . Another regular
expression for the same language : aa l ab l ba l bb
3. a* denotes the language consisting of all strings of zero or
more a's: {, a, aa, aaa, ... }.
4. (a I b) * denotes the set of all strings consisting of zero or
more instances of a or b,
 all strings of a's and b's: { , a, b, aa, ab, ba, bb, aaa, ... }.
 Another regular expression for same language: (a* b * )*.
5. ala* b denotes the language {a, b, ab, aab, aaab, ... },
 the string a & all strings consisting of zero or more a's & ending
in b. CCE-4805 23
Regular Expressions
• Regular set: A language that can be defined by
a regular expression.
• If two regular expressions r and s denote the
same regular set , we say that they are
equivalent and write r = s.
• For instance, (alb) = (bla).

CCE-4805 24
Algebraic laws for regular expressions r, s, & t

CCE-4805 25
Regular Definitions
• If  is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
• d1  r1
• d2  r2
------
• Dn  rn
Where:
1. Each di is a new symbol, not in  and not the same as any
other of the d's,
2. Each ri is a regular expression over the alphabet U {d1 , d2
, . . . , di-1
CCE-4805 26
• Example 3.5: C identifiers are strings of
letters, digits, and underscores. Here is a
regular definition for the language of C
identifiers. We shall conventionally use italics
for the symbols defined in regular definitions.
• letter  A I B I · · · l z l a l b l · · · l z l –
• digit  0| 1 | · · ·| 9
• id  letter _ ( letter_ I digit ) *

CCE-4805 27
Abbreviations
• The basic operations generate all possible regular
expressions, but there are common abbreviations
used for convenience. Typical examples are:

Abbr. Meaning Notes


r+ (rr*) 1 or more occurrences
r? (r | ε) 0 or 1 occurrence
[a-z] (a|b|…|z) 1 character in given range
[abxyz] (a|b|x|y|z) 1 of the given characters

CCE-4805 28
Extensions of Regular Expressions

CCE-4805 29
Recognition of Tokens
• Build a piece of code that examines the input
string & finds a prefix that is a lexeme
matching one of the patterns.

CCE-4805 30
Patterns
Example 3.8:
Terminals:
if, then, else,
relop, id,
number
- names of tokens

ws  ( blank | tab | newline )+


• blank, tab, newline are abstract symbols.
• Token ws is different from the other tokens in
that , when we recognize it, we do not return it
to the parser, but rather restart the lexical
analysis from the character that follows the
whitespace.

CCE-4805 31
Tokens, their patterns, and attribute values

• For each lexeme or


family of lexemes,
which token name is
returned to the
parser and what
attribute value, is
returned.
 For six relational
operators, symbolic
constants LT, LE,
and so on are used
as the attribute
value, in order to
indicate which
instance of the
token relop we
have found.
CCE-4805 32
Transition Diagrams
• Convert patterns into stylized flowcharts:
"transition diagrams"
Transition diagrams have a collection of
nodes or circles, called states.
Each state represents a condition that
could occur during the process of scanning
the input looking for a lexeme that matches
one of several patterns.

CCE-4805 33
Transition Diagrams
• Edges are directed from one state to another.
• Each edge is labeled by a symbol or set of symbols.
• If we are in some state s , & the next input symbol is a, we
look for an edge out of state s labeled by a (and perhaps
by other symbols, as well).
• If such an edge found, advance the forward pointer &
enter the state of the transition diagram to which that
edge leads.
• Transition diagrams are deterministic
 There is never more than one edge out of a given state
with a given symbol among its labels.
CCE-4805 34
• Example 3.9: Transition diagram that
recognizes the lexemes matching the token
relop.

CCE-4805 35
Finite Automata
• Finite automata are essentially graphs, like transition
diagrams, with a few differences:
1. Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavors:
 (a) Nondeterministic finite automata (NFA) have no
restrictions on the labels of their edges. A symbol can label
several edges out of the same state, and , the empty
string, is a possible label.
 (b) Deterministic finite automata (DFA) have, for each
state, and for each symbol of its input alphabet exactly one
edge with that symbol leaving that state.
• Both NFA & DFA are capable of recognizing the same
languages (regular language)
CCE-4805 36
Nondeterministic Finite Automata (NFA)
1 . A finite set of states S.
2. A set of input symbols , the input alphabet. The
empty string (), is never a member of 
3. A transition function that gives, for each state,
and for each symbol in  U {E} a set of next
states.
4. A state s0 from S that is distinguished as the start
state (or initial state) .
5. A set of states F, a subset of S, that is
distinguished as the accepting states (or final
states) .
CCE-4805 37
 Any NFA/DFA can represent by a transition graph, where the
nodes are states and the labeled edges represent the transition
function.
 There is an edge labeled a from state s to state t
 if and only if t is one of the next states for state s and
input a .

This graph is very much like a transition diagram, except


a) The same symbol can label edges from one state to several
different states,
b) An edge may be labeled by , the empty string, instead of, or
in addition to, symbols from the input alphabet.

CCE-4805 38
• Example 3.14: R = (alb) * abb
NFA

 Double circle around state 3 indicates that this state is


accepting.
 Only ways to get from the start state 0 to the accepting state is
to follow some path that stays in state 0 for a while, then goes to
states 1 , 2, and 3 by reading abb from the input.
 Thus, the only strings getting to the accepting state are those
that end in abb.

CCE-4805 39
Transition Tables
• Rows : states
• Columns: input symbols and .
• The entry for a given state & input is
value of the transition function applied
to those arguments.
• If the transition function has no
information about that state-input pair,
put .
• Adv.: Easily find the transitions on a
given state and input.
• Disadv.: takes a lot of space, when the
input alphabet is large,
CCE-4805 40
Acceptance of Input Strings by
Automata
• An NFA accepts input string x if & only if there
is some path in the transition graph from the
start state to one of the accepting states
•  labels along the path are effectively ignored,
since the empty string does not contribute to
the string constructed along the path.

CCE-4805 41
Example : The string aabb is accepted by the NFA

• Another path (not accepting)

 NFA accepts a string as long as some path


labeled by that string leads from the start
state to an accepting state.
NFA

CCE-4805 42
L(aa* | bb* )

• String aaa accepted


•  is "disappear" in a
concatenation

CCE-4805 43
Deterministic Finite Automata (DFA)
1. There are no moves on input 
2. For each state s & input symbol a, there is
exactly one edge out of s labeled a
• If we are using a transition table to represent
a DFA, then each entry is a single state.
• Represent this state without the curly braces
that we use to form sets.
• Lexical Analyzer---DFA
CCE-4805 44
Example 3.19 : (a|b)* abb

• Given the input string: ababb,


• Sequence of states: 0, 1 , 2, 1 ,
2, 3 & returns "yes"

CCE-4805 45
DFA vs. NFA
◊ DFA:  returns a single  NFA:  returns a set of states
state  NFA has an arrow with label 
◊ Every state of a DFA  NFA may have arrows labeled
always has exactly one with members of alphabet/.
exiting transition arrow  Zero, one, or many arrows may
for each symbol in the exit from each state with label 
alphabet
◊ Labels on the transition
arrows are symbols
from the alphabet

CCE-4805 46
DFA vs. NFA
Parallel computation
tree

reject

accept
Accept/reject
CCE-4805 47
Functions Computed From the
Syntax Tree
• To construct a DFA directly from a regular
expression, we construct its syntax tree and
then compute four functions:
 nullable
 firstpos
 lastpos
 followpos
CCE-4805 48
04 Functions
1. nullable(n) is true for a syntax-tree node n if & only if the sub
expression represented by n has  in its language.
 Sub-expression can be "made null" or the empty string, even
though there may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n
that correspond to the first symbol of at least one string in
the language of the sub expression rooted at n.
3. lastpos(n) is the set of positions in the subtree rooted at n
that correspond to the last symbol of at least one string in
the language of the sub expression rooted at n
4. followpos(p), for a position p, is the set of positions q in the
entire syntax tree such that there is some string x = a1 a2 . . .
an in L ( (r ) #) such that for some i, there is a way to explain
the membership of x in L( (r) #) by matching ai to position p
of the syntax tree and ai+1 toCCE-4805
position q 49
Conclusion
• Tokens
• Lexemes
• Patterns
• Regular Expressions
• Regular Definitions
• Transition Diagrams
• Finite Automata
• DFA & NFA
CCE-4805 50

You might also like