Lexical Analysis
Lexical Analysis
HARSH RAJ
Galgotias University
Greater Noida
Overview
tokens
source lexical analyzer syntax analyzer
program (scanner) (parser)
symbol table
manager
Lexical Analysis 2
Overview (cont’d)
Input file Token
sequence
keywd_int
/ * p g m . c * / \n i n
identifier: “main”
t m a i n ( i n t a
left_paren
r g c , c h a r * * a r keywod_int
g v ) { \n \t i n t x , lexical identifier: “argc”
analyzer comma
Y ; \n \t f l o a t w ;
keywd_char
... star
star
identifier: “argv”
right_paren
left_brace
keywd_int
…
CSc 453: Lexical Analysis 3
Implementing Lexical Analyzers
Different approaches:
Using a scanner generator, e.g., lex or flex. This automatically
generates a lexical analyzer from a high-level description of the tokens.
(easiest to implement; least efficient)
Programming it in a language such as C, using the I/O facilities of the
language.
(intermediate in ease, efficiency)
Writing it in assembly language and explicitly managing the input.
(hardest to implement, but most efficient)
Lexical Analysis 5
Examples
Input: count = 123
Tokens:
identifier : Rule: “letter followed by …”
Lexeme: count
assg_op : Rule: =
Lexeme: =
integer_const : Rule: “digit followed by …”
Lexeme: 123
Lexical Analysis 6
Algorithm / Pseudo code
BEGIN
Initialize character pointer to the start of the source code
WHILE not end of source code
Skip any white spaces and newlines
IF character is a letter
Begin identifier or keyword
WHILE character is a letter or digit
Add character to current token Advance character
pointer
END WHILE
IF current token is a keyword
Output keyword token
ELSE CSc 453: Lexical Analysis 7
pseudo
• Output identifier token
• END IF
• ELSE IF
• character is a digit Begin number
• WHILE character is a digit
• Add character to current token
• Advance character pointer
• END WHILE
• Output number token
• ELSE
• Output error token Advance character pointer
• END IF
• END WHILE
Lexical Analysis 8
Regular Expressions
A pattern notation for describing certain kinds
of sets over strings:
Given an alphabet :
is a regular exp. (denotes the language {})
for each a , a is a regular exp. (denotes the language
{a})
if r and s are regular exps. denoting L(r) and L(s)
respectively, then so are:
(r) | (s) ( denotes the language L(r) L(s) )
(r)(s) ( denotes the language L(r)L(s) )
(r)* ( denotes the language L(r)* )
Lexical Analysis 9
Working of Lexical Analyzer
Lexical Analysis 10
Conclusion
Content:
•Key Takeaways:
• Definition
• Purpose
• Components
•Importance in Compiler Design:
• Error Detection
• Efficiency
• Foundation for Parsing
•Practical Considerations:
• Token Definitions
• Handling Errors
• Tools and Libraries