UNIT 3 (2)
UNIT 3 (2)
Contents
3.0. Aims and Objectives
This unit discusses different types of tools in generating scanners and applications of
scanners.
After you study this unit, you will be able to:
• Describe tools in generating scanners
• Design and construct using tools in your lab class
• Understand the roles and application s of scanners
• Know the relationship between scanners and compilers
1
3.1. INTRODUCTION TO TOOLS
Lexical Analysis is the first phase of compiler also known as scanner. It converts the input
program into a sequence of Tokens. The function of the lexical analyzer is to read the source
program, one character at a time, and to translate it into a sequence of primitive units called
“tokens”. Lexical Analyzer reads the source program character by character to produce
tokens. Normally a lexical analyzer does not return a list of tokens at one shot; it returns a
token when the parser asks a token from it.
The function of the lexical analyzer is to read the source program, one character at a time,
and to translate it into a sequence of primitive units called “tokens”. Lexical Analyzer reads
the source program character by character to produce tokens. Normally a lexical analyzer
does not return a list of tokens at one shot; it returns a token when the parser asks a token
from it.
The scanner performs lexical analysis of a certain program (in our case, the Simple program).
It reads the source program as a sequence of characters and recognizes "larger" textual units
called tokens. For example, if the source programs contains the characters
In this chapter we are going to see some tools that will help us generate scanners and parsers.
FLEX (Fast Lexical analyzer generator) is a tool for generating scanners. Instead of writing
a scanner from scratch, you only need to identify the vocabulary of a certain language (e.g.
2
Simple), write a specification of patterns using regular expressions (e.g. DIGIT [0-9]), and
FLEX will construct a scanner for you. FLEX is generally used in the manner depicted here:
First, FLEX reads a specification of a scanner either from an input file *.lex, or from standard
input, and it generates as output a C source file lex.yy.c. Then, lex.yy.c is compiled and linked
with the "-lfl" library to produce an executable a.out. Finally, a.out analyzes its input stream
and transforms it into a sequence of tokens.
• *.lex is in the form of pairs of regular expressions and C code.
• lex.yy.c defines a routine yylex() that uses the specification to recognize tokens.
• a.out is actually the scanner
The flex input file consists of three sections, separated by a line with just `%%' in it:
Definitions
%%
Rules
%%
User code
The definitions section contains declarations of simple name definitions to simplify the
scanner specification, and declarations of start conditions, which are explained in a later
section. Name definitions have the form: Name definition, the "name" is a word beginning
with a letter or an underscore ('_') followed by zero or more letters, digits, '_', or '-' (dash).
The definition is taken to begin at the first non-white-space character following the name
and continuing to the end of the line. The definition can subsequently be referred to using
"{name}", which will expand to "(definition)". For example,
DIGIT [0-9]
ID [a-z][a-z0-9]*
3
Defines "DIGIT" to be a regular expression which matches a single digit, and "ID" to be a
regular expression which matches a letter followed by zero-or-more letters-or-digits. A
subsequent reference to {DIGIT} +"."{DIGIT}* is identical to ([0-9]) +"."([0-9])* and matches
one-or-more digits followed by a '.' followed by zero-or-more digits.
The rules section of the flex input contains a series of rules of the form: pattern action where
the pattern must be unindented and the action must begin on the same line.
Finally, the user code section is simply copied to `lex.yy.c' verbatim. It is used for companion
routines which call or are called by the scanner. The presence of this section is optional; if it
is missing, the second `%%' in the input file may be skipped, too.
In the definitions and rules sections, any indented text or text enclosed in `%{' and `%}' is
copied verbatim to the output (with the `%{}''s removed). The `% {}''s must appear
unindented on lines by themselves.
In the rules section, any indented or %{} text appearing before the first rule may be used to
declare variables which are local to the scanning routine and (after the declarations) code
which is to be executed whenever the scanning routine is entered. Other indented or %{}
text in the rule section is still copied to the output, but its meaning is not well-defined and it
may well cause compile-time errors.
When the generated scanner is run, it analyzes its input looking for strings which match any
of its patterns. If it finds more than one match, it takes the one matching the most text (for
trailing context rules, this includes the length of the trailing part, even though it will then be
returned to the input). If it finds two or more matches of the same length, the rule listed first
in the flex input file is chosen.
Once the match is determined, the text corresponding to the match (called the token) is made
available in the global character pointer yytext, and its length in the global integer yyleng.
The action corresponding to the matched pattern is then executed (a more detailed
description of actions follows), and then the remaining input is scanned for another match.
If no match is found, then the default rule is executed: the next character in the input is
considered matched and copied to the standard output. Thus, the simplest legal flex input is:
4
%%, which generates a scanner that simply copies its input (one character at a time) to its
output. Note that yytext can be defined in two different ways: either as a character pointer
or as a character array. You can control which definition flex uses by including one of the
special directives `%pointer' or `%array' in the first (definitions) section of your flex input.
The default is `%pointer', unless you use the `-l' lex compatibility option, in which case yytext
will be an array. The advantage of using `%pointer' is substantially faster scanning and no
buffer overflow when matching very large tokens (unless you run out of dynamic memory).
The disadvantage is that you are restricted in how your actions can modify yytext and calls
to the `unput()' function destroys the present contents of yytext, which can be a considerable
porting headache when moving between different lex versions.
5
Alex (a Lex-like tool) is a scanner generator which translates the lexical description of a
programming language into a scanner for that language. The scanner description language
is easy to use, as it is intentionally small and simple. Alex as well as the generated scanners
are written in Modula-2 and implemented on several microcomputers. The scanner
generator may be used in conjunction with a compiler-compiler.
Finally, we will explain another tool called ANTLR (ANother Tool for Language Recognition).
ANTLR is a parser generator, a tool that helps you to create parsers. A parser takes a piece
of text and transforms it in an organized structure, such as an Abstract Syntax Tree (AST).
You can think of the AST as a story describing the content of the code, or also as its logical
representation, created by putting together the various pieces.
1. If it thinks a lexeme constitutes an identifier, it stores that lexeme in that symbol table.
2. To get the type of identifier for a particular lexeme so that it can generate more
relevant token for the parser.
Lexical Analysers also have a role in removing whitespace (newline, blanks, and tabs),
comments etc. They also associate error messages with corresponding lines (based on the
newline characters or other delimiters) in source program.
Lexical Analyser can be thought of as a combination of:
6
1. Scanning - no tokenization, only scanning - removing comments etc.
2. Lexical Analysis - scanner produces sequence of tokens as output
Since the lexical analyzer is the part of the compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes. One such task is stripping out
comments and whitespace (blank, newline, tab, and perhaps other characters that are used
to separate tokens in the input). Another task is correlating error messages generated by the
compiler with the source program. For instance, the lexical analyzer may keep track of the
number of newline characters seen, so it can associate a line number with each error
message. In some compilers, the lexical analyzer makes a copy of the source program with
the error messages inserted at the appropriate positions. If the source program uses a
macro-preprocessor, the expansion of macros may also be performed by the lexical analyzer.
7
3.3. RELATIONSHIP OF SCANNERS AND COMPILERS
Lexical analyzer (or scanner) is a program to recognize tokens (also called symbols) from an
input source file (or source code). Each token is a meaningful character string, such as a
number, an operator, or an identifier.
The analysis of a source program during compilation is often complex. The construction of a
compiler can often be made easier if the analysis of the source program is separated into two
parts, with one part identifying the low-level language constructs (tokens) such as variable
names, keywords, labels, and operators, and the second part determining the syntactic
organization of the program.
Two aspects of scanners concern us. First we describe what the tokens of the language are.
The class of regular grammars, is one vehicle which can be used to describe tokens. Another
description approach which is briefly introduced involves the use of regular expressions.
Both description methods are equivalent in the sense that both describe the set of regular
languages. The second aspect of scanners deals with the recognition of tokens. Finite-state
acceptors are devices that are well suited to this recognition task primarily because they can
be specified pictorially by using transition diagrams.
The scanner represents an interface between the source program and the syntactic analyzer
or parser. The scanner, through a character-by-character examination of the input text,
separates the source program into pieces called tokens which represent the variable names,
operators, labels, and so on that comprise the source program.
Scanning and syntactic analysis can also have other advantages. Scanning characters is
typically slow in compilers, and by separating it from the parsing component of compilation,
particular emphasis can be given to making the process efficient. Furthermore, more
information can be made available to the parser when it is needed. For example, it is easier
to parse tokens such as keywords, identifiers, and operators, rather than tokens which are
the terminal character set (i.e., A, B, C, etc.). If the first token for a DO WHILE statement is DO
rather than just ‘D’, the compiler can determine that a repetition loop is being parsed rather
than other possibilities such as an assignment statement.
8
The scanner usually interacts with the parser in one of two ways. The scanner may process
the source program in a separate pass before parsing begins. Thus the tokens are stored in a
file or large table. The second way involves an interaction between the parser and the
scanner. The scanner is called by the parser whenever the next token in the source program
is required. The latter approach is the preferred method of operation, since an internal form
of the complete source program does not need to be constructed and stored in memory
before parsing can begin. Another advantage of this method is that multiple scanners can be
written for the same language. These scanners vary depending on the input interfaces used
in the language. Throughout this topic, the scanner will be assumed to be implemented in
this manner. In most cases, however, it makes little difference how the scanner is linked to
the parser.
3.4. SUMMARY
In this chapter, we have seen some of the tools used to generate scanners and parsers. The
scanner performs lexical analysis of a certain program (in our case, the Simple program). It
reads the source program as a sequence of characters and recognizes "larger" textual units
called tokens.
Lexical Analysers also have a role in removing whitespace (newline, blanks, and tabs),
comments etc. They also associate error messages with corresponding lines (based on the
newline characters or other delimiters) in source program.
Lexical Analyser can be thought of as a combination of:
1. Scanning - no tokenization, only scanning - removing comments etc.
2. Lexical Analysis - scanner produces sequence of tokens as output
9
3.5. MODEL EXAMINATION QUESTIONS.
I: True/False questions
1. The rules section of the flex input contains a series of rules of the form: pattern
action.
2. Alex (a Lex-like tool) is a scanner generator which translates the lexical
description of a programming language into a scanner for that language.
3. The scanner represents an interface between the source program and the
syntactic analyzer or parser.
10