Compiler Design Unit 1
Compiler Design Unit 1
UNIT – I
Introduction Language Processing, Structure of a compiler, the Evaluation of Programming language,
The Science of building a Compiler application of Compiler Technology. Programming Language
Basics.
Lexical Analysis-:The role of lexical analysis buffering, specification of tokens. Recognitions of tokens
the lexical analyzer generator lexical
UNIT -1
TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the
translator performs another very important role, the error-detection. Any violation of HLL
specification would be detected and reported to the programmers. Important role of
translator are:
1 Translating the HLL program input into an equivalent machine language program.
2 Providing diagnostic messages wherever the programmer violates specification of
the HLL.
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the
translator performs another very important role, the error-detection. Any violation of HLL
specification would be detected and reported to the programmers. Important role of
translator are:
TYPE OF TRANSLATORS:-
a. Compiler
b. Interpreter
c. Preprocessor
Compiler
Error msg
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also
uses interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
5. Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used
for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are
short hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the
language by certain amounts to build-in macro
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form
the loader could”relocate” directly behind the user‟s program. The task of adjusting
programs othey may be placed in arbitrary core locations is called relocation.
STRUCTURE OF A COMPILER
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of automic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic structure called
as expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On seeing the /,
the syntax analyzer should detect an error situation, because the presence of these two
adjacent binary operators violates the formulations rule of an expression.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code program that
does the some job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program
to make an improvement. For example,
If A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing
the speed of loops. A typical loop improvement is to move a
computation that produces the same result each time around the loop
to a point, in the program just before the loop is entered.
Code Generator :-
Code Generator produces the object code by deciding on the memory locations
for data, selecting code to access each datum and selecting the registers in which each
computation is to be done. Many computers have only a few high speed registers in which
computations can be performed quickly. A good code generator would attempt to utilize
registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear
in the source program. The information about data objects is collected by the early phases of
the compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and
reporting of errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-management
and error-Handling routines interact with all phases of the compiler.
Example:
Position:= initial + rate *60
Lexical Analyzer
Syntsx Analyzer
id1 +
id2 *
id3 id4
Semantic Analyzer
id1 +
id2 *
id3 60
int to real
Code Optimizer
Code Generator
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
The history of programming languages spans from documentation of early mechanical computers to
modern tools for software development. Early programming languages were highly specialized, relying
on mathematical notation
The first step towards more people friendly programming languages was the development of mnemonic
assembly languages in the early 1950’s.The instructions in assembly languages were just mnemonic
representations of machine instructions.
A major step towards higher level languages was made in the later half of the 1950’s with the
development of FORTRAN for scientific computation, Cobol for business data Processing and Lisp for
symbolic computation.
In the following decades many more languages were created with innovative features to help
make programming easier, more natural, and more robust.
Languages can also be classified in variety of ways.
Classification by Generation: Ist generation are the machine languages, 2nd generation are the
assembly languages, 3rd generation are the higher level languages like Fortran,cobol, Lisp,C
etc.4th generation are the languages designed for specific application like NOMAD,SQL,POST
The term fifth generation language has been applied to logic and the constraint based language
like prolog and OPS5.
Classification by the use: imperative languages in which your program specifies How
computation is to be done the declarative for languages in which your program specifies what
computation is to be done.
Examples:
Imperative languages: C,C++,C#,Java.
Declarative languages: ML, Haskell, Prolog
Scripting languages are interpreted languages with high level operators designed for
“gluing together” computations These computations originally called Scripts
A compiler must accept all source programs that conform to the specification of the language;
the set of source programs is infinite and any program can be very large, consisting of possibly millions
of lines of code. Any transformation performed by the compiler while translating a source program must
preserve the meaning of the program being compiled. Compiler writers thus have influence over not just
the compilers they create, but all the programs that their compilers compile. This leverage makes
writing compilers particularly rewarding; however, it also makes compiler development challenging.
Modelling in compiler design and implementation: The study of compilers is mainly a study of how
we design the right mathematical models and choose the right algorithms.Some of most fundamental
models are finite-state machines and regular expressions.These models are useful for de-scribing the
lexical units of programs (keywords, identifiers, and such) and for describing the algorithms used by the
compiler to recognize those units. Also among the most fundamental models are context-free grammars,
used to describe the syntactic structure of programming languages such as the nesting of parentheses or
control constructs. Similarly, trees are an important model for representing the structure of programs
and their translation into object code.
The science of code optimization: The term "optimization" in compiler design refers to the attempts
that a com-piler makes to produce code that is more efficient than the obvious code. In modern times,
the optimization of code that a compiler performs has become both more important and more complex.
It is more complex because processor architectures have become more complex, yielding more
opportunities to improve the way code executes. It is more important because massively par-allel
computers require substantial optimization, or their performance suffers by orders of magnitude.
1. The optimization must be correct, that is, preserve the meaning of the compiled program,
2. The optimization must improve the performance of many programs,
3. The compilation time must be kept reasonable, and
4. The engineering effort required must be manageable.
Thus, in studying compilers, we learn not only how to build a compiler, but also the general
methodology of solving complex and open-ended problems.
Specialized Architectures Over the last three decades, many architectural concepts have been
proposed. They include data flow machines, vector machines, VLIW (Very Long Instruction Word)
machines, SIMD (Single Instruction, Multiple Data) arrays of processors, systolic arrays,
multiprocessors with shared memory, and multiprocessors with distributed memory. The development
of each of these architectural concepts was accompanied by the research and development of
corresponding compiler technology.
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana ([email protected])
lOMoARcPSD|20951282
Program Translations: The following are some of the important applications of program-translation
techniques.
Binary Translation: Compiler technology can be used to translate the binary code for one machine to
that of another, allowing a machine to run programs originally compiled for another instruction set.
Binary translation technology has been used by various computer companies to increase the availability
of software for their machines.
Hardware Synthesis: Not only is most software written in high-level languages; even hardware de-
signs are mostly described in high-level hardware description languages like Verilog and VHDL.
Hardware designs are typically described at the register trans-fer level (RTL), where variables represent
registers and expressions represent combinational logic.
Database Query Interpreters: Besides specifying software and hardware, languages are useful in
many other applications. For example, query languages, especially SQL (Structured Query Language),
are used to search databases. Database queries consist of predicates containing relational and boolean
operators. They can be interpreted or com-piled into commands to search a database for records
satisfying that predicate.
Programming Language Basics:
Most languages, including C and its family, use static scope. we consider static-scope rules for a
language with blocks, where a block is a grouping of declarations and statements. C uses braces { and }
to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol.
A C program consists of a sequence of top-level declarations of variables and
functions.Functions may have variable declarations within them, where variables include local variables
and parameters. The scope of each such declaration is restricted to the function in which it appears.
The scope of a top-level declaration of a name x consists of the entire program that follows, with the
exception of those statements that lie within a function that also has a declaration of x.
A block is a sequence of declarations followed by a sequence of statements, all surrounded by
braces. a declaration D "belongs" to a block B if B is the most closely nested block containing D; that
is, D is located within B, but not within any block that is nested within B. The static-scope rule for
variable declarations in a block-structured lan-guages is as follows. If declaration D of name x belongs
to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within J5, in
which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x
belongs to B'.
An equivalent way to express this rule is to focus on a use of a name x. Let Bi, i?2, • • • , Bk be
all the blocks that surround this use of x, with Bk the smallest, nested within Bk-i, which is nested
within Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to B^.
This use of x refers to the declaration in B{. Alternatively, this use of x is within the scope of the
declaration in Bi.
Explicit Access Control
Through the use of keywords like public, private, and protected, object-oriented languages such as C
+ + or Java provide explicit control over access to member names in a superclass. These keywords
support encapsulation by restricting access. Thus, private names are purposely given a scope that
includes only the method declarations and definitions associated with that class and any "friend" classes
(the C + + term). Protected names are accessible to subclasses. Public names are accessible from outside
the class.
Dynamic Scope
Any scoping policy is dynamic if it is based on factor(s) that can be known only when the program
executes. The term dynamic scope, however, usually refers to the following policy: a use of a
name x refers to the declaration of x in the most recently called procedure with such a declaration.
Dynamic scoping of this type appears only in special situations. We shall consider two ex-amples of
Declarations tell us about the types of things, while definitions tell us about their values. Thus, i n t i is a
declaration of i, while i = 1 is a definition of i.
The difference is more significant when we deal with methods or other procedures. In C + + , a method
is declared in a class definition, by giving the types of the arguments and result of the method (often
called the signature for the method. The method is then defined, i.e., the code for executing the method
is given, in another place. Similarly, it is common to define a C function in one file and declare it in
other files where the function is used.
Parameter Passing Mechanisms
In this section, we shall consider how the actual parameters (the parameters used in the call of a
procedure) are associated with the formal parameters (those used in the procedure definition). Which
mechanism is used determines how the calling-sequence code treats parameters. The great majority of
languages use either "call-by-value," or "call-by-reference," or both.
Call - by - Value
In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable).
The value is placed in the location belonging to the corresponding formal parameter of the called
procedure. This method is used in C and Java, and is a common option in C + + , as well as in most
other languages. Call-by-value has the effect that all computation involving the formal parameters done
by the called procedure is local to that procedure, and the actual parameters themselves cannot be
changed.
Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the
callee. Likewise, array names passed as param eters in C, C + + , or Java give the called procedure what
is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling
procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as
x [ i ] = 2 really changes the array element a[2]. The reason is that, although x gets a copy of the value
of a, that value is really a pointer to the beginning of the area of the store where the array named a is
located.
Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This
observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value
exclusively, whenever we pass the name of an object to a called procedure, the value received by that
procedure is in effect a pointer to the object. Thus, the called procedure is able to affect the value of the
object itself.
Call - by - Reference
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented
by following this pointer to the location indicated by the caller. Changes to the formal parameter thus
appear as changes to the actual parameter.
LEXICAL ANALYSIS
Upon receiving a get next token command form the parser, the lexical analyzer
reads the input character until it can identify the next token. The LA return to the parser
representation for the token it has found. The representation will be an integer code, if the
token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank,
tab and new line characters. Another is correlating error message from the compiler with the
source program.
- 11 -
The lexical analyzer (the "lexer") parses A parser does not give the nodes any
individual symbols from the source code file meaning beyond structural cohesion. The
into tokens. From there, the "parser" proper next thing to do is extract meaning from this
turns those whole tokens into sentences of structure (sometimes called contextual
your grammar analysis).
The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for
thelexical analyzer to read its input from an input buffer. Figure shows a buffer divided into
two haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and
thecharacter next to be read. In practice each buffering scheme adopts one convention either
apointer is at the symbol last read or the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may
have to travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana ([email protected])
lOMoARcPSD|20951282
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Example:
Description of token
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
- 12 -
A pattern is a rule describing the set of lexemes that can represent a particular token
in source program.
Lexical errors are the errors thrown by your lexer when unable to continue. Which
means that there's no way to recognise a lexeme as a valid token for you lexer. Syntax
errors, on the other side, will be thrown by your scanner when a given set of already
recognised valid tokens don't match any of the right sides of your grammar rules.
simple panic-mode error handling system requires that we return to a high-level
parsing function when a parsing or lexical error is detected.
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each „a‟ in ∑, is a regular expression denoting { a }, the language with only
one string consisting of the single symbol „a‟ .
o If R and S are regular expressions, then
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take
the patterns for all the needed tokens and build a piece of code that examins the input string
and finds a prefix that is a lexeme matching one of the patterns.
Stmt -> if expr then stmt
| If expr then else stmt
|є
Expr --> term relop term
|term
Term -->id
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes. The
terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
+
ws --> (blank/tab/newline)
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when we
recognize it, we do not return it to parser ,but rather restart the lexical analysis from the
character that follows the white space . It is the following token that gets returned to the
parser.
Lexeme Token Name Attribute Value
Any ws _ _
if if _
then then _
else else _
Any Id id pointer to table entry
Any number number
pointer to table entry
< relop LT
- 15 -
<= relop LE
= relop ET
<> relop NE
- 16 -
If = if
Then = then
Else = else
Relop = < | <= | = | > | >=
Id = letter (letter | digit) *|
Num = digit |
2.10 AUTOMATA
Deterministic Automata
Non-Deterministic Automata.
A deterministic finite automata has at most one transition from each state on any
input. A DFA is a special case of a NFA in which:-
- 17 -
2, each input symbol has at most one transition from any state.
The regular expression is converted into minimized DFA by the following procedure:
The Finite Automata is called DFA if there is only one path for a specific input from
current state to next state.
a
a
So S2
S1
From state S0 for input „a‟ there is only one path going to S2. similarly from S0
there is only one path for input going to S1.
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial) state.
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
- 18 -
This graph looks like a transition diagram, but the same character can label two or
more transitions out of one state and edges can be labeled by the special symbol €
as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is
shown
- 19 -
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana ([email protected])
lOMoARcPSD|20951282
declarations
%%
translation rules
%%
auxiliary procedures
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.
- 20-
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana ([email protected])