CS8602 Compiler Design Notes
CS8602 Compiler Design Notes
ENGINEERING COLLEGE
Poonamallee – Tiruvallur Road, Chennai – 602025.
CS8602
Compiler Design
(Anna University - Regulation)
Ms.K.P.Revathi
Topics to be Covered
1.1 Translators:
The widely used translators that translate the code of a computer program into a machine code
are:
1. Assemblers
2. Interpreters
3. Compilers
Assembler:
An Assembler converts an assembly program into machine code.
A compiler is a program that reads a program written in one language – the source language –
and translates it into an equivalent program in another language – the target language.
error messages
As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs.
Program
Advantages of Compiler:
1. Fast in execution
2. The object/executable code produced by a compiler can be distributed or executed without
having to have the compiler present.
3. The object program can be used whenever required without the need to of recompilation.
Disadvantages of Compiler:
1. Debugging a program is much harder. Therefore not so good at finding errors.
2. When an error is found, the whole program has to be re-compiled.
1.2.2 Interpretation:
Interpretation is the conceptual process of translating a high level source code into executable
code.
Interpreter:
An Interpreter is also a program that translates high-level source code into executable code.
However the difference between a compiler and an interpreter is that an interpreter translates
one line at a time and then executes it: no object code is produced, and so the program has to
be interpreted each time it is to be run. If the program performs a section code 1000 times, then
the section is translated into machine code 1000 times since each line is interpreted and then
executed.
Disadvantages of an Interpreter:
1. Rather slow
2. No object code is produced, so a translation has to be done every time the program is running.
3. For the program to run, the Interpreter must be present
Hybrid Compiler:
Hybrid compiler is a compiler which translates a human readable source code to an intermediate
byte code for later interpretation. So these languages do have both features of a compiler and an
interpreter. These types of compilers are commonly known as Just In-time Compilers (JIT).
Java is one good example for these types of compilers. Java language processors combine
compilation and interpretation. A Java Source program may be first compiled into an
intermediate form called byte codes. The byte codes are then interpreted by a virtual machine.
Source program
Translator
Input Machine
Compilers are not only used to translate a source language into the assembly or machine
language but also used in other places.
Example:
A language processor is a program that processes the programs written in programming language
(source language). A part of a language processor is a language translator, which translates the
program from the source language into machine code, assembly language or other language.
1. Pre Processor
The Pre Processor is the system software which is used to process the source program before fed
into the compiler. They may perform the following functions:
2. Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
3. Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
4. Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
5. Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program instructions and data and creates memory
space for it. It initializes various registers to initiate execution.
Analysis and
Synthesis
1. Analysis:
The first three phases forms the bulk of the analysis portion of a compiler. The analysis part
breaks up the source program into constituent pieces and creates an intermediate representation
of the source program. During analysis, the operations implied by the source program are
determined and recorded in a hierarchical structure called a syntax tree, in which each node
represents an operation and the children of a node represent the arguments of the operation.
:=
position +
initial *
rate 60
2. Synthesis Part:
The synthesis part constructs the desired target program from the intermediate representation.
This part requires most specialized techniques.
Lexical Analysis: The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically sequence of
characters, such as identifier, a keyword (if, while, etc), a punctuation character, or a multi-
character operator work like :=. The character sequence forming a token is called the lexeme for
the token.
Certain tokens will be augmented by a “lexical value”. Ex. When an identifier rate is found, the
lexical analyzer generates the token id and also enters rate into the symbol table, if it is not
already exist. The lexical value associated with this id then points to the symbol-table entry for
rate.
Tokens:
Syntax Analysis:
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree or syntax tree. In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
It imposes a hierarchical structure of the token stream in the form of parse tree or syntax tree.
The syntax tree can be represented by using suitable data structure.
:=
position +
initial *
rate 60
:=
id 1 +
id 2 *
id 3
id 4
Semantic Analysis:
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
This analysis inserts a conversion from integer to real in the above syntax tree.
:=
position +
initial *
rate inttoreal
60
After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.
Intermediate code have two properties: easy to produce and easy to translate into the target
program. An intermediate code representation can have many forms. One of the form is three-
address code, which is like the assembly language for a machine in which every memory
location can act like a register and three-address code have at most three operands.
Example: The output of the semantic analysis can be represented in the following intermediate
form:
temp1 := inttoreal ( 60 )
id1 := temp3
Code Optimization:
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources CPU, memory. In the following
example the natural algorithm is used for optimizing the code.
Example:
Code Generation:
This is the final phase of the compiler which generates the target code, consisting normally of
relocatable machine code or assembly code. Variables are assigned to the registers.
Example:
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
The first and the second operands of each instruction specify a source and destination
respectively. The F in each instruction denotes the floating point numbers. The # signifies that
60.0 is to be treated as constant.
Activities of Compiler:
Symbol table manager and error handler are the other two activities in the compiler which is also
referred as phases. These two activities interact with all the six phases of a compiler.
The symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
The attributes of the identifiers may provide the information about the storage allocated for an
identifier, its type, its scope (where in the program it is valid), and in the case of procedure
The symbol table allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. Attributes of the identifiers cannot be determined during lexical
analysis phase. But it can be determined during the syntax and semantic analysis phases. The
other phase like code generators uses the symbol table to retrieve the details about the identifiers.
Each phase can encounter errors. After the deduction of an error, a phase must somehow deal
with that error, so that the compilation can proceed, allowing further errors in the source program
to be detected.
Lexical Analysis Phase: If the characters remaining in the input do not form any token of the
language, then the lexical analysis phase detect the error.
Syntax Analysis Phase: The large fraction of errors is handled by syntax and semantic analysis
phases. If the token stream violates the structure rules (syntax) of the language, then this phase
detects the error.
Semantic Analysis Phase: If the constructs have right syntactic structure but no meaning to the
operation involved, then this phase detects the error. Ex. Adding two identifiers, one of which is
the name of the array, and the other the name of a procedure.
If the characters remaining in the input do not form any token of the language, then the lexical
analysis phase detect the error.
There are relatively few errors which can be detected during lexical analysis.
i. Strange characters
Some programming languages do not use all possible characters, so any strange ones
which appear can be reported. However almost any character is allowed within a quoted
string.
Many programming languages do not allow quoted strings to extend over more than one
line; in such cases a missing quote can be detected.
If quoted strings can extend over multiple lines then a missing quote can cause quite a lot
of text to be 'swallowed up' before an error is detected.
For example:
fi ( a == 1) ....
Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.
During syntax analysis, the compiler is usually trying to decide what to do next on the basis of
expecting one of a small number of tokens. Hence in most cases it is possible to automatically
generate a useful error message just by listing the tokens which would be acceptable at that
point.
Source: A + * B
Error: | Found '*', expect one of: Identifier, Constant, '('
More specific hand-tailored error messages may be needed in cases of bracket mismatch.
A parser should be able to detect and report any error in the program. It is expected that when an
error is encountered, the parser should be able to handle it and carry on parsing the rest of the
input. Mostly it is expected from the parser to check for errors but errors may be encountered at
various stages of the compilation process. A program may have the following kinds of errors at
various stages:
Lexical : name of some identifier typed incorrectly
Syntactical : missing semicolon or unbalanced parenthesis
Semantical : incompatible value assignment
Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement
by not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest
way of error-recovery and also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc.. Parser designers have to be careful here because one wrong
correction may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the code. In
addition, the designers can create augmented grammar to be used, as productions that generate
erroneous constructs when these errors are encountered.
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way).
If possible, the compiler should make the appropriate correction(s). Semantic errors are much
harder and sometimes impossible for a computer to detect.
Front End:
Lexical Analysis
Syntactic Analysis
Creation of the symbol table
Semantic Analysis
Generation of the intermediate code
A part of code optimization
Error Handling that goes along with the above said phases
Back End:
The back end includes the phases of the compiler that depend on the target machine, and these
phases do not depend on the source language, but depend on the intermediate language. The
phases of back end are:
Code Optimization
Code Generation
Necessary Symbol table and error handling operations
Based on the grouping of phases there are two types of compiler design is possible:
In order to atomize the development of compilers some general tools have been created. These
tools use specialized languages for specifying and implementing the component. The most
successful tool should hide the details of the generation algorithm and produce components
which can be easily integrated into the remainder of the compiler. These tools are often referred
as compiler – compilers, compiler – generators, or translator-writing systems.
Syntax-directed translation engines: Produce collections of routines for walking a parse tree
and generating intermediate code.
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration. . A language uses static scope or lexical scope if it is possible to determine the scope
of a declaration by looking only at the program and can be determined by compiler. Otherwise,
the language uses dynamic scope.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The static-scope policy is as follows:
1. A C program consists of a sequence of top-level declarations of variables and functions.
2. Functions may have variable declarations within them, where variables include local
variables and parameters. The scope of each such declaration is restricted to the function
in which it appears.
3. The scope of a top-level declaration of a name x consists of the entire program that
follows, with the exception of those statements that lie within a function that also has a
declaration of x.
Block Structures:
Languages that allow blocks to be nested are said to have block structure. A name a: in a nested
block B is in the scope of a declaration D of x in an enclosing block if there is no other
declaration of x in an intervening block.
5. Dynamic Scope:
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration.
A language uses static scope or lexical scope if it is possible to determine the scope of a
declaration by looking only at the program and can be determined by compiler.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The language uses dynamic scope if it is not possible to determine the scope of a
declaration during compile time.
Example in Java:
public int x;
With dynamic scope, as the program runs, the same use of x could refer to any of several
different declarations of x.
6. Parameter Passing Mechanism: Parameters are passed from a calling procedure to the callee
either by value (call by value) or by reference (call by reference). Depending on the procedure
call, the actual parameters associated with formal parameters will differ.
Call-By-Reference:
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are
implemented by following this pointer to the location indicated by the caller. Changes to the
formal parameter thus appear as changes to the actual parameter.
Call-By-Name:
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the
actual parameter (with renaming of local names in the called procedure, to keep them distinct).
When large objects are passed by value, the values passed are really references to the objects
themselves, resulting in an effective call-by-reference.
7. Aliasing: When parameters are (effectively) passed by reference, two formal parameters can
refer to the same object, called aliasing. This possibility allows a change in one variable to
change another.
Topics to be Covered
Lexical Analysis
Its main task is to read the input characters and produce as output a sequence of tokens that
the parser uses for syntax analysis.
Symbol
Table
The above diagram illustrates that the lexical analyzer is a subroutine or a co routine of the
parser. Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
Since Lexical analyzer is the part of the compiler that reads the source text, it may also
perform certain secondary tasks at the user interface.
Scanning – the scanner is responsible for doing simple tasks (Example – Fortran
compiler use a scanner to eliminate blanks from the input)
Lexical analysis – the lexical analyzer does the more complex operations.
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
1. To make the design simpler. The separation of lexical analysis from syntax analysis
allows the other phases to be simpler. For example, parsing a document with
comments and white spaces is more complex than it is removed in the previous phase
itself.
2. To improve the efficiency of the compiler. A separate lexical analyzer allows to
construct an efficient processor. A large amount of time is spent in reading the source
program and partitioning it into tokens. Specialized buffering techniques speed up the
performance.
3. To enhance the compiler portability. Input alphabets and device specific anomalies
can be restricted to the lexical analyzer.
Token: A token is an atomic unit represents a logically cohesive sequence of characters such
as an identifier, a keyword, an operator, constants, literal strings, punctuation symbols such as
parentheses, commas and semicolons.
+, - - operator
if - keyword
Pattern: A pattern is a rule used to describe lexeme. It is a set of strings in the input for
which the same token is produced as output.
If If if
Relation <, <=, =, < >, >, >= < or <= or = or < > or > or >=
When more than one pattern matches a lexeme, the lexical analyzer must provide additional
information about the particular lexeme that matched to the subsequent phases of the
compiler.
For example, the pattern relation matches the operators like <, <=, >, >=, =, < >. It is
necessary to identify operator which is matched with the pattern.
The lexical analyzer collects other information about tokens as its attributes. A
token has only a single attribute, a pointer to the symbol -table entry in which the
information about the token is ke pt.
For example: The tokens and associated attribute-values for the Fortran statement
X = Y * Z ** 4
<assign_op,>
<mult_op,>
<exp_op,>
Eg. <assign_op,>
For others, the compiler stores the character string that forms a value in a symbol table.
Lexical Errors:
For example:
fi ( a == 1) ....
Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.
INPUT BUFFERING:
Input buffering is a method used to read the source program and to identify the tokens
efficiently. There are three general approaches to the implementation of a lexical analyzer.
Since the lexical analyzer is the only phase of the compiler that reads the source program
character-by-character, it is possible to spend a considerably amount of time in the lexical
analysis phase. Thus the speed of lexical analysis is a concern in compiler design.
Buffer Pairs:
The lexical analyzer needs to look-ahead many characters beyond the lexeme for finding the
pattern. The lexical analyzer uses a function ungetc( ) to push the look-ahead characters back
into the input stream. In order to reduce the amount of overhead required to process an input
character, specialized buffering techniques have been developed.
A buffer is divided into N-character halves where N is the number of characters on one disk
block. Example 1024 or 4096
: : : X : : = : : M : * : : C: * : * : 4 : eof : : : : : :
forward
lexeme_beginning
1. Read N input character into each half of the buffer using one system read command
instead of reading each input character
2. If fewer than N characters remain in the input, then eof marker is read into the buffer
after the input characters.
3. Two pointers to the input buffer are maintained. Initially both pointers point to the
first character of the next lexeme to be found.
a. Begin pointer points the s tart of the lexeme
b. The forward pointer is set to the character at its right end
4. Once the lexeme is identified, both pointers are set to the character immediately past
the lexeme.
If the forward pointer is reaching the halfway mark, the right half is filled with N new input
characters. If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the beginning of
the buffer. The number of tests to be required is very large.
Sentinels:
In the previous scheme mentioned a check should be made each time when the forward
pointer is moved that we have not moved off one half of the buffer. i.e. only one eof marker
at the end.
A sentinel is a special character which is not a part of the source program used to represent
the end of file. (eof)
Instead of testing the forward pointer each time by two tests, extend each buffer half to hold a
sentinel character at the end and reduce the number of tests to one.
forward
lexeme_beginning
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions will serve as names for set of strings.
Alphabet: An alphabet or character class denotes any finite set of symbols. For example,
Letters, Characters, ASCII characters, EBCDIC characters
String: A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
For example, 1 0 1 0 1 1 is a string over {0, 1}* , is a empty string over {0, 1}*
Length of the String : The length of the string 1 0 1 is denoted as | 1 0 1 | = 3 i.e. the number
of occurrences of symbol is S.
Language: A language denotes any set of strings over some fixed alphabet .
TERM DEFINITION
Operations on Languages:
There are several important operations that ca be applied to languages. For lexical analysis
the following operations are applied:
OPERATION DEFINITION
union of L and M
L U M = { s | s is in L or s is in M }
written L U M
concatenation of L and M
LM = { st | s is in L and t is in M }
written LM
∞
Kleene closure of L L* = U Li
i=0
written L*
L* denotes “zero or more concatenations of” L
positive closure of L written ∞
D = {0, 1, . . . , 9}
By applying operators defined above on these languages L and D we get the following new
languages:
Regular Expressions:
A regular expression is built out of simple regular expressions using a set of defining rules.
Each regular expression r denotes a language L(r).
Basis:
Induction:
iii) Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,
a. ( r ) | ( s ) is a regular expression denoting L ( r ) U L ( s ).
b. ( r ) ( s ) is a regular expression denoting L ( r ) L ( s ).
1. the unary operator * has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.
Unnecessary parentheses can be avoided in the regular expression if the above precedence is
adopted. For example the regular expression: (a) | ((b)* (c)) is equivalent to a | b*c.
Example:
Let = {a,b}
If two regular expressions r and s denote the same language, then we say r and s are
equivalent and write r = s. For example, ( a | b ) = (b | a ).
There are number of algebraic laws obeyed by regular expressions and these laws can be used
to manipulate regular expressions into equivalent forms.
Let r, s and t be the regular expression. The following are the algebraic laws for these regular
expressions:
10
Regular Definitions:
The regular expressions can be given names and defining regular expressions using these
names is called regular definition. If is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
d1 -> r1
d2 -> r2
.......
d n -> rn
where each di is a distinct name, and each ri is a regular expression over the symbols in
U { d1, d2, . . . . , di-1 }, i.e., the basic symbols and the previously defined names.
Example:
11
optional_fraction . digits |
optional_exponent ( E ( + | - | ) digits ) |
Notational Shorthands:
1. One or more instances( + ): The unary postfix operator + means “one or more
instances of”. Example – (r)+ - Set of all strings of one or more occurrences of r.
2. Zero or One Instance (?): The unary postfix operator ? means “ zero or one instance
of”. Example – (r)? – One or zero occurrence of r.
The regular definition for num can be written by using unary + and unary ? operator
as follows:
digit 0 | 1 | . . . | 9
digits digit+
optional_fraction ( . digits) ?
optional_exponent ( E ( + | - )? digits ) ?
3. Character Classes: The notation where a, b and c are alphabet symbols denotes the
regular expression a | b | c. An abbreviated character class such as [ a – z ] denotes
the regular expression a | b | . . . | z.
Using character classes the identifiers can be described as strings generated by regular
expression: [A – Za – z] [A – Z a – z 0 – 9]*
12
Example:
| term
term id
| num
where the terminals if, then, else, relop, id and num generate sets of strings given by the
following regular definitions:
if if
then then
else else
letter A | B | . . . | Z | a | b | . . . | z
digit 0 | 1 | . . . | 9
13
ws delim+
The goal of the lexical analyzer is to isolate the lexeme for the next token in the input buffer
and produce as output a pair consisting of the appropriate token and attribute value using the
table given below:
Regular
Token Attribute - Value
Expression
ws - -
if if -
then then -
else else -
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
14
Positions in a transition diagram are drawn as circles and are called states. The states are
connected by arrows, called edges. Edges leaving state s have labels indicating the input
characters that can next appear after the transition diagram has reached state s. The label
other refers to any character that is not indicated by any of the other edges leaving s.
One state is labeled as start state; it is the initial state of the transition diagram where control
resides when we begin to recognize token. Certain states may have actions that are executed
when the flow of control reaches that state. On entering a state we read the next input
character. If there is an edge from the current state whose label matches this input character,
then we go to the state pointed by the edge. Otherwise, we indicate failure.
The symbol * is used to indicate states on which the input retraction must take place.
There may be several transition diagram, each specifying a group of tokens. If failure occurs
in one transition diagram, then the forward pointer is retracted to where it was in the start
state of this diagram, and activate the next transition diagram. Since the lexeme beginning
and forward pointers marked the same position in the start state of the diagram, the forward
pointer is retracted to the position marked by the lexeme_begining pointer. If failure occurs
in all transition diagrams, then a lexical error has been detected and an error-recovery routine
is invoked.
start > =
0 6 7
other
*
8
15
other
letter or digit
return(gettoken(),install_id())
digit digit
E digit digit
+ or - digit other *
17 18 19
return(gettoken(),install_num())
16
return(gettoken(),install_num())
digit
delim
Finite automata
A recognizer for a language is a program that takes as input a string x and answers yes if x is
a sentence of the language and no otherwise.
qo - Start state
17
δ :Q x Σ → Q
o More than one transition occurs for any input symbol from a state.
o For each state and for each input symbol, exactly one transition occurs from that state.
• Union
r = r1 + r2
Concatenation
r = r1 r2
18
r = r1*
Ɛ –closure
Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states of
NFA.
Example 1
Ɛ -closure(q2) = { q0}
19
Sub-set Construction
Steps
l. Convert into NFA using above rules for operators (union, concatenation and closure) and
precedence.
20
6. If new state is found, repeat step 4 and step 5 until no more new states are found.
8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and final
state is the state that contains final state of NFA drawn.
Direct Method
Direct method is used to convert given regular expression directly into DFA.
21
else else
Computation of followpos
The position of regular expression can follow another in the following ways:
If n is a cat node with left child c1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
For cat node, for each position i in lastpos of its left child, the firstpos of its
right child will be in followpos(i).
If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are
in followpos(i).
For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of
that node.
22
(a+b)*abb
23
A=firstpos(n0)={1,2,3}
Dtran[A,a]=
followpos(1) U followpos(3)= {1,2,3,4}=B
Dtran[A,b]=
followpos(2)={1,2,3}=A
Dtran[B,a]=
followpos(1) U followpos(3)=B
Dtran[B,b]=
followpos(2) U followpos(4)={1,2,3,5}=C
….
24
Equivalent automata
{A, C}=123
{B}=1234
{D}=1235
{E}=1236
Exists a minimum state DFA
Lex is a computer program that generates lexical analyzers. Lex is commonly used with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by creating a program lex.l in
the Lex language. Then, lex.l is run through the Lex compiler to produce a C program
lex.yy.c.
Finally, lex.yy.c is run through the C compiler to produce an object progra m a.out, which
is the lexical analyzer that transforms an input stream into a sequence of tokens.
25
%{ main()
int v=0,c=0; {
%% yylex();
26
SYNTAX ANALYSIS
Need and Role of the Parser-Context Free Grammars -Top Down Parsing -General Strategies-
Recursive Descent Parser Predictive Parser-LL(1) Parser-Shift Reduce Parser-LR Parser-LR
(0)Item-Construction of SLR Parsing Table -Introduction to LALR Parser - Error Handling and
Recovery in Syntax Analyzer-YACC-Design of a syntax Analyzer for a Sample Language .
SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and
generates a syntax tree or parse tree.
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language. It reports any
syntax errors in the program. It also recovers from commonly occurring errors so that it can
continue processing its input.
symbol
table
CONTEXT-FREE GRAMMARS
Terminals : These are the basic symbols from which strings are formed.
Non-Terminals : These are the syntactic variables that denote a set of strings. These help to
define the language generated by the grammar.
Start Symbol : One non-terminal in the grammar is denoted as the “Start-symbol” and the set of
strings it denotes is the language defined by the grammar.
Productions : It specifies the manner in which terminals and non-terminals can be combined to
form strings. Each production consists of a non-terminal, followed by an arrow, followed by a
string of non-terminals and terminals.
In this grammar,
id + - * / ↑ ( ) are terminals.
expr , op are non-terminals.
expr is the start symbol.
Each line is a production.
Derivation is a process that generates a valid string with the help of grammar by replacing the
non-terminals on the left with the string on the right side of the production.
Types of derivations:
In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first
for replacement.
→-E E→-E
E→-(E) E→-(E)
E → - ( E+E ) E → - (E+E ) E
→ - ( id+E ) E → - ( E+id ) E
→ - ( id+id ) E → - ( id+id )
String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The sentinel form
in the parse tree is called yield or frontier of the tree.
Ambiguity:
A grammar that produces more than one parse for some sentence is said to be ambiguous
grammar.
The sentence id+id*id has the following two distinct leftmost derivations:
E → E+ E E → E* E
E → id + E E→E+E*E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id
E E
E + E E * E
id E * E E + E id
id id id id
WRITING A GRAMMAR
Each parsing method can handle grammars only of a certain form hence, the initial grammar may
have to be rewritten to make it parsable.
The transition diagram has set of states and The context-free grammar has set of
edges. productions.
It is useful for describing the structure of lexical It is useful in describing nested structures
constructs such as identifiers, constants, such as balanced parentheses, matching
keywords, and so forth. begin-end’s and so on.
The lexical rules of a language are simple and RE is used to describe them.
Regular expressions provide a more concise and easier to understand notation for tokens
than grammars.
Separating the syntactic structure of a language into lexical and nonlexical parts provides
a convenient way of modularizing the front end into two manageable-sized components.
Eliminating ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or ri ghtmost
derivation can be eliminated by re-writing the grammar.
Consider this example, G: stmt → if expr then stmt | if expr then stmt else stmt | other
This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the following
two parse trees for leftmost derivation :
E1
E2 S1 S2
2. stmt
E1
S2
if expr then
stmt
E2
S1
A → βA’
A’ → αA’ | ε
→ E+T | T
T → T*F | F F
→ (E) | id
E → TE’
E’ → +TE’ | ε
T → FT’ T’→
*FT’ | ε
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
Left factoring:
A → αA’
A’ → β 1 | β2
S → iEtSS’ | a
S’ → eS | ε
E→b
PARSING
Parse tree:
Types of parsing:
Top–down parsing : A parser can start with the start symbol and try to transform it to the
input string.
Example : LL Parsers.
Bottom–up parsing : A parser can start with input and attempt to rewrite it into the start
symbol.
Example : LR Parsers.
TOP-DOWN PARSING
Recursive descent parsing is one of the top-down parsing techniques that uses a set of
recursive procedures to scan its input.
This parsing method may involve backtracking, that is, making repeated scans of the
input.
The parse tree can be constructed using the following top-down approach :
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first symbol
of w. Expand the tree with the production of S.
c A d
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
c A d
a b
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer
to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the input
symbol d.
Step4:
c A d
A left-recursive grammar can cause a recursive-descent parser to go into an infinite loop. Hence,
elimination of left-recursion must be done before parsing.
E → E+T | T
T → T*F | F
F → (E) | id
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
Recursive procedure:
Procedure E()
begin
T( );
EPRIME( );
end
Procedure T( )
begin
F( );
TPRIME( );
end
Procedure TPRIME( )
begin
If input_symbol=’*’ then
ADVANCE( );
F( );
TPRIME( );
end
Procedure F( )
begin
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
end
else ERROR( );
Stack implementation:
E( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
EPRIME( ) id+id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
2. PREDICTIVE PARSING
The key problem of predictive parsing is to determine the production to be applied for a
non-terminal in case of alternatives.
INPUT a + b $
STACK
X Predictive parsing program
OUTPUT
Y
$
Parsing Table M
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$ in
the input buffer. The program that utilizes the predictive parsing table M to produce a parse for
the input is as follows:
The construction of a predictive parser is aided by two functions associated with a grammar G :
1. FIRST
2. FOLLOW
Input : Grammar G
Method :
E → E+T | T
T → T*F | F
F → (E) | id
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
NON- id + * ( ) $
TERMINAL
E E → TE’ E → TE’
E’ E’ → +TE’ E’ → ε E’→ ε
T T → FT’ T → FT’
T’ T’→ ε T’→ *FT’ T’ → ε T’ → ε
F F → id F → (E)
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry. This
type of grammar is called LL(1) grammar.
S → iEtS | iEtSeS | a
E→b
S → iEtSS’ | a
S’→ eS | ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
$ ,e }
FOLLOW(E) =
{t}
Parsing table:
NON- a b e i t $
TERMINAL
S S→a S → iEtSS’
S’ S’ → eS S’ → ε
S’ → ε
E E→b
Since there are more than one production, the grammar is not LL(1) grammar.
1. Shift
2. Reduce
3. Accept
4. Error
Constructing a parse tree for an input string beginning at the leaves and going towards the
root is called bottom-up parsing.
SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a
parse tree for an input string beginning at the leaves (the bottom) and working up
towards the root (the top).
Example:
Consider the
grammar: S →
aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde.
abbcde (A → b) S → aABe
aAbcde (A → Abc) → aAde
aAde (B → d) → aAbcde
aABe (S → aABe) → abbcde
S
The reductions trace out the right-most derivation in reverse.
Handles:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
Example:
E → E+E
E → E*E
E → (E)
E → id
E → E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→ id1+id2*id3
Handle pruning:
$E +id2*id3 $ shift
$ E+ id2*id3 $ shift
$ E+E*E $ reduce by E→ E *E
$E $ accept
2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make.
1. Shift-reduce conflict:
Example:
M → R+R | R+c |
RR→c
and input c+c
$c +c $ Reduce by $c +c $ Reduce by
R→c R→c
$R +c $ Shift $R +c $ Shift
$ R+ c$ Shift $ R+ c$ Shift
Viable prefixes:
α is a viable prefix of the grammar if there is w such that αw is a right sentinel form.
The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce parser
are called viable prefixes.
The set of viable prefixes is a regular language.
Operator precedence parser can be constructed from a grammar called Operator-grammar. These
grammars have the property that no production on right side is ɛ or has two adjacent non -
terminals.
Example:
E → EAE | (E) | -E | id
A→+|-|*|/|↑
Since the right side EAE has three consecutive non-terminals, the grammar can be written as
follows:
Example:
E → E+E | E-E | E*E | E/E | E↑E | (E) | -E | id is given in the following table assuming
Example:
Consider the grammar E → E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string is id+id*id .The
implementation is as follows:
Advantages of LR parsing:
It recognizes virtually all programming language constructs for which CFG can be
written.
It is an efficient non-backtracking shift-reduce parsing method.
A grammar that can be parsed using LR method is a proper superset of a grammar that
can be parsed with predictive parser.
It detects a syntactic error as soon as possible.
Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.
INPUT a1 ai an $
… …
STACK
The parsing program reads characters from an input buffer one at a time.
The program uses a stack to store a string of the form s 0X1s1X2s2…Xmsm, where sm is on
top. Each Xi is a grammar symbol and each s i is a state.
The parsing table consists of two parts : action and goto functions.
Action : The parsing program determines s m, the state currently on top of stack, and ai, the
current input symbol. It then consults action[sm,ai] in the action table which can have one of four
values :
Goto : The function goto takes a state and grammar symbol as arguments and produces a state.
LR Parsing algorithm:
Input: An input string w and an LR parsing table with functions action and goto for grammar G.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input
buffer. The parser then executes the following program :
LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of the
right side. For example, production A → XYZ yields the four items :
A → . XYZ
A → X . YZ
A → XY . Z
A → XYZ .
Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I
by the two rules:
Goto operation:
Goto(I, X) is defined to be the closure of the set of all items [A→ αX . β] such that
[A→ α . Xβ] is in I.
If any conflicting actions are generated by the above rules, we say grammar is not SLR(1).
I0 : E’ → . E
E →.E+T
E →.T
T →.T*F
T →.F
F → . (E)
F → . id
GOTO ( I0 , E) GOTO ( I4 , id )
I1 : E’ → E . I5 : F → id .
E →E.+T
GOTO ( I4 , T) GOTO ( I9 , *)
I2 : E →T . I7 : T → T * . F
T→T.*F F→.(E)
F → . id
GOTO ( I4 , F)
I3 : T → F .
FOLLOW (E) = { $ , ) , +)
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }
ACTION GOTO
id + * ( ) $ E T F
I0 s5 s4 1 2 3
I1 s6 ACC
I2 r2 s7 r2 r2
I3 r4 r4 r4 r4
I4 s5 s4 8 2 3
I5 r6 r6 r6 r6
I6 s5 s4 9 3
I7 s5 s4 10
I8 s6 s11
I9 r1 s7 r1 r1
I10 r3 r3 r3 r3
I11 r5 r5 r5 r5
Stack implementation:
0 id + id * id $ GOTO ( I0 , id ) = s5 ; shift
0F3 + id * id $ GOTO ( I0 , F ) = 3
GOTO ( I3 , + ) = r4 ; reduce by T → F
0T2 + id * id $ GOTO ( I0 , T ) = 2
GOTO ( I2 , + ) = r2 ; reduce by E → T
0E1 + id * id $ GOTO ( I0 , E ) = 1
GOTO ( I1 , + ) = s6 ; shift
0 E 1 + 6 id 5 * id $ GOTO ( I5 , * ) = r6 ; reduce by F → id
0E1+6F3 * id $ GOTO ( I6 , F ) = 3
GOTO ( I3 , * ) = r4 ; reduce by T → F
0E1+6T9 * id $ GOTO ( I6 , T ) = 9
GOTO ( I9 , * ) = s7 ; shift
0 E 1 + 6 T 9 * 7 id 5 $ GOTO ( I5 , $ ) = r6 ; reduce by F → id
0 E 1 + 6 T 9 * 7 F 10 $ GOTO ( I7 , F ) = 10
GOTO ( I10 , $ ) = r3 ; reduce by T → T * F
0E1+6T9 $ GOTO ( I6 , T ) = 9
GOTO ( I9 , $ ) = r1 ; reduce by E → E + T
0E1 $ GOTO ( I0 , E ) = 1
GOTO ( I1 , $ ) = accept
A, S, X: non-terminals
x,y, a, ß: string of terminals and/or non-terminals
C: one terminal or one non-terminal
Start: [S --> . w , $] is the item associated with the start state.
Read: Starting a new state (reading on one terminal or non-terminal, C) comes from
[A --> x.Cy , w] then new state includes [A --> xC.y , w] .
Complete: if [A --> x . X a , u] is an item, then completing on X gives the item(s) [X --
> .ß , z] where z є FIRST(au)
Consider the augmented grammar G’:
0. S’ --> S$
1. S --> CC
2. C --> eC
3. C --> d
The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction
On discovering an error, the parser discards input symbols one at a time until a
synchronizing token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop. When
multiple errors in the same statement are rare, this method is quite useful.
On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens is as
small as possible. However, these methods are in general too costly in terms of time and space.
It reads a specification file that codifies the grammar of a language and generates a parsing
routine.
Elements of a CFG:
Example:
A -> Bc is written in yacc as a: b 'c';
%%
%%
C programs
n n
Eg:Yacc program to recognize L = {a b | n >=0}.
%{
#include<stdio.h>
int valid=1;
%}
%token A B
%%
str:S'\n' {return 0;}
S:A S B
|
;
%%
main()
{
printf("Enter the string:\n");
yyparse();
if(valid==1)
printf("\nvalid string");
}
SEMANTIC ANALYSIS
➢ Semantic Analysis computes additional information related to the meaning of the
program once the syntactic structure is known.
➢ In typed languages as C, semantic analysis involves adding information to the symbol
table and performing type checking.
➢ The information to be computed is beyond the capabilities of standard parsing
techniques, therefore it is not regarded as syntax.
➢ As for Lexical and Syntax analysis, also for Semantic Analysis we need both a
Representation Formalism and an Implementation Mechanism.
➢ As representation formalism this lecture illustrates what are called Syntax Directed
Translations.
SYNTAX DIRECTED TRANSLATION
➢ The Principle of Syntax Directed Translation states that the meaning of an input
sentence is related to its syntactic structure, i.e., to its Parse-Tree.
➢ By Syntax Directed Translations we indicate those formalisms for specifying
translations for programming language constructs guided by context-free grammars.
o We associate Attributes to the grammar symbols representing the language
constructs.
o Values for attributes are computed by Semantic Rules associated with
grammar productions.
➢ Evaluation of Semantic Rules may:
o Generate Code;
o Insert information into the Symbol Table;
o Perform Semantic Check;
o Issue error messages;
o etc.
If the rules are evaluated during a post order traversal of the parse tree, or with reductions during
a bottom-up parse, then the sequence of steps shown below ends with p5 pointing to the root of
the constructed syntax tree.
Syntax tree for a-4+c using the above SDD is shown below.
As depicted above, attributes in S-attributed SDTs are evaluated in bottom-up parsing, as
the values of the parent nodes depend upon the values of the child nodes.
3*5+4$ - -
*5+4$ 3 3
*5+4$ T 3 T ---> F
5+4$ T* 3
+4$ T*5 3 * 5
+4$ T 15 T ---> T * F
+4$ E 15 E ---> T
4$ E+ 15
$ E+4 15 + 4
$ E+T 15 4 T ---> F
$ E 19 E ---> E + T
E 19
L 19 L ---> E $
TYPE CHECKING
A compiler must check that the source program follows both syntactic and semantic
conventions of the source language.
This checking, called static checking, detects and reports
A type checker verifies that the type of a construct matches that expected by its
context. For example : arithmetic operator mod in Pascal requires integer operands, so
a type checker verifies that the operands of mod have type integer.
Type information gathered by a type checker may be needed when code is generated.
TYPE SYSTEMS
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to language
constructs.
For example : “ if both operands of the arithmetic operators of +,- and * are of type integer, then
the result is of type integer ”
Type Expressions
checked.
1. Basic types such as boolean, char, integer, real are type expressions.
A special basic type, type_error , will signal an error during type checking; void denoting
“the absence of a value” allows statements to be checked.
Pointers : If T is a type expression, then pointer(T) is a type expression denoting the type
“pointer to an object of type T”.
For example, var p: ↑ row declares variable p to have type pointer(row).
4. Type expressions may contain variables whose values are type expressions.
x pointer
Type systems
A type system is a collection of rules for assigning type expressions to the various parts
of a program.
Checking done by a compiler is said to be static, while checking done when the
target program runs is termed dynamic.
Any check can be done dynamically, if the target code carries the type of an element
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable
for type checker to recover from errors, so it can check the rest of the input.
Error handling has to be designed into the type system right from the start; the
type checking rules must be prepared to cope with errors.
Here, we specify a type checker for a simple language in which the type of each
identifier must be declared before the identifier is used. The type checker is a translation scheme
that synthesizes the type of each expression from the types of its subexpressions. The type
checker can handle arrays, pointers, statements and functions.
A Simple Language
P →D;E
D → D ; D | id : T
T → char | integer | array [ num ] of T | ↑ T
E → literal | num | id | E mod E | E [ E ] | E
↑
Translation scheme:
P→D;E
D→D;D
D → id : T { addtype (id.entry , T.type) }
T → char { T.type : = char }
T → integer { T.type : = integer }
T → ↑ T1 { T.type : = pointer(T1.type)
}
T → array [ num ] of T1 { T.type : = array ( 1… num.val , T1.type) }
In the following rules, the attribute type for E gives the type expression assigned to the
expression generated by E.
The postfix operator ↑ yields the object pointed to by its operand. The type of E ↑ is the type t
of the object pointed to by the pointer E.
Statements do not have values; hence the basic type void can be assigned to them. If an error is
detected within a statement, then type_error is assigned.
1. Assignment statement:
S → id : = E { S.type : = if id.type = E.type then void
else type_error }
2. Conditional statement:
S → if E then S1 { S.type : = if E.type = boolean then S1.type
else type_error }
Procedures:
A procedure definition is a declaration that associates an identifier with a statement. The
identifier is the procedure name, and the statement is the procedure body.
procedure readarray;
var i : integer;
begin
for i : = 1 to 9 do read(a[i])
end;
When a procedure name appears within an executable statement, the procedure is said to be
called at that point.
Activation trees:
An activation tree is used to depict the way control enters and leaves activations. In an
activation tree,
1. Each node represents an activation of a procedure.
2. The root represents the activation of the main program.
3. The node for a is the parent of the node for b if and only if control flows from activation a to
b.
4. The node for a is to the left of the node for b if and only if the lifetime of a occurs before the
lifetime of b.
Control stack:
A control stack is used to keep track of live procedure activations. The idea is to push the
node for an activation onto the control stack as the activation begins and to pop the node
when the activation ends.
The contents of the control stack are related to paths to the root of the activation tree.
When node n is at the top of control stack, the stack contains the nodes along the path
from n to the root.
The portion of the program to which a declaration applies is called the scope of that declaration.
Binding of names:
Even if each name is declared once in a program, the same name may denote different
data objects at run time. “Data object” corresponds to a storage location that holds values.
The term environment refers to a function that maps a name to a storage location.
The term state refers to a function that maps a storage location to the value held there.
environment state
When an environment associates storage location s with a name x, we say that x is bound
to s. This association is referred to as a binding of x.
STORAGE ORGANISATION
The executing target program runs in its own logical address space in which each
program value has a location.
The management and organization of this logical address space is shared between the
complier, operating system and target machine. The operating system maps the logical
address into physical addresses, which are usually spread throughout memory.
Code
Static Data
Stac
k
free memory
Heap
Activation records:
Procedure calls and returns are usually managed by a run time stack called the control
stack.
Each live activation has an activation record on the control stack, with the root of the
activation tree at the bottom, the latter activation has its record at the top of the stack.
The contents of the activation record vary with the language being implemented. The
diagram below shows the contents of activation record.
STATIC ALLOCATION
In static allocation, names are bound to storage as the program is compiled, so there is no
need for a run-time support package.
Since the bindings do not change at run-time, everytime a procedure is activated, its
names are bound to the same storage locations.
Therefore values of local names are retained across activations of a procedure. That is,
when control returns to a procedure the values of the locals are the same as they were
when control left the last time.
From the type of a name, the compiler decides the amount of storage for the name and
decides where the activation records go. At compile time, we can fill in the addresses at
which the target code can find the data it operates on.
All compilers for languages that use procedures, functions or methods as units of user-
defined actions manage at least part of their run-time memory as a stack.
Each time a procedure is called , space for its local variables is pushed onto a stack, and
when the procedure terminates, that space is popped off the stack.
Calling sequences:
Procedures called are implemented in what is called as calling sequence, which consists
of code that allocates an activation record on the stack and enters information into its
fields.
A return sequence is similar to code to restore the state of machine so the calling
procedure can continue its execution after the call.
The code in calling sequence is often divided between the calling procedure (caller) and
the procedure it calls (callee).
When designing calling sequences and the layout of activation records, the following
principles are helpful:
Values communicated between caller and callee are generally placed at the
beginning of the callee’s activation record, so they are as close as possible to the
caller’s activation record.
Downloaded from: annauniversityedu.blogspot.com
Fixed length items are generally placed in the middle. Such items typically include
the control link, the access link, and the machine status fields.
Items whose size may not be known early enough are placed at the end of the
activation record. The most common example is dynamically sized array, where the
value of one of the callee’s parameters determines the length of the array.
We must locate the top-of-stack pointer judiciously. A common approach is to have
it point to the end of fixed-length fields in the activation record. Fixed-length data
can then be accessed by fixed offsets, known to the intermediate-code generator,
relative to the top-of-stack pointer.
...
Parameters and returned values
caller’s
control link
activation
links and saved status
record
The calling sequence and its division between caller and callee are as follows.
activation
control link
record for p
pointer to A
pointer to B
pointer to C
array A
arrays of p
array B
array C
arrays of q top
Procedure p has three local arrays, whose sizes cannot be determined at compile time.
The storage for these arrays is not part of the activation record for p.
Access to the data is through two pointers, top and top-sp. Here the top marks the actual
top of stack; it points the position at which the next activation record will begin.
The second top-sp is used to find local, fixed-length fields of the top activation record.
The code to reposition top and top-sp can be generated at compile time, in terms of sizes
that will become known at run time.
Heap allocation parcels out pieces of contiguous storage, as needed for activation
records or other objects.
Pieces may be deallocated in any order, so over the time the heap will consist of
alternate areas that are free and in use.
r q ( 1 , 9) control link
control link
q(1,9)
control link
The record for an activation of procedure r is retained when the activation ends.
Therefore, the record for the new activation q(1 , 9) cannot follow that for s physically.
If the retained activation record for r is deallocated, there will be free space in the
heap between the activation records for s and q.
PARAMETERS PASSING
A language has first-class functionsif functions can bedeclared within any
scope passed as arguments to other functions returned as results of functions.In a
language with first-class functions and static scope, a function value is generally
represented by a closure. a pair consisting of a pointer to function code a pointer
to an activation record.Passing functions as arguments is very useful in structuring
of systems using upcalls
The actual parameters are evaluated and their r-values are passed to the
called procedure
A procedure called by value can affect its caller either through nonlocal names or
through pointers.
Parameters in C are always passed by value. Array is unusual, what is passed by
value is a pointer.
Pascal uses pass by value by default, but var parameters are passed by reference.
Call-by-Reference
func(a,b) { a = b};
call func(3,4); print(3);
Copy-Restore
A hybrid between call-by-value and call-by reference.
The actual parameters are evaluated and their r-values are passed as in call- by-value.
In addition, l values are determined before the call.
When control returns, the current r-values of the formal parameters are copied back
into the l-values of the actual parameters.
Call-by-Name
The actual parameters literally substituted for the formals. This is like a macro-
expansion or in-line expansion Call-by-name is not used in practice. However, the
conceptually related technique of in-line expansion is commonly used. In-lining may
be one of the most effective optimization transformations if they are guided by
execution profiles.
A symbol table may serve the following purposes depending upon the
language in hand:
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be
implemented as an unordered list, which is easy to code, but it is only suitable for small
tables only. A symbol table can be implemented in one of the following ways: \
Hash table
Among all, symbol tables are mostly implemented as hash tables, where the
source code symbol itself is treated as a key for the hash function and the return value is
the information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of
the compiler where tokens are identified and names are stored in the table. This
operation is used to add information in the symbol table about unique names occurring
in the source code. The format or structure in which the names are stored depends upon
the compiler in hand.
insert(a, int);
Lookup()
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If
the symbol exists in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which
can be accessed by all the procedures and scope symbol tables that are created for each
scope in the program.
. . . int value=10;
void pro_one()
int one_1;
} / int one_5;
}/}
void pro_two()
/ int two_5; } . . .
The global symbol table contains names for one global variable (int value) and
two procedure names, which should be available to all the child nodes shown above.
The names mentioned in the pro_one symbol table (and all its child tables) are not
available for pro_two symbols and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyzer and
whenever a name needs to be searched in a symbol table, it is searched using the
following algorithm:
first a symbol will be searched in the current scope, i.e., current symbol
table,
either the name is found or the global symbol table has been searched for
the name.
Topics to be Covered
Principal Sources of Optimization-DAG- Optimization of Basic Blocks-Global Data Flow Analysis-
Efficient Data Flow Algorithms-Issues in Design of a Code Generator - A Simple Code Generator
Algorithm.
INTRODUCTION
➢ The code produced by the straight forward compiling algorithms can often be made to run
faster or take less space, or both. This improvement is achieved by program transformations
that are traditionally called optimizations. Compilers that apply code-improving
transformations are called optimizing compilers.
Machine independent optimizations are program transformations that improve the target code
without taking into consideration any properties of the target machine.
Machine dependant optimizations are based on register allocation and utilization of special
machine-instruction sequences.
✓ Simply stated, the best program transformations are those that yield the most benefit for the
least effort.
✓ The transformation must preserve the meaning of programs. That is, the optimization must
not change the output produced by a program for a given input, or cause an error such as
division by zero, that was not present in the original source program. At all times we take the
“safe” approach of missing an opportunity to apply a transformation rather than risk
changing what the program does.
✓ The transformation must be worth the effort. It does not make sense for a compiler writer to
expend the intellectual effort to implement a code improving transformation and to have the
compiler expend the additional time compiling source programs if this effort is not repaid
when the target programs are executed. “Peephole” transformations of this kind are simple
enough and beneficial enough to be included in any compiler.
There are a number of ways in which a compiler can improve a program without
changing the function it computes.
The transformations
The above code can be optimized using the common sub-expression elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5 : = n
t6: = b [t1] +t5
The common sub expression t4: =4*i is eliminated as its computation is already in t1. And
value of i is not been changed from definition to use.
➢ Copy Propagation:
Assignments of the form f : = g called copy statements, or copies for short. The idea
behind the copy-propagation transformation is to use g for f, whenever possible after the
copy statement f: = g. Copy propagation means use of one variable instead of another.
This may not appear to be an improvement, but as we shall see it gives us an opportunity
to eliminate x.
For example:
x=Pi;
……
A=x*r*r;
The optimization using copy propagation can be done as follows:
A=Pi*r*r;
➢ Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently; otherwise,
it is dead at that point. A related idea is dead or useless code, statements that compute
i=0;
if(i=1)
{
a=b+5;
}
Here, „if‟ statement is dead code because this condition will never get satisfied.
➢ Constant folding:
We can eliminate both the test and printing from the object code. More generally,
deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
One advantage of copy propagation is that it often turns the copy statement into dead
code.
✓ For example,
a=3.14157/2 can be replaced by
a=1.570 there by eliminating a division operation.
➢ Loop Optimizations:
We now give a brief introduction to a very important place for optimizations, namely
loops, especially the inner loops where programs tend to spend the bulk of their time. The
running time of a program may be improved if we decrease the number of instructions in
an inner loop, even if we increase the amount of code outside that loop.
Three techniques are important for loop optimization:
➢ Induction Variables :
Loops are usually processed inside out. For example consider the loop around B3.
Note that the values of j and t4 remain in lock-step; every time the value of j decreases by
1, that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers are called
induction variables.
When there are two or more induction variables in a loop, it may be possible to get rid of
all but one, by the process of induction-variable elimination. For the inner loop around
B3 in Fig. we cannot get rid of either j or t4 completely; t4 is used in B3 and j in B4.
However, we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of B2
- B5 is considered.
Example:
As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig. and t4 is not
changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1 the relationship t4:= 4*j-4 must hold. We may therefore replace the assignment t4:=
4*j by t4:= t4-4. The only problem is that t4 does not have a value when we enter block B3
for the first time. Since we must maintain the relationship t4=4*j on entry to the block B3,
we place an initializations of t4 at the end of the block where j itself is
before after
Output: A DAG for the basic block containing the following information:
1. A label for each node. For leaves, the label is an identifier. For interior nodes, an
operator symbol.
2. For each node a list of attached identifiers to hold the computed values.
Case (i) x : = y OP z
Case (ii) x : = OP y
Case (iii) x : = y
Method:
Step 2: For the case(i), create a node(OP) whose left child is node(y) and right child is
For case(ii), determine whether there is node(OP) with one child node(y). If not create such a
node.
Step 3: Delete x from the list of identifiers for node(x). Append x to the list of attached
1. t1 := 4* i
2. t2 := a[t1]
3. t3 := 4* i
4. t4 := b[t3]
5. t5 := t2*t4
6. t6 := prod+t5
7. prod := t6
8. t7 := i+1
9. i := t7
10. if i<=20 goto (1)
The advantage of generating code for a basic block from its dag representation is that,
from a dag we can easily see how to rearrange the order of the final computation sequence than
we can starting from a linear sequence of three-address statements or quadruples.
MOV a , R0
ADD b , R0
MOV c , R1
ADD d , R1
MOV R0 , t1
MOV e , R0
SUB R1 , R0
MOV t1 , R1
SUB R0 , R1
MOV R1 , t4
t2 : = c + d
t3 : = e – t2
t1 : = a + b
t4 : = t1 – t3
MOV c , R0
ADD d , R0
MOV a , R0
SUB R0 , R1
MOV a , R0
ADD b , R0
SUB R1 , R0
MOV R0 , t4
In this order, two instructions MOV R0 , t1 and MOV t1 , R1 have been saved.
The heuristic ordering algorithm attempts to make the evaluation of a node immediately follow
the evaluation of its leftmost argument.
Algorithm:
1
*
2 + - 3
4
*
5 - + 8
6 + 7 c d 11 e 12
a b
9 10
Initially, the only node with no unlisted parents is 1 so set n=1 at line (2) and list 1 at line (3).
Now, the left argument of 1, which is 2, has its parents listed, so we list 2 and set n=2 at line (6).
Now, at line (4) we find the leftmost child of 2, which is 6, has an unlisted parent 5. Thus we
select a new n at line (2), and node 3 is the only candidate. We list 3 and proceed down its left
chain, listing 4, 5 and 6. This leaves only 8 among the interior nodes so we list that.
t8 : = d + e t6 : = a
+ b t 5 : = t6 – c t4
: = t5 * t8 t3 : = t4
– e t2 : = t6 + t4 t1
: = t2 * t3
This will yield an optimal code for the DAG on machine whatever be the number of registers.
✓ Structure-Preserving Transformations
✓ Algebraic Transformations
Structure-Preserving Transformations:
The primary Structure-Preserving Transformation on basic blocks are:
Example:
a: =b+c
b: =a-d
c: =b+c
d: =a-d
The 2nd and 4th statements compute the same expression: b+c and a-d
Basic block can be transformed to
a: = b+c
b: = a-d
c: = a
d: = b
It‟s possible that a large amount of dead (useless) code may exist in the program. This
might be especially caused when introducing variables and procedures as part of constructio n or
error-correction of a program – once declared and defined, one forgets to remove them in case
they serve no purpose. Eliminating these will definitely optimize the code.
can be interchanged or reordered in its computation in the basic block when value of t1
does not affect the value of t2.
Algebraic Transformations:
Algebraic identities represent another important class of optimizations on basic blocks.
This includes simplifying expressions or replacing expensive operation by cheaper ones
i.e. reduction in strength.
Another class of related optimizations is constant folding. Here we evaluate constant
expressions at compile time and replace the constant expressions by their values. Thus
the expression 2*3.14 would be replaced by 6.28.
The relational operators <=, >=, <, >, + and = sometimes generate unexpected common
sub expressions.
Associative laws may also be applied to expose common sub expressions. For example, if
the source code has the assignments
a :=b+c
e :=c+d+b
a :=b+c
t :=c+d
e :=t+b
Example:
Dominators:
In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining nodes in the flow graph and the entry of a loop dominates all nodes in the loop.
Similarly every node dominates itself.
Example:
D(1)={1} D(2)={1,2}
D(3)={1,3}
D(4)={1,3,4}
D(5)={1,3,4,5}
D(6)={1,3,4,6}
D(7)={1,3,4,7}
D(8)={1,3,4,7,8}
D(9)={1,3,4,7,8,9}
D(10)={1,3,4,7,8,10}
One application of dominator information is in determining the loops of a flow graph suitable
for improvement.
✓ A loop must have a single entry point, called the header. This entry point-dominates all
nodes in the loop, or it would not be the sole entry to the loop.
✓ There must be at least one way to iterate the loop(i.e.)at least one path back to the header.
One way to find all the loops in a flow graph is to search for edges in the flow graph whose
heads dominate their tails. If a→b is an edge, b is the head and a is the tail. These types of
edges are called as back edges.
✓ Example:
7→4 4 DOM 7
10 →7 7 DOM 10
4→3
8→3
9 →1
Output: The set loop consisting of all nodes in the natural loop n→d.
Method: Beginning with node n, we consider each node m*d that we know is in loop, to make
sure that m‟s predecessors are also placed in loop. Each node in loop, except for d, is placed once
on stack, so its predecessors will be examined. Note that because d is put in the loop initially, we
never examine its predecessors, and thus find only those nodes that reach n without going
through d.
Procedure insert(m);
if m is not in loop then begin
loop := loop U {m};
push m onto stack
end;
stack : = empty;
Inner loop:
If we use the natural loops as “the loops”, then we have the useful property that unless
two loops have the same header, they are either disjointed or one is entirely contained in
the other. Thus, neglecting loops with the same header for the moment, we have a natural
notion of inner loop: one that contains no other loop.
When two natural loops have the same header, but neither is nested within the other, they
are combined and treated as a single loop.
Pre-Headers:
Several transformations require us to move statements “before the header”. Therefore
begin treatment of a loop L by creating a new block, called the preheater.
The pre-header has only the header as successor, and all edges which formerly entered
the header of L from outside L instead enter the pre-header.
Initially the pre-header is empty, but transformations on L may place statements in it.
header pre-header
loop L
header
loop L
Reducible flow graphs are special flow graphs, for which several code optimization
transformations are especially easy to perform, loops are unambiguously defined,
dominators can be easily calculated, data flow analysis problems can also be solved
efficiently.
Definition:
A flow graph G is reducible if and only if we can partition the edges into two disjoint
groups, forward edges and back edges, with the following properties.
✓ The forward edges from an acyclic graph in which every node can be reached from initial
node of G.
✓ The back edges consist only of edges where heads dominate theirs tails.
If we know the relation DOM for a flow graph, we can find and remove all the back
edges.
If the forward edges form an acyclic graph, then we can say the flow graph reducible.
In the above example remove the five back edges 4→3, 7→4, 8→3, 9→1 and 10→7
whose heads dominate their tails, the remaining graph is acyclic.
The key property of reducible flow graphs for loop analysis is that in such flow graphs
every set of nodes that we would informally regard as a loop must contain a back edge.
PEEPHOLE OPTIMIZATION
✓ Redundant-instructions elimination
✓ Flow-of-control optimizations
✓ Algebraic simplifications
✓ Use of machine idioms
✓ Unreachable Code
we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the value of a is
already in register R0.If (2) had a label we could not be sure that (1) was always executed immediately
before (2) and so we could not remove (2).
Unreachable Code:
Another opportunity for peephole optimizations is the removal of unreachable instructions. An
unlabeled instruction immediately following an unconditional jump may be removed. This operation
can be repeated to eliminate a sequence of instructions. For example, for debugging purposes, a large
program may have within it certain segments that are executed only if a variable debug is 1. In C, the
source code might look like:
#define debug 0
….
If ( debug ) {
Print debugging information
}
If debug =1 goto L2
goto L2
L1: print debugging information
L2: …………………………(a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what the value
of debug; (a) can be replaced by:
If debug ≠1 goto L2
Print debugging information
L2: ……………………………(b)
As the argument of the statement of (b) evaluates to a constant true it can be replaced by
If debug ≠0 goto L2
Print debugging information
L2: ……………………………(c)
As the argument of the first statement of (c) evaluates to a constant true, it can be replaced by goto L2.
Then all the statement that print debugging aids are manifestly unreachable and can be eliminated
one at a time.
Flows-Of-Control Optimizations:
goto L1
….
L1: gotoL2 by the sequence
goto L2
….
L1: goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto L2
provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
L1: goto L2 can be replaced by
If a < b goto L2
….
L1: goto L2
Finally, suppose there is only one jump to L1 and L1 is preceded by an unconditional goto.
Then the sequence
goto L1
……..
L1: if a < b goto L2
L3: …………………………………..(1)
May be replaced by
If a < b goto L2 goto L3
…….
L3: ………………………………….(2)
While the number of instructions in (1) and (2) is the same, we sometimes skip the unconditional jump
in (2), but never in (1).Thus (2) is superior to (1) in execution time
Algebraic Simplification:
There is no end to the amount of algebraic simplification that can be attempted through peephole
optimization. Only a few algebraic identities occur frequently enough that it is worth considering
implementing them .For example, statements such as
x := x+0 Or
x := x * 1
Are often produced by straightforward intermediate code-generation algorithms, and they can be
eliminated easily through peephole optimization.
Reduction in Strength:
Reduction in strength replaces expensive operations by equivalent cheaper ones on the target
machine. Certain machine instructions are considerably cheaper than others and can often be used
as special cases of more expensive operators.
For example, x is invariably cheaper to implement as x*x than as a call to an exponentiation routine.
Downloaded from: annauniversityedu.blogspot.com
Fixed-point multiplication or division by a power of two is cheaper to implement as a shift. Floating-
point division by a constant can be implemented as multiplication by a constant, which may be
cheaper.
X2 → X*X
A compiler could take advantage of “reaching definitions” , such as knowing where a variable like
debug was last defined before reaching a given block, in order to perform transformations are just a
few examples of data-flow information that an optimizing compiler collects by a process known as
data-flow analysis.
Data-flow information can be collected by setting up and solving systems of equations of the form :
This equation can be read as “ the information at the end of a statement is either generated within
the statement , or enters at the beginning and is not killed as control flows through the statement.”
The details of how data-flow equations are set and solved depend on three factors.
✓ The notions of generating and killing depend on the desired information, i.e., on the data flow
analysis problem to be solved. Moreover, for some problems, instead of proceeding along with
flow of control and defining out[s] in terms of in[s], we need to proceed backwards and define
in[s] in terms of out[s].
✓ Since data flows along control paths, data-flow analysis is affected by the constructs in a program.
In fact, when we write out[s] we implicitly assume that there is unique end point where control
leaves the statement; in general, equations are set up at the level of basic blocks rather than
statements, because blocks do have unique end points.
✓ There are subtleties that go along with such statements as procedure calls, assignments through
pointer variables, and even assignments to array variables.
Within a basic block, we talk of the point between two adjacent statements, as well as the point
before the first statement and after the last. Thus, block B1 has four points: one before any of the
Downloaded from: annauniversityedu.blogspot.com
assignments and one after each of the three assignments.
B1
d1 : i :=m-1
d2: j :=n
d3: a := u1
B2
d4 : I := i+1
B3
d5: j := j-1
B4
B5 B6
d6 :a :=u2
Now let us take a global view and consider all the points in all the blocks. A path from p 1
to pn is a sequence of points p1, p2,….,pn such that for each i between 1 and n-1, either
✓ Pi is the point immediately preceding a statement and pi+1 is the point immediately following
that statement in the same block, or
✓ Pi is the end of some block and pi+1 is the beginning of a successor block.
Reaching definitions:
A definition of variable x is a statement that assigns, or may assign, a value to x. The most
common forms of definition are assignments to x and statements that read a value from an i/o
device and store it in x.
These statements certainly define a value for x, and they are referred to as unambiguous
definitions of x. There are certain kinds of statements that may define a value for x; they are called
ambiguous definitions. The most usual forms of ambiguous definitions of x are:
✓ A call of a procedure with x as a parameter or a procedure that can access x because x is in the
scope of the procedure.
✓ An assignment through a pointer that could refer to x. For example, the assignment *q: = y is a
definition of x if it is possible that q points to x. we must assume that an assignment through a
pointer is a definition of every variable.
We say a definition d reaches a point p if there is a path from the point immediately following d
to p, such that d is not “killed” along that path. Thus a point can be reached
by an unambiguous definition and an ambiguous definition of the same variable
appearing later along one path.
Flow graphs for control flow constructs such as do-while statements have a useful
property: there is a single beginning point at which control enters and a single end point
that control leaves from when execution of the statement is over. We exploit this property
when we talk of the definitions reaching the beginning and the end of statements with the
following syntax.
E id + id| id
Expressions in this language are similar to those in the intermediate code, but the flow
graphs for statements have restricted forms.
S1
S1
If E goto s1
S2
S1 S2 If E goto s1
S1 ; S2
We define a portion of a flow graph called a region to be a set of nodes N that includes a
header, which dominates all other nodes in the region. All edges between nodes in N are
in the region, except for some that enter the header.
The portion of flow graph corresponding to a statement S is a region that obeys the
further restriction that control can flow to just one outside block when it leaves the
region.
i)
S d: a:=b+c
gen [S] = { d }
kill [S] = Da – { d }
out [S] = gen [S] U ( in[S] – kill[S] )
Observe the rules for a single assignment of variable a. Surely that assignment is a
definition of a, say d. Thus
Gen[S]={d}
On the other hand, d “kills” all other definitions of a, so we write
Kill[S] = Da – {d}
Where, Da is the set of all definitions in the program for variable a. ii )
S S1
S2
gen[S]=gen[S2] U (gen[S1]-kill[S2])
Kill[S] = kill[S2] U (kill[S1] – gen[S2])
in [S1] = in [S]
in [S2] = out [S1]
out [S] = out [S2]
There is a subtle miscalculation in the rules for gen and kill. We have made the
assumption that the conditional expression E in the if and do statements are
“uninterpreted”; that is, there exists inputs to the program that make their branches go
either way.
We assume that any graph-theoretic path in the flow graph is also an execution path, i.e.,
a path that is executed when the program is run with least one possible input.
When we compare the computed gen with the “true” gen we discover that the true gen is
always a subset of the computed gen. on the other hand, the true kill is always a superset
of the computed kill.
These containments hold even after we consider the other rules. It is natural to wonder
whether these differences between the true and computed gen and kill sets present a
serious obstacle to data-flow analysis. The answer lies in the use intended for these data.
Overestimating the set of definitions reaching a point does not seem serious; it merely
stops us from doing an optimization that we could legitimately do. On the other hand,
underestimating the set of definitions is a fatal error; it could lead us into making a
change in the program that changes what the program computes. For the case of reaching
definitions, then, we call a set of definitions safe or conservative if the estimate is a
superset of the true set of reaching definitions. We call the estimate unsafe, if it is not
necessarily a superset of the truth.
Returning now to the implications of safety on the estimation of gen and kill for reaching
definitions, note that our discrepancies, supersets for gen and subsets for kill are both in
the safe direction. Intuitively, increasing gen adds to the set of definitions that can reach a
point, and cannot prevent a definition from reaching a place that it truly reached.
Decreasing kill can only increase the set of definitions reaching any given point.
However, there are other kinds of data-flow information, such as the reaching-definitions
problem. It turns out that in is an inherited attribute, and out is a synthesized attribute
depending on in.
The set out[S] is defined similarly for the end of s. it is important to note the distinction
between out[S] and gen[S]. The latter is the set of definitions that reach the end of S
without following paths outside S.
Considering cascade of two statements S1; S2, as in the second case. We start by
observing in[S1]=in[S]. Then, we recursively compute out[S1], which gives us in[S2],
since a definition reaches the beginning of S2 if and only if it reaches the end of S1. Now
we can compute out[S2], and this set is equal to out[S].
Considering if-statement we have conservatively assumed that control can follow either
branch, a definition reaches the beginning of S1 or S2 exactly when it reaches the
beginning of S.
If a definition reaches the end of S if and only if it reaches the end of one or both sub
statements; i.e,
Out[S]=out[S1] U out[S2]
Representation of sets:
Sets of definitions, such as gen[S] and kill[S], can be represented compactly using bit
vectors. We assign a number to each definition of interest in the flow graph. Then bit
vector representing a set of definitions will have 1 in position I if and only if the
definition numbered I is in the set.
The number of definition statement can be taken as the index of statement in an array
holding pointers to statements. However, not all definitions may be of interest during
global data-flow analysis. Therefore the number of definitions of interest will typically be
recorded in a separate table.
A bit vector representation for sets also allows set operations to be implemented
efficiently. The union and intersection of two sets can be implemented by logical or and
logical and, respectively, basic operations in most systems-oriented programming
languages. The difference A-B of sets A and B can be implemented by taking the
complement of B and then using logical and to compute A
Space for data-flow information can be traded for time, by saving information only at
certain points and, as needed, recomputing information at intervening points. Basic
blocks are usually treated as a unit during global flow analysis, with attention restricted to
only those points that are the beginnings of blocks.
Since there are usually many more points than blocks, restricting our effort to blocks is a
significant savings. When needed, the reaching definitions for all points in a block can be
calculated from the reaching definitions for the beginning of a block.
Use-definition chains:
Evaluation order:
The techniques for conserving space during attribute evaluation, also apply to the
computation of data-flow information using specifications. Specifically, the only
constraint on the evaluation order for the gen, kill, in and out sets for statements is that
imposed by dependencies between these sets. Having chosen an evaluation order, we are
free to release the space for a set after all uses of it have occurred.
Earlier circular dependencies between attributes were not allowed, but we have seen that
data-flow equations may have circular dependencies.
When programs can contain goto statements or even the more disciplined break and
continue statements, the approach we have taken must be modified to take the actual
control paths into account.
Several approaches may be taken. The iterative method works arbitrary flow graphs.
Since the flow graphs obtained in the presence of break and continue statements are
reducible, such constraints can be handled systematically using the interval -based
methods
symbol
table
c. Assembly language
- Code generation is made easier.
3. Memory management:
Names in the source program are mapped to addresses of data objects in run-time
memory by the front end and code generator.
It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
Labels in three-address statements have to be converted to addresses of instructions.
For example,
j : goto i generates jump instruction as follows :
➢ if i < j, a backward jump instruction with target address equal to location of
code for quadruple i is generated.
➢ if i > j, the jump is forward. We must store on a list for quadruple i the
location of the first machine instruction generated for quadruple j. When i is
processed, the machine locations for all instructions that forward jumps to i
are filled.
4. Instruction selection:
The instructions of target machine should be complete and uniform.
Instruction speeds and machine idioms are important factors when efficiency of target
program is considered.
The quality of the generated code is determined by its speed and size.
The former statement can be translated into the latter statement as shown below:
5. Register allocation
Instructions involving register operands are shorter and faster than those involving
operands in memory.
The use of registers is subdivided into two subproblems :
➢ Register allocation – the set of variables that will reside in registers at a point in
the program is selected.
6. Evaluation order
The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others.
TARGET MACHINE
Familiarity with the target machine and its instruction set is a prerequisite for designing a
good code generator.
The target computer is a byte-addressable machine with 4 bytes to a word.
It has n general-purpose registers, R0, R1, . . . , Rn-1.
It has two-address instructions of the form:
op source, destination
where, op is an op-code, and source and destination are data fields.
The source and destination of an instruction are specified by combining registers and
memory locations with address modes.
Address modes with their assembly-language forms
absolute M M 1
register R R 0
literal #c c 1
Instruction costs :
Instruction cost = 1+cost for source and destination address modes. This cost corresponds
to the length of the instruction.
Address modes involving registers have cost zero.
Address modes involving memory location or literal have cost one.
Instruction length should be minimized if space is important. Doing so also minimizes the
time taken to fetch and perform the instruction.
For example : MOV R0, R1 copies the contents of register R0 into R1. It has cost one,
since it occupies only one word of memory.
The three-address statement a : = b + c can be implemented by many different instruction
sequences :
i) MOV b, R0
ADD c, R0 cost = 6
MOV R0, a
ii) MOV b, a
ADD c, a cost = 6
In order to generate good code for target machine, we must utilize its addressing
capabilities efficiently.
(or)
(or)
ADD Rj, Ri
A register descriptor is used to keep track of what is currently in each registers. The
register descriptors show that initially all the registers are empty.
An address descriptor stores the location where the current value of the name can be
found at run time.
A code-generation algorithm:
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z, perform the following actions:
2. Invoke a function getreg to determine the location L where the result of the computation y op
z should be stored.
3. Consult the address descriptor for y to determine y‟, the current location of y. Prefer the
register for y‟ if the value of y is currently both in memory and a register. If the value of y is
not already in L, generate the instruction MOV y’ , L to place a copy of y in L.
5. If the current values of y or z have no next uses, are not live on exit from the block, and are in
registers, alter the register descriptor to indicate that, after execution of x : = y op z , those
registers will no longer contain y or z.
The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:
t:=a–b
u:=a–c
v:=t+u
d:=v+u
with d live at the end.
Code sequence for the example is:
The table shows the code sequences generated for the indexed assignment statements
a : = b [ i ] and a [ i ] : = b
The table shows the code sequences generated for the pointer assignments
a : = *p and *p : = a
a : = *p MOV *Rp, a 2
*p : = a MOV a, *Rp 2
Statement Code
x : = y +z MOV y, R0
if x < 0 goto z ADD z, R0
MOV R0,x
CJ< z