ECS-603 Ut 13-14 Sol
ECS-603 Ut 13-14 Sol
Note: All questions are compulsory. All questions carry equal marks.
Q1. Attempt any FOUR parts. (5X4=20)
a. Explain the necessary phases and passes of a compiler design.
Ans. A compiler is a computer program (or set of programs) that transforms source code written in
a programming language (the source language) into another computer language (the target language, often
having a binary form known as object code). The most common reason for wanting to transform source
code is to create an executable program.
Phase is a logical organization of compiler which takes as input one representation of program and
produces as output another representation of program. Passes are the grouping of more than one phase. The
compilation process has two pass. 1. Analysis or Front-end (Lexical analyzer, syntactic analyzer, semantic
analyzer and intermediate code generator) 2. Synthesis or Back-end (code generation) Code optimization is
optional.
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler
requires 1) determining the correctness of the syntax of programs, 2) generating correct and efficient object
code, 3) run-time organization, and 4) formatting output according to assembler and/or linker conventions.
A compiler consists of three main parts: the front-end, the middle-end, and the backend.
The front end checks whether the program is correctly written in terms of the programming language
syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a
useful way. Type checking is also performed by collecting type information. The front-end then generates
an intermediate representation or IR of the source code for processing by the middle-end.
The middle end is where optimization takes place. Typical transformations for optimization are removal of
useless or unreachable code, discovery and propagation of constant values, relocation of computation to a
less frequently executed place (e.g., out of a loop), or specialization of computation based on the context.
The middle-end generates another IR for the following backend. Most optimization efforts are focused on
this part.
The back end is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Variables are also selected for the registers. Backend
utilizes the hardware by figuring out how to keep parallel FUs busy, filling delay slots, and so on.
Although most algorithms for optimization are in NP, heuristic techniques are well-developed
b. What is a cross compiler? How is boot-strapping of a compiler done to a second machine?
Ans. A cross compiler is used to produce a compiler which runs on one machine and produce object code
for another machine. In Bootstrapping a given complier is used to produce different compilers for different
languages.
To boot-strapping to compiler A with a second machine B initially we can easily write a compiler and then
write in the simple language S. The compiler when runs through becomes and shown by
Then we may generate the by using bootstrapping process of the compiler and other compilers, which
produce the object code for B. Then bootstrap with and generate cross compiler which runs on
machine A and produce object code for B
c. Write short note on:
(i) Context free grammar
Ans. A context-free grammar (CFG) is a formal grammar in which every production rule is of the
form
Aα
where A is a non terminal and α ε
(ii) Yacc parser generator
d. Remove left recursion from the grammar
EE (T)|T
T T (F)|F
Fid
Sol. Left recursion:
If any production AAα|β then after removing left recursion it converted into
AβA’
A’ αA’|ϵ
The above grammar has two productions of the above form. So after removing left recursion the
resulting grammar is
ETE’
E’(T) E’|ϵ
TFT’
T’(F) T’|ϵ
Fid
e. What do you mean by ambiguous grammar? Show that the following grammar is
ambiguous.
SaSbS|bSaS|ϵ
Ans. An ambiguous grammar is a context-free grammar for which there exists a string that can
have more than one leftmost derivation. For above grammar if the derived string is abab then
SaSbSabSaSbSabaSbSababSabab
SaSbSabSabaSbSababSabab
the string abab is derived by two different left most derivations so the above grammar is
ambiguous.
f. Define boot-strapping with the help of an example.
Ans. In computer science, bootstrapping is the process of writing a compiler (or assembler) in the
target programming language which it is intended to compile. Applying this technique leads to a self-
hosting compiler. Bootstrapping a compiler has the following advantages:
it is a non-trivial test of the language being compiled;
compiler developers only need to know the language being compiled;
improvements to the compiler's back-end improve not only general purpose programs but also the
compiler itself; and
it is a comprehensive consistency check as it should be able to reproduce its own object code.
g. Explain the term token, lexeme and pattern.
Ans. The words generated by the linear analysis may be of different kinds:
a. identifier,
b. keyword (if, while, ...),
c. punctuation character,
d. multi-character operator (:=, ->, ...).
Such a kind is called a TOKEN and an element of a kind is called a LEXEME.
A word is recognized to be a lexeme for a certain token by PATTERN MATCHING. For
instance letter followed by letters and digits is a pattern that matches a word like x or y with the
token id (= identifier).
The above table has not multiple defined entry so it is LL(1) grammar.
c. Show that the following grammar
S Aa|bAc|Bc|bBa
Ad
Bd
Is LR(1) but not LALR(1).
Sol. Adding S’S make above grammar augmented. The complete resulting grammar is
S’S
S Aa|bAc|Bc|bBa, Ad, Bd
LR(1) Itemset collection:
I0:
S’.S, $
S .Aa, $
S.bAc, $
S.Bc, $
S.bBa, $
A.d, a
B.d, c
I1: GOTO(I0,S)
S’S., $
I2: GOTO(I0, A)
S A.a, $
Q3 Attempt any TWO parts. (10X2=20)
a. Define Syntax Directed Translations. Construct an annotated parse tree for the expression
(4*7+2)*2, using simple desk calculator grammar.
Sol. Grammar symbols are associated with attributes to associate information with the programming
language constructs that they represent.
• Values of these attributes are evaluated by the semantic rules associated with the production rules.
• Evaluation of these semantic rules:
– may generate intermediate codes
– may put information into the symbol table
– may perform type checking
– may issue error messages
– may perform some other activities
– in fact, they may perform almost any activities.
• An attribute may hold almost any thing.
– a string, a number, a memory location, a complex record.
• When we associate semantic rules with productions, we use two notations:
– Syntax-Directed Definitions
– Translation Schemes
• Syntax-Directed Definitions:
– give high-level specifications for translations
– hide many implementation details such as order of evaluation of semantic actions.
– We associate a production rule with a set of semantic actions, and we do not say when they
will be evaluated.
• Translation Schemes:
– indicate the order of evaluation of semantic actions associated with a production rule.
– In other words, translation schemes give a little bit information about implementation
details.
Syntax-Directed Definition – Example
Production Semantic Rules
L→E$ print(E.val)
E → E1 * E 2 E.val = E1.val * E2.val
E → E1 + E 2 E.val = E1.val + E2.val
E→(E) E.val = E.val
EI E.val = I.val
E→ I digit E.val = 10* I.val+digit.lexval
Idigit I.val=lexval
• Symbols E, and I are associated with a synthesized attribute val.
• The token digit has a synthesized attribute lexval (it is assumed that it is evaluated by the lexical
analyzer).
b. What are different ways to write three address code? Write the three address code for the
following code segment:
while A<C and B<D do
if A=1 then C=C+1
else while A<=D do A=A+2
Ans. Three-Address Code: Three-address code (TAC) will be the intermediate representation used in our
Decaf compiler. It is essentially a generic assembly language that falls in the lower-end of the mid-level
IRs. Some variant of 2, 3 or 4 address code is fairly commonly used as an IR, since it maps well to most
assembly languages.
Ex. A TAC is:
x := y op z
where x, y and z are names, constants or compiler-generated temporaries; op is any operator.
The implementations are of three types:
1) Quadruples
2) Triples
3) Indirect triples.
Consider the following expression:
-(a+b) + (c+d) – (a+b+c)
The TAC is:
T1:= a+b
T2:=-T1
T3:=c+d
T4:= T2+T3
T5:=T1+c
T6:=T4-T5
Quadruples: Quadruples contain OP, ARG1,ARG2 and RESULT. The above given TAC is represented
into Quadruples in following table.
OP Arg1 Arg2 Result
(0) + a b T1
(1) Uminus T1 T2
(2) + c d T3
(3) + T2 T3 T4
(4) + T1 c T5
(5) - T4 T5 T6
Triples: It contains OP, ARG1 and ARG2 and represented by following table:
- Search
Search linearly from beginning to end. Stop if found.
- Adding
Search (does it exist?). Add at beginning if not found.
- Effectively
To insert n names and search for m names the cost will be cn (n+m) comparisons. Inefficient.
-Positive
• Easy to implement
• Uses little space
• Easy to represent scoping.
Negative
• Slow for large n and m.
Trees:
You can have the symbol table in the form of trees as:
• Each subprogram has a symbol table associated to its node in the abstract syntax tree.
• The main program has a similar table for globally declared objects.
• Quicker that linear lists.
• Easy to represent scoping.
Hash tables (with chaining)
-Search
Hash the name in a hash function,
h(symbol) ε [0, k-1]
Where k = table size
If the entry is occupied, follow the link field.
-Insertion
Search + simple insertion at the end of the symbol table (use the sympos pointer).
-Efficiency
Search proportional to n/k and the number of comparisons is (m + n) n / k for n insertions and
m searches. k can be chosen arbitrarily large.
-Positive
• Very quick search
-Negative
• Relatively complicated
• Extra space required, k words for the hash table.
• More difficult to introduce scoping.
b. What are lexical phase errors, syntactic phase errors and semantic phase errors? Explain
with suitable example.
Ans. . Lexical phase error: Lexical phase errors: There are not many errors that can be caught at the
lexical level; those you should be looking for are:
Characters that cannot appear in any token in our source language, such as @ or #.
Integer constants out of bounds (range is 0 to 32767).
Identifier names that are too long (maximum length is 32 characters).
Text strings that are two long (maximum length is 256 characters).
Text strings that span more than one line.
Certain other errors, such as malformed identifiers, could be caught here, or by the parser (the
"interpretation" of the error will be affected by the stage at which the error is caught). The only one
of these errors you are responsible for at this stage is the following: Unmatched right comment
delimiters (*/).
Syntax Errors:
• A Syntax Error occurs when stream of tokens is an invalid string.
• In LL(k)or LR(k) parsing tables, blank entries refer to syntax error
2. Report error, recover from error, search for more errors ⇒better
Error Recovery
Error Recovery: process of adjusting input stream so that parsing and Syntax error reported.
• Deletion of token types from input stream
• Insertion of token types
• Substitution of token types
Two classes of recovery:
1. Local Recovery: adjust input at point where error was detected.
2. Global Recovery: adjust input before point where error was detected
These may be applied to both LL and LR parsing techniques.
c. Why run-time storage management is required? How simple stack implementation is
implemented?
Ans. Run-Time Environments:
• How do we allocate the space for the generated target code and the data object of our source
programs?
• The places of the data objects that can be determined at compile time will be allocated statically.
• But the places for the some of data objects will be allocated at run-time.
• The allocation of de-allocation of the data objects is managed by the run-time support package.
– run-time support package is loaded together with the generate target code.
– the structure of the run-time support package depends on the semantics of the programming
language (especially the semantics of procedures in that language).
• Each execution of a procedure is called as activation of that procedure.
Stack Allocation: A function's prolog is responsible for allocating stack space for local variables,
saved registers, stack parameters, and register parameters.
• The parameter area is always at the bottom of the stack (even if alloca is used), so that it
will always be adjacent to the return address during any function call. It contains at least four
entries, but always enough space to hold all the parameters needed by any function that may be
called. Note that space is always allocated for the register parameters, even if the parameters
themselves are never homed to the stack; a callee is guaranteed that space has been allocated for all
its parameters. Home addresses are required for the register arguments so a contiguous area is
available in case the called function needs to take the address of the argument list (va_list) or an
individual argument. This area also provides a convenient place to save register arguments during
thunk execution and as a debugging option (for example, it makes the arguments easy to find during
debugging if they are stored at their home addresses in the prolog code). Even if the called function
has fewer than 4 parameters, these 4 stack locations are effectively owned by the called function,
and may be used by the called function for other purposes besides saving parameter register values.
Thus the caller may not save information in this region of stack across a function call.
• If space is dynamically allocated (alloca) in a function, then a nonvolatile register must be used as a
frame pointer to mark the base of the fixed part of the stack and that register must be saved and
initialized in the prolog. Note that when alloca is used, calls to the same callee from the same caller
may have different home addresses for their register parameters.
• The stack will always be maintained 16-byte aligned, except within the prolog (for example, after
the return address is pushed), and except where indicated in Function Types for a certain class of
frame functions.
• The following is an example of the stack layout where function A calls a non-leaf function B.
Function A's prolog has already allocated space for all the register and stack parameters required by
B at the bottom of the stack. The call pushes the return address and B's prolog allocates space for its
local variables, nonvolatile registers, and the space needed for it to call functions. If B uses alloca,
the space is allocated between the local variable/nonvolatile register save area and the parameter
stack area.
• When the function B calls another function, the return address is pushed just below the home
address for RCX.
•
b. What are different issues in code optimization? Explain it with proper example.
Ans. Code optimization are performed by different techniques
(i) Loop optimizations: These act on the statements which make up a loop, such as a for loop (eg,
loop-invariant code motion). Loop optimizations can have a significant impact because many
programs spend a large percentage of their time inside loops.
Loop optimization
Some optimization techniques primarily designed to operate on loops include:
Induction variable analysis
Roughly, if a variable in a loop is a simple function of the index variable, such as j:= 4*i+1, it can
be updated appropriately each time the loop variable is changed. This is a strength reduction, and
also may allow the index variable's definitions to become dead code. This information is also useful
for bounds-checking elimination and dependence analysis, among other things.
Loop fusion or loop combining
Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate
the same number of times (whether or not that number is known at compile time), their bodies can
be combined as long as they make no reference to each other's data.
Loop unrolling
Unrolling duplicates the body of the loop multiple times, in order to decrease the number of times
the loop condition is tested and the number of jumps, which hurt performance by impairing the
instruction pipeline. A "fewer jumps" optimization. Completely unrolling a loop eliminates all
overhead, but requires that the number of iterations be known at compile time.
(ii)
c. Write short notes (any two):
(i) Global Data Flow analysis
Ans. Data-flow analysis is a technique for gathering information about the possible set of values
calculated at various points in a computer program. A program's control flow graph (CFG) is used to
determine those parts of a program to which a particular value assigned to a variable might propagate. The
information gathered is often used by compilers when optimizing a program. A canonical example of a
data-flow analysis is reaching definitions.
It is the process of collecting information about the way the variables are used, defined in the program.
Data-flow analysis attempts to obtain particular information at each point in a procedure. Usually, it is
enough to obtain this information at the boundaries of basic blocks, since from that it is easy to compute
the information at points in the basic block. In forward flow analysis, the exit state of a block is a function
of the block's entry state. This function is the composition of the effects of the statements in the block. The
entry state of a block is a function of the exit states of its predecessors. This yields a set of data-flow
equations:
For each block b:
In this, is the transfer function of the block . It works on the entry state , yielding the exit
state . The join operation combines the exit states of the predecessors of , yielding
the entry state of .
After solving this set of equations, the entry and/or exit states of the blocks can be used to derive properties
of the program at the block boundaries. The transfer function of each statement separately can be applied to
get information at a point inside a basic block.
Each particular type of data-flow analysis has its own specific transfer function and join operation. Some
data-flow problems require backward flow analysis. This follows the same plan, except that the transfer
function is applied to the exit state yielding the entry state, and the join operation works on the entry states
of the successors to yield the exit state.
The entry point (in forward flow) plays an important role: Since it has no predecessors, its entry state is
well defined at the start of the analysis. For instance, the set of local variables with known values is empty.
If the control flow graph does not contain cycles (there were no explicit or implicit loops in the procedure)
solving the equations is straightforward. The control flow graph can then be topologically sorted; running
in the order of this sort, the entry states can be computed at the start of each block, since all predecessors of
that block have already been processed, so their exit states are available. If the control flow graph does
contain cycles, a more advanced algorithm is required.
(ii) Loop unrolling
Ans. Static/Manual Loop Unrolling
Manual (or static) loop unrolling involves the programmer analyzing the loop and interpreting the iterations
into a sequence of instructions which will reduce the loop overhead. This is in contrast to dynamic
unrolling which is accomplished by the compiler.
A simple manual example in C Language
A procedure in a computer program is to delete 100 items from a collection. This is normally accomplished
by means of a for-loop which calls the function delete(item_number). If this part of the program is to be
optimized, and the overhead of the loop requires significant resources compared to those for delete(x) loop,
unwinding can be used to speed it up.
Normal loop After loop unrolling
int x; int x;
for (x = 0; x < 100; x++) for (x = 0; x < 100; x += 5)
{ {
delete(x); delete(x);
} delete(x+1);
//. delete(x+2);
//. delete(x+3);
//. delete(x+4);
//. }
As a result of this modification, the new program has to make only 20 iterations, instead of 100.
Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many
iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal
benefit, no variables should be specified in the unrolled code that require pointer arithmetic. This usually
requires "base plus offset" addressing, rather than indexed referencing.
(iii) Loop jamming
Ans. Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which
replaces multiple loops with a single one.
Example in C
int i, a[100], b[100];
for (i = 0; i < 100; i++)
a[i] = 1;
for (i = 0; i < 100; i++)
b[i] = 2;
is equivalent to:
int i, a[100], b[100];
for (i = 0; i < 100; i++)
{
a[i] = 1; b[i] = 2; }