Compiler Note Book
Compiler Note Book
TYPE CHECKING
The language design principle ensures that every expression must have a type that is
known (at run time) and a type system has a set of rules for associating a type to an
expression. Type system allows one to determine whether the operators in an
expression are appropriately used or not.
Type Checking
Operations in a language may require operands belonging to a specific data type. Some
examples are- indexes of array references should always be integers, conditional
expressions in control-flow statements should be Boolean, pointer values cannot be real
values, and so on.
Therefore, compilers can perform type checking based on the operations involved in the
source code. Some languages even do not permit mixed mode expressions. In such
cases, expressions with operands of different data types are to be considered as semantic
errors. In some other languages, if there are expressions with operands of different but
compatible data types, one of the operands will be converted to the data type of the other
operand by the compiler. This type of conversion is known as coercion. For example,
in C language, when an addition operation involves integer and double values, the
integer value is converted to double. However, if the values are of incompatible types
such as array and integer data types, then conversion is not possible. Therefore, the
compiler has to identify whether coercion is possible. If not, it results in semantic error.
Type Checking is the process of verifying the type correctness of the input
program by using logical rules to check the behaviour of a program either at
compile time or at run time. It allows the programmers to limit the types that can
be used for semantic aspects of compilation.
Though errors can be checked dynamically (at runtime) if the target program contains
both the type of an element and its value, but a sound type system eliminates the need
for dynamic checking for type errors by ensuring that these errors would not arise when
the target program runs.
if expression “f” has a type s→t and expression “x” have a type s,
then expression f(x) will be of type t
Here, s → t represents a function from s to t. This rule can be applied to all functions
with one or more arguments. This rules consider the expressions E1 + E2 as a function,
add (E1, E2 ) and uses E1 and E2 to build the type of E1 + E2 .
Type Inference:
Type inference is the analysis of a program to determine the types of some or all of the
expressions from the way they are used. For example,
public int add (int E1, int E2)
Return E1 + E2
Here, E1 and E2 are defined as integers. So, by type inference, we just need definition of
E1 and E2. Since the resulting expression E1 + E2 uses “+” operation, which would be
taken as integer because it is performed on two integers E1 and E2. Therefore, the return
type of add must be an integer. A typical rule is used to perform type inference and has
the following form:
if f(x) is an expression,
then for some type variables α and β, f is of type α → β and x is of type α
When two type expressions are equivalent, we need a precise definition of both the
expressions. When names are given to type expressions, and these names are further
used in subsequent type expressions, it may result in potential ambiguities.
There are two schemes to check type equivalence of expressions:
Structural Equivalence:
When type expressions are represented by graphs, two types are structurally equivalent
if and only if one of the following conditions is true:
• They are the same basic type.
• They are formed by applying the same constructor to structurally equivalent
types.
• One is a type name that denotes the other.
Name Equivalence:
Two type expressions are name equivalent if and only if they are identical, that is if they
can be represented by the same syntax tree, with the same labels.
For example, consider the following few types and variable declarations,
Typedef double Value
….
….
Value var1, var2
Sum var3, var4
In these statements, var1 and var2 are name equivalent, so are var3 and var4, because
their type names are same. However, var1 and var3 are not name equivalent, because
their type names are different.
The type conversion can be done implicitly or explicitly. The conversion from one type
to another is called implicit, if it is automatically done by the compiler. Usually, implicit
conversions of constants can be done at compile-time and it results in an improvement
in the execution time of the object program. Implicit type conversion is also known as
coercion.
A conversion is said to be explicit if the programmer must write something to cause the
conversion. Explicit conversion is also known as casts.
Conversion in languages can be considered as widening conversions (which are
intended to preserve information) and narrowing conversions (which can lose
information).
The widening rules are given by the hierarchy in Fig.(a): any type lower in the
hierarchy can be widened to a higher type. Thus, a char can be widened to an int or to
a float, but a char cannot be widened to a short.
The narrowing rules are illustrated by the graph in Fig.(b): a type s can be narrowed
to a type t if there is a path from s to t. Note that char, short, and byte are pairwise
convertible to each other.
Compiled by: Surajit Das Page 6|6
COMPILER DESIGN
INTERMEDIATE CODE GENERATION
C o m p i l e d b y : S u r a j i t D a s P a g e 1 | 15
High-level Low-level
Source .. .. . Target (Object)
Intermediate Intermediate
Program Code
representation representation
Postfix Notation:
Generally, we use infix notation to represent an arithmetic expression such as
multiplication of two operands a and b (e.g. x=a*b). But in postfix notation the operator
is shifted to the right end, as x = ab*.
Process of evaluation of postfix expression:
• If the scan symbol is an operand, then it is pushed onto the stack, and scanning
is continued.
• If the scan symbol is binary operator, then the two topmost operands are
popped from the stack. The operator is applied to these operands, and the result
is pushed back to the stack.
• If the scan symbol is a unary operator, it is applied to the top of the stack and
the result is pushed back onto the stack.
▪ The result of a unary operator can be shown within parenthesis.
Param X1
Param X2
.
.
.
Param Xn
Call P, n
C o m p i l e d b y : S u r a j i t D a s P a g e 3 | 15
Here, the sequence of three-address statements is generated as a part of call of the
procedure P (X1, X2, ……Xn) and n in call P, n is defined as an integer specifying the total
number of actual parameters in the call.
Y = call P, n represents the function call.
Return Y, represents the return statement, where Y is a returned value.
Implementation of Three- Address Statements:
The three-address statement is an abstract form of intermediate code. Hence, the
actual implementation of the three-address statements can be done in the following
ways:
✓ Quadruples
✓ Triples
✓ Indirect triples
Quadruples
Quadruples is defined as a record structure used to represent a three-address
statement. It consists of four fields. The first field contains the operator, the second and
third fields contain the operand 1 and operand 2, respectively, and the last field
contains the result of that three-address statement. For better understanding of
quadruples representation of any statement,
consider a statement, S = - z / a * (x+y)
To represent this statement into quadruples representation, we first construct the
three-address code as follows:
t 1: = x + y
t 2: = a * t 1
t 3: = - z
t4: = t3/t2
S: = t4
The quadruple representation of this three-address code is:
Operator Operand 1 Operand 2 Result
0 + x y t1
1 * a t1 t2
2 - z t3
3 / t3 t2 t4
4 := t4 S
C o m p i l e d b y : S u r a j i t D a s P a g e 4 | 15
Triples
A triple is also defined as a record structure that is used to represent a three-address
statement. In Triples, for representing any three-address statement three fields are
used, namely, operator, operand 1 and operand 2, where operand 1 and operand 2 are
pointers to either symbol table or they are pointers to the records (for temporary
variables) within the triple representation itself. In this representation, the result field
is removed to eliminate the use of temporary names referring to symbol table entries.
Instead, we refer the results by their positions. The pointers to the triple structure are
represented by parenthesized numbers, whereas the symbol-table are represented by
the names themselves.
Indirect Triples
An indirect triple representation consists of an additional array that contains the
pointers to the triples in the desired order. Let us define an array A that contains
pointers to triples in desired order. Indirect triple representation for the statement S
given in the previous example is:
C o m p i l e d b y : S u r a j i t D a s P a g e 5 | 15
A Operator Operand 1 Operand 2
101 (0) 0 + x y
102 (1) 1 * a (0)
2 - z
103 (2)
3 / (2) (1)
104 (3) 4 := S (3)
105 (4)
The main advantage of indirect triple representation is that an optimizing compiler can
move an instruction by simply reordering the array A, without affecting the triples
themselves.
C o m p i l e d b y : S u r a j i t D a s P a g e 6 | 15
Methods of Translating a Boolean Expression into Three-Address Code:
There are two methods available to translate a Boolean expression into three-address
code,
▪ Numerical Representation
▪ Control-Flow Representation
Numerical Representation:
The first method of translating Boolean expression into three-address code comprises
encoding true and false numerically and then evaluating the Boolean expression similar
to an arithmetic expression. True is often denoted by 1 and false by 0. Some other
encodings are also possible where any non-zero or non-negative quantity indicates true
and any negative or zero number indicates false. Expressions will be calculated from
left to right like arithmetic expressions.
Consider a Boolean expression X and Y or Z, the translation of this expression
into three-address code is
t1 = X and Y
t2 = t1 or Z
Now, consider a relational expression if X > Y then 1 else 0, the three-address code
translation for this expression is as follows:
1. If X > Y goto (4)
2. t1: = 0
3. goto (5)
4. t1: = 1
5. Next
Here, t1 is a temporary variable that can have the value 1 or 0 depending on whether
the condition is evaluated to true or false. The label Next represents the statement
immediately following the else part.
Control-Flow Representation:
In the second method, the Boolean expression is translated into three-address code
based on the flow of control. In this method, the value of a Boolean expression is
represented by a position reached in a program. In case of evaluating the Boolean
expressions by their positions in program, we can avoid calculating the entire
expression. This method is useful in implementing the Boolean expressions in control-
flow statements such as if-then-else and while-do statements. For example, we
consider the Boolean expressions in context of conditional statements such as
C o m p i l e d b y : S u r a j i t D a s P a g e 7 | 15
• If X then S1 else S2
• While X do S
In the first statement, if X is true, the control jumps to the first statement of the
code for S1, and if X is false, the control jumps to the first statement of the code for
S2.
False
In case of second statement, when X is false, the control jumps to the statement immediately following
the while statement, and if X is true, the control jumps to the first statement of the code for S.
Example:
I. Generate the three-address code for the following program segment
while (x < z and y > s) do
If x = 1 then
z=z+1
else
while x <= s do
x = x + 10;
Ans: The three-address code for the given program segment is given below:
1. If x < z goto (3)
2. goto (16)
C o m p i l e d b y : S u r a j i t D a s P a g e 8 | 15
3. if y > s goto (5)
4. goto (16)
5. if x = 1 goto (7)
6. goto (10)
7. t1: = z + 1
8. z: = t1
9. goto (1)
10. if x < = s goto (12)
11. goto (1)
12. t2: =x + 10
13. x: = t2
14. goto (10)
15. goto (1)
16. Next
II. Consider the following code segment and generate the three-address code for
it.
Ans: The three-address code for the given program segment is given below:
1. k: = 1
2. if k <= 12 goto (4)
3. goto (11)
4. if x < y goto (6)
5. goto (8)
6. t1: = b + c
7. a: = t1
8. t2: = k + 1
9. k: = t2
10. goto (2)
11. Next.
III. Translate the following statement, which alters the flow of control of
expressions, and generate the three-address code for it.
While (P < Q) do
If (R < S) then a= b + c;
C o m p i l e d b y : S u r a j i t D a s P a g e 9 | 15
Ans: The three-address code for the given statement is as follows:
switch (a + b)
{
case 2: {x = y; break;}
case 5: switch x
{
case 0: {a = b + 1; break;}
case 1: {a = b + 3; break;}
default: {a = 2; break;}
}
break;
case 9: {x = y – 1; break;}
default: {a = 2; break;}
}
Ans: The three-address code for the given program segment is given below:
1. t1: = a + b
2. goto (23)
3. x: = y
4. goto (27)
5. goto (14)
6. t3: = b + 1
7. a: = t3
8. goto (27)
9. t4: = b+ 3
10. a: = t4
11. goto (27)
12. a: = 2
C o m p i l e d b y : S u r a j i t D a s P a g e 10 | 15
13. goto (27)
14. if x = 0 goto (6)
15. if x = 1 goto (9)
16. goto (12)
17. goto (27)
18. t5: = y – 1
19. x: = t5
20. goto (27)
21. a: = 2
22. goto (27)
23. if t1: = 2 goto (3)
24. if t1: = 5 goto (5)
25. if t1: = 9 goto (18)
26. goto (21)
27. Next
V. Translate the following program segment into three-address statement:
Ans: The three-address code for the given program segment is given below:
1. s = 0
2. i = 0
3. if i > 10 goto (16)
4. t1 = addr(a)
5. t2 = i * 4
6. t3 = t1[t2]
7. t4 = addr(b)
8. t5 = i * 4
9. t6 = t4[t5]
10. t7 = t3 * t6
11. t8 = s + t7
12. s = t8
13. t9 = i + 1
14. i = t9
15. goto (3)
16. Next
C o m p i l e d b y : S u r a j i t D a s P a g e 11 | 15
Exercise:
1. Generate the three-address code for the following program segment:
main ()
{
int k = 1;
int a[5];
while (k <= 5)
{
a[k] = 0;
k++;
}
}
Note:
If the elements of a two-dimensional array X [m][n] are stored in a row-major form, the relative
address of an array element X [i][j] is calculated as follows:
base + (i * n + j) * w
On the other hand, if the elements are stored in a column-major form, the relative address of X
[i][j] is calculated as follows:
base + (i + j * m) * w
Row-Major Column-Major
1. t1 = addr(X) 1. t1 = addr(X)
2. t2= i * n 2. t2= j * m
3. t3 = t2 + j 3. t3 = t2 + i
4. t4 = t3 * w 4. t4 = t3 * w
5. t5 = t1[t4] 5. t5 = t1[t4]
2. Generate the three-address code for the following program segment where x, y
are arrays of size 10 *10, and there are 4 bytes/word.
C o m p i l e d b y : S u r a j i t D a s P a g e 12 | 15
begin
add = 0
a=1
b=1
do
begin
add = add + x [a, b] * y [a, b]
a=a+1
b=b+1
end
while a <=10 and b <=10
end
C o m p i l e d b y : S u r a j i t D a s P a g e 13 | 15
Backpatching:
Backpatching is the process of leaving blank entries for the goto instruction where the
target address is unknown in the forward transfer in the first pass and filling these
unknown addresses in the second pass.
In all control transfer statements such as if, if-else, for, while, do-while, and switch-
case, the transfer of control takes place from one place to another. In all the examples
given in previous sections on control statements, there are goto statements to transfer
the control of the instruction. These control statements are categorized into
unconditional and conditional goto statements, which transfer the control either in
the forward direction or in the backward direction.
These sets of quadruples involved both forward and backward control. From
quadruple 10, there is a forward control to quadruple 12, and from quadruple 11,
there is a backward control to quadruple 5. In case of backward control, the location
of the label is known at the point of its reference.
When a quadruple associated with the labels such as L1 and L2 is created during the
process of IC generation, a function such as
C o m p i l e d b y : S u r a j i t D a s P a g e 14 | 15
can be called in the semantic action. Here S is the set data type, Label is the string
(such as L1, L2), and quadapperance is the reference of the quadruple associated with
this label.
When L1 is referred in quadruple 11, the location of L1 is known as quadruple
number 5. However, when L2 is referred in quadruple 10, its location is not known at
the point of its reference. Hence the actual location associated with label L2 is deferred
and left blank till the location is found.
At the end of the generation of the quadruples by leaving blank for the unknown
quadruple reference, a function
backpatch (LabelList L)
can be called to fill the blank entries with the appropriate quadruple reference from
the Label-Quadruple appearance pair table.
C o m p i l e d b y : S u r a j i t D a s P a g e 15 | 15
COMPILER DESIGN
CODE OPTIMIZATION
In the above sequence of statements, the control enters from the first statement t1 =
b * c. The second and third statements are executed sequentially without any looping
or branching and the control leaves the block from the last statement. Hence the
above statements form a basic block.
Algorithm for partitioning of three-address instructions into basic block:
A sequence of three-address instructions is taken as input and the following steps are
performed to partition the three-address instructions into basic blocks:
Step 1: Determine the set of leaders (First statement in the basic block is called leader).
[The rules for finding leaders are:
1. The first statement in the intermediate code is leader.
2. The target statement of a conditional and unconditional jump is a leader.
3. The intermediate statement following an unconditional or conditional jump
is a leader.]
Step 2: Construct the basic block for each leader that consists of the leader and all the
instructions till the next leader (excluding the next leader) or the end of the program.
The instructions that are not included in a block are not excluded and may be
removed, if desired.
Consider a program that computes the average of a set of n numbers.
1. sum = 0
2. i = 0
3. average = 0 B1
1. t1 = 4 * i
2. t2 = A[t1]
3. t3 = sum + t2 B2
4. t4 = i + 1
5. i = t4
6. if i < n goto step (1)
1. t5 = sum/n
2. average = t5 B3
1. sum = 0
2. i = 0 B1
3. average = 0
1. t1 = 4 * i
2. t2 = A[t1]
3. t3 = sum + t2
4. t4 = i + 1
B2
5. i = t4
6. if i < n goto step (1)
3. t5 = sum/n
4. average = t5 B3
find out the basic block and draw the flow graph for the above code.
Reachable Definitions
There are many variables- defined and undefined in an IC. For example, consider the
following quadruples:
X = 10;
X = 20;
X = 30;
In these statements (quadruples), the variable X is defined three times. In the first
statement, the variable is defined with the value 10. In the second statement, the
previous value is undefined and a new value 20 is defined for X. Similarly, it is extended
to the third statement. If the values are not used in any right-hand side of the
quadruple, these values can be dead values and the associated code is said to be dead
code. Such dead code can be eliminated. In this example, the last definition, i.e., X =
30, reaches the forthcoming statement. Such definitions are said to live in the basic
block. Hence, it is required to keep track of the set of quadruples that is available at
the input and output of a basic block.
The term spawn is used to define a variable, and the term destroy to kill the previous
definition. Hence, it is preferable to maintain a set of definitions generated out of each
statement and a set of definitions that are destroyed as a result of this statement.
Hence, the set of definitions that are available at the end of the statement S is
expressed by the following data-flow equation:
DAG is a Directed Acyclic Graph that is used to represent the basic blocks and to
implement transformation on them. For rearranging the ICs in the basic block, DAG is
used. It represents the way in which the value computed by each statement in a basic
block is used in the subsequent statements in the block. Every node in a flow graph
can be represented by DAG. Each node of a DAG is associated with a label. The labels
are assigned by using three rules:
1. The variable names or constants are the leaves of the DAG and are labelled
uniquely. Mostly leaves represent r values.
2. Interior nodes are labelled by the operator symbol.
3. Nodes are also labelled by a sequence of identifiers. The interior nodes are the
computed values and the identifiers labelling the node are said to possess the
value.
Note:
DAGs are different from a flow graph. A DAG represents a basic block, and a flow graph
represents the basic block and its relationship with other basic blocks.
Example:
Consider the following ICs representing the expression a + b * c + d + b * c
t1 = b * c
t2 = b * c
t3 = a + t1
t4 = d + t2
t5 = t3 + t4
t5
+
t3 t4
+ +
t1, t2
a b c d
Compiled by: Surajit Das P a g e 7 | 17
In this example, the first two leaf nodes corresponding to the variables b and c are
created and linked by the interior node, that is, the operator node *, and it is assigned
the label t1. In the next quadruple, a similar right-hand side b * c appears, for which
the nodes are already created and t2 is added to the label. This enables the catching
of the common subexpression b * c. This process is repeated for every quadruple.
Algorithm
// Input: A basic block of ICs
// Output: A DAG
// Cases of ICs: (i) x = y op z, (ii) x = op y, (iii) x = y
for each statement IC of B
{
If (node(y) is not defined)
createLeafNode(y)
defineNode(y)
endif
//Do it for z also
Case (i): n = findNode(node(op), node(y), node(z))
if (n is not found)
n = createNode (op, y, z)
endif
Case (ii): n = findNode (node (op), node(y))
if (n is not found)
n = createNode (op, y)
endif
Case (iii): n = node(y)
Delete X from attached identifier for node(x)
Append x to attached identifier for node(n)
}
2015, 2017
1. Construct the DAG for the following basic block:
d: = b * c, e: = a + b, b: = b * c, a: = e – d
2014
2. Construct the DAG for the following expression:
A = B * -C + B * -C
2013
3. Construct the DAG for the following basic block:
a = b + c, b = a – d, c = b + c, d = a - d
b0 c0 d0
The corresponding three-address code for the above code segment is:
1. product: = 0
2. j: = 1
3. t1: = 4 * j
4. t2: = x[t1]
5. t3: = 4 * j
6. t4: = y[t3]
7. t5: = t2 * t4
8. t6: = product + t5
9. product: = t6
10. t7: = j + 1
11. j: = t7
12. if j < = 20 goto (3)
The corresponding DAG for the above three-address code is given below:
2. Copy Propagation:
c=d+e c=t
(a) (b)
1
Dom (0) = {1,2,3,4,5,6}
Dom (1) = {2,3,4,5,6}
2 3
Dom (2) = {}
Dom (3) = {}
4 Dom (4) = {5,6}
Dom (5) = {6}
5 Dom (6) = {}
Peephole Optimization:
The machine codes that are generated from the ICs are straightforward. There are
many chances that redundant codes may be present in the ICs. Hence, there is a need
for reviewing the codes that are generated. This processing is one of the optimization
techniques. However, there is no guarantee that the produced codes will be optimum.
Instead of looking at the code as a whole and doing the optimization process, local
pieces of the code can be considered for the optimization process called peephole
optimization.
Peephole optimization is a technique, in which a small portion of the code (known as
peephole) is taken into consideration and optimization is done by replacing the code
by the equivalent code with shorter or faster sequence of execution. The statements
within the peephole need not be contiguous, although some of the implementations
require the statement to be contiguous. Each improvement in the code may explore
the opportunities for some other improvements. So, multiple review of the code is
necessary to get maximum benefit from the peephole optimization. The possible
transformations that are applied are as follows:
1. Redundant instruction eliminations:
Consider these machine codes.
MOV R0, a
MOV a, R0
The first instruction can be eliminated since the second instruction ensures that the
value of a is already loaded into register R0. However, it cannot be deleted in a
situation when, it has a label which makes it difficult to identify that whether the first
instruction is always executed before the second. To ensure that this kind of
transformation in the target code would be safe, the two instructions must be in the
same basic block.
2. Removal of unreachable code:
Removing an unlabelled instruction that immediately follows an unconditional jump is
possible. This process eliminates a sequence of instructions when repeated. Consider
the following IC representation: