Unit - 1 Compiler Design
Unit - 1 Compiler Design
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes
the whole program and may involve many passes. In contrast, an interpreter reads a statement
from the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code.The output of an
assembler is called an object file, which contains a combination of machine instructions as
well as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to
make an executable file. All these files might have been compiled by separate assemblers. The
major task of a linker is to search and locate referenced modules/routines in a program and to
determine the memory location where these codes will be loaded, making the program
instruction to have absolute references.
Loader
Loader is a part of the operating system and is responsible for loading executable files into
memory and executing them. It calculates the size of a program (instructions and data) and
creates memory space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Translators
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
Phases of Compiler
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of the source program, and feeds its output to the
next phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of the scanner works as a text scanner. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. Lexical analyzer represents
these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements
are checked against the source code grammar, i.e. the parser checks if the expression made by
the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language.
For example, assignment of values is between compatible data types, and adding string to an
integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an
annotated syntax tree as an output.
After semantic analysis the compiler generates an intermediate code of the source code for
the target machine. It represents a program for some abstract machine. It is in between the
high-level language and the machine language. This intermediate code should be generated in
such a way that it makes it easier to be translated into the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the sequence of
statements in order to speed up the program execution without wasting resources (CPU,
memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code
and maps it to the target machine language. The code generator translates the intermediate
code into a sequence of (generally) relocatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the compiler
to quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.
A parser takes input in the form of a sequence of tokens and produces output in the form of a
parse tree.
● In the top down parsing, the parsing starts from the start symbol and transforms it into
the input symbol.
Bottom up parsing
● In the bottom up parsing, the parsing starts with the input symbol and constructs the
parse tree up to the start symbol by tracing out the rightmost derivations of the string in
reverse.
Operator precedence can only be established between the terminals of the grammar. It ignores
the non-terminal.
a ⋗ b means that terminal "a" has the higher precedence than terminal "b".
a ⋖ b means that terminal "a" has the lower precedence than terminal "b".
a ≐ b means that the terminal "a" and "b" both have same precedence.
LR Parser
LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
"K" is the number of input symbols of the look ahead used to make a number of parsing
decisions.
LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and LALR parsing.
Canonical Collection of LR(0) items
An LR (0) item is a production G with a dot at some position on the right side of the production.
LR(0) items are useful to indicate how much of the input has been scanned up to a given point
in the process of parsing.
LR(0) Table
● If a state is going to some other state on a terminal then it corresponds to a shift move.
● If a state is going to some other state on a variable then it corresponds to going to move.
● If a state contains the final item in the particular row then write the reduce node
completely.
Explanation:
● I0 on S is going to I1 so write it as 1.
● I0 on A is going to I2 so write it as 2.
● I2 on A is going to I5 so write it as 5.
● I3 on A is going to I6 so write it as 6.
● I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
● I4, I5 and I6 all states contains the final item because they contain • in the right most end.
So rate the production as production number.
SLR (1) Table Construction
The steps which use to construct SLR (1) Table is given below:
If a state (Ii) is going to some other state (Ij) on a terminal then it corresponds to a shift move in
the action part.
If a state (Ii) is going to some other state (Ij) on a variable then it corresponds to go to move in
the Go to part.
If a state (Ii) contains the final item like A → ab• which has no transitions to the next state then
the production is known as reduced production. For all terminals X in FOLLOW (A), write the
reduced entry along with their production numbers.
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical
collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead
are combined to form a single set of items
LALR (1) parsing is the same as the CLR (1) parsing, only difference in the parsing table.
UNIT - 2
In syntax directed translation, along with the grammar we associate some informal notations
and these notations are called semantic rules.
● In the semantic rule, attribute is VAL and an attribute may hold anything like a string, a
number, a memory location and a complex record
● In Syntax directed translation, whenever a construct is encountered in the programming
language then it is translated according to the semantic rules defined in that particular
programming language.
Example
● In the translation scheme, the semantic rules are embedded within the right side of the
productions.
Example
S→E$ { printE.VAL }
SDT is implemented by parsing the input and producing a parse tree as a result.
Example
S→E$ { printE.VAL }
Intermediate code is used to translate the source code into the machine code. Intermediate
code lies between the high-level language and the machine language.
● If the compiler directly translates source code into the machine code without generating
intermediate code then a full native compiler is required for each new machine.
● The intermediate code keeps the analysis portion the same for all the compilers that's
why it doesn't need a full compiler for every unique machine.
● Intermediate code generator receives input from its predecessor phase and semantic
analyzer phase. It takes input in the form of an annotated syntax tree.
● Using the intermediate code, the second phase of the compiler synthesis phase is
changed according to the target machine.
Intermediate representation
High level intermediate code can be represented as source code. To enhance performance of
source code, we can easily apply code modification. But to optimize the target machine, it is
less preferred.
Low level intermediate code is close to the target machine, which makes it suitable for register
and memory allocation etc. it is used for machine-dependent optimizations.
Postfix Notation
● Postfix notation is the useful form of intermediate code if the given language is
expressions.
● The ordinary (infix) way of writing the sum of x and y is with operator in the middle: x * y.
But in the postfix notation, we place the operator at the right end as xy *.
● In postfix notation, the operator follows the operand.
Example
Production
1. E → E1 op E2
2. E → (E1)
3. E → id
E.code = E1.code
E.code = id print id
When you create a parse tree then it contains more details than actually needed. So, it is very
difficult for a compiler to parse the parse tree. Take the following parse tree as an example:
● In the parse tree, most of the leaf nodes are single child to their parent nodes.
● Syntax tree is a variant of parse tree. In the syntax tree, interior nodes are operators and
leaves are operands.
Abstract syntax trees are more compact than a parse tree and can be easily used by a compiler.
● In three-address code, the given expression is broken down into several separate
instructions. These instructions can easily translate into assembly language.
● Each Three address code instruction has at most three operands. It is a combination of
assignment and a binary operator.
Example
GivenExpression:
1. a := (-c * b) + (-c * d)
t1 := -c
t2 := b*t1
t3 := -c
t4 := d * t3
t5 := t2 + t4
a := t5
The three address codes can be represented in two forms: quadruples and triples.
Quadruples
The quadruples have four fields to implement the three address code. The field of quadruples
contains the name of the operator, the first source operand, the second source operand and the
result respectively.
Example
1. a := -b * c + d
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
10.6M
177
(0) uminus b - t1
(1) + c d t2
(2) * t1 t2 t3
(3) := t3 - a
Triples
The triples have three fields to implement the three address codes. The field of triples contains
the name of the operator, the first source operand and the second source operand.
In triples, the results of respective sub-expressions are denoted by the position of expression.
Triple is equivalent to DAG while representing expressions.
Fig: Triples field
Example:
1. a := -b * c + d
t1 := -b t2 := c + dM t3 := t1 * t2 a := t3
(0) uminus b -
(1) + c d
(3) := (2) -
Translation of Assignment Statements
In the syntax directed translation, the assignment statement mainly deals with expressions. The
expression can be of type real, integer, array and records.
1. S → id := E
2. E → E1 + E2
3. E → E1 * E2
4. E → (E1)
5. E → id
S → id :=E {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
}
E → E1 + E2 {E.place = newtemp();
E → E1 * E2 {E.place = newtemp();
E → id {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
● The Emit function is used for appending the three address codes to the output file.
Otherwise it will report an error.
Boolean expressions have two primary purposes. They are used for computing the logical
values. They are also used as conditional expressions using if-then-else or while-do.
1. E → E OR E
2. E → E AND E
3. E → NOT E
4. E → (E)
5. E → id relop id
6. E → TRUE
7. E → FALSE
The AND and OR are left associated. NOT has a higher precedence than AND and lastly OR.
E → E1 OR E2 {E.place = newtemp();
E → E1 + E2 {E.place = newtemp();
}
E → NOT E1 {E.place = newtemp();
nextstar + 3);
The EMIT function is used to generate the three address codes and the newtemp( ) function is
used to generate the temporary variables.
The E → id relop id2 contains the next_state and it gives the index of next three address
statements in the output sequence.
Here is the example which generates the three address code using the above translation
scheme:
The goto statement alters the flow of control. If we implement goto statements then we need to
define a LABEL for a statement. A production can be added for this purpose:
1. S → LABEL : S
2. LABEL → id
In this production system, semantic action is attached to record the LABEL and its value in the
symbol table.
1. S → if E then S
2. S → if E then S else S
3. S → while E do S
4. S → begin L end
5. S→ A
6. L→ L ; S
7. L→ S
Symbol Table
Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.
● It is used to store the name of all entities in a structured form at one place.
A symbol table can either be linear or a hash table. Using the following format, it maintains the
entry for each name.
Data structure for symbol table
● A compiler contains two types of symbol table: global symbol table and scope symbol
table.
● Global symbol tables can be accessed by all the procedures and scope symbol tables.
Data structure hierarchy of the symbol table is stored in the semantic analyzer. If you want to
search the name in the symbol table then you can search it using the following algorithm:
● If the name is found then search is completed else the name will be searched in the
symbol table of parent until,
In the source program, every name possesses a region of validity, called the scope of that name.
2. If B1 block is nested within B2 then the name that is valid for block B2 is also valid for B1
unless the name's identifier is re-declared in B1.
● These scope rules need a more complicated organization of symbol tables than a list of
associations between names and attributes.
● Tables are organized into a stack and each table contains the list of names and their
associated attributes.
● Whenever a new block is entered then a new table is entered onto the stack. The new
table holds the name that is declared as local to this block.
● When the declaration is compiled then the table is searched for a name.
● If the name is not found in the table then the new name is inserted.
● When the name's reference is translated then each table is searched, starting from each
table on the stack.
UNIT - 3
● When the target program executes then it runs in its own logical address space in which
the value of each program has a location.
● The logical address space is shared among the compiler, operating system and target
machine for management and organization. The operating system is used to map the
logical address into a physical address which is usually spread throughout the memory.
● Runtime storage comes into blocks, where a byte is used to show the smallest unit of
addressable memory. Using the four bytes a machine word can form. Object of multibyte
is stored in consecutive bytes and gives the first byte address.
Storage Allocation
● If memory is created at compile time then the memory will be created in a static area and
only once.
● Static allocation supports the dynamic data structure that means memory is created only
at compile time and deallocated after program completion.
● The drawback with static storage allocation is that the size and position of data objects
should be known at compile time.
● An activation record is pushed into the stack when activation begins and it is popped
when the activation ends.
● Activation record contains the locals so that they are bound to fresh storage in each
activation record. The value of locals is deleted when the activation ends.
● It works on the basis of last-in-first-out (LIFO) and this allocation supports the recursion
process.
● Allocation and deallocation of memory can be done at any time and at any place
depending upon the user's requirement.
● Heap allocation is used to allocate memory to the variables dynamically and when the
variables are no more used then claim it back.
Lexical Error
During the lexical analysis phase this type of error can be detected.
Lexical error is a sequence of characters that does not match the pattern of any token. Lexical
phase error is found during the execution of the program.
● Spelling error.
Syntax Error
During the syntax analysis phase, this type of error appears. Syntax error is found during the
execution of the program.
Some syntax error can be:
● Error in structure
● Missing operators
● Unbalanced parenthesis
When an invalid calculation enters into a calculator then a syntax error can also occur. This can
be caused by entering several decimal points in one number or by opening brackets without
closing them.
Semantic Error
During the semantic analysis phase, this type of error appears. These types of errors are
detected at compile time.
Most of the compile time errors are scope and declaration errors. For example: undeclared or
multiple declared identifiers. Type mismatched is another compile time error.
The semantic error can arise using the wrong variable or using the wrong operator or doing
operation in the wrong order.
● Undeclared variable
UNIT - 4
Optimization is a program transformation technique, which tries to improve the code by making
it consume less resources (i.e. CPU, Memory) and deliver high speed.
In optimization, high-level general programming constructs are replaced by very efficient
low-level programming codes. A code optimizing process must follow the three rules given
below:
● The output code must not, in any way, change the meaning of the program.
● Optimization should increase the speed of the program and if possible, the program
should demand less resources.
● Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.
● At the beginning, users can change/rearrange the code or use better algorithms to write
the code.
● After generating intermediate code, the compiler can modify the intermediate code by
address calculations and improving loops.
● While producing the target machine code, the compiler can make use of memory
hierarchy and CPU registers.
Optimization can be categorized broadly into two types : machine independent and machine
dependent.
A DAG for basic block is a directed acyclic graph with the following labels on nodes:
1. The leaves of the graph are labeled by a unique identifier and that identifier can be
variable names or constants.
3. Nodes are also given a sequence of identifiers for labels to store the computed value.
● It gives a picture representation of how the value computed by the statement is used in
subsequent statements.
Global data flow analysis
● To efficiently optimize the code, the compiler collects all the information about the
program and distributes this information to each block of the flow graph. This process is
known as data-flow graph analysis.
● Certain optimization can only be achieved by examining the entire program. It can't be
achieved by examining just a portion of the program.
● For this kind of optimization user defined chaining is one particular problem.
● Here using the value of the variable, we try to find out which definition of a variable is
applicable in a statement.
Data flow analysis is used to discover this kind of property. The data flow analysis can be
performed on the program's control flow graph (CFG).
The control flow graph of a program is used to determine those parts of a program to which a
particular value assigned to a variable might propagate.
Code Generator
Code generator is used to produce the target code for three-address statements. It uses
registers to store the operands of the three address statements.
Design Issues
2. Target program
3. Memory management
4. Instruction selection
5. Register allocation
6. Evaluation order
1. Input to the code generator
● The input to the code generator contains the intermediate representation of the source
program and the information of the symbol table. The source program is produced by the
front end.
● We assume the front end produces low-level intermediate representation i.e. values of
names in it can be directly manipulated by the machine instructions.
● The code generation phase needs complete error-free intermediate code as an input
requires.
2. Target program:
The target program is the output of the code generator. The output can be:
c) Absolute machine language: It can be placed in a fixed location in memory and can be
executed immediately.
3. Memory management
● During the code generation process the symbol table entries have to be mapped to actual
ip addresses and levels have to be mapped to instruction addresses.
● Mapping names in the source program to address data is co-operating done by the front
end and code generator.
● Local variables are stack allocation in the activation record while global variables are in
static areas.
4. Instruction selection:
● Nature of the instruction set of the target machine should be complete and uniform.
● When you consider the efficiency of the target machine then the instruction speed and
machine idioms are important factors.
● The quality of the generated code can be determined by its speed and size.
5. Register allocation
Register can be accessed faster than memory. The instructions involving operands in register
are shorter and faster than those involving memory operands.
Register allocation: In register allocation, we select the set of variables that will reside in the
register.
Register assignment: In Register assignment, we pick the register that contains variables.
Certain machines require even-odd pairs of registers for some operands and results.
6. Evaluation order
The efficiency of the target code can be affected by the order in which the computations are
performed. Some computation orders need fewer registers to hold intermediate results than
others.
Peephole Optimization
This optimization technique works locally on the source code to transform it into an optimized
code. By locally, we mean a small portion of the code block at hand. These methods can be
applied on intermediate codes as well as on target codes. A bunch of statements is analyzed
and are checked for the following possible optimization:
● Unreachable code
● Strength reduction