UNIT I - CS8602 Compiler Design Notes
UNIT I - CS8602 Compiler Design Notes
Topics to be Covered
1.1 Translators:
The widely used translators that translate the code of a computer program into a machine code
are:
1. Assemblers
2. Interpreters
3. Compilers
Assembler:
An Assembler converts an assembly program into machine code.
A compiler is a program that reads a program written in one language – the source language –
and translates it into an equivalent program in another language – the target language.
error messages
As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs.
Program
Advantages of Compiler:
1. Fast in execution
2. The object/executable code produced by a compiler can be distributed or executed without
having to have the compiler present.
3. The object program can be used whenever required without the need to of recompilation.
Disadvantages of Compiler:
1. Debugging a program is much harder. Therefore not so good at finding errors.
2. When an error is found, the whole program has to be re-compiled.
1.2.2 Interpretation:
Interpretation is the conceptual process of translating a high level source code into executable
code.
Interpreter:
An Interpreter is also a program that translates high-level source code into executable code.
However the difference between a compiler and an interpreter is that an interpreter translates
one line at a time and then executes it: no object code is produced, and so the program has to
be interpreted each time it is to be run. If the program performs a section code 1000 times, then
the section is translated into machine code 1000 times since each line is interpreted and then
executed.
Disadvantages of an Interpreter:
1. Rather slow
2. No object code is produced, so a translation has to be done every time the program is running.
3. For the program to run, the Interpreter must be present
Hybrid Compiler:
Hybrid compiler is a compiler which translates a human readable source code to an intermediate
byte code for later interpretation. So these languages do have both features of a compiler and an
interpreter. These types of compilers are commonly known as Just In-time Compilers (JIT).
Java is one good example for these types of compilers. Java language processors combine
compilation and interpretation. A Java Source program may be first compiled into an
intermediate form called byte codes. The byte codes are then interpreted by a virtual machine.
Source program
Translator
Input Machine
Compilers are not only used to translate a source language into the assembly or machine
language but also used in other places.
Example:
A language processor is a program that processes the programs written in programming language
(source language). A part of a language processor is a language translator, which translates the
program from the source language into machine code, assembly language or other language.
1. Pre Processor
The Pre Processor is the system software which is used to process the source program before fed
into the compiler. They may perform the following functions:
2. Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
3. Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
4. Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
5. Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program instructions and data and creates memory
space for it. It initializes various registers to initiate execution.
Analysis and
Synthesis
1. Analysis:
The first three phases forms the bulk of the analysis portion of a compiler. The analysis part
breaks up the source program into constituent pieces and creates an intermediate representation
of the source program. During analysis, the operations implied by the source program are
determined and recorded in a hierarchical structure called a syntax tree, in which each node
represents an operation and the children of a node represent the arguments of the operation.
:=
position +
initial *
rate 60
2. Synthesis Part:
The synthesis part constructs the desired target program from the intermediate representation.
This part requires most specialized techniques.
Lexical Analysis: The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically sequence of
characters, such as identifier, a keyword (if, while, etc), a punctuation character, or a multi-
character operator work like :=. The character sequence forming a token is called the lexeme for
the token.
Certain tokens will be augmented by a “lexical value”. Ex. When an identifier rate is found, the
lexical analyzer generates the token id and also enters rate into the symbol table, if it is not
already exist. The lexical value associated with this id then points to the symbol-table entry for
rate.
Tokens:
Syntax Analysis:
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree or syntax tree. In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
It imposes a hierarchical structure of the token stream in the form of parse tree or syntax tree.
The syntax tree can be represented by using suitable data structure.
:=
position +
initial *
rate 60
:=
id 1 +
id 2 *
id 3
id 4
Semantic Analysis:
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
This analysis inserts a conversion from integer to real in the above syntax tree.
:=
position +
initial *
rate inttoreal
60
After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.
Intermediate code have two properties: easy to produce and easy to translate into the target
program. An intermediate code representation can have many forms. One of the form is three-
address code, which is like the assembly language for a machine in which every memory
location can act like a register and three-address code have at most three operands.
Example: The output of the semantic analysis can be represented in the following intermediate
form:
temp1 := inttoreal ( 60 )
id1 := temp3
Code Optimization:
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources CPU, memory. In the following
example the natural algorithm is used for optimizing the code.
Example:
Code Generation:
This is the final phase of the compiler which generates the target code, consisting normally of
relocatable machine code or assembly code. Variables are assigned to the registers.
Example:
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
The first and the second operands of each instruction specify a source and destination
respectively. The F in each instruction denotes the floating point numbers. The # signifies that
60.0 is to be treated as constant.
Activities of Compiler:
Symbol table manager and error handler are the other two activities in the compiler which is also
referred as phases. These two activities interact with all the six phases of a compiler.
The symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
The attributes of the identifiers may provide the information about the storage allocated for an
identifier, its type, its scope (where in the program it is valid), and in the case of procedure
The symbol table allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. Attributes of the identifiers cannot be determined during lexical
analysis phase. But it can be determined during the syntax and semantic analysis phases. The
other phase like code generators uses the symbol table to retrieve the details about the identifiers.
Each phase can encounter errors. After the deduction of an error, a phase must somehow deal
with that error, so that the compilation can proceed, allowing further errors in the source program
to be detected.
Lexical Analysis Phase: If the characters remaining in the input do not form any token of the
language, then the lexical analysis phase detect the error.
Syntax Analysis Phase: The large fraction of errors is handled by syntax and semantic analysis
phases. If the token stream violates the structure rules (syntax) of the language, then this phase
detects the error.
Semantic Analysis Phase: If the constructs have right syntactic structure but no meaning to the
operation involved, then this phase detects the error. Ex. Adding two identifiers, one of which is
the name of the array, and the other the name of a procedure.
If the characters remaining in the input do not form any token of the language, then the lexical
analysis phase detect the error.
There are relatively few errors which can be detected during lexical analysis.
i. Strange characters
Some programming languages do not use all possible characters, so any strange ones
which appear can be reported. However almost any character is allowed within a quoted
string.
Many programming languages do not allow quoted strings to extend over more than one
line; in such cases a missing quote can be detected.
If quoted strings can extend over multiple lines then a missing quote can cause quite a lot
of text to be 'swallowed up' before an error is detected.
For example:
fi ( a == 1) ....
Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.
During syntax analysis, the compiler is usually trying to decide what to do next on the basis of
expecting one of a small number of tokens. Hence in most cases it is possible to automatically
generate a useful error message just by listing the tokens which would be acceptable at that
point.
Source: A + * B
Error: | Found '*', expect one of: Identifier, Constant, '('
More specific hand-tailored error messages may be needed in cases of bracket mismatch.
A parser should be able to detect and report any error in the program. It is expected that when an
error is encountered, the parser should be able to handle it and carry on parsing the rest of the
input. Mostly it is expected from the parser to check for errors but errors may be encountered at
various stages of the compilation process. A program may have the following kinds of errors at
various stages:
Lexical : name of some identifier typed incorrectly
Syntactical : missing semicolon or unbalanced parenthesis
Semantical : incompatible value assignment
Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement
by not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest
way of error-recovery and also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc.. Parser designers have to be careful here because one wrong
correction may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the code. In
addition, the designers can create augmented grammar to be used, as productions that generate
erroneous constructs when these errors are encountered.
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way).
If possible, the compiler should make the appropriate correction(s). Semantic errors are much
harder and sometimes impossible for a computer to detect.
Front End:
Lexical Analysis
Syntactic Analysis
Creation of the symbol table
Semantic Analysis
Generation of the intermediate code
A part of code optimization
Error Handling that goes along with the above said phases
Back End:
The back end includes the phases of the compiler that depend on the target machine, and these
phases do not depend on the source language, but depend on the intermediate language. The
phases of back end are:
Code Optimization
Code Generation
Necessary Symbol table and error handling operations
Based on the grouping of phases there are two types of compiler design is possible:
In order to atomize the development of compilers some general tools have been created. These
tools use specialized languages for specifying and implementing the component. The most
successful tool should hide the details of the generation algorithm and produce components
which can be easily integrated into the remainder of the compiler. These tools are often referred
as compiler – compilers, compiler – generators, or translator-writing systems.
Syntax-directed translation engines: Produce collections of routines for walking a parse tree
and generating intermediate code.
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration. . A language uses static scope or lexical scope if it is possible to determine the scope
of a declaration by looking only at the program and can be determined by compiler. Otherwise,
the language uses dynamic scope.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The static-scope policy is as follows:
1. A C program consists of a sequence of top-level declarations of variables and functions.
2. Functions may have variable declarations within them, where variables include local
variables and parameters. The scope of each such declaration is restricted to the function
in which it appears.
3. The scope of a top-level declaration of a name x consists of the entire program that
follows, with the exception of those statements that lie within a function that also has a
declaration of x.
Block Structures:
Languages that allow blocks to be nested are said to have block structure. A name a: in a nested
block B is in the scope of a declaration D of x in an enclosing block if there is no other
declaration of x in an intervening block.
5. Dynamic Scope:
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration.
A language uses static scope or lexical scope if it is possible to determine the scope of a
declaration by looking only at the program and can be determined by compiler.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The language uses dynamic scope if it is not possible to determine the scope of a
declaration during compile time.
Example in Java:
public int x;
With dynamic scope, as the program runs, the same use of x could refer to any of several
different declarations of x.
6. Parameter Passing Mechanism: Parameters are passed from a calling procedure to the callee
either by value (call by value) or by reference (call by reference). Depending on the procedure
call, the actual parameters associated with formal parameters will differ.
Call-By-Reference:
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are
implemented by following this pointer to the location indicated by the caller. Changes to the
formal parameter thus appear as changes to the actual parameter.
Call-By-Name:
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the
actual parameter (with renaming of local names in the called procedure, to keep them distinct).
When large objects are passed by value, the values passed are really references to the objects
themselves, resulting in an effective call-by-reference.
7. Aliasing: When parameters are (effectively) passed by reference, two formal parameters can
refer to the same object, called aliasing. This possibility allows a change in one variable to
change another.