Introduction To Compile1
Introduction To Compile1
We can easily analyze it, that every concept of designing of computer software is inspired by the
life. The same thought is applied in the development of programming languages. Like Natural
Languages (Hindi, English, …) programming languages are also notations, with which common
man can communicate to the computers.
Coded language used by programmers to write instructions that a computer can understand to do
what the programmer (or the computer user ) wants. The most basic (called low-level) computer
language is the machine language that uses binary (‘1’ and ‘0’) code which a computer can run
(execute) very fast without using any translator or interpreter program, but is tedious and
complex. The high-level languages (such as Basic, C, Java) are much simpler (more ‘English-
like’) to use but need to use another program (a compiler or an interpreter) to convert the high-
level code into the machine code, and are therefore slower. There are dozens of programming
languages and new ones are being continuously developed. Also called computer language.
According to Nico Habermann “An ideal language allows us to express what is useful for the
programming task and at the same time makes it difficult to write what leads to
incomprehensible or correct programs”.
According to Robert Harper “Good language makes it easier to establish, verify and maintain
the relationship between code and its properties”.
In 1945, John Von Neumann developed two important concepts that directly affected the path of
computer programming languages. The first was known as should be ‘shared program
technique’. The technique states that the actual computer hardware should be simple and need
not to be hand-wired for each program. Instead, complex instructions should be used to control
the hardware, allowing it to be reprogrammed much faster. The second concept was extremely
important for development of programming languages. Von Neumann called it “Conditional
Control Transfer”. This idea gave rise to the notation of subroutines, or small block of code that
could be jumped to in any order, instead of a single set of chronologically ordered steps for the
computer to take.
In 1949, a few years, after Von Neumann’s work, the language short code appeared. It was the
first computer language for electronic devices and it required the programmer to change its
statements in to 0’s and 1’s by hand. In 1957, the first of the major languages appeared in the
form of FORTRAN. Its name stands for FORmula TRANslating system. The language was
designed at IBM for scientific computing. Though FORTRAN was good at handling numbers, it
was not good at handling input and output, which mattered most to business computing. Business
computing started to take off in 1959 and because of this COBOL was developed. COBOL
statements also have a very English-like grammar, making it quite easy to learn. All of these
features were designed to make it easier for the average business to learn and adopt it.
In 1958, John McCarthy of MIT created the LIST Processing (or LISP) language. It was
designed for Artificial Intelligence (AI) research. Because it was designed for such a highly
specialized field, its syntax has rarely seen before or since. The Algol language was created by a
committee for scientific use in 1958. Its major contribution is being the root of the tree that has
led to such languages as Pascal, C, C++ and Java. It was also the first language with a formal
grammar, known as Backus-Naur Form or BNF. Pascal was begun in 1968 by Niklaus Wirth. Its
development was mainly out of necessity for a good teaching tool. Wirth later created a
successor to Pascal, Modula-2, but by the time it appeared, C was gaining popularity and users at
a rapid pace. C was developed in 1972 by Dennis Ritchie while working at Bell Labs in New
Jersey. In the late 1970’s and early 1980’s, a new programming method was being developed. It
was known as object oriented programming, OOP. Bjarne developed extensions to C known as
“C with classes”. This set of extensions developed into the full-feature language C++, which was
released in 1983.
In 1994, the Java project team changed their focus to the web. The next year, Netscape licensed
Java for use in their internet browser, Navigator. In 1964 BASIC was developed by the John
Kemeny and Thomas Kurtz. Microsoft has extended BASIC in its Visual Basic (VB) product. In
1987 Perl was developed by Larry Wall, it has very strong text matching functions which make it
ideal for gateway designing for a web interface.
Programming language have been under development for years and will remain so for many
years to come. They got their start with a list to wire a computer to perform a task. These steps
eventually found their way into software and began to acquire newer and better features. The
first major languages were characterized by the simple fact that they were intended for one
purpose only, while the languages of today are differentiated by the way they are programmed
in, as they can be used almost for any purpose. And perhaps the language of tomorrow will be
more natural with the invention of quantum and biological computers.
a. Imperative Languages
b. Functional Languages
The basic architecture of computers has a crucial effect on language design. Most of the popular
languages of the past 35 years have been designed around the prevalent computer architecture,
called the Von Neumann architecture, after one of its originators, John Von Neumann. These
languages are called imperative languages. Programmers using imperative languages must deal
with the management of variables and assignment of values to them. The results of this are
increased efficiency of execution but laborious construction of programs.
Procedure-oriented programming, which was the most popular software development paradigm
in the 1970’s, focuses on subprograms and sub-program libraries. Data are set to sub-programs
for computations. Data-oriented programming focuses on abstract data types. Languages that
support data-oriented programming are often called object-oriented-based languages. A language
that is object oriented must provide support for three key languages features; abstract data types,
inheritance and a particular kind of dynamic binding.
Advance computer programming language that is not limited by the type of computer or for one
specific job and more easily understood. In other words, a programming language designed to
suit the requirement of the programmer and independent of the internal machine code of any
particular computer, is called high level programming language. For example, BASIC,
FORTRAN, C++, Java and Visual Basic are high level programming languages.
In contrast, low-level language, such as assembly languages closely reflect the machine codes of
specific computers and, therefore, described as machine-oriented languages. Unlike low-level
languages, high-level languages are relatively easy to learn because the instructions bear a closer
resemblance to everyday language, and because the programmer does not require a detailed
knowledge of the internal working of the computer. Each instruction in a high-level language is
equivalent to several machine-code instructions. High-level programs are, therefore, more
compact than equivalent low-level programs. However, each high-level instruction must be
translated into machine code, by either a compiler or an interpreter, before it can be executed by
a computer. High-level languages are designed to be portable, that is programs written in a high-
level language can be run on any computer that has a compiler or interpreter for those particular
languages.
It is very important for any programming languages that it is easily understandable to the
programmers and they are able to determine how expressions, statements and program module of
the language are formed. Programmers must be able to write the software by the help of manual
of programming language.
Syntax of a programming language is the form of its expressions, statements and program
modules. And the semantic of a programming language is the meaning given to the various
syntactic structure.
If (Conditional expression)
…………………….
Statements;
…………………….
…………………….
The meaning of above syntactical structure is that if conditional expression have true value
statements of the blocks will be executed. We have seen that, like any other natural language,
computer languages behave as communication media:
a. human to machine
b. machine to machine
c. machine to human
Unlike humans, machines are not fault tolerant, hence programming is often viewed as a
“Complex task”. Before switching on the design of compiler, it is very much required to know
about the structure of the programs.
The lexical level of any program is lowest level. At this level, computer programs are viewed as
simple sequence of lexical items called tokens. In fact at this level, programs are considered as
different groups of strings that make sense. Such a group is called a token. A token may be
composed of a single character or a sequence of characters. We can classify tokens as being
either:
Identifier – in a particular language names chosen to represent data items, functions and
procedures, etc are called identifiers. Considerations for the identifier may differ from language
to language, for example, some computer language may support case sensitivity and others may
not. Number of characters in identifiers are also dependent on the design of the computer
language.
Keywords – are the names chosen by the language designer to represent facts of particular
language constructs which cannot be used as identifiers (sometimes referred to as reserved
words)
int id ;
here id is identifier
struct idl {
int a ;
char b;
}
Here struct is keyword and idl is identifier
Separators - These are punctuation marks used to group together sequences of tokens that have
a “unit” meaning. When computing text, it is often desirable to include punctuation, where these
are also used (within the language) as separators, we must precede the punctuation character with
what is called an escape character (usually a back slash `\’).
Literals – Some languages support literals which denote direct values. For example:
Comments – We know that a good program is one that is understandable. We can increase
understandability by including meaningful comments into our code. Comments are omitted
during processing. The start of a comment is typically indicated by a key word (that is comment)
or a separator, and may also be ended by a separator.
The syntactic level describes the way that program statements are constructed from tokens. This
is always very precisely defined in terms of a context-free grammars. The best known examples
are BNF (Backus Naur Form) or EBNF (Extended Backus Naur Form). Syntax may also
described using a syntax tree.
<number> <number>
`+’ `-‘
Fig 1.1
The contextual level of analysis is concerned with the “Context” in which program statements
occur. Program statements usually contain identifier whose value is dictated by earlier statements
(especially in the case of the imperative or object oriented paradigms). Consequently, the
meaning of a statement is dependent on “what has gone before”, that its context. Context also
determines whether a statement is “legal” or not (context conditions – a data item must “exist
before it can be used”).
Let us consider the translation from French to English. When translating from French to English,
the translator must comprehend each French sentence and produce a corresponding sentence in
English. Human translator receives a text that is written in the source language (here it is French)
and he decomposes it in sentences and words using his knowledge of the language. This
“knowledge” is stored in his “internal memory”, that is his brain, and it consists of the
vocabulary of the language (used to recognize the words) and the grammar of the language (used
to analyze the correctness of the sentences). Once the translator recognizes all the words and
understands the meaning of the sentence, he starts the process of translating.
Tips – For any translator, the most difficult part comes from the fact that he or she can not
translate word by word because the two language do not have the same sentence pattern’s and in
spite of this, the meaning of the sentence in both languages must be the same.
“Translators are the software programs, which can translate programs written into source
language to target language”
(Software Programs)
Fig. 1.2
1.4.1 Assembler
(object code)
Fig. 1.3
1.4.2 Compilers
Compilers are system programs that translates an input program in a high level language into its
machine language equivalent. Checks for errors. Optimizes the code. Prepares the code for
execution. Compilers are the translators, which translate a program written in high-level
language (like C++, FORTRAN, COBOL, PASCAL) into machine code for some computer
architecture (such as on Intel Pentium architecture). The generated code can be later executed
many times against different data each time.
Source program
Fig 1.4
1.4.3 Interpreter
a. Is a systems program that looks at and executes programs on a line-by-line basis rather
than producing object code.
b. Whenever the programs have to be executed repeatedly, the source code has to be
interpreted every time. In contrast, the compiled programs create an object code and this
object code will be executed.
c. Entire source code needs to be present in memory during execution as a result of which
the memory space required is more when compared to that of a compiled code.
1.4.4 Loaders
Loader is system programs. Loader loads the binary code in the memory ready for execution.
Transfer the control to 1st instruction. Loaders are responsible for locating program in the main
memory every time it is being executed.
a. Assemble-and-go loader – the assembler simply places the code into memory and loader
executes a single instruction that transfers control to the starting instruction of the
assembled program. In this scheme, some portion of the memory is used by the assembler
itself which would otherwise have been available for the object program.
b. Absolute Loader – object code must be loaded into the absolute addresses in the memory
to run. If there are multiple subroutines, then each absolute address has to be specified
explicitly.
c. Relocating Loader – this loader modifies the actual instructions of the program during the
process of loading a program so that the effect of the load address is taken into account.
1.4.5 Linkers
Linking is the process of combining various pieces of code and data together to form a single
executable that can be loaded in memory. Linking is of two main types:
a. Static Linking – All references are resolved during loading at linkage time.
b. Dynamic Linking – References made to the code in the external module are resolved
during run time. Takes advantage of the full capabilities of virtual memory. The
disadvantage is the considerable overhead and complexity incurred due to postponement
of actions till run time.
You may never write a commercial compiler, but that’s not why we study compilers. We study
compiler construction for the following reasons:
a. Writing a compiler gives a student experience with large scale applications development.
Your compiler program may be the largest program you write as a student. Experience
working with really big data structure and complex interactions between algorithms will
help you out on your next big programming projects.
b. Compiler writing is one of the shining triumphs of Computer Science theory. It
demonstrates the value of theory over the impulse to just “hack up” a solution.
c. Compiler writing is a basic element of programming language research. Many language
researchers write compilers for the language they design.
d. Many applications have similar properties to one or more phases of a compiler, and
compiler expertise and tools can help an application programmer working on other
projects beside compilers.
1.6 WHAT IS THE CHALLENGE IN COMPILER DESIGN?
Compiler writing is not an easy job. It is very challenging and you must have lot of knowledge
about various fields of computer science. Let us discuss some of the challenges:
a. Many Variations
iv. The compiler itself must run fast (Compilation time must be proportional to
program size)
vii. The generated code must work well with existing debuggers
iii. Algorithms and data structure (hash tables, graph algorithms, dynamic
programming etc)
v. Software Engineering
d. Instruction Parallelism
e. Parallel Algorithms
ii. Efficiency – In a production compiler, efficiency of the generated code and also
efficiency of the compiler itself are important considerations. The early emphasis on
correctness has consequences for your approach to the design of the implementation.
Modularity and simplicity of the code are important for two reasons: first, your code is
much more likely to be correct, and second, you will be able to respond to changes in the
source language specification from lab to lab much more easily.
iii. Interoperability – Programs do not run in isolation, but are linked with library code
before they are executed, or will be called as a library from other code. This puts some
additional requirements on the compiler, which must respect certain interface
specifications. Your generated code will be required to execute correctly in the
environments on the lab machines. This means that you will have to respect calling
conventions early on (for example, properly save callee-save registers) and data layout
conventions later, when your code will be calling library functions.
iv. Usability – A compiler interacts with the programmer primarily when there are errors in
the program. As such, it should give helpful error messages. Also, compilers may be
instructed to generate debug information together with executable code in order to help
uses debug runtime errors in their program.
v. Retargetability – At the outset, we think of a compiler of going from one source language
to one target language. In practice, compilers may be required to generate more than one
target from a given source (for example, x86-64 and ARM code), sometimes at very
different levels of abstraction (for example, x86-64 assembly or LLVM intermediate
code).
A compiler is composed of several components also called phases, each performing one specific
task. Each of these components performs at a slightly higher language level than the previous.
The compiler must take an arbitrarily-formed body of text, and translate it into a binary stream
that computer can understand. The complete compilation procedure can be divided into six
phases and these phases can be grouped in two parts, as follows:
a. Analysis – In this part source program is broken into constituent pieces and creates an
intermediate representation. Analysis can be done in three phases:
i. Lexical Analysis
ii. Syntax Analsyis
iii. Semantic Analysis
b. Synthesis – Synthesis constructs the desired target program from intermediate
representation. Synthesis can be done in following three phases:
i. Intermediate code generation
ii. Code optimization
iii. Code generator
Tips – Every phase of compilation can interact with a special data structure called symbol table
and with error handler.
Lexical Analyzer
Language
Dependent Symbol
Manager Handler
Dependent Synthesis
Target Program
Fig. 1.5
Sometimes we divide synthesis as follows:
b. Code optimization
c. Storage allocation
d. Code generation
Here, we are looking at storage allocation as a different phase, but in previous assumption it was
the part of code generation.
Source Program
Local Analyzer
Syntax Analyzer
Semantic Analyzer
Symbol Table Manager Error Handler
Intermediate Generator
Storage Allocation
Code Optimizer
Code Generator
In the past, compilers were divided into many passes to save space. When each pass is finished,
the compiler can free the space needed during the pass. Many modern compilers share a common
`two stage’ design. The first stage, the `compiler front end’ translates the source language into an
intermediate representation. The second stage, the `compiler back end’ works with the internal
representation to produce code in the output language.
The compiler front end consists of multiple phases itself, each informed by formal language
theory:
i. Lexical Analysis – breaking the source code text into small pieces (`tokens’ or
terminals’), each representing a single atomic unit of the language, for typically a regular
language, so a finite state automation constructed from a regular expression can be used
to recognize it. This phase is also called lexing or scanning.
ii. Syntax Analysis – identifying syntactic structures of source code. It only focuses on the
structure. In other words, it identifies the order of tokens and understand hierarchical
structures in code. This phase is also called parsing.
iii. Semantic Analysis – is to recognize the meaning of program code and start to prepare
for output. In that phrase, type checking is done and most of compiler errors show up.
While there are applications where only the compiler front end is necessary, such as static
language verification tools, a real compiler hands the intermediate representation generated by
the front end to the back end, which produces a functional equivalent program in the output
language. This is done in multiple steps:
ii. Code generation – the transformed intermediate language is translated into the output
language, usually the native machine language of the system. This involves resource and
storage decisions, such as deciding which variables to fit into registers and memory and
the selection and scheduling of appropriate machine instructions.
In order to perform this task, the lexical analyzer must know the key words, identifiers,
operators, delimiters and punctuation symbols of the language to be implemented. At the same
time as forming characters into symbols the lexical analyser should deal with (multiple) spaces
and remove comments and any other characters not relevant to the later stages of analysis. So
now, we can say that lexical analyzer design must:
To specify the token of the language, regular expression concept from automata theory is used
and recognition of token is done by deterministic finite automata (we will see these concepts in
text chapter)
Syntax analyzer is the module in which the overall structure is identified, and involves an
understanding of the order in which the symbols in a program may appear.
We know that it was the scanner’s duty to recognize individual words or tokens of the language.
The lexical analyzer does not, however, recognize if these words have been used correctly. The
main task of parser is to group the tokens into sentences, that is, to determine if the sequence of
tokens that have been extracted by the syntax analyser are in the correct order or not. In other
words, until the syntax analyser (parser) is reached the tokens have been collected with no regard
to the whole context of the program as whole. The parser analyzes the context of each token and
groups the token in declarations, statements and control statement.
Tips – In the process of analyzing each sentence, the parser builds abstract tree structures.
Assign op (:=)
Value 100
Some features of programming cannot be checked in a single left to right scan of the source code
without setting up arbitrary sized tables, to provide access to information that may be an arbitrary
distance away, when it required. For example, information concerning the type and scope of
variables falls into this category.
The semantic analyzer gathers type information and checks the tree produced by the syntax
analyzer for semantic errors. Let us see the statement; and consider rate is real.
Here in this statement the semantic analyzer might add a type conversion node, say inttoreal, to
the syntax tree to convert the integer to real quantity.
assign op
sum add op
rate inttoreal
num (60)
Fig. 1.12
Though the parser determines the correct usage of tokens, and whether or not they appeared in
the correct order, it still did not determine whether or not the program said anything that made
sense. This is type of checking that occurs at the semantic level. In order to perform this task, the
compiler makes use of a detailed system of lists, known as the symbol tables. The compiler
needs a symbol table to record each identifier and collect information about it.
Tips – For a variable the symbol table might record its type-expression (integer, float etc), its
scope (where the variable can be used), and its location in run-time storages, etc. For a procedure
the symbol table might record the type-expression of its arguments and the type expression of its
returned value.
The symbol table manager has a FIND function that returns a pointer to the descriptor for an
identifier (descriptor is a record which contains all the information about identifier) when given
its lexeme. Compiler phases use this pointer to read and/or modify information about the
identifier. FIND returns a NULL pointer if there is no record for a given lexeme. The INSERT
function in the symbol table manager inserts a new record into the symbol table when given its
lexeme.
After semantic analysis many compilers generated an Intermediate representation of the source
program that is both easy to produce and easy to translate into the target program. There are a
variety of forms used for intermediate code. One such form is called three-address code and
looks like code for a memory-memory machine where every operation reads operands from
memory and writes results into memory. The intermediate code generator usually has to create
temporary location to hold intermediate results.
After the compiler broke this stream of text down into tokens, parsed the tokens and created the
tree and finally performed the necessary type checking, it might create a sequence of three-
address instructions. Three address instructions contain no more than, three operands (registers)
per instruction, and in addition to an assignment, contain only one other operator. Three-address
instructions also assume on unlimited supply of registers.
The structure of the tree that is generated by the parser can be rearranged to suit the needs of the
machine architecture or the tree can be restructured in order to produce an object code that run
faster. What results is a simplified tree that may be used to generate the object code.
1.8.8 Storage-allocation
Every constant and variable appearing in the program must have storage space allocated for its
value during the storage allocation phase. This storage space may be one of the following types:
a. Static Storage – If the life time of the variable is life time of the program and the space
for its value once allocated cannot later be released. This kind of allocation is called static
storage allocation.
b. Dynamic Storage – We follow dynamic storage allocation if the life time is a particular
block or function or procedure in which it is allocated so that it may be released when the block
of function or procedure in which it is allocated is left.
c. Global Storage – If its life time is unknown at compile time and it has to be allocated and
de-allocated at run time. The efficient control of such storage usually implies run-time
overheads.
Tips – After space is allocated by the storage allocation phase, an address, containing as much as
is known about its location at compile time, is passed to the code generator for its use.
1.8.9 Code Generation
This is the part of the compiler where native machine code is actually generated. If the target
machine has registers the target program. For example, for statement
Now we can understand that sum, old sum and rate are identifier, let us say them id1, id2, and id3.
MOVF id2, R1
ADDF R2, R1
MOVF R1,id1,
During this step, the compiler has to map address names from the three-address intermediate
code onto the very finite amount of registers had by the machine. The final result is a piece of
code that is (mostly) executable. None of the remaining phases prior the actual execution of the
program are a part of the compiler’s responsibility. What remains to be done is to link the
program and build a binary executable. In this phases, various libraries as well as the main
executable portion of the program are combined to form a fully-fledged application. There are
various types of linking, most of which are done prior to run time. The final phase that a program
must pass through prior to execution is for it to be loaded into memory. In this phase, the
addresses in the binary code are translated from logical addresses, and any final run time binding
are performed.
The object code can be further optimized by a peephole optimizer. This process is present in a
more sophisticated compiler. The purpose is to produce a more efficient object program that can
be executed in a shorter time. Optimization is performed at a local level and it takes advantage of
certain operator properties such as cumulatively, associativity,, and distributivity. To see a
summary of the entire process of compilation from start to finish, refer to following figure.
Input Converter …….. a b c A C x : ! n ……….
Phase 1
Token stream
Phase 2
Optimizer
Phase 4
Phase 5
Optimized Peephole
Fig. 1.13
Detection and reporting of errors in source program is main function of compiler. Error may
occur at any phase of compilation. A good compiler must determine the line number of program
exactly, where the errors have occurred.
These are the various errors which may occur at different levels of compilation:
a. The first of these are lexical (scanner) errors. Some of the most common types here
consist of illegal or unrecognized characters, mainly, caused by typing errors. A common
way for this to happen is for the programmer to type a character that is illegal in any
instance in the language, and is never used. Finally, the quite commonly, another type of
error that the scanner may detect is an unterminated character or string constant. This
happens whenever the programmer types something in quotes and forget the trailing
quote. Again, these are mostly typing errors.
b. The second class of error is syntactic in nature, and is caught by the parser. These
errors are among the most common. The really difficult part is to decide from where to
continue the syntactical analysis after an error has been found. What happens if the parser
is not carefully written, or if the error detection and recovery scheme is sloppy, the parser
will hit one error and `mess up’ after that, and cascade spurious error messages all
throughout the rest of the program. In the case of an error, what one would like to see
happening, is to have the compiler skip any improper tokens, and continue to detect
errors without generating error messages that are not really an error but a consequence of
the first error. This aspect is so important that some compilers are categorized base on
how good their error detection system is.
c. The third type of error is semantic in nature. The semantic that are used in computer
languages are by far simpler than the semantics that are used in spoken languages. This is
the case because in computer languages everything is very much defined, there are
nonsense implied or used. The semantic errors that may occur in a program are related to
the fact that some statements may be correct from the syntactical point of view, but they
make no sense and there is no code that can be generated to carry out the meaning of the
statement.
d. The fourth type of error may encounter during code optimization, for example in
control flow analysis, there may exist some statements, which can never be reached.
e. The fifth type of error may occur during code generation, since in code generation
architecture of computer also play an important role. For example, there may exist a
constant that is too large to fit in a word of the target machine.
f. The sixth type of error may encounter when compiler try to make symbol table
entries, for example there may exist an identifier that has been multiple declaration with
contradictory attributes.
We have seen the structure of compiler. The structure of the compiler is divided into six phases.
Each phase have its own unique job to perform. The output of one phase is used as input to
another phase. We have already seen that phases can be viewed as analysis phase and synthesis
phase. On the basis of regrouping of phase compilers may multipass or one pass compilers.
c. Sometimes a single “pass” corresponds to several phases that are interleaved in time.
d. What and how many passes a compiler does over the source program is an important
decision.
We have already studied a compiler that scan the input source once, produces a first modified
form, then scans the first-modified form and produce a second-modified form and so on, until the
object form is produced. Such a compiler is called a multipass compiler. In the case of the
multi-pass compiler each function of the compiler can be performed by one pass of the
compiler. For instance, the first pass can read the input source, scan and extract the tokens and
store the result in an output file. The second pass can read the file which was produced in the
first pass, do the syntactical analysis by building a syntactical tree and associate all the
information relating to each node of the tree. The output of the second pass, then is a file
containing he syntactical tree. The third pass, can read the output file produced by the second
pass and perform the optimization by restructuring the tree structure.
A multi-pass compiler makes several passes over the program. The output of a preceding phase
is stored in a data structure and used by subsequent phases.
Fig. 1.14
In a one-pass compiler, when a line source is processed, it is scanned and the tokens are
extracted. Then the syntax of the line is analyzed and the tree structure and some tables
containing information about each token are built. Finally, after the semantical part is checked
for correctness, the code is generated. The same process is repeated for each line of code until
the entire program is compiled. Usually, the entire compiler is built around the parser, which will
call procedures that will perform different functions.
A single pass compiler makes a single pass over the source text, parsing, analyzing and
generating code all at once.
Dependency diagram of a typical single pass compiler:
Compiler Driver
Calls
Syntactic Analyzer
Calls Calls
Contextual Code
Analyzer Generator
Fig. 1.15
1.9.3 Advantages and Disadvantages for Both Single and Multipass Compilers
a. A one pass compiler is fast, since all the compiler code is loaded in the memory at once.
It can process the source text without the overhead of the operating system having to shut down
one process and start another. Also, the output of each pass of the multi-pass compiler is stored
on disk, and must be read in each time the next pass starts.
b. On the other hand, a one pass tends to impose some restrictions upon the program:
constants, types, variables and procedures must be defined before they are used. A multi-pass
compiler does not impose this type of restrictions upon the user. Operation that cannot be
performed because of the lack of information can be deferred to the next pass of the compiler
when the text has been traversed and the needed information made available.
c. The components of a one-pass compiler are inter-related much closer than the
components of a multi pass compiler. This requires all the programmers working on the project
to have knowledge about the entire project. A multi-pass compiler can be decomposed into
passes that can be relatively independent, hence a team of programmers can work on the project
with little interaction among them. Each pass of the compiler can be regarded as a mini-
compiler, having an input source written in one intermediary language and producing an output
written in another intermediary language.
a. A one-pass compiler is a compiler that passes through the source code of each
compilation unit only once. A multi-pass compiler is a type of compiler that processes the
source code or abstract syntax tree of a program several times.
d. Multi-pass compilers are sometimes called wide compilers where as one-pass compiler
are sometimes called narrow compiler.
e. Many programming languages cannot be represented with single pass compilers, for
example Pascal can be implemented with a single pass compiler where as languages like Java
require a multi-pass compiler.