Compiler Design unit-1
Compiler Design unit-1
1.1 Introduction:
a compiler is a program that can read a program in one language (the source language) and translate it into an
equivalent program in another language (the target language)
If the target program is an executable machine-language program, it can then be called by the user to process inputs and
produce outputs.
An interpreter is another common kind of language processor. Instead of producing a target program as a translation,
an interpreter appears to directly execute the operations specified in the source program on inputs supplied by the user.
The machine-language target program produced by a compiler is usually much faster than an interpreter at mapping
inputs to outputs. An interpreter, however, can usually give better error diagnostics than a compiler, because it executes
the source program statement by statement.
Example:
A Java source program may first be compiled into an intermediate form called bytecodes. The bytecodes are
then interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine can be
interpreted on another machine
In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-time compilers
(JIT compilers), translate the bytecodes into machine language immediately before they run the intermediate program
to process the input.
A source program may be divided into modules stored in separate files. The task of collecting the source program
is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also expand short hands,
called macros, into source language statements.
The modified source program is then fed to a compiler. The compiler may produce an assembly-language
program as its output, because assembly language is easier to produce as output and is easier to debug. The assembly
language is then processed by a program called an assembler that produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together
with other relocatable object files and library files into the code that actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a location in another
file. The loader then puts together all of the executable object files into memory for execution.
1.3 The Structure of a Compiler:
There are two parts in the compiler: analysis and synthesis. The analysis part is often called the front end of the compiler;
the synthesis part is the back end.
1.3.1 The Analysis:
➢ The analysis part breaks up the source program into constituent pieces and imposes a grammatical
structure on them. It then uses this structure to create an intermediate representation of the source
program.
➢ If the analysis part detects that the source program is either syntax error or semantically unsound, then
it must provide informative messages, so the user can take corrective action.
➢ The analysis part also collects information about the source program and stores it in a data structure
called a symbol table, which is passed along with the intermediate representation to the synthesis part.
The compilation process is a sequence of phases, each of which transforms one representation of the source
program to another. several phases may be grouped together, and the intermediate representations between the grouped
phases need not be constructed explicitly.
The symbol table, which stores information about the entire source program, is used by all phases of the compiler.
Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the stream of
characters making up the source program and groups the characters into meaningful sequences called lexemes. For each
lexeme, the lexical analyzer produces as output a token of the form that it passes on to the subsequent phase, syntax
analysis.
In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the
second component attribute-value points to an entry in the symbol table for this token. Information from the symbol-
table entry is needed for semantic analysis and code generation.
Example:
position = initial + rate * 60
1. position is a lexeme that would be mapped into a token 〈id,1〉, where id
is an abstract symbol standing for identifier and 1 points to the symbol
table entry for position. The symbol-table entry for an identifier holds
information about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token 〈=〉. Since this token needs no
attribute-value, we have omitted the second component. We could have used any abstract symbol such
as assign for the token-name, but for notational convenience we have chosen to use the lexeme itself as
the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token 〈id,2〉, where 2 points to the symbol-table entry for
initial.
4. + is a lexeme that is mapped into the token 〈+〉.
5. rate is a lexeme that is mapped into the token 〈id,3〉, where 3 points to the symbol-table entry for
rate.
6. * is a lexeme that is mapped into the token 〈*〉.
7. 60 is a lexeme that is mapped into the token 〈60〉.
〈id,1〉〈=〉〈id,2〉〈+〉〈id,3〉〈*〉〈60〉
In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and
multiplication operators, respectively.
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the
tokens produced by the lexical analyzer to create a tree. (like intermediate representation that depicts the grammatical
structure of the token stream)
A typical representation is a syntax tree in which each interior node represents an operation and the children of
the node represent the arguments of the operation.
1. The tree has an interior node labelled * with 〈id,3〉as its left child and the integer 60 as its right child.
2. The node 〈id,3〉represents the identifier rate.
3. The node labelled * makes it explicit that we must first multiply the value of rate by 60.
4. The node labelled + indicates that we must add the result of this multiplication to the value of initial.
5. The root of the tree, labelled =, indicates that we must store the result of this addition into the location
for the identifier position.
Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program
for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax
tree or the symbol table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that each operator has
matching operands. For example, many programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
The language specification may permit some type conversions called coercions.
For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-
point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert or coerce
the integer into a floating-point number.
Code Generation:
The code generator takes as input an intermediate representation of the source program and maps it into the
target language. If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program.
Then, the intermediate instructions are translated into sequences of machine instructions that perform the same
task. A crucial aspect of code generation is the judicious assignment of registers to hold variables.
Bootstrapping is the process of using an existing compiler for a programming language to compile a new or improved
version of the same compiler. Refer gfg
Reducing the Number of Passes
The various phases of the compiler have to be grouped for valid reasons: Grouping into a single phase makes
the arrangement quicker as the compiler isn’t expected to move to various modules or phases to get the interaction
introduction demand. Reducing the number of passes is inversely proportional as increases the time efficiency for
reading from and writing to intermediate files can be diminished. While grouping phases into one pass, the whole
program must be kept in memory to guarantee an appropriate data stream to each phase since one phase might require
data in an unexpected request in comparison to the data delivered in the past phase. The internal portrayal of the source
program or target program is different. Thus, the internal structure of the memory might be bigger than the input and
outputs.
The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes,
and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the
parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol
table. In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical
analyzer to assist it in determining the proper token it must pass to the parser.
The interaction is implemented by having the parser call the lexical analyzer. The call, suggested by the
getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme
and produce for it the next token, which it returns to the parser.
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison mentioned in Fig. 3.2.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Example:
printf("Total = %d\ri", score)
both printf and score are lexemes matching the pattern for token id, and "Total = %d \n" is a lexeme matching literal.
1.Buffer Pairs:
Because of the amount of time taken to process characters and the large number of characters that must be
processed during the compilation of a large source program, specialized buffering techniques have been developed to
reduce the amount of overhead required to process a single input character. An impor tant scheme involves two buffers
that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. Using one system
read command we can read N characters into a buffer, rather than using one system call per character. If fewer than N
characters remain in the input file, then a special character, represented by eof, marks the end of the source file and is
different from any possible character of the source program.
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to
determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this determination is
made will be covered in the balance of this chapter.
Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded
as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme
just found.
2.Sentinels:
If we use the scheme of Buffer pairs, we must check, each time we advance forward, that we have not moved
off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two
tests: one for the end of the buffer, and one to determine what character is read (the latter may be a multiway branch).
We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a
sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof.
Notice how the first test, which can be part of a multiway branch based on the character pointed to by forward,
is the only test we make, except in the case where we actually are at the end of a buffer or the end of the input.
Pattern { Action }
Each pattern is a regular expression, which may use the regular definitions of the declaration section. The
actions are fragments of code, typically written in C, although many variants of Lex using other languages
have been created.
The third section holds whatever additional functions are used in the actions. Alternatively, these functions can
be compiled separately and loaded with the lexical analyzer.
The lexical analyzer created by Lex behaves in concert with the parser as follows :
1. When called by the parser, the lexical analyzer begins reading its remaining input, one character at a
time, until it finds the longest prefix of the input that matches one of the patterns P i .
2. It then executes the associated action, A i .
3. Typically, Ai will return to the parser, but if it does not (e.g., because P i describes whitespace or
comments), then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser.
4. The lexical analyzer returns a single value, the token name, to the parser, but uses the shared, integer
variable yylval to pass additional information about the lexeme found, if needed.
Example Lex Program: Counting number of lines, words, case letters, numbers, special characters.
Lookahead Operator:
The Lookahead operator is the addition operator that distinguishes different patterns for a token. It's used to
read one character ahead of the last character that creates the selected lexeme.
Lexical Errors:
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source -code error. For
instance, if the string fi is encountered for the first time in a C program in the context:
fi ( a = = f(x)) a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other
phase of the compiler - probably the parser in this case - handle an error due to transposition of the letters.
The simplest recovery strategy is "panic mode" recovery.
1. We delete successive characters from the remaining input, until the lexical analyzer can find a well-
formed token at the beginning of what input is left.
2. This recovery technique may confuse the parser, but in an interactive computing environment it may
be quite adequate.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.