Phases of Compiler
Phases of Compiler
1
Sometimes, we divide Synthesis part as follows:
The front end of a compiler includes all analysis phases and the intermediate code generator phase.
While back end includes the code optimization and final code generator phase.
The front end analyzes the source program and produces intermediate code.
While the back end synthesizes the target program from the intermediate code.
Example:
Find the output for the following expression after each phase of compilation
position= initial + rate *60
SOLUTION: 1)Lexical Analyzer (Scanner)
Lexeme token
Position Identifier
= Assignment Operator
Initial Identifier
+ Operator
Rate Identifier
* Multiplication Operator
60 Integer Constant
2
(2) Syntax Analyzer (parser)
The syntactic level describes the way that program statements are constructed form tokens.
i.e. syntax analyzer is the module in which the overall structure is identified and involves an
understanding of the order in which the symbol in a program may appear.
The main task of parser is to group the tokens into sentences. i.e. to determine if the sequence of
tokens that have been extracted by the syntax analyzer are in the correct order or not.
For analyzing each sentence, the parser builds abstract tree structures known as syntax tree.
Syntax tree facilitates transformations of the program that may lead to possible minimization of the
machine instructions.
Example:
Position
+
initial =
rate 6
0
Position
+
initial =
rate
int to
real
60. 3
0
(4) Intermediate code generator
After semantic analysis compiler generates an intermediate representation of the source program that is both
easy to produce and easy to translate into the target program.
There are a variety of forms used for intermediate code these are: two-address code, three address
code & so on.( i,e. looks like code for a memory-memory machine where every operator reads
operands form memory & write result into memory)
The intermediate code generator usually has to create temporary location to hold intermediate results
Position = Temp3
position = Temp1
(6) Code Generator:-The main objective of this phase is to allocate storage and generates a
machine/Assembly code
REN Id1, R1
4
(*) Storage - Allocation
Every constant & variable appearing in the program must have storage space allocated for its value during
the storage allocation phase. This storage space may be three types
a) Static Storage: If the life time of the variable is life time of the program and the space for its value
once allocated cannot later be released, this kind of allocation is called static storage allocation.
b) Dynamic Storage: If the life time is a particular block or function or procedure in which it is
allocated so that, it may be released when the block or function or procedure in which it is
allocated ,is left called dynamic storage
c) Global Storage: It’s life time is unknown at compile time & it has to be allocated & deal located
at runtime. The efficient control of such storage usually implies runtime overheads
After space is allocated by the storage allocation phase, an address containing as much as is known about
it’s location at compile time, is passed to the code generator for it’s use
It contains a symbol table which is a data structure that stores/records the information of each
identifier given by the lexeme
It has facilities to manipulate (add/delete) element in it.
It contains a "FIND function" which is stored all the information if descriptions gives by the lexeme
"FIND Function" returns a pointer pointing to each identifier information
If FIND function returns a NULL pointer i.e, the symbol table has no record about the Identifier
It also has "INSERT Function" to Insert an identifier into the symbol table given by the lexeme
1. Lexical Error
2. Syntax Error
3. Semantic Error
4. Intermediate code generator Error
5. Code generator Error
6. Code optimize error
7. Symbol table entity error
5
iii. Semantic Analyzer: The error which is semantic in nature. ie, some statement may be correct form
the syntactical point of view but they make no sense & there is no code that can be generated to
carry out the meaning of the statement
iv. Intermediate code Generator may detect an operator whose operands have incompatible types.
v. The code optimizer may detect that certain statement can never be reached
vi. A code Generator may find a compiler created constant that is too large to fit in a word of the target
machine
vii. while entering information into the symbol table it may discover multiply declared identifier with
contradictory attributes.
Cross Compiler
A cross compiler is a compiler that runs on one machine & produces object code for another machine. The
cross compiler is used to implement the compiler, which is characterized by three languages
Bootstrap:- If a compiler has been implemented in its own language then this arrangement is called a
bootstrap arrangement.
6
Differentiate between one pass and multi pass compiler
7
1. It is faster because it is loaded into 1. It is slower then one pass compiler
memory one time. because each time out put of each pass is
stored on memory & must be read in each
2. it has some restriction upon program. ie, time the next pass starts.
constants, types, variable & procedure must
be defined before they are used 2. But here it has no restriction because it
differ more & more from one modified form
3. The components of a one pass compiler to another
are inter-related
3.It can be decomposed into passes that can
Ex. All the programmers working on the be relatively independent
project must have knowledge about the
entire project Ex. A team of programmers can work on the
project with interaction among them
The main task of the Lexical Analyzer is to read the input characters of the source
program, group them into Lexemes and produces as output a sequence of tokens for each
lexeme in the source program
Then the stream of tokens is sent to the parser for syntax analysis.
It is common for the Lexical analyzer to interact with the symbol table as well. When the
lexical Analyzer discovers a Lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table
Some cases, information regarding the kind of identifier may be read from the symbol
table by the lexical analyzer to assist it in determining the proper token ,that must pass to
the parser
Tokens
source code
Lexical analyzer GetNEXT tokens parser Semantic analysis
Symbol table
Here the interaction is implemented by having the parser with the Lexical Analyzer
8
The call suggested by the get Next Token command, causes the Lexical analyzer
to read characters from its input until it can identify the next lexeme & produce
for it the next token, which returns to the parser.
i,e the Lexical Analyzer is usually implemented as a subroutine of the parser
1. Separation of the input source code into token. Such as keywords, identifier,
constants ,operators .
2. Stripping out the unnecessary white space ( Blank, newline, tab) and comments.
3. Keeping track of line numbers while scanning the newline characters. The line
numbers are used by the error handler to print the error messages
4. Detect Lexical errors
5. The output of the Lexical Analysis phase is input to syntax analysis phase.
I. Misspelled keyword
II. Numeric literals are too long
III. Input characters that are not is the source language
IV. Identifiers that are too long (after a warning is given)
1. Let us consider a statement “ fi(a = = f)”. Here “fi” is a misspelled keyword. This error is not
detected in lexical Analysis phase. It is taken “fi” as an identifier. This error is then detected
is syntax analyzer phase of compilation.
2. In this case Lexical analyzer is not able to continue with the process of compilation. It resorts
to panic mode of error recovery.
The Lexical Analyzer can perform the following action to identify a token
a) Deleting the successive characters from the remaining input until a token is detected.
b) Deleting extraneous characters.
c) Inserting missing characters
d) Replacing an incorrect character by a correct character
e) Transposing two advanced characters
9
3) Minimum distance error correction is the strategy generally followed by Lexical
Analyzer to correct the errors in the lexeme.
It is nothing but the minimum number of the corrections to be made to correct an
invalid lexeme to a valid lexeme
Input Buffering :
Storing a block of input data in buffer to avoid costly access to secondary storage each time is
called input buffering
Begin I=I+1;J=J+1;…
lb sp
Begin I=I+1;J=J+1;…
When the end of the lexeme is identified, the token & the attribute corresponding to this
lexeme is returned
‘lb’ & ‘sp’ are then made to point to the beginning of the next token
lb sp
Begin I=I+1;J=J+1;…
10
Buffering methods
Reading the input character by character from the secondary storage is costly. A block of data is
read first into a buffer & then scanned by the lexical analyzer
Begin I=I+1;J=…….J+1;EOF
Buffer-1
sp
End;
Buffer-2
When the end of the current buffer is reached, the other buffer is filled. Hence the
problem encountered in the previous method is solved
In this scheme, the 2nd buffer is loaded when the first buffer becomes full.
Similarly, the first buffer is filled when the end of the 2nd buffer is reached. Then the ‘sp’
pointer is incremented
Hence two tests have done to increment the ‘sp’ pointer, ie,
(1) one for the end of the buffer
(2) another to determine what character is read
This can be reduced to one test if we include a sentinel character :- (ie a special character
not a part of the input program) at the end of the buffer
Ex. These characters are EOF (end of file)
So only if the EOF character is encountered a second check is made as to which buffer
has to be refilled & the action is performed. Hence average no. of test per input character
is 1.
11
Sentinel character :- an extra character other than input characters are added at the end
of input buffer to reduce buffer test. Ex. EOF Character
Regular expressions can be used to specify a set of strings. A set of string that can be specified
by using regular expression notation is called a regular set.
The tokens of a programming language constitutes a regular set. Hence this regular set can be
specified by using regular expression notation. Therefore, we write regular expression for things
like operators keywords, and identifier.
Ex:- The regular expression specifying the sub set of tokens of typical programming language as
follows:
Operators = + / - / * / mod
Digit = 0 / 1 / 2 …..9
Regular set is compact, precise, & contain a DFA that accepts the language specified by
the regular expression
The DFA is used to recognize the language specified by the R.E notation, making the
automatic construction of recognizer of token possible. So we need both R.E & F A
1. Identifying the tokens of the language for which the lexical analyzer is to be built , & so
specify these tokens by using suitable notation
2. Constructing a suitable recognizer for these token.
Therefore the next step is the construction of a DFA from the R.E. But DFA is a flow chart
(graphical) representation of the lexical analyzer
Therefore after constructing a DFA the next step is to write a program is suitable programming
language that will simulate the DFA.
12
Therefore it is possible to automate the procedure of obtaining the lexical analyzer from the R.E
specifying the tokens.
LEX is a compiler writing tool that facilitates writing the lexical analyzer
It’s inputs are the R.E specifying the token to be recognized & generates a C program as
output that acts as a lexical analyzer for the tokens specified by the inputted R.E
Syntax analysis which verifies whether the token produced by the lexical analyzer are
properly sequenced is accordance with the grammar of the source language
Symbol table
ERROR HANDLER
The parser should also report syntactical errors is a manner that is easily understood by
the user. It should also have procedure to recover from these errors & to continuing
parsing action
Error detection
Compiler should detect & recover from errors. A simple compiler stops all activities except
lexical & syntactic Analysis after detection of 1st error
13
Errors my occur is design specification, algorithms ,transcription during compilation
etc. .
14
RECURSIVE TABLE SLR(LR0) LALR(LR0+LR1)
CLR(LR1)
DESCENT DRIVEN
1. Universal parsing
2. Top down
3. Bottom up
Universal parsing :- this methods such as the CYK (cocke-younger – kasami algo)& Earley’s
algo can parse any grammar. These general methods are inefficient to use in compilers
The methods commonly used is compliers as be classified as either top down or bottom up
Top – down methods : build parse tree form the top (root) to the bottom (leaves)
While bottom up methods start from leaves & work their way up to the root
In either case the input to the parse is scanned from left to right, ie one symbol at a time
In top down parsers we use left-most derivation but in case of bottom up parsers we use
right most derivation
Top down parsing
Basically top down parsing attempts to find the left-most derivation for the input string W, since
string W is scanned by the parser left to right, one symbol/ token at a time & the left-most
derivations generates the leaves of the parse tree is left- to – right order which matches the input
scan order
Diagram expr
term rest
15
8 - term rest
6 + term rest
4 €
Backtracking parser
Basically in top down mechanism, every terminal symbol generated by some production of the
grammar (which is predicted) is matched with the input string symbol pointed to by the string
marker (pointer)
S aAb
A cd/c
S S S
a A b a A b a A b
back tracking by
resetting of pointer
c d c
16
point of failure
17