0% found this document useful (0 votes)
17 views

Compiler Design unit-1

Uploaded by

Uday Kiran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Compiler Design unit-1

Uploaded by

Uday Kiran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Contents

Compiler Design ................................................................................................................................................. 2


Unit 1: Overview of Compilation:...................................................................................................................... 2
1.1 Introduction:............................................................................................................................................ 2
1.2 Language Processing System: ........................................................................................................... 3
1.3 The Structure of a Compiler:............................................................................................................ 4
1.4 Compiler Construction Tools:........................................................................................................... 9
1.4 Lexical Analyzer:............................................................................................................................ 10
1.5 Regular grammar ........................................................................................................................... 13
1.6 Lexical Analyzer Generator\LEX ................................................................................................... 13
Compiler Design

Unit 1: Overview of Compilation:

1.1 Introduction:
a compiler is a program that can read a program in one language (the source language) and translate it into an
equivalent program in another language (the target language)

If the target program is an executable machine-language program, it can then be called by the user to process inputs and
produce outputs.

An interpreter is another common kind of language processor. Instead of producing a target program as a translation,
an interpreter appears to directly execute the operations specified in the source program on inputs supplied by the user.

The machine-language target program produced by a compiler is usually much faster than an interpreter at mapping
inputs to outputs. An interpreter, however, can usually give better error diagnostics than a compiler, because it executes
the source program statement by statement.
Example:
A Java source program may first be compiled into an intermediate form called bytecodes. The bytecodes are
then interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine can be
interpreted on another machine
In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-time compilers
(JIT compilers), translate the bytecodes into machine language immediately before they run the intermediate program
to process the input.

1.2 Language Processing System:

A source program may be divided into modules stored in separate files. The task of collecting the source program
is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also expand short hands,
called macros, into source language statements.
The modified source program is then fed to a compiler. The compiler may produce an assembly-language
program as its output, because assembly language is easier to produce as output and is easier to debug. The assembly
language is then processed by a program called an assembler that produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together
with other relocatable object files and library files into the code that actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a location in another
file. The loader then puts together all of the executable object files into memory for execution.
1.3 The Structure of a Compiler:

There are two parts in the compiler: analysis and synthesis. The analysis part is often called the front end of the compiler;
the synthesis part is the back end.
1.3.1 The Analysis:
➢ The analysis part breaks up the source program into constituent pieces and imposes a grammatical
structure on them. It then uses this structure to create an intermediate representation of the source
program.
➢ If the analysis part detects that the source program is either syntax error or semantically unsound, then
it must provide informative messages, so the user can take corrective action.
➢ The analysis part also collects information about the source program and stores it in a data structure
called a symbol table, which is passed along with the intermediate representation to the synthesis part.

1.3.2 The Synthesis:


➢ The synthesis part constructs the desired target program from the intermediate representation and the
information in the symbol table.

The compilation process is a sequence of phases, each of which transforms one representation of the source
program to another. several phases may be grouped together, and the intermediate representations between the grouped
phases need not be constructed explicitly.
The symbol table, which stores information about the entire source program, is used by all phases of the compiler.
Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the stream of
characters making up the source program and groups the characters into meaningful sequences called lexemes. For each
lexeme, the lexical analyzer produces as output a token of the form that it passes on to the subsequent phase, syntax
analysis.

In the token, the first component token-name is an abstract symbol that is used during syntax analysis, and the
second component attribute-value points to an entry in the symbol table for this token. Information from the symbol-
table entry is needed for semantic analysis and code generation.

Example:
position = initial + rate * 60
1. position is a lexeme that would be mapped into a token 〈id,1〉, where id
is an abstract symbol standing for identifier and 1 points to the symbol
table entry for position. The symbol-table entry for an identifier holds
information about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token 〈=〉. Since this token needs no
attribute-value, we have omitted the second component. We could have used any abstract symbol such
as assign for the token-name, but for notational convenience we have chosen to use the lexeme itself as
the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token 〈id,2〉, where 2 points to the symbol-table entry for
initial.
4. + is a lexeme that is mapped into the token 〈+〉.
5. rate is a lexeme that is mapped into the token 〈id,3〉, where 3 points to the symbol-table entry for
rate.
6. * is a lexeme that is mapped into the token 〈*〉.
7. 60 is a lexeme that is mapped into the token 〈60〉.

Blanks separating the lexemes would be discarded by the lexical analyzer.


The representation of the assignment statement after lexical analysis as the sequence of tokens :

〈id,1〉〈=〉〈id,2〉〈+〉〈id,3〉〈*〉〈60〉

In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and
multiplication operators, respectively.
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the
tokens produced by the lexical analyzer to create a tree. (like intermediate representation that depicts the grammatical
structure of the token stream)
A typical representation is a syntax tree in which each interior node represents an operation and the children of
the node represent the arguments of the operation.

1. The tree has an interior node labelled * with 〈id,3〉as its left child and the integer 60 as its right child.
2. The node 〈id,3〉represents the identifier rate.
3. The node labelled * makes it explicit that we must first multiply the value of rate by 60.
4. The node labelled + indicates that we must add the result of this multiplication to the value of initial.
5. The root of the tree, labelled =, indicates that we must store the result of this addition into the location
for the identifier position.

Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program
for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax
tree or the symbol table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that each operator has
matching operands. For example, many programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
The language specification may permit some type conversions called coercions.
For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-
point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert or coerce
the integer into a floating-point number.

Intermediate Code Generation:


In the process of translating a source program into target code, a compiler may construct one or more
intermediate representations, which can have a variety of forms. Syntax trees are a form of intermediate representation;
they are commonly used during syntax and semantic analysis.
After syntax and semantic analysis of the source program, many compilers generate an explicit low-level or
machine (like intermediate representation, which we can think of as a program for an abstract machine ).
This intermediate representation should have two important properties:
1. It should be easy to produce
2. It should be easy to translate into the target machine.

Fig: Three-address code, an Intermediate Code Representation


Code Optimization:
The machine-independent code-optimization phase attempts to improve the intermediate code so that better
target code will result. Usually better means faster, but other objectives may be desired, such as shorter code, or target
code that consumes less power.
A simple intermediate code generation algorithm followed by code optimization is a reasonable way to generate
good target code. The optimizer can deduce that the conversion of 60 from integer to floating point can be done once
and for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the floating-point
number 60.0.

Code Generation:
The code generator takes as input an intermediate representation of the source program and maps it into the
target language. If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program.
Then, the intermediate instructions are translated into sequences of machine instructions that perform the same
task. A crucial aspect of code generation is the judicious assignment of registers to hold variables.

(LD-Load, MUL-Multiply, ADD-Add, ST-store, F-Floating point)

Grouping Phases into Passes:


The discussion of phases deals with the logical organization of a compiler. In an implementation, activities
from several phases may be grouped together into a pass that reads an input file and writes an output file.
For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate
code generation might be grouped together into one pass. Code optimization might be an optional pass. Then there
could be a back-end pass consisting of code generation for a particular target machine.
A pass is a component where parts of one or more phases of the compiler are combined when a compiler is
implemented. A pass reads or scans the instructions of the source program or the output produced by the previous
pass, which makes necessary transformation specified by its phases.
There are generally two types of passes:
1. One-pass (Single Pass): In One-pass all the phases are grouped into one phase. The six phases are included
here in one pass.
Purpose of One Pass Compiler:
A one-pass compiler generates a structure of machine instructions as it looks like a stream of instructions and
then sums up with machine address for these guidelines to a rundown of directions to be backpatched once the
machine address for it is generated. It is used to pass the program for one time. Whenever the line source is handled, it
is checked and the token is removed.
2. Two-Pass (Multi Pass): Two-pass the phases are divided into two parts i.e. Analysis or Frontend part of the
compiler and the synthesis part or backend part of the compiler.
Purpose of Two Pass Compiler:
A two-pass compiler utilizes its first pass to go into its symbol table a rundown of identifiers along with the
memory areas to which these identifiers relate. Then, at that point, a second pass replaces mnemonic operation codes
by their machine language equivalent and replaced uses of identifiers by their machine address. In the second pass, the
compiler can read the result document delivered by the first pass, assemble the syntactic tree and deliver the
syntactical examination. The result of this stage is a record that contains the syntactical tree.

1.4 Compiler Construction Tools:


1. Parser generators that automatically produce syntax analyzers from a grammatical description of a
programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens
of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and
generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for translating
each operation of the intermediate language into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted
from one part of a program to each other part. Data-flow analysis is a key part of code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing various
phases of a compiler.

Bootstrapping is the process of using an existing compiler for a programming language to compile a new or improved
version of the same compiler. Refer gfg
Reducing the Number of Passes
The various phases of the compiler have to be grouped for valid reasons: Grouping into a single phase makes
the arrangement quicker as the compiler isn’t expected to move to various modules or phases to get the interaction
introduction demand. Reducing the number of passes is inversely proportional as increases the time efficiency for
reading from and writing to intermediate files can be diminished. While grouping phases into one pass, the whole
program must be kept in memory to guarantee an appropriate data stream to each phase since one phase might require
data in an unexpected request in comparison to the data delivered in the past phase. The internal portrayal of the source
program or target program is different. Thus, the internal structure of the memory might be bigger than the input and
outputs.

1.4 Lexical Analyzer:

The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes,
and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the
parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol
table. In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical
analyzer to assist it in determining the proper token it must pass to the parser.
The interaction is implemented by having the parser call the lexical analyzer. The call, suggested by the
getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme
and produce for it the next token, which it returns to the parser.

Lexical analyzers are divided into a cascade of two processes:


1. Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of
comments and compaction of consecutive whitespace characters into one.
2. Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens
as output.
Lexical Analysis Versus Parsing:
1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often
allows us to simplify at least one of these tasks.
For example, a parser that had to deal with comments and whitespace as syntactic units would be.
considerably more complex than one that can assume comments and whitespace have already been removed by
the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a
cleaner overall language design.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve
only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input
characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes:
1. A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract
symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting
an identifier. The token names are the input symbols that the parser processes. In what follows, we shall
generally write the name of a token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token,
the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the
pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified
by the lexical analyzer as an instance of that token.

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison mentioned in Fig. 3.2.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Example:
printf("Total = %d\ri", score)
both printf and score are lexemes matching the pattern for token id, and "Total = %d \n" is a lexeme matching literal.

Attributes for Tokens:


When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler
phases additional information about the particular lexeme that matched. Thus, in many cases the lexical analyzer returns
to the parser not only a token name, but an attribute value that describes the lexeme represented by the token;
The token name in fluences parsing decisions, while the attribute value influences translation of tokens after the parse.
The most important example is the token id, where we need to associate with the token a great deal of
information. Normally, information about an identifier - e.g., its lexeme, its type, and the location at which it is first
found is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table
entry for that identifier.
Input Buffering:

1.Buffer Pairs:
Because of the amount of time taken to process characters and the large number of characters that must be
processed during the compilation of a large source program, specialized buffering techniques have been developed to
reduce the amount of overhead required to process a single input character. An impor tant scheme involves two buffers
that are alternately reloaded.

Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. Using one system
read command we can read N characters into a buffer, rather than using one system call per character. If fewer than N
characters remain in the input file, then a special character, represented by eof, marks the end of the source file and is
different from any possible character of the source program.
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to
determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this determination is
made will be covered in the balance of this chapter.
Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded
as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme
just found.
2.Sentinels:
If we use the scheme of Buffer pairs, we must check, each time we advance forward, that we have not moved
off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two
tests: one for the end of the buffer, and one to determine what character is read (the latter may be a multiway branch).
We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a
sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof.
Notice how the first test, which can be part of a multiway branch based on the character pointed to by forward,
is the only test we make, except in the case where we actually are at the end of a buffer or the end of the input.

1.5 Regular grammar


Refer Text book from page 116. Topic “Specification of Tokens”

1.6 Lexical Analyzer Generator\LEX


Lex is a tool or a computer program that generates Lexical Analyzers (converts the stream of characters into tokens).
The Lex tool itself is a compiler. The Lex compiler takes the input and transforms that input into input patterns. It is
commonly used with YACC(Yet Another Compiler Compiler). It was written by Mike Lesk and Eric Schmidt.
A lexical analyzer generator, also known as Lex, is a program that creates scanners, or tokenizers, to recognize
lexical patterns in text. Lexical analyzers are programs that break input into tokens, or lexemes, based on specified
patterns.

Structure of Lex Programs:


A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
The declarations section includes declarations of variables, manifest constants (identifiers declared to stand
for a constant, e.g., the name of a token), and regular definitions.

The translation rules each have the form:

Pattern { Action }
Each pattern is a regular expression, which may use the regular definitions of the declaration section. The
actions are fragments of code, typically written in C, although many variants of Lex using other languages
have been created.

The third section holds whatever additional functions are used in the actions. Alternatively, these functions can
be compiled separately and loaded with the lexical analyzer.
The lexical analyzer created by Lex behaves in concert with the parser as follows :
1. When called by the parser, the lexical analyzer begins reading its remaining input, one character at a
time, until it finds the longest prefix of the input that matches one of the patterns P i .
2. It then executes the associated action, A i .
3. Typically, Ai will return to the parser, but if it does not (e.g., because P i describes whitespace or
comments), then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser.
4. The lexical analyzer returns a single value, the token name, to the parser, but uses the shared, integer
variable yylval to pass additional information about the lexeme found, if needed.

Example Lex Program: Counting number of lines, words, case letters, numbers, special characters.

Conflict Resolution in Lex or Rules to decide on the proper lexeme to Select:


1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex program.

Lookahead Operator:
The Lookahead operator is the addition operator that distinguishes different patterns for a token. It's used to
read one character ahead of the last character that creates the selected lexeme.

Lexical Errors:
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source -code error. For
instance, if the string fi is encountered for the first time in a C program in the context:
fi ( a = = f(x)) a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other
phase of the compiler - probably the parser in this case - handle an error due to transposition of the letters.
The simplest recovery strategy is "panic mode" recovery.
1. We delete successive characters from the remaining input, until the lexical analyzer can find a well-
formed token at the beginning of what input is left.
2. This recovery technique may confuse the parser, but in an interactive computing environment it may
be quite adequate.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

You might also like