Intro To Compilers
Intro To Compilers
I
Introduction to
Compilers
Introduction
Translator
A translator is a programming language processor that converts a computer program from
one language to another.
It discovers and identifies the error during translation.
Compiler
A compiler is a computer program which helps to transform source code written in a high-
level language into low-level machine language.
It translates the code written in one programming language to some other language without
changing the meaning of the code.
The compiler also makes the end code efficient which is optimized for execution time and
memory space.
Interpreter
An interpreter is a computer program that is used
to directly execute program instructions written using
one of the many high-level programming languages.
Example
Comparison of Compiler and
Interpreter Compiler Interpreter
Compiler takes Entire program as Interpreter takes Single instruction as
input input
Intermediate object is generated No Intermediate object is generated
Conditional control statements are Conditional control statements are
executes faster executes
slower
Memory requirement is more Memory requirement is less
Program need not be compiled Every time higher level program is
every time converted into lower level program
Errors are displayed after Errors are displayed for every
entire program is checked instruction interpreted(if any)
Example: C, C++ and JAVA Example: Basic and Python
Parts of the Compiler or Architecture of Compiler
A compiler can broadly be divided into two parts based on the way they compiler
Analysis Phase
Known as the front-end of the compiler
The analysis phase of the compiler reads the source program, divides it into
core parts and then checks for lexical, grammar and syntax errors.
The analysis phase generates an intermediate representation of the source
program and symbol
table, which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler.
The synthesis phase generates the target program with the help of intermediate
source code
Language Processing Systems or Cousins of Compiler
Any computer system is made of hardware and software.
The hardware understands a language, which humans cannot understand. So
write programs in high-level language, which is easier to understand and
remember.
These programs are then fed into a series of tools and OS components to get the
desired code
that can be used by the machine. This is known as Language Processing System.
Preprocessor: The preprocessor is considered as a part of the compiler. It is
a tool which produces input for Compiler. It deals with macro processing,
augmentation, language extension, etc.
Compiler: A compiler is a computer program which helps to transform source code written in a high-level language
into low-level machine language. It translates the code written in one programming language to some other language
without changing the meaning of the code
Assembler: It translates assembly language code into machine understandable language. The output result of
assembler is known as an object file which is a combination of machine instruction as well as the data required to
store these instructions in memory.
Linker: The linker helps you to link and merge various object files to create an executable file. All these files might
have been compiled with separate assemblers. The main task of a linker is to search for called modules in a program
and to find out the memory location where all modules are stored.
Loader: The loader is a part of the OS, which performs the tasks of loading executable files into memory and run
them. It also calculates the size of a program which creates additional memory space.
Compiler Construction Tools
Compiler construction tools were introduced as computer-related technologies spread all over the
world. They are also known as a compiler- compilers, compiler- generators or translator.
These tools use specific language or algorithm for specifying and implementing the
component of the compiler.
Scanner generators: This tool takes regular expressions as input. For example LEX for Unix
Operating System.
Syntax-directed translation engines: These software tools offer an intermediate code by using
the parse tree. It has a goal of associating one or more translations with each node of the parse tree.
Parser generators: A parser generator takes a grammar as input and automatically generates
source code which can parse streams of characters with the help of a grammar.
Automatic code generators: Takes intermediate code and converts them into Machine Language
Data-flow engines: This tool is helpful for code optimization. Here, information is supplied by user
and intermediate code is compared to analyze any relation. It is also known as data-flow analysis. It
helps you to find out how values are transmitted from one part of the program to another part.
Advantages
Modification of user program can be easily made and implemented as
execution proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a
program used for interpretation.
The interpreter for the language makes it machine independent .
Disadvantages
The execution of the program is slower.
Memory consumption is more.
Phases of a compiler
A compiler operates in phases.
A phase is a logically interrelated operation that takes source program
in one representation and produces output in another representation.
The phases of a compiler are shown in below
There are two phases of compilation.
a.Analysis (Machine Independent/Language Dependent)
b.Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called
‘phases’.
In practice, several phases may be grouped together, and the
intermediate
representations between the grouped phases need not be constructed
explicitly.
The symbol table, which stores information about the entire source
Lexical Analysis:
lexical analysis or scanning forms the first phase of a compiler.
The lexical analyzer reads the stream of characters which makes the
source program and groups them into meaningful sequences called
lexemes.
For each lexeme, the lexical analyzer produces tokens as output.
A token format is shown below.
<token-name, attribute-value>
These tokens pass on to the subsequent phase known as syntax analysis.
The token elements are listed below:
Token-name: This is an abstract symbol used during syntax analysis.
Attribute-value: This point to an entry in the symbol
table for the corresponding token.
Information from the symbol-table entry 'is needed for semantic analysis
and code generation.
For example, let us try to analyze a simple arithmetic expression evaluation
in lexcial context
position = initial + rate * 60
The individual units in the above expression can be grouped into lexemes
Lexeme Token
position Identifier
= Assignment Symbol
initial Identifier
+ Addition Operator
rate Identifier
* Multiplication Operator
60 Constant
The expression is seen by Lexical
analyzer as
<id,1> <=> <id,2> <+> <id,3> <*> <60>
Syntax Analysis
Syntax analysis forms the second phase of the compiler.
The list of tokens produced by the lexical analysis phase forms the input and
arranges them in the form of tree-structure (called the syntax tree). This
reflects the structure of the program. This phase is also called parsing.
The syntax tree consists of interior node representing an operation and the
child of the node representing arguments. A syntax tree for the token
statement is as shown in the above example.
Operators are considered as root nodes of this syntax tree. In the above case
= has left and a right node. The left node consists of <id,1> and the right
node is again parsed and the immediate operator is taken as right node
Semantic analysis
Semantic analysis forms the third phase of the compiler. This phase uses
the syntax tree and the information in the symbol table to check the
source program for consistency with the language definition.
This phase also collects type information and saves it in either the syntax
else
forward := forward + 1;
end if
end Procedure
AdvanceForwardPointer
Disadvantages of this
scheme
This scheme works
Sentinels
To combine the buffer-end test with the test for the current character if
we extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source
program, and a
natural choice is the character eof.
The sentinel arrangement is as shown below:
Note that eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a buffer means that the
input is at an end.
Code to advance forward pointer:
Procedure LookAheadwithSentinel
begin
forward : = forward + 1;
if forward ↑ = eof then
begin
if forward at end of first half then
begin
reload second half;
forward := forward +1
end
else if forward at end of
second half then
begin
reload first half;
move forward to beginning of first half
end
else
terminate lexical analysis
end if
end Procedure LookAheadwithSentinel
Advantages
Most of the time, it performs
only one test to see whether
forward pointer points to an eof.
Only when it reaches the end of
Topic
Specification of Tokens
Introduction
To specify the tokens regular expression are used. When a pattern is matched by
some regular expression then token can be recognized.
Regular expressions are used to specify the patterns. Each pattern matches a set
of
strings.
There are 3 specifications of tokens: 1) Strings 2) Language 3) Regular expression
Strings And Languages
An alphabet or character class is a finite set of symbols. Symbols are the collection
of
letters and characters.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
A language is any countable set of strings over some fixed alphabet.
The length of a string S, usually written as |S|, is the number of occurrences of
symbols
in S.
The empty string, denoted ε, is the string of length zero
For example, banana is a string of length six.
Operations on
strings
The following string-related terms are commonly used:
A prefix of string S - Any string obtained by removing zero or more symbols from the
end
of string S. For example, ban is a prefix of banana.
A suffix of string S - Any string obtained by removing zero or more symbols from
the beginning of S. For example, nana is a suffix of banana.
A substring of S - Obtained by deleting any prefix and any suffix from s. For
example, nan
is a substring of banana.
The proper prefixes, suffixes, and substrings of a string S are those prefixes, suffixes,
and substrings, respectively of S that are not ε or not equal to S itself.
A subsequence of S is any string formed by deleting zero or more not
necessarily consecutive positions of S. For example, baan is a subsequence
of banana.
Operations On
The following are the operations that can be applied to
Languages
languages: Union of L and M=L ∪ M
L ∪ M={s | s is in L or s
is in M} Concatenation L and
M=LM
LM={st | s is in L and t
is in M} Kleen closure of
L=L*
L*= ⋃i=0Li
Positive
closure of L=L+
L+= ⋃i=1Li
Regular
Expressions
Regular
expressions have
the capability to
Rules that define the regular expressions over some alphabet
Σ ε is a regular expression, and L(ε) is { ε }, that is, the language whose
1.
sole member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a},
that is, the language
with one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r)
and L(s), then
i) (r)|(s) is a regular expression denoting the language L(r) U L(s)
ii) (r).(s) is a regular expression denoting the language L(r).L(s).
iii)(r)* is a regular expression denoting (L(r))*.
Example: letter ( letter | digit ) *
Regular set
A language that can be defined by a regular expression is called a regular
set. If two regular expressions r and s denote the same regular set, we say
they are equivalent and write r = s.
Algebraic Properties Of Regular Expression
There are a number of algebraic laws for regular
expressions that can be used to manipulate into equivalent forms.
i) | is commutative: r|s = s|r
ii) | is associative: r|(s|t)=(r|s)|t
iii)Concatenation is associative: (rs)t=r(st)
iv)Concatenation is distributive: r(s|t)=rs|rt
v) ɛ is identity element for concatenation ɛ.r=r.ɛ=r
vi) Closure is idempotent: r**=r*
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ
is an alphabet of basic symbols, then a regular definition is a sequence of
definitions
d l →r 1 Each
of the di is a distinct name.
form
d2 → Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . ,
r2 di-l}.
d3→ r 3
………
dn →
Example: Identifiers is the set of strings of letters and digits beginning with a letter.
for this set:
Regular definition
letter → A | B | …. | Z | a | b | ….
| z digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Notations Of Regular Expression
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce
notational short hands for them.
1.One or more instances (+): The unary postfix operator + means “one or more instances
of” regular expression. ( r )+ is a regular expression that denotes (L(r )) +
2. Zero or more instance (*): The operator ‘*’ denotes zero or more instances of regular
expressions. (
r )* is a regular expression that denotes (L(r )) *
3.Zero or one instance ( ?): The unary postfix operator ? means “zero or one instance of”. (
r )? is a regular expression that denotes the language L(r) U {ɛ}
4. Character Classes: The character class such as [a – z] denotes the regular expression a
|b|c|d|
….|z.
Non-regular Set
A language which cannot be described by any regular expression is a non-regular
set.
Example: The set of all strings of balanced parentheses and repeating strings
Topic
Recognition of Tokens
Recognition of Tokens
Consider the following grammar fragment
stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term
term → id | num
The components G={V,T,P,S} for the above grammar is,
Variables= {stmt, expt, term}
Terminals= {if, then, else, relop,
id, num} Start symbol=stmt
The terminals generate the
following regular definition,
if → if
then →
then else
→ else
relop →
<|<=|=|
<>|>|>=
id →
letter(lett
er|digit)*
Transition diagrams
It is a diagrammatic representation to depict the action that will take place
when a lexical analyzer is called by the parser to get the next token. It is
used to keep track of information about the characters that are seen as
the forward pointer scans the input.
Example: Transition diagram for identifier Example: Transition diagram for reloperator
Topic
Lex
The Lexical- Analyzer Generator
Lex
A tool called Lex, or in a more recent implementation Flex, that allows one
to specify a lexical analyzer by specifying regular expressions to describe
patterns for tokens. The input notation for the Lex tool is referred to as the
Lex language and the tool itself is the Lex compiler.
The Lex compiler transforms the input patterns into a transition diagram
and generates code, in a file called lex.yy.c,
Creating a lexical analyzer with Lex
An input file, which we call lex.l, is written in the Lex language and
describes the lexical analyzer to be generated.
The Lex compiler transforms lex.l to a C program, in a file that is always
named lex.yy.c.
The latter file is compiled by the C compiler into a file called a.out, as
always.
The C compiler output is a working lexical analyzer that can take a stream
of input characters and produce a stream of tokens.
Structure of Lex Programs
A Lex program has the
followingdeclarations
form:
%%
translation rules
%%
auxiliary functions
The declarations section includes declarations of variables, manifest constants (A
manifest constant is an identifier that is declared to represent a constant e.g. #
define PIE 3.14), and regular definitions.
The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
where each p is a regular expression and each action is a program fragment
describing what action the lexical analyzer should take when a pattern p matches a
lexeme. In Lex the actions are written in C.
The third section holds whatever auxiliary procedures are needed by the actions.
Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
Design of a Lexical- Analyzer
The following figure gives overview of the architecture of a lexical
Generator
analyzer generated by Lex as follows.
Finite Automata
Finite Automata
Finite automata are recognizers; they simply say "yes" or "no" about each possible
input string.
Finite automata come in two flavors:
(a)Nondeterministic finite automata (NFA) have no restrictions on the labels of their
edges. A symbol can label several edges out of the same state, and ε, the empty
string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of
its input alphabet
exactly one edge with that symbol leaving
that state. A FA can be represented by a 5-
tuple (Q, ∑, δ, q0, F) Nondeterministic Finite
Automata
A nondeterministic finite automaton (NFA)
consists of:
1.A finite set of states Q.
2.A set of input symbols ∑ , the input alphabet. We assume that E, which
stands for the empty string, is never a member of ∑ .
3.A transition function δ that gives, for each state, and for each
symbol in a set of next states.
4.A state q0 from Q that is distinguished as the start state (or initial
state).
This NFA graph is very much like a transition diagram,
a) The same symbol can label edges from one state to several different states,
and
b) An edge may be labeled by c, the empty string, instead of, or in addition to,
symbols from the input alphabet.
b) CONCATENATION
Suppose r = st. The start state of N (s) becomes the start state of N (r), and the
accepting state of N(t) is the only accepting state of N(r). Thus, N(r) accepts
exactly L(s)L(t), and is a correct NFA for r = st.
c)
Suppose r = s*. Here, i and f are new states, the start state and lone
CLOSURE
accepting state of N (r) . N(r) accepts ε, which takes care of the one string
in L(s)0, & accepts all strings in L(s)1, L(s)2 and so on
Finally, suppose r = (s). Then L(r) = L(s), and we can use the
NFA N(s) as N(r). The properties of N(r) are as follows:
1.N(r) has at most twice as many states as there are operators
and operands in r.
2.N(r) has one start state and one accepting state.
3.Each state of N (r) other than the accepting state has either one outgoing
transition on a symbol in C or two outgoing transitions, both on ε.
Problem
Construct an NFA for r = (a |
The
b) * afollowing
b figure shows parse
tree for r ii) NFA for a
i) NFA for a
b a
a b
A, B D
C
a a
b
E b
Optimization of DFA-Based Pattern Matchers
Important states of an NFA
To begin our discussion of how to go directly from a regular expression
to a DFA, we must first dissect the NFA construction and consider the
roles played by various states. We call a state of an NFA important if it
has a non- ε out-transition.
During the subset construction, two sets of NFA states can be identified
if they:
1.Have the same important states, and
2.Either both have accepting states or neither does.
By concatenating a unique right endmarker # to a regular expression r, we give
the accepting state
for r a transition on #, making it an important state of the NFA for (r)#.
By using the augmented regular expression (r)#, any state with a transition on #
must be an accepting state.
To present the regular expression by its syntax tree, where the leaves correspond
to operands and
the interior nodes correspond to operators.
An interior node is called
Functions Computed From the Syntax Tree
To construct a DFA directly from a regular expression, we construct its syntax
tree and then
compute four functions: nullable, firstpos, lastpos, and followpas.
Each definition refers to the syntax tree for a particular augmented regular
expression (r) # .
1. nullable(n) is true for a syntax-tree node n if and only if the sub expression
represented by n has ε in its language.
2. firstpos(n) is the set of positions in the sub tree rooted at n that correspond to
the first symbol of at least one string in the language of the sub expression
rooted at n.
3. lastpos(n) is the set of positions in the sub tree rooted at n that correspond to
the last symbol of at least one string in the language of the sub expression
rooted at n.
4. followpos(p), for a position p, is the set of positions q in the entire syntax tree
such that there is some string x = a1a2…. an, in L ((r)#) such that for some i,
there is a way to explain the membership of x in L((r)#) by matching ai to
position p of the syntax tree and ai+l to position q.
Computing nullable, firstpos, and lastpos:
The basis & induction rules for nullable(n), firstpos(n) lastpos(n) &
followpos(n) is as follows.
Computing followpos:
There are only two ways that a position of a regular expression can be
made to follow another.
1. If n is a cat-node with left child C1 and right child C2, then for every
position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
2.If n is a star-node, and i is a position in lastpos(n), then all positions in
firstpos(n) are in followpos(i).
The followpos is almost an NFA without ε-transitions for the underlying
regular expression, and would become one if we:
3.Make all positions in firstpos of the root be initial states,
4.Label each arc from i to j by the symbol at position i, and
5.Make the position associated with end marker # be the only accepting
state
Example
Construct optimized DFA for the RE
(a|b)*abb
Step 1: Syntax Tree
Step 2: Computing firstpos and lastpos
Step 3: Computing followpos
Step 4: Directed
Graph