Complier Design Unit 1
Complier Design Unit 1
We have learnt that any computer system is made of hardware and software. The
hardware understands a language; which humans cannot understand. So we write
programs in high-level language, which is easier for us to understand and remember.
These programs are then fed into a series of tools and OS components to get the desired
code that can be used by the machine. This is known as Language Processing System.
-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------
A compiler can broadly be divided into two phases based on the way they compile.
i. Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the
source program, divides it into core parts and then checks for lexical, grammar and
syntax errors.The analysis phase generates an intermediate representation of the source
program and symbol table, which should be fed to the Synthesis phase as input.
-------------------------------------------------------------------------------------------------------------------------------------
1. Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code
as a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------
Pass is a complete traversal of the source program. Compiler has two passes to traverse the source
program.
i. Multi-pass Compiler
o Multi pass compiler is used to process the source code of a program several times.
o In the first pass, compiler can read the source program, scan it, extract the tokens
and store the result in an output file.
o In the second pass, compiler can read the output file produced by first pass, build
the syntactic tree and perform the syntactical analysis. The output of this phase is a
file that contains the syntactical tree.
o In the third pass, compiler can read the output file produced by second pass and
check that the tree follows the rules of language or not. The output of semantic
analysis phase is the annotated tree syntax.
o This pass is going on, until the target output is produced.
o One-pass compiler is used to traverse the program only once. The one-pass
compiler passes only once through the parts of each compilation unit. It translates
each part into its final machine code.
o In the one pass compiler, when the line source is processed, it is scanned and the
token is extracted.
o Then the syntax of each line is analyzed and the tree structure is build. After the
semantic part, the code is generated.
o The same process is repeated for each line of code until the entire program is
compiled.
VI. BOOTSTRAPPING
-------------------------------------------------------------------------------------------------------------------------------------
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and
that compiler runs on machine A.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,
which runs on machine A and produces code for machine A.
-------------------------------------------------------------------------------------------------------------------------------------
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
i. Input buffering: This stage involves cleaning up the input text and preparing it for
lexical analysis. This may include removing comments, whitespace, and other non-
essential characters from the input text.
ii. Tokenization: This is the process of breaking the input text into a sequence of tokens.
This is usually done by matching the characters in the input text against a set of patterns
or regular expressions that define the different types of tokens.
iii. Token classification: In this stage, the lexer determines the type of each token. For
example, in a programming language, the lexer might classify keywords, identifiers,
operators, and punctuation symbols as separate token types.
iv. Token validation: In this stage, the lexer checks that each token is valid according to
the rules of the programming language. For example, it might check that a variable
name is a valid identifier, or that an operator has the correct syntax.
v. Output generation: In this final stage, the lexer generates the output of the lexical
analysis process, which is typically a list of tokens. This list of tokens can then be
passed to the next stage of compilation or interpretation.
TOKENS
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
-------------------------------------------------------------------------------------------------------------------------------------
SPECIFICATIONS OF TOKENS
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}
is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is
denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols: -
Assignment =
-------------------------------------------------------------------------------------------------------------------------------------
Location &
Specifier
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of regular expressions.
REGULAR EXPRESSIONS
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a
set of strings, so regular expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The specification of regular
expressions is an example of a recursive definition. Regular languages are easy to understand
and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
OPERATIONS
-------------------------------------------------------------------------------------------------------------------------------------
NOTATIONS
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
-------------------------------------------------------------------------------------------------------------------------------------
FINITE AUTOMATA
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input
string is successfully processed and the automata reaches its final state, it is accepted, i.e., the
string just fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
-------------------------------------------------------------------------------------------------------------------------------------
i. DFA
DFA stands for Deterministic Finite Automata. Deterministic refers to the uniqueness
of the computation. In DFA, the input character goes to one state only. DFA doesn't
accept the null move that means the DFA cannot change state without any input
character.
Example
-------------------------------------------------------------------------------------------------------------------------------------
ii. NDFA
NDFA refer to the Non Deterministic Finite Automata. It is used to transit the any
number of states for a particular input. NDFA accepts the NULL move that means it
can change state without reading the symbols.
NDFA also has five states same as DFA. But NDFA has different transition function.
δ: Q x ∑ →2Q
Example
-------------------------------------------------------------------------------------------------------------------------------------
To optimize the DFA you have to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states and
T2 contains non-final states.
δ (q, a) = p
δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.
Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.
Example
-------------------------------------------------------------------------------------------------------------------------------------
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.
Step 3:
1. One set contains those rows, which start from non-final sates:
2. Other set contains those rows, which starts from final states.
Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to same state on 0 and
1. So skip q5 and then replace q5 by q3 in the rest.
-------------------------------------------------------------------------------------------------------------------------------------
IX. LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
o It reads the input stream and produces the source code as output through implementing the
lexical analyzer in the C program.
-------------------------------------------------------------------------------------------------------------------------------------
A Lex program is separated into three sections by %% delimiters. The formal of Lex source
is as follows:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Where pi describes the regular expression and action1 describes the actions what action
the lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be
loaded with the lexical analyzer and compiled separately.
X. FORMAL GRAMMER
o Formal grammar is a set of rules. It is used to identify correct or incorrect strings of tokens
in a language. The formal grammar is represented as G.
o Formal grammar is used to generate all possible strings over the alphabet that is
syntactically correct in the language.
-------------------------------------------------------------------------------------------------------------------------------------
G = <V, N, P, S>
Where:
Example:
Through this production we can produce some strings like: bab, baab, baaab etc.
BNF stands for Backus-Naur Form. It is used to write a formal representation of a context-
free grammar. It is also used to describe the syntax of a programming language.
-------------------------------------------------------------------------------------------------------------------------------------
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one non-
terminal.
We can define the several productions with the same leftside. All the productions are separated
by a vertical bar symbol "|".
S → aSa
S → bSb
S→c
S → aSa| bSb| c
XII. AMBIGUITY
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more
than one rightmost derivative or more than one parse tree for the given input string. If the
grammar is not ambiguous then it is called unambiguous.
Example:
S = aSb | SS
S=∈
For the string aabb, the above grammar generates two parse trees:
-------------------------------------------------------------------------------------------------------------------------------------
XIII. YACC
History of Java
-------------------------------------------------------------------------------------------------------------------------------------
C Compiler
G= (V, T, P, S)
-------------------------------------------------------------------------------------------------------------------------------------
In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly
replacing a non-terminal by the right hand side of the production, until all non-terminal have
been replaced by terminal symbols.
Example:
Production rules:
S → aSa
S → bSb
S→c
Now check that abbcbba string can be derived from the given CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and finally applying the production
S → c, we get the string abbcbba.
XV. DERIVATION
Derivation is a sequence of production rules. It is used to get the input string through these
production rules. During parsing we have to take two decisions. These are as follows:
-------------------------------------------------------------------------------------------------------------------------------------
We have two options to decide which non-terminal to be replaced with production rule.
LEFT-MOST DERIVATION
In the left most derivation, the input is scanned and replaced with the production rule from left
to right. So in left most derivatives we read the input string from left to right.
Example:
Production rules:
S=S+S
S=S-S
S = a | b |c
Input:
a-b+c
S=S+S
S=S-S+S
S=a-S+S
S=a-b+S
S=a-b+c
RIGHT-MOST DERIVATION
In the right most derivation, the input is scanned and replaced with the production rule from
right to left. So in right most derivatives we read the input string from right to left.
Example:
S=S+S
S=S-S
S = a | b |c
Input:
-------------------------------------------------------------------------------------------------------------------------------------
S=S-S
S=S-S+S
S=S-S+c
S=S-b+c
S=a-b+c
o Parse tree is the graphical representation of symbol. The symbol can be terminal or non-
terminal.
o In parsing, the string is derived using the start symbol. The root of the parse tree is that
start symbol.
o It is the graphical representation of symbol that can be terminals or non-terminals.
o Parse tree follows the precedence of operators. The deepest sub-tree traversed first. So, the
operator in the parent node has less precedence over the operator in the sub-tree.
Example:
Production rules:
T= T + T | T * T
T = a|b|c
Input:
a*b+c
-------------------------------------------------------------------------------------------------------------------------------------
Step 2:
Step 3:
Step 4:
-------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------