Compiler Design Note
Compiler Design Note
Chapter-1 (INTRODUCTION TO COMPILER): Phases and passes, Bootstrapping, Finite state machines and regular expressions and
their applications to lexical analysis, Optimization of DFA-Based Pattern Matchers implementation of lexical analyzers, lexical-
analyzer generator, LEX compiler, Formal grammars and their application to syntax analysis, BNF notation, ambiguity, YACC. The
syntactic specification of programming languages: Context free grammars, derivation and parse trees, capabilities of CFG.
Chapter-2 (BASIC PARSING TECHNIQUES): Parsers, Shift reduce parsing, operator precedence parsing, top down parsing,
predictive parsers Automatic Construction of efficient Parsers: LR parsers, the canonical Collection of LR(0) items, constructing
SLR parsing tables, constructing Canonical LR parsing tables, Constructing LALR parsing tables, using ambiguous grammars, an
automatic parser generator, implementation of LR parsing tables.
Chapter-3 (SYNTAX-DIRECTED TRANSLATION): Syntax-directed Translation schemes, Implementation of Syntax- directed
Translators, Intermediate code, postfix notation, Parse trees & syntax trees, three address code, quadruple & triples, translation
of assignment statements, Boolean expressions, statements that alter the flow of control, postfix translation, translation with a
top down parser. More about translation: Array references in arithmetic expressions, procedures call, declarations and case
statements.
Chapter-4 (SYMBOL TABLES): Data structure for symbols tables, representing scope information. Run-Time Administration:
Implementation of simple stack allocation scheme, storage allocation in block structured language. Error Detection & Recovery:
Lexical Phase errors, syntactic phase errors semantic errors.
Chapter-5 (CODE GENERATION): Design Issues, the Target Language. Addresses in the Target Code, Basic Blocks and Flow
Graphs,
Optimization of Basic Blocks, Code Generator. Code optimization: Machine-Independent Optimizations, Loop optimization, DAG
representation of basic blocks, value numbers and algebraic laws, Global Data-Flow analysis.
Chapter-1
(INTRODUCTION TO COMPILER): Phases and passes,
Bootstrapping, Finite state machines and regular expressions
and their applications to lexical analysis, Optimization of DFA-
Based Pattern Matchers implementation of lexical analyzers,
lexical-analyzer generator, LEX compiler, Formal grammars and
their application to syntax analysis, BNF notation, ambiguity,
YACC. The syntactic specification of programming languages:
Context free grammars, derivation and parse trees, capabilities
of CFG.
Language Processing System
We tend to write programs in high-level language, that is much less complicated
for us to comprehend and maintain in thoughts. These programs go through a
series of transformation so that they can readily be used in machines. This is
where language processing systems come handy.
High Level Language
• If a program contains #define or #include directives
such as #include or #define it is called HLL.
डाल DESI
डा
Compiler
• A compiler is a computer program that translates
computer code written in one programming language
(the source language) into another language (the target
language).
i) i = 0
i+1) if(i<10) goto i+3
i+2) goto i+6
i+3) S
i+4) i = i + 1
i+5) goto i+1
i+6) exit
Linker Loader
• It converts the relocatable code into absolute code and tries to run the program resulting in a
running program or an error message.
• Linker loads a variety of object files into a single file to make it executable. Then loader loads it
in memory and executes it.
• The first practical compiler was written by Corrado
Böhm, in 1951, for his PhD thesis.
X token
identifier
= token operator Y
token identifier
+ token operator Z
token identifier
* token operator
60 token constant
t1 = z * 60
x = y + t1
Target Code Generator
• The main purpose of Target Code generator is to
write a code that the machine can understand
and also register allocation, instruction selection
etc.
MOV R1, Z
MUL R1, 60
ADD R1, Y
STORE X, R1
Symbol Table
• It is a data structure being used and
maintained by the compiler, consists all the
identifier’s name along with their types.
• In general Front end fill symbol table and back end uses it.
(1) int x = 10;
• Lexical analysis in the first phase to communicate with the symbol and the compiler generate
the symbol table during the lexical analysis phase.
• Compiler is responsible to provide the memory for symbol table. at every phase if any new
variable occurs, then they will be stored in the symbol table.
• Every phase of the compiler will be interacting with the symbol table.
• In general, during the first two phases, we store the information in the symbol table and in the
memory and in the later phases, we make use of the information available in symbol table.
• Information stored in the symbol table about identifier
• name
• type
• scope
• size
• offset
Suited for simpler languages with Better for complex languages, as it can
Language Complexity less complex syntax and semantics. handle intricate syntax and semantics
more effectively.
Bootstrapping
• Bootstrapping in compiler design is a fascinating and critical concept.
• Historical Context
• Early Stages: Initially, compilers were written in assembly language or a low-
level language specific to the hardware.
• Evolution: As programming languages evolved, the need for writing compilers
in a higher-level language became evident.
• Bootstrapping Emergence: Bootstrapping was introduced as a solution to this
need. It refers to writing a compiler in the same language it intends to
compile.
Concept
• Initial Step: A simple compiler is first written in a low-level language.
This is often termed as the "bootstrap compiler."
• Self-Compiling: The compiler is then rewritten in its own higher-level
language and compiled using the bootstrap compiler.
• Iteration: This process can be iterated, with each new version of the
compiler used to compile its next version.
• Bootstrapping is the process of writing a compiler for a programming
language using the language itself. In other words, it is the process of
using a compiler written in a particular programming language to compile
a new version of the compiler written in the same language.
• If NFA have ‘n’ states which is converted into DFA which ‘m’ states than the
relationship between n and m will be
• 1<= m <= 2n
Procedure for Conversion
• There lies a fixed algorithm for the NFA and DFA conversion. Following things
must be considered
• Initial state will always remain same.
• Start the construction of ’ with the initial state & continue for every new
state that comes under the input column and terminate the process
whenever no new state appears under the input column.
• Every subset of states that contain the final state of the NFA is a final state in
the resulting DFA.
Determinism and Each state has a unique transition for A state can have multiple transitions
Uniqueness each input symbol. for the same input symbol.
• [(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator,
*), (literal, 2), (separator, ;)]
Q Identify the meaning of the rule implemented by Lexer using regular expression
for identifying tokens of a password?
Regular Expression A(A+S+D)3 (A+S+D+)4
Implementation of Lexical analyzer
Lexical analyzer can be implemented in following step :
• Input to the lexical analyzer is a source program.
• By using input buffering scheme, it scans the source program.
• Regular expressions are used to represent the input patterns.
• Now this input pattern is converted into NFA by using finite automation machine.
• This NFA are then converted into DFA and DFA are minimized by using different method of
minimization.
• The minimized DFA are used to recognize the pattern and broken into lexemes.
• Each minimized DFA is associated with a phase in a programming language which will evaluate
the lexemes that match the regular expression.
• The tool then constructs a state table for the appropriate finite state machine and creates
program code which contains the table, the evaluation phases, and a routine which uses them
appropriately.
lexical Analyzer Generator
• For efficient design of compiler, various tools are used to automate the phases of
compiler. The lexical analysis phase can be automated using a tool called LEX.
• LEX is a Unix utility which generates lexical analyzer.
• The lexical analyzer is generated with the help of regular expressions.
• LEX lexer is very fast in finding the tokens as compared to handwritten LEX
program in C.
• LEX scans the source program in order to get the stream of tokens and
these
tokens can be related together so that various programming structure such as
expression, block statement, control structures, procedures can be recognized.
LEX compiler
• Automatic generation of lexical analyzer is done using LEX programming language.
• The LEX specification file can be denoted using the extension .l (often pronounced as dot L).
• For example, let us consider specification file as x.l.
• This x.l file is then given to LEX compiler to produce lex.yy.c . This lex.yy.c is a C program which
is actually a lexical analyzer program.
• The LEX specification file stores the regular expressions for the token and the lex.yy.c file
consists of the tabular representation of the transition diagrams constructed for the regular
expression.
• In specification file, LEX actions are associated with every regular expression.
• These actions are simply the pieces of C code that are directly carried over to the lex.yy.c.
Generation of lexical analyzer using LEX
• Finally, the C compiler compiles this generated lex.yy.c and produces an object program a.out.
• When some input stream is given to a.out then sequence of tokens gets generated.
Components of LEX program
Declaration section :
• In the declaration section, declaration of variable constants can be done.
• Some regular definitions can also be written in this section.
• The regular definitions are basically components of regular expressions.
Components of LEX program
• Rule section :
• The rule section consists of regular expressions with associated actions. These
translation rules can be given in the form as :
• R1 {action1}
• R2 {action2}
•.
•.
• Rn {actionn}
• Where each Ri is a regular expression and each actioni is a program fragment
describing what action is to be taken for corresponding regular expression.
• These actions can be specified by piece of C code.
Components of LEX program
• S is a special variable (i.e., an element of VN (S Vn)) called the start symbol. Like every
automaton has exactly one initial state, similarly every grammar has exactly one start symbol.
• P is a finite set whose elements are α → β. where α and β are strings on VN ⋃ ∑. α has at
least one symbol from VN, the element of P are called productions or production rules or
rewriting rules. {Σ U Vn}* some writer refers it as total alphabet
For a formal valid production
αβ
α {Σ U Vn}* Vn {Σ U Vn}*
β {Σ U Vn}*
LANGUAGES AND AUTOMATA
|α| = 1 β {Σ U
Vn} *
• In other words, the L.H.S. has no left context or
right context.
• Production Rules: A language in BNF is defined by production rules. Each rule has a left-hand
side (a non-terminal symbol) and a right-hand side, which is a sequence of terminal and/or
non-terminal symbols. The rule shows how you can replace the non-terminal with that
sequence.
• Example: Let's consider a simple arithmetic expression grammar:
• Expression → Expression + Term | Term
• Term → Term * Factor | Factor
• Factor → ( Expression ) | Number
• Number → Digit | Number Digit
• Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
• In this example, "Expression", "Term", "Factor", "Number", and "Digit" are non-
terminal symbols, while '+', '*', '(', ')', and the digits are terminal symbols.
• Use in Parsing: BNF is widely used in the design and implementation of
compilers and interpreters for programming languages. It helps in building the
syntax tree of a program by parsing the source code according to the grammar
rules defined in BNF.
• Readability: BNF provides a concise and readable way to specify the syntax of a
language. It's easier to understand and modify compared to directly writing
parser code.
• Extensions: There are also extensions to BNF, like Extended BNF (EBNF), which
provides more expressive power by adding more constructs like optional
elements, repetitions, and choices.
Ambiguous grammar: - The grammar CFG is said to be ambiguous if
there are more than one derivation tree for any string i.e. if there exist
more than one derivation tree (LMDT or RMDT), the grammar is said to
be ambiguous.
S aS/Sa/a
YACC (Yet Another Compiler Compiler)
• YACC (Yet Another Compiler Compiler) is a tool used in compiler construction to
generate a parser from a specified grammar. Here are the key aspects:
• Function: YACC is utilized for creating compilers or interpreters. It takes a
language's grammar (in Backus-Naur Form) and produces C code for a parser.
• Structure of YACC File:
• Declarations: Defines tokens and includes necessary C code.
• Rules: Contains grammar rules with associated C actions.
• User Subroutines: Additional C functions for the parser.
• Integration with Lex: YACC is commonly used with Lex, a lexical analyzer, for
tokenization.
• Parser Type: The parser generated is a LALR(1) parser, a type of efficient bottom-
up parser.
• Error Handling: YACC includes syntax error detection and recovery mechanisms.
• Advantages:
• Automates parser creation.
• Simplifies grammar modification.
• Separates syntax rules from actions.
• Applications and Alternatives: It's widely used in both academia and industry
for compiler and interpreter development. Tools like Bison offer similar
functionality with additional features.
• Understanding YACC is important for computer science students, particularly for
insights into compiler design and parsing techniques.
Defining a language by grammar
• The concept of defining a language using grammar is, starting from a start symbol using the
production rules of the grammar any time, deriving the string. Here every time during
derivation a production is used as its LHS is replaced by its RHS, all the intermediate
stages(strings) are called sentential forms. The language formed by the grammar consists of all
distinct strings that can be generated in this manner.
L (G) = {w | w ∑* , S * W}
• *(reflexive, transitive closure) means from s we can derive w in zero or more steps
• Derivation: - The process of deriving a string is known as derivation.
• Derivation/ Syntax/ Parse Tree: - The graphical representation of
derivation is known as derivation tree.
E E + E / E * E / E = E / id
• Sentential form: - Intermediate step involve in the derivation is known
as sentential form.
E E + E / E * E / E = E / id
Sentential Form
E
E*E
E+E*E
ID+E*E
ID+ID*E
ID+ID*ID
• Left most derivation: - the process of construction of parse tree by expanding
the left most non terminal is known as LMD and the graphical representation of
LMD is known as LMDT (left most derivation tree)
LMD
EE+E E
E E * E E*E
E E = E E+E*E
E id ID+E*E
E id ID+ID*E
E id ID+ID*ID
• Right most derivation: - the process of construction of parse tree by expanding
the right most non terminal is known as RMD and the graphical
representation of RMD is known as RMDT (right most derivation tree)
RMD
EE+E E
E E * E E+E
E E = E E+E*E
E id E+E*ID
E id E+ID*ID
E id ID+ID*ID
Capabilities of CFG
Context-Free Grammar (CFG) plays a vital role in computer science for describing programming
language syntax and creating efficient parsers:
• Programming Languages: CFGs effectively describe the syntax of most programming
languages, organizing elements like statements, expressions, and declarations.
• Efficient Parser Construction: A well-designed CFG enables the automatic generation of
efficient parsers, essential for code analysis and interpretation.
• Associativity and Precedence: CFGs handle associativity and precedence in
expressions,
ensuring correct interpretation of operations and expression hierarchies.
• Nested Structures: CFGs excel in depicting nested structures common in programming
languages, like balanced parentheses, matching begin-end blocks, and nested if-then-else
statements.
Chapter-2
(BASIC PARSING TECHNIQUES): Parsers, Shift reduce
parsing, operator precedence parsing, top down parsing,
predictive parsers Automatic Construction of efficient
Parsers: LR parsers, the canonical Collection of LR(0)
items, constructing SLR parsing tables, constructing
Canonical LR parsing tables, Constructing LALR parsing
tables, using ambiguous grammars, an automatic parser
generator, implementation LR parsing tables.
Syntax analysis
• In Syntax analysis input is stream of tokens and output is syntax tree. The
process of construction of the parse tree/ syntax tree/ derivation tree is called as
parsing.
• For any input string which is given in the form of stream of tokens for the parser, if the
derivation tree exists, then the input string is syntactically or grammatically correct.
• If the parser cannot generate the derivation tree from the i/p string, then there must be some
grammatical mistakes in the string.
Classification of parser
• The program which perform parsing is known as parser or syntax analyzer.
Bottom-UP
Top-Down Parser Parser
LL(1)
Top down parsing
• The process of construction of parse tree, starting from root and process to
children, is known as TOP down parsing, i.e. getting the i/p string by starting
with a start symbol of the grammar is top down parsing.
A aA /
Top-Down
Parser
With-Back Without
Tracking Backtracking
Predictive
Brute Force
Parser
LL(1)
Non-Deterministic Grammar
• A αβ1/ αβ2
• A αβ
• A β1/ β2
Recursive production: - the production which has same variable both at left- and right-hand side
of production is known as recursive production.
SaSb
SaS
SSa
Recursive grammar: - the grammar which contains at least one recursive production is known as
recursive grammar.
SaS / a
SSa / a
SaSb / ab
Left Recursive Grammar: - The grammar G is said to be left recursive, if the Left
most variable of RHS is same as the variable at LHS.
Right Recursive Grammar: - The grammar G is said to be right recursive, if the right
most variable of RHS is same as the variable at LHS.
General recursion: - the recursion which is neither left nor right is called as general recursion. If a
CFG generates infinite number of string then it must be a recursive grammar.
Non recursive grammar: - the grammar which is free from recursive production is called as non-
recursive grammar.
SAaB
Aa
Bb
• If a CFG contains left recursion then the compiler may go to infinite loop, hence
to avoid the looping of the compiler, we need to convert the left recursive
grammar into its equivalent right recursive production.
A Aα / β1 / β2 -------/ βn
A Aα1 / Aα2 /----------Aαn / β
A Aα1 / Aα2 /----------Aαn / β1 / β2 -------/ βm
Ambiguous grammar: - The grammar CFG is said to be ambiguous if
there are more than one derivation tree for any string i.e. if there exist
more than one derivation tree (LMDT or RMDT), the grammar is said to
be ambiguous.
S aS/Sa/a
• Grammar which is both left and right recursive is always ambiguous,
but the ambiguous grammar need not be both left and right recursive.
• Top down parser uses, left most derivation.
• Left most derivation: - The process of construction of parse tree by expanding
the left most non terminal is known as LMD and the graphical representation of
LMD is known as LMDT (left most derivation tree).
A AaA /
• The TDP is constructed for the grammar, if it is free from left recursion.
With-Back Without
Tracking Backtracking
Predictive
Brute Force
Parser
LL(1)
Brute force technique
• Whenever a non- terminal is expanding first time, then go with the first
alternative and compare with the i/p string. if does not matches, go for the
second alternative and compare with i/p string, if does not matches go with the
3rd alternative and continue with each and every alternative.
• if the matching occurs for at least one alternative, then the parsing is successful,
otherwise parsing fail.
S cAd
A ab / a
w1 = cad w2 = cada
S aAc / aB
Ab/c
B ccd / ddc
w = addc
• TDP may be constructed for both left factor and non-left factor grammar.
• If the grammar is non- deterministic, then we use brute technique and if the grammar
is deterministic, then we go with the predictive parser.
• Worst case time complexity may go up to O(n4) in TDP except (Brute force).
• First(α) is a set of all terminals that may be in beginning in any
sentential form, derived from α
• if α is a terminal, then
• First(α) = {α}
Sa/b/ε
First(S) =
Follow(S) =
S aA / bB First(S)
A ε First(A) =
= First(B) =
Bε
Follow(S) =
Follow(A) =
Follow(B) =
S AaB / BA First(S) =
Aa/b First(A) =
Bd/e First(B) =
Follow(S) =
Follow(A) =
Follow(B) =
S AaB First(S) =
Ab/ε First(A) =
Bc First(B) =
Follow(S) =
Follow(A) =
Follow(B) =
S AB First(S) =
Aa/ε First(A) =
Bb/ε First(B) =
Follow(S) =
Follow(A) =
Follow(B) =
S ABCDE First(S) = Follow(S) =
Aa/ε First(A) = Follow(A) =
First(B) = Follow(B) =
Bb/ε First(C) = Follow(C) =
Cc/ε First(D) = Follow(D) =
Dd First(E) = Follow(E) =
Ee/ε
E TE’ First(E) = Follow(E) =
E’ +TE’ / ε First(E’) = Follow(E’) =
First(T) = Follow(T) =
T FT’ Follow(T’) =
T’ *FT’ / ε First(T’) =
First(F) = Follow(F) =
F (E) / id
S aBDh First(S) = Follow(S) =
First(B) = Follow(B) =
B cC Follow(C) =
First(C) =
C bC / ε First(D) = Follow(D) =
D EF First(E) =
Follow(E) =
Follow(F) =
Eg/ε First(F) =
Ff/ε
E TE’ id + id * id $
+ * ( ) id $
E’ +TE’ / E
E’
T FT’ T
T’
T’ *FT’ / F
F (E) / id
Q Consider a given Grammar LL(1) grammar design Parsing table and perform complete parsing
table?
Stack i/p Action
E TE’ $ id+id*id$ Push E
• Or the Grammar whose LL(1) parse table does not contains multiple entries in
the same cell, then the grammar is LL(1)
Parser
Bottom-UP
Top-Down
Parser Parser
L-R Operator
With-Back Without Parser Precedence
Tracking Backtracking Parser
Recursive LL(1)
Descent
Parser Non-Recursive
Descent Parser
Bottom Up parser
• The process of constructing the parse tree in the Bottom-Up manner, i.e. starting
from the children & proceeding towards root.
• S aABc
A b / bc
Bd
• w = abcdc
• LR parsers were invented by Donald Knuth in
1965 as an efficient generalization of precedence
parsers. Knuth proved that LR parsers were the
most general-purpose parsers possible.
• Stack: stack contains the grammar symbol, the grammar symbol are push into
stack or pop from the stack, using shift & reduced operation. if habdle occurs
from the topmost symbol of the stack, then apply the reduced operation & if
handle does not occurs in the topmost symbol of the stack, then apply the shift
operation.
• Parse table: parse table is constructed using terminals, non-terminals & LR(0)
items. this parse table consist of two parts:
• Action
• Goto
• Action part contains shift & reduced operation over the terminals
• Goto part consists of only Shift operation over the Non-terminals.
• Operation in shift/reduced passer
• shift
• reduced
• accept
• error
Action Goto
I0 Terminals Non-Terminals
In-1 Shift / Reduce Shift
• Shift: shift operation can be used when handle does not occurs from the
topmost symbol of the stack. using shift operation, will moving a look ahead
symbol in stack.
• Reduce: reduce operation can be whenever handle occurs from the Topmost
symbol from the stack. using reduced operation, we rename the topmost symbol
of the stack that matches with look-ahead symbol.
• Accept: after scanning the complete i/p string from the i/p buffer, if the stack
contains only the start symbol of the grammar as topmost symbol, then the i/p
string is accepted and the parsing is successful.
• Error: after the complete i/p string, if the attack contains any symbol which is
different from start symbol as a topmost symbol, then the parsing is unsuccessful
and hence error.
• LR(K)
• First L stand for left to right scanning
• Second R stand for reverse of right most derivation
• K is Look-Ahead symbol
• Procedure for the construction of LR Parser table:
1. Obtain the augmented grammar for the given grammar
2. Create the canonical collection of LR items or compiler items.
3. Draw the DFA & prepare the table based on LR items.
• Augmented grammar
• The grammar which is obtained by addition one more production that
generate the start symbol of the grammar, is known as Augmented
grammar.
• S AB A B b
a S AB A a B b
• S’ S
• LR(0) or Compiler item
• The production, which has dot(.) anywhere on RHS is known as LR(0)
items.
• A abc
• LR(0) items:
• A .abc
• A a.bc
• A ab.c
• A abc. Final / Completed items
• Canonical Collection:
• The set C = {I0, I1, I2, I3,………..IN} is known as canonical
collection of LR(0) items.
• Function used to generate LR(0) item’s:
• Closure: i/p set of items & o/p also set of items
• GOTO
• Goto(I, X)
• Goto(I, X)
•Goto(A αX.βX,=)A αX . β
Procedure to construct LR parse Table
• LR parse table consist of two parts
• Action
• Goto
• Action
• Action part consists of both shift & reduced operation that are performed on
terminals.
• Goto
• Goto parts consists of shift operation performed on Non-terminals
Goto(Ii, X) = Ij (X is terminal)
X
Ii Sj
Goto(Ii, X) = Ij (X is non-terminal)
X
Ii j
If Ii is any final item & represent the production Ri,
then place Ri under all the terminal symbols in the
action part of the table.
t1 t2 t3 tn $
I ri ri ri ri ri
S AA Action Goto
A aA / b
Action Goto
ET+E/T
T id
SLR(1)
• The procedure for constructing the parse table is similar to LR(0) parse, but there is a
restriction in the reducing entries.
• Whenever there is a final item, then placed the reduced entries under the follow symbol of
LHS symbol.
• If the SLR(1) parse table is free from multiple entries than the grammar is SLR(1) grammar.
• Every LR(0) grammar is SLR(1), but every SLR(1) grammar need not be LR(0)
• SLR(1) parser is more powerful then LR(0) parser
S CC CLR(1) Action Goto
C cC
Cd
CLR(1)
• LR(1) depends on one look Ahead symbol
• Closure(I):
• Add everything from i/p to o/p
• A α.Bβ, $ is in closure I and β ϭ is in the grammar G, then add β .ϭ, first(β, $), to
the closure I
• repeat previous step for every newly added items
• Goto(I, x):
• there will not be any change in the goto part while finding the transition.
• these many be change in the follow or look Ahead part while finding the closure.
• LR(1) grammar: the grammar for, which LR(1) is constructed is known as LR(1) or
CLR(1).
• the grammar whose LR(1) parse is free from multiple entries or conflicts, then it
is LR(1) grammar.
• Every SLR(1) grammar is CLR(1). but every CLR(1) grammar need not be SLR(1)
• because of this reason, the CLR(1) parse contains more no of entries & hence
CLR(1) parser become most costly.
LALR(1)
• In CLR(1) parser, there can ne more than one state, having same production part
and different follow part. Now combine those state whose production part is
common and follow part is different, it is a single state and then construct the
parse table, if the parse table is free from multiple entries, then the grammar is
LALR(1).
S Aa / bAc /dc /bda
Ad
• if in CLR(1), if there are no states having some production, but
different follow part, the grammar is CLR(1) and LALR(1)
S Aa S
bAc S
Bc S
bBa
Ad
Bd
Operator Precedence Grammar
Operator precedence parser can be constructed for both ambiguous and
unambiguous grammar.
In general operator precedence grammar have less complexity
Every CFG is not operator precedence grammar
Generally used for languages which are useful in scientific application.
Operator Grammar is a context free grammar that has following properties
o Does not contain ε production
o No adjacent non-terminals on RHS of any production.
$$ Parsing is done.
Algorithm for computing precedence function
Aspect Top-Down Parsing Bottom-Up Parsing
Begins from the start symbol and works Starts from the leaves (input symbols) and
Direction of Analysis
towards the leaves of the syntax tree. works towards the root of the syntax tree.
Constructs the parse tree from the top Constructs the parse tree from the bottom
Parse Tree Construction
(root) to the bottom (leaves). (leaves) to the top (root).
• With grammar we give meaningful rules, and apart from semantic analysis SDT can also be
used to perform things like
• Code generation
• Intermediate code generation
• Value in the symbol table
• Expression evaluation
• Converting infix to post fix
• Things can be done in parallel to parsing…so with semantic action and rule parsers become
much powerful
Q Consider the following grammar along with translation rules. Here # and % are operators and id is a
token that represents an integer and id•val represents the corresponding integer value. The set of non-
terminals is {S,T, R, P} and a subscripted non-terminal indicates an instance of the non-terminal.Using this
translation scheme, the computed value of S •val for root of the parse tree for the expression 20 #10%5
#8%2%2 is .
20 #10%5 #8%2%2
Q Consider the grammar with the following translation rules and E as the start
symbol.
E → E1 # T { E.value = E1.value * T.value } 2 #3 & 5 #6 &
E → T{ E.value = T.value } 4
T → T1 & F { T.value = T1.value + F.value }
T → F{ T.value = F.value }
F → num { F.value = num.value }
Compute E.value for the root of the parse tree
for the expression: 2 # 3 & 5 # 6 & 4.
Q Consider the grammar with the following translation rules and E as the start
symbol.
E → E1 + T {print ('+’);}
E→T
T → T1 * F {print ('*’);}
T→F
F → num {print
('num.val');}
Construct the parse tree for the
string 2 + 3 * 4, and find what will
be printed.
Q Consider the translation scheme shown below S → T R
R → + T {print ('+');} R / ε
T → num {print (num.val);}
Here num is a token that represents an integer and num.val represents the
corresponding integer value. For an input string ‘9 + 5 + 2ʹ, this translation scheme
will print
Q Consider the following translation scheme.
S → ER
R → *E{print(“*”);}R | ε
E → F + E {print(“+”);} | F
F → (S) | id {print(id.value);}
Here id is a token that represents an integer and id.value represents the
corresponding integer value. For an input ‘2 * 3 + 4ʹ, this translation scheme prints
Q Consider the following Syntax Directed Translation Scheme (SDTS),
with non-terminals {S, A} and terminals {a, b}}. Using the above SDTS,
the output printed by a bottom-up parser, for the input ‘aab’ is
Attributes
• Attributes attach relevant information like strings, numbers, types, memory
locations, or code fragments to grammar symbols of a language, which are used
as labels for nodes in a parse tree.
• The value of each attribute at a parse tree node is determined by semantic rules
associated with the production applied at that node, defining the context-
specific information for the language construct.
Classification of Attributes
• Based on the process of Evaluation of the values, attributes are
classified into two types:
• Synthesised Attributes
• Inherited Attributes
• Synthesized attributes are derived from a node's children within a
parse tree, and a syntax-directed definition relying solely on
these attributes is termed as S-attributed.
• S-attributed definitions allow parse trees to be annotated from
the
leaves up to the root, enabling parsers to directly evaluate semantic
rules during the parsing process.
• A XYZ {A.S = f(X.S / Y.S / Z.S)}
• Inherited Attributes: The attribute whose values are evaluated in
terms of attribute value of parents & Left siblings is known as
inherited attributes.
• Inherited attributes are convenient for expressing
dependence of a programming language construct on the
context in which it appears.
Synthesized Attributes Inherited Attributes
Computed from the attribute values of a Computed from the attribute values of a
node's children in the parse tree. node's siblings and parent.
Often associated with bottom-up parsing Often associated with top-down parsing
techniques like LALR or SLR. techniques such as LL parsers.
They do not need context from parent
They require context from parent or
nodes, only from children and the node
surrounding nodes to be computed.
itself.
S-Attributed SDT L-Attributes SDT
Uses only Synthesized attributes Uses both inherited and synthesised
attributes. Each inherited attribute is
restricted to inherit either form parent or left
sibling only.
Semantic actions are placed at extreme right Semantic actions are placed anywhere on
on right end of production right hand side of the production.
Attributes are evaluated during BUP Attributes are evaluated by traversing parse
tree depth first left to right.
Intermediate Code Generation
• Intermediate code generation in compilers creates a machine-independent, low-level
representation of source code, facilitating optimization and making the compiler design more
modular. This abstraction layer allows for:
• Portability: Easier adaptation of the compiler to different machine architectures, as only
the code generation phase needs to be machine-specific.
• Optimization Opportunities: More efficient target code through optimizations performed
on the intermediate form rather than on high-level source or machine code.
• Ease of Compiler Construction: Simplifies the development and maintenance of the
compiler by decoupling the source language from the machine code generation.
•x = y+ z*60
t1 = z *
60 t2 = y +
tx1 = t2
Intermediate Code Generation
Intermediate
Code
Three
PostFix Syntax tree DAG
address code
Q Draw syntax tree for the arithmetic expressions : a * (b + c) – d/2. Also
write the given expression in postfix notation ?
(a+b) * (a + b + c)
Post fix Three address code Syntax Tree Direct Acyclic Graph
t1 = a+b
ab+ab+c+* t2 = a+b
t3 = t2 + c
t4 = t1 * t3
3 Address Code
• Three-address code is a type of intermediate code where each instruction can
have at most three operands and one operator, like a := b op c. It simplifies
complex operations into a sequence of simple statements, supporting various
operators for arithmetic, logic, or boolean operations.
3 Address Code
Types of 3 address codes
1) x = y operator z
2) x = operator z
3) x = y
4) goto L
5) A[i] = x
y = A[i]
6) x = *p
y = &x
• 3 address codes can be implemented in a number of ways
-(a + b) * (c + d) + (a + b + c)
1) t1 = a+b
2) t2 = -t1
3) t3 = c+d
4) t4 = t2 * t3
5) t5 = a+b
6) t6 = t5 + c
7) t7 = t4 + t6
Quadruples
Operator Operand1 Operand2 Result
1) t1 = a+b
1) + a b t1
2) t2 = -t1
2) - t1 t2
3) t3 = c+d
3) + c d t3
4) * t2 t3 t4 4) t4 = t2 * t3
5) + a b t5 5) t5 = a+b
6) + t5 c t6 6) t6 = t5 + c
7) + t4 t6 t7 7) t7 = t4 + t6
Triplet 1) t1 = a+b
Operator Operand1 Operand2
1) + a b 2) t2 = -t1
2) - 1 3) t3 = c+d
3) + c d 4) t4 = t2 * t3
4) * 2 3 5) t5 = a+b
5) + a b 6) t6 = t5 + c
6) + 5 c 7) t7 = t4 + t6
7) + 4 6
• Advantage
• Space is not wasted
• Disadvantage
• Statement cannot be moved
Indirect Triplet
Triple can be separated by order of execution and uses the pointers concepts
• Advantage
• Statement can be moved
• Disadvantage
• two memory access
Q Write the quadruples, triple and indirect triple for the following
expression : (x+y)* (y+z)+(x+y+z) ?
Index Operator Arg1 Arg2 Result
1 + x y t1
2 + y z t2
3 * t1 t2 t3
4 + x y t4
5 + t4 z t5
6 + t3 t5 t6
Q Write the quadruples, triple and indirect triple for the following
expression : (x+y)* (y+z)+(x+y+z) ?
Index Operator Arg1 Arg2
0 + x y
1 + y z
2 * 0 1
3 + x y
4 + 3 z
5 + 2 4
Q Write the quadruples, triple and indirect triple for the following
expression : (x+y)* (y+z)+(x+y+z) ?
Index Operator Arg1 Arg2 Pointer Index
0 + x y p0 0
p1 1
1 + y z
p2 2
2 * 0 1
p3 3
3 + x y p4 4
4 + 3 z p5 5
5 + 2 4
One Dimensional array
• B is the base address of the array
• Address of the element at k index
th
• W is the size of each element
• a[k] = B + W*k • K is the index of the element
• Lower bound index of the first element of the
• a[k] = B + W*(k – Lower bound) array
• Upper bound index of the last element of the
array
if(a, b) then t=1
else t = 0
i) i = 0
i+1) if(i<10) goto i+3
i+2) goto i+6
i+3) S
i+4) i = i + 1
i+5) goto i+1
i+6) exit
Q Generate three address code for the following code :
switch a + b
{ 101:t =a+b goto 103
1
107 : if t = 2
goto 109
108: goto
111 109:t3
=y+2
110 : y = t3
111 : if t = 3
goto 113
112 : goto 115
Q Consider the grammar with the following translation rules and E as the start symbol.
E → E1 + T { E.nptr = mknode(E1.nptr, +, T.ptr);}
E → T{ E.nptr = T.nptr }
T → T1 * F { T.nptr = mknode(T1.nptr, *, F.ptr);}
T → F{ T.nptr = F.nptr }
F → id { F.nptr = mknode(null, id name, null);}
Assignment Instruction X = op y
E E1 AND E2 E.place=newtemp();
Emit(E.place=E1.place ‘and’ E2.place);
E NOT E1 E.place=newtemp();
Emit(E.place= ‘not’ E1.place);
E (E1) E.place=E1.place;
E TRUE E.place=newtemp();
Emit(E.place=‘1’);
E FALSE E.place=newtemp();
Emit(E.place=‘0’);
Production rule Semantic actions
{ id_entry := look_up(id.name);
if id_entry != nil then
S id := E append (id_entry ‘:=’ E.place)
else error; /* id not declared*/ }
{ E.place := newtemp();
EE1 +E2 append (E.place ‘:=’ E1.place ‘+’ E2.place) }
{ E.place := newtemp();
EE1 *E2 append (E.place ‘:=’ E1.place ‘*’ E2.place) }
{ E.place := newtemp();
E – E1 append (E.place ‘:=’ ‘minus’ E1.place) }
i) i = 0
i+1) if(i<10) goto i+3
i+2) goto i+6
i+3) S
i+4) i = i + 1
i+5) goto i+1
i+6) exit
Chapter-4
(SYMBOL TABLES): Data structure for symbols
tables, representing scope information. Run-
Time Administration: Implementation of simple
stack allocation scheme, storage allocation in
block structured language. Error Detection &
Recovery: Lexical Phase errors, syntactic phase
errors semantic errors.
• Definition:- The symbol table is a data structure used by a compiler to store
information about the source program's variables, functions, constants, user-
defined types, and other identifiers.
• Need:- It helps the compiler track the scope, life, and attributes (like type, size,
value) of each identifier. It is essential for semantic analysis, type checking,
and code generation.
• The information is collected by the analysis phases of the compiler and used
by the synthesis phases of the compiler.
• Lexical Analysis:- Creates new table entries
• Syntax Analysis:- add information about attributes types, scope and use in
the table.
• Semantic Analysis:- to check expression are semantically correct and type
checking.
• Intermediate Code Generation:-symbol table helps in adding temporary
variable information in code.
• Code Optimization:- use symbol table in machine dependent
optimization.
• Code Generation:- uses address information of identifier present in the
table for code generation.
• Information used by compiler from symbol table
• Data type and name.
• Declaring procedure.
• Pointer to structure table or record.
• Parameter passing by value or by reference.
• No and types of argument passed to function.
• Base address.
A symbol table should possess these essential functions:
• Lookup: This function checks if a specific name exists in the table.
• Insert: This allows the addition of a new name (or a new entry) into the table.
• Access: It enables retrieval of information associated with a specific name.
• Modify: This function is used to update or add additional information about an
already existing name.
• Delete: This capability is for removing a name or a set of names from the table.
Important symbol table requirements
• Adaptive Structure: Entries must be comprehensive, reflecting each identifier's
specific use.
• Quick Lookup/Search: High-speed search functionality is essential, dependent
on the symbol table's design.
• Space-Efficient: The table should dynamically adjust in size for optimal space
use.
• Language Feature Accommodation: It needs to support language-specific
aspects like scoping and implicit declarations.
Different data structures used in implementing a symbol table
• Unordered List:
• Easy to implement using arrays or linked lists.
• Linked lists allow dynamic growth, avoiding fixed size constraints.
• Quick insertion O(1) time, but slower lookup O(n) for larger tables.
• The activation record is crucial for handling data necessary for a procedure's
single execution. When a procedure is called, this record is pushed onto
the stack, and it's removed once control returns to the calling function.
Return value
Actual parameters
Control link
Access link
Saved machine status
Local data
Temporaries
Activation record fields include:
• Return Value: Allows the called procedure to return a
value to the caller. Return value
• Actual Parameters: Used by the caller to provide Actual parameters
parameters to the called procedure.
• Control Link: Connects to the caller's activation Control link
record. Access link
• Access Link: References non-local data in other Saved machine status
activation records. Local data
• Saved Machine Status: Preserves machine state prior
Temporaries
to the procedure call.
• Local Data: Contains data specific to the procedure's
execution.
• Temporaries: Stores interim values during expression
evaluation.
Overview of Memory Allocation Methods in Compilation
• Code Storage:
• Contains fixed-size, unchanging executable target code
during compilation.
• Static Allocation:
• Allocates storage for all data objects at compile time.
• Object sizes are known at compile time.
• Object names are bound to storage locations during
compilation.
• The compiler calculates the required storage for each object,
simplifying address determination in the activation record.
• Compiler sets up addresses for the target code to access
data at compile time.
Overview of Memory Allocation Methods in
Compilation
• Heap Allocation Methods:
• Garbage Collection:
• Handles objects that persist after losing all access
paths.
• Reclaims object space for reuse.
• Garbage objects are identified by a 'garbage
collection bit' and returned to free space.
• Reference Counting:
• Reclaims heap storage elements as soon as they
become inaccessible.
• Each heap cell has a counter tracking the
number of
references to it.
• The counter is adjusted as references are made
Overview of Memory Allocation Methods in
Compilation
• Stack Allocation:
• Manages data structures known as activation records.
• Activation records are pushed onto the stack at call time
and popped when the call ends.
• Local variables for each procedure call are stored in the
respective activation record, ensuring fresh storage for
each call.
• Local values are removed when the procedure call
completes.
Overview of Error Recovery in Compilers
Error Recovery Significance:
• Essential for compilers to process and execute programs even with errors.
• Example:
• Consider the code snippet: int x; int y //Syntax error
• The error arises from the missing semicolon after int y.
Semantic Phase Errors
• Definition of Semantic Errors:
• Semantic errors relate to incorrect use of program statements, impacting the
program's meaning or logic.
• Typical Reasons for Semantic Errors:
• Using undeclared names.
• Type mismatches.
• Inconsistencies between actual and formal arguments in function calls.
• Example:
• In the code scanf(“%f%f”, a, b);, the error lies in not using the addresses of
variables a and b. The correct usage should be scanf(“%f%f”, &a, &b); to
provide the address locations.
Logical Errors(Run Time Error) in Programming
• Nature of Logical Errors:
• Logical errors are mistakes in a program's logic that the compiler does not
catch.
• These errors occur in programs that are syntactically correct but do not
function as intended.
• Example of a Logical Error:
•Consider the code
snippet: x = 4;
y = 5;
average = x + y / 2;
Basics of Panic Mode Recovery
• A straightforward and commonly used method in various parsing techniques.
• Upon detecting an error, the parser discards input symbols until it encounters a
synchronizing token from a predefined set.
• Characteristics:
• Panic mode may bypass large sections of input without further error
checking.
• This approach ensures that the parser does not enter an infinite
loop.
• a = b + c;
• d = e + f;
• In panic mode, the parser might skip over the entire line a = b + c; without
identifying specific errors in it.
Phrase-Level Recovery
• This method involves making localized corrections to the input when an error is
detected by the parser.
• It involves substituting a part of the input with a string that allows the parser to
proceed.
• Correction Mechanism:
• Common corrections include replacing a comma with a semicolon, removing
an unnecessary semicolon, or inserting a missing one.
• while(x>0)
• y=a+b;
• Phrase-level recovery might correct this by adding 'do' to form while(x>0) do
y=a+b;, enabling the parsing process to continue smoothly.
Error Production
• Overview of Error Production:
• This technique allows a parser to generate relevant error messages while
continuing the parsing process.
• Functionality:
• When the parser encounters an incorrect production, it issues an error message
and then resumes parsing.
• E→+E|-E|*A|/A
• A→E
• If the parser comes across the production '* A' and deems it incorrect, it can
alert the user with a message, perhaps querying whether '*' is intended as a
unary operator, before continuing with the parsing.
Global Correction in Parsing
• Error Types and Diagnostics: The parser can detect several types of errors:
• Errors Detected During Reduction:
• Missing operand.
• Missing operator.
• Absence of expression within parentheses.
• Errors Detected During Shift/Reduce Actions:
• Missing operand.
• Unbalanced or missing right parenthesis.
• Missing operators.
Chapter-5
(CODE GENERATION): Design Issues, the Target Language.
Addresses in the Target Code, Basic Blocks and Flow Graphs,
Optimization of Basic Blocks, Code Generator. Code
optimization: Machine-Independent Optimizations, Loop
optimization, DAG representation of basic blocks, value numbers
and algebraic laws, Global Data-Flow analysis.
Code Generation
• Code generation is the process of converting the intermediate representation (IR)
of source code into a target code(assembly level). Which is also optimized.
• It involves translating the syntax and semantics of the high-level language into
assembly level code, typically after the source code has passed through
lexical analysis, syntax analysis, semantic analysis, and intermediate
code generation.
int main() { t1 = 5 MOV R1, 5
int a = 5; t2 = 3 MOV R2, 3
int b = 3; t3 = t1 + t2 ADD R3, R1, R2;
int sum = a
+ b;
}
• Code Generator Input:
• Uses the source program's intermediate representation (IR) and symbol table data.
• Intermediate Representation (IR):
• Consists of three-address and graphical forms.
• Target Program:
• Influenced by the target machine's instruction set architecture (ISA).
• Common ISAs: RISC, CISC, and stack-based.
• Instruction Selection:
• Converts IR to executable code for the target machine.
• High-level IR may use code templates for translation.
• Register Allocation:
• Critical to decide which values to store in registers.
• Non-registered values stay in memory.
• Register use leads to shorter, faster instructions.
• Involves two steps: i. Choosing variables for register storage. ii. Assigning specific registers to
these variables.
• Evaluation Order:
• The sequence of computations impacts code efficiency.
• Some sequences minimize register usage for intermediate results.
• Optimization: During code generation, various optimization techniques are
applied to improve efficiency and performance. This could include optimizing
for speed, memory usage, or even power consumption.
Machine Machine
Independent Dependent
Use of
Loop Constant Constant Strength Redundancy Algebraic Register addressing Peephole
Optimization Folding Propogation reduction Elimination Simplification allocation mode Optimization
Flow of Use of
Code Redundant Strenght
Loop Unrolling Loop Jamming control Machine
movement load reduction
optimization idoms
Aspect Machine-Independent Optimization Machine-Dependent Optimization
Optimizations that are not specific to any Optimizations tailored to the specifics of a
Definition processor or machine architecture. particular machine or processor architecture.
Stage of Applied before the code is mapped to the Applied during or after the generation of the
target machine’s instruction set, often during the target machine code, tailoring the optimizations
Application intermediate code generation phase. to the specifics of the machine’s hardware.
Algorithm to partition a sequence of three address statements into basic blocks
• Loop Optimization
• To apply loop optimization, we must first detect loops.
• For detecting loops, we use control flow analysis (CFA) using program flow
graph (PFG)
• To find PFG, we need to find basic blocks.
• A Basic block is a sequence of 3-adress statements where control enters at
the beginning and leaves only at the end without any jumps or halts
• The block can be identified with the help of leader
• Finding the leader
• Finding the bocks
• Construct PFG
• In order to find the basic blocks, we need to finds the leader in the program then
a basic block will start from one leader to the next leader but not including next
leader.
• identifying leaders in a basic block
• First statement is a leader
• Statement that is the target of conditional or unconditional statement is a
leader
• Statement that follow immediately a conditional or unconditional
statement
is a leader
Fact(x)
{
int f=1
for(i=2 ; i<=x ; i++)
f = f*i;
return f;
}
1)f=1;
2)i=2
3)if(i>x), goto 9
4)t1=f*i;
5)f=t1;
6)t2=i+1;
7)i=t2;
8)goto(3)
9)goto calling program
1) f=1;
2) i=2
3) if(i>x), goto 9
4) t1=f*i;
5) f=t1;
6) t2=i+1;
7) i=t2;
8) goto(3)
1. i = 1
2. j = 1
3. t1 = 5 * i
4. t2 = t1 + j
5. t3 = 4 * t2
6. t4 = t3
7. a[t4] = –1
8. j = j + 1
9. if j <= 5 goto(3)
10. i = i + 1
11. if i < 5 goto(2)
Q Consider the following sequence of three address codes :
1. Prod:=0
2. I:=1
3. T1:=4*I
4. T2:=addr(A)–4
5. T3:=T2[T1]
6. T4:=addr(B)–4
7. T5:=T4[T1]
8. T6:=T3*T5
9. Prod:=Prod+T6
10. I=I+1
11. If I <= 20 goto (3)
Loop Jamming: combining the bodies of two loops, whenever they share the same index and
same no of variables
}
Loop Unrolling: getting the same output with less no of iteration is called loop unrolling
int i=1;
while(i<=100)
{
print(i)
i++
}
int i=1;
while(i<=100)
{
print(i)
i++
print(i)
i++
}
Code movement(Loop invariant computation): removing those code
out from the loop which is not related to loop.
int i=1;
while(i<=100)
{
a =b+c
print(i)
i++
}
Optimization of basic blocks
• Algebraic Simplification
• Redundant code Elimination / Common subexpression elimination
• Strength reduction
• Constant Propagation
• Constant Folding
Constant Folding: Replacing the value of expression
before compilation is called as constant folding
x=a+b+2*3+4
x = a + b + 10
Constant Propagation: replacing the value of constant before
compile time, is called as constant propagation.
pi = 3.1415
x = 360 / pi
x = 360/3.1415
Strength reduction: replacing the costly operator by cheaper
operator, this process is called strength reduction.
y=2*x
y=x+x
Redundant code Elimination / Common subexpression elimination:
Avoiding the evaluation of any expression more than once is redundant
code elimination.
x=a+b
y=b+a
x=a+b
y=x
Algebraic Simplification: Basic laws of math’s which can be solved
directly.
a = b*1
a=b
a=b+0
a=b
Direct Acyclic Graph (DAG)
• A Direct Acyclic Graph is a graph that is directed and contains no cycles. So it is impossible
to start at any vertex v and follow a sequence of edges that eventually loops back to v again.
• By representing expressions and operations in a DAG, compilers can easily identify and
eliminate redundant calculations, thus optimizing the code.
• We automatically detect common sub-expressions with the help of DAG algorithm.
• We can determine which identifiers have their values used in the block.
• We can determine which statements compute values which could be used outside the
block.
((a+a) + (a+a)) + ((a+a) + (a+a))