0% found this document useful (0 votes)
10 views

Compiler Notes Arv

The document provides an overview of compilers, interpreters, and assemblers, detailing their structures, functions, and differences. It explains the phases of compilation, including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation. Additionally, it discusses the roles of loaders, error handling, and the importance of symbol tables in the compilation process.

Uploaded by

krunaljadhav900
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Compiler Notes Arv

The document provides an overview of compilers, interpreters, and assemblers, detailing their structures, functions, and differences. It explains the phases of compilation, including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation. Additionally, it discusses the roles of loaders, error handling, and the importance of symbol tables in the compilation process.

Uploaded by

krunaljadhav900
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

Overview of the Compiler and its Structure:

Language processor, Applications of language processors, Definition-Structure-Working of compiler, the


science of building compilers, Basic understanding of interpreter and assembler. Difference between
interpreter and compiler. Compilation of source code into target language, Cousins of compiler, Types of
compiler

Lexical Analysis:
The Role of the Lexical Analyzer, Specification of Tokens, Recognition of Tokens, Input Buffering,
elementary scanner design and its implementation (Lex), Applying concepts of Finite Automata for
recognition of tokens.

IntroductiontoCompiling:

INTRODUCTIONOFLANGUAGEPROCESSINGSYSTEM

Fig1.1:LanguageProcessingSystem
Preprocessor

Apreprocessorproduceinputtocompilers.Theymayperformthefollowingfunctions.

1. Macroprocessing:Apreprocessormayallowausertodefinemacrosthatareshorthandsfor longer
constructs.
2. Fileinclusion:Apreprocessormayincludeheaderfilesintotheprogramtext.
3. Rationalpreprocessor:thesepreprocessorsaugmentolderlanguageswithmoremodernflow-of-
control and data structuring facilities.
4. LanguageExtensions:Thesepreprocessorattemptstoaddcapabilitiestothelanguagebycertain
amounts to build-in macro

COMPILER

Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.

Fig1.2:Structureof Compiler
Executing a program written n HLL programming language is basically of two parts. the source
programmustfirstbecompiledtranslatedintoaobjectprogram.Thentheresultsobjectprogramis loaded
into a memory executed.

Fig1.3:ExecutionprocessofsourceprograminCompiler

ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).

INTERPRETER
Aninterpreterisaprogramthatappearstoexecuteasourceprogramasifitweremachinelanguage.

Fig1.4:ExecutioninInterpreter

LanguagessuchasBASIC,SNOBOL,LISPcanbetranslatedusinginterpreters.JAVAalsouses interpreter. The


process of interpretation can be carried out in following phases.
1. Lexicalanalysis
2. Synatxanalysis
3. Semanticanalysis
4. DirectExecution

Advantages:
Modificationofuserprogramcanbeeasilymadeandimplementedasexecutionproceeds. Type
of object that denotes a various may change dynamically.
Debuggingaprogramandfindingerrorsissimplifiedtaskforaprogramusedforinterpretation. The
interpreter for the language makes it machine independent.
Disadvantages:
Theexecutionoftheprogramisslower. Memory
consumption is more.

LOADERANDLINK-EDITOR:

Once the assembler procedures an object program, that program must be placed into memory and
executed.Theassemblercouldplacetheobjectprogramdirectlyinmemory andtransfercontroltoit,
thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would
have to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another
component called loader
“Aloaderisaprogramthatplacesprogramsinto memory andpreparesthemforexecution.”Itwould be more
efficient if subroutines could be translated into object form the loader could”relocate” directly behind
the user’s program. The task of adjusting programs o they may be placed in arbitrary core locations is
called relocation. Relocation loaders perform four functions.

TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translator are:
1 TranslatingtheHLLprograminputintoanequivalentmlprogram.
2 ProvidingdiagnosticmessageswherevertheprogrammerviolatesspecificationoftheHLL.

LISTOFCOMPILERS
1. Adacompilers
2 .ALGOLcompilers
3 .BASICcompilers
4 .C# compilers
5 .C compilers
6 .C++compilers
7 .COBOLcompilers
8 .CommonLispcompilers
9. ECMAScriptinterpreters
10. Fortran compilers
11.Java compilers
12. Pascalcompilers
13. PL/Icompilers
14. Python compilers
15. Smalltalkcompilers

STRUCTUREOFTHECOMPILERDESIGN

Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation


that takes source program in one representation and produces output in another representation. The
phases of a compiler are shown in below
Therearetwophasesof compilation.
a. Analysis(MachineIndependent/LanguageDependent)
b. Synthesis(MachineDependent/Languageindependent)

Compilationprocessispartitionedintono-of-subprocessescalled‘phases’. Lexical
Analysis:-
LAorScannersreadsthesourceprogramonecharacteratatime,carvingthesourceprogramintoa sequence of
automic units called tokens.
Fig1.5:Phasesof Compiler

SyntaxAnalysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements,declarationsetc…areidentifiedbyusingtheresultsoflexicalanalysis.Syntaxanalysisis aided
by using techniques based on formal grammar of the programming language.

IntermediateCodeGenerations:-
An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.

CodeOptimization:-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.

CodeGeneration:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator isthe
machine language program of the specified computer.
Table Management (or) Book-keeping:- This is the portion to keep the names used by theprogram
and records essential information about each. The data structure used to record this information
called a ‘Symbol Table’.

Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structureof
the incoming token stream by identifying which parts of the token stream should be grouped.

Example,(A/B*Chastwopossibleinterpretations.) 1,
divide A by B and then multiply by C or
2,multiplyBbyCandthenusetheresulttodivideA.
eachofthesetwointerpretationscanberepresentedintermsofaparsetree.

IntermediateCodeGeneration:-
The intermediate code generation uses the structure produced by the syntax analyzer to create a
streamof simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.

CodeOptimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. LocalOptimization:-
Therearelocaltransformationsthatcanbeappliedtoaprogramtomakeanimprovement.For
example,
If A >Bgoto L2
GotoL3
L2 :

Thiscanbereplacedbyasinglestatement If
A < B goto L3

Anotherimportantlocaloptimizationistheeliminationofcommonsub-expressions
A:=B+C+D
E:=B+C+F
Mightbeevaluatedas

T1:=B+C
A:=T1+D E
:= T1 +F
Takethisadvantageofthecommonsub-expressionsB+C.

b. Loop Optimization:-
Anotherimportantsourceofoptimizationconcernsaboutincreasingthespeedofloops.A
typicalloopimprovementistomoveacomputationthatproducesthesameresulteachtime around
the loop to a point, in the program just before the loop is entered.

Codegenerator:-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.

TableManagementORBook-keeping :-
A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.

ErrorHanding:-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.

Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:

Fig1.6:CompilationProcessofasourcecodethrough phases
2. AsimpleOnePass Compiler:

INTRODUCTION: In computer programming, a one-pass compiler is a compiler


that passes through the parts of each compilation unit only once, immediately translating
each part into its final machine code. This is in contrast to a multi-pass compiler which
converts the program into one or more intermediate representations in steps between
source code andmachine code, and which reprocesses the entire compilation unit in each
sequential pass.

OVERVIEW

• LanguageDefinition
o Appearanceofprogramminglanguage:
Vocabulary:Regularexpression
Syntax:Backus-NaurForm(BNF)orContextFreeForm(CFG)
o Semantics:Informallanguageorsomeexamples

• Fig2.1.Structureofourcompilerfrontend

SYNTAXDEFINITION

• Tospecifythesyntaxofalanguage:CFGandBNF
o Example:if-elsestatementinChastheformofstatement→if(expression)
statement else statement
• Analphabetofalanguageisasetofsymbols.
o Examples:{0,1}forabinarynumber system(language)={0,1,100,101,...}
{a,b,c}forlanguage={a,b,c,ac,abcc..}
{if,(,),else...}foraifstatements={if(a==1)goto10,if--}
• Astring over an alphabet
o isasequenceofzeroormoresymbolsfromthe alphabet.
o Examples:0,1,10,00,11,111,0202...stringsforaalphabet{0,1}
o Nullstringisastringwhichdoesnothaveanysymbolof alphabet.
• Language
o Isasubsetofallthestringsoveragivenalphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c,ac,abcc..}
A2={allofCtokens}L2={allsentencesofCprogram}
• Example2.1.Grammarforexpressionsconsistingofdigitsandplusandminus signs.
o LanguageofexpressionsL={9-5+2,3-1,...}
o TheproductionsofgrammarforthislanguageLare:
list→list+digit list
→ list - digit list
→ digit
digit→ 0|1|2|3|4|5|6|7|8|9
o list,digit:Grammarvariables,Grammarsymbols
o 0,1,2,3,4,5,6,7,8,9,-,+:Tokens,Terminalsymbols
• Conventionspecifyinggrammar
o Terminalsymbols:boldfacestringif,num,id
o Nonterminalsymbol,grammarsymbol:italicizednames,list,digit,A,B

• GrammarG=(N,T,P,S)
o N:asetofnonterminal symbols
o T:asetofterminalsymbols,tokens
o P:asetofproductionrules
o S:astartsymbol,S∈N
o
• GrammarGforalanguageL={9-5+2,3-1,...}
o G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P: list->list+ digit
list->list-digit list
-> digit
digit->0|1|2|3|4|5|6|7|8|9
S=list

• SomedefinitionsforalanguageLanditsgrammarG
• Derivation:
AsequenceofreplacementsS⇒α1⇒α2⇒…⇒αnisaderivationofαn.
Example,Aderivation1+9fromthegrammarG
• leftmostderivation
list⇒list+digit⇒digit+digit⇒1+digit⇒1+9
• rightmostderivation
list⇒list+digit⇒list+9 ⇒digit+9 ⇒1+9
• Languageofgrammar L(G)
L(G)isasetofsentencesthatcanbegeneratedfromthegrammarG. L(G)={x|
S ⇒* x} where x ∈ a sequence of terminal symbols
• Example:ConsideragrammarG=(N,T,P,S):
N={S} T={a,b}
S=SP={S→aSb | ε}
• is aabb a sentecne of L(g)? (derivation of string aabb)
S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(orS⇒*aabb)so,aabbεL(G)
• thereisnoderivationforaa,soaa∉L(G)
• noteL(G)={anbn|n≧0}whereanbnmeasna'sfollowedbynb's.

• Parse Tree
Aderivationcanbeconvenientlyrepresentedbyaderivationtree(parsetree).
o Therootislabeledbythestart symbol.
o Eachleafislabeledbyatokenor ε.
o Eachinteriornoneislabeledbyanonterminal symbol.
o WhenaproductionA→x1…xnisderived,nodeslabeledbyx1…xnaremadeas
children
nodesofnodelabeledbyA.
• root:thestart symbol
• internalnodes: nonterminal
• leafnodes: terminal

o ExampleG:
list->list+digit|list-digit|digit digit ->
0|1|2|3|4|5|6|7|8|9
• leftmostderivationfor9-5+2,
list⇒list+digit⇒list-digit+digit⇒digit-digit+digit⇒9-digit+digit
~9-5+digit⇒9-5+2
• rightmostderivationfor9-5+2,
list⇒list+digit⇒list+2⇒list-digit+2⇒list-5+2
~digit-5+2⇒9-5+2

parsetreefor9-5+2

Fig2.2.Parsetreefor9-5+2accordingtothegrammarinExample

Ambiguity
• Agrammarissaidtobeambiguousifthegrammarhasmorethanoneparsetreefora given
string of tokens.
• Example2.5.SupposeagrammarGthatcannotdistinguishbetweenlistsanddigitsasin
Example 2.1.
• G:string →string+string|string-string|0|1|2|3|4|5|6|7|8|9
Fig2.3.TwoParsetreefor 9-5+2
• 1-5+2has2parsetrees=>GrammarGis ambiguous.

Associativityofoperator
Aoperatorissaidtobeleftassociativeifanoperandwithoperatorsonbothsidesofitis taken by the
operator to its left.
eg)9+5+2≡(9+5)+2,a=b=c≡a=(b=c)
• LeftAssociativeGrammar:
list→list+digit|list-digit digit
→0|1|…|9
• RightAssociativeGrammar:
right→letter=right|letter
letter →a|b|…|z

Fig2.4.Parsetreeleft-andright-associativeoperators.

Precedenceofoperators
Wesaythataoperator(*)hashigherprecedencethanotheroperator(+)iftheoperator(*)takes
operands before other operator(+) does.
• ex.9+5*2≡9+(5*2),9*5+2≡(9*5)+2
• leftassociativeoperators:+,-,*,/
• rightassociativeoperators:=,**
• Syntax of fullexpressions
operator associative precedence

+, - left 1 low
*,/ left 2 heigh

• expr → expr + term | expr - term | term


term→term*factor|term/factor|factor
factor → digit | ( expr )
digit→ 0 |1|…|9

• Syntax of statements
o stmt→id =expr ;
|if(expr)stmt;
|if(expr)stmtelsestmt;
|while(expr)stmt;
expr → expr + term | expr - term | term
term→term*factor|term/factor|factor factor
→ digit | ( expr )
digit→ 0| 1|…|9
SYNTAX-DIRECTEDTRANSLATION(SDT)
Aformalismforspecifyingtranslationsforprogramminglanguageconstructs. (
attributes of a construct: type, string, location, etc)
• Syntaxdirecteddefinition(SDD)forthetranslationofconstructs
• Syntaxdirectedtranslationscheme(SDTS)forspecifyingtranslation
PostfixnotationforanexpressionE
• IfEisavariableorconstant,thenthepostfixnationforEisEitself(E.t≡E).
• ifEisanexpressionoftheformE1opE2 whereopisabinary operator
o E1'isthepostfixofE1,
o E2'isthepostfixofE2
o thenE1'E2'op isthepostfix forE1op E2
• ifEis(E1),andE1'isapostfix
thenE1'isthepostfixforE

eg) 9-5+2⇒95-2+

9-(5+2)⇒952+-

Syntax-DirectedDefinition(SDD)fortranslation
• SDDisasetofsemanticrulespredefinedforeachproductionsrespectivelyfor
translation.
• Atranslationisaninput-outputmappingprocedurefortranslationofaninputX,
o constructaparsetreeforX.
o synthesizeattributesovertheparsetree.
 SupposeanodeninparsetreeislabeledbyXandX.adenotesthevalue of
attribute a of X at that node.
 computeX'sattributesX.ausingthesemanticrulesassociatedwithX.

Example 2.6. SDD for infix to postfix translation

Fig2.5.Syntax-directeddefinitionforinfixtopostfix translation.

AnexampleofsynthesizedattributesforinputX=9-5+2

Fig2.6.Attributevaluesatnodesinaparse tree.

Syntax-directedTranslationSchemes(SDTS)
• Atranslationschemeisacontext-freegrammarinwhichprogramfragmentscalled
translation actions are embedded within the right sides of the production.
productions(postfix) SDDforpostfixto SDTS
infix notation
list→list+ term list.t=list.t||term.t||"+" list→list+ term
{print("+")}

• {print("+");}:translation(semantic)action.
• SDTS generates an output for each sentence x generated by underlying grammar by
executingactionsintheordertheyappearduringdepth-firsttraversalofaparsetreeforx.
1. Designtranslationschemes(SDTS)fortranslation
2. Translate:
a) parsetheinputstringxand
b) emittheactionresultencounteredduringthedepth-firsttraversalofparsetree.

Fig2.7.Exampleofadepth-firsttraversalofatree.Fig2.8.Anextraleafisconstructedforasemanticaction.

Example2.8.
• SDDvs.SDTSforinfixtopostfixtranslation.

productions SDD SDTS


expr→list+term expr.t=list.t||term.t||"+" expr→list+term printf{"+")}
expr→list+term expr.t = list.t || term.t || "-" expr→list+termprintf{"-")}
expr → term expr.t = term.t expr → term
term→0 term.t="0" term→0 printf{"0")}
term→1 term.t="1" term→1 printf{"1")}
… … …
term→9 term.t="9" term→9 printf{"0")}

• Actiontranslatingforinput9-5+2

Fig2.9.Actionstranslating9-5+2into95-2+.
1) Parse.
2) Translate.
Dowehavetomaintainthewholeparsetree?
No,Semanticactionsareperformedduringparsing,andwedon'tneedthenodes(whose semantic
actions done).
PARSING
iftokenstringx∈L(G),thenparsetree
elseerrormessage
Top-Downparsing
1. AtnodenlabeledwithnonterminalA,selectoneoftheproductionswhoseleftpartis
Aandconstructchildrenofnodenwiththesymbolsontherightsideofthatproduction.
2. Findthenextnodeatwhichasub-treeistobeconstructed. ex. G:
type → simple
|↑id
|array[simple]oftype simple
→ integer
|char
|numdotdotnum

Fig2.10.Top-downparsingwhilescanningtheinputfromlefttoright.
Fig2.11.Stepsinthetop-downconstructionofaparsetree.
• Theselectionofproductionforanonterminalmayinvolvetrial-and-
error.=>backtracking

• G:{S->aSb|c|ab}
Accordingtotopdownparsingprocedure,acb,aabb∈L(G)?
• S/acb⇒aSb/acb⇒aSb/acb⇒aaSbb/acb⇒X
(S→aSb) move (S→aSb) backtracking
⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(s→c) move move
so,acb∈L(G)
Isisfinishedin7stepsincludingonebacktracking.

• S/aabb⇒aSb/aabb⇒aSb/aabb⇒aaSbb/aabb⇒aaSbb/aabb⇒aaaSbbb/aabb⇒X
(S→aSb) move (S→aSb) move (S→aSb) backtracking
⇒aaSbb/aabb⇒aacbb/aabb⇒X
(S→c) backtracking
⇒aaSbb/aabb⇒aaabbb/aabb⇒X
(S→ab) backtracking
⇒aaSbb/aabb⇒X
backtracking
⇒aSb/aabb⇒acb/aabb
(S→c) bactracking
⇒aSb/aabb⇒aabb/aabb⇒aabb/aabb⇒aabb/aabb⇒aaba/aabb
(S→ab) move move move
so,aabb∈L(G)
butprocessistoodifficult.Itneeds18stepsincluding5 backtrackings.
• procedureoftop-down parsing
letapointedgrammarsymbolandpointedinputsymbolbeg,arespectively.
o if(g∈ N)selectandexpandaproductionwhoseleftpartequalstognextto current
production.
elseif(g=a)thenmakegandabeasymbolnexttocurrentsymbol. else if( g ≠a
) back tracking
 letthepointedinputsymbolabethesymbolthatmovesbacktosteps same
with the number of current symbols of underlying production
 eliminatetherightsidesymbolsofcurrentproductionandletthepointed
symbol g be the left side symbol of current production.

Predictiveparsing(RecursiveDecentParsing,RDP)
• Astrategyforthegeneraltop-downparsing
Guessaproduction,seeifitmatches,ifnot,backtrackandtry another.

• ItmayfailtorecognizecorrectstringinsomegrammarGandistediousin processing.

• Predictiveparsing
o isakindoftop-downparsingthatpredictsaproductionwhosederivedterminal
symbol is equal to next input symbol while expanding in top-down paring.
o withoutbacktracking.
o Procedure decent parser is a kind of predictive parser that is implemented by
disjointrecursiveproceduresoneprocedureforeachnonterminal,theprocedures are
patterned after the productions.
• procedureofpredictiveparsing(RDP)
letapointedgrammarsymbolandpointedinputsymbolbeg,arespectively.
o if( g ∈N)
 select next production P whose left symbol equals to g and a set of first
terminalsymbolsofderivationfromtherightsymbolsoftheproductionP
includes a input symbol a.
 expandderivationwiththatproductionP.
o elseif(g=a)thenmakegandabeasymbolnexttocurrent symbol.
o elseif(g ≠a)error

• G : { S→aSb | c | ab } => G1 : { S->aS' | c S'->Sb | ab }


Accordingtopredictiveparsingprocedure,acb,aabb∈L(G)?
o S/acb⇒ confusedin{S→aSb,S→ab}
o so,apredictiveparserrequiressomerestrictioningrammar,thatis,thereshould be
only one production whose left part of productions are A and each first terminal
symbol of those productions have unique terminal symbol.
• RequirementsforagrammartobesuitableforRDP:Foreachnonterminal either
1. A→Bα,or
2. A→ a1α1 |a2α2|…|anαn
1) for1 ≦i, j≦nand i≠j, ai ≠aj
2) AεmayalsooccurifnoneofaicanfollowAinaderivationandifwehaveA→ε
• Ifthegrammarissuitable,wecanparseefficientlywithoutbacktrack.
General top-down parser with backtracking

RecursiveDescentParserwithoutbacktracking

PictureParsing(akindofpredictiveparsing)without backtracking

Left Factoring
• Ifagrammarcontainstwoproductionsofform
S→ aα and S → aβ
itisnotsuitablefortopdownparsingwithoutbacktracking.Troublesofthisformcan sometimes
be removed from the grammar by a technique called the left factoring.
• Intheleftfactoring,wereplace{S→aα,S→aβ}by
{ S → aS', S'→ α, S'→ β } cf. S→ a(α|β)
(Hopefullyαandβstartwithdifferentsymbols)
• leftfactoringforG{S→aSb| c| ab}
S→aS'|c cf.S(=aSb |ab| c=a( Sb | b)|c)→aS'|c
S'→Sb|b
• Aconcrete example:
<stmt>→ IF<boolean>THEN<stmt>|
IF<boolean>THEN<stmt>ELSE<stmt> is
transformed into
<stmt>→ IF<boolean>THEN<stmt>S' S'
→ ELSE <stmt> | ε

• Example,
o forG1 :{S→aSb|c|ab }
Accordingtopredictiveparsingprocedure,acb,aabb∈L(G)?
 S/aabb⇒unabletochoose{S→aSb,S→ab?}
o AccordingforthefeftfactoredgtrammarG1,acb,aabb∈L(G)? G1 :
{ S→aS'|c S'→Sb|b} <= {S=a(Sb|b) | c }
o S/acb⇒aS'/acb⇒aS'/acb⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(S→aS') move (S'→Sb⇒aS'b)(S'→c) move move
so,acb∈L(G)
Itneedsonly6stepswhithoutanybacktracking.
cf.Generaltop-downparsingneeds7stepsandIbacktracking.
o S/aabb⇒aS'/aabb⇒aS'/aabb⇒aSb/aabb⇒aaS'b/aabb⇒aaS'b/aabb⇒aabb/aabb⇒

(S→aS') move (S'→Sb⇒aS'b) (S'→aS') move (S'→b) movemove
so,aabb∈L(G)
but,processisfinishedin8stepswithoutanybacktracking.
cf.Generaltop-downparsingneeds18stepsincluding5backtrackings.

Left Recursion
• AgrammarisleftrecursiveiffitcontainsanonterminalA,suchthat A⇒+
Aα, where is any string.
o Grammar{S→Sα|c}isleftrecursivebecauseof S⇒Sα
o Grammar{S→Aα,A→Sb|c}isalsoleftrecursivebecauseofS⇒Aα⇒Sbα
• Ifagrammarisleftrecursive,youcannotbuildapredictivetopdownparserforit.
1) IfaparseristryingtomatchS&S→Sα,ithasnoideahowmanytimesSmustbe applied
2) Givenaleftrecursivegrammar,itisalwayspossibletofindanothergrammarthat
generates the same language and is not left recursive.
3) TheresultinggrammarmightormightnotbesuitableforRDP.

• Afterthis,ifweneedleftfactoring,itisnotsuitableforRDP.
• Rightrecursion:Specialcare/Harderthanleftrecursion/SDTcanhandle.

EliminatingLeftRecursion
LetGbeS→SA|A
Notethatatop-downparsercannotparsethegrammarG,regardlessoftheordertheproductions are tried.
~ TheproductionsgeneratestringsofformAA…A
~ TheycanbereplacedbyS→AS'andS'→AS'|ε

Example:
• A→Aα∣β
=>
A→ βR
R→ αR|ε

Fig2.12.Left-andright-recursivewaysofgeneratingastring.

• Ingeneral,theruleisthat
o IfA→Aα1|Aα2 | …|Aαn and
A→β1|β2| …| βm(noβi'sstartwith A),
then,replaceby
A→β1R|β2R|…|βmRand
Z→α1R| α2R|…| αnR| ε

Exercise:Removetheleftrecursioninthefollowinggrammar: expr →
expr + term | expr - term
expr → term
solution:
expr→termrest
rest→+termrest| - termrest| ε
ATRANSLATORFORSIMPLEEXPRESSIONS
• Convertinfixintopostfix(polishnotation)usingSDT.
• Abstractsyntax(annotatedparsetree)treevs.Concretesyntaxtree

• Concretesyntaxtree:parsetree.
• Abstractsyntaxtree:syntaxtree
• Concretesyntax:underlyinggrammar

AdaptingtheTranslationScheme
• Embedthesemanticactioninthe production
• Designatranslationscheme
• LeftrecursioneliminationandLeftfactoring
• Example
3) Designatranslateschemeandeliminateleftrecursion
E→E+T{'+'} E→T{}R
E→E-T{'-'} R→+T{'+'}R
E→T{} R→-T{'-'}R
T→0{'0'}|…|9{'9'} R→ε
T→0{'0'}…|9{'9'}
4) Translateofainputstring9-5+2:parsingandSDT

Result:95–2+
Exampleoftranslatordesignandexecution
• Atranslationschemeandwithleft-recursion.
Initialspecificationforinfix-to-postfix withleftrecursioneliminated
translator
expr→ expr+term {printf{"+")} expr→termrest
expr→expr-term{printf{"-")} rest→+term{printf{"+")}rest
expr → term rest→-term{printf{"-")}rest
term→0 {printf{"0")} rest→ε
term→1 {printf{"1")} term→0 {printf{"0")}
… term→1 {printf{"1")}
term→9 {printf{"0")} …
term→9 {printf{"0")}

Fig2.13.Translationof9–5+2into95-2+.

ProcedurefortheNonterminalexpr,term,andrest

Fig2.14.Functionforthenonterminalsexpr,rest,and term.
OptimizerandTranslator

LEXICALANALYSIS
• readsandconvertstheinputintoastreamoftokenstobeanalyzedbyparser.
• lexeme:asequenceofcharacterswhichcomprisesasingle token.
• LexicalAnalyzer →Lexeme/Token→ Parser
RemovalofWhiteSpaceandComments
• Removewhitespace(blank,tab,newlineetc.)and comments
Contsants
• Constants:Forawhile,consideronlyintegers
• eg)forinput31+28,output(tokenrepresentation)?
input : 31 + 28
output:<num,31><+,><num,28>
num + :token
3128:attribute,value(orlexeme)ofintegertokennum
Recognizing
• Identifiers
o Identifiersarenamesofvariables,arrays,functions...
o Agrammartreatsanidentifierasatoken.
o eg) input : count = count + increment;
output:<id,1><=,><id,1><+,><id,2>;
Symbol table
tokens attributes(lexeme)

0
1 id count
2 id increment
3
• Keywordsarereserved,i.e.,theycannotbeusedas identifiers.
Thenacharacterstringformsanidentifieronlyifitisnoakeyword.
• punctuationsymbols
o operators:+-*/:=<>…

Interfacetolexicalanalyzer

Fig2.15.Insertingalexicalanalyzerbetweentheinputandtheparser

ALexicalAnalyzer

Fig2.16.ImplementingtheinteractionsinFig.2.15.

• c=getchcar();ungetc(c,stdin);
• tokenrepresentation
o #defineNUM256
• Functionlexan()
eg)inputstring76 +a
input,output(returnedvalue)
76 NUM,tokenval=76(integer)
+ +
A id , tokeval="a"

• AwaythatparserhandlesthetokenNUMreturnedbylaxan()
o consideratranslationscheme
factor → ( expr )
|num{print(num.value)}
#define NUM 256
...
factor(){
if(lookahead == '(' ) {
match('(');exor();match(')');
}elseif(lookahead==NUM){
printf("%f",tokenval);match(NUM);
}else error();
}
• Theimplementationoffunctionlexan
1) #include<stdio.h>
2) #include<ctype.h>
3) intlino=1;
4) inttokenval= NONE;
5) intlexan(){
6) intt;
7) while(1){
8) t= getchar();
9) if(t==''|| t=='\t') ;
10) elseif(t=='\n')lineno+=1;
11) elseif(isdigit(t)){
12) tokenval=t-'0';
13) t= getchar();
14) while(isdigit(t)){
15) tokenval=tokenval*10+t-'0';
16) t=getchar();
17) }
18) ungetc(t,stdin);
19) retunrNUM;
20) }else{
21) tokenval= NONE;
22) return t;
23) }
24) }
25) }

INCORPORATIONASYMBOLTABLE
• Thesymboltableinterface,operation,usuallycalledbyparser.
o insert(s,t):inputs:lexeme
t:token
outputindexofnewentry
o lookup(s):inputs: lexeme
outputindexoftheentryforstrings,or0ifsisnotfoundinthesymbol table.
• Handlingreservedkeywords
1. Insertsallkeywordsinthesymboltableinadvance. ex)
insert("div", div)
insert("mod",mod)
2. whileparsing
• wheneveranidentifiersis encountered.
if(lookup(s)'stokenin{keywords})sisforakeyword;elsesisforaidentifier;

• example
o preset
insert("div",div);
insert("mod",mod);
o whileparsing
lookup("count")=>0insert("count",id);
lookup("i") =>0 insert("i",id);
lookup("i") =>4, id
llokup("div")=>1,div

Fig2.17.Symboltableandarrayforstoringstrings.

ABSTRACTSTACKMACHINE
o Anabstractmachineisforintermediatecodegeneration/execution.
o Instructionclasses:arithmetic/stackmanipulation/control flow
• 3componentsofabstractstack machine
1) Instructionmemory:abstractmachinecode,intermediatecode(instruction)
2) Stack
3) Datamemory
• Anexampleofstackmachine operation.
o forainput(5+a)*b,intermediatecodes:push5rvalue2....
L-valueandr-value
• l-valuesa:addressoflocationa
• r-valuesa:ifaislocation,thencontentoflocationa if a is
constant, then value a
• eg) a:=5 +b;
lvaluea⇒2rvalue5⇒5rvalueofb⇒7

StackManipulation
• Someinstructionsforassignmentoperation
o pushv: pushv ontothe stack.
o rvaluea:pushthecontentsofdatalocationa.
o lvaluea:pushtheaddressofdatalocationa.
o pop:throwawaythetopelementofthestack.
o :=:assignmentforthetop2elementsofthestack.
o copy:pushacopyofthetopelementofthestack.

TranslationofExpressions
• Infixexpression(IE)→SDD/SDTS→Abstactmacinecodes(ASC)ofpostfixexpressionfor
stackmachineevaluation.
eg)
o IE:a+b,(⇒PE:ab+)⇒IC:rvaluea
rvalueb
+
o day := (1461 * y) div 4 + (153 * m + 2) div 5 + d
(⇒ day1462y*4div153m*2+5div+d+:=)
⇒1)lvalueday6) div 11)push5 16) :=
2) push1461 7)push153 12) div
3) rvaluey 8)rvaluem 13) +
4) * 9)push2 14)rvalued
5) push4 10) + 15) +
• Atranslationschemeforassignment-statementintoabstractastackmachinecodeecanbe
expressed formally In the form as follows:
stmt→id:=expr
{stmt.t:='lvalue'||id.lexeme||expr.t||':='} eg) day
:=a+b ⇒ lvalue day rvalue a rvalue b + :=
ControlFlow
• 3typesofjumpinstructions:
o Absolutetargetlocation
o Relativetargetlocation(distance:Current↔Target)
o Symbolictargetlocation(i.e.themachinesupportslabels)
• Control-flowinstructions:
o labela:thejump'stargeta
o gotoa:thenextinstructionistakenfromstatementlabeleda
o gofalsea:popthetop&ifitis0thenjumptoa
o gotruea:popthetop&ifitisnonzerothenjumptoa
o halt :stop execution

Translationof Statements
• Translationschemefortranslationif-statementintoabstractmachinecode.
stmt → if expr then stmt1
{out:=newlabel1)
stmt.t:=expr.t||'gofalse'out||stmt1.t||'label'out}

Fig2.18.Codelayoutforconditionalandwhilestatements.

• Translationschemeforwhile-statement?

Emittinga Translation
• SemanticAction(TranaslationScheme):
1. stmt→if
expr{out:=newlabel;emit('gofalse',out)} then
stmt1{emit('label',out)}
2. stmt→id{emit('lvalue',id.lexeme)}
:=
expr{emit(':=')}
3. stmt→i
expr{out:=newlabel;emit('gofalse',out)} then
stmt1{emit('label',out);out1:=newlabel;emit('goto',out`1);}
else
stmt2{emit('label',out1);}
if(expr==false) goto out
stmt1 goto out1
out:stmt2
out1:

Implementation
• procedurestmt()
• var test,out:integer;
• begin
o iflookahead=idthenbegin
 emit('lvalue',tokenval);match(id);
match(':='); expr(); emit(':=');
o end
o elseiflookahead='if'then begin
 match('if');
 expr();
 out:= newlabel();
 emit('gofalse',out);
 match('then');
 stmt;
 emit('label',out)
o end
o elseerror();
• end

ControlFlowwithAnalysis
• if E1 or E2 then S vs if E1 and E2 then S
E1 or E2 = if E1 then true else E2
E1andE2=ifE1thenE2elsefalse
• ThecodeforE1orE2.
o CodesforE1Evaluationresult:e1
o copy
o gotrueOUT
o pop
o CodesforE2Evaluationresult:e2
o labelOUT

• ThefullcodeforifE1orE2thenS;
o codesforE1
o copy
o gotrueOUT1
o pop
o codesforE2
o labelOUT1
o gofalse OUT2
o codeforS
o labelOUT2
• Exercise:HowaboutifE1andE2thenS;
o ifE1 and E2then S1 elseS2;

Puttingthetechniquestogether!
• infixexpression⇒postfixexpression
eg)id+(id-id)*num/id⇒ididid-num*id/
+

DescriptionoftheTranslator
• Syntax directed translation scheme
(SDTS)totranslatetheinfixexpressions
into the postfix expressions,
Fig2.19.Specificationforinfix-to-postfixtranslation

Structureofthe translator,

Fig2.19.Modulesofinfixtopostfix translator.

o globalheaderfile"header.h"

TheLexicalAnalysisModulelexer.c
o Descriptionof tokens
+-*/DIVMOD( )IDNUMDONE
Fig2.20.Descriptionoftokens.

TheParserModuleparser.c

SDTS
||←leftrecursion elimination
NewSDTS

Fig2.20.Specificationforinfixtopostfixtranslator&syntaxdirectedtranslationschemeafter eliminating left-


recursion.
TheEmitterModuleemitter.c
emit(t,tval)

TheSymbol-TableModulessymbol.candinit.c
Symbol.c
datastructureofsymboltableFig2.29p62
insert(s,t)
lookup(s)

TheErrorModuleerror.c
Exampleofexecution
input12div5+2
output 12
5
div
2
+
3. LexicalAnalysis:

OVERVIEWOFLEXICALANALYSIS
• To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some mechanism to recognize
theseintheinputstream. Thisis donebythetokenrecognizers,whicharedesignedusing
transition diagrams and finite automata.

ROLEOFLEXICALANALYZER
TheLAisthefirstphaseofacompiler.Itmaintaskistoreadtheinputcharacterandproduceas output a
sequence of tokens that the parser uses for syntax analysis.

Fig.3.1:RoleofLexicalanalyzer

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer readsthe
input character until it can identify the next token. The LA return to the parser representation for
the token it has found. The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.

TOKEN,LEXEME, PATTERN:
Token:Tokenisasequenceofcharactersthatcanbetreatedasasinglelogicalentity. Typical
tokens are,
1)Identifiers2)keywords3)operators4)specialsymbols5)constants
Pattern:Asetofstringsintheinputforwhichthesametokenisproducedasoutput.Thisset of strings is
described by a rule called a pattern associated with the token.
Lexeme:Alexemeisasequenceofcharactersinthesourceprogramthatismatchedbythe pattern for a
token.
Fig.3.2:ExampleofToken,Lexemeand Pattern

LEXICALERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognised valid tokens don't
match any of the right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or lexical error isdetected.

Error-recoveryactionsare:
i. Deleteonecharacterfromtheremaining input.
ii. Insertamissingcharacterintotheremaining input.
iii. Replaceacharacterbyanother character.
iv. Transposetwoadjacentcharacters.

REGULAREXPRESSIONS
Regularexpressionisaformulathatdescribesapossiblesetofstring.Componentofregular expression..
X thecharacterx
. anycharacter,usuallyacceptanewline [x
y z] any of the characters x, y, z, …..
R? aRornothing(=optionallyasR)
R* zeroormore occurrences…..
R+ oneormoreoccurrences……
R1R2 anR1followed byanR2
R1|R1 eitheranR1oranR2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.In
regular expression notation we would write.
Identifier=letter(letter|digit)*
Herearetherulesthatdefinetheregularexpressionoveralphabet.
• isaregularexpressiondenoting{€},thatis,thelanguagecontainingonlytheempty string.
• Foreach‘a’inΣ,isaregularexpressiondenoting{a },thelanguagewithonlyone string
consistingofthesinglesymbol‘a’.
• IfRandSareregularexpressions,then

(R)| (S)meansL(r)UL(s)
R.SmeansL(r).L(s) R*
denotes L(r*)

REGULARDEFINITIONS
Fornotationalconvenience,wemaywishtogivenamestoregularexpressionsandtodefine regular
expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd?Isequivalentto(a(b*))|(c(d?)) Pascal
identifier
Letter-A|B|……|Z|a|b|……|z| Digits -
0 | 1 | 2 | …. | 9
Id-letter(letter/digit)*

Recognitionoftokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string andfinds
a prefix that is a lexeme matching one of the patterns.
Stmt→ifexprthen stmt
|Ifexprthenelsestmt

Expr→termrelopterm
|term
Term →id
|number
Forrelop,weusethecomparisonoperationsoflanguageslikePascalorSQLwhere=is“equals” and <> is
“not equals” because it presents an interesting structure of lexemes.
The terminal ofgrammar, which areif, then , else, relop ,id and numbers are the names oftokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits→digit+
number→digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id→letter(letter/digit)*
if → if
then →then
else→else
relop→< |>|<=|>=|==| <>

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS→(blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
thesamenames.Token wsisdifferentfromthe othertokensinthat,whenwerecognizeit,wedo not
return it to parser ,but rather restart the lexical analysis from the character that follows the white
space . It is the following token that gets returned to the parser.

Lexeme TokenName AttributeValue


Any WS - -
if if -
then then -
else else -
Any id Id Pointertotableentry
Anynumber number Pointertotable entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

TRANSITIONDIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Someimportantconventionsabouttransitiondiagramsare
1. Certainstates are said to be acceptingor final .These states indicates that a lexeme has
beenfound,althoughtheactuallexememaynotconsistofallpositionsb/wthelexeme Begin
and forward pointers we always indicate an accepting state by a double circle.
2. Inaddition,ifitisnecessarytoreturntheforwardpointeroneposition,thenweshall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagramalways begins in the state before any input
symbols have been used.
Fig.3.3:TransitiondiagramofRelationaloperators

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,


called a transition diagram. Position in a transition diagram, are drawn as circles and are
called as states.

Fig.3.4:TransitiondiagramofIdentifier

The above TDfor an identifier, defined to be aletter followed by any no ofletters ordigits.A
sequenceoftransitiondiagramcanbeconvertedintoprogramtolookforthetokensspecified by the
diagrams. Each state gets a segment of code.

FINITEAUTOMATON
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• Wecalltherecognizerofthetokensasafinite automaton.
• Afiniteautomatoncanbe:deterministic(DFA)ornon-deterministic(NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Bothdeterministicandnon-deterministicfiniteautomatonrecognizeregularsets.
• Whichone?
– deterministic–fasterrecognizer,butitmaytakemore space
– non-deterministic–slower,butitmaytakelessspace
– Deterministicautomatonsarewidelyusedlexicalanalyzers.
• First,wedefineregularexpressionsfortokens;ThenweconvertthemintoaDFAtogeta lexical
analyzer for our tokens.
Non-DeterministicFiniteAutomaton(NFA)
• Anon-deterministicfiniteautomaton(NFA)isamathematicalmodelthatconsistsof:
o S- aset ofstates
o Σ-asetofinputsymbols(alphabet)
o move-atransitionfunctionmovetomapstate-symbolpairstosetsofstates.
o s0-astart(initial) state
o F-asetofacceptingstates(finalstates)
• ε-transitionsareallowedinNFAs.Inotherwords,wecanmovefromonestateto another
one without consuming any symbol.
• ANFAacceptsastring x,ifandonlyifthereisapathfromthestarting statetoone of accepting
states such that edge labels along this path spell out x.
Example:

DeterministicFiniteAutomaton(DFA)

• ADeterministicFiniteAutomaton(DFA)isaspecialformofaNFA.
• Nostatehasε-transition
• Foreachsymbolaandstates,thereisatmostonelabelededgealeavings.i.e.transition function
is from pair of state-symbol to state (not set of states)

Example:
ConvertingREtoNFA
• Thisisonewaytoconvertaregularexpression intoa NFA.
• Therecanbeotherways(muchefficient)fortheconversion.
• Thomson’sConstructionissimpleandsystematicmethod.
• ItguaranteesthattheresultingNFAwillhaveexactlyonefinalstate,andonestartstate.
• Constructionstartsfromsimplestparts(alphabetsymbols).
• TocreateaNFAforacomplexregularexpression,NFAsofitssub-expressionsare
combined to create its NFA.
• Torecognizeanemptystringε:

• TorecognizeasymbolainthealphabetΣ:

• Forregularexpressionr1|r2:

N(r1)andN(r2)areNFAsforregularexpressionsr1andr2.
• Forregularexpressionr1r2

Here,finalstateofN(r1)becomesthefinalstateof N(r1r2).
• Forregularexpressionr*

Example:
ForaRE(a|b)*a,theNFAconstructionisshownbelow.

ConvertingNFAtoDFA(Subset Construction)
WemergetogetherNFAstatesbylookingatthemfromthepointofviewoftheinput characters:
• From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in the DFA.
• Ifitispossibletohavemultipletransitionsbasedonthesamesymbol,thenwecanregard
atransitiononasymbol as moving fromastatetoasetofstates(ie.theunionofallthose states
reachable by a transition on the current symbol). Thus these states will becombined into a
single DFA state.
Toperformthisoperation,letusdefinetwo functions:
• The-closurefunction takesastateand returnsthesetofstatesreachablefromitbased on
(oneormore)-transitions.Notethatthiswillalwaysincludethestateitself.Weshouldbe able to
get from a state to any state in its -closure without consuming any input.
• Thefunctionmovetakesastateandacharacter,andreturnsthesetofstatesreachableby one
transition on this character.
Wecangeneraliseboththesefunctionstoapplytosetsofstatesbytakingtheunionofthe application to
individual states.

ForExample,ifA,BandCarestates,move({A,B,C},`a')=move(A,`a')move(B,`a') move(C,`a').
TheSubsetConstructionAlgorithmisafollows:

putε-closure({s0})asanunmarkedstateintothesetofDFA(DS) while
(there is one unmarked S1 in DS) do
begin
mark S1
foreachinputsymbolado
begin
S2← ε-closure(move(S1,a)) if
(S2 is not in DS) then
addS2intoDSasanunmarkedstate transfunc[S1,a] ←
S2
end
end

• astateSinDSisanacceptingstateofDFAifa stateinSisanacceptingstateofNFA
• thestartstateofDFAisε-closure({s0})

LexicalAnalyzerGenerator

3.18. Lex specifications:


ALexprogram(the.lfile)consistsofthree parts:
declarations
%%
translationrules
%%
auxiliaryprocedures
1. Thedeclarationssectionincludesdeclarationsofvariables,manifestconstants(Amanifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14),and
regular definitions.
2. ThetranslationrulesofaLexprogramarestatementsoftheform:

p1{action1}
p2{action2}
p3{action3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.

Note:Youcanrefertoasamplelexprogramgiveninpageno.109ofchapter3ofthebook:
Compilers:Principles,Techniques,andToolsbyAho,Sethi&Ullmanformoreclarity.

3.19.INPUTBUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Bufferingtechniques:
1. Buffer pairs
2. Sentinels

The lexical analyzer scans the charactersofthesource programone a t atime to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
positionofeachpointerasbeingbetweenthecharacterlastreadandthecharacternexttoberead. In
practice each buffering scheme adopts one convention either apointer is at the symbol lastread or
the symbol it is ready to read.

Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travelpasttheactualtokenmaybelarge.Forexample,inaPL/Iprogramwemaysee:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the
buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token isdiscovered. In the above example, ifthe look ahead
traveled to the left half and all the way through the left half to the middle, we could not reloadthe
right half, because we would lose characters that had not yet been groupedinto tokens. While we
can make the buffer larger if we chose or use another buffering scheme,we cannot ignore the fact
that overhead is limited.
SYNTAX ANALYSIS

ROLEOFTHEPARSER:
Parser for any grammar is program that takes as inputstringw (obtain set of strings tokens
from the lexical analyzer) and produces as output either a parse tree for w , if w is a valid
sentences of grammar or error message indicating thatw is not a valid sentences of given
grammar. The goal of the parser is to determine thesyntactic validity of a source string is
valid, a tree is built for use by the subsequent phases of the computer. The tree reflects the
sequence of derivations or reduction used during the parser. Hence, it is called parse tree. If
string is invalid, the parse has to issue diagnostic message identifying the nature and cause of
theerrorsinstring.Everyelementarysubtreeintheparsetreecorrespondstoaproductionof the
grammar.
Therearetwowaysofidentifyinganelementrysutree:

1. Byderivingastringfromanon-terminalor
2. Byreducingastringofsymboltoanon-terminal.

Thetwotypesofparsersemployed are:
a. Topdownparser:whichbuildparsetreesfromtop(root)to
bottom(leaves)
b. Bottomupparser:whichbuildparsetreesfromleavesandworkupthe root.

Fig.4.1:positionofparserincompilermodel.
CONTEXTFREEGRAMMARS
Inherentlyrecursivestructuresofaprogramminglanguagearedefinedbyacontext-free Grammar.In a
context-free grammar, we have four triples G( V,T,P,S).
Here, Visfinitesetofterminals(inourcase,thiswillbethesetoftokens) T is a
finite set of non-terminals (syntactic-variables)
Pisafinitesetofproductionsrulesinthefollowingform
A → α where A is a non-terminal and α is a string of terminals and non-terminals
(including the empty string)
Sisastartsymbol(oneofthenon-terminalsymbol)
L(G)isthelanguageofG(thelanguagegeneratedbyG)whichisasetofsentences.
Asentenceof L(G)is astringofterminalsymbolsofG.IfS is thestart symbol ofGthen ω is a
sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G. If G is a context-
freegrammar, L(G) is acontext-free language. Two grammar G1and G2 areequivalent, if
they produce same grammar.
Consider the production of the form S ⇒ α,If α contains non-terminals, it is called as a
sentential form of G. If α does not contain non-terminals, it is called as a sentence of G.
Derivations
Ingeneraladerivationstepis
αAβ⇒ αγβ is sentential form andif there is a production rule A→γ in our grammar. where
α and β are arbitrarystrings of terminal and non-terminal symbols α1 ⇒ α2 ⇒ ... ⇒ αn
(αn derives from α1 or α1 derives αn ). There are two types of derivaion
1 Ateachderivationstep, wecan chooseanyofthenon-terminalinthesententialformofG for the
replacement.
2 If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
Example:
E→E+E |E–E |E* E |E/E |-E E → (
E)
E→id
Leftmostderivation:
E→E+E
→E*E+E→id*E+E→id*id+E→id*id+id
Thestringisderivefromthegrammarw=id*id+id,whichisconsistsofallterminal symbols
Rightmostderivation
E→E+E
→ E+E * E→E+ E*id→E+id*id→id+id*id
GivengrammarG:E→E+E |E*E |(E )|-E|id
Sentence to be derived : – (id+id)
LEFTMOST DERIVATION RIGHTMOSTDERIVATION
E→-E E→-E
E→-(E) E→-(E)
E→-(E+E) E→-(E+E)
E→-(id+E) E→-(E+id)
E→-(id+id) E→-(id+id)
• Stringthatappearinleftmostderivationarecalledleftsentinelforms.
• Stringthatappearinrightmostderivationarecalledrightsentinelforms.
Sentinels:
→Sα , where α may contain non
• Given a grammar G with start symbol S, if -
terminals or terminals, then α is called the sentinel form of G.
Yieldorfrontierof tree:
• Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The
sentinel form in the parse tree is called yield or frontier of the tree.
PARSETREE
• Innernodesofaparsetreearenon-terminalsymbols.
• Theleavesofaparsetreeareterminalsymbols.
• Aparsetreecanbeseenasagraphicalrepresentationofaderivation.

Ambiguity:
Agrammarthatproduces morethanoneparseforsomesentenceissaidto beambiguous
grammar.
Example:GivengrammarG:E→E+E|E*E|(E)|-E |id

Thesentenceid+id*idhasthefollowingtwodistinctleftmostderivations: E →
E+ E E → E* E
E→id+E E→E+E*E
E→id+E*E E→id+E*E
E→id+id*E E→id+id*E
E→id+id*id E→id+id*id

Thetwocorrespondingparsetreesare:

Example:
TodisambiguatethegrammarE→E+E |E*E|E^E |id |(E),wecanuseprecedenceof operators
as follows:
^(righttoleft)
/,*(lefttoright)
-,+(lefttoright)
We get the following unambiguous grammar:
E→E+T|T
T→T*F|F
F→G^F|G
G→id|(E)
Considerthisexample,G:stmt→ifexprthenstmt |if exprthenstmtelsestmt |other This
grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the following
Twoparsetreesforleftmostderivation :

Toeliminateambiguity,thefollowinggrammarmaybeused:
stmt→matched_stmt|unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
unmatched_stmt→ifexprthenstmt|ifexprthenmatched_stmtelseunmatched_stmt
Eliminating Left Recursion:
A grammar is said to be left recursive if it has a non-terminal A such that there is a derivation
A=>Aαforsomestringα.Top-downparsingmethodscannothandleleft-recursivegrammars.
Hence, left recursion can be eliminated as follows:
IfthereisaproductionA→Aα|βitcanbereplacedwithasequenceoftwo productions
A→βA’
A’→αA’|ε
WithoutchangingthesetofstringsderivablefromA.
Example:Considerthefollowinggrammarforarithmeticexpressions: E →
E+T | T
T→T*F|F
F→(E)|id
FirsteliminatetheleftrecursionforEas E
→ TE’
E’ → +TE’ | ε
TheneliminateforTas
T→FT’
T’→*FT’|ε
Thustheobtainedgrammaraftereliminatingleftrecursionis E →
TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
Algorithmtoeliminateleftrecursion:

1. Arrangethenon-terminalsinsomeorderA1,A2...An.
2. fori:=1tondobegin
forj:=1toi-1dobegin
replaceeachproductionoftheformAi→Ajγ
bytheproductionsAi→δ1γ|δ2γ|... |δkγ
whereAj→δ1|δ2|...|δkareallthecurrentAj-productions;
end
eliminatetheimmediateleftrecursionamongtheAi-productions
end
Leftfactoring:

Leftfactoringis a grammartransformationthat isuseful forproducingagrammarsuitable for


predictive parsing. When it is not clear which of two alternative productions to use to expand
a non-terminal A, we can rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice.
IfthereisanyproductionA→αβ1|αβ2,itcanberewrittenas A →
αA’
A’→β1| β2
Considerthegrammar,G:S→iEtS|iEtSeS|a
E→b
Leftfactored,thisgrammarbecomes S
→ iEtSS’ | a
S’→eS|ε E
→b
TOP-DOWNPARSING
Itcanbeviewedas anattempttofindaleft-mostderivationforaninputstringoran attempt to
construct a parse tree for the input starting from the root to the leaves.
Typesoftop-downparsing:
1. Recursivedescentparsing
2. Predictiveparsing
1. RECURSIVEDESCENTPARSING
 Recursivedescentparsingisoneofthetop-down parsingtechniquesthatusesasetof
recursive procedures to scan its input.
 Thisparsingmethodmayinvolvebacktracking,thatis,makingrepeatedscansofthe input.
Exampleforbacktracking:
ConsiderthegrammarG:S→cAd
A→ab|a and
the input string w=cad.
Theparsetreecanbeconstructedusingthefollowingtop-downapproach:
Step1:
InitiallycreateatreewithsinglenodelabeledS.Aninputpointerpointsto‘c’,thefirst symbol of w.
Expand the tree with the production of S.
Step2:
Theleftmostleaf‘c’matchesthefirstsymbolof w,soadvancetheinputpointertothesecond symbol of w
‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.

Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the
input symbol d.
Hencediscardthechosenproductionandresetthepointertosecondposition.Thisiscalled
backtracking.
Step4:
NowtrythesecondalternativeforA.

Nowwecanhaltandannouncethesuccessfulcompletionofparsing.
Exampleforrecursivedecentparsing:
Aleft-recursivegrammarcancausearecursive-descentparsertogointoaninfiniteloop. Hence,
elimination of left-recursion must be done before parsing.
Considerthegrammarforarithmeticexpressions E
→ E+T | T
T→T*F|F
F→(E)|id
Aftereliminatingtheleft-recursionthegrammarbecomes, E
→ TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
Nowwecanwritetheprocedureforgrammarasfollows:
Recursiveprocedure:
Procedure E()
begin
T();
EPRIME();
End
ProcedureEPRIME()
begin
Ifinput_symbol=’+’then ADVANCE(
);
T();
EPRIME();
end
ProcedureT()
begin
F();
TPRIME();
End
ProcedureTPRIME()
begin
Ifinput_symbol=’*’then ADVANCE(
);
F();
TPRIME();
end
ProcedureF()
begin
Ifinput-symbol=’id’then
ADVANCE( );
elseifinput-symbol=’(‘then
ADVANCE( );
E();
elseifinput-symbol=’)’then
ADVANCE( );
end
elseERROR();
Stackimplementation:
PROCEDURE INPUTSTRING
E() id+id*id
T() id+id*id
F() id+id*id
ADVANCE() idid*id
TPRIME() idid*id
EPRIME() idid*id
ADVANCE() id+id*id
T() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
ADVANCE() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
2. PREDICTIVEPARSING
 Predictiveparsingisaspecialcaseofrecursivedescentparsingwhereno backtracking is
required.
 The key problem of predictive parsing is to determine the production tobe appliedfor
a non-terminal in case of alternatives.
Non-recursivepredictiveparser

Thetable-drivenpredictiveparserhasaninputbuffer,stack,aparsingtableandanoutput stream.
Inputbuffer:
Itconsistsofstringstobeparsed,followedby$toindicatetheendoftheinputstring.
Stack:
It contains a sequence of grammar symbols preceded by$ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsingtable:
Itisatwo-dimensionalarrayM[A,a],where‘A’isanon-terminaland‘a’isaterminal.
Predictiveparsingprogram:
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
1. IfX=a=$,theparserhaltsandannouncessuccessfulcompletionofparsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to
the next input symbol.
3. IfXisanon-terminal,theprogramconsultsentryM[X, a]of theparsing table
M.ThisentrywilleitherbeanX-productionofthegrammaroranerrorentry.
IfM[X,a]={X→UVW},theparserreplacesXontopofthestackbyUVW
IfM[X,a]=error,theparsercallsanerrorrecoveryroutine.
Algorithmfornonrecursivepredictiveparsing:
Input:AstringwandaparsingtableMforgrammarG.
Output:IfwisinL(G),aleftmostderivationofw;otherwise,anerrorindication.
Method : Initially, theparserhas $S on thestack with S, the start symbol ofGon top, and w$
intheinputbuffer.TheprogramthatutilizesthepredictiveparsingtableMtoproduceaparse for the
input is as follows:
setiptopointtothefirstsymbolofw$;
repeat
letXbethetopstacksymbolandathesymbolpointedtobyip;
ifXisaterminalor$then if X
= a then
popXfromthestackandadvanceip
elseerror()
else /*Xisanon-terminal*/ if
M[X, a] = X →Y1Y2 … Yk then begin
pop X from the stack;
pushYk,Yk-1,…,Y1ontothestack,withY1ontop; output the
production X → Y1 Y2 . . . Yk
end
elseerror()
untilX=$
Predictiveparsingtableconstruction:
The construction of a predictive parser is aided by two functions associated with a grammarG
:
1. FIRST
2. FOLLOW
Rulesforfirst( ):
1. IfXisterminal,thenFIRST(X)is{X}.
2. IfX→εisaproduction,thenaddεtoFIRST(X).
3. IfXisnon-terminalandX→aαisaproductionthenaddatoFIRST(X).
4. IfXisnon-terminalandX→Y1 Y2…Ykisaproduction,thenplaceainFIRST(X)iffor some i, a is
in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1); that is, Y1,….Yi-1
=>ε.IfεisinFIRST(Yj)forallj=1,2,..,k,thenaddεtoFIRST(X).
Rulesforfollow():
1. IfSisastartsymbol,thenFOLLOW(S)contains$.

2. Ifthereis aproduction A→αBβ,theneverythinginFIRST(β)exceptεisplacedin


follow(B).
3. Ifthereis aproduction A→αB,oraproduction A→αBβwhereFIRST(β)containsε,then
everything in FOLLOW(A) is in FOLLOW(B).
Algorithmforconstructionofpredictiveparsingtable:
Input : Grammar G
Output:ParsingtableM
Method :
1. ForeachproductionA→αofthegrammar,dosteps2and3.
2. ForeachterminalainFIRST(α),addA→αtoM[A,a].
3. IfεisinFIRST(α),addA→αtoM[A,b]foreachterminalbinFOLLOW(A). Ifεisin FIRST(α)
and $ is in FOLLOW(A) , add A → α to M[A, $].
4. MakeeachundefinedentryofMbeerror.

Example:
Considerthefollowinggrammar: E
→ E+T | T
T→T*F|F
F→(E)|id
Aftereliminatingleft-recursionthegrammaris E
→ TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
First( ) :
FIRST(E)={(,id}
FIRST(E’)={+,ε}
FIRST(T)={(,id}
FIRST(T’)={*,ε}
FIRST(F)={(,id}
Follow():
FOLLOW(E)={$,)}
FOLLOW(E’)={$,)}
FOLLOW(T)={+,$,)}
FOLLOW(T’)={+,$,)}
FOLLOW(F)={+,*,$,)}

LL(1)grammar:
Theparsingtableentriesaresingleentries.Soeachlocationhasnotmorethanoneentry. This type of
grammar is called LL(1) grammar.
Considerthisfollowinggrammar: S
→ iEtS | iEtSeS | a
E→b
Aftereliminatingleftfactoring,wehave S
→ iEtSS’ | a
S’→eS|ε E
→b
Toconstructaparsingtable,weneedFIRST()andFOLLOW()forallthenon-terminals.
FIRST(S) = { i, a }
FIRST(S’)={e,ε}
FIRST(E) = { b}
FOLLOW(S)={$,e}
FOLLOW(S’)={$,e}
FOLLOW(E) = {t}

Sincetherearemorethanoneproduction,thegrammarisnotLL(1)grammar.
Actionsperformedinpredictiveparsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementationofpredictiveparser:
1. Eliminationofleftrecursion,leftfactoringandambiguousgrammar.
2. ConstructFIRST()andFOLLOW()forallnon-terminals.
3. Constructpredictiveparsingtable.
4. Parsethegiveninputstringusingstackandparsingtable.

BOTTOM-UPPARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the
root is called bottom-up parsing.
Ageneraltypeofbottom-upparserisashift-reduceparser.

SHIFT-REDUCEPARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for
an input string beginning at the leaves (the bottom) and working up towards the root (thetop).
Example:
Considerthegrammar:
S→aABe
A→Abc|b
B→d
Thesentencetoberecognizedisabbcde.
REDUCTION(LEFTMOST) RIGHTMOSTDERIVATION

abbcde (A→b) S→aABe


aAbcde (A→Abc) →aAde
aAde (B→d) →aAbcde
aABe (S→aABe) →abbcde
S
Thereductionstraceouttheright-mostderivationinreverse.

Handles:

A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.

Example:

Considerthegrammar:

E→E+E
E→E*E E
→ (E)
E→id

And the input string id1+id2*id3

The rightmost derivation is :

E→E+E
→E+E*E
→E+E*id3
→E+id2*id3
→id1+id2*id3

In the above derivation the underlined substrings are called handles.

Handle pruning:

Arightmostderivationinreversecanbeobtainedby“handlepruning”.
(i.e.)ifwisasentenceorstringofthegrammarathand,thenw=γn,whereγnisthenthright- sentinel form of some
rightmost derivation.
Stackimplementationofshift-reduceparsing:

Stack Input Action


$ id1+id2*id3$ shift

$id1 +id2*id3$ reducebyE→id

$E +id2*id3$ shift

$E+ id2*id3$ shift

$E+id2 *id3$ reducebyE→id

$E+E *id3$ shift

$E+E* id3$ shift

$E+E*id3 $ reducebyE→id

$E+E*E $ reducebyE→E*E

$E+E $ reducebyE→E+E

$E $ accept

Actionsinshift-reduceparser:
• shift –Thenextinputsymbolisshiftedontothetopofthestack.
• reduce–Theparserreplacesthehandlewithinastackwithanon-terminal.
• accept–Theparserannouncessuccessfulcompletionofparsing.
• error –Theparserdiscoversthatasyntaxerrorhasoccurredandcallsanerrorrecovery routine.

Conflictsinshift-reduceparsing:

Therearetwoconflictsthatoccurinshiftshift-reduceparsing:

1. Shift-reduceconflict:Theparsercannotdecidewhethertoshiftortoreduce.

2. Reduce-reduceconflict:Theparsercannotdecidewhichofseveralreductionstomake.

1. Shift-

reduceconflict:Example

Considerthegrammar:

E→E+E|E*E| idandinputid+id*id
Stack Input Action Stack Input Action
$E+E *id$ Reduceby $E+E *id$ Shift
E→E+E

$E *id$ Shift $E+E* id$ Shift

$E* id$ Shift $E+E*id $ Reduceby


E→id

$E*id $ Reduceby $E+E*E $ Reduceby


E→id E→E*E
$E*E $ Reduceby $E+E $ Reduceby
E→E*E E→E*E

$E $E

2. Reduce-reduceconflict:

Considerthegrammar:

M → R+R | R+c | R
R→c
andinputc+c

Stack Input Action Stack Input Action


$ c+c $ Shift $ c+c$ Shift

$c +c$ Reduceby $c +c$ Reduceby


R→c R→c
$R +c $ Shift $R +c $ Shift

$R+ c$ Shift $R+ c$ Shift

$R+c $ Reduceby $R+c $ Reduceby


R→c M→R+c
$R+R $ Reduceby $M $
M→R+R
$M $
Viableprefixes:
➢ aisaviableprefixofthegrammarifthereiswsuchthatawisarightsentinelform.
➢ The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce parser
are called viable prefixes.
➢ The setofviableprefixesisa regularlanguage.

OPERATOR-PRECEDENCEPARSING

Anefficientway ofconstructingshift-reduceparseriscalledoperator-precedenceparsing.

Operator precedence parser can be constructed from a grammar called Operator-grammar. These
grammars have the property that no production on right side is ε or has two adjacent non-
terminals.

Example:

Considerthegrammar:

E→EAE |(E) |-E |id A


→+|-|*|/|↑

SincetherightsideEAEhasthreeconsecutivenon-terminals,thegrammarcanbewrittenas follows:

E→E+E| E-E|E*E|E/E|E↑E|-E|id

Operatorprecedencerelations:
Therearethreedisjointprecedencerelationsnamely
<. - less than
= - equalto
.
> - greaterthan
The relations give the following meaning:
a<.b–ayieldsprecedence tob
a =b –ahasthesameprecedence asb
.
a >b–atakesprecedence overb

Rulesforbinaryoperations:
1. If operator θ1 has higher precedence than operator θ2,then make
θ1.>θ2and θ2<.θ1

2. If operators θ1and θ2,are of equal precedence, then make


θ1.>θ2andθ2.>θ1if operators are left associative
θ1<.θ2andθ2<.θ1if right associative

3. Makethe followingfor all operators θ:


θ<.id , id .>θ
θ<.(,(<.θ
).>θ,θ.>)
θ.>$,$<.θ
Alsomake
( = ),(<.(,).>),(<.id,id.>),$<.id,id.>$,$<.(,).>$

Example:

Operator-precedencerelationsforthegrammar

E→E+E| E-E|E*E|E/E|E↑E|(E)|-E| idisgiveninthefollowing tableassuming

1. ↑isofhighestprecedenceandright-associative
2. *and/areofnexthigherprecedenceandleft-associative,and
3. + and -are of lowest precedence and left-associative
Note that the blanks in the table denote error entries.

TABLE:Operator-precedencerelations
+ - * / ↑ id ( ) $
. . . .
+ > > < < <. <. <. .
> .
>
. .
- > > <. <. <. <. <. .
> .
>
. . . .
* > > > > <. <. <. .
> .
>
. . . .
/ > > > > <. <. <. .
> .
>

. . . .
> > > > <. <. <. .
> .
>
. . . . . . .
id > > > > ·> > >
( <. <. <. <. <. <. <. =
. . . . . . .
) > > > > > > >
. . . . . . .
$ < < < < < < <

Operatorprecedenceparsingalgorithm:

Input :Aninputstringwandatableofprecedencerelations.
Output : If w is well formed, a skeletal parse tree ,with a placeholder non-terminal E labeling all
interior nodes; otherwise, an error indication.
Method : Initiallythe stack contains $ and the input bufferthe string w $. To parse, we execute the
following program :

(1) Setiptopointtothefirstsymbolofw$;
(2) repeatforever
(3) if$isontopofthe stackandippointsto$then
(4) return
elsebegin
(5) let a be the topmost terminal symbol on the stack
and let b be the symbol pointed to by ip;
(6) ifa<.bora= bthenbegin
(7) pushbontothestack;
(8) advanceiptothenextinputsymbol;
end;
(9) else ifa.>bthen /*reduce*/
(10) repeat
(11) popthestack
(12) until the top stack terminal is related by <.
to the terminal most recently popped
(13) elseerror()
end

Stackimplementationofoperatorprecedenceparsing:
Operator precedence parsing uses a stack and precedence relation table for its
implementation of above algorithm. It is a shift-reduce parsing containing all four actions shift,
reduce, accept and error.
Theinitialconfigurationofanoperatorprecedenceparsingis
STACK INPUT
$ w$

wherewisthe inputstring tobeparsed.

Example:

ConsiderthegrammarE→E+E| E-E|E*E| E/E| E↑E|(E)| id.Inputstringisid+id*id.The implementation is as


follows:

STACK INPUT COMMENT


$ <· id+id*id$ shiftid
$id ·> +id*id$ popthetopofthestackid
$ <· +id*id$ shift+
$+ <· id*id$ shiftid
$+id ·> *id$ popid
$+ <· *id$ shift *
$+ * <· id$ shift id
$+ *id ·> $ popid
$+ * ·> $ pop*
$+ ·> $ pop+
$ $ accept

Advantagesofoperatorprecedenceparsing:
1. Itiseasytoimplement.
2. Once an operator precedence relation is made between all pairs of terminals of agrammar ,the
grammar can be ignored. The grammar is not referred anymore during implementation.

Disadvantagesofoperatorprecedenceparsing:
1. Itishardtohandletokensliketheminussign(-)whichhastwodifferentprecedence.
2. Onlyasmallclassofgrammarcanbeparsedusingoperator-precedenceparser.
LRPARSERS
An efficient bottom-up syntax analysis technique that can be used to parse a large class of
CFG is called LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for
constructingarightmostderivationinreverse,andthe‘k’forthenumberofinputsymbols. When ‘k’ is
omitted, it is assumed to be 1.

AdvantagesofLRparsing:
✓ ItrecognizesvirtuallyallprogramminglanguageconstructsforwhichCFGcanbewritten.
✓ Itisanefficientnon-backtrackingshift-reduceparsingmethod.
✓ AgrammarthatcanbeparsedusingLRmethodisapropersupersetofagrammarthat can be parsed
with predictive parser.
✓ Itdetectsasyntacticerrorassoonaspossible.

DrawbacksofLRmethod:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.

TypesofLRparsingmethod:
1. SLR-SimpleLR
▪ Easiesttoimplement,leastpowerful.
2. CLR-CanonicalLR
▪ Mostpowerful,mostexpensive.
3. LALR-Look-AheadLR
▪ Intermediate insizeandcostbetweentheothertwomethods.

TheLRparsingalgorithm:

The schematicformofanLRparserisasfollows:

INPUT a1 ai an $
… …

Sm LRparsingprogram OUTPUT
Xm
Sm-1
Xm-1
… action goto
S0

STACK
Itconsistsof:aninput,anoutput,astack,adriverprogram,andaparsing tablethathastwo parts (action and
goto).

➢ The driverprogramisthesameforallLRparser.

➢ Theparsingprogramreadscharactersfromaninputbufferoneatatime.

➢ Theprogramusesastacktostoreastringoftheforms0X1s1X2s2…Xmsm,wheresmison top. Each Xiis


a grammar symbol and each si is a state.

➢ Theparsing tableconsistsoftwoparts:actionandgotofunctions.

Action:Theparsingprogramdeterminessm,thestatecurrently ontopofstack,andai,the current input


symbol. It then consults action[sm,ai] in the action table which can have one of four values :

1. shifts,wheresisa state,
2. reducebyagrammarproductionA→β,
3. accept,and
4. error.

Goto:Thefunctiongototakesastateandgrammarsymbolasargumentsandproducesastate.

LRParsingalgorithm:

Input:Aninputstring wandanLRparsing tablewithfunctionsactionandgotoforgrammarG.

Output:IfwisinL(G),abottom-up-parseforw;otherwise,anerrorindication.

Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input
buffer. The parser then executes the following program :

setiptopointtothe firstinputsymbolofw$;
repeatforeverbegin
letsbe thestateontopofthe stackand
athesymbolpointedtobyip;
if action[s, a] = shift s’ then begin
push a then s’ on top of the stack;
advanceip tothenextinput symbol
end
elseifaction[s,a]=reduceA→βthenbegin
pop2*|β| symbolsoffthestack;
let s’ be the state now on top of the stack;
push A then goto[s’, A] on top of the stack;
outputthe production A→ β
end
elseif action[s, a]=accept then return
elseerror()
end
CONSTRUCTINGSLR(1)PARSINGTABLE:

ToperformSLRparsing,takegrammarasinputanddothefollowing:
1. FindLR(0)items.
2. Completing theclosure.
3. Computegoto(I,X),where,I issetofitemsandXisgrammarsymbol.

LR(O)items:
An LR(O) item of a grammar G is aproduction of G with adot at some position of the
right side. For example, production A → XYZ yields the four items :

A → . XYZ
A→X.YZ
A→XY.Z A
→ XYZ .

Closureoperation:
If Iis a set of items for a grammar G, then closure(I) is the set of items constructed from I
by the two rules:

1. Initially,everyiteminIisaddedtoclosure(I).
2. If A → a . Bβ is in closure(I) and B → γ is a production, then add the item B → . γ to I, if it
is not already there. We apply this rule until no more new items can be added to closure(I).

Gotooperation:
Goto(I, X) is defined to be the closure of the set of all items [A→ aX . β] such that
[A→ a . Xβ] is in I.

StepstoconstructSLRparsing tableforgrammarGare:

1. AugmentGandproduceG’
2. Constructthe canonicalcollectionofsetofitemsCforG’
3. Construct the parsing action function action and goto using the following algorithm
thatrequires FOLLOW(A) for each non-terminal of grammar.

AlgorithmforconstructionofSLRparsingtable:

Input :AnaugmentedgrammarG’
Output:TheSLRparsing tablefunctionsactionandgotoforG’
Method:
1. ConstructC={I0,I1,….In},thecollectionofsetsofLR(0)itemsforG’.
2. StateiisconstructedfromIi..Theparsingfunctionsforstateiaredeterminedasfollows:
(a) If [A→a·aβ] is in Ii and goto(Ii,a) = Ij,then set action[i,a] to “shift j”. Here a must
beterminal.
(b) If[A→a·]isinIi,thensetaction[i,a]to“reduceA→a”forallainFOLLOW(A).
(c) If[S’→S.]isinIi,thensetaction[i,$]to“accept”.

Ifanyconflictingactionsaregeneratedbytheaboverules,wesaygrammarisnotSLR(1).
3. Thegototransitionsforstateiareconstructedforallnon-terminalsAusingtherule: If goto(Ii,A) = Ij,
then goto[i,A] = j.
4. Allentriesnotdefinedbyrules(2)and(3)aremade“error”
5. Theinitialstateoftheparseristheoneconstructedfromthesetofitemscontaining [S’→.S].

ExampleforSLRparsing:
Construct SLR parsing for the following grammar :
G:E→E+T|T
T→T*F|F F
→ (E) | id

The givengrammaris:
G:E→E+ T - ----- (1)
E→T - ----- (2)
T→T*F - ----- (3)
T→F - ----- (4)
F →(E) - ---- (5)
F →id - -----(6)

Step1:Convertgivengrammarintoaugmentedgrammar.
Augmentedgrammar:
E’→E
E→E+T
E→T
T→T*F T
→F
F →(E)
F →id

Step2:Find LR (0)items. I0 :

E’ → . E
E→.E+ T
E→.T
T→.T*F T→
.F
F→.(E)
F→.id

GOTO(I0,E) I1 GOTO(I4,id)
:E’ → E . I5:F→id.
E→E.+ T
GOTO ( I6 , T)
GOTO(I0,T) I2 : I9:E→E+T.
E→ T . T→T.*F
T→T.*F
GOTO(I6,F) I3 :
GOTO(I0,F) I3 : T→F.
T→F.
GOTO(I6,()
I4:F→(.E)

GOTO(I0,() GOTO(I6,id)
I4:F→(.E) I5:F→id.
E→.E+T
E→ . T GOTO ( I7 , F)
T→.T*F I10:T→T*F.
T→ . F
F→.(E) GOTO(I7,()
F→.id I4:F→(.E)
E→.E+T
GOTO(I0,id) E→ . T
I5:F→id. T→.T*F T→
.F
GOTO ( I1 , +) F→.(E)
I6:E→E+.T F→.id
T→.T*F
T→ . F GOTO(I7,id)
F→.(E) I5:F→id.
F→.id
GOTO(I8,))
GOTO ( I2 , *) I11:F→(E).
I7:T→T*.F
F→.(E) GOTO ( I8 , +)
F→.id I6:E→E+.T
T→.T*F T
GOTO(I4,E) I8 : →.F
F→(E.) F →.(E )
E→E.+ T F →.id

GOTO(I4,T) I2 : GOTO ( I9 , *)
E →T . I7:T→T*.F
T→T.*F F →.(E )
F →.id
GOTO(I4,F) I3 :
T→F.
GOTO(I4,()
I4:F→(.E)
E→.E+T
E→ . T
T→.T*F T→
.F
F→.(E)
F→id

FOLLOW(E)= {$,),+)
FOLLOW(T)= {$,+ ,),*}
FOOLOW(F)= {*,+ ,),$}

SLRparsingtable:

ACTION GOTO

id + * ( ) $ E T F

IO s5 s4 1 2 3

I1 s6 ACC

I2 r2 s7 r2 r2

I3 r4 r4 r4 r4

I4 s5 s4 8 2 3

I5 r6 r6 r6 r6

I6 s5 s4 9 3

I7 s5 s4 10

I8 s6 s11

I9 r1 s7 r1 r1

I1O r3 r3 r3 r3

I11 r5 r5 r5 r5

Blankentriesareerrorentries.

Stackimplementation:

Checkwhetherthe inputid+id*idisvalidornot.
STACK INPUT ACTION

0 id+id*id$ GOTO(I0,id)=s5;shift

0id5 + id*id$ GOTO(I5,+)=r6;reduceby F→id

0F 3 + id*id$ GOTO(I0,F)= 3
GOTO(I3,+)=r4;reducebyT→F

0T2 + id*id$ GOTO(I0,T)= 2


GOTO(I2,+)=r2;reducebyE→T

0E1 + id*id$ GOTO(I0,E)= 1


GOTO(I1,+ )= s6;shift

0E1+6 id*id$ GOTO(I6,id)= s5;shift

0E1+6id5 *id$ GOTO(I5,*)=r6;reduceby F→id

0E1+ 6F 3 *id$ GOTO(I6,F)= 3


GOTO(I3,*)=r4;reducebyT→F

0E1+6T9 *id$ GOTO(I6,T)= 9


GOTO(I9,*)=s7;shift

0E1+ 6T9*7 id$ GOTO(I7,id)= s5;shift

0E1+6T9*7id5 $ GOTO(I5,$)=r6;reduceby F→id

0E1+ 6T9*7F 10 $ GOTO(I7,F)= 10


GOTO(I10,$)=r3;reducebyT→T*F

0E1+6T9 $ GOTO(I6,T)= 9
GOTO(I9,$)=r1;reducebyE→E+T

0E1 $ GOTO(I0,E)= 1
GOTO(I1,$)=accept
MODULE2-SYNTAX-DIRECTEDTRANSLATION
MODULE-3TYPECHECKING
MODULE4-RUN-TIMEENVIRONMENTS
MODULE-4 INTERMEDIATECODEGENERATION

INTRODUCTION

Thefrontendtranslatesasourceprogramintoanintermediaterepresentationfromwhich the
back end generates target code.

Benefitsofusingamachine-independentintermediateformare:

1. Retargetingisfacilitated.Thatis,acompilerforadifferentmachinecanbecreatedby
attaching a back end for the new machine to an existing front end.

2. Amachine-independentcodeoptimizercanbeappliedtotheintermediaterepresentation.

Positionofintermediatecodegenerator

parser static intermediate intermediate code


checker codegenerator generator
code

INTERMEDIATELANGUAGES

Threewaysofintermediaterepresentation:

• Syntaxtree

• Postfixnotation

• Threeaddresscode

Thesemanticrulesforgeneratingthree-addresscodefromcommonprogramminglanguage constructs are


similar to those for constructing syntax trees or for generating postfix notation.

Graphical Representations:

Syntaxtree:

A syntax tree depicts the natural hierarchical structure of a source program. A dag
(Directed Acyclic Graph) gives the same information but in a more compact way because
common subexpressions are identified. A syntax tree and dag for the assignment statementa : =
b * - c + b * - c are as follows:
assign assign

a + a +

* * *
b uminusb uminus b uminus
c c c

(a)Syntaxtree (b)Dag

Postfixnotation:

Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of


the tree in which a node appears immediately after its children. The postfix notation for the
syntax tree given above is

abcuminus*bcuminus*+assign

Syntax-directeddefinition:

Syntax trees for assignment statements are produced by the syntax-directed definition.
Non-terminal S generates an assignment statement. The two binary operators + and * are
examples of the full operator set in a typical language. Operator associativities and precedences
are the usual ones, even though they have not been put into the grammar. This definition
constructs the tree from the input a : = b * - c + b* - c.

PRODUCTION SEMANTICRULE

Sid:=E S.nptr:=mknode(‘assign’,mkleaf(id,id.place),E.nptr)

EE1+E2 E.nptr:=mknode(‘+’,E1.nptr,E2.nptr)

EE1*E2 E E.nptr:=mknode(‘*’,E1.nptr,E2.nptr) E.nptr

- E1 : = mknode(‘uminus’, E1.nptr) E.nptr : =

E(E1) E1.nptr

Eid E.nptr:=mkleaf(id,id.place)

Syntax-directeddefinitiontoproducesyntaxtreesforassignmentstatements
The token id has an attribute place that points to the symbol-table entry for the identifier.
Asymbol-tableentrycan befoundfromanattribute id.name,representingthelexemeassociated with
that occurrence of id. Ifthelexical analyzer holds all lexemes in a single array of characters, then
attribute name might be the index of the first character of the lexeme.

Two representations of thesyntax tree are as follows. In (a) each node is represented as a
record with a field for its operator and additional fields for pointers to its children. In (b), nodes
are allocated from an array of records and the index or position of the node serves as the pointer
to the node. All the nodes in the syntax tree can be visited by following pointers, starting from
the root at position 10.

Tworepresentationsofthesyntaxtree

aaaaaaaaaaaaa
assign 0 id b

1 id c
id a
2 uminus
2 1

3 * 0 2
+
4 id b

5 id c
* *
6 uminus 5
id b id b * 4 6
7

uminus uminus 8 + 3 7

9 id a
id c id c
10 assign 9 8

(a) (b)

Three-Address Code:

Three-addresscodeisasequenceofstatementsofthegeneralform x : =

y op z

where x, y and z are names, constants, or compiler-generated temporaries; op stands for any
operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on boolean-
valued data. Thus a source language expression like x+ y*zmight be translated into a sequence

t1:=y*z
t2:=x+t1

wheret1andt2arecompiler-generatedtemporarynames.
Advantagesofthree-addresscode:

 The unraveling of complicated arithmetic expressions and of nested flow-of-control


statements makes three-address code desirable for target code generation and
optimization.

 The use of names for the intermediate values computed by a program allows three-
address code to be easily rearranged – unlike postfix notation.

Three-address code is a linearized representation of a syntax tree or a dag in which


explicit names correspond to the interior nodes of the graph. The syntax tree and dag are
represented by the three-address code sequences. Variable names can appear directly in three-
address statements.

Three-addresscodecorrespondingtothesyntaxtreeanddaggivenabove

t1 :=-c t1:= -c

t2 :=b*t1 t2:=b*t1

t3 :=-c t5:= t2+t2

t4 :=b*t3 a:=t5

t5:=t2+t4 a : =

t5

(a)Codeforthesyntaxtree (b)Codeforthedag

Thereasonfortheterm“three-addresscode”isthateachstatementusuallycontainsthree addresses, two for the


operands and one for the result.

TypesofThree-AddressStatements:

Thecommonthree-addressstatementsare:

1. Assignment statements of the form x : = y op z, where op is a binary arithmetic or logical


operation.

2. Assignmentinstructionsoftheform x:=opy,whereop isaunaryoperation.Essentialunary


operations include unary minus, logical negation, shift operators, and conversion operators
that, for example, convert a fixed-point number to a floating-point number.

3. Copystatementsoftheformx:=ywherethevalueofyisassignedtox.

4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.

5. ConditionaljumpssuchasifxrelopygotoL.Thisinstructionappliesa relationaloperator(
<,=,>=,etc.)toxandy,andexecutesthestatementwithlabelL nextifxstandsinrelation
relop to y. If not, the three-address statement following if x relop y goto L is executed
next,as in the usual sequence.

6. param x and call p, nfor procedure calls and return y, where yrepresenting a returned value is
optional. For example,
paramx1
paramx2
...
paramxn
call p,n
generatedaspartofacalloftheprocedurep(x1,x2,….,xn).

7. Indexedassignmentsoftheformx:=y[i]andx[i]:= y.

8. Addressandpointerassignmentsoftheformx:=&y,x:=*y,and*x:=y.

Syntax-DirectedTranslationintoThree-AddressCode:

When three-address code is generated, temporary names are made up for the interior
nodes of a syntax tree. For example, id : = E consists of code to evaluate E into some temporary
t, followed by the assignment id.place : = t.

Giveninputa:=b* -c+b*-c,thethree-addresscodeis asshownabove.The synthesized


attribute S.code represents the three-address code for the assignment S.
ThenonterminalEhastwoattributes:
1. E.place,thenamethatwillholdthevalueof E,and
2. E.code,thesequenceofthree-addressstatementsevaluatingE.

Syntax-directeddefinitiontoproducethree-addresscodeforassignments

PRODUCTION SEMANTIC RULES

Sid:=E S.code:=E.code||gen(id.place‘:=’E.place)

EE1+E2 E.place:=newtemp;
E.code:=E1.code||E2.code||gen(E.place‘:=’E1.place‘+’E2.place)

EE1*E2 E E.place:=newtemp;
E.code:=E1.code||E2.code||gen(E.place‘:=’E1.place‘*’E2.place)

 - E1 E.place:=newtemp;
E.code:=E1.code||gen(E.place‘:=’‘uminus’E1.place)

E(E1) E.place:=E1.place; E.code


: = E1.code

Eid E.place:=id.place; E.code


:=‘‘
Semanticrulesgeneratingcodeforawhilestatement

S.begin:

E.code

ifE.place=0gotoS.after

S1.code

gotoS.begin

S.after: ...

PRODUCTION SEMANTIC RULES

S while E do S1 S.begin:=newlabel;
S.after := newlabel;
S.code:=gen(S.begin‘:’)||
E.code ||
gen(‘if’E.place‘=’‘0’‘goto’S.after)||
S1.code ||
gen(‘goto’S.begin)||
gen ( S.after ‘:’)

 The function newtemp returns a sequence of distinct names t1,t2,….. in response to


successive calls.
 Notation gen(x ‘:=’ y ‘+’ z) is used to represent three-address statement x := y + z.
Expressions appearing instead of variables like x, y and z are evaluated when passed to
gen, and quoted operators or operand, like ‘+’ are taken literally.
 Flow-of–control statementscan be addedto the language ofassignments.The codeforS
while E do S1is generated using new attributes S.begin and S.after to mark the first
statement in the code for E and the statement following the code for S, respectively.
 Thefunctionnewlabelreturnsanewlabeleverytimeitiscalled.
 Weassumethatanon-zeroexpressionrepresentstrue;thatiswhenthevalueofE
becomeszero,controlleavesthewhilestatement.

ImplementationofThree-AddressStatements:

A three-address statement is an abstract form of intermediate code. In a compiler,


these statements can be implemented as records with fields for the operator and the operands.
Three such representations are:
 Quadruples

 Triples

 Indirecttriples

Quadruples:

 Aquadrupleisarecordstructurewithfourfields,whichare,op,arg1,arg2andresult.

 The op field contains an internal code for the operator. The three-address statement x : =
y op z is represented by placing y in arg1, z in arg2 and x in result.

 The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporarynames must be entered
into the symbol table as they are created.

Triples:

 To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.

 Ifwedoso,three-addressstatementscanberepresentedbyrecordswithonlythreefields:
op,arg1andarg2.

 The fields arg1 and arg2, for the arguments of op, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).

 Sincethreefieldsareused,thisintermediatecodeformatisknownastriples.

op arg1 arg2 result op arg1 arg2

(0) uminus c t1 (0) uminus c

(1) * b t1 t2 (1) * b (0)

(2) uminus c t3 (2) uminus c

(3) * b t3 t4 (3) * b (2)

(4) + t2 t4 t5 (4) + (1) (3)

(5) := t3 a (5) assign a (4)

(a)Quadruples (b)Triples

Quadrupleandtriplerepresentationofthree-addressstatementsgivenabove
A ternary operation like x[i] : = y requires two entries in the triple structure as shown as
belowwhile x : = y[i] is naturally represented as two operations.

op arg1 arg2 op arg1 arg2

(0) []= x i (0) =[ ] y i

(1) assign (0) y (1) assign x (0)

(a)x[i]:=y (b)x:=y[i]

IndirectTriples:

 Anotherimplementationofthree-addresscodeisthatoflistingpointerstotriples,rather than
listing the triples themselves. This implementation is called indirect triples.

 Forexample,letususeanarraystatementtolistpointerstotriplesinthedesiredorder. Then the


triples shown above might be represented as follows:

statement op arg1 arg2

(0) (14) (14) uminus c


(1) (15) (15) * b (14)
(2) (16) (16) uminus c
(3) (17) (17) * b (16)
(4) (18) (18) + (15) (17)
(5) (19) (19) assign a (18)

Indirecttriplesrepresentationofthree-addressstatements

DECLARATIONS

As the sequence of declarations in a procedure or block is examined, we can lay out


storage for names local to the procedure. For each local name, we create a symbol-table entry
with information like the type and the relative address of the storage for the name. The relative
address consists of an offset from the base of the static data area or the field for local data in an
activation record.
DeclarationsinaProcedure:
The syntax of languages such as C, Pascal and Fortran, allows all the declarations in a
singleproceduretobeprocessedasagroup.Inthiscase,aglobalvariable,say offset,cankeep track of the
next available relative address.

Inthetranslationschemeshownbelow:

 NonterminalPgeneratesasequenceofdeclarationsoftheformid:T.

 Before the first declaration is considered, offset is set to 0. As each new name is seen ,
that name is entered in the symbol table with offset equal to the current value of offset,
and offset is incremented by the width of the data object denoted by that name.

 The procedure enter( name, type, offset ) creates a symbol-table entry for name, gives its
type type and relative address offset in its data area.

 Attribute type represents a type expression constructed from the basic types integer and
real by applying the type constructors pointer and array. If type expressions are
represented by graphs, then attribute type might be a pointer to the node representing a
type expression.

 The width of an array is obtained by multiplying the width of each element by thenumber
of elements in the array. The width of each pointer is assumed to be 4.

Computingthetypesandrelativeaddressesofdeclarednames

PD {offset:=0}

DD;D

D id : T {enter(id.name,T.type,offset);
offset : = offset + T.width }

T integer {T.type:=integer;
T.width : = 4 }

Treal {T.type:=real;
T.width:=8}

T array [ num ] of T1 {T.type:=array(num.val,T1.type);


T.width:=num.valXT1.width}

T ↑ T1 {T.type:=pointer(T1.type);
T.width : = 4 }
KeepingTrackofScopeInformation:

Whenanestedprocedure is seen,processingofdeclarations intheenclosingprocedureis


temporarily suspended. This approach will be illustrated by adding semantic rules to the
following language:

PD

DD;D|id:T|procid;D;S

Onepossibleimplementationofasymboltableisalinkedlistofentriesfornames.

A new symbol table is created when a procedure declaration D  proc id D1;Sis seen,
and entries for the declarations in D1 are created in the new table. The new table points back to
the symbol table of the enclosing procedure; the name represented by id itself is local to the
enclosing procedure. The only change from the treatment of variable declarations is that the
procedure enter is told which symbol table to make an entry in.

For example, consider the symbol tables for procedures readarray, exchange, and
quicksort pointing back to that for the containing procedure sort, consisting of the entire
program. Since partition is declared within quicksort, its table points to that of quicksort.

Symboltablesfornestedprocedures

sort
nil header
a
x
readarray toreadarray
exchange toexchange
quicksort

readarray exchange quicksort


h d h d h d
k
v
partition

partition
h d
i
j
Thesemanticrulesaredefinedintermsofthefollowingoperations:

1. mktable(previous) creates a new symbol table and returns a pointer to the new table. The
argument previous points to a previously created symbol table, presumably that for the
enclosing procedure.

2. enter(table, name, type, offset)creates anewentryforname nameinthesymboltablepointed


tobytable. Again, enter placestype typeandrelativeaddress offset infieldswithintheentry.

3. addwidth(table, width) records the cumulative width of all the entries in table in the header
associated with this symbol table.

4. enterproc(table, name, newtable) creates a new entry for procedure name in the symbol table
pointed to by table. The argument newtable points to the symbol table for this procedure
name.

Syntaxdirectedtranslationschemefornestedprocedures

P M D {addwidth(top(tblptr),top(offset));
pop (tblptr); pop (offset) }

Mɛ {t:=mktable(nil);
push(t,tblptr);push(0,offset)}

DD1;D2

D proc id ; N D1 ; S { t : = top (tblptr);


addwidth(t,top(offset));
pop (tblptr); pop (offset);
enterproc(top(tblptr),id.name,t)}

D id : T {enter(top(tblptr),id.name,T.type,top(offset)); top


(offset) := top (offset) + T.width }

Nɛ {t:=mktable(top(tblptr));
push(t,tblptr);push(0,offset)}

 Thestacktblptrisusedtocontainpointerstothetablesfor sort,quicksort,andpartition
whenthedeclarationsinpartitionareconsidered.

 Thetopelementofstack offsetisthenextavailablerelativeaddressforalocalofthe current


procedure.

 AllsemanticactionsinthesubtreesforBandCin A 

BC {actionA}

are done before actionAat the end of the production occurs. Hence, the action associated
with the marker M is the first to be done.
 The action for nonterminal M initializes stack tblptr with a symbol table for theoutermost
scope, created by operation mktable(nil). The action also pushes relative address 0 onto
stack offset.

 Similarly, the nonterminal N uses the operation mktable(top(tblptr)) to create a new


symbol table. The argument top(tblptr) gives the enclosing scope for the new table.

 For each variable declaration id: T, an entryis created for id in the current symbol table.
The top of stack offset is incremented by T.width.

 When the action on the right side of D proc id; ND1; S occurs, the width of all
declarations generatedbyD1isonthetopofstackoffset;itisrecordedusing addwidth. Stacks
tblptr and offset are then popped.
Atthispoint,thenameoftheenclosedprocedureisenteredintothesymboltableofits enclosing
procedure.

ASSIGNMENTSTATEMENTS

Supposethatthecontextinwhichanassignmentappearsisgivenbythefollowinggrammar.

PMD

Mɛ

DD;D|id:T|procid ;ND;S N  ɛ

Nonterminal P becomes the new start symbol when these productions are addedto those in the
translation scheme shown below.

Translationschemetoproducethree-addresscodeforassignments

Sid:=E {p:=lookup(id.name);
ifp≠nilthen
emit(p‘:=’E.place)
elseerror}

EE1+E2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘+‘E2.place)}

EE1 *E2 {E.place:=newtemp;


emit(E.place‘:=’E1.place‘*‘E2.place)}

E-E1 {E.place:=newtemp;
emit(E.place‘:=’‘uminus’E1.place)}

E(E1) {E.place:=E1.place}
Eid {p:=lookup(id.name);

ifp≠nilthen
E.place:=p
elseerror}

ReusingTemporaryNames

 The temporaries used to hold intermediate values in expression calculations tend to


clutter up the symbol table, and space has to be allocated to hold their values.

 Temporariescanbereusedbychangingnewtemp.ThecodegeneratedbytherulesforE
E1+E2hasthegeneralform:

evaluateE1intot1
evaluateE2 intot2 t
: =t1+t2

 Thelifetimesofthesetemporariesarenestedlikematchingpairsofbalancedparentheses.

 Keep a count c , initialized to zero. Whenever a temporary name is used as an operand,


decrement c by 1. Whenever a new temporary name is generated, use $c and increase c
by 1.

 Forexample,considertheassignmentx:=a*b+c*d–e*f

Three-addresscodewithstacktemporaries

statement valueofc

0
$0:=a*b 1
$1:=c*d 2
$0:=$0+ $1 1
$1:=e*f 2
$0:=$0-$1 1
x:=$0 0

AddressingArrayElements:

Elements of an array can be accessed quickly if the elements are stored in a block of
consecutive locations. If the width of each array element is w, then the ith element of array A
begins in location

base+(i–low)xw

wherelowisthelowerboundonthesubscriptandbaseistherelativeaddressofthestorage allocated for the array.


That is, base is the relative address of A[low].
Theexpressioncanbepartiallyevaluatedatcompiletimeifitisrewrittenas

ixw+(base–lowxw)

Thesubexpression c = base – low x w can be evaluatedwhen the declaration ofthe arrayis seen.
We assume that c is saved in the symbol table entry for A , so the relative address of A[i] is
obtained by simply adding i x w to c.

Addresscalculationofmulti-dimensionalarrays:

Atwo-dimensionalarrayisstoredinofthetwoforms:

 Row-major(row-by-row)

 Column-major(column-by-column)

Layoutsfora2x3array

A[11] A [11]
firstcolumn
firstrow A[1,2] A [21]
A[13] A[1,2]
A[2,1] A[2,2 ] secondcolumn

secondrow A[22] A [13]


A[2,3] A[2,3] thirdcolumn

(a)ROW-MAJOR (b)COLUMN-MAJOR

Inthecaseofrow-majorform,therelativeaddressofA[i1,i2]canbecalculatedbytheformula

base+((i1–low1)xn2+i2–low2)xw

where, low1and low2are the lower bounds on the values of i1and i2and n2is the number of values
that i2can take. That is, ifhigh2is the upper bound on the value ofi2, then n2= high2 – low2 + 1.

Assuming that i1 and i2are the only values that are known at compile time, we can rewrite the
above expression as

((i1xn2 )+i2)xw+(base–((low1xn2 )+low2 )xw)

Generalized formula:

TheexpressiongeneralizestothefollowingexpressionfortherelativeaddressofA[i1,i2,…,ik]

((... ((i1n2 +i2 )n3+i3)... )nk+ik)x w+base– ((...((low1n2 +low2)n3+low3)...) nk + lowk) x w

forallj,nj=highj–lowj+1
TheTranslationSchemeforAddressingArrayElements:

Semanticactionswillbeaddedtothegrammar :

(1) S L:=E
(2) E E+E
(3) E (E)
(4) E L
(5) LElist]
(6) Lid
(7) ElistElist,E
(8) Elistid[E

WegenerateanormalassignmentifLisasimplename,andanindexedassignmentintothe location
denoted by L otherwise :

(1) SL:=E {ifL.offset=nullthen/*Lisasimpleid*/


emit(L.place‘:=’E.place);
else
emit(L.place‘[‘L.offset‘]’‘:=’E.place)}

(2) EE1+E2 {E.place:=newtemp;


emit(E.place‘:=’E1.place‘+’E2.place)}

(3) E(E1) {E.place:=E1.place}

When an arrayreference L is reduced to E , we want the r-value of L. Therefore we use indexing


to obtain the contents of the location L.place [ L.offset ] :

(4) E  L {ifL.offset=nullthen/*Lisasimpleid*/ E.place


: = L.place
elsebegin
E.place:=newtemp;
emit(E.place‘:=’L.place‘[‘L.offset‘]’)
end}

(5) L Elist ] {L.place:=newtemp;


L.offset:=newtemp;
emit(L.place‘:=’c(Elist.array));
emit(L.offset‘:=’Elist.place‘*’width(Elist.array))}

(6) L id {L.place:=id.place;


L.offset := null}

(7) ElistElist1,E {t:=newtemp;


m:=Elist1.ndim+1;
emit(t‘:=’Elist1.place‘*’limit(Elist1.array,m)); emit
( t ‘: =’ t ‘+’ E.place);
Elist.array:=Elist1.array;
Elist.place:=t;
Elist.ndim:=m}

(8) Elistid[E {Elist.array:=id.place;

Elist.place:=E.place;
Elist.ndim : = 1 }

Type conversionwithin Assignments:

Consider the grammar for assignment statements as above, but suppose there are two
types – real and integer , with integers converted to reals when necessary. We have another
attribute E.type, whose value is either real or integer. The semantic rule for E.type associated
with the production E  E + E is :

EE+E {E.type:=
ifE1.type=integerand
E2.type=integertheninteger
elsereal}

The entire semantic rule for E  E + E and most of the other productions must be
modified to generate, when necessary, three-address statements of the form x : = inttoreal y,
whose effect is to convert integer y to a real of equal value, called x.

SemanticactionforEE1+E2

E.place:=newtemp;
ifE1.type=integerandE2.type=integerthenbegin emit(
E.place ‘: =’ E1.place ‘int +’ E2.place); E.type :
= integer
end
elseif E1.type = real and E2.type = real then begin
emit(E.place‘:=’E1.place‘real+’E2.place);
E.type : = real
end
elseifE1.type=integerandE2.type=realthenbegin
u:=newtemp;
emit(u‘:=’‘inttoreal’E1.place);
emit(E.place‘:=’u‘real+’E2.place); E.type :
= real
end
elseifE1.type=realandE2.type=integerthenbegin
u:=newtemp;
emit(u‘:=’‘inttoreal’E2.place);
emit(E.place‘:=’E1.place‘real+’u); E.type :
= real
end
else
E.type:=type_error;
Forexample,fortheinputx:=y+i*j
assumingxandyhavetypereal,andiandjhavetypeinteger,theoutputwouldlook like

t1:=iint*j
t3:=inttorealt1 t2 :
= y real+ t3
x:=t2

BOOLEANEXPRESSIONS

Boolean expressions have two primary purposes. They are used to compute logical
values, but more often they are used as conditional expressions in statements that alter the flow
of control, such as if-then-else, or while-do statements.

Boolean expressions are composed of the boolean operators ( and, or, and not ) applied
to elements that are boolean variables or relational expressions. Relational expressions are of the
form E1relop E2, where E1 and E2 are arithmetic expressions.

Hereweconsiderbooleanexpressionsgeneratedbythefollowinggrammar : E 

E or E | E and E | not E | ( E ) | id relop id | true | false

MethodsofTranslatingBooleanExpressions:

Therearetwoprincipalmethodsofrepresentingthevalueofabooleanexpression.Theyare :

 To encode true and false numerically and to evaluate a boolean expression analogouslyto
an arithmetic expression. Often, 1 is used to denote true and 0 to denote false.

 To implement boolean expressions by flow of control, that is, representing the value of a
boolean expression by a position reached in a program. This method is particularly
convenient in implementing the boolean expressions in flow-of-control statements, such
as the if-then and while-do statements.

NumericalRepresentation

Here, 1 denotes true and 0 denotes false. Expressions will be evaluated completely from
left to right, in a manner similar to arithmetic expressions.

Forexample:

 Thetranslationfor
aorbandnotc
isthethree-addresssequence
t1:=notc
t2:=bandt1 t3
: = a or t2

 Arelationalexpressionsuchasa <bisequivalenttotheconditionalstatement if a <


b then 1 else 0
whichcanbetranslatedintothethree-addresscodesequence(again,wearbitrarilystart
statement numbers at 100) :

100 : ifa<b goto103


101 : t:=0
102 : goto104
103 : t:=1
104 :

Translationschemeusinganumericalrepresentationforbooleans

EE1orE2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘or’E2.place)}
EE1andE2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘and’E2.place)}
EnotE1 {E.place:=newtemp;
emit(E.place‘:=’‘not’E1.place)}
E ( E1 ) {E.place:=E1.place} E
id1 relop id2 { E.place : = newtemp;
emit(‘if’id1.placerelop.opid2.place‘goto’nextstat+3); emit(
E.place ‘: =’ ‘0’ );
emit(‘goto’nextstat+2);
emit(E.place‘:=’‘1’)}
E true { E.place : = newtemp;
emit(E.place‘:=’‘1’)}
E false { E.place : = newtemp;
emit(E.place‘:=’‘0’)}

Short-CircuitCode:

We can also translate a boolean expression into three-address code without generating
codefor anyof theboolean operators and without having the codenecessarilyevaluate theentire
expression. This styleof evaluation is sometimes called “short-circuit”or “jumping”code. It is
possible to evaluate boolean expressions without generating code for the boolean operators and,
or, and not if we represent the value of an expression by a position in the code sequence.

Translationofa<borc<dande<f

100: ifa<bgoto 103 107:t2 :=1

101: t1:=0 108:ife<fgoto111

102: goto104 109:t3 :=0

103: t1 :=1 110:goto112

104: ifc<dgoto 107 111:t3 :=1

105: t2 :=0 112:t4:=t2 andt3

106: goto108 113:t5:=t1 ort4


Flow-of-ControlStatements

We now consider the translation of boolean expressions into three-address code in the
context of if-then, if-then-else, and while-do statements such as those generated bythe following
grammar:

SifEthenS1
| ifEthenS1elseS2
| whileEdoS1

In each of these productions, E is the Boolean expression to be translated. In the translation, we


assume that a three-address statement can be symbolically labeled, and that the functionnewlabel
returns a new symbolic label each time it is called.

 E.true is the label to which control flows if E is true, and E.false is the label to which
control flows if E is false.

 The semantic rules for translating a flow-of-control statement S allow control to flow
from the translation S.code to the three-address instruction immediately followingS.code.

 S.next is a label that is attached to the first three-address instruction to be executed after
the code for S.

Codeforif-then,if-then-else,andwhile-dostatements

toE.true
E.code
toE.false

E.code to E.true E.true: S1.code

E.true: to E.false
S1.code gotoS.next
E.false:
S2.code
E.false: ...

S.next: ...

(a)if-then (b)if-then-else

S.begin: E.code to E.true

to E.false
E.true: S1.code

gotoS.begin
E.false: ...

(c)while-do
Syntax-directeddefinitionforflow-of-controlstatements

PRODUCTION SEMANTICRULES

SifEthenS1 E.true:=newlabel;
E.false : = S.next;
S1.next : = S.next;
S.code:=E.code||gen(E.true‘:’)||S1.code

SifEthenS1elseS2 E.true : = newlabel;


E.false:=newlabel;
S1.next : = S.next;
S2.next : = S.next;
S.code:=E.code||gen(E.true‘:’)||S1.code||
gen(‘goto’S.next)||
gen(E.false‘:’)||S2.code

SwhileEdoS1 S.begin:=newlabel;
E.true : = newlabel;
E.false : = S.next;
S1.next : = S.begin;
S.code:=gen(S.begin‘:’)||E.code||
gen(E.true‘:’)||S1.code||
gen(‘goto’ S.begin)

Control-FlowTranslationofBooleanExpressions:

Syntax-directeddefinitiontoproducethree-addresscodeforbooleans

PRODUCTION SEMANTICRULES

EE1orE2 E1.true : = E.true;


E1.false:=newlabel;
E2.true : = E.true;
E2.false : = E.false;
E.code:=E1.code||gen(E1.false‘:’)||E2.code

EE1andE2 E.true:=newlabel;
E1.false : = E.false;
E2.true : = E.true;
E2.false : = E.false;
E.code:=E1.code||gen(E1.true‘:’)||E2.code

EnotE1 E1.true:=E.false;
E1.false:=E.true;
E.code:=E1.code

E(E1 ) E1.true:=E.true;
E1.false:=E.false; E.code
: = E1.code

Eid1relopid2 E.code:=gen(‘if’id1.placerelop.opid2.place
‘goto’E.true)||gen(‘goto’E.false) E.code

Etrue : = gen(‘goto’ E.true)

Efalse E.code:=gen(‘goto’E.false)

CASESTATEMENTS

The“switch”or“case”statementisavailable inavarietyoflanguages.Theswitch-statement syntax is


as shown below :
Switch-statementsyntax

switchexpression
begin
casevalue: statement
casevalue: statement
...
casevalue: statement
default: statement
end

There is a selector expression, which is to be evaluated, followed by n constant values


that the expression might take, including a default “value” which always matches the expression
if no other value does. The intended translation of a switch is code to:

1. Evaluatetheexpression.
2. Findwhichvalueinthelistofcasesisthesameasthevalueoftheexpression.
3. Executethestatementassociatedwiththevaluefound.

Step(2)canbeimplementedinoneofseveralways:

 Byasequenceofconditionalgotostatements,ifthenumberofcasesissmall.
 By creating a table of pairs, with each pair consisting of a value and a label for the code
of the corresponding statement. Compiler generates a loop to compare the value of the
expression with each value in the table. If no match is found, the default (last) entry is
sure to match.
 Ifthenumberofcasesslarge,itisefficient toconstructahashtable.
 Thereisacommonspecialcaseinwhichanefficientimplementationofthen-waybranch exists.
If the values all lie in some small range, say imin to imax, and the number ofdifferent values
is a reasonable fraction of imax - imin, then we can construct an array of labels, with the
label of the statement for value j in the entry of the table with offset j-
iminandthelabelforthedefaultinentriesnotfilledotherwise.Toperformswitch,
evaluatetheexpressiontoobtainthevalueofj,checkthevalueiswithinrangeand transfer to the
table entry at offset j-imin.

Syntax-DirectedTranslationofCaseStatements:

Considerthefollowingswitchstatement:

switchE
begin
caseV1: S1
caseV2: S2
...
caseVn-1: Sn-1
default: Sn
end

Thiscasestatementistranslatedintointermediatecodethathasthefollowingform :

Translationofacasestatement

codetoevaluateEintot goto
test
L1: codeforS1
gotonext
L2: codeforS2
gotonext
...
Ln-1: codeforSn-1
gotonext
Ln: codeforSn
gotonext
test : ift=V1gotoL1
ift=V2goto L2
...
ift=Vn-1gotoLn-1
goto Ln
next:

Totranslateintoaboveform:

 When keyword switch is seen, two new labels test and next, and a new temporary t are
generated.

 As expression E is parsed, the code to evaluate E into t is generated. After processing E ,


the jump goto test is generated.

 Aseach case keyword occurs, anew label Liis created and entered into thesymbol table. A
pointer to this symbol-table entry and the value Viof case constant are placed on a stack
(used only to store cases).
 Each statement case Vi : Siis processed by emitting the newly created label Li, followed
by the code for Si, followed by the jump goto next.

 Then when the keyword end terminatingthe bodyof the switch is found, the code can be
generated for the n-way branch. Reading the pointer-value pairs on the case stack from
thebottomtothetop,wecan generate asequence ofthree-addressstatementsoftheform

caseV1L1
caseV2L2
...
caseVn-1Ln-
1case tLn
labelnext

where t is the name holding the value of the selector expression E, and Ln is the label for
the default statement.

BACKPATCHING

The easiest way to implement the syntax-directed definitions for boolean expressions isto
use two passes. First, construct a syntax tree for the input, and then walk the tree in depth-first
order, computing the translations. The main problem with generating code for boolean
expressionsandflow-of-controlstatementsin asingle passisthatduringonesingle pass we may not
know thelabelsthatcontrol must go to atthetimethe jumpstatementsaregenerated.Hence, a series
of branching statements with the targets of the jumps left unspecified is generated. Each
statement will be put on a list of goto statements whose labels will be filled in when the proper
label can be determined. We call this subsequent filling in of labels backpatching.

Tomanipulatelistsoflabels,weusethreefunctions:

1. makelist(i)createsanewlistcontainingonlyi,anindexintothearrayofquadruples;
makelistreturnsapointertothelistithasmade.
2. merge(p1,p2)concatenatesthelistspointedtobyp1andp2, andreturnsapointertothe
concatenated list.
3. backpatch(p,i) inserts i as the target label for each of the statements on the list pointed to
by p.

BooleanExpressions:

We now construct a translation scheme suitable for producing quadruples for boolean
expressions during bottom-up parsing. The grammar we use is the following:

(1) EE1orME2
(2) | E1andME2
(3) | notE1
(4) | (E1)
(5) | id1relop id2
(6) | true
(7) | false
(8) Mɛ
Synthesizedattributes truelist andfalselist ofnonterminal Eareusedto generatejumpingcode
forbooleanexpressions.Incompletejumpswithunfilledlabelsareplacedonlistspointedtoby
E.truelist and E.falselist.

Consider production E E1and M E2. If E1is false, then E is also false, so the statements on
E1.falselist becomepartofE.falselist.IfE1istrue,thenwemustnexttest E2,sothetargetforthe
statements E1.truelist must be thebeginningof the code generated for E2. This target is obtained
using marker nonterminal M.

AttributeM.quadrecordsthenumberofthefirststatementofE2.code.WiththeproductionM
ɛweassociatethesemanticaction

{M.quad:=nextquad}

The variable nextquad holds the index of the next quadruple to follow. This value will be
backpatchedontotheE1.truelistwhenwehaveseentheremainderoftheproduction E E1and M E2.
The translation scheme is as follows:

(1) EE1orME2 {backpatch(E1.falselist,M.quad);


E.truelist:=merge(E1.truelist,E2.truelist);
E.falselist : = E2.falselist }

(2) EE1andME2 {backpatch(E1.truelist,M.quad);


E.truelist:=E2.truelist;
E.falselist:=merge(E1.falselist,E2.falselist)}

(3) E not E1 { E.truelist : = E1.falselist;


E.falselist:=E1.truelist;}

(4) E  ( E1) { E.truelist : = E1.truelist;


E.falselist:=E1.falselist;}

(5) Eid1relopid2 {E.truelist:=makelist(nextquad);


E.falselist : = makelist(nextquad + 1);
emit(‘if’id1.placerelop.opid2.place‘goto_’)
emit(‘goto_’) }

(6) E true {E.truelist:=makelist(nextquad);


emit(‘goto_’) }

(7) E false {E.falselist:=makelist(nextquad);


emit(‘goto_’) }

(8) Mɛ {M.quad:=nextquad}


Flow-of-ControlStatements:

Atranslationschemeisdevelopedforstatementsgeneratedbythefollowinggrammar:

(1) SifEthenS
(2) | ifEthenSelseS
(3) | whileEdoS
(4) | beginLend
(5) | A
(6) L L;S
(7) | S

Here S denotes a statement, L a statement list, A an assignment statement, and E a boolean


expression. We make the tacit assumption that the code that follows a given statement in
execution also follows it physically in the quadruple array. Else, an explicit jump must be
provided.

SchemetoimplementtheTranslation:

The nonterminal E has two attributes E.truelist and E.falselist. L and S also need a list of
unfilled quadruples that must eventually be completed by backpatching. These lists are pointedto
by the attributes L..nextlist and S.nextlist. S.nextlist is a pointer to a list of all conditional and
unconditionaljumpstothequadruplefollowingthestatementSinexecutionorder,and L.nextlist is
defined similarly.

Thesemanticrulesfortherevisedgrammarareasfollows:

(1) SifEthenM1S1NelseM2S2
{backpatch (E.truelist, M1.quad);
backpatch(E.falselist,M2.quad);
S.nextlist:=merge(S1.nextlist,merge(N.nextlist,S2.nextlist))}

We backpatch the jumps when E is true to the quadruple M1.quad, which is the beginning of the
codefor S1.Similarly,webackpatchjumpswhen Eisfalseto gotothebeginningofthe codefor S2. The
list S.nextlist includes all jumps out of S1 and S2, as well as the jump generated by N.

(2) Nɛ {N.nextlist:=makelist(nextquad);


emit(‘goto_’)}

(3) Mɛ {M.quad:=nextquad}

(4) SifEthenMS1 {backpatch(E.truelist,M.quad);


S.nextlist:=merge(E.falselist,S1.nextlist)}

(5) SwhileM1EdoM2S1 {backpatch(S1.nextlist,M1.quad);


backpatch(E.truelist,M2.quad);
S.nextlist : = E.falselist
emit(‘goto’M1.quad)}

(6) SbeginLend {S.nextlist:=L.nextlist}


(7) SA {S.nextlist:=nil}

TheassignmentS.nextlist:=nilinitializesS.nextlisttoanemptylist.

(8) L L1 ; M S {backpatch(L1.nextlist,M.quad);


L.nextlist : = S.nextlist }

Thestatementfollowing L1inorderofexecutionisthebeginningof S.ThustheL1.nextlist listis backpatched to


the beginning of the code for S, which is given by M.quad.

(9) LS {L.nextlist:=S.nextlist}

PROCEDURECALLS

The procedure is such an important and frequently used programming construct that it is
imperative for a compiler to generate good code for procedure calls and returns. The run-time
routines that handle procedure argument passing, calls and returns are part of the run-time
support package.

Letusconsideragrammarforasimpleprocedurecallstatement

(1) Scallid(Elist)
(2) ElistElist,E
(3) ElistE

CallingSequences:

Thetranslationforacallincludesacallingsequence,asequenceofactionstakenonentry
toandexitfromeachprocedure. Thefallingaretheactionsthattakeplacein acallingsequence:

 Whenaprocedurecalloccurs,spacemustbeallocatedfortheactivationrecordofthe called
procedure.

 Theargumentsofthecalledproceduremustbeevaluatedandmadeavailabletothe called
procedure in a known place.

 Environmentpointersmustbeestablishedtoenablethecalledproceduretoaccessdatain
enclosing blocks.

 Thestateofthecallingproceduremustbesavedsoitcanresumeexecutionafterthecall.

 Alsosavedinaknownplaceisthereturnaddress,thelocationtowhichthecalled routine must


transfer after it is finished.

 Finallyajumptothebeginningofthecodeforthecalledproceduremustbegenerated.

Forexample,considerthefollowingsyntax-directedtranslation

(1) Scallid(Elist)
{foreachitemponqueuedo
emit(‘param’p);
emit(‘call’id.place)}
(2) ElistElist,E

{appendE.placetotheendofqueue}

(3) ElistE
{initializequeuetocontainonlyE.place}

 Here,thecodeforSisthecodeforElist,whichevaluatesthearguments,followedbya
parampstatementforeachargument,followedbyacallstatement.

 queue is emptied and then gets a single pointer to the symbol table location for the name
that denotes the value of E.
MODULE-4 CODEGENERATION

The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The
code generation techniques presented below can be used whether or not an optimizing phase
occurs before code generation.

Positionofcodegenerator

source frontend intermediate code intermediate code target


program code optimizer code generator program

symbol
table

ISSUESINTHEDESIGNOFACODEGENERATOR

Thefollowingissuesariseduringthecodegenerationphase :

1. Inputtocodegenerator
2. Targetprogram
3. Memorymanagement
4. Instructionselection
5. Registerallocation
6. Evaluationorder

1. Inputtocodegenerator:
• Theinput to thecodegeneration consists ofthe intermediaterepresentation of thesource
program produced by front end , together with information in the symbol table to
determine run-time addresses of the data objects denoted by the names in theintermediate
representation.

• Intermediaterepresentationcanbe:
a. Linearrepresentationsuchaspostfixnotation
b. Threeaddressrepresentationsuchasquadruples
c. Virtualmachinerepresentationsuchasstackmachinecode
d. Graphicalrepresentationssuchassyntaxtreesanddags.

• Prior to code generation, the front end must be scanned, parsed and translated into
intermediate representation along with necessary type checking. Therefore, input to code
generation is assumed to be error-free.

2. Targetprogram:
• Theoutputofthecodegeneratoristhetargetprogram.Theoutputmaybe:
a. Absolute machine language
- Itcanbeplacedinafixedmemorylocationandcanbeexecutedimmediately.
b. Relocatable machinelanguage
- Itallowssubprogramstobecompiledseparately.

c. Assemblylanguage
- Codegenerationismadeeasier.

3. Memorymanagement:
• Namesinthesourceprogramaremappedtoaddressesofdataobjectsinrun-time memory by
the front end and code generator.

• It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.

• Labelsinthree-addressstatementshavetobeconvertedtoaddressesof instructions. For


example,
j:gotoigeneratesjumpinstructionasfollows :
 ifi<j,abackwardjump instructionwithtargetaddressequaltolocationof code
for quadruple i is generated.
 if i >j, the jump is forward. We must store on a list for quadruple i the
locationofthefirstmachineinstructiongeneratedforquadruple j.Wheniis
processed, the machine locations for all instructions that forward jumps to i
are filled.

4. Instructionselection:
• Theinstructionsoftargetmachineshouldbecompleteanduniform.

• Instructionspeedsandmachineidioms areimportantfactors whenefficiencyoftarget


program is considered.

• Thequalityofthegeneratedcodeisdeterminedbyitsspeedandsize.

• Theformerstatementcanbetranslatedintothelatterstatementasshownbelow:

5. Registerallocation
• Instructionsinvolvingregisteroperandsareshorterandfasterthanthoseinvolving operands
in memory.

• Theuseofregistersissubdividedintotwosubproblems:
 Register allocation – the set of variables that will reside in registers at a point in
the program is selected.
 Registerassignment–thespecificregisterthatavariablewillresideinis picked.

• Certainmachinerequireseven-oddregisterpairsforsomeoperandsandresults. For
example , consider the division instruction of the form :
D x,y

where,x–dividendevenregisterineven/oddregister pair y –
divisor
evenregisterholdstheremainder
odd register holds the quotient

6. Evaluationorder
• The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others.

TARGETMACHINE

• Familiaritywith the target machineand its instruction set is aprerequisitefordesigninga


good code generator.
• Thetargetcomputerisabyte-addressablemachinewith4bytestoaword.
• Ithasngeneral-purposeregisters,R0,R1,...,Rn-1.
• Ithastwo-addressinstructionsoftheform:
op source,destination
where,opisanop-code,andsourceanddestinationaredatafields.

• Ithasthefollowingop-codes:
MOV(movesourcetodestination)
ADD (add source to destination)
SUB (subtractsourcefromdestination)

• Thesourceanddestinationofaninstructionarespecifiedbycombiningregistersand
memory locations with address modes.

Addressmodeswiththeirassembly-languageforms

MODE FORM ADDRESS ADDEDCOST

absolute M M 1

register R R 0

indexed c(R) c+contents(R) 1

indirectregister *R contents(R) 0

indirectindexed *c(R) contents(c+ 1


contents(R))

literal #c c 1
• Forexample:MOVR0,MstorescontentsofRegisterR0intomemorylocationM ; MOV
4(R0), M stores the value contents(4+contents(R0)) into M.

Instructioncosts:

• Instructioncost=1+costforsourceanddestination addressmodes.Thiscostcorresponds to the


length of the instruction.
• Addressmodesinvolvingregistershavecostzero.
• Addressmodesinvolvingmemorylocationorliteralhavecostone.
• Instructionlengthshouldbeminimizedifspaceisimportant.Doingsoalsominimizesthe time
taken to fetch and perform the instruction.
For example : MOV R0, R1 copies the contents of register R0 into R1. It has cost
one,since it occupies only one word of memory.
• Thethree-addressstatement a:=b+ccanbeimplementedbymanydifferentinstruction
sequences :

i) MOVb,R0
ADD c, R0 cost=6
MOV R0, a

ii) MOVb,a
ADDc,a cost=6

iii) AssumingR0,R1andR2containtheaddressesofa,b,andc: MOV


*R1, *R0
ADD*R2,*R0 cost=2

• Inordertogenerategoodcodefortargetmachine,wemustutilizeitsaddressing
capabilities efficiently.

RUN-TIME STORAGEMANAGEMENT

• Informationneededduringanexecutionofaprocedureiskeptinablockofstorage called an
activation record, which includes storage for names local to the procedure.
• Thetwostandardstorageallocationstrategiesare:
1. Staticallocation
2. Stackallocation
• Instaticallocation,thepositionofanactivationrecordinmemoryisfixedatcompile time.
• In stack allocation, anew activation record is pushed onto thestack foreach execution of a
procedure. The record is popped when the activation ends.
• The following three-address statementsare associated with the run-time allocationand
deallocation of activation records:
1. Call,
2. Return,
3. Halt,and
4. Action,aplaceholderforotherstatements.
• Weassumethattherun-timememoryisdividedintoareasfor:
1. Code
2. Staticdata
3. Stack
Staticallocation

Implementationofcallstatement:

Thecodesneededtoimplementstaticallocationareasfollows:

MOV#here+20,callee.static_area /*Itsavesreturnaddress*/

GOTO callee.code_area /*Ittransferscontroltothetargetcodeforthecalledprocedure*/

where,
callee.static_area–Addressoftheactivationrecord
callee.code_area–Addressofthefirstinstructionforcalledprocedure
#here+20–LiteralreturnaddresswhichistheaddressoftheinstructionfollowingGOTO.

Implementationofreturnstatement:

Areturnfromprocedurecalleeisimplementedby:

GOTO*callee.static_area

Thistransferscontroltotheaddresssavedatthebeginningoftheactivationrecord.

Implementationofactionstatement:

TheinstructionACTIONisusedtoimplementactionstatement.

Implementationofhaltstatement:

ThestatementHALTisthefinalinstructionthatreturnscontroltotheoperatingsystem.

Stackallocation

Static allocation can become stack allocation by using relative addresses for storage in
activation records. In stack allocation, the position of activation record is stored in register so
words in activation records can be accessed as offsets from the value in this register.

Thecodesneededtoimplementstackallocationareasfollows:

Initializationofstack:

MOV #stackstart , SP /*initializesstack*/

Code for the first procedure

HALT /*terminateexecution*/

ImplementationofCallstatement:

ADD #caller.recordsize, SP /*incrementstackpointer*/

MOV #here + 16, *SP /*Save return address */

GOTO callee.code_area
where,
caller.recordsize–sizeoftheactivationrecord
#here+16–addressoftheinstructionfollowingthe GOTO Implementation of

Return statement:

GOTO*0(SP) /*returntothecaller*/

SUB#caller.recordsize,SP /*decrementSPandrestoretopreviousvalue*/

BASICBLOCKSANDFLOWGRAPHS

BasicBlocks

• A basic block is a sequence of consecutive statements in which flow of control enters at


the beginning and leaves at the end without anyhalt or possibilityof branching except at
the end.
• Thefollowingsequenceofthree-addressstatementsformsabasicblock: t1 :
=a*a
t2 : = a * b
t3 : = 2 * t2
t4:=t1+t3 t5 :
=b*b
t6:=t4+t5

BasicBlockConstruction:

Algorithm:Partitionintobasicblocks

Input:Asequenceofthree-addressstatements

Output:Alistofbasicblockswitheachthree-addressstatementinexactlyoneblock

Method:

1. We first determine the set of leaders, the first statements of basic blocks. The ruleswe
use are of the following:
a. Thefirststatementisaleader.
b. Anystatementthatisthetargetofaconditionalorunconditionalgotoisa leader.
c. Any statement that immediately follows a goto or conditional goto statement
is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not
including the next leader or the end of the program.
• Considerthefollowingsourcecodefordotproductoftwovectorsaandboflength20

begin

prod:=0;

i:=1;

dobegin

prod:=prod+a[i]*b[i]; i

:=i+1;

end

whilei<=20

end

• Thethree-addresscodefortheabovesourceprogramisgivenas:
(1) prod:=0

(2) i:=1

(3) t1:=4*i

(4) t2:=a[t1] /*computea[i]*/

(5) t3:=4*i

(6) t4:=b[t3] /*computeb[i] */

(7) t5:=t2*t4

(8) t6:=prod+t5

(9) prod:=t6

(10) t7:= i+1

(11) i:=t7

(12) ifi<=20goto(3)

Basicblock1:Statement(1)to(2)

Basicblock2:Statement(3)to(12)
TransformationsonBasicBlocks:

Anumberoftransformationscanbeappliedtoabasicblockwithoutchangingthesetof expressions computed by


the block. Two important classes of transformation are :

• Structure-preservingtransformations

• Algebraictransformations

1. Structurepreservingtransformations:

a) Commonsubexpressionelimination:

a:=b+c a:=b +c
b:=a–d b:=a-d
c:=b+c c:=b +c
d:=a–d d:=b

Sincethesecondandfourthexpressionscomputethesameexpression,thebasicblockcanbe transformed as
above.

b) Dead-code elimination:

Supposexisdead,thatis,neversubsequentlyused,atthepointwherethestatementx:= y + z
appears in a basic block. Then this statement may be safely removed without changing the
value of the basic block.

c) Renamingtemporaryvariables:

A statement t : = b + c ( t is a temporary) can be changed tou : = b + c (u is a new


temporary)andallusesofthisinstanceof tcanbechangedto uwithoutchangingthevalueof the basic
block.
Suchablockiscalledanormal-formblock.

d) Interchangeofstatements:

Supposeablockhasthefollowingtwoadjacentstatements: t1 :

=b+c
t2:=x + y

Wecaninterchangethetwostatementswithoutaffectingthevalueoftheblockifand only if
neither x nor y is t1and neither b nor c is t2.

2. Algebraic transformations:

Algebraictransformationscanbeusedtochangethesetofexpressionscomputedbyabasic block into


an algebraically equivalent set.
Examples:
i) x : = x + 0 orx:=x*1canbeeliminatedfromabasicblockwithoutchangingthesetof
expressions it computes.
ii) Theexponentialstatementx:=y**2canbereplacedbyx:=y* y.
FlowGraphs

• Flow graph is a directed graph containing the flow-of-control information for the set
ofbasic blocks making up a program.
• Thenodesoftheflowgrapharebasicblocks.Ithasadistinguishedinitialnode.
• E.g.:Flowgraphforthevectordotproductisgivenasfollows:

prod:=0 i B1
:=1

t1 : = 4 *
it2:=a[t1] t3
:=4*
B2
it4:=b[t3]
t5: =t2* t4
t6:=prod+t5 prod :
= t6
t7: = i+1
i:=t7
ifi<=20gotoB2

• B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2. The
target of jump from last statement of B1 is the first statement B2, so there is an edge from
B1 (last statement) to B2 (first statement).
• B1isthepredecessorofB2,andB2isasuccessorofB1.

Loops

• Aloopisacollectionofnodesinaflowgraphsuchthat
1. Allnodesinthecollectionarestronglyconnected.
2. Thecollectionofnodeshasauniqueentry.
• Aloopthatcontainsnootherloopsiscalledaninnerloop.

NEXT-USEINFORMATION

• If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.
Input:BasicblockBofthree-addressstatements

Output: At each statement i: x= y op z, we attach to i the liveliness and next-uses of x,


y and z.

Method:WestartatthelaststatementofBandscanbackwards.

1. Attachtostatementitheinformationcurrentlyfoundinthesymboltable regarding
the next-use and liveliness of x, y and z.
2. Inthesymboltable,setxto“notlive”and“nonext use”.
3. Inthesymboltable,setyandzto“live”,andnext-usesofyandztoi.

SymbolTable:

Names Liveliness Next-use

x notlive nonext-use

y Live i

z Live i

ASIMPLECODEGENERATOR

• A code generator generates target code for a sequence of three- address statements
andeffectively uses registers to store operands of the statements.

• Forexample:considerthethree-addressstatementa:=b+c
Itcanhavethefollowingsequenceofcodes:

ADD Rj, Ri Cost = 1 //ifRicontainsbandRjcontainsc (or)

ADD c, Ri Cost = 2 //ifcisinamemorylocation (or)

MOV c, Rj Cost = 3 //movecfrommemorytoRjandadd

ADD Rj, Ri

RegisterandAddressDescriptors:

• Aregisterdescriptorisusedtokeeptrackofwhatiscurrentlyineachregisters.The register
descriptors show that initially all the registers are empty.
• Anaddressdescriptorstoresthelocationwherethecurrentvalueofthenamecanbe found at run
time.
Acode-generationalgorithm:

The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z, perform the following actions:

1. Invokea function getreg to determinethelocation L wherethe result of the computation yop z


should be stored.

2. Consult the address descriptor for y to determine y’, the current location of y. Prefer the
registerfor y’ifthevalueof yiscurrentlybothinmemoryandaregister.Ifthevalueof yis not
already in L, generate the instruction MOV y’ , L to place a copy of y in L.

3. GeneratetheinstructionOPz’,Lwherez’isacurrentlocationofz.Preferaregistertoa
memorylocation if z is in both. Update the address descriptor of x to indicate that x is in
location L. If x is in L, update its descriptor and remove x from all other descriptors.

4. Ifthecurrentvaluesof yorz havenonextuses,arenotliveonexitfromtheblock,andarein registers,


alter the register descriptor to indicate that, after execution of x : = y op z , those registers
will no longer contain y or z.

GeneratingCodeforAssignmentStatements:

• The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:
t: =a–b
u:=a–c
v:=t+u
d:=v+u
withdliveattheend.

Codesequencefortheexampleis:

Statements CodeGenerated Registerdescriptor Address descriptor

Registerempty

t: =a-b MOVa,R0 R0containst tinR0


SUBb, R0

u:=a-c MOVa,R1 R0 contains t tinR0


SUB c , R1 R1containsu uinR1

v:=t+u ADDR1, R0 R0containsv uinR1


R1containsu vinR0

d:=v+u ADDR1, R0 R0containsd dinR0


dinR0andmemory
MOVR0,d
GeneratingCodeforIndexedAssignments

Thetableshowsthecodesequencesgeneratedfortheindexedassignmentstatements
a:=b[i]anda[i]:=b

Statements CodeGenerated Cost

a:=b[i] MOVb(Ri),R 2

a[i]:=b MOVb,a(Ri) 3

GeneratingCodeforPointerAssignments

Thetableshowsthecodesequencesgeneratedforthepointerassignments
a:=*pand*p:=a

Statements CodeGenerated Cost

a:=*p MOV*Rp,a 2

*p:=a MOVa,*Rp 2

GeneratingCodeforConditionalStatements

Statement Code

ifx<ygotoz CMPx, y
CJ<z /*jumptozifconditioncode is
negative */

x:= y+z MOVy,R0


ifx<0gotoz ADDz,R0
MOVR0,x
CJ< z

THEDAGREPRESENTATIONFORBASICBLOCKS

• ADAGforabasicblockisadirectedacyclicgraphwiththefollowinglabelsonnodes:
1. Leavesarelabeledbyuniqueidentifiers,eithervariablenamesorconstants.
2. Interiornodesarelabeledbyanoperatorsymbol.
3. Nodesarealsooptionallygivenasequenceofidentifiersforlabelstostorethe computed
values.
• DAGsareusefuldatastructuresforimplementingtransformationsonbasicblocks.
• Itgivesapictureofhowthevaluecomputedbyastatementisusedinsubsequent statements.
• Itprovidesagoodwayofdeterminingcommonsub -expressions.
AlgorithmforconstructionofDAG

Input:Abasicblock

Output:ADAGforthebasicblockcontainingthefollowinginformation:

1. A label for each node. For leaves, the label is an identifier. For interior nodes,
anoperator symbol.
2. Foreachnodealistofattachedidentifierstoholdthecomputedvalues. Case
(i) x : = y OP z

Case(ii)x:=OPy Case

(iii) x : = y

Method:

Step1:Ifyisundefinedthencreate node(y).

Ifzisundefined,createnode(z)forcase(i).

Step2:Forthecase(i),createanode(OP)whoseleftchildisnode(y)andrightchildis node(z). (

Checking for common sub expression). Let n be this node.

For case(ii), determine whether there is node(OP)with onechild node(y). Ifnotcreate such a
node.

Forcase(iii),nodenwillbenode(y).

Step3:Deletexfromthelistofidentifiersfornode(x).Appendxtothelistofattached identifiers for

the node n found in step 2 and set node(x) to n.

Example:Considertheblockofthree-addressstatements:

1. t1:=4*i
2. t2:=a[t1]
3. t3:=4*i
4. t4:=b[t3]
5. t5:=t2*t4
6. t6:=prod+t5
7. prod:=t6
8. t7:=i+1
9. i:=t7
10. ifi<=20goto(1)
StagesinDAGConstruction
ApplicationofDAGs:

1. Wecanautomaticallydetectcommonsubexpressions.
2. Wecandeterminewhichidentifiershavetheirvaluesusedintheblock.
3. Wecandeterminewhichstatementscomputevaluesthatcouldbeusedoutsidetheblock.
GENERATINGCODEFROMDAGs

The advantage of generating code for a basic block from its dag representation is that,
from a dag we can easily see how to rearrange the order of the final computation sequence than
we can starting from a linear sequence of three-address statements or quadruples.

Rearranging theorder
Theorderinwhichcomputationsaredonecanaffectthecostofresultingobjectcode.

Forexample,considerthefollowingbasicblock: t1 :
=a+b
t2 : = c + d
t3 : = e – t2
t4:=t1–t3

Generatedcodesequenceforbasicblock:

MOV a , R0
ADD b , R0
MOV c , R1
ADD d , R1
MOVR0,t1
MOV e , R0
SUBR1,R0
MOVt1,R1
SUBR0,R1
MOVR1,t4

Rearranged basic block:


Nowt1occursimmediatelybeforet4.

t2 :=c+d
t3:=e–t2 t1
:=a+b
t4:=t1–t3

Revisedcodesequence:

MOV c , R0
ADD d , R0
MOV a , R0
SUBR0,R1
MOV a , R0
ADD b , R0
SUBR1,R0
MOVR0,t4

Inthisorder,twoinstructions MOVR0,t1andMOVt1,R1havebeensaved.
AHeuristicorderingforDags

Theheuristicorderingalgorithmattemptstomaketheevaluationofanodeimmediatelyfollow the
evaluation of its leftmost argument.

Thealgorithmshownbelowproducestheorderinginreverse.

Algorithm:

1) whileunlistedinteriornodesremaindobegin
2) selectanunlistednoden,allofwhoseparentshavebeenlisted;
3) listn;
4) whiletheleftmostchildmofnhasnounlistedparentsandisnotaleaf do begin
5) listm;
6) n:=m
end
end

Example:ConsidertheDAGshownbelow:

1
*

2 + - 3

4
*

5 - +
8

6 + 7 c d 11 e 12

a b
9 10

Initially,theonlynodewithnounlistedparentsis1sosetn=1atline(2)andlist1atline(3).

Now,theleftargumentof1,whichis2,hasitsparentslisted,sowelist2andsetn=2atline(6).

Now, at line (4) we find the leftmost child of 2, which is 6, has an unlisted parent 5. Thus we
selectanewnatline(2),andnode3istheonlycandidate.Welist3andproceeddownitsleft chain, listing
4, 5 and 6. This leaves only 8 among the interior nodes so we list that.

Theresultinglistis1234568andtheorderofevaluationis8654321.
Codesequence:

t8 : = d + e
t6 : = a + b
t5 : = t6 – c
t4:=t5*t8
t3:=t4–e
t2:=t6+t4
t1:=t2*t3

ThiswillyieldanoptimalcodefortheDAGonmachinewhateverbethenumberofregisters.
MODULE-4-CODEOPTIMIZATION

INTRODUCTION

 The code produced by the straight forward compiling algorithms can often be made to run
faster or take less space, or both. This improvement is achieved by program transformations
that are traditionally called optimizations. Compilers that apply code-improving
transformations are called optimizing compilers.

 Optimizationsareclassifiedintotwocategories.Theyare
• Machine independentoptimizations:
• Machinedependant optimizations:

Machineindependentoptimizations:

• Machineindependentoptimizationsareprogramtransformationsthatimprovethetargetcode
without taking into consideration any properties of the target machine.

Machinedependantoptimizations:

• Machine dependant optimizations are based on register allocation and utilization of special
machine-instruction sequences.

Thecriteriaforcodeimprovementtransformations:

 Simply stated, the best program transformations are those that yield the most benefit for the
least effort.

 The transformation must preserve the meaning of programs. That is, the optimization must
not change the output produced by a program for a given input, or cause an error such as
divisionbyzero, that was not present in the original source program. At all times we take the
“safe” approach of missing an opportunity to apply a transformation rather than riskchanging
what the program does.

 A transformation must, on the average, speed up programs by a measurable amount. We are


also interested in reducing the size of the compiled codealthough the size of the code has less
importance than it once had. Not every transformation succeeds in improving every program,
occasionally an “optimization” may slow down a program slightly.

 The transformation must be worth the effort. It does not make sense for a compiler writer to
expend the intellectual effort to implement a code improving transformation and to have the
compiler expend the additional time compiling source programs if this effort is not repaid
when the target programs are executed. “Peephole” transformations of this kind are simple
enough and beneficial enough to be included in any compiler.
OrganizationforanOptimizingCompiler:

 Flow analysis is a fundamental prerequisite for many important types of code


improvement.
• Generallycontrolflowanalysisprecedesdataflowanalysis.
• Control flow analysis (CFA) represents flow of control usually in form of graphs, CFA
constructs such as
• controlflowgraph
• Callgraph
• Data flow analysis (DFA) is the process of ascerting and collecting information prior to
program execution about the possible modification, preservation, and use of certain
entities (such as values or attributes of variables) in a computer program.

PRINCIPALSOURCESOFOPTIMISATION

• Atransformationofaprogramiscalledlocalifitcanbeperformedbylookingonlyatthe
statements in a basic block; otherwise, it is called global.
• Many transformations can be performed at both the local and global levels. Local
transformations are usually performed first.

Function-PreservingTransformations

• There are a number of ways in which a compiler can improve a program without
changing the function it computes.
• Thetransformations

 Commonsubexpressionelimination,
 Copypropagation,
 Dead-codeelimination,and
 Constantfolding

are common examples of such function-preserving transformations. The other


transformations come up primarily when global optimizations are performed.
• Frequently, a program will include several calculations of the same value, such as an
offset in an array. Some of the duplicate calculations cannot be avoided by the
programmer because they lie below the level of detail accessible within the source
language.

 CommonSubexpressionselimination:

• AnoccurrenceofanexpressionEiscalledacommonsub-expressionifEwaspreviously
computed, and the values of variables in E have not changed since the previous
computation. We can avoid recomputing the expression if we can use the previously
computed value.
• Forexample
t1: = 4*i
t2:=a[t1]
t3: = 4*j
t4: = 4*i
t5 : = n
t6:=b[t4]+t5

Theabovecodecanbeoptimizedusingthecommonsub-expressioneliminationas t1: =
4*i
t2:=a[t1]
t3: = 4*j
t5 : = n
t6:=b[t1]+t5

The common sub expression t4: =4*i iseliminatedasitscomputationisalready in t1. And


value of i is not been changed from definition to use.

 CopyPropagation:

• Assignments of the form f : = g called copy statements, or copies for short. The idea
behind the copy-propagation transformation is to use g for f, whenever possible after the
copy statement f: = g. Copy propagation means use of one variable instead of another.
Thismaynot appear to be an improvement, but as we shall see it gives us an opportunity
to eliminate x.
• Forexample:

x=Pi;
……
A=x*r*r;

Theoptimizationusingcopypropagationcanbedoneasfollows: A=Pi*r*r;

Herethevariablexiseliminated

 Dead-Code Eliminations:

• Avariableisliveatapointinaprogramifitsvaluecanbeusedsubsequently;otherwise, it is dead
at that point. A related idea is dead or useless code, statements that compute
values that never get used. While the programmer is unlikely to introduce any dead code
intentionally,itmayappearastheresultofprevioustransformations. Anoptimizationcan be
done by eliminating dead code.
Example:

i=0;
if(i=1)
{
a=b+5;
}

Here,‘if’statementisdeadcodebecausethisconditionwillnevergetsatisfied.

 Constantfolding:

• We can eliminate both the test and printing from the object code. More generally,
deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.

• Oneadvantageofcopypropagationisthatitoftenturnsthecopystatementintodead code.
 Forexample,
a=3.14157/2canbereplacedby
a=1.570therebyeliminatingadivisionoperation.

 LoopOptimizations:

• We now give a brief introduction to a very important place for optimizations, namely
loops,especiallytheinnerloopswhereprogramstendtospend thebulkoftheirtime.The running
time of a program maybe improved if we decrease the number of instructions in an inner
loop, even if we increase the amount of code outside that loop.
• Threetechniquesareimportantforloopoptimization:

 codemotion,whichmovescodeoutsidealoop;
 Induction-variableelimination,whichweapplytoreplacevariablesfrominnerloop.
 Reduction instrength,whichreplaces andexpensiveoperation bya cheaperone, suchas a
multiplication by an addition.

 CodeMotion:

• An important modification that decreases the amount of code in a loop is code motion.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed ( a loop-invariant computation) and places the
expression before the loop. Note that the notion “before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:

while (i <= limit-2) /*statementdoesnotchangelimit*/

Code motion will result in the equivalent of


t=limit-2;
while(i<=t) /*statementdoesnotchangelimitort*/

 InductionVariables:

• Loopsareusuallyprocessedinsideout.ForexampleconsiderthelooparoundB3.
• Note that the values of jand t4 remain in lock-step; everytime the value of j decreases by
1, that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers are called
induction variables.
• Whentherearetwoor moreinduction variables inaloop,itmaybe possibletogetridof all but
one, by the process of induction-variable elimination. For the inner loop around B3 in
Fig. we cannot get rid of either j or t4 completely; t4 is used in B3 and j in B4. However,
we can illustrate reduction in strength and illustrate a part of the process of induction-
variableelimination.EventuallyjwillbeeliminatedwhentheouterloopofB2
-B5isconsidered.

Example:
As the relationship t4:=4*j surelyholds after such an assignment to t4 in Fig. and t4 is not
changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1therelationshipt4:=4*j-4musthold.Wemaythereforereplacetheassignmentt4:=
4*jbyt4:=t4-4.Theonlyproblemis thatt4doesnothaveavaluewhenweenterblock B3
forthefirsttime.Sincewemustmaintaintherelationshipt4=4*jonentrytotheblockB3, we place
an initializations of t4 at the end of the block where j itself is

before after

initialized,shownbythedashedadditiontoblockB1insecondFig.
• Thereplacementofamultiplicationbyasubtractionwillspeeduptheobjectcodeif
multiplication takes more time than addition or subtraction, as is the case on many
machines.

 ReductionInStrength:

• Reduction in strength replaces expensive operations by equivalent cheaper ones on the


target machine. Certain machine instructions are considerably cheaper than others andcan
often be used as special cases of more expensive operators.
• For example, x² is invariably cheaper to implement as x*x than as a call to an
exponentiation routine. Fixed-point multiplication or division by a power of two is
cheapertoimplementas ashift.Floating-pointdivisionbyaconstantcanbeimplemented as
multiplication by a constant, which may be cheaper.

OPTIMIZATIONOFBASICBLOCKS

Therearetwotypesofbasicblockoptimizations.Theyare:

 Structure-PreservingTransformations
 AlgebraicTransformations

Structure-PreservingTransformations:

TheprimaryStructure-PreservingTransformationonbasicblocksare:

 Commonsub-expressionelimination
 Deadcodeelimination
 Renamingoftemporaryvariables
 Interchangeoftwoindependentadjacentstatements.

 Commonsub-expressionelimination:

Common sub expressions need not be computed over and over again. Instead they can be
computedonceand kept in store from where it’s referencedwhenencounteredagain – of course
providing the variable values in the expression still remain constant.

Example:

a:=b+c
b: =a-d
c:=b+c
d: =a-d

The2ndand4thstatementscomputethesameexpression: b+canda-d Basic

block can be transformed to

a:=b+c
b: = a-d
c: = a
d:=b
 Deadcodeelimination:

It’s possible that a large amount of dead (useless) code may exist in the program. This
might be especially caused when introducing variables and procedures as part of construction or
error-correction of a program – once declared and defined, one forgets to remove them in case
they serve no purpose. Eliminating these will definitely optimize the code.

 Renamingoftemporaryvariables:

• A statement t:=b+c where t is a temporary name can be changed to u:=b+c where u is


another temporary name, and change all uses of t to u.
• Inthiswecantransformabasicblocktoitsequivalentblockcallednormal-formblock.

 Interchangeoftwoindependentadjacentstatements:

• Twostatements

t1:=b+c

t2:=x+y

canbeinterchangedorreorderedinitscomputationinthebasicblockwhenvalueoft1
doesnotaffectthevalueoft2.

AlgebraicTransformations:

• Algebraic identities represent another important class of optimizations on basic blocks.


This includes simplifying expressions or replacing expensive operation by cheaper ones
i.e.reductioninstrength.
• Another class of related optimizations is constant folding. Here we evaluate constant
expressions at compile time and replace the constant expressions by their values. Thusthe
expression 2*3.14 would be replaced by 6.28.
• The relational operators <=, >=, <, >, + and = sometimes generate unexpected common
sub expressions.
• Associativelawsmayalsobeappliedto exposecommonsubexpressions.Forexample,if the
source code has the assignments

a:=b+c
e:=c+d+b

thefollowingintermediatecodemaybegenerated: a

:=b+c
t:=c+d
e:=t+b

• Example:

x:=x+0canberemoved

x:=y**2canbereplacedbyacheaperstatementx:=y*y
• The compiler writer should examine the language carefully to determine what
rearrangementsofcomputationsarepermitted,sincecomputerarithmeticdoesnotalways obey
the algebraic identities of mathematics. Thus, a compiler may evaluate x*y-x*z as x*(y-
z) but it may not evaluate a+(b-c) as (a+b)-c.

LOOPSINFLOWGRAPH

A graph representation of three-address statements, called a flow graph, is useful for


understanding code-generation algorithms, even if the graph is not explicitly constructed by a
code-generation algorithm. Nodes in the flow graph represent computations, and the edges
represent the flow of control.

Dominators:
In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining nodes in the flow graph and the entry of a loop dominates all nodes in the loop.
Similarly every node dominates itself.

Example:

*Intheflowgraphbelow,
*Initialnode,node1dominateseverynode.
*node2dominatesitself
*node3dominatesallbut1and2.
*node4dominatesallbut1,2and3.
*node5and6dominatesonlythemselves,sinceflowofcontrolcanskiparoundeitherbygoin through
the other.
*node7dominates7,8,9and10.
*node8dominates8,9and10.
*node9and10dominatesonlythemselves.
• Thewayofpresentingdominatorinformationisinatree,calledthedominatortreein which the
initial node is the root.
• Theparentofeachothernodeisitsimmediatedominator.
• Eachnodeddominatesonlyitsdescendentsinthetree.
• The existence of dominator tree follows from a property of dominators; each node has a
unique immediate dominator in that is the last dominator of n on anypath from the initial
node to n.
• In terms of the dom relation, the immediate dominator m has the property is d=!n and d
dom n, then d dom m.

D(1)={1}

D(2)={1,2}

D(3)={1,3}

D(4)={1,3,4}

D(5)={1,3,4,5}

D(6)={1,3,4,6}

D(7)={1,3,4,7}

D(8)={1,3,4,7,8}

D(9)={1,3,4,7,8,9}

D(10)={1,3,4,7,8,10}
NaturalLoop:

• Oneapplicationofdominatorinformationisindeterminingtheloopsofaflowgraphsuitable for
improvement.

• Thepropertiesofloopsare

 A loop must have a single entry point, called the header. This entry point-dominates all
nodes in the loop, or it would not be the sole entry to the loop.
 Theremustbeatleastonewaytoiteratetheloop(i.e.)atleastonepathbacktotheheader.

• One way to find all the loops in a flow graph is to search for edges in the flow graph whose
heads dominate their tails. If a→b is an edge, b is the head and a is the tail. These types of
edges are called as back edges.

 Example:

Intheabovegraph,

7→4 4DOM7

10→7 7DOM10

4→3

8→3

9 →1

• Theaboveedgeswillformloopinflowgraph.
• Givenaback edgen→ d,wedefinethenatural loopoftheedgetobedplustheset ofnodes that can
reach n without going through d. Node d is the header of the loop.

Algorithm:Constructingthenaturalloopofabackedge.

Input:AflowgraphGandabackedgen→d.

Output:Thesetloopconsistingofallnodesinthenaturalloopn→d.

Method: Beginning with node n, we consider each node m*d that we know is in loop, to make
surethatm’spredecessorsare alsoplacedinloop. Eachnodeinloop, exceptford,isplacedonce onstack,
so itspredecessorswillbe examined. Notethatbecause dis putin theloopinitially, we never examine
its predecessors, and thus find only those nodes that reach n without going through d.

Procedureinsert(m);
ifmisnotinloopthenbegin
loop := loop U {m};
push m onto stack
end;

stack:=empty;
loop:={d};
insert(n);
whilestackisnotemptydobegin
popm,thefirstelementofstack,offstack;
foreachpredecessorpofmdoinsert(p)
end

Innerloop:

• If we use the natural loops as “the loops”, then we have the useful property that unless
two loops have the same header, they are either disjointed or one is entirely contained in
theother.Thus,neglectingloops withthesameheaderforthemoment,wehaveanatural notion
of inner loop: one that contains no other loop.
• When two natural loopshavethesameheader, but neitheris nested withinthe other, they are
combined and treated as a single loop.

Pre-Headers:

• Several transformations require us to move statements “before the header”. Therefore


begin treatment of a loop L by creating a new block, called the preheater.

• The pre-header has only the header as successor, and all edges which formerly enteredthe
header of L from outside L instead enter the pre-header.

• EdgesfrominsideloopLtotheheaderarenotchanged.

• Initiallythepre-headerisempty,buttransformationsonLmayplacestatementsinit.

header pre-header
loopL

heder
loopL

(a)Before (b)After

Reducibleflowgraphs:

• Reducible flow graphs are special flow graphs, for which several code optimization
transformations are especially easy to perform, loops are unambiguously defined,
dominators can be easily calculated, data flow analysis problems can also be solved
efficiently.

• Exclusive use of structured flow-of-control statements such as if-then-else, while-do,


continue, and break statements produces programs whose flow graphs are always
reducible.
• The most important properties of reducible flow graphs are that there are no jumps into
the middle of loops from outside; the only entry to a loop is through its header.

• Definition:

AflowgraphGisreducibleifandonlyifwecanpartitiontheedgesintotwodisjoint groups,
forward edges and back edges, with the following properties.

 Theforwardedgesfrom anacyclicgraphinwhicheverynodecanbereachedfrominitial node of


G.

 Thebackedgesconsistonlyofedgeswhereheadsdominatetheirstails.

 Example:Theaboveflowgraphisreducible.

• If we know the relation DOM for a flow graph, we can find and remove all the back
edges.

• Theremainingedgesareforwardedges.

• Iftheforwardedgesformanacyclic graph,thenwecansaytheflowgraphreducible.

• In the above example remove the five back edges→3, 4 7→4, 8→3 , 9→1 and 10→7
whose heads dominate their tails, the remaining graph is acyclic.

• The key property of reducible flow graphs for loop analysis is that in such flow graphs
every set of nodes that we would informally regard as a loop must contain a back edge.

PEEPHOLEOPTIMIZATION

• A statement-by-statement code-generations strategy often produce target code that


contains redundant instructions and suboptimalconstructs.The quality of such target code
can be improved byapplying “optimizing” transformations to the target program.
• A simple but effective technique for improving the target code ispeephole optimization,a
method for trying to improving the performance of the target program by examining a
short sequence of target instructions (called the peephole) and replacing these
instructions by a shorter or faster sequence, whenever possible.
• Thepeepholeisasmall,movingwindowonthetargetprogram.Thecodeinthepeephole
neednotcontiguous,althoughsomeimplementationsdorequirethis.itischaracteristicof
peephole optimization that each improvement may spawn opportunities for additional
improvements.
• We shall give the following examples of program transformations that are characteristic
of peephole optimizations:

 Redundant-instructionselimination
 Flow-of-controloptimizations
 Algebraicsimplifications
 Useofmachineidioms
 UnreachableCode
RedundantLoadsAndStores:

Ifweseetheinstructionssequence

(1) MOVR0,a

(2) MOVa,R0

we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the value of
a is already in register R0.If (2) had a label we could not be sure that (1) was always executed
immediately before (2) and so we could not remove (2).

Unreachable Code:

• Another opportunity for peephole optimizations is the removal of unreachable instructions.


An unlabeled instruction immediately following an unconditional jump may be removed.
This operation can be repeated to eliminate a sequence of instructions. For example, for
debugging purposes, a large program mayhave within it certain segments that are executed
only if a variable debug is 1. In C, the source code might look like:

#definedebug0

….

If(debug){

Printdebugginginformation

• Intheintermediaterepresentationstheif-statementmaybetranslatedas:

Ifdebug=1gotoL2

goto L2

L1:printdebugginginformation

L2: ......................................................................................... (a)

• One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what
the value of debug; (a) can be replaced by:

Ifdebug≠1gotoL2

Printdebugginginformation

L2: ........................................................................................... (b)

• Astheargumentofthestatementof(b)evaluatestoaconstanttrueitcanbereplaced by
Ifdebug≠0gotoL2

Printdebugginginformation

L2: .......................................................................................... (c)

• Astheargumentofthefirststatementof(c)evaluatestoaconstanttrue,itcanbereplacedby goto
L2.Thenall the statement that print debugging aids are manifestly unreachable andcan be
eliminated one at a time.

Flows-Of-ControlOptimizations:

• The unnecessary jumps can be eliminated in either the intermediate code or the target codeby
the following types of peephole optimizations. We can replace the jump sequence

goto L1

….

L1: gotoL2

bythesequence

goto L2

….

L1:gotoL2

• If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto
L2 provided it is preceded by an unconditional jump .Similarly, the sequence

ifa<bgotoL1

….

L1: goto L2

canbereplacedby

Ifa<bgotoL2

….

L1:gotoL2

• Finally,supposethereisonlyonejumptoL1andL1isprecededbyanunconditionalgoto.
Thenthesequence

goto L1

……..
L1:ifa<bgotoL2

L3: .......................................................................... (1)

• Maybereplacedby

Ifa<bgotoL2

goto L3

…….

L3: ............................................................................ (2)

• While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time

Algebraic Simplification:

• There is no end to the amount of algebraic simplification that can be attempted through
peephole optimization. Only a few algebraic identities occur frequently enough that it is
worth considering implementing them .For example, statements such as

x:=x+0 Or

x:=x*1

• Areoftenproducedbystraightforwardintermediatecode-generationalgorithms,andtheycan be
eliminated easily through peephole optimization.

ReductioninStrength:

• Reduction in strength replaces expensive operations byequivalent cheaperones on the target


machine. Certain machine instructions are considerablycheaper than others and can often be
used as special cases of more expensive operators.
• For example, x² is invariablycheaper to implement as x*x than as a call to an exponentiation
routine. Fixed-point multiplication or division by a power of two is cheaper to implement asa
shift. Floating-point division by a constant can be implemented as multiplication by a
constant, which may be cheaper.

X2→X*X

UseofMachineIdioms:

• The target machine may have hardware instructions to implement certain specific operations
efficiently.Forexample,somemachineshaveauto-incrementandauto-decrementaddressing
modes. These add or subtract one from an operand before or after using its value.
• The use of these modes greatly improves the quality of code when pushing or popping a
stack,asinparameterpassing.Thesemodescanalsobeusedincodeforstatementslikei:
=i+1.
i:=i+1→i++

i:=i-1 → i- -

INTRODUCTIONTOGLOBAL DATAFLOWANALYSIS

• Inordertodocodeoptimizationandagoodjobofcodegeneration,compilerneedsto collect
information about the program as a whole and to distribute this information to each
block in the flow graph.

• A compiler could take advantage of “reaching definitions” , such as knowing where a


variablelikedebugwaslastdefinedbeforereachingagivenblock,inordertoperform
transformations are just a few examples of data-flow information that an optimizing
compiler collects by a process known as data-flow analysis.

• Data-flowinformationcanbecollectedbysettingupandsolvingsystemsofequationsof the
form :

out[S]=gen[S]U(in[S]–kill[S])

Thisequationcanbereadas“theinformationattheendofastatementiseithergenerated within
the statement , or enters at the beginning and is not killed as control flows through the
statement.”

• Thedetailsofhowdata-flowequationsaresetandsolveddependonthreefactors.

 Thenotionsofgeneratingandkillingdependonthedesiredinformation,i.e.,onthedata
flowanalysisproblemtobesolved.Moreover,forsomeproblems,insteadofproceeding along
with flow of control and defining out[s] in terms of in[s], we need to proceed backwards
and define in[s] in terms of out[s].

 Sincedataflowsalongcontrolpaths,data-flowanalysisisaffectedbytheconstructsina
program. In fact, when we write out[s] we implicitly assume that there is unique end
point where control leaves the statement; in general, equations are set up at the level of
basic blocks rather than statements, because blocks do have unique end points.

 Therearesubtletiesthatgoalongwithsuchstatementsasprocedurecalls,assignments
through pointer variables, and even assignments to array variables.

PointsandPaths:

• Withinabasicblock,wetalkofthepointbetweentwoadjacentstatements,aswellasthe point
before the first statement and after the last. Thus, block B1 has four points: one before
any of the assignments and one after each of the three assignments.
B1

d1:i:=m-1 d2:
j :=n
d3 =u1
B2
d4: I:=i+1
B3
d5: j:=j-1

B4

B5 B6

d6:a:=u2

• Nowletustakeaglobalviewandconsiderallthepointsinalltheblocks.Apathfromp1
topnisasequenceofpointsp1,p2,….,pnsuchthatforeachibetween1andn-1,either

 Piisthepointimmediatelyprecedingastatementandpi+1isthepointimmediately
following that statement in the same block, or

 Piistheendofsomeblockandpi+1isthebeginningofasuccessorblock.

Reachingdefinitions:

• A definition of variable x is a statement that assigns, or may assign, a value to x. The


mostcommonformsofdefinitionareassignmentstoxandstatementsthatread avalue from
an i/o device and store it in x.

• Thesestatementscertainlydefineavalueforx,andtheyarereferredtoas unambiguous
definitions of x. Therearecertain kinds of statements that maydefineavalue for x; they
are called ambiguous definitions. The most usual forms of ambiguous definitions of x
are:

 Acallofaprocedurewithxasaparameteroraprocedurethatcanaccessxbecausexis in the
scope of the procedure.

 An assignment through a pointer that could refer to x. For example, the assignment *q: =
yisadefinitionofxifitispossiblethatqpointstox.wemustassumethatanassignment through a
pointer is a definition of every variable.

• We say a definition d reaches a point p if there is a path from the point immediately
followingdtop,suchthatdisnot“killed”alongthat path.Thusapointcan bereached
byanunambiguousdefinitionandanambiguousdefinitionofthesamevariable appearing
later along one path.

Data-flowanalysisofstructuredprograms:

• Flow graphs for control flow constructs such as do-while statements have a useful
property: there is a single beginning point at which control enters and a single end point
thatcontrolleavesfromwhenexecutionofthestatementisover.Weexploitthisproperty
whenwetalkofthedefinitionsreachingthebeginningandthe endofstatementswiththe
following syntax.

S id:=E|S;S |ifEthenSelseS|doSwhileE E id +

id| id

• Expressionsinthislanguagearesimilartothoseintheintermediatecode,buttheflow graphs
for statements have restricted forms.

S1
S1
IfEgotos1

S2
S1 S2 IfEgotos1

S1;S2

IFEthenS1else S2 doS1whileE

• We define a portion of a flow graph called a region to be a set of nodes N that includes a
header, which dominates all other nodes in the region. All edges between nodes in N are
in the region, except for some that enter the header.
• The portion of flow graph corresponding to a statement S is a region that obeys the
further restriction that control can flow to just one outside block when it leaves theregion.
• We say that the beginning points of the dummy blocks at the entry and exit of a
statement’s region are the beginning and end points, respectively, of the statement. The
equations are inductive, or syntax-directed, definition of the sets in[S], out[S], gen[S],and
kill[S] for all statements S.
• gen[S] is the set of definitions “generated” by S while kill[S]is the set of definitions
that never reach the end of S.
• Considerthefollowingdata-flowequationsforreachingdefinitions: i )

S d:a:= b+c

gen[S]={d}
kill[S]=Da–{d}
out[S]=gen[S]U(in[S]–kill[S])

• Observe the rules for a single assignment of variable a. Surely that assignment is a
definition of a, say d. Thus

Gen[S]={d}

• Ontheotherhand,d“kills”allotherdefinitionsofa,sowewrite Kill[S]

= Da – {d}

Where,Daisthesetofalldefinitionsintheprogramforvariablea.

ii)

S S1

S2

gen[S]=gen[S2]U(gen[S1]-kill[S2])
Kill[S]=kill[S2]U(kill[S1]–gen[S2])

in [S1] = in [S]in
[S2] =out [S1]
out[S]=out[S2]
• Under what circumstances is definition d generated by S=S1; S2? First of all, if it is
generated by S2, then it is surely generated by S. if d is generated by S1, it will reach the
end of S provided it is not killed by S2. Thus, we write

gen[S]=gen[S2]U(gen[S1]-kill[S2])

• Similarreasoningappliestothekillingofadefinition,sowehave

Kill[S] = kill[S2] U (kill[S1] – gen[S2])

Conservativeestimationofdata-flowinformation:

• There is a subtle miscalculation in the rules for gen and kill. We have made the
assumption that the conditional expression E in the if and do statements are
“uninterpreted”;thatis,thereexistsinputs totheprogramthatmaketheirbranchesgo either
way.

• Weassumethatanygraph-theoreticpathintheflowgraphisalsoanexecutionpath,i.e., a path
that is executed when the program is run with least one possible input.

• Whenwecomparethecomputed genwiththe“true”genwediscoverthatthetruegenis
alwaysasubsetofthecomputedgen.ontheotherhand,thetruekillisalwaysasuperset of the
computed kill.

• These containments hold even after we consider the other rules. It is natural to wonder
whether these differences between the true and computed gen and kill sets present a
seriousobstacletodata-flowanalysis.Theanswerliesintheuseintendedforthesedata.

• Overestimating the set of definitions reaching a point does not seem serious; it merely
stops us from doing an optimization that we could legitimately do. On the other hand,
underestimating the set of definitions is a fatal error; it could lead us into making a
changeintheprogramthatchangeswhattheprogramcomputes. Forthecaseofreaching
definitions, then, we call a set of definitions safe or conservative if the estimate is a
superset of the true set of reaching definitions. We call the estimate unsafe, if it is not
necessarily a superset of the truth.

• Returningnow to the implications of safetyon theestimation of gen and kill forreaching


definitions, note that our discrepancies, supersets for gen and subsets for kill are both in
thesafedirection.Intuitively,increasinggenaddstothesetofdefinitionsthatcanreacha point,
and cannot prevent a definition from reaching a place that it truly reached. Decreasing
kill can only increase the set of definitions reaching any given point.

Computationofinandout:
• Manydata-flowproblemscanbesolvedbysynthesizedtranslationssimilartothoseused to
compute gen and kill. It can be used, for example, to determine loop-invariant
computations.

• However,thereareotherkindsofdata-flowinformation,suchasthereaching-definitions
problem. It turns out that in is an inherited attribute, and out is a synthesized attribute
depending on in. we intend thatin[S] be the set of definitions reaching the beginning of
S, taking into account the flow of control throughout the entire program, including
statements outside of S or within which S is nested.

• Thesetout[S]isdefinedsimilarlyfortheendofs.itisimportanttonotethedistinction between
out[S] and gen[S]. The latter is the set of definitions that reach the end of S without
following paths outside S.

• Assumingweknowin[S]wecomputeoutbyequation,thatis

Out[S] = gen[S] U (in[S] - kill[S])

• Considering cascade of two statements S1; S2, as in the second case. We start by
observing in[S1]=in[S]. Then, we recursively compute out[S1], which gives us in[S2],
sinceadefinitionreaches thebeginningofS2 ifandonlyifitreachestheendofS1.Now we can
compute out[S2], and this set is equal to out[S].

• Consideringif-statementwehaveconservativelyassumedthatcontrolcanfolloweither
branch, a definition reaches the beginning of S1 or S2 exactly when it reaches the
beginning of S.

In[S1]=in[S2]=in[S]

• IfadefinitionreachestheendofSifandonlyifitreachestheendofoneorbothsub statements;
i.e,

Out[S]=out[S1]Uout[S2]

Representationofsets:

• Setsofdefinitions,suchasgen[S]andkill[S],canberepresentedcompactlyusingbit
vectors.Weassignanumbertoeachdefinitionofinterestintheflow graph.Thenbit vector
representing a set of definitions will have 1 in position I if and only if the definition
numbered I is in the set.

• The number of definition statement can be taken as the index of statement in an array
holding pointers to statements. However, not all definitions may be of interest during
globaldata-flowanalysis.Thereforethenumberofdefinitionsofinterestwilltypicallybe
recorded in a separate table.

• A bit vector representation for sets also allows set operations to be implemented
efficiently.Theunionandintersectionoftwosetscanbeimplementedbylogicalorand logical
and, respectively, basic operations in most systems-oriented programming
languages.ThedifferenceA-BofsetsAandBcanbeimplementedbytakingthe
complement of B and then using logical and to compute A .

Localreachingdefinitions:

• Space for data-flow information can be traded for time, by saving information only at
certain points and, as needed, recomputing information at intervening points. Basic
blocksareusuallytreatedasaunitduringglobalflow analysis,withattentionrestricted to only
those points that are the beginnings of blocks.

• Since there are usuallymanymore points than blocks, restrictingour effort to blocks is a
significantsavings.Whenneeded,thereachingdefinitions forallpointsinablockcanbe
calculated from the reaching definitions for the beginning of a block.

Use-definitionchains:

• It is often convenient to store the reaching definition information as” use-definition


chains” or “ud-chains”, which are lists, for each use of a variable, of all the definitions
that reaches that use. If a use of variable a in block B is preceded by no unambiguous
definition of a, then ud-chain for that use of a is the set of definitions in in[B] that are
definitions of a.in addition, if there are ambiguous definitions of a ,then all of these for
whichnounambiguousdefinitionofaliesbetweenitandtheuseofaareontheud-chain for this
use of a.

Evaluation order:

• The techniques for conserving space during attribute evaluation, also apply to the
computation of data-flow information using specifications. Specifically, the only
constraint on the evaluation order for the gen, kill, in and out sets for statements is that
imposedbydependenciesbetweenthesesets.Havingchosenanevaluation order,weare free
to release the space for a set after all uses of it have occurred.

• Earliercirculardependenciesbetweenattributeswerenotallowed,butwehaveseenthat data-
flow equations may have circular dependencies.

General controlflow:

• Data-flow analysis must take all control paths into account. If the control paths are
evidentfromthesyntax,thendata-flowequationscanbesetupandsolvedinasyntax- directed
manner.

• Whenprogramscan containgoto statements or eventhemoredisciplinedbreakand


continuestatements,theapproachwehavetakenmustbemodifiedtotaketheactual control
paths into account.

• Severalapproachesmaybetaken.Theiterativemethodworksarbitraryflowgraphs.
Sincethe flow graphs obtained in thepresenceof break and continuestatements are
reducible, such constraints can be handled systematically using the interval-based
methods
• However,thesyntax-directedapproachneednotbeabandonedwhenbreakandcontinue
statements are allowed.

CODEIMPROVIGTRANSFORMATIONS
• Algorithms for performing the code improving transformations rely on data-flow
information.Hereweconsidercommonsub-expressionelimination,copypropagationand
transformations for moving loop invariant computations out of loops and for eliminating
induction variables.

• Globaltransformationsarenotsubstituteforlocaltransformations;bothmustbeperformed.

Eliminationofglobalcommonsubexpressions:

• The available expressions data-flow problem discussed in the last section allows us to
determine if an expression at point p in a flow graph is a common sub-expression. The
followingalgorithmformalizestheintuitiveideaspresentedforeliminatingcommonsub-
expressions.

 ALGORITHM:Global common sub expression elimination.

INPUT:Aflow graphwithavailableexpressioninformation.

OUTPUT: A revised flow graph.

METHOD: Foreverystatementsoftheformx :=y+z6suchthaty+zisavailableatthe


beginning of block and neither y nor r z is defined prior to statement s in that block,
do the following.

 Todiscovertheevaluationsofy+zthatreachs’sblock,wefollowflowgraph edges,
searching backward from s’s block. However, we do not go through any
block that evaluates y+z. The last evaluation of y+z in each block
encountered is an evaluation of y+z that reaches s.

 Createnewvariableu.

 Replaceeachstatementw:=y+zfoundin(1)by u :
=y+z
w :=u

 Replacestatementsbyx:=u.

• Someremarksaboutthisalgorithmareinorder.

 The search in step(1) of the algorithm for the evaluations of y+z that reach statement s
canalsobeformulatedasadata-flowanalysisproblem.However,itdoesnotmakesense to solve
it for all expressions y+z and all statements or blocks because too much irrelevant
information is gathered.
 Notallchangesmadebyalgorithmareimprovements.Wemight wishtolimitthe
number of different evaluations reaching s found in step (1), probably to one.

 Algorithmwillmissthefactthata*zandc*zmusthavethesamevaluein a :=x+y

c :=x+y

vs

b:=a*z d:=c*z

 Becausethissimpleapproachtocommonsubexpressionsconsidersonlytheliteral
expressions themselves, rather than the values computed by expressions.

Copypropagation:

• Various algorithms introduce copy statements such as x :=copies may also be generated
directly by the intermediate code generator, although most of these involve temporaries
localtooneblockandcanberemovedbythedagconstruction.Wemaysubstituteyforx in all
these places, provided the following conditions are met every such use u of x.

• Statementsmustbetheonlydefinitionofxreachingu.

• Oneverypathfromstoincludingpathsthatgothroughuseveraltimes,thereareno
assignments to y.

• Condition(1)canbecheckedusingud-changinginformation.Weshallsetupanewdata- flow
analysis problem in which in[B] is the set of copies s: x:=y such that every path from
initial node to the beginning of B contains the statement s, and subsequent to thelast
occurrence of s, there are no assignments to y.

 ALGORITHM: Copy propagation.

INPUT:aflowgraphG,withud-chainsgivingthedefinitionsreachingblockB,and with
c_in[B]representingthesolution toequationsthatisthesetof copiesx:=ythat reach
block B along every path, with no assignment to x or y following the last occurrence
of x:=yon the path. We also need ud-chains giving the uses of each definition.

OUTPUT:Arevisedflowgraph.

METHOD:Foreachcopys:x:=ydothefollowing:

 Determinethoseusesofxthatarereachedbythisdefinitionofnamely,s:x:=y.

 Determine whether for every use ofx found in (1) , s is in c_in[B], where B is the
blockofthisparticularuse,andmoreover,no definitionsofxoryoccurpriortothis use of x
within B. Recall that if s is in c_in[B]then s is the only definition of x that reaches B.
 Ifsmeetstheconditionsof(2),thenremovesandreplaceallusesofxfoundin(1) by y.

Detectionofloop-invariantcomputations:

• Ud-chainscanbeusedtodetectthosecomputationsinaloopthatareloop-invariant,that
is,whosevaluedoesnotchangeaslongascontrolstayswithintheloop.Loopisaregion
consisting of set of blocks with a header that dominates all the other blocks, so the only
way to enter the loop is through the header.

• If an assignment x := y+z is at a position in the loop where all possible definitions of y


and z are outside the loop, then y+z is loop-invariant because its value will be the same
eachtimex:=y+z isencountered.Havingrecognizedthatvalueofxwillnotchange,considerv
:=x+w,wherewcouldonlyhavebeendefinedoutsidetheloop,thenx+wisalsoloop-invariant.

 ALGORITHM:Detectionofloop-invariantcomputations.

INPUT:AloopLconsistingofasetofbasicblocks,eachblockcontainingsequence of three-
address statements. We assume ud-chains are available for the individual
statements.

OUTPUT:thesetofthree-addressstatementsthatcomputethesamevalueeachtime executed,
from the time control enters the loop L until control next leaves L.

METHOD:weshallgivearatherinformalspecificationofthealgorithm,trusting that
the principles will be clear.

 Mark“invariant”thosestatementswhoseoperandsarealleitherconstantorhave all
their reaching definitions outside L.

 Repeatstep(3)untilatsomerepetitionnonewstatementsaremarked“invariant”.

 Mark “invariant” all those statements not previously so marked all of whose
operandseitherareconstant,havealltheirreachingdefinitionsoutsideL,orhave
exactly one reaching definition, and that definition is a statement in L marked
invariant.

Performingcodemotion:

• Having found the invariant statements within a loop, we can apply to some of them an
optimizationknownascodemotion,inwhichthestatementsaremovedtopre-headerof the
loop. The following three conditions ensure that code motion does not change what the
program computes. Consider s: x: =y+z.

 Theblockcontainingsdominatesallexitnodesoftheloop,whereanexitofaloopisa node with


a successor not in the loop.

 Thereisnootherstatementintheloopthatassignstox.Again,ifxisatemporary assigned
only once, this condition is surely satisfied and need not be changed.
 Nouseofxintheloopisreachedbyanydefinitionofxotherthans.Thisconditiontoo will be
satisfied, normally, if x is temporary.

 ALGORITHM:Codemotion.

INPUT:AloopLwithud-chaininginformationanddominatorinformation.

OUTPUT:Arevisedversionoftheloopwithapre-headerandsomestatements moved to
the pre-header.

METHOD:

 Useloop-invariantcomputationalgorithmtofindloop-invariantstatements.

 Foreachstatementsdefiningxfoundinstep(1),check:

i) ThatitisinablockthatdominatesallexitsofL,

ii) ThatxisnotdefinedelsewhereinL,and

iii) ThatallusesinLofx canonlybereachedbythedefinitionofxinstatement s.

 Move,intheorderfoundbyloop-invariantalgorithm,eachstatementsfoundin
(1) and meeting conditions (2i), (2ii), (2iii) , to a newly created pre-header,
providedanyoperandsofsthataredefinedinloopLhavepreviouslyhadtheir
definition statements moved to the pre-header.

• To understand why no change to what the program computes can occur, condition (2i)
and(2ii) ofthisalgorithmassurethatthevalueofxcomputedatsmustbethevalueofx
afteranyexitblockofL.Whenwemovestoapre-header,swillstillbethedefinitionof x that
reaches the end of any exit block of L. Condition (2iii) assures that any uses of x within
L did, and will continue to, use the value of x computed by s.

Alternativecodemotionstrategies:

• Thecondition(1)canberelaxedifwearewillingtotaketheriskthatwemayactually increase
the running time of the program a bit; of course, we never change what the program
computes. The relaxed version of code motion condition (1) is that we may move a
statement s assigning x only if:

1’.Theblockcontainingseitherdominatesallexistsoftheloop,orxisnotusedoutside theloop.
For example, if x is a temporaryvariable, wecan besurethat thevalue will be used only
in its own block.

• Ifcodemotionalgorithmismodifiedtousecondition(1’),occasionallytherunningtime will
increase, but we can expect to do reasonably well on the average. The modified
algorithm maymove to pre-header certain computations that maynot be executed in the
loop.Notonlydoesthisriskslowingdowntheprogramsignificantly,itmayalsocause an error
in certain circumstances.

• Evenifnoneoftheconditionsof(2i),(2ii),(2iii)ofcodemotionalgorithmaremetbyan
assignment x: =y+z, we can still take the computation y+z outside a loop. Create a new
temporaryt, and set t: =y+z in the pre-header. Then replace x: =y+z byx: =t in the loop.
In many cases we can propagate out the copy statement x: = t.

Maintainingdata-flowinformationaftercodemotion:

• Thetransformationsofcodemotionalgorithmdonotchangeud-chaininginformation, since
by condition (2i), (2ii), and (2iii), all uses of the variable assigned by a moved
statement s that were reached by s are still reached by s from its new position.

• DefinitionsofvariablesusedbysareeitheroutsideL,inwhichcasetheyreachthepre- header,
or they are inside L, in which case by step (3) they were moved to pre-header ahead of
s.

• If the ud-chains are represented by lists of pointers to pointers to statements, we can


maintainud-chainswhenwemovestatementsbysimplychangingthepointertoswhen we
move it. That is, we create for each statement s pointer ps,which always points to s.

• Weputthepointeroneachud-chaincontainings.Then,nomatterwherewemoves,we have
only to change ps, regardless of how many ud-chains s is on.

• The dominator information is changed slightly by code motion. The pre-header is nowthe
immediate dominator of the header, and theimmediate dominator of the pre-header is
thenodethatformerlywastheimmediatedominatoroftheheader.Thatis,thepre-header is
inserted into the dominator tree as the parent of the header.

Eliminationofinductionvariable:

• A variable x is called an induction variable of a loop Lif every time the variable x
changesvalues,itisincrementedordecrementedbysomeconstant.Often,aninduction
variable is incremented by the same constant each time around the loop, as in a loop
headed by for i := 1 to 10.

• However,ourmethodsdealwithvariablesthatareincremented ordecrementedzero,one, two,


or more times as we go around a loop. The number of changes to an induction variable
may even differ at different iterations.

• Acommonsituationisoneinwhichaninductionvariable,sayi,indexesanarray,and
someotherinductionvariable,sayt,whose valueisalinearfunctionofi,istheactual offset
used to access the array. Often, the only use made of i is in the test for loop
termination. We can then get rid of i by replacing its test by one on t.

• Weshalllookforbasicinductionvariables,whicharethosevariablesi whoseonly
assignments within loop L are of the form i := i+c or i-c, where c is a constant.

 ALGORITHM:Eliminationofinductionvariables.
INPUT:AloopLwithreachingdefinitioninformation,loop-invariantcomputation information
and live variable information.

OUTPUT:Arevisedloop.

METHOD:

 Consider each basic induction variable i whose only uses are to compute other
induction variables in its family and in conditional branches. Take some j in i’s
family,preferablyonesuchthatcanddinitstripleareassimpleaspossibleand
modifyeachtestthati appearsintousejinstead.Weassumeinthefollowingtat c is
positive. A test of the form‘if i relop x goto B’, where x is not an induction
variable, is replaced by

r := c*x /*r:=xifcis1.*/ r :=

r+d /* omit if d is 0 */

if j relop r goto B

where, r is a new temporary. The case ‘if x relop i goto B’ is handled


analogously. If therearetwo induction variables i1 andi2 in the testif i1 relop i2
gotoB,thenwecheckifbothi1andi2canbereplaced.Theeasycaseiswhenwe have j1
with triple and j2 with triple, and c1=c2 and d1=d2. Then, i1 relop i2 is equivalent
to j1 relop j2.

 Now, consider each induction variable j for which a statement j: =s was


introduced. First check that there can be no assignment to s between the
introducedstatementj:=sandanyuseofj.Intheusualsituation,jisusedinthe block in
which it is defined, simplifying this check; otherwise, reaching definitions
information, plus some graph analysis is needed to implement the check. Then
replace all uses of j by uses of s and delete statement j: =s.

You might also like