Compiler Notes Arv
Compiler Notes Arv
Lexical Analysis:
The Role of the Lexical Analyzer, Specification of Tokens, Recognition of Tokens, Input Buffering,
elementary scanner design and its implementation (Lex), Applying concepts of Finite Automata for
recognition of tokens.
IntroductiontoCompiling:
INTRODUCTIONOFLANGUAGEPROCESSINGSYSTEM
Fig1.1:LanguageProcessingSystem
Preprocessor
Apreprocessorproduceinputtocompilers.Theymayperformthefollowingfunctions.
1. Macroprocessing:Apreprocessormayallowausertodefinemacrosthatareshorthandsfor longer
constructs.
2. Fileinclusion:Apreprocessormayincludeheaderfilesintotheprogramtext.
3. Rationalpreprocessor:thesepreprocessorsaugmentolderlanguageswithmoremodernflow-of-
control and data structuring facilities.
4. LanguageExtensions:Thesepreprocessorattemptstoaddcapabilitiestothelanguagebycertain
amounts to build-in macro
COMPILER
Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.
Fig1.2:Structureof Compiler
Executing a program written n HLL programming language is basically of two parts. the source
programmustfirstbecompiledtranslatedintoaobjectprogram.Thentheresultsobjectprogramis loaded
into a memory executed.
Fig1.3:ExecutionprocessofsourceprograminCompiler
ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).
INTERPRETER
Aninterpreterisaprogramthatappearstoexecuteasourceprogramasifitweremachinelanguage.
Fig1.4:ExecutioninInterpreter
Advantages:
Modificationofuserprogramcanbeeasilymadeandimplementedasexecutionproceeds. Type
of object that denotes a various may change dynamically.
Debuggingaprogramandfindingerrorsissimplifiedtaskforaprogramusedforinterpretation. The
interpreter for the language makes it machine independent.
Disadvantages:
Theexecutionoftheprogramisslower. Memory
consumption is more.
LOADERANDLINK-EDITOR:
Once the assembler procedures an object program, that program must be placed into memory and
executed.Theassemblercouldplacetheobjectprogramdirectlyinmemory andtransfercontroltoit,
thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would
have to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another
component called loader
“Aloaderisaprogramthatplacesprogramsinto memory andpreparesthemforexecution.”Itwould be more
efficient if subroutines could be translated into object form the loader could”relocate” directly behind
the user’s program. The task of adjusting programs o they may be placed in arbitrary core locations is
called relocation. Relocation loaders perform four functions.
TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translator are:
1 TranslatingtheHLLprograminputintoanequivalentmlprogram.
2 ProvidingdiagnosticmessageswherevertheprogrammerviolatesspecificationoftheHLL.
LISTOFCOMPILERS
1. Adacompilers
2 .ALGOLcompilers
3 .BASICcompilers
4 .C# compilers
5 .C compilers
6 .C++compilers
7 .COBOLcompilers
8 .CommonLispcompilers
9. ECMAScriptinterpreters
10. Fortran compilers
11.Java compilers
12. Pascalcompilers
13. PL/Icompilers
14. Python compilers
15. Smalltalkcompilers
STRUCTUREOFTHECOMPILERDESIGN
Compilationprocessispartitionedintono-of-subprocessescalled‘phases’. Lexical
Analysis:-
LAorScannersreadsthesourceprogramonecharacteratatime,carvingthesourceprogramintoa sequence of
automic units called tokens.
Fig1.5:Phasesof Compiler
SyntaxAnalysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements,declarationsetc…areidentifiedbyusingtheresultsoflexicalanalysis.Syntaxanalysisis aided
by using techniques based on formal grammar of the programming language.
IntermediateCodeGenerations:-
An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.
CodeOptimization:-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.
CodeGeneration:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator isthe
machine language program of the specified computer.
Table Management (or) Book-keeping:- This is the portion to keep the names used by theprogram
and records essential information about each. The data structure used to record this information
called a ‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structureof
the incoming token stream by identifying which parts of the token stream should be grouped.
Example,(A/B*Chastwopossibleinterpretations.) 1,
divide A by B and then multiply by C or
2,multiplyBbyCandthenusetheresulttodivideA.
eachofthesetwointerpretationscanberepresentedintermsofaparsetree.
IntermediateCodeGeneration:-
The intermediate code generation uses the structure produced by the syntax analyzer to create a
streamof simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.
CodeOptimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. LocalOptimization:-
Therearelocaltransformationsthatcanbeappliedtoaprogramtomakeanimprovement.For
example,
If A >Bgoto L2
GotoL3
L2 :
Thiscanbereplacedbyasinglestatement If
A < B goto L3
Anotherimportantlocaloptimizationistheeliminationofcommonsub-expressions
A:=B+C+D
E:=B+C+F
Mightbeevaluatedas
T1:=B+C
A:=T1+D E
:= T1 +F
Takethisadvantageofthecommonsub-expressionsB+C.
b. Loop Optimization:-
Anotherimportantsourceofoptimizationconcernsaboutincreasingthespeedofloops.A
typicalloopimprovementistomoveacomputationthatproducesthesameresulteachtime around
the loop to a point, in the program just before the loop is entered.
Codegenerator:-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.
TableManagementORBook-keeping :-
A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.
ErrorHanding:-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:
Fig1.6:CompilationProcessofasourcecodethrough phases
2. AsimpleOnePass Compiler:
OVERVIEW
• LanguageDefinition
o Appearanceofprogramminglanguage:
Vocabulary:Regularexpression
Syntax:Backus-NaurForm(BNF)orContextFreeForm(CFG)
o Semantics:Informallanguageorsomeexamples
• Fig2.1.Structureofourcompilerfrontend
SYNTAXDEFINITION
• Tospecifythesyntaxofalanguage:CFGandBNF
o Example:if-elsestatementinChastheformofstatement→if(expression)
statement else statement
• Analphabetofalanguageisasetofsymbols.
o Examples:{0,1}forabinarynumber system(language)={0,1,100,101,...}
{a,b,c}forlanguage={a,b,c,ac,abcc..}
{if,(,),else...}foraifstatements={if(a==1)goto10,if--}
• Astring over an alphabet
o isasequenceofzeroormoresymbolsfromthe alphabet.
o Examples:0,1,10,00,11,111,0202...stringsforaalphabet{0,1}
o Nullstringisastringwhichdoesnothaveanysymbolof alphabet.
• Language
o Isasubsetofallthestringsoveragivenalphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c,ac,abcc..}
A2={allofCtokens}L2={allsentencesofCprogram}
• Example2.1.Grammarforexpressionsconsistingofdigitsandplusandminus signs.
o LanguageofexpressionsL={9-5+2,3-1,...}
o TheproductionsofgrammarforthislanguageLare:
list→list+digit list
→ list - digit list
→ digit
digit→ 0|1|2|3|4|5|6|7|8|9
o list,digit:Grammarvariables,Grammarsymbols
o 0,1,2,3,4,5,6,7,8,9,-,+:Tokens,Terminalsymbols
• Conventionspecifyinggrammar
o Terminalsymbols:boldfacestringif,num,id
o Nonterminalsymbol,grammarsymbol:italicizednames,list,digit,A,B
• GrammarG=(N,T,P,S)
o N:asetofnonterminal symbols
o T:asetofterminalsymbols,tokens
o P:asetofproductionrules
o S:astartsymbol,S∈N
o
• GrammarGforalanguageL={9-5+2,3-1,...}
o G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P: list->list+ digit
list->list-digit list
-> digit
digit->0|1|2|3|4|5|6|7|8|9
S=list
• SomedefinitionsforalanguageLanditsgrammarG
• Derivation:
AsequenceofreplacementsS⇒α1⇒α2⇒…⇒αnisaderivationofαn.
Example,Aderivation1+9fromthegrammarG
• leftmostderivation
list⇒list+digit⇒digit+digit⇒1+digit⇒1+9
• rightmostderivation
list⇒list+digit⇒list+9 ⇒digit+9 ⇒1+9
• Languageofgrammar L(G)
L(G)isasetofsentencesthatcanbegeneratedfromthegrammarG. L(G)={x|
S ⇒* x} where x ∈ a sequence of terminal symbols
• Example:ConsideragrammarG=(N,T,P,S):
N={S} T={a,b}
S=SP={S→aSb | ε}
• is aabb a sentecne of L(g)? (derivation of string aabb)
S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(orS⇒*aabb)so,aabbεL(G)
• thereisnoderivationforaa,soaa∉L(G)
• noteL(G)={anbn|n≧0}whereanbnmeasna'sfollowedbynb's.
• Parse Tree
Aderivationcanbeconvenientlyrepresentedbyaderivationtree(parsetree).
o Therootislabeledbythestart symbol.
o Eachleafislabeledbyatokenor ε.
o Eachinteriornoneislabeledbyanonterminal symbol.
o WhenaproductionA→x1…xnisderived,nodeslabeledbyx1…xnaremadeas
children
nodesofnodelabeledbyA.
• root:thestart symbol
• internalnodes: nonterminal
• leafnodes: terminal
o ExampleG:
list->list+digit|list-digit|digit digit ->
0|1|2|3|4|5|6|7|8|9
• leftmostderivationfor9-5+2,
list⇒list+digit⇒list-digit+digit⇒digit-digit+digit⇒9-digit+digit
~9-5+digit⇒9-5+2
• rightmostderivationfor9-5+2,
list⇒list+digit⇒list+2⇒list-digit+2⇒list-5+2
~digit-5+2⇒9-5+2
parsetreefor9-5+2
Fig2.2.Parsetreefor9-5+2accordingtothegrammarinExample
Ambiguity
• Agrammarissaidtobeambiguousifthegrammarhasmorethanoneparsetreefora given
string of tokens.
• Example2.5.SupposeagrammarGthatcannotdistinguishbetweenlistsanddigitsasin
Example 2.1.
• G:string →string+string|string-string|0|1|2|3|4|5|6|7|8|9
Fig2.3.TwoParsetreefor 9-5+2
• 1-5+2has2parsetrees=>GrammarGis ambiguous.
Associativityofoperator
Aoperatorissaidtobeleftassociativeifanoperandwithoperatorsonbothsidesofitis taken by the
operator to its left.
eg)9+5+2≡(9+5)+2,a=b=c≡a=(b=c)
• LeftAssociativeGrammar:
list→list+digit|list-digit digit
→0|1|…|9
• RightAssociativeGrammar:
right→letter=right|letter
letter →a|b|…|z
Fig2.4.Parsetreeleft-andright-associativeoperators.
Precedenceofoperators
Wesaythataoperator(*)hashigherprecedencethanotheroperator(+)iftheoperator(*)takes
operands before other operator(+) does.
• ex.9+5*2≡9+(5*2),9*5+2≡(9*5)+2
• leftassociativeoperators:+,-,*,/
• rightassociativeoperators:=,**
• Syntax of fullexpressions
operator associative precedence
+, - left 1 low
*,/ left 2 heigh
• Syntax of statements
o stmt→id =expr ;
|if(expr)stmt;
|if(expr)stmtelsestmt;
|while(expr)stmt;
expr → expr + term | expr - term | term
term→term*factor|term/factor|factor factor
→ digit | ( expr )
digit→ 0| 1|…|9
SYNTAX-DIRECTEDTRANSLATION(SDT)
Aformalismforspecifyingtranslationsforprogramminglanguageconstructs. (
attributes of a construct: type, string, location, etc)
• Syntaxdirecteddefinition(SDD)forthetranslationofconstructs
• Syntaxdirectedtranslationscheme(SDTS)forspecifyingtranslation
PostfixnotationforanexpressionE
• IfEisavariableorconstant,thenthepostfixnationforEisEitself(E.t≡E).
• ifEisanexpressionoftheformE1opE2 whereopisabinary operator
o E1'isthepostfixofE1,
o E2'isthepostfixofE2
o thenE1'E2'op isthepostfix forE1op E2
• ifEis(E1),andE1'isapostfix
thenE1'isthepostfixforE
eg) 9-5+2⇒95-2+
9-(5+2)⇒952+-
Syntax-DirectedDefinition(SDD)fortranslation
• SDDisasetofsemanticrulespredefinedforeachproductionsrespectivelyfor
translation.
• Atranslationisaninput-outputmappingprocedurefortranslationofaninputX,
o constructaparsetreeforX.
o synthesizeattributesovertheparsetree.
SupposeanodeninparsetreeislabeledbyXandX.adenotesthevalue of
attribute a of X at that node.
computeX'sattributesX.ausingthesemanticrulesassociatedwithX.
Fig2.5.Syntax-directeddefinitionforinfixtopostfix translation.
AnexampleofsynthesizedattributesforinputX=9-5+2
Fig2.6.Attributevaluesatnodesinaparse tree.
Syntax-directedTranslationSchemes(SDTS)
• Atranslationschemeisacontext-freegrammarinwhichprogramfragmentscalled
translation actions are embedded within the right sides of the production.
productions(postfix) SDDforpostfixto SDTS
infix notation
list→list+ term list.t=list.t||term.t||"+" list→list+ term
{print("+")}
• {print("+");}:translation(semantic)action.
• SDTS generates an output for each sentence x generated by underlying grammar by
executingactionsintheordertheyappearduringdepth-firsttraversalofaparsetreeforx.
1. Designtranslationschemes(SDTS)fortranslation
2. Translate:
a) parsetheinputstringxand
b) emittheactionresultencounteredduringthedepth-firsttraversalofparsetree.
Fig2.7.Exampleofadepth-firsttraversalofatree.Fig2.8.Anextraleafisconstructedforasemanticaction.
Example2.8.
• SDDvs.SDTSforinfixtopostfixtranslation.
• Actiontranslatingforinput9-5+2
Fig2.9.Actionstranslating9-5+2into95-2+.
1) Parse.
2) Translate.
Dowehavetomaintainthewholeparsetree?
No,Semanticactionsareperformedduringparsing,andwedon'tneedthenodes(whose semantic
actions done).
PARSING
iftokenstringx∈L(G),thenparsetree
elseerrormessage
Top-Downparsing
1. AtnodenlabeledwithnonterminalA,selectoneoftheproductionswhoseleftpartis
Aandconstructchildrenofnodenwiththesymbolsontherightsideofthatproduction.
2. Findthenextnodeatwhichasub-treeistobeconstructed. ex. G:
type → simple
|↑id
|array[simple]oftype simple
→ integer
|char
|numdotdotnum
Fig2.10.Top-downparsingwhilescanningtheinputfromlefttoright.
Fig2.11.Stepsinthetop-downconstructionofaparsetree.
• Theselectionofproductionforanonterminalmayinvolvetrial-and-
error.=>backtracking
• G:{S->aSb|c|ab}
Accordingtotopdownparsingprocedure,acb,aabb∈L(G)?
• S/acb⇒aSb/acb⇒aSb/acb⇒aaSbb/acb⇒X
(S→aSb) move (S→aSb) backtracking
⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(s→c) move move
so,acb∈L(G)
Isisfinishedin7stepsincludingonebacktracking.
• S/aabb⇒aSb/aabb⇒aSb/aabb⇒aaSbb/aabb⇒aaSbb/aabb⇒aaaSbbb/aabb⇒X
(S→aSb) move (S→aSb) move (S→aSb) backtracking
⇒aaSbb/aabb⇒aacbb/aabb⇒X
(S→c) backtracking
⇒aaSbb/aabb⇒aaabbb/aabb⇒X
(S→ab) backtracking
⇒aaSbb/aabb⇒X
backtracking
⇒aSb/aabb⇒acb/aabb
(S→c) bactracking
⇒aSb/aabb⇒aabb/aabb⇒aabb/aabb⇒aabb/aabb⇒aaba/aabb
(S→ab) move move move
so,aabb∈L(G)
butprocessistoodifficult.Itneeds18stepsincluding5 backtrackings.
• procedureoftop-down parsing
letapointedgrammarsymbolandpointedinputsymbolbeg,arespectively.
o if(g∈ N)selectandexpandaproductionwhoseleftpartequalstognextto current
production.
elseif(g=a)thenmakegandabeasymbolnexttocurrentsymbol. else if( g ≠a
) back tracking
letthepointedinputsymbolabethesymbolthatmovesbacktosteps same
with the number of current symbols of underlying production
eliminatetherightsidesymbolsofcurrentproductionandletthepointed
symbol g be the left side symbol of current production.
Predictiveparsing(RecursiveDecentParsing,RDP)
• Astrategyforthegeneraltop-downparsing
Guessaproduction,seeifitmatches,ifnot,backtrackandtry another.
⇒
• ItmayfailtorecognizecorrectstringinsomegrammarGandistediousin processing.
⇒
• Predictiveparsing
o isakindoftop-downparsingthatpredictsaproductionwhosederivedterminal
symbol is equal to next input symbol while expanding in top-down paring.
o withoutbacktracking.
o Procedure decent parser is a kind of predictive parser that is implemented by
disjointrecursiveproceduresoneprocedureforeachnonterminal,theprocedures are
patterned after the productions.
• procedureofpredictiveparsing(RDP)
letapointedgrammarsymbolandpointedinputsymbolbeg,arespectively.
o if( g ∈N)
select next production P whose left symbol equals to g and a set of first
terminalsymbolsofderivationfromtherightsymbolsoftheproductionP
includes a input symbol a.
expandderivationwiththatproductionP.
o elseif(g=a)thenmakegandabeasymbolnexttocurrent symbol.
o elseif(g ≠a)error
Left Factoring
• Ifagrammarcontainstwoproductionsofform
S→ aα and S → aβ
itisnotsuitablefortopdownparsingwithoutbacktracking.Troublesofthisformcan sometimes
be removed from the grammar by a technique called the left factoring.
• Intheleftfactoring,wereplace{S→aα,S→aβ}by
{ S → aS', S'→ α, S'→ β } cf. S→ a(α|β)
(Hopefullyαandβstartwithdifferentsymbols)
• leftfactoringforG{S→aSb| c| ab}
S→aS'|c cf.S(=aSb |ab| c=a( Sb | b)|c)→aS'|c
S'→Sb|b
• Aconcrete example:
<stmt>→ IF<boolean>THEN<stmt>|
IF<boolean>THEN<stmt>ELSE<stmt> is
transformed into
<stmt>→ IF<boolean>THEN<stmt>S' S'
→ ELSE <stmt> | ε
• Example,
o forG1 :{S→aSb|c|ab }
Accordingtopredictiveparsingprocedure,acb,aabb∈L(G)?
S/aabb⇒unabletochoose{S→aSb,S→ab?}
o AccordingforthefeftfactoredgtrammarG1,acb,aabb∈L(G)? G1 :
{ S→aS'|c S'→Sb|b} <= {S=a(Sb|b) | c }
o S/acb⇒aS'/acb⇒aS'/acb⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(S→aS') move (S'→Sb⇒aS'b)(S'→c) move move
so,acb∈L(G)
Itneedsonly6stepswhithoutanybacktracking.
cf.Generaltop-downparsingneeds7stepsandIbacktracking.
o S/aabb⇒aS'/aabb⇒aS'/aabb⇒aSb/aabb⇒aaS'b/aabb⇒aaS'b/aabb⇒aabb/aabb⇒
⇒
(S→aS') move (S'→Sb⇒aS'b) (S'→aS') move (S'→b) movemove
so,aabb∈L(G)
but,processisfinishedin8stepswithoutanybacktracking.
cf.Generaltop-downparsingneeds18stepsincluding5backtrackings.
Left Recursion
• AgrammarisleftrecursiveiffitcontainsanonterminalA,suchthat A⇒+
Aα, where is any string.
o Grammar{S→Sα|c}isleftrecursivebecauseof S⇒Sα
o Grammar{S→Aα,A→Sb|c}isalsoleftrecursivebecauseofS⇒Aα⇒Sbα
• Ifagrammarisleftrecursive,youcannotbuildapredictivetopdownparserforit.
1) IfaparseristryingtomatchS&S→Sα,ithasnoideahowmanytimesSmustbe applied
2) Givenaleftrecursivegrammar,itisalwayspossibletofindanothergrammarthat
generates the same language and is not left recursive.
3) TheresultinggrammarmightormightnotbesuitableforRDP.
• Afterthis,ifweneedleftfactoring,itisnotsuitableforRDP.
• Rightrecursion:Specialcare/Harderthanleftrecursion/SDTcanhandle.
EliminatingLeftRecursion
LetGbeS→SA|A
Notethatatop-downparsercannotparsethegrammarG,regardlessoftheordertheproductions are tried.
~ TheproductionsgeneratestringsofformAA…A
~ TheycanbereplacedbyS→AS'andS'→AS'|ε
Example:
• A→Aα∣β
=>
A→ βR
R→ αR|ε
Fig2.12.Left-andright-recursivewaysofgeneratingastring.
• Ingeneral,theruleisthat
o IfA→Aα1|Aα2 | …|Aαn and
A→β1|β2| …| βm(noβi'sstartwith A),
then,replaceby
A→β1R|β2R|…|βmRand
Z→α1R| α2R|…| αnR| ε
Exercise:Removetheleftrecursioninthefollowinggrammar: expr →
expr + term | expr - term
expr → term
solution:
expr→termrest
rest→+termrest| - termrest| ε
ATRANSLATORFORSIMPLEEXPRESSIONS
• Convertinfixintopostfix(polishnotation)usingSDT.
• Abstractsyntax(annotatedparsetree)treevs.Concretesyntaxtree
• Concretesyntaxtree:parsetree.
• Abstractsyntaxtree:syntaxtree
• Concretesyntax:underlyinggrammar
AdaptingtheTranslationScheme
• Embedthesemanticactioninthe production
• Designatranslationscheme
• LeftrecursioneliminationandLeftfactoring
• Example
3) Designatranslateschemeandeliminateleftrecursion
E→E+T{'+'} E→T{}R
E→E-T{'-'} R→+T{'+'}R
E→T{} R→-T{'-'}R
T→0{'0'}|…|9{'9'} R→ε
T→0{'0'}…|9{'9'}
4) Translateofainputstring9-5+2:parsingandSDT
Result:95–2+
Exampleoftranslatordesignandexecution
• Atranslationschemeandwithleft-recursion.
Initialspecificationforinfix-to-postfix withleftrecursioneliminated
translator
expr→ expr+term {printf{"+")} expr→termrest
expr→expr-term{printf{"-")} rest→+term{printf{"+")}rest
expr → term rest→-term{printf{"-")}rest
term→0 {printf{"0")} rest→ε
term→1 {printf{"1")} term→0 {printf{"0")}
… term→1 {printf{"1")}
term→9 {printf{"0")} …
term→9 {printf{"0")}
Fig2.13.Translationof9–5+2into95-2+.
ProcedurefortheNonterminalexpr,term,andrest
Fig2.14.Functionforthenonterminalsexpr,rest,and term.
OptimizerandTranslator
LEXICALANALYSIS
• readsandconvertstheinputintoastreamoftokenstobeanalyzedbyparser.
• lexeme:asequenceofcharacterswhichcomprisesasingle token.
• LexicalAnalyzer →Lexeme/Token→ Parser
RemovalofWhiteSpaceandComments
• Removewhitespace(blank,tab,newlineetc.)and comments
Contsants
• Constants:Forawhile,consideronlyintegers
• eg)forinput31+28,output(tokenrepresentation)?
input : 31 + 28
output:<num,31><+,><num,28>
num + :token
3128:attribute,value(orlexeme)ofintegertokennum
Recognizing
• Identifiers
o Identifiersarenamesofvariables,arrays,functions...
o Agrammartreatsanidentifierasatoken.
o eg) input : count = count + increment;
output:<id,1><=,><id,1><+,><id,2>;
Symbol table
tokens attributes(lexeme)
0
1 id count
2 id increment
3
• Keywordsarereserved,i.e.,theycannotbeusedas identifiers.
Thenacharacterstringformsanidentifieronlyifitisnoakeyword.
• punctuationsymbols
o operators:+-*/:=<>…
Interfacetolexicalanalyzer
Fig2.15.Insertingalexicalanalyzerbetweentheinputandtheparser
ALexicalAnalyzer
Fig2.16.ImplementingtheinteractionsinFig.2.15.
• c=getchcar();ungetc(c,stdin);
• tokenrepresentation
o #defineNUM256
• Functionlexan()
eg)inputstring76 +a
input,output(returnedvalue)
76 NUM,tokenval=76(integer)
+ +
A id , tokeval="a"
• AwaythatparserhandlesthetokenNUMreturnedbylaxan()
o consideratranslationscheme
factor → ( expr )
|num{print(num.value)}
#define NUM 256
...
factor(){
if(lookahead == '(' ) {
match('(');exor();match(')');
}elseif(lookahead==NUM){
printf("%f",tokenval);match(NUM);
}else error();
}
• Theimplementationoffunctionlexan
1) #include<stdio.h>
2) #include<ctype.h>
3) intlino=1;
4) inttokenval= NONE;
5) intlexan(){
6) intt;
7) while(1){
8) t= getchar();
9) if(t==''|| t=='\t') ;
10) elseif(t=='\n')lineno+=1;
11) elseif(isdigit(t)){
12) tokenval=t-'0';
13) t= getchar();
14) while(isdigit(t)){
15) tokenval=tokenval*10+t-'0';
16) t=getchar();
17) }
18) ungetc(t,stdin);
19) retunrNUM;
20) }else{
21) tokenval= NONE;
22) return t;
23) }
24) }
25) }
INCORPORATIONASYMBOLTABLE
• Thesymboltableinterface,operation,usuallycalledbyparser.
o insert(s,t):inputs:lexeme
t:token
outputindexofnewentry
o lookup(s):inputs: lexeme
outputindexoftheentryforstrings,or0ifsisnotfoundinthesymbol table.
• Handlingreservedkeywords
1. Insertsallkeywordsinthesymboltableinadvance. ex)
insert("div", div)
insert("mod",mod)
2. whileparsing
• wheneveranidentifiersis encountered.
if(lookup(s)'stokenin{keywords})sisforakeyword;elsesisforaidentifier;
• example
o preset
insert("div",div);
insert("mod",mod);
o whileparsing
lookup("count")=>0insert("count",id);
lookup("i") =>0 insert("i",id);
lookup("i") =>4, id
llokup("div")=>1,div
Fig2.17.Symboltableandarrayforstoringstrings.
ABSTRACTSTACKMACHINE
o Anabstractmachineisforintermediatecodegeneration/execution.
o Instructionclasses:arithmetic/stackmanipulation/control flow
• 3componentsofabstractstack machine
1) Instructionmemory:abstractmachinecode,intermediatecode(instruction)
2) Stack
3) Datamemory
• Anexampleofstackmachine operation.
o forainput(5+a)*b,intermediatecodes:push5rvalue2....
L-valueandr-value
• l-valuesa:addressoflocationa
• r-valuesa:ifaislocation,thencontentoflocationa if a is
constant, then value a
• eg) a:=5 +b;
lvaluea⇒2rvalue5⇒5rvalueofb⇒7
StackManipulation
• Someinstructionsforassignmentoperation
o pushv: pushv ontothe stack.
o rvaluea:pushthecontentsofdatalocationa.
o lvaluea:pushtheaddressofdatalocationa.
o pop:throwawaythetopelementofthestack.
o :=:assignmentforthetop2elementsofthestack.
o copy:pushacopyofthetopelementofthestack.
TranslationofExpressions
• Infixexpression(IE)→SDD/SDTS→Abstactmacinecodes(ASC)ofpostfixexpressionfor
stackmachineevaluation.
eg)
o IE:a+b,(⇒PE:ab+)⇒IC:rvaluea
rvalueb
+
o day := (1461 * y) div 4 + (153 * m + 2) div 5 + d
(⇒ day1462y*4div153m*2+5div+d+:=)
⇒1)lvalueday6) div 11)push5 16) :=
2) push1461 7)push153 12) div
3) rvaluey 8)rvaluem 13) +
4) * 9)push2 14)rvalued
5) push4 10) + 15) +
• Atranslationschemeforassignment-statementintoabstractastackmachinecodeecanbe
expressed formally In the form as follows:
stmt→id:=expr
{stmt.t:='lvalue'||id.lexeme||expr.t||':='} eg) day
:=a+b ⇒ lvalue day rvalue a rvalue b + :=
ControlFlow
• 3typesofjumpinstructions:
o Absolutetargetlocation
o Relativetargetlocation(distance:Current↔Target)
o Symbolictargetlocation(i.e.themachinesupportslabels)
• Control-flowinstructions:
o labela:thejump'stargeta
o gotoa:thenextinstructionistakenfromstatementlabeleda
o gofalsea:popthetop&ifitis0thenjumptoa
o gotruea:popthetop&ifitisnonzerothenjumptoa
o halt :stop execution
Translationof Statements
• Translationschemefortranslationif-statementintoabstractmachinecode.
stmt → if expr then stmt1
{out:=newlabel1)
stmt.t:=expr.t||'gofalse'out||stmt1.t||'label'out}
Fig2.18.Codelayoutforconditionalandwhilestatements.
• Translationschemeforwhile-statement?
Emittinga Translation
• SemanticAction(TranaslationScheme):
1. stmt→if
expr{out:=newlabel;emit('gofalse',out)} then
stmt1{emit('label',out)}
2. stmt→id{emit('lvalue',id.lexeme)}
:=
expr{emit(':=')}
3. stmt→i
expr{out:=newlabel;emit('gofalse',out)} then
stmt1{emit('label',out);out1:=newlabel;emit('goto',out`1);}
else
stmt2{emit('label',out1);}
if(expr==false) goto out
stmt1 goto out1
out:stmt2
out1:
Implementation
• procedurestmt()
• var test,out:integer;
• begin
o iflookahead=idthenbegin
emit('lvalue',tokenval);match(id);
match(':='); expr(); emit(':=');
o end
o elseiflookahead='if'then begin
match('if');
expr();
out:= newlabel();
emit('gofalse',out);
match('then');
stmt;
emit('label',out)
o end
o elseerror();
• end
ControlFlowwithAnalysis
• if E1 or E2 then S vs if E1 and E2 then S
E1 or E2 = if E1 then true else E2
E1andE2=ifE1thenE2elsefalse
• ThecodeforE1orE2.
o CodesforE1Evaluationresult:e1
o copy
o gotrueOUT
o pop
o CodesforE2Evaluationresult:e2
o labelOUT
• ThefullcodeforifE1orE2thenS;
o codesforE1
o copy
o gotrueOUT1
o pop
o codesforE2
o labelOUT1
o gofalse OUT2
o codeforS
o labelOUT2
• Exercise:HowaboutifE1andE2thenS;
o ifE1 and E2then S1 elseS2;
Puttingthetechniquestogether!
• infixexpression⇒postfixexpression
eg)id+(id-id)*num/id⇒ididid-num*id/
+
DescriptionoftheTranslator
• Syntax directed translation scheme
(SDTS)totranslatetheinfixexpressions
into the postfix expressions,
Fig2.19.Specificationforinfix-to-postfixtranslation
Structureofthe translator,
Fig2.19.Modulesofinfixtopostfix translator.
o globalheaderfile"header.h"
TheLexicalAnalysisModulelexer.c
o Descriptionof tokens
+-*/DIVMOD( )IDNUMDONE
Fig2.20.Descriptionoftokens.
TheParserModuleparser.c
SDTS
||←leftrecursion elimination
NewSDTS
TheSymbol-TableModulessymbol.candinit.c
Symbol.c
datastructureofsymboltableFig2.29p62
insert(s,t)
lookup(s)
TheErrorModuleerror.c
Exampleofexecution
input12div5+2
output 12
5
div
2
+
3. LexicalAnalysis:
OVERVIEWOFLEXICALANALYSIS
• To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some mechanism to recognize
theseintheinputstream. Thisis donebythetokenrecognizers,whicharedesignedusing
transition diagrams and finite automata.
ROLEOFLEXICALANALYZER
TheLAisthefirstphaseofacompiler.Itmaintaskistoreadtheinputcharacterandproduceas output a
sequence of tokens that the parser uses for syntax analysis.
Fig.3.1:RoleofLexicalanalyzer
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer readsthe
input character until it can identify the next token. The LA return to the parser representation for
the token it has found. The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.
TOKEN,LEXEME, PATTERN:
Token:Tokenisasequenceofcharactersthatcanbetreatedasasinglelogicalentity. Typical
tokens are,
1)Identifiers2)keywords3)operators4)specialsymbols5)constants
Pattern:Asetofstringsintheinputforwhichthesametokenisproducedasoutput.Thisset of strings is
described by a rule called a pattern associated with the token.
Lexeme:Alexemeisasequenceofcharactersinthesourceprogramthatismatchedbythe pattern for a
token.
Fig.3.2:ExampleofToken,Lexemeand Pattern
LEXICALERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognised valid tokens don't
match any of the right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or lexical error isdetected.
Error-recoveryactionsare:
i. Deleteonecharacterfromtheremaining input.
ii. Insertamissingcharacterintotheremaining input.
iii. Replaceacharacterbyanother character.
iv. Transposetwoadjacentcharacters.
REGULAREXPRESSIONS
Regularexpressionisaformulathatdescribesapossiblesetofstring.Componentofregular expression..
X thecharacterx
. anycharacter,usuallyacceptanewline [x
y z] any of the characters x, y, z, …..
R? aRornothing(=optionallyasR)
R* zeroormore occurrences…..
R+ oneormoreoccurrences……
R1R2 anR1followed byanR2
R1|R1 eitheranR1oranR2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.In
regular expression notation we would write.
Identifier=letter(letter|digit)*
Herearetherulesthatdefinetheregularexpressionoveralphabet.
• isaregularexpressiondenoting{€},thatis,thelanguagecontainingonlytheempty string.
• Foreach‘a’inΣ,isaregularexpressiondenoting{a },thelanguagewithonlyone string
consistingofthesinglesymbol‘a’.
• IfRandSareregularexpressions,then
(R)| (S)meansL(r)UL(s)
R.SmeansL(r).L(s) R*
denotes L(r*)
REGULARDEFINITIONS
Fornotationalconvenience,wemaywishtogivenamestoregularexpressionsandtodefine regular
expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd?Isequivalentto(a(b*))|(c(d?)) Pascal
identifier
Letter-A|B|……|Z|a|b|……|z| Digits -
0 | 1 | 2 | …. | 9
Id-letter(letter/digit)*
Recognitionoftokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string andfinds
a prefix that is a lexeme matching one of the patterns.
Stmt→ifexprthen stmt
|Ifexprthenelsestmt
|є
Expr→termrelopterm
|term
Term →id
|number
Forrelop,weusethecomparisonoperationsoflanguageslikePascalorSQLwhere=is“equals” and <> is
“not equals” because it presents an interesting structure of lexemes.
The terminal ofgrammar, which areif, then , else, relop ,id and numbers are the names oftokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits→digit+
number→digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id→letter(letter/digit)*
if → if
then →then
else→else
relop→< |>|<=|>=|==| <>
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS→(blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
thesamenames.Token wsisdifferentfromthe othertokensinthat,whenwerecognizeit,wedo not
return it to parser ,but rather restart the lexical analysis from the character that follows the white
space . It is the following token that gets returned to the parser.
TRANSITIONDIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Someimportantconventionsabouttransitiondiagramsare
1. Certainstates are said to be acceptingor final .These states indicates that a lexeme has
beenfound,althoughtheactuallexememaynotconsistofallpositionsb/wthelexeme Begin
and forward pointers we always indicate an accepting state by a double circle.
2. Inaddition,ifitisnecessarytoreturntheforwardpointeroneposition,thenweshall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagramalways begins in the state before any input
symbols have been used.
Fig.3.3:TransitiondiagramofRelationaloperators
Fig.3.4:TransitiondiagramofIdentifier
The above TDfor an identifier, defined to be aletter followed by any no ofletters ordigits.A
sequenceoftransitiondiagramcanbeconvertedintoprogramtolookforthetokensspecified by the
diagrams. Each state gets a segment of code.
FINITEAUTOMATON
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• Wecalltherecognizerofthetokensasafinite automaton.
• Afiniteautomatoncanbe:deterministic(DFA)ornon-deterministic(NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Bothdeterministicandnon-deterministicfiniteautomatonrecognizeregularsets.
• Whichone?
– deterministic–fasterrecognizer,butitmaytakemore space
– non-deterministic–slower,butitmaytakelessspace
– Deterministicautomatonsarewidelyusedlexicalanalyzers.
• First,wedefineregularexpressionsfortokens;ThenweconvertthemintoaDFAtogeta lexical
analyzer for our tokens.
Non-DeterministicFiniteAutomaton(NFA)
• Anon-deterministicfiniteautomaton(NFA)isamathematicalmodelthatconsistsof:
o S- aset ofstates
o Σ-asetofinputsymbols(alphabet)
o move-atransitionfunctionmovetomapstate-symbolpairstosetsofstates.
o s0-astart(initial) state
o F-asetofacceptingstates(finalstates)
• ε-transitionsareallowedinNFAs.Inotherwords,wecanmovefromonestateto another
one without consuming any symbol.
• ANFAacceptsastring x,ifandonlyifthereisapathfromthestarting statetoone of accepting
states such that edge labels along this path spell out x.
Example:
DeterministicFiniteAutomaton(DFA)
• ADeterministicFiniteAutomaton(DFA)isaspecialformofaNFA.
• Nostatehasε-transition
• Foreachsymbolaandstates,thereisatmostonelabelededgealeavings.i.e.transition function
is from pair of state-symbol to state (not set of states)
Example:
ConvertingREtoNFA
• Thisisonewaytoconvertaregularexpression intoa NFA.
• Therecanbeotherways(muchefficient)fortheconversion.
• Thomson’sConstructionissimpleandsystematicmethod.
• ItguaranteesthattheresultingNFAwillhaveexactlyonefinalstate,andonestartstate.
• Constructionstartsfromsimplestparts(alphabetsymbols).
• TocreateaNFAforacomplexregularexpression,NFAsofitssub-expressionsare
combined to create its NFA.
• Torecognizeanemptystringε:
• TorecognizeasymbolainthealphabetΣ:
• Forregularexpressionr1|r2:
N(r1)andN(r2)areNFAsforregularexpressionsr1andr2.
• Forregularexpressionr1r2
Here,finalstateofN(r1)becomesthefinalstateof N(r1r2).
• Forregularexpressionr*
Example:
ForaRE(a|b)*a,theNFAconstructionisshownbelow.
ConvertingNFAtoDFA(Subset Construction)
WemergetogetherNFAstatesbylookingatthemfromthepointofviewoftheinput characters:
• From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in the DFA.
• Ifitispossibletohavemultipletransitionsbasedonthesamesymbol,thenwecanregard
atransitiononasymbol as moving fromastatetoasetofstates(ie.theunionofallthose states
reachable by a transition on the current symbol). Thus these states will becombined into a
single DFA state.
Toperformthisoperation,letusdefinetwo functions:
• The-closurefunction takesastateand returnsthesetofstatesreachablefromitbased on
(oneormore)-transitions.Notethatthiswillalwaysincludethestateitself.Weshouldbe able to
get from a state to any state in its -closure without consuming any input.
• Thefunctionmovetakesastateandacharacter,andreturnsthesetofstatesreachableby one
transition on this character.
Wecangeneraliseboththesefunctionstoapplytosetsofstatesbytakingtheunionofthe application to
individual states.
ForExample,ifA,BandCarestates,move({A,B,C},`a')=move(A,`a')move(B,`a') move(C,`a').
TheSubsetConstructionAlgorithmisafollows:
putε-closure({s0})asanunmarkedstateintothesetofDFA(DS) while
(there is one unmarked S1 in DS) do
begin
mark S1
foreachinputsymbolado
begin
S2← ε-closure(move(S1,a)) if
(S2 is not in DS) then
addS2intoDSasanunmarkedstate transfunc[S1,a] ←
S2
end
end
• astateSinDSisanacceptingstateofDFAifa stateinSisanacceptingstateofNFA
• thestartstateofDFAisε-closure({s0})
LexicalAnalyzerGenerator
p1{action1}
p2{action2}
p3{action3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.
Note:Youcanrefertoasamplelexprogramgiveninpageno.109ofchapter3ofthebook:
Compilers:Principles,Techniques,andToolsbyAho,Sethi&Ullmanformoreclarity.
3.19.INPUTBUFFERING
The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Bufferingtechniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the charactersofthesource programone a t atime to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
positionofeachpointerasbeingbetweenthecharacterlastreadandthecharacternexttoberead. In
practice each buffering scheme adopts one convention either apointer is at the symbol lastread or
the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travelpasttheactualtokenmaybelarge.Forexample,inaPL/Iprogramwemaysee:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the
buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token isdiscovered. In the above example, ifthe look ahead
traveled to the left half and all the way through the left half to the middle, we could not reloadthe
right half, because we would lose characters that had not yet been groupedinto tokens. While we
can make the buffer larger if we chose or use another buffering scheme,we cannot ignore the fact
that overhead is limited.
SYNTAX ANALYSIS
ROLEOFTHEPARSER:
Parser for any grammar is program that takes as inputstringw (obtain set of strings tokens
from the lexical analyzer) and produces as output either a parse tree for w , if w is a valid
sentences of grammar or error message indicating thatw is not a valid sentences of given
grammar. The goal of the parser is to determine thesyntactic validity of a source string is
valid, a tree is built for use by the subsequent phases of the computer. The tree reflects the
sequence of derivations or reduction used during the parser. Hence, it is called parse tree. If
string is invalid, the parse has to issue diagnostic message identifying the nature and cause of
theerrorsinstring.Everyelementarysubtreeintheparsetreecorrespondstoaproductionof the
grammar.
Therearetwowaysofidentifyinganelementrysutree:
1. Byderivingastringfromanon-terminalor
2. Byreducingastringofsymboltoanon-terminal.
Thetwotypesofparsersemployed are:
a. Topdownparser:whichbuildparsetreesfromtop(root)to
bottom(leaves)
b. Bottomupparser:whichbuildparsetreesfromleavesandworkupthe root.
Fig.4.1:positionofparserincompilermodel.
CONTEXTFREEGRAMMARS
Inherentlyrecursivestructuresofaprogramminglanguagearedefinedbyacontext-free Grammar.In a
context-free grammar, we have four triples G( V,T,P,S).
Here, Visfinitesetofterminals(inourcase,thiswillbethesetoftokens) T is a
finite set of non-terminals (syntactic-variables)
Pisafinitesetofproductionsrulesinthefollowingform
A → α where A is a non-terminal and α is a string of terminals and non-terminals
(including the empty string)
Sisastartsymbol(oneofthenon-terminalsymbol)
L(G)isthelanguageofG(thelanguagegeneratedbyG)whichisasetofsentences.
Asentenceof L(G)is astringofterminalsymbolsofG.IfS is thestart symbol ofGthen ω is a
sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G. If G is a context-
freegrammar, L(G) is acontext-free language. Two grammar G1and G2 areequivalent, if
they produce same grammar.
Consider the production of the form S ⇒ α,If α contains non-terminals, it is called as a
sentential form of G. If α does not contain non-terminals, it is called as a sentence of G.
Derivations
Ingeneraladerivationstepis
αAβ⇒ αγβ is sentential form andif there is a production rule A→γ in our grammar. where
α and β are arbitrarystrings of terminal and non-terminal symbols α1 ⇒ α2 ⇒ ... ⇒ αn
(αn derives from α1 or α1 derives αn ). There are two types of derivaion
1 Ateachderivationstep, wecan chooseanyofthenon-terminalinthesententialformofG for the
replacement.
2 If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
Example:
E→E+E |E–E |E* E |E/E |-E E → (
E)
E→id
Leftmostderivation:
E→E+E
→E*E+E→id*E+E→id*id+E→id*id+id
Thestringisderivefromthegrammarw=id*id+id,whichisconsistsofallterminal symbols
Rightmostderivation
E→E+E
→ E+E * E→E+ E*id→E+id*id→id+id*id
GivengrammarG:E→E+E |E*E |(E )|-E|id
Sentence to be derived : – (id+id)
LEFTMOST DERIVATION RIGHTMOSTDERIVATION
E→-E E→-E
E→-(E) E→-(E)
E→-(E+E) E→-(E+E)
E→-(id+E) E→-(E+id)
E→-(id+id) E→-(id+id)
• Stringthatappearinleftmostderivationarecalledleftsentinelforms.
• Stringthatappearinrightmostderivationarecalledrightsentinelforms.
Sentinels:
→Sα , where α may contain non
• Given a grammar G with start symbol S, if -
terminals or terminals, then α is called the sentinel form of G.
Yieldorfrontierof tree:
• Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The
sentinel form in the parse tree is called yield or frontier of the tree.
PARSETREE
• Innernodesofaparsetreearenon-terminalsymbols.
• Theleavesofaparsetreeareterminalsymbols.
• Aparsetreecanbeseenasagraphicalrepresentationofaderivation.
Ambiguity:
Agrammarthatproduces morethanoneparseforsomesentenceissaidto beambiguous
grammar.
Example:GivengrammarG:E→E+E|E*E|(E)|-E |id
Thesentenceid+id*idhasthefollowingtwodistinctleftmostderivations: E →
E+ E E → E* E
E→id+E E→E+E*E
E→id+E*E E→id+E*E
E→id+id*E E→id+id*E
E→id+id*id E→id+id*id
Thetwocorrespondingparsetreesare:
Example:
TodisambiguatethegrammarE→E+E |E*E|E^E |id |(E),wecanuseprecedenceof operators
as follows:
^(righttoleft)
/,*(lefttoright)
-,+(lefttoright)
We get the following unambiguous grammar:
E→E+T|T
T→T*F|F
F→G^F|G
G→id|(E)
Considerthisexample,G:stmt→ifexprthenstmt |if exprthenstmtelsestmt |other This
grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the following
Twoparsetreesforleftmostderivation :
Toeliminateambiguity,thefollowinggrammarmaybeused:
stmt→matched_stmt|unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
unmatched_stmt→ifexprthenstmt|ifexprthenmatched_stmtelseunmatched_stmt
Eliminating Left Recursion:
A grammar is said to be left recursive if it has a non-terminal A such that there is a derivation
A=>Aαforsomestringα.Top-downparsingmethodscannothandleleft-recursivegrammars.
Hence, left recursion can be eliminated as follows:
IfthereisaproductionA→Aα|βitcanbereplacedwithasequenceoftwo productions
A→βA’
A’→αA’|ε
WithoutchangingthesetofstringsderivablefromA.
Example:Considerthefollowinggrammarforarithmeticexpressions: E →
E+T | T
T→T*F|F
F→(E)|id
FirsteliminatetheleftrecursionforEas E
→ TE’
E’ → +TE’ | ε
TheneliminateforTas
T→FT’
T’→*FT’|ε
Thustheobtainedgrammaraftereliminatingleftrecursionis E →
TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
Algorithmtoeliminateleftrecursion:
1. Arrangethenon-terminalsinsomeorderA1,A2...An.
2. fori:=1tondobegin
forj:=1toi-1dobegin
replaceeachproductionoftheformAi→Ajγ
bytheproductionsAi→δ1γ|δ2γ|... |δkγ
whereAj→δ1|δ2|...|δkareallthecurrentAj-productions;
end
eliminatetheimmediateleftrecursionamongtheAi-productions
end
Leftfactoring:
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the
input symbol d.
Hencediscardthechosenproductionandresetthepointertosecondposition.Thisiscalled
backtracking.
Step4:
NowtrythesecondalternativeforA.
Nowwecanhaltandannouncethesuccessfulcompletionofparsing.
Exampleforrecursivedecentparsing:
Aleft-recursivegrammarcancausearecursive-descentparsertogointoaninfiniteloop. Hence,
elimination of left-recursion must be done before parsing.
Considerthegrammarforarithmeticexpressions E
→ E+T | T
T→T*F|F
F→(E)|id
Aftereliminatingtheleft-recursionthegrammarbecomes, E
→ TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
Nowwecanwritetheprocedureforgrammarasfollows:
Recursiveprocedure:
Procedure E()
begin
T();
EPRIME();
End
ProcedureEPRIME()
begin
Ifinput_symbol=’+’then ADVANCE(
);
T();
EPRIME();
end
ProcedureT()
begin
F();
TPRIME();
End
ProcedureTPRIME()
begin
Ifinput_symbol=’*’then ADVANCE(
);
F();
TPRIME();
end
ProcedureF()
begin
Ifinput-symbol=’id’then
ADVANCE( );
elseifinput-symbol=’(‘then
ADVANCE( );
E();
elseifinput-symbol=’)’then
ADVANCE( );
end
elseERROR();
Stackimplementation:
PROCEDURE INPUTSTRING
E() id+id*id
T() id+id*id
F() id+id*id
ADVANCE() idid*id
TPRIME() idid*id
EPRIME() idid*id
ADVANCE() id+id*id
T() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
ADVANCE() id+id*id
F() id+id*id
ADVANCE() id+id*id
TPRIME() id+id*id
2. PREDICTIVEPARSING
Predictiveparsingisaspecialcaseofrecursivedescentparsingwhereno backtracking is
required.
The key problem of predictive parsing is to determine the production tobe appliedfor
a non-terminal in case of alternatives.
Non-recursivepredictiveparser
Thetable-drivenpredictiveparserhasaninputbuffer,stack,aparsingtableandanoutput stream.
Inputbuffer:
Itconsistsofstringstobeparsed,followedby$toindicatetheendoftheinputstring.
Stack:
It contains a sequence of grammar symbols preceded by$ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsingtable:
Itisatwo-dimensionalarrayM[A,a],where‘A’isanon-terminaland‘a’isaterminal.
Predictiveparsingprogram:
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
1. IfX=a=$,theparserhaltsandannouncessuccessfulcompletionofparsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to
the next input symbol.
3. IfXisanon-terminal,theprogramconsultsentryM[X, a]of theparsing table
M.ThisentrywilleitherbeanX-productionofthegrammaroranerrorentry.
IfM[X,a]={X→UVW},theparserreplacesXontopofthestackbyUVW
IfM[X,a]=error,theparsercallsanerrorrecoveryroutine.
Algorithmfornonrecursivepredictiveparsing:
Input:AstringwandaparsingtableMforgrammarG.
Output:IfwisinL(G),aleftmostderivationofw;otherwise,anerrorindication.
Method : Initially, theparserhas $S on thestack with S, the start symbol ofGon top, and w$
intheinputbuffer.TheprogramthatutilizesthepredictiveparsingtableMtoproduceaparse for the
input is as follows:
setiptopointtothefirstsymbolofw$;
repeat
letXbethetopstacksymbolandathesymbolpointedtobyip;
ifXisaterminalor$then if X
= a then
popXfromthestackandadvanceip
elseerror()
else /*Xisanon-terminal*/ if
M[X, a] = X →Y1Y2 … Yk then begin
pop X from the stack;
pushYk,Yk-1,…,Y1ontothestack,withY1ontop; output the
production X → Y1 Y2 . . . Yk
end
elseerror()
untilX=$
Predictiveparsingtableconstruction:
The construction of a predictive parser is aided by two functions associated with a grammarG
:
1. FIRST
2. FOLLOW
Rulesforfirst( ):
1. IfXisterminal,thenFIRST(X)is{X}.
2. IfX→εisaproduction,thenaddεtoFIRST(X).
3. IfXisnon-terminalandX→aαisaproductionthenaddatoFIRST(X).
4. IfXisnon-terminalandX→Y1 Y2…Ykisaproduction,thenplaceainFIRST(X)iffor some i, a is
in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1); that is, Y1,….Yi-1
=>ε.IfεisinFIRST(Yj)forallj=1,2,..,k,thenaddεtoFIRST(X).
Rulesforfollow():
1. IfSisastartsymbol,thenFOLLOW(S)contains$.
Example:
Considerthefollowinggrammar: E
→ E+T | T
T→T*F|F
F→(E)|id
Aftereliminatingleft-recursionthegrammaris E
→ TE’
E’→+TE’|ε T
→ FT’
T’→*FT’|ε F
→ (E) | id
First( ) :
FIRST(E)={(,id}
FIRST(E’)={+,ε}
FIRST(T)={(,id}
FIRST(T’)={*,ε}
FIRST(F)={(,id}
Follow():
FOLLOW(E)={$,)}
FOLLOW(E’)={$,)}
FOLLOW(T)={+,$,)}
FOLLOW(T’)={+,$,)}
FOLLOW(F)={+,*,$,)}
LL(1)grammar:
Theparsingtableentriesaresingleentries.Soeachlocationhasnotmorethanoneentry. This type of
grammar is called LL(1) grammar.
Considerthisfollowinggrammar: S
→ iEtS | iEtSeS | a
E→b
Aftereliminatingleftfactoring,wehave S
→ iEtSS’ | a
S’→eS|ε E
→b
Toconstructaparsingtable,weneedFIRST()andFOLLOW()forallthenon-terminals.
FIRST(S) = { i, a }
FIRST(S’)={e,ε}
FIRST(E) = { b}
FOLLOW(S)={$,e}
FOLLOW(S’)={$,e}
FOLLOW(E) = {t}
Sincetherearemorethanoneproduction,thegrammarisnotLL(1)grammar.
Actionsperformedinpredictiveparsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementationofpredictiveparser:
1. Eliminationofleftrecursion,leftfactoringandambiguousgrammar.
2. ConstructFIRST()andFOLLOW()forallnon-terminals.
3. Constructpredictiveparsingtable.
4. Parsethegiveninputstringusingstackandparsingtable.
BOTTOM-UPPARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the
root is called bottom-up parsing.
Ageneraltypeofbottom-upparserisashift-reduceparser.
SHIFT-REDUCEPARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for
an input string beginning at the leaves (the bottom) and working up towards the root (thetop).
Example:
Considerthegrammar:
S→aABe
A→Abc|b
B→d
Thesentencetoberecognizedisabbcde.
REDUCTION(LEFTMOST) RIGHTMOSTDERIVATION
Handles:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
Example:
Considerthegrammar:
E→E+E
E→E*E E
→ (E)
E→id
E→E+E
→E+E*E
→E+E*id3
→E+id2*id3
→id1+id2*id3
Handle pruning:
Arightmostderivationinreversecanbeobtainedby“handlepruning”.
(i.e.)ifwisasentenceorstringofthegrammarathand,thenw=γn,whereγnisthenthright- sentinel form of some
rightmost derivation.
Stackimplementationofshift-reduceparsing:
$E +id2*id3$ shift
$E+E*id3 $ reducebyE→id
$E+E*E $ reducebyE→E*E
$E+E $ reducebyE→E+E
$E $ accept
Actionsinshift-reduceparser:
• shift –Thenextinputsymbolisshiftedontothetopofthestack.
• reduce–Theparserreplacesthehandlewithinastackwithanon-terminal.
• accept–Theparserannouncessuccessfulcompletionofparsing.
• error –Theparserdiscoversthatasyntaxerrorhasoccurredandcallsanerrorrecovery routine.
Conflictsinshift-reduceparsing:
Therearetwoconflictsthatoccurinshiftshift-reduceparsing:
1. Shift-reduceconflict:Theparsercannotdecidewhethertoshiftortoreduce.
2. Reduce-reduceconflict:Theparsercannotdecidewhichofseveralreductionstomake.
1. Shift-
reduceconflict:Example
Considerthegrammar:
E→E+E|E*E| idandinputid+id*id
Stack Input Action Stack Input Action
$E+E *id$ Reduceby $E+E *id$ Shift
E→E+E
$E $E
2. Reduce-reduceconflict:
Considerthegrammar:
M → R+R | R+c | R
R→c
andinputc+c
OPERATOR-PRECEDENCEPARSING
Anefficientway ofconstructingshift-reduceparseriscalledoperator-precedenceparsing.
Operator precedence parser can be constructed from a grammar called Operator-grammar. These
grammars have the property that no production on right side is ε or has two adjacent non-
terminals.
Example:
Considerthegrammar:
SincetherightsideEAEhasthreeconsecutivenon-terminals,thegrammarcanbewrittenas follows:
E→E+E| E-E|E*E|E/E|E↑E|-E|id
Operatorprecedencerelations:
Therearethreedisjointprecedencerelationsnamely
<. - less than
= - equalto
.
> - greaterthan
The relations give the following meaning:
a<.b–ayieldsprecedence tob
a =b –ahasthesameprecedence asb
.
a >b–atakesprecedence overb
Rulesforbinaryoperations:
1. If operator θ1 has higher precedence than operator θ2,then make
θ1.>θ2and θ2<.θ1
Example:
Operator-precedencerelationsforthegrammar
1. ↑isofhighestprecedenceandright-associative
2. *and/areofnexthigherprecedenceandleft-associative,and
3. + and -are of lowest precedence and left-associative
Note that the blanks in the table denote error entries.
TABLE:Operator-precedencerelations
+ - * / ↑ id ( ) $
. . . .
+ > > < < <. <. <. .
> .
>
. .
- > > <. <. <. <. <. .
> .
>
. . . .
* > > > > <. <. <. .
> .
>
. . . .
/ > > > > <. <. <. .
> .
>
↑
. . . .
> > > > <. <. <. .
> .
>
. . . . . . .
id > > > > ·> > >
( <. <. <. <. <. <. <. =
. . . . . . .
) > > > > > > >
. . . . . . .
$ < < < < < < <
Operatorprecedenceparsingalgorithm:
Input :Aninputstringwandatableofprecedencerelations.
Output : If w is well formed, a skeletal parse tree ,with a placeholder non-terminal E labeling all
interior nodes; otherwise, an error indication.
Method : Initiallythe stack contains $ and the input bufferthe string w $. To parse, we execute the
following program :
(1) Setiptopointtothefirstsymbolofw$;
(2) repeatforever
(3) if$isontopofthe stackandippointsto$then
(4) return
elsebegin
(5) let a be the topmost terminal symbol on the stack
and let b be the symbol pointed to by ip;
(6) ifa<.bora= bthenbegin
(7) pushbontothestack;
(8) advanceiptothenextinputsymbol;
end;
(9) else ifa.>bthen /*reduce*/
(10) repeat
(11) popthestack
(12) until the top stack terminal is related by <.
to the terminal most recently popped
(13) elseerror()
end
Stackimplementationofoperatorprecedenceparsing:
Operator precedence parsing uses a stack and precedence relation table for its
implementation of above algorithm. It is a shift-reduce parsing containing all four actions shift,
reduce, accept and error.
Theinitialconfigurationofanoperatorprecedenceparsingis
STACK INPUT
$ w$
Example:
Advantagesofoperatorprecedenceparsing:
1. Itiseasytoimplement.
2. Once an operator precedence relation is made between all pairs of terminals of agrammar ,the
grammar can be ignored. The grammar is not referred anymore during implementation.
Disadvantagesofoperatorprecedenceparsing:
1. Itishardtohandletokensliketheminussign(-)whichhastwodifferentprecedence.
2. Onlyasmallclassofgrammarcanbeparsedusingoperator-precedenceparser.
LRPARSERS
An efficient bottom-up syntax analysis technique that can be used to parse a large class of
CFG is called LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for
constructingarightmostderivationinreverse,andthe‘k’forthenumberofinputsymbols. When ‘k’ is
omitted, it is assumed to be 1.
AdvantagesofLRparsing:
✓ ItrecognizesvirtuallyallprogramminglanguageconstructsforwhichCFGcanbewritten.
✓ Itisanefficientnon-backtrackingshift-reduceparsingmethod.
✓ AgrammarthatcanbeparsedusingLRmethodisapropersupersetofagrammarthat can be parsed
with predictive parser.
✓ Itdetectsasyntacticerrorassoonaspossible.
DrawbacksofLRmethod:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.
TypesofLRparsingmethod:
1. SLR-SimpleLR
▪ Easiesttoimplement,leastpowerful.
2. CLR-CanonicalLR
▪ Mostpowerful,mostexpensive.
3. LALR-Look-AheadLR
▪ Intermediate insizeandcostbetweentheothertwomethods.
TheLRparsingalgorithm:
The schematicformofanLRparserisasfollows:
INPUT a1 ai an $
… …
Sm LRparsingprogram OUTPUT
Xm
Sm-1
Xm-1
… action goto
S0
STACK
Itconsistsof:aninput,anoutput,astack,adriverprogram,andaparsing tablethathastwo parts (action and
goto).
➢ The driverprogramisthesameforallLRparser.
➢ Theparsingprogramreadscharactersfromaninputbufferoneatatime.
➢ Theparsing tableconsistsoftwoparts:actionandgotofunctions.
1. shifts,wheresisa state,
2. reducebyagrammarproductionA→β,
3. accept,and
4. error.
Goto:Thefunctiongototakesastateandgrammarsymbolasargumentsandproducesastate.
LRParsingalgorithm:
Output:IfwisinL(G),abottom-up-parseforw;otherwise,anerrorindication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input
buffer. The parser then executes the following program :
setiptopointtothe firstinputsymbolofw$;
repeatforeverbegin
letsbe thestateontopofthe stackand
athesymbolpointedtobyip;
if action[s, a] = shift s’ then begin
push a then s’ on top of the stack;
advanceip tothenextinput symbol
end
elseifaction[s,a]=reduceA→βthenbegin
pop2*|β| symbolsoffthestack;
let s’ be the state now on top of the stack;
push A then goto[s’, A] on top of the stack;
outputthe production A→ β
end
elseif action[s, a]=accept then return
elseerror()
end
CONSTRUCTINGSLR(1)PARSINGTABLE:
ToperformSLRparsing,takegrammarasinputanddothefollowing:
1. FindLR(0)items.
2. Completing theclosure.
3. Computegoto(I,X),where,I issetofitemsandXisgrammarsymbol.
LR(O)items:
An LR(O) item of a grammar G is aproduction of G with adot at some position of the
right side. For example, production A → XYZ yields the four items :
A → . XYZ
A→X.YZ
A→XY.Z A
→ XYZ .
Closureoperation:
If Iis a set of items for a grammar G, then closure(I) is the set of items constructed from I
by the two rules:
1. Initially,everyiteminIisaddedtoclosure(I).
2. If A → a . Bβ is in closure(I) and B → γ is a production, then add the item B → . γ to I, if it
is not already there. We apply this rule until no more new items can be added to closure(I).
Gotooperation:
Goto(I, X) is defined to be the closure of the set of all items [A→ aX . β] such that
[A→ a . Xβ] is in I.
StepstoconstructSLRparsing tableforgrammarGare:
1. AugmentGandproduceG’
2. Constructthe canonicalcollectionofsetofitemsCforG’
3. Construct the parsing action function action and goto using the following algorithm
thatrequires FOLLOW(A) for each non-terminal of grammar.
AlgorithmforconstructionofSLRparsingtable:
Input :AnaugmentedgrammarG’
Output:TheSLRparsing tablefunctionsactionandgotoforG’
Method:
1. ConstructC={I0,I1,….In},thecollectionofsetsofLR(0)itemsforG’.
2. StateiisconstructedfromIi..Theparsingfunctionsforstateiaredeterminedasfollows:
(a) If [A→a·aβ] is in Ii and goto(Ii,a) = Ij,then set action[i,a] to “shift j”. Here a must
beterminal.
(b) If[A→a·]isinIi,thensetaction[i,a]to“reduceA→a”forallainFOLLOW(A).
(c) If[S’→S.]isinIi,thensetaction[i,$]to“accept”.
Ifanyconflictingactionsaregeneratedbytheaboverules,wesaygrammarisnotSLR(1).
3. Thegototransitionsforstateiareconstructedforallnon-terminalsAusingtherule: If goto(Ii,A) = Ij,
then goto[i,A] = j.
4. Allentriesnotdefinedbyrules(2)and(3)aremade“error”
5. Theinitialstateoftheparseristheoneconstructedfromthesetofitemscontaining [S’→.S].
ExampleforSLRparsing:
Construct SLR parsing for the following grammar :
G:E→E+T|T
T→T*F|F F
→ (E) | id
The givengrammaris:
G:E→E+ T - ----- (1)
E→T - ----- (2)
T→T*F - ----- (3)
T→F - ----- (4)
F →(E) - ---- (5)
F →id - -----(6)
Step1:Convertgivengrammarintoaugmentedgrammar.
Augmentedgrammar:
E’→E
E→E+T
E→T
T→T*F T
→F
F →(E)
F →id
Step2:Find LR (0)items. I0 :
E’ → . E
E→.E+ T
E→.T
T→.T*F T→
.F
F→.(E)
F→.id
GOTO(I0,E) I1 GOTO(I4,id)
:E’ → E . I5:F→id.
E→E.+ T
GOTO ( I6 , T)
GOTO(I0,T) I2 : I9:E→E+T.
E→ T . T→T.*F
T→T.*F
GOTO(I6,F) I3 :
GOTO(I0,F) I3 : T→F.
T→F.
GOTO(I6,()
I4:F→(.E)
GOTO(I0,() GOTO(I6,id)
I4:F→(.E) I5:F→id.
E→.E+T
E→ . T GOTO ( I7 , F)
T→.T*F I10:T→T*F.
T→ . F
F→.(E) GOTO(I7,()
F→.id I4:F→(.E)
E→.E+T
GOTO(I0,id) E→ . T
I5:F→id. T→.T*F T→
.F
GOTO ( I1 , +) F→.(E)
I6:E→E+.T F→.id
T→.T*F
T→ . F GOTO(I7,id)
F→.(E) I5:F→id.
F→.id
GOTO(I8,))
GOTO ( I2 , *) I11:F→(E).
I7:T→T*.F
F→.(E) GOTO ( I8 , +)
F→.id I6:E→E+.T
T→.T*F T
GOTO(I4,E) I8 : →.F
F→(E.) F →.(E )
E→E.+ T F →.id
GOTO(I4,T) I2 : GOTO ( I9 , *)
E →T . I7:T→T*.F
T→T.*F F →.(E )
F →.id
GOTO(I4,F) I3 :
T→F.
GOTO(I4,()
I4:F→(.E)
E→.E+T
E→ . T
T→.T*F T→
.F
F→.(E)
F→id
FOLLOW(E)= {$,),+)
FOLLOW(T)= {$,+ ,),*}
FOOLOW(F)= {*,+ ,),$}
SLRparsingtable:
ACTION GOTO
id + * ( ) $ E T F
IO s5 s4 1 2 3
I1 s6 ACC
I2 r2 s7 r2 r2
I3 r4 r4 r4 r4
I4 s5 s4 8 2 3
I5 r6 r6 r6 r6
I6 s5 s4 9 3
I7 s5 s4 10
I8 s6 s11
I9 r1 s7 r1 r1
I1O r3 r3 r3 r3
I11 r5 r5 r5 r5
Blankentriesareerrorentries.
Stackimplementation:
Checkwhetherthe inputid+id*idisvalidornot.
STACK INPUT ACTION
0 id+id*id$ GOTO(I0,id)=s5;shift
0F 3 + id*id$ GOTO(I0,F)= 3
GOTO(I3,+)=r4;reducebyT→F
0E1+6T9 $ GOTO(I6,T)= 9
GOTO(I9,$)=r1;reducebyE→E+T
0E1 $ GOTO(I0,E)= 1
GOTO(I1,$)=accept
MODULE2-SYNTAX-DIRECTEDTRANSLATION
MODULE-3TYPECHECKING
MODULE4-RUN-TIMEENVIRONMENTS
MODULE-4 INTERMEDIATECODEGENERATION
INTRODUCTION
Thefrontendtranslatesasourceprogramintoanintermediaterepresentationfromwhich the
back end generates target code.
Benefitsofusingamachine-independentintermediateformare:
1. Retargetingisfacilitated.Thatis,acompilerforadifferentmachinecanbecreatedby
attaching a back end for the new machine to an existing front end.
2. Amachine-independentcodeoptimizercanbeappliedtotheintermediaterepresentation.
Positionofintermediatecodegenerator
INTERMEDIATELANGUAGES
Threewaysofintermediaterepresentation:
• Syntaxtree
• Postfixnotation
• Threeaddresscode
Graphical Representations:
Syntaxtree:
A syntax tree depicts the natural hierarchical structure of a source program. A dag
(Directed Acyclic Graph) gives the same information but in a more compact way because
common subexpressions are identified. A syntax tree and dag for the assignment statementa : =
b * - c + b * - c are as follows:
assign assign
a + a +
* * *
b uminusb uminus b uminus
c c c
(a)Syntaxtree (b)Dag
Postfixnotation:
abcuminus*bcuminus*+assign
Syntax-directeddefinition:
Syntax trees for assignment statements are produced by the syntax-directed definition.
Non-terminal S generates an assignment statement. The two binary operators + and * are
examples of the full operator set in a typical language. Operator associativities and precedences
are the usual ones, even though they have not been put into the grammar. This definition
constructs the tree from the input a : = b * - c + b* - c.
PRODUCTION SEMANTICRULE
Sid:=E S.nptr:=mknode(‘assign’,mkleaf(id,id.place),E.nptr)
EE1+E2 E.nptr:=mknode(‘+’,E1.nptr,E2.nptr)
E(E1) E1.nptr
Eid E.nptr:=mkleaf(id,id.place)
Syntax-directeddefinitiontoproducesyntaxtreesforassignmentstatements
The token id has an attribute place that points to the symbol-table entry for the identifier.
Asymbol-tableentrycan befoundfromanattribute id.name,representingthelexemeassociated with
that occurrence of id. Ifthelexical analyzer holds all lexemes in a single array of characters, then
attribute name might be the index of the first character of the lexeme.
Two representations of thesyntax tree are as follows. In (a) each node is represented as a
record with a field for its operator and additional fields for pointers to its children. In (b), nodes
are allocated from an array of records and the index or position of the node serves as the pointer
to the node. All the nodes in the syntax tree can be visited by following pointers, starting from
the root at position 10.
Tworepresentationsofthesyntaxtree
aaaaaaaaaaaaa
assign 0 id b
1 id c
id a
2 uminus
2 1
3 * 0 2
+
4 id b
5 id c
* *
6 uminus 5
id b id b * 4 6
7
uminus uminus 8 + 3 7
9 id a
id c id c
10 assign 9 8
(a) (b)
Three-Address Code:
Three-addresscodeisasequenceofstatementsofthegeneralform x : =
y op z
where x, y and z are names, constants, or compiler-generated temporaries; op stands for any
operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on boolean-
valued data. Thus a source language expression like x+ y*zmight be translated into a sequence
t1:=y*z
t2:=x+t1
wheret1andt2arecompiler-generatedtemporarynames.
Advantagesofthree-addresscode:
The use of names for the intermediate values computed by a program allows three-
address code to be easily rearranged – unlike postfix notation.
Three-addresscodecorrespondingtothesyntaxtreeanddaggivenabove
t1 :=-c t1:= -c
t2 :=b*t1 t2:=b*t1
t4 :=b*t3 a:=t5
t5:=t2+t4 a : =
t5
(a)Codeforthesyntaxtree (b)Codeforthedag
TypesofThree-AddressStatements:
Thecommonthree-addressstatementsare:
3. Copystatementsoftheformx:=ywherethevalueofyisassignedtox.
4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
5. ConditionaljumpssuchasifxrelopygotoL.Thisinstructionappliesa relationaloperator(
<,=,>=,etc.)toxandy,andexecutesthestatementwithlabelL nextifxstandsinrelation
relop to y. If not, the three-address statement following if x relop y goto L is executed
next,as in the usual sequence.
6. param x and call p, nfor procedure calls and return y, where yrepresenting a returned value is
optional. For example,
paramx1
paramx2
...
paramxn
call p,n
generatedaspartofacalloftheprocedurep(x1,x2,….,xn).
7. Indexedassignmentsoftheformx:=y[i]andx[i]:= y.
8. Addressandpointerassignmentsoftheformx:=&y,x:=*y,and*x:=y.
Syntax-DirectedTranslationintoThree-AddressCode:
When three-address code is generated, temporary names are made up for the interior
nodes of a syntax tree. For example, id : = E consists of code to evaluate E into some temporary
t, followed by the assignment id.place : = t.
Syntax-directeddefinitiontoproducethree-addresscodeforassignments
Sid:=E S.code:=E.code||gen(id.place‘:=’E.place)
EE1+E2 E.place:=newtemp;
E.code:=E1.code||E2.code||gen(E.place‘:=’E1.place‘+’E2.place)
EE1*E2 E E.place:=newtemp;
E.code:=E1.code||E2.code||gen(E.place‘:=’E1.place‘*’E2.place)
- E1 E.place:=newtemp;
E.code:=E1.code||gen(E.place‘:=’‘uminus’E1.place)
S.begin:
E.code
ifE.place=0gotoS.after
S1.code
gotoS.begin
S.after: ...
S while E do S1 S.begin:=newlabel;
S.after := newlabel;
S.code:=gen(S.begin‘:’)||
E.code ||
gen(‘if’E.place‘=’‘0’‘goto’S.after)||
S1.code ||
gen(‘goto’S.begin)||
gen ( S.after ‘:’)
ImplementationofThree-AddressStatements:
Triples
Indirecttriples
Quadruples:
Aquadrupleisarecordstructurewithfourfields,whichare,op,arg1,arg2andresult.
The op field contains an internal code for the operator. The three-address statement x : =
y op z is represented by placing y in arg1, z in arg2 and x in result.
The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporarynames must be entered
into the symbol table as they are created.
Triples:
To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.
Ifwedoso,three-addressstatementscanberepresentedbyrecordswithonlythreefields:
op,arg1andarg2.
The fields arg1 and arg2, for the arguments of op, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).
Sincethreefieldsareused,thisintermediatecodeformatisknownastriples.
(a)Quadruples (b)Triples
Quadrupleandtriplerepresentationofthree-addressstatementsgivenabove
A ternary operation like x[i] : = y requires two entries in the triple structure as shown as
belowwhile x : = y[i] is naturally represented as two operations.
(a)x[i]:=y (b)x:=y[i]
IndirectTriples:
Anotherimplementationofthree-addresscodeisthatoflistingpointerstotriples,rather than
listing the triples themselves. This implementation is called indirect triples.
Indirecttriplesrepresentationofthree-addressstatements
DECLARATIONS
Inthetranslationschemeshownbelow:
NonterminalPgeneratesasequenceofdeclarationsoftheformid:T.
Before the first declaration is considered, offset is set to 0. As each new name is seen ,
that name is entered in the symbol table with offset equal to the current value of offset,
and offset is incremented by the width of the data object denoted by that name.
The procedure enter( name, type, offset ) creates a symbol-table entry for name, gives its
type type and relative address offset in its data area.
Attribute type represents a type expression constructed from the basic types integer and
real by applying the type constructors pointer and array. If type expressions are
represented by graphs, then attribute type might be a pointer to the node representing a
type expression.
The width of an array is obtained by multiplying the width of each element by thenumber
of elements in the array. The width of each pointer is assumed to be 4.
Computingthetypesandrelativeaddressesofdeclarednames
PD {offset:=0}
DD;D
D id : T {enter(id.name,T.type,offset);
offset : = offset + T.width }
T integer {T.type:=integer;
T.width : = 4 }
Treal {T.type:=real;
T.width:=8}
T ↑ T1 {T.type:=pointer(T1.type);
T.width : = 4 }
KeepingTrackofScopeInformation:
PD
DD;D|id:T|procid;D;S
Onepossibleimplementationofasymboltableisalinkedlistofentriesfornames.
A new symbol table is created when a procedure declaration D proc id D1;Sis seen,
and entries for the declarations in D1 are created in the new table. The new table points back to
the symbol table of the enclosing procedure; the name represented by id itself is local to the
enclosing procedure. The only change from the treatment of variable declarations is that the
procedure enter is told which symbol table to make an entry in.
For example, consider the symbol tables for procedures readarray, exchange, and
quicksort pointing back to that for the containing procedure sort, consisting of the entire
program. Since partition is declared within quicksort, its table points to that of quicksort.
Symboltablesfornestedprocedures
sort
nil header
a
x
readarray toreadarray
exchange toexchange
quicksort
partition
h d
i
j
Thesemanticrulesaredefinedintermsofthefollowingoperations:
1. mktable(previous) creates a new symbol table and returns a pointer to the new table. The
argument previous points to a previously created symbol table, presumably that for the
enclosing procedure.
3. addwidth(table, width) records the cumulative width of all the entries in table in the header
associated with this symbol table.
4. enterproc(table, name, newtable) creates a new entry for procedure name in the symbol table
pointed to by table. The argument newtable points to the symbol table for this procedure
name.
Syntaxdirectedtranslationschemefornestedprocedures
P M D {addwidth(top(tblptr),top(offset));
pop (tblptr); pop (offset) }
Mɛ {t:=mktable(nil);
push(t,tblptr);push(0,offset)}
DD1;D2
Nɛ {t:=mktable(top(tblptr));
push(t,tblptr);push(0,offset)}
Thestacktblptrisusedtocontainpointerstothetablesfor sort,quicksort,andpartition
whenthedeclarationsinpartitionareconsidered.
AllsemanticactionsinthesubtreesforBandCin A
BC {actionA}
are done before actionAat the end of the production occurs. Hence, the action associated
with the marker M is the first to be done.
The action for nonterminal M initializes stack tblptr with a symbol table for theoutermost
scope, created by operation mktable(nil). The action also pushes relative address 0 onto
stack offset.
For each variable declaration id: T, an entryis created for id in the current symbol table.
The top of stack offset is incremented by T.width.
When the action on the right side of D proc id; ND1; S occurs, the width of all
declarations generatedbyD1isonthetopofstackoffset;itisrecordedusing addwidth. Stacks
tblptr and offset are then popped.
Atthispoint,thenameoftheenclosedprocedureisenteredintothesymboltableofits enclosing
procedure.
ASSIGNMENTSTATEMENTS
Supposethatthecontextinwhichanassignmentappearsisgivenbythefollowinggrammar.
PMD
Mɛ
DD;D|id:T|procid ;ND;S N ɛ
Nonterminal P becomes the new start symbol when these productions are addedto those in the
translation scheme shown below.
Translationschemetoproducethree-addresscodeforassignments
Sid:=E {p:=lookup(id.name);
ifp≠nilthen
emit(p‘:=’E.place)
elseerror}
EE1+E2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘+‘E2.place)}
E-E1 {E.place:=newtemp;
emit(E.place‘:=’‘uminus’E1.place)}
E(E1) {E.place:=E1.place}
Eid {p:=lookup(id.name);
ifp≠nilthen
E.place:=p
elseerror}
ReusingTemporaryNames
Temporariescanbereusedbychangingnewtemp.ThecodegeneratedbytherulesforE
E1+E2hasthegeneralform:
evaluateE1intot1
evaluateE2 intot2 t
: =t1+t2
Thelifetimesofthesetemporariesarenestedlikematchingpairsofbalancedparentheses.
Forexample,considertheassignmentx:=a*b+c*d–e*f
Three-addresscodewithstacktemporaries
statement valueofc
0
$0:=a*b 1
$1:=c*d 2
$0:=$0+ $1 1
$1:=e*f 2
$0:=$0-$1 1
x:=$0 0
AddressingArrayElements:
Elements of an array can be accessed quickly if the elements are stored in a block of
consecutive locations. If the width of each array element is w, then the ith element of array A
begins in location
base+(i–low)xw
ixw+(base–lowxw)
Thesubexpression c = base – low x w can be evaluatedwhen the declaration ofthe arrayis seen.
We assume that c is saved in the symbol table entry for A , so the relative address of A[i] is
obtained by simply adding i x w to c.
Addresscalculationofmulti-dimensionalarrays:
Atwo-dimensionalarrayisstoredinofthetwoforms:
Row-major(row-by-row)
Column-major(column-by-column)
Layoutsfora2x3array
A[11] A [11]
firstcolumn
firstrow A[1,2] A [21]
A[13] A[1,2]
A[2,1] A[2,2 ] secondcolumn
(a)ROW-MAJOR (b)COLUMN-MAJOR
Inthecaseofrow-majorform,therelativeaddressofA[i1,i2]canbecalculatedbytheformula
base+((i1–low1)xn2+i2–low2)xw
where, low1and low2are the lower bounds on the values of i1and i2and n2is the number of values
that i2can take. That is, ifhigh2is the upper bound on the value ofi2, then n2= high2 – low2 + 1.
Assuming that i1 and i2are the only values that are known at compile time, we can rewrite the
above expression as
Generalized formula:
TheexpressiongeneralizestothefollowingexpressionfortherelativeaddressofA[i1,i2,…,ik]
forallj,nj=highj–lowj+1
TheTranslationSchemeforAddressingArrayElements:
Semanticactionswillbeaddedtothegrammar :
(1) S L:=E
(2) E E+E
(3) E (E)
(4) E L
(5) LElist]
(6) Lid
(7) ElistElist,E
(8) Elistid[E
WegenerateanormalassignmentifLisasimplename,andanindexedassignmentintothe location
denoted by L otherwise :
Elist.place:=E.place;
Elist.ndim : = 1 }
Consider the grammar for assignment statements as above, but suppose there are two
types – real and integer , with integers converted to reals when necessary. We have another
attribute E.type, whose value is either real or integer. The semantic rule for E.type associated
with the production E E + E is :
EE+E {E.type:=
ifE1.type=integerand
E2.type=integertheninteger
elsereal}
The entire semantic rule for E E + E and most of the other productions must be
modified to generate, when necessary, three-address statements of the form x : = inttoreal y,
whose effect is to convert integer y to a real of equal value, called x.
SemanticactionforEE1+E2
E.place:=newtemp;
ifE1.type=integerandE2.type=integerthenbegin emit(
E.place ‘: =’ E1.place ‘int +’ E2.place); E.type :
= integer
end
elseif E1.type = real and E2.type = real then begin
emit(E.place‘:=’E1.place‘real+’E2.place);
E.type : = real
end
elseifE1.type=integerandE2.type=realthenbegin
u:=newtemp;
emit(u‘:=’‘inttoreal’E1.place);
emit(E.place‘:=’u‘real+’E2.place); E.type :
= real
end
elseifE1.type=realandE2.type=integerthenbegin
u:=newtemp;
emit(u‘:=’‘inttoreal’E2.place);
emit(E.place‘:=’E1.place‘real+’u); E.type :
= real
end
else
E.type:=type_error;
Forexample,fortheinputx:=y+i*j
assumingxandyhavetypereal,andiandjhavetypeinteger,theoutputwouldlook like
t1:=iint*j
t3:=inttorealt1 t2 :
= y real+ t3
x:=t2
BOOLEANEXPRESSIONS
Boolean expressions have two primary purposes. They are used to compute logical
values, but more often they are used as conditional expressions in statements that alter the flow
of control, such as if-then-else, or while-do statements.
Boolean expressions are composed of the boolean operators ( and, or, and not ) applied
to elements that are boolean variables or relational expressions. Relational expressions are of the
form E1relop E2, where E1 and E2 are arithmetic expressions.
Hereweconsiderbooleanexpressionsgeneratedbythefollowinggrammar : E
MethodsofTranslatingBooleanExpressions:
Therearetwoprincipalmethodsofrepresentingthevalueofabooleanexpression.Theyare :
To encode true and false numerically and to evaluate a boolean expression analogouslyto
an arithmetic expression. Often, 1 is used to denote true and 0 to denote false.
To implement boolean expressions by flow of control, that is, representing the value of a
boolean expression by a position reached in a program. This method is particularly
convenient in implementing the boolean expressions in flow-of-control statements, such
as the if-then and while-do statements.
NumericalRepresentation
Here, 1 denotes true and 0 denotes false. Expressions will be evaluated completely from
left to right, in a manner similar to arithmetic expressions.
Forexample:
Thetranslationfor
aorbandnotc
isthethree-addresssequence
t1:=notc
t2:=bandt1 t3
: = a or t2
Translationschemeusinganumericalrepresentationforbooleans
EE1orE2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘or’E2.place)}
EE1andE2 {E.place:=newtemp;
emit(E.place‘:=’E1.place‘and’E2.place)}
EnotE1 {E.place:=newtemp;
emit(E.place‘:=’‘not’E1.place)}
E ( E1 ) {E.place:=E1.place} E
id1 relop id2 { E.place : = newtemp;
emit(‘if’id1.placerelop.opid2.place‘goto’nextstat+3); emit(
E.place ‘: =’ ‘0’ );
emit(‘goto’nextstat+2);
emit(E.place‘:=’‘1’)}
E true { E.place : = newtemp;
emit(E.place‘:=’‘1’)}
E false { E.place : = newtemp;
emit(E.place‘:=’‘0’)}
Short-CircuitCode:
We can also translate a boolean expression into three-address code without generating
codefor anyof theboolean operators and without having the codenecessarilyevaluate theentire
expression. This styleof evaluation is sometimes called “short-circuit”or “jumping”code. It is
possible to evaluate boolean expressions without generating code for the boolean operators and,
or, and not if we represent the value of an expression by a position in the code sequence.
Translationofa<borc<dande<f
We now consider the translation of boolean expressions into three-address code in the
context of if-then, if-then-else, and while-do statements such as those generated bythe following
grammar:
SifEthenS1
| ifEthenS1elseS2
| whileEdoS1
E.true is the label to which control flows if E is true, and E.false is the label to which
control flows if E is false.
The semantic rules for translating a flow-of-control statement S allow control to flow
from the translation S.code to the three-address instruction immediately followingS.code.
S.next is a label that is attached to the first three-address instruction to be executed after
the code for S.
Codeforif-then,if-then-else,andwhile-dostatements
toE.true
E.code
toE.false
E.true: to E.false
S1.code gotoS.next
E.false:
S2.code
E.false: ...
S.next: ...
(a)if-then (b)if-then-else
to E.false
E.true: S1.code
gotoS.begin
E.false: ...
(c)while-do
Syntax-directeddefinitionforflow-of-controlstatements
PRODUCTION SEMANTICRULES
SifEthenS1 E.true:=newlabel;
E.false : = S.next;
S1.next : = S.next;
S.code:=E.code||gen(E.true‘:’)||S1.code
SwhileEdoS1 S.begin:=newlabel;
E.true : = newlabel;
E.false : = S.next;
S1.next : = S.begin;
S.code:=gen(S.begin‘:’)||E.code||
gen(E.true‘:’)||S1.code||
gen(‘goto’ S.begin)
Control-FlowTranslationofBooleanExpressions:
Syntax-directeddefinitiontoproducethree-addresscodeforbooleans
PRODUCTION SEMANTICRULES
EE1andE2 E.true:=newlabel;
E1.false : = E.false;
E2.true : = E.true;
E2.false : = E.false;
E.code:=E1.code||gen(E1.true‘:’)||E2.code
EnotE1 E1.true:=E.false;
E1.false:=E.true;
E.code:=E1.code
E(E1 ) E1.true:=E.true;
E1.false:=E.false; E.code
: = E1.code
Eid1relopid2 E.code:=gen(‘if’id1.placerelop.opid2.place
‘goto’E.true)||gen(‘goto’E.false) E.code
Efalse E.code:=gen(‘goto’E.false)
CASESTATEMENTS
switchexpression
begin
casevalue: statement
casevalue: statement
...
casevalue: statement
default: statement
end
1. Evaluatetheexpression.
2. Findwhichvalueinthelistofcasesisthesameasthevalueoftheexpression.
3. Executethestatementassociatedwiththevaluefound.
Step(2)canbeimplementedinoneofseveralways:
Byasequenceofconditionalgotostatements,ifthenumberofcasesissmall.
By creating a table of pairs, with each pair consisting of a value and a label for the code
of the corresponding statement. Compiler generates a loop to compare the value of the
expression with each value in the table. If no match is found, the default (last) entry is
sure to match.
Ifthenumberofcasesslarge,itisefficient toconstructahashtable.
Thereisacommonspecialcaseinwhichanefficientimplementationofthen-waybranch exists.
If the values all lie in some small range, say imin to imax, and the number ofdifferent values
is a reasonable fraction of imax - imin, then we can construct an array of labels, with the
label of the statement for value j in the entry of the table with offset j-
iminandthelabelforthedefaultinentriesnotfilledotherwise.Toperformswitch,
evaluatetheexpressiontoobtainthevalueofj,checkthevalueiswithinrangeand transfer to the
table entry at offset j-imin.
Syntax-DirectedTranslationofCaseStatements:
Considerthefollowingswitchstatement:
switchE
begin
caseV1: S1
caseV2: S2
...
caseVn-1: Sn-1
default: Sn
end
Thiscasestatementistranslatedintointermediatecodethathasthefollowingform :
Translationofacasestatement
codetoevaluateEintot goto
test
L1: codeforS1
gotonext
L2: codeforS2
gotonext
...
Ln-1: codeforSn-1
gotonext
Ln: codeforSn
gotonext
test : ift=V1gotoL1
ift=V2goto L2
...
ift=Vn-1gotoLn-1
goto Ln
next:
Totranslateintoaboveform:
When keyword switch is seen, two new labels test and next, and a new temporary t are
generated.
Aseach case keyword occurs, anew label Liis created and entered into thesymbol table. A
pointer to this symbol-table entry and the value Viof case constant are placed on a stack
(used only to store cases).
Each statement case Vi : Siis processed by emitting the newly created label Li, followed
by the code for Si, followed by the jump goto next.
Then when the keyword end terminatingthe bodyof the switch is found, the code can be
generated for the n-way branch. Reading the pointer-value pairs on the case stack from
thebottomtothetop,wecan generate asequence ofthree-addressstatementsoftheform
caseV1L1
caseV2L2
...
caseVn-1Ln-
1case tLn
labelnext
where t is the name holding the value of the selector expression E, and Ln is the label for
the default statement.
BACKPATCHING
The easiest way to implement the syntax-directed definitions for boolean expressions isto
use two passes. First, construct a syntax tree for the input, and then walk the tree in depth-first
order, computing the translations. The main problem with generating code for boolean
expressionsandflow-of-controlstatementsin asingle passisthatduringonesingle pass we may not
know thelabelsthatcontrol must go to atthetimethe jumpstatementsaregenerated.Hence, a series
of branching statements with the targets of the jumps left unspecified is generated. Each
statement will be put on a list of goto statements whose labels will be filled in when the proper
label can be determined. We call this subsequent filling in of labels backpatching.
Tomanipulatelistsoflabels,weusethreefunctions:
1. makelist(i)createsanewlistcontainingonlyi,anindexintothearrayofquadruples;
makelistreturnsapointertothelistithasmade.
2. merge(p1,p2)concatenatesthelistspointedtobyp1andp2, andreturnsapointertothe
concatenated list.
3. backpatch(p,i) inserts i as the target label for each of the statements on the list pointed to
by p.
BooleanExpressions:
We now construct a translation scheme suitable for producing quadruples for boolean
expressions during bottom-up parsing. The grammar we use is the following:
(1) EE1orME2
(2) | E1andME2
(3) | notE1
(4) | (E1)
(5) | id1relop id2
(6) | true
(7) | false
(8) Mɛ
Synthesizedattributes truelist andfalselist ofnonterminal Eareusedto generatejumpingcode
forbooleanexpressions.Incompletejumpswithunfilledlabelsareplacedonlistspointedtoby
E.truelist and E.falselist.
Consider production E E1and M E2. If E1is false, then E is also false, so the statements on
E1.falselist becomepartofE.falselist.IfE1istrue,thenwemustnexttest E2,sothetargetforthe
statements E1.truelist must be thebeginningof the code generated for E2. This target is obtained
using marker nonterminal M.
AttributeM.quadrecordsthenumberofthefirststatementofE2.code.WiththeproductionM
ɛweassociatethesemanticaction
{M.quad:=nextquad}
The variable nextquad holds the index of the next quadruple to follow. This value will be
backpatchedontotheE1.truelistwhenwehaveseentheremainderoftheproduction E E1and M E2.
The translation scheme is as follows:
Atranslationschemeisdevelopedforstatementsgeneratedbythefollowinggrammar:
(1) SifEthenS
(2) | ifEthenSelseS
(3) | whileEdoS
(4) | beginLend
(5) | A
(6) L L;S
(7) | S
SchemetoimplementtheTranslation:
The nonterminal E has two attributes E.truelist and E.falselist. L and S also need a list of
unfilled quadruples that must eventually be completed by backpatching. These lists are pointedto
by the attributes L..nextlist and S.nextlist. S.nextlist is a pointer to a list of all conditional and
unconditionaljumpstothequadruplefollowingthestatementSinexecutionorder,and L.nextlist is
defined similarly.
Thesemanticrulesfortherevisedgrammarareasfollows:
(1) SifEthenM1S1NelseM2S2
{backpatch (E.truelist, M1.quad);
backpatch(E.falselist,M2.quad);
S.nextlist:=merge(S1.nextlist,merge(N.nextlist,S2.nextlist))}
We backpatch the jumps when E is true to the quadruple M1.quad, which is the beginning of the
codefor S1.Similarly,webackpatchjumpswhen Eisfalseto gotothebeginningofthe codefor S2. The
list S.nextlist includes all jumps out of S1 and S2, as well as the jump generated by N.
TheassignmentS.nextlist:=nilinitializesS.nextlisttoanemptylist.
PROCEDURECALLS
The procedure is such an important and frequently used programming construct that it is
imperative for a compiler to generate good code for procedure calls and returns. The run-time
routines that handle procedure argument passing, calls and returns are part of the run-time
support package.
Letusconsideragrammarforasimpleprocedurecallstatement
(1) Scallid(Elist)
(2) ElistElist,E
(3) ElistE
CallingSequences:
Thetranslationforacallincludesacallingsequence,asequenceofactionstakenonentry
toandexitfromeachprocedure. Thefallingaretheactionsthattakeplacein acallingsequence:
Whenaprocedurecalloccurs,spacemustbeallocatedfortheactivationrecordofthe called
procedure.
Theargumentsofthecalledproceduremustbeevaluatedandmadeavailabletothe called
procedure in a known place.
Environmentpointersmustbeestablishedtoenablethecalledproceduretoaccessdatain
enclosing blocks.
Thestateofthecallingproceduremustbesavedsoitcanresumeexecutionafterthecall.
Finallyajumptothebeginningofthecodeforthecalledproceduremustbegenerated.
Forexample,considerthefollowingsyntax-directedtranslation
(1) Scallid(Elist)
{foreachitemponqueuedo
emit(‘param’p);
emit(‘call’id.place)}
(2) ElistElist,E
{appendE.placetotheendofqueue}
(3) ElistE
{initializequeuetocontainonlyE.place}
Here,thecodeforSisthecodeforElist,whichevaluatesthearguments,followedbya
parampstatementforeachargument,followedbyacallstatement.
queue is emptied and then gets a single pointer to the symbol table location for the name
that denotes the value of E.
MODULE-4 CODEGENERATION
The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The
code generation techniques presented below can be used whether or not an optimizing phase
occurs before code generation.
Positionofcodegenerator
symbol
table
ISSUESINTHEDESIGNOFACODEGENERATOR
Thefollowingissuesariseduringthecodegenerationphase :
1. Inputtocodegenerator
2. Targetprogram
3. Memorymanagement
4. Instructionselection
5. Registerallocation
6. Evaluationorder
1. Inputtocodegenerator:
• Theinput to thecodegeneration consists ofthe intermediaterepresentation of thesource
program produced by front end , together with information in the symbol table to
determine run-time addresses of the data objects denoted by the names in theintermediate
representation.
• Intermediaterepresentationcanbe:
a. Linearrepresentationsuchaspostfixnotation
b. Threeaddressrepresentationsuchasquadruples
c. Virtualmachinerepresentationsuchasstackmachinecode
d. Graphicalrepresentationssuchassyntaxtreesanddags.
• Prior to code generation, the front end must be scanned, parsed and translated into
intermediate representation along with necessary type checking. Therefore, input to code
generation is assumed to be error-free.
2. Targetprogram:
• Theoutputofthecodegeneratoristhetargetprogram.Theoutputmaybe:
a. Absolute machine language
- Itcanbeplacedinafixedmemorylocationandcanbeexecutedimmediately.
b. Relocatable machinelanguage
- Itallowssubprogramstobecompiledseparately.
c. Assemblylanguage
- Codegenerationismadeeasier.
3. Memorymanagement:
• Namesinthesourceprogramaremappedtoaddressesofdataobjectsinrun-time memory by
the front end and code generator.
• It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
4. Instructionselection:
• Theinstructionsoftargetmachineshouldbecompleteanduniform.
• Thequalityofthegeneratedcodeisdeterminedbyitsspeedandsize.
• Theformerstatementcanbetranslatedintothelatterstatementasshownbelow:
5. Registerallocation
• Instructionsinvolvingregisteroperandsareshorterandfasterthanthoseinvolving operands
in memory.
• Theuseofregistersissubdividedintotwosubproblems:
Register allocation – the set of variables that will reside in registers at a point in
the program is selected.
Registerassignment–thespecificregisterthatavariablewillresideinis picked.
• Certainmachinerequireseven-oddregisterpairsforsomeoperandsandresults. For
example , consider the division instruction of the form :
D x,y
where,x–dividendevenregisterineven/oddregister pair y –
divisor
evenregisterholdstheremainder
odd register holds the quotient
6. Evaluationorder
• The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others.
TARGETMACHINE
• Ithasthefollowingop-codes:
MOV(movesourcetodestination)
ADD (add source to destination)
SUB (subtractsourcefromdestination)
• Thesourceanddestinationofaninstructionarespecifiedbycombiningregistersand
memory locations with address modes.
Addressmodeswiththeirassembly-languageforms
absolute M M 1
register R R 0
indirectregister *R contents(R) 0
literal #c c 1
• Forexample:MOVR0,MstorescontentsofRegisterR0intomemorylocationM ; MOV
4(R0), M stores the value contents(4+contents(R0)) into M.
Instructioncosts:
i) MOVb,R0
ADD c, R0 cost=6
MOV R0, a
ii) MOVb,a
ADDc,a cost=6
• Inordertogenerategoodcodefortargetmachine,wemustutilizeitsaddressing
capabilities efficiently.
RUN-TIME STORAGEMANAGEMENT
• Informationneededduringanexecutionofaprocedureiskeptinablockofstorage called an
activation record, which includes storage for names local to the procedure.
• Thetwostandardstorageallocationstrategiesare:
1. Staticallocation
2. Stackallocation
• Instaticallocation,thepositionofanactivationrecordinmemoryisfixedatcompile time.
• In stack allocation, anew activation record is pushed onto thestack foreach execution of a
procedure. The record is popped when the activation ends.
• The following three-address statementsare associated with the run-time allocationand
deallocation of activation records:
1. Call,
2. Return,
3. Halt,and
4. Action,aplaceholderforotherstatements.
• Weassumethattherun-timememoryisdividedintoareasfor:
1. Code
2. Staticdata
3. Stack
Staticallocation
Implementationofcallstatement:
Thecodesneededtoimplementstaticallocationareasfollows:
MOV#here+20,callee.static_area /*Itsavesreturnaddress*/
where,
callee.static_area–Addressoftheactivationrecord
callee.code_area–Addressofthefirstinstructionforcalledprocedure
#here+20–LiteralreturnaddresswhichistheaddressoftheinstructionfollowingGOTO.
Implementationofreturnstatement:
Areturnfromprocedurecalleeisimplementedby:
GOTO*callee.static_area
Thistransferscontroltotheaddresssavedatthebeginningoftheactivationrecord.
Implementationofactionstatement:
TheinstructionACTIONisusedtoimplementactionstatement.
Implementationofhaltstatement:
ThestatementHALTisthefinalinstructionthatreturnscontroltotheoperatingsystem.
Stackallocation
Static allocation can become stack allocation by using relative addresses for storage in
activation records. In stack allocation, the position of activation record is stored in register so
words in activation records can be accessed as offsets from the value in this register.
Thecodesneededtoimplementstackallocationareasfollows:
Initializationofstack:
HALT /*terminateexecution*/
ImplementationofCallstatement:
GOTO callee.code_area
where,
caller.recordsize–sizeoftheactivationrecord
#here+16–addressoftheinstructionfollowingthe GOTO Implementation of
Return statement:
GOTO*0(SP) /*returntothecaller*/
SUB#caller.recordsize,SP /*decrementSPandrestoretopreviousvalue*/
BASICBLOCKSANDFLOWGRAPHS
BasicBlocks
BasicBlockConstruction:
Algorithm:Partitionintobasicblocks
Input:Asequenceofthree-addressstatements
Output:Alistofbasicblockswitheachthree-addressstatementinexactlyoneblock
Method:
1. We first determine the set of leaders, the first statements of basic blocks. The ruleswe
use are of the following:
a. Thefirststatementisaleader.
b. Anystatementthatisthetargetofaconditionalorunconditionalgotoisa leader.
c. Any statement that immediately follows a goto or conditional goto statement
is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not
including the next leader or the end of the program.
• Considerthefollowingsourcecodefordotproductoftwovectorsaandboflength20
begin
prod:=0;
i:=1;
dobegin
prod:=prod+a[i]*b[i]; i
:=i+1;
end
whilei<=20
end
• Thethree-addresscodefortheabovesourceprogramisgivenas:
(1) prod:=0
(2) i:=1
(3) t1:=4*i
(5) t3:=4*i
(7) t5:=t2*t4
(8) t6:=prod+t5
(9) prod:=t6
(11) i:=t7
(12) ifi<=20goto(3)
Basicblock1:Statement(1)to(2)
Basicblock2:Statement(3)to(12)
TransformationsonBasicBlocks:
• Structure-preservingtransformations
• Algebraictransformations
1. Structurepreservingtransformations:
a) Commonsubexpressionelimination:
a:=b+c a:=b +c
b:=a–d b:=a-d
c:=b+c c:=b +c
d:=a–d d:=b
Sincethesecondandfourthexpressionscomputethesameexpression,thebasicblockcanbe transformed as
above.
b) Dead-code elimination:
Supposexisdead,thatis,neversubsequentlyused,atthepointwherethestatementx:= y + z
appears in a basic block. Then this statement may be safely removed without changing the
value of the basic block.
c) Renamingtemporaryvariables:
d) Interchangeofstatements:
Supposeablockhasthefollowingtwoadjacentstatements: t1 :
=b+c
t2:=x + y
Wecaninterchangethetwostatementswithoutaffectingthevalueoftheblockifand only if
neither x nor y is t1and neither b nor c is t2.
2. Algebraic transformations:
• Flow graph is a directed graph containing the flow-of-control information for the set
ofbasic blocks making up a program.
• Thenodesoftheflowgrapharebasicblocks.Ithasadistinguishedinitialnode.
• E.g.:Flowgraphforthevectordotproductisgivenasfollows:
prod:=0 i B1
:=1
t1 : = 4 *
it2:=a[t1] t3
:=4*
B2
it4:=b[t3]
t5: =t2* t4
t6:=prod+t5 prod :
= t6
t7: = i+1
i:=t7
ifi<=20gotoB2
• B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2. The
target of jump from last statement of B1 is the first statement B2, so there is an edge from
B1 (last statement) to B2 (first statement).
• B1isthepredecessorofB2,andB2isasuccessorofB1.
Loops
• Aloopisacollectionofnodesinaflowgraphsuchthat
1. Allnodesinthecollectionarestronglyconnected.
2. Thecollectionofnodeshasauniqueentry.
• Aloopthatcontainsnootherloopsiscalledaninnerloop.
NEXT-USEINFORMATION
• If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.
Input:BasicblockBofthree-addressstatements
Method:WestartatthelaststatementofBandscanbackwards.
1. Attachtostatementitheinformationcurrentlyfoundinthesymboltable regarding
the next-use and liveliness of x, y and z.
2. Inthesymboltable,setxto“notlive”and“nonext use”.
3. Inthesymboltable,setyandzto“live”,andnext-usesofyandztoi.
SymbolTable:
x notlive nonext-use
y Live i
z Live i
ASIMPLECODEGENERATOR
• A code generator generates target code for a sequence of three- address statements
andeffectively uses registers to store operands of the statements.
• Forexample:considerthethree-addressstatementa:=b+c
Itcanhavethefollowingsequenceofcodes:
ADD Rj, Ri
RegisterandAddressDescriptors:
• Aregisterdescriptorisusedtokeeptrackofwhatiscurrentlyineachregisters.The register
descriptors show that initially all the registers are empty.
• Anaddressdescriptorstoresthelocationwherethecurrentvalueofthenamecanbe found at run
time.
Acode-generationalgorithm:
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z, perform the following actions:
2. Consult the address descriptor for y to determine y’, the current location of y. Prefer the
registerfor y’ifthevalueof yiscurrentlybothinmemoryandaregister.Ifthevalueof yis not
already in L, generate the instruction MOV y’ , L to place a copy of y in L.
3. GeneratetheinstructionOPz’,Lwherez’isacurrentlocationofz.Preferaregistertoa
memorylocation if z is in both. Update the address descriptor of x to indicate that x is in
location L. If x is in L, update its descriptor and remove x from all other descriptors.
GeneratingCodeforAssignmentStatements:
• The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:
t: =a–b
u:=a–c
v:=t+u
d:=v+u
withdliveattheend.
Codesequencefortheexampleis:
Registerempty
Thetableshowsthecodesequencesgeneratedfortheindexedassignmentstatements
a:=b[i]anda[i]:=b
a:=b[i] MOVb(Ri),R 2
a[i]:=b MOVb,a(Ri) 3
GeneratingCodeforPointerAssignments
Thetableshowsthecodesequencesgeneratedforthepointerassignments
a:=*pand*p:=a
a:=*p MOV*Rp,a 2
*p:=a MOVa,*Rp 2
GeneratingCodeforConditionalStatements
Statement Code
ifx<ygotoz CMPx, y
CJ<z /*jumptozifconditioncode is
negative */
THEDAGREPRESENTATIONFORBASICBLOCKS
• ADAGforabasicblockisadirectedacyclicgraphwiththefollowinglabelsonnodes:
1. Leavesarelabeledbyuniqueidentifiers,eithervariablenamesorconstants.
2. Interiornodesarelabeledbyanoperatorsymbol.
3. Nodesarealsooptionallygivenasequenceofidentifiersforlabelstostorethe computed
values.
• DAGsareusefuldatastructuresforimplementingtransformationsonbasicblocks.
• Itgivesapictureofhowthevaluecomputedbyastatementisusedinsubsequent statements.
• Itprovidesagoodwayofdeterminingcommonsub -expressions.
AlgorithmforconstructionofDAG
Input:Abasicblock
Output:ADAGforthebasicblockcontainingthefollowinginformation:
1. A label for each node. For leaves, the label is an identifier. For interior nodes,
anoperator symbol.
2. Foreachnodealistofattachedidentifierstoholdthecomputedvalues. Case
(i) x : = y OP z
Case(ii)x:=OPy Case
(iii) x : = y
Method:
Step1:Ifyisundefinedthencreate node(y).
Ifzisundefined,createnode(z)forcase(i).
Step2:Forthecase(i),createanode(OP)whoseleftchildisnode(y)andrightchildis node(z). (
For case(ii), determine whether there is node(OP)with onechild node(y). Ifnotcreate such a
node.
Forcase(iii),nodenwillbenode(y).
Example:Considertheblockofthree-addressstatements:
1. t1:=4*i
2. t2:=a[t1]
3. t3:=4*i
4. t4:=b[t3]
5. t5:=t2*t4
6. t6:=prod+t5
7. prod:=t6
8. t7:=i+1
9. i:=t7
10. ifi<=20goto(1)
StagesinDAGConstruction
ApplicationofDAGs:
1. Wecanautomaticallydetectcommonsubexpressions.
2. Wecandeterminewhichidentifiershavetheirvaluesusedintheblock.
3. Wecandeterminewhichstatementscomputevaluesthatcouldbeusedoutsidetheblock.
GENERATINGCODEFROMDAGs
The advantage of generating code for a basic block from its dag representation is that,
from a dag we can easily see how to rearrange the order of the final computation sequence than
we can starting from a linear sequence of three-address statements or quadruples.
Rearranging theorder
Theorderinwhichcomputationsaredonecanaffectthecostofresultingobjectcode.
Forexample,considerthefollowingbasicblock: t1 :
=a+b
t2 : = c + d
t3 : = e – t2
t4:=t1–t3
Generatedcodesequenceforbasicblock:
MOV a , R0
ADD b , R0
MOV c , R1
ADD d , R1
MOVR0,t1
MOV e , R0
SUBR1,R0
MOVt1,R1
SUBR0,R1
MOVR1,t4
t2 :=c+d
t3:=e–t2 t1
:=a+b
t4:=t1–t3
Revisedcodesequence:
MOV c , R0
ADD d , R0
MOV a , R0
SUBR0,R1
MOV a , R0
ADD b , R0
SUBR1,R0
MOVR0,t4
Inthisorder,twoinstructions MOVR0,t1andMOVt1,R1havebeensaved.
AHeuristicorderingforDags
Theheuristicorderingalgorithmattemptstomaketheevaluationofanodeimmediatelyfollow the
evaluation of its leftmost argument.
Thealgorithmshownbelowproducestheorderinginreverse.
Algorithm:
1) whileunlistedinteriornodesremaindobegin
2) selectanunlistednoden,allofwhoseparentshavebeenlisted;
3) listn;
4) whiletheleftmostchildmofnhasnounlistedparentsandisnotaleaf do begin
5) listm;
6) n:=m
end
end
Example:ConsidertheDAGshownbelow:
1
*
2 + - 3
4
*
5 - +
8
6 + 7 c d 11 e 12
a b
9 10
Initially,theonlynodewithnounlistedparentsis1sosetn=1atline(2)andlist1atline(3).
Now,theleftargumentof1,whichis2,hasitsparentslisted,sowelist2andsetn=2atline(6).
Now, at line (4) we find the leftmost child of 2, which is 6, has an unlisted parent 5. Thus we
selectanewnatline(2),andnode3istheonlycandidate.Welist3andproceeddownitsleft chain, listing
4, 5 and 6. This leaves only 8 among the interior nodes so we list that.
Theresultinglistis1234568andtheorderofevaluationis8654321.
Codesequence:
t8 : = d + e
t6 : = a + b
t5 : = t6 – c
t4:=t5*t8
t3:=t4–e
t2:=t6+t4
t1:=t2*t3
ThiswillyieldanoptimalcodefortheDAGonmachinewhateverbethenumberofregisters.
MODULE-4-CODEOPTIMIZATION
INTRODUCTION
The code produced by the straight forward compiling algorithms can often be made to run
faster or take less space, or both. This improvement is achieved by program transformations
that are traditionally called optimizations. Compilers that apply code-improving
transformations are called optimizing compilers.
Optimizationsareclassifiedintotwocategories.Theyare
• Machine independentoptimizations:
• Machinedependant optimizations:
Machineindependentoptimizations:
• Machineindependentoptimizationsareprogramtransformationsthatimprovethetargetcode
without taking into consideration any properties of the target machine.
Machinedependantoptimizations:
• Machine dependant optimizations are based on register allocation and utilization of special
machine-instruction sequences.
Thecriteriaforcodeimprovementtransformations:
Simply stated, the best program transformations are those that yield the most benefit for the
least effort.
The transformation must preserve the meaning of programs. That is, the optimization must
not change the output produced by a program for a given input, or cause an error such as
divisionbyzero, that was not present in the original source program. At all times we take the
“safe” approach of missing an opportunity to apply a transformation rather than riskchanging
what the program does.
The transformation must be worth the effort. It does not make sense for a compiler writer to
expend the intellectual effort to implement a code improving transformation and to have the
compiler expend the additional time compiling source programs if this effort is not repaid
when the target programs are executed. “Peephole” transformations of this kind are simple
enough and beneficial enough to be included in any compiler.
OrganizationforanOptimizingCompiler:
PRINCIPALSOURCESOFOPTIMISATION
• Atransformationofaprogramiscalledlocalifitcanbeperformedbylookingonlyatthe
statements in a basic block; otherwise, it is called global.
• Many transformations can be performed at both the local and global levels. Local
transformations are usually performed first.
Function-PreservingTransformations
• There are a number of ways in which a compiler can improve a program without
changing the function it computes.
• Thetransformations
Commonsubexpressionelimination,
Copypropagation,
Dead-codeelimination,and
Constantfolding
CommonSubexpressionselimination:
• AnoccurrenceofanexpressionEiscalledacommonsub-expressionifEwaspreviously
computed, and the values of variables in E have not changed since the previous
computation. We can avoid recomputing the expression if we can use the previously
computed value.
• Forexample
t1: = 4*i
t2:=a[t1]
t3: = 4*j
t4: = 4*i
t5 : = n
t6:=b[t4]+t5
Theabovecodecanbeoptimizedusingthecommonsub-expressioneliminationas t1: =
4*i
t2:=a[t1]
t3: = 4*j
t5 : = n
t6:=b[t1]+t5
CopyPropagation:
• Assignments of the form f : = g called copy statements, or copies for short. The idea
behind the copy-propagation transformation is to use g for f, whenever possible after the
copy statement f: = g. Copy propagation means use of one variable instead of another.
Thismaynot appear to be an improvement, but as we shall see it gives us an opportunity
to eliminate x.
• Forexample:
x=Pi;
……
A=x*r*r;
Theoptimizationusingcopypropagationcanbedoneasfollows: A=Pi*r*r;
Herethevariablexiseliminated
Dead-Code Eliminations:
• Avariableisliveatapointinaprogramifitsvaluecanbeusedsubsequently;otherwise, it is dead
at that point. A related idea is dead or useless code, statements that compute
values that never get used. While the programmer is unlikely to introduce any dead code
intentionally,itmayappearastheresultofprevioustransformations. Anoptimizationcan be
done by eliminating dead code.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here,‘if’statementisdeadcodebecausethisconditionwillnevergetsatisfied.
Constantfolding:
• We can eliminate both the test and printing from the object code. More generally,
deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
• Oneadvantageofcopypropagationisthatitoftenturnsthecopystatementintodead code.
Forexample,
a=3.14157/2canbereplacedby
a=1.570therebyeliminatingadivisionoperation.
LoopOptimizations:
• We now give a brief introduction to a very important place for optimizations, namely
loops,especiallytheinnerloopswhereprogramstendtospend thebulkoftheirtime.The running
time of a program maybe improved if we decrease the number of instructions in an inner
loop, even if we increase the amount of code outside that loop.
• Threetechniquesareimportantforloopoptimization:
codemotion,whichmovescodeoutsidealoop;
Induction-variableelimination,whichweapplytoreplacevariablesfrominnerloop.
Reduction instrength,whichreplaces andexpensiveoperation bya cheaperone, suchas a
multiplication by an addition.
CodeMotion:
• An important modification that decreases the amount of code in a loop is code motion.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed ( a loop-invariant computation) and places the
expression before the loop. Note that the notion “before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
InductionVariables:
• Loopsareusuallyprocessedinsideout.ForexampleconsiderthelooparoundB3.
• Note that the values of jand t4 remain in lock-step; everytime the value of j decreases by
1, that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers are called
induction variables.
• Whentherearetwoor moreinduction variables inaloop,itmaybe possibletogetridof all but
one, by the process of induction-variable elimination. For the inner loop around B3 in
Fig. we cannot get rid of either j or t4 completely; t4 is used in B3 and j in B4. However,
we can illustrate reduction in strength and illustrate a part of the process of induction-
variableelimination.EventuallyjwillbeeliminatedwhentheouterloopofB2
-B5isconsidered.
Example:
As the relationship t4:=4*j surelyholds after such an assignment to t4 in Fig. and t4 is not
changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1therelationshipt4:=4*j-4musthold.Wemaythereforereplacetheassignmentt4:=
4*jbyt4:=t4-4.Theonlyproblemis thatt4doesnothaveavaluewhenweenterblock B3
forthefirsttime.Sincewemustmaintaintherelationshipt4=4*jonentrytotheblockB3, we place
an initializations of t4 at the end of the block where j itself is
before after
initialized,shownbythedashedadditiontoblockB1insecondFig.
• Thereplacementofamultiplicationbyasubtractionwillspeeduptheobjectcodeif
multiplication takes more time than addition or subtraction, as is the case on many
machines.
ReductionInStrength:
OPTIMIZATIONOFBASICBLOCKS
Therearetwotypesofbasicblockoptimizations.Theyare:
Structure-PreservingTransformations
AlgebraicTransformations
Structure-PreservingTransformations:
TheprimaryStructure-PreservingTransformationonbasicblocksare:
Commonsub-expressionelimination
Deadcodeelimination
Renamingoftemporaryvariables
Interchangeoftwoindependentadjacentstatements.
Commonsub-expressionelimination:
Common sub expressions need not be computed over and over again. Instead they can be
computedonceand kept in store from where it’s referencedwhenencounteredagain – of course
providing the variable values in the expression still remain constant.
Example:
a:=b+c
b: =a-d
c:=b+c
d: =a-d
a:=b+c
b: = a-d
c: = a
d:=b
Deadcodeelimination:
It’s possible that a large amount of dead (useless) code may exist in the program. This
might be especially caused when introducing variables and procedures as part of construction or
error-correction of a program – once declared and defined, one forgets to remove them in case
they serve no purpose. Eliminating these will definitely optimize the code.
Renamingoftemporaryvariables:
Interchangeoftwoindependentadjacentstatements:
• Twostatements
t1:=b+c
t2:=x+y
canbeinterchangedorreorderedinitscomputationinthebasicblockwhenvalueoft1
doesnotaffectthevalueoft2.
AlgebraicTransformations:
a:=b+c
e:=c+d+b
thefollowingintermediatecodemaybegenerated: a
:=b+c
t:=c+d
e:=t+b
• Example:
x:=x+0canberemoved
x:=y**2canbereplacedbyacheaperstatementx:=y*y
• The compiler writer should examine the language carefully to determine what
rearrangementsofcomputationsarepermitted,sincecomputerarithmeticdoesnotalways obey
the algebraic identities of mathematics. Thus, a compiler may evaluate x*y-x*z as x*(y-
z) but it may not evaluate a+(b-c) as (a+b)-c.
LOOPSINFLOWGRAPH
Dominators:
In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining nodes in the flow graph and the entry of a loop dominates all nodes in the loop.
Similarly every node dominates itself.
Example:
*Intheflowgraphbelow,
*Initialnode,node1dominateseverynode.
*node2dominatesitself
*node3dominatesallbut1and2.
*node4dominatesallbut1,2and3.
*node5and6dominatesonlythemselves,sinceflowofcontrolcanskiparoundeitherbygoin through
the other.
*node7dominates7,8,9and10.
*node8dominates8,9and10.
*node9and10dominatesonlythemselves.
• Thewayofpresentingdominatorinformationisinatree,calledthedominatortreein which the
initial node is the root.
• Theparentofeachothernodeisitsimmediatedominator.
• Eachnodeddominatesonlyitsdescendentsinthetree.
• The existence of dominator tree follows from a property of dominators; each node has a
unique immediate dominator in that is the last dominator of n on anypath from the initial
node to n.
• In terms of the dom relation, the immediate dominator m has the property is d=!n and d
dom n, then d dom m.
D(1)={1}
D(2)={1,2}
D(3)={1,3}
D(4)={1,3,4}
D(5)={1,3,4,5}
D(6)={1,3,4,6}
D(7)={1,3,4,7}
D(8)={1,3,4,7,8}
D(9)={1,3,4,7,8,9}
D(10)={1,3,4,7,8,10}
NaturalLoop:
• Oneapplicationofdominatorinformationisindeterminingtheloopsofaflowgraphsuitable for
improvement.
• Thepropertiesofloopsare
A loop must have a single entry point, called the header. This entry point-dominates all
nodes in the loop, or it would not be the sole entry to the loop.
Theremustbeatleastonewaytoiteratetheloop(i.e.)atleastonepathbacktotheheader.
• One way to find all the loops in a flow graph is to search for edges in the flow graph whose
heads dominate their tails. If a→b is an edge, b is the head and a is the tail. These types of
edges are called as back edges.
Example:
Intheabovegraph,
7→4 4DOM7
10→7 7DOM10
4→3
8→3
9 →1
• Theaboveedgeswillformloopinflowgraph.
• Givenaback edgen→ d,wedefinethenatural loopoftheedgetobedplustheset ofnodes that can
reach n without going through d. Node d is the header of the loop.
Algorithm:Constructingthenaturalloopofabackedge.
Input:AflowgraphGandabackedgen→d.
Output:Thesetloopconsistingofallnodesinthenaturalloopn→d.
Method: Beginning with node n, we consider each node m*d that we know is in loop, to make
surethatm’spredecessorsare alsoplacedinloop. Eachnodeinloop, exceptford,isplacedonce onstack,
so itspredecessorswillbe examined. Notethatbecause dis putin theloopinitially, we never examine
its predecessors, and thus find only those nodes that reach n without going through d.
Procedureinsert(m);
ifmisnotinloopthenbegin
loop := loop U {m};
push m onto stack
end;
stack:=empty;
loop:={d};
insert(n);
whilestackisnotemptydobegin
popm,thefirstelementofstack,offstack;
foreachpredecessorpofmdoinsert(p)
end
Innerloop:
• If we use the natural loops as “the loops”, then we have the useful property that unless
two loops have the same header, they are either disjointed or one is entirely contained in
theother.Thus,neglectingloops withthesameheaderforthemoment,wehaveanatural notion
of inner loop: one that contains no other loop.
• When two natural loopshavethesameheader, but neitheris nested withinthe other, they are
combined and treated as a single loop.
Pre-Headers:
• The pre-header has only the header as successor, and all edges which formerly enteredthe
header of L from outside L instead enter the pre-header.
• EdgesfrominsideloopLtotheheaderarenotchanged.
• Initiallythepre-headerisempty,buttransformationsonLmayplacestatementsinit.
header pre-header
loopL
heder
loopL
(a)Before (b)After
Reducibleflowgraphs:
• Reducible flow graphs are special flow graphs, for which several code optimization
transformations are especially easy to perform, loops are unambiguously defined,
dominators can be easily calculated, data flow analysis problems can also be solved
efficiently.
• Definition:
AflowgraphGisreducibleifandonlyifwecanpartitiontheedgesintotwodisjoint groups,
forward edges and back edges, with the following properties.
Thebackedgesconsistonlyofedgeswhereheadsdominatetheirstails.
Example:Theaboveflowgraphisreducible.
• If we know the relation DOM for a flow graph, we can find and remove all the back
edges.
• Theremainingedgesareforwardedges.
• Iftheforwardedgesformanacyclic graph,thenwecansaytheflowgraphreducible.
• In the above example remove the five back edges→3, 4 7→4, 8→3 , 9→1 and 10→7
whose heads dominate their tails, the remaining graph is acyclic.
• The key property of reducible flow graphs for loop analysis is that in such flow graphs
every set of nodes that we would informally regard as a loop must contain a back edge.
PEEPHOLEOPTIMIZATION
Redundant-instructionselimination
Flow-of-controloptimizations
Algebraicsimplifications
Useofmachineidioms
UnreachableCode
RedundantLoadsAndStores:
Ifweseetheinstructionssequence
(1) MOVR0,a
(2) MOVa,R0
we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the value of
a is already in register R0.If (2) had a label we could not be sure that (1) was always executed
immediately before (2) and so we could not remove (2).
Unreachable Code:
#definedebug0
….
If(debug){
Printdebugginginformation
• Intheintermediaterepresentationstheif-statementmaybetranslatedas:
Ifdebug=1gotoL2
goto L2
L1:printdebugginginformation
• One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter what
the value of debug; (a) can be replaced by:
Ifdebug≠1gotoL2
Printdebugginginformation
• Astheargumentofthestatementof(b)evaluatestoaconstanttrueitcanbereplaced by
Ifdebug≠0gotoL2
Printdebugginginformation
• Astheargumentofthefirststatementof(c)evaluatestoaconstanttrue,itcanbereplacedby goto
L2.Thenall the statement that print debugging aids are manifestly unreachable andcan be
eliminated one at a time.
Flows-Of-ControlOptimizations:
• The unnecessary jumps can be eliminated in either the intermediate code or the target codeby
the following types of peephole optimizations. We can replace the jump sequence
goto L1
….
L1: gotoL2
bythesequence
goto L2
….
L1:gotoL2
• If there are now no jumps to L1, then it may be possible to eliminate the statement L1:goto
L2 provided it is preceded by an unconditional jump .Similarly, the sequence
ifa<bgotoL1
….
L1: goto L2
canbereplacedby
Ifa<bgotoL2
….
L1:gotoL2
• Finally,supposethereisonlyonejumptoL1andL1isprecededbyanunconditionalgoto.
Thenthesequence
goto L1
……..
L1:ifa<bgotoL2
• Maybereplacedby
Ifa<bgotoL2
goto L3
…….
• While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time
Algebraic Simplification:
• There is no end to the amount of algebraic simplification that can be attempted through
peephole optimization. Only a few algebraic identities occur frequently enough that it is
worth considering implementing them .For example, statements such as
x:=x+0 Or
x:=x*1
• Areoftenproducedbystraightforwardintermediatecode-generationalgorithms,andtheycan be
eliminated easily through peephole optimization.
ReductioninStrength:
X2→X*X
UseofMachineIdioms:
• The target machine may have hardware instructions to implement certain specific operations
efficiently.Forexample,somemachineshaveauto-incrementandauto-decrementaddressing
modes. These add or subtract one from an operand before or after using its value.
• The use of these modes greatly improves the quality of code when pushing or popping a
stack,asinparameterpassing.Thesemodescanalsobeusedincodeforstatementslikei:
=i+1.
i:=i+1→i++
i:=i-1 → i- -
INTRODUCTIONTOGLOBAL DATAFLOWANALYSIS
• Inordertodocodeoptimizationandagoodjobofcodegeneration,compilerneedsto collect
information about the program as a whole and to distribute this information to each
block in the flow graph.
• Data-flowinformationcanbecollectedbysettingupandsolvingsystemsofequationsof the
form :
out[S]=gen[S]U(in[S]–kill[S])
Thisequationcanbereadas“theinformationattheendofastatementiseithergenerated within
the statement , or enters at the beginning and is not killed as control flows through the
statement.”
• Thedetailsofhowdata-flowequationsaresetandsolveddependonthreefactors.
Thenotionsofgeneratingandkillingdependonthedesiredinformation,i.e.,onthedata
flowanalysisproblemtobesolved.Moreover,forsomeproblems,insteadofproceeding along
with flow of control and defining out[s] in terms of in[s], we need to proceed backwards
and define in[s] in terms of out[s].
Sincedataflowsalongcontrolpaths,data-flowanalysisisaffectedbytheconstructsina
program. In fact, when we write out[s] we implicitly assume that there is unique end
point where control leaves the statement; in general, equations are set up at the level of
basic blocks rather than statements, because blocks do have unique end points.
Therearesubtletiesthatgoalongwithsuchstatementsasprocedurecalls,assignments
through pointer variables, and even assignments to array variables.
PointsandPaths:
• Withinabasicblock,wetalkofthepointbetweentwoadjacentstatements,aswellasthe point
before the first statement and after the last. Thus, block B1 has four points: one before
any of the assignments and one after each of the three assignments.
B1
d1:i:=m-1 d2:
j :=n
d3 =u1
B2
d4: I:=i+1
B3
d5: j:=j-1
B4
B5 B6
d6:a:=u2
• Nowletustakeaglobalviewandconsiderallthepointsinalltheblocks.Apathfromp1
topnisasequenceofpointsp1,p2,….,pnsuchthatforeachibetween1andn-1,either
Piisthepointimmediatelyprecedingastatementandpi+1isthepointimmediately
following that statement in the same block, or
Piistheendofsomeblockandpi+1isthebeginningofasuccessorblock.
Reachingdefinitions:
• Thesestatementscertainlydefineavalueforx,andtheyarereferredtoas unambiguous
definitions of x. Therearecertain kinds of statements that maydefineavalue for x; they
are called ambiguous definitions. The most usual forms of ambiguous definitions of x
are:
Acallofaprocedurewithxasaparameteroraprocedurethatcanaccessxbecausexis in the
scope of the procedure.
An assignment through a pointer that could refer to x. For example, the assignment *q: =
yisadefinitionofxifitispossiblethatqpointstox.wemustassumethatanassignment through a
pointer is a definition of every variable.
• We say a definition d reaches a point p if there is a path from the point immediately
followingdtop,suchthatdisnot“killed”alongthat path.Thusapointcan bereached
byanunambiguousdefinitionandanambiguousdefinitionofthesamevariable appearing
later along one path.
Data-flowanalysisofstructuredprograms:
• Flow graphs for control flow constructs such as do-while statements have a useful
property: there is a single beginning point at which control enters and a single end point
thatcontrolleavesfromwhenexecutionofthestatementisover.Weexploitthisproperty
whenwetalkofthedefinitionsreachingthebeginningandthe endofstatementswiththe
following syntax.
S id:=E|S;S |ifEthenSelseS|doSwhileE E id +
id| id
• Expressionsinthislanguagearesimilartothoseintheintermediatecode,buttheflow graphs
for statements have restricted forms.
S1
S1
IfEgotos1
S2
S1 S2 IfEgotos1
S1;S2
IFEthenS1else S2 doS1whileE
• We define a portion of a flow graph called a region to be a set of nodes N that includes a
header, which dominates all other nodes in the region. All edges between nodes in N are
in the region, except for some that enter the header.
• The portion of flow graph corresponding to a statement S is a region that obeys the
further restriction that control can flow to just one outside block when it leaves theregion.
• We say that the beginning points of the dummy blocks at the entry and exit of a
statement’s region are the beginning and end points, respectively, of the statement. The
equations are inductive, or syntax-directed, definition of the sets in[S], out[S], gen[S],and
kill[S] for all statements S.
• gen[S] is the set of definitions “generated” by S while kill[S]is the set of definitions
that never reach the end of S.
• Considerthefollowingdata-flowequationsforreachingdefinitions: i )
S d:a:= b+c
gen[S]={d}
kill[S]=Da–{d}
out[S]=gen[S]U(in[S]–kill[S])
• Observe the rules for a single assignment of variable a. Surely that assignment is a
definition of a, say d. Thus
Gen[S]={d}
• Ontheotherhand,d“kills”allotherdefinitionsofa,sowewrite Kill[S]
= Da – {d}
Where,Daisthesetofalldefinitionsintheprogramforvariablea.
ii)
S S1
S2
gen[S]=gen[S2]U(gen[S1]-kill[S2])
Kill[S]=kill[S2]U(kill[S1]–gen[S2])
in [S1] = in [S]in
[S2] =out [S1]
out[S]=out[S2]
• Under what circumstances is definition d generated by S=S1; S2? First of all, if it is
generated by S2, then it is surely generated by S. if d is generated by S1, it will reach the
end of S provided it is not killed by S2. Thus, we write
gen[S]=gen[S2]U(gen[S1]-kill[S2])
• Similarreasoningappliestothekillingofadefinition,sowehave
Conservativeestimationofdata-flowinformation:
• There is a subtle miscalculation in the rules for gen and kill. We have made the
assumption that the conditional expression E in the if and do statements are
“uninterpreted”;thatis,thereexistsinputs totheprogramthatmaketheirbranchesgo either
way.
• Weassumethatanygraph-theoreticpathintheflowgraphisalsoanexecutionpath,i.e., a path
that is executed when the program is run with least one possible input.
• Whenwecomparethecomputed genwiththe“true”genwediscoverthatthetruegenis
alwaysasubsetofthecomputedgen.ontheotherhand,thetruekillisalwaysasuperset of the
computed kill.
• These containments hold even after we consider the other rules. It is natural to wonder
whether these differences between the true and computed gen and kill sets present a
seriousobstacletodata-flowanalysis.Theanswerliesintheuseintendedforthesedata.
• Overestimating the set of definitions reaching a point does not seem serious; it merely
stops us from doing an optimization that we could legitimately do. On the other hand,
underestimating the set of definitions is a fatal error; it could lead us into making a
changeintheprogramthatchangeswhattheprogramcomputes. Forthecaseofreaching
definitions, then, we call a set of definitions safe or conservative if the estimate is a
superset of the true set of reaching definitions. We call the estimate unsafe, if it is not
necessarily a superset of the truth.
Computationofinandout:
• Manydata-flowproblemscanbesolvedbysynthesizedtranslationssimilartothoseused to
compute gen and kill. It can be used, for example, to determine loop-invariant
computations.
• However,thereareotherkindsofdata-flowinformation,suchasthereaching-definitions
problem. It turns out that in is an inherited attribute, and out is a synthesized attribute
depending on in. we intend thatin[S] be the set of definitions reaching the beginning of
S, taking into account the flow of control throughout the entire program, including
statements outside of S or within which S is nested.
• Thesetout[S]isdefinedsimilarlyfortheendofs.itisimportanttonotethedistinction between
out[S] and gen[S]. The latter is the set of definitions that reach the end of S without
following paths outside S.
• Assumingweknowin[S]wecomputeoutbyequation,thatis
• Considering cascade of two statements S1; S2, as in the second case. We start by
observing in[S1]=in[S]. Then, we recursively compute out[S1], which gives us in[S2],
sinceadefinitionreaches thebeginningofS2 ifandonlyifitreachestheendofS1.Now we can
compute out[S2], and this set is equal to out[S].
• Consideringif-statementwehaveconservativelyassumedthatcontrolcanfolloweither
branch, a definition reaches the beginning of S1 or S2 exactly when it reaches the
beginning of S.
In[S1]=in[S2]=in[S]
• IfadefinitionreachestheendofSifandonlyifitreachestheendofoneorbothsub statements;
i.e,
Out[S]=out[S1]Uout[S2]
Representationofsets:
• Setsofdefinitions,suchasgen[S]andkill[S],canberepresentedcompactlyusingbit
vectors.Weassignanumbertoeachdefinitionofinterestintheflow graph.Thenbit vector
representing a set of definitions will have 1 in position I if and only if the definition
numbered I is in the set.
• The number of definition statement can be taken as the index of statement in an array
holding pointers to statements. However, not all definitions may be of interest during
globaldata-flowanalysis.Thereforethenumberofdefinitionsofinterestwilltypicallybe
recorded in a separate table.
• A bit vector representation for sets also allows set operations to be implemented
efficiently.Theunionandintersectionoftwosetscanbeimplementedbylogicalorand logical
and, respectively, basic operations in most systems-oriented programming
languages.ThedifferenceA-BofsetsAandBcanbeimplementedbytakingthe
complement of B and then using logical and to compute A .
Localreachingdefinitions:
• Space for data-flow information can be traded for time, by saving information only at
certain points and, as needed, recomputing information at intervening points. Basic
blocksareusuallytreatedasaunitduringglobalflow analysis,withattentionrestricted to only
those points that are the beginnings of blocks.
• Since there are usuallymanymore points than blocks, restrictingour effort to blocks is a
significantsavings.Whenneeded,thereachingdefinitions forallpointsinablockcanbe
calculated from the reaching definitions for the beginning of a block.
Use-definitionchains:
Evaluation order:
• The techniques for conserving space during attribute evaluation, also apply to the
computation of data-flow information using specifications. Specifically, the only
constraint on the evaluation order for the gen, kill, in and out sets for statements is that
imposedbydependenciesbetweenthesesets.Havingchosenanevaluation order,weare free
to release the space for a set after all uses of it have occurred.
• Earliercirculardependenciesbetweenattributeswerenotallowed,butwehaveseenthat data-
flow equations may have circular dependencies.
General controlflow:
• Data-flow analysis must take all control paths into account. If the control paths are
evidentfromthesyntax,thendata-flowequationscanbesetupandsolvedinasyntax- directed
manner.
• Severalapproachesmaybetaken.Theiterativemethodworksarbitraryflowgraphs.
Sincethe flow graphs obtained in thepresenceof break and continuestatements are
reducible, such constraints can be handled systematically using the interval-based
methods
• However,thesyntax-directedapproachneednotbeabandonedwhenbreakandcontinue
statements are allowed.
CODEIMPROVIGTRANSFORMATIONS
• Algorithms for performing the code improving transformations rely on data-flow
information.Hereweconsidercommonsub-expressionelimination,copypropagationand
transformations for moving loop invariant computations out of loops and for eliminating
induction variables.
• Globaltransformationsarenotsubstituteforlocaltransformations;bothmustbeperformed.
Eliminationofglobalcommonsubexpressions:
• The available expressions data-flow problem discussed in the last section allows us to
determine if an expression at point p in a flow graph is a common sub-expression. The
followingalgorithmformalizestheintuitiveideaspresentedforeliminatingcommonsub-
expressions.
INPUT:Aflow graphwithavailableexpressioninformation.
Todiscovertheevaluationsofy+zthatreachs’sblock,wefollowflowgraph edges,
searching backward from s’s block. However, we do not go through any
block that evaluates y+z. The last evaluation of y+z in each block
encountered is an evaluation of y+z that reaches s.
Createnewvariableu.
Replaceeachstatementw:=y+zfoundin(1)by u :
=y+z
w :=u
Replacestatementsbyx:=u.
• Someremarksaboutthisalgorithmareinorder.
The search in step(1) of the algorithm for the evaluations of y+z that reach statement s
canalsobeformulatedasadata-flowanalysisproblem.However,itdoesnotmakesense to solve
it for all expressions y+z and all statements or blocks because too much irrelevant
information is gathered.
Notallchangesmadebyalgorithmareimprovements.Wemight wishtolimitthe
number of different evaluations reaching s found in step (1), probably to one.
Algorithmwillmissthefactthata*zandc*zmusthavethesamevaluein a :=x+y
c :=x+y
vs
b:=a*z d:=c*z
Becausethissimpleapproachtocommonsubexpressionsconsidersonlytheliteral
expressions themselves, rather than the values computed by expressions.
Copypropagation:
• Various algorithms introduce copy statements such as x :=copies may also be generated
directly by the intermediate code generator, although most of these involve temporaries
localtooneblockandcanberemovedbythedagconstruction.Wemaysubstituteyforx in all
these places, provided the following conditions are met every such use u of x.
• Statementsmustbetheonlydefinitionofxreachingu.
• Oneverypathfromstoincludingpathsthatgothroughuseveraltimes,thereareno
assignments to y.
• Condition(1)canbecheckedusingud-changinginformation.Weshallsetupanewdata- flow
analysis problem in which in[B] is the set of copies s: x:=y such that every path from
initial node to the beginning of B contains the statement s, and subsequent to thelast
occurrence of s, there are no assignments to y.
INPUT:aflowgraphG,withud-chainsgivingthedefinitionsreachingblockB,and with
c_in[B]representingthesolution toequationsthatisthesetof copiesx:=ythat reach
block B along every path, with no assignment to x or y following the last occurrence
of x:=yon the path. We also need ud-chains giving the uses of each definition.
OUTPUT:Arevisedflowgraph.
METHOD:Foreachcopys:x:=ydothefollowing:
Determinethoseusesofxthatarereachedbythisdefinitionofnamely,s:x:=y.
Determine whether for every use ofx found in (1) , s is in c_in[B], where B is the
blockofthisparticularuse,andmoreover,no definitionsofxoryoccurpriortothis use of x
within B. Recall that if s is in c_in[B]then s is the only definition of x that reaches B.
Ifsmeetstheconditionsof(2),thenremovesandreplaceallusesofxfoundin(1) by y.
Detectionofloop-invariantcomputations:
• Ud-chainscanbeusedtodetectthosecomputationsinaloopthatareloop-invariant,that
is,whosevaluedoesnotchangeaslongascontrolstayswithintheloop.Loopisaregion
consisting of set of blocks with a header that dominates all the other blocks, so the only
way to enter the loop is through the header.
ALGORITHM:Detectionofloop-invariantcomputations.
INPUT:AloopLconsistingofasetofbasicblocks,eachblockcontainingsequence of three-
address statements. We assume ud-chains are available for the individual
statements.
OUTPUT:thesetofthree-addressstatementsthatcomputethesamevalueeachtime executed,
from the time control enters the loop L until control next leaves L.
METHOD:weshallgivearatherinformalspecificationofthealgorithm,trusting that
the principles will be clear.
Mark“invariant”thosestatementswhoseoperandsarealleitherconstantorhave all
their reaching definitions outside L.
Repeatstep(3)untilatsomerepetitionnonewstatementsaremarked“invariant”.
Mark “invariant” all those statements not previously so marked all of whose
operandseitherareconstant,havealltheirreachingdefinitionsoutsideL,orhave
exactly one reaching definition, and that definition is a statement in L marked
invariant.
Performingcodemotion:
• Having found the invariant statements within a loop, we can apply to some of them an
optimizationknownascodemotion,inwhichthestatementsaremovedtopre-headerof the
loop. The following three conditions ensure that code motion does not change what the
program computes. Consider s: x: =y+z.
Thereisnootherstatementintheloopthatassignstox.Again,ifxisatemporary assigned
only once, this condition is surely satisfied and need not be changed.
Nouseofxintheloopisreachedbyanydefinitionofxotherthans.Thisconditiontoo will be
satisfied, normally, if x is temporary.
ALGORITHM:Codemotion.
INPUT:AloopLwithud-chaininginformationanddominatorinformation.
OUTPUT:Arevisedversionoftheloopwithapre-headerandsomestatements moved to
the pre-header.
METHOD:
Useloop-invariantcomputationalgorithmtofindloop-invariantstatements.
Foreachstatementsdefiningxfoundinstep(1),check:
i) ThatitisinablockthatdominatesallexitsofL,
ii) ThatxisnotdefinedelsewhereinL,and
Move,intheorderfoundbyloop-invariantalgorithm,eachstatementsfoundin
(1) and meeting conditions (2i), (2ii), (2iii) , to a newly created pre-header,
providedanyoperandsofsthataredefinedinloopLhavepreviouslyhadtheir
definition statements moved to the pre-header.
• To understand why no change to what the program computes can occur, condition (2i)
and(2ii) ofthisalgorithmassurethatthevalueofxcomputedatsmustbethevalueofx
afteranyexitblockofL.Whenwemovestoapre-header,swillstillbethedefinitionof x that
reaches the end of any exit block of L. Condition (2iii) assures that any uses of x within
L did, and will continue to, use the value of x computed by s.
Alternativecodemotionstrategies:
• Thecondition(1)canberelaxedifwearewillingtotaketheriskthatwemayactually increase
the running time of the program a bit; of course, we never change what the program
computes. The relaxed version of code motion condition (1) is that we may move a
statement s assigning x only if:
1’.Theblockcontainingseitherdominatesallexistsoftheloop,orxisnotusedoutside theloop.
For example, if x is a temporaryvariable, wecan besurethat thevalue will be used only
in its own block.
• Ifcodemotionalgorithmismodifiedtousecondition(1’),occasionallytherunningtime will
increase, but we can expect to do reasonably well on the average. The modified
algorithm maymove to pre-header certain computations that maynot be executed in the
loop.Notonlydoesthisriskslowingdowntheprogramsignificantly,itmayalsocause an error
in certain circumstances.
• Evenifnoneoftheconditionsof(2i),(2ii),(2iii)ofcodemotionalgorithmaremetbyan
assignment x: =y+z, we can still take the computation y+z outside a loop. Create a new
temporaryt, and set t: =y+z in the pre-header. Then replace x: =y+z byx: =t in the loop.
In many cases we can propagate out the copy statement x: = t.
Maintainingdata-flowinformationaftercodemotion:
• Thetransformationsofcodemotionalgorithmdonotchangeud-chaininginformation, since
by condition (2i), (2ii), and (2iii), all uses of the variable assigned by a moved
statement s that were reached by s are still reached by s from its new position.
• DefinitionsofvariablesusedbysareeitheroutsideL,inwhichcasetheyreachthepre- header,
or they are inside L, in which case by step (3) they were moved to pre-header ahead of
s.
• Weputthepointeroneachud-chaincontainings.Then,nomatterwherewemoves,we have
only to change ps, regardless of how many ud-chains s is on.
• The dominator information is changed slightly by code motion. The pre-header is nowthe
immediate dominator of the header, and theimmediate dominator of the pre-header is
thenodethatformerlywastheimmediatedominatoroftheheader.Thatis,thepre-header is
inserted into the dominator tree as the parent of the header.
Eliminationofinductionvariable:
• A variable x is called an induction variable of a loop Lif every time the variable x
changesvalues,itisincrementedordecrementedbysomeconstant.Often,aninduction
variable is incremented by the same constant each time around the loop, as in a loop
headed by for i := 1 to 10.
• Acommonsituationisoneinwhichaninductionvariable,sayi,indexesanarray,and
someotherinductionvariable,sayt,whose valueisalinearfunctionofi,istheactual offset
used to access the array. Often, the only use made of i is in the test for loop
termination. We can then get rid of i by replacing its test by one on t.
• Weshalllookforbasicinductionvariables,whicharethosevariablesi whoseonly
assignments within loop L are of the form i := i+c or i-c, where c is a constant.
ALGORITHM:Eliminationofinductionvariables.
INPUT:AloopLwithreachingdefinitioninformation,loop-invariantcomputation information
and live variable information.
OUTPUT:Arevisedloop.
METHOD:
Consider each basic induction variable i whose only uses are to compute other
induction variables in its family and in conditional branches. Take some j in i’s
family,preferablyonesuchthatcanddinitstripleareassimpleaspossibleand
modifyeachtestthati appearsintousejinstead.Weassumeinthefollowingtat c is
positive. A test of the form‘if i relop x goto B’, where x is not an induction
variable, is replaced by
r := c*x /*r:=xifcis1.*/ r :=
r+d /* omit if d is 0 */
if j relop r goto B