0% found this document useful (0 votes)
16 views

Compiler Design - YesDee(1)

The document provides an overview of compilers, interpreters, and assemblers, detailing their functions and differences. It explains the phases of a compiler, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with error handling and symbol table management. Additionally, it discusses compiler construction tools and the distinction between static and dynamic policies in programming languages.

Uploaded by

maniputu11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
16 views

Compiler Design - YesDee(1)

The document provides an overview of compilers, interpreters, and assemblers, detailing their functions and differences. It explains the phases of a compiler, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with error handling and symbol table management. Additionally, it discusses compiler construction tools and the distinction between static and dynamic policies in programming languages.

Uploaded by

maniputu11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 427
Introduction to Compilers 44 Compilation and Interpretation 444 Compiler Compiler is a translator which is used to convert programs in high-level language to low-level translates the entire pro; Janguage- It Te program and also reports the errot ce im ; : s rs in s a encountered during the translation. eae ee |g eer ee Error messages Figure 1.1 Compiler 11.2 Interpreter is used to convert programs in high-level language to low-level Interpreter is a translator which ncountered during language, Interpreter translates line by line and reports the error once it e the translation process. It directly executes the operations s| pecified in the source program when the input is given by the user. It gives better error diagnostics than a compiler. 2 Compiler Design 2e Source program Tj] Output Input - Figure 1.2 Interpreter Table 1.1 Differences between compiler and interpreter S.No. | Compiler Interpreter ” . Performs the translation of a program | Performs statement by — statement as a whole. translation. om | Execution is faster. Execution is slower. 3. | Requires more memory as linking is | Memory usage is efficient as no needed for the generated intermediate | intermediate object code is generated, | object code. 4. Debugging is hard as the error | It stops translation when the first seal messages are generated after scanning | is met. Hence, debugging is easy. | the entire program only. Se) Programming languages like C, C++ | Programming languages like Python, uses compilers. BASIC, Ruby uses interpreters. 1.1.3. Assembler Assembler is a translator which is used to machine language code. translate the assembly language code into |___, Machine language code Figure 1.3 Assembler Introduction to Compilers 3 42 Language Processing System Skeletal source program it Source program Compier Target assembly program Assembler Re-locatable machine program Loasertinvedior je Library reocatable object fles ‘Absolute machine program U Figure 1.4 Language Processing system Preprocessor A source program may be divided into modules stored in separate files. The task of collecting the source program is entrusted to a separate program called preprocessor. It may also expand macros into source language statement. Compiler Compiler is a program that takes source program as input and produces assembly language program as output. Assembler Assembler is a program that converts assembly language program into machine language Program. It produces re-locatable machine code as its output. Loader and link-editor Y The re-locatable machine code has to be linked together with other re-locatable object files and library files into the code that actually runs on the machine. ; Y The linker resolves external memory addresses, where the code in one file may refer to 4 location in another file. : ge ion V The loader puts together the entire executable object files into memory 4 Compiler Design 2e 1.3 Structure of a Compiler The structure of compiler consists of two parts: eces and imposes a Analysis part create an intermediate ¥ Analysis part breaks th grammatical structure on them W) representation of the source program. Y It is also termed as front end of compiler. Y Information about the source program is CO] symbol table. constituent pil e source program into js structure to hhich further uses thi lected and stored in a data structure called Intermediate —— representation rc Source rogram ——* Analysis phase | Figure 1.5 Analysis part Synthesis part v Synthesis part takes the intermediate representation as input and transforms it to the target program. V Itis also termed as back end of compiler. Intermediate —— —{ sree |] to eee Figure 1.6 Synthesis part 2 ae pen neil a be Samer into several phases, each of which convert The different phases of compiler are as follows: 1. Lexical analysis 2. Syntax analysis 3. Semantic analysis 4. Intermediate code generation 5. Code optimization 6. Code generation All of the aforementioned phases involve the following tasks: ¥ Symbol table management. : v Error handling. E STRUCTURE OF A COMPILER Symbol Table character stream Lexical Analyzer i token stream Syntax Analyzer T syntax tree Semantic Analyzer be syntax tree Intermediate Code Generator T 5 intermediate representation Machine-Independent Code Optimizer T intermediate representation Code Generator T target-machine code Machine-Dependent Code Optimizer T target-machine code Figure 1.6: Phases of a compiler 4.3.1 Lexical Analysis Y Lexical analysis is the first phase of compiler which is also termed as scanning. ¥ Source program is scanned to read the stream of characters and those characters are grouped to form a sequence called lexemes which produces token as output. ¥ Token Token is a sequence of characters that represent lexical unit, which matches with the pattern, such as keywords, operators, identifiers, etc. ¥ Lexeme Lexeme is instance of a token i.e., group of characters forming a token. ¥ Pattern Pattern describes the rule that the lexemes of a token takes. It is the structure that must be matched by strings. Y Once a token is generated the corresponding entry is made in the symbol table. Input: stream of characters Output: Token Token Template: 6 Compiler Design 2e eg, c=at bx; Table 1.2 Lexemes and tokens Lexemes Tokens c identifier = assignment symbol a identifier he + (addition symbol) identifier * * (multiplication symbol) 5 5 (number) Hence, <=>< id, 2>< + >< id, 3 >< *><5> 1.3.2 Syntax Analysis v Syntax analysis is the second phase of compiler which is also called as parsing. ¥ Parser converts the tokens produced by lexical analyzer into a tree like representation called parse tree, v A parse tree describes the Syntactic structure of the input. assignment statement ees ] oe yt ce o— = J | | b Figure 1.8 Parse tree v Syntax tree j ri Site Seon peresentation of the parse tree in which the operators S an for that operator, the operands of the Operator are the children of the node Introduction to Compilers. 7 Input : Tokens Output : Syntax tree oo phil, a 1.3.3 Semantic Analysis ¥ Semantic analysis is the third phase of compiler. ¥ Itchecks for the semantic consistency. ¥ Type information is gathered and stored in symbol table or in syntax tree. ¥ Performs type checking. agi Biitn b inttofloat Figure 1.10 Semantic analysis 13.4 Intermediate Code Generation Y Intermediate code generation produces intermediate representations for the source Program which are of the following forms: © Postfix notation © Three-address code © Syntax tree 8 Compiler Design 2¢ yy used form is the three-address code: Most common ty = inttofloat (5) ty = idge th ty = id + to id) = ts Properties of intermediate code V¥ Itshould be easy to produce. Y It should be easy to translate into target program 1.3.5 Code Optimization ¥ Code optimization phase gets the intermediate code as input and produces optirnized intermediate code as output. ¥ Itresults in faster running machine code. V Itcan be done by reducing the number of lines of code for a program. Y¥ This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result. V During the code optimization, the result of the program is not affected ¥ To improve the code generation, the optimization involves © Deduction and removal of dead code (unreachable code). © Calculation of constants in expressions and terms. © Collapsing of repeated expression into temporary string. © Loop unrolling. © Moving code outside the loop. © Removal of unwanted temporary variables. Introduction to Compilers 9 1.36 Code Generation Y Code generation is the final phase of a compiler. ¥ Itgets input from code optimization phase and produces the target code or object code as result. ¥ Intermediate instructions are translated into a sequence of machine instructions that perform the same task. ¥ The code generation involves © Allocation of register and memory. ¢ Generation of correct references. © Generation of correct data types. Generation of missing code. LDF Rp, MULF Ro, #5.0 LDF 21, ida ADDF R1, Ro STF id), Ri 1.3.7 Symbol Table Management ¥ Symbol table is used to store all the information about identifiers used in the program. ¥ Itis a data structure containing a record for each identifier, with fields for the attributes of the identifier. ¥ Itallows finding the record for each identifier quickly and to store or retrieve data from that record. ¥ Whenever an identifier is detected in any of the phases, itis stored in the symbol table. 12 Compiler Design 2e 4.3.8 Error Handling Y Esch phase can encounter errors. After detecting an erron @ phase must handle th, error so that compilation can proceed. ¥ Inlexical analysis, errors occur in separation of tokens ¥ Insyntax analysis, errors occur during construction of syntax tree. Y Insemantic analysis, errors may occur at the following cases: ) When the compiler detects constructs that have right syntactic structure but ny meaning (ii) During type conversion. ¥ Incode optimization, errors occur when the result is affected by the optimization, jy code generation, it shows error when code is missing, etc. Figure 1.11 illustrates the translation of source code through each phase, considering th statement =atb*5 1.4 Errors Encountered in Different Phases Each phase can encounter errors. After detecting an error, a phase must somehow deal with the error, so that compilation can proceed. ‘A program may have the following kinds of errors at various stages; 1.4.1. Lexical Errors It includes incorrect or misspelled name of some identifier, i.e., identifiers typed incorrectly. 1.4.2. Syntactical Errors It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by syntax analyzer (parser). When an error is detected, it must be handled by parser to enable the parsing of the res of the input. In general, errors may be expected at various stages of compilation but most of the errors are syntactic errors and hence the parser should be able to detect and report those errors in the program. ‘The goals of error handler in parser are: ¥ Report the presence of errors clearly and accurately. ¥ Recover from each error quickly enough to detect subsequent errors. ¥. Add minimal overhead to the processing of correcting programs. There are four common error-recovery strategies that can be implemented in the parse! to deal with errors in the code. ¥ Panic mode. ¥ Statement level. Introduction to Compilers 13 / Brror productions. i. Global correction. 443 Semantical Errors a result of incompatible yg i These errors Breen patible value assignme = ualyzer is expcted th recognize’ ignment. The semantic errors that the ¥ Type mismatch. Y Undeclared variable. Reserved identifier misuse. ¥ Multiple declaration of variable in a scope. Accessing an out of scope variable, ¥ Actual and formal parameter mismatch. Logical errors ‘These errors occur due to not reachable code— infinite loop 1.5 Grouping of Phases The phases of a compiler can be grouped as: Front end Front end of a compiler consists of the phases ¥ Lexical analysis. ¥ Syntax analysis. ¥ Semantic analysis. v Intermediate code generation. Back end Back end of a compiler contains ¥ Code optimization. V Code generation. 15.1 Front End ¥ Front end comprises of phases which are dependent on the input (source language) and independent on the target machine (target language). ¥ Itincludes lexical and syntactic analysis, symbol table management, semantic analysis and the generation of intermediate code. ¥ Code optimization can also be done by the front end. ¥ Italso includes error handling at the phases concerned. 14 Compiler Design 2¢ Se ee Front end Semantic analysis, intermediate code generation eee Figure 1.12 Frontend 1.5.2 Back End ¥ Back end comprises of those phases of the compiler that are dependent on the target machine and independent on the source language. ¥ This includes code optimization, code generation. ¥ Inaddition to this, it also encompasses error handling and symbol table management operations. Code optimization Code generation | Bie remem Back end Figure 1.13 Back end 1.5.3 Passes ¥ The phases of compiler can be implemented in a single pass by marking the primary actions, viz., reading of input file and writing to the output file. ¥ Several phases of compiler are grouped into one pass in such a way that the operations in each and every phase are incorporated during the pass. For example, lexical analysis, syntax analysis, semantic analysis and intermediate code generation might be grouped into one pass. If so, the token stream after lexical analysis may be translated directly into intermediate code. 1.5.4 Reducing the Number of Passes ¥ Minimizing the number of passes improves the time efficiency as reading from and writing to intermediate files can be reduced. Introduction to Compilers 15 y When cone a One Pass, the entire program has to be kept in memory to ensure proper i eT Sa each phase because one phase may need information in a different order than the information Produced in previous phase. ‘The source program or target program differs from its internal representation. So, the memory for internal form may be larger than that of input and output 4,6 compiler Construction Tools Some commonly used compiler construction tools include Parser generators. Scanner generators. . Syntax-directed translation engines. . Automatic code generators, . Data-flow analysis engines. Aweyne . Compiler construction toolkits. 14.6.1 Parser Generators Input Grammatical description of a programming language Ouput Syntax analyzers. Parser generator takes the grammatical description of a programming language and produces asyntax analyzer. 1.6.2 Scanner Generators Input Regular expression description of the tokens of a language Output Lexical analyzers. Scanner generator generates lexical analyzers from a regular expression description of the tokens of a language. 16.3 Syntax-directed Translation Engines Input Parse tree. Output Intermediate code. Synax-directed translation engines produce collections of routines that walk a parse tree and Senerates intermediate code. 184 Automatic Code Generators me Intermediate language. Mtput Machine language. has 8enerator takes a collection of rules that define the translation of each operation of the diate language into the machine language for a target machine. 16 Compiler Design 2 4.6.5 Data-flow Analysis Engines Data-flow analysis engine gathers the ine part of a program to each of the other parts. n, that is, the valu transmitted from one Data-flow analysis is a key part of code optimization. 4.6.6 Compiler Construction Toolkits i a ci ier. Cor The toolkits provide integrated set of routines for various phases of compi Compiler construction toolkits provide an integrated set of routines for construction of phases of compile, 1.7. Programming Language Basics 4.7.1 Static/Dynamic Distinction 4.7.1.1. Policy Policy defines what decisions are made by a compiler for a program. Static policy If the language of a compiler enables it to make decisions on an issue at compile time it is termed as static policy. Dynamic policy When a compiler is able to make decision at runtime, it is said to follow 2 dynamic policy. 1.7.1.2 Scope Scope defines the region of program that can access a particular declaration. Static scope If a scope of a declaration is found by looking at the program, then it i scope. It can also be termed as lexical scope. Dynamic scope variable. s static At runtime, the scope of a variable refers to different declarations of that e.g., In java, public static int a; (rs to create only one of the variable a for any number of objects created by its ss and the location of the variable can be found in advance by a compiler. : public int a; This makes each obj locations could pot be g ject of its class to have its own copy of the variable and all the und by compiler before running the program. 17.4.3 Environment and states V tis used to know whethe: occur when the Program v They are used to de: with its value at ry T the values of ae es of data elements get affected by the changes that scribe the associat; SOociati sigs a ntime, ‘on of names with its location and then location 20 Compiler Design 2e Y_ sL.name and s2.name are aliases (same |_value, refer to same location in memory) Vv sland s2 are not aliases. 1.8 Lexical Analysis Lexical analysis is the process of converting a sequence of characters from source program into a sequence of tokens. 2 A program which performs lexical analysis is termed as @ lexical analyzer (lexer), tokenizer or scanner. hich are as follows: Lexical analysis consists of two stages of processing WI ¥ Scanning ¥ Tokenization 1.8.1 Token, Pattern and Lexeme Token : Token is a valid sequence of characters which are given by lexeme. In a programming language, ¥ Keywords, ¥ Constant, v Identifiers, v¥ Numbers, v¥ Operators and ¥ Punctuations symbols are possible tokens to be identified. Pattern Pattern describes a rule that must be mat: s iched by sequer E token. It can be defined by regular expressions is sence ta tr (lexemes) to form S. Lexeme Lexeme is a sequence of characters ey that mate! eg, cHatheS hes the pattern for a token, i.e., instance of Introduction to Compilers 21 Lexemes img ee ¢ Identifier res =__| Assignment symbol a Identifier | + | + (Addition symbor) [ees Identifier «| |_* QMultiplication symbol) | 5 S(Number) | The sequence of tokens produced by lexical analyzer helps the parser in analyzing the syntax of programming languages 1.8.2 Role of Lexical Analyzer Token To semantic analysis Source Lexical analyzer [ bo program ya Parser getNextToken() ‘Symbol table Figure 1.15 Interaction between lexical analyzer and parser Lexical analyzer performs the following tasks: V Reads the source program, scans the input characters, group them into lexemes and produce the token as output. ¥ Enters the identified token into the symbol table. ¥ Strips out white spaces and comments from source program. ¥ Correlates error messages with the source program, i.¢., displays error message with its occurrence by specifying the line number. ¥ Expands the macros if it is found in the source program. Tasks of lexical analyzer can be divided into two processes: Scanning Performs reading of input characters, removal of white spaces and comments. Lexical analysis Produce tokens as the output. 22 Compiler Design 2e 4.8.2.1 Need of Lexical Analyzer f compiler The removal of white spaces and comments enables the n of cor Simplicity of desig! syntax analyzer for efficient synt Compiler efficiency is improved Special speed up the compiler process. Compiler portability is enhanced. actic constructs. ized buffering techniques for reading characters 1.8.2.2 Issues in Lexical Analysis Lexical analysis is the process of producing tokens from the source program. It has the following issues ¥ Lookahead ¥ Ambiguities Lookahead Lookahead is required to decide when one token will end and the next token will begin. The . Therefore a way to simple example which has lookahead issues are i vs. if, = describe the lexemes of each token is required. A way is needed to resolve the following ambiguities ¥ Is if is two variables i and f or if? ¥ Is == is two equal signs = and = or = wv arr(5, 4) vs. fn(5, 4) // in Ada (as array reference syntax and function call syntax are similar), Hence, the number of lookahead to be considered and a way to describe the lexemes of each token is also needed. Regular expressions are one of the most popular ways of repre Ambiguities enting tokens, ‘The lexical analysis programs written with lex accept ambiguous specifications and choose the longest match possible at each input point. Lex can handle ambiguous specifications. When more than one expression can match the current input, lex chooses as follows: ¥ The longest match is preferred, S ¥ Among rules which matched the same number of preferred 1.8.3. Lexical Errors characters, the rule given first is ¥ Acharacter sequence that cannot be scar ¥ Lexical errors are uncommon, but ¥ Misspelling of identifiers, ined into any valid token is a lexical error. key wor Usually, a lexical error 1S Caused by the appearance of legal character, mostly F U “trance of some illegal cl ‘ Introduction to Compilers. 23 1034 Error Recovery Schemes Panic mode recovery { Local correction © Source text is changed around the ‘error poi © Analyzer will be restarted with Point in order to get a correct text. the resultant new text as ‘input. ¥ Global correction © ltis an enhanced panic mode recovery. © Preferred when local correction fails, Panic mode recovery In panic mode recovery, unmatched patterns are deleted from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left. eg, For instance the string fi is encountered for the first time in aC program in the context fi (a== f(z))--- A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared function identifier. Since fi is a valid lexeme for the token id, the lexical analyzer will return the token id to the parser. Local correction Local correction performs deletion/insertion and/or replacement of any number of symbols in the error detection point, eg In Pascal, c[i]‘="; the scanner deletes the first quote because it cannot legally follow the closing bracket and the parser replaces the resulting ‘=" by an assignment statement, Most of the errors are corrected by local correction. eg. The effects of lexical error recovery might well create a later syntax error, handled by the parser, Consider © for Stnight -.. The § terminates scanning of for. Since no valid token begins with $, itis deleted. Then ‘might is scanned as an identifier. Ineffect it results, for tnight ... hich will cause a syntax error. Such false errors are unavoidable, though a syntactic “"Ot-repair may help. 24 Compiler Design 2e Lexical error handling approaches ; lowing actions: rs can be handled by the following Lee aes ‘one character from the remaining input. i i it. ¥. Inserting a missing character into the remaining input Y Replacing a character by another character. ¥ Transposing two adjacent characters. 1.9 Input Buffering V To ensure that a right lexeme is found, one beyond the next lexeme. ¥ Hence a two-buffer scheme is introduced to handle large lookaheads safely. f lexical analyzer such as the use of sentinels or more characters have to be looked up ¥ Techniques for speeding up the process o} to mark the buffer end have been adopted. . There are three general approaches for the implementation of a lexical analyzer: (i) By using a lexical analyzer generator, such as lex compiler to produce the lexical analyzer from a regular expression based specification. In this, the generator provides routines for reading and buffering the input. (ii) By writing the lexical analyzer in a conventional systems-programming language, using /O facilities of that language to read the input (iii) By writing the lexical analyzer in assembly language and explicitly managing the reading of input. 1.9.1 Buffer Pairs Because of large amount of time consumption in moving characters, specialized buffering techniques have been developed to reduce the f overt Juired to process ai amount of overhead required Figure 1.16 shows the buffer pairs which are used to hold the input data, El [=] [e[=[el=] =] 2 Teor Treat for lexemeBegin “4 Figure 1.16 Buffer pairs Introduction to Compilers 25 pointers i inters lexemeBegin and forward are maintained iexemeBesil points to the beginning of the cur i rent lexeme which is fo ward scans ahead until a match for a patter ver betound. m is found. ¥ Once & lexeme is found, lexemeBegin is set to the character jexeme which is just found and forward is set to the elaacee Y Current lexeme is the set of characters between two pointers, immediately after the its right end. Disadvantages of this scheme ¥ This scheme works well most of the time, but the amount of lookahead is limited. ¥ This limited lookahead may make it impossible to recognize tokens in situations where the distance that the forward pointer must travel is more than the length of the buffer. CB DECLARE (ARG1, ARG2, . . . , ARGn)// in PL/1 program ¥ It cannot determine whether the DECLARE is a keyword or an array name until the character that follows the right parenthesis. 1.9.2 Sentinels ¥ Inthe previous scheme, each time when the forward pointer is moved, a check is done to ensure that one half of the buffer has not moved off. If it is done, then the other half must be reloaded. v Therefore the ends of the buffer halves require two tests for each advance of the forward pointer. Test 1: For end of buffer. Test 2: To determine what character is read. ¥ The usage of sentinel reduces the two tests to one by extending each buffer half to hold a sentinel character at the end. ¥ The sentinel is a special character that cannot be part of the source program. (cof character is used as sentinel). ay [e | TL Figure 1.17 Sentinels at the end of each buffer 26 Compiler Design 2c Advantages: ¥ Most of the time, it performs only one tes points to an cof. . ¥ More tests are performed only when it reaches the end of the buffer half Re : ¥ Since N input characters are encountered between €0 fs, the average number of tes; per input character is very close to 1. t to see whether forward pointe, 1.10 Specification of Token ¥ Regular expressions are a notation to represent lexeme patterns V They are used to represent the language for lexical analyzer. V They assist in finding the type of token that accounts for a particular lexeme. for a token. 1.10.1 Strings and Languages Alphabets are finite, non-empty set of input symbols. ST = {0,1} — binary alphabets String represents the collection of alphabets. w = {0,1,00, 01, 10, 11,001, 010, ...} w indicates the set of possible strings for the given binary alphabet 5°. Language (L) is the collection of strings which are accepted by finite automata. L={0"1|n >=0} Length of string is defined as the number of input symbols in a gi i i ies put symbols in a given string. It is found ig w = 0101 lw = 4 Empty suing denotes zero occurrence of input symbol. It is represented b: Concatenation of two strings p and q is denoted by pq sis fees Le p=010 and q=001 pq = 010001 gp = 001010 Pq # ap Introduction to Compilers 27 empty string is identity under concatenation, Let x bea string. © = ae x Prefix A prefix of any string s, is obtained by removin end of - egy $= balloon ' Zero Or more symbols from the possible prefixes are: ball, balloon, Sufix A suffix of any string s, is obtained b beginning of s. eg, 8 Possible prefixes are: loon, balloon Proper prefix: Proper prefix p of a string s, can be given by s # p and p 4 © Proper suffix Proper suffix « of a string s, can be given bys#vandr fe Substring Substring is part of a string obtained by removing any prefix or any suffix from s. y Temoving zero or more symbols from the balloon 1.10.2 Operations on Languages Important operations on a language are: ¥ Union ¥ Concatenation and ¥ Closure Union Union of two languages L and M produces the set of strings which may be either in language Lorin language M or in both. It can be denoted as, LUM = {p| pis inL or pis in M} Concatenation Concatenation of two languages L and M, produces a set of strings which are formed by Merging the strings in L with strings in M (strings in L must be followed by strings in M). It cat nde Tepresented as, LM 16684 = {pq | pis in Land gis in M} s 28 Compiler Design 2e Closure (i) Kleene closure (L*) Kleene closure refers to zero or more includes empty string ¢ (set of strings with 0 or more oc co ve=UL i=0 occurrences of input symbols ina string, ie, .currences of input symbols) (ii) Positive closure (L*) ces of input symbols in a string, ie, i Positive closure indicates one or more occurren i 6 of input symbols), excludes empty string ¢ (set of strings with 1 or more occurrence: a wreUu = L3~— set of strings each with length 3. eg, Let © = {a,b} + = {e,0,,aa, ab, ba, bb, aab, aba, aba, ...} uy L” = {a,b, aa, ab, ba, bb, aab, aaba, L3 = {aaa, aba, abb, ba, bab, bbb, . -- Precedence of operators Y Unary operator (*) is having highest precedence. Y Concatenation operator (-) is second highest and is left associative. ¥_ Union operator ( | or U ) has least precedence and is left associative. Based on the precedence, the regular expression is transformed to finite automata whet implementing lexical analyzer. 1.10.3 Regular Expressions Regular expressions are a combination of input symbols and language operators such & union, concatenation and closure. It-can be used to describe the identifier for a language. The identifier is a collection letters, digits and underscore which must begin with a letter. Hence, the regular expressi® for an identifier can be given by, letter. (letter. | digit)* Note: Vertical bar ( | ) refers to ‘or’ (Union operator). Introduction to Compilers 29 The following describes the language for given regular expression: Table 1.3 Languages for regular expressions | Regular expression | Language ae r Ur) 2. a (a) Bf, ae ris Ur) |L(s) 4. rs Lor) L6s) 5. om (Lor) Regular set Language defined by regular expression. ‘Two regular expressions are equivalent, if they represent the same regular set. (P| 9) = (a lp) Table 1.4 Algebraic laws of regular expressions Law Description rls=s|r | is commutative ri@ld=Cis)lt | is associative r(st) = (rs)t Concatenation is associative r(s|t) =rs | rt;(s|t)r = sr| tr | Concatenation is distributive er=re=r ¢ is identity for concatenation (r | e)* é is guaranteed in closure * is idempotent 1.10.4 Regular Definition Regular definition d, gives aliases to regular expressions r and uses it for convenience. Sequence of definitions are of the following form dary dg > 12 ds > 13 dy Tn in which definitions dy, dy, ... ,can be used in place of ry, r respectively. 30 Compiler Design 2e ‘ ‘i ‘ ws: Regular expression for an identifier and number given as follo letter > A|B| -*° {Zabol jz] digit > 0[1|2-- 19 id — letter_ (letter | digit)* num — digit(digit)” 1.11. Recognition of Tokens | tokens have been taken. Recognition of token explains how the patterns for all ing and to find the prefix (lexeme) tha It also generates code for examining the input stri matches with any one of the patterns. Rules for conditional statement can be given as follows: stmt — if expr then stmt | if expr then stmt else stmt | expr — term relop term | term term + id | number Figure 1.17 Conditions for branching statements ‘The syntax in Figure 1.17 is very much similar to that of Pascal. The terminals of the grammar which are if, then, else, relop, id and number are the name of tokens for lexical analyzer. For easy recognition, keywords are considered as reserved words even though their lexem* match with the pattern for identifiers. Lexical analyzer also performs stripping out of white spaces. In order recogni whitespace, it is defined as ws — (blank | tab | newline)t Generally, when a token is found it is returned to the parser. But when a lexical analy2 encounters ws, it restarts its process from the character that follows the whitespace. Table 1 Introduction to Compilers 31 ‘5 Token names with their attribute value “Lexemes | Tokenname | Attribute value Any ws - zm ae ‘i | phen then a [area else : = | Any id id Pointer to symbol table entry | any number | number Pointer to symbol table entry < relop LT ee relop LE e = relop EQ <> relop NE me Z | relop ct 2 >= relop GE é 7 1.11.1 Transition Diagrams Transition diagrams are pictorial representation of transition from one state to another on taking some input symbol. The patterns are converted into transition diagram while constructing the lexical analyzer. It comprises of states and edges, where states represent the condition that occurs in the process of scanning the input and edges indicate the transition. Edges are labelled by some input symbols. Forward pointer is advanced if an edge is found from some state with label of the input under consideration. Conventions ¥ Start state is indicated by an arrow labelled with start. —O Y Final states indicate that a lexeme has been found, It is represented by a double circle. O 32. Compiler Design 2e i “4? wi ~ed near the acceptii ¥ To indicate the retraction of forward pointer, a “* will be 2 eng oo pting state as the lexeme does not include the symbol that reaches the a i 5 (which is similar ition di i signed integers (whic! in The transition diagram for relation operators and unsi Sa most of the programming languages like Pascal, C) can be given as and 1.19. stat = ao @) retur(relop, LE) > © retum(relop, NE) ott C4) tit. 0 returm(relop, EQ) Ci ‘)——@) retur(rolop, SE) Figure 1.18 Tran: digt digit digit start Figure 1.19 Transition diagram for unsigned integers 1.11.2 Recognition of Reserved Words and Keywords Keyword patterns match with that of identifiers. But they should be recognized differently The transition diagram for identifiers is given in Fi hs pattern for keywords, is given in Figure 1.20. This figure also satisfies Introduction to Compilers 33 letter | digit start letter othe . “OO =o reluin(getTokeni)instalO() Figure 1.20 Transition diagram for identifiers and keywords To handle reserved words different from identi fiers, v Install the reserved words in the symbol table initially. Symbol table entry indicates the token they represent from a Separate field, © installID() When an identifier is found, this function places it in symbol table if itis not already there and returns pointer to the symbol table entry for the lexeme found. © gerToken() When a lexeme is found, this function examines the symbol table entry and returns the token name indicated in it ¥ Create separate transition diagrams for each keyword. start t h e a nonlet | dig CO ie ay ee i Figure 1.21 Transition diagram for keyword In Figure 1.21 ¥ Atest for ‘non-letter or digit’ is done to check the end of identifier. v If it reaches the accepting state it is recognised as keyword else identifier. ¥ This is done since lexeme such as the nextvalue has a proper prefix then. v When lexeme matches both patterns, priority is given to reserved words. 112 Lex Y Lexis a tool in lexical analysis phase to recognize tokens using regular expression. ¥ Lex tool itself is a lex compiler. 1.12.1 Use of Lex ¥ lest is an a input file written in a language which describes the generation of lexical analyzer, The lex compiler transforms lex.! to a C program known as lex.yy.c. ¥ lex.yy.e is compiled by the C compiler to a file called a.out. Y The output of C compiler is the working lexical analyzer which takes stream of input Characters and produces a stream of tokens. 34 Compiler Design 2e ¥ yylval isa global variable which is shared by lexical analyzer and parser to return the name and an attribute value of token. ¥ The attribute value can be numeric code, pointer to symbo! Y. Another tool for lexical analyzer generation is flex. | table or nothing. fa ed cine Lexcompiter. |-——> lexyy.c —" lexyyc — Compiler. |-——> aout (rout stream oo pi [___» Sequence of tokens a Figure 1.22 Creating lexical analyzer 1.12.2 Structure of Lex Programs Lex program will be in following form declarations ss translation rules ae auxiliary functions Declarations This section includes declaration of variables, constants and regular defir Translation rules _\t contains regular expressions and code segments. Form : Pattern {action} Pattern is a regular expression or regular definition, Action refers to segments of code. Auxiliary functions This section holds additional functions which are used in actions These functions are compiled separately and loaded with lexical analyzer. Lexical analyzer produced by lex starts its process by reading one character at a time until a valid match for a pattern is found. Once a match is found, the associated action takes place to produce token. The token is then given to parser for further processing. Introduction to Compilers 35 4.12.3. Conflict Resolution in Lex Conflict arises when several Prefixes of input matches one or more patterns. This can be resolved by the following: ¥_ Always prefer a longer prefix than a shorter prefix. If two or more patterns are matched for the longest prefix, then the first pattern listed in lex program is preferred. 4.12.4 Lookahead Operator ¥ Lookahead operator is the additional operator that is read by lex in order to distinguish additional pattern for a token, ¥ Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to produce token. ¥ At times, it is needed to have certain characters at the end of input to match with a pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches the lexeme. eg. In some languages keywords are not reserved. So the statements IF (I, J) = 5 and IF(condition) THEN . results in conflict whether to produce IF as an array name or a keyword. To resolve this the lex rule for keyword IF can be written as, IF, /\, (, tab letter } 1.13 Design of Lexical Analyzer V Lexical analyzer can either be generated by NFA or by DEA. V DEA is preferable in the implementation of lex. 1.13.1 Structure of Generated Analyzer Architecture of lexical analyzer generated by lex is given in Figure 1.23 Lexical analyzer program includes: ¥ Program to simulate automata ¥ Components created from lex program by lex itself which are listed as follows: © A transition table for automaton. © Functions that are passed directly through lex to the output. © Actions from input program (fragments of code) which are invoked by automaton simulator when needed. 36 Compiler Design 2¢ ‘Automaton simulator Transition table Lex program ors Figure 1.23 Lex program used by finite automaton simulator Steps to construct automaton Step 1: Convert each regular expression into NFA either by Thompson’s sub-set construction or direct method. Step 2: Combine all NFAs into one by introducing new start state with €-transitions to each of start states of NFAs N; for pattern pj. Step 2 is needed as the objective is to construct single automaton to recognize lexemes that matches with any of the patterns. Figure 1 :24 Construction of NFA from lex Program Introduction to Compilers. 37 eB @ {action A; for pattern p, } ab {action Ag for pattern pz} a°b* {action Ag for pattern p3} For string abd, pattern p2 and pattern py matches. But the pattern pp will be taken into account as it was listed first in lex program. For string aabbb ---, matches pattern p3 as it has many prefixes, Figure 1.25 shows NFAs for recognizing the above mentioned three patterns. The combined NFA for all three given patterns is shown in Figure 1.26 “OO ah Bh eye “GY Figure 1.25 NFA’s for a,abb,a*bt DO Nee Figure 1.26 Combined NFA 1.13.2 Pattern Matching based on NFAs Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by the pointer lexemeBegin. Forward pointer is used to move ahead of input symbols, calculates the set of states it is in at each point, If NFA simulation has no next state for some input symbol, then there will be no longer prefix which reaches the accepting state exists. At such cases¥the decision will be made on the so seen longest prefix, i.e., lexeme matching some Patten. Process is repeated until one or more accepting states are reached. If there are several accepting states, then the pattern p; which appears earliest in the list of lex program is chosen. Compiler Design 2€ 38 CBs woe a’b* b a = a _———+ none osha E —|7 ae 4 Es =I Figure 1.27 Processing input aaba Explanation Process starts with e-closure of initial state 0. After processing all the input symbols, no state is found as there is no transition out of state 8 on input a. Hence, accepting state is looked by retracting to previous state. From Figure 1.27 state 2 which is an accepting state is reached after reading input symbol a and therefore the pattern a has been matched. At state 8, string aab has been matched with pattern a*b*. By lex rule, the longest matching prefix should be considered. Hence, action Ag corresponds to pattern ps will be executed for the string aab. 1.13.3 DFAs for Lexical Analyzers DFAs are also used to represent the output of lex. DFA is constructed from NFA, by converting all the patterns into equivalent DFA using sub-set construction algorithm. If there are one or more accepting NFA states, the first pattern whose accepting state is represented in each DFA state is determined and displayed as output of DFA state. Process of DFA is similar to that of NFA. Simulation of DEA is conti i ntinued until ni i retraction takes place to find the acce ‘0 next state is found. Then pling state of i i A for that state is executed. DFA. Action associated with the pattem 1.13.4 Implementing Lookahead Operator Lookahead operator 11 /r2 is need to describe some trailing context 12 in order to Correctly For the pattern r, /r2,'/' is treated as ¢ If some prefix ab, is recopn; » 18 recognized by NFA as a mato! Ismot ended as NFA reaches the accepting state, hohe ad Mia nesaueneetl led because the pattern r) for a particular token may need identify the actual lexeme. Introduction to Compilers 39 ‘The end of lexeme occurs when NFA enters a state p such that 1. phas an ¢-transition on /, 2. There is a path from start state to state p, that spells out a. 3, There is a path from state p to accepting state that spells out b. 4, aisas long as possible for any ab satisfying conditions 1 — 3. any ell) ( roe Figure 1.28 NFA for keyword IF ait letter Figure 1.28 shows the NFA for recognizing the keyword IF with lookahead. Transition from state 2 to state 3 represents the lookahead operator (¢-transition). Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme IF is found by looking backwards to the state 2, whenever accepting state (state 6) is reached. 1.14 Finite Automata A recognizer for a language is a program that takes as input a string © and answers yes if x is a sentence of the language and no otherwise. Aregular expression is compiled into a recognizer by constructing a generalized transition diagram called a Finite Automaton (FA). ‘A finite automata can be Non-deterministic Finite Automata (NFA) or Deterministic Finite Automata (DFA). Itis given by M = (Q, ©, go, F; 4). where Q — Set of states & — Set of input symbols qo — Start state F — Set of final states 6 — Transition function (mapping states to input symbol). 6:QxzZ-Q ¥ Non-deterministic Finite Automata (NFA) © More than one transition occurs for any input symbol from a state. © Transition can occur even on empty string (€). 40 Compiler Design 2e V Deterministic Finite Automata (DFA) f i S cl sition occurs fr s © For each state and for each input symbol, exactly one transitio TOM thay state. Regular expression can be converted into DFA by the following methods: (i) Thompson's sub-set construction V Given regular expression is converted into NFA V Resultant NFA is converted into DFA (ii) Direct method V Indirect method, given regular expression is converted directly into DFA. 1.15 Regular Expressions to DFA Regular expression is used to represent the language (lexeme) of finite automata (lexical analyzer). 1.15.1 Rules for Conversion of Regular Expression to NFA © Union * Concatenation Introduction to Compilers. 41 e Closure 1.15.2 e-closure e-closure is the set of states that are reachable from the state concemed on taking empty string as input. It describes the path that consumes empty string (€) to reach some states of NFA. @ Example 1.3 e-closure(qgo) = {¢o, 11, 42} e-closure(qi) = {a1 92} e-closure(qa) = {q2} @ Example 1.4 42. Compiler Design 2e 4, 6} e-closure(1) = {1,2,3 e-closure(2) = {2,3; 6} e-closure(3) = 13, 6} e-closure(4) = {4} --closure(5) = {5,7} closure(6) = {6} ) e-closure(7) = {7} 1.15.3 Sub-set Construction ¥. Given regular expression is converted into NFA. v Then, NFA is converted into DFA. Steps 1. Convert given RE into NFA using above rules for operators (union, concatenation and closure) and precedence. Find e-closure of all states. N Start with e-closure of start state of NFA. rw Apply the input symbols and find its ¢-closure. Diran{state, input symbol] = losure(move(state, input symbol)) where Dtran — transition function of DFA 5. Analyze the output state to find whether it is a new state. . If new state is found, repeat step 4 and step 5 until no more new states are found. . Construct the transition table for Diran function. oe ND - Draw the transition diagram with start state as the -closure (start state of NFA) and final state as the state that contains final state of NFA drawn for given RE. @ Example 1.5 RE = (a| b)*abb Step 1: Construct NFA Introduction to Compilers 43 Step 2: Start with finding ¢-closure of state 0 e-closure(0) = {0,1,2,4,7} =A Step 3: Apply input symbols a, bwoA Dtran{A, a] = ¢-closure(move(A, a)) e-closure(move({0, 1,2,4,7}.@)) e-closure(3, 8) 3,6, 7,1,2,4,8} {1,2,3, 4,6, 7,8} =B Dtran[A, a] = B e-closure(move(A, b)) e-closure(move({0,1,2,4,7},b)) = e-closure(5) = {5,6,7,1,2,4,7} = (1,2,4.5,6,7,=° Dtran[A, b] =C i Step 4: Apply input symbols to new state B Dtran[B, a] = e-closure(move(B, a)) = e-closure(move({1, 2,3, 4,6, 7,8} a)) = e-closure(3, 8) = {1,2,3,4,6, 7,8} = B 44 Compiler Design 2e Dtran[B, a] = B Dtran[B, }] = e-closure( move(B, ®)) 7 “ = e-closure(move({1,2,3:4 6,7, 8}; 5)) -closure(5, 9) = {1,2,4,5,6 7,9} =D Dtran[B, 6] =D Step 5: Apply input symbols to new state C Dtran{C, a] = e-closure(move(C, @)) = -closure(move({1, 2, 4,5, 6, 7}, 4)) = e-closure(3, 8) = {1,2,3,4,6,7,8} =B Dtran{C, a] = B Dtran{C, 6] = e-closure(move(C, b)) = e-closure(move({1, 2, 4,5, 6, 7}, >) = e-closure(5) = {1,2,4,5,6,7}=C Dtran{C, b| = C Step 6: Apply input symbols to new state D Dtran[D, a] = e-closure(move(D, a)) = e-closure(move({1, 2, 4,5, 6,7, 9}, a)) = e-closure(3, 8) = {1,2,3,4,6,7,8} =B Dtran[D, a] = B Drran[D, b] = e-closure(move(D, b)) = e-closure(move({1, 2, 4,5, 6, 7, 9}, b)) = e-closure(5, 10) = {1,2,4,5,6, 7,10} = Dtran(D, b] = E TO} =E Introduction to Compilers Step 7! Apply input symbols to new state E Dtran[E, a] = €-closure(move(E, a)) = €-closure(move({1, 2, 4, 5, 6,7, 10}, a)) = e-closure(3, 8) = {1, 2,3, 4,6, 7,8} =B Dtran{E, a] = B Dtran{E, b] = e-closure(move(E, b)) e-closure(move({1, 2, 4, 5,6, 7, 10}, 6)) = e-closure(5) = {1,2,4,5,6,7}=C Dtran[E, 6] = C I Step 8: Construct transition table a aAaljmialsojyalye= Di mi wmliwi we ¥E Note: ¥ Start state is the e-closure(0), i.e., A. i d ‘A. V Final state is the state that contains final state of drawn NF Step 9: Construct transition diagram 48 Compiler Design 2e 1.15.4 Direct Method .d to convert given regular expression directly into DFA. v Direct method is uses ¥ Uses augmented regular expression r#. ¥ Important states of NFA correspond to positions in reg! of the alphabet. ular expression that hold symbols Regular expression is represented as syntax tree where interior nodes correspond to operators representing union, concatenation and closure operations. v Leaf nodes corresponds to the input symbols ¥ Construct DFA directly from a regular expression by © nullable(n), firsipos(n), lastpos(n) and followpos(i) from the syntax tree. © nullable(n); Is true for « node and node labeled with ¢. For other nodes it is computing the functions false. © firstpos(n): Set of positions at node n that corresponds to the first symbol of the sub-expression rooted at 7. © lastpos(n): Set of positions at node n that corresponds to the last symbol of the sub-expression rooted at n. © followpos(i): Set of positions that follows given position by matching the first or last symbol of a string generated by sub-expression of the given regular expression. Table 1.6 Rules for computing nullable, firstpos and lastpos Noden nullable(n) firstpos(n) lastpos(n) A leaf labeled & True % : ¢ A leaf with position i False {i} {i} An or node nullable(c)) or | firstpos(c,) U lastpos(c,) U n=cy|Co nullable(c2) firstpos(c2) lastpos(c2) if (nullable(c1)) | if (nullable(e,)) nullable(c) and | ftstpos(c1) U | Iastpos(e1) U A cat node nna nullable(cy) ay lastpos(c2) Sisto) else irstpos(c, lastpost A star node Posten’ n=c True firstpos(c,) lastpos(cy) Introduction to Compilers 49 Computation of followpos ‘The position of regular expression can follow another in the following ways: V If nis a cat node with left child cy and right chi it ‘ 7 it child ae te lastpos(c1), all positions in firstpos(cy) - Gana ies every poeuen © For cat node, for each position i in k ‘| t ni ; right child will be in flscaty ‘pos of its le ft child, the firstpos of its ¥ If nis a star node and 7 is a position in | ; te : are in followpos(i). astpos(n), then all positions in firstpos(n) © For star node, the firstpos of that node i: ‘ i f is in foll it i lastpos of that node. Followpos of all ‘positions in Input A regular expression r Output A DFA D, that recognizes L(r) Method 1. Construct syntax tree T for the augmented regular expression r#. 2. Compute nullable, firstpos, lastpos and followpos for T. 3. Construct Dstates, the set of states for DFA and Dtran, the transition function for DFA. 3.1 Initially all states are unmarked. Make the state marked by considering its out-transitions. 3.2 Start state of D is firstpos(root of T) 3.3 Apply union of followpos(p) for all p corresponds to considered input symbol | et containing the 3.4 Accepting states are the s vr end marker # ion of regular expression into DFA Figure 1.29 Algorithm for conversi

You might also like