AT&CD DCM UNIT 4-1
AT&CD DCM UNIT 4-1
2
Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
3
22AI602/22AM601
AUTOMATA THEORY AND COMPILER DESIGN
4
5
Table of Contents
2. Pre Requisites 09
3. Syllabus 11
4. Course outcomes 14
6. Lecture Plan 18
8. Lecture Notes 23
9. Assignments 88
11. Part B Qs 96
6
COURSE OBJECTIVES
7
1. Course Objectives
8
PRE REQUISITE
9
2. PRE-REQUISITES
10
SYLLABUS
11
3. SYLLABUS
22AI602/ AUTOMATA THEORY AND COMPILER DESIGN L T P C
22AM601
3 0 0 3
12
Syllabus
22AI602/ AUTOMATA THEORY AND COMPILER L T P C
22AM601 DESIGN
3 0 0 3
UNIT IV LEXICAL AND SYNTAX ANALYSIS
9
Introduction: The structure of a compiler,
Lexical Analysis: The Role of the Lexical Analyzer, Input Buffering, Recognition of Tokens,
The Lexical- Analyzer Generator Lex,
Syntax Analysis: Introduction, Context-Free Grammars, Writing a Grammar, Top-Down
Parsing, Bottom- Up Parsing, Introduction to LR Parsing: Simple LR, More Powerful LR
Parsers, Parser Generators YACC.
13
COURSE OUTCOMES
14
4. Course Outcomes
Highest
CO No. Course Outcomes Cognitive
Level
Construct deterministic and non-deterministic
CO1 K2
finite automata
Design context free grammars for formal languages
CO2 K3
using regular expressions.
Use PDA and Turing Machines for recognizing
CO3 K3
context-free languages.
CO4 Design a lexical analyzer. K2
15
CO - PO / PSO MAPPING
16
5. CO - PO / PSO MAPPING
K3 K4 K5 K5 K3/ A2 A3 A3 A3 A3 A3 A2 K3 K3 K3
K5
C202.1 K3 3 3 3 3 2 3 2 2
C202.2 K3 3 3 2 2 1 3 2 2
C202.3 K3 3 3 1 2 2 3 2 1
C202.4 K3 3 3 1 2 2 3 2 1
C202.5 K3 3 3 1 2 2 2 2 1
C202 3 3 3 3 2 3 2 1
LECTURE PLAN
18
6. Lecture Plan
19
ACTIVITY BASED LEARNING
20
UNIT – IV
https://sites.google.com/site/wonsunahn/teaching/cs-0449-systems-
software/kahoot-quiz
• Hands On -Assignment:
UNIT – IV
https://create.kahoot.it/share/cs8602-unit-1/2e5c742f-541a-4bd0-84b2-
9e2ca3b47170
Join at www.kahoot.it
or with the Kahoot! app use the below game pin to play the Quiz.
Hands-on Assignment:
1. Use JFLAP to demonstrate the construction of LL(1) parsing table for the
grammar
S -> aABb
A -> aAc | λ
B -> bB | c
22
23
9. LECTURE NOTES
UNIT – I
INTRODUCTION TO COMPILERS
Preprocessor
2.File inclusion: A preprocessor may include header files into the program
text.
24
COMPILER
Error Message
25
INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.
Data
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Disadvantages:
26
Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly
in memory and transfer control to it, thereby causing the machine language
program to be execute. This would waste core by leaving the assembler in memory
while the users program was being executed. Also the programmer would have to
retranslate his program with each execution, thus wasting translation time. To
overcome this problem of wasted translation time and memory, system
programmers developed another component called loader.
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object
form the loader could ”relocate” directly behind the users program. The task of
adjusting programs so that they may be placed in arbitrary core locations is called
relocation. Relocation loaders perform four functions.
TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of the HLL (High Level Language) specification would be
detected and reported to the programmers. Important role of translator are:
• Translating the HLL program input into an equivalent ml program.
• Providing diagnostic messages wherever the programmer violates
specification of the HLL.
TYPE OF TRANSLATORS:-
27
1.2 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and
produces output in another representation. The phases of a compiler are shown in
below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into
PHASES OF A COMPILER
No-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of atomic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of
lexical analysis. Syntax analysis is aided by using techniques based on formal
grammar of the programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output
runs faster and
28
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this
phase. The output of the code generator is the machine language program of the
specified computer.
29
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It
also imposes on tokens a tree-like structure that is used by the sub-sequent phases
of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On
seeing the /, the syntax analyzer should detect an error situation, because the
presence of these two adjacent binary operators violates the formulations rule of
an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming token
stream by identifying which parts of the token stream should be grouped.
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code
are possible. One common style uses instruction with one operator and a small
number of operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate
language representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code
program that does the some job as the original, but in a way that saves time and /
or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A > B goto L2
Goto L3
30
L2 :This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-
expressions
A := B + C + D E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing the speed of
loops. A typical loop improvement is to move a computation that produces the
same result each time around the loop to a point, in the program just before the
loop is entered.
Code generator :-
Cg produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each
computation is to be done. Many computers have only a few high speed registers in
which computations can be performed quickly. A good code generator would
attempt to utilize registers as efficiently as possible.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting of
errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the
phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-
management and error-Handling routines interact with all phases of the compiler.
31
32
1.3 LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of
tokens. A program or function which performs lexical analysis is called a lexical analyzer
or scanner. A lexer often exists as a single function which is called by a parser or
another function.
THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer is the first phase of a compiler.
Its main task is used to read the input characters and produces as output a sequence of
tokens that the parser uses for syntax analysis.
token
source
pgm Lexical Parser
Analyzer getnexttoken
Symbol
table
Upon receiving a ‘get next token’ command from the parser, the lexical analyzer reads
input characters until it can identify the next token.
ISSUES OF LEXICAL ANALYZER
There are three issues in lexical analysis:
To make the design simpler.
To improve the efficiency of the compiler. To enhance the computer portability.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream
of characters is called tokenization.
33
A token can look like anything that is useful for processing an input text stream or
sum Identifier
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement
LEXEME:
Collection or group of characters forming tokens is called Lexeme.
PATTERN:
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
Attributes for Tokens
Some tokens have attributes that can be passed back to the parser. The lexical
analyzer collects information about tokens into their associated attributes. The
attributes influence the translation of tokens.
a. Constant : value of the constant
b. Identifiers: pointer to the corresponding symbol table
entry.
34
ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:
The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode recovery: Deletion of successive characters from the
token until error is resolved.
INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme. As characters are read from left to right,
each character is stored in the buffer to form a meaningful token as shown below:
Forward pointer
A = B + C
35
Each buffer is of the same size N, and N is usually the number of characters on one
disk block. E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.
Two pointers to the input are maintained:
1. Pointer lexeme_beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme. After the
lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.
36
then begin reload second half; move forward to beginning of first half
end
else forward := forward + 1;
SENTINELS
For each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read. We can combine the buffer-end test with the
test for the current character if we extend each buffer to hold a sentinel character
at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
The sentinel arrangement is as shown below:
Note that eof retains its use as a marker for the end of the entire input. Any eof
that appears other than at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin reload second half;
forward := forward + 1
end
else if forward at end of second half then begin reload first half;
move forward to beginning of first half end
else /* eof within a buffer signifying end of input */ terminate lexical analysis
end
37
1.4 RECOGNITION OF TOKENS
Consider the following grammar fragment:
stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term
term → id | num
where the terminals if , then, else, relop, id and num generate sets of strings given
by the following regular definitions:
if the n → if the n els
els → e
e →
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-
)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then,
else,
as well as the lexemes denoted by relop, id, and num. To simplify matters, we
assume keywords are reserved; that is, they cannot be used as identifiers.
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a
lexical analyzer is called by the parser to get the next token. It is used to keep
track of information about the characters that are seen as the forward pointer
scans the input.
38
39
1.9 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER
There is a wide range of tools for constructing lexical
analyzers.
· Lex
· YACC
LEX
Lex is a
computer program that generates lexical analyzers. Lex is commonly used
with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by
creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C
program lex.yy.c.
LEX Compiler
lex.l lex.yy.c
C Compiler
lex.yy.c a.out
a.out
input stream Sequence of tokens
40
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
where pi is regular expression and action i describes what action the lexical
analyzer should take when pattern pi matches a lexeme.
Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These
can be compiled separately and loaded with the lexical analyzer.
YACC- YET ANOTHER COMPILER-COMPILER
Yacc provides a general tool for describing the input to a computer program.
The Yacc user specifies the structures of his input, together with code to be
invoked as each such structure is recognized. Yacc turns such a specification
into a subroutine that handles the input process; frequently, it is convenient
and appropriate to have most of the flow of control in the user's application
handled by this subroutine.
41
In the case two sets are equal, we simply reuse the existing DFA state that we
already constructed. This process is then repeated for each of the new DFA states
(that is, set of NFA states) until we run out of DFA states to process. Finally, every
DFA state whose corresponding set of NFA states contains an accepting state is itself
marked as an accepting state.
The Lexical-Analyzer Generator Lex
Lexical Analyzer tool is called Lex, or in a more recent implementation Flex, that
allows one to specify a lexical analyzer by specifying regular expressions to describe
patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool
itself is the Lex compiler.
Behind the scenes, the Lex compiler transforms the input patterns into a transition
diagram and generates code, in a file called l e x . y y . c, that simulates this
transition diagram.
Use of Lex
Figure 3.22 suggests how Lex is used. An input file, which we call lex.l , is written in
the Lex language and describes the lexical analyzer to be generated. The Lex
compiler transforms lex.1 to a C program, in a file that is always named lex.yy.c.
The latter file is compiled by the C compiler into a file called a.out . The C-compiler
output is a working lexical analyzer that can take a stream of input characters and
produce a stream of tokens.
The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a
subroutine of the parser. It is a C function that returns an integer, which is a code
for one of the possible token names. The attribute value, whether it be another
numeric code, a pointer to the symbol table, or nothing, is placed in a global variable
yylval , which is shared between the lexical analyzer and parser, thereby making it
simple to return both the name and an attribute value of a token.
42
Structure of Lex Programs
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
· The declarations section includes declarations of variables, manifest constants
(identifiers declared to stand for a constant, e.g., the name of a token), and regular
definitions.
· The translation rules each have the form
Pattern { Action }
· Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written in C,
although many variants of Lex using other languages have been created.
· The third section holds whatever additional functions are used in the actions.
· Alternatively, these functions can be compiled separately and loaded with the
lexical analyzer.
· When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns Pi. It then executes the associated action Ai. Typically,
Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace
or comments), then the lexical analyzer proceeds to find additional lexemes, until
one of the corresponding actions causes a return to the parser.
43
The lexical analyzer returns a single value, the token name, to the parser, but uses the
shared, integer variable yylval to pass additional information about the lexeme found, if
needed.
Example: Figure 3.23 is a Lex program that recognizes the tokens of Fig. 3.12 and returns
the token found.
Declarations section:
In the declarations section we see a pair of special brackets, %{ and %}. Anything
within these brackets is copied directly to the file lex.yy.c , and is not treated as a regular
definition. It is common to place there the definitions of the manifest constants, using C
#def i n e statements to associate unique integer codes with each of the manifest constants.
Notice that in the definition of id and number, parentheses are used as grouping
metasymbols and do not stand for themselves. In contrast, E in the definition of number
stands for itself. If we wish to use one of the Lex metasymbols, such as any of the
parentheses, +, *, or ?, to stand for themselves, we may precede them with a backslash.
For instance, we see \. in the definition of number, to represent the dot, since that character
is a metasymbol representing "any character," as usual in UNIX regular expressions.
Auxiliary-function section:
In the auxiliary-function section, we see two such functions, i n s t a llID () and
installNum(). Like the portion of the declaration section that appears between %{. . . % } ,
everything in the auxiliary section is copied directly to file l e x . y y . c , but may be used in
the actions.
44
Translation rules:
Finally, let us examine some of the patterns and rules in the middle section of
Fig. 3.23. First, ws, an identifier declared in the first section, has an associated
empty action. If we find whitespace, we do not return to the parser, but look for
another lexeme.
The second token has the simple regular expression pattern if. Should we see the
two letters if on the input, and they are not followed by another letter or digit
(which would cause the lexical analyzer to find a longer prefix of the input matching
the pattern for id), then the lexical analyzer consumes these two letters from the
input and returns the token name IF, that is, the integer for which the manifest
constant IF stands. Keywords t h e n and e l s e are treated similarly.
%{
/* definitions of manifest constants
#define LT 260
#define LE 261
EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNumO ; return(NUMBER);}
“<” {yylval = LT; return(RELOP) ;}
"<=" {yylval = LE; return(RELOP) ;}
“=” {yylval = EQ; return(RELOP) ;}
"<>" {yylval = NE; return(RELOP) ;}
“>” {yylval = GT; return(RELOP) ;}
“>=” {yylval = GE; return(RELOP) ;}
%%
45
int installID() {/* function to install the lexeme, whose first
character is pointed to by yytext, arid whose length is yyleng, into
the symbol table and return a pointer thereto */
}
int installNum() {/* similar to installlD, but puts numerical
constants into a separate table */
}
Figure 3.23: Lex program for the tokens of Fig. 3.12
The fifth token has the pattern defined by id. Note that, although keywords like if
match this pattern as well as an earlier pattern, Lex chooses whichever pattern is
listed first in situations where the longest matching prefix matches two or more
patterns. The action taken when id is matched is threefold:
1. Function i n s t a l l I D ( ) is called to place the lexeme found in the symbol
table.
2. This function returns a pointer to the symbol table, which is placed in global
variable y y l v a l , where it can be used by the parser or a later component of the
compiler. Note that i n s t a l l I D ( ) has available to it two variables that are set
automatically by the lexical analyzer that Lex generates:
(a) yytext is a pointer to the beginning of the lexeme, analogous to lexeme Begin in
Fig. 3.3.
(b) yyleng is the length of the lexeme found.
3. The token name ID is returned to the parser. The action taken when a lexeme
matching the pattern number is similar, using the auxiliary function installNum().
Conflict Resolution in Lex
Two rules that Lex uses to decide on the proper lexeme to select, when several
prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the
pattern listed first in the Lex program.
46
E x a m p l e:
• The first rule tells us to continue reading letters and digits to find the longest
prefix of these characters to group as an identifier. It also tells us to treat <=
as a single lexeme, rather than selecting < as one lexeme and = as the next
lexeme.
• The second rule makes keywords reserved, if we list the keywords before id
in the program.
For instance, if t h e n is determined to be the longest prefix of the input that
matches any pattern, and the pattern then precedes { i d } , as it does in Fig. 3.23,
then the token THEN is returned, rather than ID.
The Lookahead Operator
· Lex automatically reads one character ahead of the last character that
forms the selected lexeme, and then retracts the input so only the lexeme
itself is consumed from the input.
· When we want a certain pattern to be matched to the input only when
it is followed by a certain other characters. If so, we may use the slash in a
pattern to indicate the end of the part of the pattern that matches the
lexeme. What follows / is additional pattern that must be matched before we
can decide that the token in question was seen, but what matches this
second pattern is not part of the lexeme.
47
Example : In Fortran and some other languages, keywords are not reserved. That
situation creates problems, such as a statement
IF(I,J) = 3
where IF is the name of an array, not a keyword. This statement contrasts with
statements of the form
IF( condition ) THEN ...
where IF is a keyword.
To recognize the keyword IF, which is always followed by a left parenthesis, some
text ,the condition that may contain parentheses, a right parenthesis and a letter.
Then, we could write a Lex rule for the keyword IF like:
IF / \( .* \) {letter}
This rule says that the pattern the lexeme matches is just the two letters IF. The
slash says that additional pattern follows but does not match the lexeme.
In this pattern, the first character is the left parentheses. Since that character is a
Lex metasymbol, it must be preceded by a backslash to indicate that it has its literal
meaning. The dot and star match "any string without a newline." Note that the dot
is a Lex metasymbol meaning "any character except newline." It is followed by a
right parenthesis, again with a backslash to give that character its literal meaning.
The additional pattern is followed by the symbol letter, which is a regular definition
representing the character class of all letters.
For instance, suppose this pattern is asked to match a prefix of input:
IF(A<(B+C)*D)THEN...
the first two characters match IF, the next character matches \ ( , the next nine
characters match .*, and the next two match \) and letter. Note the fact that the
first right parenthesis (after C) is not followed by a letter is irrelevant; we only need
to find some way of matching the input to the pattern. We conclude that the letters
IF constitute the lexeme, and they are an instance of token if.
48
SYNTAX ANALYSIS
CONTEXT-FREE GRAMMAR
Grammars were introduced to systematically describe the syntax of programming
language constructs like expressions and statements.
A context-free grammar (grammar for short) consists of terminals, nonterminals, a
start symbol, and productions.
E->EAE|(E)|-E|id
A->+|-|*|?|^
49
WRITING A GRAMMAR
Grammars are capable of describing most, but not all, of the syntax of
programming languages.
For instance, the requirement that identifiers be declared before they are used,
cannot be described by a context-free grammar.
We consider several transformations that could be applied to get a grammar
more suitable for parsing.
One technique can eliminate ambiguity in the grammar, and other techniques —
left-recursion elimination and left factoring — are useful for rewriting grammars
so they become suitable for top-down parsing.
PARSING
50
Classification of Parsers
Parsers
Bottom Up
Top Down
Parsers (Shift
Parsers
Reduce Parsers)
LALR(1)
51
RECURSIVE DESCENT PARSING
Typically, top-down parsers are implemented as a set of recursive functions that
descent through a parse tree for a string. This approach is known as recursive
descent parsing, also known as LL(k) parsing where the first L stands for left-to-
right, the second L stands for leftmost-derivation, and k indicates k-symbol look-
ahead.
Therefore, a parser using the single symbol look-ahead method and top-down
parsing without backtracking is called LL(1) parser. In the following sections, we
will also use an extended BNF notation in which some regulation expression
operators are to be incorporated.
A syntax expression defines sentences of the form , or . A syntax of the form
defines sentences that consist of a sentence of the form followed by a sentence of
the form followed by a sentence of the form . A syntax of the form defines zero or
one occurrence of the form . A syntax of the form defines zero or more occurrences
of the form .
A usual implementation of an LL(1) parser is: initialize its data structures, get the
lookahead token by calling scanner routines, and call the routine that implements
the start symbol. Here is an example.
proc syntaxAnalysis() begin
initialize(); // initialize global data and structures nextToken(); // get the lookahead
token
program(); // parser routine that implements the start symbol end;
Left Recursion:
A grammar is left-recursive if and only if there exists a nonterminal symbol that
can derive to a sentential form with itself as the leftmost symbol .
A→A𝛼/𝛽
Elimination of Left Recursion
A→A𝛼/𝛽
Introduce a new nonterminal A' and rewrite the rule as
𝐴 → 𝛽𝐴′
𝐴′ → 𝛼𝐴′/∈
52
Eliminate left recursion for the following grammar:
1) E →E +T / T
Solution:
E→ 𝑇𝐸′ 𝐸′→ +𝑇 𝐸′ / ∈
1) T →T*F / F
solution”
T→ FT′ T′→ *F T′ / ∈
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or e can be added to any FIRST set.
PROBLEM :
Consider the following example to understand the concept of First and Follow.Find the first
and follow of all nonterminals in the Grammar-
E -> TE' E'-> +TE'|e T -> FT'
T'-> *FT'|e F -> (E)|id then:
53
FIRST(E)=FIRST(T)=FIRST(F)={(,id} FIRST(E')={+,e}
FIRST(T')={*,e} FOLLOW(E)=FOLLOW(E')={),$}
FOLLOW(T)=FOLLOW(T')={+,),$}
FOLLOW(F)={+,*,),$}
For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST
with i=1 in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3
with i=1, the production T -> FT' implies that id and left parenthesis belong to FIRST(T)
also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied to
production F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to production
E-> TE', $ and right parenthesis are in FOLLOW(E').
Example 2:
S → aBDh
B ->cC
C → bC / ∈
D → EF
E→g/∈
F→f/∈
Solution:
First Functions-
First(S) = { a }
First(B) = {b , ∈ }
First(C) = { b , ∈ }
First(D) = { First(E) – ∈ } ∪ First(F) ={g,f,∈}
First(E) = { g , ∈ }
First(F) = { f , ∈ }
Follow Functions-
Follow(S) = { $ }
Follow(B) = { First(D) – ∈ } ∪ follow(D) = { g , f , h }
Follow(C) = Follow(B) = {c, g , f , h }
Follow(D) = First(h) = { h }
Follow(E) = { First(F) – ∈ } ∪ Follow(F) = { f , h }
Follow(F) = Follow(D) = { h }
54
CONSTRUCTION OF PREDICTIVE PARSING TABLES
For any grammar G, the following algorithm can be used to construct the predictive parsing
table. The algorithm is
Input : Grammar G Output : Parsing table M Method
For each production A-> a of the grammar, do steps 2 and 3
For each terminal a in FIRST(a), add A->a, to M[A,a].
If e is in First(a), add A->a to M[A,b] for each terminal b in FOLLOW(A). If
e is in FIRST(a) and $ is in FOLLOW(A), add A->a to M[A,$].
Make each undefined entry of M be error.
LL(1) GRAMMAR
The above algorithm can be applied to any grammar G to produce a parsing table M. For
some Grammars, for example if G is left recursive or ambiguous, then M will have at least
one multiply-defined entry. A grammar whose parsing table has no multiply defined
entries is said to be LL(1). It can be shown that the above algorithm can be used to
produce for every LL(1) grammar G a parsing table M that parses all and only the
sentences of G. LL(1) grammars have several distinctive properties. No ambiguous or left
recursive grammar can be LL(1).
There remains a question of what should be done in case of multiply defined entries. One
easy solution is to eliminate all left recursion and left factoring, hoping to produce a
grammar which will produce no multiply defined entries in the parse tables. Unfortunately
there are some grammars which will give an LL(1) grammar after any kind of alteration. In
general, there are no universal rules to convert multiply defined entries into single valued
entries without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
recursion elimination and left factoring are easy to do, they make the resulting grammar
hard to read and difficult to use the translation purposes. To alleviate some of this
difficulty, a common organization for a parser in a compiler is to use a predictive parser for
control constructs and to use operator precedence for expressions. However, if an LR
parser generator is available, one can get all the benefits of predictive parsing and operator
precedence automatically.
55
ERROR RECOVERY IN PREDICTIVE PARSING
The stack of a non-recursive predictive parser makes explicit the terminals and non-
terminals that the parser hopes to match with the remainder of the input. We shall
therefore refer to symbols on the parser stack in the following discussion. An error is
detected during predictive parsing when the terminal on top of the stack does not
match the next input symbol or when non-terminal A is on top of the stack, a is the
next input symbol, and the parsing table entry M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input
until a token in a selected set of synchronizing tokens appears. Its effectiveness
depends on the choice of synchronizing set. The sets should be chosen so that the
parser recovers quickly from errors that are likely to occur in practice. Some
heuristics are as follows
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing
set for non terminal A. If we skip tokens until an element of FOLLOW(A) is seen
and pop A from the stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizing set for A. Fo example , if
semicolons terminate statements, as in C, then keywords that begin statements
may not appear in the FOLLOW set of the non terminal generating expressions. A
missing semicolon after an assignment may therefore result in the keyword
beginning the next statement being skipped. Often, there is a hierarchical structure
on constructs in a language; e.g., expressions appear within statement, which
appear within b blocks, and so on. We can add to the synchronizing set of a lower
construct the symbols that begin higher constructs. For example, we might add
keywords that begin statements to the synchronizing sets for the non terminals
generating expressions.
56
If we add symbols in FIRST(A) to the synchronizing set for non terminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST(A) appears in
the input.
If a non terminal can generate the empty string, then the production deriving e can
be used as a default. Doing so may postpone some error detection, but cannot
cause an error to be missed. This approach reduces the number of non terminals
that have to be considered during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue
parsing. In effect, this approach takes the synchronizing set of a token to consist of
all other tokens.
LR PARSING
INTRODUCTION
The "L" is for left-to-right scanning of the input and the "R" is for constructing a
rightmost derivation in reverse.
57
WHY LR PARSING:
LR parsers can be constructed to recognize virtually all programming-language
constructs for which context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
The class of grammars that can be parsed using LR methods is a proper subset of
the class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-
to- right scan of the input. The disadvantage is that it takes too much work to
construct an LR parser by hand for a typical programming-language grammar.
But there are lots of LR parser generators available to make this task easy.
MODELS OF LR PARSERS
58
This configuration represents the right-sentential form X1 X1 ... Xm ai ai+1 ...an
in essentially the same way a shift-reduce parser would; only the presence of the
states on the stack is new. Recall the sample parse we did (see Example 1: Sample
bottom-up parse) in which we assembled the right-sentential form by concatenating
the remainder of the input buffer to the top of the stack. The next move of the
parser is determined by reading ai and sm, and consulting the parsing action table
entry action[sm, ai]. Note that we are just looking at the state here and no symbol
below it. We'll see how this actually works later.
The configurations resulting after each of the four types of move are as follows:
If action[sm, ai] = shift s, the parser executes a shift move entering the
configuration (s0 X1 s1 X2 s2... Xm sm ai s, ai+1... an$)
Here the parser has shifted both the current input symbol ai and the next symbol.
If action[sm, ai] = reduce A -> b, then the parser executes a reduce move,
entering the configuration,
(s0 X1 s1 X2 s2... Xm-r sm-r A s, ai ai+1... an$)
where s = goto[sm-r, A] and r is the length of b, the right side of the production.
The parser first popped 2r symbols off the stack (r state symbols and r grammar
symbols), exposing state sm-r. The parser then pushed both A, the left side of the
production, and s, the entry for goto[sm-r, A], onto the stack. The current input
symbol is not changed in a reduce move.
59
The output of an LR parser is generated after a reduce move by executing the
semantic action associated with the reducing production. For example, we might just
print out the production reduced.
This shift-reduce process continues until the parser terminates, reporting either
success or failure. It terminates with success when the input is legal and is
accepted by the parser. It terminates with failure if an error is detected in the
input. The parser is nothing but a stack automaton which may be in one of several
discrete states. A state is usually represented simply as an integer.
In reality, the parse stack contains states, rather than grammar symbols. However,
since each state corresponds to a unique grammar symbol, the state stack can be
mapped onto the grammar symbol stack mentioned earlier.
60
The program driving the LR parser behaves as follows: It determines sm the state currently
on top of the stack and ai the current input symbol. It then consults action[sm, ai], which
can have one of four values:
§ accept
§ error
The function goto takes a state and grammar symbol as arguments and produces a state.
For a parsing table constructed for a grammar G, the goto table is the transition function of
a deterministic finite automaton that recognizes the viable prefixes of G. Recall that the
viable prefixes of G are those prefixes of right-sentential forms that can appear on the
stack of a shift reduce parser because they do not extend past the rightmost handle.
A configuration of an LR parser is a pair whose first component is the stack contents and
whose second component is the unexpended input:
(s0 X1 s1 X2 s2... Xm sm, ai ai+1... an$)
61
ACTION TABLE
The action table is a table with rows indexed by states and columns indexed by
terminal symbols. When the parser is in some state s and the current lookahead
terminal is t, the action taken by the parser depends on the contents of
action[s][t], which can contain four different kinds of entries:
Shift s'
GOTO TABLE
The goto table is a table with rows indexed by states and columns indexed by non
terminal symbols. When the parser is in state s immediately after reducing by rule
N, then the next state to enter is given by goto[s][N].
The current state of a shift-reduce parser is the state on top of the state stack. The
detailed operation of such a parser is as follows:
1. Initialize the parse stack to contain a single state s0, where s0 is the
distinguished initial state of the parser.
2. Use the state s on top of the parse stack and the current lookahead t
to consult the action table entry action[s][t]:
· If the action table entry is shift s' then push state s' onto the stack and advance
the input so that the lookahead is set to the next token.
62
· If the action table entry is reduce r and rule r has m symbols in its RHS, then pop
m symbols off the parse stack. Let s' be the state now revealed on top of the parse
stack and N be the LHS nonterminal for rule r. Then consult the goto table and
push the state given by goto[s'][N] onto the stack. The lookahead token is not
changed by this step.
Ø If the action table entry is accept, then terminate the parse with success.
Ø If the action table entry is error, then signal an error.
3. Repeat step (2) until the parser terminates.
For example, consider the following simple grammar
0) $S: stmt <EOF>
1) stmt: ID ':=' expr
0) expr: expr '+' ID
1) expr: expr '-' ID
2) expr: ID
Parser Tables
63
SLR PARSER
An LR(0) item (or just item) of a grammar G is a production of G with a dot at
some position of the right side indicating how much of a production we have seen
up to a given point.
For example, for the production E -> E + T we would have the following items:
[E -> .E + T]
[E -> E. + T]
[E -> E +. T]
[E -> E + T.]
64
CONSTRUCTING THE SLR PARSING TABLE
To construct the parser table we must convert our NFA into a DFA. The states in
the LR table will be the e-closures of the states corresponding to the items I0...
the process of creating the LR state table parallels the process of constructing an
equivalent DFA from a machine with e-transitions. Been there, done that - this is
essentially the subset construction algorithm so we are in familiar territory here.
We need two operations: closure() and goto().
closure()
From our grammar above, if I is the set of one item {[E'-> .E]},
I0:
E' -> .E
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
65
goto()
SETS-OF-ITEMS-CONSTRUCTION
66
ALGORITHM FOR CONSTRUCTING AN SLR PARSING TABLE
Output: SLR parsing table functions action and goto for G'
Method:
Construct C = {I0, I1 , ..., In} the collection of sets of LR(0) items for G'.
then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A may
not be S'.
67
If any conflicting actions are generated by these rules, the grammar is not SLR(1)
and the algorithm fails to produce a parser. The goto transitions for state i are
constructed for all non terminals A using
The initial state of the parser is the one constructed from the set of items
containing [S' -> .S].
An Example
(1) E -> E * B
(2) E -> E + B
The Action and Goto Table The two LR(0) parsing tables for this grammar look as
follows:
68
LALR PARSER:
We begin with two observations. First, some of the states generated for LR(1)
parsing have the same set of core (or first) components and differ only in their
second component, the lookahead symbol. Our intuition is that we should be able
to merge these states and reduce the number of states we have, getting close to
the number of states that would be generated for LR(0) parsing. This observation
suggests a hybrid approach: We can construct the canonical LR(1) sets of items
and then look for sets of items having the same core. We merge these sets with
common cores into one set of items. The merging of states with common cores
can never produce a shift/reduce conflict that was not present in one of the original
states because shift actions depend only on the core, not the lookahead. But it is
possible for the merger to produce a reduce/reduce conflict.
Our second observation is that we are really only interested in the lookahead symbol
in places where there is a problem. So our next thought is to take the LR(0) set of
items and add lookaheads only where they are needed. This leads to a more
efficient, but much more complicated method.
69
ALGORITHM FOR EASY CONSTRUCTION OF AN LALR TABLE
Input: G'
Output: LALR parsing table functions with action and goto for G'. Method:
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
2. For each core present among the set of LR(1) items, find all sets having that
core and replace these sets by the union.
3. Let C' = {J0, J1 , ..., Jm} be the resulting sets of LR(1) items. The parsing
actions for state i are constructed from Ji in the same manner as in the
construction of the canonical LR parsing table.
4. If there is a conflict, the grammar is not LALR(1) and the algorithm fails.
5. The goto table is constructed as follows: If J is the union of one or more sets of
LR(1) items, that is, J = I0U I1 U ... U Ik, then the cores of goto(I0, X), goto(I1, X),
..., goto(Ik, X) are the same, since I0, I1 , ..., Ik all have the same core. Let K be
the union of all sets of items having the same core asgoto(I1, X).
6. Then goto(J, X) = K. Consider the above example,
I36:C->c.C,c/d/$
70
state c d $ S C
0 S36 S47 1 2
1 Accept
2 S36 S47 5
36 S36 S47 89
47 R3 R3
5 R1
89 R2 R2 R2
HANDLING ERRORS
The LALR parser may continue to do reductions after the LR parser would have
spotted an error, but the LALR parser will never do a shift after the point the LR
parser would have discovered the error and will eventually find the error.
LR ERROR RECOVERY
An LR parser will detect an error when it consults the parsing action table and find
a blank or error entry. Errors are never detected by consulting the goto table. An
LR parser will detect an error as soon as there is no valid continuation for the
portion of the input thus far scanned. A canonical LR parser will not make even a
single reduction before announcing the error. SLR and LALR parsers may make
several reductions before detecting an error, but they will never shift an erroneous
input symbol onto the stack.
71
PANIC-MODE ERROR RECOVERY
We can implement panic-mode error recovery by scanning down the stack until a
state s with a goto on a particular nonterminal A is found. Zero or more input
symbols are then discarded until a symbol a is found that can legitimately follow
The situation might exist where there is more than one choice for the nonterminal
A. Normally these would be nonterminals representing major program pieces, e.g.
an expression, a statement, or a block. For example, if A is the nonterminal stmt, a
might be semicolon or }, which marks the end of a statement sequence. This
method of error recovery attempts to eliminate the phrase containing the syntactic
error. The parser determines that a string derivable from A contains an error. Part
of that string has already been processed, and the result of this processing is a
sequence of states on top of the stack.
The remainder of the string is still in the input, and the parser attempts to skip over
the remainder of this string by looking for a symbol on the input that can
legitimately follow A. By removing states from the stack, skipping over the input,
and pushing GOTO(s, A) on the stack, the parser pretends that if has found an
instance of A and resumes normal parsing.
72
PHRASE-LEVEL RECOVERY
The actions may include insertion or deletion of symbols from the stack or the input
or both, or alteration and transposition of input symbols. We must make our
choices so that the LR parser will not get into an infinite loop. A safe strategy will
assure that at least one input symbol will be removed or shifted eventually, or that
the stack will eventually shrink if the end of the input has been reached. Popping a
stack state that covers a non terminal should be avoided, because this modification
eliminates from the stack a construct that has already been successfully parsed.
73
Problems
Solution:
E -> TE’
E ‘-> +TE’|ε
T -> FT’
T ’ -> *FT’|ε
F -> (E)| id
FIRST(E) = { ( , id}
FIRST(T) = { ( , id}
FIRST(F) = { ( , id}
FIRST(E’) = { +, Ɛ}
FIRST(T’) = { *, Ɛ}
FOLLOW(E) = { ) , $}
FOLLOW(E’) = { ) , $}
FOLLOW(T) = {+, ), $}
FOLLOW(T’) = {+, ), $}
FOLLOW(F) = {+, *, ), $}
74
Non Terminals
Termin
al
id + * ( ) $
75
Stack Input string Production used
$E id+id*id$ E->TE’
$E’T’ $ T’->ε
$E’ $ E’->ε
$ $ String accepted
76
2) SLR Parser or LR(0)
Steps:
1.create augment grammar
2.generate kernel items
3.find closure
4.compute goto()
5.construct parsing table
6.parse the string
Let us consider grammar:
S->a
S->(L)
L->S
L->L,S
S’-> .S
(Rule: A -> α.Xβ i.e if there is nonterminal next to dot then Include X
production))
S’-> .S
S-> . a
S->.(L) I0
77
78
Step 5: construct parsing table
ACTION GOTO
a ( ) , $ S L
0 S2 S3 1
1 acce
pt
2 R1 R1 R1
3 S2 S3 5 4
4 S6 S7
5 R3 R3
6 R2 R2 R2
7 S2 S3 8
8 R4 R4
79
Goto (I0, S) 1 Goto (I3, ( ) 3
R0 S’-> S . I1 Follow(S’)={ $ }
R1 S-> a . I2 Follow(S)={ $ )
,}
R2 S-> ( L ) . I6 Follow(S)={ $ )
,}
R3 L-> S . I5 Follow(L)={ ) ,
}
R4 L-> L , S . I8 Follow(L)={ ) ,
}
80
Stack Input Action
$0 (a,a)$ S3
$0(3 a,a)$ S2
$0(3a2 ,a)$ R1
$0(3S5 ,a)$ R3
$0(3L4 ,a)$ S7
$0(3L4 , 7 a)$ S2
$0(3L4 , 7a2 )$ R1
$0(3L4 , 7S8 )$ R4
$0(3L4 )$ S6
$0(3L4)6 $ R2
$0S1 $ accept
81
CLR Parser or LR(1):
Steps:
1.create augment grammar
2.generate kernel items and add 2nd component
3.find closure
4.compute goto()
5.construct parsing table
6.parse the string
S->L=R
S->R
L->*R
L->id
R->L
82
Step 2: Generate kernel items and add 2nd component
Introduce dot in RHS of the production
Add $ as 2nd component separated by comma
S’-> .S , $
Step 3: Find closure
(Rule: A -> α.Xβ i.e if there is nonterminal next to dot then Include X
production))
S’-> .S , $
S-> . L=R
S-> . R
I0
L-> . *R
L-> . Id
R-> . L
Next find 2nd component:
compare each of the production with A-> α
S’-> .S , $ .Bβ , a
S-> . L=R, S’ -> . s, $ here no β, so $ is 2nd comp to S
$ S -> . L=R,$ here β is = so add it as 2nd
S-> . R, $ comp to L
L-> . *R, S-> . R, $ here no β, so $ is
I0
=/$ 2 comp to R
nd
83
Step 4 : compute goto()
84
We notice that some states in CLR parser have the same core items and differ only
in possible lookahead symbols.
Such as
I4 and I4’
I5 and I5’
I7 and I7’
I8 and I8’
So we shrink the obtained CLR parser by merging such states to form LALR Parser
Hence
85
PARSER GENERATOR-YACC
Each translation rule input to YACC has a string specification that resembles a
production of a grammar-it has a nonterminal on the LHS and a few alternatives on the
RHS. For simplicity, we will refer to a string specification as a production. YACC
generates an LALR(1) parser for language L from the productions, which is a bottom-
up parser. The parser would operate as follows: For a shift action, it would invoke the
scanner to obtain the next token and continue the parse by using that token. While
performing a reduced action in accordance with production, it would perform the
semantic action associated with that production.
The semantic actions associated with productions achieve the building of an
intermediate representation or target code as follows:
Every nonterminal symbol in the parser has an attribute.
The semantic action associated with a production can access attributes of
nonterminal symbols used in that production–a symbol “$n’ in the semantic
action, where n is an integer, designates the attribute of the nonterminal symbol in
the RHS of the production and the symbol ‘$$’ designates the attribute of the LHS
nonterminal symbol of the production.
The semantic action uses the values of these attributes for building the
intermediate representation or target code.
A parser generator is a program that takes as input a specification of a syntax and
produces as output a procedure for recognizing that language. Historically, they are also
called compiler compilers. YACC (yet another compiler-compiler) is an LALR(1)
(LookAhead, Left-to-right, Rightmost derivation producer with 1 lookahead token)
parser generator. YACC was originally designed for being complemented by Lex.
86
Input File: YACC input file is divided into three parts.
/* definitions */
....
%%
/* rules */
....
%%
/* auxiliary routines */
....
Example
Yacc File (.y)
%{
#include <ctype.h>
#include <stdio.h>
#define YYSTYPE double /* double type for yacc stack */
%}
%%
Lines : Lines S '\n' { printf("OK \n"); }
| S '\n’
| error '\n' {yyerror("Error: reenter last line:");
yyerrok; };
S : '(' S ')’
| '[' S ']’
| /* empty */ ;
%%
#include "lex.yy.c“
void yyerror(char * s)
/* yacc error handler */
{ fprintf (stderr, "%s\n", s);
}
int main(void) {
return yyparse();
}
87
9. ASSIGNMENTS
1. Construct the predictive parsing table for the following grammar.(CO4, K3)
E -> E+T / T
T -> T & F / F
F -> ! F / (E) /1 / 0
i) S ->C C
C -> Cc
C ->d
ii) S -> Aa / bAc / Bc / bBa
A ->d
B ->d
Check whether the grammar is LR(1) but not LALA(1).
5. Perform LALR for the given grammar .(CO4, K3)
i) S ->C C
C -> Cc
C ->d
ii) S -> Aa / bAc / dc / bda
A -> d
Check whether the grammar is LALR(1) but not SLR(1).
88
10. PART A : Q & A : UNIT – IV
SNo Questions and Answers CO K
What are the two parts of a compilation? Explain
briefly.
Analysis and Synthesis are the two parts of compilation.
(Front-end & Back-end)
1 ● The analysis part breaks up the source program K1
into constituent pieces and creates an
intermediate representation of the source program.
● The synthesis part constructs the desired target
program from the intermediate representation.
List the various compiler construction tools.
● Parser generators
● Scanner Generator
2 ● Syntax-directed translation engines K2
● Automatic code generators
● Dataflow Engines
● Compiler construction tool kits
Differentiate compiler and interpreter.
● The machine-language target program produced
by a compiler is usually much faster than an
3 interpreter at mapping inputs to outputs. K1
● An interpreter, however, can usually give better CO1
error diagnostics than a compiler, because it executes
the source program statement by statement.
Define tokens, Patterns, and lexemes.
● Tokens- Sequence of characters that have a collective
meaning. A token of a language is a category of its
lexemes.
● Patterns- There is a set of strings in the input for
4 K2
which the same token is produced as output. This
set of strings is described by a rule called a
pattern associated with the token.
● Lexeme- A sequence of characters in the source
program that is matched by the pattern for a token.
Describe the possible error recovery actions in lexical
analyzer.
·Panic mode recovery
·Deleting an extraneous character
5 K1
·Inserting a missing character
·Replacing an incorrect character by a correct character
·Transposing two adjacent characters
89
10. PART A : Q & A : UNIT – IV
14 K1
91
10. PART A : Q & A : UNIT – IV
Define Ambiguity ?
A grammar G is said to be ambiguous if it produces
21 K1
different parse trees for the same sentence or yield w
which has been derived from the same start symbol.
92
10. PART A : Q & A : UNIT – IV
93
10. PART A : Q & A : UNIT – IV
Define handle?
A handle of a string is a substring that matches the right
28 side of a production and whose reduction to the non- K1
terminal on the left side of the production represents one
step along the reverse of a rightmost derivation.
Define LR parser ?
LR parser can be used to parse a large class of Context
Free Grammars. The technique is called LR(K) parsing.
30 ● L - Input sequence is processed from left to right. K2
● R - Rightmost derivation is performed. CO2
● K - At most K symbols of the sequence are used to
make decision.
95
11. PART B QUESTIONS : UNIT – IV
(CO4, K2)
1.Describe the various phases of compiler and trace it with the program segment
(t=b*-c+b*-c).
2.Explain in detail the process of compilation. Illustrate the output of each phase of
the compilation for the input “a = (b+c) * (b+c) *2”
3. Discuss various buffering techniques in detail.
4. Construct Regular expression to NFA for the sentence (alb)*a.
5. Construct NFA using the regular expression (a/b)*abb.
6. Construct the NFA from the (a|b)*a(a|b) using Thompson’s construction
algorithm.
7. Construct the DFA for the augmented regular expression (a | b )* # directly
using syntax tree.
8. Write an algorithm for minimizing the number of states of a DFA.
PART C QUESTIONS
(CO4, K3)
96
11 PART B QUESTIONS : UNIT – IV
9 Consider the following CFG grammar over the non-terminals {X, Y, Z} and
terminals {a, b, d} with the productions below and start symbol Z. (CO4,K3)
X -> a
X -> Y
Z -> d
Z -> X Y Z
Y -> c
Y -> €
Compute the FIRST and FOLLOW sets of every non-terminal and the set of
non-terminals that are nullable. Construct the predictive parsing table.
10. Construct predictive parsing table for the grammar (CO4,K3)
E->E+T | T, T->T*F | F, F->(E)|id
Using predictive parsing the string id+id*id.
11.Construct predictive parsing table for the grammar . (CO4,K3)
S->(L) | a, L->L,S | S
and show whether the following string will be accepted or not. (a, (a, (a,a)))
12. Construct the SLR parsing table for the following grammar. Show the actions
for the parser for the input string “abab” & “baab” (CO4,K3)
S->AS | b , A->SA|a
13. Write the LR parsing algorithm. Check whether the grammar is SLR (1) or not.
Justify the answer with reasons. (CO4,K3)
S->L=R | R; L->*R | id; R->L
14.Construct Stack Implementation of shift reduce parsing for the grammar
(CO4,K3)
E->E+E | E*E | (E) | id. And the input string id1 + id2 * id3
97
15. Consider the CFG depicted below where “begin”, “end” and “x” are all terminal
symbols of the grammar and stat is considered the starting symbol for this
grammar. Productions are numbered in parenthesis and you can abbreviate
“begin” to “b” and “end” to “e” respectively.
Stat -> Block
Block -> begin Block end
Block -> Body
Body -> x
(i) Compute the set of LR (1) items for this grammar and draw the
corresponding DFA. Do not forget to augment the grammar with the initial
productions Sà Start$ as the production (0).
(ii)Construct the corresponding LR parsing table.
16. Show that the grammar is LR(1) but not LALR(1) S-
S> Aa | bAc | Bc |bBa
A->d
B->d
And parse the statement “bdc” and “dd”.
17. Show that the following grammar:
S -> Aa | bAc | dc | bda
A -> a
Is LALR(1) but not SLR(1)
18. Design the Syntax analyzer using YACC tool.
98
99
12. SUPPORTIVE ONLINE CERTIFICATION COURSES
NPTEL : https://swayam.gov.in/nd1_noc19_cs79/preview
Swayam : https://nptel.ac.in/courses/106/106/106106049/
coursera : https://www.coursera.org/learn/cs-algorithms-theory-machines
Udemy : https://www.udemy.com/course/theory-of-computation-toc/
Mooc :https://www.mooc-list.com/tags/theory-computation
NPTEL : https://nptel.ac.in/courses/106/105/106105190/
Swayam : https://www.classcentral.com/course/swayam-compiler-design-12926
coursera : https://www.coursera.org/learn/nand2tetris2
Udemy : https://www.udemy.com/course/introduction-to-compiler-construction-and-design/
Mooc : https://www.mooc-list.com/course/compilers-coursera
edx : https://www.edx.org/course/compilers
Real time Applications in day
to day life and to Industry
101
13. Real time Applications in day to day life and to Industry
102 102
Natural Language Processing - Syntactic Analysis
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The
purpose of this phase is to draw exact meaning, or you can say dictionary meaning
from the text. Syntax analysis checks the text for meaningfulness comparing to the
rules of formal grammar. For example, the sentence like “hot ice-cream” would be
rejected by semantic analyzer. In this sense, syntactic analysis or parsing may be
defined as the process of analyzing the strings of symbols in natural language
conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from
Latin word ‘pars’ which means ‘part’.
10
4
14. CONTENTS BEYOND SYLLABUS : UNIT – IV
[~delim1,delim2,...,delimN] :: Token
Text Parsing:
Text parsing is a technique which is used to derive a text string using
the production rules of a grammar to check the acceptability of a string.
105
In computer-based language recognition, ANTLR (pronounced antler), or ANother Tool
for Language Recognition, is a parser generator that uses LL(*) for parsing. ANTLR is
the successor to the Purdue Compiler Construction Tool Set (PCCTS), first
developed in 1989, and is under active development. Its maintainer is Professor Terence
Parr of the University of San Francisco.
Example
In the following example, a parser in ANTLR describes the sum of expressions can be seen
in the form of "1 + 2 + 3":
// Common options, for example, the target language
options
{
language = "CSharp";
}
// Followed by the parser
class SumParser extends Parser;
options
{
k = 1; // Parser Lookahead: 1 Token
}
// Definition of an expression
statement: INTEGER (PLUS^ INTEGER)*;
// Here is the Lexer
class SumLexer extends Lexer;
options
{
k = 1; // Lexer Lookahead: 1 characters
}
PLUS: '+';
DIGIT: ('0'..'9');
INTEGER: (DIGIT)+;
Revision Test :
108
TEXT BOOKS & REFERENCE BOOKS
109
16. Prescribed Text Books & Reference Books
TEXT BOOKS:
2. Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, “Compilers Principles,
REFERENCES:
REFERENCES:
1.K.L.P Mishra and Chandrashekaran, Theory of Computer Science – Automata languages
and computation,3rd Edition, PHI, 2007.
2. Elain Rich, “Automata, Computability and complexity”, 1st Edition, Pearson Education,
2018.
3. Peter Linz, “An introduction to Formal Languages and Automata”, Jones and Bartlett
Publishers, 6th Edition, 2016.
110
111
17. MINI PROJECT SUGGESTION
• Objective:
Design of lexical analyzer Generator , Design of an Automata for pattern matching
• Planning:
• This method is mostly used to improve the ability of students in application domain
and also to reinforce knowledge imparted during the lecture.
• Students are asked to prepare mini projects involving application of the concepts,
principles or laws learnt.
• The faulty guides the students at various stages of developing the project and gives
timely inputs for the development of the model.
• Students converts their ideas into real time applicatons.
Projects:
Set
Number Mini Project Title
(Category)
Set - 1 Create a automata vending machine as an automated machine use
(Toppers) finite state automata to control the functions process.
(CO4, K3)
Set - 2 use Flex to create a lexical analyzer for C
(Above (CO4, K3)
Average)
Set - 3 Regular Expression matching - to check the two or more regular
(Average) expression are similar to each other or not (CO4, K3)
Set - 4
(Below Design, develop and implement YACC/C program to demonstrate
Average) Shift Reduce Parsing technique for the grammar rules:
E →E+T | T, T →T*F | F, F → (E) |id and parse the sentence:
id + id * id
(CO4, K3)
Set - 5 Token Explorer: Understanding Lexical Analysis in Steps (CO4, K3)
(Slow
Learners)
11
2
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.
106