0% found this document useful (0 votes)
39 views

AT&CD DCM UNIT 4-1

This document outlines the course structure for 'Automata Theory and Compiler Design' at RMK Group of Educational Institutions, detailing objectives, prerequisites, syllabus, and course outcomes. It includes a comprehensive lecture plan, activity-based learning assignments, and assessment schedules. The course aims to equip students with fundamental concepts in automata theory, compiler design, and practical applications in programming languages.

Uploaded by

PHANENDRA PHANI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

AT&CD DCM UNIT 4-1

This document outlines the course structure for 'Automata Theory and Compiler Design' at RMK Group of Educational Institutions, detailing objectives, prerequisites, syllabus, and course outcomes. It includes a comprehensive lecture plan, activity-based learning assignments, and assessment schedules. The course aims to equip students with fundamental concepts in automata theory, compiler design, and practical applications in programming languages.

Uploaded by

PHANENDRA PHANI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

1

2
Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.

3
22AI602/22AM601
AUTOMATA THEORY AND COMPILER DESIGN

Departments: ARTIFICIAL INTELLIGENCE AND DATA SCIENCE/


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Batch/Year: 2022 – 2026/III
Created by: Dr. R. SHEEJA,ASP/ADS,RMKEC
Dr.C.S.ANITA,PROFESSOR,AIML
Ms.A.AKILA,AP/ADS,RMKCET
Mr.Karthikeyan M, AP/CSE-CS- RMKCET

4
5
Table of Contents

S.N Contents Page No


o
1. Course Objectives 07

2. Pre Requisites 09

3. Syllabus 11

4. Course outcomes 14

5. CO- PO/PSO Mapping 16

6. Lecture Plan 18

7. Activity based learning 20

8. Lecture Notes 23

9. Assignments 88

10. Part A Q & A 89

11. Part B Qs 96

12. Supportive online Certification courses 100

Real time Applications in day to day life and to


13. 101
Industry

14. Contents beyond the Syllabus 105

15. Assessment Schedule 108

16. Prescribed Text Books & Reference Books 110

17. Mini Project suggestions 112

6
COURSE OBJECTIVES

7
1. Course Objectives

Students completing this course are expected to:


To introduce the fundamental concepts of automata theory.

To understand deterministic and non-deterministic finite automata.

To understand deterministic and non-deterministic finite automata.

To introduce Push down Automata and Turing Machines.

To introduce the major concepts of language translation and compiler design.

To elaborate the code optimization and code generation in compiler design.

8
PRE REQUISITE

9
2. PRE-REQUISITES

AUTOMATA THEORY AND COMPILER


DESIGN(VI Sem)

Data Structures Discrete Mathematics


(II Sem) (III Sem)

Operating Computer Organization


Systems and Architecture
(III & IV Sem) (III Sem)

10
SYLLABUS

11
3. SYLLABUS
22AI602/ AUTOMATA THEORY AND COMPILER DESIGN L T P C
22AM601
3 0 0 3

UNIT I INTRODUCTION TO AUTOMATA THEORY 9


Introduction to Finite Automata: Structural Representations, Automata and Complexity,
the Central Concepts of Automata Theory – Alphabets, Strings, Languages, Problems.
Nondeterministic Finite Automata: Formal Definition, an application, Text Search, Finite
Automata with Epsilon-Transitions. Deterministic Finite Automata: Definition of DFA, How
A DFA Process Strings, The language of DFA, Conversion of NFA with €-transitions to NFA
without €-transitions. Conversion of NFA to DFA.

UNIT II REGULAR EXPRESSIONS AND CONTEXT FREEGRAMMARS 9


Regular Expressions: Finite Automata and Regular Expressions, Applications of Regular
Expressions, Algebraic Laws for Regular Expressions, Conversion of Finite Automata to
Regular Expressions. Pumping Lemma for Regular Languages: Statement of the pumping
lemma, Applications of the Pumping Lemma.Context-Free Grammars: Definition of
Context-Free Grammars, Derivations Using a Grammar, Leftmost and Rightmost
Derivations, the Language of a Grammar, Parse Trees, Ambiguity in Grammars and
Languages.

UNIT III PDA AND TURING MACHINES 9


Push Down Automata: Definition of the Pushdown Automaton, the Languages of a PDA,
Equivalence of PDA and CFG’s, Acceptance by final state
Turing Machines: Introduction to Turing Machine, Formal Description, Instantaneous
description, The language of a Turing machine .

12
Syllabus
22AI602/ AUTOMATA THEORY AND COMPILER L T P C
22AM601 DESIGN
3 0 0 3
UNIT IV LEXICAL AND SYNTAX ANALYSIS
9
Introduction: The structure of a compiler,
Lexical Analysis: The Role of the Lexical Analyzer, Input Buffering, Recognition of Tokens,
The Lexical- Analyzer Generator Lex,
Syntax Analysis: Introduction, Context-Free Grammars, Writing a Grammar, Top-Down
Parsing, Bottom- Up Parsing, Introduction to LR Parsing: Simple LR, More Powerful LR
Parsers, Parser Generators YACC.

UNIT V CODE GENERATION AND OPTIMIZATION 9


Code generation and optimization: Issues in the design of code generator, a simple code
generator, Introduction to code optimization, Basic blocks & flow graphs, DAG
representation of basic blocks, Peephole optimization, the principle sources of
optimizationTOTAL: 45 PERIODS

13
COURSE OUTCOMES

14
4. Course Outcomes

After successful completion of the course, the students should be able to

Highest
CO No. Course Outcomes Cognitive
Level
Construct deterministic and non-deterministic
CO1 K2
finite automata
Design context free grammars for formal languages
CO2 K3
using regular expressions.
Use PDA and Turing Machines for recognizing
CO3 K3
context-free languages.
CO4 Design a lexical analyzer. K2

CO5 Design syntax analyzer. K3

Design a simple code generator and apply different


CO6 K3
code optimizations.

15
CO - PO / PSO MAPPING

16
5. CO - PO / PSO MAPPING

Course Programme Outcomes (POs), Programme Specific


Outcomes
Outcomes (PSOs)
(COs) PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3

K3 K4 K5 K5 K3/ A2 A3 A3 A3 A3 A3 A2 K3 K3 K3
K5
C202.1 K3 3 3 3 3 2 3 2 2

C202.2 K3 3 3 2 2 1 3 2 2

C202.3 K3 3 3 1 2 2 3 2 1

C202.4 K3 3 3 1 2 2 3 2 1

C202.5 K3 3 3 1 2 2 2 2 1

C202 3 3 3 3 2 3 2 1
LECTURE PLAN

18
6. Lecture Plan

UNIT IV LEXICAL AND SYNTAX ANALYSIS

Proposed Actual Perta Highest


Mode of
S.No Lecture Topic Lecture ining Cognitive
Delivery
Date Date CO(s) Level

Introduction to the Structure of a


1 17.02.2025 Compiler - Overview of compiler CO4 K3 MD1
phases - Importance of each phase
Lexical Analysis: Role of the Lexical CO4 K3
2 18.02.2025 Analyzer - Introduction to lexical MD1
analysis - input buffering
CO4 K3
Recognition of Tokens and Lexical
3 19.02.2025 MD1
Analyzer Generators Lex
Syntax Analysis-Context-Free CO4 K3
4 Grammars - Role of syntax analysis - MD1
20.02.2025
Context-free grammars (CFGs) - Writing
grammars for languages
Top-Down Parsing Techniques - Basics CO4 K3
5 21.02.2025 of top-down parsing - Recursive MD2
descent parsing - Predictive parsing
Bottom-Up Parsing Techniques - Basics CO4 K3
6 22.02.2025 of bottom-up parsing - Shift-reduce MD1
parsing - Parse tree construction
Introduction to LR Parsing: Simple LR CO4 K3
7 (SLR) Parsers - Fundamentals of LR MD2
24.02.2025 parsing - Constructing SLR parsing
tables - Advantages and limitations of
SLR parsers
More Powerful LR Parsers (CLR and CO4 K3
8 LALR) - CLR and LALR parsing MD2
25.02.2025
techniques - Comparison with SLR -
Practical applications of LR parsers
9 Parser Generators and Wrap-Up - CO4 K3 MD2
Introduction to YACC and parser
26.02.2025 generators - Review of all parsing
techniques - Summary and problem-
solving

19
ACTIVITY BASED LEARNING

20
UNIT – IV

• TO UNDERSTAND THE BASIC CONCEPTS OF COMPILERS , STUDENTS


ABLE TO TAKE QUIZ AS AN ACTIVITY.

• LINK for the quiz is given below.

https://sites.google.com/site/wonsunahn/teaching/cs-0449-systems-
software/kahoot-quiz

• Hands On -Assignment:

1. Illustrate diagrammatically the output of each phases of compiler


2. Demonstrate the RE to DFA process using JFLAP
a. a(b+c)*a
b. 10 + (0 + 11)0* 1
To run JFLAP on Windows, you may simply use one of the following
methods:
In a command/console window, go to the directory with the JFLAP7. jar, execute
java -jar JFLAP7.
Right-click on the file and open it with "Java(TM) Platform SE binary"
Double-click on the file (How to run a Jar file in Windows by double-clicking)

1. Use online LEX compiler and execute the program


a. To create simple calculator with variables.
b. To Count the number of characters in a string
2. Use Regular Expression Tester -
https://www.freeformatter.com/regex-tester.html and the test the
regular expressions and show the results ab*,(a|b)*abb.
8. ACTIVITY BASED LEARNING : UNIT – IV

UNIT – IV

• TO UNDERSTAND THE BASIC CONCEPTS OF COMPILERS , STUDENTS


ABLE TO TAKE QUIZ AS AN ACTIVITY.

• LINKS WILL BE PROVIDED BELOW.

https://create.kahoot.it/share/cs8602-unit-1/2e5c742f-541a-4bd0-84b2-
9e2ca3b47170
Join at www.kahoot.it
or with the Kahoot! app use the below game pin to play the Quiz.

Hands-on Assignment:

1. Use JFLAP to demonstrate the construction of LL(1) parsing table for the
grammar
S -> aABb
A -> aAc | λ
B -> bB | c

2. Use the above grammar to demonstrate the construction of SLR parsing


table and parse the string aacbbcb.
3. Compute FIRST() and FOLLOW ( ) for the grammar and display result in
JFLAP
E ->E+T | T
T -> T*F } F
F -> (E) | id
Construct the CLR parsing table and then optimize it to construct LALR
parsing table.

22
23
9. LECTURE NOTES
UNIT – I

INTRODUCTION TO COMPILERS

1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM

Preprocessor

A preprocessor produce input to compilers. They may perform the following


functions.

1.Macro processing: A preprocessor may allow a user to define macros that


are short hands for longer constructs.

2.File inclusion: A preprocessor may include header files into the program
text.

3. Rational preprocessor: these preprocessors augment older languages with


more modern flow-of-control and data structuring facilities.

4.Language Extensions: These preprocessor attempts to add capabilities to


the language by certain amounts to build-in macro

24
COMPILER

Compiler is a translator program that translates a program written in (HLL) the


source program and translate it into an equivalent program in (MLL) the target
program. As an important part of a compiler is error showing to the programmer.

Source Program COMPILER Target Program

Error Message

Executing a program written n HLL programming language is basically of two parts.


The source program must first be compiled translated into a object program. Then
the results object program is loaded into a memory executed.

Source Program COMPILER Obj Program

Obj Program input OBJECT Obj Program


PROGRAM

ASSEMBLER: programmers found it difficult to write or read programs in machine


language. They begin to use a mnemonic (symbols) for each machine instruction,
which they would subsequently translate into machine language. Such a mnemonic
machine language is now called an assembly language. Programs known as
assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).

25
INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.

INPUT PROCES OUTPU


S T
INTERPRETER
Source Program Program Output

Data

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters.


JAVA also uses interpreter. The process of interpretation can be carried out in
following phases.

1. Lexical analysis

2. Syntax analysis

3. Semantic analysis

4. Direct Execution

Advantages:

• Modification of user program can be easily made and implemented as


execution proceeds.
• Type of object that denotes various may change dynamically.
• Debugging a program and finding errors is simplified task for a program used
for interpretation.
• The interpreter for the language makes it machine independent.

Disadvantages:

• The execution of the program is slower. Memory consumption is more.

26
Loader and Link-editor:

Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly
in memory and transfer control to it, thereby causing the machine language
program to be execute. This would waste core by leaving the assembler in memory
while the users program was being executed. Also the programmer would have to
retranslate his program with each execution, thus wasting translation time. To
overcome this problem of wasted translation time and memory, system
programmers developed another component called loader.
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object
form the loader could ”relocate” directly behind the users program. The task of
adjusting programs so that they may be placed in arbitrary core locations is called
relocation. Relocation loaders perform four functions.

TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of the HLL (High Level Language) specification would be
detected and reported to the programmers. Important role of translator are:
• Translating the HLL program input into an equivalent ml program.
• Providing diagnostic messages wherever the programmer violates
specification of the HLL.

TYPE OF TRANSLATORS:-

1) INTERPRETOR 2) COMPILER 3) PREPROSSESSOR

27
1.2 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and
produces output in another representation. The phases of a compiler are shown in
below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into
PHASES OF A COMPILER
No-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of atomic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of
lexical analysis. Syntax analysis is aided by using techniques based on formal
grammar of the programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output
runs faster and

28
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this
phase. The output of the code generator is the machine language program of the
specified computer.

Table Management (or) Book-keeping:-


This is the portion to keep the names used by the program and records essential
information about each. The data structure used to record this information called a
“Symbol Table”.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.
The output of LA is a stream of tokens, which is passed to the next phase, the
syntax analyzer or parser. The SA groups the tokens together into syntactic
structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the
token called as parse trees.

29
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It
also imposes on tokens a tree-like structure that is used by the sub-sequent phases
of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On
seeing the /, the syntax analyzer should detect an error situation, because the
presence of these two adjacent binary operators violates the formulations rule of
an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming token
stream by identifying which parts of the token stream should be grouped.

Example, (A/B*C has two possible interpretations.) 1, divide A by B and then


multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.

Intermediate Code Generation:-

The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code
are possible. One common style uses instruction with one operator and a small
number of operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate
language representation of the source program.

Code Optimization

This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code
program that does the some job as the original, but in a way that saves time and /
or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A > B goto L2
Goto L3

30
L2 :This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-
expressions
A := B + C + D E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing the speed of
loops. A typical loop improvement is to move a computation that produces the
same result each time around the loop to a point, in the program just before the
loop is entered.
Code generator :-
Cg produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each
computation is to be done. Many computers have only a few high speed registers in
which computations can be performed quickly. A good code generator would
attempt to utilize registers as efficiently as possible.

Symbol Table Management OR Book-keeping :-


A compiler needs to collect information about all the data objects that appear in the
source program. The information about data objects is collected by the early phases
of the compiler-lexical and syntactic analyzers. The data structure used to record
this information is called as Symbol Table.

Error Handing :-
One of the most important functions of a compiler is the detection and reporting of
errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the
phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-
management and error-Handling routines interact with all phases of the compiler.

31
32
1.3 LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of
tokens. A program or function which performs lexical analysis is called a lexical analyzer
or scanner. A lexer often exists as a single function which is called by a parser or
another function.
THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer is the first phase of a compiler.
Its main task is used to read the input characters and produces as output a sequence of
tokens that the parser uses for syntax analysis.

token
source
pgm Lexical Parser
Analyzer getnexttoken

Symbol
table

Upon receiving a ‘get next token’ command from the parser, the lexical analyzer reads
input characters until it can identify the next token.
ISSUES OF LEXICAL ANALYZER
There are three issues in lexical analysis:
To make the design simpler.
To improve the efficiency of the compiler. To enhance the computer portability.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream
of characters is called tokenization.

33
A token can look like anything that is useful for processing an input text stream or

text file. Consider this expression in the C programming language: sum=3+2;


Lexeme Token type

sum Identifier

= Assignment operator

3 Number

+ Addition operator

2 Number

; End of statement

LEXEME:
Collection or group of characters forming tokens is called Lexeme.
PATTERN:
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
Attributes for Tokens
Some tokens have attributes that can be passed back to the parser. The lexical
analyzer collects information about tokens into their associated attributes. The
attributes influence the translation of tokens.
a. Constant : value of the constant
b. Identifiers: pointer to the corresponding symbol table
entry.

34
ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:
The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode recovery: Deletion of successive characters from the
token until error is resolved.
INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme. As characters are read from left to right,
each character is stored in the buffer to form a meaningful token as shown below:
Forward pointer

A = B + C

Beginning of the token Look ahead pointer


We introduce a two-buffer scheme that handles large look ahead’s safely. We then
consider an improvement involving "sentinels" that saves time checking for the
ends of buffers.
BUFFER PAIRS
A buffer is divided into two N-character halves, as shown below
Beginning of the token Look ahead pointer
We introduce a two-buffer scheme that handles large look ahead’s safely. We then
consider an improvement involving "sentinels" that saves time checking for the
ends of buffers.
BUFFER PAIRS
A buffer is divided into two N-character halves, as shown below

35
Each buffer is of the same size N, and N is usually the number of characters on one
disk block. E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.
Two pointers to the input are maintained:
1. Pointer lexeme_beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.

Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme. After the
lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.

Advancing forward pointer:


Advancing forward pointer requires that we first test whether we have reached the
end of one of the buffers, and if so, we must reload the other buffer from the
input, and move forward to the beginning of the newly loaded buffer. If the end of
second buffer is reached, we must again reload the first buffer with input and the
pointer wraps to the beginning of the buffer.
Code to advance forward pointer:
if forward at end of first half
then begin reload second half; forward := forward + 1
end
else if forward at end of second half

36
then begin reload second half; move forward to beginning of first half
end
else forward := forward + 1;
SENTINELS
For each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read. We can combine the buffer-end test with the
test for the current character if we extend each buffer to hold a sentinel character
at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
The sentinel arrangement is as shown below:

Note that eof retains its use as a marker for the end of the entire input. Any eof
that appears other than at the end of a buffer means that the input is at an end.
Code to advance forward pointer:

forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin reload second half;
forward := forward + 1
end
else if forward at end of second half then begin reload first half;
move forward to beginning of first half end
else /* eof within a buffer signifying end of input */ terminate lexical analysis
end

37
1.4 RECOGNITION OF TOKENS
Consider the following grammar fragment:
stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term
term → id | num
where the terminals if , then, else, relop, id and num generate sets of strings given
by the following regular definitions:
if the n → if the n els
els → e
e →
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-
)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then,
else,
as well as the lexemes denoted by relop, id, and num. To simplify matters, we
assume keywords are reserved; that is, they cannot be used as identifiers.
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a
lexical analyzer is called by the parser to get the next token. It is used to keep
track of information about the characters that are seen as the forward pointer
scans the input.

38
39
1.9 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER
There is a wide range of tools for constructing lexical
analyzers.
· Lex
· YACC
LEX
Lex is a
computer program that generates lexical analyzers. Lex is commonly used
with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by
creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C
program lex.yy.c.

Finally, lex.yy.c is run through the C compiler to produce an object program


a.out, which is the lexical analyzer that transforms an input stream into a
sequence of tokens.

LEX Compiler
lex.l lex.yy.c

C Compiler
lex.yy.c a.out

a.out
input stream Sequence of tokens

40
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }

Definitions include declarations of variables, constants, and regular


definitions
Rules are statements of the form
p1 {action 1}
p2 {action 2}

pn {action n}

where pi is regular expression and action i describes what action the lexical
analyzer should take when pattern pi matches a lexeme.
Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These
can be compiled separately and loaded with the lexical analyzer.
YACC- YET ANOTHER COMPILER-COMPILER
Yacc provides a general tool for describing the input to a computer program.
The Yacc user specifies the structures of his input, together with code to be
invoked as each such structure is recognized. Yacc turns such a specification
into a subroutine that handles the input process; frequently, it is convenient
and appropriate to have most of the flow of control in the user's application
handled by this subroutine.

41
In the case two sets are equal, we simply reuse the existing DFA state that we

already constructed. This process is then repeated for each of the new DFA states
(that is, set of NFA states) until we run out of DFA states to process. Finally, every
DFA state whose corresponding set of NFA states contains an accepting state is itself
marked as an accepting state.
The Lexical-Analyzer Generator Lex
Lexical Analyzer tool is called Lex, or in a more recent implementation Flex, that
allows one to specify a lexical analyzer by specifying regular expressions to describe
patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool
itself is the Lex compiler.
Behind the scenes, the Lex compiler transforms the input patterns into a transition
diagram and generates code, in a file called l e x . y y . c, that simulates this
transition diagram.
Use of Lex
Figure 3.22 suggests how Lex is used. An input file, which we call lex.l , is written in
the Lex language and describes the lexical analyzer to be generated. The Lex
compiler transforms lex.1 to a C program, in a file that is always named lex.yy.c.
The latter file is compiled by the C compiler into a file called a.out . The C-compiler
output is a working lexical analyzer that can take a stream of input characters and
produce a stream of tokens.
The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a
subroutine of the parser. It is a C function that returns an integer, which is a code
for one of the possible token names. The attribute value, whether it be another
numeric code, a pointer to the symbol table, or nothing, is placed in a global variable
yylval , which is shared between the lexical analyzer and parser, thereby making it
simple to return both the name and an attribute value of a token.

42
Structure of Lex Programs
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
· The declarations section includes declarations of variables, manifest constants
(identifiers declared to stand for a constant, e.g., the name of a token), and regular
definitions.
· The translation rules each have the form
Pattern { Action }
· Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written in C,
although many variants of Lex using other languages have been created.
· The third section holds whatever additional functions are used in the actions.
· Alternatively, these functions can be compiled separately and loaded with the
lexical analyzer.
· When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns Pi. It then executes the associated action Ai. Typically,
Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace
or comments), then the lexical analyzer proceeds to find additional lexemes, until
one of the corresponding actions causes a return to the parser.

43
The lexical analyzer returns a single value, the token name, to the parser, but uses the
shared, integer variable yylval to pass additional information about the lexeme found, if
needed.
Example: Figure 3.23 is a Lex program that recognizes the tokens of Fig. 3.12 and returns
the token found.

Declarations section:
In the declarations section we see a pair of special brackets, %{ and %}. Anything
within these brackets is copied directly to the file lex.yy.c , and is not treated as a regular
definition. It is common to place there the definitions of the manifest constants, using C
#def i n e statements to associate unique integer codes with each of the manifest constants.

Also in the declarations section is a sequence of regular definitions. Regular definitions


that are used in later definitions or in the patterns of the translation rules are surrounded
by curly braces. Thus, for instance, delim is defined to be a shorthand for the character
class consisting of the blank, the tab, and the newline; the latter two are represented, as in
all UNIX commands, by backslash followed by t or n, respectively. Then, ws is defined to be
one or more delimiters, by the regular expression {delim}+.

Notice that in the definition of id and number, parentheses are used as grouping
metasymbols and do not stand for themselves. In contrast, E in the definition of number
stands for itself. If we wish to use one of the Lex metasymbols, such as any of the
parentheses, +, *, or ?, to stand for themselves, we may precede them with a backslash.
For instance, we see \. in the definition of number, to represent the dot, since that character
is a metasymbol representing "any character," as usual in UNIX regular expressions.

Auxiliary-function section:
In the auxiliary-function section, we see two such functions, i n s t a llID () and
installNum(). Like the portion of the declaration section that appears between %{. . . % } ,
everything in the auxiliary section is copied directly to file l e x . y y . c , but may be used in
the actions.

44
Translation rules:
Finally, let us examine some of the patterns and rules in the middle section of
Fig. 3.23. First, ws, an identifier declared in the first section, has an associated
empty action. If we find whitespace, we do not return to the parser, but look for
another lexeme.
The second token has the simple regular expression pattern if. Should we see the
two letters if on the input, and they are not followed by another letter or digit
(which would cause the lexical analyzer to find a longer prefix of the input matching
the pattern for id), then the lexical analyzer consumes these two letters from the
input and returns the token name IF, that is, the integer for which the manifest
constant IF stands. Keywords t h e n and e l s e are treated similarly.
%{
/* definitions of manifest constants
#define LT 260
#define LE 261
EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNumO ; return(NUMBER);}
“<” {yylval = LT; return(RELOP) ;}
"<=" {yylval = LE; return(RELOP) ;}
“=” {yylval = EQ; return(RELOP) ;}
"<>" {yylval = NE; return(RELOP) ;}
“>” {yylval = GT; return(RELOP) ;}
“>=” {yylval = GE; return(RELOP) ;}
%%

45
int installID() {/* function to install the lexeme, whose first
character is pointed to by yytext, arid whose length is yyleng, into
the symbol table and return a pointer thereto */
}
int installNum() {/* similar to installlD, but puts numerical
constants into a separate table */
}
Figure 3.23: Lex program for the tokens of Fig. 3.12
The fifth token has the pattern defined by id. Note that, although keywords like if
match this pattern as well as an earlier pattern, Lex chooses whichever pattern is
listed first in situations where the longest matching prefix matches two or more
patterns. The action taken when id is matched is threefold:
1. Function i n s t a l l I D ( ) is called to place the lexeme found in the symbol
table.
2. This function returns a pointer to the symbol table, which is placed in global
variable y y l v a l , where it can be used by the parser or a later component of the
compiler. Note that i n s t a l l I D ( ) has available to it two variables that are set
automatically by the lexical analyzer that Lex generates:
(a) yytext is a pointer to the beginning of the lexeme, analogous to lexeme Begin in
Fig. 3.3.
(b) yyleng is the length of the lexeme found.
3. The token name ID is returned to the parser. The action taken when a lexeme
matching the pattern number is similar, using the auxiliary function installNum().
Conflict Resolution in Lex
Two rules that Lex uses to decide on the proper lexeme to select, when several
prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the
pattern listed first in the Lex program.

46
E x a m p l e:
• The first rule tells us to continue reading letters and digits to find the longest
prefix of these characters to group as an identifier. It also tells us to treat <=
as a single lexeme, rather than selecting < as one lexeme and = as the next
lexeme.
• The second rule makes keywords reserved, if we list the keywords before id
in the program.
For instance, if t h e n is determined to be the longest prefix of the input that
matches any pattern, and the pattern then precedes { i d } , as it does in Fig. 3.23,
then the token THEN is returned, rather than ID.
The Lookahead Operator
· Lex automatically reads one character ahead of the last character that
forms the selected lexeme, and then retracts the input so only the lexeme
itself is consumed from the input.
· When we want a certain pattern to be matched to the input only when
it is followed by a certain other characters. If so, we may use the slash in a
pattern to indicate the end of the part of the pattern that matches the
lexeme. What follows / is additional pattern that must be matched before we
can decide that the token in question was seen, but what matches this
second pattern is not part of the lexeme.

47
Example : In Fortran and some other languages, keywords are not reserved. That
situation creates problems, such as a statement
IF(I,J) = 3
where IF is the name of an array, not a keyword. This statement contrasts with
statements of the form
IF( condition ) THEN ...
where IF is a keyword.
To recognize the keyword IF, which is always followed by a left parenthesis, some
text ,the condition that may contain parentheses, a right parenthesis and a letter.
Then, we could write a Lex rule for the keyword IF like:
IF / \( .* \) {letter}
This rule says that the pattern the lexeme matches is just the two letters IF. The
slash says that additional pattern follows but does not match the lexeme.
In this pattern, the first character is the left parentheses. Since that character is a
Lex metasymbol, it must be preceded by a backslash to indicate that it has its literal
meaning. The dot and star match "any string without a newline." Note that the dot
is a Lex metasymbol meaning "any character except newline." It is followed by a
right parenthesis, again with a backslash to give that character its literal meaning.
The additional pattern is followed by the symbol letter, which is a regular definition
representing the character class of all letters.
For instance, suppose this pattern is asked to match a prefix of input:
IF(A<(B+C)*D)THEN...
the first two characters match IF, the next character matches \ ( , the next nine
characters match .*, and the next two match \) and letter. Note the fact that the
first right parenthesis (after C) is not followed by a letter is irrelevant; we only need
to find some way of matching the input to the pattern. We conclude that the letters
IF constitute the lexeme, and they are an instance of token if.

48
SYNTAX ANALYSIS

ROLE OF THE PARSER


Parser obtains a string of tokens from the lexical analyzer and verifies that it can be
generated by the language for the source program. The parser should report any
syntax errors in an intelligible fashion. The two types of parsers employed are:
Top down parser: which build parse trees from top(root) to bottom(leaves)
Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up
parsing

CONTEXT-FREE GRAMMAR
Grammars were introduced to systematically describe the syntax of programming
language constructs like expressions and statements.
A context-free grammar (grammar for short) consists of terminals, nonterminals, a
start symbol, and productions.

Grammar for simple arithmetic expressions

E->EAE|(E)|-E|id

A->+|-|*|?|^

49
WRITING A GRAMMAR
Grammars are capable of describing most, but not all, of the syntax of
programming languages.
For instance, the requirement that identifiers be declared before they are used,
cannot be described by a context-free grammar.
We consider several transformations that could be applied to get a grammar
more suitable for parsing.
One technique can eliminate ambiguity in the grammar, and other techniques —
left-recursion elimination and left factoring — are useful for rewriting grammars
so they become suitable for top-down parsing.

PARSING

A program that performs syntax analysis is called a parser.


A syntax analyzer takes tokens as input and output error message if the
program syntax is wrong.
The parser uses symbol-look- ahead and an approach called top-down parsing
without backtracking.
Top-down parsers check to see if a string can be generated by a grammar by
creating a parse tree starting from the initial symbol and working down.
Bottom-up parsers, however, check to see a string can be generated from a
grammar by creating a parse tree from the leaves, and working up.
Early parser generators such as YACC creates bottom-up parsers whereas
many of Java parser generators such as JavaCC create top-down parsers.

50
Classification of Parsers

Parsers
Bottom Up
Top Down
Parsers (Shift
Parsers
Reduce Parsers)

with full without Operator


Backtrackin Backtrackin Precedence LR Parsers
g g Parsers

Brute Force Recursive


Method Descent LR (0)
Parsers

Non recursive Simple LR


Descent Parsers or SLR(1)
or LL(1)
Canonical LR
or CLR(1) or
LR(1)

LALR(1)

51
RECURSIVE DESCENT PARSING
Typically, top-down parsers are implemented as a set of recursive functions that
descent through a parse tree for a string. This approach is known as recursive
descent parsing, also known as LL(k) parsing where the first L stands for left-to-
right, the second L stands for leftmost-derivation, and k indicates k-symbol look-
ahead.
Therefore, a parser using the single symbol look-ahead method and top-down
parsing without backtracking is called LL(1) parser. In the following sections, we
will also use an extended BNF notation in which some regulation expression
operators are to be incorporated.
A syntax expression defines sentences of the form , or . A syntax of the form
defines sentences that consist of a sentence of the form followed by a sentence of
the form followed by a sentence of the form . A syntax of the form defines zero or
one occurrence of the form . A syntax of the form defines zero or more occurrences
of the form .
A usual implementation of an LL(1) parser is: initialize its data structures, get the
lookahead token by calling scanner routines, and call the routine that implements
the start symbol. Here is an example.
proc syntaxAnalysis() begin
initialize(); // initialize global data and structures nextToken(); // get the lookahead
token
program(); // parser routine that implements the start symbol end;

Left Recursion:
A grammar is left-recursive if and only if there exists a nonterminal symbol that
can derive to a sentential form with itself as the leftmost symbol .
A→A𝛼/𝛽
Elimination of Left Recursion
A→A𝛼/𝛽
Introduce a new nonterminal A' and rewrite the rule as
𝐴 → 𝛽𝐴′

𝐴′ → 𝛼𝐴′/∈

52
Eliminate left recursion for the following grammar:

1) E →E +T / T

Solution:

E→ 𝑇𝐸′ 𝐸′→ +𝑇 𝐸′ / ∈

1) T →T*F / F

solution”

T→ FT′ T′→ *F T′ / ∈

FIRST AND FOLLOW

To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or e can be added to any FIRST set.

1. If X is terminal, then FIRST(X) is {X}.


2. If X->e is a production, then add e to FIRST(X).
3. If X is non-terminal and X->Y1Y2...Yk is a production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1) that is, Y1.......Yi-
1=*>e. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X). For example,
everything in FIRST(Yj) is surely in FIRST(X). If y1 does not derive e, then we add nothing
more to FIRST(X), but if Y1=*>e, then we add FIRST(Y2) and so on.
To compute the FIRST(A) for all nonterminals A, apply the following rules until nothing can
be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBs where FIRST(s) except e is placed in FOLLOW(B).
3. If there is a production A->aB or a production A->aBs where FIRST(s) contains e, then
everything in FOLLOW(A) is in FOLLOW(B).

PROBLEM :

Consider the following example to understand the concept of First and Follow.Find the first
and follow of all nonterminals in the Grammar-
E -> TE' E'-> +TE'|e T -> FT'
T'-> *FT'|e F -> (E)|id then:

53
FIRST(E)=FIRST(T)=FIRST(F)={(,id} FIRST(E')={+,e}
FIRST(T')={*,e} FOLLOW(E)=FOLLOW(E')={),$}
FOLLOW(T)=FOLLOW(T')={+,),$}
FOLLOW(F)={+,*,),$}
For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST
with i=1 in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3
with i=1, the production T -> FT' implies that id and left parenthesis belong to FIRST(T)
also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied to
production F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to production
E-> TE', $ and right parenthesis are in FOLLOW(E').
Example 2:
S → aBDh
B ->cC
C → bC / ∈
D → EF
E→g/∈
F→f/∈
Solution:
First Functions-
First(S) = { a }
First(B) = {b , ∈ }
First(C) = { b , ∈ }
First(D) = { First(E) – ∈ } ∪ First(F) ={g,f,∈}
First(E) = { g , ∈ }
First(F) = { f , ∈ }
Follow Functions-
Follow(S) = { $ }
Follow(B) = { First(D) – ∈ } ∪ follow(D) = { g , f , h }
Follow(C) = Follow(B) = {c, g , f , h }
Follow(D) = First(h) = { h }
Follow(E) = { First(F) – ∈ } ∪ Follow(F) = { f , h }
Follow(F) = Follow(D) = { h }

54
CONSTRUCTION OF PREDICTIVE PARSING TABLES
For any grammar G, the following algorithm can be used to construct the predictive parsing
table. The algorithm is
Input : Grammar G Output : Parsing table M Method
For each production A-> a of the grammar, do steps 2 and 3
For each terminal a in FIRST(a), add A->a, to M[A,a].
If e is in First(a), add A->a to M[A,b] for each terminal b in FOLLOW(A). If
e is in FIRST(a) and $ is in FOLLOW(A), add A->a to M[A,$].
Make each undefined entry of M be error.

LL(1) GRAMMAR

The above algorithm can be applied to any grammar G to produce a parsing table M. For
some Grammars, for example if G is left recursive or ambiguous, then M will have at least
one multiply-defined entry. A grammar whose parsing table has no multiply defined
entries is said to be LL(1). It can be shown that the above algorithm can be used to
produce for every LL(1) grammar G a parsing table M that parses all and only the
sentences of G. LL(1) grammars have several distinctive properties. No ambiguous or left
recursive grammar can be LL(1).

There remains a question of what should be done in case of multiply defined entries. One
easy solution is to eliminate all left recursion and left factoring, hoping to produce a
grammar which will produce no multiply defined entries in the parse tables. Unfortunately
there are some grammars which will give an LL(1) grammar after any kind of alteration. In
general, there are no universal rules to convert multiply defined entries into single valued
entries without affecting the language recognized by the parser.

The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
recursion elimination and left factoring are easy to do, they make the resulting grammar
hard to read and difficult to use the translation purposes. To alleviate some of this
difficulty, a common organization for a parser in a compiler is to use a predictive parser for
control constructs and to use operator precedence for expressions. However, if an LR
parser generator is available, one can get all the benefits of predictive parsing and operator
precedence automatically.

55
ERROR RECOVERY IN PREDICTIVE PARSING

The stack of a non-recursive predictive parser makes explicit the terminals and non-
terminals that the parser hopes to match with the remainder of the input. We shall
therefore refer to symbols on the parser stack in the following discussion. An error is
detected during predictive parsing when the terminal on top of the stack does not
match the next input symbol or when non-terminal A is on top of the stack, a is the
next input symbol, and the parsing table entry M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input
until a token in a selected set of synchronizing tokens appears. Its effectiveness
depends on the choice of synchronizing set. The sets should be chosen so that the
parser recovers quickly from errors that are likely to occur in practice. Some
heuristics are as follows
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing
set for non terminal A. If we skip tokens until an element of FOLLOW(A) is seen
and pop A from the stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizing set for A. Fo example , if
semicolons terminate statements, as in C, then keywords that begin statements
may not appear in the FOLLOW set of the non terminal generating expressions. A
missing semicolon after an assignment may therefore result in the keyword
beginning the next statement being skipped. Often, there is a hierarchical structure
on constructs in a language; e.g., expressions appear within statement, which
appear within b blocks, and so on. We can add to the synchronizing set of a lower
construct the symbols that begin higher constructs. For example, we might add
keywords that begin statements to the synchronizing sets for the non terminals
generating expressions.

56
If we add symbols in FIRST(A) to the synchronizing set for non terminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST(A) appears in
the input.

If a non terminal can generate the empty string, then the production deriving e can
be used as a default. Doing so may postpone some error detection, but cannot
cause an error to be missed. This approach reduces the number of non terminals
that have to be considered during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue
parsing. In effect, this approach takes the synchronizing set of a token to consist of
all other tokens.

LR PARSING

INTRODUCTION

The "L" is for left-to-right scanning of the input and the "R" is for constructing a
rightmost derivation in reverse.

57
WHY LR PARSING:
LR parsers can be constructed to recognize virtually all programming-language
constructs for which context-free grammars can be written.
The LR parsing method is the most general non-backtracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other shift-reduce
methods.
The class of grammars that can be parsed using LR methods is a proper subset of
the class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-
to- right scan of the input. The disadvantage is that it takes too much work to
construct an LR parser by hand for a typical programming-language grammar.
But there are lots of LR parser generators available to make this task easy.

MODELS OF LR PARSERS

The schematic form of an LR parser is shown below.


The program uses a stack to store a string of the form s0 X1 s1 X2...Xm sm where
sm is on top. Each Xi is a grammar symbol and each si is a symbol representing a
state. Each state symbol summarizes the information contained in the stack below
it. The combination of the state symbol on top of the stack and the current input
symbol are used to index the parsing table and determine the shif treduce parsing
decision. The parsing table consists of two parts: a parsing action function action
and a goto function goto.

58
This configuration represents the right-sentential form X1 X1 ... Xm ai ai+1 ...an
in essentially the same way a shift-reduce parser would; only the presence of the
states on the stack is new. Recall the sample parse we did (see Example 1: Sample
bottom-up parse) in which we assembled the right-sentential form by concatenating
the remainder of the input buffer to the top of the stack. The next move of the
parser is determined by reading ai and sm, and consulting the parsing action table
entry action[sm, ai]. Note that we are just looking at the state here and no symbol
below it. We'll see how this actually works later.

The configurations resulting after each of the four types of move are as follows:
If action[sm, ai] = shift s, the parser executes a shift move entering the
configuration (s0 X1 s1 X2 s2... Xm sm ai s, ai+1... an$)
Here the parser has shifted both the current input symbol ai and the next symbol.
If action[sm, ai] = reduce A -> b, then the parser executes a reduce move,
entering the configuration,
(s0 X1 s1 X2 s2... Xm-r sm-r A s, ai ai+1... an$)
where s = goto[sm-r, A] and r is the length of b, the right side of the production.
The parser first popped 2r symbols off the stack (r state symbols and r grammar
symbols), exposing state sm-r. The parser then pushed both A, the left side of the
production, and s, the entry for goto[sm-r, A], onto the stack. The current input
symbol is not changed in a reduce move.

59
The output of an LR parser is generated after a reduce move by executing the
semantic action associated with the reducing production. For example, we might just
print out the production reduced.

If action[sm, ai] = accept, parsing is completed.

SHIFT REDUCE PARSING

A shift-reduce parser uses a parse stack which (conceptually) contains grammar


symbols. During the operation of the parser, symbols from the input are shifted
onto the stack. If a prefix of the symbols on top of the stack matches the RHS of a
grammar rule which is the correct rule to use within the current context, then the
parser reduces the RHS of the rule to its LHS, replacing the RHS symbols on top of
the stack with the non terminal occurring on the LHS of the rule.

This shift-reduce process continues until the parser terminates, reporting either
success or failure. It terminates with success when the input is legal and is
accepted by the parser. It terminates with failure if an error is detected in the
input. The parser is nothing but a stack automaton which may be in one of several
discrete states. A state is usually represented simply as an integer.

In reality, the parse stack contains states, rather than grammar symbols. However,
since each state corresponds to a unique grammar symbol, the state stack can be
mapped onto the grammar symbol stack mentioned earlier.

The operation of the parser is controlled by a couple of tables:

60
The program driving the LR parser behaves as follows: It determines sm the state currently
on top of the stack and ai the current input symbol. It then consults action[sm, ai], which
can have one of four values:

§ shift s, where s is a state

§ reduce by a grammar production A -> b

§ accept

§ error

The function goto takes a state and grammar symbol as arguments and produces a state.

For a parsing table constructed for a grammar G, the goto table is the transition function of
a deterministic finite automaton that recognizes the viable prefixes of G. Recall that the
viable prefixes of G are those prefixes of right-sentential forms that can appear on the
stack of a shift reduce parser because they do not extend past the rightmost handle.
A configuration of an LR parser is a pair whose first component is the stack contents and
whose second component is the unexpended input:
(s0 X1 s1 X2 s2... Xm sm, ai ai+1... an$)

61
ACTION TABLE

The action table is a table with rows indexed by states and columns indexed by
terminal symbols. When the parser is in some state s and the current lookahead
terminal is t, the action taken by the parser depends on the contents of
action[s][t], which can contain four different kinds of entries:

Shift s'

Shift state s' onto the parse stack. Reduce r

Reduce by rule r. This is explained in more detail below. Accept

Terminate the parse with success, accepting the input. Error

Signal a parse error

GOTO TABLE

The goto table is a table with rows indexed by states and columns indexed by non
terminal symbols. When the parser is in state s immediately after reducing by rule
N, then the next state to enter is given by goto[s][N].
The current state of a shift-reduce parser is the state on top of the state stack. The
detailed operation of such a parser is as follows:
1. Initialize the parse stack to contain a single state s0, where s0 is the
distinguished initial state of the parser.
2. Use the state s on top of the parse stack and the current lookahead t
to consult the action table entry action[s][t]:
· If the action table entry is shift s' then push state s' onto the stack and advance
the input so that the lookahead is set to the next token.

62
· If the action table entry is reduce r and rule r has m symbols in its RHS, then pop
m symbols off the parse stack. Let s' be the state now revealed on top of the parse
stack and N be the LHS nonterminal for rule r. Then consult the goto table and
push the state given by goto[s'][N] onto the stack. The lookahead token is not
changed by this step.
Ø If the action table entry is accept, then terminate the parse with success.
Ø If the action table entry is error, then signal an error.
3. Repeat step (2) until the parser terminates.
For example, consider the following simple grammar
0) $S: stmt <EOF>
1) stmt: ID ':=' expr
0) expr: expr '+' ID
1) expr: expr '-' ID
2) expr: ID

which describes assignment statements like a:= b + c - d. (Rule 0 is a special


augmenting production added to the grammar).
One possible set of shift-reduce parsing tables is shown below (sn denotes shift n,
rn denotes reduce n, acc denotes accept and blank entries denote error entries):

Parser Tables

63
SLR PARSER
An LR(0) item (or just item) of a grammar G is a production of G with a dot at
some position of the right side indicating how much of a production we have seen
up to a given point.
For example, for the production E -> E + T we would have the following items:
[E -> .E + T]
[E -> E. + T]
[E -> E +. T]
[E -> E + T.]

64
CONSTRUCTING THE SLR PARSING TABLE

To construct the parser table we must convert our NFA into a DFA. The states in
the LR table will be the e-closures of the states corresponding to the items I0...
the process of creating the LR state table parallels the process of constructing an
equivalent DFA from a machine with e-transitions. Been there, done that - this is
essentially the subset construction algorithm so we are in familiar territory here.
We need two operations: closure() and goto().
closure()

If I is a set of items for a grammar G, then closure(I) is the set of items


constructed from I by the two rules: Initially every item in I is added to closure(I)
If A -> a.Bb is in closure(I), and B -> g is a production, then add the initial item [B
-> .g] to I, if it is not already there. Apply this rule until no more new items can be
added to closure(I).

From our grammar above, if I is the set of one item {[E'-> .E]},

then closure(I) contains:

I0:

E' -> .E

E -> .E + T

E -> .T

T -> .T * F

T -> .F
F -> .(E)

F -> .id

65
goto()

goto(I, X), where I is a set of items and X is a grammar symbol, is defined to


be the closure of the set of all items [A -> aX.b] such that [A -> a.Xb] is in I.
The idea here is fairly intuitive: if I is the set of items that are valid for some
viable prefix g, then goto(I, X) is the set of items that are valid for the viable
prefix gX.

SETS-OF-ITEMS-CONSTRUCTION

To construct the canonical collection of sets of LR(0) items for


augmented grammar G'.
procedure items(G') begin
C := {closure({[S' -> .S]})};
repeat
for each set of items in C and each grammar symbol X such that goto(I, X)
is not empty and not in C do
add goto(I, X) to C;
until no more sets of items can be added to C end;

66
ALGORITHM FOR CONSTRUCTING AN SLR PARSING TABLE

Input: augmented grammar G'

Output: SLR parsing table functions action and goto for G'

Method:

Construct C = {I0, I1 , ..., In} the collection of sets of LR(0) items for G'.

State i is constructed from Ii:

if [A -> a.ab] is in Ii and goto(Ii, a) = Ij,

then set action[i, a] to "shift j". Here a must be a terminal.

if [A -> a.] is in Ii,

then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A may
not be S'.

if [S' -> S.] is in Ii,

then set action[i, $] to "accept"

67
If any conflicting actions are generated by these rules, the grammar is not SLR(1)
and the algorithm fails to produce a parser. The goto transitions for state i are
constructed for all non terminals A using

the rule: If goto(Ii, A)= Ij, then goto[i, A] = j.

All entries not defined by rules 2 and 3 are made "error".

The initial state of the parser is the one constructed from the set of items
containing [S' -> .S].

Let's work an example to get a feel for what is going on,

An Example

(1) E -> E * B

(2) E -> E + B

(3) E -> B (4) B -> 0 (5) B -> 1

The Action and Goto Table The two LR(0) parsing tables for this grammar look as
follows:

68
LALR PARSER:

We begin with two observations. First, some of the states generated for LR(1)
parsing have the same set of core (or first) components and differ only in their
second component, the lookahead symbol. Our intuition is that we should be able
to merge these states and reduce the number of states we have, getting close to
the number of states that would be generated for LR(0) parsing. This observation
suggests a hybrid approach: We can construct the canonical LR(1) sets of items
and then look for sets of items having the same core. We merge these sets with
common cores into one set of items. The merging of states with common cores
can never produce a shift/reduce conflict that was not present in one of the original
states because shift actions depend only on the core, not the lookahead. But it is
possible for the merger to produce a reduce/reduce conflict.

Our second observation is that we are really only interested in the lookahead symbol
in places where there is a problem. So our next thought is to take the LR(0) set of
items and add lookaheads only where they are needed. This leads to a more
efficient, but much more complicated method.

69
ALGORITHM FOR EASY CONSTRUCTION OF AN LALR TABLE

Input: G'

Output: LALR parsing table functions with action and goto for G'. Method:
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.
2. For each core present among the set of LR(1) items, find all sets having that
core and replace these sets by the union.
3. Let C' = {J0, J1 , ..., Jm} be the resulting sets of LR(1) items. The parsing
actions for state i are constructed from Ji in the same manner as in the
construction of the canonical LR parsing table.
4. If there is a conflict, the grammar is not LALR(1) and the algorithm fails.
5. The goto table is constructed as follows: If J is the union of one or more sets of
LR(1) items, that is, J = I0U I1 U ... U Ik, then the cores of goto(I0, X), goto(I1, X),
..., goto(Ik, X) are the same, since I0, I1 , ..., Ik all have the same core. Let K be
the union of all sets of items having the same core asgoto(I1, X).
6. Then goto(J, X) = K. Consider the above example,

I3 & I6 can be replaced by their union

I36:C->c.C,c/d/$

C->.Cc,C/D/$ C->.d,c/d/$ I47:C->d.,c/d/$


I89:C->Cc.,c/d/$ Parsing Table

70
state c d $ S C

0 S36 S47 1 2

1 Accept

2 S36 S47 5

36 S36 S47 89

47 R3 R3

5 R1

89 R2 R2 R2

HANDLING ERRORS

The LALR parser may continue to do reductions after the LR parser would have
spotted an error, but the LALR parser will never do a shift after the point the LR
parser would have discovered the error and will eventually find the error.

LR ERROR RECOVERY

An LR parser will detect an error when it consults the parsing action table and find
a blank or error entry. Errors are never detected by consulting the goto table. An
LR parser will detect an error as soon as there is no valid continuation for the
portion of the input thus far scanned. A canonical LR parser will not make even a
single reduction before announcing the error. SLR and LALR parsers may make
several reductions before detecting an error, but they will never shift an erroneous
input symbol onto the stack.

71
PANIC-MODE ERROR RECOVERY

We can implement panic-mode error recovery by scanning down the stack until a
state s with a goto on a particular nonterminal A is found. Zero or more input
symbols are then discarded until a symbol a is found that can legitimately follow
The situation might exist where there is more than one choice for the nonterminal
A. Normally these would be nonterminals representing major program pieces, e.g.
an expression, a statement, or a block. For example, if A is the nonterminal stmt, a
might be semicolon or }, which marks the end of a statement sequence. This
method of error recovery attempts to eliminate the phrase containing the syntactic
error. The parser determines that a string derivable from A contains an error. Part
of that string has already been processed, and the result of this processing is a
sequence of states on top of the stack.

The remainder of the string is still in the input, and the parser attempts to skip over
the remainder of this string by looking for a symbol on the input that can
legitimately follow A. By removing states from the stack, skipping over the input,
and pushing GOTO(s, A) on the stack, the parser pretends that if has found an
instance of A and resumes normal parsing.

72
PHRASE-LEVEL RECOVERY

Phrase-level recovery is implemented by examining each error entry in the LR


action table and deciding on the basis of language usage the most likely
programmer error that would give rise to that error. An appropriate recovery
procedure can then be constructed; presumably the top of the stack and/or first
input symbol would be modified in a way deemed appropriate for each error entry.
In designing specific error-handling routines for an LR parser, we can fill in each
blank entry in the action field with a pointer to an error routine that will take the
appropriate action selected by the compiler designer.

The actions may include insertion or deletion of symbols from the stack or the input
or both, or alteration and transposition of input symbols. We must make our
choices so that the LR parser will not get into an infinite loop. A safe strategy will
assure that at least one input symbol will be removed or shifted eventually, or that
the stack will eventually shrink if the end of the input has been reached. Popping a
stack state that covers a non terminal should be avoided, because this modification
eliminates from the stack a construct that has already been successfully parsed.

73
Problems

1) Predictive parser ot LL(1) PARSER


E->E+T| T
T->T*F | F
F-> (E)| id

Solution:

E -> TE’
E ‘-> +TE’|ε
T -> FT’
T ’ -> *FT’|ε
F -> (E)| id

Step : 2 Find FIRST( ) & FOLLOW( )

FIRST(E) = { ( , id}

FIRST(T) = { ( , id}

FIRST(F) = { ( , id}

FIRST(E’) = { +, Ɛ}

FIRST(T’) = { *, Ɛ}

FOLLOW(E) = { ) , $}

FOLLOW(E’) = { ) , $}

FOLLOW(T) = {+, ), $}

FOLLOW(T’) = {+, ), $}

FOLLOW(F) = {+, *, ), $}

74
Non Terminals
Termin
al

id + * ( ) $

E E -> TE’ E -> TE’

E’ E -> +TE’ E’ -> ε E’ -> ε

T T -> FT’ T -> FT’

T’ T -> FT’ T’ ->*FT’ T’ -> ε T’ -> ε

F F -> id F -> (E)

75
Stack Input string Production used

$E id+id*id$ E->TE’

$E’T id+id*id$ T->FT’

$E’T’F id+id*id$ F->id

$E’T’id id+id*id$ Pop

$E’T’ +id*id$ T’->ε

$E’ +id*id$ E’->+TE’

$E’T+ +id*id$ Pop

$E’T id*id$ T->FT’

$E’T’F id*id$ F->id

$E’T’id id*id$ Pop

$E’T’ *id$ T’->*FT’

$E’T’F* *id$ Pop

$E’T’F id$ F->id

$E’T’id id$ Pop

$E’T’ $ T’->ε

$E’ $ E’->ε

$ $ String accepted

76
2) SLR Parser or LR(0)

Steps:
1.create augment grammar
2.generate kernel items
3.find closure
4.compute goto()
5.construct parsing table
6.parse the string
Let us consider grammar:

S->a

S->(L)

L->S

L->L,S

Step 1: Create augment grammar


S is start symbol in the given grammar
Augment grammar is S’-> S

Step 2: Generate kernel items

Introduce dot in RHS of the production``

S’-> .S

Step 3: Find closure

(Rule: A -> α.Xβ i.e if there is nonterminal next to dot then Include X
production))
S’-> .S
S-> . a
S->.(L) I0

77
78
Step 5: construct parsing table

Label the rows of table with no. of iterations

Divide the Column into two parts


• Action part: terminal symbols
• Goto part: Nonterminal symbols

ACTION GOTO

a ( ) , $ S L

0 S2 S3 1

1 acce
pt

2 R1 R1 R1

3 S2 S3 5 4

4 S6 S7

5 R3 R3

6 R2 R2 R2

7 S2 S3 8

8 R4 R4

79
Goto (I0, S) 1 Goto (I3, ( ) 3

Goto (I0, a) 2 Goto (I4, ) ) 6

Goto (I0, ( ) 3 Goto (I4, ,) 7

Goto (I3, L) 4 Goto (I7, S) 8

Goto (I3, S) 5 Goto (I7, a) 2

Goto (I3, a) 2 Goto (I7, ( ) 3

label Production Ending Follow(LHS)


iteration

R0 S’-> S . I1 Follow(S’)={ $ }

R1 S-> a . I2 Follow(S)={ $ )
,}

R2 S-> ( L ) . I6 Follow(S)={ $ )
,}

R3 L-> S . I5 Follow(L)={ ) ,
}

R4 L-> L , S . I8 Follow(L)={ ) ,
}

80
Stack Input Action

$0 (a,a)$ S3

$0(3 a,a)$ S2

$0(3a2 ,a)$ R1

$0(3S5 ,a)$ R3

$0(3L4 ,a)$ S7

$0(3L4 , 7 a)$ S2

$0(3L4 , 7a2 )$ R1

$0(3L4 , 7S8 )$ R4

$0(3L4 )$ S6

$0(3L4)6 $ R2

$0S1 $ accept

Hence the string parsed successfully !!!!!!

81
CLR Parser or LR(1):

Steps:
1.create augment grammar
2.generate kernel items and add 2nd component
3.find closure
4.compute goto()
5.construct parsing table
6.parse the string

Let us consider grammar:

S->L=R

S->R

L->*R

L->id

R->L

Rule to find 2nd component:


Consider the production of the form : A-> α .Bβ , a
THEN 2nd component of B is : β , if it is terminal
First (β ) if it is non terminal
a, if there is no β

Step 1: Create augment grammar


S is start symbol in the given grammar
Augment grammar is S’-> S

82
Step 2: Generate kernel items and add 2nd component
Introduce dot in RHS of the production
Add $ as 2nd component separated by comma
S’-> .S , $
Step 3: Find closure
(Rule: A -> α.Xβ i.e if there is nonterminal next to dot then Include X
production))

S’-> .S , $
S-> . L=R
S-> . R
I0
L-> . *R
L-> . Id
R-> . L
Next find 2nd component:
compare each of the production with A-> α
S’-> .S , $ .Bβ , a
S-> . L=R, S’ -> . s, $ here no β, so $ is 2nd comp to S
$ S -> . L=R,$ here β is = so add it as 2nd
S-> . R, $ comp to L
L-> . *R, S-> . R, $ here no β, so $ is
I0
=/$ 2 comp to R
nd

L-> . L production is not in


Id,=/$ standard form
R-> . L, $ R-> . L, $ here no β, so $ is
2 comp to L
nd

Rule to find 2nd component:


Consider the production of the form : A-> α .Bβ , a
THEN 2nd component of B is : β , if it is terminal
First (β ) if it is non terminal
a, if there is no β

83
Step 4 : compute goto()

84
We notice that some states in CLR parser have the same core items and differ only
in possible lookahead symbols.

Such as

I4 and I4’

I5 and I5’

I7 and I7’

I8 and I8’

So we shrink the obtained CLR parser by merging such states to form LALR Parser

Hence

CLR PARSER has 14 States (I0, I1,I2,I3,I4,I4’,I5,I5’,I6,I7,I7’,I8,I8’,I9)

LALR PARSER or LR(2) has 10 states (I0, I1,I2,I3,I4,I5,I6,I7,I8,I9)

85
PARSER GENERATOR-YACC
Each translation rule input to YACC has a string specification that resembles a
production of a grammar-it has a nonterminal on the LHS and a few alternatives on the
RHS. For simplicity, we will refer to a string specification as a production. YACC
generates an LALR(1) parser for language L from the productions, which is a bottom-
up parser. The parser would operate as follows: For a shift action, it would invoke the
scanner to obtain the next token and continue the parse by using that token. While
performing a reduced action in accordance with production, it would perform the
semantic action associated with that production.
The semantic actions associated with productions achieve the building of an
intermediate representation or target code as follows:
 Every nonterminal symbol in the parser has an attribute.
 The semantic action associated with a production can access attributes of
nonterminal symbols used in that production–a symbol “$n’ in the semantic
action, where n is an integer, designates the attribute of the nonterminal symbol in
the RHS of the production and the symbol ‘$$’ designates the attribute of the LHS
nonterminal symbol of the production.
 The semantic action uses the values of these attributes for building the
intermediate representation or target code.
A parser generator is a program that takes as input a specification of a syntax and
produces as output a procedure for recognizing that language. Historically, they are also
called compiler compilers. YACC (yet another compiler-compiler) is an LALR(1)
(LookAhead, Left-to-right, Rightmost derivation producer with 1 lookahead token)
parser generator. YACC was originally designed for being complemented by Lex.

For Compiling YACC Program:


Write lex program in a file file.l and yacc in a file file.y
Open Terminal and Navigate to the Directory where you have saved the files.
type lex file.l
type yacc file.y
type cc lex.yy.c y.tab.h -ll
type ./a.out

86
Input File: YACC input file is divided into three parts.
/* definitions */
....

%%
/* rules */
....
%%

/* auxiliary routines */
....

Example
Yacc File (.y)

%{
#include <ctype.h>
#include <stdio.h>
#define YYSTYPE double /* double type for yacc stack */
%}

%%
Lines : Lines S '\n' { printf("OK \n"); }
| S '\n’
| error '\n' {yyerror("Error: reenter last line:");
yyerrok; };
S : '(' S ')’
| '[' S ']’
| /* empty */ ;
%%

#include "lex.yy.c“
void yyerror(char * s)
/* yacc error handler */
{ fprintf (stderr, "%s\n", s);
}
int main(void) {
return yyparse();
}

87
9. ASSIGNMENTS
1. Construct the predictive parsing table for the following grammar.(CO4, K3)
E -> E+T / T
T -> T & F / F
F -> ! F / (E) /1 / 0

2. Perform Shift-reduce parsing for the given grammars. .(CO4, K3)

i) S -> a / ^ /(T) ii) S -> a / ^ /(R)


T -> T, S / S T -> S , T / S , R -> T
Parse the Input for both problems. (a,(a,a)) and (((a,a),^,(a)),(a))
3. Construct LR(0) for the given grammar .(CO4, K3)

E -> E sub E sup E


E-> E sub E
E -> E sup E
E -> {E} / c
4. Perform LR(1) for the given grammar .(CO4, K3)

i) S ->C C
C -> Cc
C ->d
ii) S -> Aa / bAc / Bc / bBa
A ->d
B ->d
Check whether the grammar is LR(1) but not LALA(1).
5. Perform LALR for the given grammar .(CO4, K3)

i) S ->C C
C -> Cc
C ->d
ii) S -> Aa / bAc / dc / bda
A -> d
Check whether the grammar is LALR(1) but not SLR(1).

88
10. PART A : Q & A : UNIT – IV
SNo Questions and Answers CO K
What are the two parts of a compilation? Explain
briefly.
Analysis and Synthesis are the two parts of compilation.
(Front-end & Back-end)
1 ● The analysis part breaks up the source program K1
into constituent pieces and creates an
intermediate representation of the source program.
● The synthesis part constructs the desired target
program from the intermediate representation.
List the various compiler construction tools.
● Parser generators
● Scanner Generator
2 ● Syntax-directed translation engines K2
● Automatic code generators
● Dataflow Engines
● Compiler construction tool kits
Differentiate compiler and interpreter.
● The machine-language target program produced
by a compiler is usually much faster than an
3 interpreter at mapping inputs to outputs. K1
● An interpreter, however, can usually give better CO1
error diagnostics than a compiler, because it executes
the source program statement by statement.
Define tokens, Patterns, and lexemes.
● Tokens- Sequence of characters that have a collective
meaning. A token of a language is a category of its
lexemes.
● Patterns- There is a set of strings in the input for
4 K2
which the same token is produced as output. This
set of strings is described by a rule called a
pattern associated with the token.
● Lexeme- A sequence of characters in the source
program that is matched by the pattern for a token.
Describe the possible error recovery actions in lexical
analyzer.
·Panic mode recovery
·Deleting an extraneous character
5 K1
·Inserting a missing character
·Replacing an incorrect character by a correct character
·Transposing two adjacent characters

89
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K


What is Sentinel ?
For each character read, we make two tests in lexical
analysis phase: one for the end of the buffer, and one
to determine what character is read (the latter may be
a multiway branch). We can combine the buffer-end
6 K1
test with the test for the current character if we extend
each buffer to hold a sentinel character at the end. The
sentinel is a special character that cannot be part of
the source program, and a natural choice is the
character eof.
Define Symbol Table.
A symbol table is a data structure containing a record for
each identifier, with fields for the attributes of the
identifier. The data structure allows us to find the record
7 for each identifier quickly and to store or retrieve data K2
from that record quickly. CO1
Whenever an identifier is detected by a lexical analyzer,
it is entered into the symbol table. The attributes of an
identifier cannot be determined by the lexical analyzer.
Define a preprocessor.
A source program may be divided into modules
stored in separate files. The task of collecting the
8 source program is sometimes entrusted to a separate K2
program, called a preprocessor.The preprocessor may
also expand shorthands, called macros, into source
language statements.
What are the functions of the pre-processors?
The task of collecting the source program is sometimes
9 entrusted to a separate program, called a preprocessor. K2
The preprocessor may also expand shorthands, called
macros, into source language statements.
What are the various parts in LEX program?
Lex program will be in following form
declarations
10 %%
translation rules
%%
auxiliary functions
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K

How will you group the phases of the compiler?


The front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation
11 might be grouped together into one pass. Code K1
optimization might be an optional pass.
Then there could be a back-end pass consisting of code
generation for a particular target machine.

Name some variety of intermediate forms.


· Postfix notation or polish notation.
· Syntax tree
12 K1
· Three address code
· Quadruple
· Triple
Write a regular expression to describe a
languages consists of strings made of even
numbers a and b.
13 K1
r= CO1
((a|b)(a|b
))*
Give the transition diagram for an identifier.

14 K1

Write a short notes on LEX.


A LEX source program is a specification of lexical
analyzer consisting of set of regular expressions
together with an action for each regular expression. The
action is a piece of code, which is to be executed
15
whenever a token specified by the corresponding
regular expression is recognized. The output of a LEX is
a lexical analyzer program constructed from the LEX
source specification.

91
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K


Define Parser and its types?
A parser for grammar G is a program that takes a string
w as input and produces either a parse tree for w, if w is
a sentence of G or an error message indicating that w is
16 K1
not a sentence of G as output.
It obtains a string of tokens from the lexical analyser ,
verifies that the string generated by the grammar for the
source language. 1. Top down 2. Bottom up
What are the different levels of syntax error handler?
● Lexical such as misspelling an identifier, keyword
or operator.
● Syntactic ,such as an arithmetic expression with
17 K1
unbalanced parentheses.
● Semantic such as operator applied to an
incompatible operand.
● Logical, such as an infinitely recursive call.
What are the goals of error handler in a parser?
● It should report the presence of errors clearly and
accurately. CO2
18 ● It should recover from each error quickly enough K1
to be able to detect subsequent errors.
● It should not significantly slow down the
processing of correct programs.
What are error recovery strategies in parser?
● Panic mode
19 ● Phrase level K1
● Error productions.
● Global Corrections.
Define CFG?
Grammar defines rules that describe the syntactic
20 structure of well-formed programs. The syntax of K1
programming language constructs can be described by
CFG.It is denoted as , G = (V,T,P,S).

Define Ambiguity ?
A grammar G is said to be ambiguous if it produces
21 K1
different parse trees for the same sentence or yield w
which has been derived from the same start symbol.

92
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K


Define yield of a parse tree.?
A string that is derived from the parse tree by
22 K1
concatenating all the leaves starting from the leftmost
leaf to the rightmost leaf is called the yield of the tree.
Define Left-factoring?
When a production has more than one alternative with
common prefixes , then it is necessary to make right
23 choice on production. This can be done through K1
rewriting the production until enough of the input has
been seen. To perform left factoring for the productions,
A-> 𝛂 𝛽1 | 𝛂 𝛽2 ⇒ A -> 𝛂 A’ and A’ -> 𝛽1 | 𝛽2
What are the difficulties with top down parsing?
● Left recursion
● Backtracking
24 ● The order in which alternates are tried can affect K2
the language accepted. CO2
● When failure is reported there will be very little
idea where the error actually occurred.

What is meant by recursive-descent parser?


A parser which uses a set of recursive procedures to
25 K1
recognize its input with no backtracking is called a
recursive-descent parser.

What are the actions available in shift-reduce parser?


● Shift
26 ● Reduce K2
● Accept
● Error

Define Viable prefixes ?


The set of prefixes of right sentential forms that can
27 appear on the stack of a shift-reduce parser are called K2
viable prefixes.

93
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K

Define handle?
A handle of a string is a substring that matches the right
28 side of a production and whose reduction to the non- K1
terminal on the left side of the production represents one
step along the reverse of a rightmost derivation.

Define LL(1) Grammar ?


● L - Input sequence is processed from left to right.
29 ● R - Leftmost derivation is performed. K2
● K - At most 1 symbol of the sequence are used to
make decision.

Define LR parser ?
LR parser can be used to parse a large class of Context
Free Grammars. The technique is called LR(K) parsing.
30 ● L - Input sequence is processed from left to right. K2
● R - Rightmost derivation is performed. CO2
● K - At most K symbols of the sequence are used to
make decision.

Give the reasons for LR parser?


● LR parser can handle a large class of Context Free
Languages.
● An LR parser can detect syntax errors as soon as
31 K2
they occur.
● It is the most general non-backtracking shift-
reduce parsing method.
● It can handle all language recognizable by LL(1).

What are the drawbacks of LR parser?


● LR parser needs an automated parser generator
as parsing tables are too complicated to be
32 K2
generated by hand.
● It cannot handle ambiguous grammar without
special tricks.
10. PART A : Q & A : UNIT – IV

SNo Questions and Answers CO K

What are the parts of YACC program.


There are 3 parts available.
33 ● Declarations K1
● Translation rules
● Supporting C -routines.

Define handle pruning?


Handle pruning is the process of reducing 𝛽 to A in
34 K1
𝛂𝛽w., it means removing the children of A from the
parse tree.
CO2
Define LALR Grammar?
LALR ( Lookahead LR) parser is an extension of
canonical LR parser.
LALR parsing tables are constructed by merging the
itemsets of canonical LR parser that possess similar
35 K1
LR(0) items. Hence ,the parsing table is relatively
smaller than that of CLR.
If there are no action conflicts ,then the given grammar
is said to be an LALR(1) grammar. The collection of sets
of items constructed is called LALR(1) collections.

95
11. PART B QUESTIONS : UNIT – IV

(CO4, K2)

1.Describe the various phases of compiler and trace it with the program segment
(t=b*-c+b*-c).
2.Explain in detail the process of compilation. Illustrate the output of each phase of
the compilation for the input “a = (b+c) * (b+c) *2”
3. Discuss various buffering techniques in detail.
4. Construct Regular expression to NFA for the sentence (alb)*a.
5. Construct NFA using the regular expression (a/b)*abb.
6. Construct the NFA from the (a|b)*a(a|b) using Thompson’s construction
algorithm.
7. Construct the DFA for the augmented regular expression (a | b )* # directly
using syntax tree.
8. Write an algorithm for minimizing the number of states of a DFA.

PART C QUESTIONS
(CO4, K3)

1. Explain the structure of a LEX program with an example.


2. Illustrate how does LEX work?
3.Consider the regular expression below which can be used as part of a
specification of the definition of exponents in floating-point numbers. Assume that
the alphabet consists of numeric digits (‘0’ through ‘9’) and alphanumeric
characters (‘a’ through ‘z’ and ‘A’ through ‘Z’) with the addition of a selected small
set of punctuation and special characters. (say in this example only the characters
‘+’ and ‘-‘ are relevant). Also, in this representation of regular expressions the
character ‘.’ Denotes concatenation.
Exponent = (+|-|€) . (E | e) . (digit)+
(i) Derive an NFA capable of recognizing its language using Thompsons’
construction.
(ii) Derive the DFA for the NFA found in i) above using subset construction.
(iii) Minimize the DFA found in (ii) above.
4. Write Lex program to identify the following tokens - relational operators ,
arithmetic operators and keywords (if, while, do, switch, for ).

96
11 PART B QUESTIONS : UNIT – IV

9 Consider the following CFG grammar over the non-terminals {X, Y, Z} and
terminals {a, b, d} with the productions below and start symbol Z. (CO4,K3)
X -> a
X -> Y
Z -> d
Z -> X Y Z
Y -> c
Y -> €
Compute the FIRST and FOLLOW sets of every non-terminal and the set of
non-terminals that are nullable. Construct the predictive parsing table.
10. Construct predictive parsing table for the grammar (CO4,K3)
E->E+T | T, T->T*F | F, F->(E)|id
Using predictive parsing the string id+id*id.
11.Construct predictive parsing table for the grammar . (CO4,K3)
S->(L) | a, L->L,S | S
and show whether the following string will be accepted or not. (a, (a, (a,a)))
12. Construct the SLR parsing table for the following grammar. Show the actions
for the parser for the input string “abab” & “baab” (CO4,K3)
S->AS | b , A->SA|a
13. Write the LR parsing algorithm. Check whether the grammar is SLR (1) or not.
Justify the answer with reasons. (CO4,K3)
S->L=R | R; L->*R | id; R->L
14.Construct Stack Implementation of shift reduce parsing for the grammar
(CO4,K3)
E->E+E | E*E | (E) | id. And the input string id1 + id2 * id3

97
15. Consider the CFG depicted below where “begin”, “end” and “x” are all terminal
symbols of the grammar and stat is considered the starting symbol for this
grammar. Productions are numbered in parenthesis and you can abbreviate
“begin” to “b” and “end” to “e” respectively.
Stat -> Block
Block -> begin Block end
Block -> Body
Body -> x
(i) Compute the set of LR (1) items for this grammar and draw the
corresponding DFA. Do not forget to augment the grammar with the initial
productions Sà Start$ as the production (0).
(ii)Construct the corresponding LR parsing table.
16. Show that the grammar is LR(1) but not LALR(1) S-
S> Aa | bAc | Bc |bBa
A->d
B->d
And parse the statement “bdc” and “dd”.
17. Show that the following grammar:
S -> Aa | bAc | dc | bda
A -> a
Is LALR(1) but not SLR(1)
18. Design the Syntax analyzer using YACC tool.

98
99
12. SUPPORTIVE ONLINE CERTIFICATION COURSES

NPTEL : https://swayam.gov.in/nd1_noc19_cs79/preview

Swayam : https://nptel.ac.in/courses/106/106/106106049/

coursera : https://www.coursera.org/learn/cs-algorithms-theory-machines

Udemy : https://www.udemy.com/course/theory-of-computation-toc/

Mooc :https://www.mooc-list.com/tags/theory-computation

NPTEL : https://nptel.ac.in/courses/106/105/106105190/
Swayam : https://www.classcentral.com/course/swayam-compiler-design-12926
coursera : https://www.coursera.org/learn/nand2tetris2
Udemy : https://www.udemy.com/course/introduction-to-compiler-construction-and-design/
Mooc : https://www.mooc-list.com/course/compilers-coursera
edx : https://www.edx.org/course/compilers
Real time Applications in day
to day life and to Industry

101
13. Real time Applications in day to day life and to Industry

1. Characterizing Tokens in the JavaCC Grammar File


2. use the StreamTokenizer object to implement an interactive calculator
3. Regular expressions which can be useful in the real-world applications
○ Email validation
○ Password validation
○ Valid date format
○ Empty string validation
○ Phone number validation
○ Credit card number Validation

4. Regular expressions are useful in a wide variety of text processing


tasks, include data validation, data scraping (especially web scraping),
data wrangling, simple parsing, the production of syntax highlighting
systems, and many other tasks.
5. While regexps would be useful on Internet search engines,
processing them across the entire database could consume excessive
computer resources depending on the complexity and design of the
regex.
6. One salient example of a commercial, real-life application of
deterministic finite automata (DFAs) is Microsoft's historical reliance
on finite-state technology from Xerox PARC. Microsoft has used two-level
automata for spell-checking and other functions in various products for
three decades.
7. The various applications of Automata are sequential machine, vending
machine. A simple video games, text matching can be done with
automata

102 102
Natural Language Processing - Syntactic Analysis
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The
purpose of this phase is to draw exact meaning, or you can say dictionary meaning
from the text. Syntax analysis checks the text for meaningfulness comparing to the
rules of formal grammar. For example, the sentence like “hot ice-cream” would be
rejected by semantic analyzer. In this sense, syntactic analysis or parsing may be
defined as the process of analyzing the strings of symbols in natural language
conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from
Latin word ‘pars’ which means ‘part’.

Detailed description is available :


https://www.tutorialspoint.com/natural_language_processing/natural_language_pro
cessing_syntactic_analysis.htm

Some of the most common real world applications


·Text summarization to automatically provide the most important points in the
original document. This is particulary good for news summary learning semantic
relations between named entities.
· POS or parts-of-speech of the language. It is about finding parts-of-speach
such as nouns, adjectives, verbs based on context and relationship to adjacent
words.
· Coreference resolution. It is about understanding references to multiple
entities existing in the text and disambiguating that reference. For example the
difference between a mouse (the animal) and a mouse (PC input device).
·Argument mining which is the automatic extraction and identification of
argumentative structures in a text piece typically an article of a journal.
Contents beyond the
Syllabus

10
4
14. CONTENTS BEYOND SYLLABUS : UNIT – IV

Lexical analysis and Java:


Learn how to convert human readable text into machine readable data using
the StringTokenizer and StreamTokenizer classes

Java's lexical analyzers


The Java Language Specification, version 1.0.2, defines two lexical analyzer
classes, StringTokenizer and StreamTokenizer. From their names you can deduce
that StringTokenizer uses String objects as its input, and StreamTokenizer uses
InputStream objects.

As a lexical analyzer, StringTokenizer could be formally defined as shown below.

[~delim1,delim2,...,delimN] :: Token

Detailed description can be obtained from the link,


https://www.infoworld.com/article/2076874/lexical-analysis-and-java--part-1.html

Text Parsing:
Text parsing is a technique which is used to derive a text string using
the production rules of a grammar to check the acceptability of a string.

Regular Expression Matching:


It is a technique to checking the two or more regular expression are
similar to each other or not. The finite state machine is useful to checking out that
the expressions are acceptable or not by a machine or not.
Speech Recognition:
Speech recognition via machine is the technology enhancement that is
capable to identify words and phrases in spoken language and convert them to a
machine-readable format. Receiving words and phrases from real world and then
converted it into machine readable language automatically is effectively solved by
using finite state machine.
Detailed description can be obtained from the link,
https://yashindiasoni.medium.com/real-world-applications-of-automata-88c7ba254e80

105
In computer-based language recognition, ANTLR (pronounced antler), or ANother Tool
for Language Recognition, is a parser generator that uses LL(*) for parsing. ANTLR is
the successor to the Purdue Compiler Construction Tool Set (PCCTS), first
developed in 1989, and is under active development. Its maintainer is Professor Terence
Parr of the University of San Francisco.
Example
In the following example, a parser in ANTLR describes the sum of expressions can be seen
in the form of "1 + 2 + 3":
// Common options, for example, the target language
options
{
language = "CSharp";
}
// Followed by the parser
class SumParser extends Parser;
options
{
k = 1; // Parser Lookahead: 1 Token
}
// Definition of an expression
statement: INTEGER (PLUS^ INTEGER)*;
// Here is the Lexer
class SumLexer extends Lexer;
options
{
k = 1; // Lexer Lookahead: 1 characters
}
PLUS: '+';
DIGIT: ('0'..'9');
INTEGER: (DIGIT)+;

Detailed description can be obtained from the link,


https://www.antlr.org/
https://en.wikipedia.org/wiki/ANTLR
107
15. ASSESSMENT
SCHEDULE

Tentative dates for Assessment Tests and Model Exam.

Unit Test I : Jan 3- Jan 11

Internal Assessment Test I : JAN 28-FEB 3

Unit Test II : Feb 24 – Mar 1

Internal Assessment Test II : MAR 10-MARCH 15

Revision Test :

Model Examination :APR 3-APR 17

108
TEXT BOOKS & REFERENCE BOOKS

109
16. Prescribed Text Books & Reference Books
TEXT BOOKS:

1. John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman, Introduction to Automata Theory,


Languages, and Computation, 3rd Edition, Pearson Education, 2008.

2. Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, “Compilers Principles,

Techniques and Tools”, Second Edition, Pearson, 2013.

REFERENCES:
REFERENCES:
1.K.L.P Mishra and Chandrashekaran, Theory of Computer Science – Automata languages
and computation,3rd Edition, PHI, 2007.
2. Elain Rich, “Automata, Computability and complexity”, 1st Edition, Pearson Education,
2018.
3. Peter Linz, “An introduction to Formal Languages and Automata”, Jones and Bartlett
Publishers, 6th Edition, 2016.

4. K Muneeswaran, “Compiler Design”, Oxford University Press, 2013.


5. John C Martin, Introduction to Languages and The Theory of Computation, TMH, 4th
Edition, 2010.

110
111
17. MINI PROJECT SUGGESTION
• Objective:
Design of lexical analyzer Generator , Design of an Automata for pattern matching
• Planning:
• This method is mostly used to improve the ability of students in application domain
and also to reinforce knowledge imparted during the lecture.
• Students are asked to prepare mini projects involving application of the concepts,
principles or laws learnt.
• The faulty guides the students at various stages of developing the project and gives
timely inputs for the development of the model.
• Students converts their ideas into real time applicatons.

Projects:
Set
Number Mini Project Title
(Category)
Set - 1 Create a automata vending machine as an automated machine use
(Toppers) finite state automata to control the functions process.
(CO4, K3)
Set - 2 use Flex to create a lexical analyzer for C
(Above (CO4, K3)
Average)
Set - 3 Regular Expression matching - to check the two or more regular
(Average) expression are similar to each other or not (CO4, K3)
Set - 4
(Below Design, develop and implement YACC/C program to demonstrate
Average) Shift Reduce Parsing technique for the grammar rules:
E →E+T | T, T →T*F | F, F → (E) |id and parse the sentence:
id + id * id
(CO4, K3)
Set - 5 Token Explorer: Understanding Lexical Analysis in Steps (CO4, K3)
(Slow
Learners)

11
2
Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

106

You might also like