0% found this document useful (0 votes)
97 views

CD Unit - 1 Lms Notes

The document provides information about compiler design, including: 1. It discusses the need for compilers to translate high-level programming languages written by humans into binary machine code understood by computers. 2. It outlines the typical phases in a compiler's architecture, including lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation. 3. It describes some of the key phases in more detail, such as the roles of the lexical analyzer in identifying tokens and the parser in checking syntax against grammar rules.

Uploaded by

ashok koppolu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

CD Unit - 1 Lms Notes

The document provides information about compiler design, including: 1. It discusses the need for compilers to translate high-level programming languages written by humans into binary machine code understood by computers. 2. It outlines the typical phases in a compiler's architecture, including lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation. 3. It describes some of the key phases in more detail, such as the roles of the lexical analyzer in identifying tokens and the parser in checking syntax against grammar rules.

Uploaded by

ashok koppolu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

COMPIER DESIGN

COMPUTER SCIENCE AND ENGINEERING

LECTURE NOTES

Course Name COMPILER DESIGN

Course Code CS602PC

Class III B. Tech II Semester

Branch CSE

Mrs. Ch.Sravani, Asst. Prof., CSE


Course Faculty

SNO Course Outcomes

Demonstrate the ability to design a compiler given a set of language features


1

Demonstrate the the knowledge of patterns, tokens & regular expressions for lexical
2 analysis
3 Acquire skills in using lex tool & yacc tool for devleoping a scanner and parser.
4 Design and implement LL and LR parsers
5 Design algorithms to do code optimization in order to improve the performance of a
program in terms of space and time complexity.
6 Design algorithms to generate machine code
Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical

device and its functions are being controlled by a compatible software. Hardware understands

instructions in the form of electronic charge, which is the counterpart of binary language in

software programming. Binary language has only two alphabets, 0 and 1. To instruct, the hardware

codes must be written in binary format, which is simply a series of 1s and 0s. It would be a difficult

and cumbersome task for computer programmers to write such codes, which is why we have

compilers to write such codes.

Language Processing System


We have learnt that any computer system is made of hardware and software. The hardware

understands a language, which humans cannot understand. So we write programs in high- level

language, which is easier for us to understand and remember. These programs are then fed into a

series of tools and OS components to get the desired code that can be used by the machine. This is

known as Language Processing System.


1
2. COMPILER ARCHITECTURE

The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.

□ User writes a program in C language (high-level language).

□ The C compiler compiles the program and translates it to assembly program (low-
level language).

□ An assembler then translates the assembly program into machine code (object).

□ A linker tool is used to link all the parts of the program together for execution
(executable machine code).

□ A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.

Preprocessor

A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.

Interpreter

An interpreter, like a compiler, translates high-level language into low-level machine language. The
difference lies in the way they read the source code or input. A compiler reads the whole source code
at once, creates tokens, checks semantics, generates intermediate code, executes the whole program
and may involve many passes. In contrast, an interpreter reads a statement from the input, converts it to
an intermediate code, executes it, then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it; whereas a compiler reads the whole program even if it
2. COMPILER ARCHITECTURE

encounters several errors.

Assembler

An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well as
the data required to place these instructions in memory.

Linker
Linker is a computer program that links and merges various object files together in order to make an
executable file. All these files might have been compiled by separate assemblers. The major task of a
linker is to search and locate referenced module/routines in a program and to determine the memory
location where these codes will be loaded, making the program instruction to have absolute references.

Loader

Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.

Cross-compiler

A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.

Source-to-source Compiler

A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
2. COMPILER ARCHITECTURE

A compiler can broadly be divided into two phases based on the way they compile.

Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts, and then checks for lexical, grammar, and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.

Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with the
help of intermediate source code representation and symbol table.

A compiler can have many phases and passes.

□ Pass : A pass refers to the traversal of a compiler through the entire program.

□ Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next stage.
A pass can have more than one phase.
. PHASES OF COMPILER

The compilation process is a sequence of various phases. Each phase takes input from its previous
stage, has its own representation of source program, and feeds its output to the next phase of the
compiler. Let us understand the phases of a compiler.
. PHASES OF COMPILER

Lexical Analysis

The first phase of scanner works as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes

<token-name, attribute-value>
in the form of tokens as:

Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements are
checked against the source code grammar, i.e., the parser checks if the expression made by the
tokens is syntactically correct.

Semantic Analysis

Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not, etc. The semantic analyzer produces an annotated syntax
tree as an output.

Intermediate Code Generation

After semantic analysis, the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way that
it makes it easier to be translated into the target machine code.
. PHASES OF COMPILER

Code Optimization

The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order

Code Generation

In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.

Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’ names
along with their types are stored here. The symbol table makes it easier for the compiler to quickly
search the identifier record and retrieve it. The symbol table is also used
LEXICAL ANALYZER
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.

If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely
with the syntax analyzer. It reads character streams from the source code, checks for legal tokens,
and passes the data to the syntax analyzer when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are
defined by means of regular expressions.

In programming language, keywords, constants, identifiers, strings, numbers, operators, and


punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line contains the tokens:

int value = 100;

int (keyword), value (identifier), = (operator), 100 (constant) and ;


(symbol).

Specifications of Tokens
Let us understand how the language theory undertakes the following terms:

Alphabets

Anyfinite set ofsymbols {0,1} is a set ofbinaryalphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set


of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings

Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by
|tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by s(epsilon).

Special Symbols
A typical high-level language contains the following symbols:-

Arithmetic Addition(+), Subtraction(-), Modulo(%), Multiplication(*),


Symbols Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)


Compiler Design

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language

A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations canbe
performed on them.Finite languages can be described by means of regular expressions.

The Role of the Lexical Analyzer

1. Lexical Analysis Versus Parsing

2 Tokens, Patterns, and Lexemes


3 Attributes for Tokens
4 Lexical Errors

As the first phase of a compiler, the main task of the lexical analyzer is to read the input
Compiler Design
characters

of the source program, group them into lexemes, and produce as output a sequence of

tokens for each lexeme in the source program. The stream of tokens is sent to the parser

for syntax analysis. It is common for the lexical analyzer to interact with the symbol

table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it

needs to enter that lexeme into the symbol table. In some cases, information regarding

the kind of identifier may be read from the symbol table by the lexical analyzer to assist

it in determining the proper token it must pass to the parser.

1. Lexical Analysis Versus Parsing

There are a number of reasons why the analysis portion of a compiler is normally

separated into lexical analysis and parsing (syntax analysis) phases.

1. Simplicity of design is the most important consideration. The separation of lexical

and syntactic analysis often allows us to simplify at least one of these tasks. For example,

a parser that had to deal with comments and whitespace as syntactic units would be

considerably more complex than one that can assume comments and whitespace have

already been removed by the lexical analyzer. If we are designing a new language,

separating lexical and syntactic concerns can lead to a cleaner overall language design.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply

specialized techniques that serve only the lexical task, not the job of parsing. In addition,

specialized buffering techniques for reading input characters can speed up the compiler
Compiler Design
significantly.

Compiler portability is enhanced. Input-device-specific peculiarities can be

restricted to the lexical analyzer.

2. Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms:

A token is a pair consisting of a token name and an optional attribute value. The

token name is an abstract symbol representing a kind of lexical unit, e.g., a particular

keyword, or a sequence of input characters denoting an identifier. The token names are

the input symbols that the parser processes. In what follows, we shall generally write

the name of token in boldface. We will often refer to a token by its token name.

• A pattern is a description of the form that the lexemes of a token may take. In the case

of a keyword as a token, the pattern is just the sequence of characters that form the

keyword. For identifiers and some other tokens, the pattern is a more complex structure

that is matched by many strings.

A lexeme is a sequence of characters in the source program that matches the pattern

for a token and is identified by the lexical analyzer as an instance of that token.

E x a m p l e 3 . 1 : Figure 3.2 gives some typical tokens, their informally described

patterns, and some sample lexemes. To see how these concepts are used in practice, in

the C statement

printf ( " Total = %d\n", score ) ;


Compiler Design
both printf and score are lexemes matching the pattern for token id, and

" Total = °/,d\n" is a lexeme matching literal .

In many programming languages, the following classes cover most or all of the tokens:

3. Attributes for Tokens


When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent

compiler phases additional information about the par-ticular lexeme that matched. For example, the

pattern for token number matches both 0 and 1, but it is extremely important for the code generator

to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer

returns to the parser not only a token name, but an attribute value that describes the lexeme represented

by the token; the token name in-fluences parsing decisions, while the attribute value influences

translation of tokens after the parse.


Compiler Design

We shall assume that tokens have at most one associated attribute, although this attribute may have a

structure that combines several pieces of information. The most important example is the token id, where

we need to associate with the token a great deal of information. Normally, information about an identi-

fier — e.g., its lexeme, its type, and the location at which it is first found (in case an error message about

that identifier must be issued) — is kept in the symbol table. Thus, the appropriate attribute value for an

identifier is a pointer to the symbol-table entry for that identifier.

Tricky Problems When Recognizing Tokens

Usually, given the pattern describing the lexemes of a token, it is relatively simple to recognize matching

lexemes when they occur on the input. How-ever, in some languages it is not immediately apparent when

we have seen an instance of a lexeme corresponding to a token. The following example is taken from

Fortran, in the fixed-format still allowed in Fortran 90. In the statement

DO 5 I = 1.25

it is not apparent that the first lexeme is D05I, an instance of the identifier token, until we see the dot

following the 1. Note that blanks in fixed-format Fortran are ignored (an archaic convention). Had we seen

a comma instead of the dot, we would have had a do-statement

DO 5 I = 1,25

in which the first lexeme is the keyword DO.

E x a m p l e 3 . 2 : The token names and associated attribute values for the For-tran statement

E = M * C ** 2

are written below as a sequence of pairs.

<id, pointer to symbol-table entry for E>


Compiler Design
<assign _ op>

<id, pointer to symbol-table entry for M>

<mult _ op>

<id, pointer to symbol-table entry for C>

<exp _ op>

<number, integer value 2>

Note that in certain pairs, especially operators, punctuation, and keywords, there is no need for an attribute

value. In this example, the token number has been given an integer-valued attribute. In practice, a typical

compiler would instead store a character string representing the constant and use as an attribute value

for number a pointer to that string. •

4. Lexical Errors

It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error.

For instance, if the string f i is encountered for the first time in a C program in the context:

fi ( a == f ( x ) ) . . .

a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function

identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the

parser and let some other phase of the compiler — probably the parser in this case — handle an error due

to transposition of the letters.


Compiler Design

However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the

patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is "panic

mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer can

find a well-formed token at the beginning of what input is left. This recovery technique may confuse the

parser, but in an interactive computing environment it may be quite adequate.

Other possible error-recovery actions are:

Delete one character from the remaining input.

Insert a missing character into the remaining input.

Replace a character by another character.

Transpose two adjacent characters.

Transformations like these may be tried in an attempt to repair the input. The simplest such strategy is to

see whether a prefix of the remaining input can be transformed into a valid lexeme by a single

transformation. This strategy makes sense, since in practice most lexical errors involve a single character.

A more general correction strategy is to find the smallest number of transforma-tions needed to convert the

source program into one that consists only of valid lexemes, but this approach is considered too expensive

in practice to be worth the effort.


Compiler Design

INPUT BUFFERING

The lexical analyzer scans the input from left to right one character at a time. It uses two pointers

begin ptr(bp) and forward to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown

below

The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,

it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is

identified.
Compiler Design
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves ahead.

then both the begin ptr(bp) and forward ptr(fp) are set at next token.

The input character is thus read from secondary storage, but reading in this way from secondary storage is

costly. hence buffering technique is used.A block of data is first read into a buffer, and then second by

lexical analyzer. there are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme.

These are explained as following below.

OneBufferScheme:

In this scheme, only one buffer is used to store the input string but the problem with this scheme

is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the

buffer has to be refilled, that makes overwriting the first of lexeme.

Two Buffer Scheme:

To overcome the problem of one buffer scheme, in this method two buffers
Compiler Design
are used to store the input string. the first buffer and second buffer are scanned alternately. when end of

current buffer is reached the other buffer is filled. the only problem with this method is that if length of the

lexeme is longer than length of the buffer then scanning input cannot be scanned completely.

Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right

in search of end of lexeme. as soon as blank character is recognized, the string between bp and fp is identified

as corresponding token. to identify, the boundary of first buffer end of buffer character should be placed at

the end first buffer.

Similarly end of second buffer is also recognized by the end of buffer mark

present at the end of second buffer. when fp encounters first eof, then one can recognize end of first buffer

and hence filling up second buffer is started. in the same way when second eof is obtained then it indicates

of second buffer. alternatively both the buffers can be filled up until end of the input program and stream of

tokens is identified. This eof character introduced at the end is calling Sentinel which is used to identify the

end of buffer.
Compiler Design

Recognition of Tokens
Tokens can be recognized by Finite Automata

A Finite automaton(FA) is a simple idealized machine used to recognize patterns within input taken from

some character set(or Alphabet) C. The job of FA is to accept or reject an input depending on whether the

pattern defined by the FA occurs in the input.

There are two notations for representing Finite Automata. They are

TransitionDiagram

TransitionTable

Transition diagram is a directed labeled graph in which it contains nodes and edges

Nodes represents the states and edges represents the transition of a state

Every transition diagram is only one initial state represented by an arrow mark (-->) and zero or more final

states are represented by double circle

Where state "1" is initial state and state 3 is final state.


Compiler Design

Finite Automata for recognizing identifiers

Finite Automata for recognizing keywords


Compiler Design

Finite Automata for recognizing numbers

Finite Automata for relational operators


Finite Automata for recognizing white spaces

The Lexical-Analyzer Generator Lex


1. Use of Lex

2 Structure of Lex Programs

3 Conflict Resolution in Lex

4 The Lookahead Operator

In this section, we introduce a tool called Lex, or in a more recent implemen-tation Flex, that allows

one to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens. The

input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.

Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates

code, in a file called l e x . y y . c, that simulates this tran-sition diagram. The mechanics of how this

translation from regular expressions to transition diagrams occurs is the subject of the next sections; here

we only learn the Lex language.


1. Use of Lex

Figure 3.22 suggests how Lex is used. An input file, which we call l e x . l , is written in the Lex language and

describes the lexical analyzer to be generated. The Lex compiler transforms l e x . 1 to a C program, in a

file that is always named l e x . y y . c. The latter file is compiled by the C compiler into a file called a .

o u t , as always. The C-compiler output is a working lexical analyzer that can take a stream of input

characters and produce a stream of tokens.

The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a subroutine of the parser. It

is a C function that returns an integer, which is a code for one of the possible token names. The attribute

value, whether it be another numeric code, a pointer to the symbol table, or nothing, is placed in a global

variable y y l v a l , 2 which is shared between the lexical analyzer and parser, thereby making it simple

to return both the name and an attribute value of a token.


Compiler Design
2. Structure of Lex Programs

A Lex program has the following form:

declarations

°/.7.

translation rules

°/.0/.

auxiliary functions

The declarations section includes declarations of variables, manifest constants (identifiers declared to

stand for a constant, e.g., the name of a token), and regular definitions, in the style of Section 3.3.4.

The translation rules each have the form

Pattern { Action }
Compiler Design

Each pattern is a regular expression, which may use the regular definitions of the declaration section.

The actions are fragments of code, typically written in C, although many variants of Lex using other

languages have been created.

The third section holds whatever additional functions are used in the actions. Alternatively, these

functions can be compiled separately and loaded with the lexical analyzer.

3. Conflict Resolution in Lex

We have alluded to the two rules that Lex uses to decide on the proper lexeme to select, when several

prefixes of the input match one or more patterns:

Always prefer a longer prefix to a shorter prefix.

If the longest possible prefix matches two or more patterns, prefer the pattern listed first

in the Lex program.

4.The Lookahead Operator

Lex automatically reads one character ahead of the last character that forms the selected

lexeme, and then retracts the input so only the lexeme itself is consumed from the input. However,

sometimes, we want a certain pattern to be matched to the input only when it is followed by a

certain other characters. If so, we may use the slash in a pattern to indicate the end of the part of

the pattern that matches the lexeme. What follows / is additional pattern that must be matched

before we can decide that the token in question was seen, but what matches this second pattern is

not part of the lexeme.


Compiler Design

Finite Automata

Finite Automata(FA) is the simplest machine to recognize patterns. The finite automata or

finite state machine is an abstract machine which have five elements or tuple. It has a set of states

and rules for moving from one state to another but it depends upon the applied input symbol.

Basically it is an abstract model of digital computer. Following figure shows some essential

features of a general automation.

The above figure shows following features of automata:


1. Input
2. Output
3. States of automata
4. State relation
5. Output relation
A Finite Automata consists of the following :
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function.
Compiler Design

Formal specification of machine is


{ Q, Σ, q, F, δ }.
FA is characterized into two types:

1) Deterministic Finite Automata (DFA)

DFA consists of 5 tuples {Q, Σ, q, F, δ}.


Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X Σ --> Q.

In a DFA, for a particular input character, the machine goes to one state only. A transition function is

defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed,

i.e., DFA cannot change state without any input character.

For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
Compiler Design

2 Nondeterministic Finite Automata(NFA)

NFA is similar to DFA except following additional features:

1. Null (or ε) move is allowed i.e., it can move forward without reading symbols.

2. Ability to transmit to any number of states for a particular input.

However, these above features don’t add any power to NFA. If we compare both in terms of power, both are

equivalent.

Due to above additional features, NFA has a different transition function, rest is same as DFA.

δ: Transition Function

δ: Q X (Σ U ε ) --> 2 ^ Q.

As you can see in transition function is for any input including null (or ε), NFA can go to any state number of states.

For example, below is a NFA for above problem


Regular Expression:

The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong
to the language in hand. It searches for the pattern defined by the language rules. Regular expressions have the
capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by
regular expressions is known as regular grammar. The language defined by regular grammar is known as
regular language.

Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens can be
described by regular languages. The specification of regular expressions is an example ofa recursive definition.
Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to

manipulate regular expressions into equivalent forms

Operations
The various operations on languages are:

□ Union of two languages L and M is written as L


U M = {s | s is in L or s is in M}

□ Concatenation of two languages L and M is written as LM


= {st | s is in L and t is in M}

□ The Kleene Closure of a language L is written as L*


= Zero or more occurrence of language L.

Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then

□ Union : (r)|(s) is a regular expression denoting L(r) U L(s)

□ Concatenation : (r)(s) is a regular expression denoting L(r)L(s)


□ Kleene closure : (r)* is a regular expression denoting (L(r))*

□ (r) is a regular expression denoting L(r)

Precedence and Associativity


□ *, concatenation (.), and | (pipe sign) are left associative

□ * has the highest precedence

□ Concatenation (.) has the second highest precedence.

□ | (pipe sign) has the lowest precedence of all.

Representing valid tokens of a language in regular expression

If x is a regular expression, then:

□ x* means zero or more occurrence of x. i.e., it


can generate { e, x, xx, xxx, xxxx, … }
□ x+ means one or more occurrence of x.

i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

□ x? means at most one occurrence of x


i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language. [A-


Z] is allupper-casealphabetsofEnglish language. [0-9] is
all natural digits used in mathematics.

Representing occurrence of symbols using regular expressions

letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign
=[+|-]
Representing language tokens using regular expressions

Decimal = (sign)?(digit)+ Identifier


=(letter)(letter | digit)*

The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution isto use finite automata for verification.

Regular Expression to Finite Automata

To convert the RE to FA, we are going to use a method called the subset method. This method is

used to obtain FA from the given regular expression. This method is given below:

Step 1: Design a transition diagram for given regular expression, using NFA with ε moves.

Step 2: Convert this NFA with ε to NFA without ε.

Step 3: Convert the obtained NFA to equivalent DFA.

Example 1:

Design a FA from given regular expression

10 + (0 + 11)0* 1.

Solution: First we will construct the transition diagram for a given regular expression.

Step-1
Step-2

Step-3
.
Step-4

Step-5

Now we have got NFA without ε. Now we will convert it into required DFA for that,
we will first write a transition table for this NFA.

State 0 1
→q0 q3 {q1, q2}
q1 qf ϕ
q2 ϕ q3
q3 q3 qf
*qf ϕ ϕ
The equivalent DFA will be:
State 0 1
→[q0] [q3] [q1, q2]
[q1] [qf] ϕ
[q2] ϕ [q3]
[q3] [q3] [qf]
[q1, q2] [qf] [qf]
*[qf] ϕ ϕ
is given priority over user input. That is, if the lexical analyzer finds a lexeme that matches with any
UNIT – 1 TUTORIAL QUESTIONS
(Week 1)

1. Define compiler? State various phases of a compiler and explain them in detail.

2. Explain the various phases of a compiler in detail. Also writedown the output for the
following expression after each phase
a. a: =b*c-d.

3. Explain the cousins of a Compiler? Explain them in detail.

4. Describe how various phases could be combined as a pass in a compiler? Also briefly
explain Compiler construction tools.
5. For the following expression Position:=initial+ rate*60
UNIT – 1 TUTORIAL QUESTIONS
(Week 2)

1. Write down the output after each phase

2. Explain the role Lexical Analyzer and issues of Lexical Analyzer.

3. Differentiate the pass and phase in compiler construction?

4. Explain single pass and multi pass compiler with example?

5. Define bootstrapping concept in brief?


UNIT – I UNIVERSITY QUESTIONS

1. Describe the output for the various phases of compiler with respect to the following
statements Total = count + rate * 10.
2. Describe the role of regular expression in lexical analyzer.

3. What are the error recovery actions in a lexical analyzer?

4. Differentiate between Loader and Linker.

5. Write differences between single pass and two pass translation.


UNIT – 1

DESCRIPTIVE QUESTIONS AND ANSWERS

1. Briefly describe the execution of a program.

2. Explain different properties of compiler.

3. Briefly describe the phases of a program.

4. Explain different types of compiler.

5. Briefly describe the parsing techniques

6. Explain compiler construction tools


7. Briefly describe the problems with top down parsing

8. Explain LL(1) parser with suitable example


UNIT – 1

OBJECTIVE QUESTIONS AND ANSWERS


1. The action of passing the source program into the proper syntactic classes is known as [ A]
A) Lexical analysis B) Syntax analysis C) Interpretation analysis D) Parsing

2. FOLLOW is applicable for [A ]


A) Only non-terminals B) Either terminals or non terminals
C) Only terminals D) Neither terminals nor non-terminals

3. A grammar will be meaningless [ A]


A) If the left hand side of the production is a single terminal
B) If the terminal set and non terminal sets are disjoint
C) If the left hand side production has more than two non terminals
D) If the left hand side production has non terminal

4. The minimum value of k in LR(k) is [ B]


A) 1 B) 0 C) 2 D) 3

5. Indirect triples are used to implement [ C]


A) Infix notation B) Postfix notation C) Three address code D) Prefix notation

6. System program such as compiler are designed so that they are [ C]


A) Re-entrable B) Serially usable C) Recursive D) non re-usable

7. A collection of syntactic rule is called a [ A]


A) Grammar B) NFA C) Sentence D) Language

8. Shift reduce parsers are [D ]


A) May be topdown or bottom up B) Topdown parsers
C) Bottom up parsers D) Predictive parsers
9. Which of the following is related to synthesis phase? [B ]
A) Syntax analysis B) Code generation
C) Lexical analysis D) Semantic analysis

10. Which of the following is true? [A ]


A) LALR(1) requires less space compare with LR(1)
B) LALR(1) is more powerful than LR(1)
C) LALR(1) requires more space compare with LR(1)
D) LALR(1) is a powerful as an LR(1)
NAME OF THE STUDENT: BRANCH: CSE REG NO:
SUBJECT: PPC TEST NO: 1 MARKS:

SET - A
I. Answer the Following Objective Questions: Each carries 0.5 Mark
1. The action of passing the source program into the proper syntactic classes is known as

Lexical analysis

2. FOLLOW is applicable for Only non-terminals

3. A grammar will be meaningless If the left hand side of the production is a single terminal

4. The minimum value of k in LR(k) is zero

5. 5. Indirect triples are used to implement three address code

6. System program such as compiler are designed so that they are recursive

7. A collection of syntactic rule is called a grammar

8. Shift reduce parsers are predictive parser

9. code generation is related to synthesis phase?

10. LALR(1) requires less space compare with LR(1)

II. Answer any one question: Each carries 5 Mark

1. Briefly describe the execution of a program.

Ref: page no1.1

2. Explain different properties of compiler.

Ref : page no 1.2


NAME OF THE STUDENT: BRANCH: CSE REG NO:
SUBJECT: CD TEST NO: 1 MARKS:

SET - B

I. Answer the Following Objective Questions: Each carries 0.5 Mark

1. Shift reduce parsers are predictive parser

2. code generation is related to synthesis phase?

3. LALR(1) requires less space compare with LR(1)

4. The action of passing the source program into the proper syntactic classes is known as

Lexical analysis

5. FOLLOW is applicable for Only non-terminals

6. A grammar will be meaningless If the left hand side of the production is a single terminal

7. The minimum value of k in LR(k) is zero

8. 5. Indirect triples are used to implement three address code

9. System program such as compiler are designed so that they are recursive

10. A collection of syntactic rule is called a grammar

II. Answer any one question: Each carries 5 Mark


1. Briefly describe the phases of a program.

Ref: page no1.2


2. Explain different types of compiler.
Ref : page no 1.5
NAME OF THE STUDENT: BRANCH: CSE REG NO:
SUBJECT: CD TEST NO: 1 MARKS:

SET - C

I. Answer the Following Objective Questions: Each carries 0.5 Mark


1. . Indirect triples are used to implement three address code

2. System program such as compiler are designed so that they are recursive

3. The action of passing the source program into the proper syntactic classes is known as

Lexical analysis

4. Shift reduce parsers are predictive parser

5. code generation is related to synthesis phase?

6. LALR(1) requires less space compare with LR(1)

7. FOLLOW is applicable for Only non-terminals

8. A grammar will be meaningless If the left hand side of the production is a single terminal

9. The minimum value of k in LR(k) is zero

10. A collection of syntactic rule is called a grammar

II. Answer any one question: Each carries 5 Mark


1. Briefly describe the parsing techniques

Ref: page no1.10


2. Explain compiler construction tools.
Ref : page no 1.7
NAME OF THE STUDENT: BRANCH: CSE REG NO:
SUBJECT: CD TEST NO: 1 MARKS:

SET - D
I. Answer the Following Objective Questions: Each carries 0.5 Mark
1. code generation is related to synthesis phase?

2. FOLLOW is applicable for Only non-terminals

3. A grammar will be meaningless If the left hand side of the production is a single terminal

4. The minimum value of k in LR(k) is zero

5. 5. Indirect triples are used to implement three address code

6. System program such as compiler are designed so that they are recursive

7. A collection of syntactic rule is called a grammar

8. Shift reduce parsers are predictive parser

9. LALR(1) requires less space compare with LR(1)

10. The action of passing the source program into the proper syntactic classes is known as

Lexical analysis
II. Answer any one question: Each carries 5 Mark
1. Briefly describe the problems with top down parsing

Ref: page no1.11


2. Explain LL(1) parser with suitable example
Ref : page no 1.14
UNIT I
SEMINAR TOPICS

1. Phases of Compilation, Lexical Analysis

2. Regular Grammar, Regular Expression for Common Programming Language Features,

3. Pass & Phases of Translation, Interpretation, Boot Strapping, Data Structures in


Compilation.
4. LEX- lexical analyzer generator.

5. FA
UNIT - 1
ASSIGNMENT QUESTIONS
1. Define Complier briefly?

2. Explain the cousins of compiler?

3. Define the two main parts of compilation? What they perform?

4. Explain how many phases does analysis consists?

5. Define and explain the Loader?

6. Explain about preprocessor?

7. Classify the general phases of a compiler?

8. Classify the rules and define regular expression?

9. Explain a lexeme and define regular sets?

10. Explain the issues of lexical analyzer?


UNIT – 1

REAL TIME APPLICATIONS

One application of compilers is to make custom hardware, i.e hardware specific to the
software that you want to run. These custom hardware are energy efficient and offer better
performance. This can vary from adding new custom instructions to an ASIP(Application
Specific Instruction Set Processor) to designing a custom ASIC(Application Specific
Integrated Circuit), or an FPGA based design(High-level synthesis)

The compiler takes the program, analyses and optimizes it and uses the characteristics
of the program to build hardware. For eg. if a program only consists of additions then the
custom hardware only needs to consist of adder units, it does not need multipliers. The
custom hardware generated is also bitwidth aware. For eg. if all the additions are 8 bit
then it does not need a 32 bit adder. It also tries to exploit parallelism in theprogram.
UNIT - 1
NPTEL

• http://nptel.ac.in/courses/106108052/#
UNIT -1
Bloom'sTaxonomy
1. Remembering

Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.

Interpreter

An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.

Assembler

An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.

Linker

Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to
determine the memory location where these codes will be loaded, making the program instruction
to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.

2. Understanding
Cross-compiler:
A compiler that runs on platform (A) and is capable of generating executable code for platform

(A) is called a cross-compiler.

Source-to-source Compiler:

A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
UNIT – 1
GATE and PLACEMENT QUESTIONS

1. The action of passing the source program into the proper syntactic classes is known as [ A]
A) Lexical analysis B) Syntax analysis C) Interpretation analysis D) Parsing

2. FOLLOW is applicable for [A ]


A) Only non-terminals B) Either terminals or non terminals
C) Only terminals D) Neither terminals nor non-terminals

3. A grammar will be meaningless [ A]


A) If the left hand side of the production is a single terminal
B) If the terminal set and non terminal sets are disjoint
C) If the left hand side production has more than two non terminals
D) If the left hand side production has non terminal

4. The minimum value of k in LR(k) is [ B]


A) 1 B) 0 C) 2 D) 3

5. Indirect triples are used to implement [ C]


A) Infix notation B) Postfix notation C) Three address code D) Prefix notation

6. System program such as compiler are designed so that they are [ C]


A) Re-entrable B) Serially usable C) Recursive D) non re-usable

7. A collection of syntactic rule is called a [ A]


A) Grammar B) NFA C) Sentence D) Language

8. Shift reduce parsers are [D ]


A) May be topdown or bottom up B) Topdown parsers
C) Bottom up parsers D) Predictive parsers
9. Which of the following is related to synthesis phase? [B ]
A) Syntax analysis B) Code generation
C) Lexical analysis D) Semantic analysis

10. Which of the following is true? [A ]


A) LALR(1) requires less space compare with LR(1)
B) LALR(1) is more powerful than LR(1)
C) LALR(1) requires more space compare with LR(1)
D) LALR(1) is a powerful as an LR(1)

11. The context sensitive grammar can be identified by [B ]


a) TM b) PDA c) FA d) LBA

12. Necessary condition for construction of predictive parser is [B ]

a) Avoid ambiguous grammar b) left factor the recursive grammar


c) both a and b d) none

13. The action table in LR parser have ------------- value . [ A]


a) shift b) accept c) error d) all

14. ----------- are the constructed data types. [D ]


a) arrays b) structures c) pointer d) all

15. There are ------ representations used for these address code. [C ]
a) 1 b) 2 c) 3 d) 4

16.file consists of tabular representation of the transmission diagrams Constructed


for expressions of specification file [B ]
a) lex.yy.c b) x.l c) a.out d) none

17. Interpreter is a type of [ D]


a) Preprocessor b) compiler c) translator d) all

18. In which parsing input symbols are placed at leaf nodes for successful parsing is [D ]
a) top down b) bottom up c) universal d) predictive
19. -------- converts high level language to machine level language [B ]
a) Compiler b) translator c) interpreter d) none

20. While performing type checking ------ special basic types are needed. [C ]
a) 1 b) 2 c) 3 d) 4

You might also like