CD Unit - 1 Lms Notes
CD Unit - 1 Lms Notes
LECTURE NOTES
Branch CSE
Demonstrate the the knowledge of patterns, tokens & regular expressions for lexical
2 analysis
3 Acquire skills in using lex tool & yacc tool for devleoping a scanner and parser.
4 Design and implement LL and LR parsers
5 Design algorithms to do code optimization in order to improve the performance of a
program in terms of space and time complexity.
6 Design algorithms to generate machine code
Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical
device and its functions are being controlled by a compatible software. Hardware understands
instructions in the form of electronic charge, which is the counterpart of binary language in
software programming. Binary language has only two alphabets, 0 and 1. To instruct, the hardware
codes must be written in binary format, which is simply a series of 1s and 0s. It would be a difficult
and cumbersome task for computer programmers to write such codes, which is why we have
understands a language, which humans cannot understand. So we write programs in high- level
language, which is easier for us to understand and remember. These programs are then fed into a
series of tools and OS components to get the desired code that can be used by the machine. This is
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.
□ The C compiler compiles the program and translates it to assembly program (low-
level language).
□ An assembler then translates the assembly program into machine code (object).
□ A linker tool is used to link all the parts of the program together for execution
(executable machine code).
□ A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language. The
difference lies in the way they read the source code or input. A compiler reads the whole source code
at once, creates tokens, checks semantics, generates intermediate code, executes the whole program
and may involve many passes. In contrast, an interpreter reads a statement from the input, converts it to
an intermediate code, executes it, then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it; whereas a compiler reads the whole program even if it
2. COMPILER ARCHITECTURE
Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well as
the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to make an
executable file. All these files might have been compiled by separate assemblers. The major task of a
linker is to search and locate referenced module/routines in a program and to determine the memory
location where these codes will be loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
2. COMPILER ARCHITECTURE
A compiler can broadly be divided into two phases based on the way they compile.
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts, and then checks for lexical, grammar, and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with the
help of intermediate source code representation and symbol table.
□ Pass : A pass refers to the traversal of a compiler through the entire program.
□ Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next stage.
A pass can have more than one phase.
. PHASES OF COMPILER
The compilation process is a sequence of various phases. Each phase takes input from its previous
stage, has its own representation of source program, and feeds its output to the next phase of the
compiler. Let us understand the phases of a compiler.
. PHASES OF COMPILER
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
<token-name, attribute-value>
in the form of tokens as:
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements are
checked against the source code grammar, i.e., the parser checks if the expression made by the
tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not, etc. The semantic analyzer produces an annotated syntax
tree as an output.
After semantic analysis, the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way that
it makes it easier to be translated into the target machine code.
. PHASES OF COMPILER
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’ names
along with their types are stored here. The symbol table makes it easier for the compiler to quickly
search the identifier record and retrieve it. The symbol table is also used
LEXICAL ANALYZER
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely
with the syntax analyzer. It reads character streams from the source code, checks for legal tokens,
and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are
defined by means of regular expressions.
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by
|tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by s(epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Assignment =
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations canbe
performed on them.Finite languages can be described by means of regular expressions.
As the first phase of a compiler, the main task of the lexical analyzer is to read the input
Compiler Design
characters
of the source program, group them into lexemes, and produce as output a sequence of
tokens for each lexeme in the source program. The stream of tokens is sent to the parser
for syntax analysis. It is common for the lexical analyzer to interact with the symbol
table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it
needs to enter that lexeme into the symbol table. In some cases, information regarding
the kind of identifier may be read from the symbol table by the lexical analyzer to assist
There are a number of reasons why the analysis portion of a compiler is normally
and syntactic analysis often allows us to simplify at least one of these tasks. For example,
a parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer. If we are designing a new language,
separating lexical and syntactic concerns can lead to a cleaner overall language design.
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
Compiler Design
significantly.
When discussing lexical analysis, we use three related but distinct terms:
A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a particular
keyword, or a sequence of input characters denoting an identifier. The token names are
the input symbols that the parser processes. In what follows, we shall generally write
the name of token in boldface. We will often refer to a token by its token name.
• A pattern is a description of the form that the lexemes of a token may take. In the case
of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure
A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token.
patterns, and some sample lexemes. To see how these concepts are used in practice, in
the C statement
In many programming languages, the following classes cover most or all of the tokens:
compiler phases additional information about the par-ticular lexeme that matched. For example, the
pattern for token number matches both 0 and 1, but it is extremely important for the code generator
to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer
returns to the parser not only a token name, but an attribute value that describes the lexeme represented
by the token; the token name in-fluences parsing decisions, while the attribute value influences
We shall assume that tokens have at most one associated attribute, although this attribute may have a
structure that combines several pieces of information. The most important example is the token id, where
we need to associate with the token a great deal of information. Normally, information about an identi-
fier — e.g., its lexeme, its type, and the location at which it is first found (in case an error message about
that identifier must be issued) — is kept in the symbol table. Thus, the appropriate attribute value for an
Usually, given the pattern describing the lexemes of a token, it is relatively simple to recognize matching
lexemes when they occur on the input. How-ever, in some languages it is not immediately apparent when
we have seen an instance of a lexeme corresponding to a token. The following example is taken from
DO 5 I = 1.25
it is not apparent that the first lexeme is D05I, an instance of the identifier token, until we see the dot
following the 1. Note that blanks in fixed-format Fortran are ignored (an archaic convention). Had we seen
DO 5 I = 1,25
E x a m p l e 3 . 2 : The token names and associated attribute values for the For-tran statement
E = M * C ** 2
<mult _ op>
<exp _ op>
Note that in certain pairs, especially operators, punctuation, and keywords, there is no need for an attribute
value. In this example, the token number has been given an integer-valued attribute. In practice, a typical
compiler would instead store a character string representing the constant and use as an attribute value
4. Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error.
For instance, if the string f i is encountered for the first time in a C program in the context:
fi ( a == f ( x ) ) . . .
a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function
identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the
parser and let some other phase of the compiler — probably the parser in this case — handle an error due
However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the
patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is "panic
mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left. This recovery technique may confuse the
Transformations like these may be tried in an attempt to repair the input. The simplest such strategy is to
see whether a prefix of the remaining input can be transformed into a valid lexeme by a single
transformation. This strategy makes sense, since in practice most lexical errors involve a single character.
A more general correction strategy is to find the smallest number of transforma-tions needed to convert the
source program into one that consists only of valid lexemes, but this approach is considered too expensive
INPUT BUFFERING
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,
it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme “int” is
identified.
Compiler Design
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves ahead.
then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary storage is
costly. hence buffering technique is used.A block of data is first read into a buffer, and then second by
lexical analyzer. there are two methods used in this context: One Buffer Scheme, and Two Buffer Scheme.
OneBufferScheme:
In this scheme, only one buffer is used to store the input string but the problem with this scheme
is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the
To overcome the problem of one buffer scheme, in this method two buffers
Compiler Design
are used to store the input string. the first buffer and second buffer are scanned alternately. when end of
current buffer is reached the other buffer is filled. the only problem with this method is that if length of the
lexeme is longer than length of the buffer then scanning input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right
in search of end of lexeme. as soon as blank character is recognized, the string between bp and fp is identified
as corresponding token. to identify, the boundary of first buffer end of buffer character should be placed at
Similarly end of second buffer is also recognized by the end of buffer mark
present at the end of second buffer. when fp encounters first eof, then one can recognize end of first buffer
and hence filling up second buffer is started. in the same way when second eof is obtained then it indicates
of second buffer. alternatively both the buffers can be filled up until end of the input program and stream of
tokens is identified. This eof character introduced at the end is calling Sentinel which is used to identify the
end of buffer.
Compiler Design
Recognition of Tokens
Tokens can be recognized by Finite Automata
A Finite automaton(FA) is a simple idealized machine used to recognize patterns within input taken from
some character set(or Alphabet) C. The job of FA is to accept or reject an input depending on whether the
There are two notations for representing Finite Automata. They are
TransitionDiagram
TransitionTable
Transition diagram is a directed labeled graph in which it contains nodes and edges
Nodes represents the states and edges represents the transition of a state
Every transition diagram is only one initial state represented by an arrow mark (-->) and zero or more final
In this section, we introduce a tool called Lex, or in a more recent implemen-tation Flex, that allows
one to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens. The
input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.
Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates
code, in a file called l e x . y y . c, that simulates this tran-sition diagram. The mechanics of how this
translation from regular expressions to transition diagrams occurs is the subject of the next sections; here
Figure 3.22 suggests how Lex is used. An input file, which we call l e x . l , is written in the Lex language and
describes the lexical analyzer to be generated. The Lex compiler transforms l e x . 1 to a C program, in a
file that is always named l e x . y y . c. The latter file is compiled by the C compiler into a file called a .
o u t , as always. The C-compiler output is a working lexical analyzer that can take a stream of input
The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a subroutine of the parser. It
is a C function that returns an integer, which is a code for one of the possible token names. The attribute
value, whether it be another numeric code, a pointer to the symbol table, or nothing, is placed in a global
variable y y l v a l , 2 which is shared between the lexical analyzer and parser, thereby making it simple
declarations
°/.7.
translation rules
°/.0/.
auxiliary functions
The declarations section includes declarations of variables, manifest constants (identifiers declared to
stand for a constant, e.g., the name of a token), and regular definitions, in the style of Section 3.3.4.
Pattern { Action }
Compiler Design
Each pattern is a regular expression, which may use the regular definitions of the declaration section.
The actions are fragments of code, typically written in C, although many variants of Lex using other
The third section holds whatever additional functions are used in the actions. Alternatively, these
functions can be compiled separately and loaded with the lexical analyzer.
We have alluded to the two rules that Lex uses to decide on the proper lexeme to select, when several
If the longest possible prefix matches two or more patterns, prefer the pattern listed first
Lex automatically reads one character ahead of the last character that forms the selected
lexeme, and then retracts the input so only the lexeme itself is consumed from the input. However,
sometimes, we want a certain pattern to be matched to the input only when it is followed by a
certain other characters. If so, we may use the slash in a pattern to indicate the end of the part of
the pattern that matches the lexeme. What follows / is additional pattern that must be matched
before we can decide that the token in question was seen, but what matches this second pattern is
Finite Automata
Finite Automata(FA) is the simplest machine to recognize patterns. The finite automata or
finite state machine is an abstract machine which have five elements or tuple. It has a set of states
and rules for moving from one state to another but it depends upon the applied input symbol.
Basically it is an abstract model of digital computer. Following figure shows some essential
In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed,
For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
Compiler Design
1. Null (or ε) move is allowed i.e., it can move forward without reading symbols.
However, these above features don’t add any power to NFA. If we compare both in terms of power, both are
equivalent.
Due to above additional features, NFA has a different transition function, rest is same as DFA.
δ: Transition Function
δ: Q X (Σ U ε ) --> 2 ^ Q.
As you can see in transition function is for any input including null (or ε), NFA can go to any state number of states.
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong
to the language in hand. It searches for the pattern defined by the language rules. Regular expressions have the
capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by
regular expressions is known as regular grammar. The language defined by regular grammar is known as
regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens can be
described by regular languages. The specification of regular expressions is an example ofa recursive definition.
Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to
Operations
The various operations on languages are:
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign
=[+|-]
Representing language tokens using regular expressions
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution isto use finite automata for verification.
To convert the RE to FA, we are going to use a method called the subset method. This method is
used to obtain FA from the given regular expression. This method is given below:
Step 1: Design a transition diagram for given regular expression, using NFA with ε moves.
Example 1:
10 + (0 + 11)0* 1.
Solution: First we will construct the transition diagram for a given regular expression.
Step-1
Step-2
Step-3
.
Step-4
Step-5
Now we have got NFA without ε. Now we will convert it into required DFA for that,
we will first write a transition table for this NFA.
State 0 1
→q0 q3 {q1, q2}
q1 qf ϕ
q2 ϕ q3
q3 q3 qf
*qf ϕ ϕ
The equivalent DFA will be:
State 0 1
→[q0] [q3] [q1, q2]
[q1] [qf] ϕ
[q2] ϕ [q3]
[q3] [q3] [qf]
[q1, q2] [qf] [qf]
*[qf] ϕ ϕ
is given priority over user input. That is, if the lexical analyzer finds a lexeme that matches with any
UNIT – 1 TUTORIAL QUESTIONS
(Week 1)
1. Define compiler? State various phases of a compiler and explain them in detail.
2. Explain the various phases of a compiler in detail. Also writedown the output for the
following expression after each phase
a. a: =b*c-d.
4. Describe how various phases could be combined as a pass in a compiler? Also briefly
explain Compiler construction tools.
5. For the following expression Position:=initial+ rate*60
UNIT – 1 TUTORIAL QUESTIONS
(Week 2)
1. Describe the output for the various phases of compiler with respect to the following
statements Total = count + rate * 10.
2. Describe the role of regular expression in lexical analyzer.
SET - A
I. Answer the Following Objective Questions: Each carries 0.5 Mark
1. The action of passing the source program into the proper syntactic classes is known as
Lexical analysis
3. A grammar will be meaningless If the left hand side of the production is a single terminal
6. System program such as compiler are designed so that they are recursive
SET - B
4. The action of passing the source program into the proper syntactic classes is known as
Lexical analysis
6. A grammar will be meaningless If the left hand side of the production is a single terminal
9. System program such as compiler are designed so that they are recursive
SET - C
2. System program such as compiler are designed so that they are recursive
3. The action of passing the source program into the proper syntactic classes is known as
Lexical analysis
8. A grammar will be meaningless If the left hand side of the production is a single terminal
SET - D
I. Answer the Following Objective Questions: Each carries 0.5 Mark
1. code generation is related to synthesis phase?
3. A grammar will be meaningless If the left hand side of the production is a single terminal
6. System program such as compiler are designed so that they are recursive
10. The action of passing the source program into the proper syntactic classes is known as
Lexical analysis
II. Answer any one question: Each carries 5 Mark
1. Briefly describe the problems with top down parsing
5. FA
UNIT - 1
ASSIGNMENT QUESTIONS
1. Define Complier briefly?
One application of compilers is to make custom hardware, i.e hardware specific to the
software that you want to run. These custom hardware are energy efficient and offer better
performance. This can vary from adding new custom instructions to an ASIP(Application
Specific Instruction Set Processor) to designing a custom ASIC(Application Specific
Integrated Circuit), or an FPGA based design(High-level synthesis)
The compiler takes the program, analyses and optimizes it and uses the characteristics
of the program to build hardware. For eg. if a program only consists of additions then the
custom hardware only needs to consist of adder units, it does not need multipliers. The
custom hardware generated is also bitwidth aware. For eg. if all the additions are 8 bit
then it does not need a 32 bit adder. It also tries to exploit parallelism in theprogram.
UNIT - 1
NPTEL
• http://nptel.ac.in/courses/106108052/#
UNIT -1
Bloom'sTaxonomy
1. Remembering
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to
determine the memory location where these codes will be loaded, making the program instruction
to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
2. Understanding
Cross-compiler:
A compiler that runs on platform (A) and is capable of generating executable code for platform
Source-to-source Compiler:
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
UNIT – 1
GATE and PLACEMENT QUESTIONS
1. The action of passing the source program into the proper syntactic classes is known as [ A]
A) Lexical analysis B) Syntax analysis C) Interpretation analysis D) Parsing
15. There are ------ representations used for these address code. [C ]
a) 1 b) 2 c) 3 d) 4
18. In which parsing input symbols are placed at leaf nodes for successful parsing is [D ]
a) top down b) bottom up c) universal d) predictive
19. -------- converts high level language to machine level language [B ]
a) Compiler b) translator c) interpreter d) none
20. While performing type checking ------ special basic types are needed. [C ]
a) 1 b) 2 c) 3 d) 4