Chapter 2 - Copy
Chapter 2 - Copy
COMPILER DESIGN
1
Phases of a Compiler
This figure shows a typical decomposition (Implementation) of a compiler.
2
2.1. The Role of the Lexical Analysis
Lexical Analyzer (LA) is the first phase of a compiler also called linear analysis or scanning.
LA reads the stream of character as input and produce a sequence of character.
Main Functions of Lexical Analyzer.
1st task is to read the input characters (stream of characters) and
produce a sequence of tokens that the parser uses for syntax analysis.
2nd task is removing any comments and white space from source code
in the form of blank, tab, and new line characters.
Another task is it generates an error messages, if it finds invalid token from the
source program.
3
2.1. The Role of the Lexical Analysis…(Cont.)
Generally LA reads stream of characters as input and produce a sequence of tokens.
4
2.2. Token, Pattern, Lexeme
Implementation of Lexical Analyzer can do:-
Removes all white space and comments.
Identify tokens.
Return lexeme of a found token.
Tokens: describes the category of input string.
In programming language, keywords, constants, identifiers, strings, numbers, whitespace,
operators, and punctuations symbol are considered as tokens.
For example, in C language, the variable declaration line.
Int value = 100;
Contains the tokens:
Int (keyword), value (identifier), = (assignment) 100 (constant) and; (symbol(semicolon)).
Attributes of Token
When more than one pattern matches a lexeme,
Lexical analyzer must provide additional information about the particular lexeme.
Lexical analyzer collects information about tokens into their associated attributes. 5
2.2. Token, Pattern, Lexeme
Lexeme is a sequence of characters (alphanumeric) in the source program that
matches the pattern for a token.
Example: a = b + c;
Tokens: Identifier, Keywords, and Punctuations.
Lexeme: a, b, c, +, =, ;
6
2.2. Token, Pattern, Lexeme
Program 1:
int max (int a, int b)
{
if(a>b)
Return a; Source Code
Else
Return b;
}
7
2.2. Token, Pattern, Lexeme…(Cont.)
Specifications of Tokens
The Following are the Specifications of Tokens:
a) Alphabet
b) Strings
c) Language
d) Regular Expression
Let us understand how the language theory undertakes these terms:
a) Alphabets
Any finite set of symbols
{0,1} is a set of binary alphabets,
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F} is a set of hexadecimal alphabets,
{a-z, A-Z} is a set of English language alphabets.
7
2.2. Token, Pattern, Lexeme…(Cont.)
b) Strings
Any finite sequence of alphabets (characters) is called a string.
A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
Length of strings S is the total number of occurrence of alphabets, and it is denoted by |S|
E.g. the length of the string compiler is 8 and is denoted by |compiler| = 8
A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by Ɛ (epsilon).
c) Special Symbols
A typical high-level language contains the following special symbols: -
8
2.2. Token, Pattern, Lexeme…(Cont.)
c) Language
Language is considered as a finite set of strings over some finite set of fixed
alphabets.
Computer languages are considered as finite sets, and mathematically set operations
can be performed on them.
Finite languages can be described by means of regular expressions.
d) Regular Expressions
Regular expressions are an important notation to specify lexeme patterns for a token.
Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
Regular expressions are used to represent the language for lexical analyzer
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand.
It searches for the pattern defined by the language rules.
9
2.2. Token, Pattern, Lexeme…(Cont.)
d) Regular Expressions…(Cont.)
A grammar defined by regular expressions is known as regular grammar.
The language defined by regular grammar is known as regular languages.
There are a number of algebraic laws that are obeyed by regular expressions.
Also known as operations on language.
In lexical analysis by using regular expressions it is possible to represent:
Valid tokens of a language,
Occurrences of Symbols, and
Language Tokens;
Representing Valid Tokens of a language in regular expression
If X is a regular expression, then:
X* means zero or more occurrences of x.
i.e. it can generate {e, x, xx, xxx, xxxx, …}
X+ means one or more occurrence of x.
i.e. it can generate {x, xx, xxx, xxxx, …} or x.x* 10
2.2. Token, Pattern, Lexeme…(Cont.)
d) Regular Expressions…(Cont.)
X? means at most one occurrence of x.
i.e. it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation of Occurrence of Symbols using regular expressions
Letter = [a-z] or [A-Z]
Digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
Sign = [ + | - ]
Representation of Language Tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
11
2.2. Token, Pattern, Lexeme…(Cont.)
d) Regular Expressions…(Cont.)
Table 1: pattern matching examples of regular expression
12
2.2. Token, Pattern, Lexeme…(Cont.)
d) Regular Expressions…(cont.)
Therefore, in lexical analysis by using regular expression it is possible to
represent:
Valid tokens of a Language,
Occurrences of Symbols, and
Language Tokens;
The only problem left with the lexical analyzer is how to verify the validity
of a regular expression used in specifying the patterns of keyword of
languages.
A well-accepted solution to this problem is using Finite Automata for
verification.
13
2.3. Lexical Errors
14
2.3. Lexical Errors…(Cont.)
Lexical Error:
Lexical errors are not very common, but it should be managed by a scanner.
Misspelling of identifiers, operators, keyword, appearance of some illegal
character are considered as lexical errors.
Some errors out of power of LA to recognize, because a LA has a very localized
view of source program.
fi (a == b)
Whlie(a<b)
Such errors are recognized when no pattern for tokens matches a character
sequence.
Numeric constants which are ill-formed.
int i = 4567$91;
Lexical Error
15
2.3. Lexical Errors…(Cont.)
Error Recovery
Deleting an extraneous character
e.g.. coutt -> cout
Inserting a missing character
e.g. cot -> cout
Replacing an incorrect character by a correct character
e.g. couf -> cout
Transpose two adjacent characters.
E.g. ocut ->cout
16
2.4. Automata
Finite Automata (FA) is a state machine that takes a string of symbols as input and
changes its state accordingly.
FA is a recognized for regular expressions.
When a regular expression string is fed into finite automata, it changes its state for each
literal.
If the input string is successfully processed and the automata reaches its final state, it is
accepted.
i.e. the string that fed was said to be valid token of the language in hand.
Regular Expressions = Specification
Finite Automata = Implementation
FA representations:
Graphical (transition diagram or transition table)
Tabular (transition table).
Mathematical (transition function or mapping function).
17
2.4. Automata….(Cont.)
Formal Definition of Finite Automata
FA is a set of 5-tuples, they are:
M = (Q, Σ, δ, , F)
Where:
Q = is a finite set called the state.
Σ = is a finite set called the alphabets.
δ=QxΣQ
Q is the start state also called initial state.
18
2.4. Automata….(Cont.)
A) Transition Diagram (Transition Graph)
It is a directed graph associated with the vertices of the graph corresponds to the state of
finite automata.
Finite Automata State Graphs
{ 0, 1} are inputs
= initial states
= intermediate states States
= final state
19
2.4. Automata…..(Cont.)
B) Transition Table
It is basically a tabular representation of the transition function that takes two
arguments (a state and a symbol) and returns a value (the “next state”).
Rows corresponds to states,
Column corresponds to input symbols,
Entire corresponds to the next states.
The start states is marked with an arrow (->).
The accept state are marked with a star (*).
20
2.4. Automata
C) Transition Function
The mapping function or transition function denoted by δ.
Two parameter are passed to this transition function.
Current State
Input State
δ=QxΣQ
The transition function returns a state which can be called as next state.
Example:
21
2.4. Automata….(Cont.)
Types of Finite Automata
1) Deterministic Automata (DFA)
2) Non-Deterministic Automata (NDFA or NFA)
DFA can contain multiple final states. It is used in lexical analysis in compiler.
22
2.4. Automata….(Cont.)
Formal Definition of DFA
A DFA is a collection of 5-tuples (same as FA).
Q = is a finite set called the state.
Σ = is a finite set called the symbols.
= initial state.
F = final state.
δ = transition function
Transition function can be defined as:
δ=QxΣQ
Acceptance of Language
A language acceptance is defined by “if a string ” w is accepted by the machine m.
i.e. if it is reaching the final state F by taking the storing W.
Not accepted if not reaching the final state.
23
2.4. Automata….(Cont.)
Generally, in DFA the transition state depends on the length of the string.
Number of State = Minimum String + 1
L = {} which means no string
24
2.4. Automata….(Cont.)
Example 1:
Let DFA be Q = {a, b, c}
Σ = {0, 1} Input Symbols
= {a} Initial State
F = {c} Final State
Input state 0 1
a a a
b c a
* c b c
25
2.4. Automata….(Cont.)
Example 2: design DFA with Σ= {0, 1} accepts those storing which starts with 1 and ends
with 0.
Solution:
L = {10, 100, 1010 ------}
Min length = 2
Number of states = |Minimum length + 1|
=2+1
=3
26
2.4. Automata….(Cont.)
Example 3: construct DFA accepts all strings over {a, b} ending with ‘ab’.
Solution:
L = {ab, bab ------}
Min length = 2
Number of states = |Minimum length + 1|
=2+1
=3
27
2.4. Automata….(Cont.)
2) NFA (Non Deterministic Finite Automata)
NFA: when there exist many paths for specific input from the current state to the next state.
It is easy to construct NFA than DFA for a given regular language.
Every NFA is not DFA, but each NFA can be translated into DFA.
DFA only one path for specific input.
NFA many paths for specific input.
NFA is defined in the same way as DFA, but with two exceptions,
It contains multiple next state
It contains transitions. which means a machine can from one state to another without
reading input.
28
2.4. Automata….(Cont.)
Formal Definitions of NFA
DFA: also have five states same as DFA, but different transition function,
δ=QxΣ
Where:
Q = is a finite set called the state.
Σ = is a finite set of the input symbol.
= initial state.
F = final state.
δ = transition function
29
2.4. Automata….(Cont.)
Graphical Representation of NFA
An NFA can be represented by graphs called state diagrams. In which:
1. The state is represented by vertices
2. The arc labeled with an input character show the transitions.
3. The initial state is marked with an arrow.
4. The final state is denoted by the double circle.
30
2.4. Automata….(Cont.)
Example 1:
Let NFA be Q = {, , }
Σ = {0, 1} Input Symbols
= {} Initial State
F = {} Final State
Input state 0 1
{, }
* {, }
31
2.4. Automata….(Cont.)
Example 3: construct NFA where L = {start with ‘a’}
Σ = {a, b}
Solution:
L = {a, ab, abb ------}
Min length = 1
Number of states = |Minimum length + 1|
=1+1
=2
32
2.4. Automata….(Cont.)
Example 1: design NFA for transition table Q = {, , }
Input state 0 1
{, }
33
2.5. Lexical Analyzer Generator: LEX
Creating a Lexical Analyzer with Lex:
First, a lexical analyzer is prepared by creating a program lex.l in the Lex language.
Then, lex.1 is run through the lex compiler to produce C a program lex.yy.c.
Finally, lex.yy.c is run through the C compiler to produce an object program a.out,
a.out is the lexical analyzer that transforms an input stream into a sequence of
tokens.
34
2.5. Lexical Analyzer Generator: LEX…..(Cont.)
Lex specification: a lex program consists of three parts:
36
U!
YO
NK
H A
T
39