0% found this document useful (0 votes)
16 views38 pages

Compiler 2

The document discusses lexical analysis, the first phase of a compiler, which involves breaking source code into tokens while removing whitespace and comments. It explains the concepts of tokens, lexemes, patterns, and the use of finite automata in recognizing these tokens, as well as error handling during compilation. Additionally, it covers different types of errors, their recovery mechanisms, and provides examples of tokenization in C language code.

Uploaded by

ajebaderesa12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views38 pages

Compiler 2

The document discusses lexical analysis, the first phase of a compiler, which involves breaking source code into tokens while removing whitespace and comments. It explains the concepts of tokens, lexemes, patterns, and the use of finite automata in recognizing these tokens, as well as error handling during compilation. Additionally, it covers different types of errors, their recovery mechanisms, and provides examples of tokenization in C language code.

Uploaded by

ajebaderesa12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Compiler Design

Chapter 2
Lexical Analysis
By Diriba Regasa (MSc)
Lexical analysis

• Lexical analysis is the first phase of a compiler also known as


scanner.
• It takes modified source code from language preprocessors that
are written in the form of sentences.
• The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
• If the lexical analyzer finds a token invalid, it generates an error.
• The lexical analyzer works closely with the syntax analyzer.
• It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it
demands.
• Lexical Analysis can be implemented with the Deterministic
finite Automata.
2
Token, Lexeme and Pattern

• Token: A token is a pair consisting of a token name and an optional


attribute value.
– The token name is an abstract symbol representing a kind of lexical
unit.
• Lexeme: A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical analyzer as
an instance of that token.
• There are some predefined rules for every lexeme to be identified as a
valid token.
• These rules are defined by grammar rules, by means of a pattern.
• Pattern: A pattern is a description of the form that the lexemes of a token
may take.
– A pattern explains what can be a token, and these patterns are defined
by means of regular expressions.

3

• In programming language, keywords, constants, identifiers,


strings, numbers, operators, literals and punctuations symbols
can be considered as tokens.
• For example, in C language, the variable declaration line.
• int value = 100; contains the tokens:
• int (keyword)
• value (identifier),
• = (operator),
• 100 (constant) and
• ; (symbol).

4

• Special symbols: A typical high-level language contains the


following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%),
Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
5
Token specification

A token specification is a formal definition that describes the valid types of


tokens that can be produced during lexical analysis (or tokenization) of source
code.

Let us understand how the language theory undertakes the following terms:

• Alphabets: is any finite set of symbols {0,1} is a set of binary alphabets,


{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z,
A-Z} is a set of English language alphabets.

• Strings: is any finite sequence of alphabets (characters). Length of the


string is the total number of occurrence of alphabets, e.g., the length of the
string tokens is 6 and is denoted by |tokens| = 6.

• A string having no alphabets, i.e. a string of zero length is known as an


empty string and is denoted by ε (epsilon).

6

• Language is considered as a finite set of strings over some


finite set of alphabets.
• Computer languages are considered as finite sets, and
mathematically set operations can be performed on them.
• Finite languages can be described by means of regular
expressions.
• Regular Expressions: it have the capability to express finite
languages by defining a pattern for finite strings of symbols.
• The grammar defined by regular expressions is known
as regular grammar.
• The language defined by regular grammar is known as regular
language.

7

• Regular expression is an important notation for specifying


patterns.

• Each pattern matches a set of strings, so regular


expressions serve as names for a set of strings.

• Programming language tokens can be described by regular


languages.

• Regular languages are easy to understand and have efficient


implementation.

8

• Some of laws obeyed by regular expressions, to manipulate


regular expressions into equivalent forms.
• Operations: these operations can be
✓ Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
✓ Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
✓ The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.

9

• Notations: If r and s are regular expressions denoting the languages


L(r) and L(s), then
• Union : (r)|(s) is a regular expression denoting L(r) U L(s)
• Concatenation : (r)(s) is a regular expression denoting
L(r)L(s)
• Kleene closure : (r)* is a regular expression denoting (L(r))
• * (r) is a regular expression denoting L(r)

10

Precedence and Associativity:


• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence and
• | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
• If y is a regular expression, then:
✓ y* means zero or more occurrence of y.
✓it can generate { e, y, yy, yyy, yyyy, … }
✓ y+ means one or more occurrence of y.
✓it can generate { y, yy, yyy, yyyy … } or y.y*

11

✓ x? means at most one occurrence of x.


✓it can generate either {x} or {e}.
• [a-z] is all lower-case alphabets of English language.
• [A-Z] is all upper-case alphabets of English language.
• [0-9] is all natural digits used in mathematics.

12
Finite Automaton

• Finite automata are used to recognize patterns.


• It takes the string of symbol as input and changes its state
accordingly.
• When the desired symbol is found, then the transition occurs.
• At the time of transition, the automata can either move to the
next state or stay in the same state.
• Finite automata have two states
– Accept state or
– Reject state.
• When the input string is processed successfully, and the
automata reached its final state, then it will accept.

13

• An automaton can be represented by a 5-tuple (Q, ∑, δ, q0, F), where :


• Q is a finite set of states.
• ∑ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function.
• q0 is the initial state from where any input is processed (q0 ∈ Q).
• F is a set of final state/states of Q (F ⊆ Q).

14
Finite automata model

• Finite automata can be represented by input tape and finite control.


• Input tape: It is a linear tape having some number of cells. Each input
symbol is placed in each cell.
• Finite control: The finite control decides the next state on receiving input
from input tape.
• The tape reader reads the cells one by one from left to right, and at a time
only one input symbol is read.

Tape reader reading the input symbol

Finite control 15
Representation of Finite Automata

Transition diagram
• The transition diagram is also called a transition graph; it is
represented by a diagraph. A transition graph consists of three
things
• Arrow: The initial state in the transition diagram is marked
with an arrow.
• Circle: Each circle represents the state.
• Double circle: Double circle indicates the final state or
accepting state.

16

Transition Table
– It is the tabular representation of the behavior of the
transition function that takes two arguments, the first is a
state, and the other is input, and it returns a value, which is
the new state of the automata.
– It represents all the moves of finite-state function based on
the current state and input.
– In the transition table, the initial state is represented with an
arrow, and the final state is represented by a single circle.

17

• Formally a transition table is a 2-dimensional array, which consists


of rows and columns where:
– The rows in the transition table represent the states.
– The columns contain the state in which the machine will move
on the input alphabet.

18
Types of finite automata

• Deterministic
– On each input there is one and only one state to which the automaton
can transition from its current state
• Nondeterministic
– An automaton can be in several states at once.

19
Deterministic Finite Automaton (DFA)

• A finite set of states, often denoted Q


• A finite set of input symbols, often denoted Σ
• A transition function that takes as arguments a state and an input
symbol and returns a state.
• The transition function is commonly denoted δ
If q is a state and a is a symbol, then δ(q, a) is a state p (and in the
graph that represents the automaton there is an arc from q to p
labeled a)
• A start state, one of the states in Q
• A set of final or accepting states F (F ⊆ Q)
– Notation: A DFA A is a tuple: A = (Q, Σ, δ, q0, F)

20

Example: Let a deterministic finite automaton be →


Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c}

21
Non-Deterministic Finite Automaton (NDFA)

• A NFA has the power to be in several states at once


• Each NFA accepts a language that is also accepted by some DFA
• NFA are often more succinct and easier than DFAs
• We can always convert an NFA to a DFA
• The difference between the DFA and the NFA is the type of
transition function δ
– For a NFA δ is a function that takes a state and input symbol as
arguments (like the DFA transition function), but returns a set of
zero or more states (rather than returning exactly one state, as the
DFA must)

22

Example: Let a non-deterministic finite automaton be →


Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c}

23
DNA vs NDFA

DFA NDFA
The transition from a state is to a single The transition from a state can be to
next state for each input symbol. Hence it multiple next states for each input symbol.
is called deterministic. Hence it is called non-deterministic.
Empty string transitions are not seen in NDFA permits empty string transitions.
DFA.
Backtracking is allowed in DFA In NDFA, backtracking is not always
possible.
Requires more space. Requires less space.
A string is accepted by a DFA, if it transits A string is accepted by a NDFA, if at least
to a final state. one of all possible transitions ends in a
final state.

24
Error recover

• An error is a user-initiated action that results in a program's


abnormal behavior or issues.
• It is critical to recovering these errors as quickly as possible
because if they are not retrieved in a timely manner,
– they will lead to a situation from which it will be extremely
difficult to recover.
• All possible user errors are identified and reported to the user
in the form of error messages during this phase of compilation.
• The Error Handling process is a way of locating errors and
notifying them to users.

25

• In general, a compiler error occurs whenever it fails to compile a


line of code or an algorithm due to a bug in the code or a bug in
the compiler itself.
• The goals of the Error Handling process are to identify each error,
display it to the user, and then develop and apply a recovery strategy
to handle the error.
• The program's processing time should not be slow during this entire
operation.
• Features of an Error handler:
1.Error Detection
2.Error Reporting
3.Error Recovery
Error handler = Error Detection + Error Report + Error Recovery

26

• The blank entries in the symbol table are Errors.


• The parser should be able to discover and report program
errors.
• The parser can handle any errors that occur and continue
parsing the rest of the input.
• Although the parser is primarily responsible for error
detection, faults can arise at any point during the compilation
process.
• There are three kinds of errors:
• Compile-time errors
• Runtime errors
• Logical errors
27
Compile-time error

• Compile-time errors appear during the compilation process


before the program is executed.

• This can be due to a syntax error or a missing file reference


that stops the application from compiling properly.
Types of Compile Time Errors
• The major three kinds of compile-time errors.
• Lexical Phase errors
• Syntactic phase errors
• Semantic phase errors

28
Run-time error

• A run-time error occurs during the execution of a program


and is most caused by incorrect system parameters or improper
input data.

• This can include a lack of memory to run an application, a


memory conflict with another software, or a logical error.

• These are errors that occur when a user enters improper syntax
into a code or enters code that a typical compiler cannot run.

29
Logical error

• When programs execute poorly and yet don't terminate


abnormally, logic errors occur.

• A logic error can result in unexpected or unwanted outputs or


other behavior, even if it is not immediately identified as such.

• These are errors that occur when the specified code is


unreachable or when an infinite loop is present.

30
Mode of Error recovery

• The compiler's simplest requirement is to simply stop, issue a


message, and halt compiling.
• To cope with problems in the code, the parser can implement
one of five typical error-recovery mechanisms in the following
which are some of the most prevalent recovery strategies.
• Panic mode recovery
• Statement mode recovery
• Error productions
• Global correction
• Using Symbol table

31
Examples

• Find the total tokens in below C lang. code


main()
{
int i, j;
}

LA consider “main” as one token


rather than “main()”
Ans = 10 tokens

32

• Find total number of tokens in below C lang code


int *abc;
int x;
x++;

“abc” and “*” consider as different


tokens even though they together
representing the same type of data.
Longest string will be considered
as a token
Ans = 10 tokens

33

• Find number of tokens in below c lang. code


printf(“hello dirro%d%d”,i, j)

Everything inside double quote


will be consider as one token
Ans = 8 tokens

34

• Find total number of tokens in below C lang. code


Char a = ‘b’;

Everything inside single quote will


be consider as one token
Ans = 5 tokens

35

• Find total number of tokens in below C lang. code


*++--+++***=++++&&

*= will be consider as token


Ans = 11 tokens

36

• Which one of the following string can defiantly said to be a


token without looking at next input character
• Var
• Float
• Return
•;
•*
• ++

37
The End

You might also like