0% found this document useful (0 votes)

16 views38 pages

Compiler 2

The document discusses lexical analysis, the first phase of a compiler, which involves breaking source code into tokens while removing whitespace and comments. It explains the concepts of tokens, lexemes, patterns, and the use of finite automata in recognizing these tokens, as well as error handling during compilation. Additionally, it covers different types of errors, their recovery mechanisms, and provides examples of tokenization in C language code.

Uploaded by

ajebaderesa12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views38 pages

Compiler 2

Uploaded by

ajebaderesa12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Compiler Design

Chapter 2
Lexical Analysis
By Diriba Regasa (MSc)
Lexical analysis

• Lexical analysis is the first phase of a compiler also known as

scanner.
• It takes modified source code from language preprocessors that
are written in the form of sentences.
• The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
• If the lexical analyzer finds a token invalid, it generates an error.
• The lexical analyzer works closely with the syntax analyzer.
• It reads character streams from the source code, checks for legal
tokens, and passes the data to the syntax analyzer when it
demands.
• Lexical Analysis can be implemented with the Deterministic
finite Automata.
2
Token, Lexeme and Pattern

• Token: A token is a pair consisting of a token name and an optional

attribute value.
– The token name is an abstract symbol representing a kind of lexical
unit.
• Lexeme: A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical analyzer as
an instance of that token.
• There are some predefined rules for every lexeme to be identified as a
valid token.
• These rules are defined by grammar rules, by means of a pattern.
• Pattern: A pattern is a description of the form that the lexemes of a token
may take.
– A pattern explains what can be a token, and these patterns are defined
by means of regular expressions.

3
…

• In programming language, keywords, constants, identifiers,

strings, numbers, operators, literals and punctuations symbols
can be considered as tokens.
• For example, in C language, the variable declaration line.
• int value = 100; contains the tokens:
• int (keyword)
• value (identifier),
• = (operator),
• 100 (constant) and
• ; (symbol).

4
…

• Special symbols: A typical high-level language contains the

following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%),
Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
5
Token specification

A token specification is a formal definition that describes the valid types of

tokens that can be produced during lexical analysis (or tokenization) of source
code.

Let us understand how the language theory undertakes the following terms:

• Alphabets: is any finite set of symbols {0,1} is a set of binary alphabets,

{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z,
A-Z} is a set of English language alphabets.

• Strings: is any finite sequence of alphabets (characters). Length of the

string is the total number of occurrence of alphabets, e.g., the length of the
string tokens is 6 and is denoted by |tokens| = 6.

• A string having no alphabets, i.e. a string of zero length is known as an

empty string and is denoted by ε (epsilon).

6
…

• Language is considered as a finite set of strings over some

finite set of alphabets.
• Computer languages are considered as finite sets, and
mathematically set operations can be performed on them.
• Finite languages can be described by means of regular
expressions.
• Regular Expressions: it have the capability to express finite
languages by defining a pattern for finite strings of symbols.
• The grammar defined by regular expressions is known
as regular grammar.
• The language defined by regular grammar is known as regular
language.

7
…

• Regular expression is an important notation for specifying

patterns.

• Each pattern matches a set of strings, so regular

expressions serve as names for a set of strings.

• Programming language tokens can be described by regular

languages.

• Regular languages are easy to understand and have efficient

implementation.

8
…

• Some of laws obeyed by regular expressions, to manipulate

regular expressions into equivalent forms.
• Operations: these operations can be
✓ Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
✓ Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
✓ The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.

9
…

• Notations: If r and s are regular expressions denoting the languages

L(r) and L(s), then
• Union : (r)|(s) is a regular expression denoting L(r) U L(s)
• Concatenation : (r)(s) is a regular expression denoting
L(r)L(s)
• Kleene closure : (r)* is a regular expression denoting (L(r))
• * (r) is a regular expression denoting L(r)

10
…

Precedence and Associativity:

• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence and
• | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
• If y is a regular expression, then:
✓ y* means zero or more occurrence of y.
✓it can generate { e, y, yy, yyy, yyyy, … }
✓ y+ means one or more occurrence of y.
✓it can generate { y, yy, yyy, yyyy … } or y.y*

11
…

✓ x? means at most one occurrence of x.

✓it can generate either {x} or {e}.
• [a-z] is all lower-case alphabets of English language.
• [A-Z] is all upper-case alphabets of English language.
• [0-9] is all natural digits used in mathematics.

12
Finite Automaton

• Finite automata are used to recognize patterns.

• It takes the string of symbol as input and changes its state
accordingly.
• When the desired symbol is found, then the transition occurs.
• At the time of transition, the automata can either move to the
next state or stay in the same state.
• Finite automata have two states
– Accept state or
– Reject state.
• When the input string is processed successfully, and the
automata reached its final state, then it will accept.

13
…

• An automaton can be represented by a 5-tuple (Q, ∑, δ, q0, F), where :

• Q is a finite set of states.
• ∑ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function.
• q0 is the initial state from where any input is processed (q0 ∈ Q).
• F is a set of final state/states of Q (F ⊆ Q).

14
Finite automata model

• Finite automata can be represented by input tape and finite control.

• Input tape: It is a linear tape having some number of cells. Each input
symbol is placed in each cell.
• Finite control: The finite control decides the next state on receiving input
from input tape.
• The tape reader reads the cells one by one from left to right, and at a time
only one input symbol is read.

Tape reader reading the input symbol

Finite control 15
Representation of Finite Automata

Transition diagram
• The transition diagram is also called a transition graph; it is
represented by a diagraph. A transition graph consists of three
things
• Arrow: The initial state in the transition diagram is marked
with an arrow.
• Circle: Each circle represents the state.
• Double circle: Double circle indicates the final state or
accepting state.

16
…

Transition Table
– It is the tabular representation of the behavior of the
transition function that takes two arguments, the first is a
state, and the other is input, and it returns a value, which is
the new state of the automata.
– It represents all the moves of finite-state function based on
the current state and input.
– In the transition table, the initial state is represented with an
arrow, and the final state is represented by a single circle.

17
…

• Formally a transition table is a 2-dimensional array, which consists

of rows and columns where:
– The rows in the transition table represent the states.
– The columns contain the state in which the machine will move
on the input alphabet.

18
Types of finite automata

• Deterministic
– On each input there is one and only one state to which the automaton
can transition from its current state
• Nondeterministic
– An automaton can be in several states at once.

19
Deterministic Finite Automaton (DFA)

• A finite set of states, often denoted Q

• A finite set of input symbols, often denoted Σ
• A transition function that takes as arguments a state and an input
symbol and returns a state.
• The transition function is commonly denoted δ
If q is a state and a is a symbol, then δ(q, a) is a state p (and in the
graph that represents the automaton there is an arc from q to p
labeled a)
• A start state, one of the states in Q
• A set of final or accepting states F (F ⊆ Q)
– Notation: A DFA A is a tuple: A = (Q, Σ, δ, q0, F)

20
…

Example: Let a deterministic finite automaton be →

Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c}

21
Non-Deterministic Finite Automaton (NDFA)

• A NFA has the power to be in several states at once

• Each NFA accepts a language that is also accepted by some DFA
• NFA are often more succinct and easier than DFAs
• We can always convert an NFA to a DFA
• The difference between the DFA and the NFA is the type of
transition function δ
– For a NFA δ is a function that takes a state and input symbol as
arguments (like the DFA transition function), but returns a set of
zero or more states (rather than returning exactly one state, as the
DFA must)

22
…

Example: Let a non-deterministic finite automaton be →

Q = {a, b, c},
∑ = {0, 1},
q0 = {a},
F = {c}

23
DNA vs NDFA

DFA NDFA
The transition from a state is to a single The transition from a state can be to
next state for each input symbol. Hence it multiple next states for each input symbol.
is called deterministic. Hence it is called non-deterministic.
Empty string transitions are not seen in NDFA permits empty string transitions.
DFA.
Backtracking is allowed in DFA In NDFA, backtracking is not always
possible.
Requires more space. Requires less space.
A string is accepted by a DFA, if it transits A string is accepted by a NDFA, if at least
to a final state. one of all possible transitions ends in a
final state.

24
Error recover

• An error is a user-initiated action that results in a program's

abnormal behavior or issues.
• It is critical to recovering these errors as quickly as possible
because if they are not retrieved in a timely manner,
– they will lead to a situation from which it will be extremely
difficult to recover.
• All possible user errors are identified and reported to the user
in the form of error messages during this phase of compilation.
• The Error Handling process is a way of locating errors and
notifying them to users.

25
…

• In general, a compiler error occurs whenever it fails to compile a

line of code or an algorithm due to a bug in the code or a bug in
the compiler itself.
• The goals of the Error Handling process are to identify each error,
display it to the user, and then develop and apply a recovery strategy
to handle the error.
• The program's processing time should not be slow during this entire
operation.
• Features of an Error handler:
1.Error Detection
2.Error Reporting
3.Error Recovery
Error handler = Error Detection + Error Report + Error Recovery

26
…

• The blank entries in the symbol table are Errors.

• The parser should be able to discover and report program
errors.
• The parser can handle any errors that occur and continue
parsing the rest of the input.
• Although the parser is primarily responsible for error
detection, faults can arise at any point during the compilation
process.
• There are three kinds of errors:
• Compile-time errors
• Runtime errors
• Logical errors
27
Compile-time error

• Compile-time errors appear during the compilation process

before the program is executed.

• This can be due to a syntax error or a missing file reference

that stops the application from compiling properly.
Types of Compile Time Errors
• The major three kinds of compile-time errors.
• Lexical Phase errors
• Syntactic phase errors
• Semantic phase errors

28
Run-time error

• A run-time error occurs during the execution of a program

and is most caused by incorrect system parameters or improper
input data.

• This can include a lack of memory to run an application, a

memory conflict with another software, or a logical error.

• These are errors that occur when a user enters improper syntax
into a code or enters code that a typical compiler cannot run.

29
Logical error

• When programs execute poorly and yet don't terminate

abnormally, logic errors occur.

• A logic error can result in unexpected or unwanted outputs or

other behavior, even if it is not immediately identified as such.

• These are errors that occur when the specified code is

unreachable or when an infinite loop is present.

30
Mode of Error recovery

• The compiler's simplest requirement is to simply stop, issue a

message, and halt compiling.
• To cope with problems in the code, the parser can implement
one of five typical error-recovery mechanisms in the following
which are some of the most prevalent recovery strategies.
• Panic mode recovery
• Statement mode recovery
• Error productions
• Global correction
• Using Symbol table

31
Examples

• Find the total tokens in below C lang. code

main()
{
int i, j;
}

LA consider “main” as one token

rather than “main()”
Ans = 10 tokens

32
…

• Find total number of tokens in below C lang code

int *abc;
int x;
x++;

“abc” and “*” consider as different

tokens even though they together
representing the same type of data.
Longest string will be considered
as a token
Ans = 10 tokens

33
…

• Find number of tokens in below c lang. code

printf(“hello dirro%d%d”,i, j)

Everything inside double quote

will be consider as one token
Ans = 8 tokens

34
…

• Find total number of tokens in below C lang. code

Char a = ‘b’;

Everything inside single quote will

be consider as one token
Ans = 5 tokens

35
…

• Find total number of tokens in below C lang. code

*++--+++***=++++&&

*= will be consider as token

Ans = 11 tokens

36
…

• Which one of the following string can defiantly said to be a

token without looking at next input character
• Var
• Float
• Return
•;
•*
• ++

37
The End

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
A Day in Code- Python: Learn to Code in Python through an Illustrated Story (for Kids and Beginners)
From Everand
A Day in Code- Python: Learn to Code in Python through an Illustrated Story (for Kids and Beginners)
Shari Eskenas
5/5 (1)
Software Engineering For Modern Web Applications
No ratings yet
Software Engineering For Modern Web Applications
403 pages
Matlab Report
No ratings yet
Matlab Report
1 page
Compiler 2
No ratings yet
Compiler 2
10 pages
CH 2
No ratings yet
CH 2
36 pages
Unit 1 - Finite Automata
No ratings yet
Unit 1 - Finite Automata
18 pages
The Branch of Computer Science That Deals With How Efficiently The Problem Can Be Solved On A Model of Computation, Using An Algorithm
No ratings yet
The Branch of Computer Science That Deals With How Efficiently The Problem Can Be Solved On A Model of Computation, Using An Algorithm
9 pages
Lecture 3 (30-1-23)
No ratings yet
Lecture 3 (30-1-23)
11 pages
Automata and Complexity Theory
100% (4)
Automata and Complexity Theory
18 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Regular Expression
No ratings yet
Regular Expression
6 pages
Day 4 - Finite Automata
No ratings yet
Day 4 - Finite Automata
52 pages
Compiler Construction: Lexical Analysis
No ratings yet
Compiler Construction: Lexical Analysis
37 pages
Automata (Unit - 1)
100% (1)
Automata (Unit - 1)
12 pages
Topic 3
No ratings yet
Topic 3
66 pages
FA msc 2
No ratings yet
FA msc 2
100 pages
Acd Notes - 2
No ratings yet
Acd Notes - 2
32 pages
CD Digital Notes Cse-Aiml
No ratings yet
CD Digital Notes Cse-Aiml
186 pages
Compiler Design - Lexical Analysis: University of Salford, UK
No ratings yet
Compiler Design - Lexical Analysis: University of Salford, UK
1 page
CD_Unit II_Notes
No ratings yet
CD_Unit II_Notes
20 pages
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
No ratings yet
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
52 pages
automata-and-complexity-theory
No ratings yet
automata-and-complexity-theory
19 pages
Chapter 2 Finite Automata PDF
No ratings yet
Chapter 2 Finite Automata PDF
146 pages
Unit 1
No ratings yet
Unit 1
74 pages
All Chapters of Automata and Complexity Theory New
No ratings yet
All Chapters of Automata and Complexity Theory New
61 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
chapter twoRegular_anguage
No ratings yet
chapter twoRegular_anguage
32 pages
Automata Doc
No ratings yet
Automata Doc
120 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Regular Anguage
No ratings yet
Regular Anguage
38 pages
2.Chapter3_Regular expressions and Automata
No ratings yet
2.Chapter3_Regular expressions and Automata
28 pages
Automata 3 Slide
No ratings yet
Automata 3 Slide
30 pages
CD Digital Notes
No ratings yet
CD Digital Notes
126 pages
Formal Languages Part 1 Including Regular Expressions: Basic Concepts For Symbols, Strings, and Languages
No ratings yet
Formal Languages Part 1 Including Regular Expressions: Basic Concepts For Symbols, Strings, and Languages
4 pages
Chapter One
No ratings yet
Chapter One
41 pages
14. Automata and Complexity Theory
No ratings yet
14. Automata and Complexity Theory
166 pages
Finite Automata Formal Languages: Operations On Sentences
No ratings yet
Finite Automata Formal Languages: Operations On Sentences
20 pages
Formal Languages CHAPTER-I
No ratings yet
Formal Languages CHAPTER-I
11 pages
ATCD UT1 Material_220816_203517
No ratings yet
ATCD UT1 Material_220816_203517
21 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Automata Theory Answers
No ratings yet
Automata Theory Answers
33 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Formal Language Theory
No ratings yet
Formal Language Theory
69 pages
Toc NOTES - 2
No ratings yet
Toc NOTES - 2
0 pages
Study Notes-Theory of Computation
No ratings yet
Study Notes-Theory of Computation
47 pages
Practical No-1: AIM: Write and Implement A Program To Simulate Deterministic Finite Automata Theory
No ratings yet
Practical No-1: AIM: Write and Implement A Program To Simulate Deterministic Finite Automata Theory
9 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Automata Notes
No ratings yet
Automata Notes
19 pages
Toa Lec 4
No ratings yet
Toa Lec 4
79 pages
Finite Automata and Regular Languages
No ratings yet
Finite Automata and Regular Languages
158 pages
CompilerD L3
No ratings yet
CompilerD L3
36 pages
TOC - UNIT-1 To 4
No ratings yet
TOC - UNIT-1 To 4
77 pages
Lec 03 - Finite Languages
No ratings yet
Lec 03 - Finite Languages
29 pages
Automata: The Methods and The Madness
No ratings yet
Automata: The Methods and The Madness
84 pages
2 Lexical Analizer
No ratings yet
2 Lexical Analizer
56 pages
SLD 2
No ratings yet
SLD 2
67 pages
unit-1 FiniteAutomata
No ratings yet
unit-1 FiniteAutomata
89 pages
Theory of Computation: Champion Study Plan - 2017
No ratings yet
Theory of Computation: Champion Study Plan - 2017
8 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Ian Talks Regex A-Z
From Everand
Ian Talks Regex A-Z
Ian Eress
No ratings yet
Ch-5 IME (2) (2)
No ratings yet
Ch-5 IME (2) (2)
6 pages
Compiler 1
No ratings yet
Compiler 1
33 pages
Chapter One (1)
No ratings yet
Chapter One (1)
48 pages
Chapter 5
No ratings yet
Chapter 5
15 pages
selective topic assignmentpdf
No ratings yet
selective topic assignmentpdf
7 pages
Mad Project Report
No ratings yet
Mad Project Report
16 pages
AXUG Azure DevOps Tasks1
No ratings yet
AXUG Azure DevOps Tasks1
18 pages
Software Development Plan
No ratings yet
Software Development Plan
6 pages
3 +Enterprise+Structure
No ratings yet
3 +Enterprise+Structure
7 pages
Microsoft Dynamics 365 Business Central On Premises Licensing Guide
No ratings yet
Microsoft Dynamics 365 Business Central On Premises Licensing Guide
18 pages
Java Lesson 3 - Displaying Text On Screen
No ratings yet
Java Lesson 3 - Displaying Text On Screen
18 pages
Nis-1 5
No ratings yet
Nis-1 5
5 pages
Systemverilog Interview Questions
100% (2)
Systemverilog Interview Questions
31 pages
Character Ai - Google Search
No ratings yet
Character Ai - Google Search
1 page
03 Git Fundamentals and Practices
No ratings yet
03 Git Fundamentals and Practices
25 pages
User Guide Bitdefender
No ratings yet
User Guide Bitdefender
9 pages
Microsoft AZ-500 Exam Practice Set - 04 - Results Attempt 1: Return To Review
No ratings yet
Microsoft AZ-500 Exam Practice Set - 04 - Results Attempt 1: Return To Review
51 pages
C++ Program for Scheduling Algorithms Simulation
No ratings yet
C++ Program for Scheduling Algorithms Simulation
19 pages
Internship Report on Application Developer Web and Mobile Internship Report
No ratings yet
Internship Report on Application Developer Web and Mobile Internship Report
26 pages
Website Design CSC
No ratings yet
Website Design CSC
1 page
Womens Safety Report
No ratings yet
Womens Safety Report
17 pages
Nest - Js A Progressive Node - Js Framework
100% (1)
Nest - Js A Progressive Node - Js Framework
323 pages
Front End Web Developer Interview Questions
No ratings yet
Front End Web Developer Interview Questions
2 pages
Resume Ankur Bhardwaj 2019
No ratings yet
Resume Ankur Bhardwaj 2019
4 pages
Opencv Interview Questions
No ratings yet
Opencv Interview Questions
3 pages
Class - Xi - Informatics Practices - Half Yearly Examination - MS - Set - A - 2019-1
No ratings yet
Class - Xi - Informatics Practices - Half Yearly Examination - MS - Set - A - 2019-1
4 pages
CC Notes PDF
No ratings yet
CC Notes PDF
25 pages
Top Excel Tips and Tricks To Make You A PRO in 2023
No ratings yet
Top Excel Tips and Tricks To Make You A PRO in 2023
73 pages
How To Use The Layout Editor of SAP Web IDE
100% (1)
How To Use The Layout Editor of SAP Web IDE
35 pages
Lab 11 Advance Angular
No ratings yet
Lab 11 Advance Angular
22 pages
Sltest Ref
No ratings yet
Sltest Ref
610 pages
Quick SQL
No ratings yet
Quick SQL
11 pages
Cse 6TH Sem Syllabus
No ratings yet
Cse 6TH Sem Syllabus
9 pages

Compiler 2

Uploaded by

Compiler 2

Uploaded by

Compiler Design

• Lexical analysis is the first phase of a compiler also known as

• Token: A token is a pair consisting of a token name and an optional

• In programming language, keywords, constants, identifiers,

• Special symbols: A typical high-level language contains the

A token specification is a formal definition that describes the valid types of

• Alphabets: is any finite set of symbols {0,1} is a set of binary alphabets,

• Strings: is any finite sequence of alphabets (characters). Length of the

• A string having no alphabets, i.e. a string of zero length is known as an

• Language is considered as a finite set of strings over some

• Regular expression is an important notation for specifying

• Each pattern matches a set of strings, so regular

• Programming language tokens can be described by regular

• Regular languages are easy to understand and have efficient

• Some of laws obeyed by regular expressions, to manipulate

• Notations: If r and s are regular expressions denoting the languages

Precedence and Associativity:

✓ x? means at most one occurrence of x.

• Finite automata are used to recognize patterns.

• An automaton can be represented by a 5-tuple (Q, ∑, δ, q0, F), where :

• Finite automata can be represented by input tape and finite control.

Tape reader reading the input symbol

• Formally a transition table is a 2-dimensional array, which consists

• A finite set of states, often denoted Q

Example: Let a deterministic finite automaton be →

• A NFA has the power to be in several states at once

Example: Let a non-deterministic finite automaton be →

• An error is a user-initiated action that results in a program's

• In general, a compiler error occurs whenever it fails to compile a

• The blank entries in the symbol table are Errors.

• Compile-time errors appear during the compilation process

• This can be due to a syntax error or a missing file reference

• A run-time error occurs during the execution of a program

• This can include a lack of memory to run an application, a

• When programs execute poorly and yet don't terminate

• A logic error can result in unexpected or unwanted outputs or

• These are errors that occur when the specified code is

• The compiler's simplest requirement is to simply stop, issue a

• Find the total tokens in below C lang. code

LA consider “main” as one token

• Find total number of tokens in below C lang code

“abc” and “*” consider as different

• Find number of tokens in below c lang. code

Everything inside double quote

• Find total number of tokens in below C lang. code

Everything inside single quote will

• Find total number of tokens in below C lang. code

*= will be consider as token

• Which one of the following string can defiantly said to be a

You might also like