0% found this document useful (0 votes)
17 views

Mid1 Merged (11 PPTS)

design notes computer

Uploaded by

Bhanu Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Mid1 Merged (11 PPTS)

design notes computer

Uploaded by

Bhanu Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

ICS 223: Compiler Design

Lecture – 1

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Operational information
● Attendance
● Continuous evaluation and feedback
○ Be interactive
● Evaluation components:
○ Quizzes
○ Mid Sem 1 and Mid Sem 2 exams
○ End Sem exam
○ Scribbles (form groups of 5)
● Book:
○ "Compilers: Principles, Techniques and Tools" by Alfred V Aho, Monica
S Lam, Ravi Sethi, and Jeffrey D Ullman
Objective of the course
● Open the lid of compilers and see inside
○ Understand what they do
○ Understand how they work
○ Understand how to build them

● Why this course?


○ Gain deeper understanding of programming
languages
○ Improve programming skills
○ Develop problem-solving skills
○ Career opportunities
Compiler, what’s that?
Compilers
● A system software that converts a source language program to a target
language program.
● Validates input to the source language specifications - produces
error/warning messages.
● Correctness over performance
○ Correctness is essential in compilers
○ They must produce correct code
● Compiler design started with FORTRAN in 1950s
Functionalities of compilers
Machine code generation:

● Convert source language program to machine understandable one


● Takes care of semantics of varied constructs of source language
● Considers the limitations and specifications of the target machine language
● Checks for syntactical correctness of the program
Functionalities of compilers
Format converters:
● Act as interfaces between two or more software packages
● Convert programs in older language to newer languages
● Compatibility of input-output formats between tools coming from different
vendors
Query optimization:
● Optimize the query time in the domain of database query processing
● Generate proper sequence of operations suitable for faster query
processing
Functionalities of compilers
Hardware compilation:

● Automatically synthesize circuits from their behavioural description in VHDL


or Verilog, etc.
● Analyse the complexity of digital circuits

Text formatting:

● Accepts an ordinary text file as input having formatting commands


embedded
● Generates formatted text: eg. LaTex
ICS 223: Compiler Design
Lecture – 2

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Structure of a compiler
Phases of compiler
ICS 223: Compiler Design
Lecture – 3

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Phases of a compiler
Lexical Analysis
● Interface of the compiler to the outside world
● Scans input source program, identifies valid words of the language in it
● Also removes cosmetics, e.g., extra white spaces, comments etc.
● Expands user defined macros
● Reports presence of foreign words
● May perform case-conversion
● Generates tokens to be passed to the syntax analysis phase
● Generally implemented as a finite automata
Syntax Analysis
● Takes words/tokens from lexical analyzer
● Work hand-in-hand with lexical analyzer
● Checks syntactic (grammatical) correctness
● Identifies sequence of grammar rules to derive the input program from the
start symbol
● Constructs a parse tree
● Error/ warning messages are flashed for syntactically incorrect/ inappropriate
program
Semantic Analysis
● Semantics of a program is dependent on the language
● A common check is for types of variables and expressions
● Applicability of operators to operands
● Scope rules of the language are applied to determine types - static scope,
global scope, etc.
Intermediate Code Generation
● Optional towards target code generation
● Code corresponding to input language program is generated in terms of some
hypothetical machine instructions
● Helps to relocate the code from one process to another
● Simple language supported by most of the contemporary processors
● Powerful enough to express programming language constructs
Code optimization
● Most vital step in code generation
● Automated steps of compilers generate lot of redundant codes that can be
eliminated
● Code is divided into basic blocks - a sequence of statements with single entry
and single exit.
● Local optimizations restrict within a single basic block
● Global optimization spans across the boundaries of basic blocks
● Most important source of optimization are the loops
● Algebraic simplifications, elimination of load-and-store are common
optimizations
Symbol Table Management
● Symbol table is a data structure holding information about all symbols defined
in the source program
● Not part of the final code, however used as a reference by all phases of a
compiler
● Typical information stored there include name, type, size, relative offset of
variables
● Generally created by lexical analyzer and syntax analyzer
● Good data structures needed to minimize searching time
● The data structure may be flat or hierarchical
Error handling and Recovery
● An important criteria for judging the quality of compiler
● For a semantic error, compiler can proceed
● For syntax error, parser enters into an erroneous state
● Needs to undo some processing already carried out by the perser for recovery
● A few more tokens may need to be discarded to reach a descent state from
which the parser may proceed
● Recovery is essential to provide a bunch of errors to the user, so that all of
them may be corrected together instead of one-by-one.
ICS 223: Compiler Design
Lecture – 3

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Phases of a compiler
Intermediate Code Generation
● Optional towards target code generation
● Code corresponding to input language program is generated in terms of some
hypothetical machine instructions
● Helps to relocate the code from one process to another
● Simple language supported by most of the contemporary processors
● Powerful enough to express programming language constructs
Code optimization
● Most vital step in code generation
● Automated steps of compilers generate lot of redundant codes that can be
eliminated
● Code is divided into basic blocks - a sequence of statements with single entry
and single exit.
● Local optimizations restrict within a single basic block
● Global optimization spans across the boundaries of basic blocks
● Most important source of optimization are the loops
● Algebraic simplifications, elimination of load-and-store are common
optimizations
Target Code Generation
● Uses template substitution from intermediate code
● Predefined target language templates are used to generate final code
● Machine instructions, addressing modes, CPU registers are crucial to handle
temporary variables.
Symbol Table Management
● Symbol table is a data structure holding information about all symbols defined
in the source program
● Not part of the final code, however used as a reference by all phases of a
compiler
● Typical information stored there include name, type, size, relative offset of
variables
● Generally created by lexical analyzer and syntax analyzer
● Good data structures needed to minimize searching time
● The data structure may be flat or hierarchical
Error handling and Recovery
● An important criteria for judging the quality of compiler
● For a semantic error, compiler can proceed
● For syntax error, parser enters into an erroneous state
● Needs to undo some processing already carried out by the perser for recovery
● A few more tokens may need to be discarded to reach a descent state from
which the parser may proceed
● Recovery is essential to provide a bunch of errors to the user, so that all of
them may be corrected together instead of one-by-one.
Challenges in Compiler Design
● Language semantics
● Hardware platform
● Operating System and system software
● Error handling
● Aid in debugging
● Optimization
● Runtime environment
● Speed of compilation
Language semantics
● “case” – fall through or not
● “loop” – index may or may not remember last value
● “break” and “next” modify execution sequence of the program
Hardware platform
● Code generation strategy for accumulator based machine cannot be similar to
a stack based machine
● CISC v RISC instruction sets
OS and System Software
● Format of file to be executed is governed by the operating systems (loader)
● Linking process can combine object files generated by different compilers into
one executable file
Error Handling
● Show appropriate error messages — detailed enough to pinpoint the error, not
too verbose to confuse
● A missing semicolon may be reported as “line no”; expected rather than
“syntax error”
● If a variable is reported to be undefined at one place, should not be reported
again and again
● Compiler designer has to imagine the probable types of mistakes, design
suitable detection and recovery mechanism
● Some compiler even go to the extent of modifying source program partially, in
order to correct it
Aid in Debugging
● Debugging helps in detecting logical mistakes in a program
● User needs to control execution of machine language program from the
source language level
● Compiler has to generate extra information regarding the corresponding
between source and machine instructions
● Symbol table needs to be available to the debugger
● Extra debugging information embedded into the machine code
Optimization
● Needs to identify set of transformations that will be beneficial for most of the
programs in a language
● Transformations should be safe
● Trade-off between the time spent to optimize a program vis-a-vis
improvement in execution time
● Often, several levels of optimizations are used
● Selecting a debugging option may disable any optimization that disturbs the
correspondence between the source program and object code
Runtime Environment
● Deals with creating space for parameters and local variables
● For older languages, it is static - fixed memory locations created for them
● Not suitable for languages supporting recursion
● To support recursion, stack frames are used to hold variables and parameters
Speed of Compilation
● An important criteria to judge the acceptability of a compiler to the user
community
● Initial phase of program development contains lots of bugs, thus need quick
compilation than optimized code
● Towards the end, execution efficiency becomes the prime concern, hence
more compilation time may be allowed to optimize the code significantly
ICS 223: Compiler Design
Lecture – 5

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Fundamentals of Programming
● Names, Identifiers, and Variables
● Procedures, Functions, and Methods
● Declarations and Definitions
● Static and Dynamic scoping
● Environment and states
● Scope and Block Structures
● Parameter Passing Mechanisms
Names, Identifiers, and Variables
● Identifier is a string of characters that refers to an entity, such as a data object,
a procedure, a class, or a type.
● All identifiers are names, but not all names are identifiers.
● A variable refers to a particular location in the memory.
Procedures, Functions, and Methods
● Usually refer to the same thing, with explicit exceptions.
● C has ‘functions’, Java calls it ‘Methods’
● A function generally returns a value of some type (the return type), while a
procedure does not return any value.
Declarations and Definitions
● Declaration refers to the type of things, Definition tells us about their values.
● Same for variables as well as methods/functions.
Distinction between Static and Dynamic
● If a language uses a policy that allows the compiler to decide an issue, then
its a static policy.
● A policy that allows a decision to be made when we execute the program is
called dynamic policy.
● Scope of a declaration of x is the region of the program in which uses of x refer
to this declaration.
Scope and Block Structures
● Scope of a declaration is determined by where the declaration appears in the
program.
● Some OOP languages provide explicit scope control mechanisms using
keywords such as public, private, protected.
● A block is a group of declarations and statements, usually surrounded by
delimiters.
A bit more….
● A declaration D belongs to a block B if B is the most closely nested block
containing D; i.e., D is located within B, but not within any block that is nested
with B.
● The static scope rule for variable declarations in block-structured languages:
○ If declaration D of name x belongs to block B, then the scope of D is all of B, except for any
blocks B’ nested to any depth within B, in which x is redeclared. Here, x is redeclared in B’ if
some other declaration D’ of the same name x belongs to B’.
Environment and States
● The association of names with locations in memory and their values can be
described using:
○ Environment: a mapping from names to locations in the memory.
○ State: a mapping from locations in memory to their values.
Parameter passing
● Call by value

● Call by reference

● Call by name
Chomsky Hierarchy
ICS 223: Compiler Design
Lecture – 6

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
From the vault of ToC
● Chomsky Hierarchy
Lexical Analysis
Why to separate LA and Parsing
● Simplicity of design
● Improving compiler efficiency
● Enhancing compiler portability
Tokens, Patterns, and Lexemes
● A token is a pair – a token name and an optional token value
● A pattern is a description of the form that the lexemes of a token may take.
● A lexeme is a sequence of characters in the source program that matches the
pattern of the token.
Attributes for tokens
ICS 223: Compiler Design
Lecture – 7

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Jadhav Bhuvan sai 2021BCS0153 K Sai Saranya 2021BCS0045 Saketh 2021BCS0125 Seth Randive 2021BCY0041
Anish Reddy 2021BCS0099 Maitri D. Savla 2021BCS0197 Ganesh 2021BCS0131 Akhilesh Zende 2021BCY0033
Charan sai 2021BCS0091 Samraddhi Rathore 2021BCY0025 Manikanta lokesh 2021BCS0089 Abhishek Pal 2021BCS0107
Karthik MVS 2021BCS0137 K Hari Meghana 2021BCS0047 Mahesh.M 2021BCS0121 Anurag 2021BCS0085
Ujwal 2021BCS0187 P Kusuma Sree 2021BCS0037 Sasi kiran ch 2021BCS0061 Rohit Chenna 2021BCS0069

Hrishikesh 2021BCS0055 Charith 2021BCS0057


SHAH ARNAV KAUSHAL 2021BCS0009 Riya Bajaj 2021BCS0171
Mahesh.G 2021BCS0095 Manoj 2021BCY0017
HARSH ANAND 2021BCS0041 Nidheesh Gorthi 2021BCS0143
Yuvan 2021BCS0199
MRUGANK 2021BCS0039 Aravind.V 2021BCS0123 Ashish M. 2021BCS0033
Lahir 2021BCS0083
KRATIK BOHRA 2021BCS0071 Keerthi 2021BCS0155 K. Revanth 2021BCS0037
Maghana 2021BCS0127 Bhuvan 2021BCY0021
SUDARSHAN KHARADE 2021BCY0011 Karthik konduru 2021BCS0093
Kartheek Valluri 2021BCS0115
Jerin Thomas 2021BCS0031 N.N.L Keerthana 2021BCS0185 Hazarathayya 2021BCS0175
Niraj Kumar 2021BCS0149
Jovin P Saju 2021BCS0027 Anirudh Myakam 2021BCS0139 G.Sai Deepak 2021BCS0167
Ujjwal Jain 2021BCS0059
Mohammad Raziq 2021BCS0177 Muralidhar P 2021BCS0053 B.Naga Jayadeep 2021BCS0179
Notla Dinesh 2021BCS0119
Neeraj Roy 2021BCS0117 P Sai Sathvik 2021BCS0081 M.Vasu 2021BCS0169
Ayush Labhane 2021BCS0111
Aswin A Nair 2021BCY0005 L.Pruthvi Raj 2021BCS0043
Nityam Tripathi 2021BCS0183 Niranjan Nagumalli 2021BCS0017
Keerthana Menon 2021BCS0067
venkat sai 2021BCS0159 Mohit Gupta 2021BCS0129 Samhitha Bharthulwar 2021BCY0009 Vishnu Prasad 2021BCS0019
Bala Sai 2021BCS0193 Vikas Kushwaha 2021BCS0021 Aryan Kashyap 2021BCY0013 Subodh Uniyal 2021BCS0013
vishnu vardhan 2021BCY0043 Yashwantrao Sheshkar 2021BCS0025 Aayush Muneshwar 2021BCS0191
Bhanu prakash 2021BCY0029 Abhishek Kumar 2021BCS0161 George Varughese 2021BCS0001 Harjotpal Singh 2021BCS0157
Shashank 2021BCS0195 Yash Singh 2021BCY0019 Joel Jacob Felix 2021BCS0003 Rohan Gudimetla 2021BCS0015
Rohit Rajesh 2021BCY0001
Harigovind P 2021BCS0075 Rohit Raj 2021BCS0065
Neha Rajesh 2021BCS0007 Baggu Ajay Kumar 2021BCS0145
Abhishek Raj 2021BCS0141
Arya Suneesh 2021BCS0005 M Sasidhar 2021BCS0049 Vaishnavi Reddy 2021BCS0103
Ankit Raj 2021BCY0007
Sai Harvin 2021BCS0135 S Vamsi Teja Reddy 2021BCS0163 Abhay Kumar 2021BCS0087
Avinash Kumar 2021BCS0077
Jean joseph 2021BCY0003 Harish S 2021BCY0015 M.Aravind Naik 2021BCS0183
Pratyush Kumar 2021BCY0027
Sai Surya 2021BCY0023
Lexical Errors
● Some errors are beyond the scope of Lexical Analysis

● Some are possible to detect

● Such errors are recognized when no pattern for tokens matches the sequence
Error Recovery
● Panic mode: successive characters are ignored until we reach to a well formed
token
● Delete one character from the remaining input
● Insert a missing character in the remaining input
● Replace a character with another
● Transpose two adjacent characters
Input Buffering
● Looking ahead some symbols to decide about the token to return.
○ In C, we need to look after -, =, <, etc. to decide on which token to return.
● We need to introduce two buffer scheme to handle large look aheads safely
ICS 223: Compiler Design
Lecture – 8

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
So far in Lexical Analysis..
● A token is a pair – a token name and an optional token value
● A pattern is a description of the form that the lexemes of a token may take.
● A lexeme is a sequence of characters in the source program that matches the
pattern of the token.
Tasks in Lexical Analysis
● Define the set of tokens
● Define a pattern for each token
● Define an algorithm for cutting the source language program into lexemes
and produce the tokens
Choosing the tokens
● Mostly dependent on the source language
● Typical token classes for programming languages:
○ One token for each keyword
○ One token for each punctuation (parenthesis, comma, semicolon, etc.)
○ One token for identifiers
○ Several tokens for operators
○ Tokens for the constants
Describing the patterns
● A pattern defines the set of lexemes corresponding to a token.
● A lexeme being a string, a pattern can be considered to be a language
● Typically defined through Regular Expressions
Regular Expression
A Regular Expression can be recursively defined as follows −
1. ε is a Regular Expression indicates the language containing an empty string. (L (ε) = {ε})
2. φ is a Regular Expression denoting an empty language. (L (φ) = { })
3. X is a Regular Expression where L = {X}
4. If X is a Regular Expression denoting the language L(X) and Y is a Regular Expression
denoting the language L(Y), then
i. X + Y is a Regular Expression corresponding to the language L(X) ∪ L(Y) where L(X+Y) = L(X) ∪ L(Y).
ii. X . Y is a Regular Expression corresponding to the language L(X) . L(Y) where L(X.Y) = L(X) . L(Y)
iii. R* is a Regular Expression corresponding to the language L(R*)where L(R*) = (L(R))*
5. If we apply any of the rules several times from 1 to 4, they are Regular Expressions.
Extensions
Examples
ICS 223: Compiler Design
Lecture – 9

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Algorithm for Lexical Analysis
● How to perform lexical analysis from token through regular expressions?
● Regular expressions are equivalent to Finite State Machines (Finite Automata)
○ Deterministic Finite Automata (DFA)
○ Non-deterministic Finite Automata (NFA)
● Finite Automata can be easily converted to computer programs
● Two methods—
○ Convert RE to NFA and simulate NFA
○ Convert RE to NFA, then NFA to DFA, and simulate DFA
Formal definition
Simulating NFA
Lexical Analysis, till now..
● What we have so far:
● Regular expressions for each token
○NFAs for each token that can recognize the corresponding lexemes
○ A way to simulate an NFA
● How to combine these to cut apart the input text and recognize tokens?
● Two ways:
○ Simulate all NFAs in turn (or in parallel) from the current position and output the
token of the first one to get to an accepting state
○ Merge all NFAs into a single one with labels of the tokens on the accepting states
ICS 223: Compiler Design
Lecture – 10

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Where are we
Dealing with Ambiguities

● Principle of the longest matching prefix: we choose the longest prefix of the
input that matches any token
● Following this principle, ifu26 = 60 will be split into:
○ <ID, ifu26>, <EQ>,<NUM, 60>

● How to implement?
○ Run all NFAs in parallel, keeping track of the last accepting state reached by any of the NFAs
○ When all automata get stuck, report the last match and restart the search at that point
For NO matches..

● What if we can not reach any accepting states given the current input?
● Add a “catch-all" rule that matches any character and reports an error
Merging of all FSMs
● In practice, all NFAs are merged and simulated as a single NFA
● Accepting states are labeled with the token name
Lexical Analysis with NFA.. So far

● Construct NFAs for all regular expressions


● Merge them into one automaton by adding a new start state
● Scan the input, keeping track of the last known match
● Break ties by choosing higher-precedence matches
● Have a catch-all rule to handle errors
Recap: NFA simulation
● Algorithm to check whether an input string is accepted by the NFA:

● In the worst case, an NFA with |Q| states takes O(|S||Q|2) time to match a
string of length |S|
● Complexity thus depends on the number of states
● It is possible to reduce complexity of matching to O(|S|) by transforming the
NFA into an equivalent deterministic finite automaton (DFA)
Deterministic Finite Automata
NFA to DFA

● DFA and NFA (and regular expressions) have the same expressive power
● An NFA can be converted into a DFA by the subset construction method
● Ex: (a|b)*ac
Simulating DFA
Lexical Analysis with DFA

● Construct NFAs for all regular expressions


● Mark the accepting states of the NFAs by the name of the tokens they accept
● Merge them into one automaton by adding a new start state
● Convert the combined NFA to a DFA
● Convey the accepting state labeling of the NFAs to the DFA (by taking into
account precedence rules)
● Scanning is done like with an NFA
Keywords and Identifiers
● Having a separate regular expression for each keyword is not very efficient.
● In practice:
○ We define only one regular expression for both keywords and identifiers
○ All keywords are stored in a (hash) table
○ Once an identifier/keyword is read, a table lookup is performed to see
whether this is an identifier or a keyword
● Reduces drastically the size of the DFA
● Adding a keyword requires only to add one entry in the hash table.
Source Language Specifications
Combined NFA
Combined NFA
ICS 223: Compiler Design
Lecture – 11

Instructor–
Dr. Nilotpal Chakraborty
Department of Computer Science and Engineering
Indian Institute of Information Technology Kottayam
Keywords and Identifiers
● Having a separate regular expression for each keyword is not very efficient.
● In practice:
○ We define only one regular expression for both keywords and identifiers
○ All keywords are stored in a (hash) table
○ Once an identifier/keyword is read, a table lookup is performed to see
whether this is an identifier or a keyword
● Reduces drastically the size of the DFA
● Adding a keyword requires only to add one entry in the hash table.
Source Language Specifications
Combined NFA
Combined NFA
Reminder
Syntax Analysis
● Takes words/tokens from lexical analyzer
● Work hand-in-hand with lexical analyzer
● Checks syntactic (grammatical) correctness
● Identifies sequence of grammar rules to derive the input program from the
start symbol
● Constructs a parse tree
● Error/ warning messages are flashed for syntactically incorrect/ inappropriate
program
● Syntax analyzer also known as Parser
Context Free Grammar

You might also like