Program Compilation Lec 7
Program Compilation Lec 7
Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a
scanner. It converts the High level input program into a sequence
of Tokens.
Lexical Analysis can be implemented with the Deterministic
finite Automata.
The output is a sequence of tokens that is sent to the parser for
syntax analysis
Example of tokens:
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a
single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”,
“273”, “;”.
How Lexical Analyzer works-
1. Input preprocessing: This stage involves cleaning up the input
text and preparing it for lexical analysis. This may include
removing comments, whitespace, and other non-essential
characters from the input text.
2. Tokenization: This is the process of breaking the input text into
a sequence of tokens. This is usually done by matching the
characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type
of each token. For example, in a programming language, the
lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token
is valid according to the rules of the programming language. For
example, it might check that a variable name is a valid identifier,
or that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the
output of the lexical analysis process, which is typically a list of
tokens. This list of tokens can then be passed to the next stage of
compilation or interpretation.
The lexical analyzer identifies the error with the help of the
automation machine and the grammar of the given language on
which it is based like C, C++, and gives row number and column
number of the error.
Suppose we pass a statement through lexical analyzer – a = b +c;
It will generate token sequence like this: id=id+id;
Where each id refers to it’s variable in the symbol table referencing all
details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens. You can observe that we have omitted
comments. As another example, consider below printf statement.
( LAPREN = ASSIGNMENT
A IDENTIFIER a IDENTIFIER
B IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
Syntax Analysis
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis.
It checks the syntactical structure of the given input, i.e. whether the
given input is in the correct syntax (of the language in which the input
has been written) or not. It does so by building a data structure, called a
Parse tree or Syntax tree. The parse tree is constructed by using the pre-
defined Grammar of the language and the input string. If the given input
string can be produced with the help of the syntax tree (in the derivation
process), the input string is found to be in the correct syntax. if not, the
error is reported by the syntax analyzer.The main goal of syntax analysis
is to create a parse tree or abstract syntax tree (AST) of the source code,
which is a hierarchical representation of the source code that reflects the
grammatical structure of the program.
A -> bc|a
Now the parser attempts to construct a syntax tree from this grammar for
the given input string. It uses the given production rules and applies those
as needed to generate the string. To generate string “cad” it uses the rules
as shown in the given diagram:
In step (iii) above, the production rule A->bc was not a suitable one to
apply (because the string produced is “cbcd” not “cad”), here the parser
needs to backtrack, and apply the next production rule available with A
which is shown in step (iv), and the string “cad” is produced.
Semantic Analysis
is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a
\collection of procedures which is called by parser as and when required
by grammar. Both syntax tree of previous phase and symbol table are
used to check the consistency of the given code.
Type checking is an important part of semantic analysis where compiler
makes sure that each operator has matching operands.
Semantic Analyzer:
It uses syntax tree and symbol table to check whether the given program
is semantically consistent with language definition. It gathers type
information and stores it in either syntax tree or symbol table. This type
information is subsequently used by compiler during intermediate-code
generation.
Semantic Errors:
Errors recognized by semantic analyzer are as follows:
Type mismatch
Undeclared variables
Reserved identifier misuse
Functions of Semantic Analysis:
1. Type Checking –
Ensures that data types are used in a way consistent with their
definition.
2. Label Checking –
A program should contain labels references.
3. Flow Control Check –
Keeps a check that control structures are used in a proper
manner.(example: no break statement outside a loop)
Example:
float x = 10.1;
float y = x*30;
In the above example integer 30 will be typecasted to float 30.0 before
multiplication, by semantic analyzer.
When to Optimize?
Why Optimize?
1 : N Mapping
Register Allocation :
Register allocation is the process of assigning program variables to
registers and reducing the number of swaps in and out of the
registers. Movement of variables across memory is time consuming
and this is the main reason why registers are used as they available
within the memory and they are the fastest accessible storage
location.
Example 1:
R1<--- a
R2<--- b
R3<--- c
R4<--- d
MOV R3, c
MOV R4, d
MUL R3, R4
MOV R2, b
ADD R2, R3
MOV R1, R2
MOV a, R1
Example 2:
R1<--- e
R2<--- f
R3<--- g
R4<--- h
MOV R3, g
MOV R4, h
DIV R3, R4
MOV R2, f
SUB R2, R3
MOV R1, R2
MOV e, R1
Advantages :
Fast accessible storage
Allows computations to be performed on them
Reduce memory traffic
Reduces overall computation time
Disadvantages :
Registers are generally available in small amount ( up to
few hundred Kb )
Register sizes are fixed and it varies from one processor to
another
Registers are complicated
Need to save and restore changes during context switch
and procedure calls