0% found this document useful (0 votes)
8 views

Compiler Design

Uploaded by

Sameer verma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Compiler Design

Uploaded by

Sameer verma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Compiler Design

Lexeme: In a compiler, a lexeme refers to the smallest meaningful unit in the source code of a
program. It is a sequence of characters in the source code that represents a single, indivisible
element, such as a keyword, identifier, operator, or literal. Lexemes are the building blocks of a
program's syntax and semantics, and they are used by the lexical analyzer (lexer) to generate tokens.

A lexer is a crucial component of the compiler that performs lexical analysis or scanning. Its primary
role is to read the source code character by character, identify lexemes, and categorize them into
tokens, associating each token with a specific type and, in some cases, additional attributes. Tokens
are then passed to the parser for further analysis and processing.

Here are some examples of lexemes and the corresponding tokens they may generate in a
programming language like C:

Lexeme: "while"

Token Type: Keyword

Description: The lexeme "while" is recognized as a keyword in C, indicating the start of a while loop.

Lexeme: "count"

Token Type: Identifier

Description: The lexeme "count" represents an identifier, which could be a variable or function name.

Lexeme: "+"

Token Type: Operator

Description: The lexeme "+" is recognized as an operator, indicating addition.

Lexeme: "42"

Token Type: Integer Literal

Description: The lexeme "42" represents an integer literal, a numeric constant.

Lexeme: "3.14"

Token Type: Floating-Point Literal

Description: The lexeme "3.14" represents a floating-point literal, a decimal number with a fractional
part.
Lexeme: "("

Token Type: Left Parenthesis

Description: The lexeme "(" is recognized as a left parenthesis, often used to group expressions.

In summary, a lexeme in a compiler is a sequence of characters in the source code that represents a
single, meaningful element of the program. Lexemes are identified and categorized into tokens
during the lexical analysis phase, and these tokens are then used by subsequent phases of the
compiler for parsing, semantic analysis, and code generation.

Type of grammar
CFG Simplification
[Video lecture-11: Knowledge Gate]

In a CFG, it may happen that all the production rules and symbols are not
needed for the derivation of strings. Besides, there may be some null
productions and unit productions. Elimination of these productions and
symbols is called simplification of CFGs. Simplification essentially comprises
of the following steps −

 Reduction of CFG
 Removal of Unit Productions
 Removal of Null Productions

Reduction of CFG
CFGs are reduced in two phases −

Phase 1 − Derivation of an equivalent grammar, G’, from the CFG, G, such


that each variable derives some terminal string.

Derivation Procedure −

Step 1 − Include all symbols, W1, that derive some terminal and
initialize i=1.

Step 2 − Include all symbols, Wi+1, that derive Wi.

Step 3 − Increment i and repeat Step 2, until Wi+1 = Wi.

Step 4 − Include all production rules that have Wi in it.

Phase 2 − Derivation of an equivalent grammar, G”, from the CFG, G’,


such that each symbol appears in a sentential form.

Derivation Procedure −

Step 1 − Include the start symbol in Y1 and initialize i = 1.

Step 2 − Include all symbols, Yi+1, that can be derived from Yi and include
all production rules that have been applied.

Step 3 − Increment i and repeat Step 2, until Yi+1 = Yi.

Problem
Find a reduced grammar equivalent to the grammar G, having production
rules, P: S → AC | B, A → a, C → c | BC, E → aA | e

Solution

Phase 1 −

T = { a, c, e }

W1 = { A, C, E } from rules A → a, C → c and E → aA

W2 = { A, C, E } U { S } from rule S → AC

W3 = { A, C, E, S } U ∅

Since W2 = W3, we can derive G’ as −

G’ = { { A, C, E, S }, { a, c, e }, P, {S}}

where P: S → AC, A → a, C → c , E → aA | e

Phase 2 −

Y1 = { S }

Y2 = { S, A, C } from rule S → AC

Y3 = { S, A, C, a, c } from rules A → a and C → c

Y4 = { S, A, C, a, c }

Since Y3 = Y4, we can derive G” as −

G” = { { A, C, S }, { a, c }, P, {S}}

where P: S → AC, A → a, C → c
Removal of Unit Productions
Any production rule in the form A → B where A, B ∈ Non-terminal is
called unit production..

Removal Procedure −

Step 1 − To remove A → B, add production A → x to the grammar rule


whenever B → x occurs in the grammar. [x ∈ Terminal, x can be Null]

Step 2 − Delete A → B from the grammar.

Step 3 − Repeat from step 1 until all unit productions are removed.

Problem

Remove unit production from the following −

S → XY, X → a, Y → Z | b, Z → M, M → N, N → a

Solution −

There are 3 unit productions in the grammar −

Y → Z, Z → M, and M → N

At first, we will remove M → N.

As N → a, we add M → a, and M → N is removed.


The production set becomes

S → XY, X → a, Y → Z | b, Z → M, M → a, N → a

Now we will remove Z → M.

As M → a, we add Z→ a, and Z → M is removed.

The production set becomes

S → XY, X → a, Y → Z | b, Z → a, M → a, N → a

Now we will remove Y → Z.

As Z → a, we add Y→ a, and Y → Z is removed.

The production set becomes

S → XY, X → a, Y → a | b, Z → a, M → a, N → a

Now Z, M, and N are unreachable, hence we can remove those.

The final CFG is unit production free −

S → XY, X → a, Y → a | b
Removal of Null Productions
In a CFG, a non-terminal symbol ‘A’ is a nullable variable if there is a
production A → ε or there is a derivation that starts at A and finally ends
up with

ε: A → .......… → ε

Removal Procedure

Step 1 − Find out nullable non-terminal variables which derive ε.

Step 2 − For each production A → a, construct all productions A →


x where x is obtained from ‘a’ by removing one or multiple non-terminals
from Step 1.

Step 3 − Combine the original productions with the result of step 2 and
remove ε - productions.

Problem

Remove null production from the following −

S → ASA | aB | b, A → B, B → b | ∈

Solution −

There are two nullable variables − A and B

At first, we will remove B → ε.

After removing B → ε, the production set becomes −

S→ASA | aB | b | a, A ε B| b | &epsilon, B → b
Now we will remove A → ε.

After removing A → ε, the production set becomes −

S→ASA | aB | b | a | SA | AS | S, A → B| b, B → b

This is the final production set without null transition.

Question : In general how many normal forms are for CFG (Context Free Grammar)
Ans. 2 Chomsky Normal Form and Greiback Normal Form

Important terms
The most general phase of structured grammar is?
Answer: a. Context Sensitive Grammar
Explanation: Context-sensitive grammar is the most general phase of
structured grammar because, in this grammar, the left-hand side and the right
side contain the terminals or non-terminals.
In the compiler, the function of using intermediate code is:

Answer: d. to increase the chances of re-using the machine-independent code


optimizer in other compilers.

Explanation: After semantic analysis, the intermediate code increases the


chances of reusing the machine-independent code optimizer in other compilers.
In how many types of optimization can be divided?
Answer: a. two types
Explanation: The code optimization technique is divided into machine-
dependent and machine-independent types.
Which language is accepted by the push-down automata?

Answer: c. Type 2 language

Explanation: According to the Chomsky hierarchy, push down automata accepts


the Type 2 language, which is used for context-free language.

Which of the following parser is a top-down parser?

Answer: d. Recursive descent parser

Explanation: Recursive descent parser is a type of top-down parser which


generates the parse tree from top to bottom and reads the input string from left
to right.

Which parser is most powerful in the following parsers?

a. Operator Precedence
b. SLR
c. Canonical LR
d. LALR

Answer: c. Canonical LR

Explanation: Canonical LR (CLR) is the most powerful parser than LALR and SLR.

The value of which variable is updated inside the loop by a loop-invariant value?

a. loop
b. strength
c. induction
d. invariable

Answer: c. induction

Explanation: The value of the induction variable is updated inside the loop by a
loop-invariant value.

Which of the following is not a characteristic of the compiler?

a. More execution time


b. Debugging process is slow
c. The execution takes place after the removal of all syntax errors
d. Firstly scans the entire program and then transforms it into machine-
understandable code

Answer: a. More execution time

Explanation: The compiler does not take more time to execute. So, more
execution time is not a characteristic of the compiler.
Which method merges the multiple loops into the single one?

a. Constant Folding
b. Loop rolling
c. Loop fusion or jamming
d. None of the above
Answer: c. Loop fusion or Loop jamming

Explanation: Loop fusion is an optimization technique which merges the multiple


bodies of loops into a single body. This programming technique may reduce the
runtime performance of the program.

Which parser is known as the shift-reduce parser?

a. Bottom-up parser
b. Top-down parser
c. Both Top-down and bottom-up
d. None of the Above

Answer: a. Bottom-up parser

Explanation: Bottom-up parser in the compiler is also called the shift-reduce


parser.

You might also like