Cse384 Compiler Design Laboratory Lab Manual
Cse384 Compiler Design Laboratory Lab Manual
MANAGEMENT
BHUBANESWAR
LAB MANUAL
Aim:
To separate the tokens from the given source program
Theory:
Lexical analysis reads the characters in the source program and groups them into stream
of tokens in which each token represents a logically cohesive sequence of characters
such as an identifier, keyword, and punctuation character. The character sequence
forming a token is called lexeme of the token
Algorithm:
Input:
#include<stdio.h>
void main()
{
int a;
double b;
char c;
printf("%d %b %c",a,b,c);
}
Aim:
To convert the given regular expression to NFA
Theory
Regular expression:
For lexical analysis, specifications are traditionally written using regular expressions: An
algebraic notation for describing sets of strings.
Figure shows the constructions used to build regular expressions and the languages they
describe:
Regular Language Informal description
expression (set of strings)
a {“{“”}a”} The set consisting of the one letter
string “a”.
NFA
Start:
Apply the following rules to the current transition diagram until all the edges are labeled by
characters from or . The nodes on the left sides of the rules are identified with nodes in the
current transition diagram. All newly occurring nodes on the right side of a rule correspond to
newly created nodes and thus to new states.
Input:
Output
a
e e
2 3 b b
Start a
1 6 7 8 9
b
4 5
e e
Aim
Theory
The JFLAP (Java Formal Language and Automata Package) is a visual tool used to create and
simulate various types of robots, and convert different representations of languages.
DFA
NFA
Regular grammar
Regular expression
push-down automaton
context-free grammar
Definition
alphabet (defined as Σ). The simplest regular expressions are symbols from λ, ∅, and symbols
The following are a few examples of regular expressions and the languages generated using
these operators:
1. a+b+c = {a, b, 4. ab* = {a, ab, abb, abbb, ...} 7. a+b* = (a, λ, b, bb,
c} 5. (ab)* = (λ, ab, abab, ababab, bbb, ...)
2. abc = {abc} ...) 8. a+!* = (a, λ)
3. (!+a)bc = {bc, 6. (a+b)* = (λ, a, b, aa, ab, ba, bb, 9. (a+!)* = (λ, a, aa, aaa,
abc} aaa, ...) aaaa, ...)
Since all regular languages accept finite acceptors, a regular expression must also be able to
accept a finite automaton. There is a feature in JFLAP that allows the conversion of a regular
expression to an NFA.
Converting to a NFA
After typing in an expression, there is nothing else that can be done in this editor window
besides converting it to an NFA, so let's proceed to that. Click on the “Convert → Convert to
NFA” menu option. If one uses the example provided earlier, this screen should come up (after
resizing the window a little).
You're probably wondering what exactly you just did. Basically, you broke the regular
expression into three sub-expressions, which were united into one expression by the implicit
concatenation operator. Whenever you click on an expression or sub-expression through this
button, it will subdivide according to the last unique operation in the order of operations. Thus,
Let's continue. Click on the second button from the left, the “(T)ransition Creator” button. Now,
let's make our first transition. Create a transition from “q0” to “q2”. You will not be prompted
by a label, as in this mode only “λ” transitions are created. These types of transitions are all that
are needed to create a nondeterministic automaton. When finished, you should see a “λ”
transition between q0 and q2. Now, try to create a transition from q0 to q4. You will be notified
that such a transition is “invalid”. This is because, due to the concatenation operation between
sub-expressions “a*” and “b”, any “b” must first process the “a*” part of the NFA. While it is
possible to go to “q4” without processing any input, that will have to wait until the “a*”
expression is decomposed. Thus, in order to get to “q4”, we need to go through “q3”, the final
state of the “a*” expression. If you create a transition from “q3” to “q4”, it will be accepted.
Since “(a+b)” is the last expression, we will get to the final state after processing it. Because of
this, try to establish a transition from “q7” to “q1”. However, you should be warned that
although the transition may be correct, you must create the transitions in the correct order. Thus,
add the transition from “q5” to “q6” before adding the “q7” to “q1” transition. When done, your
screen should resemble the one below.
Aim
Write a C Program to find first and follow of a given context free grammar
Theory
To compute FIRST( X ) for all grammar symbols X apply the following rules until no more
terminals or can be added to any FIRST set:
To compute FOLLOW( A ) for all nonterminals A apply the following rules until nothing can
be added to any FOLLOW set:
place $ in FOLLOW( S ) where S is the start symbol and $ is the input right marker;
Example: Find the FIRST and FOLLOW for the following grammar:
P bSe
S AR
Aim
Theory
add A to M[ A, a ]
if FIRST( ) contains
add A to M[ A, b ] for each b in FOLLOW( A )
add A to M[ A, $ ]
P bSe
S AR
R AR |
A id = E ;
E FT
T +FT |
F (E) | id | INT
Using the above parsing table a simple string will be parsed as follows:
Stack Input Output
$P b id = id + 1 ; e $
$ eSb b id = id + 1 ; e $ P bSe
$eS id = id + 1 ; e $
$eRA id = id + 1 ; e $ S AR
$eR;E=id id = id + 1 ; e $ A id = E ;
$eR;E= = id + 1 ; e $
$eR;E id + 1 ; e $
$eR;TF id + 1 ; e $ E FT
$eR;Tid id + 1 ; e $ F id
$eR;T +1;e$
$eR;TF+ +1;e$ T +FT
$eR;TF 1;e$
$eR;TINT 1;e$ F INT
$eR;T ;e$
$eR; ;e$ T
$eR e$
$e e$ R
$ $
Aim
Theory
Precedence Relations
Bottom-up parsers for a large class of context-free grammars can be easily developed using
operator grammars.
Operator grammars have the property that no production right side is empty or has two adjacent
nonterminals. This property enables the implementation of efficient operator-precedence
parsers. Theseparser rely on the following three precedence relations:
Relation Meaning
a <· b a yields precedence to b
a =· b a has the same precedence as b
a ·> b a takes precedence over b
These operator precedence relations allow delimiting the handles in the right sentential forms:
<· marks the left end, =· appears in the interior of the handle, and ·> marks the right end.
Let assume that between the symbols ai and ai+1 there is exactly one precedence relation.
Suppose that $ is the end of the string. Then for all terminals we can write: $ <· b and b ·> $.
If were move all nonterminals and place the correct precedence relation: <·, =·, ·> between the
remaining terminals, there remain strings that can be analyzed by easily developed parser.
For example, the following operator precedence relations canbe introduced for simple
expressions:
id + * $
id ·> ·> ·>
+ <· ·> <· ·>
* <· ·> ·> ·>
$ <· <· <· ·>
- scan backwards the string from right to left until seeing <·
- everything between the two relations <· and ·> forms the handle
Note that not the entire sentential form is scanned to find the handle.
The operator precedence parsers usually do not store the precedence table with the relations,
rather they are implemented in a special way.
Operator precedence parsers use precedence functions that map terminal symbols to integers,
and so the precedence relations between the symbols are implemented by numerical
comparison.
1. Create functions fa for each grammar terminal a and for the end of string symbol;
2. Partition the symbols in groups so that fa and gb are in the same group if a =· b ( there
can be symbols in the same group even if they are not connected by this relation);
3. Create a directed graph whose nodes are in the groups, next for each symbols a and b do:
place an edge from the group of gb to the group of fa if a <· b, otherwise if a ·> b place
an edge from the group of fa to that of gb;
4. If the constructed graph has a cycle then no precedence functions exist. When there are
no cycles collect the length of the longest paths from the groups of fa and gb
respectively.
id + * $
id ·> ·> ·>
+ <· ·> <· ·>
* <· ·> ·> ·>
$ <· <· <· ·>
f* g*
g+ f+
f$ g$
id + * $
f 4 2 4 0
g 5 1 3 0
Aim
Theory
Shift-reduce parsing is a method for syntax analysis that constructs the parse tree on seeing an
input string beginning at the leaves and working towards the root.
At each step it attempts to reduce a substring from the input by replacing it with the right side of
a grammar production, thus attempting to reach the start symbol of the grammar.
At the end of the operation of the shift-reduce parser there can be traced in reverse the rightmost
derivation of the input string according to the grammar.
S aABe
A Abc | b
Bd
abbcde
aAbcde
aAde
aABe
S
A handle of a string is a substring that matches the right side of a production whose reduction to
the nonterminal on the left represents one step along the reverse of a rightmost derivation.
A convenient way for dealing with such difficulties is to design shift-reduce parser which uses a
stack to hold the grammar symbols and an input buffer to hold the string to be parsed.
The shift-reduce parser operates by shifting zero or more symbols onto the stack until a handle
appears on the top. The parser then reduces the handle to the left side of the corresponding
production. This processcontinues until an error occurs or the start symbol remains on the stack.
shift - the next input symbol is shifted onto the top of the stack;
reduce - the parser knows the right end of the handle is at the top of thestack. It must
then locate the left end of the handle within the stack and decide with what
nonterminal to replace the handle;
Stack Input
1) $ yz$
2) $ B yz$
3) $ By z$
Stack Input
1) $ xyz$
E(E)
E id
performs the following steps when analyzing the input string: id1 + id2 * id3
Aim
To create and print a symbol table that contains the name, type and size of the identifier from a
C file.
Algorithm
Step 1: Declare the necessary variables.
Step 2: Open the file that contains the simple C program.
Step 3: Read word by word (using fscanf function) from the file and perform the following
operations until the end of file is reached.
a) Compare the word with the data types that are supported by C.
b) Then store the type, size (according to the data types in C) and name of the Identifier
in the symbol table.
c) Print the symbol table in a neat format.
Step 4: Stop the program execution.
Example Input
#include<stdio.h>
void main()
{
int first;
double second;
char c;
printf("%d %b %c",c);
}
Output
SYMBOL TABLE
first int
second double
c char
Aim
Algorithm
Step 1: Declare the necessary variables.
Step 2: Read the expression from the user.
Step 3: Read character by character and do the following until end of the expression is
reached.
a) If the character is an alphabet, push it in the identifier stack.
b) Else if the character is an operator, temporarily store the corresponding instruction
based upon the operator.
c) Then pop the second identifier from the stack and then the first from the stack and
store them in temporary variable, say a, b respectively.
d) Then print the current instruction as”LDA” with the operand stored in variable b.
e) Then print the arithmetic instruction with the operand stored in the variable a.
f) Then print the print the instruction “STA” with a temporary operand.
g) Then push the temporary operand in the stack for future use.
Step 4: Stop the program execution.
Example Output
Enter the Postfix Expression: ABC*+
LDA B
MUL C
STA T1
LDA A
ADD T1
STA T2
Aim
Construct a scanner for the simple C-like language using the lex/flex tool. Given a legal
input (valid program in the language) the scanner should print a stream of two tuples in the
following format <tokenid, tokenname>.
For the following program segment
int main()
{
int first, second, third;
third=first+second;
}
<260, int>
<265, main>
<270, ident>
...
In the above example, the token-id for int is 260 and the token-id for ident ("identifier") is 270.
These numbers can be assigned according to your wish. Consider that the language only defines
main and supports the declaration part and statements involving arithmetic operations.
Theory
NAME
SYNOPSIS
flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Ppre-
fix -Sskeleton] [--help --version] [filename ...]
DESCRIPTION
flex is a tool for generating scanners: programs which recognized lexical patterns in
text. flex reads the given input files, or its standard input if no file names are given, for a
description of a scanner to generate. The description is in the form of pairs of regular
expressions and C code, called rules. flex generates as output a C source file, lex.yy.c, which
defines a routine yylex(). This file is compiled and linked with the -lfl library to produce an
First some simple examples to get the flavor of how one uses flex. The following flex
input specifies a scanner which whenever it encounters the string "username" will replace it
with the user's login name:
%%
username printf( "%s", getlogin() );
By default, any text not matched by a f lex scanner is copied to the output, so the net effect of
this scanner is to copy its input file to its output with each occurrence of "username" expanded.
In this input, there is just one rule. "username" is the pattern and the "printf" is the action. The
"%%" marks the beginning of the rules.
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
This scanner counts the number of characters and the number of lines in its input (it produces
no output other than the final report on the counts). The first line declares two globals,
"num_lines" and “num_chars", which are accessible both inside yylex() and in the main()
routine declared after the second "%%". There are two rules, one which matches a newline
("\n") and increments both the line count and the character count, and one which matches any
character other than a newline (indicated by the "." regular expression).
The flex input file consists of three sections, separated by a line with just %% in it:
definitions
%%
rules
%%
user code
Name definitions have the form:
name definition
The "name" is a word beginning with a letter or an underscore ('_') followed by zero or more
letters, digits, '_', or '-' (dash). The definition is taken to begin at the first non white-space
character following the name and continuing to the end of the line. The definition can
subsequently be referred to using "{name}", which will expand to "(definition)". For example,
DIGIT [0-9]
ID [a-z][a-z0-9]*
defines "DIGIT" to be a regular expression which matches a single digit, and "ID" to be a
regular expression which matches a letter followed by zero-or-more letters-or-digits. A
subsequent reference to
{DIGIT}+"."{DIGIT}*
is identical to
([0-9])+"."([0-9])*
%{
#include <stdio.h>
#define INT 260
#define MAIN 265
#define IDENT 270
#define CONSTANT 275
#define OPERATOR 280
Test.c
int main()
{
int first, second, third;
third = first + second ;
}
<260,int>
<265,main>
<260,int>
<270,ident>
, <270,ident>
, <270,ident>
<270,ident>
<270,ident>
<280,operator>
<270,ident>
Aim
Construct a parser for a C-like language. The parser should work with the lexer that has been
developed already. The parser should give warning messages when any error occurs, such as
when a variable is used without a preceding definition. Description of the language is given
below:
The keywords of the language are: else if int return void
Special symbols are: + - * / < > = : . ( ) { } /*…*/
Other tokens are ID and NUM. defined by the following regular expressions:
ID = letter letter*
NUM = digit digit*
letter = a|…|z|A|…|Z
digit = 0|…|9
White spaces (blanks, newlines, tabs) must be ignored. Comments must be skipped by the lexer.
A BNF grammar for the language is as follows:
program declaration-list
declaration-list declaration-list declaration | declaration
declaration var-declaration | fun-declaration
var-declaration type ID ;
type int | void
fun-declaration type ID ( params ) compound-stmt
params type ID
compound stmt { local declarations statement-list }
local declarations local declarations var-declarations | empty
statement-list statement-list statement | empty
statement expression-stmt | compound-stmt | selection-stmt | return-stmt
expression-stmt expression : | :
selection-stmt if ( expression ) statement | if ( expression ) statement else statement
return-stmt return ;
expression var = expression | simple-expression
var = ID
Theory
Introduction
Yacc provides a general tool for describing the input to a computer program. The Yacc user
specifies the structures of his input, together with code to be invoked as each such structure is
recognized. Yacc turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the user's
application handled by this subroutine.
The input subroutine produced by Yacc calls a user-supplied routine to return the next basic
input item. Thus, the user can specify his input in terms of individual input characters or in
terms of higher level constructs such as names and numbers. The user supplied routine may
also handle idiomatic features such as comment and continuation conventions, which typically
defy easy grammatical specification. Yacc is written in portable C.
Yacc provides a general tool for imposing structure on the input to a computer program. User
prepares a specification of the input process; this includes rules describing the input structure,
code to be invoked when these rules are recognized, and a low-level routine to do the basic
input. Yacc then generates a function to control the input process. This function, called a
parser, calls the user-supplied low-level input routine (the lexical analyzer) to pick up the basic
items (called tokens) from the input stream. These tokens are organized according to the
input structure rules, called grammar rules; when one of these rules has been recognized, then
user code supplied for this rule, an action, is invoked; actions have the ability to return
values and make use of the values of other actions.
The heart of the input specification is a collection of grammar rules. Each rule describes an
allowable structure and gives it a name. For example, one grammar rule might be
date : month_name day ',' year
Here, date, month_name, day, and year represent structures of interest in the input
process; presumably, month_name, day, and year are defined elsewhere. The comma ``,'' is
enclosed in single quotes; this implies that the comma is to appear literally in the input. The
colon and semicolon merely serve as punctuation in the rule, and have no significance in
controlling the input. Thus, with proper definitions, the input
July 4, 1776
An important part of the input process is carried out by the lexical analyzer. This user routine
reads the input stream, recognizing the lower level structures, and communicates these tokens to
the parser. For historical reasons, a structure recognized by the lexical analyzer is called a
terminal symbol, while the structure recognized by the parser is called a nonterminal symbol.
To avoid confusion, terminal symbols will usually be referred to as tokens.
There is considerable leeway (flexibility) in deciding whether to recognize structures using the
lexical analyzer or grammar rules. For example, the rules
month_name : 'J' 'a' 'n' ;
month_name : 'F' 'e' 'b' ;
...
month_name : 'D' 'e' 'c' ;
might be used in the above example. The lexical analyzer would only need to recognize
individual letters, and month_name would be a nonterminal symbol. Such low-level rules tend
to waste time and space, and may complicate the specification beyond Yacc's ability to deal
Literal characters such as ``,'' must also be passed through the lexical analyzer, and are also
considered tokens.
Specification files are very flexible. It is relatively easy to add to the above example the rule
date: month '/' day '/' year; allowing 7 / 4 / 1776 as a synonym for July 4, 1776. In most
cases, this new rule could be ``slipped in'' to a working system with minimal effort, and little
danger of disrupting existing input.
The input being read may not conform to the specifications. These input errors are detected as
early as is theoretically possible with a left-to-right scan; thus, not only is the chance of reading
and computing with bad input data substantially reduced, but the bad data can usually be
quickly found. Error handling, provided as part of the input specifications, permits the reentry
of bad data, or the continuation of the input process after skipping over the bad data.
In some cases, Yacc fails to produce a parser when given a set of specifications. For example,
the specifications may be self contradictory, or they may require a more powerful recognition
mechanism than that available to Yacc. The former cases represent design errors; the latter
cases can often be corrected by making the lexical analyzer more powerful, or by rewriting
some of the grammar rules. While Yacc cannot handle all possible specifications, its power
compares favorably with similar systems; moreover, the constructions which are difficult for
Yacc to handle are also frequently difficult for human beings to handle. Some users have
reported that the discipline of formulating valid Yacc specifications for their input revealed
errors of conception or design early in the program development.
1. Basic Specifications
Every specification file consists of three sections: the declarations, (grammar) rules, and
programs. The sections are separated by double percent ``%%'' marks. (The percent ``%'' is
generally used in Yacc specifications as an escape character.)
The declaration section may be empty. Moreover, if the programs section is omitted, the second
%% mark may be omitted also; thus, the smallest legal Yacc specification is
%%
rules
Blanks, tabs, and newlines are ignored except that they may not appear in names or multi-
character reserved symbols. Comments may appear wherever a name is legal; they are
enclosed in /* . . . */, as in C and PL/I.
Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``_'', and
non-initial digits. Upper and lower case letters are distinct. The names used in the body of a
grammar rule may represent tokens or nonterminal symbols.
A literal consists of a character enclosed in single quotes ``'''. As in C, the backslash ``\'' is an
escape character within literals, and all the C escapes are recognized. Thus
'\n' newline
'\r' return
'\'' single quote ``'''
'\\' backslash ``\''
'\t' tab
If there are several grammar rules with the same left hand side, the vertical bar ``|'' can be used
to avoid rewriting the left hand side. In addition, the semicolon at the end of a rule can be
dropped before a vertical bar. Thus the grammar rules
A : BCD;
A : EF;
A : G ;
can be given to Yacc as
A : B C D
| E F
| G
;
It is not necessary that all grammar rules with the same left side appear together in the
grammar rules section, although it makes the input much more readable, and easier to change.
If a nonterminal symbol matches the empty string, this can be indicated in the obvious way:
empty : ;
Names representing tokens must be declared; this is most simply done by writing
%token name1, name2 . . .
in the declarations section. Every name not defined in the declarations section is assumed to
represent a non-terminal symbol. Every non-terminal symbol must appear on the left side of at
least one rule.
Of all the nonterminal symbols, one, called the start symbol, has particular importance. The
parser is designed to recognize the start symbol; thus, this symbol represents the largest, most
general structure described by the grammar rules. By default, the start symbol is taken to be the
%start symbol
The end of the input to the parser is signaled by a special token, called the endmarker. If the
tokens up to, but not including, the endmarker form a structure which matches the start symbol,
the parser function returns to its caller after the end-marker is seen; it accepts the input. If the
endmarker is seen in any other context, it is an error.
It is the job of the user-supplied lexical analyzer to return the endmarker when
appropriate; see section 3, below. Usually the endmarker represents some reasonably obvious
I/O status, such as ``end-of-file'' or ``end-of-record''.
2: Actions: With each grammar rule, the user may associate actions to be Yacc: Yet Another
Compiler-Compiler performed each time the rule is recognized in the input process. These
actions may return values, and may obtain the values returned by previous actions. Moreover,
the lexical analyzer can return values for tokens, if desired.
An action is an arbitrary C statement, and as such can do input and output, call subprograms,
and alter external vectors and variables. An action is specified by one or more statements,
enclosed in curly braces ``{'' and ``}''. For example,
A : '(' B ')'
{ hello( 1, "abc" ); }
and
XXX : YYY ZZZ
{ printf("a message\n");
flag = 25; }
To facilitate easy communication between the actions and the parser, the action statements are
altered slightly. The symbol ``dollar sign'' ``$'' is used as a signal to Yacc in this context.
{ $$ = 1; }
To obtain the values returned by previous actions and the lexical analyzer, the action may use
the pseudo-variables $1, $2, . . ., which refer to the values returned by the components of the
right side of a rule, reading from left to right. Thus, if the rule is
A : BCD ;
for example, then $2 has the value returned by C, and $3 the value returned by D.
As a more concrete example, consider the rule
By default, the value of a rule is the value of the first element in it ($1). Thus, grammar rules
of the form
A : B ;
$ACT : /* empty */
{ $$ = 1; }
;
A : B $ACT C
{ x = $2; y = $3; }
;
In many applications, output is not done directly by the actions; rather, a data structure, such as
a parse tree, is constructed in memory, and transformations are applied to it before output is
generated. Parse trees are particularly easy to construct, given routines to build and maintain
the tree structure desired. For example, suppose there is a C function node, written so that the
call
node( L, n1, n2 )
creates a node with label L, and descendants n1 and n2, and returns the index of the newly
created node. Then parse tree can be built by supplying actions such as:
expr : expr '+' expr
{ $$ = node( '+', $1, $3 ); }
in the specification.
The user may define other variables to be used by the actions. Declarations and definitions can
appear in the declarations section, enclosed in the marks ``%{'' and ``%}''. These declarations
and definitions have global scope, so they are known to the action statements and the lexical
analyzer. For example,
%{ int variable = 0; %}
3: Lexical Analysis
The user must supply a lexical analyzer to read the input stream and communicate tokens (with
values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex.
The user must supply a lexical analyzer to read the input stream and communicate tokens (with
values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex.
The parser and the lexical analyzer must agree on these token numbers in order for
communication between them to take place. The numbers may be chosen by Yacc, or chosen
by the user. In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
to return these numbers symbolically. For example, suppose that the token name DIGIT has
been defined in the declarations section of the Yacc specification file. The relevant portion of
the lexical analyzer might look like:
yylex(){
extern int yylval;
int c;
...
c = getchar();
...
switch( c ) {
...
case '0':
case '1':
...
case '9':
yylval = c-'0';
return( DIGIT );
The intent is to return a token number of DIGITS and a value equal to the numerical value of
the digit. Provided that the lexical analyzer code is placed in the programs section of the
specification file, the identifier DIGIT will be defined as the token number associated with the
token DIGIT.
This mechanism leads to clear, easily modified lexical analyzers; the only pitfall is the need
to avoid using any token names in the grammar that are reserved or significant in C or the
parser; for example, the use of token names ‘if’ or ‘while’ will almost certainly cause
severe difficulties when the lexical analyzer is compiled. The token name error is reserved
for error handling, and should not be used naively.
As mentioned above, the token numbers may be chosen by Yacc or by the user. In the default
situation, the numbers are chosen by Yacc. The default token number for a literal character is
the numerical value of the character in the local character set. Other names are assigned token
numbers starting at 257.
To assign a token number to a token (including literals), the first appearance of the token
name or literal in the declarations section can be immediately followed by a nonnegative
integer. This integer is taken to be the token number of the name or literal. Names and literals
not defined by this mechanism retain their default definition. It is important that all token
numbers be distinct.
For historical reasons, the end marker must have token number 0 or negative. This token
number cannot be redefined by the user; thus, all lexical analyzers should be prepared to return
0 or negative as a token number upon reaching the end of their input.
A very useful tool for constructing lexical analyzers is the Lex program developed by Mike
Lesk. These lexical analyzers are designed to work in close harmony with Yacc parsers. The
specifications for these lexical analyzers use regular expressions instead of grammar rules. Lex
%{
#include <stdlib.h>
#include "y.tab.h"
void yyerror(char *);
extern int yylval;
%}
digit [0-9]
num {digit}{digit}*
%%
{num} { yylval=atoi(yytext);
return(NUM);}
"+"|"*"|"("|")"|"\n" {return yytext[0];}
[ \t] ;
/*Anything else is treated as error*/
. yyerror("Invalid character\n");
%%
Newyacc.y
%{
#include <ctype.h>
#include <stdio.h>
%}
%token NUM
%%
lines : lines exp '\n' {printf("value : %d \n",$2);}
| lines '\n'
|
;
int main(void) {
yyparse();
return 0;
}
yywrap()
{
return(1);
}
LFile.l
%{
#include<stdio.h>
#include "y.tab.h"
#define TableSize 29
struct SYMTAB{
char symb[16];
int kind;
int type;
int empty;
}symbol[TableSize];
extern int yylval;
int full=0;
%}
/* regular deifinitions*/
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}{letter}*
%{
#include<stdio.h>
#define TableSize 29
extern struct SYMTAB{
char symb[16];
int kind;
int type;
int empty;
}symbol[TableSize];
%}
%token ID
%token NUM IF ELSE RETURN VOID INT
%%
program : declist ;
declist : declist dec
| dec ;
dec : vardec
| fundec ;
vardec : type ID ';'; {symbol[$2].kind=0; symbol[$2].type=$1;}
type : INT {$$ = $1;}
| VOID ; {$$= $1;}
fundec : type ID '(' params ')' compoundstmt ; {symbol[$2].kind=1;
symbol[$2].type=$1;}
params : type ID {symbol[$2].kind=0; symbol[$2].type=$1;}
compoundstmt: '{' localdec statementlist '}' ;
localdec : localdec vardec
|;
statementlist : statementlist statement
|;
statement : expstmt
| compoundstmt
Aim
Algorithm
1. Process the yacc grammar file using the -d optional flag (which informs the yacc
command to create a file that defines the tokens used in addition to the C language
source code):
yacc -d calc.yacc
2. Use the ls command to verify that the following files were created:
y.tab.c
The C language source file that the yacc command created for the parser
y.tab.h
A header file containing define statements for the tokens used by the parser
lex calc.lex
4. Use the ls command to verify that the following file was created:
lex.yy.c
The C language source file that the lex command created for the lexical analyzer
cc y.tab.c lex.yy.c
6. Use the ls command to verify that the following files were created:
y.tab.o
The object file for the y.tab.c source file
lex.yy.o
The object file for the lex.yy.c source file
a.out
The executable program file
In either case, after you start the program, the cursor moves to the line below the $ (command
prompt). Then, enter numbers and operators as you would on a calculator. When you press the
Enter key, the program displays the result of the operation. After you assign a value to a
variable, as follows, the cursor moves to the next line.
m=4 <enter>
When you use the variable in subsequent calculations, it will have the assigned value:
m+5 <enter>
9
The following example shows the contents of the calc.yacc file. This file has entries in all three
sections of a yacc grammar file: declarations, rules, and programs.
%{
#include <stdio.h>
int regs[26];
int base;
%}
%start list
%left '|'
%left '&'
%left '+' '-'
%left '*' '/' '%'
%left UMINUS /*supplies precedence for unary minus */
list: /*empty */
|
list stat '\n'
|
list error '\n'
{
stat: expr
{
printf("%d\n",$1);
}
|
LETTER '=' expr
{
regs[$1] = $3;
}
$$ = $1 * $3;
}
|
expr '/' expr
{
$$ = $1 / $3;
}
|
expr '%' expr
{
$$ = $1 % $3;
}
|
expr '+' expr
{
$$ = $1 + $3;
}
|
expr '-' expr
{
$$ = $1 - $3;
}
|
expr '&' expr
{
|
number
;
number: DIGIT
{
$$ = $1;
base = ($1==0) ? 8 : 10;
} |
number DIGIT
{
$$ = base * $1 + $2;
}
;
%%
main()
{
return(yyparse());
}
yyerror(s)
char *s;
{
fprintf(stderr, "%s\n",s);
}
yywrap()
{
return(1);
The required main program that calls the yyparse subroutine to start the
main
program.
yyerror(s) This error-handling subroutine only prints a syntax error message.
The wrap-up subroutine that returns a value of 1 when the end of input
yywrap
occurs.
This file contains include statements for standard input and output, as well as for the y.tab.h
file. If you use the -d flag with the yacc command, the yacc program generates that file from the
yacc grammar file information. The y.tab.h file contains definitions for the tokens that the
parser program uses. In addition, the calc.lex file contains the rules to generate these tokens
from the input stream. The following are the contents of the calc.lex file.
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
%}
%%
"" ;
*******