Object-And Pattern-Oriented Compiler Construction in C++
Object-And Pattern-Oriented Compiler Construction in C++
January 2006
Foreword
A
knowledgements
I Introduction 1
1 Introdution 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Who Should Read This Document? . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
II Lexical Analysis 19
i
3.1.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Regular Language . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . 21
3.2 Finite State Acceptors . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Building the NFA . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Determinization: Building the DFA . . . . . . . . . . . . 22
3.2.3 Compacting the DFA . . . . . . . . . . . . . . . . . . . . . 24
3.3 Analysing a Input String . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Building Lexical Analysers . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 The construction process . . . . . . . . . . . . . . . . . . . 27
3.4.1.1 The NFA . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1.2 The DFA and the minimized DFA . . . . . . . . 27
3.4.2 The Analysis Process and Backtracking . . . . . . . . . . 29
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ii
III Syntactic Analysis 35
iii
7.2.1 The first part: definitions . . . . . . . . . . . . . . . . . . 49
7.2.1.1 External definitions and code blocks . . . . . . 49
7.2.1.2 Internal definitions . . . . . . . . . . . . . . . . 49
7.2.2 The second part: rules . . . . . . . . . . . . . . . . . . . . 53
7.2.2.1 Shifts and reduces . . . . . . . . . . . . . . . . . 53
7.2.2.2 Structure of a rule . . . . . . . . . . . . . . . . . 53
7.2.2.3 The grammar’s start symbol . . . . . . . . . . . 55
7.2.3 The third part: code . . . . . . . . . . . . . . . . . . . . . 55
7.3 Handling Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
IV Semantic Analysis 63
iv
9.3 Visitors and Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.3.1 Basic interface . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.3.2 Processing interface . . . . . . . . . . . . . . . . . . . . . 67
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V Appendices 73
v
B.2.4 Arithmetic instructions . . . . . . . . . . . . . . . . . . . 79
B.2.5 Rotation and shift instructions . . . . . . . . . . . . . . . 80
B.2.6 Logical instructions . . . . . . . . . . . . . . . . . . . . . . 80
B.2.7 Integer comparison instructions . . . . . . . . . . . . . . 80
B.2.8 Other comparison instructions . . . . . . . . . . . . . . . 81
B.2.9 Type conversion instructions . . . . . . . . . . . . . . . . 81
B.2.10 Function definition instructions . . . . . . . . . . . . . . . 82
B.2.10.1 Function definitions . . . . . . . . . . . . . . . . 82
B.2.10.2 Function calls . . . . . . . . . . . . . . . . . . . . 83
B.2.11 Addressing instructions . . . . . . . . . . . . . . . . . . . 83
B.2.11.1 Absolute and relative addressing . . . . . . . . 83
B.2.11.2 Quick opcodes for addressing . . . . . . . . . . 84
B.2.11.3 Load instructions . . . . . . . . . . . . . . . . . 84
B.2.11.4 Store instructions . . . . . . . . . . . . . . . . . 85
B.2.12 Segments, values, and labels . . . . . . . . . . . . . . . . 85
B.2.12.1 Segments . . . . . . . . . . . . . . . . . . . . . . 85
B.2.12.2 Values . . . . . . . . . . . . . . . . . . . . . . . . 85
B.2.12.3 Labels . . . . . . . . . . . . . . . . . . . . . . . . 86
B.2.12.4 Types of global names . . . . . . . . . . . . . . . 87
B.2.13 Jump instructions . . . . . . . . . . . . . . . . . . . . . . . 87
B.2.13.1 Conditional jump instructions . . . . . . . . . . 87
B.2.13.2 Other jump instructions . . . . . . . . . . . . . . 88
B.2.14 Other instructions . . . . . . . . . . . . . . . . . . . . . . 88
B.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
B.3.1 NASM code generator . . . . . . . . . . . . . . . . . . . . 89
B.3.2 Debug-only “code” generator . . . . . . . . . . . . . . . . 89
B.3.3 Developing new generators . . . . . . . . . . . . . . . . . 89
B.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi
D Glossary 93
vii
viii
List of Figures
2.1 CDK library’s class diagram . . . . . . . . . . . . . . . . . . . . . 6
2.2 CDK library’s main function sequence diagram . . . . . . . . . . 7
2.3 Abstract compiler factory base class. . . . . . . . . . . . . . . . . 8
2.4 Concrete compiler factory for the Compact compiler . . . . . . . 9
2.5 Concrete compiler factory for the Compact compiler . . . . . . . 9
2.6 Compact’s lexical analyser header . . . . . . . . . . . . . . . . . 10
2.7 Abstract CDK compiler class . . . . . . . . . . . . . . . . . . . . 12
2.8 Partial syntax specification for the Compact compiler . . . . . . 13
2.9 CDK node hierarchy class diagram . . . . . . . . . . . . . . . . . 14
2.10 Partial specification of the abstract semantic processor . . . . . . 15
2.11 CDK library’s sequence diagram for syntax evaluation . . . . . 16
2.12 CDK library’s main function (simplified code) . . . . . . . . . . 17
ix
6.2 Graphical representation of the DFA showing each state’s item
set. Reduces are possible in states I1 , I2 , I3 , I5 , I9 , I10 , and I11 : it
will depend on the actual parser whether reduces actually occur. 44
6.3 Example of a parser table. Note the column for the end-of-
phrase symbol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 exemplo de acções L unitárias . . . . . . . . . . . . . . . . . . . . 45
6.5 exemplo de acções L quase unitárias . . . . . . . . . . . . . . . . 46
6.6 exemplo de conflitos e compressão . . . . . . . . . . . . . . . . . 46
9.1 Macro structure of the main function. Note especially the syn-
tax and semantic processing phases (respectively, yyparse and
evaluate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
x
List of Tables
3.1 Primitive constructions in Thompson’s algorithm. . . . . . . . . 23
xi
xii
I
Introdu
tion
1.1 Introdu
tion
(Aho et al., 1986)
1
Introdution
• C/C++ are typically compiled into object code for later linking (.o files);
• Java is typically compiled (Sun compiler) into binary class files (.class),
used by the virtual machine. The GCC Java compiler can, besides the
class files, also produce object files (like for C/C++).
language is used, but the rationale is valid for other OO languages as well.
Note, however, that C++ works with C tools almost without change, some-
thing that may not be true of other languages (although there may exist tools
similar to flex and yacc that support them).
The use of C++ is not motivated only by a “better” C, a claim some would
deny. Rather, it is motivated by the advantages that can be gained from bring-
ing OO design principles in contact with compiler construction problems. In
this regard, C++ is a more obvious choice than C1 , and is not so far removed
that traditional compiler development techniques and tools have to be aban-
doned.
Going beyond basic OO principles into the world of design patterns is
just a small step, but one that contributes much of the overall gains in this
change: indeed, effective use of a few choice design patterns – especially, but
not necessarily limited to, the composite and visitor design patterns – contributes
to a much more robust compiler and a much easier development process.
The document assumes basic knowlege of object-oriented design as well
as abstract data type definition. Knowledge about design patterns is desir-
able, but not necessary: the patterns used in the text will be briefly presented.
Nevertheless, useful insights can be gained from reading a patterns book such
as citeNbook:gof.
1.3 Organization
This text parallels both the structure and development process of a compiler.
Thus, the first part deals with lexical analysis, or by a different name, with
the morphological analysis of the language being recognized. The second part
presents syntax analysis in general and LALR(1) parsers in particular. The
fourth part is dedicated to semantic analysis and the deep structure of a pro-
gram as represented by a languistic structure. Semantic processing also covers
code generation, translation, interpretation, as well as the other processes that
use similar development processes.
Regarding the appendices, they present the code used throught the docu-
ment. In particular, detailed descriptions of each hierarchy are presented. Also
presented is the structure of the final compiler, in terms of code: both the code
developed by the compiler developer, and the support code for compiler de-
velopment and final program execution.
1 Even though one could say that if you have mastered OO design, than you can do it in almost
any language, C++ continues to be a better choice than C, simply because it offers direct support
for those principles and a strict type system.
2.1 Introdu
tion
Using C++ and the
CDK Library
2
The Compiler Development Kit (CDK) is a library for building compilers by
allowing the combination of advanced OO programming techniques and tra-
ditional compiler building tools such as Lex () and YACC (). The resulting
compiler is clearly structured and easy to maintain.
The CDK version described here uses the GNU Flex () lexical analyser and
the Berkeley YACC () LALR(1) parser generator. In this chapter the description
will focus on OO aspects and not in compiler construction details. These will
be covered in detail in later chapters.
The reader is encouraged to review object-oriented and design pattern
concepts, especially, but without limitation, the ones used by the CDK and the
compilers based on it: Abstract Factory, Composite, Strategy, Visitor (Gamma
et al., 1995, respectively, pp. 87, 163, 315, 331).
«abstract»
CompilerFactory
+ yylex() : int
CompactCompilerFactory CompactScanner
+ yylex() : int
VectorCompilerFactory VectorScanner
+ yylex() : int
compact::CompilerImpl vector::CompilerImpl
«abstract»
Compiler
+ parse() : int
+ yyparse() : int
+ evaluate() : int
+ evaluate() : bool
vector::PFEvaluator
compact::Cwriter compact::CEvaluator
«abstract»
SemanticProcessor
compact::Interpreter compact::InterpretationEvaluator
Figure 2.1: CDK library’s class diagram: Compact and Vector are two examples
of concrete languages.
The library defines the following items, each of which will be the subject
of one of the sections below.
• The abstract evaluator: evaluating semantics from the syntax tree (§2.3.6);
• The abstract semantic processor: syntax tree node visiting (§2.3.7);
• The code generators: production of final code (§2.3.8);
• The main function (§2.3.9);
These topics are arranged more or less in the order they are needed to
produce a full-fledged compiler: everythings starts in the main function with
the creation of the compiler for the work language and helper objects (e.g.,
the scanner); then nodes are created and passed to a specific evaluator that, in
turn, creates the appropriate visitors to handle tasks such as code generation
or interpretation. Figure 2.2 presents the top-level interactions (main function
– see §2.3.9).
: (main) : CompilerFactory
fact : CompactCompilerFactory
scanner : FlexLexer
compiler : Compiler
3: processCmdLineOptions
4: parse() : int
5: yyparse() : int
6: yylex() : int
7: evaluate() : int
Figure 2.2: CDK library’s main function sequence diagram: details from
yyparse, yylex, and evaluate have been omitted in this diagram.
As show in figure 2.2, the abstract factory is resposible for creating both the
scanner and the compiler itself. The parser is not an object itself, but rather
a compiler method. This arrangement is due to the initial choice of support
tools: the parser generator, BYACC, is (currently) only capable of creating a C
function (yyparse). It is easier to transform this function into a method than
creating a class just for encapsulating it. Note that these choices (BYACC and
method vs. class) may change and so it may happen with all the classes in-
volved (both compiler and factory).
8 CHAPTER 2. USING C++ AND THE CDK LIBRARY
The process of creating a compiler is a simple one: the factory method for
creating the compiler object first creates a scanner that is, later, passed as an
argument of the compiler’s constructor. Client code may rely on the factory
for creating the appropriate scanner and needs only to ask for the creation of a
compiler.
Figure 2.3 presents the superclass definition. This is the main factory ab-
stract class: it provides methods for creating the lexical analyser and the com-
piler itself. Instances of concrete subclasses will be obtained by the main func-
tion to provide instances of the scanner and compiler objects for a concrete lan-
guage. The factory provides a registry ( factories) of all the known instances
of its subclasses. Each instance is automatically registered by the correspond-
ing constructor. Also, notice that scanner creation is not avaliable outside the
factory: this is to avoid mismatches between the scanner and the compiler ob-
ject (which contains the parser code that uses the scanner).
class FlexLexer;
namespace cdk {
class Compiler;
class CompilerFactory {
static std::map<std::string, CompilerFactory *> _factories;
protected:
CompilerFactory(const char *lang) {
_factories[lang] = this;
}
public:
static CompilerFactory *getImplementation(const char *lang) {
return _factories[lang];
}
public:
virtual ˜CompilerFactory();
protected:
virtual FlexLexer *createScanner(const char *name) = 0;
public:
virtual Compiler *createCompiler(const char *name) = 0;
}; // class CompilerFactory
} // namespace cdk
The programmer must proviode a concrete subclass for each new com-
piler/language. Figure 2.4 shows the class definition for the concrete fac-
tory, part of the Compact compiler. Note that this class is a singleton object
that automatically registers itself with the abstract factory superclass. Classes
CompactScanner and CompilerImpl are implementations for Compact of the
abstract concepts, FlexLexer and Compiler, used in the CDK (see, respec-
tively, §2.3.2 and §2.3.3).
2.3. THE CDK LIBRARY 9
protected:
CompactCompilerFactory(const char *language)
: cdk::CompilerFactory(language) {}
FlexLexer *
createScanner(const char *name) {
return new CompactScanner(name, NULL, NULL);
}
cdk::Compiler *
createCompiler(const char *name) {
FlexLexer *scanner = createScanner(name);
return new CompilerImpl(name, scanner);
}
}; // class CompactCompilerFactory
Figure 2.4: Concrete compiler factory class definition for the Compact compiler
(this code is not part of the CDK).
CompactCompilerFactory
CompactCompilerFactory::_thisFactory("compact");
Figure 2.5: Implementation of the concrete compiler factory for the Compact
compiler (this code is not part of the CDK).
The abstract scanner class, FlexLexer, is provided by the GNU Flex tool and
is not part of the CDK proper. We include it here because the CDK depends on
it and the compiler developer must proviode a concrete subclass for each new
compiler/language. Essentially, this class is a wrapper for the code implement-
ing the automaton which will recognize the input language. The most relevant
methods are lineno (for providing information in source line numbers) and
yylex (the lexical analyser itself).
The Compact compiler defines the concrete class CompactScanner, for im-
plementing the lexical analyser. Figure 2.6 shows the header file. The rest of
the class is implemented automatically from the lexical analyser’s specifica-
tion (§3.4). Note that we also defined yyerror as a method, ensuring source
code compatibility with traditional C-based approaches.
10 CHAPTER 2. USING C++ AND THE CDK LIBRARY
public: // constructors
CompactScanner(const char *filename,
std::istream *yyin = NULL,
std::ostream *yyout = NULL)
: yyFlexLexer(yyin, yyout), _filename(filename) {
set_debug(1);
}
Figure 2.6: Implementation of the concrete lexical analyser class for the Com-
pact compiler (this code is not part of the CDK).
The concrete class will be used by the concrete compiler factory (see §2.3.1)
to initialize the compiler (see §2.3.3). There is, in principle, no limit to the num-
ber of concrete lexical analyser classes that may be defined for a given compiler.
Normally, though, one should be enough to account for the whole lexicon.
The abstact compiler class, Compiler, represents the compiler as a single entity
and is responsible for performing all high-level compiler actions, namely lexi-
cal, sintactic, semantic analysis, and actions deriving from those analyses: e.g.,
interpretation or code generation.
To carry out those tasks, the compiler class depends on other classes and
tools to perform specific actions. Thus, it relies on the scanner class hierarchy
(see §2.3.2) to execute the lexical analysis phase and recognize the tokens cor-
responding to the input program text. As we saw, that code was generated by
a specialized tool for creating implementations for regular expression proces-
sors. Likewise, the compiler relies on another specialized tool, YACC, to create
from a grammar specification, an LALR(1) parser. Currently, this parser is not
encapsulated as an object (as was the case with the Flex-created code), but is
simply a method of the compiler class itself: yyparse.
Besides the compilation-specific parts of the class, it defines a series of
flags for controlling how to execute the compilation process. These flags in-
clude behaviour flags and input and output processing variables. The follow-
ing table describes the compiler’s instance variables as well as their uses.
2.3. THE CDK LIBRARY 11
Variable Description
errors Counts compilation errors.
extension Output file extension (defined by the target and output
file options).
ifile Input file name.
istream Input file stream (default std::cin).
name Language name.
optimize Controls whether optimization should be performed.
ofile Output file name.
ostream Output file stream (default std::cout).
scanner Pointer to the scanner object to be used by the compiler.
syntax Syntax tree representation (nodes).
trace Controls compiler execution trace level.
tree Create only the syntax tree: this is the same as specify-
ing an XML target (see extension above).
As mentioned in the previous section, the parsing function is simply the re-
sult of processing an LALR(1) parser specification with a tool such as Berkeley
YACC or GNU Bison. Such a function is usually called yyparse and is, in
the case of BYACC and the default action for Bison, written in C. Bison can,
however produce C++ code and sophisticated reentrant parsers. In the current
version of the CDK, it is assumed that the tool used to process the syntactic
specification is BYACC and that the native code is C.
1 The capability to change the scanner does not mean that the change is in any way a good ideia.
namespace cdk {
namespace node { class Node; }
class Compiler {
typedef cdk::semantics::Evaluator evaluator_type;
protected:
virtual int yyparse() = 0;
public:
inline int parse() { return yyparse(); }
public:
virtual bool evaluate() {
evaluator_type *evaluator =
evaluator_type::getEvaluatorFor(_extension);
if (evaluator) return evaluator->evaluate(this);
else exit(1); // no evaluator defined for target
}
}; // class Compiler
} // namespace cdk
Figure 2.7: Abstract CDK compiler class (simplified view): note that yyparse
is pure virtual.
Besides the classes defining compiler architecture, the CDK framework also
includes classes for representing language concepts, i.e., they represent com-
2 Note that although the code is contained in a single file, there is no guarantee that the global
variables it contains do not cause problems: for instance, if multiple parsers where to be present at
a given moment.
2.3. THE CDK LIBRARY 13
%{
#define LINE scanner()->lineno()
#define yylex scanner()->yylex
#define yyparse CompilerImpl::yyparse
%}
%%
// ... rules...
%%
Figure 2.8: Partial syntax specification for the Compact compiler: note the
macros used to control code binding to the CompilerImpl class.
pilation results. The classes that represent syntactic concepts are called nodes
and they form tree structures that represent whole or part of programs: the
syntax tree.
Although it is difficult, if not outright impossible, to predict what concepts
are defined by a given programming language, the CDK, nevertheless, tries to
provide a small set of basic nodes for simple, potencially useful, concepts. The
reason is twofold: on the one hand, the nodes provide built-in support for
recurrent concepts; on the other hand, they are useful examples for extending
the CDK framework.
Figure 2.9 presents the UML diagram for the CDK node hierarchy. The
nodes are fairly general in nature: general concepts for unary and binary op-
erators, as well as particularizations for commonly used arithmetics and logi-
cal operations. In addition terminal nodes for storing primitive types are also
provided: a template class for storing any atomic type (Simple) and its instan-
tiations for integers, doubles, strings, and identifiers (a special case of string).
Other special nodes are also provided: Data, for opaque data types; Sequence,
for representing data collections organized as vectors; Composite, for organiz-
ing data collections as linearized trees; Nil, for representing empty nodes (null
object).
When developing a new compiler, the programmer has to provide new
concrete subclasses, according to the concepts to be supported. In the Compact
compiler, for instance, nodes were defined for concepts such as while loops,
if-then-else instructions, and so on. See chapters 8 and 11 for detailed informa-
tion.
After the parser has performed the syntactic analysis, we have a syntax tree
representing the structure of the input program. This structured is formed
by instances of the node set described in §2.3.5. Representing the program,
though, is simply a step towards computing its true meaning or semantics.
This is the evaluator’s task: to take the syntax tree and extract from it the se-
mantics corresponding to the concepts modeled by the input programming
language.
14 CHAPTER 2. USING C++ AND THE CDK LIBRARY
NEG
Identifier
UnaryOperator
+ accept(sp : SemanticProcessor, level : int) - _argument : Node
+ accept(sp : SemanticProcessor, level : int) Data
String
+ accept(sp : SemanticProcessor, level : int)
+ accept(sp : SemanticProcessor, level : int) +_argument
Sequence
Simple
Node + accept(sp : SemanticProcessor, level : int)
- _value : StoredType
- _lineno : int Composite
+ accept(sp : SemanticProcessor, level : int)
+ lineno() : int
+ accept(sp : SemanticProcessor, level : int)
Double + accept(sp : SemanticProcessor, level : int)
+_left +_right Nil
+ accept(sp : SemanticProcessor, level : int)
+ accept(sp : SemanticProcessor, level : int)
Integer
NE
+ accept(sp : SemanticProcessor, level : int)
BinaryOperator
+ accept(sp : SemanticProcessor, level : int)
- _left : Node
ADD - _right : Node LT
+ accept(sp : SemanticProcessor, level : int) + accept(sp : SemanticProcessor, level : int) + accept(sp : SemanticProcessor, level : int)
SUB LE
The CDK provides an abstract evaluator class for defining the interface
its subclasses must implement. The compiler class, when asked to eval-
uate the program, creates a concrete evaluator for the selected target (see
figure 2.7). The programmer must provide a concrete subclasses for each
new compiler/language/target. Each such class will automatically register
its instance with the superclass’ registry for a given target. In the Compact
compiler, four concrete subclasses are provided: for generating XML trees
(XMLevaluator); for generating assembly code (PFevaluator); for gener-
ating C code (Cevaluator); and for interpreting the program, i.e., directly
executing the program from the syntax tree (InterpretationEvaluator).
To do its task, the evaluator needs two entities: the syntax tree to be anal-
ysed and the code for deciding on the meaning of each node or set of nodes.
It would be possible to write this code as a set of classes, global functions, or
even as methods in the node classes. Nevertheless, all these solutions present
disadvantages: using multiple classes or multiple functions would mean that
some kind of selection would have to be made in the evaluation code, making it
more complex than necessary; using methods in the node classes would solve
the former problem, but would make it difficult, or even impossible, to reuse
the node classes for multiple purposes (such as generating code for different
targets).
The selected solution is to use the Visitor design pattern (described
in §2.3.7).
2.3. THE CDK LIBRARY 15
class SemanticProcessor {
std::ostream &_os; // output stream
protected:
SemanticProcessor(std::ostream &os = std::cout) : _os(os) {}
inline std::ostream &os() { return _os; }
public:
virtual ˜SemanticProcessor() {}
};
Figure 2.10: Partial specification of the abstract semantic processor. Note that
the interface cannot be defined independently from the node set used by a
specific compiler.
Currently, the CDK provides an abstract stack machine for generating the final
code. Visitors for final code production will call the stack machine’s pseudo-
instructions while performing the evaluation of the syntax tree. The pseudo-
instructions will produce the final machine code. The stack machine is encap-
sulated by the Postfix abstract class. Figure 2.11 shows the sequence diagram
16 CHAPTER 2. USING C++ AND THE CDK LIBRARY
for the syntax tree evaluation process: this includes tree traversal and includ-
ing code generation.
1: evaluate() : int
evaluator : compact::PFEvaluator
symtab : SymbolTable
The main function is where every part, both defined on the CDK or rede-
fined/implemented in each compiler, comes together (see figure 2.2): the first
step is to determine which language the compiler is for (step 1) and getting
the corresponding factory, to create all the necessary objects (step 2). Steps
3 through 7 correspond to parsing the input to produce final code (currently,
assembly, by default).
Note that the main function is already part of the CDK and there is, thus,
no need for the programmer to provide it.
The code corresponding to the above diagram (simplified version) is de-
picted in figure 2.12.
2.4. SUMMARY 17
cdk::CompilerFactory *fact =
cdk::CompilerFactory::getImplementation(language.c_str());
if (!fact) {
// fatal error: no factory available for language
return 1; // failure
}
if (!compiler->evaluate()) {
// ... report semantic errors...
return 1; // failure
}
return 0; // success
}
Figure 2.12: CDK library’s main function code sample corresponding to the
sequence diagram above.
2.4 Summary
In this chapter we presented the CDK library and how it can be used to build
compilers for specific languages. The CDK contains classes for representing
all concepts involved in a simple compiler: the scanner or lexical analyser, the
parser, and the semantic processor (including code generation and interpreta-
tion). Besides the simple classes, the library also includes factories for abstract-
ing compiler creation as well as creation of the corresponding evaluators for
specific targets. Evaluation is based on the Visitor design pattern: it allows for
specific functionality to be decoupled from the syntax tree, making it easy to
add or modify the functionality of the evaluation process.
The next chapters will cover some of the topics approached here concern-
ing lexical and syntactic analysis, as well as semantic processing and code
generation. The theoretical aspects, covered in chapters 3 and 6, will be sup-
plemented with support for specific tools, namely GNU Flex (chapter 4) and
Berkeley YACC (chapter 7). This does not mean that other similar tools cannot
be used: it means simply that the current implementation directly supports
those two. In addition to the description of each tool, the corresponding code
in use in the Compact compiler will also be presented, thus illustrating the full
process.
18 CHAPTER 2. USING C++ AND THE CDK LIBRARY
II
Lexi
al Analysis
3.1 What is Lexi
al Analysis?
Theoreti
al Aspe
ts
of Lexi
al Analysis
3
Lexical analysis is the process of analysing the input text and recognizing ele-
ments of the language being processed. These elements are called tokens and
are associated with lexemes (bits of text associated with each token).
There are several forms of performing lexical analysis, one of the most
common being finite state-based approaches, i.e., those using a finite state ma-
chine to recognize valid language elements.
This chapter describes Thompson’s algorithm for building finite-state au-
tomata for recognizing/accepting regular expressions.
3.1.1 Language
that carries his name and describes how to build an acceptor for a given regular
expression.
Created for Thompson’s implementation of the grep UNIX command, the
algorithm creates an NFA from a regular expression specification that can then
be converted into a DFA. It is this DFA that after minimization yields an au-
tomaton that is an acceptor for the original expression.
The following sections cover the algorithm’s construction primitives and
how to recognize a simple expression. Lexical analysis such as performed by
flex is presented in §3.4. In this case, several expressions may be watched for,
each one corresponding to a token. Such automatons feature multiple final
states, one or more for each recognized expression.
NFAs are not well suited for computers to work with, since each state may
have multiple acceptable conditions for transitioning to another state. Thus,
it is necessary to transform the automaton so that each state has a single tran-
sition for each possible condition. This process is called determination. The
algorithm for transforming an NFA into a DFA is a simple one and relies on
two primitive functions, move and ǫ − closure.
The move function is defined over a set of NFA states and input symbol
pairs and a set of NFA states sets: for each state and input symbol, it computes
the set of reacheable states. As an example consider, for the NFA in figure 3.1:
ǫ Empty expression.
The ǫ − closure function is defined for sets of states: the function computes
a new set of states reacheable from the initial set by using only all the possible
ǫ transitions to other states (including the each state itself), as well as the states
reacheable through ǫ transitions from those states. Thus, considering the NFA
in figure 3.1, we could write:
Step 1: Compute the ǫ − closure of the NFA’s start state. The resulting set will
be the DFA’s start state, I0 . Add all pairs (I0 , α) (∀α∈Σ , with Σ the input
alphabeth) to the agenda.
Step 2: For each unprocessed pair in the agenda (In , α), remove it from the
agenda and compute ǫ − closure(move(In , α)): if the resulting configura-
tion, In+1 , is not a known one (i.e., it is different from all Ik , ∀k<n+1 ), add
the corresponding pairs to the agenda.
Step 3: Repeat 2 until the agenda is empty.
The algorithm’s steps can be tabled (see fig. 3.2): Σ = {a, b, c} is the input
alphabet; α ∈ Σ is an input symbol; and In+1 = ǫ − closure(move(In , α)).
Figure 3.3 presents a graph representation of the DFA computed in accor-
dance with the determinization algorithm. The numbers correspond to DFA
states whose NFA state configurations are presented in figure 3.2.
The compaction process is simply a way of eliminating DFA states that are un-
necessary. This may happen because one or more states are indistinguishable
from each other, given the input symbols.
3.2. FINITE STATE ACCEPTORS 25
Figure 3.3: DFA graph for a(a|b) ∗ |c: full configuration and simplified view
(right).
26 CHAPTER 3. THEORETICAL ASPECTS OF LEXICAL ANALYSIS
A simple algorithm consists of starting with a set containing all states and
progressively dividing it according to various criteria: final states and non-final
states are fundamentally different, so the corresponding sets must be disjoint;
states in a set that have transitions to different sets, when considering the same
input symbol are also different; states that have transitions on a given input
symbol are also different from states that do not have those transitions. The
algorithm must be applied until no further tests can be carried out.
Regarding the above example, we would have the following sets:
• All states: A = {1, 2, 3, 4, 5}; separating final and non-final states we get
• Final states, F = {2, 3, 4, 5}; and non-final states, N F = {1};
• Considering a and F : 2, 4, and 5 present similar behavior (all have tran-
sitions ending in states in the same set, i.e., 4); 3 presents a different be-
havior (i.e., no a transition). Thus, we get two new sets: {2, 4, 5} and
{3};
• Considering b and {2, 4, 5} we reach a conclusion similar to the one for a,
i.e., all states have transitions to state 5 and cannot, thus, be distinguished
from each other;
• Since {2, 4, 5} has no c transitions, it remains as is. Since all other sets are
singular, the minimization process stops.
Figure 3.4 presents the process of minimizing the DFA (the starting point
is the one in figure 3.2), in the form of a minimization tree.
Figure 3.4: Minimal DFA graph for a(a|b) ∗ |c: original DFA, minimized DFA,
and minimization tree.
is empty and the state is not final, then there was an error and the string is said
to be rejected. If there is no possible transition for a given state and the current
input symbol, then processing fails and the string is also rejected.
The following example illustrates the construction process for a lexical analyser
that identifies three expressions: G = {a ∗ |b, a|b∗, a∗}. Thus, the recognized
tokens are T OK1 = a ∗ |b, T OK2 = a|b∗, and T OK3 = a∗. Note that the con-
struction process handles ambiguity by selecting the token that consumes the
most input characters and, if two or more tokens match, by selecting the first.
It may possible that the lexical analyser never signals one of the expressions:
in an actual situations, this may be undesirable, but may be unavoidable. For
instance, when recognizing identifiers and keywords, care must be exercised
so as not to select an identifier when a keyword is desired.
As figure 3.6 clearly illustrates, all DFA states are final: each of them contains,
at least, one final NFA state. When several final NFA states are present, the first
is the one considered. In this way, we are able to select the first expression in
the list, when multiple matches would be possible. Note also that the third ex-
pression is never matched. This expression corresponds to state 20 in the NFA:
in the DFA this state never occurs by itself, meaning that the first expression is
always preferred (as expected).
The minimization process is as before, but now we have to take into account
that states may differ only with respect to the expression they recognize. Thus,
after splitting states sets into final and non-final, the set of final states should be
28 CHAPTER 3. THEORETICAL ASPECTS OF LEXICAL ANALYSIS
Figure 3.5: NFA for a lexical analyser for G = {a ∗ |b, a|b∗, a∗}.
split according to the recognized expression. From this point on, the procedure
is as before.
Figure 3.7: DFA for a lexical analyser for G = {a ∗ |b, a|b∗, a∗}: original (top
left), minimized (bottom left), and minimization tree (right). Note that states 2
and 4 cannot be merged since they recognize different tokens.
Figure 3.8 shows the process of analysing the input string aababb. As can be
seen from the table, several tokens are recognized and, for each one, the anal-
yser returns to the initial state to process the remainder of the input.
In Input In+1
0 aababb$ 13
13 ababb$ 13
13 babb$ TOK1
0 babb$ 2
2 abb$ TOK1
0 abb$ 13
13 bb$ TOK1
0 bb$ 2
2 b$ 4
4 $ TOK2
Figure 3.8: Processing an input string and token identification. The inut string
aababb is split into aa (TOK1), b (TOK1), a (TOK1), and bb (TOK2).
3.5 Summary
30 CHAPTER 3. THEORETICAL ASPECTS OF LEXICAL ANALYSIS
4.1 Introdu
tion
4
The GNU
ex
Lexi
al Analyser
4.3 Summary
32 CHAPTER 4. THE GNU FLEX LEXICAL ANALYSER
Lexi
al Analysis
Case 5
This chapter describes the application of the lexical processing theory and tools
to our test case, the compact programming language.
5.3 Summary
34 CHAPTER 5. LEXICAL ANALYSIS CASE
III
Synta
ti
Analysis
Theoreti
al Aspe
ts
of Syntax
6
Syntactic analysis can be carried out by a variety of methods, dependending
on the desired result and the type of grammar and analysis.
falar dos diferentes métodos?
6.2 Grammars
6.2.1 Formal definition
In the current chapter, we will use a simple grammar for illustrating our expla-
nations with clear examples. We will choose a simple grammar but one which
will allow us to exercise a wide range of processing decisions. This grammar
is presented in 6.1: E (the start symbol) and F are non-terminals and id is a
terminal (token) that represents arbitrary identifiers (variables).
E → E + T |T
T → T ∗ F |F (6.1)
F → (E)|id
In addition to the non-terminals and the token described above, four other
tokens, whose value is also the same as their corresponding lexemes, exist: (,
), +, and *.
38 CHAPTER 6. THEORETICAL ASPECTS OF SYNTAX
6.3 LR Parsers
As mentioned in the introduction, this chapter is centered around the LR family
of parsers: these are bottom-up parsers that shift items from the input to the
stack and reduce symbols on the stack in accordance with available grammar
rules.
An LR parser has the following structure (Aho et al., 1986, fig. 4.29):
LR parser model
The stack starts with no symbols, containing only the initial state (typi-
cally 0). Parsing consists of pushing new symbols and states to the stack (shift
operation) or in removing groups of symbols from the stack – corresponding to
the right hand side of a production an pushing back the left hand side of that
production (reduce operation). Parsing ends when the end of phrase is seen in
the appropriate state (see also §6.4.4).
The parse table is built according to the following algorithm: it takes as
input an augmented grammar (§6.3.1.1), and produces as output the parser
table (actions and gotos).
built (although the effort of computing the LR(1) items would be patially
wasted).
2. Each state i is built from the DFA’s Ii state. Actions in state i are built
according to the following method (with terminal a):
a) If [A → α • aβ] is in Ii and goto(Ii , a) = Ij , then action[i, a] =
shiftj;
b) If [A → α•] is in Ii , then action[i, a] = reduceA →
α, ∀a∈F OLLOW (A) (with A distinct from S ′ );
c) If [S ′ → S•] is in Ii , then action[i, a] = accept.
3. Gotos to state i (with non-terminal A): if goto(Ii , A) = Ij , then goto[i, A] =
j.
4. The parser table cells not filled by the second and third steps correspond
to parse errors;
5. The parser’s initial state corresponds to set of items containing item [S ′ →
•S] (§6.3.1.4).
Informally, an LR(0) item is a “dotted rule”, i.e., a grammar rule and a dot
indicating which parts of the rule have been seen/recognized/accepted so far.
As an example, consider rule 6.2: the LR(0) items for this rule are presented
in 6.3.
E → ABC (6.2)
E → •ABC E → A • BC E → AB • C E → ABC• (6.3)
Only one LR(0) item (6.5) exists for empty rules (6.4).
E→ǫ (6.4)
E→• (6.5)
If S is the start symbol for a given grammar, then the corresponding augmented
grammar is defined by adding an additional rule S ′ → S and defining S ′ as
40 CHAPTER 6. THEORETICAL ASPECTS OF SYNTAX
the new start symbol. The idea behing the concept of augmented grammar
is to make it simple to decide when to stop processing input data. With the
extra production added to the augmented grammar, it is a simple matter of
keeping track of the reduction of the old start symbol when processing the
new production. Thus, and now in terms of LR(0) items, the entire processing
would correspond to navigating through the automaton, starting in the state
with [S ′ → •S] and ending at [S ′ → S•], with no input left to be processed.
YACC parser generators stop processing when the start symbol of the orig-
inal grammar is reduced. Bison, however, introduces a new transition to an ex-
tra state (even when in “YACC compatibility mode”). This transition (the only
difference from a YACC parser) corresponds to processing the end-of-phrase
symbol (see below). The parser automata are otherwise identical.1
The augmented grammar corresponding to the grammar to be processed.
The augmented grammar for the grammar presented in 6.1 is shown in 6.6.
This grammar has a new start symbol: E ′ .
Example 1
E′ → E
E → E + T |T
(6.6)
T → T ∗ F |F
F → (E)|id
For building a parser for this grammar, the NFA automaton would start in
state [E ′ → •E] and end in state [E ′ → E•]. After determinizing the NFA, the
above items would be part of the DFA’s initial and final states.
So, all there is to do to build a parser is to compute all possible LR(0)
items: this is nothing more than considering all possible positions for a “dot”
in all the possible productions and the causes of transition from one item to
another: if a terminal is after the dot, then the transition is labeled with that
terminal; otherwise, a new ǫ transition to all LR(0) items that can generate (by
reduction) the non-terminal after the dot will be produced. By following the
procedure until no more transitions or items are added to the set, the parser’s
NFA is finished. The final state is the one containing the [E ′ → E•] item.
Determinization proceeds as in the case of the lexical automata. When im-
plementing programs for generating these parsers, starting with the NFA may
seem like a good idea (after all, the algorithms may be reused), when building
the state machines by hand, first building the NFA, and afterwards the DFA, is
very time consuming and error prone. Fortunately, it is quite straightforward
to build the DFA directly from the LR(0) items. We will see how in a moment:
first we will introduce two concepts that will help us do it.
explicar que a closure e o goto permitem não construir o NFA e avançar
para o DFA directamente.
1 In fact, Bison makes a better job at code generation for supporting the parser. In this document,
Definition 1 Closure.
Let I be a set of items. Then closure(I) is a set of items such that:
Example 2
I = {[E → •E ′ ]}
closure(I) = {
[E → •E ′ ], [E → •E + T ], [E → •T ], [T → •T ∗ F ], (6.7)
[T → •F ], [F → •(E)], [F → •(E)], [F → •id]
}
Given a grammar symbol and a set of items, the goto function computes the
closure of the set of all possible transitions on the selected symbol (see defini-
tion 2).
Example 3
I = {[E → E ′ •], [E → E • +T ]}
goto(I, +) = {[E → E + •T ], [T → •T ∗ F ], [T → •F ], [F → •(E)], [F → •id]}
(6.9)
42 CHAPTER 6. THEORETICAL ASPECTS OF SYNTAX
Note that, in 6.8, only the second item in I contributes to the set formed
by goto(I, +). This is because no other item has + in any viable prefix.
Now that we have defined the closure and goto functions, we can build the set
of states C of the parser’s DFA automaton (see def. 3).
Then, for each I in C and for each grammar symbol X , add goto(I, X), if not
empty, to C . Repeat until no changes occur to the set.
Let us now consider our example and build the sets corresponding to the
DFA for the parser: as defined in 3, we will build C = {I0 , ..., In }, the set of DFA
states. Each DFA state will contain a set of items, as defined by the closure and
goto functions. The first state in C corresponds to the DFA’s initial state, I0 :
I0 = closure({[E ′ → •E]}))
I0 = {
[E → •E ′ ], [E → •E + T ], [E → •T ], [T → •T ∗ F ], (6.11)
[T → •F ], [F → •(E)], [F → •id]
}
After computing I0 , the next step is to compute goto(I0 , X) for all symbols
X for which there will be viable prefixes: by inspection, we can see that these
symbols are E, T , F , (, and id. Each set resulting from the goto function will be
a new DFA state. We will consider them in the following order (but, of course,
this is arbitrary):
The next step is to compute the goto functions for each of the new states
I1 through I5 . For instance, from I1 , only one new state is defined:
6.3. LR PARSERS 43
Computing these states is left as an exercise for the reader (tip: use a
graphical approach). At first glance, it would seem that more states would be
produced when computing the goto functions. This does not necessarily hap-
pen because some of the transitions lead to states already seen, i.e., the DFA is
not acyclic. Figure 6.2 presents a graphical DFA representation: each state lists
its LR(0) items (i.e., NFA states).
If you look carefully at the diagram in figure 6.2, you will notice that in
each state some of the items have been “propagated” from other states and
others have yet to be processed (having been derived from the former). This
implies that what really characterizes each state are its “propagated” items.
These are called nuclear items and contain the actual information about the
state of processing by the parser.
A parse table defines how the parser behaves in the presence of input and
symbols at the top of the stack. The parse table decides the action and state of
the parser.
exemplo?
A parse table has two main zones: the left one, for dealing with terminal
symbols; and the right hand side, for handling non-terminal symbols at the top
of the stack. Figure 6.3 showns an example.
44 CHAPTER 6. THEORETICAL ASPECTS OF SYNTAX
Figure 6.2: Graphical representation of the DFA showing each state’s item set.
Reduces are possible in states I1 , I2 , I3 , I5 , I9 , I10 , and I11 : it will depend on
the actual parser whether reduces actually occur.
Parse table
Figure 6.3: Example of a parser table. Note the column for the end-of-phrase
symbol.
6.4. LALR(1) PARSERS 45
Note that if there are any conflicts in the parse table, it should only be
compressed after the conflicts have been eliminated.
6.6 Summary
7
Using Berkeley
YACC
In the previous chapter we looked at various grammar types and suitable pars-
ing algorithms. The last chapter also dealt with some semantic aspects, such
as attributive grammars. While semantics aspects still have to be presented,
some of them will be presented here, since they are important in syntax speci-
fications.
In this chapter we consider grammar definitions for automatic parser gen-
eration. The parsers we consider here are LALR(1). Several tools are available
for creating this type of parser from grammar definitions, e.g. the one pre-
sented in this chapter: Berkeley YACC1 .
Parser generator tools do not limit themselves to the grammar itself: they
also consider semantic aspects (as mentioned above). It is in this semantic part
that the syntactic tree is built. It is also here the place where a particular pro-
gramming language, for encoding meaning, is relevant.
GNU bison is widely used and generates powerful parsers. In this document
we use byacc due to its simplicity. In future versions, we may yet consider
using bison.
Although other parser generators could have been considered we chose the
above for their simplicity and ease of use. Furthermore, since they were all
initially developed for using C as the underlying programming language, it
was a small step to make them work with C++. Note that we are not referring
to the usual rhetoric that C++ is just a better C: we are interested in that aspect,
but we are not limited to it. In particular, the use of C++ is a bridge for using
sophisticated object- and pattern-oriented programming with these tools. As
with the lexical analisys step, the current chapter presents the use of the CDK
library for elegantly deal with aspects of syntactic processing in a compiler.
Other parser generator tools for C++
Comparisons and why byacc...
Figure 7.1: General structure of a grammar definition file for a YACC-like tool.
7.2. SYNTAX OF A GRAMMAR DEFINITION 49
The first part usually contains local (static) variables, include directives,
and other non-function objects. Note that the first part may contain C code
blocks. Thus, although functions usually are not present, this does not mean
that they cannot be defined there. Having said this, we will assume that all
functions are outside the grammar definition file and that only their declara-
tions are present and that only if actually needed. Other than C code blocks,
the first part contains miscellaneous definitions used throughout the grammar
file.
The thrid part is another code section, one that usually contains only func-
tions. Since we assume that all functions are defined outside the grammar
definition file, the third part will be empty.
The second part is the most important: this is the rules section and it will
deserve the bulk of our attention, since it is the one that varies most from lan-
guage to language.
External definitions come between %{ and %} (these special braces must appear
at the first column). Several of these blocks may be present. They will be copied
verbatim to the output parser file.
Internal definitions use special syntax and cover specific aspects of the gram-
mar’s definition: (i) types for terminal (tokens) and non-terminal grammar
symbols; (ii) definitions of terminals and non-terminals; and (iii) optionally,
symbol precedences and associativies;
YACC grammars are attributives grammars and, thus, specific values can
be associated with each symbol. Terminal symbols (tokens) can also have at-
tributes: these are set by the lexical analyser and passed to the parser. These
values are also known as lexemes. Types for non-terminal symbols are com-
puted whenever a semantic rule (a code block embedded in a syntactic rule) is
reduced. Regardless of their origin, i.e., lexical analyser or parser reductions,
all values must have a type. These types are defined using a union (à la C),
in which various types and variables are specified. Since this union will be
realized as an actual C union, the corresponding restrictions apply: one of the
most stringent is that no structured types may be used. Only atomic types
(including pointers) are allowed.
Types, variables, and the grammar symbols are associated through the
specific YACC directives %token (for terminals) and %type (for non-terminals).
50 CHAPTER 7. USING BERKELEY YACC
%{
#include <some-decl-header-file.h>
#include "mine/other-decl.h"
Figure 7.2: Various code blocks like the one shown here may be defined in the
definitions part of a grammar file: they are copied verbatim to the output file
in the order they appear.
%union {
char *s; // string values
int i; // integer values
double d; // floating point numbers
SomeComplexType *complex; // what this...?
SomethingElse *other; // hmmm...
}
Figure 7.3: The %union directive defines types for both terminal and non-
terminal symbols.
7.2. SYNTAX OF A GRAMMAR DEFINITION 51
The token declarations also declare constants for use in the program (as C mac-
tos): as such, these constants are also available on the lexical analyser side. Fig-
ure 7.4 illustrates these declarations. Regarding constant declarations, STRING
and ID are declared as character strings, i.e., they use the union’s s entry as
their attribute and, likewise, NUMBER and INTEGER are declared as integers, i.e.,
they use the union’s i entry. Regarding non-terminal symbols, they are de-
clared as complex and other, two structured types.
%token<s> STRING ID
%token<i> NUMBER
%type<complex> mynode anothernode
%type<other> strangestuff
2 By default, this file is called y.tab.h, but can be renamed without harm. In fact, some YACC
Figure 7.5: YACC-generated parser C/C++ header file: note especially the spe-
cial type for symbol values, YYSTYPE, and the automatic declaration of the
global variable yylval. The code shown in the figure corresponds to actual
YACC output.
%prec can be used to make the precedence of a rule be the same as that of a
given token (see figure ??). The first rule of figure ?? states that, if in doubt, a
unary minus interpretation should be preferred to a binary interpretation (we
are assuming that precedences have been defined as in figure 7.6).
This section contains the grammar rules and corresponding semantic code sec-
tions. This is the section that will give rise to the parser’s main function:
yyparse.
In the LALR(1) parser algorithm we discussed before, the parser carries out
two types of action: shift and reduce. Shifts correspond to input consump-
tion, i.e., the parser reads more data from the input; reductions correspond to
recognition of a particular pattern through a bottom-up process, i.e., the parser
assemble a group of items and is able to tell they are equivalent to a rule’s left
side (hence, the term reduce).
Regarding reductions: any rule whose head (left hand part) does not ap-
pear in any other rule’s right hand side means that rule will never be reduced.
If a rule is never reduced, then the corresponding semantic block will never be
executed.
A rule is compose by a head (the left hand side), corresponding to the construc-
tion to be recognized, and a right hand side corresponding to the items to be
matched against the input, before reducing the left hand side.
Reductions may be performed from different right hand side hypotheses,
esch corresponding to an alternative in the target language. In this case, each
hypothesis is considered independent from the others and each will have its
own semantic block. Figure 7.7 shows different types of rules: statement has
several hypothesis, as does list. Note that list is recursive (a somewhat
recurring construction).
The simple grammar in figure 7.7 uses common conventions for naming
the symbols: uppercase symbols represent terminals and lowercase ones non-
terminals. As any other convention, this one is mostly for human consumption
and useless for the parser generator. The grammar shown recognizes several
statements and also lists of statements. “Programs” recognized by the gram-
mar will consist of print statements and assigments to variables. An example
program is shown in figure 7.8.
54 CHAPTER 7. USING BERKELEY YACC
Figure 7.7: Examples of rules and corresponding semantic blocks. The first
part of the figure shows a collection of statements; the second part shows an
example of recursive definition of a rule. Note that, contrary to what happens
in LL(1) grammars, there is no problem with left recursion in LALR(1) parsers.
print 1;
i = 34;
s = "this is a long string in a variable";
print s;
print i;
print "this string is printed directly, without a variable";
The top grammar symbol is, by default, the symbol on the left hand side of the
first rule in the rules section. If this is not the desired behaviour, the program-
mer is able to select an arbitraty one using the directive %start. This directive
should appear in the definitions section (first part). Figure 7.9 shows two sce-
narios: the first will use X as the top symbol (it is the head of the first rule); the
second case uses Y as the top symbol (it has been explicitly selected).
%{ %{
/* miscellaneous code */ /* miscellaneous code */
%} %}
%union { /* symbol types */ } %union { /* symbol types */ }
%% %start y
x : a b y | a ; %%
a : ’a’ ; x : a b y | a ;
y : b a x | b ; a : ’a’ ;
b : ’b’ ; y : b a x | b ;
%% b : ’b’ ;
/* miscellaneous code */ %%
/* miscellaneous code */
Figure 7.9: Start symbols and grammars. Assuming the two tokens ’a’ and
’b’, the same rules recognize different syntactic constructions depending on
the selection of the top symbol. Note that the non-terminal symbols a and b
are different from the tokens (terminals) ’a’ and ’b’.
The code part is useful mostly for writing functions dealing directly or exclu-
sively with the parser code of for exclusive use by the parser code. Of course,
it is possible to write any code in this section. In fact, in some cases, the main
function for tests is directly coded here.
In this document, however, we will not use this section for any code other
that for functions used only by the parser, i.e. static functions. The reason
for this procedure is that it is best to organize the code in well-defined blocks,
instead of grouping functions by their accidental proximity or joint use. This
aspect assumes greater importance in object-oriented programming languages
than in C (even though the remark is also valid for this language). Thus, we
will try and write classes for all concepts we can identify and the parser will
use objects created from those classes, instead of using private code.
56 CHAPTER 7. USING BERKELEY YACC
7.4 Pitfalls
Syntactic analysis is like a minefield: on top it is innocently harmless, but a
misstep may do great harm. This is true of syntactic rules and their associated
meanings, i.e., how semantics is “grafted” onto syntactic tree nodes.
7.5. SUMMARY 57
7.5 Summary
In this chapter we presented the YACC family of LALR(1) parser generators.
Syntax of a Grammar Definition, i.e., struture of a YACC file
Conflicts
pitfalls
58 CHAPTER 7. USING BERKELEY YACC
Synta
ti
Analysis
Case 8
This chapter describes the application of the syntactic processing theory and
tools to our test case, the compact programming language.
• A union for defining types for terminal (tokens) and non-terminal gram-
mar symbols;
60 CHAPTER 8. SYNTACTIC ANALYSIS CASE
%union {
int i; /* integer value */
std::string *s; /* symbol name or string literal */
cdk::node::Node *node; /* node pointer */
ProgramNode *program; /* node pointer */
cdk::node::Sequence *sequence;
};
%nonassoc IFX
%nonassoc ELSE
%left CPT_GE CPT_LE CPT_EQ CPT_NE ’>’ ’<’
%left ’+’ ’-’
%left ’*’ ’/’ ’%’
%nonassoc UMINUS
ProgramNode *syntax = 0;
8.5 Summary
62 CHAPTER 8. SYNTACTIC ANALYSIS CASE
IV
Semanti
Analysis
9
The
Syntax-Semanti
s
Interfa
e
The Visitor design pattern (Gamma et al., 1995) provides the appropriate
framework for processing syntactic trees. Using visitors, the programmer is
able to decouple the final code generation decisions from the objects that form
the syntactic description of a program.
In view of the above, it comes as no surprise that the bridge between syn-
tax and semantics is formed by the communication between visitors, represent-
ing semantic processing, and the node classes, representing syntax structure.
if (!evaluate()) {
std::cerr << "Semantic errors in " << g_ifile << std::endl;
return 1;
}
return 0;
}
Figure 9.1: Macro structure of the main function. Note especially the syntax
and semantic processing phases (respectively, yyparse and evaluate.
9.3. VISITORS AND TREES 67
class SemanticProcessor {
//! The output stream
std::ostream &_os;
protected:
SemanticProcessor(std::ostream &os = std::cout) : _os(os) {}
inline std::ostream &os() { return _os; }
public:
virtual ˜SemanticProcessor() {}
public:
// processing interface
};
9.4 Summary
68 CHAPTER 9. THE SYNTAX-SEMANTICS INTERFACE
10
Semanti
Analysis
and Code
Generation
10.1 Introdu
tion
10.2 Code Generation
10.3 Summary
70 CHAPTER 10. SEMANTIC ANALYSIS AND CODE GENERATION
11
Semanti
Analysis
Case
This chapter describes the application of the semantic processing theory and
tools to our test case, the compact programming language.
A.2.1 Interface
#ifndef ___parole_morphology___class_2176_drv_H__
#define ___parole_morphology___class_2176_drv_H__
#include <DTL.h>
#include <date_util.h>
#include <table.h> /* DTL header for macros */
#include <driver/driver.h> /* {lr:db} defs. */
#include <driver/Meta.h> /* access to metadata tables */
namespace cdk {
//... etc. etc. ...
}
#endif
A.2.2 Interface
#ifndef ___parole_morphology___class_2608_drv_H__
#define ___parole_morphology___class_2608_drv_H__
#include <DTL.h>
#include <date_util.h>
#include <table.h> /* DTL header for macros */
#include <driver/driver.h> /* {lr:db} defs. */
#include <driver/Meta.h> /* access to metadata tables */
namespace cdk {
namespace node {
//... etc. etc. ...
}
}
#endif
A.2.3 Interface
#ifndef ___parole_morphology___class_2621_drv_H__
#define ___parole_morphology___class_2621_drv_H__
#include <DTL.h>
76 APPENDIX A. THE CDK LIBRARY
#include <date_util.h>
#include <table.h> /* DTL header for macros */
#include <driver/driver.h> /* {lr:db} defs. */
#include <driver/Meta.h> /* access to metadata tables */
namespace parole {
namespace morphology {
}; // namespace morphology
}; // namespace parole
#endif
#ifndef ___parole_morphology___class_2176_H__
#define ___parole_morphology___class_2176_H__
#include <driver/auto/dbdrv/parole/morphology/__class_2176_drv.h>
#endif
A.3.2 Cápsula
#ifndef ___parole_morphology___class_2621_H__
#define ___parole_morphology___class_2621_H__
#include <driver/auto/dbdrv/parole/morphology/__class_2608.h>
#include <driver/auto/dbdrv/parole/morphology/__class_2621_drv.h>
#endif
#define ___parole_morphology_MorphologicalUnitSimple_CPP__
#include <parole/morphology/MorphologicalUnitSimple.h>
#undef ___parole_morphology_MorphologicalUnitSimple_CPP__
B
Postx Code
Generator
This chapter documents the reimplementation of the postfix code generation
engine. The original was created by Santos (2004). Is was composed by a set of
macros to be used with printf functions. Each macro would “take” as argu-
ments, either a number or a string.
The postfix code generator class maintains the same abstraction, but does
not rely on macros. Instead, it defines an interface to be used by semantic
analysers, as defined by a strategy pattern (Gamma et al., 1995). Specific im-
plementations will provide the realization of the postfix commands for a par-
ticular target machine.
In some of the following tables, the “Stack” column presents the actions
on the values at the top of the stack. Note that only elements relevant in a
given context, i.e., that of the postfix instruction being executed, are shown.
The notation #length represents a set of length consecutive bytes in the stack,
i.e., a vector. Consider the following example:
a #8 b : a b
The stack had at its top b, followed by eight bytes, followed by a. After
executing some postfix instruction using these elements, the stack has at its top
b, followed by a.
78 APPENDIX B. POSTFIX CODE GENERATOR
class Postfix {
protected:
std::ostream &_os;
inline Postfix(std::ostream &os) : _os(os) {}
inline std::ostream &os() { return _os; }
public:
virtual ˜Postfix();
public: // miscellaneous
// rest of the class (mostly postfix instructions: see below)
};
Postfix instructions in the following tables have void return type unless
otherwise indicated.
The following instructions take one or two double precision floating point
values. The result is also a double precision floating point value.
Shift and rotation operations have as maximum value the number of bits of the
underlying processor register (32 bits in a ix86-family processor). Safe opera-
tion for values above is not guaranteed.
These operations use two values from the stack: the value at the top spec-
ifies the number of bits to rotate/shift; the second from the top is the value to
be rotated/shifted, as specified by the following table.
The following operations perform logical operations using the elements at the
top of the stack. Arguments are taken from the stack, the result is put on the
stack.
The comparison instructions are binary operations that leave at the top of the
stack 0 (zero) or 1 (one), depending on the result result of the comparison:
respectively, false or true. The value may be directly used to perform condi-
tional jumps (e.g., JZ, JNZ), that use the value of the top of the stack instead of
relying on special processor registers (flags ).
B.2. THE INTERFACE 81
In a stack machine the arguments for a function call are already in the stack.
Thus, it is not necessary to put them there (it is enough not to remove them).
When building functions that conform to the C calling convetions (?, ?), those
arguments are destroyed by the caller, after the return of the callee, using
TRASH, stating the total size (i.e., for all arguments). Regarding the callee, it
must create a distinct activation register (ENTER or START) and, when no longer
needed, destroy it (LEAVE). The latter action must be performed immediately
before returning control to the caller.
Similarly, to return values from a function, the callee must call POP to store
the return value in the accumulator register, so that it survives the destruction
of the invocation context. The caller must call PUSH, to put the accumulator
in the stack. An analogous procedure is valid for DPOP/DPUSH (for double
precision floating point return values).
Note [*4*] that these operations (ADDR, LOCAL) put at the top of the stack
the symbol’s address, independently of its origin. O endereço pode posteri-
ormente ser utilizado como ponteiro, obtido o valor nesse endereço (LOAD)
ou guardar um valor nesse endereço (STORE). No entanto, nas duas últimas
situações, devido à frequência com que ocorrem e o número de ciclos de relógio
que levam a executar, podem ser substituı́das com vantagem pela operações
descritas em [*10*].
“Quick opcodes” are shortcuts for groups of operations commonly used to-
gether. These opcodes may be made efficient by implementing them in dif-
ferent ways than the original set of high-level operations would suggest, i.e.,
the code generated by ADDRV may be more efficient than the code generated by
ADDR followed by LOAD. Nevertheless, the outcome is the same.
The load instructions assume that the top of the stack contains an address
pointing to the data to be read. Each load instruction will replace the address
at the top of the stack with the contents of the position it points to. Load in-
structions differ only in what they load.
B.2. THE INTERFACE 85
Store instructions assume the stack contains at the top the address where data
is to be stored. That data is in the stack, immediately after (below) the address.
Store instructions differ only in what they store.
B.2.12.1 Segments
These instructins start various segments. They do not affect the stack, nor are
they affected by its contents.
B.2.12.2 Values
These instructins declare values directly in segments. They do not affect the
stack, nor are they affected by its contents.
86 APPENDIX B. POSTFIX CODE GENERATOR
Note [*1*] that literal values, e.g. integers, may be used in their static form,
using memory space from a data segment (or text, if it is a constant), using
LIT. On the other hand, only integer literals and pointers can be used in the
instructions themselves as immediate values (INT, ADDR, etc.).
B.2.12.3 Labels
These instructins operate directly on symbols and their definition within some
segment. They do not affect the stack, nor are they affected by its contents.
B.3 Implementations
interface above
uml diagram
As should be expected, the classes described here provide concrete imple-
mentations for the abstract functions declared in the superclass. Although the
main objective is to produce the final (machine- and OS-specific) code, the gen-
erators are free to go about it as they see fit. In general, though, each instruction
of the stack machine (postfix) will produce a set of instructions belonging to the
target machine.
Two example generators are presented here, and provided with the CDK
library: a nasm code generator (§B.3.1) and a debug-only generator (§B.3.2).
B.4. SUMMARY 89
This code generator implements the postfix instructions for producing code to
be processed by NASM1 (NASM, n.d.), the Netwide Assembler. NASM is an
assembler for the x86 family of processors designed for portability and modu-
larity. It supports a range of object file formats including Linux a.out and ELF,
COFF, Microsoft 16-bit OBJ and Win32. The NASM processor is designed to be
simple and easy to understand, similar to Intel’s but less complex.
The NASM code generator can be used in two basic modes: code-only or
code and debug. The debug data provided here is different from the produced
by the debug-only generator (see §B.3.2) in that it describes groups of target
machine code using as labels the names of postfix instructions.
The debug-only generator does not produce executable code for any machine.
Instead, it provides a trace of the postfix instructions executed by the postfix
code generator associated with a particular visitor from the syntax tree. Need-
less to say, although the code generator does not actually produce any code, it
can be used just like any other code generator.
B.4 Summary
Note that the code provided with the CDK library is written in standard C++
and will compile almost anywhere a C++ compiler is available. However, note
that while a working CDK is a guarantee for a working compiler, this does
not mean that the final program will run in that particular environment. For
final programs to work in a given environment, final code generators must be
provided for that environment. Consider the following example: the CDK, and
the rest of the development tools exist in a Solaris environment running on a
SPARC machine. If we were to use the NASM generator in that environment,
it would work, i.e., it would produce the code it was supposed to produce, but
for a ix86-based machine. Further confusion would ensue because NASM can
produce code for ix86-based machines from SPARC-based machines, using the
same binary format, both Solaris and Linux – just to give an example – use the
ELF () binary format.
C.1 Introdu
tion
C
The Runtime
Library
The runtime support library is a set of functions for use by programs produced
by the compiler. The intent is to simplify code generation by providing library
functions for commonly used functions, such as complex mathematical func-
tions, or input and output routines.
In principle, there are no restrictions regarding the programming style or
language, as long as the code is binary-compatible with the one produced by
the code generator used by the compiler. In the examples provided in this
document, and included in the CDK library, the function calling convention
adheres to the C calling convention. Thus, in principle, any C-based library or
compatible, could be used.
• Input functions.
• Output functions.
• Floating-point functions.
• Operating system-related functions.
C.3 Summary
92 APPENDIX C. THE RUNTIME LIBRARY
D Glossary
Este capı́tulo apresenta alguma da terminologia utilizada na dissertação. Al-
guns dos termos apresentados resultam da tradução de termos utilizados na
literatura internacional.
DOM Document Object Model (W3C, 2002). O Document Object Model é uma
interface neutra relativamente a plataformas ou linguagens particulares.
Esta interface permite acesso dinâmico ao conteúdo, estrutura e estilo de
documentos que sigam este padrão.
95
96 BIBLIOGRAPHY
Author Index
Aho, A. V., 3, 38, 95 Johnson, R., 95 Ullman, J. D., 95
Gamma, E., 5, 65, 77, 95 OMG, 95
Vlissides, J., 95
Helm, R., 95
Santos, P. R. dos, 77, 95
ISO, 95 Sethi, R., 95 W3C, 95
97
98 AUTHOR INDEX
Index
GNU m4, see m4 see JDBC semantic analysis, 65–71
SGML, 93
ISO, 93 lexical analysis, 21–33 syntactic analysis, 37–61
8879, see SGML m4, 93
TC37, see TC37 XMI, 93
Open Database XML, 93
Java Database Connectivity, XSL, 93
Connectivity, see ODBC XSLT, 93
99
100 INDEX