Domain Specific Program
Domain Specific Program
Editorial Board
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
New York University, NY, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Christian Lengauer Don Batory
Charles Consel Martin Odersky (Eds.)
Domain-Specific
Program Generation
International Seminar
Dagstuhl Castle, Germany, March 23-28, 2003
Revised Papers
13
Volume Editors
Christian Lengauer
Universität Passau, Fakultät für Mathematik und Informatik
94030 Passau, Germany
E-mail: [email protected]
Don Batory
The University of Texas at Austin, Department of Computer Sciences
Austin, TX 78712, USA
E-mail: [email protected]
Charles Consel
INRIA/LaBRI, ENSEIRB, 1, avenue du docteur Albert Schweitzer
Domaine universitaire, BP 99, 33402 Talence Cedex, France
E-mail: [email protected]
Martin Odersky
École Polytechnique Fédérale de Lausanne (EPFL)
1015 Lausanne, Switzerland
E-mail: martin.odersky@epfl.ch
ISSN 0302-9743
ISBN 3-540-22119-0 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable to prosecution under the German Copyright Law.
Springer-Verlag is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2004
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik
Printed on acid-free paper SPIN: 11011682 06/3142 543210
Preface
2. Consel makes the point that a domain is best defined as a set of existing
programs and sketches how one might derive a domain-specific language from
such a set, with which one can then specify other programs in this set.
3. Taha illustrates the technique of multi-stage programming on the example
of a staged interpreter.
4. Czarnecki et al. describe staged interpreters and templates as a suitable way
of extending a host language with domain-specific constructs. They evalu-
ate and compare three languages for template programming: MetaOCaml,
Template Haskell and C++.
5. One step beyond domain-specific program generation lies the goal of domain-
specific program optimization. Lengauer reviews different optimization tech-
niques in the domain of high-performance parallelism.
6. Smaragdakis offers a personal assessment of the approaches and attitudes in
the research community of generative programming.
2. Visser presents the Stratego language for the domain of rewriting program
transformations, and the corresponding toolset Stratego/XT.
3. Fischer and Visser work with AutoBayes, a fully automatic, schema-based
program synthesis system for applications in the analysis of statistical data.
It consists of a domain-specific schema library, implemented in Prolog. In
their contribution to this volume, they discuss the software engineering chal-
lenges in retro-fitting the system with a concrete, domain-specific syntax.
Domain-Specific Optimization. Finally, four contributions describe domain-
specific techniques of program optimization.
1. Kuchen works with a skeleton library for high-performance parallelism, sim-
ilar to the library DatTel of Bischof et al. However, Kuchen’s library is not
based on STL. Instead, it contains alternative C++ templates which are
higher-order, enable type polymorphism and allow for partial application.
In the second half of his treatise, Kuchen discusses ways of optimizing (i.e.,
“retuning”) sequences of calls of his skeletons.
2. Gorlatch addresses exactly the same problem: the optimization of sequences
of skeleton calls. His skeletons are basic patterns of communication and com-
putation – so-called collective operations, some of which are found in stan-
dard communication libraries like MPI. Gorlatch also discusses how to tune
compositions of skeleton calls for a distributed execution on the Grid.
3. Beckmann et al. describe the TaskGraph library: a further library for C++
which can be used to optimize, restructure and specialize parallel target code
at run time. They demonstrate that the effort spent on the context-sensitive
optimization can be heavily outweighed by the gains in performance.
4. Veldhuizen pursues the idea of a compiler for an extensible language, which
can give formal guarantees of the performance of its target code.
Each submission was reviewed by one person who was present at the corre-
sponding presentation at Schloss Dagstuhl and one person who did not attend
the Dagstuhl seminar.1 There were two rounds of reviews. Aside from the editors
themselves, the reviewers were:
Ira Baxter Christoph M. Kirsch Ulrik Schultz
Claus Braband Shriram Krishnamurthy Tim Sheard
Krzysztof Czarnecki Calvin Lin Satnam Singh
Albert Cohen Andrew Lumsdaine Yannis Smaragdakis
Marco Danelutto Anne-Françoise Le Meur Jrg Striegnitz
Olivier Danvy Jim Neighbors Andrew Tolmach
Prem Devanbu John O’Donnell Todd Veldhuizen
Sergei Gorlatch Catuscia Palamidessi Harrick Vin
Kevin Hammond Susanna Pelagatti David Wile
Christoph A. Herrmann Simon Peyton Jones Matthias Zenger
Zhenjiang Hu Frank Pfenning
Paul H. J. Kelly Laurent Réveillère
1
An exception is the contribution by Cremet and Odersky, which was not presented
at Schloss Dagstuhl.
VIII Preface
IFIP Working Group. At the Dagstuhl seminar, plans were made to form an
IFIP TC-2 Working Group: WG 2.11 on Program Generation. In the meantime,
IFIP has given permission to go ahead with the formation. The mission statement
of the group follows this preface.
Acknowledgements. The editors, who were also the organizers of the Dagstuhl
seminar, would like to thank the participants of the seminar for their contribu-
tions and the reviewers for their thoughtful reviews and rereviews. The first
editor is grateful to Johanna Bucur for her help in the final preparation of the
book.
We hope that this volume will be a good ambassador for the new IFIP WG.
Scope
The scope of this WG includes the design, analysis, generation, and quality
control of generative programs and the programs that they generate.
Specific research themes include (but are not limited to) the following areas:
– Foundations: language design, semantics, type systems, formal methods,
multi-stage and multi-level languages, validation and verification.
– Design: models of generative programming, domain engineering, domain
analysis and design, system family and product line engineering, model-
driven development, separation of concerns, aspect-oriented modeling, feature-
oriented modeling.
– Engineering: practices in the context of program generation, such as re-
quirements elicitation and management, software process engineering and
management, software maintenance, software estimation and measurement.
– Techniques: meta-programming, staging, templates, inlining, macro expan-
sion, reflection, partial evaluation, intentional programming, stepwise refine-
ment, software reuse, adaptive compilation, runtime code generation, com-
pilation, integration of domain-specific languages, testing.
– Tools: open compilers, extensible programming environments, active libraries,
frame processors, program transformation systems, program specializers, as-
pect weavers, tools for domain modeling.
– Applications: IT infrastructure, finance, telecom, automotive, aerospace, space
applications, scientific computing, health, life sciences, manufacturing, gov-
ernment, systems software and middle-ware, embedded and real-time sys-
tems, generation of non-code artifacts.
Objectives
– Foster collaboration and interaction between researchers from domain en-
gineering and those working on language design, meta-programming tech-
niques, and generative methodologies.
– Demonstrate concrete benefits in specific application areas.
– Develop techniques to assess productivity, reliability, and usability.
Table of Contents
Surveys
The Road to Utopia: A Future for Generative Programming . . . . . . . . . . . . . 1
Don Batory
Domain-Specific Languages
Generic Parallel Programming Using C++ Templates and Skeletons . . . . . . 107
Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
Domain-Specific Optimization
Optimizing Sequences of Skeleton Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Herbert Kuchen
Domain-Specific Optimizations of Composed Parallel Components . . . . . . . 274
Sergei Gorlatch
Runtime Code Generation in C++ as a Foundation
for Domain-Specific Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Olav Beckmann, Alastair Houghton, Michael Mellor,
and Paul H.J. Kelly
Guaranteed Optimization for Domain-Specific Programming . . . . . . . . . . . . . 307
Todd L. Veldhuizen
Don Batory
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 1–18, 2004.
c Springer-Verlag Berlin Heidelberg 2004
2 Don Batory
So if this also can be replicated in other domains, would this be Utopia? No,
but it is on the road. . .
“Especially important are rules and methods for composing skeletons in large-
scale applications with reliable performance prediction.” – Sergei Gorlatch
The quality of performance estimates has long been known to be critical for
identifying good access plans in query optimization. Interestingly, cost estimates
used by query optimizers have historically been simple and crude, based on
averages. For n-way joins (for large n) estimates are known to be very poor
[15]. Further, performance estimates that are based on page caching – that is,
knowing what pages are on disk and which are in the cache, are highly unreliable.
Despite these limited capabilities, relational optimizers have done quite well. I
am sure there are domains other than query processing that require more precise
estimates.
In any case, if these problems are solved, would this be Utopia? No, its on
the road. . .
“DSLs are best understood in terms of their ‘negative space’ – what they
don’t do is just as important as what they do... How to avoid ‘mission creep’
for languages?” – Shriram Krishnamurthi
Are domain algebras closed, meaning do they have a fixed set of operators, or
are domain algebras open, allowing new operators to be added subsequently?
This is a fundamental question whose answer is not immediately obvious. The
experience of database researchers is quite interesting with respect to this topic.
To appreciate the lessons that they learned (and the lessons that we should
learn), it is instructive to see what database researchers did “right”.
The success of relational query optimization hinged on the creation of a sci-
ence to specify and optimize query evaluation programs. Specifically researchers:
Given the fact that open algebras will be common, how has this impacted query
processing research? My guess is that the database community was lucky. The
original optimizers supported only the initial set of relational operators. This
constraint made query optimization amenable to a dynamic programming solu-
tion that admitted reasonable heuristics [22]. The end result was that database
1
Not “closed” in a mathematical sense, such as addition is closed in integers but not
in subranges. By “closed” I mean a social club: no new members were thought to be
needed.
The Road to Utopia: A Future for Generative Programming 7
people could legitimately claim that their optimization algorithms were guaran-
teed to find the “best” query evaluation program. And it was this guarantee that
was absolutely crucial for early acceptance. Prior work on query optimization
used only heuristics, and the results were both unsatisfying and unconvincing.
Providing hard guarantees made all the difference in the world to the acceptance
of relational optimizers.
Ironically, the most advanced databases today use rule-based optimizers that
offer many fewer guarantees. But by now, database people are willing to live with
this. So is this Utopia? No, its on the road. . .
So what did database people do? They had two different outputs of domain
analysis. First, they defined relational algebra, which is the set of operators whose
compositions defined the domain of query evaluation programs. (So defining a
domain algebra is equivalent to defining the domain of programs to generate).
Another related analysis produced the SQL language, which defined declarative
specifications of data retrieval that hid its relational algebra underpinnings. So
database people created both and integrated both.
In general, however, these are separable tasks. You can define a DSL and
map it to a program directly, introducing optimizations along the way. Or you
can define a DSL and map it to an algebra whose expressions you can optimize.
This brings up a fundamental result on hierarchies of DSLs and optimiza-
tions. The first time I saw this result was in Jim Neighbor’s 1984 thesis on
8 Don Batory
DRACO [19]. The idea is simple: programs are written in DSLs. You can trans-
form (map) an unoptimized DSL program to an optimized DSL program because
the domain abstractions are still visible. Stated another way, you can not opti-
mize abstractions that have been compiled away.
Given an optimized DSL program, you translate it to an unoptimized pro-
gram in a lower level abstraction DSL and repeat the same process until you get
to machine code. So it is this “zig-zag” series of translations that characterize
hierarchies of DSLs (or hierarchies of languages, in general) and their optimiza-
tions (Figure 3a).
These are fundamental problems of software design. I’ll sketch a common prob-
lem, and then show how a database approach solves it. But first, we need to
understand the relationship between operator compositions and layered designs.
Figure 4 depicts a layered design, where layer a is on the bottom, layer b
sits atop a, and layer c is atop b. Operator implementations often correspond to
layers or layers of abstraction. The design in Figure 4 corresponds to the com-
position c(b(a)), where layers a, b, and c implement their respective operators.
In general, systems are conceptually, but not physically, layered [12]. Inter-
faces delineate the boundaries of operators/layers a, b, and c. These interfaces
might be Java interfaces or they might be DSL specifications!
Now to an example. Suppose a program maintains a set of records of form
(age, tag) and these records are stored on an ascending age-ordered linked list
(Figure 5). Here the first record has age=1, tag=A, the next record age=10, tag=B
and so on.
Periodically, we want to count all records that satisfy the predicate age>n
and tag==t, for some n, t. What is the code for this retrieval? Here’s our first
try: we write an ordered list data type. Our retrieval is simple: we examine every
record, and apply the full predicate. This program is easy to write. Unfortunately
it is inefficient.
int count = 0;
Node node = container.first;
while (node != null) {
if (node.tag == t && node.age > n)
count++;
node = node.next;
}
10 Don Batory
Our next try exploits a property of ordered lists. The observation is that we
can skip over records that don’t satisfy the key predicate, age>n. As soon as
we find the first record that satisfies the predicate, we know that all records
past this point also satisfy the predicate, so all we need to do is to apply the
residual predicate, tag==t, to the remaining records. This leads to the following
2-loop program. It takes longer to write, longer to debug, but the result is more
efficient.
int count = 0;
Node node = container.first;
count++;
node = node.next;
}
There is yet another alternative: a Java programmer would ask: Why not use
the Java library? With library classes, you would have to write even less code! In
the J2SDK, TreeSet implements SortedCollection. With TreeSet, the desired
subcollection can be extracted in one operation (called tail()). By iterating
over the extracted elements and applying the residual predicate as before we
can produce the desired result. The code is indeed shorter:
int count = 0;
// ts is TreeSet with elements
Unfortunately, the TreeSet code is much slower (maybe even slower than original
list implementation). Why? The reason is TreeSet creates an index over all
the records that it sorts. Applying the tail operation creates an index that
references the extracted elements. Index construction can be very slow. This
raises a classical dilemma: if you want execution speed, stay with customized
code. If you want to write the code fast, use libraries.
This problem has everything to do with selecting the right interfaces and
right abstractions and is a classical situation for the use of DSLs and GP. Let
The Road to Utopia: A Future for Generative Programming 11
Do algebraic approaches scale? Let’s face it, query evaluation programs and
data structures are tiny. Can an algebraic approach synthesize large systems? If
it can’t, then we should pack our bags and try something else.
Interestingly, it does scale. And this brings us to a key result about scaling.
Feature Oriented Domain Analysis (FODA) is pioneering work by Kyo Kang,
et al [17]. His work deals with product-lines and producing a family of related
applications by composing primitives.
So the goal is to synthesize a large application from “primitives”. But what
are these primitives? Consider the following thought experiment. Suppose you
have programs that you want others to use. How would you describe them? Well,
you shouldn’t say what DLLs or object-oriented classes each uses. No one will
care. Instead, you are more likely to describe the program by the features it has.
(A feature is a characteristic that is useful in distinguishing programs within
a family of related programs [11]). For example, you might say Program1 has
features X, Y, and Z. But Program2 is better because it has features X, Q, and R.
The reason is that clients have an understanding of their requirements and can
see how features relate to requirements.
A common way to specify products is by its set of features. While this is
almost unheard of in software, it is indeed common in many other engineering
disciplines. For example, go to the Dell Web site. You’ll find lots of web pages
that provide declarative DSL specifications (e.g. menu lists) from which you can
specify the features that you want on your customized PC. After completing this
specification, you can initiate its order. We want to do the same for software.
Here is a program synthesis vision that has evolved concurrently with FODA
[2]. Program P is a package of classes (class1-class4). P will have an algebraic
definition as a composition of features (or rather feature operators). Consider Fig-
ure 7. P starts with featureX, which encapsulates fragments of class1-class3.
featureY is added, which extends class1-class3 and introduces class4, and
featureZ extends all four classes. Thus, by composing features which encapsu-
late fragments of classes, a package of fully formed classes is synthesized.
program by adding details (in this case, features), one at a time. This means that
feature operators are implemented by program refinements or program exten-
sions2 . Program P is created by starting with a simple program, featureX. This
program is extended by featureY and then by featureZ – a classic example of
step-wise development.
Here is an example problem that illustrates the scalability of algebraic ap-
proaches. I and my students are now building customized tools for processing
programs written in extensible- Java languages. These tools belong to Integrated
Development Environments (IDEs). The GUI shown in Figure 8 is a declarative
DSL. We allow architects to select the set of optional Java extensions that they
want (in the left-most panel), the set of optional tools that they want (in the
middle panel), and by pressing the Generate button, the selected tools will be
synthesized and will work for that specified dialect of Java. We are now gener-
ating over 200K Java LOC from such specifications, all of which are (internally)
driven by equations. So algebraic approaches do indeed scale, and there seems
to be no limit to the size of a system that can be produced [6].
Surprisingly, this simple approach has worked well for large applications.
However, for smaller programs or algorithm synthesis, much more is needed.
The work at Kestrel on synthesizing scheduling algorithms [7] and the synthesis
and optimization of orbital algorithms at NASA Ames [23] are really impressive.
They require nontrivial “domain theories” and a non-trivial programming infras-
tructure. Even the simple data structures domain requires more sophistication
than macros.
One reason for this is that domain-specific optimizations are below the level
of relational algebraic rewrites; one has to break encapsulation of abstractions
to achieve better performance. And the operations (functions) seem much more
complicated.
Oddly, the synthesis of large systems has different requirements. Architects
generally don’t care about low-level optimization issues. The problem is more
of gluing operators together; breaking encapsulations to optimize is rarely done.
For years I thought lispquote-unquote features were critical for all generative
tools. I now think that most domains don’t need such sophistication. Tools for
synthesis-in-the-large will be very different than those for synthesis-in-the-small.
The Road to Utopia: A Future for Generative Programming 15
2.11 Verification
“What kinds of verifications are appropriate/feasible for what kinds of lan-
guages, and what approaches are appropriate to carry out these verifications?
How can languages be designed to facilitate verification?” – Julia Lawall
Technology transfer are tough issues indeed. By far, technology transfer is the
hardest problem. Education is the key. We must demonstrate over and over again
where GP, DSLs, and AP are relevant and beneficial. We must be constantly
looking for new applications to demonstrate their value. Sadly, I fear, not until
large companies like Microsoft see the advantage, progress will be glacial. You
have heard of the 17 year lag between the discovery of ideas and practice; I think
things are much longer for software engineering simply because the inertia is so
great.
So is this Utopia? No, its on the road. . .
16 Don Batory
There is no lack of other issues. Every issue raised above is indeed important.
Often the progress of a field hinges on economics. And until we understand the
economic ramifications (i.e., benefits), transfer of our ideas to industry will be
slow.
3 Epilog
So if we solved all of the previously mentioned problems, would this be Utopia?
It might be. But let’s put this in perspective: Did database people know they
were on the road to Utopia? Hardly. Let’s start with Codd’s 1970 seminal paper
on the Relational Model. Its first public review in Computing Surveys panned
the idea [20]. And it is easy to forget that the appreciation of the Relational
Model grew over time.
“It isn’t like someone sat down in the early 1980’s to do domain analysis. No
– we had trial and error as briefly outlined:
(1) CODASYL 1960s – every update type and every query type requires a
custom program to be written,
(2) Codd 1970 – relational algebra – no keys but in theory no custom pro-
grams,
(3) IBM & researchers (late) 1970s – compelling business issues press devel-
opment at business labs and universities. Query languages, schema lan-
guages, normal forms, keys, etc.
(4) Oracle early 1980s – and they are off...
Now, which domain analysis methodology shall we assert could have achieved
this in a shorter term? It takes time and experience on the road to a solution;
it also takes the patience to continually abstract from the problem at hand
until you recognize you already have a solution to the immediate problem.” –
Jim Neighbors
In short, it takes time and clarity of hindsight to find Utopia. Utopia is a small
place and is easy to miss.
“People of different backgrounds have very different opinions on fundamental
problems and principles of software engineering, amazingly.” – Stan Jarzabek
“Mindset is a very important issue. How can researchers find Utopia if they
are not trying to get there? How can they be trying to get there if they are
not solving a specific problem? Without a problem, they are on the road to
where?” – Jim Neighbors
My response: this is Science. The signs along the road to scientific advancement
are strange, if not obscure. But what did you expect? Some observations and
results will be difficult, if not impossible to explain. But eventually they will all
make sense. However, if you don’t look, you’ll just drive right past Utopia, never
knowing what you missed.
My parting message is simple: database researchers got it right; they un-
derstood the significance of generative programming, domain-specific languages,
automatic programming and lots of other concepts and their relationships, and
they made it all work.
Software engineering is about the challenges of designing and building large-
scale programs. The future of software engineering will require making programs
first-class objects and using algebras and operators to manipulate these pro-
grams. Until these ideas are in place, we are unlikely to reach Utopia. Our
challenge is to replicate the success of database researchers in other domains.
I believe that our respective communities – generative programming, metapro-
gramming, and the skeleton communities – represent the future of what software
engineering will become, not what it is today.
I hope to see you on the road!
Acknowledgements
I am grateful to Chris Lengauer and Jim Neighbors for their comments and
insights on an earlier draft of this paper.
References
1. R. Balzer, “A Fifteen-Year Perspective on Automatic Programming”, IEEE Trans-
actions on Software Engineering, November 1985.
2. D. Batory and S. O’Malley, “The Design and Implementation of Hierarchical Soft-
ware Systems with Reusable Components”, ACM TOSEM, October 1992.
3. D. Batory, V. Singhal, J. Thomas, and M. Sirkin, “Scalable Software Libraries”,
ACM SIGSOFT 1993.
4. D. Batory, G. Chen, E. Robertson, and T. Wang, “Design Wizards and Visual Pro-
gramming Environments for GenVoca Generators”, IEEE Transactions on Soft-
ware Engineering, May 2000, 441-452.
5. D. Batory, J.N. Sarvela, and A. Rauschmayer, “Scaling Step-Wise Refinement”,
International Conference on Software Engineering (ICSE-2003).
6. D. Batory, R. Lopez-Herrejon, J.P. Martin, “Generating Product-Lines of Product-
Families”, Automated Software Engineering Conference, 2002.
7. L. Blaine, et al., “Planware: Domain-Specific Synthesis of High-performance Sched-
ulers”, Automated Software Engineering Conference 1998, 270-280.
18 Don Batory
8. R.J. Brachman, “Systems That Know What They’re Doing”, IEEE Intelligent
Systems, Vol. 17#6, 67-71 (Nov. 2002).
9. D. DeWitt, et al., The Gamma Database Machine Project, IEEE Transactions on
Data and Knowledge Engineering, March 1990.
10. J. Gray, et al. “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery
1, 29-53, 1997.
11. M. Griss, “Implementing Product-Line Features by Composing Component As-
pects”, Software Product-Line Conference, Denver, August 2000.
12. A.N. Habermann, L. Flon, and L. Cooprider, “Modularization and Hierarchy in a
Family of Operating Systems”, CACM, 19 #5, May 1976.
13. A. Hein, M. Schlick, R. Vinga-Martins, “Applying Feature Models in Industrial
Settings”, Software Product Line Conference (SPLC1), August 2000.
14. J. Hellerstein, “Predicate Migration: Optimizing Queries with Expensive Predi-
cates”, SIGMOD 1993.
15. Y.E. Ioannidis and S. Christodoulakis, “On the Propagation of Errors in the Size
of Join Results”, ACM SIGMOD 1991.
16. Y. E. Ioannidis, “On the Computation of the Transitive Closure of Relational
Operators”, Very Large Database Conference 1986, 403-411.
17. K. Kang, S. Cohen, J. Hess, W. Nowak, and S. Peterson. “Feature-Oriented Domain
Analysis (FODA) Feasibility Study”. Tech. Rep. CMU/SEI-90-TR-21, Soft. Eng.
Institute, Carnegie Mellon Univ., Pittsburgh, PA, Nov. 1990.
18. H.C. Li, S. Krishnamurthi, and K. Fisler, “Interfaces for Modular Feature Verifi-
cation”, Automated Software Engineering Conference 2002, 195-204.
19. J. Neighbors, “Software construction using components”. Ph. D. Thesis, (Technical
Report TR-160), University of California, Irvine, 1980.
20. J.E. Sammet and R.W. Rector, “In Recognition of the 25th Anniversary of Com-
puting Reviews: Selected Reviews 1960-1984”, Communications of the ACM, Jan-
uary 1985, 53-68.
21. U.P. Schultz, J.L. Lawall, and C. Consel, “Specialization Patterns”, Research Re-
port #3835, January 1999.
22. P. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price, “Access
Path Selection in a Relational Database System”, ACM SIGMOD 1979, 23-34.
23. M. Stickel, et al., “Deductive Composition of Astronomical Software from Subrou-
tine Libraries”, In Automated Deduction, A. Bundy, ed., Springer-Verlag Lecture
Notes in Computer Science, Vol. 814.
24. M. Stonebraker, “Inclusion of New Types in Relational Data Base Systems”, IEEE
Data Engineering Conference, Los Angeles, CA, February 1986.
From a Program Family
to a Domain-Specific Language
Charles Consel
INRIA/LaBRI
ENSEIRB – 1, avenue du docteur Albert Schweitzer,
Domaine universitaire – BP 99
33402 Talence Cedex, France
[email protected]
http://compose.labri.fr
1 Introduction
Domain-specific languages (DSLs) are being successfully designed, implemented
and used, both academically and industrially, in a variety of areas including
interactive voice menus [1], Web services [2], component configuration [3], fi-
nancial products [4], and communication services [5]. Yet, developing a DSL is
still a complicated process, often conducted in an ad hoc fashion, without much
assessment of the resulting language.
A step toward a methodology for DSL development is presented by Consel
and Marlet [6]. It is based on a denotational semantic approach and integrates
software architecture concerns. The development of a DSL is initiated with a
variety of ingredients, mainly consisting of a program family, technical literature
and documentation, knowledge of domain experts, and current and future re-
quirements. From these ingredients, domain operations and objects (i.e., values)
are determined, as well as the language requirements and notations. Notations
are further refined to produce a DSL syntax. Its semantics is informally defined:
it relates syntactic constructs to the objects and operations identified previ-
ously. Next, like any programming language, the DSL semantics is divided into
two stages: the static semantics, that can be viewed as program configuration
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 19–29, 2004.
c Springer-Verlag Berlin Heidelberg 2004
20 Charles Consel
Overview
Section 2 defines the notion of a program family and discusses its importance
in the context of DSL development. Section 3 explains how a program family
naturally gives rise to a library. Section 4 shows how a library can be used as a
basis to develop a domain-specific abstract machine. Section 5 explains how to
introduce a DSL from an abstract machine. Section 6 discusses the importance
of a program family to assess a DSL.
2 Program Family
The design of a DSL can also take advantage of a specific analysis of its program
family. This analysis is aimed to identify repetitive program patterns. These
patterns can be abstracted over by introducing appropriate syntactic constructs.
For example, to optimize memory space, a device register often consists of
a concatenation of values. As a consequence, device drivers exhibit repetitive
sequences of code to either extract bit segments from a device register, or to
concatenate bit segments to form a register value. To account for these manipu-
lations, Devil includes concise syntax to designate and concatenate bit fragments.
In the XDR example, like most session-oriented library, an invocation is com-
monly preceded and succeeded by state-related operations (e.g., updating the
pointer to the marshaled data) and various checks to test the consistency of
argument and return values. Given, an XDR description (i.e., program), the
RPCGEN compiler directly includes these operations when generating the mar-
shaling code.
In the device driver case, optimization efforts mainly aim to minimize the
communications with the device because of their cost. As a result, programmers
try to update as many bit segments of a register as possible before issuing a write,
instead of updating bit segments individually. The Devil compiler pursues the
same goal and includes a dependency analysis to group device communications.
In the context of the marshaling process, one could conceive an optimization
aimed to eliminate buffer overflow checks. This would entail the definition of a
specific analysis to statically calculate values of buffer pointers in the case of
fixed-length data. Based on this information, a code generator could then select
the appropriate version of a marshaling instruction (i.e., with or without a buffer
overflow check). Such domain-specific optimization was automatically performed
by program specialization and produced a significant speedup [15].
6.1 Performance
In many areas, performance is a critical measure to assess the value of a DSL.
At the very least, one should show that using a DSL does not entail any perfor-
mance penalty. At most, it should demonstrate that its dedicated nature enables
optimizations that are beyond the reach of compilers for GPLs.
Device drivers represent a typical domain where efficiency is critical for the
overall system performance. As such, the recognition gained by Devil in the
operating systems community, significantly relied on an extensive measurement
analysis that showed that Devil-generated code did not induce significant exe-
cution overhead [8].
6.2 Robustness
Robustness can also be an important criterion to assess a DSL. For critical parts
of a system, it is a significant contribution to demonstrate that a DSL enables
bugs to be detected as early as possible during the development process. An
interesting approach to assessing robustness is based on a mutation analysis [16].
In brief, this analysis consists of defining mutation rules to introduce errors in
programs, while preserving its syntactic correctness. Then mutated programs
are compiled and executed to measure how many errors are detected statically
and/or dynamically.
From a Program Family to a Domain-Specific Language 27
6.3 Conciseness
Comparing the conciseness of DSL programs to equivalent programs written in
a GPL is not necessarily an easy task to do. The syntax of a DSL (e.g., declar-
ative) may drastically differ from that of a GPL (e.g., imperative). Counting
the number of lines or characters can thus be misleading. In our experience, a
meaningful measure is the number of words. Yet, a comparison needs to be com-
pleted with a close examination of both sources to make sure that no unforeseen
program aspects can distort measurements.
In practice, we observe that the more narrow the program family of a DSL
the more concise the DSL programs. This observation is illustrated by the device
driver domain. As mentioned earlier, our first study in this domain aimed to
develop a DSL, named GAL, specifically targeting graphic display adaptors for
the X11 server [13]. Then, a second study widened the scope of the target domain
by addressing the communication layer of a device driver with Devil. On the
one hand, the narrow scope of computations modeled by GAL leads to very
concise programs that could be seen as a rich form of structured parameters.
Experiments have shown that GAL programs are at least 9 times smaller than
existing equivalent C device drivers [13]. On the other hand, the more general
nature of Devil and the extra declarations provided by the programmer (e.g.,
types of bit segments) do not allow similar conciseness. In fact, the size of a
Devil program is similar to the size of its equivalent C program.
7 Conclusion
In practice, the design of a DSL is a very subjective process and has a lot of
similarities with a craft. This situation is a key obstacle to spreading the DSL
approach more widely. Methodologies to analyze a program family are quite
general and do not specifically target the design of a DSL.
This paper advocates an approach to designing of a DSL tightly linked with
the notion of program family. A program family is a concrete basis from which
key properties can be extracted and fuel the design of a DSL. A program family
naturally gives rise to a library, that, in turn, suggests domain-specific opera-
tions, domain-specific values and a domain-specific state. These elements con-
tribute to define a domain-specific abstract machine. This abstract machine,
combined with common program patterns from the program family, form the
main ingredients to design a DSL. Lastly, a program family can be used to assess
a DSL. In our experience, it produces objective measurements of the potential
benefits of a DSL.
28 Charles Consel
Acknowledgment
This work has been partly supported by the Conseil Régional d’Aquitaine under
Contract 20030204003A.
References
1. Atkins, D., Ball, T., Baran, T., Benedikt, A., Cox, C., Ladd, D., Mataga, P.,
Puchol, C., Ramming, J., Rehor, K., Tuckey, C.: Integrated web and telephone
service creation. The Bell Labs Technical Journal 2 (1997) 18–35
2. Brabrand, C., Møller, A., Schwartzbach, M.: The <bigwig> project. ACM Trans-
actions on Internet Technology 2 (2002)
3. Czarnecki, K., Eisenecker, U.: Generative Programming. Addison-Wesley (2000)
4. Arnold, B., van Deursen, A., Res, M.: An algebraic specification of a language
describing financial products. In: IEEE Workshop on Formal Methods Application
in Software Engineering. (1995) 6–13
5. Consel, C., Réveillère, L.: A DSL paradigm for domains of services: A study of
communication services (2004) In this volume.
6. Consel, C., Marlet, R.: Architecturing software using a methodology for language
development. In Palamidessi, C., Glaser, H., Meinke, K., eds.: Proceedings of the
10th International Symposium on Programming Language Implementation and
Logic Programming. Number 1490 in Lecture Notes in Computer Science, Pisa,
Italy (1998) 170–194
7. Weiss, D.: Family-oriented abstraction specification and translation: the FAST
process. In: Proceedings of the 11th Annual Conference on Computer Assurance
(COMPASS), Gaithersburg, Maryland, IEEE Press, Piscataway, NJ (1996) 14–22
8. Mérillon, F., Réveillère, L., Consel, C., Marlet, R., Muller, G.: Devil: An IDL for
hardware programming. In: Proceedings of the Fourth Symposium on Operating
Systems Design and Implementation, San Diego, California (2000) 17–30
9. Réveillère, L., Mérillon, F., Consel, C., Marlet, R., Muller, G.: A DSL approach
to improve productivity and safety in device drivers development. In: Proceedings
of the 15th IEEE International Conference on Automated Software Engineering
(ASE 2000), Grenoble, France, IEEE Computer Society Press (2000) 101–109
From a Program Family to a Domain-Specific Language 29
10. Réveillère, L., Muller, G.: Improving driver robustness: an evaluation of the Devil
approach. In: The International Conference on Dependable Systems and Networks,
Göteborg, Sweden, IEEE Computer Society (2001) 131–140
11. Sun Microsystem: NFS: Network file system protocol specification. RFC 1094, Sun
Microsystem (1989)
12. Parnas, D.: On the design and development of program families. IEEE Transactions
on Software Engineering 2 (1976) 1–9
13. Thibault, S., Marlet, R., Consel, C.: Domain-specific languages: from design to im-
plementation – application to video device drivers generation. IEEE Transactions
on Software Engineering 25 (1999) 363–377
14. Schmidt, D.: Denotational Semantics: a Methodology for Language Development.
Allyn and Bacon, Inc. (1986)
15. Muller, G., Marlet, R., Volanschi, E., Consel, C., Pu, C., Goel, A.: Fast, optimized
Sun RPC using automatic program specialization. In: Proceedings of the 18th
International Conference on Distributed Computing Systems, Amsterdam, The
Netherlands, IEEE Computer Society Press (1998)
16. DeMillo, R.A., Lipton, R.J., Sayward, F.G.: Hints on test data selection: help for
the practicing programmer. Computer 11 (1978) 34–41
17. Wetherall, D.: Active network vision and reality: lessons from a capsule-based
system. In: Proceedings of the 17th ACM Symposium on Operating Systems Prin-
ciples, Kiawah Island, SC (1999)
18. Thibault, S., Consel, C., Muller, G.: Safe and efficient active network program-
ming. In: 17th IEEE Symposium on Reliable Distributed Systems, West Lafayette,
Indiana (1998) 135–143
A Gentle Introduction
to Multi-stage Programming
Walid Taha
1 Introduction
Although program generation has been shown to improve code reuse, product
reliability and maintainability, performance and resource utilization, and devel-
oper productivity, there is little support for writing generators in mainstream
languages such as C or Java. Yet a host of basic problems inherent in program
generation can be addressed effectively by a programming language designed
specifically to support writing generators.
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 30–50, 2004.
c Springer-Verlag Berlin Heidelberg 2004
A Gentle Introduction to Multi-stage Programming 31
With the data type encoding the situation is improved, but the best we can
do is ensure that any generated program is syntactically correct. We cannot
use data types to ensure that generated programs are well-typed. The reason
is that data types can represent context-free sets accurately, but usually not
context sensitive sets. Type systems generally define context sensitive sets (of
programs). Constructing data type values that represent trees can be a bit more
verbose, but a quasi-quotation mechanism [1] can alleviate this problem and
make the notation as concise as that of strings.
In contrast to the strings encoding, MSP languages statically ensure that
any generator only produces syntactically well-formed programs. Additionally,
statically typed MSP languages statically ensure that any generated program is
also well-typed.
Finally, with both string and data type representations, ensuring that there
are no name clashes or inadvertent variable captures in the generated program
is the responsibility of the programmer. This is essentially the same problem
that one encounters with the C macro system. MSP languages ensure that such
inadvertent capture is not possible. We will return to this issue when we have
seen one example of MSP.
Brackets (written .<...>.) can be inserted around any expression to delay its
execution. MetaOCaml implements delayed expressions by dynamically gener-
ating source code at runtime. While using the source code representation is not
the only way of implementing MSP languages, it is the simplest. The following
short interactive MetaOCaml session illustrates the behavior of Brackets1 :
# let a = 1+2;;
val a : int = 3
# let a = .<1+2>.;;
val a : int code = .<1+2>.
Lines that start with # are what is entered by the user, and the following line(s)
are what is printed back by the system. Without the Brackets around 1+2, the
addition is performed right away. With the Brackets, the result is a piece of code
representing the program 1+2. This code fragment can either be used as part of
another, bigger program, or it can be compiled and executed.
1
Some versions of MetaOCaml developed after December 2003 support environment
classifiers [21]. For these systems, the type int code is printed as (’a,int) code.
To follow the examples in this tutorial, the extra parameter ’a can be ignored.
32 Walid Taha
Escape (written .˜...) allows the combination of smaller delayed values to con-
struct larger ones. This combination is achieved by “splicing-in” the argument
of the Escape in the context of the surrounding Brackets:
# let b = .<.˜a * .˜a >. ;;
val b : int code = .<(1 + 2) * (1 + 2)>.
Run (written .!...) allows us to compile and execute the dynamically generated
code without going outside the language:
# let c = .! b;;
val c : int = 9
As an aside, there are two basic equivalences that hold for the three MSP con-
structs [18]:
.˜ .<e>. = e
.! .<e>. = e
Here, a value v can include usual values such as integers, booleans, and lambdas,
as well as Bracketed terms of the form .<e>.. In the presentation above, we
use e for an expression where all Escapes are enclosed by a matching set of
Brackets. The rules for Escape and Run are identical. The distinction between
the two constructs is in the notion of values: the expression in the value .<e>.
cannot contain Escapes that are not locally surrounded by their own Brackets.
An expression e is unconstrained as to where Run can occur.
A Gentle Introduction to Multi-stage Programming 33
Whereas the source code only had fun x -> ... inside Brackets, this code
fragment was generated three times, and each time it produced a different
fun x_i -> ... where i is a different number each time. If we run the gen-
erated code above, we get 7 as the answer. We view it as a highly desirable
property that the results generated by staged programs are related to the re-
sults generated by the unstaged program. The reader can verify for herself that
if the xs were not renamed and we allowed variable capture, the answer of run-
ning the staged program would be different from 7. Thus, automatic renaming
of bound variables is not so much a feature; rather, it is the absence of renaming
that seems like a bug.
hope is that having such support will allow programmers to write higher-level
and more reusable programs.
Here, we have taken away the formal parameter x from the left-hand side of the
equality and replaced it by the equivalent “fun x ->” on the right-hand side.
To use the function power2, all we have to do is to apply it as follows:
let answer = power2 (3);;
The result of this computation is 9. But notice that every time we apply power2
to some value x it calls the power function with parameters (2,x). And even
though the first argument will always be 2, evaluating power (2,x) will always
involve calling the function recursively two times. This is an undesirable over-
head, because we know that the result can be more efficiently computed by
multiplying x by itself. Using only unfolding and the definition of power, we
know that the answer can be computed more efficiently by:
let power2 = fun x -> 1*x*x;;
We also do not want to write this by hand, as there may be many other special-
ized power functions that we wish to use. So, can we automatically build such a
program?
In an MSP language such as MetaOCaml, all we need to do is to stage the
power function by annotating it:
let rec power (n, x) =
match n with
0 -> .<1>. | n -> .<.˜x * .˜(power (n-1, x))>.;;
This function still takes two arguments. The second argument is no longer an
integer, but rather, a code of type integer. The return type is also changed.
Instead of returning an integer, this function will return a code of type integer.
To match this return type, we insert Brackets around 1 in the first branch on
the third line. By inserting Brackets around the multiplication expression, we
now return a code of integer instead of just an integer. The Escape around the
recursive call to power means that it is performed immediately.
The staging constructs can be viewed as “annotations” on the original pro-
gram, and are fairly unobtrusive. Also, we are able to type check the code both
outside and inside the Brackets in essentially the same way that we did before.
If we were using strings instead of Brackets, we would have to sacrifice static
type checking of delayed computations.
After annotating power, we have to annotate the uses of power. The decla-
ration of power2 is annotated as follows:
let power2 = .! .<fun x -> .˜(power (2,.<x>.))>.;;
Evaluating the application of the Run construct will compile and if execute its
argument. Notice that this declaration is essentially the same as what we used
to define power2 before, except for the staging annotations. The annotations
say that we wish to construct the code for a function that takes one argument
(fun x ->). We also do not spell out what the function itself should do; rather ,
we use the Escape construct (.˜) to make a call to the staged power. The result
36 Walid Taha
3.1 Syntax
Using the above data types, a small program that defines the factorial function
and then applies it to 10 can be concisely represented as follows:
2
If we are using the MetaOCaml native code compiler. If we are using the bytecode
compiler, the composition produces bytecode.
A Gentle Introduction to Multi-stage Programming 37
Program ([Declaration
("fact","x", Ifz(Var "x",
Int 1,
Mul(Var"x",
(App ("fact", Sub(Var "x",Int 1))))))
],
App ("fact", Int 10))
OCaml lex and yacc can be used to build parsers that take textual representa-
tions of such programs and produce abstract syntax trees such as the above. In
the rest of this section, we focus on what happens after such an abstract syntax
tree has been generated.
3.2 Environments
To associate variable and function names with their values, an interpreter for
this language will need a notion of an environment. Such an environment can be
conveniently implemented as a function from names to values. If we look up a
variable and it is not in the environment, we will raise an exception (let’s call it
Yikes). If we want to extend the environment (which is just a function) with an
association from the name x to a value v, we simply return a new environment
(a function) which first tests to see if its argument is the same as x. If so, it
returns v. Otherwise, it looks up its argument in the original environment. All
we need to implement such environments is the following:
exception Yikes
The types of all three functions are polymorphic. Type variables such as ’a
and ’b are implicitly universally quantified. This means that they can be later
instantiated to more specific types. Polymorphism allows us, for example, to
define the initial function environment fenv0 as being exactly the same as the
initial variable environment env0. It will also allow us to use the same function
ext to extend both kinds of environments, even when their types are instantiated
to the more specific types such as:
env0 : string -> int fenv0 : string -> (int -> int)
(* eval1 : exp -> (string -> int) -> (string -> int -> int) -> int *)
In this function, when the list of declarations is empty ([]), we simply use eval1
to evaluate the body of the program. Otherwise, we recursively interpret the list
of declarations. Note that we also use eval1 to interpret the body of function
declarations. It is also instructive to note the three places where we use the
environment extension function ext on both variable and function environments.
The above interpreter is a complete and concise specification of what pro-
grams in this language should produce when they are executed. Additionally,
this style of writing interpreters follows quite closely what is called the denota-
tional style of specifying semantics, which can be used to specify a wide range
of programming languages. It is reasonable to expect that a software engineer
can develop such implementations in a short amount of time.
have had to pay for using such an implementation. The overhead is avoided by
staging the above function as follows:
(* eval2 : exp -> (string -> int code) -> (string -> (int -> int) code)
-> int code *)
(* peval2 : prog -> (string -> int code) -> (string -> (int -> int) code)
-> int code *)
If we apply peval2 to the abstract syntax tree of the factorial example (given
above) and the empty environments env0 and fenv0, we get back the following
code fragment:
This is exactly the same code that we would have written by hand for that
specific program. Running this program has exactly the same performance as if
we had written the program directly in OCaml.
The staged interpreter is a function that takes abstract syntax trees and
produces MetaOCaml programs. This gives rise to a simple but often overlooked
fact [19,20]:
.<let rec f =
fun x ->
(match (Some (x)) with
Some (x) ->
if (x = 0) then (Some (1))
else
(match
((Some (x)),
(match
(match ((Some (x)), (Some (1))) with
(Some (x), Some (y)) ->
(Some ((x - y)))
| _ -> (None)) with
Some (x) -> (f x)
| None -> (None))) with
(Some (x), Some (y)) ->
(Some ((x * y)))
| _ -> (None))
| None -> (None)) in
(match (Some (10)) with
Some (x) -> (f x)
| None -> (None))>.
42 Walid Taha
The generated code is doing much more work than before, because at every
operation we are checking to see if the values we are operating with are proper
values or not. Which branch we take in every match is determined by the explicit
form of the value being matched.
The source of the problem is the if statement that appears in the interpretation
of Div. In particular, because y is bound inside Brackets, we cannot perform
the test y=0 while we are building for these Brackets. As a result, we cannot
immediately determine if the function should return a None or a Some value.
This affects the type of the whole staged interpreter, and effects the way we
interpret all programs even if they do not contain a use of the Div construct.
The problem can be avoided by what is called a binding-time improvement
in the partial evaluation literature [7]. It is essentially a transformation of the
program that we are staging. The goal of this transformation is to allow better
staging. In the case of the above example, one effective binding time improve-
ment is to rewrite the interpreter in continuation-passing style (CPS) [5], which
produces the following code:
(* eval5 : exp -> (string -> int) -> (string -> int -> int)
-> (int option -> ’a) -> ’a *)
(* pevalK5 : prog -> (string -> int) -> (string -> int -> int)
-> (int option -> int) -> int *)
exception Div_by_zero;;
(* peval5 : prog -> (string -> int) -> (string -> int -> int) -> int *)
(* peval6 : prog -> (string -> int code) -> (string -> (int -> int) code)
-> int code *)
The improvement can be seen at the level of the type of eval6: the option type
occurs outside the code type, which suggests that it can be eliminated in the
first stage. What we could not do before is to Escape the application of the
continuation to the branches of the if statement in the Div case. The extra ad-
vantage that we have when staging a CPS program is that we are applying the
continuation multiple times, which is essential for performing the computation
in the branches of an if statement. In the unstaged CPS interpreter, the contin-
uation is always applied exactly once. Note that this is the case even in the if
statement used to interpret Div: The continuation does occur twice (once in each
branch), but only one branch is ever taken when we evaluate this statement. But
in the the staged interpreter, the continuation is indeed duplicated and applied
multiple times3 .
The staged CPS interpreter generates code that is exactly the same as what
we got from the first interpreter as long as the program we are interpreting
does not use the division operation. When we use division operations (say, if we
replace the code in the body of the fact example with fact (20/2)) we get the
following code:
.<let rec f =
fun x -> if (x = 0) then 1 else (x * (f (x - 1)))
in if (2 = 0) then (raise (Div_by_zero)) else (f (20 / 2))>.
3
This means that converting a program into CPS can have disadvantages (such as
a computationally expensive first stage, and possibly code duplication). Thus, CPS
conversion must be used judiciously [8].
A Gentle Introduction to Multi-stage Programming 45
This code can be expected to be faster than that produced by the first staged
interpreter, because only one function call is needed for every two iterations.
Avoiding Code Duplication. The last interpreter also points out an impor-
tant issue that the multi-stage programmer must pay attention to: code dupli-
cation. Notice that the term x-1 occurs three times in the generated code. In
the result of the first staged interpreter, the subtraction only occurred once.
46 Walid Taha
The duplication of this term is a result of the inlining that we perform on the
body of the function. If the argument to a recursive call was even bigger, then
code duplication would have a more dramatic effect both on the time needed to
compile the program and the time needed to run it.
A simple solution to this problem comes from the partial evaluation com-
munity: we can generate let statements that replace the expression about to
be duplicated by a simple variable. This is only a small change to the staged
interpreter presented above:
let rec eval8 e env fenv =
match e with
... same as eval7 except for
| App (s,e2) -> .<let x= .˜(eval8 e2 env fenv)
in .˜(fenv s .<x>.)>. ...
4 To Probe Further
– We want to minimize total cost of all stages for most inputs. This model ap-
plies, for example, to implementations of programming languages. The cost
of a simple compilation followed by execution is usually lower than the cost
of interpretation. For example, the program being executed usually contains
loops, which typically incur large overhead in an interpreted implementation.
– We want to minimize a weighted average of the cost of all stages. The weights
reflect the relative frequency at which the result of a stage can be reused.
This situation is relevant in many applications of symbolic computation.
Often, solving a problem symbolically, and then graphing the solution at
a thousand points can be cheaper than numerically solving the problem a
thousand times. This cost model can make a symbolic approach worthwhile
even when it is 100 times more expensive than a direct numerical one. (By
symbolic computation we simply mean computation where free variables are
values that will only become available at a later stage.)
– We want to minimize the cost of the last stage. Consider an embedded sys-
tem where the sin function may be implemented as a large look-up table.
The cost of constructing the table is not relevant. Only the cost of comput-
ing the function at run-time is. The same applies to optimizing compilers,
which may spend an unusual amount of time to generate a high-performance
computational library. The cost of optimization is often not relevant to the
users of such libraries.
The last model seems to be the most commonly referenced one in the lit-
erature, and is often described as “there is ample time between the arrival of
different inputs”, “there is a significant difference between the frequency at which
the various inputs to a program change”, and “the performance of the program
matters only after the arrival of its last input”.
Detailed examples of MSP can be found in the literature, including term-
rewriting systems [17] and graph algorithms [12]. The most relevant example to
A Gentle Introduction to Multi-stage Programming 49
Acknowledgments
Stephan Jorg Ellner, John Garvin, Christoph Hermann, Samah Mahmeed, and
Kedar Swadi read and gave valuable comments on this tutorial. Thanks are also
due to the reviewers, whose detailed comments have greatly improved this work,
as well as to the four editors of this volume, who have done an extraordinary job
organizing the Dagstuhl meeting and organizing the preparation of this volume.
References
1 Introduction
A basic design choice when implementing a programming language is whether to
build an interpreter or a compiler. An interpreter realizes the actions of the pro-
gram by stepping through the source program. A compiler consists of a translator
and a runtime system. The translator maps the source program to a program
in an existing language, while the runtime environment provides the primitives
needed for the resulting program to be executable. Translation can be to a
lower-level language – as is the case in traditional compilers – or to a high-level
language for which we already have an implementation.
An interesting special case occurs when no translation is needed. This means
that the new language is both syntactically and semantically a subset of an
existing host language. In this case, all we need is to implement the runtime
system as a library in the host language. This approach of embedded languages
has recently gained significant popularity in the functional programming com-
munity [19]. Functional languages make it possible to implement DSLs that
are more sophisticated than is possible with traditional languages. For exam-
ple, lambda abstractions can be used conveniently to allow DSLs with binding
constructs, and nontrivial type systems for the DSLs can be encoded within
the sophisticated type systems provided by functional languages. From the DSL
implementer’s point of view, the benefits of this approach include reusing the
parser, type checker, and the compiler of an existing language to implement a
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 51–72, 2004.
c Springer-Verlag Berlin Heidelberg 2004
52 Krzysztof Czarnecki et al.
new one. Examples of embedded DSLs include parsing [32, 22, 37], pretty print-
ing [21], graphics [12, 11], functional reactive programming [20], computer music
[18], robotics [40], graphical user interfaces [4], and hardware description lan-
guages [1, 31, 36]. While this approach works surprisingly well for a wide range
of applications, embedding may not be appropriate if there is
– Mismatch in concrete syntax. A prerequisite for embedding is that the syn-
tax for the new language be a subset of the syntax for the host language.
This excludes many stylistic design choices for the new language, including
potentially useful forms of syntactic sugar and notational conventions. Fur-
thermore, as DSLs become easier to implement, we can expect an increase
in the use of graphical notations and languages, which often do not use the
same syntax as potential host languages.
– Mismatch in semantics. The embedding approach requires that the seman-
tics for the DSL and host languages coincide for the common subset. For ex-
ample, two languages may share the same syntax but not the same semantics
if one is call-by-value and the other is call-by-name. A more domain-specific
example arises in the Hydra language [36], which uses the same syntax as
Haskell, but which has a different semantics.
Even when these problems can be avoided, the resulting implementation may be
lacking in a number of ways:
2 MetaOCaml
MetaOCaml is a multi-stage extension of the OCaml programming language [2,
33]. Multi-stage programming languages [51, 47, 52] provide a small set of con-
structs for the construction, combination, and execution of program fragments.
The key novelty in multi-stage languages is that they can have static type sys-
tems that guarantee a priori that all programs generated using these constructs
will be well-typed. The basics of programming in MetaOCaml can be illustrated
with the following declarations:
let rec power n x = (* int -> .<int>. -> .<int>. *)
if n=0 then .<1>. else .<.˜x * .˜(power (n-1) x)>.
let power3 = (* int -> int *)
.! .<fun x -> .˜(power 3 .<x>.)>.
Ignoring the code type constructor .<t>. and the three staging annotations
brackets .<e>., escapes .˜e and run .!, the above code is a standard definition
of a function that computes xn , which is then used to define the specialized func-
tion x3 . Without staging, the last step just produces a function that invokes the
power function every time it gets a value for x. The effect of staging is best under-
stood by starting at the end of the example. Whereas a term fun x -> e x is a
value, an annotated term .<fun x -> .˜(e .<x>.)>. is not. Brackets indicate
that we are constructing a future stage computation, and an escape indicates
that we must perform an immediate computation while building the bracketed
computation. The application e .<x>. has to be performed even though x is still
an uninstantiated symbol. In the power example, power 3 .<x>. is performed
immediately, once and for all, and not repeated every time we have a new value
for x. In the body of the definition of the power function, the recursive applica-
tion of power is also escaped to make sure that they are performed immediately.
The run .! on the last line invokes the compiler on the generated code fragment,
and incorporates the result of compilation into the runtime system.
The basic approach to implementing DSLs in MetaOCaml is the staged in-
terpreter approach [13, 26, 25, 43]. First, an interpreter is implemented and tested
for the DSL. Then, the three staging constructs are used to produce an imple-
mentation that performs the traversal of the DSL program in a stage earlier
54 Krzysztof Czarnecki et al.
Formulae such as ∀p.T ⇒ p can be represented using this datatype by the value
Forall ("p", Implies(True, Var "p")). Implementing this DSL would involve
implementing an interpreter that checks the validity of the formula. Such a
function is implemented concisely by the following (Meta)OCaml code:
exception VarNotFound;;
The first line declares an exception that may be raised (and caught) by the
code to follow. We implement an environment as a function that takes a name
and either returns a corresponding value or raises an exception if that value is
not found. The initial environment env0 always raises the exception because it
contains no proper bindings. We add proper bindings to an environment using
the function ext, which takes an environment env, a name x, and a value v.
It returns a new environment that is identical to env, except that it returns v
DSL Implementation in MetaOCaml, Template Haskell, and C++ 55
if we look up the name x. The evaluation function itself takes a formula b and
environment env and returns a boolean that represents the truth of the formula.
Note that here we are pattern matching on the datatype used to represent the
DSL programs, not on the MetaOCaml code type (which is the return type of
the staged interpreter above). Pattern matching on the code type is not encour-
aged1 , primarily because it is currently not known how to allow it without either
allowing ill-typed code to be generated or to make substantial changes to the
MetaOCaml type system [49]. This way the programmer is sure that once a par-
ticular code fragment is generated, the meaning of this code fragment will not
change during the course of a computation.
3 Template Haskell
Template Haskell is an extension of Haskell that supports compile-time prepro-
cessing of Haskell source programs [9, 44]. The goal of Template Haskell is to
enable the Haskell programmer to define new language features without having
to modify the compilers [28, 29, 16].
To illustrate some of the constructs of Template Haskell, we present an analog
of the power example above. The expand power function builds an expression
that multiplies n copies of the expression x together, and mk power yields a
function that computes xn for any integer x.
1
Using coercions it is possible to expose the underlying representation of the MetaO-
Caml code type, which is the same as the datatype for parse trees used in the OCaml
compiler.
DSL Implementation in MetaOCaml, Template Haskell, and C++ 57
module A where
import Language.Haskell.THSyntax
A specialized function power3 is now defined using $ to splice in the code pro-
duced at compile time by mk power 3.
In this example, quotations [|e|] and splice $e are analogous to the brack-
ets and escape of MetaOCaml. However, there are three key differences. First,
in place of MetaOCaml’s parameterized type constructor .<t>. the type for
quoted values in Template Haskell is always Q Exp. Second, in the definition of
module Main, the splice occurs without any explicit surrounding quotations. To
compare this with MetaOCaml, the module (apart from the import directive)
could be interpreted as having implicit quotations2 . Third, there is no explicit
run construct.
The staged interpreter approach can be used in Template Haskell. A Tem-
plate Haskell version of the QBF Interpreter example would be similar to the
MetaOCaml version. Programs can be staged using quasi-quotes and splices.
But the primary approach to implementing DSLs in Template Haskell is a vari-
ation of the embedding approach. In particular, Template Haskell allows the
programming to alter the semantics of a program by transforming it into a dif-
ferent program before it reaches the compiler. This is possible because Template
Haskell allows the inspection of quoted values.
[| circ x y = (a,b)
where a = inv x
b = xor2 a y |]
The value of the quoted expression is a monadic computation that, when per-
formed, produces the abstract syntax tree:
Fun "circ"
[Clause
[Pvar "x’0", Pvar "y’1"]
(Normal (Tup [Var "a’2", Var "b’3"]))
[Val (Pvar "a’2") (Normal (App (Var "Signal:inv") (Var "x’0"))) [],
Val (Pvar "b’3") (Normal (App (App (Var "Signal:xor2")
(Var "a’2")) (Var "y’1"))) []]]
This abstract syntax tree consists of one clause corresponding to the single defin-
ing equation. Within the clause, there are three pieces of information: the list of
patterns, the expression to be returned, and a list of local definitions for a and
b. The variables have been renamed automatically in order to prevent acciden-
tal name capture; for example x in the original source has become x’0 in the
abstract syntax tree. The numbers attached to variable names are maintained
in a monad, and the DSL programmer can create new names using a monadic
gensym operation.
In general, it is easier to work with code inside [| |] brackets; this is more
concise and the automatic renaming of bound variables helps to maintain correct
lexical scoping. However, the ability to define abstract syntax trees directly offers
more flexibility, at the cost of added effort by the DSL implementor.
Yet another way to produce an abstract syntax tree is to reify an ordinary
Haskell definition. For example, suppose that an ordinary Haskell definition f x
= . . . appears in a program. Although this user code contains no quotations or
3
In Haskell, computations are implemented using an abstract type constructor called
a monad [34, 41].
DSL Implementation in MetaOCaml, Template Haskell, and C++ 59
splices, the DSL implementation can obtain its abstract syntax tree: ftree =
reifyDecl f. Values and types can be reified, as well as operator fixities (needed
for compile time code to understand expressions that contain infix operators) and
code locations (which enables the compile time code to produce error messages
that specify where in the source file a DSL error occurs).
Once a piece of Haskell code has been represented as an abstract syntax tree,
it can be analyzed or transformed using standard pattern matching in Haskell.
This makes it possible to transform the code into any Haskell program that we
wish to use as the DSL semantics for this expression.
As with MetaOCaml, the staged interpreter can be made available to the user
as a library function, along with a parser and a loader. Such functions can only
be used at compile time in Template Haskell. It is worth noting, however, that
Template Haskell does allow compile-time IO, which is necessary if we want to
allow loading DSL programs from files.
Additionally, Template Haskell’s quotations make it possible to use Haskell
syntax to represent a DSL program. Template Haskell also makes it possible to
quote declarations (and not just expressions) using the [d| . . . |] operator:
code = [d|
v1 = [1,2,3]
v2 = [1,2,3]
v3 = [1,2,3]
r = v1 ’add’ (v2 ’mul’ v3)
|]
Then, another module could apply a staged interpreter to this code in order to
transform the original code into an optimized version and splice in the result.
Template Haskell also allows templates to construct declarations that intro-
duce new names. For example, suppose we wish to provide a set of specialized
power functions, power2, power3, up to some maximum such as power8. We
can write a Template Haskell function that generates these functions, using the
mk power defined above, along with the names, and then splices them into the
program:
$(generate_power_functions 8)
4 C++ Templates
may contain both static code, which is evaluated at compile time, and dynamic
code, which is compiled and later executed at runtime. Static computations are
encoded in the C++ type system, and writing static programs is usually referred
to as Template Metaprogramming [57].
Before giving a short overview of what static code looks like, we recapitulate
some C++ template basics.
Here, the type parameter T in the original definition is simply replaced by float
and size by 10. Now we can use MyVector as a regular C++ class:
MyVector v;
v[3] = 2.2;
The function template swap can be called as if it were an ordinary C++ func-
tion. The C++ compiler will automatically infer and substitute the appropriate
parameter type for T:
int a = 5, b = 9;
swap(a,b);
typedef
Cons<1,Cons<2, Nil> > list; list = Cons 1 (Cons 2 Nil)
The C++ definitions for Nil and Cons have different dynamic and static
semantics. Dynamically, Nil is the name of a class, and Cons is the name of a
class template. Statically, both Nil and Cons can be viewed as data constructors
corresponding to the Haskell declaration on the right. All such constructors can
be viewed as belonging to a single “extensible datatype”, which such definitions
extend.
The C++ template instantiation mechanism [23] provides the semantics for
compile-time computations. Instantiating a class template corresponds to apply-
ing a function that computes a class. In contrast, class template specialization
allows us to provide different template implementations for different argument
values, which serves as a vehicle to support pattern matching, as common in
functional programming languages. Next, we compare how functions are imple-
mented at the C++ compile time and in Haskell:
C++: Haskell:
template <class List> len :: List -> int;
struct Len;
template <> len Nil = 0
struct Len<Nil> {
enum { RET = 0 }; };
MatrixGen takes a list of properties, checks their validity, completes the list of
properties by computing defaults for unspecified properties, and computes the
appropriate composition of elementary generic components such as containers,
shape adapters and bounds checkers. This configuration generator uses a specific
style for specifying properties that simulates keyword parameters: for example,
when we write ElemType<float> we have a parameter float wrapped in the
template ElemType to indicate its name [7]. Configuration generators can com-
pose mixins (that is, class template whose superclasses are still parameters) and
thus generate whole class hierarchies [5]. They can also compose mixin layers
[45], which corresponds to generating object-oriented frameworks.
However, assume that our DSL should have parallel execution semantics, e.g., by
using OpenMP[3]. In order to achieve this, we need to transform the expression
into code like the following:
For example, the type of the expression (a+b)*c, where all variables are of
type Vector<int, 10>, would be:
BNode< OpMult,
BNode< OpPlus,
Vector<int, 10>,
Vector<int, 10> >,
Vector<int, 10> >
To create such an object, we overload all the operators used in our DSL for all
combinations of operand types (Vector/Vector, Vector/BNode, BNode/Vector,
and BNode/BNode):
Note that code inspection in C++ is limited. For a type-driven DSL all in-
formation is embedded in the type. For an expression-driven DSL, this is not
the case. For instance, we can only inspect the type of a vector-argument; its
address in memory remains unknown at the static level. Inspecting arbitrary
C++ code is possible to a very limited extent. For instance, different program-
ming idioms (such as traits members, classes, and templates [35, 30]) have been
developed to allow testing a wide range of properties of types. Examples of such
properties include whether a type is a pointer type or if type A is derived from
B. Many other properties (such as the names of variables, functions and classes;
and member functions of a class) must be hard-coded as traits.
Next, we define Eval, which is a transformation that takes a parse tree and
generates code to evaluate it. For instance, while visiting a leaf, we generate a
function that evaluates a vector at a certain position i:
template <class T,int size> struct Eval< Vector<T,size> >
{ static inline T evalAt(const Vector<T,size>& v, int i)
{ return v[i]; }
};
For a binary node we first generate two functions to evaluate the values of
the siblings. Once these functions have been generated (i.e, the corresponding
templates were instantiated), we generate calls to those functions:
template <class L,class R> struct Eval< BNode<OpPlus,L,R> >
{ static inline T evalAt(const BNode<OpPlus,L,R>& b,int i)
{ return Eval<L>::evalAt // generate function
(b.left,i) // generate call to generated code
+ // Operation to perform
Eval<R>::evalAt // generate function
(b.right,i); } // generate call to generated code
};
Since all generated functions are eligible candidates for inlining and are tagged
with the inline keyword, a compiler should be able to remove any call overhead
so that the result is the same as if we spliced in code immediately.
Invoking the Eval transformation is somewhat tedious because we need to
pass the complete parse tree to it. Since type parameters to function templates
are inferred automatically, a function template solves this problem quite conve-
niently:
template <class T>
T inline eval(const T& expression,int i) {
return Eval<T>::evalAt(expression, i);
}
The code generated for the assignment statement d=((a+b)*c); will be the
same as the desired code given at the beginning of this section with the in-
ner comment replaced by d[i]=(a[i]+b[i])*c[i], and it will be executed in
parallel.
5 Comparison
In this section, we present some basic dimensions for variation among the three
different languages. The following table summarizes where each of the languages
falls along each of these dimensions:
Dimension MetaOCaml Template Haskell C++ Templates
1. Approach Staged interp. Templates Templates
2. How Quotes Quotes & abs. syn. Templates
3. When Runtime Compile-time Compile-time
4. Reuse Compiler Compiler (& parser) Compiler & parser
5. Guarantee Well-typed Syntax valid Syntax valid
6. Code inspection No Yes Limited
7. IO Always Always Runtime
8. Homogeneous Yes Yes No
9. CSP Yes Yes No
10. Encapsulation Yes No Yes
11. Modularity Yes Yes No
The definition (and explanation) of each of these dimensions is as follows:
1. Of the DSL implementation approaches discussed in the introduction, what is
the primary approach supported? To varying extents, both Template Haskell
and C++ can also support the staged interpreter approach (but with a
different notion of static typing). MetaOCaml staged interpreters can be
translated almost mechanically into Template Haskell. Staging in C++ is not
as straightforward due to its heterogeneity. Also, in C++, staged interpreters
must use C++ concrete syntax (see Reuse and IO dimensions below).
66 Krzysztof Czarnecki et al.
Acknowledgments
Anonymous reviewers, Van Bui, Simon Helsen, Roumen Kaiabachev, and Kedar
Swadi provided valuable comments on drafts of this paper.
References
1. Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: Hardware
design in Haskell. ACM SIGPLAN Notices, 34(1):174–184, January 1999.
2. Cristiano Calcagno, Walid Taha, Liwen Huang, and Xavier Leroy. Implementing
multi-stage languages using asts, gensym, and reflection. In Frank Pfenning and
Yannis Smaragdakis, editors, Generative Programming and Component Engineer-
ing (GPCE), Lecture Notes in Computer Science. Springer-Verlag, 2003.
3. Rohit Chandra, Leonardo Dagum, and Dave Kohr. Parallel Programming in
OpenMP++. Morgan Kaufmann, 2000.
4. Antony Courtney. Frappé: Functional reactive programming in Java. In Proceedings
of Symposium on Practical Aspects of Declarative Languages. ACM, 2001.
5. K. Czarnecki and U. W. Eisenecker. Synthesizing objects. In Proceedings of
ECOOP’99, LNCS 1628, pages 18–42. Springer-Verlag, 1999.
6. K. Czarnecki and U. W. Eisenecker. Generative Programming: Methods, Tools, and
Applications. Addison-Wesley, 2000.
7. K. Czarnecki and U. W. Eisenecker. Named parameters for configuration genera-
tors. http://www.generative-programming.org/namedparams/, 2000.
8. K. Czarnecki, U. W. Eisenecker, R. Glück, D. Vandevoorde, and T. Veldhuizen.
Generative programming and active libraries (extended abstract). In M. Jazayeri,
D. Musser, and R. Loos, editors, Generic Programming. Proceedings, volume 1766
of LNCS, pages 25–39. Springer-Verlag, 2000.
9. Simon Peyton Jones (ed.). Haskell 98 language and libraries. Journal of Functional
Programming, 13(1):1–255, January 2003.
10. Conal Elliott, Sigbjørn Finne, and Oege de Moore. Compiling embedded languages.
In [48], pages 9–27, 2000.
70 Krzysztof Czarnecki et al.
11. Conal Elliott and Paul Hudak. Functional reactive animation. In International
Conference on Functional Programming, pages 163–173, June 1997.
12. Sigbjorn Finne and Simon L. Peyton Jones. Pictures: A simple structured graphics
model. In Proceedings of Glasgow Functional Programming Workshop, July 1995.
13. Yhoshihiko Futamura. Partial evaluation of computation: An approach to a
compiler-compiler. Systems, Computers, Controls, 2(5):45–50, 1971.
14. Steven Ganz, Amr Sabry, and Walid Taha. Macros as multi-stage computations:
Type-safe, generative, binding macros in MacroML. In the International Confer-
ence on Functional Programming (ICFP ’01), Florence, Italy, September 2001.
ACM.
15. Aleksey Gurtovoy. Boost MPL Library (Template metaprogramming framework).
http://www.boost.org/libs/mpl/doc/.
16. K. Hammond, R. Loogen, and J. Berhold. Automatic Skeletons in Template
Haskell. In Proceedings of 2003 Workshop on High Level Parallel Programming,
Paris, France, June 2003.
17. Scott Haney, James Crotinger, Steve Karmesin, and Stephen Smith. Pete: The
portable expression template engine. Dr. Dobbs Journal, October 1999.
18. P. Hudak, T. Makucevich, S. Gadde, and B. Whong. Haskore music notation – an
algebra of music. Journal of Functional Programming, 6(3):465–483, May 1996.
19. Paul Hudak. Building domain specific embedded languages. ACM Computing Sur-
veys, 28A:(electronic), December 1996.
20. Paul Hudak. The Haskell School of Expression – Learning Functional Programming
through Multimedia. Cambridge University Press, New York, 2000.
21. J. Hughes. Pretty-printing: an exercise in functional programming. In R. S. Bird,
C. C. Morgan, and J. C. P. Woodcock, editors, Mathematics of Program Construc-
tion; Second International Conference; Proceedings, pages 11–13, Berlin, Germany,
1993. Springer-Verlag.
22. G. Hutton. Combinator parsing. Journal of Functional Programming, 1993.
23. ISO/IEC. Programming languages – C++. ISO/IEC 14882 Standard, October
2003.
24. Jaakko Järvi and Gary Powell. The lambda library: Lambda abstraction in C++.
In Second Workshop on C++ Template Programming, Tampa Bay, Florida, USA,
October 2001.
25. Neil D. Jones. What not to do when writing an interpreter for specialisation. In
Olivier Danvy, Robert Glück, and Peter Thiemann, editors, Partial Evaluation, vol-
ume 1110 of Lecture Notes in Computer Science, pages 216–237. Springer-Verlag,
1996.
26. Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. Partial Evaluation and
Automatic Program Generation. Prentice-Hall, 1993.
27. Shiram Krishnamurti, Matthias Felleisen, and Daniel P. Friedman. Synthesizing
object-oriented and functional design to promote re-use. In Eric Jul, editor, Euro-
pean Conference in Object-Oriented Programming, volume 1445 of Lecture Notes
in Computer Science, pages 91–113. Springer Verlag, 1998.
28. Ian Lynagh. Template Haskell: A report from the field. http://web.comlab.ox.ac.uk
/oucl/work/ian.lynagh/papers/, May 2003.
29. Ian Lynagh. Unrolling and simplifying expressions with Template Haskell.
http://web.comlab.ox.ac.uk /oucl/work/ian.lynagh/papers/, May 2003.
30. John Maddock and Steve Cleary et al. Boost type traits library.
http://www.boost.org/libs/type traits/.
DSL Implementation in MetaOCaml, Template Haskell, and C++ 71
31. John Matthews, Byron Cook, and John Launchbury. Microprocessor specification
in Hawk. In Proceedings of the 1998 International Conference on Computer Lan-
guages, pages 90–101. IEEE Computer Society Press, 1998.
32. M. Mauny. Parsers and printers as stream destructors embedded in functional
languages. In Proceedings of the Conference on Functional Programming Languages
and Computer Architecture, pages 360–370. ACM/IFIP, 1989.
33. MetaOCaml: A compiled, type-safe multi-stage programming language. Available
online from http://www.metaocaml.org/, 2003.
34. Eugenio Moggi. Notions of computation and monads. Information and Computa-
tion, 93(1), 1991.
35. N. C. Myers. Traits: a new and useful template technique. C++ Report, 7(5), June
1995.
36. John O’Donnell. Overview of Hydra: A concurrent language for synchronous digital
circuit design. In Proceedings 16th International Parallel & Distributed Processing
Symposium, page 234 (abstract). IEEE Computer Society, April 2002. Workshop on
Parallel and Distribued Scientific and Engineering Computing with Applications –
PDSECA.
37. Chris Okasaki. Even higher-order functions for parsing or why would anyone ever
want to use a sixth-order function? Journal of Functional Programming, 8(2):195–
199, March 1998.
38. Oregon Graduate Institute Technical Reports. P.O. Box 91000, Portland, OR
97291-1000,USA. Available online from
ftp://cse.ogi.edu/pub/tech-reports/README.html.
39. Emir Pašalić, Walid Taha, and Tim Sheard. Tagless staged interpreters for typed
languages. In the International Conference on Functional Programming (ICFP
’02), Pittsburgh, USA, October 2002. ACM.
40. J. Peterson, G. Hager, and P. Hudak. A language for declarative robotic program-
ming. In Proceedings of IEEE Conf. on Robotics and Automation, 1999.
41. Simon Peyton Jones and Philip Wadler. Imperative functional programming. In the
Symposium on Principles of Programming Languages (POPL ’93). ACM, January
1993. 71–84.
42. Rice Students. Multi-stage programming course projects.
http://www.cs.rice.edu/˜ taha/teaching/, 2000.
43. Tim Sheard, Zine El-Abidine Benaissa, and Emir Pašalić. DSL implementation
using staging and monads. In Second Conference on Domain-Specific Languages
(DSL’99), Austin, Texas, 1999. USENIX.
44. Tim Sheard and Simon Peyton Jones. Template metaprogramming for Haskell. In
Manuel M. T. Chakravarty, editor, ACM SIGPLAN Haskell Workshop 02, pages
1–16. ACM Press, October 2002.
45. Yannis Smaragdakis and Don Batory. Implementing layered designs with mixin
layers. In Proceedings of the European Conference on Object-Oriented Programming
(ECOOP), pages 550–570. Springer-Verlag LNCS 1445, 1998.
46. Jörg Striegnitz and Stephen Smith. An expression template aware lambda function.
In First Workshop on C++ Template Programming, Erfurt, Germany, October
2000.
47. Walid Taha. Multi-Stage Programming: Its Theory and Applications. PhD thesis,
Oregon Graduate Institute of Science and Technology, 1999. Available from [38].
48. Walid Taha, editor. Semantics, Applications, and Implementation of Program
Generation, volume 1924 of Lecture Notes in Computer Science, Montréal, 2000.
Springer-Verlag.
72 Krzysztof Czarnecki et al.
49. Walid Taha. A sound reduction semantics for untyped CBN multi-stage computa-
tion. Or, the theory of MetaML is non-trivial. In Proceedings of the Workshop on
Partial Evaluation and Semantics-Based Program Maniplation (PEPM), Boston,
2000. ACM Press.
50. Walid Taha and Patricia Johann. Staged notational definitions. In Frank Pfen-
ning and Yannis Smaragdakis, editors, Generative Programming and Component
Engineering (GPCE), Lecture Notes in Computer Science. Springer-Verlag, 2003.
51. Walid Taha and Tim Sheard. Multi-stage programming with explicit annotations.
In Proceedings of the Symposium on Partial Evaluation and Semantic-Based Pro-
gram Manipulation (PEPM), pages 203–217, Amsterdam, 1997. ACM Press.
52. Walid Taha and Tim Sheard. MetaML: Multi-stage programming with explicit
annotations. Theoretical Computer Science, 248(1-2), 2000.
53. Peter Thiemann. Programmable Type Systems for Domain Specific Languages.
In Marco Comini and Moreno Falaschi, editors, Electronic Notes in Theoretical
Computer Science, volume 76. Elsevier, 2002.
54. Ervin Unruh. Prime number computation. Internal document, ANSI X3J16-94-
0075/ISO WG21-462, 1994.
55. Todd Veldhuizen and Kumaraswamy Ponnambalam. Linear algebra with C++
template metaprograms. Dr. Dobb’s Journal of Software Tools, 21(8):38–44, Au-
gust 1996.
56. Todd L. Veldhuizen. Expression templates. C++ Report, 7(5):26–31, 1995.
57. Todd L. Veldhuizen. Template metaprograms. C++ Report, 7(4):36–43, 1995.
58. D. Wile. Popart: Producer of parsers and related tools. system builders’ manual.
Technical report, USC Information Sciences Institute, 1981.
Program Optimization in the Domain
of High-Performance Parallelism
Christian Lengauer
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 73–91, 2004.
c Springer-Verlag Berlin Heidelberg 2004
74 Christian Lengauer
the parallel program may be portable, its performance may not be but may differ
wildly on different platforms1 .
If we consider high-performance parallelism an application domain, it makes
sense to view the programming of high-performance parallelism as a domain-
specific activity. Thus, it is worth-while to assess whether techniques used in
domain-specific program generation have been or can be applied, and what spe-
cific requirements this domain may have. To that end, I proceed as follows:
2 Domain-Specific Programming
What makes a language domain-specific is not clear cut and probably not worth
worrying about too much. A debate of this issue would start with the already
difficult question of what constitutes a domain. Is it a collection of users or a
collection of software techniques or a collection of programs...?
For the purpose of my explorations we only need to agree that there are
languages that have a large user community and those that have a much smaller
user base, by comparison. Here I mean “large” in the sense of influence, man-
power, money. I call a language with a large user community general-purpose.
Probably undisputed examples are C and C++, Java and Fortran but, with
this definition, I could also call the query language SQL general-purpose, which
demonstrates that the term is meant in a relative sense.
A large user community can invest a lot of effort and resources in developing
high-quality implementations of and programming environments for their lan-
guage. A small user community has much less opportunity to do so, but may
have a need for special programming features that are not provided by any
programming language which other communities support. I call such features
domain-specific. One real-world example of a language with a small user com-
munity, taken from Charles Consel’s list of domain-specific languages on the
Web, is the language Devil for the specification of device driver interfaces [1].
What if the small user community prefers some widely used, general-purpose
language as a base, but needs to enhance it with domain-specific constructs to
1
On present-day computers with their memory hierarchies, instruction-level paral-
lelism and speculation, a lack of performance portability can also be observed in
sequential programs.
Program Optimization in the Domain of High-Performance Parallelism 75
Let us look more closely at the latter approach for our particular purpose: pro-
gram optimization.
5 Domain-Specific Libraries
5.1 Principle and Limitations
The easiest, and a common way of embedding domain-specific capabilities in a
general-purpose programming language is via a library of domain-specific pro-
gram modules (Fig. 1). Two common forms are libraries of subprograms and, in
2
This program structure is called single-program multiple-data (SPMD).
Program Optimization in the Domain of High-Performance Parallelism 77
domain−specific general−purpose
library compiler
target
compiler
– Error messages generated in the library modules are often cryptic because
they have been issued by a compiler or run-time system which is ignorant of
the special domain.
– The caller of the module is responsible for setting the structure parameters
consistently. This limits the robustness of the approach.
– The implementer of the module has to predict all contexts in which the
module may be called and build a case analysis which selects the given
context. This limits the flexibility of the approach.
Collective Operations. For sequential programs, there are by now more ab-
stract programming models – first came structured programming, then func-
tional and object-oriented programming. For parallel programs, abstractions are
still being worked out. One small but rewarding step is to go from point-to-
point communication, which causes unstructured communication code like the
goto does control code in sequential programs [9], to patterns of communications
and distributed computations. One frequently occurring case is the reduction,
in which an associative binary operator is applied in a distributed fashion to
a collection of values to obtain a result value (e.g., the sum or product of a
sequence of numbers)4 . A number of these operations are already provided by
4
With a sequential loop, the execution time of a reduction is linear in the number of
operand values. With a naive parallel tree computation, the time complexity is log-
arithmic, for a linear number of processors. This is not cost-optimal. However, with
a commonly used trick, called Brent’s Theorem, the granularity of the parallelism
can be coarsened, i.e., the number of processors needed can be reduced to maintain
cost optimality in a shared-memory cost model [4].
Program Optimization in the Domain of High-Performance Parallelism 79
MPI – programmers just need some encouragement to use them! The benefit
of an exclusive use of collective operations is that the more regular structure of
the program enables a better cost prediction. (BSP has a similar benefit.) With
architecture-specific implementations of collective operations, one can calibrate
program performance for a specific parallel computer [10].
One problem with libraries of high-performance modules is their lack of per-
formance compositionality: the sequential composition of two calls of highly
tuned implementations does, in general, not deliver the best performance. Better
performance can be obtained when one provides a specific implementation for
the composition of the two calls.
In general, it is hopeless to predict what compositions of calls users might
require. But, at a comparatively low level of abstraction, e.g., at the level of
collective operations, it is quite feasible to build a table of frequently occur-
ring compositions and their costs incurred on a variety of parallel architectures.
Gorlatch [11] and Kuchen [12] deal with this issue in their contributions to this
volume.
yields a multi-dimensional array, but these are not allocated contiguously and
need not be of equal length. Thus, the resulting structure is not necessarily
rectangular.
One reason why the multi-dimensional array is used so heavily in scientific
computing is that it can be mapped contiguously to memory and that array
elements can be referenced very efficiently by precomputing the constant part
of the index expression at compile time [18]. There are several ways to make
contiguous, rectangular, multi-dimensional arrays available in Java [19]. The
easiest and most portable is via a class library, in which a multi-dimensional
array is laid out one-dimensionally in row- or column-major order.
6 Preprocessors
6.1 Principle and Limitations
A preprocessor translates domain-specific language features into the general-
purpose host language and, in the process, can perform a context-sensitive anal-
ysis, follow up with context-sensitive optimizations and generate more appropri-
ate error messages (Fig. 2)5 . For example, the structure parameters mentioned
in the previous section could be generated more reliably by a preprocessor then
by the programmer.
domain−specific general−purpose
preprocessor compiler
target
compiler
Fig. 2. Preprocessor
While this approach allows for a more flexible optimization (since there is no
hard-wired case analysis), one remaining limitation is that no domain-specific
5
Compile-time error messages can still be cryptic if they are triggered by the general-
purpose code which the preprocessor generates.
Program Optimization in the Domain of High-Performance Parallelism 81
optimizations can be performed below the level of the general-purpose host lan-
guage. Only the general-purpose compiler can optimize at that level.
Most HPF compilers are preprocessors for a compiler for Fortran 90, which
add calls to a library of domain-specific routines which, in turn, call MPI routines
to maintain portability. One example is the HPF compiler ADAPTOR with its
communications library DALIB [25].
There was justified hope for very efficient HPF programs. The general-purpose
source language, Fortran, is at a level of abstraction which is pleasingly familiar
to the community, yet comparatively close to the machine. Sequential Fortran is
supported by sophisticated compilers, which produce highly efficient code. And,
with the domain-specific run-time system for parallelism in form of the added
library routines, one could cater to the special needs in the domain – also with
regard to performance.
However, things did not quite work out as had been hoped. One requirement
of an HPF compiler is that it should be able to accept every legal Fortran pro-
gram. Since the HPF directives can appear anywhere and refer to any part of the
program, the compiler cannot be expected to react reasonably to all directives.
In principle, it can disregard any directive. The directives for data distributions
are quite inflexible and the compiler’s ability to deal with them depends on its
capabilities of analyzing the dependences in the program and transforming the
program to expose the optimal degree of parallelism and generate efficient code
for it. Both the dependence analysis and the code generation are still areas of
much research.
Existing HPF compilers deal well with fairly simple directives for scenarios
which occur quite commonly, like disjoint parallelism or blockwise and cyclic
data distributions. However, they are quite sensitive to less common or less
regular dependence patterns: even when they can handle them, they often do
not succeed in generating efficient code. Work on this continues in the area of
loop parallelization (see further on).
With some adjustments of the paradigm, e.g., a more explicit and realistic
commitment to what a compiler is expected to do and a larger emphasis on loop
parallelization, a data-parallel Fortran might still have a future.
A powerful geometric model for loop parallelization is the polytope model [27,
28], which lays out the steps of a loop nest, iterating over an array structure,
in a multi-dimensional, polyhedral coordinate space, with one dimension per
loop. The points in this space are connected by directed edges representing the
dependences between the loop iterations. With techniques of linear algebra and
linear programming, one can conduct an automatic, optimizing search for the
best mapping of the loop steps to time and space (processors) with respect to
some objective function like the number of parallel execution steps (the most
popular choice), the number of communications, a balanced processor load or
combinations of these or others.
The polytope model comes with restrictions: the array indices and the loop
bounds must be affine, and the space-time mapping must be affine8 . Recent
extensions allow mild violations of this affinity requirement – essentially, the
permit a constant number of breaks in the affinity9 . This admits a larger set of
loop nests and yields better solutions.
Methods based on the polytope model are elegant and work well. The gran-
ularity of parallelism can also be chosen conveniently via a partitioning of the
iteration space, called tiling [29, 30]. The biggest challenge is to convert the solu-
tions found in the model into efficient code. Significant headway has been made
recently on how to avoid frequent run-time tests which guide control through
the various parts of the iteration space [31, 32].
Methods based on the polytope model have been implemented in various
prototypical preprocessors. One for C with MPI is LooPo [33]. These systems use
a number of well known schedulers (which compute temporal distributions) [34,
35] and allocators (which compute spatial distributions) [36, 37] for the optimized
search of a space-time mapping. Polytope methods still have to make their way
into production compilers.
7 Active Libraries
7.1 Principle and Limitations
The modules of an active library [39] are coded in two layers: the domain-specific
layer, whose language offers abstractions for the special implementation needs
of the domain, and the domain-independent layer in the host language. Thus,
pieces of host code can be combined with domain-specific combinators10 .
An active module can have several translations, each of which is optimized for
a particular call context, e.g., a specific set of call parameters. The preprocessor
is responsible for the analysis of the call context and for the corresponding
translation of the domain-specific module code. The general-purpose compiler
translates the pieces of host code which require no domain-specific treatment.
Arranging the program code in different layers, which are translated at dif-
ferent times or by different agents, is the principle of multi-stage programming
(see the contribution of Taha [42] to this volume).
There is the danger of code explosion if a module is called in many different
contexts.
domain−specific general−purpose
compiler compiler
target
compiler
parallel code from an abstract specification (a syntax tree), and the adaptation
of the target code to the context in which it appears, go on at run time. The
advantage is, of course, the wealth of information available at run time. With
an image filtering example, the authors demonstrate that the overhead incurred
by the run-time analysis and code generation can be recovered in just one pass
of an algorithm that iterates typically over many passes.
8 Two Compilers
8.1 Principle and Limitations
In order to allow context-sensitive, domain-specific optimizations below the level
of the host language, one needs two separate compilers which both translate to
the same target language; the two pieces of target code are then linked together
and translated further by the target language compiler (Fig. 3). Note that, what
is composed in sequence in Fig. 2, is composed unordered here.
There needs to be some form of information exchange between the general-
purpose and the domain-specific side. This very important and most challenging
aspect is not depicted in the figure because it could take many forms. One
option is a (domain-specific) preprocessor which divides the source program into
domain-specific and general-purpose code and provides the linkage between both
sides.
The main challenge in this approach is to maintain a clean separation of the
responsibilities of the two compilers:
– The duplication of analysis or code generation effort by the two compilers
should be minimized. One would not want to reimplement significant parts
of the general-purpose compiler in the domain-specific compiler.
86 Christian Lengauer
domain−specific
compiler
target
compiler
digital signal processing, there are FFTW [46] and SPIRAL [47]. FFTW comes
close to the two-compiler idea and, since there is a parallel version of it (although
this seems to have been added as an afterthought), I include it here.
9 Conclusions
So far, the main aims of domain-specific program generation seem to have been
programming convenience and reliability. The perception of a need for domain-
specific program optimization is just emerging.
Even in high-performance parallelism, an area with much work on domain-
specific program optimization, most programmers favour programming at a low
level. The easiest approach for today’s programmer is to provide annotations of
the simplest kind, as in Cilk, GpH and JavaParty, or of a more elaborate kind
with HPF and OpenMP. Imposing more burden, but also offering more control
over distributed parallelism and communication is MPI.
The contributions on high-performance parallelism in this volume are doc-
umenting an increasing interest in abstraction. A first step of abstracting from
point-to-point communications in explicitly distributed programs is the use of
collective operations (as provided by MPI). The next abstraction is to go to
skeleton libraries as proposed by Bischof et al. [13] or Kuchen [12]. One step
further would be to develop an active library.
88 Christian Lengauer
Acknowledgements
Thanks to Peter Faber, Martin Griebl and Christoph Herrmann for discussions.
Profound thanks to Don Batory, Albert Cohen and Paul Kelly for very useful
exchanges on content and presentation. Thanks also to Paul Feautrier for many
long discussions about domain-specific programming and program optimization.
The contact with Paul Feautrier and Albert Cohen has been funded by a Procope
exchange grant.
References
1. Réveillère, L., Mérillon, F., Consel, C., Marlet, R., Muller, G.: A DSL approach to
improve productivity and safety in device drivers development. In: Proc. Fifteenth
IEEE Int. Conf. on Automated Software Engineering (ASE 2000), IEEE Computer
Society Press (2000) 91–100
2. van Deursen, A., Klint, P., Visser, J.: Domain-specific languages: An annotated
bibliography. ACM SIGPLAN Notices 35 (2000) 26–36
3. Hammond, K., Michaelson, G.: The design of Hume: A high-level language for the
real-time embedded system domain (2004) In this volume.
4. Quinn, M.J.: Parallel Computing. McGraw-Hill (1994)
5. Robison, A.D.: Impact of economics on compiler optimization. In: Proc. ACM 2001
Java Grande/ISCOPE Conf., ACM Press (2001) 1–10
6. Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann (1997)
7. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.:
PVM Parallel Virtual Machine, A User’s Guide and Tutorial for Networked Parallel
Computing. MIT Press (1994)
Project Web page: http://www.csm.ornl.gov/pvm/pvm home.html.
Program Optimization in the Domain of High-Performance Parallelism 89
8. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and answers about BSP.
Scientific Programming 6 (1997) 249–274
Project Web page: http://www.bsp-worldwide.org/.
9. Gorlatch, S.: Message passing without send-receive. Future Generation Computer
Systems 18 (2002) 797–805
10. Gorlatch, S.: Toward formally-based design of message passing programs. IEEE
Transactions on Software Engineering 26 (2000) 276–288
11. Gorlatch, S.: Optimizing compositions of components in parallel and distributed
programming (2004) In this volume.
12. Kuchen, H.: Optimizing sequences of skeleton calls (2004) In this volume.
13. Bischof, H., Gorlatch, S., Leshchinskiy, R.: Generic parallel programming using
C++ templates and skeletons (2004) In this volume.
14. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Don-
garra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley,
R.C.: ScaLAPACK: A linear algebra library for message-passing computers. In:
Proc. Eighth SIAM Conf. on Parallel Processing for Scientific Computing, Society
for Industrial and Applied Mathematics (1997) 15 (electronic) Project Web page:
http://www.netlib.org/scalapack/.
15. van de Geijn, R.: Using PLAPACK: Parallel Linear Algebra Package. Scien-
tific and Engineering Computation Series. MIT Press (1997) Project Web page:
http://www.cs.utexas.edu/users/plapack/.
16. Herrmann, C.A.: The Skeleton-Based Parallelization of Divide-and-Conquer Re-
cursions. PhD thesis, Fakultät für Mathematik und Informatik, Universität Passau
(2001) Logos-Verlag.
17. Herrmann, C.A., Lengauer, C.: HDC: A higher-order language for divide-and-
conquer. Parallel Processing Letters 10 (2000) 239–250
18. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers – Principles, Techniques, and Tools.
Addison-Wesley (1986)
19. Moreira, J.E., Midkiff, S.P., Gupta, M.: Supporting multidimensional arrays in
Java. Concurrency and Computation – Practice & Experience 13 (2003) 317–340
20. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 mul-
tithreaded language. ACM SIGPLAN Notices 33 (1998) 212–223 Proc. ACM SIG-
PLAN Conf. on Programming Language Design and Implementation (PLDI’98).
Project Web page: http://supertech.lcs.mit.edu/cilk/.
21. Trinder, P.W., Hammond, K., Loidl, H.W., Peyton Jones, S.L.: Algorithm + strat-
egy = parallelism. J. Functional Programming 8 (1998) 23–60 Project Web page:
http://www.cee.hw.ac.uk/ dsg/gph/.
22. Philippsen, M., Zenger, M.: JavaParty – transparent remote objects in Java.
Concurrency: Practice and Experience 9 (1997) 1225–1242 Project Web page:
http://www.ipd.uka.de/JavaParty/.
23. Koelbel, C.H., Loveman, D.B., Schreiber, R.S., Steele, Jr., G.L., Zosel, M.E.: The
High Performance Fortran Handbook. Scientific and Engineering Computation.
MIT Press (1994)
24. Foster, I.: Designing and Building Parallel Programs. Addison-Wesley (1995)
25. Brandes, T., Zimmermann, F.: ADAPTOR—a transformation tool for HPF pro-
grams. In Decker, K.M., Rehmann, R.M., eds.: Programming Environments for
Massively Distributed Systems. Birkhäuser (1994) 91–96
26. Dagum, L., Menon, R.: OpenMP: An industry-standard API for shared-memory
programming. IEEE Computational Science & Engineering 5 (1998) 46–55 Project
Web page: http://www.openmp.org/.
90 Christian Lengauer
27. Lengauer, C.: Loop parallelization in the polytope model. In Best, E., ed.: CON-
CUR’93. LNCS 715, Springer-Verlag (1993) 398–416
28. Feautrier, P.: Automatic parallelization in the polytope model. In Perrin, G.R.,
Darte, A., eds.: The Data Parallel Programming Model. LNCS 1132. Springer-
Verlag (1996) 79–103
29. Andonov, R., Balev, S., Rajopadhye, S., Yanev, N.: Optimal semi-oblique tiling.
In: Proc.13th Ann. ACM Symp.on Parallel Algorithms and Architectures (SPAA
2001), ACM Press (2001)
30. Griebl, M., Faber, P., Lengauer, C.: Space-time mapping and tiling – a helpful
combination. Concurrency and Computation: Practice and Experience 16 (2004)
221–246 Proc. 9th Workshop on Compilers for Parallel Computers (CPC 2001).
31. Quilleré, F., Rajopadhye, S., Wilde, D.: Generation of efficient nested loops from
polyhedra. Int. J. Parallel Programming 28 (2000) 469–498
32. Bastoul, C.: Generating loops for scanning polyhedra. Technical Re-
port 2002/23, PRiSM, Versailles University (2002) Project Web page:
http://www.prism.uvsq.fr/˜cedb/bastools/cloog.html.
33. Griebl, M., Lengauer, C.: The loop parallelizer LooPo. In Gerndt, M., ed.: Proc.
Sixth Workshop on Compilers for Parallel Computers (CPC’96). Konferenzen des
Forschungszentrums Jülich 21, Forschungszentrum Jülich (1996) 311–320 Project
Web page: http://www.infosun.fmi.uni-passau.de/cl/loopo/.
34. Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part I.
One-dimensional time. Int. J. Parallel Programming 21 (1992) 313–348
35. Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II.
Multidimensional time. Int. J. Parallel Programming 21 (1992) 389–420
36. Feautrier, P.: Toward automatic distribution. Parallel Processing Letters 4 (1994)
233–244
37. Dion, M., Robert, Y.: Mapping affine loop nests: New results. In Hertzberger, B.,
Serazzi, G., eds.: High-Performance Computing & Networking (HPCN’95). LNCS
919. Springer-Verlag (1995) 184–189
38. Guyer, S.Z., Lin, C.: Optimizing the use of high-performance software libraries.
In Midkiff, S.P., Moreira, J.E., Gupta, M., Chatterjee, S., Ferrante, J., Prins, J.,
Pugh, W., Tseng, C.W., eds.: 13th Workshop on Languages and Compilers for
Parallel Computing (LCPC 2000). LNCS 2017, Springer-Verlag (2001) 227–243
39. Czarnecki, K., Eisenecker, U., Glück, R., Vandevoorde, D., Veldhuizen, T.: Gener-
ative programming and active libraries (extended abstract). In Jazayeri, M., Loos,
R.G.K., Musser, D.R., eds.: Generic Programming. LNCS 1766, Springer-Verlag
(2000) 25–39
40. Hoare, C.A.R.: Communicating Sequential Processes. Series in Computer Science.
Prentice-Hall Int. (1985)
41. Herrmann, C.A., Lengauer, C.: Using metaprogramming to parallelize functional
specifications. Parallel Processing Letters 12 (2002) 193–210
42. Taha, W.: A gentle introduction to multi-stage programming (2004) In this volume.
43. Kennedy, K., Broom, B., Cooper, K., Dongarra, J., Fowler, R., Gannon, D., Johns-
son, L., Mellor-Crummey, J., Torczon, L.: Telescoping languages: A strategy for
automatic generation of scientific problem solving systems from annotated libraries.
J. Parallel and Distributed Computing 61 (2001) 1803–1826
44. Beckmann, O., Houghton, A., Mellor, M., Kelly, P.: Run-time code generation in
C++ as a foundation for domain-specific optimisation (2004) In this volume.
45. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of
software and the ATLAS project. Parallel Computing 27 (2001) 3–35 Project Web
page: http://math-atlas.sourceforge.net/.
Program Optimization in the Domain of High-Performance Parallelism 91
46. Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the
FFT. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
(ICASSP’98). Volume 3. (1998) 1381–1384
Project Web page: http://www.fftw.org/.
47. Püschel, M., Singer, B., Xiong, J., Moura, J.F.F., Johnson, J., Padua, D., Veloso,
M., Johnson, R.W.: SPIRAL: A generator for platform-adapted libraries of signal
processing algorithms. J. High Performance in Computing and Applications (2003)
To appear. Project Web page: http://www.ece.cmu.edu/˜spiral/.
48. Frigo, M.: A fast Fourier transform compiler. ACM SIGPLAN Notices 34 (1999)
169–180 Proc. ACM SIGPLAN Conf. on Programming Language Design and Im-
plementation (PLDI’99).
49. Aldinucci, M., Gorlatch, S., Lengauer, C., Pelagatti, S.: Towards parallel program-
ming by transformation: The FAN skeleton framework. Parallel Algorithms and
Applications 16 (2001) 87–121
50. Kuchen, H., Cole, M.: The integration of task and data parallel skeletons. Parallel
Processing Letters 12 (2002) 141–155
A Personal Outlook on Generator Research
(A Position Paper)
Yannis Smaragdakis
1 Introduction
This chapter is a personal account of my past work and current thoughts on
research in software generators and the generators research community.
As an opinion piece, this chapter contains several unsubstantiated claims and
(hopefully) many opinions the reader will find provocative. It also makes liberal
use of the first person singular. At the same time, whenever I use the first person
plural, I try to not have it mean the “royal ‘we’ ” but instead to speak on behalf
of the community of generators researchers.
There are two ways to view the material of this chapter. The first is as a
threat-analysis for the area of domain-specific program generation. Indeed, a
lot of the discussion is explicitly critical. For instance, although I believe that
domain-specific program generation has tremendous potential, I also feel that the
domain-specificity of the area can limit the potential for knowledge transfer and
deep research. Another way to view this chapter, however, is as an opportunity-
analysis: based on a critical view of the area, I try to explicitly identify the
directions along which both research and community-building in software gen-
erators can have the maximum impact.
I will begin with a description of my background in generators research. This
is useful to the reader mainly as a point of reference for my angle and outlook.
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 92–106, 2004.
c Springer-Verlag Berlin Heidelberg 2004
A Personal Outlook on Generator Research 93
2 My Work in Generators
A large part of my past and present research is related to software generators.
I have worked on two different transformation systems (Intentional Program-
ming and JTS), on the DiSTiL generator, on the Generation Scoping facility, on
C++ Template libraries and components, and on the GOTECH framework for
generation of EJB code.
2.2 DiSTiL
The DiSTiL domain specific language [12] is an extension to C that allows the
user to compose data structure components to form very efficient combinations
of data structures. The language has a declarative syntax for specifying data
structure operations: the user can define traversals over data through a pred-
icate that the data need to satisfy. The DiSTiL generator can then perform
optimizations based on the static parts of the predicate. Optimizations include
the choice of an appropriate data structure, if multiple are available over the
same data.
94 Yannis Smaragdakis
The following (slightly simplified) DiSTiL source code fragment shows the
main elements of the language, including the data structure definition (typeq1
and cont1 definitions), cursor predicate (curs1 definition) and traversal key-
words (foreach, ref).
...
foreach(curs1)
... ref(curs1, name) ...
// DiSTiL operations mixed with C code
This example code shows a data structure organizing the same data in two
ways: using a hash table (the Hash component in the above code) on the “phone”
field and using a red-black tree (Tree) on the “name” field of the data records.
The data are stored transiently in memory (Transient) and allocated dynam-
ically on the heap (Malloc). The cursor shown in the example code is defined
using a predicate on the “name” field. Therefore, DiSTiL will generate efficient
code for all uses of this cursor by using the red-black tree structure. Should the
cursor predicate or the data structure specification change in the future, DiSTiL
will generate efficient code for the new requirements without needing to change
the data structure traversal code.
Consider the generation of code using code template operators quote (‘) and
unquote ($). The generator code may contain, for instance, the expression:
The question becomes, what are the bindings of free variables in this expres-
sion? What is the meaning of fp or even fopen? This is the scoping problem
for generated code and it has been studied extensively in the hygienic macros
literature [5, 8] for the case of pattern-based generation. Generation scoping [16]
is a mechanism that gives a similar solution for the case of programmatic (i.e.,
not pattern-based) generation. The mechanism adds a new type, Env, and a new
keyword, environment, that takes an expression of type Env as an argument.
environment works in conjunction with quote and unquote – all generated
code fragments under an environment(e) scope have their variable declarations
inserted in e and their identifiers bound to variables in e. For example, the
following code fragment demonstrates the generation scoping syntax:
A Personal Outlook on Generator Research 95
That is, a mixin layer in C++ is a class template, T, that inherits from its
template parameter, S, while it contains nested classes that inherit from the
corresponding nested classes of S. In this way, a mixin layer can inherit entire
classes from other layers, while by composing layers (e.g., T<A>) the programmer
can form inheritance hierarchies for a whole set of inter-related classes (like
T<A>::I1, T<A>::I2, etc.).
My C++ templates work also includes FC++ [9, 10]: a library for functional
programming in C++. FC++ offers much of the convenience of programming in
Haskell without needing to extend the C++ language. Although the novelty and
value of FC++ is mostly in its type system, the latest FC++ versions make use
of C++ template meta-programming to also extend C++ with a sub-language
for expressing lambdas and various monad-related syntactic conveniences (e.g.,
comprehensions).
2.5 GOTECH
The GOTECH system is a modular generator that transforms plain Java classes
into Enterprise Java Beans – i.e., classes conforming to a complex specification
(J2EE) for supporting distribution in a server-side setting. The purpose of the
transformation is to make these classes accessible from remote machines, i.e.,
to turn local communication into distributed. In GOTECH, the programmer
marks existing classes with unobtrusive annotations (inside Java comments).
The annotations contain simple settings, such as:
/**
*@ejb:bean name = "SimpleClass"
* type = "stateless"
A Personal Outlook on Generator Research 97
* jndi-name = "ejb/test/simple"
* semantics = "by-copy"
*/
From such information, the generator creates several pieces of Java code and
meta-data conforming to the specifications for Enterprise Java Beans. These in-
clude remote and local interfaces, a deployment descriptor (meta-data describing
how the code is to be deployed), etc. Furthermore, all clients of the annotated
class are modified to now make remote calls to the new form of the class. The
modifications are done elegantly by producing code in the AspectJ language [7]
that takes care of performing the necessary redirection. (AspectJ is a system
that allows the separate specification of dispersed parts of an application’s code
and the subsequent composition of these parts with the main code body.) As a
result, GOTECH is a modular, template-based generator that is easy to change,
even for end-users.
think GPCE is important and why the generators community needs something
like what I imagine GPCE becoming. All of my observations concern not just
GPCE but the overall generators community. At the same time, all opinions
are, of course, only mine and not necessarily shared among GPCE organizers.
When I speak of “generators conferences”, the ones I have in mind are ICSR
(the International Conference on Software Reuse), GPCE, as well as the older
events DSL (the Domain-Specific Languages conference), SAIG (the workshop
on Semantics, Applications and Implementation of Program Generation), and
GCSE (the Generative and Component-based Software Engineering conference).
Of course, much of the work on generators also appears in broader conferences
(e.g., in the Object-Oriented Programming or Automated Software Engineering
communities) and my observations also apply to the generators-related parts of
these venues.
Is there something wrong with the current state of research in generators or
the current state of the generators scientific community? One can certainly argue
that the community is alive and well, good research is being produced, and one
cannot improve research quality with strategic decisions anyway. Nevertheless,
I will argue that there are some symptoms that suggest we can do a lot better.
These symptoms are to some extent shared by our neighboring and surrounding
research communities – those of object-oriented and functional programming,
as well as the broader Programming Languages and Software Engineering com-
munities. I do believe, however, that some of the symptoms outlined below are
unique and the ones that are shared are even more pronounced in the generators
community. By necessity, my comments on community building are specific to
current circumstances, but I hope that my comments on the inherent difficulties
of generator research are general.
3.1 Symptoms
Relying on Other Communities. The generators community is derivative, to
a larger extent than it should be. This means that we often expect technical
solutions from the outside. The solution of fundamental problems that have
direct impact to the generators community is often not even considered our
responsibility. Perhaps this is an unfair characterization, but I often get the
impression that we delegate important conceptual problems to the programming
languages or systems communities. A lot of the interactions between members of
the generators community and researchers in, say, programming languages (but
outside generators) take the form of “what cool things did you guys invent lately
that we can use in generators?”.
Although I acknowledge that my symptom description is vague, I did want
to state this separately from the next symptom, which may be a cause as well
as an expression of this one.
doing generators work. Most of us prefer to publish our best results elsewhere.
Of course, this is a chicken-and-egg problem: if the publication outlets are not
prestigious, people will not submit their best papers. But if people do not submit
their best papers, the publication outlets will remain non-prestigious. I don’t
know if GPCE will overcome this obstacle, but I think it has a chance to do
so. GPCE integrates both people who are interested in generators applications
(the Software Engineering side) and people who work on basic mechanisms for
generators (the Programming Languages side). GPCE is a research conference:
the results that it accepts have to be new contributions to knowledge and not
straightforward applications of existing knowledge. Nevertheless, research can be
both scientific research (i.e., research based on analysis) and engineering research
(i.e., research based on synthesis). Both kinds of work are valuable to GPCE. The
hope is that by bringing together the Software Engineering and the Programming
Languages part of the community, the result will be a community with both
strength in numbers but also a lively, intellectually stimulating exchange of ideas.
Limited Impact. A final, and possibly the most important, symptom of the
problems of our community has to do with the impact we have had in prac-
tice. There are hundreds of nice domain-specific languages out there. There are
several program generation tools. A well-known software engineering researcher
recently told me (upon finding out I work on generators) “You guys begin to
have impact! I have seen some very nice domain-specific languages for XYZ.” I
was embarrassed to admit that I could not in good conscience claim any credit.
Can we really claim such an impact? Or were all these useful tools developed in
complete isolation from research in software generators? If we do claim impact,
is it for ideas, for tools, or for methodologies? In the end, when a new generator
is designed, domain experts are indispensable. Does the same hold for research
results?
One can argue that this symptom is shared with the programming languages
research community. Nevertheless, I believe the problem is worse for us. The de-
signers of new general purpose programming languages (e.g., Java, C#, Python,
etc.) may not have known the latest related research for every aspect of their
design. Nevertheless, they have at least read some of the research results in lan-
guage design. In contrast, many people develop useful domain-specific languages
without ever having read a single research paper on generators.
100 Yannis Smaragdakis
3.2 Causes?
If we agree that the above observations are indeed symptoms of a problem,
then what is the cause of that problem? Put differently, what are the general
obstacles to having a fruitful and impactful research community in domain-
specific program generation? I believe there are two main causes of many of the
difficulties encountered by the generators community.
In the next two sections, I try to discuss in more detail these two causes.
By doing this, I also identify what I consider promising approaches to domain-
specific program generation research.
4 Domain-Specificity
4.1 Lessons That Transcend Domains
In generators conferences, one finds several papers that tell a similarly-structured
tale: “We made this wonderful generator for domain XYZ. We used these tools.”
Although this paper structure can certainly be very valuable, it often degenerates
into a “here’s what I did last summer” paper. A domain-specific implementation
may be valuable to other domain experts, but the question is, what is the value
to other generators researchers and developers who are not domain experts? Are
the authors only providing an example of the success of generators but without
offering any real research benefit to others? If so, isn’t this really not a research
community but a birds-of-a-feather gathering?
Indeed, I believe we need to be very vigilant in judging technical contributions
according to the value they offer to other researchers. In doing this, we could
establish some guidelines about what we expect to see in a good domain-specific
paper. Do we want an explicit “lessons learned” section? Do we want authors
to outline what part of their expertise is domain-independent? Do we want an
analysis of the difficulties of the domain, in a form that will be useful to future
generators’ implementors for the same domain? I believe it is worth selecting a
few good domain-specific papers and using them as examples of what we would
like future authors to address.
the generator writer. The other two reasons, however, are variables that good
generator infrastructure can change. In other words, good infrastructure can
result in more successful generators.
Given that our goal is to help generators impose fewer dependencies and fit
better with the rest of a program, an interesting question is whether a generator
should be regarded as a tool or as a language. To clarify the question, let’s char-
acterize the two views of a generator a little more precisely. Viewing a generator
as a language means treating it as a closed system, where little or no inspection
of the output is expected. Regarding a generator as a tool means to support a
quick-and-dirty implementation and shift some responsibility to the user: some-
times the user will need to understand the generated code, ensure good fit with
the rest of the application, and even maintain generated code.
The two viewpoints have different advantages and applicability ranges. For
example, when the generator user is not a programmer, the only viable op-
tion is the generator-as-a-language viewpoint. The generator-as-a-language ap-
proach is high-risk, however: it requires buy-in by generator users because it
adds the generator as a required link in the dependency chain. At the same
time, it implies commitment to the specific capabilities supported by the gen-
erator. The interconnectivity and debugging issues are also not trivial. In sum-
mary, the generator-as-a-language approach can only be valuable in the case of
well-developed generators for mature domains. Unfortunately, this case is almost
always in the irrelevant range for generator infrastructure. Research on generator
infrastructure will very rarely have any impact on generators that are developed
as languages. If such a generator is successful, its preconditions for success are
such that they make the choice of infrastructure be irrelevant.
Therefore, I believe the greatest current promise for generator research with
impact is on infrastructure for generators that follow the generator-as-a-tool
viewpoint. Of course, even this approach has its problems: infrastructure for
such generators may be trivial – as, for instance, in the case of the “wizards”
in Microsoft tools that generate code skeletons using simple text templates.
Nonetheless, the “trivial” case is rare in practice. Most of the time a good
generator-as-a-tool needs some sophistication – at the very least to the level
of syntactic and simple semantic analysis.
How can good infrastructure help generators succeed? Recall that we want
to reduce dependencies on generator tools and increase the fit of generated code
to existing code. Based on these two requirements, I believe that a few good
principles for a generator-as-a-tool are the following:
For instance, recall the DiSTiL generator that I mentioned in Section 2. DiSTiL
is an extension to the C language. The DiSTiL keywords are obtrusive, however,
and the DiSTiL generated code is weaved through the C code of the application
for efficiency. I reproduce below the DiSTiL source code fragment shown earlier:
...
foreach(curs1)
... ref(curs1, name) ...
// DiSTiL operations mixed with C code
References
1. William Aitken, Brian Dickens, Paul Kwiatkowski, Oege de Moor, David Richter,
and Charles Simonyi, “Transformation in Intentional Programming”, in Prem De-
vanbu and Jeffrey Poulin (eds.), proc. 5th International Conference on Software
Reuse (ICSR ’98), 114-123, IEEE CS Press, 1998.
2. Paul Basset, Framing Software Reuse: Lessons from the Real World, Yourdon Press,
Prentice Hall, 1997.
3. Don Batory, Bernie Lofaso, and Yannis Smaragdakis, “JTS: Tools for Implementing
Domain-Specific Languages”, in Prem Devanbu and Jeffrey Poulin (eds.), proc. 5th
International Conference on Software Reuse (ICSR ’98), 143-155, IEEE CS Press,
1998.
2
E.g., ((lambda (x) (list x (list ’quote x))) ’(lambda (x) (list x (list
’quote x)))) in Lisp.
106 Yannis Smaragdakis
4. Don Batory, Gang Chen, Eric Robertson, and Tao Wang, “Design Wizards and
Visual Programming Environments for GenVoca Generators”, IEEE Transactions
on Software engineering, 26(5), 441-452, May 2000.
5. William Clinger and Jonathan Rees, “Macros that work”, Eighteenth Annual ACM
Symposium on Principles of Programming Languages (PoPL ’91), 155-162, ACM
Press, 1991.
6. Krzysztof Czarnecki and Ulrich Eisenecker, Generative Programming: Methods,
Techniques, and Applications, Addison-Wesley, 2000.
7. Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and
William G. Griswold, “An Overview of AspectJ”, in Jørgen Lindskov Knudsen
(ed.), proc. 15th European Conference on Object-Oriented Programming (ECOOP
’01). In Lecture Notes in Computer Science (LNCS) 2072, Springer-Verlag, 2001.
8. Eugene Kohlbecker, Daniel P. Friedman, Matthias Felleisen, and Bruce Duba, “Hy-
gienic macro expansion”, in Richard P. Gabriel (ed.), proc. ACM SIGPLAN ’86
Conference on Lisp and Functional Programming, 151-161, ACM Press, 1986.
9. Brian McNamara and Yannis Smaragdakis, “Functional programming in C++”,
in Philip Wadler (ed.), proc. ACM SIGPLAN 5th International Conference on
Functional Programming (ICFP ’00), 118-129, ACM Press, 2000.
10. Brian McNamara and Yannis Smaragdakis, “Functional Programming with the
FC++ Library”, Journal of Functional Programming (JFP), Cambridge University
Press, to appear.
11. Charles Simonyi, “The Death of Computer Languages, the Birth of Intentional
Programming”, NATO Science Committee Conference, 1995.
12. Yannis Smaragdakis and Don Batory, “DiSTiL: a Transformation Library for Data
Structures”, in J. Christopher Ramming (ed.), Conference on Domain-Specific Lan-
guages (DSL ’97), 257-269, Usenix Association, 1997.
13. Yannis Smaragdakis and Don Batory, “Implementing Reusable Object-Oriented
Components”, in Prem Devanbu and Jeffrey Poulin (eds.), proc. 5th International
Conference on Software Reuse (ICSR ’98), 36-45, IEEE CS Press, 1998.
14. Yannis Smaragdakis and Don Batory, “Implementing Layered Designs with Mixin
Layers”, in Eric Jul (ed.), 12th European Conference on Object-Oriented Program-
ming (ECOOP ’98), 550-570. In Lecture Notes in Computer Science (LNCS) 1445,
Springer-Verlag, 1998.
15. Yannis Smaragdakis and Don Batory, “Application Generators”, in J.G. Webster
(ed.), Encyclopedia of Electrical and Electronics Engineering, John Wiley and Sons
2000.
16. Yannis Smaragdakis and Don Batory, “Scoping Constructs for Program Genera-
tors”, in Krzysztof Czarnecki and Ulrich Eisenecker (eds.), First Symposium on
Generative and Component-Based Software Engineering (GCSE ’99), 65-78. In
Lecture Notes in Computer Science (LNCS) 1799, Springer-Verlag, 1999.
17. Yannis Smaragdakis and Don Batory, “Mixin Layers: an Object-Oriented Imple-
mentation Technique for Refinements and Collaboration-Based Designs”, ACM
Trans. Softw. Eng. and Methodology (TOSEM), 11(2), 215-255, April 2002.
18. Eli Tilevich, Stephan Urbanski, Yannis Smaragdakis and Marc Fleury, “Aspectiz-
ing Server-Side Distribution”, in proc. 18th IEEE Automated Software Engineering
Conference (ASE’03), 130-141, IEEE CS Press, 2003.
19. Todd Veldhuizen, “Scientific Computing in Object-Oriented Languages web page”,
http://www.oonumerics.org/
Generic Parallel Programming
Using C++ Templates and Skeletons
1 Introduction
The domain of parallel programming is known to be error-prone and strongly
performance-driven. A promising approach to cope with both problems is based
on skeletons. The term skeleton originates from the observation that many par-
allel applications share a common set of computation and interaction patterns
such as pipelines and data-parallel computations. The use of skeletons (as op-
posed to programming each application “from scratch”) has many advantages,
such as offering higher-level programming interfaces, opportunities for formal
analysis and transformation, and the potential for generic implementations that
are both portable and efficient.
Domain-specific features must be taken into account when using general-
purpose programming languages like C, C++, Java, etc. in the area of parallel
programming. Modern C++ programming makes extensive use of templates to
program generically, e. g. to design one program that can be simply adapted for
various particular data types without additional re-engineering. This concept
has been realized efficiently in the Standard Template Library (STL) [1], which
is part of the standard C++ library. Owing to its convenient generic features
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 107–126, 2004.
c Springer-Verlag Berlin Heidelberg 2004
108 Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
and efficient implementation, the STL has been employed extensively in many
application fields by a broad user community.
Our objective is to enable an efficient execution of STL programs on parallel
machines while allowing C++ programmers to stay in the world of STL program-
ming. We concentrate on so-called function templates, which are parameterized
patterns of computations, and use them to implement skeletons.
We introduce the DatTeL, a new data-parallel library that provides the user
with parallelism options at two levels: (1) using annotations for sequential pro-
grams with function templates, and (2) calling parallel functions of the library.
The presentation is illustrated using throughout the chapter one case study –
carry-lookahead addition, for which we demonstrate the development from a
sequential, STL-based solution to a parallel implementation.
The structure of the chapter, with its main contributions is as follows:
– We present the general concept of the DatTeL and the role distribution
between application programmers, systems programmers and skeleton pro-
grammers in the program development process (Sect. 2).
– We introduce skeletons as patterns of parallelism, and formulate our case
study – the carry-lookahead addition – in terms of skeletons (Sect. 3).
– We describe the concepts of the STL, show their correspondence to skeletons,
and express the case study using the STL (Sect. 4).
– We present the architecture of the DatTeL library and show how it is used
to parallelize the case study (Sect. 5).
– We discuss the main features of the DatTeL implementation (Sect. 6).
– We report measurements for our resulting DatTeL-program, which show
competitive performance as compared to low-level sequential code, and indi-
cate the potential for good parallel speedup in realistic applications on both
shared-memory and distributed-memory platforms (Sect. 7).
Finally, we compare our approach with related work, draw conclusions and dis-
cuss possibilities for future research.
The above definition defines a family of classes parametrized over the type
variable T. In order to use vector it must be instantiated, i. e. applied to a
concrete type:
vector<int> intvector;
vector<float *> ptrvector;
What sets C++ apart from other languages with similar constructs is the
ability to define different implementations of generics depending on the types
they are applied to. For instance, we might use a specialized implementation for
vectors of pointers (a technique often used to avoid code bloat):
With this definition, ptrvector declared before benefits from the more ef-
ficient implementation automatically and transparently to the user. Moreover,
no run-time overhead is incurred as this kind of polymorphism is resolved at
compile-time. The technique is not restricted to classes – algorithms can be
specialized in a similar manner.
This example demonstrates two key features of C++ templates: unification
on type parameters and statically evaluated computations on types. In fact,
the C++ type system is basically a Turing-complete functional programming
language interpreted by the compiler [2].
A wide range of template-based approaches to compile-time optimization and
code generation make use of this property, most notably traits [3] for specifying
and dispatching on properties of types and expression templates [4] for eliminat-
ing temporaries in vector and matrix expressions. These techniques have given
rise to the concept of active libraries [5], which not only provide preimplemented
components but also perform optimizations and generate code depending on how
these components are used. Frequently, such libraries achieve very good perfor-
mance, sometimes even surpassing hand-coded implementations – Blitz++ [6]
and MTL [7] are two notable examples. More importantly, however, they provide
an abstract, easy-to-use interface to complex algorithms without a large penalty
in performance.
Not surprisingly, these advantages have made C++ an interesting choice for
implementing libraries for parallel programming, a domain where both high per-
formance and a sufficiently high level of abstraction are of crucial importance. For
instance, templates are exploited in POOMA, a successful parallel framework for
scientific computing [8]. In the following, we investigate how C++ templates can
be used to implement a skeleton-based, data-parallel library which completely
hides the peculiarities of parallel programming from the user while still allowing
a high degree of control over the execution of the parallel program.
and lists) from algorithms (such as finding, merging and sorting). The two are
connected through the use of iterators, which are classes that know how to read or
write particular containers, without exposing the actual type of those containers.
As an example we give the STL implementation to compute the partial sums of
a vector of integers.
vector<int> v;
partial_sum(v.begin(), v.end(), v.begin(), plus<int>());
The STL function partial_sum has four arguments, the first and second
describing the range of the first input vector, the third pointing to the begin-
ning of the result vector and the last pointing to the parameter function. The
two methods, begin() and end() of the STL class vector, return instances
of vector::iterator, marking the beginning and end of the vector. The STL
template class plus implements the binary operation addition.
As part of the standard C++ library, the STL provides a uniform interface
to a large set of data structures and algorithms. Moreover, it allows user-defined
components to be seamlessly integrated into the STL framework provided they
satisfy certain requirements or concepts. When implementing a data-parallel
library which provides parallel data structures and operations on them, adhering
to the STL interface decreases the learning time for the users and ultimately
results in programs which are easier to understand and maintain. However, the
STL interface has been designed for sequential programming and cannot be
directly used in a parallel setting. Our goal is to study how the STL interface
can be enhanced to support data parallelism.
2.3 Skeletons
The work on skeletons has been mostly associated with functional languages,
where skeletons are modeled as higher-order functions. Among the various skele-
ton-related projects are those concerned with the definition of relatively simple,
basic skeletons for parallel programming, and their implementation. For example,
the two well-known list processing operators map and reduce are basic skeletons
with inherent parallelism (map corresponds to an element-wise operation and
reduce to a reduction).
The main advantage of using well-defined basic skeletons is the availability of
a formal framework for program composition. This allows a rich set of program
transformations (e. g. transforming a given program in a semantically-equivalent,
more efficient one) to be applied. In addition, cost measures can be associated
with basic skeletons and their compositions.
Existing programming systems based on skeletons include P3L [9], HDC
[10], and others. Recently, two new libraries have been proposed. The skeleton
library [11] provides both task- and data-parallel skeletons in form of C++
function templates rather than within a new programming language. The eSkel
library [12] is an extension to MPI, providing some new collective operations
and auxiliary functions.
Generic Parallel Programming Using C++ Templates and Skeletons 111
par::vector<int> v;
partial_sum(v.begin(), v.end(), v.begin(), plus<int>());
Like the STL, the DatTeL is extensible at every level. A programmer can
use the library to implement a parallel application, extend the library to provide
better support for the task at hand, or port it to a different parallel architecture.
– The application programmer uses a high-level library (in our case, DatTeL)
that extends the concepts and interfaces of the STL and consists of a number
of ready-to-use building blocks for implementing parallel programs.
– The skeleton programmer is responsible for implementing generic skeletons
and adding new ones. The need for new skeletons arises when implementing
a specific application; such skeletons can often be reused in later projects.
The DatTeL library supports the development of reusable algorithms by
providing well-defined concepts and interfaces similar to those of the STL.
112 Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
When faced with the task of implementing a new application, the DatTeL
user can choose between two approaches:
1. Existing STL code can be reused and adapted to the parallel facilities pro-
vided by the DatTeL. The high degree of compatibility with the STL ensures
that this is possible in many cases with little or no change to the code.
2. DatTeL-based programs can be developed from scratch. In this case, users
can utilize their knowledge of STL programming styles and techniques and
rapidly prototype the code by testing it with sequential algorithms and data
structures provided by the STL and later replacing these with their parallel
counterparts. We believe this to be a time- and cost-efficient approach to
parallel programming.
DatTeL STL
Frontend. Like the STL, the DatTeL library provides the programmer with a
wide range of parallel data structures and algorithms. The DatTeL also adopts
the STL’s notion of concepts, i. e. abstract, generic interfaces implemented by
both predefined and user-defined components. Concepts such as iterators and
sequences are generalized and modified to work in a parallel setting and new
concepts such as data distributions are introduced.
Generic Parallel Programming Using C++ Templates and Skeletons 113
The classical addition algorithm scans over the bits of the two input values
a and b from the least significant to the most significant bit. The i-th bit of
the result is computed by adding the i-th bits of both input values, ai and bi ,
to the overflow of the less significant bit, oi−1 . An overflow bit oi is set iff at
least two of the three bits ai , bi , oi−1 are set. Obviously, this algorithm could be
specified using the scanr skeleton. However, the binary operator of the scan is
non-associative in this case, so it prescribes a strictly sequential execution.
To enable a parallel execution, we must rely on skeletons with associative
base operators. Our next candidate is the carry-lookahead addition [14].
a 0 1 0 0 1 0
⎧
b 0 1 1 1 0 0 ⎪
⎨ S = stop (ai = bi = 0)
spg = zip(get spg)(a, b) S G P P P S P = propagate (ai = bi )
spg = snoc(spg, S) S G P P P SS ⎪
⎩ G = generate (ai = bi = 1)
o = scanr (add op) spg 01 0 0 0 0 0
zip(add )(a, zip(add )(b, o)) 01 0 1 1 1 0
skeletons. In fact, the STL provides many skeletons known from functional pro-
gramming, including reductions (accumulate), scans (partial sum), map and
zip (variants of transform). However, most of these algorithms work on cons
lists and provide only sequential access to the elements. Parallel functionality
is only supported by random access iterators as provided e. g. by the standard
vector container. Thus, data-parallel skeletons can be implemented in an STL-
like fashion as function templates taking pairs of random access iterators and
functors.
Here, integer constants are used to represent STOP, GENERATE and PROPAGATE.
All operator parameters and return values have the same type int, which enables
a compact and efficient implementation.
PL::init(NTHREADS);
par::vector<int,PL> a(N), b(N), c(N+1);
// ...
PL::finalize();
Note that the method call a.begin() returns a DatTeL iterator pointing to
the first element of the distributed container. The DatTeL overloads the STL
algorithms – transform and partial_sum in our example – for DatTeL iter-
ators. Thus, the function calls in our example automatically call the parallel
implementation of the skeletons.
Note that if we have to compute the prefix sums sequentially on a parallel
vector, e. g. if the base operation is not associative, then the parallel vector needs
to be copied into a temporary sequential one (std::vector) and then back after
the computation.
Generic Parallel Programming Using C++ Templates and Skeletons 119
While the DatTeL implements a number of primitive skeletons and includes sev-
eral high-level ones, our goal is not to support a large but fixed number of pre-
defined skeletons but to facilitate the development of new ones. This is achieved
by providing basic building blocks for implementing complex computations, by
making the backend interfaces sufficiently high-level and by including rigorous
specifications of interfaces between the library’s different layers.
There are two ways to implement new complex skeletons: (1) in terms of
simpler ones which are already available, or (2) by invoking backend operations
directly if the predefined algorithms do not provide the required functionality.
Obviously, the first way is preferable since it allows the programmer to work at a
more abstract level. We will now demonstrate briefly how to add a new skeleton
to the DatTeL. Let us analyze the last step of the carry-lookahead addition in
120 Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
which the input vectors and the overflow vector are summed componentwise.
This was accomplished using the zip skeleton twice in (1). Alternatively, we can
define a new skeleton, zip3 , that takes three lists and a ternary operator:
Using a new operation add3 , which returns the sum of three values, the two zips
from (1) can be expressed as zip3 (add3 )(a, b, o) = zip(add )(a, zip(add )(b, o)).
Let us discuss how the new skeleton zip3 can be implemented as the template
function transform3 in the DatTeL. We start with a sequential implementation:
template< /* typenames */ >
Out transform3(In1 first1, In1 last1, In2 first2, In3 first3,
Out result, TerOp op) {
for ( ; first1 != last1; ++first1, ++first2, ++first3, ++result)
*result = op(*first1, *first2, *first3);
return result; }
In this example, Out, In1, In2 and In3 are type variables for iterator types and
TerOp for a ternary operator. We omit most typenames for the sake of brevity.
From the STL code, we proceed to the implementation in the DatTeL. The
DatTeL’s abstraction layer provides the function makepar that calls a given
function with parameters on all processors, so we simply call makepar with the
sequential version std::transform3 and the parameters specified by the user.
We can use the new DatTeL function in the implementation of our case study,
presented in Fig. 3. The latter two transforms can be substituted by
transform3(c.begin(), c.end()-1, a.begin(), b.begin(),
c.begin(), binadd3);
Using transform3 makes the implementation of our case study more obvious,
because its semantics coincides better with the description of the carry-lookahead
addition: “add all three vectors element-wise”. This more intuitive version is also
more compact and has better performance (see Sect. 7).
6.2 Marshalling
In the distributed memory setting, efficient marshaling, i. e. the packing of ob-
jects by the sender and their unpacking by the receiver during communication, is
of crucial importance. Implementing marshaling is a non-trivial task in any pro-
gramming language. Unfortunately, C++’s lack of introspection facilities com-
plicates a generic implementation of marshaling. In the DatTeL, we intend to
Generic Parallel Programming Using C++ Templates and Skeletons 121
adopt an approach similar to that of the TPO++ library [16]. Here, a traits class
is used to specify the marshaling properties of a class. When communicating a
class instance, the user can choose between several strategies:
Since the decision about strategy can be made at compile time, this flexibil-
ity incurs no run-time overhead. Different container classes, such as the STL’s
vector and list, will use different strategies depending on the types of their
elements. At the moment, only the first strategy is supported.
7 Experimental Results
The critical questions about any new approach in the domain of parallel pro-
gramming are: 1) whether the target performance can compete with hand-crafted
solutions, and 2) how well does it scale on parallel machines.
1. Our first goal is to assess the overhead caused by the genericity of the
STL and the DatTeL. We compare our STL and DatTeL implementations of the
carry-lookahead addition with a hand-crafted, non-generic one:
Note that if we are not restricted to the STL, the two zips in (5) can be
implemented in one forall loop. Table 3 compares the runtimes of three sequential
versions: 1) DatTeL on one processor, 2) STL, and 3) the non-generic version. We
measured the STL and DatTeL versions both with transform and transform3.
6
10 carry lookahead
ideal speedup 5
8
4
6
3
4
2
2 1
threads elements
0 0
0 2 4 6 8 10 12 14 100 1000 10000 100000 1e+06 1e+07 1e+08
7 6
SunFire 6800, 10 elements Cray T3E, 10 elements
14 25 speedup
speedup
12
20 carry lookahead
10 carry lookahead ideal speedup
ideal speedup
8 15
6 10
4
5
2
threads processors
0 0
0 2 4 6 8 10 12 14 0 5 10 15 20 25
Fig. 5. Speedups for carry-lookahead addition; left: on a SunFire 6800 with 16 proces-
sors used as a server at a university computing-center; right: on a Cray T3E
124 Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
Template Library (PSTL) [24], which is part of the High Performance C++
Library, and (2) the Standard Template Adaptive Library (STAPL) [25]. Com-
pared with PSTL and STAPL, the novel feature of the DatTeL is its use of an
extensible set of skeletons. The DatTeL also differs in offering the potential of
nested data parallelism, which was not covered here because of lack of space.
Our ongoing work on the DatTeL and future plans include:
– In addition to the available blockwise data distribution, we plan to imple-
ment also cyclic and block-cyclic distribution.
– Our current efforts include the implementation of matrix computations and
simulations such as the Barnes-Hut algorithm.
– The use of the MPI backend is currently restricted to containers consisting
of simple data types. Ongoing work adds marshalling and serialization of
user-defined data types, such as classes, to the DatTeL.
It would be desirable to combine the data parallelism provided by the DatTeL
with task parallelism. We plan to develop a task-parallel library that can be used
together with the DatTeL, rather than to extend the DatTeL in the direction of
task parallelism, because there are no close matches for it in the STL library.
An important question to be studied is performance portabiliy: to describe
how the target performance behaves depending on the machine used, we are
developing a suitable cost model that is more precise than asymptotic estimates
used in the paper. We are working on a cost calculus and performance prediction
for DatTeL-based programs, based on the results achieved for skeletons and
nested data parallelism in the context of the Nepal project [26].
Acknowledgments
We are grateful to Chris Lengauer and two anonymous referees for many helpful
comments and suggestions, and to Phil Bacon and Julia Kaiser-Mariani who
assisted in improving the presentation.
References
1. Stepanov, A., Lee, M.: The Standard Template Library. Technical Report HPL-
95-11, Hewlett-Packard Laboratories (1995)
2. Veldhuizen, T.: Using C++ template metaprograms. C++ Report 7 (1995) 36–43
Reprinted in C++ Gems, ed. Stanley Lippman.
3. Myers, N.: Traits: a new and useful template technique. C++ Report (1995)
4. Veldhuizen, T.: Expression templates. C++ Report 7 (1995) 26–31
5. Veldhuizen, T.L., Gannon, D.: Active libraries: Rethinking the roles of compilers
and libraries. In: Proceedings of the SIAM Workshop on Object Oriented Methods
for Inter-operable Scientific and Engineering Computing (OO’98), SIAM Press
(1998)
6. Veldhuizen, T.L.: Arrays in Blitz++. In: Proceedings of the 2nd International Sci-
entific Computing in Object-Oriented Parallel Environments (ISCOPE’98). LNCS,
Springer-Verlag (1998)
126 Holger Bischof, Sergei Gorlatch, and Roman Leshchinskiy
7. Siek, J.G., Lumsdaine, A.: The Matrix Template Library: A generic programming
approach to high performance numerical linear algebra. In: ISCOPE. (1998) 59–70
8. Karmesin, S. et al.: Array design and expression evaluation in POOMA II. In
Caromel, D., Oldehoeft, R., Tholburn, M., eds.: Computing in Object-Oriented
Parallel Environments: Second International Symposium, ISCOPE 98. LNCS 1505,
Springer-Verlag (1998)
9. Danelutto, M., Pasqualetti, F., Pelagatti, S.: Skeletons for data parallelism in
P3L. In Lengauer, C., Griebl, M., Gorlatch, S., eds.: Euro-Par’97. Volume 1300 of
LNCS., Springer (1997) 619–628
10. Herrmann, C.A., Lengauer, C.: HDC: A higher-order language for divide-and-
conquer. Parallel Processing Letters 10 (2000) 239–250
11. Kuchen, H.: A skeleton library. In Monien, B., Feldmann, R., eds.: Euro-Par 2002.
Volume 2400 of LNCS., Springer (2002) 620–629
12. Cole, M.: eSkel library home page. (http://www.dcs.ed.ac.uk/home/mic/eSkel)
13. Blelloch, G.E.: Programming parallel algorithms. Communications of the ACM
39 (1996) 85–97
14. Leighton, F.T.: Introduction to Parallel Algorithms and Architectures: Arrays,
Trees, Hypercubes. Morgan Kaufmann Publ. (1992)
15. Koenig, A., Stroustrup, B.: As close as possible to C – but no closer. The C++
Report 1 (1989)
16. Grundmann, T., Ritt, M., Rosenstiel, W.: TPO++: An object-oriented message-
passing library in C++. In: International Conference on Parallel Processing. (2000)
43–50
17. Alexandrescu, A.: Modern C++ Design. Addison-Wesley (2001)
18. Chakravarty, M.M.T., Keller, G.: More types for nested data parallel program-
ming. In Wadler, P., ed.: Proceedings of the Fifth ACM SIGPLAN International
Conference on Functional Programming (ICFP’00), ACM Press (2000) 94–105
19. Blelloch, G.E., Chatterjee, S., Hardwick, J.C., Sipelstein, J., Zagha, M.: Imple-
mentation of a portable nested data-parallel language. Journal of Parallel and
Distributed Computing 21 (1994) 4–14
20. Pfannenstiel, W. et al.: Aspects of the compilation of nested parallel imperative
languages. In Werner, B., ed.: Third Working Conference on Programming Models
for Massively Parallel Computers, IEEE Computer Society (1998) 102–109
21. Wilson, G., Lu, P., eds.: Parallel Programming using C++. MIT press (1996)
22. Dabrowski, F., Loulergue, F.: Functional bulk synchronous programming in C++.
In: 21st IASTED International Multi-conference, AI 2003, Symposium on Parallel
and Distributed Computing and Networks, ACTA Press (2003) 462–467
23. Danelutto, M., Ratti, D.: Skeletons in MPI. In Aki, S., Gonzales, T., eds.: Pro-
ceedings of the 14th IASTED International Conference on Parallel and Distributed
Computing and Systems, ACTA Press (2002) 392–397
24. Johnson, E., Gannon, D.: Programming with the HPC++ Parallel Standard Tem-
plate Library. In: Proceedings of the 8th SIAM Conference on Parallel Processing
for Scientific Computing, PPSC 1997, Minneapolis, SIAM (1997)
25. Rauchwerger, L., Arzu, F., Ouchi, K.: Standard templates adaptive parallel library.
In O’Hallaron, D.R., ed.: 4th International Workshop on Languages, Compilers and
Run-Time Systems for Scalable Computers. LNCS 1511, Springer (1998) 402–410
26. Lechtchinsky, R., Chakravarty, M.M.T., Keller, G.: Costing nested array codes.
Parallel Processing Letters 12 (2002) 249–266
The Design of Hume: A High-Level Language
for the Real-Time Embedded Systems Domain
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 127–142, 2004.
c Springer-Verlag Berlin Heidelberg 2004
128 Kevin Hammond and Greg Michaelson
the need for productivity improvement means that there has been a transition
to higher-level general-purpose languages such as C/C++, Ada or Java. Despite
this, 80% of all embedded systems are delivered late [11], and massive amounts
are spent on bug fixes: according to Klocwork, for example, Nortel spends on
average $14,000 correcting each bug that is found once a system is deployed.
Many of these faults are caused by poor programmer management of memory
resources [29], exacerbated by programming at a relatively low level of abstrac-
tion. By adopting a domain-specific approach to language design rather than
adapting an existing general-purpose language, it is possible to allow low-level
system requirements to guide the design of the required high level language fea-
tures rather than being required to use existing general-purpose designs. This is
the approach we have taken in designing the Hume language, as described here.
Embedding a domain-specific language into a general purpose language such
as Haskell [25] to give an embedded domain-specific language has a number of ad-
vantages: it is possible to build on existing high-quality implementations and to
exploit the host language semantics, defining new features only where required.
The primary disadvantage of the approach is that there may be a poor fit be-
tween the semantics of the host language and the embedded language. This is
especially significant for the real-time domain: in order to ensure tight bounds
in practice as well as in theory, all language constructs must have a direct and
simple translation. Library-based approaches suffer from a similar problem. In
order to avoid complex and unwanted language interactions, we have therefore
designed Hume as a stand-alone language rather than embedding it into an ex-
isting design.
Moreover, the language design must incorporate constructs to allow the con-
struction of embedded software, including:
The Design of Hume 129
Full Hume
Full recursion
Full Hume
PR−Hume
PR−Hume Primitive Recursive functions
HO−Hume HO−Hume
Non−recursive higher−order functions
FSM−Hume Non−recursivedata structures
FSM−Hume
Non−recursive first−order functions
HW−Hume Non−recursive data structures
HW−Hume
No functions
Non−recursive data structures
Although the body of a box is a single function, the process defined by a box will
iterate indefinitely, repeatedly matching inputs and producing the corresponding
outputs, in this case a single stream of bits representing a binary-and of the two
input streams. Since a box is stateless, information that is preserved between
box iterations must be passed explicitly between those iterations through some
wire (Section 2.1). This roughly corresponds to tail recursion over a stream in
a functional language [22], as recently exploited by E-FRP, for example [33]. In
the Hume context, this design allows a box to be implemented as an uninter-
ruptible thread, taking its inputs, computing some result values and producing
its outputs [15]. Moreover, if a bound on dynamic memory usage can be prede-
termined, a box can execute with a fixed size stack and heap without requiring
garbage collection [14].
2.1 Wiring
Boxes are connected using wiring declarations to form a static process network.
A wire provides a mapping between an output link and an input link, each of
which may be a named box input/output, a port, or a stream. Ports and streams
connect to external devices, such as parallel ports, files etc. For example, we
can wire two boxes band and bor into a static process network linked to the
corresponding input and streams as follows:
box band in ( b1, b2 :: bit ) out ( b :: bit) ...
box bor in ( b1, b2 :: bit ) out ( b :: bit) ...
initial band.b = 1;
Note the use of an initialiser to specify that the initial value of the wire connected
to band.b is 1.
This matches two possible (polymorphic) input streams xs and ys, choosing
fairly between them to produce the single merged output xys. Variables x and
y match the corresponding input items in each of the two rules. The *-pattern
indicates that the corresponding input position should be ignored, that is the
pattern matches any input on the corresponding wire, without consuming it.
Such a pattern must appear at the top level. Section 3.4 includes an example
where an output is ignored.
3.1 Exceptions
Exceptions are raised in the expression layer and handled by the surrounding
box. In order to ensure tight bounds on exception handling costs, exception
handlers are not permitted within expressions. Consequently, we avoid the hard-
to-cost chain of dynamic exception handlers that can arise when using non-real-
time languages such as Java or Concurrent Haskell [26]. The following (trivial)
example shows how an exception that is raised in a function is handled by the
calling box. Since each Hume process instantiates one box, the exception handler
is fixed at the start of each process and can therefore be called directly. A static
analysis is used to ensure that all exceptions that could be raised are handled
by each box. We use Hume’s as construct to coerce the result of the function f
to a string containing 10 characters.
3.2 Timing
Five kinds of device are supported: buffered streams (files), unbuffered FIFO
streams, ports, memory-mapped devices, and interrupts. Each device has a di-
rectionality (for interrupts, this must be input), an operating system designator,
and an optional time specification. The latter can be used to enable periodic
scheduling or to ensure that critical events are not missed. Devices may be
wired to box inputs or outputs. For example, we can define an operation to read
a mouse periodically, returning true if the mouse button is down, as:
exception Timeout_mouseport;
port mouseport from "/dev/mouse" within 1ms raising Timeout_mouseport;
other operating systems, Linux splits interrupt handlers into two halves: a top-
half which runs entirely in kernel space, and which must execute in minimal time,
and a bottom-half which may access user space memory, and which has more
relaxed time constraints. For the parallel port, the top-half simply schedules the
bottom-half for execution (using the trig wire), having first checked that the
handler has been properly initialised. The IRQ and other device information are
ignored.
The bottom-half handler receives a time record produced by the top-half trig-
ger output and the action generated by the parallel port. Based on the internal
buffer state (a non-circular buffer represented as a byte array plus head and tail
indexes), the internal state is modified and a byte will be read from/written to
the parallel port. In either case, a log message is generated.
We now define a box pp do write to write the output value and the necessary
two control bytes to strobe the parallel port. The control bytes are sequenced
internally using a state transition, and output is performed only if the parallel
port is not busy. We also define abstract ports (pp action, . . .) to manage the
interaction with actual parallel port.
box pp_do_write -- Write to the parallel port
in ( cr, stat :: Byte, bout :: Byte, cr2 :: Byte )
out ( bout’, cr’ :: Byte, cr’’:: Byte )
match
( *, SP_SR_BUSY, *, *, * ) -> ( *, *, * ) -- wait until non-busy
| ( *, *, *, *, cr ) -> ( *, cr & ˜SP_CR_STROBE, * )
| ( cr, _, bout, false, * ) -> ( bout, cr | SP_CR_STROBE, cr );
Finally, wiring definitions link the bottom-half boxes with the parallel port.
wire trigger.outp to pp_bottomhalf.td;
wire pp_bottomhalf.buffer’ to pp_bottomhalf.buffer;
space
E exp ⇒ Cost, Cost
space
(1)
E n ⇒ Hint32 , 1
...
space
E (var) = h, s ∀i. 1 ≤ i ≤ n, E expi ⇒ hi , si
(2)
space n n
E var exp1 . . . expn ⇒ hi + h, max (si + (i − 1)) + s
i=1
i=1
space
∀i. 1 ≤ i ≤ n, E expi ⇒ hi , si
space
(3)
E con exp1 . . . expn
n n
⇒ hi + n + Hcon , max (si + (i − 1))
i=1
i=1
space
E exp1 ⇒ h1 , s1
space space
E exp2 ⇒ h2 , s2 E exp3 ⇒ h3 , s3
space
(4)
E if exp1 then exp2 else exp3
⇒ h1 + max(h2 , h3 ), max(s1 , s2 , s3 )
decl space
E decls ⇒ hd , sd , sd , E’ E’ exp ⇒ he , se
space
(5)
E let decls in exp ⇒ hd + he , max(sd , sd + se )
that predicts upper bound stack and space usage with respect to the prototype
Hume Abstract Machine (pHAM) [13]. The stack and heap requirements for the
boxes and wires represent the only dynamically variable memory requirements:
all other memory costs can be fixed at compile-time based on the number of
wires, boxes, functions and the sizes of static strings. In the absence of recursion,
we can provide precise static memory bounds on rule evaluation. Predicting the
stack and heap requirements for an FSM-Hume program thus provides complete
static information about system memory requirements.
in practice for FSM-Hume. Heap and stack costs are each integer values of
type Cost, labelled h and s, respectively. Each rule produces a pair of such
values representing independent upper bounds on the stack and heap usage.
The result is produced in the context of an environment, E, that maps function
names to the heap and stack requirements associated with executing the body of
the function, and which is derived from the top-level program declarations plus
standard prelude definitions. Rules for building the environment are omitted
here, except for local declarations, but can be trivially constructed.
The heap cost of a standard integer is given by Hint32 (rule 1), with other
scalar values costed similarly. The cost of a function application is the cost of
evaluating the body of the function plus the cost of each argument (rule 2). Each
evaluated argument is pushed on the stack before the function is applied, and this
must be taken into account when calculating the maximum stack usage. The cost
of building a new data constructor value such as a user-defined constructed type
(rule 3) is similar to a function application, except that pointers to the arguments
must be stored in the newly created closure (one word per argument), and fixed
costs Hcon are added to represent the costs of tag and size fields. The heap
usage of a conditional (rule 4) is the heap required by the condition part plus the
maximum heap used by either branch. The maximum stack requirement is simply
the maximum required by the condition and either branch. Case expressions
(omitted) are costed analogously. The cost of a let-expression (rule 5) is the
space required to evaluate the value definitions (including the stack required
to store the result of each new value definition) plus the cost of the enclosed
expression. The local declarations are used to derive a quadruple comprising
total heap usage, maximum stack required to evaluate any value definition, a
count of the value definitions in the declaration sequence (used to calculate the
size of the stack frame for the local declarations), and an environment mapping
function names to heap and stack usage. The body of the let-expression is costed
in the context of this extended environment.
5 Implementations of Hume
6 Related Work
Accurate time and space cost-modelling is an area of known difficulty for func-
tional language designs [28]. Hume is thus, as far as we are aware, unique in
being based on strong automatic cost models, and in being designed to allow
straightforward space- and time-bounded implementation for hard real-time sys-
tems. A number of functional languages have, however, looked at soft real-time
issues (e.g. Erlang [2] or E-FRP [33], there has been work on using functional no-
The Design of Hume 139
tations for hardware design (essentially at the HW-Hume level) (e.g. Hydra [24]),
and there has been much recent theoretical interest both in the problems as-
sociated with costing functional languages (e.g. [28, 20, 21]) and in bounding
space/time usage (e.g. [34, 19]).
In a wider framework, two extreme approaches to real-time language de-
sign are exemplified by SPARK Ada [4] and the real-time specification for Java
(RTSJ) [7]. The former epitomises the idea of language design by elimination
of unwanted behaviour from a general-purpose language, including concurrency.
The remaining behaviour is guaranteed by strong formal models. In contrast,
the latter provides specialised runtime and library support for real-time sys-
tems work, but makes no absolute performance guarantees. Thus, SPARK Ada
provides a minimal, highly controlled environment for real-time programming
emphasising correctness by construction [1], whilst Real-Time Java provides a
much more expressible, but less controlled environment, without formal guar-
antees. Our objective with the Hume design is to maintain correctness whilst
providing high levels of expressibility.
There has been much recent interest in applying static analysis to issues of
bounded time and space, but none is capable of dealing with higher-order, poly-
morphic and generally recursive function definitions as found in full Hume. For
example, region types [34] allow memory cells to be tagged with an allocation
region, whose scope can be determined statically. When the region is no longer
required, all memory associated with that region may be freed without invok-
ing a garbage collector. This is analogous to the use of Hume boxes to scope
memory allocations. Hofmann’s linearly-typed functional programming language
LFPL [18] uses linear types to determine resource usage patterns. First-order
LFPL definitions can be computed in bounded space, even in the presence of
140 Kevin Hammond and Greg Michaelson
References
1. P. Amey, “Correctness by Construction: Better can also be Cheaper”, CrossTalk:
the Journal of Defense Software Engineering, March 2002, pp. 24–28.
2. J. Armstrong, S.R. Virding, and M.C. Williams, Concurrent Programming in Er-
lang, Prentice-Hall, 1993.
3. M. Barabanov, A Linux-based Real-Time Operating System, M.S. Thesis, Dept. of
Comp. Sci., New Mexico Institute of Mining and Technology, June 1997.
4. J. Barnes, High Integrity Ada: the Spark Approach, Addison-Wesley, 1997.
5. A. Benveniste and P.L. Guernic, “Synchronous Programming with Events and
Relations: the Signal Language and its Semantics”, Science of Computer Program-
ming, 16, 1991, pp. 103–149.
6. G. Berry. “The Foundations of Esterel”, In Proof, Language, and Interaction. MIT
Press, 2000.
7. G. Bollela et al. The Real-Time Specification for Java, Addison-Wesley, 2000.
8. P. Caspi and M. Pouzet. “Synchronous Kahn Networks”, SIGPLAN Notices
31(6):226–238, 1996.
9. M. Chakravarty (ed.), S.O. Finne, F. Henderson, M. Kowalczyk, D. Leijen, S.
Marlow, E. Meijer, S. Panne, S.L. Peyton Jones, A. Reid, M. Wallace and M.
Weber, “The Haskell 98 Foreign Function Interface 1.0”, http://www.cse.unsw.
edu.au/˜chak/haskell/ffi, December, 2003.
10. J. Corbet and A. Rubini, “Linux Device Drivers”, 2nd Edition, O’Reilly, 2001.
11. The Ganssle Group. Perfecting the Art of Building Embedded Systems. http:
//www.ganssle.com, May 2003.
12. N. Halbwachs, D. Pilaud and F. Ouabdesselam, “Specificying, Programming and
Verifying Real-Time Systems using a Synchronous Declarative Language”, in Au-
tomatic Verification Methods for Finite State Systems, J. Sifakis (ed.), Springer-
Verlag, 1990, pp. 213–231.
13. K. Hammond. “An Abstract Machine Implementation for Embedded Systems Ap-
plications in Hume”, Submitted to 2003 Workshop on Implementations of Func-
tional Languages (IFL 2003), Edinburgh, 2003.
14. K. Hammond and G.J. Michaelson “Predictable Space Behaviour in FSM-Hume”,
Proc. 2002 Intl. Workshop on Impl. Functional Langs. (IFL ’02), Madrid, Spain,
Springer-Verlag LNCS 2670, 2003.
15. K. Hammond and G.J. Michaelson, “Hume: a Domain-Specific Language for Real-
Time Embedded Systems”, Proc. Conf. on Generative Programming and Compo-
nent Engineering (GPCE ’03), Springer-Verlag LNCS, 2003.
16. K. Hammond, H.-W. Loidl, A.J. Rebón Portillo and P. Vasconcelos, “A Type-and-
Effect System for Determining Time and Space Bounds of Recursive Functional
Programs”, In Preparation, 2003.
17. D. Harel, “Statecharts: a Visual Formalism for Complex Systems”, Science of Com-
puter Programming, 8, 1987, pp. 231–274.
18. M. Hofmann. A Type System for Bounded Space and Functional In-place Update.
Nordic Journal of Computing, 7(4):258–289, 2000.
19. M. Hofmann and S. Jost, “Static Prediction of Heap Space Usage for First-Order
Functional Programs”, Proc. POPL’03 — Symposium on Principles of Program-
ming Languages, New Orleans, LA, USA, January 2003. ACM Press.
20. R.J.M. Hughes, L. Pareto, and A. Sabry. “Proving the Correctness of Reactive
Systems Using Sized Types”, Proc. POPL’96 — ACM Symp. on Principles of
Programming Languages, St. Petersburg Beach, FL, Jan. 1996.
142 Kevin Hammond and Greg Michaelson
21. R.J.M. Hughes and L. Pareto, “Recursion and Dynamic Data Structures in
Bounded Space: Towards Embedded ML Programming”, Proc. 1999 ACM Intl.
Conf. on Functional Programming (ICFP ’99), aris, France, pp. 70–81, 1999.
22. S.D. Johnson, Synthesis of Digital Designs from Recursive Equations, MIT Press,
1984, ISBN 0-262-10029-0.
23. J. McDermid, “Engineering Safety-Critical Systems”, I. Wand and R. Milner(eds),
Computing Tomorrow: Future Research Directions in Computer Science, Cam-
bridge University Press, 1996, pp. 217–245.
24. J.T. O’Donnell, “The Hydra Hardware Description Language”, This proc., 2003.
25. S.L. Peyton Jones (ed.), L. Augustsson, B. Boutel, F.W. Burton, J.H. Fasel, A.D.
Gordon, K. Hammond, R.J.M. Hughes, P. Hudak, T. Johnsson, M.P. Jones, J.C.
Peterson, A. Reid, and P.L. Wadler, Report on the Non-Strict Functional Language,
Haskell (Haskell98) Yale University, 1999.
26. S.L. Peyton Jones, A.D. Gordon and S.O. Finne “Concurrent Haskell”, Proc. ACM
Symp. on Princ. of Prog. Langs., St Petersburg Beach, Fl., Jan. 1996, pp. 295–308.
27. R. Pointon, “A Rate Analysis for Hume”, In preparation, Heriot-Watt University,
2004.
28. A.J. Rebón Portillo, Kevin Hammond, H.-W. Loidl and P. Vasconcelos, “Automatic
Size and Time Inference”, Proc. Intl. Workshop on Impl. of Functional Langs. (IFL
2002), Madrid, Spain, Sept. 2002, Springer-Verlag LNCS 2670, 2003.
29. M. Sakkinen. “The Darker Side of C++ Revisited”, Technical Report 1993-I-13,
http://www.kcl.ac.uk/kis/support/cit//fortran/cpp/dark-cpl.ps, 1993.
30. T. Sayeed, N. Shaylor and A. Taivalsaari, “Connected, Limited Device Configura-
tion (CLDC) for the J2ME Platform and the K Virtual Machine (KVM)”, Proc.
JavaOne – Sun’s Worldwide 2000 Java Developers Conf., San Francisco, June 2000.
31. N. Shaylor, “A Just-In-Time Compiler for Memory Constrained Low-Power De-
vices”, Proc. 2nd Usenix Symposium on Java Virtual Machine Research and Techn-
log (JVM ’02), San Francisco, August 2002.
32. E. Schoitsch. “Embedded Systems – Introduction”, ERCIM News, 52:10–11, 2003.
33. W.Taha, “Event-Driven FRP”, Proc. ACM Symp. on Practical Applications of
Declarative Languages (PADL ’02), 2002.
34. M. Tofte and J.-P. Talpin, “Region-based Memory Management”, Information and
Control, 132(2), 1997, pp. 109–176.
35. P. Vasconcelos and K. Hammond. “Inferring Costs for Recursive, Polymorphic and
Higher-Order Functional Programs”, Submitted to 2003 Workshop on Implemen-
tations of Functional Languages (IFL 2003), Edinburgh, 2003.
Embedding a Hardware Description Language
in Template Haskell
John T. O’Donnell
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 143–164, 2004.
c Springer-Verlag Berlin Heidelberg 2004
144 John T. O’Donnell
2 A Perfect Embedding
The central concept of functional programming is the mathematical function,
which takes an argument and produces a corresponding output. The central
building block of digital circuits is the component, which does the same. This
fundamental similarity lies at the heart of the Hydra embedding: a language that
is good at defining and using functions is also likely to be good at defining and
using digital circuits.
a
a b
b x
c
x
c circ
In order to make the circuit executable, the basic logic gates need to be
defined. A simple approach is to treat signals as booleans, and to define the
logic gates as the corresponding boolean operators.
inv = not
and2 = (&&)
or2 = (||)
Embedding a Hardware Description Language in Template Haskell 145
A useful building block circuit is the multiplexor mux1, whose output can be
described as “if c = 0 then a else b” (Figure 2):
mux1 c a b =
or2 (and2 (inv c) a) (and2 c b)
c
a x
The functional metaphor used above requires each circuit to act like a pure
mathematical function, yet real circuits often contain state. A crucial technique
for using streams (infinite lists) to model circuits with state was discovered by
Steven Johnson [2]. The idea is to treat a signal as a sequence of values, one
for each clock cycle. Instead of thinking of a signal as something that changes
over time, it is a representation of the entire history of values on a wire. This
approach is efficient because lazy evaluation and garbage collection combine to
keep only the necessary information in memory at any time.
The delay flip flop dff is a primitive component with state; at all times it
outputs the value of its state, and at each clock tick it overwrites its state with
the value of its input. Let us assume that the flip flop is initialized to 0 when
power is turned on; then the behavior of the component is defined simply as
dff x = False : x
Now we can define synchronous circuits with feedback. For example, the 1-bit
register circuit has a load control ld and a data input x. It continuously outputs
its state, and it updates its state with x at a clock tick if ld is true.
reg1 ld x = r
where r = dff (mux1 ld r x)
sim_reg1 = reg1 ld x
where ld = [True, False, False, True, False, False]
x = [True, False, False, False, False, False]
146 John T. O’Donnell
This is now executed using the interactive Haskell interpreter ghci, producing
the correct output:
*Main> sim_reg1
[False,True,True,True,False,False,False]
mux1 dff
r
x
ld
reg1
x0 x1 x2 x3
z f f f f a
y0 y1 y2 y3
A ripple carry adder can now be defined using ascanr to handle the carry
propagation, and the map2 combinator (similar to zipWith) to compute the
sums.
add1 :: Signal a => a -> [(a,a)] -> (a,[a])
add1 c zs =
let (c’,cs) = ascanr bcarry c zs
ss = map2 bsum zs cs
in (c’,ss)
This specification operates correctly on all word sizes. In other words, it defines
the infinite class of n-bit adders, so the circuit designer doesn’t need to design
a 4-bit adder, and then an 8-bit one, and so on. Furthermore, we can reason
formally about the circuit using equational reasoning. A formal derivation of an
O(log) time parallel adder, starting from this O(n) one, is presented in [9].
Embedding a Hardware Description Language in Template Haskell 147
There are other static signal instances that allow for richer circuit models,
allowing techniques like tristate drivers, wired or, and bidirectional buses to be
handled. A static signal can be lifted to a clocked one using streams:
instance Static a => Clocked (Stream a) where
zero = Srepeat zero
...
dff xs = Scons zero xs
inv xs = Smap inv xs
...
Embedding a Hardware Description Language in Template Haskell 149
The following session with the Haskell interpreter ghci executes circ to perform
a boolean simulation, a clocked boolean simulation, and a netlist generation (see
Section 4).
*Main> test_circ_1
True
*Main> test_circ_2
[True,False,False,True]
*Main> test_circ_3
Or2 (And2 (Inport "a") (Inport "b")) (Inv (Inport "c"))
4.1 Netlists
take a netlist and fabricate a physical circuit. In a sense, the whole point of
designing a digital circuit is to obtain the netlist; this is the starting point of the
manufacturing process.
A result of the perfect embedding is that the simulation of a Hydra specifica-
tion is faithful to the underlying model of hardware. The behavioral semantics
of the circuit is identical to the denotational semantics of the Haskell program
that describes it.
Sometimes, however, we don’t want the execution of a Hydra specification
to be faithful to the circuit. Examples of this situation include handling errors
in feedback loops, inserting logic probes into an existing circuit, and generating
netlists. The first two issues will be discussed briefly, and the rest of this section
will examine the problem of netlists in more detail.
In a digital circuit with feedback, the maximum clock speed is determined
by the critical path. The simulation speed is also affected strongly by the criti-
cal path depth (and in a parallel Hydra simulator, the simulation speed would
be roughly proportional to the critical path depth). If there is a purely com-
binational feedback loop, as the result of a design error, then the critical path
depth is infinite. But a faithful simulation may not produce an error message; it
may go into an infinite loop, faithfully simulating the circuit failing to converge.
It is possible that such an error will be detected and reported by the Haskell
runtime system as a “black hole error”, but there is no guarantee of this – and
such an error message gives little understanding of where the feedback error has
occurred.
A good solution to this problem is a circuit analysis that detects and reports
feedback errors. However, that requires the ability to traverse a netlist.
Circuit designers sometimes test a small scale circuit using a prototype board,
with physical chips plugged into slots and physical wires connecting them. The
circuit will typically have a modest number of inputs and outputs, and a far
larger number of internal wires. A helpful instrument for debugging the design
is the logic probe, which has a tip that can be placed in contact with any pin on
any of the chips. The logic probe has light emitting diodes that indicate the state
of the signal. This tool is the hardware analogue of inserting print statements
into an imperative program, in order to find out what is going on inside.
Logic probes are not supported by the basic versions of Hydra described
in the previous sections, because they are not faithful to the circuit, so it is
impossible to implement them in a perfect embedding. The Hydra simulation
computes exactly what the real circuit does, and in a real circuit there is no
such thing as a logic probe. Section 4.6 shows how metaprogramming solves this
problem, enabling us to have software logic probes while retaining the correct
semantics of the circuit.
Hydra generates netlists in two steps. First, a special instance of the signal class
causes the execution of a circuit to produce a graph which is isomorphic to the
Embedding a Hardware Description Language in Template Haskell 151
circuit. Second, a variety of software tools traverse the graph in order to generate
the netlist, perform timing analyses, insert logic probes, and so on.
The idea is that a component with n inputs is represented by a graph node
tagged with the name of the component, with n pointers to the sources of the
component’s input signals. There are two kinds of signal source:
– An input to the circuit, represented as an Inport node with a String giving
the input signal name.
– An output of some component, represented by a pointer to that component.
The following algebraic data type defines graphs where nodes correspond to
primitive components, or inputs to the entire circuit. (This is not the represen-
tation actually used in Hydra, which is much more complex – all of the pieces
of Hydra implementation given here are simplified versions, which omit many of
the capabilities of the full system.)
The Net type can now be used to represent signals, so it is made an instance of
the Signal class.
Or2
a
b And2 Inv
x
c
of circ to input ports, using the interactive ghci Haskell system. The output is
an Or2 node, which points to the nodes that define the input signals to the or2
gate.
*Main> circ (Inport "a") (Inport "b") (Inport "c")
Or2 (And2 (Inport "a") (Inport "b")) (Inv (Inport "c"))
Given a directed acyclic graph like the one in Figure 5, it is straightforward
to write traversal functions that build the netlist, count numbers of components,
and analyze the circuit in various ways.
dff x
dff dff x
The first language in which Hydra was embedded was Daisy, a lazy functional
language with dynamic types (like Lisp). Daisy was implemented by an inter-
preter that traversed a tree representation of the program, and the language
offered constructors for building code that could then be executed. In short,
Daisy supported metaprogramming, and this was exploited in early research on
debugging tools, operating systems, and programming environments.
Daisy took the same approach as Lisp for performing comparisons of data
structures: there was a primitive function that returned true if two objects are
identical – that is, if the pointers representing them had the same value. This
was used in turn by the standard equality predicate.
In the true spirit of embedding, the Daisy pointer equality predicate was used
in the first version of Hydra to implement safe traversal of circuit graphs. This
is a standard technique: the traversal function keeps a list of nodes it has visited
before; when it is about to follow a pointer to a node, the traversal first checks
to see whether this node has already been processed. This technique for netlist
generation is described in detail in [5].
There is a serious and fundamental problem with the use of a pointer equal-
ity predicate in a functional language. Such a function violates referential trans-
parency, and this in turn makes equational reasoning unsound.
The severity of this loss is profound: the greatest advantage of a pure func-
tional language is surely the soundness of equational reasoning, which simplifies
the use of formal methods sufficiently to make them practical for problems of
real significance and complexity. For example, equational reasoning in Hydra can
be used to derive a subtle and efficient parallel adder circuit [9].
The choice to use pointer equality in the first version of Hydra was not an
oversight; the drawbacks of this approach were well understood from the outset,
but it was also clear that netlist generation was a serious problem that would
154 John T. O’Donnell
require further work. The simple embedding with pointer equality provided a
convenient temporary solution until a better approach could be found. The un-
safe pointer equality method was abandoned in Hydra around 1990, and replaced
by the labeling transformation discussed in the next section.
Now labels can be introduced into a circuit specification; for example, a labeled
version of the reg1 circuit might be written as follows:
reg1’ ld x = r
where r = label 100 (dff (mux1 ld r x))
The labeled circuit reg’ can be simulated just like the previous version:
sim_reg’ = reg1’ ld x
where ld = [True, False, False, True, False, False]
x = [True, False, False, False, False, False]
In order to insert the labels into circuit graphs, a new Label node needs to be
defined, and the instance of the label function for the Net type uses the Label
constructor to create the node:
Now we can define executions of the labeled specification for both simulation
and netlist types:
sim_reg’ = reg1’ ld x
where ld = [True, False, False, True, False, False]
x = [True, False, False, False, False, False]
r
Dff
Or2
ld
And2 And2
dff r
Inv
Inport Inport
"ld" "x"
Figure 7 shows the schematic diagram and the corresponding circuit graph.
The results of executing these test cases using ghci are shown below. The sim-
ulation produces the correct outputs. The graph can be printed out, but it is a
circular graph so the output will be infinite (and is interrupted interactively by
typing Control-C).
*Main> sim_reg’
[False,True,True,True,False,False,False]
*Main> graph_reg’
Label 100 (Dff (Or2 (And2 (Inv (Inport "ld")) (Label 100 (Dff (Or2
(And2 (Inv (Inport "ld")) (Label 100 (Dff (Or2 (And2 (Inv (Inport
"ld")) (Label 100 (DfInterrupted.
Although it is not useful to print the graph directly, the presence of labels
makes it possible to traverse it in order to build a netlist. Rather than show a full
156 John T. O’Donnell
netlist traversal here, consider the equivalent but simpler problem of counting
the number of flip flops in a circuit. This is achieved by the following function
(there is an equation corresponding to each constructor in the Net type, but only
a few are shown here):
Using labels, the sequence of circuits shown in Figure 6 can be defined un-
ambiguously:
Executing the test cases produces the following results. First, the number
of flip flops in the labeled register circuit is counted correctly. The unlabeled
circuit results in a runtime exception, since the circuit graph is equivalent to an
infinitely deep directed acyclic graph. The labeled circuits are all distinguished
correctly.
The use of labeling solves the problem of traversing circuit graphs, at the cost
of introducing two new problems. It forces a notational burden onto the circuit
designer which has nothing to do with the hardware, but is merely an artifact of
the embedding technique. Even worse, the labeling must be done correctly and
yet cannot be checked by the traversal algorithms.
Suppose that a specification contains two different components that were
mistakenly given the same label. Simulation will not bring out this error, but the
netlist will actually describe a different circuit than the one that was simulated.
Later on the circuit will be fabricated using the erroneous netlist. No amount of
simulation or formal methods will help if the circuit that is built doesn’t match
the one that was designed.
When monads were introduced into Haskell, it was immediately apparent
that they had the potential to solve the labeling problem for Hydra. Monads are
often used to automate the passing of state from one computation to the next,
while avoiding the naming errors that are rife with ordinary let bindings. Thus
a circuit specification might be written in a form something like the following:
circ a b =
do p <- dff a
q <- dff b
x <- and2 p q
return x
The monad would be defined so that a unique label is generated for each opera-
tion; this is enough to guarantee that all feedback loops can be handled correctly.
However, there are two disadvantages of using monads for labeling in Hydra.
The first problem is that monads introduce new names one at a time, in a
sequence of nested scopes, while Hydra requires the labels to come into scope
recursively, all at once, so that they are all visible throughout the scope of
a circuit definition. In the above example, the definition of p cannot use the
signal q.
A more severe problem is that the circuit specification is no longer a system
of simultaneous equations, which can be manipulated formally just by “substi-
tuting equals for equals”. Instead, the specification is now a sequence of com-
putations that – when executed – will yield the desired circuit. It feels like
writing an imperative program to draw a circuit, instead of defining the circuit
directly. Equational reasoning would still be sound with the monadic approach,
but it would be far more difficult to use: the monadic circuit specification above
contains no equations at all, and if the monadic operations are expanded out
to their equational form, the specification becomes bloated, making it hard to
manipulate formally.
The monadic approach offers a way to overcome the mismatch between the
semantics of Hydra (which needs netlists) and Haskell (which disallows impure
features). However, this is a case where a compromise to the DSL in order to
make it fit within a general purpose language is not worthwhile; the loss of
easy equational reasoning is too great a drawback, so the monadic approach was
rejected for Hydra.
158 John T. O’Donnell
Instead of requiring the designer to insert labels by hand, or using monads, the
labels could be inserted automatically by a program transformation. If we were
restricted to the standard Haskell language, this would entail developing a parser
for Hydra, an algebraic data type for representing the language, and functions
to analyze the source and generate the target Haskell code, which can then be
compiled. A project along these lines was actually underway when Template
Haskell became available, offering a much more attractive approach.
Template Haskell [10] provides the ability for a Haskell program to perform
computations at compile time which generate new code that can then be spliced
into the program. It is similar in many ways to macros in Scheme, which have
long been used for implementing domain specific languages within a small and
powerful host.
Template Haskell defines a standard algebraic data type for representing the
abstract syntax of Haskell programs, and a set of monadic operations for con-
structing programs. These are expressible in pure Haskell. Two new syntactic
constructs are also introduced: a reification construct that gives the representa-
tion of a fragment of code, and a splicing construct that takes a code represen-
tation tree and effectively inserts it into a program.
The following definition uses the reification brackets [d| . . . |] to define
circ defs rep as an algebraic data type representing the code for a definition:
circ_defs_rep = [d|
reg1 :: Clocked a => a -> a -> a
reg1 ld x = r
where r = dff (mux1 ld r x)
|]
The value of circ defs rep is a list of abstract syntax trees for the definitions
within the brackets; this list contains a representation of the type declaration
and the equation defining reg1. All the syntactic constructs, including the top
level equation, the where clause, the equations within the where clause, all the
way down to individual applications, literals, and variables, are represented as
nodes in the abstract syntax tree.
Template Haskell is a homogeneous multistage language; that is, all the as-
pects of Haskell which the ordinary programmer can use are also available to
process the abstract syntax tree at program generation time. Thus the Hydra
transform module function, which transforms the source code into the form
that will be executed, is just an ordinary Haskell function definition.
The transformation used by Hydra is a generalization of the labeling transfor-
mation (Section 4.5), but the details of the representation are more complex than
described there. The transformed circuit specification contains detailed informa-
tion about the form of the specification itself, represented in data structures
that can be used by software tools at runtime. For example, logic probes can be
implemented in software, because simulation time functions can probe the data
Embedding a Hardware Description Language in Template Haskell 159
structures generated by the transformed code in order to find any signal the user
wants to see.
The transform module function analyzes the code, checks for a variety of
errors and issues domain-specific error messages if necessary, inserts the labels
correctly, generates code that will build useful data structures for simulation
time, and also performs some other useful tasks. The result of this is a new code
tree. The $(. . .) syntax then splices the new code into the program, and resumes
the compilation.
$(transform_module circ_defs_rep)
It would be possible to ask the circuit designer using Hydra to write the final
transformed code directly, bypassing the need for metaprogramming. In princi-
ple, the effect of this automated transformation is the same as if the programmer
had written the final code in the first place. In practice, however, there are some
significant differences:
The algebraic data type used to represent the circuit graph contains more
information than the simple Net type used earlier in this paper; in addition
to supporting netlist generation, it also provides the information needed for
several other software tools, including software logic probes. The transformation
proceeds in the following main steps:
The first case handles constants, which are represented as a node Con s. Since
no variables have been found in this case, the existing renaming table rs is
returned. The next case, an expression Var s, represents a variable whose name
is the string value of s. The function performs a monadic operation gensym s to
create a fresh name, which will contain the string s along with a counter value.
This fresh name is bound to s1. An entry for the renaming table is calculated,
with the help of a functional argument f, and the result is attached to the head
of the rs list. Many parts of the Hydra transformation involve similar recursions
over abstract syntax trees in order to obtain the information needed to generate
the final code.
The renaming table constructed by findvars exp is used in a later stage of
the transformation to build a new expression. This is performed by update exp;
part of its definition is:
5 Conclusion
The design and implementation of Hydra is an example of embedding a domain
specific language in a general purpose language. The embedding has always been
fundamental to the whole conception of Hydra. At the beginning, in 1982, the
project was viewed as an experiment to see how expressive functional languages
really are, as well as a practical tool for studying digital circuits. It took several
years before Hydra came to be seen as a domain specific language, rather than
just a style for using functional languages to model hardware.
There are several criteria for judging whether an embedding is successful:
– Does the host language provide all the capabilities needed for the domain
specific language? It’s no good if essential services – such as netlist generation
– must be abandoned in order to make the DSL fit within the host language.
– Does the host language provide undesirable characteristics that would be in-
herited by the domain specific one? This criterion is arguably a compelling
reason not to use most imperative programming languages for hosting a hard-
ware description language, because the sequential assignment statement has
fundamentally different semantics than the state changes in a synchronous
circuit. But users of VHDL would probably disagree with that assessment,
and there is always a degree of opinion in assessing this criterion. For ex-
ample, many people find the punctuation-free syntax of Haskell, with the
layout rule, to be clean, readable, and elegant. Others consider it to be an
undesirable feature of Haskell which has been inherited by Hydra.
– Have the relevant advantages of the host language been retained? The most
uncompromising requirement of Hydra has always been support for equa-
tional reasoning, in order to make formal methods helpful during the design
process. This is a good reason for using a functional language like Haskell,
but it is also essential to avoid any techniques that make the formal reasoning
unsound, or unnecessarily difficult.
– Does the embedding provide a notation that is natural for the problem do-
main? A domain specific language really must use notation that is suitable
for the application domain. It is unacceptable to force users into a completely
foreign notation, simply in order to save the DSL implementor the effort of
writing a new compiler. Before Template Haskell became available, Hydra
failed badly on this criterion. If the hardware designer is forced to insert
labels manually, in order to solve the netlist problem, one could argue that
the embedding is causing serious problems and a special purpose compiler
would be more appropriate.
Based on these criteria, the embedding of Hydra in Template Haskell has
been successful. All the necessary features of Hydra – concise specification, sim-
ulation and other software tools, software logic probes, netlist generation – are
provided through embedding, by using Haskell’s features. Arguably, an undesir-
able characteristic of Haskell that might be inherited by Hydra is the lack of a
pointer equality predicate. However, we have avoided the need for pointer equal-
ity through metaprogramming – and this has enabled us to retain the crown-
Embedding a Hardware Description Language in Template Haskell 163
ing advantage of Haskell, equational reasoning, which would have been lost if
pointer equality were used! Finally, the notation for writing circuits is concise
and lightweight; experience has shown that beginners are able to learn and use
Hydra effectively to design digital circuits, even if they do not know Haskell.
The criteria above have been used as guiding principles through the entire
project. Of course they are not simple binary success/failure measures. It is
always fair to ask, for example, whether the notation could be made more friendly
to a hardware designer if we did not need to employ the syntax of Haskell, and
opinions differ as to what notations are the best. This situation is not peculiar
to Hydra: all other hardware description languages also inherit syntactic clutter
from existing programming languages, even when they are just inspired by the
languages and are not embedded in them.
It is interesting to observe that most of the embedding of Hydra in Haskell
does not rely on metaprogramming; it is almost a perfect embedding in standard
Haskell. Although metaprogramming is not used throughout the entire imple-
mentation of Hydra, it is absolutely essential in the one place where it is used.
There is a continuum of possible embeddings of a DSL in a host language. At
one extreme, a perfect embedding treats the DSL as just a library to be used with
the host language, and no metaprogramming is needed. At the other extreme,
the host language is used to write a conventional compiler, and the DSL inherits
nothing directly from the host. Metaprogramming is also unnecessary here; one
just needs a language suitable for compiler construction. Metaprogramming is
most valuable for intermediate situations where the DSL can almost be imple-
mented in the host language with a library of ordinary function definitions, but
there is some mismatch between the semantics of the languages that causes it
not quite to work. Hydra illustrates exactly that situation.
Template Haskell offers a significant improvement to the ability of Haskell
to host domain specific languages. Wherever there is a mismatch between the
DSL and Haskell, a metaprogram has the opportunity to analyze the source and
translate it into equivalent Haskell; there is no longer a need to make the original
DSL code serve also as the executable Haskell code.
In effect, Template Haskell allows the designer of a domain specific language
to find the right mixture of embedding and compilation. It is no longer necessary
for an embedding to be perfect in order to be useful. Haskell has excellent facil-
ities for general purpose programming, as well as excellent properties for formal
reasoning. Now that the constraints on embedding have been relaxed, Haskell is
likely to find much wider use as the host for domain specific languages.
References
1. Simon Peyton Jones (ed.). Haskell 98 language and libraries. Journal of Functional
Programming, 13(1):1–255, January 2003.
2. Steven D. Johnson. Synthesis of Digital Designs from Recursion Equations. MIT
Press, 1984. The ACM Distinguished Dissertation Series.
164 John T. O’Donnell
INRIA – LaBRI
ENSEIRB – 1, avenue du docteur Albert Schweitzer
Domaine universitaire - BP 99
F-33402 Talence Cedex, France
{consel,reveillere}@labri.fr
http://compose.labri.fr
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 165–179, 2004.
c Springer-Verlag Berlin Heidelberg 2004
166 Charles Consel and Laurent Réveillère
Unpredictability and volatility make the supply of new services vital. In the
context of telecommunications, the platform owners should not constrain third-
party service providers. Instead, they should encourage them to participate in
the service creation process so as to increase and diversify the supply of new
services. Currently, platforms are often closed to potential service providers be-
cause conventional models rely on controlling the market from the platform to
the end-users. Beyond economical reasons, platform openness is prohibited be-
cause it compromises the robustness of the platform.
Our Approach
To address the key issues of service creation, we introduce a paradigm based on
domain-specific languages (DSLs). A DSL is developed for a domain of services.
It provides networking and telecommunication experts with dedicated syntax
and semantics to quickly develop robust variations of services within a particu-
lar domain. Because of a critical need for standardization, the networking and
telecommunication area critically relies on a protocol specification to define a
domain of services. Our approach aims to introduce variations in a domain of
services without requiring changes to the protocol. Furthermore, the DSL is re-
stricted so as to control the programmability of variations. Finally, we describe
how the most common software architecture of the networking and telecommu-
nication area, namely the client-server model, can be extended to support our
DSL paradigm.
A DSL Paradigm for Domains of Services 167
We use our DSL paradigm to develop the Nova platform that consists of dif-
ferent domains of services, namely, telephony services, e-mail processing, remote-
document processing, stream and HTTP resource adapters. A DSL has been
developed for each domain of services; the definition of this domain of services
is itself derived from a specific protocol.
To illustrate our presentation, we consider services dedicated to access mail-
boxes remotely. Such services allow an end-user to access messages stored on a
remote server. In fact, there are many ways in which these services can be real-
ized. One domain of services is defined by the Internet Message Access Protocol
(IMAP) [4, 5]. We show that variations need to be introduced in this domain of
services to adapt to a variety of user requirements. We develop a DSL to enable
service variation to be introduced without compromising the robustness of the
e-mail server.
Overview
Section 2 identifies the key requirements of a platform for communication ser-
vices. Section 3 describes how to introduce and control programmability in such
a platform. Section 4 presents strategies to introduce programmability in the
client-server model. Section 5 shows how it scales up to a complete platform for
communication services. Section 6 discusses lessons learned during the develop-
ment of Nova. Section 7 gives concluding remarks.
2 Requirements
In this section, we present the key requirements that a platform should fulfill to
successfully address the rapidly evolving domain of communication services.
Openness. Most families of services are bound to evolve, often rapidly and
unpredictably in emerging domains such as multimedia communications. To ad-
dress this key issue, a platform for communication services has to be open. This
openness should enable a wide variety of services to be easily introduced. In fact,
each community of users ought to be offered specific services, applications and
ways of communicating that reflect their interests, cultural backgrounds, social
codes, etc.
available bandwidth. For example, one can filter out messages with respect to
some criteria to minimize the length of the summary of new messages received.
introducing service variations. In fact, each of these strategies has been used in
Nova as illustrated later in this section and in Section 5.
Although adaptations could either be done on the client side or the server
side, we concentrate on the latter to relieve the client terminal from adaptation
processing and the network from unnecessary data.
To assess our approach, we have used the DSL paradigm to develop a pro-
grammable platform for communication services, named Nova. It consists of a
programmable server and a DSL for each target application domain. Five appli-
cation domains are currently covered by Nova: e-mail processing, remote docu-
ment processing, telephony services, streams, and HTTP resource adapters. Let
us briefly present these different application areas.
Telephony services are executed over a signaling platform based on the Ses-
sion Initiation Protocol (SIP). We have designed a dialect of C to program call
processing services, named Call/C. In contrast with a prior language, called
CPL [20], our DSL is a full-fledged programming language based on familiar syn-
tax and semantics. Yet, it conforms with the features and requirements of a call
processing language as listed in the RFC 2824 [21]. In fact, our DSL goes even
further because it introduces domain-specific types and constructs that allow
verifications beyond the reach of both CPL and general-purpose languages. The
example shown in Figure 2 illustrates the use of the Call/C language to program
a call forwarding service. This service is introduced by defining a behavior for
the incoming request1 of the SIP protocol. When a call is received, the incoming
1
Strictly speaking, a call is initiated with the request Invite of the SIP protocol.
A DSL Paradigm for Domains of Services 173
entry point is invoked with information about the caller. In this call forwarding
service, the incoming call is redirected to sip:[email protected]. If this redirection
is itself redirected, then the location of the new call target is analyzed. If the lo-
cation is not a voice-mail address, then the redirection is performed. Otherwise,
the call is redirected to the callee’s voice-mail sip:[email protected].
filter graph that defines the intermediate steps and the filters used for the format
conversion of a document. Filters can be combined to perform a wide range of
transformations.
The ICAP protocol [22] was designed to facilitate better distribution and
caching for the Web. It distributes Internet-based content from the origin servers,
via proxy caches (ICAP clients), to dedicated ICAP servers. These ICAP servers
focus on specific value-added services such as access control, authentication, lan-
guage translation, content filtering, and virus scanning. Moreover, ICAP enables
adaptation of content in such a way that it becomes suitable for other less pow-
erful devices such as PDAs and mobile phones.
A DSL Paradigm for Domains of Services 175
6 Assessment
In this section, we review the lessons learned from our experience in developing
Nova. We provide some insights obtained from the study of the different domains
of services supported by Nova. Some performance and robustness aspects are
discussed and related to existing works. Finally, we address some of the issues
raised by the introduction of domain-specific languages.
176 Charles Consel and Laurent Réveillère
6.2 Performance
Traditional compilation techniques are applicable to DSLs. In fact, it has been
shown that DSL features can enable drastic optimizations, beyond the reach of
general-purpose languages [24].
In the IMAP case, we have conducted some experiments to assess the perfor-
mance and bandwidth usage of the programmable server approach. The results
of these experiments show that no significant performance overhead is introduced
in the programmable IMAP server, compared to its original version [18]. In the
Call/C example, we showed that DSL invariants enabled many target-specific
optimizations when generating code for a given programming layer [25].
6.3 Robustness
Our approach assumes that service developers may not be trusted by the owner
of the server. Furthermore, when the domains of services involve ordinary users,
as in the case of Nova, the developer may not be an experienced programmer.
A DSL Paradigm for Domains of Services 177
Let us now present the key issues raised by using DSLs as a paradigm to address
domains of communication services.
Cost of DSL invention. In our approach, a DSL is developed for each target
domain of services. This systematic language invention introduces a cost in terms
of domain analysis, language design and implementation. Traditionally, domain
analysis and language design require significant efforts. In contrast, our approach
relies on a key existing component: a protocol in the target domain. This protocol
paves the way for the domain analysis by exposing the fundamental abstractions.
It also suggests variations in the domain of services, feeding the language design
process.
Learning overhead. Some effort is usually required for learning a new language.
However, unlike a general-purpose language, a DSL uses domain-specific nota-
tions and constructs rather than inventing new ones. This situation increases the
ability for domain experts to quickly adopt and use the language [14].
Programming interface. The five DSLs currently included in Nova have a textual
representation and a C-like syntax. Yet, writing programs in these DSLs can use
other representations. For example, one could use an existing framework such
as XML to reduce the learning curve for users familiar with these notations.
Also, textual forms could be abstracted by visual forms. That is, a DSL may
have a graphical representation and be supported by an appropriate graphic-
user interface. For example, we have developed a graphical front-end for the
development of Spidle programs.
Acknowledgment
This work has been partly supported by the Conseil Régional d’Aquitaine under
Contract 20030204003A.
References
1. Ghribi, B., Logrippo, L.: Understanding GPRS: the GSM packet radio service.
Computer Networks (Amsterdam, Netherlands: 1999) 34 (2000) 763–779
2. Mock, M., Nett, E., Schemmer, S.: Efficient reliable real-time group communication
for wireless local area networks. In Hlavicka, J., Maehle, E., Pataricza, A., eds.:
Dependable Computing - EDCC-3. Volume 1667 of Lecture Notes in Computer
Science., Springer-Verlag (1999) 380
3. O’Mahony, D.: Umts: The fusion of fixed and mobile networking. IEEE Internet
Computing 2 (1998) 49–56
4. IETF: Internet Message Access Protocol (IMAP) - version 4rev1 (1996) Request
for Comments 2060.
5. Mullet, D., Mullet, K.: Managing IMAP. O’REILLY (2000)
6. Althea: An IMAP e-mail client for X Windows. http://althea.sourceforge.net
(2002)
7. Microsoft: Microsoft Outlook. http://www.microsoft.com/outlook (2003)
8. Netscape: Netscape Messenger. http://wp.netscape.com (2003)
A DSL Paradigm for Domains of Services 179
9. Parnas, D.: On the design and development of program families. IEEE Transactions
on Software Engineering 2 (1976) 1–9
10. Van Deursen, A., Klint, P., Visser, J.: Domain-specific languages: An annotated
bibliography. ACM SIGPLAN Notices 35 (2000) 26–36
11. McCain, R.: Reusable software component construction: A product-oriented
paradigm. In: Proceedings of the 5th AiAA/ACM/NASA/IEEE Computers in
Aerospace Conference, Long Beach, California (1985)
12. Neighbors, J.: Software Construction Using Components. PhD thesis, University
of California, Irvine (1980)
13. Weiss, D.: Family-oriented abstraction specification and translation: the FAST
process. In: Proceedings of the 11th Annual Conference on Computer Assurance
(COMPASS), Gaithersburg, Maryland, IEEE Press, Piscataway, NJ (1996) 14–22
14. Consel, C., Marlet, R.: Architecturing software using a methodology for language
development. In Palamidessi, C., Glaser, H., Meinke, K., eds.: Proceedings of the
10th International Symposium on Programming Language Implementation and
Logic Programming. Volume 1490 of Lecture Notes in Computer Science., Pisa,
Italy (1998) 170–194
15. Consel, C.: From a program family to a domain-specific language (2004) In this
volume.
16. IETF: The WWW common gateway interface version 1.1. http://cgi-
spec.golux.com/ncsa (1999) Work in progress.
17. Brabrand, C., Møller, A., Schwartzbach, M.I.: The <bigwig> project. ACM Trans-
actions on Internet Technology 2 (2002)
18. Consel, C., Réveillère, L.: A programmable client-server model: Robust extensi-
bility via dsls. In: Proceedings of the 18th IEEE International Conference on Au-
tomated Software Engineering (ASE 2003), Montréal, Canada, IEEE Computer
Society Press (2003) 70–79
19. University of Washington: Imap server. ftp://ftp.cac.washington.edu/imap/ (2004)
20. Rosenberg, J., Lennox, J., Schulzrinne, H.: Programming internet telephony ser-
vices. IEEE Network Magazine 13 (1999) 42–49
21. IETF: Call processing language framework and requirements (2000) Request for
Comments 2824.
22. IETF: Internet content adaptation protocol (icap) (2003) Request for Comments
3507.
23. Consel, C., Hamdi, H., Réveillère, L., Singaravelu, L., Yu, H., Pu, C.: Spidle: A DSL
approach to specifying streaming application. In: Second International Conference
on Generative Programming and Component Engineering, Erfurt, Germany (2003)
24. Eide, E., Frei, K., Ford, B., Lepreau, J., Lindstrom, G.: Flick: A flexible, opti-
mizing IDL compiler. In: Proceedings of the ACM SIGPLAN ’97 Conference on
Programming Language Design and Implementation, Las Vegas, NV, USA (1997)
44–56
25. Brabrand, C., Consel, C.: Call/c: A domain-specific language for robust internet
telephony services. Research Report RR-1275-03, LaBRI, Bordeaux, France (2003)
PILIB: A Hosted Language
for Pi-Calculus Style Concurrency
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 180–195, 2004.
c Springer-Verlag Berlin Heidelberg 2004
PiLib: A Hosted Language for Pi-Calculus Style Concurrency 181
2.1 Syntax
There are various versions of the π-calculus. The variant we chose for PiLib
has very few constructs and is similar to Milner’s definition in [6]. The only
differences in our definition are the absence of an unobservable action τ and the
use of recursive agent definitions instead of the replication operator !.
Here is the inductive definition of π-calculus processes.
Processes n
P, Q ::= i=0 Gi Sum (n ≥ 0)
νx.P Restriction
P |Q Parallel
A(y1 , . . . , yn ) Identifier (n ≥ 0)
Guarded processes
G ::= x(y).P Input
xy.P Output
Definitions
def
A(x1 , . . . , xn ) = P (xi ’s distinct, fn(P ) ⊂ x)
PiLib: A Hosted Language for Pi-Calculus Style Concurrency 183
A process in the π-calculus is either a choice Gi of several guarded pro-
cesses Gi , or a parallel composition P | Q of two processes P and Q, or a
restriction νx.P of a private fresh name x in some process P , or a process iden-
tifier A(y1 , . . . , yn ) which refers to a (potentially recursive) process definition.
In a sum, processes are guarded by an input action x(y) or an output action
xy, which receives (respectively sends) a channel y via the channel x before
the execution of the two communicating processes continues. In the process def-
inition, fn(P ) is the set of free names of the process P , i.e. the names that are
not bound through the parameter of an input guarded process or through the
restriction operator.
Extr-par x∈
/ fn(P )
P | νx.Q ≡ νx.(P | Q)
Extr-sum x∈
/ fn(P )
P + νx.Q ≡ νx.(P + Q)
def
A( x) = P
Unfold
y ) ≡ P [
A( y /
x]
represents a sequence x1 , . . . , xn of names and P [
In the third equality, x y /
x]
denotes the substitution of yi for every occurrence of xi in P .
Beside these three equalities, there are also rules that identify processes that
differ only in the names of bound variables, that declare + and | to be associative
and commutative, and that allow permuting restriction binders νx.
P → P
Par
P | Q → P | Q
P → Q
Nu
νx.P → νx.Q
P ≡ P P → Q Q ≡ Q
Struct
P → Q
Com
a(x).P + P | ay.Q + Q → P [y/x] | Q
The rule Par tells that a process can evolve independently of other processes
in parallel. The rule Nu is a contextual rule that makes it possible to reduce
under a restriction. The rule Struct is the one that allows to identify equivalent
processes when considering reduction. Finally, rule Com is the key to understand
the synchronization mechanism of the calculus: two parallel processes that can
perform complementary actions, i.e. send and receive the name of some channel
along the same channel, can communicate. They proceed next with the guarded
processes associated to the involved send and receive action. Note that the name
x potentially free in the continuation P of the input guarded process a(x).P is
bound to the transmitted name y thanks to the substitution [y/x] applied to
P . The alternatives in each communicating sum that do not play a role in the
communication, namely P and Q , are discarded, for this reason a sum is also
called a choice.
Consumer is a process that tirelessly gets an item from the buffer and discards
it.
The three processes are put in parallel using the operator | and are linked to-
gether through the sharing of the fresh channels put and get introduced by the
restriction operator ν.
In the example above the values added to the buffer by the producer and ex-
tracted from it by the consumer are π-calculus channels because there is nothing
else to transmit in the π-calculus. So both channels and the values they carry
are taken from the same domain. A typical way of typing such recursive channels
in PiLib consists in using a recursive type definition:
Such a definition can be read: “A π-calculus channel is a channel that can carry
other π-calculus channels”.
Using this type definition, the π-calculus code above has now an exact coun-
terpart in PiLib:
New features of PiLib appear only in the way of creating new channels and
executing several processes in parallel using spawn.
As we have seen, the implementation using PiLib is very close to the spec-
ification in π-calculus. It seems as if Scala has special syntax for π-calculus
primitives, however PiLib is nothing more than a library implemented using
low-level concurrency constructs of Scala inherited from Java. In the rest of
the paper we will try to demystify the magic.
Of course we can also write a two-place buffer using monitors as in Java
and the implementation would certainly be more efficient. But then relating the
implementation to the specification would be non-trivial. With PiLib we can
closely relate the specification language and the implementation language, and
thereby gain immediate confidence in our program.
PiLib: A Hosted Language for Pi-Calculus Style Concurrency 187
class Monitor {
def synchronized[a](def p: a): a;
def wait(): unit;
def wait(timeout: long): unit;
def notify(): unit;
def notifyAll(): unit;
}
Now we present the grammar of PiLib. How this grammar is actually imple-
mented as a library in Scala is the topic of the next section. Here we just
present the syntax of PiLib constructs and give an informal description of their
associated semantics.
Channel creation. At the basis of the PiLib interface is the concept of channel
(also called “name” in the π-calculus). A channel represents a communication
medium. To get an object that represents a fresh channel that can carry objects
of type A, one simply writes new Chan[A].
object. The type of a guarded process whose continuation has result type B is
GP[B]. Note that instead of { x ⇒ c } we could have used any expression of
type A ⇒ B (the type of functions from A to B).
Similarly, an output guarded process is written a(v) ∗ c where v is the value
to be sent and c is any continuation expression. The type of the guarded process
is again GP[B], where B is the type of the continuation expression c.
As in the π-calculus on which it is based, communications in PiLib are
synchronous: when a thread tries to output (resp. input) a value on a given
channel it blocks as long as there is no other thread that tries to perform an
input (resp. output) on the same channel. When two threads are finally able to
communicate, through input and output guarded processes on the same channel,
the thread that performs the input proceeds with the continuation of the input
guarded process applied to the transmitted value, and the thread that performs
the output proceeds with the continuation of the output guarded process.
Summation. The next specific ingredient of PiLib is the function choice which
takes an arbitrary number of guarded processes as arguments and tries to estab-
lish communication with another thread using one of these guarded processes.
The choice function blocks until communication takes place, as explained above.
Once the communication takes place, the guarded processes among the argu-
ments to the function choice that do not take part in the communication are
discarded.
Nu
new Chan[A] : Chan[A]
a : Chan[A] {x ⇒ c} : A ⇒ B
Input
a ∗ {x ⇒ c} : GP[B]
a : Chan[A] v:A c:B
Output
a(v) ∗ c : GP[B]
gi : GP[B] i ∈ {1, . . . , n} n≥0
Choice
choice (g1 , . . . , gn ) : B
pi : unit i ∈ {1, . . . , n} n≥1
Spawn
spawn < p1 | . . . | pn >: unit
val x = a.read;
var x: A = null;
choice (a ∗ { y ⇒ x = y });
5 Desugarization
In this section we explain how it is possible to implement PiLib as an hosted
language, i.e. as a Scala library. We consider each PiLib construct in turn and
see how it is interpreted by the language.
Channel Creation
In PiLib, a fresh channel carrying objects of type A is created using the syntax
new Chan[A]. Indeed, this is the normal Scala syntax for creating instances of
the parameterized class Chan.
a∗{x⇒c}
a.∗({ x ⇒ c }) .
new Function1[A,B] {
def apply(x: A): B = c;
}
Guarded Process
The type of an input or output guarded process expression is GP[A] where A is
the result type of the continuation.
The class GP[A] encapsulates the elements that compose a guarded process:
the name of the channel and the continuation function for an input guarded
process, and the name of the channel, the transmitted value and the continuation
function for an output guarded process.
Summation
As specified by the star symbol at the end of the argument type, the argument
can be a sequence of guarded processes of arbitrary length.
The syntax to fork several threads is spawn < p | q | r >. As we have seen pre-
viously, this expression is recognized as spawn.<(p).|(q).|(r).>. Here is the im-
plementation of this construct:
Polymorphism
Type genericity allowed us to parameterize a channel by the type of the objects
it can carry. The Scala type system makes sure that a channel is always used
with an object of the right type.
In Scala, methods can also be polymorphic. As π-calculus agents are repre-
sented by methods in PiLib, such agents can be polymorphic, like the two-place
buffer in the example of Section 3.
Type parameters may have bounds in Scala, and these bounds can be re-
cursive. This feature known under the name of F-bounded polymorphism can be
used to express recursive channels. For instance to define a kind of channel that
can carry pairs of integers and channels of the same kind, we would write
Syntactic Sugar
It is convenient to write x(v) ∗ c instead of x.apply(v).∗(c). This is permitted
by the Scala parser which perform simple but powerful transformations.
192 Vincent Cremet and Martin Odersky
The fact that all functions are members of a class (i.e. methods) allows us
to overload already existing operators such as ∗ easily without introducing am-
biguity.
Call-by-Name
The parameter modifier def makes it possible in Scala to explicitly specify that
some parameters must be called by name. This feature of the language is used on
two occasions in PiLib. In an output guarded process x(v) ∗ c, the continuation
c must not be evaluated at the creation of the guarded process but only after
communication takes place with another matching guarded process, this is a
perfect candidate to be passed by name. Also in the construct spawn < p | q >,
the call-by-name strategy for the parameters of the construct spawn allows to
avoid a sequential evaluation of the arguments p and q.
The alternative way to delay the evaluation of an expression e consists in
manipulating the closure () => e but it is more cumbersome for the programmer.
Sequence Type
Another feature of Scala, which is used in the definition of the choice function,
is the possibility to pass an arbitrary number of arguments to a method.
Choice resolution. Now we will explain what happens when a thread calls the
function choice. The argument of the function choice is a sequence of guarded
processes. This sequence is first turned into a sum with an undefined continu-
ation. From now this sum is designated as the arriving sum. The pending list
is scanned to find a complementary sum. If there is none, the arriving sum is
appended to the end of the pending list, the choice function waits until the sum
continuation gets a value and then execute this continuation. If there is a com-
plementary sum, it is extracted from the pending list, both communicating sums
get the value for their continuation and their respective threads are woken up.
The implementation consists of about 150 lines of Scala code with a large
part dedicated to the syntactic sugar. The complete source code is available on
the web [24].
8 Conclusion
We have presented PiLib, a language for concurrent programming in the style
of the π-calculus. The language is hosted as a library in Scala. The hosting
provides “for free” a highly higher-order π-calculus, because all parts of a PiLib
construct are values that can be passed to methods and transmitted over chan-
nels. This is the case for channels themselves, output prefixes, continuations of
input and output guarded processes, and guarded processes.
A PiLib program corresponds closely to a π-calculus specification. Of course,
any run of that program will only perform one of the possible traces of the
specification. Moreover, this trace is not necessarily fair, because of PiLib’s
method of choosing a matching sum in the pending list. A fair implementation
remains a topic for future work.
There is a result of Palamidessi [25, 26] showing that it is not possible to
implement the mixed choice π-calculus (when a sum can contain at the same
time input and output guarded processes) into an asynchronous language in a
distributed (deterministic) way. Our current implementation is centralized and
in this case the mixed choice is not problematic. A possible continuation of this
work is to implement a distributed version of PiLib, which would force us to
abandon the mixed choice.
Our experience with PiLib in class has been positive. Because of the close
connections to the π-calculus, students quickly became familiar with the syntax
and programming methodology. The high-level process abstractions were a big
help in developing correct solutions to concurrency problems. Generally, it was
far easier for the students to develop a correct system using PiLib than using
Java’s native thread and monitor-based concurrency constructs.
194 Vincent Cremet and Martin Odersky
References
1. Reppy, J.: CML: A higher-order concurrent language. In: Programming Language
Design and Implementation, SIGPLAN, ACM (1991) 293–259
2. INMOS Ltd.: OCCAM Programming Manual. Prentice-Hall International (1984)
3. Hoare, C.A.R.: Communicating sequential processes. Communications of the ACM
21 (1978) 666–677 Reprinted in “Distributed Computing: Concepts and Implemen-
tations” edited by McEntire, O’Reilly and Larson, IEEE, 1984.
4. Pierce, B.C., Turner, D.N.: Pict: A programming language based on the pi-calculus.
In Plotkin, G., Stirling, C., Tofte, M., eds.: Proof, Language and Interaction: Essays
in Honour of Robin Milner. MIT Press (2000) 455–494
5. Milner, R., Parrow, J., Walker, D.: A calculus of mobile processes (Parts I and II).
Information and Computation 100 (1992) 1–77
6. Milner, R.: Communicating and Mobile Systems: the Pi-Calculus. Cambridge
University Press (1999)
7. Giacalone, A., Mishra, P., Prasad, S.: Facile: A symmetric integration of concurrent
and functional programming. International Journal of Parallel Programming 18
(1989) 121–160
8. Conchon, S., Fessant, F.L.: Jocaml: Mobile agents for Objective-Caml. In: First
International Symposium on Agent Systems and Applications (ASA’99)/Third In-
ternational Symposium on Mobile Agents (MA’99), Palm Springs, CA, USA (1999)
9. Odersky, M.: Functional nets. In: Proc. European Symposium on Programming.
Number 1782 in LNCS, Springer Verlag (2000) 1–25
10. Benton, N., Cardelli, L., Fournet, C.: Modern concurrency abstractions for C . In:
Proceedings of the 16th European Conference on Object-Oriented Programming,
Springer-Verlag (2002) 415–440
11. Fournet, C., Gonthier, G.: The reflexive chemical abstract machine and the join-
calculus. In: Principles of Programming Languages. (1996)
12. Armstrong, J., Virding, R., Wikström, C., Williams, M.: Concurrent Programming
in Erlang, Second Edition. Prentice-Hall (1996)
13. Smolka, G., Henz, M., Würtz, J.: Object-oriented concurrent constraint program-
ming in Oz. In van Hentenryck, P., Saraswat, V., eds.: Principles and Practice of
Constraint Programming. The MIT Press (1995) 29–48
14. Arvind, Gostelow, K., Plouffe, W.: The ID-Report: An Asynchronous Program-
ming Language and Computing Machine. Technical Report 114, University of
California, Irvine, California, USA (1978)
15. Peyton Jones, S., Gordon, A., Finne, S.: Concurrent Haskell. In ACM, ed.: Confer-
ence record of POPL ’96, 23rd ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages: papers presented at the Symposium: St. Petersburg
Beach, Florida, 21–24 January 1996, New York, NY, USA, ACM Press (1996)
295–308
16. Barth, P.S., Nikhil, R.S., Arvind: M-structures: Extending a parallel, non-strict,
functional language with state. In Hughes, J., ed.: Proceedings Functional Pro-
gramming Languages and Computer Architecture, 5th ACM Conference, Cam-
bridge, MA, USA, Springer-Verlag (1991) 538–568 Lecture Notes in Computer
Science 523.
17. Gosling, J., Joy, B., Steele, G., Bracha, G.: The Java Language Specification,
Second Edition. Java Series, Sun Microsystems (2000) ISBN 0-201-31008-2.
18. Box, D.: Essential .NET, Volume I: The Common Language Runtime. Addison
Wesley (2002)
PiLib: A Hosted Language for Pi-Calculus Style Concurrency 195
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 196–215, 2004.
c Springer-Verlag Berlin Heidelberg 2004
A Language and Tool for Generating Efficient Virtual Machine Interpreters 197
3 Automation
When creating a VM interpreter, there are many repetitive pieces of code: The
code for executing one VM instruction has similarities with code for executing
other VM instructions (get arguments, store results, dispatch next instruction).
Similarly, when we optimise the source code for an interpreter, we apply sim-
ilar transformations to the code for each VM instruction. Applying those op-
timisations manually would be very time consuming and expensive, and would
inevitably lead us to exploring only a very small part of the design space for
interpreter optimisations. This would most likely limit the performance of the
resulting interpreter, because our experience is that the correct mix of interpreter
optimisations is usually found by experimentation.
Our system generates C source code, which is then fed into a compiler. With
respect to optimisation, there is a clear division of labour in our system. Vmgen
performs relatively high-level optimisations while generating the source code.
These are made possible by vmgen’s domain-specific knowledge of the structure
of interpreters, and particularly of the stack. On the other hand, lower-level,
A Language and Tool for Generating Efficient Virtual Machine Interpreters 199
4 A Domain-Specific Language
Our domain-specific language for describing VM instructions, vmIDL, is sim-
ple, but it allows a very large amount of routine code to be generated from a
very short specification. The most important feature of vmIDL is that each VM
instruction defines its effect on the stack. By describing the stack effect of each in-
struction at a high level, rather than as simply a sequence of low-level operations
on memory locations, it is possible to perform domain-specific optimisations on
accesses to the stack.
iadd ( i1 i2 -- i )
{
i = i1+i2;
}
The stack effect (which is described by the first line) contains the following
information: the number of items popped from and pushed onto the stacks, their
order, which stack they belong to (we support multiple stacks for implementing
VMs such as Forth, which has separate integer and floating point stacks), their
200 David Gregg and M. Anton Ertl
type, and by what name they are referred to in the C code. In our example, iadd
pops the two integers i1 and i2 from the data stack, executes the C code, and
then pushes the integer i onto the data stack.
A significant amount of C code can be automatically generated from this
simple stack effect description. For example, C variables are declared for each of
the stack items, code to load and store items from the stack, code to write out the
operands and results while tracing the execution of a program, for writing out
the immediate arguments when generating VM code from source code, and when
disassembling VM code. Similarly, because the effect on the stack is described
at a high level, code for different low-level representations of the stack can be
generated. This feature allows many of the stack optimisations described in
section 7.
SET IP This keyword sets the VM instruction pointer. It is used for implement-
ing VM branches.
TAIL This keyword indicates that the execution of the current VM instruction
ends and the next one should be invoked. Using this keyword is only neces-
sary when there is an early exit out of the VM instruction from within the
user-supplied C code. Vmgen automatically appends code to invoke the next
VM instruction to the end of the generated C code for each VM instruction,
so TAIL is not needed for instructions that do not branch out early.
ifeq ( #aTarget i -- )
{
if ( i == 0 ) {
SET_IP(aTarget);
TAIL;
}
}
4.3 Types
The type of a stack item is specified through its prefix. In our example, all
stack items have the prefix i that indicates a 32-bit integer. The types and their
prefixes are specified at the start of the vmIDL file:
The s" int" indicates the C type of the prefix (int). In our current implemen-
tation, this line is executable Forth code, and the slightly unorthodox s"..."
syntax is used to manipulate the string "int". The qualifier single indicates
that this type takes only one slot on the stack, data-stack is the default stack
for stack items of that type, and i is the name of the prefix. If there are several
matching prefixes, the longest one is used.
type, such as void *). When a VM uses multiple stacks, a stack prefix can be
declared. Finally, type prefixes are used to identify how data on the stack should
be interpreted (such as whether the value at the top of the stack should be
interpreted as an integer or floating point number).
Note that the syntax for declarations is rather unusual for a programming
language. As with comments, the syntax originates with Forth. The current
version of our interpreter generator system is implemented in Forth, and \E
denotes that vmgen should escape to the Forth interpreter. Everything appearing
after the \E is actually executable Forth code. For example, the vmIDL keyword
stack is a Forth procedure, which is called by vmgen to declare a new stack.
Although it is intended that this escape facility will only be used for declarations,
it allows our vmgen to be enormously flexible, since any valid Forth code can be
inserted in an escape line.
A great deal of research on domain-specific languages is concerned with se-
mantic issues such as reasoning about the properties of the described system,
checking for consistency, and type systems [5]. Our work on vmIDL does not ad-
dress these issues at all. The burden of finding semantic errors in the instruction
definition falls entirely on the programmer, in much the same way as if the in-
terpreter were written entirely in C, without the help of a generator. In fact, our
current system is deliberately lightweight, with only just enough functionality
to automatically generate the C code that would normally be written manually.
Our experience is that this is sufficient for the purposes of building efficient VM
interpreters, although occasionally we must examine the generated C code to
identify errors. Our work on vmIDL operates under the same economics as many
other domain-specific languages; the user base is not sufficiently large to support
features that are not central to building the interpreter.
5 Generator Output
Given an instruction definition in vmIDL, our generator, vmgen, creates several
different files of code, which are included into wrapper C functions using the C
preprocessor #include feature. By generating this code from a single definition,
we avoid having to maintain these different sections of code manually.
LABEL(iadd) { /* label */
int i1; /* declarations of stack items */
int i2;
int i;
NEXT_P0; /* dispatch next instruction (part 0) */
i1 = vm_Cell2i(sp[1]); /* fetch argument stack items */
i2 = vm_Cell2i(spTOS);
sp += 1; /* stack pointer updates */
{ /* user-provided C code */
i = i1+i2;
}
NEXT_P1; /* dispatch next instruction (part 1) */
spTOS = vm_i2Cell(i); /* store result stack item(s) */
NEXT_P2; /* dispatch next instruction (part 2) */
}
Fig. 2. Simplified version of the code generated for the iadd VM instruction
the C code from the instruction specification. After that, apart from the dispatch
code there is only the stack access for the result of the instruction. The stack
accesses load and store values from the stack. The variable spTOS is used for top-
of-stack caching (see section 7.2), while vm Cell2i and vm i2Cell are macros
for changing the type of the stack item from the generic type to the type of the
actual stack item. Note that if the VM instruction uses the TAIL keyword to exit
an instruction early, then the outputted C code will contain an additional copy
of the code to write results to the stack and dispatch the next instruction at the
early exit point.
This C code looks long and inefficient (and the complete version is even
longer, since it includes trace-collecting and other code), but GCC1 optimises it
quite well and produces the assembly code we would have written ourselves on
most architectures we looked at, such as the Alpha code in figure 3.
1
Other compilers (such as Intel’s Compiler for Linux) usually produce similar assem-
bly code for the stack access. Our experience is that most mainstream compilers
perform copy propagation and register allocation at least as well as GCC. However,
instruction dispatch is more efficient with GCC, since GNU C’s labels-as-values ex-
tension can be used to implement threaded dispatch, rather than switch dispatch [10].
204 David Gregg and M. Anton Ertl
5.2 Tracing
LABEL(iadd) {
NAME("iadd") /* print VM inst. name and some VM registers */
... /* fetch stack items */
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" i1=", vm_out); printarg_i(i1); /* print arguments */
fputs(" i2=", vm_out); printarg_i(i2);
}
#endif
... /* user-provided C code */
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" -- ", vm_out); /* print result(s) */
fputs(" i=", vm_out); printarg_i(i);
fputc(’\n’, vm_out);
}
#endif
... /* store stack items; dispatch */
5.4 Disassembler
Having a VM disassembler is useful for debugging the front end of the inter-
pretive system. All the information necessary for VM disassembly is present in
the instruction descriptions, so vmgen generates the instruction-specific parts
automatically:
if (ip[0] == vm_inst[1]) {
fputs("ipush", vm_out);
fputc(’ ’, vm_out); printarg_i((int)ip[1]);
ip += 2;
}
This example shows the code generated for disassembling the VM instruction
ipush. The if condition tests whether the current instruction (ip[0]) is ipush
(vm inst[1]). If so, it prints the name of the instruction and its arguments, and
sets ip to point to the next instruction. A similar piece of code is generated for
all the VM’s instruction set. The sequence of ifs results in a linear search of the
existing VM instructions; we chose this approach for its simplicity and because
the disassembler is not time-critical.
5.5 Profiling
Vmgen supports profiling at the VM level. The goal is to provide information
to the interpreter writer about frequently-occurring (both statically and dynam-
ically) sequences of VM instructions. The interpreter writer can then use this
information to select VM instructions to replicate and sequences to combine into
superinstructions.
The profiler counts the execution frequency of each basic block. At the end of
the run the basic blocks are disassembled, and output with attached frequencies.
There are scripts for aggregating this output into totals for static occurrences
and dynamic execution frequencies, and to process them into superinstruction
and instruction replication rules. The profiler overhead is low (around a factor
of 2), allowing long-running programs to be profiled.
6 Experience
We have used vmgen to implement three interpreters: Gforth, Cacao and CVM.
Our work on interpreter generators began with Forth and was later generalised
206 David Gregg and M. Anton Ertl
to deal with the more complicated Java VM. This section describes the three
implementations, and provides a discussion of integrating a vmIDL interpreter
into the rest of a sophisticated JVM with such features as dynamic class loading
and threads.
(J2ME) standard, which provides a core set of class libraries, and is intended for
use on devices with up to 2MB of memory. It supports the full JVM instruction
set, as well as full system-level threads. Our new interpreter replaces the existing
interpreter in CVM. Our CVM interpreter is similar to the Cacao implementa-
tion, except that it follows the standard JVM standard fully, and it is stable
and runs all benchmark programs without modification. Experimental results
[15] show that on a Pentium 4 machine the Kaffe 1.0.6 interpreter is 5.76 times
slower than our base version of CVM without superinstructions on standard large
benchmarks. The original CVM is 31% slower, and the Hotspot interpreter, the
hand-written assembly language interpreter used by Sun’s Hotspot JVM is 20.4%
faster than our interpreter. Finally, the Kaffe JIT compiler is just over twice as
fast as our version of CVM.
Our CVM implementation does not interpret original Java bytecode. Instead
we take Java bytecode, and produce direct-threaded code [16] using vmgen’s VM
code generation functions. These generated functions replace sequences of simple
VM instructions with superinstructions as the VM code is generated. However,
quick instructions make this process much more complicated, since the VM code
modifies itself after it is created. Our current version performs another (hand-
written) optimisation pass over the method each time an instruction is replaced
by a quick version. This solution effective, but makes poor use of vmgen’s features
for automatic VM code optimisation. It is not clear to us how vmgen can be
modified to better suit Java’s needs in this regard, while still remaining simple
and general.
CVM uses system-level threads to implement JVM threads. Several threads
can run in parallel, and in CVM these run as several different instances of the in-
terpreter. As long as no global variables are used in the interpreter, these different
instances will run independently. Implementing threads and monitors involves
many difficult issues, almost all of which are made neither simpler nor more
difficult by the use of vmIDL for the interpreter core. One exception to this was
with quick instructions. The same method may be executed by simultaneously
by several different threads, so race conditions can arise with quick instructions
which modify the VM code. We eventually solved this problem using locks on
the VM code when quickening, but the solution was not easily found. If we were
to implement the system again, we would implement threading within a single
instance of the interpreter, which would perform its own thread switches period-
ically. Interacting with the operating system’s threading system is complicated,
and reduces the portability of the implementation.
A final complication with our CVM interpreter arose with garbage collection.
CVM implements precise garbage collection, using stack maps to identify point-
ers at each point where garbage collection is possible. In our implementation, at
every backward branch, and at every method call, a global variable is checked to
see whether some thread has requested that garbage collection should start. If it
has, then the current thread puts itself into a garbage collection safe-state and
waits for the collection to complete. The use of vmIDL neither helps nor hinders
the implementation of garbage collection. Entering a safe state involves saving
the stack pointer, stack cache and other variables in the same way as when a
method call occurs. It seems possible that in the future, vmgen’s knowledge of
the stack effect of each instruction could be used to help automatically generate
stack maps. However, the current version contains no such feature, and items
on the stack remain, essentially, untyped. The main thrust of our current vmgen
work is interpreter optimisation, as we show in the next section.
7 Optimisations
7.1 Prefetching
Perhaps the most expensive part of executing a VM instruction is dispatch (fetch-
ing and executing the next VM instruction). One way to help the dispatch branch
to be resolved earlier is to fetch the next instruction early. Therefore, vmgen gen-
erates three macro invocations for dispatch (NEXT P0, NEXT P1, NEXT P2) and
distributes them through the code for a VM instruction (see figure 2).
These macros can be defined to take advantage of specific properties of real
machine architectures and microarchitectures, such as the number of registers,
the latency between the VM instruction load and the dispatch jump, and autoin-
crement addressing mode. This scheme even allows prefetching the next-but-one
VM instruction; Gforth uses this on the PowerPC architecture to good advantage
(about 20% speedup).
For conditional branch VM instructions it is likely that the two possible next VM
instructions are different, so it is a good idea to use different indirect branches for
them. The vmIDL language supports this optimisation with the keyword TAIL.
Vmgen expands this macro into the whole end-part of the VM instruction.
We evaluated the effect of using different indirect jumps for the different
outcomes of VM conditional branches in GForth. We found speedups of 0%–9%,
with only small benefits for most programs. However, we found a small reduction
in the number of executed instructions (0.6%–1.7%); looking at the assembly
language code, we discovered that GCC performs some additional optimisations
if we use TAIL.
7.5 Superinstructions
all stack memory accesses and stack pointer updates can be eliminated. Overall,
adding superinstructions gives speedups of between 20% and 80% on Gforth [13].
As mentioned in section 7.2 keeping the topmost element of the stack in a register
can reduce memory traffic for stack accesses by around 50%. Further gains are
achievable by reserving two or more registers for stack items. The simplest way
to do this is to simply keep the topmost n items in registers. For example, the
local variable2 TOS 3 might store the top of stack, TOS 2 the second from top
and TOS 1 the next item down. The problem with this approach can be seen if
we consider an instruction that pushes an item onto the stack. This value of this
item will be placed in the variable TOS 3. But first, the current value of TOS 3
will be copied to TOS 2, since this is now the second from topmost item. The
same applies to the value in TOS 1, which must be stored to memory. Thus, any
operation that affects the height of the stack will result in a ripple of copies,
which usually outweigh the benefits of stack caching [17].
A better solution is to introduce multiple states into the interpreter, in which
the stack can be partially empty. For example, a scheme with three stack-cache
registers would have four states:
In this scheme, there will be four separate versions of each virtual machine
instruction – one for each state. Each version of each instruction will be cus-
tomised to use the correct variable names for the topmost stack items. All of
this code is generated by vmgen automatically from the vmIDL definition. Fig-
ure 5 shows the output of vmgen for one state of the iadd VM instruction. The
two operands are in the cache at the start of the instruction, and so are copied
from TOS 1 and TOS 2. The result is put into the new topmost stack location,
TOS 1, and the state is changed3 to state 1, before the dispatch of the next VM
instruction. Note that there is no stack update in this instruction; the change in
the height of the stack is captured by the change in state.
Multiple-state stack caching is currently implemented in an experimental,
unreleased version of vmgen. Preliminary experiments show that memory traffic
for accessing the stack can be reduced by more than three quarters using a three
register cache.
2
By keeping stack cache items in local variables, they become candidates to be al-
located to registers. There is no guarantee that C compiler’s register allocator will
actually place those variables in registers, but they are likely to be good candidates
because the stack cache is frequently used.
3
A common way to implement the state is to use a different dispatch table or switch
statement for each state.
212 David Gregg and M. Anton Ertl
LABEL(iadd_state2) { /* label */
int i1; /* declarations of stack items */
int i2;
int i;
NEXT_P0; /* dispatch next instruction (part 0) */
i1 = vm_Cell2i(TOS_1); /* fetch argument stack items */
i2 = vm_Cell2i(TOS_2);
{ /* user-provided C code */
i = i1+i2;
}
NEXT_P1; /* dispatch next instruction (part 1) */
TOS_1 = vm_i2Cell(i); /* store result stack item(s) */
CHANGE_STATE(1); /* switch to state 1 */
NEXT_P2; /* dispatch next instruction (part 2) */
}
Fig. 5. Simplified version of the code generated for state 2 of the iadd instruction with
multiple-state stack caching
8 Related Work
Our work on generating VM interpreters is ongoing. The best reference on the
current release version of vmgen is [13], which gives a detailed description of
vmgen output, and presents detailed experimental results on the performance
A Language and Tool for Generating Efficient Virtual Machine Interpreters 213
of the GForth and Cacao implementations. More recent work presents newer
results on superinstructions and instruction replication [18] and the CVM im-
plementation [15].
The C interpreter hti [19] is created using a tree parser generator and can
contain superoperators. The VM instructions are specified in a tree grammar;
superoperators correspond to non-trivial tree patterns. It uses a tree-based VM
(linearized into a stack-based form) derived from lcc’s intermediate representa-
tion. A variation of this scheme is used for automatically generating interpreters
of compressed bytecode [20, 21].
Many of the performance-enhancing techniques used by vmgen have been
used and published earlier: threaded code and decoding speed [16, 22], schedul-
ing and software pipelining the dispatch [11, 23, 24], stack caching [11, 17] and
combining VM instructions [19, 25, 24, 26]. Our main contribution is to automate
the implementation of these optimisations using a DSL and generator.
9 Conclusion
Virtual machine interpreters contain large amounts of repeated code, and opti-
misations require large numbers of similar changes to many parts of the source
code. We have presented an overview of our work on vmIDL, a domain-specific
language for describing the instruction sets of stack-based VMs. Given a vmIDL
description, our interpreter generator, vmgen, will automatically generate the
large amounts of C source code needed to implement a corresponding inter-
preter system complete with support for tracing, VM code generation, VM code
disassembly, and profiling. Furthermore, vmgen will, on request, apply a variety
of optimisations to the generated interpreter, such as prefetching the next VM
instruction, stack caching, instruction replication, having different instances of
the dispatch code for better branch prediction, and combining VM instructions
into superinstructions. Generating optimised C code from a simple specification
allows the programmer to experiment with optimisations and explore a much
greater part of the design space for interpreter optimisations than would be
feasible if the code were written manually.
Availability
The current release version of the vmgen generator can be downloaded from:
http://www.complang.tuwien.ac.at/anton/vmgen/.
Acknowledgments
We would like to thank the anonymous reviewers for their detailed comments,
which greatly improved the quality of this chapter.
214 David Gregg and M. Anton Ertl
References
1. Lindholm, T., Yellin, F.: The Java Virtual Machine Specification. Second edn.
Addison-Wesley, Reading, MA, USA (1999)
2. Aı̈t-Kaci, H.: The WAM: A (real) tutorial. In: Warren’s Abstract Machine: A
Tutorial Reconstruction. MIT Press (1991)
3. Goldberg, A., Robson, D.: Smalltalk-80: The Language and its Implementation.
Addison-Wesley (1983)
4. Weiss, D.M.: Family-oriented abstraction specification and translation: the FAST
process. In: Proceedings of the 11th Annual Conference on Computer Assurance
(COMPASS), Gaithersburg, Maryland, IEEE Press (1996) 14–22
5. Czarnecki, K., Eisenecker, U.: Generative programming – methods tools and ap-
plications. Addison-Wesley (2000)
6. Lengauer, C.: Program optimization in the domain of high-performance parallelism
(2004) In this volume.
7. Grune, D., Bal, H., Jacobs, C., Langendoen, K.: Modern Compiler Design. Wiley
(2001)
8. Ertl, M.A.: Implementation of Stack-Based Languages on Register Machines. PhD
thesis, Technische Universität Wien, Austria (1996)
9. Moore, C.H., Leach, G.C.: Forth – a language for interactive computing. Technical
report, Mohasco Industries, Inc., Amsterdam, NY (1970)
10. Ertl, M.A., Gregg, D.: The behaviour of efficient virtual machine interpreters on
modern architectures. In: Euro-Par 2001, Springer LNCS 2150 (2001) 403–412
11. Ertl, M.A.: A portable Forth engine. In: EuroFORTH ’93 conference proceedings,
Mariánské Láznè (Marienbad) (1993)
12. Paysan, B.: Ein optimierender Forth-Compiler. Vierte Dimension 7 (1991) 22–25
13. Ertl, M.A., Gregg, D., Krall, A., Paysan, B.: vmgen – A generator of efficient virtual
machine interpreters. Software – Practice and Experience 32 (2002) 265–294
14. Krall, A., Grafl, R.: CACAO – a 64 bit JavaVM just-in-time compiler. Concurrency:
Practice and Experience 9 (1997) 1017–1030
15. Casey, K., Gregg, D., Ertl, M.A., Nisbet, A.: Towards superinstructions for Java
interpreters. In: 7th International Workshop on Software and Compilers for Em-
bedded Systems. LNCS 2826 (2003) 329 – 343
16. Bell, J.R.: Threaded code. Communications of the ACM 16 (1973) 370–372
17. Ertl, M.A.: Stack caching for interpreters. In: SIGPLAN ’95 Conference on Pro-
gramming Language Design and Implementation. (1995) 315–327
18. Ertl, M.A., Gregg, D.: Optimizing indirect branch prediction accuracy in virtual
machine interpreters. In: Proceedings of the ACM SIGPLAN 2003 Conference on
Programming Language Design and Implementation (PLDI 03), San Diego, Cali-
fornia, ACM (2003) 278–288
19. Proebsting, T.A.: Optimizing an ANSI C interpreter with superoperators. In: Prin-
ciples of Programming Languages (POPL ’95). (1995) 322–332
20. Ernst, J., Evans, W., Fraser, C.W., Lucco, S., Proebsting, T.A.: Code compression.
In: SIGPLAN ’97 Conference on Programming Language Design and Implementa-
tion. (1997) 358–365
21. Evans, W.S., Fraser, C.W.: Bytecode compression via profiled grammar rewriting.
In: Proceedings of the ACM SIGPLAN Conference on Programming Language
Design and Implementation. (2001) 148–155
22. Klint, P.: Interpretation techniques. Software – Practice and Experience 11 (1981)
963–973
A Language and Tool for Generating Efficient Virtual Machine Interpreters 215
23. Hoogerbrugge, J., Augusteijn, L.: Pipelined Java virtual machine interpreters. In:
Proceedings of the 9th International Conference on Compiler Construction (CC’
00), Springer LNCS (2000)
24. Hoogerbrugge, J., Augusteijn, L., Trum, J., van de Wiel, R.: A code compression
system based on pipelined interpreters. Software – Practice and Experience 29
(1999) 1005–1023
25. Piumarta, I., Riccardi, F.: Optimizing direct threaded code by selective inlining. In:
SIGPLAN ’98 Conference on Programming Language Design and Implementation.
(1998) 291–300
26. Clausen, L., Schultz, U.P., Consel, C., Muller, G.: Java bytecode compression for
low-end embedded systems. ACM Transactions on Programming Languages and
Systems 22 (2000) 471–489
Program Transformation with Stratego/XT
Rules, Strategies, Tools, and Systems in Stratego/XT 0.9
Eelco Visser
1 Introduction
Program transformation, the automatic manipulation of source programs, emerged in
the context of compilation for the implementation of components such as optimiz-
ers [28]. While compilers are rather specialized tools developed by few, transformation
systems are becoming widespread. In the paradigm of generative programming [13],
the generation of programs from specifications forms a key part of the software engi-
neering process. In refactoring [21], transformations are used to restructure a program
in order to improve its design. Other applications of program transformation include
migration and reverse engineering. The common goal of these transformations is to
increase programmer productivity by automating programming tasks.
With the advent of XML, transformation techniques are spreading beyond the area
of programming language processing, making transformation a necessary operation in
any scenario where structured data play a role. Techniques from program transformation
are applicable in document processing. In turn, applications such as Active Server Pages
(ASP) for the generation of web-pages in dynamic HTML has inspired the creation
of program generators such as Jostraca [31], where code templates specified in the
concrete syntax of the object language are instantiated with application data.
Stratego/XT is a framework for the development of transformation systems aiming
to support a wide range of program transformations. The framework consists of the
transformation language Stratego and the XT collection of transformation tools. Strat-
ego is based on the paradigm of rewriting under the control of programmable rewrit-
ing strategies. The XT tools provide facilities for the infrastructure of transformation
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 216–238, 2004.
c Springer-Verlag Berlin Heidelberg 2004
Program Transformation with Stratego/XT 217
systems including parsing and pretty-printing. The framework addresses all aspects of
the construction of transformation systems; from the specification of transformations
to their composition into transformation systems. This chapter gives an overview of
the main ingredients involved in the composition of transformation systems with Strat-
ego/XT, where we distinguish the abstraction levels of rules, strategies, tools, and sys-
tems.
A transformation rule encodes a basic transformation step as a rewrite on an abstract
syntax tree (Section 3). Abstract syntax trees are represented by first-order prefix terms
(Section 2). To decrease the gap between the meta-program and the object program that
it transforms, syntax tree fragments can be described using the concrete syntax of the
object language (Section 4).
A transformation strategy combines a set of rules into a complete transformation
by ordering their application using control and traversal combinators (Section 5). An
essential element is the capability of defining traversals generically in order to avoid
the overhead of spelling out traversals for specific data types. The expressive set of
strategy combinators allows programmers to encode a wide range of transformation
idioms (Section 6). Rewrite rules are not the actual primitive actions of program trans-
formations. Rather these can be broken down into the more basic actions of matching,
building, and variable scope (Section 7). Standard rewrite rules are context-free, which
makes it difficult to propagate context information in a transformation. Scoped dynamic
rewrite rules allow the run-time generation of rewrite rules encapsulating context infor-
mation (Section 8).
A transformation tool wraps a composition of rules and strategies into a stand-alone,
deployable component, which can be called from the command-line or from other tools
to transform terms into terms (Section 10). The use of the ATerm format makes ex-
change of abstract syntax trees between tools transparent.
A transformation system is a composition of such tools performing a complete
source-to-source transformation. Such a system typically consists of a parser and a
pretty-printer combined with a number of transformation tools. Figure 1 illustrates
such a composition. The XTC transformation tool composition framework supports
the transparent construction of such systems (Section 10).
Stratego/XT is designed such that artifacts at each of these levels of abstraction can
be named and reused in several applications, making the construction of transforma-
tion systems an accumulative process. The chapter concludes with a brief overview of
typical applications created using the framework (Section 12). Throughout the chap-
ter relevant Stratego/XT publications are cited, thus providing a bibliography of the
project.
2 Program Representation
Program transformation systems require a representation for programs that allows easy
and correct manipulation. Programmers write programs as texts using text editors. Some
programming environments provide more graphical (visual) interfaces for programmers
to specify certain domain-specific ingredients (e.g., user interface components). But
ultimately, such environments have a textual interface for specifying the details. Even if
218 Eelco Visser
program tiger.tbl
sglr
parse tree
implode-asfix
ast
tiger-desugar
ast ast
tiger-typecheck tiger-partial-eval
ast ast
ast ast
ast
pp-tiger
programs are written in a ‘structured format’ such as XML, the representation used by
programmers generally is text. So a program transformation system needs to manipulate
programs in text format.
However, for all but the most trivial transformations, a structured rather than a tex-
tual representation is needed. Bridging the gap between textual and structured repre-
sentation requires parsers and unparsers. XT provides formal syntax definition with the
syntax definition formalism SDF, parsing with the scannerless generalized-LR parser
SGLR, representation of trees as ATerms, mapping of parse trees to abstract syntax
trees, and pretty-printing using the target-independent Box language.
The lexical and context-free syntax of a language are described using context-free pro-
ductions of the form s1 ... sn -> s0 declaring that the concatenation of phrases of
sort s1 to sn forms a phrase of sort s0 . Since SDF is modular it is easy to make exten-
sions of a language.
module Tiger-Statements
signature
constructors
Assign : Var * Exp -> Exp
If : Exp * Exp * Exp -> Exp
Minus
While : Exp * Exp -> Exp
Var : String -> Exp
Call : String * List(Exp) -> Exp Call Int
Plus : Exp * Exp -> Exp
Minus : Exp * Exp -> Exp Var [] "3"
Var Int
"a" "10"
Signatures can be derived automatically from syntax definitions. For each produc-
tion A1 ...An → A0 {cons(c)} in a syntax definition, the corresponding constructor
declaration is c : S1 *...*Sm → S0 , where the Si are the sorts corresponding to the
symbols Aj after leaving out literals and layout sorts. Thus, the signature in Figure 2
describes the abstract syntax trees derived from parse trees over the syntax definition
above.
2.4 Pretty-Printing
After transformation, an abstract syntax tree should be turned into text again to be use-
ful as a program. Mapping a tree into text is the inverse of parsing, and is thus called
unparsing. When an unparser makes an attempt at producing human readable, instead
of just compiler parsable, program text, an unparser is called a pretty-printer. Strat-
ego/XT uses the pretty-printing model as provided by the Generic Pretty-Printing pack-
age GPP [14]. In this model a tree is unparsed to a Box expression, which contains
text with markup for pretty-printing. A Box expression can be interpreted by different
back-ends to produce formatted output for different displaying devices such as plain
text, HTML, and LATEX.
3 Transformation Rules
After parsing produces the abstract syntax tree of a program, the actual transformation
can be applied to it. The Stratego language is used to define transformations on terms.
In Stratego, rewrite rules express basic transformations on terms.
A rewrite rule has the form L : l -> r, where L is the label of the rule, and the term
patterns l and r are its left-hand side and right-hand side, respectively. A term pattern
Program Transformation with Stratego/XT 221
which reduces the addition of two constants to a constant by calling the library function
add for adding integers. Another example, is the rule
LetSplit : Let([d1, d2 | d*], e*) -> Let([d1], Let([d2 | d*], e*))
instead of the equivalent transformation on abstract syntax trees on the previous page.
The use of concrete syntax is indicated by quotation delimiters, e.g.,. the |[ and ]|
delimiters above. Note that not only the right-hand side of the rule, but also its matching
left-hand side can be written using concrete syntax.
In particular for larger program fragments the use of concrete syntax makes a big
difference. For example, consider the instrumentation rule
TraceFunction :
|[ function f(x*) : tid = e ]| ->
|[ function f(x*) : tid =
(enterfun(f); let var x : tid in x := e; exitfun(f); x end) ]|
where new => x
which adds calls to enterfun at the entry and exitfun at the exit of functions. Writ-
ing this rule using abstract syntax requires a thorough knowledge of the abstract syntax
and is likely to make the rule unreadable. Using concrete syntax the right-hand side can
be written as a normal program fragment with holes. Thus, specification of transforma-
tion rules in the concrete syntax of the object language closes the conceptual distance
between the programs that we write and their representation in a transformation system.
The implementation of concrete syntax for Stratego is completely generic; all as-
pects of the embedding of an object syntax in Stratego are user-definable including the
quotation and anti-quotation delimiters and the object language itself, of course. In-
deed in [40] a general schema is given for extending arbitrary languages with concrete
syntax, and in [20] the application of this schema to Prolog is discussed.
5 Transformation Strategies
In the normal interpretation of term rewriting, terms are normalized by exhaustively ap-
plying rewrite rules to it and its subterms until no further applications are possible. But
because normalizing a term with respect to all rules in a specification is not always de-
sirable, and because rewrite systems need not be confluent or terminating, more careful
control is often necessary. A common solution is to introduce additional constructors
and use them to encode control by means of additional rules which specify where and
in what order the original rules are to be applied. The underlying problem is that the
rewriting strategy used by rewriting systems is fixed and implicit. In order to provide
full control over the application of rewrite rules, Stratego makes the rewriting strategy
explicit and programmable [27, 42, 41]. Therefore, the specification of a simplifier us-
ing innermost normalization in Section 3 required explicit indication of the rules and
the strategy to be used in this transformation.
There are many strategies that could potentially be used in program transformations,
including exhaustive innermost or outermost normalization, and single pass bottom-up
or topdown transformation. Instead of providing built-in implementations for each of
these strategies, Stratego provides basic combinators for the composition of strategies.
Program Transformation with Stratego/XT 223
Such strategies can be defined in a highly generic manner. Strategies can be param-
eterized with the set of rules, or in general, the transformation, to be applied by the
strategy. Thus, the specification of rules can remain separate from the specification of
the strategy, and rules can be reused in many different transformations.
Formally, a strategy is an algorithm that transforms a term into another term or
fails at doing so. Strategies are composed using the following strategy combinators:
sequential composition (s1 ; s2), determistic choice (s1 <+ s2; first try s1, only if
that fails s2), non-deterministic choice (s1 + s2; same as <+, but the order of trying
is not defined1), guarded choice (s1 < s2 + s3; if s1 succeeds then commit to s2
else s3), testing (where(s); ignores the transformation achieved), negation (not(s);
succeeds if s fails), and recursion (rec x(s)).
Strategies composed using these combinators can be named using strategy defini-
tions. A strategy definition of the form f(x1 ,...,xn ) = s introduces a user-defined
operator f with n strategy arguments, which can be called by providing it n argument
strategies as f(s1 ,...,sn). For example, the definition
try(s) = s <+ id
defines the combinator try, which applies s to the current subject term. If that fails it
applies id, the identity strategy, to the term, which always succeeds with the original
term as result. Similarly the repeat strategy
repeat(s) = try(s; repeat(s))
repeats transformation s until it fails. Note that strategy definitions do not explicitly
mention the term to which they are applied; strategies combine term transformations,
i.e., functions from terms to terms, into term transformations.
which applies a transformation s to the elements of a list. Another example of the use
of congruences is the following control-flow strategy [29]
control-flow(s) = Assign(id, s) + If(s, id, id) + While(s, id)
which applies the argument strategy s, typically a (partial) evaluation strategy, only to
selected arguments in order to defer evaluation of the others.
While congruence operators support the definition of traversals that are specific to
a data type, Stratego also provides combinators for composing generic traversals. The
operator all(s) applies s to each of the direct subterms ti of a constructor application
C(t1 ,...,tn ). It succeeds if and only if the application of s to each direct subterm
succeeds. In this case the resulting term is the constructor application C(t1 ,...,tn ),
where each term ti is obtained by applying s to ti . Note that all(s) is the identity on
constants, i.e., on constructor applications without children. An example of the use of
all is the definition of the strategy bottomup(s):
bottomup(s) = all(bottomup(s)); s
Topdown(s) applies s throughout a term starting at the top. Alltd(s) applies s along
a frontier of a term. It tries to apply s at the root, if that succeeds the transformation
is complete. Otherwise the transformation is applied recursively to all direct subterms.
Oncetd(s) is similar, but uses the one combinator to apply a transformation to ex-
actly one direct subterm. One-pass traversals such as shown above can be used in the
definition of fixpoint traversals such as innermost
innermost(s) = bottomup(try(s; innermost(s)))
6 Transformation Idioms
The explicit control over the rewriting strategy using strategy combinators admits a
wide variety of transformation idioms. In this section we discuss several such idioms to
illustrate the expressiveness of strategies.
However, other strategies are possible. For example, the GHC simplifier [30] applies
rules in a single traversal over a program tree in which rules are applied both on the
way down and on the way up. This is expressed in Stratego by the strategy
simplify =
downup(repeat(R1 <+ ... <+ Rn))
downup(s) =
s; all(downup(s)); s
The strategy alltd(s) descends into a term until a subterm is encountered for which
the transformation s succeeds. In this case the strategy trigger-transformation
recognizes a program fragment that should be transformed. Thus, cascading transfor-
mations are applied locally to terms for which the transformation is triggered. Of course
more sophisticated strategies can be used for finding application locations, as well as
for applying the rules locally. Nevertheless, the key observation underlying this idiom
remains: Because the transformations to be applied are local, special knowledge about
the subject program at the point of application can be used. This allows the application
of rules that would not be otherwise applicable.
So far it was implied that the basic actions applied by strategies are rewrite rules. How-
ever, the distinction between rules and strategies is methodological rather than semantic.
Rewrite rules are just syntactic sugar for strategies composed from more basic transfor-
mation actions, i.e., matching and building terms, and delimiting the scope of pattern
variables [42, 41]. Making these actions first-class citizens makes many interesting id-
ioms involving matching directly expressible.
To understand the idea, consider what happens when the following rewrite rule is
applied:
EvalPlus : Plus(Int(i), Int(j)) -> Int(k) where <add> (i, j) => k
First it matches the subject term against the pattern Plus(Int(i), Int(j)) in the
left-hand side. This means that a substitution for the variables i, and j is sought, that
makes the pattern equal to the subject term. If the match fails, the rule fails. If the match
succeeds, the condition strategy is evaluated and the result bound to the variable k. This
binding is then used to instantiate the right-hand side pattern Int(k). The instantiated
term then replaces the original subject term. Furthermore, the rule limits the scope of
the variables occurring in the rule. That is, the variables i, j, and k are local to this rule.
After the rule is applied the bindings to these variables are invisible again.
Program Transformation with Stratego/XT 227
Using the primitive actions match (?pat), build (!pat) and scope ({x1 ,...,
xn :s}), this sequence of events can be expressed as
EvalPlus =
{i,j,k: ?Plus(Int(i), Int(j)); where(!(i,j); add; ?k); !Int(k)}
The action ?pat matches the current subject term against the pattern pat, binding all its
variables. The action !pat builds the instantiation of the pattern pat, using the current
bindings of variables in the pattern. The scope {x1 ,...,xn :s} delimits the scope of
the term variables xi to the strategy s. In fact, the Stratego compiler desugars rule
definitions in this way. In general, a labeled conditional rewrite rule
R : p1 -> p2 where s
is equivalent to a strategy definition
R = {x1,...,xn : ?p1; where(s); !p2}
with x1,...,xn the free variables of the patterns p1 and p2. Similarly, the strategy ap-
plication <s> pat1 => pat2 is desugared to the sequence !pat1; s; ?pat2. Many
other constructs such as anonymous (unlabeled) rules \ p1 -> p2 where s \, applica-
tion of strategies in build Int(<add>(i,j)), contextual rules [35], and many others
can be expressed using these basic actions.
foldr(z, c, f) =
[]; z
<+ \ [h | t] -> <c>(<f>h, <foldr(z, c, f)>t) \
If the term is a variable, a singleton list containing the variable name x is produced.
Otherwise the list of subterms xs is obtained using generic term deconstruction (the
underscore in the pattern is a wildcard matching with any term); the variables for each
subterm are collected recursively; and the union of the resulting lists is produced. Since
this is a frequently occuring pattern, the collect-om strategy generically defines the
notion of collecting outermost occurrences of subterms:
228 Eelco Visser
exp-vars =
collect-om(?Var(_))
collect-om(s) =
s; \ x -> [x] \
<+ crush(![], union, collect-om(s))
crush(nul, sum, s) :
_#(xs) -> <foldr(nul, sum, s)> xs
The rule InlineFun is generated by DeclareFun in the context of the definition of the
function f, but applied at the call sites f(a*). This is achieved by declaring InlineFun
in the scope of the match to the function definition fdec (second line); the variables
bound in that match, i.e., fdec and f, are inherited by the InlineFun rule declared
within the rules(...) construct. Thus, the use of f in the left-hand side of the rule
and fdec in the condition refer to inherited bindings to these variables.
Dynamic rules are first-class entities and can be applied as part of a global term
traversal. It is possible to restrict the application of dynamic rules to certain parts
of subject terms using rule scopes, which limit the live range of rules. For example,
DeclareFun and InlineFun as defined above, could be used in the following simple
inlining strategy:
Program Transformation with Stratego/XT 229
inline = {| InlineFun
: try(DeclareFun)
; repeat(InlineFun + Simplify)
; all(inline)
; repeat(Simplify)
|}
This transformation performs a single traversal over an abstract syntax tree. First in-
lining functions are generated for all functions encountered by DeclareFun, function
calls are inlined using InlineFun, and expressions are simplified using some set of
Simplify rules. Then the tree is traversed using all with a recursive call of the in-
liner. Finally, on the way up, the simplification rules are applied again. The dynamic
rule scope {|L : s|} restricts the scope of a generated rule L to the strategy s. Of
course an actual inliner will be more sophisticated than the strategy shown above; most
importantly an inlining criterium should be added to DeclareFun and/or InlineFun
to determine whether a function should be inlined at all. However, the main idea will
be the same.
After generic traversal, dynamic rules constituted a key innovation of Stratego that
allow many more transformation problems to be addressed with the idiom of strategic
rewriting. Other applications of dynamic rules include bound variable renaming [37],
dead-code elimination [37], constant-propagation [29] and other data-flow optimiza-
tions, instruction selection [9], type checking, partial evaluation, and interpretation [19].
9 Term Annotations
Stratego uses terms to represent the abstract syntax of programs or documents. A term
consists of a constructor and a list of argument terms. Sometimes it is useful to record
additional information about a term without adapting its structure, i.e., creating a con-
structor with additional arguments. For this purpose terms can be annotated. Thus, the
results of a program analysis can be stored directly in the nodes of the tree.
In Stratego a term always has a list of annotations. This is the empty list if a term
does not have any annotations. A term with annotations has the form t{a1 ,...,am},
where t is a term as defined in Section 2, the ai are terms used as annotations, and
m ≥ 0. A term t{} with an empty list of annotations is equivalent to t. Since annotations
are terms, any transformations defined by rules and strategies can be applied to them.
The annotations of a term can be retrieved in a pattern match and attached in a
build. For example the build !Plus(1, 2){Int} will create a term Plus(1, 2) with
the term Int as the only annotation. Naturally, the annotation syntax can also be used
in a match: ?Plus(1, 2){Int}. Note however that this match only accepts Plus(1,
2) terms with just one annotation, which should be the empty constructor application
Int. This match will thus not allow other annotations. Because a rewrite rule is just
sugar for a strategy definition, the usage of annotations in rules is just as expected. For
example, the rule
TypeCheck : Plus(e1{Int}, e2{Int}) -> Plus(e1, e2){Int}
checks that the two subterms of the Plus have annotation Int and then attaches the
annotation Int to the whole term. Such a rule is typically part of a typechecker which
230 Eelco Visser
checks type correctness of the expressions in a program and annotates them with their
types. Similarly many other program analyses can be expressed as program transfor-
mation problems. Actual examples in which annotations were used include escaping
variables analysis in a compiler for an imperative language, strictness analysis for lazy
functional programs, and bound-unbound variables analysis for Stratego itself.
Annotations are useful to store information in trees without changing their signa-
ture. Since this information is part of the tree structure it is easily made persistent for
exchange with other transformation tools (Section 10). However, annotations also bring
their own problems. First of all, transformations are expected to preserve annotations
produced by different transformations. This requires that traversals preserve annota-
tions, which is the case for Stratego’s traversal operators. However, when transforming
a term it is difficult to preserve the annotations on the original term since this should be
done according to the semantics of the annotations. Secondly, it is no longer straightfor-
ward to determine the equality relation between terms. Equality can be computed with
or without (certain) annotations. These issues are inherent in any annotation framework
and preclude smooth integration of annotations with the other features discussed; fur-
ther research is needed in this area.
ATerm Exchange Format. The terms Stratego uses internally correspond exactly with
terms in the ATerm exchange format [7]. The Stratego run-time system is based on the
ATerm Library which provides support for internal term representation as well as their
persistent representation in files, making it trivial to provide input and output for terms
in Stratego, and to exchange terms between transformation tools. Thus, transformation
systems can be divided into small, reusable tools
Foreign Function Interface. Stratego has a foreign function interface which makes it
possible to call C functions from Stratego functions. The operator prim(f, t1, ..., tn )
calls the C function f with term arguments ti . Via this mechanism functionality such as
arithmetic, string manipulation, hash tables, I/O, and process control are incorporated
in the library without having to include them as built-ins in the language. For example,
the definition
read-from-stream =
?Stream(stream)
; prim("SSL_read_term_from_stream", stream)
introduces an alias for a primitive reading a term from an input stream. In fact several
language features started their live as a collection of primitives before being elevated to
the level of language construct; examples are dynamic rules and annotations.
Program Transformation with Stratego/XT 231
The main strategy represents the tool. It is defined using the io-wrap strategy, which
takes as arguments the non-default command-line options and the strategy to apply.
The wrapper strategy parses the command-line options, providing a standardized tool
interface with options such as -i for the input and -o for the output file. Furthermore,
it reads the input term, applies the transformation to it, and writes the resulting term to
output. Thus, all I/O complexities are hidden from the programmer.
Tool Collections. Stratego’s usage of the ATerm exchange format and its support for
interface implementation makes it very easy to make small reusable tools. In the spirit
of the Unix pipes and filters model, these tools can be mixed and matched in many dif-
ferent transformation systems. However, instead of transforming text files, these tools
transform structured data. This approach has enabled and encouraged the construction
of a large library of reusable tools. The core library is the XT bundle of transformation
tools [17], which provides some 100 more or less generic tools useful in the construction
and generation of transformation systems. This includes the implementation of pretty-
printing formatters of the generic pretty-printing package GPP [14], coupling of Strat-
ego transformation components with SDF parsers, tools for parsing and pretty-printing,
and generators for deriving components of transformation systems from a syntax defi-
nition. A collection of application-specific transformation components based on the XT
library is emerging (see Section 12).
io-tiger-pe =
xtc-io-wrap(tiger-pe-options,
parse-tiger
; tiger-desugar
; tiger-partial-eval
; if-switch(!"elim-dead", tiger-elim-dead)
; if-switch(!"ensugar", tiger-ensugar)
; if-switch(!"pp", pp-tiger)
)
tiger-partial-eval =
xtc-transform(!"Tiger-Partial-Eval", pass-verbose)
...
in their composition. Figure 3 illustrates the use of XTC in the composition of a par-
tial evaluator from transformation components, corresponding to the right branch of the
data-flow diagram in Figure 1.
11 Stratego/XT in Practice
The Stratego language is implemented by means of a compiler that translates Strat-
ego programs to C programs. Generated programs depend on the ATerm library and
a small Stratego-specific, run-time system. The Stratego Standard Library provides a
large number of generic and data-type specific reusable rules and strategies. The com-
piler and the library, as well as number of other packages from the XT collection
are bundled in the Stratego/XT distribution, which is available from www.stratego-
language.org [43] under the LGPL license. The website also provides user documenta-
tion, pointers to publications and applications, and mailinglists for users and developers.
12 Applications
The original application area of Stratego is the specification of optimizers, in particular
for functional compilers [42]. Since then, Stratego has been applied in many areas of
language processing:
– Compilers: typechecking, translation, desugaring, instruction selection
– Optimization: data-flow optimizations, vectorization, ghc-style simplification, de-
forestation, domain-specific optimization, partial evaluation, specialization of dy-
namic typing
– Program generators: pretty-printer and signature generation from syntax defini-
tions, application generation from DSLs, language extension preprocessors
– Program migration: grammar conversion
– Program understanding: documentation generation
– Document generation and transformation: XML processing, web services
The rest of this section gives an overview of applications categorized by the type of the
source language.
Program Transformation with Stratego/XT 233
Imperative Languages. Tiger is the example language of Andrew Appel’s text book on
compiler construction. It has proven a fruitful basis for experimentation with all kinds
of transformations and for use in teaching [43]. Results include techniques for building
interpreters [19], implementing instruction selection (maximal munch and burg-style
dynamic programming) [9], and specifying optimizations such as function inlining [37]
and constant propagation [29].
These techniques are being applied to real imperative languages. CodeBoost [2] is a
transformation framework for the domain-specific optimization of C++ programs devel-
oped for the optimization of programs written in the Sophus style. Several application
generators have been developed for the generation of Java and C++ programs.
Other Languages. In a documentation generator for SDL [16], Stratego was used to
extract transition tables from SDL specifications.
XML and Meta-data for Software Deployment. The structured representation of data,
their easy manipulation and external representation, makes Stratego an atractive lan-
guage for processing XML documents and other structured data formats. For example,
the Autobundle system [15] computes compositions (bundles) of source packages by
analyzing the dependencies in package descriptions represented as terms and generates
an online package base from such descriptions. Application in other areas of software
deployment is underway. The generation of XHTML and other XML documents is also
well supported with concrete syntax for XML in Stratego and used for example in xDoc,
a documentation generator for Stratego and other languages.
13 Related Work
Term rewriting [33] is a long established branch of theoretical computer science. Sev-
eral systems for program transformation are based on term rewriting. The motivation
for and the design of Stratego were directly influenced by the ASF+SDF and ELAN lan-
guages. The algebraic specification formalism ASF+SDF [18] is based on pure rewrit-
ing with concrete syntax without strategies. Recently traversal functions were added to
234 Eelco Visser
ASF+SDF to reduce the overhead of traversal control [8]. The ELAN system [4] first in-
troduced the ideas of user-definable rewriting strategies in term rewriting systems. How-
ever, generic traversal is not provided in ELAN. The first ideas about programmable
rewriting strategies with generic term traversal were developed with ASF+SDF [27].
These ideas were further developed in the design of Stratego [42, 41]. Also the gener-
alization of concrete syntax [40], first-class pattern matching [35], generic term decon-
struction [36], scoped dynamic rewrite rules [37], annotations, and the XTC component
composition model are contributions of Stratego/XT. An earlier paper [38] gives a short
overview of version 0.5 of the Stratego language and system, before the addition of
concrete syntax, dynamic rules, and XTC.
Other systems based on term rewriting include TAMPR [5, 6] and Maude [11, 10].
There are also a large number of transformation systems not based (directly) on term
rewriting, including TXL [12] and JTS [3]. A more thorough discussion of the common-
alities between Stratego and other transformation systems is beyond the scope of this
paper. The papers about individual language concepts cited throughout this paper dis-
cuss related mechanisms in other languages. In addition, several papers survey aspects
of strategies and related mechanisms in programming languages. A survey of strategies
in program transformation systems is presented in [39], introducing the motivation for
programmable strategies and discussing a number of systems with (some) support for
definition of strategies. The essential ingredients of the paradigm of ‘strategic program-
ming’ and their incarnations in other paradigms, such as object-oriented and functional
programming, are discussed in [25]. A comparison of strategic programming with adap-
tive programming is presented in [26]. Finally, the program transformation wiki [1] lists
a large number of transformation systems.
14 Conclusion
This paper has presented a broad overview of the concepts and applications of the Strat-
ego/XT framework, a language and toolset supporting the high-level implementation of
program transformation systems. The framework is applicable to many kinds of trans-
formations, including compilation, generation, analysis, and migration. The framework
supports all aspects of program transformation, from the specification of transformation
rules, their composition using strategies, to the encapsulation of strategies in tools, and
composition of tools into systems.
An important design guideline in Stratego/XT is separation of concerns to achieve
reuse at all levels of abstractions. Thus, the separation of rules and strategies allows the
specification of rules separately from the strategy that applies them and a generic strat-
egy can be instantiated with different rules. Similarly a certain strategy can be used in
different tools, and a tool can be used in different transformation systems. This principle
supports reuse of transformations at different levels of granularity.
Another design guideline is that separation of concerns should not draw artificial
boundaries. Thus, there is no strict separation between abstraction levels. Rather the
distinctions between these levels is methodological and idiomatic rather than semantic.
For instance, a rule is really an idiom for a certain type of strategy. Thus, rules and
strategies can be interchanged. Similarly, XTC applies strategic control to tools and
Program Transformation with Stratego/XT 235
allows calling an external tool as though it were a rule. In general, one can mix rules,
strategies, tools, and systems as is appropriate for the system under consideration, thus
making transformations compositional in practice. Of course one has to consider trade-
offs when doing this, e.g., the overhead of calling an external process versus the reuse
obtained, but there is no technical objection.
Finally, Stratego/XT is designed and developed as an open language and system.
The intial language based on rewriting of abstract syntax trees under the control of
strategies has been extended with first-class pattern matching, dynamic rules, concrete
syntax, and a tool composition model, in order to address new classes of problems. The
library has accumulated many generic transformation solutions. Also the compiler is
component-based, and more and more aspects are under the control of the programmer.
Certain aspects of the language could have been developed as a library in a general
purpose language. Such an approach, although interesting in its own right, meets with
the syntactic and semantic limitations of the host language. Building a domain-specific
language for the domain of program transformation has been fruitful. First of all, the
constructs that matter can be provided without (syntactic) overhead to the programmer.
The separation of concerns (e.g., rules as separately definable entities) that is provided
in Stratego is hard to achieve in general purpose languages. Furthermore, the use of the
ATerm library with its maximal sharing (hash consing) term model and easy persistence
provides a distinct run-time system not available in other languages. Rather than strug-
gling with a host language, the design of Stratego has been guided by the needs of the
transformation domain, striving to express transformations in a natural way.
Symbolic manipulation and generation of programs is increasingly important in
software engineering, and Stratego/XT is an expressive framework for its implemen-
tation. The ideas developed in the project can also be useful in other settings. For exam-
ple, the approach to generic traversal has been transposed to functional, object-oriented,
and logic programming [25]. This paper describes Stratego/XT at release 0.9, which is
not the final one. There is a host of ideas for improving and extending the language,
compiler, library, and support packages, and for new applications. For an overview, see
www.stratego-language.org.
Acknowledgments
Stratego and XT have been developed with contributions by many people. The initial
set of strategy combinators was designed with Bas Luttik. The first prototype language
design and compiler was developed with Zino Benaissa and Andrew Tolmach. The run-
time system of Stratego is based on the ATerm Library developed at the University of
Amsterdam by Pieter Olivier and Hayco de Jong. SDF is maintained and further de-
veloped at CWI by Mark van den Brand and Jurgen Vinju. The XT bundle was set up
and developed together with Merijn de Jonge and Joost Visser. Martin Bravenboer has
played an important role in modernizing XT, collaborated in the development of XTC,
and contributed several packages in the area of XML and Java processing. Eelco Dolstra
has been very resourceful when it came to compilation and porting issues. The approach
to data-flow optimization was developed with Karina Olmos. Rob Vermaas developed
the documentation software for Stratego. Many others developed applications or oth-
236 Eelco Visser
erwise provided valuable feedback including Otto Skrove Bagge, Arne de Bruijn, Karl
Trygve Kalleberg, Dick Kieburtz, Patricia Johann, Lennart Swart, Hedzer Westra, and
Jonne van Wijngaarden. Finally, the anonymous referees provided useful feedback on
an earlier version of this paper.
References
1. http://www.program-transformation.org.
2. O. S. Bagge, K. T. Kalleberg, M. Haveraaen, and E. Visser. Design of the CodeBoost trans-
formation system for domain-specific optimisation of C++ programs. In D. Binkley and
P. Tonella, editors, Third IEEE International Workshop on Source Code Analysis and Ma-
nipulation (SCAM’03), pages 65–74, Amsterdam, September 2003. IEEE Computer Society
Press.
3. D. Batory, B. Lofaso, and Y. Smaragdakis. JTS: tools for implementing domain-specific
languages. In Proceedings Fifth International Conference on Software Reuse, pages 143–
153, Victoria, BC, Canada, 2–5 1998. IEEE.
4. P. Borovanský, H. Cirstea, H. Dubois, C. Kirchner, H. Kirchner, P.-E. Moreau, C. Ringeissen,
and M. Vittek. ELAN: User Manual. Loria, Nancy, France, v3.4 edition, January 27 2000.
5. J. M. Boyle. Abstract programming and program transformation—An approach to reusing
programs. In T. J. Biggerstaff and A. J. Perlis, editors, Software Reusability, volume 1, pages
361–413. ACM Press, 1989.
6. J. M. Boyle, T. J. Harmer, and V. L. Winter. The TAMPR program transforming system: Sim-
plifying the development of numerical software. In E. Arge, A. M. Bruaset, and H. P. Lang-
tangen, editors, Modern Software Tools in Scientific Computing, pages 353–372. Birkhäuser,
1997.
7. M. G. J. van den Brand, H. A. de Jong, P. Klint, and P. A. Olivier. Efficient annotated terms.
Software—Practice & Experience, 30:259–291, 2000.
8. M. G. J. van den Brand, P. Klint, and J. Vinju. Term rewriting with traversal functions. ACM
Transactions on Software Engineering and Methodology, 12(2):152–190, April 2003.
9. M. Bravenboer and E. Visser. Rewriting strategies for instruction selection. In S. Tison, edi-
tor, Rewriting Techniques and Applications (RTA’02), volume 2378 of Lecture Notes in Com-
puter Science, pages 237–251, Copenhagen, Denmark, July 2002. Springer-Verlag.
10. M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martı́-Oliet, J. Meseguer, and J. F. Quesada.
Maude: specification and programming in rewriting logic. Theoretical Computer Science,
285(2):187–243, 2002.
11. M. Clavel, S. Eker, P. Lincoln, and J. Meseguer. Principles of Maude. In J. Meseguer, editor,
Proceedings of the First International Workshop on Rewriting Logic and its Applications,
volume 4 of Electronic Notes in Theoretical Computer Science, pages 65–89, Asilomar, Pa-
cific Grove, CA, September 1996. Elsevier.
12. J. R. Cordy, I. H. Carmichael, and R. Halliday. The TXL Programming Language, Version 8,
April 1995.
13. K. Czarnecki and U. W. Eisenecker. Generative Programming. Addison Wesley, 2000.
14. M. de Jonge. A pretty-printer for every occasion. In I. Ferguson, J. Gray, and L. Scott, edi-
tors, Proceedings of the 2nd International Symposium on Constructing Software Engineering
Tools (CoSET2000). University of Wollongong, Australia, 2000.
15. M. de Jonge. Source tree composition. In C. Gacek, editor, Proceedings: Seventh Interna-
tional Conference on Software Reuse, volume 2319 of LNCS, pages 17–32. Springer-Verlag,
April 2002.
Program Transformation with Stratego/XT 237
16. M. de Jonge and R. Monajemi. Cost-effective maintenance tools for proprietary languages.
In Proceedings: International Conference on Software Maintenance (ICSM 2001), pages
240–249. IEEE Computer Society Press, November 2001.
17. M. de Jonge, E. Visser, and J. Visser. XT: A bundle of program transformation tools. In
M. G. J. van den Brand and D. Perigot, editors, Workshop on Language Descriptions, Tools
and Applications (LDTA’01), volume 44 of Electronic Notes in Theoretical Computer Sci-
ence. Elsevier Science Publishers, April 2001.
18. A. van Deursen, J. Heering, and P. Klint, editors. Language Prototyping. An Algebraic Spec-
ification Approach, volume 5 of AMAST Series in Computing. World Scientific, Singapore,
September 1996.
19. E. Dolstra and E. Visser. Building interpreters with rewriting strategies. In M. G. J. van den
Brand and R. Lämmel, editors, Workshop on Language Descriptions, Tools and Applications
(LDTA’02), volume 65/3 of Electronic Notes in Theoretical Computer Science, Grenoble,
France, April 2002. Elsevier Science Publishers.
20. B. Fischer and E. Visser. Retrofitting the AutoBayes program synthesis system with concrete
syntax. In this volume.
21. M. Fowler. Refactoring: Improving the Design of Existing Programs. Addison-Wesley, 1999.
22. J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The syntax definition formalism SDF
– reference manual. SIGPLAN Notices, 24(11):43–75, 1989.
23. P. Johann and E. Visser. Warm fusion in Stratego: A case study in the generation of program
transformation systems. Annals of Mathematics and Artificial Intelligence, 29(1–4):1–34,
2000.
24. P. Johann and E. Visser. Fusing logic and control with local transformations: An example
optimization. In B. Gramlich and S. Lucas, editors, Workshop on Reduction Strategies in
Rewriting and Programming (WRS’01), volume 57 of Electronic Notes in Theoretical Com-
puter Science, Utrecht, The Netherlands, May 2001. Elsevier Science Publishers.
25. R. Lämmel, E. Visser, and J. Visser. The essence of strategic programming, October 2002.
(Draft).
26. R. Lämmel, E. Visser, and J. Visser. Strategic Programming Meets Adaptive Program-
ming. In Proceedings of Aspect-Oriented Software Development (AOSD’03), pages 168–177,
Boston, USA, March 2003. ACM Press.
27. B. Luttik and E. Visser. Specification of rewriting strategies. In M. P. A. Sellink, edi-
tor, 2nd International Workshop on the Theory and Practice of Algebraic Specifications
(ASF+SDF’97), Electronic Workshops in Computing, Berlin, November 1997. Springer-
Verlag.
28. S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Pub-
lishers, 1997.
29. K. Olmos and E. Visser. Strategies for source-to-source constant propagation. In B. Gramlich
and S. Lucas, editors, Workshop on Reduction Strategies (WRS’02), volume 70 of Electronic
Notes in Theoretical Computer Science, page 20, Copenhagen, Denmark, July 2002. Elsevier
Science Publishers.
30. S. L. Peyton Jones and A. L. M. Santos. A transformation-based optimiser for Haskell. Sci-
ence of Computer Programming, 32(1–3):3–47, September 1998.
31. R. J. Rodger. Jostraca: a template engine for generative programming. European Conference
for Object-Oriented Programming, 2002.
32. L. Swart. Partial evaluation using rewrite rules. A specification of a partial evaluator for
Similix in Stratego. Master’s thesis, Utrecht University, Utrecht, The Netherlands, August
2002.
33. Terese. Term Rewriting Systems. Cambridge University Press, March 2003.
34. E. Visser. Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam,
September 1997.
238 Eelco Visser
35. E. Visser. Strategic pattern matching. In P. Narendran and M. Rusinowitch, editors, Rewriting
Techniques and Applications (RTA’99), volume 1631 of Lecture Notes in Computer Science,
pages 30–44, Trento, Italy, July 1999. Springer-Verlag.
36. E. Visser. Language independent traversals for program transformation. In J. Jeuring, editor,
Workshop on Generic Programming (WGP’00), Ponte de Lima, Portugal, July 2000. Tech-
nical Report UU-CS-2000-19, Department of Information and Computing Sciences, Univer-
siteit Utrecht.
37. E. Visser. Scoped dynamic rewrite rules. In M. van den Brand and R. Verma, editors, Rule
Based Programming (RULE’01), volume 59/4 of Electronic Notes in Theoretical Computer
Science. Elsevier Science Publishers, September 2001.
38. E. Visser. Stratego: A language for program transformation based on rewriting strategies.
System description of Stratego 0.5. In A. Middeldorp, editor, Rewriting Techniques and Ap-
plications (RTA’01), volume 2051 of Lecture Notes in Computer Science, pages 357–361.
Springer-Verlag, May 2001.
39. E. Visser. A survey of rewriting strategies in program transformation systems. In B. Gramlich
and S. Lucas, editors, Workshop on Reduction Strategies in Rewriting and Programming
(WRS’01), volume 57 of Electronic Notes in Theoretical Computer Science, Utrecht, The
Netherlands, May 2001. Elsevier Science Publishers.
40. E. Visser. Meta-programming with concrete object syntax. In D. Batory, C. Consel, and
W. Taha, editors, Generative Programming and Component Engineering (GPCE’02), vol-
ume 2487 of Lecture Notes in Computer Science, pages 299–315, Pittsburgh, PA, USA,
October 2002. Springer-Verlag.
41. E. Visser and Z.-e.-A. Benaissa. A core language for rewriting. In C. Kirchner and
H. Kirchner, editors, Second International Workshop on Rewriting Logic and its Applica-
tions (WRLA’98), volume 15 of Electronic Notes in Theoretical Computer Science, Pont-à-
Mousson, France, September 1998. Elsevier Science Publishers.
42. E. Visser, Z.-e.-A. Benaissa, and A. Tolmach. Building program optimizers with rewriting
strategies. In Proceedings of the third ACM SIGPLAN International Conference on Func-
tional Programming (ICFP’98), pages 13–26. ACM Press, September 1998.
43. http://www.stratego-language.org.
Retrofitting the AutoBayes Program Synthesis System
with Concrete Syntax
1 Introduction
Program synthesis and transformation systems work on two language levels, the object-
level (i.e., the language of the manipulated programs), and the meta-level (i.e., the im-
plementation language of the system itself). Conceptually, these two levels are unrelated
but in practice they have to be interfaced with each other. Often, the object-language
is simply embedded within the meta-language, using a data type to represent the ab-
stract syntax trees of the object-language. Although the actual implementation mecha-
nisms (e.g., records, objects, or algebraic data types) may vary, embeddings can be used
with essentially all meta-languages, making their full programming capabilities imme-
diately available for program manipulations. Meta-level representations of object-level
program fragments are then constructed in an essentially syntax-free fashion, i.e., not
using the notation of the object-language, but using the operations provided by the data
type.
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 239–253, 2004.
c Springer-Verlag Berlin Heidelberg 2004
240 Bernd Fischer and Eelco Visser
However, syntax matters. The conceptual distance between the concrete programs
that we understand and the meta-level representations that we need to use grows with
the complexity of the object-language syntax and the size of the represented program
fragments, and the use of abstract syntax becomes less and less satisfactory. Languages
like Prolog and Haskell allow a rudimentary integration of concrete syntax via user-
defined operators. However, this is usually restricted to simple precedence grammars,
entailing that realistic object-languages cannot be represented well if at all. Tradition-
ally, a quotation/anti-quotation mechanism is thus used to interface languages: a quo-
tation denotes an object-level fragment, an anti-quotation denotes the result of a meta-
level computation which is spliced into the object-level fragment. If object-language
and meta-language coincide, the distinction between the language levels is purely con-
ceptual, and switching between the levels is easy; a single compiler can be used to
process both levels. If the object-language is user-definable, the mechanism becomes
more complicated to implement and usually requires specialized meta-languages such
as ASF+SDF [5], Maude [3], or TXL [4] which support syntax definition and reflection.
AUTO BAYES [9, 8] is a program synthesis system for the statistical data analysis do-
main. It is a large and complex software system implemented in Prolog and its complex-
ity is comparable to a compiler. The synthesis process is based on schemas which are
written in Prolog and use abstract syntax representations of object-program fragments.
The introduction of concrete syntax would simplify the creation and maintenance of
these schemas. However, a complete migration of the implementation of AUTO BAYES
to a different meta-programming language requires a substantial effort and disrupts the
ongoing system development. To avoid this problem, we have chosen a different path.
In this chapter, we thus describe the first experiences with our ongoing work on
adding support for user-definable concrete syntax to AUTO BAYES. We follow the gen-
eral approach outlined in [15], which allows the extension of an arbitrary meta-language
with concrete object-language syntax by combining the syntax definitions of both lan-
guages. We show how the approach is instantiated for Prolog and describe the pro-
cessing steps required for a seamless interaction of concrete syntax fragments with the
remaining “legacy” meta-programming system based on abstract syntax—despite all
its idiosyncrasies. With this work we show that the approach of [15] can indeed be
applied to meta-languages other than Stratego. To reduce the effort of making such
instantiations we have constructed a generic tool encapsulating the process of parsing
a program using concrete object-syntax. Furthermore, we have extended the approach
with object-level comments, and object-language specific transformations for integrat-
ing object-level abstract syntax in the meta-language.
The original motivation for this specific path was purely pragmatic. We wanted to
realize the benefits of concrete syntax without forcing the disruptive migration of the en-
tire system to a different meta-programming language. Retrofitting Prolog with support
for concrete syntax allows a gradual migration. Our long-term goal, however, is more
ambitious: we want to support domain experts in creating and maintaining schemas. We
expect that the use of concrete syntax makes it easier to gradually “schematize” exist-
ing domain programs. We also plan to use different grammars to describe programs on
different levels of abstraction and thus to support domain engineering.
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 241
model specification
AutoBayes
Input Parser
internal representation
Test−data Schema
Synthesis Kernel
Generator Library
Rewriting
Engine
intermediate code
System utilities
Optimizer
Equation
intermediate code
Solver
Code Generator
for i=0;i<nx;i++ {
for j=0;j<ny;j++ {
tmp = x[i,j]**2;
sigma = ...
}
}
AUTO BAYES is a fully automatic program synthesis system for data analysis problems.
It has been used to derive programs for applications like the analysis of planetary nebu-
lae images taken by the Hubble space telescope [7, 6] as well as research-level machine
learning algorithms [1]. It is implemented in SWI-Prolog [18] and currently comprises
about 75,000 lines of documented code; Figure 1 shows the system architecture.
AUTO BAYES derives code from a statistical model which describes the expected
properties of the data in a fully declarative fashion: for each problem variable (i.e.,
observation or parameter), properties and dependencies are specified via probability
242 Bernd Fischer and Eelco Visser
distributions and constraints. The top box in Figure 1 shows the specification of a nebu-
lae analysis model. The last two clauses are the core of this specification; the remaining
clauses declare the model constants and variables, and impose constraints on them. The
distribution clause
states that, with an expected error sigma, the expected value of the observation x at
a given position (i, j) is a function of this position and the nebula’s unknown center
position (x0 , y0 ), radius r, and overall intensity i0 ,. The task clause
specifies the analysis task, which the synthesized program has to solve, i.e., to estimate
the parameter values which maximize the probability of actually observing the given
data and thus under the given model best explain the observations. In this case, the task
can be solved by a mean square error minimization due to the gaussian distribution of
the data and the specific form of the probability. Note, however, that (i) this is not im-
mediately clear from the model, (ii) the function to be minimized is not explicitly given
in the model, and (iii) even small modifications of the model may require completely
different algorithms.
AUTO BAYES derives the code following a schema-based approach. A program
schema consists of a parameterized code fragment (i.e., template) and a set of con-
straints. Code fragments are written in ABIR (AUTO BAYES Intermediate Representa-
tion), which is essentially a “sanitized” variant of C (e.g., neither pointers nor side ef-
fects in expressions) but also contains a number of domain-specific constructs (e.g., vec-
tor/matrix operations, finite sums, and convergence-loops). The fragments also contain
parameterized object-level comments which eventually become the documentation of
the synthesized programs. The parameters are instantiated either directly by the schema
or by AUTO BAYES calling itself recursively with a modified problem. The constraints
determine whether a schema is applicable and how the parameters can be instantiated.
They are formulated as conditions on the model, either directly on the specification, or
indirectly on a Bayesian network [2] extracted from the specification. Such networks
are directed, acyclic graphs whose nodes represent the variables specified in the model
and whose edges represent the probabilistic dependencies between them, as specified
by the distribution clauses: the variable on the left-hand side depends on all model vari-
ables occuring on the right-hand side. In the example, each xij thus depends on i0 , x0 ,
y0 , r and sigma.
The schemas are organized hierarchically into a schema library. Its top layers con-
tain decomposition schemas based on independence theorems for Bayesian networks
which try to break down the problem into independent sub-problems. These are domain-
specific divide-and-conquer schemas: the emerging sub-problems are fed back into the
synthesis process and the resulting programs are composed to achieve a solution for the
original problem. Guided by the network structure, AUTO BAYES is thus able to synthe-
size larger programs by composition of different schemas. The core layer of the library
contains statistical algorithm schemas as for example expectation maximization (EM)
[10] and nearest neighbor clustering (k-Means); usually, these generate the skeleton of
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 243
the program. The final layer contains standard numeric optimization methods as for ex-
ample the simplex method or different conjugate gradient methods. These are applied
after the statistical problem has been transformed into an ordinary numeric optimiza-
tion problem and AUTO BAYES failed to find a symbolic solution for that problem. The
schemas in the upper layers of the library are very similar to the underlying theorems
and thus contain only relatively small code fragments while the schemas in the lower
layers closely resemble “traditional” generic algorithm templates. Their code fragments
are much larger and make full use of ABIR’s language constructs. These schemas are
the focus of our migration approach.
The schemas are applied exhaustively until all maximization tasks are rewritten
into ABIR code. The schemas can explicitly trigger large-scale optimizations which
take into account information from the synthesis process. For example, all numeric
optimization routines restructure the goal expression using code motion, common sub-
expression elimination, and memoization. In a final step, AUTO BAYES translates the
ABIR code into code tailored for a specific run-time environment. Currently, it pro-
vides code generators for the Octave and Matlab environments; it can also produce
standalone C and Modula-2 code. The entire synthesis process is supported by a large
meta-programming kernel which includes the graphical reasoning routines, a symbolic-
algebraic subsystem based on a rewrite engine, and a symbolic equation solver.
of terms. An ABIR program is then represented as a term by using a functor for each
construct in the language, for example:
assign : lvalue * expression * list(comment) -> statement
for : list(index) * statement * list(comment) -> statement
series : list(statement) * list(comment) -> statement
sum : list(index) * expression -> expression
In addition to goals composing program fragments by direct term formation, the schema
contains recursive schema invocations such as simplex try, which produces code
for the Reflection fragment from a more constrained version of Formula. Further-
more, the schema calls a number of meta-programming predicates. For example, the
var fresh(X) predicate generates a fresh object variable and binds it to its argument,
which is a meta-level variable. This prevents variable clashes in the generated program.
Similarly, the index make predicate constructs an index expression.
The schema in Figure 2 also uses second-order terms to represent array accesses
or function calls where the names are either given as parameters or automatically re-
named apart from a meaningful root (cf. the model gensym(simplex, Simplex)
goal). A fully abstract syntax would use additional functors for these constructs and
represent for example an access to the array simplex with subscripts pv0 and pv1 by
arraysub(simplex, [var(pv0), var(pv1)]). However, this makes the abstract
syntax rather unwieldy and much harder to read. Therefore, such constructs are abbre-
viated by means of simple functor applications, e.g., simplex(pv0, pv1). Unfortu-
nately, Prolog does not allow second-order term formation, i.e., terms with variables in
the functor-position. Instead, it is necessary to use the built-in =..-operator, which con-
structs a functor application from a list where the head element is used as functor name
and the rest of the list contains the arguments of the application. Hence, the schemas
generate array access expressions such as the one above by goals such as Simplex ji
=.. [Simplex, J, I], where the meta-variables Simplex, J, and I are bound to
concrete names.
The excerpt shows why the simple abstract syntax approach quickly becomes cumber-
some as the schemas become larger. The code fragment is built up from many smaller
fragments by the introduction of new meta-variables (e.g., Loop) because the abstract
syntax would become unreadable otherwise. However, this makes it harder to follow
and understand the overall structure of the algorithm. The schema is sprinkled with a
large number of calls to small meta-programming predicates, which makes it harder
to write schemas because one needs to know not only the abstract syntax, but also a
large part of the meta-programming base. In our experience, these peculiarities make
the learning curve much steeper than it ought to be, which in turn makes it difficult for a
domain expert to gradually extend the system’s capabilities by adding a single schema.
In the following, we illustrate how this schema is migrated and refactored to make
use of concrete syntax, using the Centroid fragment as running example.
The first step of the migration is to replace terms representing program fragments in
abstract syntax by the equivalent fragments in the concrete syntax of ABIR. Thus, the
Centroid fragment becomes:
246 Bernd Fischer and Eelco Visser
Centroid = |[
/* Calculate the center of gravity in the simplex */
for( Index_i:idx )
Center_i := sum( Index_j:idx ) Simplex_ji:exp
]|
4.2 Meta-variables
In the translation to concrete syntax, Prolog variables in a term are meta-variables,
i.e., variables ranging over ABIR code, rather than variables in ABIR code. In the
fragment |[ x := 3 + j ]|, x and j are ABIR variables, whereas in the fragment
|[ x := 3 + J:exp ]|, x is an ABIR variable, but J:exp is a meta-variable rang-
ing over expressions. For the embedding of ABIR in Prolog we use the convention that
meta-variables are distinguished by capitalization and can thus be used directly in the
concrete syntax without tags. In a few places, the meta-variables are tagged with their
syntactic category, e.g., Index i:idx. This allows the parser to resolve ambiguities
and to introduce the injection functions necessary to build well-formed syntax trees.
Incidentally, this also eliminates the need for the idx-tags because the syntactic cate-
gory is now determined by the source text.
Next, array-reference creation with the =.. operator is replaced with array access
notation in the program fragment:
Centroid = |[
/* Calculate the center of gravity in the simplex */
for( I := A_BASE .. Size0 )
Center[I] := sum( J := A_BASE .. Size1 ) Simplex[J, I]
]|
Finally, the explicit generation of fresh object-variables using var fresh is expressed
in the code by tagging the corresponding meta-variable with @new, a special anti-
quotation operator which constructs fresh object-level variable names.
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 247
Centroid = |[
/* Calculate the center of gravity in the simplex */
for( I@new := A_BASE .. Size0 )
Center[I] := sum( J@new := A_BASE .. Size1 ) Simplex[J, I]
]|
Thus 10 lines of code have been reduced to 5 lines, which are more readable.
The final step of the migration consists of refactoring the schema by inlining program
fragments; the fragments are self-descriptive, do not depend on separate calls to meta-
programming predicates, and can be read as pieces of code. For example, the fragment
for Centroid above can be inlined in the fragment for Loop, which itself can be inlined
in the final Code fragment. After this refactoring, a schema consists of one, or a few,
large program patterns.
In the example schema the use of concrete syntax, @new, and inlining reduces
the overall size by approximately 30% and eliminates the need for explicit meta-pro-
gramming. The reduction ratio is more or less maintained over the entire schema. After
migration along the lines above, the schema size is reduced from 508 lines to 366 lines.
After white space removal, the original schema contains 7779 characters and the result-
ing schema with concrete syntax 5538, confirming a reduction of 30% in actual code
size. At the same time, the resulting fewer but larger code fragments give a better insight
into the structure of the generated code.
The extension of Prolog with concrete syntax as sketched in the previous section is
achieved using the syntax definition formalism SDF2 [14, 12] and the transformation
language Stratego [16, 13] following the approach described in [15]. SDF is used to
specify the syntax of ABIR and Prolog as well as the embedding of ABIR into Prolog.
Stratego is used to transform syntax trees over this combined language into a pure Pro-
log program. In this section we explain the syntactical embedding, and in the next two
sections we outline the transformations mapping Prolog with concrete syntax to pure
Prolog.
module Prolog
exports
context-free syntax
Head ":-" Body "." -> Clause {cons("nonunitclause")}
Goal -> Body {cons("bodygoal")}
Term -> Goal
Functor "(" {Term ","}+ ")" -> Term {cons("func")}
Term Op Term -> Term {cons("infix")}
Variable -> Term {cons("var")}
Atom -> Term {cons("atom")}
Name -> Functor {cons("functor")}
Name -> Op {cons("op")}
The {cons(c)} annotations in the productions declare the constructors to be used in ab-
stract syntax trees corresponding to the parse trees over the syntax definition. Similarly,
the following is a fragment from the syntax definition of ABIR:
module ABIR
exports
context-free syntax
LValue ":=" Exp -> Stat {cons("assign")}
"for" "(" IndexList ")" Stat -> Stat {cons("for")}
{Index ","}* -> IndexList {cons("indexlist")}
Id ":=" Exp ".." Exp -> Index {cons("index")}
module PrologABIR
imports Prolog ABIR
exports
context-free syntax
"|[" Exp "]|" -> Term {cons("toterm")}
"|[" Stat "]|" -> Term {cons("toterm")}
variables
[A-Z][A-Za-z0-9 ]* -> Id {prefer}
[A-Z][A-Za-z0-9 ]* ":exp" -> Exp
The module declares that ABIR Expressions and Statemements can be used as Prolog
terms by quoting them with the |[ and ]| delimiters, as we have seen in the previous
section. The variables section declares schemas for meta-variables. Thus, a capital-
ized identifier can be used as a meta-variable for identifiers, and a capitalized identifier
tagged with :exp can be used as a meta-variable for expressions.
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 249
After parsing a schema with the combined syntax definition the resulting abstract syntax
tree is a mixture of Prolog and ABIR abstract syntax. For example, the Prolog-goal
Code = |[ X := Y:exp + z ]|
bodygoal(infix(var("Code"), op(symbol("=")),
toterm(assign(var(meta-var("X")),
plus(meta-var("Y:exp"),var("z"))))))
6.2 Exploding
A mixed syntax tree can be translated to a pure Prolog tree by “exploding” embedded
tree constructors to functor applications2:
bodygoal(infix(var("Code"),op(symbol("=")),
func(functor(word("assign")),
[func(functor(word("var")),[var("X")]),
func(functor(word("plus")),
[var("Y:exp"),
func(functor(word("var")),
[atom(quotedname("’z’"))])])])))
Note how the meta-variables X and Y have become Prolog variables representing a vari-
able name and an expression, respectively, while the object variable z has become a
character literal. Also note that X is a a meta-variable for an object-level identifier and
will eventually be instantiated with a character literal, while Y is a variable for an ex-
pression.
strategies
explode = alltd(?toterm(<trm-explode>))
trm-explode = trm-metavar <+ trm-op
rules
trm-metavar : meta-var(X) -> var(X)
trm-op : Op#([]) -> atom(word(<lower-case>Op))
trm-op : Op#([T | Ts]) -> func(functor(word(<lower-case>Op)),
<map(trm-explode)>[T | Ts])
Parsing and then exploding the final Centroid-fragment on page 247 then produces
the pure Prolog-goal
Centroid =
commented(
comment([’Calculate the center of gravity in the simplex ’]),
for(indexlist([index(newvar(I),var(A_BASE),var(Size0))]),
assign(arraysub(Center,[var(I)]),
sum(indexlist([index(newvar(J),
var(A_BASE),var(Size1))]),
call(Simplex,[var(J),var(I)])))))
Comparing the generated Centroid-goal above with the original in Figure 2 shows
that the abstract syntax underlying the concrete syntax fragments does not correspond
exactly to the original abstract syntax used in AutoBayes. That is, two different abstract
syntax formats are used for the ABIR language. The format used in AUTO BAYES (e.g.,
Figure 2) is less explicit since it uses Prolog functor applications to represent array
references and function calls, instead of the more verbose representation underlying the
concrete syntax fragments.
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 251
Hence, the embedded concrete syntax is transformed exactly into the form needed to
interface it with the legacy system.
8 Conclusions
Program generation and transformation systems manipulate large, parameterized ob-
ject language fragments. Operating on such fragments using abstract-syntax trees or
string-based concrete syntax is possible, but has severe limitations in maintainability
and expressive power. Any serious program generator should thus provide support for
concrete object syntax together with the underlying abstract syntax.
In this chapter we have shown that the approach of [15] can indeed be generalized
to meta-languages other than Stratego and that it is thus possible to add such support
to systems implemented in a variety of meta-languages. We have applied this approach
to AutoBayes, a large program synthesis system that uses a simple embedding of its
object-language (ABIR) into its meta-language (Prolog). The introduction of concrete
syntax results in a considerable reduction of the schema size (≈ 30%), but even more
importantly, in an improved readability of the schemas. In particular, abstracting out
fresh-variable generation and second-order term construction allows the formulation of
larger continuous fragments and improves the locality in the schemas. Moreover, meta-
programming with concrete syntax is cheap: using Stratego and SDF, the overall effort
to develop all supporting tools was less than three weeks. Once the tools were in place,
the migration of a schema was a matter of a few hours. Finally, the experiment has also
demonstrated that it is possible to introduce concrete syntax support gradually, without
252 Bernd Fischer and Eelco Visser
forcing a disruptive migration of the entire system to the extended meta-language. The
seamless integration with the “legacy” meta-programming kernel is achieved with a few
additional transformations, which can be implemented quickly in Stratego.
8.1 Contributions
The work described in this chapter makes three main contributions to domain-specific
program generation. First, we described an extension of Prolog with concrete object
syntax, which is a useful tool for all meta-programming systems using Prolog. The tools
that implement the mapping back into pure Prolog are available for embedding arbitrary
object languages into Prolog3 . Second, we demonstrated that the approach of [15] can
indeed be applied to meta-languages other than Stratego. We extended the approach
by incorporating concrete syntax for object-level comments and annotations, which
are required for documentation and certification of the generated code [17]. Third, we
also extended the approach with object-language-specific transformations to achieve a
seamless integration with the legacy meta-programming kernel. This allows a gradual
migration of existing systems, even if they were originally designed without support for
concrete syntax in mind. These transformations also lift meta-computations from object
code into the surrounding meta-code. This allows us to introduce abstractions for fresh
variable generation and second-order variables to Prolog.
Acknowledgements
We would like to thank the anonymous referees for their comments on a previous ver-
sion of this paper.
References
1. W. Buntine, B. Fischer, and A. G. Gray. Automatic derivation of the multinomial PCA al-
gorithm. Technical report, NASA/Ames, 2003. Available at http://ase.arc.nasa.gov/
people/fischer/.
3
http://www.stratego-language.org/Stratego/PrologTools
Retrofitting the AutoBayes Program Synthesis System with Concrete Syntax 253
2. W. L. Buntine. Operations for learning with graphical models. JAIR, 2:159–225, 1994.
3. M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martı́-Oliet, J. Meseguer, and J. F. Quesada.
Maude: specification and programming in rewriting logic. Theoretical Computer Science,
285(2):187–243, 2002.
4. J. R. Cordy, I. H. Carmichael, and R. Halliday. The TXL Programming Language, Version 8,
April 1995.
5. A. van Deursen, J. Heering, and P. Klint, editors. Language Prototyping. An Algebraic Spec-
ification Approach, volume 5 of AMAST Series in Computing. World Scientific, Singapore,
September 1996.
6. B. Fischer, A. Hajian, K. Knuth, and J. Schumann. Automatic derivation of statistical data
analysis algorithms: Planetary nebulae and beyond. Technical report, NASA/Ames, 2003.
Available at http://ase.arc.nasa.gov/people/fischer/.
7. B. Fischer and J. Schumann. Applying autobayes to the analysis of planetary nebulae im-
ages. In J. Grundy and J. Penix, editors, Proc. 18th ASE, pages 337–342, Montreal, Canada,
October 6–10 2003. IEEE Comp. Soc. Press.
8. B. Fischer and J. Schumann. AutoBayes: A system for generating data analysis programs
from statistical models. JFP, 13(3):483–508, May 2003.
9. A. G. Gray, B. Fischer, J. Schumann, and W. Buntine. Automatic derivation of statistical
algorithms: The EM family and beyond. In S. Becker, S. Thrun, and K. Obermayer, editors,
NIPS 15, pages 689–696. MIT Press, 2003.
10. G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley Series in Proba-
bility and Statistics. John Wiley & Sons, New York, 1997.
11. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C.
Cambridge Univ. Press, Cambridge, UK, 2nd. edition, 1992.
12. M. G. J. van den Brand, J. Scheerder, J. Vinju, and E. Visser. Disambiguation filters for
scannerless generalized LR parsers. In N. Horspool, editor, Compiler Construction (CC’02),
volume 2304 of LNCS, pages 143–158, Grenoble, France, April 2002. Springer-Verlag.
13. E. Visser. Program transformation with Stratego/XT. Rules, strategies, tools, and systems in
Stratego/XT 0.9. In this volume.
14. E. Visser. Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam,
September 1997.
15. E. Visser. Meta-programming with concrete object syntax. In D. Batory, C. Consel, and
W. Taha, editors, Generative Programming and Component Engineering (GPCE’02), vol-
ume 2487 of LNCS, pages 299–315, Pittsburgh, PA, USA, October 2002. Springer-Verlag.
16. E. Visser, Z.-e.-A. Benaissa, and A. Tolmach. Building program optimizers with rewriting
strategies. In Proceedings of the third ACM SIGPLAN International Conference on Func-
tional Programming (ICFP’98), pages 13–26. ACM Press, September 1998.
17. M. Whalen, J. Schumann, and B. Fischer. Synthesizing certified code. In L.-H. Eriksson and
P. A. Lindsay, editors, Proc. FME 2002: Formal Methods—Getting IT Right, volume 2391
of LNCS, pages 431–450, Copenhagen, Denmark, July 2002. Springer.
18. J. Wielemaker. SWI-Prolog 5.2.9 Reference Manual. Amsterdam, 2003.
Optimizing Sequences of Skeleton Calls
Herbert Kuchen
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 254–273, 2004.
c Springer-Verlag Berlin Heidelberg 2004
Optimizing Sequences of Skeleton Calls 255
overhead for providing them, but these details are just passed as arguments.
Another way of simulating higher-order functions in an object-oriented language
is to use the “command” design pattern [GH95]. Here, the “argument functions”
are encapsulated by command objects. But still this pattern causes substantial
overhead.
In our framework, a parallel computation consists of a sequence of calls to
skeletons, possibly interleaved by some local computations. The computation
is now seen from a global perspective. As explained in [Le04], skeletons can
be understood as a domain-specific language for parallel programming. Several
implementations of algorithmic skeletons are available. They differ in the kind
of host language used and in the particular set of skeletons offered. Since higher-
order functions are taken from functional languages, many approaches use such a
language as host language [Da93,KP94,Sk94]. In order to increase the efficiency,
imperative languages such as C and C++ have been extended by skeletons, too
[BK96,BK98,DP97,FO92].
Depending on the kind of parallelism used, skeletons can be classified into
task parallel and data parallel ones. In the first case, a skeleton (dynamically) cre-
ates a system of communicating processes by nesting predefined process topolo-
gies such as pipeline, farm, parallel composition, divide&conquer, and
branch&bound [DP97,Co89,Da93,KC02]. In the second case, a skeleton works
on a distributed data structure, performing the same operations on some or all
elements of this data structure. Data-parallel skeletons, such as map, fold or
rotate are used in [BK96,BK98,Da93,Da95,DP97,KP94].
Moreover, there are implementations offering skeletons as a library rather
than as part of a new programming language. The approach described in the
sequel is based on the skeleton library introduced in [Ku02,KC02,KS02] and
on the corresponding C++ language binding. As pointed out by Smaragdakis
[Sm04], C++ is particularly suited for domain-specific languages due to its meta-
programming abilities. Our library provides task as well as data parallel skele-
tons, which can be combined based on the two-tier model taken from P3 L [DP97].
In general, a computation consists of nested task parallel constructs where an
atomic task parallel computation can be sequential or data parallel. Purely data
parallel and purely task parallel computations are special cases of this model.
An advantage of the C++ binding is that the three important features needed
for skeletons, namely higher-order functions (i.e. functions having functions as
arguments), partial applications (i.e. the possibility to apply a function to less
arguments than it needs and to supply the missing arguments later), and para-
metric polymorphism, can be implemented elegantly and efficiently in C++ using
operator overloading and templates, respectively [St00,KS02].
Skeletons provide a global view of the computation which enables certain op-
timizations. In the spirit of the well-known Bird-Meertens formalism [Bi88,Bi89]
[GL97], algebraic transformations on sequences of skeleton calls allow one to
replace a sequence by a semantically equivalent but more efficient sequence.
The investigation of such sequences of skeletons and their transformation will
be the core of the current chapter. Similar transformations can be found in
256 Herbert Kuchen
The skeleton library offers data parallel and task parallel skeletons. Data paral-
lelism is based on a distributed data structure, which is manipulated by opera-
tions processing it as a whole and which happen to be implemented in parallel
internally. Task parallelism is established by setting up a system of processes
which communicate via streams of data. Such a system is not arbitrarily struc-
tured but constructed by nesting predefined process topologies such as farms
and pipelines. Moreover, it is possible to nest task and data parallelism accord-
ing to the mentioned two-tier model of P3 L, which allows atomic task parallel
processes to use data parallelism inside. Here, we will focus on data parallelism
and on the optimization of sequences of data parallel skeletons. Details on task
parallel skeletons can be found in [KC02].
encapsulates the passing of low-level messages in a safe way. Currently, two main
distributed data structures are offered by the library, namely:
where E is the type of the elements of the distributed data structure. Moreover,
there are variants for sparse arrays and matrices, which we will not consider
here. By instantiating the template parameter E, arbitrary element types can be
generated. This shows one of the major features of distributed data structures
and their operations in our framework. They are polymorphic. Moreover, a dis-
tributed data structure is split into several partitions, each of which is assigned
to one processor participating in the data parallel computation. Currently, only
block partitioning is supported. Future extensions by other partitioning schemes
are planned.
Roughly, two classes of data parallel skeletons can be distinguished: compu-
tation skeletons and communication skeletons. Computation skeletons process
the elements of a distributed data structure in parallel. Typical examples are
the following methods in class DistributedArray<E>:
Thus, the parameter f can not only be a C++ function of the mentioned type
but also a so-called function object, i.e. an object representing a function of the
corresponding type. In particular, such a function object can represent a partial
application as we will explain below.
Communication consists of the exchange of the partitions of a distributed
data structure between all processors participating in the data parallel compu-
tation. In order to avoid inefficiency, there is no implicit communication e.g. by
accessing elements of remote partitions like in HPF [Ko94] or Pooma [Ka98],
but the programmer has to control communication explicitly by using skeletons.
Since there are no individual messages but only coordinated exchanges of parti-
tions, deadlocks cannot occur. The most frequently used communication skeleton
is
for all elements Al,j of the local partition. Note that pivotOp does not compute
Ak,j /Ak,k itself, but fetches it from the j-th element of the locally available row
260 Herbert Kuchen
of Pivot. Since all rows of matrix Pivot are identical after the broadcast, the
index of the row does not matter and we may take the first locally available
row (in fact the only one), namely the one with local index 0. Thus, we find
Ak,j /Ak,k in Pivot.getLocalGlobal(0,j). Note that 0 is a local index referring
to the local partition of Pivot, while j is a global one referring to matrix Pivot
as a whole. Our skeleton library provides operations which can access array and
matrix elements using local, global, and mixed indexing. The user may pick the
most convenient one. Note that parallel programming based on message passing
libraries such as MPI only provides local indexing.
It is also worth mentioning that for the above example the “non-standard”
map operation mapPartitionInPlace is clearly more efficient (but slightly less
elegant) than more classic map operations. mapPartitionInPlace manipulates
a whole partition at once rather than a single element. This allows one to ignore
such elements of a partition which need no processing. In the example, these
are the elements to the left of the pivot column. Using classic map operations,
one would have to apply the identity function to these elements, which causes
substantial overhead.
It offers the public methods shown below. For every method which has a C++
function as an argument, there is an additional variant of it which may instead
use a partial application with a corresponding type. As mentioned above, a par-
tial application is generated by applying the function curry to a C++ function
or by applying an existing partial application to additional arguments. For in-
stance, in addition to
void mapInPlace(E (*f)(E))
(explained below) there is some variant of type
template <class F>
void mapInPlace(const Fct1<E,E,F>& f)
which can take a partial application rather than a C++ function as argument.
Thus, if the C++ functions succ (successor function) and add (addition) have
been previously defined, all elements of a distributed array A can be incremented
by one either by using the first variant of mapInPlace
A.mapInPlace(succ)
or by the second variant
A.mapInPlace(curry(add)(1))
Of course, the second is more flexible, since the arguments of a partial applica-
tion are computed at runtime. Fct1<A,R,F> is the type of a partial application
representing a unary function with argument type A and result type R. The third
template parameter F determines the computation rule needed to evaluate the
application of the partial application to the missing arguments (see [KS02] for
details). The type Fct1<A,R,F> is of internal use only. The user does not need
to care about it. She only has to remember that skeletons may have partial
applications instead of C++ functions as arguments and that these partial ap-
plications are created by applying the special, predefined function curry to a
C++ function and by applying the result successively to additional arguments.
Of course, there are also types for partial applications representing functions of
arbitrary arity. In general, a partial application of type Fcti<A1 ,...,An ,R,F>
represents a n-ary function with argument types A1 ,...,An and result type R,
for i ∈ IN.
Some of the following methods depend on some preconditions. An exception
is thrown, if they are not met. p denotes the number of processors collaborating
in the data parallel computation on the considered distributed array.
Constructors
DistributedArray(int size, E (*f)(int))
creates a distributed array (DA) with size elements. The i-th element is
initialized with f(i) for i = 0, . . . , size − 1. The DA is partitioned into
262 Herbert Kuchen
equally sized blocks, each of which is given to one of the p processors. More
precisely, the j-th participating processor gets the elements with indices
j · size/p,. . . ,(j + 1) · size/p − 1, where j = 0, . . . , p − 1. We assume that p
divides size.
DistributedArray(int size, E initial)
This constructor works as the previous. However, every array element is
initialized with the same value initial. There are more constructors, which
are omitted here in order to save space.
int getFirst()
delivers the index of the first locally available element of the DA.
int getSize()
returns the size (i.e. total number of elements) of the DA.
int getLocalSize()
returns the number of locally available elements.
bool isLocal(int i)
tells whether the i-th element is locally available.
void setLocal(int i, E v)
sets the value of the i-th locally available element to v. Note that the index
i is referring to the local partition and not to the DA as a whole.
Precondition: 0 ≤ i ≤ getLocalSize() − 1.
void set(int i, E v)
sets the value of the i-th element to v.
Precondition: j · size/p ≤ i < (j + 1) · size/p, if the local partition is the
j-th partition of the DA (0 ≤ j < p).
E getLocal(int i)
returns the value of the i-th locally available element.
Precondition: 0 ≤ i ≤ getLocalSize() − 1.
E get(int i)
delivers the value of the i-th element of the DA.
Precondition: j · size/p ≤ i < (j + 1) · size/p, where the local partition is
the j-th partition of the DA (0 ≤ j < p).
Obviously, there are auxiliary functions which access a distributed array via
a global index and others which access it via an index relative to the start
of the local partition. Using global indexing is usually more elegant, while local
indexing can be sometimes more efficient. MPI-based programs use local indexing
only. Our skeleton library offers both, and it even allows to mix them as in the
Gaussian elimination example (Fig. 1, lines 9 and 10). All these methods and
most skeletons are inlined. Thus, there is no overhead for function calls when
using them.
Optimizing Sequences of Skeleton Calls 263
Map Operations. The map function is the most well-known and most fre-
quently used higher-order function in functional languages. In its original for-
mulation, it is used to a apply an unary argument function to every element
of a list and to construct a new list from the results. In the context of arrays
and imperative or object oriented programming, this original formulation is of
limited use. The computation related to an array element Ai often not only de-
pends on Ai itself but also on its index i. In contrast to functional languages,
imperative and object oriented languages allow also to update the element in
place. Thus, there is no need to construct a new array, and time and space can
be saved. Consequently, our library offers several variants of map, namely with
and without considering the index and with and without update in place. In
fact, mapIndexInPlace turned out to be the most useful one in many example
applications. Some of the following variants of map require a template parameter
R, which is the element type of the resulting distributed array.
template <class R> DistributedArray<R> map(R (*f)(E))
returns a new DA, where element i has value f(get(i))
(0 ≤ i ≤ getSize() − 1). It is partitioned as the considered DA.
template <class R> DistributedArray<R> mapIndex(R (*f)(int,E))
returns a new DA, where element i has value f(i,get(i))
(0 ≤ i ≤ getSize() − 1). It is partitioned as the considered DA.
void mapInPlace(E (*f)(E))
replaces the value of each DA element with index i by f(get(i)),
where 0 ≤ i ≤ getSize() − 1.
void mapIndexInPlace(E (*f)(int,E))
replaces the value of each DA element with index i by f(i,get(i)),
where 0 ≤ i ≤ getSize() − 1.
void mapPartitionInPlace(void (*f)(E*))
replaces each partition P of the DA by f(P ).
Fold and Scan. fold (also known as reduction) and scan (often called parallel
prefix) are computation skeletons, which require some communication internally.
They combine the elements of a distributed array by an associative binary func-
tion.
E fold(E (*f)(E,E))
combines all elements of the DA by the associative binary operation f, i.e.
it computes f(get(0),f(get(1),...f(get(n-2),get(n-1))...))
where n = getSize(). Precondition: f is associative (this is not checked).
void scan(E (*f)(E,E))
replaces every element with index i by
f(get(0),f(get(1),...f(get(i-1),get(i))...)),
where 0 ≤ i ≤ getSize() − 1.
Precondition: f is associative (this is not checked).
Both, fold and scan, require Θ(log p) messages (of size size(E) and n/p ·
size(E), respectively) per processor [BK98].
Another communication skeleton, gather, is a bit special, since it does not rear-
range a distributed array, but it copies it to an ordinary (non-distributed) array.
Since all non-distributed data structures and the corresponding computations
are replicated on each processor participating in a data parallel computation
[BK98], gather corresponds to the MPI operation MPI Allgather rather than
to MPI Gather.
permute and permutePartition require a single send and receive per processor,
while broadcast, broadcastPartition, and gather need Θ(log p) communica-
tion steps (with message sizes size(E) and n/p·size(E), respectively). allToAll
requires Θ(p) communication steps.
Other Operations
DistributedArray<E> copy()
generates a copy of the considered DA.
void show()
prints the DA on standard output.
The operations for distributed matrices (DM) are similar to those for distributed
arrays. Thus, we will sketch them only briefly here. The constructors specify the
number of rows and columns, tell how the elements shall be initialized, and give
the number of partitions in vertical and horizontal direction. E.g. the constructor
constructs a distributed n×m matrix, where every element is initialized with x0.
The matrix is split into v×h partitions. As for DAs, there are other constructors
which compute the value of each element depending on its position using a C++
function or function object.
The operations for accessing attributes of a DM now have to take both di-
mensions into account. Consequently, there are operations delivering the number
of (local or global) rows and columns, the index of the first locally available row
and column, and so on. As shown in the Gaussian-elimination example, the op-
erations for accessing an element of a DM can use any combination of local and
global indexing. For instance, M.get(i,j) fetches (global) element Mi,j of DM
266 Herbert Kuchen
n\p 4 8 16
16 1.64 1.91 1.89
128 1.71 1.92 1.70
1024 1.48 1.81 1.72
p=2: p=4:
2.5 2.5
2 2
1.5 1.5
Speedup
Speedup
1 1
map o map perm o perm
0.5 map o fold 0.5 map o map
map o scan map o fold
multiMap map o scan
multiMap
0 0
100 1000 10000 100000 100 1000 10000 100000
n n
p=8: p=16:
2.5 2.5
2 2
1.5 1.5
Speedup
Speedup
1 1
perm o perm perm o perm
map o map map o map
0.5 0.5
map o fold map o fold
map o scan map o scan
0 multiMap 0 multiMap
100 1000 10000 100000 100 1000 10000 100000
n n
original combined
A.permute(f); A.permute(curry(compose)(g)(f));
A.permute(g); (analogously for permutePartition)
A.permute(f); A.broadcast(f−1 (k));
A.broadcast(k); (analogously for permutePartition)
A.broadcast(k); A.broadcast(k);
A.permute(f); (analogously for permutePartition)
A.broadcast(k); A.broadcast(k);
A.broadcast(i);
M.permutePartition(f,g); M.permutePartition(curry(compose)(h’)(f),g);
M.rotateRows(h);
M.rotateRows(h); M.permutePartition(curry(compose)(f)(h’),g);
M.permutePartition(f,g);
M.permutePartition(f,g); M.permutePartition(f,curry(compose)(h’)(g));
M.rotateCols(h);
M.rotateCols(h); M.permutePartition(f,curry(compose)(g)(h’));
M.permutePartition(f,g);
M.broadcast(i,j); M.broadcast(i,j);
M.rotateRows(h); (analogously for rotateCols)
M.rotateRows(h); M.broadcast(h’−1 (i),j);
M.broadcast(i,j); (analogously for rotateCols)
M.rotateRows(f); M.rotateRows(curry(compose)(g)(f));
M.rotateRows(g); (analogously for rotateCols)
M.rotateRows(f); M.permutePartition(f’,g’);
M.rotateCols(g);
M.rotateCols(f); M.permutePartition(g’,f’);
M.rotateRows(g);
requires not only a transformation of the application program, but the skeleton
library also has to be extended by the required combined skeletons. For instance,
A.mapIndexInPlace(f);
result = A.fold(g);
is replaced by
result = A.mapIndexInPlaceFold(f,g);
in the application program, and the new skeleton mapIndexInPlaceFold with
obvious meaning is added to the library. Combining map and scan is done anal-
ogously. As Fig. 2 shows, speedups of up to 1.1 can be obtained for big arrays,
and a bit less for smaller arrays. Again, a simple argument function was used
here. With more complex argument functions, the speedup decreases. The same
approach can be applied for fusing:
– two calls to zipWith (and variants),
– zipWith and map, fold, or scan,
– a communication skeleton (such as permute, broadcast and rotate) and a
computation skeleton (such as map, fold, and scan) (and vice versa).
Let us consider the latter in more detail. If implemented in a clever way, it
may (in addition to loop fusion) allow one to overlap communication and com-
putation and hence provide a new source for improvements. For the combination
of mapIndexInPlace and permutePartition, we observed a speedup of up to
1.25 (see Fig. 3). Here, the speedup heavily depends on a good balance between
computation and communication. If the arrays are too small and the amount
of computation performed by the argument function of mapIndexInPlace is not
sufficient, the overhead of sending more (but smaller) messages does not pay off,
and a slowdown may occur. Also if the amount of computation exceeds that of
communication, the speedup decreases again. With an optimal balance between
computation and communication (and a more sophisticated implementation of
the mapIndexInPlaceFold skeleton) speedups up to 2 are theoretically possible.
We have investigated a few example applications (see [Ku03]) such as matrix
multiplication, FFT, Gaussian elimination, all pairs shortest paths, samplesort,
TSP (traveling salesperson problem), and bitonic sort in order to check which
of the optimizations discussed above can be applied more or less frequently in
Optimizing Sequences of Skeleton Calls 271
p=2: p=4:
1.4 1.4
1.3 1.3
1.2 1.2
1.1 1.1
1 1
Speedup
Speedup
0.9 0.9
0.8 0.8
0.7 1 0.7 1
0.6 100 0.6 100
10000 10000
0.5 0.5
0.4 0.4
100 1000 10000 100000 100 1000 10000 100000
n n
A.mapIndexInPlace(f);
B.mapIndexInPlace(g);
Acknowledgements
I would like to thank the anonymous referees for lots of helpful suggestions.
References
[BG02] Bischof, H., Gorlatch, S.: Double-Scan: Introducing and Implementing a New
Data-Parallel Skeleton. In Monien, B., Feldmann, R., eds.: Euro-Par’02. LNCS
2400, Springer-Verlag (2002) 640-647
[Bi88] Bird, R.: Lectures on Constructive Functional Programming, In Broy, M., ed.:
Constructive Methods in Computing Science, NATO ASI Series. Springer-
Verlag (1988) 151-216
[Bi89] Bird, R.: Algebraic identities for program calculation. The Computer Journal
32(2) (1989) 122-126
[BK96] Botorog, G.H., Kuchen, H.: Efficient Parallel Programming with Algorithmic
Skeletons. In Bougé, L. et al., eds.: Euro-Par’96. LNCS 1123, Springer-Verlag
(1996) 718-731
[BK98] Botorog, G.H., Kuchen, H.: Efficient High-Level Parallel Programming, The-
oretical Computer Science 196 (1998) 71-107
[Co89] Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Compu-
tation. MIT Press (1989)
[Da93] Darlington, J., Field, A.J., Harrison T.G., et al: Parallel Programming Using
Skeleton Functions. PARLE’93. LNCS 694, Springer-Verlag (1993) 146-160
[Da95] Darlington, J., Guo, Y., To, H.W., Yang, J.: Functional Skeletons for Par-
allel Coordination. In Hardidi, S., Magnusson, P.: Euro-Par’95. LNCS 966,
Springer-Verlag (1995) 55-66
[DP97] Danelutto, M., Pasqualetti, F., Pelagatti S.: Skeletons for Data Parallelism
in p3l. In Lengauer, C., Griebl, M., Gorlatch, S.: Euro-Par’97. LNCS 1300,
Springer-Verlag (1997) 619-628
Optimizing Sequences of Skeleton Calls 273
[FO92] Foster, I., Olson, R., Tuecke, S.: Productive Parallel Programming: The PCN
Approach. Scientific Programming 1(1) (1992) 51-66
[GH95] Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns. Addison-
Wesley (1995)
[GL99] Gropp, W., Lusk, E., Skjellum, A.: Using MPI. MIT Press (1999)
[GL97] Gorlatch, S., Lengauer, C.: (De)Composition Rules for Parallel Scan and Re-
duction, 3rd Int. Working Conf. on Massively Parallel Programming Models
(MPPM’97). IEEE (1997) 23-32
[Go04] Gorlatch, S.: Optimizing Compositions of Components in Parallel and Dis-
tributed Programming (2004) In this volume
[GW99] Gorlatch, S., Wedler, C., Lengauer, C.: Optimization Rules for Programming
with Collective Operations, 13th Int. Parallel Processing Symp. & 10th Symp.
on Parallel and Distributed Processing (1999) 492-499
[Ka98] Karmesin, S., et al.: Array Design and Expression Evaluation in POOMA II.
ISCOPE’98 (1998) 231-238
[KC02] Kuchen, H., Cole, M.: The Integration of Task and Data Parallel Skeletons.
Parallel Processing Letters 12(2) (2002) 141-155
[Ko94] Koelbl, C.H., et al.: The High Performance Fortran Handbook. Scientific and
Engineering Computation. MIT Press (1994)
[KP94] Kuchen, H., Plasmeijer, R., Stoltze, H.: Efficient Distributed Memory Imple-
mentation of a Data Parallel Functional Language. PARLE’94. LNCS 817,
Springer-Verlag (1994) 466-475
[KS02] Kuchen, H., Striegnitz, J.: Higher-Order Functions and Partial Applications
for a C++ Skeleton Library. Joint ACM Java Grande & ISCOPE Conference.
ACM (2002) 122-130
[Ku02] Kuchen, H.: A Skeleton Library. In Monien, B., Feldmann, R.: Euro-Par’02.
LNCS 2400, Springer-Verlag (2002) 620-629
[Ku03] Kuchen, H.: The Skeleton Library Web Pages.
http://danae.uni-muenster.de/lehre/kuchen/Skeletons/
[Le04] Lengauer, C.: Program Optimization in the Domain of High-Performance Par-
allelism. (2004) In this volume
[MC95] W. McColl: Scalable Computing. In van Leuwen, J., ed.: Computer Science
Today, LNCS 1000, Springer-Verlag (1995) 46-61
[MS00] McNamara, B., Smaragdakis, Y.: Functional Programming in C++. ICFP’00.
ACM (2000) 118-129
[SD98] Skillicorn, D., Danelutto, M., Pelagatti, S., Zavanella, A.: Optimising Data-
Parallel Programs Using the BSP Cost Model. In Pritchard, D., Reeve, J.:
Euro-Par’98. LNCS 1470, Springer-Verlag (1998) 698-703
[SH97] Skillicorn, D., Hill, J.M.D., McColl, W.: Questions and Answers about BSP.
Scientific Programming 6(3) (1997) 249-274
[Sk94] Skillicorn, D.: Foundations of Parallel Programming. Cambridge U. Press
(1994)
[Sm04] Smaragdakis, Y.: A Personal Outlook of Generator Research (2004) In this
volume
[St00] Striegnitz, J.: Making C++ Ready for Algorithmic Skeletons. Tech. Report
IB-2000-08, http://www.fz-juelich.de/zam/docs/autoren/striegnitz.html
[ZC90] Zima, H.P., Chapman, B.M.: Supercompilers for Parallel and Vector Comput-
ers. ACM Press/Addison-Wesley (1990)
[ZC03] ZIV-Cluster: http://zivcluster.uni-muenster.de/
Domain-Specific Optimizations
of Composed Parallel Components
Sergei Gorlatch
1 Introduction
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 274–290, 2004.
c Springer-Verlag Berlin Heidelberg 2004
Domain-Specific Optimizations of Composed Parallel Components 275
We present here an overview of our recent results covering quite a wide range
of program components used in the parallel and distributed setting. Our work
can be viewed as a step on the way to developing a mathematical science of
parallel components’ composition.
The structure of the paper, with its main contributions, is as follows:
The next three sections deal with compositions of parallel skeletons at the
three abstraction levels shown in Figure 1.
Domain-Specific Optimizations of Composed Parallel Components 277
We call map, red and scan “skeletons” because each describes a whole class
of functions, obtainable by substituting domain-specific operators for ⊕ and f .
The typical structure of MPI programs is a sequential composition of compu-
tations and collective operations that are executed one after another. We model
this by functional composition ◦:
def
(f ◦ g) x = f (g x) (5)
MPI_Scan (op1);
MPI_Allreduce (op2);
Condition 1 Condition 2
yes no yes no
Here is the proof sketch: a) cp ∈ Θ(n · max{log(n/p), log p}) ⊆ O(n · log n), and
b) cp ∈ Θ(n · max{log(n/p), log p}) ⊆ Ω(n · log p).
Since sequential time complexity of the DS skeleton as a composition of two
scans is obviously linear, it follows from Theorem 4(b) that the generic DH-
implementation cannot be cost-optimal for the DS skeleton.
282 Sergei Gorlatch
This general result is quite disappointing. However, since there are instances
of the DS skeleton that do have cost-optimal hand-coded implementations, we
can strive to find a generic, cost-optimal solution for all instances. This motivates
our further search for a better parallel implementation of DS.
Our further considerations are valid for arbitrary partitions but, in parallel
programs, one usually works with segments of approximately the same size.
– plist2list k , the inverse of list2plist k , transforms a k-plist into a list:
– pscanrl _p(⊕, ⊗) applies function scanrl (⊕, ⊗), defined by (12), to the list
containing only the points of the argument plist
[ ]
Domain-Specific Optimizations of Composed Parallel Components 283
The next theorem shows that the distributed version of the double-scan skeleton
can be implemented using our auxiliary skeletons. We use the following new def-
inition: a binary operation is said to be associative modulo , iff for arbitrary
elements a, b, c we have: (a b) c = (a b) (b c). The usual associativity is
the associativity modulo operation first, which yields the first element of a pair.
Since pscanrl (➀, ➁) = pscanlr (➂, ➃), the equality (18) also holds for the
function pscanlr (➂, ➃). There is a constructive procedure for generating the
operator ➄, if (a➂) is bijective for arbitrary a and if ➂ distributes over ➀.
On a parallel machine, we split plists so that each segment and its right
“border point” are mapped to a processor. The time complexity analysis [9]
shows that the DS implementation on plists is cost-optimal.
The theorem ensures the cost optimality of the suggested parallel implementa-
tion in the practice-relevant case when the number of processors is not too big.
The estimate O(n/ log n) provides an upper bound for how fast the number of
processors is allowed to grow with the problem size. If the number of processors
grows slower than the bound, the implementation is cost optimal.
284 Sergei Gorlatch
5
time in s 5
Cray T3E, 5·10 elements time in s Linux Cluster, 5·10 elements
1 0.5
non-cost-optimal
0.5
non-cost-optimal
0.1
0.1
0.05
cost-optimal measured cost-optimal estimated
cost-optimal measured
0.05
cost-optimal estimated
processors processors
2 4 6 8 10 12 14 16 2 3 4 5 6 7 8 9
Server1 Server1
1 2 1 2
5
Client Client
4
3 3
4 6
Server2 Server2
Fig. 5. Skeleton composition using plain RMI (left) and future-based RMI (right)
We have tested our approach on a linear system solver, which we implement using
the matrix library Jama [15]. For a given system Ax = b, we look for vector x̂ that
minimizes χ2 = (Ax − b)2 . Our example uses the singular value decomposition
method (SVD): the SVD of a matrix is a decomposition A = U · Σ · V T , U and
V being orthonormal matrices and Σ a diagonal matrix [16].
The solver is a composition of four steps. First the SVD for A is computed,
using a Jama library method. Then the inverse of A is computed, and A−1 is
multiplied by b to obtain x. Finally, the residual r = |A · x − b| is computed.
The version of the program, with the composition expressed using the future-
based RMI, is shown in Figure 6. The srv variable holds the RMI reference
to a remote object on the server, providing access to the Jama methods. The
288 Sergei Gorlatch
plain RMI
improved RMI
lower bound
20000
15000
Time [ms]
10000
5000
0
160 180 200 220 240
Matrix Size
The plain RMI version is much slower (three to four times) than the “ideal”
version, which runs completely on the server side. The improved RMI version is
less than 10 % slower than the ideal version, so it eliminates most of the overhead
of plain RMI.
Acknowledgements
This work is based on results obtained in joint research with Christian Lengauer,
Holger Bischof, Martin Alt, Christoph Wedler and Emanuel Kitzelmann, to all
of whom the author is grateful for their cooperation. Anonymous referees and
Julia Kaiser-Mariani helped a lot to improve the presentation.
References
Olav Beckmann, Alastair Houghton, Michael Mellor, and Paul H.J. Kelly
1 Introduction
The work we describe in this Chapter is part of a wider research programme at
Imperial College aimed at addressing the apparent conflict between the quality
of scientific software and its performance. The TaskGraph library, which is the
focus of this Chapter, is a key tool which we are developing in order to drive
this research programme. The library is written in C++ and is designed to
support a model of software components which can be composed dynamically,
and optimised, at runtime, to exploit execution context:
– Optimisation with Respect to Runtime Parameters
The TaskGraph library can be used for specialising software components
according to either their parameters or other runtime context information.
Later in this Chapter (Sec. 3), we show an example of specialising a generic
image filtering function to the particular convolution matrix being used.
– Optimisation with Respect to Platform
The TaskGraph library uses SUIF-1 [1], the Stanford University Intermediate
Format, as its internal representation for code. This makes a rich collection
of dependence analysis and restructuring passes available for our use in code
optimisation. In Sec. 5 of this Chapter we show an example of generating, at
runtime, a matrix multiply component which is optimally tiled with respect
to the host platform.
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 291–306, 2004.
c Springer-Verlag Berlin Heidelberg 2004
292 Olav Beckmann et al.
Background. Several earlier tools for dynamic code optimisation have been re-
ported in the literature [2, 3]. The key characteristics which distinguish our ap-
proach are as follows:
– Single-Language Design
The TaskGraph library is implemented in C++ and any TaskGraph pro-
gram can be compiled as C++ using widely-available compilers. This is in
contrast with approaches such as Tick-C [2] which rely on a special com-
piler for processing dynamic constructs. The TaskGraph library’s support
for manipulating code as data within one language was pioneered in Lisp [4].
– Explicit Specification of Dynamic Code
Like Tick-C [2], the TaskGraph library is an imperative system in which
the application programmer has to construct the code as an explicit data
structure. This is in contrast with ambitious partial evaluation approaches
such as DyC [3,5] which use declarative annotations of regular code to spec-
ify where specialisation should occur and which variables can be assumed
constant. Offline partial evaluation systems like these rely on binding-time
analysis (BTA) to find other, derived static variables [6].
– Simplified C-like Sub-language
Dynamic code is specified with the TaskGraph library via a small sub-
language which is very similar to standard C (see Sec. 2). This language has
been implemented through extensive use of macros and C++ operator over-
loading and consists of a small number of special control flow constructs, as
well as special types for dynamically bound variables. Binding times of vari-
ables are explicitly determined by their C++ types, while binding times for
intermediate expression values are derived using C++’s overloading mech-
anism (Sec. 4). The language has first-class arrays, unlike C and C++, to
facilitate dependence analysis.
In Sec. 6 we discuss the relationship with other approaches in more detail.
1 #include <stdio.h>
TaskGraph
2 #include <stdlib.h>
3 #include <TaskGraph>
4 using namespace tg;
5 int main( int argc, char ∗argv[] ) { var
progn
a : int
6 TaskGraph T;
7 int b = 1;
8 int c = atoi ( argv [1] );
9 taskgraph( T ) { (Statement) Assign
10 tParameter( tVar( int, a ) );
11 a = a + c;
12 } Var a Add
13 T.compile();
14 T.execute( ”a”, &b, NULL );
15 printf ( ”b = %d\n”, b ); Var a 1
16 }
Fig. 1. Left: Simple Example of using the TaskGraph library. Right: Abstract syntax
tree (AST) for the simple TaskGraph constructed by the piece of code shown on the left.
The int variable c, is static at TaskGraph construction time, and appears in the AST
as a value (see Sec. 4). The (not type-safe) execute() call takes a NULL-terminated
list of parameter name/value pairs; it binds TaskGraph parameter “a” to the address
of the integer b, then invokes the compiled code.
A Simple Example. The simple C++ program shown in the left-hand part of
Fig. 1 is a complete example of using the TaskGraph library. When compiled with
g++, linked against the TaskGraph library and executed, this program dynami-
cally creates a piece of code for the statement a = a + c, binds the application
program variable b as a parameter and executes the code, printing b = 2 as the
result. This very simple example illustrates both that creation of dynamic code
is completely explicit in our approach and that the language for creating the
AST which a TaskGraph holds looks similar to ordinary C.
void convolution( const int IMGSZ, const float ∗image, float ∗new image,
const int CSZ /∗ convolution matrix size ∗/, const float ∗matrix ) {
int i , j , ci , cj ; const int c half = ( CSZ / 2 );
Fig. 2. Generic image filtering: C++ code. Because the size as well as the entries of
the convolution matrix are runtime parameters, the inner loops (for-ci and for-cj), with
typically very low trip-count, cannot be unrolled efficiently.
– The bounds of the inner loops over the convolution matrix are statically
unknown, hence these loops, with typically very low trip-count, cannot be
unrolled efficiently.
– Failure to unroll the inner loops leads to unnecessarily complicated control
flow and also blocks optimisations such as vectorisation on the outer loops.
Fig. 3. Generic image filtering: function constructing the TaskGraph for a specific
convolution matrix. The size as well as the entries of the convolution matrix are static
at TaskGraph construction time. This facilitates complete unrolling of the inner two
loops. The outer loops (for-i and for-j) are entered as control flow nodes in the AST.
0.5
0
0 512 1024 1536 2048 2560 3072 3584 4096
Image Size (512 means image size is 512x512 floats)
0.5
Code Runtime
Compile Time
0.4
Time in Seconds
0.3
0.2
0.1
0
Generic Generic TG gcc TG icc Generic Generic TG gcc TG icc
gcc 1024 icc 1024 1024 1024 gcc 2048 icc 2048 2048 2048
Fig. 4. Performance of image filtering example. Top: Total execution time, including
runtime compilation, for one pass over image. Bottom: Breakdown of total execution
time into compilation time and execution time of the actual convolution code for two
specific image sizes: 1024 × 1024 (the break-even point) and 2048 × 2048.
Runtime Code Generation in C++ as a Foundation 297
we would have to pay the runtime compilation overhead only once and will get
higher overall speedups.
4 How It Works
Thus far, we have given examples of how the TaskGraph library is used, and
demonstrated that it can achieve significant performance gains. In this section
we now give a brief overview of TaskGraph syntax, together with an explanation
of how the library works.
TaskGraph Creation. The TaskGraph library can represent code as data – specif-
ically, it provides TaskGraphs as data structures holding the AST for a piece of
code. We can create, compile and execute different TaskGraphs independently.
Statements such as the assignment a = a + c in line 11 of Fig. 1 make use of
C++ operator overloading to add nodes (in this case an assignment statement)
to a TaskGraph. Figure 1 illustrates this by showing a graphical representation
of the complete AST which was created by the adjacent code. Note that the
variable c has static binding-time for this TaskGraph. Consequently, the AST
contains its value rather than a variable reference.
The taskgraph( T ){...} construct (see line 7 in Fig. 3) determines which
AST the statements in a block are attached to. This is necessary in order to
facilitate independent construction of different TaskGraphs.
TaskGraph Parameters. Both Fig. 1 (line 10) and Fig. 3 (lines 9 and 10) illustrate
that any TaskGraph variable can be declared to be a TaskGraph parameter using
the tParameter() construct. We require the application programmer to ensure
that TaskGraph parameters bound at execution time do not alias each other.
Control Flow Nodes. Inside a TaskGraph construction block, for loops and
if conditionals are executed at construction time. Therefore, the for loops on
lines 20 and 21 in Fig. 3 result in an unrolled inner loop. However, the TaskGraph
sub-language defines some constructs for adding control-flow nodes to an AST:
tFor(var,lower,upper) adds a loop node (see lines 15 and 16 in Fig. 3). The
loop bounds are inclusive. tIf() can be used to add a conditional node to the
AST. The TaskGraph embedded sublanguage also includes tWhile, tBreak and
tContinue. Function call nodes can be built, representing execution-time calls
to functions defined in the host C++ program.
298 Olav Beckmann et al.
303 taskgraph( T ) {
304 unsigned int dims[] = {IMGSZ ∗ IMGSZ};
305 tParameter( tArrayFromList( float, tgimg , 1, dims ) );
306 tParameter( tArrayFromList( float, new tgimg , 1, dims ) );
307 tVar ( int, i );
308 tVar ( int, j );
309
310 // Loop iterating over image
311 tFor( i , c half, IMGSZ − (c half + 1) ) {
312 tFor( j , c half, IMGSZ − (c half + 1) ) {
313 new tgimg [ i ∗ IMGSZ + j ] = 0.0;
314
315 // Loop to apply convolution matrix
316 for( ci = −c half; ci <= c half; ++ci ) {
317 for( cj = −c half; cj <= c half; ++cj) {
318 new tgimg [ i ∗ IMGSZ + j ] +=
319 tgimg [( i +ci) ∗ IMGSZ + j +cj] ∗ matrix[(c half+ci) ∗ CSZ + c half+cj];
320 }}}}
321 }
322 }
Fig. 5. Binding-Time Derivation. TaskGraph construction code for the image filtering
example from Fig. 2, with all dynamic variables marked by a boxed outline .
Fig. 6. The code on the top left is the standard C++ matrix multiply (ijk loop or-
der) code. The code on the top right constructs a TaskGraph for the standard ijk
matrix multiply loop. The code underneath shows an example of using the TaskGraph
representation for the ijk matrix multiply kernel, together with SUIF-1 passes for in-
terchanging and tiling loops to search for the optimal tilesize of the interchanged and
tiled kernel for a particular architecture and problem size.
Figure 6 shows both the code for the standard C/C++ matrix multiply
loop (ijk loop order) and the code for constructing a TaskGraph representing
this loop, together with an example of how we can direct optimisations from
the application program: we can interchange the for-j and for-k loops before
compiling and executing the code. Further, we can perform loop tiling with a
runtime-selected tile size. This last application demonstrates in particular the
possibilities of using the TaskGraph library for domain-specific optimisation:
300 Olav Beckmann et al.
6 Related Work
In this section, we briefly discuss related work in the field of dynamic and multi-
stage code optimisation.
Language-Based Approaches
– Imperative
Tick-C or ’C [2], a superset of ANSI C, is a language for dynamic code gen-
eration. Like the TaskGraph library, ’C is explicit and imperative in nature;
however, a key difference in the underlying design is that ’C relies on a special
compiler (tcc). Dynamic code can be specified, composed and instantiated,
i.e. compiled, at runtime. The fact that ’C relies on a special compiler also
means that it is in some ways a more expressive and more powerful system
than the TaskGraph library. For example, ’C facilitates the construction of
dynamic function calls where the type and number of parameters is dynam-
ically determined. This is not possible in the TaskGraph library. Jak [10],
MetaML [11], MetaOCaml [12] and Template Haskell [13] are similar efforts,
all relying on changes to the host language’s syntax. Some of what we do
with the TaskGraph library can be done using template metaprogramming in
C++, which is used, for example, for loop fusion and temporary elimination
in the Blitz++ and POOMA libraries [14, 15].
Runtime Code Generation in C++ as a Foundation 301
1400
1200
1000
800
600
400
200
0
0 100 200 300 400 500 600 700 800 900 1000
Square Root of Datasize
2000
1500
1000
500
0
0 100 200 300 400 500 600 700 800 900 1000
Square Root of Datasize
250
Optimal Tile Size
200
150
100
50
0
100 200 300 400 500 600 700 800 900 1000
Square Root of Datasize
Fig. 8. Optimal tile size on Athlon and Pentium 4-M processors, for each data point
from Fig. 7. These results are based on a straight-forward exhaustive search imple-
mented using the TaskGraph library’s runtime code restructuring capabilities (see code
in Fig. 6).
– Declarative
DyC [3,5] is a dynamic compilation system which specialised selected parts of
programs at runtime based on runtime information, such as values of certain
data structures. DyC relies on declarative user annotations to trigger spe-
cialisation. This means that a sophisticated binding-time analysis is required
which is both polyvariant (i.e. allowing specialisation of one piece of code
for different combinations of static and dynamic variables) and program-
point specific (i.e. allowing polyvariant specialisation to occur at arbitrary
program points). The result of BTA is a set of derived static variables in
addition to those variables which have been annotated as static. In order to
reduce runtime compilation time, DyC produces, at compile-time, a generat-
ing extension [6] for each specialisation point. This is effectively a dedicated
compiler which has been specialised to compile only the code which is being
dynamically optimised. This static pre-planning of dynamic optimisation is
referred as staging.
Marlet et al. [16] present a proposal for making the specialisation process
itself more efficient. This is built using Tempo [17], an offline partial eval-
uator for C programs and also relies on an earlier proposal by Glück and
Jørgensen to extend two-level binding-time analysis to multiple levels [18],
i.e. to distinguish not just between dynamic and static variables but between
multiple stages. The main contribution of Marlet et al. is to show that multi-
level specialisation can be achieved more efficiently by repeated, incremental
application of a two-level specialiser.
Runtime Code Generation in C++ as a Foundation 303
Data-Flow Analysis. Our library performs runtime data flow analysis on loops
operating on arrays. A possible drawback with this solution could be high run-
time overheads. Sharma et al. present deferred data-flow analysis (DDFA) [19]
as a possible way of combining compile-time information with only limited run-
time analysis in order to get accurate results. This technique relies on comprising
the data flow information from regions of the control-flow graph into summary
functions, together with a runtime stitcher which selects the applicable summary
function, as well as computes summary function compositions at runtime.
Transparent Dynamic Optimisation of Binaries. One category of work on dy-
namic optimisation which contrasts with ours are approaches which do not rely
on program source code but instead work in a transparent manner on running bi-
naries. Dynamo [20] is a transparent dynamic optimisation system, implemented
purely in software, which works on an executing stream of native instructions.
Dynamo interprets the instruction stream until a hot trace of instructions is
identified. This is then optimised, placed into a code cache and executed when
the starting-point is re-encountered. These techniques also perform runtime code
optimisation; however, as stated in Sec. 1, our objective is different: restructuring
optimisation of software components with respect to context at runtime.
Acknowledgements
This work was supported by the United Kingdom EPSRC-funded OSCAR project
(GR/R21486). We thank the referees for helpful and interesting comments.
Runtime Code Generation in C++ as a Foundation 305
References
1. Wilson, R.P., French, R.S., Wilson, C.S., Amarasinghe, S.P., Anderson, J.M.,
Tjiang, S.W.K., Liao, S.W., Tseng, C.W., Hall, M.W., Lam, M.S., Hennessy, J.L.:
SUIF: an infrastructure for research on parallelizing and optimizing compilers.
ACM SIGPLAN Notices 29 (1994) 31–37
2. Engler, D.R., Hsieh, W.C., Kaashoek, M.F.: ’C: a language for high-level, efficient,
and machine-independent dynamic code generation. In: POPL ’96: Principles of
Programming Languages. (1996) 131–144
3. Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: An expressive
annotation-directed dynamic compiler for C. Theoretical Computer Science 248
(2000) 147–199
4. McCarthy, J.: History of LISP. In: The first ACM SIGPLAN Conference on History
of Programming Languages. Volume 13(8) of ACM SIGPLAN Notices. (1978) 217–
223
5. Grant, B., Philipose, M., Mock, M., Chambers, C., Eggers, S.J.: An evaluation
of staged run-time optimizations in DyC. In: PLDI ’99: Programming Language
Design and Implementation. (1999) 293–304
6. Jones, N.D.: Mix Ten Years Later. In: PEPM ’95: Partial Evaluation and Seman-
tics-Based Program Manipulation. (1995)
7. Intel Corporation: Integrated Performance Primitives for Intel Architecture. Ref-
erence Manual. Volume 2: Image and Video Processing. (200–2001)
8. Intel Corporation: Intel Pentium 4 and Intel Xeon Processor Optimization Refer-
ence Manual. (1999–2002) Available via developer.intel.com.
9. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of
software and the ATLAS project. Parallel Computing 27 (2001) 3–35
10. Batory, D., Lofaso, B., Smaragdakis, Y.: JTS: Tools for Implementing Domain-
Specific Languages. In: Fifth International Conference on Software Reuse, IEEE
Computer Society Press (1998) 143–153
11. Taha, W., Sheard, T.: MetaML and multi-stage programming with explicit anno-
tations. Theoretical Computer Science 248 (2000) 211–242
12. Taha, W.: A gentle introduction to multi-stage programming (2004) In this volume.
13. Sheard, T., Peyton-Jones, S.: Template meta-programming for Haskell. ACM SIG-
PLAN Notices 37 (2002) 60–75
14. Veldhuizen, T.L.: Arrays in Blitz++. In: ISCOPE’98: Proceedings of the 2nd Inter-
national Scientific Computing in Object-Oriented Parallel Environments. Number
1505 in LNCS, Springer-Verlag (1998) 223ff
15. Karmesin, S., Crotinger, J., Cummings, J., Haney, S., Humphrey, W.J., Reynders,
J., Smith, S., Williams, T.: Array design and expression evaluation in POOMA II.
In: ISCOPE’98: Proceedings of the 2nd International Scientific Computing in
Object-Oriented Parallel Environments. Number 231–238 in LNCS (1998) 223 ff
16. Marlet, R., Consel, C., Boinot, P.: Efficient incremental run-time specialization for
free. ACM SIGPLAN Notices 34 (1999) 281–292 Proceedings of PLDI’99.
17. Consel, C., Hornof, L., Marlet, R., Muller, G., Thibault, S., Volanschi, E.N.:
Tempo: Specializing systems applications and beyond. ACM Computing Surveys
30 (1998)
18. Glück, R., Jørgensen, J.: Fast binding-time analysis for multi-level specialization.
In: Perspectives of System Informatics. Number 1181 in LNCS (1996) 261–272
19. Sharma, S., Acharya, A., Saltz, J.: Deferred Data-Flow Analysis. Technical Report
TRCS98-38, University of California, Santa Barbara (1998)
306 Olav Beckmann et al.
20. Bala, V., Duesterwald, E., Banerjia, S.: Dynamo: A transparent dynamic optimiza-
tion system. In: PLDI ’00: Programming Language Design and Implementation.
(2000) 1–12
21. Fordham, P.: Transparent run-time cross-component loop fusion. MEng Thesis,
Department of Computing, Imperial College London (2002)
22. www.opnemp.org: OpenMP C and C++ Application Program Interface, Version
2.0 (2002)
23. Liniker, P., Beckmann, O., Kelly, P.H.J.: Delayed evaluation self-optimising soft-
ware components as a programming model. In: Euro-Par 2002: Proceedings of the
8th International Euro-Par Conference. Number 2400 in LNCS (2002) 666–673
24. Subramanian, M.: A C++ library to manipulate parallel computation plans. Msc
thesis, Department of Computing, Imperial College London, U.K. (2001)
25. Lengauer, C.: Program optimization in the domain of high-performance parallelism
(2004) In this volume.
26. Veldhuizen, T.L.: C++ templates as partial evaluation. In: PEPM ’99: Partial
Evaluation and Semantic-Based Program Manipulation. (1999) 13–18
27. Czarnecki, K., O’Donnell, J., Striegnitz, J., Taha, W.: DSL Implementation in
MetaOCaml, Template Haskell, and C++ (2004) In this volume.
28. Visser, E.: Program Transformation with Stratego/XT: Rules, Strategies, Tools,
and Systems in Stratego/XT 0.9 (2004) In this volume.
Guaranteed Optimization
for Domain-Specific Programming
Todd L. Veldhuizen
1 Introduction
There are several competing strategies for providing domain-specific program-
ming environments: one can construct a wholly new language, extend an exist-
ing language, or work within an existing language. New languages are appealing
because they’re fun to design and allow radical departures in syntax and seman-
tics from existing languages; an example is logic-based program synthesis (e.g.,
[1]), in which purely declarative languages act as specifications for automatic
programming. For non-programmers, a simple language that does exactly what
they need can be less intimidating than general-purpose languages. However,
new languages have drawbacks, too:
– New languages are hard to get right — semantics is a surprisingly tricky
business.
– Users have to learn a new programming language, which can discourage
adoption.
– Compilers require ongoing support to keep up with changing operating sys-
tems and architectures. One-off languages requiring special compilers are
often research projects that founder when students graduate and professors
move on to other interests.
– You can’t use features of multiple DSLs in one source file. For example, there
currently exists a Fortran-like DSL that provides sparse arrays, and another
that provides interval arithmetic. However, if one wants both sparse arrays
and intervals, there is no compiler that supports both at once.
C. Lengauer et al. (Eds.): Domain-Specific Program Generation, LNCS 3016, pp. 307–324, 2004.
c Springer-Verlag Berlin Heidelberg 2004
308 Todd L. Veldhuizen
Structure of this paper. This is a paper in two parts. In the first we describe
hopes for general-purpose languages, which can be summarized as “pretty, fast,
safe.” This provides background for the second half of the paper in which we
propose technologies to make languages more general-purpose: guaranteed opti-
mization, which addresses some aspects of performance, and proof embeddings,
which address safety.
Guaranteed Optimization for Domain-Specific Programming 309
In this paper we are concerned mostly with the fast and safe aspects, the pretty
part — extensible syntax — having been studied in depth for decades, with
macro systems, extensible parsers and the like. The problem of how to provide
extensible languages with performance and safety, though, is a bit less well-
explored.
Fast. With current compilers there is often a tradeoff between the expressive-
ness of code and its performance: code written close to the machine model will
perform well, whereas code written at a high level of abstraction often performs
poorly. This loss in performance associated with writing high-level code is of-
ten called the abstraction penalty [4–6]. The abstraction penalty can arise both
from using a high level of abstraction, and also from syntax extensions (e.g.,
naive macro expansion). For this reason, compilers for fast extensible languages
need to minimize this abstraction penalty. Traditional compiler optimizations
are the obvious solution: disaggregation, virtual function elimination, and the
like can greatly reduce the abstraction penalty. One of the shortcomings of cur-
rent optimizing compilers is that they are unpredictable, with the result that
performance-tuning is a fickle and frustrating art rather than a science. In the
second half of this paper we propose a possible solution: optimizers that pro-
vide proven guarantees of what optimizations they will perform, thus making
performance more predictable.
Beyond reducing the abstraction penalty, many problem areas admit domain-
specific optimizations which can be exploited to achieve high performance. For
example, operations on dense arrays admit an astonishing variety of performance
improvements (e.g., [7]). A fundamental issue is whether domain-specific opti-
mization is achieved by transformation or generation of code:
1
“Some of the greatest advances in mathematics have been due to the invention
of symbols, which it afterwards became necessary to explain; from the minus sign
proceeded the whole theory of negative quantities.” – Aldous Huxley
310 Todd L. Veldhuizen
optimizations. Staging has been used to great effect in C++ to provide high-
performance libraries for scientific computing, which we’ll discuss shortly. In
particular, staging and partial evaluation allow one to specialize code and
(more generally) do component generation, which can have an enormous
benefit to performance.
Safe. There are diverse (and firmly held!) ideas about what it means for a
program to be safe. The question is what makes programs safe enough, absolute
safety being unattainable in practice. The pragmatic approach is to recognize
the existence a wide spectrum of safety levels, and that users should be free
to choose a level appropriate to their purpose, avoiding a “one size fits all”
mentality. At the low end of the spectrum are languages such as Matlab that
do almost no static checking, deferring even syntactic checking of functions until
they are invoked. At the other end is full-blown deductive verification that aims
to prove correctness with respect to a specification. To aim for universality, one
would like a single language capable of spanning this spectrum, and in particular
to allow different levels of safety for different aspects of a program. For instance,
it is common to check type-correctness statically, while deferring array bounds-
checking until run time; this clearly represents two different standards of safety.
Even program verification is typically applied only to some critical properties of
a program, rather than attempting to exhaustively verify every aspect. Thus it
is important to allow a mixture of safety levels within a single program.
Safety checks may be dynamic, as in (say) Scheme type checks [24] or Eiffel
pre- and post-condition checking [25]. Dynamic checking can detect the presence
of bugs at run-time; static checking can prove the absence of some bugs at
compile-time. A notable difference between the two is that fully automatic static
checking must necessarily reject some correct programs, due to the undecidability
of most nontrivial safety conditions. Within static checking there is a wide range
of ambition, from opportunistic bug-finding as in LCLint [26] and Metal [27], to
lightweight verification of selected properties as in extended static checking [28]
or SLAM [29], to full-blown deductive verification (e.g., PVS [30], Z [31], VDM
[32]).
Systems that check programs using static analysis can be distinguished on
the style of analysis performed: whether it is flow-, path-, or context-sensitive,
for example. A recent trend has been to check static safety properties by shoe-
horning them into the type system, for example checking locks [33, 34], array
bounds [35], and security properties [36]. Type systems are generally flow- and
context-invariant, whereas many of the static safety properties one would like
to check are not. Thus it’s not clear if type-based analyses are really the best
way to go, since the approximations one gets are probably so coarse as to be of
limited use in practice.
We can also distinguish between approaches in which safety checks are ex-
ternal to the program (as annotations, additional specifications, etc.) versus
approaches in which safety checks are part of the program code (for example,
run-time assertions, pre- and post-condition checking). Any artifact maintained
312 Todd L. Veldhuizen
separately from the source code will tend to diverge from it, whether it be docu-
mentation, models or proofs. The “one-source” principle of software engineering
suggests that safety checks should be integrated with the source code rather than
separate from it. Some systems achieve this by placing annotations in the source
code (e.g., ESC/Java [37]). However, this approach does not appear to integrate
easily with staging; for example, can a stage produce customized safety checks
for a later stage, when such checks are embedded in comments? Comments are
not usually given staging semantics, so it’s unclear how this would work. Hav-
ing safety checks part of the language ensures that they will interact with other
language features (e.g., staging) in a sensible way.
Rather than relying solely on “external” tools such as model checkers or
verification systems, we’d like as much as possible to have safety checking inte-
grated with compilation. Why? First, for expedience: many checking tools rely
on the same analyses required by optimization (points-to, alias, congruence), so
it makes sense to combine these efforts. But our primary reason is that by inte-
grating safety checking with compilers, we can provide libraries with the ability
to perform their own static checks and emit customized diagnostics.
The approach we advocate is to define a general-purpose safety-checking sys-
tem which subsumes both type-checking and domain-specific safety checks. Thus
types are not given any special treatment, but rather treated the same as any
other “domain-specific” safety property. Ideally one would also have extensi-
ble type systems, since many abstractions for problem domains have their own
typing requirements. For example, deciding whether a tensor expression has a
meaningful interpretation requires a careful analysis of indices and ranks. In
scientific computing, dimension types (e.g., [38]) have been used to avoid mis-
takes such as assigning meters-per-second quantities to miles-per-hour variables.
Having type checking handled by a general-purpose safety-checking system will
likely open a way to extensible type systems.
Yet another aspect of safety checking is whether it is fully automatic (static
analysis, model checking) or semi-automatic (e.g., proof assistants). There are
many intriguing uses for a compiler supporting semi-automatic checking. By this
we mean that a certain level of automated theorem proving takes place, but when
checking fails, users can provide supplemental proofs of a property in order to
proceed. This opens the way for libraries to issue proof obligations that must be
satisfied by users (for example, to remove bound checks on arrays). Interesting
safety properties are undecidable; this means any safety check must necessarily
reject some safe programs. Thus, proof obligations would let users go beyond
the limits of the compiler’s ability to automatically decide safety.
parameterized types: one can create template classes such as List T where T is
a type parameter, and instantiate it to particular instances such as List int and
List string . (This idea was not new to C++ – a similar mechanism existed in
Ada, and of course parametric polymorphism is a near cousin.) This instantiation
involves duplicating the code and replacing the template parameters with their
argument values – similar to polyvariant specialization in partial evaluation (c.f.
[39]). In the development of templates it became clear that allowing dependent
types such as Vector 3 would be useful (in C++ terminology called non-type
template parameters); to type-check such classes it became necessary to evaluate
expressions inside the brackets, so that Vector 1 + 2 is understood to be the
same as Vector 3 . The addition of this evaluation step turned C++ into a staged
language: arbitrary computations could be encoded as -expressions, which are
guaranteed to be evaluated at compile-time. This capability was the basis of
template metaprogramming and expression templates. A taste for these tech-
niques is given by this definition of a function pow to calculate xn (attribution
unknown):
with respect to physical units; also, MTL and Blitz check conformance of matri-
ces and arrays at compile-time. And as a more general example, we point to the
ctassert<> template [47] which provides a compile-time analogue of dynamic
assert() statements.
C++ was not intended to provide these features; they are largely serendip-
itous, the result of a flexible, general-purpose language design. Even now – a
good decade after the basic features of C++ were laid down – people are still
discovering novel (and useful!) techniques to accomplish things that were pre-
viously believed impossible; just recently it was discovered that compile-time
reflection of arbitrary type properties was possible [48], something previously
thought impossible. The lesson to be taken is that programming languages can
have emergent properties; we mean “emergent properties” in the vague sense of
surprising capabilities whose existence could not be foretold from the basic rules
of the language. And here we must acknowledge Scheme, a small language that
has proven amazingly capable due to such emergent properties. Such languages
have an exuberance that makes them both fun and powerful; unanticipated fea-
tures are bursting out all over! That emergent properties have proven so useful
suggests that we try to foster languages with such potential. In the second half
of this paper we describe our efforts in this direction.
A starting point is the goal that optimizing compilers should undo transfor-
mations we might apply to programs to make them “more abstract” for software-
engineering purposes, for example replacing “1 + 2” with
x = new Integer(1);
y = new Integer(2);
x.plus(y).intValue();
Not that anyone would write that, exactly, but similar code is routinely written
for encapsulation and modularity. We can represent such transformations by
rewrite rules. Some trivial examples of rewrites are:
R1. x→x+0
R2. x → car(cons(x, y))
R3. x → if true then x else y
..
.
These rules are, of course, completely backward from the usual approach: we
work with rules that de-optimize a program. The obvious question is: why not
devise an optimizer by reversing the direction of the arrows, using (say) Knuth-
Bendix completion? The answer is that we work with an infinite number of such
rules, some of which are conditional; thus we violate two requirements of Knuth-
Bendix (which applies to a finite number of unconditional axioms.) Moreover,
reversing the direction of the arrows can turn unconditional rewrites into condi-
tional rewrites with undecidable conditions; and sensible approximations to the
conditions require global analysis.
It turns out that the unusual approach of considering de-optimizing rules
leads to a usable proof technique: we can prove that certain compilers undo any
sequence of rule applications in a single application of the optimizer, yielding a
program that is “minimally abstract.” Figure 1 illustrates.
In the next section we sketch the proof technique to give a taste for the
methodology and results; for a detailed description see [49, 50].
p / analysis
(1) (2)
/ solution (3)
/ transformed
equations program
where the numbers (1), (2), (3) refer to the steps above. The essence of the
proof technique is to consider a de-optimizing rewrite p → p , and compare what
happens to both p and p in each step of the optimizer:
p / analysis
(1) (2)
/ solution (3)
/ transformed
equations program
rewrite
analysis
p /
(1) (2)
/ solution (3)
/ transformed
equations program
[ ] denotes a hole. Thus the context is the portion of the program unchanged by
the rewrite.
For each rewrite rule R1, R2, . . . one proves a lemma, a simplified example of
which we give here for some rule R1:
Lemma 1 If p → p by rule R1, then (i) the analysis equations for p have the
same solution as the equations of p for program points in the rewrite context,
and (ii) the transformed version of p is the same as the transformed version of
p.
One proves a lemma of the above form for each rewrite considered. The effect
of a rewrite on the fixpoint solution of a system of equations is reasoned about
using the theory of fixpoint-preserving transformations and bisimilarity [53, 54].
We use these techniques to prove that analysis equations for program points in
the context of p are bisimilar to those in p , and thus the solution is unchanged
for analysis variables in the context; and we prove enough about the solution to
the p equations to show that the additional code introduced by the rewrite is
removed by the transformation step.
Proving these lemmas requires adjusting the design of the analysis and trans-
formation so that the proof can succeed; thus, the proof technique is really a
design technique for optimizing compilers. Once the lemmas are established, we
can show that any sequence of rewrites is undone by the optimizer by a straight-
forward induction over rewrites, which we illustrate by stacking the above di-
agrams (Figure 2). Thus for any sequence of rewrites p → p → p → . . .,
we have that the transformed versions of p, p , p , . . . are all equal. Writing
O : Program → Program for the optimizer, a guaranteed optimization proof
culminates in a theorem of the form:
∗
Theorem 1 (Guaranteed Optimization) If p → p , then Op = Op .
This theorem simply states that any sequence of “de-optimizing” rewrites is
undone by the optimizer. This theorem has some interesting implications. Recall
that the kernel of O is ker O = {(p1 , p2 ) | Op1 = Op2 }. If the optimizer is sound,
then (p1 , p2 ) ∈ ker O implies that p1 and p2 are behaviourally equivalent. Thus,
one can view the optimizer O as computing a normal form of programs with
respect to a decidable subtheory of behavioural equivalence.
This notion of the kernel of an optimizer is generally useful, since it captures
what we might call the “staging power” of a compiler; if OA and OB are two
optimizing compilers and ker OA ⊆ ker OB , we can conclude that OB is a more
powerful optimizer.
Another useful property one can prove with guaranteed optimization is the
following: if one defines an abstraction level AL(p) of a program p as the length
of the longest chain p0 → . . . → p, then by imposing a few further requirements
on the optimizer one can attain AL(Op) = 0. Such optimizers find a minimal
program with respect to the metric AL. Thus with appropriate “de-optimizing”
rules →, guaranteed optimization can address the abstraction penalty: opti-
mized programs are guaranteed to be “minimally abstract” with respect to the
rewrites →.
318 Todd L. Veldhuizen
rewrite
(1) (2) (3)
p / analysis / solution / transformed
equations program
rewrite
(1) (2) (3)
p / analysis / solution / transformed
equations program
.. .. .. ..
. . . .
Now, a good optimizer should clearly perform some of these evaluation steps at
compile time by constant folding. We can use Guaranteed Optimization to design
compilers that fully reduce a program with respect to . Suppose the compiler
guarantees that applications of the “de-optimizing” rule R3 are undone:
Clearly the right-hand side of rule R3 matches the left-hand side of rule E1; so
the optimizer will fully reduce a program with respect to E1. By making the set
of “de-optimizing” rewrites → big enough, in principle one can prove that:
∗
⊆ ↔
(1)
(evaluation relation ) (guaranteed optimization )
This becomes interesting and useful when the evaluation relation encompasses
a Turing-complete subset of the language; then we have functionality closely
Guaranteed Optimization for Domain-Specific Programming 319
related to that of staging. In staging (e.g. [21, 22]; see also the chapter by Taha,
this volume), one has evaluation relations such as:
x + ˜(1 + 2) x+3
So there is less of a syntactic burden to users, who no longer have to escape and
defer obvious things; it is more like partial evaluation in this respect.
To summarize, guaranteed optimization can reduce the abstraction penalty;
its staging-like capabilities can be used to remove code introduced by syntax
extensions and to perform domain-specific optimizations.
ker O ⊆ ∼ (2)
Thus one can view program optimizers as deciding a weaker behavioural equiv-
alences on programs. Guaranteed optimization is, in essence, a proof that ker O
satisfies certain closure properties related to the de-optimizing rewrites. From
∗
Theorem 1 (Guaranteed Optimization) we have p → p ⇒ Op = Op . This
∗
implies ↔ ⊆ ker O; together with soundness of O, we have:
∗
↔ ⊆ ker O ⊆ ∼ (3)
is an effective decision procedure for the theory generated by these axioms (or,
usually, a sound superset of the axioms). In related work [57] we describe a corre-
spondence between the problem of how to effectively combine optimizing passes
in a compiler, and how to combine decision procedures in an automated theo-
rem prover. Rewrite-based or pessimistic optimizers can decide combinations of
inductively defined theories in a manner similar to the Nelson-Oppen method
of combining decision procedures [58]. On the other hand, optimizers based on
optimistic superanalysis, of which guaranteed optimization is an example, can
decide combinations of (more powerful) coinductively defined theories such as
bisimulation.
This correspondence suggests using the compiler as a theorem prover, since
the optimizer can prove properties of run-time values and behaviour. The opti-
mizer fulfills a role similar to that of simplification engines in theorem provers:
it can decide simple theorems on its own, for example x + 3 = 1 + 2 + x and
car(cons(x, y)) = x. A number of interesting research directions are suggested:
– By including a language primitive check(·) that fails at compile-time if its
argument is not provably true, one can provide a simple but crude version of
domain-specific static checking, that would, in principle and assuming Eqn.
(1), be at least be as good as static checks implementable with staging.
– In principle, one can embed a proof system in the language by encoding
proof trees as objects, and rules as functions, such that a proof object is
constructible in the language only if the corresponding -proof is. Such proofs
can be checked by the optimizer; and deductions made by the optimizer (such
as x + y = y + x) can be used as premises for proofs. This is similar in spirit
to the Curry-Howard isomorphism (e.g. [59]), although we embed proofs
in values rather than types; and to proof-carrying code [60], although our
approach is to intermingle proofs with the source code rather than having
them separate.
– Such proof embeddings would let domain-specific libraries require proof obli-
gations of users when automated static checks failed. For example, the ex-
pression check(x = y ∨ P.proves(equal(x, y))) would succeed only if x = y
were proven by the optimizer, or if a proof object P were provided that gave
a checkable proof of x = y.
– A reasonable test of a domain-specific safety checking system is whether
it is powerful enough to subsume the type system. That is, one regards
type checking as simply another kind of safety check to be implemented
by a “type system library,” an approach we’ve previously explored in [61].
Embedded type systems are attractive because they open a natural route to
user-extensible type systems, and hence domain-specific type systems.
4 A Summing Up
We have argued that making programming languages more general-purpose is a
useful research direction for domain-specific programming environments. In our
view, truly general-purpose languages should let libraries provide domain-specific
Guaranteed Optimization for Domain-Specific Programming 321
Staged languages (e.g., [21, 22]) are explicitly annotated with binding times,
and have the advantage of guaranteeing evaluation of static computations at
compile-time. The binding-time annotation is generally a congruent division,
which effectively prevents theorem-proving about dynamic values.
Partial evaluation (e.g., [23]) automatically evaluates parts of a program at
compile-time; in this respect it is closely related to guaranteed optimization.
Partial evaluation is not usually understood to include proven guarantees of
what will be statically evaluated; indeed, a lot of interesting research looks at
effective heuristics for deciding what to evaluate. General partial computation
[62] is an intriguing extension to partial evaluation in which theorem proving is
used to reason about dynamic values.
Guaranteed optimization is largely annotation-free, although one must intro-
duce some small annotations to control unfolding in the compile-time stage. It
provides proven guarantees of what optimizations it will perform, and has the
ability to prove theorems about run-time values.
Acknowledgments
We thank Andrew Lumsdaine and the referees for helpful discussions and com-
ments on drafts of this paper.
References
1. Fischer, B., Visser, E.: Retrofitting the autobayes program synthesis system with
concrete syntax (2004) In this volume.
2. Hudak, P.: Building domain-specific embedded languages. ACM Computing Sur-
veys 28 (1996) 196–196
3. Thomas, W.: Logic for computer science: The engineering challenge. Lecture Notes
in Computer Science 2000 (2001) 257–267
4. Stepanov, A.: Abstraction penalty benchmark (1994)
5. Robison, A.D.: The abstraction penalty for small objects in C++. In: POOMA’96:
The Parallel Object-Oriented Methods and Applications Conference. (1996) Santa
Fe, New Mexico.
322 Todd L. Veldhuizen
42. Siek, J.G., Lumsdaine, A.: The Matrix Template Library: A generic programming
approach to high performance numerical linear algebra. In: International Sympo-
sium on Computing in Object-Oriented Parallel Environments. (1998)
43. Neubert, T.: Anwendung von generativen programmiertechniken am beispiel der
matrixalgebra. Master’s thesis, Technische Universität Chemnitz (1998) Diplomar-
beit.
44. McNamara, B., Smaragdakis, Y.: Static interfaces in C++. In: First Workshop on
C++ Template Programming, Erfurt, Germany. (2000)
45. Siek, J., Lumsdaine, A.: Concept checking: Binding parametric polymorphism in
C++. In: First Workshop on C++ Template Programming, Erfurt, Germany.
(2000)
46. Brown, W.E.: Applied template metaprogramming in SIUnits: The library of
united-based computation. In: Second Workshop on C++ Template Programming.
(2001)
47. Horn, K.S.V.: Compile-time assertions in C++. C/C++ Users Journal (1997)
48. Järvi, J., Willcock, J., Hinnant, H., Lumsdaine, A.: Function overloading based on
arbitrary properties of types. C/C++ Users Journal 21 (2003) 25–32
49. Veldhuizen, T.L.: Active Libraries and Universal Languages. PhD thesis, Indiana
University Computer Science (2004) (forthcoming).
50. Veldhuizen, T.L., Lumsdaine, A.: Guaranteed optimization: Proving nullspace
properties of compilers. In: Proceedings of the 2002 Static Analysis Symposium
(SAS’02). Volume 2477 of Lecture Notes in Computer Science., Springer-Verlag
(2002) 263–277
51. Wegman, M.N., Zadeck, F.K.: Constant propagation with conditional branches.
ACM Transactions on Programming Languages and Systems 13 (1991) 181–210
52. Click, C., Cooper, K.D.: Combining analyses, combining optimizations. ACM
Transactions on Programming Languages and Systems 17 (1995) 181–196
53. Wei, J.: Correctness of fixpoint transformations. Theoretical Computer Science
129 (1994) 123–142
54. Courcelle, B., Kahn, G., Vuillemin, J.: Algorithmes d’équivalence et de réduction
à des expressions minimales dans une classe d’équations récursives simples. In
Loeckx, J., ed.: Automata, Languages and Programming. Volume 14 of Lecture
Notes in Computer Science., Springer Verlag (1974) 200–213
55. Milner, R.: Communication and Concurrency. International Series in Computer
Science. Prentice Hall (1989)
56. Rutten, J.J.M.M.: Universal coalgebra: a theory of systems. Theoretical Computer
Science 249 (2000) 3–80
57. Veldhuizen, T.L., Siek, J.G.: Combining optimizations, combining theories. Tech-
nical Report TR582, Indiana University Computer Science (2003)
58. Nelson, G., Oppen, D.C.: Simplification by cooperating decision procedures. ACM
Transactions on Programming Languages and Systems (TOPLAS) 1 (1979) 245–
257
59. Sørensen, M.H., Urzyczyn, P.: Lectures on the Curry-Howard isomorphism. Tech-
nical report TOPPS D-368, Univ. of Copenhagen (1998)
60. Necula, G.C.: Proof-carrying code. In: Proceedings of the 24th ACM Symposium
on Principles of Programming Languages, Paris, France (1997)
61. Veldhuizen, T.L.: Five compilation models for C++ templates. In: First Workshop
on C++ Template Programming, Erfurt, Germany. (2000)
62. Futamura, Y., Nogi, K.: Generalized partial computation. In Bjørner, D., Ershov,
A.P., Jones, N.D., eds.: Proceedings of the IFIP Workshop on Partial Evaluation
and Mixed Computation, North-Holland (1987)
Author Index