Algorithms With JULIA
Algorithms With JULIA
Algorithms
with JULIA
Optimization, Machine Learning, and
Differential Equations Using the JULIA
Language
Algorithms with JULIA
Clemens Heitzinger
123
Clemens Heitzinger
Center for Artificial Intelligence
and Machine Learning (CAIML)
and
Department of Mathematics
and Geoinformation
Technische Universität Wien
Vienna, Austria
Mathematics Subject Classification: 65-XX, 34K28, 65Mxx, 65M06, 65M08, 65M60, 65Kxx, 65Yxx,
62M45, 68T05
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To M.S.
Foreword
Students of applied mathematics are often confronted with textbooks that ei-
ther cover the mathematical principles and concepts of mathematical models
or with textbooks which introduce the basic language structures of a program-
ming language. Many authors fail to cover the underlying mathematical theory
of models, which is crucial in understanding the applicability of models – and
especially their limitations – to real world problems. On the other hand, many
textbooks and monographs fail to address the crucial step from the algorithmic
formulation of an applied problem to the actual implementation and solution in
the form of an executable program. This book brilliantly combines these two as-
pects using a high level open source computer language and covers many areas of
continuum model based areas of natural and social sciences, applied mathemat-
ics and engineering. Julia is a high-level, high-performance, dynamic program-
ming language which can be used to write any application in numerical analysis
and computational science. Clemens Heitzinger has gone through great lengths
to organize this book into sequences that make sense for the beginner as well as
for the expert in one particular field.
The applied topics are carefully chosen, from the most relevant standard areas
like ordinary and partial differential equations and optimization to more recent
fields of interest like machine learning and neural networks. The chapters on
ordinary and partial differential equations include examples of how to use exist-
ing packages included in the Julia software. In the chapter about optimization
the methods for standard local optimization are nicely explained. However, this
book also contains a very relevant chapter about global optimization, including
methods such as simulated annealing and agent based optimization algorithms.
All this is not something usually found in the same book. Again, the global op-
timization theory, as far as the general theory exists, is well presented and the
application examples (and, most importantly, the benchmark problems) are well
chosen. One chapter – concerned with the currently maybe most relevant area
– introduces practical problem solving in the field of machine learning. The au-
thor covers the basic approach of learning via artificial neural networks as well
as probabilistic methods based on Bayesian theory. Again, the topics and exam-
vii
viii Foreword
ples are well chosen, the underlying theory is well explained, and the solutions
of the chosen application problems are immediately implementable in Julia.
Clemens Heitzinger has been involved in some of the most impressive applica-
tions in engineering and the applied physical sciences, covering microelectron-
ics, sensors and biomedical applications. This book covers both the theoretical
and practical aspects of this part of modern science. The approach taken in this
book is novel in the sense that it goes into quite some detail in the theoretical
background, while at the same time being based on a modern computing plat-
form. In a sense, this work serves the role of two books. It is much more than a
cookbook for “how to solve problems with Julia,” but also a good introduction
to the most relevant problems in continuum model based science and engineer-
ing. At the same time, it gives the novice in Julia programming a good introduc-
tion on how to use this higher level programming language. It can therefore be
used as a text for students in an advanced graduate level course as well as a mono-
graph by the researcher planning to solve actual problems by programming in
Julia.
Why computation? The middle of the last century marks the beginning of
a new era in the history of mathematics. Although calculators and computers
had been envisioned centuries before and mechanical calculators or calculating
machines were in widespread use already in the nineteenth century, only the in-
vention of purely electronic computers made it possible to perform calculations
on increasingly large scales. The reason is quite simple: mechanical and electro-
mechanical calculators (see Fig. 0.1) are severely limited by friction.
The advent of electronic computers and later the rise of the integrated cir-
cuit have resulted in portable devices of astounding computational power at tiny
power consumption (see Fig. 0.2). Computations that were unthinkable a few
decades ago can now be performed at low cost and at great speed. These devel-
opments in the physical realm have resulted in the birth of new mathematical
disciplines. Computer algebra, scientific computing, machine learning, artificial
intelligence, and related areas are concerned with solving abstract mathematical
problems as well as scientific and data-science problems correctly, precisely, and
efficiently.
Although lots of computational power are available today, fundamental ques-
tions will always have to be answered. How should the computations be struc-
tured? What are the advantages and disadvantages of various algorithms? How
accurate will the results be? How can we best take advantage of the computa-
tional resources available to us? These are fundamental questions that lead to
new and fascinating mathematical problems. In this sense, the invention of elec-
tronic computers has had and will have a twofold influence on mathematics:
computers are both an enabling technology and a source of new mathematical
problems.
There is no doubt that computers and mathematical algorithms have im-
pacted our lives in many ways. In many engineering disciplines, it has become
common to perform simulations for the rational design and the optimization
of all kinds of devices and processes. Simulations can be much cheaper than
performing many experiments and they provide theoretical and quantitative in-
sights. Examples are airplanes, combustion engines, antennas, and the construc-
ix
x Preface
Fig. 0.2 Motorola 68000 cpu. (Photo by Pauli Rautakorpi, no changes, license CC-BY-3.0,
https://creativecommons.org/licenses/by/3.0/.)
tion of bridges and other buildings. Large-scale computations are also behind
search engines, financial services, and other data-intensive industries. There is
probably not a single hour in our daily lives when we do not use a service or a de-
vice that has only become possible by computers and mathematical algorithms
or that has been much improved by them.
Preface xi
Using this book, you will learn a modern, general-purpose, and efficient pro-
gramming language, namely Julia, as well as some of the most important meth-
ods in optimization, machine learning, and differential equations and how they
work. These three fields, optimization, machine learning, and differential equa-
tions, have been chosen because they cover a wide range of computational tasks
in science, engineering, and industry.
Methods and algorithms in these areas will be discussed in sufficient detail to
arrive at a complete understanding. You will understand how the computational
approaches work starting from the basic mathematical theory. Important results
and proofs will be given in each chapter (and can be skipped on first reading).
Based on these foundations, you will be provided with the knowledge to imple-
ment the algorithms and your own variants. To this end, sample programs and
hints for implementation in Julia are provided.
The ultimate purpose of this book is to provide the reader both with a working
knowledge of the Julia programming languages as well as with more than a
superficial understanding of modern topics in three important computational
fields. Using the algorithms and the sample codes for leading problems, you will
be able to translate the theory into working knowledge in order to solve your
scientific, engineering, mathematical, or industrial problems.
How is this book unique? This book strives to provide a modern, practical,
and well founded perspective on algorithms in optimization, machine-learning,
and differential equations. Hence there are two points how the present book
differs from other books in this area.
First, the topics were selected with a modern view of computation in mind.
As mathematics and computation evolve, we are able to solve more and more
difficult problems. These advances are reflected in the material in this book. For
example, topics such as artificial neural networks, computational Bayesian esti-
mation, and partial differential equations are discussed, but numerically solving
systems of linear equations is not, since you will most likely not write your own
program to do so due the availability of well tested libraries (also immediately
available in Julia).
Optimization is of great value in almost all disciplines. Differential equations
are of great utility in many cases where fundamental relationships between the
known and unknown variables exist, for example in physics, chemistry, and
many engineering disciplines. Furthermore, machine learning in particular is
an area that has benefited from increases in computational power and available
memory and that is of utmost importance when large amounts of data are avail-
able, but fundamental relationships are unknown.
Second, the Julia language, a rather young language designed with scien-
tific and technical computing in mind, is used to implement the algorithms. Its
implementation includes a compiler and a type system that leads to fast com-
piled code. It builds on modern and general programming concepts so that it
is usable for many different purposes. It comes with linear-algebra algorithms,
sparse matrices, and a package system. It is open source and its syntax is easy
xii Preface
Second, the book teaches a modern programming language that is especially use-
ful in technical and scientific applications, while also providing high-level and
advanced programming concepts. Third, in addition to self-study, the book can
be used as a textbook for courses in these areas. By choosing the chapters of inter-
est, the course can be tailored to various needs. The exercises deepen the theory
and help practice translating the theory into useful programs.
Acknowledgments. Finally, it is my pleasure to acknowledge the interest and
support by Klaus Stricker and his department. I would also like to acknowledge
the students in Vienna who helped improve the manuscript.
2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Defining Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Argument Passing Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Functions as First-Class Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Anonymous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Optional Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xv
xvi Contents
6 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Compound Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Conditional Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Short-Circuit Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Repeated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.1 Built-in Exceptions and Defining Exceptions . . . . . . . . . . 109
6.5.2 Throwing and Catching Exceptions . . . . . . . . . . . . . . . . . . 109
6.5.3 Messages, Warnings, and Errors . . . . . . . . . . . . . . . . . . . . . 114
6.5.4 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Tasks, Channels, and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.7 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.7.1 Starting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7.2 Data Movement and Processes . . . . . . . . . . . . . . . . . . . . . . 123
6.7.3 Parallel Loops and Parallel Mapping . . . . . . . . . . . . . . . . . 126
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Macros in Common Lisp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3 Macro Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Two Examples: Repeating and Collecting . . . . . . . . . . . . . . . . . . . 139
7.5 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Built-in Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Part I
The Julia Language
Chapter 1
An Introduction to the Julia Language
generated code is often linked to the type system of the programming languages.
Therefore the type system and the programming language should have been de-
signed such that they support the task of the compiler to generate fast code with
minimal burden on the programmer.
In the past, programming languages were usually standardized and had sev-
eral implementations. Nowadays, the situation is different; many popular pro-
gramming languages are not standardized, and their single implementations
serve as their specifications. Therefore the choice of programming language of-
ten severely limits the choice of compiler. This means that the choices of pro-
gramming language and compiler are not independent, and one often has to
decide on a combination of both.
A final requirement or consideration is the availability of well-tested libraries
so that algorithmic and numerical wheels do not have to be reinvented.
matlab is probably the most well-known and widely used programming lan-
guage in scientific computing and engineering. It has gained this position mainly
by being a more convenient and more productive alternative to Fortran. mat-
lab can be used interactively, many numerical algorithms are either built-in or
available as packages, and plotting is easy. This is a large productivity gain com-
pared to writing a Fortran program from scratch or downloading and installing
libraries.
The programming language used in this book is Julia [2]. Julia is a high-
level, high-performance, and dynamic programming language that has been de-
veloped with scientific and technical computing in mind. It offers features that
make it very well suited for computing in science, engineering, and machine
learning in view of the requirements posed in Sect. 1.2, while some of the fea-
tures are unique for a programming language in this field. An overview of the
key features of the Julia language is given in the following.
Julia is open source and distributed under the so called mit license. Apart from
concerns about licensing costs, the access to source code is essential for the re-
producibility of science. Reproducibility is one of the main principles of the sci-
entific method [10, 4] and hence also of artificial intelligence, machine learning,
scientific computing, and computational science [12].
Reproducibility is important whenever calculations are performed. By using
an open-source operating system and an open-source implementation of the pro-
gramming language, it is – at least in principle – possible to know precisely which
6 1 An Introduction to the Julia Language
1.2.2 Compiler
Julia uses standard packages for numerical linear algebra such as blas (Basic
Linear Algebra Subprograms), lapack (Linear Algebra Package), and Suite-
Sparse (a collection of sparse-matrix software). These libraries are standard
among other programming languages and software systems so that the perfor-
mance of many linear-algebra algorithms in Julia should be comparable if not
identical to the performance in these other languages.
1.2.4 Interactivity
The dichotomy between the Fortran and Lisp families of languages manifests
itself clearly in the question whether functions can be called interactively or not.
While Fortran programs follow a strict write-compile-execute cycle, Lisp sys-
tems allow the interactive execution of expressions as well as the interactive def-
inition and compilation of functions [11]. At the same, they usually support file
compilation, saving of memory-image files, and generation of binaries.
Interactivity has a large effect on productivity, especially in explorative prob-
lem solving and programming. It makes it possible to implement complicated
algorithms step by step and to immediately test them piece by piece. An inter-
1.2 An Overview of Julia 7
active environment allows to set up complicated test cases and keep them in
memory while defining new functions. Only the redefined functions need to be
compiled so that potentially long compilation times are avoided when develop-
ing programs interactively.
1.2.6 Interoperability
As mentioned in Sect. 1.2.3, Julia uses external libraries for numerical linear al-
gebra. In addition to this built-in use of external libraries, external libraries can
8 1 An Introduction to the Julia Language
generally be called easily from Julia programs. External functions in C and For-
tran shared libraries can be called without writing any wrapper code, and they
can even be called directly from Julia’s interactive prompt. Furthermore, the
¸˩&Ǥɜɜ package makes it possible to call Python code. Other operating-system
processes can also be invoked and managed from within Julia by using its shell-
like capabilities.
Vice versa, Julia itself can be built as a shared library so that users can call
their Julia functions from within their C or Fortran programs.
Julia comes with a package system that is both easy to use and easy to con-
tribute to. It is straightforward to install packages and their dependencies, to
update them, and to remove them. Many packages in the package system build
on libraries written in other languages and provide interfaces to external func-
tionalities in a Julia-like style.
Julia comes with built-in functionality to run programs in parallel. The first
type of parallel execution is running on a single computer and harnessing the
power of the multiple cores of a cpu or of the multiple cpus of a computer. The
second type are clusters spanning multiple computers. For both types of paral-
lel execution, introspective features make adding, removing, and querying the
processes in a cluster straightforward.
Parallel function execution is achieved by just using the parallel version of a
mapping function or by a parallel ȯɴʝ loop. For parallel algorithms that require
non-trivial communication, functions for moving data, for synchronization, for
scheduling, and for shared arrays are available as well.
After installing Julia, you can start it by double-clicking the Julia icon (de-
pending on your system and installation) or by running the Julia executable
from the command line. If you just type
љ ɔʼɜɃǤ
at the command line, Julia starts, displays a banner, and prompts you for in-
put with its own prompt ɔʼɜɃǤљ. You can quit the interactive session by typing
ȕ˦ɃʲФХ and pressing the return key or by typing control-d.
To run a Julia program saved in a file called file ʙʝɴȱʝǤɦѐɔɜ non-interactively,
type
љ ɔʼɜɃǤ ʙʝɴȱʝǤɦѐɔɜ
at the command line. If your program is supposed to take arguments from the
command line, you can simply pass them at the end of the command line and
they will be available in Julia in the global variable ¼QÆ as an array of strings.
љ ɔʼɜɃǤ ʙʝɴȱʝǤɦѐɔɜ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
You can also pass code to be executed directly from the command line to Julia
by using the вȕ command-line option. Then the traditional example looks like
this.
љ ɔʼɜɃǤ вȕ щʙʝɃɪʲɜɪФъYȕɜɜɴя ˞ɴʝɜȍРъХщ
Yȕɜɜɴя ˞ɴʝɜȍР
Note that the quotes around the Julia code depend on your command shell and
may differ from щ.
When Julia code is executed via the вȕ command-line option, the arguments
are stored in ¼QÆ as well. We can try to retrieve the command-line arguments
passed to Julia code like this.
љ ɔʼɜɃǤ вȕ щ¼QÆщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
However, nothing is printed. Using вȕ, the expression is only evaluated, but the
result is not printed. To print a value (followed by a newline), we can use the
command-line option в5, which evaluates an expression and shows the result.
љ ɔʼɜɃǤ в5 щ¼QÆщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
ЦъǤʝȱЖъя ъǤʝȱЗъя ъǤʝȱИъЧ
The output is the printed representation of an array containing the three strings
shown. To print each element of the ¼QÆ array on a separate line, you can ɦǤʙ
the ʙʝɃɪʲɜɪ function over the array stored in the variable ¼QÆ.
љ ɔʼɜɃǤ вȕ щɦǤʙФʙʝɃɪʲɜɪя ¼QÆХщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
ǤʝȱЖ
ǤʝȱЗ
ǤʝȱИ
Running ɔʼɜɃǤ ввȹȕɜʙ from your command line gives an overview of the
many options of the Julia executable. Command-line arguments for the Julia
executable must be passed before the name of the file to be executed as in this
example.
љ ɔʼɜɃǤ ввɴʙʲɃɦɃ˴ȕ ʙʝɴȱʝǤɦѐɔɜ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
Any Julia code you put into the file ϵY 5ЭѐɔʼɜɃǤЭȆɴɪȯɃȱЭʧʲǤʝʲʼʙѐɔɜ in
your home directory will be executed every time Julia is started.
So far we have seen how to run your Julia programs from the command line.
Although this is the usual way how Julia programs are run in production envi-
ronments, it is only one way to run a Julia program. While developing programs,
interactive sessions connected to an editor are much preferred. The features of
interactive sessions are explained next.
Julia is often used interactively via its repl . The abbreviation repl is short for
read-eval-print loop and has its roots in Lisp implementations. The three parts
of the repl are the following.
Read: An expression typed by the user is read. An error is raised if it is not syn-
tactically correct.
Eval: The expression is evaluated. The result is a value, unless an error was
raised.
Print: The value is printed or – if an error occurred – the error message is dis-
played. Finally, a new prompt is displayed and the loop is repeated.
This implies that each expression in Julia returns a value, just as in Lisp.
(Therefore it has been said about Lisp programmers that they know the value
of everything, but the (computational) cost of nothing.) There is no expression
that does not return a value. Displaying a value can, however, be suppressed by
appending a semicolon ѓ at the end of the input.
ɔʼɜɃǤљ ъYȕɜɜɴя ˞ɴʝɜȍРъ
ъYȕɜɜɴя ˞ɴʝɜȍРъ
ɔʼɜɃǤљ ъYȕɜɜɴя ȕ˛ȕʝ˩ɴɪȕРъѓ
ɔʼɜɃǤљ Ǥɪʧ
1.3 Using Julia and Accessing Documentation 11
ъYȕɜɜɴя ȕ˛ȕʝ˩ɴɪȕРъ
This example shows that the shortest hello-world program in Julia is just the
string ъYȕɜɜɴя ˞ɴʝɜȍРъ. It also shows that the variable Ǥɪʧ, short for answer, is
bound to the value of the previous expression evaluated by the repl irrespective
whether it was printed or not. The variable Ǥɪʧ is only bound in repls.
In addition to strings, numbers such as integers and floating-point numbers
also evaluate to themselves and can be entered as usual. Furthermore, it is pos-
sible to type additional underscores Ѫ in order to divide long numbers and make
them easier to read. The groups do not have to contain three digits.
ɔʼɜɃǤљ ЖѪЕЕЕѪЕЕЕѪЕЕЕ Ѯ ЕѐЕЕЕѪЕЕЕѪЕЕЖ
ЖѐЕ
ɔʼɜɃǤљ ЖѪЗѪИ
ЖЗИ
You can load a Julia source file called ъȯɃɜȕѐɔɜъ and evaluate the expres-
sions it contains using ɃɪȆɜʼȍȕФъȯɃɜȕѐɔɜъХ. Since the function ɃɪȆɜʼȍȕ works
recursively, this also makes it possible to split programs into various files. Dur-
ing development, however, smaller pieces of code are usually evaluated (see
Sect. 1.3.5). Larger Julia programs, on the other hand, such as packages whose
source code is distributed online, are usually installed and loaded as packages
much more easily than by working with single files (see Sect. 1.3.4).
When you working in the Julia repl, you can execute shell commands con-
veniently by switching to shell mode, which is entered by just typing a semi-
colon ѓ. Then the Julia prompt changes and shell commands such ɜʧ or ʙʧ
can be executed. Typing backspace switches the prompt back to the usual Julia
prompt.
You can save lots of typing at the repl using autocompletion. If you press
the tab key, the symbol you started typing is completed, or – if the completion
is not unique – completions are suggested after pressing tab a second time. This
feature also works in shell mode, where it is convenient to complete names of
directories and files.
The repl remembers the expressions it has evaluated previously. Ex-
pressions from previous interactive sessions are also stored in the file
ϵY 5ЭѐɔʼɜɃǤЭɜɴȱʧЭʝȕʙɜѪȹɃʧʲɴʝ˩ѐɔɜ. The straightforward way to access pre-
vious expressions is using the up and down arrow keys. But you can also search
the history forwards and backwards with control-s and control-r, respectively.
Other keyboard commands analogous to the Emacs text editor are available as
well.
Similarly to the shell mode, you can enter the help mode of the repl by typing
a question mark Т at the Julia prompt. The prompt changes and you can enter
12 1 An Introduction to the Julia Language
a string to search for. You can again use the tab key to complete the string you
typed as a symbol or to view possible completions. After pressing enter, the doc-
umentation is searched and you will be presented with the documentation for
the symbol you entered or further matches of the string you typed.
To search all documentation for a string, you can use the Ǥʙʝɴʙɴʧ function.
ɔʼɜɃǤљ ǤʙʝɴʙɴʧФъ5ʼɜȕʝъХ
"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȱǤɦɦǤ
"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕʼɜȕʝȱǤɦɦǤ
The ЪȍɴȆ macro allows you to access the documentation string of any symbol
and also to change it. Documentation is accessed by ЪȍɴȆФsymbolХ or simpler
by ЪȍɴȆ symbol (see Chap. 7 for more information about macros and how to use
them).
ɔʼɜɃǤљ ЪȍɴȆ ЪȍɴȆ
Table 1.1 contains a list of functions and macros that are useful to inspect
the state of the Julia executable or to interact with the operating system. The
contexts in which several of these functions and macros are useful will become
clearer later, but they are collected here for reference.
The value of the constant ù5¼Æb is of type ùȕʝʧɃɴɪʼɦȂȕʝ and can easily
be compared to other values of the same type. This makes it possible to write
programs that work with different versions of Julia.
ɔʼɜɃǤљ ù5¼Æb љќ ˛ъЖѐЕъ
ʲʝʼȕ
ɔʼɜɃǤљ ù5¼Æb љќ ˛ъЗѐЕъ
ȯǤɜʧȕ
The plain way to handle packages is to use the functions provided by the ¸ɖȱ
package, which we must load first.
ɔʼɜɃǤљ Ƀɦʙɴʝʲ ¸ɖȱ
The most important functions in this package are ¸ɖȱѐǤȍȍ, ¸ɖȱѐʝɦ, and
¸ɖȱѐʼʙȍǤʲȕ. For example, to install a package called &Æù (for reading and writing
comma separated values), we can use the ¸ɖȱѐǤȍȍ function.
ɔʼɜɃǤљ ¸ɖȱѐǤȍȍФъ&ÆùъХ
Another mode that can be entered from the repl is the package mode for
handling packages. It is entered by typing Ч at the Julia prompt. The prompt
changes to end in ʙɖȱљ. Typing tab at the prompt shows a list of all commands
available in package mode. For example, to install the &Æù package, type Ǥȍȍ &Æù
at the package prompt; to remove it, type ʝȕɦɴ˛ȕ &Æù; and to update all packages
type ʼʙȍǤʲȕ at the package prompt.
14 1 An Introduction to the Julia Language
If files in the current directory are of interest, the following function call is useful.
ɔʼɜɃǤљ bsʼɜɃǤѐɪɴʲȕȂɴɴɖФȍɃʝ ќ ʙ˞ȍФХХ
An alternative that has the same effect but saves some typing are the following
commands.
ɔʼɜɃǤљ ʼʧɃɪȱ bsʼɜɃǤ
ɔʼɜɃǤљ ɪɴʲȕȂɴɴɖФХ
Problems
1.2 Install an extension for dealing with Julia programs in your favorite text
editor or install the bsʼɜɃǤ package.
References
1. American National Standards Institute (ANSI), Washington, DC, USA: Programming Lan-
guage Common Lisp, ANSI INCITS 226-1994 (R2004) (1994)
2. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: The Julia programming language.
http://julialang.org
3. Durán, A., Pérez, M., Varona, J.: The misfortunes of a trio of mathematicians using com-
puter algebra systems. Can we trust in them? Notices of the AMS 61(10), 1249–1252 (2014)
4. Fisher, R.: The Design of Experiments. Oliver and Boyd, Edinburgh (1935)
5. Keene, S.: Object-Oriented Programming in Common Lisp: A Programmer’s Guide to CLOS.
Addison-Wesley Professional (1989)
6. The Mathlab Group, Laboratory for Computer Science, MIT, Cambridge, MA 02139: MAC-
SYMA Reference Manual, Version Nine, Second Printing (1977)
7. Maxima, a Computer Algebra System, version 5.43.0 (2019).
http://maxima.sourceforge.net
8. McCarthy, J.: Recursive functions of symbolic expressions and their computation by ma-
chine (part I). Comm. ACM 3, 184–195 (1960)
9. McCarthy, J.: LISP 1.5 Programmer’s Manual. The MIT Press (1962)
10. Popper, K.: Logik der Forschung. Zur Erkenntnistheorie der modernen Naturwissenschaft.
Verlag von Julius Springer, Wien (1935)
11. Sandewall, E.: Programming in an interactive environment: the “Lisp” experience. Com-
puting Surveys 10(1), 35–71 (1978)
12. Stodden, V., Borwein, J., Bailey, D.: Setting the default to reproducible in computational
science research. SIAM News 46(5), 4–6 (2013)
Chapter 2
Functions
Functions are one of the most important concepts in mathematics. The defini-
tion of a mathematical function 𝑓 ∶ 𝑋 → 𝑌, 𝑥 ↦ 𝑦 comprises three parts: the
domain 𝑋, i.e., the set where the function is defined; the codomain 𝑌, i.e., the
set that contains all function values 𝑦; and a rule 𝑥 ↦ 𝑦 describing how a unique
function value 𝑦 is assigned to each argument 𝑥 ∈ 𝑋.
The same three pieces of information are important when defining a function
in any programming language. The role of the domain is played by the types of
the arguments, the function values are calculated by the body of the function
definition, and the role of the codomain is played by the type of the calculated
value and can hopefully be inferred by the compiler. If it can be inferred, faster
code can be generated for the calling function that receives the output.
The example we consider in this chapter is the definition of a function that
calculates the 𝑛-th number 𝑥𝑛 in the Fibonacci sequence defined by the recur-
rence relation
The idea is to check if the argument ɪ is one of the starting values or not. If it is
not, then the recurrence relation is used. We note that in Julia everything is an
expression and hence returns a value, so that it is not necessary to use an explicit
ʝȕʲʼʝɪ statement here.
An alternative, but equivalent syntax for function definition is the following
called the ternary operator.
ȯɃȂЗФɪХ ќ Фɪ јќ ЖХ Т ɪ ђ ȯɃȂЗФɪвЖХ ў ȯɃȂЗФɪвЗХ
This example shows how short functions are often defined in Julia. Here the
syntax
condition Т consequent ђ alternative
was used as an alternative for the Ƀȯ expression as well.
You can save this function definition in a file and load it into Julia or you
can type it directly into the Julia repl. In the repl, Julia will answer with the
following output.
ȯɃȂЖ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
We note that this function definition does not capture two pieces of informa-
tion that are part of a mathematical function definition: the domain and the
codomain. We have neither specified the type of ɪ nor the type of the possible
function values Е, Ж, and ȯɃȂЖФɪвЖХ ў ȯɃȂЖФɪвЗХ. This means that this function
definition will work whenever the operations used in its definition, namely јќ,
ў, and в, are defined for their arguments. For example, evaluating ȯɃȂЖФКЭЭЙХ
in the Julia repl yields вЖЭЭЗ; here КЭЭЙ and вЖЭЭЗ are rational numbers. We
will learn all about the types of numbers available in Julia in Chap. 5. It is not
obvious just by looking at the definitions of ȯɃȂЖ and ȯɃȂЗ which types can be
used as arguments and whether Julia can generate efficient code or not.
2.1 Defining Functions 19
Therefore, we now use Julia’s introspective features to find out more about
the function we just defined. Julia told us after evaluating the function defi-
nition that we have defined a generic function with one method. The same in-
formation is obtained by typing ȯɃȂЖ into the repl, since functions evaluate to
themselves in Julia. This means that functions in Julia are what are called
generic functions in computer science. A generic function is a collection of (func-
tion) methods, where each method is responsible for a certain combination of
argument types. For example, the generic function ў comprises many methods.
ɔʼɜɃǤљ ў
ў ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ЖЛЛ ɦȕʲȹɴȍʧХ
We can find the methods of a generic function using ɦȕʲȹɴȍʧ. It lists all
the methods that were defined for various combinations of argument types and
where they were defined.
ɔʼɜɃǤљ ɦȕʲȹɴȍʧФȯɃȂЖХ
Ы Ж ɦȕʲȹɴȍ ȯɴʝ ȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ъȯɃȂЖъђ
ЦЖЧ ȯɃȂЖФɪХ Ƀɪ ǤɃɪ Ǥʲ ¼5¸{ЦЖЧђЗ
If a new method is defined and a method already exists for this particular com-
bination of argument types, then the old method definition is superseded.
To find out more about the codomain of the function, we can query the types
of function values.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЕХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФЕХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЖХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФЖХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЗХХ
bɪʲЛЙ
In the first two cases, the argument is simply returned. Therefore the type of the
returned value is the same as the type of the argument. The last example implies
that the type of the sum of two values of type bɪʲЛЙ is again bɪʲЛЙ.
If you are using a 32-bit system, then integers are by default represented by
the type bɪʲИЗ. The default integer type is called bɪʲ and it can be either bɪʲИЗ
or bɪʲЛЙ depending on your system. The variable "ǤʧȕѐÆ˩ʧѐü¼-ѪÆbĐ5 also in-
dicates if Julia is running on a 32-bit or a 64-bit system.
We have just seen that literal integers such as Е and Ж are parsed and then
represented as values of type bɪʲ, which is an bɪʲЛЙ on this particular system.
bɪʲЛЙ are (positive or negative, i.e., signed) integers that can be stored within 64
bits. This is illustrated by the following calculations.
20 2 Functions
ɔʼɜɃǤљ ЗѭЛЗ
ЙЛЖЖЛНЛЕЖНЙЗМИНМОЕЙ
ɔʼɜɃǤљ ФвЗХѭЛЗ
ЙЛЖЖЛНЛЕЖНЙЗМИНМОЕЙ
ɔʼɜɃǤљ ЗѭЛИ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ФвЗХѭЛИ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ЗѭЛЙ
Е
ɔʼɜɃǤљ ФвЗХѭЛЙ
Е
In addition to querying the type of a value using ʲ˩ʙȕɴȯ, we can also query a
type which the minimum and maximum numbers it can represent are. This is
informative when a type can only represent a finite number of values by design.
ɔʼɜɃǤљ ʲ˩ʙȕɦɃɪФbɪʲЛЙХ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ʲ˩ʙȕɦǤ˦ФbɪʲЛЙХ
ОЗЗИИМЗЕИЛНКЙММКНЕМ
ɔʼɜɃǤљ ʲ˩ʙȕɦǤ˦Фʲ˩ʙȕɴȯФȯɃȂЖФЕХХХ
ОЗЗИИМЗЕИЛНКЙММКНЕМ
These bounds explain why 263 could not be calculated above as an bɪʲЛЙ, but
(−2)63 (barely) could.
This implies that types such as bɪʲ, bɪʲИЗ, and bɪʲЛЙ can represent the math-
ematical structure of the ring (ℤ, +, ⋅) only if all operations during a calculation
remain within the interval given by ʲ˩ʙȕɦɃɪФtypeХ and ʲ˩ʙȕɦǤ˦ФtypeХ.
If this is not assured, then it is necessary to use the "Ƀȱbɪʲ type, which can rep-
resent integers provided they fit into available memory and are thus a much bet-
ter representation of the ring (ℤ, +, ⋅). To illustrate the efficiency of calculations
using "Ƀȱbɪʲs, we consider Mersenne prime numbers. The following function
returns Mersenne prime numbers given an adequate exponent.
ɦȕʝʧȕɪɪȕФɪХ ќ "ɃȱbɪʲФЗХѭɪ в Ж
The type of the return value is "Ƀȱbɪʲ, since we base the calculation on 2 repre-
sented as a "Ƀȱbɪʲ. In general, typeФ˦Х returns the representation of the value
˦ in the type type; in other words, a new value converted to the type type is
returned. It is instructive to inspect the "Ƀȱbɪʲ type using ɦȕʲȹɴȍʧФ"ɃȱbɪʲХ,
ʲ˩ʙȕɦɃɪФ"ɃȱbɪʲХ, and ʲ˩ʙȕɦǤ˦Ф"ɃȱbɪʲХ.
The largest known Mersenne prime number to date is 282 589 933 − 1. The fol-
lowing interaction shows that it can be computed within a few thousands of a
second requiring less than 20 MB of memory, also showing that it has 24 862 048
digits.
ȍɃȱɃʲʧФ˦Х ќ ȯɜɴɴʝФbɪʲȕȱȕʝя ɜɴȱЖЕФ˦ХўЖХ
2.1 Defining Functions 21
The ЪʲɃɦȕ macro yields the run time and the allocated memory. Asterisk gener-
ally indicate macros; we will learn all about macros in Chap. 7.
Now we know how to define a function that can calculate arbitrarily large
(only limited by the available memory) Fibonacci numbers. The next version of
the Fibonacci function ensures that the codomain is the type "Ƀȱbɪʲ.
ȯʼɪȆʲɃɴɪ ȯɃȂИФɪХђђ"Ƀȱbɪʲ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂИФɪвЖХ ў ȯɃȂИФɪвЗХ
ȕɪȍ
ȕɪȍ
The syntax ђђtype after the argument list means that the return value will be
converted to the specified type. Here the base case ɪ јќ Ж ensures that later on
"Ƀȱbɪʲs are added. If the return value cannot be converted to the specified type,
then an error is raised. If ђђtype is not given, it is assumed to be ђђɪ˩. Every
value is Julia is of type ɪ˩.
We have seen how we can specify the codomain of our function. How can we
specify the domain of our function? We already know that a generic function con-
sists of methods. The various methods that constitute a generic function are re-
sponsible for different argument types. Whenever a function is called, the types
of the arguments are inspected and then the most specific matching method is
called; if none exists, an error is raised. The only method we have defined for
the function ȯɃȂИ is called for every argument type, since we did not specify any
type for the argument ɪ.
But this is not what we intend in the context of the Fibonacci sequence. It
is unclear what ȯɃȂЖФКЭЭЙХ or ȯɃȂЖФОѐКХ should be, although our implementa-
tion returns numbers in these cases. ȯɃȂЖФъȯɴɴъХ clearly raises an error (only af-
ter trying to perform calculations), but ȯɃȂЖФКЭЭЙХ and ȯɃȂЖФОѐКХ should raise
errors as well.
How can we restrict the domain, i.e., how can we specify the types of the
arguments of a method? The syntax is again argumentђђtype. The domain of the
next version of the Fibonacci function is the bɪʲȕȱȕʝ type and the codomain is
the "Ƀȱbɪʲ type.
ȯʼɪȆʲɃɴɪ ȯɃȂЙФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂЙФɪвЖХ ў ȯɃȂЙФɪвЗХ
ȕɪȍ
ȕɪȍ
22 2 Functions
The bɪʲȕȱȕʝ type comprises "Ƀȱbɪʲs and the finite integer types bɪʲИЗ and bɪʲЛЙ
among others. It is therefore the most natural domain for our function.
We can check whether a type is a subset of another one using the subtype func-
tion subtypeјђsupertype. The following example shows that the bɪʲȕȱȕʝ type
works as intended in the method definition above. It also shows that neither
"Ƀȱbɪʲ is a subtype of bɪʲ nor bɪʲ is a subtype of "Ƀȱbɪʲ, confirming the use-
fulness of the bɪʲȕȱȕʝ type. The ɪ˩ type is a supertype of every type. An argu-
ment ˦ without any specified type is equivalent to an argument ˦ђђɪ˩, just as
in the case of the return value.
ɔʼɜɃǤљ bɪʲ
bɪʲЛЙ
ɔʼɜɃǤљ bɪʲИЗ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ bɪʲЛЙ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ "Ƀȱbɪʲ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ "Ƀȱbɪʲ јђ bɪʲ
ȯǤɜʧȕ
ɔʼɜɃǤљ bɪʲ јђ "Ƀȱbɪʲ
ȯǤɜʧȕ
ɔʼɜɃǤљ bɪʲȕȱȕʝ јђ ɪ˩
ʲʝʼȕ
To check whether a value has a certain type, the ɃʧǤ function, which also
supports infix syntax, can be used.
ɔʼɜɃǤљ ɃʧǤФЕя bɪʲЛЙХ
ʲʝʼȕ
ɔʼɜɃǤљ Е ɃʧǤ bɪʲЛЙ
ʲʝʼȕ
Calling the generic function ȯɃȂЙ with arguments that are not of type bɪʲȕȱȕʝ
results in an error explaining that there is no matching method. Sometimes it is
useful to define a method that catches all other argument types. This is achieved
by the following method.
ȯɃȂЙФ˦ђђɪ˩Х ќ ȕʝʝɴʝФъɪɜ˩ ȍȕȯɃɪȕȍ ȯɴʝ Ƀɪʲȕȱȕʝ ǤʝȱʼɦȕɪʲʧѐъХ
𝑥𝑛 = 𝑥𝑛−1 + 𝑥𝑛−2
𝑥𝑛 = 𝑐1 𝑦1𝑛 + 𝑐2 𝑦2𝑛 .
The
√two starting values
√ 𝑥0 and 𝑥1 determine the two constants 𝑐1 and 𝑐2 as 𝑐1 ∶=
1∕ 5 and 𝑐2 ∶= −1∕ 5. Therefore the Fibonacci sequence is given by
√ 𝑛 √ 𝑛
1 1+ 5 1 1− 5
𝑥𝑛 = √ ( ) −√ ( )
5 2 5 2
√ 𝑛
⎛ 1 1+ 5 ⎞
= round ⎜ √ ( ) ⎟ ∀𝑛 ∈ ℕ.
5 2
⎝ ⎠
The last equality holds since 𝑦2 ≈ −0.618 and |𝑐2 | < 1∕2.
The theory of difference equations hence leads to the next function definition.
ȯɃȂКФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ
ʝɴʼɪȍФ"Ƀȱbɪʲя ФФЖўʧʜʝʲФКХХЭЗХѭɪ Э ʧʜʝʲФКХХ
The argument "Ƀȱbɪʲ of ʝɴʼɪȍ ensures not only that a "Ƀȱbɪʲ is returned instead
of a floating-point value, but also that no error is raised when the number to be
rounded is large.
Although we can calculate Fibonacci numbers now very quickly, it unfortu-
nately turns out that the return value of ȯɃȂКФМЖХ is not equal to 𝑥71 (while
the preceding values are correct). The smallest example to demonstrate this de-
ficiency is the following.
ɔʼɜɃǤљ ЪʲɃɦȕ ȯɃȂКФЛОХ ў ȯɃȂКФМЕХ ќќ ȯɃȂКФМЖХ
ЕѐЕЕЕЕЕК ʧȕȆɴɪȍʧ ФН ǤɜɜɴȆǤʲɃɴɪʧђ ЖЛН Ȃ˩ʲȕʧХ
ȯǤɜʧȕ
We can alleviate this limitation by using the "ɃȱOɜɴǤʲ type and ensuring that
all calculations are performed over this type. Then the "ɃȱOɜɴǤʲ value is rounded
to a "Ƀȱbɪʲ value.
ȯɃȂЛФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ
ʝɴʼɪȍФ"Ƀȱbɪʲя
ФФЖўʧʜʝʲФ"ɃȱOɜɴǤʲФКХХХЭЗХѭɪ Э ʧʜʝʲФ"ɃȱOɜɴǤʲФКХХХ
The following function checks the calculations. The output means that all
Fibonacci numbers up to and including 𝑥358 are calculated correctly, while
ȯɃȂЛФИКОХ is not equal to 𝑥359 .
ȯʼɪȆʲɃɴɪ ȆȹȕȆɖѪȯɃȂЛФʝǤɪȱȕХ
ȯɴʝ Ƀ Ƀɪ ʝǤɪȱȕ
Ƀȯ ȯɃȂЛФɃХ ў ȯɃȂЛФɃўЖХ Рќ ȯɃȂЛФɃўЗХ
ʙʝɃɪʲФɃя ъ ъХ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ɔʼɜɃǤљ ȆȹȕȆɖѪȯɃȂЛФЕђИМЕХ
ИКМ ИЛЗ ИЛЛ ИЛМ ИЛН ИЛО ИМЕ
We can again relate the number of bits used in this calculation to the number
of decimal digits of 𝑥359 . The following calculation shows that we can expect
at most 78 decimal digits of precision, since "ɃȱOɜɴǤʲs use 256 binary digits by
default. (The precision of "ɃȱOɜɴǤʲs can be changed using ʧȕʲʙʝȕȆɃʧɃɴɪ.) At
the same time, representing ȯɃȂКФИКОХ requires 75 decimal digits. Therefore the
calculation using the exponentiation is very precise in the sense that almost all
digits of the result are correct.
ɔʼɜɃǤљ ȍɃȱɃʲʧФ"ɃȱbɪʲФЗХѭЗКЛХ
МН
ɔʼɜɃǤљ ȍɃȱɃʲʧФȯɃȂЛФИКОХХ
МК
The idea is to define a global variable that holds a dictionary containing the
previously calculated values. (Global variables are discussed in Chap. 3, and dic-
tionaries in Sect. 4.5.4.) The function checks if the function value has been calcu-
lated previously. If yes, it is simply returned; if not, the new value is calculated,
stored, and returned. This technique is known as memoization (see Sect. 7.5).
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ ќ -ɃȆʲШ"Ƀȱbɪʲя "ɃȱbɪʲЩФЕ ќљ Ея Ж ќљ ЖХ
ȯʼɪȆʲɃɴɪ ȯɃȂМФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ
Ƀȯ ȹǤʧɖȕ˩ФȯɃȂѪȆǤȆȹȕя ɪХ
ȯɃȂѪȆǤȆȹȕЦɪЧ
ȕɜʧȕ
ȯɃȂѪȆǤȆȹȕЦɪЧ ќ ȯɃȂМФɪвЖХ ў ȯɃȂМФɪвЗХ
ȕɪȍ
ȕɪȍ
This implementation can calculate 𝑥10 000 , which has 2090 digits, within a few
thousands of a second using about 6 MB of memory. We will see a more general
approach to memoization in Sect. 7.5.
Additional properties of the Fibonacci sequence are useful to refine this ap-
proach. It can be shown that the equalities
2 2
𝑥2𝑛 = 𝑥𝑛+1 − 𝑥𝑛−1 = 𝑥𝑛 (𝑥𝑛+1 + 𝑥𝑛−1 ), (2.2)
𝑥3𝑛 = 2𝑥𝑛3+ 3𝑥𝑛 𝑥𝑛+1 𝑥𝑛−1 = 5𝑥𝑛3+ 3(−1)𝑛 𝑥𝑛 , (2.3)
2
𝑥4𝑛 = 4𝑥𝑛 𝑥𝑛+1 (𝑥𝑛+1 + 2𝑥𝑛2 ) − 3𝑥𝑛2 (𝑥𝑛2 + 2𝑥𝑛+1
2
) (2.4)
ȯʼɪȆʲɃɴɪ ȯɃȂНФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ
Ƀȯ ȹǤʧɖȕ˩ФȯɃȂѪȆǤȆȹȕя ɪХ
ȯɃȂѪȆǤȆȹȕЦɪЧ
ȕɜʧȕ
Ƀȯ ɦɴȍФɪя ЙХ ќќ Е
ɦ ќ ȍɃ˛Фɪя ЙХ
ȯɃȂѪȆǤȆȹȕЦɪЧ ќ ФЙѮȯɃȂНФɦХѮȯɃȂНФɦўЖХ
ѮФȯɃȂНФɦўЖХѭЗўЗѮȯɃȂНФɦХѭЗХ
вИѮȯɃȂНФɦХѭЗѮФȯɃȂНФɦХѭЗўЗѮȯɃȂНФɦўЖХѭЗХХ
ȕɜʧȕ
26 2 Functions
In computer science, various ways to pass arguments from the caller to the called
function are known. The following approaches are commonly found in program-
ming languages.
2.3 Multiple Return Values 27
Call by value: The arguments are evaluated and the resulting values are passed
to the function and bound to local variables. The passed values are often
copied into a new memory region. The function cannot make changes in
the scope of the caller, since it only receives a copy.
Call by reference: The function receives a reference to a variable used as the ar-
gument. Via this reference, the function can assign a new value to the vari-
able or modify it, and any changes are also seen by the caller of the function.
Call by sharing: If the values in a languages are objects (carrying type informa-
tion in contrast to primitive types), then call by sharing is possible. In call by
sharing, function arguments act as new variable bindings and assignments
to function arguments are not visible to the caller. No copies of the argu-
ments are made, however, and the values the new variable bindings refer to
are identical to the passed values. Therefore changes to a mutable object are
seen by the caller.
Call by value is provides additional safety, since it is impossible for the called
function to effect any changes outside of its own scope. On the other hand, it
is inefficient to copy all arguments, especially when many small functions are
defined or the arguments occupy large memory regions such as large arrays.
Call by reference is more efficient, but a function may effect changes outside
of its scope, making reasoning about program behavior much more difficult and
possibly leading to subtle bugs. Call be reference is the most unsafe way of pass-
ing arguments.
Julia uses call by sharing. Assignments to function arguments only affect the
scope of the function. Still, mutable objects (such as the elements of vectors or
arrays) can be changed and these changes persist and are seen by the caller. Call
by sharing is a reasonable compromise between memory safety and efficiency
and it is found in other dynamic languages such as Lisp, Scheme, and Python.
Julia does not provide multiple return values per se, but uses tuples of values
to the same effect. Since tuples can be created and destructured also without
parentheses, the illusion of returning and receiving multiple values is created by
leaving out the parentheses.
Tuples can always be created with parentheses and in many circumstances
without parentheses. The same holds true when tuples are destructured in as-
signments. If there are fewer variables in the tuple on the left side than elements
in the tuple on the right side of the assignment, then only the first elements on
the right side are used. Conversely, if there are more variables in the tuple on the
left side than elements on the right side, an error is raised.
28 2 Functions
A tuple with a single element is created using the syntax ФelementяХ so that it
can be distinguished from parentheses that have no effect around an expression.
ɔʼɜɃǤљ ФЕяХ
ФЕяХ
ɔʼɜɃǤљ Ȇ ќ ȯɴɴФЖя Зя Ия ЙХ
ФвКя ЖЕХ
ɔʼɜɃǤљ ФȆЖя ȆЗХ ќ ȯɴɴФЖя Зя Ия ЙХ
ФвКя ЖЕХ
ɔʼɜɃǤљ Ȇя ȆЖя ȆЗ
ФФвКя ЖЕХя вКя ЖЕХ
Functions are first-class objects in Julia, and the type of each function is a sub-
type of the type OʼɪȆʲɃɴɪ. This means that functions can be assigned and passed
as arguments just as any other data type.
ɔʼɜɃǤљ ў
ў ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ЖЛЛ ɦȕʲȹɴȍʧХ
ɔʼɜɃǤљ ўФЕя ЖХ
Ж
ɔʼɜɃǤљ ɃʧǤФўя OʼɪȆʲɃɴɪХ
ʲʝʼȕ
ɔʼɜɃǤљ ȯɴɴ ќ Ѯ
Ѯ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ИКМ ɦȕʲȹɴȍʧХ
ɔʼɜɃǤљ ȯɴɴФЕя ЖХ
Е
2.5 Anonymous Functions 29
The first element of the tuple is the estimated value of the integral, and the sec-
ond an estimated upper bound for the absolute error.
The identity function is called ɃȍȕɪʲɃʲ˩ in Julia. Additionally, there are syn-
tactic expressions in Julia, listed in Table 2.1, which are translated into function
calls, but the names of the functions are not obvious.
Finally, the expression arg Юљ fun is the same as funФargХ. It allows to revert
the order of function and arguments and is easier to read in certain situations.
ɔʼɜɃǤљ ˦ вљ З˦
ЫИ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ɃʧǤФ˦ вљ З˦я OʼɪȆʲɃɴɪХ
ʲʝʼȕ
ɔʼɜɃǤљ Ф˦я ˩Х вљ З˦Ѯ˩
ЫМ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ȯʼɪȆʲɃɴɪ Ф˦Х
З˦
ȕɪȍ
ЫО ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
ЖЙ
ЗК
ИЛ
There are variants of the ɦǤʙ function, namely ɦǤʙР, ɦǤʙȯɴɜȍɜ, ɦǤʙȯɴɜȍʝ,
ɦǤʙʝȕȍʼȆȕ, and ɦǤʙʧɜɃȆȕʧ. The function ɦǤʙР stores the result in its second
argument, i.e., the first sequence argument (see Sect. 2.2). It follows the con-
vention that functions with names that end in Р modify their arguments. This
convention stems from the programming languages Scheme.
The function ʝȕȍʼȆȕ is another mainstay of functional programming. It takes
an associative function as its first argument, a collection as its second, and an
initial value as the (optional) keyword argument ɃɪɃʲ. The initial value should
be the neutral element for applying the function to an empty collection. ʝȕȍʼȆȕ
applies the function to two values from the collection (except for the initial value)
repeatedly until the collection has been reduced to a single value. The functions
ȯɴɜȍɜ and ȯɴɜȍʝ are similar, but guarantee left or right associativity.
The function ɦǤʙʝȕȍʼȆȕ maps and reduces. It maps its first argument (a func-
tion) over the sequence given as the third argument and then reduces the result
using the second argument (again a function) with the initial value (optionally)
given as the keyword argument ɃɪɃʲ. ɦǤʙʝȕȍʼȆȕ is more efficient than using ɦǤʙ
and ʝȕȍʼȆȕ, since the intermediary sequence is not stored. The example calcu-
lates the 𝓁𝑝 norm of a vector, where an anonymous function is used for 𝑥 ↦ |𝑥|𝑝 .
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђʼɦȂȕʝХ ќ ɦǤʙʝȕȍʼȆȕФ˦ вљ ǤȂʧФ˦Хѭʙя ўя ˛ХѭФЖЭʙХ
Arguments often have sensible default values. For example, the 𝓁2 norm is the
most popular among the 𝓁𝑝 norms. In these cases, it is convenient to declare
these arguments as optional arguments. Optional arguments do not have to be
specified in the argument list when the function is called.
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђʼɦȂȕʝ ќ ЗХ ќ
ɦǤʙʝȕȍʼȆȕФ˦ вљ ǤȂʧФ˦Хѭʙя ўя ˛ХѭФЖЭʙХ
ɜɴȆǤɜ ˦ ќ ʲ˩ʙФ˦ЕХ
ɜɴȆǤɜ Ƀ ќ Ж
˞ȹɃɜȕ ǤȂʧФȯФ˦ХХ љќ ʲɴɜ ПП Ƀ јќ ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧ
˦ ўќ вȯФ˦ХЭȍȯФ˦Х
Ъʧȹɴ˞ Ƀя ˦я ȯФ˦Х
Ƀ ўќ Ж
ȕɪȍ
˦
ȕɪȍ
If you are not interested in printing the progress of the calculation, you can add a
comment character Ы in front of the call of the Ъʧȹɴ˞ macro. In this line, a tuple
containing the three values Ƀ, ˦, and ȯФ˦Х is created and passed to the Ъʧȹɴ˞
macro. Also note that ʲ˩ʙȕ is a reserved word in Julia, so that we use the name
ʲ˩ʙ for the first keyword argument instead.
The first argument is the function whose zero is sought starting from the point
specified as the third argument. In this implementation we require the deriva-
tive, a function, to be passed as the second argument. The ʲ˩ʙ keyword argument
specifies the type of the result by converting the starting value to this type. The
ʲɴɜ keyword argument specifies how large the absolute value of the function
value at the final point may be at most. The last keyword argument specifies the
maximum number of iterations calculated and ensures that the function always
returns.
2.7 Keyword Arguments 33
We note that it is possible to specify the type of keyword and optional argu-
ments, as we have done here for the last keyword argument.
The ЪǤʧʧȕʝʲ macro takes an expression as its argument and evaluates it. If
the value is ʲʝʼȕ, the function continues; if it is ȯǤɜʧȕ, an error is raised. Such
assertions are commonly used to check that the input is valid.
Two syntactic options are valid when calling the function. The semicolon that
separates the positional arguments and the keywords arguments in the function
definition is often not required when calling the function and can then be re-
placed by a comma.
The first example illustrates that the type of the result is given by the ʲ˩ʙ
keyword argument. Here the requested tolerance exceeds the precision of the
floating-point type OɜɴǤʲИЗ so that the function returns after the maximum num-
ber of iterations.
ɔʼɜɃǤљ ɪȕ˞ʲɴɪФʧɃɪя Ȇɴʧя ИѐЕя ʲ˩ʙ ќ OɜɴǤʲИЗя ʲɴɜ ќ ЖȕвЖКя
ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧ ќ КХ
ФɃя ˦я ȯФ˦ХХ ќ ФЖя ИѐЖЙЗКЙЛМȯЕя вЕѐЕЕЕОКЙȯЕХ
ФɃя ˦я ȯФ˦ХХ ќ ФЗя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФИя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФЙя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФКя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ИѐЖЙЖКОЗМȯЕ
In the next example, the type is not specified so that the default type OɜɴǤʲЛЙ is
used.
ɔʼɜɃǤљ ɪȕ˞ʲɴɪФʧɃɪя Ȇɴʧя ИѐЕя ʲɴɜ ќ КȕвЖЛХ
ФɃя ˦я ȯФ˦ХХ ќ ФЖя ИѐЖЙЗКЙЛКЙИЕМЙЗМНя вЕѐЕЕЕОКИННОИИОНЗЛЙЙЕОХ
ФɃя ˦я ȯФ˦ХХ ќ ФЗя ИѐЖЙЖКОЗЛКИИЕЕЙММя ЗѐНОИЖЛЗЙОЕМЛЗЖНЙИȕвЖЕХ
ФɃя ˦я ȯФ˦ХХ ќ ФИя ИѐЖЙЖКОЗЛКИКНОМОИя ЖѐЗЗЙЛЙЛМООЖЙМИКИЗȕвЖЛХ
ИѐЖЙЖКОЗЛКИКНОМОИ
We will learn more about the Newton method and its convergence behavior in
Chap. 12.
Keyword arguments are ignored in method dispatch, i.e., when searching for
a matching method of the generic function. Keyword arguments are only pro-
cessed after the matching method has been found.
Functions can also receive a variable number of keyword arguments at the
end of the argument list using the syntax ѐѐѐ after the name of variable that
will receive all remaining keyword arguments as a collection. The function in
this example just returns the collection containing all keyword arguments it has
received.
ȯɴɴФǤѓ Ȃ ќ Ея ȆѐѐѐХ ќ Ȇ
When calling this function, a semicolon must be used after the keyword argu-
ment Ȃ so that the keyword arguments collected in Ȇ can be distinguished from
the preceding keyword argument Ȃ.
34 2 Functions
ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ ЗХ
bʲȕʝǤʲɴʝʧѐ¸ǤɃʝʧФђђǤɦȕȍÑʼʙɜȕШФХяÑʼʙɜȕШЩЩя ђђÑʼʙɜȕШЩХ ˞Ƀʲȹ Е ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ Зѓ ȂǤʝ ќ Ия ȂǤ˴ ќ ЙХ
ʙǤɃʝʧФђђǤɦȕȍÑʼʙɜȕХ ˞Ƀʲȹ З ȕɪʲʝɃȕʧђ
ђȂǤʝ ќљ И
ђȂǤ˴ ќљ Й
ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ Зѓ ђȂǤʝ ќљ Ия ђȂǤ˴ ќљ ЙХ
ʙǤɃʝʧФђђǤɦȕȍÑʼʙɜȕХ ˞Ƀʲȹ З ȕɪʲʝɃȕʧђ
ђȂǤʝ ќљ И
ђȂǤ˴ ќљ Й
Two syntactic options to pass the keyword arguments are shown here. The first
is the usual keyword-argument syntax variable ќ value, and the second are pairs
of the form ђvariable ќљ value.
This general facility for passing keyword arguments is useful when the key-
word names are computed at runtime or when a number of keyword arguments
is assembled and passed through one or more function calls and the receiving
functions picks the keyword arguments it needs.
To summarize, keyword arguments are arguments after a semicolon ѓ in the
argument list of a function definition, while optional arguments are listed before
the semicolon.
The variable Ǥʝȱʧ is bound to a tuple of all the trailing values passed to the func-
tion.
ɔʼɜɃǤљ ȯɴɴФЖя Зя ИХ
ФХ
ɔʼɜɃǤљ ȯɴɴФЖя Зя Ия Йя Кя ЛХ
ФЙя Кя ЛХ
Analogously, the ellipsis ѐѐѐ can be used in a function call to splice the val-
ues contained in an iterable collection (see Sect. 4.5.2) into a function call as
individual arguments.
ɔʼɜɃǤљ ȯɴɴФФЖя Зя Ия Йя Кя ЛХѐѐѐХ
ФЙя Кя ЛХ
ɔʼɜɃǤљ ȯɴɴФЦЖя Зя Ия Йя Кя ЛЧѐѐѐХ
ФЙя Кя ЛХ
2.9 ȍɴ blocks 35
This example shows that the spliced arguments can also take the place of fixed
arguments. In fact, the function call taking a spliced argument list does not have
to take a variable number of arguments at all.
2.9 ȍɴ blocks
Built-in functions that take a function as one of its arguments usually receive the
function argument as the first argument, which is an idiomatic use of function
arguments in Julia. A ȍɴ block is a syntactic expression that supports this idiom.
They are useful for passing longer anonymous functions as first arguments to
functions. The ȍɴ block
functionФargumentsХ ȍɴ variables
body
ȕɪȍ
is equivalent to
functionФvariables вљ bodyя argumentsХ
so that the possibly long function body body is written at the end of the ȍɴ block.
Continuing the above example, the ɪɴʝɦ function can equivalently also be
defined as follows.
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђʼɦȂȕʝХ ќ
ɦǤʙʝȕȍʼȆȕФўя ˛Х ȍɴ ˦
ǤȂʧФ˦Хѭʙ
ȕɪȍѭФЖЭʙХ
The function that operates on the input/output stream may be quite complicated.
Then it is therefore convenient to use a ȍɴ block. In the following example, ʧ
is the stream on which the anonymous function body in the ȍɴ block operates.
While the end of the file has not been reached, a line is read and printed.
36 2 Functions
˞ɃʲȹѪʧʲʝȕǤɦФъЭȕʲȆЭʙǤʧʧ˞ȍъя ъʝъХ ȍɴ ʧ
˞ȹɃɜȕ РȕɴȯФʧХ
ʙʝɃɪʲФʝȕǤȍɜɃɪȕФʧХХ
ȕɪȍ
ȕɪȍ
It should be noted in this context that Julia comes with the ʝȕǤȍɜɃɪȕʧ func-
tion that returns the contents of a file as a vector of strings. ʝȕǤȍɜɃɪȕʧ is often
sufficient when the file to be processed is small.
Problems
2.1 Write a function that uses "ɃȱOɜɴǤʲs and ʧȕʲʙʝȕȆɃʧɃɴɪ (in conjunction with
a ȍɴ block) to calculate large Fibonacci numbers. What is the largest Fibonacci
number you can calculate in this manner and what is the limitation you eventu-
ally run into?
2.2 Write a function that records the number of calls of ȯɃȂЙФ𝑚Х for each 0 ≤
𝑚 < 𝑛 when calculating ȯɃȂЙФ𝑛Х.
2.3 Calculating larger and larger Fibonacci numbers using ȯɃȂМ is limited by the
stack size. However, if you gradually increase the size of the argument, you can
circumvent this limitation. Explain why. What is the largest Fibonacci number
you can calculate in this manner and what is the limitation you eventually run
into?
2.4 The following function is a shorter alternative to ȯɃȂМ because of its use of
ȱȕʲР. However, it does not work. Explain why.
ȯʼɪȆʲɃɴɪ ȯɃȂФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ
ȱȕʲРФȯɃȂѪȆǤȆȹȕя "ɃȱbɪʲФɪХя ȯɃȂФɪвЖХ ў ȯɃȂФɪвЗХХ
ȕɪȍ
2.5 (Identities for Fibonacci numbers) Suppose 𝑥𝑛 is the 𝑛-th Fibonacci num-
ber.
(a) Prove d’Ocagne’s identity (2.2).
(b) Prove the identity (2.3).
(c) Prove the identity (2.4).
2.6 Suppose 𝑥𝑛 is the 𝑛-th Fibonacci number. Prove the identity
∑𝑎 ( )
𝑎
𝑥𝑎𝑛+𝑏 = 𝑥 𝑥 𝑖 𝑥 𝑎−𝑖
𝑖=0
𝑖 𝑏−𝑖 𝑛 𝑛+1
2.9 ȍɴ blocks 37
2.7 Use the macro ЪʲɃɦȕȍ to plot the time and memory consumption of the var-
ious functions to calculate Fibonacci numbers.
⎧𝑛 + 1, 𝑚 = 0,
𝐴(𝑚, 𝑛) ∶= 𝐴(𝑚 − 1, 1), 𝑚 > 0 ∧ 𝑛 = 0,
⎨
⎩𝐴(𝑚 − 1, 𝐴(𝑚, 𝑛 − 1)), 𝑚 > 0 ∧ 𝑛 > 0.
Implement this function and also implement a memoized version. Compare the
speed of both versions.
Chapter 3
Variables, Constants, Scopes, and Modules
Abstract This chapter discusses how to introduce global and local variables and
constants as well as their scopes or visibility. While discussing functions, we have
seen that function arguments become local variables in the function body, but
there are more ways to introduce variables. Local variables are only visible in
(small) parts of a program, which is an important property to structure a pro-
gram into small, understandable parts. Global variables are only visible in their
module. A module is a (usually large) part of a program that contains functions,
variables, and constants with a similar purpose. The scopes of variables follow
rules that are described in detail.
The scope of a variable is defined as the part of a program where the variable is
visible. Modules are a fundamental data structure in this regard, as each mod-
ule corresponds to a global scope. There is one-to-one correspondence between
modules and global scopes in Julia. The global scope of the repl is the module
called ǤɃɪ.
A new module called Oɴɴ can be defined in a Julia program or at the repl
like this.
ɦɴȍʼɜȕ OɴɴЖ
ȱɜɴȂǤɜ ȯɴɴ ќ Ж
ȕɪȍ
Modules usually have names that start with an uppercase letter. Program lines
within a module are not indented, since a module almost always comprises a
whole file and indenting the whole file would be superfluous. Here we have also
defined a variable called ȯɴɴ, whose global scope is the module OɴɴЖ. Although
it is good practice to use the keyword ȱɜɴȂǤɜ to define a global variable, it is not
necessary to do so.
ɦɴȍʼɜȕ OɴɴЗ
ȯɴɴ ќ З
ȕɪȍ
Modules can be replaced by evaluating a module definition for the same name.
Modules can be nested, and modules can be imported into other modules, as the
next example shows. Here the lines are indented to illustrate the nesting.
ɦɴȍʼɜȕ OɴɴИ
ɦɴȍʼɜȕ "Ǥʝ
ȱɜɴȂǤɜ ȂǤʝ ќ Е
ȕɪȍ
ȱɜɴȂǤɜ ȯɴɴИ ќ "ǤʝѐȂǤʝ Ы ɦɴȍʼɜȕ "Ǥʝ Ƀʧ ˛ɃʧɃȂɜȕ
Ƀɦʙɴʝʲ ѐѐOɴɴЖ Ы ɦǤɖȕ ɦɴȍʼɜȕ OɴɴЖ ˛ɃʧɃȂɜȕ
ȱɜɴȂǤɜ ȯɴɴЙ ќ OɴɴЖѐȯɴɴ Ы ɦɴȍʼɜȕ OɴɴЖ Ƀʧ ˛ɃʧɃȂɜȕ
ȕɪȍ
ɔʼɜɃǤљ OɴɴИѐ"ǤʝѐȂǤʝ
Е
ɔʼɜɃǤљ OɴɴИѐȯɴɴИ
Е
ɔʼɜɃǤљ OɴɴИѐȯɴɴЙ
Ж
We have defined the scope of a variable as the part of a program where the vari-
able is visible. In addition to the global scopes of modules, there are also local
scopes. For example, a function definition introduces a new local scope.
What happens if there are two variables with same name within a program?
If the scopes of the variables do not overlap, there is no ambiguity. If the scopes
of the variables overlap, however, then Julia’s scope rules are applied in order
to resolve any ambiguities. This section discusses the various scopes of variables
and the scope rules.
The scope of a variable is a concept that is familiar from mathematics. An
example is given by integrals. In the formula
𝑥
𝑓(𝑥) = ∫ 𝑓 ′ (𝜉)d𝜉,
𝑎
the scope of the integration variable 𝜉 is the integrand, i.e., 𝜉 is only visible within
the integrand, which is 𝑓 ′ (𝜉) here. But how should we interpret the formula
𝑥
𝑓(𝑥) = ∫ 𝑓 ′ (𝑥)d𝑥?
𝑎
∑∞
1
𝜁(𝑠) = 𝑠
,
𝑛=1
𝑛
the summation index 𝑛 is only visible within the summand 1∕𝑛𝑠 , while the func-
tion argument 𝑠 is visible in the whole right-hand side.
Returning from mathematics to computer science, there are two types of scop-
ing, namely dynamic scoping and lexical scoping. In lexical scoping, the scope of
a variable is the program text of the scope block where the variable is defined. In
dynamic scoping, the scope of a variable is the time period when the code block
where the variable is defined is executed. Julia uses lexical scoping.
42 3 Variables, Constants, Scopes, and Modules
ȯʼɪȆʲɃɴɪ ȱФХ
˦
ȕɪȍ
We could have left out the keyword ȱɜɴȂǤɜ here, but it is good style to indicate
the definition of variables using the keywords ȱɜɴȂǤɜ or ɜɴȆǤɜ.
The next example is concerned with the nesting of local scopes. The function
Ƀɪɪȕʝ is defined within the function ɴʼʲȕʝ.
ȯʼɪȆʲɃɴɪ ɴʼʲȕʝФХ
ȯʼɪȆʲɃɴɪ ɃɪɪȕʝФХ
˩
ȕɪȍ
ɜɴȆǤɜ ˩ ќ Е
ɃɪɪȕʝФХ
ȕɪȍ
3.3 Local Scope Blocks 43
The Ƀɪɪȕʝ function inherits all variables from its outer scope, i.e., the ɴʼʲȕʝ func-
tion, so that the variable ˩ inside Ƀɪɪȕʝ refers to the value of ˩ in ɴʼʲȕʝ.
ɔʼɜɃǤљ ɴʼʲȕʝФХ
Е
Even defining a global variable ˩ cannot change the return value of ɴʼʲȕʝ.
ɔʼɜɃǤљ ˩ ќ Жѓ ɴʼʲȕʝФХ
Е
In Julia, there are eight types of local scope blocks, all listed in Table 3.1, which
also lists the three ways to introduce global scope blocks for completeness. There
are two types of local scope blocks, namely hard local scopes and soft local scopes.
On the other hand, ȂȕȱɃɪ blocks and Ƀȯ blocks are not scope blocks and cannot
introduce new variable bindings.
According to Table 3.1, functions (see Chap. 2), macros (see Chap. 7), and ʧʲʝʼȆʲ
type definitions (see Sect. 5.4) introduce new hard local scopes. The scope rules
for hard local scopes are these.
1. All variables are inherited from their parent scope with the following two
exceptions.
2. A variable is not inherited if an assignment would modify a global variable.
(A new binding is introduced instead.)
3. A variable is not inherited if it is marked with the keyword ɜɴȆǤɜ. (A new
binding is introduced instead.)
The second rule means that global variables are inherited only if they are read,
not if they are modified. This ensures that a local variable cannot unintentionally
modify a global variable with the same name.
ȱɜɴȂǤɜ ˦ ќ Е
ȯʼɪȆʲɃɴɪ ȯɴɴЖФХ
˦ ќ Ж Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
˦
ȕɪȍ
ȯʼɪȆʲɃɴɪ ȯɴɴЗФХ
ȱɜɴȂǤɜ ˦ ќ З Ы ǤʧʧɃȱɪ ʲɴ ȱɜɴȂǤɜ ˛ǤʝɃǤȂɜȕ
˦
ȕɪȍ
In the first function, the assignment ˦ ќ Ж would modify the global variable ˦,
and therefore a new local variable is introduced. In the second function, the
ȱɜɴȂǤɜ keyword ensures that the assignment ˦ ќ З refers to the global variable ˦,
whose binding is modified.
ɔʼɜɃǤљ ȯɴɴЖФХ
Ж
ɔʼɜɃǤљ ˦
Е
ɔʼɜɃǤљ ȯɴɴЗФХ
З
ɔʼɜɃǤљ ˦
З
3.3 Local Scope Blocks 45
According to Table 3.1, ȯɴʝ loops, ˞ȹɃɜȕ loops, ʲʝ˩ ȆǤʲȆȹ ȯɃɪǤɜɜ˩ blocks (see
Chap. 6), ɜȕʲ blocks (see Sect. 3.4), and array comprehensions (see Sect. 3.5)
introduce new soft local scopes. The scope rules for hard local scopes are these.
1. All variables are inherited from their parent scope with the following excep-
tion.
2. A variable is not inherited if it is marked with the keyword ɜɴȆǤɜ. (A new
binding is introduced instead.)
3. Additional rules for ɜȕʲ blocks (see Sect. 3.4) and ȯɴʝ loops and comprehen-
sions (see Sect. 3.5) apply.
Hard and soft local scopes differ in their intended purposes and hence in their
scope rules. Hard local scopes, i.e., function, macro, and type definitions, are
usually independent entities than can be moved around freely within a program.
Modifying global variables within their scopes is possible, but should be done
with care, and therefore requires the ȱɜɴȂǤɜ keyword. On the other hand, soft
local scopes such as loops are often used to modify variables that are defined in
their parent scopes. Hence the default is to modify variables unless the ɜɴȆǤɜ
keywords is used.
The following two examples involving ȯɴʝ loops illustrate soft local scopes.
ȯʼɪȆʲɃɴɪ ʧʼɦФǤʝȱʧѐѐѐХ
ɜɴȆǤɜ ʧʼɦ ќ Е Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȯɴʝ Ƀ Ƀɪ Ǥʝȱʧ
ʧʼɦ ќ ʧʼɦ ў Ƀ Ы ɃɪȹȕʝɃʲ
ȕɪȍ
ʧʼɦ
ȕɪȍ
Here the assignment ʧʼɦ ќ ʧʼɦ ў Ƀ does not introduce a new binding for the
variable ʧʼɦ in the ȯɴʝ loop, since it is inherited from the parent scope by the
first rule.
The situation is different in the following example.
ȯʼɪȆʲɃɴɪ ȯɴɴФǤʝȱʧѐѐѐХ
ɜɴȆǤɜ ˦ ќ Е Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȯɴʝ Ƀ Ƀɪ Ǥʝȱʧ
ɜɴȆǤɜ ˦ ќ Ƀ Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȕɪȍ
˦
ȕɪȍ
Here the ɜɴȆǤɜ keyword always introduces a new variable in the scope of the ȯɴʝ
loop. Therefore this function always returns Е.
Named functions (in contrast to anonymous functions) are stored as OʼɪȆʲɃɴɪ
objects in variables. Therefore a function ȯ can be referred to in the definition of
46 3 Variables, Constants, Scopes, and Modules
a function ȱ even if ȯ has not been defined yet. An example is given by mutually
recursive functions. In Julia, function definitions can be ordered arbitrarily and
no forward function declarations are required as in some other programming
languages in such cases.
Whether a variable is defined can be checked using the macro ЪɃʧȍȕȯɃɪȕȍ
and the function ɃʧȍȕȯɃɪȕȍ.
Closures are a concept in computer science that can be found in many modern
programming languages. In general, a closure is a function together with an en-
vironment (or set of bindings) of variables. Variables in the enclosing scope are
called free variables and can be accessed by the function even when the function
is called outside the scope. This behavior is consistent with lexical scoping.
In Julia, closures are based on ɜȕʲ blocks. A ɜȕʲ block has the syntax
ɜȕʲ variable1 [ќ value1]я variable2 [ќ value2]я variable3 [ќ value3]
body
ȕɪȍ
The ȱɜɴȂǤɜ declarations of the functions are necessary. Otherwise the function
definitions would only be accessible inside the ɜȕʲ block (and not globally), as
function definitions are stored in variables. (In Common Lisp, the ȱɜɴȂǤɜ dec-
laration would not be needed.)
3.5 ȯɴʝ Loops and Array Comprehensions 47
ɔʼɜɃǤљ ȱȕʲѪȆɴʼɪʲȕʝФХ
Е
ɔʼɜɃǤљ ɃɪȆʝȕǤʧȕФХя ɃɪȆʝȕǤʧȕФХя ɃɪȆʝȕǤʧȕФХ
ФЖя Зя ИХ
ɔʼɜɃǤљ ȱȕʲѪȆɴʼɪʲȕʝФХ
И
After reevaluating the ɜȕʲ block above, a new closure is created and the counter
is again equal to Е.
Array comprehensions are a convenient way to make (dense) arrays (see Sect. 8.1)
while initializing its elements. A multidimensional array can be constructed by
Цexpr ȯɴʝ variable1 ќ value1я variable2 ќ value2Ч, where an arbitrary number of
iteration variables can be used. The values on the right-hand sides must be iter-
able objects such as ranges. Then the expression expr is evaluated with freshly
allocated iteration variables. The dimensions of the resulting array are given by
the numbers of the values of the iteration variables in order.
This example shows how an array comprehension is used to make and initial-
ize a two-dimensional array.
ɔʼɜɃǤљ ЦЖЕ˦ ў ˩ ȯɴʝ ˦ Ƀɪ ЖђЗя ˩ Ƀɪ ЖђИЧ
ЗѠИ ʝʝǤ˩ШbɪʲЛЙяЗЩђ
ЖЖ ЖЗ ЖИ
ЗЖ ЗЗ ЗИ
ɔʼɜɃǤљ ʧɃ˴ȕФǤɪʧХ
ФЗя ИХ
The iteration variables are freshly allocated for each iteration of the com-
prehension, and hence any previously existing variable bindings with the same
name are not affected by the array comprehension.
ɔʼɜɃǤљ ˦ ќ Еѓ ˩ ќ Е
Е
ɔʼɜɃǤљ ЦЖЕ˦ ў ˩ ȯɴʝ ˦ Ƀɪ ЖђЗя ˩ Ƀɪ ЖђИЧ
ЗѠИ ʝʝǤ˩ШbɪʲЛЙяЗЩђ
ЖЖ ЖЗ ЖИ
ЗЖ ЗЗ ЗИ
ɔʼɜɃǤљ ˦я ˩
ФЕя ЕХ
The behavior of ȯɴʝ loops is the same in this regard. We consider two exam-
ples. In the first, the iteration variable has not been previously defined. In this
case, the iteration variable is local to the ȯɴʝ loop.
48 3 Variables, Constants, Scopes, and Modules
ɔʼɜɃǤљ ЪɃʧȍȕȯɃɪȕȍ Ƀ
ȯǤɜʧȕ
After the ȯɴʝ loop, there is again no binding for the variable Ƀ.
ɔʼɜɃǤљ ЪɃʧȍȕȯɃɪȕȍ Ƀ
ȯǤɜʧȕ
In the second example, a variable with the same name as the iteration variable
already exists. After the ȯɴʝ loop has been evaluated, the value of the iteration
variable remains unchanged.
ɔʼɜɃǤљ ɔ ќ Е
Е
ɔʼɜɃǤљ ȯɴʝ ɔ Ƀɪ ЖђЗ ȕɪȍ
ɔʼɜɃǤљ ɔ
Е
3.6 Constants
Both global and local variables can be declared constant by the Ȇɴɪʧʲ keyword.
Declaring global variables as constant helps the compiler to optimize code. Since
the types and values of global variables may change at any time, code involving
global variables can generally hardly be optimized by the compiler. If a global
variable is declared constant, however, the compile can employ type inference
and the performance problem is solved.
The situation is different for local variables in this regard. The compiler can
determine whether a local variable is constant or not, and therefore declaring
local variables constant does not affect performance.
Finally, we note that declaring a variable constant only affects the variable
binding. If the value of a constant variable is a mutable object such as a set, an
array, or a dictionary (see Chap. 4), the elements of the mutable object may still
be modified as shown in this example.
ɔʼɜɃǤљ Ȇɴɪʧʲ ќ ЦЖя ЗЧ
Звȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Ж
З
ɔʼɜɃǤљ ЦЖЧ ќ Еѓ
Звȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Е
З
3.7 Global and Local Variables in this Book 49
This book uses the ȱɜɴȂǤɜ and ɜɴȆǤɜ keywords to explicitly denote global and
local variables, although many programmers usually do not do so in practice.
There are two reasons for the use of these two keywords here. The first is simply
a didactic reason; the keywords clearly indicate where a new variable is defined
and of which kind it is.
The second reason is that writing ȱɜɴȂǤɜ and ɜɴȆǤɜ explicitly to define (and
to access) variables can be considered good practice, because it helps spot the
declarations of global and local variables at a glance and also where they are
used (in the case of global ones). Global variables are always noteworthy, and
therefore deserve to be spotted easily.
Using the ȱɜɴȂǤɜ and ɜɴȆǤɜ keywords is also a matter of style and personal
preference. Julia’s syntax is heavily influenced by Pascal’s syntax, and in Pas-
cal all variables are introduced by the ˛Ǥʝ and Ȇɴɪʧʲ keywords. In this tradi-
tion, the ɜɴȆǤɜ keyword serves the role of ˛Ǥʝ in Pascal. On the other hand,
with more experience in spotting variables and recognizing their scopes, the key-
words may appear superfluous.
Problems
3.1 Extend the example of a closure in Sect. 3.4 by writing functions for resetting
the counter to a given value and for decreasing the counter.
3.2 A Hilbert matrix is a square matrix 𝐻 with the entries ℎ𝑖𝑗 = 1∕(𝑖 + 𝑗 − 1).
Write a function that returns a Hilbert matrix of given size.
Chapter 4
Built-in Data Structures
Data dominates.
If you’ve chosen the right data structures and organized things well,
the algorithms will almost always be self-evident.
Data structures, not algorithms, are central to programming.
—Rob Pike
Bad programmers worry about the code.
Good programmers worry about data structures and their relationships.
—Linus Torvalds
Abstract Julia comes with many useful, built-in data structures that cover
many requirements of general-purpose programming. In this chapter, the most
important built-in data structures are discussed, including characters, strings,
regular expressions, symbols, expressions, and several types of collections. In
conjunction with the data structures, the operations on the data structures are
introduced as well and examples of their usage are given.
4.1 Characters
One of the simplest, but most fundamental data structures is the character, the
type &ȹǤʝ. A character is created by щchar щ, i.e., using single quotes. Each char-
acter corresponds to a Unicode code point, and a character can be converted to
its code point, which is an integer value, by calling bɪʲ.
ɔʼɜɃǤљ щǤщя ʲ˩ʙȕɴȯФщǤщХя bɪʲФщǤщХя ʲ˩ʙȕɴȯФbɪʲФщǤщХХ
ФщǤщя &ȹǤʝя ОМя bɪʲЛЙХ
ɔʼɜɃǤљ &ȹǤʝФОМХ
щǤщђ Æ&bbЭÚɪɃȆɴȍȕ ÚўЕЕЛЖ ФȆǤʲȕȱɴʝ˩ {ɜђ {ȕʲʲȕʝя ɜɴ˞ȕʝȆǤʧȕХ
However, not all integers are valid Unicode code points. You can check if an
integer is a valid code point by using Ƀʧ˛ǤɜɃȍФ&ȹǤʝя integer Х.
Any Unicode character can be input in single quotes using аʼ followed by
up to four hexadecimal digits or using аÚ followed by up to eight hexadecimal
digits; for example, щаʼЛЖщ is the letter a. Furthermore, some special characters
can be escaped using a backslash: the backslash character щаащ, the single quote
щащщ, newline (line feed) щаɪщ, carriage return щаʝщ, form feed щаȯщ, backspace
щаȂщ, horizontal tab щаʲщ, vertical tab ща˛щ, and alert (bell) щаǤщ. Additionally,
the character with octal value ooo (three octal digits) can be written as щаoooщ,
and the character with hexadecimal value hh (two hexadecimal digits) can be
written as ща˦hhщ.
The standard comparison operators ќќ, Рќ, ј, јќ, љ, and љќ are available for
characters. Furthermore, the function в is defined for characters as well.
ɔʼɜɃǤљ щЕщ ј щщ ј щǤщ
ʲʝʼȕ
ɔʼɜɃǤљ Ъ˞ȹɃȆȹ щЕщ ј щщ
јФ˦я ˩Х Ƀɪ "Ǥʧȕ Ǥʲ ɴʙȕʝǤʲɴʝʧѐɔɜђЗЛН
ɔʼɜɃǤљ Ъ˞ȹɃȆȹ щ˴щ в щǤщ
вФ˦ђђȂʧʲʝǤȆʲ&ȹǤʝя ˩ђђȂʧʲʝǤȆʲ&ȹǤʝХ Ƀɪ "Ǥʧȕ Ǥʲ ȆȹǤʝѐɔɜђЗЗЖ
ɔʼɜɃǤљ щ˴щ в щǤщ ў Ж
ЗЛ
4.2 Strings
Julia strings are immutable, i.e., once they have been created, they cannot be
changed anymore. Strings are delimited either by double quotes ъ or by triple
double quotes ъъъ. Characters can be entered as part of a string using the syntax
for Unicode characters and the one for special characters mentioned in Sect. 4.1.
A double quote needs to be escaped by a backslash if the string is delimited by
double quotes, while it can occur unaltered inside a string delimited by triple
double quotes.
ɔʼɜɃǤљ ʙʝɃɪʲɜɪФъЖЗИаȂаȂаȂЙКЛЙКЛаʝМНОъХ
МНОЙКЛ
ɔʼɜɃǤљ ъъъÑȹɃʧ Ƀʧ Ǥɪ ъɃɪʲȕʝȕʧʲɃɪȱъ ȕ˦Ǥɦʙɜȕѐъъъ
ъÑȹɃʧ Ƀʧ Ǥɪ аъɃɪʲȕʝȕʧʲɃɪȱаъ ȕ˦Ǥɦʙɜȕѐъ
Triple double quotes are useful for creating longer blocks of text such as doc-
umentation strings. White space receives special treatment, however, when this
4.2 Strings 53
syntax is used. When the opening triple quotes are immediately followed by a
newline, then the first newline is ignored. Trailing white space at the end of the
string remains unchanged, however. The indentation of a triple-quoted string is
also changed when it is read. The same amount of white space at the beginning
of each line is removed so that the input line with the least amount of indenta-
tion is not indented at all in the final string. This behavior is useful when a string
occurs as part of indented code.
The elements of a string are characters and can be accessed by their index.
Indices in Julia always start at Ж and end at ȕɪȍ. The ȕɪȍ index can be used in
computations as in this example.
ɔʼɜɃǤљ ʧ ќ ъYȕɜɜɴя ˞ɴʝɜȍРъ
ъYȕɜɜɴя ˞ɴʝɜȍРъ
ɔʼɜɃǤљ ʧЦЖЧя ʧЦЗЧя ʧЦȯɜɴɴʝФbɪʲя ȕɪȍЭЗХЧя ʧЦȕɪȍвЖЧя ʧЦȕɪȍЧ
ФщYщя щȕщя щящя щȍщя щРщХ
However, strings can consist of arbitrary Unicode characters. Since the Uni-
code encoding is a variable-length encoding, i.e., not all characters are encoded
by the same number of bytes, accessing a string at an arbitrary byte position by
ЦЧ does not necessarily yield a valid Unicode character. The number of Unicode
characters in a string is returned by ɜȕɪȱʲȹ. In this example, the string consists
of four Unicode characters, where only the second character occupies one byte.
ɔʼɜɃǤљ Ǥ ќ ъаʼЗЗЕЕ˦аʼЗЗЕНаʼЗЖЖȍъ
ъ ∀ ˦ ∈ ℝъ
Trying to access ǤЦЗЧ and ǤЦИЧ results in errors, while ǤЦЙЧ evaluates to щ˦щ.
It is possible to step through a string using ɪȕ˦ʲɃɪȍ, which returns the next
valid index after a given index, as this example illustrates.
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя ЖХ
Й
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя ЙХ
К
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя КХ
54 4 Built-in Data Structures
Н
ɔʼɜɃǤљ ǤЦЖЧя ǤЦЙЧя ǤЦКЧя ǤЦНЧ
Фщ∀щя щ˦щя щ∈щя щℝщХ
However, the most convenient way to iterate through a string is using a ȯɴʝ
loop.
ɔʼɜɃǤљ ȯɴʝ ȆȹǤʝ Ƀɪ Ǥ ʙʝɃɪʲɜɪФȆȹǤʝХ ȕɪȍ
∀
˦
∈
ℝ
A common string operation is sorting. Since the standard comparisons ќќ, Рќ, ј,
јќ, љ, and љќimplement lexicographical comparison of strings based on charac-
ter comparison, the ʧɴʝʲ function can be used to sort strings.
To check whether a string contains a character, the Ƀɪ function can be used,
which also supports infix syntax.
ɔʼɜɃǤљ ɃɪФщаʼЗЗЕЕщя ǤХ
ʲʝʼȕ
ɔʼɜɃǤљ щаʼЗЗЕЕщ Ƀɪ Ǥ
ʲʝʼȕ
4.2 Strings 55
The ȯɃɪȍȯɃʝʧʲ function returns more information. Its first argument can be a
character, a string, or a regular expression (see Sect. 4.2.5 below). The first argu-
ment is searched for in the second argument, a string, and ȯɃɪȍȯɃʝʧʲ returns
the (byte) indices of the matching substring or ɪɴʲȹɃɪȱ if there is no such occur-
rence.
ɔʼɜɃǤљ ȯɃɪȍȯɃʝʧʲФщ˦щя ǤХ
Й
ɔʼɜɃǤљ ȯɃɪȍȯɃʝʧʲФщ˦щя ʧХ
ɔʼɜɃǤљ Ǥɪʧ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ
This example shows that printing ɪɴʲȹɃɪȱ prints nothing. The most recently re-
turned value is stored in the variable Ǥɪʧ, however, and we could check that it is
equal to ɪɴʲȹɃɪȱ.
The ʝȕʙɜǤȆȕ function replaces a substring of a string with another one. Since
strings are immutable, a new string is always returned. The second argument
indices the substitution, and it can be a pair (see Sect. 4.5.4), a dictionary (see also
Sect. 4.5.4), or a regular expression (see Sect. 4.2.5). The replacement can be a
(constant) string or a function that is applied to the match and that yields a string.
The keyword argument Ȇɴʼɪʲ indicates how many occurrences are replaced.
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ъ˞ɴʝɜȍъ ќљ ъüɴʝɜȍъХ
ъYȕɜɜɴя üɴʝɜȍРъ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ъ˞ъ ќљ ʼʙʙȕʝȆǤʧȕХ
ъYȕɜɜɴя üɴʝɜȍРъ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ʝъЦǤв˴Чъ ќљ ˦ вљ &ȹǤʝФbɪʲФ&ȹǤʝФ˦ЦЖЧХХ ў ЖХХ
ъYȯɦɦʙя ˦ʙʧɦȕРъ
Here the regular expression ʝъЦǤв˴Чъ matches all lowercase characters. They
are replaced by the character that follows them in the character ordering.
Strings can be assembled from substrings using Ѯ, ʝȕʙȕǤʲ, and ɔɴɃɪ. The func-
tion Ѯ (and not ў) concatenates two strings. The ʝȕʙȕǤʲ function concatenates
a given number of characters or strings. The ɔɴɃɪ function concatenates an ar-
ray of strings, inserting a given delimiter string between adjacent strings. An
optional second delimiter may be given; it is then used as the delimiter between
the last two substrings.
ɔʼɜɃǤљ ɔɴɃɪФЦъɃɪʲȕȱȕʝʧъя ъʝǤʲɃɴɪǤɜʧъя ъʝȕǤɜ ɪʼɦȂȕʝʧъЧя
ъя ъя ъя Ǥɪȍ ъХ
ъɃɪʲȕȱȕʝʧя ʝǤʲɃɴɪǤɜʧя Ǥɪȍ ʝȕǤɜ ɪʼɦȂȕʝʧъ
56 4 Built-in Data Structures
Here the version number ˛ъЖѐЖвъ indicates a version lower than any 1.1 release
or pre-release. It is good practice to append a trailing в to version numbers in
upper bounds unless there is a specific reason not to do so.
Often the version number should be checked not at run time, but already
when the program is parsed into expressions. The ЪʧʲǤʲɃȆ macro makes it pos-
sible to perform such a check as in the following example.
ɔʼɜɃǤљ ЦЪʧʲǤʲɃȆ Ƀȯ ˛ъЖъ јќ ù5¼Æb ј ˛ъЗвъ Ж ȕɜʧȕ ђʼɪɖɪɴ˞ɪ ȕɪȍЧ
Жвȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Ж
4.2 Strings 57
ɔʼɜɃǤљ ʧ ќ ъъъ
ɦɴ˛Ƀȕ ɪǤɦȕђ üǤʝ QǤɦȕʧ
ʝȕɜȕǤʧȕ ȍǤʲȕђ Ǥ˩ Мя ЖОНИ
ʝȕʧȕǤʝȆȹȕʝђ Æʲȕʙȹȕɪ OǤɜɖȕɪ
ǤʝʲɃȯɃȆɃǤɜ ɃɪʲȕɜɜɃȱȕɪȆȕђ sɴʧȹʼǤ
Ȇɴɦʙʼʲȕʝ ɪǤɦȕђ ü¸¼ ФüǤʝ ʙȕʝǤʲɃɴɪ ¸ɜǤɪ ¼ȕʧʙɴɪʧȕХ
ȱǤɦȕʧђ OǤɜɖȕɪщʧ ɦǤ˴ȕя Ȇȹȕʧʧя ʙɴɖȕʝя ȱɜɴȂǤɜ ʲȹȕʝɦɴɪʼȆɜȕǤʝ ˞Ǥʝ
ȹȕʝɴђ -Ǥ˛Ƀȍ {ѐ {ɃȱȹʲɦǤɪ
ʧʲǤʲʼʧђ ˞ɴʝɜȍ ʧǤ˛ȕȍ
ʙȹɴɪȕ ɪʼɦȂȕʝʧ ȯʝɴɦђ ИЖЖѪОИЛѪЕЕЕЖ
ʙȹɴɪȕ ɪʼɦȂȕʝʧ ʲɴђ ИЖЖѪОИЛѪООООъъъѓ
ɔʼɜɃǤљ ɦǤʲȆȹФʝъЗъя ʧХ
ɔʼɜɃǤљ ɦǤʲȆȹФʝъОўъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъОъХ
To search for occurrences not only of single characters, but of more compli-
cated patterns, characters and patterns can be grouped by brackets ФpatternХ.
Such a group can also be named so that one can refer to it more conveniently than
by index, as we will see later. A named group looks like ФТјnameљpatternХ. A
group consisting of two alternatives pattern1 and pattern2 is written as
Фpattern1Юpattern2Х.
We see that a matched group can be accessed by its (numerical) index (here З)
or by the name of the group (here ъȯɴɴъ, using the string ъȯɴɴъ as the index).
Furthermore, the details of the match ɦ can be inspected by ȍʼɦʙФɦХ. The fields
ɦǤʲȆȹ, ȆǤʙʲʼʝȕʧ, ɴȯȯʧȕʲ, and ɴȯȯʧȕʲʧ may be useful. In general, ȍʼɦʙ is ex-
tremely useful for inspecting any value.
There are also characters and patterns that match only certain characters or
locations. A dot ѐ matches any character. The characters ѭ and ϵ match the be-
ginning and end of a string (or line, in multiline mode), respectively.
Character classes are patterns of the form ЦfromвtoЧ. The negation of such
a character class is ЦѭfromвtoЧ. Predefined character classes include аȍ for any
decimal digit, аʧ any white-space character, and а˞ for any “word” character. For
example, the patterns ЦЕвОЧ and аȍ both match any decimal digit. The negations
of these classes are given by uppercase letters.
There are also named character classes such as ЦђǤɜɪʼɦђЧ (letters and digits),
ЦђǤɜʙȹǤђЧ (letters), ЦђȂɜǤɪɖђЧ (spaces and tabs), ЦђȍɃȱɃʲђЧ (digits), Цђɜɴ˞ȕʝђЧ
(lowercase letters), ЦђʧʙǤȆȕђЧ (white space), ЦђʼʙʙȕʝђЧ (uppercase letters), and
Цђ˞ɴʝȍђЧ (“word” characters).
Named groups and character classes help parse heterogeneous data. In this
example, we extract a date from the string using three named groups.
4.2 Strings 59
ɔʼɜɃǤљ ɦ ќ ɦǤʲȆȹФʝъФТјɦɴɪʲȹља˞ўХаʧўФТјȍǤ˩љаȍўХяаʧўФТј˩ȕǤʝљаȍўХъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъ Ǥ˩ Мя ЖОНИъя ɦɴɪʲȹќъ Ǥ˩ъя ȍǤ˩ќъМъя ˩ȕǤʝќъЖОНИъХ
ɔʼɜɃǤљ ɦЦъȍǤ˩ъЧя ɦЦъɦɴɪʲȹъЧя ɦЦъ˩ȕǤʝъЧ
ФъЖъя ъsǤɪъя ъЗЕЕЕъХ
ɔʼɜɃǤљ ɦЦъȍǤ˩ъЧя ɦЦъɦɴɪʲȹъЧя ɦЦъ˩ȕǤʝъЧ
ФъМъя ъ Ǥ˩ъя ъЖОНИъХ
The three parts of the first phone number can be extracted similarly. Groups
can even be nested, so that we can defined a group named ɪʼɦȂȕʝ to contain the
whole match.
ɔʼɜɃǤљ ɦ ќ ɦǤʲȆȹФʝъФТјɪʼɦȂȕʝљФТјǤʝȕǤљаȍўХѪФаȍўХѪФаȍўХХъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъИЖЖѪОИЛѪЕЕЕЖъя ɪʼɦȂȕʝќъИЖЖѪОИЛѪЕЕЕЖъя ǤʝȕǤќъИЖЖъя
ИќъОИЛъя ЙќъЕЕЕЖъХ
ɔʼɜɃǤљ ɦЦъɪʼɦȂȕʝъЧя ɦЦъǤʝȕǤъЧя ɦЦЖЧя ɦЦЗЧя ɦЦИЧя ɦЦЙЧ
ФъИЖЖѪОИЛѪЕЕЕЖъя ъИЖЖъя ъИЖЖѪОИЛѪЕЕЕЖъя ъИЖЖъя ъОИЛъя ъЕЕЕЖъХ
The following example shows how named character classes are used inside
brackets.
ɔʼɜɃǤљ ɦǤʲȆȹФʝъǤʝʲɃȯɃȆɃǤɜ ɃɪʲȕɜɜɃȱȕɪȆȕђ ФТјɪǤɦȕљЦЦђǤɜɪʼɦђЧЧўХъя
ʧХЦъɪǤɦȕъЧ
ъsɴʧȹʼǤъ
4.3 Symbols
Symbols are an important data structure in Lisp like languages, because they
serve as variable names and because they are fundamental building blocks of
expressions (see Sect. 4.4). A symbol is essentially an interned string identifier.
Interning a string means that it is ensured that only one copy of each distinct
string is stored, and thus interned strings can be associated with values. This
implies that it is not possible for two symbols with the same name to exist simul-
taneously, i.e., symbols are unique.
There are a few ways to create a symbol. We can call the parser, i.e., the func-
tion ʙǤʝʧȕ in the ȕʲǤ package, to directly create a symbol.
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъȯɴɴъХ
ђȯɴɴ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФђȯɴɴХ
Æ˩ɦȂɴɜ
The parser recognizes ȯɴɴ as a symbol and returns it (without evaluating it).
Another option to create a symbol is to enter a suitable expression. Entering
ȯɴɴ at the repl and hence evaluating it yields the value of the variable called
ȯɴɴ, however. Therefore we have to protect the expression from evaluation. This
is achieved by prepending it with a colon ђ, which adds one layer of protection
against evaluation to the expression that follows it. There the colon is also called
the quote character in Julia. Thus ђȯɴɴ evaluates to the symbol ȯɴɴ.
ɔʼɜɃǤљ ђȯɴɴ
ђȯɴɴ
4.4 Expressions 61
A more direct way to create a symbol is to use the function Æ˩ɦȂɴɜ, which
follows the theme in Julia that functions that have the same name as a type
create new values of this type. The function Æ˩ɦȂɴɜ creates a new symbol by
concatenating the string representations of its arguments.
ɔʼɜɃǤљ ђȯɴɴ ќќ Æ˩ɦȂɴɜФъȯɴɴъХ ќќ Æ˩ɦȂɴɜФщȯщя ъɴɴъХ
ʲʝʼȕ
Symbols are used to access variables and evaluate to the values of the vari-
ables. Expressions can be evaluated not only in the repl, but also by calling the
function ȕ˛Ǥɜ. In this example, we try to access the value of the undefined vari-
able named ȯɴɴ, which raises an error. After defining the variable, however, its
value is returned by evaluating ђȯɴɴ.
ɔʼɜɃǤљ ȕ˛ǤɜФђȯɴɴХ
5¼¼¼ђ ÚɪȍȕȯùǤʝ5ʝʝɴʝђ ȯɴɴ ɪɴʲ ȍȕȯɃɪȕȍ
ѐѐѐ
ɔʼɜɃǤљ ȯɴɴ ќ Е
Е
ɔʼɜɃǤљ ȕ˛ǤɜФђȯɴɴХ
Е
We have just seen that Julia provides access to its parser and its evaluator
via ȕʲǤѐʙǤʝʧȕ and ȕ˛Ǥɜ. These functions work not only with symbols, but also
with expressions, the building blocks of Julia programs.
4.4 Expressions
Reading a variable name using ȕʲǤѐʙǤʝʧȕ yields a symbol. The following ex-
ample shows what happens when we parse more complicated expressions.
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъЕ ў ЖъХ
ђФЕ ў ЖХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
5˦ʙʝ
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъȯɴɴ ў ȂǤʝъХ
62 4 Built-in Data Structures
ђФȯɴɴ ў ȂǤʝХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
5˦ʙʝ
The expressions are returned seemingly unchanged by the repl, because they
have been quoted using the colon ђ. Behind the scenes, the situation is a bit more
involved, however. The quote absorbed the evaluation by the repl, returning the
expression Е ў Ж. This expression is printed as ђФЕ ў ЖХ (and not as Ж, which
would require another evaluation) so that it remains an expression.
More information about the parts of an expression can be obtained by using
the ȍʼɦʙ function.
ɔʼɜɃǤљ ȍʼɦʙФђФЕ ў ЖХХ
5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ ў
Зђ bɪʲЛЙ Е
Иђ bɪʲЛЙ Ж
ɔʼɜɃǤљ ȍʼɦʙФђФȯɴɴ ў ЗѮȂǤʝХХ
5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ ў
Зђ Æ˩ɦȂɴɜ ȯɴɴ
Иђ 5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ Ѯ
Зђ bɪʲЛЙ З
Иђ Æ˩ɦȂɴɜ ȂǤʝ
We find that an object of type 5˦ʙʝ has two fields, namely ȹȕǤȍ and Ǥʝȱʧ. In both
examples, the ȹȕǤȍ is ђȆǤɜɜ. The arguments are vectors, whose first element is
a symbol that names a function. Further arguments can be constants (such as
symbols) or further expressions.
In the next example, we deconstruct an expression into the parts we just ob-
served using ȍʼɦʙ and then we make another expression out of these parts. As
4.5 Collections 63
usual in Julia, the name of a type (here 5˦ʙʝ) is also a function, and calling this
function makes a new object of this type.
ɔʼɜɃǤљ ȕ˦ʙʝ ќ ȕʲǤѐʙǤʝʧȕФъȯɴɴ ў ЗѮȂǤʝъХ
ђФȯɴɴ ў ЗȂǤʝХ
ɔʼɜɃǤљ ȕ˦ʙʝѐȹȕǤȍ
ђȆǤɜɜ
ɔʼɜɃǤљ ȕ˦ʙʝѐǤʝȱʧ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩Шɪ˩яЖЩђ
ђў
ђȯɴɴ
ђФЗȂǤʝХ
ɔʼɜɃǤљ 5˦ʙʝФȕ˦ʙʝѐȹȕǤȍя ȕ˦ʙʝѐǤʝȱʧѐѐѐХ
ђФȯɴɴ ў ЗȂǤʝХ
ɔʼɜɃǤљ 5˦ʙʝФȕ˦ʙʝѐȹȕǤȍя ȕ˦ʙʝѐǤʝȱʧѐѐѐХ ќќ ђФȯɴɴ ў ЗѮȂǤʝХ
ʲʝʼȕ
Evaluating the expression ȕ˦ʙʝ in this example raises an error, since the vari-
ables ȯɴɴ and ȂǤʝ are undefined. After defining them, however, we can evaluate
the expression.
ɔʼɜɃǤљ ȯɴɴ ќ Жѓ ȂǤʝ ќ Зѓ
ɔʼɜɃǤљ ȕ˛ǤɜФȕ˦ʙʝХ
К
We will learn much more about expressions and their evaluation in Chap. 7.
The salient point is that Julia code is represented in a canonical form as a Julia
data structure, namely as objects of type 5˦ʙʝ. One could argue that any language
(that at least has a string data type) can represent programs in this language as a
string and therefore using a built-in data type. This is true, of course, but of very
limited use, since string data types do not provide the facilities of the 5˦ʙʝ type,
of ȕʲǤѐʙǤʝʧȕ, and of ȕ˛Ǥɜ.
4.5 Collections
A collection is the general term for a data type that contain elements in ordered,
unordered, indexable, or not indexable form. Various types of collections are dis-
cussed in this section. Data types that are collections are listed in Table 4.1. All
built-in abstract data types are listed in Table 4.2; several of them are collections,
but not all of them. It is not possible to create instances of abstract types, only of
concrete subtypes of abstract types.
64 4 Built-in Data Structures
Any collection can be queried whether it is empty or not. The following exam-
ple is an empty array. Arrays are discussed in detail in Chap. 8; for now, it suffices
to know that vectors and arrays are denoted by square brackets and that the data
type of their elements may be indicated before the opening square bracket.
ɔʼɜɃǤљ ЦЧя ʲ˩ʙȕɴȯФЦЧХя Ƀʧȕɦʙʲ˩ФЦЧХ
Фɪ˩ЦЧя ùȕȆʲɴʝШɪ˩Щя ʲʝʼȕХ
executes the expressions for all elements of the iterable collection iterable bound
to variable in the order in which they are returned by the ɃʲȕʝǤʲȕ method. For
built-in iterable collections, ɃʲȕʝǤʲȕ methods have of course already been de-
fined. Furthermore, after defining ɃʲȕʝǤʲȕ methods for your own data struc-
tures, you can iterate over these data structures using ȯɴʝ loops as well.
There are many useful functions that can be applied to iterable collections;
these include functions to extract certain elements, to reduce the collection by
applying a function repeatedly, and to map a function over all elements. Hence
the functions discussed in this section are important building blocks for func-
tional programming, which often proceeds by combining functions to extract
the desired information after starting from a collection.
Table 4.4 gives an overview of basic functions that are defined for iterable
collections.
Several functions exist to find extrema in an iterable collection. They are listed
in Table 4.5. As usual, destructive versions end in an exclamation mark Р.
An important set of functions is given in Table 4.6. The most general ones in
this table are the ones for reducing and folding an iterable collection. Reducing
a collection containing elements {𝑎𝑖 }𝑛𝑖=1 using a function or operation ⊕ means
calculating the expression 𝑎1 ⊕𝑎2 ⊕⋯⊕𝑎𝑛 . The operation ⊕ must take two argu-
ments and it must be associative. An initial element ɃɪɃʲ may also be specified as
a keyword argument. The operation is applied repeatedly until the whole expres-
sion has been evaluated. If the collection is empty, the initial element must be
specified except in special cases where Julia knows the neutral element of the
operation. If the collection is non-empty, it is unspecified whether the initial ele-
ment is used. If the collection is an ordered one, the elements are not reordered;
otherwise the evaluation order is unspecified.
66 4 Built-in Data Structures
The function ʝȕȍʼȆȕ provides the general form of reduction. Special, often
used cases come with the special implementations ɦǤ˦Ƀɦʼɦ, ɦɃɪɃɦʼɦ, ʧʼɦ, ʙʝɴȍ,
Ǥɪ˩, and Ǥɜɜ (and their variants) and should be used instead.
The folding functions ȯɴɜȍɜ and ȯɴɜȍʝ come with more guarantees. They
guarantee left respectively right associativity and use the given initial or neutral
element exactly once.
In the first example, we have specified the type (bɪʲ) of the elements of the vec-
tor by using bɪʲЦЧ. Since ЦЧ may contain elements of any type, which you can
check by evaluating ȕɜʲ˩ʙȕФЦЧХ, and therefore ʝȕȍʼȆȕФўя ЦЧХ raises an error
because Julia cannot determine the neutral element, we have specified the ini-
tial element as zero in the second example.
If the collection contains only one element, it is returned. If there are two or
more elements in the collection, the operation is applied at least once. In both
cases, it is unspecified whether the initial element is used.
A mathematical example that can be implemented using ʝȕȍʼȆȕ is given by
Taylor series.
ȯʼɪȆʲɃɴɪ ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ
ʝȕȍʼȆȕФўя Ц˦ѭɃЭȯǤȆʲɴʝɃǤɜФɃХ ȯɴʝ Ƀ Ƀɪ ЕђɪЧХ
ȕɪȍ
Here we have used array comprehensions (see Sect. 3.5) to conveniently con-
struct a collection with the appropriate elements.
Analogously, we can multiply all elements in a collection using ʝȕȍʼȆȕ with
the operation Ѯ and the initial or neutral element Ж. The factorial function can
be defined in one line in this manner.
ɦ˩ѪȯǤȆʲɴʝɃǤɜФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ ʝȕȍʼȆȕФѮя "ɃȱbɪʲФЖХђ"ɃȱbɪʲФɪХХ
Here we have used "Ƀȱbɪʲs in order to be able to compute large values. The
syntax
startђstepђend
makes a range or more precisely a ÚɪɃʲ¼Ǥɪȱȕ of values starting at start and end-
ing at end with step size step. Ranges may be empty.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ"ɃȱbɪʲФЖХђ"ɃȱbɪʲФЗХХ
ÚɪɃʲ¼ǤɪȱȕШ"ɃȱbɪʲЩ
This confirms that the collection which is reduced is indeed a range of "Ƀȱbɪʲs.
Another example is the implementation of the ɦǤ˦Ƀɦʼɦ function by reduction.
While ɦǤ˦Ƀɦʼɦ acts on iterable collections, the ɦǤ˦ function takes one or more
arguments. We can implement ɦǤ˦Ƀɦʼɦ using ɦǤ˦ (called with two arguments)
as follows.
ɔʼɜɃǤљ ʝȕȍʼȆȕФɦǤ˦я ЦЧя ɃɪɃʲ ќ вbɪȯХя ʝȕȍʼȆȕФɦǤ˦я ЦЖЧХ
Фвbɪȯя ЖХ
Analogously, the effect of ɦɃɪɃɦʼɦ can be achieved by reducing ɦɃɪ using a suit-
able initial element.
ɔʼɜɃǤљ ʝȕȍʼȆȕФɦɃɪя ЦЧя ɃɪɃʲ ќ bɪȯХя ʝȕȍʼȆȕФɦɃɪя ЦЖЧХ
Фbɪȯя ЖХ
4.5 Collections 69
Continuing the example of the Taylor series above, the implementation can
be made more practical by two improvements. The first improvement is, as indi-
cated above already, to define it even more succinctly using ʧʼɦ.
ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ ќ ʧʼɦФЦ˦ѭɃЭȯǤȆʲɴʝɃǤɜФɃХ ȯɴʝ Ƀ Ƀɪ ЕђɪЧХ
The functions Ǥɪ˩ and Ǥɜɜ can also be viewed as reductions of collections. To
see this, we define the two helper functions Ǥɪȍ and ɴʝ.
ɔʼɜɃǤљ ǤɪȍФǤя ȂХ ќ Ǥ ПП Ȃ
Ǥɪȍ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ɴʝФǤя ȂХ ќ Ǥ ЮЮ Ȃ
ɴʝ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
Reducing a collection using ɴʝ has the same effect as applying Ǥɪ˩ to the col-
lection; the initial or neutral element of the operation ɴʝ is ȯǤɜʧȕ. Analogously,
reducing a collection using Ǥɪȍ has the same effect as applying Ǥɜɜ to the collec-
tion; the initial or neutral element of the operation Ǥɪȍ is ʲʝʼȕ.
ɦ˩ѪǤɪ˩ФȆɴɜɜȕȆʲɃɴɪХ ќ ʝȕȍʼȆȕФɴʝя ȆɴɜɜȕȆʲɃɴɪя ɃɪɃʲ ќ ȯǤɜʧȕХ
ɦ˩ѪǤɜɜФȆɴɜɜȕȆʲɃɴɪХ ќ ʝȕȍʼȆȕФǤɪȍя ȆɴɜɜȕȆʲɃɴɪя ɃɪɃʲ ќ ʲʝʼȕХ
Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɜɜФЦȯǤɜʧȕя ʲʝʼȕЧХя ǤɜɜФЦȯǤɜʧȕя ʲʝʼȕЧХ
ФȯǤɜʧȕя ȯǤɜʧȕХ
It is often convenient to map anonymous functions (see Sect. 2.5) like in these
examples.
ɔʼɜɃǤљ ɦǤʙФ˦ вљ ɜɴȱФЖЕя ˦Хя ЦЖя ЖЕя ЖЕЕЧХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ
ЗѐЕ
The same effect can be achieved using an array comprehension (see Sect. 3.5). It
is often a matter of style if ɦǤʙ or a comprehension is used.
ɔʼɜɃǤљ ЦɜɴȱФЖЕя ˦Х ȯɴʝ ˦ Ƀɪ ЦЖя ЖЕя ЖЕЕЧЧ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ
ЗѐЕ
The function to be mapped may take more than one argument, and then a
corresponding number of collections must be supplied to ɦǤʙ or its cousins, one
collection for each function argument.
ɔʼɜɃǤљ ɦǤʙФФ˦я ˩Х вљ ˦ѭЗ ў ˩ѭЗя ЦЖя Зя ИЧя ЦЖЕя ЗЕя ИЕЧХ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
4.5 Collections 71
ЖЕЖ
ЙЕЙ
ОЕО
The function ȯɴʝȕǤȆȹ is the same as ɦǤʙ, except that it discards the results
of applying the function and always returns ɪɴʲȹɃɪȱ, the only value of type
ɴʲȹɃɪȱ. It should be used when the function calls are performed to produce
side effects only, e.g., to print values.
The functions ȯɃɜʲȕʝ and ȯɃɜʲȕʝР are also similar to ɦǤʙ, but they are used
to return only a subset of the collection. Again, a function is applied to each
element of a collection. If it returns ʲʝʼȕ, the element is kept in a copy of the
collection, otherwise it is ignored.
The function ɦǤʙʝȕȍʼȆȕ and its variants combine ɦǤʙ and ʝȕȍʼȆȕ as the name
indicates. The function call
ɦǤʙʝȕȍʼȆȕФf я opя iterableѐѐѐ ѓ initХ
is equivalent to evaluating
ʝȕȍʼȆȕФopя ɦǤʙФf я iterableѐѐѐХѓ ɃɪɃʲќinitХ
except that the need to allocate any intermediate results is obviated; hence the
ɦǤʙʝȕȍʼȆȕ version generally runs faster and generates less garbage.
Using ɦǤʙʝȕȍʼȆȕ, Taylor series can be implemented perfectly in the spirit of
functional programming.
ȯʼɪȆʲɃɴɪ ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ
ɦǤʙʝȕȍʼȆȕФɃ вљ ˦ѭɃЭȯǤȆʲɴʝɃǤɜФ"ɃȱbɪʲФɃХХя ўя ЕђɪХ
ȕɪȍ
In order to check the convergence speed for different arguments ˦, we use ɦǤʙ
and an anonymous function to produce tuples that contain the values of ˦ and
the corresponding residua. Then we filter the tuples to identify the tuples where
the residua are above a certain threshold.
ɔʼɜɃǤљ ȯɃɜʲȕʝФ˦Ѫʝȕʧ вљ ǤȂʧФ˦ѪʝȕʧЦЗЧХ љ ЖȕвЛя
ɦǤʙФ˦ вљ Ф˦я ɦ˩Ѫȕ˦ʙФ˦я ЗЕХ в ȕ˦ʙФ˦ХХя ЖђКХХ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШÑʼʙɜȕШbɪʲЛЙя"ɃȱOɜɴǤʲЩЩђ
ФКя вЖѐЗЕИКЖѐѐѐȕвЕКХ
Indexable collections are collections whose elements are associated with an in-
dex or key. A set (see Sect. 4.5.5) is an example of a collection that is iterable, but
not indexable. In Julia, the syntax
72 4 Built-in Data Structures
aЦiѐѐѐЧ
is just an abbreviation for the function call
ȱȕʲɃɪȍȕ˦Фaя iѐѐѐХ.
In the case of -ɃȆʲs, the index is called a key (see Sect. 4.5.4).
This example shows that multi-dimensional arrays require a corresponding
number of indices.
ɔʼɜɃǤљ ȯɴɴ ќ ЦЖ Зѓ И ЙЧѓ ʧȕʲɃɪȍȕ˦РФȯɴɴя Кя Жя ЖХѓ ȯɴɴ
ЗаʲɃɦȕʧ З ǤʲʝɃ˦ШbɪʲЛЙЩђ
К З
И Й
There are various ways to create a -ɃȆʲ. It can be created by passing ¸ǤɃʝ
objects to the -ɃȆʲ constructor.
4.5 Collections 73
We see that the types of the keys and values (i.e., ÆʲʝɃɪȱ and bɪʲЛЙ) are inferred
from the ¸ǤɃʝs, but they can also be specified as parameters to the -ɃȆʲ function
in curly brackets (see Sect. 5.7) by writing -ɃȆʲШkey-typeя value-typeЩФpairsѐѐѐХ
as in the following example.
ɔʼɜɃǤљ -ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФХ
-ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФХ
ɔʼɜɃǤљ -ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФъǤъ ќљ Жя ъȂъ ќљ Зя ъȆъ ќљ ИХ
-ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩ ˞Ƀʲȹ И ȕɪʲʝɃȕʧђ
ъȆъ ќљ И
ъȂъ ќљ З
ъǤъ ќљ Ж
ȍЦkeyЧќvalue
stores a key-value pair in the dictionary, possibly replacing any existing value for
the key. Furthermore, the expression ȍЦkeyЧ returns the value of the given key
if it exists or throws an error if it does not. The function ȹǤʧɖȕ˩ test whether a
collection contains a value associated with a given key and returns a "ɴɴɜ value.
Table 4.9 provides an overview of the operations available on associative col-
lections.
4.5.5 Sets
Sets are among the most fundamental data structures in mathematics. As usual,
a set can be constructed by using the name of the data structure, i.e., Æȕʲ, as
a function or constructor, where the type of the elements can be specified as a
type parameter in curly brackets. The initial elements of a set may be passed as
an argument that is an iterable object, e.g., a vector.
74 4 Built-in Data Structures
ɔʼɜɃǤљ ÆȕʲФХ
ÆȕʲШɪ˩ЩФХ
ɔʼɜɃǤљ ÆȕʲШbɪʲЛЙЩФХ
ÆȕʲШbɪʲЛЙЩФХ
ɔʼɜɃǤљ ÆȕʲФЦЖя Зя ИЧХя ʲ˩ʙȕɴȯФÆȕʲФЦЖя Зя ИЧХХ
ФÆȕʲФЦЗя Ия ЖЧХя ÆȕʲШbɪʲЛЙЩХ
ɔʼɜɃǤљ ÆȕʲШOɜɴǤʲЖЛЩФЦЖя Зя ИЧХя ʲ˩ʙȕɴȯФÆȕʲШOɜɴǤʲЖЛЩФЦЖя Зя ИЧХХ
ФÆȕʲФOɜɴǤʲЖЛЦЗѐЕя ИѐЕя ЖѐЕЧХя ÆȕʲШOɜɴǤʲЖЛЩХ
A "ɃʲÆȕʲ is a sorted set of bɪʲs implemented as a bit string. While Æȕʲs are
suitable for sparse integer sets and generally for arbitrary objects, "ɃʲÆȕʲs are
especially suited for dense integer sets.
The usual set operations are available in both non-destructive and destructive
versions, the latter ending with an exclamation mark Р. An overview is given in
Table 4.10.
4.5 Collections 75
Several of the functions in the table can also be applied to arrays with the
expected results. If the arguments are arrays, the order of the elements is main-
tained and an array is returned. These methods of the generic functions are eas-
ier to use and run faster when arrays are to be interpreted as sets.
ɔʼɜɃǤљ ǤЖ ќ ЦЖя ЗЧѓ ǤЗ ќ ЦИЧѓ
ɔʼɜɃǤљ ʼɪɃɴɪФÆȕʲФǤЖХя ÆȕʲФǤЗХХ ќќ ÆȕʲФʼɪɃɴɪФǤЖя ǤЗХХ
ʲʝʼȕ
Vectors in the context of linear algebra are discussed in detail in Chap. 8. Here
we view vectors as deques (double-ended queues) and discuss the operations
that implement deques on top of vectors as the underlying data structure. This
notion of viewing vectors as collections of items (and not as elements of ℝ𝑑 ) and
performing operations on them fit well into the theme of the present section.
The operations on deques are summarized in Table 4.11. All of the functions
are destructive. The functions in Table 4.11 differ in their return values. Some
return the item or items in question, while others return the modified collection.
The two most iconic operations on deques are ʙʼʧȹР and ʙɴʙР for inserting
an item or items at the end of a collection and for removing the last item, respec-
tively. These two functions operate on the end of the collection, usually a vector,
because the end is where a vector can be modified most easily. If the functions
were to operate on the beginning of the vector, then the vector would have to be
copied every time.
76 4 Built-in Data Structures
The two functions ʙʼʧȹР and ʙɴʙР are inverses of one another.
ɔʼɜɃǤљ ʙʼʧȹРФ˛я ʙɴʙРФ˛ХХ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З
ɔʼɜɃǤљ ʙɴʙРФʙʼʧȹРФ˛я ИХХ
И
ɔʼɜɃǤљ ˛
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З
While both ʙʼʧȹР and ǤʙʙȕɪȍР add elements to the end of a collection, ʙʼʧȹР
takes a variable number of arguments, while the second argument to ǤʙʙȕɪȍР is
already a collection. These two function calls have the same effect.
ɔʼɜɃǤљ ʙʼʧȹРФЦЧя Жя Зя ИХ ќќ ǤʙʙȕɪȍРФЦЧя ЦЖя Зя ИЧХ
ʲʝʼȕ
4.5 Collections 77
The ʧʙɜɃȆȕР method acts on a range of indices and can be used to insert new
elements without removing any elements by specifying an empty range.
ɔʼɜɃǤљ ˛ ќ ЦЖя Зя ИЧѓ ʧʙɜɃȆȕРФ˛я ЗђЖя ЦЗЕя ИЕЧХ
bɪʲЛЙЦЧ
ɔʼɜɃǤљ ˛
Квȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
ЗЕ
ИЕ
З
И
Problems
4.2 Write a function that calculates the natural logarithm using its Taylor series.
Compare the speed and memory allocation of five versions using ʝȕȍʼȆȕ, ʧʼɦ,
ɦǤʙʝȕȍʼȆȕ, array comprehensions, and generators.
4.4 Compare the speed and memory allocation of three versions of the
ɦ˩ѪȯǤȆʲɴʝɃǤɜ function: (a) using a range, (b) using an array comprehension,
and (c) using a generator.
4.5 Write iterative versions of ɦ˩ѪǤɪ˩ and ɦ˩ѪǤɜɜ. Compare their speed and
memory allocation with the versions based on ʝȕȍʼȆȕ and with the built-in func-
tions Ǥɪ˩ and Ǥɜɜ.
Chapter 5
User Defined Data Structures and the Type
System
5.1 Introduction
All values in digital, binary computers are stored as vectors of two values, zeros
and ones. In order to make memory more useful and accessible to programmers,
these vectors or strings of zeros and ones are interpreted as more meaningful
objects such as signed and unsigned integers, rational numbers, floating-point
numbers as approximations of real numbers, characters, dictionaries, user de-
fined data structures, and many others. The information that enables the inter-
pretation of a vector of zeros and ones as a more meaningful object is its type.
In other words, both a vector of zeros and ones and the type information are
necessary to interpret a part of memory usefully and correctly.
In computer science, there are traditionally two approaches to implement
type systems, namely static type systems and dynamic type systems. In static type
systems, each variable and expression in the program has a type that is known or
computable before the execution of the program. In dynamic type systems, the
types of variables and expressions are unknown until run time, when the values
stored in variables are available. This means that in a static type system, types
are associated with variables; variables always contain values of the same type,
and the type is known before the execution of the program. In a dynamic type
system, types are associated with values; variables may contain values of differ-
ent types, and the type is only known during the execution of the program and
may change.
The ability of programs or expressions to operate on different types is called
polymorphism. By definition, all programs written in a dynamically typed lan-
guage are polymorphic; restrictions to the types of values only occur when a type
is checked during run time or when an operation is not available for a certain
value at run time.
Dynamic and static typing both have their advantages and disadvantages.
While Julia’s type system is dynamic, it is also possible to specify the types of
certain variables like in a static type system. This helps generate efficient code
and allows method dispatch on the types of function arguments. In this manner,
Julia gains some properties and advantages of static type systems.
Since Julia’s type system is dynamic, variables in Julia may contain values
of any type by default. When desired, it is possible to add type annotations to
variables and expressions. These type annotations serve a few purposes. They
enable multiple dispatch on function arguments, they serve as documentation
and can hence make a program more readable and clarify its purpose, and they
can serve as safety measures and catch programmer errors.
The type system is an important part of any programming language. Some
important properties of Julia’s powerful and expressive type system are the fol-
lowing.
• As usual in a dynamic type system, types are associated with values, and
never with variables.
• Some programming languages, especially object oriented ones, discern be-
tween object (composite) and non-object (numbers etc.) values. This is not
the case in Julia, where all values are objects and each type is a first-class
type.
• It is possible to parameterize both abstract and concrete types, and type pa-
rameters are optional when they are not required or not restricted.
In this chapter, the basics of Julia’s type system needed to define own type
hierarchies are summarized, and some finer points are discussed as well.
If a type annotates an expression in the form expr ђђtype, the meaning of the ђђ
operator is that it asserts that the value of the expression must be of the indi-
cated type. If the assertion is true, the value is returned, otherwise an exception
is thrown.
ɔʼɜɃǤљ ʲʝʼȕђђbɪʲ
5¼¼¼ђ Ñ˩ʙȕ5ʝʝɴʝђ Ƀɪ ʲ˩ʙȕǤʧʧȕʝʲя ȕ˦ʙȕȆʲȕȍ bɪʲЛЙя ȱɴʲ Ǥ ˛Ǥɜʼȕ ɴȯ
ʲ˩ʙȕ "ɴɴɜ
ɔʼɜɃǤљ ʲʝʼȕђђ"ɴɴɜ
ʲʝʼȕ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɴɴФХХ
OɜɴǤʲЛЙ
Similarly, it is possible to declare the type of the return value of a function (see
Chap. 2). The annotation of a return value (returned by ʝȕʲʼʝɪ or being the last
value in the function body) is treated just as the annotation of a local variable as
discussed above, and hence the assignment is performed using Ȇɴɪ˛ȕʝʲ. As an
example, the following function will always raise an error at run time.
ȯʼɪȆʲɃɴɪ ȯɴɴФХђђbɪʲ
ъÑȹɃʧ Ȇɴɪ˛ȕʝʧɃɴɪ ʝǤɃʧȕʧ Ǥɪ ȕʝʝɴʝѐъ
ȕɪȍ
82 5 User Defined Data Structures and the Type System
Abstract types are defined as types that cannot be instantiated, while concrete
types can be. In the type graph, the children or subtypes of abstract types are
abstract or concrete types, while concrete types cannot have children or subtypes
in the type graph. Abstract types are defined via ǤȂʧʲʝǤȆʲ ʲ˩ʙȕ Name ȕɪȍ or
ǤȂʧʲʝǤȆʲ ʲ˩ʙȕ Name јђ Supertype ȕɪȍ. The first syntax is equivalent to ǤȂʧʲʝǤȆʲ
ʲ˩ʙȕ Name јђ ɪ˩ ȕɪȍ, where the type ɪ˩ is at the top of the type graph or
hierarchy. The names of types are usually capitalized, and the names of abstract
types usually start with ȂʧʲʝǤȆʲ.
While the type ɪ˩ is at the top of the type hierarchy or graph, the ÚɪɃɴɪШЩ
type is at the bottom. While all objects are instances of ɪ˩ and all types are
subtypes of ɪ˩, no object is an instance of ÚɪɃɴɪШЩ and all types are supertypes
of ÚɪɃɴɪШЩ.
The whole type graph or hierarchy can be probed easily using the јђ operator
and the functions ʧʼȂʲ˩ʙȕʧ, ʧʼʙȕʝʲ˩ʙȕ, and ʧʼʙȕʝʲ˩ʙȕʧ. The expression type1
јђ type2 returns true if type1 is below type2 in the type hierarchy; it does not have
to be a child.
ɔʼɜɃǤљ bɪʲН јђ ɪ˩
ʲʝʼȕ
The function ʧʼȂʲ˩ʙȕʧ returns a vector with all types that are directly below the
given type in the hierarchy, i.e., with all children of the given type.
ɔʼɜɃǤљ ʧʼȂʲ˩ʙȕʧФbɪʲȕȱȕʝХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
"ɴɴɜ
ÆɃȱɪȕȍ
ÚɪʧɃȱɪȕȍ
The function ʧʼʙȕʝʲ˩ʙȕ returns the parent of the given type, and the function
ʧʼʙȕʝʲ˩ʙȕʧ returns a tuple with all types above the given type in the hierarchy,
starting with the parent and always ending with ɪ˩.
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕФbɪʲНХ
ÆɃȱɪȕȍ
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФbɪʲНХ
ФbɪʲНя ÆɃȱɪȕȍя bɪʲȕȱȕʝя ¼ȕǤɜя ʼɦȂȕʝя ɪ˩Х
In the next example, we locate real and complex numbers in the type hierar-
chy.
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФ¼ȕǤɜХ
Ф¼ȕǤɜя ʼɦȂȕʝя ɪ˩Х
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФʲ˩ʙȕɴȯФЖ ў ЗɃɦХХ
Ф&ɴɦʙɜȕ˦ШbɪʲЛЙЩя ʼɦȂȕʝя ɪ˩Х
Irrational numbers are also part of the Julia’s numerical tower, as we can see
here in the example of Euler’s number.
5.4 Composite Types 83
Composite types are the most common types defined by users. In other lan-
guages, composite types are also called structs, records, or objects. They consist
of named fields of arbitrary types, and therefore usually serve to collect quite dis-
tinct objects or values into ensembles, in contrast to vectors, which are usually
used to store values of the same type.
A composite type is defined by the keyword ʧʲʝʼȆʲ followed by the field
names, which may be annotated by their types using the usual ђђ syntax. If there
is no annotation, it defaults to the ɪ˩ type. Again, the names of composite types
are capitalized in Julia by convention.
In the first example, there is no type annotation so that both fields can contain
values of ɪ˩ type. (The types in the examples are numbered, because redefining
types is not allowed in Julia. Numbering the types saves us from restarting Ju-
lia to evaluate the examples.)
ʧʲʝʼȆʲ OɴɴЖ
Ǥ
Ȃ
ȕɪȍ
These two field specifications are equivalent to Ǥђђɪ˩ and Ȃђђɪ˩. In order to
create an instance of the newly defined composite type OɴɴЖ, its name is used
as a function; this constructor function was defined by the expression above. By
default, a composite object is printed as its name followed by the values of its
fields in parentheses.
ɔʼɜɃǤљ OɴɴЖФЖя ЗХ
OɴɴЖФЖя ЗХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
OɴɴЖ
ɔʼɜɃǤљ ɦȕʲȹɴȍʧФOɴɴЖХ
Ы Ж ɦȕʲȹɴȍ ȯɴʝ ʲ˩ʙȕ ȆɴɪʧʲʝʼȆʲɴʝђ
ЦЖЧ OɴɴЖФǤя ȂХ Ƀɪ ǤɃɪ Ǥʲ ¼5¸{ЦЖЧђЗ
84 5 User Defined Data Structures and the Type System
The ɦȕʲȹɴȍʧ call reveals that behind the scenes our type definition defined a
generic function of the same name. Its method takes the correct number of field
values as input.
The value of a field is accessed by a dot ѐ followed the field name.
ɔʼɜɃǤљ ȯɴɴЖ ќ OɴɴЖФЖя ЗХ
OɴɴЖФЖя ЗХ
ɔʼɜɃǤљ ȯɴɴЖѐǤ
Ж
ɔʼɜɃǤљ ȯɴɴЖѐȂ
З
ɔʼɜɃǤљ OɴɴЖФЖя ЗХѐǤ
Ж
The fields of a composite object can also be accessed using the function
ȱȕʲȯɃȕɜȍ, and all field names of acomposite type can be obtained by the func-
tion ȯɃȕɜȍɪǤɦȕʧ as symbols. This provides a way to iterate over all fields of a
composite type, as shown in the next example.
ɔʼɜɃǤљ ȯɃȕɜȍɪǤɦȕʧФOɴɴЖХ
ФђǤя ђȂХ
ɔʼɜɃǤљ ȯɴʝ ɪǤɦȕ Ƀɪ ȯɃȕɜȍɪǤɦȕʧФOɴɴЖХ
Ъʧȹɴ˞ ɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХ
ȕɪȍ
ФɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХХ ќ ФђǤя ЖХ
ФɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХХ ќ ФђȂя ЗХ
Then field values can be assigned using the equal sign ќ and the accessor on the
left-hand side.
ɔʼɜɃǤљ ȯɴɴЗ ќ OɴɴЗФЖя ЗХ
OɴɴЗФЖя ЗѐЕХ
ɔʼɜɃǤљ ȯɴɴЗѐǤ ќ И
И
ɔʼɜɃǤљ ȯɴɴЗѐǤ
И
5.4 Composite Types 85
In our examples so far, the initial field values have matched the field types
specified during ʧʲʝʼȆʲ definition. What happens if they do not match, though?
Julia tries to Ȇɴɪ˛ȕʝʲ the given values to the types specified in the ʧʲʝʼȆʲ def-
inition whenever possible; if it is not possible, an error is raised. An example of
such a conversion can be seen above: the bɪʲ value З was convert to a OɜɴǤʲЛЙ
value according to the field specification ȂђђOɜɴǤʲЛЙ. The error raised when the
conversion is not possible is shown in the following example.
ɔʼɜɃǤљ OɴɴЗФЖя ъЗъХ
5¼¼¼ђ ȕʲȹɴȍ5ʝʝɴʝђ &Ǥɪɪɴʲ ҉Ȇɴɪ˛ȕʝʲ҉ Ǥɪ ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ ÆʲʝɃɪȱ ʲɴ Ǥɪ
ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ OɜɴǤʲЛЙ
The same error is raised when trying to assign a value that cannot be converted
to the type indicated in the ʧʲʝʼȆʲ definition.
ɔʼɜɃǤљ OɴɴЗФЖя ЗХѐȂ ќ ъЗъ
5¼¼¼ђ ȕʲȹɴȍ5ʝʝɴʝђ &Ǥɪɪɴʲ ҉Ȇɴɪ˛ȕʝʲ҉ Ǥɪ ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ ÆʲʝɃɪȱ ʲɴ Ǥɪ
ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ OɜɴǤʲЛЙ
In this case, the vector with undefined elements consists of elements whose
fields contain random numbers. The numbers are random in the sense that they
contain whichever bits were present at their memory location when they were
allocated.
ɔʼɜɃǤљ ùȕȆʲɴʝШOɴɴЙЩФʼɪȍȕȯя ИХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɴɴЙЩђ
OɴɴЙФЖя Зя ЖѐКȕвИЗИя НѐЕȕвИЗИ ў ЗѐЕȕвИЗИɃɦХ
OɴɴЙФКя Ля ИѐКȕвИЗИя ЙѐЙȕвИЗИ ў КѐЕȕвИЗИɃɦХ
OɴɴЙФЖЗя ЖИя МѐЕȕвИЗИя МѐЙȕвИЗИ ў КѐЙȕвИЗИɃɦХ
5.5 Constructors
Constructors are, in general, functions that create new objects. We have already
seen in the previous section that defining a new composite type automatically de-
fines a standard constructor for this type. The standard constructor is a method
for the generic function with the same name as the type, taking the initial values
for the fields as arguments. Sometimes, however, it is desirable to define custom
constructors, for example, to create complex objects in a consistent state, to en-
force invariants, or to construct self-referential objects.
There are two types of constructors: inner and outer ones. Outer constructors
are just additional methods to the generic function of the same name as the com-
posite type. They usually provide convenience such as constructing objects with
default values.
Here we consider the example of (real-valued) intervals. (Again, the types
have numbers in their names since Julia forbids redefining types and we want
to avoid restarting Julia for each example.)
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜЖ
ɜȕȯʲђђOɜɴǤʲЛЙ
ʝɃȱȹʲђђOɜɴǤʲЛЙ
ȕɪȍ
The following method for the generic function bɪʲȕʝ˛ǤɜЖ is an outer constructor.
Its only purpose is to define a default interval.
ȯʼɪȆʲɃɴɪ bɪʲȕʝ˛ǤɜЖФХ
bɪʲȕʝ˛ǤɜЖФЕя ЖХ
ȕɪȍ
Outer constructors bear their name because they are defined outside the
scope of the ʧʲʝʼȆʲ definition. Inner constructors are defined inside the ʧʲʝʼȆʲ
5.5 Constructors 87
definition and have an additional capability, namely that they can call a func-
tion called ɪȕ˞ that creates objects of the composite type being defined. After a
composite type has been defined, it is not possible to add any inner constructors.
Also, if an inner constructor is defined, no default constructor is defined.
In the following example, the only constructor checks whether a valid interval
is being constructed.
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜЗ
ɜȕȯʲђђOɜɴǤʲЛЙ
ʝɃȱȹʲђђOɜɴǤʲЛЙ
In the first call below, the endpoints are valid; in the second one, they are not.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜЗФЕя ЖХ
bɪʲȕʝ˛ǤɜЗФЕѐЕя ЖѐЕХ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜЗФЖя ЕХ
5¼¼¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ɜȕȯʲ ȕɪȍʙɴɃɪʲ ɦʼʧʲ Ȃȕ ɜȕʧʧ ʲȹǤɪ ɴʝ ȕʜʼǤɜ ʲɴ
ʝɃȱȹʲ ȕɪȍʙɴɃɪʲ
A finer point of the ɪȕ˞ function defined in inner constructors is that it can be
called with fewer arguments than the number of fields the type has. This feature
makes the creation of instances of self-referential types possible. Although this
may sound like a situation that is seldom encountered, we do not have to look
far for such a data structure; a prime example is lists. In Lisp, they consist of a
data structure called a Ȇɴɪʧ (which is short for construct). For example, the Lisp
expression
ФȆɴɪʧ Ж ФȆɴɪʧ З ФȆɴɪʧ И ɪɃɜХХХ
evaluates to the list ФЖ З ИХ. Ȇɴɪʧ cells consist of two fields, which may contain
arbitrary values. When Ȇɴɪʧ cells are used to build a list, the first field (tradition-
ally called ȆǤʝ in Lisp, which is short for “contents of the address part of register
number” on the ibm 704 computer) holds a value and the second (traditionally
called Ȇȍʝ in Lisp, which is short for “contents of the decrement part of register
number” on the ibm 704 computer) holds another Ȇɴɪʧ cell or ɪɃɜ. Because of
their structure, Ȇɴɪʧ cells and hence lists are usually traversed recursively.
We start with a first try to define a Ȇɴɪʧ cell in Julia.
ʧʲʝʼȆʲ &ɴɪʧЖ
ȆǤʝђђɪ˩
Ȇȍʝђђ&ɴɪʧЖ
ȕɪȍ
88 5 User Defined Data Structures and the Type System
While this self-referential type definition can be evaluated without any error, we
encounter a problem when trying to create an instance. Just saying &ɴɪʧЖФХ or
&ɴɪʧЖФЖя &ɴɪʧЖФХХ does not work, since all fields must be initialized and no
instance of this type exists yet. The problem is that it is not possible to create an
instance, because the second field must contain an instance of the same type.
The solution is to define an inner constructor that calls ɪȕ˞ with only one
argument that initializes the ȆǤʝ field (in addition to a second, standard con-
structor).
ɦʼʲǤȂɜȕ ʧʲʝʼȆʲ &ɴɪʧ
ȆǤʝђђɪ˩
Ȇȍʝђђ&ɴɪʧ
ȯʼɪȆʲɃɴɪ &ɴɪʧФȆǤʝђђɪ˩Хђђ&ɴɪʧ
ɪȕ˞ФȆǤʝХ
ȕɪȍ
With this definition, we can mimic the list above using our &ɴɪʧ data type.
ɔʼɜɃǤљ &ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИХХХ
&ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИя ЫʼɪȍȕȯХХХ
The last function in this example shows how functions can operate safely on the
&ɴɪʧ composite type, namely by using ɃʧȍȕȯɃɪȕȍ.
ȯʼɪȆʲɃɴɪ ɜȕɪȱʲȹФȆɴɪʧђђ&ɴɪʧХђђbɪʲ
Ƀȯ ɃʧȍȕȯɃɪȕȍФȆɴɪʧя ђȆȍʝХ
Ж ў ɜȕɪȱʲȹФȆɴɪʧѐȆȍʝХ
ȕɜʧȕ
5.7 Parametric Types 89
Ж
ȕɪȍ
ȕɪȍ
This recursive definition of the length returns the correct value when the &ɴɪʧes
are interpreted as a list.
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖХХ
Ж
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖя &ɴɪʧФЗХХХ
З
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИХХХХ
И
A type union is an abstract type that consists of the union of all types given after
the ÚɪɃɴɪ keyword. A common example is a type that may take a value or not.
Such types are sometimes useful when passing arguments or as return values.
ɔʼɜɃǤљ Ǥ˩Ȃȕbɪʲ ќ ÚɪɃɴɪШbɪʲя ɴʲȹɃɪȱЩ
ÚɪɃɴɪШɴʲȹɃɪȱя bɪʲЛЙЩ
ɔʼɜɃǤљ Еђђ Ǥ˩Ȃȕbɪʲя ɪɴʲȹɃɪȱђђ Ǥ˩Ȃȕbɪʲ
ФЕя ɪɴʲȹɃɪȱХ
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜИШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ
The type bɪʲȕʝ˛ǤɜИ has type ÚɪɃɴɪɜɜ. The type ÚɪɃɴɪɜɜ represents the union
of all types over all values of the type parameter.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИХ
ÚɪɃɴɪɜɜ
This is due to the practical reason that composite types should be stored as ef-
ficiently in memory as possible. For example, while bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ can be
stored as two adjacent 64-bit values, this is not true for bɪʲȕʝ˛ǤɜИШ¼ȕǤɜЩ, which
entails the allocation of two ¼ȕǤɜ objects.
The above fact has ramifications for the definition of methods. Suppose we
want to define a function that calculates the midpoint of an interval of numbers.
ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲЖФɃђђbɪʲȕʝ˛ǤɜИШʼɦȂȕʝЩХ
ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ
This method does not work as intended, as the following function call shows.
ɔʼɜɃǤљ ɦɃȍʙɴɃɪʲЖФbɪʲȕʝ˛ǤɜИФвЖя ЖХХ
5¼¼¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ ɦɃȍʙɴɃɪʲЖФђђbɪʲȕʝ˛ǤɜИШbɪʲЛЙЩХ
There are three ways to define suitable methods, whose syntaxes differ slightly.
ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲЗФɃђђbɪʲȕʝ˛ǤɜИШјђʼɦȂȕʝЩХ
ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ
The second option is to use the default constructor of the underlying para-
metric composite type (of type ÚɪɃɴɪɜɜ), which is bɪʲȕʝ˛ǤɜИ in this example,
as long as the implied value of the parameter type Ñ is unambiguous. In these
two examples, the underlying type is unambiguous.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖХХ
bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЖЭЭЗя ЗЭЭИХХ
bɪʲȕʝ˛ǤɜИШ¼ǤʲɃɴɪǤɜШbɪʲЛЙЩЩ
In the following two examples, the underlying type is ambiguous and errors re-
sult.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖѐЕХХ
5¼¼¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ bɪʲȕʝ˛ǤɜИФђђbɪʲЛЙя ђђOɜɴǤʲЛЙХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖЭЭЗХХ
5¼¼¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ
bɪʲȕʝ˛ǤɜИФђђbɪʲЛЙя ђђ¼ǤʲɃɴɪǤɜШbɪʲЛЙЩХ
92 5 User Defined Data Structures and the Type System
Just as in the case of parametric composite types, each concrete parametric ab-
stract type is a subtype of the underlying abstract type (of type ÚɪɃɴɪɜɜ). Fur-
thermore, a concrete parametric abstract type is never a subtype of another con-
crete parametric abstract type, even if one type parameter is a subtype of the
other.
The notation QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ denotes the set of all types
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШT Щ where T is a subtype of ¼ȕǤɜ, and analogously
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђ¼ȕǤɜЩ denotes the set of all types QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШT Щ
where T is a supertype of ¼ȕǤɜ. This is illustrated in the following examples.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђ¼ȕǤɜЩХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ
ʲʝʼȕ
ɔʼɜɃǤљ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђbɪʲЩ
ʲʝʼȕ
The purpose of abstract types is to create type hierarchies over concrete types.
This is exactly how the parametric abstract type is used in the following example.
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜШÑЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ
With these definitions, it is ensured that unit intervals always have length one.
5.7 Parametric Types 93
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩФЕя ЖХ
ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЛЙЩФЕя ЖХ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩФЕя ЗХ
5¼¼¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ʝɃȱȹʲ в ɜȕȯʲ ќќ Ж
Second, further inclusions can be realized using the notation јђtype ex-
plained above. In the first example here, bɪʲȕʝ˛ǤɜШbɪʲЩ is not a subtype of
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ by the general rule above. However, in the second ex-
ample, јђ¼ȕǤɜ makes it possible to denote such a set of types; bɪʲȕʝ˛ǤɜШbɪʲЩ
is a subtype of QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ because bɪʲȕʝ˛Ǥɜ is a subtype of the
abstract type QȕɪȕʝǤɜbɪʲȕʝ˛Ǥɜ by its definition and because the parameter type
bɪʲ is a subtype of ¼ȕǤɜ (and the usage of јђ¼ȕǤɜ).
Another use of the notation јђtype is to restrict the allowed types. In the fol-
lowing example, we define intervals of characters and of integers.
ʧʲʝʼȆʲ &ȹǤʝbɪʲȕʝ˛ǤɜШÑјђȂʧʲʝǤȆʲ&ȹǤʝЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ
The purpose of tuples is to model the argument lists of functions. Tuple types
take multiple type parameters, each corresponding to the type of an argument
in order. Because of their purpose, they have the following special properties.
1. Tuples types may be parameterized by an arbitrary number of types.
2. Tuples types are only concrete if their type parameters are.
3. ÑʼʙɜȕШ𝑆1 , … , 𝑆𝑛 Щ is a subtype of ÑʼʙɜȕШ𝑇1 , … , 𝑇𝑛 Щ if each type 𝑆𝑖 is a subtype
of the corresponding type 𝑇𝑖 . This property is, of course, what is needed to
determine which methods of a generic function match an argument list.
4. In contrast to composite types, tuples do not have field names, and hence
their fields can only be accessed by their index. However, named tuples do
have field names.
Because tuples are used for passing arguments to functions and receiving re-
turn values from functions and hence are an often used type, the syntax to con-
struct tuple values is very simple. Additionally to the default constructors such
as ÑʼʙɜȕШbɪʲЩФЖХ, tuple values can be written in parentheses with commas in
between, and an appropriate type is automatically constructed as well. It is im-
portant to note that a tuple with a single element is still written with a comma
at the end in order to make the syntax unambiguous as seen here.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖя ХХ
ÑʼʙɜȕШbɪʲЛЙЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖя ЗѐЕя ИЭЭЖя ЙɃɦХХ
ÑʼʙɜȕШbɪʲЛЙяOɜɴǤʲЛЙя¼ǤʲɃɴɪǤɜШbɪʲЛЙЩя&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩ
ɔʼɜɃǤљ Ǥɪʧ јђ ÑʼʙɜȕШbɪʲя ¼ȕǤɜя ¼ȕǤɜя &ɴɦʙɜȕ˦Щ
ʲʝʼȕ
While tuples do not have field names, named tuples do. The ǤɦȕȍÑʼʙɜȕ type
takes two parameters, namely a tuple of symbols indicating the field names and a
tuple with the field types. The corresponding parameterized type is constructed
automatically when a named tuple is constructed.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФǤʝȱЖ ќ Жя ǤʝȱЗ ќ ЗѐЕя ǤʝȱИ ќ ИЭЭЖХХ
ǤɦȕȍÑʼʙɜȕШФђǤʝȱЖя ђǤʝȱЗя ђǤʝȱИХя ÑʼʙɜȕШbɪʲЛЙя OɜɴǤʲЛЙя
¼ǤʲɃɴɪǤɜШbɪʲЛЙЩЩЩ
The first argument of the constructor ǤɦȕȍÑʼʙɜȕ specifies the names, and the
second, optional one specifies the types. If the types are specified, the arguments
are converted using Ȇɴɪ˛ȕʝʲ; otherwise, their types are inferred automatically.
Note that the values are specified as tuples as well.
ɔʼɜɃǤљ ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХЩФФЖя ЖѐЕХХ
ФǤ ќ Жя Ȃ ќ ЖѐЕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲЛЙя OɜɴǤʲЛЙЩЩ
ɔʼɜɃǤљ ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲНя OɜɴǤʲИЗЩЩФФЖя ЖХХ
ФǤ ќ Жя Ȃ ќ ЖѐЕȯЕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲНя OɜɴǤʲИЗЩЩ
In the second example, the types are indicated as well, and a ǤɦȕȍÑʼʙɜȕ type
is returned by the macro accordingly. The arguments to the constructor are con-
verted to the indicated types when the object is created.
96 5 User Defined Data Structures and the Type System
ʧʲʝʼȆʲ bɪʲȕʝ˛Ǥɜ
ɜȕȯʲѪɴʙȕɪђђ"ɴɴɜ
ɜȕȯʲђђʼɦȂȕʝ
ʝɃȱȹʲѪɴʙȕɪђђ"ɴɴɜ
ʝɃȱȹʲђђʼɦȂȕʝ
ȕɪȍ
The default "Ǥʧȕѐʧȹɴ˞ method prints values such that the resulting string yields
a valid object again after parsing.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХ
bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХ
The generic function "Ǥʧȕѐʧȹɴ˞ takes the output stream as its first argument
(usually called Ƀɴђђb) and the object to be printed as its second. Note that it
is necessary to mention the module name "Ǥʧȕ to add a method to the correct
generic function.
ȯʼɪȆʲɃɴɪ "Ǥʧȕѐʧȹɴ˞ФɃɴђђbя Ƀђђbɪʲȕʝ˛ǤɜХ
ʙʝɃɪʲФɃɴя
ɃѐɜȕȯʲѪɴʙȕɪ Т ъФъ ђ ъЦъя
Ƀѐɜȕȯʲя ъя ъя ɃѐʝɃȱȹʲя
ɃѐʝɃȱȹʲѪɴʙȕɪ Т ъХъ ђ ъЧъХ
ȕɪȍ
Abstract, composite, and a few other types are instances of the type -ǤʲǤÑ˩ʙȕ, as
seen in this example.
ɔʼɜɃǤљ Фʲ˩ʙȕɴȯФbɪʲХя ʲ˩ʙȕɴȯФɪ˩ХХ
Ф-ǤʲǤÑ˩ʙȕя -ǤʲǤÑ˩ʙȕХ
As we have seen already, ordinary functions can operate on types, since they are
objects of type -ǤʲǤÑ˩ʙȕ themselves. In this section, the operations on types are
briefly summarized.
The subtype operator јђ determines whether the type on its left is a subtype
of the type on its right. The function ɃʧǤ determines whether its first argument
is an object of the second argument, a type. The function ʲ˩ʙȕɴȯ returns the type
of its argument.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ǤʲǤÑ˩ʙȕХ
-ǤʲǤÑ˩ʙȕ
The function ʧʼʙȕʝʲ˩ʙȕ returns the supertype of its argument, and ʧʼʙȕʝʲ˩ʙȕʧ
returns all supertypes of its argument. Finally, the function ʧʼȂʲ˩ʙȕʧ returns all
subtypes.
Problems
5.1 (Numerical tower) Use the functions in Sect. 5.3 to obtain the numerical
tower, i.e., the whole hierarchy below the type ʼɦȂȕʝ. Draw the numerical tower
(by hand, but see Problem 5.2). How does the numerical tower correspond to the
sets ℕ, ℤ, ℚ, ℝ, and ℂ?
5.2 (Visualize type graph) * Write a function that obtains the full type graph.
Then write a function that writes an input file for graph visualization software
such as Graphviz to plot the type graph.
98 5 User Defined Data Structures and the Type System
5.3 (Type hierarchy for intervals) Define a type hierarchy for intervals below
an abstract type called QȕɪȕʝǤɜbɪʲȕʝ˛Ǥɜ. The types in the type hierarchy should
provide for intervals with finite and infinite numbers of elements. Define inter-
vals for Unicode characters, for integers, for rational numbers, and for floating-
point numbers.
5.4 (Interval arithmetic) Define generic functions for the addition, subtrac-
tion, multiplication, and division of numeric intervals. The functions should
take numeric intervals of the same type as their inputs and return a newly con-
structed interval of the same type (and a Boolean value to indicate whether a
division by zero occurred in the case of division).
References
1. Goldberg, D.: What every computer scientist should know about floating point arithmetic.
ACM Computing Surveys 23(1), 5–48 (1991)
2. Pierce, B.: Types and Programming Languages. MIT Press (2002)
Chapter 6
Control Flow
Abstract The control flow in a function or program is not just linear, but is de-
termined by branches, loops, and non-local transfer of control. The standard
control-flow mechanisms usually found in high-level programming languages
such as compound expressions, conditional evaluation, short-circuit evaluation,
repeated evaluation, and exception handling are available in Julia and are dis-
cussed in detail in this chapter. Additionally, tasks are a powerful mechanisms
for non-local transfer of control and make it possible to switch between computa-
tions. Parallel or distributed computing is discussed in detail as well, presenting
various techniques how to distribute computations efficiently and conveniently.
Similar to ʙʝɴȱɪ in Common Lisp , ȂȕȱɃɪ blocks and semicolon chains evaluate
the constituent expressions in order and return the value of the last expression.
A ȂȕȱɃɪ block begins with the keyword ȂȕȱɃɪ and ends with the keyword
ȕɪȍ. Instead of these keywords, parentheses can also be used to the same effect,
resulting in a semicolon chain. The expressions in a ȂȕȱɃɪ block are separated
by newlines, by semicolons, or by both. The expressions in a semicolon chain
are separated by semicolons ѓ.
The following examples show the various cases that can occur. Global vari-
ables Ǥ and Ȃ are defined in each example, i.e., ȂȕȱɃɪ blocks do not introduce a
new scope.
ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ
ȱɜɴȂǤɜ Ǥ ќ И
ȱɜɴȂǤɜ Ȃ ќ З
ǤѭǤ ЭЭ ȂѭȂ
ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ
ȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ
ǤѭǤ ЭЭ ȂѭȂѓ
ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ ȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ ǤѭǤ ЭЭ ȂѭȂ ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ФȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ ǤѭǤ ЭЭ ȂѭȂХ
ЗМЭЭЙ
Empty expressions such as ȂȕȱɃɪ ȕɪȍ and ФѓХ return ɪɴʲȹɃɪȱ, which is of
type ɴʲȹɃɪȱ and which is not printed by the repl.
ɔʼɜɃǤљ ȂȕȱɃɪ ȕɪȍ
ɔʼɜɃǤљ Ǥɪʧ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȂȕȱɃɪ ȕɪȍХ
ɴʲȹɃɪȱ
All of the ȕɜʧȕɃȯ clauses as well as the ȕɜʧȕ clause are optional. The ȕɜʧȕɃȯ
clauses ȕɜʧȕɃȯ condition𝑖 expressions𝑖 can be repeated arbitrarily often.
If the Boolean expression condition0 in the Ƀȯ clause is true, then the corre-
sponding expressions expressions0 are evaluated; if it is false, then the condition
condition1 in the first ȕɜʧȕɃȯ clause is evaluated. Again, if it is true, then the
corresponding expressions expressions1 are evaluated; if it is false, then the next
6.2 Conditional Evaluation 101
ȕɜʧȕɃȯ clause is considered, etc. If none of the conditions is true, then the ex-
pressions expressions after the ȕɜʧȕ clause are evaluated.
In other words, the expressions following the first ʲʝʼȕ condition are evalu-
ated, and the rest of the conditions are not considered anymore. If none of the
conditions is true, the expressions in the ȕɜʧȕ clause are evaluated if present.
ȯʼɪȆʲɃɴɪ ɦ˩ѪʧɃȱɪФ˦Х
Ƀȯ ˦ ј Е
ʲ˩ʙȕɴȯФ˦ХФвЖХ
ȕɜʧȕɃȯ ˦ ќќ Е
ʲ˩ʙȕɴȯФ˦ХФЕХ
ȕɜʧȕ
ʲ˩ʙȕɴȯФ˦ХФЖХ
ȕɪȍ
ȕɪȍ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФɦ˩ѪʧɃȱɪФbɪʲНФЗХХХ
bɪʲН
Ƀȯ blocks do not introduce a new scope, implying that local variables that are
defined or changed within an Ƀȯ block remain visible after the Ƀȯ block.
Ƀȯ expressions return a value after having been evaluated, namely the value
of the last expression evaluated.
ɔʼɜɃǤљ ȯɴɴ ќ Ƀȯ Ж љ Е ъ˩ȕʧъ ȕɜʧȕ ъɪɴъ ȕɪȍ
ъ˩ȕʧъ
There is an additional syntax, the so-called ternary operator, for Ƀȯ ȕɜʧȕ ȕɪȍ
blocks whose expressions are single expressions. The ternary operator
condition0 Т expression0 ђ expression
is equivalent to
Ƀȯ condition0
expression0
ȕɜʧȕ
expression
ȕɪȍ.
ɔʼɜɃǤљ ȆɴɦʙǤʝȕФЕя ЖХ
ъЕ Ƀʧ ɜȕʧʧ ʲȹǤɪ Жъ
102 6 Control Flow
Ternary operators can be chained, and to facilitate this, the ternary operator
associates from right to left.
ȯʼɪȆʲɃɴɪ ȆɴɦʙǤʝȕФǤя ȂХ
ʧʲʝɃɪȱФǤХ Ѯ ъ Ƀʧ ъ Ѯ
ФǤ ј Ȃ Т ъɜȕʧʧ ʲȹǤɪ ъ ђ
Ǥ ќќ Ȃ Т ъȕʜʼǤɜ ʲɴ ъ ђ ъȱʝȕǤʲȕʝ ʲȹǤɪ ъХ Ѯ
ʧʲʝɃɪȱФȂХ
ȕɪȍ
The Boolean operators ПП and ЮЮ implement the logical “and” and “or” opera-
tions. However, not all of their arguments are generally evaluated; only the min-
imum number of arguments necessary to determine the value of the whole ex-
pression is evaluated from the left to the right. This means that the evaluation is
short-circuited if the value of the whole expression can be known in advance.
For example, in the expression Ǥ ПП Ȃ, the second argument Ȃ is evaluated
only if Ǥ evaluates to ʲʝʼȕ, since otherwise – if Ǥ is ȯǤɜʧȕ – it is already obvious
that the whole expression must be ȯǤɜʧȕ after evaluating Ǥ. Analogously, in the
expression Ǥ ЮЮ Ȃ, the second argument Ȃ is evaluated only if Ǥ evaluates to
ȯǤɜʧȕ.
The ПП operator has higher precedence than the ЮЮ operator, which some-
times makes it possible to leave out parentheses. However, it is preferable in most
cases to write out the parentheses in order to make the intent of the program im-
mediately clear.
Short-circuit evaluation can also be used as an alternative short form for cer-
tain short Ƀȯ expressions. For example, when checking the arguments of a func-
tion and acting accordingly, checks may only occupy one line. In this example,
the two functions are equivalent. (In general, using the ЪǤʧʧȕʝʲ macro to check
arguments is preferable.)
ȯʼɪȆʲɃɴɪ ȯɃȂЖФɪђђbɪʲХ
ɪ љќ Е ЮЮ ȕʝʝɴʝФъɪ ɦʼʧʲ Ȃȕ ɪɴɪвɪȕȱǤʲɃ˛ȕъХ
Е јќ ɪ јќ Ж ПП ʝȕʲʼʝɪ ɪ
ȯɃȂЖФɪвЖХ ў ȯɃȂЖФɪвЗХ
ȕɪȍ
ȯʼɪȆʲɃɴɪ ȯɃȂЗФɪђђbɪʲХ
Ƀȯ ɪ ј Е ȕʝʝɴʝФъɪ ɦʼʧʲ Ȃȕ ɪɴɪвɪȕȱǤʲɃ˛ȕъХ ȕɪȍ
Ƀȯ Е јќ ɪ јќ Ж ʝȕʲʼʝɪ ɪ ȕɪȍ
6.4 Repeated Evaluation 103
ȯɃȂЗФɪвЖХ ў ȯɃȂЗФɪвЗХ
ȕɪȍ
Whether saving a few characters is worth the terser and slightly obscured appear-
ance of the first version lies in the eye of the beholder.
The condition expressions in Ƀȯ expressions, in the ternary operator, and the
operands of the ПП and ЮЮ operators must be "ɴɴɜ values. The only exception are
the last arguments in ПП and ЮЮ chains, whose values may be returned.
ɔʼɜɃǤљ Е ЮЮ ʲʝʼȕ
5¼¼¼ђ Ñ˩ʙȕ5ʝʝɴʝђ ɪɴɪвȂɴɴɜȕǤɪ ФbɪʲЛЙХ ʼʧȕȍ Ƀɪ ȂɴɴɜȕǤɪ Ȇɴɪʲȕ˦ʲ
ɔʼɜɃǤљ ȯǤɜʧȕ ЮЮ Е
Е
Note that in contrast to the ПП and ЮЮ operators, the functions П and Ю are
just generic functions without short-circuit behavior; they are only special in the
sense that they support infix syntax.
ɔʼɜɃǤљ ʲʝʼȕ П ʲʝʼȕ
ʲʝʼȕ
ɔʼɜɃǤљ ФПХФʲʝʼȕя ʲʝʼȕХ
ʲʝʼȕ
ɔʼɜɃǤљ ʲʝʼȕ Ю ʲʝʼȕ
ʲʝʼȕ
ɔʼɜɃǤљ ЮФʲʝʼȕя ʲʝʼȕХ
ʲʝʼȕ
where the condition must be a Boolean expression. While the condition evaluates
to ʲʝʼȕ, the expressions in the body of the ˞ȹɃɜȕ loop are evaluated. In contrast to
ȯɴʝ loops, the programmer is responsible for defining and updating an iteration
variable if one is needed.
104 6 Control Flow
ȱɜɴȂǤɜ Ƀ ќ Ж
˞ȹɃɜȕ Ƀ јќ И
ȱɜɴȂǤɜ Ƀ
Ъʧȹɴ˞ Ƀ
Ƀ ўќ Ж
ȕɪȍ
This example prints three lines as expected. Note that the ȱɜɴȂǤɜ declaration is
needed here in order to change the change. An equivalent version of this loop is
the following.
ȱɜɴȂǤɜ ɔ ќ Е
˞ȹɃɜȕ ɔ јќ З
ȱɜɴȂǤɜ ɔ ўќ Ж
Ъʧȹɴ˞ ɔ
ȕɪȍ
Both loops are linked via the generic function ɃʲȕʝǤʲȕ, which we have to access
as "ǤʧȕѐɃʲȕʝǤʲȕ when defining additional methods. It must have two methods
6.4 Repeated Evaluation 105
for the type of ɃʲȕʝǤȂɜȕ, namely one taking one arguments and one taking two
arguments.
We consider the example of iterating over the coordinates of a three-dimens-
ional point and first define a data structure called ¸ɴɃɪʲ (see Sect. 5.4).
ʧʲʝʼȆʲ ¸ɴɃɪʲ
˦ђђOɜɴǤʲЛЙѓ ˩ђђOɜɴǤʲЛЙѓ ˴ђђOɜɴǤʲЛЙ
ȕɪȍ
The first method takes one argument (as in the call of ɃʲȕʝǤʲȕ before the ˞ȹɃɜȕ
loop above) and returns an iterate and a state.
ȯʼɪȆʲɃɴɪ "ǤʧȕѐɃʲȕʝǤʲȕФʙђђ¸ɴɃɪʲХђђÑʼʙɜȕ
Фʙѐ˦я Æ˩ɦȂɴɜЦђ˩я ђ˴ЧХ
ȕɪȍ
The state can be any object, but it should be defined in such a way that it is con-
ducive for iterating by the second method. The second method takes the iterable
data structure and a state as its two arguments and returns an iterate and a state
as well.
ȯʼɪȆʲɃɴɪ "ǤʧȕѐɃʲȕʝǤʲȕФʙђђ¸ɴɃɪʲя ʧʲǤʲȕђђùȕȆʲɴʝХђђÚɪɃɴɪШɴʲȹɃɪȱя
ÑʼʙɜȕЩ
Ƀȯ Ƀʧȕɦʙʲ˩ФʧʲǤʲȕХ
ɪɴʲȹɃɪȱ
ȕɜʧȕ
ФȱȕʲȯɃȕɜȍФʙя ʧʲǤʲȕЦЖЧХя ʧʲǤʲȕЦЗђȕɪȍЧХ
ȕɪȍ
ȕɪȍ
Having defined the two ɃʲȕʝǤʲȕ methods, we can use the ˞ȹɃɜȕ loop above to
iterate over the coordinates of a ¸ɴɃɪʲ.
ȱɜɴȂǤɜ ʙɴɃɪʲ ќ ¸ɴɃɪʲФЖя Зя ИХ
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФʙɴɃɪʲХ
˞ȹɃɜȕ ɪȕ˦ʲ Рќќ ɪɴʲȹɃɪȱ
ɜɴȆǤɜ ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ɪȕ˦ʲ
Ъʧȹɴ˞ ɃʲȕʝǤʲȕя ʧʲǤʲȕ
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФʙɴɃɪʲя ʧʲǤʲȕХ
ȕɪȍ
Much more interestingly, however, we have extended the built-in ȯɴʝ loop by
defining these two ɃʲȕʝǤʲȕ methods. This also means that iterable data struc-
tures are those for which ɃʲȕʝǤʲȕ methods have been defined.
106 6 Control Flow
Ȇɴɴʝȍ ќ ЖѐЕ
Ȇɴɴʝȍ ќ ЗѐЕ
Ȇɴɴʝȍ ќ ИѐЕ
ȯɴʝ loops can be nested, but nested loops can be written more succinctly using
the syntax
ȯɴʝ i1 Ƀɪ iterable1 я i2 Ƀɪ iterable2
expressions
ȕɪȍ,
ФɃя ɔХ ќ ФЖя ИХ
ФɃя ɔХ ќ ФЖя ЙХ
ФɃя ɔХ ќ ФЗя ИХ
ФɃя ɔХ ќ ФЗя ЙХ
As this example shows, the first iteration variable (Ƀ here) changes slowest and
the last iteration variable (ɔ here) changes fastest.
Another extension of the basic syntax of ȯɴʝ loops is destructuring of the iter-
ation variable. Destructuring of variables is an idea also found in Common Lisp
and Clojure. In Julia, it means that if the iteration variable is a tuple, then
its components are bound to the respective components of the elements of the
iterable data structure.
ȯɴʝ ФǤя ȂХ Ƀɪ ФФЖя ЗХя ФИя ЙХХ
Ъʧȹɴ˞ Ǥя Ȃ
ȕɪȍ
In the first iteration, the two components Ǥ and Ȃ of the iteration variable ФǤя ȂХ,
a tuple, are bound to the components of the first element ФЖя ЗХ of the iterable
data structure. This loop hence yields the following output.
ФǤя ȂХ ќ ФЖя ЗХ
ФǤя ȂХ ќ ФИя ЙХ
The data structure to be iterated over may be any iterable data structure. In
the next example, it is a set.
ȯɴʝ ФǤя ȂХ Ƀɪ ÆȕʲФЦФЖя ЗХя ФИя ЙХя ФКя ЛХЧХ
Ъʧȹɴ˞ Ǥя Ȃ
ȕɪȍ
6.4 Repeated Evaluation 107
ФǤя ȂХ ќ ФЖя ЗХ
ФǤя ȂХ ќ ФКя ЛХ
ФǤя ȂХ ќ ФИя ЙХ
It is only important that the elements of the iterable data structure are tuples that
are compatible with the iteration variable. If the iteration variable is a tuple, it is
compatible with the tuples in the iterable data structure if it has the same num-
ber of elements or fewer. If the tuple acting as the iteration variable is too long,
an error is raised. If it is shorter than the data, then only the given elements are
bound as shown in the following example. (Recall that the syntax for a tuple with
a single element is ФǤяХ, which is necessary to distinguish it from ФǤХ, which is
the same as just Ǥ.)
ȯɴʝ ФǤяХ Ƀɪ ÆȕʲФЦФЖяЗХя ФИяЙХя ФКяЛХЧХ
Ъʧȹɴ˞ Ǥ
ȕɪȍ
Ǥ ќ Ж
Ǥ ќ К
Ǥ ќ И
Here Ƀ is a counter starting at Ж and ʙ iterates over the collection which is the
argument of ȕɪʼɦȕʝǤʲȕ.
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Жђ З
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Зђ И
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Иђ К
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Йђ М
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Кђ ЖЖ
It is possible to stop ˞ȹɃɜȕ and ȯɴʝ loops using the ȂʝȕǤɖ keyword. This ex-
ample returns the smallest prime number greater than or equal to 2020.
108 6 Control Flow
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȱɜɴȂǤɜ Ƀ ќ ЗЕЗЕ
˞ȹɃɜȕ ʲʝʼȕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ȂʝȕǤɖ
ȕɜʧȕ
ȱɜɴȂǤɜ Ƀ ўќ Ж
ȕɪȍ
ȕɪȍ
Ƀ
Finally, the classical ȱɴ ʲɴ statement is available as the Ъȱɴʲɴ macro and used
in conjunction with the ЪɜǤȂȕɜ macro. Although ȱɴ ʲɴ statements have gener-
ally fallen out of favor in modern programming style, they are very useful in
certain cases. A prime example is the implementation of finite-state machines,
where a state transition table and the Ъȱɴʲɴ macro can be used to switch between
the states. Examples of finite-state machines are parsers and regular expressions.
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȯʼɪȆʲɃɴɪ ȯɃɪȍѪȯɃʝʧʲѪʙʝɃɦȕѪǤȯʲȕʝФɪђђbɪʲȕȱȕʝХђђbɪʲȕȱȕʝ
ЪǤʧʧȕʝʲ ɪ љќ З
ЪɜǤȂȕɜ ʧʲǤʝʲ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɪХ
ʝȕʲʼʝɪ ɪ
ȕɜʧȕ
ɪ ўќ Ж
Ъȱɴʲɴ ʧʲǤʝʲ
ȕɪȍ
ȕɪȍ
6.5 Exception Handling 109
The language constructs discussed so far result in local control flow; even condi-
tional and repeated evaluation cannot result in non-local transfer of control. By
considering the program text locally, it is clear which expression will be evalu-
ated next.
Throwing an exception is a non-local control flow. Exceptions are useful in
situations when an unexpected condition occurs and a function cannot com-
pute and return the value it is meant to return. In these cases, exceptions can be
thrown and caught. When the exception is caught, it is decided how to proceed
best, e.g., by terminating the program, printing an error message, or by taking a
corrective action such as retrying.
The built-in exceptions are subtypes of the abstract type 5˦ȆȕʙʲɃɴɪ and can be
listed by ЪȍɴȆ 5˦ȆȕʙʲɃɴɪ. Using the type system, it also is possible to define
custom exceptions as in this example.
ʧʲʝʼȆʲ ˩˞ȕʧɴɦȕ5˦ȆȕʙʲɃɴɪ јђ 5˦ȆȕʙʲɃɴɪ
ȕɪȍ
Exceptions are thrown by the ʲȹʝɴ˞ function. This example throws the fitting
-ɴɦǤɃɪ5ʝʝɴʝ for negative arguments when Fibonacci numbers are calculated.
ȯʼɪȆʲɃɴɪ ȯɃȂФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ
Ƀȯ ɪ ј Е
ʲȹʝɴ˞Ф-ɴɦǤɃɪ5ʝʝɴʝФъɪȕȱǤʲɃ˛ȕ ɃɪʲȕȱȕʝъХХ
ȕɜʧȕɃȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂФɪвЗХ ў ȯɃȂФɪвЖХ
ȕɪȍ
ȕɪȍ
110 6 Control Flow
The argument to ʲȹʝɴ˞ must be an exception and not a type of exception. The
function call -ɴɦǤɃɪ5ʝʝɴʝФargХ yields an exception, while -ɴɦǤɃɪ5ʝʝɴʝ is the
type.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝФъȯɴɴъХХя ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝХ
Ф-ɴɦǤɃɪ5ʝʝɴʝя -ǤʲǤÑ˩ʙȕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝФъȯɴɴъХХ јђ 5˦ȆȕʙʲɃɴɪ
ʲʝʼȕ
As usual, the built-in functionјђ tests whether the left-hand side is a subtype of
the right-hand side.
Exceptions take arguments to describe the situation as in this example.
ɔʼɜɃǤљ ʲȹʝɴ˞ФÚɪȍȕȯùǤʝ5ʝʝɴʝФђȯɴɴХХ
5¼¼¼ђ ÚɪȍȕȯùǤʝ5ʝʝɴʝђ ȯɴɴ ɪɴʲ ȍȕȯɃɪȕȍ
User defined exceptions can also take arguments, which are the fields of the
exception type (see Sect. 5.4). In the following example, we define an exception
type called -Ƀ˛ɃʧɃɴɪ"˩Đȕʝɴ with a field called ɪʼɦȕʝǤʲɴʝ, which holds addi-
tional information.
ʧʲʝʼȆʲ -Ƀ˛ɃʧɃɴɪ"˩Đȕʝɴ јђ 5˦ȆȕʙʲɃɴɪ
ɪʼɦȕʝǤʲɴʝђђʼɦȂȕʝ
ȕɪȍ
At this point, we know how to throw exceptions. For non-local control flow,
we must also be able to catch the exceptions. This facility is provided by the
ʲʝ˩
body
ȆǤʲȆȹ [exception]
handler
ȯɃɪǤɜɜ˩
cleanup
ȕɪȍ
6.5 Exception Handling 111
In the second example, the exception is bound to the variable ȕ˦Ȇ for closer
inspection.
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦ЦЗЧХ
ȆǤʲȆȹ ȕ˦Ȇ
Ƀȯ ɃʧǤФȕ˦Ȇя -ɴɦǤɃɪ5ʝʝɴʝХ
ɜɴȱФ&ɴɦʙɜȕ˦Ф˦я ЕХХ
ȕɜʧȕ
ЪɃɪȯɴ ъÑȹȕ ɪȕ˦ʲ ȕ˦ȆȕʙʲɃɴɪ ȹǤʧ Ȃȕȕɪ ʝȕʲȹʝɴ˞ɪѐъ
ʝȕʲȹʝɴ˞Фȕ˦ȆХ Ы ɴʝ ɔʼʧʲ ʝȕʲȹʝɴ˞ФХ
ȕɪȍ
ȕɪȍ
ȕɪȍ
The syntax of the ʲʝ˩ expression requires some care regarding the variable
name after the ȆǤʲȆȹ keyword. It must be on the same line as the ȆǤʲȆȹ keyword.
Since any symbol after the ȆǤʲȆȹ keyword is interpreted as the variable name
for the exception unless it is written on a new line or separated by a semicolon,
one must also be careful when the intention is to return the value of another vari-
able and the whole expression is written on a single line. The following function
returns its argument ˦ if an exception is raised.
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦Х
ȆǤʲȆȹ Ы ɪɴ ˛ǤʝɃǤȂɜȕ ɪǤɦȕ ȯɴʝ ʲȹȕ ȕ˦ȆȕʙʲɃɴɪ
˦
ȕɪȍ
ȕɪȍ
If the ʲʝ˩ expression is written on a single line, it must look like this. Note the
semicolon; it ensures that ˦ is not interpreted as the variable name for the excep-
tion, but as the return value of the handler expressions.
ɦ˩ѪɜɴȱФ˦Х ќ ʲʝ˩ ɜɴȱФ˦Х ȆǤʲȆȹѓ ˦ ȕɪȍ
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
вЖ
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦Х
ȆǤʲȆȹ
ȕʝʝɴʝФъʧʲɃɜɜ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱъХ
ȯɃɪǤɜɜ˩
ʙʝɃɪʲɜɪФъȆɜȕǤɪɃɪȱ ʼʙъХ
6.5 Exception Handling 113
ȕɪȍ
ȕɪȍ
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
ȆɜȕǤɪɃɪȱ ʼʙ
5¼¼¼ђ ʧʲɃɜɜ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱ
ɔʼɜɃǤљ ʼɪȯɴʝʲʼɪǤʲȕѪʝȕǤȍȕʝФъЭȕʲȆЭʙǤʧʧ˞ȍъХ
ʧʲʝȕǤɦ ɴʙȕɪђ ȯǤɜʧȕ
5¼¼¼ђ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱя ȕѐȱѐя ˞ȹɃɜȕ ʙǤʝʧɃɪȱ
The general facility to log messages is the Ъɜɴȱɦʧȱ macro. Often the four stan-
dard logging macros ЪȍȕȂʼȱ, ЪɃɪȯɴ, Ъ˞Ǥʝɪ, Ъȕʝʝɴʝ are used, which are based
on Ъɜɴȱɦʧȱ and log messages at the four standard levels -ȕȂʼȱ, bɪȯɴ, üǤʝɪ, and
5ʝʝɴʝ. The first argument to these four macro should be an expression that eval-
uates to a string that describes the situation. The string is formatted as Mark-
down when printed. Further, optional arguments can be of the form key ќ value
or value and are attached to the log message.
ɔʼɜɃǤљ ЪɃɪȯɴ ъʲȹȕ Ǥɪʧ˞ȕʝ Ƀʧъ Ǥɪʧ˞ȕʝ ќ ЙЗ
bɪȯɴђ ʲȹȕ Ǥɪʧ˞ȕʝ Ƀʧ
Ǥɪʧ˞ȕʝ ќ ЙЗ
ɔʼɜɃǤљ Ъ˞Ǥʝɪ ъʧɴɦȕʲȹɃɪȱ ʼɪȕ˦ʙȕȆʲȕȍ ɴȆȆʼʝʝȕȍъ Ǥɪʧ˞ȕʝ ќ
ɜɴȱФ&ɴɦʙɜȕ˦ФвЖя ЕХХ
üǤʝɪɃɪȱђ ʧɴɦȕʲȹɃɪȱ ʼɪȕ˦ʙȕȆʲȕȍ ɴȆȆʼʝʝȕȍ
Ǥɪʧ˞ȕʝ ќ ЕѐЕ ў ИѐЖЙЖКОЗЛКИКНОМОИɃɦ
ɔʼɜɃǤљ Ъȕʝʝɴʝ ъȆǤɪɪɴʲ ȍɃ˛Ƀȍȕ Ȃ˩ ˴ȕʝɴъ ȍȕɪɴɦɃɪǤʲɴʝ ќ Е
5ʝʝɴʝђ ȆǤɪɪɴʲ ȍɃ˛Ƀȍȕ Ȃ˩ ˴ȕʝɴ
ȍȕɪɴɦɃɪǤʲɴʝ ќ Е
On the other hand, the ȕʝʝɴʝ function (and not the macro) raises an exception
of type 5ʝʝɴʝ5˦ȆȕʙʲɃɴɪ.
ɔʼɜɃǤљ ʲʝ˩ ȕʝʝɴʝФъȯɴɴъХ ȆǤʲȆȹ ȕ˦Ȇ ȕ˦Ȇ ȕɪȍ
5ʝʝɴʝ5˦ȆȕʙʲɃɴɪФъȯɴɴъХ
6.5.4 Assertions
Assertions are useful to ensure that certain conditions are always satisfied during
the evaluation of a program. For example, an expression may be known to be
invariant in a loop; these invariants may be conserved quantities such as energy
or angular momentum in a physical simulation. Assertions are also a convenient
way to check whether argument values are valid.
Assertions are written using the macro ЪǤʧʧȕʝʲ, which takes a "ɴɴɜ expres-
sion as its first arguments and an informative messages as its optional second.
ɔʼɜɃǤљ ЪǤʧʧȕʝʲ Ж ј Е
5¼¼¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ Ж ј Е
ɔʼɜɃǤљ ЪǤʧʧȕʝʲ Ж ј Е ъʧɴɦȕʲȹɃɪȱ Ƀʧ ʝȕǤɜɜ˩ ˞ʝɴɪȱъ
5¼¼¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ʧɴɦȕʲȹɃɪȱ Ƀʧ ʝȕǤɜɜ˩ ˞ʝɴɪȱ
6.6 Tasks, Channels, and Events 115
∑𝑛
In the following example, we know that 𝑘=1 1∕𝑘2 = 𝜋2 ∕6. Since all terms
are positive, all partial sums must be less than 𝜋2 ∕6, which is checked by an
assertion. The argument value is also checked by an assertion.
ȯʼɪȆʲɃɴɪ ʧʼɦɦǤʲɃɴɪФɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж
ɜɴȆǤɜ ʧ ќ "ɃȱOɜɴǤʲФЕХ
ȯɴʝ ɖ Ƀɪ Жђɪ
ʧ ўќ ЖЭ"ɃȱOɜɴǤʲФɖХѭЗ
ЪǤʧʧȕʝʲ ʧ ј ʙɃѭЗЭЛ
ȕɪȍ
ʧ
ȕɪȍ
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȯʼɪȆʲɃɴɪ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹђђ&ȹǤɪɪȕɜя ɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж
ȯɴʝ ʙ Ƀɪ ¸ʝɃɦȕʧѐʙʝɃɦȕʧФЖя ɪХ
ʙʼʲРФȆȹя ʙХ
ȕɪȍ
ȕɪȍ
The producer function must be scheduled to run in a new ÑǤʧɖ. The most con-
venient way to do so is to use the &ȹǤɪɪȕɜ constructor that takes a function as
its argument and runs a ÑǤʧɖ associated with the new &ȹǤɪɪȕɜ. The function
given as the argument to the constructor must take one argument, namely the
&ȹǤɪɪȕɜ. In our example, a &ȹǤɪɪȕɜ and an associated ÑǤʧɖ can be created by
ȆȹǤɪ ќ &ȹǤɪɪȕɜФȆȹ вљ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹя ОХХ for example.
Values can be consumed from a &ȹǤɪɪȕɜ by the function ʲǤɖȕР; in our exam-
ple, evaluating ʲǤɖȕРФȆȹǤɪХ yields consecutive prime numbers. Values can also
be conveniently consumed in ȯɴʝ loops by iterating over the &ȹǤɪɪȕɜ as in the
following example.
ȯʼɪȆʲɃɴɪ ɦ˩ѪȆɴɪʧʼɦȕʝФɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж
ȯɴʝ Ƀ Ƀɪ &ȹǤɪɪȕɜФȆȹ вљ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹя ɪХХ
ʙʝɃɪʲɜɪФɃХ
ȕɪȍ
ȕɪȍ
In the ȯɴʝ loop, values are consumed as long as they are available from the
&ȹǤɪɪȕɜ.
ɔʼɜɃǤљ ɦ˩ѪȆɴɪʧʼɦȕʝФОХ
З
И
К
М
You may have expected to close the &ȹǤɪɪȕɜ. This is not necessary here, since the
&ȹǤɪɪȕɜ is associated with the ÑǤʧɖ, and therefore the lifetime of the &ȹǤɪɪȕɜ
being open is associated with the ÑǤʧɖ. The ÑǤʧɖ terminates when the function
returns, at which point the &ȹǤɪɪȕɜ is closed automatically.
The operations on &ȹǤɪɪȕɜs are summarized in Table 6.2. When creating a
&ȹǤɪɪȕɜ, the type of the values to be passed may be specified. If no such type
argument is given, the general type ɪ˩ is used by default. If a function argument
is given to the constructor, a ÑǤʧɖ is created and associated with the new &ȹǤɪɪȕɜ
as discussed above. However, a &ȹǤɪɪȕɜ may also be created without the function
argument. The size of the buffer of the &ȹǤɪɪȕɜ may be specified, where the
default size zero creates an unbuffered &ȹǤɪɪȕɜ. &ȹǤɪɪȕɜФbɪȯХ is equivalent to
&ȹǤɪɪȕɜШɪ˩ЩФʲ˩ʙȕɦǤ˦ФbɪʲХХ.
6.6 Tasks, Channels, and Events 117
ÑǤʧɖ states are described in Table 6.4. A newly created ÑǤʧɖ is initially not known
to the scheduler and therefore not run.
In cooperative multitasking, most ÑǤʧɖ switches are the result of waiting for
events such as input or output requests. The generic function ˞ǤɃʲ is the ba-
sic way to wait and includes methods for several types of objects such as ÑǤʧɖs,
&ȹǤɪɪȕɜs, -ɃʧʲʝɃȂʼʲȕȍѐ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs, "Ǥʧȕѐ5˛ȕɪʲs, and "Ǥʧȕѐ¸ʝɴȆȕʧʧes.
The function ˞ǤɃʲ is usually called implicitly; for example, the function ʝȕǤȍ
uses ˞ǤɃʲ to wait for data to become available.
We now discuss how jobs or workloads can be distributed to ÑǤʧɖs and how
&ȹǤɪɪȕɜs can be used for communication between these ÑǤʧɖs and to collect the
results. This example is a leading one, and these techniques are already useful
for sequential computing on a single processor. (Parallel computing is discussed
in Sect. 6.7 below.) For example, the jobs may be functions that mostly deal with
input/output operations and that hence may ˞ǤɃʲ for a substantial amount of
time.
6.6 Tasks, Channels, and Events 119
In the ɜȕʲ expression below, we will define two variables that are buffered
&ȹǤɪɪȕɜs and that hold the jobs and their results. The jobs and the results are
ǤɦȕȍÑʼʙɜȕs, and the buffer sizes of both &ȹǤɪɪȕɜs are only 5. The first function
we define in this example creates jobs. We just draw a random number between
zero and one, which represents the job. After all jobs have been written to the
&ȹǤɪɪȕɜ, it is Ȇɜɴʧȕd.
The second function does the work. It uses a ȯɴʝ loop to get all available jobs
and writes the results of performing the jobs into the other channel. The work is
trivial, as it is just ʧɜȕȕʙing, but when running the example, it illustrates nicely
how long it takes all jobs to finish.
ȯʼɪȆʲɃɴɪ ˞ɴʝɖФɔɴȂʧђђ&ȹǤɪɪȕɜя ʝȕʧʼɜʲʧђђ&ȹǤɪɪȕɜя ɃȍђђbɪʲȕȱȕʝХ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵɃȍ ʧʲǤʝʲȕȍѐъХ
ȯɴʝ ɔ Ƀɪ ɔɴȂʧ
ʧɜȕȕʙФɔѐ˞ɴʝɖɜɴǤȍХ
ʙʼʲРФʝȕʧʼɜʲʧя ФɃȍ ќ ɔѐɃȍя ˞ɴʝɖȕʝ ќ Ƀȍя ʲɃɦȕ ќ ɔѐ˞ɴʝɖɜɴǤȍХХ
ȕɪȍ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵɃȍ ȯɃɪɃʧȹȕȍѐъХ
ȕɪȍ
Next, we create ten jobs by wrapping the call to ɦǤɖȕѪɔɴȂʧ in the ЪǤʧ˩ɪȆ macro.
The ЪǤʧ˩ɪȆ macro creates a ÑǤʧɖ and adds it to the queue of the scheduler. It
is expedient to use ЪǤʧ˩ɪȆ at this point, since creating the job descriptions may
take some time or the number of jobs may exceed the buffer size, but in this
way work can start immediately. Having created the jobs, we start three tasks by
wrapping calls to ˞ɴʝɖ in the ЪǤʧ˩ɪȆ macro. Then, we take the results from the
buffered &ȹǤɪɪȕɜ and print the total elapsed time.
ɜȕʲ ɪ ќ ЖЕ
ɜɴȆǤɜ ɔɴȂʧ ќ &ȹǤɪɪȕɜШǤɦȕȍÑʼʙɜȕЩФКХ
ɜɴȆǤɜ ʝȕʧʼɜʲʧ ќ &ȹǤɪɪȕɜШǤɦȕȍÑʼʙɜȕЩФКХ
ЪǤʧ˩ɪȆ ɦǤɖȕѪɔɴȂʧФɔɴȂʧя ɪХ
ȯɴʝ Ƀ Ƀɪ ЖђИ
ЪǤʧ˩ɪȆ ˞ɴʝɖФɔɴȂʧя ʝȕʧʼɜʲʧя ɃХ
ȕɪȍ
üɴʝɖȕʝ Ж ʧʲǤʝʲȕȍѐ
üɴʝɖȕʝ З ʧʲǤʝʲȕȍѐ
üɴʝɖȕʝ И ʧʲǤʝʲȕȍѐ
sɴȂ З ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐЗЗЙ ʧȕȆɴɪȍʧѐ
sɴȂ Ж ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐЗЗО ʧȕȆɴɪȍʧѐ
sɴȂ И ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐКИИ ʧȕȆɴɪȍʧѐ
sɴȂ Й ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐМКК ʧȕȆɴɪȍʧѐ
sɴȂ К ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐНЖН ʧȕȆɴɪȍʧѐ
sɴȂ Л ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐОИЛ ʧȕȆɴɪȍʧѐ
sɴȂ Н ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐККЗ ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ З ȯɃɪɃʧȹȕȍѐ
sɴȂ М ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐОК ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ И ȯɃɪɃʧȹȕȍѐ
sɴȂ О ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐМЙО ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ Ж ȯɃɪɃʧȹȕȍѐ
sɴȂ ЖЕ ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐМОЛ ʧȕȆɴɪȍʧѐ
ЗѐОИЙЖМЙ ʧȕȆɴɪȍʧ ФЙѐЖК ǤɜɜɴȆǤʲɃɴɪʧђ ЖНЕѐЙЖК Ƀ"я ИѐОК҄ ȱȆ ʲɃɦȕХ
ȯʼɪȆʲɃɴɪ ʧɴʼʝȆȕФȆȹя ɪХ
ȯɴʝ Ƀ Ƀɪ ¸ʝɃɦȕʧѐʙʝɃɦȕʧФЖя ɪХ
ʧɜȕȕʙФʝǤɪȍФХХ
ʙʝɃɪʲɜɪФъ¸ʼʲʲɃɪȱ ϵФɃХѐ ъХ
ʙʼʲРФȆȹя ɃХ
ȕɪȍ
ȆɜɴʧȕФȆȹХ
ȕɪȍ
ȯʼɪȆʲɃɴɪ ʧɃɪɖФȆȹХ
ȯɴʝ Ƀ Ƀɪ Ȇȹ
ʙʝɃɪʲɜɪФъÑǤɖɃɪȱ ϵФɃХѐъХ
ȕɪȍ
ȕɪȍ
6.7 Parallel Computing 121
ɜȕʲ Ȇȹ ќ &ȹǤɪɪȕɜФЕХ
Ъʧ˩ɪȆ ȂȕȱɃɪ
ЪǤʧ˩ɪȆ ʧɴʼʝȆȕФȆȹя ЖЕХ
ЪǤʧ˩ɪȆ ʧɃɪɖФȆȹХ
ȕɪȍ
ȕɪȍ
The ɜȕʲ expression runs these two functions. First, an unbuffered &ȹǤɪɪȕɜ is
recreated. In the ȂȕȱɃɪ expression, ÑǤʧɖs for the calls to the ʧɴʼʝȆȕ and ʧɃɪɖ
functions are created and run asynchronously by the macro ЪǤʧ˩ɪȆ. The Ъʧ˩ɪȆ
macro outside the ȂȕȱɃɪ expression waits till both tasks are done; it will return
only when the ȯɴʝ loop in the ʧɃɪɖ function has returned.
The final example in this section shows how &ɴɪȍɃʲɃɴɪs can be used. The
function ˞ǤɃʲѪǤѪȂɃʲ is run in a new ÑǤʧɖ after having been started by ʧȆȹȕȍʼɜȕ
and notifies the &ɴɪȍɃʲɃɴɪ when it is done. After ɪɴʲɃȯ˩ has been called, control
flow continues after the call to ˞ǤɃʲ, which has been waiting for the &ɴɪȍɃʲɃɴɪ.
This construct is more effective than a loop that polls the &ɴɪȍɃʲɃɴɪ repeatedly.
ɜȕʲ Ȇ ќ &ɴɪȍɃʲɃɴɪФХ
ȯʼɪȆʲɃɴɪ ˞ǤɃʲѪǤѪȂɃʲФХ
ʙʝɃɪʲɜɪФъüǤɃʲɃɪȱ ȯɴʝ QɴȍɴʲѐъХ
ʧɜȕȕʙФЖ ў ʝǤɪȍФХХ
ɪɴʲɃȯ˩ФȆХ
ȕɪȍ
ʧȆȹȕȍʼɜȕФЪʲǤʧɖ ˞ǤɃʲѪǤѪȂɃʲФХХ
˞ǤɃʲФȆХ
ʙʝɃɪʲɜɪФъYȕ ȹǤʧ ǤʝʝɃ˛ȕȍѐъХ
ȕɪȍ
on a certain process and can be used from any process; there are two types of re-
mote references, namely Oʼʲʼʝȕs and ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs.
A remote call is the request by a process to call a certain function on certain
arguments on another (or the same) process. Every remote call returns immedi-
ately, and the result of a remote call is a Oʼʲʼʝȕ. The return value of the remote
call can be obtained using the function ȯȕʲȆȹ, or the function ˞ǤɃʲ can be called
on the Oʼʲʼʝȕ to wait until the result is available.
The Julia system can use multiple processes. The process associated with the
repl always has id 1, and additional processes with higher ids can be started and
are called workers. If there is only the process with id 1, it is considered the only
worker. The number of workers can be supplied when starting Julia using the
command-line arguments вʙ or ввʙʝɴȆʧ. The argument should be equal to the
number of available (logical) cores or to Ǥʼʲɴ, which determines the number of
(logical) cores automatically. If the arguments вʙ or ввʙʝɴȆʧ are supplied on the
command line, then the built-in module -ɃʧʲʝɃȂʼʲȕȍ is loaded automatically.
љ ɔʼɜɃǤ вʙ Ǥʼʲɴ
ɔʼɜɃǤљ ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ
ЗЙ
Within Julia, the workers can be managed using the functions ˞ɴʝɖȕʝʧ,
ǤȍȍʙʝɴȆʧ, and ʝɦʙʝɴȆʧ. The module -ɃʧʲʝɃȂʼʲȕȍ must be loaded on the pro-
cess with id 1 before using ǤȍȍʙʝɴȆʧ to add workers.
Another option to start worker processes is to use the ввɦǤȆȹɃɪȕвȯɃɜȕ
command-line option. Then Julia uses passwordless ʧʧȹ login to start workers
on the machines specified in the supplied file.
Worker process differ from the process with id 1 by not evaluating the
ʧʲǤʝʲʼʙѐɔɜ startup file. Their global state, i.e., loaded modules, global variables,
and generic functions and methods, is not synchronized automatically between
processes. The common way to load modules or program files into all workers is
to use the Ъȕ˛ȕʝ˩˞ȹȕʝȕ macro by writing
Ъȕ˛ȕʝ˩˞ȹȕʝȕ Ƀɦʙɴʝʲ module
or
Ъȕ˛ȕʝ˩˞ȹȕʝȕ ɃɪȆɜʼȍȕФъfilenameъХ.
The example shows that required data are copied to the worker process automat-
ically; here, the global variable ɪǤɦȕ is accessed by the expression to be spawned
and therefore it is made available to the worker process. After the expression has
been evaluated on the worker process, ȯȕʲȆȹ fetches its value from the worker
process and returns it.
The example could have been written more succinctly using ЪȯȕʲȆȹȯʝɴɦ,
which is equivalent to ȯȕʲȆȹ after ЪʧʙǤ˞ɪǤʲ. The function ʝȕɦɴʲȕȆǤɜɜѪȯȕʲȆȹ
is equivalent to applying ȯȕʲȆȹ to the result of ʝȕɦɴʲȕȆǤɜɜ, but it is more effi-
cient. The function ʝȕɦɴʲȕѪȍɴ evaluates a function on a worker with a given id,
but does not yield the return value of the function. It is also not possible to ˞ǤɃʲ
for the completion of the function call.
To illustrate the use of worker processes and ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs, we rework
the jobs example in Sect. 6.6 to use remote workers and channels. The function
ɦǤɖȕѪɔɴȂʧ remains essentially unchanged.
124 6 Control Flow
The function ˞ɴʝɖ to be run on the workers must be made available to all pro-
cesses. Therefore we wrap the function definition into the Ъȕ˛ȕʝ˩˞ȹȕʝȕ macro.
Ъȕ˛ȕʝ˩˞ȹȕʝȕ ȯʼɪȆʲɃɴɪ ˞ɴʝɖФɔɴȂʧђђ¼ȕɦɴʲȕ&ȹǤɪɪȕɜя
ʝȕʧʼɜʲʧђђ¼ȕɦɴʲȕ&ȹǤɪɪȕɜХ
˞ȹɃɜȕ ʲʝʼȕ
ɜɴȆǤɜ ɔ
ʲʝ˩
ɔ ќ ʲǤɖȕРФɔɴȂʧХ
ȆǤʲȆȹ ȕ˦Ȇ
ȂʝȕǤɖ
ȕɪȍ
ʧɜȕȕʙФɔѐ˞ɴʝɖɜɴǤȍХ
ʙʼʲРФʝȕʧʼɜʲʧя ФɃȍ ќ ɔѐɃȍя ˞ɴʝɖȕʝ ќ ɦ˩ɃȍФХя
ʲɃɦȕ ќ ɔѐ˞ɴʝɖɜɴǤȍХХ
ȕɪȍ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵФɦ˩ɃȍФХХ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐъХ
ȕɪȍ
Since ȯɴʝ loops over ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs are not supported directly, we use a ˞ȹɃɜȕ
loop instead. We would like to check whether the ɔɴȂʧ channel is still open and
then take a value from it. However, in the time between checking and taking a
value, another worker process may have snatched the last available value, result-
ing in a race condition. Therefore we just ʲʝ˩ to take a value from the channel
and ȆǤʲȆȹ any possibly resulting exception. If there is an exception, we know
that the channel has been Ȇɜɴʧȕd and we ȂʝȕǤɖ the loop and hence end the
function.
With these function definition, we can run remote workers and communi-
cate via ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs. After having defined the channels, we create the jobs
asynchronously. On each available worker, we execute the ˞ɴʝɖ function using
ʝȕɦɴʲȕѪȍɴ. In the final ˞ȹɃɜȕ loop, we collect all results.
ɜȕʲ ɪ ќ ЖЕ
ɜɴȆǤɜ ɔɴȂʧ ќ ¼ȕɦɴʲȕ&ȹǤɪɪȕɜФФХ вљ &ȹǤɪɪȕɜШǤɦȕȍÑʼʙɜȕЩФКХХ
ɜɴȆǤɜ ʝȕʧʼɜʲʧ ќ ¼ȕɦɴʲȕ&ȹǤɪɪȕɜФФХ вљ &ȹǤɪɪȕɜШǤɦȕȍÑʼʙɜȕЩФКХХ
ЪǤʧ˩ɪȆ ɦǤɖȕѪɔɴȂʧФɔɴȂʧя ɪХ
ȯɴʝ ˞ Ƀɪ ˞ɴʝɖȕʝʧФХ
ʝȕɦɴʲȕѪȍɴФ˞ɴʝɖя ˞я ɔɴȂʧя ʝȕʧʼɜʲʧХ
6.7 Parallel Computing 125
ȕɪȍ
In this particular example, there are fewer jobs than worker processes. Therefore
some of the ˞ɴʝɖ functions finish immediately. All jobs are finished after about
the time it takes the longest job to finish.
126 6 Control Flow
programmer is not involved in the low-level tasks of moving data and managing
worker processes.
We consider a simple mathematical example that can be parallelized in a
straightforward manner, namely the approximation of 𝜋 using random numbers.
The area of the circle sector of the unit circle with radius one around the origin
within the square [0, 1] × [0, 1] is 𝜋∕4. If we draw uniformly distributed random
numbers 𝑋 and 𝑌 from the interval [0, 1], then the fraction of these pairs (𝑋, 𝑌)
within this circle sector (i.e., with 𝑋 2 + 𝑌 2 ≤ 1) will thus be 𝜋∕4 of the number
of all pairs (𝑋, 𝑌) drawn. Hence we have found an algorithm that calculates a
Monte Carlo approximation of 𝜋∕4.
In the first step, we save the following function ʙɃ in a file called ʙɃѐɔɜ.
ȯʼɪȆʲɃɴɪ ʙɃФɪђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ Ȇɴʼɪʲȕʝђђbɪʲ ќ Е
ȯɴʝ Ƀ Ƀɪ Жђɪ
Ƀȯ ʝǤɪȍФХѭЗ ў ʝǤɪȍФХѭЗ јќ Ж
Ȇɴʼɪʲȕʝ ќ Ȇɴʼɪʲȕʝ ў Ж
ȕɪȍ
ȕɪȍ
Й Ѯ Ȇɴʼɪʲȕʝ Э ɪ
ȕɪȍ
Since we started Julia with the вʙ command-line option, the -ɃʧʲʝɃȂʼʲȕȍ mod-
ule (and hence the macro -ɃʧʲʝɃȂʼʲȕȍѐЪʧʙǤ˞ɪǤʲ) is already available; other-
wise we would need ʼʧɃɪȱ -ɃʧʲʝɃȂʼʲȕȍ.
The ЪʧʙǤ˞ɪǤʲ macro takes two arguments, namely the id of the process to
be used and an expression. The expression is wrapped into a closure and run
asynchronously on the specified process, and a Oʼʲʼʝȕ is returned. If the process
id is equal to ђǤɪ˩, the scheduler picks the process to be used.
ɔʼɜɃǤљ ʙɃЖ ќ ЪʧʙǤ˞ɪǤʲ З ʙɃФЖЕЕЕХ
OʼʲʼʝȕФЗя Жя МЙя ɪɴʲȹɃɪȱХ
ɔʼɜɃǤљ ʙɃЗ ќ ЪʧʙǤ˞ɪǤʲ ђǤɪ˩ ʙɃФЖЕЕЕХ
OʼʲʼʝȕФЗя Жя МКя ɪɴʲȹɃɪȱХ
ɔʼɜɃǤљ ФȯȕʲȆȹФʙɃЖХ ў ȯȕʲȆȹФʙɃЗХХ Э З
ИѐЖЙЙ
Here we have used ȯȕʲȆȹ to obtain the return value of the spawned function from
the Oʼʲʼʝȕ. The average of the two approximations yields three correct digits of 𝜋
(at least in this run). This approach is still low level, as some programming work
is required to spawn the expressions and to collect their values.
Parallel ȯɴʝ loops provide a convenient way to distribute expressions to pro-
cesses and to collect the results.
128 6 Control Flow
The ЪȍɃʧʲʝɃȂʼʲȕȍ macro turns the ȯɴʝ loop into a parallel ȯɴʝ loop. Its first
(optional) argument is a function that will process the values returned by each
iteration; here the argument is supplied as ФўХ (instead of just ў) for a syntactic
reason. All iterations are performed on the worker processes, and each iteration
returns the value of its last expression. The postprocessing is performed on the
calling process.
This is all that is needed to run this Monte Carlo algorithm in parallel.
ɔʼɜɃǤљ ЪʲɃɦȕ ʙǤʝǤɜɜȕɜѪʙɃФɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХя ЖЕѭОХ
ЖЖѐКЖККЗИ ʧȕȆɴɪȍʧ ФЛЗѐИИ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ ИѐЖЙЗ Ƀ"Х
ИѐЖЙЖКОКЗОЗКЕЕЕЕЕЙ
In this run, we obtained six correct decimal digits of 𝜋 by running 24 ⋅ 109 sam-
ples in total. By using ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ as the first argument, this number
of iterations is distributed to the same number of workers. If the argument is
ЖЕ Ѯ ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ, each worker receives ten loop iterations.
When one is interested in receiving the values calculated in all loop iterations,
the ˛ȆǤʲ function can be used to return a vector with all values.
ɔʼɜɃǤљ ЪȍɃʧʲʝɃȂʼʲȕȍ ˛ȆǤʲ ȯɴʝ Ƀ Ƀɪ ЖђɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ ʙɃФЖЕѭОХ ȕɪȍ
ЗЙвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ИѐЖЙЖКОИМЗЙ
ѐѐѐ
All variables used inside a parallel ȯɴʝ loop will be copied to each worker
process. On the other hand, any changes to these variables will not be visible
after the loop has finished, i.e., the values of the variables are not copied back. It
is also important to note that the order of the iterations is unspecified.
It is possible to omit the function that processes the values when calling
ЪȍɃʧʲʝɃȂʼʲȕȍ. Then the loop iterations are spawned on all available workers
and an array of Oʼʲʼʝȕs is immediately returned without waiting for the itera-
tions to finish. The functions ˞ǤɃʲ and ȯȕʲȆȹ can be applied to the Oʼʲʼʝȕs as
usual, or it is possible to wait for the completion of all iterations by calling Ъʧ˩ɪȆ
on the result, i.e., by writing
Ъʧ˩ɪȆ ЪȍɃʧʲʝɃȂʼʲȕȍ ȯɴʝ
...
ȕɪȍ.
Using a parallel ȯɴʝ loop with ˛ȆǤʲ as the postprocessing function is equiv-
alent to using the ʙɦǤʙ function, which is the parallel version of the mapping
function ɦǤʙ. Using ʙɦǤʙ, we can rewrite ʙǤʝǤɜɜȕɜѪʙɃ above succinctly as fol-
lows.
6.7 Parallel Computing 129
What is the difference between a ȯɴʝ loop and ʙɦǤʙ? The ʙɦǤʙ function is
meant to be used when evaluating the function is computationally expensive.
On the other hand, a parallel ȯɴʝ loop can handle tiny computations in each
iteration well.
Problems
6.1 (Fizz-buzz)
Write a function that prints the numbers from 1 to 100. However, for multiples
of three, print “fizz” instead of the number; for multiples of five, print “buzz”;
and for multiples of both three and five, print “fizz-buzz”.
Lisp is now the second oldest programming language in present widespread use (after
Fortran and not counting apt, which isn’t used for programming per se). It owes its
longevity to two facts. First, its core occupies some kind of local optimum in the space
of programming languages given that static friction discourages purely notational
changes. Recursive use of conditional expressions, representation of symbolic
information externally by lists and internally by list structure, and representation of
program in the same way will probably have a very long life.
Second, Lisp still has operational features unmatched by other language that make it a
convenient vehicle for higher level systems for symbolic computation and for artificial
intelligence. These include its run-time system that give good access to the features of
the host machine and its operating system, its list structure internal language that
makes it a good target for compiling from yet higher level languages, its compatibility
with systems that produce binary or assembly level program, and the availability of its
interpreter as a command language for driving other programs. (One can even
conjecture that Lisp owes its survival specifically to the fact that its programs are lists,
which everyone, including me, has regarded as a disadvantage. Proposed replacements
for Lisp [. . . ] abandoned this feature in favor of an Algol-like syntax leaving no target
language for higher level systems).
Lisp will become obsolete when someone makes a more comprehensive language that
dominates Lisp practically and also gives a clear mathematical semantics to a more
comprehensive set of features.
—John McCarthy, History of Lisp (12 February 1979)
Abstract For several decades, Lisp macros have been the state of the art in
metaprogramming. Macros are expanded at the time when a program is read,
and thus provide a mechanism for defining new language constructs by rewrit-
ing expressions at read time and before compile and evaluation time. In this
chapter, the concept of macros is explained via the example of macros in Com-
mon Lisp, which is conducive for this purpose due to its uniform syntax. Then
Julia macros and their building blocks are presented in detail. Finally, useful
built-in Julia macros are discussed.
7.1 Introduction
The simple example we consider is a macro called ȍɴвʲȹʝɃȆȕ that takes an ex-
pression and executes it three times. We start gently in Common Lisp .
ФɃɪвʙǤȆɖǤȱȕ ђȆɜвʼʧȕʝХ
7.2 Macros in Common Lisp 133
The following expression prints a message three times using a ȍɴʲɃɦȕʧ loop.
ФȍɴʲɃɦȕʧ ФɃ ИХ
ФʙʝɃɪʲ ъYȕɜɜɴя ˞ɴʝɜȍРъХХ
This expression is a list that contains the three elements ȍɴʲɃɦȕʧ, ФɃ ИХ, and
ФʙʝɃɪʲ ъYȕɜɜɴя ˞ɴʝɜȍРъХ. The first element is the name of the function, macro,
or special form to be called. (In fact, ȍɴʲɃɦȕʧ is a macro, but it is indistinguish-
able from a function if all we know is its name.) The second element ФɃ ИХ de-
fines the iteration variable Ƀ and specifies how many times the following expres-
sion will repeated. The third element is the expression to be repeated.
We are already half way to defining the macro ȍɴвʲȹʝɃȆȕ. The macro
ȍȕȯɦǤȆʝɴ defines a new macro. Its first argument is the name of the macro to
be defined, its second argument is the argument list, and the remaining argu-
ments are the expressions to be returned by the new macro. Therefore the first
version of our simple macro is the following.
ФȍȕȯɦǤȆʝɴ ȍɴвʲȹʝɃȆȕвЖ ФПȂɴȍ˩ Ȃɴȍ˩Х
҉ФȍɴʲɃɦȕʧ ФɃ ИХ
яЪȂɴȍ˩ХХ
Here the argument list just means that all expressions that will be passed to the
new macro will be contained in the local variable Ȃɴȍ˩. The backquote ҉ is com-
monly used in macro definitions to protect its argument from evaluation, just as
ʜʼɴʲȕ in Julia. The syntax яЪ within a backquote means that the elements of its
argument, here Ȃɴȍ˩, are spliced into the surrounding list. If we would not like
to use the backquote syntax, we could also construct the expression, i.e., a list,
explicitly, but the purpose of the backquote syntax is to facilitate writing macros
and therefore we use it.
Common Lisp offers a simple way to check that our macro does what it is
supposed to do: ɦǤȆʝɴȕ˦ʙǤɪȍвЖ expands a macro only once.
ФɦǤȆʝɴȕ˦ʙǤɪȍвЖ щФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХХ
Symbols are printed in uppercase letters by default. The value Ñ is the second
return value; it stands for true. We see that the macro ȍɴвʲȹʝɃȆȕвЖ does what
we intend it to do: it takes an expression and puts it inside a ȍɴʲɃɦȕʧ loop.
You might expect that there is also a function called ɦǤȆʝɴȕ˦ʙǤɪȍ and you
would be right. The function ɦǤȆʝɴȕ˦ʙǤɪȍ expands a macro including all nested
macro calls. Evaluating the following expression also expands the call of the
ȍɴʲɃɦȕʧ macro, but the final expression is implementation dependent.
Next, we call our macro, which means that the resulting macro expansion is
evaluated.
134 7 Macros
Three lines are printed as expected. Here b{ is the return value.
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
b{
However, there is a problem. Our use of ȍɴʲɃɦȕʧ defines the local variable Ƀ
in the macro expansion, and we can access its value.
ФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ɃХХ
Printing the value is relatively harmless, but we can also – a bit more maliciously
– change the value of the iteration variable and hence change the behavior of the
macro. (ʧȕʲʜ is short for “set quoted”.)
ФȍɴвʲȹʝɃȆȕвЖ Фʧȕʲʜ Ƀ ЗХ ФʙʝɃɪʲ ъ¸ʝɃɪʲȕȍ ɴɪɜ˩
ɴɪȆȕѐъХХ
In the first iteration of the ȍɴʲɃɦȕʧ loop, the iteration variable is increased, which
prevents any further iterations, and the ʙʝɃɪʲ expression is evaluated.
Such macros are called unhygienic, since the pollute the name space of vari-
ables. A hygienic version of the macro is the second version shown here.
ФȍȕȯɦǤȆʝɴ ȍɴвʲȹʝɃȆȕвЗ ФПȂɴȍ˩ Ȃɴȍ˩Х
Фɜȕʲ ФФɃ Фȱȕɪʧ˩ɦХХХ
҉ФȍɴʲɃɦȕʧ ФяɃ ИХ
яЪȂɴȍ˩ХХХ
7.3 Macro Definition 135
The ɜȕʲ form assigns a new unique symbol returned by ȱȕɪʧ˩ɦ to the local vari-
able Ƀ when the macro is expanded. The name of this new unique symbol is then
used as the name of the iteration variable in the ȍɴʲɃɦȕʧ loop by splicing it as
яɃ into the expression returned by the macro, preventing any unwanted variable
capture.
ФɦǤȆʝɴȕ˦ʙǤɪȍвЖ щФȍɴвʲȹʝɃȆȕвЗ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХХ
The macro expansion shows that a variable called QЙКЛ is used as the iteration
variable. (More precisely, ЫђQЙКЛ is an uninterned symbol.)
Finally, we try to change the iteration variable Ƀ again.
ФȍɴвʲȹʝɃȆȕвЗ Фʧȕʲʜ Ƀ ЗХ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХ
Now we only receive a warning that a variable Ƀ was not declared or defined
previously, while the macro still works as intended, printing three strings.
This first example illustrates how expressions can be rewritten and what hy-
gienic macros are. We will encounter the same concepts in Julia, where only
the names are different.
In Julia, macros are defined by ɦǤȆʝɴ analogous to ȯʼɪȆʲɃɴɪ. A macro must re-
turn an expression, which is easily achieved by wrapping an expression between
ʜʼɴʲȕ and ȕɪȍ. Within such a quoted expression, the value of a variable can be
substituted by prepending its name with a dollar sign ϵ. This syntax is analo-
gous to the syntax for string interpolation (see Sect. 4.2.2). Hence the backquote
136 7 Macros
ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
˞ȹɃɜȕ Ƀ ј И
ϵȕ˦ʙʝ
Ƀ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
The assertion in the first line is evaluated at macro-expansion time, and the
quoted expression is returned. The value of the local variable ȕ˦ʙʝ is substituted
into the quoted expression because of the dollar sign in ϵȕ˦ʙʝ.
Just like functions, macros are generic in Julia, which means that methods
with the same name but with different argument signatures can be defined. The
method that best matches the arguments of the function or macro call will be
chosen.
Macros are called just like functions, but the arguments are not evaluated at
macro-expansion time.
ɔʼɜɃǤљ Ъʼɪȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФИя ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐъХХ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ
Whenever a macro takes only one argument, the parentheses around the argu-
ments can be left out. Expressions passed as arguments to macros are created
as usual, separating them by semicolons within parentheses or using ȂȕȱɃɪ and
ȕɪȍ.
7.3 Macro Definition 137
The next call illustrates that we can change the value of the iteration variable
within the expression passed as an argument, showing that the macro is unhy-
gienic.
ɔʼɜɃǤљ Ъʼɪȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФИя ȂȕȱɃɪ Ƀ ќ Зѓ ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ɴɪȆȕѐъХ
ȕɪȍХ
¸ʝɃɪʲȕȍ ɴɪȆȕѐ
ʜʼɴʲȕ
ɜȕʲ ϵ˛Ǥʝ ќ Е
˞ȹɃɜȕ ϵ˛Ǥʝ ј И
ϵȕ˦ʙʝ
ϵ˛Ǥʝ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
The mechanism built into Julia to facilitate the definition of hygienic macros
is called ȕʧȆ. It is best illustrated by a simple example. We first define a global
variable and a local variable in the expression returned by the macro, which
both have the same name ȯɴɴ but different values. The macro returns ȯɴɴ and
ϵФȕʧȆФȯɴɴХХ.
138 7 Macros
ȱɜɴȂǤɜ ȯɴɴ ќ Е
ɦǤȆʝɴ ȕʧȆǤʙȕȍФХ
ʜʼɴʲȕ
ɜɴȆǤɜ ȯɴɴ ќ Ж
Фȯɴɴя ϵФȕʧȆФȯɴɴХХХ
ȕɪȍ
ȕɪȍ
After calling the macro, we observe that ȯɴɴ evaluates to Ж and the escaped vari-
able evaluates to Е. (When calling a macro with no arguments, no parentheses
are required.)
ɔʼɜɃǤљ ЪȕʧȆǤʙȕȍ
ФЖя ЕХ
This result shows that ȯɴɴ refers to the local variable of the same name as ex-
pected, while ϵФȕʧȆФȯɴɴХХ escapes the ʜʼɴʲȕ block and refers to the global vari-
able.
Next, we have a look at the macro expansion, which is very instructive. (Com-
ments have been deleted.)
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍ ЪȕʧȆǤʙȕȍ
ʜʼɴʲȕ
ɜɴȆǤɜ ˛ǤʝъЫЖЕЫȯɴɴъ ќ Ж
Ф˛ǤʝъЫЖЕЫȯɴɴъя ЕХ
ȕɪȍ
The expansion shows Julia’s hygiene mechanism at work. All local variables
are renamed to new unique names in order to prevent unintended variable cap-
ture. In escaped expressions, these substitutions are not performed, however.
Therefore ϵФȕʧȆФȯɴɴХХ can escape the ʜʼɴʲȕ block, and the value Е of the global
variable called ȯɴɴ at macro-expansion time is used. This explains the output
ФЖя ЕХ.
In summary, ȕʧȆ is only valid in expressions returned from a macro and pre-
vents renaming embedded variables into hygienic variables generated by ȱȕɪʧ˩ɦ.
Knowing ȕʧȆ, we return to the ȍɴʲɃɦȕʧ macro and present its idiomatic ver-
sion in Julia. (Using a ȯɴʝ loop is possible as well, of course, but would not allow
us to explain variable capture.)
ɦǤȆʝɴ ȕʧȆǤʙȕȍѪȍɴʲɃɦȕʧФɪђђbɪʲȕȱȕʝя ȕ˦ʙʝђђ5˦ʙʝХ
ЪǤʧʧȕʝʲ ɪ љќ Е
ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
˞ȹɃɜȕ Ƀ ј И
ϵФȕʧȆФȕ˦ʙʝХХ
Ƀ ўќ Ж
7.4 Two Examples: Repeating and Collecting 139
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
Expanding the macro illustrates the substitution of variables in the quoted ex-
pression by ȱȕɪʧ˩ɦ versions except for the escaped variables. (Comments have
been deleted.)
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍ ЪȕʧȆǤʙȕȍѪȍɴʲɃɦȕʧФИя
ȂȕȱɃɪ
Ƀ ќ З
ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍХ
ʜʼɴʲȕ
ɜȕʲ ˛ǤʝъЫЖИЫɃъ ќ Е
˞ȹɃɜȕ ˛ǤʝъЫЖИЫɃъ ј И
ȂȕȱɃɪ
Ƀ ќ З
ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍ
˛ǤʝъЫЖИЫɃъ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ
In this section, we discuss two more examples of macro definitions. The first
example is called ЪʝȕʙȕǤʲ. Its purpose is to take an expression and a condition
and to repeat the expression until the condition is satisfied just as repeat state-
ments in other programming languages. Since Julia usually has more syntactic
sugar than Lisp, we require the second argument of our ЪʝȕʙȕǤʲ macro to be the
symbol ʼɪʲɃɜ. This also allows us to illustrate that the corresponding check is
evaluated when the macro is expanded. We substitute the values variables ȕ˦ʙʝ
and ȆɴɪȍɃʲɃɴɪ into the right places in a ˞ȹɃɜȕ loop, and we employ the escape
140 7 Macros
mechanism to make the macro hygienic. Therefore our ЪʝȕʙȕǤʲ macro looks
like this.
ɦǤȆʝɴ ʝȕʙȕǤʲФȕ˦ʙʝђђ5˦ʙʝя ʼɪʲɃɜђђÆ˩ɦȂɴɜя ȆɴɪȍɃʲɃɴɪђђ5˦ʙʝХ
Ƀȯ ʼɪʲɃɜ Рќ ђʼɪʲɃɜ
ȕʝʝɴʝФъɦǤɜȯɴʝɦȕȍ ȆǤɜɜ ɴȯ ЪʝȕʙȕǤʲъХ
ȕɪȍ
ʜʼɴʲȕ
˞ȹɃɜȕ ʲʝʼȕ
ϵФȕʧȆФȕ˦ʙʝХХ
Ƀȯ ϵФȕʧȆФȆɴɪȍɃʲɃɴɪХХ
ȂʝȕǤɖ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
Again, it is constructive to inspect the macro expansion. When using the func-
tion ɦǤȆʝɴȕ˦ʙǤɪȍ, we must specify the module (here ǤɃɪ) in whose context the
macro is evaluated and quote the expression we want to expand. When using
ЪɦǤȆʝɴȕ˦ʙǤɪȍ or ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ, the argument expression is not quoted. (Com-
ments in the macro expansion have been deleted.)
ɔʼɜɃǤљ ɦǤȆʝɴȕ˦ʙǤɪȍФ ǤɃɪя ʜʼɴʲȕ
7.4 Two Examples: Repeating and Collecting 141
ɜȕʲ Ƀ ќ Е
ЪʝȕʙȕǤʲ ȂȕȱɃɪ
Ƀ ўќ Ж
Ъʧȹɴ˞ Ƀ
ȕɪȍ ʼɪʲɃɜ Ƀ љќ И
ȕɪȍ
ȕɪȍХ
ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
ȂȕȱɃɪ
˞ȹɃɜȕ ʲʝʼȕ
ȂȕȱɃɪ
Ƀ ўќ Ж
ȂȕȱɃɪ
"ǤʧȕѐʙʝɃɪʲɜɪФъɃ ќ ъя "ǤʧȕѐʝȕʙʝФȂȕȱɃɪ
˛ǤʝъЫЖЫ˛Ǥɜʼȕъ ќ Ƀ
ȕɪȍХХ
˛ǤʝъЫЖЫ˛Ǥɜʼȕъ
ȕɪȍ
ȕɪȍ
Ƀȯ Ƀ љќ И
ȂʝȕǤɖ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ɦǤȆʝɴ ȆɴɜɜȕȆʲФȕ˦ʙʝђђ5˦ʙʝХ
ʜʼɴʲȕ
ɜȕʲ ˛ ќ ùȕȆʲɴʝФХ
ȯʼɪȆʲɃɴɪ ϵФȕʧȆФђʝȕɦȕɦȂȕʝХХФ˦Х
ʙʼʧȹРФ˛я ˦Х
ȕɪȍ
ϵФȕʧȆФȕ˦ʙʝХХ
˛
ȕɪȍ
ȕɪȍ
ȕɪȍ
Note that the function ʝȕɦȕɦȂȕʝ is accessible only within the argument expres-
sion that is passed to ЪȆɴɜɜȕȆʲ; it is not a globally defined function and does not
pollute the global variable bindings.
It is instructive to macroexpand the example above once.
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ ЪȆɴɜɜȕȆʲ ȯɴʝ Ƀ Ƀɪ ЖђЖЕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ʝȕɦȕɦȂȕʝФɃХ
ȕɪȍ
ȕɪȍ
ʜʼɴʲȕ
ɜȕʲ ˛ǤʝъЫЗЫ˛ъ ќ ǤɃɪѐùȕȆʲɴʝФХ
ȯʼɪȆʲɃɴɪ ʝȕɦȕɦȂȕʝФ˛ǤʝъЫЙЫ˦ъХ
ǤɃɪѐʙʼʧȹРФ˛ǤʝъЫЗЫ˛ъя ˛ǤʝъЫЙЫ˦ъХ
ȕɪȍ
ȯɴʝ Ƀ ќ ЖђЖЕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ʝȕɦȕɦȂȕʝФɃХ
ȕɪȍ
ȕɪȍ
˛ǤʝъЫЗЫ˛ъ
ȕɪȍ
ȕɪȍ
7.5 Memoization
ʜʼɴʲȕ
ɜȕʲ ȆǤȆȹȕ ќ -ɃȆʲШϵФȕʧȆФǤʝȱЖѪʲ˩ʙȕХХя ϵФȕʧȆФʝȕʲʼʝɪѪʲ˩ʙȕХХЩФХ
ȱɜɴȂǤɜ ȯʼɪȆʲɃɴɪ ϵФȕʧȆФɪǤɦȕХХФϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХђђ
ϵФȕʧȆФǤʝȱЖѪʲ˩ʙȕХХХђђϵФȕʧȆФʝȕʲʼʝɪѪʲ˩ʙȕХХ
Ƀȯ ȹǤʧɖȕ˩ФȆǤȆȹȕя ϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХХ
ȆǤȆȹȕЦϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХЧ
ȕɜʧȕ
ȆǤȆȹȕЦϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХЧ ќ ϵФȕʧȆФȂɴȍ˩ХХ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
About half of the work that the macro performs is spent on parsing the function
definition. We look for the name of the function, its first argument, the type of its
first argument, the return type, and the body of the function. You can use ȍʼɦʙ
to view all these expressions and make sense of the meaning of their parts.
The macro returns a ʜʼɴʲȕd expression as usual. The definition of the memo-
ized function is encapsulated within a closure (see Sect. 3.4) created by ɜȕʲ. The
ȆǤȆȹȕ variable contains a -ɃȆʲ with keys that have the type of the function argu-
ment and with values that have the type of the return value. Within this closure,
the memoized function is defined. We have to write ȱɜɴȂǤɜ before the ȯʼɪȆʲɃɴɪ
definition, because otherwise the function would only be defined locally within
the closure and thus inaccessible and useless.
The function signature consists of the parsed function name, argument, ar-
gument type, and return type. The memoized function itself is simple. It checks
whether the cache contains the argument of the memoized function as a key.
If it does, the cached value is returned. If it does not, the ȕʧȆaped Ȃɴȍ˩ of the
function is evaluated and the resulting value is stored in the cache. Since the
Ƀȯ expression is the last expression in the function, one of these two values is
returned.
144 7 Macros
Some extensions of this macro are the subject of Problems 7.6, 7.7, and 7.8.
Tables 7.1 and 7.2 summarize the built-in macros. Since macros are code trans-
formations, some of the more advanced or extravagant features of the Julia lan-
guage can be found in these two tables, and some of them are explained in the
following in more detail.
The first group of macros to be discussed in more detail are the ones whose
names end in Ѫʧʲʝ. These macros create string literals (already mentioned in
Sect. 4.2.4), which are a mechanism to create objects from a textual represen-
tation. The part of the name before Ѫʧʲʝ indicates the type of the object to be
created. For example, the macros ЪɃɪʲЖЗНѪʧʲʝ and ЪʼɃɪʲЖЗНѪʧʲʝ return an
bɪʲЖЗН and an ÚbɪʲЖЗН, respectively, and the ȂɃȱѪʧʲʝ macro returns a "ɃȱOɜɴǤʲ
or "Ƀȱbɪʲ depending on whether the string contains a decimal point or not. For
example, ȂɃȱъЖѐЗъ returns a "ɃȱOɜɴǤʲ and ȂɃȱъЖъ returns a "Ƀȱbɪʲ, as is easily
checked by ʲ˩ʙȕɴȯФȂɃȱъЖѐЗъХ and ʲ˩ʙȕɴȯФȂɃȱъЖъХ.
The usefulness of these macros is much increased by the syntactic rule that
ЪnameѪʧʲʝ ъ. . . ъ is equivalent to nameъ. . . ъ. For example, ˛ъЖѐЗѐИъ returns the
same ùȕʝʧɃɴɪʼɦȂȕʝ object as Ъ˛Ѫʧʲʝ ъЖѐЗѐИъ.
7.6 Built-in Macros 145
Table 7.1 Built-in macros: parsing, documentation, output, profiling, tasks, metaprogram-
ming, and performance annotations.
Macro Description
ЪѪѪ-b¼ѪѪ directory of the file containing the macro call
or the current working directory
ЪѪѪOb{5ѪѪ file containing the macro call or an empty string
ЪѪѪ{b5ѪѪ line number of the location of the macro call or Е
ЪѪѪ -Ú{5ѪѪ module of the toplevel eval
ЪȆɦȍ string generate a &ɦȍ object from string
ЪɃɪʲЖЗНѪʧʲʝ string parse string into an bɪʲЖЗН
ЪʼɃɪʲЖЗНѪʧʲʝ string parse string into an ÚbɪʲЖЗН
ЪȂɃȱѪʧʲʝ string parse string into a "Ƀȱbɪʲ or a "ɃȱOɜɴǤʲ
ЪȂѪʧʲʝ string create an immutable ÚbɪʲН vector
ЪʝѪʧʲʝ string create a ¼ȕȱȕ˦ (regular expression)
ЪʧѪʧʲʝ string create a substitution string for regular expressions
Ъ˛Ѫʧʲʝ string parse string into a ùȕʝʧɃɴɪʼɦȂȕʝ
ЪʝǤ˞Ѫʧʲʝ string create a raw string without interpolation and unescaping
Ъ b 5Ѫʧʲʝ string parse string into a mime type
Ъʲȕ˦ʲѪʧʲʝ string parse string into a Ñȕ˦ʲ object
ЪȹʲɦɜѪʧʲʝ string parse string into an html object
ЪȍɴȆ retrieve documentation for a function, macro, or other object
Ъʧȹɴ˞ expr print and return the expression expr
ЪʲɃɦȕ expr return the value of expr after printing timing and allocation
ЪʲɃɦȕȍ expr return the value of expr together with allocation information
ЪʲɃɦȕ˛ expr verbose version of the ЪʲɃɦȕ macro
ЪȕɜǤʙʧȕȍ expr return the number of seconds it took to evaluate expr
ЪǤɜɜɴȆǤʲȕȍ expr return the total number of bytes allocated while evaluating expr
Ъʧ˩ɪȆ wait until all lexically enclosed ÑǤʧɖs have completed
ЪǤʧ˩ɪȆ wrap an expression in a ÑǤʧɖ and add it to the scheduler
ЪʲǤʧɖ expr create a ÑǤʧɖ from expr
ЪʲȹʝȕǤȍȆǤɜɜ similar to ȆȆǤɜɜ, but in a different thread
ЪɦǤȆʝɴȕ˦ʙǤɪȍ expr fully (recursively) expand the macros in expr
ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ expr expand expr non-recursively (only once)
ЪȱȕɪȕʝǤʲȕȍ annotate a function which will be generated
Ъȱȕɪʧ˩ɦ generate a symbol for a variable
Ъȕ˛Ǥɜ [mod] expr evaluate ȕ˦ʙʝ (in ɴȍʼɜȕ mod if given)
ЪȍȕʙʝȕȆǤʲȕ old new mark function as deprecated
ЪȂɴʼɪȍʧȆȹȕȆɖ expr annotate the expression allowing it to be elided by ЪɃɪȂɴʼɪȍʧ
ЪɃɪȂɴʼɪȍʧ expr eliminate checking of array bounds within expr
ЪȯǤʧʲɦǤʲȹ expr use fast math operations, strict ieee semantics may be violated
ЪʧɃɦȍ ȯɴʝ . . . ȕɪȍ annotate a ȯɴʝ loop to allow more re-ordering
ЪɃɪɜɃɪȕ hint that the function is worth inlining
ЪɪɴɃɪɜɃɪȕ prevent the compiler from inlining a function
ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ hint that the method should not be specialized for different types
ЪʧʙȕȆɃǤɜɃ˴ȕ reset specialization hint for an argument back to the default
Ъʙɴɜɜ˩ tell the compiler to apply the optimizer Polly to a function
146 7 Macros
Table 7.2 Built-in macros: errors etc., compiler, and miscellaneous macros.
Macro Description
ЪȍȕȂʼȱ create a log record with a debug message
ЪɃɪȯɴ create a log record with an informational message
Ъ˞Ǥʝɪ create a log record with a warning message
Ъȕʝʝɴʝ create a log record with an error message
Ъɜɴȱɦʧȱ general way to create a log record
ЪȆɴȍȕѪɜɜ˛ɦ , evaluate the arguments of the function or macro call,
ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ , determine their types, and
ЪȆɴȍȕѪɪǤʲɃ˛ȕ , call the corresponding function
ЪȆɴȍȕѪʲ˩ʙȕȍ , and on the resulting expression
ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ (see text for more explanations)
ЪѪѪȍɴʲѪѪ expr and convert every function call, operator, and assignment
Ъѐ expr into a “dot call” (ȯ into ȯѐ etc.)
ЪǤʧʧȕʝʲ cond throw an ʧʧȕʝʲɃɴɪ5ʝʝɴʝ if cond is ȯǤɜʧȕ
ЪȆȯʼɪȆʲɃɴɪ generate a C-callable function pointer from a Julia function
ЪȕȍɃʲ call the ȕȍɃʲ function
Ъȕɪʼɦ create an enum subtype
Ъȕ˛Ǥɜʙɴɜ˩ evaluate a polynomial efficiently using Horner’s method
ЪȯʼɪȆʲɃɴɪɜɴȆ return the location of a method definition
Ъȱɴʲɴ name unconditionally jump to the location denoted by ЪɜǤȂȕɜ name
ЪɜǤȂȕɜ name label a destination for Ъȱɴʲɴ
ЪɃʧȍȕȯɃɪȕȍ var tests whether a variable var is defined in the current scope
Ъɜȕʧʧ shows source code (using ɜȕʧʧ) for a function or macro call
ЪʧʲǤʲɃȆ expr partially evaluate the expression expr at parse time
Ъ˛Ƀȕ˞ AЦ. . . Ч create a ÆʼȂʝʝǤ˩ from the indexing operation
Ъ˛Ƀȕ˞ʧ expr convert all array-slicing operations in expr to return a view
Ъ˞ȹɃȆȹ return the ȕʲȹɴȍ that would be called
for a given function or macro call with given arguments
If you are familiar with the finer points of Common Lisp macros, you will
have noticed that this mechanism plays the role of reader macros in Common
Lisp. The mechanism in Julia that translates macro calls of the form nameъ. . . ъ
to macro calls of the form ЪnameѪʧʲʝ ъ. . . ъ is general in the sense that you can
define your own translations. This is useful when you want to construct objects
from a textual representation, for example while reading constants in a program
or while parsing data files. An example is the following. We first define a data
structure and then a macro to convert strings into such a data structure.
ʧʲʝʼȆʲ bɪʲȕʝ˛Ǥɜ
ǤђђOɜɴǤʲЛЙ
ȂђђOɜɴǤʲЛЙ
ȕɪȍ
ɦǤȆʝɴ ɃѪʧʲʝФʧђђÆʲʝɃɪȱХ
ɜɴȆǤɜ ȆɴɦɦǤ ќ ȯɃɪȍȯɃʝʧʲФъяъя ʧХЦЖЧ
ɜɴȆǤɜ Ǥ ќ ʙǤʝʧȕФOɜɴǤʲЛЙя ʧЦЖђȆɴɦɦǤвЖЧХ
ɜɴȆǤɜ Ȃ ќ ʙǤʝʧȕФOɜɴǤʲЛЙя ʧЦȆɴɦɦǤўЖђȕɪȍЧХ
7.6 Built-in Macros 147
ЪǤʧʧȕʝʲ Ǥ јќ Ȃ
bɪʲȕʝ˛ǤɜФǤя ȂХ
ȕɪȍ
The group of macros ЪʲɃɦȕ, ЪʲɃɦȕȍ, ЪʲɃɦȕ˛, ЪǤɜɜɴȆǤʲȕȍ, and ЪȕɜǤʙʧȕȍ re-
turn information about memory usage and evaluation time of an expression.
The ЪǤɜɜɴȆǤʲȕȍ macro discards the resulting value and returns the total num-
ber of bytes allocated during evaluation. Analogously, the ЪȕɜǤʙʧȕȍ macro dis-
cards the resulting value and returns the number of seconds the evaluation took.
ЪʲɃɦȕ evaluates an expression and returns its value after printing the time it
took to evaluate, the number of allocations, and the total number of bytes allo-
cated. ЪʲɃɦȕȍ returns multiple values (that can be used in the program instead
of just being printed): the return value of the expression, the elapsed time, the
total number of bytes allocated, the garbage collection time, and an object with
various memory allocation counters. ЪʲɃɦȕ˛ is a more verbose version of ЪʲɃɦȕ.
The group of macros ЪǤʧ˩ɪȆ, Ъʧ˩ɪȆ, and ЪʲǤʧɖ as well as ЪȍɃʧʲʝɃȂʼʲȕȍ,
ЪʧʙǤ˞ɪ, and ЪʧʙǤ˞ɪǤʲ are discussed in Sect. 6.6 and Sect. 6.7.
The next two macros we discuss in more detail are ЪȂɴʼɪȍʧȆȹȕȆɖ and
ЪɃɪȂɴʼɪȍʧ. The ЪɃɪȂɴʼɪȍʧ macro skips range checks in its argument expres-
sion in order to improve performance when referencing array elements. The user
must guarantee that all bounds checks after a call to ЪɃɪȂɴʼɪȍʧ are satisfied. The
canonical example of its usage is within a ȯɴʝ loop when many array elements
are referenced. One should be careful when using it; if an illegal array reference
is made, incorrect results, corrupted memory, or program crashes may result.
The ЪȂɴʼɪȍʧȆȹȕȆɖ macro makes it possible to use ЪɃɪȂɴʼɪȍʧ in your own
functions, but you can use ЪȂɴʼɪȍʧȆȹȕȆɖ only within inlined functions. The
ЪȂɴʼɪȍʧȆȹȕȆɖ macro marks the following expression as a bounds check, which
is elided when the inlined function is called after ЪɃɪȂɴʼɪȍʧ.
The next family of macros consists of ЪȆɴȍȕѪɜɜ˛ɦ, ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ,
ЪȆɴȍȕѪɪǤʲɃ˛ȕ, ЪȆɴȍȕѪʲ˩ʙȕȍ, and ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ. These five macros make it
easy to watch the compiler at work and are useful when you want to optimize
a function at the assembler level. The first macro, ЪȆɴȍȕѪɜɜ˛ɦ, shows the com-
piler output. It evaluates the arguments of a function or macro call, determines
the types of the arguments, and calls the function ȆɴȍȕѪɜɜ˛ɦ on the resulting
expression. The ЪȆɴȍȕѪɜɜ˛ɦ macro also takes a few keyword arguments.
Maybe the simplest example is the following. What happens when we ask
Julia to evaluate ЗўЗ?
148 7 Macros
The output shows the assembler code for the method specialized for two argu-
ments of type bɪʲЛЙ (ɃЛЙ) for the generic function ў. In assembler, the method
for this particular argument signature consists of a call of Ǥȍȍ and a call of ʝȕʲ (re-
turn). This example shows that Julia generates highly efficient code for known
argument types.
It is often more interesting to disassemble your own function. We define a
simple (generic) function first without specifying any types.
ȯʼɪȆʲɃɴɪ ʲɃɦȕʧѪʲ˞ɴФ˦Х
ЗѮ˦
ȕɪȍ
We see that the assembler code consists of two instructions for an bɪʲЛЙ (ɃЛЙ) ar-
gument. The first is a call to ʧȹɜ (shift left), the fastest way to multiply an integer
given in its binary representation by two. The second is a call to ʝȕʲ (return).
Next we apply ЪȆɴȍȕѪɜɜ˛ɦ to a call of our function with a OɜɴǤʲЛЙ (ȍɴʼȂɜȕ)
argument.
ɔʼɜɃǤљ ЪȆɴȍȕѪɜɜ˛ɦ ʲɃɦȕʧѪʲ˞ɴФЖѐЕХ
ѓ Ъ ¼5¸{ЦЗЧђЖ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ȍȕȯɃɪȕ ȍɴʼȂɜȕ ЪɔʼɜɃǤѪʲɃɦȕʧѪʲ˞ɴѪЖИОФȍɴʼȂɜȕ ҄ЕХ ЫЕ Ш
ʲɴʙђ
ѓ Ъ ¼5¸{ЦЗЧђЗ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ѓ ԍ Ъ ʙʝɴɦɴʲɃɴɪѐɔɜђИНЕ ˞ɃʲȹɃɪ ҉Ѯ҉ Ъ ȯɜɴǤʲѐɔɜђЙЕК
҄Ж ќ ȯɦʼɜ ȍɴʼȂɜȕ ҄Ея ЗѐЕЕЕЕЕЕȕўЕЕ
7.6 Built-in Macros 149
ѓ ԕ
ʝȕʲ ȍɴʼȂɜȕ ҄Ж
Щ
The assembler code for this method consists again of two instructions, but
the multiplication instruction is different now. The instruction ȯɦʼɜ (floating-
point multiplication) is applied to the argument and the constant OɜɴǤʲЛЙ value
ЗѐЕЕЕЕЕЕȕўЕЕ. The resulting value is returned by ʝȕʲ.
These two examples show that the generic function ˦ вљ З˦ is compiled into
a single assembly instruction in both cases and that the special instruction for the
argument type is used. Therefore Julia is capable of compiling programs into
highly efficient code. The Julia compiler also inlines functions automatically
(see the macros ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ below). The only drawback of generat-
ing specialized code for all argument signatures (and of inlining functions) is
increased code size, which makes cache misses more likely, which slows down
modern processors. But in summary, it is very unlikely that you will need to re-
sort to lower-level languages than Julia for performance reasons.
The next macro, ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ, returns arrays of &ɴȍȕbɪȯɴ objects contain-
ing the lowered forms for the methods matching the given method and its type
signature.
The third macro in this family, ЪȆɴȍȕѪɪǤʲɃ˛ȕ, is similar to ЪȆɴȍȕѪɜɜ˛ɦ, but
instead of showing the instructions used by the llvm compiler framework, the
native instructions of the processor you are using are shown.
The next macro, ЪȆɴȍȕѪʲ˩ʙȕȍ, is similar to ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ, but shows type
inferred information.
The last macro in this family, ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ, prints lowered and type in-
ferred abstract syntax trees for the given method and its type signature. The out-
put is annotated in color (if available) to give warnings of potential type insta-
bilities, i.e., variables whose types may change during evaluation are marked.
These annotations may be related to operations for which the generated code is
not optimal. This macro is especially useful for optimizing functions.
These five macros have sister functions: ȆɴȍȕѪɜɜ˛ɦ, ȆɴȍȕѪɜɴ˞ȕʝȕȍ,
ȆɴȍȕѪɪǤʲɃ˛ȕ, ȆɴȍȕѪʲ˩ʙȕȍ, and ȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ.
The four macros ЪȍȕȂʼȱ, ЪɃɪȯɴ, Ъ˞Ǥʝɪ, and Ъȕʝʝɴʝ are the recommended way
to communicate debugging output, informational messages, warning messages,
and errors to users of your program (see Sect. 6.5.3).
The ЪȍɴȆ macro, already mentioned in Sect. 1.3.3, is highly useful at the repl
to retrieve documentation not only about built-in functions, macros, and types,
about also about user defined ones if a documentation string was included.
The Ъȕɪʼɦ macro makes it possible to define enumeration types.
The ЪȱȕɪȕʝǤʲȕȍ macro, used before a function definition, defines so-called
generated functions. Generated functions are a generalization of the multiple
dispatch we know from generic functions. The body of a generated function has
access only to the types of the arguments, but not to their values, and a generated
function must return a quoted expression like a macro. They differ from macros,
because generated functions are expanded after the types of the arguments are
150 7 Macros
known, but before the function is compiled, while macros are expanded at read
time and cannot access the types of their arguments. Generated functions are a
seldom used feature.
Hints about inlining functions can be given to the compiler using the two
macros ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ. Inlining is a form of optimization that replaces
calls of the function to be inlined with the code of the function itself within its
caller. The advantage is that the overhead of passing and returning arguments is
eliminated; on the other hand, the disadvantage is increased code size. Usually,
only small and often called functions are inlined by the compiler. The macros
ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ are written just before ȯʼɪȆʲɃɴɪ.
The Ъɜȕʧʧ macro is very useful to show the source code of a method. For
example, Ъɜȕʧʧ ЗўЗ shows the file Ƀɪʲѐɔɜ, which is part of the implementation
of Julia.
The ЪʧɃɦȍ macro annotates a ȯɴʝ loop and allows the compiler to perform
more loop reordering, although the compiler already is able to automatically
vectorize inner ȯɴʝ loops. The ȯɴʝ loop must satisfy a few conditions when ЪʧɃɦȍ
is to be used. simd (single instruction multiple data) instructions are available on
most modern processors; a simd instruction is executed in parallel on multiple
data as opposed to executing multiple instructions.
The macros ЪʧʙȕȆɃǤɜɃ˴ȕ and ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ make it possible to exert some
control whether the compiler should generate code for methods with certain ar-
gument signatures or not. If ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ appears in front of an argument in
a method, it gives a hint to the compiler that the method should not be special-
ized for different types of this argument, thus avoiding excess code generation.
The ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ macro can also appear in the function body before any other
code. Furthermore, it can be used without arguments in the local scope of a func-
tion and then applies to all arguments of the function. It can also be used without
arguments in the global scope and then applies to all methods subsequently de-
fined in the module. The ЪʧʙȕȆɃǤɜɃ˴ȕ macro resets the hint back to the default
when used in the global scope.
The ЪʧʲǤʲɃȆ macro partially evaluates the following expression at read time.
This is useful, for example, to define functions or values that are system specific.
A simple example is the following.
ЪʧʲǤʲɃȆ Ƀȯ Æ˩ʧѐɃʧǤʙʙɜȕФХ ЮЮ Æ˩ʧѐɃʧɜɃɪʼ˦ФХ
ȯʼɪȆʲɃɴɪ ɃʧʼɪɃ˦ФХ
ʲʝʼȕ
ȕɪȍ
ȕɪȍ
A more interesting example is calling functions (for example using ȆȆǤɜɜ) that
only exist on certain systems.
The Ъ˛Ƀȕ˞ macro creates a ÆʼȂʝʝǤ˩ from an array and an indexing expres-
sion into the array. A simple example of its usage is the following.
7.7 Bibliographical Remarks 151
ɔʼɜɃǤљ ќ ЦЖ Зѓ И ЙЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж З
И Й
ɔʼɜɃǤљ Ȃ ќ Ъ˛Ƀȕ˞ ЦЖя ђЧ
Звȕɜȕɦȕɪʲ ˛Ƀȕ˞Фђђ ǤʲʝɃ˦ШbɪʲЛЙЩя Жя ђХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
Ж
З
ɔʼɜɃǤљ ȂЦЖЧ ќ Еѓ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Е З
И Й
The macro Ъ˛Ƀȕ˞ʧ applies Ъ˛Ƀȕ˞ to every array indexing expression in the given
expression.
Homoiconicity and macros have been part of Lisp [3] since its inception. The
metaprogramming and macro features of Common Lisp, the most modern and
standardized [1] Lisp dialect, are second to none and provided inspiration to the
metaprogramming facilities in Julia.
Problems
7.2 Modify the ЪȆɴɜɜȕȆʲ macro such that the user can specify the element type
of the vector that is returned.
7.3 (Unless) Macros make it possible to define new control structures that are
indistinguishable from built-in ones apart from the at sign. Write a macro called
Ъʼɪɜȕʧʧ which takes two arguments, namely a condition and an expression, and
that evaluates the expression only if the condition is false.
7.4 (Anaphoric macro) Anaphoric macros [2, Chapter 14] are macros that de-
liberately capture an argument of the macro which can later be referred to by an
anaphor. (In linguistics, an anaphor is the use of an expression whose interpre-
tation depends on another (usually previous) expression.)
Write a macro called ЪɃȯѪɜȕʲ that takes four arguments, namely a -ɃȆʲ, a key
(of arbitrary type), and two expressions. If the -ɃȆʲ contains the key, its value is
152 7 Macros
bound to the local variable Ƀʲ within the first expression, which is evaluated; if
the key cannot be found, the second expression is evaluated.
7.5 (Case) Julia does not come with a ʧ˞ɃʲȆȹ expression that is common
in other programming languages. Implement four macros modeled after their
counterparts in Common Lisp.
1. The macro ЪȆǤʧȕ takes a value and evaluates the clause that matches the
value. A default clause may be given.
2. The macro ЪȕȆǤʧȕ is analogous to ЪȆǤʧȕ, but it takes no default clause and
it raises an error if no clause matches.
3. The macro Ъʲ˩ʙȕȆǤʧȕ takes a value and evaluates the clause that matches
the type of the given value. A default clause may be given.
4. The macro Ъȕʲ˩ʙȕȆǤʧȕ is analogous to Ъʲ˩ʙȕȆǤʧȕ, but it takes no default
clause and it raises an error if no clause matches.
References
1. American National Standards Institute (ANSI), Washington, DC, USA: Programming Lan-
guage Common Lisp, ANSI INCITS 226-1994 (R2004) (1994)
2. Graham, P.: On Lisp. Prentice Hall (1993)
3. McCarthy, J.: LISP 1.5 Programmer’s Manual. The MIT Press (1962)
Chapter 8
Arrays and Linear Algebra
8.1.1 Introduction
more efficient than passing by value, since the arrays passed as arguments do
not have to be copied. The disadvantage is that any modifications made by the
function called persist and are then seen by the caller.
It is important when designing programs that one keeps these advantages
and disadvantages in mind. Destructively modifying arrays often results in large
performance gains at the expensive of programs whose control and data flow is
harder to understand. The convention that the names of functions that destruc-
tively modify any of their arguments end in an exclamation mark Р is particularly
useful in this context. On the other hand, functions that never modify their (ar-
ray) arguments make it easier to reason about the data flow, but are generally
less efficient when large data structures must be copied.
The basic syntax to construct arrays are square brackets. The type of the array
elements may be specified before the opening square bracket; if it is not, it is
inferred from the elements given.
ɔʼɜɃǤљ ЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
ɔʼɜɃǤљ bɪʲНЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲНЩђ
Ж
ɔʼɜɃǤљ ЦЖѐЕЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЕ
ɔʼɜɃǤљ ЦЖя ЗѐЕЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЕ
ЗѐЕ
ɔʼɜɃǤљ OɜɴǤʲИЗЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲИЗЩђ
ЖѐЕ
The following examples illustrate the cases that may occur. One-dimensional
ʝʝǤ˩s, i.e., ùȕȆʲɴʝs, are constructed if the elements are separated by commas
only or by semicolons only.
ɔʼɜɃǤљ ЦЖя ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
З
ɔʼɜɃǤљ ɃʧǤФЦЖя ЗЧя ùȕȆʲɴʝХ
ʲʝʼȕ
ɔʼɜɃǤљ ЦЖѓ ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
З
ɔʼɜɃǤљ ɃʧǤФЦЖѓ ЗЧя ùȕȆʲɴʝХ
ʲʝʼȕ
The conjugate transpose can be written more conveniently using the postfix op-
erator щ.
ɔʼɜɃǤљ ЦЖɃɦя ЗɃɦЧщ
ЖѠЗ ǤȍɔɴɃɪʲФђђùȕȆʲɴʝШ&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
ЕвЖɃɦ ЕвЗɃɦ
ɔʼɜɃǤљ ЦЖɃɦя ЗɃɦЧщ Ѯ ЦЖɃɦя ЗɃɦЧ
К ў ЕɃɦ
The conjugate transpose of the 1 × 2 array ЦЖɃɦ ЗɃɦЧ, i.e., a row vector, is a 2 × 1
array, i.e., a column vector, as expected.
156 8 Arrays and Linear Algebra
On the other hand, trying to multiply the matrix by the 1 × 2 array ЦЖ ЗЧ yields
an error as expected.
In summary, the vector and matrix operations in Julia follow the conven-
tions of linear algebra, and one-dimensional arrays of length 𝑛 are interpreted
as column vectors of size 𝑛 × 1 for convenience.
Basic operations to query multi-dimensional arrays about their properties are
summarized in Table 8.1.
In addition to the syntax using square brackets to construct vectors and arrays,
there is a number of functions to construct, initialize, and fill multi-dimensional
arrays. Table 8.2 provides an overview of the functions to construct and initialize
arrays.
The function ʝȕʧȹǤʙȕ is useful to interpret the elements of a given array as
an array of a different shape, i.e., with different dimensions or with a different
number of dimensions. The elements of the underlying array are not changed by
8.1 Dense Arrays 157
ʝȕʧȹǤʙȕ itself; however, the two arrays share the same elements, so that chang-
ing the elements of one array also affects the elements of the other. The new di-
mensions are specified as a tuple, whereby one dimension may be specified as ђ
indicating that this dimension should be calculated to match the total number
of the elements. In this example, we define a magic square.
ɔʼɜɃǤљ - ќ ЦЖЛя Кя Оя Йя Ия ЖЕя Ля ЖКя Зя ЖЖя Мя ЖЙя ЖИя Ня ЖЗя ЖЧѓ
ɔʼɜɃǤљ ќ ʝȕʧȹǤʙȕФ-я ФЙя ђХХ
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ И З ЖИ
К ЖЕ ЖЖ Н
О Л М ЖЗ
Й ЖК ЖЙ Ж
ɔʼɜɃǤљ -ЦЖЧ ќ вЖЛѓ ʧʼɦФ ХЭЙ
ЗЛѐЕ
ɔʼɜɃǤљ -ЦЖЧ ќ ЖЛѓ ʧʼɦФ ХЭЙ
ИЙѐЕ
The functions and the syntax for the concatenation of arrays are summarized
in Table 8.3. The syntactic expressions in the second column are just a conve-
nient way to call the functions in the first column.
158 8 Arrays and Linear Algebra
Another way to construct arrays are comprehensions (cf. Sect. 3.5), whose syntax
typeЦexpr ȯɴʝ var1 Ƀɪ iterable1я var2 Ƀɪ iterable2я . . . Ч
is similar to the array syntax and to mathematical set notation. The dots indicate
an arbitrary number of iteration variables. The values of the iteration variables
may be given by any iterable collection (see Sect. 4.5.2) such as ranges or vectors.
The resulting array has the dimensions given by the dimensions of the collec-
tions specifying the iteration variables in order and the elements are found by
evaluating the expression expr, which often depends on the iteration variables.
Specifying the element type type of the resulting array by prepending it to the
array comprehension is optional. If the element type is not specified, it is deter-
mined automatically.
ɔʼɜɃǤљ ʧʼɦФЦ ЦɃя ɃЧ ȯɴʝ Ƀ Ƀɪ ЖђЙЧХ
ИЙ
ɔʼɜɃǤљ ʧʼɦФЦ ЦКвɃя ɃЧ ȯɴʝ Ƀ Ƀɪ ЖђЙЧХ
ИЙ
ɔʼɜɃǤљ ʧʼɦФЦ ЦɃя ɔЧ ȯɴʝ Ƀ Ƀɪ ЖђЗя ɔ Ƀɪ ЖђЗЧХ
ИЙ
The following example shows how a generator expression is used inside ʧʼɦ.
Changing the number of terms in the sum does not change the amount of mem-
ory allocated when using a generator.
ɔʼɜɃǤљ ЪʲɃɦȕ ʧʜʝʲФЛѮʧʼɦФЖЭɃѭЗ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕѪЕЕЕѪЕЕЕХХ в ʙɃ
ЕѐЖИОМКМ ʧȕȆɴɪȍʧ ФОМѐОЖ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ КѐЕИН Ƀ"Х
вНѐЛЕМЙЕИЗИЙЛЕЛЖНЖȕвО
On the other hand, the amount of allocated memory grows linearly with the
number of iterations when using a comprehension.
ɔʼɜɃǤљ ЪʲɃɦȕ ʧʜʝʲФЛѮʧʼɦФЦЖЭɃѭЗ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕѪЕЕЕѪЕЕЕЧХХ в ʙɃ
ЕѐИЕЕООЖ ʧȕȆɴɪȍʧ ФЖЗЙѐИМ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ МЛОѐЖЙЗ Ƀ"я ИѐЕЙ҄ ȱȆ ʲɃɦȕХ
вОѐКЙОЗОЙЛННИЗЛННЖȕвО
There are various ways to retrieve a certain element or certain elements from an
array by indexing or to assign elements of an array by indexing. One indexing
syntax is to supply 𝑛 indices in square brackets after an 𝑛-dimensional array.
Another option is to index a (multi-dimensional) array by a single index, which
is then interpreted as a linear index. Each index may be
160 8 Arrays and Linear Algebra
• a positive integer,
• a range of the form fromђto or fromђstepђto,
• a colon ђ, which is the same as &ɴɜɴɪФХ, to select the whole dimension,
• an array of positive integers including the empty array ЦЧ, or
• an array of "ɴɴɜs.
The indexing syntax denotes an array or a single element of an array if all the
indices are scalar integers. As part of the indexing syntax, the last valid index of
each dimension can be specified by the keyword ȕɪȍ. Hence, a colon is equiva-
lent to Жђȕɪȍ, as the first index is always Ж.
If any of the indices bj, 𝑗 ∈ {1, … , 𝑛}, is not a scalar integer, but an array, then
" ќ ЦbЖя bЗя . . . Ч becomes an array. The dimensions of the resulting array "
are given by the dimensions of the indices bj. Suppose that the index bj is a 𝑑𝑗 -
dimensional array and drop the empty, 0-dimensional arrays that correspond
to the scalar indices. Then the dimensions of " are ʧɃ˴ȕФbЖя ЖХ, . . . , ʧɃ˴ȕФbЖя
𝑑1 Х, ʧɃ˴ȕФbЗя ЖХ, . . . , ʧɃ˴ȕФbЗя 𝑑2 Х, . . . , ʧɃ˴ȕФbnя ЖХ, . . . , ʧɃ˴ȕФbnя 𝑑𝑛 Х. The
resulting element
"ЦɃЖ1я . . . я ɃЖ𝑑1 я ɃЗ1я . . . я ɃЗ𝑑2 я . . . я Ƀn1я . . . я Ƀn𝑑𝑛 Ч
is the element
ЦbЖЦɃЖ1я . . . я ɃЖ𝑑1 Чя bЗЦɃЗ1я . . . я ɃЗ𝑑2 Чя . . . я bnЦɃn1я . . . я Ƀn𝑑𝑛 ЧЧ
We can extract the two innermost elements from the first and fourth row in this
manner.
ɔʼɜɃǤљ ЦЦЖя ЙЧя ЦЗя ИЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
И З
ЖК ЖЙ
ɔʼɜɃǤљ ʧʼɦФǤɪʧХ
ИЙ
The shape of the resulting array is determined by the shape of the indices.
ɔʼɜɃǤљ ЦИя ЦЗ Иѓ Ж ЙЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Л М
О ЖЗ
8.1 Dense Arrays 161
Indexing using a Boolean array " is also called logical indexing. If it is used,
each Boolean array used as an index must have the same length as the dimension
of it corresponds to or it must be the only index provided and have the same
shape as . In the second case, a one-dimensional array is returned. A logical
index " acts as a mask and chooses the elements of that correspond to ʲʝʼȕ
values in the index ".
This example uses two logical indices to select the four corner elements of the
magic square.
ɔʼɜɃǤљ ЦЦʲʝʼȕя ȯǤɜʧȕя ȯǤɜʧȕя ʲʝʼȕЧя Цʲʝʼȕя ȯǤɜʧȕя ȯǤɜʧȕя ʲʝʼȕЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ ЖИ
Й Ж
ɔʼɜɃǤљ ʧʼɦФǤɪʧХ
ИЙ
The next example illustrates using a single logical index that has the same shape
as the array.
ɔʼɜɃǤљ " ќ ЦȯǤɜʧȕ ʲʝʼȕ ʲʝʼȕ ȯǤɜʧȕѓ ʲʝʼȕ ȯǤɜʧȕ ȯǤɜʧȕ ʲʝʼȕѓ
ʲʝʼȕ ȯǤɜʧȕ ȯǤɜʧȕ ʲʝʼȕѓ ȯǤɜʧȕ ʲʝʼȕ ʲʝʼȕ ȯǤɜʧȕЧ
ЙѠЙ ǤʲʝɃ˦Ш"ɴɴɜЩђ
Е Ж Ж Е
Ж Е Е Ж
Ж Е Е Ж
Е Ж Ж Е
ɔʼɜɃǤљ Ц"Ч
Нвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
К
О
И
ЖК
З
ЖЙ
Н
ЖЗ
ɔʼɜɃǤљ ʧʼɦФ Ц"ЧХЭЗ
ИЙѐЕ
Logical indexing is also useful in expressions such as -Ц- ѐјќ НЧ. First,
the index - ѐјќ Н is a "ɃʲʝʝǤ˩, which is then used to extract the subset of the
one-dimensional array - for which the condition holds.
The indexing syntax using square brackets is nearly equivalent to calling the
function ȱȕʲɃɪȍȕ˦. The only difference is that the ȕɪȍ keyword, representing the
last index in each dimension, can only be used inside square brackets. The ȕɪȍ
keyword can also be part of an expression as in this example.
162 8 Arrays and Linear Algebra
Again, each index can be one of the five items mentioned at the beginning of this
section.
If the right-hand side " is an array, the number of elements on the left- and
on the right-hand sides must match. Then the element
ЦbЖЦɃЖ1я . . . я ɃЖ𝑑1 Чя bЗЦɃЗ1я . . . я ɃЗ𝑑2 Чя . . . я bnЦɃn1я . . . я Ƀn𝑑𝑛 ЧЧ
on the right-hand side. If the right-hand side " is not an array, then its value is
written to all elements of referenced on the left-hand side.
Analogously to ȱȕʲɃɪȍȕ˦, the assignment ЦbЖя bЗя . . . я bnЧ ќ " is equiva-
lent to the function call ʧȕʲɃɪȍȕ˦РФя "я bЖя bЗя . . . я bnЧ.
Linear indexing means that the elements of an array are indexed by a single in-
dex that runs from one to the total number of elements in the array. As a linear
index increases, the first dimension (i.e., the row) changes faster than the sec-
ond dimension, and so forth. Fast linear indexing is generally available if the
elements of an array are contiguous in memory. Linear indexing into an array is
not always available, e.g., if the array is a view into another array.
ɔʼɜɃǤљ ЦЖЧя ЦȕɪȍЧ
ФЖЛя ЖХ
Here the index Ƀ is an bɪʲ if fast linear indexing is available for the type of . If
linear indexing is not available, the index Ƀ is, e.g., a &ǤʝʲȕʧɃǤɪbɪȍȕ˦ as in this
example, which also shows how to create a view into an array (see Sect. 8.3). The
row index changes faster than the column index.
ɔʼɜɃǤљ ќ ˛Ƀȕ˞Ф я ИђЙя ЖђЗХ
ЗѠЗ ˛Ƀȕ˞Фђђ ǤʲʝɃ˦ШbɪʲЛЙЩя ИђЙя ЖђЗХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
О Л
Й ЖК
ɔʼɜɃǤљ ʧʼɦФХ
ИЙ
ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ȕǤȆȹɃɪȍȕ˦ФХ Ъʧȹɴ˞ ФɃя ЦɃЧХ ȕɪȍ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЖя ЖХя ОХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЗя ЖХя ЙХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЖя ЗХя ЛХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЗя ЗХя ЖКХ
8.1.6 Operators
Table 8.4 summarizes the most important operations on arrays. Operators with-
out a dot ѐ are operations on (whole) arrays or matrices, while operators with a
dot ѐ always act elementwise. For example, the equality operator ќќ compares
two arrays and returns a single "ɴɴɜ value, while the elementwise equality op-
erator ѐќќ returns an array of the same shape as its arguments that contains the
results of the elementwise comparisons.
In addition to this general rule, multiplication Ѯ acts elementwise when one
argument is a scalar value, and the division operators Э and а act elementwise
when the denominator is a scalar value.
The left-division operator а is popular for solving systems
𝐴𝐱 = 𝐛
The effect of the syntax f ѐФAХ for vectorizing can also achieved by defining a
method such as
ȯФђђȂʧʲʝǤȆʲʝʝǤ˩Х ќ ɦǤʙФȯя Х
but it is more convenient to use the built-in syntax for vectorizing than to define
methods for each generic function to be vectorized.
8.1 Dense Arrays 165
Sparse vectors and sparse matrices are important types of vectors and matrices.
Their defining characteristic is that sufficiently many elements are zero so that
storing them in a special data structure is advantageous regarding execution time
and memory consumption. Special data structures and algorithms for sparse ma-
trices make calculations possible that could not be performed within reasonable
time or space requirements using dense vectors or matrices. An important ex-
ample is given by discretizations of partial differential equations, especially in
higher spatial dimensions (see Chap. 10).
To use sparse vectors or matrices, the built-in module ÆʙǤʝʧȕʝʝǤ˩ʧ must be
imported or used first.
ɔʼɜɃǤљ ʼʧɃɪȱ ÆʙǤʝʧȕʝʝǤ˩ʧ
The two types ÆʙǤʝʧȕùȕȆʲɴʝ and ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& have two parameters,
namely the type of the (non-zero) elements and the integer type of column and
row indices.
Sparse matrices are stored in the compressed-sparse-column (CSC) format.
This format is especially efficient for calculating matrix-vector products and col-
umn slicing. On the other hand, accessing a sparse matrix stored in this format
by rows is much slower. Furthermore, inserting non-zero values one at a time is
slow, since all elements beyond the insertion point must be moved over.
Many functions pertaining to sparse vectors or matrices start with the prefix
ʧʙ added to the names of the functions dealing with their dense counterparts.
The simplest example is ʧʙ˴ȕʝɴʧ for creating empty sparse vectors and matrices,
where the type of the elements can optionally be supplied.
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФЖЕЕЕХ
ЖЕЕЕвȕɜȕɦȕɪʲ ÆʙǤʝʧȕùȕȆʲɴʝШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФЖЕЕЕя ЖЕЕЕХ
ЖЕЕЕѠЖЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФ"Ƀȱbɪʲя ЖЕЕЕя ЖЕЕЕХ
ЖЕЕЕѠЖЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&Ш"Ƀȱbɪʲя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
ʧ ќ ʧʙǤʝʧȕ˛ȕȆФɃя ˛Х
for all indices k. Here the vectors Ƀ and ɔ contain the row and column indices
of the non-zero elements and the vector ˛ contains the non-zero elements them-
selves.
ɔʼɜɃǤљ ʧʙǤʝʧȕ˛ȕȆФЦЖя ЖЕя ЖЕЕЧя ЦЖѐЕя ЗѐЕя ИѐЕЧХ
ЖЕЕвȕɜȕɦȕɪʲ ÆʙǤʝʧȕùȕȆʲɴʝШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ И ʧʲɴʝȕȍ ȕɪʲʝɃȕʧђ
ЦЖ Ч ќ ЖѐЕ
ЦЖЕ Ч ќ ЗѐЕ
ЦЖЕЕЧ ќ ИѐЕ
ɔʼɜɃǤљ Æ ќ ʧʙǤʝʧȕФЦЖя ЖЕя ЖЕЕЧя
ЦЖЕЕЕя ЖЕѪЕЕЕя ЖЕЕѪЕЕЕЧя
ЦЖѐЕя ЗѐЕя ИѐЕЧХ
ЖЕЕѠЖЕЕЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ И ʧʲɴʝȕȍ ȕɪʲʝɃȕʧђ
ЦЖ я ЖЕЕЕЧ ќ ЖѐЕ
ЦЖЕ я ЖЕЕЕЕЧ ќ ЗѐЕ
ЦЖЕЕя ЖЕЕЕЕЕЧ ќ ИѐЕ
ɔʼɜɃǤљ ȯɃɪȍɪ˴ФÆХ
ФЦЖя ЖЕя ЖЕЕЧя ЦЖЕЕЕя ЖЕЕЕЕя ЖЕЕЕЕЕЧя ЦЖѐЕя ЗѐЕя ИѐЕЧХ
As this example shows, the function ȯɃɪȍɪ˴ retrieves the indices and the non-
zero elements of a ÆʙǤʝʧȕùȕȆʲɴʝ or a ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&.
Another use of the function ʧʙǤʝʧȕ is to create the sparse counterpart of a
dense vector or matrix. The function ɃʧʧʙǤʝʧȕ tests whether its argument is
sparse or not.
ɔʼɜɃǤљ ɃʧʧʙǤʝʧȕФʧʙǤʝʧȕФЦЕя Жя ЗЧХХ
ʲʝʼȕ
ɔʼɜɃǤљ ʧʙǤʝʧȕФЦЕя Жя ЗЧХ ќќ ЦЕя Жя ЗЧ
ʲʝʼȕ
Since many types of arrays and matrices occur in mathematics and in applica-
tions, the part of the type system that deals with arrays and matrices is quite
168 8 Arrays and Linear Algebra
This prints the elements (and their indices) in the same linear order as above.
The use strides can be illustrated by iterating over a three-dimensional array
in two different ways as well.
170 8 Arrays and Linear Algebra
ȱɜɴȂǤɜ ќ ʝǤɪȍФЗя Зя ЗХ
ȯɴʝ Ƀ Ƀɪ ЖђʙʝɴȍФʧɃ˴ȕФХХ
ʙʝɃɪʲɜɪФЦɃЧХ
ȕɪȍ
The type ʝʝǤ˩ is a subtype of -ȕɪʧȕʝʝǤ˩ and ensures that elements are
stored in column-major order.
ɔʼɜɃǤљ ʝʝǤ˩ јђ -ȕɪʧȕʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ɃʧǤФ я ʝʝǤ˩Х
ʲʝʼȕ
Vectors and matrices as we know them in mathematics are subtypes of the ʝʝǤ˩
type: ùȕȆʲɴʝ is an alias for a one-dimensional ʝʝǤ˩ and ǤʲʝɃ˦ is an alias for a
two-dimensional ʝʝǤ˩.
ɔʼɜɃǤљ ùȕȆʲɴʝ јђ ʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ǤʲʝɃ˦ јђ ʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ùȕȆʲɴʝ
ùȕȆʲɴʝ ФǤɜɃǤʧ ȯɴʝ ʝʝǤ˩ШÑя ЖЩ ˞ȹȕʝȕ ÑХ
ɔʼɜɃǤљ ǤʲʝɃ˦
ǤʲʝɃ˦ ФǤɜɃǤʧ ȯɴʝ ʝʝǤ˩ШÑя ЗЩ ˞ȹȕʝȕ ÑХ
In this section, major concepts from linear algebra are summarized and their
implementation in Julia is discussed.
holds. They span the whole vector space if every element 𝐮 of 𝑈 can be written
as a linear combination of the basis vectors 𝐁, i.e.,
𝑙
∑
∀𝐮 ∈ 𝑈 ∶ ∃𝑢1 , … , 𝑢𝑙 ∈ 𝐹 ∶ 𝐮= 𝑢𝑖 𝐛𝑖 .
𝑖=1
The coefficients 𝑢𝑖 ∈ 𝐹 are the coordinates of the vector 𝐮 with respect to the
basis 𝐁 and they are uniquely determined because of the linear independence of
the basis vectors. Furthermore, the dimension dim 𝑈 of 𝑈 is 𝑙.
Therefore every vector 𝐮 ∈ 𝑈 can be represented by its coordinates 𝑢𝑖 written
in the form
8.4 Linear Algebra 173
⎛𝑢 1 ⎞ ⎛𝑢1 ⎞
𝐮 = ⎜ ⋮ ⎟ = ⎜ ⋮ ⎟. (8.1)
⎝ 𝑢𝑙 ⎠𝐁 ⎝ 𝑢𝑙 ⎠
The basis 𝐁 has been indicated here for the sake of completeness; in most cases,
it is known from the context and omitted. The coefficients 𝑢𝑖 are called the ele-
ments (of the representation) of the vector.
It is customary to write vectors as column vectors (and not as row vectors) for
most purposes in linear algebra for a reason that will become clear soon.
In Julia, vectors are of course represented by the data structure ùȕȆʲɴʝШtypeЩ,
where the type of the elements plays the role of the underlying field 𝐹 in mathe-
matics.
The significance of matrices is that every linear function between two given
vector spaces can be represented as a matrix and, vice versa, every matrix gives
rise to a linear function (again between two given vector spaces). To see this, we
consider linear functions 𝑓 ∶ 𝑈 → 𝑉 between two vector spaces 𝑈 and 𝑉. We
also choose a basis 𝐁 ∶= {𝐛1 , … , 𝐛𝑙 } of the 𝑙-dimensional vector space 𝑈 and a
basis 𝐂 ∶= {𝐜1 , … , 𝐜𝑚 } of the 𝑚-dimensional vector space 𝑉. Since 𝑓 is linear,
i.e., it is compatible with the vector addition and scalar multiplication via
it suffices to know or to store the images of the basis vectors 𝐛𝑖 . This fact follows
immediately from the linearity of 𝑓, since
𝑙
∑ 𝑙
∑
𝑓(𝐮) = 𝑓( 𝑢𝑖 𝐛 𝑖 ) = 𝑢𝑖 𝑓(𝐛𝑖 ) (8.2)
𝑖=1 𝑖=1
⎛ 𝑓(𝐛1 )1 ⋯ 𝑓(𝐛𝑙 )1 ⎞
𝐴 ∶= (𝑓(𝐛1 ), … , 𝑓(𝐛𝑙 )) = ⎜ ⋮ ⋱ ⋮ ⎟,
⎝ 𝑓(𝐛 1 ) 𝑚 ⋯ 𝑓(𝐛𝑙 )𝑚 ⎠
where the element 𝑎𝑗𝑖 ∶= 𝑓(𝐛𝑖 )𝑗 ∈ 𝐹 of the matrix 𝐴 is the 𝑗-th element of the
vector 𝑓(𝐛𝑖 ) ∈ 𝑊. Since 𝑈 is 𝑙-dimensional, 𝑖 runs from 1 to 𝑙, and since 𝑉 is 𝑚-
dimensional, 𝑗 runs from 1 to 𝑚. Therefore the matrix 𝐴 contains 𝑚 rows and
𝑙 columns, and we say it has dimension 𝑚 × 𝑙. We denote the set of all (𝑚 × 𝑙)-
dimensional matrices over the field 𝐹 by 𝐹 𝑚×𝑙 .
Since the matrix 𝐴 contains all the information about the function 𝑓, it is
certainly possible to calculate the image 𝑓(𝐮). How can we calculate it easily
174 8 Arrays and Linear Algebra
𝑓(𝐮) = 𝐴𝐮.
∑𝑚 ∑𝑙
⎛ ∑𝑙 𝑎1𝑖 𝑢𝑖 ⎞ ⎛ 𝑗=1 𝑏1𝑗 𝑖=1 𝑎𝑗𝑖 𝑢𝑖 ⎞
𝑖=1
𝑔(𝑓(𝐮)) = 𝐵(𝐴𝐮) = 𝐵 ⎜ ⋮ ⎟=⎜ ⋮ ⎟
⎜∑ 𝑙 ⎟ ⎜∑𝑚 ∑𝑙 ⎟
𝑖=1
𝑎𝑚𝑖 𝑢𝑖 𝑗=1
𝑏𝑛𝑗 𝑖=1 𝑎𝑗𝑖 𝑢𝑖
⎝ ⎠ ⎝ ⎠
∑ 𝑚 ∑𝑙 ∑𝑚 ∑𝑚
⎛ 𝑗=1 𝑖=1 𝑏1𝑗 𝑎𝑗𝑖 𝑢𝑖 ⎞ ⎛ 𝑗=1 𝑏1𝑗 𝑎𝑗1 ⋯ 𝑗=1 𝑏1𝑗 𝑎𝑗𝑙 ⎞ 𝑢1
⎛ ⎞
=⎜ ⋮ ⎟=⎜ ⋮ ⋱ ⋮ ⎟ ⋮ = 𝐶𝐮.
⎜ ⎟
⎜ ∑ 𝑚 ∑𝑙 ⎟ ⎜∑𝑚 ∑𝑚 ⎟ 𝑢
𝑏 𝑎 𝑢 𝑏 𝑎
𝑛𝑗 𝑗1 ⋯ 𝑏 𝑎
𝑗=1 𝑛𝑗 𝑗𝑙 ⎠ ⎝ 𝑙 ⎠
⎝ 𝑗=1 𝑖=1 𝑛𝑗 𝑗𝑖 𝑖 ⎠ ⎝ 𝑗=1
⏟⎴⎴⎴⎴⎴⎴⎴⎴⎴⏟⎴⎴⎴⎴⎴⎴⎴⎴⎴⏟
𝐶∶=
The last equation yields the matrix 𝐶 ∈ 𝐹 𝑛×𝑙 , whose entries are
𝑚
∑
𝑐𝑘𝑖 = 𝑏𝑘𝑗 𝑎𝑗𝑖 .
𝑗=1
⎛𝑢1 ⎞ ∑𝑙 ∑𝑙 ∑𝑙 ∑𝑙
𝐮𝐁2 = ⎜ ⋮ ⎟ = 𝑢𝑖 𝐛2𝑖 = 𝑢𝑖 𝑔(𝐛1𝑖 ) = 𝑢𝑖 𝐺𝐛1𝑖 = 𝐺 𝑢𝑖 𝐛1𝑖
⎝ 𝑢𝑙 ⎠ 𝐁 𝑖=1
2
𝑖=1 𝑖=1 𝑖=1
⎛𝑢1 ⎞
= 𝐺 ⎜ ⋮ ⎟ = 𝐺𝐮𝐁1 .
⎝ 𝑢𝑙 ⎠𝐁 1
8.4 Linear Algebra 177
cos 𝜙 − sin 𝜙
𝑅(𝜙) ∶= ( ).
sin 𝜙 cos 𝜙
Note that a newline character can be used instead of a semicolon to indicate the
start of another row.
So far we seen how to change the basis over which a vector as written in (8.1)
is to be understood. We can also change the bases over which a matrix as written
in (8.3) is to be understood. This is useful in situations when a linear function is
known or more easily investigated in a certain basis. Linear functions also come
with basis vectors which are helpful to understand their action (see Sect. 8.4.9
and Sect. 8.4.10).
Suppose that 𝐴𝐁1 𝐂1 is the representation of a linear function 𝑓 in the old
bases 𝐁1 of 𝑈 and 𝐂1 of 𝑉. How can we find the representation 𝐴𝐁2 𝐂2 of 𝑓 in
the new bases 𝐁2 and 𝐂2 ? We start from the two basis changes
𝐮𝐁2 = 𝐺𝐮𝐁1 ,
𝐯𝐂2 = 𝐻𝐯𝐂1
and the representation of 𝐴 over the old bases, i.e., from the equation
Multiplying this equation from the left by 𝐻 and using 𝐮𝐁1 = 𝐺 −1 𝐮𝐁2 yields
If 𝑈 = 𝑉, the two matrices 𝐴𝐁1 𝐂1 and 𝐴𝐁2 𝐂2 are called similar or conjugate (see
Definition 8.34). Similarity is an equivalence relation.
The last equation implies
𝐴𝐁2 𝐂2
𝐁2 −−−−−−−−−−−−−→ 𝐂2
⏐
⏐
⏐
↑
⏐
⏐ ⏐
⏐
⏐
⏐ ⏐
⏐
⏐
⏐ −1 ⏐
𝐺⏐
⏐
⏐
⏐ 𝐻 ⏐ ⏐
⏐
⏐
⏐ ⏐
⏐
⏐
⏐
⏐
⏐
⏐ ↓
𝐴𝐁1 𝐂1
𝐁1 −−−−−−−−−−−−−→ 𝐂1
which is easily interpreted in the commutative diagram Fig. 8.1. In the old
bases 𝐁1 and 𝐂1 , 𝐴𝐁1 𝐂1 maps vectors represented using 𝐁1 to those represented
using 𝐂1 ; this is the left-hand side of the equation and the arrow at the bottom in
the diagram. The same effect is achieved by changing the argument vector from
the old basis 𝐁1 to the new basis 𝐁2 , then applying the linear function via its
new representation 𝐴𝐁2 𝐂2 , and finally changing from the new basis 𝐂2 back to
the old basis 𝐂1 ; this is the right-hand-side of the equation and the other three
arrows in the diagram.
Next, we consider an example. We seek the representation of a geometric
transformation in the canonical basis. The transformation is stretching the en-
tire two-dimensional plane by a factor of 2 only in the direction of the 𝑥-axis
rotated by 𝜋∕4. We define the two bases 𝐂1 ∶= 𝐁1 ∶= 𝐄 and the basis change
𝐻 ∶= 𝐺 ∶= 𝑅(−𝜋∕4) such that
10 1 1 −1
( ) = 𝑅(−𝜋∕4) √ ( ) .
01 2 1 1 𝐁1
𝐁2
20
𝐴𝐁2 𝐂2 = ( ).
01
20
𝐴𝐄𝐄 = 𝐴𝐁1 𝐂1 = 𝐻 −1 𝐴𝐁2 𝐂2 𝐺 = 𝑅(𝜋∕4) ( ) 𝑅(−𝜋∕4).
01
Finally, we check that it computes the desired transformation. The vector (1, 1)⊤
should be stretched by a factor of two, and the vector (−1, 1)⊤ , which is orthog-
onal to it, should remain unchanged.
ɔʼɜɃǤљ Ѯ ЦЖя ЖЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЗѐЕ
ЗѐЕ
ɔʼɜɃǤљ Ѯ ЦвЖя ЖЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
вЖѐЕ
ЖѐЕ
Many vector spaces can be equipped with an inner product. Inner products give
vector spaces geometric structure by making it possible to define lengths and
angles. An inner product ⟨., .⟩ of the vector space 𝑉 is a function
⟨., .⟩ ∶ 𝑉 × 𝑉 → 𝐹
that satisfies – for all vectors 𝐮, 𝐯, and 𝐰 ∈ 𝑉 and for all scalars 𝑎 ∈ 𝐹 – the three
conditions of conjugate symmetry ⟨𝐮, 𝐯⟩ = ⟨𝐯, 𝐮⟩, linearity in the first argument
⟨𝑎𝐮, 𝐯⟩ = 𝑎⟨𝐮, 𝐯⟩ and ⟨𝐮 + 𝐯, 𝐰⟩ = ⟨𝐮, 𝐰⟩ + ⟨𝐯, 𝐰⟩, and positive-definiteness
⟨𝐯, 𝐯⟩ ≥ 0 with equality if and only if 𝐯 = 0.
Every inner product induces a norm on its vector space 𝑉 by defining
holds, while equality holds if and only if 𝐮 and 𝐯 are linearly dependent.
180 8 Arrays and Linear Algebra
The cosine of the angle 𝜙(𝐮, 𝐯) between two vectors 𝐮 and 𝐯 is defined as
𝐮⋅𝐯
cos 𝜙(𝐮, 𝐯) ∶= ,
‖𝐮‖‖𝐯‖
⟨𝐴𝐮, 𝐯⟩ = ⟨𝑢, 𝐴∗ 𝐯⟩
for all vectors 𝐮 and 𝐯 ∈ 𝑉. This means that the elements of 𝐴∗ are given by
𝑎𝑗𝑖 , if the elements of 𝐴 are denoted by 𝑎𝑖𝑗 . If the underlying field of the vector
space 𝑉 are the real numbers ℝ, then the complex transpose 𝐴∗ is the transpose
of 𝐴 and denoted by 𝐴⊤ ; its elements are 𝑎𝑗𝑖 .
In Julia, the conjugate transpose or Hermitian conjugate of a matrix is cal-
culated by the functions ǤȍɔɴɃɪʲ and ǤȍɔɴɃɪʲР or the postfix operator щ.
ɔʼɜɃǤљ ќ ЦЖўЗɃɦ ИўЙɃɦѓ КўЛɃɦ МўНɃɦЧѓ щ
ЗѠЗ ǤȍɔɴɃɪʲФђђ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
ЖвЗɃɦ КвЛɃɦ
ИвЙɃɦ МвНɃɦ
Before we can state the rank-nullity theorem, some definitions are required. The
kernel or nullspace of a function 𝑓 ∶ 𝑈 → 𝑉 is the set of all elements 𝐮 ∈ 𝑈
whose image vanishes, i.e.,
holds.
In the language of matrices, the theorem can be stated as follows. The nullity
and the rank of a matrix are the nullity and rank of the corresponding linear
function.
Theorem 8.3 (rank-nullity theorem for matrices) Let 𝐴 ∈ 𝐹 𝑛×𝑙 be an 𝑛 × 𝑙-
dimensional matrix. Then the equation
nul(𝐴) + rk(𝐴) = 𝑙
holds.
In Julia, the function {ɃɪȕǤʝɜȱȕȂʝǤѐɪʼɜɜʧʙǤȆȕ calculates a basis of the
nullspace of a matrix and the function {ɃɪȕǤʝɜȱȕȂʝǤѐʝǤɪɖ computes its rank.
ɔʼɜɃǤљ ќ ЦЖ Ж Е Еѓ Е Ж Ж Еѓ Е Ж вЖ ЕЧ
ИѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж Ж Е Е
Е Ж Ж Е
Е Ж вЖ Е
ɔʼɜɃǤљ ɪʼɜɜʧʙǤȆȕФХ
182 8 Arrays and Linear Algebra
ЙѠЖ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЕѐЕ
ЕѐЕ
ЖѐЕ
ɔʼɜɃǤљ ʝǤɪɖФХ
И
ɔʼɜɃǤљ ʧɃ˴ȕФɪʼɜɜʧʙǤȆȕФХя ЗХ ў ʝǤɪɖФХ ќќ ʧɃ˴ȕФя ЗХ
ʲʝʼȕ
In applications, matrices with special structures often arise. The special prop-
erties of these matrices can often be exploited by specialized operations and al-
gorithms such as matrix factorizations (see Sect. 8.4.11). Therefore important
matrix types are discussed in the following.
The simplest matrix type are diagonal matrices which have the form
⎛∗ ⎞
⎜ ⋱ ⎟.
⎝ ∗⎠
⎛∗ ∗ ⎞ ⎛∗ ⎞
⎜ ⋱ ⋱ ⎟ ⎜∗ ⋱ ⎟,
and
⎜ ⋱ ∗⎟ ⎜ ⋱ ⋱ ⎟
⎝ ∗⎠ ⎝ ∗ ∗⎠
⎛∗ ∗ ⎞
⎜∗ ⋱ ⋱ ⎟,
⎜ ⋱ ⋱ ∗⎟
⎝ ∗ ∗⎠
Analogously, symmetric matrices, i.e., matrices over ℝ and with the property
𝐴 = 𝐴⊤ , are represented by the type Æ˩ɦɦȕʲʝɃȆ and are constructed by the func-
tion Æ˩ɦɦȕʲʝɃȆ.
Upper-triangular and lower-triangular matrices are matrices of the forms
⎛∗ ∗ ∗⎞ ⎛∗ ⎞
⎜ ⋱ ∗⎟ and ⎜∗ ⋱ ⎟,
⎝ ∗⎠ ⎝∗ ∗ ∗⎠
respectively. They are important for solving linear systems (see Sect. 8.4.8) and
in matrix factorizations (see Sect. 8.4.11). They are represented by the types
ÚʙʙȕʝÑʝɃǤɪȱʼɜǤʝ and {ɴ˞ȕʝÑʝɃǤɪȱʼɜǤʝ, and they are constructed by functions
of the same name.
The function ÚɪɃȯɴʝɦÆȆǤɜɃɪȱ returns a multiple of the identity matrix 𝐼,
which is generally sized so that it can be multiplied by any matrix.
ɔʼɜɃǤљ ÚɪɃȯɴʝɦÆȆǤɜɃɪȱФЗХ Ѯ ЦЖ Зѓ И ЙЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
З Й
Л Н
184 8 Arrays and Linear Algebra
In general, matrices can be converted from the general ȂʧʲʝǤȆʲ ǤʲʝɃ˦ type
to a special type by calling the constructor of the special type on the matrix. Vice
versa, a special type can be converted to the general ʝʝǤ˩ type by calling the
constructors ǤʲʝɃ˦ or ʝʝǤ˩ on the special matrix.
ɔʼɜɃǤљ ÑʝɃȍɃǤȱɴɪǤɜФ Х
ЙѠЙ ÑʝɃȍɃǤȱɴɪǤɜШbɪʲЛЙя ùȕȆʲɴʝШbɪʲЛЙЩЩђ
ЖЛ И ѐ ѐ
К ЖЕ ЖЖ ѐ
ѐ Л М ЖЗ
ѐ ѐ ЖЙ Ж
ɔʼɜɃǤљ ǤʲʝɃ˦ФÑʝɃȍɃǤȱɴɪǤɜФ ХХ
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ И Е Е
К ЖЕ ЖЖ Е
Е Л М ЖЗ
Е Е ЖЙ Ж
It can be shown that these three defining properties imply the definition
where the angle ∠(𝐚, 𝐛) ∈ [0, 𝜋] is the angle between 𝐚 and 𝐛 in the plane con-
taining both and 𝐧 is a unit vector normal to the same plane. Its direction is given
by the right-hand rule: if 𝐚 points along the thumb and 𝐛 along the forefinger of
the right hand, then 𝐧 and thus 𝐜 point along the middle finger.
The cross product is anticommutative, i.e., 𝐚 × 𝐛 = −(𝐛 × 𝐚) holds for all
vectors 𝐚 and 𝐛 ∈ ℝ3 . It is also bilinear, i.e., (𝜆𝐚 + 𝜇𝐛) × 𝐜 = 𝜆(𝐚 × 𝐜) + 𝜇(𝐛 × 𝐜)
holds for all 𝜆 and 𝜇 ∈ ℝ and for all vectors 𝐚, 𝐛, and 𝐜 ∈ ℝ3 . Furthermore,
two vectors 𝐚 ≠ 0 and 𝐛 ≠ 0 are parallel if and only if 𝐚 × 𝐛 = 0. Finally, the
Lagrangian identity ‖𝐚 × 𝐛‖2 = ‖𝐚‖2 ‖𝐛‖2 − (𝐚 ⋅ 𝐛)2 holds for all vectors 𝐚 and
𝐛 ∈ ℝ3 . The cross product is not associative, i.e., in general 𝐚×(𝐛×𝐜) ≠ (𝐚×𝐛)×𝐜.
We can use the three defining properties to determine the cross products 𝐞𝑖 ×
𝐞𝑗 for all 𝑖 and 𝑗 ∈ {1, 2, 3} of all combinations of vectors in the standard basis
{𝐞1 , 𝐞2 , 𝐞3 }. Then multiplying out the product (𝑎1 𝐞1 +𝑎2 𝐞2 +𝑎3 𝐞3 )×(𝑏1 𝐞1 +𝑏2 𝐞2 +
𝑏3 𝐞3 ) and using the bilinearity yields the formula
⎛ |||𝑎2 𝑏2 |||⎞
| |
⎜ |||𝑎3 𝑏3 |||⎟
𝑎 𝑏 𝑎 𝑏 − 𝑎 𝑏 ⎜ ||𝑎
⎛ 1⎞ ⎛ 1⎞ ⎛ 2 3 3 2⎞
𝑏1 |||⎟
𝐚 × 𝐛 = ⎜𝑎2 ⎟ × ⎜𝑏2 ⎟ = ⎜𝑎3 𝑏1 − 𝑎1 𝑏3 ⎟ = ⎜− ||| 1 | .
||𝑎3 𝑏3 |||⎟
⎝𝑎3 ⎠ ⎝𝑏3 ⎠ ⎝𝑎1 𝑏2 − 𝑎2 𝑏1 ⎠ ⎜ |||𝑎1 𝑏1 |||⎟
⎜ || |⎟
|𝑎 𝑏2 |||
⎝ | 2 ⎠
In Julia, the function Ȇʝɴʧʧ calculates the cross product of two three-dimens-
ional vectors.
ɔʼɜɃǤљ ȆʝɴʧʧФЦЖя Ея ЕЧя ЦЕя Жя ЕЧХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Е
Е
Ж
𝐴𝐱 = 𝐛, (8.6)
where the matrix 𝐴 ∈ 𝐹 𝑛×𝑚 and the vector 𝐱 contains the unknowns 𝑥𝑖 . Linear
systems are among the most common equations to be solved. Linear systems
also arise from the linearization of systems of nonlinear equations; the linearized
system is then used as an approximation of the more complicated and usually
harder to solve nonlinear system. Because linear systems are ubiquitous, Julia
provides advanced algorithms for solving them.
If you only remember one Julia function from this section, it should be the
backslash function а. It usually solves a linear system very reliably.
ɔʼɜɃǤљ ќ ЦЖ Зѓ Й КЧѓ Ȃ ќ ЦЖКѓ ЙЗЧѓ
ɔʼɜɃǤљ ˦ ќ а Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ИѐЕ
ЛѐЕ
ɔʼɜɃǤљ Ѯ ˦ ќќ Ȃ
ʲʝʼȕ
ɔʼɜɃǤљ Ѯ ˦
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖКѐЕ
ЙЗѐЕ
The rest of this section is concerned with what happens under the hood.
8.4.8.1 Solvability
Cramer’s rule, an explicit formula for solving systems of linear equations with
an equal number 𝑛 of equations and unknowns, has been known since the mid-
18th century. It is named after Gabriel Cramer, who published the rule for sys-
tems of arbitrary size in 1750. Cramer’s rule is based on determinants and its
naive implementation has a time complexity of 𝑂((𝑛 + 1)𝑛!), although it can be
implemented with a time complexity of 𝑂(𝑛3 ) [2].
However, before we solve linear systems, it is important to consider their solv-
ability. There may be no solution, a unique solution, or multiple solutions; if the
underlying field 𝐹 of the preimage vector space is infinite, the third case of mul-
8.4 Linear Algebra 187
tiple solutions means that there are infinitely many solutions. In the following,
we assume that the underlying field 𝐹 is ℝ or ℂ.
A system with fewer equations than unknowns is called an underdetermined
system. In general, such a system has infinitely many solutions, but it may have
no solution. A system with the same number of equations and unknowns usually
has a unique solution. A system with more equations than unknowns is called
an overdetermined system. In general, such a system has no solution.
The reasons why a certain system may behave differently from the general
case is that the equations may be linearly dependent, i.e., one or more equations
may be redundant, or that two or more of the equations may be inconsistent, i.e.,
contradictory.
Equations of the form (8.6) can be interpreted – as all of linear algebra – within
the context of linear functions 𝑓 and within the context of matrices 𝐴. There
is also a geometric interpretation: each linear equation (or row) in (8.6) deter-
mines a hyperplane in 𝐹 𝑚 and the set of solutions is the intersection of these
hyperplanes.
To characterize the three cases for the number of solutions that can occur, it is
useful to start with homogeneous systems. A system of linear equations is called
homogeneous if the constant terms in each equation vanish, i.e., if (8.6) has the
form
𝐴𝐱 = 0. (8.7)
Each homogeneous equation has at least one solution, namely the trivial solu-
tion 𝐱 = 0.
If the matrix 𝐴 is regular, which is equivalent to 𝑓 being a bijection, then the
trivial solution is the unique solution; the set of solutions is the kernel ker(𝐴) =
{0}.
If the matrix 𝐴 is singular, the set of solutions is the kernel ker(𝐴) and it
contains infinitely many solutions. It is straightforward to see that if 𝐱 and 𝐲 are
two solutions, then the linear combination 𝛼𝐱 + 𝛽𝐲 is a solution as well. This
implies that the set of solutions is a linear subspace of 𝐹 𝑚 .
A linear function 𝑓 can only be a bijection if the dimensions 𝑚 and 𝑛 of the
preimage and image spaces are equal. Therefore a regular matrix 𝐴 must be a
square matrix. A square matrix 𝐴 is regular if and only if det(𝐴) ≠ 0.
An example of a singular matrix is the following.
ɔʼɜɃǤљ ќ ЦЖ Еѓ Ж ЕЧѓ ȍȕʲФХ
ЕѐЕ
ɔʼɜɃǤљ ʝǤɪɖФХ
Ж
ɔʼɜɃǤљ ɪʼɜɜʧʙǤȆȕФХ
ЗѠЖ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ
With this knowledge about the solution set of homogeneous systems (8.7), we
can now consider general, inhomogeneous systems (8.6). An inhomogeneous
system has (at least) one solution if the inhomogeneity 𝐛 lies in the image of 𝑓
or 𝐴, i.e., if 𝐛 ∈ im(𝐴). If 𝐳 ∈ 𝐹 𝑚 is any particular solution of (8.6), then all
solutions of (8.6) are given by the set
𝐳 + ker(𝐴) = {𝐳 + 𝐯 ∣ 𝐯 ∈ ker(𝐴)}.
In this section, we consider the important special case when the system matrix 𝐴
in (8.6) is square, i.e., 𝐴 ∈ 𝐹 𝑛×𝑛 . In other words, the preimage and the image
spaces have the same dimension 𝑛. A square linear system has the form
𝑈𝐱 = 𝐛,
𝐴𝐱 = 𝑃⊤ 𝐿 𝑈𝐱 = 𝐛.
⏟⏟⏟
=∶𝐲
190 8 Arrays and Linear Algebra
This element 𝑎𝑖𝑗 is called the pivot element. If the pivot element is equal
to zero, the matrix 𝐴 is singular and the algorithm stops.
Swap the 𝑖-th and the 𝑗-th rows such that the pivot element is now
located at 𝑎𝑗𝑗 in the 𝑗-th row. Record swapping the two rows as left-
multiplication by a permutation matrix 𝑃𝑗 .
b. For all rows 𝑖 ∈ {𝑗 + 1, … , 𝑛} below the pivot element, add the pivot
row multiplied by −𝑎𝑖𝑗 ∕𝑎𝑗𝑗 to the 𝑖-th row such that all matrix elements
below the pivot element vanish.
Record these row operations by left-multiplication with a lower-triang-
ular matrix 𝐿𝑗 .
c. Increase 𝑗 by one and repeat while 𝑗 ≤ 𝑛.
Swapping the rows in the second step is called row pivoting or partial pivoting.
Choosing the element with the largest absolute value improves the numerical
stability of the algorithm, although other variants are used as well.
If it is not possible to find a non-zero pivot element in the second step, we
have shown that the matrix is singular. In fact, Gaussian elimination is a decision
procedure for inverting a square matrix 𝐴: if it succeeds, the inverse matrix 𝐴 has
been calculated; if it does not succeed, it has constructed a proof that the matrix
is singular.
The following theorem states that Gaussian elimination indeed yields a fac-
torization of the desired shape.
8.4 Linear Algebra 191
𝐴 = 𝑃⊤ 𝐿𝑈.
which is indeed lower triangular. The only non-zero element of the product 𝐞𝑖 𝐞⊤ 𝑗
is in row 𝑖 and column 𝑗.
The permutation matrices 𝑃𝑗 record swapping the 𝑖-th row with the 𝑗-th one,
where 𝑖 > 𝑗 again. It is straightforward to show that the inverse of a permutation
matrix 𝑃 is 𝑃−1 = 𝑃⊤ .
The goal is to reorder the terms in the product 𝐿𝑛 𝑃𝑛 ⋯ 𝐿1 𝑃1 such that it be-
comes 𝐾𝑛 ⋯ 𝐾1 𝑃𝑛 ⋯ 𝑃1 . This can be achieved by moving the matrices 𝑃𝑗 to the
right repeatedly by replacing the products 𝑃𝑘 𝐿𝑗 with 𝑘 > 𝑗 by products 𝐾𝑗 𝑃𝑘 .
We have 𝐾𝑗 = 𝑃𝑘 𝐿𝑗 𝑃𝑘⊤ . Since 𝑘 > 𝑗, multiplication of 𝐿𝑗 on the right by 𝑃𝑘⊤ only
swaps two columns whose only non-zero element is equal to one. Since 𝑘 > 𝑗,
multiplication of 𝐿𝑗 𝑃𝑘⊤ on the left by 𝑃𝑘 swaps these two ones back into the main
diagonal and leaves the structure of the matrix unchanged otherwise. Therefore
𝐾𝑗 has the same, lower-triangular structure as 𝐿𝑗 , and all the terms can be re-
ordered such that
𝐿𝑛 𝑃𝑛 ⋯ 𝐿1 𝑃1 𝐴 = 𝐾𝑛 ⋯ 𝐾1 𝑃𝑛 ⋯ 𝑃1 𝐴 = 𝑈.
is simply
192 8 Arrays and Linear Algebra
𝑛
∑
𝐾𝑗−1 = 𝐼 − 𝜅𝑖𝑗 𝐞𝑖 𝐞⊤
𝑗
.
𝑖=𝑗+1
𝐾𝑗 𝐾𝑗−1 = 𝐾𝑗−1 𝐾𝑗
𝑛
∑ 𝑛
∑ 𝑛
∑ 𝑛
∑
=𝐼+ 𝜅𝑘𝑗 𝐞𝑘 𝐞⊤
𝑗
− 𝜅𝑙𝑗 𝐞𝑙 𝐞⊤
𝑗
+( 𝜅𝑘𝑗 𝐞𝑘 𝐞⊤
𝑗
)( 𝜅𝑙𝑗 𝐞𝑙 𝐞⊤
𝑗
) = 𝐼.
𝑘=𝑗+1 𝑙=𝑗+1 𝑘=𝑗+1 𝑙=𝑗+1
The last of the four terms vanishes because of the inner products 𝐞⊤ 𝐞 with 𝑗 ≠ 𝑙.
𝑗 𝑙
Therefore the inverses 𝐾𝑗−1 have the same lower-triangular structure as the
matrices 𝐿𝑗 . Since the product of lower-triangular matrices is again lower-triang-
ular, we have thus shown that
𝑃𝐴 = 𝐿𝑈,
∏𝑛 𝑛
∏
det(𝐴) = (−1)𝑝 ( 𝑙𝑖𝑖 )( 𝑢𝑖𝑖 ),
𝑖=1 𝑖=1
Choosing the element with the maximal absolute value as the pivot element is
important for the precision of the algorithm. To see this, we consider the example
𝜖 1 1+𝜖
𝐴 ∶= ( ), 𝐛 ∶= ( )
1 −1 0
(1 + 𝜖) − 1
𝑥1 = .
𝜖
Symbolic evaluation of this expression results in the correct solution 𝐱 = (1, 1)⊤ .
Numerical evaluation, however, requires to subtract 1 from 1 + 𝜖, which are two
close numbers. This leads to cancellation and the loss of digits in floating-point
arithmetic. The problem is aggravated by the division by 𝜖. This effect makes the
solution unstable.
ɔʼɜɃǤљ ȕ ќ ЖѐКѮȕʙʧФЖѐЕХѓ Ъʧȹɴ˞ ȕѓ Ъʧȹɴ˞ ФЖўȕХвЖѓ Ъʧȹɴ˞ ФФЖўȕХвЖХЭȕѓ
ȕ ќ ИѐИИЕЛЛОЕМИНМКЙЛОЛȕвЖЛ
ФЖ ў ȕХ в Ж ќ ЙѐЙЙЕНОЗЕОНКЕЕЛЗЛȕвЖЛ
ФФЖ ў ȕХ в ЖХ Э ȕ ќ ЖѐИИИИИИИИИИИИИИИИ
194 8 Arrays and Linear Algebra
The function ȕʙʧ returns the epsilon of the given floating-point type, which is
defined as the gap between Ж and the next largest value representable by this
type.
On the other hand, row pivoting yields
1 −1 0
( )
0 1+𝜖 1+𝜖
0 − (−1)
𝑥1 = .
1
In this quotient, no such problem occurs.
An special type of matrices of importance is the following.
Definition 8.9 (positive-definite matrix) A Hermitian matrix 𝐴 ∈ 𝐹 𝑛×𝑛 is
called positive definite if
𝐱∗ 𝐴𝐱 > 0 ∀𝐱 ∈ 𝐹 𝑛 ∖{𝟎}
holds.
It is easy to construct positive-definite matrices. If 𝐴 ∈ ℂ𝑛×𝑛 is regular,
then 𝐴∗ 𝐴 is a positive-definite matrix. 𝐴∗ 𝐴 is obviously Hermitian; furthermore,
𝐱∗ 𝐴∗ 𝐴𝐱 = ‖𝐴𝐱‖22 > 0, since 𝐴𝐱 ≠ 0 due to the regularity of 𝐴.
If 𝐴 is a positive-definite matrix, then Cholesky factorization, which is a
specialization of 𝐿𝑈 factorization for this type of matrices, can be used and is
roughly twice as efficient as 𝐿𝑈 factorization.
𝐴 = 𝐿𝐿∗ ,
Overdetermined systems of linear equations are systems of the form (8.6) where
𝑛>𝑚
holds for the system matrix 𝐴 ∈ 𝐹 𝑛×𝑚 , i.e., the number 𝑛 of equations is larger
than the number 𝑚 of unknowns. The unknown vector 𝐱 is an element of 𝐹 𝑚 and
the inhomogeneity 𝐛 is an element of 𝐹 𝑛 . In general, overdetermined systems do
not have a solution, since the equations are contradictory.
Instead of solving the system, it is expedient to minimize a norm of the
residuum 𝐛 − 𝐴𝐱, i.e., to find
Any choice of norm is possible. If the infinity norm is chosen, then this type
of problem is often called a minimax problem. In the case of the 2-norm, the
problem is called a linear least-squares problem. The name stems from the form
𝑚
∑
‖𝐛 − 𝐴𝐱‖22 = |𝑏𝑖 − (𝐴𝐱)𝑖 |2 (8.8)
𝑖=1
𝐴𝐱 ≈ 𝐛
of certain 𝑚 functions (𝑓1 (𝑡), … , 𝑓𝑚 (𝑡)) with the coefficients (𝑥1 , … , 𝑥𝑚 ). Since
we know the values 𝑏𝑖 = 𝑏(𝑡𝑖 ), we obtain the equations
⎛ 𝑓1 (𝑡1 ) ⋯ 𝑓𝑚 (𝑡1 ) ⎞ ⎛ 𝑥1 ⎞ ⎛ 𝑏1 ⎞
⎜ ⋮ ⋮ ⎟⎜ ⋮ ⎟ = ⎜ ⋮ ⎟.
𝑓 (𝑡
⎝ 1 𝑛 ) ⋯ 𝑓𝑚 (𝑡𝑛 )⎠ ⎝𝑥𝑚 ⎠ ⎝ 𝑏𝑛 ⎠
After setting
𝑎𝑖𝑗 ∶= 𝑓𝑗 (𝑡𝑖 ),
we have thus found a linear system 𝐴𝐱 = 𝐛 or a linear least-squares problems
𝐴𝐱 ≈ 𝐛.
Furthermore, substitutions can be used to write nonlinear relationships be-
tween the dependent and independent variables in the form (8.9) of a linear
combination. For example, taking the logarithm of both sides of the power law
𝑐(𝑠) ∶= 𝛼𝑠𝛽 , we find ln 𝑐𝑖 = ln 𝛼 + 𝛽 ln 𝑠𝑖 and hence define 𝑡𝑖 ∶= ln 𝑠𝑖 and
𝑏𝑖 ∶= ln 𝑐𝑖 . This yields the linear relationship 𝑏𝑖 = ln 𝛼 + 𝛽𝑡𝑖 and the two func-
tions 𝑓1 (𝑡) ∶= 1 and 𝑓2 (𝑡) ∶= 𝑡. Then 𝑥1 = ln 𝛼 and 𝑥2 = 𝛽.
The following theorem gives the already expected relationship between linear
least-squares problems and linear systems.
Theorem 8.11 (least-squares problem) If 𝐱 ∈ 𝐹 𝑚 solves the linear system
𝐴∗ (𝐴𝐱 − 𝐛) = 𝟎, (8.10)
holds. Here we have used the fact that if 𝐮 and 𝐯 are two vectors, then the equa-
tion (𝐮 + 𝐯)∗ (𝐮 + 𝐯) = 𝐮∗ 𝐮 + 2𝐯 ∗ 𝐮 + 𝐯 ∗ 𝐯 holds due to 𝐮∗ 𝐯 = 𝐯 ∗ 𝐮 being a
scalar.
The inequality shows that any other vector 𝐱 + 𝐲 cannot be a solution of the
least-squares problem. □
Proof (using calculus) The gradient of the expression to be minimized is
𝑛 (
∑ 𝑛
∑ )2
∇𝐱 ‖𝐛 − 𝐴𝐱‖22 = ∇𝐱 ( 𝑏𝑖 − 𝑎𝑖𝑗 𝑥𝑗 ).
𝑖=1 𝑗=1
which yields
∇𝐱 ‖𝐛 − 𝐴𝐱‖22 = 2𝐴∗ (𝐴𝐱 − 𝐛) = 𝟎.
The last equation holds due to (8.10). □
The condition (8.10) is equivalent to
𝐴∗ 𝐴𝐱 = 𝐴∗ 𝐛. (8.11)
𝐴 = 𝑄𝑅,
holds. If 𝐪∗𝑖 𝐪𝑖 = 1 additionally holds for all 𝑖 ∈ {1, … , 𝑛}, the vectors are called
orthonormal.
Proof By definition, 𝑄∗ 𝑄 = 𝐼 and 𝑄𝑄∗ = 𝐼. Therefore the left and right inverses
of 𝑄 are equal to 𝑄∗ . □
cannot be parallelized, but they are the simplest of the numerically stable 𝑄𝑅 fac-
torization algorithms. We therefore discuss Householder transformations, also
called Householder reflections, for 𝑄𝑅 factorization in the following.
The defining properties of a Householder transformation or reflection are that
it is represented by an orthogonal/unitary matrix and that it maps a vector 𝐱 to
a vector whose only non-zero element is the first one, i.e.,
⎛±‖𝐱‖2 ⎞
0 ⎟
𝑃𝐱 = ⎜ .
⎜ ⋮ ⎟
⎝ 0 ⎠
The first element of 𝑃𝐱 must be ±‖𝐱‖2 because of Theorem 8.17. The House-
holder reflection 𝑃 reflects through the line bisecting the angle between 𝐱 and 𝐞1 .
The advantageous numerical property is that the maximum bisected angle is 45◦ .
On the other hand, orthogonal projection of the vector 𝐱 onto 𝐞1 as used in the
Gram–Schmidt algorithm is numerically unstable whenever 𝐱 and 𝐞1 are approx-
imately orthogonal/unitary.
The following two theorems show how to construct Householder reflections
in the real case 𝐹 = ℝ and in the complex case 𝐹 = ℂ.
Theorem 8.18 (Householder reflection (𝐹 = ℝ)) Suppose 𝐱 ∈ ℝ𝑛 and choose
an 𝛼 ∈ ℝ such that |𝛼| = ‖𝐱‖2 . Define 𝐮 ∶= 𝐱 − 𝛼𝐞1 and
⎧ 2
𝐼− 𝐮𝐮⊤ , 𝐮 ≠ 𝟎,
𝑃 ∶= 𝐮⊤ 𝐮
⎨𝐼, 𝐮 = 𝟎.
⎩
Then 𝑃 ∈ ℝ𝑛×𝑛 is symmetric and orthogonal, and the equation 𝑃𝐱 = 𝛼𝐞1 holds.
Theorem 8.19 (Householder reflection (𝐹 = ℂ)) Suppose 𝐮 ∈ ℂ𝑛 and define
𝛼 ∶= −ei arg 𝑢𝑘 ‖𝐮‖. Define 𝐮 ∶= 𝐱 − 𝛼𝐞1 , 𝐯 ∶= 𝐮∕‖𝐮‖2 unless 𝐮 = 𝟎, and
⎧ 𝐱∗ 𝐯
𝐼 − (1 + ∗ ) 𝐯𝐯 ∗ , 𝐮 ≠ 𝟎,
𝑃 ∶= 𝐯 𝐱
⎨
𝐼, 𝐮 = 𝟎.
⎩
Then 𝑃 ∈ 𝐹 𝑛×𝑛 is Hermitian and unitary, and the equation 𝑃𝐱 = 𝛼𝐞1 holds.
When using floating-point numbers, the scalar 𝛼 in the real case in Theo-
rem 8.18 should be chosen so that it has the opposite sign to the 𝑘-th coordinate
𝑢𝑘 of 𝐮, where 𝑢𝑘 is the pivot element in Theorem 8.21, in order to avoid cancel-
lation. The choice of 𝛼 in Theorem 8.19 also achieves this in the complex case
[3, Section 4.7].
We can now state the algorithm and prove the theorem for 𝑄𝑅 factorization.
Since several Householder reflections are used, we use a more precise notation
now and denote the Householder reflection for the vector 𝐱 by 𝑃(𝐱).
200 8 Arrays and Linear Algebra
𝑄𝑁 ⋯ 𝑄1 𝐴 = 𝑅𝑁 =∶ 𝑅
𝑄 ∶= 𝑄1∗ ⋯ 𝑄𝑁
∗
.
𝐴 = 𝑄1∗ ⋯ 𝑄𝑁
∗
𝑅 = 𝑄𝑅.
This algorithm always computes the factorization, which implies the follow-
ing theorem.
Theorem 8.21 (𝑄𝑅 factorization) Every matrix 𝐴 ∈ 𝐹 𝑛×𝑚 , 𝐹 ∈ {ℝ, ℂ}, can be
factored as
𝐴 = 𝑄𝑅,
where 𝑄 ∈ 𝐹 𝑛×𝑛 is orthogonal/unitary and 𝑅 ∈ 𝐹 𝑛×𝑚 is upper triangular.
These two functions also works on sparse matrices. In this case, row and col-
umn permutations are provided in the fields ʙʝɴ˞ and ʙȆɴɜ such that the number
of non-zero entries is reduced.
ɔʼɜɃǤљ ќ ʧʙʝǤɪȍɪФЙя Йя ЖЭЗХѓ ȯ ќ ʜʝФХѓ
ɔʼɜɃǤљ ȯѐ» Ѯ ȯѐ¼ в Цȯѐʙʝɴ˞я ȯѐʙȆɴɜЧ
ЙѠЙ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
where the columns 𝐪𝑗 , 𝑗 ∈ {1, … , 𝑚}, of 𝑄̃ ∈ 𝐹 𝑛×𝑚 are still orthonormal and
𝑅̃ ∈ 𝐹 𝑛×𝑛 is upper triangular. While 𝑄̃ ∗ 𝑄̃ = 𝐼𝑚 , in general 𝑄̃ 𝑄̃ ∗ ≠ 𝐼𝑛 .
To solve the least-squares problem 𝐴x ≈ 𝐛, we solve the normal equations
(8.11). Using (8.13), they become
𝐴∗ 𝐴𝐱 = 𝐴∗ 𝐛,
𝑅̃ ∗ 𝑄̃ ∗ 𝑄̃ 𝑅𝐱
̃ = 𝑅̃ ∗ 𝑄̃ ∗ 𝐛,
̃ = 𝑅̃ ∗ 𝑄̃ ∗ 𝐛.
𝑅̃ ∗ 𝑅𝐱
If 𝐴 has full rank, then 𝑅̃ is regular. Therefore 𝑅̃ ∗ is regular as well and the last
equation becomes
̃ = 𝑄̃ ∗ 𝐛.
𝑅𝐱
Thus we can find 𝐱 by calculating 𝑄̃ and 𝑅̃ and then using backward substitution
(see Sect. 8.4.8.2).
𝐴𝐴+ 𝐴 = 𝐴,
𝐴+ 𝐴𝐴+ = 𝐴+ ,
(𝐴𝐴+ )∗ = 𝐴𝐴+ ,
(𝐴+ 𝐴)∗ = 𝐴+ 𝐴.
The first and second conditions means that 𝐴 and 𝐴+ are weak inverses of 𝐴+
and 𝐴, respectively, in the multiplicative semigroup. The third and fourth condi-
tions mean that 𝐴𝐴+ and 𝐴+ 𝐴 are self-adjoint.
The pseudoinverse as defined by these four conditions exists uniquely. Impor-
tant properties of the pseudoinverse are the following. As expected, if the ma-
trix 𝐴 is regular, then the pseudoinverse is its inverse, i.e., 𝐴+ = 𝐴−1 . The pseu-
doinverse of the pseudoinverse is again the original matrix, i.e., (𝐴+ )+ = 𝐴. Pseu-
doinversion commutes with complex conjugation, taking the conjugate trans-
pose, and with transposition. The pseudoinverse of a scalar multiple of a ma-
trix 𝐴 is the reciprocal multiple of the pseudoinverse, i.e., (𝛼𝐴)+ = 𝛼 −1 𝐴+ for
8.4 Linear Algebra 203
𝐲 ∶= 𝐴+ 𝐛
satisfies
‖𝐴𝐳 − 𝐛‖2 ≥ ‖𝐴𝐲 − 𝐛‖2 ∀𝐳 ∈ 𝐹 𝑚 ,
i.e., the vector 𝐲 provides the smallest error in the least-squares sense. Equality
holds if and only if
meaning that there are infinitely many minimizing solutions 𝐳 unless 𝐴 has full
rank, i.e., rk(𝐴) = 𝑚. If 𝐴 has full rank, then 𝐼 = 𝐴+ 𝐴 and thus 𝐳 = 𝐲 = 𝐴+ 𝐛.
204 8 Arrays and Linear Algebra
In geometric terms, the point (0, 0)⊤ on the line 𝜆(1, 1)⊤ , 𝜆 ∈ ℝ, has the smallest
Euclidean distance from the point (−1, 1)⊤ .
If the linear system has multiple solutions, the vector with the minimal Eu-
clidean norm is found.
ɔʼɜɃǤљ ќ ЦЖ Еѓ Ж ЕЧѓ Ȃ ќ ЦЖѓ ЖЧѓ ʙɃɪ˛ФХ Ѯ Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐОООООООООООООООМ
ЕѐЕ
ɔʼɜɃǤљ ʙɃɪ˛ФХ Ѯ
ЗѠЗ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐЕ ЕѐЕ
ЕѐЕ ЕѐЕ
Since the zero vector always satisfies (8.15) trivially, it is excluded from the
definition of an eigenvector. The geometric interpretation of a pair of eigenvalues
and eigenvectors is that an eigenvector is a direction in which the transformation
stretches by a factor that is the eigenvalue.
How can we find the eigenvalues and eigenvectors of a square matrix? Deter-
mining the eigenvalues of a square matrix in theory is not hard. Equation (8.15)
has a non-zero solution if and only if the determinant of 𝐴 − 𝜆𝐼 is zero. This
observation yields the equation
for any eigenvalue 𝜆 ∈ ℂ. Expanding the determinant shows that the left-hand
side of this equation is a polynomial of degree 𝑛 in 𝜆 (and that the coefficient of
𝜆𝑛 is (−1)𝑛 ).
Definition 8.23 (characteristic polynomial) The polynomial
(𝐴 − 𝜆𝐼)𝐯 = 𝟎, (8.17)
The constructor ǤʲʝɃ˦ called with b as the first argument constructs matrices
with ones in the main diagonal such as identity matrices.
Eigenvalues may be repeated. There are two ways to define the multiplicity
of an eigenvalue: the first is called algebraic multiplicity and the second is called
geometric multiplicity. The algebraic multiplicity stems from the multiplicity of
the eigenvalue as a root of the characteristic polynomial, while the geometric
multiplicity stems from the number of corresponding eigenvectors or the size of
the solution space of the linear system (8.17).
Clearly, the inequality 1 ≤ 𝜇𝐴 (𝜆𝑖 ) ≤ 𝑛 holds for all 𝑖 ∈ {1, … , 𝑑}, and the sum of
all algebraic multiplicities is equal to the dimension of the vector space, i.e.,
𝑑
∑
𝜇𝐴 (𝜆𝑖 ) = 𝑛. (8.18)
𝑖=1
The eigenspace of the eigenvalue 𝜆𝑖 is, of course, the kernel of the matrix
𝐴 − 𝜆𝑖 𝐼 and can be calculated using the function ɪʼɜɜʧʙǤȆȕ. It is spanned by all
eigenvectors associated with 𝜆𝑖 . It is straightforward to show that an eigenspace
is always a linear subspace (see Problem 8.16). Using the notion of an eigenspace,
we can now define the geometric multiplicity.
𝛾𝐴 (𝜆𝑖 ) ≥ 1.
How are the algebraic and the geometric multiplicities related? The answer
is given by the following theorem, whose proof is nontrivial (see Problem 8.17).
It means that the geometric multiplicity of an eigenvalue is always smaller than
its algebraic one.
1 ≤ 𝛾𝐴 (𝜆𝑖 ) ≤ 𝜇𝐴 (𝜆𝑖 ) ≤ 𝑛
holds.
by summing over all eigenvalues and using (8.18). Here we have defined 𝛾𝐴 as
the sum of all geometric multiplicities.
If the geometric multiplicities of all eigenvalues are equal to their algebraic
multiplicities and thus are maximal, then the eigenvectors have the important
and useful property that a basis of ℂ𝑛 , the eigenbasis, can be chosen from the set
of eigenvectors. This property is recorded in the following theorem.
208 8 Arrays and Linear Algebra
𝑑
∑
𝛾𝐴 (𝜆𝑖 ) = 𝑛 (8.19)
𝑖=1
holds, then
𝑑
⋃
span( 𝐸(𝜆𝑖 )) = ℂ𝑛
𝑖=1
and a basis of ℂ𝑛 , the eigenbasis, can be formed from 𝑛 linearly independent eigen-
vectors.
2. The trace of 𝐴, which is defined as the sum of all diagonal elements, is equal
to the sum of all its eigenvalues, i.e.,
𝑛
∑ 𝑛
∑
tr(𝐴) ∶= 𝐴𝑖𝑖 = 𝜆𝑖 .
𝑖=1 𝑖=1
3. The matrix 𝐴 is regular if and only if all of its eigenvalues are non-zero.
4. If 𝐴 is regular, then the eigenvalues of the inverse 𝐴−1 are (1∕𝜆1 , … , 1∕𝜆𝑛 ) with
the same algebraic and geometric multiplicities.
5. The eigenvalues of 𝐴𝑘 , 𝑘 ∈ ℕ, are (𝜆1𝑘 , … , 𝜆𝑛𝑘 ).
6. If 𝐴 is unitary, then the absolute value of all its eigenvalues is 1, i.e., |𝜆𝑖 | = 1
for all 𝑖 ∈ {1, … , 𝑛}.
7. If 𝐴 is Hermitian, then all its eigenvalues are real.
8. If 𝐴 is Hermitian and positive definite, positive semidefinite, negative definite,
or negative semidefinite, then all its eigenvalues are positive, nonnegative, neg-
ative, or nonpositive, respectively.
𝐴 = 𝑄Λ𝑄−1 , (8.20)
where 𝑄 ∈ ℂ𝑛×𝑛 is a regular matrix and Λ is a diagonal matrix whose entries are
the eigenvalues of 𝐴.
11
( ) (8.21)
01
𝐴∗ 𝐴 = 𝐴𝐴∗ .
210 8 Arrays and Linear Algebra
𝐴 = 𝑈Λ𝑈 ∗ .
This factorization is especially useful, since the basis change given by a uni-
tary matrix is numerically stable and the representation as a diagonal matrix in
this basis is especially simple.
Theorem 8.30 shows that if and only if there are 𝑛 linearly independent eigen-
vectors, then the matrix has an eigenfactorization and hence can be represented
very simply by a diagonal matrix in an eigenbasis. This naturally leads to the
question what happens when there are fewer than 𝑛 independent eigenvectors.
Can we still find factorization? We expect that such a factorization would be
more complicated than the representation by a diagonal matrix; it is hardly con-
ceivable that a simpler factorization exists.
The answer is given by the following theorem, which also draws the complete
picture. In order to formulate the theorem, we need a definition first.
⎛𝐽1 ⎞
𝐽=⎜ ⋱ ⎟,
⎝ 𝐽𝑑 ⎠
⎛𝜆 𝑖 1 ⎞
𝜆𝑖 ⋱
𝐽𝑖 = ⎜ ⎟,
⎜ ⋱ 1⎟
⎝ 𝜆𝑖 ⎠
This means that a Jordan matrix 𝐽 is a square matrix whose only non-zero
entries are in the main diagonal and the first superdiagonal. The blocks 𝐽𝑖 are
called Jordan blocks. In each block, all entries in the first superdiagonal are equal
to one.
𝐴 = 𝑃𝐵𝑃−1 .
8.4 Linear Algebra 211
𝐴 = 𝑃𝐽𝑃−1 .
The theorem shows that matrices are not diagonalizable in general, but that
every matrix is similar to Jordan matrix, which still contains the eigenvalues in
the main diagonal, but may contain additional ones in the first superdiagonal.
Is the Jordan normal form of a matrix unique? Of course, it is also possible
to rearrange the similarity matrix 𝑃 such that the Jordan blocks are reordered.
Apart from these rearrangements, however, the Jordan normal form is unique,
as the following theorem records.
Theorem 8.36 (uniqueness of Jordan normal form) The Jordan normal form
of a matrix 𝐴 ∈ ℂ𝑛×𝑛 is unique up to the order of the Jordan blocks.
The Jordan normal form 𝐽 and its blocks have the following properties. The
geometric multiplicity 𝛾𝐴 (𝜆𝑖 ) is the number of Jordan blocks corresponding to
the eigenvalue 𝜆𝑖 , and the sum of the sizes of all Jordan blocks corresponding
to 𝜆𝑖 is its algebraic multiplicity 𝜇𝐴 (𝜆𝑖 ). In terms of the sizes of the Jordan blocks,
we thus see that a matrix 𝐴 is diagonalizable if and only if the algebraic and
geometric multiplicities of every eigenvalue 𝜆𝑖 coincide.
The sizes of the Jordan blocks corresponding to an eigenvalue help to solve
the mystery of the missing eigenvectors whenever the geometric multiplicity is
smaller than the algebraic one. In this case, there are fewer than 𝑛 linearly inde-
pendent eigenvectors, and we would like to find more vectors in order to com-
plete the set of eigenvectors and obtain a basis of the whole vector space.
We consider a three-dimensional example and define
⎛𝜆1 0 0 ⎞
𝐽 ∶= ⎜ 0 𝜆2 1 ⎟
⎝ 0 0 𝜆2 ⎠
consisting of two Jordan blocks. The Jordan normal form 𝐴 = 𝑃𝐽𝑃−1 implies
𝐴𝑃 = 𝑃𝐽, and we denote the three columns of 𝑃 by 𝐩𝑖 , 𝑖 ∈ {1, 2, 3}. Then we
have the equation
( ) ( ) ⎛𝜆 1 0 0 ⎞ ( )
𝐴 𝐩1 𝐩2 𝐩3 = 𝐩1 𝐩2 𝐩3 ⎜ 0 𝜆2 1 ⎟ = 𝜆1 𝐩1 𝜆2 𝐩2 𝐩2 + 𝜆2 𝐩3 ,
⎝ 0 0 𝜆2 ⎠
whose columns yield the equations
(𝐴 − 𝜆1 𝐼)𝐩1 = 𝟎,
(𝐴 − 𝜆2 𝐼)𝐩2 = 𝟎,
(𝐴 − 𝜆2 𝐼)𝐩3 = 𝐩2 .
212 8 Arrays and Linear Algebra
The first equation means that 𝐩1 ∈ ker(𝐴 − 𝜆1 𝐼) is an eigenvector for the eigen-
value 𝜆1 and the second equation means that 𝐩2 ∈ ker(𝐴 −𝜆2 𝐼) is an eigenvector
for the eigenvalue 𝜆2 .
The second Jordan block
𝜆 1
( 2 )
0 𝜆2
(𝐴 − 𝜆2 𝐼)2 𝐩3 = (𝐴 − 𝜆2 𝐼)𝐩2 = 𝟎,
(𝐴 − 𝜆𝐼)𝑘 𝐯 = 𝟎,
(𝐴 − 𝜆𝐼)(𝑘−1) 𝐯 ≠ 𝟎.
11
( )
𝜖1
√
with 𝜖 ≠ 0. Its eigenvalues are 1 ± 𝜖, and therefore its Jordan normal form is
8.4 Linear Algebra 213
√
1+ 𝜖 0√
( ).
0 1− 𝜖
This means that a slight perturbation of a matrix with multiple eigenvalues can
completely change the structure of its Jordan normal form. The numerical prob-
lem of calculating the Jordan normal form of a matrix is therefore ill-conditioned
and depends critically on the criterion whether two eigenvalues are considered
equal. Hence, the Jordan normal form of a matrix is usually avoided in computa-
tions, although it is of great theoretical importance, and alternatives such as the
Schur decomposition are employed. The Schur factorization or Schur decompo-
sition always exists, as the following theorem shows.
𝐴 = 𝑄𝑇𝑄−1 .
The Schur factorization means that every complex, square matrix is similar
to an upper-triangular matrix. The Schur factorization is not unique. The advan-
tage of such a factorization is that the basis change 𝑄 is given by a unitary matrix,
and we already know that multiplication by an orthogonal or by a unitary matrix
is well-conditioned.
How does the Schur factorization relate to eigenvalues? If a Schur factoriza-
tion is known, then 𝐴 and 𝑇 have the same eigenvalues by Problem 8.25. The
eigenvalues of an upper-triangular matrix are just its diagonal elements by Prob-
lem 8.26. This means that a Schur factorization of a matrix 𝐴 immediately yields
its eigenvalues.
In Julia, Schur factorization is available as the two functions
{ɃɪȕǤʝɜȱȕȂʝǤѐʧȆȹʼʝ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧȆȹʼʝР. Just as the other func-
tions for matrix factorization in Julia, it returns an object. The information
in the ÆȆȹʼʝ object can be accessed as the fields Ñ or ÆȆȹʼʝ for the quasi
upper-triangular matrix, as the fields Đ or ˛ȕȆʲɴʝʧ for the unitary matrix, and
as the index ˛Ǥɜʼȕʧ for the eigenvalues. This built-in implementation only
calculates a quasi upper-triangular matrix and not an upper-triangular matrix
as in Theorem 8.38.
ɔʼɜɃǤљ ќ ʝǤɪȍɪФИя ИХѓ ȯ ќ ʧȆȹʼʝФХѓ
ɔʼɜɃǤљ в ȯѐĐ Ѯ ȯѐÑ Ѯ ȯѐĐщ
ИѠИ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ОѐООЗЕЖȕвЖЛ вЛѐИНИМНȕвЖЛ вЖѐЖЖЕЗЗȕвЖК
ЕѐЕ вЗѐЗЗЕЙКȕвЖЛ ЛѐЛЛЖИЙȕвЖЛ
ЖѐЛИЕЛЙȕвЖЛ вЖѐЛЛКИИȕвЖЛ вИѐИИЕЛМȕвЖЛ
𝑄𝑘 𝑅𝑘 ∶= 𝐴𝑘−1 ,
𝐴𝑘 ∶= 𝑅𝑘 𝑄𝑘 .
𝐴𝑘 = 𝑄𝑘∗ 𝑄𝑘−1
∗
⋯ 𝑄1∗ 𝐴𝑄1 𝑄2 ⋯ 𝑄𝑘 = (𝑄1 𝑄2 ⋯ 𝑄𝑘 )∗ 𝐴(𝑄1 𝑄2 ⋯ 𝑄𝑘 ) = 𝑄∗ 𝐴𝑄,
КѠК ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
вЕѐКИЖЕЖО вЕѐЙМЕЙНМ ЕѐЕЗМЖЛИЖ вЕѐИЗККОЛ вЕѐНЕНКЙК
ЕѐЕЛЖЖЗЛ ЕѐЙИКОЗЗ ЕѐЕИНИКЛО ЕѐИЗЖКЖЗ вЖѐЙЖЛЛН
вЕѐЕЕЛЕЗЛЕЖ вЕѐЕЖКОИЕМ ЕѐЗОЛЗЛМ ЕѐЕНМЕЗЗН ЖѐОНЖЗЗ
вЖѐЗЛНЙМȕвЛ вЕѐЕЕЕЖЙЙМОК вЕѐЕЕЕНОКМИЖ ЕѐЖООЗКЖ ЖѐНМОЕО
ЙѐНЛЙМЗȕвО вЗѐЗОЕОЗȕвН вМѐКОЖЗКȕвМ вЛѐЛЙЛОМȕвК вЕѐЖЕЕЙЗ
ɔʼɜɃǤљ ɖ ќ ѓ
ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕ Ф»я ¼Х ќ ʜʝФɖХѓ ɖ ќ ¼Ѯ» ȕɪȍѓ ɖ
КѠК ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
вЕѐК вЕѐКИЖЙКК вЕѐЕИИККИЙ вЕѐИЙКЛИК вЕѐМЕЖНЗМ
ЖѐЖНЕЕКȕвЖЖ ЕѐЙ ЕѐЕКЕЙННЛ ЕѐЗНМЕЗ вЖѐМЕЖМК
вИѐЛЗМООȕвЗК вИѐОЖМКЛȕвЖК ЕѐИ ЕѐЖЗЛЖЙН ЖѐММЖЛЖ
вЖѐНООНЛȕвЙЛ вЖѐЖЕИЙȕвИЙ вЗѐЖКИЙОȕвЗЖ ЕѐЗ ЖѐНОЙМК
КѐОЙЕЙЙȕвМО вЖѐКЙЗЖЖȕвЛН вЖѐКЖОКЗȕвКЙ вКѐЗЗЕЗНȕвИК вЕѐЖ
We observe that the eigenvalue closest to zero is approximated in the lower right
corner. After 100 steps, an upper-triangular matrix with the sought eigenvalues
is obtained. Even after 10 steps, three correct digits of the eigenvalue closest to
zero are found in the lower right corner.
The convergence rate of 𝑄𝑅 iteration depends on the separation between the
eigenvalues. The Gershgorin circle theorem, a bound on the spectrum (i.e., the
set of all eigenvalues) of a square matrix, is useful for testing for convergence.
It is known that an eigenvalue close to zero improves convergence. How can
we move an eigenvalue closer to zero? It is straightforward to see that if 𝜆 is an
eigenvalue of 𝐴, then 𝜆 − 𝑠 is an eigenvalue of 𝐴 − 𝑠𝐼. Assuming that we can
find suitable shifts 𝑠𝑘 that approximate the eigenvalues closest to zero, we hence
define shifted 𝑄𝑅 iteration as
𝑄𝑘 𝑅𝑘 ∶= 𝐴𝑘−1 − 𝑠𝑘 𝐼, (8.23a)
𝐴𝑘 ∶= 𝑅𝑘 𝑄𝑘 + 𝑠𝑘 𝐼. (8.23b)
and the next eigenvalue is calculated. Deflation is based on the following theo-
rem.
𝐵 𝐮
𝐴=( ),
𝟎∗ 𝜆
The next theorem shows that it is always possible to find a matrix 𝐻 in upper-
Hessenberg form that is similar to a given matrix 𝐴. The fact that the basis
change 𝑄 is unitary is advantageous and again ensures numerical stability. We
use Householder reflections (see Sect. 8.4.8.3) in the algorithm and its proof.
𝐴 = 𝑄𝐻𝑄−1 .
⎛ 𝑎11 ∗ ∗ ⋯ ∗⎞
⎜±‖𝐱‖2 ∗ ∗ ⋯ ∗⎟
𝐴2 = 𝑄1∗ 𝐴1 𝑄1 = ⎜ 0 ∗ ∗ ⋯ ∗⎟ .
⎜ ⋮ ⋮ ⋮ ⋮⎟
⎝ 0 ∗ ∗ ⋯ ∗⎠
In all later steps, analogous calculations show that zeros are created by
left-multiplication by 𝑄𝑗∗ and all zeros thus created, also in the previous
steps, remain after right-multiplication by 𝑄𝑗 .
We also set
𝑄 ∶= 𝑄1 ⋯ 𝑄𝑛−2
to obtain
𝐻 = 𝑄∗ 𝐴𝑄.
The matrix 𝑄 is unitary as a product of unitary matrices, and the matrix 𝐻
is upper Hessenberg.
218 8 Arrays and Linear Algebra
𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑠 ≥ 0.
which explains the names of the singular vectors. Geometrically, these equations
mean that the function represented by 𝐴 maps each right singular value 𝐯𝑘 to
the corresponding left singular vector 𝐮𝑘 stretched by the corresponding singular
value 𝜎𝑘 .
The svd and the eigenfactorization of a matrix are related. The svd 𝐴 =
𝑈Σ𝑉 ∗ of a matrix 𝐴 ∈ ℂ𝑛×𝑚 yields the two equations
The first equation means that the right singular vectors (i.e., the columns of 𝑉)
are eigenvectors of 𝐴∗ 𝐴, while the second equation means that the left singular
vectors (i.e., the columns of 𝑈) are eigenvectors of 𝐴𝐴∗ . Furthermore, the non-
zero singular values are the square roots of the non-zero eigenvalues of 𝐴∗ 𝐴 or
𝐴𝐴∗ . If 𝐴 is normal, then it can be diagonalized and written as 𝐴 = 𝑈Λ𝑈 ∗ by
Theorem 8.32. If 𝐴 is positive semidefinite in addition, then this factorization
𝐴 = 𝑈Λ𝑈 ∗ is also an svd.
This observation also yields a numerical algorithm for the calculation of the
singular values and singular vectors of a matrix 𝐴, namely to apply 𝑄𝑅 iteration
(see Sect. 8.4.9) to the matrix in (8.25a) to first find the singular values and the
right singular vectors of 𝐴 and then to use (8.24) to find its left singular vectors.
Practical methods, however, are based on the matrix
0 𝐴∗
( )
𝐴 0
Theorem 8.45 (svd and rank, nullity) Suppose 𝐴 ∈ ℂ𝑛×𝑚 . Then the left sin-
gular values corresponding to non-zero singular values of 𝐴 span the range of 𝐴
and the right singular vectors corresponding to zero singular values of 𝐴 span the
null space of 𝐴. Furthermore, the rank of 𝐴 equals the number of non-zero singular
values.
Knowing a svd of a matrix, its pseudoinverse (see Sect. 8.4.8.4) is easily found,
as the following theorem shows.
𝐴+ = 𝑉Σ+ 𝑈 ∗ .
220 8 Arrays and Linear Algebra
The following theorem means that the principal singular value 𝜎1 is equal to
the operator 2-norm of the matrix 𝐴. The 𝑝-norm of a matrix is defined as
‖𝐴𝐱‖𝑝
‖𝐴‖𝑝 ∶= sup .
𝐱≠𝟎 ‖𝐱‖𝑝
Therefore the svd is the usual means for calculating the 2-norm of a matrix.
Theorem 8.47 (svd and norm) Suppose 𝐴 ∈ ℂ𝑛×𝑚 . Then ‖𝐴‖2 = 𝜎1 .
The following criterion for determining whether a square matrix is regular or
singular follows from Theorem 8.45.
Theorem 8.48 (svd and regularity) Suppose 𝐴 ∈ ℂ𝑛×𝑛 . Then 𝐴 is regular if
and only if 𝜎𝑛 ≠ 0.
Another important application of the svd is the approximation of a matrix 𝐴
by a simpler and – as the following theorem shows – truncated version of 𝐴. The
idea is to use only the first 𝑘 singular values, which are also the largest ones
by convention. Depending on how fast the singular values decrease, these first
singular values may already capture a significant portion of the behavior of the
linear function. As the approximation is of lower rank than the original matrix 𝐴,
it is called an low-rank approximation of 𝐴.
Theorem 8.49 (svd and low-rank approximation, Eckart–Young–Mirsky
Theorem) Suppose 𝐴 ∈ ℂ𝑛×𝑚 and define
𝐴𝑘 ∶= 𝑈Σ𝑘 𝑉 ∗ ,
where Σ𝑘 is the copy of Σ only containing the first 𝑘 singular values of Σ. Then the
equation
‖𝐴𝑘 − 𝐴‖2 = 𝜎𝑘+1 ∀𝑘 ∈ {1, … , 𝑛 − 1}
holds.
For computing the low-rank approximation 𝐴𝑘 of rank 𝑘, only the first 𝑘 left
and right singular values, i.e., the first 𝑘 columns of 𝑈 and 𝑉, are needed, as all
singular values after the first 𝑘 ones are replaced by zero in Σ𝑘 .
In Julia, the svd of a matrix is calculated by the two functions
{ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍР. The singular values are com-
puted by {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ˛Ǥɜʧ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ˛ǤɜʧР. All these
functions follow the convention of sorting the singular values in descending or-
der. The functions ʧ˛ȍ and ʧ˛ȍР return objects of type Æù- with the fields Ú, Æ,
and ùʲ. There is also a field ù, but since 𝑉 ∗ is calculated and accessible by ùʲ, it
is more efficient to use than ù.
ɔʼɜɃǤљ ќ ʝǤɪȍɪФКя КХѓ ȯ ќ ʧ˛ȍФХѓ
ɔʼɜɃǤљ ȯѐùщ ќќ ȯѐùʲ
ʲʝʼȕ
8.4 Linear Algebra 221
If ʧ˛ȍ or ʧ˛ȍР are called with two matrix arguments, they compute the gener-
alized svd of two matrices.
Tables 8.7 and 8.8 give an overview of the vector and matrix operations available
in Julia excluding matrix factorizations.
A multitude of low-level functions are available as well; for example,
the blas (Basic Linear Algebra Subprograms) and lapack (Linear Algebra
Package) functions are available in the modules {ɃɪȕǤʝɜȱȕȂʝǤѐ"{Æ and
{ɃɪȕǤʝɜȱȕȂʝǤѐ{¸&u.
Table 8.9 summarizes the various types of matrix factorizations available in
Julia. The functions in Table 8.9 whose names ends with an exclamation mark
are destructive versions of their counterparts without the exclamation mark and
hence save memory. The function ȯǤȆʲɴʝɃ˴ȕ acts as a general interface to the
various matrix factorizations. It recognizes the matrix types listed in Table 8.10,
determines the most specific type a given matrix has, and then calculates the
factorization indicated in the table. The return value can be used as an argument
to the left-division operator а.
Problems
Table 8.7 Vector and matrix operations (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
Ѯ matrix multiplication
а left division
Э right division
ȍɴʲ compute the inner product
Ȇʝɴʧʧ compute the cross product
ʲʝǤɪʧʙɴʧȕ compute the transpose
ʲʝǤɪʧʙɴʧȕР compute the transpose (destructive version)
ǤȍɔɴɃɪʲ compute the conjugate transpose
ǤȍɔɴɃɪʲР compute the conjugate transpose (destructive version)
щ postfix operator, same as ǤȍɔɴɃɪʲ
ȍȕʲ compute the determinant
Ƀɪ˛ compute the inverse (using left division)
ɖʝɴɪ compute the Kronecker tensor product
ɜɴȱȍȕʲ compute the logarithm of determinant
ɜɴȱǤȂʧȍȕʲ compute the logarithm of absolute value of determinant
ɪʼɜɜʧʙǤȆȕ compute a basis of the nullspace
ʝǤɪɖ compute the rank by counting the non-zero singular values
ʙɃɪ˛ compute the Moore–Penrose pseudoinverse
ʲʝ compute the trace (sum of diagonal elements)
ɪɴʝɦ compute the norm of a vector or the operator norm of a matrix
ɪɴʝɦǤɜɃ˴ȕ normalize so that the norm becomes one
ɪɴʝɦǤɜɃ˴ȕР destructive version of ɪɴʝɦǤɜɃ˴ȕ
ȍɃǤȱ return the given diagonal of the given matrix as a vector
ȍɃǤȱɃɪȍ return an ȂʧʲʝǤȆʲ¼Ǥɪȱȕ with the indices of the given diagonal
ȍɃǤȱɦ construct a matrix with the given vector as a diagonal
ʝȕʙȕǤʲ construct an array by repeating the elements of a given one
ʲʝɃɜ return the lower triangle of a matrix
ʲʝɃɜР destructive version of ʲʝɃɜ
ʲʝɃʼ return the upper triangle of a matrix
ʲʝɃʼР destructive version
Ȇɴɪȍ compute the condition number of a matrix
Ȇɴɪȍʧɖȕȕɜ compute Skeel condition number of a matrix
ȱɃ˛ȕɪʧ compute a Givens rotation
ɜ˩Ǥʙ solve a Lyapunov equation
ʧ˩ɜ˛ȕʧʲȕʝ solve a Sylvester equation
ʙȕǤɖȯɜɴʙʧ compute the peak flop rate of the computer using matrix multiplication
Table 8.8 Functions for checking properties of matrices (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
ɃʧȂǤɪȍȕȍ determine whether a matrix is banded
ɃʧȍɃǤȱ determine whether a matrix is diagonal
ɃʧȹȕʝɦɃʲɃǤɪ determine whether a matrix is Hermitian
Ƀʧʙɴʧȍȕȯ determine whether a matrix is positive definite
ɃʧʙɴʧȍȕȯР destructive version of Ƀʧʙɴʧȍȕȯ
ɃʧʧʼȆȆȕʧʧ determine whether a matrix factorization succeeded
Ƀʧʧ˩ɦɦȕʲʝɃȆ determine whether a matrix is symmetric
ɃʧʲʝɃɜ determine whether a matrix is lower triangular
ɃʧʲʝɃʼ determine whether a matrix is upper triangular
Table 8.9 Functions for matrix factorizations (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
ȯǤȆʲɴʝɃ˴ȕ compute a convenient factorization, general interface to factorizations
ȂʼɪȆȹɖǤʼȯɦǤɪ compute the Bunch-Kaufman fact. of a symmetric/Hermitian matrix
ȂʼɪȆȹɖǤʼȯɦǤɪР destructive version of ȂʼɪȆȹɖǤʼȯɦǤɪ
Ȇȹɴɜȕʧɖ˩ compute Cholesky factorization of positive definite matrix
Ȇȹɴɜȕʧɖ˩Р destructive version of Ȇȹɴɜȕʧɖ˩
ȕɃȱȕɪ compute eigenfactorization
ȕɃȱȕɪР destructive version of ȕɃȱȕɪ
ȕɃȱ˛Ǥɜʧ compute the eigenvalues
ȕɃȱ˛ǤɜʧР destructive version of ȕɃȱ˛Ǥɜʧ
ȕɃȱ˛ȕȆʧ compute the eigenvectors
ȕɃȱɦɃɪ compute the smallest eigenvalue if all eigenvalues are real
ȕɃȱɦǤ˦ compute the largest eigenvalue if all eigenvalues are real
ȹȕʧʧȕɪȂȕʝȱ compute Hessenberg factorization
ȹȕʧʧȕɪȂȕʝȱР destructive version of ȹȕʧʧȕɪȂȕʝȱ
ɜȍɜʲ compute 𝐿𝐷𝐿⊤ factorization
ɜȍɜʲР destructive version of ɜȍɜʲ
ɜʜ compute 𝐿𝑄 factorization
ɜʜР destructive version of ɜʜ
ɜʼ compute 𝐿𝑈 factorization
ɜʼР destructive version of ɜʼ
ʜʝ compute 𝑄𝑅 factorization
ʜʝР destructive version of ʜʝ
ʧȆȹʼʝ compute Schur factorization
ʧȆȹʼʝР destructive version of ʧȆȹʼʝ
ɴʝȍʧȆȹʼʝ reorder Schur factorization
ɴʝȍʧȆȹʼʝР destructive version of ɴʝȍʧȆȹʼʝ
ʧ˛ȍ compute svd
ʧ˛ȍР destructive version of ʧ˛ȍ
ʧ˛ȍ˛Ǥɜʧ compute singular values and return them in descending order
ʧ˛ȍ˛ǤɜʧР destructive version of ʧ˛ȍ˛Ǥɜʧ
224 8 Arrays and Linear Algebra
Table 8.10 Forms of matrices recognized by the function ȯǤȆʲɴʝɃ˴ȕ. The type of the return
object and the function called are shown in the second and third columns.
Form Function
Diagonal none
Bidiagonal none
Tridiagonal ɜʼ
Lower/upper triangular none
Positive definite Ȇȹɴɜȕʧɖ˩
Dense symmetric/Hermitian ȂʼɪȆȹɖǤʼȯɦǤɪ
Sparse symmetric/Hermitian ɜȍɜʲ
Symmetric real tridiagonal ɜȍɜʲ
General square ɜʼ
General non-square ʜʝ
8.5 (Volume of parallelepiped) Show that the signed volume 𝑉 of the paral-
lelepiped with the edges 𝐚, 𝐛, and 𝐜 is given by
𝑉 = 𝐚 ⋅ (𝐛 × 𝐜) = 𝐛 ⋅ (𝐜 × 𝐚) = 𝐜 ⋅ (𝐚 × 𝐛).
8.11 Prove that the product of orthogonal matrices is again an orthogonal ma-
trix.
8.16 Prove that every eigenspace is a linear subspace, i.e., it is closed under ad-
dition and scalar multiplication.
8.25 Show that two similar matrices 𝐴 and 𝐵 have the same eigenvalues. How
do the eigenvectors of 𝐵 relate to those of 𝐴?
8.31 Compare the convergence rates of standard 𝑄𝑅 iteration (Problem 8.27) and
shifted 𝑄𝑅 iteration (Problem 8.30).
0 𝐴∗
𝐵 ∶= ( ).
𝐴 0
where 𝐮𝑘 and 𝐯𝑘 are the 𝑘-th column of 𝑈 and 𝑉, respectively. Then use Theo-
rem 8.47.
8.42 Use Theorem 8.49 to compress an image. Choose a sample image, repre-
sent it by a matrix, and use different numbers of singular values to compress the
image.
8.43 Find an example of each of the types of matrices listed in Table 8.10, check
its type in Julia, and determine the type of the return value after factorization.
References
1. Cuvelier, F., Japhet, C., Scarella, G.: An efficient way to assemble finite element matrices in
vector languages (2014). URL http://arxiv.org/abs/1401.3301. arXiv:1401.3301 [cs.MS]
2. Habgood, K., Arel, I.: A condensation-based application of Cramer’s rule for solving large-
scale linear systems. Journal of Discrete Algorithms 10, 98–109 (2012)
3. Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis, 3rd edn. Springer (2002)
Part II
Algorithms for Differential Equations
Chapter 9
Ordinary Differential Equations
Differo, distuli, dilatum (latin, from dis- (apart) and fero (carry, bear)):
to carry different ways, to spread, to scatter, to disperse, to separate
9.1 Introduction
where 𝐺 is a function of all derivatives that occur in the equation and 𝐼 usually
is an interval and possibly all of ℝ.
Since this is a very abstract way of writing an ode, we now derive an impor-
tant time dependent ode that models exponential growth. The example stems
from modeling bacterial growth. We denote the amount of bacteria in a Petri
dish by 𝑦(𝑡) and the known amount of bacteria at the initial time 𝑡 = 0 by 𝑦0 ,
i.e.,
𝑦(0) = 𝑦0 ∈ ℝ+ .
To derive a differential equation, we start with a finite, small time interval of
length Δ𝑡 ∈ ℝ+ . Our modeling assumption is that the equation
holds, which is reasonable because it means that the number of bacteria at the
end of the small time interval is equal to their number at the beginning of the
interval plus a constant 𝛼 ∈ ℝ times the length of the interval times the number
of bacteria present (at the beginning of the interval). In other words, the change
in the number of bacteria is proportional to the length of the interval and the
number of bacteria provided that the time interval is small enough.
By considering the units of the terms in the equation, it becomes clear that
the last term must contain a constant factor, because if it did not, the units could
not match. More precisely, if we denote the unit of 𝑦 by [𝑦], comparing the three
terms yields [𝑦] = [𝛼][𝑡][𝑦], and thus the unit of the constant factor 𝛼 is [𝛼] =
1∕[𝑡]; it is a growth rate. Such considerations are a general principle and they are
very useful when assessing constants or parameters in differential equations.
Rearranging the terms in the equation yields
𝑦 ′ (𝑡) = 𝛼𝑦(𝑡).
In order to fully specify the problem, we also have to give the initial value 𝑦(0) =
𝑦0 in addition to the equation that holds at all later times. Therefore we arrive at
the initial-value problem
Do solutions of a given ode exist? Is the solution unique? These questions are
not only of theoretical importance, but they are also valid questions to ask from
the modeling and numerical points of view. An ode that is supposed to model
a physical, chemical, biological, or engineering process gains credibility when
it is known that it has a unique solution. The existence and uniqueness of a
solution is also important whenever the solution of an ode is to be approximated
numerically. What should be approximated unless there is a unique solution?
Here we answer the questions of existence and uniqueness for general first-
order initial-value problems under very general assumptions [2, Section 2.8]. We
can always write a first-order initial-value problem in the form
Here we have assumed that the initial point is the origin (0, 0), but this can al-
ways be achieved by a simple substitution. The main result is the following.
Theorem 9.1 (Picard’s existence and uniqueness theorem) Suppose
𝑓 ∶ 𝑅 → ℝ and 𝜕𝑓∕𝜕𝑦 are continuous functions in a rectangle 𝑅 ∶= [−𝑎, 𝑎] ×
[−𝑏, 𝑏] containing the origin. Then there exists a unique solution 𝑦 ∶ [−ℎ, ℎ] → ℝ
defined on the interval [−ℎ, ℎ] ⊂ [−𝑎, 𝑎] of the first-order initial-value problem
(9.2).
The solution referred to in this theorem is a classical solution, i.e., a function
that is differentiable, that thus can be substituted into equation (9.2), and that
satisfies it for every point in the solution interval.
Proof To be able to apply the method used in this proof, we transform the dif-
ferential equation into an integral equation. This can always be achieved by in-
tegrating the equation. If 𝑦 is a solution of (9.2), then 𝑓(𝑡, 𝑦(𝑡)) is a continuous
function by assumption and hence integrable. Integrating (9.2) from the initial
point 0 to an arbitrary point 𝑡 yields the integral equation
𝑡
𝑦(𝑡) = ∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠, (9.3)
0
𝑦0 (𝑡) ∶= 0,
which satisfies the initial condition. Further approximations are found by using
the previous approximations on the right-hand side of the integral equation (9.3)
and using it as the definition of the next approximation, i.e., by defining
𝑡
𝑦𝑛+1 (𝑡) ∶= ∫ 𝑓(𝑠, 𝑦𝑛 (𝑠))d𝑠. (9.4)
0
Each function in the sequence ⟨𝑦𝑛 ⟩ satisfies the initial condition. If there is
an 𝑛 ∈ ℕ0 such that 𝑦𝑛 = 𝑦𝑛+1 , then 𝑦𝑛 is a solution of the differential equation
and hence the integral equation, but in general this does not happen.
We therefore consider the limit function of this sequence and will establish
that it solves the equation by proceeding in the following steps.
1. Are all elements of the sequence well-defined, differentiable functions?
2. If yes, does the sequence converge?
3. If yes, does the limit function satisfy the integral equation (9.3)?
4. If yes, is the solution unique?
If the last equation can be answered positively, the proof is complete.
1. So far, the approximations 𝑦𝑛 have not been fully defined. In addition to
(9.4), the domains of definition (and the images) of the functions 𝑦𝑛 must also
be specified. Here, in particular, the domains of definition must be specified such
that 𝑓(𝑠, 𝑦𝑛 (𝑠)) in the integrand of 𝑦𝑛+1 can be evaluated. Since 𝑓 is only known
to be defined when its second argument is in the interval [−𝑏, 𝑏], the domains
of definition must be chosen sufficiently small such that 𝑦𝑛 lies in the interval
[−𝑏, 𝑏].
Since 𝑓 is a continuous function on a closed bounded domain, it is bounded,
i.e.,
∃𝑀 ∈ ℝ+ 0
∶ ∀(𝑡, 𝑦) ∈ 𝑅 ∶ |𝑓(𝑡, 𝑦)| ≤ 𝑀. (9.5)
Because 𝑦𝑛′ (𝑡) = 𝑓(𝑡, 𝑦𝑛−1 (𝑡)), the absolute slope of 𝑦𝑛′ is also bounded by 𝑀.
Hence, because 𝑦𝑛 (0) = 0, we have −𝑀𝑡 ≤ 𝑦𝑛 (𝑡) ≤ 𝑀𝑡. This consideration
implies that the condition that 𝑦𝑛 lies in the interval [−𝑏, 𝑏] is ensured if 𝑡 ≤
𝑏∕𝑀.
Therefore we define
𝑏
ℎ ∶= min (𝑎, )
𝑀
and use the rectangle
𝐷 ∶= [−ℎ, ℎ] × [−𝑏, 𝑏]
as the domain of definition of the functions 𝑦𝑛 . The 𝑦𝑛 are thus functions
𝑦𝑛 ∶ 𝐷 → ℝ and well-defined.
2. The second question is whether the sequence ⟨𝑦𝑛 ⟩ converges. We start by
showing the estimate
9.2 Existence and Uniqueness of Solutions * 233
𝑀𝐿𝑛−1 |𝑡|𝑛
|𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)| ≤ ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ (9.6)
𝑛!
by induction. If 𝑛 = 1, then |𝑦1 (𝑡)| ≤ 𝑀|𝑡| follows from the definition (9.4) of 𝑦1
and (9.5). If 𝑛 > 1, we use the Lipschitz condition (see Problem 9.2) to calculate
𝑡
| |
|𝑦𝑛+1 (𝑡) − 𝑦𝑛 (𝑡)| ≤ ∫ |||𝑓(𝑠, 𝑦𝑛 (𝑠)) − 𝑓(𝑠, 𝑦𝑛−1 (𝑠))|||d𝑠
0
𝑡
≤ 𝐿 ∫ |𝑦𝑛 (𝑠) − 𝑦𝑛−1 (𝑠)|d𝑠
0
𝑡
𝑀𝐿𝑛−1 |𝑠|𝑛
≤ 𝐿∫ d𝑠
0
𝑛!
𝑀𝐿𝑛 |𝑠|𝑛+1
= .
(𝑛 + 1)!
𝑀𝐿𝑛−1 ℎ𝑛
|𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)| ≤ ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ, (9.7)
𝑛!
whose right-hand side is independent of 𝑡.
Next, we write 𝑦𝑛 (𝑡) as the telescoping sum
which implies
|𝑦𝑛 (𝑡)| ≤ |𝑦0 (𝑡)| + |𝑦1 (𝑡) − 𝑦0 (𝑡)| + ⋯ + |𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)|. (9.8)
∑𝑛 𝑛 𝑘
𝑀𝐿𝑘−1 ℎ𝑘 𝑀 ∑ 𝐿ℎ
|𝑦𝑛 (𝑡)| ≤ 0 + = ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ.
𝑘=1
𝑘! 𝐿 𝑘=1 𝑘!
We have hence shown that the sum in (9.8) converges as 𝑛 → ∞. Therefore the
sequence ⟨𝑦𝑛 (𝑡)⟩ converges for all 𝑡 ∈ [−ℎ, ℎ] as it is a sequence of partial sums
of a convergent infinite series.
The bound in (9.7) does not depend on 𝑡 and hence the bounds in the inequal-
ities in the preceding argument also hold independently of 𝑡. Therefore the se-
quence ⟨𝑦𝑛 ⟩ even converges uniformly.
Having shown that the sequence ⟨𝑦𝑛 ⟩ converges uniformly, we denote its limit
by
𝑦(𝑡) ∶= lim 𝑦𝑛 (𝑡).
𝑛→∞
Since the sequence ⟨𝑦𝑛 ⟩ converges uniformly, we can interchange taking the
limit and integration (see Problem 9.3) to obtain
𝑡
𝑦(𝑡) = ∫ lim 𝑓(𝑠, 𝑦𝑛 (𝑠))d𝑠.
0 𝑛→∞
Since the function 𝑓 is continuous in its second argument, we can take the limit
inside its second argument to find
𝑡 𝑡
𝑦(𝑡) = ∫ 𝑓(𝑠, lim 𝑦𝑛 (𝑠))d𝑠 = ∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠.
0 𝑛→∞ 0
The last equation means that 𝑦 solves the integral equation and hence the differ-
ential equation by the discussion at the beginning of the proof.
4. Is the solution unique? Suppose there is another solution 𝑧. Then
𝑡
( )
𝑦(𝑡) − 𝑧(𝑡) = ∫ 𝑓(𝑠, 𝑦(𝑠)) − 𝑓(𝑠, 𝑧(𝑠)) d𝑠 ∀𝑡 ∈ [0, 𝑎]
0
We denote the integral on the right-hand side by 𝑈(𝑡). The function 𝑈 is dif-
ferentiable, and we obviously have
𝑈(0) = 0 (9.9)
and
𝑈(𝑡) ≥ 0 ∀𝑡 ∈ [0, 𝑎]. (9.10)
Using 𝑈, the last inequality becomes
Integrating this inequality from zero to 𝑡 and using (9.9), we find the inequality
The last inequality and (9.10) imply 𝑈(𝑡) = 0 for all 𝑡 ∈ [0, 𝑎] and hence 𝑈 ′ (𝑡) =
|𝑦(𝑡) − 𝑧(𝑠)| = 0. In other words, any two solutions 𝑦 and 𝑧 are identical for
𝑡 ∈ [0, 𝑎]. An analogous argument shows that the solution is unique for all 𝑡 ∈
[−𝑎, 0].
This completes the proof. □
An alternative proof is based on observing that the operator given by the Pi-
card iteration (9.4) is a contraction and then using the Banach fixed-point theo-
rem (see Problem 9.4).
The result of the theorem is that the solution exists in a finite (and possibly
very small) time interval. Can we do better? In general, a stronger result cannot
be expected. The most prominent and simple counterexample is the equation
𝑦 ′ (𝑡) = 𝑦(𝑡)2 with the initial condition 𝑦(0) = 𝑦0 . Separation of variables shows
that its solution is 𝑦(𝑡) = 1∕(𝑦0 − 𝑡) if 𝑦0 ≠ 0 and 𝑦 = 0 if 𝑦0 = 0. If 𝑦0 > 0, then
the solution exists only in the interval 𝑡 ∈ [0, 𝑦0 ) and even becomes unbounded
even within a finite amount of time.
A linear ode is a special case of the general form (9.1) and has the form
𝑎𝑛 (𝑡)𝑦 (𝑛) (𝑡) + 𝑎𝑛−1 (𝑡)𝑦 (𝑛−1) (𝑡) + ⋯ + 𝑎1 𝑦 ′ (𝑡) + 𝑎0 𝑦(𝑡) = 𝑏(𝑡) ∀𝑡 ∈ 𝐼,
whose defining feature is that all terms that contain the unknown 𝑦 are linear
in 𝑦.
236 9 Ordinary Differential Equations
𝑧0 ∶= 𝑦,
𝑧1 ∶= 𝑦 ′ ,
⋮
𝑧𝑛−1 ∶= 𝑦 (𝑛−1)
and can now write the linear equation as the linear system
𝑧0′ = 𝑧1 ,
𝑧1′ = 𝑧2 ,
⋮
′
𝑧𝑛−2 = 𝑧𝑛−1 ,
′
𝑎𝑛 𝑧𝑛−1 = 𝑏 − 𝑎𝑛−1 𝑧𝑛−1 − 𝑎𝑛−2 𝑧𝑛−2 − ⋯ − 𝑎1 𝑧1 − 𝑎0 𝑧0
in the interval 𝐼. There are 𝑛 equations for the 𝑛 variables 𝑧0 , … , 𝑧𝑛−1 . The last
equation stems from the original equation, while the other equations connect
the new variables.
This consideration underlines the importance of (linear) systems of first-order
odes. Any linear ode of order 𝑛 for a single unknown function can be written
in this form, and linear problems with more unknowns can also be written in
this form. Therefore most numerical programs for odes have been developed
for systems of first-order equations.
In the rest of this chapter, numerical methods for the approximation of solutions
of odes are presented. Although many sophisticated methods for finding solu-
tions of odes in closed form have been developed, the solutions can generally not
be written in closed form. A simple counterexample is the ode 𝑦 ′ (𝑡) = 𝑓(𝑡) with
𝑡
the initial condition 𝑦(0) = 0. Its solution is the integral 𝑦(𝑡) = ∫0 𝑓(𝑠)d𝑠. But
the antiderivative of an elementary function (in the sense of differential algebra
this is a function that can be written in closed algebraic form) is not necessarily
elementary; the most prominent example is the function
2
𝑓(𝑡) ∶= e−𝑡 .
The Risch algorithm is a decision procedure that answers the question whether
an elementary function has an elementary antiderivative or not [6, 7].
9.4 Euler Methods 237
The most straightforward idea to solve any differential equation is to use the
definition of the derivative and to replace it by its difference quotient. We start
from the general first-order equation
and assume that it has a unique solution (see Sect. 9.2). We also define a sequence
of points 𝑡𝑛 such that 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑛 < 𝑡𝑛+1 < ⋯ and denote the approxima-
tion of 𝑦(𝑡𝑛 ) by 𝑦𝑛 . Replacing the derivative by its forward difference quotient
yields
𝑦𝑛+1 − 𝑦𝑛
≈ 𝑦 ′ (𝑡𝑛 ) = 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
𝑡𝑛+1 − 𝑡𝑛
necessitating that 𝑡𝑛+1 − 𝑡𝑛 is small. This motivates the definition
Algorithm 9.3 (backward Euler method) Input: the right-hand side 𝑓, the
initial value 𝑦0 , and points 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑁 or a step size ℎ.
1. Loop for 𝑛 from 1 to 𝑁: set 𝑦𝑛+1 to be the solution of the (algebraic) equation
The difference between the exact solution of the ode and its numerical approxi-
mation is called the global truncation error. It stems from two causes (ignoring
the round-off error). The first cause is the use of an approximate formula to cal-
culate 𝑦𝑛+1 from the previous approximation 𝑦𝑛 (assuming that the previous ap-
proximation was exact, i.e., 𝑦(𝑡𝑛 ) = 𝑦𝑛 ). This cause of errors is called the local
truncation error; it is the error due to the use of an approximate formula only.
The second cause is the fact that the input used in each step is only approxima-
tively correct since 𝑦(𝑡𝑛 ) is not equal to 𝑦𝑛 in general, also because the previous
errors accumulate.
Another fundamental source of errors arises from performing the computa-
tions in arithmetic with only a finite number of digits. This error is called the
round-off error and is not considered here.
In the following, we focus on the local truncation error
i.e., the difference between the approximation 𝑦𝑛+1 at 𝑡𝑛+1 and the value 𝑦(𝑡𝑛+1 )
of the exaction solution 𝑦 at 𝑡𝑛+1 while assuming that 𝑦(𝑡𝑛 ) = 𝑦𝑛 .
Theorem 9.4 (local truncation error of the forward Euler method) Sup-
pose the exact solution 𝑦 exists uniquely and that it is twice differentiable in the open
interval (𝑡𝑛 , 𝑡𝑛+1 ) and continuously differentiable in the closed interval [𝑡𝑛 , 𝑡𝑛+1 ].
Then the local truncation error 𝑒𝑛+1 of the forward Euler method is given by
1
𝑒𝑛+1 = − ℎ2 𝑦 ′′ (𝑡̃𝑛 ) ∃𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ).
2
Proof Taylor expansion of the exact solution 𝑦 at 𝑡𝑛+1 = 𝑡𝑛 + ℎ around the point
𝑡𝑛 and using the Lagrange form of the remainder term yields
ℎ2 ′′
𝑦(𝑡𝑛+1 ) = 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 (𝑡̃𝑛 ),
2
where 𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ). Subtracting the Taylor expansion from the forward Euler
method (9.11) yields
9.4 Euler Methods 239
ℎ2 ′′
𝑒𝑛+1 = 𝑦𝑛 − 𝑦(𝑡𝑛 ) + ℎ(𝑓(𝑡𝑛 , 𝑦𝑛 ) − 𝑦 ′ (𝑡𝑛 )) − 𝑦 (𝑡̃𝑛 ). (9.12)
2
Recalling that we assume that 𝑦(𝑡𝑛 ) = 𝑦𝑛 when considering the local truncation
error, we also have 𝑦 ′ (𝑡𝑛 ) = 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) = 𝑓(𝑡𝑛 , 𝑦𝑛 ). This simplifies the local
truncation error to
ℎ2
𝑒𝑛+1 = − 𝑦 ′′ (𝑡̃𝑛 )
2
as claimed. □
Hence the local truncation error is proportional both to the square of the step
size ℎ and to the second derivative of the solution somewhere in the interval
[𝑡𝑛 , 𝑡𝑛+1 ]. If a bound 𝑀 of the absolute value of the second derivative is known
on the whole interval where the solution is approximated, we can write
ℎ2 𝑀
|𝑒𝑛 | ≤ .
2
Hence it can be ensured that the local truncation error is less than or equal to 𝜖
if the inequality √
2𝜖
ℎ≤
𝑀
holds. It is also clear from the proof that a bound on the global truncation error
will require a bound on the second derivative of the solution.
Analyzing the global truncation error, which is defined as
𝐸𝑛 ∶= 𝑦𝑛 − 𝑦(𝑡𝑛 )
Theorem 9.5 (global truncation error of the forward Euler method) Sup-
pose that 𝑡𝑛 ∶= 𝑡0 + 𝑛ℎ (ℎ ∈ ℝ+ ), that 𝑓 is continuous, that 𝑓 is Lipschitz con-
tinuous with respect to its second argument with Lipschitz constant 𝐿, and that the
exact solution 𝑦 is twice differentiable in the open interval (0, 𝑡𝑛 ) and continuously
differentiable in the closed interval [0, 𝑡𝑛 ]. Then the global truncation error 𝐸𝑛 of
the forward Euler method is bounded by
e(𝑡𝑛 −𝑡0 )𝐿 − 1
|𝐸𝑛 | ≤ 𝛽ℎ,
𝐿
where 𝛼 ∶= 1 + ℎ𝐿 and 𝛽 ∶= (1∕2) max 𝑡∈(𝑡0 ,𝑡𝑛 ) |𝑦 ′′ (𝑡)|.
If 𝜕𝑓∕𝜕𝑡 is continuous in the interval [𝑡0 , 𝑡𝑛 ], then the solution 𝑦 has a con-
tinuous second derivative on this interval and hence the assumptions on the
smoothness of 𝑦 are satisfied.
Proof Equation (9.12) and the Lipschitz continuity of 𝑓 with respect to its sec-
ond argument imply that
240 9 Ordinary Differential Equations
ℎ2 ′′
|𝐸𝑛+1 | ≤ |𝐸𝑛 | + ℎ|𝑓(𝑡𝑛 , 𝑦𝑛 ) − 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))| + |𝑦 (𝑡̃𝑛 )| ≤ 𝛼|𝐸𝑛 | + 𝛽ℎ2 .
2
It is straightforward to show by induction that 𝐸0 = 0 and the last inequality
|𝐸𝑛+1 | ≤ 𝛼|𝐸𝑛 | + 𝛽ℎ2 imply that
𝛼𝑛 − 1 2
|𝐸𝑛 | ≤ 𝛽ℎ .
𝛼−1
Note that 𝛼 > 1 can be assumed without loss of generality.
Substituting the definition of 𝛼 into the last estimate yields
(1 + ℎ𝐿)𝑛 − 1
|𝐸𝑛 | ≤ 𝛽ℎ.
𝐿
The Taylor expansion of the exponential function shows that 1 + ℎ𝐿 ≤ eℎ𝐿 and
hence (1 + ℎ𝐿)𝑛 ≤ e𝑛ℎ𝐿 .
In summary, we find that
Since the global truncation error has order one in the step size ℎ, the forward
Euler method is called a first-order method. Much effort has been devoted to the
development of higher-order methods, and we discuss such methods in the rest
of this chapter.
We have replaced the integrand by its value 𝑓(𝑡𝑛 , 𝑦𝑛 ) on the left interval endpoint
in the case of the forward Euler formula (recall the forward difference) and by
its value 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 ) on the right interval endpoint in the case of the backward
Euler formula (recall the backward difference).
Both choices seem to be one-sided and arbitrary. It is more prudent to use the
approximation
𝑡𝑛+1
ℎ
∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠 ≈ (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 )),
𝑡𝑛
2
ℎ
𝑦𝑛+1 = 𝑦𝑛 + (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 )).
2
Unfortunately, this is only an implicit definition of 𝑦𝑛+1 . We can arrive at an
explicit formula if we replace the occurrence of 𝑦𝑛+1 in the last term by its ap-
proximation 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ) according to the forward Euler formula.
In summary, we define
ℎ
𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )), (9.13)
2
which is the improved Euler method. Its advantage is that its local truncation
error has order three as we will show next. Its disadvantage is that the evaluation
of 𝑓 on the right-hand side proceeds in two steps and requires two evaluations
of 𝑓, which is more costly.
ℎ2 ′′ ℎ3
𝑦(𝑡𝑛+1 ) = 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 (𝑡𝑛 ) + 𝑦 ′′′ (𝑡̃𝑛 ),
2! 3!
where 𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ). Subtracting the Taylor expansion from the improved Euler
method (9.13) yields
ℎ( )
𝑒𝑛+1 = 𝑦𝑛 + 𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ))
2
( ℎ2 ℎ3 )
− 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 ′′ (𝑡𝑛 ) + 𝑦 ′′′ (𝑡̃𝑛 )
2! 3!
ℎ ℎ ℎ2 ℎ3
= − 𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )) − 𝑦 ′′ (𝑡𝑛 ) − 𝑦 ′′′ (𝑡̃𝑛 ).
2 2 2! 3!
The two-dimensional Taylor expansion of the term 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ))
around the point (𝑡𝑛 , 𝑦𝑛 ) is
Using the ode and the chain rule, the second derivative of 𝑦 can be written as
𝑦 ′′ (𝑡𝑛 ) = 𝑓𝑡 (𝑡𝑛 , 𝑦(𝑡𝑛 )) + 𝑓𝑦 (𝑡𝑛 , 𝑦(𝑡𝑛 ))𝑦 ′ (𝑡𝑛 ) = 𝑓𝑡 (𝑡𝑛 , 𝑦𝑛 ) + 𝑓𝑦 (𝑡𝑛 , 𝑦𝑛 )𝑓(𝑡𝑛 , 𝑦𝑛 ).
Substituting these last two equations into the expression for 𝑒𝑛+1 shows that
𝑒𝑛+1 = 𝑂(ℎ3 ),
It is often conducive to adjust the step size in order to maintain the local trunca-
tion error at a nearly constant level. Not only can computational work be saved
in this manner, but it is also possible to control the accuracy of the approxima-
tion.
The most straightforward way to control the local truncation error would be
to calculate the difference between the approximation and the exact solution.
While this approach is a good idea in test problems, where the exact solution is
known, it is obviously not possible to do so in the general setting; if the exact
solution were known, we would not approximate it. Therefore we use a more
9.6 Runge–Kutta Methods 243
accurate numerical method as a substitute for the exact solution and compute
two different approximations using two different methods.
For example, we can use the forward Euler method (as the less accurate
method) and the improved Euler method (as the more accurate method). Then
the difference between the two approximate solutions is used as the estimate
| Euler improved ||
est
𝑒𝑛+1 ∶= |||𝑦𝑛+1 − 𝑦𝑛+1 ||
adjusts the local truncation error (up or down) to the given error tolerance 𝜖.
In this way, the local truncation error can be kept approximately constant
throughout the approximation of a solution. Small step sizes, which increase
computation time, are only used where needed so that the resulting algorithm
is both efficient and accurate (see Problem 9.5).
Two of the most often executed programs in the history of ordinary differential
equations are probably the functions called ɴȍȕЗИ and ɴȍȕЙК in matlab. These
two functions use adaptive Runge–Kutta methods [4, 1, 8], and therefore we
have a closer look at these methods in the rest of this chapter.
We still consider the initial-value problem
𝑘1 ∶= 𝑓(𝑡𝑛 , 𝑦𝑛 ), (9.15a)
( ℎ ℎ )
𝑘2 ∶= 𝑓 𝑡𝑛 + , 𝑦𝑛 + 𝑘1 , (9.15b)
2 2
( ℎ ℎ )
𝑘3 ∶= 𝑓 𝑡𝑛 + , 𝑦𝑛 + 𝑘2 , (9.15c)
2 2
𝑘4 ∶= 𝑓(𝑡𝑛 + ℎ, 𝑦𝑛 + ℎ𝑘3 ), (9.15d)
244 9 Ordinary Differential Equations
ℎ
𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑘 + 2𝑘2 + 2𝑘3 + 𝑘4 ), (9.15e)
6 1
𝑡𝑛+1 ∶= 𝑡𝑛 + ℎ. (9.15f)
It is called a four-stage method, since the four stages 𝑘1 , 𝑘2 , 𝑘3 , and 𝑘4 are needed
to proceed from time 𝑡𝑛 to time 𝑡𝑛+1 . This classical Runge–Kutta method is
therefore often abbreviated as rk4.
If 𝑓 does not depend on 𝑦, then the method (9.15) simplifies to
ℎ( ( ℎ) )
𝑦𝑛+1 = 𝑦𝑛 + 𝑓(𝑡𝑛 ) + 4𝑓 𝑡𝑛 + + 𝑓(𝑡𝑛 + ℎ) ,
6 2
which is Simpson’s rule for approximating the integral of 𝑦 ′ (𝑡) = 𝑓(𝑡). This con-
sideration is analogous to the interpretation of the improved Euler method as an
application of the trapezoid rule to an integral in Sect. 9.4.3.
Generalizing (9.15), Runge–Kutta methods with 𝑠 stages can be written in
the form
( 𝑠
∑ )
𝑘𝑖 ∶= 𝑓 𝑡𝑛 + ℎ𝑐𝑖 , 𝑦𝑛 + ℎ 𝑎𝑖𝑗 𝑘𝑗 , 1 ≤ 𝑖 ≤ 𝑠, (9.16a)
𝑗=1
𝑠
∑
𝑦𝑛+1 ∶= 𝑦𝑛 + ℎ 𝑏𝑖 𝑘𝑖 , (9.16b)
𝑖=1
𝑡𝑛+1 ∶= 𝑡𝑛 + ℎ. (9.16c)
The following two theorems mean that the rk4 method has order four; more
precisely, its local truncation error has order five and its global truncation error
has order four.
Theorem 9.7 (local truncation error of the rk4 method) Suppose that the
fourth partial derivatives of 𝑓 in the ordinary differential equation (9.14) exist in
the open interval (𝑡𝑛 , 𝑡𝑛+1 ) and that its third partial derivatives exist and are con-
tinuous in the closed interval [𝑡𝑛 , 𝑡𝑛+1 ]. Then the local truncation error of the rk4
method (9.15) has order five, i.e.,
Theorem 9.8 (global truncation error of the rk4 method) Under the as-
sumptions of Theorem 9.7, the global truncation error of the rk4 method (9.15)
has order four.
cause calculating the stages 𝑘𝑖 is faster compared to implicit methods and be-
cause explicit methods already enable a large choice of coefficients.
The rk4 method in (9.15) is a four-stage method and has order four. How do
the number of stages 𝑠 and the order 𝑝 relate in explicit Runge–Kutta methods?
In general, it can be shown that the inequality
𝑝≤𝑠
holds for any explicit Runge–Kutta method; if 𝑝 ≥ 5, then the stronger in-
equality
𝑝<𝑠
holds [3, Paragraph 324].
It is not known, however, whether these inequalities are sharp. It is an open
problem what the minimum number of stages 𝑠 of an explicit Runge–Kutta
method with order 𝑝 is in the cases where no methods are already known that
satisfy 𝑝 + 1 = 𝑠. The following table summarizes what is known about the
known minimum number of stages for orders one to ten [3, Chapter 32].
Order 𝑝 1 2 3 4 5 6 7 8 9 10
Number 𝑠 of stages 1 2 3 4 6 7 9 11 ? 17
0
𝑐2 𝑎21
𝑐3 𝑎31 𝑎32
⋮ ⋮ ⋱
𝑐𝑠 𝑎𝑠1 𝑎𝑠2 ⋯ 𝑎𝑠,𝑠−1
𝑏1 𝑏2 ⋯ 𝑏𝑠−1 𝑏𝑠 ,
holds.
246 9 Ordinary Differential Equations
In the following, in this section and the next, the Butcher tableaux of impor-
tant Runge–Kutta methods are given. The forward Euler method (see Sect.
9.4.1) is the simplest Runge–Kutta method and has the Butcher tableau
0
1.
0
1∕2 1∕2
1∕2 0 1∕2
1 0 0 1
1∕6 1∕3 1∕3 1∕6.
where the asterisk indicates the method with order 𝑝 − 1 and the other method
has order 𝑝. Then the local truncation error 𝑒𝑛+1 is estimated by
𝑠
∑
est ∗
𝑒𝑛+1 ≈ 𝑒𝑛+1 ∶= 𝑦𝑛+1 − 𝑦𝑛+1 = ℎ (𝑏𝑖∗ − 𝑏𝑖 )𝑘1 = 𝑂(ℎ𝑝 ).
𝑖=1
in order to adjust the local truncation error (up or down) to the given error tol-
erance 𝜖.
The last two lines of the Butcher tableau
0
𝑐2 𝑎21
𝑐3 𝑎31 𝑎32
⋮ ⋮ ⋱
𝑐𝑠 𝑎𝑠1 𝑎𝑠2 ⋯ 𝑎𝑠,𝑠−1
𝑏1 𝑏2 ⋯ 𝑏𝑠−1 𝑏𝑠
𝑏1∗ 𝑏2∗ ⋯ ∗
𝑏𝑠−1 𝑏𝑠
ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ
ɜɴȆǤɜ ɖ ќ ȯɃɜɜФǤя ʧХ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФʲѪʧʲǤʝʲя ʲѪʧʲǤʝʲ ў ФвЖХѮȹя Х
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФǤя Х
˩ЦЖЧ ќ ˩ѪʧʲǤʝʲ
ȯɴʝ ɪ Ƀɪ ЖђвЖ
ɖЦЖЧ ќ ȯФʲЦɪЧя ˩ЦɪЧХ
ȯɴʝ Ƀ Ƀɪ Зђʧ
ɖЦɃЧ ќ ȯФʲЦɪЧ ў ȹ Ѯ ȆЦɃЧя
˩ЦɪЧ ў ȹ Ѯ ʧʼɦФЦɃя ɔЧ Ѯ ɖЦɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȕɪȍ
9.9 Implementation of Runge–Kutta Methods 249
Фʲ ќ ʲя ˩ ќ ˩Х
ȕɪȍ
After some assertions to check the consistency of the input and after extracting
the matrix and the vectors Ȃ and Ȇ from the Butcher tableau, the coefficient
vector ɖ and the output vectors ʲ and ˩ are allocated and initialized. In the ȯɴʝ
loop, the equation is solved using (9.16). (Unfortunately, ʧʼɦ does not work on
empty generators so that ɖЦЖЧ is defined separately.)
This implementation is a straightforward implementation of equations (9.16).
It is important to note, however, that the Butcher tableau and the number of
stages are known and constant. Therefore it seems wasteful in the inner loop
to use a ȯɴʝ loop to iterate over the stages, to access the coefficients stored in a
matrix and in vectors, and to use ʧʼɦ to sum a few terms instead of writing out
the expressions explicitly. But writing out the expressions explicitly would have
to be done for every Butcher tableau, and we still want a general, yet efficient
code.
The solution is to not write a program to solve the equation, but to write a
program that writes programs to solve the equation. In other words, we will write
a macro (see Chap. 7) that generates the code specialized for a given Butcher
tableau and right-hand side.
The first version of the Ъ¼u macro is more straightforward and easier to un-
derstand, while the second version is an optimized one. We start with the first
version, called Ъ¼uЕ.
ɦǤȆʝɴ ¼uЕФÑя ȯя ʲѪʧʲǤʝʲђђOɜɴǤʲЛЙя ʲѪȕɪȍђђOɜɴǤʲЛЙя ˩ѪʧʲǤʝʲђђOɜɴǤʲЛЙя
ȹђђOɜɴǤʲЛЙХ
ɜɴȆǤɜ ÑÑ ќ ȕ˛ǤɜФÑХ
ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ
250 9 Ordinary Differential Equations
ɜɴȆǤɜ ɖʧ ќ ђФХ
ȯɴʝ Ƀ Ƀɪ Жђʧ
ɜɴȆǤɜ ʧʼɦ ќ ђФЕХ
ȯɴʝ ɔ Ƀɪ ЖђɃвЖ
ʧʼɦ ќ ђФϵʧʼɦ ў ϵФȹ Ѯ ЦɃя ɔЧХ Ѯ ϵФȕʧȆФɖЦɔЧХХХ
ȕɪȍ
ɖʧ ќ ђФϵɖʧѓ ɜɴȆǤɜ ϵФȕʧȆФɖЦɃЧХХ ќ ϵȯФʲЦɪЧ ў ϵФȹ Ѯ ȆЦɃЧХя
˩ЦɪЧ ў ϵʧʼɦХХ
ȕɪȍ
ʜʼɴʲȕ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФϵʲѪʧʲǤʝʲя ϵФʲѪʧʲǤʝʲ ў ФвЖХѮȹХя ϵХ
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФǤя ϵХ
˩ЦЖЧ ќ ϵ˩ѪʧʲǤʝʲ
ȯɴʝ ɪ Ƀɪ ЖђϵФвЖХ
ϵɖʧ
˩ЦɪўЖЧ ќ ˩ЦɪЧ ў ϵȹ Ѯ ϵ˩ѪʼʙȍǤʲȕ
ȕɪȍ
ФϵФȕʧȆФђʲХХ ќ ʲя ϵФȕʧȆФђ˩ХХ ќ ˩Х
ȕɪȍ
ȕɪȍ
After some assertions and the definitions of local variables such as , Ȃ, and Ȇ,
the first ȯɴʝ loop builds the expression ˩ѪʼʙȍǤʲȕ that is used in the macro ex-
pansion where ˩ЦɪўЖЧ is updated. The local variable ˩ѪʼʙȍǤʲȕ is initialized as
the expression Е and then the terms 𝑏𝑖 𝑘𝑖 are added in the ȯɴʝ loop. The vector ɖ
contains already unique symbols generated by ȱȕɪʧ˩ɦ, and therefore ȕʧȆ is used.
The expressions for the stages 𝑘𝑖 are built in a similar manner. The outer ȯɴʝ
loop adds definitions of local variables to the initially empty expression ɖʧ. The
names of the local variables are the elements of the vector ɖ. Each symbol stored
in ɖЦɃЧ has a unique name that starts with a ɖ. The inner ȯɴʝ loop adds terms to
the expression stored in ʧʼɦ, analogous to the generation of ˩ѪʼʙȍǤʲȕ.
All this work is performed during macro-expansion time. The advantage is
that the expressions contain the entries of the Butcher tableau and do not have
to access vectors or arrays during run time.
At the end of the macro, a ʜʼɴʲȕ expression returns the code that is executed.
First, the local variables ʲ and ˩ are initialized and will contain the results. Then,
in the ȯɴʝ loop, the solution is calculated: the expression ɖʧ calculates the stages
9.9 Implementation of Runge–Kutta Methods 251
and then the next element of ˩ is calculated. Finally, ʲ and ˩ are returned. Be-
cause of all the preparatory work before the ʜʼɴʲȕ expression, the whole ʜʼɴʲȕ
expression is rather short.
It is instructive to use ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ to see what the code that solves the
equation looks like. You will notice that the entries of the Butcher tableau have
been substituted into the code.
The first version of the macro is already faster than the ¼u function, but some
improvements are still possible. This leads us to the second version, called Ъ¼u.
ɦǤȆʝɴ ¼uФÑя ȯя ʲѪʧʲǤʝʲђђOɜɴǤʲЛЙя ʲѪȕɪȍђђOɜɴǤʲЛЙя ˩ѪʧʲǤʝʲђђOɜɴǤʲЛЙя
ȹђђOɜɴǤʲЛЙХ
ɜɴȆǤɜ ÑÑ ќ ȕ˛ǤɜФÑХ
ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ
ʜʼɴʲȕ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФϵʲѪʧʲǤʝʲя ϵФʲѪʧʲǤʝʲ ў ФвЖХѮȹХя ϵХ
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФǤя ϵХ
252 9 Ordinary Differential Equations
˩ЦЖЧ ќ ϵ˩ѪʧʲǤʝʲ
ȯɴʝ ɪ Ƀɪ ЖђϵФвЖХ
ϵɖʧ
˩ЦɪўЖЧ ќ ˩ЦɪЧ ў ϵ˩ѪʼʙȍǤʲȕ
ȕɪȍ
ФϵФȕʧȆФђʲХХ ќ ʲя ϵФȕʧȆФђ˩ХХ ќ ˩Х
ȕɪȍ
ȕɪȍ
Ъʧȹɴ˞ ʧɴɜЖЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜЗЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜИЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜЙЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ȕ˦ʙФЖЕѐЕХ
ɪɴʲȹɃɪȱ
ȕɪȍ
ɔʼɜɃǤљ ȂȕɪȆȹɦǤʝɖФХ
ЕѐЖМЛОЕН ʧȕȆɴɪȍʧ ФЗЕѐЕЕ ǤɜɜɴȆǤʲɃɴɪʧђ КИЙѐЕКН Ƀ"я ЖКѐЖЗ҄ ȱȆ ʲɃɦȕХ
ЕѐЕЗММЛО ʧȕȆɴɪȍʧ ФЗ ǤɜɜɴȆǤʲɃɴɪʧђ МЛѐЗОЙ Ƀ"Х
ЕѐМЙОМНМ ʧȕȆɴɪȍʧ ФНЕѐЕЕ ǤɜɜɴȆǤʲɃɴɪʧђ ЖѐНЛИ QɃ"я ЗЖѐНК҄ ȱȆ ʲɃɦȕХ
ЕѐЕОЖЙКК ʧȕȆɴɪȍʧ ФЗ ǤɜɜɴȆǤʲɃɴɪʧђ МЛѐЗОЙ Ƀ"я ЙѐНИ҄ ȱȆ ʲɃɦȕХ
ФʧɴɜЖЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐИККЛЛЗНИИМЕЛ
ФʧɴɜЗЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐИККЛЛЗНИИМЕЛ
ФʧɴɜИЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛЙКЛ
ФʧɴɜЙЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛЙКЛ
ȕ˦ʙФЖЕѐЕХ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛМЖН
9.10 Julia Packages 253
The first-order method yields five correct digits, while in the fourth-order method
all digits but the last three are correct.
The results obtained by the function and the macros are identical. In this par-
ticular, but typical run, the macro is six to eight times faster than the function.
The allocations are also in favor of the macro implementation; the macro allo-
cates memory only twice, while the function performs 20 million allocations
(first-order method) or 80 million allocations (fourth-order method) and thus
spends significant time in garbage collection.
Problems 9.8, 9.9, 9.10, and 9.11 are concerned with the implementation of
the numerical methods presented in this chapter.
These ideas are applicable and useful also when implementing other numeri-
cal methods in a generic way while emphasizing performance. For example, spe-
cialized code for graphics processing units (gpu) can be written in this manner.
This pattern of defining problem and solution objects can be followed for all
equation types that are supported by the package. Each supported equation has
a problem type and a solution type that are understood by the generic functions
ʧɴɜ˛ȕ and ʙɜɴʲ.
Finally, it is mentioned that the package ʝȍɃɪǤʝ˩-Ƀȯȯ5ʜ is a component
package of -ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ and holds the solvers and utilities for odes.
It is completely independent and usable on its own, which is expedient when a
light-weight package is sufficient.
254 9 Ordinary Differential Equations
Both the theory of ordinary differential equations and their numerical methods
are large fields. A very accessible and comprehensive text book on differential
equations is [2]. A detailed treatment of numerical methods for ordinary differ-
ential equations can be found in [3].
Problems
9.1 (Modeling) Find an ode in a subject of your interest and derive it similarly
to the example in Sect. 9.1.
holds.
Hint: The Lipschitz constant 𝐿 is the maximum value of |𝜕𝑓∕𝜕𝑦| in 𝐷. Apply
the mean-value theorem to 𝑓 as a function of 𝑦 only.
9.3 (Interchanging taking the limit and integration) * Suppose that the se-
quence ⟨𝑓𝑛 ⟩ of Riemann integrable functions defined on a compact interval 𝐼
converges uniformly to 𝑓. Show that then the limit function 𝑓 is Riemann inte-
grable and that the equality
holds.
9.4 (Picard and Banach) * Show that the operator given by the Picard itera-
tion (9.4) is a contraction and use the Banach fixed-point theorem to show The-
orem 9.1.
9.6 (Local truncation error of the rk4 method) * Show Theorem 9.7 by fol-
lowing these steps.
References 255
d𝑦(𝑡𝑛 ) ℎ2 d2 𝑦(𝑡𝑛 )
𝑦(𝑡𝑛 + ℎ) = 𝑦(𝑡𝑛 ) + ℎ + +⋯
d𝑡 2! d𝑡 2
ℎ2 d𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) ℎ3 d2 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
= 𝑦(𝑡𝑛 ) + ℎ𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) + +
2! d𝑡 3! d2 𝑡
4 3
ℎ d 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
+ + 𝑂(ℎ5 )
4! d3 𝑡
of the solution 𝑦 of the differential equation in terms of partial derivatives
of 𝑓 (up to third order).
4. Compare the two Taylor expansions to find a system of algebraic equations
for the unknown coefficients in (9.16).
9.7 (Global truncation error of the rk4 method) * Show Theorem 9.8.
9.8 (Adaptive Runge–Kutta methods) * Extend (a) the function and (b) the
macro to implement adaptive Runge–Kutta methods as described in Sect. 9.8.
9.9 (Plot and compare) Choose the solution of an initial-value problem first
and then calculate the right-hand side 𝑓. Plot and compare the numerical so-
lutions with the exact solution for different Runge–Kutta methods, for differ-
ent step sizes, and using adaptive Runge–Kutta methods (building on Prob-
lem 9.8).
References
1. Bogacki, P., Shampine, L.: A 3(2) pair of Runge–Kutta formulas. Appl. Math. Lett. 2(4),
321–325 (1989)
2. Boyce, W., DiPrima, R.: Elementary Differential Equations and Boundary Value Problems,
9th edn. John Wiley and Sons, Inc. (2009)
3. Butcher, J.: Numerical Methods for Ordinary Differential Equations, 2nd edn. John Wiley &
Sons, Ltd., Chichester, England (2008)
4. Dormand, J., Prince, P.: A family of embedded Runge–Kutta formulae. J. Comp. Appl. Math.
6(1), 19–26 (1980)
256 9 Ordinary Differential Equations
10.1 Introduction
𝜕𝑢 𝜕2 𝑢 𝜕2 𝑢
𝑢𝑥 = , 𝑢𝑥𝑦 = , 𝑢 𝑥𝑖 𝑥𝑗 =
𝜕𝑥 𝜕𝑥𝜕𝑦 𝜕𝑥𝑖 𝜕𝑥𝑗
to simplify notation.
The order of a pde is the order of the highest derivative that occurs in the
equation. pdes of order higher than second are much rarer than those of first
and second order.
What do the elliptic, parabolic, and hyperbolic equations look like? Second-
order linear pdes are classified into three types: elliptic, parabolic, and hyper-
bolic equations. All second-order linear pdes in two independent variables 𝑥
and 𝑦 can be written in the form
where we have already dropped the dependence of the unknown 𝑢 = 𝑢(𝑥, 𝑦), of
its partial derivatives, and of the coefficient functions 𝐴 = 𝐴(𝑥, 𝑦) to 𝐺 = 𝐺(𝑥, 𝑦)
on the independent variables 𝑥 and 𝑦 to shorten the notation. Note that the terms
that contain the unknown 𝑢 or its derivatives are all linear, as they must be in a
linear equation. The domain 𝑈 ⊂ ℝ2 is the domain where the equation holds.
To complete the specification of a pde that can be solved, it is also necessary to
provide boundary conditions, initial conditions, or both. The types and amounts
of such conditions depend on the equation type.
A second-order linear pde is called elliptic if the condition
holds.
This naming convention is an analogy to conic sections. If we replace 𝑢𝑥𝑥 by
𝑥2 , 𝑢𝑥𝑦 by 𝑥𝑦, and 𝑢𝑦𝑦 by 𝑦 2 , we see that the same three conditions hold for
ellipses 𝑥2 + 𝑦 2 = 𝑎2 , parabolas 𝑦 2 = 4𝑎𝑥, and hyperbolas 𝑥2 ∕𝑎2 − 𝑦 2 ∕𝑏2 = 1,
respectively. More precisely, after replacing 𝑢𝑥𝑥 by 𝑥 2 , 𝑢𝑥𝑦 by 𝑥𝑦, and 𝑢𝑦𝑦 by
𝑦 2 and only considering the first three terms, which are of second order in 𝑥
and 𝑦, we find the equation 𝐴𝑥 2 + 𝐵𝑥𝑦 + 𝐶𝑦 2 = 0. Dividing it by 𝑦 (or 𝑥) and
defining 𝑧 ∶= 𝑥∕𝑦 (or 𝑧 ∶= 𝑦∕𝑥) yields the second-order polynomial equation
𝐴𝑧2 + 𝐵𝑧 + 𝐶 = 0 (or 𝐶𝑧2 + 𝐵𝑧 + 𝐴 = 0), whose discriminant is the expression
𝐵2 − 4𝐴𝐶.
10.2 Elliptic Equations 259
For the purpose of classifying second-order pdes, only the second-order terms
responsible for the expression 𝐵2 − 4𝐴𝐶 are important.
More generally, in higher dimensions, when there are 𝑑 independent vari-
ables 𝑥1 , … , 𝑥𝑑 and 𝐷 ⊂ ℝ𝑑 , the general second-order linear pde has the form
𝑑 ∑
∑ 𝑑
𝑎𝑖𝑗 𝑢𝑥𝑖 𝑥𝑗 + lower-order terms = 0. (10.1)
𝑖=1 𝑗=1
Elliptic, parabolic, and hyperbolic equations are then characterized by the prop-
erties of the matrix 𝐴 whose entries are the coefficients 𝑎𝑖𝑗 .
Various methods for solving pdes analytically have been developed. They in-
clude separation of variables, the method of characteristics, integral transforms,
change of variables, use of fundamental solutions, the superposition principle,
and Lie groups. As we are interested in computational methods in this book, we
discuss the three main methods for solving pdes numerically: finite differences,
finite volumes, and finite elements.
But before doing so, we take a closer look at elliptic equations in the next
section including their theory, before we briefly discuss parabolic and hyperbolic
equations in Sections 10.3 and 10.4. We focus on elliptic equations in this chapter,
since they are amenable to all three methods, finite differences, finite volumes,
and finite differences, which are discussed in the subsequent sections, but the
three methods are generally applicable to all kinds of pdes.
In this section we take a closer look at elliptic equations. Three physical phenom-
ena, namely electrostatics, diffusion processes, and thermal conduction, and
how elliptic equations arise in these three applications are presented. The final
part of this section is more advanced and summarizes the theory of weak solu-
tions of elliptic equations.
We start with convenient notation first. When formulating pdes, the so-called
nabla operator
𝜕
⎛ 𝜕𝑥 ⎞
1
∇ ∶= ⎜ ⋮ ⎟
⎜ 𝜕 ⎟
⎝ 𝜕𝑥𝑑 ⎠
is commonly used. It is used to write the gradient of a scalar multivariate func-
tion 𝑓 ∶ ℝ𝑑 → ℝ as
𝜕𝑓
⎛ 𝜕𝑥 ⎞
1
∇𝑓 = ⎜ ⋮ ⎟ ,
⎜ 𝜕𝑓 ⎟
⎝ 𝜕𝑥𝑑 ⎠
260 10 Partial-Differential Equations
∑𝑑
𝜕𝑓𝑖
∇⋅𝐟 = ,
𝑖=1
𝜕𝑥𝑖
∑ 𝑑 (
𝑑 ∑ 𝜕𝑎𝑖𝑗 )
∇ ⋅ (𝐴(𝐱)∇𝑢(𝐱)) = 𝑎𝑖𝑗 𝑢𝑥𝑖 𝑥𝑗 + 𝑢𝑥𝑗 (10.2)
𝑖=1 𝑗=1
𝜕𝑥𝑖
if the matrix-valued function 𝐴 is smooth enough (see Problem 10.1). With these
preliminaries, we can write elliptic equations in convenient and compact forms.
The first example of an elliptic equation is the derivation of the Poisson equa-
tion for electrostatic problems from the Maxwell equations, which are the fun-
damental equations for electromagnetism. An alternative, but more limited re-
lationship between Coulomb’s law and elliptic equations is also discussed.
The second and third example are diffusion and thermal conduction. Al-
though these are also elliptic model equations, additional modeling assumptions
are necessary, and hence these equations are not as fundamental as the first ex-
ample and variants are possible.
We derive the Poisson equation from the Maxwell equations. The Poisson equa-
tion is contained in the Maxwell equations and it is retrieved by considering the
electrostatic case. An electrostatic system is a system whose magnetic field does
not vary with time.
The Maxwell equations are the four pdes
10.2 Elliptic Equations 261
where 𝜖(𝐱) is the permittivity and 𝜇(𝐱) is the permeability, which are both matrix-
valued functions from ℝ3 to ℝ3×3 .
The fields 𝐄 and 𝐇 satisfy the physical interface or jump conditions
[𝐄 × 𝐧] = 𝟎, [𝜖𝐄 ⋅ 𝐧] = 𝜌|Γ ,
[𝐇 × 𝐧] = 𝟎, [𝜇𝐇 ⋅ 𝐧] = 0
𝐄 = −∇𝜙,
𝐃 = −𝜖∇𝜙.
The minus sign only serves cosmetic purposes here. Substitution of the last equa-
tion into the first the Maxwell equations, namely Gauss’s law, now yields the
Poisson equation
−∇ ⋅ (𝜖∇𝜙) = 𝜌. (10.3)
262 10 Partial-Differential Equations
We know that Coulomb’s law holds for electric fields in homogeneous materials
with no magnetic field present, and therefore the question naturally arises how
it relates to the Poisson equation under these two assumptions. It should be pos-
sible to arrive at the Poisson equation from Coulomb’s law, and we now discuss
how this is indeed possible. The derivation from Coulomb’s law governs only
the electrostatic case; no magnetic field ever enters the picture. A homogeneous
material means that the permittivity 𝜖0 ∈ ℝ is simply a real constant.
According to Coulomb’s law, the force 𝐅𝑖𝑗 ∈ ℝ3 that a particle at position 𝐫𝑗 ∈
ℝ with charge 𝑞𝑗 ∈ ℝ exerts on a particle at position 𝑟𝑖 with charge 𝑞𝑖 is given
by
𝑞𝑖 𝑞𝑗 𝐫𝑖 − 𝐫𝑗
𝐅𝑖𝑗 = .
4𝜋𝜖0 |𝐫𝑖 − 𝐫𝑗 |3
It implies that the force is proportional to 1∕|𝐫𝑖 − 𝐫𝑗 |2 , one over the distance
between the particles squared. Coulomb’s law is an atomistic model, since the
charges are point like.
In a continuum model, on the other hand, all the point charges 𝑞𝑗 give rise to
a charge density 𝜌 ∶ ℝ3 → ℝ. The force 𝐅 that acts on a particle with charge 𝑞
at position 𝐱 can be written as
𝐅 = 𝑞𝐄
using the electric field 𝐄, which is obtained from Coulomb’s law by integration
over all other charges 𝑞𝑗 at positions 𝐲 as
1 𝜌(𝐲)(𝐱 − 𝐲)
𝐄(𝐱) ∶= ∭ d𝐲.
4𝜋𝜖0 |𝐱 − 𝐲|3
Next we check by a simple calculation that the electric field is irrotational, i.e.,
that
∇×𝐄=𝟎 (10.4)
holds (see Problem 10.2). Hence the electric field 𝐄 is a gradient field again and
thus can be written as
𝐄 = −∇𝜙 (10.5)
as the gradient of minus a potential, where 𝜙 is called the electrostatic potential
and the minus sign serves a cosmetic purpose the next calculation reveals. It is
straightforward to check by differentiating that
10.2 Elliptic Equations 263
1 𝜌(𝐲)
𝜙(𝐱) ∶= ∭ d𝐲 (10.6)
4𝜋𝜖0 |𝐱 − 𝐲|
Δ ∶= ∇ ⋅ ∇
(see Problem 10.4), integration of the last equation against the charge density 𝜌
yields
and further
( )
Δ ∭ 𝐺(𝐱 − 𝐲)𝜌(𝐲)d𝐲 = Δ(−𝜖0 𝜙(𝐱)) = 𝜌(𝐱) ∀𝐱 ∈ ℝ3 .
ℝ3
In other words, the potential 𝜙 given by (10.6) solves the Poisson equation
−𝜖0 Δ𝜙 = 𝜌.
This equation is a special case of (10.3), as here the permittivity is a real con-
stant 𝜖0 and could be pulled out of the divergence.
10.2.1.3 Diffusion
We can describe any stationary diffusion process by a pde using two considera-
tions. The first is fundamental, while the second one requires a physical model.
Transient diffusion processes lead to parabolic equations (see Sect. 10.3).
The first step is to note that the total flux of particles out of any subdomain
Ω ⊂ 𝐷 of the domain 𝐷 equals the amount of particles produced by all sources
in Ω due to mass conservation. This yields
∯ 𝐧 ⋅ 𝐉d𝑆 = ∭ 𝑓d𝑉.
𝜕Ω Ω
264 10 Partial-Differential Equations
In the surface integral on the left-hand side, the flux density of the particles is
denoted by 𝐉 and the 𝐧 are outward unit normal vectors. The volume integral on
the right-hand side is over the function 𝑓 that describes the sources.
Using the divergence theorem, the boundary integral on the left-hand side
becomes a volume integral, and hence the equation becomes
∭ ∇ ⋅ 𝐉d𝑉 = ∭ 𝑓d𝑉.
Ω Ω
It holds true for all subdomains Ω. After assuming that the integrand is com-
pactly supported and smooth, we can therefore apply the fundamental lemma
of variational calculus to the equation
∭ (∇ ⋅ 𝐉 − 𝑓)d𝑉 = 0 ∀Ω ⊂ 𝐷
Ω
to find
∇ ⋅ 𝐉(𝐱) = 𝑓(𝐱) ∀𝐱 ∈ 𝐷.
There are various versions of the fundamental lemma of variational calculus;
two are recorded in the following.
Theorem 10.1 (fundamental lemma of variational calculus) 1. Version for
continuous functions: Suppose that Ω ⊂ ℝ𝑑 is an open set and that a continuous
multivariate function 𝑓 ∶ Ω → ℝ satisfies the equation
∭ 𝑓(𝐱)ℎ(𝐱)d𝐱 = 0
Ω
∭ 𝑓(𝐱)ℎ(𝐱)d𝐱 = 0
Ω
−∇ ⋅ (𝐷∇𝑢) = 𝑓.
10.2 Elliptic Equations 265
Many other physical models for the relationship between the flux density 𝐉 and
the unknown 𝑢 are known to be useful. For example, diffusion in porous media
is governed by
𝐉 ∶= −𝐷∇𝑢𝑚 ,
where 𝑚 ∈ ℝ is a constant 𝑚 > 0 and usually 𝑚 > 1.
The physical model for thermal conduction is the law of heat conduction or
Fourier’s law, which states that the rate of heat transfer through a material is
proportional to the negative gradient of the temperature and to the area orthog-
onal to the gradient through which the heat flows.
The derivation proceeds analogously to the modeling of diffusion processes
in Sect. 10.2.1.3. Now the vector 𝐉 is the heat flux density and it is, by the law of
heat conduction, equal to
𝐉 ∶= −𝑘∇𝑢,
where the thermal conductivity 𝑘 ∶ ℝ𝑑 → ℝ𝑑×𝑑 is generally a matrix-valued
function and 𝑢 denotes the unknown temperature. This results in the heat equa-
tion
−∇ ⋅ (𝑘∇𝑢) = 𝑓.
However, the thermal conductivity 𝑘 of a material generally varies with temper-
ature, which gives rise to nonlinear equations in which 𝑘 is a function of the
unknown temperature.
It is often instructive to check that the physical units of the variables and con-
stants in an equation and its derivation are consistent (see Problem 10.5). To
check the consistency of the units in the heat equation, we note that the un-
known temperature 𝑢 has unit [𝑢] = K, the thermal conductivity 𝑘 has unit
[𝑘] = W ⋅ m−1 ⋅ K−1 , and the source term 𝑓 has unit [𝑓] = W ⋅ m−3 . In the equa-
tion, we thus have [𝐉] = [𝑘∇𝑢] = [𝑘][∇𝑢] = W ⋅ m−2 and [∇ ⋅ (𝑘∇𝑢)] = W ⋅ m−3 .
The unit of [∇ ⋅ (𝑘∇𝑢)] on the left-hand side of the equation is consistent with
the unit of the source term 𝑓 on the right-hand side.
In general, equations may have any number of solutions, ranging from no solu-
tion at all, a unique solution, and a finite number of solutions to infinitely many
solutions. This is true, e.g., for linear systems of equations, for polynomial equa-
tions, and for Diophantine equations, and it is also true for differential equations.
It is usually desirable that a pde has a unique solution, because we expect
the equation to be a full and unique description of the problem or system under
consideration. Therefore the question naturally arises whether a given pde has
266 10 Partial-Differential Equations
a unique solution. If we can answer this question positively, our confidence that
the pde is a useful model is much increased. This knowledge is also useful when
we aim to calculate a numerical approximation of a solution; it stands to reason
that the existence of a unique solution is beneficial for any numerical algorithm.
The existence and possibly uniqueness of a solution is not a property of just
the equation that holds for all points in the interior of the domain, but a full
problem description must be supplemented with initial and/or boundary con-
ditions and a specification of the set of functions from which the solution is
sought. Again, this should not come as a surprise; we know from algebra that,
e.g., the polynomial equation 𝑥 2 = 2 has no solution in the rational numbers ℚ
but a unique solution in the real numbers ℝ, and that the polynomial equation
𝑥2 = −1 has no solution in the real numbers ℝ but a unique solution in the
complex numbers ℂ.
Analogously, there are different types of solutions of pdes. A classical solution
is a solution that can be substituted into the equation and that then satisfies the
equation pointwise. But there are other, weaker, types of function like objects
that can be interpreted as solutions of differential equations. A glimpse of the
theory of elliptic equations is given in the next section, Sect. 10.2.3, and many
textbooks have been written on these questions [1].
The two major types of conditions are initial conditions and boundary condi-
tions. In transient problems, initial conditions are usually specified, and bound-
ary conditions are usually specified in both stationary and transient problems.
Initial conditions give the solution at the beginning of the time interval, and
boundary conditions hold on the boundary of the spatial domain.
For elliptic equations, there are four major types of boundary conditions:
• Dirichlet boundary conditions specify the unknown 𝑢 on all points of the
boundary 𝜕𝑈 of a domain 𝑈 or only on the Dirichlet part 𝜕𝑈𝐷 ⊂ 𝜕𝑈 of the
boundary 𝜕𝑈 as in the example
𝜕𝑢
= 𝐧 ⋅ ∇𝑢(𝐱) = 𝑢𝑁 (𝐱) ∀𝐱 ∈ 𝜕𝑈𝑁 ,
𝜕𝐧
where 𝑢𝑁 is a given function.
10.2 Elliptic Equations 267
𝜕𝑢
𝑎𝑢 + 𝑏 = 𝑢𝑅 (𝐱) ∀𝐱 ∈ 𝜕𝑈,
𝜕𝐧
where 𝑢𝑅 is a given function, are linear combinations of Dirichlet and Neu-
mann boundary conditions.
They often appear in Sturm–Liouville problems. In convection-diffusion
equations, they can act as insulting boundary conditions that ensure that
the sum of the convective and diffusive fluxes vanishes. In electromagnetism,
they are called impedance boundary conditions.
Boundary conditions whose given function on the right-hand side vanishes are
called homogeneous boundary conditions; otherwise they are called inhomoge-
neous.
Prescribing Dirichlet boundary conditions or mixed Dirichlet/Neumann
boundary conditions to an elliptic pde yields a unique solution.
However, prescribing Neumann boundary conditions to an elliptic pde re-
sults in no solutions or infinitely many solutions. This can be discussed vividly
using the electrostatic problem described by the elliptic boundary-value problem
−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈, (10.9a)
𝐧 ⋅ (𝐴∇𝑢) = 𝑢𝑁 on 𝜕𝑈. (10.9b)
Here the right-hand side 𝑓 are the charges and the Neumann boundary condi-
tion 𝑢𝑁 corresponds to a known electric field on the whole boundary 𝜕𝑈. In-
tegrating the equation, using the divergence theorem, and using the Neumann
boundary condition, we find
In other words, the Neumann boundary conditions must match the right-hand
side via the equation
− ∯ 𝑢𝑁 d𝑆 = ∭ 𝑓d𝑉,
𝜕𝑈 𝑈
which hence is a necessary condition for the existence of a solution.
268 10 Partial-Differential Equations
−∇ ⋅ (𝐴∇𝑢) + 𝐛 ⋅ ∇𝑢 + 𝑐𝑢 = 𝑓 in 𝑈, (10.10a)
𝑢 = 𝑢𝐷 on 𝜕𝑈, (10.10b)
𝐧 ⋅ (𝐴∇𝑢) = 0 on 𝜕𝑈𝑁 , (10.10c)
Theorem 10.2 (inverse trace theorem) Suppose that 𝑠 is a positive integer, that
the domain 𝑈 is of class 𝐶 𝑠 , and that 𝜕𝑈 is bounded. Then there is a bounded
trace operator 𝑇 ∶ 𝐻 𝑠 (𝑈) → 𝐻 𝑠−1∕2 (𝜕𝑈). Moreover, 𝑇 has a bounded right inverse
𝐸 ∶ 𝐻 𝑠−1∕2 (𝜕𝑈) → 𝐻 𝑠 (𝑈), i.e., there exists a positive constant 𝐶 such that
𝑤 ∶= 𝑢 − 𝑢̄ 𝐷 .
The integral ∮𝜕𝑈 (𝐴∇𝑢) ⋅ (𝑣𝐧)d𝑉 vanishes, because the test function 𝑣 ∈ 𝐻01 (𝑈)
vanishes on the boundary 𝜕𝑈𝐷 and because 𝐧 ⋅ (𝐴∇𝑢) = 0 holds on 𝜕𝑈𝑁 . There-
fore the weak formulation is to find a function 𝑢 with 𝑢 − 𝑢̄ 𝐷 ∈ 𝐻01 (𝑈) such
that
𝛽‖𝑢‖2𝐻 ≤ 𝑎(𝑢, 𝑢) ∀𝑢 ∈ 𝐻
holds.
The main assumptions in the theorem are that the bilinear form 𝑎 is coercive
and bounded.
Theorem 10.5 (Lax–Milgram theorem [3]) Suppose that 𝐻 is a Hilbert space,
that 𝐹 ∈ 𝐻 ′ , and that 𝑎 is a bilinear form on 𝐻 that is bounded (with constant 𝛼)
and coercive (with constant 𝛽). Then there exists a unique solution 𝑢 ∈ 𝐻 of the
equation
𝑎(𝑢, 𝑣) = 𝐹(𝑣) ∀𝑣 ∈ 𝐻.
Furthermore, the estimate
1
‖𝑢‖𝐻 ≤ ‖𝐹‖𝐻 ′ (10.13)
𝛽
holds.
The proof, which mostly follows [1, Section 6.2.1], uses basic concepts such as
the Riesz representation theorem from functional analysis, whose full explana-
tion can be found in any text book on functional analysis. However, apart from
an introduction to some basic concepts from functional analysis, the proof is
complete.
Theorem 10.6 (Riesz representation theorem) Suppose 𝐻 is a Hilbert space.
Then the dual space 𝐻 ′ of 𝐻 can be canonically identified with 𝐻. More precisely,
for each 𝑢′ ∈ 𝐻 ′ there exists a unique element 𝑢 ∈ 𝐻 such that
𝑢′ (𝑣) = ⟨𝑢, 𝑣⟩ ∀𝑣 ∈ 𝐻
Before giving the proof, we note that the special case of a symmetric bilinear
form 𝑎 leads the way to the general case. In the symmetric case, ⟨𝑢, 𝑣⟩ ∶= 𝑎(𝑢, 𝑣)
is an inner product on the Hilbert space 𝐻, and the Riesz representation theorem,
Theorem 10.6, can be directly applied, showing the existence of a unique solution
𝑢 ∈ 𝐻. A proof for the general case is the following.
Proof First, we consider any 𝑢 ∈ 𝐻 and note that 𝑣 ↦ 𝑎(𝑢, 𝑣) is a bounded
linear functional on 𝐻 by assumption. Then the Riesz representation theorem,
Theorem 10.6, implies that a unique element 𝑤 ∈ 𝐻 that satisfies
𝑎(𝑢, 𝑣) = ⟨𝑤, 𝑣⟩ ∀𝑣 ∈ 𝐻
⟨𝐴(𝜆1 𝑢1 + 𝜆2 𝑢2 ), 𝑣⟩ = 𝑎(𝜆1 𝑢1 + 𝜆2 𝑢2 , 𝑣)
= 𝜆1 𝑎(𝑢1 , 𝑣) + 𝜆2 𝑎(𝑢2 , 𝑣)
= 𝜆1 ⟨𝐴𝑢1 , 𝑣⟩ + 𝜆2 ⟨𝐴𝑢2 , 𝑣⟩
= ⟨𝜆1 𝐴𝑢1 + 𝜆2 𝐴𝑢2 , 𝑣⟩
for all 𝜆1 and 𝜆2 ∈ ℝ, for all 𝑢1 and 𝑢2 ∈ ℝ, and for all 𝑣 ∈ 𝐻. Since the equality
holds for all 𝑣 ∈ 𝐻, the operator 𝐴 is linear. To show that 𝐴 is bounded, we use
the assumption that the bilinear form 𝑎 is bounded to calculate
which implies
𝛽‖𝑢‖𝐻 ≤ ‖𝐴𝑢‖𝐻 ∀𝑢 ∈ 𝐻, (10.14)
i.e., the operator 𝐴 is bounded below. It is straightforward to see that it is there-
fore injective. Furthermore, its range is closed: if 𝐴𝑢𝑛 → 𝑦, then the inequality
Next, we show that the linear operator 𝐴 is surjective. Suppose it is not. Since
its range 𝑅(𝐴) is closed, there would exist a nonzero element 𝑦 ∈ 𝐻 with 0 ≠
𝑦 ∈ 𝑅(𝐴)⊥ . This would imply 𝛽‖𝑦‖2𝐻 ≤ 𝑎(𝑦, 𝑦) = ⟨𝐴𝑦, 𝑦⟩ = 0 and therefore
𝑦 = 0, a contradiction. Therefore the operator 𝐴 is surjective.
Since the operator 𝐴 ∶ 𝐻 → 𝐻 is both injective and surjective, it is bijective.
Next, the Riesz representation theorem, Theorem 10.6, implies that a unique
𝑧 ∈ 𝐻 exists such that
𝐹(𝑣) = ⟨𝑧, 𝑣⟩ ∀𝑣 ∈ 𝐻
and that the norms ‖𝐹‖𝐻 ′ and ‖𝑧‖𝐻 agree. Since 𝐴 is a bijection, there exists a
unique element 𝑢 ∈ 𝐻 such that 𝐴𝑢 = 𝑧.
In summary, we have shown that there exists a unique element 𝑢 ∈ 𝐻 such
that
𝑎(𝑢, 𝑣) = ⟨𝐴𝑢, 𝑣⟩ = ⟨𝑧, 𝑣⟩ = 𝐹(𝑣) ∀𝑣 ∈ 𝐻.
Inequality (10.13) follows from inequality (10.14) and recalling that 𝐴𝑢 = 𝑧
and ‖𝐹‖𝐻 ′ = ‖𝑧‖𝐻 . □
We can now state and prove the existence and uniqueness of the solution
of the elliptic boundary-value problem by applying the Lax–Milgram theorem.
To do so, it is customary to call coefficient matrices 𝐴 that give rise to coercive
bilinear forms uniformly elliptic.
holds.
holds.
Proof We will show that the bilinear form 𝑎 in (10.11) is coercive and bounded
and that the functional 𝐹 in (10.12) is in 𝐻 −1 (𝑈) so that Theorem 10.5 can be
applied.
Using the Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1, it is strai-
ghtforward to see that the bilinear form 𝑎 is bounded.
10.2 Elliptic Equations 273
To see that the bilinear form 𝑎 is coercive, we first find a bound from above for
the second term ∭𝑈 (𝑢𝐛)⋅∇𝑢d𝑉 in 𝑎(𝑢, 𝑢). The Cauchy–Bunyakovsky–Schwarz
inequality yields
For the factor ‖𝑢‖𝐿2 (𝑈) ‖∇𝑢‖𝐿2 (𝑈) , we use the inequality
𝑥𝑦 ≤ 𝛼𝑥 2 + 𝛽𝑦 2 ∀∀𝑥, 𝑦 ∈ ℝ,
inf 𝑈 𝑐
𝛼 ∶= >0
‖𝐛‖𝐿∞ (𝑈)
so that the coefficient of ‖𝑢‖2𝐿2 (𝑈) vanishes. Then 𝛽 = ‖𝐛‖𝐿∞ (𝑈) ∕(4 inf 𝑈 𝑐) and
the coefficient of ‖∇𝑢‖2𝐿2 (𝑈) is positive due to the assumption. Therefore the bi-
linear form 𝑎 is again coercive.
Finally, 𝐹 belongs to 𝐻 −1 (𝑈) = 𝐻01 (𝑈)′ provided it is a bounded linear func-
tional on 𝐻01 (𝑈). 𝐹 is clearly linear and it is bounded due to the assumptions on
the data. □
The following theorem is a pointwise estimate for solutions of the linear Pois-
son equation. It is quite well-known and can be found, e.g., as [2, Theorem 3.7].
Many regularity results for elliptic equations are available to ensure the smooth-
ness of the solution based on assumptions on the coefficients and the domain. If
sufficient smoothness is assumed, the operator ∇ ⋅ (𝐴∇) in divergence form can
always be rewritten in the non-divergence form that occurs in the theorem.
274 10 Partial-Differential Equations
where 𝑢 ∈ 𝐶 0 (𝑈)∩𝐶 2 (𝑈), the coefficient matrix 𝐴 is symmetric, and the inequality
𝑐(𝑥) ≤ 0 holds for all 𝑥 ∈ 𝑈. Furthermore, suppose that the operator 𝐿 is elliptic,
i.e., 0 < 𝜆(𝑥)|𝜉|2 ≤ 𝜉 ⊤ 𝐴(𝑥)𝜉 ≤ Λ(𝑥)|𝜉|2 holds for all 𝜉 ∈ ℝ𝑑 ∖{0} and for all 𝑥 ∈
𝑈, where 𝜆(𝑥) and Λ(𝑥) are the minimum and maximum eigenvalues, respectively.
Then the estimate
|𝑓|
sup |𝑢| ≤ sup |𝑢| + 𝐶 sup
𝑈 𝜕𝑈 𝑈 𝜆
holds, where 𝐶 is a constant depending only on diam(𝑈) and 𝛽 ∶= sup |𝑏|∕𝜆 < ∞.
In particular, if 𝑈 lies between two parallel planes a distance 𝑑 apart, then the
estimate holds with 𝐶 = e(𝛽+1)𝑑 − 1.
𝑢𝑡 = 𝐷𝑢𝑥𝑥 + 𝑞,
where the positive coefficient function 𝐷 is the thermal conductivity. The same
equation can be interpreted as a transient one-dimensional diffusion equation;
then the positive coefficient 𝐷 is the diffusion constant. The function 𝑞 on the
right-hand side is a source of heat (in the case of the heat equation) or mass (in
the case of the diffusion equation).
In higher dimensions, heat or diffusion equations have the form
𝑢𝑡 = 𝐷Δ𝑢 + 𝑞
The solutions of hyperbolic equations behave like waves. More precisely, if the
initial condition for the solution at time 𝑡 = 0 is disturbed, it takes a finite
amount of time to observe this disturbance at other points of space, meaning
that the disturbance has a finite propagation speed in contrast to elliptic and
parabolic equations, where a disturbance is observed everywhere immediately
and the hence the propagation speed is infinite.
The simplest example of a hyperbolic equation is the one-dimensional (in
space) wave equation
𝑢𝑡𝑡 = 𝑐2 𝑢𝑥𝑥
equipped with an initial condition 𝑢(𝑡 = 0, 𝑥) = 𝑓(𝑥) and boundary conditions
such as 𝑢(𝑡, 𝑥 = 𝑥1 ) = 𝑔1 (𝑡) and 𝑢(𝑡, 𝑥 = 𝑥2 ) = 𝑔2 (𝑡). In the case of hyper-
bolic equations, the choice of boundary conditions is often a delicate matter. As
the waves travel long enough, the waves may leave a finite boundary, and there-
fore one often tries to make provisions to enable the wave to leave the boundary
without any disturbance or reflection. One often also generates waves on the
boundary to enter the domain.
It is easy to find an exact solution of the wave equation 10.4. Any function
𝑢(𝑡, 𝑥) ∶= 𝑓(𝑥 − 𝑐𝑡) is a solution – as long as the boundary conditions match
– as can be checked in a straightforward manner using the chain rule. Here the
function 𝑓 is the initial condition. It is clear that the constant 𝑐 ∈ ℝ is the speed
of the wave. This exact solution is useful for testing numerical methods.
Another important class of hyperbolic equations are conservation laws. To
derive a conservation law, we start from the equation
d
∭ 𝑢(𝐱)d𝐱 + ∯ 𝐧 ⋅ 𝐟 (𝑢)d𝑆 = 0.
d𝑡 Ω 𝜕Ω
The first term is the time rate of change of 𝑢 in the subdomain Ω ⊂ 𝐷, which
is arbitrary with a sufficiently smooth boundary in this equation. The second
integral is a surface integral and gives the flux 𝐟 of 𝑢 through the boundary 𝜕Ω
of Ω, where the 𝐧 are outward unit normal vectors. The equation just means that
the change of 𝑢 contained in Ω is equal to the negative total outflow of 𝑢 from Ω.
276 10 Partial-Differential Equations
If 𝑢 and 𝐟 are sufficiently smooth functions, we can change the order of dif-
ferentation and integration in the first term and use the divergence theorem in
the second term to find
( )
∭ 𝑢𝑡 (𝐱)d𝐱 + ∭ ∇ ⋅ 𝐟 (𝑢(𝐱))d𝐱 = ∭ 𝑢𝑡 (𝐱) + ∇ ⋅ 𝐟 (𝑢(𝐱)) d𝐱 = 0.
Ω Ω Ω
𝑢𝑡 + ∇ ⋅ 𝐟 (𝑢) = 0 ∀𝐱 ∈ 𝐷.
Regarding the choice of grid points in the first step, it is customary to use
equidistant grid points. In one dimension, this means that the solution 𝑢 is cal-
culated at the points
𝑢𝑖 ∶= 𝑢(𝑎 + 𝑖ℎ),
where the domain is the interval 𝑈 ∶= (𝑎, 𝑏), ℎ ∈ ℝ+ is the grid spacing, and
the index 𝑖 ∈ {1, … , 𝑁} is related to the grid spacing ℎ by
𝑏−𝑎
ℎ= .
𝑁
In two dimensions, we have
where the domain is 𝑈 ∶= (𝑎1 , 𝑏1 ) × (𝑎2 , 𝑏2 ) and the indices are 𝑖 ∈ {1, … , 𝑁1 }
and 𝑗 ∈ {1, … , 𝑁2 }, and so forth in higher dimensions.
Finite differences are especially well suited for domains with simple bound-
aries, while the finite-element method (see Sect. 10.7 below) is especially well
suited for domains with complex boundaries that are to be resolved precisely.
Taylor’s theorem is used to approximate the derivatives in the second step.
Theorem 10.10 (Taylor’s theorem) Suppose that 𝑘 ∈ ℕ and that the function
𝑓 ∶ ℝ → ℝ is 𝑛 times differentiable at the point 𝑎 ∈ ℝ. Then there exists a function
ℎ ∶ ℝ → ℝ such that
𝑛
∑ 𝑓 (𝑘) (𝑎)
𝑓(𝑥) = (𝑥 − 𝑎)𝑘 + ℎ(𝑥)(𝑥 − 𝑎)𝑛
𝑘=0
𝑘!
and
lim ℎ(𝑥) = 0.
𝑥→𝑎
The Taylor expansion must be truncated at one point. This gives rise to the
local truncation error. Using more terms in the Taylor expansion generally re-
sults in smaller local truncation errors in the third step, and we will implement
an example below. However, more terms generally complicate the system of al-
gebraic equations, making it more time consuming to assemble and to solve. It
is therefore not obvious a priori if the accuracy of the solutions is increased by
increasing the number of terms in the Taylor expansion when the total compu-
tation time is the same.
Regarding the solution of the resulting system of algebraic equations, it is ob-
vious that a linear pde will result in a linear system of equations (see Sect. 8.4.8).
As the number of dimensions of the domain increases, the system matrices be-
come sparser, and it is imperative to use sparse matrices (see Sect. 8.2). We will
discuss the implementation in more details below.
If the pde and thus the system of algebraic equations are nonlinear, Newton
methods (see Sect. 12.7) and fixed-point methods are the methods of choice.
278 10 Partial-Differential Equations
in the interval (𝑎, 𝑏), where the function 𝑓 ∶ ℝ → ℝ and the constants 𝑔1 ∈ ℝ
and 𝑔2 ∈ ℝ are given.
In the first step, we use the equidistant grid 𝑎 + 𝑖ℎ defined above. The two
boundary conditions result in 𝑢0 = 𝑢(𝑎) = 𝑔1 and 𝑢𝑁 = 𝑢(𝑎 + 𝑁ℎ) = 𝑢(𝑏) = 𝑔2 .
In the second step, to approximate the derivative 𝑢𝑥𝑥 in the equation, we apply
Taylor’s theorem, Theorem 10.10, to find the two expansions
ℎ2 ℎ3
𝑢𝑖+1 = 𝑢𝑖 + ℎ𝑢𝑥 (𝑎 + 𝑖ℎ) + 𝑢 (𝑎 + 𝑖ℎ) + 𝑢 (𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ),
2 𝑥𝑥 6 𝑥𝑥𝑥
ℎ2 ℎ3
𝑢𝑖−1 = 𝑢𝑖 − ℎ𝑢𝑥 (𝑎 + 𝑖ℎ) + 𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) − 𝑢 (𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ),
2 6 𝑥𝑥𝑥
for 𝑢𝑖+1 = 𝑢(𝑎 + (𝑖 + 1)ℎ) and 𝑢𝑖−1 = 𝑢(𝑎 + (𝑖 − 1)ℎ) around the point 𝑎 + 𝑖ℎ.
Here 𝑂(ℎ4 ) includes all terms of fourth order and higher in ℎ. More precisely,
we write 𝑓(𝑥) = 𝑂(𝑔(𝑥)) as 𝑥 → 𝑥0 if and only if lim sup𝑥→𝑎 |𝑓(𝑥)∕𝑔(𝑥)| < ∞.
Adding these two expansions yields
found by substituting the equation for 𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) into the boundary-value prob-
lem. We observe that the local truncation error is a term 𝑂(ℎ2 ) of second order
in ℎ, rendering this finite-difference discretization a second-order one.
In order to solve this linear system of equations, we write a function that
records each equation in a row of a sparse matrix and then calls the standard
solver in Julia for this type of linear equation. The vector ȯʧ that contains the
right side is a dense vector. The system matrix is initialized as an empty, sparse
(𝑁 − 1) × (𝑁 − 1) matrix. In the loop, the coefficients of 𝑢𝑖+1 , 𝑢𝑖 , and 𝑢𝑖−1 are writ-
ten into the 𝑖-th row of the matrix. The two equations that contain the boundary
conditions, i.e., the ones for 𝑖 = 0 and 𝑖 = 𝑁, require special treatment, and the
constant terms 𝑢0 = 𝑔1 and 𝑢𝑁 = 𝑔2 go on the right side. For solving the linear
system of equations, we use the built-in function.
Ƀɦʙɴʝʲ {ɃɪȕǤʝɜȱȕȂʝǤ
Ƀɦʙɴʝʲ ÆʙǤʝʧȕʝʝǤ˩ʧ
ЫЫ ɃɪʲȕʝɃɴʝ
ȯɴʝ Ƀ Ƀɪ ЗђвЗ
ЦɃя ɃўЖЧ ќ вЖѐЕ
ЦɃя ɃЧ ќ ЗѐЕ
ЦɃя ɃвЖЧ ќ вЖѐЕ
ȕɪȍ
ЫЫ ɜȕȯʲ ȂɴʼɪȍǤʝ˩
ЦЖя ЖЧ ќ ЗѐЕ
ЦЖя ЗЧ ќ вЖѐЕ
ȯʧЦЖЧ ўќ ȱЖ
ЫЫ ʝɃȱȹʲ ȂɴʼɪȍǤʝ˩
ЦвЖя вЖЧ ќ ЗѐЕ
ЦвЖя вЗЧ ќ вЖѐЕ
ȯʧЦвЖЧ ўќ ȱЗ
ЫЫ ʧɴɜ˛ȕ
а ȯʧ
ȕɪȍ
280 10 Partial-Differential Equations
Now we can easily test our implementation. In the following test case, the
domain is the interval (0, 2𝜋) and 𝑢 ∶= sin yields 𝑓 = sin. The right-hand side
could also be obtained using symbolic computations, automating testing even
further. We calculate the error on four grids, namely with 101 , 102 , 103 , and 104
points.
Ƀɦʙɴʝʲ ¸ʝɃɪʲȯ
ȯɴʝ Ƀ Ƀɪ ЖђЙ
ɜɴȆǤɜ ȕʝʝɴʝ ќ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO-ѪЖ-ФʧɃɪя ʧɃɪя ЕѐЕя ЗѮʙɃя ЖЕѭɃХ
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ ќ ЖЕѭ҄Жȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя Ƀя ȕʝʝɴʝХ
ȕɪȍ
Not much imagination is required to see that many variations in the steps
above are possible. Most importantly, to affect the convergence speed of a finite-
difference scheme, many variations how to apply Taylor’s theorem are possible.
This is the question we investigate next.
in two and three dimensions 𝑑 are derived. In addition to being of fourth order,
the schemes still have the desirable property that only neighboring grid points
are used. This fact considerably simplifies the implementation at the grid points
near the boundary. Another advantage is the small bandwidth in the resulting
linear system of equations, meaning that it can be solved faster. Therefore these
two- and three-dimensional finite-difference discretizations combine fast con-
vergence as ℎ → 0 with ease of implementation, providing a good example of
how Taylor’s theorem can be applied to good effect.
1
(𝑢𝑖+1,𝑗+1 + 𝑢𝑖+1,𝑗−1 + 𝑢𝑖−1,𝑗+1 + 𝑢𝑖−1,𝑗−1 )
6
2 10
+ (𝑢𝑖+1,𝑗 + 𝑢𝑖−1,𝑗 + 𝑢𝑖,𝑗+1 + 𝑢𝑖,𝑗−1 ) − 𝑢𝑖,𝑗
3 3
(2 1 )
= ℎ 𝑓𝑖,𝑗 + (𝑓𝑖+1,𝑗 + 𝑓𝑖−1,𝑗 + 𝑓𝑖,𝑗+1 + 𝑓𝑖,𝑗−1 ) + 𝑂(ℎ6 )
2
3 12
where the matrix elements are the coefficients of the unknown 𝑢𝑖,𝑗 at the grid
point (𝑖, 𝑗) and its neighbors.
This discretization is a fourth-order compact finite-difference one and de-
rived in the proof of the following theorem. Such a discretization is often called
compact, since it only involves the eight neighboring grid points (𝑖, 𝑗); a general
fourth-order discretization involves more grid points.
Δ𝑢 = 𝑓 in 𝐷 ⊂ ℝ2 (10.17)
Proof We use Taylor’s theorem, Theorem 10.10, to find the two expansions
ℎ2 ℎ3 ℎ4 ℎ5
𝑢𝑖+1,𝑗 = 𝑢𝑖 + ℎ𝑢𝑥 + 𝑢 + 𝑢 + 𝑢 + 𝑢 + 𝑂(ℎ6 ),
2 𝑥𝑥 6 𝑥𝑥𝑥 24 𝑥𝑥𝑥𝑥 120 𝑥𝑥𝑥𝑥𝑥
ℎ2 ℎ3 ℎ4 ℎ5
𝑢𝑖−1,𝑗 = 𝑢𝑖 − ℎ𝑢𝑥 + 𝑢𝑥𝑥 − 𝑢 + 𝑢 − 𝑢 + 𝑂(ℎ6 )
2 6 𝑥𝑥𝑥 24 𝑥𝑥𝑥𝑥 120 𝑥𝑥𝑥𝑥𝑥
for 𝑢𝑖+1,𝑗 = 𝑢(𝑎1 + (𝑖 + 1)ℎ, 𝑎2 + 𝑗ℎ) and 𝑢𝑖−1 = 𝑢(𝑎1 + (𝑖 − 1)ℎ, 𝑎 − 2 +
𝑗ℎ) with respect to 𝑥 around the point 𝑎 + 𝑖ℎ. For convenience, the arguments
(𝑎1 + 𝑖ℎ, 𝑎2 + 𝑗ℎ) of the derivatives are dropped. Adding these two expansions
and rearranging terms shows that the equation
𝑢𝑖+1,𝑗 − 2𝑢𝑖,𝑗 + 𝑢𝑖−1,𝑗 ℎ2
𝐷𝑥2 𝑢𝑖,𝑗 ∶= = 𝑢𝑥𝑥 + 𝑢 + 𝑂(ℎ4 ) (10.18)
ℎ2 12 𝑥𝑥𝑥𝑥
holds for the central-difference operator 𝐷𝑥2 that acts with respect to 𝑥.
Therefore the local truncation error 𝜏𝑖,𝑗 of the initial (second-order) discretiza-
tion
𝐷𝑥2 𝑢𝑖,𝑗 + 𝐷𝑦2 𝑢𝑖,𝑗 = 𝑓𝑖,𝑗 + 𝜏𝑖,𝑗 (10.19)
of (10.17) equals
ℎ2
𝜏𝑖,𝑗 = (𝑢 + 𝑢𝑦𝑦𝑦𝑦 ) + 𝑂(ℎ4 ).
12 𝑥𝑥𝑥𝑥
In order to obtain more information about the coefficient of ℎ2 , we differenti-
ate (10.17) twice with respect to 𝑥 and twice with respect to 𝑦 to find
ℎ2 2 ℎ2
𝜏𝑖,𝑗 = (𝐷𝑥 𝑓𝑖,𝑗 + 𝐷𝑦2 𝑓𝑖,𝑗 ) − 𝐷𝑥2 𝐷𝑦2 𝑢𝑖,𝑗 + 𝑂(ℎ4 ).
12 6
We substitute this form of 𝜏𝑖,𝑗 into the initial discretization (10.19).
In summary, the sought discretization is
ℎ2 ℎ2
𝐷𝑥2 𝑢𝑖,𝑗 + 𝐷𝑦2 𝑢𝑖,𝑗 + 𝐷𝑥2 𝐷𝑦2 𝑢𝑖,𝑗 = 𝑓𝑖,𝑗 + (𝐷𝑥2 + 𝐷𝑦2 )𝑓𝑖,𝑗 + 𝑂(ℎ4 ), (10.21)
⏟⏟⏟ ⏟⏟⏟ ⏟⎴⎴⏟⎴⎴⏟ 6 12
⏟⏟⏟ ⏟⎴⎴⎴⎴⏟⎴⎴⎴⎴⏟ ⏟⏟⏟
ℎ−2 ℎ−2 ℎ0 ℎ4
ℎ−2 ℎ0
which yields (10.16) after expanding the central-difference operators 𝐷𝑥2 and 𝐷𝑦2
and multiplying by ℎ2 . The local truncation error 𝑂(ℎ4 ) of this discretization is
a factor ℎ4 apart from the other terms. □
1
− 4𝑢𝑖,𝑗,𝑘 + (𝑢𝑖+1,𝑗,𝑘 + 𝑢𝑖−1,𝑗,𝑘 + 𝑢𝑖,𝑗+1,𝑘 + 𝑢𝑖,𝑗−1,𝑘 + 𝑢𝑖,𝑗,𝑘+1 + 𝑢𝑖,𝑗,𝑘−1 )
3
1
+ (𝑢𝑖,𝑗+1,𝑘+1 + 𝑢𝑖,𝑗+1,𝑘−1 + 𝑢𝑖,𝑗−1,𝑘+1 + 𝑢𝑖,𝑗−1,𝑘−1
6
+ 𝑢𝑖+1,𝑗,𝑘+1 + 𝑢𝑖+1,𝑗,𝑘−1 + 𝑢𝑖−1,𝑗,𝑘+1 + 𝑢𝑖−1,𝑗,𝑘−1
+ 𝑢𝑖+1,𝑗+1,𝑘 + 𝑢𝑖+1,𝑗−1,𝑘 + 𝑢𝑖−1,𝑗+1,𝑘 + 𝑢𝑖−1,𝑗−1,𝑘 )
(
2 1 1 )
=ℎ 𝑓 + (𝑓𝑖+1,𝑗,𝑘 + 𝑓𝑖−1,𝑗,𝑘 + 𝑓𝑖,𝑗+1,𝑘 + 𝑓𝑖,𝑗−1,𝑘 + 𝑓𝑖,𝑗,𝑘+1 + 𝑓𝑖,𝑗,𝑘−1 )
2 𝑖,𝑗,𝑘 12
+ 𝑂(ℎ6 ) (10.22)
284 10 Partial-Differential Equations
Δ𝑢 = 𝑓 in 𝐷 ⊂ ℝ3 (10.23)
Proof Analogously to the proof of Theorem 10.11, the local truncation error
𝜏𝑖,𝑗,𝑘 of the initial (second-order) discretization
equals
ℎ2
𝜏𝑖,𝑗,𝑘 = (𝑢 + 𝑢𝑦𝑦𝑦𝑦 + 𝑢𝑧𝑧𝑧𝑧 ) + 𝑂(ℎ4 ).
12 𝑥𝑥𝑥𝑥
Differentiating (10.23) yields
ℎ2 ℎ2
𝜏𝑖,𝑗,𝑘 = (𝑓𝑥𝑥 + 𝑓𝑦𝑦 + 𝑓𝑧𝑧 ) − (𝑢𝑥𝑥𝑦𝑦 + 𝑢𝑥𝑥𝑧𝑧 + 𝑢𝑦𝑦𝑧𝑧 ) + 𝑂(ℎ4 ).
12 6
Using this expression for 𝜏𝑖,𝑗,𝑘 in the initial discretization yields
ℎ2 2 2
𝐷𝑥2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑦2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑧2 𝑢𝑖,𝑗,𝑘 + (𝐷 𝐷 + 𝐷𝑥2 𝐷𝑧2 + 𝐷𝑦2 𝐷𝑧2 )𝑢𝑖,𝑗,𝑘
6 𝑥 𝑦
ℎ2
= 𝑓𝑖,𝑗,𝑘 + (𝐷𝑥2 + 𝐷𝑦2 + 𝐷𝑧2 )𝑓𝑖,𝑗,𝑘 + 𝑂(ℎ4 ), (10.24)
12
which is of fourth order. Finally, expanding the central-difference operators and
multiplying by ℎ2 yields (10.22). □
a surface integral. These surface integrals are fluxes through the surfaces of all
finite volumes or control volumes. The fluxes are conserved by construction, as
the flux through any surface of a finite volume must equal the flux out of an
adjacent volume through the same surface.
The conservation of fluxes is an important feature of the finite-volume method.
The theoretical treatment of the finite-volume method such as convergence
proofs is more complicated compared to the finite-difference method, where Tay-
lor expansions are available, and to the finite-element method.
Here, we derive a finite-volume discretization of the prototypical elliptic pde
in divergence form, namely the two-dimensional Poisson equation
−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈 ⊂ ℝ2 , (10.25)
for which Theorem 10.8 holds. We first choose a grid spacing ℎ and grid points
(𝑖ℎ, 𝑗ℎ) ∈ 𝑈, where the integers 𝑖 and 𝑗 are chosen such that the grid points lie
in the domain 𝑈. Next, we define the control volumes
( ) ( )
𝑉𝑖,𝑗 ∶= (𝑖 − 1∕2)ℎ, (𝑖 + 1∕2)ℎ × (𝑗 − 1∕2)ℎ, (𝑗 + 1∕2)ℎ
surrounding the grid points (see Fig. 10.1). Since the grid points we have defined
lie on an equidistant grid, the control volumes are rectangles. The sought values
are the values
𝑢𝑖,𝑗 ∶= 𝑢(𝑖ℎ, 𝑗ℎ)
of the solution at the grid points. We analogously write 𝐴𝑖,𝑗 ∶= 𝐴(𝑖ℎ, 𝑗ℎ) and
𝑓𝑖,𝑗 ∶= 𝑓(𝑖ℎ, 𝑗ℎ).
By applying the divergence theorem
∬ ∇ ⋅ 𝐉d𝑉 = ∮ 𝐧 ⋅ 𝐉d𝑆,
𝑉 𝜕𝑉
where 𝐧 is the outward unit normal vector, to the control volume 𝑉𝑖,𝑗 , equation
(10.25) becomes
−∮ 𝐧 ⋅ (𝐴∇𝑢)d𝑆 = ∬ 𝑓d𝑉.
𝜕𝑉𝑖,𝑗 𝑉𝑖,𝑗
( ) 𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗
𝑢𝑥 (𝑖 + 1∕2)ℎ, 𝑗ℎ = + 𝑂(ℎ2 )
ℎ
and analogously on the other edges. This equation follows from applying Tay-
lor’s theorem to 𝑢 around 𝑢𝑖+1∕2,𝑗 with steps ℎ∕2 and −ℎ∕2 to find
286 10 Partial-Differential Equations
A ∇u
A ∇u
Fig. 10.1 The control volume 𝑉𝑖,𝑗 of a finite-volume discretization is shown in red in the center.
The fluxes 𝐹 = 𝐴∇𝑢 are shown as well and are assumed to be constant on the edges of the
control volume.
ℎ ℎ2 ℎ3
𝑢𝑖+1,𝑗 = 𝑢𝑖+1∕2,𝑗 + 𝑢 + 𝑢 + 𝑢 + 𝑂(ℎ4 )
2 𝑥 2 ⋅ 4 𝑥𝑥 6 ⋅ 8 𝑥𝑥𝑥
ℎ ℎ2 ℎ3
𝑢𝑖,𝑗 = 𝑢𝑖+1∕2,𝑗 − 𝑢𝑥 + 𝑢 − 𝑢 + 𝑂(ℎ4 )
2 2 ⋅ 4 𝑥𝑥 6 ⋅ 8 𝑥𝑥𝑥
and then subtracting.
Assuming that the matrix-valued function 𝐴 is the diagonal matrix
𝑎11 (𝑥, 𝑦) 0
( )
0 𝑎22 (𝑥, 𝑦)
everywhere and that 𝐴 is constant on the edges of 𝜕𝑉𝑖,𝑗 , we hence obtain the
discretization
10.6 Finite Volumes 287
11
𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗 11
𝑢𝑖−1,𝑗 − 𝑢𝑖,𝑗
− (𝑎𝑖+1∕2,𝑗 ℎ + 𝑎𝑖−1∕2,𝑗 ℎ
ℎ ℎ
22
𝑢𝑖,𝑗+1 − 𝑢𝑖,𝑗 22
𝑢𝑖,𝑗−1 − 𝑢𝑖,𝑗
+ 𝑎𝑖,𝑗+1∕2 ℎ + 𝑎𝑖,𝑗−1∕2 ℎ)
ℎ ℎ
= ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ), (10.26)
since the length of all edges is ℎ. If 𝑓 is not constant on the control volume,
integration formulas such as Simpson’s rule are useful.
The discretization simplifies to
( 11 11 22 22
− 𝑎𝑖+1∕2,𝑗 𝑢𝑖+1,𝑗 + 𝑎𝑖−1∕2,𝑗 𝑢𝑖−1,𝑗 + 𝑎𝑖,𝑗+1∕2 𝑢𝑖,𝑗+1 + 𝑎𝑖,𝑗−1∕2 𝑢𝑖,𝑗−1
11 11 22 22
)
− (𝑎𝑖+1∕2,𝑗 + 𝑎𝑖−1∕2,𝑗 + 𝑎𝑖,𝑗+1∕2 + 𝑎𝑖,𝑗−1∕2 )𝑢𝑖,𝑗
= ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ), (10.27)
which is of first order, while it can be shown using the approach in Sect. 10.5
that the straightforward finite-difference discretization of 𝑎Δ𝑢 = 𝑓 is
(𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗 )∕ℎ etc. between grid points; the rest of the calculations involves
surface and volume integrals.
In summary, the main appeal of finite volumes is that they conserve the fluxes
by construction, which is especially important in physical problems such as dif-
fusion and electrostatic problems, where flux conservation is a physical prin-
ciple, i.e., mass conservation and Gauss’ law, respectively. Finite volumes can
also easily deal with coefficient functions 𝐴 that are not constant and with solu-
tions 𝑢 that are less smooth. Another difference between finite differences and
finite volumes is that finite volumes are amenable to better approximations of
the right-hand side via
∬ 𝑓d𝑉,
𝑉𝑖,𝑗
ȯɴʝ Ƀ Ƀɪ Еђ
ȯɴʝ ɔ Ƀɪ Еђ
Ƀȯ Ƀ ќќ Е ЮЮ Ƀ ќќ ЮЮ ɔ ќќ Е ЮЮ ɔ ќќ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ќ ЖѐЕ
ȯʧЦɃɪȍФɃя ɔХЧ ќ ȱФǤЖ ў ɃѮȹя ǤЗ ў ɔѮȹХ
ȕɜʧȕ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃўЖя ɔХЧ ќ в ǤЖЖФǤЖ ў ФɃўЖЭЗХѮȹя ǤЗ ў ɔѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃвЖя ɔХЧ ќ в ǤЖЖФǤЖ ў ФɃвЖЭЗХѮȹя ǤЗ ў ɔѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔўЖХЧ ќ в ǤЗЗФǤЖ ў ɃѮȹя ǤЗ ў ФɔўЖЭЗХѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔвЖХЧ ќ в ǤЗЗФǤЖ ў ɃѮȹя ǤЗ ў ФɔвЖЭЗХѮȹХ
Here, the strategy for implementing the boundary conditions is to explicitly in-
clude the equations 𝑢𝑖,𝑗 = 𝑔𝑖,𝑗 for the unknown on the boundary, i.e., for 𝑖 = 0,
𝑖 = 𝑁, 𝑗 = 0, or 𝑗 = 𝑁. This strategy leads to shorter code than substituting the
values on the boundary into the system while considering all cases.
The assertion follows from (10.27) and is useful to check that the implemen-
tation is correct.
The next function is useful to assess the accuracy of the solution in examples
where the exact solution is known.
ȯʼɪȆʲɃɴɪ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФʼѪȕ˦ǤȆʲђђOʼɪȆʲɃɴɪя
ǤЖђђOɜɴǤʲЛЙя ȂЖђђOɜɴǤʲЛЙя ǤЗђђOɜɴǤʲЛЙя
ǤЖЖђђOʼɪȆʲɃɴɪя ǤЗЗђђOʼɪȆʲɃɴɪя
ȯђђOʼɪȆʲɃɴɪя ђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ ȹ ќ ФȂЖвǤЖХ Э
ɜɴȆǤɜ ʼѪɪʼɦ ќ ȕɜɜɃʙʲɃȆѪOùѪЗ-ФǤЖя ȂЖя ǤЗя ǤЖЖя ǤЗЗя ȯя ʼѪȕ˦ǤȆʲя Х
ɜɴȆǤɜ ʼѪȕ˦ ќ ЦʼѪȕ˦ǤȆʲФǤЖ ў ɃѮȹя ǤЗ ў ɔѮȹХ ȯɴʝ Ƀ Ƀɪ Еђя ɔ Ƀɪ ЕђЧ
In the first example, we set 𝑈 ∶= (−𝜋, 𝜋)2 , 𝑢(𝑥, 𝑦) ∶= cos 𝑥 cos 𝑦, and
2 + cos 𝑦 0
𝐴(𝑥, 𝑦) ∶= ( ),
0 2 + cos 𝑥
𝑥, 𝑥 < 0,
𝑢(𝑥, 𝑦) ∶= {
2𝑥, 𝑥 ≥ 1,
and
⎧ 10
⎪(0 1) , 𝑥 < 0,
𝐴(𝑥, 𝑦) ∶=
⎨ 1∕2 0
⎪( ) , 𝑥 ≥ 1,
⎩ 0 1∕2
which results in constant 𝐴∇𝑢 and 𝑓(𝑥, 𝑦) = 0. This example is covered by the
theory in Sect. 10.2.3. In electrostatics, this example corresponds to material with
a jump discontinuity in the permittivity, which results in a jump in the derivative
of the solution. We again calculate some solutions, each time multiplying ℎ by
1∕2.
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ ˦ ȕɜʧȕ ЗѮ˦ ȕɪȍя
вЖѐЕя ЖѐЕя вЖѐЕя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Фȹя ˦я ˩Х вљ ЕѐЕя
Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя я ȕʝʝɴʝХ
ȕɪȍ
10.6 Finite Volumes 291
−𝑥, 𝑥 < 0,
𝑢(𝑥, 𝑦) ∶= {
𝑥, 𝑥 ≥ 0,
and
10
𝐴(𝑥, 𝑦) ∶= ( ) ,
01
which results in 𝑓(𝑥, 𝑦) = −2𝛿(𝑥), where 𝛿 is the Dirac delta distribution. Its
defining characteristic is the equality
292 10 Partial-Differential Equations
∫ 𝛿(𝑥)d𝑥 = 1, (10.29)
ℝ
As in the second example, the surprising accuracy is due to the linearity of the
solution and the perfect alignment of the control volumes with the line 𝑥 = 0,
where the derivative of the solution jumps. This numerical result validates our
implementation of the Dirac delta distribution.
Finally, we redefine the domain to 𝑈 ∶= (−1, 2)2 .
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ в˦ ȕɜʧȕ ˦ ȕɪȍя
10.7 Finite Elements 293
In each step, the error is again multiplied by approximately 1∕2, which is again
consistent with the general case of first-order convergence as predicted in (10.27).
Finally, we note that the implementation of the boundary conditions may
require thought and effort. While Dirichlet boundary conditions are relatively
straightforward to implement, Neumann boundary conditions generally require
an approximation of the directional derivative. If this approximation is not good
enough, it may reduce the convergence order, although the discretization in the
interior would support a higher convergence order.
We have seen in the previous section that the use of integration in the finite-
volume method was advantageous compared to finite differences, as it made it
possible to relax the assumptions on the smoothness of the solution. Finite el-
ements take these considerations further. We have already laid the foundation
for finite elements above in Sect. 10.2.3; finite elements are just the numerical
implementation of weak solutions. (If you skipped Sect. 10.2.3, the main points
are explained in the following again for the purposes of finite elements.)
We again use the elliptic boundary-value problem
−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈 ⊂ ℝ𝑑 , (10.30a)
𝑢=0 on 𝜕𝑈 (10.30b)
Can we infer (10.30) from this equality? The answer is positive and is provided
by the fundamental lemma of variational calculus, Theorem 10.1, which states
that since this equality holds for sufficiently many test functions 𝑣 (namely all
elements of 𝐻01 (𝑈)), the other factors −∇ ⋅ (𝐴∇𝑢) and 𝑓 in the integrands must
agree. The intuitive explanation is that if ∫𝑈 𝑤𝑣d𝑉 = 0 holds for sufficiently
many functions 𝑣, especially those with tiny support that allow to zoom into
smaller and smaller intervals, then the function 𝑤 must vanish.
Next, we use integration by parts – called Green’s first identity in this case
(see Problem 10.21) – to see that the problem is equivalent to finding solutions
𝑢 ∈ 𝐻01 (𝑈) such that
holds, since the boundary term vanishes because of 𝐻01 (𝑈) ∋ 𝑣 = 0 on the
boundary 𝜕𝑈. This formulation relaxes the requirements on the smoothness of 𝑢
and explains the choice of function spaces for both 𝑢 and 𝑣. The first derivatives
of both 𝑢 and 𝑣 are required to exist only in the weak sense (since they appear in
the integrand), whereas a classical solution 𝑢 of (10.30) must be twice differen-
tiable in the whole domain. Due to the symmetry of ∇𝑢 and ∇𝑣 appearing in the
integrand, we can choose 𝐻 1 (𝑈) as the function space for both the solution 𝑢
and the test function 𝑣. Moreover, the zero Dirichlet boundary conditions are
incorporated by choosing 𝐻01 (𝑈).
Such solutions 𝑢 that satisfy (10.31) are called weak solutions. In its most
general form, weak formulations are to find a function 𝑢 ∈ 𝑊, the weak solution,
such that
𝑎(𝑢, 𝑣) = 𝐹(𝑣) ∀𝑣 ∈ 𝑉 (10.32)
holds. Here the Hilbert spaces 𝑉 and 𝑊 are the space of test functions and the
solution space, respectively.
Up to now we have only explained the concept of a weak solution, but finite
elements are just around the corner. Since we cannot perform numerical calcu-
lations using elements of the infinite-dimensional function spaces 𝑉 and 𝑊, we
restrict both the test space 𝑉 and the solution space 𝑊 to much simpler, finite-
10.7 Finite Elements 295
{𝑉ℎ ⊂ 𝑉 ∣ ℎ ∈ ℝ+ }
ℎ ∶= (𝑏 − 𝑎)∕𝑁
1, 𝑖 = 𝑗,
𝜙𝑗ℎ (𝑥𝑖 ) ∶= {
0, 𝑖 ≠ 𝑗.
Make sure to visualize the triangular shape of these 𝑁 + 1 functions.
Next, we define the function spaces
𝑁
∑ ||
𝑃ℎ (𝑈) ∶= { 𝛼𝑗 𝜙𝑗ℎ ||| 𝛼𝑗 ∈ ℝ ∀𝑗 ∈ {0, … , 𝑁}},
||
𝑗=0
𝑁−1
∑ ||
𝑃ℎ0 (𝑈) ∶= { 𝛼𝑗 𝜙𝑗ℎ ||| 𝛼𝑗 ∈ ℝ ∀𝑗 ∈ {1, … , 𝑁 − 1}}
||
𝑗=1
for any ℎ ∈ ℝ+ , which are both sets of all linear combinations of certain hat
functions. The difference between the two function spaces is that all functions
in 𝑃ℎ0 (𝑈) vanish on both endpoints 𝑎 and 𝑏 of the interval domain 𝑈 = (𝑎, 𝑏).
The function space 𝑃ℎ (𝑈) satisfies the required approximation property
(10.33) for 𝐻 1 (𝑈), and 𝑃ℎ0 (𝑈) satisfies it for 𝐻01 (𝑈). Therefore, we can choose
𝑉ℎ ∶= 𝑃ℎ0 (𝑈)
𝑊ℎ ∶= 𝑃ℎ0 (𝑈) = 𝑉ℎ
where the 𝑢𝑗 ∈ ℝ are unknown coefficients for 𝑗 ∈ {0, … , 𝑁}. In our example,
the basis functions are hat functions. Satisfying the weak formulation (10.34)
(for all test functions 𝑣ℎ ∈ 𝑉ℎ ) is equivalent to satisfying the equation for all 𝑀
basis functions 𝜓𝑖 , i.e., we seek solutions 𝑢ℎ ∈ 𝑊ℎ such that
𝑁
∑ 𝑁
∑
𝑎(𝑢ℎ , 𝜓𝑖 ) = 𝑎( 𝑢𝑗 𝜓 𝑗 , 𝜓 𝑖 ) = 𝑢𝑗 𝑎(𝜓𝑗 , 𝜓𝑖 ) = 𝐹(𝜓𝑖 ) ∀𝑖 ∈ {0, … , 𝑁}.
𝑗=0 𝑗=0
𝑚𝑖𝑗 ∶= 𝑎(𝜓𝑗 , 𝜓𝑖 )
10.7 Finite Elements 297
and a vector 𝐳 by setting 𝑧𝑖 ∶= 𝐹(𝜓𝑖 ), this condition becomes the linear system
of equations
𝑀𝐮 = 𝐳 (10.36)
for the unknown vector 𝐮 = (𝑢0 , … , 𝑢𝑁+1 ). This linear system of equations is the
finite-element discretization we will implement below.
Before we do so, we however pose the question whether the linear system
(10.36) has a unique solution. If we can answer this question positively, then
our confidence in the whole finite-element procedure will be much increased.
It is clear that we must ensure that our original equation (10.30) has a unique
solution 𝑢 to begin with. To do so, we assume that the assumptions of the Lax–
Milgram theorem, Theorem 10.5, are satisfied.
The first argument that a unique solution 𝐮 of (10.36) exists is an algebraic
one.
Theorem 10.13 Suppose that the assumptions of Theorem 10.5 hold for the
boundary-value problem (10.30). Then the matrix 𝑀 in (10.36) is positive definite.
(see Definition 8.9). For any vector 𝐯 ∈ ℝ𝑁+1 , we define the function
𝑁
∑
𝑣 ∶= 𝑣𝑖 𝜓𝑖 ∈ 𝑉ℎ
𝑖=0
and calculate
𝑁 ∑
∑ 𝑁
𝐯 ⊤ 𝑀𝐯 = 𝑣𝑖 𝑣𝑗 𝑎(𝜓𝑗 , 𝜓𝑖 ) = 𝑎(𝑣, 𝑣) ≥ 𝛽‖𝑣‖2𝑉 > 0
𝑖=0 𝑗=0
unless 𝐯 = 𝟎. The first inequality holds since the bilinear form 𝑎 is coercive
(with constant 𝛽) by assumption. □
Since every positive definite matrix is invertible, the linear system (10.36) has
a unique solution.
The second argument that a unique solution 𝐮 of (10.36) exists is the follow-
ing. For the elliptic equation (10.30), the solution space and the space of the test
functions are identical.
Theorem 10.14 (Céa’s lemma) Define 𝑉 ∶= 𝐻01 (𝑈) and suppose that the as-
sumptions of Theorem 10.5 hold for the boundary-value problem (10.30): the bilin-
ear form 𝑎 is bounded with constant 𝛼 and coercive with constant 𝛽. Then there
exists a unique solution 𝑢 ∈ 𝑉 of (10.30) and a unique solution 𝑢ℎ ∈ 𝑉ℎ of (10.34).
Furthermore, the inequalities
298 10 Partial-Differential Equations
1
‖𝑢ℎ ‖𝑉 ≤ ‖𝐹‖𝑉 ′
𝛽
(stability) and
𝛼
‖𝑢ℎ − 𝑢‖𝑉 ≤ inf ‖𝑣 − 𝑢‖𝑉 (10.37)
𝛽 𝑣ℎ ∈𝑉ℎ ℎ
(convergence; Céa’s lemma) hold.
The first inequality means that the solutions 𝑢ℎ are stable for varying ℎ, as
the norm of every solution 𝑢ℎ is bounded by a constant that does not depend
on ℎ. The second inequality is called Céa’s lemma and will be important for the
convergence of the approximations 𝑢ℎ to the exact solution 𝑢 in Theorem 10.15.
Proof Since 𝑉ℎ ⊂ 𝑉 is a subspace of 𝑉, Theorem 10.5 implies that there exists
a unique solution 𝑢ℎ ∈ 𝑉ℎ of (10.34).
Since the bilinear form 𝑎 stemming from (10.30) is coercive with constant 𝛽,
we find
𝛽‖𝑢ℎ ‖2𝑉 ≤ 𝑎(𝑢ℎ , 𝑢ℎ ) = 𝐹(𝑢ℎ ) ≤ ‖𝐹‖𝑉 ′ ‖𝑢ℎ ‖𝑉 ,
which immediately implies the first inequality.
To show the second inequality, we first subtract (10.32) (for all 𝑣ℎ ∈ 𝑉ℎ ⊂ 𝑉)
from (10.34) to find
𝑎(𝑢ℎ − 𝑢, 𝑣ℎ ) = 0 ∀𝑣ℎ ∈ 𝑉ℎ .
Under the reasonable assumption (10.33) that the subspaces 𝑉ℎ become bet-
ter and better approximations of the function space 𝑉, Theorem 10.14 imme-
diately shows that the approximations 𝑢ℎ ∈ 𝑉ℎ converge to the exact solu-
tion 𝑢 ∈ 𝑉 as ℎ → 0. This is expressed by the following theorem.
10.7 Finite Elements 299
holds.
Proof Inequality (10.37) and equation (10.33) yield
𝛼
lim ‖𝑢ℎ − 𝑢‖𝑉 ≤ lim inf ‖𝑣 − 𝑢‖𝑉 = 0
ℎ→0 𝛽 ℎ→0 𝑣ℎ ∈𝑉ℎ ℎ
showing convergence. □
After these theoretic considerations, we now implement the finite-element
discretization (10.36) in a one-dimensional example. We use the domain 𝑈 ∶=
(𝑎, 𝑏), the equidistant grid points defined above, and the linear hat functions 𝜙𝑖ℎ
defined above on these finite elements. We assume that the coefficient matrix 𝐴
in (10.30) is piecewise constant on the 𝑁 finite elements or intervals with the
value 𝑎𝑖 ∈ ℝ, 𝑖 ∈ {1, … , 𝑁}, on the interval (𝑎 + (𝑖 − 1)ℎ, 𝑎 + 𝑖ℎ).
Sketching the hat functions and calculating the integral in the bilinear form 𝑎
yields
𝑎𝑖 +𝑎𝑖+1
⎧ , 𝑖 = 𝑗,
𝑎ℎmax(𝑖,𝑗)
𝑚𝑖𝑗 = 𝑎(𝜙𝑗ℎ , 𝜙𝑖ℎ ) = − , |𝑖 − 𝑗| = 1, (10.38)
⎨ ℎ
⎩0, |𝑖 − 𝑗| > 2
whenever 0 < 𝑖 < 𝑁 and 0 < 𝑗 < 𝑁. Obviously, the system matrix 𝑆 is sparse,
and thus it is instrumental to use linear solvers that take advantage of its sparsity
in realistic applications.
The vector 𝐳 on the right side of (10.36) has the elements
𝑏
𝑧𝑖 = 𝐹(𝜙𝑖 ) = ∫ 𝑓𝜙𝑖 d𝑉.
𝑎
Again assuming that the right-hand side 𝑓 is constant on the finite elements, we
set
ℎ
𝑧𝑖 ∶= (𝑓𝑖 + 𝑓𝑖+1 ).
2
In general, good approximations of this integral are important.
Dirichlet boundary conditions can be implemented by noting that the form
(10.35) of the approximation 𝑢ℎ as a linear combination of hat functions yields
𝑢ℎ (𝑎) = 𝑢0 ,
𝑢ℎ (𝑏) = 𝑢𝑁 .
by the formulas
1
𝐧 ⋅ ∇𝑢ℎ (𝑎) = (𝑢 − 𝑢1 ),
ℎ 0
1
𝐧 ⋅ ∇𝑢ℎ (𝑏) = (𝑢𝑁 − 𝑢𝑁−1 ),
ℎ
which follow immediately by differentiating (10.35).
The function ȕɜɜɃʙʲɃȆѪO5ѪЖ- implements the discretization (10.36). The pur-
pose of the function Ƀɪȍ is to translate the index 𝑖 in the discussion above to a
linear index of Julia vectors and arrays, which always start at index one. (An-
other option is to use the package ȯȯʧȕʲʝʝǤ˩ʧ.)
ȯʼɪȆʲɃɴɪ ȕɜɜɃʙʲɃȆѪO5ѪЖ-ФǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя ђђOʼɪȆʲɃɴɪя
ȯђђOʼɪȆʲɃɴɪя ȱђђOʼɪȆʲɃɴɪя
ђђbɪʲХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э
ɜɴȆǤɜ ˴ ќ ùȕȆʲɴʝШOɜɴǤʲЛЙЩФʼɪȍȕȯя ўЖХ
ɜɴȆǤɜ ќ ÆʙǤʝʧȕʝʝǤ˩ʧѐʧʙ˴ȕʝɴʧФўЖя ўЖХ
ȯʼɪȆʲɃɴɪ ɃɪȍФɃђђbɪʲХђђbɪʲ
Ж ў Ƀ
ȕɪȍ
ȯɴʝ Ƀ Ƀɪ Еђ
Ƀȯ Ƀ ќќ Е ЮЮ Ƀ ќќ
ЦɃɪȍФɃХя ɃɪȍФɃХЧ ќ ЖѐЕ
˴ЦɃɪȍФɃХЧ ќ ȱФǤ ў ɃѮȹХ
ȕɜʧȕ
ЦɃɪȍФɃХя ɃɪȍФɃвЖХЧ ќ в ФǤ ў ФɃвЖЭЗХѮȹХ
ЦɃɪȍФɃХя ɃɪȍФɃХЧ ќ ФǤ ў ФɃвЖЭЗХѮȹХ ў ФǤ ў ФɃўЖЭЗХѮȹХ
ЦɃɪȍФɃХя ɃɪȍФɃўЖХЧ ќ в ФǤ ў ФɃўЖЭЗХѮȹХ
а ˴
ȕɪȍ
ȯʼɪȆʲɃɴɪ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO5ѪЖ-ФʼѪȕ˦ǤȆʲђђOʼɪȆʲɃɴɪя
ǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя
ђђOʼɪȆʲɃɴɪя ȯђђOʼɪȆʲɃɴɪя
ђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э
ɜɴȆǤɜ ʼѪȕ˦ ќ OɜɴǤʲЛЙЦʼѪȕ˦ǤȆʲФǤ ў ɃѮȹХ ȯɴʝ Ƀ Ƀɪ ЕђЧ
ɜɴȆǤɜ ʼѪɪʼɦ ќ ȕɜɜɃʙʲɃȆѪO5ѪЖ-ФǤя Ȃя я ȯя ʼѪȕ˦ǤȆʲя Х
Regarding the basic operations, the package ʙʙʝɴ˦Oʼɪ can be used for approx-
imating functions. Regarding finite differences, the package -Ƀȯȯ5ʜʙȕʝǤʲɴʝʧ
constructs finite-difference operators to discretize pdes, reducing the
equations to systems of odes which can be solved using the package
-ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ. Regarding finite volumes, the package ùɴʝɴɪɴɃOù
can solve coupled nonlinear pdes. Regarding finite elements, the sʼɜɃǤO5
project contains software and documentation for (nonlinear) equations and
distributed calculations. The package QʝɃȍǤʙ provides software for various
problem types, including linear, nonlinear, and multi-physics problems and is
written in Julia. The package O5ɪɃ&Æ is a wrapper for the fenics library for
finite-element discretizations.
A standard textbook on the theory of pdes is [1]. A few books [4, 5, 6, 8, 9] are
mentioned here among the multitude of textbooks on their numerical methods.
Problems
10.5 Choose a pde and its derivation and write down the units of each variable
and constant in the equation and its derivation. Also check that all equations in
the derivation and the pde itself have consistent units.
𝑥𝑦 ≤ 𝛼𝑥2 + 𝛽𝑦 2 ∀∀𝑥, 𝑦 ∈ ℝ.
∏ 𝑛 1∕𝑛 𝑛
1∑
( 𝑥𝑖 ) ≤ 𝑥.
𝑖=1
𝑛 𝑖=1 𝑖
10.12 Change the function ȕɜɜɃʙʲɃȆѪO-ѪЖ- in Sect. 10.5.1 to assemble the sys-
tem matrix using ʧʙǤʝʧȕ (see Sect. 8.2). Is the new version faster? By how much?
Why?
10.13 Show that (10.16) follows from (10.21) by expanding the central-difference
operators.
10.15 Show that (10.22) follows from (10.24) by expanding the central-difference
operators.
10.19 Implement zero and general Neumann boundary conditions for the finite-
volume discretization in Sect. 10.6 on one edge of a square domain.
∭ ∇ ⋅ 𝐅d𝑉 = ∯ 𝐧 ⋅ 𝐅d𝑆
𝑈 𝜕𝑈
References
1. Evans, L.: Partial Differential Equations, 1st edn. American Mathematical Society (1998)
2. Gilbarg, D., Trudinger, N.: Elliptic Partial Differential Equations of Second Order. Springer-
Verlag (2001)
3. Lax, P., Milgram, A.: Parabolic equations. Ann. Math. Stud. 33, 167–190 (1954)
4. LeVeque, R.: Numerical Methods for Conservation Laws, 2nd edn. Birkhäuser, Basel (1992)
5. LeVeque, R.: Finite Volume Methods for Hyperbolic Problems. Cambridge University Press
(2002)
6. LeVeque, R.: Finite Difference Methods for Ordinary and Partial Differential Equations:
Steady-State and Time-dependent Problems. Society for Industrial and Applied Mathemat-
ics (SIAM) (2007)
7. Renardy, M., Rogers, R.: An introduction to partial differential equations, 2nd edn. Springer-
Verlag, New York, NY (2004)
8. Strang, G., Fix, G.: An Analysis of the Finite Element Method, 2nd edn. Wellesley-Cambridge
Press (2008)
9. Toro, E.: Riemann Solvers and Numerical Methods for Fluid Dynamics, 3rd edn. Springer
(2009)
Part III
Algorithms for Optimization
Chapter 11
Global Optimization
But. . . TANSTAAFL.
“There ain’t no such thing as a free lunch,” in Bombay or in Luna.
—Robert A. Heinlein, The Moon is a Harsh Mistress (1966)
11.1 Introduction
Global optimization is a large field with many applications, and many deter-
ministic and stochastic optimization methods have been developed. Determinis-
tic methods include branch-and-bound methods, cutting-plane methods, inner-
and-outer approximation, and interval methods. Stochastic methods, on the
other hand, are methods whose results depend on random numbers drawn
while running the algorithm; the function to be optimized is still deterministic.
Stochastic methods include direct Monte Carlo sampling, stochastic tunneling,
and parallel tempering.
Heuristic methods are strategies for searching the domain in a – hopefully –
intelligent manner. They include differential evolution, evolutionary computa-
tion (including, e.g., evolution strategies, genetic algorithms, and genetic pro-
gramming), graduated optimization, simulated annealing, swarm based opti-
mization (including, e.g., particle-swarm optimization and ant-colony optimiza-
tion), and taboo search.
Other methods are Bayesian optimization and memetic algorithms. In the
next chapter, Chap. 12, local optimization methods are discussed. They are usu-
ally based on the gradient of the function to be optimized or on its second deriva-
tives. Global and local optimization methods can be combined by performing
global optimization before or while improving promising candidate points by lo-
cal optimization. This synergy between evolutionary and local optimization is
called memetic algorithms.
Two branches of optimization are continuous and discrete optimization. In
discrete optimization, some or all of the variables of the objective function are
discrete, i.e., they assume only values from a finite set of values. Two important
fields of discrete optimization are combinatorial optimization and integer pro-
gramming.
Furthermore, optimization problems can be categorized with respect to the
presence of constraints on the independent variables of the objective function.
The constraints can be hard constraints, which must be satisfied by the inde-
pendent variables, or soft constraints, where some values are penalized if the
constraints are not satisfied and based on the extent that they do not.
The choice of optimization method is also influenced by the computational
cost of evaluating the objective function. If the objective function is given as an
expression (see, e.g., the benchmark problems in Sect. 11.8), it is usually possible
to perform many function evaluations thus rendering the optimization problem
more tractable. On the other hand, if the objective function is computationally
expensive such as a function of the solution of a partial differential equation,
the choice of optimization method is more limited and also more important. A
closely related question is whether first- or second-order derivatives of the objec-
tive function are available and what their computational cost is. The case when
no derivatives are available is called derivative free optimization.
The long, but not exhaustive, lists of optimization methods above pose the
question which one to use when presented with an optimization problem. Unfor-
tunately, there is no general answer to this question, as there is no best optimiza-
tion algorithm. It will always be the case that for any given class of optimization
problems, there is a best class of algorithms, and – vice versa – for any given class
of algorithms, there is a class of optimization problems for which the algorithms
work best. This notion is formalized in no-free-lunch theorems. Therefore, we
start by examining in the next section the question what can generally be said
about the relationship between classes of optimization problems and classes of
optimization algorithms.
11.2 No Free Lunch 309
𝐷𝑚 ∶= (𝑋 × 𝑌)𝑚
𝑦
i.e., the conditional probability 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎) summed over all objective func-
tions 𝑓, is independent of the algorithm 𝑎.
Theorem 11.1 (no free lunch) For any pair of algorithms 𝑎1 and 𝑎2 , the equa-
tion ∑ ∑
𝑦 𝑦
𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎1 ) = 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎2 ).
𝑓∈𝐹 𝑓∈𝐹
A proof can be found in [17, Appendix A]. This theorem means that the sum
of such probabilities over all possible optimization problems 𝑓 is identical for all
𝑦
optimization algorithms. In other words, the average of 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎1 ) over all
problems 𝑓 is independent of the algorithm 𝑎. This implies that any performance
gain that a certain algorithm provides on one class of problems must be offset by
a performance loss on the remaining problems.
These deliberations can be extended to time-dependent objective functions.
The initial cost function is called 𝑓1 and is used to sample the first 𝑋 value. Be-
fore the next iteration 𝑖 of the optimization algorithm, the cost function is trans-
formed to a new cost function by 𝑇 ∶ 𝐹 × ℕ → 𝐹, i.e., 𝑓𝑖+1 ∶= 𝑇(𝑓𝑖 , 𝑖). It is
assumed that all the transformations 𝑇(., 𝑖) are bijections on 𝐹. This assumption
is important, because otherwise a bias in a region of cost functions could be in-
troduced and exploited by some optimization algorithms.
Analogously to Theorem 11.1, the following theorem shows that the aver-
age performance of any two algorithms is the same also in the case of time-
11.3 Simulated Annealing 311
dependent objective functions. But now we average over all possible time de-
pendencies of cost functions, meaning the average is calculated over all transfor-
mations 𝑇 rather than over all objective functions 𝑓. These averages are given
by ∑ 𝑦
𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎),
𝑇
where 𝑓1 is the initial objective function. The samples are redefined to drop their
first elements such that the transformations 𝑇 can take full effect, although the
initial object function 𝑓1 is fixed. Then the following result can be shown.
𝑦 𝑦
Theorem 11.2 (no free lunch for time-dependent problems) For all 𝑑𝑚 , 𝐷𝑚 ,
𝑚 > 1, algorithms 𝑎1 and 𝑎2 , and initial cost functions 𝑓1 , the equation
∑ 𝑦 ∑ 𝑦
𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎1 ) = 𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎2 )
𝑇 𝑇
holds.
In the middle of the last century, Monte Carlo algorithms for calculating prop-
erties of materials consisting of interacting individual particles were developed
[11]. These Monte Carlo integrations over the configuration spaces of the parti-
cles turned out to be highly useful in statistical mechanics.
312 11 Global Optimization
There are 𝑁 particles so that the phase space is 6𝑁-dimensional. The factor
e−𝐸∕𝑘𝑇
𝑀
1 ∑
𝐹̄ = 𝐹.
𝑀 𝑖=1 𝑖
It can be shown that this algorithm does indeed choose configurations with
probabilities e−𝐸∕𝑘𝑇 [11] and therefore indeed calculates the equilibrium value
of the quantity of interest.
Another question concerns the convergence speed of this algorithm. Here we
can already observe a major effect that is common to many Monte Carlo pro-
cedures that construct sequences of states. The maximum displacement 𝛼 is of
great importance and should be chosen with care or – even better – should be ad-
justed automatically. If it is too small, the configuration changes only little and
sampling the whole space requires many iterations. On the other hand, if it is
too large, most moves are forbidden. In both cases, it takes longer to arrive at
the equilibrium than with a suitable maximum displacement.
The reason why we have discussed the classical Metropolis algorithm in de-
tail is that it is not only foundational in the Monte Carlo, but also because it
motivates the optimization algorithm that is the subject of this section.
Simulated annealing was introduced in [9] and copies the way of sampling the
whole space in the Metropolis algorithm and applies it to global optimization.
Simulated annealing can be applied to both continuous and discrete optimiza-
tion and search problems. It only requires two simple operations, namely choos-
ing a random starting point and an unary operation that maps points to neigh-
boring points.
In order to formulate the simulated-annealing algorithm, the sampling
method of the Metropolis algorithm is applied to a single point or particle. The
energy 𝐸 of a configuration in the Metropolis algorithm now corresponds to the
objective function in the minimization problem, and therefore we denote it by 𝑓
in the algorithm. This analogy yields the simulated-annealing algorithm.
e−Δ𝑓∕𝑘𝑇 , Δ𝑓 > 0,
𝑝 ∶= {
1, Δ𝑓 ≤ 0.
𝑥̃ 𝑡+1 , 𝜌 ≤ 𝑝,
𝑥𝑡+1 ∶= {
𝑥𝑡 , 𝜌 > 𝑝.
Therefore the temperature 𝑇 provides a way to adjust the behavior of the algo-
rithm. It is customary to start with a higher temperature and reduce it during the
iterations. In the beginning, while the temperature is high, simulated annealing
resembles a global random search. As the temperature falls, the probability of
accepting an uphill step decreases. Then points become more and more unlikely
to escape regions with local minima and hopefully converge to a global mini-
mum as the temperatures goes to zero. Good cooling strategies can thus combine
global search with local refinement.
This consideration shows that reasonable cooling strategies have the follow-
ing three properties. From now on, we denote the temperature in iteration 𝑡 by
𝑇𝑡 .
1. The initial temperature 𝑇1 is greater than zero.
2. The temperature decreases, i.e., 𝑇𝑡+1 ≤ 𝑇𝑡 .
11.4 Particle-Swarm Optimization 315
where 𝑥𝑡∗ is the best point found up to iteration 𝑡 and 𝛿 ∈ (0, 1). You have
guessed it, there is no general theory to find the parameters 𝑠, 𝛾, and 𝛿. In
this cooling strategy, our rule that the cooling strategy should be decreasing
may be violated.
It can be shown that simulated annealing algorithms with appropriate cool-
ing strategies converge to the global optimum as the number of iterations goes
to infinity [10, 4, 13]. The domain can be searched faster if cooling is sped up,
but then convergence is not guaranteed anymore.
the speed of a particle is high, its search is explorative, and if it is low, the search
is exploitive. We briefly mention that it is possible to use genotype-phenotype
mapping, which will be discussed in the next section.
The positions and velocities of all particles are initialized randomly. In each
iteration, the velocity is updated first and then the position. Each particle 𝑘 also
keeps track of its history by remembering the best position 𝐛𝑘,𝑡 it is has seen up
to iteration 𝑡.
The social component of the swarm is realized by communication between
the particles. To that end, each particle updates in each iteration 𝑡 its set 𝑁𝑘,𝑡 of
neighbors. The set 𝑁𝑘,𝑡 of neighbors is usually defined as the set of all particles
within a certain distance from 𝐱𝑘 measured by a certain metric such as the Eu-
clidean distance in the simplest case. Knowing its neighbors, each particle then
communicates its best position 𝐛𝑘,𝑡 found so far to its neighbors in each iteration.
Therefore each particle knows the best point 𝐧𝑘,𝑡 found in its neighborhood so
far. Furthermore, the swarm records the best position 𝐛𝑡 found by all particles
in the swarm up to iteration 𝑡.
These best points that the particles communicate among themselves consti-
tute the social component of particle-swarm optimization and enter the calcula-
tions in the updates of the velocities of the particles. When updating the velocity
of a particle, we can use the best point 𝐧𝑘,𝑡 found in its neighborhood so far or
the best point 𝐛𝑡 found by all particles so far.
These two possibilities lead to local and global updates, respectively. In a local
update, the velocity 𝐯𝑘,𝑡 of particle 𝑘 is updated by
𝐯𝑘,𝑡+1 ∶= 𝐯𝑘,𝑡 + (𝐛𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐜) + ( 𝐧𝑘,𝑡 −𝐱𝑘,𝑡 )𝑈(0, 𝐝). (11.2)
⏟⏟⏟
best neighbor
For each particle, either a local or a global velocity update is chosen randomly.
Here 𝑈(0, 𝐜), 𝑈(0, 𝐝), and 𝑈(0, 𝐞) are random vectors whose entries are uni-
formly distributed random variables in the intervals [0, 𝑐𝑖 ], [0, 𝑑𝑖 ], and [0, 𝑒𝑖 ], re-
spectively. In other words, the vectors 𝐜, 𝐝, and 𝐞 are parameters of the algorithm.
Having updated the velocities of all particles, their positions 𝐱𝑘,𝑡 are updated
by
𝐱𝑘,𝑡+1 ∶= 𝐱𝑘,𝑡 + 𝐯𝑘,𝑡+1 (11.4)
using the new velocities 𝐯𝑘,𝑡+1 . Usually, the size of the search space 𝑋 is finite,
and then it is necessary to ensure that particles do not move out of 𝑋 due to this
update.
The updates mean that two steps are added to the previous velocity. In both
the local and the global updates, the term (𝐛𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐜) points the particle
11.5 Genetic Algorithms 317
towards its best position so far. In the local update, the term (𝐧𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐝)
points the particle towards its best neighbor found so far, and in the global up-
date, the term (𝐛𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐞) ensures that it points towards the best point
found by the whole swarm so far.
The learning-rate vectors 𝐜, 𝐝, and 𝐞 strongly influence convergence speed.
The components of these three vectors determine how vigorously the particles
move towards their best positions, their best neighbors, and the best position
in the whole swarm so far. The components can of course be different for all
directions in the search space.
If the components of 𝐞 are large and the update hence relies much on the
best position 𝐛𝑡 in the whole swarm, the algorithm converges faster, but is less
likely to find a global minimum. If the components of 𝐝 are large and the update
hence relies much on the best neighbor 𝐛𝑘,𝑡 , convergence is slower, but a global
minimum is more likely to be found.
Having discussed how the particles are updated, we can summarize particle-
swarm optimization as follows.
Algorithm 11.4 (particle-swarm optimization)
1. Choose a swarm size and initialize the particles with random positions and
velocities.
2. Set the iteration counter 𝑖 ∶= 1.
3. Loop over all particles, update their velocities choosing either the local up-
date (11.2) or the global update (11.3) and then update their positions using
(11.4).
4. While the termination criterion has not been met, increase 𝑖 by 1 and go to
Step 2.
5. Finally, return the best position found.
In practice, it is useful to return the history of the best points 𝐛𝑡 so that the
progress of the algorithm can be plotted. It is also useful to return the whole last
swarm as well so that the algorithm can be restarted (without random initializa-
tion) whenever the results are not satisfactory and there is still progress. Since
the global optimum or optima are unknown unless a test problem is considered,
practical termination criteria judge the progress in the past iterations. When it
has become very slow, the algorithm is stopped; but of course, long periods of
stagnation may be deceptive.
The search space on which the reproduction operations act is the genome, and
the elements of the genome are the genotypes. A genotype is the collection of
the genes of an individual. In biology, the concept of a gene has changed as new
discoveries, for example about gene regulation, have been made. In genetic al-
gorithms, we are free to choose the genes and genotypes. Beneficial choices of
course help in solving the optimization problem.
11.5 Genetic Algorithms 319
Genotypes are often vectors of genes in contrast to other, nonlinear data struc-
tures. In this leading case of vectors, the genotypes are usually referred to as
chromosomes. Chromosomes can either be vectors of fixed length or of variable
length.
In the case of chromosomes with a fixed-length vector 𝐚, the locus of each
gene is always the same element 𝑎𝑖 of the vector, implying that the competing
alleles of a gene are always positioned at the same element number 𝑖 of the vector.
The genes may have different data types so that the elements of the vector may
have different types.
In the case of chromosomes with a variable-length vector, the positions of the
genes may have been shifted after reproduction operations are applied. In this
case, the genes often have the same data type.
Leading examples of chromosomes are vectors that consist of bits, of integers,
or of real numbers.
The phenotype of an individual in the context of an optimization problem is
an element 𝑥 ∈ 𝑋 in the preimage 𝑋 of the objective function
𝑓 ∶ 𝑋 → ℝ.
𝑔◦𝑓 ∶ 𝐺 → ℝ
is optimized.
Therefore the significance of the genotype-phenotype mapping is that its
choice should facilitate the reproduction operations and optimization by sup-
porting advantageous genotypes. A good choice of genotype is, of course, highly
problem dependent.
To clarify these concepts, we mention the canonical choice of genotype for
objective functions 𝑓 ∶ ℝ𝑑 → ℝ. The canonical genome is 𝐺 ∶= ℝ𝑑 such that
any vector 𝐠 ∈ ℝ𝑑 is a genotype. Each element 𝑔𝑖 of 𝐠 is a gene and has locus 𝑖.
Because of this linear arrangement of the genes, the vector 𝐠 is a fixed-length
chromosome. The canonical choice for the genotype-phenotype mapping 𝑔 is
the identity 𝑔 ∶= id, and therefore 𝑔◦𝑓 ∶ ℝ𝑑 → ℝ.
11.5.3 Fitness
In the next step of the algorithm, the objective function is evaluated for the phe-
notypes of all individuals. Based on these values, each individual is assigned a
320 11 Global Optimization
fitness value. Then individuals with higher fitness are more likely to be selected
for reproduction.
It is obvious that the fitness of an individual should reflect how well it solves
the optimization problem. But it should also reflect the variety of the population
and incorporate information on population density and niches. By doing so, the
probability of finding the global optima can be increased significantly. Fitness
assignment is therefore in general a function that acts on the whole population.
The simplest way of assigning fitness is, however, to just use the value 𝑓(𝑥) of
the objective function or – in multiobjective optimization – the weighted sum
∑
𝑤𝑖 𝑓𝑖 (𝑥),
𝑖
where the functions 𝑓𝑖 are the objectives and the constants 𝑤𝑖 are weights.
Another option is Pareto ranking, which also works for multiobjective opti-
mization. To explain it, we first consider the following fitness assignment. If an
individual prevails 𝑚 other individuals, it is assigned the fitness value 1∕(1 + 𝑚).
Its disadvantage especially in multiobjective optimization is, however, that indi-
viduals in a crowded part of the search space have the chance to prevail many
others, while individuals in a less explored part of the space are assigned a much
worse fitness value only because there are fewer neighbors, although they score
best with respect to the objective function.
An alternative is to assign to each individual the fitness value 𝑛, where 𝑛 is
the number of other individuals it is prevailed by. In multiobjective optimization,
this choice recognizes individuals on the Pareto frontier and avoids the disadvan-
tages of the fitness choice 1∕(1 + 𝑚) from the previous paragraph. The Pareto
fitness value or Pareto rank of an individual 𝑖 is found by looping over all other
individuals 𝑗. If individual 𝑖 is prevailed by individual 𝑗, the Pareto rank of 𝑖 is
increased by one.
Still, Pareto ranking does not exploit any information about population den-
sity or variety. In global optimization, crowding of the individuals is undesirable
and variety is beneficial to explore the whole space. Sharing is a method to in-
clude variety information into fitness assignment, and variety preserving rank-
ing is another method worth mentioning here.
11.5.4 Selection
Elitism means that the best 𝑛 individuals, where 𝑛 ≥ 1, get a free pass and
are placed in the mating pool in each generation.
The simplest selection method is truncation selection. The individuals are
ordered by their fitness and the desired number of the best individuals is placed
in the mating pool. When truncation selection is used, care should be taken to
combine it with an fitness assignment that ensures variety in order to prevent
premature convergence.
Another classical selection method is fitness proportionate selection. The
probability of the individual 𝑖 being placed in the mating pool is proportional
to its fitness, i.e., the probability is given by
𝑣𝑖
∑ ,
𝑣
𝑗 𝑗
11.5.5 Reproduction
In the final step of the algorithm, the individuals in the mating pool are repro-
duced and their offspring is placed in the next generation of the population. We
present four reproduction operations here.
322 11 Global Optimization
consists of the first part of the first parent and second part of the second parent.
More generally, in multipoint crossover, several crossover points are chosen ran-
domly and the offspring consists of the subsequences taken alternately from the
two parental chromosomes.
Crossover in variable-length chromosomes is analogous, but the loci where
the chromosomes are split are not necessarily the same anymore, and the off-
spring generally has a different length than the parents.
This discussion of reproduction completes the discussion of the operations
that occur in a genetic algorithm.
Many variations of the operations in a genetic algorithm have been devised.
If the genotypes or the phenotypes are elements of unusual spaces, it may be a
challenge to adapt these ideas such that the algorithm works well, i.e., that it
converges while still performing a global search.
abled and the performance of the most sophisticated algorithm with single vari-
ations disabled are assessed. Of course, more algorithms with some variations
enabled and some variations disabled can be assessed additionally. Finally, the
performances on the benchmark problems are compared, which usually gives a
good overview over the variations that are indeed beneficial.
has many local minima in the nearly flat outer region and a large hole at the
center. Usually, the values 𝛼 ∶= 20, 𝛽 ∶= 0.2, and 𝛾 ∶= 2𝜋 are used. Its
global minimum is 𝑓AC (𝟎) = 0.
2. The Bukin function no. 6
√
𝑓BU6 ∶ [−15, −5]×[−3, 3] → ℝ, 𝐱 ↦ 100 |𝑥2 − 0.01𝑥12 |+0.01|𝑥1 +10|
has many local minima, all of which lie in a narrow valley. Its global mini-
mum is 𝑓BU6 (−10, 1) = 0.
3. The drop-wave function
√
1 + cos(12 𝑥12 + 𝑥22 )
𝑓DW ∶ [−5.12, 5.12]2 → ℝ, 𝐱 ↦ −
(𝑥12 + 𝑥22 )∕2 + 2
has a very complicated structure. Its global minimum is 𝑓DW (0, 0) = −1.
4. The Easom function
has several local minima, while the area near the global
( minimum
) is small
relative to the search space. Its global minimum is 𝑓EA (𝜋, 𝜋) = −1.
5. The Gramacy–Lee function
sin(10𝜋𝑥)
𝑓GL ∶ [0.5, 2.5] → ℝ, 𝑥↦ + (𝑥 − 1)4
2𝑥
is a one-dimensional function simple to minimize. Its global minimum is
near 0.549.
6. The Griewank function
𝑑
∑ 𝑥𝑖2 𝑑
∏ 𝑥𝑖
𝑓GR ∶ [−600, 600]𝑑 → ℝ, 𝐱↦ − cos ( √ ) + 1
𝑖=1
4000 𝑖=1 𝑖
has many regularly distributed, widespread local minima. Its global mini-
mum is 𝑓GR (𝟎) = 0.
7. The Hölder table function
√
||| ⎛|||| || ||
|| | 𝑥12 + 𝑥22 |||⎞|||
| ||⎟|||
𝑓HT ∶ [−10, 10]2 → ℝ, 𝐱 ↦ − |||sin 𝑥1 cos 𝑥2 exp ⎜|||1 − || ||
||| ⎜|| | 𝜋 ||⎟||
|| | || ||
⎝| ⎠
11.8 Benchmark Problems 325
has 𝑑! local minima. Its parameter 𝑚 determines the steepness of the slopes,
where larger 𝑚 makes the search more difficult. Usually, the value 𝑚 ∶= 10
is used.
10. The Rastrigin function
𝑑
∑ ( 2 )
𝑓RA ∶ [−5.12, 5.12]𝑑 → ℝ, 𝐱 ↦ 𝛼𝑛 + 𝑥𝑖 − 𝛼 cos(2𝜋𝑥𝑖 ) , 𝛼 ∶= 10,
𝑖=1
𝑓SHC ∶ [−3, 3] × [−2, 2], 𝐱 ↦ (4 − 2.1𝑥12 + 𝑥14 ∕3)𝑥12 + 𝑥1 𝑥2 + (−4 + 𝑥22 )𝑥22
has six
( local minima, two) of which are global. The two global minima are
𝑓SHC (±0.0898, ∓0.7126) ≈ −1.0316.
15. The sphere function
𝑑
∑
𝑓SPH ∶ ℝ𝑑 → ℝ, 𝐱↦ 𝑥𝑖2
𝑖=1
Many optimization packages are available under the sʼɜɃǤʙʲ umbrella. Var-
ious capabilities for global optimization are provided, e.g., by the ɜʙɃɪȕ,
"ɜǤȆɖ"ɴ˦ʙʲɃɦ, 5˛ɴɜʼʲɃɴɪǤʝ˩, QȕɪȕʲɃȆɜȱɴʝɃʲȹɦʧ, ÆʲɴȆȹǤʧʲɃȆÆȕǤʝȆȹ pack-
ages.
Problems
References
1. Darwin, C.: On the Origin of Species by Means of Natural Selection, or the Preservation of
Favoured Races in the Struggle for Life. John Murray, London, UK (1859)
2. Eberhart, R., Kennedy, J.: A new optimizer using particle swarm theory. In: Proc. 6th Inter-
national Symposium on Micro Machine and Human Science, p. 39–43. IEEE Press (1995)
3. Feoktistov, V.: Differential Evolution: in Search of Solutions. Springer (2006)
4. Hajek, B.: Cooling schedules for optimal annealing. Mathematics of Operations Research
13(2), 311–329 (1988)
5. Holland, J.: Adaptation in Natural and Artificial Systems. The University of Michigan
Press, Ann Arbor, MI, USA (1975)
6. Horst, R., Pardalos, P., Thoai, N.: Introduction to Global Optimization, 2nd edn. Kluwer
Academic Publishers (2000)
7. Horst, R., Tuy, H.: Global Optimization: Deterministic Approaches. Springer (1996)
8. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proc. IEEE International Con-
ference on Neural Networks, pp. 1942–1948. IEEE Press (1995)
9. Kirkpatrick, S., Gelatt Jr., C., Vecchi, M.: Optimization by simulated annealing. Science
220(4598), 671–680 (1983)
10. van Laarhoven, P., Aarts, E.: Simulated Annealing: Theory and Applications. Mathematics
and its Applications. Kluwer Academic Publishers (1987)
11. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of state cal-
culations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
12. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn.
Springer (1996)
13. Nolte, A., Schrader, R.: A note on the finite time behaviour of simulated annealing. Math-
ematics of Operations Research 25(3), 476–484 (2000)
14. Pintér, J.: Global Optimization in Action – Continuous and Lipschitz Optimization: Algo-
rithms, Implementations and Applications, reprint edn. Springer (2010)
15. Price, K., Storn, R., Lampinen, J.: Differential Evolution: a Practical Approach to Global
Optimization. Springer (2005)
16. Strongin, R., Sergeyev, Y.: Global Optimization with Non-Convex Constraints: Sequential
and Parallel Algorithms. Kluwer Academic Publishers (2000)
17. Wolpert, D., Macready, W.: No free lunch theorems for optimization. IEEE Transactions
on Evolutionary Computation 1(1) (1997)
18. Zhigljavsky, A.: Theory of Global Random Search. Kluwer Academic Publishers (1991)
Chapter 12
Local Optimization
Abstract Optimization theory and algorithms can take advantage of the smooth-
ness of real-valued functions. The underlying assumption is that a reasonable
starting point sufficiently close to a local extremum is already known, for exam-
ple from performing a global optimization, and that (at least) the gradient of
the objective function is available. After a discussion of the convergence rates
of gradient descent, accelerated gradient descent, and the Newton method, the
bfgs method is presented in detail, as it is one of the most popular quasi-Newton
methods and highly effective in practice.
12.1 Introduction
The general assumption in this chapter is that the real scalar objective function
𝑓 ∶ ℝ𝑑 → ℝ
is smooth enough in the sense that all the derivatives we use in certain contexts
exist. The gradient ∇𝑓 provides us with valuable knowledge how the function
changes locally: since the gradient is the direction in which the function changes
the most, it is reasonable to follow the gradient ∇𝑓 when maximizing the func-
tion and the negative gradient −∇𝑓 when minimizing it. But this is, a bit surpris-
ingly, only a general rule as we will see in Sect. 12.5.
Why is the gradient the direction of the largest change? It can be shown that
the directional derivative
𝜕𝑓
(𝐫) = 𝐞 ⋅ ∇𝑓,
𝜕𝐞
where the vector 𝐞 is a unit vector. The Cauchy–Bunyakovsky–Schwarz inequal-
ity, Theorem 8.1, states that
|𝐱 ⋅ 𝐲| ≤ ‖𝐱‖‖𝐲‖ ∀∀𝐱, 𝐲 ∈ ℝ𝑑 ,
and equality only holds if and only if there exists a 𝜆 ∈ ℝ such that 𝐱 = 𝜆𝐲,
i.e., if the two vectors 𝐱 and 𝐲 are parallel. Here ‖.‖ denotes the Euclidean norm.
Applying the inequality to the directional derivative yields
|| 𝜕𝑓 ||
|| (𝐫)|| ≤ ‖∇𝑓‖,
|| 𝜕𝐞 ||
| |
since ‖𝐞‖ = 1, yielding an upper bound for the absolute value of the directional
derivative. The Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1, also
tells us that this upper bound is achieved if and only if 𝐞 and ∇𝑓 are parallel, i.e.,
if and only if the directional directive is taken in the direction 𝐞 ∶= ∇𝑓∕‖∇𝑓‖ of
the gradient. This is the reason for the importance of the gradient for optimiza-
tion.
In optimization we seek local and/or global extrema, whose definitions are
the following.
A function may have more than one global extremum; for example, consider
the function 𝑓 ∶ ℝ → ℝ, 𝑥 ↦ (𝑥 2 − 1)2 .
to repeat this point continuously. But it should be remembered that a global ex-
tremum may be located on the boundary of a closed domain, and the derivatives
of the function cannot help us to find it there.
Any point where the gradient of the function vanishes is called a stationary
or critical point. If a stationary point of a real-valued function 𝑓 ∶ ℝ → ℝ is
isolated, it can be classified into four kinds depending on the signs of the first
derivative to the left and to the right of the stationary point:
1. a local minimum is a stationary point 𝑥 where the first derivative 𝑓 ′ changes
from negative to positive (and hence the second derivative 𝑓 ′′ (𝑥) is positive),
2. a local maximum is a stationary point 𝑥 where the first derivative 𝑓 ′ changes
from positive to negative (and hence the second derivative 𝑓 ′′ (𝑥) is nega-
tive),
3. an increasing point of inflection is a stationary point 𝑥 where the first deriva-
tive 𝑓 ′ is positive on both sides of the stationary point, and
4. a decreasing point of inflection is a stationary point 𝑥 where the first deriva-
tive 𝑓 ′ is negative on both sides of the stationary point.
The first two categories are local extrema, and the other two cases are known
as inflection points or saddle points. Using this naming convention, Fermat’s
theorem (the one easier to prove) on interior extrema states that the condition
that the first derivative of a (smooth) real function defined on an open interval
vanishes is a necessary condition for a local extremum.
Theorem 12.3 (Fermat’s theorem, interior extremum theorem) Suppose
that 𝑓 ∶ (𝑎, 𝑏) → ℝ is a function on the open interval (𝑎, 𝑏) ⊂ ℝ and that 𝑥 ∗ is a
local extremum of 𝑓. If 𝑓 is differentiable at the point 𝑥 ∗ , then 𝑓 ′ (𝑥 ∗ ) = 0.
Proof Suppose that 𝑥∗ is a local minimum. (The proof in the case of a local
maximum proceeds analogously.) Then, by the definition of a local minimum,
there exists a 𝛿 ∈ ℝ+ such that (𝑥 ∗ − 𝛿, 𝑥 ∗ + 𝛿) ⊂ (𝑎, 𝑏) and such that 𝑓(𝑥 ∗ ) ≤
𝑓(𝑥) for all 𝑥 ∈ (𝑥∗ − 𝛿, 𝑥 ∗ + 𝛿). Dividing by positive and negative ℎ implies
𝑓(𝑥 ∗ + ℎ) − 𝑓(𝑥 ∗ )
≥0 ∀ℎ ∈ (0, 𝛿),
ℎ
∗
𝑓(𝑥 + ℎ) − 𝑓(𝑥 ) ∗
≤0 ∀ℎ ∈ (−𝛿, 0).
ℎ
Since 𝑓 is differentiable at 𝑥 ∗ by assumption, the limits of these two quotients as
ℎ → 0 exist and taking the limits implies both 𝑓 ′ (𝑥 ∗ ) ≥ 0 and 𝑓 ′ (𝑥 ∗ ) ≤ 0. □
A counterexample showing that the condition in the theorem is not sufficient
and – at the same time – the simplest example of an inflection or saddle point is
the function 𝑓(𝑥) ∶= 𝑥3 on any interval that contains zero. Then 𝑓 ′ (0) = 0, but
zero is not a local extremum; it is a saddle point.
In higher dimensions, a stationary or critical point is a point where the gradi-
ent vanishes. Again, a vanishing gradient is not a sufficient condition for a local
extremum. The prototypical example is the function 𝑓(𝑥, 𝑦) ∶= 𝑥2 + 𝑦 3 at the
point (0, 0), where it resembles a saddle.
332 12 Local Optimization
The relationships between the first and second derivatives of a multivariate func-
tion and its local extrema are summarized in Theorem 12.7 below, where the
Hessian matrix of the multivariate function 𝑓 plays a major role. Before stating
the theorem, we define the Hessian matrix of a function and see where it occurs
in multivariate Taylor expansions.
𝜕 2 𝑓(𝐱)
ℎ𝑖𝑗 ∶= .
𝜕𝑥𝑖 𝜕𝑥𝑗
These considerations are often also called the first and second partial-
derivative test and provide a tool how to identify local extrema of a sufficiently
smooth function in the interior of a domain. But the test may be inconclusive,
as the statements in Theorem 12.7 do not cover all cases that may occur.
12.3 Convexity
Another question that naturally arises is whether a local extremum that we may
have found is already a (or the unique) global extremum. Are there properties of
the function or its domain that always ensure that we can draw such a conclu-
sion?
The answer is yes, such properties exist: they are the convexity of the domain
and the convexity of the function. The concept of convexity substantially simpli-
fies the search for global extrema. We start by defining convex sets and convex
functions.
holds.
holds and it is called strictly convex if the inequality holds with < instead of ≤.
Convexity has the useful property that it ensures that a local minimum is
already a global minimum (while disregarding the boundary as usual by suppos-
ing that the domain of the function is an open set). If the function is even strictly
convex, we can additionally conclude that this global minimum is unique.
The proof is indirect. Assuming that there exists a different point 𝐱0 ∈ 𝐶∖{𝐱∗ }
such that 𝑓(𝐱0 ) < 𝑓(𝐱∗ ) yields
If a 𝑡 ∈ (0, 1) can be found such that ‖(𝑡𝐱0 + (1 − 𝑡)𝐱∗ ) − 𝐱∗ ‖ < 𝛿, we have found
a point 𝑡𝐱0 + (1 − 𝑡)𝐱∗ that contradicts the assumption (12.1) that 𝐱∗ is a local
minimum. We find
after defining
𝛼𝛿
0 < 𝑡 ∶= <1
‖𝐱0 − 𝐱∗ ‖
and choosing any 𝛼 < 1 from the interval (0, ‖𝐱0 − 𝐱∗ ‖∕𝛿), which shows that
such a 𝑡 exists. Hence the first part of theorem follows.
To show the second part, we assume that there exists a different point 𝐱0 ∈
𝐶∖{𝐱∗ } such that 𝑓(𝐱0 ) ≤ 𝑓(𝐱∗ ). Since 𝑓 is now even strictly convex, we can
conclude that
Proceeding similarly, we again see that the existence of the point 𝑡𝐱0 + (1 − 𝑡)𝐱∗
contradicts (12.1), which concludes the indirect proof. □
𝐱𝑛+1 ∶= 𝐱𝑛 − ℎ∇𝑓(𝐱𝑛 ),
𝐱𝑛+1 ∶= 𝐱𝑛 + ℎ∇𝑓(𝐱𝑛 )
Another way to see why the choice (12.2) is expedient, is the following argu-
ment, which is based on a basic inequality, which is shown in Problem 12.4.
holds.
holds.
The following theorem shows that the convergence rate is linear (as a function
of the number of iterations) when an appropriate constant step size is used.
𝐱𝑛+1 ∶= 𝐱𝑛 − ℎ𝑛 ∇𝑓(𝐱𝑛 )
‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ∑𝑛 ∀𝑛 ∈ ℕ.
𝑘=0
ℎ𝑘 (1 − 𝐿ℎ𝑘 ∕2)
Proof Lemma 12.17 below applied to the points 𝐱𝑘+1 = 𝐱𝑘 − ℎ𝑘 ∇𝑓(𝐱𝑘 ) and 𝐱𝑘
yields the inequality
12.4 Gradient Descent 337
𝐿
𝑓(𝐱𝑘+1 ) − 𝑓(𝐱𝑘 ) ≤ ∇𝑓(𝐱𝑘 ) ⋅ (𝐱𝑘+1 − 𝐱𝑘 ) + ‖𝐱 − 𝐱𝑘 ‖ 2
2 𝑘+1
𝐿ℎ𝑘
= −ℎ𝑘 (1 − ) ‖∇𝑓(𝐱𝑘 )‖2 ∀𝑘 ∈ ℕ0 .
2
Next, we define
𝑒𝑘 ∶= 𝑓(𝐱𝑘 ) − 𝑓(𝐱∗ ) ≥ 0,
which is the error in the 𝑘-th step and greater than or equal to zero for all 𝑘, since
𝑥∗ is the global minimum. The last inequality implies
𝐿ℎ𝑘
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ‖∇𝑓(𝐱𝑘 )‖2 ∀𝑘 ∈ ℕ0 . (12.3)
2
𝐿ℎ𝑘 𝑒𝑘2
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0 .
2 ‖𝐱𝑘 − 𝐱∗ ‖2
𝐿ℎ𝑘 𝑒𝑘2
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0
2 ‖𝐱0 − 𝐱∗ ‖2
and hence
1 1 𝐿ℎ𝑘 𝑒𝑘
− ≥ ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0
𝑒𝑘+1 𝑒𝑘 2 ‖𝐱0 − 𝐱∗ ‖2 𝑒𝑘+1
after division by 𝑒𝑘 𝑒𝑘+1 and rearranging terms. (If 𝑒𝑘 = 0 for any 𝑘, then 𝐱𝑘 = 𝐱∗
and ∇𝑓(𝐱𝑘 ) = 0, implying that all 𝐱𝑘 will be equal to 𝐱∗ from this point on and
trivially satisfying the asserted inequality.)
Because of 𝑒𝑘+1 ≤ 𝑒𝑘 due to (12.3), we find
1 1 𝐿ℎ𝑘 1
− ≥ ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0 .
𝑒𝑘+1 𝑒𝑘 2 ‖𝐱0 − 𝐱∗ ‖2
Summing all these inequalities for 𝑘 ∈ {0, … , 𝑛 − 1} yields a telescopic sum and
hence the estimate
𝑛−1
∑
1 1 1 𝐿ℎ𝑘
− ≥ ℎ𝑘 (1 − ) ∀𝑛 ∈ ℕ.
𝑒𝑛 𝑒0 ∗ 2
‖𝐱0 − 𝐱 ‖ 𝑘=0 2
338 12 Local Optimization
The following two lemmata were used in the proof. Note that Lemma 12.14
provides a lower bound for 𝑓(𝐲) − 𝑓(𝐱) due to convexity and that Lemma 12.17
provides an upper bound for 𝑓(𝐲) − 𝑓(𝐱) due to 𝐿-smoothness.
Lemma 12.17 Suppose that the function 𝑓 ∶ ℝ𝑑 → ℝ is 𝐿-smooth. Then the in-
equality
𝐿
𝑓(𝐲) − 𝑓(𝐱) ≤ ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱) + ‖𝐲 − 𝐱‖2 ∀∀𝐱, 𝐲 ∈ ℝ𝑑
2
holds.
Proof We calculate
1
( )
𝑓(𝐲) − 𝑓(𝐱) − ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱) = ∫ ∇𝑓(𝐱 + 𝑡(𝐲 − 𝐱)) − ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱)d𝑡
0
1
≤ ∫ 𝐿𝑡‖𝐲 − 𝐱‖2 d𝑡
0
𝐿
= ‖𝐲 − 𝐱‖2 ,
2
where the inequality follows using the Cauchy–Bunyakovsky–Schwarz inequal-
ity, Theorem 8.1, and the 𝐿-smoothness of 𝑓. □
Lemma 12.18 Suppose the function 𝑓 and the sequences ⟨ℎ𝑛 ⟩ and ⟨𝐱𝑛 ⟩ are as in
Theorem 12.16. Then the sequence ⟨‖𝐱𝑛 − 𝐱∗ ‖⟩ decreases as 𝑛 increases.
In the estimate
( ) ( )
𝑓(𝐱) − 𝑓(𝐲) = 𝑓(𝐱) − 𝑓(𝐳) + 𝑓(𝐳) − 𝑓(𝐲)
𝐿
≤ ∇𝑓(𝐱) ⋅ (𝐱 − 𝐳) + ∇𝑓(𝐲) ⋅ (𝐳 − 𝐲) + ‖𝐳 − 𝐲‖2
2
( ) 𝐿
= ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) + ∇𝑓(𝐱) − ∇𝑓(𝐲) ⋅ (𝐲 − 𝐳) + ‖𝐳 − 𝐲‖2 ,
2
the first term 𝑓(𝐱) − 𝑓(𝐳) is estimated using Lemma 12.14 and the second term
𝑓(𝐳) − 𝑓(𝐲) is estimated using Lemma 12.17. Then substituting
1( )
𝐳 ∶= 𝐲 − ∇𝑓(𝐲) − ∇𝑓(𝐱)
𝐿
yields (12.5).
Inequality (12.5) applied to the situation in this lemma implies that
1
0 ≤ 𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ∇𝑓(𝐱𝑛 ) ⋅ (𝐱𝑛 − 𝐱∗ ) − ‖∇𝑓(𝐱𝑛 )‖2 . (12.6)
2𝐿
In the last step, we combine (12.4) and (12.6) to find
ℎ𝑛
‖𝐱𝑛+1 − 𝐱∗ ‖2 ≤ ‖𝐱𝑛 − 𝐱∗ ‖2 − ‖∇𝑓(𝐱𝑛 )‖2 + ℎ𝑛2 ‖∇𝑓(𝐱𝑛 )‖2
𝐿
1
= ‖𝐱𝑛 − 𝐱∗ ‖2 − ℎ𝑛 ( − ℎ𝑛 ) ‖∇𝑓(𝐱𝑛 )‖2
𝐿
≤ ‖𝐱𝑛 − 𝐱∗ ‖2 ,
𝑛
∑ ∑𝑛 𝑛
𝐿ℎ𝑘 𝐿 ∑ 2
ℎ𝑘 (1 − )= ℎ𝑘 − ℎ .
𝑘=0
2 𝑘=0
2 𝑘=0 𝑘
This is indeed the usual requirement in stochastic optimization. The step sizes
ℎ𝑛 ∶= 𝑎∕(𝑛 +𝑏) in the second corollary are very common in stochastic optimiza-
tion and satisfy these two requirements.
Corollary 12.20 (diminishing step sizes) Suppose that the assumptions in The-
orem 12.16 hold and that the step sizes are defined such that
𝑎 1
ℎ𝑛 ∶= ≤ , 𝑎, 𝑏 ∈ ℝ+ .
𝑛+𝑏 𝐿
Then the estimate
‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ( 𝑛+1+𝑏 ) ∀𝑛 ∈ ℕ
𝐿𝑎2 (𝑛+1)
𝑎 ln −
𝑏 2𝑏(𝑛+1+𝑏)
holds.
Proof In general, if 𝑔 ∶ ℝ → ℝ is a monotonically decreasing Riemann-
integrable function, the estimates
𝑛1 +1 𝑛1 𝑛1
∑
∫ 𝑔(𝑥)d𝑥 ≤ 𝑔(𝑘) ≤ ∫ 𝑔(𝑥)d𝑥
𝑛0 𝑘=𝑛0 𝑛0 −1
hold. (To see this, the sum is interpreted as the Riemann sum of an integral; it is
useful to draw a sketch of a monotonically decreasing function and the rectan-
gles in the Riemann sum.)
In our case, we have
𝑛 𝑛+1
∑ 𝐿ℎ𝑘 𝑎 𝐿 𝑎
ℎ𝑘 (1 − )≥∫ (1 − ) d𝑥
𝑘=0
2 0 𝑥+𝑏 2𝑥+𝑏
𝑛+1+𝑏 𝐿𝑎2 (𝑛 + 1)
= 𝑎 ln ( )− , (12.7)
𝑏 2𝑏(𝑛 + 1 + 𝑏)
which concludes the proof. □
We can also interpret the estimates in the theorem and its corollaries differ-
ently by asking the question how many iterations are necessary to ensure that
the error is smaller than a prescribed tolerance 𝜖, i.e., that
12.4 Gradient Descent 341
2𝐿‖𝐱0 − 𝐱∗ ‖2
𝑛+1> .
𝜖
Therefore 𝑂(1∕𝜖) iterations are required in order to achieve an error 𝑓(𝐱𝑘 ) −
𝑓(𝐱∗ ) < 𝜖. Furthermore, convergence is proportional to the Lipschitz constant 𝐿
and the distance of the starting point 𝐱0 from the minimum 𝐱∗ and it is inversely
proportional to the prescribed tolerance.
There is also a connection with regularization (see Sect. 13.9.1) in the context
of neural networks (see Chap. 13). Regularization diminishes the Lipschitz con-
stant of the neural network, thus having the additional benefit that it speeds up
convergence.
In practice, the Lipschitz constant 𝐿 can be calculated using the Hessian ma-
trix if the function 𝑓 is given in a sufficiently explicit and differentiable form. But
if it is not represented in a straightforward manner, it may be hard or impossible
to determine the Lipschitz constant 𝐿. Also, intuition suggests that the step size
should become smaller as the minimum is approached. These considerations
motivate deliberations on the step size in the next sections.
Another practical consideration is that if the function 𝑓 is so smooth that
the Hessian matrix exists, then we can take advantage of the second derivatives
directly by using the bfgs method (see Sect. 12.8).
An question that suggests itself having shown Theorem 12.16 is: what is the
best convergence rate that an optimization algorithm that only uses the gradient
of a function can achieve? In order to be able to answer this question, we have
to restrict the class of functions that the optimization is supposed to work on.
The reason is simply that if we demand the optimization algorithm to work on
functions that are not smooth at all, it is always possible to construct counterex-
amples because the function values may jump without restriction. Therefore we
require the functions to have the same smoothness as in the (only) convergence
results we already know, namely Theorem 12.16.
The following theorem states that any optimization algorithm that uses only
the gradient can achieve at most quadratic convergence on this class of functions.
Such iterative optimization algorithms are called first-order methods.
3𝐿‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≥ ∀𝑛 ∈ ℕ,
32(𝑛 + 1)2
1
‖𝐱𝑛 − 𝐱∗ ‖2 ≥ ‖𝐱0 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ
8
hold.
The answer to this question was found in 1983 [13] and is discussed in [14, Sec-
tion 2.2]. It turns out that the requirement 𝑓(𝐱𝑛+1 ) < 𝑓(𝐱𝑛 ) discussed at the be-
ginning of Sect. 12.4 is not conducive for the minimization of convex functions;
the requirement is a local statement, but convexity is a global property.
As the next theorem shows, quadratic convergence, i.e., the optimal conver-
gence rate, can indeed be achieved in the minimization of convex, smooth func-
tions. The iteration in accelerated gradient descent combines the gradient with
the difference 𝐱𝑛 − 𝐱𝑛−1 , which is the momentum of the trajectory of the se-
quence ⟨𝐱𝑛 ⟩.
𝜆0 ∶= 1,
√
2
1+ 4𝜆𝑛−1 +1
𝜆𝑛 ∶= ,
2
𝜆𝑛−1 − 1
𝛾𝑛 ∶= ,
𝜆𝑛
𝐝𝑛 ∶= 𝛾𝑛 (𝐱𝑛 − 𝐱𝑛−1 ),
𝐲𝑛 ∶= 𝐱𝑛 + 𝐝𝑛 ,
1 1
𝐠𝑛 ∶= − ∇𝑓(𝐲𝑛 ) = − ∇𝑓(𝐱𝑛 + 𝐝𝑛 ),
𝐿 𝐿
𝐱𝑛+1 ∶= 𝐲𝑛 + 𝐠𝑛 = 𝐱𝑛 + 𝐝𝑛 + 𝐠𝑛
Furthermore estimating the first two terms on the right-hand side using Lemma
12.17 yields
( )
𝑓 𝐱 − ℎ∇𝑓(𝐱) − 𝑓(𝐲)
𝐿ℎ2
≤ (−ℎ + ) ‖∇𝑓(𝐱)‖2 + ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) ∀∀𝐱, 𝐲 ∈ ℝ𝑑 ∀ℎ ∈ ℝ.
2
In order to find the strongest estimate, we define ℎ ∶= arg minℎ∈ℝ (𝐿ℎ2 ∕2−ℎ) =
1∕𝐿, which results in the inequality
( 1 ) 1
𝑓 𝐱 − ∇𝑓(𝐱) − 𝑓(𝐲) ≤ − ‖∇𝑓(𝐱)‖2 + ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) ∀∀𝐱, 𝐲 ∈ ℝ𝑑 .
𝐿 2𝐿
We define the error in the 𝑛-th iteration as
𝑒𝑛 ∶= 𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ).
𝐿( )
𝑒𝑛+1 − 𝑒𝑛 ≤ − ‖𝐠𝑛 ‖2 + 2𝐠𝑛 ⋅ 𝐝𝑛 ∀𝑛 ∈ ℕ,
2
𝐿( )
𝑒𝑛+1 ≤ − ‖𝐠𝑛 ‖2 + 2𝐠𝑛 ⋅ (𝐱𝑛 + 𝐝𝑛 − 𝐱∗ ) ∀𝑛 ∈ ℕ.
2
We would like to use in the identity
𝐿( )
𝜆𝑛2 (𝑒𝑛+1 − 𝑒𝑛 ) + 𝜆𝑛 𝑒𝑛 ≤ − ‖𝜆𝑛 𝐠𝑛 ‖2 + 2𝜆𝑛 𝐠𝑛 ⋅ (𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ )
2
𝐿( )
= − ‖𝜆𝑛 𝐠𝑛 + 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ‖2 − ‖𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ.
2
We call the arguments of the norms
𝐫𝑛 ∶= 𝜆𝑛 𝐠𝑛 + 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ,
𝐬𝑛 ∶= 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ .
Using the definition of 𝐝𝑛+1 and 𝐱𝑛+1 on the right-hand side, this condition is
furthermore equivalent to
holds.
Therefore the last inequality becomes
𝐿( )
𝜆𝑛2 𝑒𝑛+1 − (𝜆𝑛2 − 𝜆𝑛 )𝑒𝑛 ≤ − ‖𝐬 ‖2 − ‖𝐬𝑛 ‖2 ∀𝑛 ∈ ℕ.
2 𝑛+1
12.5 Accelerated Gradient Descent * 345
To achieve a suitable telescoping structure on the left side as well, we would like
to meet the condition 𝑢𝑛 = 𝑣𝑛+1 for all 𝑛, where
𝑢𝑛 ∶= 𝜆𝑛2 𝑒𝑛+1 ,
𝑣𝑛 ∶= (𝜆𝑛2 − 𝜆𝑛 )𝑒𝑛 .
𝜆𝑛2 = 𝜆𝑛+1
2
− 𝜆𝑛+1 ∀𝑛 ∈ ℕ (12.10)
holds.
Thus the last inequality becomes
𝐿( )
𝑣𝑛+1 − 𝑣𝑛 ≤ − ‖𝐬𝑛+1 ‖2 − ‖𝐬𝑛 ‖2 ∀𝑛 ∈ ℕ. (12.11)
2
The last step now involves inequalities with this telescoping structure. Sup-
pose there are two real sequences ⟨𝑎𝑛 ⟩ and ⟨𝑏𝑛 ⟩ such that 𝑎𝑛+1 − 𝑎𝑛 ≤ 𝑏𝑛 − 𝑏𝑛+1
holds for all 𝑛. Then the chain
𝑎𝑛+1 + 𝑏𝑛+1 ≤ 𝑎𝑛 + 𝑏𝑛 ≤ ⋯ ≤ 𝑎1 + 𝑏1
𝐿 𝐿
𝜆𝑛2 𝑒𝑛+1 = 𝑣𝑛+1 ≤ 𝑣𝑛+1 + ‖𝐬 ‖2 ≤ 𝑣1 + ‖𝐬1 ‖2
2 𝑛+1 2
2 𝐿
= 𝜆0 𝑒1 + ‖(𝜆0 − 1)(𝐱1 − 𝐱0 ) + 𝐱1 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ.
2
After setting 𝜆0 ∶= 1, we have
𝐿
𝑓(𝐱1 ) − 𝑓(𝐱∗ ) + ‖𝐱1 − 𝐱∗ ‖2
2
𝑓(𝐱𝑛+1 ) − 𝑓(𝐱∗ ) ≤ ∀𝑛 ∈ ℕ.
𝜆𝑛2
Finally, we check that all three conditions for the 𝜆𝑛 and 𝛾𝑛 can be met and
that the 𝜆𝑛 grow (at least) linearly. The three conditions are (12.8), (12.9), and
(12.10), which yield
√
2
1 + 4𝜆𝑛−1 +1 1
𝜆𝑛 ∶= ≥ + 𝜆𝑛−1 ,
2 2
𝜆𝑛−1 − 1
𝛾𝑛 ∶= .
𝜆𝑛
It is important to note that up to now we have only used fixed step sizes, i.e., the
factors of the gradients in the update formulas have been constants determined
by the Lipschitz constant of the objective function. We now lift this restriction
(although it has made the analysis easier), because the Lipschitz constant may
not be known or the function may not be Lipschitz continuous at all, but we still
wish to optimize such functions.
Since we search along the direction given by the gradient, such optimization
algorithms are called line-search algorithms. In general, they have the following
form.
Algorithm 12.24 (line-search algorithm)
1. Compute the search direction 𝐩𝑛 ∶= −∇𝑓.
2. Determine the step size 𝛼𝑛 > 0 such that a sufficient-decrease condition
condition is satisfied.
3. Set 𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐩𝑛 .
A simple-minded decrease condition such as 𝑓(𝐱𝑛+1 ) = 𝑓(𝐱𝑛 + 𝛼𝑛 𝐩𝑛 ) ≤
𝑓(𝐱𝑛 ), meaning that the function value at the new approximation point 𝐱𝑛+1
is smaller than before, is not a useful requirement, since the function value, a
real number, may decrease forever without getting close to a minimum.
Instead, we would ideally like to use a global minimizer
of the function
ℎ∶ ℝ+ → ℝ, 𝛼 ↦ 𝑓(𝐱𝑛 + 𝛼𝐩𝑛 )
as the step size. Finding an approximation of the minimizer may be quite an
elaborate task (but at least it is always a one-dimensional problem) and is usually
performed in two steps. In the first step, an interval of acceptable step sizes is
identified. In the second step, the function is interpolated and the interval is
bisected to find a good approximation of the best step size.
12.6 Line Search and the Wolfe Conditions 347
on the step size 𝛼𝑛 , where 𝑐2 ∈ (𝑐1 , 1) is a constant. The left-hand side is equal
to ℎ′ (𝛼𝑛 ), implying the interpretation that the slope ℎ′ (𝛼𝑛 ) of ℎ at 𝛼𝑛 must be
greater than or equal to the constant 𝑐2 times the slope ℎ′ (0).
If the slope ℎ′ (𝛼) is strongly negative for small 𝛼, then this second condition
requires that the step size 𝛼 cannot be chosen too small, which is reasonable,
since a strongly negative slope indicates that the function 𝑓 can be reduced sig-
nificantly by moving further along.
If line search is used in conjunction with a Newton or quasi-Newton method
(see the next sections), a typical value for 𝑐2 is 0.9.
Having defined the Wolfe conditions, the question whether they can be satis-
fied arises naturally. The answer to his question is always yes under reasonable
assumptions on the objective function 𝑓, as the following theorem shows.
348 12 Local Optimization
Proof Since the matrix 𝐵 is positive definite by assumption and since 𝛼 > 0 and
𝑐1 > 0, the function 𝑙 ∶ ℝ+ → ℝ, 𝛼 ↦ 𝑓(𝐱) + 𝛼𝑐1 ∇𝑓(𝐱) ⋅ 𝐩 is unbounded below
and thus intersects the function ℎ, which is bounded below, at least once. We
denote the smallest such value by 𝛼1 such that 𝛼1 > 0 and
Hence the strict inequality in the first Wolfe condition (12.14) is satisfied for all
𝛼 ∈ (0, 𝛼1 ).
Next, the mean-value theorem implies that
In this section, the Newton method for approximating the roots of functions,
i.e., for approximating points 𝑥 such that 𝑔(𝑥) = 0, is summarized in the one-
dimensional case, i.e., for functions 𝑔 ∶ ℝ ⊃ [𝑎, 𝑏] → ℝ. In addition to finding
roots of general, nonlinear functions, we can also use the Newton method to
find stationary points, which are candidates for local extrema, by approximating
the roots of the derivative of a given function. The second use is extended in
the following section, Sect. 12.8, where we will discuss a generalization of the
Newton method for nonlinear, multidimensional optimization.
The Newton method works iteratively and is best summarized as replacing
the function by its tangent at the current approximation 𝑥𝑛 of the root and then
using the root of the tangent, which is a linear function, as the next approxima-
tion 𝑥𝑛+1 .
In more detail, we start from a differentiable function 𝑔 ∶ [𝑎, 𝑏] → ℝ and an
initial approximation 𝑥𝑛 of a root. First, we write the tangent 𝑦 ∶ [𝑎, 𝑏] → ℝ of 𝑔
at the point (𝑥𝑛 , 𝑔(𝑥𝑛 )) as the linear function
This formula is easily checked by noting that 𝑦 is a linear function, that its has
the correct slope 𝑔′ (𝑥𝑛 ), and that it takes the correct value 𝑦(𝑥𝑛 ) = 𝑔(𝑥𝑛 ) of the
tangent at the point (𝑥𝑛 , 𝑔(𝑥𝑛 )).
The next approximation 𝑥𝑛+1 is the root of the tangent 𝑦 and is hence found
by solving the equation
𝑔(𝑥𝑛 )
𝑥𝑛+1 ∶= 𝑥𝑛 − . (12.17)
𝑔′ (𝑥𝑛 )
𝑔′′ (𝜉𝑛 )
𝜉 − 𝑥𝑛+1 = − (𝜉 − 𝑥𝑛 )2 .
2𝑔′ (𝑥𝑛 )
𝑒𝑛 ∶= 𝑥𝑛 − 𝜉.
|𝑔′′ (𝜉𝑛 )| 2
|𝑒𝑘+1 | = 𝑒
2|𝑔′ (𝑥𝑛 )| 𝑛
holds between the errors 𝑒𝑛 and 𝑒𝑛+1 in steps 𝑛 and 𝑛+1, which implies quadratic
convergence due to the assumptions on 𝑔′ and 𝑔′′ if the starting point 𝑥0 is suffi-
ciently close to the root 𝜉. □
In the multidimensional case of vector-valued functions 𝐠 ∶ ℝ𝑑 → ℝ𝑑 , the
condition (12.16) that the tangent vanishes becomes
where 𝐽𝐠 (𝐱𝑛 ) is the Jacobi matrix of 𝐠 at the point 𝐱𝑛 . Although the next approx-
imation 𝐱𝑛+1 can be written concisely as
it is much more computationally efficient not to calculate the inverse 𝐽𝐠−1 (𝐱𝑛 ) of
the Jacobi matrix, but to solve the linear system
The bfgs method is one of the most popular quasi-Newton methods and named
after Charles G. Broyden, Roger Fletcher, Donald Goldfarb, and David Shanno
[3, 4, 6, 8, 17, 18]. As a quasi-Newton method it is iterative and suitable for non-
linear optimization problems. Although it does not handle constraints in its orig-
inal form, a version for box constraints has been developed [5].
In this section, the problem is to minimize a real differentiable scalar function
𝑓 ∶ ℝ𝑑 → ℝ without constraints. We denote the starting point of the iteration
by 𝐱0 and the approximation in iteration 𝑛 by 𝐱𝑛 . If 𝐻𝑓 (𝐱𝑛 ) denotes the Hessian
352 12 Local Optimization
𝐬𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛
𝐻𝑛 ≈ 𝐻𝑓 (𝐱𝑛 )−1
𝛼𝑛 𝐬𝑛 = 𝐱𝑛+1 − 𝐱𝑛 ,
because we only approximate the Hessian matrix, and we use line search to make
up for this approximation. Having thus determined the search direction 𝐬𝑛 , any
line-search method (see Sect. 12.6) can be employed to find the next approxima-
tion 𝐱𝑛+1 by minimizing the objective function 𝑓(𝐱𝑛+1 ) = 𝑓(𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ) in the
search direction 𝐬𝑛 over the scalar 𝛼 ∈ ℝ+ , i.e.,
long as the function is sufficiently smooth) and that 𝐻𝑛 is positive definite (as
all Hessian matrices are at a local minimum, see Theorem 12.7). If 𝐻𝑛 is positive
definite, this property also implies the so-called descent property
(cf. Sect. 12.1). Equation (12.19) becomes 𝛼𝑛 𝐬𝑛 = −𝐻𝑛 𝐠(𝐱𝑛 ) for the new step
𝛼𝑛 𝐬𝑛 (instead of 𝐬𝑛 ), implying that
The higher order terms in 𝑂(‖𝐱𝑛+1 − 𝐱𝑛 ‖2 ) vanish for quadratic functions 𝑓 and
we will neglect them, meaning that the following construction will be exact for
quadratic functions. This yields
𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= 𝐠(𝐱𝑛+1 ) − 𝐠(𝐱𝑛 ) = ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ),
𝐻𝑛 𝐝𝑛 = 𝐲𝑛 ,
but the matrix 𝐻𝑛 does usually not satisfy this equation, since 𝐝𝑛 and hence 𝐲𝑛
are only known after the line search is complete. Therefore the next approxima-
tion 𝐻𝑛+1 is chosen such that it satisfies the updated equation
𝐻𝑛+1 𝐝𝑛 = 𝐲𝑛 , (12.21)
where the symmetric rank-one matrix 𝑎𝐮𝐮⊤ is added to the previous matrix. The
quasi-Newton condition (12.21) can be satisfied by setting 𝐮 ∶= 𝐲𝑛 − 𝐻𝑛 𝐝𝑛 and
requiring that 𝑎𝐮⊤ 𝐝𝑛 = 1 (see Problem 12.13), although these updates are usu-
ally applied to approximating the inverses 𝐻𝑓 (𝐱𝑁 )−1 directly. More importantly,
however, they have serious disadvantages [7, Section 3.2]: positive definiteness
of the approximate matrices cannot be guaranteed and the denominator in the
resulting update formula may become zero.
A better approach are rank-two updates, which have the form
𝐲𝑛 𝐲𝑛⊤ 𝐻𝑛 𝐝𝑛 𝐝⊤ ⊤
𝑛 𝐻𝑛
𝐻𝑛+1 = 𝐻𝑛 + − (12.24)
𝐲𝑛⊤ 𝐝𝑛 𝐝⊤
𝑛 𝐻𝑛 𝐝𝑛
−1 𝐝𝑛 𝐲𝑛⊤ 𝐲𝑛 𝐝⊤
𝑛 𝐝𝑛 𝐝⊤
𝑛
𝐻𝑛+1 = (𝐼 − ) 𝐻𝑛−1 (𝐼 − )+ (12.25)
𝐲𝑛⊤ 𝐝𝑛 𝐲𝑛⊤ 𝐝𝑛 𝐲𝑛⊤ 𝐝𝑛
(see Problem 12.15). Expanding and noting that 𝐲𝑛⊤ 𝐻𝑛−1 𝐲𝑛 is a scalar yields the
equivalent form
1. Initialize the starting point 𝐱0 and the initial approximation 𝐻0−1 ∶= 𝐼 of the
inverse of the Hessian matrix.
12.8 The bfgs Method 355
2. Repeat:
a. Calculate the search direction
𝐬𝑛 ∶= −𝐻𝑛−1 ∇𝑓(𝐱𝑛 ).
𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ,
𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ).
−1
d. Calculate 𝐻𝑛+1 using (12.25).
e. Repeat until convergence, i.e., until a norm of ∇𝑓(𝐱𝑛+1 ) has become suf-
ficiently small or a norm of 𝐱𝑛+1 − 𝐱𝑛 has become sufficiently small.
3. Return the approximation 𝐱𝑛 of a minimum.
where the norm is a weighted Frobenius norm [16, Section 6.1] (see Problem
12.20). More precisely, it is a weighted Frobenius norm
whose weight matrix 𝑊 is any positive definite matrix that satisfies 𝑊𝐝𝑛 = 𝐲𝑛 ;
the Frobenius norm is defined as
356 12 Local Optimization
𝑑 ∑
∑ 𝑑
2
‖𝐴‖F ∶= 𝑎𝑖𝑗 .
𝑖=1 𝑗=1
Quasi-Newton methods such as the bfgs algorithm in the previous section pro-
vide superlinear convergence while never calculating second derivatives explic-
itly. Their remaining disadvantage when applied to large optimization problems
is that the whole approximate Hessian matrix 𝐻𝑛−1 is stored during the iteration,
meaning that the memory requirement increases quadratically with the number
of variables.
In order to make the bfgs algorithm amenable to large optimization prob-
lems, the limited-memory bfgs (l-bfgs) method has been developed [15, 10, 11].
In the l-bfgs algorithm, not the whole matrix 𝐻𝑛−1 is stored and manipulated,
which would be prohibitive when the number of variables is large, but only a
few vectors which suffice to calculate the recursion (12.25) only for products
𝐻𝑛−1 𝐪 [16, Section 7.2]. Furthermore, not the whole recursion starting from 𝐻0−1
is calculated, but only a fixed number of previous steps is used. Thus the num-
ber of vectors stored is fixed and often relatively small, and hence the amount of
memory required is only linear in the number of variables.
The key to understanding the l-bfgs algorithm is the efficient way of calcu-
lating only the products 𝐻𝑛−1 𝐪 in Algorithm 12.30. We start by summarizing the
recursion (12.25) for the approximations 𝐻𝑛−1 of the inverses of the Hessian ma-
trices as
−1
𝐻𝑛+1 ∶= 𝑉𝑛⊤ 𝐻𝑛−1 𝑉𝑛 + 𝜌𝑛 𝐝𝑛 𝐝⊤
𝑛,
𝑉𝑛 ∶= 𝐼 − 𝜌𝑛 𝐲𝑛 𝐝⊤
𝑛,
1
𝜌𝑛 ∶= ⊤ .
𝐲𝑛 𝐝 𝑛
𝐻𝑛−1 = (𝑉𝑛−1
⊤
⋯ 𝑉0⊤ )𝐻0−1 (𝑉0 ⋯ 𝑉𝑛−1 )
⊤
+ 𝜌0 (𝑉𝑛−1 ⋯ 𝑉1⊤ )𝐝0 𝐝⊤
0
(𝑉1 ⋯ 𝑉𝑛−1 )
⊤
+ 𝜌1 (𝑉𝑛−1 ⋯ 𝑉2⊤ )𝐝1 𝐝⊤
1
(𝑉2 ⋯ 𝑉𝑛−1 )
+⋯
+ 𝜌𝑛−1 𝐝𝑛−1 𝐝⊤
𝑛−1
(12.27)
12.9 The l-bfgs (Limited-Memory bfgs) Method 357
−1
𝐝⊤ 𝐲
𝑛−1 𝑛−1
𝐻𝑛,0 ∶= ⊤
𝐼. (12.28)
𝐲𝑛−1 𝐲𝑛−1
The scaling factor in front of the identity matrix estimates the size of the true
Hessian matrix along the most recent search direction, which helps to keep the
scale of the search direction correct and hence a step length of one is accepted
in most iterations. Based on the starting value and using only the last 𝑚 vectors,
the expanded iteration becomes
𝐻𝑛−1 = (𝑉𝑛−1
⊤ ⊤
⋯ 𝑉𝑛−𝑚 −1
)𝐻𝑛,0 (𝑉𝑛−𝑚 ⋯ 𝑉𝑛−1 )
⊤ ⊤
+ 𝜌𝑛−𝑚 (𝑉𝑛−1 ⋯ 𝑉𝑛−𝑚+1 )𝐝𝑛−𝑚 𝐝⊤
𝑛−𝑚 (𝑉𝑛−𝑚+1 ⋯ 𝑉𝑛−1 )
⊤ ⊤
+ 𝜌𝑛−𝑚+1 (𝑉𝑛−1 ⋯ 𝑉𝑛−𝑚+2 )𝐝𝑛−𝑚+1 𝐝⊤ (𝑉
𝑛−𝑚+1 𝑛−𝑚+2
⋯ 𝑉𝑛−1 )
+⋯
+ 𝜌𝑛−1 𝐝𝑛−1 𝐝⊤
𝑛−1
. (12.29)
1. Define 𝐪 ∶= 𝐯.
2. For 𝑖 ∶= 𝑛 − 1, 𝑛 − 2, … , 𝑛 − 𝑚:
a. Set 𝛼𝑖 ∶= 𝜌𝑖 𝐝⊤
𝑖
𝐪.
b. Set 𝐪 ∶= 𝐪 − 𝛼𝑖 𝐲𝑖 .
−1
3. Set 𝐫 ∶= 𝐻𝑛,0 𝐪.
4. For 𝑖 ∶= 𝑛 − 𝑚, 𝑛 − 𝑚 + 1, … , 𝑛 − 1:
a. Set 𝛽 ∶= 𝜌𝑖 𝐲𝑖⊤ 𝐫.
b. Set 𝐫 ∶= 𝐫 + (𝛼𝑖 − 𝛽)𝐝𝑖 .
5. Return 𝐫, which is equal to 𝐻𝑛−1 𝐪.
−1
In the first loop, the factors to the right of 𝐻𝑛,0 are calculated. In the second
loop, the multiplications on the left are performed, and the terms are added. Note
that this approach works only for the products 𝐻𝑛−1 𝐪, but not to calculate the
whole matrices 𝐻𝑛−1 .
Assembling all pieces, we arrive at the bfgs algorithm (cf. Algorithm 12.29).
358 12 Local Optimization
𝐬𝑛 ∶= −𝐻𝑛−1 ∇𝑓(𝐱𝑛 )
𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ,
𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ).
The fact that l-bfgs only uses recent information and discards older itera-
tions in contrast to bfgs, which uses information from all previous iterations,
can be viewed as an advantage. While l-bfgs is significantly faster than bfgs
when the number of variables is large and can perform nearly as well, the opti-
mal choice of the parameter 𝑚 in the algorithm depends on the problem class,
which is a disadvantage of l-bfgs. When l-bfgs fails, one should therefore try
to increase the parameter 𝑚.
Many variants of the l-bfgs method have been developed. For example, the
limited-memory algorithm in [5, 21, 12] can solve optimization problems with
simple bounds.
12.11 Bibliographical Remarks 359
Problems
12.5 Calculate the integral in (12.7) and plot the sum and its approximate upper
bound used in the proof of Corollary 12.20.
12.9 Apply Algorithm 12.22 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.
12.10 Explain the meaning of the end points of the intervals 𝑐1 ∈ (0, 1) and
𝑐2 ∈ (𝑐1 , 1) in the Wolfe conditions (12.14) and (12.15).
12.11 Implement Algorithm 12.26, which satisfies the Wolfe conditions (12.14)
and (12.15).
360 12 Local Optimization
12.12 Apply Algorithm 12.26 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.
12.14 Prove the rank-two update formula (12.24) by substituting the definition
(12.23) of 𝐻𝑛+1 into the quasi-Newton condition (12.21) and following the text.
−1
12.15 Prove (12.25) by showing that 𝐻𝑛+1 𝐻𝑛+1 = 𝐼 or by showing that
−1
𝐻𝑛+1 𝐻𝑛+1 = 𝐼.
12.16 Prove: Suppose that 𝐻𝑛 is symmetric; then 𝐻𝑛+1 defined in (12.24) is sym-
metric as well.
12.17 * Prove: Suppose that 𝐻𝑛 is symmetric and positive definite and that
𝐲𝑛⊤ 𝐝𝑛 > 0; then 𝐻𝑛+1 defined in (12.24) is positive definite as well.
12.19 Apply Algorithm 12.29 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.
12.20 * Prove that the unique solution of the minimization problem (12.26) is
the bfgs update (12.25).
12.25 Apply Algorithm 12.31 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.
12.26 Compare Algorithm 12.29 and Algorithm 12.31 for different values of the
parameter 𝑚.
References
1. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sciences 2(1), 183–202 (2009)
2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cam-
bridge, UK (2004)
3. Broyden, C.: The convergence of a class of double-rank minimization algorithms 1. general
considerations. IMA Journal of Applied Mathematics 6(1), 76–90 (1970)
References 361
Abstract Artificial neural networks were first conceived decades ago and have
been an important part of machine learning and artificial intelligence ever since.
The basic idea behind neural networks is to combine linear combinations and
nonlinear functions into layers such that they can approximate arbitrary func-
tions. Neural networks have been used to great effect for example in image
classification and recognition once the computational power and suitable algo-
rithms to train large neural networks had become available. In this chapter, we
implement a neural network and train it using backpropagation in a fully self-
contained program. Using tens of thousands of scanned images, we train the
neural network to recognize handwritten digits and discuss important aspects
of training neural networks.
13.1 Introduction
Input layer Hidden layer Hidden layer Hidden layer Output layer
Vector
Vector Vector
Vector
Vector
Fig. 13.1 Schematic diagram of an artificial neural network. Not all dependencies between ad-
jacent layers are shown in order not to overload the figure; in general, each neuron influences
each neuron in the next layer.
plicated (and still mostly unknown) than the ones between adjacent layers indi-
cated in Fig. 13.1.
More general dependencies are of course possible in artificial neural networks
and have been investigated. The computational reason why an arrangement in
layers is preferable is that it makes training the neural network (see Sections 13.6
and 13.7) much easier than in the general case when circular dependencies are
allowed.
Neural networks have become a vast field with many applications in super-
vised machine learning. Supervised learning means that the input cases and
the output cases are known and that their (possibly highly complicated) func-
tional relationship is to be learned while the data may be noisy. Since the ar-
rangement in Fig. 13.1 resembles some structures in the visual cortex of the
brain, it may be unsurprising that large enough neural networks with a certain
number and structure of the hidden layers have been used very successfully in
13.2 Feeding Forward 367
Before we can evaluate the function given by a neural network, we still must
specify the neural network more precisely. We denote the vector-valued output
of layer 𝑘 − 1 by 𝐱(𝑘−1) . The action of layer 𝑘 is given by applying a linear (or
more precisely, an affine) function and then a nonlinear function elementwise.
The vector output 𝐱(𝑘) of layer 𝑘 can therefore be written as
where the matrix 𝑊 (𝑘) is called the weight matrix, the vector 𝐛(𝑘) contains the
so-called biases, and the activation function 𝜎 ∶ ℝ → ℝ is applied elementwise
to its argument.
It is an important feature of neural networks that the activation function is
nonlinear. Otherwise, if it were linear, the whole network would just be a linear
function as a composition of linear functions. There are two main considerations
important for the choice of the activation function: first, it should facilitate the
kind of functional relationship between the known input and output vectors of
the learning problem at hand, and secondly, it should be expedient for the com-
putations required for learning (see Sections 13.6 and 13.7).
The first requirement is generally hard to satisfy a priori and without any
experience or experimentation. The question of the shape of the output layer in
the handwriting-recognition problem is an example of such considerations and
is discussed below.
The second requirement is especially important for deep neural networks, i.e.,
for networks with many hidden layers, and is discussed using the example of the
choice of the activation function 𝜎 in the following.
Various popular choices of activation functions are the following. The first is
the sigmoid function
1
𝜎1 (𝑥) ∶= ,
1 + e−𝑥
also called the logistic function. The second is the leaky rectifier
368 13 Neural Networks
𝑥, 𝑥 ≥ 0,
𝜎2 (𝑥) ∶= {
𝛼𝑥, 𝑥 < 0,
𝜎5 (𝑥) ∶= tanh(𝑥).
ɦɴȍʼɜȕ
Ƀɦʙɴʝʲ QĐɃʙ
Ƀɦʙɴʝʲ {ɃɪȕǤʝɜȱȕȂʝǤ
Ƀɦʙɴʝʲ {-ǤʲǤʧȕʲʧ
Ƀɦʙɴʝʲ ¸ʝɃɪʲȯ
Ƀɦʙɴʝʲ ¸˩¸ɜɴʲ
Ƀɦʙɴʝʲ ¼Ǥɪȍɴɦ
ʧʲʝʼȆʲ ȆʲɃ˛ǤʲɃɴɪ
ȯђђOʼɪȆʲɃɴɪ
ȍђђOʼɪȆʲɃɴɪ
ȕɪȍ
ȯʼɪȆʲɃɴɪ ʧɃȱɦǤФ˦Х
ЖЭФЖўȕ˦ʙФв˦ХХ
ȕɪȍ
The next data type &ɴʧʲ contains the cost function to be used. In short, a
cost function measures the difference between the output of the neural network
and the correct output to be learned. We already define two cost functions at
this point so that we can evaluate our neural network later. Cost functions are
discussed in more detail in Sect. 13.5.
ʧʲʝʼȆʲ &ɴʧʲ
ȯђђOʼɪȆʲɃɴɪ
ȍȕɜʲǤђђOʼɪȆʲɃɴɪ
ȕɪȍ
370 13 Neural Networks
ȯʼɪȆʲɃɴɪ ʜʼǤȍʝǤʲɃȆѪȆɴʧʲФǤȆʲɃ˛ǤʲɃɴɪђђȆʲɃ˛ǤʲɃɴɪХђђ&ɴʧʲ
&ɴʧʲФФǤя ˩Х вљ ЕѐК Ѯ {ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФǤв˩ХѭЗя
Ф˴я Ǥя ˩Х вљ ФǤв˩Х ѐѮ ǤȆʲɃ˛ǤʲɃɴɪѐȍѐФ˴ХХ
ȕɪȍ
The data structure ȕʲ˞ɴʝɖ contains the weights and biases of all layers with
additional information. The ˞ȕɃȱȹʲʧ are matrices and the ȂɃǤʧȕʧ are vectors.
The vector ʧɃ˴ȕʧ contains the numbers of neurons in each layer. The final four
vectors record the progress during training as we will see later. The function
ȕʲ˞ɴʝɖ is a custom constructor and only requires the sizes of the layers, but
takes three keyword arguments. It calls the function ɪȕ˞, only available in this
context, to construct the instance (see Sect. 5.5).
The weights and biases are initialized with normally distributed random num-
bers. If a weight matrix is large, its product with the output of the previous layer
tends to be large as well. This hinders learning with activation functions whose
derivatives are small for large arguments. Therefore it is generally useful to scale
the weight matrices such that products of the weight matrices with columns of
all ones are still normally distributed with variance one; this is achieved by the
scaling factor ʧʜʝʲФɔХ, which is the square root of the size of the previous layer.
ɦʼʲǤȂɜȕ ʧʲʝʼȆʲ ȕʲ˞ɴʝɖ
ǤȆʲɃ˛ǤʲɃɴɪђђȆʲɃ˛ǤʲɃɴɪ
Ȇɴʧʲђђ&ɴʧʲ
ɪѪɜǤ˩ȕʝʧђђbɪʲ
ʧɃ˴ȕʧђђùȕȆʲɴʝШbɪʲЩ
˞ȕɃȱȹʲʧђђùȕȆʲɴʝШʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩЩ
ȂɃǤʧȕʧђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩ
ʲʝǤɃɪɃɪȱѪȆɴʧʲђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
Having defined these data structures, we can construct your first neural net-
work by evaluating ȕʲ˞ɴʝɖФЦЖЕЕя ЖЕя ЖЧХ.
Evaluating a neural network is commonly referred to as feeding forward. We
loop over all weight matrices and bias vectors simultaneously using ˴Ƀʙ. In each
iteration, the activation Ǥ of the previous layer is transformed linearly and then
the activation function ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȯ is applied elementwise.
ȯʼɪȆʲɃɴɪ ȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪђђȕʲ˞ɴʝɖя ɃɪʙʼʲђђùȕȆʲɴʝХђђùȕȆʲɴʝ
ɜɴȆǤɜ Ǥ ќ Ƀɪʙʼʲ
Ǥ
ȕɪȍ
Before we train our neural network, the question poses itself if neural networks
such as the ones shown in Fig. 13.1 can approximate arbitrary functions to be
learned or not. This question is fundamental: if arbitrary relationships between
the input and output could not be represented by such functions, it would be
absurd to try to train neural networks. Fortunately, the answer to this question
is positive.
As the following theorem shows, neural networks 𝜙 without any hidden layer,
but whose output consists of a linear combination of a sufficiently large num-
ber 𝑛 of neurons, are already capable of approximating any given continuous
function 𝑓 arbitrarily well on compact subsets of ℝ𝑑 [1, 2]. The assumptions on
the activation function 𝜎 are lenient. The restriction that the function to be ap-
proximated must be continuous is understandable in view of the fact that neural
networks are continuous functions.
𝑛
∑
𝜙(𝐱) ∶= 𝑣𝑖 𝜎(𝐰𝑖 ⋅ 𝐱 + 𝑏𝑖 ),
𝑖=1
𝑐𝑗 , 𝑥 ∈ [𝑎𝑗 , 𝑏𝑗 ),
𝜓𝑗 (𝑥) ∶= {
0, otherwise.
By using a hidden layer with 2𝑚 neurons, we can hence approximate any piece-
wise constant function
𝑚
∑
𝜓(𝑥) = 𝜓𝑗 (𝑥) ≈ 𝜎−1 ◦𝑔(𝑥)
𝑗=1
( ) 𝑐 , 𝑥1 ∈ [𝑎1𝑗 , 𝑏1𝑗 ),
𝜓1𝑗 (𝑥1 , 𝑥2 ) ∶= { 1𝑗
0, otherwise,
( ) 𝑐 , 𝑥2 ∈ [𝑎2𝑘 , 𝑏2𝑘 ),
𝜓2𝑘 (𝑥1 , 𝑥2 ) ∶= { 2𝑘
0, otherwise
in the first hidden layer. These four neurons are combined in the second hidden
layer to approximate functions of the form
( )
( ) 𝑐𝑗𝑘 , (𝑥1 , 𝑥2 ) ∈ [𝑎1𝑗 , 𝑏1𝑗 ), [𝑎2𝑘 , 𝑏2𝑘 ) ,
𝜓𝑗𝑘 (𝑥1 , 𝑥2 ) ∶= {
0, otherwise,
which are nonzero only on a rectangle. Using these functions 𝜓𝑗𝑘 , we can approx-
imate any continuous function 𝑓 ∈ 𝐶(ℝ2 , ℝ) by the piecewise (on rectangles)
constant function
𝑚𝑗 𝑚 𝑘
( ) ∑ ∑ ( ) ( )
𝜓 (𝑥1 , 𝑥2 ) ∶= 𝜓𝑗𝑘 (𝑥1 , 𝑥2 ) ≈ 𝜎−1 ◦𝑓 (𝑥1 , 𝑥2 )
𝑗=1 𝑘=1
and finally applying the activation function 𝜎 in the output layer. This argument
can be generalized from two to 𝑑 dimensions.
In the third and last step, we make the assumption that the neurons are ap-
proximated by step functions superfluous. In the one-dimensional case 𝐶(ℝ, ℝ),
we choose intervals of same length. The error due to the smoothed step func-
tions occurs where the intervals meet. We can make this error arbitrarily small
by adding a large number 𝑁 of shifted approximations of the function 𝑓(𝑥)∕𝑁,
because then each point 𝑥 is only affected by the error due to one shifted approx-
imation, which decreases as 𝑛 increases. This ideas can be generalized to the
multidimensional case. This concludes the sketch of the proof. □
Fig. 13.2 The digits from 0 to 9 from the beginning of the mnist training set.
ȯʼɪȆʲɃɴɪ bÆÑѪʝȕǤȍѪɃɦǤȱȕʧФȯɃɜȕɪǤɦȕђђÆʲʝɃɪȱХ
QĐɃʙѐɴʙȕɪФȯɃɜȕɪǤɦȕя ъʝъХ ȍɴ ʧ
ɜɴȆǤɜ ɦǤȱɃȆѪɪʼɦȂȕʝ ќ Ȃʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХ
ɜɴȆǤɜ ɪѪɃʲȕɦʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
ȱɜɴȂǤɜ bÆÑѪɪѪʝɴ˞ʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
ȱɜɴȂǤɜ bÆÑѪɪѪȆɴɜʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
ЦùȕȆʲɴʝШOɜɴǤʲЛЙЩФʝȕǤȍРФʧя ʝʝǤ˩ШÚbɪʲНЩФʼɪȍȕȯя
bÆÑѪɪѪʝɴ˞ʧѮ bÆÑѪɪѪȆɴɜʧХХХ ѐЭ
ʲ˩ʙȕɦǤ˦ФÚbɪʲНХ
ȯɴʝ Ƀ Ƀɪ ЖђɪѪɃʲȕɦʧЧ
ȕɪȍ
ȕɪȍ
ȯʼɪȆʲɃɴɪ bÆÑѪʝȕǤȍѪɜǤȂȕɜʧФȯɃɜȕɪǤɦȕђђÆʲʝɃɪȱХ
QĐɃʙѐɴʙȕɪФȯɃɜȕɪǤɦȕя ъʝъХ ȍɴ ʧ
ɜɴȆǤɜ ɦǤȱɃȆѪɪʼɦȂȕʝ ќ Ȃʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХ
ɜɴȆǤɜ ɪѪɃʲȕɦʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
The function ˛ȕȆʲɴʝɃ˴ȕ takes a label, i.e., an integer 𝑛 between zero and nine,
and yields a vector of length ten whose 𝑛-th element is equal to one, while all
other elements are equal to zero. The reason for expanding the label into a vector
was already discussed in Sect. 13.2: if the number of classes is large, then neural
networks learn much better when each class corresponds to a designated neuron
in the output layer. The class assigned to a given input is the one whose output
neuron has the largest value.
ȯʼɪȆʲɃɴɪ ˛ȕȆʲɴʝɃ˴ȕФɪђђbɪʲȕȱȕʝХ
ɜɴȆǤɜ ʝȕʧʼɜʲ ќ ˴ȕʝɴʧФЖЕХ
ʝȕʧʼɜʲЦɪўЖЧ ќ Ж
ʝȕʧʼɜʲ
ȕɪȍ
Finally, the function ɜɴǤȍѪ bÆÑѪȍǤʲǤ yields the input and output values in
six vectors, and six global variables are defined.
376 13 Neural Networks
ФʲʝǤɃɪѪ˦ЦЖђКЕѪЕЕЕЧя ˛ȕȆʲɴʝɃ˴ȕѐФʲʝǤɃɪѪ˩ЦЖђКЕѪЕЕЕЧХя
ʲʝǤɃɪѪ˦ЦКЕѪЕЕЖђЛЕѪЕЕЕЧя ʲʝǤɃɪѪ˩ЦКЕѪЕЕЖђЛЕѪЕЕЕЧя
ʲȕʧʲѪ˦я ʲȕʧʲѪ˩Х
ȕɪȍ
The following function was used to plot the images in Fig. 13.2. The ¸˩¸ɜɴʲ
packages was already imported at the beginning of the module.
ȯʼɪȆʲɃɴɪ ʙɜɴʲѪȍɃȱɃʲФɪђђbɪʲя ȯɃɜȕ ќ ɪɴʲȹɃɪȱХ
ɜɴȆǤɜ ˛
Ƀȯ Ж јќ ɪ јќ КЕѪЕЕЕ
˛ ќ ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ЦɪЧ
ȕɜʧȕɃȯ КЕѪЕЕЖ јќ ɪ јќ ЛЕѪЕЕЕ
˛ ќ ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ЦɪвКЕѪЕЕЕЧ
ȕɜʧȕɃȯ ЛЕѪЕЕЖ јќ ɪ јќ МЕѪЕЕЕ
˛ ќ ʲȕʧʲѪȍǤʲǤѪ˦ЦɪвЛЕѪЕЕЕЧ
ȕɪȍ
Ƀȯ ɃʧǤФȯɃɜȕя ÆʲʝɃɪȱХ
¸˩¸ɜɴʲѐʧǤ˛ȕȯɃȱФȯɃɜȕ Ѯ ʧʲʝɃɪȱФɪХ Ѯ ъѐʙȍȯъя
ȂȂɴ˦ѪɃɪȆȹȕʧ ќ ъʲɃȱȹʲъя ʙǤȍѪɃɪȆȹȕʧ ќ ЕХ
ȕɪȍ
ȕɪȍ
In order to train a neural network, we must be able to measure how well its out-
put agrees with the given labels of the data. Functions that measure this differ-
13.6 Stochastic Gradient Descent 377
ence are called cost functions, loss functions, or objective functions, and many
choices exist.
We denote the items in the given training data by vectors 𝐱 ∈ ℝ784 , which
in our application is 28 ⋅ 28 = 784 dimensional and represents an image. These
items serve as input to the neural network. The corresponding labels in the train-
ing data are denoted by 𝑦(𝐱) ∈ ℝ10 , where each of the ten elements or neurons
corresponds to one digit. In other words, the function 𝐲 ∶ ℝ784 ⊃ 𝑇 → ℝ10 rep-
resents the given training data. Furthermore, we denote the neural network by
the function 𝐚 ∶ ℝ784 → ℝ10 , whose value is the activation of the output layer.
One of the most popular cost functions is the quadratic cost function
1 ∑
𝐶2 (𝑊, 𝐛) ∶= ‖𝐲(𝐱) − 𝐚(𝐱)‖22 ,
2|𝑇| 𝐱∈𝑇
also called the mean squared error. The sum is over all |𝑇| elements 𝐱 of the
training set 𝑇. The cost function is a function of the parameters of the neural
network, which are denoted by 𝑊 and 𝐛 for the collection of all weights and
biases. The reason for the factor 1∕|𝑇| is that it makes the values of the cost
function comparable whenever the number |𝑇| of the training items changes.
The reason for the factor 1∕2 is that it removes the factor 2 in the derivative of
the cost function, which we will use shortly. The factors are, of course, irrelevant
when the cost function is minimized.
It goes without saying that other norms can be used instead of the Euclidean
norm. Different choices of cost functions generally lead to different neural net-
works, and an expedient choice generally depends on the problem at hand.
Another popular cost function is the cross-entropy cost function
1 ∑( )
𝐶CE (𝑊, 𝐛) ∶= − 𝐲(𝐱) ⋅ ln 𝐚(𝐱) + (𝟏 − 𝐲(𝐱)) ⋅ ln(𝟏 − 𝐚(𝐱)) , (13.1)
|𝑇| 𝐱∈𝑇
where the logarithm is applied elementwise to its vector argument. The cost func-
tion, the activation function of the output layer, and how fast a network learns
are closely related. This relationship is discussed at the end of the next section,
where we will also see how the expression in 𝐶CE is obtained.
In order to improve a neural network, we aim to find weights and biases that min-
imize the cost function. Since neural networks generally have a large number of
weights and biases, this is usually a high-dimensional optimization problem.
To minimize the cost function, we use a version of gradient descent called
stochastic gradient descent. Other optimization methods can of course be used
depending on the size of the optimization problem (see Chap. 11 and Chap. 12).
378 13 Neural Networks
How does gradient descent work? Gradient descent was already discussed in
Chap. 12, but we recapitulate the main idea here using the current notation. To
simplify notation, we collect all parameters of the network, i.e., all weights and
biases, in the vector 𝐩. Hence the gradient of the cost function 𝐶 is
𝜕𝐶
⎛ 𝜕𝑝 ⎞
1
∇𝐶 = ⎜ ⋮ ⎟ .
⎜ 𝜕𝐶 ⎟
⎝ 𝜕𝑝𝑛 ⎠
The directional derivative
𝜕𝐶
(𝐩)
𝜕𝐞
is the derivative of 𝐶 at 𝐩 in the direction of the unit vector 𝐞 and can be written
as
𝜕𝐶
(𝐩) = ∇𝐶(𝐩) ⋅ 𝐞
𝜕𝐞
using the gradient of 𝐶 at 𝐩.
Starting at the point 𝐩, we would like to take a (small) step in the direction that
minimizes 𝐶(𝐩) the most. How can we find a direction 𝐞 in which the function 𝐶
changes the most? The Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1,
implies the inequality
|| 𝜕𝐶 ||
|| (𝐩)|| = |∇𝐶(𝐩) ⋅ 𝐞| ≤ ‖∇𝐶(𝐩)‖,
|| 𝜕𝐞 ||
| |
since ‖𝐞‖ = 1; equality in the Cauchy–Bunyakovsky–Schwarz inequality holds if
and only if one vector is a multiple of the other. The inequality therefore means
that the multiples of the gradient are the directions in which the directional
derivative changes the most.
If the direction is 𝐞 = ∇𝐶(𝐩), then the directional derivative is
𝜕𝐶
(𝐩) = ∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = ‖∇𝐶(𝐩)‖2 ≥ 0,
𝜕𝐞
and taking a step in this direction increases 𝐶; this is called gradient ascent. On
the other hand, if the direction is 𝐞 = −∇𝐶(𝐩), then the directional derivative is
𝜕𝐶
(𝐩) = −∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = −‖∇𝐶(𝐩)‖2 ≤ 0,
𝜕𝐞
and taking a step in this direction decreases 𝐶; this is called gradient descent.
Since we aim to minimize the function 𝐶, we define the step Δ𝐩 to take at the
point 𝐩 as
Δ𝐩 ∶= −𝜂∇𝐶(𝐩),
where 𝜂 ∈ ℝ+ is called the learning rate. The directional derivative implies that
the change Δ𝐶 in the function value is approximately given by
13.6 Stochastic Gradient Descent 379
Δ𝐶 ≈ ∇𝐶(𝐩) ⋅ Δ𝐩,
which yields
Δ𝐶 ≈ −𝜂∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = −𝜂‖∇𝐶(𝐩)‖2 ≤ 0
for our choice of the step Δ𝐩; the function value indeed decreases.
Having discussed the methods of gradient ascent and descent, we now use a
variant called stochastic gradient descent to adjust the parameters of the neural
network. Stochastic gradient descent is the most common basic method to min-
imize the cost function. It just means that the training data are split randomly
into batches and that gradient descent is performed for each batch. A reason for
doing so is that in practice the number of training items is very large so that
working with batches is more manageable. Reasonably large batches are also
usually already sufficient to obtain a good approximation of the gradient, and
the stochastic nature of stochastic gradient descent helps escape from local min-
ima. Furthermore, the gradients can be computed in parallel for all batches.
Stochastic gradient descent is implemented by the function ÆQ-. The number
of steps in stochastic gradient descent is commonly called the number of epochs.
The meaning of the parameter 𝜆 will be discussed in Sect. 13.9.1; we suppose
that it vanishes for now. The function can also monitor the cost function and the
accuracy (i.e., how many items are classified correctly) during the epochs. It can
do so using the training data that must be supplied, but it can also use optional
validation data. The reasons for this additional capability will be discussed in
Sect. 13.8.
ȯʼɪȆʲɃɴɪ ÆQ-Фɪɪђђȕʲ˞ɴʝɖя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȕʙɴȆȹʧђђbɪʲя ȂǤʲȆȹѪʧɃ˴ȕђђbɪʲя ȕʲǤђђOɜɴǤʲЛЙя
ɜǤɦȂȍǤђђOɜɴǤʲЛЙ ќ ЕѐЕѓ
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩ ќ ЦЧя
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩ђђÚɪɃɴɪШùȕȆʲɴʝШbɪʲЛЙЩя
ùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩЩ ќ ЦЧя
ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪȆɴʧʲ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ ќ ʲʝʼȕХ
ɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲ ќ ЦЧ
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ ќ ЦЧ
ɪɪѐʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ ќ ЦЧ
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ ќ ЦЧ
ȯɴʝ ɖ Ƀɪ ЖђȂǤʲȆȹѪʧɃ˴ȕђɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦Х
ʼʙȍǤʲȕРФɪɪя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ЦʙȕʝɦЦɖђɦɃɪФɖўȂǤʲȆȹѪʧɃ˴ȕвЖя
ȕɪȍХЧЧя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩ЦʙȕʝɦЦɖђɦɃɪФɖўȂǤʲȆȹѪʧɃ˴ȕвЖя
ȕɪȍХЧЧя
ȕʲǤя ɜǤɦȂȍǤя ɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ХХ
ȕɪȍ
Ƀȯ ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪȆɴʧʲ
ʙʼʧȹРФɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲя
ʲɴʲǤɜѪȆɴʧʲФɪɪя ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩я
ɜǤɦȂȍǤХХ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъȆɴʧʲ ɴɪ ʲʝǤɃɪɃɪȱ ȍǤʲǤђ ҄ȯъя
ɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲЦȕɪȍЧХ
ȕɪȍ
Ƀȯ ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ
ʙʼʧȹРФɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲя
ʲɴʲǤɜѪȆɴʧʲФɪɪя ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩я ɜǤɦȂȍǤХХ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъȆɴʧʲ ɴɪ ˛ǤɜɃȍǤʲɃɴɪ ȍǤʲǤђ ҄ȯъя
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲЦȕɪȍЧХ
ȕɪȍ
Ƀȯ ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩
ɜɴȆǤɜ Ǥ ќ ǤȆȆʼʝǤȆ˩Фɪɪя ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩Х
ɜɴȆǤɜ ɜ ќ ɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦Х
ɜɴȆǤɜ ʝ ќ ǤЭɜ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъǤȆȆʼʝǤȆ˩ ɴɪ ʲʝǤɃɪɃɪȱ ȍǤʲǤђ
҄Кȍ Э ҄Кȍ ќ ҄КѐЖȯ҄҄ ȆɴʝʝȕȆʲъя Ǥя ɜя ЖЕЕѮʝХ
ʙʼʧȹРФɪɪѐʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩я ʝХ
ȕɪȍ
Ƀȯ ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩
ɜɴȆǤɜ Ǥ ќ ǤȆȆʼʝǤȆ˩Фɪɪя ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩Х
ɜɴȆǤɜ ɜ ќ ɜȕɪȱʲȹФ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦Х
ɜɴȆǤɜ ʝ ќ ǤЭɜ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъǤȆȆʼʝǤȆ˩ ɴɪ ˛ǤɜɃȍǤʲɃɴɪ ȍǤʲǤђ
҄Кȍ Э ҄Кȍ ќ ҄КѐЖȯ҄҄ ȆɴʝʝȕȆʲъя Ǥя ɜя ЖЕЕѮʝХ
ʙʼʧȹРФɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩я ʝХ
ȕɪȍ
13.6 Stochastic Gradient Descent 381
ȕɪȍ
ɪɪ
ȕɪȍ
The next four functions calculate the cost and the accuracy of a neural net-
work. Each function has two methods, one for labels that are integers and one
for vectorized labels.
ȯʼɪȆʲɃɴɪ ʲɴʲǤɜѪȆɴʧʲФɪɪђђȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШbɪʲЛЙЩя ɜǤɦȂȍǤђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ʲɴʲǤɜѪȆɴʧʲФɪɪя ȍǤʲǤѪ˦я ˛ȕȆʲɴʝɃ˴ȕѐФȍǤʲǤѪ˩Хя ɜǤɦȂȍǤХ
ȕɪȍ
ȯʼɪȆʲɃɴɪ ʲɴʲǤɜѪȆɴʧʲФɪɪђђȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ɜǤɦȂȍǤђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ʧʼɦФɦǤʙФФ˦я ˩Х вљ ɪɪѐȆɴʧʲѐȯФȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪя ˦Хя ˩Хя
ȍǤʲǤѪ˦я ȍǤʲǤѪ˩ХХ Э ɜȕɪȱʲȹФȍǤʲǤѪ˦Х ў
ЕѐК Ѯ ɜǤɦȂȍǤ Ѯ ʧʼɦФ{ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФ˞ХѭЗ ȯɴʝ ˞ Ƀɪ
ɪɪѐ˞ȕɃȱȹʲʧХ Э ɜȕɪȱʲȹФȍǤʲǤѪ˦Х
ȕɪȍ
ȯʼɪȆʲɃɴɪ ǤȆȆʼʝǤȆ˩Фɪɪђђȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШbɪʲЛЙЩХђђbɪʲȕȱȕʝ
ȆɴʼɪʲФɦǤʙФФ˦я ˩Х вљ ˩ ќќ ǤʝȱɦǤ˦ФȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪя ˦ХХ в Жя
ȍǤʲǤѪ˦я ȍǤʲǤѪ˩ХХ
ȕɪȍ
ȯʼɪȆʲɃɴɪ ǤȆȆʼʝǤȆ˩Фɪɪђђȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩХђђbɪʲȕȱȕʝ
ǤȆȆʼʝǤȆ˩Фɪɪя ȍǤʲǤѪ˦я ɦǤʙФ˩ вљ ǤʝȱɦǤ˦Ф˩Х в Жя ȍǤʲǤѪ˩ХХ
ȕɪȍ
The function ʼʙȍǤʲȕР adds the gradient of the weights and biases multiplied
by the learning rate 𝜂 to the weights and biases of the network for each batch
of training items. Again, we suppose that 𝜆 = 0 for now. The gradients are cal-
culated by the function ʙʝɴʙǤȱǤʲȕѪȂǤȆɖ, which will be discussed in the next
section.
ȯʼɪȆʲɃɴɪ ʼʙȍǤʲȕРФɪɪђђȕʲ˞ɴʝɖя
ȂǤʲȆȹѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȂǤʲȆȹѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
382 13 Neural Networks
ɪɪѐ˞ȕɃȱȹʲʧ ќ
ФЖвȕʲǤѮɜǤɦȂȍǤЭɪХѮɪɪѐ˞ȕɃȱȹʲʧ в ФȕʲǤЭɜȕɪȱʲȹФȂǤʲȆȹѪ˦ХХѮȱʝǤȍѪü
ɪɪѐȂɃǤʧȕʧ вќ ФȕʲǤЭɜȕɪȱʲȹФȂǤʲȆȹѪ˦ХХ Ѯ ȱʝǤȍѪȂ
ɪɪ
ȕɪȍ
13.7 Backpropagation
where the function 𝜎 is applied elementwise to its vector argument, and we de-
note the weighted input to the neurons in layer 𝑙 by
𝜕𝐶 𝜕𝐶
and
(𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑏𝑖
of the cost function 𝐶 with respect to all elements of the weight matrices 𝑊 (𝑙)
and with respect to all elements of the bias vectors 𝐛(𝑙) .
We make two assumptions on the cost function 𝐶, which are usually satisfied,
namely that it can be written as an average
13.7 Backpropagation 383
1 ∑
𝐶(𝑊, 𝐛) = 𝐾(𝑊, 𝐛, 𝐱)
|𝑇| 𝐱∈𝑇
over all training samples 𝐱 in the training set 𝑇 and that it is a function of the
activation of the output layer of the neural network only, i.e., 𝐶 = 𝐶(𝐚 (𝑙) ). Both
cost functions in Sect. 13.5, the quadratic cost function and cross-entropy cost
function, satisfy these two assumptions.
To help apply the chain rule, we denote the partial derivatives of the cost func-
(𝑙)
tion 𝐶 with respect to the weighted input 𝐳𝑖 to neuron 𝑖 in layer 𝑙 by
(𝑙) 𝜕𝐶
𝛿𝑖 ∶= . (13.2)
(𝑙)
𝜕𝑧𝑖
where the sum is over all neurons 𝑘 in the output layer 𝐿. Since the activation
(𝐿) (𝐿)
function 𝜎 is applied elementwise, the partial derivative 𝜕𝑎𝑘 ∕𝜕𝑧𝑖 is nonzero
only if 𝑘 = 𝑖. Therefore we have
(𝐿) 𝜕𝐶 (𝐿)
𝛿𝑖 = 𝜎′ (𝑧𝑖 ) ∀𝑖
(𝐿)
𝜕𝑎𝑖
or
𝜕𝐶
𝛅(𝐿) = ⊙ 𝜎′ (𝐳(𝐿) ), (13.3)
𝜕𝐚 (𝐿)
a formula for the error in the output layer, where ⊙ denotes elementwise multi-
plication and 𝜎′ is applied elementwise to its vector argument.
Second, we derive an equation for the error 𝛅(𝑙) in terms of the error 𝛅(𝑙+1) .
The chain rule yields
(𝑙+1) (𝑙+1)
(𝑙) 𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘 ∑ 𝜕𝑧𝑘 (𝑙+1)
𝛿𝑖 = = = 𝛿𝑘 .
(𝑙) (𝑙+1) (𝑙) (𝑙)
𝜕𝑧𝑖 𝑘 𝜕𝑧
𝑘
𝜕𝑧𝑖 𝑘 𝜕𝑧𝑖
Since
𝐳(𝑙+1) = 𝑊 (𝑙+1) 𝐚 (𝑙) + 𝐛(𝑙+1) = 𝑊 (𝑙+1) 𝜎(𝐳(𝑙) ) + 𝐛(𝑙+1)
and hence ∑
(𝑙+1) (𝑙+1) (𝑙) (𝑙+1)
𝑧𝑘 = 𝑤𝑘𝑖 𝜎(𝑧𝑖 ) + 𝑏𝑘
𝑖
384 13 Neural Networks
and therefore ∑
(𝑙) (𝑙+1) ′ (𝑙) (𝑙+1)
𝛿𝑖 = 𝑤𝑘𝑖 𝜎 (𝑧𝑖 )𝛿𝑘
𝑘
or ( )
𝛅(𝑙) = (𝑊 (𝑙+1) )⊤ 𝛅(𝑙+1) ⊙ 𝜎′ (𝐳(𝑙) ). (13.4)
Third, we can find the partial derivatives of the cost function 𝐶 with respect
to all weights and biases using the errors 𝛅(𝑙) . The chain rule yields
(𝑙)
𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘
= ,
(𝑙) (𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝑘 𝜕𝑧𝑘 𝜕𝑤𝑖𝑗
(𝑙)
𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘
= .
(𝑙) (𝑙) (𝑙)
𝜕𝑏𝑖 𝑘 𝜕𝑧𝑘 𝜕𝑏𝑖
(𝑙) (𝑙)
Furthermore, the equations 𝜕𝐶∕𝜕𝑧𝑘 = 𝛿𝑘 and
𝜕𝐶 (𝑙) (𝑙−1)
= 𝛿𝑖 𝑎𝑗 , (13.5a)
(𝑙)
𝜕𝑤𝑖𝑗
𝜕𝐶 (𝑙)
= 𝛿𝑖 . (13.5b)
(𝑙)
𝜕𝑏𝑖
The four equations in (13.3), (13.4), and (13.5) constitute the four fundamen-
tal equations of backpropagation. All the partial derivatives of the cost func-
13.7 Backpropagation 385
tion 𝐶 are calculated via the errors 𝛅(𝑙) as well as the weighted inputs 𝐳(𝑙) and
the activations 𝐚 (𝑙) . The weighted inputs and the activations are found by sim-
ply feeding the network forward. Then we calculate the errors 𝛅(𝑙) by starting
with the error 𝛅(𝐿) of the last layer 𝐿 given by (13.3) and working recursively
towards lower layers by using (13.4) in order to obtain 𝛅(𝑙) from 𝛅(𝑙+1) . Finally,
(𝑙) (𝑙)
(13.5) yields the partial derivatives 𝜕𝐶∕𝜕𝑤𝑖𝑗 and 𝜕𝐶∕𝜕𝑏𝑖 , since the errors and
activations are now known.
The backpropagation algorithm is very efficient, since it comprises of only
two passes over the layers to calculate all partial derivatives. As such it is approx-
imately twice as costly as the feeding the network forward in order to evaluate
it. In the forward pass, the weighted inputs and activations are calculated, while
in the backward pass the errors and the partial derivatives are calculated.
Backpropagation is implemented by the function ʙʝɴʙǤȱǤʲȕѪȂǤȆɖ. First, the
two variables ȱʝǤȍѪü and ȱʝǤȍѪȂ for the partial derivatives as well as the two
variables ˴ and Ǥ for the weighted inputs and activations are allocated. Note that
ǤЦɜЧ is equal to 𝐚 (𝑙−1) such that ǤЦЖЧ records the input to the neural network.
The first loop is the forward pass, where all weights inputs and activations are
calculated. Before the second loop, the variable ȍȕɜʲǤ is initialized as 𝛅(𝐿) given
by (13.3). Since the expression depends on the cost function, it is calculated by
the function stored in the ȍȕɜʲǤ field of the type &ɴʧʲ. The results ȱʝǤȍѪüЦȕɪȍЧ
and ȱʝǤȍѪȂЦȕɪȍЧ are given by (13.5) for 𝑙 = 𝐿.
Then, in the second loop, the variable ȍȕɜʲǤ is updated recursively using
(13.4) and the results ȱʝǤȍѪüЦɜЧ and ȱʝǤȍѪȂЦɜЧ are calculated by (13.5). Again
note that ǤЦɜЧ is equal to 𝐚 (𝑙−1) .
ȯʼɪȆʲɃɴɪ ʙʝɴʙǤȱǤʲȕѪȂǤȆɖФɪɪђђȕʲ˞ɴʝɖя ˦ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩя
˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩХђђÑʼʙɜȕ
ɜɴȆǤɜ ȱʝǤȍѪü ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФüХХ ȯɴʝ ü Ƀɪ ɪɪѐ˞ȕɃȱȹʲʧЧ
ɜɴȆǤɜ ȱʝǤȍѪȂ ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФȂХХ ȯɴʝ Ȃ Ƀɪ ɪɪѐȂɃǤʧȕʧЧ
ǤЦЖЧ ќ ˦
ȯɴʝ ФɃя Фüя ȂХХ Ƀɪ ȕɪʼɦȕʝǤʲȕФ˴ɃʙФɪɪѐ˞ȕɃȱȹʲʧя ɪɪѐȂɃǤʧȕʧХХ
˴ЦɃЧ ќ ü Ѯ ǤЦɃЧ ў Ȃ
ǤЦɃўЖЧ ќ ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȯѐФ˴ЦɃЧХ
ȕɪȍ
ȯɴʝ ɜ Ƀɪ ɪɪѐɪѪɜǤ˩ȕʝʧвЗђвЖђЖ
ȍȕɜʲǤ ќ Фɪɪѐ˞ȕɃȱȹʲʧЦɜўЖЧщ Ѯ ȍȕɜʲǤХ ѐѮ ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȍѐФ˴ЦɜЧХ
386 13 Neural Networks
ФȱʝǤȍѪüя ȱʝǤȍѪȂХ
ȕɪȍ
At this point, the code of the module that implements our neural network is
complete.
ȕɪȍ Ы ɦɴȍʼɜȕ
We use the package name as a prefix to access the symbols in this package.
After a few minutes, we observe that the algorithm classifies more than 95% of
the digits in the validation data correctly, while the accuracy in the training data
is higher. The cost function is also smaller on the training data than on the vali-
dation data. Typical accuracies and costs for one hundred iterations of training
are shown in Figures 13.3 and 13.4. We will discuss these curves thoroughly in
the next section.
Training algorithm contain various parameters such as the learning rate 𝜂 and
the parameter 𝜆 here. The parameters that pertain to the training algorithm are
called hyperparameters in order to distinguish them from the parameters of the
neural network itself such as its number of levels, its weights, and its biases. Thus
the question naturally arises how these parameters should be chosen.
Unfortunately, there is no general answer to this question. Much depends
on the specific problem and the task the neural network shall solve. Whenever
there are unknown (hyper-)parameters, the idea of optimizing these parameters
suggests itself (see Chap. 11 and Chap. 12 for inspirations for optimization algo-
rithms to be used). In addition to optimizing continuous parameters such as the
13.8 Hyperparameters and Overfitting 387
Fig. 13.3 Training and validation accuracies as functions of the iteration number. The training
accuracies are generally larger than the validation accuracies.
learning rate, there are also discrete optimization problems. These concern dis-
crete parameters such as the numbers of layers of the neural network and their
sizes, the activation functions used in each layer, and even the type of the layer
(e.g., in convolutional neural networks).
The hyperparameters are found either manually or automatically by an op-
timization algorithm using the validation dataset. The validation dataset is also
useful in another regard. When training the network using stochastic gradient
descent as in the example at the end of the previous section, the training ac-
curacy is generally higher than the accuracy observed on an validation dataset
(see Fig. 13.3). The same effect is observed in the values of the cost function: the
cost function is smaller on the training set than on the validation dataset (see
Fig. 13.4).
These observations are easily explained: the training data are used in gradient
descent and therefore the cost function decreases and the accuracy increases
on the training dataset, in this case up to the 100-th iteration. The situation is
different on the validation dataset. There the accuracy stops to increase after
about fifteen iterations, and the cost stops to decrease after the same number
of iterations. This means that any improvement on the training data after this
number of iterations is highly unlikely to translate into any improvement on
new data. This effect is called overfitting.
388 13 Neural Networks
Fig. 13.4 Training and validation costs as functions of the iteration number. The training costs
are generally smaller than the validation costs.
After this point, training fits the neural network only to idiosyncrasies and
noise in the training data, which is not only a waste of computational resources,
but it is even more importantly also detrimental for generalization. Generaliza-
tion is the ultimate goal in machine learning; it means that the essence hidden in
the data is learned and that the idiosyncrasies of the training data are discarded.
Since neural networks contain many parameters while the amount of avail-
able data is always limited, they are a very versatile tool, but at the same time the
parameters are comparatively ill specified. Hence neural networks are prone to
overfitting, as the many parameters are usually easily fitted to the training data.
Therefore overfitting requires careful consideration.
Early stopping is a simple strategy to overcome overfitting and means that
training is stopped as soon as the accuracy stops to decrease on the validation
dataset. Unfortunately, it is not obvious when to stop since stochastic gradient
descent is stochastic by its nature and because the accuracy of a neural networks
may reach a plateau while training and then improve again.
Clearly, overfitting is reduced (and generalization is improved) by using more
training data, but the amount of training data is often given and cannot be
changed easily (but see Problem 13.12). We can also reduce the size of the neural
network to reduce the number of parameters to be determined. This is generally
13.9 Improving Training 389
a good approach, but care must be taken not to reduce it too much until it cannot
learn the essence hidden in the data anymore.
After the hyperparameters have been chosen and training has been estab-
lished while avoiding overfitting, the success of the whole procedure must be
assessed on a third dataset never used before. This third dataset is called the test
dataset, and it is the arbiter for the accuracy that has been achieved.
In summary, it is prudent to split all the available data into three sets, namely
the training, the validation, and the test datasets, with the following purposes.
• The training dataset is used for minimizing the cost function.
• The validation dataset is used for finding hyperparameters and for prevent-
ing overfitting (e.g. using early stopping).
• The test dataset is used for assessing the success of the whole training pro-
cedure using untouched data.
In this section, we discuss two methods for improving the training of neural
networks. The first is regularization, which we have implemented already, but
which still needs to be explained. The second is the choice of cost function and
how it affects training.
13.9.1 Regularization
𝜆
𝐶𝓁2 (𝑊, 𝐛, 𝜆) ∶= 𝐶0 (𝑊, 𝐛) + ‖𝐰‖22
2|𝑇|
𝜆 ∑ 2
= 𝐶0 (𝑊, 𝐛) + 𝑤
2|𝑇| 𝑘 𝑘
𝜆 ∑ (𝑙) 2
= 𝐶0 (𝑊, 𝐛) + (𝑊 ) ,
2|𝑇| 𝑙,𝑖,𝑗 𝑖𝑗
390 13 Neural Networks
where 𝜆 ∈ ℝ+ 0
is called the regularization parameter. The factor 1∕|𝑇| occurs
since it is also part of 𝐶0 . The biases are not affected by regularization as ex-
plained below.
Of course, any other norm of the weights such as the 𝓁𝑝 -norms can be used,
resulting in the more general definition
𝜆 𝑝 𝜆 ∑
𝐶𝓁𝑝 (𝑊, 𝐛, 𝜆) ∶= 𝐶0 (𝑊, 𝐛) + ‖𝐰‖𝑝 = 𝐶0 (𝑊, 𝐛) + |𝑤𝑘 |𝑝 .
𝑝|𝑇| 𝑝|𝑇| 𝑘
We first discuss how this modification of the cost function affects learning. It
means that smaller weights are preferred all other things being equal. The size of
the regularization parameter 𝜆 determines the relative importance of minimiz-
ing the original cost function 𝐶0 and minimizing the weights.
Why does regularization reduce overfitting? Regularized neural networks
tend to contain smaller weights, which implies that the output of the network
does not change much when small perturbations are added to the input in con-
trast to unregularized neural networks with larger weights. In other words, it is
more difficult for regularized neural networks to learn randomness in the train-
ing data; the smaller weights must learn the features present in the training data.
In other words, the larger weights of an unregularized network can adjust better
to noise and thus facilitate overfitting.
This explanation also justifies why the biases are not included in regulariza-
tion: large biases do not affect the sensitivity of the neural network to perturba-
tions or noise.
Regularization as modification of the cost function is implemented in a
straightforward manner in backpropagation. Because of the derivatives
𝜕𝐶𝓁2 𝜕𝐶0 𝜆
= + 𝐰,
𝜕𝐰 𝜕𝐰 |𝑇|
𝜕𝐶𝓁2 𝜕𝐶0
= ,
𝜕𝐛 𝜕𝐛
the steps in gradient descent become
𝜕𝐶0 𝜂𝜆
Δ𝐰 ∶= −𝜂 − 𝐰,
𝜕𝐰 |𝑇|
𝜕𝐶0
Δ𝐛 ∶= −𝜂 .
𝜕𝐛
After adding the step Δ𝐰 to 𝐰, the new weight is
𝜂𝜆 𝜕𝐶0
(1 − )𝐰 − 𝜂 ,
|𝑇| 𝜕𝐰
which is calculated at the end of the function ʼʙȍǤʲȕР. This concludes the expla-
nation of the implementation.
13.9 Improving Training 391
Fig. 13.5 Training and validation accuracies as functions of the iteration number with and
without regularization.
Figures 13.5 and 13.6 show the accuracies and costs evaluated on training
and validation data with and without regularization. Fig. 13.5 shows that contin-
ued overfitting to the training data is much reduced by regularization and that
performance on the validation data has been improved as well. Fig. 13.6 again
shows that continued overfitting to the training data has been reduced. It also
shows that the cost does not increase on the validation data as training continues.
Therefore even this first choice of hyperparameters has beneficial effects.
392 13 Neural Networks
Fig. 13.6 Training and validation costs as functions of the iteration number with and without
regularization.
The choice of the cost function can strongly affect how fast a neural network
learns. We discuss properties of the two cost functions in Sect. 13.5 that can now
be understood in terms of the properties of the activation functions and their
role in the backpropagation algorithm (see Sect. 13.7).
An effect due to the choice of the activation function is learning slowdown.
As seen from (13.4), the errors 𝛅(𝑙) and therefore the gradients used while mini-
mizing the cost function using stochastic gradient descent become small when
𝜎′ (𝐳(𝑙) ) becomes small. Such a factor 𝜎′ (𝐳(𝑙) ) occurs in every recursive use of
(13.4) in backpropagation, which implies that deep neural networks become
harder to train when the factors 𝜎′ (𝐳(𝑙) ) become small. Therefore the leaky rec-
tifier 𝜎2 is advantageous compared to the hyperbolic tangent 𝜎5 , especially in
deep neural networks.
The cost function can also be a cause of learning slowdown, again due to the
factor 𝜎′ (𝐳(𝑙) ) in the error 𝛿 (𝐿) in (13.3). Thus this interaction between the cost
function and the activation function in the output layer may be detrimental to
learning.
We can remedy the situation by choosing the cost function 𝐶 such that this fac-
tor disappears. Considering the derivatives with respect to the biases, we would
13.9 Improving Training 393
𝜕𝐶 𝑎−𝑦 𝑦 1−𝑦
= =− +
𝜕𝑎 𝑎(1 − 𝑎) 𝑎 1−𝑎
for 𝜕𝐶∕𝜕𝑎. The last equation follows from a partial-fraction decomposition. In-
tegration yields
𝐶 = −𝑦 ln 𝑎 − (1 − 𝑦) ln(1 − 𝑎) + const.
for the cost function for a single training item, and hence the cross-entropy cost
function 𝐶CE defined in (13.1) for all training items.
We now check that this calculation for the partial derivatives with respect to
the biases yields the desired form for all partial derivatives in the output layer.
Starting from the cross-entropy cost function
1 ∑( )
𝐶CE (𝑊, 𝐛) = − 𝐲(𝐱) ⋅ ln 𝐚 (𝐿) (𝐱) + (𝟏 − 𝐲(𝐱)) ⋅ ln(𝟏 − 𝐚 (𝐿) (𝐱)) ,
|𝑇| 𝐱∈𝑇
𝜕𝐶
𝛅(𝐿) = ⊙ 𝜎′ (𝐳(𝐿) )
𝜕𝐚 (𝐿)
1 ∑( )
=− 𝐲(𝐱) ⊘ 𝐚 (𝐿) (𝐱) − (𝟏 − 𝐲(𝐱)) ⊘ (𝟏 − 𝐚 (𝐿) (𝐱)) ⊙ 𝐚 (𝐿) (𝐱)
|𝑇| 𝐱∈𝑇
⊙ (𝟏 − 𝐚 (𝐿) (𝐱))
1 ∑ (𝐿)
= (𝐚 (𝐱) − 𝐲(𝐱)),
|𝑇| 𝐱∈𝑇
where ⊘ denotes elementwise division of two vectors. Hence, by (13.5), the par-
tial derivatives are
394 13 Neural Networks
𝜕𝐶CE 1 ∑ (𝐿)
= 𝛅(𝐿) = (𝐚 (𝐱) − 𝐲(𝐱)).
𝜕𝐛(𝐿) |𝑇| 𝐱∈𝑇
Hence there are no factors 𝜎′ (𝐳(𝐿) ) often responsible for learning slowdown.
There is a similar interaction between the quadratic cost function 𝐶2 and so-
called linear neurons in the output layer, meaning that the activation function 𝜎
(𝐿) (𝐿)
of the output layer 𝐿 is the identity function. Then 𝐚𝑗 = 𝐳𝑗 and 𝛅(𝐿) = 𝐚 (𝐿) − 𝐲
by (13.3). Therefore (13.5) yields
𝜕𝐶2 1 ∑ (𝐿)
= (𝐚 (𝐱) − 𝐳(𝐿) (𝐱)),
𝜕𝐛(𝐿) |𝑇| 𝐱∈𝑇
showing no detrimental factor in the output layer 𝐿 for this choice of cost func-
tion and activation function.
These considerations imply that the cost function and the activation functions
should not be chosen independently. It is worthwhile to study their interactions
in order to arrive at efficient training algorithms.
The package {-ǤʲǤʧȕʲʧ used above contains sample data sets for machine
learning. The package Oɜʼ˦ provides software for neural networks written in
pure Julia and includes gpu support.
Artificial neural networks were first implemented in the middle of the twentieth
century and have become a standard method in machine learning and artificial
intelligence. Thus there is a vast body of literature on this topic. A historic per-
spective can be found in [4], and a very accessible introduction can be found in
[3]. The hyperparameters used in the examples are from [3].
13.11 Bibliographical Remarks 395
Problems
13.1 Implement and plot the five activation functions 𝜎𝑖 , 𝑖 ∈ {1, 2, 3, 4, 5},
along with their derivatives using Julia. Furthermore, find the ten limits
lim𝑥→±∞ 𝜎𝑖 (𝑥) for 𝑖 ∈ {1, 2, 3, 4, 5}.
13.2 Find suitable hyperparameters for each of the five activation functions 𝜎𝑖 ,
𝑖 ∈ {1, 2, 3, 4, 5}, such that stochastic gradient descent works well for mnist
handwriting recognition.
13.3 Compare how well the neural network learns with and without scaling the
initial weights.
13.4 Compare the best classification performance you can each achieve using
each of the five activation functions 𝜎𝑖 , 𝑖 ∈ {1, 2, 3, 4, 5}, in a neural network.
13.6 Implement an adaptive strategy for choosing the learning rate 𝜂. Choose
𝜇 ∈ ℝ+ , 𝜇 ≈ 1, and change the learning rate to 𝜇𝜆, 𝜆, or 𝜆∕𝜇 depending on the
improvement achieved by each of these three values.
13.7 Investigate further strategies for choosing the learning rate 𝜂 adaptively by
making it depend on the learning progress.
13.8 Parallelize the ȯɴʝ loop over the all batches in the function ÆQ-.
13.9 Use the validation dataset to optimize the hyperparameters of the training
algorithm. Optimize single hyperparameters one after another in one-dimens-
ional optimization problems. Which hyperparameters should be considered on
a logarithmic scale?
13.10 Use the validation dataset to optimize the hyperparameters of the train-
ing algorithm. Optimize (as many as possible of) the hyperparameters simulta-
neously in a multidimensional optimization problem. Which hyperparameters
should be considered on a logarithmic scale?
13.12 More training data helps reduce overfitting. In certain problems, more
training data can be generated by simply perturbing the available training data
slightly. When dealing with image data, shifting and rotating the images slightly
suggests itself. Implement and evaluate this idea to reduce overfitting.
13.13 Derive the formulas for 𝓁1 -regularization and implement it (as an option).
13.14 Derive the formulas for 𝓁𝑝 -regularization and implement it (as an option).
13.16 After using all ideas in this chapter and optimizing the hyperparameters
using the validation data, what is the best accuracy you can achieve on the test
data?
13.17 Implement a convolutional neural network and train it with the mnist
database. Which classification error can you achieve?
References
No other formula in the alchemy of logic has exerted more astonishing powers,
For it has established the existence of God from the premiss of total ignorance;
and it has measured with numerical precision the probability
that the sun will rise to-morrow.
—John Maynard Keynes, A Treatise on Probability, Chapter VII (1921)
Abstract Frequentist and Bayesian statistics and inference differ in their fun-
damental assumptions on the nature of probabilities and models. After a short
discussion of the differences, we use the ideas of Bayesian inference to determine
model parameters. The motivation for these considerations is the fact that mod-
els usually contain parameters that are unknown and often cannot be measured
or determined directly. Thus they must be estimated by comparing the model to
data. In this chapter, the Bayesian approach to the estimation of model parame-
ters is developed, implemented, and applied to an example.
14.1 Introduction
Theorems in discrete and continuous probability are often stated separately us-
ing two different notions, namely the (discrete) probability 𝑃 of an event and
the (continuous) probability density 𝑓 of a random variable. In order to unify the
treatment of discrete and continuous probabilities, we use the Riemann–Stieltjes
integral as an elegant concept to cover both cases, the discrete and the continu-
ous one, simultaneously.
To define the Riemann–Stieltjes integral, we need the ⋃ concept of a partition. A
partition of a set 𝐴 is a set of subsets 𝐴𝑖 ⊂ 𝐴 such that 𝑖 𝐴𝑖 = 𝐴 and 𝐴𝑖 ∩𝐴𝑗 = ∅
for all indices 𝑖 ≠ 𝑗. The definition of the Riemann–Stieltjes integral is a gen-
eralization of the Riemann integral with the additional notion of an integrator
function. In a Riemann sum, each function value is multiplied by the subinterval
length, whereas in the Riemann–Stieltjes integral the function values are more
generally multiplied by the subintervals weighted by the integrator.
lim 𝑆(𝑃, ℎ, 𝑔)
|𝑃|→0
Here { }
𝑃 ∶= [𝑥0 ∶= 𝑎, 𝑥1 ), [𝑥1 , 𝑥2 ), … , [𝑥𝑛−1 , 𝑥𝑛 ∶= 𝑏]
is a partition of the interval [𝑎, 𝑏], where 𝑥𝑖 < 𝑥𝑖+1 holds for all indices 𝑖 ∈
{0, … , 𝑛 − 1}; the fineness
{ }
|𝑃| ∶= max 𝑥𝑖+1 − 𝑥𝑖 ∣ 𝑖 ∈ {0, … , 𝑛 − 1}
of a partition 𝑃 is the length of the longest of its subintervals; and the points 𝜉𝑖
are chosen as 𝜉𝑖 ∈ [𝑥𝑖 , 𝑥𝑖+1 ) for all indices 𝑖 ∈ {0, … , 𝑛 − 1}.
If the limit above exists, then ℎ is called Riemann–Stieltjes integrable with
respect to 𝑔 on the interval [𝑎, 𝑏].
tor is the identity, then the Riemann–Stieltjes integral reduces to the Riemann
integral.
holds.
𝑔𝑖 ∶= 𝑔(𝑥𝑖 +) − 𝑔(𝑥𝑖 −)
at the points 𝑥𝑖 ∈ [𝑎, 𝑏], 𝑖 ∈ {1, … , 𝑛}, of discontinuity, where 𝑔(𝑎−) ∶= 𝑔(𝑎)
and 𝑔(𝑏+) ∶= 𝑔(𝑏). Suppose further that at all points 𝑥𝑖 not both ℎ and 𝑔 are
discontinuous from the left and not both ℎ and 𝑔 are discontinuous from the right.
Then the function ℎ is Riemann–Stieltjes integrable with respect to 𝑔 on the interval
[𝑎, 𝑏], and the integral has the value
𝑏 𝑛
∑
∫ ℎ(𝑥)d𝑔(𝑥) = ℎ(𝑥𝑖 )𝑔𝑖 .
𝑎 𝑖=1
𝐹𝑋 (𝑥) ∶= 𝑃(𝑋 ≤ 𝑥)
𝑓𝑋 ∶= 𝐹𝑋′
is its probability density. Then discrete random variables become special cases of
continuous random variables: discrete random variables are just continuous ran-
dom variables with piecewise constant cumulative probability distributions 𝐹𝑋 ,
which are the integrators, implying that the probability densities 𝑓𝑋 are sums of
delta distributions.
400 14 Bayesian Estimation
If the random variable 𝑋 is continuous, then Theorem 14.2 yields the usual
integral
∞ ∞
𝔼[ℎ(𝑋)] = ∫ ℎ(𝑥)d𝐹𝑋 (𝑥) = ∫ ℎ(𝑥)𝑓𝑋 (𝑥)d𝑥
−∞ −∞
of the expected value of a continuous random variable.
If the random variable 𝑋 is discrete, then 𝐹𝑋 is a step function with the points
𝑥𝑖 ∈ [𝑎, 𝑏], 𝑖 ∈ {1, … , 𝑛}, of discontinuity, and the jumps 𝑃(𝑋 = 𝑥𝑖 ) = 𝐹𝑋 (𝑥𝑖 +) −
𝐹𝑋 (𝑥𝑖 −). The expected value 𝔼[ℎ(𝑋)] simplifies to the sum
∞ 𝑛
∑
𝔼[ℎ(𝑋)] = ∫ ℎ(𝑥)d𝐹𝑋 (𝑥) = ℎ(𝑥𝑖 )𝑃(𝑋 = 𝑥𝑖 )
−∞ 𝑖=1
by Theorem 14.3, which is the usual definition of the expected value of discrete
random variables. Furthermore, using delta distributions, we can write the prob-
ability density 𝑓𝑋 , i.e., the derivative of the step function 𝐹𝑋 , as
𝑛
∑
𝑓𝑋 (𝑥) = 𝑃(𝑋 = 𝑥𝑖 )𝛿(𝑥 = 𝑥𝑖 ).
𝑖=1
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥) ∶=
𝑓𝑋 (𝑥)
assuming that 𝑓𝑋 (𝑥) > 0, where 𝑓𝑋,𝑌 (𝑥, 𝑦) is the joint density of the random
variables 𝑋 and 𝑌 and 𝑓𝑋 (𝑥) is the marginal density of 𝑋.
In the discrete case, the conditional probability 𝑃(𝐴|𝐵) of the event 𝐴 occur-
ring given that 𝐵 with 𝑃(𝐵) > 0 occurs is usually written as
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
14.3 Bayes’ Theorem 401
𝐴 𝐴∩𝐵 𝐵
Fig. 14.1 Here two events 𝐴 and 𝐵 are illustrated as part of a probability space Ω. The condi-
tional probability 𝑃(𝐴|𝐵) corresponds to the ratio of the areas of 𝐴 ∩ 𝐵 and 𝐵.
Theorem 14.6 (Bayes’ theorem) Suppose that 𝐴 and 𝐵 are two events and that
𝑃(𝐵) > 0. Then the equation
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
holds.
Proof We start with the case 𝑃(𝐴) = 0. Then both sides of the equation vanish.
The remaining case is 𝑃(𝐴) > 0. By Definition 14.5, 𝑃(𝐴|𝐵) = 𝑃(𝐴 ∩ 𝐵)∕𝑃(𝐵)
and 𝑃(𝐵|𝐴) = 𝑃(𝐵∩𝐴)∕𝑃(𝐴) if 𝑃(𝐴) > 0 and 𝑃(𝐵) > 0. Since 𝑃(𝐴∩𝐵) = 𝑃(𝐵∩𝐴),
we find 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴). Division by 𝑃(𝐵) > 0 yields the assertion.□
and follows easily from Definition 14.5. For discrete random variables, it reads
∑
𝑃(𝐵) = 𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 )
𝑖
Theorem 14.7 (extended form of Bayes’ theorem) Suppose that 𝑋 and 𝑌 are
two random variables and that 𝑓𝑌 > 0. Then the equations
hold.
Proof We start with the case 𝑓𝑋 (𝑥) = 0. Then all terms vanish.
The general case is 𝑓𝑋 (𝑥) > 0. By Definition 14.5, the equations
hold if 𝑓𝑋 (𝑥) > 0 and 𝑓𝑌 (𝑦) > 0. Since the two joint densities on the right-hand
sides are identical, the first equation follows. The second equation uses (14.1) in
the denominator. □
Corollary 14.8 (discrete, extended form of Bayes’ theorem) Suppose that
the events 𝐴𝑖 are a partition of the sample space and that 𝐵 is an event with nonzero
probability 𝑃(𝐵) ≠ 0. Then the equations
hold.
There are two perspectives in statistical inference, namely the frequentist and
the Bayesian perspectives.
In the frequentist perspective, probabilities are the frequencies of the occur-
rences of an event if experiments are repeated many times. This definition is
objective, since the frequencies are independent of the observer. The probabili-
ties are not updated during data acquisition. The parameters of a model are un-
known but deterministic. Estimators are constructed and confidence intervals
are calculated. A confidence interval for a parameter contains the true value
of the corresponding parameter in repeated sampling with a given probability
or frequency, namely the confidence level. Since the unknown parameters are
viewed as deterministic, parameter densities cannot be propagated through the
model in order to quantify model uncertainties.
In the Bayesian perspective, probabilities are subjective and can be updated to
incorporate new data or information. Probabilities are probability distributions
and not a single frequency value. Model parameters are considered to be random
variables. When a parameter is estimated, the probability density obtained is
14.4 Frequentist and Bayesian Inference 403
called the posterior probability density. This viewpoint is a natural one when
uncertainties in model parameters are to be propagated through the models and
quantified. Instead of confidence intervals, credible intervals are calculated; a
credible interval contains the parameter with a given probability, namely the
confidence level.
In Bayesian inference, new information can be incorporated into the knowl-
edge of an observer, e.g., into the probabilities of parameters, as soon as it be-
comes available. In other words, it is a method for online learning. We denote
the (unknown) model parameters by the random variable 𝑄, which can be mul-
tidimensional in general. The data, measurements, or observations are denoted
by the random variable 𝐷 and can be multidimensional in general as well.
We rewrite Bayes’ formula in Theorem 14.7 in the form
where
𝜋(𝑞|𝑑) ∶= 𝜋𝑄 (𝑞 ∣ 𝐷 = 𝑑),
𝜋(𝑑|𝑞) ∶= 𝜋𝐷 (𝑑 ∣ 𝑄 = 𝑞)
are defined to simplify notation. The indices referring to the random variables
are usually dropped in Bayesian inference, since they are clear from the context.
The probability density
𝜋0 (𝑞) ∶= 𝜋𝑄 (𝑞)
is called the prior probability density and contains the previous knowledge about
the random variable 𝑄 before the incorporation of new information. Further-
more, the probability density 𝜋(𝑞|𝑑) on the left-hand side is called the posterior
probability density and represents the updated knowledge after the realization
𝑑 = 𝐷(𝜔) has been observed. Finally, 𝜋(𝑑|𝑞) is called the likelihood, and the
marginal density 𝜋𝐷 (𝑑) is a normalization factor.
Therefore, new data informs the posterior density directly only through the
likelihood 𝜋(𝑑|𝑞). With this interpretation, equation (14.2) gives rise to an itera-
tive algorithm.
Algorithm 14.9 (Bayesian inference)
1. Initialize the prior density 𝜋0 (𝑞).
2. While data are available:
a. Calculate the posterior density 𝜋(𝑞|𝑑) using (14.2). Realizations 𝑑 =
𝐷(𝜔) of the data inform the likelihood 𝜋(𝑑|𝑞).
b. Set 𝜋0 (𝑞) to be the posterior density 𝜋(𝑞|𝑑) just calculated.
In the first step, the question which prior density should be used before any
data are available arises immediately. If no prior information is available at all,
404 14 Bayesian Estimation
We see that after the first test, the posterior probability of having the disease
has increased from 1‰ to ≈ 9%. Why not more? This is explained by the rel-
atively small correctness 𝑃(pos.|dis.) = 99% of the test compared to the small
frequency of the disease in the population. Taking a simple of one thousand peo-
ple, we expect one person to have the disease, while ten are expected to be tested
14.5 Parameter Estimation and Inverse Problems 405
positive. Out of these (approximately) eleven persons tested positive, only one
has the disease, resulting in the probability of ≈ 9% to have the disease after the
first test.
If the prior probability is zero, the posterior probability will always be zero as
well. This means that absolute confidence in the beginning remains unchanged
in the Bayesian setting, unless the likelihood is equal to one. In this case, both
the nominator and the denominator are zero and the quotient is undefined.
After a second, independent test, the posterior probability of having the dis-
ease increases to ≈ 91%, and so forth. The posterior probability converges quickly
to 1 as more information becomes available.
When applying Bayes’ Theorem like in this example, a few questions arise.
What is the influence of the initial prior density? Can it change the result? And
does the posterior density converge?
These questions can be answered as follows. The Bernstein–von-Mises theo-
rem states that the posterior density converges to a normal distribution indepen-
dent of the initial prior density in the limit of infinitely many, independent, and
identically distributed realizations under certain conditions.
Parameter estimation and inverse problems refer to the kind of problems where
model parameters are to be inferred given measurements or observations. The
advantage of the Bayesian approach is that it also shows how well such an in-
verse problem can be solved. If the resulting probability distribution for a sought
model parameter is spread out or multimodal, this parameter cannot be deter-
mined well. On the other hand, if the distribution is well localized around a
certain value, the parameter can be calculated precisely. Having calculated prob-
ability distributions, confidence intervals are also easily found.
In contrast, classical methods that work by trying to find parameter values
such that the distance in a certain norm between measurement and model out-
put is minimized always yield a parameter value. This parameter value depends
on the choice of norm in the minimization problem, but more importantly, ad-
ditional considerations are necessary in order to assess how sensitive this mini-
mum is to perturbations. Without additional work, we do not know how reliable
406 14 Bayesian Estimation
the parameter values found are, which is especially important in the case of non-
linear or so-called ill-posed inverse problems.
Because parameter estimation and inverse modeling must always deal with
measurement noise and possibly with other uncertainties, it is therefore expedi-
ent to pose the problem within the context of probability theory from the begin-
ning.
In order to apply Bayes’ theorem, we start by considering the general statisti-
cal model
𝐷𝑖 ∶= 𝑓(𝑡𝑖 , 𝑄) + 𝜖𝑖 , (14.3)
where the function 𝑓 represents the model; 𝐷 is a random vector representing
data, measurements, or observations; 𝑄 is a random vector representing param-
eters to be determined, i.e., the quantities of interest; and the random vector 𝜖
represents any unbiased, independent, and identically distributed errors such
as measurement errors. The independent variable 𝑡 of the model function 𝑓 will
represent time in the example below. The indices 𝑖 number the points where data
points 𝐷𝑖 are available for the values 𝑡𝑖 of the independent variable 𝑡 of 𝑓. The
errors are mutually independent from the parameters 𝑄, and they are additive
here. This model equation applies to any problem with additive errors, but the
approach can of course also be formulated for multiplicative errors.
Equivalently, we can write
𝐷𝑖 ∼ 𝑁(𝑓(𝑡𝑖 , 𝑄), 𝜎2 ),
i.e., the measurements are independent and normally distributed with mean
𝑓(𝑡𝑖 , 𝑄) and variance 𝜎2 .
Before we apply Theorem 14.7 or (14.2), we define a model function in or-
der to make things concrete and to discuss a non-trivial example that we will
implement later.
𝑓(𝑡)
𝑓 ′ (𝑡) = 𝑞1 𝑓(𝑡) (1 − ), 𝑓(0) = 𝑞3 ≠ 0, (14.4)
𝑞2
term −(𝑞1 ∕𝑞2 )𝑓(𝑡)2 counteracts the growth term 𝑞1 𝑓(𝑡). (A linear second term
could just be absorbed into the first term.)
Equation (14.4) can be solved by separation of variables; a short calculation
shows that its solution is
𝑞2 𝑞3
𝑓(𝑡) = , (14.5)
𝑞3 + (𝑞2 − 𝑞3 )e−𝑞1 𝑡
lim 𝑓(𝑡) = 𝑞2 ,
𝑡→∞
implying that the population size always tends to the parameter 𝑞2 , which is
hence usually called the carrying capacity. The larger the carrying capacity is,
the smaller the second term −(𝑞1 ∕𝑞2 )𝑓(𝑡)2 in (14.4) responsible for population
decrease is.
A realistic example of a population governed by the logistic equation is the
growth of a bacterial colony. We can measure the number of bacteria in a Petri
dish as time progresses or we can create synthetic measurements by defining
𝑑𝑖 ∶= 𝑓(𝑡𝑖 , 𝑞1 , 𝑞2 , 𝑞3 ) + 𝜖𝑖 , 𝜖𝑖 ∼ 𝑁(0, 𝜎2 ),
after recalling (14.3). Here the times 𝑡𝑖 ∶= 𝑖Δ𝑡 when measurements are taken
are equidistant, and the measurement errors 𝜖𝑖 are normally distributed with
variance 𝜎2 .
The following Julia function implements the model. It is simple, since we
could solve the underlying logistic equation explicitly, and its solution is given
by (14.5).
ȯʼɪȆʲɃɴɪ ȯФɃя ȍʲя ʧɃȱɦǤя ʜЖя ʜЗя ʜИХ
ʜЗѮʜИ Э ФʜИ ў ФʜЗвʜИХѮȕ˦ʙФвʜЖѮɃѮȍʲХХ ў ʧɃȱɦǤѮʝǤɪȍɪФOɜɴǤʲЛЙХ
ȕɪȍ
Fig. 14.2 shows synthetic measurements generated by this function. The popula-
tion starts with a small number of individuals and then approaches the carrying
capacity.
In order to generate the synthetic measurements, we had to choose values for
the three parameters 𝑞1 , 𝑞2 , and 𝑞3 of interest. After having generated the data,
we forget these three values before proceeding with the parameter estimation.
In order to apply Bayes’ theorem in the form (14.2) and to calculate the sought
posterior density 𝜋(𝑞|𝑑), we must know the likelihood 𝜋(𝑑|𝑞) (and the prior
density 𝜋0 (𝑞)). The likelihood function depends on the assumptions made on the
408 14 Bayesian Estimation
Fig. 14.2 Synthetic measurements generated using the logistic function as implemented by ȯ
for 𝑖 ∈ {1, … , 100}, Δ𝑡 ∶= 1, 𝜎 = 0.05, 𝑞1 ∶= 0.1, 𝑞2 ∶= 2, 𝑞3 ∶= 0.05.
𝑁
∏ 1 2 2
𝜋(𝑑|𝑞) ∶= 𝐿(𝑞, 𝜎2 ∣ 𝑑) ∶= √ e−(𝑑𝑖 −𝑓(𝑡𝑖 ,𝑞)) ∕(2𝜎 )
𝑖=1 2𝜋𝜎
1 2
= e−𝑆(𝑞)∕(2𝜎 ) , (14.6)
(2𝜋𝜎 )2 𝑁∕2
where 𝐷 is a 𝑁-dimensional random vector meaning that there are 𝑁 data points
(𝑑1 , … , 𝑑𝑁 ) and we have defined
𝑁
∑
𝑆(𝑞) ∶= (𝑑𝑖 − 𝑓(𝑡𝑖 , 𝑞))2 .
𝑖=1
14.5 Parameter Estimation and Inverse Problems 409
Furthermore, in order to apply Bayes’ Theorem in the form (14.2), the integral in
the denominator must be evaluated. This numerical integration remains a chal-
lenge if the number of parameters, i.e., the dimension of the random variable 𝑄,
is large, although many methods for high-dimensional numerical integration
have been developed. Therefore we follow an alternative approach here.
The alternative is to construct a Markov chain whose stationary distribution
is equal to the posterior density. To do so, we start by defining Markov chains.
Definition 14.10 (Markov chain, Markov property) A Markov chain is a se-
quence of random variables 𝑋𝑛 , 𝑛 ∈ ℕ, that satisfy the Markov property, namely
that 𝑋𝑛+1 depends only on its predecessor 𝑋𝑛 for all 𝑛 ∈ ℕ, i.e.,
𝑃(𝑋𝑛+1 = 𝑥𝑛+1 ∣ 𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 )
= 𝑃(𝑋𝑛+1 = 𝑥𝑛+1 ∣ 𝑋𝑛 = 𝑥𝑛 ) ∀𝑛 ∈ ℕ.
The set of all possible realizations of the random variables 𝑋𝑛 is called the
state space of the Markov chain.
Markov chains can be realized if three pieces of information are known:
1. its state space,
2. its initial distribution 𝑝0 , i.e., the distribution of 𝑋0 , and
3. its transition or Markov kernel, which gives the probability
𝑝𝑖𝑗 ∶= 𝑃(𝑋𝑛+1 = 𝑥𝑗 ∣ 𝑋𝑛 = 𝑥𝑖 )
of transitioning from state 𝑥𝑖 to 𝑥𝑗 and thus defines how the chain evolves.
If the state space is finite, the Markov chain is called finite and the entries of
the transition matrix 𝑃 are the probabilities 𝑝𝑖𝑗 .
Here we assume that the transition probabilities 𝑝𝑖𝑗 that constitute the transition
kernel are independent of time or iteration 𝑛. Markov chains with this property
are called homogeneous Markov chains.
Clearly, the entries of the initial distribution 𝑝0 and of the transition matrix 𝑃
are nonnegative, and the elements of 𝑝0 and the rows of 𝑃 sum to one. The dis-
tributions of the states as time progresses are given by
𝑝0 ,
𝑝1 ∶= 𝑝0 𝑃,
𝑝2 ∶= 𝑝1 𝑃 = 𝑝0 𝑃2 ,
⋮
𝑝𝑛 ∶= 𝑝𝑛−1 𝑃 = 𝑝0 𝑃𝑛 ,
holds, then the sequence ⟨𝑋𝑛 ⟩𝑛∈ℕ is said to converge in distribution to the limiting
random variable 𝑋. Convergence in distribution is written as
𝐷
𝑋𝑛 ⟶ 𝑋.
𝜋 = 𝜋𝑃.
𝜋 = 𝜋𝑃
Every homogeneous Markov chain on a finite state space has at least one sta-
tionary distribution. A stationary distribution, however, may not be unique and
it may not be equal to lim𝑛→∞ 𝑝𝑛 .
There are two kinds of homogeneous Markov chain that we must exclude to
ensure the unique existence of a stationary distribution. The first kind to be ex-
cluded are Markov chains in which not the whole state space is reachable after
some time. This kind is excluded by stipulating that the Markov chain is irre-
ducible.
14.5 Parameter Estimation and Inverse Problems 411
A Markov chain is called aperiodic if its period is equal to one and it is called
periodic if its period is greater than one.
and
lim 𝑝𝑛 = 𝜋,
𝑛→∞
which implies
𝜋𝑃 = 𝜋,
which is the definition of stationarity. □
𝜋(𝑑|𝑞)𝜋0 (𝑞)
𝜔(𝑞) ∶= 𝜋(𝑞|𝑑) = . (14.7)
∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞
But for now it is more convenient to derive the algorithm for a general distribu-
tion 𝜔 that should become the stationary distribution of the Markov chain.
14.5 Parameter Estimation and Inverse Problems 413
We construct the transition probability 𝑃(𝑞 ′ |𝑞) for going from state 𝑞 to
state 𝑞′ such that it satisfies the detailed-balance condition (see Definition 14.16),
of course. Given 𝜔, this means that the transition probability 𝑃(𝑞 ′ |𝑞) must satisfy
or, equivalently,
𝑃(𝑞 ′ |𝑞) 𝜔(𝑞′ )
′
= . (14.8)
𝑃(𝑞|𝑞 ) 𝜔(𝑞)
The transition from state 𝑞 to state 𝑞 ′ happens in two steps: first, a new state is
proposed by a proposal or jumping distribution 𝐽(𝑞 ′ |𝑞), which is the conditional
probability of proposing the new state 𝑞′ given state 𝑞, and second the acceptance
probability 𝐴(𝑞 ′ |𝑞) is the probability of accepting the proposed state 𝑞 ′ . If it is
rejected, the old state 𝑞 is repeated. In summary, this means that we try to find 𝐽
and 𝐴 such that
𝑃(𝑞 ′ |𝑞) = 𝐽(𝑞 ′ |𝑞)𝐴(𝑞′ |𝑞).
Substituting this form of 𝑃(𝑞′ 𝑞) into (14.8) yields the form
of the detailed-balance condition. At this point, we can choose any proposal dis-
tribution 𝐽. There are many choices, but the choice is important for the numer-
ical behavior of the Markov chain and will be discussed later. Having chosen
the proposal distribution, we must define a suitable acceptance probability. The
Metropolis acceptance probability
𝜔(𝑞′ )𝐽(𝑞|𝑞 ′ )
𝐴(𝑞 ′ |𝑞) ∶= min (1, )
𝜔(𝑞)𝐽(𝑞 ′ |𝑞)
is common.
We can check that it works by setting
𝜔(𝑞′ )𝐽(𝑞|𝑞 ′ )
𝑟 ∶=
𝜔(𝑞)𝐽(𝑞 ′ |𝑞)
and calculating
⎧ 1 , 𝑟 ≥ 1,
𝐴(𝑞′ |𝑞) min(1, 𝑟)
= = 1∕𝑟
𝐴(𝑞|𝑞 ′ ) min(1, 1∕𝑟) ⎨𝑟, 𝑟<1
⎩
= 𝑟,
414 14 Bayesian Estimation
which shows that (14.9) and therefore the detailed-balance condition is satisfied.
Hence we can indeed construct a Markov chain whose stationary distribution is
the given, arbitrary distribution 𝜔.
The difference between the Metropolis and the Metropolis–Hastings algo-
rithm lies only in the proposal distribution 𝐽. If it is symmetric, i.e., if 𝐽(𝑞 ′ , 𝑞) =
𝐽(𝑞, 𝑞 ′ ), then the algorithm is called a Metropolis algorithm. If it is not symmet-
ric, it is called a Metropolis–Hastings algorithm, which is therefore slightly more
general.
We can now formulate the Metropolis–Hastings algorithm. The similarity to
simulated annealing (see Sect. 11.3) is not a coincidence, but due to their com-
mon root. The algorithm works for general distributions 𝜔, but in the formu-
lation of the algorithm we also note what happens when 𝜔 is given by (14.7),
because this is the application we are interested in.
𝑞′ ∶= 𝑞𝑛 + 𝑅𝑧 (14.10)
𝑞 ′ , 𝑢 ≤ 𝐴(𝑞 ′ |𝑞𝑛 ),
𝑞𝑛+1 ∶= {
𝑞𝑛 , 𝑢 > 𝐴(𝑞 ′ |𝑞𝑛 ).
14.5 Parameter Estimation and Inverse Problems 415
In other words, the candidate value 𝑞 ′ is used as the next value 𝑞𝑛+1 in
the Markov chain with the acceptance probability, and otherwise it is
rejected and the old value 𝑞𝑛 is repeated.
d. Repeat until the chain is long enough to estimate the parameter 𝑞 after
discarding a sufficiently long burn-in period at the beginning. Compute
any statistic of interest from the Markov chain without the burn-in pe-
riod.
The derivation of the acceptance probability above showed that the values
in a Markov chain calculated by the Metropolis–Hastings algorithm satisfy the
detailed-balance condition by construction and thus the Markov chain is re-
versible. We have hence shown the following theorem.
where 𝐷 is a diagonal matrix and 𝑉 is the covariance matrix for the parameter
vector 𝑞. In the first choice, the elements of the diagonal matrix reflect the scale
associated with each parameter. In the second choice, the scale of each param-
eter can depend on the other parameters via the covariance matrix. Considera-
tions regarding the choices of 𝐷 or 𝑉 will be discussed later.
In both these choices for the proposal distributions 𝐽, 𝐽 is symmetric as the
calculation
416 14 Bayesian Estimation
1
1 − (𝑞′ −𝑞𝑛 )𝑉 −1 (𝑞′ −𝑞𝑛 )⊤
𝐽(𝑞 ′ , 𝑞𝑛 ) = √ e 2
(2𝜋)𝑁 |𝑉|
1
1 − (𝑞 −𝑞 ′ )𝑉 −1 (𝑞𝑛 −𝑞 ′ )⊤
=√ e 2 𝑛 = 𝐽(𝑞𝑛 , 𝑞 ′ )
𝑁
(2𝜋) |𝑉|
shows.
It is obvious that in a Metropolis algorithm (where the proposal distribution 𝐽
is symmetric by definition) the acceptance probability (14.11) simplifies to
𝑌 = 𝑅𝑍 + 𝜇,
where
𝑉 = 𝑅𝑅⊤
and 𝑅 is a lower triangular matrix.
Ƀɦʙɴʝʲ ÆʲǤʲʧ"Ǥʧȕ
Ƀȯ ʜʜ ј ʜѪɦɃɪ
ʜʜ ќ ʜѪɦɃɪ
ȕɪȍ
Ƀȯ ʜʜ љ ʜѪɦǤ˦
ʜʜ ќ ʜѪɦǤ˦
ȕɪȍ
ɜɴȆǤɜ ђђOɜɴǤʲЛЙ ќ
ɦɃɪФЖя ФɜɃɖȕɜɃȹɴɴȍФʜʜХ Ѯ ʙʝɃɴʝФʜʜХ
Ѯ ʙʝɴʙɴʧǤɜФʜЦɪЧя ʜʜя ˛ǤʝХХ Э
ФɜɃɖȕɜɃȹɴɴȍФʜЦɪЧХ Ѯ ʙʝɃɴʝФʜЦɪЧХ
Ѯ ʙʝɴʙɴʧǤɜФʜʜя ʜЦɪЧя ˛ǤʝХХХ
ʜЦɪўЖЧ ќ
Ƀȯ ʝǤɪȍФOɜɴǤʲЛЙХ јќ
ʜʜ
ȕɜʧȕ
ʜЦɪЧ
ȕɪȍ
ȕɪȍ
ʜ
ȕɪȍ
418 14 Bayesian Estimation
The function ȯ yields the exact value of the model, here the logistic equation,
at time Ƀ times ȍʲ for given parameter values ʜЖ, ʜЗ, and ʜИ.
ȯʼɪȆʲɃɴɪ ȯФɃђђbɪʲя ȍʲђђOɜɴǤʲЛЙя ʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя
ʜИђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ЪǤʧʧȕʝʲ Ƀ љќ Е
ЪǤʧʧȕʝʲ ȍʲ љ Е
ЪǤʧʧȕʝʲ ʜЖ љќ Е
ЪǤʧʧȕʝʲ ʜЗ љќ Е
ЪǤʧʧȕʝʲ ʜИ љќ Е
The function ɦɴȍȕɜ wraps evaluations of ȯ for equidistant points in time and
returns a vector.
ȯʼɪȆʲɃɴɪ ɦɴȍȕɜФʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя
ʜИђђOɜɴǤʲЛЙХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
OɜɴǤʲЛЙЦȯФɃя ȍʲя ʜЖя ʜЗя ʜИХ ȯɴʝ Ƀ Ƀɪ ЕђвЖЧ
ȕɪȍ
You may want to experiment with different values for the number of points and
the time step by changing the global variables and ȍʲ below.
Next we define some global variables. Since we use synthetic measurements,
we define the exact parameter values. We will extract the parameter ʜЗ and as-
sume that it lies in the interval [0, 5]. We also know the standard deviation 𝜎 of
the additive noise. In a real-world example, it would correspond to the measure-
ment error. Based on these constants, we produce the synthetic data by evaluat-
ing the ɦɴȍȕɜ function and adding the noise.
ȱɜɴȂǤɜ ќ КЕ
ȱɜɴȂǤɜ ȍʲ ќ ЖѐЕ
ȱɜɴȂǤɜ ʜЖѪȕ˦ǤȆʲ ќ ЕѐЖ
ȱɜɴȂǤɜ ʜЗѪȕ˦ǤȆʲ ќ ЗѐЕ
ȱɜɴȂǤɜ ʜЗѪɦɃɪ ќ ЕѐЕ
ȱɜɴȂǤɜ ʜЗѪɦǤ˦ ќ КѐЕ
ȱɜɴȂǤɜ ʜИѪȕ˦ǤȆʲ ќ ЕѐЕК
ȱɜɴȂǤɜ ʧɃȱɦǤ ќ ЕѐЕК
ȱɜɴȂǤɜ ȍǤʲǤ ќ
ɦɴȍȕɜФʜЖѪȕ˦ǤȆʲя ʜЗѪȕ˦ǤȆʲя ʜИѪȕ˦ǤȆʲХ ў ʧɃȱɦǤ Ѯ ʝǤɪȍɪФOɜɴǤʲЛЙя Х
To complete the description of our inverse problem, we define the prior, the
likelihood, and the proposal distribution. We use a uniformly distributed prior
here.
ȯʼɪȆʲɃɴɪ ʙʝɃɴʝФʜђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
Ж Э ФʜЗѪɦǤ˦ в ʜЗѪɦɃɪХ
ȕɪȍ
14.5 Parameter Estimation and Inverse Problems 419
The ratio 𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) in the acceptance probability (14.11) may simplify if the
prior distribution has a suitable form. In fact, the factor 𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) in (14.11)
simplifies to one here.
The likelihood (14.6) uses the standard deviation 𝜎 defined above.
ȯʼɪȆʲɃɴɪ ɜɃɖȕɜɃȹɴɴȍФʜђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ Æ ќ ʧʼɦФФɦɴȍȕɜФʜЖѪȕ˦ǤȆʲя ʜя ʜИѪȕ˦ǤȆʲХ ѐв ȍǤʲǤХѐѭЗХ
If the likelihood is a normal distribution (as is the case in (14.6) in Sect. 14.5.3),
a numerical improvement is possible. Then in the quotient 𝜋(𝑑|𝑞 ′ )∕𝜋(𝑑|𝑞𝑛 ) in
the acceptance probability (14.11), the normalization factor 1∕(2𝜋𝜎2 )𝑁∕2 can be
cancelled so that we have
𝜋(𝑑|𝑞′ ) ′ 2
= e(𝑆(𝑞𝑛 )−𝑆(𝑞 ))∕(2𝜎 ) . (14.13)
𝜋(𝑑|𝑞𝑛 )
This form of the ratio has the advantage that the division of two numbers possibly
very close to zero is avoided and thus numerical accuracy is improved. You may
want to implement this improvement when possible (see Problem 14.13).
The third function to be defined is the proposal distribution.
ȯʼɪȆʲɃɴɪ ʙʝɴʙɴʧǤɜФʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя ˛ǤʝђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ЪǤʧʧȕʝʲ ˛Ǥʝ љ Е
Now everything is in place to compute a Markov chain for the parameter ʜЗ.
The following function call returns a long vector, still containing the burn-in
period. The proposal distribution has variance 0.01, and the initial state is the
interval midpoint.
YѪЖ-ФʙʝɃɴʝя ɜɃɖȕɜɃȹɴɴȍя ʙʝɴʙɴʧǤɜя
ЖЕѭКя ЕѐЕЖя
ʜЗѪɦɃɪя ʜЗѪɦǤ˦я ФʜЗѪɦɃɪўʜЗѪɦǤ˦ХЭЗХ
In postprocessing (see Problem 14.14), the burn-in period is discarded and the
values of the Markov chain are sorted into bins. You may want to experiment
with different values for the arguments of YѪЖ- and for the various (global) vari-
ables to see how they affect the posterior distribution.
The results from this function call for determining the parameter 𝑞2 in the lo-
gistic equation are shown in Figures 14.3, 14.4, and 14.5. Using a burn-in period
of length 103 and a bin width of 0.005 for the final histogram, the maximum-a-
posterior (map) estimate is 2.0025, which is reasonably close to the exact value 2.
In other words, the most likely bin is centered around 2.0025. The mean value is
420 14 Bayesian Estimation
Fig. 14.3 Exact solution of the model equation and 50 synthetic measurements with additive
noise (𝜎 = 0.05).
Fig. 14.4 Beginning of a Markov chain of length 105 for the measurement data shown in
Fig. 14.3. The burn-in period discarded when plotting the next figure, Fig. 14.5, is shown in
black.
≈ 2.004 and the median is ≈ 2.004. In all three values, the first three digits agree
with the exact value.
In Problem 14.15, you are asked to extend the implementation to multiple
dimensions.
14.5 Parameter Estimation and Inverse Problems 421
Fig. 14.5 Histogram of the parameter values found in the Markov chain shown in Fig. 14.4
after the burn-in period.
The posterior density 𝜋(𝑞|𝑑) provides complete information about the model
parameters calculated from the measurements or observations. From this den-
sity, point estimates such as the mean, the median, or a mode can be calculated.
Furthermore, confidence intervals are easily calculated as well.
A mode of a continuous probability distribution is a local maximum of its
density. A mode of the posterior density 𝜋(𝑞|𝑑) is called a maximum-a-posterior
(map) estimate and can be written as
14.5.8 Convergence
Having shown that the Metropolis–Hastings algorithm yields the stationary dis-
tribution of the Markov chain and having implemented the basic algorithm, nu-
merical questions still remain. The two main questions are the following. How
should the proposal distribution be chosen? And how long should the Markov
chain be?
The variance of the proposal distribution affects the Markov chain in an im-
portant way. If the variance is too large, a large proportion of the candidate states
is rejected because they have smaller likelihoods and the chain stagnates for
many iterations. On the other hand, if the variance is too small, the acceptance
probability is large, but the chain explores the parameter space only slowly.
In multidimensional problems, the individual parameters should in general
be explored at different speeds or scales. This is the reason why covariance ma-
trices 𝐷 or 𝑉 are used in the proposal distributions in (14.12) instead of just a
multiple of the identity matrix, which would explore all parameters at the same
scale. We still do not know a good, automatic method to find such a matrix 𝐷 or 𝑉
beyond checking the resulting Markov chain, but we will return to this question
in the next section.
How long should the Markov chain be to ensure convergence and to ade-
quately sample the posterior distribution? This is a difficult question, as analytic
convergence and stopping criteria are lacking. Convergence of Markov-chain
Monte Carlo algorithms can be falsified, but not verified in general. We men-
tion some tests to instill confidence in the convergence of a Markov chain, while
more on this subject can be found, for example, in [2, 4].
The simplest and most straightforward method to assess the burn-in period
and the convergence behavior is to plot or to statistically monitor the marginal
paths of the unknown parameters as in Fig. 14.4. Unfortunately, the chain may
meander around a local minimum for a long time before it transitions close to
another local minimum or, hopefully, close to a global minimum. But depending
on the problem and on the starting point, this may take a very, very long time.
Furthermore, it is possible – at least when the number of unknown param-
eters is sufficiently small – to compare the parameter values resulting from the
Markov chain with the parameter values that stem from applying Bayes’ formula
(14.2) by calculating the integral directly, e.g. by sparse-grid quadrature.
A more statistical test is to keep track of the ratio of accepted states to the total
number of states, which is called the acceptance ratio. Depending on the prob-
lem, a rather large range of acceptance ratios can be reasonable, but acceptance
ratios between 0.1 and 0.5 are usually considered reasonable (see Problem 14.17).
Knowing the acceptance ratio helps tuning the proposal density 𝐽.
Another statistical test is to calculate the autocorrelation between subchains
of length 𝐿 of the Markov chain with lag ℎ. While adjacent subchains are likely
correlated because of the Markov property, low autocorrelation often indicates
fast convergence since in this case independent samples are produced and mix-
ing is good. The autocorrelation function
14.5 Parameter Estimation and Inverse Problems 423
∑𝐿−ℎ
𝑖=1
̄ 𝑖+ℎ − 𝑞)
(𝑞𝑖 − 𝑞)(𝑞 ̄
ACF(𝐿, ℎ) ∶= ∑𝐿
̄2
(𝑞 − 𝑞)
𝑖=1 𝑖
𝐿−ℎ
1 ∑
̄ 𝑖+ℎ − 𝑞)
(𝑞 − 𝑞)(𝑞 ̄
𝐿 − ℎ 𝑖=1 𝑖
of the autocovariance suggests itself, the estimate (14.14) with the factor 1∕𝐿
instead of 1∕(𝐿 − ℎ) is often used, since it can be shown to be biased with a bias
of (only) order 1∕𝐿 (thus being asymptotically unbiased) and it has the useful
property that its finite Fourier transform is nonnegative, among other properties
[3, Section 4.1].
We now revisit the question how the proposal distribution can be chosen auto-
matically to ensure expedient parameter scaling during the learning progress.
The answer is provided by adaptive Metropolis algorithms [1, 7, 8, 10] such as
the dram algorithm [6].
Since adaptive Metropolis algorithms change the proposal distribution using
the chain history, they violate the Markov property and no longer yield a Markov
process. Therefore establishing their convergence to the posterior distribution re-
quires further thought. Examples are criteria such as the diminishing-adaptation
condition and the bounded-convergence condition [1, 7, 8].
The Metropolis algorithm becomes adaptive in the dram algorithm in the
following manner. In the beginning, the covariance matrix is initialized as 𝑉1 =
𝐷 (diagonal) or 𝑉1 = 𝑉. Afterwards, the covariance matrix in the 𝑛-th step is
computed as
𝑉𝑛 ∶= 𝑠𝑝 (cov(𝑞1 , … , 𝑞𝑛 ) + 𝜖𝐼𝑝 ). (14.15)
424 14 Bayesian Estimation
Here 𝑝 is the dimension of the parameter space, and the parameter 𝑠𝑝 is com-
monly chosen to be 𝑠𝑝 ∶= 2.382 ∕𝑝 [6]. The initial period without adaptation
should be chosen long enough to be sufficiently diverse to make the covari-
ance matrix nonsingular. The purpose of the second term 𝜖𝐼𝑝 , where 𝐼𝑝 is the
𝑝-dimensional identity matrix and 𝜖 ≥ 0, is to ensure that 𝑉𝑛 is positive definite;
it is often possible to set 𝜖 ∶= 0.
The most straightforward way to calculate the covariance in the formula
above is to use the formula for the empirical covariance. However, this becomes
increasingly computationally expensive as 𝑛 increases. A much faster way is to
use recursive formulas.
First, the definition of and a recursive formula for the sample mean 𝑞̄𝑛 in the
𝑛-th step are
𝑛
1∑ 1 𝑛−1
𝑞̄𝑛 ∶= 𝑞 = 𝑞 + 𝑞̄ , (14.16)
𝑛 𝑖=1 𝑖 𝑛 𝑛 𝑛 𝑛−1
(see Problem 14.20). Based on this direct formula, the recursive formula
𝑛−2
cov(𝑞1 , … , 𝑞𝑛 ) = cov(𝑞1 , … , 𝑞𝑛−1 )
𝑛−1
1 𝑛
+ 𝑞 𝑞⊤ − 𝑞̄ 𝑞̄ ⊤ + 𝑞̄𝑛−1 𝑞̄𝑛−1
⊤
(14.18)
𝑛−1 𝑛 𝑛 𝑛−1 𝑛 𝑛
for the empirical covariance can be shown (see Problem 14.21).
Using this recursion for the empirical covariance cov(𝑞1 , … , 𝑞𝑛 ) occurring in
(14.15), we find the recursive formula
1 ( )
𝑉𝑛 = (𝑛 − 2)𝑉𝑛−1 + 𝑠𝑝 (𝑞𝑛 𝑞𝑛⊤ − 𝑛𝑞̄𝑛 𝑞̄𝑛⊤ + (𝑛 − 1)𝑞̄𝑛−1 𝑞̄𝑛−1
⊤
+ 𝜖𝐼𝑝 ) (14.19)
𝑛−1
equivalent to (14.15) above (see Problem 14.22).
The efficiency of the algorithm is further improved if the proposal distribution
is adapted only from time to time as in Algorithm 14.21 below.
Delayed rejection is the second aspect of the dram algorithm. It means that
another candidate state 𝑞′′ is constructed and given a chance instead of retain-
ing the previous value whenever a candidate state 𝑞 ′ has been rejected. This so-
called second-stage candidate 𝑞′′ can be chosen using the proposal function
14.5 Parameter Estimation and Inverse Problems 425
𝐽2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) ∶= 𝑁(𝑞𝑛 , 𝛾2 𝑉𝑛 ),
where 𝑉𝑛 is the covariance matrix calculated in the adaptive part of the algo-
rithm above and 𝛾 ∈ (0, 1) is a constant [6]. Since the constant 𝛾 is smaller than
one, the proposal function 𝐽2 for the second state is narrower than the original
one, which increases mixing. A popular choice for 𝛾 is 1∕5.
This proposal function 𝐽2 must be accompanied by a matching acceptance
probability 𝐴2 in order to ensure that the detailed-balance condition is satisfied.
In Sect. 14.5.5, we used the detailed-balance condition to define a suitable accep-
tance probability after having decided which proposal distribution to use. If the
first proposed state 𝑞′ is accepted, the detailed-balance condition holds by the
calculations in Sect. 14.5.5. Otherwise, if it is rejected, the transition probability
is
yields
and hence
𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) 𝜋(𝑞 ′′ |𝑑)𝐽(𝑞 ′ |𝑞′′ )(1 − 𝐴(𝑞 ′ |𝑞 ′′ ))𝐽2 (𝑞𝑛 ∣ 𝑞 ′′ , 𝑞 ′ )
= =∶ 𝑟.
𝐴2 (𝑞𝑛 ∣ 𝑞′′ , 𝑞 ′ ) 𝜋(𝑞𝑛 |𝑑)𝐽(𝑞 ′ |𝑞𝑛 )(1 − 𝐴(𝑞′ |𝑞𝑛 ))𝐽2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ )
We can proceed similarly to the case with only one stage in Sect. 14.5.5 to find
such an 𝐴2 . A suitable acceptance probability is
since
𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) min(1, 𝑟)
= = 𝑟.
𝐴2 (𝑞𝑛 ∣ 𝑞′′ , 𝑞 ′ ) min(1, 1∕𝑟)
As you will have guessed, these ideas can be extended and third-order, fourth-
order, etc. candidate states can be constructed recursively, together with their
proposal densities and acceptance probabilities as we just did.
426 14 Bayesian Estimation
choose the number 𝐾 of steps after which the proposal distribution is adap-
ted; choose the parameter 𝜖; choose the initial covariance matrix 𝑉1 (diag-
onal or symmetric) in the proposal distribution; choose the factor 𝛾 (often
𝛾 ∶= 1∕5) for the second-stage proposal distribution; and set the iteration
number 𝑛 ∶= 1.
If the likelihood function (14.6) is used, the variance 𝜎2 must be known. If
it is not known a priori as is often the case, there are two options:
a. use the empirical estimate
𝑁
1 ∑
𝜎2 ∶= (𝑑 − 𝑓𝑖 (𝑞))2
𝑁 − 𝑝 𝑖=1 𝑖
𝑞′ ∶= 𝑞𝑛 + 𝑅𝑛 𝑧
𝑛𝑠 + 𝑁 𝑛𝑠 𝜎𝑠2 + 𝑆(𝑞𝑛 )
𝜎𝑛 ∼ InvGamma ( , ),
2 2
𝜋(𝑞′ |𝑑) ′ 2
= e(𝑆(𝑞𝑛 )−𝑆(𝑞 ))∕(2𝜎𝑛 )
𝜋(𝑞𝑛 |𝑑)
as in (14.13) if the likelihood has the form (14.6). The second fraction
𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) is equal to one in the case of a uniform prior distribution.
The third fraction 𝐽(𝑞𝑛 |𝑞′ )∕𝐽(𝑞 ′ |𝑞𝑛 ) is equal to one if the proposal distri-
bution is symmetric.
e. Accept or reject the first-stage candidate 𝑞′ by generating a uniformly
distributed random number 𝑢 ∼ 𝑈(0, 1) from the interval [0, 1]. If 𝑢 ≤
𝐴(𝑞′ |𝑞𝑛 ), the candidate 𝑞′ is accepted and we set 𝑞𝑛+1 ∶= 𝑞′ .
f. If the first-stage candidate was rejected, accept or reject a second-stage
candidate.
i. Generate a second-stage candidate
𝑞 ′′ ∶= 𝑞𝑛 + 𝛾𝑅𝑛 𝑧,
𝑞′′ , 𝑢 ≤ 𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ ),
𝑞𝑛+1 ∶= {
𝑞𝑛 , 𝑢 > 𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ ).
3. Iterate until the chain is long enough to estimate the parameter 𝑞 after dis-
carding a sufficiently long burn-in period at the beginning. Compute any
statistic of interest from the Markov chain without the burn-in period.
428 14 Bayesian Estimation
We close the discussion of the algorithm with a comment on how to treat the
variance 𝜎2 on the right side in (14.6) as a random parameter to be sampled by
the Markov chain. The likelihood
1 2
𝜋(𝑑, 𝑞 ∣ 𝜎2 ) ∶= e−𝑆(𝑞)∕(2𝜎 )
2
(2𝜋𝜎 )𝑁∕2
implying that
with the two new parameters 𝑛𝑠 ∶= 2𝛼 and 𝜎𝑠2 ∶= 𝛽∕𝛼. The parameter 𝑛𝑠 can
be interpreted as the number of observations used in the prior distribution, and
the parameter 𝜎𝑠2 is the mean squared error of the observations [4]. Usually 𝑛𝑠 is
chosen to be small, which corresponds to a noninformative prior distribution.
The packages under the ÑʼʝɃɪȱ umbrella provide Bayesian inference with
general-purpose probabilistic programming. Similarly, the ǤɦȂǤ package imple-
ments Markov-chain Monte Carlo methods.
Problems
14.3 Consider the example of a fair dice, all of whose six faces have the probabil-
ity 1∕6. Write down and sketch the probability density and the cumulative proba-
bility distribution as discussed in Sect. 14.2. What are the points of discontinuity
and the probabilities 𝑃(𝑋 = 𝑥𝑖 )? At which points is 𝐹𝑋 continuous from the left?
At which points from the right? Furthermore, calculate the expected values 𝔼[𝑋]
and 𝔼[𝑋 2 ] as well as the variance 𝕍[𝑋] = 𝔼[(𝑋 − 𝔼(𝑋))2 ] as Riemann–Stieltjes
integrals.
14.4 For the example at the end of Sect. 14.4, plot the iterated posterior probabil-
ity for different values for the initial prior probability and the likelihood.
14.7 Find an example of a reducible Markov chain with two different stationary
distributions.
14.13 Implement the special form of the acceptance probability for the case of a
normally distributed likelihood.
14.14 Write a function that – given a Markov chain and the length of the burn-in
period – calculates a histogram (given the bin width), the maximum-a-posteriori
(map) estimate, and a (symmetric) confidence interval around the map estimate
based on the histogram and given the percentage of samples to be found in the
confidence interval.
14.24 Implement the dram algorithm Algorithm 14.21 for parameter vectors.
14.26 Investigate how the parameters of the dram algorithm affect the results
at the hand of a parameter-estimation problem of your choice.
14.27 Compare the performance of the Metropolis–Hastings and the dram al-
gorithms. Which one is easier to use?
References
1. Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Statistics and Computing 18, 343–
373 (2008)
2. Brooks, S., Roberts, G.: Convergence assessment techniques for Markov chain Monte
Carlo. Statistics and Computing 8(4), 319–335 (1998)
3. Chatfield, C.: The Analysis of Time Series, 6th edn. Chapman & Hall (2003)
4. Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D.: Bayesian Data Anal-
ysis, 3rd edn. Taylor & Francis Group, Boca Raton, FL (2013)
5. Golberg, M., Cho, H.: Introduction to Regression Analysis. WIT Press, Southampton, UK
(2004)
6. Haario, H., Laine, M., Mira, A., Saksman, E.: DRAM: efficient adaptive MCMC. Statistics
and Computing 16(4), 339–354 (2006)
7. Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli 7(2),
223–242 (2001)
8. Roberts, G., Rosenthal, J.: Examples of adaptive MCMC. Journal of Computational and
Graphical Statistics 18(2), 349–367 (2009)
9. Smith, R.C.: Uncertainty Quantification. SIAM, Philadelphia, PA (2014)
10. Vihola, M.: Robust adaptive Metropolis algorithm with coerced acceptance rate. Statistics
and Computing 22(5), 997–1008 (2012)
Index