100% found this document useful (1 vote)
941 views

Algorithms With JULIA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
941 views

Algorithms With JULIA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 447

Clemens Heitzinger

Algorithms
with JULIA
Optimization, Machine Learning, and
Differential Equations Using the JULIA
Language
Algorithms with JULIA
Clemens Heitzinger

Algorithms with JULIA


Optimization, Machine Learning,
and Differential Equations Using the JULIA
Language

123
Clemens Heitzinger
Center for Artificial Intelligence
and Machine Learning (CAIML)
and
Department of Mathematics
and Geoinformation
Technische Universität Wien
Vienna, Austria

ISBN 978-3-031-16559-7 ISBN 978-3-031-16560-3 (eBook)


https://doi.org/10.1007/978-3-031-16560-3

Mathematics Subject Classification: 65-XX, 34K28, 65Mxx, 65M06, 65M08, 65M60, 65Kxx, 65Yxx,
62M45, 68T05

© Springer Nature Switzerland AG 2022


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To M.S.
Foreword

Students of applied mathematics are often confronted with textbooks that ei-
ther cover the mathematical principles and concepts of mathematical models
or with textbooks which introduce the basic language structures of a program-
ming language. Many authors fail to cover the underlying mathematical theory
of models, which is crucial in understanding the applicability of models – and
especially their limitations – to real world problems. On the other hand, many
textbooks and monographs fail to address the crucial step from the algorithmic
formulation of an applied problem to the actual implementation and solution in
the form of an executable program. This book brilliantly combines these two as-
pects using a high level open source computer language and covers many areas of
continuum model based areas of natural and social sciences, applied mathemat-
ics and engineering. Julia is a high-level, high-performance, dynamic program-
ming language which can be used to write any application in numerical analysis
and computational science. Clemens Heitzinger has gone through great lengths
to organize this book into sequences that make sense for the beginner as well as
for the expert in one particular field.
The applied topics are carefully chosen, from the most relevant standard areas
like ordinary and partial differential equations and optimization to more recent
fields of interest like machine learning and neural networks. The chapters on
ordinary and partial differential equations include examples of how to use exist-
ing packages included in the Julia software. In the chapter about optimization
the methods for standard local optimization are nicely explained. However, this
book also contains a very relevant chapter about global optimization, including
methods such as simulated annealing and agent based optimization algorithms.
All this is not something usually found in the same book. Again, the global op-
timization theory, as far as the general theory exists, is well presented and the
application examples (and, most importantly, the benchmark problems) are well
chosen. One chapter – concerned with the currently maybe most relevant area
– introduces practical problem solving in the field of machine learning. The au-
thor covers the basic approach of learning via artificial neural networks as well
as probabilistic methods based on Bayesian theory. Again, the topics and exam-

vii
viii Foreword

ples are well chosen, the underlying theory is well explained, and the solutions
of the chosen application problems are immediately implementable in Julia.
Clemens Heitzinger has been involved in some of the most impressive applica-
tions in engineering and the applied physical sciences, covering microelectron-
ics, sensors and biomedical applications. This book covers both the theoretical
and practical aspects of this part of modern science. The approach taken in this
book is novel in the sense that it goes into quite some detail in the theoretical
background, while at the same time being based on a modern computing plat-
form. In a sense, this work serves the role of two books. It is much more than a
cookbook for “how to solve problems with Julia,” but also a good introduction
to the most relevant problems in continuum model based science and engineer-
ing. At the same time, it gives the novice in Julia programming a good introduc-
tion on how to use this higher level programming language. It can therefore be
used as a text for students in an advanced graduate level course as well as a mono-
graph by the researcher planning to solve actual problems by programming in
Julia.

Tempe, April 2022 Christian Ringhofer


Preface

Why computation? The middle of the last century marks the beginning of
a new era in the history of mathematics. Although calculators and computers
had been envisioned centuries before and mechanical calculators or calculating
machines were in widespread use already in the nineteenth century, only the in-
vention of purely electronic computers made it possible to perform calculations
on increasingly large scales. The reason is quite simple: mechanical and electro-
mechanical calculators (see Fig. 0.1) are severely limited by friction.
The advent of electronic computers and later the rise of the integrated cir-
cuit have resulted in portable devices of astounding computational power at tiny
power consumption (see Fig. 0.2). Computations that were unthinkable a few
decades ago can now be performed at low cost and at great speed. These devel-
opments in the physical realm have resulted in the birth of new mathematical
disciplines. Computer algebra, scientific computing, machine learning, artificial
intelligence, and related areas are concerned with solving abstract mathematical
problems as well as scientific and data-science problems correctly, precisely, and
efficiently.
Although lots of computational power are available today, fundamental ques-
tions will always have to be answered. How should the computations be struc-
tured? What are the advantages and disadvantages of various algorithms? How
accurate will the results be? How can we best take advantage of the computa-
tional resources available to us? These are fundamental questions that lead to
new and fascinating mathematical problems. In this sense, the invention of elec-
tronic computers has had and will have a twofold influence on mathematics:
computers are both an enabling technology and a source of new mathematical
problems.
There is no doubt that computers and mathematical algorithms have im-
pacted our lives in many ways. In many engineering disciplines, it has become
common to perform simulations for the rational design and the optimization
of all kinds of devices and processes. Simulations can be much cheaper than
performing many experiments and they provide theoretical and quantitative in-
sights. Examples are airplanes, combustion engines, antennas, and the construc-

ix
x Preface

Fig. 0.1 A Brunsviga 15 mechanical calculator, produced by Brunsviga-Maschinenwerke


Grimme, Natalis & Co. AG, Braunschweig, between 1934 and 1947. The shrouds are removed
to reveal the internal mechanic. (Photo by CEphoto, Uwe Aranas, no changes, license CC-BY-
SA-3.0, https://creativecommons.org/licenses/by-sa/3.0/.)

Fig. 0.2 Motorola 68000 cpu. (Photo by Pauli Rautakorpi, no changes, license CC-BY-3.0,
https://creativecommons.org/licenses/by/3.0/.)

tion of bridges and other buildings. Large-scale computations are also behind
search engines, financial services, and other data-intensive industries. There is
probably not a single hour in our daily lives when we do not use a service or a de-
vice that has only become possible by computers and mathematical algorithms
or that has been much improved by them.
Preface xi

Using this book, you will learn a modern, general-purpose, and efficient pro-
gramming language, namely Julia, as well as some of the most important meth-
ods in optimization, machine learning, and differential equations and how they
work. These three fields, optimization, machine learning, and differential equa-
tions, have been chosen because they cover a wide range of computational tasks
in science, engineering, and industry.
Methods and algorithms in these areas will be discussed in sufficient detail to
arrive at a complete understanding. You will understand how the computational
approaches work starting from the basic mathematical theory. Important results
and proofs will be given in each chapter (and can be skipped on first reading).
Based on these foundations, you will be provided with the knowledge to imple-
ment the algorithms and your own variants. To this end, sample programs and
hints for implementation in Julia are provided.
The ultimate purpose of this book is to provide the reader both with a working
knowledge of the Julia programming languages as well as with more than a
superficial understanding of modern topics in three important computational
fields. Using the algorithms and the sample codes for leading problems, you will
be able to translate the theory into working knowledge in order to solve your
scientific, engineering, mathematical, or industrial problems.
How is this book unique? This book strives to provide a modern, practical,
and well founded perspective on algorithms in optimization, machine-learning,
and differential equations. Hence there are two points how the present book
differs from other books in this area.
First, the topics were selected with a modern view of computation in mind.
As mathematics and computation evolve, we are able to solve more and more
difficult problems. These advances are reflected in the material in this book. For
example, topics such as artificial neural networks, computational Bayesian esti-
mation, and partial differential equations are discussed, but numerically solving
systems of linear equations is not, since you will most likely not write your own
program to do so due the availability of well tested libraries (also immediately
available in Julia).
Optimization is of great value in almost all disciplines. Differential equations
are of great utility in many cases where fundamental relationships between the
known and unknown variables exist, for example in physics, chemistry, and
many engineering disciplines. Furthermore, machine learning in particular is
an area that has benefited from increases in computational power and available
memory and that is of utmost importance when large amounts of data are avail-
able, but fundamental relationships are unknown.
Second, the Julia language, a rather young language designed with scien-
tific and technical computing in mind, is used to implement the algorithms. Its
implementation includes a compiler and a type system that leads to fast com-
piled code. It builds on modern and general programming concepts so that it
is usable for many different purposes. It comes with linear-algebra algorithms,
sparse matrices, and a package system. It is open source and its syntax is easy
xii Preface

to learn. Because of these reasons, it has quickly gained popularity in scientific


computing and machine learning.
In a larger historic perspective, there has been a divide between the Fortran
and the Lisp families of languages since the early days of computing. Julia can
be seen as a successor of Fortran in the path of matlab, while it incorporates
several important concepts from languages in the Lisp family. From this point of
view, it can be seen as the intersection of two very different paths in the history
of programming languages.
For whom is this book useful? The book has been written with various audi-
ences in mind, namely
• mathematicians,
• computer scientists, and
• computational scientists and engineers of any kind.
I have tried to combine the mathematical, algorithmic, and computational as-
pects into one exposition, because this intersection is challenging, exciting, and
useful. These three aspects must be included in holistic views of the various sub-
jects in this book, and only a holistic view can provide deeper understanding and
appreciation of the results and algorithms and the paths that lead there.
The prerequisites are the usual ones for a book in these areas: a certain math-
ematical proficiency is required in the form of the basics of linear algebra and
calculus. Proficiency in programming is helpful, but not required, when you take
the next step and begin to implement algorithms. A prior knowledge of Julia is
not required, as its main as well as several of its advanced features are presented
in this book.
The book is perfectly readable and usable by skipping some details and sec-
tions marked as advanced material on first reading. In any case, stories as com-
plete as possible are told in each chapter in order to meet the intellectual demand
of a complete and self-contained treatment.
Therefore, in order to provide a complete and self-contained exposition, proofs
of the main results are included. They can be skipped, however, altogether by
readers who are not interested in the proofs or in classes where they are not
required. The level of exposition is mathematically rigorous, yet care was taken
that it remains accessible to a large audience with a background in linear algebra
and calculus.
Most readers will eventually be interested in quantitative solutions of their
mathematical problems. This means that the theoretic mathematical results
should be translated into correct and hopefully also efficient computer program.
To this end, the theoretical background is accompanied not only by algorithms,
but in some cases also by sample programs that illustrate the best practice in im-
plementation. Various exercises have been added in order to assist exploring the
variants of the algorithms and related problems.
In summary, this book has been written with a few goals in mind. First, the
choice of mathematical subjects and computational problems is modern and rel-
evant in the practice of applied mathematics, computer science, and engineering.
Preface xiii

Second, the book teaches a modern programming language that is especially use-
ful in technical and scientific applications, while also providing high-level and
advanced programming concepts. Third, in addition to self-study, the book can
be used as a textbook for courses in these areas. By choosing the chapters of inter-
est, the course can be tailored to various needs. The exercises deepen the theory
and help practice translating the theory into useful programs.
Acknowledgments. Finally, it is my pleasure to acknowledge the interest and
support by Klaus Stricker and his department. I would also like to acknowledge
the students in Vienna who helped improve the manuscript.

Wien, March 2022 Clemens Heitzinger


Contents

Part I The Julia Language

1 An Introduction to the Julia Language . . . . . . . . . . . . . . . . . . . . . . . 3


1.1 Brief Historic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 An Overview of Julia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 The Reproducibility of Science and Open Source . . . . . . 5
1.2.2 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Libraries and Numerical Linear Algebra . . . . . . . . . . . . . . 6
1.2.4 Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 High-Level Programming Concepts . . . . . . . . . . . . . . . . . . 7
1.2.6 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.7 Package System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.8 Parallel and Distributed Computing . . . . . . . . . . . . . . . . . . 8
1.2.9 Availability on Common Operating Systems . . . . . . . . . . 8
1.3 Using Julia and Accessing Documentation . . . . . . . . . . . . . . . . . 9
1.3.1 Starting Julia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 The Read-Eval-Print Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Help and Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Handling Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.5 Developing Julia Programs . . . . . . . . . . . . . . . . . . . . . . . . . 14
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Defining Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Argument Passing Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Multiple Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Functions as First-Class Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Anonymous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Optional Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

xv
xvi Contents

2.7 Keyword Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


2.8 Functions with a Variable Number of Arguments . . . . . . . . . . . . 34
2.9 ȍɴ blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Variables, Constants, Scopes, and Modules . . . . . . . . . . . . . . . . . . . . 39


3.1 Modules and Global Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Dynamic and Lexical Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Local Scope Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Hard Local Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Soft Local Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 ɜȕʲ Blocks and Closures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 ȯɴʝ Loops and Array Comprehensions . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Global and Local Variables in this Book . . . . . . . . . . . . . . . . . . . . . 49
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Built-in Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


4.1 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 Creating and Accessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 String Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 String Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 String Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.5 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.1 General Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Iterable Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Indexable Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.4 Associative Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.5 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.6 Deques (Double-Ended Queues) . . . . . . . . . . . . . . . . . . . . . 75
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 User Defined Data Structures and the Type System . . . . . . . . . . . . 79


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Type Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Annotations of Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Declarations of Variables and Return Values . . . . . . . . . . 81
5.3 Abstract Types, Concrete Types, and the Type Hierarchy . . . . . . 82
5.4 Composite Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Contents xvii

5.6 Type Unions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


5.7 Parametric Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7.1 Parametric Composite Types . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7.2 Parametric Abstract Types . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Tuple Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.9 Pretty Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.10 Operations on Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Compound Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Conditional Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Short-Circuit Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Repeated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.1 Built-in Exceptions and Defining Exceptions . . . . . . . . . . 109
6.5.2 Throwing and Catching Exceptions . . . . . . . . . . . . . . . . . . 109
6.5.3 Messages, Warnings, and Errors . . . . . . . . . . . . . . . . . . . . . 114
6.5.4 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Tasks, Channels, and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.7 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.7.1 Starting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7.2 Data Movement and Processes . . . . . . . . . . . . . . . . . . . . . . 123
6.7.3 Parallel Loops and Parallel Mapping . . . . . . . . . . . . . . . . . 126
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Macros in Common Lisp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3 Macro Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Two Examples: Repeating and Collecting . . . . . . . . . . . . . . . . . . . 139
7.5 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Built-in Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8 Arrays and Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153


8.1 Dense Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.1.2 Construction, Initialization, and Concatenation . . . . . . . 154
8.1.3 Comprehensions and Generator Expressions . . . . . . . . . . 158
xviii Contents

8.1.4 Indexing and Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


8.1.5 Iteration and Linear Indexing . . . . . . . . . . . . . . . . . . . . . . . 162
8.1.6 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.7 Broadcasting and Vectorizing Functions . . . . . . . . . . . . . . 164
8.2 Sparse Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3 Array Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.4.1 Vector Spaces and Linear Functions . . . . . . . . . . . . . . . . . 171
8.4.2 Basis Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.4.3 Inner-Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.4.4 The Rank-Nullity Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.4.5 Matrix Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.4.6 The Cross Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.4.7 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.8 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.4.9 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . 204
8.4.10 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 218
8.4.11 Summary of Matrix Operations and Factorizations . . . . 221
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

Part II Algorithms for Differential Equations

9 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.2 Existence and Uniqueness of Solutions * . . . . . . . . . . . . . . . . . . . . 231
9.3 Systems of Ordinary Differential Equations . . . . . . . . . . . . . . . . . . 235
9.4 Euler Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.4.1 Forward and the Backward Euler Methods . . . . . . . . . . . . 237
9.4.2 Truncation Errors of the Forward Euler Method . . . . . . . 238
9.4.3 Improved Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.5 Variation of Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.6 Runge–Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.7 Butcher Tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.8 Adaptive Runge–Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.9 Implementation of Runge–Kutta Methods . . . . . . . . . . . . . . . . 248
9.10 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
9.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

10 Partial-Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.2 Elliptic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Contents xix

10.2.1 Three Physical Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . 260


10.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.2.3 Existence, Uniqueness, and a Pointwise Estimate * . . . . 268
10.3 Parabolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
10.4 Hyperbolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.5 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.5.1 One-Dimensional Second-Order Discretization . . . . . . . 278
10.5.2 Compact Fourth-Order Finite-Difference Discretizations 281
10.6 Finite Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.7 Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
10.8 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
10.9 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

Part III Algorithms for Optimization

11 Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
11.2 No Free Lunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
11.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
11.3.1 The Metropolis Monte Carlo Algorithm . . . . . . . . . . . . . . 311
11.3.2 The Simulated-Annealing Algorithm . . . . . . . . . . . . . . . . . 313
11.3.3 Cooling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
11.4 Particle-Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
11.5 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
11.5.2 Genotypes and Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . 318
11.5.3 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
11.5.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
11.5.5 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
11.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
11.7 Random Restarting and Hybrid Algorithms . . . . . . . . . . . . . . . . . 323
11.8 Benchmark Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.9 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
11.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

12 Local Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
12.2 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
12.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
xx Contents

12.5 Accelerated Gradient Descent * . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342


12.6 Line Search and the Wolfe Conditions . . . . . . . . . . . . . . . . . . . . . . 346
12.7 The Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
12.8 The bfgs Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
12.9 The l-bfgs (Limited-Memory bfgs) Method . . . . . . . . . . . . . . . . 356
12.10 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
12.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

Part IV Algorithms for Machine Learning

13 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
13.2 Feeding Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
13.3 The Approximation Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
13.4 Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
13.5 Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
13.6 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
13.7 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
13.8 Hyperparameters and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 386
13.9 Improving Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.9.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.9.2 Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
13.10 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
13.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

14 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
14.2 The Riemann–Stieltjes Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
14.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
14.4 Frequentist and Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . 402
14.5 Parameter Estimation and Inverse Problems . . . . . . . . . . . . . . . . . 405
14.5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.5.2 The Logistic Equation as an Example . . . . . . . . . . . . . . . . 406
14.5.3 The Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
14.5.4 Markov-Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5.5 The Metropolis–Hastings Algorithm . . . . . . . . . . . . . . . . . 412
14.5.6 Implementation of the Metropolis–Hastings Algorithm 416
14.5.7 Maximum-a-Posteriori Estimate and Maximum-
Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Contents xxi

14.5.8 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422


14.5.9 The Delayed-Rejection Adaptive-Metropolis (dram)
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
14.6 Julia Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
14.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Part I
The Julia Language
Chapter 1
An Introduction to the Julia Language

Dicebat Bernardus Carnotensis


nos esse quasi nanos gigantium humeris insidentes,
ut possimus plura eis et remotiora videre,
non utique proprii visus acumine aut eminentia corporis,
sed quia in altum subvehimur et extollimur magnitudine gigantea.
—John of Salisbury, Metalogicon (1159)

Abstract This chapter provides a first introduction to the Julia language. A


brief historic overview of programming languages is given, which is important
to understand the dichotomy between static programming languages such as
Fortran and dynamic languages such as Lisp as well as their uses. Then an
overview of Julia discusses some important design choices and major attributes
of the language as well as why and how it is useful and meets today’s require-
ments. Finally, we learn how to start Julia, how to interact with it, how to run
Julia programs, and how to install packages. Finally, various development envi-
ronments are presented, and commands for accessing help and documentation
from within the Julia system itself are explained.

1.1 Brief Historic Overview

The calculations an electronic computers performs are eventually performed by


the movement of electrons. Therefore the execution of any computer program
that implements an algorithm depends on several layers of hard- and software.
Transistors, which gate electron flow, and memory are arranged into integrated
circuits whose task is to store and to execute machine code and to store data.
In the early days of electronic computers, programs were written directly in ma-
chine code, but soon more abstract ways to specify programs than machine code
and assembly language were sought.

© Springer Nature Switzerland AG 2022 3


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_1
4 1 An Introduction to the Julia Language

Fortran (the Formula Translating System), is generally considered the old-


est high-level programming language and dates back to the 1950s. It has been
standardized for most of its existence. In many applications, it has been largely
superseded by the commercial matlab software due to its convenient plotting
routines, built-in availability of standard numerical linear-algebra algorithms,
interactive use, large collection of additional software packages, and hence in-
creased productivity.
Lisp (the List Processor), is the second-oldest high-level programming lan-
guage and was originally specified in 1958 [8, 9]. It gave rise to a large and diverse
family of languages and it has influenced many other programming languages.
Common Lisp [1] is its most prominent member. Although Lisp has never been
as popular by far as Fortran for numerical computations, it is ideally suited
for symbolic computations and computer algebra. The computer-algebra system
Macsyma [6] was written in Lisp and continues to be developed in the form of
Maxima [7]. Common Lisp has also been playing a major role in artificial intel-
ligence.
While Fortran emphasizes vectors and matrices as data structures, early
Lisp dialects emphasized data structures such as lists and symbols, which are
especially useful for symbolic computations. It may just be an unfortunate his-
toric coincidence that no single language in the early days of computing accom-
modated these two different needs in an efficient manner, and this fact may have
resulted in a historic divide. In any case, a programming language that unifies ef-
ficient symbolic and numerical computations is certainly highly useful in math-
ematics.
Large-scale numerical computations require – at least on today’s computing
architectures – various layers of hard- and software: transistors, integrated cir-
cuits, machine code, assembly code, operating system, compiler, programming
language, and algorithm. Of course, this book is not concerned with the layers
(and technological marvels) up to and including the operating system. However,
the top three layers, namely the choices of compiler, programming language, and
algorithm, may have a profound impact on the final result or may be instrumen-
tal in being able to obtain a result at all.
Why is the choice of programming language important? There are three times
in the life of a program:
• First, the program is written, i.e., an algorithm is implemented.
• Second, the program is compiled (or interpreted).
• Finally, the program is executed and produces output.
These three times result in requirements on programming languages suitable for
the task at hand. High-level programming languages are preferable, since the
time we have at our disposal to implement an algorithm or to find the most suit-
able numerical method is always limited. High-level programming languages
by definition offer features that reduce implementation time. At the same time,
the final programs should be efficient. Therefore the availability of state-of-the
art compiler technology is another important requirement. The efficiency of the
1.2 An Overview of Julia 5

generated code is often linked to the type system of the programming languages.
Therefore the type system and the programming language should have been de-
signed such that they support the task of the compiler to generate fast code with
minimal burden on the programmer.
In the past, programming languages were usually standardized and had sev-
eral implementations. Nowadays, the situation is different; many popular pro-
gramming languages are not standardized, and their single implementations
serve as their specifications. Therefore the choice of programming language of-
ten severely limits the choice of compiler. This means that the choices of pro-
gramming language and compiler are not independent, and one often has to
decide on a combination of both.
A final requirement or consideration is the availability of well-tested libraries
so that algorithmic and numerical wheels do not have to be reinvented.

1.2 An Overview of Julia

matlab is probably the most well-known and widely used programming lan-
guage in scientific computing and engineering. It has gained this position mainly
by being a more convenient and more productive alternative to Fortran. mat-
lab can be used interactively, many numerical algorithms are either built-in or
available as packages, and plotting is easy. This is a large productivity gain com-
pared to writing a Fortran program from scratch or downloading and installing
libraries.
The programming language used in this book is Julia [2]. Julia is a high-
level, high-performance, and dynamic programming language that has been de-
veloped with scientific and technical computing in mind. It offers features that
make it very well suited for computing in science, engineering, and machine
learning in view of the requirements posed in Sect. 1.2, while some of the fea-
tures are unique for a programming language in this field. An overview of the
key features of the Julia language is given in the following.

1.2.1 The Reproducibility of Science and Open Source

Julia is open source and distributed under the so called mit license. Apart from
concerns about licensing costs, the access to source code is essential for the re-
producibility of science. Reproducibility is one of the main principles of the sci-
entific method [10, 4] and hence also of artificial intelligence, machine learning,
scientific computing, and computational science [12].
Reproducibility is important whenever calculations are performed. By using
an open-source operating system and an open-source implementation of the pro-
gramming language, it is – at least in principle – possible to know precisely which
6 1 An Introduction to the Julia Language

software was executed and which instructions were performed. Furthermore, it


is also possible – again at least in principle – to check the correctness of all the
software involved.
Unfortunately, it is likely that any large piece of software contains errors; an
example is discussed in [3]. Therefore access to the source code is a reasonable
requirement whenever errors in the implementation of a programming language
are encountered or its behavior is to be verified.

1.2.2 Compiler

The implementation of the Julia language includes a native-code compiler.


More precisely, it includes a just-in-time compiler based on the llvm com-
piler infrastructure. Employing a high-performance and well supported com-
piler in combination with Julia’s design yields performance that is – as micro-
benchmarks show – within a small factor of C and Fortran and faster in some
benchmarks. By leveraging state-of-the-art compiler technology, Julia achieves
performance that is comparable to other popular programming languages not
only for calls of library functions, but also for user defined functions.

1.2.3 Libraries and Numerical Linear Algebra

Julia uses standard packages for numerical linear algebra such as blas (Basic
Linear Algebra Subprograms), lapack (Linear Algebra Package), and Suite-
Sparse (a collection of sparse-matrix software). These libraries are standard
among other programming languages and software systems so that the perfor-
mance of many linear-algebra algorithms in Julia should be comparable if not
identical to the performance in these other languages.

1.2.4 Interactivity

The dichotomy between the Fortran and Lisp families of languages manifests
itself clearly in the question whether functions can be called interactively or not.
While Fortran programs follow a strict write-compile-execute cycle, Lisp sys-
tems allow the interactive execution of expressions as well as the interactive def-
inition and compilation of functions [11]. At the same, they usually support file
compilation, saving of memory-image files, and generation of binaries.
Interactivity has a large effect on productivity, especially in explorative prob-
lem solving and programming. It makes it possible to implement complicated
algorithms step by step and to immediately test them piece by piece. An inter-
1.2 An Overview of Julia 7

active environment allows to set up complicated test cases and keep them in
memory while defining new functions. Only the redefined functions need to be
compiled so that potentially long compilation times are avoided when develop-
ing programs interactively.

1.2.5 High-Level Programming Concepts

Julia provides several high-level programming concepts and is heavily influ-


enced by the Lisp family of languages, although its syntax is much closer to
mathematical notation.
Julia supports object oriented programming, provides a dynamic type sys-
tem, and uses types for documentation, optimization, and dispatch. Users can
of course define new types, and the user defined types are as fast and as com-
pactly represented in memory as the built-in ones.
A dispatch mechanism decides which version of a function to run depend-
ing on the types of its arguments. Julia uses the most general form of dispatch,
namely multiple dispatch, similar to clos (Common Lisp Object System) [5].
Multiple dispatch makes it possible to define function behavior depending on
many combinations of argument types and not only the type of the first argu-
ment. Julia also automatically generates efficient, specialized code for the differ-
ent combinations of argument types. In this way, it provides both a very general
approach to types, objects, and function dispatch, while sophisticated compila-
tion techniques ensure that efficient code is generated.
Julia is a homoiconic programming language, meaning that programs are
represented in a data structure in a built-in type of the language itself. In other
words, code is naturally represented in the language itself, and therefore code
is easily treated as data. This makes it possible to write programs that write pro-
grams in a straightforward manner. Such program writing programs are called
macros in the Lisp family of languages, and macros can hence easily be defined
in Julia. As a homoiconic programming language, Julia also provides other
metaprogramming facilities.
In summary, these state-of-the-art programming concepts render Julia not
only useful in certain application areas, but also render it suitable for general
programming tasks. As modern algorithms use a larger variety of data structures
than scalars, vectors, and arrays, the support of user defined types and multiple
dispatch are welcome for implementing sophisticated algorithms.

1.2.6 Interoperability

As mentioned in Sect. 1.2.3, Julia uses external libraries for numerical linear al-
gebra. In addition to this built-in use of external libraries, external libraries can
8 1 An Introduction to the Julia Language

generally be called easily from Julia programs. External functions in C and For-
tran shared libraries can be called without writing any wrapper code, and they
can even be called directly from Julia’s interactive prompt. Furthermore, the
¸˩&Ǥɜɜ package makes it possible to call Python code. Other operating-system
processes can also be invoked and managed from within Julia by using its shell-
like capabilities.
Vice versa, Julia itself can be built as a shared library so that users can call
their Julia functions from within their C or Fortran programs.

1.2.7 Package System

Julia comes with a package system that is both easy to use and easy to con-
tribute to. It is straightforward to install packages and their dependencies, to
update them, and to remove them. Many packages in the package system build
on libraries written in other languages and provide interfaces to external func-
tionalities in a Julia-like style.

1.2.8 Parallel and Distributed Computing

Julia comes with built-in functionality to run programs in parallel. The first
type of parallel execution is running on a single computer and harnessing the
power of the multiple cores of a cpu or of the multiple cpus of a computer. The
second type are clusters spanning multiple computers. For both types of paral-
lel execution, introspective features make adding, removing, and querying the
processes in a cluster straightforward.
Parallel function execution is achieved by just using the parallel version of a
mapping function or by a parallel ȯɴʝ loop. For parallel algorithms that require
non-trivial communication, functions for moving data, for synchronization, for
scheduling, and for shared arrays are available as well.

1.2.9 Availability on Common Operating Systems

Julia is available on macOS, Linux, Windows, and FreeBSD. It can be down-


loaded as a binary distribution, it is available through various package managers,
and the user can download and build the Julia source code and its dependencies
with a few commands.
1.3 Using Julia and Accessing Documentation 9

1.3 Using Julia and Accessing Documentation

1.3.1 Starting Julia

After installing Julia, you can start it by double-clicking the Julia icon (de-
pending on your system and installation) or by running the Julia executable
from the command line. If you just type
љ ɔʼɜɃǤ

at the command line, Julia starts, displays a banner, and prompts you for in-
put with its own prompt ɔʼɜɃǤљ. You can quit the interactive session by typing
ȕ˦ɃʲФХ and pressing the return key or by typing control-d.
To run a Julia program saved in a file called file ʙʝɴȱʝǤɦѐɔɜ non-interactively,
type
љ ɔʼɜɃǤ ʙʝɴȱʝǤɦѐɔɜ

at the command line. If your program is supposed to take arguments from the
command line, you can simply pass them at the end of the command line and
they will be available in Julia in the global variable ¼QÆ as an array of strings.
љ ɔʼɜɃǤ ʙʝɴȱʝǤɦѐɔɜ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ

You can also pass code to be executed directly from the command line to Julia
by using the вȕ command-line option. Then the traditional example looks like
this.
љ ɔʼɜɃǤ вȕ щʙʝɃɪʲɜɪФъYȕɜɜɴя ˞ɴʝɜȍРъХщ
Yȕɜɜɴя ˞ɴʝɜȍР

Note that the quotes around the Julia code depend on your command shell and
may differ from щ.
When Julia code is executed via the вȕ command-line option, the arguments
are stored in ¼QÆ as well. We can try to retrieve the command-line arguments
passed to Julia code like this.
љ ɔʼɜɃǤ вȕ щ¼QÆщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ

However, nothing is printed. Using вȕ, the expression is only evaluated, but the
result is not printed. To print a value (followed by a newline), we can use the
command-line option в5, which evaluates an expression and shows the result.
љ ɔʼɜɃǤ в5 щ¼QÆщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
ЦъǤʝȱЖъя ъǤʝȱЗъя ъǤʝȱИъЧ

Another possibility is to use the ʙʝɃɪʲɜɪ function.


љ ɔʼɜɃǤ вȕ щʙʝɃɪʲɜɪФ¼QÆХщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
ЦъǤʝȱЖъя ъǤʝȱЗъя ъǤʝȱИъЧ
10 1 An Introduction to the Julia Language

The output is the printed representation of an array containing the three strings
shown. To print each element of the ¼QÆ array on a separate line, you can ɦǤʙ
the ʙʝɃɪʲɜɪ function over the array stored in the variable ¼QÆ.
љ ɔʼɜɃǤ вȕ щɦǤʙФʙʝɃɪʲɜɪя ¼QÆХщ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ
ǤʝȱЖ
ǤʝȱЗ
ǤʝȱИ

Running ɔʼɜɃǤ ввȹȕɜʙ from your command line gives an overview of the
many options of the Julia executable. Command-line arguments for the Julia
executable must be passed before the name of the file to be executed as in this
example.
љ ɔʼɜɃǤ ввɴʙʲɃɦɃ˴ȕ ʙʝɴȱʝǤɦѐɔɜ ǤʝȱЖ ǤʝȱЗ ǤʝȱИ

Any Julia code you put into the file ϵY“ 5ЭѐɔʼɜɃǤЭȆɴɪȯɃȱЭʧʲǤʝʲʼʙѐɔɜ in
your home directory will be executed every time Julia is started.
So far we have seen how to run your Julia programs from the command line.
Although this is the usual way how Julia programs are run in production envi-
ronments, it is only one way to run a Julia program. While developing programs,
interactive sessions connected to an editor are much preferred. The features of
interactive sessions are explained next.

1.3.2 The Read-Eval-Print Loop

Julia is often used interactively via its repl . The abbreviation repl is short for
read-eval-print loop and has its roots in Lisp implementations. The three parts
of the repl are the following.
Read: An expression typed by the user is read. An error is raised if it is not syn-
tactically correct.
Eval: The expression is evaluated. The result is a value, unless an error was
raised.
Print: The value is printed or – if an error occurred – the error message is dis-
played. Finally, a new prompt is displayed and the loop is repeated.
This implies that each expression in Julia returns a value, just as in Lisp.
(Therefore it has been said about Lisp programmers that they know the value
of everything, but the (computational) cost of nothing.) There is no expression
that does not return a value. Displaying a value can, however, be suppressed by
appending a semicolon ѓ at the end of the input.
ɔʼɜɃǤљ ъYȕɜɜɴя ˞ɴʝɜȍРъ
ъYȕɜɜɴя ˞ɴʝɜȍРъ
ɔʼɜɃǤљ ъYȕɜɜɴя ȕ˛ȕʝ˩ɴɪȕРъѓ
ɔʼɜɃǤљ Ǥɪʧ
1.3 Using Julia and Accessing Documentation 11

ъYȕɜɜɴя ȕ˛ȕʝ˩ɴɪȕРъ

This example shows that the shortest hello-world program in Julia is just the
string ъYȕɜɜɴя ˞ɴʝɜȍРъ. It also shows that the variable Ǥɪʧ, short for answer, is
bound to the value of the previous expression evaluated by the repl irrespective
whether it was printed or not. The variable Ǥɪʧ is only bound in repls.
In addition to strings, numbers such as integers and floating-point numbers
also evaluate to themselves and can be entered as usual. Furthermore, it is pos-
sible to type additional underscores Ѫ in order to divide long numbers and make
them easier to read. The groups do not have to contain three digits.
ɔʼɜɃǤљ ЖѪЕЕЕѪЕЕЕѪЕЕЕ Ѯ ЕѐЕЕЕѪЕЕЕѪЕЕЖ
ЖѐЕ
ɔʼɜɃǤљ ЖѪЗѪИ
ЖЗИ

You can load a Julia source file called ъȯɃɜȕѐɔɜъ and evaluate the expres-
sions it contains using ɃɪȆɜʼȍȕФъȯɃɜȕѐɔɜъХ. Since the function ɃɪȆɜʼȍȕ works
recursively, this also makes it possible to split programs into various files. Dur-
ing development, however, smaller pieces of code are usually evaluated (see
Sect. 1.3.5). Larger Julia programs, on the other hand, such as packages whose
source code is distributed online, are usually installed and loaded as packages
much more easily than by working with single files (see Sect. 1.3.4).
When you working in the Julia repl, you can execute shell commands con-
veniently by switching to shell mode, which is entered by just typing a semi-
colon ѓ. Then the Julia prompt changes and shell commands such ɜʧ or ʙʧ
can be executed. Typing backspace switches the prompt back to the usual Julia
prompt.
You can save lots of typing at the repl using autocompletion. If you press
the tab key, the symbol you started typing is completed, or – if the completion
is not unique – completions are suggested after pressing tab a second time. This
feature also works in shell mode, where it is convenient to complete names of
directories and files.
The repl remembers the expressions it has evaluated previously. Ex-
pressions from previous interactive sessions are also stored in the file
ϵY“ 5ЭѐɔʼɜɃǤЭɜɴȱʧЭʝȕʙɜѪȹɃʧʲɴʝ˩ѐɔɜ. The straightforward way to access pre-
vious expressions is using the up and down arrow keys. But you can also search
the history forwards and backwards with control-s and control-r, respectively.
Other keyboard commands analogous to the Emacs text editor are available as
well.

1.3.3 Help and Documentation

Similarly to the shell mode, you can enter the help mode of the repl by typing
a question mark Т at the Julia prompt. The prompt changes and you can enter
12 1 An Introduction to the Julia Language

a string to search for. You can again use the tab key to complete the string you
typed as a symbol or to view possible completions. After pressing enter, the doc-
umentation is searched and you will be presented with the documentation for
the symbol you entered or further matches of the string you typed.
To search all documentation for a string, you can use the Ǥʙʝɴʙɴʧ function.
ɔʼɜɃǤљ ǤʙʝɴʙɴʧФъ5ʼɜȕʝъХ
"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȱǤɦɦǤ
"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕʼɜȕʝȱǤɦɦǤ

The ЪȍɴȆ macro allows you to access the documentation string of any symbol
and also to change it. Documentation is accessed by ЪȍɴȆФsymbolХ or simpler
by ЪȍɴȆ symbol (see Chap. 7 for more information about macros and how to use
them).
ɔʼɜɃǤљ ЪȍɴȆ ЪȍɴȆ

Table 1.1 contains a list of functions and macros that are useful to inspect
the state of the Julia executable or to interact with the operating system. The
contexts in which several of these functions and macros are useful will become
clearer later, but they are collected here for reference.
The value of the constant ù5¼Æb“‰ is of type ùȕʝʧɃɴɪ‰ʼɦȂȕʝ and can easily
be compared to other values of the same type. This makes it possible to write
programs that work with different versions of Julia.
ɔʼɜɃǤљ ù5¼Æb“‰ љќ ˛ъЖѐЕъ
ʲʝʼȕ
ɔʼɜɃǤљ ù5¼Æb“‰ љќ ˛ъЗѐЕъ
ȯǤɜʧȕ

1.3.4 Handling Packages

The plain way to handle packages is to use the functions provided by the ¸ɖȱ
package, which we must load first.
ɔʼɜɃǤљ Ƀɦʙɴʝʲ ¸ɖȱ

The most important functions in this package are ¸ɖȱѐǤȍȍ, ¸ɖȱѐʝɦ, and
¸ɖȱѐʼʙȍǤʲȕ. For example, to install a package called &Æù (for reading and writing
comma separated values), we can use the ¸ɖȱѐǤȍȍ function.
ɔʼɜɃǤљ ¸ɖȱѐǤȍȍФъ&ÆùъХ

To remove it, we can use the ¸ɖȱѐʝɦ function.


ɔʼɜɃǤљ ¸ɖȱѐʝɦФъ&ÆùъХ
1.3 Using Julia and Accessing Documentation 13

Table 1.1 Generally useful functions and macros.


Function or macro Description
Ǥɪʧ variable with the last evaluated expression
Ǥʙʝɴʙɴʧ search documentation for a string
Ǥʲȕ˦Ƀʲ register a function to be called at exit
ǤʲʝȕʙɜɃɪɃʲ register a function to be called before starting a repl
ȆɜɃʙȂɴǤʝȍ send a string and receive a string from the clipboard
ЪȍɴȆ access and modify documentation strings
ȍʼɦʙ show user visible structure of a value
ȕȍɃʲ𝑎,𝑏 edit a file or function definition
ЪȕȍɃʲ 𝑏 edit a function definition
5‰ù operating-system environment variables
ȕ˦Ƀʲ exit the Julia executable, possibly supplying an exit code
ȯɃȕɜȍɪǤɦȕʧ return an array of the fields of a type
ɃɪȆɜʼȍȕ evaluate a source file, files are fetched from node 1
ɃʧɃɪʲȕʝǤȆʲɃ˛ȕ whether Julia is running interactively
ɜȕʧʧ show a file or function definition
Ъɜȕʧʧ show a function definition
ɦȕʲȹɴȍʧ return the methods of a function
ɦȕʲȹɴȍʧ˞Ƀʲȹ return methods with an argument of the given type
ɪǤɦȕʧ return an array of the names exported by a module
Ъʧȹɴ˞ show an expression and return it
ʧʼɦɦǤʝ˩ briefly describe a value
ЪʲɃɦȕ print timing and allocation information, return value
ЪʲɃɦȕȍ return value as well as timing and allocation information
ЪʲɃɦȕ˛ verbose version of ЪʲɃɦȕ
ù5¼Æb“‰ the version number, a value of type ùȕʝʧɃɴɪ‰ʼɦȂȕʝ
˛ȕʝʧɃɴɪɃɪȯɴ print version information about Julia, libraries, and packages
˞ȹɃȆȹ 𝑎 return the applicable method
˞ȹɃȆȹ 𝑎 return the module where a variable was defined
Ъ˞ȹɃȆȹ return the applicable method
𝑎
The behavior depends on the types of the arguments.
𝑏
The editor called is the one given by 5‰ùЦъ5-bѓ¼ъЧ.

To update all installed packages, we can use the ¸ɖȱѐʼʙȍǤʲȕ function.


ɔʼɜɃǤљ ¸ɖȱѐʼʙȍǤʲȕФХ

Another mode that can be entered from the repl is the package mode for
handling packages. It is entered by typing Ч at the Julia prompt. The prompt
changes to end in ʙɖȱљ. Typing tab at the prompt shows a list of all commands
available in package mode. For example, to install the &Æù package, type Ǥȍȍ &Æù
at the package prompt; to remove it, type ʝȕɦɴ˛ȕ &Æù; and to update all packages
type ʼʙȍǤʲȕ at the package prompt.
14 1 An Introduction to the Julia Language

1.3.5 Developing Julia Programs

Because of the repl, developing Julia programs resembles developing Lisp,


Python, or matlab programs in the sense that it may become a very interac-
tive process. A good practice is to compose larger programs out of functions that
are small enough to serve a single, well-defined purpose, to be understood with-
out unnecessary context, and to be tested exhaustively. In short, the programmer
should be able to prove the correctness of each function. Then larger programs
can be assembled from these building blocks and well-thought-out data struc-
tures.
When writing functions in an editor, the repl is useful to explore ideas and
to query the built-in documentation. Function definitions and source files can
quickly be loaded into the Julia executable. Reloading definitions is usually
supported by editors suitable for programming. Whenever a function definition
appears satisfactory, it can immediately be tested in the repl. Data for testing
functions can be saved in variables in the repl so that the cycle of exploring
ideas, implementing them, testing the functions on realistic data, and improv-
ing them does not have to interrupted, but can be performed on the same input
repeatedly. In languages that cannot be used interactively, changing a program
necessitates the building of a new executable and loading test data anew, which
are potentially time consuming steps. In a dynamic language such as Julia, re-
defining and compiling single functions is a simple and often performed proce-
dure and much accelerates the loop consisting of exploration, implementation,
testing, and improvement.
The method to develop larger programs by abstracting its pieces into func-
tions is very effective. In the ideal case, it is possible to show the correctness
of each function at least informally. This development style should also be sup-
ported by the choice of editor and development environment. Two such environ-
ments are discussed briefly in the following.
The first development environment is Emacs in conjunction with its Julia
mode. Emacs (short for editor macros) is a Lisp programmable editor which is
convenient for many different text editing and programming tasks because of its
many modes that provide special key bindings and interfaces to other programs.
The Julia mode makes it straightforward to interact with a Julia repl, to eval-
uate Julia expressions and files, and to access documentation.
The second environment is bsʼɜɃǤ. It is a popular way of writing shorter Ju-
lia programs and preparing reports. bsʼɜɃǤ provides a graphical user interface
that runs within a web browser. It resembles a notebook, where you can input
Julia expressions and function definitions in cells. You can evaluate an input
cell by typing shift-enter, and then the output is collected below the input cell.
Text such as documentation or a narrative for the calculations can be saved in a
notebook file as well. Finally, the notebook can be exported to various formats
such as Julia code in a text file, pdf, html, and Markdown.
bsʼɜɃǤ is in fact a Julia package and can be installed as described in
Sect. 1.3.4. Then is loaded and started using these two commands.
References 15

ɔʼɜɃǤљ Ƀɦʙɴʝʲ bsʼɜɃǤ


ɔʼɜɃǤљ bsʼɜɃǤѐɪɴʲȕȂɴɴɖФХ

If files in the current directory are of interest, the following function call is useful.
ɔʼɜɃǤљ bsʼɜɃǤѐɪɴʲȕȂɴɴɖФȍɃʝ ќ ʙ˞ȍФХХ

An alternative that has the same effect but saves some typing are the following
commands.
ɔʼɜɃǤљ ʼʧɃɪȱ bsʼɜɃǤ
ɔʼɜɃǤљ ɪɴʲȕȂɴɴɖФХ

Problems

1.1 Install Julia on your computer.

1.2 Install an extension for dealing with Julia programs in your favorite text
editor or install the bsʼɜɃǤ package.

References

1. American National Standards Institute (ANSI), Washington, DC, USA: Programming Lan-
guage Common Lisp, ANSI INCITS 226-1994 (R2004) (1994)
2. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: The Julia programming language.
http://julialang.org
3. Durán, A., Pérez, M., Varona, J.: The misfortunes of a trio of mathematicians using com-
puter algebra systems. Can we trust in them? Notices of the AMS 61(10), 1249–1252 (2014)
4. Fisher, R.: The Design of Experiments. Oliver and Boyd, Edinburgh (1935)
5. Keene, S.: Object-Oriented Programming in Common Lisp: A Programmer’s Guide to CLOS.
Addison-Wesley Professional (1989)
6. The Mathlab Group, Laboratory for Computer Science, MIT, Cambridge, MA 02139: MAC-
SYMA Reference Manual, Version Nine, Second Printing (1977)
7. Maxima, a Computer Algebra System, version 5.43.0 (2019).
http://maxima.sourceforge.net
8. McCarthy, J.: Recursive functions of symbolic expressions and their computation by ma-
chine (part I). Comm. ACM 3, 184–195 (1960)
9. McCarthy, J.: LISP 1.5 Programmer’s Manual. The MIT Press (1962)
10. Popper, K.: Logik der Forschung. Zur Erkenntnistheorie der modernen Naturwissenschaft.
Verlag von Julius Springer, Wien (1935)
11. Sandewall, E.: Programming in an interactive environment: the “Lisp” experience. Com-
puting Surveys 10(1), 35–71 (1978)
12. Stodden, V., Borwein, J., Bailey, D.: Setting the default to reproducible in computational
science research. SIAM News 46(5), 4–6 (2013)
Chapter 2
Functions

A year spent in artificial intelligence is enough to make one believe in God.


—Alan Jay Perlis

Abstract Functions are one of the most important abstractions in mathemat-


ics, and they are one of the most important concepts in programming languages
as well. Functions are the main building blocks of programs. In this chapter,
we learn how functions are defined in Julia and we discuss generic functions
and methods. More details are presented as well, including argument passing be-
havior, multiple return values, functions as first-class objects, anonymous func-
tions, optional and keyword arguments, variable numbers of arguments, and the
scopes of variables. Interactions with data types are discussed throughout the
chapter to illustrate the interplay between functions, domains, and codomains.

2.1 Defining Functions

Functions are one of the most important concepts in mathematics. The defini-
tion of a mathematical function 𝑓 ∶ 𝑋 → 𝑌, 𝑥 ↦ 𝑦 comprises three parts: the
domain 𝑋, i.e., the set where the function is defined; the codomain 𝑌, i.e., the
set that contains all function values 𝑦; and a rule 𝑥 ↦ 𝑦 describing how a unique
function value 𝑦 is assigned to each argument 𝑥 ∈ 𝑋.
The same three pieces of information are important when defining a function
in any programming language. The role of the domain is played by the types of
the arguments, the function values are calculated by the body of the function
definition, and the role of the codomain is played by the type of the calculated
value and can hopefully be inferred by the compiler. If it can be inferred, faster
code can be generated for the calling function that receives the output.
The example we consider in this chapter is the definition of a function that
calculates the 𝑛-th number 𝑥𝑛 in the Fibonacci sequence defined by the recur-

© Springer Nature Switzerland AG 2022 17


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_2
18 2 Functions

rence relation

𝑥0 ∶= 0, 𝑥1 ∶= 1, 𝑥𝑛 ∶= 𝑥𝑛−1 + 𝑥𝑛−2 . (2.1)

The mathematical function corresponding to this sequence is denoted by


f ib ∶ ℕ0 → ℕ0 , 𝑛 ↦ 𝑥𝑛 .
A straightforward translation of this recurrence relation into a Julia function
is the following.
ȯʼɪȆʲɃɴɪ ȯɃȂЖФɪХ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂЖФɪвЖХ ў ȯɃȂЖФɪвЗХ
ȕɪȍ
ȕɪȍ

The idea is to check if the argument ɪ is one of the starting values or not. If it is
not, then the recurrence relation is used. We note that in Julia everything is an
expression and hence returns a value, so that it is not necessary to use an explicit
ʝȕʲʼʝɪ statement here.
An alternative, but equivalent syntax for function definition is the following
called the ternary operator.
ȯɃȂЗФɪХ ќ Фɪ јќ ЖХ Т ɪ ђ ȯɃȂЗФɪвЖХ ў ȯɃȂЗФɪвЗХ

This example shows how short functions are often defined in Julia. Here the
syntax
condition Т consequent ђ alternative
was used as an alternative for the Ƀȯ expression as well.
You can save this function definition in a file and load it into Julia or you
can type it directly into the Julia repl. In the repl, Julia will answer with the
following output.
ȯɃȂЖ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ

We note that this function definition does not capture two pieces of informa-
tion that are part of a mathematical function definition: the domain and the
codomain. We have neither specified the type of ɪ nor the type of the possible
function values Е, Ж, and ȯɃȂЖФɪвЖХ ў ȯɃȂЖФɪвЗХ. This means that this function
definition will work whenever the operations used in its definition, namely јќ,
ў, and в, are defined for their arguments. For example, evaluating ȯɃȂЖФКЭЭЙХ
in the Julia repl yields вЖЭЭЗ; here КЭЭЙ and вЖЭЭЗ are rational numbers. We
will learn all about the types of numbers available in Julia in Chap. 5. It is not
obvious just by looking at the definitions of ȯɃȂЖ and ȯɃȂЗ which types can be
used as arguments and whether Julia can generate efficient code or not.
2.1 Defining Functions 19

Therefore, we now use Julia’s introspective features to find out more about
the function we just defined. Julia told us after evaluating the function defi-
nition that we have defined a generic function with one method. The same in-
formation is obtained by typing ȯɃȂЖ into the repl, since functions evaluate to
themselves in Julia. This means that functions in Julia are what are called
generic functions in computer science. A generic function is a collection of (func-
tion) methods, where each method is responsible for a certain combination of
argument types. For example, the generic function ў comprises many methods.
ɔʼɜɃǤљ ў
ў ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ЖЛЛ ɦȕʲȹɴȍʧХ

We can find the methods of a generic function using ɦȕʲȹɴȍʧ. It lists all
the methods that were defined for various combinations of argument types and
where they were defined.
ɔʼɜɃǤљ ɦȕʲȹɴȍʧФȯɃȂЖХ
Ы Ж ɦȕʲȹɴȍ ȯɴʝ ȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ъȯɃȂЖъђ
ЦЖЧ ȯɃȂЖФɪХ Ƀɪ ǤɃɪ Ǥʲ ¼5¸{ЦЖЧђЗ

If a new method is defined and a method already exists for this particular com-
bination of argument types, then the old method definition is superseded.
To find out more about the codomain of the function, we can query the types
of function values.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЕХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФЕХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЖХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФЖХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɃȂЖФЗХХ
bɪʲЛЙ

In the first two cases, the argument is simply returned. Therefore the type of the
returned value is the same as the type of the argument. The last example implies
that the type of the sum of two values of type bɪʲЛЙ is again bɪʲЛЙ.
If you are using a 32-bit system, then integers are by default represented by
the type bɪʲИЗ. The default integer type is called bɪʲ and it can be either bɪʲИЗ
or bɪʲЛЙ depending on your system. The variable "ǤʧȕѐÆ˩ʧѐü“¼-ѪÆbĐ5 also in-
dicates if Julia is running on a 32-bit or a 64-bit system.
We have just seen that literal integers such as Е and Ж are parsed and then
represented as values of type bɪʲ, which is an bɪʲЛЙ on this particular system.
bɪʲЛЙ are (positive or negative, i.e., signed) integers that can be stored within 64
bits. This is illustrated by the following calculations.
20 2 Functions

ɔʼɜɃǤљ ЗѭЛЗ
ЙЛЖЖЛНЛЕЖНЙЗМИНМОЕЙ
ɔʼɜɃǤљ ФвЗХѭЛЗ
ЙЛЖЖЛНЛЕЖНЙЗМИНМОЕЙ
ɔʼɜɃǤљ ЗѭЛИ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ФвЗХѭЛИ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ЗѭЛЙ
Е
ɔʼɜɃǤљ ФвЗХѭЛЙ
Е

In addition to querying the type of a value using ʲ˩ʙȕɴȯ, we can also query a
type which the minimum and maximum numbers it can represent are. This is
informative when a type can only represent a finite number of values by design.
ɔʼɜɃǤљ ʲ˩ʙȕɦɃɪФbɪʲЛЙХ
вОЗЗИИМЗЕИЛНКЙММКНЕН
ɔʼɜɃǤљ ʲ˩ʙȕɦǤ˦ФbɪʲЛЙХ
ОЗЗИИМЗЕИЛНКЙММКНЕМ
ɔʼɜɃǤљ ʲ˩ʙȕɦǤ˦Фʲ˩ʙȕɴȯФȯɃȂЖФЕХХХ
ОЗЗИИМЗЕИЛНКЙММКНЕМ

These bounds explain why 263 could not be calculated above as an bɪʲЛЙ, but
(−2)63 (barely) could.
This implies that types such as bɪʲ, bɪʲИЗ, and bɪʲЛЙ can represent the math-
ematical structure of the ring (ℤ, +, ⋅) only if all operations during a calculation
remain within the interval given by ʲ˩ʙȕɦɃɪФtypeХ and ʲ˩ʙȕɦǤ˦ФtypeХ.
If this is not assured, then it is necessary to use the "Ƀȱbɪʲ type, which can rep-
resent integers provided they fit into available memory and are thus a much bet-
ter representation of the ring (ℤ, +, ⋅). To illustrate the efficiency of calculations
using "Ƀȱbɪʲs, we consider Mersenne prime numbers. The following function
returns Mersenne prime numbers given an adequate exponent.
ɦȕʝʧȕɪɪȕФɪХ ќ "ɃȱbɪʲФЗХѭɪ в Ж

The type of the return value is "Ƀȱbɪʲ, since we base the calculation on 2 repre-
sented as a "Ƀȱbɪʲ. In general, typeФ˦Х returns the representation of the value
˦ in the type type; in other words, a new value converted to the type type is
returned. It is instructive to inspect the "Ƀȱbɪʲ type using ɦȕʲȹɴȍʧФ"ɃȱbɪʲХ,
ʲ˩ʙȕɦɃɪФ"ɃȱbɪʲХ, and ʲ˩ʙȕɦǤ˦Ф"ɃȱbɪʲХ.
The largest known Mersenne prime number to date is 282 589 933 − 1. The fol-
lowing interaction shows that it can be computed within a few thousands of a
second requiring less than 20 MB of memory, also showing that it has 24 862 048
digits.
ȍɃȱɃʲʧФ˦Х ќ ȯɜɴɴʝФbɪʲȕȱȕʝя ɜɴȱЖЕФ˦ХўЖХ
2.1 Defining Functions 21

ɔʼɜɃǤљ ЪʲɃɦȕ ȍɃȱɃʲʧФɦȕʝʧȕɪɪȕФНЗѪКНОѪОИИХХ


ЕѐЕЕЙЛКИ ʧȕȆɴɪȍʧ ФЗИ ǤɜɜɴȆǤʲɃɴɪʧђ ЖОѐЛОЗ Ƀ"Х
ЗЙНЛЗЕЙН

The ЪʲɃɦȕ macro yields the run time and the allocated memory. Asterisk gener-
ally indicate macros; we will learn all about macros in Chap. 7.
Now we know how to define a function that can calculate arbitrarily large
(only limited by the available memory) Fibonacci numbers. The next version of
the Fibonacci function ensures that the codomain is the type "Ƀȱbɪʲ.
ȯʼɪȆʲɃɴɪ ȯɃȂИФɪХђђ"Ƀȱbɪʲ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂИФɪвЖХ ў ȯɃȂИФɪвЗХ
ȕɪȍ
ȕɪȍ

The syntax ђђtype after the argument list means that the return value will be
converted to the specified type. Here the base case ɪ јќ Ж ensures that later on
"Ƀȱbɪʲs are added. If the return value cannot be converted to the specified type,
then an error is raised. If ђђtype is not given, it is assumed to be ђђɪ˩. Every
value is Julia is of type ɪ˩.
We have seen how we can specify the codomain of our function. How can we
specify the domain of our function? We already know that a generic function con-
sists of methods. The various methods that constitute a generic function are re-
sponsible for different argument types. Whenever a function is called, the types
of the arguments are inspected and then the most specific matching method is
called; if none exists, an error is raised. The only method we have defined for
the function ȯɃȂИ is called for every argument type, since we did not specify any
type for the argument ɪ.
But this is not what we intend in the context of the Fibonacci sequence. It
is unclear what ȯɃȂЖФКЭЭЙХ or ȯɃȂЖФОѐКХ should be, although our implementa-
tion returns numbers in these cases. ȯɃȂЖФъȯɴɴъХ clearly raises an error (only af-
ter trying to perform calculations), but ȯɃȂЖФКЭЭЙХ and ȯɃȂЖФОѐКХ should raise
errors as well.
How can we restrict the domain, i.e., how can we specify the types of the
arguments of a method? The syntax is again argumentђђtype. The domain of the
next version of the Fibonacci function is the bɪʲȕȱȕʝ type and the codomain is
the "Ƀȱbɪʲ type.
ȯʼɪȆʲɃɴɪ ȯɃȂЙФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂЙФɪвЖХ ў ȯɃȂЙФɪвЗХ
ȕɪȍ
ȕɪȍ
22 2 Functions

The bɪʲȕȱȕʝ type comprises "Ƀȱbɪʲs and the finite integer types bɪʲИЗ and bɪʲЛЙ
among others. It is therefore the most natural domain for our function.
We can check whether a type is a subset of another one using the subtype func-
tion subtypeјђsupertype. The following example shows that the bɪʲȕȱȕʝ type
works as intended in the method definition above. It also shows that neither
"Ƀȱbɪʲ is a subtype of bɪʲ nor bɪʲ is a subtype of "Ƀȱbɪʲ, confirming the use-
fulness of the bɪʲȕȱȕʝ type. The ɪ˩ type is a supertype of every type. An argu-
ment ˦ without any specified type is equivalent to an argument ˦ђђɪ˩, just as
in the case of the return value.
ɔʼɜɃǤљ bɪʲ
bɪʲЛЙ
ɔʼɜɃǤљ bɪʲИЗ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ bɪʲЛЙ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ "Ƀȱbɪʲ јђ bɪʲȕȱȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ "Ƀȱbɪʲ јђ bɪʲ
ȯǤɜʧȕ
ɔʼɜɃǤљ bɪʲ јђ "Ƀȱbɪʲ
ȯǤɜʧȕ
ɔʼɜɃǤљ bɪʲȕȱȕʝ јђ ɪ˩
ʲʝʼȕ

To check whether a value has a certain type, the ɃʧǤ function, which also
supports infix syntax, can be used.
ɔʼɜɃǤљ ɃʧǤФЕя bɪʲЛЙХ
ʲʝʼȕ
ɔʼɜɃǤљ Е ɃʧǤ bɪʲЛЙ
ʲʝʼȕ

Calling the generic function ȯɃȂЙ with arguments that are not of type bɪʲȕȱȕʝ
results in an error explaining that there is no matching method. Sometimes it is
useful to define a method that catches all other argument types. This is achieved
by the following method.
ȯɃȂЙФ˦ђђɪ˩Х ќ ȕʝʝɴʝФъ“ɪɜ˩ ȍȕȯɃɪȕȍ ȯɴʝ Ƀɪʲȕȱȕʝ ǤʝȱʼɦȕɪʲʧѐъХ

Evaluating ɦȕʲȹɴȍʧФȯɃȂЙХ confirms that the generic function comprises two


methods.
So far we have learned how we can define the domains and codomains of
Julia functions, namely by specifying the types of the arguments and the re-
turn values. Next the question arises how efficient our implementation actually
works. Calculating a few function values based on bɪʲs or "Ƀȱbɪʲs for relatively
small arguments and timing the calculations using the ЪʲɃɦȕ macro quickly
leads us to the conclusion that we will run out of patience before we run out
of integers.
2.1 Defining Functions 23

We therefore examine the computational problem, namely solving the differ-


ence equation (2.1), in more detail. The difference equation

𝑥𝑛 = 𝑥𝑛−1 + 𝑥𝑛−2

in (2.1) can be solved using the ansatz


√ 𝑥𝑛 = 𝑦 𝑛 , substituting
√ it into (2.1), find-
ing the two solutions 𝑦1 ∶= (1 + 5)∕2 and 𝑦2 ∶= (1 − 5)∕2 of the resulting
quadratic equation, and noting that the general solution of this difference equa-
tion is the linear combination

𝑥𝑛 = 𝑐1 𝑦1𝑛 + 𝑐2 𝑦2𝑛 .

The
√two starting values
√ 𝑥0 and 𝑥1 determine the two constants 𝑐1 and 𝑐2 as 𝑐1 ∶=
1∕ 5 and 𝑐2 ∶= −1∕ 5. Therefore the Fibonacci sequence is given by
√ 𝑛 √ 𝑛
1 1+ 5 1 1− 5
𝑥𝑛 = √ ( ) −√ ( )
5 2 5 2
√ 𝑛
⎛ 1 1+ 5 ⎞
= round ⎜ √ ( ) ⎟ ∀𝑛 ∈ ℕ.
5 2
⎝ ⎠
The last equality holds since 𝑦2 ≈ −0.618 and |𝑐2 | < 1∕2.
The theory of difference equations hence leads to the next function definition.
ȯɃȂКФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ
ʝɴʼɪȍФ"Ƀȱbɪʲя ФФЖўʧʜʝʲФКХХЭЗХѭɪ Э ʧʜʝʲФКХХ

The argument "Ƀȱbɪʲ of ʝɴʼɪȍ ensures not only that a "Ƀȱbɪʲ is returned instead
of a floating-point value, but also that no error is raised when the number to be
rounded is large.
Although we can calculate Fibonacci numbers now very quickly, it unfortu-
nately turns out that the return value of ȯɃȂКФМЖХ is not equal to 𝑥71 (while
the preceding values are correct). The smallest example to demonstrate this de-
ficiency is the following.
ɔʼɜɃǤљ ЪʲɃɦȕ ȯɃȂКФЛОХ ў ȯɃȂКФМЕХ ќќ ȯɃȂКФМЖХ
ЕѐЕЕЕЕЕК ʧȕȆɴɪȍʧ ФН ǤɜɜɴȆǤʲɃɴɪʧђ ЖЛН Ȃ˩ʲȕʧХ
ȯǤɜʧȕ

Evaluating ʲ˩ʙȕɴȯФʧʜʝʲФКХХ yields a value of type OɜɴǤʲЛЙ, which is the type


of ieee 754 double-precision floating-point numbers. It is well-known that this
representation of the real numbers ℝ gives 15 to 17 significant decimal digits
of precision. Since 𝑥70 and 𝑥71 have 15 decimal digits, we conclude that expo-
nentiation over the OɜɴǤʲЛЙ type works very precisely within these limits. On
the other hand, we cannot calculate more Fibonacci numbers in this manner,
since the spacing between adjacent floating-point numbers becomes too large to
represent integers.
24 2 Functions

We can alleviate this limitation by using the "ɃȱOɜɴǤʲ type and ensuring that
all calculations are performed over this type. Then the "ɃȱOɜɴǤʲ value is rounded
to a "Ƀȱbɪʲ value.
ȯɃȂЛФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ
ʝɴʼɪȍФ"Ƀȱbɪʲя
ФФЖўʧʜʝʲФ"ɃȱOɜɴǤʲФКХХХЭЗХѭɪ Э ʧʜʝʲФ"ɃȱOɜɴǤʲФКХХХ

The following function checks the calculations. The output means that all
Fibonacci numbers up to and including 𝑥358 are calculated correctly, while
ȯɃȂЛФИКОХ is not equal to 𝑥359 .

ȯʼɪȆʲɃɴɪ ȆȹȕȆɖѪȯɃȂЛФʝǤɪȱȕХ
ȯɴʝ Ƀ Ƀɪ ʝǤɪȱȕ
Ƀȯ ȯɃȂЛФɃХ ў ȯɃȂЛФɃўЖХ Рќ ȯɃȂЛФɃўЗХ
ʙʝɃɪʲФɃя ъ ъХ
ȕɪȍ
ȕɪȍ
ȕɪȍ

ɔʼɜɃǤљ ȆȹȕȆɖѪȯɃȂЛФЕђИМЕХ
ИКМ ИЛЗ ИЛЛ ИЛМ ИЛН ИЛО ИМЕ

We can again relate the number of bits used in this calculation to the number
of decimal digits of 𝑥359 . The following calculation shows that we can expect
at most 78 decimal digits of precision, since "ɃȱOɜɴǤʲs use 256 binary digits by
default. (The precision of "ɃȱOɜɴǤʲs can be changed using ʧȕʲʙʝȕȆɃʧɃɴɪ.) At
the same time, representing ȯɃȂКФИКОХ requires 75 decimal digits. Therefore the
calculation using the exponentiation is very precise in the sense that almost all
digits of the result are correct.
ɔʼɜɃǤљ ȍɃȱɃʲʧФ"ɃȱbɪʲФЗХѭЗКЛХ
МН
ɔʼɜɃǤљ ȍɃȱɃʲʧФȯɃȂЛФИКОХХ
МК

Although our detour taking advantage of an explicit formula for Fibonacci


numbers turned out to be successful in the sense that we can calculate more
Fibonacci numbers faster, it is not entirely satisfying for two reasons. First, the
theory of difference equations provided a new angle how to attack the problem,
but such a new angle is not available in general. Second, the machinery behind
"ɃȱOɜɴǤʲʧ and ʝɴʼɪȍ is considerable.
Therefore we return to the recursive function ȯɃȂЙ now. The inefficiency of
this implementation stems from the fact that during the recursive calculation of
ȯɃȂЙФ𝑛) the number of times that ȯɃȂЙФ𝑚) is called becomes larger and larger as
𝑚 becomes smaller. Therefore a trade-off between computation time and mem-
ory consumption is reasonable.
2.1 Defining Functions 25

The idea is to define a global variable that holds a dictionary containing the
previously calculated values. (Global variables are discussed in Chap. 3, and dic-
tionaries in Sect. 4.5.4.) The function checks if the function value has been calcu-
lated previously. If yes, it is simply returned; if not, the new value is calculated,
stored, and returned. This technique is known as memoization (see Sect. 7.5).
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ ќ -ɃȆʲШ"Ƀȱbɪʲя "ɃȱbɪʲЩФЕ ќљ Ея Ж ќљ ЖХ

ȯʼɪȆʲɃɴɪ ȯɃȂМФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ

Ƀȯ ȹǤʧɖȕ˩ФȯɃȂѪȆǤȆȹȕя ɪХ
ȯɃȂѪȆǤȆȹȕЦɪЧ
ȕɜʧȕ
ȯɃȂѪȆǤȆȹȕЦɪЧ ќ ȯɃȂМФɪвЖХ ў ȯɃȂМФɪвЗХ
ȕɪȍ
ȕɪȍ

This implementation can calculate 𝑥10 000 , which has 2090 digits, within a few
thousands of a second using about 6 MB of memory. We will see a more general
approach to memoization in Sect. 7.5.
Additional properties of the Fibonacci sequence are useful to refine this ap-
proach. It can be shown that the equalities
2 2
𝑥2𝑛 = 𝑥𝑛+1 − 𝑥𝑛−1 = 𝑥𝑛 (𝑥𝑛+1 + 𝑥𝑛−1 ), (2.2)
𝑥3𝑛 = 2𝑥𝑛3+ 3𝑥𝑛 𝑥𝑛+1 𝑥𝑛−1 = 5𝑥𝑛3+ 3(−1)𝑛 𝑥𝑛 , (2.3)
2
𝑥4𝑛 = 4𝑥𝑛 𝑥𝑛+1 (𝑥𝑛+1 + 2𝑥𝑛2 ) − 3𝑥𝑛2 (𝑥𝑛2 + 2𝑥𝑛+1
2
) (2.4)

hold for all 𝑛 ∈ ℕ.


The next version uses the last equality to reduce the number of function val-
ues that are memoized to approximately one quarter.
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ ќ -ɃȆʲШ"Ƀȱbɪʲя "ɃȱbɪʲЩФЕ ќљ Ея Ж ќљ ЖХ

ȯʼɪȆʲɃɴɪ ȯɃȂНФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ

Ƀȯ ȹǤʧɖȕ˩ФȯɃȂѪȆǤȆȹȕя ɪХ
ȯɃȂѪȆǤȆȹȕЦɪЧ
ȕɜʧȕ
Ƀȯ ɦɴȍФɪя ЙХ ќќ Е
ɦ ќ ȍɃ˛Фɪя ЙХ
ȯɃȂѪȆǤȆȹȕЦɪЧ ќ ФЙѮȯɃȂНФɦХѮȯɃȂНФɦўЖХ
ѮФȯɃȂНФɦўЖХѭЗўЗѮȯɃȂНФɦХѭЗХ
вИѮȯɃȂНФɦХѭЗѮФȯɃȂНФɦХѭЗўЗѮȯɃȂНФɦўЖХѭЗХХ
ȕɜʧȕ
26 2 Functions

ȯɃȂѪȆǤȆȹȕЦɪЧ ќ ȯɃȂНФɪвЖХ ў ȯɃȂНФɪвЗХ


ȕɪȍ
ȕɪȍ
ȕɪȍ

To summarize, we have approached the problem of calculating Fibonacci


numbers, defined by maybe the simplest of all recursive formulas, from several
very different angles. When translating the definition of a mathematical function
into a Julia function, we have seen
• that the domain of the function corresponds to the types of the arguments,
• that the codomain corresponds to the type of the return value, and
• that the development of an efficient algorithm to calculate entities defined
by abstract mathematical definitions may be a highly complex task.
Mathematical functions correspond to generic functions in Julia. Generic
functions consist of (function) methods, and the method that is called depends
on the types of the arguments. The function ǤʙʙɜɃȆǤȂɜȕ can be used to find out
if a given generic function has a method that is applicable to given arguments.
The function ˞ȹɃȆȹ returns the method of a given generic function that matches
given argument types. Given a generic function and argument types, you can
use Ƀɪ˛ɴɖȕ to call the method that matches the given argument types (after con-
verting the arguments). This allows invoking a method different from the most
specific one.
Another important consideration is how to translate infinite sets and abstract
concepts such as ℕ, ℤ, ℝ, and ℂ into a necessarily finite approximation in a com-
puter. Appropriate representations must be chosen depending on the problem to
be solved, and therefore we will learn much more about built-in and user defined
data types in Chapters 4 and 5. Fortunately, Julia provides a rich set of numer-
ical data types of various degrees of precision and makes it easy to change the
types used in a program.
We have also seen in the example that it is prudent to implement tests and
checks whenever and as soon as possible. A standard procedure is to compare
numerical results with known exact solutions in order to check the correctness
of the implementation and in order to assess the size of the error necessarily
made by using approximations.

2.2 Argument Passing Behavior

In computer science, various ways to pass arguments from the caller to the called
function are known. The following approaches are commonly found in program-
ming languages.
2.3 Multiple Return Values 27

Call by value: The arguments are evaluated and the resulting values are passed
to the function and bound to local variables. The passed values are often
copied into a new memory region. The function cannot make changes in
the scope of the caller, since it only receives a copy.
Call by reference: The function receives a reference to a variable used as the ar-
gument. Via this reference, the function can assign a new value to the vari-
able or modify it, and any changes are also seen by the caller of the function.
Call by sharing: If the values in a languages are objects (carrying type informa-
tion in contrast to primitive types), then call by sharing is possible. In call by
sharing, function arguments act as new variable bindings and assignments
to function arguments are not visible to the caller. No copies of the argu-
ments are made, however, and the values the new variable bindings refer to
are identical to the passed values. Therefore changes to a mutable object are
seen by the caller.

Call by value is provides additional safety, since it is impossible for the called
function to effect any changes outside of its own scope. On the other hand, it
is inefficient to copy all arguments, especially when many small functions are
defined or the arguments occupy large memory regions such as large arrays.
Call by reference is more efficient, but a function may effect changes outside
of its scope, making reasoning about program behavior much more difficult and
possibly leading to subtle bugs. Call be reference is the most unsafe way of pass-
ing arguments.
Julia uses call by sharing. Assignments to function arguments only affect the
scope of the function. Still, mutable objects (such as the elements of vectors or
arrays) can be changed and these changes persist and are seen by the caller. Call
by sharing is a reasonable compromise between memory safety and efficiency
and it is found in other dynamic languages such as Lisp, Scheme, and Python.

2.3 Multiple Return Values

Julia does not provide multiple return values per se, but uses tuples of values
to the same effect. Since tuples can be created and destructured also without
parentheses, the illusion of returning and receiving multiple values is created by
leaving out the parentheses.
Tuples can always be created with parentheses and in many circumstances
without parentheses. The same holds true when tuples are destructured in as-
signments. If there are fewer variables in the tuple on the left side than elements
in the tuple on the right side of the assignment, then only the first elements on
the right side are used. Conversely, if there are more variables in the tuple on the
left side than elements on the right side, an error is raised.
28 2 Functions

ɔʼɜɃǤљ ȯɴɴя ȂǤʝ ќ Ея Ж


ФЕя ЖХ
ɔʼɜɃǤљ Фȯɴɴя ȂǤʝХ ќ ФЗя ИХ
ФЕя ЖХ
ɔʼɜɃǤљ ФȯɴɴяХ ќ ФЕя ЖХ
ФЕя ЖХ
ɔʼɜɃǤљ ȯɴɴ
Е

A tuple with a single element is created using the syntax ФelementяХ so that it
can be distinguished from parentheses that have no effect around an expression.
ɔʼɜɃǤљ ФЕяХ
ФЕяХ

Using tuples, multiple values can be returned and received in a straightfor-


ward manner. When receiving the return values, a tuple can be assigned to a
variable or the tuple can be destructured and multiple variables can be assigned.
ȯʼɪȆʲɃɴɪ ȯɴɴФǤЖя ǤЗя ȂЖя ȂЗХ
ФǤЖѮȂЖ в ǤЗѮȂЗя ǤЖѮȂЗ ў ǤЗѮȂЖХ
ȕɪȍ

ɔʼɜɃǤљ Ȇ ќ ȯɴɴФЖя Зя Ия ЙХ
ФвКя ЖЕХ
ɔʼɜɃǤљ ФȆЖя ȆЗХ ќ ȯɴɴФЖя Зя Ия ЙХ
ФвКя ЖЕХ
ɔʼɜɃǤљ Ȇя ȆЖя ȆЗ
ФФвКя ЖЕХя вКя ЖЕХ

2.4 Functions as First-Class Objects

Functions are first-class objects in Julia, and the type of each function is a sub-
type of the type OʼɪȆʲɃɴɪ. This means that functions can be assigned and passed
as arguments just as any other data type.
ɔʼɜɃǤљ ў
ў ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ЖЛЛ ɦȕʲȹɴȍʧХ
ɔʼɜɃǤљ ўФЕя ЖХ
Ж
ɔʼɜɃǤљ ɃʧǤФўя OʼɪȆʲɃɴɪХ
ʲʝʼȕ
ɔʼɜɃǤљ ȯɴɴ ќ Ѯ
Ѯ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ ИКМ ɦȕʲȹɴȍʧХ
ɔʼɜɃǤљ ȯɴɴФЕя ЖХ
Е
2.5 Anonymous Functions 29

Passing functions as function arguments is not uncommon. The built-in ʧɴʝʲ


function, for example, takes a keyword argument ɜʲ (short for less than) that
specifies the ordering to be used. In this example, the function љ is passed as the
ɜʲ argument and then used to compare two values.

ɔʼɜɃǤљ ʧɴʝʲФЦЖя Ия ЗЧя ɜʲ ќ љХ


Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
И
З
Ж

Functions whose domains are functions are well-known in mathematics: they


are functionals. The following function calculates a simple approximation (the
𝑏
Riemann definition) of the functional 𝐼(𝑓) ∶= ∫𝑎 𝑓(𝑥)d𝑥.
ȯʼɪȆʲɃɴɪ ʝɃȕɦǤɪɪФȯђђOʼɪȆʲɃɴɪя Ǥђђ‰ʼɦȂȕʝя Ȃђђ‰ʼɦȂȕʝя ɪђђbɪʲȕȱȕʝХ
ɜɴȆǤɜ ʧʼɦ ќ Е
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э ɪ
ȯɴʝ Ƀ Ƀɪ Жђɪ
ʧʼɦ ўќ ȯФФɃўЖЭЗХ Ѯ ȹХ
ȕɪȍ
ȹ Ѯ ʧʼɦ
ȕɪȍ

ɔʼɜɃǤљ ʝɃȕɦǤɪɪФȆɴʧя Ея ЗʙɃя ЖЕЕХ


ЖѐИЗКИОЕЕЗОЗЙИЗНЕЗȕвЖЛ

The »ʼǤȍQu package provides one-dimensional numerical integration using


adaptive Gauss–Kronrod quadrature.
ɔʼɜɃǤљ Ƀɦʙɴʝʲ »ʼǤȍQu
ɔʼɜɃǤљ »ʼǤȍQuѐʜʼǤȍȱɖФȆɴʧя Ея ЗʙɃХ
ФИѐЕЕЕИЗИИКНЗЛКЙМȕвЖЛя КѐЕЛЗНЙЛЖНЕНИЛЕИЛȕвЗЙХ

The first element of the tuple is the estimated value of the integral, and the sec-
ond an estimated upper bound for the absolute error.
The identity function is called ɃȍȕɪʲɃʲ˩ in Julia. Additionally, there are syn-
tactic expressions in Julia, listed in Table 2.1, which are translated into function
calls, but the names of the functions are not obvious.
Finally, the expression arg Юљ fun is the same as funФargХ. It allows to revert
the order of function and arguments and is easier to read in certain situations.

2.5 Anonymous Functions

Anonymous functions can be created by either of two syntactic options, namely


using вљ or a function definition without a function name. Especially the first
syntax is useful to create simple functions and to pass them to other functions.
30 2 Functions

Table 2.1 Syntactic expressions that correspond to function calls.


Syntactic expression Function
Ц " & . . . Ч ȹȆǤʲ𝑎
Ця "я &я . . . Ч ˛ȆǤʲ 𝑏
Ц "ѓ & -ѓ . . . Ч ȹ˛ȆǤʲ𝑐
щ ǤȍɔɴɃɪʲ𝑑
ʧʲǤʝʲђʧʲɴʙ, ʧʲǤʝʲђʧʲȕʙђʧʲɴʙ ФђХФʧʲǤʝʲя ЦʧʲȕʙяЧ ʧʲɴʙХ
ЦɃЧ ȱȕʲɃɪȍȕ˦
ЦɃЧ ќ ˦ ʧȕʲɃɪȍȕ˦Р
𝑎
Concatenate horizontally.
𝑏
Concatenate vertically.
𝑐
Concatenate horizontally and vertically.
𝑑
Conjugate transpose.

ɔʼɜɃǤљ ˦ вљ З˦
ЫИ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ɃʧǤФ˦ вљ З˦я OʼɪȆʲɃɴɪХ
ʲʝʼȕ
ɔʼɜɃǤљ Ф˦я ˩Х вљ З˦Ѯ˩
ЫМ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ȯʼɪȆʲɃɴɪ Ф˦Х
З˦
ȕɪȍ
ЫО ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ

A popular example of using anonymous functions is passing a custom order-


ing to the ʧɴʝʲ function. Instead of defining a function used only in one place,
it is often more convenient to use a short anonymous function.
ɔʼɜɃǤљ ʧɴʝʲФЦЖѐОя ЖѐНя ЖѐМЧя ɜʲ ќ Ф˦я˩Х вљ ȯɜɴɴʝФ˦Х ј ȯɜɴɴʝФ˩ХХ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШOɜɴǤʲЛЙяЖЩђ
ЖѐО
ЖѐН
ЖѐМ

A classical example, often found in functional programming, is the ɦǤʙ func-


tion. It takes a function as its first argument and one or more collections as the re-
maining arguments. The number of arguments the function expects must match
the number of collections passed. Then the function is applied elementwise to
the collections and a collection with the results is returned.
ɔʼɜɃǤљ ɦǤʙФ˦ вљ ЖЕ˦я ЦЖя Зя ИЧХ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
ЖЕ
ЗЕ
ИЕ
ɔʼɜɃǤљ ɦǤʙФФ˦я ˩Х вљ ЖЕ˦ ў ˩я ЦЖя Зя ИЧя ЦЙя Кя ЛЧХ
2.6 Optional Arguments 31

Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
ЖЙ
ЗК
ИЛ

There are variants of the ɦǤʙ function, namely ɦǤʙР, ɦǤʙȯɴɜȍɜ, ɦǤʙȯɴɜȍʝ,
ɦǤʙʝȕȍʼȆȕ, and ɦǤʙʧɜɃȆȕʧ. The function ɦǤʙР stores the result in its second
argument, i.e., the first sequence argument (see Sect. 2.2). It follows the con-
vention that functions with names that end in Р modify their arguments. This
convention stems from the programming languages Scheme.
The function ʝȕȍʼȆȕ is another mainstay of functional programming. It takes
an associative function as its first argument, a collection as its second, and an
initial value as the (optional) keyword argument ɃɪɃʲ. The initial value should
be the neutral element for applying the function to an empty collection. ʝȕȍʼȆȕ
applies the function to two values from the collection (except for the initial value)
repeatedly until the collection has been reduced to a single value. The functions
ȯɴɜȍɜ and ȯɴɜȍʝ are similar, but guarantee left or right associativity.

ɔʼɜɃǤљ ʝȕȍʼȆȕФѮя ЦЖ З ИЧХ


Л
ɔʼɜɃǤљ ʝȕȍʼȆȕФѮя ЦЖ З ИЧя ɃɪɃʲ ќ вЖХ
вЛ

The function ɦǤʙʝȕȍʼȆȕ maps and reduces. It maps its first argument (a func-
tion) over the sequence given as the third argument and then reduces the result
using the second argument (again a function) with the initial value (optionally)
given as the keyword argument ɃɪɃʲ. ɦǤʙʝȕȍʼȆȕ is more efficient than using ɦǤʙ
and ʝȕȍʼȆȕ, since the intermediary sequence is not stored. The example calcu-
lates the 𝓁𝑝 norm of a vector, where an anonymous function is used for 𝑥 ↦ |𝑥|𝑝 .
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђ‰ʼɦȂȕʝХ ќ ɦǤʙʝȕȍʼȆȕФ˦ вљ ǤȂʧФ˦Хѭʙя ўя ˛ХѭФЖЭʙХ

2.6 Optional Arguments

Arguments often have sensible default values. For example, the 𝓁2 norm is the
most popular among the 𝓁𝑝 norms. In these cases, it is convenient to declare
these arguments as optional arguments. Optional arguments do not have to be
specified in the argument list when the function is called.
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђ‰ʼɦȂȕʝ ќ ЗХ ќ
ɦǤʙʝȕȍʼȆȕФ˦ вљ ǤȂʧФ˦Хѭʙя ўя ˛ХѭФЖЭʙХ

ɔʼɜɃǤљ ɪɴʝɦФЦвЖя вЗя вИЧХ


ИѐМЙЖЛКМИНЛММИОЙЖИ
ɔʼɜɃǤљ ɪɴʝɦФЦвЖя вЗя вИЧя ЖХ
ЛѐЕ
32 2 Functions

Optional arguments are implemented as methods of the generic function. For


example, a function with one optional argument is implemented as two meth-
ods (one method for no arguments and one method for one argument) and a
function with two optional arguments is implemented as three methods (one
method each for the three cases of zero, one, or two arguments).

2.7 Keyword Arguments

When functions have many arguments or many optional arguments in particu-


lar, it is often clearer in the caller to use keyword arguments. Keyword arguments
allow the arguments to be identified not by position, but by name. Keyword argu-
ments follow a semicolon in the function signature, and default values must be
provided. The example is an implementation of the Newton method for finding
roots of nonlinear functions.
ȯʼɪȆʲɃɴɪ ɪȕ˞ʲɴɪФȯђђOʼɪȆʲɃɴɪя ȍȯђђOʼɪȆʲɃɴɪя ˦Еѓ ʲ˩ʙ ќ OɜɴǤʲЛЙя
ʲɴɜ ќ ЖȕвЛя ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧђђbɪʲȕȱȕʝ ќ ЖЕЕХ
ЪǤʧʧȕʝʲ ʲɴɜ љ Е
ЪǤʧʧȕʝʲ ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧ љќ Ж

ɜɴȆǤɜ ˦ ќ ʲ˩ʙФ˦ЕХ
ɜɴȆǤɜ Ƀ ќ Ж
˞ȹɃɜȕ ǤȂʧФȯФ˦ХХ љќ ʲɴɜ ПП Ƀ јќ ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧ
˦ ўќ вȯФ˦ХЭȍȯФ˦Х
Ъʧȹɴ˞ Ƀя ˦я ȯФ˦Х
Ƀ ўќ Ж
ȕɪȍ
˦
ȕɪȍ

If you are not interested in printing the progress of the calculation, you can add a
comment character Ы in front of the call of the Ъʧȹɴ˞ macro. In this line, a tuple
containing the three values Ƀ, ˦, and ȯФ˦Х is created and passed to the Ъʧȹɴ˞
macro. Also note that ʲ˩ʙȕ is a reserved word in Julia, so that we use the name
ʲ˩ʙ for the first keyword argument instead.
The first argument is the function whose zero is sought starting from the point
specified as the third argument. In this implementation we require the deriva-
tive, a function, to be passed as the second argument. The ʲ˩ʙ keyword argument
specifies the type of the result by converting the starting value to this type. The
ʲɴɜ keyword argument specifies how large the absolute value of the function
value at the final point may be at most. The last keyword argument specifies the
maximum number of iterations calculated and ensures that the function always
returns.
2.7 Keyword Arguments 33

We note that it is possible to specify the type of keyword and optional argu-
ments, as we have done here for the last keyword argument.
The ЪǤʧʧȕʝʲ macro takes an expression as its argument and evaluates it. If
the value is ʲʝʼȕ, the function continues; if it is ȯǤɜʧȕ, an error is raised. Such
assertions are commonly used to check that the input is valid.
Two syntactic options are valid when calling the function. The semicolon that
separates the positional arguments and the keywords arguments in the function
definition is often not required when calling the function and can then be re-
placed by a comma.
The first example illustrates that the type of the result is given by the ʲ˩ʙ
keyword argument. Here the requested tolerance exceeds the precision of the
floating-point type OɜɴǤʲИЗ so that the function returns after the maximum num-
ber of iterations.
ɔʼɜɃǤљ ɪȕ˞ʲɴɪФʧɃɪя Ȇɴʧя ИѐЕя ʲ˩ʙ ќ OɜɴǤʲИЗя ʲɴɜ ќ ЖȕвЖКя
ɦǤ˦ѪɃʲȕʝǤʲɃɴɪʧ ќ КХ
ФɃя ˦я ȯФ˦ХХ ќ ФЖя ИѐЖЙЗКЙЛМȯЕя вЕѐЕЕЕОКЙȯЕХ
ФɃя ˦я ȯФ˦ХХ ќ ФЗя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФИя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФЙя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ФɃя ˦я ȯФ˦ХХ ќ ФКя ИѐЖЙЖКОЗМȯЕя вНѐМЙЗЗМНȯвНХ
ИѐЖЙЖКОЗМȯЕ

In the next example, the type is not specified so that the default type OɜɴǤʲЛЙ is
used.
ɔʼɜɃǤљ ɪȕ˞ʲɴɪФʧɃɪя Ȇɴʧя ИѐЕя ʲɴɜ ќ КȕвЖЛХ
ФɃя ˦я ȯФ˦ХХ ќ ФЖя ИѐЖЙЗКЙЛКЙИЕМЙЗМНя вЕѐЕЕЕОКИННОИИОНЗЛЙЙЕОХ
ФɃя ˦я ȯФ˦ХХ ќ ФЗя ИѐЖЙЖКОЗЛКИИЕЕЙММя ЗѐНОИЖЛЗЙОЕМЛЗЖНЙИȕвЖЕХ
ФɃя ˦я ȯФ˦ХХ ќ ФИя ИѐЖЙЖКОЗЛКИКНОМОИя ЖѐЗЗЙЛЙЛМООЖЙМИКИЗȕвЖЛХ
ИѐЖЙЖКОЗЛКИКНОМОИ

We will learn more about the Newton method and its convergence behavior in
Chap. 12.
Keyword arguments are ignored in method dispatch, i.e., when searching for
a matching method of the generic function. Keyword arguments are only pro-
cessed after the matching method has been found.
Functions can also receive a variable number of keyword arguments at the
end of the argument list using the syntax ѐѐѐ after the name of variable that
will receive all remaining keyword arguments as a collection. The function in
this example just returns the collection containing all keyword arguments it has
received.
ȯɴɴФǤѓ Ȃ ќ Ея ȆѐѐѐХ ќ Ȇ

When calling this function, a semicolon must be used after the keyword argu-
ment Ȃ so that the keyword arguments collected in Ȇ can be distinguished from
the preceding keyword argument Ȃ.
34 2 Functions

ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ ЗХ
bʲȕʝǤʲɴʝʧѐ¸ǤɃʝʧФђђ‰ǤɦȕȍÑʼʙɜȕШФХяÑʼʙɜȕШЩЩя ђђÑʼʙɜȕШЩХ ˞Ƀʲȹ Е ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ Зѓ ȂǤʝ ќ Ия ȂǤ˴ ќ ЙХ
ʙǤɃʝʧФђђ‰ǤɦȕȍÑʼʙɜȕХ ˞Ƀʲȹ З ȕɪʲʝɃȕʧђ
ђȂǤʝ ќљ И
ђȂǤ˴ ќљ Й
ɔʼɜɃǤљ ȯɴɴФЖя Ȃ ќ Зѓ ђȂǤʝ ќљ Ия ђȂǤ˴ ќљ ЙХ
ʙǤɃʝʧФђђ‰ǤɦȕȍÑʼʙɜȕХ ˞Ƀʲȹ З ȕɪʲʝɃȕʧђ
ђȂǤʝ ќљ И
ђȂǤ˴ ќљ Й

Two syntactic options to pass the keyword arguments are shown here. The first
is the usual keyword-argument syntax variable ќ value, and the second are pairs
of the form ђvariable ќљ value.
This general facility for passing keyword arguments is useful when the key-
word names are computed at runtime or when a number of keyword arguments
is assembled and passed through one or more function calls and the receiving
functions picks the keyword arguments it needs.
To summarize, keyword arguments are arguments after a semicolon ѓ in the
argument list of a function definition, while optional arguments are listed before
the semicolon.

2.8 Functions with a Variable Number of Arguments

Sometimes, it is convenient for a function to take a variable number of argu-


ments. The syntax to supply the arguments as a tuple is that the last argument
is followed by an ellipsis ѐѐѐ.
ȯɴɴФǤя Ȃя Ȇя ǤʝȱʧѐѐѐХ ќ Ǥʝȱʧ

The variable Ǥʝȱʧ is bound to a tuple of all the trailing values passed to the func-
tion.
ɔʼɜɃǤљ ȯɴɴФЖя Зя ИХ
ФХ
ɔʼɜɃǤљ ȯɴɴФЖя Зя Ия Йя Кя ЛХ
ФЙя Кя ЛХ

Analogously, the ellipsis ѐѐѐ can be used in a function call to splice the val-
ues contained in an iterable collection (see Sect. 4.5.2) into a function call as
individual arguments.
ɔʼɜɃǤљ ȯɴɴФФЖя Зя Ия Йя Кя ЛХѐѐѐХ
ФЙя Кя ЛХ
ɔʼɜɃǤљ ȯɴɴФЦЖя Зя Ия Йя Кя ЛЧѐѐѐХ
ФЙя Кя ЛХ
2.9 ȍɴ blocks 35

This example shows that the spliced arguments can also take the place of fixed
arguments. In fact, the function call taking a spliced argument list does not have
to take a variable number of arguments at all.

2.9 ȍɴ blocks

Built-in functions that take a function as one of its arguments usually receive the
function argument as the first argument, which is an idiomatic use of function
arguments in Julia. A ȍɴ block is a syntactic expression that supports this idiom.
They are useful for passing longer anonymous functions as first arguments to
functions. The ȍɴ block
functionФargumentsХ ȍɴ variables
body
ȕɪȍ

is equivalent to
functionФvariables вљ bodyя argumentsХ
so that the possibly long function body body is written at the end of the ȍɴ block.
Continuing the above example, the ɪɴʝɦ function can equivalently also be
defined as follows.
ɪɴʝɦФ˛ђђùȕȆʲɴʝя ʙђђ‰ʼɦȂȕʝХ ќ
ɦǤʙʝȕȍʼȆȕФўя ˛Х ȍɴ ˦
ǤȂʧФ˦Хѭʙ
ȕɪȍѭФЖЭʙХ

A prime example where it is convenient to pass long anonymous function


bodies to a function is file input and output. The function in the next example
ensures that the stream that has been opened to read from or to write to is closed
after use. The function takes a variable number of arguments (see Sect. 2.8).
ȯʼɪȆʲɃɴɪ ˞ɃʲȹѪʧʲʝȕǤɦФȯʼɪђђOʼɪȆʲɃɴɪя ǤʝȱʧѐѐѐХ
ɜɴȆǤɜ ʧʲʝȕǤɦ ќ ɴʙȕɪФǤʝȱʧѐѐѐХ
ʲʝ˩
ȯʼɪФʧʲʝȕǤɦХ
ȯɃɪǤɜɜ˩
ȆɜɴʧȕФʧʲʝȕǤɦХ
ȕɪȍ
ȕɪȍ

The function that operates on the input/output stream may be quite complicated.
Then it is therefore convenient to use a ȍɴ block. In the following example, ʧ
is the stream on which the anonymous function body in the ȍɴ block operates.
While the end of the file has not been reached, a line is read and printed.
36 2 Functions

˞ɃʲȹѪʧʲʝȕǤɦФъЭȕʲȆЭʙǤʧʧ˞ȍъя ъʝъХ ȍɴ ʧ
˞ȹɃɜȕ РȕɴȯФʧХ
ʙʝɃɪʲФʝȕǤȍɜɃɪȕФʧХХ
ȕɪȍ
ȕɪȍ

It should be noted in this context that Julia comes with the ʝȕǤȍɜɃɪȕʧ func-
tion that returns the contents of a file as a vector of strings. ʝȕǤȍɜɃɪȕʧ is often
sufficient when the file to be processed is small.

Problems

2.1 Write a function that uses "ɃȱOɜɴǤʲs and ʧȕʲʙʝȕȆɃʧɃɴɪ (in conjunction with
a ȍɴ block) to calculate large Fibonacci numbers. What is the largest Fibonacci
number you can calculate in this manner and what is the limitation you eventu-
ally run into?
2.2 Write a function that records the number of calls of ȯɃȂЙФ𝑚Х for each 0 ≤
𝑚 < 𝑛 when calculating ȯɃȂЙФ𝑛Х.
2.3 Calculating larger and larger Fibonacci numbers using ȯɃȂМ is limited by the
stack size. However, if you gradually increase the size of the argument, you can
circumvent this limitation. Explain why. What is the largest Fibonacci number
you can calculate in this manner and what is the limitation you eventually run
into?
2.4 The following function is a shorter alternative to ȯɃȂМ because of its use of
ȱȕʲР. However, it does not work. Explain why.

ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ ќ -ɃȆʲШ"Ƀȱbɪʲя "ɃȱbɪʲЩФЕ ќљ Ея Ж ќљ ЖХ

ȯʼɪȆʲɃɴɪ ȯɃȂФɪђђbɪʲȕȱȕʝХ
ȱɜɴȂǤɜ ȯɃȂѪȆǤȆȹȕ
ȱȕʲРФȯɃȂѪȆǤȆȹȕя "ɃȱbɪʲФɪХя ȯɃȂФɪвЖХ ў ȯɃȂФɪвЗХХ
ȕɪȍ

2.5 (Identities for Fibonacci numbers) Suppose 𝑥𝑛 is the 𝑛-th Fibonacci num-
ber.
(a) Prove d’Ocagne’s identity (2.2).
(b) Prove the identity (2.3).
(c) Prove the identity (2.4).
2.6 Suppose 𝑥𝑛 is the 𝑛-th Fibonacci number. Prove the identity

∑𝑎 ( )
𝑎
𝑥𝑎𝑛+𝑏 = 𝑥 𝑥 𝑖 𝑥 𝑎−𝑖
𝑖=0
𝑖 𝑏−𝑖 𝑛 𝑛+1
2.9 ȍɴ blocks 37

for all 𝑎 ∈ ℕ and 𝑏 ∈ ℕ. Then use it to implement an efficient memoized func-


tion to calculate Fibonacci numbers. What is the largest Fibonacci number you
can calculate in this manner and what is the limitation you eventually run into?

2.7 Use the macro ЪʲɃɦȕȍ to plot the time and memory consumption of the var-
ious functions to calculate Fibonacci numbers.

2.8 (Ackermann function) The Ackermann function is defined for 𝑚 ∈ ℕ and


𝑛 ∈ ℕ as

⎧𝑛 + 1, 𝑚 = 0,
𝐴(𝑚, 𝑛) ∶= 𝐴(𝑚 − 1, 1), 𝑚 > 0 ∧ 𝑛 = 0,

⎩𝐴(𝑚 − 1, 𝐴(𝑚, 𝑛 − 1)), 𝑚 > 0 ∧ 𝑛 > 0.
Implement this function and also implement a memoized version. Compare the
speed of both versions.
Chapter 3
Variables, Constants, Scopes, and Modules

But I just told you, I don’t have a problem with closure.


—Sheldon Lee Cooper in The Big Bang Theory, Season 6, Episode 21
The Closure Alternative (2013)

Abstract This chapter discusses how to introduce global and local variables and
constants as well as their scopes or visibility. While discussing functions, we have
seen that function arguments become local variables in the function body, but
there are more ways to introduce variables. Local variables are only visible in
(small) parts of a program, which is an important property to structure a pro-
gram into small, understandable parts. Global variables are only visible in their
module. A module is a (usually large) part of a program that contains functions,
variables, and constants with a similar purpose. The scopes of variables follow
rules that are described in detail.

3.1 Modules and Global Scopes

The scope of a variable is defined as the part of a program where the variable is
visible. Modules are a fundamental data structure in this regard, as each mod-
ule corresponds to a global scope. There is one-to-one correspondence between
modules and global scopes in Julia. The global scope of the repl is the module
called ǤɃɪ.
A new module called Oɴɴ can be defined in a Julia program or at the repl
like this.
ɦɴȍʼɜȕ OɴɴЖ
ȱɜɴȂǤɜ ȯɴɴ ќ Ж
ȕɪȍ

© Springer Nature Switzerland AG 2022 39


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_3
40 3 Variables, Constants, Scopes, and Modules

Modules usually have names that start with an uppercase letter. Program lines
within a module are not indented, since a module almost always comprises a
whole file and indenting the whole file would be superfluous. Here we have also
defined a variable called ȯɴɴ, whose global scope is the module OɴɴЖ. Although
it is good practice to use the keyword ȱɜɴȂǤɜ to define a global variable, it is not
necessary to do so.
ɦɴȍʼɜȕ OɴɴЗ
ȯɴɴ ќ З
ȕɪȍ

A module evaluates to itself, and we can access variables within a module


using a dot ѐ, which signifies a qualified access.
ɔʼɜɃǤљ OɴɴЖ
ǤɃɪѐOɴɴЖ
ɔʼɜɃǤљ OɴɴЖѐȯɴɴ
Ж

Modules can be replaced by evaluating a module definition for the same name.
Modules can be nested, and modules can be imported into other modules, as the
next example shows. Here the lines are indented to illustrate the nesting.
ɦɴȍʼɜȕ OɴɴИ
ɦɴȍʼɜȕ "Ǥʝ
ȱɜɴȂǤɜ ȂǤʝ ќ Е
ȕɪȍ
ȱɜɴȂǤɜ ȯɴɴИ ќ "ǤʝѐȂǤʝ Ы ɦɴȍʼɜȕ "Ǥʝ Ƀʧ ˛ɃʧɃȂɜȕ
Ƀɦʙɴʝʲ ѐѐOɴɴЖ Ы ɦǤɖȕ ɦɴȍʼɜȕ OɴɴЖ ˛ɃʧɃȂɜȕ
ȱɜɴȂǤɜ ȯɴɴЙ ќ OɴɴЖѐȯɴɴ Ы ɦɴȍʼɜȕ OɴɴЖ Ƀʧ ˛ɃʧɃȂɜȕ
ȕɪȍ

ɔʼɜɃǤљ OɴɴИѐ"ǤʝѐȂǤʝ
Е
ɔʼɜɃǤљ OɴɴИѐȯɴɴИ
Е
ɔʼɜɃǤљ OɴɴИѐȯɴɴЙ
Ж

As a safety measure, it is not possible to assign values to variables in the global


scope of another module, as illustrated in this example, which raises an error.
ɦɴȍʼɜȕ OɴɴЙ
Ƀɦʙɴʝʲ ѐѐOɴɴЖ
OɴɴЖѐȯɴɴ ќ Й
ȕɪȍ
3.2 Dynamic and Lexical Scoping 41

3.2 Dynamic and Lexical Scoping

We have defined the scope of a variable as the part of a program where the vari-
able is visible. In addition to the global scopes of modules, there are also local
scopes. For example, a function definition introduces a new local scope.
What happens if there are two variables with same name within a program?
If the scopes of the variables do not overlap, there is no ambiguity. If the scopes
of the variables overlap, however, then Julia’s scope rules are applied in order
to resolve any ambiguities. This section discusses the various scopes of variables
and the scope rules.
The scope of a variable is a concept that is familiar from mathematics. An
example is given by integrals. In the formula
𝑥
𝑓(𝑥) = ∫ 𝑓 ′ (𝜉)d𝜉,
𝑎

the scope of the integration variable 𝜉 is the integrand, i.e., 𝜉 is only visible within
the integrand, which is 𝑓 ′ (𝜉) here. But how should we interpret the formula
𝑥
𝑓(𝑥) = ∫ 𝑓 ′ (𝑥)d𝑥?
𝑎

Is it ambiguous? From d𝑥 we know that the integration variable is called 𝑥. The


integration variable 𝑥 is certainly different from the upper integration limit 𝑥,
since the upper integration limit must always be outside of the scope of the inte-
gration variable. Since the left-hand side is a function of 𝑥, the right-hand side
must be one as well, and therefore the upper integration limit 𝑥 is the function
argument 𝑥 on the left-hand side. But is the 𝑥 in the integrand 𝑓 ′ (𝑥) the function
argument 𝑥 or the integration variable 𝑥? Their scopes overlap and therefore we
need a scope rule to resolve this ambiguity. A reasonable scope rule is to require
that any 𝑥 in the integrand refers to the innermost definition of 𝑥, i.e., the inte-
gration variable 𝑥, and not to the outer 𝑥, i.e., the function argument 𝑥.
Another example is given by summation indices, which are only visible within
their summands. For example, in the formula

∑∞
1
𝜁(𝑠) = 𝑠
,
𝑛=1
𝑛

the summation index 𝑛 is only visible within the summand 1∕𝑛𝑠 , while the func-
tion argument 𝑠 is visible in the whole right-hand side.
Returning from mathematics to computer science, there are two types of scop-
ing, namely dynamic scoping and lexical scoping. In lexical scoping, the scope of
a variable is the program text of the scope block where the variable is defined. In
dynamic scoping, the scope of a variable is the time period when the code block
where the variable is defined is executed. Julia uses lexical scoping.
42 3 Variables, Constants, Scopes, and Modules

Nowadays, lexical scoping is more popular than dynamic scoping, since it


makes it easier for the programmer to reason about the variables defined within
a code block. The (local) program text is sufficient. Dynamic scoping, on the
other hand, requires knowledge about the run-time behavior of the program,
which can be arbitrarily complex. Common Lisp is an example of a program-
ming language that provides both types of scoping for its variables. While lexical
scoping is the default, dynamic scoping is advantageous for certain purposes.
To illustrate the difference between dynamic and lexical scoping, suppose that
a function called ȯ calls another function called ȱ and that these two code blocks
do not overlap, i.e., none of the two functions is defined within the other one.
Under lexical scoping, the function ȱ does not have access to the variables in-
troduced by ȯ, because its program text does not lie within ȯ, whereas under
dynamic scoping, ȱ does have access to the variables introduced by ȯ, since it is
executed while ȯ is executed.
This is the situation implemented in this example.
ȯʼɪȆʲɃɴɪ ȯФХ
ɜɴȆǤɜ ˦ ќ Е
ȱФХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ȱФХ
˦
ȕɪȍ

Calling ȯ results in a call to ȱ. Since Julia uses lexical scoping, evaluating ˦


inside the scope block of the function ȱ can only refer to variables defined in ȱ
and the global scope outside of ȱ (i.e., the module that contains 𝑔). Because ˦ is
defined in neither, an error saying that 𝑥 is not defined is raised.
If we introduce a global variable ˦ outside the scope of ȱ (i.e., in the module
that contains 𝑔), then evaluation of ˦ in ȱ will refer to its value.
ɔʼɜɃǤљ ˦ ќ Жѓ ȯФХ
Ж

We could have left out the keyword ȱɜɴȂǤɜ here, but it is good style to indicate
the definition of variables using the keywords ȱɜɴȂǤɜ or ɜɴȆǤɜ.
The next example is concerned with the nesting of local scopes. The function
Ƀɪɪȕʝ is defined within the function ɴʼʲȕʝ.

ȯʼɪȆʲɃɴɪ ɴʼʲȕʝФХ
ȯʼɪȆʲɃɴɪ ɃɪɪȕʝФХ
˩
ȕɪȍ

ɜɴȆǤɜ ˩ ќ Е
ɃɪɪȕʝФХ
ȕɪȍ
3.3 Local Scope Blocks 43

The Ƀɪɪȕʝ function inherits all variables from its outer scope, i.e., the ɴʼʲȕʝ func-
tion, so that the variable ˩ inside Ƀɪɪȕʝ refers to the value of ˩ in ɴʼʲȕʝ.
ɔʼɜɃǤљ ɴʼʲȕʝФХ
Е

Even defining a global variable ˩ cannot change the return value of ɴʼʲȕʝ.
ɔʼɜɃǤљ ˩ ќ Жѓ ɴʼʲȕʝФХ
Е

Because of lexical scoping, the definition of the ɴʼʲȕʝ function is sufficient to


determine its return value. It is impossible that a global variable overrides a local
variable with the same name unless this is the intended behavior by using the
keyword ȱɜɴȂǤɜ as we did in Sect. 2.1.

3.3 Local Scope Blocks

In Julia, there are eight types of local scope blocks, all listed in Table 3.1, which
also lists the three ways to introduce global scope blocks for completeness. There
are two types of local scope blocks, namely hard local scopes and soft local scopes.
On the other hand, ȂȕȱɃɪ blocks and Ƀȯ blocks are not scope blocks and cannot
introduce new variable bindings.

Table 3.1 All global and local scope blocks.


Scope block Scope type
ɦɴȍʼɜȕ global
ȂǤʝȕɦɴȍʼɜȕ global
repl ( ǤɃɪ module) global
ȯʼɪȆʲɃɴɪ bodies and ȍɴ blocks hard local
ɦǤȆʝɴ bodies hard local
ʧʲʝʼȆʲ blocks hard local
ȯɴʝ loops soft local
˞ȹɃɜȕ loops soft local
ʲʝ˩ ȆǤʲȆȹ ȯɃɪǤɜɜ˩ blocks soft local
ɜȕʲ blocks soft local
array comprehensions soft local
44 3 Variables, Constants, Scopes, and Modules

3.3.1 Hard Local Scopes

According to Table 3.1, functions (see Chap. 2), macros (see Chap. 7), and ʧʲʝʼȆʲ
type definitions (see Sect. 5.4) introduce new hard local scopes. The scope rules
for hard local scopes are these.
1. All variables are inherited from their parent scope with the following two
exceptions.
2. A variable is not inherited if an assignment would modify a global variable.
(A new binding is introduced instead.)
3. A variable is not inherited if it is marked with the keyword ɜɴȆǤɜ. (A new
binding is introduced instead.)
The second rule means that global variables are inherited only if they are read,
not if they are modified. This ensures that a local variable cannot unintentionally
modify a global variable with the same name.
ȱɜɴȂǤɜ ˦ ќ Е

ȯʼɪȆʲɃɴɪ ȯɴɴЖФХ
˦ ќ Ж Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
˦
ȕɪȍ

ȯʼɪȆʲɃɴɪ ȯɴɴЗФХ
ȱɜɴȂǤɜ ˦ ќ З Ы ǤʧʧɃȱɪ ʲɴ ȱɜɴȂǤɜ ˛ǤʝɃǤȂɜȕ
˦
ȕɪȍ

In the first function, the assignment ˦ ќ Ж would modify the global variable ˦,
and therefore a new local variable is introduced. In the second function, the
ȱɜɴȂǤɜ keyword ensures that the assignment ˦ ќ З refers to the global variable ˦,
whose binding is modified.
ɔʼɜɃǤљ ȯɴɴЖФХ
Ж
ɔʼɜɃǤљ ˦
Е
ɔʼɜɃǤљ ȯɴɴЗФХ
З
ɔʼɜɃǤљ ˦
З
3.3 Local Scope Blocks 45

3.3.2 Soft Local Scopes

According to Table 3.1, ȯɴʝ loops, ˞ȹɃɜȕ loops, ʲʝ˩ ȆǤʲȆȹ ȯɃɪǤɜɜ˩ blocks (see
Chap. 6), ɜȕʲ blocks (see Sect. 3.4), and array comprehensions (see Sect. 3.5)
introduce new soft local scopes. The scope rules for hard local scopes are these.
1. All variables are inherited from their parent scope with the following excep-
tion.
2. A variable is not inherited if it is marked with the keyword ɜɴȆǤɜ. (A new
binding is introduced instead.)
3. Additional rules for ɜȕʲ blocks (see Sect. 3.4) and ȯɴʝ loops and comprehen-
sions (see Sect. 3.5) apply.
Hard and soft local scopes differ in their intended purposes and hence in their
scope rules. Hard local scopes, i.e., function, macro, and type definitions, are
usually independent entities than can be moved around freely within a program.
Modifying global variables within their scopes is possible, but should be done
with care, and therefore requires the ȱɜɴȂǤɜ keyword. On the other hand, soft
local scopes such as loops are often used to modify variables that are defined in
their parent scopes. Hence the default is to modify variables unless the ɜɴȆǤɜ
keywords is used.
The following two examples involving ȯɴʝ loops illustrate soft local scopes.
ȯʼɪȆʲɃɴɪ ʧʼɦФǤʝȱʧѐѐѐХ
ɜɴȆǤɜ ʧʼɦ ќ Е Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȯɴʝ Ƀ Ƀɪ Ǥʝȱʧ
ʧʼɦ ќ ʧʼɦ ў Ƀ Ы ɃɪȹȕʝɃʲ
ȕɪȍ
ʧʼɦ
ȕɪȍ

Here the assignment ʧʼɦ ќ ʧʼɦ ў Ƀ does not introduce a new binding for the
variable ʧʼɦ in the ȯɴʝ loop, since it is inherited from the parent scope by the
first rule.
The situation is different in the following example.
ȯʼɪȆʲɃɴɪ ȯɴɴФǤʝȱʧѐѐѐХ
ɜɴȆǤɜ ˦ ќ Е Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȯɴʝ Ƀ Ƀɪ Ǥʝȱʧ
ɜɴȆǤɜ ˦ ќ Ƀ Ы ɃɪʲʝɴȍʼȆȕ ɪȕ˞ ɜɴȆǤɜ ˛ǤʝɃǤȂɜȕ
ȕɪȍ
˦
ȕɪȍ

Here the ɜɴȆǤɜ keyword always introduces a new variable in the scope of the ȯɴʝ
loop. Therefore this function always returns Е.
Named functions (in contrast to anonymous functions) are stored as OʼɪȆʲɃɴɪ
objects in variables. Therefore a function ȯ can be referred to in the definition of
46 3 Variables, Constants, Scopes, and Modules

a function ȱ even if ȯ has not been defined yet. An example is given by mutually
recursive functions. In Julia, function definitions can be ordered arbitrarily and
no forward function declarations are required as in some other programming
languages in such cases.
Whether a variable is defined can be checked using the macro ЪɃʧȍȕȯɃɪȕȍ
and the function ɃʧȍȕȯɃɪȕȍ.

3.4 ɜȕʲ Blocks and Closures

Closures are a concept in computer science that can be found in many modern
programming languages. In general, a closure is a function together with an en-
vironment (or set of bindings) of variables. Variables in the enclosing scope are
called free variables and can be accessed by the function even when the function
is called outside the scope. This behavior is consistent with lexical scoping.
In Julia, closures are based on ɜȕʲ blocks. A ɜȕʲ block has the syntax
ɜȕʲ variable1 [ќ value1]я variable2 [ќ value2]я variable3 [ќ value3]
body
ȕɪȍ

and accepts an arbitrary number of assignments separated by commas while the


values are optional. The assignments are evaluated in order. A ɜȕʲ block always
introduces a new scope block and new local variables each time it is evaluated;
this is the additional scope rule for ɜȕʲ blocks.
A closure is created by a ɜȕʲ block containing one or more function defini-
tions. In this example, the local variable Ȇɴʼɪʲȕʝ is captured and visible only
in the two functions ȱȕʲѪȆɴʼɪʲȕʝ and ɃɪȆʝȕǤʧȕ due to lexical scoping. Thus
the closure serves to encapsulate the data, as the variable Ȇɴʼɪʲȕʝ can only be
accessed and modified by the functions defined inside the ɜȕʲ block.
ɜȕʲ Ȇɴʼɪʲȕʝ ќ Е
ȱɜɴȂǤɜ ȯʼɪȆʲɃɴɪ ȱȕʲѪȆɴʼɪʲȕʝФХ
Ȇɴʼɪʲȕʝ
ȕɪȍ

ȱɜɴȂǤɜ ȯʼɪȆʲɃɴɪ ɃɪȆʝȕǤʧȕФХ


Ȇɴʼɪʲȕʝ ўќ Ж
ȕɪȍ
ȕɪȍ

The ȱɜɴȂǤɜ declarations of the functions are necessary. Otherwise the function
definitions would only be accessible inside the ɜȕʲ block (and not globally), as
function definitions are stored in variables. (In Common Lisp, the ȱɜɴȂǤɜ dec-
laration would not be needed.)
3.5 ȯɴʝ Loops and Array Comprehensions 47

ɔʼɜɃǤљ ȱȕʲѪȆɴʼɪʲȕʝФХ
Е
ɔʼɜɃǤљ ɃɪȆʝȕǤʧȕФХя ɃɪȆʝȕǤʧȕФХя ɃɪȆʝȕǤʧȕФХ
ФЖя Зя ИХ
ɔʼɜɃǤљ ȱȕʲѪȆɴʼɪʲȕʝФХ
И

After reevaluating the ɜȕʲ block above, a new closure is created and the counter
is again equal to Е.

3.5 ȯɴʝ Loops and Array Comprehensions

Array comprehensions are a convenient way to make (dense) arrays (see Sect. 8.1)
while initializing its elements. A multidimensional array can be constructed by
Цexpr ȯɴʝ variable1 ќ value1я variable2 ќ value2Ч, where an arbitrary number of
iteration variables can be used. The values on the right-hand sides must be iter-
able objects such as ranges. Then the expression expr is evaluated with freshly
allocated iteration variables. The dimensions of the resulting array are given by
the numbers of the values of the iteration variables in order.
This example shows how an array comprehension is used to make and initial-
ize a two-dimensional array.
ɔʼɜɃǤљ ЦЖЕ˦ ў ˩ ȯɴʝ ˦ Ƀɪ ЖђЗя ˩ Ƀɪ ЖђИЧ
ЗѠИ ʝʝǤ˩ШbɪʲЛЙяЗЩђ
ЖЖ ЖЗ ЖИ
ЗЖ ЗЗ ЗИ
ɔʼɜɃǤљ ʧɃ˴ȕФǤɪʧХ
ФЗя ИХ

The iteration variables are freshly allocated for each iteration of the com-
prehension, and hence any previously existing variable bindings with the same
name are not affected by the array comprehension.
ɔʼɜɃǤљ ˦ ќ Еѓ ˩ ќ Е
Е
ɔʼɜɃǤљ ЦЖЕ˦ ў ˩ ȯɴʝ ˦ Ƀɪ ЖђЗя ˩ Ƀɪ ЖђИЧ
ЗѠИ ʝʝǤ˩ШbɪʲЛЙяЗЩђ
ЖЖ ЖЗ ЖИ
ЗЖ ЗЗ ЗИ
ɔʼɜɃǤљ ˦я ˩
ФЕя ЕХ

The behavior of ȯɴʝ loops is the same in this regard. We consider two exam-
ples. In the first, the iteration variable has not been previously defined. In this
case, the iteration variable is local to the ȯɴʝ loop.
48 3 Variables, Constants, Scopes, and Modules

ɔʼɜɃǤљ ЪɃʧȍȕȯɃɪȕȍ Ƀ
ȯǤɜʧȕ

This shows that initially there is no binding for the variable Ƀ.


ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ЖђЗ ȕɪȍ

After the ȯɴʝ loop, there is again no binding for the variable Ƀ.
ɔʼɜɃǤљ ЪɃʧȍȕȯɃɪȕȍ Ƀ
ȯǤɜʧȕ

In the second example, a variable with the same name as the iteration variable
already exists. After the ȯɴʝ loop has been evaluated, the value of the iteration
variable remains unchanged.
ɔʼɜɃǤљ ɔ ќ Е
Е
ɔʼɜɃǤљ ȯɴʝ ɔ Ƀɪ ЖђЗ ȕɪȍ
ɔʼɜɃǤљ ɔ
Е

3.6 Constants

Both global and local variables can be declared constant by the Ȇɴɪʧʲ keyword.
Declaring global variables as constant helps the compiler to optimize code. Since
the types and values of global variables may change at any time, code involving
global variables can generally hardly be optimized by the compiler. If a global
variable is declared constant, however, the compile can employ type inference
and the performance problem is solved.
The situation is different for local variables in this regard. The compiler can
determine whether a local variable is constant or not, and therefore declaring
local variables constant does not affect performance.
Finally, we note that declaring a variable constant only affects the variable
binding. If the value of a constant variable is a mutable object such as a set, an
array, or a dictionary (see Chap. 4), the elements of the mutable object may still
be modified as shown in this example.
ɔʼɜɃǤљ Ȇɴɪʧʲ  ќ ЦЖя ЗЧ
Звȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Ж
З
ɔʼɜɃǤљ ЦЖЧ ќ Еѓ 
Звȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Е
З
3.7 Global and Local Variables in this Book 49

3.7 Global and Local Variables in this Book

This book uses the ȱɜɴȂǤɜ and ɜɴȆǤɜ keywords to explicitly denote global and
local variables, although many programmers usually do not do so in practice.
There are two reasons for the use of these two keywords here. The first is simply
a didactic reason; the keywords clearly indicate where a new variable is defined
and of which kind it is.
The second reason is that writing ȱɜɴȂǤɜ and ɜɴȆǤɜ explicitly to define (and
to access) variables can be considered good practice, because it helps spot the
declarations of global and local variables at a glance and also where they are
used (in the case of global ones). Global variables are always noteworthy, and
therefore deserve to be spotted easily.
Using the ȱɜɴȂǤɜ and ɜɴȆǤɜ keywords is also a matter of style and personal
preference. Julia’s syntax is heavily influenced by Pascal’s syntax, and in Pas-
cal all variables are introduced by the ˛Ǥʝ and Ȇɴɪʧʲ keywords. In this tradi-
tion, the ɜɴȆǤɜ keyword serves the role of ˛Ǥʝ in Pascal. On the other hand,
with more experience in spotting variables and recognizing their scopes, the key-
words may appear superfluous.

Problems

3.1 Extend the example of a closure in Sect. 3.4 by writing functions for resetting
the counter to a given value and for decreasing the counter.

3.2 A Hilbert matrix is a square matrix 𝐻 with the entries ℎ𝑖𝑗 = 1∕(𝑖 + 𝑗 − 1).
Write a function that returns a Hilbert matrix of given size.
Chapter 4
Built-in Data Structures

Data dominates.
If you’ve chosen the right data structures and organized things well,
the algorithms will almost always be self-evident.
Data structures, not algorithms, are central to programming.
—Rob Pike
Bad programmers worry about the code.
Good programmers worry about data structures and their relationships.
—Linus Torvalds

Abstract Julia comes with many useful, built-in data structures that cover
many requirements of general-purpose programming. In this chapter, the most
important built-in data structures are discussed, including characters, strings,
regular expressions, symbols, expressions, and several types of collections. In
conjunction with the data structures, the operations on the data structures are
introduced as well and examples of their usage are given.

4.1 Characters

One of the simplest, but most fundamental data structures is the character, the
type &ȹǤʝ. A character is created by щchar щ, i.e., using single quotes. Each char-
acter corresponds to a Unicode code point, and a character can be converted to
its code point, which is an integer value, by calling bɪʲ.
ɔʼɜɃǤљ щǤщя ʲ˩ʙȕɴȯФщǤщХя bɪʲФщǤщХя ʲ˩ʙȕɴȯФbɪʲФщǤщХХ
ФщǤщя &ȹǤʝя ОМя bɪʲЛЙХ

The type of bɪʲФщǤщХ depends on your system architecture.


Vice versa, a code point, i.e., an integer, can be converted to a character by
calling &ȹǤʝ.

© Springer Nature Switzerland AG 2022 51


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_4
52 4 Built-in Data Structures

ɔʼɜɃǤљ &ȹǤʝФОМХ
щǤщђ Æ&bbЭÚɪɃȆɴȍȕ ÚўЕЕЛЖ ФȆǤʲȕȱɴʝ˩ {ɜђ {ȕʲʲȕʝя ɜɴ˞ȕʝȆǤʧȕХ

However, not all integers are valid Unicode code points. You can check if an
integer is a valid code point by using Ƀʧ˛ǤɜɃȍФ&ȹǤʝя integer Х.
Any Unicode character can be input in single quotes using аʼ followed by
up to four hexadecimal digits or using аÚ followed by up to eight hexadecimal
digits; for example, щаʼЛЖщ is the letter a. Furthermore, some special characters
can be escaped using a backslash: the backslash character щаащ, the single quote
щащщ, newline (line feed) щаɪщ, carriage return щаʝщ, form feed щаȯщ, backspace
щаȂщ, horizontal tab щаʲщ, vertical tab ща˛щ, and alert (bell) щаǤщ. Additionally,
the character with octal value ooo (three octal digits) can be written as щаoooщ,
and the character with hexadecimal value hh (two hexadecimal digits) can be
written as ща˦hhщ.
The standard comparison operators ќќ, Рќ, ј, јќ, љ, and љќ are available for
characters. Furthermore, the function в is defined for characters as well.
ɔʼɜɃǤљ щЕщ ј щщ ј щǤщ
ʲʝʼȕ
ɔʼɜɃǤљ Ъ˞ȹɃȆȹ щЕщ ј щщ
јФ˦я ˩Х Ƀɪ "Ǥʧȕ Ǥʲ ɴʙȕʝǤʲɴʝʧѐɔɜђЗЛН
ɔʼɜɃǤљ Ъ˞ȹɃȆȹ щ˴щ в щǤщ
вФ˦ђђȂʧʲʝǤȆʲ&ȹǤʝя ˩ђђȂʧʲʝǤȆʲ&ȹǤʝХ Ƀɪ "Ǥʧȕ Ǥʲ ȆȹǤʝѐɔɜђЗЗЖ
ɔʼɜɃǤљ щ˴щ в щǤщ ў Ж
ЗЛ

4.2 Strings

4.2.1 Creating and Accessing

Julia strings are immutable, i.e., once they have been created, they cannot be
changed anymore. Strings are delimited either by double quotes ъ or by triple
double quotes ъъъ. Characters can be entered as part of a string using the syntax
for Unicode characters and the one for special characters mentioned in Sect. 4.1.
A double quote needs to be escaped by a backslash if the string is delimited by
double quotes, while it can occur unaltered inside a string delimited by triple
double quotes.
ɔʼɜɃǤљ ʙʝɃɪʲɜɪФъЖЗИаȂаȂаȂЙКЛЙКЛаʝМНОъХ
МНОЙКЛ
ɔʼɜɃǤљ ъъъÑȹɃʧ Ƀʧ Ǥɪ ъɃɪʲȕʝȕʧʲɃɪȱъ ȕ˦Ǥɦʙɜȕѐъъъ
ъÑȹɃʧ Ƀʧ Ǥɪ аъɃɪʲȕʝȕʧʲɃɪȱаъ ȕ˦Ǥɦʙɜȕѐъ

Triple double quotes are useful for creating longer blocks of text such as doc-
umentation strings. White space receives special treatment, however, when this
4.2 Strings 53

syntax is used. When the opening triple quotes are immediately followed by a
newline, then the first newline is ignored. Trailing white space at the end of the
string remains unchanged, however. The indentation of a triple-quoted string is
also changed when it is read. The same amount of white space at the beginning
of each line is removed so that the input line with the least amount of indenta-
tion is not indented at all in the final string. This behavior is useful when a string
occurs as part of indented code.
The elements of a string are characters and can be accessed by their index.
Indices in Julia always start at Ж and end at ȕɪȍ. The ȕɪȍ index can be used in
computations as in this example.
ɔʼɜɃǤљ ʧ ќ ъYȕɜɜɴя ˞ɴʝɜȍРъ
ъYȕɜɜɴя ˞ɴʝɜȍРъ
ɔʼɜɃǤљ ʧЦЖЧя ʧЦЗЧя ʧЦȯɜɴɴʝФbɪʲя ȕɪȍЭЗХЧя ʧЦȕɪȍвЖЧя ʧЦȕɪȍЧ
ФщYщя щȕщя щящя щȍщя щРщХ

Substrings can be extracted using any range startђstepђend. The step of a


range equals 1 by default, i.e., the two ranges startђЖђend and startђend are equiv-
alent.
ɔʼɜɃǤљ ʧЦНђЖЗЧ
ъ˞ɴʝɜȍъ
ɔʼɜɃǤљ ʧЦЖђЗђɜȕɪȱʲȹФʧХЧ
ъYɜɴ ɴɜРъ

Note the difference between accessing an element of a string (which returns a


character) and accessing a substring (which returns a string).
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФʧЦЖЧХя ʲ˩ʙȕɴȯФʧЦЖђЖЧХ
Ф&ȹǤʝя ÆʲʝɃɪȱХ

However, strings can consist of arbitrary Unicode characters. Since the Uni-
code encoding is a variable-length encoding, i.e., not all characters are encoded
by the same number of bytes, accessing a string at an arbitrary byte position by
ЦЧ does not necessarily yield a valid Unicode character. The number of Unicode
characters in a string is returned by ɜȕɪȱʲȹ. In this example, the string consists
of four Unicode characters, where only the second character occupies one byte.
ɔʼɜɃǤљ Ǥ ќ ъаʼЗЗЕЕ˦аʼЗЗЕНаʼЗЖЖȍъ
ъ ∀ ˦ ∈ ℝъ

Trying to access ǤЦЗЧ and ǤЦИЧ results in errors, while ǤЦЙЧ evaluates to щ˦щ.
It is possible to step through a string using ɪȕ˦ʲɃɪȍ, which returns the next
valid index after a given index, as this example illustrates.
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя ЖХ
Й
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя ЙХ
К
ɔʼɜɃǤљ ɪȕ˦ʲɃɪȍФǤя КХ
54 4 Built-in Data Structures

Н
ɔʼɜɃǤљ ǤЦЖЧя ǤЦЙЧя ǤЦКЧя ǤЦНЧ
Фщ∀щя щ˦щя щ∈щя щℝщХ

However, the most convenient way to iterate through a string is using a ȯɴʝ
loop.
ɔʼɜɃǤљ ȯɴʝ ȆȹǤʝ Ƀɪ Ǥ ʙʝɃɪʲɜɪФȆȹǤʝХ ȕɪȍ

˦

4.2.2 String Interpolation

Strings can be concatenated using the ʧʲʝɃɪȱ function. Often it is convenient to


substitute a substring given by an expression into a mostly constant string. This
procedure is called string interpolation in Julia. In its most general form, the
shortest complete expression after a dollar sign ϵ is evaluated, and the result-
ing printed form of the expression is interpolated into the string. A syntax that
always works for an expression expr is ϵФexpr Х inside a string. In the simplest
form of string interpolation, the printed form of a variable is interpolated into
the string by using ϵvar. Finally, a dollar character can be included in a string by
escaping it with a backslash like this аϵ.
ɔʼɜɃǤљ ъ&ɴɦɦǤɪȍвɜɃɪȕ Ǥʝȱʼɦȕɪʲʧ Ǥʝȕђ ϵ¼QÆъ
ъ&ɴɦɦǤɪȍвɜɃɪȕ Ǥʝȱʼɦȕɪʲʧ Ǥʝȕђ ÆʲʝɃɪȱЦЧъ
ɔʼɜɃǤљ ъȕ˦ʙФɃɦѮʙɃХ ќ ϵФȕ˦ʙФɃɦѮʙɃХХъ
ъȕ˦ʙФɃɦѮʙɃХ ќ вЖѐЕ ў ЖѐЗЗЙЛЙЛМООЖЙМИКИЗȕвЖЛɃɦъ

4.2.3 String Operations

A common string operation is sorting. Since the standard comparisons ќќ, Рќ, ј,
јќ, љ, and љќimplement lexicographical comparison of strings based on charac-
ter comparison, the ʧɴʝʲ function can be used to sort strings.
To check whether a string contains a character, the Ƀɪ function can be used,
which also supports infix syntax.
ɔʼɜɃǤљ ɃɪФщаʼЗЗЕЕщя ǤХ
ʲʝʼȕ
ɔʼɜɃǤљ щаʼЗЗЕЕщ Ƀɪ Ǥ
ʲʝʼȕ
4.2 Strings 55

To check whether a string contains another string, the ɴȆȆʼʝʧɃɪ function,


which returns a Boolean value, can be used.
ɔʼɜɃǤљ ɴȆȆʼʝʧɃɪФъ˞ɴʝɜȍъя ʧХ
ʲʝʼȕ

The ȯɃɪȍȯɃʝʧʲ function returns more information. Its first argument can be a
character, a string, or a regular expression (see Sect. 4.2.5 below). The first argu-
ment is searched for in the second argument, a string, and ȯɃɪȍȯɃʝʧʲ returns
the (byte) indices of the matching substring or ɪɴʲȹɃɪȱ if there is no such occur-
rence.
ɔʼɜɃǤљ ȯɃɪȍȯɃʝʧʲФщ˦щя ǤХ
Й
ɔʼɜɃǤљ ȯɃɪȍȯɃʝʧʲФщ˦щя ʧХ
ɔʼɜɃǤљ Ǥɪʧ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ

This example shows that printing ɪɴʲȹɃɪȱ prints nothing. The most recently re-
turned value is stored in the variable Ǥɪʧ, however, and we could check that it is
equal to ɪɴʲȹɃɪȱ.
The ʝȕʙɜǤȆȕ function replaces a substring of a string with another one. Since
strings are immutable, a new string is always returned. The second argument
indices the substitution, and it can be a pair (see Sect. 4.5.4), a dictionary (see also
Sect. 4.5.4), or a regular expression (see Sect. 4.2.5). The replacement can be a
(constant) string or a function that is applied to the match and that yields a string.
The keyword argument Ȇɴʼɪʲ indicates how many occurrences are replaced.
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ъ˞ɴʝɜȍъ ќљ ъüɴʝɜȍъХ
ъYȕɜɜɴя üɴʝɜȍРъ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ъ˞ъ ќљ ʼʙʙȕʝȆǤʧȕХ
ъYȕɜɜɴя üɴʝɜȍРъ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ʝъЦǤв˴Чъ ќљ ˦ вљ &ȹǤʝФbɪʲФ&ȹǤʝФ˦ЦЖЧХХ ў ЖХХ
ъYȯɦɦʙя ˦ʙʧɦȕРъ

Here the regular expression ʝъЦǤв˴Чъ matches all lowercase characters. They
are replaced by the character that follows them in the character ordering.
Strings can be assembled from substrings using Ѯ, ʝȕʙȕǤʲ, and ɔɴɃɪ. The func-
tion Ѯ (and not ў) concatenates two strings. The ʝȕʙȕǤʲ function concatenates
a given number of characters or strings. The ɔɴɃɪ function concatenates an ar-
ray of strings, inserting a given delimiter string between adjacent strings. An
optional second delimiter may be given; it is then used as the delimiter between
the last two substrings.
ɔʼɜɃǤљ ɔɴɃɪФЦъɃɪʲȕȱȕʝʧъя ъʝǤʲɃɴɪǤɜʧъя ъʝȕǤɜ ɪʼɦȂȕʝʧъЧя
ъя ъя ъя Ǥɪȍ ъХ
ъɃɪʲȕȱȕʝʧя ʝǤʲɃɴɪǤɜʧя Ǥɪȍ ʝȕǤɜ ɪʼɦȂȕʝʧъ
56 4 Built-in Data Structures

4.2.4 String Literals

The syntax sъstringъ makes it possible to conveniently create certain objects


(whose type is determined by the string s) from their string representations. More
information about the general mechanism can be found in Sect. 7.6. Here we
discuss three built-in types of such string literals. Reading such expressions cre-
ates regular expressions using ʝ (see Sect. 4.2.5), regular-expression substitution
strings using ʧ (see also Sect. 4.2.5), and version numbers using ˛.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȂъȯɴɴъХ
"Ǥʧȕѐ&ɴȍȕÚɪɃʲʧШÚbɪʲНя ÆʲʝɃɪȱЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФʝъȯɴɴъХ
¼ȕȱȕ˦
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФʧъȯɴɴъХ
ÆʼȂʧʲɃʲʼʲɃɴɪÆʲʝɃɪȱШÆʲʝɃɪȱЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ˛ъЖѐЗѐИъХ
ùȕʝʧɃɴɪ‰ʼɦȂȕʝ

Strings prefixed by ʝ become regular expressions, and substitution strings are


prefixed by ʧ. They are discussed in Sect. 4.2.5 below.
Strings prefixed by ˛ are read as objects of type ùȕʝʧɃɴɪ‰ʼɦȂȕʝ. Version num-
bers follow the specifications of semantic versioning and hence consist of a major
version, a minor version, and a patch version. Everything except the major ver-
sion number is optional. Version numbers are mostly useful for executing code
contingent on the Julia version and for specifying dependencies on certain ver-
sion numbers. When comparing two version numbers, a ў or в may be appended
to indicate a version higher or lower than any of the given version numbers.
ɔʼɜɃǤљ Ƀȯ ˛ъЖѐЕъ јќ ù5¼Æb“‰ ј ˛ъЖѐЖвъ
ʙʝɃɪʲɜɪФъ&ɴȍȕ ʧʙȕȆɃȯɃȆ ʲɴ ʲȹȕ ˛ȕʝʧɃɴɪ ЖѐЕ ʝȕɜȕǤʧȕ
ʧȕʝɃȕʧѐъХ
ȕɪȍ
ɔʼɜɃǤљ Ƀȯ ˛ъЖъ јќ ù5¼Æb“‰ ј ˛ъЗвъ
ʙʝɃɪʲɜɪФъ&ɴȍȕ ʧʙȕȆɃȯɃȆ ʲɴ ʲȹȕ ˛ȕʝʧɃɴɪ Ж ʝȕɜȕǤʧȕ ʧȕʝɃȕʧѐъХ
ȕɪȍ
&ɴȍȕ ʧʙȕȆɃȯɃȆ ʲɴ ʲȹȕ ˛ȕʝʧɃɴɪ Ж ʝȕɜȕǤʧȕ ʧȕʝɃȕʧѐ

Here the version number ˛ъЖѐЖвъ indicates a version lower than any 1.1 release
or pre-release. It is good practice to append a trailing в to version numbers in
upper bounds unless there is a specific reason not to do so.
Often the version number should be checked not at run time, but already
when the program is parsed into expressions. The ЪʧʲǤʲɃȆ macro makes it pos-
sible to perform such a check as in the following example.
ɔʼɜɃǤљ ЦЪʧʲǤʲɃȆ Ƀȯ ˛ъЖъ јќ ù5¼Æb“‰ ј ˛ъЗвъ Ж ȕɜʧȕ ђʼɪɖɪɴ˞ɪ ȕɪȍЧ
Жвȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
Ж
4.2 Strings 57

The converse string operation, namely splitting strings, is implemented by the


function ʧʙɜɃʲ.

4.2.5 Regular Expressions

Regular expressions are a powerful way to check whether a string matches a


given regular pattern and to extract substrings from any matches. For exam-
ple, regular expressions are convenient for extracting information from unstruc-
tured data, for scraping websites, and for parsing large textual input. There are
many variants and implementations of regular expressions; Julia uses the pcre
(Perl compatible regular expressions) library, a fast implementation of a com-
mon form of regular expressions. This section gives an introduction to regular
expressions and how to use them in Julia by discussing a few examples, but
there are many more options.
Many characters have a special meaning inside a regular expression. Com-
mon characters with special meaning are Т, ў, and Ѯ. They allow repetitions of
the preceding character or group. The question mark Т allows no occurrence or
exactly one occurrence. For example, the regular expression ʝъЖТъ matches the
empty string or ъЖъ. The plus sign ў allows at least one occurrence. For example,
the regular expression ʝъЖўъ matches ъЖъ, ъЖЖъ, ъЖЖЖъ, and so forth. Third, the
asterisk Ѯ allows any number of occurrences including none. For example, the
regular expression ʝъЖѮъ matches the empty string, ъЖъ, ъЖЖъ, and so forth.
The ɦǤʲȆȹ function takes a regular expression and a string as its arguments
as well as the index where to start the search as an optional argument. If there
is a match, a ¼ȕȱȕ˦ ǤʲȆȹ object with information about the first match is re-
turned; otherwise, ɪɴʲȹɃɪȱ is returned. We have already seen above that ɪɴʲȹɃɪȱ
is treated specially by the repl in the sense that nothing is printed if it is the re-
sult of an evaluation.
The example we consider here is extracting information from a text file or
the textual output of a program. The lines of a file can conveniently be read by
the functions ɴʙȕɪ and ʝȕǤȍɜɃɪȕʧ (if it is not too large). Calls to external pro-
grams are entered between backquotes, and string interpolation (see Sect. 4.2.2)
is performed between backquotes. The output of an external program can then
be captured in a string or in an array of strings as in these examples.
ɔʼɜɃǤљ ʝȕǤȍФ҉ȕȆȹɴ Yȕɜɜɴя ˞ɴʝɜȍР҉я ÆʲʝɃɪȱХ
ъYȕɜɜɴя ˞ɴʝɜȍРаɪъ
ɔʼɜɃǤљ ʝȕǤȍɜɃɪȕʧФ҉ȕȆȹɴ Yȕɜɜɴя ˞ɴʝɜȍР҉Х
Жвȕɜȕɦȕɪʲ ʝʝǤ˩ШÆʲʝɃɪȱяЖЩђ
ъYȕɜɜɴя ˞ɴʝɜȍРъ

We define the string ʧ as an example and then check whether it contains at


least one digit two or at least two consecutive nines.
58 4 Built-in Data Structures

ɔʼɜɃǤљ ʧ ќ ъъъ
ɦɴ˛Ƀȕ ɪǤɦȕђ üǤʝ QǤɦȕʧ
ʝȕɜȕǤʧȕ ȍǤʲȕђ Ǥ˩ Мя ЖОНИ
ʝȕʧȕǤʝȆȹȕʝђ Æʲȕʙȹȕɪ OǤɜɖȕɪ
ǤʝʲɃȯɃȆɃǤɜ ɃɪʲȕɜɜɃȱȕɪȆȕђ sɴʧȹʼǤ
Ȇɴɦʙʼʲȕʝ ɪǤɦȕђ ü“¸¼ ФüǤʝ “ʙȕʝǤʲɃɴɪ ¸ɜǤɪ ¼ȕʧʙɴɪʧȕХ
ȱǤɦȕʧђ OǤɜɖȕɪщʧ ɦǤ˴ȕя Ȇȹȕʧʧя ʙɴɖȕʝя ȱɜɴȂǤɜ ʲȹȕʝɦɴɪʼȆɜȕǤʝ ˞Ǥʝ
ȹȕʝɴђ -Ǥ˛Ƀȍ {ѐ {ɃȱȹʲɦǤɪ
ʧʲǤʲʼʧђ ˞ɴʝɜȍ ʧǤ˛ȕȍ
ʙȹɴɪȕ ɪʼɦȂȕʝʧ ȯʝɴɦђ ИЖЖѪОИЛѪЕЕЕЖ
ʙȹɴɪȕ ɪʼɦȂȕʝʧ ʲɴђ ИЖЖѪОИЛѪООООъъъѓ
ɔʼɜɃǤљ ɦǤʲȆȹФʝъЗъя ʧХ
ɔʼɜɃǤљ ɦǤʲȆȹФʝъОўъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъОъХ

To search for occurrences not only of single characters, but of more compli-
cated patterns, characters and patterns can be grouped by brackets ФpatternХ.
Such a group can also be named so that one can refer to it more conveniently than
by index, as we will see later. A named group looks like ФТјnameљpatternХ. A
group consisting of two alternatives pattern1 and pattern2 is written as
Фpattern1Юpattern2Х.

ɔʼɜɃǤљ ɦ ќ ɦǤʲȆȹФʝъФТјȯɴɴљЖЗИЮЙКЛЮМНОХФǤЮȂЮȆХъя ъЖЗИǤȂȆъХ


¼ȕȱȕ˦ ǤʲȆȹФъЖЗИǤъя ȯɴɴќъЖЗИъя ЗќъǤъХ
ɔʼɜɃǤљ ɦЦъȯɴɴъЧя ɦЦЗЧ
ФъЖЗИъя ъǤъХ

We see that a matched group can be accessed by its (numerical) index (here З)
or by the name of the group (here ъȯɴɴъ, using the string ъȯɴɴъ as the index).
Furthermore, the details of the match ɦ can be inspected by ȍʼɦʙФɦХ. The fields
ɦǤʲȆȹ, ȆǤʙʲʼʝȕʧ, ɴȯȯʧȕʲ, and ɴȯȯʧȕʲʧ may be useful. In general, ȍʼɦʙ is ex-
tremely useful for inspecting any value.
There are also characters and patterns that match only certain characters or
locations. A dot ѐ matches any character. The characters ѭ and ϵ match the be-
ginning and end of a string (or line, in multiline mode), respectively.
Character classes are patterns of the form ЦfromвtoЧ. The negation of such
a character class is ЦѭfromвtoЧ. Predefined character classes include аȍ for any
decimal digit, аʧ any white-space character, and а˞ for any “word” character. For
example, the patterns ЦЕвОЧ and аȍ both match any decimal digit. The negations
of these classes are given by uppercase letters.
There are also named character classes such as ЦђǤɜɪʼɦђЧ (letters and digits),
ЦђǤɜʙȹǤђЧ (letters), ЦђȂɜǤɪɖђЧ (spaces and tabs), ЦђȍɃȱɃʲђЧ (digits), Цђɜɴ˞ȕʝђЧ
(lowercase letters), ЦђʧʙǤȆȕђЧ (white space), ЦђʼʙʙȕʝђЧ (uppercase letters), and
Цђ˞ɴʝȍђЧ (“word” characters).
Named groups and character classes help parse heterogeneous data. In this
example, we extract a date from the string using three named groups.
4.2 Strings 59

ɔʼɜɃǤљ ɦ ќ ɦǤʲȆȹФʝъФТјɦɴɪʲȹља˞ўХаʧўФТјȍǤ˩љаȍўХяаʧўФТј˩ȕǤʝљаȍўХъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъ Ǥ˩ Мя ЖОНИъя ɦɴɪʲȹќъ Ǥ˩ъя ȍǤ˩ќъМъя ˩ȕǤʝќъЖОНИъХ
ɔʼɜɃǤљ ɦЦъȍǤ˩ъЧя ɦЦъɦɴɪʲȹъЧя ɦЦъ˩ȕǤʝъЧ
ФъЖъя ъsǤɪъя ъЗЕЕЕъХ
ɔʼɜɃǤљ ɦЦъȍǤ˩ъЧя ɦЦъɦɴɪʲȹъЧя ɦЦъ˩ȕǤʝъЧ
ФъМъя ъ Ǥ˩ъя ъЖОНИъХ

The three parts of the first phone number can be extracted similarly. Groups
can even be nested, so that we can defined a group named ɪʼɦȂȕʝ to contain the
whole match.
ɔʼɜɃǤљ ɦ ќ ɦǤʲȆȹФʝъФТјɪʼɦȂȕʝљФТјǤʝȕǤљаȍўХѪФаȍўХѪФаȍўХХъя ʧХ
¼ȕȱȕ˦ ǤʲȆȹФъИЖЖѪОИЛѪЕЕЕЖъя ɪʼɦȂȕʝќъИЖЖѪОИЛѪЕЕЕЖъя ǤʝȕǤќъИЖЖъя
ИќъОИЛъя ЙќъЕЕЕЖъХ
ɔʼɜɃǤљ ɦЦъɪʼɦȂȕʝъЧя ɦЦъǤʝȕǤъЧя ɦЦЖЧя ɦЦЗЧя ɦЦИЧя ɦЦЙЧ
ФъИЖЖѪОИЛѪЕЕЕЖъя ъИЖЖъя ъИЖЖѪОИЛѪЕЕЕЖъя ъИЖЖъя ъОИЛъя ъЕЕЕЖъХ

The following example shows how named character classes are used inside
brackets.
ɔʼɜɃǤљ ɦǤʲȆȹФʝъǤʝʲɃȯɃȆɃǤɜ ɃɪʲȕɜɜɃȱȕɪȆȕђ ФТјɪǤɦȕљЦЦђǤɜɪʼɦђЧЧўХъя
ʧХЦъɪǤɦȕъЧ
ъsɴʧȹʼǤъ

In addition to extracting matching substrings, regular expressions are also


useful to replace matching parts of a string, where the substituted string may
depend on the groups in the match. In the substitution string, which is usually
a string literal starting with ʧ, аЕ refers to the whole match, аinteger refers to
the group with index integer, and аȱјgroupљ refers to a named group. In this
example, we rewrite a date, converting it from American to British format, first
using named groups first and then using numbered groups.
ɔʼɜɃǤљ ʝ ќ ʝъФТјɦɴɪʲȹља˞ўХаʧўФТјȍǤ˩љаȍўХяаʧўФТј˩ȕǤʝљаȍўХъ
ʝъФТјɦɴɪʲȹља˞ўХаʧўФТјȍǤ˩љаȍўХяаʧўФТј˩ȕǤʝљаȍўХъ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ʝ ќљ ʧъаȱјȍǤ˩љ аȱјɦɴɪʲȹљ аȱј˩ȕǤʝљъХ
ѐѐѐʝȕɜȕǤʧȕ ȍǤʲȕђ М Ǥ˩ ЖОНИаɪѐѐѐ
ɔʼɜɃǤљ ʝȕʙɜǤȆȕФʧя ʝ ќљ ʧъаЗ аЖ аИ ФаЕХъХ
ѐѐѐʝȕɜȕǤʧȕ ȍǤʲȕђ М Ǥ˩ ЖОНИ Ф Ǥ˩ Мя ЖОНИХаɪѐѐѐ

Another example is parsing floating-point numbers written with a decimal


comma, which means that the decimal comma must be replaced by a decimal
point. The following regular expression requires at least one digit to the left and
to the right of the decimal comma, and the substitution string takes these digits
and puts a decimal point between them. The function ʙǤʝʧȕ takes the desired
type as the first argument.
60 4 Built-in Data Structures

ɔʼɜɃǤљ ʝȕʙɜǤȆȕФъЗяМЖНЗНЖНЗНъя ʝъФаȍўХяФаȍўХъ ќљ ʧъаЖѐаЗъХ


ъЗѐМЖНЗНЖНЗНъ
ɔʼɜɃǤљ ʙǤʝʧȕФOɜɴǤʲЛЙя ǤɪʧХ
ЗѐМЖНЗНЖНЗН
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
OɜɴǤʲЛЙ

The behavior of regular expressions can be modified by combinations of the


flags Ƀ, ɦ, ʧ, and ˦ after the closing double quote of its string literal. These flags
affect the behavior regarding case sensitivity, the treatment of multiple lines, and
white space.
Regular expressions have many more options, and the interested reader is
referred to the documentation of the pcre library for the details. In summary,
regular expressions are a fast tool to deal with large bodies of text with a hetero-
geneous structure.
Whenever you are interested in more than the first match of a regular expres-
sion, the function ȕǤȆȹɦǤʲȆȹ is useful.

4.3 Symbols

Symbols are an important data structure in Lisp like languages, because they
serve as variable names and because they are fundamental building blocks of
expressions (see Sect. 4.4). A symbol is essentially an interned string identifier.
Interning a string means that it is ensured that only one copy of each distinct
string is stored, and thus interned strings can be associated with values. This
implies that it is not possible for two symbols with the same name to exist simul-
taneously, i.e., symbols are unique.
There are a few ways to create a symbol. We can call the parser, i.e., the func-
tion ʙǤʝʧȕ in the ȕʲǤ package, to directly create a symbol.
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъȯɴɴъХ
ђȯɴɴ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФђȯɴɴХ
Æ˩ɦȂɴɜ

The parser recognizes ȯɴɴ as a symbol and returns it (without evaluating it).
Another option to create a symbol is to enter a suitable expression. Entering
ȯɴɴ at the repl and hence evaluating it yields the value of the variable called
ȯɴɴ, however. Therefore we have to protect the expression from evaluation. This
is achieved by prepending it with a colon ђ, which adds one layer of protection
against evaluation to the expression that follows it. There the colon is also called
the quote character in Julia. Thus ђȯɴɴ evaluates to the symbol ȯɴɴ.
ɔʼɜɃǤљ ђȯɴɴ
ђȯɴɴ
4.4 Expressions 61

A more direct way to create a symbol is to use the function Æ˩ɦȂɴɜ, which
follows the theme in Julia that functions that have the same name as a type
create new values of this type. The function Æ˩ɦȂɴɜ creates a new symbol by
concatenating the string representations of its arguments.
ɔʼɜɃǤљ ђȯɴɴ ќќ Æ˩ɦȂɴɜФъȯɴɴъХ ќќ Æ˩ɦȂɴɜФщȯщя ъɴɴъХ
ʲʝʼȕ

This example also shows that symbols are indeed unique.


Sometimes it is necessary to create a new, uninterned symbol, ensuring that
its name does not conflict with other symbol names (see Chap. 7). This is
achieved by the function ȱȕɪʧ˩ɦ, which returns a unique symbol whenever it
is called. A prefix may be supplied to become part of the symbol name.
ɔʼɜɃǤљ ȱȕɪʧ˩ɦФХя ȱȕɪʧ˩ɦФХ
ФÆ˩ɦȂɴɜФъЫЫЗКИъХя Æ˩ɦȂɴɜФъЫЫЗКЙъХХ
ɔʼɜɃǤљ ȱȕɪʧ˩ɦФъȯɴɴъХя ȱȕɪʧ˩ɦФъȯɴɴъХ
ФÆ˩ɦȂɴɜФъЫЫȯɴɴЫЗККъХя Æ˩ɦȂɴɜФъЫЫȯɴɴЫЗКЛъХХ

Symbols are used to access variables and evaluate to the values of the vari-
ables. Expressions can be evaluated not only in the repl, but also by calling the
function ȕ˛Ǥɜ. In this example, we try to access the value of the undefined vari-
able named ȯɴɴ, which raises an error. After defining the variable, however, its
value is returned by evaluating ђȯɴɴ.
ɔʼɜɃǤљ ȕ˛ǤɜФђȯɴɴХ
5¼¼“¼ђ ÚɪȍȕȯùǤʝ5ʝʝɴʝђ ȯɴɴ ɪɴʲ ȍȕȯɃɪȕȍ
ѐѐѐ
ɔʼɜɃǤљ ȯɴɴ ќ Е
Е
ɔʼɜɃǤљ ȕ˛ǤɜФђȯɴɴХ
Е

We have just seen that Julia provides access to its parser and its evaluator
via ȕʲǤѐʙǤʝʧȕ and ȕ˛Ǥɜ. These functions work not only with symbols, but also
with expressions, the building blocks of Julia programs.

4.4 Expressions

Reading a variable name using ȕʲǤѐʙǤʝʧȕ yields a symbol. The following ex-
ample shows what happens when we parse more complicated expressions.
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъЕ ў ЖъХ
ђФЕ ў ЖХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
5˦ʙʝ
ɔʼɜɃǤљ ȕʲǤѐʙǤʝʧȕФъȯɴɴ ў ȂǤʝъХ
62 4 Built-in Data Structures

ђФȯɴɴ ў ȂǤʝХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
5˦ʙʝ

In both examples, an object of type 5˦ʙʝ, i.e., an expression, is returned. Hence


expressions are first-class objects in Julia. They store Julia programs directly
in a suitable data structure and not just in a string.
We can evaluate expressions at the repl just like any other data structure.
ɔʼɜɃǤљ ђФЕ ў ЖХ
ђФЕ ў ЖХ
ɔʼɜɃǤљ ђФȯɴɴ ў ȂǤʝХ
ђФȯɴɴ ў ȂǤʝХ

The expressions are returned seemingly unchanged by the repl, because they
have been quoted using the colon ђ. Behind the scenes, the situation is a bit more
involved, however. The quote absorbed the evaluation by the repl, returning the
expression Е ў Ж. This expression is printed as ђФЕ ў ЖХ (and not as Ж, which
would require another evaluation) so that it remains an expression.
More information about the parts of an expression can be obtained by using
the ȍʼɦʙ function.
ɔʼɜɃǤљ ȍʼɦʙФђФЕ ў ЖХХ
5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ ў
Зђ bɪʲЛЙ Е
Иђ bɪʲЛЙ Ж
ɔʼɜɃǤљ ȍʼɦʙФђФȯɴɴ ў ЗѮȂǤʝХХ
5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ ў
Зђ Æ˩ɦȂɴɜ ȯɴɴ
Иђ 5˦ʙʝ
ȹȕǤȍђ Æ˩ɦȂɴɜ ȆǤɜɜ
Ǥʝȱʧђ ʝʝǤ˩Шɪ˩ЩФФИяХХ
Жђ Æ˩ɦȂɴɜ Ѯ
Зђ bɪʲЛЙ З
Иђ Æ˩ɦȂɴɜ ȂǤʝ

We find that an object of type 5˦ʙʝ has two fields, namely ȹȕǤȍ and Ǥʝȱʧ. In both
examples, the ȹȕǤȍ is ђȆǤɜɜ. The arguments are vectors, whose first element is
a symbol that names a function. Further arguments can be constants (such as
symbols) or further expressions.
In the next example, we deconstruct an expression into the parts we just ob-
served using ȍʼɦʙ and then we make another expression out of these parts. As
4.5 Collections 63

usual in Julia, the name of a type (here 5˦ʙʝ) is also a function, and calling this
function makes a new object of this type.
ɔʼɜɃǤљ ȕ˦ʙʝ ќ ȕʲǤѐʙǤʝʧȕФъȯɴɴ ў ЗѮȂǤʝъХ
ђФȯɴɴ ў ЗȂǤʝХ
ɔʼɜɃǤљ ȕ˦ʙʝѐȹȕǤȍ
ђȆǤɜɜ
ɔʼɜɃǤљ ȕ˦ʙʝѐǤʝȱʧ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩Шɪ˩яЖЩђ
ђў
ђȯɴɴ
ђФЗȂǤʝХ
ɔʼɜɃǤљ 5˦ʙʝФȕ˦ʙʝѐȹȕǤȍя ȕ˦ʙʝѐǤʝȱʧѐѐѐХ
ђФȯɴɴ ў ЗȂǤʝХ
ɔʼɜɃǤљ 5˦ʙʝФȕ˦ʙʝѐȹȕǤȍя ȕ˦ʙʝѐǤʝȱʧѐѐѐХ ќќ ђФȯɴɴ ў ЗѮȂǤʝХ
ʲʝʼȕ

Evaluating the expression ȕ˦ʙʝ in this example raises an error, since the vari-
ables ȯɴɴ and ȂǤʝ are undefined. After defining them, however, we can evaluate
the expression.
ɔʼɜɃǤљ ȯɴɴ ќ Жѓ ȂǤʝ ќ Зѓ
ɔʼɜɃǤљ ȕ˛ǤɜФȕ˦ʙʝХ
К

We will learn much more about expressions and their evaluation in Chap. 7.
The salient point is that Julia code is represented in a canonical form as a Julia
data structure, namely as objects of type 5˦ʙʝ. One could argue that any language
(that at least has a string data type) can represent programs in this language as a
string and therefore using a built-in data type. This is true, of course, but of very
limited use, since string data types do not provide the facilities of the 5˦ʙʝ type,
of ȕʲǤѐʙǤʝʧȕ, and of ȕ˛Ǥɜ.

4.5 Collections

A collection is the general term for a data type that contain elements in ordered,
unordered, indexable, or not indexable form. Various types of collections are dis-
cussed in this section. Data types that are collections are listed in Table 4.1. All
built-in abstract data types are listed in Table 4.2; several of them are collections,
but not all of them. It is not possible to create instances of abstract types, only of
concrete subtypes of abstract types.
64 4 Built-in Data Structures

Table 4.1 Built-in data types that are collections.


Arrays: ʝʝǤ˩, ȂʧʲʝǤȆʲʝʝǤ˩, "ɃʲʝʝǤ˩, -ȕɪʧȕʝʝǤ˩, ÆʲʝɃȍȕȍʝʝǤ˩, ÆʼȂʝʝǤ˩;
Dictionaries: -ɃȆʲ, ȂʧʲʝǤȆʲ-ɃȆʲ, üȕǤɖuȕ˩-ɃȆʲ;
Matrices: ȂʧʲʝǤȆʲ ǤʲʝɃ˦, ÆʲʝɃȍȕȍ ǤʲʝɃ˦;
Pairs: ¸ǤɃʝ;
Ranges: ȂʧʲʝǤȆʲ¼Ǥɪȱȕ, “ʝȍɃɪǤɜ¼Ǥɪȱȕ, Æʲȕʙ¼Ǥɪȱȕ, ÚɪɃʲ¼Ǥɪȱȕ;
Sets: Æȕʲ, "ɃʲÆȕʲ;
Strings: ÆʲʝɃɪȱ, ȂʧʲʝǤȆʲÆʲʝɃɪȱ, ÆʼȂÆʲʝɃɪȱ;
Tuples: Ñʼʙɜȕ, ‰Ñʼʙɜȕ, ‰ǤɦȕȍÑʼʙɜȕ, ùǤʝǤʝȱ;
Vectors: ùȕȆʲɴʝ, ȂʧʲʝǤȆʲùȕȆʲɴʝ, "ɃʲùȕȆʲɴʝ, ÆʲʝɃȍȕȍùȕȆʲɴʝ;
Vectors or matrices: ùȕȆ“ʝ Ǥʲ, ÆʲʝɃȍȕȍùȕȆ“ʝ Ǥʲ.

Table 4.2 All built-in abstract data types.


ȂʧʲʝǤȆʲʝʝǤ˩ , ȂʧʲʝǤȆʲ&ȹǤɪɪȕɜ , ȂʧʲʝǤȆʲ&ȹǤʝ , ȂʧʲʝǤȆʲ-ɃȆʲ ,
ȂʧʲʝǤȆʲ-ɃʧʙɜǤ˩ , ȂʧʲʝǤȆʲOɜɴǤʲ , ȂʧʲʝǤȆʲbʝʝǤʲɃɴɪǤɜ , ȂʧʲʝǤȆʲ ǤʲʝɃ˦ ,
ȂʧʲʝǤȆʲ¼Ǥɪȱȕ , ȂʧʲʝǤȆʲÆȕʲ , ȂʧʲʝǤȆʲÆʲʝɃɪȱ , ȂʧʲʝǤȆʲÚɪɃʲ¼Ǥɪȱȕ ,
ȂʧʲʝǤȆʲùȕȆ“ʝ Ǥʲ , ȂʧʲʝǤȆʲùȕȆʲɴʝ .

4.5.1 General Collections

The most general operations on collections are summarized in Table 4.3.

Table 4.3 Operations on general collections.


Function Description
Ƀʧȕɦʙʲ˩ФcollectionХ → "ɴɴɜ whether a collection is empty or not
ȕɦʙʲ˩РФcollectionХ destructively modify a collection to be empty
ɜȕɪȱʲȹФcollectionХ return the number of elements in a collection

Any collection can be queried whether it is empty or not. The following exam-
ple is an empty array. Arrays are discussed in detail in Chap. 8; for now, it suffices
to know that vectors and arrays are denoted by square brackets and that the data
type of their elements may be indicated before the opening square bracket.
ɔʼɜɃǤљ ЦЧя ʲ˩ʙȕɴȯФЦЧХя Ƀʧȕɦʙʲ˩ФЦЧХ
Фɪ˩ЦЧя ùȕȆʲɴʝШɪ˩Щя ʲʝʼȕХ

Any collection can also be destructively modified to be empty. The examples


here are emptying an array and a set (see also Sect. 4.5.5).
ɔʼɜɃǤљ ȕɦʙʲ˩РФЦЖя Зя ИЧХя ȕɦʙʲ˩РФÆȕʲФЦЖя Зя ИЧХХ
ФbɪʲЛЙЦЧя ÆȕʲШbɪʲЛЙЩФХХ
4.5 Collections 65

Finally, every collection can be queried how many elements it contains.


ɔʼɜɃǤљ ɜȕɪȱʲȹФȕɦʙʲ˩РФЦЖя Зя ИЧХХя ɜȕɪȱʲȹФȕɦʙʲ˩РФÆȕʲФЦЖя Зя ИЧХХХ
ФЕя ЕХ

4.5.2 Iterable Collections

Iterating over all elements of an iterable collection is a common programming


task. Iterable collections are collections whose elements can be iterated over us-
ing ȯɴʝ loops (see Chap. 6.4) or comprehensions (see Sect. 3.5).
Iteration in Julia is based on the generic function ɃʲȕʝǤʲȕ. A ȯɴʝ loop of the
form
ȯɴʝ variable Ƀɪ iterable
expressions
ȕɪȍ

executes the expressions for all elements of the iterable collection iterable bound
to variable in the order in which they are returned by the ɃʲȕʝǤʲȕ method. For
built-in iterable collections, ɃʲȕʝǤʲȕ methods have of course already been de-
fined. Furthermore, after defining ɃʲȕʝǤʲȕ methods for your own data struc-
tures, you can iterate over these data structures using ȯɴʝ loops as well.
There are many useful functions that can be applied to iterable collections;
these include functions to extract certain elements, to reduce the collection by
applying a function repeatedly, and to map a function over all elements. Hence
the functions discussed in this section are important building blocks for func-
tional programming, which often proceeds by combining functions to extract
the desired information after starting from a collection.
Table 4.4 gives an overview of basic functions that are defined for iterable
collections.
Several functions exist to find extrema in an iterable collection. They are listed
in Table 4.5. As usual, destructive versions end in an exclamation mark Р.
An important set of functions is given in Table 4.6. The most general ones in
this table are the ones for reducing and folding an iterable collection. Reducing
a collection containing elements {𝑎𝑖 }𝑛𝑖=1 using a function or operation ⊕ means
calculating the expression 𝑎1 ⊕𝑎2 ⊕⋯⊕𝑎𝑛 . The operation ⊕ must take two argu-
ments and it must be associative. An initial element ɃɪɃʲ may also be specified as
a keyword argument. The operation is applied repeatedly until the whole expres-
sion has been evaluated. If the collection is empty, the initial element must be
specified except in special cases where Julia knows the neutral element of the
operation. If the collection is non-empty, it is unspecified whether the initial ele-
ment is used. If the collection is an ordered one, the elements are not reordered;
otherwise the evaluation order is unspecified.
66 4 Built-in Data Structures

Table 4.4 Basic operations on iterable collections.


Function Description
ȕɜʲ˩ʙȕФiterableХ return the type of the elements of iterable
ɃɪФitemя collectionХ → "ɴɴɜ determine whether item is in collection
item Ƀɪ collection → "ɴɴɜ infix syntax for Ƀɪ
ɃʧʧʼȂʧȕʲФaя bХ → "ɴɴɜ test whether every element of a is also present in b
Ƀɪȍȕ˦ɃɪФaя bХ return an array containing the first index in b
for each value in a which is a member of b
or ɪɴʲȹɃɪȱ otherwise
ʼɪɃʜʼȕФЦf я] iterableХ returns an array with the unique elements of iterable
after applying the function f
ʼɪɃʜʼȕФiterableѓ dimsХ return an array with the unique elements of iterable
along dimension dim
ǤɜɜʼɪɃʜʼȕФiterableХ → "ɴɴɜ determine whether all elements are distinct
ȯɃʝʧʲФiterableХ return the first element
ȆɴɜɜȕȆʲФcollectionХ return an ʝʝǤ˩ containing all elements in collection
ȆɴɜɜȕȆʲФtypeя collectionХ return an ʝʝǤ˩ with element type type
containing all elements in collection
ȆɴʼɪʲФ[f я] iterableХ → bɪʲȕȱȕʝ returns the number of ʲʝʼȕ elements
after applying f

Table 4.5 Operations on iterable collections for finding extrema.


Function Description
ɦǤ˦ɃɦʼɦФ[f я] iterableХ return the maximum element in iterable
after applying the function f
ɦǤ˦ɃɦʼɦФAѓ dimsХ return the maximum element of the array A over dims
ɦǤ˦ɃɦʼɦРФr я AХ write the maximum element of the array A
over the singleton dimensions of the array r into r
ɦɃɪɃɦʼɦФ[f я] iterableХ return the minimum element in iterable
after applying the function f
ɦɃɪɃɦʼɦФAѓ dimsХ return the minimum element of the array A over dims
ɦɃɪɃɦʼɦРФr я AХ write the minimum element of the array A
over the singleton dimensions of the array r into r
ȕ˦ʲʝȕɦǤФ[f я] iterableХ return the minimum and maximum elements
→ Ñʼʙɜȕ computed in a single pass
ȕ˦ʲʝȕɦǤФAѓ dimsХ return the min. and max. elements of the array A
→ ʝʝǤ˩ШÑʼʙɜȕЩ over the dimensions dims
ȯɃɪȍɦǤ˦ФiterableХ → Ñʼʙɜȕ return the maximum element of iterable and its index
ȯɃɪȍɦǤ˦ФAѓ dimsХ return the maximum element of the array A
→ Ñʼʙɜȕ and its index over dimensions dims
ȯɃɪȍɦǤ˦РФvalsя indsя AХ return the maximum element of the array A along the
→ Ñʼʙɜȕ dimensions of vals and inds and store them there
ȯɃɪȍɦɃɪФiterableХ → Ñʼʙɜȕ return the minimum element of iterable and its index
ȯɃɪȍɦɃɪФAѓ dimsХ return the minimum element of the array A
→ Ñʼʙɜȕ and its index over dimensions dims
ȯɃɪȍɦɃɪРФvalsя indsя AХ return the minimum element of the array A along the
→ Ñʼʙɜȕ dimensions of vals and inds and store them there
4.5 Collections 67

The function ʝȕȍʼȆȕ provides the general form of reduction. Special, often
used cases come with the special implementations ɦǤ˦Ƀɦʼɦ, ɦɃɪɃɦʼɦ, ʧʼɦ, ʙʝɴȍ,
Ǥɪ˩, and Ǥɜɜ (and their variants) and should be used instead.
The folding functions ȯɴɜȍɜ and ȯɴɜȍʝ come with more guarantees. They
guarantee left respectively right associativity and use the given initial or neutral
element exactly once.

Table 4.6 Reduction and folding operations on iterable collections.


Function Description
ʝȕȍʼȆȕФf я iterableѓ initХ reduce iterable using f and the initial element init
ȯɴɜȍɜФf я iterableѓ initХ like ʝȕȍʼȆȕ, but guarantee left associativity
and use the neutral element init exactly once
ȯɴɜȍʝФf я iterableѓ initХ like ʝȕȍʼȆȕ, but guarantee right associativity
and use the neutral element init exactly once
ʧʼɦФiterableХ return the sum of all elements
ʧʼɦФf я iterableХ sum the results of calling f on each element
ʧʼɦФAѓ dimsХ return the sum of all elements over the dimensions dims
ʧʼɦРФr я AХ store the sum over the singleton dimensions of r in r
ʙʝɴȍФiterableХ return the product of all elements
ʙʝɴȍФf я iterableХ multiply the results of calling f on each element
ʙʝɴȍФAѓ dimsХ return the product of all elements over the dimensions dims
ʙʝɴȍРФr я A::arrayХ store the product over the singleton dimensions of r in r
Ǥɪ˩ФiterableХ → "ɴɴɜ test whether any element of a collection of "ɴɴɜs is ʲʝʼȕ
Ǥɪ˩Фf я iterableХ test whether f returns ʲʝʼȕ for any element of iterable
Ǥɪ˩ФAѓ dimsХ test whether any element of an array of "ɴɴɜs
along the dimensions dims is ʲʝʼȕ
Ǥɪ˩РФr я AХ store whether any element of the array A along the
singleton dimensions of r is ʲʝʼȕ in r
ǤɜɜФiterableХ → "ɴɴɜ test whether all elements of a collection of "ɴɴɜs are ʲʝʼȕ
ǤɜɜФf я iterableХ test whether f returns ʲʝʼȕ for all elements of iterable
ǤɜɜФAѓ dimsХ test whether all elements of an array of "ɴɴɜs
along the dimensions dims are ʲʝʼȕ
ǤɜɜРФr я AХ store whether all elements of the array A along the
singleton dimensions of r are ʲʝʼȕ in r

A simple example of reduction is the summation of the elements of a collec-


tion. (Note that in practice the preferred way to sum the elements of a vector is
to use the function ʧʼɦ.)
ɔʼɜɃǤљ ʝȕȍʼȆȕФўя bɪʲЦЧХ
Е
ɔʼɜɃǤљ ʝȕȍʼȆȕФўя ЦЧя ɃɪɃʲ ќ ЕХ
Е
ɔʼɜɃǤљ ʝȕȍʼȆȕФўя ЦЖЧХ
Ж
ɔʼɜɃǤљ ʝȕȍʼȆȕФўя ЦЖя ЗЧХ
И
68 4 Built-in Data Structures

ɔʼɜɃǤљ ʝȕȍʼȆȕФўя ЦЖя Зя ИЧХ


Л

In the first example, we have specified the type (bɪʲ) of the elements of the vec-
tor by using bɪʲЦЧ. Since ЦЧ may contain elements of any type, which you can
check by evaluating ȕɜʲ˩ʙȕФЦЧХ, and therefore ʝȕȍʼȆȕФўя ЦЧХ raises an error
because Julia cannot determine the neutral element, we have specified the ini-
tial element as zero in the second example.
If the collection contains only one element, it is returned. If there are two or
more elements in the collection, the operation is applied at least once. In both
cases, it is unspecified whether the initial element is used.
A mathematical example that can be implemented using ʝȕȍʼȆȕ is given by
Taylor series.
ȯʼɪȆʲɃɴɪ ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ
ʝȕȍʼȆȕФўя Ц˦ѭɃЭȯǤȆʲɴʝɃǤɜФɃХ ȯɴʝ Ƀ Ƀɪ ЕђɪЧХ
ȕɪȍ

Here we have used array comprehensions (see Sect. 3.5) to conveniently con-
struct a collection with the appropriate elements.
Analogously, we can multiply all elements in a collection using ʝȕȍʼȆȕ with
the operation Ѯ and the initial or neutral element Ж. The factorial function can
be defined in one line in this manner.
ɦ˩ѪȯǤȆʲɴʝɃǤɜФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ ќ ʝȕȍʼȆȕФѮя "ɃȱbɪʲФЖХђ"ɃȱbɪʲФɪХХ

Here we have used "Ƀȱbɪʲs in order to be able to compute large values. The
syntax
startђstepђend
makes a range or more precisely a ÚɪɃʲ¼Ǥɪȱȕ of values starting at start and end-
ing at end with step size step. Ranges may be empty.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ"ɃȱbɪʲФЖХђ"ɃȱbɪʲФЗХХ
ÚɪɃʲ¼ǤɪȱȕШ"ɃȱbɪʲЩ

This confirms that the collection which is reduced is indeed a range of "Ƀȱbɪʲs.
Another example is the implementation of the ɦǤ˦Ƀɦʼɦ function by reduction.
While ɦǤ˦Ƀɦʼɦ acts on iterable collections, the ɦǤ˦ function takes one or more
arguments. We can implement ɦǤ˦Ƀɦʼɦ using ɦǤ˦ (called with two arguments)
as follows.
ɔʼɜɃǤљ ʝȕȍʼȆȕФɦǤ˦я ЦЧя ɃɪɃʲ ќ вbɪȯХя ʝȕȍʼȆȕФɦǤ˦я ЦЖЧХ
Фвbɪȯя ЖХ

Analogously, the effect of ɦɃɪɃɦʼɦ can be achieved by reducing ɦɃɪ using a suit-
able initial element.
ɔʼɜɃǤљ ʝȕȍʼȆȕФɦɃɪя ЦЧя ɃɪɃʲ ќ bɪȯХя ʝȕȍʼȆȕФɦɃɪя ЦЖЧХ
Фbɪȯя ЖХ
4.5 Collections 69

Continuing the example of the Taylor series above, the implementation can
be made more practical by two improvements. The first improvement is, as indi-
cated above already, to define it even more succinctly using ʧʼɦ.
ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ ќ ʧʼɦФЦ˦ѭɃЭȯǤȆʲɴʝɃǤɜФɃХ ȯɴʝ Ƀ Ƀɪ ЕђɪЧХ

The second improvement is to replace the array comprehension by a genera-


tor. An array comprehension allocates an array (a costly operation if the array
is large), but here the array is only used to be reduced or summed. Generators,
on the other hand, generate values on demand obviating the need for any al-
location. Generators can be used wherever functions take them as arguments.
Their syntax is the same as the syntax of array comprehensions (see Sect. 3.5),
but without the brackets. Hence the final definition of ɦ˩Ѫȕ˦ʙ is the following.
ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ ќ ʧʼɦФ˦ѭɃЭȯǤȆʲɴʝɃǤɜФɃХ ȯɴʝ Ƀ Ƀɪ ЕђɪХ

Similarly, the factorial function can be defined using ʙʝɴȍ.


ɦ˩ѪȯǤȆʲɴʝɃǤɜФɪђђbɪʲȕȱȕʝХ ќ ʙʝɴȍФ"ɃȱbɪʲФЖХђ"ɃȱbɪʲФɪХХ

The functions Ǥɪ˩ and Ǥɜɜ can also be viewed as reductions of collections. To
see this, we define the two helper functions Ǥɪȍ and ɴʝ.
ɔʼɜɃǤљ ǤɪȍФǤя ȂХ ќ Ǥ ПП Ȃ
Ǥɪȍ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ɴʝФǤя ȂХ ќ Ǥ ЮЮ Ȃ
ɴʝ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ

Reducing a collection using ɴʝ has the same effect as applying Ǥɪ˩ to the col-
lection; the initial or neutral element of the operation ɴʝ is ȯǤɜʧȕ. Analogously,
reducing a collection using Ǥɪȍ has the same effect as applying Ǥɜɜ to the collec-
tion; the initial or neutral element of the operation Ǥɪȍ is ʲʝʼȕ.
ɦ˩ѪǤɪ˩ФȆɴɜɜȕȆʲɃɴɪХ ќ ʝȕȍʼȆȕФɴʝя ȆɴɜɜȕȆʲɃɴɪя ɃɪɃʲ ќ ȯǤɜʧȕХ
ɦ˩ѪǤɜɜФȆɴɜɜȕȆʲɃɴɪХ ќ ʝȕȍʼȆȕФǤɪȍя ȆɴɜɜȕȆʲɃɴɪя ɃɪɃʲ ќ ʲʝʼȕХ

The following examples show that our definitions work as expected.


ɔʼɜɃǤљ ɦ˩ѪǤɪ˩ФЦЧХя Ǥɪ˩ФЦЧХ
ФȯǤɜʧȕя ȯǤɜʧȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɪ˩ФЦȯǤɜʧȕЧХя Ǥɪ˩ФЦȯǤɜʧȕЧХ
ФȯǤɜʧȕя ȯǤɜʧȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɪ˩ФЦʲʝʼȕЧХя Ǥɪ˩ФЦʲʝʼȕЧХ
Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɪ˩ФЦȯǤɜʧȕя ʲʝʼȕЧХя Ǥɪ˩ФЦȯǤɜʧȕя ʲʝʼȕЧХ
Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɜɜФЦЧХя ǤɜɜФЦЧХ
Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɜɜФЦȯǤɜʧȕЧХя ǤɜɜФЦȯǤɜʧȕЧХ
ФȯǤɜʧȕя ȯǤɜʧȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɜɜФЦʲʝʼȕЧХя ǤɜɜФЦʲʝʼȕЧХ
70 4 Built-in Data Structures

Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ɦ˩ѪǤɜɜФЦȯǤɜʧȕя ʲʝʼȕЧХя ǤɜɜФЦȯǤɜʧȕя ʲʝʼȕЧХ
ФȯǤɜʧȕя ȯǤɜʧȕХ

A common programming task is to apply a function to each element of a col-


lection or of collections and to collect the results. This operation is called map-
ping a function over a collection or collections, and variants of mapping are sum-
marized in Table 4.7. The type of the resulting collection is the same as the type
of the given collection or collections.

Table 4.7 Mapping operations on iterable collections.


Function Description
ɦǤʙФf я iterableѐѐѐХ → iterable return the result of applying f elementwise
ɦǤʙРФf я destinationя iterableѐѐѐХ like ɦǤʙ, but store the result in destination
ȯɴʝȕǤȆȹФf я iterableѐѐѐХ → ‰ɴʲȹɃɪȱ like map, but discard the results
ȯɃɜʲȕʝФf я iterableХ return a copy of iterable without the elements
for which f returns ȯǤɜʧȕ
ȯɃɜʲȕʝРФf я iterableХ destructive version of ȯɃɜʲȕʝ
ɦǤʙʝȕȍʼȆȕФf я opя iterableѐѐѐ ѓ initХ equivalent to
ʝȕȍʼȆȕФopя ɦǤʙФf я iterableѐѐѐХѓ ɃɪɃʲќinitХ
ɦǤʙȯɴɜȍɜ like ɦǤʙʝȕȍʼȆȕ, but use ȯɴɜȍɜ instead of ʝȕȍʼȆȕ
ɦǤʙȯɴɜȍʝ like ɦǤʙʝȕȍʼȆȕ, but use ȯɴɜȍʝ instead of ʝȕȍʼȆȕ

It is often convenient to map anonymous functions (see Sect. 2.5) like in these
examples.
ɔʼɜɃǤљ ɦǤʙФ˦ вљ ɜɴȱФЖЕя ˦Хя ЦЖя ЖЕя ЖЕЕЧХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ
ЗѐЕ

The same effect can be achieved using an array comprehension (see Sect. 3.5). It
is often a matter of style if ɦǤʙ or a comprehension is used.
ɔʼɜɃǤљ ЦɜɴȱФЖЕя ˦Х ȯɴʝ ˦ Ƀɪ ЦЖя ЖЕя ЖЕЕЧЧ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ
ЗѐЕ

The function to be mapped may take more than one argument, and then a
corresponding number of collections must be supplied to ɦǤʙ or its cousins, one
collection for each function argument.
ɔʼɜɃǤљ ɦǤʙФФ˦я ˩Х вљ ˦ѭЗ ў ˩ѭЗя ЦЖя Зя ИЧя ЦЖЕя ЗЕя ИЕЧХ
Ивȕɜȕɦȕɪʲ ʝʝǤ˩ШbɪʲЛЙяЖЩђ
4.5 Collections 71

ЖЕЖ
ЙЕЙ
ОЕО

The function ȯɴʝȕǤȆȹ is the same as ɦǤʙ, except that it discards the results
of applying the function and always returns ɪɴʲȹɃɪȱ, the only value of type
‰ɴʲȹɃɪȱ. It should be used when the function calls are performed to produce
side effects only, e.g., to print values.
The functions ȯɃɜʲȕʝ and ȯɃɜʲȕʝР are also similar to ɦǤʙ, but they are used
to return only a subset of the collection. Again, a function is applied to each
element of a collection. If it returns ʲʝʼȕ, the element is kept in a copy of the
collection, otherwise it is ignored.
The function ɦǤʙʝȕȍʼȆȕ and its variants combine ɦǤʙ and ʝȕȍʼȆȕ as the name
indicates. The function call
ɦǤʙʝȕȍʼȆȕФf я opя iterableѐѐѐ ѓ initХ

is equivalent to evaluating
ʝȕȍʼȆȕФopя ɦǤʙФf я iterableѐѐѐХѓ ɃɪɃʲќinitХ

except that the need to allocate any intermediate results is obviated; hence the
ɦǤʙʝȕȍʼȆȕ version generally runs faster and generates less garbage.
Using ɦǤʙʝȕȍʼȆȕ, Taylor series can be implemented perfectly in the spirit of
functional programming.
ȯʼɪȆʲɃɴɪ ɦ˩Ѫȕ˦ʙФ˦я ɪђђbɪʲȕȱȕʝХ
ɦǤʙʝȕȍʼȆȕФɃ вљ ˦ѭɃЭȯǤȆʲɴʝɃǤɜФ"ɃȱbɪʲФɃХХя ўя ЕђɪХ
ȕɪȍ

In order to check the convergence speed for different arguments ˦, we use ɦǤʙ
and an anonymous function to produce tuples that contain the values of ˦ and
the corresponding residua. Then we filter the tuples to identify the tuples where
the residua are above a certain threshold.
ɔʼɜɃǤљ ȯɃɜʲȕʝФ˦Ѫʝȕʧ вљ ǤȂʧФ˦ѪʝȕʧЦЗЧХ љ ЖȕвЛя
ɦǤʙФ˦ вљ Ф˦я ɦ˩Ѫȕ˦ʙФ˦я ЗЕХ в ȕ˦ʙФ˦ХХя ЖђКХХ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШÑʼʙɜȕШbɪʲЛЙя"ɃȱOɜɴǤʲЩЩђ
ФКя вЖѐЗЕИКЖѐѐѐȕвЕКХ

Another example of using ɦǤʙʝȕȍʼȆȕ, namely the definition of norms, has


already been discussed in Sect. 2.5 and Sect. 2.6.

4.5.3 Indexable Collections

Indexable collections are collections whose elements are associated with an in-
dex or key. A set (see Sect. 4.5.5) is an example of a collection that is iterable, but
not indexable. In Julia, the syntax
72 4 Built-in Data Structures

aЦiѐѐѐЧ
is just an abbreviation for the function call
ȱȕʲɃɪȍȕ˦Фaя iѐѐѐХ.

Similarly, assignments of the form


aЦiѐѐѐЧќx
are equivalent to the expression
ȂȕȱɃɪ ʧȕʲɃɪȍȕ˦РФaя x я iѐѐѐХѓ x ȕɪȍ.

In the case of -ɃȆʲs, the index is called a key (see Sect. 4.5.4).
This example shows that multi-dimensional arrays require a corresponding
number of indices.
ɔʼɜɃǤљ ȯɴɴ ќ ЦЖ Зѓ И ЙЧѓ ʧȕʲɃɪȍȕ˦РФȯɴɴя Кя Жя ЖХѓ ȯɴɴ
ЗаʲɃɦȕʧ З ǤʲʝɃ˦ШbɪʲЛЙЩђ
К З
И Й

The operations special to indexable collections are summarized in Table 4.8.

Table 4.8 Operations on indexable collections.


Function Description
ȱȕʲɃɪȍȕ˦Фcollectionя iѐѐѐХ return the element stored at the given index or key i
ʧȕʲɃɪȍȕ˦РФcollectionя valueя iѐѐѐХ store the value at the given index or key i

4.5.4 Associative Collections

An associative collection associates keys with values. The standard associative


collection is the -ɃȆʲ, short for dictionary. The type of the keys in a -ɃȆʲ may be
any type for which the hash function ȹǤʧȹ and the comparison function ɃʧȕʜʼǤɜ
are defined. The type of the values may be arbitrary.
A -ɃȆʲ contains ¸ǤɃʝs of keys and values, and a ¸ǤɃʝ is made by the ќљ func-
tion, where keyќљvalue is a syntactic abbreviation for the function call ќљФkeyя
value).
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФќљФЖя ЗХХя ʲ˩ʙȕɴȯФЖ ќљ ЗХ
Ф¸ǤɃʝШbɪʲЛЙя bɪʲЛЙЩя ¸ǤɃʝШbɪʲЛЙя bɪʲЛЙЩХ

There are various ways to create a -ɃȆʲ. It can be created by passing ¸ǤɃʝ
objects to the -ɃȆʲ constructor.
4.5 Collections 73

ɔʼɜɃǤљ -ɃȆʲФъǤъ ќљ Жя ъȂъ ќљ Зя ъȆъ ќљ ИХ


-ɃȆʲШÆʲʝɃɪȱя bɪʲЛЙЩ ˞Ƀʲȹ И ȕɪʲʝɃȕʧђ
ъȆъ ќљ И
ъȂъ ќљ З
ъǤъ ќљ Ж

We see that the types of the keys and values (i.e., ÆʲʝɃɪȱ and bɪʲЛЙ) are inferred
from the ¸ǤɃʝs, but they can also be specified as parameters to the -ɃȆʲ function
in curly brackets (see Sect. 5.7) by writing -ɃȆʲШkey-typeя value-typeЩФpairsѐѐѐХ
as in the following example.
ɔʼɜɃǤљ -ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФХ
-ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФХ
ɔʼɜɃǤљ -ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩФъǤъ ќљ Жя ъȂъ ќљ Зя ъȆъ ќљ ИХ
-ɃȆʲШÆʲʝɃɪȱя bɪʲЖЛЩ ˞Ƀʲȹ И ȕɪʲʝɃȕʧђ
ъȆъ ќљ И
ъȂъ ќљ З
ъǤъ ќљ Ж

A -ɃȆʲ can also be created by a generator as shown here.


ɔʼɜɃǤљ -ɃȆʲФɃ ќљ &ȹǤʝФɃўЛЙХ ȯɴʝ Ƀ Ƀɪ ЖђИХ
-ɃȆʲШbɪʲЛЙя &ȹǤʝЩ ˞Ƀʲȹ И ȕɪʲʝɃȕʧђ
З ќљ щ"щ
И ќљ щ&щ
Ж ќљ щщ

Of course, the value associated with a key in a dictionary can be modified. If


ȍ is a -ɃȆʲ, then the expression

ȍЦkeyЧќvalue

stores a key-value pair in the dictionary, possibly replacing any existing value for
the key. Furthermore, the expression ȍЦkeyЧ returns the value of the given key
if it exists or throws an error if it does not. The function ȹǤʧɖȕ˩ test whether a
collection contains a value associated with a given key and returns a "ɴɴɜ value.
Table 4.9 provides an overview of the operations available on associative col-
lections.

4.5.5 Sets

Sets are among the most fundamental data structures in mathematics. As usual,
a set can be constructed by using the name of the data structure, i.e., Æȕʲ, as
a function or constructor, where the type of the elements can be specified as a
type parameter in curly brackets. The initial elements of a set may be passed as
an argument that is an iterable object, e.g., a vector.
74 4 Built-in Data Structures

Table 4.9 Operations on associative collections.


Function Description
-ɃȆʲФpairsѐѐѐХ create a dictionary with automatic types
-ɃȆʲШt1яt2ЩФ pairsѐѐѐХ create a dictionary for the key and value types
ɖȕ˩ʲ˩ʙȕФcollectionХ return the type of the keys
˛Ǥɜʲ˩ʙȕФ collectionХ return the type of the values
ɖȕ˩ʧФcollectionХ return an iterator over all keys in a collection,
to be used in a ȯɴʝ loop
˛ǤɜʼȕʧФ collectionХ return an iterator over all values in a collection
to be used in a ȯɴʝ loop
ȍЦkeyЧ return the value associated with key in ȍ
ȍЦkeyЧќvalue associate value with key in ȍ
ȹǤʧɖȕ˩Фcollectionя keyХ determine whether a collection contains key
ȱȕʲФcollectionя keyя defaultХ return the value for key if it exists in collection
or the default value otherwise
ȱȕʲФf я keyя defaultХ return the value for key if it exists in collection
or f ФХ otherwise; to be used with a ȍɴ block
ȱȕʲРФcollectionя keyя defaultХ return the value for key if it exists in collection
or otherwise store keyќљdefault and return default
ȱȕʲРФdefaultя keyя defaultХ destructive version of ȱȕʲФf я keyя defaultХ
ȱȕʲɖȕ˩Фcollectionя keyя defaultХ return key if it exists in collection or default otherwise
ȍȕɜȕʲȕРФcollectionя keyХ delete the key and its value from collection and
return collection
ʙɴʙРФcollectionя key[я default]Х delete key and return its value
if it exists in collection, otherwise
return default or throw an error if unspecified
ɦȕʝȱȕФcollectionя othersѐѐѐХ return a collection merged from all given collections,
the value for each key is taken
from the last collection that contains it
ɦȕʝȱȕРФcollectionя othersѐѐѐХ destructive version of ɦȕʝȱȕ, modify collection
ʧɃ˴ȕȹɃɪʲРФcollectionя nХ suggest that collection will contain at least n elements

ɔʼɜɃǤљ ÆȕʲФХ
ÆȕʲШɪ˩ЩФХ
ɔʼɜɃǤљ ÆȕʲШbɪʲЛЙЩФХ
ÆȕʲШbɪʲЛЙЩФХ
ɔʼɜɃǤљ ÆȕʲФЦЖя Зя ИЧХя ʲ˩ʙȕɴȯФÆȕʲФЦЖя Зя ИЧХХ
ФÆȕʲФЦЗя Ия ЖЧХя ÆȕʲШbɪʲЛЙЩХ
ɔʼɜɃǤљ ÆȕʲШOɜɴǤʲЖЛЩФЦЖя Зя ИЧХя ʲ˩ʙȕɴȯФÆȕʲШOɜɴǤʲЖЛЩФЦЖя Зя ИЧХХ
ФÆȕʲФOɜɴǤʲЖЛЦЗѐЕя ИѐЕя ЖѐЕЧХя ÆȕʲШOɜɴǤʲЖЛЩХ

A "ɃʲÆȕʲ is a sorted set of bɪʲs implemented as a bit string. While Æȕʲs are
suitable for sparse integer sets and generally for arbitrary objects, "ɃʲÆȕʲs are
especially suited for dense integer sets.
The usual set operations are available in both non-destructive and destructive
versions, the latter ending with an exclamation mark Р. An overview is given in
Table 4.10.
4.5 Collections 75

Table 4.10 Operations on set-like collections.


Function Description
ÆȕʲФ[iterable]Х construct a set with the elements iterable
ÆȕʲШtypeЩФ [iterable]Х construct a set for elements of type with the elements iterable
"ɃʲÆȕʲФ[iterable]Х construct a (dense) set bɪʲs
ɃʧʧʼȂʧȕʲФs1я s2Х determine whether s1 is a subset of s2 using Ƀɪ
ʼɪɃɴɪФѐѐѐХ return the union of the given sets
ʼɪɃɴɪРФsetя ѐѐѐХ destructively modify set to contain
the union of all given sets
ɃɪʲȕʝʧȕȆʲФѐѐѐХ return the intersection of the given sets
ɃɪʲȕʝʧȕȆʲРФsetя ѐѐѐХ destructively modify set to contain
the intersection of all given sets
ʧȕʲȍɃȯȯФs1я setsѐѐѐХ return the set of elements which are in s1,
but not in any of the sets
ʧȕʲȍɃȯȯРФs1я setsѐѐѐХ destructively delete the elements in sets from s1
ʧ˩ɦȍɃȯȯФѐѐѐХ return the symmetric difference of all given sets
ʧ˩ɦȍɃȯȯРФs1я setsѐѐѐХ destructive version of ʧ˩ɦȍɃȯȯ, store result in s1

Several of the functions in the table can also be applied to arrays with the
expected results. If the arguments are arrays, the order of the elements is main-
tained and an array is returned. These methods of the generic functions are eas-
ier to use and run faster when arrays are to be interpreted as sets.
ɔʼɜɃǤљ ǤЖ ќ ЦЖя ЗЧѓ ǤЗ ќ ЦИЧѓ
ɔʼɜɃǤљ ʼɪɃɴɪФÆȕʲФǤЖХя ÆȕʲФǤЗХХ ќќ ÆȕʲФʼɪɃɴɪФǤЖя ǤЗХХ
ʲʝʼȕ

4.5.6 Deques (Double-Ended Queues)

Vectors in the context of linear algebra are discussed in detail in Chap. 8. Here
we view vectors as deques (double-ended queues) and discuss the operations
that implement deques on top of vectors as the underlying data structure. This
notion of viewing vectors as collections of items (and not as elements of ℝ𝑑 ) and
performing operations on them fit well into the theme of the present section.
The operations on deques are summarized in Table 4.11. All of the functions
are destructive. The functions in Table 4.11 differ in their return values. Some
return the item or items in question, while others return the modified collection.
The two most iconic operations on deques are ʙʼʧȹР and ʙɴʙР for inserting
an item or items at the end of a collection and for removing the last item, respec-
tively. These two functions operate on the end of the collection, usually a vector,
because the end is where a vector can be modified most easily. If the functions
were to operate on the beginning of the vector, then the vector would have to be
copied every time.
76 4 Built-in Data Structures

Table 4.11 Operations on deques (double-ended queues).


Function Description
ʙʼʧȹРФcollectionя itemsѐѐѐХ insert the items at the end of collection
ʙɴʙРФcollectionХ remove the last item of collection and return it
ɃɪʧȕʝʲРФvector я index я itemХ insert item into vector at index
ȍȕɜȕʲȕǤʲРФvector я index Х remove the item at index and
return the modified vector
ȍȕɜȕʲȕǤʲРФvector я indicesХ remove the items at the indices and
return the modified vector
ʧʙɜɃȆȕРФvector я index[я new]Х remove the item at index and return it;
if specified, the new value is inserted instead
ʧʙɜɃȆȕРФvector я range[я new]Х remove the items at the index range and return them;
if specified, the new value is inserted instead
ǤʙʙȕɪȍРФcoll1я coll2Х append the elements of coll2 at the end of coll1
ʙʝȕʙȕɪȍРФvector я itemsХ insert the items at the beginning of vector
ʝȕʧɃ˴ȕРФvector я nХ resize vector to contain n elements

ɔʼɜɃǤљ ˛ќ ЦЧѓ ʙʼʧȹРФ˛я Жя Зя ИХ


Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З
И
ɔʼɜɃǤљ ʙɴʙРФ˛Х
И
ɔʼɜɃǤљ ˛
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З

The two functions ʙʼʧȹР and ʙɴʙР are inverses of one another.
ɔʼɜɃǤљ ʙʼʧȹРФ˛я ʙɴʙРФ˛ХХ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З
ɔʼɜɃǤљ ʙɴʙРФʙʼʧȹРФ˛я ИХХ
И
ɔʼɜɃǤљ ˛
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
Ж
З

While both ʙʼʧȹР and ǤʙʙȕɪȍР add elements to the end of a collection, ʙʼʧȹР
takes a variable number of arguments, while the second argument to ǤʙʙȕɪȍР is
already a collection. These two function calls have the same effect.
ɔʼɜɃǤљ ʙʼʧȹРФЦЧя Жя Зя ИХ ќќ ǤʙʙȕɪȍРФЦЧя ЦЖя Зя ИЧХ
ʲʝʼȕ
4.5 Collections 77

The ʧʙɜɃȆȕР method acts on a range of indices and can be used to insert new
elements without removing any elements by specifying an empty range.
ɔʼɜɃǤљ ˛ ќ ЦЖя Зя ИЧѓ ʧʙɜɃȆȕРФ˛я ЗђЖя ЦЗЕя ИЕЧХ
bɪʲЛЙЦЧ
ɔʼɜɃǤљ ˛
Квȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
ЗЕ
ИЕ
З
И

The range ЗђЖ is indeed empty.


ɔʼɜɃǤљ ɜȕɪȱʲȹФЗђЖХ
Е

Problems

4.1 Write your own repl.

4.2 Write a function that calculates the natural logarithm using its Taylor series.
Compare the speed and memory allocation of five versions using ʝȕȍʼȆȕ, ʧʼɦ,
ɦǤʙʝȕȍʼȆȕ, array comprehensions, and generators.

4.3 Why does ɦ˩ѪȯǤȆʲɴʝɃǤɜФвЖХ return Ж?

4.4 Compare the speed and memory allocation of three versions of the
ɦ˩ѪȯǤȆʲɴʝɃǤɜ function: (a) using a range, (b) using an array comprehension,
and (c) using a generator.

4.5 Write iterative versions of ɦ˩ѪǤɪ˩ and ɦ˩ѪǤɜɜ. Compare their speed and
memory allocation with the versions based on ʝȕȍʼȆȕ and with the built-in func-
tions Ǥɪ˩ and Ǥɜɜ.
Chapter 5
User Defined Data Structures and the Type
System

It is better to have 100 functions operate on one data structure


than to have 10 functions operate on 10 data structures.
—Alan Jay Perlis

Abstract Data types defined by the programmer are indistinguishable from


built-in types in Julia. In this chapter, it is first explained how variables and re-
turn values can be annotated with types and how Julia’s introspective features
can be used to inspect the type system. Types that can be defined by the program-
mer include abstract types, concrete types, composite types, type unions, and
tuples. Custom constructors and pretty printers can be defined as well. Further-
more, abstract and composite types can be parameterized. Finally, the operations
on types are summarized.

5.1 Introduction

All values in digital, binary computers are stored as vectors of two values, zeros
and ones. In order to make memory more useful and accessible to programmers,
these vectors or strings of zeros and ones are interpreted as more meaningful
objects such as signed and unsigned integers, rational numbers, floating-point
numbers as approximations of real numbers, characters, dictionaries, user de-
fined data structures, and many others. The information that enables the inter-
pretation of a vector of zeros and ones as a more meaningful object is its type.
In other words, both a vector of zeros and ones and the type information are
necessary to interpret a part of memory usefully and correctly.
In computer science, there are traditionally two approaches to implement
type systems, namely static type systems and dynamic type systems. In static type
systems, each variable and expression in the program has a type that is known or
computable before the execution of the program. In dynamic type systems, the

© Springer Nature Switzerland AG 2022 79


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_5
80 5 User Defined Data Structures and the Type System

types of variables and expressions are unknown until run time, when the values
stored in variables are available. This means that in a static type system, types
are associated with variables; variables always contain values of the same type,
and the type is known before the execution of the program. In a dynamic type
system, types are associated with values; variables may contain values of differ-
ent types, and the type is only known during the execution of the program and
may change.
The ability of programs or expressions to operate on different types is called
polymorphism. By definition, all programs written in a dynamically typed lan-
guage are polymorphic; restrictions to the types of values only occur when a type
is checked during run time or when an operation is not available for a certain
value at run time.
Dynamic and static typing both have their advantages and disadvantages.
While Julia’s type system is dynamic, it is also possible to specify the types of
certain variables like in a static type system. This helps generate efficient code
and allows method dispatch on the types of function arguments. In this manner,
Julia gains some properties and advantages of static type systems.
Since Julia’s type system is dynamic, variables in Julia may contain values
of any type by default. When desired, it is possible to add type annotations to
variables and expressions. These type annotations serve a few purposes. They
enable multiple dispatch on function arguments, they serve as documentation
and can hence make a program more readable and clarify its purpose, and they
can serve as safety measures and catch programmer errors.
The type system is an important part of any programming language. Some
important properties of Julia’s powerful and expressive type system are the fol-
lowing.
• As usual in a dynamic type system, types are associated with values, and
never with variables.
• Some programming languages, especially object oriented ones, discern be-
tween object (composite) and non-object (numbers etc.) values. This is not
the case in Julia, where all values are objects and each type is a first-class
type.
• It is possible to parameterize both abstract and concrete types, and type pa-
rameters are optional when they are not required or not restricted.
In this chapter, the basics of Julia’s type system needed to define own type
hierarchies are summarized, and some finer points are discussed as well.

5.2 Type Annotations

The double-colon operator ђђ underlies all type annotations. Type annotations


can be attached to any variable and expression. As mentioned above, a type anno-
tation conveys type information to the compiler, usually to help it generate faster
5.2 Type Annotations 81

code, or to programmers, usually to help understand the program and document


it. Type annotations also make method dispatch possible. Type annotations may
appear in various syntactic locations with slightly different meanings, which are
discussed in detail in the following.

5.2.1 Annotations of Expressions

If a type annotates an expression in the form expr ђђtype, the meaning of the ђђ
operator is that it asserts that the value of the expression must be of the indi-
cated type. If the assertion is true, the value is returned, otherwise an exception
is thrown.
ɔʼɜɃǤљ ʲʝʼȕђђbɪʲ
5¼¼“¼ђ Ñ˩ʙȕ5ʝʝɴʝђ Ƀɪ ʲ˩ʙȕǤʧʧȕʝʲя ȕ˦ʙȕȆʲȕȍ bɪʲЛЙя ȱɴʲ Ǥ ˛Ǥɜʼȕ ɴȯ
ʲ˩ʙȕ "ɴɴɜ
ɔʼɜɃǤљ ʲʝʼȕђђ"ɴɴɜ
ʲʝʼȕ

5.2.2 Declarations of Variables and Return Values

If the ђђ operator appears in a ɜɴȆǤɜ variable declaration in the form vari-


ableђђtype, it declares the variable to always have the indicated type. This is the
behavior of a statically typed language. Recall from Chap. 3 that such local vari-
able definitions do not require the keyword ɜɴȆǤɜ to be written out. Any value
assigned to such an annotated variable is converted to the indicated type using
the function Ȇɴɪ˛ȕʝʲ, which raises an error if the conversion is not possible.
ȯʼɪȆʲɃɴɪ ȯɴɴФХ
ɜɴȆǤɜ ˦ђђOɜɴǤʲЛЙ ќ Е
˦
ȕɪȍ

ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȯɴɴФХХ
OɜɴǤʲЛЙ

Similarly, it is possible to declare the type of the return value of a function (see
Chap. 2). The annotation of a return value (returned by ʝȕʲʼʝɪ or being the last
value in the function body) is treated just as the annotation of a local variable as
discussed above, and hence the assignment is performed using Ȇɴɪ˛ȕʝʲ. As an
example, the following function will always raise an error at run time.
ȯʼɪȆʲɃɴɪ ȯɴɴФХђђbɪʲ
ъÑȹɃʧ Ȇɴɪ˛ȕʝʧɃɴɪ ʝǤɃʧȕʧ Ǥɪ ȕʝʝɴʝѐъ
ȕɪȍ
82 5 User Defined Data Structures and the Type System

5.3 Abstract Types, Concrete Types, and the Type Hierarchy

Abstract types are defined as types that cannot be instantiated, while concrete
types can be. In the type graph, the children or subtypes of abstract types are
abstract or concrete types, while concrete types cannot have children or subtypes
in the type graph. Abstract types are defined via ǤȂʧʲʝǤȆʲ ʲ˩ʙȕ Name ȕɪȍ or
ǤȂʧʲʝǤȆʲ ʲ˩ʙȕ Name јђ Supertype ȕɪȍ. The first syntax is equivalent to ǤȂʧʲʝǤȆʲ
ʲ˩ʙȕ Name јђ ɪ˩ ȕɪȍ, where the type ɪ˩ is at the top of the type graph or
hierarchy. The names of types are usually capitalized, and the names of abstract
types usually start with ȂʧʲʝǤȆʲ.
While the type ɪ˩ is at the top of the type hierarchy or graph, the ÚɪɃɴɪШЩ
type is at the bottom. While all objects are instances of ɪ˩ and all types are
subtypes of ɪ˩, no object is an instance of ÚɪɃɴɪШЩ and all types are supertypes
of ÚɪɃɴɪШЩ.
The whole type graph or hierarchy can be probed easily using the јђ operator
and the functions ʧʼȂʲ˩ʙȕʧ, ʧʼʙȕʝʲ˩ʙȕ, and ʧʼʙȕʝʲ˩ʙȕʧ. The expression type1
јђ type2 returns true if type1 is below type2 in the type hierarchy; it does not have
to be a child.
ɔʼɜɃǤљ bɪʲН јђ ɪ˩
ʲʝʼȕ

The function ʧʼȂʲ˩ʙȕʧ returns a vector with all types that are directly below the
given type in the hierarchy, i.e., with all children of the given type.
ɔʼɜɃǤљ ʧʼȂʲ˩ʙȕʧФbɪʲȕȱȕʝХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
"ɴɴɜ
ÆɃȱɪȕȍ
ÚɪʧɃȱɪȕȍ

The function ʧʼʙȕʝʲ˩ʙȕ returns the parent of the given type, and the function
ʧʼʙȕʝʲ˩ʙȕʧ returns a tuple with all types above the given type in the hierarchy,
starting with the parent and always ending with ɪ˩.
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕФbɪʲНХ
ÆɃȱɪȕȍ
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФbɪʲНХ
ФbɪʲНя ÆɃȱɪȕȍя bɪʲȕȱȕʝя ¼ȕǤɜя ‰ʼɦȂȕʝя ɪ˩Х

In the next example, we locate real and complex numbers in the type hierar-
chy.
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФ¼ȕǤɜХ
Ф¼ȕǤɜя ‰ʼɦȂȕʝя ɪ˩Х
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФʲ˩ʙȕɴȯФЖ ў ЗɃɦХХ
Ф&ɴɦʙɜȕ˦ШbɪʲЛЙЩя ‰ʼɦȂȕʝя ɪ˩Х

Irrational numbers are also part of the Julia’s numerical tower, as we can see
here in the example of Euler’s number.
5.4 Composite Types 83

ɔʼɜɃǤљ "Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕ


ȕ ќ ЗѐМЖНЗНЖНЗНЙКОЕѐѐѐ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕХ
bʝʝǤʲɃɴɪǤɜШђȕЩ
ɔʼɜɃǤљ ʧʼʙȕʝʲ˩ʙȕʧФʲ˩ʙȕɴȯФ"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕХХ
ФbʝʝǤʲɃɴɪǤɜШђȕЩя ȂʧʲʝǤȆʲbʝʝǤʲɃɴɪǤɜя ¼ȕǤɜя ‰ʼɦȂȕʝя ɪ˩Х

As we have seen in these examples, the built-in functions above suffice to


obtain the full type graph consisting of both built-in and user defined types (see
Problems Problem 5.1 and Problem 5.2).

5.4 Composite Types

Composite types are the most common types defined by users. In other lan-
guages, composite types are also called structs, records, or objects. They consist
of named fields of arbitrary types, and therefore usually serve to collect quite dis-
tinct objects or values into ensembles, in contrast to vectors, which are usually
used to store values of the same type.
A composite type is defined by the keyword ʧʲʝʼȆʲ followed by the field
names, which may be annotated by their types using the usual ђђ syntax. If there
is no annotation, it defaults to the ɪ˩ type. Again, the names of composite types
are capitalized in Julia by convention.
In the first example, there is no type annotation so that both fields can contain
values of ɪ˩ type. (The types in the examples are numbered, because redefining
types is not allowed in Julia. Numbering the types saves us from restarting Ju-
lia to evaluate the examples.)
ʧʲʝʼȆʲ OɴɴЖ
Ǥ
Ȃ
ȕɪȍ

These two field specifications are equivalent to Ǥђђɪ˩ and Ȃђђɪ˩. In order to
create an instance of the newly defined composite type OɴɴЖ, its name is used
as a function; this constructor function was defined by the expression above. By
default, a composite object is printed as its name followed by the values of its
fields in parentheses.
ɔʼɜɃǤљ OɴɴЖФЖя ЗХ
OɴɴЖФЖя ЗХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
OɴɴЖ
ɔʼɜɃǤљ ɦȕʲȹɴȍʧФOɴɴЖХ
Ы Ж ɦȕʲȹɴȍ ȯɴʝ ʲ˩ʙȕ ȆɴɪʧʲʝʼȆʲɴʝђ
ЦЖЧ OɴɴЖФǤя ȂХ Ƀɪ ǤɃɪ Ǥʲ ¼5¸{ЦЖЧђЗ
84 5 User Defined Data Structures and the Type System

The ɦȕʲȹɴȍʧ call reveals that behind the scenes our type definition defined a
generic function of the same name. Its method takes the correct number of field
values as input.
The value of a field is accessed by a dot ѐ followed the field name.
ɔʼɜɃǤљ ȯɴɴЖ ќ OɴɴЖФЖя ЗХ
OɴɴЖФЖя ЗХ
ɔʼɜɃǤљ ȯɴɴЖѐǤ
Ж
ɔʼɜɃǤљ ȯɴɴЖѐȂ
З
ɔʼɜɃǤљ OɴɴЖФЖя ЗХѐǤ
Ж

The fields of a composite object can also be accessed using the function
ȱȕʲȯɃȕɜȍ, and all field names of acomposite type can be obtained by the func-
tion ȯɃȕɜȍɪǤɦȕʧ as symbols. This provides a way to iterate over all fields of a
composite type, as shown in the next example.
ɔʼɜɃǤљ ȯɃȕɜȍɪǤɦȕʧФOɴɴЖХ
ФђǤя ђȂХ
ɔʼɜɃǤљ ȯɴʝ ɪǤɦȕ Ƀɪ ȯɃȕɜȍɪǤɦȕʧФOɴɴЖХ
Ъʧȹɴ˞ ɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХ
ȕɪȍ
ФɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХХ ќ ФђǤя ЖХ
ФɪǤɦȕя ȱȕʲȯɃȕɜȍФȯɴɴЖя ɪǤɦȕХХ ќ ФђȂя ЗХ

Composite types (i.e., ʧʲʝʼȆʲs) are defined to be immutable by default, i.e.,


their field values cannot be changed after they have been constructed. Advan-
tages of immutable types include better efficiency and easier reasoning about
the code. However, when required, for example when copying of objects is to
be avoided, composite types can also be defined to be mutable by just writing
ɦʼʲǤȂɜȕ before ʧʲʝʼȆʲ.

ɦʼʲǤȂɜȕ ʧʲʝʼȆʲ OɴɴЗ


Ǥђђbɪʲ
ȂђђOɜɴǤʲЛЙ
ȕɪȍ

Then field values can be assigned using the equal sign ќ and the accessor on the
left-hand side.
ɔʼɜɃǤљ ȯɴɴЗ ќ OɴɴЗФЖя ЗХ
OɴɴЗФЖя ЗѐЕХ
ɔʼɜɃǤљ ȯɴɴЗѐǤ ќ И
И
ɔʼɜɃǤљ ȯɴɴЗѐǤ
И
5.4 Composite Types 85

In our examples so far, the initial field values have matched the field types
specified during ʧʲʝʼȆʲ definition. What happens if they do not match, though?
Julia tries to Ȇɴɪ˛ȕʝʲ the given values to the types specified in the ʧʲʝʼȆʲ def-
inition whenever possible; if it is not possible, an error is raised. An example of
such a conversion can be seen above: the bɪʲ value З was convert to a OɜɴǤʲЛЙ
value according to the field specification ȂђђOɜɴǤʲЛЙ. The error raised when the
conversion is not possible is shown in the following example.
ɔʼɜɃǤљ OɴɴЗФЖя ъЗъХ
5¼¼“¼ђ ȕʲȹɴȍ5ʝʝɴʝђ &Ǥɪɪɴʲ ҉Ȇɴɪ˛ȕʝʲ҉ Ǥɪ ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ ÆʲʝɃɪȱ ʲɴ Ǥɪ
ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ OɜɴǤʲЛЙ

The same error is raised when trying to assign a value that cannot be converted
to the type indicated in the ʧʲʝʼȆʲ definition.
ɔʼɜɃǤљ OɴɴЗФЖя ЗХѐȂ ќ ъЗъ
5¼¼“¼ђ ȕʲȹɴȍ5ʝʝɴʝђ &Ǥɪɪɴʲ ҉Ȇɴɪ˛ȕʝʲ҉ Ǥɪ ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ ÆʲʝɃɪȱ ʲɴ Ǥɪ
ɴȂɔȕȆʲ ɴȯ ʲ˩ʙȕ OɜɴǤʲЛЙ

When an instance of a composite type is constructed using the standard con-


structor, the field values must usually be supplied. However, as an exception it is
also possible to construct instances with uninitialized fields as elements of vec-
tors. We consider two leading examples. In the first one, one of the fields has
type ɪ˩.
ʧʲʝʼȆʲ OɴɴИ
Ǥ
Ȃђђbɪʲ
ȕɪȍ

In this case, constructing a ùȕȆʲɴʝ consisting of elements of type OɴɴИ yields a


vector consisting of undefined elements.
ɔʼɜɃǤљ ùȕȆʲɴʝШOɴɴИЩФʼɪȍȕȯя ИХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɴɴИЩђ
Ыʼɪȍȕȯ
Ыʼɪȍȕȯ
Ыʼɪȍȕȯ

Trying to access any of these undefined elements yields an Úɪȍȕȯ¼ȕȯ5ʝʝɴʝ error,


saying that an undefined reference was accessed.
The situation is different with numerical fields. We first define a new compos-
ite type.
ʧʲʝʼȆʲ OɴɴЙ
ǤђђbɪʲН
ȂђђbɪʲЛЙ
ȆђђOɜɴǤʲЛЙ
ȍђђ&ɴɦʙɜȕ˦OЛЙ
ȕɪȍ
86 5 User Defined Data Structures and the Type System

In this case, the vector with undefined elements consists of elements whose
fields contain random numbers. The numbers are random in the sense that they
contain whichever bits were present at their memory location when they were
allocated.
ɔʼɜɃǤљ ùȕȆʲɴʝШOɴɴЙЩФʼɪȍȕȯя ИХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɴɴЙЩђ
OɴɴЙФЖя Зя ЖѐКȕвИЗИя НѐЕȕвИЗИ ў ЗѐЕȕвИЗИɃɦХ
OɴɴЙФКя Ля ИѐКȕвИЗИя ЙѐЙȕвИЗИ ў КѐЕȕвИЗИɃɦХ
OɴɴЙФЖЗя ЖИя МѐЕȕвИЗИя МѐЙȕвИЗИ ў КѐЙȕвИЗИɃɦХ

Constructing a vector of undefined elements in this manner is useful when a


large vector is required; of course, care must be taken to never use the values of
the uninitialized fields.

5.5 Constructors

Constructors are, in general, functions that create new objects. We have already
seen in the previous section that defining a new composite type automatically de-
fines a standard constructor for this type. The standard constructor is a method
for the generic function with the same name as the type, taking the initial values
for the fields as arguments. Sometimes, however, it is desirable to define custom
constructors, for example, to create complex objects in a consistent state, to en-
force invariants, or to construct self-referential objects.
There are two types of constructors: inner and outer ones. Outer constructors
are just additional methods to the generic function of the same name as the com-
posite type. They usually provide convenience such as constructing objects with
default values.
Here we consider the example of (real-valued) intervals. (Again, the types
have numbers in their names since Julia forbids redefining types and we want
to avoid restarting Julia for each example.)
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜЖ
ɜȕȯʲђђOɜɴǤʲЛЙ
ʝɃȱȹʲђђOɜɴǤʲЛЙ
ȕɪȍ

The following method for the generic function bɪʲȕʝ˛ǤɜЖ is an outer constructor.
Its only purpose is to define a default interval.
ȯʼɪȆʲɃɴɪ bɪʲȕʝ˛ǤɜЖФХ
bɪʲȕʝ˛ǤɜЖФЕя ЖХ
ȕɪȍ

Outer constructors bear their name because they are defined outside the
scope of the ʧʲʝʼȆʲ definition. Inner constructors are defined inside the ʧʲʝʼȆʲ
5.5 Constructors 87

definition and have an additional capability, namely that they can call a func-
tion called ɪȕ˞ that creates objects of the composite type being defined. After a
composite type has been defined, it is not possible to add any inner constructors.
Also, if an inner constructor is defined, no default constructor is defined.
In the following example, the only constructor checks whether a valid interval
is being constructed.
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜЗ
ɜȕȯʲђђOɜɴǤʲЛЙ
ʝɃȱȹʲђђOɜɴǤʲЛЙ

ȯʼɪȆʲɃɴɪ bɪʲȕʝ˛ǤɜЗФɜђђ¼ȕǤɜя ʝђђ¼ȕǤɜХђђbɪʲȕʝ˛ǤɜЗ


ЪǤʧʧȕʝʲ ɜ јќ ʝ ъɜȕȯʲ ȕɪȍʙɴɃɪʲ ɦʼʧʲ Ȃȕ ɜȕʧʧ ʲȹǤɪ
ɴʝ ȕʜʼǤɜ ʲɴ ʝɃȱȹʲ ȕɪȍʙɴɃɪʲъ
ɪȕ˞Фɜя ʝХ
ȕɪȍ
ȕɪȍ

In the first call below, the endpoints are valid; in the second one, they are not.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜЗФЕя ЖХ
bɪʲȕʝ˛ǤɜЗФЕѐЕя ЖѐЕХ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜЗФЖя ЕХ
5¼¼“¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ɜȕȯʲ ȕɪȍʙɴɃɪʲ ɦʼʧʲ Ȃȕ ɜȕʧʧ ʲȹǤɪ ɴʝ ȕʜʼǤɜ ʲɴ
ʝɃȱȹʲ ȕɪȍʙɴɃɪʲ

A finer point of the ɪȕ˞ function defined in inner constructors is that it can be
called with fewer arguments than the number of fields the type has. This feature
makes the creation of instances of self-referential types possible. Although this
may sound like a situation that is seldom encountered, we do not have to look
far for such a data structure; a prime example is lists. In Lisp, they consist of a
data structure called a Ȇɴɪʧ (which is short for construct). For example, the Lisp
expression
ФȆɴɪʧ Ж ФȆɴɪʧ З ФȆɴɪʧ И ɪɃɜХХХ

evaluates to the list ФЖ З ИХ. Ȇɴɪʧ cells consist of two fields, which may contain
arbitrary values. When Ȇɴɪʧ cells are used to build a list, the first field (tradition-
ally called ȆǤʝ in Lisp, which is short for “contents of the address part of register
number” on the ibm 704 computer) holds a value and the second (traditionally
called Ȇȍʝ in Lisp, which is short for “contents of the decrement part of register
number” on the ibm 704 computer) holds another Ȇɴɪʧ cell or ɪɃɜ. Because of
their structure, Ȇɴɪʧ cells and hence lists are usually traversed recursively.
We start with a first try to define a Ȇɴɪʧ cell in Julia.
ʧʲʝʼȆʲ &ɴɪʧЖ
ȆǤʝђђɪ˩
Ȇȍʝђђ&ɴɪʧЖ
ȕɪȍ
88 5 User Defined Data Structures and the Type System

While this self-referential type definition can be evaluated without any error, we
encounter a problem when trying to create an instance. Just saying &ɴɪʧЖФХ or
&ɴɪʧЖФЖя &ɴɪʧЖФХХ does not work, since all fields must be initialized and no
instance of this type exists yet. The problem is that it is not possible to create an
instance, because the second field must contain an instance of the same type.
The solution is to define an inner constructor that calls ɪȕ˞ with only one
argument that initializes the ȆǤʝ field (in addition to a second, standard con-
structor).
ɦʼʲǤȂɜȕ ʧʲʝʼȆʲ &ɴɪʧ
ȆǤʝђђɪ˩
Ȇȍʝђђ&ɴɪʧ

ȯʼɪȆʲɃɴɪ &ɴɪʧФȆǤʝђђɪ˩Хђђ&ɴɪʧ
ɪȕ˞ФȆǤʝХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ &ɴɪʧФȆǤʝђђɪ˩я Ȇȍʝђђ&ɴɪʧХђђ&ɴɪʧ


ɪȕ˞ФȆǤʝя ȆȍʝХ
ȕɪȍ
ȕɪȍ

An undefined second field Ȇȍʝ is printed as Ыʼɪȍȕȯ, and accessing it yields an


error.
ɔʼɜɃǤљ &ɴɪʧФЖХ
&ɴɪʧФЖя ЫʼɪȍȕȯХ
ɔʼɜɃǤљ &ɴɪʧФЖХѐȆȍʝ
5¼¼“¼ђ Úɪȍȕȯ¼ȕȯ5ʝʝɴʝђ ǤȆȆȕʧʧ ʲɴ ʼɪȍȕȯɃɪȕȍ ʝȕȯȕʝȕɪȆȕ

With this definition, we can mimic the list above using our &ɴɪʧ data type.
ɔʼɜɃǤљ &ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИХХХ
&ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИя ЫʼɪȍȕȯХХХ

Additionally, circular data structures can be defined. Julia’s printer is smart


enough to detect them.
ɔʼɜɃǤљ ȆɃʝȆʼɜǤʝ ќ &ɴɪʧФЖХ
&ɴɪʧФЖя ЫʼɪȍȕȯХ
ɔʼɜɃǤљ ȆɃʝȆʼɜǤʝѐȆȍʝ ќ ȆɃʝȆʼɜǤʝ
&ɴɪʧФЖя &ɴɪʧФЫќ ȆɃʝȆʼɜǤʝ ʝȕȯȕʝȕɪȆȕ ЪвЖ ќЫХХ

The last function in this example shows how functions can operate safely on the
&ɴɪʧ composite type, namely by using ɃʧȍȕȯɃɪȕȍ.

ȯʼɪȆʲɃɴɪ ɜȕɪȱʲȹФȆɴɪʧђђ&ɴɪʧХђђbɪʲ
Ƀȯ ɃʧȍȕȯɃɪȕȍФȆɴɪʧя ђȆȍʝХ
Ж ў ɜȕɪȱʲȹФȆɴɪʧѐȆȍʝХ
ȕɜʧȕ
5.7 Parametric Types 89

Ж
ȕɪȍ
ȕɪȍ

This recursive definition of the length returns the correct value when the &ɴɪʧes
are interpreted as a list.
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖХХ
Ж
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖя &ɴɪʧФЗХХХ
З
ɔʼɜɃǤљ ɜȕɪȱʲȹФ&ɴɪʧФЖя &ɴɪʧФЗя &ɴɪʧФИХХХХ
И

5.6 Type Unions

A type union is an abstract type that consists of the union of all types given after
the ÚɪɃɴɪ keyword. A common example is a type that may take a value or not.
Such types are sometimes useful when passing arguments or as return values.
ɔʼɜɃǤљ Ǥ˩Ȃȕbɪʲ ќ ÚɪɃɴɪШbɪʲя ‰ɴʲȹɃɪȱЩ
ÚɪɃɴɪШ‰ɴʲȹɃɪȱя bɪʲЛЙЩ
ɔʼɜɃǤљ Еђђ Ǥ˩Ȃȕbɪʲя ɪɴʲȹɃɪȱђђ Ǥ˩Ȃȕbɪʲ
ФЕя ɪɴʲȹɃɪȱХ

5.7 Parametric Types

Types in Julia can be specified in a more fine-grained manner by parameters.


Such types are called parametric types, and the parameters follow the name of
the type surrounded by curly brackets. A prime example is vectors: an object of
type ùȕȆʲɴʝШbɪʲЩ is a vector that can contain any bɪʲ, and a ùȕȆʲɴʝШ"ɴɴɜЩ can
contain only Boolean values, for example. In the following, parametric compos-
ite types and parametric abstract types are discussed in detail.

5.7.1 Parametric Composite Types

In a parametric composite type, the type parameter is used to further specify


the types of the fields. For example, suppose we want to define an interval type
whose endpoints have the given type Ñ.
90 5 User Defined Data Structures and the Type System

ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜИШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ

The type bɪʲȕʝ˛ǤɜИ has type ÚɪɃɴɪɜɜ. The type ÚɪɃɴɪɜɜ represents the union
of all types over all values of the type parameter.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИХ
ÚɪɃɴɪɜɜ

The type of the parametric type bɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩ is -ǤʲǤÑ˩ʙȕ, as expected.


ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩХ
-ǤʲǤÑ˩ʙȕ

Concrete parametric composite types are subtypes of the underlying compos-


ite type (of type ÚɪɃɴɪɜɜ).
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩ јђ bɪʲȕʝ˛ǤɜИ
ʲʝʼȕ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜИШ¼ǤʲɃɴɪǤɜЩ јђ bɪʲȕʝ˛ǤɜИ
ʲʝʼȕ

On the other hand, a concrete parametric composite type is never a subtype of


another concrete parametric composite type, even if one type parameter is a sub-
type of the other.
ɔʼɜɃǤљ ¼ǤʲɃɴɪǤɜ јђ ‰ʼɦȂȕʝ
ʲʝʼȕ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜИШ¼ǤʲɃɴɪǤɜЩ јђ bɪʲȕʝ˛ǤɜИШ‰ʼɦȂȕʝЩ
ȯǤɜʧȕ

This is due to the practical reason that composite types should be stored as ef-
ficiently in memory as possible. For example, while bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ can be
stored as two adjacent 64-bit values, this is not true for bɪʲȕʝ˛ǤɜИШ¼ȕǤɜЩ, which
entails the allocation of two ¼ȕǤɜ objects.
The above fact has ramifications for the definition of methods. Suppose we
want to define a function that calculates the midpoint of an interval of numbers.
ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲЖФɃђђbɪʲȕʝ˛ǤɜИШ‰ʼɦȂȕʝЩХ
ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ

This method does not work as intended, as the following function call shows.
ɔʼɜɃǤљ ɦɃȍʙɴɃɪʲЖФbɪʲȕʝ˛ǤɜИФвЖя ЖХХ
5¼¼“¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ ɦɃȍʙɴɃɪʲЖФђђbɪʲȕʝ˛ǤɜИШbɪʲЛЙЩХ

The reason is that bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ is not a subtype of bɪʲȕʝ˛ǤɜИШ‰ʼɦȂȕʝЩ as


explained above.
5.7 Parametric Types 91

ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ јђ bɪʲȕʝ˛ǤɜИШ‰ʼɦȂȕʝЩ


ȯǤɜʧȕ

There are three ways to define suitable methods, whose syntaxes differ slightly.
ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲЗФɃђђbɪʲȕʝ˛ǤɜИШјђ‰ʼɦȂȕʝЩХ
ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ

ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲИФɃђђbɪʲȕʝ˛ǤɜИШÑЩ ˞ȹȕʝȕ Ñјђ‰ʼɦȂȕʝХ


ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ

ȯʼɪȆʲɃɴɪ ɦɃȍʙɴɃɪʲЙФɃђђbɪʲȕʝ˛ǤɜИШÑЩХ ˞ȹȕʝȕ Ñјђ‰ʼɦȂȕʝ


ФɃѐɜȕȯʲ ў ɃѐʝɃȱȹʲХ Э З
ȕɪȍ

To construct objects of a parametric composite type, two default constructors


can be used (unless other constructors have been defined, see Sect. 5.5). The first
option is to use the default constructor of the concrete parametric composite type
as in the following example.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩФвЖя ЖХ
bɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩФвЖя ЖХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
bɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИШ"ɃȱbɪʲЩФвЖя ЖХѐɜȕȯʲХ
"Ƀȱbɪʲ

The second option is to use the default constructor of the underlying para-
metric composite type (of type ÚɪɃɴɪɜɜ), which is bɪʲȕʝ˛ǤɜИ in this example,
as long as the implied value of the parameter type Ñ is unambiguous. In these
two examples, the underlying type is unambiguous.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖХХ
bɪʲȕʝ˛ǤɜИШbɪʲЛЙЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЖЭЭЗя ЗЭЭИХХ
bɪʲȕʝ˛ǤɜИШ¼ǤʲɃɴɪǤɜШbɪʲЛЙЩЩ

In the following two examples, the underlying type is ambiguous and errors re-
sult.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖѐЕХХ
5¼¼“¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ bɪʲȕʝ˛ǤɜИФђђbɪʲЛЙя ђђOɜɴǤʲЛЙХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФbɪʲȕʝ˛ǤɜИФЕя ЖЭЭЗХХ
5¼¼“¼ђ ȕʲȹɴȍ5ʝʝɴʝђ ɪɴ ɦȕʲȹɴȍ ɦǤʲȆȹɃɪȱ
bɪʲȕʝ˛ǤɜИФђђbɪʲЛЙя ђђ¼ǤʲɃɴɪǤɜШbɪʲЛЙЩХ
92 5 User Defined Data Structures and the Type System

5.7.2 Parametric Abstract Types

Similarly to parametric composite types, parametric abstract types declare a fam-


ily of abstract types. We continue the interval example.
ǤȂʧʲʝǤȆʲ ʲ˩ʙȕ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ ȕɪȍ

Just as in the case of parametric composite types, each concrete parametric ab-
stract type is a subtype of the underlying abstract type (of type ÚɪɃɴɪɜɜ). Fur-
thermore, a concrete parametric abstract type is never a subtype of another con-
crete parametric abstract type, even if one type parameter is a subtype of the
other.
The notation QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ denotes the set of all types
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШT Щ where T is a subtype of ¼ȕǤɜ, and analogously
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђ¼ȕǤɜЩ denotes the set of all types QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШT Щ
where T is a supertype of ¼ȕǤɜ. This is illustrated in the following examples.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФQȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђ¼ȕǤɜЩХ
ÚɪɃɴɪɜɜ
ɔʼɜɃǤљ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ
ʲʝʼȕ
ɔʼɜɃǤљ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШљђbɪʲЩ
ʲʝʼȕ

The purpose of abstract types is to create type hierarchies over concrete types.
This is exactly how the parametric abstract type is used in the following example.
ʧʲʝʼȆʲ bɪʲȕʝ˛ǤɜШÑЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ

ʧʲʝʼȆʲ ÚɪɃʲbɪʲȕʝ˛ǤɜШÑЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ


ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ

ȯʼɪȆʲɃɴɪ ÚɪɃʲbɪʲȕʝ˛ǤɜШÑЩФɜȕȯʲђђÑя ʝɃȱȹʲђђÑХ ˞ȹȕʝȕ Ñ


ЪǤʧʧȕʝʲ ʝɃȱȹʲ в ɜȕȯʲ ќќ Ж
ɪȕ˞Фɜȕȯʲя ʝɃȱȹʲХ
ȕɪȍ
ȕɪȍ

With these definitions, it is ensured that unit intervals always have length one.
5.7 Parametric Types 93

ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩФЕя ЖХ
ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЛЙЩФЕя ЖХ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩФЕя ЗХ
5¼¼“¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ʝɃȱȹʲ в ɜȕȯʲ ќќ Ж

By introducing the parametric abstract type QȕɪȕʝǤɜbɪʲȕʝ˛Ǥɜ, two kinds of


inclusions hold. First, each parametric concrete type is a subtype of the corre-
sponding parametric abstract type because their parameter types are the same.
These inclusions hold due to our definitions of the concrete types as subtypes of
the abstract type.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШbɪʲЩ
ʲʝʼȕ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜШOɜɴǤʲЛЙЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШOɜɴǤʲЛЙЩ
ʲʝʼȕ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШbɪʲЩ
ʲʝʼȕ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШOɜɴǤʲЛЙЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШOɜɴǤʲЛЙЩ
ʲʝʼȕ

Second, further inclusions can be realized using the notation јђtype ex-
plained above. In the first example here, bɪʲȕʝ˛ǤɜШbɪʲЩ is not a subtype of
QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ by the general rule above. However, in the second ex-
ample, јђ¼ȕǤɜ makes it possible to denote such a set of types; bɪʲȕʝ˛ǤɜШbɪʲЩ
is a subtype of QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ because bɪʲȕʝ˛Ǥɜ is a subtype of the
abstract type QȕɪȕʝǤɜbɪʲȕʝ˛Ǥɜ by its definition and because the parameter type
bɪʲ is a subtype of ¼ȕǤɜ (and the usage of јђ¼ȕǤɜ).

ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ


ȯǤɜʧȕ
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ
ʲʝʼȕ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШ¼ȕǤɜЩ
ȯǤɜʧȕ
ɔʼɜɃǤљ ÚɪɃʲbɪʲȕʝ˛ǤɜШbɪʲЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШјђ¼ȕǤɜЩ
ʲʝʼȕ

Another use of the notation јђtype is to restrict the allowed types. In the fol-
lowing example, we define intervals of characters and of integers.
ʧʲʝʼȆʲ &ȹǤʝbɪʲȕʝ˛ǤɜШÑјђȂʧʲʝǤȆʲ&ȹǤʝЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ
ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ

ʧʲʝʼȆʲ bɪʲbɪʲȕʝ˛ǤɜШÑјђbɪʲȕȱȕʝЩ јђ QȕɪȕʝǤɜbɪʲȕʝ˛ǤɜШÑЩ


ɜȕȯʲђђÑ
ʝɃȱȹʲђђÑ
ȕɪȍ
94 5 User Defined Data Structures and the Type System

The parameter types are indeed restricted as shown here.


ɔʼɜɃǤљ bɪʲbɪʲȕʝ˛ǤɜШbɪʲНЩФЕя ЖЕЕХ
bɪʲbɪʲȕʝ˛ǤɜШbɪʲНЩФЕя ЖЕЕХ
ɔʼɜɃǤљ bɪʲbɪʲȕʝ˛ǤɜШ¼ǤʲɃɴɪǤɜЩФЕя ЖЕЕХ
5¼¼“¼ђ Ñ˩ʙȕ5ʝʝɴʝђ Ƀɪ bɪʲbɪʲȕʝ˛Ǥɜя Ƀɪ Ñя ȕ˦ʙȕȆʲȕȍ Ñјђbɪʲȕȱȕʝя
ȱɴʲ Ñ˩ʙȕШ¼ǤʲɃɴɪǤɜЩ

5.8 Tuple Types

The purpose of tuples is to model the argument lists of functions. Tuple types
take multiple type parameters, each corresponding to the type of an argument
in order. Because of their purpose, they have the following special properties.
1. Tuples types may be parameterized by an arbitrary number of types.
2. Tuples types are only concrete if their type parameters are.
3. ÑʼʙɜȕШ𝑆1 , … , 𝑆𝑛 Щ is a subtype of ÑʼʙɜȕШ𝑇1 , … , 𝑇𝑛 Щ if each type 𝑆𝑖 is a subtype
of the corresponding type 𝑇𝑖 . This property is, of course, what is needed to
determine which methods of a generic function match an argument list.
4. In contrast to composite types, tuples do not have field names, and hence
their fields can only be accessed by their index. However, named tuples do
have field names.
Because tuples are used for passing arguments to functions and receiving re-
turn values from functions and hence are an often used type, the syntax to con-
struct tuple values is very simple. Additionally to the default constructors such
as ÑʼʙɜȕШbɪʲЩФЖХ, tuple values can be written in parentheses with commas in
between, and an appropriate type is automatically constructed as well. It is im-
portant to note that a tuple with a single element is still written with a comma
at the end in order to make the syntax unambiguous as seen here.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖХХ
bɪʲЛЙ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖя ХХ
ÑʼʙɜȕШbɪʲЛЙЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФЖя ЗѐЕя ИЭЭЖя ЙɃɦХХ
ÑʼʙɜȕШbɪʲЛЙяOɜɴǤʲЛЙя¼ǤʲɃɴɪǤɜШbɪʲЛЙЩя&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩ
ɔʼɜɃǤљ Ǥɪʧ јђ ÑʼʙɜȕШbɪʲя ¼ȕǤɜя ¼ȕǤɜя &ɴɦʙɜȕ˦Щ
ʲʝʼȕ

The last line illustrates the second property above.


We already know from Sect. 2.8 that functions can take a variable number of
arguments. This corresponds to the case when the last parameter of a tuple type
has the type ùǤʝǤʝȱ, which represents an arbitrary number (including zero) of
trailing elements of the given type. The type ùǤʝǤʝȱ also takes an optional second
argument, which indicates the number of elements.
5.8 Tuple Types 95

ɔʼɜɃǤљ ˛ǤʝǤʝȱʧ ќ ÑʼʙɜȕШOɜɴǤʲЛЙя ùǤʝǤʝȱШbɪʲЩЩ


ÑʼʙɜȕШOɜɴǤʲЛЙя ùǤʝǤʝȱШbɪʲЛЙЩЩ
ɔʼɜɃǤљ ɃʧǤФФЖѐЕя Хя ˛ǤʝǤʝȱʧХ
ʲʝʼȕ
ɔʼɜɃǤљ ɃʧǤФФЖѐЕя ЗХя ˛ǤʝǤʝȱʧХ
ʲʝʼȕ
ɔʼɜɃǤљ ɃʧǤФФЖѐЕя Зя ИХя ˛ǤʝǤʝȱʧХ
ʲʝʼȕ
ɔʼɜɃǤљ ɃʧǤФФЖѐЕя Зя Ия ЙЭЭЖХя ˛ǤʝǤʝȱʧХ
ȯǤɜʧȕ

While tuples do not have field names, named tuples do. The ‰ǤɦȕȍÑʼʙɜȕ type
takes two parameters, namely a tuple of symbols indicating the field names and a
tuple with the field types. The corresponding parameterized type is constructed
automatically when a named tuple is constructed.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФФǤʝȱЖ ќ Жя ǤʝȱЗ ќ ЗѐЕя ǤʝȱИ ќ ИЭЭЖХХ
‰ǤɦȕȍÑʼʙɜȕШФђǤʝȱЖя ђǤʝȱЗя ђǤʝȱИХя ÑʼʙɜȕШbɪʲЛЙя OɜɴǤʲЛЙя
¼ǤʲɃɴɪǤɜШbɪʲЛЙЩЩЩ

The first argument of the constructor ‰ǤɦȕȍÑʼʙɜȕ specifies the names, and the
second, optional one specifies the types. If the types are specified, the arguments
are converted using Ȇɴɪ˛ȕʝʲ; otherwise, their types are inferred automatically.
Note that the values are specified as tuples as well.
ɔʼɜɃǤљ ‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХЩФФЖя ЖѐЕХХ
ФǤ ќ Жя Ȃ ќ ЖѐЕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲЛЙя OɜɴǤʲЛЙЩЩ
ɔʼɜɃǤљ ‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲНя OɜɴǤʲИЗЩЩФФЖя ЖХХ
ФǤ ќ Жя Ȃ ќ ЖѐЕȯЕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲНя OɜɴǤʲИЗЩЩ

Another way to construct an instance of ‰ǤɦȕȍÑʼʙɜȕ is provided by the macro


Ъ‰ǤɦȕȍÑʼʙɜȕ, which yields a type. Names and types can be indicated, and omit-
ted types default to ɪ˩. In the first example, no types, only names, are specified.
ɔʼɜɃǤљ Ъ‰ǤɦȕȍÑʼʙɜȕШǤя ȂЩ
‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШɪ˩я ɪ˩ЩЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
-ǤʲǤÑ˩ʙȕ
ɔʼɜɃǤљ Ъ‰ǤɦȕȍÑʼʙɜȕШǤя ȂЩФФЖя ЖѐЕХХ
‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШɪ˩я ɪ˩ЩЩФФЖя ЖѐЕХХ

In the second example, the types are indicated as well, and a ‰ǤɦȕȍÑʼʙɜȕ type
is returned by the macro accordingly. The arguments to the constructor are con-
verted to the indicated types when the object is created.
96 5 User Defined Data Structures and the Type System

ɔʼɜɃǤљ Ъ‰ǤɦȕȍÑʼʙɜȕШǤђђbɪʲНя ȂђђOɜɴǤʲИЗЩ


‰ǤɦȕȍÑʼʙɜȕШФђǤя ђȂХя ÑʼʙɜȕШbɪʲНя OɜɴǤʲИЗЩЩ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФǤɪʧХ
-ǤʲǤÑ˩ʙȕ
ɔʼɜɃǤљ Ъ‰ǤɦȕȍÑʼʙɜȕШǤђђbɪʲНя ȂђђOɜɴǤʲИЗЩФФЖя ЖХХ
ФǤ ќ Жя Ȃ ќ ЖѐЕȯЕХ

5.9 Pretty Printing

After defining a custom composite type, it is often useful to customize how


the objects are printed, if only to make the output more readable, i.e., to pretty
print it. This can be accomplished by defining methods for the generic function
"Ǥʧȕѐʧȹɴ˞. We consider the example of intervals again.

ʧʲʝʼȆʲ bɪʲȕʝ˛Ǥɜ
ɜȕȯʲѪɴʙȕɪђђ"ɴɴɜ
ɜȕȯʲђђ‰ʼɦȂȕʝ
ʝɃȱȹʲѪɴʙȕɪђђ"ɴɴɜ
ʝɃȱȹʲђђ‰ʼɦȂȕʝ
ȕɪȍ

The default "Ǥʧȕѐʧȹɴ˞ method prints values such that the resulting string yields
a valid object again after parsing.
ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХ
bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХ

This property is illustrated by the following piece of code.


ɜȕʲ Ƀɴ ќ b“"ʼȯȯȕʝФХ
"Ǥʧȕѐʧȹɴ˞ФɃɴя bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХХ
ȍʼɦʙФ ȕʲǤѐʙǤʝʧȕФÆʲʝɃɪȱФʲǤɖȕРФɃɴХХХХ
ȕɪȍ

The generic function "Ǥʧȕѐʧȹɴ˞ takes the output stream as its first argument
(usually called Ƀɴђђb“) and the object to be printed as its second. Note that it
is necessary to mention the module name "Ǥʧȕ to add a method to the correct
generic function.
ȯʼɪȆʲɃɴɪ "Ǥʧȕѐʧȹɴ˞ФɃɴђђb“я Ƀђђbɪʲȕʝ˛ǤɜХ
ʙʝɃɪʲФɃɴя
ɃѐɜȕȯʲѪɴʙȕɪ Т ъФъ ђ ъЦъя
Ƀѐɜȕȯʲя ъя ъя ɃѐʝɃȱȹʲя
ɃѐʝɃȱȹʲѪɴʙȕɪ Т ъХъ ђ ъЧъХ
ȕɪȍ

With this method, a standard mathematical notation is achieved.


5.11 Bibliographical Remarks 97

ɔʼɜɃǤљ bɪʲȕʝ˛ǤɜФȯǤɜʧȕя Ея ʲʝʼȕя ЖХ


ЦЕя ЖХ

5.10 Operations on Types

Abstract, composite, and a few other types are instances of the type -ǤʲǤÑ˩ʙȕ, as
seen in this example.
ɔʼɜɃǤљ Фʲ˩ʙȕɴȯФbɪʲХя ʲ˩ʙȕɴȯФɪ˩ХХ
Ф-ǤʲǤÑ˩ʙȕя -ǤʲǤÑ˩ʙȕХ

As we have seen already, ordinary functions can operate on types, since they are
objects of type -ǤʲǤÑ˩ʙȕ themselves. In this section, the operations on types are
briefly summarized.
The subtype operator јђ determines whether the type on its left is a subtype
of the type on its right. The function ɃʧǤ determines whether its first argument
is an object of the second argument, a type. The function ʲ˩ʙȕɴȯ returns the type
of its argument.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ǤʲǤÑ˩ʙȕХ
-ǤʲǤÑ˩ʙȕ

The function ʧʼʙȕʝʲ˩ʙȕ returns the supertype of its argument, and ʧʼʙȕʝʲ˩ʙȕʧ
returns all supertypes of its argument. Finally, the function ʧʼȂʲ˩ʙȕʧ returns all
subtypes.

5.11 Bibliographical Remarks

A comprehensive introduction to type systems and to the basic theory of pro-


gramming languages is [2]. An excellent review of the ieee floating-point stan-
dard can be found in [1].

Problems

5.1 (Numerical tower) Use the functions in Sect. 5.3 to obtain the numerical
tower, i.e., the whole hierarchy below the type ‰ʼɦȂȕʝ. Draw the numerical tower
(by hand, but see Problem 5.2). How does the numerical tower correspond to the
sets ℕ, ℤ, ℚ, ℝ, and ℂ?
5.2 (Visualize type graph) * Write a function that obtains the full type graph.
Then write a function that writes an input file for graph visualization software
such as Graphviz to plot the type graph.
98 5 User Defined Data Structures and the Type System

5.3 (Type hierarchy for intervals) Define a type hierarchy for intervals below
an abstract type called QȕɪȕʝǤɜbɪʲȕʝ˛Ǥɜ. The types in the type hierarchy should
provide for intervals with finite and infinite numbers of elements. Define inter-
vals for Unicode characters, for integers, for rational numbers, and for floating-
point numbers.

5.4 (Interval arithmetic) Define generic functions for the addition, subtrac-
tion, multiplication, and division of numeric intervals. The functions should
take numeric intervals of the same type as their inputs and return a newly con-
structed interval of the same type (and a Boolean value to indicate whether a
division by zero occurred in the case of division).

5.5 (Bisection) Bisection is a straightforward algorithm for calculating a root of


a continuous function on a given real-valued interval. Starting with an interval
on whose endpoints the function to bisect has opposite signs, the interval is split
at its midpoint. Depending on the sign of the function at the midpoint, bisection
continues recursively.
1. Implement bisection based on the data type for floating-point intervals in
Problem 5.3.
2. Find a suitable stopping criterion that works with floating-point numbers.
3. As an example, apply your bisection implementation to finding the root of
sin in the interval [3, 4].

References

1. Goldberg, D.: What every computer scientist should know about floating point arithmetic.
ACM Computing Surveys 23(1), 5–48 (1991)
2. Pierce, B.: Types and Programming Languages. MIT Press (2002)
Chapter 6
Control Flow

Wenn das Wörtchen “wenn” nicht wär,


wär deer Bettelmann Kaiser.
—Proverb

Abstract The control flow in a function or program is not just linear, but is de-
termined by branches, loops, and non-local transfer of control. The standard
control-flow mechanisms usually found in high-level programming languages
such as compound expressions, conditional evaluation, short-circuit evaluation,
repeated evaluation, and exception handling are available in Julia and are dis-
cussed in detail in this chapter. Additionally, tasks are a powerful mechanisms
for non-local transfer of control and make it possible to switch between computa-
tions. Parallel or distributed computing is discussed in detail as well, presenting
various techniques how to distribute computations efficiently and conveniently.

6.1 Compound Expressions

Similar to ʙʝɴȱɪ in Common Lisp , ȂȕȱɃɪ blocks and semicolon chains evaluate
the constituent expressions in order and return the value of the last expression.
A ȂȕȱɃɪ block begins with the keyword ȂȕȱɃɪ and ends with the keyword
ȕɪȍ. Instead of these keywords, parentheses can also be used to the same effect,
resulting in a semicolon chain. The expressions in a ȂȕȱɃɪ block are separated
by newlines, by semicolons, or by both. The expressions in a semicolon chain
are separated by semicolons ѓ.
The following examples show the various cases that can occur. Global vari-
ables Ǥ and Ȃ are defined in each example, i.e., ȂȕȱɃɪ blocks do not introduce a
new scope.

© Springer Nature Switzerland AG 2022 99


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_6
100 6 Control Flow

ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ
ȱɜɴȂǤɜ Ǥ ќ И
ȱɜɴȂǤɜ Ȃ ќ З
ǤѭǤ ЭЭ ȂѭȂ
ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ
ȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ
ǤѭǤ ЭЭ ȂѭȂѓ
ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ȂȕȱɃɪ ȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ ǤѭǤ ЭЭ ȂѭȂ ȕɪȍ
ЗМЭЭЙ
ɔʼɜɃǤљ Ȇ ќ ФȱɜɴȂǤɜ Ǥ ќ Иѓ ȱɜɴȂǤɜ Ȃ ќ Зѓ ǤѭǤ ЭЭ ȂѭȂХ
ЗМЭЭЙ

Empty expressions such as ȂȕȱɃɪ ȕɪȍ and ФѓХ return ɪɴʲȹɃɪȱ, which is of
type ‰ɴʲȹɃɪȱ and which is not printed by the repl.
ɔʼɜɃǤљ ȂȕȱɃɪ ȕɪȍ
ɔʼɜɃǤљ Ǥɪʧ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФȂȕȱɃɪ ȕɪȍХ
‰ɴʲȹɃɪȱ

6.2 Conditional Evaluation

Conditional evaluation means that expressions are evaluated or not depending


on the value of a Boolean expression. The syntax of Ƀȯ expressions is
Ƀȯ condition0
expressions0
ȕɜʧȕɃȯ condition1
expressions1
ȕɜʧȕ
expressions
ȕɪȍ.

All of the ȕɜʧȕɃȯ clauses as well as the ȕɜʧȕ clause are optional. The ȕɜʧȕɃȯ
clauses ȕɜʧȕɃȯ condition𝑖 expressions𝑖 can be repeated arbitrarily often.
If the Boolean expression condition0 in the Ƀȯ clause is true, then the corre-
sponding expressions expressions0 are evaluated; if it is false, then the condition
condition1 in the first ȕɜʧȕɃȯ clause is evaluated. Again, if it is true, then the
corresponding expressions expressions1 are evaluated; if it is false, then the next
6.2 Conditional Evaluation 101

ȕɜʧȕɃȯ clause is considered, etc. If none of the conditions is true, then the ex-
pressions expressions after the ȕɜʧȕ clause are evaluated.
In other words, the expressions following the first ʲʝʼȕ condition are evalu-
ated, and the rest of the conditions are not considered anymore. If none of the
conditions is true, the expressions in the ȕɜʧȕ clause are evaluated if present.
ȯʼɪȆʲɃɴɪ ɦ˩ѪʧɃȱɪФ˦Х
Ƀȯ ˦ ј Е
ʲ˩ʙȕɴȯФ˦ХФвЖХ
ȕɜʧȕɃȯ ˦ ќќ Е
ʲ˩ʙȕɴȯФ˦ХФЕХ
ȕɜʧȕ
ʲ˩ʙȕɴȯФ˦ХФЖХ
ȕɪȍ
ȕɪȍ

ɔʼɜɃǤљ ʲ˩ʙȕɴȯФɦ˩ѪʧɃȱɪФbɪʲНФЗХХХ
bɪʲН

Ƀȯ blocks do not introduce a new scope, implying that local variables that are
defined or changed within an Ƀȯ block remain visible after the Ƀȯ block.
Ƀȯ expressions return a value after having been evaluated, namely the value
of the last expression evaluated.
ɔʼɜɃǤљ ȯɴɴ ќ Ƀȯ Ж љ Е ъ˩ȕʧъ ȕɜʧȕ ъɪɴъ ȕɪȍ
ъ˩ȕʧъ

There is an additional syntax, the so-called ternary operator, for Ƀȯ ȕɜʧȕ ȕɪȍ
blocks whose expressions are single expressions. The ternary operator
condition0 Т expression0 ђ expression
is equivalent to
Ƀȯ condition0
expression0
ȕɜʧȕ
expression
ȕɪȍ.

It is used to squeeze Ƀȯ expressions into a single line.


ȯʼɪȆʲɃɴɪ ȆɴɦʙǤʝȕФǤя ȂХ
ʧʲʝɃɪȱФǤХ Ѯ ъ Ƀʧ ъ Ѯ
ФǤ ј Ȃ Т ъɜȕʧʧ ʲȹǤɪ ъ ђ ъȱʝȕǤʲȕʝ ʲȹǤɪ ɴʝ ȕʜʼǤɜ ʲɴ ъХ Ѯ
ʧʲʝɃɪȱФȂХ
ȕɪȍ

ɔʼɜɃǤљ ȆɴɦʙǤʝȕФЕя ЖХ
ъЕ Ƀʧ ɜȕʧʧ ʲȹǤɪ Жъ
102 6 Control Flow

Ternary operators can be chained, and to facilitate this, the ternary operator
associates from right to left.
ȯʼɪȆʲɃɴɪ ȆɴɦʙǤʝȕФǤя ȂХ
ʧʲʝɃɪȱФǤХ Ѯ ъ Ƀʧ ъ Ѯ
ФǤ ј Ȃ Т ъɜȕʧʧ ʲȹǤɪ ъ ђ
Ǥ ќќ Ȃ Т ъȕʜʼǤɜ ʲɴ ъ ђ ъȱʝȕǤʲȕʝ ʲȹǤɪ ъХ Ѯ
ʧʲʝɃɪȱФȂХ
ȕɪȍ

ɔʼɜɃǤљ ȆɴɦʙǤʝȕФЕя вЖХя ȆɴɦʙǤʝȕФЕя ЕХя ȆɴɦʙǤʝȕФЕя ЖХ


ФъЕ Ƀʧ ȱʝȕǤʲȕʝ ʲȹǤɪ вЖъя ъЕ Ƀʧ ȕʜʼǤɜ ʲɴ Еъя ъЕ Ƀʧ ɜȕʧʧ ʲȹǤɪ ЖъХ

6.3 Short-Circuit Evaluation

The Boolean operators ПП and ЮЮ implement the logical “and” and “or” opera-
tions. However, not all of their arguments are generally evaluated; only the min-
imum number of arguments necessary to determine the value of the whole ex-
pression is evaluated from the left to the right. This means that the evaluation is
short-circuited if the value of the whole expression can be known in advance.
For example, in the expression Ǥ ПП Ȃ, the second argument Ȃ is evaluated
only if Ǥ evaluates to ʲʝʼȕ, since otherwise – if Ǥ is ȯǤɜʧȕ – it is already obvious
that the whole expression must be ȯǤɜʧȕ after evaluating Ǥ. Analogously, in the
expression Ǥ ЮЮ Ȃ, the second argument Ȃ is evaluated only if Ǥ evaluates to
ȯǤɜʧȕ.
The ПП operator has higher precedence than the ЮЮ operator, which some-
times makes it possible to leave out parentheses. However, it is preferable in most
cases to write out the parentheses in order to make the intent of the program im-
mediately clear.
Short-circuit evaluation can also be used as an alternative short form for cer-
tain short Ƀȯ expressions. For example, when checking the arguments of a func-
tion and acting accordingly, checks may only occupy one line. In this example,
the two functions are equivalent. (In general, using the ЪǤʧʧȕʝʲ macro to check
arguments is preferable.)
ȯʼɪȆʲɃɴɪ ȯɃȂЖФɪђђbɪʲХ
ɪ љќ Е ЮЮ ȕʝʝɴʝФъɪ ɦʼʧʲ Ȃȕ ɪɴɪвɪȕȱǤʲɃ˛ȕъХ
Е јќ ɪ јќ Ж ПП ʝȕʲʼʝɪ ɪ
ȯɃȂЖФɪвЖХ ў ȯɃȂЖФɪвЗХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ȯɃȂЗФɪђђbɪʲХ
Ƀȯ ɪ ј Е ȕʝʝɴʝФъɪ ɦʼʧʲ Ȃȕ ɪɴɪвɪȕȱǤʲɃ˛ȕъХ ȕɪȍ
Ƀȯ Е јќ ɪ јќ Ж ʝȕʲʼʝɪ ɪ ȕɪȍ
6.4 Repeated Evaluation 103

ȯɃȂЗФɪвЖХ ў ȯɃȂЗФɪвЗХ
ȕɪȍ

Whether saving a few characters is worth the terser and slightly obscured appear-
ance of the first version lies in the eye of the beholder.
The condition expressions in Ƀȯ expressions, in the ternary operator, and the
operands of the ПП and ЮЮ operators must be "ɴɴɜ values. The only exception are
the last arguments in ПП and ЮЮ chains, whose values may be returned.
ɔʼɜɃǤљ Е ЮЮ ʲʝʼȕ
5¼¼“¼ђ Ñ˩ʙȕ5ʝʝɴʝђ ɪɴɪвȂɴɴɜȕǤɪ ФbɪʲЛЙХ ʼʧȕȍ Ƀɪ ȂɴɴɜȕǤɪ Ȇɴɪʲȕ˦ʲ
ɔʼɜɃǤљ ȯǤɜʧȕ ЮЮ Е
Е

Note that in contrast to the ПП and ЮЮ operators, the functions П and Ю are
just generic functions without short-circuit behavior; they are only special in the
sense that they support infix syntax.
ɔʼɜɃǤљ ʲʝʼȕ П ʲʝʼȕ
ʲʝʼȕ
ɔʼɜɃǤљ ФПХФʲʝʼȕя ʲʝʼȕХ
ʲʝʼȕ
ɔʼɜɃǤљ ʲʝʼȕ Ю ʲʝʼȕ
ʲʝʼȕ
ɔʼɜɃǤљ ЮФʲʝʼȕя ʲʝʼȕХ
ʲʝʼȕ

6.4 Repeated Evaluation

While branching using Ƀȯ expressions achieves nonlinear control flow, it is pos-


sible to evaluate expressions repeatedly using two constructs: ˞ȹɃɜȕ loops and
ȯɴʝ loops.
We discuss ˞ȹɃɜȕ loops first, as they are more general. The syntax is
˞ȹɃɜȕ condition
expressions
ȕɪȍ,

where the condition must be a Boolean expression. While the condition evaluates
to ʲʝʼȕ, the expressions in the body of the ˞ȹɃɜȕ loop are evaluated. In contrast to
ȯɴʝ loops, the programmer is responsible for defining and updating an iteration
variable if one is needed.
104 6 Control Flow

ȱɜɴȂǤɜ Ƀ ќ Ж
˞ȹɃɜȕ Ƀ јќ И
ȱɜɴȂǤɜ Ƀ
Ъʧȹɴ˞ Ƀ
Ƀ ўќ Ж
ȕɪȍ

This example prints three lines as expected. Note that the ȱɜɴȂǤɜ declaration is
needed here in order to change the change. An equivalent version of this loop is
the following.
ȱɜɴȂǤɜ ɔ ќ Е
˞ȹɃɜȕ ɔ јќ З
ȱɜɴȂǤɜ ɔ ўќ Ж
Ъʧȹɴ˞ ɔ
ȕɪȍ

Some programming languages contain ȍɴ ʼɪʲɃɜ statements. They are equivalent


to ˞ȹɃɜȕ loops as the following example shows.
ȱɜɴȂǤɜ ɖ ќ Е
˞ȹɃɜȕ ʲʝʼȕ
ȱɜɴȂǤɜ ɖ ўќ Ж
Ъʧȹɴ˞ ɖ
Ƀȯ ɖ љќ И
ȂʝȕǤɖ
ȕɪȍ
ȕɪȍ

When the number of iterations is known in advance or one needs to iterate


over an iterable data structure, it is usually more convenient to use a ȯɴʝ loop.
The syntax of a simple ȯɴʝ loop with a single iteration variable is
ȯɴʝ i Ƀɪ iterable
expressions
ȕɪȍ.

The keyword Ƀɪ can be replaced by ќ. Here i is the iteration variable, iterable is


the iterable data structure, and the expressions are the body to be evaluated.
This ȯɴʝ loop is equivalent to the following ˞ȹɃɜȕ loop.
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФɃʲȕʝǤȂɜȕХ Ы ȕѐȱѐ ɃʲȕʝǤȂɜȕ ќ ЖђИ
˞ȹɃɜȕ ɪȕ˦ʲ Рќќ ɪɴʲȹɃɪȱ
ɜɴȆǤɜ ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ɪȕ˦ʲ
Ы ȕ˦ʙʝȕʧʧɃɴɪʧ
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФɃʲȕʝǤȂɜȕя ʧʲǤʲȕХ
ȕɪȍ

Both loops are linked via the generic function ɃʲȕʝǤʲȕ, which we have to access
as "ǤʧȕѐɃʲȕʝǤʲȕ when defining additional methods. It must have two methods
6.4 Repeated Evaluation 105

for the type of ɃʲȕʝǤȂɜȕ, namely one taking one arguments and one taking two
arguments.
We consider the example of iterating over the coordinates of a three-dimens-
ional point and first define a data structure called ¸ɴɃɪʲ (see Sect. 5.4).
ʧʲʝʼȆʲ ¸ɴɃɪʲ
˦ђђOɜɴǤʲЛЙѓ ˩ђђOɜɴǤʲЛЙѓ ˴ђђOɜɴǤʲЛЙ
ȕɪȍ

The first method takes one argument (as in the call of ɃʲȕʝǤʲȕ before the ˞ȹɃɜȕ
loop above) and returns an iterate and a state.
ȯʼɪȆʲɃɴɪ "ǤʧȕѐɃʲȕʝǤʲȕФʙђђ¸ɴɃɪʲХђђÑʼʙɜȕ
Фʙѐ˦я Æ˩ɦȂɴɜЦђ˩я ђ˴ЧХ
ȕɪȍ

The state can be any object, but it should be defined in such a way that it is con-
ducive for iterating by the second method. The second method takes the iterable
data structure and a state as its two arguments and returns an iterate and a state
as well.
ȯʼɪȆʲɃɴɪ "ǤʧȕѐɃʲȕʝǤʲȕФʙђђ¸ɴɃɪʲя ʧʲǤʲȕђђùȕȆʲɴʝХђђÚɪɃɴɪШ‰ɴʲȹɃɪȱя
ÑʼʙɜȕЩ
Ƀȯ Ƀʧȕɦʙʲ˩ФʧʲǤʲȕХ
ɪɴʲȹɃɪȱ
ȕɜʧȕ
ФȱȕʲȯɃȕɜȍФʙя ʧʲǤʲȕЦЖЧХя ʧʲǤʲȕЦЗђȕɪȍЧХ
ȕɪȍ
ȕɪȍ

Having defined the two ɃʲȕʝǤʲȕ methods, we can use the ˞ȹɃɜȕ loop above to
iterate over the coordinates of a ¸ɴɃɪʲ.
ȱɜɴȂǤɜ ʙɴɃɪʲ ќ ¸ɴɃɪʲФЖя Зя ИХ
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФʙɴɃɪʲХ
˞ȹɃɜȕ ɪȕ˦ʲ Рќќ ɪɴʲȹɃɪȱ
ɜɴȆǤɜ ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ɪȕ˦ʲ
Ъʧȹɴ˞ ɃʲȕʝǤʲȕя ʧʲǤʲȕ
ȱɜɴȂǤɜ ɪȕ˦ʲ ќ "ǤʧȕѐɃʲȕʝǤʲȕФʙɴɃɪʲя ʧʲǤʲȕХ
ȕɪȍ

ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ФЖѐЕя Цђ˩я ђ˴ЧХ


ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ФЗѐЕя Цђ˴ЧХ
ФɃʲȕʝǤʲȕя ʧʲǤʲȕХ ќ ФИѐЕя Æ˩ɦȂɴɜЦЧХ

Much more interestingly, however, we have extended the built-in ȯɴʝ loop by
defining these two ɃʲȕʝǤʲȕ methods. This also means that iterable data struc-
tures are those for which ɃʲȕʝǤʲȕ methods have been defined.
106 6 Control Flow

ȯɴʝ Ȇɴɴʝȍ Ƀɪ ¸ɴɃɪʲФЖя Зя ИХ


Ъʧȹɴ˞ Ȇɴɴʝȍ
ȕɪȍ

Ȇɴɴʝȍ ќ ЖѐЕ
Ȇɴɴʝȍ ќ ЗѐЕ
Ȇɴɴʝȍ ќ ИѐЕ

ȯɴʝ loops can be nested, but nested loops can be written more succinctly using
the syntax
ȯɴʝ i1 Ƀɪ iterable1 я i2 Ƀɪ iterable2
expressions
ȕɪȍ,

where an arbitrary number of iteration variables can be given. It is again possibly


to replace the keyword Ƀɪ by ќ.
ȯɴʝ Ƀ Ƀɪ ЖђЗя ɔ Ƀɪ ИђЙ
Ъʧȹɴ˞ Ƀя ɔ
ȕɪȍ

ФɃя ɔХ ќ ФЖя ИХ
ФɃя ɔХ ќ ФЖя ЙХ
ФɃя ɔХ ќ ФЗя ИХ
ФɃя ɔХ ќ ФЗя ЙХ

As this example shows, the first iteration variable (Ƀ here) changes slowest and
the last iteration variable (ɔ here) changes fastest.
Another extension of the basic syntax of ȯɴʝ loops is destructuring of the iter-
ation variable. Destructuring of variables is an idea also found in Common Lisp
and Clojure. In Julia, it means that if the iteration variable is a tuple, then
its components are bound to the respective components of the elements of the
iterable data structure.
ȯɴʝ ФǤя ȂХ Ƀɪ ФФЖя ЗХя ФИя ЙХХ
Ъʧȹɴ˞ Ǥя Ȃ
ȕɪȍ

In the first iteration, the two components Ǥ and Ȃ of the iteration variable ФǤя ȂХ,
a tuple, are bound to the components of the first element ФЖя ЗХ of the iterable
data structure. This loop hence yields the following output.
ФǤя ȂХ ќ ФЖя ЗХ
ФǤя ȂХ ќ ФИя ЙХ

The data structure to be iterated over may be any iterable data structure. In
the next example, it is a set.
ȯɴʝ ФǤя ȂХ Ƀɪ ÆȕʲФЦФЖя ЗХя ФИя ЙХя ФКя ЛХЧХ
Ъʧȹɴ˞ Ǥя Ȃ
ȕɪȍ
6.4 Repeated Evaluation 107

ФǤя ȂХ ќ ФЖя ЗХ
ФǤя ȂХ ќ ФКя ЛХ
ФǤя ȂХ ќ ФИя ЙХ

It is only important that the elements of the iterable data structure are tuples that
are compatible with the iteration variable. If the iteration variable is a tuple, it is
compatible with the tuples in the iterable data structure if it has the same num-
ber of elements or fewer. If the tuple acting as the iteration variable is too long,
an error is raised. If it is shorter than the data, then only the given elements are
bound as shown in the following example. (Recall that the syntax for a tuple with
a single element is ФǤяХ, which is necessary to distinguish it from ФǤХ, which is
the same as just Ǥ.)
ȯɴʝ ФǤяХ Ƀɪ ÆȕʲФЦФЖяЗХя ФИяЙХя ФКяЛХЧХ
Ъʧȹɴ˞ Ǥ
ȕɪȍ

Ǥ ќ Ж
Ǥ ќ К
Ǥ ќ И

Tuples as iteration variables are especially convenient in conjunction with


ȕɪʼɦȕʝǤʲȕ. The function ȕɪʼɦȕʝǤʲȕ takes an iterable data structure and yields
iterates Фiя xХ, a tuple, where i is a counter starting at one and x iterates over
the given iterable data structure. This is useful in loops where the number of
elements iterated over so far is needed
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
Ƀɦʙɴʝʲ ¸ʝɃɪʲȯ
ȯɴʝ ФɃя ʙХ Ƀɪ ȕɪʼɦȕʝǤʲȕФ¸ʝɃɦȕʧѐ¸¼b 5ÆЦЖђКЧХ
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ ҄Жȍђ ҄Зȍаɪъя Ƀя ʙХ
ȕɪȍ

Here Ƀ is a counter starting at Ж and ʙ iterates over the collection which is the
argument of ȕɪʼɦȕʝǤʲȕ.
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Жђ З
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Зђ И
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Иђ К
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Йђ М
ʙʝɃɦȕ ɪʼɦȂȕʝ ɪɴѐ Кђ ЖЖ

It is possible to stop ˞ȹɃɜȕ and ȯɴʝ loops using the ȂʝȕǤɖ keyword. This ex-
ample returns the smallest prime number greater than or equal to 2020.
108 6 Control Flow

Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȱɜɴȂǤɜ Ƀ ќ ЗЕЗЕ
˞ȹɃɜȕ ʲʝʼȕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ȂʝȕǤɖ
ȕɜʧȕ
ȱɜɴȂǤɜ Ƀ ўќ Ж
ȕɪȍ
ȕɪȍ
Ƀ

The ȆɴɪʲɃɪʼȕ keyword makes it possible to shortcut an iteration within a


˞ȹɃɜȕ or ȯɴʝ loop and to proceed to the next iteration. This example prints the
primes numbers between 2020 and 2050.
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȯɴʝ ɔ Ƀɪ ЗЕЗЕђЗЕКЕ
Ƀȯ ɦɴȍФɔя ЗХ ќќ Е ЮЮ ɦɴȍФɔя ИХ ќќ Е
ȆɴɪʲɃɪʼȕ
ȕɜʧȕɃȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɔХ
ʙʝɃɪʲɜɪФɔХ
ȕɪȍ
ȕɪȍ

Finally, the classical ȱɴ ʲɴ statement is available as the Ъȱɴʲɴ macro and used
in conjunction with the ЪɜǤȂȕɜ macro. Although ȱɴ ʲɴ statements have gener-
ally fallen out of favor in modern programming style, they are very useful in
certain cases. A prime example is the implementation of finite-state machines,
where a state transition table and the Ъȱɴʲɴ macro can be used to switch between
the states. Examples of finite-state machines are parsers and regular expressions.
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȯʼɪȆʲɃɴɪ ȯɃɪȍѪȯɃʝʧʲѪʙʝɃɦȕѪǤȯʲȕʝФɪђђbɪʲȕȱȕʝХђђbɪʲȕȱȕʝ
ЪǤʧʧȕʝʲ ɪ љќ З

ЪɜǤȂȕɜ ʧʲǤʝʲ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɪХ
ʝȕʲʼʝɪ ɪ
ȕɜʧȕ
ɪ ўќ Ж
Ъȱɴʲɴ ʧʲǤʝʲ
ȕɪȍ
ȕɪȍ
6.5 Exception Handling 109

6.5 Exception Handling

The language constructs discussed so far result in local control flow; even condi-
tional and repeated evaluation cannot result in non-local transfer of control. By
considering the program text locally, it is clear which expression will be evalu-
ated next.
Throwing an exception is a non-local control flow. Exceptions are useful in
situations when an unexpected condition occurs and a function cannot com-
pute and return the value it is meant to return. In these cases, exceptions can be
thrown and caught. When the exception is caught, it is decided how to proceed
best, e.g., by terminating the program, printing an error message, or by taking a
corrective action such as retrying.

6.5.1 Built-in Exceptions and Defining Exceptions

The built-in exceptions are subtypes of the abstract type 5˦ȆȕʙʲɃɴɪ and can be
listed by ЪȍɴȆ 5˦ȆȕʙʲɃɴɪ. Using the type system, it also is possible to define
custom exceptions as in this example.
ʧʲʝʼȆʲ ˩˞ȕʧɴɦȕ5˦ȆȕʙʲɃɴɪ јђ 5˦ȆȕʙʲɃɴɪ
ȕɪȍ

Here јђ indicates that the new type ˩˞ȕʧɴɦȕ5˦ȆȕʙʲɃɴɪ will be a subtype of


5˦ȆȕʙʲɃɴɪ. The type system and how to define new types are explained in detail
in Chap. 5.

6.5.2 Throwing and Catching Exceptions

Exceptions are thrown by the ʲȹʝɴ˞ function. This example throws the fitting
-ɴɦǤɃɪ5ʝʝɴʝ for negative arguments when Fibonacci numbers are calculated.

ȯʼɪȆʲɃɴɪ ȯɃȂФɪђђbɪʲȕȱȕʝХђђ"Ƀȱbɪʲ
Ƀȯ ɪ ј Е
ʲȹʝɴ˞Ф-ɴɦǤɃɪ5ʝʝɴʝФъɪȕȱǤʲɃ˛ȕ ɃɪʲȕȱȕʝъХХ
ȕɜʧȕɃȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂФɪвЗХ ў ȯɃȂФɪвЖХ
ȕɪȍ
ȕɪȍ
110 6 Control Flow

The argument to ʲȹʝɴ˞ must be an exception and not a type of exception. The
function call -ɴɦǤɃɪ5ʝʝɴʝФargХ yields an exception, while -ɴɦǤɃɪ5ʝʝɴʝ is the
type.
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝФъȯɴɴъХХя ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝХ
Ф-ɴɦǤɃɪ5ʝʝɴʝя -ǤʲǤÑ˩ʙȕХ
ɔʼɜɃǤљ ʲ˩ʙȕɴȯФ-ɴɦǤɃɪ5ʝʝɴʝФъȯɴɴъХХ јђ 5˦ȆȕʙʲɃɴɪ
ʲʝʼȕ

As usual, the built-in functionјђ tests whether the left-hand side is a subtype of
the right-hand side.
Exceptions take arguments to describe the situation as in this example.
ɔʼɜɃǤљ ʲȹʝɴ˞ФÚɪȍȕȯùǤʝ5ʝʝɴʝФђȯɴɴХХ
5¼¼“¼ђ ÚɪȍȕȯùǤʝ5ʝʝɴʝђ ȯɴɴ ɪɴʲ ȍȕȯɃɪȕȍ

User defined exceptions can also take arguments, which are the fields of the
exception type (see Sect. 5.4). In the following example, we define an exception
type called -Ƀ˛ɃʧɃɴɪ"˩Đȕʝɴ with a field called ɪʼɦȕʝǤʲɴʝ, which holds addi-
tional information.
ʧʲʝʼȆʲ -Ƀ˛ɃʧɃɴɪ"˩Đȕʝɴ јђ 5˦ȆȕʙʲɃɴɪ
ɪʼɦȕʝǤʲɴʝђђ‰ʼɦȂȕʝ
ȕɪȍ

In order to provide an informative error message, we also define a method for


the generic function "Ǥʧȕѐʧȹɴ˞ȕʝʝɴʝ, which is called by Julia to show an er-
ror. (As usual, you can find all methods defined for this generic function using
ɦȕʲȹɴȍʧФ"Ǥʧȕѐʧȹɴ˞ȕʝʝɴʝХ.)

ȯʼɪȆʲɃɴɪ "Ǥʧȕѐʧȹɴ˞ȕʝʝɴʝФɃɴђђb“я ȕ˦Ȇђђ-Ƀ˛ɃʧɃɴɪ"˩ĐȕʝɴХ


ʙʝɃɪʲФɃɴя ȕ˦ȆѐɪʼɦȕʝǤʲɴʝя ъ ȆǤɪɪɴʲ Ȃȕ ȍɃ˛Ƀȍȕȍ Ȃ˩ ˴ȕʝɴъХ
ȕɪȍ

With these definitions, our new type of exception behaves as intended.


ɔʼɜɃǤљ ʲȹʝɴ˞Ф-Ƀ˛ɃʧɃɴɪ"˩ĐȕʝɴФЙЗХХ
5¼¼“¼ђ ЙЗ ȆǤɪɪɴʲ Ȃȕ ȍɃ˛Ƀȍȕȍ Ȃ˩ ˴ȕʝɴ

At this point, we know how to throw exceptions. For non-local control flow,
we must also be able to catch the exceptions. This facility is provided by the
ʲʝ˩
body
ȆǤʲȆȹ [exception]
handler
ȯɃɪǤɜɜ˩
cleanup
ȕɪȍ
6.5 Exception Handling 111

expression. It evaluates the expressions in body until an exception occurs. If an


exception has occurred, the expressions in handler are evaluated, where the ex-
ception is bound to the optional variable named exception so that the exception
can be inspected and handled. Finally, the expressions in cleanup are evaluated
in any case.
The first, simple example is built around the ɜɴȱ function. The built-in ɜɴȱ
function raises a -ɴɦǤɃɪ5ʝʝɴʝ when called with a negative real-valued argument
and returns a complex number only when called with a &ɴɦʙɜȕ˦ argument. In
the example, we try the built-in version first; if it does not succeed and an excep-
tion is raised, it is retried with a &ɴɦʙɜȕ˦ argument.
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦Х
ȆǤʲȆȹ
ɜɴȱФ&ɴɦʙɜȕ˦Ф˦я ЕХХ
ȕɪȍ
ȕɪȍ

ɔʼɜɃǤљ ɦ˩ѪɜɴȱФ"Ǥʧȕѐ Ǥʲȹ&ɴɪʧʲǤɪʲʧѐȕХ


Ж
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
ЕѐЕ ў ИѐЖЙЖКОЗЛКИКНОМОИɃɦ

In the second example, the exception is bound to the variable ȕ˦Ȇ for closer
inspection.
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦ЦЗЧХ
ȆǤʲȆȹ ȕ˦Ȇ
Ƀȯ ɃʧǤФȕ˦Ȇя -ɴɦǤɃɪ5ʝʝɴʝХ
ɜɴȱФ&ɴɦʙɜȕ˦Ф˦я ЕХХ
ȕɜʧȕ
ЪɃɪȯɴ ъÑȹȕ ɪȕ˦ʲ ȕ˦ȆȕʙʲɃɴɪ ȹǤʧ Ȃȕȕɪ ʝȕʲȹʝɴ˞ɪѐъ
ʝȕʲȹʝɴ˞Фȕ˦ȆХ Ы ɴʝ ɔʼʧʲ ʝȕʲȹʝɴ˞ФХ
ȕɪȍ
ȕɪȍ
ȕɪȍ

Whenever there is an exception that cannot or should not be handled in the


handler after the ȆǤʲȆȹ keyword after all, it can be rethrown by the ʝȕʲȹʝɴ˞
function. Within the handler expressions, it suffices to just call ʝȕʲȹʝɴ˞ФХ; the
exception is the default argument.
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
Ц bɪȯɴђ Ñȹȕ ɪȕ˦ʲ ȕ˦ȆȕʙʲɃɴɪ ȹǤʧ Ȃȕȕɪ ʝȕʲȹʝɴ˞ɪѐ
5¼¼“¼ђ "ɴʼɪȍʧ5ʝʝɴʝ
112 6 Control Flow

The syntax of the ʲʝ˩ expression requires some care regarding the variable
name after the ȆǤʲȆȹ keyword. It must be on the same line as the ȆǤʲȆȹ keyword.
Since any symbol after the ȆǤʲȆȹ keyword is interpreted as the variable name
for the exception unless it is written on a new line or separated by a semicolon,
one must also be careful when the intention is to return the value of another vari-
able and the whole expression is written on a single line. The following function
returns its argument ˦ if an exception is raised.
ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦Х
ȆǤʲȆȹ Ы ɪɴ ˛ǤʝɃǤȂɜȕ ɪǤɦȕ ȯɴʝ ʲȹȕ ȕ˦ȆȕʙʲɃɴɪ
˦
ȕɪȍ
ȕɪȍ

If the ʲʝ˩ expression is written on a single line, it must look like this. Note the
semicolon; it ensures that ˦ is not interpreted as the variable name for the excep-
tion, but as the return value of the handler expressions.
ɦ˩ѪɜɴȱФ˦Х ќ ʲʝ˩ ɜɴȱФ˦Х ȆǤʲȆȹѓ ˦ ȕɪȍ

ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
вЖ

On the other hand, if there is no semicolon, the function is still syntactically


correct. Then, however, ˦ is interpreted as the variable name for the exception
and the handler expressions are empty, resulting in ɪɴʲȹɃɪȱ as the return value.
ɦ˩ѪɜɴȱФ˦Х ќ ʲʝ˩ ɜɴȱФ˦Х ȆǤʲȆȹ ˦ ȕɪȍ

ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ ќќ ɪɴʲȹɃɪȱ
ʲʝʼȕ

The ȆǤʲȆȹ clause is optional in ʲʝ˩ expressions. If it is omitted, ɪɴʲȹɃɪȱ is


returned as usual for empty expressions.
The ȯɃɪǤɜɜ˩ clause is always evaluated, even if an exception was raised in
the body or in the handler expressions. If an exception is raised in the handler
expressions, then it is re-raised after evaluating the cleanup expressions in the
ȯɃɪǤɜɜ˩ clause.

ȯʼɪȆʲɃɴɪ ɦ˩ѪɜɴȱФ˦Х
ʲʝ˩
ɜɴȱФ˦Х
ȆǤʲȆȹ
ȕʝʝɴʝФъʧʲɃɜɜ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱъХ
ȯɃɪǤɜɜ˩
ʙʝɃɪʲɜɪФъȆɜȕǤɪɃɪȱ ʼʙъХ
6.5 Exception Handling 113

ȕɪȍ
ȕɪȍ

ɔʼɜɃǤљ ɦ˩ѪɜɴȱФвЖХ
ȆɜȕǤɪɃɪȱ ʼʙ
5¼¼“¼ђ ʧʲɃɜɜ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱ

A prime example of the usefulness of the ȯɃɪǤɜɜ˩ clause is ensuring that an


operating-system resource such as a file descriptor or a pipe is freed or closed
under all circumstances. In this example, the error is not caught, but the ȯɃɪǤɜɜ˩
clause ensures that the stream is closed.
ȯʼɪȆʲɃɴɪ ʼɪȯɴʝʲʼɪǤʲȕѪʝȕǤȍȕʝФȯɃɜȕɪǤɦȕђђÆʲʝɃɪȱХ
ɜɴȆǤɜ ʧʲʝȕǤɦ ќ ɴʙȕɪФȯɃɜȕɪǤɦȕХ
ʲʝ˩
ȕʝʝɴʝФъʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱя ȕѐȱѐя ˞ȹɃɜȕ ʙǤʝʧɃɪȱъХ
ȯɃɪǤɜɜ˩
ȆɜɴʧȕФʧʲʝȕǤɦХ
ʙʝɃɪʲɜɪФъʧʲʝȕǤɦ ɴʙȕɪђ ъя ɃʧɴʙȕɪФʧʲʝȕǤɦХХ
ȕɪȍ
ȕɪȍ

ɔʼɜɃǤљ ʼɪȯɴʝʲʼɪǤʲȕѪʝȕǤȍȕʝФъЭȕʲȆЭʙǤʧʧ˞ȍъХ
ʧʲʝȕǤɦ ɴʙȕɪђ ȯǤɜʧȕ
5¼¼“¼ђ ʧɴɦȕʲȹɃɪȱ ˞ȕɪʲ ˞ʝɴɪȱя ȕѐȱѐя ˞ȹɃɜȕ ʙǤʝʧɃɪȱ

In practice, however, the ɴʙȕɪ function is often used in conjunction with a ȍɴ


block (see Sect. 2.9). According to the documentation of ɴʙȕɪ, it always closes
the file descriptor upon completion.
Finally, the functions for handling exceptions are summarized in Table 6.1.
Assertions are discussed in Sect. 6.5.4 below.

Table 6.1 Exception handling.


Function Description
ȕʝʝɴʝФmessageХ raise an 5ʝʝɴʝ5˦ȆȕʙʲɃɴɪ with message
ЪǤʧʧȕʝʲ condition message throw an ʧʧȕʝʲɃɴɪ5ʝʝɴʝ if condition is ȯǤɜʧȕ
ʲȹʝɴ˞ФexceptionХ throw the exception
ʝȕʲȹʝɴ˞Ф[exception]Х rethrow exception from within a ȆǤʲȆȹ clause
ʝȕʲʝ˩ retry a function repeatedly
ȂǤȆɖʲʝǤȆȕ return the current backtrace object
ȆǤʲȆȹѪȂǤȆɖʲʝǤȆȕ return the backtrace object of the current exception,
for use in a ȆǤʲȆȹ clause
114 6 Control Flow

6.5.3 Messages, Warnings, and Errors

The general facility to log messages is the Ъɜɴȱɦʧȱ macro. Often the four stan-
dard logging macros ЪȍȕȂʼȱ, ЪɃɪȯɴ, Ъ˞Ǥʝɪ, Ъȕʝʝɴʝ are used, which are based
on Ъɜɴȱɦʧȱ and log messages at the four standard levels -ȕȂʼȱ, bɪȯɴ, üǤʝɪ, and
5ʝʝɴʝ. The first argument to these four macro should be an expression that eval-
uates to a string that describes the situation. The string is formatted as Mark-
down when printed. Further, optional arguments can be of the form key ќ value
or value and are attached to the log message.
ɔʼɜɃǤљ ЪɃɪȯɴ ъʲȹȕ Ǥɪʧ˞ȕʝ Ƀʧъ Ǥɪʧ˞ȕʝ ќ ЙЗ
bɪȯɴђ ʲȹȕ Ǥɪʧ˞ȕʝ Ƀʧ
Ǥɪʧ˞ȕʝ ќ ЙЗ
ɔʼɜɃǤљ Ъ˞Ǥʝɪ ъʧɴɦȕʲȹɃɪȱ ʼɪȕ˦ʙȕȆʲȕȍ ɴȆȆʼʝʝȕȍъ Ǥɪʧ˞ȕʝ ќ
ɜɴȱФ&ɴɦʙɜȕ˦ФвЖя ЕХХ
üǤʝɪɃɪȱђ ʧɴɦȕʲȹɃɪȱ ʼɪȕ˦ʙȕȆʲȕȍ ɴȆȆʼʝʝȕȍ
Ǥɪʧ˞ȕʝ ќ ЕѐЕ ў ИѐЖЙЖКОЗЛКИКНОМОИɃɦ
ɔʼɜɃǤљ Ъȕʝʝɴʝ ъȆǤɪɪɴʲ ȍɃ˛Ƀȍȕ Ȃ˩ ˴ȕʝɴъ ȍȕɪɴɦɃɪǤʲɴʝ ќ Е
5ʝʝɴʝђ ȆǤɪɪɴʲ ȍɃ˛Ƀȍȕ Ȃ˩ ˴ȕʝɴ
ȍȕɪɴɦɃɪǤʲɴʝ ќ Е

On the other hand, the ȕʝʝɴʝ function (and not the macro) raises an exception
of type 5ʝʝɴʝ5˦ȆȕʙʲɃɴɪ.
ɔʼɜɃǤљ ʲʝ˩ ȕʝʝɴʝФъȯɴɴъХ ȆǤʲȆȹ ȕ˦Ȇ ȕ˦Ȇ ȕɪȍ
5ʝʝɴʝ5˦ȆȕʙʲɃɴɪФъȯɴɴъХ

6.5.4 Assertions

Assertions are useful to ensure that certain conditions are always satisfied during
the evaluation of a program. For example, an expression may be known to be
invariant in a loop; these invariants may be conserved quantities such as energy
or angular momentum in a physical simulation. Assertions are also a convenient
way to check whether argument values are valid.
Assertions are written using the macro ЪǤʧʧȕʝʲ, which takes a "ɴɴɜ expres-
sion as its first arguments and an informative messages as its optional second.
ɔʼɜɃǤљ ЪǤʧʧȕʝʲ Ж ј Е
5¼¼“¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ Ж ј Е
ɔʼɜɃǤљ ЪǤʧʧȕʝʲ Ж ј Е ъʧɴɦȕʲȹɃɪȱ Ƀʧ ʝȕǤɜɜ˩ ˞ʝɴɪȱъ
5¼¼“¼ђ ʧʧȕʝʲɃɴɪ5ʝʝɴʝђ ʧɴɦȕʲȹɃɪȱ Ƀʧ ʝȕǤɜɜ˩ ˞ʝɴɪȱ
6.6 Tasks, Channels, and Events 115
∑𝑛
In the following example, we know that 𝑘=1 1∕𝑘2 = 𝜋2 ∕6. Since all terms
are positive, all partial sums must be less than 𝜋2 ∕6, which is checked by an
assertion. The argument value is also checked by an assertion.
ȯʼɪȆʲɃɴɪ ʧʼɦɦǤʲɃɴɪФɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж

ɜɴȆǤɜ ʧ ќ "ɃȱOɜɴǤʲФЕХ
ȯɴʝ ɖ Ƀɪ Жђɪ
ʧ ўќ ЖЭ"ɃȱOɜɴǤʲФɖХѭЗ
ЪǤʧʧȕʝʲ ʧ ј ʙɃѭЗЭЛ
ȕɪȍ

ʧ
ȕɪȍ

6.6 Tasks, Channels, and Events

ÑǤʧɖs (also known in computer science as symmetric coroutines or cooperative


multitasking) are an advanced feature to control the flow of a program that
makes it possible to start, suspend, and resume calculations. A ÑǤʧɖ can be
viewed as a function call with the additional feature that it can be interrupted
and that control is then transferred to another ÑǤʧɖ. The first ÑǤʧɖ can then be
resumed where it was interrupted.
There are two main differences compared to usual function calls. First, in
contrast to a function call, a ÑǤʧɖ switch does not use space on the call stack so
that an arbitrary number of ÑǤʧɖ switches are possible without running out of
stack space. Second, ÑǤʧɖs can be run in any order, again in contrast to function
calls, which must be completed before control returns to the caller and another
function can be called.
ÑǤʧɖs are very useful for certain problems without a clear caller-callee struc-
ture. A prime example is producer-consumer problems, where a function pro-
duces values and another one consumes them. In such problems, the consumer
cannot simply call the producer, because calculating the values is a time consum-
ing task and the producer may not be ready to return a value. Using ÑǤʧɖs, the
producer and the consumer can both run as long as required, and the values are
passed between them when necessary.
ÑǤʧɖs typically communicate using &ȹǤɪɪȕɜs, which are first-in first-out
queues. Multiple ÑǤʧɖs can put values into a &ȹǤɪɪȕɜ and take values from it.
In the following example, we define a producer that ʙʼʲРs prime numbers into
a channel.
116 6 Control Flow

Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ȯʼɪȆʲɃɴɪ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹђђ&ȹǤɪɪȕɜя ɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж
ȯɴʝ ʙ Ƀɪ ¸ʝɃɦȕʧѐʙʝɃɦȕʧФЖя ɪХ
ʙʼʲРФȆȹя ʙХ
ȕɪȍ
ȕɪȍ

The producer function must be scheduled to run in a new ÑǤʧɖ. The most con-
venient way to do so is to use the &ȹǤɪɪȕɜ constructor that takes a function as
its argument and runs a ÑǤʧɖ associated with the new &ȹǤɪɪȕɜ. The function
given as the argument to the constructor must take one argument, namely the
&ȹǤɪɪȕɜ. In our example, a &ȹǤɪɪȕɜ and an associated ÑǤʧɖ can be created by
ȆȹǤɪ ќ &ȹǤɪɪȕɜФȆȹ вљ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹя ОХХ for example.
Values can be consumed from a &ȹǤɪɪȕɜ by the function ʲǤɖȕР; in our exam-
ple, evaluating ʲǤɖȕРФȆȹǤɪХ yields consecutive prime numbers. Values can also
be conveniently consumed in ȯɴʝ loops by iterating over the &ȹǤɪɪȕɜ as in the
following example.
ȯʼɪȆʲɃɴɪ ɦ˩ѪȆɴɪʧʼɦȕʝФɪђђbɪʲȕȱȕʝХ
ЪǤʧʧȕʝʲ ɪ љќ Ж
ȯɴʝ Ƀ Ƀɪ &ȹǤɪɪȕɜФȆȹ вљ ɦ˩ѪʙʝɴȍʼȆȕʝФȆȹя ɪХХ
ʙʝɃɪʲɜɪФɃХ
ȕɪȍ
ȕɪȍ

In the ȯɴʝ loop, values are consumed as long as they are available from the
&ȹǤɪɪȕɜ.

ɔʼɜɃǤљ ɦ˩ѪȆɴɪʧʼɦȕʝФОХ
З
И
К
М

You may have expected to close the &ȹǤɪɪȕɜ. This is not necessary here, since the
&ȹǤɪɪȕɜ is associated with the ÑǤʧɖ, and therefore the lifetime of the &ȹǤɪɪȕɜ
being open is associated with the ÑǤʧɖ. The ÑǤʧɖ terminates when the function
returns, at which point the &ȹǤɪɪȕɜ is closed automatically.
The operations on &ȹǤɪɪȕɜs are summarized in Table 6.2. When creating a
&ȹǤɪɪȕɜ, the type of the values to be passed may be specified. If no such type
argument is given, the general type ɪ˩ is used by default. If a function argument
is given to the constructor, a ÑǤʧɖ is created and associated with the new &ȹǤɪɪȕɜ
as discussed above. However, a &ȹǤɪɪȕɜ may also be created without the function
argument. The size of the buffer of the &ȹǤɪɪȕɜ may be specified, where the
default size zero creates an unbuffered &ȹǤɪɪȕɜ. &ȹǤɪɪȕɜФbɪȯХ is equivalent to
&ȹǤɪɪȕɜШɪ˩ЩФʲ˩ʙȕɦǤ˦ФbɪʲХХ.
6.6 Tasks, Channels, and Events 117

Table 6.2 Channel operations.


Function Description
&ȹǤɪɪȕɜ[Ш TЩ ]Ф [size]Х create a &ȹǤɪɪȕɜ for at most size objects of type T
&ȹǤɪɪȕɜ[Ш TЩ ]Ф f [я size]Х create a ÑǤʧɖ and a &ȹǤɪɪȕɜ for at most size objects of type T
ʙʼʲРФch, valueХ append ˛Ǥɜʼȕ to the &ȹǤɪɪȕɜ ch and block if it becomes full
ʲǤɖȕРФchХ block until a value is available from &ȹǤɪɪȕɜ ch,
then remove and return it
ɃʧɴʙȕɪФchХ determine whether the &ȹǤɪɪȕɜ ch is open
ɃʧʝȕǤȍ˩ФchХ determine whether a value is available from &ȹǤɪɪȕɜ ch
˞ǤɃʲФ chХ wait until a value becomes available from &ȹǤɪɪȕɜ ch
ȯȕʲȆȹФchХ wait for and return the first value from the ȆȹǤɪɪȕɜ ch,
but do not remove it from the &ȹǤɪɪȕɜ
ȆɜɴʧȕФchХ close the &ȹǤɪɪȕɜ ch

Similarly to &ȹǤɪɪȕɜs whose constructor takes a function argument, the argu-


ment of the ÑǤʧɖ constructor must be a function with no arguments. The ЪʲǤʧɖ
macro serves the same purpose. Furthermore, the function ȂɃɪȍ associates a
&ȹǤɪɪȕɜ with a ÑǤʧɖ, and the function ʧȆȹȕȍʼɜȕ adds a ÑǤʧɖ to the queue of
the scheduler, causing to task to be run.
The basic building block of symmetric coroutines is the function ˩Ƀȕɜȍʲɴ. It
suspends the current task and receive a value at the same time. More precisely,
the function call ˩ɃȕɜȍʲɴФtaskя valueХ suspends the current task and switches to
task; then the last call of ˩Ƀȕɜȍʲɴ in task returns the value. The first time a task is
switched to, the function it is associated with (i.e., the argument that was given to
the ÑǤʧɖ constructor when it was created) is called with no argument. Therefore
˩Ƀȕɜȍʲɴ achieves both fundamental operations, i.e., suspending the current task
on the one hand and transferring control to another task and receiving values
on the other hand. Because of the symmetry between the ÑǤʧɖs – they all call
˩Ƀȕɜȍʲɴ –, they are called symmetric coroutines.
While the function ˩Ƀȕɜȍʲɴ is the basic building block, it is not invoked di-
rectly in many use cases. Switching between tasks usually requires coordination
between the tasks, which means that state must be maintained, e.g., to know
which task is a producer and which is a consumer. Therefore it is often more
convenient to use the functions ʙʼʲР and ʲǤɖȕР.
Table 6.3 summarizes operations on ÑǤʧɖs. A ÑǤʧɖ is created by calling the
constructor ÑǤʧɖ on a function with no arguments or by calling the macro ЪʲǤʧɖ
on an expression. The function ʧȆȹȕȍʼɜȕ takes additional arguments that are
useful in certain situations. We will see how to use the ЪǤʧ˩ɪȆ and Ъʧ˩ɪȆ macros
in an example below. Having starting a ÑǤʧɖ with ЪǤʧ˩ɪȆ, the nearest enclosing
Ъʧ˩ɪȆ will wait for it to finish. In fact, the Ъʧ˩ɪȆ macro will wait for all lexically
enclosed uses of ЪǤʧ˩ɪȆ, ЪʧʙǤ˞ɪ, ЪʧʙǤ˞ɪǤʲ, and ЪȍɃʧʲʝɃȂʼʲȕȍ to finish (see also
Sect. 6.7).
The scheduler keeps track of the ÑǤʧɖs, executes an event loop, and main-
tains a queue of runnable tasks. To this end, ÑǤʧɖs have a field called ʧʲǤʲȕ that
contains one of three symbols and indicates the execution status of the ÑǤʧɖ. The
118 6 Control Flow

Table 6.3 Task operations.


Function Description
ÑǤʧɖФ f Х create a ÑǤʧɖ to evaluate the function f
ЪʲǤʧɖ expr create a ÑǤʧɖ to evaluate the expression expr
˩ɃȕɜȍʲɴФtaskя valueХ switch to task and return value
from the last call of ˩Ƀȕɜȍʲɴ in task
˩ɃȕɜȍФХ allow another scheduled task to be run
and remain ђʝʼɪɪǤȂɜȕ
ȆʼʝʝȕɪʲѪʲǤʧɖФХ return the currently running ÑǤʧɖ
ɃʧʲǤʧɖʧʲǤʝʲȕȍФtaskХ → "ɴɴɜ check whether task has been started
ɃʧʲǤʧɖȍɴɪȕФtaskХ check whether task has finished
ʲǤʧɖѪɜɴȆǤɜѪʧʲɴʝǤȱȕФХ return the local storage of the current task
ʲǤʧɖѪɜɴȆǤɜѪʧʲɴʝǤȱȕФkeyХ return the value of key in the task-local storage
ʲǤʧɖѪɜɴȆǤɜѪʧʲɴʝǤȱȕФkeyя valueХ assign value to key in the task-local storage
ʲǤʧɖѪɜɴȆǤɜѪʧʲɴʝǤȱȕФf я keyя valueХ call the function f with a modified
task-local storage, which is restored afterward
ʧȆȹȕȍʼɜȕФtaskХ add task to the queue of the scheduler
ʧɜȕȕʙФsecondsХ block the current task for seconds (at least 0.001)
ЪǤʧ˩ɪȆ expression wrap expression in a ÑǤʧɖ and schedule it locally
Ъʧ˩ɪȆ expression wait until all uses of ЪǤʧ˩ɪȆ etc. have completed

ÑǤʧɖ states are described in Table 6.4. A newly created ÑǤʧɖ is initially not known
to the scheduler and therefore not run.

Table 6.4 Task states.


State Description
ђʝʼɪɪǤȂɜȕ currently running or able to run
ђȍɴɪȕ successfully finished, i.e., the function has returned
ђȯǤɃɜȕȍ unsuccessfully finished, i.e., an uncaught exception was thrown

In cooperative multitasking, most ÑǤʧɖ switches are the result of waiting for
events such as input or output requests. The generic function ˞ǤɃʲ is the ba-
sic way to wait and includes methods for several types of objects such as ÑǤʧɖs,
&ȹǤɪɪȕɜs, -ɃʧʲʝɃȂʼʲȕȍѐ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs, "Ǥʧȕѐ5˛ȕɪʲs, and "Ǥʧȕѐ¸ʝɴȆȕʧʧes.
The function ˞ǤɃʲ is usually called implicitly; for example, the function ʝȕǤȍ
uses ˞ǤɃʲ to wait for data to become available.
We now discuss how jobs or workloads can be distributed to ÑǤʧɖs and how
&ȹǤɪɪȕɜs can be used for communication between these ÑǤʧɖs and to collect the
results. This example is a leading one, and these techniques are already useful
for sequential computing on a single processor. (Parallel computing is discussed
in Sect. 6.7 below.) For example, the jobs may be functions that mostly deal with
input/output operations and that hence may ˞ǤɃʲ for a substantial amount of
time.
6.6 Tasks, Channels, and Events 119

In the ɜȕʲ expression below, we will define two variables that are buffered
&ȹǤɪɪȕɜs and that hold the jobs and their results. The jobs and the results are
‰ǤɦȕȍÑʼʙɜȕs, and the buffer sizes of both &ȹǤɪɪȕɜs are only 5. The first function
we define in this example creates jobs. We just draw a random number between
zero and one, which represents the job. After all jobs have been written to the
&ȹǤɪɪȕɜ, it is Ȇɜɴʧȕd.

ȯʼɪȆʲɃɴɪ ɦǤɖȕѪɔɴȂʧФɔɴȂʧђђ&ȹǤɪɪȕɜя ɪђђbɪʲȕȱȕʝХ


ȯɴʝ Ƀ Ƀɪ Жђɪ
ʙʼʲРФɔɴȂʧя ФɃȍ ќ Ƀя ˞ɴʝɖɜɴǤȍ ќ ʝǤɪȍФХХХ
ȕɪȍ
ȆɜɴʧȕФɔɴȂʧХ
ȕɪȍ

The second function does the work. It uses a ȯɴʝ loop to get all available jobs
and writes the results of performing the jobs into the other channel. The work is
trivial, as it is just ʧɜȕȕʙing, but when running the example, it illustrates nicely
how long it takes all jobs to finish.
ȯʼɪȆʲɃɴɪ ˞ɴʝɖФɔɴȂʧђђ&ȹǤɪɪȕɜя ʝȕʧʼɜʲʧђђ&ȹǤɪɪȕɜя ɃȍђђbɪʲȕȱȕʝХ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵɃȍ ʧʲǤʝʲȕȍѐъХ
ȯɴʝ ɔ Ƀɪ ɔɴȂʧ
ʧɜȕȕʙФɔѐ˞ɴʝɖɜɴǤȍХ
ʙʼʲРФʝȕʧʼɜʲʧя ФɃȍ ќ ɔѐɃȍя ˞ɴʝɖȕʝ ќ Ƀȍя ʲɃɦȕ ќ ɔѐ˞ɴʝɖɜɴǤȍХХ
ȕɪȍ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵɃȍ ȯɃɪɃʧȹȕȍѐъХ
ȕɪȍ

Next, we create ten jobs by wrapping the call to ɦǤɖȕѪɔɴȂʧ in the ЪǤʧ˩ɪȆ macro.
The ЪǤʧ˩ɪȆ macro creates a ÑǤʧɖ and adds it to the queue of the scheduler. It
is expedient to use ЪǤʧ˩ɪȆ at this point, since creating the job descriptions may
take some time or the number of jobs may exceed the buffer size, but in this
way work can start immediately. Having created the jobs, we start three tasks by
wrapping calls to ˞ɴʝɖ in the ЪǤʧ˩ɪȆ macro. Then, we take the results from the
buffered &ȹǤɪɪȕɜ and print the total elapsed time.
ɜȕʲ ɪ ќ ЖЕ
ɜɴȆǤɜ ɔɴȂʧ ќ &ȹǤɪɪȕɜШ‰ǤɦȕȍÑʼʙɜȕЩФКХ
ɜɴȆǤɜ ʝȕʧʼɜʲʧ ќ &ȹǤɪɪȕɜШ‰ǤɦȕȍÑʼʙɜȕЩФКХ

ЪǤʧ˩ɪȆ ɦǤɖȕѪɔɴȂʧФɔɴȂʧя ɪХ

ȯɴʝ Ƀ Ƀɪ ЖђИ
ЪǤʧ˩ɪȆ ˞ɴʝɖФɔɴȂʧя ʝȕʧʼɜʲʧя ɃХ
ȕɪȍ

ЪʲɃɦȕ ȯɴʝ Ƀ Ƀɪ Жђɪ


ɜɴȆǤɜ ʝ ќ ʲǤɖȕРФʝȕʧʼɜʲʧХ
120 6 Control Flow

ʙʝɃɪʲɜɪФъsɴȂ ϵФʝѐɃȍХ ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ϵФʝѐ˞ɴʝɖȕʝХ ъ Ѯ


ъʲɴɴɖ ϵФʝɴʼɪȍФʝѐʲɃɦȕѓ ȍɃȱɃʲʧ ќ ИХХ ʧȕȆɴɪȍʧѐъХ
ȕɪȍ
ȕɪȍ

üɴʝɖȕʝ Ж ʧʲǤʝʲȕȍѐ
üɴʝɖȕʝ З ʧʲǤʝʲȕȍѐ
üɴʝɖȕʝ И ʧʲǤʝʲȕȍѐ
sɴȂ З ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐЗЗЙ ʧȕȆɴɪȍʧѐ
sɴȂ Ж ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐЗЗО ʧȕȆɴɪȍʧѐ
sɴȂ И ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐКИИ ʧȕȆɴɪȍʧѐ
sɴȂ Й ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐМКК ʧȕȆɴɪȍʧѐ
sɴȂ К ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐНЖН ʧȕȆɴɪȍʧѐ
sɴȂ Л ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐОИЛ ʧȕȆɴɪȍʧѐ
sɴȂ Н ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐККЗ ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ З ȯɃɪɃʧȹȕȍѐ
sɴȂ М ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐОК ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ И ȯɃɪɃʧȹȕȍѐ
sɴȂ О ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐМЙО ʧȕȆɴɪȍʧѐ
üɴʝɖȕʝ Ж ȯɃɪɃʧȹȕȍѐ
sɴȂ ЖЕ ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Ж ʲɴɴɖ ЕѐМОЛ ʧȕȆɴɪȍʧѐ
ЗѐОИЙЖМЙ ʧȕȆɴɪȍʧ ФЙѐЖК ǤɜɜɴȆǤʲɃɴɪʧђ ЖНЕѐЙЖК Ƀ"я ИѐОК҄ ȱȆ ʲɃɦȕХ

In the next example, we use an unbuffered &ȹǤɪɪȕɜ for communication. The


function ʧɴʼʝȆȕ slowly writes values into a &ȹǤɪɪȕɜ. After all the values have
been written, it is Ȇɜɴʧȕd. The function ʧɃɪɖ receives the values. Since &ȹǤɪɪȕɜs
are iterable, we use a ȯɴʝ loop as usual; the ȯɴʝ loop iterates until the channel is
closed.
Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ

ȯʼɪȆʲɃɴɪ ʧɴʼʝȆȕФȆȹя ɪХ
ȯɴʝ Ƀ Ƀɪ ¸ʝɃɦȕʧѐʙʝɃɦȕʧФЖя ɪХ
ʧɜȕȕʙФʝǤɪȍФХХ
ʙʝɃɪʲɜɪФъ¸ʼʲʲɃɪȱ ϵФɃХѐ ъХ
ʙʼʲРФȆȹя ɃХ
ȕɪȍ
ȆɜɴʧȕФȆȹХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ʧɃɪɖФȆȹХ
ȯɴʝ Ƀ Ƀɪ Ȇȹ
ʙʝɃɪʲɜɪФъÑǤɖɃɪȱ ϵФɃХѐъХ
ȕɪȍ
ȕɪȍ
6.7 Parallel Computing 121

ɜȕʲ Ȇȹ ќ &ȹǤɪɪȕɜФЕХ
Ъʧ˩ɪȆ ȂȕȱɃɪ
ЪǤʧ˩ɪȆ ʧɴʼʝȆȕФȆȹя ЖЕХ
ЪǤʧ˩ɪȆ ʧɃɪɖФȆȹХ
ȕɪȍ
ȕɪȍ

The ɜȕʲ expression runs these two functions. First, an unbuffered &ȹǤɪɪȕɜ is
recreated. In the ȂȕȱɃɪ expression, ÑǤʧɖs for the calls to the ʧɴʼʝȆȕ and ʧɃɪɖ
functions are created and run asynchronously by the macro ЪǤʧ˩ɪȆ. The Ъʧ˩ɪȆ
macro outside the ȂȕȱɃɪ expression waits till both tasks are done; it will return
only when the ȯɴʝ loop in the ʧɃɪɖ function has returned.
The final example in this section shows how &ɴɪȍɃʲɃɴɪs can be used. The
function ˞ǤɃʲѪǤѪȂɃʲ is run in a new ÑǤʧɖ after having been started by ʧȆȹȕȍʼɜȕ
and notifies the &ɴɪȍɃʲɃɴɪ when it is done. After ɪɴʲɃȯ˩ has been called, control
flow continues after the call to ˞ǤɃʲ, which has been waiting for the &ɴɪȍɃʲɃɴɪ.
This construct is more effective than a loop that polls the &ɴɪȍɃʲɃɴɪ repeatedly.
ɜȕʲ Ȇ ќ &ɴɪȍɃʲɃɴɪФХ
ȯʼɪȆʲɃɴɪ ˞ǤɃʲѪǤѪȂɃʲФХ
ʙʝɃɪʲɜɪФъüǤɃʲɃɪȱ ȯɴʝ QɴȍɴʲѐъХ
ʧɜȕȕʙФЖ ў ʝǤɪȍФХХ
ɪɴʲɃȯ˩ФȆХ
ȕɪȍ

ʧȆȹȕȍʼɜȕФЪʲǤʧɖ ˞ǤɃʲѪǤѪȂɃʲФХХ

˞ǤɃʲФȆХ
ʙʝɃɪʲɜɪФъYȕ ȹǤʧ ǤʝʝɃ˛ȕȍѐъХ
ȕɪȍ

6.7 Parallel Computing

Parallel computing is an important topic on today’s and future hardware. As the


physical sizes of cmos transistors have been approaching their physical limits
and heat dissipation has become the main limiting factor, the performance of
single cpu cores has been stagnating. Therefore modern cpus consist of multi-
ple cores, single computers may possess multiple cpus, and computers may be
combined into clusters. Distributing the computations however comes with a
communication cost.
Julia’s implementation of parallel or distributed computing is based on mes-
sage passing, but provides higher-level operations than just sending and receiv-
ing of messages. Remote references and remote calls are the basic building
blocks for parallel computing in Julia. Remote references refer to objects stored
122 6 Control Flow

on a certain process and can be used from any process; there are two types of re-
mote references, namely Oʼʲʼʝȕs and ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs.
A remote call is the request by a process to call a certain function on certain
arguments on another (or the same) process. Every remote call returns immedi-
ately, and the result of a remote call is a Oʼʲʼʝȕ. The return value of the remote
call can be obtained using the function ȯȕʲȆȹ, or the function ˞ǤɃʲ can be called
on the Oʼʲʼʝȕ to wait until the result is available.

6.7.1 Starting Processes

The Julia system can use multiple processes. The process associated with the
repl always has id 1, and additional processes with higher ids can be started and
are called workers. If there is only the process with id 1, it is considered the only
worker. The number of workers can be supplied when starting Julia using the
command-line arguments вʙ or ввʙʝɴȆʧ. The argument should be equal to the
number of available (logical) cores or to Ǥʼʲɴ, which determines the number of
(logical) cores automatically. If the arguments вʙ or ввʙʝɴȆʧ are supplied on the
command line, then the built-in module -ɃʧʲʝɃȂʼʲȕȍ is loaded automatically.
љ ɔʼɜɃǤ вʙ Ǥʼʲɴ
ɔʼɜɃǤљ ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ
ЗЙ

Within Julia, the workers can be managed using the functions ˞ɴʝɖȕʝʧ,
ǤȍȍʙʝɴȆʧ, and ʝɦʙʝɴȆʧ. The module -ɃʧʲʝɃȂʼʲȕȍ must be loaded on the pro-
cess with id 1 before using ǤȍȍʙʝɴȆʧ to add workers.
Another option to start worker processes is to use the ввɦǤȆȹɃɪȕвȯɃɜȕ
command-line option. Then Julia uses passwordless ʧʧȹ login to start workers
on the machines specified in the supplied file.
Worker process differ from the process with id 1 by not evaluating the
ʧʲǤʝʲʼʙѐɔɜ startup file. Their global state, i.e., loaded modules, global variables,
and generic functions and methods, is not synchronized automatically between
processes. The common way to load modules or program files into all workers is
to use the Ъȕ˛ȕʝ˩˞ȹȕʝȕ macro by writing
Ъȕ˛ȕʝ˩˞ȹȕʝȕ Ƀɦʙɴʝʲ module

or
Ъȕ˛ȕʝ˩˞ȹȕʝȕ ɃɪȆɜʼȍȕФъfilenameъХ.

Furthermore, it is possible to write customized &ɜʼʧʲȕʝ ǤɪǤȱȕʝs, but this op-


tion is not discussed in detail here.
6.7 Parallel Computing 123

6.7.2 Data Movement and Processes

The disadvantage of parallel or distributed computing is that messages and data


must be exchanged between processes. Reducing the amount of messages and
data sent is of paramount importance for performance and scalability.
The function ʝȕɦɴʲȕȆǤɜɜ and the macro ЪʧʙǤ˞ɪǤʲ are the basic operations
for evaluating a function or an expression, respectively, on a worker process. The
first argument of ʝȕɦɴʲȕȆǤɜɜ is the function to be called, the second is the pro-
cess id of the worker to be used, and the rest of the arguments are passed to the
function to be called. The first argument of ЪʧʙǤ˞ɪǤʲ is the id of the worker pro-
cess to be used or it is equal to ђǤɪ˩, which lets the scheduler choose the worker
process, and the second argument is the expression to be evaluated on the worker
process. Both ʝȕɦɴʲȕȆǤɜɜ and ЪʧʙǤ˞ɪǤʲ return immediately and yield a Oʼʲʼʝȕ
as the result.
The generic function ȯȕʲȆȹ is the basic operation for moving data. It (or more
precisely one of its methods) receives a Oʼʲʼʝȕ as its argument, waits until the
worker process has finished evaluating, and returns the value.
Maybe the simplest example is the following. Remember to use the mod-
ule -ɃʧʲʝɃȂʼʲȕȍ first, if you have started Julia without the вʙ or ввʙʝɴȆʧ
command-line options.
ɔʼɜɃǤљ ʼʧɃɪȱ -ɃʧʲʝɃȂʼʲȕȍ
ɔʼɜɃǤљ ɪǤɦȕ ќ ъ˞ɴʝɜȍъ
ɔʼɜɃǤљ ЪʲɃɦȕ ȯȕʲȆȹФЪʧʙǤ˞ɪǤʲ ђǤɪ˩ ȂȕȱɃɪ ʧɜȕȕʙФИХѓ ъYȕɜɜɴя ϵФɪǤɦȕХРъ
ȕɪȍХ
ИѐЕЛКНЖЛ ʧȕȆɴɪȍʧ ФМЙѐЕЗ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ ИѐЛЗЗ Ƀ"Х
ъYȕɜɜɴя ˞ɴʝɜȍРъ

The example shows that required data are copied to the worker process automat-
ically; here, the global variable ɪǤɦȕ is accessed by the expression to be spawned
and therefore it is made available to the worker process. After the expression has
been evaluated on the worker process, ȯȕʲȆȹ fetches its value from the worker
process and returns it.
The example could have been written more succinctly using ЪȯȕʲȆȹȯʝɴɦ,
which is equivalent to ȯȕʲȆȹ after ЪʧʙǤ˞ɪǤʲ. The function ʝȕɦɴʲȕȆǤɜɜѪȯȕʲȆȹ
is equivalent to applying ȯȕʲȆȹ to the result of ʝȕɦɴʲȕȆǤɜɜ, but it is more effi-
cient. The function ʝȕɦɴʲȕѪȍɴ evaluates a function on a worker with a given id,
but does not yield the return value of the function. It is also not possible to ˞ǤɃʲ
for the completion of the function call.
To illustrate the use of worker processes and ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs, we rework
the jobs example in Sect. 6.6 to use remote workers and channels. The function
ɦǤɖȕѪɔɴȂʧ remains essentially unchanged.
124 6 Control Flow

ȯʼɪȆʲɃɴɪ ɦǤɖȕѪɔɴȂʧФɔɴȂʧђђ¼ȕɦɴʲȕ&ȹǤɪɪȕɜя ɪђђbɪʲȕȱȕʝХ


ȯɴʝ Ƀ Ƀɪ Жђɪ
ʙʼʲРФɔɴȂʧя ФɃȍ ќ Ƀя ˞ɴʝɖɜɴǤȍ ќ ʝǤɪȍФХХХ
ȕɪȍ
ȆɜɴʧȕФɔɴȂʧХ
ȕɪȍ

The function ˞ɴʝɖ to be run on the workers must be made available to all pro-
cesses. Therefore we wrap the function definition into the Ъȕ˛ȕʝ˩˞ȹȕʝȕ macro.
Ъȕ˛ȕʝ˩˞ȹȕʝȕ ȯʼɪȆʲɃɴɪ ˞ɴʝɖФɔɴȂʧђђ¼ȕɦɴʲȕ&ȹǤɪɪȕɜя
ʝȕʧʼɜʲʧђђ¼ȕɦɴʲȕ&ȹǤɪɪȕɜХ
˞ȹɃɜȕ ʲʝʼȕ
ɜɴȆǤɜ ɔ
ʲʝ˩
ɔ ќ ʲǤɖȕРФɔɴȂʧХ
ȆǤʲȆȹ ȕ˦Ȇ
ȂʝȕǤɖ
ȕɪȍ
ʧɜȕȕʙФɔѐ˞ɴʝɖɜɴǤȍХ
ʙʼʲРФʝȕʧʼɜʲʧя ФɃȍ ќ ɔѐɃȍя ˞ɴʝɖȕʝ ќ ɦ˩ɃȍФХя
ʲɃɦȕ ќ ɔѐ˞ɴʝɖɜɴǤȍХХ
ȕɪȍ
ʙʝɃɪʲɜɪФъüɴʝɖȕʝ ϵФɦ˩ɃȍФХХ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐъХ
ȕɪȍ

Since ȯɴʝ loops over ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs are not supported directly, we use a ˞ȹɃɜȕ
loop instead. We would like to check whether the ɔɴȂʧ channel is still open and
then take a value from it. However, in the time between checking and taking a
value, another worker process may have snatched the last available value, result-
ing in a race condition. Therefore we just ʲʝ˩ to take a value from the channel
and ȆǤʲȆȹ any possibly resulting exception. If there is an exception, we know
that the channel has been Ȇɜɴʧȕd and we ȂʝȕǤɖ the loop and hence end the
function.
With these function definition, we can run remote workers and communi-
cate via ¼ȕɦɴʲȕ&ȹǤɪɪȕɜs. After having defined the channels, we create the jobs
asynchronously. On each available worker, we execute the ˞ɴʝɖ function using
ʝȕɦɴʲȕѪȍɴ. In the final ˞ȹɃɜȕ loop, we collect all results.

ɜȕʲ ɪ ќ ЖЕ
ɜɴȆǤɜ ɔɴȂʧ ќ ¼ȕɦɴʲȕ&ȹǤɪɪȕɜФФХ вљ &ȹǤɪɪȕɜШ‰ǤɦȕȍÑʼʙɜȕЩФКХХ
ɜɴȆǤɜ ʝȕʧʼɜʲʧ ќ ¼ȕɦɴʲȕ&ȹǤɪɪȕɜФФХ вљ &ȹǤɪɪȕɜШ‰ǤɦȕȍÑʼʙɜȕЩФКХХ

ЪǤʧ˩ɪȆ ɦǤɖȕѪɔɴȂʧФɔɴȂʧя ɪХ

ȯɴʝ ˞ Ƀɪ ˞ɴʝɖȕʝʧФХ
ʝȕɦɴʲȕѪȍɴФ˞ɴʝɖя ˞я ɔɴȂʧя ʝȕʧʼɜʲʧХ
6.7 Parallel Computing 125

ȕɪȍ

ЪʲɃɦȕ ȯɴʝ Ƀ Ƀɪ Жђɪ


ɜɴȆǤɜ ʝ ќ ʲǤɖȕРФʝȕʧʼɜʲʧХ
ʙʝɃɪʲɜɪФъsɴȂ ϵФʝѐɃȍХ ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ϵФʝѐ˞ɴʝɖȕʝХ ъ Ѯ
ъʲɴɴɖ ϵФʝɴʼɪȍФʝѐʲɃɦȕѓ ȍɃȱɃʲʧ ќ ИХХ ʧȕȆɴɪȍʧѐъХ
ȕɪȍ
ȕɪȍ

Oʝɴɦ ˞ɴʝɖȕʝ ЖЖђ üɴʝɖȕʝ ЖЖ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ


Oʝɴɦ ˞ɴʝɖȕʝ Ођ üɴʝɖȕʝ О ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ Йђ üɴʝɖȕʝ Й ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖОђ üɴʝɖȕʝ ЖО ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЗИђ üɴʝɖȕʝ ЗИ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЗКђ üɴʝɖȕʝ ЗК ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ Лђ üɴʝɖȕʝ Л ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖЙђ üɴʝɖȕʝ ЖЙ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЗЖђ üɴʝɖȕʝ ЗЖ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЗЙђ üɴʝɖȕʝ ЗЙ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖКђ üɴʝɖȕʝ ЖК ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ Ж ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ З ʲɴɴɖ ЕѐЕКЗ ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ Зђ üɴʝɖȕʝ З ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ Й ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ Н ʲɴɴɖ ЕѐЕЙК ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ Нђ üɴʝɖȕʝ Н ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ М ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЖМ ʲɴɴɖ ЕѐЕЛО ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖМђ üɴʝɖȕʝ ЖМ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ Л ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЖЗ ʲɴɴɖ ЕѐЖИМ ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖЗђ üɴʝɖȕʝ ЖЗ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ К ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЗЗ ʲɴɴɖ ЕѐИЖЗ ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЗЗђ üɴʝɖȕʝ ЗЗ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ З ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ И ʲɴɴɖ ЕѐИИН ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ Иђ üɴʝɖȕʝ И ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ ЖЕ ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЖЛ ʲɴɴɖ ЕѐЛЗН ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖЛђ üɴʝɖȕʝ ЖЛ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ И ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЖЕ ʲɴɴɖ ЕѐМНЖ ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖЕђ üɴʝɖȕʝ ЖЕ ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ О ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ ЖН ʲɴɴɖ ЕѐНЕО ʧȕȆɴɪȍʧѐ
Oʝɴɦ ˞ɴʝɖȕʝ ЖНђ üɴʝɖȕʝ ЖН ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ
sɴȂ Н ʙȕʝȯɴʝɦȕȍ Ȃ˩ ˞ɴʝɖȕʝ К ʲɴɴɖ ЕѐОЛО ʧȕȆɴɪȍʧѐ
ЕѐОННОМЕ ʧȕȆɴɪȍʧ ФОѐИЛ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ ЙЗЛѐЕЕЕ uɃ"Х
Oʝɴɦ ˞ɴʝɖȕʝ Кђ üɴʝɖȕʝ К ȹǤʧ ɪɴ ɦɴʝȕ ɔɴȂʧ ʲɴ ȍɴѐ

In this particular example, there are fewer jobs than worker processes. Therefore
some of the ˞ɴʝɖ functions finish immediately. All jobs are finished after about
the time it takes the longest job to finish.
126 6 Control Flow

The operations for parallel or distributed computing are summarized in Ta-


ble 6.5. ÆȹǤʝȕȍʝʝǤ˩s are available in the module ÆȹǤʝȕȍʝʝǤ˩ʧ, which must be
loaded on all workers, and make it possible for multiple processes to access an
entire array.

Table 6.5 Operations for parallel computing.


Function Description
˞ɴʝɖȕʝʧФХ return an array of all worker process ids
ǤȍȍʙʝɴȆʧФХ add Æ˩ʧѐ&¸ÚѪÑY¼5-Æ workers
ʝɦʙʝɴȆʧФidsѐѐѐХ remove the workers with the given process ids
ɦ˩Ƀȍ return the id of the current process
ÆȹǤʝȕȍʝʝǤ˩ create an array that can be accessed by multiple workers
¼ȕɦɴʲȕ&ȹǤɪɪȕɜ create a channel for communication between processes
ʝȕɦɴʲȕѪȍɴФf я idя argsѐѐѐХ evaluate a function asynchronously
ʝȕɦɴʲȕȆǤɜɜФf я idя argsѐѐѐХ evaluate a function asynchronously and return a Oʼʲʼʝȕ
ȯȕʲȆȹФfutureХ wait for and return the value of future
ʝȕɦɴʲȕȆǤɜɜѪȯȕʲȆȹ equivalent to ȯȕʲȆȹФʝȕɦɴʲȕȆǤɜɜФ. . . Х, but faster
ʝȕɦɴʲȕȆǤɜɜѪ˞ǤɃʲ equivalent to ˞ǤɃʲФʝȕɦɴʲȕȆǤɜɜФ. . . Х, but faster
ЪǤʧ˩ɪȆ expression wrap expression in a ÑǤʧɖ and schedule it locally
Ъʧ˩ɪȆ expression wait until all uses of ЪǤʧ˩ɪȆ etc. have completed
ЪʧʙǤ˞ɪǤʲ id expr evaluate expr asynchronously on process id (ђǤɪ˩)
ЪȍɃʧʲʝɃȂʼʲȕȍ [f ] ȯɴʝ . . . ȕɪȍ distributed, parallel version of a ȯɴʝ loop
ЪȯȕʲȆȹȯʝɴɦ id expr equivalent to ȯȕʲȆȹФЪʧʙǤ˞ɪǤʲ p expr Х
ȆɜȕǤʝРФsymbolsя idsХ clear global bindings for symbols on processes ids
ȯɃɪǤɜɃ˴ȕФobjectХ finalize object, making it available for garbage collection

We close this section with a comment on garbage collection. Julia performs


garbage collection on local and remote processes as usual so that you are not re-
quired to care about allocating and freeing memory. There is one effect, however,
that may require you to be more careful. As usual in garbage collection, the time
when an object is garbage collected is unspecified and depends on the size of
the object and the current memory pressure. Remote garbage collection requires
the use of remote references, which are very small, and hence there is little pres-
sure to collect them (often). However, the objects the remote references point to
may be quite large and cannot be collected while the remote references exist. To
help the garbage collector in such cases, one can explicitly call ȯɃɪǤɜɃ˴ȕ on lo-
cal instances of a ¼ȕɦɴʲȕ&ȹǤɪɪȕɜ and on unfetched Oʼʲʼʝȕs that are not needed
anymore.

6.7.3 Parallel Loops and Parallel Mapping

Fortunately, many parallel computations can be implemented using parallel ȯɴʝ


loops or parallel mapping of functions. When using these built-in facilities, the
6.7 Parallel Computing 127

programmer is not involved in the low-level tasks of moving data and managing
worker processes.
We consider a simple mathematical example that can be parallelized in a
straightforward manner, namely the approximation of 𝜋 using random numbers.
The area of the circle sector of the unit circle with radius one around the origin
within the square [0, 1] × [0, 1] is 𝜋∕4. If we draw uniformly distributed random
numbers 𝑋 and 𝑌 from the interval [0, 1], then the fraction of these pairs (𝑋, 𝑌)
within this circle sector (i.e., with 𝑋 2 + 𝑌 2 ≤ 1) will thus be 𝜋∕4 of the number
of all pairs (𝑋, 𝑌) drawn. Hence we have found an algorithm that calculates a
Monte Carlo approximation of 𝜋∕4.
In the first step, we save the following function ʙɃ in a file called ʙɃѐɔɜ.
ȯʼɪȆʲɃɴɪ ʙɃФɪђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ Ȇɴʼɪʲȕʝђђbɪʲ ќ Е
ȯɴʝ Ƀ Ƀɪ Жђɪ
Ƀȯ ʝǤɪȍФХѭЗ ў ʝǤɪȍФХѭЗ јќ Ж
Ȇɴʼɪʲȕʝ ќ Ȇɴʼɪʲȕʝ ў Ж
ȕɪȍ
ȕɪȍ
Й Ѯ Ȇɴʼɪʲȕʝ Э ɪ
ȕɪȍ

To run ʙɃ on multiple processes (assuming your computer comes with mul-


tiple (logical) cores), we start Julia using ɔʼɜɃǤ вʙ Ǥʼʲɴ. Then we use the
Ъȕ˛ȕʝ˩˞ȹȕʝȕ macro to make our program available to all workers.

ɔʼɜɃǤљ Ъȕ˛ȕʝ˩˞ȹȕʝȕ ɃɪȆɜʼȍȕФъʙɃѐɔɜъХ

Since we started Julia with the вʙ command-line option, the -ɃʧʲʝɃȂʼʲȕȍ mod-
ule (and hence the macro -ɃʧʲʝɃȂʼʲȕȍѐЪʧʙǤ˞ɪǤʲ) is already available; other-
wise we would need ʼʧɃɪȱ -ɃʧʲʝɃȂʼʲȕȍ.
The ЪʧʙǤ˞ɪǤʲ macro takes two arguments, namely the id of the process to
be used and an expression. The expression is wrapped into a closure and run
asynchronously on the specified process, and a Oʼʲʼʝȕ is returned. If the process
id is equal to ђǤɪ˩, the scheduler picks the process to be used.
ɔʼɜɃǤљ ʙɃЖ ќ ЪʧʙǤ˞ɪǤʲ З ʙɃФЖЕЕЕХ
OʼʲʼʝȕФЗя Жя МЙя ɪɴʲȹɃɪȱХ
ɔʼɜɃǤљ ʙɃЗ ќ ЪʧʙǤ˞ɪǤʲ ђǤɪ˩ ʙɃФЖЕЕЕХ
OʼʲʼʝȕФЗя Жя МКя ɪɴʲȹɃɪȱХ
ɔʼɜɃǤљ ФȯȕʲȆȹФʙɃЖХ ў ȯȕʲȆȹФʙɃЗХХ Э З
ИѐЖЙЙ

Here we have used ȯȕʲȆȹ to obtain the return value of the spawned function from
the Oʼʲʼʝȕ. The average of the two approximations yields three correct digits of 𝜋
(at least in this run). This approach is still low level, as some programming work
is required to spawn the expressions and to collect their values.
Parallel ȯɴʝ loops provide a convenient way to distribute expressions to pro-
cesses and to collect the results.
128 6 Control Flow

ȯʼɪȆʲɃɴɪ ʙǤʝǤɜɜȕɜѪʙɃФɦђђbɪʲя ɪђђbɪʲХђђOɜɴǤʲЛЙ


ФЖЭɦХ Ѯ ЪȍɃʧʲʝɃȂʼʲȕȍ ФўХ ȯɴʝ Ƀ Ƀɪ Жђɦ
ʙɃФɪХ
ȕɪȍ
ȕɪȍ

The ЪȍɃʧʲʝɃȂʼʲȕȍ macro turns the ȯɴʝ loop into a parallel ȯɴʝ loop. Its first
(optional) argument is a function that will process the values returned by each
iteration; here the argument is supplied as ФўХ (instead of just ў) for a syntactic
reason. All iterations are performed on the worker processes, and each iteration
returns the value of its last expression. The postprocessing is performed on the
calling process.
This is all that is needed to run this Monte Carlo algorithm in parallel.
ɔʼɜɃǤљ ЪʲɃɦȕ ʙǤʝǤɜɜȕɜѪʙɃФɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХя ЖЕѭОХ
ЖЖѐКЖККЗИ ʧȕȆɴɪȍʧ ФЛЗѐИИ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ ИѐЖЙЗ Ƀ"Х
ИѐЖЙЖКОКЗОЗКЕЕЕЕЕЙ

In this run, we obtained six correct decimal digits of 𝜋 by running 24 ⋅ 109 sam-
ples in total. By using ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ as the first argument, this number
of iterations is distributed to the same number of workers. If the argument is
ЖЕ Ѯ ɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ, each worker receives ten loop iterations.
When one is interested in receiving the values calculated in all loop iterations,
the ˛ȆǤʲ function can be used to return a vector with all values.
ɔʼɜɃǤљ ЪȍɃʧʲʝɃȂʼʲȕȍ ˛ȆǤʲ ȯɴʝ Ƀ Ƀɪ ЖђɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХ ʙɃФЖЕѭОХ ȕɪȍ
ЗЙвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ИѐЖЙЖКОИМЗЙ
ѐѐѐ

All variables used inside a parallel ȯɴʝ loop will be copied to each worker
process. On the other hand, any changes to these variables will not be visible
after the loop has finished, i.e., the values of the variables are not copied back. It
is also important to note that the order of the iterations is unspecified.
It is possible to omit the function that processes the values when calling
ЪȍɃʧʲʝɃȂʼʲȕȍ. Then the loop iterations are spawned on all available workers
and an array of Oʼʲʼʝȕs is immediately returned without waiting for the itera-
tions to finish. The functions ˞ǤɃʲ and ȯȕʲȆȹ can be applied to the Oʼʲʼʝȕs as
usual, or it is possible to wait for the completion of all iterations by calling Ъʧ˩ɪȆ
on the result, i.e., by writing
Ъʧ˩ɪȆ ЪȍɃʧʲʝɃȂʼʲȕȍ ȯɴʝ
...
ȕɪȍ.

Using a parallel ȯɴʝ loop with ˛ȆǤʲ as the postprocessing function is equiv-
alent to using the ʙɦǤʙ function, which is the parallel version of the mapping
function ɦǤʙ. Using ʙɦǤʙ, we can rewrite ʙǤʝǤɜɜȕɜѪʙɃ above succinctly as fol-
lows.
6.7 Parallel Computing 129

ȯʼɪȆʲɃɴɪ ʙǤʝǤɜɜȕɜѪʙɃФɦђђbɪʲя ɪђђbɪʲХђђOɜɴǤʲЛЙ


ФЖЭɦХ Ѯ ʧʼɦФʙɦǤʙФʙɃя ʝȕʙȕǤʲФЦɪЧя ɦХХХ
ȕɪȍ

The results are the same.


ɔʼɜɃǤљ ЪʲɃɦȕ ʙǤʝǤɜɜȕɜѪʙɃФɜȕɪȱʲȹФ˞ɴʝɖȕʝʧФХХя ЖЕѭОХ
ЖЖѐЙИИННК ʧȕȆɴɪȍʧ ФЖѐММ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ ЛНѐЖЕО uɃ"Х
ИѐЖЙЖКООКЙМЖЛЛЛЛЛ

What is the difference between a ȯɴʝ loop and ʙɦǤʙ? The ʙɦǤʙ function is
meant to be used when evaluating the function is computationally expensive.
On the other hand, a parallel ȯɴʝ loop can handle tiny computations in each
iteration well.

Problems

6.1 (Fizz-buzz)
Write a function that prints the numbers from 1 to 100. However, for multiples
of three, print “fizz” instead of the number; for multiples of five, print “buzz”;
and for multiples of both three and five, print “fizz-buzz”.

6.2 Define a data structure for a mathematical object as well as corresponding


ɃʲȕʝǤʲȕ methods. Show two examples of iteration, one using a ȯɴʝ loop and one
using a ˞ȹɃɜȕ loop.

6.3 (Sieve of Eratosthenes) Implement the sieve of Eratosthenes for calculat-


ing prime numbers.
∑𝑛
6.4 Write a parallel program to calculate 𝑓(𝑛) ∶= 𝑘=1 1∕𝑘2 and determine
the speed-up of your program compared to a serial version as a function of
the
∑∞number of workers. Hint: you can check your solution using the formula
𝑘=1
1∕𝑘2 = 𝜋2 ∕6.

6.5 (How to shoot 𝜋, continued)


Compare the speed of both versions of the Monte Carlo approximation of 𝜋 in
Sect. 6.7.3 for different ɦ and ɪ. Which version and parameters yield the fastest
program?
Chapter 7
Macros

Lisp is now the second oldest programming language in present widespread use (after
Fortran and not counting apt, which isn’t used for programming per se). It owes its
longevity to two facts. First, its core occupies some kind of local optimum in the space
of programming languages given that static friction discourages purely notational
changes. Recursive use of conditional expressions, representation of symbolic
information externally by lists and internally by list structure, and representation of
program in the same way will probably have a very long life.
Second, Lisp still has operational features unmatched by other language that make it a
convenient vehicle for higher level systems for symbolic computation and for artificial
intelligence. These include its run-time system that give good access to the features of
the host machine and its operating system, its list structure internal language that
makes it a good target for compiling from yet higher level languages, its compatibility
with systems that produce binary or assembly level program, and the availability of its
interpreter as a command language for driving other programs. (One can even
conjecture that Lisp owes its survival specifically to the fact that its programs are lists,
which everyone, including me, has regarded as a disadvantage. Proposed replacements
for Lisp [. . . ] abandoned this feature in favor of an Algol-like syntax leaving no target
language for higher level systems).
Lisp will become obsolete when someone makes a more comprehensive language that
dominates Lisp practically and also gives a clear mathematical semantics to a more
comprehensive set of features.
—John McCarthy, History of Lisp (12 February 1979)

Abstract For several decades, Lisp macros have been the state of the art in
metaprogramming. Macros are expanded at the time when a program is read,
and thus provide a mechanism for defining new language constructs by rewrit-
ing expressions at read time and before compile and evaluation time. In this
chapter, the concept of macros is explained via the example of macros in Com-
mon Lisp, which is conducive for this purpose due to its uniform syntax. Then
Julia macros and their building blocks are presented in detail. Finally, useful
built-in Julia macros are discussed.

© Springer Nature Switzerland AG 2022 131


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_7
132 7 Macros

7.1 Introduction

The lifetime of a program consists of several phases.


1. The first phase is the time when the program is written.
2. The second phase is the read time, when the program is read and translated
into the internal data structure the interpreter or compiler uses to perform
its task. In the Lisp family of languages, the internal representation is a list,
and in Julia, which has more syntax, it is an expression (5˦ʙʝ). It is a note-
worthy feature of these two languages and related ones that the internal rep-
resentation of the program is exposed to the programmer, much facilitating
the definition of macros.
At read time, any expression that is a macro is expanded and yields a new
expression that replaces the macro.
3. At compile time, the expressions are compiled.
4. Finally, at evaluation or run time, the expressions are evaluated and their
values are returned.
In Lisp and also in Julia, there are three categories of function-like expres-
sions: macros, special forms, and functions. The differences between these three
categories are due to their different purposes.
• The arguments of macros are not evaluated, because the whole purpose of a
macro is to translate an expression into a new expression.
• Special forms only evaluate some of their arguments. The canonical example
of a special form is the Ƀȯ expression, which evaluates its first argument and
depending on the result its second or third argument.
• Functions evaluate all of their arguments.
The discussion of the phases in the lifetime of a program shows that macros
are expanded after read time and before any evaluations of the expressions are
performed. Therefore macros make it possible to define new language constructs
that are syntactically indistinguishable from built-in language constructs in the
case of Lisp and nearly syntactically indistinguishable in the case of Julia. In
Julia, the names of macros always start with the at sign Ъ.
In the next section, we illustrate the concept of macros via a simple example
implemented in Common Lisp, because its syntax is so uniform that the concept
of a macro can be illustrated in this language very easily. Afterwards, we proceed
with macros in Julia, which has more syntactic constructs.

7.2 Macros in Common Lisp

The simple example we consider is a macro called ȍɴвʲȹʝɃȆȕ that takes an ex-
pression and executes it three times. We start gently in Common Lisp .
ФɃɪвʙǤȆɖǤȱȕ ђȆɜвʼʧȕʝХ
7.2 Macros in Common Lisp 133

The following expression prints a message three times using a ȍɴʲɃɦȕʧ loop.
ФȍɴʲɃɦȕʧ ФɃ ИХ
ФʙʝɃɪʲ ъYȕɜɜɴя ˞ɴʝɜȍРъХХ

This expression is a list that contains the three elements ȍɴʲɃɦȕʧ, ФɃ ИХ, and
ФʙʝɃɪʲ ъYȕɜɜɴя ˞ɴʝɜȍРъХ. The first element is the name of the function, macro,
or special form to be called. (In fact, ȍɴʲɃɦȕʧ is a macro, but it is indistinguish-
able from a function if all we know is its name.) The second element ФɃ ИХ de-
fines the iteration variable Ƀ and specifies how many times the following expres-
sion will repeated. The third element is the expression to be repeated.
We are already half way to defining the macro ȍɴвʲȹʝɃȆȕ. The macro
ȍȕȯɦǤȆʝɴ defines a new macro. Its first argument is the name of the macro to
be defined, its second argument is the argument list, and the remaining argu-
ments are the expressions to be returned by the new macro. Therefore the first
version of our simple macro is the following.
ФȍȕȯɦǤȆʝɴ ȍɴвʲȹʝɃȆȕвЖ ФПȂɴȍ˩ Ȃɴȍ˩Х
҉ФȍɴʲɃɦȕʧ ФɃ ИХ
яЪȂɴȍ˩ХХ

Here the argument list just means that all expressions that will be passed to the
new macro will be contained in the local variable Ȃɴȍ˩. The backquote ҉ is com-
monly used in macro definitions to protect its argument from evaluation, just as
ʜʼɴʲȕ in Julia. The syntax яЪ within a backquote means that the elements of its
argument, here Ȃɴȍ˩, are spliced into the surrounding list. If we would not like
to use the backquote syntax, we could also construct the expression, i.e., a list,
explicitly, but the purpose of the backquote syntax is to facilitate writing macros
and therefore we use it.
Common Lisp offers a simple way to check that our macro does what it is
supposed to do: ɦǤȆʝɴȕ˦ʙǤɪȍвЖ expands a macro only once.
ФɦǤȆʝɴȕ˦ʙǤɪȍвЖ щФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХХ

This results in the following output.


Ф-“Ñb 5Æ Фb ИХ Ф¸¼b‰Ñ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХ
Ñ

Symbols are printed in uppercase letters by default. The value Ñ is the second
return value; it stands for true. We see that the macro ȍɴвʲȹʝɃȆȕвЖ does what
we intend it to do: it takes an expression and puts it inside a ȍɴʲɃɦȕʧ loop.
You might expect that there is also a function called ɦǤȆʝɴȕ˦ʙǤɪȍ and you
would be right. The function ɦǤȆʝɴȕ˦ʙǤɪȍ expands a macro including all nested
macro calls. Evaluating the following expression also expands the call of the
ȍɴʲɃɦȕʧ macro, but the final expression is implementation dependent.

ФɦǤȆʝɴȕ˦ʙǤɪȍ щФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХХ

Next, we call our macro, which means that the resulting macro expansion is
evaluated.
134 7 Macros

ФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХ

Three lines are printed as expected. Here ‰b{ is the return value.
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
‰b{

However, there is a problem. Our use of ȍɴʲɃɦȕʧ defines the local variable Ƀ
in the macro expansion, and we can access its value.
ФȍɴвʲȹʝɃȆȕвЖ ФʙʝɃɪʲ ɃХХ

Evaluating this expression prints the following output.


Е
Ж
З
‰b{

Printing the value is relatively harmless, but we can also – a bit more maliciously
– change the value of the iteration variable and hence change the behavior of the
macro. (ʧȕʲʜ is short for “set quoted”.)
ФȍɴвʲȹʝɃȆȕвЖ Фʧȕʲʜ Ƀ ЗХ ФʙʝɃɪʲ ъ¸ʝɃɪʲȕȍ ɴɪɜ˩
ɴɪȆȕѐъХХ

Now only one line is printed.


ъ¸ʝɃɪʲȕȍ ɴɪɜ˩ ɴɪȆȕѐъ
‰b{

The macro expansion shows why the message is printed once.


ФɦǤȆʝɴȕ˦ʙǤɪȍвЖ щФȍɴвʲȹʝɃȆȕвЖ Фʧȕʲʜ Ƀ ЗХ ФʙʝɃɪʲ ъ¸ʝɃɪʲȕȍ ɴɪɜ˩
ɴɪȆȕѐъХХХ

Ф-“Ñb 5Æ Фb ИХ ФÆ5Ñ» b ЗХ Ф¸¼b‰Ñ ъ¸ʝɃɪʲȕȍ ɴɪɜ˩ ɴɪȆȕѐъХХ


Ñ

In the first iteration of the ȍɴʲɃɦȕʧ loop, the iteration variable is increased, which
prevents any further iterations, and the ʙʝɃɪʲ expression is evaluated.
Such macros are called unhygienic, since the pollute the name space of vari-
ables. A hygienic version of the macro is the second version shown here.
ФȍȕȯɦǤȆʝɴ ȍɴвʲȹʝɃȆȕвЗ ФПȂɴȍ˩ Ȃɴȍ˩Х
Фɜȕʲ ФФɃ Фȱȕɪʧ˩ɦХХХ
҉ФȍɴʲɃɦȕʧ ФяɃ ИХ
яЪȂɴȍ˩ХХХ
7.3 Macro Definition 135

The ɜȕʲ form assigns a new unique symbol returned by ȱȕɪʧ˩ɦ to the local vari-
able Ƀ when the macro is expanded. The name of this new unique symbol is then
used as the name of the iteration variable in the ȍɴʲɃɦȕʧ loop by splicing it as
яɃ into the expression returned by the macro, preventing any unwanted variable
capture.
ФɦǤȆʝɴȕ˦ʙǤɪȍвЖ щФȍɴвʲȹʝɃȆȕвЗ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХХ

Ф-“Ñb 5Æ ФЫђQЙКЛ ИХ Ф¸¼b‰Ñ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХ


Ñ

The macro expansion shows that a variable called QЙКЛ is used as the iteration
variable. (More precisely, ЫђQЙКЛ is an uninterned symbol.)
Finally, we try to change the iteration variable Ƀ again.
ФȍɴвʲȹʝɃȆȕвЗ Фʧȕʲʜ Ƀ ЗХ ФʙʝɃɪʲ ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъХХ

ѓ Ƀɪђ -“вÑY¼b&5вЗ ФÆ5Ñ» b ЗХ


ѓ ФÆ5Ñ» b ЗХ
ѓ
ѓ ȆǤʼȱȹʲ ü¼‰b‰Qђ
ѓ ʼɪȍȕȯɃɪȕȍ ˛ǤʝɃǤȂɜȕђ &“ “‰в{bƸвÚÆ5¼ђђb
ѓ
ѓ ȆɴɦʙɃɜǤʲɃɴɪ ʼɪɃʲ ȯɃɪɃʧȹȕȍ
ѓ ÚɪȍȕȯɃɪȕȍ ˛ǤʝɃǤȂɜȕђ
ѓ b
ѓ ȆǤʼȱȹʲ Ж ü¼‰b‰Q ȆɴɪȍɃʲɃɴɪ

ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ


ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
ъYȕɜɜɴя sʼɜɃǤ ʼʧȕʝʧРъ
‰b{

Now we only receive a warning that a variable Ƀ was not declared or defined
previously, while the macro still works as intended, printing three strings.
This first example illustrates how expressions can be rewritten and what hy-
gienic macros are. We will encounter the same concepts in Julia, where only
the names are different.

7.3 Macro Definition

In Julia, macros are defined by ɦǤȆʝɴ analogous to ȯʼɪȆʲɃɴɪ. A macro must re-
turn an expression, which is easily achieved by wrapping an expression between
ʜʼɴʲȕ and ȕɪȍ. Within such a quoted expression, the value of a variable can be
substituted by prepending its name with a dollar sign ϵ. This syntax is analo-
gous to the syntax for string interpolation (see Sect. 4.2.2). Hence the backquote
136 7 Macros

in Common Lisp corresponds to ʜʼɴʲȕ in Julia, and the comma corresponds


to the dollar sign.
The expansion of a macro is returned by the function ɦǤȆʝɴȕ˦ʙǤɪȍ and the
macros ЪɦǤȆʝɴȕ˦ʙǤɪȍ and ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ, which behave slightly differently.
The function ɦǤȆʝɴȕ˦ʙǤɪȍ takes the module in whose context the macro is
expanded as the first argument. The keyword argument ʝȕȆʼʝʧɃ˛ȕ controls
whether it should be expanded recursively or not. The macros ЪɦǤȆʝɴȕ˦ʙǤɪȍ and
ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ both do not take the module as an argument, and they differ in
whether the macro is always expanded recursively (in the case of ЪɦǤȆʝɴȕ˦ʙǤɪȍ)
or not (in the case of ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ).
To illustrate the intricacies of defining macros in Julia, we will write three
versions of a macro that evaluates an expression a given number of times, just
as the ȍɴʲɃɦȕʧ macro does in Common Lisp. The first version is the straightfor-
ward, unhygienic version.
ɦǤȆʝɴ ʼɪȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФɪђђbɪʲȕȱȕʝя ȕ˦ʙʝђђ5˦ʙʝХ
ЪǤʧʧȕʝʲ ɪ љќ Е

ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
˞ȹɃɜȕ Ƀ ј И
ϵȕ˦ʙʝ
Ƀ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

The assertion in the first line is evaluated at macro-expansion time, and the
quoted expression is returned. The value of the local variable ȕ˦ʙʝ is substituted
into the quoted expression because of the dollar sign in ϵȕ˦ʙʝ.
Just like functions, macros are generic in Julia, which means that methods
with the same name but with different argument signatures can be defined. The
method that best matches the arguments of the function or macro call will be
chosen.
Macros are called just like functions, but the arguments are not evaluated at
macro-expansion time.
ɔʼɜɃǤљ Ъʼɪȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФИя ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐъХХ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ
¸ʝɃɪʲȕȍ ʲȹʝȕȕ ʲɃɦȕʧѐ

Whenever a macro takes only one argument, the parentheses around the argu-
ments can be left out. Expressions passed as arguments to macros are created
as usual, separating them by semicolons within parentheses or using ȂȕȱɃɪ and
ȕɪȍ.
7.3 Macro Definition 137

The next call illustrates that we can change the value of the iteration variable
within the expression passed as an argument, showing that the macro is unhy-
gienic.
ɔʼɜɃǤљ Ъʼɪȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФИя ȂȕȱɃɪ Ƀ ќ Зѓ ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ɴɪȆȕѐъХ
ȕɪȍХ
¸ʝɃɪʲȕȍ ɴɪȆȕѐ

It is very instructive to compare the expansions of the macros in this section


using the function ɦǤȆʝɴȕ˦ʙǤɪȍ or the macro ЪɦǤȆʝɴȕ˦ʙǤɪȍ in order to under-
stand the details of macro expansion in Julia (see Problem 7.1).
Although the local variables within a ʜʼɴʲȕ block always receive unique
names in Julia in a hygiene pass as is easily checked by inspecting the macro
expansion, this is not enough to make the macro hygienic.
Analogous to Common Lisp, we can use the function ȱȕɪʧ˩ɦ to define a hy-
gienic version of the macro. Now a new variable name is generated at macro-
expansion time by ȱȕɪʧ˩ɦ and used as the iteration variable in the quoted ex-
pression.
ɦǤȆʝɴ ȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФɪђђbɪʲȕȱȕʝя ȕ˦ʙʝђђ5˦ʙʝХ
ЪǤʧʧȕʝʲ ɪ љќ Е

ɜɴȆǤɜ ˛Ǥʝ ќ ȱȕɪʧ˩ɦФХ

ʜʼɴʲȕ
ɜȕʲ ϵ˛Ǥʝ ќ Е
˞ȹɃɜȕ ϵ˛Ǥʝ ј И
ϵȕ˦ʙʝ
ϵ˛Ǥʝ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

We check that the expression is indeed evaluated three times.


ɔʼɜɃǤљ Ъȹ˩ȱɃȕɪɃȆѪȍɴʲɃɦȕʧФИя ȂȕȱɃɪ Ƀ ќ Зѓ ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍХ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ

The mechanism built into Julia to facilitate the definition of hygienic macros
is called ȕʧȆ. It is best illustrated by a simple example. We first define a global
variable and a local variable in the expression returned by the macro, which
both have the same name ȯɴɴ but different values. The macro returns ȯɴɴ and
ϵФȕʧȆФȯɴɴХХ.
138 7 Macros

ȱɜɴȂǤɜ ȯɴɴ ќ Е

ɦǤȆʝɴ ȕʧȆǤʙȕȍФХ
ʜʼɴʲȕ
ɜɴȆǤɜ ȯɴɴ ќ Ж
Фȯɴɴя ϵФȕʧȆФȯɴɴХХХ
ȕɪȍ
ȕɪȍ

After calling the macro, we observe that ȯɴɴ evaluates to Ж and the escaped vari-
able evaluates to Е. (When calling a macro with no arguments, no parentheses
are required.)
ɔʼɜɃǤљ ЪȕʧȆǤʙȕȍ
ФЖя ЕХ

This result shows that ȯɴɴ refers to the local variable of the same name as ex-
pected, while ϵФȕʧȆФȯɴɴХХ escapes the ʜʼɴʲȕ block and refers to the global vari-
able.
Next, we have a look at the macro expansion, which is very instructive. (Com-
ments have been deleted.)
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍ ЪȕʧȆǤʙȕȍ
ʜʼɴʲȕ
ɜɴȆǤɜ ˛ǤʝъЫЖЕЫȯɴɴъ ќ Ж
Ф˛ǤʝъЫЖЕЫȯɴɴъя ЕХ
ȕɪȍ

The expansion shows Julia’s hygiene mechanism at work. All local variables
are renamed to new unique names in order to prevent unintended variable cap-
ture. In escaped expressions, these substitutions are not performed, however.
Therefore ϵФȕʧȆФȯɴɴХХ can escape the ʜʼɴʲȕ block, and the value Е of the global
variable called ȯɴɴ at macro-expansion time is used. This explains the output
ФЖя ЕХ.
In summary, ȕʧȆ is only valid in expressions returned from a macro and pre-
vents renaming embedded variables into hygienic variables generated by ȱȕɪʧ˩ɦ.
Knowing ȕʧȆ, we return to the ȍɴʲɃɦȕʧ macro and present its idiomatic ver-
sion in Julia. (Using a ȯɴʝ loop is possible as well, of course, but would not allow
us to explain variable capture.)
ɦǤȆʝɴ ȕʧȆǤʙȕȍѪȍɴʲɃɦȕʧФɪђђbɪʲȕȱȕʝя ȕ˦ʙʝђђ5˦ʙʝХ
ЪǤʧʧȕʝʲ ɪ љќ Е

ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
˞ȹɃɜȕ Ƀ ј И
ϵФȕʧȆФȕ˦ʙʝХХ
Ƀ ўќ Ж
7.4 Two Examples: Repeating and Collecting 139

ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

Calling the macro shows that it works correctly.


ɔʼɜɃǤљ ЪȕʧȆǤʙȕȍѪȍɴʲɃɦȕʧФИя ȂȕȱɃɪ Ƀ ќ Зѓ ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍХ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ
¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐ

Expanding the macro illustrates the substitution of variables in the quoted ex-
pression by ȱȕɪʧ˩ɦ versions except for the escaped variables. (Comments have
been deleted.)
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍ ЪȕʧȆǤʙȕȍѪȍɴʲɃɦȕʧФИя
ȂȕȱɃɪ
Ƀ ќ З
ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍХ
ʜʼɴʲȕ
ɜȕʲ ˛ǤʝъЫЖИЫɃъ ќ Е
˞ȹɃɜȕ ˛ǤʝъЫЖИЫɃъ ј И
ȂȕȱɃɪ
Ƀ ќ З
ʙʝɃɪʲɜɪФъ¸ʝɃɪʲȕȍ ʲȹʝɃȆȕѐъХ
ȕɪȍ
˛ǤʝъЫЖИЫɃъ ўќ Ж
ȕɪȍ
ȕɪȍ
ȕɪȍ

This is all there is to defining macros in Julia.

7.4 Two Examples: Repeating and Collecting

In this section, we discuss two more examples of macro definitions. The first
example is called ЪʝȕʙȕǤʲ. Its purpose is to take an expression and a condition
and to repeat the expression until the condition is satisfied just as repeat state-
ments in other programming languages. Since Julia usually has more syntactic
sugar than Lisp, we require the second argument of our ЪʝȕʙȕǤʲ macro to be the
symbol ʼɪʲɃɜ. This also allows us to illustrate that the corresponding check is
evaluated when the macro is expanded. We substitute the values variables ȕ˦ʙʝ
and ȆɴɪȍɃʲɃɴɪ into the right places in a ˞ȹɃɜȕ loop, and we employ the escape
140 7 Macros

mechanism to make the macro hygienic. Therefore our ЪʝȕʙȕǤʲ macro looks
like this.
ɦǤȆʝɴ ʝȕʙȕǤʲФȕ˦ʙʝђђ5˦ʙʝя ʼɪʲɃɜђђÆ˩ɦȂɴɜя ȆɴɪȍɃʲɃɴɪђђ5˦ʙʝХ
Ƀȯ ʼɪʲɃɜ Рќ ђʼɪʲɃɜ
ȕʝʝɴʝФъɦǤɜȯɴʝɦȕȍ ȆǤɜɜ ɴȯ ЪʝȕʙȕǤʲъХ
ȕɪȍ

ʜʼɴʲȕ
˞ȹɃɜȕ ʲʝʼȕ
ϵФȕʧȆФȕ˦ʙʝХХ
Ƀȯ ϵФȕʧȆФȆɴɪȍɃʲɃɴɪХХ
ȂʝȕǤɖ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

Next we use the macro by defining a local variable to serve as an iteration


counter. Evaluating the following expression returns an error, since the check at
the beginning fails when the macro is expanded.
ɔʼɜɃǤљ ɜȕʲ Ƀ ќ Е
ЪʝȕʙȕǤʲ ȂȕȱɃɪ
Ƀ ўќ Ж
Ъʧȹɴ˞ Ƀ
ȕɪȍ ʼɪʲɃɜɜ Ƀ љќ И
ȕɪȍ
5¼¼“¼ђ {ɴǤȍ5ʝʝɴʝђ ɦǤɜȯɴʝɦȕȍ ȆǤɜɜ ɴȯ ЪʝȕʙȕǤʲ

It works as expected if we spell correctly.


ɔʼɜɃǤљ ɜȕʲ Ƀ ќ Е
ЪʝȕʙȕǤʲ ȂȕȱɃɪ
Ƀ ўќ Ж
Ъʧȹɴ˞ Ƀ
ȕɪȍ ʼɪʲɃɜ Ƀ љќ И
ȕɪȍ
Ƀ ќ Ж
Ƀ ќ З
Ƀ ќ И

Again, it is constructive to inspect the macro expansion. When using the func-
tion ɦǤȆʝɴȕ˦ʙǤɪȍ, we must specify the module (here ǤɃɪ) in whose context the
macro is evaluated and quote the expression we want to expand. When using
ЪɦǤȆʝɴȕ˦ʙǤɪȍ or ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ, the argument expression is not quoted. (Com-
ments in the macro expansion have been deleted.)
ɔʼɜɃǤљ ɦǤȆʝɴȕ˦ʙǤɪȍФ ǤɃɪя ʜʼɴʲȕ
7.4 Two Examples: Repeating and Collecting 141

ɜȕʲ Ƀ ќ Е
ЪʝȕʙȕǤʲ ȂȕȱɃɪ
Ƀ ўќ Ж
Ъʧȹɴ˞ Ƀ
ȕɪȍ ʼɪʲɃɜ Ƀ љќ И
ȕɪȍ
ȕɪȍХ
ʜʼɴʲȕ
ɜȕʲ Ƀ ќ Е
ȂȕȱɃɪ
˞ȹɃɜȕ ʲʝʼȕ
ȂȕȱɃɪ
Ƀ ўќ Ж
ȂȕȱɃɪ
"ǤʧȕѐʙʝɃɪʲɜɪФъɃ ќ ъя "ǤʧȕѐʝȕʙʝФȂȕȱɃɪ
˛ǤʝъЫЖЫ˛Ǥɜʼȕъ ќ Ƀ
ȕɪȍХХ
˛ǤʝъЫЖЫ˛Ǥɜʼȕъ
ȕɪȍ
ȕɪȍ
Ƀȯ Ƀ љќ И
ȂʝȕǤɖ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

The second example in this section is called ЪȆɴɜɜȕȆʲ. It wraps an expression,


often a loop, in which ʝȕɦȕɦȂȕʝ can be used to collect values, which are returned
by ЪȆɴɜɜȕȆʲ. Such a feature is available in the ɜɴɴʙ macro in Common Lisp and
is useful, for example, when the number of values collected in each iteration of
a loop is unknown beforehand. A simple example of its usage is the following.
ɔʼɜɃǤљ Ƀɦʙɴʝʲ ¸ʝɃɦȕʧ
ɔʼɜɃǤљ ЪȆɴɜɜȕȆʲ ȯɴʝ Ƀ Ƀɪ ЖђЖЕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ʝȕɦȕɦȂȕʝФɃХ
ȕɪȍ
ȕɪȍ
Йвȕɜȕɦȕɪʲ ùȕȆʲɴʝШɪ˩Щђ
З
И
К
М

The definition of ЪȆɴɜɜȕȆʲ looks like this.


142 7 Macros

ɦǤȆʝɴ ȆɴɜɜȕȆʲФȕ˦ʙʝђђ5˦ʙʝХ
ʜʼɴʲȕ
ɜȕʲ ˛ ќ ùȕȆʲɴʝФХ
ȯʼɪȆʲɃɴɪ ϵФȕʧȆФђʝȕɦȕɦȂȕʝХХФ˦Х
ʙʼʧȹРФ˛я ˦Х
ȕɪȍ

ϵФȕʧȆФȕ˦ʙʝХХ

˛
ȕɪȍ
ȕɪȍ
ȕɪȍ

Note that the function ʝȕɦȕɦȂȕʝ is accessible only within the argument expres-
sion that is passed to ЪȆɴɜɜȕȆʲ; it is not a globally defined function and does not
pollute the global variable bindings.
It is instructive to macroexpand the example above once.
ɔʼɜɃǤљ ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ ЪȆɴɜɜȕȆʲ ȯɴʝ Ƀ Ƀɪ ЖђЖЕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ʝȕɦȕɦȂȕʝФɃХ
ȕɪȍ
ȕɪȍ
ʜʼɴʲȕ
ɜȕʲ ˛ǤʝъЫЗЫ˛ъ ќ ǤɃɪѐùȕȆʲɴʝФХ
ȯʼɪȆʲɃɴɪ ʝȕɦȕɦȂȕʝФ˛ǤʝъЫЙЫ˦ъХ
ǤɃɪѐʙʼʧȹРФ˛ǤʝъЫЗЫ˛ъя ˛ǤʝъЫЙЫ˦ъХ
ȕɪȍ
ȯɴʝ Ƀ ќ ЖђЖЕ
Ƀȯ ¸ʝɃɦȕʧѐɃʧʙʝɃɦȕФɃХ
ʝȕɦȕɦȂȕʝФɃХ
ȕɪȍ
ȕɪȍ
˛ǤʝъЫЗЫ˛ъ
ȕɪȍ
ȕɪȍ

7.5 Memoization

In this section, we implement the optimization technique of memoization as a


macro called ЪɦȕɦɴɃ˴ȕ. Memoization trades run time for memory by storing the
results of (computationally expensive) function calls and returning the cached
results whenever possible.
7.5 Memoization 143

The ЪɦȕɦɴɃ˴ȕ macro is defined in such a way that it is straightforward to


use. In order to memoize a function, we only have to write ЪɦȕɦɴɃ˴ȕ in front
of its definition; in other words, the only argument of the ЪɦȕɦɴɃ˴ȕ macro is a
function definition (which is an expression).
ɦǤȆʝɴ ɦȕɦɴɃ˴ȕФȯʼɪђђ5˦ʙʝХ
ɜɴȆǤɜ ȆǤɜɜ ќ ȯʼɪѐǤʝȱʧЦЖЧ
ɜɴȆǤɜ ɪǤɦȕ ќ ȆǤɜɜѐǤʝȱʧЦЖЧѐǤʝȱʧЦЖЧ
ɜɴȆǤɜ ǤʝȱЖ ќ ȆǤɜɜѐǤʝȱʧЦЖЧѐǤʝȱʧЦЗЧ
ɜɴȆǤɜ ǤʝȱЖѪɪǤɦȕ ќ ǤʝȱЖѐǤʝȱʧЦЖЧ
ɜɴȆǤɜ ǤʝȱЖѪʲ˩ʙȕ ќ ǤʝȱЖѐǤʝȱʧЦЗЧ
ɜɴȆǤɜ ʝȕʲʼʝɪѪʲ˩ʙȕ ќ ȆǤɜɜѐǤʝȱʧЦЗЧ
ɜɴȆǤɜ Ȃɴȍ˩ ќ ȯʼɪѐǤʝȱʧЦЗЧ

ʜʼɴʲȕ
ɜȕʲ ȆǤȆȹȕ ќ -ɃȆʲШϵФȕʧȆФǤʝȱЖѪʲ˩ʙȕХХя ϵФȕʧȆФʝȕʲʼʝɪѪʲ˩ʙȕХХЩФХ
ȱɜɴȂǤɜ ȯʼɪȆʲɃɴɪ ϵФȕʧȆФɪǤɦȕХХФϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХђђ
ϵФȕʧȆФǤʝȱЖѪʲ˩ʙȕХХХђђϵФȕʧȆФʝȕʲʼʝɪѪʲ˩ʙȕХХ
Ƀȯ ȹǤʧɖȕ˩ФȆǤȆȹȕя ϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХХ
ȆǤȆȹȕЦϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХЧ
ȕɜʧȕ
ȆǤȆȹȕЦϵФȕʧȆФǤʝȱЖѪɪǤɦȕХХЧ ќ ϵФȕʧȆФȂɴȍ˩ХХ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ
ȕɪȍ

About half of the work that the macro performs is spent on parsing the function
definition. We look for the name of the function, its first argument, the type of its
first argument, the return type, and the body of the function. You can use ȍʼɦʙ
to view all these expressions and make sense of the meaning of their parts.
The macro returns a ʜʼɴʲȕd expression as usual. The definition of the memo-
ized function is encapsulated within a closure (see Sect. 3.4) created by ɜȕʲ. The
ȆǤȆȹȕ variable contains a -ɃȆʲ with keys that have the type of the function argu-
ment and with values that have the type of the return value. Within this closure,
the memoized function is defined. We have to write ȱɜɴȂǤɜ before the ȯʼɪȆʲɃɴɪ
definition, because otherwise the function would only be defined locally within
the closure and thus inaccessible and useless.
The function signature consists of the parsed function name, argument, ar-
gument type, and return type. The memoized function itself is simple. It checks
whether the cache contains the argument of the memoized function as a key.
If it does, the cached value is returned. If it does not, the ȕʧȆaped Ȃɴȍ˩ of the
function is evaluated and the resulting value is stored in the cache. Since the
Ƀȯ expression is the last expression in the function, one of these two values is
returned.
144 7 Macros

The Fibonacci sequence serves as a good example. We have already seen in


Sect. 2.1 that caching the return values is a key strategy to the fast calculation of
Fibonacci numbers, but now we can fully automate this idea by just prepending
the function definition with ЪɦȕɦɴɃ˴ȕ. The function definition we use here takes
a "Ƀȱbɪʲ and returns a "Ƀȱbɪʲ.
ЪɦȕɦɴɃ˴ȕ ȯʼɪȆʲɃɴɪ ȯɃȂФɪђђ"ɃȱbɪʲХђђ"Ƀȱbɪʲ
Ƀȯ ɪ јќ Ж
ɪ
ȕɜʧȕ
ȯɃȂФɪвЗХ ў ȯɃȂФɪвЖХ
ȕɪȍ
ȕɪȍ

Of course, the cache is reset to an empty one whenever this definition


is evaluated. It is very instructive to view the expansion of the macro using
ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ, for example.
Just after defining the memoized function, we can easily observe the speedup
by calculating the same Fibonacci number twice.
ɔʼɜɃǤљ ЪʲɃɦȕ ȯɃȂФ"ɃȱbɪʲФЖЕѪЕЕЕХХѓ
ЕѐЕЕМЕЗН ʧȕȆɴɪȍʧ ФНЕѐЕЖ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ КѐНЗЛ Ƀ"Х
ɔʼɜɃǤљ ЪʲɃɦȕ ȯɃȂФ"ɃȱbɪʲФЖЕѪЕЕЕХХѓ
ЕѐЕЕЕЕЕЙ ʧȕȆɴɪȍʧ ФЗ ǤɜɜɴȆǤʲɃɴɪʧђ ЙЕ Ȃ˩ʲȕʧХ

Some extensions of this macro are the subject of Problems 7.6, 7.7, and 7.8.

7.6 Built-in Macros

Tables 7.1 and 7.2 summarize the built-in macros. Since macros are code trans-
formations, some of the more advanced or extravagant features of the Julia lan-
guage can be found in these two tables, and some of them are explained in the
following in more detail.
The first group of macros to be discussed in more detail are the ones whose
names end in Ѫʧʲʝ. These macros create string literals (already mentioned in
Sect. 4.2.4), which are a mechanism to create objects from a textual represen-
tation. The part of the name before Ѫʧʲʝ indicates the type of the object to be
created. For example, the macros ЪɃɪʲЖЗНѪʧʲʝ and ЪʼɃɪʲЖЗНѪʧʲʝ return an
bɪʲЖЗН and an ÚbɪʲЖЗН, respectively, and the ȂɃȱѪʧʲʝ macro returns a "ɃȱOɜɴǤʲ
or "Ƀȱbɪʲ depending on whether the string contains a decimal point or not. For
example, ȂɃȱъЖѐЗъ returns a "ɃȱOɜɴǤʲ and ȂɃȱъЖъ returns a "Ƀȱbɪʲ, as is easily
checked by ʲ˩ʙȕɴȯФȂɃȱъЖѐЗъХ and ʲ˩ʙȕɴȯФȂɃȱъЖъХ.
The usefulness of these macros is much increased by the syntactic rule that
ЪnameѪʧʲʝ ъ. . . ъ is equivalent to nameъ. . . ъ. For example, ˛ъЖѐЗѐИъ returns the
same ùȕʝʧɃɴɪ‰ʼɦȂȕʝ object as Ъ˛Ѫʧʲʝ ъЖѐЗѐИъ.
7.6 Built-in Macros 145

Table 7.1 Built-in macros: parsing, documentation, output, profiling, tasks, metaprogram-
ming, and performance annotations.
Macro Description
ЪѪѪ-b¼ѪѪ directory of the file containing the macro call
or the current working directory
ЪѪѪOb{5ѪѪ file containing the macro call or an empty string
ЪѪѪ{b‰5ѪѪ line number of the location of the macro call or Е
ЪѪѪ “-Ú{5ѪѪ module of the toplevel eval
ЪȆɦȍ string generate a &ɦȍ object from string
ЪɃɪʲЖЗНѪʧʲʝ string parse string into an bɪʲЖЗН
ЪʼɃɪʲЖЗНѪʧʲʝ string parse string into an ÚbɪʲЖЗН
ЪȂɃȱѪʧʲʝ string parse string into a "Ƀȱbɪʲ or a "ɃȱOɜɴǤʲ
ЪȂѪʧʲʝ string create an immutable ÚbɪʲН vector
ЪʝѪʧʲʝ string create a ¼ȕȱȕ˦ (regular expression)
ЪʧѪʧʲʝ string create a substitution string for regular expressions
Ъ˛Ѫʧʲʝ string parse string into a ùȕʝʧɃɴɪ‰ʼɦȂȕʝ
ЪʝǤ˞Ѫʧʲʝ string create a raw string without interpolation and unescaping
Ъ b 5Ѫʧʲʝ string parse string into a mime type
Ъʲȕ˦ʲѪʧʲʝ string parse string into a Ñȕ˦ʲ object
ЪȹʲɦɜѪʧʲʝ string parse string into an html object
ЪȍɴȆ retrieve documentation for a function, macro, or other object
Ъʧȹɴ˞ expr print and return the expression expr
ЪʲɃɦȕ expr return the value of expr after printing timing and allocation
ЪʲɃɦȕȍ expr return the value of expr together with allocation information
ЪʲɃɦȕ˛ expr verbose version of the ЪʲɃɦȕ macro
ЪȕɜǤʙʧȕȍ expr return the number of seconds it took to evaluate expr
ЪǤɜɜɴȆǤʲȕȍ expr return the total number of bytes allocated while evaluating expr
Ъʧ˩ɪȆ wait until all lexically enclosed ÑǤʧɖs have completed
ЪǤʧ˩ɪȆ wrap an expression in a ÑǤʧɖ and add it to the scheduler
ЪʲǤʧɖ expr create a ÑǤʧɖ from expr
ЪʲȹʝȕǤȍȆǤɜɜ similar to ȆȆǤɜɜ, but in a different thread
ЪɦǤȆʝɴȕ˦ʙǤɪȍ expr fully (recursively) expand the macros in expr
ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ expr expand expr non-recursively (only once)
ЪȱȕɪȕʝǤʲȕȍ annotate a function which will be generated
Ъȱȕɪʧ˩ɦ generate a symbol for a variable
Ъȕ˛Ǥɜ [mod] expr evaluate ȕ˦ʙʝ (in ɴȍʼɜȕ mod if given)
ЪȍȕʙʝȕȆǤʲȕ old new mark function as deprecated
ЪȂɴʼɪȍʧȆȹȕȆɖ expr annotate the expression allowing it to be elided by ЪɃɪȂɴʼɪȍʧ
ЪɃɪȂɴʼɪȍʧ expr eliminate checking of array bounds within expr
ЪȯǤʧʲɦǤʲȹ expr use fast math operations, strict ieee semantics may be violated
ЪʧɃɦȍ ȯɴʝ . . . ȕɪȍ annotate a ȯɴʝ loop to allow more re-ordering
ЪɃɪɜɃɪȕ hint that the function is worth inlining
ЪɪɴɃɪɜɃɪȕ prevent the compiler from inlining a function
ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ hint that the method should not be specialized for different types
ЪʧʙȕȆɃǤɜɃ˴ȕ reset specialization hint for an argument back to the default
Ъʙɴɜɜ˩ tell the compiler to apply the optimizer Polly to a function
146 7 Macros

Table 7.2 Built-in macros: errors etc., compiler, and miscellaneous macros.
Macro Description
ЪȍȕȂʼȱ create a log record with a debug message
ЪɃɪȯɴ create a log record with an informational message
Ъ˞Ǥʝɪ create a log record with a warning message
Ъȕʝʝɴʝ create a log record with an error message
Ъɜɴȱɦʧȱ general way to create a log record
ЪȆɴȍȕѪɜɜ˛ɦ , evaluate the arguments of the function or macro call,
ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ , determine their types, and
ЪȆɴȍȕѪɪǤʲɃ˛ȕ , call the corresponding function
ЪȆɴȍȕѪʲ˩ʙȕȍ , and on the resulting expression
ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ (see text for more explanations)
ЪѪѪȍɴʲѪѪ expr and convert every function call, operator, and assignment
Ъѐ expr into a “dot call” (ȯ into ȯѐ etc.)
ЪǤʧʧȕʝʲ cond throw an ʧʧȕʝʲɃɴɪ5ʝʝɴʝ if cond is ȯǤɜʧȕ
ЪȆȯʼɪȆʲɃɴɪ generate a C-callable function pointer from a Julia function
ЪȕȍɃʲ call the ȕȍɃʲ function
Ъȕɪʼɦ create an enum subtype
Ъȕ˛Ǥɜʙɴɜ˩ evaluate a polynomial efficiently using Horner’s method
ЪȯʼɪȆʲɃɴɪɜɴȆ return the location of a method definition
Ъȱɴʲɴ name unconditionally jump to the location denoted by ЪɜǤȂȕɜ name
ЪɜǤȂȕɜ name label a destination for Ъȱɴʲɴ
ЪɃʧȍȕȯɃɪȕȍ var tests whether a variable var is defined in the current scope
Ъɜȕʧʧ shows source code (using ɜȕʧʧ) for a function or macro call
ЪʧʲǤʲɃȆ expr partially evaluate the expression expr at parse time
Ъ˛Ƀȕ˞ AЦ. . . Ч create a ÆʼȂʝʝǤ˩ from the indexing operation
Ъ˛Ƀȕ˞ʧ expr convert all array-slicing operations in expr to return a view
Ъ˞ȹɃȆȹ return the ȕʲȹɴȍ that would be called
for a given function or macro call with given arguments

If you are familiar with the finer points of Common Lisp macros, you will
have noticed that this mechanism plays the role of reader macros in Common
Lisp. The mechanism in Julia that translates macro calls of the form nameъ. . . ъ
to macro calls of the form ЪnameѪʧʲʝ ъ. . . ъ is general in the sense that you can
define your own translations. This is useful when you want to construct objects
from a textual representation, for example while reading constants in a program
or while parsing data files. An example is the following. We first define a data
structure and then a macro to convert strings into such a data structure.
ʧʲʝʼȆʲ bɪʲȕʝ˛Ǥɜ
ǤђђOɜɴǤʲЛЙ
ȂђђOɜɴǤʲЛЙ
ȕɪȍ

ɦǤȆʝɴ ɃѪʧʲʝФʧђђÆʲʝɃɪȱХ
ɜɴȆǤɜ ȆɴɦɦǤ ќ ȯɃɪȍȯɃʝʧʲФъяъя ʧХЦЖЧ
ɜɴȆǤɜ Ǥ ќ ʙǤʝʧȕФOɜɴǤʲЛЙя ʧЦЖђȆɴɦɦǤвЖЧХ
ɜɴȆǤɜ Ȃ ќ ʙǤʝʧȕФOɜɴǤʲЛЙя ʧЦȆɴɦɦǤўЖђȕɪȍЧХ
7.6 Built-in Macros 147

ЪǤʧʧȕʝʲ Ǥ јќ Ȃ

bɪʲȕʝ˛ǤɜФǤя ȂХ
ȕɪȍ

Now we can create intervals easily from a straightforward string interpretation,


while the input is checked as well.
ɔʼɜɃǤљ ɃъЖѐЕя ЗѐЕъ
bɪʲȕʝ˛ǤɜФЖѐЕя ЗѐЕХ

The group of macros ЪʲɃɦȕ, ЪʲɃɦȕȍ, ЪʲɃɦȕ˛, ЪǤɜɜɴȆǤʲȕȍ, and ЪȕɜǤʙʧȕȍ re-
turn information about memory usage and evaluation time of an expression.
The ЪǤɜɜɴȆǤʲȕȍ macro discards the resulting value and returns the total num-
ber of bytes allocated during evaluation. Analogously, the ЪȕɜǤʙʧȕȍ macro dis-
cards the resulting value and returns the number of seconds the evaluation took.
ЪʲɃɦȕ evaluates an expression and returns its value after printing the time it
took to evaluate, the number of allocations, and the total number of bytes allo-
cated. ЪʲɃɦȕȍ returns multiple values (that can be used in the program instead
of just being printed): the return value of the expression, the elapsed time, the
total number of bytes allocated, the garbage collection time, and an object with
various memory allocation counters. ЪʲɃɦȕ˛ is a more verbose version of ЪʲɃɦȕ.
The group of macros ЪǤʧ˩ɪȆ, Ъʧ˩ɪȆ, and ЪʲǤʧɖ as well as ЪȍɃʧʲʝɃȂʼʲȕȍ,
ЪʧʙǤ˞ɪ, and ЪʧʙǤ˞ɪǤʲ are discussed in Sect. 6.6 and Sect. 6.7.
The next two macros we discuss in more detail are ЪȂɴʼɪȍʧȆȹȕȆɖ and
ЪɃɪȂɴʼɪȍʧ. The ЪɃɪȂɴʼɪȍʧ macro skips range checks in its argument expres-
sion in order to improve performance when referencing array elements. The user
must guarantee that all bounds checks after a call to ЪɃɪȂɴʼɪȍʧ are satisfied. The
canonical example of its usage is within a ȯɴʝ loop when many array elements
are referenced. One should be careful when using it; if an illegal array reference
is made, incorrect results, corrupted memory, or program crashes may result.
The ЪȂɴʼɪȍʧȆȹȕȆɖ macro makes it possible to use ЪɃɪȂɴʼɪȍʧ in your own
functions, but you can use ЪȂɴʼɪȍʧȆȹȕȆɖ only within inlined functions. The
ЪȂɴʼɪȍʧȆȹȕȆɖ macro marks the following expression as a bounds check, which
is elided when the inlined function is called after ЪɃɪȂɴʼɪȍʧ.
The next family of macros consists of ЪȆɴȍȕѪɜɜ˛ɦ, ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ,
ЪȆɴȍȕѪɪǤʲɃ˛ȕ, ЪȆɴȍȕѪʲ˩ʙȕȍ, and ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ. These five macros make it
easy to watch the compiler at work and are useful when you want to optimize
a function at the assembler level. The first macro, ЪȆɴȍȕѪɜɜ˛ɦ, shows the com-
piler output. It evaluates the arguments of a function or macro call, determines
the types of the arguments, and calls the function ȆɴȍȕѪɜɜ˛ɦ on the resulting
expression. The ЪȆɴȍȕѪɜɜ˛ɦ macro also takes a few keyword arguments.
Maybe the simplest example is the following. What happens when we ask
Julia to evaluate ЗўЗ?
148 7 Macros

ɔʼɜɃǤљ ЪȆɴȍȕѪɜɜ˛ɦ ЗўЗ


ѓ Ъ ɃɪʲѐɔɜђНМ ˞ɃʲȹɃɪ ҉ў҉
ȍȕȯɃɪȕ ɃЛЙ ЪъɔʼɜɃǤѪўѪЖЖМъФɃЛЙ ʧɃȱɪȕ˦ʲ ҄Ея ɃЛЙ ʧɃȱɪȕ˦ʲ ҄ЖХ ЫЕ Ш
ʲɴʙђ
҄З ќ Ǥȍȍ ɃЛЙ ҄Жя ҄Е
ʝȕʲ ɃЛЙ ҄З
Щ

The output shows the assembler code for the method specialized for two argu-
ments of type bɪʲЛЙ (ɃЛЙ) for the generic function ў. In assembler, the method
for this particular argument signature consists of a call of Ǥȍȍ and a call of ʝȕʲ (re-
turn). This example shows that Julia generates highly efficient code for known
argument types.
It is often more interesting to disassemble your own function. We define a
simple (generic) function first without specifying any types.
ȯʼɪȆʲɃɴɪ ʲɃɦȕʧѪʲ˞ɴФ˦Х
ЗѮ˦
ȕɪȍ

Then we apply ЪȆɴȍȕѪɜɜ˛ɦ to a call of our function with an bɪʲЛЙ argument.


This means that the Julia compiler generates code for a method specialized for
this particular argument type.
ɔʼɜɃǤљ ЪȆɴȍȕѪɜɜ˛ɦ ʲɃɦȕʧѪʲ˞ɴФЖХ
ѓ Ъ ¼5¸{ЦЗЧђЖ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ȍȕȯɃɪȕ ɃЛЙ ЪɔʼɜɃǤѪʲɃɦȕʧѪʲ˞ɴѪЖИМФɃЛЙ ʧɃȱɪȕ˦ʲ ҄ЕХ ЫЕ Ш
ʲɴʙђ
ѓ Ъ ¼5¸{ЦЗЧђЗ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ѓ ԍ Ъ ɃɪʲѐɔɜђНН ˞ɃʲȹɃɪ ҉Ѯ҉
҄Ж ќ ʧȹɜ ɃЛЙ ҄Ея Ж
ѓ ԕ
ʝȕʲ ɃЛЙ ҄Ж
Щ

We see that the assembler code consists of two instructions for an bɪʲЛЙ (ɃЛЙ) ar-
gument. The first is a call to ʧȹɜ (shift left), the fastest way to multiply an integer
given in its binary representation by two. The second is a call to ʝȕʲ (return).
Next we apply ЪȆɴȍȕѪɜɜ˛ɦ to a call of our function with a OɜɴǤʲЛЙ (ȍɴʼȂɜȕ)
argument.
ɔʼɜɃǤљ ЪȆɴȍȕѪɜɜ˛ɦ ʲɃɦȕʧѪʲ˞ɴФЖѐЕХ
ѓ Ъ ¼5¸{ЦЗЧђЖ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ȍȕȯɃɪȕ ȍɴʼȂɜȕ ЪɔʼɜɃǤѪʲɃɦȕʧѪʲ˞ɴѪЖИОФȍɴʼȂɜȕ ҄ЕХ ЫЕ Ш
ʲɴʙђ
ѓ Ъ ¼5¸{ЦЗЧђЗ ˞ɃʲȹɃɪ ҉ʲɃɦȕʧѪʲ˞ɴ҉
ѓ ԍ Ъ ʙʝɴɦɴʲɃɴɪѐɔɜђИНЕ ˞ɃʲȹɃɪ ҉Ѯ҉ Ъ ȯɜɴǤʲѐɔɜђЙЕК
҄Ж ќ ȯɦʼɜ ȍɴʼȂɜȕ ҄Ея ЗѐЕЕЕЕЕЕȕўЕЕ
7.6 Built-in Macros 149

ѓ ԕ
ʝȕʲ ȍɴʼȂɜȕ ҄Ж
Щ

The assembler code for this method consists again of two instructions, but
the multiplication instruction is different now. The instruction ȯɦʼɜ (floating-
point multiplication) is applied to the argument and the constant OɜɴǤʲЛЙ value
ЗѐЕЕЕЕЕЕȕўЕЕ. The resulting value is returned by ʝȕʲ.
These two examples show that the generic function ˦ вљ З˦ is compiled into
a single assembly instruction in both cases and that the special instruction for the
argument type is used. Therefore Julia is capable of compiling programs into
highly efficient code. The Julia compiler also inlines functions automatically
(see the macros ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ below). The only drawback of generat-
ing specialized code for all argument signatures (and of inlining functions) is
increased code size, which makes cache misses more likely, which slows down
modern processors. But in summary, it is very unlikely that you will need to re-
sort to lower-level languages than Julia for performance reasons.
The next macro, ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ, returns arrays of &ɴȍȕbɪȯɴ objects contain-
ing the lowered forms for the methods matching the given method and its type
signature.
The third macro in this family, ЪȆɴȍȕѪɪǤʲɃ˛ȕ, is similar to ЪȆɴȍȕѪɜɜ˛ɦ, but
instead of showing the instructions used by the llvm compiler framework, the
native instructions of the processor you are using are shown.
The next macro, ЪȆɴȍȕѪʲ˩ʙȕȍ, is similar to ЪȆɴȍȕѪɜɴ˞ȕʝȕȍ, but shows type
inferred information.
The last macro in this family, ЪȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ, prints lowered and type in-
ferred abstract syntax trees for the given method and its type signature. The out-
put is annotated in color (if available) to give warnings of potential type insta-
bilities, i.e., variables whose types may change during evaluation are marked.
These annotations may be related to operations for which the generated code is
not optimal. This macro is especially useful for optimizing functions.
These five macros have sister functions: ȆɴȍȕѪɜɜ˛ɦ, ȆɴȍȕѪɜɴ˞ȕʝȕȍ,
ȆɴȍȕѪɪǤʲɃ˛ȕ, ȆɴȍȕѪʲ˩ʙȕȍ, and ȆɴȍȕѪ˞Ǥʝɪʲ˩ʙȕ.
The four macros ЪȍȕȂʼȱ, ЪɃɪȯɴ, Ъ˞Ǥʝɪ, and Ъȕʝʝɴʝ are the recommended way
to communicate debugging output, informational messages, warning messages,
and errors to users of your program (see Sect. 6.5.3).
The ЪȍɴȆ macro, already mentioned in Sect. 1.3.3, is highly useful at the repl
to retrieve documentation not only about built-in functions, macros, and types,
about also about user defined ones if a documentation string was included.
The Ъȕɪʼɦ macro makes it possible to define enumeration types.
The ЪȱȕɪȕʝǤʲȕȍ macro, used before a function definition, defines so-called
generated functions. Generated functions are a generalization of the multiple
dispatch we know from generic functions. The body of a generated function has
access only to the types of the arguments, but not to their values, and a generated
function must return a quoted expression like a macro. They differ from macros,
because generated functions are expanded after the types of the arguments are
150 7 Macros

known, but before the function is compiled, while macros are expanded at read
time and cannot access the types of their arguments. Generated functions are a
seldom used feature.
Hints about inlining functions can be given to the compiler using the two
macros ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ. Inlining is a form of optimization that replaces
calls of the function to be inlined with the code of the function itself within its
caller. The advantage is that the overhead of passing and returning arguments is
eliminated; on the other hand, the disadvantage is increased code size. Usually,
only small and often called functions are inlined by the compiler. The macros
ЪɃɪɜɃɪȕ and ЪɪɴɃɪɜɃɪȕ are written just before ȯʼɪȆʲɃɴɪ.
The Ъɜȕʧʧ macro is very useful to show the source code of a method. For
example, Ъɜȕʧʧ ЗўЗ shows the file Ƀɪʲѐɔɜ, which is part of the implementation
of Julia.
The ЪʧɃɦȍ macro annotates a ȯɴʝ loop and allows the compiler to perform
more loop reordering, although the compiler already is able to automatically
vectorize inner ȯɴʝ loops. The ȯɴʝ loop must satisfy a few conditions when ЪʧɃɦȍ
is to be used. simd (single instruction multiple data) instructions are available on
most modern processors; a simd instruction is executed in parallel on multiple
data as opposed to executing multiple instructions.
The macros ЪʧʙȕȆɃǤɜɃ˴ȕ and ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ make it possible to exert some
control whether the compiler should generate code for methods with certain ar-
gument signatures or not. If ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ appears in front of an argument in
a method, it gives a hint to the compiler that the method should not be special-
ized for different types of this argument, thus avoiding excess code generation.
The ЪɪɴʧʙȕȆɃǤɜɃ˴ȕ macro can also appear in the function body before any other
code. Furthermore, it can be used without arguments in the local scope of a func-
tion and then applies to all arguments of the function. It can also be used without
arguments in the global scope and then applies to all methods subsequently de-
fined in the module. The ЪʧʙȕȆɃǤɜɃ˴ȕ macro resets the hint back to the default
when used in the global scope.
The ЪʧʲǤʲɃȆ macro partially evaluates the following expression at read time.
This is useful, for example, to define functions or values that are system specific.
A simple example is the following.
ЪʧʲǤʲɃȆ Ƀȯ Æ˩ʧѐɃʧǤʙʙɜȕФХ ЮЮ Æ˩ʧѐɃʧɜɃɪʼ˦ФХ
ȯʼɪȆʲɃɴɪ ɃʧʼɪɃ˦ФХ
ʲʝʼȕ
ȕɪȍ
ȕɪȍ

A more interesting example is calling functions (for example using ȆȆǤɜɜ) that
only exist on certain systems.
The Ъ˛Ƀȕ˞ macro creates a ÆʼȂʝʝǤ˩ from an array and an indexing expres-
sion into the array. A simple example of its usage is the following.
7.7 Bibliographical Remarks 151

ɔʼɜɃǤљ  ќ ЦЖ Зѓ И ЙЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж З
И Й
ɔʼɜɃǤљ Ȃ ќ Ъ˛Ƀȕ˞ ЦЖя ђЧ
Звȕɜȕɦȕɪʲ ˛Ƀȕ˞Фђђ ǤʲʝɃ˦ШbɪʲЛЙЩя Жя ђХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
Ж
З
ɔʼɜɃǤљ ȂЦЖЧ ќ Еѓ 
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Е З
И Й

The macro Ъ˛Ƀȕ˞ʧ applies Ъ˛Ƀȕ˞ to every array indexing expression in the given
expression.

7.7 Bibliographical Remarks

Homoiconicity and macros have been part of Lisp [3] since its inception. The
metaprogramming and macro features of Common Lisp, the most modern and
standardized [1] Lisp dialect, are second to none and provided inspiration to the
metaprogramming facilities in Julia.

Problems

7.1 Use ЪɦǤȆʝɴȕ˦ʙǤɪȍ or ɦǤȆʝɴȕ˦ʙǤɪȍ to expand all three versions of the


ȍɴʲɃɦȕʧ macro in Sect. 7.3 and discuss the significance of all local variables.

7.2 Modify the ЪȆɴɜɜȕȆʲ macro such that the user can specify the element type
of the vector that is returned.

7.3 (Unless) Macros make it possible to define new control structures that are
indistinguishable from built-in ones apart from the at sign. Write a macro called
Ъʼɪɜȕʧʧ which takes two arguments, namely a condition and an expression, and
that evaluates the expression only if the condition is false.

7.4 (Anaphoric macro) Anaphoric macros [2, Chapter 14] are macros that de-
liberately capture an argument of the macro which can later be referred to by an
anaphor. (In linguistics, an anaphor is the use of an expression whose interpre-
tation depends on another (usually previous) expression.)
Write a macro called ЪɃȯѪɜȕʲ that takes four arguments, namely a -ɃȆʲ, a key
(of arbitrary type), and two expressions. If the -ɃȆʲ contains the key, its value is
152 7 Macros

bound to the local variable Ƀʲ within the first expression, which is evaluated; if
the key cannot be found, the second expression is evaluated.

7.5 (Case) Julia does not come with a ʧ˞ɃʲȆȹ expression that is common
in other programming languages. Implement four macros modeled after their
counterparts in Common Lisp.
1. The macro ЪȆǤʧȕ takes a value and evaluates the clause that matches the
value. A default clause may be given.
2. The macro ЪȕȆǤʧȕ is analogous to ЪȆǤʧȕ, but it takes no default clause and
it raises an error if no clause matches.
3. The macro Ъʲ˩ʙȕȆǤʧȕ takes a value and evaluates the clause that matches
the type of the given value. A default clause may be given.
4. The macro Ъȕʲ˩ʙȕȆǤʧȕ is analogous to Ъʲ˩ʙȕȆǤʧȕ, but it takes no default
clause and it raises an error if no clause matches.

7.6 (Memoization for unspecified types)


In the definition of the ЪɦȕɦɴɃ˴ȕ macro in Sect. 7.5, the argument and return
types are parsed in order to be able to use a -ɃȆʲ for these types.
Generalize the ЪɦȕɦɴɃ˴ȕ macro and make it more robust (still for functions
with a single argument) by checking whether the argument and return types
are specified or not. If they are not, use the ɪ˩ type instead of the argument or
return type (or both) in the -ɃȆʲ that acts as the cache.

7.7 (Empty memoization cache)


Extend the ЪɦȕɦɴɃ˴ȕ macro such that it additionally defines a function called
nameѪȕɦʙʲ˩ѪȆǤȆȹȕР (when defining a memoized function called name) that
empties the cache.

7.8 (Memoization for two arguments)


Generalize the ЪɦȕɦɴɃ˴ȕ macro to functions with two arguments.

7.9 (Quine) Write a quine in Julia.

References

1. American National Standards Institute (ANSI), Washington, DC, USA: Programming Lan-
guage Common Lisp, ANSI INCITS 226-1994 (R2004) (1994)
2. Graham, P.: On Lisp. Prentice Hall (1993)
3. McCarthy, J.: LISP 1.5 Programmer’s Manual. The MIT Press (1962)
Chapter 8
Arrays and Linear Algebra

Should array indices start at 0 or 1?


My compromise of 0.5 was rejected
without, I thought, proper consideration.
—Stan Kelly-Bootle

Abstract Arrays are a multi-dimensional data structure and hence encompass


the mathematical structures of vectors, matrices, and tensors. An important ap-
plication of arrays is the numerical implementation of linear algebra. In certain
applications, the majority of the entries of vectors or matrices are zero; in these
situations, the sparse versions should be used instead of the dense ones. Opera-
tions on dense and sparse arrays are discussed in detail in this chapter, including
those from linear algebra such as solving systems of linear equations, calculat-
ing the eigenvalues and eigenvectors of matrices, singular-value decomposition,
and matrix factorizations.

8.1 Dense Arrays

8.1.1 Introduction

An array is a collection of objects called elements that can be referenced accord-


ing to a rectilinear coordinate system. All elements have the same type, but the
type of the elements may be arbitrary; in the most general case, the elements are
of type ɪ˩. Usually, the type of the elements is more specific and better suited
for efficient numerical calculations. When performing calculations known from
linear algebra, the elements are usually floating-point numbers.
When arrays are passed as function arguments (see Sect. 2.2), they are passed
by reference. This is due to performance reasons, as passing by reference is much

© Springer Nature Switzerland AG 2022 153


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_8
154 8 Arrays and Linear Algebra

more efficient than passing by value, since the arrays passed as arguments do
not have to be copied. The disadvantage is that any modifications made by the
function called persist and are then seen by the caller.
It is important when designing programs that one keeps these advantages
and disadvantages in mind. Destructively modifying arrays often results in large
performance gains at the expensive of programs whose control and data flow is
harder to understand. The convention that the names of functions that destruc-
tively modify any of their arguments end in an exclamation mark Р is particularly
useful in this context. On the other hand, functions that never modify their (ar-
ray) arguments make it easier to reason about the data flow, but are generally
less efficient when large data structures must be copied.

8.1.2 Construction, Initialization, and Concatenation

The basic syntax to construct arrays are square brackets. The type of the array
elements may be specified before the opening square bracket; if it is not, it is
inferred from the elements given.
ɔʼɜɃǤљ ЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
ɔʼɜɃǤљ bɪʲНЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲНЩђ
Ж
ɔʼɜɃǤљ ЦЖѐЕЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЕ
ɔʼɜɃǤљ ЦЖя ЗѐЕЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЕ
ЗѐЕ
ɔʼɜɃǤљ OɜɴǤʲИЗЦЖЧ
Жвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲИЗЩђ
ЖѐЕ

Furthermore, within the square brackets, the elements may be separated by


spaces, commas, or semicolons. These separators have different meanings and
result in different types of arrays being constructed.
In mathematics, there is a difference between vectors (elements of ℝ𝑛 ), col-
umn vectors (elements of ℝ𝑛×1 ), and row vectors (elements of ℝ1×𝑛 ). Mathemat-
ical vectors are implemented as ùȕȆʲɴʝs, i.e., one-dimensional ʝʝǤ˩s. Mathe-
matical column and row vectors are two-dimensional ʝʝǤ˩s, whose second or
first dimension has length one.
8.1 Dense Arrays 155

The following examples illustrate the cases that may occur. One-dimensional
ʝʝǤ˩s, i.e., ùȕȆʲɴʝs, are constructed if the elements are separated by commas
only or by semicolons only.
ɔʼɜɃǤљ ЦЖя ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
З
ɔʼɜɃǤљ ɃʧǤФЦЖя ЗЧя ùȕȆʲɴʝХ
ʲʝʼȕ
ɔʼɜɃǤљ ЦЖѓ ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Ж
З
ɔʼɜɃǤљ ɃʧǤФЦЖѓ ЗЧя ùȕȆʲɴʝХ
ʲʝʼȕ

If the elements are separated by spaces only, a row vector is constructed.


ɔʼɜɃǤљ ЦЖ ЗЧ
ЖѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж З

If the elements are separated by spaces and semicolons, then a two-dimens-


ional array is constructed, whereby the semicolons separate the rows.
ɔʼɜɃǤљ ЦЖ Зѓ И ЙЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж З
И Й

ùȕȆʲɴʝs, i.e., one-dimensional ʝʝǤ˩s, are treated as column vectors when-


ever an implicit conversion is expedient. The first example is taking the conju-
gate transpose (adjoint) of a vector, which yields a row vector.
ɔʼɜɃǤљ ǤȍɔɴɃɪʲФЦЖя ЗЧХ
ЖѠЗ ǤȍɔɴɃɪʲФђђùȕȆʲɴʝШbɪʲЛЙЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
Ж З

The conjugate transpose can be written more conveniently using the postfix op-
erator щ.
ɔʼɜɃǤљ ЦЖɃɦя ЗɃɦЧщ
ЖѠЗ ǤȍɔɴɃɪʲФђђùȕȆʲɴʝШ&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
ЕвЖɃɦ ЕвЗɃɦ
ɔʼɜɃǤљ ЦЖɃɦя ЗɃɦЧщ Ѯ ЦЖɃɦя ЗɃɦЧ
К ў ЕɃɦ

The conjugate transpose of the 1 × 2 array ЦЖɃɦ ЗɃɦЧ, i.e., a row vector, is a 2 × 1
array, i.e., a column vector, as expected.
156 8 Arrays and Linear Algebra

ɔʼɜɃǤљ ЦЖɃɦ ЗɃɦЧщ


ЗѠЖ ǤȍɔɴɃɪʲФђђ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
Е в ЖɃɦ
Е в ЗɃɦ

The usefulness of treating one-dimensional ʝʝǤ˩s, i.e., ùȕȆʲɴʝs as column


vectors is also seen in the second example, matrix-vector multiplication. Recall
that ЦЖя ЗЧ and ЦЖѓ ЗЧ are ùȕȆʲɴʝs.
ɔʼɜɃǤљ ЦЖ Зѓ И ЙЧ Ѯ ЦЖя ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
К
ЖЖ
ɔʼɜɃǤљ ЦЖ Зѓ И ЙЧ Ѯ ЦЖѓ ЗЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
К
ЖЖ

On the other hand, trying to multiply the matrix by the 1 × 2 array ЦЖ ЗЧ yields
an error as expected.
In summary, the vector and matrix operations in Julia follow the conven-
tions of linear algebra, and one-dimensional arrays of length 𝑛 are interpreted
as column vectors of size 𝑛 × 1 for convenience.
Basic operations to query multi-dimensional arrays about their properties are
summarized in Table 8.1.

Table 8.1 Basic operations on arrays.


Function Description
ȕɜʲ˩ʙȕФAХ return the type of the elements of A
ɜȕɪȱʲȹФAХ return the number of elements in A
ɪȍɃɦʧФAХ return the number of dimensions of A
ʧɃ˴ȕФAХ return a tuple with all the dimensions of A
ʧɃ˴ȕФAя nХ return the size of A along dimension n
ȕǤȆȹɃɪȍȕ˦ФAХ return an iterator for iterating over each element of A
ʧʲʝɃȍȕФAя nХ return the number of elements (in memory)
between adjacent elements along dimension n
ʧʲʝɃȍȕʧФAХ return a tuple with the strides along all dimensions of A

In addition to the syntax using square brackets to construct vectors and arrays,
there is a number of functions to construct, initialize, and fill multi-dimensional
arrays. Table 8.2 provides an overview of the functions to construct and initialize
arrays.
The function ʝȕʧȹǤʙȕ is useful to interpret the elements of a given array as
an array of a different shape, i.e., with different dimensions or with a different
number of dimensions. The elements of the underlying array are not changed by
8.1 Dense Arrays 157

Table 8.2 Operations for constructing and initializing dense arrays.


Function Description
ʝʝǤ˩Ш typeЩФʼɪȍȕȯя dimsѐѐѐХ construct an uninitialized n-dimensional array
containing elements of type
ǤʲʝɃ˦Фbя nя mХ construct a matrix with ones in the main diagonal
˴ȕʝɴʧФ[typeя] dimsѐѐѐХ construct an array of zeros of type (default OɜɴǤʲЛЙ)
ɴɪȕʧФtypeя dimsѐѐѐХ construct an array of ones of type (default OɜɴǤʲЛЙ)
ȯǤɜʧȕʧФAХ construct a "ɃʲʝʝǤ˩ containing only ȯǤɜʧȕ
ʲʝʼȕʧФAХ construct a "ɃʲʝʝǤ˩ containing only ʲʝʼȕ
Ȇɴʙ˩ФAХ create a shallow copy of A
ȍȕȕʙȆɴʙ˩ФAХ create a deep copy of A
ʧɃɦɃɜǤʝФA[я type][я dims]Х construct an uninitialized array similar to A
with the specified element type and dimensions dims
ʝȕʧȹǤʙȕФAя dimsХ and return the elements of A reshaped
ʝȕʧȹǤʙȕФAя dimsѐѐѐХ into the dimensions dims
ʝȕɃɪʲȕʝʙʝȕʲФtypeя AХ return an array with the same binary data as A,
but with element type type
ȯɃɜɜФaя dimsХ and construct an array with dimensions dims
ȯɃɜɜФaя dimsѐѐѐХ filled with the values a
ȯɃɜɜРФAя aХ destructively fill the array A with the value a
ʝǤɪȍФ[typeя] dimsХ and return an array with i.i.d., uniformly (in [0, 1))
ʝǤɪȍФ[typeя] dimsѐѐѐХ distributed random values of type (default OɜɴǤʲЛЙ)
ʝǤɪȍɪФ[typeя] dimsХ and return an array with i.i.d., standard normally
ʝǤɪȍɪФ[typeя] dimsѐѐѐХ distributed random values of type (default OɜɴǤʲЛЙ)

ʝȕʧȹǤʙȕ itself; however, the two arrays share the same elements, so that chang-
ing the elements of one array also affects the elements of the other. The new di-
mensions are specified as a tuple, whereby one dimension may be specified as ђ
indicating that this dimension should be calculated to match the total number
of the elements. In this example, we define a magic square.
ɔʼɜɃǤљ - ќ ЦЖЛя Кя Оя Йя Ия ЖЕя Ля ЖКя Зя ЖЖя Мя ЖЙя ЖИя Ня ЖЗя ЖЧѓ
ɔʼɜɃǤљ ќ ʝȕʧȹǤʙȕФ-я ФЙя ђХХ
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ И З ЖИ
К ЖЕ ЖЖ Н
О Л М ЖЗ
Й ЖК ЖЙ Ж
ɔʼɜɃǤљ -ЦЖЧ ќ вЖЛѓ ʧʼɦФ ХЭЙ
ЗЛѐЕ
ɔʼɜɃǤљ -ЦЖЧ ќ ЖЛѓ ʧʼɦФ ХЭЙ
ИЙѐЕ

The functions and the syntax for the concatenation of arrays are summarized
in Table 8.3. The syntactic expressions in the second column are just a conve-
nient way to call the functions in the first column.
158 8 Arrays and Linear Algebra

Table 8.3 Operations for concatenating arrays.


Function Syntax Description
ȆǤʲФAѐѐѐѓ ȍɃɦʧќdimsХ concatenate the arrays
along the dimensions dims
˛ȆǤʲФ AѐѐѐХ ЦǤѓ Ȃѓ Ȇѓ ѐѐѐЧ vertical concatenation
(along the first dimension)
ȹȆǤʲФAѐѐѐХ ЦǤ Ȃ Ȇ ѐѐѐЧ horizontal concatenation
(along the second dimension)
ȹ˛ȆǤʲФrowsя elemsѐѐѐХ ЦǤ Ȃѓ Ȇ ȍѓ ѐѐѐЧ horizontal and vertical concatenation

8.1.3 Comprehensions and Generator Expressions

Another way to construct arrays are comprehensions (cf. Sect. 3.5), whose syntax
typeЦexpr ȯɴʝ var1 Ƀɪ iterable1я var2 Ƀɪ iterable2я . . . Ч
is similar to the array syntax and to mathematical set notation. The dots indicate
an arbitrary number of iteration variables. The values of the iteration variables
may be given by any iterable collection (see Sect. 4.5.2) such as ranges or vectors.
The resulting array has the dimensions given by the dimensions of the collec-
tions specifying the iteration variables in order and the elements are found by
evaluating the expression expr, which often depends on the iteration variables.
Specifying the element type type of the resulting array by prepending it to the
array comprehension is optional. If the element type is not specified, it is deter-
mined automatically.
ɔʼɜɃǤљ ʧʼɦФЦ ЦɃя ɃЧ ȯɴʝ Ƀ Ƀɪ ЖђЙЧХ
ИЙ
ɔʼɜɃǤљ ʧʼɦФЦ ЦКвɃя ɃЧ ȯɴʝ Ƀ Ƀɪ ЖђЙЧХ
ИЙ
ɔʼɜɃǤљ ʧʼɦФЦ ЦɃя ɔЧ ȯɴʝ Ƀ Ƀɪ ЖђЗя ɔ Ƀɪ ЖђЗЧХ
ИЙ

(This particular magic square has even more remarkable properties.)


Comprehensions are a convenient way to construct arrays with elements com-
puted in a non-trivial manner. However, in situations when not the constructed
array itself, but further computations on the array are of interest, generator ex-
pressions are advantageous, since they do not require the allocation of the in-
termediate array. Generator expressions are syntactically just comprehensions
without the square brackets, although it is sometimes necessary to enclose the
generator expression in parentheses in order to avoid syntactic ambiguity. Since
all comma separated expressions after the ȯɴʝ keyword are interpreted as itera-
tion variables, parentheses are sometimes required to distinguish the iteration
variables from other arguments.
ɔʼɜɃǤљ ФɃ ȯɴʝ Ƀ Ƀɪ ЖђЖЕХ
"ǤʧȕѐQȕɪȕʝǤʲɴʝШÚɪɃʲ¼ǤɪȱȕШbɪʲЛЙЩя ʲ˩ʙȕɴȯФɃȍȕɪʲɃʲ˩ХЩФɃȍȕɪʲɃʲ˩я ЖђЖЕХ
8.1 Dense Arrays 159

The following example shows how a generator expression is used inside ʧʼɦ.
Changing the number of terms in the sum does not change the amount of mem-
ory allocated when using a generator.
ɔʼɜɃǤљ ЪʲɃɦȕ ʧʜʝʲФЛѮʧʼɦФЖЭɃѭЗ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕѪЕЕЕѪЕЕЕХХ в ʙɃ
ЕѐЖИОМКМ ʧȕȆɴɪȍʧ ФОМѐОЖ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ КѐЕИН Ƀ"Х
вНѐЛЕМЙЕИЗИЙЛЕЛЖНЖȕвО

On the other hand, the amount of allocated memory grows linearly with the
number of iterations when using a comprehension.
ɔʼɜɃǤљ ЪʲɃɦȕ ʧʜʝʲФЛѮʧʼɦФЦЖЭɃѭЗ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕѪЕЕЕѪЕЕЕЧХХ в ʙɃ
ЕѐИЕЕООЖ ʧȕȆɴɪȍʧ ФЖЗЙѐИМ ɖ ǤɜɜɴȆǤʲɃɴɪʧђ МЛОѐЖЙЗ Ƀ"я ИѐЕЙ҄ ȱȆ ʲɃɦȕХ
вОѐКЙОЗОЙЛННИЗЛННЖȕвО

The iterations both in comprehensions and in generator expressions can be


nested, i.e., the collections to be iterated over can depend on previous ones. If
this is the case, the result is always a one-dimensional ʝʝǤ˩.
ɔʼɜɃǤљ ЦЖЕɃ ў ɔ ȯɴʝ Ƀ Ƀɪ ЖђИ ȯɴʝ ɔ Ƀɪ ЖђɃЧ
Лвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
ЖЖ
ЗЖ
ЗЗ
ИЖ
ИЗ
ИИ

Another extension that is sometimes useful is filtering by using the Ƀȯ key-


word. The expression is collected into the one-dimensional output array only if
the condition after the Ƀȯ keyword is true.
ɔʼɜɃǤљ ЦЖЕɃ ў ɔ ȯɴʝ Ƀ Ƀɪ ЖђИ ȯɴʝ ɔ Ƀɪ ЖђɃ Ƀȯ ɦɴȍФɃ ў ɔя ЗХ ќќ ЕЧ
Йвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
ЖЖ
ЗЗ
ИЖ
ИИ

8.1.4 Indexing and Assignment

There are various ways to retrieve a certain element or certain elements from an
array by indexing or to assign elements of an array by indexing. One indexing
syntax is to supply 𝑛 indices in square brackets after an 𝑛-dimensional array.
Another option is to index a (multi-dimensional) array by a single index, which
is then interpreted as a linear index. Each index may be
160 8 Arrays and Linear Algebra

• a positive integer,
• a range of the form fromђto or fromђstepђto,
• a colon ђ, which is the same as &ɴɜɴɪФХ, to select the whole dimension,
• an array of positive integers including the empty array ЦЧ, or
• an array of "ɴɴɜs.
The indexing syntax denotes an array or a single element of an array if all the
indices are scalar integers. As part of the indexing syntax, the last valid index of
each dimension can be specified by the keyword ȕɪȍ. Hence, a colon is equiva-
lent to Жђȕɪȍ, as the first index is always Ж.
If any of the indices bj, 𝑗 ∈ {1, … , 𝑛}, is not a scalar integer, but an array, then
" ќ ЦbЖя bЗя . . . Ч becomes an array. The dimensions of the resulting array "
are given by the dimensions of the indices bj. Suppose that the index bj is a 𝑑𝑗 -
dimensional array and drop the empty, 0-dimensional arrays that correspond
to the scalar indices. Then the dimensions of " are ʧɃ˴ȕФbЖя ЖХ, . . . , ʧɃ˴ȕФbЖя
𝑑1 Х, ʧɃ˴ȕФbЗя ЖХ, . . . , ʧɃ˴ȕФbЗя 𝑑2 Х, . . . , ʧɃ˴ȕФbnя ЖХ, . . . , ʧɃ˴ȕФbnя 𝑑𝑛 Х. The
resulting element
"ЦɃЖ1я . . . я ɃЖ𝑑1 я ɃЗ1я . . . я ɃЗ𝑑2 я . . . я Ƀn1я . . . я Ƀn𝑑𝑛 Ч

is the element
ЦbЖЦɃЖ1я . . . я ɃЖ𝑑1 Чя bЗЦɃЗ1я . . . я ɃЗ𝑑2 Чя . . . я bnЦɃn1я . . . я Ƀn𝑑𝑛 ЧЧ

of the original array .


In this example, we extract the lower left subsquare from the magic square
above.
ɔʼɜɃǤљ ЦЦИя ЙЧя ЦЖя ЗЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
О Л
Й ЖК
ɔʼɜɃǤљ ʧʼɦФǤɪʧХ
ИЙ

We can extract the two innermost elements from the first and fourth row in this
manner.
ɔʼɜɃǤљ ЦЦЖя ЙЧя ЦЗя ИЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
И З
ЖК ЖЙ
ɔʼɜɃǤљ ʧʼɦФǤɪʧХ
ИЙ

The shape of the resulting array is determined by the shape of the indices.
ɔʼɜɃǤљ ЦИя ЦЗ Иѓ Ж ЙЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Л М
О ЖЗ
8.1 Dense Arrays 161

Indexing using a Boolean array " is also called logical indexing. If it is used,
each Boolean array used as an index must have the same length as the dimension
of  it corresponds to or it must be the only index provided and have the same
shape as . In the second case, a one-dimensional array is returned. A logical
index " acts as a mask and chooses the elements of  that correspond to ʲʝʼȕ
values in the index ".
This example uses two logical indices to select the four corner elements of the
magic square.
ɔʼɜɃǤљ ЦЦʲʝʼȕя ȯǤɜʧȕя ȯǤɜʧȕя ʲʝʼȕЧя Цʲʝʼȕя ȯǤɜʧȕя ȯǤɜʧȕя ʲʝʼȕЧЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ ЖИ
Й Ж
ɔʼɜɃǤљ ʧʼɦФǤɪʧХ
ИЙ

The next example illustrates using a single logical index that has the same shape
as the array.
ɔʼɜɃǤљ " ќ ЦȯǤɜʧȕ ʲʝʼȕ ʲʝʼȕ ȯǤɜʧȕѓ ʲʝʼȕ ȯǤɜʧȕ ȯǤɜʧȕ ʲʝʼȕѓ
ʲʝʼȕ ȯǤɜʧȕ ȯǤɜʧȕ ʲʝʼȕѓ ȯǤɜʧȕ ʲʝʼȕ ʲʝʼȕ ȯǤɜʧȕЧ
ЙѠЙ ǤʲʝɃ˦Ш"ɴɴɜЩђ
Е Ж Ж Е
Ж Е Е Ж
Ж Е Е Ж
Е Ж Ж Е
ɔʼɜɃǤљ Ц"Ч
Нвȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
К
О
И
ЖК
З
ЖЙ
Н
ЖЗ
ɔʼɜɃǤљ ʧʼɦФ Ц"ЧХЭЗ
ИЙѐЕ

Logical indexing is also useful in expressions such as -Ц- ѐјќ НЧ. First,
the index - ѐјќ Н is a "ɃʲʝʝǤ˩, which is then used to extract the subset of the
one-dimensional array - for which the condition holds.
The indexing syntax using square brackets is nearly equivalent to calling the
function ȱȕʲɃɪȍȕ˦. The only difference is that the ȕɪȍ keyword, representing the
last index in each dimension, can only be used inside square brackets. The ȕɪȍ
keyword can also be part of an expression as in this example.
162 8 Arrays and Linear Algebra

ɔʼɜɃǤљ ЦȕɪȍвЖђȕɪȍя ЖђȍɃ˛Фȕɪȍя ЗХЧ


ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
О Л
Й ЖК

The syntax for assigning values to elements of an 𝑛-dimensional array  is to


use the indexing syntax on the left-hand side of an assignment, i.e.,
ЦbЖя bЗя . . . я bnЧ ќ ".

Again, each index can be one of the five items mentioned at the beginning of this
section.
If the right-hand side " is an array, the number of elements on the left- and
on the right-hand sides must match. Then the element
ЦbЖЦɃЖ1я . . . я ɃЖ𝑑1 Чя bЗЦɃЗ1я . . . я ɃЗ𝑑2 Чя . . . я bnЦɃn1я . . . я Ƀn𝑑𝑛 ЧЧ

on the left-hand side is overwritten with the element


"ЦɃЖ1я . . . я ɃЖ𝑑1 я ɃЗ1я . . . я ɃЗ𝑑2 я . . . я Ƀn1я . . . я Ƀn𝑑𝑛 Ч

on the right-hand side. If the right-hand side " is not an array, then its value is
written to all elements of  referenced on the left-hand side.
Analogously to ȱȕʲɃɪȍȕ˦, the assignment ЦbЖя bЗя . . . я bnЧ ќ " is equiva-
lent to the function call ʧȕʲɃɪȍȕ˦РФя "я bЖя bЗя . . . я bnЧ.

8.1.5 Iteration and Linear Indexing

Linear indexing means that the elements of an array are indexed by a single in-
dex that runs from one to the total number of elements in the array. As a linear
index increases, the first dimension (i.e., the row) changes faster than the sec-
ond dimension, and so forth. Fast linear indexing is generally available if the
elements of an array are contiguous in memory. Linear indexing into an array is
not always available, e.g., if the array is a view into another array.
ɔʼɜɃǤљ ЦЖЧя ЦȕɪȍЧ
ФЖЛя ЖХ

Iterating over all elements of an array  is simple.


ȯɴʝ ɦ Ƀɪ
Ъʧȹɴ˞ ɦ
ȕɪȍ

It is also possible to use an index Ƀ to iterate over all elements of an array .


ȯɴʝ Ƀ Ƀɪ ȕǤȆȹɃɪȍȕ˦Ф Х
Ъʧȹɴ˞ Ƀя ЦɃЧ
ȕɪȍ
8.1 Dense Arrays 163

Here the index Ƀ is an bɪʲ if fast linear indexing is available for the type of . If
linear indexing is not available, the index Ƀ is, e.g., a &ǤʝʲȕʧɃǤɪbɪȍȕ˦ as in this
example, which also shows how to create a view into an array (see Sect. 8.3). The
row index changes faster than the column index.
ɔʼɜɃǤљ  ќ ˛Ƀȕ˞Ф я ИђЙя ЖђЗХ
ЗѠЗ ˛Ƀȕ˞Фђђ ǤʲʝɃ˦ШbɪʲЛЙЩя ИђЙя ЖђЗХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
О Л
Й ЖК
ɔʼɜɃǤљ ʧʼɦФХ
ИЙ
ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ȕǤȆȹɃɪȍȕ˦ФХ Ъʧȹɴ˞ ФɃя ЦɃЧХ ȕɪȍ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЖя ЖХя ОХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЗя ЖХя ЙХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЖя ЗХя ЛХ
ФɃя ЦɃЧХ ќ Ф&ǤʝʲȕʧɃǤɪbɪȍȕ˦ФЗя ЗХя ЖКХ

8.1.6 Operators

Table 8.4 summarizes the most important operations on arrays. Operators with-
out a dot ѐ are operations on (whole) arrays or matrices, while operators with a
dot ѐ always act elementwise. For example, the equality operator ќќ compares
two arrays and returns a single "ɴɴɜ value, while the elementwise equality op-
erator ѐќќ returns an array of the same shape as its arguments that contains the
results of the elementwise comparisons.
In addition to this general rule, multiplication Ѯ acts elementwise when one
argument is a scalar value, and the division operators Э and а act elementwise
when the denominator is a scalar value.
The left-division operator а is popular for solving systems

𝐴𝐱 = 𝐛

of linear equations. Depending on the structure of the first argument, namely


the matrix 𝐴, it chooses a suitable linear solver and returns the solution 𝑥.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФЗя ЗХѓ Ȃ ќ ʝǤɪȍɪФЗХѓ
ɔʼɜɃǤљ аȂ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ИѐЛЗЖОЛЕЛОКМЕЕЙЕЛЗ
ЗѐЗИЛКОЛЗЕЛОЕМЕЛОМ
ɔʼɜɃǤљ  Ѯ ФаȂХ в Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
вЛѐОИННОИОЕИОЕМЗЗНȕвЖМ
ЕѐЕ
164 8 Arrays and Linear Algebra

Table 8.4 Operations on arrays.


Type Operator Description
unary arithmetic ў ,в addition, subtraction
binary arithmetic ў ,в addition, subtraction
binary arithmetic Ѯ matrix multiplication
binary arithmetic ѐѮ elementwise multiplication
binary arithmetic Э right division, 𝐴Э 𝐵 = 𝐴𝐵−1
binary arithmetic а left division, 𝐴а 𝐵 = 𝐴−1 𝐵
binary arithmetic ѐЭ elementwise right division, (𝐴ѐЭ𝐵)𝑖𝑗 = 𝑎𝑖𝑗 𝑏𝑖𝑗−1
binary arithmetic ѐа elementwise left division, (𝐴ѐа𝐵)𝑖𝑗 = 𝑎𝑖𝑗−1 𝑏𝑖𝑗
binary arithmetic ѭ matrix exponentation
binary arithmetic ѐѭ elementwise exponentation
assignment ќ assignment
assignment ѐќ elementwise assignment
comparison ќќ equality
comparison ѐќќ elementwise equality
comparison Рќ inequality
comparison ѐРќ elementwise inequality
comparison ѐј, ѐљ elementwise <, >
comparison ѐјќ, ѐљќ elementwise ≤, ≥
unary Boolean ѐР elementwise Boolean not
binary Boolean ѐП elementwise Boolean and
binary Boolean ѐЮ elementwise Boolean or
unary bitwise ѐѤ elementwise bitwise not
binary bitwise ѐП elementwise bitwise and
binary bitwise ѐЮ elementwise bitwise or

8.1.7 Broadcasting and Vectorizing Functions

Vectorizing a function means to apply it to each element of an array to yield a


new array. Hence vectorizing is thus just a synonym for applying the function el-
ementwise or for mapping the function. The syntax for vectorizing any function f
and applying it elementwise to any collection A is just f ѐФAХ.
ɔʼɜɃǤљ ȆɴʧѐФФЕђЙХ Ѯ ʙɃЭЗХ
Квȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЕ
ЛѐЖЗИЗИИООКМИЛМЛЛȕвЖМ
вЖѐЕ
вЖѐНИЛОМЕЖОНМЗЖЕЗОМȕвЖЛ
ЖѐЕ

The effect of the syntax f ѐФAХ for vectorizing can also achieved by defining a
method such as
ȯФђђȂʧʲʝǤȆʲʝʝǤ˩Х ќ ɦǤʙФȯя Х

but it is more convenient to use the built-in syntax for vectorizing than to define
methods for each generic function to be vectorized.
8.1 Dense Arrays 165

Broadcasting is a generalization of vectorization and is supported by the syn-


tax f ѐФargsѐѐѐХ, which is equivalent to ȂʝɴǤȍȆǤʧʲФf я argsѐѐѐХ. Broadcasting
means that singleton dimensions in array arguments are expanded in order to
match the corresponding dimensions in the other, larger array and that the func-
tion is then applied elementwise. No additional memory is required; eliminating
the allocation of intermediary arrays is important for performance.
A leading example is adding a vector to the columns of a matrix, e.g., adding
(17, 17, 17, 17)⊤ to minus the magic square . Just using the substraction в re-
sults in a dimension-mismatch error. Using the function ȂʝɴǤȍȆǤʧʲ is more con-
venient and efficient.
ɔʼɜɃǤљ ȂʝɴǤȍȆǤʧʲФвя ЦЖМя ЖМя ЖМя ЖМЧя Х
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж ЖЙ ЖК Й
ЖЗ М Л О
Н ЖЖ ЖЕ К
ЖИ З И ЖЛ

Broadcasting is performed by elementwise operations such as ѐў, ѐв, and ѐѮ au-


tomatically if necessary, so that we can also write ЦЖМя ЖМя ЖМя ЖМЧ ѐв .
The ȂʝɴǤȍȆǤʧʲ function is even more general and also works on tuples. Any
argument that is not an ʝʝǤ˩ or a Ñʼʙɜȕ is treated as a scalar and broadcast.
ɔʼɜɃǤљ Ȇɴɪ˛ȕʝʲѐФbɪʲИЗя ФЕ˦ȯя Е˦ȯȯя Е˦ȯȯȯя Е˦ȯȯȯȯя Е˦ȯȯѪȯȯȯХХ
ФЖКя ЗККя ЙЕОКя ЛККИКя ЖЕЙНКМКХ

Furthermore, the compiler guarantees to fuse nested vectorized function calls


into a single ȂʝɴǤȍȆǤʧʲ loop, i.e., f ѐФgѐФAХХ is equivalent to
ȂʝɴǤȍȆǤʧʲФǤ вљ f ФgФǤХХя AХ.
This implies that there is only a single loop iterating over the collection A and
that only a single array is allocated for the result. The significance of this guar-
antee is that allocations of temporary arrays for intermediate results are avoided.
Fusion of vectorized calls is not possible if non-vectorized function calls happen
in between.
The destructive version of ȂʝɴǤȍȆǤʧʲ is called ȂʝɴǤȍȆǤʧʲР as usual. Avoid-
ing allocations of intermediary results is always important for performance,
since it eliminates time for memory allocation and reduces the work load of the
garbage collector. To avoid allocations, the output of a vectorized operation can
be pre-allocated, which can be achieved by A ѐќ RHS, which is equivalent to
ȂʝɴǤȍȆǤʧʲРФɃȍȕɪʲɃʲ˩я Aя RHSХ. Additionally, the outer call to ȂʝɴǤȍȆǤʧʲР is
fused with any vectorized calls in RHS if possible. The ȂʝɴǤȍȆǤʧʲР function, i.e.,
the destructive version of ȂʝɴǤȍȆǤʧʲ, overwrites the first array with the result in
place and thus eliminates an allocation for the result and possibly any interme-
diate results. The left-hand side of ѐќ may also be an indexing expression; then
ȂʝɴǤȍȆǤʧʲР acts on a view.
166 8 Arrays and Linear Algebra

8.2 Sparse Vectors and Matrices

Sparse vectors and sparse matrices are important types of vectors and matrices.
Their defining characteristic is that sufficiently many elements are zero so that
storing them in a special data structure is advantageous regarding execution time
and memory consumption. Special data structures and algorithms for sparse ma-
trices make calculations possible that could not be performed within reasonable
time or space requirements using dense vectors or matrices. An important ex-
ample is given by discretizations of partial differential equations, especially in
higher spatial dimensions (see Chap. 10).
To use sparse vectors or matrices, the built-in module ÆʙǤʝʧȕʝʝǤ˩ʧ must be
imported or used first.
ɔʼɜɃǤљ ʼʧɃɪȱ ÆʙǤʝʧȕʝʝǤ˩ʧ

The two types ÆʙǤʝʧȕùȕȆʲɴʝ and ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& have two parameters,
namely the type of the (non-zero) elements and the integer type of column and
row indices.
Sparse matrices are stored in the compressed-sparse-column (CSC) format.
This format is especially efficient for calculating matrix-vector products and col-
umn slicing. On the other hand, accessing a sparse matrix stored in this format
by rows is much slower. Furthermore, inserting non-zero values one at a time is
slow, since all elements beyond the insertion point must be moved over.
Many functions pertaining to sparse vectors or matrices start with the prefix
ʧʙ added to the names of the functions dealing with their dense counterparts.
The simplest example is ʧʙ˴ȕʝɴʧ for creating empty sparse vectors and matrices,
where the type of the elements can optionally be supplied.
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФЖЕЕЕХ
ЖЕЕЕвȕɜȕɦȕɪʲ ÆʙǤʝʧȕùȕȆʲɴʝШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФЖЕЕЕя ЖЕЕЕХ
ЖЕЕЕѠЖЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ
ɔʼɜɃǤљ ʧʙ˴ȕʝɴʧФ"Ƀȱbɪʲя ЖЕЕЕя ЖЕЕЕХ
ЖЕЕЕѠЖЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&Ш"Ƀȱbɪʲя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ

As mentioned above, inserting elements into a sparse vector or matrix one


element at a time is slow due to the bookkeeping that must be performed. The
recommended way to create a ÆʙǤʝʧȕùȕȆʲɴʝ or a ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& with a size-
able number of non-zero elements is to use the functions ʧʙǤʝʧȕ˛ȕȆ (to create a
ÆʙǤʝʧȕùȕȆʲɴʝ) or ʧʙǤʝʧȕ (to create a ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&). Calling ʧʙǤʝʧȕ˛ȕȆ as

ʧ ќ ʧʙǤʝʧȕ˛ȕȆФɃя ˛Х

creates a ÆʙǤʝʧȕùȕȆʲɴʝ named ʧ such that


ʧЦɃЦkЧЧ ќ ˛ЦkЧ

for all indices k. Analogously, calling ʧʙǤʝʧȕ as


Æ ќ ʧʙǤʝʧȕФɃя ɔя ˛Х
8.3 Array Types 167

creates a ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& named Æ such that


ÆЦɃЦkЧя ɔЦkЧЧ ќ ˛ЦkЧ

for all indices k. Here the vectors Ƀ and ɔ contain the row and column indices
of the non-zero elements and the vector ˛ contains the non-zero elements them-
selves.
ɔʼɜɃǤљ ʧʙǤʝʧȕ˛ȕȆФЦЖя ЖЕя ЖЕЕЧя ЦЖѐЕя ЗѐЕя ИѐЕЧХ
ЖЕЕвȕɜȕɦȕɪʲ ÆʙǤʝʧȕùȕȆʲɴʝШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ И ʧʲɴʝȕȍ ȕɪʲʝɃȕʧђ
ЦЖ Ч ќ ЖѐЕ
ЦЖЕ Ч ќ ЗѐЕ
ЦЖЕЕЧ ќ ИѐЕ
ɔʼɜɃǤљ Æ ќ ʧʙǤʝʧȕФЦЖя ЖЕя ЖЕЕЧя
ЦЖЕЕЕя ЖЕѪЕЕЕя ЖЕЕѪЕЕЕЧя
ЦЖѐЕя ЗѐЕя ИѐЕЧХ
ЖЕЕѠЖЕЕЕЕЕ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ И ʧʲɴʝȕȍ ȕɪʲʝɃȕʧђ
ЦЖ я ЖЕЕЕЧ ќ ЖѐЕ
ЦЖЕ я ЖЕЕЕЕЧ ќ ЗѐЕ
ЦЖЕЕя ЖЕЕЕЕЕЧ ќ ИѐЕ
ɔʼɜɃǤљ ȯɃɪȍɪ˴ФÆХ
ФЦЖя ЖЕя ЖЕЕЧя ЦЖЕЕЕя ЖЕЕЕЕя ЖЕЕЕЕЕЧя ЦЖѐЕя ЗѐЕя ИѐЕЧХ

As this example shows, the function ȯɃɪȍɪ˴ retrieves the indices and the non-
zero elements of a ÆʙǤʝʧȕùȕȆʲɴʝ or a ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&.
Another use of the function ʧʙǤʝʧȕ is to create the sparse counterpart of a
dense vector or matrix. The function ɃʧʧʙǤʝʧȕ tests whether its argument is
sparse or not.
ɔʼɜɃǤљ ɃʧʧʙǤʝʧȕФʧʙǤʝʧȕФЦЕя Жя ЗЧХХ
ʲʝʼȕ
ɔʼɜɃǤљ ʧʙǤʝʧȕФЦЕя Жя ЗЧХ ќќ ЦЕя Жя ЗЧ
ʲʝʼȕ

The technique of using ʧʙǤʝʧȕ˛ȕȆ or ʧʙǤʝʧȕ and ȯɃɪȍɪ˴ to construct and


decompose sparse vectors or matrices is critical for performance when the sizes
of the vectors or matrices become large. An example is constructing the vectors
and matrices for finite-difference, finite-volume, or finite-element calculations
[1] (see Chap. 10).
Finally, Table 8.5 summarizes the functions related to sparse vectors and ma-
trices.

8.3 Array Types

Since many types of arrays and matrices occur in mathematics and in applica-
tions, the part of the type system that deals with arrays and matrices is quite
168 8 Arrays and Linear Algebra

Table 8.5 Functions that operate on sparse vectors or matrices.


Function Description
ÆʙǤʝʧȕùȕȆʲɴʝ type for sparse vectors
ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& type for sparse matrices
ʝʝǤ˩ create a dense version
ɃʧʧʙǤʝʧȕ check whether a vector or matrix is sparse
ʧʙǤʝʧȕ˛ȕȆ create a sparse vector
ʧʙǤʝʧȕ create a sparse matrix
ʧʙ˴ȕʝɴʧ create an empty sparse vector or sparse matrix
ʧʙȍɃǤȱɦ create a sparse diagonal matrix
ʧʙʝǤɪȍ create a random sparse matrix of given density,𝑎
non-zero elements are sampled from given distribution
ʧʙʝǤɪȍɪ create a random sparse matrix of given density,𝑎
non-zero elements are sampled from given distribution
ɪɪ˴ return the number of stored elements
ɪɴɪ˴ȕʝɴʧ return a vector of the structural non-zero elements
ȯɃɪȍɪ˴ return the indices and values of the non-zero elements
ʝɴ˞˛Ǥɜʧ return a vector with the row indices
ɪ˴ʝǤɪȱȕ return the column indices of non-zero elements,
useful for iterating with a ȯɴʝ loop
ȂɜɴȆɖȍɃǤȱ concatenate matrices block-diagonally
ȍʝɴʙ˴ȕʝɴʧ remove zero elements
ȍʝɴʙ˴ȕʝɴʧР destructive version of ȍʝɴʙ˴ȕʝɴʧ
ȍʝɴʙʲɴɜР drop elements whose absolute value is smaller than a tolerance
ʙȕʝɦʼʲȕ permute rows and columns
ʙȕʝɦʼʲȕР destructive version of ʙȕʝɦʼʲȕ
𝑎
The density of a sparse matrix is the probability that any element is non-zero.

extensive in order to accommodate various types of arrays. The subtree below


ȂʧʲʝǤȆʲʝʝǤ˩ in the tree of types is discussed in this section to elucidate the
relationship between the various subtypes and to indicate how special types of
matrices or arrays can be implemented idiomatically.
The most general array type is the abstract type ȂʧʲʝǤȆʲʝʝǤ˩ШÑя ɪЩ. An
abstract type is a type that cannot be instantiated, i.e., no objects of this type can
be created; abstract types usually have subtypes that can be instantiated. The first
parameter Ñ is the element type and the second parameter is the number ɪ of
dimensions. The subtypes ȂʧʲʝǤȆʲùȕȆʲɴʝ and ȂʧʲʝǤȆʲ ǤʲʝɃ˦ are just aliases
for the one- and two-dimensional cases.
ɔʼɜɃǤљ ȂʧʲʝǤȆʲùȕȆʲɴʝ
ȂʧʲʝǤȆʲùȕȆʲɴʝ ФǤɜɃǤʧ ȯɴʝ ȂʧʲʝǤȆʲʝʝǤ˩ШÑя ЖЩ ˞ȹȕʝȕ ÑХ
ɔʼɜɃǤљ ȂʧʲʝǤȆʲùȕȆʲɴʝ јђ ȂʧʲʝǤȆʲʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ȂʧʲʝǤȆʲ ǤʲʝɃ˦
ȂʧʲʝǤȆʲ ǤʲʝɃ˦ ФǤɜɃǤʧ ȯɴʝ ȂʧʲʝǤȆʲʝʝǤ˩ШÑя ЗЩ ˞ȹȕʝȕ ÑХ
ɔʼɜɃǤљ ȂʧʲʝǤȆʲ ǤʲʝɃ˦ јђ ȂʧʲʝǤȆʲʝʝǤ˩
ʲʝʼȕ
8.3 Array Types 169

The essential properties of a specific array type should be implemented by a


concrete subtype of ȂʧʲʝǤȆʲʝʝǤ˩. A concrete type is a type that can be instan-
tiated. Essential properties are ʧɃ˴ȕ, ȱȕʲɃɪȍȕ˦, and ʧȕʲɃɪȍȕ˦Р (in the case of a
mutable array). These functions should have a computational complexity that
is constant in time, since a defining feature of arrays is that the time to access
or change an element is constant. (On the other hand, accessing and changing
elements of lists, for example, is an operation whose complexity is usually linear
as a function of the number of elements.) Concrete types should also implement
the function ʧɃɦɃɜǤʝ, which is used by Ȇɴʙ˩, for example. Furthermore, an ob-
ject of the element type Ñ must always be returned when indexing the array using
integers (cf. Sect. 8.1.4), and the length of the Ñʼʙɜȕ returned by ʧɃ˴ȕ must be
the number ɪ of dimensions of the array.
The type -ȕɪʧȕʝʝǤ˩ is an abstract subtype of ȂʧʲʝǤȆʲʝʝǤ˩.
ɔʼɜɃǤљ -ȕɪʧȕʝʝǤ˩ јђ ȂʧʲʝǤȆʲʝʝǤ˩
ʲʝʼȕ

The defining characteristic of a -ȕɪʧȕʝʝǤ˩ is that it each element occupies mem-


ory (in contrast to a sparse vector or sparse array, see Sect. 8.2) and that the ele-
ments are laid out in a regular pattern in memory. The memory layout is compat-
ible with C and Fortran so that arrays can easily be passed to external C and
Fortran functions. Concrete subtypes should define a method for the function
ʧʲʝɃȍȕ such that ʧʲʝɃȍȕФя ɖХ returns the distance in the memory layout, i.e.,
the difference in linear indices, between two elements that are adjacent in di-
mension ɖ.
This is illustrated by the following example. We can use a linear index to iter-
ate over all elements of the square matrix in the order in which they are laid
out in memory. (The recommended and simpler way to iterate over all elements
of an array is ȯɴʝ Ǥ Ƀɪ , see Sect. 8.1.5.)
ȯɴʝ Ƀ Ƀɪ ЖђʙʝɴȍФʧɃ˴ȕФ ХХ
ʙʝɃɪʲФ ЦɃЧя ъ ъХ
ȕɪȍ
Ы ЖЛ К О Й И ЖЕ Л ЖК З ЖЖ М ЖЙ ЖИ Н ЖЗ Ж

This prints the elements in the same order as ȯɴʝ ɦ Ƀɪ .


Another way to iterate over all elements is to use strides. We can calculate a
linear index from the indices over each dimension. In ȯɴʝ loops, the iteration
variable given last changes fastest, and we want the iteration variable over the
first dimension to change fastest. Therefore the dimensions are listed in descend-
ing order when specifying the loop variables in this example.
ȯɴʝ ɔ Ƀɪ ЖђʧɃ˴ȕФ я ЗХя Ƀ Ƀɪ ЖђʧɃ˴ȕФ я ЖХ
ʙʝɃɪʲɜɪФФɃя ɔя ЦЖ ў ФɃвЖХѮʧʲʝɃȍȕФ я ЖХ ў ФɔвЖХѮʧʲʝɃȍȕФ я ЗХЧХХ
ȕɪȍ

This prints the elements (and their indices) in the same linear order as above.
The use strides can be illustrated by iterating over a three-dimensional array
in two different ways as well.
170 8 Arrays and Linear Algebra

ȱɜɴȂǤɜ  ќ ʝǤɪȍФЗя Зя ЗХ

ȯɴʝ Ƀ Ƀɪ ЖђʙʝɴȍФʧɃ˴ȕФХХ
ʙʝɃɪʲɜɪФЦɃЧХ
ȕɪȍ

ȯɴʝ ɖ Ƀɪ ЖђʧɃ˴ȕФя ИХя ɔ Ƀɪ ЖђʧɃ˴ȕФя ЗХя Ƀ Ƀɪ ЖђʧɃ˴ȕФя ЖХ


ʙʝɃɪʲɜɪФФɃя ɔя ɖя
ЦЖ ў ФɃвЖХ Ѯ ʧʲʝɃȍȕФя ЖХ ў
ФɔвЖХ Ѯ ʧʲʝɃȍȕФя ЗХ ў
ФɖвЖХ Ѯ ʧʲʝɃȍȕФя ИХЧХХ
ȕɪȍ

The type ʝʝǤ˩ is a subtype of -ȕɪʧȕʝʝǤ˩ and ensures that elements are
stored in column-major order.
ɔʼɜɃǤљ ʝʝǤ˩ јђ -ȕɪʧȕʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ɃʧǤФ я ʝʝǤ˩Х
ʲʝʼȕ

Vectors and matrices as we know them in mathematics are subtypes of the ʝʝǤ˩
type: ùȕȆʲɴʝ is an alias for a one-dimensional ʝʝǤ˩ and ǤʲʝɃ˦ is an alias for a
two-dimensional ʝʝǤ˩.
ɔʼɜɃǤљ ùȕȆʲɴʝ јђ ʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ǤʲʝɃ˦ јђ ʝʝǤ˩
ʲʝʼȕ
ɔʼɜɃǤљ ùȕȆʲɴʝ
ùȕȆʲɴʝ ФǤɜɃǤʧ ȯɴʝ ʝʝǤ˩ШÑя ЖЩ ˞ȹȕʝȕ ÑХ
ɔʼɜɃǤљ ǤʲʝɃ˦
ǤʲʝɃ˦ ФǤɜɃǤʧ ȯɴʝ ʝʝǤ˩ШÑя ЗЩ ˞ȹȕʝȕ ÑХ

A ÆʼȂʝʝǤ˩ is a subtype of ȂʧʲʝǤȆʲʝʝǤ˩ that performs indexing by ref-


erence, and not by copying, which can be useful for performance reasons. A
ÆʼȂʝʝǤ˩ is created by a call to the ˛Ƀȕ˞ function, which is similar to ȱȕʲɃɪȍȕ˦,
but returns a view into the parent array instead of copying the elements.
ÆʼȂʝʝǤ˩s are a convenient way to reference a part of an array, which means that
modifying the elements of a ÆʼȂʝʝǤ˩ also modifies the elements of the original
array as illustrated in this example.
ɔʼɜɃǤљ ‰ ќ Ȇɴʙ˩Ф Хѓ ȆɴɜʼɦɪЖ ќ ˛Ƀȕ˞Ф‰я ђя ЖХ
Йвȕɜȕɦȕɪʲ ˛Ƀȕ˞Фђђ ǤʲʝɃ˦ШbɪʲЛЙЩя ђя ЖХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ bɪʲЛЙђ
ЖЛ
К
О
Й
8.4 Linear Algebra 171

ɔʼɜɃǤљ ɃʧǤФȆɴɜʼɦɪЖя ÆʼȂʝʝǤ˩Х


ʲʝʼȕ
ɔʼɜɃǤљ ȆɴɜʼɦɪЖ ѐќ Еѓ ‰
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Е И З ЖИ
Е ЖЕ ЖЖ Н
Е Л М ЖЗ
Е ЖК ЖЙ Ж

Two more subtypes of ȂʧʲʝǤȆʲʝʝǤ˩ are the types ÆʲʝɃȍȕȍùȕȆʲɴʝ and


ÆʲʝɃȍȕȍ ǤʲʝɃ˦, whose purpose is to interface to blas and lapack functions
efficiently regarding memory allocation and copying.
Finally, sparse vectors and sparse matrices are subtypes of their abstract super-
types ȂʧʲʝǤȆʲÆʙǤʝʧȕùȕȆʲɴʝ and ȂʧʲʝǤȆʲÆʙǤʝʧȕ ǤʲʝɃ˦. They are (of course)
not subtypes of -ȕɪʧȕʝʝǤ˩, and hence also not of ʝʝǤ˩, ùȕȆʲɴʝ, ǤʲʝɃ˦, and
ÆʼȂʝʝǤ˩.

ɔʼɜɃǤљ ʼʧɃɪȱ ÆʙǤʝʧȕʝʝǤ˩ʧ


ɔʼɜɃǤљ ÆʙǤʝʧȕùȕȆʲɴʝ јђ ȂʧʲʝǤȆʲÆʙǤʝʧȕùȕȆʲɴʝ јђ ȂʧʲʝǤȆʲùȕȆʲɴʝ
ʲʝʼȕ
ɔʼɜɃǤљ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ& јђ ȂʧʲʝǤȆʲÆʙǤʝʧȕ ǤʲʝɃ˦ јђ ȂʧʲʝǤȆʲ ǤʲʝɃ˦
ʲʝʼȕ

8.4 Linear Algebra

In this section, major concepts from linear algebra are summarized and their
implementation in Julia is discussed.

8.4.1 Vector Spaces and Linear Functions

Linear algebra is the branch of mathematics concerning linear functions and


their properties as functions between vector spaces. Before we discuss functions
on vector spaces, it is useful to briefly recall the definition of a vector space. A
vector space 𝑉 over a field 𝐹 is a set equipped with two binary operations, namely
vector addition and scalar multiplication, satisfying the following axioms. The
elements of 𝑉 are called vectors and the elements of 𝐹 are called scalars. Vector
addition is the function + ∶ 𝑉 × 𝑉 → 𝑉, (𝐮, 𝐯) ↦ 𝐮 + 𝐯. Scalar multiplication is
the function ⋅ ∶ 𝐹 × 𝑉 → 𝑉, (𝑎, 𝐮) ↦ 𝑎𝐮.
The eight axioms defining a vector space are
1. the associativity 𝐮 + (𝐯 + 𝐰) = (𝐮 + 𝐯) + 𝐰 of addition,
2. the commutativity 𝐮 + 𝐯 = 𝐯 + 𝐮 of addition,
172 8 Arrays and Linear Algebra

3. the existence of an identity element 𝟎 ∈ 𝑉 (the zero vector) of addition, i.e.,


𝐯 + 0 = 𝐯,
4. the existence of inverse elements −𝐯 ∈ 𝑉 of addition, i.e., 𝐯 + (−𝐯) = 0,
5. the distributivity 𝑎(𝐮 + 𝐯) = 𝑎𝐮 + 𝑎𝐯 of scalar multiplication with respect
to vector addition,
6. the distributivity (𝑎 + 𝑏)𝐯 = 𝑎𝐯 + 𝑏𝐯 of scalar multiplication with respect
to field addition,
7. the compatibility 𝑎(𝑏𝐯) = (𝑎𝑏)𝐯 of scalar multiplication with field multipli-
cation, and
8. the multiplicative identity 1 of the field 𝐹 is the identity element of scalar
multiplication, i.e., 1𝐯 = 𝐯.
Here all equations must hold for all 𝐮 ∈ 𝑉, for all 𝐯 ∈ 𝑉, for all 𝐰 ∈ 𝑉, for all
𝑎 ∈ 𝐹, and for all 𝑏 ∈ 𝐹.
The first four axioms are equivalent to 𝑉 being an Abelian group under vector
addition. The last four axioms concern the interaction of vector addition and
scalar multiplication with the underlying field 𝐹.
The historically leading example of a vector space is of course the set of points
in a Euclidean space equipped with the well-known geometric operations. The
underlying field of a vector space is often the real numbers ℝ or the complex
numbers ℂ. (The reader may want to recall the definition of a field.) Further
examples of vector spaces are sequences, polynomials, functions, and matrices,
all equipped with suitable operations.
The significance of the vector data structure is that it provides a short and
convenient representation of the elements of a vector space. Suppose that 𝐁 ∶=
{𝐛1 , … , 𝐛𝑙 } ⊂ 𝑈 is a finite subset of the vector space 𝑈 over the field 𝐹. The set
𝐁 is called a basis of 𝑈 if the elements of 𝐁 are linearly independent and span
the whole vector space. The vectors in 𝐁 are defined to be linearly independent
if the condition
𝑙

∀𝑎1 , … , 𝑎𝑙 ∈ 𝐹 ∶ 𝑎𝑖 𝐛𝑖 = 0 ⟹ 𝑎1 = ⋯ = 𝑎𝑙 = 0
𝑖=1

holds. They span the whole vector space if every element 𝐮 of 𝑈 can be written
as a linear combination of the basis vectors 𝐁, i.e.,
𝑙

∀𝐮 ∈ 𝑈 ∶ ∃𝑢1 , … , 𝑢𝑙 ∈ 𝐹 ∶ 𝐮= 𝑢𝑖 𝐛𝑖 .
𝑖=1

The coefficients 𝑢𝑖 ∈ 𝐹 are the coordinates of the vector 𝐮 with respect to the
basis 𝐁 and they are uniquely determined because of the linear independence of
the basis vectors. Furthermore, the dimension dim 𝑈 of 𝑈 is 𝑙.
Therefore every vector 𝐮 ∈ 𝑈 can be represented by its coordinates 𝑢𝑖 written
in the form
8.4 Linear Algebra 173

⎛𝑢 1 ⎞ ⎛𝑢1 ⎞
𝐮 = ⎜ ⋮ ⎟ = ⎜ ⋮ ⎟. (8.1)
⎝ 𝑢𝑙 ⎠𝐁 ⎝ 𝑢𝑙 ⎠
The basis 𝐁 has been indicated here for the sake of completeness; in most cases,
it is known from the context and omitted. The coefficients 𝑢𝑖 are called the ele-
ments (of the representation) of the vector.
It is customary to write vectors as column vectors (and not as row vectors) for
most purposes in linear algebra for a reason that will become clear soon.
In Julia, vectors are of course represented by the data structure ùȕȆʲɴʝШtypeЩ,
where the type of the elements plays the role of the underlying field 𝐹 in mathe-
matics.
The significance of matrices is that every linear function between two given
vector spaces can be represented as a matrix and, vice versa, every matrix gives
rise to a linear function (again between two given vector spaces). To see this, we
consider linear functions 𝑓 ∶ 𝑈 → 𝑉 between two vector spaces 𝑈 and 𝑉. We
also choose a basis 𝐁 ∶= {𝐛1 , … , 𝐛𝑙 } of the 𝑙-dimensional vector space 𝑈 and a
basis 𝐂 ∶= {𝐜1 , … , 𝐜𝑚 } of the 𝑚-dimensional vector space 𝑉. Since 𝑓 is linear,
i.e., it is compatible with the vector addition and scalar multiplication via

𝑓(𝐮1 + 𝐮2 ) = 𝑓(𝐮1 ) + 𝑓(𝐮2 ) ∀𝐮1 ∈ 𝑈 ∀𝐮2 ∈ 𝑈,


𝑓(𝑎𝐮) = 𝑎𝑓(𝐮) ∀𝑎 ∈ 𝐹 ∀𝐮 ∈ 𝑈,

it suffices to know or to store the images of the basis vectors 𝐛𝑖 . This fact follows
immediately from the linearity of 𝑓, since
𝑙
∑ 𝑙

𝑓(𝐮) = 𝑓( 𝑢𝑖 𝐛 𝑖 ) = 𝑢𝑖 𝑓(𝐛𝑖 ) (8.2)
𝑖=1 𝑖=1

holds. Representing the vector 𝐮 by the coefficients 𝑢𝑖 ∈ 𝐹 and knowing the


images 𝑓(𝐛𝑖 ), the right-hand side is calculated in a straightforward manner in
order to obtain the image 𝑓(𝐮) ∈ 𝑉 of 𝐮 ∈ 𝑈.
To record the images 𝑓(𝐛𝑖 ) ∈ 𝑉 in an orderly fashion, we arrange them as
column vectors in a two-dimensional array of numbers, the matrix 𝐴, and write

⎛ 𝑓(𝐛1 )1 ⋯ 𝑓(𝐛𝑙 )1 ⎞
𝐴 ∶= (𝑓(𝐛1 ), … , 𝑓(𝐛𝑙 )) = ⎜ ⋮ ⋱ ⋮ ⎟,
⎝ 𝑓(𝐛 1 ) 𝑚 ⋯ 𝑓(𝐛𝑙 )𝑚 ⎠

where the element 𝑎𝑗𝑖 ∶= 𝑓(𝐛𝑖 )𝑗 ∈ 𝐹 of the matrix 𝐴 is the 𝑗-th element of the
vector 𝑓(𝐛𝑖 ) ∈ 𝑊. Since 𝑈 is 𝑙-dimensional, 𝑖 runs from 1 to 𝑙, and since 𝑉 is 𝑚-
dimensional, 𝑗 runs from 1 to 𝑚. Therefore the matrix 𝐴 contains 𝑚 rows and
𝑙 columns, and we say it has dimension 𝑚 × 𝑙. We denote the set of all (𝑚 × 𝑙)-
dimensional matrices over the field 𝐹 by 𝐹 𝑚×𝑙 .
Since the matrix 𝐴 contains all the information about the function 𝑓, it is
certainly possible to calculate the image 𝑓(𝐮). How can we calculate it easily
174 8 Arrays and Linear Algebra

using 𝐴? The answer is given by the matrix-vector multiplication defined by

⎛ 𝑎11 ⋯ 𝑎1𝑙 ⎞ ⎛𝑢1 ⎞ ⎛ 𝑎11 𝑢1 + ⋯ + 𝑎1𝑙 𝑢𝑙 ⎞


𝐴𝐮 = ⎜ ⋮ ⋱ ⋮ ⎟ ⎜ ⋮ ⎟ ∶= ⎜ ⋮ ⎟
⎝ 𝑎 𝑚1 ⋯ 𝑎 𝑚𝑙 𝑢
⎠𝐁𝐂 ⎝ ⎠𝐁
𝑙 ⎝ 𝑎 𝑚1 𝑢 1 + ⋯ + 𝑎𝑚𝑙 𝑢𝑙 ⎠𝐂
⎛ ∑𝑙 𝑎 𝑢 ⎞
𝑖=1 1𝑖 𝑖
=⎜ ⋮ ⎟ , (8.3)
⎜∑ 𝑙 ⎟
𝑖=1
𝑎𝑚𝑖 𝑢𝑖
⎝ ⎠𝐂

which is a function 𝐹 𝑚×𝑙 × 𝑈 → 𝑉. For the sake of completeness, the bases


over which the elements of the matrix and the vectors are to be understood are
indicated here, but are usually omitted.
The right-hand side is just the linear combination of the images 𝑓(𝐛𝑖 ) with
the coefficients 𝑢𝑖 , i.e.,
∑𝑙
𝐴𝐮 = 𝑢𝑖 𝑓(𝐛𝑖 ).
𝑖=1

Recalling (8.2), this equation implies

𝑓(𝐮) = 𝐴𝐮.

We have seen that every linear function 𝑓 ∶ 𝑈 → 𝑉 gives rise to a matrix 𝐴


after fixing a basis 𝐁. This matrix contains all the information of the linear func-
tion 𝑓. Vice versa, any matrix 𝐴 gives rise to a linear function 𝑓 via matrix-vector
multiplication by defining 𝑓(𝐮) ∶= 𝐴𝐮. It is straightforward to see that the func-
tion 𝑓 defined in this manner is indeed linear by the definition of matrix-vector
multiplication.
Therefore we have constructed a bijection between the linear functions
𝑓 ∶ 𝑈 → 𝑉 from the 𝑙-dimensional vector space 𝑈 to the 𝑚-dimensional vec-
tor space 𝑉 (both over the field 𝐹) and the matrices in 𝐹 𝑚×𝑙 . In other words, the
matrix 𝐴 is a representation of the function 𝑓 in the basis 𝐁.
Now the reason why vectors are usually written as column vectors in linear al-
gebra becomes clear. If the matrix 𝐴 represents the linear function 𝑓, the matrix-
vector product 𝐴𝐮 is equivalent to the function evaluation 𝑓(𝐮) and both the
function and the matrix are written before the vector 𝐮. Being able to choose
a notation between matrix-vector products 𝐴𝐮 and column vectors on the one
hand and vector-matrix productions 𝐮𝐴 and row vectors on the other hand, the
first choice resembles the function evaluation 𝑓(𝐮) and is therefore the more
natural choice (cf. Sect. 8.1.2).
Because of this bijection between linear functions 𝑓 and matrices 𝐴, there is
always a correspondence between the properties of linear functions and matri-
ces. For example, if 𝑓 is bijective, then 𝐴 is called regular. Furthermore, prop-
erties of and operations on linear functions and matrices correspond to types
and functions in Julia. These Julia types and functions are described in de-
8.4 Linear Algebra 175

tail in the following. For example, matrix-vector multiplication is performed by


the generic function Ѯ. Many symbols concerning linear algebra are placed in
the built-in module {ɃɪȕǤʝɜȱȕȂʝǤ, and it is assumed in the following that you
have evaluated ʼʧɃɪȱ {ɃɪȕǤʝɜȱȕȂʝǤ.
How can we represent the composition of linear functions using matrices?
Suppose we want to compose two linear functions 𝑓 ∶ 𝑈 → 𝑉 (represented by
the matrix 𝐴 ∈ 𝐹 𝑚×𝑙 ) and 𝑔 ∶ 𝑉 → 𝑊 (represented by the matrix 𝐵 ∈ 𝐹 𝑛×𝑚 ).
Here 𝑊 is an 𝑛-dimensional vector space. It is straightforward to show that the
resulting composition 𝑔◦𝑓 ∶ 𝑈 → 𝑊 is again a linear function, and we intend
to find its representation 𝐶 ∈ 𝐹 𝑛×𝑙 .
Using (8.3), we know that 𝑓(𝐮) = 𝐴𝐮. Using (8.3) again for multiplying the
matrix 𝐵 = (𝑏𝑘𝑗 ) by the vector 𝐴𝐮 ∈ 𝑉, we find that

∑𝑚 ∑𝑙
⎛ ∑𝑙 𝑎1𝑖 𝑢𝑖 ⎞ ⎛ 𝑗=1 𝑏1𝑗 𝑖=1 𝑎𝑗𝑖 𝑢𝑖 ⎞
𝑖=1
𝑔(𝑓(𝐮)) = 𝐵(𝐴𝐮) = 𝐵 ⎜ ⋮ ⎟=⎜ ⋮ ⎟
⎜∑ 𝑙 ⎟ ⎜∑𝑚 ∑𝑙 ⎟
𝑖=1
𝑎𝑚𝑖 𝑢𝑖 𝑗=1
𝑏𝑛𝑗 𝑖=1 𝑎𝑗𝑖 𝑢𝑖
⎝ ⎠ ⎝ ⎠
∑ 𝑚 ∑𝑙 ∑𝑚 ∑𝑚
⎛ 𝑗=1 𝑖=1 𝑏1𝑗 𝑎𝑗𝑖 𝑢𝑖 ⎞ ⎛ 𝑗=1 𝑏1𝑗 𝑎𝑗1 ⋯ 𝑗=1 𝑏1𝑗 𝑎𝑗𝑙 ⎞ 𝑢1
⎛ ⎞
=⎜ ⋮ ⎟=⎜ ⋮ ⋱ ⋮ ⎟ ⋮ = 𝐶𝐮.
⎜ ⎟
⎜ ∑ 𝑚 ∑𝑙 ⎟ ⎜∑𝑚 ∑𝑚 ⎟ 𝑢
𝑏 𝑎 𝑢 𝑏 𝑎
𝑛𝑗 𝑗1 ⋯ 𝑏 𝑎
𝑗=1 𝑛𝑗 𝑗𝑙 ⎠ ⎝ 𝑙 ⎠
⎝ 𝑗=1 𝑖=1 𝑛𝑗 𝑗𝑖 𝑖 ⎠ ⎝ 𝑗=1
⏟⎴⎴⎴⎴⎴⎴⎴⎴⎴⏟⎴⎴⎴⎴⎴⎴⎴⎴⎴⏟
𝐶∶=

The last equation yields the matrix 𝐶 ∈ 𝐹 𝑛×𝑙 , whose entries are
𝑚

𝑐𝑘𝑖 = 𝑏𝑘𝑗 𝑎𝑗𝑖 .
𝑗=1

In this calculation, we have only used matrix-vector multiplication defined


above and we found that 𝐵(𝐴𝐮) = 𝐶𝐮 holds for all vectors 𝐮 ∈ 𝑈. This implies
that the composition 𝑔◦𝑓 is represented by the matrix 𝐶. The matrix-matrix mul-
tiplication defined by
𝐵𝐴 ∶= 𝐶
is a function 𝐹 𝑛×𝑚 × 𝐹 𝑚×𝑙 → 𝐹 𝑛×𝑙 and corresponds to the composition of two
linear functions.
Matrix-matrix multiplication is associative, i.e., (𝐶𝐵)𝐴 = 𝐶(𝐵𝐴) for all ma-
trices whose products exist. This can be checked directly using the definition
above. It also follows from the associative property (ℎ◦𝑔)◦𝑓 = ℎ◦(𝑔◦𝑓) of the
corresponding linear functions. Matrix-matrix multiplication is not commuta-
tive, since 𝑔◦𝑓 ≠ 𝑓◦𝑔 in general. In Julia, matrix-matrix multiplication is per-
formed by the generic function Ѯ.
176 8 Arrays and Linear Algebra

8.4.2 Basis Change

In our discussion of the relationships between linear functions and matrices,


we have used fixed bases so far. The canonical basis or standard basis of an 𝑙-
dimensional vector space 𝑈 is the basis 𝐄 ∶= {𝐞1 , … 𝐞𝑙 }, where

⎛1⎞ ⎛0⎞ ⎛0⎞


⎜0⎟ ⎜1⎟ ⎜⋮ ⎟
𝐞1 ∶= ⎜0⎟ , 𝐞2 ∶= ⎜0⎟ , …, 𝐞𝑙 ∶= ⎜0⎟ ,
⎜⋮⎟ ⎜⋮⎟ ⎜0⎟
⎝0⎠ ⎝0⎠ ⎝1⎠
i.e., the 𝑖-th element of 𝐞𝑖 is equal to 1 ∈ 𝐹 and the other elements vanish. The
matrix
⎛1 0 ⋯ 0 0⎞
⎜0 1 0⎟
𝐼 ∶= (𝐞1 , … , 𝐞𝑙 ) = ⎜⋮ ⋱ ⋮⎟
⎜0 1 0⎟
⎝0 0 ⋯ 0 1⎠
containing all standard basis vectors as columns is called the identity matrix.
Obviously, 𝐴𝐼 = 𝐼𝐴 = 𝐴 holds for all square matrices 𝐴 ∈ 𝐹 𝑙×𝑙 . In Julia,
identity matrices are constructed by the (generic) function ȕ˩ȕ.
Sometimes it is useful to change the basis in which a vector is represented in
(8.1). We call the old basis 𝐁1 ∶= {𝐛11 , … , 𝐛1𝑙 } and the new basis 𝐁2 ∶= {𝐛21 , … , 𝐛2𝑙 }.
Analogously to the identity matrix, we define 𝐵1 ∶= (𝐛11 , … , 𝐛1𝑙 ) to be the ma-
trix that contains all basis vectors of 𝐁1 as columns, and we also define 𝐵2 ∶=
(𝐛21 , … , 𝐛2𝑙 ).
We denote the function that performs a basis change from the old basis 𝐁1 to
the new basis 𝐁2 by
𝑔 ∶ 𝑈 → 𝑈, 𝐮𝐁1 ↦ 𝐮𝐁2 .
It maps the coefficients of the vector 𝐮 with respect to the old basis 𝐁1 to its coeffi-
cients with respect to the new basis 𝐁2 . Since 𝑔(𝐛1𝑖 ) = 𝐛2𝑖 holds for all 𝑖 ∈ {1, … , 𝑙},
it is obvious that 𝑔 is a linear function. We denote the matrix representation of 𝑔
by 𝐺. Since every basis change 𝑔 is a bijection, its inverse function 𝑔−1 exists and
is represented by the inverse matrix 𝐺 −1 of 𝐺 (see Sect. 8.4.8).
Therefore we have

⎛𝑢1 ⎞ ∑𝑙 ∑𝑙 ∑𝑙 ∑𝑙
𝐮𝐁2 = ⎜ ⋮ ⎟ = 𝑢𝑖 𝐛2𝑖 = 𝑢𝑖 𝑔(𝐛1𝑖 ) = 𝑢𝑖 𝐺𝐛1𝑖 = 𝐺 𝑢𝑖 𝐛1𝑖
⎝ 𝑢𝑙 ⎠ 𝐁 𝑖=1
2
𝑖=1 𝑖=1 𝑖=1

⎛𝑢1 ⎞
= 𝐺 ⎜ ⋮ ⎟ = 𝐺𝐮𝐁1 .
⎝ 𝑢𝑙 ⎠𝐁 1
8.4 Linear Algebra 177

An important example of a basis change is rotation. In two-dimensional Eu-


clidean geometry, counter-clockwise rotation by the angle 𝜙 is expressed by mul-
tiplication with the matrix

cos 𝜙 − sin 𝜙
𝑅(𝜙) ∶= ( ).
sin 𝜙 cos 𝜙

In Julia, this matrix is calculated as follows.


ȯʼɪȆʲɃɴɪ ʝɴʲǤʲȕФʙȹɃХ
ЦȆɴʧФʙȹɃХ вʧɃɪФʙȹɃХ
ʧɃɪФʙȹɃХ ȆɴʧФʙȹɃХЧ
ȕɪȍ

Note that a newline character can be used instead of a semicolon to indicate the
start of another row.
So far we seen how to change the basis over which a vector as written in (8.1)
is to be understood. We can also change the bases over which a matrix as written
in (8.3) is to be understood. This is useful in situations when a linear function is
known or more easily investigated in a certain basis. Linear functions also come
with basis vectors which are helpful to understand their action (see Sect. 8.4.9
and Sect. 8.4.10).
Suppose that 𝐴𝐁1 𝐂1 is the representation of a linear function 𝑓 in the old
bases 𝐁1 of 𝑈 and 𝐂1 of 𝑉. How can we find the representation 𝐴𝐁2 𝐂2 of 𝑓 in
the new bases 𝐁2 and 𝐂2 ? We start from the two basis changes

𝐮𝐁2 = 𝐺𝐮𝐁1 ,
𝐯𝐂2 = 𝐻𝐯𝐂1

and the representation of 𝐴 over the old bases, i.e., from the equation

𝑉 ∋ 𝐯𝐂1 = 𝐴𝐁1 𝐂1 𝐮𝐁1 ∈ 𝑈.

Multiplying this equation from the left by 𝐻 and using 𝐮𝐁1 = 𝐺 −1 𝐮𝐁2 yields

𝐯𝐂2 = 𝐻𝐯𝐂1 = 𝐻𝐴𝐁1 𝐂1 𝐮𝐁1 = 𝐻𝐴𝐁1 𝐂1 𝐺 −1 𝐮𝐁2


⏟⎴⎴⏟⎴⎴⏟
𝐴𝐁2 𝐂2 ∶=

so that the sought matrix 𝐴𝐁2 𝐂2 is given by

𝐴𝐁2 𝐂2 = 𝐻𝐴𝐁1 𝐂1 𝐺 −1 . (8.4)

If 𝑈 = 𝑉, the two matrices 𝐴𝐁1 𝐂1 and 𝐴𝐁2 𝐂2 are called similar or conjugate (see
Definition 8.34). Similarity is an equivalence relation.
The last equation implies

𝐴𝐁1 𝐂1 = 𝐻 −1 𝐴𝐁2 𝐂2 𝐺, (8.5)


178 8 Arrays and Linear Algebra

𝐴𝐁2 𝐂2
𝐁2 −−−−−−−−−−−−−→ 𝐂2





⏐ ⏐


⏐ ⏐


⏐ −1 ⏐
𝐺⏐


⏐ 𝐻 ⏐ ⏐


⏐ ⏐





⏐ ↓
𝐴𝐁1 𝐂1
𝐁1 −−−−−−−−−−−−−→ 𝐂1

Fig. 8.1 Commutative diagram for changing the bases of a matrix.

which is easily interpreted in the commutative diagram Fig. 8.1. In the old
bases 𝐁1 and 𝐂1 , 𝐴𝐁1 𝐂1 maps vectors represented using 𝐁1 to those represented
using 𝐂1 ; this is the left-hand side of the equation and the arrow at the bottom in
the diagram. The same effect is achieved by changing the argument vector from
the old basis 𝐁1 to the new basis 𝐁2 , then applying the linear function via its
new representation 𝐴𝐁2 𝐂2 , and finally changing from the new basis 𝐂2 back to
the old basis 𝐂1 ; this is the right-hand-side of the equation and the other three
arrows in the diagram.
Next, we consider an example. We seek the representation of a geometric
transformation in the canonical basis. The transformation is stretching the en-
tire two-dimensional plane by a factor of 2 only in the direction of the 𝑥-axis
rotated by 𝜋∕4. We define the two bases 𝐂1 ∶= 𝐁1 ∶= 𝐄 and the basis change
𝐻 ∶= 𝐺 ∶= 𝑅(−𝜋∕4) such that

10 1 1 −1
( ) = 𝑅(−𝜋∕4) √ ( ) .
01 2 1 1 𝐁1
𝐁2

In the new bases 𝐁2 and 𝐂2 , the stretching transformation is easily expressed as

20
𝐴𝐁2 𝐂2 = ( ).
01

The matrix representing the transformation in the canonical basis is therefore

20
𝐴𝐄𝐄 = 𝐴𝐁1 𝐂1 = 𝐻 −1 𝐴𝐁2 𝐂2 𝐺 = 𝑅(𝜋∕4) ( ) 𝑅(−𝜋∕4).
01

In Julia, the transformation is


ɔʼɜɃǤљ  ќ ʝɴʲǤʲȕФʙɃЭЙХ Ѯ ЦЗ Еѓ Е ЖЧ Ѯ ʝɴʲǤʲȕФвʙɃЭЙХ
ЗѠЗ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐК ЕѐК
ЕѐК ЖѐК
8.4 Linear Algebra 179

Finally, we check that it computes the desired transformation. The vector (1, 1)⊤
should be stretched by a factor of two, and the vector (−1, 1)⊤ , which is orthog-
onal to it, should remain unchanged.
ɔʼɜɃǤљ  Ѯ ЦЖя ЖЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЗѐЕ
ЗѐЕ
ɔʼɜɃǤљ  Ѯ ЦвЖя ЖЧ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
вЖѐЕ
ЖѐЕ

8.4.3 Inner-Product Spaces

Many vector spaces can be equipped with an inner product. Inner products give
vector spaces geometric structure by making it possible to define lengths and
angles. An inner product ⟨., .⟩ of the vector space 𝑉 is a function

⟨., .⟩ ∶ 𝑉 × 𝑉 → 𝐹

that satisfies – for all vectors 𝐮, 𝐯, and 𝐰 ∈ 𝑉 and for all scalars 𝑎 ∈ 𝐹 – the three
conditions of conjugate symmetry ⟨𝐮, 𝐯⟩ = ⟨𝐯, 𝐮⟩, linearity in the first argument
⟨𝑎𝐮, 𝐯⟩ = 𝑎⟨𝐮, 𝐯⟩ and ⟨𝐮 + 𝐯, 𝐰⟩ = ⟨𝐮, 𝐰⟩ + ⟨𝐯, 𝐰⟩, and positive-definiteness
⟨𝐯, 𝐯⟩ ≥ 0 with equality if and only if 𝐯 = 0.
Every inner product induces a norm on its vector space 𝑉 by defining

‖𝐯‖2 ∶= ⟨𝐯, 𝐯⟩.

In Julia, the function {ɃɪȕǤʝɜȱȕȂʝǤѐȍɴʲ computes the canonical inner


product
dim
∑𝑉
𝐮 ⋅ 𝐯 ∶= 𝑢𝑖 𝑣𝑖 ,
𝑖=1

where the first vector is conjugated.


An important inequality that involves the inner product is the following.

Theorem 8.1 (Cauchy–Bunyakovsky–Schwarz inequality) The inequality

|⟨𝐮, 𝐯⟩|2 ≤ ⟨𝐮, 𝐮⟩⟨𝐯, 𝐯⟩ ∀∀𝐮, 𝐯 ∈ ℝ𝑑

holds, while equality holds if and only if 𝐮 and 𝐯 are linearly dependent.
180 8 Arrays and Linear Algebra

Equivalently, we can also write

|⟨𝐮, 𝐯⟩| ≤ ‖𝐮‖‖𝐯‖ ∀∀𝐮, 𝐯 ∈ ℝ𝑑 .

The cosine of the angle 𝜙(𝐮, 𝐯) between two vectors 𝐮 and 𝐯 is defined as
𝐮⋅𝐯
cos 𝜙(𝐮, 𝐯) ∶= ,
‖𝐮‖‖𝐯‖

and the inequality | cos 𝜙| ≤ 1 holds because of the Cauchy–Bunyakovsky–


Schwarz inequality, Theorem 8.1. Two vectors are called orthogonal if ⟨𝐮, 𝐯⟩ = 0,
which corresponds to 𝜙(𝐮, 𝐯) ∈ {𝜋∕2, 3𝜋∕2} in the case of the canonical inner
product.
A basis of a vector space is called orthogonal if all basis vectors are orthogonal
to one another. It is called orthonormal if all basis vectors are orthogonal and
have length one.
An inner product also gives rise to the conjugate transpose or Hermitian con-
jugate of a matrix 𝐴. It is denoted by 𝐴∗ and is defined as the matrix that satisfies

⟨𝐴𝐮, 𝐯⟩ = ⟨𝑢, 𝐴∗ 𝐯⟩

for all vectors 𝐮 and 𝐯 ∈ 𝑉. This means that the elements of 𝐴∗ are given by
𝑎𝑗𝑖 , if the elements of 𝐴 are denoted by 𝑎𝑖𝑗 . If the underlying field of the vector
space 𝑉 are the real numbers ℝ, then the complex transpose 𝐴∗ is the transpose
of 𝐴 and denoted by 𝐴⊤ ; its elements are 𝑎𝑗𝑖 .
In Julia, the conjugate transpose or Hermitian conjugate of a matrix is cal-
culated by the functions ǤȍɔɴɃɪʲ and ǤȍɔɴɃɪʲР or the postfix operator щ.
ɔʼɜɃǤљ  ќ ЦЖўЗɃɦ ИўЙɃɦѓ КўЛɃɦ МўНɃɦЧѓ щ
ЗѠЗ ǤȍɔɴɃɪʲФђђ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
ЖвЗɃɦ КвЛɃɦ
ИвЙɃɦ МвНɃɦ

The transpose is calculated by ʲʝǤɪʧʙɴʧȕ.


ɔʼɜɃǤљ ʲʝǤɪʧʙɴʧȕФХ
ЗѠЗ ʲʝǤɪʧʙɴʧȕФђђ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩХ ˞Ƀʲȹ ȕɜʲ˩ʙȕ &ɴɦʙɜȕ˦ШbɪʲЛЙЩђ
ЖўЗɃɦ КўЛɃɦ
ИўЙɃɦ МўНɃɦ

A matrix 𝐴 is called self-adjoint or Hermitian if 𝐴 = 𝐴∗ . If the underlying


field is ℝ and 𝐴 is self-adjoint, i.e., 𝐴 = 𝐴⊤ , then 𝐴 is called symmetric.
If 𝑉 is a vector space over the complex numbers ℂ, there is a bijection between
the inner products and the sesquilinear forms (𝐮, 𝐯) ↦ 𝐯𝐴𝐮, where 𝐴 is a self-
adjoint positive-definite matrix.
8.4 Linear Algebra 181

8.4.4 The Rank-Nullity Theorem

Before we can state the rank-nullity theorem, some definitions are required. The
kernel or nullspace of a function 𝑓 ∶ 𝑈 → 𝑉 is the set of all elements 𝐮 ∈ 𝑈
whose image vanishes, i.e.,

ker(𝑓) ∶= {𝐮 ∈ 𝑈 ∣ 𝑓(𝐮) = 0}.

It is straightforward to show that the kernel of a linear function is a linear sub-


space of the domain 𝑈. The nullity nul(𝑓) of a linear function 𝑓 or a matrix is
the dimension of its kernel. Furthermore, the rank rk(𝑓) of a linear function 𝑓
or a matrix is the dimension of its image 𝑓(𝑈), which can be shown to be a linear
subspace as well.
The rank-nullity theorem is an important relationship between the dimen-
sions of the kernel, the image, and the preimage space of a linear function or
matrix.
As we have seen, matrices are representations of linear functions. Therefore
we can state the rank-nullity theorem in both the language of linear functions
and in the one of matrices.
In the language of linear functions, the following theorem holds.
Theorem 8.2 (rank-nullity theorem for linear functions) Let 𝑓 ∶ 𝑈 → 𝑉
be a linear function. Then the equation

dim(ker 𝑓) + dim(im 𝑓) = dim(𝑈)

holds.
In the language of matrices, the theorem can be stated as follows. The nullity
and the rank of a matrix are the nullity and rank of the corresponding linear
function.
Theorem 8.3 (rank-nullity theorem for matrices) Let 𝐴 ∈ 𝐹 𝑛×𝑙 be an 𝑛 × 𝑙-
dimensional matrix. Then the equation

nul(𝐴) + rk(𝐴) = 𝑙

holds.
In Julia, the function {ɃɪȕǤʝɜȱȕȂʝǤѐɪʼɜɜʧʙǤȆȕ calculates a basis of the
nullspace of a matrix and the function {ɃɪȕǤʝɜȱȕȂʝǤѐʝǤɪɖ computes its rank.
ɔʼɜɃǤљ  ќ ЦЖ Ж Е Еѓ Е Ж Ж Еѓ Е Ж вЖ ЕЧ
ИѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
Ж Ж Е Е
Е Ж Ж Е
Е Ж вЖ Е
ɔʼɜɃǤљ ɪʼɜɜʧʙǤȆȕФХ
182 8 Arrays and Linear Algebra

ЙѠЖ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЕѐЕ
ЕѐЕ
ЖѐЕ
ɔʼɜɃǤљ ʝǤɪɖФХ
И
ɔʼɜɃǤљ ʧɃ˴ȕФɪʼɜɜʧʙǤȆȕФХя ЗХ ў ʝǤɪɖФХ ќќ ʧɃ˴ȕФя ЗХ
ʲʝʼȕ

8.4.5 Matrix Types

In applications, matrices with special structures often arise. The special prop-
erties of these matrices can often be exploited by specialized operations and al-
gorithms such as matrix factorizations (see Sect. 8.4.11). Therefore important
matrix types are discussed in the following.
The simplest matrix type are diagonal matrices which have the form

⎛∗ ⎞
⎜ ⋱ ⎟.
⎝ ∗⎠

We simplify notation by not showing zero entries. Furthermore, the symbol ∗


denotes any element of the underlying field, and two entries ∗ are not necessarily
equal.
The function {ɃɪȕǤʝɜȱȕȂʝǤѐ-ɃǤȱɴɪǤɜ constructs a diagonal matrix from a
vector by placing its elements in the diagonal or from a matrix taking only its
diagonal elements. The result is of type {ɃɪȕǤʝɜȱȕȂʝǤѐ-ɃǤȱɴɪǤɜ.
ɔʼɜɃǤљ -ɃǤȱɴɪǤɜФЦЖя ЗЧХ
ЗѠЗ -ɃǤȱɴɪǤɜШbɪʲЛЙя ùȕȆʲɴʝШbɪʲЛЙЩЩђ
Ж ѐ
ѐ З
ɔʼɜɃǤљ -ɃǤȱɴɪǤɜ јђ ȂʧʲʝǤȆʲ ǤʲʝɃ˦
ʲʝʼȕ

Upper and lower bidiagonal matrices are of the forms

⎛∗ ∗ ⎞ ⎛∗ ⎞
⎜ ⋱ ⋱ ⎟ ⎜∗ ⋱ ⎟,
and
⎜ ⋱ ∗⎟ ⎜ ⋱ ⋱ ⎟
⎝ ∗⎠ ⎝ ∗ ∗⎠

respectively. They are constructed by the function "ɃȍɃǤȱɴɪǤɜ. Again,


"ɃȍɃǤȱɴɪǤɜ is a subtype of ȂʧʲʝǤȆʲ ǤʲʝɃ˦.
8.4 Linear Algebra 183

Tridiagonal matrices are of the form

⎛∗ ∗ ⎞
⎜∗ ⋱ ⋱ ⎟,
⎜ ⋱ ⋱ ∗⎟
⎝ ∗ ∗⎠

and their type is called ÑʝɃȍɃǤȱɴɪǤɜ.


Symmetric tridiagonal matrices are represented by the type Æ˩ɦÑʝɃȍɃǤȱɴɪǤɜ.
They are tridiagonal matrices whose first sub- and super-diagonals are equal.
Self-adjoint or Hermitian matrices, i.e., matrices over ℂ with the property 𝐴 =
𝐴∗ , are represented by the type YȕʝɦɃʲɃǤɪ and are constructed by the function
YȕʝɦɃʲɃǤɪ from the upper or lower triangle of a given array.

ɔʼɜɃǤљ  ќ ЦЖ ИўЙɃɦѓ КўЛɃɦ МЧ


ЗѠЗ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩђ
ЖўЕɃɦ ИўЙɃɦ
КўЛɃɦ МўЕɃɦ
ɔʼɜɃǤљ YȕʝɦɃʲɃǤɪФя ђÚХ
ЗѠЗ YȕʝɦɃʲɃǤɪШ&ɴɦʙɜȕ˦ШbɪʲЛЙЩя ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩЩђ
ЖўЕɃɦ ИўЙɃɦ
ИвЙɃɦ МўЕɃɦ
ɔʼɜɃǤљ YȕʝɦɃʲɃǤɪФя ђ{Х
ЗѠЗ YȕʝɦɃʲɃǤɪШ&ɴɦʙɜȕ˦ШbɪʲЛЙЩя ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦ШbɪʲЛЙЩЩЩђ
ЖўЕɃɦ КвЛɃɦ
КўЛɃɦ МўЕɃɦ

Analogously, symmetric matrices, i.e., matrices over ℝ and with the property
𝐴 = 𝐴⊤ , are represented by the type Æ˩ɦɦȕʲʝɃȆ and are constructed by the func-
tion Æ˩ɦɦȕʲʝɃȆ.
Upper-triangular and lower-triangular matrices are matrices of the forms

⎛∗ ∗ ∗⎞ ⎛∗ ⎞
⎜ ⋱ ∗⎟ and ⎜∗ ⋱ ⎟,
⎝ ∗⎠ ⎝∗ ∗ ∗⎠

respectively. They are important for solving linear systems (see Sect. 8.4.8) and
in matrix factorizations (see Sect. 8.4.11). They are represented by the types
ÚʙʙȕʝÑʝɃǤɪȱʼɜǤʝ and {ɴ˞ȕʝÑʝɃǤɪȱʼɜǤʝ, and they are constructed by functions
of the same name.
The function ÚɪɃȯɴʝɦÆȆǤɜɃɪȱ returns a multiple of the identity matrix 𝐼,
which is generally sized so that it can be multiplied by any matrix.
ɔʼɜɃǤљ ÚɪɃȯɴʝɦÆȆǤɜɃɪȱФЗХ Ѯ ЦЖ Зѓ И ЙЧ
ЗѠЗ ǤʲʝɃ˦ШbɪʲЛЙЩђ
З Й
Л Н
184 8 Arrays and Linear Algebra

Table 8.6 gives an overview of the matrix types.

Table 8.6 Matrix types.


Function or type Description
{ɃɪȕǤʝɜȱȕȂʝǤѐ-ɃǤȱɴɪǤɜ diagonal matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐ"ɃȍɃǤȱɴɪǤɜ bidiagonal matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐÑʝɃȍɃǤȱɴɪǤɜ tridiagonal matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐÆ˩ɦÑʝɃȍɃǤȱɴɪǤɜ symmetric tridiagonal matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐÆ˩ɦɦȕʲʝɃȆ symmetric matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐYȕʝɦɃʲɃǤɪ self-adjoint or Hermitian matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐÚʙʙȕʝÑʝɃǤɪȱʼɜǤʝ upper-triangular matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐ{ɴ˞ȕʝÑʝɃǤɪȱʼɜǤʝ lower-triangular matrix
{ɃɪȕǤʝɜȱȕȂʝǤѐÚɪɃȯɴʝɦÆȆǤɜɃɪȱ constant times identity matrix

In general, matrices can be converted from the general ȂʧʲʝǤȆʲ ǤʲʝɃ˦ type
to a special type by calling the constructor of the special type on the matrix. Vice
versa, a special type can be converted to the general ʝʝǤ˩ type by calling the
constructors ǤʲʝɃ˦ or ʝʝǤ˩ on the special matrix.
ɔʼɜɃǤљ ÑʝɃȍɃǤȱɴɪǤɜФ Х
ЙѠЙ ÑʝɃȍɃǤȱɴɪǤɜШbɪʲЛЙя ùȕȆʲɴʝШbɪʲЛЙЩЩђ
ЖЛ И ѐ ѐ
К ЖЕ ЖЖ ѐ
ѐ Л М ЖЗ
ѐ ѐ ЖЙ Ж
ɔʼɜɃǤљ ǤʲʝɃ˦ФÑʝɃȍɃǤȱɴɪǤɜФ ХХ
ЙѠЙ ǤʲʝɃ˦ШbɪʲЛЙЩђ
ЖЛ И Е Е
К ЖЕ ЖЖ Е
Е Л М ЖЗ
Е Е ЖЙ Ж

8.4.6 The Cross Product

The cross product or vector product × ∶ ℝ3 × ℝ3 → ℝ3 is defined on the three-


dimensional real vector space ℝ3 and has useful geometric meaning. Its three
defining, geometric properties are that
1. 𝐚 × 𝐛 is orthogonal to 𝐚 and 𝐛,
2. the three vectors 𝐚, 𝐛, and 𝐚 × 𝐛 are a right-handed system, and
3. the (Euclidean) length of 𝐚 × 𝐛 is the area of the parallelogram spanned by 𝐚
and 𝐛.
8.4 Linear Algebra 185

It can be shown that these three defining properties imply the definition

𝐚 × 𝐛 ∶= ‖𝐚‖‖𝐛‖ sin(∠(𝐚, 𝐛))𝐧,

where the angle ∠(𝐚, 𝐛) ∈ [0, 𝜋] is the angle between 𝐚 and 𝐛 in the plane con-
taining both and 𝐧 is a unit vector normal to the same plane. Its direction is given
by the right-hand rule: if 𝐚 points along the thumb and 𝐛 along the forefinger of
the right hand, then 𝐧 and thus 𝐜 point along the middle finger.
The cross product is anticommutative, i.e., 𝐚 × 𝐛 = −(𝐛 × 𝐚) holds for all
vectors 𝐚 and 𝐛 ∈ ℝ3 . It is also bilinear, i.e., (𝜆𝐚 + 𝜇𝐛) × 𝐜 = 𝜆(𝐚 × 𝐜) + 𝜇(𝐛 × 𝐜)
holds for all 𝜆 and 𝜇 ∈ ℝ and for all vectors 𝐚, 𝐛, and 𝐜 ∈ ℝ3 . Furthermore,
two vectors 𝐚 ≠ 0 and 𝐛 ≠ 0 are parallel if and only if 𝐚 × 𝐛 = 0. Finally, the
Lagrangian identity ‖𝐚 × 𝐛‖2 = ‖𝐚‖2 ‖𝐛‖2 − (𝐚 ⋅ 𝐛)2 holds for all vectors 𝐚 and
𝐛 ∈ ℝ3 . The cross product is not associative, i.e., in general 𝐚×(𝐛×𝐜) ≠ (𝐚×𝐛)×𝐜.
We can use the three defining properties to determine the cross products 𝐞𝑖 ×
𝐞𝑗 for all 𝑖 and 𝑗 ∈ {1, 2, 3} of all combinations of vectors in the standard basis
{𝐞1 , 𝐞2 , 𝐞3 }. Then multiplying out the product (𝑎1 𝐞1 +𝑎2 𝐞2 +𝑎3 𝐞3 )×(𝑏1 𝐞1 +𝑏2 𝐞2 +
𝑏3 𝐞3 ) and using the bilinearity yields the formula

⎛ |||𝑎2 𝑏2 |||⎞
| |
⎜ |||𝑎3 𝑏3 |||⎟
𝑎 𝑏 𝑎 𝑏 − 𝑎 𝑏 ⎜ ||𝑎
⎛ 1⎞ ⎛ 1⎞ ⎛ 2 3 3 2⎞
𝑏1 |||⎟
𝐚 × 𝐛 = ⎜𝑎2 ⎟ × ⎜𝑏2 ⎟ = ⎜𝑎3 𝑏1 − 𝑎1 𝑏3 ⎟ = ⎜− ||| 1 | .
||𝑎3 𝑏3 |||⎟
⎝𝑎3 ⎠ ⎝𝑏3 ⎠ ⎝𝑎1 𝑏2 − 𝑎2 𝑏1 ⎠ ⎜ |||𝑎1 𝑏1 |||⎟
⎜ || |⎟
|𝑎 𝑏2 |||
⎝ | 2 ⎠
In Julia, the function Ȇʝɴʧʧ calculates the cross product of two three-dimens-
ional vectors.
ɔʼɜɃǤљ ȆʝɴʧʧФЦЖя Ея ЕЧя ЦЕя Жя ЕЧХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШbɪʲЛЙЩђ
Е
Е
Ж

8.4.7 The Determinant

The geometric meaning of the determinant det(𝐴) of a square matrix 𝐴 ∈ 𝐹 𝑛×𝑛 is


that it is equal to the signed volume of the 𝑛-dimensional parallelepiped spanned
by the column or the row vectors of the matrix. It is positive when the corre-
sponding linear function preserves the orientation of the vector space and neg-
ative otherwise. A matrix is singular, i.e., it cannot be inverted, if and only if its
determinant is zero; conversely, a matrix is regular, i.e., it can be inverted, if and
only if its determinant is non-zero. Determinants occur in different contexts in
186 8 Arrays and Linear Algebra

mathematics and have many important properties. In Julia, the determinant of


a matrix is calculated by the function {ɃɪȕǤʝɜȱȕȂʝǤѐȍȕʲ.

8.4.8 Linear Systems

Systems of linear equations can always be written in the form

𝐴𝐱 = 𝐛, (8.6)

where the matrix 𝐴 ∈ 𝐹 𝑛×𝑚 and the vector 𝐱 contains the unknowns 𝑥𝑖 . Linear
systems are among the most common equations to be solved. Linear systems
also arise from the linearization of systems of nonlinear equations; the linearized
system is then used as an approximation of the more complicated and usually
harder to solve nonlinear system. Because linear systems are ubiquitous, Julia
provides advanced algorithms for solving them.
If you only remember one Julia function from this section, it should be the
backslash function а. It usually solves a linear system very reliably.
ɔʼɜɃǤљ  ќ ЦЖ Зѓ Й КЧѓ Ȃ ќ ЦЖКѓ ЙЗЧѓ
ɔʼɜɃǤљ ˦ ќ  а Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ИѐЕ
ЛѐЕ
ɔʼɜɃǤљ  Ѯ ˦ ќќ Ȃ
ʲʝʼȕ
ɔʼɜɃǤљ  Ѯ ˦
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖКѐЕ
ЙЗѐЕ

The rest of this section is concerned with what happens under the hood.

8.4.8.1 Solvability

Cramer’s rule, an explicit formula for solving systems of linear equations with
an equal number 𝑛 of equations and unknowns, has been known since the mid-
18th century. It is named after Gabriel Cramer, who published the rule for sys-
tems of arbitrary size in 1750. Cramer’s rule is based on determinants and its
naive implementation has a time complexity of 𝑂((𝑛 + 1)𝑛!), although it can be
implemented with a time complexity of 𝑂(𝑛3 ) [2].
However, before we solve linear systems, it is important to consider their solv-
ability. There may be no solution, a unique solution, or multiple solutions; if the
underlying field 𝐹 of the preimage vector space is infinite, the third case of mul-
8.4 Linear Algebra 187

tiple solutions means that there are infinitely many solutions. In the following,
we assume that the underlying field 𝐹 is ℝ or ℂ.
A system with fewer equations than unknowns is called an underdetermined
system. In general, such a system has infinitely many solutions, but it may have
no solution. A system with the same number of equations and unknowns usually
has a unique solution. A system with more equations than unknowns is called
an overdetermined system. In general, such a system has no solution.
The reasons why a certain system may behave differently from the general
case is that the equations may be linearly dependent, i.e., one or more equations
may be redundant, or that two or more of the equations may be inconsistent, i.e.,
contradictory.
Equations of the form (8.6) can be interpreted – as all of linear algebra – within
the context of linear functions 𝑓 and within the context of matrices 𝐴. There
is also a geometric interpretation: each linear equation (or row) in (8.6) deter-
mines a hyperplane in 𝐹 𝑚 and the set of solutions is the intersection of these
hyperplanes.
To characterize the three cases for the number of solutions that can occur, it is
useful to start with homogeneous systems. A system of linear equations is called
homogeneous if the constant terms in each equation vanish, i.e., if (8.6) has the
form
𝐴𝐱 = 0. (8.7)
Each homogeneous equation has at least one solution, namely the trivial solu-
tion 𝐱 = 0.
If the matrix 𝐴 is regular, which is equivalent to 𝑓 being a bijection, then the
trivial solution is the unique solution; the set of solutions is the kernel ker(𝐴) =
{0}.
If the matrix 𝐴 is singular, the set of solutions is the kernel ker(𝐴) and it
contains infinitely many solutions. It is straightforward to see that if 𝐱 and 𝐲 are
two solutions, then the linear combination 𝛼𝐱 + 𝛽𝐲 is a solution as well. This
implies that the set of solutions is a linear subspace of 𝐹 𝑚 .
A linear function 𝑓 can only be a bijection if the dimensions 𝑚 and 𝑛 of the
preimage and image spaces are equal. Therefore a regular matrix 𝐴 must be a
square matrix. A square matrix 𝐴 is regular if and only if det(𝐴) ≠ 0.
An example of a singular matrix is the following.
ɔʼɜɃǤљ  ќ ЦЖ Еѓ Ж ЕЧѓ ȍȕʲФХ
ЕѐЕ
ɔʼɜɃǤљ ʝǤɪɖФХ
Ж
ɔʼɜɃǤљ ɪʼɜɜʧʙǤȆȕФХ
ЗѠЖ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЕѐЕ
ЖѐЕ

An example of a regular matrix is the following.


188 8 Arrays and Linear Algebra

ɔʼɜɃǤљ  ќ ЦЖ Зѓ И ЙЧѓ ȍȕʲФХ


вЗѐЕ
ɔʼɜɃǤљ ʝǤɪɖФХ
З
ɔʼɜɃǤљ ɪʼɜɜʧʙǤȆȕФХ
ЗѠЕ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩ

With this knowledge about the solution set of homogeneous systems (8.7), we
can now consider general, inhomogeneous systems (8.6). An inhomogeneous
system has (at least) one solution if the inhomogeneity 𝐛 lies in the image of 𝑓
or 𝐴, i.e., if 𝐛 ∈ im(𝐴). If 𝐳 ∈ 𝐹 𝑚 is any particular solution of (8.6), then all
solutions of (8.6) are given by the set

𝐳 + ker(𝐴) = {𝐳 + 𝐯 ∣ 𝐯 ∈ ker(𝐴)}.

Geometrically, this means that the solution set of an inhomogeneous system is


a translation (by a particular solution 𝐳) of the solution set of the corresponding
homogeneous equation, which is ker(𝐴).
These facts underline the importance of the two formulations of the rank-
nullity theorem, Theorem 8.2 and Theorem 8.3. It is important to note that most
of the algorithms we will encounter in the following yield the rank or the nullity
of a matrix as an important byproduct so that it is often not necessary and even
inefficient to calculate them separately.
In the following, algorithms for solving two important types of linear systems
are discussed in some detail. The two types are square linear systems and overde-
termined linear systems. Underdetermined linear systems are of less practical
importance.

8.4.8.2 Square Linear Systems

In this section, we consider the important special case when the system matrix 𝐴
in (8.6) is square, i.e., 𝐴 ∈ 𝐹 𝑛×𝑛 . In other words, the preimage and the image
spaces have the same dimension 𝑛. A square linear system has the form

𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥𝑛 = 𝑏1 ,


𝑎21 𝑥1 + 𝑎22 𝑥2 + ⋯ + 𝑎2𝑛 𝑥𝑛 = 𝑏2 ,

𝑎𝑛1 𝑥1 + 𝑎𝑛2 𝑥2 + ⋯ + 𝑎𝑛𝑛 𝑥𝑛 = 𝑏𝑛 .

The next theorem records basic facts about regular matrices.


8.4 Linear Algebra 189

Theorem 8.4 (regular matrices) The following statements about a matrix 𝐴 ∈


𝐹 𝑛×𝑛 are equivalent.
1. The linear function associated with 𝐴 is bijective.
2. 𝐴 is regular.
3. The inverse matrix 𝐴−1 of 𝐴 exists and (𝐴−1 )−1 = 𝐴 holds.
4. det(𝐴) ≠ 0.
5. The equation 𝐴𝐱 = 𝟎 implies that 𝐱 = 𝟎.
6. The equation 𝐴𝐱 = 𝐛 has a unique solution 𝐱 = 𝐴−1 𝐛 for all 𝐛 ∈ 𝐹 𝑛 .
Furthermore, the inverse matrix 𝐴−1 always commutes with 𝐴, i.e., 𝐴𝐴−1 =
𝐴−1 𝐴 = 𝐼 holds.
It is important to note that inverse matrices are usually not calculated explic-
itly when solving a system of linear equations – unless the system is very small
–, since there are much more efficient computational alternatives. On the other
hand, if we have a fast algorithm for solving a square linear system, we can
always calculate the inverse 𝐴−1 explicitly if desired by solving the equations
𝐴𝐱𝑖 = 𝐞𝑖 and collecting the solutions 𝐱𝑖 columnwise into a matrix because of
𝐴𝐴−1 = 𝐼.
We start with triangular matrices. If the square linear system has the special
form
𝐿𝐱 = 𝐛,
where 𝐿 is a lower-triangular matrix, then the elements of the solution 𝐱 are
found as
∑𝑖−1
𝑏𝑖 − 𝑗=1 𝑙𝑖𝑗 𝑥𝑗
𝑥𝑖 ∶= ∀𝑖 ∈ {1, … , 𝑛}
𝑙𝑖𝑖
starting with the first equation. This algorithm is called forward substitution.
Analogously, if the system has the special form

𝑈𝐱 = 𝐛,

where 𝑈 is an upper-triangular matrix, then the elements of the solution 𝐱 are


found as ∑𝑛
𝑏𝑖 − 𝑗=𝑖+1 𝑢𝑖𝑗 𝑥𝑗
𝑥𝑖 ∶= ∀𝑖 ∈ {𝑛, … , 1}
𝑢𝑖𝑖
starting with the last equation. This algorithm is called backward substitution.
Both forward and backward substitution fail if and only if one of the diagonal
elements is zero, but this is equivalent to the system matrix 𝐿 or 𝑈 being singular.
In the next step we use Gaussian elimination to factor the matrix 𝐴 into 𝐴 =
𝑃⊤ 𝐿𝑈. This factorization is called 𝐿𝑈 factorization; the permutation matrix 𝑃 is
only a complication of the basic idea. 𝐿𝑈 factorization then immediately results
in an algorithm for solving the linear system

𝐴𝐱 = 𝑃⊤ 𝐿 𝑈𝐱 = 𝐛.
⏟⏟⏟
=∶𝐲
190 8 Arrays and Linear Algebra

Algorithm 8.5 (solve linear system)


1. Factor 𝐴 into 𝐴 = 𝑃⊤ 𝐿𝑈 by Gaussian elimination.
2. Solve 𝑃⊤ 𝐿𝐲 = 𝐛, which is equivalent to 𝐿𝐲 = 𝑃𝐛, for 𝐲 using forward substi-
tution.
3. Solve 𝑈𝐱 = 𝐲 for 𝐱 using backward substitution.
But how can we factor the matrix 𝐴 into 𝐴 = 𝐿𝑈 using Gaussian elimina-
tion? The idea of Gaussian elimination is that row operations are used on the
augmented matrix 𝐴̂ ∶= (𝐴, 𝐛) such that the matrix 𝐴 becomes an upper trian-
gular matrix. The row operations that yield an equivalent system are operations
on the equations of the system: it is possible
1. to multiply a row or equation by a non-zero scalar,
2. to add a multiple of a row or equation to another one, and
3. to swap the position of two rows or equations.
Algorithm 8.6 (Gaussian elimination, 𝐿𝑈 factorization)
1. Set 𝑗 ∶= 1. The counter 𝑗 indicates the current column.
2. Repeat:
a. Find the row index 𝑖 of the element in {𝑎𝑗𝑗 , 𝑎𝑗+1,𝑗 , … , 𝑎𝑛𝑗 } with the
largest absolute value, i.e.,

𝑖 ∶= arg max |𝑎𝑘𝑗 |.


𝑘∈{𝑗,…,𝑛}

This element 𝑎𝑖𝑗 is called the pivot element. If the pivot element is equal
to zero, the matrix 𝐴 is singular and the algorithm stops.
Swap the 𝑖-th and the 𝑗-th rows such that the pivot element is now
located at 𝑎𝑗𝑗 in the 𝑗-th row. Record swapping the two rows as left-
multiplication by a permutation matrix 𝑃𝑗 .
b. For all rows 𝑖 ∈ {𝑗 + 1, … , 𝑛} below the pivot element, add the pivot
row multiplied by −𝑎𝑖𝑗 ∕𝑎𝑗𝑗 to the 𝑖-th row such that all matrix elements
below the pivot element vanish.
Record these row operations by left-multiplication with a lower-triang-
ular matrix 𝐿𝑗 .
c. Increase 𝑗 by one and repeat while 𝑗 ≤ 𝑛.
Swapping the rows in the second step is called row pivoting or partial pivoting.
Choosing the element with the largest absolute value improves the numerical
stability of the algorithm, although other variants are used as well.
If it is not possible to find a non-zero pivot element in the second step, we
have shown that the matrix is singular. In fact, Gaussian elimination is a decision
procedure for inverting a square matrix 𝐴: if it succeeds, the inverse matrix 𝐴 has
been calculated; if it does not succeed, it has constructed a proof that the matrix
is singular.
The following theorem states that Gaussian elimination indeed yields a fac-
torization of the desired shape.
8.4 Linear Algebra 191

Theorem 8.7 (𝐿𝑈 factorization) Gaussian elimination of a regular matrix 𝐴


yields a factorization that is of the form

𝐴 = 𝑃⊤ 𝐿𝑈.

Proof Gaussian elimination performed on a regular matrix 𝐴 yields the equa-


tion
𝐿𝑛 𝑃𝑛 ⋯ 𝐿1 𝑃1 𝐴 = 𝑈,
where the matrices 𝐿𝑗 and 𝑃𝑗 record the row operations performed and the right-
hand side is upper triangular by construction.
The lower-triangular matrices 𝐿𝑗 record adding the 𝑗-th row multiplied by 𝜆𝑖𝑗
all to 𝑖-th rows, where 𝑖 > 𝑗, and therefore they have the form
𝑛

𝐿𝑗 = 𝐼 + 𝜆𝑖𝑗 𝐞𝑖 𝐞⊤
𝑗
,
𝑖=𝑗+1

which is indeed lower triangular. The only non-zero element of the product 𝐞𝑖 𝐞⊤ 𝑗
is in row 𝑖 and column 𝑗.
The permutation matrices 𝑃𝑗 record swapping the 𝑖-th row with the 𝑗-th one,
where 𝑖 > 𝑗 again. It is straightforward to show that the inverse of a permutation
matrix 𝑃 is 𝑃−1 = 𝑃⊤ .
The goal is to reorder the terms in the product 𝐿𝑛 𝑃𝑛 ⋯ 𝐿1 𝑃1 such that it be-
comes 𝐾𝑛 ⋯ 𝐾1 𝑃𝑛 ⋯ 𝑃1 . This can be achieved by moving the matrices 𝑃𝑗 to the
right repeatedly by replacing the products 𝑃𝑘 𝐿𝑗 with 𝑘 > 𝑗 by products 𝐾𝑗 𝑃𝑘 .
We have 𝐾𝑗 = 𝑃𝑘 𝐿𝑗 𝑃𝑘⊤ . Since 𝑘 > 𝑗, multiplication of 𝐿𝑗 on the right by 𝑃𝑘⊤ only
swaps two columns whose only non-zero element is equal to one. Since 𝑘 > 𝑗,
multiplication of 𝐿𝑗 𝑃𝑘⊤ on the left by 𝑃𝑘 swaps these two ones back into the main
diagonal and leaves the structure of the matrix unchanged otherwise. Therefore
𝐾𝑗 has the same, lower-triangular structure as 𝐿𝑗 , and all the terms can be re-
ordered such that

𝐿𝑛 𝑃𝑛 ⋯ 𝐿1 𝑃1 𝐴 = 𝐾𝑛 ⋯ 𝐾1 𝑃𝑛 ⋯ 𝑃1 𝐴 = 𝑈.

The last equation yields

𝑃𝐴 = (𝐾𝑛 ⋯ 𝐾1 )−1 𝑈 = 𝐾1−1 ⋯ 𝐾𝑛−1 𝑈.

We now show that the inverse of


𝑛

𝐾𝑗 = 𝐼 + 𝜅𝑖𝑗 𝐞𝑖 𝐞⊤
𝑗
𝑖=𝑗+1

is simply
192 8 Arrays and Linear Algebra

𝑛

𝐾𝑗−1 = 𝐼 − 𝜅𝑖𝑗 𝐞𝑖 𝐞⊤
𝑗
.
𝑖=𝑗+1

Multiplying out the products 𝐾𝑗 𝐾𝑗−1 and 𝐾𝑗−1 𝐾𝑗 yields

𝐾𝑗 𝐾𝑗−1 = 𝐾𝑗−1 𝐾𝑗
𝑛
∑ 𝑛
∑ 𝑛
∑ 𝑛

=𝐼+ 𝜅𝑘𝑗 𝐞𝑘 𝐞⊤
𝑗
− 𝜅𝑙𝑗 𝐞𝑙 𝐞⊤
𝑗
+( 𝜅𝑘𝑗 𝐞𝑘 𝐞⊤
𝑗
)( 𝜅𝑙𝑗 𝐞𝑙 𝐞⊤
𝑗
) = 𝐼.
𝑘=𝑗+1 𝑙=𝑗+1 𝑘=𝑗+1 𝑙=𝑗+1

The last of the four terms vanishes because of the inner products 𝐞⊤ 𝐞 with 𝑗 ≠ 𝑙.
𝑗 𝑙
Therefore the inverses 𝐾𝑗−1 have the same lower-triangular structure as the
matrices 𝐿𝑗 . Since the product of lower-triangular matrices is again lower-triang-
ular, we have thus shown that

𝑃𝐴 = 𝐿𝑈,

which completes the proof. □

𝐿𝑈 factorization is also useful to compute the determinant det(𝐴) of a square


matrix, as the next theorem shows.

Theorem 8.8 (determinant by 𝐿𝑈 factorization) The determinant of a regu-


lar square matrix 𝐴 with 𝐿𝑈 factorization 𝐴 = 𝑃⊤ 𝐿𝑈 is given by

∏𝑛 𝑛

det(𝐴) = (−1)𝑝 ( 𝑙𝑖𝑖 )( 𝑢𝑖𝑖 ),
𝑖=1 𝑖=1

where 𝑝 is the number of row exchanges in the factorization.

Proof The formula follows from det(𝐴) = det(𝑃) det(𝐿) det(𝑈). □

In Julia, 𝐿𝑈 factorization is implemented for dense and sparse matrices


by the functions {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼ and {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼР. In the case of a
dense matrix, the components of the factorization are the fields { for the lower-
triangular matrix, Ú for the upper-triangular matrix, ¸ for the right permutation
matrix, and ʙ for the right permutation vector, which is a space saving and con-
venient representation of the permutation.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФЙя ЙХѓ ȯ ќ ɜʼФХѓ
ɔʼɜɃǤљ ȯѐ{ Ѯ ȯѐÚ в Цȯѐʙя ђЧ
ЙѠЙ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЕѐЕ ЕѐЕ ЕѐЕ ЕѐЕ
ЕѐЕ ЕѐЕ вЖѐЖЖЕЗЗȕвЖЛ ЕѐЕ
ЗѐЗЗЕЙКȕвЖЛ ЕѐЕ НѐИЗЛЛМȕвЖМ ЕѐЕ
ЕѐЕ ЖѐЖЖЕЗЗȕвЖЛ вЖѐЛЛКИИȕвЖЛ ЕѐЕ
8.4 Linear Algebra 193

Furthermore, the resulting factorization can be used as an argument to the


functions Э, а, ȍȕʲ, Ƀɪ˛, ɜɴȱȍȕʲ, ɜɴȱǤȂʧȍȕʲ, and ʧɃ˴ȕ.
ɔʼɜɃǤљ ʧɃ˴ȕФȯХ
ФЙя ЙХ
ɔʼɜɃǤљ ȍȕʲФȯХ
ЙѐКЕЙНЕЖЕЙМЗНЛИЕН
ɔʼɜɃǤљ  Ѯ Ƀɪ˛ФȯХ
ЙѠЙ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐЕ МѐЕКМИОȕвЖМ ЖѐЗКЛЗМȕвЖМ вЖѐЕЗЖИȕвЖЛ
ИѐНЖЕЖЙȕвЖМ ЖѐЕ ЖѐЗИЗЙЖȕвЖЛ ЖѐЛИМЙȕвЖМ
вЗѐКЕЙНОȕвЖЛ ИѐЕКМЕЙȕвЖМ ЖѐЕ ЖѐЛМЕОНȕвЖЛ
ИѐЕИЛККȕвЖЛ ЖѐКЗНИЖȕвЖЛ ЖѐИЛЙЕКȕвЖЛ ЖѐЕ

The functions {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼ and {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼР also work on


sparse matrices. Then the additional fields are ʜ for the left permutation vector,
and ¼ʧ for the vector of scaling factors.
ɔʼɜɃǤљ  ќ ʧʙʝǤɪȍɪФЙя Йя ЖЭЗХѓ ȯ ќ ɜʼФХѓ
ɔʼɜɃǤљ ȯѐ{ Ѯ ȯѐÚ в Фȯѐ¼ʧ ѐѮ ХЦȯѐʙя ȯѐʜЧ
ЙѠЙ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ

Choosing the element with the maximal absolute value as the pivot element is
important for the precision of the algorithm. To see this, we consider the example

𝜖 1 1+𝜖
𝐴 ∶= ( ), 𝐛 ∶= ( )
1 −1 0

with 𝜖 ≪ 1. If we choose 𝑎11 = 𝜖 as the pivot element, already in place, then


Gaussian elimination results in
𝜖 1 1+𝜖
( 1+𝜖 ) .
0 −1 − 1∕𝜖 −
𝜖

The second row yields 𝑥2 = 1, and the first row yields

(1 + 𝜖) − 1
𝑥1 = .
𝜖
Symbolic evaluation of this expression results in the correct solution 𝐱 = (1, 1)⊤ .
Numerical evaluation, however, requires to subtract 1 from 1 + 𝜖, which are two
close numbers. This leads to cancellation and the loss of digits in floating-point
arithmetic. The problem is aggravated by the division by 𝜖. This effect makes the
solution unstable.
ɔʼɜɃǤљ ȕ ќ ЖѐКѮȕʙʧФЖѐЕХѓ Ъʧȹɴ˞ ȕѓ Ъʧȹɴ˞ ФЖўȕХвЖѓ Ъʧȹɴ˞ ФФЖўȕХвЖХЭȕѓ
ȕ ќ ИѐИИЕЛЛОЕМИНМКЙЛОЛȕвЖЛ
ФЖ ў ȕХ в Ж ќ ЙѐЙЙЕНОЗЕОНКЕЕЛЗЛȕвЖЛ
ФФЖ ў ȕХ в ЖХ Э ȕ ќ ЖѐИИИИИИИИИИИИИИИИ
194 8 Arrays and Linear Algebra

The function ȕʙʧ returns the epsilon of the given floating-point type, which is
defined as the gap between Ж and the next largest value representable by this
type.
On the other hand, row pivoting yields

1 −1 0
( )
0 1+𝜖 1+𝜖

and hence 𝑥2 = (1 + 𝜖)∕(1 + 𝜖) = 1 and

0 − (−1)
𝑥1 = .
1
In this quotient, no such problem occurs.
An special type of matrices of importance is the following.
Definition 8.9 (positive-definite matrix) A Hermitian matrix 𝐴 ∈ 𝐹 𝑛×𝑛 is
called positive definite if

𝐱∗ 𝐴𝐱 > 0 ∀𝐱 ∈ 𝐹 𝑛 ∖{𝟎}

holds.
It is easy to construct positive-definite matrices. If 𝐴 ∈ ℂ𝑛×𝑛 is regular,
then 𝐴∗ 𝐴 is a positive-definite matrix. 𝐴∗ 𝐴 is obviously Hermitian; furthermore,
𝐱∗ 𝐴∗ 𝐴𝐱 = ‖𝐴𝐱‖22 > 0, since 𝐴𝐱 ≠ 0 due to the regularity of 𝐴.
If 𝐴 is a positive-definite matrix, then Cholesky factorization, which is a
specialization of 𝐿𝑈 factorization for this type of matrices, can be used and is
roughly twice as efficient as 𝐿𝑈 factorization.

Theorem 8.10 (Cholesky factorization) A matrix 𝐴 ∈ 𝐹 𝑛×𝑛 is positive definite


if and only if there exists a unique lower-triangular matrix 𝐿 with real and strictly
positive diagonal elements such that

𝐴 = 𝐿𝐿∗ ,

which is called the Cholesky factorization of 𝐴.

Cholesky factorization is a decision procedure for positive definiteness: it ei-


ther succeeds in constructing such a factorization or it shows that the matrix is
not positive definite.
In Julia, Cholesky factorization is implemented by the functions
{ɃɪȕǤʝɜȱȕȂʝǤѐȆȹɴɜȕʧɖ˩ and {ɃɪȕǤʝɜȱȕȂʝǤѐȆȹɴɜȕʧɖ˩Р analogously
to {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼ. Testing whether a matrix is Hermitian is imple-
mented by {ɃɪȕǤʝɜȱȕȂʝǤѐɃʧȹȕʝɦɃʲɃǤɪ, and testing for positive defi-
niteness is implemented by the functions {ɃɪȕǤʝɜȱȕȂʝǤѐɃʧʙɴʧȍȕȯ and
{ɃɪȕǤʝɜȱȕȂʝǤѐɃʧʙɴʧȍȕȯР.
8.4 Linear Algebra 195

ɔʼɜɃǤљ  ќ ЦЖ Зѓ З ЖЧѓ ɃʧȹȕʝɦɃʲɃǤɪФХя ɃʧʙɴʧȍȕȯФХ


Фʲʝʼȕя ȯǤɜʧȕХ
ɔʼɜɃǤљ  ќ ЦЗ вЖўЖɃɦѓ вЖвЖɃɦ ЗЧѓ ɃʧȹȕʝɦɃʲɃǤɪФХя ɃʧʙɴʧȍȕȯФХ
Фʲʝʼȕя ʲʝʼȕХ
ɔʼɜɃǤљ ȯ ќ Ȇȹɴɜȕʧɖ˩ФХѓ ȯѐ{ Ѯ ȯѐÚ в 
ЗѠЗ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦OЛЙЩђ
ЙѐЙЙЕНОȕвЖЛўЕѐЕɃɦ ЕѐЕўЕѐЕɃɦ
ЕѐЕўЕѐЕɃɦ ЕѐЕўЕѐЕɃɦ

8.4.8.3 Overdetermined Systems

Overdetermined systems of linear equations are systems of the form (8.6) where

𝑛>𝑚

holds for the system matrix 𝐴 ∈ 𝐹 𝑛×𝑚 , i.e., the number 𝑛 of equations is larger
than the number 𝑚 of unknowns. The unknown vector 𝐱 is an element of 𝐹 𝑚 and
the inhomogeneity 𝐛 is an element of 𝐹 𝑛 . In general, overdetermined systems do
not have a solution, since the equations are contradictory.
Instead of solving the system, it is expedient to minimize a norm of the
residuum 𝐛 − 𝐴𝐱, i.e., to find

arg min ‖𝐛 − 𝐴𝐱‖.


𝐱∈𝐹 𝑚

Any choice of norm is possible. If the infinity norm is chosen, then this type
of problem is often called a minimax problem. In the case of the 2-norm, the
problem is called a linear least-squares problem. The name stems from the form
𝑚

‖𝐛 − 𝐴𝐱‖22 = |𝑏𝑖 − (𝐴𝐱)𝑖 |2 (8.8)
𝑖=1

of the 2-norm of the residuum.


This problem has a long history and is of great importance for regression or
fitting functions to data. The choice of the 2-norm is also distinguished by its
relationship to linear systems. Since the expression in (8.8) to be minimized is
quadratic, its derivative is linear and vanishes at a minimum. This fact results in
the close relationship between least-squares problems and linear systems, as we
will see in more detail.
From now on, we use the 2-norm and introduce the notation

𝐴𝐱 ≈ 𝐛

for the linear least-squares problem


196 8 Arrays and Linear Algebra

arg min ‖𝐛 − 𝐴𝐱‖2 .


𝐱∈𝐹 𝑚

Regression or data-fitting problems result in overdetermined linear systems


as follows. Suppose that there are 𝑛 measurements (𝑏1 , … , 𝑏𝑛 ) of a dependent
variable 𝑏(𝑡) at 𝑛 values (𝑡1 , … , 𝑡𝑛 ) of the independent variable 𝑡. We want to
model the dependent variable 𝑏(𝑡) as a linear combination

𝑏(𝑡) = 𝑥1 𝑓1 (𝑡) + ⋯ + 𝑥𝑚 𝑓𝑚 (𝑡) (8.9)

of certain 𝑚 functions (𝑓1 (𝑡), … , 𝑓𝑚 (𝑡)) with the coefficients (𝑥1 , … , 𝑥𝑚 ). Since
we know the values 𝑏𝑖 = 𝑏(𝑡𝑖 ), we obtain the equations

𝑏(𝑡𝑖 ) = 𝑥1 𝑓1 (𝑡𝑖 ) + ⋯ + 𝑥𝑚 𝑓𝑚 (𝑡𝑖 ) = 𝑏𝑖 ∀𝑖 ∈ {1, … , 𝑛},

which can be written as the system

⎛ 𝑓1 (𝑡1 ) ⋯ 𝑓𝑚 (𝑡1 ) ⎞ ⎛ 𝑥1 ⎞ ⎛ 𝑏1 ⎞
⎜ ⋮ ⋮ ⎟⎜ ⋮ ⎟ = ⎜ ⋮ ⎟.
𝑓 (𝑡
⎝ 1 𝑛 ) ⋯ 𝑓𝑚 (𝑡𝑛 )⎠ ⎝𝑥𝑚 ⎠ ⎝ 𝑏𝑛 ⎠
After setting
𝑎𝑖𝑗 ∶= 𝑓𝑗 (𝑡𝑖 ),
we have thus found a linear system 𝐴𝐱 = 𝐛 or a linear least-squares problems
𝐴𝐱 ≈ 𝐛.
Furthermore, substitutions can be used to write nonlinear relationships be-
tween the dependent and independent variables in the form (8.9) of a linear
combination. For example, taking the logarithm of both sides of the power law
𝑐(𝑠) ∶= 𝛼𝑠𝛽 , we find ln 𝑐𝑖 = ln 𝛼 + 𝛽 ln 𝑠𝑖 and hence define 𝑡𝑖 ∶= ln 𝑠𝑖 and
𝑏𝑖 ∶= ln 𝑐𝑖 . This yields the linear relationship 𝑏𝑖 = ln 𝛼 + 𝛽𝑡𝑖 and the two func-
tions 𝑓1 (𝑡) ∶= 1 and 𝑓2 (𝑡) ∶= 𝑡. Then 𝑥1 = ln 𝛼 and 𝑥2 = 𝛽.
The following theorem gives the already expected relationship between linear
least-squares problems and linear systems.
Theorem 8.11 (least-squares problem) If 𝐱 ∈ 𝐹 𝑚 solves the linear system

𝐴∗ (𝐴𝐱 − 𝐛) = 𝟎, (8.10)

then 𝐱 solves the least-squares problem 𝐴𝐱 ≈ 𝐛.


Proof (elementary) Let 𝐲 ∈ 𝐹 𝑚 be an arbitrary vector. Then the inequality

‖𝐴(𝐱 + 𝐲) − 𝐛‖22 = ((𝐴𝐱 − 𝐛) + 𝐴𝐲)∗ ((𝐴𝐱 − 𝐛) + 𝐴𝐲)


= (𝐴𝐱 − 𝐛)∗ (𝐴𝐱 − 𝐛) + 2(𝐴𝐲)∗ (𝐴𝐱 − 𝐛) + (𝐴𝐲)∗ (𝐴𝐲)
= ‖𝐴𝐱 − 𝐛‖22 + 2𝐲 ∗ 𝐴∗ (𝐴𝐱 − 𝐛) +‖𝐴𝐲‖22
⏟⎴⎴⏟⎴⎴⏟
=𝟎
≥ ‖𝐴𝐱 − 𝐛‖22
8.4 Linear Algebra 197

holds. Here we have used the fact that if 𝐮 and 𝐯 are two vectors, then the equa-
tion (𝐮 + 𝐯)∗ (𝐮 + 𝐯) = 𝐮∗ 𝐮 + 2𝐯 ∗ 𝐮 + 𝐯 ∗ 𝐯 holds due to 𝐮∗ 𝐯 = 𝐯 ∗ 𝐮 being a
scalar.
The inequality shows that any other vector 𝐱 + 𝐲 cannot be a solution of the
least-squares problem. □
Proof (using calculus) The gradient of the expression to be minimized is
𝑛 (
∑ 𝑛
∑ )2
∇𝐱 ‖𝐛 − 𝐴𝐱‖22 = ∇𝐱 ( 𝑏𝑖 − 𝑎𝑖𝑗 𝑥𝑗 ).
𝑖=1 𝑗=1

The 𝑘-th element of the gradient is


𝑛
∑ ( 𝑛
∑ )
𝜕𝑥𝑘 = −2 𝑎𝑖𝑘 𝑏𝑖 − 𝑎𝑖𝑗 𝑥𝑗 ,
𝑖=1 𝑗=1

which yields
∇𝐱 ‖𝐛 − 𝐴𝐱‖22 = 2𝐴∗ (𝐴𝐱 − 𝐛) = 𝟎.
The last equation holds due to (8.10). □
The condition (8.10) is equivalent to

𝐴∗ 𝐴𝐱 = 𝐴∗ 𝐛. (8.11)

The matrix 𝐴∗ 𝐴 on the left-hand side is (𝑚 × 𝑚)-dimensional, and therefore this


system is a square linear system for 𝐱. These equations are called the normal
equations. This form of the condition (8.10) motivates the following definition.
Definition 8.12 (pseudoinverse) The matrix

𝐴+ ∶= (𝐴∗ 𝐴)−1 𝐴∗ ∈ 𝐹 𝑚×𝑛

is called the pseudoinverse of 𝐴 ∈ 𝐹 𝑛×𝑚 .


With this definition, we can at least formally write the solution of the least-
squares problem 𝐴𝐱 ≈ 𝐛 as
𝐱 = 𝐴+ 𝐛.
It is straightforward to see that 𝐴∗ 𝐴 is Hermitian. In general, 𝐴∗ 𝐴 is not regu-
lar, however. The properties of the pseudoinverse are studied in more detail in
Sect. 8.4.8.4.
Every matrix 𝐴 ∈ 𝐹 𝑛×𝑚 can also be factored as

𝐴 = 𝑄𝑅,

where 𝑄 ∈ 𝐹 𝑛×𝑛 is an orthogonal matrix and 𝑅 ∈ 𝐹 𝑚×𝑛 is an upper-triangular


matrix. This factorization is called 𝑄𝑅 factorization. Before discussing its details,
we define orthogonal vectors and matrices.
198 8 Arrays and Linear Algebra

Definition 8.13 (orthogonal and orthonormal vectors) A set {𝐪1 , … , 𝐪𝑛 } of


vectors is called orthogonal, if

∀𝑖 ∈ {1, … , 𝑛} ∶ ∀𝑗 ∈ {1, … , 𝑛} ∶ 𝑖 ≠ 𝑗 ⟹ 𝐪∗𝑖 𝐪𝑗 = 0

holds. If 𝐪∗𝑖 𝐪𝑖 = 1 additionally holds for all 𝑖 ∈ {1, … , 𝑛}, the vectors are called
orthonormal.

Definition 8.14 (orthogonal matrix) A square matrix 𝑄 ∈ ℝ𝑛×𝑛 is called or-


thogonal if its columns are orthonormal vectors.

Definition 8.15 (unitary matrix) A square matrix 𝑄 ∈ ℂ𝑛×𝑛 is called unitary


if its columns are orthonormal vectors.

Theorem 8.16 (inverse of orthogonal/unitary matrix) The inverse of an or-


thogonal/unitary matrix 𝑄 is 𝑄−1 = 𝑄∗ .

Proof By definition, 𝑄∗ 𝑄 = 𝐼 and 𝑄𝑄∗ = 𝐼. Therefore the left and right inverses
of 𝑄 are equal to 𝑄∗ . □

Theorem 8.17 (orthogonal matrix) A matrix 𝑄 ∈ 𝐹 𝑛×𝑚 is orthogonal if and


only if
‖𝑄𝐱‖2 = ‖𝐱‖2 ∀𝐱 ∈ 𝐹 𝑚
holds.

Geometrically, this property means that an orthogonal matrix represents a lin-


ear isometry. An isometry is a function that does not change the lengths of its
arguments.
Orthogonal vectors are important both theoretically and computationally. In
theoretic calculations, they have the convenient property that their inner prod-
ucts 𝐪∗𝑖 𝐪𝑗 vanish (if 𝑖 ≠ 𝑗), which implies that norms of their difference are
calculated easily as
‖𝐪𝑖 − 𝐪𝑗 ‖2 = ‖𝐪𝑖 ‖2 + ‖𝐪𝑗 ‖2 (8.12)
if 𝑖 ≠ 𝑗. In computations, orthogonality has the advantageous property that it
avoids multidimensional subtractive cancellation. If 𝐱 ≈ 𝐲, then ‖𝐱 − 𝐲‖ ≪ ‖𝐱‖
and ‖𝐱 − 𝐲‖ ≪ ‖𝐲‖ and subtractive cancellation may occur. Orthogonal vectors
cannot suffer from this problem due to (8.12). In other words, a basis consisting
of orthogonal vectors is expedient for computations.
The importance of 𝑄𝑅 factorization stems from the fact that it yields an or-
thogonal basis 𝑄 and a basis change 𝑅 in convenient upper-triangular form.
𝑄𝑅 factorizations can be calculated by Gram–Schmidt orthogonalization, by
Householder transformations, and by Givens rotations. These methods have
their advantages and disadvantages. The advantage of the Gram–Schmidt algo-
rithm is that it is implemented easily, while its disadvantage is that it is numer-
ically unstable. Givens rotations can be parallelized better than the other meth-
ods, but their implementation is more involved. Householder transformations
8.4 Linear Algebra 199

cannot be parallelized, but they are the simplest of the numerically stable 𝑄𝑅 fac-
torization algorithms. We therefore discuss Householder transformations, also
called Householder reflections, for 𝑄𝑅 factorization in the following.
The defining properties of a Householder transformation or reflection are that
it is represented by an orthogonal/unitary matrix and that it maps a vector 𝐱 to
a vector whose only non-zero element is the first one, i.e.,

⎛±‖𝐱‖2 ⎞
0 ⎟
𝑃𝐱 = ⎜ .
⎜ ⋮ ⎟
⎝ 0 ⎠
The first element of 𝑃𝐱 must be ±‖𝐱‖2 because of Theorem 8.17. The House-
holder reflection 𝑃 reflects through the line bisecting the angle between 𝐱 and 𝐞1 .
The advantageous numerical property is that the maximum bisected angle is 45◦ .
On the other hand, orthogonal projection of the vector 𝐱 onto 𝐞1 as used in the
Gram–Schmidt algorithm is numerically unstable whenever 𝐱 and 𝐞1 are approx-
imately orthogonal/unitary.
The following two theorems show how to construct Householder reflections
in the real case 𝐹 = ℝ and in the complex case 𝐹 = ℂ.
Theorem 8.18 (Householder reflection (𝐹 = ℝ)) Suppose 𝐱 ∈ ℝ𝑛 and choose
an 𝛼 ∈ ℝ such that |𝛼| = ‖𝐱‖2 . Define 𝐮 ∶= 𝐱 − 𝛼𝐞1 and

⎧ 2
𝐼− 𝐮𝐮⊤ , 𝐮 ≠ 𝟎,
𝑃 ∶= 𝐮⊤ 𝐮
⎨𝐼, 𝐮 = 𝟎.

Then 𝑃 ∈ ℝ𝑛×𝑛 is symmetric and orthogonal, and the equation 𝑃𝐱 = 𝛼𝐞1 holds.
Theorem 8.19 (Householder reflection (𝐹 = ℂ)) Suppose 𝐮 ∈ ℂ𝑛 and define
𝛼 ∶= −ei arg 𝑢𝑘 ‖𝐮‖. Define 𝐮 ∶= 𝐱 − 𝛼𝐞1 , 𝐯 ∶= 𝐮∕‖𝐮‖2 unless 𝐮 = 𝟎, and

⎧ 𝐱∗ 𝐯
𝐼 − (1 + ∗ ) 𝐯𝐯 ∗ , 𝐮 ≠ 𝟎,
𝑃 ∶= 𝐯 𝐱

𝐼, 𝐮 = 𝟎.

Then 𝑃 ∈ 𝐹 𝑛×𝑛 is Hermitian and unitary, and the equation 𝑃𝐱 = 𝛼𝐞1 holds.
When using floating-point numbers, the scalar 𝛼 in the real case in Theo-
rem 8.18 should be chosen so that it has the opposite sign to the 𝑘-th coordinate
𝑢𝑘 of 𝐮, where 𝑢𝑘 is the pivot element in Theorem 8.21, in order to avoid cancel-
lation. The choice of 𝛼 in Theorem 8.19 also achieves this in the complex case
[3, Section 4.7].
We can now state the algorithm and prove the theorem for 𝑄𝑅 factorization.
Since several Householder reflections are used, we use a more precise notation
now and denote the Householder reflection for the vector 𝐱 by 𝑃(𝐱).
200 8 Arrays and Linear Algebra

Algorithm 8.20 (𝑄𝑅 factorization)


1. Set 𝑗 ∶= 0 and 𝑅0 ∶= 𝐴.
2. Repeat:
a. Increase 𝑗.
b. Let 𝐱𝑗 be the 𝑗-th column of 𝑅𝑗−1 rows 𝑗 to 𝑛, i.e., 𝐱𝑗 ∶= 𝑅𝑗−1 [𝑗 ∶ 𝑛, 𝑗].
(Here the notation 𝑗 ∶ 𝑛 indicates the integers 𝑗, 𝑗 + 1, … , 𝑛.) Set 𝑃𝑗 ∶=
𝑃(𝐱𝑗 ), the Householder reflection constructed for 𝐱𝑗 as in Theorem 8.18
or 8.19.
c. Set
𝐼 𝟎∗
𝑄𝑗 ∶= ( 𝑗−1 ) .
𝟎 𝑃𝑗

The matrix 𝑃𝑗 has size (𝑛 + 1 − 𝑗) × (𝑛 + 1 − 𝑗), and 𝐼𝑗−1 is the identity


matrix of size (𝑗 − 1) × (𝑗 − 1). Therefore 𝑄𝑗 has size 𝑛 × 𝑛.
d. Set 𝑅𝑗 ∶= 𝑄𝑗 𝑅𝑗−1 . The elements of the matrix 𝑅𝑗 in the first 𝑗 columns
below the main diagonal are all zero because of the actions of the House-
holder reflections 𝑃𝑘 , 𝑘 ≤ 𝑗.
e. Repeat while 𝑗 < min(𝑛 − 1, 𝑚) =∶ 𝑁. If 𝑚 < 𝑛, i.e., the matrix 𝐴 is tall,
then a Householder reflection for each of the 𝑚 columns must be used,
resulting in 𝑚 steps. If 𝑚 ≥ 𝑛 on the other hand, then 𝑛 −1 Householder
reflections are necessary to zero all elements below the main diagonal,
which comprises 𝑛 elements.
3. After the loop, the product

𝑄𝑁 ⋯ 𝑄1 𝐴 = 𝑅𝑁 =∶ 𝑅

is upper triangular by construction. Set

𝑄 ∶= 𝑄1∗ ⋯ 𝑄𝑁

.

The matrix 𝑄 is orthogonal/unitary because of Problem 8.11 and Problem


8.12. Because of Theorem 8.16, we have

𝐴 = 𝑄1∗ ⋯ 𝑄𝑁

𝑅 = 𝑄𝑅.

This algorithm always computes the factorization, which implies the follow-
ing theorem.

Theorem 8.21 (𝑄𝑅 factorization) Every matrix 𝐴 ∈ 𝐹 𝑛×𝑚 , 𝐹 ∈ {ℝ, ℂ}, can be
factored as
𝐴 = 𝑄𝑅,
where 𝑄 ∈ 𝐹 𝑛×𝑛 is orthogonal/unitary and 𝑅 ∈ 𝐹 𝑛×𝑚 is upper triangular.

In Julia, 𝑄𝑅 factorization is implemented by the two functions


and {ɃɪȕǤʝɜȱȕȂʝǤѐʜʝР, whose optional argument ʙɃ˛ɴʲ
{ɃɪȕǤʝɜȱȕȂʝǤѐʜʝ
8.4 Linear Algebra 201

indicates if pivoting is to be used; it is not used by default. These functions


behave similarly to {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼ and {ɃɪȕǤʝɜȱȕȂʝǤѐɜʼР and return an
object that contains the components of the factorization. The field » contains
the orthogonal/unitary matrix 𝑄, ¼ contains the upper triangular matrix 𝑅,
and ʙ and ¸ contain the permutation vector and matrix, respectively.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФИя ЙХѓ ȯ ќ ʜʝФХѓ
ɔʼɜɃǤљ ȯѐ» Ѯ ȯѐ¼ в 
ИѠЙ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
вЖѐЖЖЕЗЗȕвЖЛ НѐННЖМНȕвЖЛ вЙѐЙЙЕНОȕвЖЛ вЖѐЖЖЕЗЗȕвЖЛ
ЕѐЕ ЗѐЗЗЕЙКȕвЖЛ вЗѐЗЗЕЙКȕвЖЛ ЕѐЕ
ЕѐЕ ЗѐЗЗЕЙКȕвЖЛ вЗѐММККЛȕвЖМ ЕѐЕ

Furthermore, the resulting factorization can be used as an argument to the


functions Ƀɪ˛, ʧɃ˴ȕ, and а. If the matrix 𝐴 is not square, then the function а
returns the least-squares solution with minimal norm.
ɔʼɜɃǤљ  а ЦЖѓ Зѓ ИЧ
Йвȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
вЕѐОККИМИЛЗИНИНИИКЗ
ЖѐЙЛЗООЖЗЙИИКНЗМЖМ
ЕѐЙЕЖЗЙЕЗОЕЕЙЖОЕЙО
ЕѐЕЖЛОКЕМНМИКЛЙЕОЗЕИ

These two functions also works on sparse matrices. In this case, row and col-
umn permutations are provided in the fields ʙʝɴ˞ and ʙȆɴɜ such that the number
of non-zero entries is reduced.
ɔʼɜɃǤљ  ќ ʧʙʝǤɪȍɪФЙя Йя ЖЭЗХѓ ȯ ќ ʜʝФХѓ
ɔʼɜɃǤљ ȯѐ» Ѯ ȯѐ¼ в Цȯѐʙʝɴ˞я ȯѐʙȆɴɜЧ
ЙѠЙ ÆʙǤʝʧȕ ǤʲʝɃ˦&Æ&ШOɜɴǤʲЛЙя bɪʲЛЙЩ ˞Ƀʲȹ Е ʧʲɴʝȕȍ ȕɪʲʝɃȕʧ

𝑄𝑅 factorization is useful to solve least-squares problems 𝐴x ≈ 𝐛. In overde-


termined systems, i.e., when 𝑛 > 𝑚, 𝑄𝑅 factorization yields a factorization of
the form

⎛𝑟11 𝑟12 ⋯ 𝑟1𝑚 ⎞


⎜0 𝑟22 ⋯ 𝑟2𝑚 ⎟
⎜ ⋮ ⎟
( ) ⋮ ⋱
𝐴 = 𝐪1 𝐪2 ⋯ 𝐪𝑛−1 𝐪𝑛 ⎜ 0 0 ⋯ 𝑟𝑚𝑚 ⎟ ,
⎜0 0 ⋯ 0 ⎟
⎜ ⎟
⋮ ⋮ ⋮
⎜ ⎟
0 0 ⋯ 0
⎝ ⎠
where the columns of the matrix 𝑄 are denoted by the vectors 𝐪𝑗 . We notice that
the vectors 𝐪𝑚+1 , … , 𝐪𝑛 are always multiplied by zero. Neglecting the superflu-
ous parts, the product becomes
202 8 Arrays and Linear Algebra

⎛𝑟11 𝑟12 ⋯ 𝑟1𝑚 ⎞


( )
𝐴 = 𝐪1 𝐪2 ⋯ 𝐪𝑚−1 𝐪𝑚 ⎜0 𝑟22 ⋯ 𝑟2𝑚 ⎟
=∶ 𝑄̃ 𝑅,
̃ (8.13)
⎜⋮ ⋱ ⋮ ⎟
⎝0 0 ⋯ 𝑟𝑚𝑚 ⎠

where the columns 𝐪𝑗 , 𝑗 ∈ {1, … , 𝑚}, of 𝑄̃ ∈ 𝐹 𝑛×𝑚 are still orthonormal and
𝑅̃ ∈ 𝐹 𝑛×𝑛 is upper triangular. While 𝑄̃ ∗ 𝑄̃ = 𝐼𝑚 , in general 𝑄̃ 𝑄̃ ∗ ≠ 𝐼𝑛 .
To solve the least-squares problem 𝐴x ≈ 𝐛, we solve the normal equations
(8.11). Using (8.13), they become

𝐴∗ 𝐴𝐱 = 𝐴∗ 𝐛,
𝑅̃ ∗ 𝑄̃ ∗ 𝑄̃ 𝑅𝐱
̃ = 𝑅̃ ∗ 𝑄̃ ∗ 𝐛,
̃ = 𝑅̃ ∗ 𝑄̃ ∗ 𝐛.
𝑅̃ ∗ 𝑅𝐱

If 𝐴 has full rank, then 𝑅̃ is regular. Therefore 𝑅̃ ∗ is regular as well and the last
equation becomes
̃ = 𝑄̃ ∗ 𝐛.
𝑅𝐱
Thus we can find 𝐱 by calculating 𝑄̃ and 𝑅̃ and then using backward substitution
(see Sect. 8.4.8.2).

8.4.8.4 The Moore–Penrose Pseudoinverse

A pseudoinverse of a matrix 𝐴 is generalization of the inverse matrix. The


most common pseudoinverse is the Moore–Penrose pseudoinverse, which we
will refer to as the pseudoinverse here. If 𝐴 ∈ 𝐹 𝑛×𝑚 , then the pseudoinverse
𝐴+ ∈ 𝐹 𝑚×𝑛 is defined as the matrix that satisfies the four conditions

𝐴𝐴+ 𝐴 = 𝐴,
𝐴+ 𝐴𝐴+ = 𝐴+ ,
(𝐴𝐴+ )∗ = 𝐴𝐴+ ,
(𝐴+ 𝐴)∗ = 𝐴+ 𝐴.

The first and second conditions means that 𝐴 and 𝐴+ are weak inverses of 𝐴+
and 𝐴, respectively, in the multiplicative semigroup. The third and fourth condi-
tions mean that 𝐴𝐴+ and 𝐴+ 𝐴 are self-adjoint.
The pseudoinverse as defined by these four conditions exists uniquely. Impor-
tant properties of the pseudoinverse are the following. As expected, if the ma-
trix 𝐴 is regular, then the pseudoinverse is its inverse, i.e., 𝐴+ = 𝐴−1 . The pseu-
doinverse of the pseudoinverse is again the original matrix, i.e., (𝐴+ )+ = 𝐴. Pseu-
doinversion commutes with complex conjugation, taking the conjugate trans-
pose, and with transposition. The pseudoinverse of a scalar multiple of a ma-
trix 𝐴 is the reciprocal multiple of the pseudoinverse, i.e., (𝛼𝐴)+ = 𝛼 −1 𝐴+ for
8.4 Linear Algebra 203

all 𝛼 ≠ 0 in the underlying field. Finally, the pseudoinverse of a zero matrix is


its (conjugate) transpose.
In Julia, the pseudoinverse of a matrix is calculated using the function
{ɃɪȕǤʝɜȱȕȂʝǤѐʙɃɪ˛, which uses the singular-value decomposition 𝐴 = 𝑈Σ𝑉 ∗
(see Sect. 8.4.10) to find 𝐴+ = 𝑉Σ+ 𝑈 ∗ in the general case, although there are spe-
cial methods for special matrix types as well. For example, the pseudoinverse Σ+
of the rectangular diagonal matrix Σ is simply found by replacing the elements
on the diagonal whose absolute value is above a certain tolerance by their re-
ciprocals, leaving the rest of the elements, close to zero, unchanged, and finally
transposing the matrix. The value of the tolerance that determines non-zero el-
ements is important for ill-conditioned matrices and can be supplied as an op-
tional second argument to ʙɃɪ˛.
ɔʼɜɃǤљ ʙɃɪ˛ФЦЖѐЕ ЕѐЕѓ ЕѐЕ ЖѐЕȕвЖМЧХ
ЗѠЗ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐЕ ЕѐЕ
ЕѐЕ ЕѐЕ
ɔʼɜɃǤљ ʙɃɪ˛ФЦЖѐЕ ЕѐЕѓ ЕѐЕ ЖѐЕȕвЖМЧя ЖȕвЖНХ
ЗѠЗ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐЕ ЕѐЕ
ЕѐЕ ЖѐЕȕЖМ
ɔʼɜɃǤљ ʙɃɪ˛Ф-ɃǤȱɴɪǤɜФЦЖѐЕя ЖѐЕȕвЖМЧХХ
ЗѠЗ -ɃǤȱɴɪǤɜШOɜɴǤʲЛЙя ùȕȆʲɴʝШOɜɴǤʲЛЙЩЩђ
ЖѐЕ ѐ
ѐ ЖѐЕȕЖМ

There are applications of the pseudoinverse related to the solution of linear


systems where the pseudoinverse plays the role of a generalization of the inverse
matrix and earns its name.
As the solution of a system of linear equations 𝐴𝐱 = 𝐛, 𝐴 ∈ 𝐹 𝑛×𝑚 , does not
necessarily exist uniquely, the pseudoinverse still always solves such a system in
the least-squares sense. More precisely, this means that

𝐲 ∶= 𝐴+ 𝐛

satisfies
‖𝐴𝐳 − 𝐛‖2 ≥ ‖𝐴𝐲 − 𝐛‖2 ∀𝐳 ∈ 𝐹 𝑚 ,
i.e., the vector 𝐲 provides the smallest error in the least-squares sense. Equality
holds if and only if

𝐳 = 𝐴+ 𝐛 +(𝐼 − 𝐴+ 𝐴)𝐮 ∃𝐮 ∈ 𝐹 𝑚 , (8.14)


⏟⏟⏟
=𝐲

meaning that there are infinitely many minimizing solutions 𝐳 unless 𝐴 has full
rank, i.e., rk(𝐴) = 𝑚. If 𝐴 has full rank, then 𝐼 = 𝐴+ 𝐴 and thus 𝐳 = 𝐲 = 𝐴+ 𝐛.
204 8 Arrays and Linear Algebra

Solutions of the system 𝐴𝐱 = 𝐛 exist if and only if 𝐴𝐲 = 𝐴𝐴+ 𝐛 = 𝐛. If


this last equation holds, the solution is unique if and only if 𝐴 has full rank, i.e.,
rk(𝐴) = 𝑚 and thus 𝐼 = 𝐴+ 𝐴. If the system has any solutions, they are all given
by 𝐳 in (8.14).
Furthermore, if the system 𝐴𝐱 = 𝐛 has multiple solutions, then the pseudoin-
verse yields the solution 𝐲 of minimal Euclidean norm, i.e., ‖𝐲‖2 ≤ ‖𝐱‖2 holds
for all solutions 𝐱.
We consider two short examples. If the linear system has no solution, the
vector with the smallest least-squares error is found.
ɔʼɜɃǤљ  ќ ЦЖ Еѓ Ж ЕЧѓ Ȃ ќ ЦвЖѓ ЖЧѓ ʙɃɪ˛ФХ Ѯ Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЖѐЖЖЕЗЗИЕЗЙЛЗКЖКЛКȕвЖЛ
ЕѐЕ

In geometric terms, the point (0, 0)⊤ on the line 𝜆(1, 1)⊤ , 𝜆 ∈ ℝ, has the smallest
Euclidean distance from the point (−1, 1)⊤ .
If the linear system has multiple solutions, the vector with the minimal Eu-
clidean norm is found.
ɔʼɜɃǤљ  ќ ЦЖ Еѓ Ж ЕЧѓ Ȃ ќ ЦЖѓ ЖЧѓ ʙɃɪ˛ФХ Ѯ Ȃ
Звȕɜȕɦȕɪʲ ùȕȆʲɴʝШOɜɴǤʲЛЙЩђ
ЕѐОООООООООООООООМ
ЕѐЕ
ɔʼɜɃǤљ ʙɃɪ˛ФХ Ѯ 
ЗѠЗ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ЖѐЕ ЕѐЕ
ЕѐЕ ЕѐЕ

In this example, all solutions are given by (1, 𝑢2 )⊤ , 𝑢2 ∈ ℝ.

8.4.9 Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors1 of matrices have important meaning in applica-


tions such as geometric transformations, differential equations, stability analy-
sis, quantum mechanics (the Schrödinger equation and molecular orbits), vibra-
tion analysis, principal-component analysis, and graph theory. They are defined
for linear transformation from a vector space into itself.
Definition 8.22 (eigenvalue and eigenvector) Suppose 𝐴 ∈ ℂ𝑛×𝑛 is a square
matrix. If
𝐴𝐯 = 𝜆𝐯 (8.15)
and 𝐯 ≠ 𝟎 hold, then 𝜆 ∈ ℂ is called an eigenvalue of 𝐴 and 𝐯 ∈ ℂ𝑛 is called an
eigenvector of 𝐴.
1
German “eigen-” means “self-”.
8.4 Linear Algebra 205

Since the zero vector always satisfies (8.15) trivially, it is excluded from the
definition of an eigenvector. The geometric interpretation of a pair of eigenvalues
and eigenvectors is that an eigenvector is a direction in which the transformation
stretches by a factor that is the eigenvalue.
How can we find the eigenvalues and eigenvectors of a square matrix? Deter-
mining the eigenvalues of a square matrix in theory is not hard. Equation (8.15)
has a non-zero solution if and only if the determinant of 𝐴 − 𝜆𝐼 is zero. This
observation yields the equation

det(𝐴 − 𝜆𝐼) = 0 (8.16)

for any eigenvalue 𝜆 ∈ ℂ. Expanding the determinant shows that the left-hand
side of this equation is a polynomial of degree 𝑛 in 𝜆 (and that the coefficient of
𝜆𝑛 is (−1)𝑛 ).
Definition 8.23 (characteristic polynomial) The polynomial

𝜒𝐴 (𝜆) ∶= det(𝐴 − 𝜆𝐼)

in 𝜆 is called the characteristic polynomial of the square matrix 𝐴.


The fundamental theorem of algebra implies that the characteristic polyno-
mial has 𝑛 roots over the complex numbers ℂ. This is the reason why the under-
lying field in Definition 8.22 is ℂ.
Once we have determined the eigenvalues, e.g., by finding all roots of the
characteristic polynomial, the eigenvector 𝐯 that corresponds to the eigenvalue 𝜆
can be found by solving the linear system

(𝐴 − 𝜆𝐼)𝐯 = 𝟎, (8.17)

that stems from (8.15), for 𝐯.


In Julia, the two functions {ɃɪȕǤʝɜȱȕȂʝǤѐȕɃȱ˛Ǥɜʧ and
{ɃɪȕǤʝɜȱȕȂʝǤѐȕɃȱ˛ǤɜʧР compute eigenvalues and the function
{ɃɪȕǤʝɜȱȕȂʝǤѐȕɃȱ˛ȕȆʧ computes eigenvectors.

ɔʼɜɃǤљ  ќ ʝǤɪȍɪФИя ИХѓ


ɔʼɜɃǤљ ɜǤɦȂȍǤ ќ ȕɃȱ˛ǤɜʧФХ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШ&ɴɦʙɜȕ˦OЛЙЩђ
вЕѐНЛМЕЖКЖООЗМЕМЗИН ў ЕѐЕɃɦ
ЕѐНИКЖМЛЙКЕЖЕКЛКЕЙ в ЕѐКМНИОООЗНИЖЖИННɃɦ
ЕѐНИКЖМЛЙКЕЖЕКЛКЕЙ ў ЕѐКМНИОООЗНИЖЖИННɃɦ

The functions {ɃɪȕǤʝɜȱȕȂʝǤѐȕɃȱȕɪ and {ɃɪȕǤʝɜȱȕȂʝǤѐȕɃȱȕɪР compute fac-


torization objects ȯ of type 5Ƀȱȕɪ, which contain the eigenvalues in the field
˛Ǥɜʼȕʧ and the eigenvectors in the field ˛ȕȆʲɴʝʧ such that the 𝑘-th eigenvector
is stored in the column ȯѐ˛ȕȆʲɴʝʧЦђя ɖЧ. The functions ȍȕʲ, Ƀɪ˛, and Ƀʧʙɴʧȍȕȯ
are also defined for 5Ƀȱȕɪ objects.
ɔʼɜɃǤљ ȯ ќ ȕɃȱȕɪФХѓ
206 8 Arrays and Linear Algebra

We check that the results satisfy the definition.


ɔʼɜɃǤљ  Ѯ ȯѐ˛ȕȆʲɴʝʧ в ȯѐ˛ȕȆʲɴʝʧ Ѯ ȍɃǤȱɦФȯѐ˛ǤɜʼȕʧХ
ИѠИ ǤʲʝɃ˦Ш&ɴɦʙɜȕ˦OЛЙЩђ
ЗѐММККЛȕвЖЛўЕѐЕɃɦ ЖѐЖЖЕЗЗȕвЖКвЙѐООЛȕвЖЛɃɦ ЖѐЖЖЕЗЗȕвЖКўЙѐООЛȕвЖЛɃɦ
вЖѐЛЛКИИȕвЖЛўЕѐЕɃɦ вИѐННКМНȕвЖЛўКѐККЖЖЗȕвЖМɃɦ вИѐННКМНȕвЖЛвКѐККЖЖЗȕвЖМɃɦ
вЙѐЙЙЕНОȕвЖЛўЕѐЕɃɦ ЖѐЖЖЕЗЗȕвЖЛвЖѐЖЖЕЗЗȕвЖЛɃɦ ЖѐЖЖЕЗЗȕвЖЛўЖѐЖЖЕЗЗȕвЖЛɃɦ

Equation (8.16) is also satisfied.


ɔʼɜɃǤљ ЦȍȕʲФ в ȯѐ˛ǤɜʼȕʧЦɃЧ Ѯ ǤʲʝɃ˦Фbя Ия ИХХ
ȯɴʝ Ƀ Ƀɪ ЖђʧɃ˴ȕФя ЖХЧ
Ивȕɜȕɦȕɪʲ ùȕȆʲɴʝШ&ɴɦʙɜȕ˦OЛЙЩђ
вЖѐЗЙМЕОИЖЕКЗИЗЙЙЛМȕвЖК ў ЕѐЕɃɦ
вЖѐЖЕНЙЛИЙЕЙМИЙЕЛКȕвЖК в ИѐНЖЙНЖЖМНЖОЛООКЕМȕвЖЛɃɦ
вЖѐЖЕНЙЛИЙЕЙМИЙЕЛКȕвЖК ў ИѐНЖЙНЖЖМНЖОЛООКЕМȕвЖЛɃɦ

The constructor ǤʲʝɃ˦ called with b as the first argument constructs matrices
with ones in the main diagonal such as identity matrices.
Eigenvalues may be repeated. There are two ways to define the multiplicity
of an eigenvalue: the first is called algebraic multiplicity and the second is called
geometric multiplicity. The algebraic multiplicity stems from the multiplicity of
the eigenvalue as a root of the characteristic polynomial, while the geometric
multiplicity stems from the number of corresponding eigenvectors or the size of
the solution space of the linear system (8.17).

Definition 8.24 (algebraic multiplicity) Let 𝜆𝑖 be an eigenvalue of the square


matrix 𝐴. Then the algebraic multiplicity 𝜇𝐴 (𝜆𝑖 ) is the multiplicity of the eigen-
value 𝜆𝑖 as a root of the characteristic polynomial.

In other words, if 𝐴 ∈ ℂ𝑛×𝑛 has 𝑑 distinct eigenvalues, then the characteristic


polynomial can be written as the product
𝑑

𝜒𝐴 (𝜆) = (𝜆𝑖 − 𝜆)𝜇𝐴 (𝜆𝑖 ) .
𝑖=1

Clearly, the inequality 1 ≤ 𝜇𝐴 (𝜆𝑖 ) ≤ 𝑛 holds for all 𝑖 ∈ {1, … , 𝑑}, and the sum of
all algebraic multiplicities is equal to the dimension of the vector space, i.e.,

𝑑

𝜇𝐴 (𝜆𝑖 ) = 𝑛. (8.18)
𝑖=1

Before we can define the geometric multiplicity of an eigenvalue, we define


its eigenspace.

Definition 8.25 (eigenspace) Suppose 𝜆𝑖 is an eigenvalue of 𝐴 ∈ ℂ𝑛×𝑛 . Then


the eigenspace associated with 𝜆𝑖 is the set

𝐸(𝜆𝑖 ) ∶= {𝐯 ∈ ℂ𝑛 ∣ (𝐴 − 𝜆𝑖 𝐼)𝐯 = 𝟎}.


8.4 Linear Algebra 207

The eigenspace of the eigenvalue 𝜆𝑖 is, of course, the kernel of the matrix
𝐴 − 𝜆𝑖 𝐼 and can be calculated using the function ɪʼɜɜʧʙǤȆȕ. It is spanned by all
eigenvectors associated with 𝜆𝑖 . It is straightforward to show that an eigenspace
is always a linear subspace (see Problem 8.16). Using the notion of an eigenspace,
we can now define the geometric multiplicity.

Definition 8.26 (geometric multiplicity) Let 𝜆𝑖 be an eigenvalue of the square


matrix 𝐴. Then the geometric multiplicity 𝛾𝐴 (𝜆𝑖 ) is the dimension of its eigen-
space, i.e.,
𝛾𝐴 (𝜆𝑖 ) ∶= dim 𝐸(𝜆𝑖 ).

In other words, the geometric multiplicity of the eigenvalue 𝜆𝑖 is the nullity


of the matrix 𝐴 − 𝜆𝑖 𝐼. By Theorem 8.3, the geometric multiplicity is equal to

𝛾𝐴 (𝜆𝑖 ) = 𝑛 − rk(𝐴 − 𝜆𝐼).

By equation (8.16), the inequality

rk(𝐴 − 𝜆𝐼) < 𝑛

holds and therefore the geometric multiplicity is at least one, i.e.,

𝛾𝐴 (𝜆𝑖 ) ≥ 1.

How are the algebraic and the geometric multiplicities related? The answer
is given by the following theorem, whose proof is nontrivial (see Problem 8.17).
It means that the geometric multiplicity of an eigenvalue is always smaller than
its algebraic one.

Theorem 8.27 (geometric and algebraic multiplicities) Suppose 𝜆𝑖 is an


eigenvalue of 𝐴 ∈ ℂ𝑛×𝑛 . Then the inequality

1 ≤ 𝛾𝐴 (𝜆𝑖 ) ≤ 𝜇𝐴 (𝜆𝑖 ) ≤ 𝑛

holds.

This theorem also implies the inequality


𝑑

𝑑 ≤ 𝛾𝐴 ∶= 𝛾𝐴 (𝜆𝑖 ) ≤ 𝑛
𝑖=1

by summing over all eigenvalues and using (8.18). Here we have defined 𝛾𝐴 as
the sum of all geometric multiplicities.
If the geometric multiplicities of all eigenvalues are equal to their algebraic
multiplicities and thus are maximal, then the eigenvectors have the important
and useful property that a basis of ℂ𝑛 , the eigenbasis, can be chosen from the set
of eigenvectors. This property is recorded in the following theorem.
208 8 Arrays and Linear Algebra

Theorem 8.28 (eigenbasis) Suppose 𝐴 ∈ ℂ𝑛×𝑛 . If the equation

𝑑

𝛾𝐴 (𝜆𝑖 ) = 𝑛 (8.19)
𝑖=1

holds, then
𝑑

span( 𝐸(𝜆𝑖 )) = ℂ𝑛
𝑖=1

and a basis of ℂ𝑛 , the eigenbasis, can be formed from 𝑛 linearly independent eigen-
vectors.

Further important properties of eigenvalues are collected in the following the-


orem.

Theorem 8.29 (properties of eigenvalues) Suppose 𝐴 ∈ ℂ𝑛×𝑛 is a square ma-


trix with the 𝑛 eigenvalues (𝜆1 , … , 𝜆𝑛 ). (Each eigenvalue 𝜆𝑖 appears 𝜇𝐴 (𝜆𝑖 ) (alge-
braic multiplicity) times in this vector.) Then the following statements hold true.
1. The determinant of 𝐴 is equal to the product of all its eigenvalues, i.e.,
𝑛

det(𝐴) = 𝜆𝑖 .
𝑖=1

2. The trace of 𝐴, which is defined as the sum of all diagonal elements, is equal
to the sum of all its eigenvalues, i.e.,
𝑛
∑ 𝑛

tr(𝐴) ∶= 𝐴𝑖𝑖 = 𝜆𝑖 .
𝑖=1 𝑖=1

3. The matrix 𝐴 is regular if and only if all of its eigenvalues are non-zero.
4. If 𝐴 is regular, then the eigenvalues of the inverse 𝐴−1 are (1∕𝜆1 , … , 1∕𝜆𝑛 ) with
the same algebraic and geometric multiplicities.
5. The eigenvalues of 𝐴𝑘 , 𝑘 ∈ ℕ, are (𝜆1𝑘 , … , 𝜆𝑛𝑘 ).
6. If 𝐴 is unitary, then the absolute value of all its eigenvalues is 1, i.e., |𝜆𝑖 | = 1
for all 𝑖 ∈ {1, … , 𝑛}.
7. If 𝐴 is Hermitian, then all its eigenvalues are real.
8. If 𝐴 is Hermitian and positive definite, positive semidefinite, negative definite,
or negative semidefinite, then all its eigenvalues are positive, nonnegative, neg-
ative, or nonpositive, respectively.

As we have seen in Theorem 8.28, a matrix 𝐴 ∈ ℂ𝑛×𝑛 has 𝑛 linearly indepen-


dent eigenvectors if the geometric multiplicities of all eigenvalues are maximal,
i.e., if (8.19) holds. In this case, it can be represented very simply by a diagonal
matrix in a suitable basis. The converse statement is also true, as the following
theorem shows.
8.4 Linear Algebra 209

Theorem 8.30 (diagonalization) A matrix 𝐴 ∈ ℂ𝑛×𝑛 has 𝑛 linearly indepen-


dent eigenvectors if and only if 𝐴 can be factorized as

𝐴 = 𝑄Λ𝑄−1 , (8.20)

where 𝑄 ∈ ℂ𝑛×𝑛 is a regular matrix and Λ is a diagonal matrix whose entries are
the eigenvalues of 𝐴.

Proof By definition and assumption, we can find eigenvalues and 𝑛 eigenvec-


tors such that
𝐴𝐯𝑖,𝑗𝑖 = 𝜆𝑖 𝐯𝑖,𝑗𝑖 ,
where the indices 𝑖 and 𝑗𝑖 enumerate the 𝑛 eigenvectors. There are 𝑑 eigenvalue
such that 𝑖 ∈ {1, … , 𝑑} and 𝑗𝑖 ∈ {1, … , 𝜇𝐴 (𝜆𝑖 )}. Collecting all the 𝑛 eigenvectors
𝐯𝑖,𝑗𝑖 as columns in a matrix 𝑄, we hence find 𝐴𝑄 = 𝑄Λ and therefore 𝐴 =
𝑄Λ𝑄−1 .
To prove the converse statement, the existence of a factorization 𝐴 = 𝑄Λ𝑄−1
(with 𝑄 being regular) implies 𝐴𝑄 = 𝑄Λ and therefore there are 𝑑 eigenvalues
and 𝑛 eigenvectors. It remains to show that the eigenvectors 𝐯𝑘 found in this
manner are linearly independent.
Suppose they are linearly dependent. Then, ∑𝑛by definition, there exists a vec-
tor 𝐚 ≠ 𝟎 such that the linear combination 𝑘=1 𝑎𝑘 𝐯𝑘 is zero or, equivalently,
𝑄𝐚 = 𝟎. Therefore nul(𝐴) > 0 and, by Theorem 8.3, rk(𝐴) < 𝑛, which is a
contradiction to the regularity of the matrix 𝑄. □

A matrix is called diagonalizable if it admits such a factorization. If not, it is


called defective. Maybe the simplest example of a defective matrix is

11
( ) (8.21)
01

(see Problem 8.20).


It is clear that if the characteristic polynomial has no repeated roots, then the
matrix is diagonalizable.
If it exists, the diagonalization of a matrix 𝐴 is very useful, since many impor-
tant properties of 𝐴 can immediately be seen from the factorization (8.20) (see
Theorem 8.29). In particular, if 𝐴 is regular in addition, then its inverse is imme-
diately given by 𝐴−1 = 𝑄Λ−1 𝑄−1 . The geometric action of the linear function
represented by the matrix 𝐴 is also obvious.
Certain matrices, called normal matrices, have especially simple factoriza-
tions, as the following definition and theorem show.

Definition 8.31 (normal matrix) A matrix 𝐴 ∈ ℂ𝑛×𝑛 is called normal if

𝐴∗ 𝐴 = 𝐴𝐴∗ .
210 8 Arrays and Linear Algebra

Theorem 8.32 (normal matrices) A matrix 𝐴 ∈ ℂ𝑛×𝑛 is normal if and only if


there exist a diagonal matrix Λ and a unitary matrix 𝑈 such that

𝐴 = 𝑈Λ𝑈 ∗ .

This factorization is especially useful, since the basis change given by a uni-
tary matrix is numerically stable and the representation as a diagonal matrix in
this basis is especially simple.
Theorem 8.30 shows that if and only if there are 𝑛 linearly independent eigen-
vectors, then the matrix has an eigenfactorization and hence can be represented
very simply by a diagonal matrix in an eigenbasis. This naturally leads to the
question what happens when there are fewer than 𝑛 independent eigenvectors.
Can we still find factorization? We expect that such a factorization would be
more complicated than the representation by a diagonal matrix; it is hardly con-
ceivable that a simpler factorization exists.
The answer is given by the following theorem, which also draws the complete
picture. In order to formulate the theorem, we need a definition first.

Definition 8.33 (Jordan matrix) A square, complex matrix 𝐴 ∈ ℂ𝑛×𝑛 , which


has the block diagonal form

⎛𝐽1 ⎞
𝐽=⎜ ⋱ ⎟,
⎝ 𝐽𝑑 ⎠

where each block 𝐽𝑖 is a square matrix of the form

⎛𝜆 𝑖 1 ⎞
𝜆𝑖 ⋱
𝐽𝑖 = ⎜ ⎟,
⎜ ⋱ 1⎟
⎝ 𝜆𝑖 ⎠

is called a Jordan matrix.

This means that a Jordan matrix 𝐽 is a square matrix whose only non-zero
entries are in the main diagonal and the first superdiagonal. The blocks 𝐽𝑖 are
called Jordan blocks. In each block, all entries in the first superdiagonal are equal
to one.

Definition 8.34 (similar matrix) Two square matrices 𝐴 ∈ ℂ𝑛×𝑛 and 𝐵 ∈


ℂ𝑛×𝑛 are called similar if there exists a regular matrix 𝑃 ∈ ℂ𝑛×𝑛 such that

𝐴 = 𝑃𝐵𝑃−1 .
8.4 Linear Algebra 211

Theorem 8.35 (eigenfactorization, Jordan normal form) Every square ma-


trix 𝐴 ∈ ℂ𝑛×𝑛 is similar to a Jordan matrix 𝐽, called the Jordan normal form of 𝐴,
i.e., there exists a regular matrix 𝑃 ∈ ℂ𝑛×𝑛 and a Jordan matrix 𝐽 such that

𝐴 = 𝑃𝐽𝑃−1 .

The theorem shows that matrices are not diagonalizable in general, but that
every matrix is similar to Jordan matrix, which still contains the eigenvalues in
the main diagonal, but may contain additional ones in the first superdiagonal.
Is the Jordan normal form of a matrix unique? Of course, it is also possible
to rearrange the similarity matrix 𝑃 such that the Jordan blocks are reordered.
Apart from these rearrangements, however, the Jordan normal form is unique,
as the following theorem records.
Theorem 8.36 (uniqueness of Jordan normal form) The Jordan normal form
of a matrix 𝐴 ∈ ℂ𝑛×𝑛 is unique up to the order of the Jordan blocks.
The Jordan normal form 𝐽 and its blocks have the following properties. The
geometric multiplicity 𝛾𝐴 (𝜆𝑖 ) is the number of Jordan blocks corresponding to
the eigenvalue 𝜆𝑖 , and the sum of the sizes of all Jordan blocks corresponding
to 𝜆𝑖 is its algebraic multiplicity 𝜇𝐴 (𝜆𝑖 ). In terms of the sizes of the Jordan blocks,
we thus see that a matrix 𝐴 is diagonalizable if and only if the algebraic and
geometric multiplicities of every eigenvalue 𝜆𝑖 coincide.
The sizes of the Jordan blocks corresponding to an eigenvalue help to solve
the mystery of the missing eigenvectors whenever the geometric multiplicity is
smaller than the algebraic one. In this case, there are fewer than 𝑛 linearly inde-
pendent eigenvectors, and we would like to find more vectors in order to com-
plete the set of eigenvectors and obtain a basis of the whole vector space.
We consider a three-dimensional example and define

⎛𝜆1 0 0 ⎞
𝐽 ∶= ⎜ 0 𝜆2 1 ⎟
⎝ 0 0 𝜆2 ⎠
consisting of two Jordan blocks. The Jordan normal form 𝐴 = 𝑃𝐽𝑃−1 implies
𝐴𝑃 = 𝑃𝐽, and we denote the three columns of 𝑃 by 𝐩𝑖 , 𝑖 ∈ {1, 2, 3}. Then we
have the equation

( ) ( ) ⎛𝜆 1 0 0 ⎞ ( )
𝐴 𝐩1 𝐩2 𝐩3 = 𝐩1 𝐩2 𝐩3 ⎜ 0 𝜆2 1 ⎟ = 𝜆1 𝐩1 𝜆2 𝐩2 𝐩2 + 𝜆2 𝐩3 ,
⎝ 0 0 𝜆2 ⎠
whose columns yield the equations

(𝐴 − 𝜆1 𝐼)𝐩1 = 𝟎,
(𝐴 − 𝜆2 𝐼)𝐩2 = 𝟎,
(𝐴 − 𝜆2 𝐼)𝐩3 = 𝐩2 .
212 8 Arrays and Linear Algebra

The first equation means that 𝐩1 ∈ ker(𝐴 − 𝜆1 𝐼) is an eigenvector for the eigen-
value 𝜆1 and the second equation means that 𝐩2 ∈ ker(𝐴 −𝜆2 𝐼) is an eigenvector
for the eigenvalue 𝜆2 .
The second Jordan block
𝜆 1
( 2 )
0 𝜆2

reduces the geometric multiplicity 𝛾𝐴 (𝜆2 ), so that we would like to construct an


additional vector corresponding to 𝜆2 to adjoin to the eigenvectors to find a basis
of the whole vector space.
We can do so as follows. Multiplying the last equation by 𝐴 − 𝜆2 𝐼 yields

(𝐴 − 𝜆2 𝐼)2 𝐩3 = (𝐴 − 𝜆2 𝐼)𝐩2 = 𝟎,

meaning that 𝐩3 ∈ ker((𝐴 − 𝜆2 𝐼)2 ). Vectors such as 𝐩3 are called generalized


eigenvectors of 𝐴.
Definition 8.37 (generalized eigenvector) A vector 𝐯 ∈ ℂ𝑛 is called a gener-
alized eigenvector of rank 𝑘 of the matrix 𝐴 ∈ ℂ𝑛×𝑛 corresponding to the eigen-
value 𝜆 if

(𝐴 − 𝜆𝐼)𝑘 𝐯 = 𝟎,
(𝐴 − 𝜆𝐼)(𝑘−1) 𝐯 ≠ 𝟎.

An eigenvector is, of course, a generalized eigenvector of rank 1.


Just as in this example, it is generally possible to construct a Jordan chain of
generalized eigenvectors for a given Jordan block starting from an eigenvector.
The eigenvector is an element of ker(𝐴−𝜆𝑖 𝐼), while the generalized eigenvectors
are elements of ker((𝐴 − 𝜆𝑖 𝐼)𝑘 ), 𝑘 ∈ {2, … , 𝐾}, where 𝐾 is the size of the Jordan
block for which the Jordan chain is calculated.
These generalized eigenvectors are the vectors needed in Theorem 8.35 to
complete the eigenvectors to find a basis of the whole vector space. It can be
shown that this construction is possible for all Jordan blocks and that the gen-
eralized eigenvectors are indeed linearly independent. The space spanned by all
generalized eigenvectors corresponding to an eigenvalue is called the general-
ized eigenspace of this eigenvalue.
It is now clear how the simple example of a defective matrix in (8.21) relates
to the general case and the Jordan normal form. This defective matrix is just a
single Jordan block for the eigenvalue 1.
We now perturb this matrix and consider the matrix

11
( )
𝜖1

with 𝜖 ≠ 0. Its eigenvalues are 1 ± 𝜖, and therefore its Jordan normal form is
8.4 Linear Algebra 213

1+ 𝜖 0√
( ).
0 1− 𝜖

This means that a slight perturbation of a matrix with multiple eigenvalues can
completely change the structure of its Jordan normal form. The numerical prob-
lem of calculating the Jordan normal form of a matrix is therefore ill-conditioned
and depends critically on the criterion whether two eigenvalues are considered
equal. Hence, the Jordan normal form of a matrix is usually avoided in computa-
tions, although it is of great theoretical importance, and alternatives such as the
Schur decomposition are employed. The Schur factorization or Schur decompo-
sition always exists, as the following theorem shows.

Theorem 8.38 (Schur factorization) Suppose 𝐴 ∈ ℂ𝑛×𝑛 . Then there exist a


unitary matrix 𝑄 and an upper-triangular matrix 𝑇 such that

𝐴 = 𝑄𝑇𝑄−1 .

The Schur factorization means that every complex, square matrix is similar
to an upper-triangular matrix. The Schur factorization is not unique. The advan-
tage of such a factorization is that the basis change 𝑄 is given by a unitary matrix,
and we already know that multiplication by an orthogonal or by a unitary matrix
is well-conditioned.
How does the Schur factorization relate to eigenvalues? If a Schur factoriza-
tion is known, then 𝐴 and 𝑇 have the same eigenvalues by Problem 8.25. The
eigenvalues of an upper-triangular matrix are just its diagonal elements by Prob-
lem 8.26. This means that a Schur factorization of a matrix 𝐴 immediately yields
its eigenvalues.
In Julia, Schur factorization is available as the two functions
{ɃɪȕǤʝɜȱȕȂʝǤѐʧȆȹʼʝ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧȆȹʼʝР. Just as the other func-
tions for matrix factorization in Julia, it returns an object. The information
in the ÆȆȹʼʝ object can be accessed as the fields Ñ or ÆȆȹʼʝ for the quasi
upper-triangular matrix, as the fields Đ or ˛ȕȆʲɴʝʧ for the unitary matrix, and
as the index ˛Ǥɜʼȕʧ for the eigenvalues. This built-in implementation only
calculates a quasi upper-triangular matrix and not an upper-triangular matrix
as in Theorem 8.38.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФИя ИХѓ ȯ ќ ʧȆȹʼʝФХѓ
ɔʼɜɃǤљ  в ȯѐĐ Ѯ ȯѐÑ Ѯ ȯѐĐщ
ИѠИ ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
ОѐООЗЕЖȕвЖЛ вЛѐИНИМНȕвЖЛ вЖѐЖЖЕЗЗȕвЖК
ЕѐЕ вЗѐЗЗЕЙКȕвЖЛ ЛѐЛЛЖИЙȕвЖЛ
ЖѐЛИЕЛЙȕвЖЛ вЖѐЛЛКИИȕвЖЛ вИѐИИЕЛМȕвЖЛ

How is the Schur factorization of a matrix 𝐴 ∈ ℂ𝑛×𝑛 calculated? The answer


is provided by 𝑄𝑅 iteration, also called the 𝑄𝑅 algorithm. The basic version is
the following.
214 8 Arrays and Linear Algebra

Algorithm 8.39 (𝑄𝑅 iteration)


1. Set 𝑘 ∶= 1 and define 𝐴0 ∶= 𝐴.
2. Repeat:
a. Define

𝑄𝑘 𝑅𝑘 ∶= 𝐴𝑘−1 ,
𝐴𝑘 ∶= 𝑅𝑘 𝑄𝑘 .

The first definition means that 𝑄𝑘 and 𝑅𝑘 are obtained by a 𝑄𝑅 factor-


ization of 𝐴, whose factors are multiplied in the second definition in
reverse order.
b. Increase 𝑘 ∶= 𝑘 + 1 and repeat until a termination criterion is satisfied.
The orthogonal similarity transformations 𝑄𝑘 obtained by 𝑄𝑅 factorization
and employed in the iteration make 𝑄𝑅 iteration numerically stable.
We find that

𝐴𝑘 = 𝑅𝑘 𝑄𝑘 = 𝑄𝑘−1 𝑄𝑘 𝑅𝑘 𝑄𝑘 = 𝑄𝑘∗ 𝐴𝑘−1 𝑄𝑘 , (8.22)

since the factor 𝑄𝑘 in the 𝑄𝑅 factorization is unitary. This equation shows by


induction that all matrices 𝐴 = 𝐴0 , 𝐴1 , 𝐴2 , … are similar and hence have the
same eigenvalues (by Problem 8.25).
Furthermore, it can be shown that the matrices 𝐴𝑘 converge to an upper-
triangular matrix as 𝑘 tends to infinity. Then, by (8.22), we find that

𝐴𝑘 = 𝑄𝑘∗ 𝑄𝑘−1

⋯ 𝑄1∗ 𝐴𝑄1 𝑄2 ⋯ 𝑄𝑘 = (𝑄1 𝑄2 ⋯ 𝑄𝑘 )∗ 𝐴(𝑄1 𝑄2 ⋯ 𝑄𝑘 ) = 𝑄∗ 𝐴𝑄,

where we have defined


𝑄 ∶= 𝑄1 𝑄2 ⋯ 𝑄𝑘
and 𝑄−1 = 𝑄∗ holds. Since 𝐴𝑘 converges to an upper-triangular matrix 𝑈, we
obtain 𝑈 ≈ 𝑄∗ 𝐴𝑄 and hence
𝐴 ≈ 𝑄𝑈𝑄∗ ,
which is a Schur factorization as in Theorem 8.38. Since the eigenvalues of a tri-
angular matrix are just the diagonal elements (by Problem 8.26), the eigenvalues
have been approximated by calculating 𝑈.
We consider a numerical example next. In order to observe convergence, we
construct an eigenvalue problem with a known solution. We start from a ma-
trix 𝑈 with known eigenvalues, construct a similar matrix 𝐴, and then imple-
ment 𝑄𝑅 iteration.
ɔʼɜɃǤљ Ú ќ ȍɃǤȱɦФЦЕѐЗя вЕѐЖя ЕѐИя вЕѐКя ЕѐЙЧХѓ
ɔʼɜɃǤљ Æ ќ ʝǤɪȍɪФʧɃ˴ȕФÚХХѓ
ɔʼɜɃǤљ  ќ Æ Ѯ Ú Э Æѓ
ɔʼɜɃǤљ ɖ ќ ѓ
ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ЖђЖЕ Ф»я ¼Х ќ ʜʝФɖХѓ ɖ ќ ¼Ѯ» ȕɪȍѓ ɖ
8.4 Linear Algebra 215

КѠК ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
вЕѐКИЖЕЖО вЕѐЙМЕЙНМ ЕѐЕЗМЖЛИЖ вЕѐИЗККОЛ вЕѐНЕНКЙК
ЕѐЕЛЖЖЗЛ ЕѐЙИКОЗЗ ЕѐЕИНИКЛО ЕѐИЗЖКЖЗ вЖѐЙЖЛЛН
вЕѐЕЕЛЕЗЛЕЖ вЕѐЕЖКОИЕМ ЕѐЗОЛЗЛМ ЕѐЕНМЕЗЗН ЖѐОНЖЗЗ
вЖѐЗЛНЙМȕвЛ вЕѐЕЕЕЖЙЙМОК вЕѐЕЕЕНОКМИЖ ЕѐЖООЗКЖ ЖѐНМОЕО
ЙѐНЛЙМЗȕвО вЗѐЗОЕОЗȕвН вМѐКОЖЗКȕвМ вЛѐЛЙЛОМȕвК вЕѐЖЕЕЙЗ
ɔʼɜɃǤљ ɖ ќ ѓ
ɔʼɜɃǤљ ȯɴʝ Ƀ Ƀɪ ЖђЖЕЕ Ф»я ¼Х ќ ʜʝФɖХѓ ɖ ќ ¼Ѯ» ȕɪȍѓ ɖ
КѠК ǤʲʝɃ˦ШOɜɴǤʲЛЙЩђ
вЕѐК вЕѐКИЖЙКК вЕѐЕИИККИЙ вЕѐИЙКЛИК вЕѐМЕЖНЗМ
ЖѐЖНЕЕКȕвЖЖ ЕѐЙ ЕѐЕКЕЙННЛ ЕѐЗНМЕЗ вЖѐМЕЖМК
вИѐЛЗМООȕвЗК вИѐОЖМКЛȕвЖК ЕѐИ ЕѐЖЗЛЖЙН ЖѐММЖЛЖ
вЖѐНООНЛȕвЙЛ вЖѐЖЕИЙȕвИЙ вЗѐЖКИЙОȕвЗЖ ЕѐЗ ЖѐНОЙМК
КѐОЙЕЙЙȕвМО вЖѐКЙЗЖЖȕвЛН вЖѐКЖОКЗȕвКЙ вКѐЗЗЕЗНȕвИК вЕѐЖ

We observe that the eigenvalue closest to zero is approximated in the lower right
corner. After 100 steps, an upper-triangular matrix with the sought eigenvalues
is obtained. Even after 10 steps, three correct digits of the eigenvalue closest to
zero are found in the lower right corner.
The convergence rate of 𝑄𝑅 iteration depends on the separation between the
eigenvalues. The Gershgorin circle theorem, a bound on the spectrum (i.e., the
set of all eigenvalues) of a square matrix, is useful for testing for convergence.
It is known that an eigenvalue close to zero improves convergence. How can
we move an eigenvalue closer to zero? It is straightforward to see that if 𝜆 is an
eigenvalue of 𝐴, then 𝜆 − 𝑠 is an eigenvalue of 𝐴 − 𝑠𝐼. Assuming that we can
find suitable shifts 𝑠𝑘 that approximate the eigenvalues closest to zero, we hence
define shifted 𝑄𝑅 iteration as

𝑄𝑘 𝑅𝑘 ∶= 𝐴𝑘−1 − 𝑠𝑘 𝐼, (8.23a)
𝐴𝑘 ∶= 𝑅𝑘 𝑄𝑘 + 𝑠𝑘 𝐼. (8.23b)

We must check that (8.22) still holds, which is done by calculating

𝐴𝑘 = 𝑅𝑘 𝑄𝑘 + 𝑠𝑘 𝐼 = 𝑄𝑘−1 (𝐴𝑘−1 − 𝑠𝑘 𝐼)𝑄𝑘 + 𝑠𝑘 𝐼 = 𝑄𝑘∗ 𝐴𝑘−1 𝑄𝑘 ,

and therefore all matrices 𝐴𝑘 are again similar to 𝐴.


We still have to find suitable shifts 𝑠𝑘 . Since approximations of the eigenval-
ues are calculated during the iteration, these approximations suggest themselves
for this task. The most obvious approximation would be to use the lower right
element of 𝐴𝑘−1 as the shift 𝑠𝑘 in (8.23). This choice, however, may cause the
iteration to fail. A much better choice is the eigenvalue of the 2 × 2 lower right
submatrix of 𝐴𝑘−1 that is closest to the lower right element of 𝐴𝑘−1 . This choice
is called the Wilkinson shift.
Another technique to improve the convergence of 𝑄𝑅 iteration is called defla-
tion. Once the eigenvalue closest to zero has been approximated satisfactorily, a
smaller matrix without this eigenvalue is constructed, i.e., the matrix is deflated,
216 8 Arrays and Linear Algebra

and the next eigenvalue is calculated. Deflation is based on the following theo-
rem.

Theorem 8.40 (deflation) Suppose 𝐴 ∈ ℂ𝑛×𝑛 has the form

𝐵 𝐮
𝐴=( ),
𝟎∗ 𝜆

where 𝐵 ∈ ℂ(𝑛−1)×(𝑛−1) , 𝐮 ∈ ℂ(𝑛−1)×1 , 𝟎∗ ∈ ℂ1×(𝑛−1) , and 𝜆 ∈ ℂ. Then the


eigenvalues of 𝐴 are 𝜆 and the eigenvalues of 𝐵.

Further improvements are still possible; a modern variant of 𝑄𝑅 iteration is


implicit 𝑄𝑅 iteration, which simplifies the use of multiple shifts.
In the 𝑄𝑅 iterations discussed so far, each iteration requires a 𝑄𝑅 factoriza-
tion, which requires 𝑂(𝑛3 ) floating-point operations in the general case, as can
be shown. In order to avoid this computational expense, practical algorithms
transform the matrix into almost triangular form at the beginning, which results
in large savings in each 𝑄𝑅 factorization.
A factorization which achieves this goal is Hessenberg factorization, and it is
the final factorization relating to eigenvalues we discuss here. Hessenberg factor-
ization can be shown to reduce the computational cost of both 𝐿𝑈 factorization
and 𝑄𝑅 factorization to 𝑂(𝑛2 ). Although the computational cost of Hessenberg
factorization is 𝑂(𝑛3 ), it is usually worth the effort in the beginning by reducing
the computational cost of the subsequent steps; the constants in the operation
count are important, not only its asymptotic behavior as 𝑛 tends to infinity.
We first define the form of the final result, which is almost triangular: in an
upper-Hessenberg matrix, all elements below the first subdiagonal are zero.

Definition 8.41 (upper-Hessenberg matrix) A matrix 𝐻 ∈ ℂ𝑛×𝑛 is called up-


per Hessenberg if ℎ𝑖𝑗 = 0 whenever 𝑖 > 𝑗 + 1.

The next theorem shows that it is always possible to find a matrix 𝐻 in upper-
Hessenberg form that is similar to a given matrix 𝐴. The fact that the basis
change 𝑄 is unitary is advantageous and again ensures numerical stability. We
use Householder reflections (see Sect. 8.4.8.3) in the algorithm and its proof.

Theorem 8.42 (Hessenberg factorization) Suppose 𝐴 ∈ ℂ𝑛×𝑛 . Then there ex-


ist a unitary matrix 𝑄 and an upper-Hessenberg matrix 𝐻 such that

𝐴 = 𝑄𝐻𝑄−1 .

Proof The proof is constructive. The following algorithm computes unitary


transformations such that the resulting matrix is upper Hessenberg. □

Algorithm 8.43 (Hessenberg factorization)


1. Set 𝐴0 ∶= 𝐴.
2. For 𝑗 from 1 to 𝑛 − 2, perform these steps:
8.4 Linear Algebra 217

a. Set 𝐱𝑗 to be the 𝑗-th column of 𝐴𝑗 rows 𝑗 + 1 to 𝑛, i.e., 𝐱𝑗 ∶= 𝐴𝑗 [(𝑗 + 1) ∶


𝑛, 𝑗]. Set 𝑃𝑗 ∶= 𝑃(𝐱𝑗 ), the Householder reflection constructed for 𝐱𝑗 as
in Theorem 8.19.
b. Set
𝐼 𝟎∗
𝑄𝑗∗ ∶= ( 𝑗 ) .
𝟎 𝑃𝑗

The matrix 𝑃𝑗 has size (𝑛 − 𝑗) × (𝑛 − 𝑗), and 𝐼𝑗 is the identity matrix of


size 𝑗 × 𝑗. Therefore 𝑄𝑗∗ has size 𝑛 × 𝑛. The matrix 𝑄𝑗∗ is Hermitian and
unitary by Theorem 8.19.
c. Set
𝐴𝑗 ∶= 𝑄𝑗∗ 𝐴𝑗−1 𝑄𝑗 .
In the first step, the form of the product 𝑄1∗ 𝐴1 is

⎛ 𝑎11 𝑎21 𝑎31 ⋯ 𝑎1𝑛 ⎞


⎜±‖𝐱‖2 ∗ ∗ ⋯ ∗ ⎟
𝑄1∗ 𝐴1 = ⎜ 0 ∗ ∗ ⋯ ∗ ⎟,
⎜ ⋮ ⋮ ⋮ ⋮ ⎟
⎝ 0 ∗ ∗ ⋯ ∗ ⎠

because of the definition of 𝑃1 as a Householder reflection. Furthermore,


the form of the product 𝑄1∗ 𝐴1 𝑄1 is

⎛ 𝑎11 ∗ ∗ ⋯ ∗⎞
⎜±‖𝐱‖2 ∗ ∗ ⋯ ∗⎟
𝐴2 = 𝑄1∗ 𝐴1 𝑄1 = ⎜ 0 ∗ ∗ ⋯ ∗⎟ .
⎜ ⋮ ⋮ ⋮ ⋮⎟
⎝ 0 ∗ ∗ ⋯ ∗⎠

In all later steps, analogous calculations show that zeros are created by
left-multiplication by 𝑄𝑗∗ and all zeros thus created, also in the previous
steps, remain after right-multiplication by 𝑄𝑗 .

3. After the 𝑛 − 2 steps of the loop, we set


∗ ∗
𝐻 ∶= 𝐴𝑛−2 = 𝑄𝑛−2 𝐴𝑛−1 𝑄𝑛−2 = 𝑄𝑛−2 ⋯ 𝑄1∗ 𝐴𝑄1 ⋯ 𝑄𝑛−2 .

We also set
𝑄 ∶= 𝑄1 ⋯ 𝑄𝑛−2
to obtain
𝐻 = 𝑄∗ 𝐴𝑄.
The matrix 𝑄 is unitary as a product of unitary matrices, and the matrix 𝐻
is upper Hessenberg.
218 8 Arrays and Linear Algebra

In Julia, Hessenberg factorization is implemented by the functions


{ɃɪȕǤʝɜȱȕȂʝǤѐȹȕʧʧȕɪȂȕʝȱ and {ɃɪȕǤʝɜȱȕȂʝǤѐȹȕʧʧȕɪȂȕʝȱР. The unitary
matrix stored in the resulting object of type YȕʧʧȕɪȂȕʝȱ in the field » and the
upper-Hessenberg matrix in the field Y.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФКя КХѓ ȯ ќ ȹȕʧʧȕɪȂȕʝȱФХѓ

Having calculated the Hessenberg factorization, we check the result.


ɔʼɜɃǤљ ɦǤ˦ɃɦʼɦФǤȂʧѐФȯѐ» Ѯ ȯѐ»щ в ǤʲʝɃ˦Фbя Кя КХХХ
ЙѐЙЙЕНОЗЕОНКЕЕЛЗЛȕвЖЛ
ɔʼɜɃǤљ ɦǤ˦ɃɦʼɦФǤȂʧѐФȯѐ» Ѯ ȯѐY Э ȯѐ» в ХХ
НѐННЖМНЙЖОМЕЕЖЗКЗȕвЖЛ
ɔʼɜɃǤљ ɦǤ˦ɃɦʼɦФǤȂʧѐФȯѐ» Ѯ ȯѐY Ѯ ȯѐ»щ в ХХ
ЖѐККЙИЖЗЗИЙЙМКЗЖОЗȕвЖК

8.4.10 Singular-Value Decomposition

The singular-value decomposition (svd) is another factorization that is as fun-


damental as the eigenfactorization in Theorem 8.35 or the Schur factorization
in Theorem 8.38. In contrast to these two factorization, svd provides a factoriza-
tion of any complex 𝑛 × 𝑚 matrix and not only of square matrices. The following
theorem shows the form of the factorization.

Theorem 8.44 (singular-value decomposition) Suppose 𝐴 ∈ ℂ𝑛×𝑚 . Then 𝐴


can be factored as
𝐴 = 𝑈Σ𝑉 ∗ ,
where 𝑈 ∈ ℂ𝑛×𝑛 and 𝑉 ∈ ℂ𝑚×𝑚 are unitary matrices and Σ ∈ ℝ𝑛×𝑚 is a rect-
angular diagonal matrix with nonnegative elements on the diagonal. If 𝐴 is a real
matrix, then 𝑈 and 𝑉 can be chosen real as well.

The matrices 𝑈 and 𝑉 in an svd are not unique, as 𝑈Σ𝑉 ∗ = (−𝑈)Σ(−𝑉)∗ ,


for example. The
𝑠 ∶= min(𝑚, 𝑛)
diagonal entries of Σ are called the singular values of 𝐴 and denoted by 𝜎1 , … , 𝜎𝑠 .
The convention is to order the singular values such that

𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑠 ≥ 0.

With this convention, the matrix Σ is uniquely determined.


The columns of 𝑈 and 𝑉 are called the left and right singular vectors, respec-
tively. Writing the svd as 𝐴𝑉 = 𝑈Σ yields the equations

𝐴𝐯𝑘 = 𝜎𝑘 𝐮𝑘 ∀𝑘 ∈ {1, … , 𝑠}, (8.24)


8.4 Linear Algebra 219

which explains the names of the singular vectors. Geometrically, these equations
mean that the function represented by 𝐴 maps each right singular value 𝐯𝑘 to
the corresponding left singular vector 𝐮𝑘 stretched by the corresponding singular
value 𝜎𝑘 .
The svd and the eigenfactorization of a matrix are related. The svd 𝐴 =
𝑈Σ𝑉 ∗ of a matrix 𝐴 ∈ ℂ𝑛×𝑚 yields the two equations

𝐴∗ 𝐴 = 𝑉Σ∗ 𝑈 ∗ 𝑈Σ𝑉 ∗ = 𝑉(Σ∗ Σ)𝑉 ∗ , (8.25a)


𝐴𝐴∗ = 𝑈Σ𝑉 ∗ 𝑉Σ∗ 𝑈 ∗ = 𝑈(ΣΣ∗ )𝑈 ∗ . (8.25b)

The first equation means that the right singular vectors (i.e., the columns of 𝑉)
are eigenvectors of 𝐴∗ 𝐴, while the second equation means that the left singular
vectors (i.e., the columns of 𝑈) are eigenvectors of 𝐴𝐴∗ . Furthermore, the non-
zero singular values are the square roots of the non-zero eigenvalues of 𝐴∗ 𝐴 or
𝐴𝐴∗ . If 𝐴 is normal, then it can be diagonalized and written as 𝐴 = 𝑈Λ𝑈 ∗ by
Theorem 8.32. If 𝐴 is positive semidefinite in addition, then this factorization
𝐴 = 𝑈Λ𝑈 ∗ is also an svd.
This observation also yields a numerical algorithm for the calculation of the
singular values and singular vectors of a matrix 𝐴, namely to apply 𝑄𝑅 iteration
(see Sect. 8.4.9) to the matrix in (8.25a) to first find the singular values and the
right singular vectors of 𝐴 and then to use (8.24) to find its left singular vectors.
Practical methods, however, are based on the matrix

0 𝐴∗
( )
𝐴 0

(see Problem 8.36).


The svd has many useful properties and can be used to calculate important
properties of a matrix, as we will see in the next few theorems. The first theorem
in this series shows that the svd gives an representation of the range and the
null space of a matrix.

Theorem 8.45 (svd and rank, nullity) Suppose 𝐴 ∈ ℂ𝑛×𝑚 . Then the left sin-
gular values corresponding to non-zero singular values of 𝐴 span the range of 𝐴
and the right singular vectors corresponding to zero singular values of 𝐴 span the
null space of 𝐴. Furthermore, the rank of 𝐴 equals the number of non-zero singular
values.

Knowing a svd of a matrix, its pseudoinverse (see Sect. 8.4.8.4) is easily found,
as the following theorem shows.

Theorem 8.46 (svd and pseudoinverse) Suppose 𝐴 = 𝑈Σ𝑉 ∗ is a svd of a


matrix 𝐴 ∈ ℂ𝑛×𝑚 . Then its pseudoinverse is

𝐴+ = 𝑉Σ+ 𝑈 ∗ .
220 8 Arrays and Linear Algebra

The following theorem means that the principal singular value 𝜎1 is equal to
the operator 2-norm of the matrix 𝐴. The 𝑝-norm of a matrix is defined as

‖𝐴𝐱‖𝑝
‖𝐴‖𝑝 ∶= sup .
𝐱≠𝟎 ‖𝐱‖𝑝

Therefore the svd is the usual means for calculating the 2-norm of a matrix.
Theorem 8.47 (svd and norm) Suppose 𝐴 ∈ ℂ𝑛×𝑚 . Then ‖𝐴‖2 = 𝜎1 .
The following criterion for determining whether a square matrix is regular or
singular follows from Theorem 8.45.
Theorem 8.48 (svd and regularity) Suppose 𝐴 ∈ ℂ𝑛×𝑛 . Then 𝐴 is regular if
and only if 𝜎𝑛 ≠ 0.
Another important application of the svd is the approximation of a matrix 𝐴
by a simpler and – as the following theorem shows – truncated version of 𝐴. The
idea is to use only the first 𝑘 singular values, which are also the largest ones
by convention. Depending on how fast the singular values decrease, these first
singular values may already capture a significant portion of the behavior of the
linear function. As the approximation is of lower rank than the original matrix 𝐴,
it is called an low-rank approximation of 𝐴.
Theorem 8.49 (svd and low-rank approximation, Eckart–Young–Mirsky
Theorem) Suppose 𝐴 ∈ ℂ𝑛×𝑚 and define

𝐴𝑘 ∶= 𝑈Σ𝑘 𝑉 ∗ ,

where Σ𝑘 is the copy of Σ only containing the first 𝑘 singular values of Σ. Then the
equation
‖𝐴𝑘 − 𝐴‖2 = 𝜎𝑘+1 ∀𝑘 ∈ {1, … , 𝑛 − 1}
holds.
For computing the low-rank approximation 𝐴𝑘 of rank 𝑘, only the first 𝑘 left
and right singular values, i.e., the first 𝑘 columns of 𝑈 and 𝑉, are needed, as all
singular values after the first 𝑘 ones are replaced by zero in Σ𝑘 .
In Julia, the svd of a matrix is calculated by the two functions
{ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍР. The singular values are com-
puted by {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ˛Ǥɜʧ and {ɃɪȕǤʝɜȱȕȂʝǤѐʧ˛ȍ˛ǤɜʧР. All these
functions follow the convention of sorting the singular values in descending or-
der. The functions ʧ˛ȍ and ʧ˛ȍР return objects of type Æù- with the fields Ú, Æ,
and ùʲ. There is also a field ù, but since 𝑉 ∗ is calculated and accessible by ùʲ, it
is more efficient to use than ù.
ɔʼɜɃǤљ  ќ ʝǤɪȍɪФКя КХѓ ȯ ќ ʧ˛ȍФХѓ
ɔʼɜɃǤљ ȯѐùщ ќќ ȯѐùʲ
ʲʝʼȕ
8.4 Linear Algebra 221

We check that the resulting factorization is correct.


ɔʼɜɃǤљ ɪɴʝɦФȯѐÚ Ѯ ȍɃǤȱɦФȯѐÆХ Ѯ ȯѐùʲ в Х
ЗѐИИЕНЖММИОЛИЕЙМЛКȕвЖК

If ʧ˛ȍ or ʧ˛ȍР are called with two matrix arguments, they compute the gener-
alized svd of two matrices.

8.4.11 Summary of Matrix Operations and Factorizations

Tables 8.7 and 8.8 give an overview of the vector and matrix operations available
in Julia excluding matrix factorizations.
A multitude of low-level functions are available as well; for example,
the blas (Basic Linear Algebra Subprograms) and lapack (Linear Algebra
Package) functions are available in the modules {ɃɪȕǤʝɜȱȕȂʝǤѐ"{Æ and
{ɃɪȕǤʝɜȱȕȂʝǤѐ{¸&u.
Table 8.9 summarizes the various types of matrix factorizations available in
Julia. The functions in Table 8.9 whose names ends with an exclamation mark
are destructive versions of their counterparts without the exclamation mark and
hence save memory. The function ȯǤȆʲɴʝɃ˴ȕ acts as a general interface to the
various matrix factorizations. It recognizes the matrix types listed in Table 8.10,
determines the most specific type a given matrix has, and then calculates the
factorization indicated in the table. The return value can be used as an argument
to the left-division operator а.

Problems

8.1 Prove the Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1.

8.2 Show that 𝐴∗ 𝐴 is a Hermitian matrix.

8.3 Prove Theorem 8.2 and Theorem 8.3.

8.4 (Properties of the cross product)


(a) Show that the cross product is anticommutative.
(b) Show that the cross product is bilinear.
(c) Show that two vectors 𝐚 ≠ 0 and 𝐛 ≠ 0 are parallel if and only if 𝐚 × 𝐛 = 0.
(d) Show that the Lagrangian identity

‖𝐚 × 𝐛‖2 = ‖𝐚‖2 ‖𝐛‖2 − (𝐚 ⋅ 𝐛)2

holds for all vectors 𝐚 and 𝐛 ∈ ℝ3 .


222 8 Arrays and Linear Algebra

Table 8.7 Vector and matrix operations (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
Ѯ matrix multiplication
а left division
Э right division
ȍɴʲ compute the inner product
Ȇʝɴʧʧ compute the cross product
ʲʝǤɪʧʙɴʧȕ compute the transpose
ʲʝǤɪʧʙɴʧȕР compute the transpose (destructive version)
ǤȍɔɴɃɪʲ compute the conjugate transpose
ǤȍɔɴɃɪʲР compute the conjugate transpose (destructive version)
щ postfix operator, same as ǤȍɔɴɃɪʲ
ȍȕʲ compute the determinant
Ƀɪ˛ compute the inverse (using left division)
ɖʝɴɪ compute the Kronecker tensor product
ɜɴȱȍȕʲ compute the logarithm of determinant
ɜɴȱǤȂʧȍȕʲ compute the logarithm of absolute value of determinant
ɪʼɜɜʧʙǤȆȕ compute a basis of the nullspace
ʝǤɪɖ compute the rank by counting the non-zero singular values
ʙɃɪ˛ compute the Moore–Penrose pseudoinverse
ʲʝ compute the trace (sum of diagonal elements)
ɪɴʝɦ compute the norm of a vector or the operator norm of a matrix
ɪɴʝɦǤɜɃ˴ȕ normalize so that the norm becomes one
ɪɴʝɦǤɜɃ˴ȕР destructive version of ɪɴʝɦǤɜɃ˴ȕ
ȍɃǤȱ return the given diagonal of the given matrix as a vector
ȍɃǤȱɃɪȍ return an ȂʧʲʝǤȆʲ¼Ǥɪȱȕ with the indices of the given diagonal
ȍɃǤȱɦ construct a matrix with the given vector as a diagonal
ʝȕʙȕǤʲ construct an array by repeating the elements of a given one
ʲʝɃɜ return the lower triangle of a matrix
ʲʝɃɜР destructive version of ʲʝɃɜ
ʲʝɃʼ return the upper triangle of a matrix
ʲʝɃʼР destructive version
Ȇɴɪȍ compute the condition number of a matrix
Ȇɴɪȍʧɖȕȕɜ compute Skeel condition number of a matrix
ȱɃ˛ȕɪʧ compute a Givens rotation
ɜ˩Ǥʙ solve a Lyapunov equation
ʧ˩ɜ˛ȕʧʲȕʝ solve a Sylvester equation
ʙȕǤɖȯɜɴʙʧ compute the peak flop rate of the computer using matrix multiplication

(e) Show that the identity

‖𝐚 × 𝐛‖2 = ‖𝐚‖2 ‖𝐛‖2 − (𝐚 ⋅ 𝐛)2

holds for all vectors 𝐚 and 𝐛 ∈ ℝ3 .


8.4 Linear Algebra 223

Table 8.8 Functions for checking properties of matrices (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
ɃʧȂǤɪȍȕȍ determine whether a matrix is banded
ɃʧȍɃǤȱ determine whether a matrix is diagonal
ɃʧȹȕʝɦɃʲɃǤɪ determine whether a matrix is Hermitian
Ƀʧʙɴʧȍȕȯ determine whether a matrix is positive definite
ɃʧʙɴʧȍȕȯР destructive version of Ƀʧʙɴʧȍȕȯ
ɃʧʧʼȆȆȕʧʧ determine whether a matrix factorization succeeded
Ƀʧʧ˩ɦɦȕʲʝɃȆ determine whether a matrix is symmetric
ɃʧʲʝɃɜ determine whether a matrix is lower triangular
ɃʧʲʝɃʼ determine whether a matrix is upper triangular

Table 8.9 Functions for matrix factorizations (in the module {ɃɪȕǤʝɜȱȕȂʝǤ).
Function Description
ȯǤȆʲɴʝɃ˴ȕ compute a convenient factorization, general interface to factorizations
ȂʼɪȆȹɖǤʼȯɦǤɪ compute the Bunch-Kaufman fact. of a symmetric/Hermitian matrix
ȂʼɪȆȹɖǤʼȯɦǤɪР destructive version of ȂʼɪȆȹɖǤʼȯɦǤɪ
Ȇȹɴɜȕʧɖ˩ compute Cholesky factorization of positive definite matrix
Ȇȹɴɜȕʧɖ˩Р destructive version of Ȇȹɴɜȕʧɖ˩
ȕɃȱȕɪ compute eigenfactorization
ȕɃȱȕɪР destructive version of ȕɃȱȕɪ
ȕɃȱ˛Ǥɜʧ compute the eigenvalues
ȕɃȱ˛ǤɜʧР destructive version of ȕɃȱ˛Ǥɜʧ
ȕɃȱ˛ȕȆʧ compute the eigenvectors
ȕɃȱɦɃɪ compute the smallest eigenvalue if all eigenvalues are real
ȕɃȱɦǤ˦ compute the largest eigenvalue if all eigenvalues are real
ȹȕʧʧȕɪȂȕʝȱ compute Hessenberg factorization
ȹȕʧʧȕɪȂȕʝȱР destructive version of ȹȕʧʧȕɪȂȕʝȱ
ɜȍɜʲ compute 𝐿𝐷𝐿⊤ factorization
ɜȍɜʲР destructive version of ɜȍɜʲ
ɜʜ compute 𝐿𝑄 factorization
ɜʜР destructive version of ɜʜ
ɜʼ compute 𝐿𝑈 factorization
ɜʼР destructive version of ɜʼ
ʜʝ compute 𝑄𝑅 factorization
ʜʝР destructive version of ʜʝ
ʧȆȹʼʝ compute Schur factorization
ʧȆȹʼʝР destructive version of ʧȆȹʼʝ
ɴʝȍʧȆȹʼʝ reorder Schur factorization
ɴʝȍʧȆȹʼʝР destructive version of ɴʝȍʧȆȹʼʝ
ʧ˛ȍ compute svd
ʧ˛ȍР destructive version of ʧ˛ȍ
ʧ˛ȍ˛Ǥɜʧ compute singular values and return them in descending order
ʧ˛ȍ˛ǤɜʧР destructive version of ʧ˛ȍ˛Ǥɜʧ
224 8 Arrays and Linear Algebra

Table 8.10 Forms of matrices recognized by the function ȯǤȆʲɴʝɃ˴ȕ. The type of the return
object and the function called are shown in the second and third columns.
Form Function
Diagonal none
Bidiagonal none
Tridiagonal ɜʼ
Lower/upper triangular none
Positive definite Ȇȹɴɜȕʧɖ˩
Dense symmetric/Hermitian ȂʼɪȆȹɖǤʼȯɦǤɪ
Sparse symmetric/Hermitian ɜȍɜʲ
Symmetric real tridiagonal ɜȍɜʲ
General square ɜʼ
General non-square ʜʝ

8.5 (Volume of parallelepiped) Show that the signed volume 𝑉 of the paral-
lelepiped with the edges 𝐚, 𝐛, and 𝐜 is given by

𝑉 = 𝐚 ⋅ (𝐛 × 𝐜) = 𝐛 ⋅ (𝐜 × 𝐚) = 𝐜 ⋅ (𝐚 × 𝐛).

8.6 Prove Theorem 8.4.

8.7 Implement 𝐿𝑈 factorization, Algorithm 8.6, without row pivoting.

8.8 Implement 𝐿𝑈 factorization, Algorithm 8.6, with row pivoting.

8.9 Prove Theorem 8.10.

8.10 Prove Theorem 8.17.

8.11 Prove that the product of orthogonal matrices is again an orthogonal ma-
trix.

8.12 Prove that 𝑄∗ is orthogonal if 𝑄 is an orthogonal matrix.

8.13 Prove Theorem 8.18.

8.14 Prove Theorem 8.19.

8.15 Implement 𝑄𝑅 factorization, Algorithm 8.20.

8.16 Prove that every eigenspace is a linear subspace, i.e., it is closed under ad-
dition and scalar multiplication.

8.17 Prove Theorem 8.27.

8.18 Prove Theorem 8.28.

8.19 Prove Theorem 8.29.


8.4 Linear Algebra 225

8.20 Show that the matrix


11
( )
01
is defective by calculating (algebraically, not numerically) all eigenvalues, their
algebraic and geometric multiplicities, and all eigenvectors.

8.21 Prove Theorem 8.32.

8.22 Prove Theorem 8.35.

8.23 Prove Theorem 8.36.

8.24 Prove Theorem 8.38.

8.25 Show that two similar matrices 𝐴 and 𝐵 have the same eigenvalues. How
do the eigenvectors of 𝐵 relate to those of 𝐴?

8.26 Show that the eigenvalues of an upper-triangular or lower-triangular matrix


are its diagonal elements.

8.27 Implement 𝑄𝑅 iteration, Algorithm 8.39.

8.28 Solve several eigenvalue problems numerically by 𝑄𝑅 iteration after con-


structing problems with known eigenvalues. Use problems with different eigen-
values close to zero. Plot the error for the eigenvalue closest to zero. Which type
of plot is most useful? Which convergence rate to you observe?

8.29 Perform numerical experiments to investigate how convergence is influ-


enced by the absolute value of the eigenvalue closest to zero and by the sepa-
ration between the two eigenvalues closest to zero. What do you observe?

8.30 Implemented shifted 𝑄𝑅 iteration using the Wilkinson shift. Use


ɪɴʝɦФЦɪя ЖђɪвЖЧХ ј ʧʜʝʲФɪХ Ѯ ȕʙʧФOɜɴǤʲЛЙХ

as the stopping criterion, where ɪ ќ ʧɃ˴ȕФя ЖХ.

8.31 Compare the convergence rates of standard 𝑄𝑅 iteration (Problem 8.27) and
shifted 𝑄𝑅 iteration (Problem 8.30).

8.32 Prove Theorem 8.40.

8.33 Implement shifted 𝑄𝑅 iteration with deflation to find all eigenvalues of a


given matrix. Use recursion.

8.34 Implement Hessenberg factorization, Algorithm 8.43.

8.35 Prove Theorem 8.44.


226 8 Arrays and Linear Algebra

8.36 Suppose 𝐴 ∈ ℂ𝑛×𝑛 has a svd 𝐴 = 𝑈Σ𝑉 ∗ and define

0 𝐴∗
𝐵 ∶= ( ).
𝐴 0

Show that the eigenvalues of 𝐵 are {±𝜎𝑘 ∣ 𝑘 ∈ {1, … , 𝑛}}.

8.37 Prove Theorem 8.45.

8.38 Prove Theorem 8.46.

8.39 Prove Theorem 8.47.

8.40 Prove Theorem 8.48.

8.41 Prove Theorem 8.49. Hint: Show that


𝑘

𝐴𝑘 = 𝜎𝑖 𝐮𝑖 𝐯𝑖∗ ,
𝑖=1

where 𝐮𝑘 and 𝐯𝑘 are the 𝑘-th column of 𝑈 and 𝑉, respectively. Then use Theo-
rem 8.47.

8.42 Use Theorem 8.49 to compress an image. Choose a sample image, repre-
sent it by a matrix, and use different numbers of singular values to compress the
image.

8.43 Find an example of each of the types of matrices listed in Table 8.10, check
its type in Julia, and determine the type of the return value after factorization.

References

1. Cuvelier, F., Japhet, C., Scarella, G.: An efficient way to assemble finite element matrices in
vector languages (2014). URL http://arxiv.org/abs/1401.3301. arXiv:1401.3301 [cs.MS]
2. Habgood, K., Arel, I.: A condensation-based application of Cramer’s rule for solving large-
scale linear systems. Journal of Discrete Algorithms 10, 98–109 (2012)
3. Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis, 3rd edn. Springer (2002)
Part II
Algorithms for Differential Equations
Chapter 9
Ordinary Differential Equations

Differo, distuli, dilatum (latin, from dis- (apart) and fero (carry, bear)):
to carry different ways, to spread, to scatter, to disperse, to separate

Differens (present participle of differo):


carrying different ways, spreading, scattering, dispersing, separating

Abstract Differential equations are among the most successful models in


physics, chemistry, biology, engineering, and many other fields. This chapter is
concerned with solving systems of ordinary differential equations. Ordinary dif-
ferential equations are equations that contain derivatives with respect to only
one independent variable. The main result that a system of ordinary differen-
tial equations has a unique solution under certain reasonable assumptions is
presented. In order to solve the equations numerically, Runge–Kutta formu-
las are discussed in detail, since they yield excellent results for a wide range of
equation types. Finally, the formulas are implemented in an idiomatic manner
in Julia.

9.1 Introduction

The unknown in an ordinary differential equation (ode) is a function 𝑦 ∶ ℝ →


ℝ of a single independent variable, often of time 𝑡 or position 𝑥, in contrast to
partial differential equations, whose unknown functions depend on two or more
independent variables (see Chap. 10).
The order of an ode is the order of the highest derivative of the unknown
function that appears in the equation. In abstract terms, any ordinary differential
equation of order 𝑛 ∈ ℕ0 can be written in the form

𝐺(𝑦 (𝑛) (𝑡), 𝑦 (𝑛−1) (𝑡), … , 𝑦 ′ (𝑡), 𝑦(𝑡)) = 0 ∀𝑡 ∈ 𝐼, (9.1)

© Springer Nature Switzerland AG 2022 229


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_9
230 9 Ordinary Differential Equations

where 𝐺 is a function of all derivatives that occur in the equation and 𝐼 usually
is an interval and possibly all of ℝ.
Since this is a very abstract way of writing an ode, we now derive an impor-
tant time dependent ode that models exponential growth. The example stems
from modeling bacterial growth. We denote the amount of bacteria in a Petri
dish by 𝑦(𝑡) and the known amount of bacteria at the initial time 𝑡 = 0 by 𝑦0 ,
i.e.,
𝑦(0) = 𝑦0 ∈ ℝ+ .
To derive a differential equation, we start with a finite, small time interval of
length Δ𝑡 ∈ ℝ+ . Our modeling assumption is that the equation

𝑦(𝑡 + Δ𝑡) = 𝑦(𝑡) + 𝛼 Δ𝑡 𝑦(𝑡)

holds, which is reasonable because it means that the number of bacteria at the
end of the small time interval is equal to their number at the beginning of the
interval plus a constant 𝛼 ∈ ℝ times the length of the interval times the number
of bacteria present (at the beginning of the interval). In other words, the change
in the number of bacteria is proportional to the length of the interval and the
number of bacteria provided that the time interval is small enough.
By considering the units of the terms in the equation, it becomes clear that
the last term must contain a constant factor, because if it did not, the units could
not match. More precisely, if we denote the unit of 𝑦 by [𝑦], comparing the three
terms yields [𝑦] = [𝛼][𝑡][𝑦], and thus the unit of the constant factor 𝛼 is [𝛼] =
1∕[𝑡]; it is a growth rate. Such considerations are a general principle and they are
very useful when assessing constants or parameters in differential equations.
Rearranging the terms in the equation yields

𝑦(𝑡 + Δ𝑡) − 𝑦(𝑡)


= 𝛼𝑦(𝑡),
Δ𝑡
which holds for all small time intervals. Therefore we can take the limit as Δ𝑡
goes to zero on both sides of the equation to obtain the first-order ode

𝑦 ′ (𝑡) = 𝛼𝑦(𝑡).

In order to fully specify the problem, we also have to give the initial value 𝑦(0) =
𝑦0 in addition to the equation that holds at all later times. Therefore we arrive at
the initial-value problem

𝑦 ′ (𝑡) = 𝛼𝑦(𝑡) ∀𝑡 ∈ (0, ∞),


𝑦(0) = 𝑦0 ,

where we have also noted the time interval.


9.2 Existence and Uniqueness of Solutions * 231

9.2 Existence and Uniqueness of Solutions *

Do solutions of a given ode exist? Is the solution unique? These questions are
not only of theoretical importance, but they are also valid questions to ask from
the modeling and numerical points of view. An ode that is supposed to model
a physical, chemical, biological, or engineering process gains credibility when
it is known that it has a unique solution. The existence and uniqueness of a
solution is also important whenever the solution of an ode is to be approximated
numerically. What should be approximated unless there is a unique solution?
Here we answer the questions of existence and uniqueness for general first-
order initial-value problems under very general assumptions [2, Section 2.8]. We
can always write a first-order initial-value problem in the form

𝑦 ′ (𝑡) = 𝑓(𝑡, 𝑦(𝑡)) ∀𝑡 ∈ (0, ∞), (9.2a)


𝑦(0) = 0. (9.2b)

Here we have assumed that the initial point is the origin (0, 0), but this can al-
ways be achieved by a simple substitution. The main result is the following.
Theorem 9.1 (Picard’s existence and uniqueness theorem) Suppose
𝑓 ∶ 𝑅 → ℝ and 𝜕𝑓∕𝜕𝑦 are continuous functions in a rectangle 𝑅 ∶= [−𝑎, 𝑎] ×
[−𝑏, 𝑏] containing the origin. Then there exists a unique solution 𝑦 ∶ [−ℎ, ℎ] → ℝ
defined on the interval [−ℎ, ℎ] ⊂ [−𝑎, 𝑎] of the first-order initial-value problem
(9.2).
The solution referred to in this theorem is a classical solution, i.e., a function
that is differentiable, that thus can be substituted into equation (9.2), and that
satisfies it for every point in the solution interval.
Proof To be able to apply the method used in this proof, we transform the dif-
ferential equation into an integral equation. This can always be achieved by in-
tegrating the equation. If 𝑦 is a solution of (9.2), then 𝑓(𝑡, 𝑦(𝑡)) is a continuous
function by assumption and hence integrable. Integrating (9.2) from the initial
point 0 to an arbitrary point 𝑡 yields the integral equation
𝑡
𝑦(𝑡) = ∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠, (9.3)
0

where the initial condition 𝑦(0) = 0 was used.


Conversely, if 𝑦 is a solution of the integral equation, the integrand is contin-
uous, and therefore 𝑦 is differentiable by the fundamental theorem of calculus.
Differentiating the integral equation yields (9.2) again, and the initial condition
is also satisfied.
In summary, the differential equation (9.2) and the integral equation (9.3) are
equivalent.
The proof method used here is called the method of successive approximation
or Picard’s iteration method. We start with the initial approximation
232 9 Ordinary Differential Equations

𝑦0 (𝑡) ∶= 0,

which satisfies the initial condition. Further approximations are found by using
the previous approximations on the right-hand side of the integral equation (9.3)
and using it as the definition of the next approximation, i.e., by defining
𝑡
𝑦𝑛+1 (𝑡) ∶= ∫ 𝑓(𝑠, 𝑦𝑛 (𝑠))d𝑠. (9.4)
0

Each function in the sequence ⟨𝑦𝑛 ⟩ satisfies the initial condition. If there is
an 𝑛 ∈ ℕ0 such that 𝑦𝑛 = 𝑦𝑛+1 , then 𝑦𝑛 is a solution of the differential equation
and hence the integral equation, but in general this does not happen.
We therefore consider the limit function of this sequence and will establish
that it solves the equation by proceeding in the following steps.
1. Are all elements of the sequence well-defined, differentiable functions?
2. If yes, does the sequence converge?
3. If yes, does the limit function satisfy the integral equation (9.3)?
4. If yes, is the solution unique?
If the last equation can be answered positively, the proof is complete.
1. So far, the approximations 𝑦𝑛 have not been fully defined. In addition to
(9.4), the domains of definition (and the images) of the functions 𝑦𝑛 must also
be specified. Here, in particular, the domains of definition must be specified such
that 𝑓(𝑠, 𝑦𝑛 (𝑠)) in the integrand of 𝑦𝑛+1 can be evaluated. Since 𝑓 is only known
to be defined when its second argument is in the interval [−𝑏, 𝑏], the domains
of definition must be chosen sufficiently small such that 𝑦𝑛 lies in the interval
[−𝑏, 𝑏].
Since 𝑓 is a continuous function on a closed bounded domain, it is bounded,
i.e.,
∃𝑀 ∈ ℝ+ 0
∶ ∀(𝑡, 𝑦) ∈ 𝑅 ∶ |𝑓(𝑡, 𝑦)| ≤ 𝑀. (9.5)
Because 𝑦𝑛′ (𝑡) = 𝑓(𝑡, 𝑦𝑛−1 (𝑡)), the absolute slope of 𝑦𝑛′ is also bounded by 𝑀.
Hence, because 𝑦𝑛 (0) = 0, we have −𝑀𝑡 ≤ 𝑦𝑛 (𝑡) ≤ 𝑀𝑡. This consideration
implies that the condition that 𝑦𝑛 lies in the interval [−𝑏, 𝑏] is ensured if 𝑡 ≤
𝑏∕𝑀.
Therefore we define
𝑏
ℎ ∶= min (𝑎, )
𝑀
and use the rectangle
𝐷 ∶= [−ℎ, ℎ] × [−𝑏, 𝑏]
as the domain of definition of the functions 𝑦𝑛 . The 𝑦𝑛 are thus functions
𝑦𝑛 ∶ 𝐷 → ℝ and well-defined.
2. The second question is whether the sequence ⟨𝑦𝑛 ⟩ converges. We start by
showing the estimate
9.2 Existence and Uniqueness of Solutions * 233

𝑀𝐿𝑛−1 |𝑡|𝑛
|𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)| ≤ ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ (9.6)
𝑛!
by induction. If 𝑛 = 1, then |𝑦1 (𝑡)| ≤ 𝑀|𝑡| follows from the definition (9.4) of 𝑦1
and (9.5). If 𝑛 > 1, we use the Lipschitz condition (see Problem 9.2) to calculate
𝑡
| |
|𝑦𝑛+1 (𝑡) − 𝑦𝑛 (𝑡)| ≤ ∫ |||𝑓(𝑠, 𝑦𝑛 (𝑠)) − 𝑓(𝑠, 𝑦𝑛−1 (𝑠))|||d𝑠
0
𝑡
≤ 𝐿 ∫ |𝑦𝑛 (𝑠) − 𝑦𝑛−1 (𝑠)|d𝑠
0
𝑡
𝑀𝐿𝑛−1 |𝑠|𝑛
≤ 𝐿∫ d𝑠
0
𝑛!
𝑀𝐿𝑛 |𝑠|𝑛+1
= .
(𝑛 + 1)!

Since 𝑡 ∈ [−ℎ, ℎ], the estimate (9.6) implies

𝑀𝐿𝑛−1 ℎ𝑛
|𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)| ≤ ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ, (9.7)
𝑛!
whose right-hand side is independent of 𝑡.
Next, we write 𝑦𝑛 (𝑡) as the telescoping sum

𝑦𝑛 (𝑡) = 𝑦0 (𝑡) + (𝑦1 (𝑡) − 𝑦0 (𝑡)) + ⋯ + (𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)),

which implies

|𝑦𝑛 (𝑡)| ≤ |𝑦0 (𝑡)| + |𝑦1 (𝑡) − 𝑦0 (𝑡)| + ⋯ + |𝑦𝑛 (𝑡) − 𝑦𝑛−1 (𝑡)|. (9.8)

Using inequality (9.7), we thus find

∑𝑛 𝑛 𝑘
𝑀𝐿𝑘−1 ℎ𝑘 𝑀 ∑ 𝐿ℎ
|𝑦𝑛 (𝑡)| ≤ 0 + = ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ.
𝑘=1
𝑘! 𝐿 𝑘=1 𝑘!

The right-hand side converges to


𝑛 𝑘
𝑀 ∑ 𝐿ℎ 𝑀
lim = (e𝐿ℎ − 1)
𝑛→∞ 𝐿 𝑘! 𝐿
𝑘=1

from below, which implies that


𝑀 𝐿ℎ
|𝑦𝑛 (𝑡)| ≤ (e − 1) ∀𝑡 ∈ [−ℎ, ℎ] ∀𝑛 ∈ ℕ.
𝐿
234 9 Ordinary Differential Equations

We have hence shown that the sum in (9.8) converges as 𝑛 → ∞. Therefore the
sequence ⟨𝑦𝑛 (𝑡)⟩ converges for all 𝑡 ∈ [−ℎ, ℎ] as it is a sequence of partial sums
of a convergent infinite series.
The bound in (9.7) does not depend on 𝑡 and hence the bounds in the inequal-
ities in the preceding argument also hold independently of 𝑡. Therefore the se-
quence ⟨𝑦𝑛 ⟩ even converges uniformly.
Having shown that the sequence ⟨𝑦𝑛 ⟩ converges uniformly, we denote its limit
by
𝑦(𝑡) ∶= lim 𝑦𝑛 (𝑡).
𝑛→∞

3. Does the limit function 𝑦 satisfy the integral equation (9.3)?


We start by taking the limit as 𝑛 goes to ∞ on both sides of the iteration (9.4),
yielding
𝑡
𝑦(𝑡) = lim ∫ 𝑓(𝑠, 𝑦𝑛 (𝑠))d𝑠.
𝑛→∞ 0

Since the sequence ⟨𝑦𝑛 ⟩ converges uniformly, we can interchange taking the
limit and integration (see Problem 9.3) to obtain
𝑡
𝑦(𝑡) = ∫ lim 𝑓(𝑠, 𝑦𝑛 (𝑠))d𝑠.
0 𝑛→∞

Since the function 𝑓 is continuous in its second argument, we can take the limit
inside its second argument to find
𝑡 𝑡
𝑦(𝑡) = ∫ 𝑓(𝑠, lim 𝑦𝑛 (𝑠))d𝑠 = ∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠.
0 𝑛→∞ 0

The last equation means that 𝑦 solves the integral equation and hence the differ-
ential equation by the discussion at the beginning of the proof.
4. Is the solution unique? Suppose there is another solution 𝑧. Then
𝑡
( )
𝑦(𝑡) − 𝑧(𝑡) = ∫ 𝑓(𝑠, 𝑦(𝑠)) − 𝑓(𝑠, 𝑧(𝑠)) d𝑠 ∀𝑡 ∈ [0, 𝑎]
0

holds for their difference, and furthermore we have


𝑡
| |
|𝑦(𝑡) − 𝑧(𝑡)| ≤ ∫ |||𝑓(𝑠, 𝑦(𝑠)) − 𝑓(𝑠, 𝑧(𝑠))|||d𝑠 ∀𝑡 ∈ [0, 𝑎].
0

Using Problem 9.2, the last inequality implies that


𝑡
|𝑦(𝑡) − 𝑧(𝑡)| ≤ 𝐿 ∫ |𝑦(𝑠) − 𝑧(𝑠)|d𝑠 ∀𝑡 ∈ [0, 𝑎].
0
⏟ ⎴⎴⎴⎴ ⏟ ⎴⎴⎴⎴ ⏟
𝑈(𝑡)∶=
9.3 Systems of Ordinary Differential Equations 235

We denote the integral on the right-hand side by 𝑈(𝑡). The function 𝑈 is dif-
ferentiable, and we obviously have

𝑈(0) = 0 (9.9)

and
𝑈(𝑡) ≥ 0 ∀𝑡 ∈ [0, 𝑎]. (9.10)
Using 𝑈, the last inequality becomes

𝑈 ′ (𝑡) − 𝐿𝑈(𝑡) ≤ 0 ∀𝑡 ∈ [0, 𝑎].

By multiplying this inequality by e−𝐿𝑡 , we find that it is equivalent to


( )′
e−𝐿𝑡 𝑈(𝑡) ≤0 ∀𝑡 ∈ [0, 𝑎].

Integrating this inequality from zero to 𝑡 and using (9.9), we find the inequality

e−𝐿𝑡 𝑈(𝑡) ≤ 0 ∀𝑡 ∈ [0, 𝑎].

The last inequality and (9.10) imply 𝑈(𝑡) = 0 for all 𝑡 ∈ [0, 𝑎] and hence 𝑈 ′ (𝑡) =
|𝑦(𝑡) − 𝑧(𝑠)| = 0. In other words, any two solutions 𝑦 and 𝑧 are identical for
𝑡 ∈ [0, 𝑎]. An analogous argument shows that the solution is unique for all 𝑡 ∈
[−𝑎, 0].
This completes the proof. □

An alternative proof is based on observing that the operator given by the Pi-
card iteration (9.4) is a contraction and then using the Banach fixed-point theo-
rem (see Problem 9.4).
The result of the theorem is that the solution exists in a finite (and possibly
very small) time interval. Can we do better? In general, a stronger result cannot
be expected. The most prominent and simple counterexample is the equation
𝑦 ′ (𝑡) = 𝑦(𝑡)2 with the initial condition 𝑦(0) = 𝑦0 . Separation of variables shows
that its solution is 𝑦(𝑡) = 1∕(𝑦0 − 𝑡) if 𝑦0 ≠ 0 and 𝑦 = 0 if 𝑦0 = 0. If 𝑦0 > 0, then
the solution exists only in the interval 𝑡 ∈ [0, 𝑦0 ) and even becomes unbounded
even within a finite amount of time.

9.3 Systems of Ordinary Differential Equations

A linear ode is a special case of the general form (9.1) and has the form

𝑎𝑛 (𝑡)𝑦 (𝑛) (𝑡) + 𝑎𝑛−1 (𝑡)𝑦 (𝑛−1) (𝑡) + ⋯ + 𝑎1 𝑦 ′ (𝑡) + 𝑎0 𝑦(𝑡) = 𝑏(𝑡) ∀𝑡 ∈ 𝐼,

whose defining feature is that all terms that contain the unknown 𝑦 are linear
in 𝑦.
236 9 Ordinary Differential Equations

A linear ode of order 𝑛 can always be written as a linear system of 𝑛 first-order


odes. The idea is to introduce new variables for the higher-order derivatives. We
hence define

𝑧0 ∶= 𝑦,
𝑧1 ∶= 𝑦 ′ ,

𝑧𝑛−1 ∶= 𝑦 (𝑛−1)

and can now write the linear equation as the linear system

𝑧0′ = 𝑧1 ,
𝑧1′ = 𝑧2 ,


𝑧𝑛−2 = 𝑧𝑛−1 ,

𝑎𝑛 𝑧𝑛−1 = 𝑏 − 𝑎𝑛−1 𝑧𝑛−1 − 𝑎𝑛−2 𝑧𝑛−2 − ⋯ − 𝑎1 𝑧1 − 𝑎0 𝑧0

in the interval 𝐼. There are 𝑛 equations for the 𝑛 variables 𝑧0 , … , 𝑧𝑛−1 . The last
equation stems from the original equation, while the other equations connect
the new variables.
This consideration underlines the importance of (linear) systems of first-order
odes. Any linear ode of order 𝑛 for a single unknown function can be written
in this form, and linear problems with more unknowns can also be written in
this form. Therefore most numerical programs for odes have been developed
for systems of first-order equations.

9.4 Euler Methods

In the rest of this chapter, numerical methods for the approximation of solutions
of odes are presented. Although many sophisticated methods for finding solu-
tions of odes in closed form have been developed, the solutions can generally not
be written in closed form. A simple counterexample is the ode 𝑦 ′ (𝑡) = 𝑓(𝑡) with
𝑡
the initial condition 𝑦(0) = 0. Its solution is the integral 𝑦(𝑡) = ∫0 𝑓(𝑠)d𝑠. But
the antiderivative of an elementary function (in the sense of differential algebra
this is a function that can be written in closed algebraic form) is not necessarily
elementary; the most prominent example is the function
2
𝑓(𝑡) ∶= e−𝑡 .

The Risch algorithm is a decision procedure that answers the question whether
an elementary function has an elementary antiderivative or not [6, 7].
9.4 Euler Methods 237

Therefore calculating precise approximations of the solutions in an efficient


manner is an important practical task.

9.4.1 Forward and the Backward Euler Methods

The most straightforward idea to solve any differential equation is to use the
definition of the derivative and to replace it by its difference quotient. We start
from the general first-order equation

𝑦 ′ (𝑡) = 𝑓(𝑡, 𝑦), 𝑦(𝑡0 ) = 𝑦0

and assume that it has a unique solution (see Sect. 9.2). We also define a sequence
of points 𝑡𝑛 such that 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑛 < 𝑡𝑛+1 < ⋯ and denote the approxima-
tion of 𝑦(𝑡𝑛 ) by 𝑦𝑛 . Replacing the derivative by its forward difference quotient
yields
𝑦𝑛+1 − 𝑦𝑛
≈ 𝑦 ′ (𝑡𝑛 ) = 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
𝑡𝑛+1 − 𝑡𝑛
necessitating that 𝑡𝑛+1 − 𝑡𝑛 is small. This motivates the definition

𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑡𝑛+1 − 𝑡𝑛 )𝑓(𝑡𝑛 , 𝑦𝑛 ), (9.11)

which is called the forward Euler method.


Algorithm 9.2 (forward Euler method) Input: the right-hand side 𝑓, the ini-
tial value 𝑦0 , and points 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑁 or a step size ℎ. If the step size ℎ is
given, the points are 𝑡𝑛 ∶= 𝑡0 + 𝑛ℎ.
1. Loop for 𝑛 from 1 to 𝑁: set

𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑡𝑛+1 − 𝑡𝑛 )𝑓(𝑡𝑛 , 𝑦𝑛 ).

2. Return the values 𝑦𝑛 .


Alternatively, we can also use the backward difference quotient instead of the
forward difference quotient in the derivation above. Then we approximate the
derivative by
𝑦𝑛 − 𝑦𝑛−1
≈ 𝑦 ′ (𝑡𝑛 ) = 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )),
𝑡𝑛 − 𝑡𝑛−1
which motivates to define the approximation 𝑦𝑛+1 as the solution of the (alge-
braic) equation
𝑦𝑛+1 = 𝑦𝑛 + (𝑡𝑛+1 − 𝑡𝑛 )𝑓(𝑡𝑛+1 , 𝑦𝑛+1 )
after shifting the index 𝑛 by one. This is called the backward Euler method. In
contrast to the forward Euler method, this formula is not an explicit formula for
𝑦𝑛+1 , but defines 𝑦𝑛+1 only implicitly. Therefore an algebraic equation must be
solved in each time step.
238 9 Ordinary Differential Equations

Algorithm 9.3 (backward Euler method) Input: the right-hand side 𝑓, the
initial value 𝑦0 , and points 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑁 or a step size ℎ.
1. Loop for 𝑛 from 1 to 𝑁: set 𝑦𝑛+1 to be the solution of the (algebraic) equation

𝑦𝑛+1 = 𝑦𝑛 + (𝑡𝑛+1 − 𝑡𝑛 )𝑓(𝑡𝑛+1 , 𝑦𝑛+1 ).

2. Return the values 𝑦𝑛 .

9.4.2 Truncation Errors of the Forward Euler Method

The difference between the exact solution of the ode and its numerical approxi-
mation is called the global truncation error. It stems from two causes (ignoring
the round-off error). The first cause is the use of an approximate formula to cal-
culate 𝑦𝑛+1 from the previous approximation 𝑦𝑛 (assuming that the previous ap-
proximation was exact, i.e., 𝑦(𝑡𝑛 ) = 𝑦𝑛 ). This cause of errors is called the local
truncation error; it is the error due to the use of an approximate formula only.
The second cause is the fact that the input used in each step is only approxima-
tively correct since 𝑦(𝑡𝑛 ) is not equal to 𝑦𝑛 in general, also because the previous
errors accumulate.
Another fundamental source of errors arises from performing the computa-
tions in arithmetic with only a finite number of digits. This error is called the
round-off error and is not considered here.
In the following, we focus on the local truncation error

𝑒𝑛+1 ∶= 𝑦𝑛+1 − 𝑦(𝑡𝑛+1 ),

i.e., the difference between the approximation 𝑦𝑛+1 at 𝑡𝑛+1 and the value 𝑦(𝑡𝑛+1 )
of the exaction solution 𝑦 at 𝑡𝑛+1 while assuming that 𝑦(𝑡𝑛 ) = 𝑦𝑛 .

Theorem 9.4 (local truncation error of the forward Euler method) Sup-
pose the exact solution 𝑦 exists uniquely and that it is twice differentiable in the open
interval (𝑡𝑛 , 𝑡𝑛+1 ) and continuously differentiable in the closed interval [𝑡𝑛 , 𝑡𝑛+1 ].
Then the local truncation error 𝑒𝑛+1 of the forward Euler method is given by
1
𝑒𝑛+1 = − ℎ2 𝑦 ′′ (𝑡̃𝑛 ) ∃𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ).
2
Proof Taylor expansion of the exact solution 𝑦 at 𝑡𝑛+1 = 𝑡𝑛 + ℎ around the point
𝑡𝑛 and using the Lagrange form of the remainder term yields

ℎ2 ′′
𝑦(𝑡𝑛+1 ) = 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 (𝑡̃𝑛 ),
2
where 𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ). Subtracting the Taylor expansion from the forward Euler
method (9.11) yields
9.4 Euler Methods 239

ℎ2 ′′
𝑒𝑛+1 = 𝑦𝑛 − 𝑦(𝑡𝑛 ) + ℎ(𝑓(𝑡𝑛 , 𝑦𝑛 ) − 𝑦 ′ (𝑡𝑛 )) − 𝑦 (𝑡̃𝑛 ). (9.12)
2
Recalling that we assume that 𝑦(𝑡𝑛 ) = 𝑦𝑛 when considering the local truncation
error, we also have 𝑦 ′ (𝑡𝑛 ) = 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) = 𝑓(𝑡𝑛 , 𝑦𝑛 ). This simplifies the local
truncation error to
ℎ2
𝑒𝑛+1 = − 𝑦 ′′ (𝑡̃𝑛 )
2
as claimed. □

Hence the local truncation error is proportional both to the square of the step
size ℎ and to the second derivative of the solution somewhere in the interval
[𝑡𝑛 , 𝑡𝑛+1 ]. If a bound 𝑀 of the absolute value of the second derivative is known
on the whole interval where the solution is approximated, we can write

ℎ2 𝑀
|𝑒𝑛 | ≤ .
2
Hence it can be ensured that the local truncation error is less than or equal to 𝜖
if the inequality √
2𝜖
ℎ≤
𝑀
holds. It is also clear from the proof that a bound on the global truncation error
will require a bound on the second derivative of the solution.
Analyzing the global truncation error, which is defined as

𝐸𝑛 ∶= 𝑦𝑛 − 𝑦(𝑡𝑛 )

while not assuming that 𝑦(𝑡𝑛 ) = 𝑦𝑛 , is more involved.

Theorem 9.5 (global truncation error of the forward Euler method) Sup-
pose that 𝑡𝑛 ∶= 𝑡0 + 𝑛ℎ (ℎ ∈ ℝ+ ), that 𝑓 is continuous, that 𝑓 is Lipschitz con-
tinuous with respect to its second argument with Lipschitz constant 𝐿, and that the
exact solution 𝑦 is twice differentiable in the open interval (0, 𝑡𝑛 ) and continuously
differentiable in the closed interval [0, 𝑡𝑛 ]. Then the global truncation error 𝐸𝑛 of
the forward Euler method is bounded by

e(𝑡𝑛 −𝑡0 )𝐿 − 1
|𝐸𝑛 | ≤ 𝛽ℎ,
𝐿
where 𝛼 ∶= 1 + ℎ𝐿 and 𝛽 ∶= (1∕2) max 𝑡∈(𝑡0 ,𝑡𝑛 ) |𝑦 ′′ (𝑡)|.

If 𝜕𝑓∕𝜕𝑡 is continuous in the interval [𝑡0 , 𝑡𝑛 ], then the solution 𝑦 has a con-
tinuous second derivative on this interval and hence the assumptions on the
smoothness of 𝑦 are satisfied.
Proof Equation (9.12) and the Lipschitz continuity of 𝑓 with respect to its sec-
ond argument imply that
240 9 Ordinary Differential Equations

ℎ2 ′′
|𝐸𝑛+1 | ≤ |𝐸𝑛 | + ℎ|𝑓(𝑡𝑛 , 𝑦𝑛 ) − 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))| + |𝑦 (𝑡̃𝑛 )| ≤ 𝛼|𝐸𝑛 | + 𝛽ℎ2 .
2
It is straightforward to show by induction that 𝐸0 = 0 and the last inequality
|𝐸𝑛+1 | ≤ 𝛼|𝐸𝑛 | + 𝛽ℎ2 imply that

𝛼𝑛 − 1 2
|𝐸𝑛 | ≤ 𝛽ℎ .
𝛼−1
Note that 𝛼 > 1 can be assumed without loss of generality.
Substituting the definition of 𝛼 into the last estimate yields

(1 + ℎ𝐿)𝑛 − 1
|𝐸𝑛 | ≤ 𝛽ℎ.
𝐿
The Taylor expansion of the exponential function shows that 1 + ℎ𝐿 ≤ eℎ𝐿 and
hence (1 + ℎ𝐿)𝑛 ≤ e𝑛ℎ𝐿 .
In summary, we find that

e𝑛ℎ𝐿 − 1 e(𝑡𝑛 −𝑡0 )𝐿 − 1


|𝐸𝑛 | ≤ 𝛽ℎ = 𝛽ℎ,
𝐿 𝐿
which concludes the proof. □

Since the global truncation error has order one in the step size ℎ, the forward
Euler method is called a first-order method. Much effort has been devoted to the
development of higher-order methods, and we discuss such methods in the rest
of this chapter.

9.4.3 Improved Euler Method

Another view towards deriving formulas for numerical approximations is not to


approximate the derivative as we have done until now, but to approximate the
integral in the equivalent formulation as an integral equation (see Sect. 9.2). The
improved Euler method can be derived in this manner.
We start by considering the initial-value problem

𝑦 ′ (𝑡) = 𝑓(𝑡, 𝑦(𝑡)), 𝑦(𝑡0 ) = 𝑦0

and recall the equivalent integral-equation formulation


𝑡𝑛+1
𝑦(𝑡𝑛+1 ) = 𝑦(𝑡𝑛 ) + ∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠
𝑡𝑛

from Sect. 9.2, now written for arbitrary initial points.


9.4 Euler Methods 241

Next, we approximate the integral. We recover the forward Euler method by


using the approximation
𝑡𝑛+1
∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠 ≈ ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )
𝑡𝑛

and the backward Euler formula by using


𝑡𝑛+1
∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠 ≈ ℎ𝑓(𝑡𝑛+1 , 𝑦𝑛+1 ).
𝑡𝑛

We have replaced the integrand by its value 𝑓(𝑡𝑛 , 𝑦𝑛 ) on the left interval endpoint
in the case of the forward Euler formula (recall the forward difference) and by
its value 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 ) on the right interval endpoint in the case of the backward
Euler formula (recall the backward difference).
Both choices seem to be one-sided and arbitrary. It is more prudent to use the
approximation
𝑡𝑛+1

∫ 𝑓(𝑠, 𝑦(𝑠))d𝑠 ≈ (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 )),
𝑡𝑛
2

approximating the integral by the area of a trapezoid, which motivates to define


𝑦𝑛+1 as the solution of the equation


𝑦𝑛+1 = 𝑦𝑛 + (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛+1 )).
2
Unfortunately, this is only an implicit definition of 𝑦𝑛+1 . We can arrive at an
explicit formula if we replace the occurrence of 𝑦𝑛+1 in the last term by its ap-
proximation 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ) according to the forward Euler formula.
In summary, we define


𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )), (9.13)
2
which is the improved Euler method. Its advantage is that its local truncation
error has order three as we will show next. Its disadvantage is that the evaluation
of 𝑓 on the right-hand side proceeds in two steps and requires two evaluations
of 𝑓, which is more costly.

Theorem 9.6 (local truncation error of improved Euler method) Suppose


the exact solution 𝑦 exists uniquely and that it is three times differentiable in the
open interval (𝑡𝑛 , 𝑡𝑛+1 ) and twice continuously differentiable in the closed interval
[𝑡𝑛 , 𝑡𝑛+1 ]. Then the local truncation error of the improved Euler method has order
three in the step size ℎ.

Proof Recall that the local truncation error is given by


242 9 Ordinary Differential Equations

𝑒𝑛+1 = 𝑦𝑛+1 − 𝑦(𝑡𝑛+1 )

while assuming that 𝑦𝑛 = 𝑦(𝑡𝑛 ).


Taylor expansion of the exact solution 𝑦 at 𝑡𝑛+1 = 𝑡𝑛 + ℎ around the point 𝑡𝑛
and using the Lagrange form of the remainder term yields

ℎ2 ′′ ℎ3
𝑦(𝑡𝑛+1 ) = 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 (𝑡𝑛 ) + 𝑦 ′′′ (𝑡̃𝑛 ),
2! 3!
where 𝑡̃𝑛 ∈ (𝑡𝑛 , 𝑡𝑛+1 ). Subtracting the Taylor expansion from the improved Euler
method (9.13) yields

ℎ( )
𝑒𝑛+1 = 𝑦𝑛 + 𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ))
2
( ℎ2 ℎ3 )
− 𝑦(𝑡𝑛 ) + ℎ𝑦 ′ (𝑡𝑛 ) + 𝑦 ′′ (𝑡𝑛 ) + 𝑦 ′′′ (𝑡̃𝑛 )
2! 3!
ℎ ℎ ℎ2 ℎ3
= − 𝑓(𝑡𝑛 , 𝑦𝑛 ) + 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )) − 𝑦 ′′ (𝑡𝑛 ) − 𝑦 ′′′ (𝑡̃𝑛 ).
2 2 2! 3!
The two-dimensional Taylor expansion of the term 𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 ))
around the point (𝑡𝑛 , 𝑦𝑛 ) is

𝑓(𝑡𝑛+1 , 𝑦𝑛 + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )) = 𝑓(𝑡𝑛 , 𝑦𝑛 ) + ℎ𝑓𝑡 (𝑡𝑛 , 𝑦𝑛 ) + ℎ𝑓(𝑡𝑛 , 𝑦𝑛 )𝑓𝑦 (𝑡𝑛 , 𝑦𝑛 ) + 𝑂(ℎ2 ).

Using the ode and the chain rule, the second derivative of 𝑦 can be written as

𝑦 ′′ (𝑡𝑛 ) = 𝑓𝑡 (𝑡𝑛 , 𝑦(𝑡𝑛 )) + 𝑓𝑦 (𝑡𝑛 , 𝑦(𝑡𝑛 ))𝑦 ′ (𝑡𝑛 ) = 𝑓𝑡 (𝑡𝑛 , 𝑦𝑛 ) + 𝑓𝑦 (𝑡𝑛 , 𝑦𝑛 )𝑓(𝑡𝑛 , 𝑦𝑛 ).

Substituting these last two equations into the expression for 𝑒𝑛+1 shows that

𝑒𝑛+1 = 𝑂(ℎ3 ),

which concludes the proof. □

9.5 Variation of Step Size

It is often conducive to adjust the step size in order to maintain the local trunca-
tion error at a nearly constant level. Not only can computational work be saved
in this manner, but it is also possible to control the accuracy of the approxima-
tion.
The most straightforward way to control the local truncation error would be
to calculate the difference between the approximation and the exact solution.
While this approach is a good idea in test problems, where the exact solution is
known, it is obviously not possible to do so in the general setting; if the exact
solution were known, we would not approximate it. Therefore we use a more
9.6 Runge–Kutta Methods 243

accurate numerical method as a substitute for the exact solution and compute
two different approximations using two different methods.
For example, we can use the forward Euler method (as the less accurate
method) and the improved Euler method (as the more accurate method). Then
the difference between the two approximate solutions is used as the estimate
| Euler improved ||
est
𝑒𝑛+1 ∶= |||𝑦𝑛+1 − 𝑦𝑛+1 ||

of the error of the less accurate method.


est
If the estimated error 𝑒𝑛+1 does not match a given error tolerance 𝜖, then the
step size ℎ is adjusted and the calculations are repeated. It is important to know
how the local truncation error 𝑒𝑛+1 depends on the step size ℎ so that it can be
adjusted efficiently. In the case of the Euler method, used as the less accurate
method in this example, the local truncation error 𝑒𝑛+1 is proportional to ℎ2 (see
Theorem 9.4), which means that multiplying the step size ℎ by

𝜖
est
𝑒𝑛+1

adjusts the local truncation error (up or down) to the given error tolerance 𝜖.
In this way, the local truncation error can be kept approximately constant
throughout the approximation of a solution. Small step sizes, which increase
computation time, are only used where needed so that the resulting algorithm
is both efficient and accurate (see Problem 9.5).

9.6 Runge–Kutta Methods

Two of the most often executed programs in the history of ordinary differential
equations are probably the functions called ɴȍȕЗИ and ɴȍȕЙК in matlab. These
two functions use adaptive Runge–Kutta methods [4, 1, 8], and therefore we
have a closer look at these methods in the rest of this chapter.
We still consider the initial-value problem

𝑦 ′ (𝑡) = 𝑓(𝑡, 𝑦(𝑡)), 𝑦(𝑡0 ) = 𝑦0 (9.14)

and start by defining the (classical) Runge–Kutta method

𝑘1 ∶= 𝑓(𝑡𝑛 , 𝑦𝑛 ), (9.15a)
( ℎ ℎ )
𝑘2 ∶= 𝑓 𝑡𝑛 + , 𝑦𝑛 + 𝑘1 , (9.15b)
2 2
( ℎ ℎ )
𝑘3 ∶= 𝑓 𝑡𝑛 + , 𝑦𝑛 + 𝑘2 , (9.15c)
2 2
𝑘4 ∶= 𝑓(𝑡𝑛 + ℎ, 𝑦𝑛 + ℎ𝑘3 ), (9.15d)
244 9 Ordinary Differential Equations


𝑦𝑛+1 ∶= 𝑦𝑛 + (𝑘 + 2𝑘2 + 2𝑘3 + 𝑘4 ), (9.15e)
6 1
𝑡𝑛+1 ∶= 𝑡𝑛 + ℎ. (9.15f)

It is called a four-stage method, since the four stages 𝑘1 , 𝑘2 , 𝑘3 , and 𝑘4 are needed
to proceed from time 𝑡𝑛 to time 𝑡𝑛+1 . This classical Runge–Kutta method is
therefore often abbreviated as rk4.
If 𝑓 does not depend on 𝑦, then the method (9.15) simplifies to

ℎ( ( ℎ) )
𝑦𝑛+1 = 𝑦𝑛 + 𝑓(𝑡𝑛 ) + 4𝑓 𝑡𝑛 + + 𝑓(𝑡𝑛 + ℎ) ,
6 2
which is Simpson’s rule for approximating the integral of 𝑦 ′ (𝑡) = 𝑓(𝑡). This con-
sideration is analogous to the interpretation of the improved Euler method as an
application of the trapezoid rule to an integral in Sect. 9.4.3.
Generalizing (9.15), Runge–Kutta methods with 𝑠 stages can be written in
the form
( 𝑠
∑ )
𝑘𝑖 ∶= 𝑓 𝑡𝑛 + ℎ𝑐𝑖 , 𝑦𝑛 + ℎ 𝑎𝑖𝑗 𝑘𝑗 , 1 ≤ 𝑖 ≤ 𝑠, (9.16a)
𝑗=1
𝑠

𝑦𝑛+1 ∶= 𝑦𝑛 + ℎ 𝑏𝑖 𝑘𝑖 , (9.16b)
𝑖=1
𝑡𝑛+1 ∶= 𝑡𝑛 + ℎ. (9.16c)

The following two theorems mean that the rk4 method has order four; more
precisely, its local truncation error has order five and its global truncation error
has order four.

Theorem 9.7 (local truncation error of the rk4 method) Suppose that the
fourth partial derivatives of 𝑓 in the ordinary differential equation (9.14) exist in
the open interval (𝑡𝑛 , 𝑡𝑛+1 ) and that its third partial derivatives exist and are con-
tinuous in the closed interval [𝑡𝑛 , 𝑡𝑛+1 ]. Then the local truncation error of the rk4
method (9.15) has order five, i.e.,

𝑒𝑛+1 = 𝑦𝑛+1 − 𝑦(𝑡𝑛+1 ) = 𝑂(ℎ5 ).

Theorem 9.8 (global truncation error of the rk4 method) Under the as-
sumptions of Theorem 9.7, the global truncation error of the rk4 method (9.15)
has order four.

These two theorems are shown in Problems 9.6 and 9.7.


Runge–Kutta methods can be classified into explicit and implicit methods.
The system (9.16) of equations is explicit if the coefficients 𝑎𝑖𝑗 vanish for 𝑗 < 𝑖,
corresponding to an explicit method. In practice, explicit methods are used be-
9.7 Butcher Tableaux 245

cause calculating the stages 𝑘𝑖 is faster compared to implicit methods and be-
cause explicit methods already enable a large choice of coefficients.
The rk4 method in (9.15) is a four-stage method and has order four. How do
the number of stages 𝑠 and the order 𝑝 relate in explicit Runge–Kutta methods?
In general, it can be shown that the inequality

𝑝≤𝑠

holds for any explicit Runge–Kutta method; if 𝑝 ≥ 5, then the stronger in-
equality
𝑝<𝑠
holds [3, Paragraph 324].
It is not known, however, whether these inequalities are sharp. It is an open
problem what the minimum number of stages 𝑠 of an explicit Runge–Kutta
method with order 𝑝 is in the cases where no methods are already known that
satisfy 𝑝 + 1 = 𝑠. The following table summarizes what is known about the
known minimum number of stages for orders one to ten [3, Chapter 32].
Order 𝑝 1 2 3 4 5 6 7 8 9 10
Number 𝑠 of stages 1 2 3 4 6 7 9 11 ? 17

9.7 Butcher Tableaux

The coefficients 𝑎𝑖𝑗 , 𝑏𝑖 , and 𝑐𝑖 in the general form (9.16) of a Runge–Kutta


method can be arranged in so-called Butcher tableaux. The Butcher tableau of
an explicit Runge–Kutta method has the form

0
𝑐2 𝑎21
𝑐3 𝑎31 𝑎32
⋮ ⋮ ⋱
𝑐𝑠 𝑎𝑠1 𝑎𝑠2 ⋯ 𝑎𝑠,𝑠−1
𝑏1 𝑏2 ⋯ 𝑏𝑠−1 𝑏𝑠 ,

while the tableau of an implicit Runge–Kutta method would have nonzero


𝑎𝑖𝑗 entries in or above the diagonal. In the following, we only consider explicit
Runge–Kutta methods.
Calculations such as the ones in Problem 9.6 show that an explicit Runge–
Kutta method is consistent if
𝑖−1

𝑎𝑖𝑗 = 𝑐𝑖 ∀𝑖 ∈ {2, … , 𝑠}
𝑗=1

holds.
246 9 Ordinary Differential Equations

In the following, in this section and the next, the Butcher tableaux of impor-
tant Runge–Kutta methods are given. The forward Euler method (see Sect.
9.4.1) is the simplest Runge–Kutta method and has the Butcher tableau

0
1.

It is the only consistent explicit one-stage Runge–Kutta method. (The back-


ward Euler method (see Sect. 9.4.1 is an implicit method and therefore not con-
sidered here.)
A family of second-order two-stage Runge–Kutta methods has the Butcher
tableau
0
𝛼 𝛼
(1 − 1∕(2𝛼)) 1∕(2𝛼).
The case 𝛼 = 1∕2 is called the midpoint method. In the case 𝛼 = 1, we recover
the improved Euler method (see Sect. 9.4.3).
Finally, the classical Runge–Kutta or rk4 method has the Butcher tableau

0
1∕2 1∕2
1∕2 0 1∕2
1 0 0 1
1∕6 1∕3 1∕3 1∕6.

9.8 Adaptive Runge–Kutta Methods

The basic idea of adaptive Runge–Kutta methods was already discussed in


Sect. 9.5. In an adaptive Runge–Kutta method, two methods, one of order 𝑝
and one of order 𝑝 − 1, are used to obtain an estimate of the local truncation
error. In order to keep the computational cost as small as possible, the stages of
both Runge–Kutta methods are identical; only the linear combinations of the
𝑘𝑖 , i.e., the last lines in the Butcher tableaux, differ. The orders of the methods
are known and therefore the algorithm for adjusting the step size in Sect. 9.5 can
be used. In this way, the step size ℎ is always almost optimal.
More concretely, we can write the two Runge–Kutta methods as
𝑠


𝑦𝑛+1 ∶= 𝑦𝑛∗ + ℎ 𝑏𝑖∗ 𝑘𝑖 ,
𝑖=1
∑𝑠
𝑦𝑛+1 ∶= 𝑦𝑛 + ℎ 𝑏 𝑖 𝑘𝑖 ,
𝑖=1
𝑡𝑛+1 ∶= 𝑡𝑛 + ℎ,
9.8 Adaptive Runge–Kutta Methods 247

where the asterisk indicates the method with order 𝑝 − 1 and the other method
has order 𝑝. Then the local truncation error 𝑒𝑛+1 is estimated by
𝑠

est ∗
𝑒𝑛+1 ≈ 𝑒𝑛+1 ∶= 𝑦𝑛+1 − 𝑦𝑛+1 = ℎ (𝑏𝑖∗ − 𝑏𝑖 )𝑘1 = 𝑂(ℎ𝑝 ).
𝑖=1

The local truncation error of the lower-order method is proportional to ℎ𝑝 , since


its order is 𝑝 − 1. Therefore the step size ℎ is multiplied by
1∕𝑝
𝜖
( est
)
𝑒𝑛+1

in order to adjust the local truncation error (up or down) to the given error tol-
erance 𝜖.
The last two lines of the Butcher tableau
0
𝑐2 𝑎21
𝑐3 𝑎31 𝑎32
⋮ ⋮ ⋱
𝑐𝑠 𝑎𝑠1 𝑎𝑠2 ⋯ 𝑎𝑠,𝑠−1
𝑏1 𝑏2 ⋯ 𝑏𝑠−1 𝑏𝑠
𝑏1∗ 𝑏2∗ ⋯ ∗
𝑏𝑠−1 𝑏𝑠

of an adaptive Runge–Kutta method contain the coefficients 𝑏𝑖 and 𝑏𝑖∗ .


The simplest adaptive Runge–Kutta method combines the forward Euler
and the improved Euler methods as already discussed in Sect. 9.5. Its Butcher
tableau is
0
1 1
1∕2 1∕2
1 0.
A well-known example of an adaptive Runge–Kutta method is the so-called
Runge–Kutta–Fehlberg method [5], which uses two methods of orders four
and five. Its Butcher tableau is
0
1∕4 1∕4
3∕8 3∕32 9∕32
12∕13 1932∕2197 −7200∕2197 7296∕2197
1 439∕216 −8 3680∕513 −845∕4104
1∕2 −8∕27 2 −3544∕2565 1859∕4104 −11∕40
16∕135 0 6656∕12825 28561∕56430 −9∕50 2∕55
25∕216 0 1408∕2565 2197∕4104 −1∕5 0.

These ideas can immediately be applied to systems of ordinary differential equa-


tions (see Sect. 9.3, Problem 9.10, and Problem 9.11).
248 9 Ordinary Differential Equations

9.9 Implementation of Runge–Kutta Methods

After discussing the theory of Runge–Kutta methods, two different imple-


mentations are presented in this section. Both implementations take a Butcher
tableau as input, but they differ in how they use it. ‰Ǥ‰, short for “not a number,”
is a floating-point number, and we use it here for values that should never be
accessed.
Ȇɴɪʧʲ ¼uЖ ќ ЦЕ ‰Ǥ‰ѓ
‰Ǥ‰ ЖЧ

Ȇɴɪʧʲ ¼uЙ ќ ЦЕ ‰Ǥ‰ ‰Ǥ‰ ‰Ǥ‰ ‰Ǥ‰ѓ


ЖЭЗ ЖЭЗ ‰Ǥ‰ ‰Ǥ‰ ‰Ǥ‰ѓ
ЖЭЗ Е ЖЭЗ ‰Ǥ‰ ‰Ǥ‰ѓ
Ж Е Е Ж ‰Ǥ‰ѓ
‰Ǥ‰ ЖЭЛ ЖЭИ ЖЭИ ЖЭЛЧ

The first implementation is a straightforward use of the general Runge–


Kutta equations (9.16) with a constant step size ȹ.
ȯʼɪȆʲɃɴɪ ¼uФÑђђʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩя ȯђђOʼɪȆʲɃɴɪя
ʲѪʧʲǤʝʲђђOɜɴǤʲЛЙя ʲѪȕɪȍђђOɜɴǤʲЛЙя
˩ѪʧʲǤʝʲђђOɜɴǤʲЛЙя ȹђђOɜɴǤʲЛЙХђђ‰ǤɦȕȍÑʼʙɜȕ
ЪǤʧʧȕʝʲ ÑЦЖя ЖЧ ќќ Е
ЪǤʧʧȕʝʲ ȹ љ Е

ɜɴȆǤɜ  ќ ÑЦЖђȕɪȍвЖя ЗђȕɪȍвЖЧ


ɜɴȆǤɜ Ȃ ќ ÑЦȕɪȍя ЗђȕɪȍЧ
ɜɴȆǤɜ Ȇ ќ ÑЦЖђȕɪȍвЖя ЖЧ
ɜɴȆǤɜ ʧ ќ ʧɃ˴ȕФȂя ЖХ
ɜɴȆǤɜ ‰ ќ ȆȕɃɜФbɪʲя ФʲѪȕɪȍ в ʲѪʧʲǤʝʲХ Э ȹХ ў Ж

ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ

ɜɴȆǤɜ ɖ ќ ȯɃɜɜФ‰Ǥ‰я ʧХ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФʲѪʧʲǤʝʲя ʲѪʧʲǤʝʲ ў Ф‰вЖХѮȹя ‰Х
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФ‰Ǥ‰я ‰Х
˩ЦЖЧ ќ ˩ѪʧʲǤʝʲ

ȯɴʝ ɪ Ƀɪ Жђ‰вЖ
ɖЦЖЧ ќ ȯФʲЦɪЧя ˩ЦɪЧХ
ȯɴʝ Ƀ Ƀɪ Зђʧ
ɖЦɃЧ ќ ȯФʲЦɪЧ ў ȹ Ѯ ȆЦɃЧя
˩ЦɪЧ ў ȹ Ѯ ʧʼɦФЦɃя ɔЧ Ѯ ɖЦɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȕɪȍ
9.9 Implementation of Runge–Kutta Methods 249

˩ЦɪўЖЧ ќ ˩ЦɪЧ ў ȹ Ѯ ʧʼɦФȂЦɃЧ Ѯ ɖЦɃЧ ȯɴʝ Ƀ Ƀɪ ЖђʧХ


ȕɪȍ

Фʲ ќ ʲя ˩ ќ ˩Х
ȕɪȍ

After some assertions to check the consistency of the input and after extracting
the matrix  and the vectors Ȃ and Ȇ from the Butcher tableau, the coefficient
vector ɖ and the output vectors ʲ and ˩ are allocated and initialized. In the ȯɴʝ
loop, the equation is solved using (9.16). (Unfortunately, ʧʼɦ does not work on
empty generators so that ɖЦЖЧ is defined separately.)
This implementation is a straightforward implementation of equations (9.16).
It is important to note, however, that the Butcher tableau and the number of
stages are known and constant. Therefore it seems wasteful in the inner loop
to use a ȯɴʝ loop to iterate over the stages, to access the coefficients stored in a
matrix and in vectors, and to use ʧʼɦ to sum a few terms instead of writing out
the expressions explicitly. But writing out the expressions explicitly would have
to be done for every Butcher tableau, and we still want a general, yet efficient
code.
The solution is to not write a program to solve the equation, but to write a
program that writes programs to solve the equation. In other words, we will write
a macro (see Chap. 7) that generates the code specialized for a given Butcher
tableau and right-hand side.
The first version of the Ъ¼u macro is more straightforward and easier to un-
derstand, while the second version is an optimized one. We start with the first
version, called Ъ¼uЕ.
ɦǤȆʝɴ ¼uЕФÑя ȯя ʲѪʧʲǤʝʲђђOɜɴǤʲЛЙя ʲѪȕɪȍђђOɜɴǤʲЛЙя ˩ѪʧʲǤʝʲђђOɜɴǤʲЛЙя
ȹђђOɜɴǤʲЛЙХ
ɜɴȆǤɜ ÑÑ ќ ȕ˛ǤɜФÑХ

ЪǤʧʧȕʝʲ ɃʧǤФÑÑя ʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩХ


ЪǤʧʧȕʝʲ ÑÑЦЖя ЖЧ ќќ Е
ЪǤʧʧȕʝʲ ȹ љ Е

ɜɴȆǤɜ  ќ ÑÑЦЖђȕɪȍвЖя ЗђȕɪȍвЖЧ


ɜɴȆǤɜ Ȃ ќ ÑÑЦȕɪȍя ЗђȕɪȍЧ
ɜɴȆǤɜ Ȇ ќ ÑÑЦЖђȕɪȍвЖя ЖЧ
ɜɴȆǤɜ ʧ ќ ʧɃ˴ȕФȂя ЖХ
ɜɴȆǤɜ ‰ ќ ȆȕɃɜФbɪʲя ФʲѪȕɪȍ в ʲѪʧʲǤʝʲХ Э ȹХ ў Ж
ɜɴȆǤɜ ɖ ќ Цȱȕɪʧ˩ɦФъɖъХ ȯɴʝ Ƀ Ƀɪ ЖђʧЧ

ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ
250 9 Ordinary Differential Equations

ɜɴȆǤɜ ˩ѪʼʙȍǤʲȕ ќ ђФЕХ


ȯɴʝ Ƀ Ƀɪ Жђʧ
˩ѪʼʙȍǤʲȕ ќ ђФϵ˩ѪʼʙȍǤʲȕ ў ϵФȂЦɃЧХ Ѯ ϵФȕʧȆФɖЦɃЧХХХ
ȕɪȍ

ɜɴȆǤɜ ɖʧ ќ ђФХ
ȯɴʝ Ƀ Ƀɪ Жђʧ
ɜɴȆǤɜ ʧʼɦ ќ ђФЕХ
ȯɴʝ ɔ Ƀɪ ЖђɃвЖ
ʧʼɦ ќ ђФϵʧʼɦ ў ϵФȹ Ѯ ЦɃя ɔЧХ Ѯ ϵФȕʧȆФɖЦɔЧХХХ
ȕɪȍ
ɖʧ ќ ђФϵɖʧѓ ɜɴȆǤɜ ϵФȕʧȆФɖЦɃЧХХ ќ ϵȯФʲЦɪЧ ў ϵФȹ Ѯ ȆЦɃЧХя
˩ЦɪЧ ў ϵʧʼɦХХ
ȕɪȍ

ʜʼɴʲȕ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФϵʲѪʧʲǤʝʲя ϵФʲѪʧʲǤʝʲ ў Ф‰вЖХѮȹХя ϵ‰Х
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФ‰Ǥ‰я ϵ‰Х
˩ЦЖЧ ќ ϵ˩ѪʧʲǤʝʲ

ȯɴʝ ɪ Ƀɪ ЖђϵФ‰вЖХ
ϵɖʧ
˩ЦɪўЖЧ ќ ˩ЦɪЧ ў ϵȹ Ѯ ϵ˩ѪʼʙȍǤʲȕ
ȕɪȍ

ФϵФȕʧȆФђʲХХ ќ ʲя ϵФȕʧȆФђ˩ХХ ќ ˩Х
ȕɪȍ
ȕɪȍ

After some assertions and the definitions of local variables such as , Ȃ, and Ȇ,
the first ȯɴʝ loop builds the expression ˩ѪʼʙȍǤʲȕ that is used in the macro ex-
pansion where ˩ЦɪўЖЧ is updated. The local variable ˩ѪʼʙȍǤʲȕ is initialized as
the expression Е and then the terms 𝑏𝑖 𝑘𝑖 are added in the ȯɴʝ loop. The vector ɖ
contains already unique symbols generated by ȱȕɪʧ˩ɦ, and therefore ȕʧȆ is used.
The expressions for the stages 𝑘𝑖 are built in a similar manner. The outer ȯɴʝ
loop adds definitions of local variables to the initially empty expression ɖʧ. The
names of the local variables are the elements of the vector ɖ. Each symbol stored
in ɖЦɃЧ has a unique name that starts with a ɖ. The inner ȯɴʝ loop adds terms to
the expression stored in ʧʼɦ, analogous to the generation of ˩ѪʼʙȍǤʲȕ.
All this work is performed during macro-expansion time. The advantage is
that the expressions contain the entries of the Butcher tableau and do not have
to access vectors or arrays during run time.
At the end of the macro, a ʜʼɴʲȕ expression returns the code that is executed.
First, the local variables ʲ and ˩ are initialized and will contain the results. Then,
in the ȯɴʝ loop, the solution is calculated: the expression ɖʧ calculates the stages
9.9 Implementation of Runge–Kutta Methods 251

and then the next element of ˩ is calculated. Finally, ʲ and ˩ are returned. Be-
cause of all the preparatory work before the ʜʼɴʲȕ expression, the whole ʜʼɴʲȕ
expression is rather short.
It is instructive to use ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ to see what the code that solves the
equation looks like. You will notice that the entries of the Butcher tableau have
been substituted into the code.
The first version of the macro is already faster than the ¼u function, but some
improvements are still possible. This leads us to the second version, called Ъ¼u.
ɦǤȆʝɴ ¼uФÑя ȯя ʲѪʧʲǤʝʲђђOɜɴǤʲЛЙя ʲѪȕɪȍђђOɜɴǤʲЛЙя ˩ѪʧʲǤʝʲђђOɜɴǤʲЛЙя
ȹђђOɜɴǤʲЛЙХ
ɜɴȆǤɜ ÑÑ ќ ȕ˛ǤɜФÑХ

ЪǤʧʧȕʝʲ ɃʧǤФÑÑя ʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩХ


ЪǤʧʧȕʝʲ ÑÑЦЖя ЖЧ ќќ Е
ЪǤʧʧȕʝʲ ȹ љ Е

ɜɴȆǤɜ  ќ ÑÑЦЖђȕɪȍвЖя ЗђȕɪȍвЖЧ


ɜɴȆǤɜ Ȃ ќ ÑÑЦȕɪȍя ЗђȕɪȍЧ
ɜɴȆǤɜ Ȇ ќ ÑÑЦЖђȕɪȍвЖя ЖЧ
ɜɴȆǤɜ ʧ ќ ʧɃ˴ȕФȂя ЖХ
ɜɴȆǤɜ ‰ ќ ȆȕɃɜФbɪʲя ФʲѪȕɪȍ в ʲѪʧʲǤʝʲХ Э ȹХ ў Ж
ɜɴȆǤɜ ɖ ќ Цȱȕɪʧ˩ɦФъɖъХ ȯɴʝ Ƀ Ƀɪ ЖђʧЧ

ЪǤʧʧȕʝʲ ȆЦЖЧ ќќ Е
ЪǤʧʧȕʝʲ ǤɜɜФɃʧǤʙʙʝɴ˦ФȆЦɃЧя ʧʼɦФЦɃя ɔЧ ȯɴʝ ɔ Ƀɪ ЖђɃвЖХХ
ȯɴʝ Ƀ Ƀɪ ЗђʧɃ˴ȕФя ЖХХ

ɜɴȆǤɜ ˩ѪʼʙȍǤʲȕ ќ ђФϵФȹ Ѯ ȂЦЖЧХ Ѯ ϵФȕʧȆФɖЦЖЧХХХ


ȯɴʝ Ƀ Ƀɪ Зђʧ
˩ѪʼʙȍǤʲȕ ќ ђФϵ˩ѪʼʙȍǤʲȕ ў ϵФȹ Ѯ ȂЦɃЧХ Ѯ ϵФȕʧȆФɖЦɃЧХХХ
ȕɪȍ

ɜɴȆǤɜ ɖʧ ќ ђФɜɴȆǤɜ ϵФȕʧȆФɖЦЖЧХХ ќ ϵȯФʲЦɪЧя ˩ЦɪЧХХ


ȯɴʝ Ƀ Ƀɪ Зђʧ
ɜɴȆǤɜ ʧʼɦ ќ ђФϵФȹ Ѯ ЦɃя ЖЧХ Ѯ ϵФȕʧȆФɖЦЖЧХХХ
ȯɴʝ ɔ Ƀɪ ЗђɃвЖ
ʧʼɦ ќ ђФϵʧʼɦ ў ϵФȹ Ѯ ЦɃя ɔЧХ Ѯ ϵФȕʧȆФɖЦɔЧХХХ
ȕɪȍ
ɖʧ ќ ђФϵɖʧѓ ɜɴȆǤɜ ϵФȕʧȆФɖЦɃЧХХ ќ ϵȯФʲЦɪЧ ў ϵФȹ Ѯ ȆЦɃЧХя
˩ЦɪЧ ў ϵʧʼɦХХ
ȕɪȍ

ʜʼɴʲȕ
ɜɴȆǤɜ ʲ ќ {Ƀɪ¼ǤɪȱȕФϵʲѪʧʲǤʝʲя ϵФʲѪʧʲǤʝʲ ў Ф‰вЖХѮȹХя ϵ‰Х
ɜɴȆǤɜ ˩ ќ ȯɃɜɜФ‰Ǥ‰я ϵ‰Х
252 9 Ordinary Differential Equations

˩ЦЖЧ ќ ϵ˩ѪʧʲǤʝʲ

ȯɴʝ ɪ Ƀɪ ЖђϵФ‰вЖХ
ϵɖʧ
˩ЦɪўЖЧ ќ ˩ЦɪЧ ў ϵ˩ѪʼʙȍǤʲȕ
ȕɪȍ

ФϵФȕʧȆФђʲХХ ќ ʲя ϵФȕʧȆФђ˩ХХ ќ ˩Х
ȕɪȍ
ȕɪȍ

In this second version, subexpressions of the form Е ў . . . have been elimi-


nated and the value of ɖʧ has been streamlined. Furthermore, the multiplica-
tions with ȹ are already performed in the preamble of the macro and not at run
time. These changes result in faster code, and it is instructive to compare the two
macro expansions using ЪɦǤȆʝɴȕ˦ʙǤɪȍЖ.
Finally in this section, we use a small benchmark to compare the speeds
and accuracies of the forward Euler method and the classical Runge–Kutta
method, both solved using the ¼u function and the Ъ¼u macro. The very simple
ode used here is the initial-value problem 𝑦 ′ (𝑡) = 𝑦(𝑡), 𝑦(0) = 1, whose solution
is the function 𝑦(𝑡) = e𝑡 , whose value at 𝑡 = 10 is e10 .
ȯʼɪȆʲɃɴɪ ȂȕɪȆȹɦǤʝɖФХ
ɜɴȆǤɜ ʧɴɜЖ ќ ЪʲɃɦȕ ¼uФ¼uЖя Фʲя ˩Х вљ ˩я ЕѐЕя ЖЕѐЕя ЖѐЕя ЖȕвЛХ
ɜɴȆǤɜ ʧɴɜЗ ќ ЪʲɃɦȕ Ъ¼uФ¼uЖя Фʲя ˩Х вљ ˩я ЕѐЕя ЖЕѐЕя ЖѐЕя ЖȕвЛХ
ɜɴȆǤɜ ʧɴɜИ ќ ЪʲɃɦȕ ¼uФ¼uЙя Фʲя ˩Х вљ ˩я ЕѐЕя ЖЕѐЕя ЖѐЕя ЖȕвЛХ
ɜɴȆǤɜ ʧɴɜЙ ќ ЪʲɃɦȕ Ъ¼uФ¼uЙя Фʲя ˩Х вљ ˩я ЕѐЕя ЖЕѐЕя ЖѐЕя ЖȕвЛХ

Ъʧȹɴ˞ ʧɴɜЖЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜЗЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜИЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ʧɴɜЙЦђ˩ЧЦȕɪȍЧ
Ъʧȹɴ˞ ȕ˦ʙФЖЕѐЕХ

ɪɴʲȹɃɪȱ
ȕɪȍ

ɔʼɜɃǤљ ȂȕɪȆȹɦǤʝɖФХ
ЕѐЖМЛОЕН ʧȕȆɴɪȍʧ ФЗЕѐЕЕ ǤɜɜɴȆǤʲɃɴɪʧђ КИЙѐЕКН Ƀ"я ЖКѐЖЗ҄ ȱȆ ʲɃɦȕХ
ЕѐЕЗММЛО ʧȕȆɴɪȍʧ ФЗ ǤɜɜɴȆǤʲɃɴɪʧђ МЛѐЗОЙ Ƀ"Х
ЕѐМЙОМНМ ʧȕȆɴɪȍʧ ФНЕѐЕЕ ǤɜɜɴȆǤʲɃɴɪʧђ ЖѐНЛИ QɃ"я ЗЖѐНК҄ ȱȆ ʲɃɦȕХ
ЕѐЕОЖЙКК ʧȕȆɴɪȍʧ ФЗ ǤɜɜɴȆǤʲɃɴɪʧђ МЛѐЗОЙ Ƀ"я ЙѐНИ҄ ȱȆ ʲɃɦȕХ
ФʧɴɜЖЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐИККЛЛЗНИИМЕЛ
ФʧɴɜЗЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐИККЛЛЗНИИМЕЛ
ФʧɴɜИЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛЙКЛ
ФʧɴɜЙЦђ˩ЧХЦȕɪȍЧ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛЙКЛ
ȕ˦ʙФЖЕѐЕХ ќ ЗЗЕЗЛѐЙЛКМОЙНЕЛМЖН
9.10 Julia Packages 253

The first-order method yields five correct digits, while in the fourth-order method
all digits but the last three are correct.
The results obtained by the function and the macros are identical. In this par-
ticular, but typical run, the macro is six to eight times faster than the function.
The allocations are also in favor of the macro implementation; the macro allo-
cates memory only twice, while the function performs 20 million allocations
(first-order method) or 80 million allocations (fourth-order method) and thus
spends significant time in garbage collection.
Problems 9.8, 9.9, 9.10, and 9.11 are concerned with the implementation of
the numerical methods presented in this chapter.
These ideas are applicable and useful also when implementing other numeri-
cal methods in a generic way while emphasizing performance. For example, spe-
cialized code for graphics processing units (gpu) can be written in this manner.

9.10 Julia Packages

The package -ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ contains a comprehensive suite for nu-


merically solving differential equations. It can solve odes, stochastic odes, ran-
dom odes, differential algebraic equations, delay differential equations, and dis-
crete stochastic equations.
The very simple ode used in the following example is the initial-value prob-
lem 𝑦 ′ (𝑡) = 𝑦(𝑡), 𝑦(0) = 1, whose solution is the function 𝑦(𝑡) = e𝑡 . We define
the right-hand side first and then the problem specifying the initial value and
the interval. An approximation is calculated by the ʧɴɜ˛ȕ function, which also
makes it possible to choose from many algorithms. Finally the solution is plotted.
ɔʼɜɃǤљ ʼʧɃɪȱ -ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ
ɔʼɜɃǤљ ȯФʼя ʙя ʲХ ќ ʼ
ȯ ФȱȕɪȕʝɃȆ ȯʼɪȆʲɃɴɪ ˞Ƀʲȹ Ж ɦȕʲȹɴȍХ
ɔʼɜɃǤљ ʙʝɴȂɜȕɦ ќ “-5¸ʝɴȂɜȕɦФȯя ЖѐЕя ФЕѐЕя ЖЕѐЕХХ
ѐѐѐ
ɔʼɜɃǤљ ʧɴɜ ќ ʧɴɜ˛ȕФʙʝɴȂɜȕɦХ
ѐѐѐ
ɔʼɜɃǤљ ʼʧɃɪȱ ¸ɜɴʲʧ
ɔʼɜɃǤљ ʙɜɴʲФʧɴɜХ

This pattern of defining problem and solution objects can be followed for all
equation types that are supported by the package. Each supported equation has
a problem type and a solution type that are understood by the generic functions
ʧɴɜ˛ȕ and ʙɜɴʲ.
Finally, it is mentioned that the package “ʝȍɃɪǤʝ˩-Ƀȯȯ5ʜ is a component
package of -ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ and holds the solvers and utilities for odes.
It is completely independent and usable on its own, which is expedient when a
light-weight package is sufficient.
254 9 Ordinary Differential Equations

9.11 Bibliographical Remarks

Both the theory of ordinary differential equations and their numerical methods
are large fields. A very accessible and comprehensive text book on differential
equations is [2]. A detailed treatment of numerical methods for ordinary differ-
ential equations can be found in [3].

Problems

9.1 (Modeling) Find an ode in a subject of your interest and derive it similarly
to the example in Sect. 9.1.

9.2 (Lipschitz condition) Suppose that 𝜕𝑓∕𝜕𝑦 is continuous in the rectan-


gle 𝐷 ∶= [−𝑎, 𝑎] × [−𝑏, 𝑏]. Prove that

∃𝐿 ∈ ℝ+ ∶ ∀𝑡 ∈ [−𝑎, 𝑎] ∶ ∀∀𝑦1 , 𝑦2 ∈ [−𝑏, 𝑏] ∶


|𝑓(𝑡, 𝑦1 ) − 𝑓(𝑡, 𝑦2 )| ≤ 𝐿|𝑦1 − 𝑦2 |

holds.
Hint: The Lipschitz constant 𝐿 is the maximum value of |𝜕𝑓∕𝜕𝑦| in 𝐷. Apply
the mean-value theorem to 𝑓 as a function of 𝑦 only.

9.3 (Interchanging taking the limit and integration) * Suppose that the se-
quence ⟨𝑓𝑛 ⟩ of Riemann integrable functions defined on a compact interval 𝐼
converges uniformly to 𝑓. Show that then the limit function 𝑓 is Riemann inte-
grable and that the equality

lim ∫𝑓𝑛 = ∫ lim 𝑓𝑛


𝑛→∞ 𝐼 𝐼 𝑛→∞
⏟⏟⏟
=𝑓

holds.

9.4 (Picard and Banach) * Show that the operator given by the Picard itera-
tion (9.4) is a contraction and use the Banach fixed-point theorem to show The-
orem 9.1.

9.5 (Variation of step size) Implement variation of step size as described in


Sect. 9.5 for the Euler and improved Euler methods. Compare the accuracies
and the computational expenses of the Euler method and of the method with
variable step size in an example whose exact solution is known.

9.6 (Local truncation error of the rk4 method) * Show Theorem 9.7 by fol-
lowing these steps.
References 255

1. Calculate the Taylor expansions of 𝑘2 , 𝑘3 , and 𝑘4 in (9.16) based on 𝑘1 up to


𝑂(ℎ4 ). The coefficients are treated as unknowns. (The expressions become
lengthy and contain partial derivatives of 𝑓 up to third order.)
2. Substitute the expressions for 𝑘1 , 𝑘2 , 𝑘3 , and 𝑘4 into the rk4 method (9.15).
3. Write the Taylor expansion

d𝑦(𝑡𝑛 ) ℎ2 d2 𝑦(𝑡𝑛 )
𝑦(𝑡𝑛 + ℎ) = 𝑦(𝑡𝑛 ) + ℎ + +⋯
d𝑡 2! d𝑡 2
ℎ2 d𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) ℎ3 d2 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
= 𝑦(𝑡𝑛 ) + ℎ𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 )) + +
2! d𝑡 3! d2 𝑡
4 3
ℎ d 𝑓(𝑡𝑛 , 𝑦(𝑡𝑛 ))
+ + 𝑂(ℎ5 )
4! d3 𝑡
of the solution 𝑦 of the differential equation in terms of partial derivatives
of 𝑓 (up to third order).
4. Compare the two Taylor expansions to find a system of algebraic equations
for the unknown coefficients in (9.16).

9.7 (Global truncation error of the rk4 method) * Show Theorem 9.8.

9.8 (Adaptive Runge–Kutta methods) * Extend (a) the function and (b) the
macro to implement adaptive Runge–Kutta methods as described in Sect. 9.8.

9.9 (Plot and compare) Choose the solution of an initial-value problem first
and then calculate the right-hand side 𝑓. Plot and compare the numerical so-
lutions with the exact solution for different Runge–Kutta methods, for differ-
ent step sizes, and using adaptive Runge–Kutta methods (building on Prob-
lem 9.8).

9.10 (Systems of equations) * Write a function to numerically solve systems of


first-order odes.

9.11 (Systems of equations) * Write a macro to numerically solve systems of


first-order odes.

References

1. Bogacki, P., Shampine, L.: A 3(2) pair of Runge–Kutta formulas. Appl. Math. Lett. 2(4),
321–325 (1989)
2. Boyce, W., DiPrima, R.: Elementary Differential Equations and Boundary Value Problems,
9th edn. John Wiley and Sons, Inc. (2009)
3. Butcher, J.: Numerical Methods for Ordinary Differential Equations, 2nd edn. John Wiley &
Sons, Ltd., Chichester, England (2008)
4. Dormand, J., Prince, P.: A family of embedded Runge–Kutta formulae. J. Comp. Appl. Math.
6(1), 19–26 (1980)
256 9 Ordinary Differential Equations

5. Fehlberg, E.: Klassische Runge-Kutta-Formeln vierter und niedrigerer Ordnung mit


Schrittweiten-Kontrolle und ihre Anwendung auf Wärmeleitungsprobleme. Computing
6(1–2), 61–71 (1970)
6. Risch, R.: The problem of integration in finite terms. Trans. Amer. Math. Soc. 139, 167–189
(1969)
7. Risch, R.: The solution of the problem of integration in finite terms. Bull. Amer. Math. Soc.
76(3), 605–608 (1970)
8. Shampine, L., Reichelt, M.: The MATLAB ODE suite. SIAM Journal on Scientific Computing
18(1), 1–22 (1997)
Chapter 10
Partial-Differential Equations

Hydrodynamics procreated complex analysis, partial differential equations,


Lie groups and algebra theory, cohomology theory, and scientific computing.
—Vladimir Arnold

Abstract Partial-differential equations are equations that contain partial deriva-


tives of the unknown function. The field of partial-differential equations is large
and diverse, as these equations describe many different phenomena and sys-
tems, and hence many theories for the existence and uniqueness of their solu-
tions as well as many numerical methods have been developed. In this chap-
ter, we explain fundamental concepts and numerical methods using the exam-
ple of three important classes of partial-differential equations, namely elliptic,
parabolic, and hyperbolic. These types of equations describe diverse physical
phenomena such as diffusion, thermal conduction, electromagnetism, and wave
propagation. Finite differences, finite volumes, and finite elements are used to
calculate approximations of the solutions.

10.1 Introduction

While ordinary differential equations are equations that contain derivatives of


the unknown univariate function 𝑦(𝑥) (or 𝑦(𝑡)) with respect to its only indepen-
dent variable 𝑥 (or 𝑡), the unknown function 𝑢 in a partial-differential equation
(pde) is multivariate, and a pde contains derivatives of the unknown with re-
spect to more than one independent variable. The unknown function is com-
monly denoted by 𝑢, and the independent variables are often called 𝑡, 𝑥, 𝑦, and 𝑧,
or 𝑡 and 𝐱 with 𝐱 ∶= (𝑥, 𝑦, 𝑧) or 𝐱 ∶= (𝑥1 , 𝑥2 , 𝑥3 ). Once the context has been
established, it is common to leave out the independent variables entirely to sim-
plify the notation.

© Springer Nature Switzerland AG 2022 257


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_10
258 10 Partial-Differential Equations

In the derivatives, one often writes the independent variable as an index as in

𝜕𝑢 𝜕2 𝑢 𝜕2 𝑢
𝑢𝑥 = , 𝑢𝑥𝑦 = , 𝑢 𝑥𝑖 𝑥𝑗 =
𝜕𝑥 𝜕𝑥𝜕𝑦 𝜕𝑥𝑖 𝜕𝑥𝑗

to simplify notation.
The order of a pde is the order of the highest derivative that occurs in the
equation. pdes of order higher than second are much rarer than those of first
and second order.
What do the elliptic, parabolic, and hyperbolic equations look like? Second-
order linear pdes are classified into three types: elliptic, parabolic, and hyper-
bolic equations. All second-order linear pdes in two independent variables 𝑥
and 𝑦 can be written in the form

𝐴𝑢𝑥𝑥 + 𝐵𝑢𝑥𝑦 + 𝐶𝑢𝑦𝑦 + 𝐷𝑢𝑥 + 𝐸𝑢𝑦 + 𝐹𝑢 + 𝐺 = 0 ∀(𝑥, 𝑦) ∈ 𝑈,

where we have already dropped the dependence of the unknown 𝑢 = 𝑢(𝑥, 𝑦), of
its partial derivatives, and of the coefficient functions 𝐴 = 𝐴(𝑥, 𝑦) to 𝐺 = 𝐺(𝑥, 𝑦)
on the independent variables 𝑥 and 𝑦 to shorten the notation. Note that the terms
that contain the unknown 𝑢 or its derivatives are all linear, as they must be in a
linear equation. The domain 𝑈 ⊂ ℝ2 is the domain where the equation holds.
To complete the specification of a pde that can be solved, it is also necessary to
provide boundary conditions, initial conditions, or both. The types and amounts
of such conditions depend on the equation type.
A second-order linear pde is called elliptic if the condition

𝐵(𝑥, 𝑦)2 − 4𝐴(𝑥, 𝑦)𝐶(𝑥, 𝑦) < 0 ∀(𝑥, 𝑦) ∈ 𝑈

holds, it is called parabolic if the condition

𝐵(𝑥, 𝑦)2 − 4𝐴(𝑥, 𝑦)𝐶(𝑥, 𝑦) = 0 ∀(𝑥, 𝑦) ∈ 𝑈

holds, and it is called hyperbolic – you guessed it – if the condition

𝐵(𝑥, 𝑦)2 − 4𝐴(𝑥, 𝑦)𝐶(𝑥, 𝑦) > 0 ∀(𝑥, 𝑦) ∈ 𝑈

holds.
This naming convention is an analogy to conic sections. If we replace 𝑢𝑥𝑥 by
𝑥2 , 𝑢𝑥𝑦 by 𝑥𝑦, and 𝑢𝑦𝑦 by 𝑦 2 , we see that the same three conditions hold for
ellipses 𝑥2 + 𝑦 2 = 𝑎2 , parabolas 𝑦 2 = 4𝑎𝑥, and hyperbolas 𝑥2 ∕𝑎2 − 𝑦 2 ∕𝑏2 = 1,
respectively. More precisely, after replacing 𝑢𝑥𝑥 by 𝑥 2 , 𝑢𝑥𝑦 by 𝑥𝑦, and 𝑢𝑦𝑦 by
𝑦 2 and only considering the first three terms, which are of second order in 𝑥
and 𝑦, we find the equation 𝐴𝑥 2 + 𝐵𝑥𝑦 + 𝐶𝑦 2 = 0. Dividing it by 𝑦 (or 𝑥) and
defining 𝑧 ∶= 𝑥∕𝑦 (or 𝑧 ∶= 𝑦∕𝑥) yields the second-order polynomial equation
𝐴𝑧2 + 𝐵𝑧 + 𝐶 = 0 (or 𝐶𝑧2 + 𝐵𝑧 + 𝐴 = 0), whose discriminant is the expression
𝐵2 − 4𝐴𝐶.
10.2 Elliptic Equations 259

For the purpose of classifying second-order pdes, only the second-order terms
responsible for the expression 𝐵2 − 4𝐴𝐶 are important.
More generally, in higher dimensions, when there are 𝑑 independent vari-
ables 𝑥1 , … , 𝑥𝑑 and 𝐷 ⊂ ℝ𝑑 , the general second-order linear pde has the form

𝑑 ∑
∑ 𝑑
𝑎𝑖𝑗 𝑢𝑥𝑖 𝑥𝑗 + lower-order terms = 0. (10.1)
𝑖=1 𝑗=1

Elliptic, parabolic, and hyperbolic equations are then characterized by the prop-
erties of the matrix 𝐴 whose entries are the coefficients 𝑎𝑖𝑗 .
Various methods for solving pdes analytically have been developed. They in-
clude separation of variables, the method of characteristics, integral transforms,
change of variables, use of fundamental solutions, the superposition principle,
and Lie groups. As we are interested in computational methods in this book, we
discuss the three main methods for solving pdes numerically: finite differences,
finite volumes, and finite elements.
But before doing so, we take a closer look at elliptic equations in the next
section including their theory, before we briefly discuss parabolic and hyperbolic
equations in Sections 10.3 and 10.4. We focus on elliptic equations in this chapter,
since they are amenable to all three methods, finite differences, finite volumes,
and finite differences, which are discussed in the subsequent sections, but the
three methods are generally applicable to all kinds of pdes.

10.2 Elliptic Equations

In this section we take a closer look at elliptic equations. Three physical phenom-
ena, namely electrostatics, diffusion processes, and thermal conduction, and
how elliptic equations arise in these three applications are presented. The final
part of this section is more advanced and summarizes the theory of weak solu-
tions of elliptic equations.
We start with convenient notation first. When formulating pdes, the so-called
nabla operator
𝜕
⎛ 𝜕𝑥 ⎞
1
∇ ∶= ⎜ ⋮ ⎟
⎜ 𝜕 ⎟
⎝ 𝜕𝑥𝑑 ⎠
is commonly used. It is used to write the gradient of a scalar multivariate func-
tion 𝑓 ∶ ℝ𝑑 → ℝ as
𝜕𝑓
⎛ 𝜕𝑥 ⎞
1
∇𝑓 = ⎜ ⋮ ⎟ ,
⎜ 𝜕𝑓 ⎟
⎝ 𝜕𝑥𝑑 ⎠
260 10 Partial-Differential Equations

the divergence of a vector-valued multivariate function 𝐟 ∶ ℝ𝑑 → ℝ𝑑 as

∑𝑑
𝜕𝑓𝑖
∇⋅𝐟 = ,
𝑖=1
𝜕𝑥𝑖

and the rotation of a vector-valued multivariate function 𝐟 ∶ ℝ3 → ℝ3 as


𝜕𝑓 𝜕𝑓2
⎛ 3− ⎞
𝜕𝑥 𝜕𝑥3
⎜ 𝜕𝑓21 𝜕𝑓3 ⎟
∇×𝐟 =⎜ − .
𝜕𝑥3 𝜕𝑥1 ⎟
⎜ 2−
𝜕𝑓 𝜕𝑓1 ⎟
⎝ 𝜕𝑥1 𝜕𝑥2 ⎠

Using the gradient, the divergence, and a matrix-valued function 𝐴 ∶ ℝ𝑑 →


ℝ𝑑×𝑑 , we can write second- and first-order derivatives of a function 𝑢 ∶
ℝ𝑑 → ℝ
that appear in the general form (10.1) in the compact form

∑ 𝑑 (
𝑑 ∑ 𝜕𝑎𝑖𝑗 )
∇ ⋅ (𝐴(𝐱)∇𝑢(𝐱)) = 𝑎𝑖𝑗 𝑢𝑥𝑖 𝑥𝑗 + 𝑢𝑥𝑗 (10.2)
𝑖=1 𝑗=1
𝜕𝑥𝑖

if the matrix-valued function 𝐴 is smooth enough (see Problem 10.1). With these
preliminaries, we can write elliptic equations in convenient and compact forms.

10.2.1 Three Physical Phenomena

The first example of an elliptic equation is the derivation of the Poisson equa-
tion for electrostatic problems from the Maxwell equations, which are the fun-
damental equations for electromagnetism. An alternative, but more limited re-
lationship between Coulomb’s law and elliptic equations is also discussed.
The second and third example are diffusion and thermal conduction. Al-
though these are also elliptic model equations, additional modeling assumptions
are necessary, and hence these equations are not as fundamental as the first ex-
ample and variants are possible.

10.2.1.1 Electrostatics and Derivation from the Maxwell Equations

We derive the Poisson equation from the Maxwell equations. The Poisson equa-
tion is contained in the Maxwell equations and it is retrieved by considering the
electrostatic case. An electrostatic system is a system whose magnetic field does
not vary with time.
The Maxwell equations are the four pdes
10.2 Elliptic Equations 261

∇⋅𝐃=𝜌 (Gauss’s law),


∇⋅𝐁=0 (Gauss’s law for magnetism),
𝜕
∇×𝐄=− 𝐁 (Faraday’s law of induction),
𝜕𝑡
𝜕
∇×𝐇=𝐉+ 𝐃 (Ampère’s circuital law)
𝜕𝑡
in three spatial dimensions and time, where 𝐄(𝑡, 𝐱) is the electric field, 𝐁(𝑡, 𝐱) is
the magnetic field, 𝐃(𝑡, 𝐱) is the electric displacement field, 𝐇(𝑡, 𝐱) is the magne-
tizing field, 𝜌(𝐱) the charge density, and 𝐉(𝑡, 𝐱) is the (applied) current density.
Here bold letters (irrespective of being upper- or lowercase letters) denote vector
valued functions. The electric displacement field 𝐃 and the magnetizing field 𝐇
satisfy the constitutive relations

𝐃(𝑡, 𝐱) = 𝜖(𝐱)𝐄(𝑡, 𝐱),


𝐇(𝑡, 𝐱) = 𝜇(𝐱)−1 𝐁(𝑡, 𝐱),

where 𝜖(𝐱) is the permittivity and 𝜇(𝐱) is the permeability, which are both matrix-
valued functions from ℝ3 to ℝ3×3 .
The fields 𝐄 and 𝐇 satisfy the physical interface or jump conditions

[𝐄 × 𝐧] = 𝟎, [𝜖𝐄 ⋅ 𝐧] = 𝜌|Γ ,
[𝐇 × 𝐧] = 𝟎, [𝜇𝐇 ⋅ 𝐧] = 0

across an interface Γ. The interface Γ is the interface between two domains Ω1


and Ω2 consisting of different materials, and the vector 𝐧 is the outward unit
normal vector of 𝜕Ω1 . The jump [𝑓] of any function 𝑓 across the interface Γ is
defined as
[𝑓] ∶= 𝑓|Γ∩Ω̄ 2 − 𝑓|Γ∩Ω̄ 1 ,
i.e., it is the difference between two one-sided limits of 𝑓, the first within Ω2 and
the second within Ω1 . Furthermore, 𝜌|Γ ∶ Γ → ℝ is the restriction of the charge
density 𝜌 to the interface Γ.
If the magnetic field 𝐁 is constant with respect to time, the rotation ∇ × 𝐄 van-
ishes by Faraday’s law of induction and therefore the electric field 𝐄 is conserva-
tive and can be expressed as the gradient of minus a scalar potential 𝜙 ∶ ℝ → ℝ,
i.e.,

𝐄 = −∇𝜙,
𝐃 = −𝜖∇𝜙.

The minus sign only serves cosmetic purposes here. Substitution of the last equa-
tion into the first the Maxwell equations, namely Gauss’s law, now yields the
Poisson equation
−∇ ⋅ (𝜖∇𝜙) = 𝜌. (10.3)
262 10 Partial-Differential Equations

Here 𝜖 is a matrix-valued function and 𝜌 is a scalar function, as already noted


above, and it becomes clear why the gradient and the divergence are so conve-
nient to express these equations.

10.2.1.2 Electrostatics and Derivation from Coulomb’s Law

We know that Coulomb’s law holds for electric fields in homogeneous materials
with no magnetic field present, and therefore the question naturally arises how
it relates to the Poisson equation under these two assumptions. It should be pos-
sible to arrive at the Poisson equation from Coulomb’s law, and we now discuss
how this is indeed possible. The derivation from Coulomb’s law governs only
the electrostatic case; no magnetic field ever enters the picture. A homogeneous
material means that the permittivity 𝜖0 ∈ ℝ is simply a real constant.
According to Coulomb’s law, the force 𝐅𝑖𝑗 ∈ ℝ3 that a particle at position 𝐫𝑗 ∈
ℝ with charge 𝑞𝑗 ∈ ℝ exerts on a particle at position 𝑟𝑖 with charge 𝑞𝑖 is given
by
𝑞𝑖 𝑞𝑗 𝐫𝑖 − 𝐫𝑗
𝐅𝑖𝑗 = .
4𝜋𝜖0 |𝐫𝑖 − 𝐫𝑗 |3

It implies that the force is proportional to 1∕|𝐫𝑖 − 𝐫𝑗 |2 , one over the distance
between the particles squared. Coulomb’s law is an atomistic model, since the
charges are point like.
In a continuum model, on the other hand, all the point charges 𝑞𝑗 give rise to
a charge density 𝜌 ∶ ℝ3 → ℝ. The force 𝐅 that acts on a particle with charge 𝑞
at position 𝐱 can be written as
𝐅 = 𝑞𝐄
using the electric field 𝐄, which is obtained from Coulomb’s law by integration
over all other charges 𝑞𝑗 at positions 𝐲 as

1 𝜌(𝐲)(𝐱 − 𝐲)
𝐄(𝐱) ∶= ∭ d𝐲.
4𝜋𝜖0 |𝐱 − 𝐲|3

Next we check by a simple calculation that the electric field is irrotational, i.e.,
that
∇×𝐄=𝟎 (10.4)
holds (see Problem 10.2). Hence the electric field 𝐄 is a gradient field again and
thus can be written as
𝐄 = −∇𝜙 (10.5)
as the gradient of minus a potential, where 𝜙 is called the electrostatic potential
and the minus sign serves a cosmetic purpose the next calculation reveals. It is
straightforward to check by differentiating that
10.2 Elliptic Equations 263

1 𝜌(𝐲)
𝜙(𝐱) ∶= ∭ d𝐲 (10.6)
4𝜋𝜖0 |𝐱 − 𝐲|

yields the field 𝐄 above (see Problem 10.3).


Since the function
1 1
𝐺(𝐱) ∶= − (10.7)
4𝜋 |𝐱|
is a fundamental solution or a Green function of the Laplace operator

Δ ∶= ∇ ⋅ ∇

on the whole space ℝ3 , i.e.,

Δ𝐺(𝐱 − 𝐲) = 𝛿(𝐱 − 𝐲) ∀𝐱 ∈ ℝ3 ∀𝐲 ∈ ℝ3 (10.8)

(see Problem 10.4), integration of the last equation against the charge density 𝜌
yields

∭ Δ𝐺(𝐱 − 𝐲)𝜌(𝐲)d𝐲 = ∭ 𝛿(𝐱 − 𝐲)𝜌(𝐲)d𝐲 = 𝜌(𝐱) ∀𝐱 ∈ ℝ3


ℝ3 ℝ3

and further
( )
Δ ∭ 𝐺(𝐱 − 𝐲)𝜌(𝐲)d𝐲 = Δ(−𝜖0 𝜙(𝐱)) = 𝜌(𝐱) ∀𝐱 ∈ ℝ3 .
ℝ3

In other words, the potential 𝜙 given by (10.6) solves the Poisson equation

−𝜖0 Δ𝜙 = 𝜌.

This equation is a special case of (10.3), as here the permittivity is a real con-
stant 𝜖0 and could be pulled out of the divergence.

10.2.1.3 Diffusion

We can describe any stationary diffusion process by a pde using two considera-
tions. The first is fundamental, while the second one requires a physical model.
Transient diffusion processes lead to parabolic equations (see Sect. 10.3).
The first step is to note that the total flux of particles out of any subdomain
Ω ⊂ 𝐷 of the domain 𝐷 equals the amount of particles produced by all sources
in Ω due to mass conservation. This yields

∯ 𝐧 ⋅ 𝐉d𝑆 = ∭ 𝑓d𝑉.
𝜕Ω Ω
264 10 Partial-Differential Equations

In the surface integral on the left-hand side, the flux density of the particles is
denoted by 𝐉 and the 𝐧 are outward unit normal vectors. The volume integral on
the right-hand side is over the function 𝑓 that describes the sources.
Using the divergence theorem, the boundary integral on the left-hand side
becomes a volume integral, and hence the equation becomes

∭ ∇ ⋅ 𝐉d𝑉 = ∭ 𝑓d𝑉.
Ω Ω

It holds true for all subdomains Ω. After assuming that the integrand is com-
pactly supported and smooth, we can therefore apply the fundamental lemma
of variational calculus to the equation

∭ (∇ ⋅ 𝐉 − 𝑓)d𝑉 = 0 ∀Ω ⊂ 𝐷
Ω

to find
∇ ⋅ 𝐉(𝐱) = 𝑓(𝐱) ∀𝐱 ∈ 𝐷.
There are various versions of the fundamental lemma of variational calculus;
two are recorded in the following.
Theorem 10.1 (fundamental lemma of variational calculus) 1. Version for
continuous functions: Suppose that Ω ⊂ ℝ𝑑 is an open set and that a continuous
multivariate function 𝑓 ∶ Ω → ℝ satisfies the equation

∭ 𝑓(𝐱)ℎ(𝐱)d𝐱 = 0
Ω

for all compactly supported smooth functions ℎ on Ω; then 𝑓 is identically equal to


zero.
2. Version for discontinuous functions: Suppose that Ω ⊂ ℝ𝑑 is an open set and
that a multivariate function 𝑓 ∈ 𝐿2 (Ω) satisfies the equation

∭ 𝑓(𝐱)ℎ(𝐱)d𝐱 = 0
Ω

for all compactly supported smooth functions ℎ on Ω; then 𝑓 = 0 in 𝐿2 , i.e., 𝑓 is


equal to zero almost everywhere.
In the second step, we must specify how the flux density 𝐉 relates to the un-
known concentration 𝑢. The most common and basic physical model is Fick’s
first law
𝐉 ∶= −𝐷∇𝑢,
where the diffusion coefficient 𝐷 ∶ ℝ𝑑 → ℝ𝑑×𝑑 is generally a matrix-valued func-
tion. This physical model results in the linear elliptic equation

−∇ ⋅ (𝐷∇𝑢) = 𝑓.
10.2 Elliptic Equations 265

Many other physical models for the relationship between the flux density 𝐉 and
the unknown 𝑢 are known to be useful. For example, diffusion in porous media
is governed by
𝐉 ∶= −𝐷∇𝑢𝑚 ,
where 𝑚 ∈ ℝ is a constant 𝑚 > 0 and usually 𝑚 > 1.

10.2.1.4 Thermal Conduction

The physical model for thermal conduction is the law of heat conduction or
Fourier’s law, which states that the rate of heat transfer through a material is
proportional to the negative gradient of the temperature and to the area orthog-
onal to the gradient through which the heat flows.
The derivation proceeds analogously to the modeling of diffusion processes
in Sect. 10.2.1.3. Now the vector 𝐉 is the heat flux density and it is, by the law of
heat conduction, equal to
𝐉 ∶= −𝑘∇𝑢,
where the thermal conductivity 𝑘 ∶ ℝ𝑑 → ℝ𝑑×𝑑 is generally a matrix-valued
function and 𝑢 denotes the unknown temperature. This results in the heat equa-
tion
−∇ ⋅ (𝑘∇𝑢) = 𝑓.
However, the thermal conductivity 𝑘 of a material generally varies with temper-
ature, which gives rise to nonlinear equations in which 𝑘 is a function of the
unknown temperature.
It is often instructive to check that the physical units of the variables and con-
stants in an equation and its derivation are consistent (see Problem 10.5). To
check the consistency of the units in the heat equation, we note that the un-
known temperature 𝑢 has unit [𝑢] = K, the thermal conductivity 𝑘 has unit
[𝑘] = W ⋅ m−1 ⋅ K−1 , and the source term 𝑓 has unit [𝑓] = W ⋅ m−3 . In the equa-
tion, we thus have [𝐉] = [𝑘∇𝑢] = [𝑘][∇𝑢] = W ⋅ m−2 and [∇ ⋅ (𝑘∇𝑢)] = W ⋅ m−3 .
The unit of [∇ ⋅ (𝑘∇𝑢)] on the left-hand side of the equation is consistent with
the unit of the source term 𝑓 on the right-hand side.

10.2.2 Boundary Conditions

In general, equations may have any number of solutions, ranging from no solu-
tion at all, a unique solution, and a finite number of solutions to infinitely many
solutions. This is true, e.g., for linear systems of equations, for polynomial equa-
tions, and for Diophantine equations, and it is also true for differential equations.
It is usually desirable that a pde has a unique solution, because we expect
the equation to be a full and unique description of the problem or system under
consideration. Therefore the question naturally arises whether a given pde has
266 10 Partial-Differential Equations

a unique solution. If we can answer this question positively, our confidence that
the pde is a useful model is much increased. This knowledge is also useful when
we aim to calculate a numerical approximation of a solution; it stands to reason
that the existence of a unique solution is beneficial for any numerical algorithm.
The existence and possibly uniqueness of a solution is not a property of just
the equation that holds for all points in the interior of the domain, but a full
problem description must be supplemented with initial and/or boundary con-
ditions and a specification of the set of functions from which the solution is
sought. Again, this should not come as a surprise; we know from algebra that,
e.g., the polynomial equation 𝑥 2 = 2 has no solution in the rational numbers ℚ
but a unique solution in the real numbers ℝ, and that the polynomial equation
𝑥2 = −1 has no solution in the real numbers ℝ but a unique solution in the
complex numbers ℂ.
Analogously, there are different types of solutions of pdes. A classical solution
is a solution that can be substituted into the equation and that then satisfies the
equation pointwise. But there are other, weaker, types of function like objects
that can be interpreted as solutions of differential equations. A glimpse of the
theory of elliptic equations is given in the next section, Sect. 10.2.3, and many
textbooks have been written on these questions [1].
The two major types of conditions are initial conditions and boundary condi-
tions. In transient problems, initial conditions are usually specified, and bound-
ary conditions are usually specified in both stationary and transient problems.
Initial conditions give the solution at the beginning of the time interval, and
boundary conditions hold on the boundary of the spatial domain.
For elliptic equations, there are four major types of boundary conditions:
• Dirichlet boundary conditions specify the unknown 𝑢 on all points of the
boundary 𝜕𝑈 of a domain 𝑈 or only on the Dirichlet part 𝜕𝑈𝐷 ⊂ 𝜕𝑈 of the
boundary 𝜕𝑈 as in the example

𝑢(𝐱) = 𝑢𝐷 (𝐱) ∀𝐱 ∈ 𝜕𝑈𝐷 ,

where 𝑢𝐷 is a given function.


For example, in electrostatics, this means that a contact is held at a fixed
voltage; in mechanics, this means that a beam is held at a fixed position;
in thermodynamics, this means that the surface of a body is held at a fixed
temperature; and in fluid dynamics, such a no-slip condition means that a
viscous fluid has zero velocity relative to a solid boundary.
• Neumann boundary conditions specify the directional derivative of the un-
known 𝑢 with respect to the outward unit normal vectors 𝐧 of the boundary
𝜕𝑈 on all points of the boundary 𝜕𝑈 of a domain 𝑈 or only on the Neumann
part 𝜕𝑈𝑁 ⊂ 𝜕𝑈 of the boundary 𝜕𝑈 as in the example

𝜕𝑢
= 𝐧 ⋅ ∇𝑢(𝐱) = 𝑢𝑁 (𝐱) ∀𝐱 ∈ 𝜕𝑈𝑁 ,
𝜕𝐧
where 𝑢𝑁 is a given function.
10.2 Elliptic Equations 267

In electrostatics, this means that a constant (often vanishing) field is ap-


plied; and in thermodynamics, this means that the heat flux from a surface
is known.
• Mixed boundary conditions are different boundary conditions prescribed
on disjoint parts of the boundary. The most common example is the pre-
scription of Dirichlet boundary conditions on the Dirichlet part 𝜕𝑈𝐷 and of
Neumann boundary conditions on the Neumann part 𝜕𝑈𝑁 of the boundary
𝜕𝑈 = 𝜕𝑈𝐷 ∪ 𝜕𝑈𝑁 .
• Robin boundary conditions

𝜕𝑢
𝑎𝑢 + 𝑏 = 𝑢𝑅 (𝐱) ∀𝐱 ∈ 𝜕𝑈,
𝜕𝐧
where 𝑢𝑅 is a given function, are linear combinations of Dirichlet and Neu-
mann boundary conditions.
They often appear in Sturm–Liouville problems. In convection-diffusion
equations, they can act as insulting boundary conditions that ensure that
the sum of the convective and diffusive fluxes vanishes. In electromagnetism,
they are called impedance boundary conditions.
Boundary conditions whose given function on the right-hand side vanishes are
called homogeneous boundary conditions; otherwise they are called inhomoge-
neous.
Prescribing Dirichlet boundary conditions or mixed Dirichlet/Neumann
boundary conditions to an elliptic pde yields a unique solution.
However, prescribing Neumann boundary conditions to an elliptic pde re-
sults in no solutions or infinitely many solutions. This can be discussed vividly
using the electrostatic problem described by the elliptic boundary-value problem

−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈, (10.9a)
𝐧 ⋅ (𝐴∇𝑢) = 𝑢𝑁 on 𝜕𝑈. (10.9b)

Here the right-hand side 𝑓 are the charges and the Neumann boundary condi-
tion 𝑢𝑁 corresponds to a known electric field on the whole boundary 𝜕𝑈. In-
tegrating the equation, using the divergence theorem, and using the Neumann
boundary condition, we find

− ∯ 𝑢𝑁 d𝑆 = − ∯ 𝐧 ⋅ (𝐴∇𝑢)d𝑆 = − ∭ ∇ ⋅ (𝐴∇𝑢)d𝑆 = ∭ 𝑓d𝑉.


𝜕𝑈 𝜕𝑈 𝑈 𝑈

In other words, the Neumann boundary conditions must match the right-hand
side via the equation
− ∯ 𝑢𝑁 d𝑆 = ∭ 𝑓d𝑉,
𝜕𝑈 𝑈
which hence is a necessary condition for the existence of a solution.
268 10 Partial-Differential Equations

Furthermore, it is obvious that if 𝑢 is a solution, then 𝐱 ↦ 𝑢(𝐱) + 𝐶 is also


a solution for all 𝐶 ∈ ℝ, as only the gradient ∇𝑢 of the solution 𝑢 appears in
(10.9).
Therefore this physical examples illustrates the fact how different bound-
ary conditions render three cases possible, namely no solution (no matching
Neumann boundary conditions), a unique solution (Dirichlet or mixed Dirich-
let/Neumann boundary conditions), and infinitely many solutions (matching
Neumann boundary conditions).
In summary, the existence and uniqueness of solutions of pdes deserve inves-
tigations for all combinations of equations, boundary conditions, and function
spaces in which the solutions are sought. In the following, a short summary of
the modern theory of elliptic pdes is given.

10.2.3 Existence, Uniqueness, and a Pointwise Estimate *

We consider the elliptic boundary-value problem

−∇ ⋅ (𝐴∇𝑢) + 𝐛 ⋅ ∇𝑢 + 𝑐𝑢 = 𝑓 in 𝑈, (10.10a)
𝑢 = 𝑢𝐷 on 𝜕𝑈, (10.10b)
𝐧 ⋅ (𝐴∇𝑢) = 0 on 𝜕𝑈𝑁 , (10.10c)

where 𝑈 ⊂ ℝ𝑑 is a domain and 𝑑 ∈ ℕ. The boundary 𝜕𝑈 is the complementary


union of its Dirichlet part 𝜕𝑈𝐷 and its Neumann part 𝜕𝑈𝑁 , and the 𝐧 in the
Neumann boundary condition are outward pointing unit normal vector on 𝜕𝑈𝑁 .
The given functions are 𝐴 ∶ 𝑈 → ℝ𝑑×𝑑 , 𝐛 ∶ 𝑈 → ℝ𝑑 , 𝑐 ∶ 𝑈 → ℝ, 𝑓 ∶ 𝑈 → ℝ,
and 𝑢𝐷 ∶ 𝑈 → ℝ.
The so-called weak formulation is generally obtained by multiplying the
strong, pointwise, or classical formulation by so-called test functions, integrat-
ing over the whole domain, using partial integration to distribute the derivatives,
and requiring that the resulting equation, i.e., the weak formulation, is satisfied
for all test functions. Then the fundamental lemma of variational calculus, The-
orem 10.1, ensures that the integrands in the weak formulation are equal, al-
though it may not be possible to substitute the solution of the weak formulation
into the original formulation because it may not be smooth enough. Hence the
names weak solution and weak formulation.
Here, the appropriate function space for the elliptic problem (10.10) is the
Hilbert space 𝐻 1 (𝑈), as the calculations below will show. In order to take care
of the inhomogeneous Dirichlet boundary conditions, we extend 𝑢𝐷 from the
boundary 𝜕𝑈𝐷 to the whole domain 𝑈 by using the following theorem with
𝑠 = 1.
10.2 Elliptic Equations 269

Theorem 10.2 (inverse trace theorem) Suppose that 𝑠 is a positive integer, that
the domain 𝑈 is of class 𝐶 𝑠 , and that 𝜕𝑈 is bounded. Then there is a bounded
trace operator 𝑇 ∶ 𝐻 𝑠 (𝑈) → 𝐻 𝑠−1∕2 (𝜕𝑈). Moreover, 𝑇 has a bounded right inverse
𝐸 ∶ 𝐻 𝑠−1∕2 (𝜕𝑈) → 𝐻 𝑠 (𝑈), i.e., there exists a positive constant 𝐶 such that

‖𝐸𝑢‖𝐻 𝑠 (𝑈) ≤ 𝐶‖𝑢‖𝐻 𝑠−1∕2 (𝜕𝑈) ∀𝑢 ∈ 𝐻 𝑠−1∕2 (𝜕𝑈).

For a proof, see [7, Section 7.2.5].


We denote the extension of 𝑢𝐷 from the boundary 𝜕𝑈𝐷 to 𝑈 by 𝑢̄ 𝐷 and define

𝑤 ∶= 𝑢 − 𝑢̄ 𝐷 .

Then the problem (10.10) has homogeneous Dirichlet boundary conditions 𝑤 =


0 on 𝜕𝑈𝐷 , and solutions 𝑤 are sought in the function space 𝐻01 (𝑈).
To find the weak formulation, we first multiply (10.10a) by test functions
𝑣 ∈ 𝐻01 (𝑈). The test functions are chosen from the same function space as the
solution. Then partial integration of the first term yields

− ∭ ∇ ⋅ (𝐴∇𝑢)𝑣d𝑉 = − ∯ (𝐴∇𝑢) ⋅ (𝑣𝐧)d𝑆 + ∭ (𝐴∇𝑢) ⋅ (∇𝑣)d𝑉.


𝑈 𝜕𝑈 𝑈

The integral ∮𝜕𝑈 (𝐴∇𝑢) ⋅ (𝑣𝐧)d𝑉 vanishes, because the test function 𝑣 ∈ 𝐻01 (𝑈)
vanishes on the boundary 𝜕𝑈𝐷 and because 𝐧 ⋅ (𝐴∇𝑢) = 0 holds on 𝜕𝑈𝑁 . There-
fore the weak formulation is to find a function 𝑢 with 𝑢 − 𝑢̄ 𝐷 ∈ 𝐻01 (𝑈) such
that

∭ (𝐴∇𝑢) ⋅ ∇𝑣 + (𝑣𝐛) ⋅ ∇𝑢 + 𝑐𝑢𝑣d𝑉 = ∭ 𝑓𝑣d𝑉 ∀𝑣 ∈ 𝐻01 (𝑈)


𝑈 𝑈

holds, or equivalently to find a function 𝑤 ∈ 𝐻01 (𝑈) such that

∭ (𝐴∇𝑤) ⋅ ∇𝑣 + (𝑣𝐛) ⋅ ∇𝑤 + 𝑐𝑤𝑣d𝑉


𝑈

= ∭ 𝑓𝑣 − 𝐴∇𝑢̄ 𝐷 ⋅ ∇𝑣 − (𝑣𝐛) ⋅ ∇𝑢̄ 𝐷 − 𝑐𝑢̄ 𝐷 𝑣d𝑉 ∀𝑣 ∈ 𝐻01 (𝑈)


𝑈

holds. After defining the bilinear form

𝑎(𝑢, 𝑣) ∶= ∭ (𝐴∇𝑢) ⋅ ∇𝑣 + (𝑣𝐛) ⋅ ∇𝑢 + 𝑐𝑢𝑣d𝑉 (10.11)


𝑈

(the left-hand side of the last equation) and the functional

𝐹(𝑣) ∶= ∭ 𝑓𝑣 − 𝐴∇𝑢̄ 𝐷 ⋅ ∇𝑣 − (𝑣𝐛) ⋅ ∇𝑢̄ 𝐷 − 𝑐𝑢̄ 𝐷 𝑣d𝑉, (10.12)


𝑈
270 10 Partial-Differential Equations

(the right-hand side), the weak formulation of (10.10) is to find a function 𝑢 ∈


𝐻01 (𝑈) such that
𝑎(𝑢, 𝑣) = 𝐹(𝑣) ∀𝑣 ∈ 𝐻01 (𝑈)
holds.
Does such a weak solution 𝑢 exist and is it unique? The existence and unique-
ness of the weak solution 𝑢 can be shown using the Lax–Milgram theorem. Be-
fore we can state it, we need a few definitions.
Definition 10.3 (bounded) Suppose 𝐻 is a Hilbert space. A bilinear form
𝑎 ∶ 𝐻 × 𝐻 → ℝ is called bounded, if there exists a positive constant 𝛼 such
that
|𝑎(𝑢, 𝑣)| ≤ 𝛼‖𝑢‖𝐻 ‖𝑣‖𝐻 ∀∀𝑢, 𝑣 ∈ 𝐻
holds.
Definition 10.4 (coercive) Suppose 𝐻 is a Hilbert space. A bilinear form
𝑎 ∶ 𝐻 × 𝐻 → ℝ is called coercive, if there exists a positive constant 𝛽 such that

𝛽‖𝑢‖2𝐻 ≤ 𝑎(𝑢, 𝑢) ∀𝑢 ∈ 𝐻

holds.
The main assumptions in the theorem are that the bilinear form 𝑎 is coercive
and bounded.
Theorem 10.5 (Lax–Milgram theorem [3]) Suppose that 𝐻 is a Hilbert space,
that 𝐹 ∈ 𝐻 ′ , and that 𝑎 is a bilinear form on 𝐻 that is bounded (with constant 𝛼)
and coercive (with constant 𝛽). Then there exists a unique solution 𝑢 ∈ 𝐻 of the
equation
𝑎(𝑢, 𝑣) = 𝐹(𝑣) ∀𝑣 ∈ 𝐻.
Furthermore, the estimate
1
‖𝑢‖𝐻 ≤ ‖𝐹‖𝐻 ′ (10.13)
𝛽
holds.
The proof, which mostly follows [1, Section 6.2.1], uses basic concepts such as
the Riesz representation theorem from functional analysis, whose full explana-
tion can be found in any text book on functional analysis. However, apart from
an introduction to some basic concepts from functional analysis, the proof is
complete.
Theorem 10.6 (Riesz representation theorem) Suppose 𝐻 is a Hilbert space.
Then the dual space 𝐻 ′ of 𝐻 can be canonically identified with 𝐻. More precisely,
for each 𝑢′ ∈ 𝐻 ′ there exists a unique element 𝑢 ∈ 𝐻 such that

𝑢′ (𝑣) = ⟨𝑢, 𝑣⟩ ∀𝑣 ∈ 𝐻

and the mapping 𝑢′ ↦ 𝑢 is a linear isomorphism of 𝐻 ′ into 𝐻.


10.2 Elliptic Equations 271

Before giving the proof, we note that the special case of a symmetric bilinear
form 𝑎 leads the way to the general case. In the symmetric case, ⟨𝑢, 𝑣⟩ ∶= 𝑎(𝑢, 𝑣)
is an inner product on the Hilbert space 𝐻, and the Riesz representation theorem,
Theorem 10.6, can be directly applied, showing the existence of a unique solution
𝑢 ∈ 𝐻. A proof for the general case is the following.
Proof First, we consider any 𝑢 ∈ 𝐻 and note that 𝑣 ↦ 𝑎(𝑢, 𝑣) is a bounded
linear functional on 𝐻 by assumption. Then the Riesz representation theorem,
Theorem 10.6, implies that a unique element 𝑤 ∈ 𝐻 that satisfies

𝑎(𝑢, 𝑣) = ⟨𝑤, 𝑣⟩ ∀𝑣 ∈ 𝐻

exists. Since 𝑢 ∈ 𝐻 was arbitrary, we can thus define an operator 𝐴 ∶ 𝐻 → 𝐻


satisfying
𝑎(𝑢, 𝑣) = ⟨𝐴𝑢, 𝑣⟩ ∀∀𝑢, 𝑣 ∈ 𝐻.
We claim that the operator 𝐴 is linear and bounded. To show that it is linear,
the bilinearity of 𝑎 and the linearity of the inner product yield

⟨𝐴(𝜆1 𝑢1 + 𝜆2 𝑢2 ), 𝑣⟩ = 𝑎(𝜆1 𝑢1 + 𝜆2 𝑢2 , 𝑣)
= 𝜆1 𝑎(𝑢1 , 𝑣) + 𝜆2 𝑎(𝑢2 , 𝑣)
= 𝜆1 ⟨𝐴𝑢1 , 𝑣⟩ + 𝜆2 ⟨𝐴𝑢2 , 𝑣⟩
= ⟨𝜆1 𝐴𝑢1 + 𝜆2 𝐴𝑢2 , 𝑣⟩

for all 𝜆1 and 𝜆2 ∈ ℝ, for all 𝑢1 and 𝑢2 ∈ ℝ, and for all 𝑣 ∈ 𝐻. Since the equality
holds for all 𝑣 ∈ 𝐻, the operator 𝐴 is linear. To show that 𝐴 is bounded, we use
the assumption that the bilinear form 𝑎 is bounded to calculate

‖𝐴𝑢‖2𝐻 = ⟨𝐴𝑢, 𝐴𝑢⟩ = 𝑎(𝑢, 𝐴𝑢) ≤ 𝛼‖𝑢‖𝐻 ‖𝐴𝑢‖𝐻 ∀𝑢 ∈ 𝐻,

which yields ‖𝐴𝑢‖ ≤ 𝛼‖𝑢‖ for all 𝑢 ∈ 𝐻, i.e., 𝐴 is bounded.


Having shown that the linear operator 𝐴 is linear and bounded, we next show
that it is injective and that it has closed range. Since the bilinear form 𝑎 is coer-
cive, we find

𝛽‖𝑢‖2𝐻 ≤ 𝑎(𝑢, 𝑢) = ⟨𝐴𝑢, 𝑢⟩ ≤ ‖𝐴𝑢‖𝐻 ‖𝑢‖𝐻 ∀𝑢 ∈ 𝐻,

which implies
𝛽‖𝑢‖𝐻 ≤ ‖𝐴𝑢‖𝐻 ∀𝑢 ∈ 𝐻, (10.14)
i.e., the operator 𝐴 is bounded below. It is straightforward to see that it is there-
fore injective. Furthermore, its range is closed: if 𝐴𝑢𝑛 → 𝑦, then the inequality

𝛽‖𝑢𝑛 − 𝑢𝑚 ‖𝐻 ≤ ‖𝐴(𝑢𝑛 − 𝑢𝑚 )‖𝐻 = ‖𝐴𝑢𝑛 − 𝐴𝑢𝑚 ‖𝐻

shows that {𝑢𝑛 } is a Cauchy sequence; if 𝑢𝑛 → 𝑢, then 𝐴𝑢𝑛 → 𝐴𝑢 by the conti-


nuity of 𝐴 and hence 𝐴𝑢 = 𝑦, which means that 𝐴 has closed range.
272 10 Partial-Differential Equations

Next, we show that the linear operator 𝐴 is surjective. Suppose it is not. Since
its range 𝑅(𝐴) is closed, there would exist a nonzero element 𝑦 ∈ 𝐻 with 0 ≠
𝑦 ∈ 𝑅(𝐴)⊥ . This would imply 𝛽‖𝑦‖2𝐻 ≤ 𝑎(𝑦, 𝑦) = ⟨𝐴𝑦, 𝑦⟩ = 0 and therefore
𝑦 = 0, a contradiction. Therefore the operator 𝐴 is surjective.
Since the operator 𝐴 ∶ 𝐻 → 𝐻 is both injective and surjective, it is bijective.
Next, the Riesz representation theorem, Theorem 10.6, implies that a unique
𝑧 ∈ 𝐻 exists such that
𝐹(𝑣) = ⟨𝑧, 𝑣⟩ ∀𝑣 ∈ 𝐻
and that the norms ‖𝐹‖𝐻 ′ and ‖𝑧‖𝐻 agree. Since 𝐴 is a bijection, there exists a
unique element 𝑢 ∈ 𝐻 such that 𝐴𝑢 = 𝑧.
In summary, we have shown that there exists a unique element 𝑢 ∈ 𝐻 such
that
𝑎(𝑢, 𝑣) = ⟨𝐴𝑢, 𝑣⟩ = ⟨𝑧, 𝑣⟩ = 𝐹(𝑣) ∀𝑣 ∈ 𝐻.
Inequality (10.13) follows from inequality (10.14) and recalling that 𝐴𝑢 = 𝑧
and ‖𝐹‖𝐻 ′ = ‖𝑧‖𝐻 . □

We can now state and prove the existence and uniqueness of the solution
of the elliptic boundary-value problem by applying the Lax–Milgram theorem.
To do so, it is customary to call coefficient matrices 𝐴 that give rise to coercive
bilinear forms uniformly elliptic.

Definition 10.7 (uniformly elliptic) A matrix-valued function 𝐴 ∶ 𝑈 → ℝ𝑑×𝑑


is called uniformly elliptic, if there exists a positive constant 𝛽 such that

𝛽|𝐳|2 ≤ 𝐳⊤ 𝐴(𝐱)𝐳 ∀𝐳 ∈ ℝ𝑑 ∖{0} ∀𝐱 ∈ 𝑈

holds.

Theorem 10.8 (existence and uniqueness of weak solutions of elliptic


equations) Suppose that the domain 𝑈 ⊂ ℝ𝑑 , 𝑑 ∈ ℕ, is of class 𝐶 1 and
that its boundary 𝜕𝑈 is bounded. Suppose that the matrix-valued function 𝐴 ∈
𝐿∞ (𝑈, ℝ𝑑×𝑑 ) is uniformly elliptic with constant 𝐾, that 𝐛 ∈ 𝐿∞ (𝑈, ℝ𝑑 ), 𝑐 ∈
𝐿∞ (𝑈, ℝ), 𝑓 ∈ 𝐻 −1 (𝑈), and 𝑢𝐷 ∈ 𝐻 1∕2 (𝜕𝑈). Suppose further that 𝐛 = 0 and
𝑐 ≥ 0 for all 𝑥 ∈ 𝑈 or that ‖𝐛‖2𝐿∞ < 4𝐾 inf 𝑈 𝑐 holds.
Then the boundary-value problem (10.10) has a unique solution 𝑢 ∈ 𝐻 1 (𝑈).
Furthermore, there is a positive constant 𝐶 such that the estimate

‖𝑢‖𝐻 1 (𝑈) ≤ 𝐶‖𝐹‖𝐻 −1 (𝑈)

holds.

Proof We will show that the bilinear form 𝑎 in (10.11) is coercive and bounded
and that the functional 𝐹 in (10.12) is in 𝐻 −1 (𝑈) so that Theorem 10.5 can be
applied.
Using the Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1, it is strai-
ghtforward to see that the bilinear form 𝑎 is bounded.
10.2 Elliptic Equations 273

To see that the bilinear form 𝑎 is coercive, we first find a bound from above for
the second term ∭𝑈 (𝑢𝐛)⋅∇𝑢d𝑉 in 𝑎(𝑢, 𝑢). The Cauchy–Bunyakovsky–Schwarz
inequality yields

∭ (𝑣𝐛) ⋅ ∇𝑢d𝑉 ≤ ‖𝐛‖𝐿∞ (𝑈) ‖𝑢‖𝐿2 (𝑈) ‖∇𝑢‖𝐿2 (𝑈) .


𝑈

For the factor ‖𝑢‖𝐿2 (𝑈) ‖∇𝑢‖𝐿2 (𝑈) , we use the inequality

𝑥𝑦 ≤ 𝛼𝑥 2 + 𝛽𝑦 2 ∀∀𝑥, 𝑦 ∈ ℝ,

where 𝛼 and 𝛽 ∈ ℝ+ with 𝛼𝛽 = 1∕4 (see Problem 10.8). These considerations


yield the bound from below
( )
𝑎(𝑢, 𝑢) ≥ 𝐾‖∇𝑢‖2𝐿2 (𝑈) − ‖𝐛‖𝐿∞ (𝑈) 𝛼‖𝑢‖2𝐿2 (𝑈) + 𝛽‖∇𝑢‖2𝐿2 (𝑈) + (inf 𝑐)‖𝑢‖2𝐿2 (𝑈)
𝑈
= (𝐾 − 𝛽‖𝐛‖𝐿∞ (𝑈) )‖∇𝑢‖2𝐿2 (𝑈) + (inf 𝑐 − 𝛼‖𝐛‖𝐿∞ (𝑈) )‖𝑢‖2𝐿2 (𝑈)
𝑈

for the bilinear form 𝑎, where 𝛼𝛽 = 1∕4.


In the first case, i.e., if 𝐛 = 0 and 𝑐 ≥ 0 everywhere, the Poincaré inequality

∃𝐶 ∈ ℝ+ ∶ ∀𝑢 ∈ 𝐻01 (𝑈) ∶ ‖𝑢‖𝐿2 (𝑈) ≤ 𝐶‖∇𝑢‖𝐿2 (𝑈)

(see, e.g., [7, Theorem 7.32]) establishes that 𝑎 is coercive.


In the second case, i.e., if ‖𝐛‖2𝐿∞ < 4𝐾 inf 𝑈 𝑐, we set

inf 𝑈 𝑐
𝛼 ∶= >0
‖𝐛‖𝐿∞ (𝑈)

so that the coefficient of ‖𝑢‖2𝐿2 (𝑈) vanishes. Then 𝛽 = ‖𝐛‖𝐿∞ (𝑈) ∕(4 inf 𝑈 𝑐) and
the coefficient of ‖∇𝑢‖2𝐿2 (𝑈) is positive due to the assumption. Therefore the bi-
linear form 𝑎 is again coercive.
Finally, 𝐹 belongs to 𝐻 −1 (𝑈) = 𝐻01 (𝑈)′ provided it is a bounded linear func-
tional on 𝐻01 (𝑈). 𝐹 is clearly linear and it is bounded due to the assumptions on
the data. □

The following theorem is a pointwise estimate for solutions of the linear Pois-
son equation. It is quite well-known and can be found, e.g., as [2, Theorem 3.7].
Many regularity results for elliptic equations are available to ensure the smooth-
ness of the solution based on assumptions on the coefficients and the domain. If
sufficient smoothness is assumed, the operator ∇ ⋅ (𝐴∇) in divergence form can
always be rewritten in the non-divergence form that occurs in the theorem.
274 10 Partial-Differential Equations

Theorem 10.9 (pointwise estimate for elliptic equations) Suppose 𝑈 ⊂ ℝ𝑑


is a bounded domain. Define the linear operator
𝑑

𝐿𝑢 ∶= 𝑎𝑖𝑗 (𝑥)𝜕𝑥𝑖 𝑥𝑗 𝑢(𝑥) + 𝑏(𝑥) ⋅ ∇𝑢(𝑥) + 𝑐(𝑥)𝑢(𝑥) = 𝑓(𝑥),
𝑖,𝑗=1

where 𝑢 ∈ 𝐶 0 (𝑈)∩𝐶 2 (𝑈), the coefficient matrix 𝐴 is symmetric, and the inequality
𝑐(𝑥) ≤ 0 holds for all 𝑥 ∈ 𝑈. Furthermore, suppose that the operator 𝐿 is elliptic,
i.e., 0 < 𝜆(𝑥)|𝜉|2 ≤ 𝜉 ⊤ 𝐴(𝑥)𝜉 ≤ Λ(𝑥)|𝜉|2 holds for all 𝜉 ∈ ℝ𝑑 ∖{0} and for all 𝑥 ∈
𝑈, where 𝜆(𝑥) and Λ(𝑥) are the minimum and maximum eigenvalues, respectively.
Then the estimate
|𝑓|
sup |𝑢| ≤ sup |𝑢| + 𝐶 sup
𝑈 𝜕𝑈 𝑈 𝜆
holds, where 𝐶 is a constant depending only on diam(𝑈) and 𝛽 ∶= sup |𝑏|∕𝜆 < ∞.
In particular, if 𝑈 lies between two parallel planes a distance 𝑑 apart, then the
estimate holds with 𝐶 = e(𝛽+1)𝑑 − 1.

10.3 Parabolic Equations

Parabolic equations are frequently the transient versions of stationary, elliptic


equations. In the case of two independent variables, they are usually called 𝑡
and 𝑥 as in the example of the transient one-dimensional heat equation

𝑢𝑡 = 𝐷𝑢𝑥𝑥 + 𝑞,

where the positive coefficient function 𝐷 is the thermal conductivity. The same
equation can be interpreted as a transient one-dimensional diffusion equation;
then the positive coefficient 𝐷 is the diffusion constant. The function 𝑞 on the
right-hand side is a source of heat (in the case of the heat equation) or mass (in
the case of the diffusion equation).
In higher dimensions, heat or diffusion equations have the form

𝑢𝑡 = 𝐷Δ𝑢 + 𝑞

or, more generally,


𝑢𝑡 = ∇ ⋅ (𝐷∇𝑢) + 𝑞
with space dependent coefficient functions 𝐷. Here the Laplace operator Δ and
the nabla operators ∇ are applied to the unknown 𝑢(𝑡, 𝐱) with respect to the
spatial variables 𝐱 = (𝑥1 , … , 𝑥𝑑 ).
In the case of transient parabolic equations, one ensures that a unique solu-
tion exists by specifying initial conditions 𝑢(𝑡 = 0, 𝐱) = 𝑓(𝐱) as well as boundary
conditions 𝑢(𝑡, 𝐱) = 𝑔(𝑡) for all points 𝐱 on the boundary.
10.4 Hyperbolic Equations 275

It is obvious that stationary solutions, i.e., when 𝑢𝑡 = 0, satisfy the corre-


sponding elliptic equation. However, it is not generally true that stationary and
transient descriptions of physical phenomena are corresponding pairs of ellip-
tic and parabolic equations. Counterexamples are the Poisson equation and the
Maxwell equations: the Poisson equation is stationary, elliptic, and governs elec-
trostatic problems, while the Maxwell equations are transient, hyperbolic, and
govern electromagnetism. This takes us to hyperbolic equations.

10.4 Hyperbolic Equations

The solutions of hyperbolic equations behave like waves. More precisely, if the
initial condition for the solution at time 𝑡 = 0 is disturbed, it takes a finite
amount of time to observe this disturbance at other points of space, meaning
that the disturbance has a finite propagation speed in contrast to elliptic and
parabolic equations, where a disturbance is observed everywhere immediately
and the hence the propagation speed is infinite.
The simplest example of a hyperbolic equation is the one-dimensional (in
space) wave equation
𝑢𝑡𝑡 = 𝑐2 𝑢𝑥𝑥
equipped with an initial condition 𝑢(𝑡 = 0, 𝑥) = 𝑓(𝑥) and boundary conditions
such as 𝑢(𝑡, 𝑥 = 𝑥1 ) = 𝑔1 (𝑡) and 𝑢(𝑡, 𝑥 = 𝑥2 ) = 𝑔2 (𝑡). In the case of hyper-
bolic equations, the choice of boundary conditions is often a delicate matter. As
the waves travel long enough, the waves may leave a finite boundary, and there-
fore one often tries to make provisions to enable the wave to leave the boundary
without any disturbance or reflection. One often also generates waves on the
boundary to enter the domain.
It is easy to find an exact solution of the wave equation 10.4. Any function
𝑢(𝑡, 𝑥) ∶= 𝑓(𝑥 − 𝑐𝑡) is a solution – as long as the boundary conditions match
– as can be checked in a straightforward manner using the chain rule. Here the
function 𝑓 is the initial condition. It is clear that the constant 𝑐 ∈ ℝ is the speed
of the wave. This exact solution is useful for testing numerical methods.
Another important class of hyperbolic equations are conservation laws. To
derive a conservation law, we start from the equation

d
∭ 𝑢(𝐱)d𝐱 + ∯ 𝐧 ⋅ 𝐟 (𝑢)d𝑆 = 0.
d𝑡 Ω 𝜕Ω

The first term is the time rate of change of 𝑢 in the subdomain Ω ⊂ 𝐷, which
is arbitrary with a sufficiently smooth boundary in this equation. The second
integral is a surface integral and gives the flux 𝐟 of 𝑢 through the boundary 𝜕Ω
of Ω, where the 𝐧 are outward unit normal vectors. The equation just means that
the change of 𝑢 contained in Ω is equal to the negative total outflow of 𝑢 from Ω.
276 10 Partial-Differential Equations

If 𝑢 and 𝐟 are sufficiently smooth functions, we can change the order of dif-
ferentation and integration in the first term and use the divergence theorem in
the second term to find
( )
∭ 𝑢𝑡 (𝐱)d𝐱 + ∭ ∇ ⋅ 𝐟 (𝑢(𝐱))d𝐱 = ∭ 𝑢𝑡 (𝐱) + ∇ ⋅ 𝐟 (𝑢(𝐱)) d𝐱 = 0.
Ω Ω Ω

Since the subdomain Ω is arbitrary, the fundamental lemma of variational cal-


culus, Theorem 10.1, applied to the second equation yields that the integrand
vanishes everywhere, i.e.,

𝑢𝑡 + ∇ ⋅ 𝐟 (𝑢) = 0 ∀𝐱 ∈ 𝐷.

In higher dimensions, an equation for 𝑢 ∶ ℝ𝑑 → ℝ of this form is called hy-


perbolic if the Jacobian matrix
𝜕𝑓1 1 𝜕𝑓
⎛ 𝜕𝑢 ⋯ 𝜕𝑢 ⎞
1 𝑑
⎜ ⋮ ⋱ ⋮ ⎟
⎜ 𝜕𝑓𝑑 𝜕𝑓 ⎟
⋯ 𝑑
⎝ 𝜕𝑢1 𝜕𝑢𝑑 ⎠

of the flux function 𝐟 ∶ ℝ𝑑 → ℝ𝑑 has only real eigenvalues and is diagonalizable.


If the Jacobian matrix even has distinct real eigenvalues, it is certainly diagonal-
izable, and then the equation is called strictly hyperbolic.

10.5 Finite Differences

The finite-difference method is one of the most important numerical methods


for pdes. Solving a pde using the finite-difference method requires these steps.
1. Define grid points on the domain. The solution will be calculated on these
grid points.
2. Approximate the derivatives in the equation on the grid points using Taylor’s
theorem.
3. Substitute the approximations of the derivatives into the equation, which
yields a system of algebraic equations for the values of the solution at the
grid points.
4. Solve the system of algebraic equations.
5. Test the finite-difference discretization and its implementation by compar-
ing exact and numerical solutions for different grid sizes.
Before implementing finite differences for a one-dimensional equation first,
we discuss these steps in more detail. It goes without saying that there are many
variants how the first, second, and fourth steps can be implemented.
10.5 Finite Differences 277

Regarding the choice of grid points in the first step, it is customary to use
equidistant grid points. In one dimension, this means that the solution 𝑢 is cal-
culated at the points
𝑢𝑖 ∶= 𝑢(𝑎 + 𝑖ℎ),
where the domain is the interval 𝑈 ∶= (𝑎, 𝑏), ℎ ∈ ℝ+ is the grid spacing, and
the index 𝑖 ∈ {1, … , 𝑁} is related to the grid spacing ℎ by

𝑏−𝑎
ℎ= .
𝑁
In two dimensions, we have

𝑢𝑖,𝑗 = 𝑢(𝑎1 + 𝑖ℎ, 𝑎2 + 𝑗ℎ),

where the domain is 𝑈 ∶= (𝑎1 , 𝑏1 ) × (𝑎2 , 𝑏2 ) and the indices are 𝑖 ∈ {1, … , 𝑁1 }
and 𝑗 ∈ {1, … , 𝑁2 }, and so forth in higher dimensions.
Finite differences are especially well suited for domains with simple bound-
aries, while the finite-element method (see Sect. 10.7 below) is especially well
suited for domains with complex boundaries that are to be resolved precisely.
Taylor’s theorem is used to approximate the derivatives in the second step.

Theorem 10.10 (Taylor’s theorem) Suppose that 𝑘 ∈ ℕ and that the function
𝑓 ∶ ℝ → ℝ is 𝑛 times differentiable at the point 𝑎 ∈ ℝ. Then there exists a function
ℎ ∶ ℝ → ℝ such that
𝑛
∑ 𝑓 (𝑘) (𝑎)
𝑓(𝑥) = (𝑥 − 𝑎)𝑘 + ℎ(𝑥)(𝑥 − 𝑎)𝑛
𝑘=0
𝑘!

and
lim ℎ(𝑥) = 0.
𝑥→𝑎

The Taylor expansion must be truncated at one point. This gives rise to the
local truncation error. Using more terms in the Taylor expansion generally re-
sults in smaller local truncation errors in the third step, and we will implement
an example below. However, more terms generally complicate the system of al-
gebraic equations, making it more time consuming to assemble and to solve. It
is therefore not obvious a priori if the accuracy of the solutions is increased by
increasing the number of terms in the Taylor expansion when the total compu-
tation time is the same.
Regarding the solution of the resulting system of algebraic equations, it is ob-
vious that a linear pde will result in a linear system of equations (see Sect. 8.4.8).
As the number of dimensions of the domain increases, the system matrices be-
come sparser, and it is imperative to use sparse matrices (see Sect. 8.2). We will
discuss the implementation in more details below.
If the pde and thus the system of algebraic equations are nonlinear, Newton
methods (see Sect. 12.7) and fixed-point methods are the methods of choice.
278 10 Partial-Differential Equations

10.5.1 One-Dimensional Second-Order Discretization

We now apply these ideas to the prototypical one-dimensional elliptic boundary-


value problem

−𝑢𝑥𝑥 (𝑥) = 𝑓(𝑥) ∀𝑥 ∈ (𝑎, 𝑏) ⊂ ℝ,


𝑢(𝑎) = 𝑔1 ,
𝑢(𝑏) = 𝑔2

in the interval (𝑎, 𝑏), where the function 𝑓 ∶ ℝ → ℝ and the constants 𝑔1 ∈ ℝ
and 𝑔2 ∈ ℝ are given.
In the first step, we use the equidistant grid 𝑎 + 𝑖ℎ defined above. The two
boundary conditions result in 𝑢0 = 𝑢(𝑎) = 𝑔1 and 𝑢𝑁 = 𝑢(𝑎 + 𝑁ℎ) = 𝑢(𝑏) = 𝑔2 .
In the second step, to approximate the derivative 𝑢𝑥𝑥 in the equation, we apply
Taylor’s theorem, Theorem 10.10, to find the two expansions

ℎ2 ℎ3
𝑢𝑖+1 = 𝑢𝑖 + ℎ𝑢𝑥 (𝑎 + 𝑖ℎ) + 𝑢 (𝑎 + 𝑖ℎ) + 𝑢 (𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ),
2 𝑥𝑥 6 𝑥𝑥𝑥
ℎ2 ℎ3
𝑢𝑖−1 = 𝑢𝑖 − ℎ𝑢𝑥 (𝑎 + 𝑖ℎ) + 𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) − 𝑢 (𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ),
2 6 𝑥𝑥𝑥
for 𝑢𝑖+1 = 𝑢(𝑎 + (𝑖 + 1)ℎ) and 𝑢𝑖−1 = 𝑢(𝑎 + (𝑖 − 1)ℎ) around the point 𝑎 + 𝑖ℎ.
Here 𝑂(ℎ4 ) includes all terms of fourth order and higher in ℎ. More precisely,
we write 𝑓(𝑥) = 𝑂(𝑔(𝑥)) as 𝑥 → 𝑥0 if and only if lim sup𝑥→𝑎 |𝑓(𝑥)∕𝑔(𝑥)| < ∞.
Adding these two expansions yields

𝑢𝑖+1 + 𝑢𝑖−1 = 2𝑢𝑖 + ℎ2 𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ),

and then dividing by ℎ2 and reordering yields


𝑢𝑖+1 − 2𝑢𝑖 + 𝑢𝑖−1
𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) = + 𝑂(ℎ2 ).
ℎ2
In the third step, we write down the system of algebraic equations, which is
linear for this linear equation, and solve it using Julia. There are 𝑁 − 1 un-
knowns, namely the values 𝑢𝑖 for all 𝑖 ∈ {1, … , 𝑁 − 1}. For each unknown value
𝑢𝑖 , we have the algebraic equation
𝑢𝑖+1 − 2𝑢𝑖 + 𝑢𝑖−1
− = 𝑓(𝑎 + 𝑖ℎ) + 𝑂(ℎ2 ) ∀𝑖 ∈ {1, … , 𝑁 − 1}
ℎ2
or, equivalently,

−(𝑢𝑖+1 − 2𝑢𝑖 + 𝑢𝑖−1 ) = ℎ2 𝑓(𝑎 + 𝑖ℎ) + 𝑂(ℎ4 ) ∀𝑖 ∈ {1, … , 𝑁 − 1} (10.15)


10.5 Finite Differences 279

found by substituting the equation for 𝑢𝑥𝑥 (𝑎 + 𝑖ℎ) into the boundary-value prob-
lem. We observe that the local truncation error is a term 𝑂(ℎ2 ) of second order
in ℎ, rendering this finite-difference discretization a second-order one.
In order to solve this linear system of equations, we write a function that
records each equation in a row of a sparse matrix and then calls the standard
solver in Julia for this type of linear equation. The vector ȯʧ that contains the
right side is a dense vector. The system matrix is initialized as an empty, sparse
(𝑁 − 1) × (𝑁 − 1) matrix. In the loop, the coefficients of 𝑢𝑖+1 , 𝑢𝑖 , and 𝑢𝑖−1 are writ-
ten into the 𝑖-th row of the matrix. The two equations that contain the boundary
conditions, i.e., the ones for 𝑖 = 0 and 𝑖 = 𝑁, require special treatment, and the
constant terms 𝑢0 = 𝑔1 and 𝑢𝑁 = 𝑔2 go on the right side. For solving the linear
system of equations, we use the built-in function.
Ƀɦʙɴʝʲ {ɃɪȕǤʝɜȱȕȂʝǤ
Ƀɦʙɴʝʲ ÆʙǤʝʧȕʝʝǤ˩ʧ

ȯʼɪȆʲɃɴɪ ȕɜɜɃʙʲɃȆѪO-ѪЖ-ФǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя ȯђђOʼɪȆʲɃɴɪя


ȱЖђђOɜɴǤʲЛЙя ȱЗђђOɜɴǤʲЛЙя
‰ђђbɪʲХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э ‰
ɜɴȆǤɜ ȯʧ ќ OɜɴǤʲЛЙЦȹѭЗ Ѯ ȯФǤ ў ɃѮȹХ ȯɴʝ Ƀ Ƀɪ Жђ‰вЖЧ
ɜɴȆǤɜ ќ ÆʙǤʝʧȕʝʝǤ˩ʧѐʧʙ˴ȕʝɴʧФ‰вЖя ‰вЖХ

ЫЫ ɃɪʲȕʝɃɴʝ
ȯɴʝ Ƀ Ƀɪ Зђ‰вЗ
ЦɃя ɃўЖЧ ќ вЖѐЕ
ЦɃя ɃЧ ќ ЗѐЕ
ЦɃя ɃвЖЧ ќ вЖѐЕ
ȕɪȍ

ЫЫ ɜȕȯʲ ȂɴʼɪȍǤʝ˩
ЦЖя ЖЧ ќ ЗѐЕ
ЦЖя ЗЧ ќ вЖѐЕ
ȯʧЦЖЧ ўќ ȱЖ

ЫЫ ʝɃȱȹʲ ȂɴʼɪȍǤʝ˩
Ц‰вЖя ‰вЖЧ ќ ЗѐЕ
Ц‰вЖя ‰вЗЧ ќ вЖѐЕ
ȯʧЦ‰вЖЧ ўќ ȱЗ

ЫЫ ʧɴɜ˛ȕ
а ȯʧ
ȕɪȍ
280 10 Partial-Differential Equations

It is an unfortunate fact of life that the implementation of each boundary con-


dition – which seems so unimportant, because it concerns only few points com-
pared to the number of interior points – requires as much care as the implemen-
tation of the equation for the interior points. If the finite-difference discretization
(10.15) would also contain 𝑢𝑖+2 or 𝑢𝑖−2 , then we would have to implement more
special cases close to the boundary.
Calculating a single solution of a pde is not only useless, but may be outright
dangerous depending on what the solution is used for. In the fifth and last step,
we hence write a function to test our discretization and its implementation. We
can easily find the exact solution in our test cases by putting the cart before the
horse: we define the solution first, in our example 𝑢 ∶= sin, and only then obtain
the right-hand side 𝑓 and the boundary conditions after substituting the solution
into the equation. After calculating the numerical approximation of the solution
and evaluating the exact solution at the grid points, we calculate the error as
the maximum norm of the difference between the approximation and the exact
solution. This procedure is implemented in the following function.
ȯʼɪȆʲɃɴɪ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO-ѪЖ-ФʼѪȕ˦ǤȆʲђђOʼɪȆʲɃɴɪя ȯђђOʼɪȆʲɃɴɪя
ǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя ‰ђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э ‰
ɜɴȆǤɜ ʼѪȕ˦ ќ OɜɴǤʲЛЙЦʼѪȕ˦ǤȆʲФǤ ў ɃѮȹХ ȯɴʝ Ƀ Ƀɪ Жђ‰вЖЧ
ɜɴȆǤɜ ʼѪɪʼɦ ќ ȕɜɜɃʙʲɃȆѪO-ѪЖ-ФǤя Ȃя ȯя ʼѪȕ˦ǤȆʲФǤХя ʼѪȕ˦ǤȆʲФȂХя ‰Х

{ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФʼѪɪʼɦ в ʼѪȕ˦я bɪȯХ


ȕɪȍ

Now we can easily test our implementation. In the following test case, the
domain is the interval (0, 2𝜋) and 𝑢 ∶= sin yields 𝑓 = sin. The right-hand side
could also be obtained using symbolic computations, automating testing even
further. We calculate the error on four grids, namely with 101 , 102 , 103 , and 104
points.
Ƀɦʙɴʝʲ ¸ʝɃɪʲȯ
ȯɴʝ Ƀ Ƀɪ ЖђЙ
ɜɴȆǤɜ ȕʝʝɴʝ ќ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO-ѪЖ-ФʧɃɪя ʧɃɪя ЕѐЕя ЗѮʙɃя ЖЕѭɃХ
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ЖЕѭ҄Жȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя Ƀя ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕѭЖђ ȕʝʝɴʝ ќ ИѐЖОЖКОȕвЕЗ


‰ ќ ЖЕѭЗђ ȕʝʝɴʝ ќ ИѐЗОЕКЗȕвЕЙ
‰ ќ ЖЕѭИђ ȕʝʝɴʝ ќ ИѐЗНОННȕвЕЛ
‰ ќ ЖЕѭЙђ ȕʝʝɴʝ ќ ИѐЗОЕЙЙȕвЕН

We observe that dividing ℎ by 10 (or equivalently multiplying 𝑁 by 10) di-


vides the error by factor of 100. This is exactly what is expected, since the local
truncation error in (10.15) is 𝑂(ℎ2 ) and the discretization is thus a second-order
one.
10.5 Finite Differences 281

Not much imagination is required to see that many variations in the steps
above are possible. Most importantly, to affect the convergence speed of a finite-
difference scheme, many variations how to apply Taylor’s theorem are possible.
This is the question we investigate next.

10.5.2 Compact Fourth-Order Finite-Difference Discretizations

In this section, fourth-order finite-difference discretizations of the Laplace oper-


ator
∑𝑑
Δ𝑢 = ∇ ⋅ (∇𝑢) = 𝑢𝑥𝑖 𝑥𝑖
𝑖=1

in two and three dimensions 𝑑 are derived. In addition to being of fourth order,
the schemes still have the desirable property that only neighboring grid points
are used. This fact considerably simplifies the implementation at the grid points
near the boundary. Another advantage is the small bandwidth in the resulting
linear system of equations, meaning that it can be solved faster. Therefore these
two- and three-dimensional finite-difference discretizations combine fast con-
vergence as ℎ → 0 with ease of implementation, providing a good example of
how Taylor’s theorem can be applied to good effect.

10.5.2.1 For Two-Dimensional Elliptic Equations

We consider the two-dimensional elliptic equation

Δ𝑢 = 𝑢𝑥𝑥 + 𝑢𝑦𝑦 = 𝑓 in 𝑈 ∶= (𝑎1 , 𝑏1 ) × (𝑎2 , 𝑏2 )

with twice differentiable right-hand side 𝑓 and discretize it on an equidistant


grid with spacing ℎ. The discretization

1
(𝑢𝑖+1,𝑗+1 + 𝑢𝑖+1,𝑗−1 + 𝑢𝑖−1,𝑗+1 + 𝑢𝑖−1,𝑗−1 )
6
2 10
+ (𝑢𝑖+1,𝑗 + 𝑢𝑖−1,𝑗 + 𝑢𝑖,𝑗+1 + 𝑢𝑖,𝑗−1 ) − 𝑢𝑖,𝑗
3 3
(2 1 )
= ℎ 𝑓𝑖,𝑗 + (𝑓𝑖+1,𝑗 + 𝑓𝑖−1,𝑗 + 𝑓𝑖,𝑗+1 + 𝑓𝑖,𝑗−1 ) + 𝑂(ℎ6 )
2
3 12

can be written in symbolic form neatly as


1 2 1 1
⎡ ⎤ ⎡0 0⎤
⎢ 62 310 6
2⎥
⎥ ⎢1 12
2 1

⎢ − 𝑢𝑖,𝑗 = ℎ2 ⎢ ⎥ 𝑓𝑖,𝑗 + 𝑂(ℎ6 ), (10.16)
⎢3 3 3⎥ ⎢ 12 3 12 ⎥
⎢1 2 1⎥ ⎢ 1 ⎥
0 0
⎣6 3 6 ⎦𝑖,𝑗 ⎣ 12 ⎦𝑖,𝑗
282 10 Partial-Differential Equations

where the matrix elements are the coefficients of the unknown 𝑢𝑖,𝑗 at the grid
point (𝑖, 𝑗) and its neighbors.
This discretization is a fourth-order compact finite-difference one and de-
rived in the proof of the following theorem. Such a discretization is often called
compact, since it only involves the eight neighboring grid points (𝑖, 𝑗); a general
fourth-order discretization involves more grid points.

Theorem 10.11 (fourth-order compact finite-difference discretization of


two-dimensional elliptic equations) The local truncation error of the finite-
difference discretization (10.16) of the two-dimensional elliptic equation

Δ𝑢 = 𝑓 in 𝐷 ⊂ ℝ2 (10.17)

with twice differentiable 𝑓 is of fourth order.

Proof We use Taylor’s theorem, Theorem 10.10, to find the two expansions

ℎ2 ℎ3 ℎ4 ℎ5
𝑢𝑖+1,𝑗 = 𝑢𝑖 + ℎ𝑢𝑥 + 𝑢 + 𝑢 + 𝑢 + 𝑢 + 𝑂(ℎ6 ),
2 𝑥𝑥 6 𝑥𝑥𝑥 24 𝑥𝑥𝑥𝑥 120 𝑥𝑥𝑥𝑥𝑥
ℎ2 ℎ3 ℎ4 ℎ5
𝑢𝑖−1,𝑗 = 𝑢𝑖 − ℎ𝑢𝑥 + 𝑢𝑥𝑥 − 𝑢 + 𝑢 − 𝑢 + 𝑂(ℎ6 )
2 6 𝑥𝑥𝑥 24 𝑥𝑥𝑥𝑥 120 𝑥𝑥𝑥𝑥𝑥
for 𝑢𝑖+1,𝑗 = 𝑢(𝑎1 + (𝑖 + 1)ℎ, 𝑎2 + 𝑗ℎ) and 𝑢𝑖−1 = 𝑢(𝑎1 + (𝑖 − 1)ℎ, 𝑎 − 2 +
𝑗ℎ) with respect to 𝑥 around the point 𝑎 + 𝑖ℎ. For convenience, the arguments
(𝑎1 + 𝑖ℎ, 𝑎2 + 𝑗ℎ) of the derivatives are dropped. Adding these two expansions
and rearranging terms shows that the equation
𝑢𝑖+1,𝑗 − 2𝑢𝑖,𝑗 + 𝑢𝑖−1,𝑗 ℎ2
𝐷𝑥2 𝑢𝑖,𝑗 ∶= = 𝑢𝑥𝑥 + 𝑢 + 𝑂(ℎ4 ) (10.18)
ℎ2 12 𝑥𝑥𝑥𝑥

holds for the central-difference operator 𝐷𝑥2 that acts with respect to 𝑥.
Therefore the local truncation error 𝜏𝑖,𝑗 of the initial (second-order) discretiza-
tion
𝐷𝑥2 𝑢𝑖,𝑗 + 𝐷𝑦2 𝑢𝑖,𝑗 = 𝑓𝑖,𝑗 + 𝜏𝑖,𝑗 (10.19)
of (10.17) equals
ℎ2
𝜏𝑖,𝑗 = (𝑢 + 𝑢𝑦𝑦𝑦𝑦 ) + 𝑂(ℎ4 ).
12 𝑥𝑥𝑥𝑥
In order to obtain more information about the coefficient of ℎ2 , we differenti-
ate (10.17) twice with respect to 𝑥 and twice with respect to 𝑦 to find

𝑢𝑥𝑥𝑥𝑥 = 𝑓𝑥𝑥 − 𝑢𝑥𝑥𝑦𝑦 ,


𝑢𝑦𝑦𝑦𝑦 = 𝑓𝑦𝑦 − 𝑢𝑥𝑥𝑦𝑦 .
10.5 Finite Differences 283

Hence we can rewrite the truncation error as


ℎ2 ℎ2
𝜏𝑖,𝑗 = (𝑓𝑥𝑥 + 𝑓𝑦𝑦 ) − 𝑢𝑥𝑥𝑦𝑦 + 𝑂(ℎ4 ). (10.20)
12 6
Next, we approximate the terms in the coefficient of ℎ2 by

𝑓𝑥𝑥 = 𝐷𝑥2 𝑓𝑖,𝑗 + 𝑂(ℎ2 ),


𝑓𝑦𝑦 = 𝐷𝑦2 𝑓𝑖,𝑗 + 𝑂(ℎ2 ),
𝑢𝑥𝑥𝑦𝑦 = 𝐷𝑥2 𝐷𝑦2 𝑢𝑖,𝑗 + 𝑂(ℎ2 ),

which are second-order discretizations as we already know, which yields

ℎ2 2 ℎ2
𝜏𝑖,𝑗 = (𝐷𝑥 𝑓𝑖,𝑗 + 𝐷𝑦2 𝑓𝑖,𝑗 ) − 𝐷𝑥2 𝐷𝑦2 𝑢𝑖,𝑗 + 𝑂(ℎ4 ).
12 6
We substitute this form of 𝜏𝑖,𝑗 into the initial discretization (10.19).
In summary, the sought discretization is

ℎ2 ℎ2
𝐷𝑥2 𝑢𝑖,𝑗 + 𝐷𝑦2 𝑢𝑖,𝑗 + 𝐷𝑥2 𝐷𝑦2 𝑢𝑖,𝑗 = 𝑓𝑖,𝑗 + (𝐷𝑥2 + 𝐷𝑦2 )𝑓𝑖,𝑗 + 𝑂(ℎ4 ), (10.21)
⏟⏟⏟ ⏟⏟⏟ ⏟⎴⎴⏟⎴⎴⏟ 6 12
⏟⏟⏟ ⏟⎴⎴⎴⎴⏟⎴⎴⎴⎴⏟ ⏟⏟⏟
ℎ−2 ℎ−2 ℎ0 ℎ4
ℎ−2 ℎ0

which yields (10.16) after expanding the central-difference operators 𝐷𝑥2 and 𝐷𝑦2
and multiplying by ℎ2 . The local truncation error 𝑂(ℎ4 ) of this discretization is
a factor ℎ4 apart from the other terms. □

10.5.2.2 For Three-Dimensional Elliptic Equations

Analogously to Theorem 10.11, the following result holds for three-dimensional


elliptic equations with sufficiently smooth right-hand side.

Theorem 10.12 (fourth-order compact finite-difference discretization of


three-dimensional elliptic equations) The local truncation error of the finite-
difference discretization

1
− 4𝑢𝑖,𝑗,𝑘 + (𝑢𝑖+1,𝑗,𝑘 + 𝑢𝑖−1,𝑗,𝑘 + 𝑢𝑖,𝑗+1,𝑘 + 𝑢𝑖,𝑗−1,𝑘 + 𝑢𝑖,𝑗,𝑘+1 + 𝑢𝑖,𝑗,𝑘−1 )
3
1
+ (𝑢𝑖,𝑗+1,𝑘+1 + 𝑢𝑖,𝑗+1,𝑘−1 + 𝑢𝑖,𝑗−1,𝑘+1 + 𝑢𝑖,𝑗−1,𝑘−1
6
+ 𝑢𝑖+1,𝑗,𝑘+1 + 𝑢𝑖+1,𝑗,𝑘−1 + 𝑢𝑖−1,𝑗,𝑘+1 + 𝑢𝑖−1,𝑗,𝑘−1
+ 𝑢𝑖+1,𝑗+1,𝑘 + 𝑢𝑖+1,𝑗−1,𝑘 + 𝑢𝑖−1,𝑗+1,𝑘 + 𝑢𝑖−1,𝑗−1,𝑘 )
(
2 1 1 )
=ℎ 𝑓 + (𝑓𝑖+1,𝑗,𝑘 + 𝑓𝑖−1,𝑗,𝑘 + 𝑓𝑖,𝑗+1,𝑘 + 𝑓𝑖,𝑗−1,𝑘 + 𝑓𝑖,𝑗,𝑘+1 + 𝑓𝑖,𝑗,𝑘−1 )
2 𝑖,𝑗,𝑘 12
+ 𝑂(ℎ6 ) (10.22)
284 10 Partial-Differential Equations

of the three-dimensional elliptic equation

Δ𝑢 = 𝑓 in 𝐷 ⊂ ℝ3 (10.23)

with twice differentiable 𝑓 is of fourth order.

Proof Analogously to the proof of Theorem 10.11, the local truncation error
𝜏𝑖,𝑗,𝑘 of the initial (second-order) discretization

𝐷𝑥2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑦2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑧2 𝑢𝑖,𝑗,𝑘 = 𝑓𝑖,𝑗,𝑘 + 𝜏𝑖,𝑗,𝑘

equals
ℎ2
𝜏𝑖,𝑗,𝑘 = (𝑢 + 𝑢𝑦𝑦𝑦𝑦 + 𝑢𝑧𝑧𝑧𝑧 ) + 𝑂(ℎ4 ).
12 𝑥𝑥𝑥𝑥
Differentiating (10.23) yields

𝑢𝑥𝑥𝑥𝑥 = 𝑓𝑥𝑥 − 𝑢𝑥𝑥𝑦𝑦 − 𝑢𝑥𝑥𝑧𝑧 ,


𝑢𝑦𝑦𝑦𝑦 = 𝑓𝑦𝑦 − 𝑢𝑥𝑥𝑦𝑦 − 𝑢𝑦𝑦𝑧𝑧 ,
𝑢𝑧𝑧𝑧𝑧 = 𝑓𝑧𝑧 − 𝑢𝑥𝑥𝑧𝑧 − 𝑢𝑦𝑦𝑧𝑧 .

Substituting into 𝜏𝑖,𝑗,𝑘 results in

ℎ2 ℎ2
𝜏𝑖,𝑗,𝑘 = (𝑓𝑥𝑥 + 𝑓𝑦𝑦 + 𝑓𝑧𝑧 ) − (𝑢𝑥𝑥𝑦𝑦 + 𝑢𝑥𝑥𝑧𝑧 + 𝑢𝑦𝑦𝑧𝑧 ) + 𝑂(ℎ4 ).
12 6
Using this expression for 𝜏𝑖,𝑗,𝑘 in the initial discretization yields

ℎ2 2 2
𝐷𝑥2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑦2 𝑢𝑖,𝑗,𝑘 + 𝐷𝑧2 𝑢𝑖,𝑗,𝑘 + (𝐷 𝐷 + 𝐷𝑥2 𝐷𝑧2 + 𝐷𝑦2 𝐷𝑧2 )𝑢𝑖,𝑗,𝑘
6 𝑥 𝑦
ℎ2
= 𝑓𝑖,𝑗,𝑘 + (𝐷𝑥2 + 𝐷𝑦2 + 𝐷𝑧2 )𝑓𝑖,𝑗,𝑘 + 𝑂(ℎ4 ), (10.24)
12
which is of fourth order. Finally, expanding the central-difference operators and
multiplying by ℎ2 yields (10.22). □

10.6 Finite Volumes

Another approach to the numerical approximation of solutions of pdes is the


finite-volume method. It is suited for pdes in divergence form, i.e., when the
operator is the divergence of an expression involving the unknown, as is often
the case with elliptic and parabolic equations.
The general idea of the finite-volume method is to integrate the pde over fi-
nite volumes or control volumes surrounding each grid point and to employ the
divergence theorem in order to convert the divergence term in the equation into
10.6 Finite Volumes 285

a surface integral. These surface integrals are fluxes through the surfaces of all
finite volumes or control volumes. The fluxes are conserved by construction, as
the flux through any surface of a finite volume must equal the flux out of an
adjacent volume through the same surface.
The conservation of fluxes is an important feature of the finite-volume method.
The theoretical treatment of the finite-volume method such as convergence
proofs is more complicated compared to the finite-difference method, where Tay-
lor expansions are available, and to the finite-element method.
Here, we derive a finite-volume discretization of the prototypical elliptic pde
in divergence form, namely the two-dimensional Poisson equation

−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈 ⊂ ℝ2 , (10.25)

for which Theorem 10.8 holds. We first choose a grid spacing ℎ and grid points
(𝑖ℎ, 𝑗ℎ) ∈ 𝑈, where the integers 𝑖 and 𝑗 are chosen such that the grid points lie
in the domain 𝑈. Next, we define the control volumes
( ) ( )
𝑉𝑖,𝑗 ∶= (𝑖 − 1∕2)ℎ, (𝑖 + 1∕2)ℎ × (𝑗 − 1∕2)ℎ, (𝑗 + 1∕2)ℎ

surrounding the grid points (see Fig. 10.1). Since the grid points we have defined
lie on an equidistant grid, the control volumes are rectangles. The sought values
are the values
𝑢𝑖,𝑗 ∶= 𝑢(𝑖ℎ, 𝑗ℎ)
of the solution at the grid points. We analogously write 𝐴𝑖,𝑗 ∶= 𝐴(𝑖ℎ, 𝑗ℎ) and
𝑓𝑖,𝑗 ∶= 𝑓(𝑖ℎ, 𝑗ℎ).
By applying the divergence theorem

∬ ∇ ⋅ 𝐉d𝑉 = ∮ 𝐧 ⋅ 𝐉d𝑆,
𝑉 𝜕𝑉

where 𝐧 is the outward unit normal vector, to the control volume 𝑉𝑖,𝑗 , equation
(10.25) becomes
−∮ 𝐧 ⋅ (𝐴∇𝑢)d𝑆 = ∬ 𝑓d𝑉.
𝜕𝑉𝑖,𝑗 𝑉𝑖,𝑗

Next, we approximate 𝑢𝑥 at 𝑢𝑖+1∕2,𝑗 by

( ) 𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗
𝑢𝑥 (𝑖 + 1∕2)ℎ, 𝑗ℎ = + 𝑂(ℎ2 )

and analogously on the other edges. This equation follows from applying Tay-
lor’s theorem to 𝑢 around 𝑢𝑖+1∕2,𝑗 with steps ℎ∕2 and −ℎ∕2 to find
286 10 Partial-Differential Equations

ui-1, j+1 ui, j+1 ui+1, j+1

A ∇u

ui-1, j A ∇u ui, j A ∇u ui+1, j

A ∇u

ui-1, j-1 ui, j-1 ui+1, j-1

Fig. 10.1 The control volume 𝑉𝑖,𝑗 of a finite-volume discretization is shown in red in the center.
The fluxes 𝐹 = 𝐴∇𝑢 are shown as well and are assumed to be constant on the edges of the
control volume.

ℎ ℎ2 ℎ3
𝑢𝑖+1,𝑗 = 𝑢𝑖+1∕2,𝑗 + 𝑢 + 𝑢 + 𝑢 + 𝑂(ℎ4 )
2 𝑥 2 ⋅ 4 𝑥𝑥 6 ⋅ 8 𝑥𝑥𝑥
ℎ ℎ2 ℎ3
𝑢𝑖,𝑗 = 𝑢𝑖+1∕2,𝑗 − 𝑢𝑥 + 𝑢 − 𝑢 + 𝑂(ℎ4 )
2 2 ⋅ 4 𝑥𝑥 6 ⋅ 8 𝑥𝑥𝑥
and then subtracting.
Assuming that the matrix-valued function 𝐴 is the diagonal matrix

𝑎11 (𝑥, 𝑦) 0
( )
0 𝑎22 (𝑥, 𝑦)

everywhere and that 𝐴 is constant on the edges of 𝜕𝑉𝑖,𝑗 , we hence obtain the
discretization
10.6 Finite Volumes 287

11
𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗 11
𝑢𝑖−1,𝑗 − 𝑢𝑖,𝑗
− (𝑎𝑖+1∕2,𝑗 ℎ + 𝑎𝑖−1∕2,𝑗 ℎ
ℎ ℎ
22
𝑢𝑖,𝑗+1 − 𝑢𝑖,𝑗 22
𝑢𝑖,𝑗−1 − 𝑢𝑖,𝑗
+ 𝑎𝑖,𝑗+1∕2 ℎ + 𝑎𝑖,𝑗−1∕2 ℎ)
ℎ ℎ
= ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ), (10.26)

since the length of all edges is ℎ. If 𝑓 is not constant on the control volume,
integration formulas such as Simpson’s rule are useful.
The discretization simplifies to
( 11 11 22 22
− 𝑎𝑖+1∕2,𝑗 𝑢𝑖+1,𝑗 + 𝑎𝑖−1∕2,𝑗 𝑢𝑖−1,𝑗 + 𝑎𝑖,𝑗+1∕2 𝑢𝑖,𝑗+1 + 𝑎𝑖,𝑗−1∕2 𝑢𝑖,𝑗−1
11 11 22 22
)
− (𝑎𝑖+1∕2,𝑗 + 𝑎𝑖−1∕2,𝑗 + 𝑎𝑖,𝑗+1∕2 + 𝑎𝑖,𝑗−1∕2 )𝑢𝑖,𝑗
= ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ), (10.27)

which is of first order.


It is interesting to compare the finite-volume discretization to its finite-differ-
ence cousin in Sect. 10.5.1. The finite-difference discretization (when general-
ized to two dimensions) is of second order, while the finite-volume discretiza-
tion here is of first order. The difference is apparently due to the fact that the
second derivative was approximated in the finite-difference method, but the first
derivative was approximated here in the finite-volume method. But the underly-
ing reason is that the equations are different; in the present section, we consider
(10.25), which contains the matrix-valued coefficient function 𝐴, about whose
smoothness we have only supposed that it is constant on the edges of the control
volumes. In particular, the coefficient function 𝐴 may be discontinuous.
What happens if the coefficient function is smoother? If the coefficient func-
tion 𝐴 is a constant 𝑎 ∈ ℝ+ , then (10.27) simplifies to the discretization

−𝑎(𝑢𝑖+1,𝑗 + 𝑢𝑖−1,𝑗 + 𝑢𝑖,𝑗+1 + 𝑢𝑖,𝑗−1 − 4𝑢𝑖,𝑗 ) = ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ),

which is of first order, while it can be shown using the approach in Sect. 10.5
that the straightforward finite-difference discretization of 𝑎Δ𝑢 = 𝑓 is

−𝑎(𝑢𝑖+1,𝑗 + 𝑢𝑖−1,𝑗 + 𝑢𝑖,𝑗+1 + 𝑢𝑖,𝑗−1 − 4𝑢𝑖,𝑗 ) = ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ4 ), (10.28)

which is of second order. Thus, if 𝐴 is constant, the finite-difference and finite-


volume discretizations are identical, but the finite-difference method yields bet-
ter theoretical convergence behavior.
However, the better theoretical convergence prediction of the finite-difference
method is contingent upon sufficient smoothness of the solution 𝑢 (and the
coefficient 𝐴 if applicable) to allow the use of Taylor’s theorem. In the deriva-
tion of the finite-volume discretization, the only assumption on the smoothness
of the solution 𝑢 was that its derivatives 𝑢𝑥 and 𝑢𝑦 can be approximated by
288 10 Partial-Differential Equations

(𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗 )∕ℎ etc. between grid points; the rest of the calculations involves
surface and volume integrals.
In summary, the main appeal of finite volumes is that they conserve the fluxes
by construction, which is especially important in physical problems such as dif-
fusion and electrostatic problems, where flux conservation is a physical prin-
ciple, i.e., mass conservation and Gauss’ law, respectively. Finite volumes can
also easily deal with coefficient functions 𝐴 that are not constant and with solu-
tions 𝑢 that are less smooth. Another difference between finite differences and
finite volumes is that finite volumes are amenable to better approximations of
the right-hand side via
∬ 𝑓d𝑉,
𝑉𝑖,𝑗

making it possible to conserve ∬𝑈 𝑓d𝑉 when implemented carefully.


It is instructive to implement the finite-volume discretization (10.27) and to
discuss some examples. We represent the Dirichlet boundary conditions 𝑢 = 𝑔
on 𝜕𝑈 by a function ȱ that is evaluated only at the boundary. Again, each row
in the system matrix corresponds to one equation (10.27) for one unknown
value 𝑢𝑖,𝑗 . The helper function Ƀɪȍ takes a two-dimensional index (𝑖, 𝑗) with 𝑖 ∈
{0, … , 𝑁} and 𝑗 ∈ {0, … , 𝑁} and turns it into a linear, one-dimensional index
running from 1 to (𝑁 +1)2 , which is useful for indexing the matrix and the vector
of the discretized system of equations. Points with 𝑖 = 0, 𝑖 = 𝑁, 𝑗 = 0, or 𝑗 = 𝑁
lie on the boundary of the square domain 𝑈 ∶= (𝑎1 , 𝑏1 ) × (𝑎2 , 𝑏1 + 𝑎2 − 𝑎1 ).
Ƀɦʙɴʝʲ {ɃɪȕǤʝɜȱȕȂʝǤ
Ƀɦʙɴʝʲ ÆʙǤʝʧȕʝʝǤ˩ʧ

ȯʼɪȆʲɃɴɪ ȕɜɜɃʙʲɃȆѪOùѪЗ-ФǤЖђђOɜɴǤʲЛЙя ȂЖђђOɜɴǤʲЛЙя ǤЗђђOɜɴǤʲЛЙя


ǤЖЖђђOʼɪȆʲɃɴɪя ǤЗЗђђOʼɪȆʲɃɴɪя ȯђђOʼɪȆʲɃɴɪя
ȱђђOʼɪȆʲɃɴɪя ‰ђђbɪʲХђђʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩ
ɜɴȆǤɜ ȹ ќ ФȂЖвǤЖХ Э ‰
ɜɴȆǤɜ ȯʧ ќ ùȕȆʲɴʝШOɜɴǤʲЛЙЩФʼɪȍȕȯя Ф‰ўЖХѭЗХ
ɜɴȆǤɜ ќ ÆʙǤʝʧȕʝʝǤ˩ʧѐʧʙ˴ȕʝɴʧФФ‰ўЖХѭЗя Ф‰ўЖХѭЗХ

ȯʼɪȆʲɃɴɪ ɃɪȍФɃђђbɪʲя ɔђђbɪʲХђђbɪʲ


Ж ў Ƀ ў ɔѮФ‰ўЖХ
ȕɪȍ

ȯɴʝ Ƀ Ƀɪ Еђ‰
ȯɴʝ ɔ Ƀɪ Еђ‰
Ƀȯ Ƀ ќќ Е ЮЮ Ƀ ќќ ‰ ЮЮ ɔ ќќ Е ЮЮ ɔ ќќ ‰
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ќ ЖѐЕ
ȯʧЦɃɪȍФɃя ɔХЧ ќ ȱФǤЖ ў ɃѮȹя ǤЗ ў ɔѮȹХ
ȕɜʧȕ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃўЖя ɔХЧ ќ в ǤЖЖФǤЖ ў ФɃўЖЭЗХѮȹя ǤЗ ў ɔѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃвЖя ɔХЧ ќ в ǤЖЖФǤЖ ў ФɃвЖЭЗХѮȹя ǤЗ ў ɔѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔўЖХЧ ќ в ǤЗЗФǤЖ ў ɃѮȹя ǤЗ ў ФɔўЖЭЗХѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔвЖХЧ ќ в ǤЗЗФǤЖ ў ɃѮȹя ǤЗ ў ФɔвЖЭЗХѮȹХ

ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ўќ ǤЖЖФǤЖ ў ФɃўЖЭЗХѮȹя ǤЗ ў ɔ ѮȹХ


ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ўќ ǤЖЖФǤЖ ў ФɃвЖЭЗХѮȹя ǤЗ ў ɔ ѮȹХ
ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ўќ ǤЗЗФǤЖ ў Ƀ Ѯȹя ǤЗ ў ФɔўЖЭЗХѮȹХ
10.6 Finite Volumes 289

ЦɃɪȍФɃя ɔХя ɃɪȍФɃя ɔХЧ ўќ ǤЗЗФǤЖ ў Ƀ Ѯȹя ǤЗ ў ФɔвЖЭЗХѮȹХ

ȯʧЦɃɪȍФɃя ɔХЧ ќ ȹѭЗ Ѯ ȯФȹя ǤЖ ў ɃѮȹя ǤЗ ў ɔѮȹХ


ȕɪȍ

ɜɴȆǤɜ ʧ ќ ʧʼɦФ ЦɃɪȍФɃя ɔХя ђЧХ


ЪǤʧʧȕʝʲ ɃʧǤʙʙʝɴ˦Фʧя ЕѐЕя Ǥʲɴɜ ќ КȕвЖКХ ЮЮ
ɃʧǤʙʙʝɴ˦Фʧя ЖѐЕя Ǥʲɴɜ ќ КȕвЖКХ
ȕɪȍ
ȕɪȍ

ʝȕʧȹǤʙȕФ а ȯʧя ‰ўЖя ‰ўЖХ


ȕɪȍ

Here, the strategy for implementing the boundary conditions is to explicitly in-
clude the equations 𝑢𝑖,𝑗 = 𝑔𝑖,𝑗 for the unknown on the boundary, i.e., for 𝑖 = 0,
𝑖 = 𝑁, 𝑗 = 0, or 𝑗 = 𝑁. This strategy leads to shorter code than substituting the
values on the boundary into the system while considering all cases.
The assertion follows from (10.27) and is useful to check that the implemen-
tation is correct.
The next function is useful to assess the accuracy of the solution in examples
where the exact solution is known.
ȯʼɪȆʲɃɴɪ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФʼѪȕ˦ǤȆʲђђOʼɪȆʲɃɴɪя
ǤЖђђOɜɴǤʲЛЙя ȂЖђђOɜɴǤʲЛЙя ǤЗђђOɜɴǤʲЛЙя
ǤЖЖђђOʼɪȆʲɃɴɪя ǤЗЗђђOʼɪȆʲɃɴɪя
ȯђђOʼɪȆʲɃɴɪя ‰ђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ ȹ ќ ФȂЖвǤЖХ Э ‰
ɜɴȆǤɜ ʼѪɪʼɦ ќ ȕɜɜɃʙʲɃȆѪOùѪЗ-ФǤЖя ȂЖя ǤЗя ǤЖЖя ǤЗЗя ȯя ʼѪȕ˦ǤȆʲя ‰Х
ɜɴȆǤɜ ʼѪȕ˦ ќ ЦʼѪȕ˦ǤȆʲФǤЖ ў ɃѮȹя ǤЗ ў ɔѮȹХ ȯɴʝ Ƀ Ƀɪ Еђ‰я ɔ Ƀɪ Еђ‰Ч

{ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФʼѪɪʼɦ в ʼѪȕ˦я bɪȯХ


ȕɪȍ

In the first example, we set 𝑈 ∶= (−𝜋, 𝜋)2 , 𝑢(𝑥, 𝑦) ∶= cos 𝑥 cos 𝑦, and

2 + cos 𝑦 0
𝐴(𝑥, 𝑦) ∶= ( ),
0 2 + cos 𝑥

which results in 𝑓(𝑥, 𝑦) = (4 + cos 𝑥 + cos 𝑦) cos 𝑥 cos 𝑦. We calculate some


solutions, each time multiplying 𝑁 by 2 or ℎ by 1∕2.
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ ȆɴʧФ˦Х Ѯ ȆɴʧФ˩Хя
вʙɃя OɜɴǤʲЛЙФʙɃХя вʙɃя
Ф˦я ˩Х вљ З ў ȆɴʧФ˩Хя
Ф˦я ˩Х вљ З ў ȆɴʧФ˦Хя
290 10 Partial-Differential Equations

Фȹя ˦я ˩Х вљ ФЙ ў ȆɴʧФ˦Х ў ȆɴʧФ˩ХХ Ѯ ȆɴʧФ˦Х Ѯ ȆɴʧФ˩Хя


‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕђ ȕʝʝɴʝ ќ КѐЖИОЙОȕвЕЗ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ ЖѐЗКНЙЗȕвЕЗ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ИѐЖЗОНКȕвЕИ
‰ ќ НЕђ ȕʝʝɴʝ ќ МѐНЖЙКЛȕвЕЙ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ ЖѐОКИЕЖȕвЕЙ
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ ЙѐННЗЖЙȕвЕК

We observe that the error is multiplied by a factor approximately equal to


1∕4, which implies a second-order convergence rate, which is consistent with
the discussion above for smooth coefficient functions 𝐴.
In the second example, we consider a solution 𝑢 that is not differentiable. This
example is one-dimensional, but we still use our two-dimensional implementa-
tion. We set 𝑈 ∶= (−1, 1)2 ,

𝑥, 𝑥 < 0,
𝑢(𝑥, 𝑦) ∶= {
2𝑥, 𝑥 ≥ 1,

and
⎧ 10
⎪(0 1) , 𝑥 < 0,
𝐴(𝑥, 𝑦) ∶=
⎨ 1∕2 0
⎪( ) , 𝑥 ≥ 1,
⎩ 0 1∕2
which results in constant 𝐴∇𝑢 and 𝑓(𝑥, 𝑦) = 0. This example is covered by the
theory in Sect. 10.2.3. In electrostatics, this example corresponds to material with
a jump discontinuity in the permittivity, which results in a jump in the derivative
of the solution. We again calculate some solutions, each time multiplying ℎ by
1∕2.
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ ˦ ȕɜʧȕ ЗѮ˦ ȕɪȍя
вЖѐЕя ЖѐЕя вЖѐЕя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Фȹя ˦я ˩Х вљ ЕѐЕя
‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ
10.6 Finite Volumes 291

‰ ќ ЖЕђ ȕʝʝɴʝ ќ КѐККЖЖЗȕвЖЛ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ ЙѐЙЙЕНОȕвЖЛ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ЖѐИИЗЗМȕвЖК
‰ ќ НЕђ ȕʝʝɴʝ ќ МѐММЖКЛȕвЖЛ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ ИѐККЗМЖȕвЖК
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ КѐММИЖЛȕвЖК
These surprisingly accurate results are explained by the simple structure of the
solution 𝑢, which is linear except at the line 𝑥 = 0 and by the fact that the control
volumes are perfectly aligned with the line 𝑥 = 0, implying that the discretiza-
tion error indeed vanishes. The only sources of error are solving the linear system
and floating-point errors.
Next, we redefine the domain to 𝑈 ∶= (−1, 2)2 .
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ ˦ ȕɜʧȕ ЗѮ˦ ȕɪȍя
вЖѐЕя ЗѐЕя вЖѐЕя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Ф˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ Ж ȕɜʧȕ ЖЭЗ ȕɪȍя
Фȹя ˦я ˩Х вљ ЕѐЕя
‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕђ ȕʝʝɴʝ ќ ЛѐЕОЗИЖȕвЕЗ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ ИѐКЖОККȕвЕЗ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ЖѐМЙЕЙИȕвЕЗ
‰ ќ НЕђ ȕʝʝɴʝ ќ ОѐЕЕЖЕЕȕвЕИ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ ЙѐЙНМНМȕвЕИ
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ ЗѐЗЛЗМЙȕвЕИ
In each step, the error is multiplied by approximately 1∕2, and we finally observe
the first-order convergence we predicted in (10.27). This is the general case.
In the third example, we consider a solution 𝑢 that is not differentiable, but
now the jump in the derivative of 𝑢 is not offset by the coefficient 𝐴, but corre-
sponds to a Dirac delta distribution on the right-hand side. We set 𝑈 ∶= (−1, 1)2 ,

−𝑥, 𝑥 < 0,
𝑢(𝑥, 𝑦) ∶= {
𝑥, 𝑥 ≥ 0,

and
10
𝐴(𝑥, 𝑦) ∶= ( ) ,
01

which results in 𝑓(𝑥, 𝑦) = −2𝛿(𝑥), where 𝛿 is the Dirac delta distribution. Its
defining characteristic is the equality
292 10 Partial-Differential Equations

∫ 𝛿(𝑥)d𝑥 = 1, (10.29)

meaning that it can concentrate an integral of value 1 at a single point 𝑥 = 0.


In electrostatics, this example corresponds to two unit charges placed at 𝑥 = 0,
which results in a jump in the derivative of the solution. We again calculate some
solutions, each time multiplying ℎ by 1∕2.
In order to implement the Dirac delta distribution, we note that if it takes
the value 1∕ℎ on the control volume (better called control interval in this one-
dimensional case) of length ℎ that contains 0, then (10.29) is satisfied. We can
also arrive at the same conclusion from (10.26): in our one-dimensional ex-
ample, we have −2ℎ = ℎ2 𝑓𝑖,𝑗 + 𝑂(ℎ3 ) and hence 𝑓𝑖,𝑗 ∶= −2∕ℎ because of
(𝑢𝑖+1,𝑗 − 𝑢𝑖,𝑗 )∕ℎ = 1 and (𝑢𝑖−1,𝑗 − 𝑢𝑖,𝑗 )∕ℎ = 1.
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ в˦ ȕɜʧȕ ˦ ȕɪȍя
вЖѐЕя ЖѐЕя вЖѐЕя
Ф˦я ˩Х вљ Жя
Ф˦я ˩Х вљ Жя
Фȹя ˦я ˩Х вљ Ƀȯ ǤȂʧФ˦Х ј ȹЭЗ в ʧʜʝʲФȕʙʧФOɜɴǤʲЛЙХХ
вЗЭȹ
ȕɜʧȕ
Е
ȕɪȍя
‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕђ ȕʝʝɴʝ ќ ЙѐЙЙЕНОȕвЖЛ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ ИѐИИЕЛМȕвЖЛ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ЛѐЛЛЖИЙȕвЖЛ
‰ ќ НЕђ ȕʝʝɴʝ ќ КѐККЖЖЗȕвЖЛ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ ЖѐИИЗЗМȕвЖК
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ ЙѐЗЖННКȕвЖК

As in the second example, the surprising accuracy is due to the linearity of the
solution and the perfect alignment of the control volumes with the line 𝑥 = 0,
where the derivative of the solution jumps. This numerical result validates our
implementation of the Dirac delta distribution.
Finally, we redefine the domain to 𝑈 ∶= (−1, 2)2 .
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪOùѪЗ-ФФ˦я ˩Х вљ Ƀȯ ˦ јќ Еѓ в˦ ȕɜʧȕ ˦ ȕɪȍя
10.7 Finite Elements 293

вЖѐЕя ЗѐЕя вЖѐЕя


Ф˦я ˩Х вљ Жя
Ф˦я ˩Х вљ Жя
Фȹя ˦я ˩Х вљ Ƀȯ ǤȂʧФ˦Х ј ȹЭЗ в ʧʜʝʲФȕʙʧФOɜɴǤʲЛЙХХ
вЗЭȹ
ȕɜʧȕ
Е
ȕɪȍя
‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕђ ȕʝʝɴʝ ќ ОѐОЖЗКНȕвЕЗ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ КѐЙЕМКОȕвЕЗ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ЗѐМКЖИОȕвЕЗ
‰ ќ НЕђ ȕʝʝɴʝ ќ ЖѐЙЕЙЕЖȕвЕЗ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ МѐЕЙНОЖȕвЕИ
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ ИѐКЙЗЗЛȕвЕИ

In each step, the error is again multiplied by approximately 1∕2, which is again
consistent with the general case of first-order convergence as predicted in (10.27).
Finally, we note that the implementation of the boundary conditions may
require thought and effort. While Dirichlet boundary conditions are relatively
straightforward to implement, Neumann boundary conditions generally require
an approximation of the directional derivative. If this approximation is not good
enough, it may reduce the convergence order, although the discretization in the
interior would support a higher convergence order.

10.7 Finite Elements

We have seen in the previous section that the use of integration in the finite-
volume method was advantageous compared to finite differences, as it made it
possible to relax the assumptions on the smoothness of the solution. Finite el-
ements take these considerations further. We have already laid the foundation
for finite elements above in Sect. 10.2.3; finite elements are just the numerical
implementation of weak solutions. (If you skipped Sect. 10.2.3, the main points
are explained in the following again for the purposes of finite elements.)
We again use the elliptic boundary-value problem

−∇ ⋅ (𝐴∇𝑢) = 𝑓 in 𝑈 ⊂ ℝ𝑑 , (10.30a)
𝑢=0 on 𝜕𝑈 (10.30b)

in divergence form as the leading equation to show how finite-element dis-


cretizations follow from weak solutions. As will become apparent after the
294 10 Partial-Differential Equations

integration-by-parts step, we seek solutions 𝑢 in 𝐻01 (𝑈), which is – simply put –


the function space of all locally summable functions whose weak first (indicated
by the index 1) partial derivatives are square integrable and that vanish on the
boundary 𝜕𝑈 (indicated by the index 0); the reader is referred to [1, Section 5.2]
for details. We first multiply (10.30) by an arbitrary test function 𝑣 chosen from
𝐻01 (𝑈). The reason why we use the same function space for the test function will
also become apparent after the step in which we integrate by parts.
Hence we seek solutions 𝑢 ∈ 𝐻01 (𝑈) such that

− ∫ ∇ ⋅ (𝐴∇𝑢)𝑣d𝑉 = ∫ 𝑓𝑣d𝑉 ∀𝑣 ∈ 𝐻01 (𝑈).


𝑈 𝑈

Can we infer (10.30) from this equality? The answer is positive and is provided
by the fundamental lemma of variational calculus, Theorem 10.1, which states
that since this equality holds for sufficiently many test functions 𝑣 (namely all
elements of 𝐻01 (𝑈)), the other factors −∇ ⋅ (𝐴∇𝑢) and 𝑓 in the integrands must
agree. The intuitive explanation is that if ∫𝑈 𝑤𝑣d𝑉 = 0 holds for sufficiently
many functions 𝑣, especially those with tiny support that allow to zoom into
smaller and smaller intervals, then the function 𝑤 must vanish.
Next, we use integration by parts – called Green’s first identity in this case
(see Problem 10.21) – to see that the problem is equivalent to finding solutions
𝑢 ∈ 𝐻01 (𝑈) such that

∫ (𝐴∇𝑢) ⋅ ∇𝑣d𝑉 = ∫ 𝑓𝑣d𝑉 ∀𝑣 ∈ 𝐻01 (𝑈) (10.31)


𝑈 𝑈

holds, since the boundary term vanishes because of 𝐻01 (𝑈) ∋ 𝑣 = 0 on the
boundary 𝜕𝑈. This formulation relaxes the requirements on the smoothness of 𝑢
and explains the choice of function spaces for both 𝑢 and 𝑣. The first derivatives
of both 𝑢 and 𝑣 are required to exist only in the weak sense (since they appear in
the integrand), whereas a classical solution 𝑢 of (10.30) must be twice differen-
tiable in the whole domain. Due to the symmetry of ∇𝑢 and ∇𝑣 appearing in the
integrand, we can choose 𝐻 1 (𝑈) as the function space for both the solution 𝑢
and the test function 𝑣. Moreover, the zero Dirichlet boundary conditions are
incorporated by choosing 𝐻01 (𝑈).
Such solutions 𝑢 that satisfy (10.31) are called weak solutions. In its most
general form, weak formulations are to find a function 𝑢 ∈ 𝑊, the weak solution,
such that
𝑎(𝑢, 𝑣) = 𝐹(𝑣) ∀𝑣 ∈ 𝑉 (10.32)
holds. Here the Hilbert spaces 𝑉 and 𝑊 are the space of test functions and the
solution space, respectively.
Up to now we have only explained the concept of a weak solution, but finite
elements are just around the corner. Since we cannot perform numerical calcu-
lations using elements of the infinite-dimensional function spaces 𝑉 and 𝑊, we
restrict both the test space 𝑉 and the solution space 𝑊 to much simpler, finite-
10.7 Finite Elements 295

dimensional function spaces 𝑉ℎ and 𝑊ℎ that are both amenable to calculations


and that will result in an algebraic system of equations with a unique solution at
the same time. The index ℎ indicates the fineness of these discretized function
spaces and will become concrete in the example below. For now, it is important
that a smaller ℎ implies more elements in 𝑉ℎ and 𝑊ℎ , i.e., a higher-dimensional
function space that can approximate functions in 𝑉 and 𝑊 better. The parame-
ter ℎ is inversely proportional to the dimension of 𝑉ℎ and 𝑊ℎ .
We can make this notion more precise by considering the family

{𝑉ℎ ⊂ 𝑉 ∣ ℎ ∈ ℝ+ }

of finite-dimensional subspaces 𝑉ℎ of 𝑉. We will choose the subspaces 𝑉ℎ such


that
∀𝑣 ∈ 𝑉 ∶ lim inf ‖𝑣ℎ − 𝑣‖𝑉 = 0. (10.33)
ℎ→0 𝑣ℎ ∈𝑉ℎ

Since 𝑉ℎ ⊂ 𝑉 for all ℎ ∈ ℝ+ , an approximation using the subspaces 𝑉ℎ is called


an internal approximation.
After choosing the function spaces 𝑉ℎ and 𝑊ℎ , the weak formulation (10.32)
becomes the task to find a function 𝑢ℎ ∈ 𝑊ℎ such that

𝑎(𝑢ℎ , 𝑣ℎ ) = 𝐹(𝑣ℎ ) ∀𝑣ℎ ∈ 𝑉ℎ . (10.34)

The solution 𝑢ℎ ∈ 𝑊ℎ of this problem is called the finite-element or Galerkin


approximation of the solution 𝑢 of (10.32).
In general, many different choices for such discretized function spaces 𝑉ℎ and
𝑊ℎ are possible, and we use the choice that is prototypical for finite elements
here. We consider the one-dimensional case, denote the domain by 𝑈 ∶= (𝑎, 𝑏),
and divide it into 𝑁 intervals which form a partition of the whole domain and
are called the finite elements. We denote the length of the largest interval by ℎ.
A set of all such intervals or finite elements with a certain value ℎ is called a
triangulation 𝑇ℎ , and ℎ is called the fineness of the triangulation. (The name
triangulation stems of course from the two-dimensional case.)
The intervals can be chosen to be of the same length

ℎ ∶= (𝑏 − 𝑎)∕𝑁

after choosing 𝑁 in order to simplify the implementation. We can then define


equidistant points
𝑥𝑖 ∶= 𝑎 + 𝑖ℎ
for 𝑖 ∈ {0, … , 𝑁} such that there are 𝑁 intervals of length ℎ and 𝑁 + 1 points 𝑥𝑖 ,
but in general the points 𝑥𝑖 are not necessarily equidistant.
The so-called hat functions 𝜙𝑗ℎ , 𝑗 ∈ {0, … , 𝑁}, are piecewise linear functions
that have the value 1 at exactly one point 𝑥𝑖 and that vanish at all other points
𝑥𝑖 , i.e.,
296 10 Partial-Differential Equations

1, 𝑖 = 𝑗,
𝜙𝑗ℎ (𝑥𝑖 ) ∶= {
0, 𝑖 ≠ 𝑗.
Make sure to visualize the triangular shape of these 𝑁 + 1 functions.
Next, we define the function spaces
𝑁
∑ ||
𝑃ℎ (𝑈) ∶= { 𝛼𝑗 𝜙𝑗ℎ ||| 𝛼𝑗 ∈ ℝ ∀𝑗 ∈ {0, … , 𝑁}},
||
𝑗=0
𝑁−1
∑ ||
𝑃ℎ0 (𝑈) ∶= { 𝛼𝑗 𝜙𝑗ℎ ||| 𝛼𝑗 ∈ ℝ ∀𝑗 ∈ {1, … , 𝑁 − 1}}
||
𝑗=1

for any ℎ ∈ ℝ+ , which are both sets of all linear combinations of certain hat
functions. The difference between the two function spaces is that all functions
in 𝑃ℎ0 (𝑈) vanish on both endpoints 𝑎 and 𝑏 of the interval domain 𝑈 = (𝑎, 𝑏).
The function space 𝑃ℎ (𝑈) satisfies the required approximation property
(10.33) for 𝐻 1 (𝑈), and 𝑃ℎ0 (𝑈) satisfies it for 𝐻01 (𝑈). Therefore, we can choose

𝑉ℎ ∶= 𝑃ℎ0 (𝑈)

as the discretized, finite-dimensional space of test functions and

𝑊ℎ ∶= 𝑃ℎ0 (𝑈) = 𝑉ℎ

as the discretized, finite-dimensional solution space for solving (10.30), which


requires that the solutions vanish on the boundary.
Now how can we calculate the finite-element approximation 𝑢ℎ in (10.34)?
We denote the basis functions of the 𝑀-dimensional space 𝑉ℎ by 𝜓𝑗 and can
hence write the solution as the linear combination
𝑁

𝑢ℎ = 𝑢𝑗 𝜓𝑗 , (10.35)
𝑗=0

where the 𝑢𝑗 ∈ ℝ are unknown coefficients for 𝑗 ∈ {0, … , 𝑁}. In our example,
the basis functions are hat functions. Satisfying the weak formulation (10.34)
(for all test functions 𝑣ℎ ∈ 𝑉ℎ ) is equivalent to satisfying the equation for all 𝑀
basis functions 𝜓𝑖 , i.e., we seek solutions 𝑢ℎ ∈ 𝑊ℎ such that
𝑁
∑ 𝑁

𝑎(𝑢ℎ , 𝜓𝑖 ) = 𝑎( 𝑢𝑗 𝜓 𝑗 , 𝜓 𝑖 ) = 𝑢𝑗 𝑎(𝜓𝑗 , 𝜓𝑖 ) = 𝐹(𝜓𝑖 ) ∀𝑖 ∈ {0, … , 𝑁}.
𝑗=0 𝑗=0

After defining an (𝑁 + 1) × (𝑁 + 1) matrix 𝑀 by setting

𝑚𝑖𝑗 ∶= 𝑎(𝜓𝑗 , 𝜓𝑖 )
10.7 Finite Elements 297

and a vector 𝐳 by setting 𝑧𝑖 ∶= 𝐹(𝜓𝑖 ), this condition becomes the linear system
of equations
𝑀𝐮 = 𝐳 (10.36)
for the unknown vector 𝐮 = (𝑢0 , … , 𝑢𝑁+1 ). This linear system of equations is the
finite-element discretization we will implement below.
Before we do so, we however pose the question whether the linear system
(10.36) has a unique solution. If we can answer this question positively, then
our confidence in the whole finite-element procedure will be much increased.
It is clear that we must ensure that our original equation (10.30) has a unique
solution 𝑢 to begin with. To do so, we assume that the assumptions of the Lax–
Milgram theorem, Theorem 10.5, are satisfied.
The first argument that a unique solution 𝐮 of (10.36) exists is an algebraic
one.

Theorem 10.13 Suppose that the assumptions of Theorem 10.5 hold for the
boundary-value problem (10.30). Then the matrix 𝑀 in (10.36) is positive definite.

Proof The matrix 𝑆 being positive definite is equivalent to

𝐯 ⊤ 𝑀𝐯 > 0 ∀𝐯 ∈ ℝ𝑁+1 ∖{𝟎}

(see Definition 8.9). For any vector 𝐯 ∈ ℝ𝑁+1 , we define the function
𝑁

𝑣 ∶= 𝑣𝑖 𝜓𝑖 ∈ 𝑉ℎ
𝑖=0

and calculate
𝑁 ∑
∑ 𝑁
𝐯 ⊤ 𝑀𝐯 = 𝑣𝑖 𝑣𝑗 𝑎(𝜓𝑗 , 𝜓𝑖 ) = 𝑎(𝑣, 𝑣) ≥ 𝛽‖𝑣‖2𝑉 > 0
𝑖=0 𝑗=0

unless 𝐯 = 𝟎. The first inequality holds since the bilinear form 𝑎 is coercive
(with constant 𝛽) by assumption. □

Since every positive definite matrix is invertible, the linear system (10.36) has
a unique solution.
The second argument that a unique solution 𝐮 of (10.36) exists is the follow-
ing. For the elliptic equation (10.30), the solution space and the space of the test
functions are identical.

Theorem 10.14 (Céa’s lemma) Define 𝑉 ∶= 𝐻01 (𝑈) and suppose that the as-
sumptions of Theorem 10.5 hold for the boundary-value problem (10.30): the bilin-
ear form 𝑎 is bounded with constant 𝛼 and coercive with constant 𝛽. Then there
exists a unique solution 𝑢 ∈ 𝑉 of (10.30) and a unique solution 𝑢ℎ ∈ 𝑉ℎ of (10.34).
Furthermore, the inequalities
298 10 Partial-Differential Equations

1
‖𝑢ℎ ‖𝑉 ≤ ‖𝐹‖𝑉 ′
𝛽

(stability) and
𝛼
‖𝑢ℎ − 𝑢‖𝑉 ≤ inf ‖𝑣 − 𝑢‖𝑉 (10.37)
𝛽 𝑣ℎ ∈𝑉ℎ ℎ
(convergence; Céa’s lemma) hold.

The first inequality means that the solutions 𝑢ℎ are stable for varying ℎ, as
the norm of every solution 𝑢ℎ is bounded by a constant that does not depend
on ℎ. The second inequality is called Céa’s lemma and will be important for the
convergence of the approximations 𝑢ℎ to the exact solution 𝑢 in Theorem 10.15.
Proof Since 𝑉ℎ ⊂ 𝑉 is a subspace of 𝑉, Theorem 10.5 implies that there exists
a unique solution 𝑢ℎ ∈ 𝑉ℎ of (10.34).
Since the bilinear form 𝑎 stemming from (10.30) is coercive with constant 𝛽,
we find
𝛽‖𝑢ℎ ‖2𝑉 ≤ 𝑎(𝑢ℎ , 𝑢ℎ ) = 𝐹(𝑢ℎ ) ≤ ‖𝐹‖𝑉 ′ ‖𝑢ℎ ‖𝑉 ,
which immediately implies the first inequality.
To show the second inequality, we first subtract (10.32) (for all 𝑣ℎ ∈ 𝑉ℎ ⊂ 𝑉)
from (10.34) to find

𝑎(𝑢ℎ − 𝑢, 𝑣ℎ ) = 0 ∀𝑣ℎ ∈ 𝑉ℎ .

Since the bilinear form 𝑎 is coercive, we can calculate

𝛽‖𝑢ℎ − 𝑢‖2𝑉 ≤ 𝑎(𝑢ℎ − 𝑢, 𝑢ℎ − 𝑢) = 𝑎(𝑢ℎ − 𝑢, 𝑢ℎ + 𝑣ℎ − 𝑢) ∀𝑣ℎ ∈ 𝑉ℎ

using this equality. Next, we define 𝑤ℎ ∶= 𝑢ℎ + 𝑣ℎ and note that statements


that hold for all 𝑣ℎ ∈ 𝑉ℎ are equivalent to statements that hold for all 𝑤ℎ ∈ 𝑉ℎ
because 𝑢ℎ + 𝑉ℎ = 𝑉ℎ (and 𝑢ℎ is fixed). Therefore we have

𝛽‖𝑢ℎ − 𝑢‖2𝑉 ≤ 𝑎(𝑢ℎ − 𝑢, 𝑤ℎ − 𝑢) ∀𝑤ℎ ∈ 𝑉ℎ .

Since the bilinear form 𝑎 is bounded, we furthermore find

𝛽‖𝑢ℎ − 𝑢‖2𝑉 ≤ 𝛼‖𝑢ℎ − 𝑢‖𝑉 ‖𝑤ℎ − 𝑢‖𝑉 ∀𝑤ℎ ∈ 𝑉ℎ ,

which yields the second inequality. □

Under the reasonable assumption (10.33) that the subspaces 𝑉ℎ become bet-
ter and better approximations of the function space 𝑉, Theorem 10.14 imme-
diately shows that the approximations 𝑢ℎ ∈ 𝑉ℎ converge to the exact solu-
tion 𝑢 ∈ 𝑉 as ℎ → 0. This is expressed by the following theorem.
10.7 Finite Elements 299

Theorem 10.15 (convergence) Suppose 𝑉ℎ ⊂ 𝑉 are subspaces of 𝑉 such that


(10.33) holds and suppose that the assumptions of Theorem 10.14 hold. Then

lim ‖𝑢ℎ − 𝑢‖𝑉 = 0


ℎ→0

holds.
Proof Inequality (10.37) and equation (10.33) yield
𝛼
lim ‖𝑢ℎ − 𝑢‖𝑉 ≤ lim inf ‖𝑣 − 𝑢‖𝑉 = 0
ℎ→0 𝛽 ℎ→0 𝑣ℎ ∈𝑉ℎ ℎ

showing convergence. □
After these theoretic considerations, we now implement the finite-element
discretization (10.36) in a one-dimensional example. We use the domain 𝑈 ∶=
(𝑎, 𝑏), the equidistant grid points defined above, and the linear hat functions 𝜙𝑖ℎ
defined above on these finite elements. We assume that the coefficient matrix 𝐴
in (10.30) is piecewise constant on the 𝑁 finite elements or intervals with the
value 𝑎𝑖 ∈ ℝ, 𝑖 ∈ {1, … , 𝑁}, on the interval (𝑎 + (𝑖 − 1)ℎ, 𝑎 + 𝑖ℎ).
Sketching the hat functions and calculating the integral in the bilinear form 𝑎
yields
𝑎𝑖 +𝑎𝑖+1
⎧ , 𝑖 = 𝑗,
𝑎ℎmax(𝑖,𝑗)
𝑚𝑖𝑗 = 𝑎(𝜙𝑗ℎ , 𝜙𝑖ℎ ) = − , |𝑖 − 𝑗| = 1, (10.38)
⎨ ℎ
⎩0, |𝑖 − 𝑗| > 2
whenever 0 < 𝑖 < 𝑁 and 0 < 𝑗 < 𝑁. Obviously, the system matrix 𝑆 is sparse,
and thus it is instrumental to use linear solvers that take advantage of its sparsity
in realistic applications.
The vector 𝐳 on the right side of (10.36) has the elements
𝑏
𝑧𝑖 = 𝐹(𝜙𝑖 ) = ∫ 𝑓𝜙𝑖 d𝑉.
𝑎

Again assuming that the right-hand side 𝑓 is constant on the finite elements, we
set

𝑧𝑖 ∶= (𝑓𝑖 + 𝑓𝑖+1 ).
2
In general, good approximations of this integral are important.
Dirichlet boundary conditions can be implemented by noting that the form
(10.35) of the approximation 𝑢ℎ as a linear combination of hat functions yields

𝑢ℎ (𝑎) = 𝑢0 ,
𝑢ℎ (𝑏) = 𝑢𝑁 .

Regarding the implementation of Neumann boundary conditions (see Prob-


lem 10.24), we note that the outward unit normal derivatives of 𝑢ℎ are given
300 10 Partial-Differential Equations

by the formulas
1
𝐧 ⋅ ∇𝑢ℎ (𝑎) = (𝑢 − 𝑢1 ),
ℎ 0
1
𝐧 ⋅ ∇𝑢ℎ (𝑏) = (𝑢𝑁 − 𝑢𝑁−1 ),

which follow immediately by differentiating (10.35).
The function ȕɜɜɃʙʲɃȆѪO5ѪЖ- implements the discretization (10.36). The pur-
pose of the function Ƀɪȍ is to translate the index 𝑖 in the discussion above to a
linear index of Julia vectors and arrays, which always start at index one. (An-
other option is to use the package “ȯȯʧȕʲʝʝǤ˩ʧ.)
ȯʼɪȆʲɃɴɪ ȕɜɜɃʙʲɃȆѪO5ѪЖ-ФǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя ђђOʼɪȆʲɃɴɪя
ȯђђOʼɪȆʲɃɴɪя ȱђђOʼɪȆʲɃɴɪя
‰ђђbɪʲХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э ‰
ɜɴȆǤɜ ˴ ќ ùȕȆʲɴʝШOɜɴǤʲЛЙЩФʼɪȍȕȯя ‰ўЖХ
ɜɴȆǤɜ ќ ÆʙǤʝʧȕʝʝǤ˩ʧѐʧʙ˴ȕʝɴʧФ‰ўЖя ‰ўЖХ

ȯʼɪȆʲɃɴɪ ɃɪȍФɃђђbɪʲХђђbɪʲ
Ж ў Ƀ
ȕɪȍ

ȯɴʝ Ƀ Ƀɪ Еђ‰
Ƀȯ Ƀ ќќ Е ЮЮ Ƀ ќќ ‰
ЦɃɪȍФɃХя ɃɪȍФɃХЧ ќ ЖѐЕ
˴ЦɃɪȍФɃХЧ ќ ȱФǤ ў ɃѮȹХ
ȕɜʧȕ
ЦɃɪȍФɃХя ɃɪȍФɃвЖХЧ ќ в ФǤ ў ФɃвЖЭЗХѮȹХ
ЦɃɪȍФɃХя ɃɪȍФɃХЧ ќ ФǤ ў ФɃвЖЭЗХѮȹХ ў ФǤ ў ФɃўЖЭЗХѮȹХ
ЦɃɪȍФɃХя ɃɪȍФɃўЖХЧ ќ в ФǤ ў ФɃўЖЭЗХѮȹХ

˴ЦɃɪȍФɃХЧ ќ ФȹѭЗЭЗХ Ѯ ФȯФǤ ў ФɃвЖЭЗХѮȹХ


ў ȯФǤ ў ФɃўЖЭЗХѮȹХХ
ȕɪȍ
ȕɪȍ

а ˴
ȕɪȍ

The next function ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO5ѪЖ- calculates the maximum norm of


the difference between a known exact solution and its numerical approximation.
10.7 Finite Elements 301

ȯʼɪȆʲɃɴɪ ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO5ѪЖ-ФʼѪȕ˦ǤȆʲђђOʼɪȆʲɃɴɪя
ǤђђOɜɴǤʲЛЙя ȂђђOɜɴǤʲЛЙя
ђђOʼɪȆʲɃɴɪя ȯђђOʼɪȆʲɃɴɪя
‰ђђbɪʲХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ ȹ ќ ФȂвǤХ Э ‰
ɜɴȆǤɜ ʼѪȕ˦ ќ OɜɴǤʲЛЙЦʼѪȕ˦ǤȆʲФǤ ў ɃѮȹХ ȯɴʝ Ƀ Ƀɪ Еђ‰Ч
ɜɴȆǤɜ ʼѪɪʼɦ ќ ȕɜɜɃʙʲɃȆѪO5ѪЖ-ФǤя Ȃя я ȯя ʼѪȕ˦ǤȆʲя ‰Х

{ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФʼѪɪʼɦ в ʼѪȕ˦я bɪȯХ


ȕɪȍ

In the following numerical example, we set 𝑈 ∶= (−𝜋, 𝜋), 𝑢 ∶= cos, and


𝐴(𝑥) ∶= 2 + cos(𝑥), which results in 𝑓(𝑥) ∶= 2 cos2 (𝑥) + 2 cos(𝑥) − 1.
ȯɴʝ Ƀ Ƀɪ ЕђК
ɜɴȆǤɜ ‰ ќ ЖЕ Ѯ ЗѭɃ
ɜɴȆǤɜ ȕʝʝɴʝ ќ
ʲȕʧʲѪȕɜɜɃʙʲɃȆѪO5ѪЖ-ФȆɴʧя
вʙɃя OɜɴǤʲЛЙФʙɃХя
˦ вљ З ў ȆɴʧФ˦Хя
˦ вљ ЗѮȆɴʧФ˦ХѭЗ ў ЗѮȆɴʧФ˦Х в Жя
‰Х
¸ʝɃɪʲȯѐЪʙʝɃɪʲȯФъ‰ ќ ҄Иȍђ ȕʝʝɴʝ ќ ҄ѐКȕаɪъя ‰я ȕʝʝɴʝХ
ȕɪȍ

‰ ќ ЖЕђ ȕʝʝɴʝ ќ ЗѐМНЕЕЙȕвЕЗ


‰ ќ ЗЕђ ȕʝʝɴʝ ќ ЛѐЗЙКИЗȕвЕИ
‰ ќ ЙЕђ ȕʝʝɴʝ ќ ЖѐКЗЖЖИȕвЕИ
‰ ќ НЕђ ȕʝʝɴʝ ќ ИѐММНЗЙȕвЕЙ
‰ ќ ЖЛЕђ ȕʝʝɴʝ ќ ОѐЙИЙНИȕвЕК
‰ ќ ИЗЕђ ȕʝʝɴʝ ќ ЗѐИКНИКȕвЕК

Since the error is multiplied by a factor of approximately 1∕4 when ℎ is multi-


plied by 1∕2, we observe second-order convergence in this example, where the
functions 𝐴, 𝑓, and 𝑢 are very smooth. Comparing the resulting discretization
(10.38) with the finite-difference and finite-volume discretizations immediately
shows that this finite-element discretization is of second order.
The finite-element method is deeply rooted in the theory of weak solutions of
pdes. In fact, equations (10.32) and (10.34) make it clear that the numerical ap-
proximation stems from posing the problem on discretized, finite-dimensional
function spaces instead of on the Sobolev spaces generally used in the modern
theory of pdes, which are unsuitable for numerical approximations. A great ad-
vantage of finite elements especially in higher spatial dimensions is that geomet-
rically complex domains as they may occur in realistic problems can be approxi-
mated by finite elements very well. Lots of mesh generation software to partition
a given geometry into finite elements has been developed.
302 10 Partial-Differential Equations

10.8 Julia Packages

Regarding the basic operations, the package ʙʙʝɴ˦Oʼɪ can be used for approx-
imating functions. Regarding finite differences, the package -Ƀȯȯ5ʜ“ʙȕʝǤʲɴʝʧ
constructs finite-difference operators to discretize pdes, reducing the
equations to systems of odes which can be solved using the package
-ɃȯȯȕʝȕɪʲɃǤɜ5ʜʼǤʲɃɴɪʧ. Regarding finite volumes, the package ùɴʝɴɪɴɃOù
can solve coupled nonlinear pdes. Regarding finite elements, the sʼɜɃǤO5
project contains software and documentation for (nonlinear) equations and
distributed calculations. The package QʝɃȍǤʙ provides software for various
problem types, including linear, nonlinear, and multi-physics problems and is
written in Julia. The package O5ɪɃ&Æ is a wrapper for the fenics library for
finite-element discretizations.

10.9 Bibliographical Remarks

A standard textbook on the theory of pdes is [1]. A few books [4, 5, 6, 8, 9] are
mentioned here among the multitude of textbooks on their numerical methods.

Problems

10.1 Prove (10.2).

10.2 Prove (10.4).

10.3 Prove that the potential defined in (10.6) satisfies (10.5).

10.4 * Prove (10.8) for the fundamental solution 𝐺 defined in (10.7).

10.5 Choose a pde and its derivation and write down the units of each variable
and constant in the equation and its derivation. Also check that all equations in
the derivation and the pde itself have consistent units.

10.6 * Prove Theorem 10.1.

10.7 * Prove Theorem 10.6.

10.8 (Mean inequalities)


1. Prove the inequality
√ 𝑥+𝑦
𝑥𝑦 ≤ ∀∀𝑥, 𝑦 ∈ ℝ+ .
2
10.9 Bibliographical Remarks 303

2. Suppose 𝛼 ∈ ℝ+ , 𝛽 ∈ ℝ+ , and 𝛼𝛽 = 1∕4. Prove the inequality

𝑥𝑦 ≤ 𝛼𝑥2 + 𝛽𝑦 2 ∀∀𝑥, 𝑦 ∈ ℝ.

3. * Suppose 𝑛 ∈ ℕ and 𝑥𝑖 ∈ ℝ+ for all 𝑖 ∈ {1, … , 𝑛}. Prove the inequality

∏ 𝑛 1∕𝑛 𝑛
1∑
( 𝑥𝑖 ) ≤ 𝑥.
𝑖=1
𝑛 𝑖=1 𝑖

This inequality is known as the inequality of the arithmetic and geometric


means.

10.9 * Prove Theorem 10.9.

10.10 * Prove Theorem 10.10.

10.11 Write a function to plot the solution in Sect. 10.5.1.

10.12 Change the function ȕɜɜɃʙʲɃȆѪO-ѪЖ- in Sect. 10.5.1 to assemble the sys-
tem matrix using ʧʙǤʝʧȕ (see Sect. 8.2). Is the new version faster? By how much?
Why?

10.13 Show that (10.16) follows from (10.21) by expanding the central-difference
operators.

10.14 Implement the fourth-order compact finite-difference discretization in


Theorem 10.11 and use an example to validate that it has order four.

10.15 Show that (10.22) follows from (10.24) by expanding the central-difference
operators.

10.16 Implement the fourth-order compact finite-difference discretization in


Theorem 10.12 and use an example to validate that it has order four.

10.17 Prove (10.28).

10.18 Implement Simpson’s rule to approximate ∬𝑉𝑖,𝑗 𝑓d𝑉 (instead of ℎ2 𝑓𝑖,𝑗 ).


Compare the error using ℎ2 𝑓𝑖,𝑗 and Simpson’s rule in a (well-chosen) example.

10.19 Implement zero and general Neumann boundary conditions for the finite-
volume discretization in Sect. 10.6 on one edge of a square domain.

10.20 Derive, implement, and test a three-dimensional version of the two-


dimensional finite-volume discretization in Sect. 10.6.

10.21 (Green’s first identity) * Suppose 𝑈 ⊂ ℝ𝑑 is a domain, 𝑢 ∶ 𝑈 → ℝ


a twice continuously differentiable function, 𝑣 ∶ 𝑈 → ℝ a once continuously
differentiable function, and 𝐴 ∶ 𝑈 → ℝ𝑑×𝑑 a once continuously differentiable
matrix-valued function. Prove that the identity
304 10 Partial-Differential Equations

∭ ∇ ⋅ (𝐴∇𝑢)𝑣d𝑉 = ∯ 𝑣𝐧 ⋅ (𝐴∇𝑢)d𝑆 − ∭ (𝐴∇𝑢) ⋅ ∇𝑣d𝑉


𝑈 𝜕𝑈 𝑈

holds by applying the divergence theorem

∭ ∇ ⋅ 𝐅d𝑉 = ∯ 𝐧 ⋅ 𝐅d𝑆
𝑈 𝜕𝑈

to the vector field 𝐅 ∶= 𝑣𝐴∇𝑢.


10.22 * Prove Theorem 10.14.
10.23 Prove (10.38).
10.24 Generalize the function ȕɜɜɃʙʲɃȆѪO5ѪЖ- by implement Neumann bound-
ary conditions.
10.25 Implement a problem with mixed Dirichlet/Neumann boundary condi-
tions and validate the implementation using a test problem.
10.26 (Finite elements in two dimensions) The one-dimensional finite-
element discretization is generalized to two dimensions in the following steps.
1. Partition a general rectangular domain into rectangles and subdivide the
rectangles into two right-angled triangles. Use these triangles as the finite
elements and derive a finite-element discretization of the elliptic problem
(10.30) analogously to the one-dimensional case.
2. Implement the discretization and validate it using a test problem.
3. Implement mixed Dirichlet/Neumann boundary conditions and validate the
implementation using a test problem.

References

1. Evans, L.: Partial Differential Equations, 1st edn. American Mathematical Society (1998)
2. Gilbarg, D., Trudinger, N.: Elliptic Partial Differential Equations of Second Order. Springer-
Verlag (2001)
3. Lax, P., Milgram, A.: Parabolic equations. Ann. Math. Stud. 33, 167–190 (1954)
4. LeVeque, R.: Numerical Methods for Conservation Laws, 2nd edn. Birkhäuser, Basel (1992)
5. LeVeque, R.: Finite Volume Methods for Hyperbolic Problems. Cambridge University Press
(2002)
6. LeVeque, R.: Finite Difference Methods for Ordinary and Partial Differential Equations:
Steady-State and Time-dependent Problems. Society for Industrial and Applied Mathemat-
ics (SIAM) (2007)
7. Renardy, M., Rogers, R.: An introduction to partial differential equations, 2nd edn. Springer-
Verlag, New York, NY (2004)
8. Strang, G., Fix, G.: An Analysis of the Finite Element Method, 2nd edn. Wellesley-Cambridge
Press (2008)
9. Toro, E.: Riemann Solvers and Numerical Methods for Fluid Dynamics, 3rd edn. Springer
(2009)
Part III
Algorithms for Optimization
Chapter 11
Global Optimization

But. . . TANSTAAFL.
“There ain’t no such thing as a free lunch,” in Bombay or in Luna.
—Robert A. Heinlein, The Moon is a Harsh Mistress (1966)

Abstract The optimization of functions is a topic of both great theoretical and


practical importance. Optimization problems occur in many different contexts,
where a common setting is that a model of the quantity of interest is to be op-
timized with respect to its parameters. In this chapter, we present important
classes of algorithms for global optimization, i.e., for finding all global minima
or maxima of a given function on a given domain disregarding any local optima.
The methods described here are simulated annealing, particle-swarm optimiza-
tion, and genetic algorithms. A list of benchmark problems of varying difficulty
is also included, inviting the reader to experiment with the optimization algo-
rithms and their parameters. Local optimization methods, usually based on the
gradient of the function, are discussed in the next chapter.

11.1 Introduction

Global optimization is a large field with many applications, and many deter-
ministic and stochastic optimization methods have been developed. Determinis-
tic methods include branch-and-bound methods, cutting-plane methods, inner-
and-outer approximation, and interval methods. Stochastic methods, on the
other hand, are methods whose results depend on random numbers drawn
while running the algorithm; the function to be optimized is still deterministic.
Stochastic methods include direct Monte Carlo sampling, stochastic tunneling,
and parallel tempering.
Heuristic methods are strategies for searching the domain in a – hopefully –
intelligent manner. They include differential evolution, evolutionary computa-

© Springer Nature Switzerland AG 2022 307


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_11
308 11 Global Optimization

tion (including, e.g., evolution strategies, genetic algorithms, and genetic pro-
gramming), graduated optimization, simulated annealing, swarm based opti-
mization (including, e.g., particle-swarm optimization and ant-colony optimiza-
tion), and taboo search.
Other methods are Bayesian optimization and memetic algorithms. In the
next chapter, Chap. 12, local optimization methods are discussed. They are usu-
ally based on the gradient of the function to be optimized or on its second deriva-
tives. Global and local optimization methods can be combined by performing
global optimization before or while improving promising candidate points by lo-
cal optimization. This synergy between evolutionary and local optimization is
called memetic algorithms.
Two branches of optimization are continuous and discrete optimization. In
discrete optimization, some or all of the variables of the objective function are
discrete, i.e., they assume only values from a finite set of values. Two important
fields of discrete optimization are combinatorial optimization and integer pro-
gramming.
Furthermore, optimization problems can be categorized with respect to the
presence of constraints on the independent variables of the objective function.
The constraints can be hard constraints, which must be satisfied by the inde-
pendent variables, or soft constraints, where some values are penalized if the
constraints are not satisfied and based on the extent that they do not.
The choice of optimization method is also influenced by the computational
cost of evaluating the objective function. If the objective function is given as an
expression (see, e.g., the benchmark problems in Sect. 11.8), it is usually possible
to perform many function evaluations thus rendering the optimization problem
more tractable. On the other hand, if the objective function is computationally
expensive such as a function of the solution of a partial differential equation,
the choice of optimization method is more limited and also more important. A
closely related question is whether first- or second-order derivatives of the objec-
tive function are available and what their computational cost is. The case when
no derivatives are available is called derivative free optimization.
The long, but not exhaustive, lists of optimization methods above pose the
question which one to use when presented with an optimization problem. Unfor-
tunately, there is no general answer to this question, as there is no best optimiza-
tion algorithm. It will always be the case that for any given class of optimization
problems, there is a best class of algorithms, and – vice versa – for any given class
of algorithms, there is a class of optimization problems for which the algorithms
work best. This notion is formalized in no-free-lunch theorems. Therefore, we
start by examining in the next section the question what can generally be said
about the relationship between classes of optimization problems and classes of
optimization algorithms.
11.2 No Free Lunch 309

11.2 No Free Lunch

No-free-lunch (nfl) theorems are statements in optimization and computa-


tional complexity that imply that, for certain types of problems, the computa-
tional cost of finding a solution when averaged over all problems in a class is the
same for any solution method.
In other words, there is no optimization method that outperforms all other
methods on all problems. Any improved performance over one class of prob-
lems is always offset by decreased performance over another class of problems.
Specialized optimization methods have the best performance for solving a cer-
tain class of problems. There may still be algorithms yielding good results on
a variety of classes of problems, but they are still outperformed by specialized
methods in each of them.
These notions have been formalized in the nfl theorems [17] for search and
optimization problems. More precisely, the connection between algorithms and
their cost functions is analyzed. The finite search set is denoted by 𝑋 and the
finite set of all possible values of the objective function is denoted by 𝑌. (The as-
sumption that both 𝑋 and 𝑌 are finite is met when the elements are represented
by floating-point numbers.) An optimization problem is represented by its objec-
tive function 𝑓 ∶ 𝑋 → 𝑌, and the set of all optimization problems or objective
functions is denoted by 𝐹 ∶= 𝑌 𝑋 .
The points in 𝑋 × 𝑌 visited or evaluated while running an optimization algo-
rithm form a sample of size 𝑚, i.e., they are a (time ordered) set of 𝑚 distinct
visited points and are denoted by
{( 𝑥 𝑦 ) ( 𝑥 𝑦 )}
𝑑𝑚 ∶= 𝑑𝑚 (1), 𝑑𝑚 (1) , … , 𝑑𝑚 (𝑚), 𝑑𝑚 (𝑚) .
𝑥
The element 𝑑𝑚 (𝑖) ∈ 𝑋 denotes the 𝑖-th element in a sample of size 𝑚 and
𝑦 𝑥
𝑑𝑚 (𝑖) ∶= 𝑓(𝑑𝑚 (𝑖)) ∈ 𝑌 is the corresponding value of the objective func-
𝑦
tion. The ordered set of values of the objective function is denoted by 𝑑𝑚 ∶=
𝑦 𝑦
{𝑑𝑚 (1), … , 𝑑𝑚 (𝑚)}.
The set of all samples of size 𝑚 is denoted by

𝐷𝑚 ∶= (𝑋 × 𝑌)𝑚

and the set of all samples (of arbitrary size) by



𝐷 ∶= 𝐷𝑚 .
𝑚≥0

An optimization algorithm 𝑎 is represented by a function mapping a previ-


ously visited set of points to a new, previously unseen or unevaluated point in 𝑋,
i.e.,
𝑎 ∶ 𝐷 → 𝑋, 𝑑 ↦ 𝑥 ∉ 𝑑𝑥 .
310 11 Global Optimization

These optimization algorithms are deterministic, as every sample is mapped to a


new point deterministically. The results below for deterministic algorithms can
be extended to stochastic algorithms [17].
𝑦
We also define performance measures Φ(𝑑𝑚 ) of algorithms after 𝑚 itera-
𝑦
tions. The performance measures are functions of the sample 𝑑𝑚 . For example,
when minimizing the objective function 𝑓, a self-evident performance measure
𝑦 𝑦
is Φ(𝑑𝑚 ) ∶= min𝑖∈{1,…,𝑚} 𝑑𝑚 (𝑖).
nfl theorems are formulated within the framework of probability theory. The
𝑦
conditional probability 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎) is the probability of obtaining the sample
𝑦
𝑑𝑚 after iterating an algorithm 𝑎 𝑚 times on an objective function 𝑓. Knowing
𝑦
this conditional probability, performance measures Φ(𝑑𝑚 ) can be calculated eas-
ily.
The first nfl theorem discussed here addresses the question how the set 𝐹1 ⊂
𝐹 of problems for which an algorithm 𝑎1 performs better than an algorithm 𝑎2
compares to the set 𝐹2 ⊂ 𝐹 of problems for which the converse is true. This is
𝑦
done by summing the conditional probabilities 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎) over all objective
functions 𝑓 and comparing the sums obtained for 𝑎 = 𝑎1 and 𝑎 = 𝑎2 . A major
result of [17] is that the sum
∑ 𝑦
𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎),
𝑓∈𝐹

𝑦
i.e., the conditional probability 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎) summed over all objective func-
tions 𝑓, is independent of the algorithm 𝑎.

Theorem 11.1 (no free lunch) For any pair of algorithms 𝑎1 and 𝑎2 , the equa-
tion ∑ ∑
𝑦 𝑦
𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎1 ) = 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎2 ).
𝑓∈𝐹 𝑓∈𝐹

holds under the above assumptions.

A proof can be found in [17, Appendix A]. This theorem means that the sum
of such probabilities over all possible optimization problems 𝑓 is identical for all
𝑦
optimization algorithms. In other words, the average of 𝑃(𝑑𝑚 |𝑓, 𝑚, 𝑎1 ) over all
problems 𝑓 is independent of the algorithm 𝑎. This implies that any performance
gain that a certain algorithm provides on one class of problems must be offset by
a performance loss on the remaining problems.
These deliberations can be extended to time-dependent objective functions.
The initial cost function is called 𝑓1 and is used to sample the first 𝑋 value. Be-
fore the next iteration 𝑖 of the optimization algorithm, the cost function is trans-
formed to a new cost function by 𝑇 ∶ 𝐹 × ℕ → 𝐹, i.e., 𝑓𝑖+1 ∶= 𝑇(𝑓𝑖 , 𝑖). It is
assumed that all the transformations 𝑇(., 𝑖) are bijections on 𝐹. This assumption
is important, because otherwise a bias in a region of cost functions could be in-
troduced and exploited by some optimization algorithms.
Analogously to Theorem 11.1, the following theorem shows that the aver-
age performance of any two algorithms is the same also in the case of time-
11.3 Simulated Annealing 311

dependent objective functions. But now we average over all possible time de-
pendencies of cost functions, meaning the average is calculated over all transfor-
mations 𝑇 rather than over all objective functions 𝑓. These averages are given
by ∑ 𝑦
𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎),
𝑇

where 𝑓1 is the initial objective function. The samples are redefined to drop their
first elements such that the transformations 𝑇 can take full effect, although the
initial object function 𝑓1 is fixed. Then the following result can be shown.
𝑦 𝑦
Theorem 11.2 (no free lunch for time-dependent problems) For all 𝑑𝑚 , 𝐷𝑚 ,
𝑚 > 1, algorithms 𝑎1 and 𝑎2 , and initial cost functions 𝑓1 , the equation
∑ 𝑦 ∑ 𝑦
𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎1 ) = 𝑃(𝑑𝑚 |𝑓1 , 𝑇, 𝑚, 𝑎2 )
𝑇 𝑇

holds.

A proof can be found in [17, Appendix B]. The interpretation is analogous to


the one of Theorem 11.1. The average performances of any two algorithms 𝑎1 and
𝑎2 over all cost-function dynamics are identical, meaning that any performance
gain over a class of problems must be offset by a performance loss over the rest
of the problems.
These nfl results tell us that unfortunately there is no ingenious optimiza-
tion algorithm that works better than all other algorithms on all problems. On
the other hand, the nfl theorems motivate specialized considerations and work
focused on well-defined classes of problems, for which specialized algorithms
that outperform generic ones exist.
Having discussed the theoretic limitations of optimization algorithms, mod-
ern methods for global optimization will be presented in the rest of this chapter.

11.3 Simulated Annealing

Simulated annealing is a very practical optimization method. We start by dis-


cussing its roots.

11.3.1 The Metropolis Monte Carlo Algorithm

In the middle of the last century, Monte Carlo algorithms for calculating prop-
erties of materials consisting of interacting individual particles were developed
[11]. These Monte Carlo integrations over the configuration spaces of the parti-
cles turned out to be highly useful in statistical mechanics.
312 11 Global Optimization

In simple terms, the canonical ensemble is used to calculate the equilibrium


value 𝐹̄ of any quantity 𝐹 of interest as

∫ 𝐹e−𝐸∕𝑘𝑇 d3𝑁 𝐩d3𝑁 𝐪


𝐹̄ = . (11.1)
∫ e−𝐸∕𝑘𝑇 d3𝑁 𝐩d3𝑁 𝐪

There are 𝑁 particles so that the phase space is 6𝑁-dimensional. The factor

e−𝐸∕𝑘𝑇

stems from the Boltzmann probability distribution of the particles, where 𝐸 is


the energy of the state and the constant 𝑘𝑇 is the product of the Boltzmann con-
stant 𝑘 and the thermodynamic temperature 𝑇. The probability of a state with
energy 𝐸 is proportional to e−𝐸∕𝑘𝑇 .
Since the forces between the particles are independent of velocity, the mo-
mentum integrals (d3𝑁 𝐩) cancel and only the integrals (d3𝑁 𝐪) over the 3𝑁-
dimensional configuration space must be computed. The method of choice to
do so is the Monte Carlo method, meaning that random samples of particles are
averaged.
The naive approach to perform these computations is to generate configura-
tions with 𝑁 particles each at random positions, each corresponding to a random
point in the 3𝑁-dimensional configuration space, then to calculate the energy 𝐸
of each configuration (depending on the forces considered), and finally to use
the weight e−𝐸∕𝑘𝑇 of each configuration in (11.1).
However, this naive approach is not practical when the particles are close-
packed, since the probability to choose a configuration with very small e−𝐸∕𝑘𝑇
(and large 𝐸) is high. Therefore, in [11], a modification of the Monte Carlo
method was introduced. Instead of choosing random configurations and weigh-
ing them with the factor e−𝐸∕𝑘𝑇 , configurations are chosen with probability
e−𝐸∕𝑘𝑇 and weighed evenly.
This alternative sampling can be achieved by starting from any configuration
and then moving each particle in succession by adding random displacements
chosen according to a uniform distribution on an interval [−𝛼, 𝛼]. Next, the en-
ergy change Δ𝐸 before and after the move is calculated. If Δ𝐸 ≤ 0, i.e., the new
configuration has lower energy, the move is allowed and the particle assumes
its new position. On the other hand, if Δ𝐸 > 0, the move is allowed and per-
formed only with probability e−Δ𝐸∕𝑘𝑇 . This can be achieved by drawing a ran-
dom number 𝜌 uniformly from the interval [0, 1] and moving the particle to its
new position if 𝜌 ≤ e−Δ𝐸∕𝑘𝑇 . Otherwise, it remains at its old position.
The new configuration is considered different from the old one for the pur-
pose of calculating the average in any case, irrespective of whether the move
was allowed (and hence performed) or not. In each iteration, a new value 𝐹𝑖 of
the quantity of interest is obtained. Finally, after 𝑀 iterations, the equilibrium
value of the quantity of interest is calculated as
11.3 Simulated Annealing 313

𝑀
1 ∑
𝐹̄ = 𝐹.
𝑀 𝑖=1 𝑖

It can be shown that this algorithm does indeed choose configurations with
probabilities e−𝐸∕𝑘𝑇 [11] and therefore indeed calculates the equilibrium value
of the quantity of interest.
Another question concerns the convergence speed of this algorithm. Here we
can already observe a major effect that is common to many Monte Carlo pro-
cedures that construct sequences of states. The maximum displacement 𝛼 is of
great importance and should be chosen with care or – even better – should be ad-
justed automatically. If it is too small, the configuration changes only little and
sampling the whole space requires many iterations. On the other hand, if it is
too large, most moves are forbidden. In both cases, it takes longer to arrive at
the equilibrium than with a suitable maximum displacement.
The reason why we have discussed the classical Metropolis algorithm in de-
tail is that it is not only foundational in the Monte Carlo, but also because it
motivates the optimization algorithm that is the subject of this section.

11.3.2 The Simulated-Annealing Algorithm

Simulated annealing was introduced in [9] and copies the way of sampling the
whole space in the Metropolis algorithm and applies it to global optimization.
Simulated annealing can be applied to both continuous and discrete optimiza-
tion and search problems. It only requires two simple operations, namely choos-
ing a random starting point and an unary operation that maps points to neigh-
boring points.
In order to formulate the simulated-annealing algorithm, the sampling
method of the Metropolis algorithm is applied to a single point or particle. The
energy 𝐸 of a configuration in the Metropolis algorithm now corresponds to the
objective function in the minimization problem, and therefore we denote it by 𝑓
in the algorithm. This analogy yields the simulated-annealing algorithm.

Algorithm 11.3 (simulated annealing for minimization)


1. Choose a random starting point 𝑥1 and let 𝑡 ∶= 1.
2. Denote the current point by 𝑥𝑡 and generate a candidate point 𝑥̃ 𝑡+1 by adding
a random displacement, e.g., a normally distributed random variable with
zero mean such that
𝑥̃ 𝑡+1 ∶= 𝑥𝑡 + 𝑁(0, 𝜎).
3. Calculate the difference in the objective function by letting

Δ𝑓 ∶= 𝑓(𝑥̃ 𝑡+1 ) − 𝑓(𝑥𝑡 ).


314 11 Global Optimization

4. Calculate the acceptance probability

e−Δ𝑓∕𝑘𝑇 , Δ𝑓 > 0,
𝑝 ∶= {
1, Δ𝑓 ≤ 0.

(Note that for implementation purposes simply setting 𝑝 ∶= e−Δ𝑓∕𝑘𝑇 suf-


fices and yields the same 𝑥𝑡+1 in Step 6.)
5. Draw a uniformly distributed random number 𝜌 ∼ 𝑈(0, 1) (such that 𝜌 ∈
[0, 1]).
6. If 𝜌 ≤ 𝑝, accept the candidate point 𝑥̃ 𝑡+1 , otherwise the point remains un-
changed, i.e., the next point is

𝑥̃ 𝑡+1 , 𝜌 ≤ 𝑝,
𝑥𝑡+1 ∶= {
𝑥𝑡 , 𝜌 > 𝑝.

7. Set 𝑡 ∶= 𝑡 + 1 and go to Step 2 until a termination criterion has been reached.


8. Return the best of the points 𝑥𝑡 found.

In summary, candidate points that yield a better solution of the minimization


problem are always accepted (the case Δ𝑓 ≤ 0), while uphill steps are still per-
formed with probability e−Δ𝑓∕𝑘𝑇 (the case Δ𝑓 > 0). This probability depends
on the difference Δ𝑓 in the objective function as well as the factor 𝑘𝑇. If the
temperature 𝑇 is high, the acceptance probability e−Δ𝑓∕𝑘𝑇 is close to one, and
therefore uphill steps are likely to be accepted. Conversely, if the temperature is
low, the acceptance probability is close to zero, and uphill steps are unlikely to
be accepted.

11.3.3 Cooling Strategies

Therefore the temperature 𝑇 provides a way to adjust the behavior of the algo-
rithm. It is customary to start with a higher temperature and reduce it during the
iterations. In the beginning, while the temperature is high, simulated annealing
resembles a global random search. As the temperature falls, the probability of
accepting an uphill step decreases. Then points become more and more unlikely
to escape regions with local minima and hopefully converge to a global mini-
mum as the temperatures goes to zero. Good cooling strategies can thus combine
global search with local refinement.
This consideration shows that reasonable cooling strategies have the follow-
ing three properties. From now on, we denote the temperature in iteration 𝑡 by
𝑇𝑡 .
1. The initial temperature 𝑇1 is greater than zero.
2. The temperature decreases, i.e., 𝑇𝑡+1 ≤ 𝑇𝑡 .
11.4 Particle-Swarm Optimization 315

3. The temperature approaches zero as the number of iterations increases, i.e.,


lim𝑡→∞ 𝑇𝑡 = 0. If the search space is finite, the temperature is equal to zero
after a certain (large) number of iterations.
Many variants of cooling strategies are possible and have been investigated.
Here three cooling strategies are presented and invite the reader to experiment
with cooling strategies and the benchmark problems in Sect. 11.8 (see Prob-
lem 11.7).
1. Multiply the temperature 𝑇 by a factor 1 − 𝜖 every 𝑠 iterations, where 0 <
𝜖 ≪ 1. Suitable values for 𝜖 and 𝑠 are found by experimentation and depend
on the minimization problem.
2. Using an iteration budget of 𝑁 iterations in total, reduce the temperature to
the new value
𝛽
𝑡
𝑇𝑡 ∶= (1 − ) 𝑇1
𝑁
every 𝑠 iterations. Again, good values for the parameters 𝑁, 𝛽, and 𝑠 are
found by experimentation.
3. Every 𝑠 iterations, set the temperature to the new value

𝛾(𝑓(𝑥𝑡 ) − 𝑓(𝑥𝑡∗ )), 𝑇𝑡 ≠ 0,


𝑇𝑡 ∶= {
𝛿𝑇𝑡−1 , 𝑇𝑡 = 0,

where 𝑥𝑡∗ is the best point found up to iteration 𝑡 and 𝛿 ∈ (0, 1). You have
guessed it, there is no general theory to find the parameters 𝑠, 𝛾, and 𝛿. In
this cooling strategy, our rule that the cooling strategy should be decreasing
may be violated.
It can be shown that simulated annealing algorithms with appropriate cool-
ing strategies converge to the global optimum as the number of iterations goes
to infinity [10, 4, 13]. The domain can be searched faster if cooling is sped up,
but then convergence is not guaranteed anymore.

11.4 Particle-Swarm Optimization

Particle-swarm optimization [2, 8] can be viewed as an extension of simulated


annealing. Instead of using a single particle, a swarm of particles is moved from
iteration to iteration. The movement of the members of the swarm may be consid-
ered to mimic the “swarm intelligence” of social systems such as flocks of birds
or schools of fish. The members of the swarm spread out and move randomly
hoping to find a local optimum. As the swarm is social, the members announce
their success to their neighbors in order to attract them to promising regions and
to help in the search for an optimum.
Each particle is an element of an 𝑛-dimensional search space 𝑋 ⊂ ℝ𝑛 . Each
particle 𝑘 has a position 𝐱𝑘,𝑡 ∈ 𝑋 and a velocity 𝐯𝑘,𝑡 ∈ ℝ𝑛 in each iteration 𝑡. If
316 11 Global Optimization

the speed of a particle is high, its search is explorative, and if it is low, the search
is exploitive. We briefly mention that it is possible to use genotype-phenotype
mapping, which will be discussed in the next section.
The positions and velocities of all particles are initialized randomly. In each
iteration, the velocity is updated first and then the position. Each particle 𝑘 also
keeps track of its history by remembering the best position 𝐛𝑘,𝑡 it is has seen up
to iteration 𝑡.
The social component of the swarm is realized by communication between
the particles. To that end, each particle updates in each iteration 𝑡 its set 𝑁𝑘,𝑡 of
neighbors. The set 𝑁𝑘,𝑡 of neighbors is usually defined as the set of all particles
within a certain distance from 𝐱𝑘 measured by a certain metric such as the Eu-
clidean distance in the simplest case. Knowing its neighbors, each particle then
communicates its best position 𝐛𝑘,𝑡 found so far to its neighbors in each iteration.
Therefore each particle knows the best point 𝐧𝑘,𝑡 found in its neighborhood so
far. Furthermore, the swarm records the best position 𝐛𝑡 found by all particles
in the swarm up to iteration 𝑡.
These best points that the particles communicate among themselves consti-
tute the social component of particle-swarm optimization and enter the calcula-
tions in the updates of the velocities of the particles. When updating the velocity
of a particle, we can use the best point 𝐧𝑘,𝑡 found in its neighborhood so far or
the best point 𝐛𝑡 found by all particles so far.
These two possibilities lead to local and global updates, respectively. In a local
update, the velocity 𝐯𝑘,𝑡 of particle 𝑘 is updated by

𝐯𝑘,𝑡+1 ∶= 𝐯𝑘,𝑡 + (𝐛𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐜) + ( 𝐧𝑘,𝑡 −𝐱𝑘,𝑡 )𝑈(0, 𝐝). (11.2)
⏟⏟⏟
best neighbor

In a global update, the velocity becomes

𝐯𝑘,𝑡+1 ∶= 𝐯𝑘,𝑡 + (𝐛𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐜) + ( 𝐛𝑡 −𝐱𝑘,𝑡 )𝑈(0, 𝐞). (11.3)


⏟⏟⏟
best in population

For each particle, either a local or a global velocity update is chosen randomly.
Here 𝑈(0, 𝐜), 𝑈(0, 𝐝), and 𝑈(0, 𝐞) are random vectors whose entries are uni-
formly distributed random variables in the intervals [0, 𝑐𝑖 ], [0, 𝑑𝑖 ], and [0, 𝑒𝑖 ], re-
spectively. In other words, the vectors 𝐜, 𝐝, and 𝐞 are parameters of the algorithm.
Having updated the velocities of all particles, their positions 𝐱𝑘,𝑡 are updated
by
𝐱𝑘,𝑡+1 ∶= 𝐱𝑘,𝑡 + 𝐯𝑘,𝑡+1 (11.4)
using the new velocities 𝐯𝑘,𝑡+1 . Usually, the size of the search space 𝑋 is finite,
and then it is necessary to ensure that particles do not move out of 𝑋 due to this
update.
The updates mean that two steps are added to the previous velocity. In both
the local and the global updates, the term (𝐛𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐜) points the particle
11.5 Genetic Algorithms 317

towards its best position so far. In the local update, the term (𝐧𝑘,𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐝)
points the particle towards its best neighbor found so far, and in the global up-
date, the term (𝐛𝑡 − 𝐱𝑘,𝑡 )𝑈(0, 𝐞) ensures that it points towards the best point
found by the whole swarm so far.
The learning-rate vectors 𝐜, 𝐝, and 𝐞 strongly influence convergence speed.
The components of these three vectors determine how vigorously the particles
move towards their best positions, their best neighbors, and the best position
in the whole swarm so far. The components can of course be different for all
directions in the search space.
If the components of 𝐞 are large and the update hence relies much on the
best position 𝐛𝑡 in the whole swarm, the algorithm converges faster, but is less
likely to find a global minimum. If the components of 𝐝 are large and the update
hence relies much on the best neighbor 𝐛𝑘,𝑡 , convergence is slower, but a global
minimum is more likely to be found.
Having discussed how the particles are updated, we can summarize particle-
swarm optimization as follows.
Algorithm 11.4 (particle-swarm optimization)
1. Choose a swarm size and initialize the particles with random positions and
velocities.
2. Set the iteration counter 𝑖 ∶= 1.
3. Loop over all particles, update their velocities choosing either the local up-
date (11.2) or the global update (11.3) and then update their positions using
(11.4).
4. While the termination criterion has not been met, increase 𝑖 by 1 and go to
Step 2.
5. Finally, return the best position found.
In practice, it is useful to return the history of the best points 𝐛𝑡 so that the
progress of the algorithm can be plotted. It is also useful to return the whole last
swarm as well so that the algorithm can be restarted (without random initializa-
tion) whenever the results are not satisfactory and there is still progress. Since
the global optimum or optima are unknown unless a test problem is considered,
practical termination criteria judge the progress in the past iterations. When it
has become very slow, the algorithm is stopped; but of course, long periods of
stagnation may be deceptive.

11.5 Genetic Algorithms

Evolutionary computation is an umbrella term for genetic algorithms, genetic


programming, and evolution strategies. In this section, genetic algorithms are
discussed. Genetic algorithms share with particle-swarm optimization the idea
that there is a population of points, particles, or individuals that is updated from
iteration to iteration such that the individuals perform a global optimization.
318 11 Global Optimization

Genetic algorithms follow biological evolution [1] by implementing natural


selection. We start by discussing the basic algorithm, of which many variations
have been developed. The concepts, the data structures, and the operations that
appear in the algorithm are discussed afterwards. A more recent development
closes this section.

11.5.1 The Algorithm

All algorithms in evolutionary computation and all genetic algorithms can be


summarized as follows.
Algorithm 11.5 (genetic algorithm)
1. Generate a population of individuals with random genomes.
2. Map the genome of each individual to its phenotype, and evaluate the objec-
tive function for the phenotype of each individual. (Evaluating the objective
function may be computationally expensive, but the evaluations can be par-
allelized easily.)
3. Map the value of the objective function for each individual to the fitness of
each individual.
4. Select individuals for reproduction such that individuals with higher fitness
have a higher probability of reproduction.
5. In reproduction, create offspring by varying or combining the genotypes of
the individuals to create new individuals, which are then added to the pop-
ulation.
6. Go to Step 2 unless the termination criterion has been met.
While formulating the algorithm, we have introduced a number of concepts
that must be made more precise. What are genotypes and phenotypes in the con-
text of optimization problems? What is fitness and how are values of the objective
function translated to fitness values? How are individuals selected for reproduc-
tion, and how do the reproduction operations work? These building blocks of the
algorithm are discussed next. Many variants of these operations are obviously
possible, which brings us back to Sect. 11.2.

11.5.2 Genotypes and Phenotypes

The search space on which the reproduction operations act is the genome, and
the elements of the genome are the genotypes. A genotype is the collection of
the genes of an individual. In biology, the concept of a gene has changed as new
discoveries, for example about gene regulation, have been made. In genetic al-
gorithms, we are free to choose the genes and genotypes. Beneficial choices of
course help in solving the optimization problem.
11.5 Genetic Algorithms 319

Genotypes are often vectors of genes in contrast to other, nonlinear data struc-
tures. In this leading case of vectors, the genotypes are usually referred to as
chromosomes. Chromosomes can either be vectors of fixed length or of variable
length.
In the case of chromosomes with a fixed-length vector 𝐚, the locus of each
gene is always the same element 𝑎𝑖 of the vector, implying that the competing
alleles of a gene are always positioned at the same element number 𝑖 of the vector.
The genes may have different data types so that the elements of the vector may
have different types.
In the case of chromosomes with a variable-length vector, the positions of the
genes may have been shifted after reproduction operations are applied. In this
case, the genes often have the same data type.
Leading examples of chromosomes are vectors that consist of bits, of integers,
or of real numbers.
The phenotype of an individual in the context of an optimization problem is
an element 𝑥 ∈ 𝑋 in the preimage 𝑋 of the objective function

𝑓 ∶ 𝑋 → ℝ.

The relationship between genotypes and phenotypes is given by the genotype-


phenotype mapping
𝑔∶ 𝐺 → 𝑋
from the set 𝐺 of all genotypes to the preimage 𝑋 of the objective function 𝑓.
This means that from the viewpoint of a genetic algorithm the composition

𝑔◦𝑓 ∶ 𝐺 → ℝ

is optimized.
Therefore the significance of the genotype-phenotype mapping is that its
choice should facilitate the reproduction operations and optimization by sup-
porting advantageous genotypes. A good choice of genotype is, of course, highly
problem dependent.
To clarify these concepts, we mention the canonical choice of genotype for
objective functions 𝑓 ∶ ℝ𝑑 → ℝ. The canonical genome is 𝐺 ∶= ℝ𝑑 such that
any vector 𝐠 ∈ ℝ𝑑 is a genotype. Each element 𝑔𝑖 of 𝐠 is a gene and has locus 𝑖.
Because of this linear arrangement of the genes, the vector 𝐠 is a fixed-length
chromosome. The canonical choice for the genotype-phenotype mapping 𝑔 is
the identity 𝑔 ∶= id, and therefore 𝑔◦𝑓 ∶ ℝ𝑑 → ℝ.

11.5.3 Fitness

In the next step of the algorithm, the objective function is evaluated for the phe-
notypes of all individuals. Based on these values, each individual is assigned a
320 11 Global Optimization

fitness value. Then individuals with higher fitness are more likely to be selected
for reproduction.
It is obvious that the fitness of an individual should reflect how well it solves
the optimization problem. But it should also reflect the variety of the population
and incorporate information on population density and niches. By doing so, the
probability of finding the global optima can be increased significantly. Fitness
assignment is therefore in general a function that acts on the whole population.
The simplest way of assigning fitness is, however, to just use the value 𝑓(𝑥) of
the objective function or – in multiobjective optimization – the weighted sum

𝑤𝑖 𝑓𝑖 (𝑥),
𝑖

where the functions 𝑓𝑖 are the objectives and the constants 𝑤𝑖 are weights.
Another option is Pareto ranking, which also works for multiobjective opti-
mization. To explain it, we first consider the following fitness assignment. If an
individual prevails 𝑚 other individuals, it is assigned the fitness value 1∕(1 + 𝑚).
Its disadvantage especially in multiobjective optimization is, however, that indi-
viduals in a crowded part of the search space have the chance to prevail many
others, while individuals in a less explored part of the space are assigned a much
worse fitness value only because there are fewer neighbors, although they score
best with respect to the objective function.
An alternative is to assign to each individual the fitness value 𝑛, where 𝑛 is
the number of other individuals it is prevailed by. In multiobjective optimization,
this choice recognizes individuals on the Pareto frontier and avoids the disadvan-
tages of the fitness choice 1∕(1 + 𝑚) from the previous paragraph. The Pareto
fitness value or Pareto rank of an individual 𝑖 is found by looping over all other
individuals 𝑗. If individual 𝑖 is prevailed by individual 𝑗, the Pareto rank of 𝑖 is
increased by one.
Still, Pareto ranking does not exploit any information about population den-
sity or variety. In global optimization, crowding of the individuals is undesirable
and variety is beneficial to explore the whole space. Sharing is a method to in-
clude variety information into fitness assignment, and variety preserving rank-
ing is another method worth mentioning here.

11.5.4 Selection

After having assigned fitness values, a sufficient number of individuals is chosen


from the whole population in the selection step according to their fitness values
and placed in the mating pool. In the following step, reproduction operations
are applied to the mating pool. Selection may be deterministic or stochastic. Fur-
thermore, it may proceed with replacement or without replacement depending
whether an individual may be placed in the mating pool multiple times (with
replacement) or at most once (without replacement).
11.5 Genetic Algorithms 321

Elitism means that the best 𝑛 individuals, where 𝑛 ≥ 1, get a free pass and
are placed in the mating pool in each generation.
The simplest selection method is truncation selection. The individuals are
ordered by their fitness and the desired number of the best individuals is placed
in the mating pool. When truncation selection is used, care should be taken to
combine it with an fitness assignment that ensures variety in order to prevent
premature convergence.
Another classical selection method is fitness proportionate selection. The
probability of the individual 𝑖 being placed in the mating pool is proportional
to its fitness, i.e., the probability is given by
𝑣𝑖
∑ ,
𝑣
𝑗 𝑗

where 𝑣𝑘 is the fitness of individual 𝑘.


Tournament selection is one of the most popular and effective selection meth-
ods. In tournament selection, 𝑘 individuals are selected from the population at
random and compared with each other in a tournament. The winner of the tour-
nament is placed in the mating pool. Tournament selection can be used with and
without replacement.
We can estimate the probability of an individual being chosen for the mating
pool in deterministic tournament selection with replacement and with tourna-
ment size two. In each tournament, the individuals are chosen randomly accord-
ing to a uniform distribution. If the size of the mating pool is about the size of
the population, each individual will participate in about two tournaments on av-
erage. The individual with the highest fitness will win all tournaments and will
be placed twice in the mating pool; the individual with the median fitness will
be placed in the mating pool once; and the individual with the worst fitness can
only win against itself, but it is unlikely that it is chosen twice for a tournament.
This means that the number of copies of an individual in the mating pool is a
decreasing function of fitness, being two for the best fitness, one for the median
fitness, and close to zero for the worst fitness.
In stochastic tournament selection, the best individual in the tournament
is selected with a probability 𝑝 and the 𝑖-th best individual with probability
𝑝(1 − 𝑝)𝑖 .

11.5.5 Reproduction

In the final step of the algorithm, the individuals in the mating pool are repro-
duced and their offspring is placed in the next generation of the population. We
present four reproduction operations here.
322 11 Global Optimization

First, in creation, a new genotype is created with random genes. Creation is


usually applied only when generating the initial population. Creation is a nullary
operation.
Second, in duplication, an exact copy of a genotype is created. It is mostly
useful to increase the number of individuals of a certain type in a population.
Third, mutation introduces small random changes and is important for pre-
serving variety. In fixed-length chromosomes, one or more of the genes are ran-
domly changed to another allele. If the type of the gene is a bit, it is toggled. If the
gene is a real number, a random value from a normal distribution can be added.
Duplication and mutation are unary operations.
Mutation in variable-length chromosomes can mean two more operations,
namely insertion and deletion, in addition to changing a gene. Insertion means
that random genes are inserted, and deletion means that some of the genes are
deleted.
Fourth, crossover is a binary operation. In single-point crossover, the two
parental chromosomes 𝐚 and 𝐛 are split at a random crossover point, called 𝑘
here, and the offspring

(𝑎1 , 𝑎2 , … , 𝑎𝑘−1 , 𝑎𝑘 , 𝑏𝑘+1 , 𝑏𝑘+2 , … , 𝑏𝑑−1 , 𝑏𝑑 )

consists of the first part of the first parent and second part of the second parent.
More generally, in multipoint crossover, several crossover points are chosen ran-
domly and the offspring consists of the subsequences taken alternately from the
two parental chromosomes.
Crossover in variable-length chromosomes is analogous, but the loci where
the chromosomes are split are not necessarily the same anymore, and the off-
spring generally has a different length than the parents.
This discussion of reproduction completes the discussion of the operations
that occur in a genetic algorithm.
Many variations of the operations in a genetic algorithm have been devised.
If the genotypes or the phenotypes are elements of unusual spaces, it may be a
challenge to adapt these ideas such that the algorithm works well, i.e., that it
converges while still performing a global search.

11.6 Ablation Studies

A profound way to assess the performance of variants of an algorithm are abla-


tion studies. In an ablation study, we define a basic algorithm and a certain num-
ber of variations of the basic algorithm, usually more complicated versions of
parts of the basic algorithm expected to yield improvements. The basic algorithm
is then tested on a set of benchmark problems, as well as the most sophisticated
algorithm with all variants enabled, which would be expected to perform best.
Furthermore, the performance of the basic algorithm with single variations en-
11.8 Benchmark Problems 323

abled and the performance of the most sophisticated algorithm with single vari-
ations disabled are assessed. Of course, more algorithms with some variations
enabled and some variations disabled can be assessed additionally. Finally, the
performances on the benchmark problems are compared, which usually gives a
good overview over the variations that are indeed beneficial.

11.7 Random Restarting and Hybrid Algorithms

In global optimization, we usually face the problem that there is no guarantee


that all global optima or even at least one has indeed been found. A simple, but
effective idea is to restart the optimization algorithm many times hoping that
many runs with different, random initializations increase the probability of find-
ing global optima.
The global-optimization algorithms in this chapter can be highly effective
in searching globally and they do not presuppose much on the smoothness of
the objective function. If approximations of global optima have been found and
the objective function is so smooth that the first derivative or even more exist,
these approximations can be used as starting points in algorithms for local opti-
mizations. Local-optimization algorithms are discussed in the next chapter. Al-
gorithms combining global and local optimization are called hybrid algorithms.
But before we continue with optimization algorithms, benchmark problems
are presented in the next section to guide experimentation with algorithms and
objective functions.

11.8 Benchmark Problems

Benchmark problems for multidimensional global optimization are presented


in the following. The independent variable is usually 𝐱 ∈ ℝ𝑑 . These functions
are useful for evaluating the performance of the optimization algorithms in this
chapter and the next. A few are simple test cases for quickly checking an opti-
mization algorithm, but most have properties that makes them difficult to mini-
mize.
1. The Ackley function

𝑓AC ∶ [−32.768, 32.768]𝑑 → ℝ,



⎛ √ 𝑑 ⎞ 𝑑

√ 1 ∑ 2⎟ ⎛1 ∑ ⎞

𝐱 ↦ −𝛼 exp −𝛽 𝑥𝑖 − exp ⎜ cos(𝛾𝑥𝑖 )⎟ + 𝛼 + exp(1),
⎜ 𝑑 𝑖=1 ⎟ 𝑑
⎝ 𝑖=1 ⎠
⎝ ⎠
324 11 Global Optimization

has many local minima in the nearly flat outer region and a large hole at the
center. Usually, the values 𝛼 ∶= 20, 𝛽 ∶= 0.2, and 𝛾 ∶= 2𝜋 are used. Its
global minimum is 𝑓AC (𝟎) = 0.
2. The Bukin function no. 6

𝑓BU6 ∶ [−15, −5]×[−3, 3] → ℝ, 𝐱 ↦ 100 |𝑥2 − 0.01𝑥12 |+0.01|𝑥1 +10|

has many local minima, all of which lie in a narrow valley. Its global mini-
mum is 𝑓BU6 (−10, 1) = 0.
3. The drop-wave function

1 + cos(12 𝑥12 + 𝑥22 )
𝑓DW ∶ [−5.12, 5.12]2 → ℝ, 𝐱 ↦ −
(𝑥12 + 𝑥22 )∕2 + 2

has a very complicated structure. Its global minimum is 𝑓DW (0, 0) = −1.
4. The Easom function

𝑓EA ∶ [−100, 100]2 → ℝ,


𝐱 ↦ − cos(𝑥1 ) cos(𝑥2 ) exp(−(𝑥1 − 𝜋)2 − (𝑥2 − 𝜋)2 )

has several local minima, while the area near the global
( minimum
) is small
relative to the search space. Its global minimum is 𝑓EA (𝜋, 𝜋) = −1.
5. The Gramacy–Lee function

sin(10𝜋𝑥)
𝑓GL ∶ [0.5, 2.5] → ℝ, 𝑥↦ + (𝑥 − 1)4
2𝑥
is a one-dimensional function simple to minimize. Its global minimum is
near 0.549.
6. The Griewank function
𝑑
∑ 𝑥𝑖2 𝑑
∏ 𝑥𝑖
𝑓GR ∶ [−600, 600]𝑑 → ℝ, 𝐱↦ − cos ( √ ) + 1
𝑖=1
4000 𝑖=1 𝑖

has many regularly distributed, widespread local minima. Its global mini-
mum is 𝑓GR (𝟎) = 0.
7. The Hölder table function

||| ⎛|||| || ||
|| | 𝑥12 + 𝑥22 |||⎞|||
| ||⎟|||
𝑓HT ∶ [−10, 10]2 → ℝ, 𝐱 ↦ − |||sin 𝑥1 cos 𝑥2 exp ⎜|||1 − || ||
||| ⎜|| | 𝜋 ||⎟||
|| | || ||
⎝| ⎠
11.8 Benchmark Problems 325

has many local minima and the four global minima


( )
𝑓HT (±8.05502, ±9.66459) ≈ −19.2085.

8. The Levy function

𝑓LE ∶ [−10, 10]𝑑 → ℝ,


𝑑−1
∑ ( )
𝐱 ↦ sin(𝜋𝑦1 )2 + (𝑦𝑖 − 1)2 1 + 10 sin(𝜋𝑦𝑖 + 1)2
𝑖=1
( )
+ (𝑦𝑑 − 1)2 1 + sin(2𝜋𝑦𝑑 )2 ,
𝑥𝑖 − 1
𝑦𝑖 ∶= 1 + ,
4
( )
has many local minima. Its global minimum is 𝑓LE (1, … , 1) = 0.
9. The Michalewicz function
𝑑

𝑓MI ∶ [0, 𝜋]𝑑 → ℝ, 𝐱↦− sin(𝑥𝑖 ) sin(𝑖𝑥𝑖2 ∕𝜋)2𝑚
𝑖=1

has 𝑑! local minima. Its parameter 𝑚 determines the steepness of the slopes,
where larger 𝑚 makes the search more difficult. Usually, the value 𝑚 ∶= 10
is used.
10. The Rastrigin function

𝑑
∑ ( 2 )
𝑓RA ∶ [−5.12, 5.12]𝑑 → ℝ, 𝐱 ↦ 𝛼𝑛 + 𝑥𝑖 − 𝛼 cos(2𝜋𝑥𝑖 ) , 𝛼 ∶= 10,
𝑖=1

has many regularly distributed local minima. Its global minimum is


𝑓RA (𝟎) = 0.
11. The two-dimensional Rosenbrock function

𝑓RO2 ∶ [−5, 10]2 → ℝ, 𝐱 ↦ (𝛼 − 𝑥1 )2 + 𝛽(𝑥2 − 𝑥12 )2 ,


( )
has the global minimum 𝑓RO2 (𝑎, 𝑎2 ) = 0. Usually the parameters 𝛼 ∶= 1
and 𝛽 ∶= 100 are used. The global minimum lies in a narrow, parabolic
valley. While finding this valley is easy, converging to the global minimum
in the valley is difficult.
The 𝑑-dimensional Rosenbrock function
𝑑−1
∑ ( )
𝑓ROD ∶ [−5, 10]𝑑 → ℝ, 𝐱↦ (1 − 𝑥𝑖 )2 + 100(𝑥𝑖+1 − 𝑥𝑖2 )2
𝑖=1
326 11 Global Optimization
( )
is nonconvex. In the case 𝑑 = 3, it has one global minimum 𝑓ROD (1, 1, 1) =
( )
0. In the case 4 ≤ 𝑑 ≤ 7, it has the global minimum 𝑓ROD (1, … , 1) = 0 and
a local minimum near( the point
) (−1, 1, … , 1). In all dimensions, the global
minimum is 𝑓ROD (1, … , 1) = 0.
12. The Schaffer function no. 2
sin(𝑥12 − 𝑥22 )2 − 1∕2
𝑓SCH2 ∶ [−100, 100]2 → ℝ, 𝐱 ↦ 1∕2 +
(1 + (𝑥12 + 𝑥22 )∕1000)2
( )
has a complicated structure. Its global minimum is 𝑓SCH2 (0, 0) = 0.
13. The Shubert function

𝑓SH ∶ [−10, 10]2 → ℝ,


5
⎛∑ 5
( )⎞ ⎛ ∑ ( )⎞
𝐱 ↦ ⎜ 𝑖 cos (𝑖 + 1)𝑥1 + 𝑖 ⎟ ⎜ 𝑖 cos (𝑖 + 1)𝑥2 + 𝑖 ⎟
⎝𝑖=1 ⎠ ⎝𝑖=1 ⎠
has several local minima and many global minima. Its global minimum is
approximately −186.731.
14. The six-hump camel function

𝑓SHC ∶ [−3, 3] × [−2, 2], 𝐱 ↦ (4 − 2.1𝑥12 + 𝑥14 ∕3)𝑥12 + 𝑥1 𝑥2 + (−4 + 𝑥22 )𝑥22

has six
( local minima, two) of which are global. The two global minima are
𝑓SHC (±0.0898, ∓0.7126) ≈ −1.0316.
15. The sphere function
𝑑

𝑓SPH ∶ ℝ𝑑 → ℝ, 𝐱↦ 𝑥𝑖2
𝑖=1

is a simple, convex function to check the implementation of optimization


algorithms. It has 𝑑 local minima except for the global minimum 𝑓SPH (𝟎) =
0.
16. The Zakharov function
2 4
𝑑
∑ ⎛∑𝑑 ⎞ ⎛∑𝑑 ⎞
𝑓ZA ∶ [−5, 10] → ℝ, 𝐱↦ 𝑥𝑖2 + ⎜ 0.5𝑖𝑥𝑖2 ⎟ + ⎜ 0.5𝑖𝑥𝑖2 ⎟
𝑖=1
⎝𝑖=1 ⎠ ⎝𝑖=1 ⎠
has no local minima except the global minimum 𝑓ZA (𝟎) = 0.
11.10 Bibliographical Remarks 327

11.9 Julia Packages

Many optimization packages are available under the sʼɜɃǤ“ʙʲ umbrella. Var-
ious capabilities for global optimization are provided, e.g., by the ɜʙɃɪȕ,
"ɜǤȆɖ"ɴ˦“ʙʲɃɦ, 5˛ɴɜʼʲɃɴɪǤʝ˩, QȕɪȕʲɃȆɜȱɴʝɃʲȹɦʧ, ÆʲɴȆȹǤʧʲɃȆÆȕǤʝȆȹ pack-
ages.

11.10 Bibliographical Remarks

A classic book on evolutionary computation is [5], while a more recent one is


[12]. Two text books on differential evolution are [3, 15]. Good overviews over
global optimization are [7, 6, 14, 16, 18].

Problems

11.1 Implement the benchmark functions in Sect. 11.8.


11.2 Plot the benchmark functions in Sect. 11.8 and categorize them with respect
to properties that make their minimization difficult.
11.3 Show that the global minimum of the Rastrigin function 𝑓RA is 𝑓(𝟎) = 0.
11.4 Show the statement about the (local and global) minima of the 𝑑-dimens-
ional Rosenbrock function 𝑓ROD in Sect. 11.8.
11.5 Implement simulated annealing.
11.6 Apply simulated annealing to the benchmark problems.
11.7 Implement the cooling strategies in Sect. 11.3.3 for simulated annealing and
apply them to the benchmark problems.
11.8 Implement particle-swarm optimization.
11.9 Apply particle-swarm optimization to the benchmark problems.
11.10 Implement a genetic algorithm.
11.11 Apply a genetic algorithm to the benchmark problems. Can you find an
algorithm such that the same set of parameters works well for all benchmark
problems?
11.12 Design and perform an ablation study for a genetic algorithm.
11.13 Implement random restarting. The function should take an optimization
algorithm and take care of running it a given number of times and recording the
results. The overall results should be presented in a plot.
11.14 Write parallel versions of the algorithms in the previous exercises.
328 11 Global Optimization

References

1. Darwin, C.: On the Origin of Species by Means of Natural Selection, or the Preservation of
Favoured Races in the Struggle for Life. John Murray, London, UK (1859)
2. Eberhart, R., Kennedy, J.: A new optimizer using particle swarm theory. In: Proc. 6th Inter-
national Symposium on Micro Machine and Human Science, p. 39–43. IEEE Press (1995)
3. Feoktistov, V.: Differential Evolution: in Search of Solutions. Springer (2006)
4. Hajek, B.: Cooling schedules for optimal annealing. Mathematics of Operations Research
13(2), 311–329 (1988)
5. Holland, J.: Adaptation in Natural and Artificial Systems. The University of Michigan
Press, Ann Arbor, MI, USA (1975)
6. Horst, R., Pardalos, P., Thoai, N.: Introduction to Global Optimization, 2nd edn. Kluwer
Academic Publishers (2000)
7. Horst, R., Tuy, H.: Global Optimization: Deterministic Approaches. Springer (1996)
8. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proc. IEEE International Con-
ference on Neural Networks, pp. 1942–1948. IEEE Press (1995)
9. Kirkpatrick, S., Gelatt Jr., C., Vecchi, M.: Optimization by simulated annealing. Science
220(4598), 671–680 (1983)
10. van Laarhoven, P., Aarts, E.: Simulated Annealing: Theory and Applications. Mathematics
and its Applications. Kluwer Academic Publishers (1987)
11. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of state cal-
culations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
12. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn.
Springer (1996)
13. Nolte, A., Schrader, R.: A note on the finite time behaviour of simulated annealing. Math-
ematics of Operations Research 25(3), 476–484 (2000)
14. Pintér, J.: Global Optimization in Action – Continuous and Lipschitz Optimization: Algo-
rithms, Implementations and Applications, reprint edn. Springer (2010)
15. Price, K., Storn, R., Lampinen, J.: Differential Evolution: a Practical Approach to Global
Optimization. Springer (2005)
16. Strongin, R., Sergeyev, Y.: Global Optimization with Non-Convex Constraints: Sequential
and Parallel Algorithms. Kluwer Academic Publishers (2000)
17. Wolpert, D., Macready, W.: No free lunch theorems for optimization. IEEE Transactions
on Evolutionary Computation 1(1) (1997)
18. Zhigljavsky, A.: Theory of Global Random Search. Kluwer Academic Publishers (1991)
Chapter 12
Local Optimization

Gradior, gradi, gressus (latin):


to walk, to step, to advance

Gradiens (present participle of gradior):


walking, stepping, advancing

Abstract Optimization theory and algorithms can take advantage of the smooth-
ness of real-valued functions. The underlying assumption is that a reasonable
starting point sufficiently close to a local extremum is already known, for exam-
ple from performing a global optimization, and that (at least) the gradient of
the objective function is available. After a discussion of the convergence rates
of gradient descent, accelerated gradient descent, and the Newton method, the
bfgs method is presented in detail, as it is one of the most popular quasi-Newton
methods and highly effective in practice.

12.1 Introduction

The general assumption in this chapter is that the real scalar objective function

𝑓 ∶ ℝ𝑑 → ℝ

is smooth enough in the sense that all the derivatives we use in certain contexts
exist. The gradient ∇𝑓 provides us with valuable knowledge how the function
changes locally: since the gradient is the direction in which the function changes
the most, it is reasonable to follow the gradient ∇𝑓 when maximizing the func-
tion and the negative gradient −∇𝑓 when minimizing it. But this is, a bit surpris-
ingly, only a general rule as we will see in Sect. 12.5.
Why is the gradient the direction of the largest change? It can be shown that
the directional derivative

© Springer Nature Switzerland AG 2022 329


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_12
330 12 Local Optimization

𝜕𝑓 𝑓(𝐫 + ℎ𝐞) − 𝑓(𝐫)


(𝐫) ∶= lim
𝜕𝐞 ℎ→0 ℎ
of 𝑓 at a point 𝐫 in direction 𝐞 is equal to

𝜕𝑓
(𝐫) = 𝐞 ⋅ ∇𝑓,
𝜕𝐞
where the vector 𝐞 is a unit vector. The Cauchy–Bunyakovsky–Schwarz inequal-
ity, Theorem 8.1, states that

|𝐱 ⋅ 𝐲| ≤ ‖𝐱‖‖𝐲‖ ∀∀𝐱, 𝐲 ∈ ℝ𝑑 ,

and equality only holds if and only if there exists a 𝜆 ∈ ℝ such that 𝐱 = 𝜆𝐲,
i.e., if the two vectors 𝐱 and 𝐲 are parallel. Here ‖.‖ denotes the Euclidean norm.
Applying the inequality to the directional derivative yields
|| 𝜕𝑓 ||
|| (𝐫)|| ≤ ‖∇𝑓‖,
|| 𝜕𝐞 ||
| |
since ‖𝐞‖ = 1, yielding an upper bound for the absolute value of the directional
derivative. The Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1, also
tells us that this upper bound is achieved if and only if 𝐞 and ∇𝑓 are parallel, i.e.,
if and only if the directional directive is taken in the direction 𝐞 ∶= ∇𝑓∕‖∇𝑓‖ of
the gradient. This is the reason for the importance of the gradient for optimiza-
tion.
In optimization we seek local and/or global extrema, whose definitions are
the following.

Definition 12.1 ((strict) global extremum) A function 𝑓 ∶ 𝑈 → 𝑉 has a


global minimum at a point 𝑥∗ ∈ 𝑈 if 𝑓(𝑥 ∗ ) ≤ 𝑓(𝑥) holds for all 𝑥 ∈ 𝑈. Analo-
gously, it has a global maximum at a point 𝑥 ∗ ∈ 𝑈 if 𝑓(𝑥 ∗ ) ≥ 𝑓(𝑥) holds for all
𝑥 ∈ 𝑈. Strict global extrema are ones where the conditions hold with < and >
instead of ≤ and ≥.

A function may have more than one global extremum; for example, consider
the function 𝑓 ∶ ℝ → ℝ, 𝑥 ↦ (𝑥 2 − 1)2 .

Definition 12.2 ((strict) local extremum) A function 𝑓 ∶ 𝑈 → 𝑉 has a local


minimum at a point 𝑥 ∗ ∈ 𝑈 if there exists a neighborhood 𝑊 such that 𝑓(𝑥 ∗ ) ≤
𝑓(𝑥) holds for all 𝑥 ∈ 𝑊. Analogously, it has a local maximum at a point 𝑥∗ ∈ 𝑈
if there exists a neighborhood 𝑊 such that 𝑓(𝑥 ∗ ) ≥ 𝑓(𝑥) holds for all 𝑥 ∈ 𝑊.
Strict local extrema are ones where the conditions hold with < and > instead of ≤
and ≥.

In order to fully specify an optimization problem, the domain where a global


extremum is sought must be specified. However, since we are interested in local
optimization here, we will refrain from doing so in the following in order not
12.1 Introduction 331

to repeat this point continuously. But it should be remembered that a global ex-
tremum may be located on the boundary of a closed domain, and the derivatives
of the function cannot help us to find it there.
Any point where the gradient of the function vanishes is called a stationary
or critical point. If a stationary point of a real-valued function 𝑓 ∶ ℝ → ℝ is
isolated, it can be classified into four kinds depending on the signs of the first
derivative to the left and to the right of the stationary point:
1. a local minimum is a stationary point 𝑥 where the first derivative 𝑓 ′ changes
from negative to positive (and hence the second derivative 𝑓 ′′ (𝑥) is positive),
2. a local maximum is a stationary point 𝑥 where the first derivative 𝑓 ′ changes
from positive to negative (and hence the second derivative 𝑓 ′′ (𝑥) is nega-
tive),
3. an increasing point of inflection is a stationary point 𝑥 where the first deriva-
tive 𝑓 ′ is positive on both sides of the stationary point, and
4. a decreasing point of inflection is a stationary point 𝑥 where the first deriva-
tive 𝑓 ′ is negative on both sides of the stationary point.
The first two categories are local extrema, and the other two cases are known
as inflection points or saddle points. Using this naming convention, Fermat’s
theorem (the one easier to prove) on interior extrema states that the condition
that the first derivative of a (smooth) real function defined on an open interval
vanishes is a necessary condition for a local extremum.
Theorem 12.3 (Fermat’s theorem, interior extremum theorem) Suppose
that 𝑓 ∶ (𝑎, 𝑏) → ℝ is a function on the open interval (𝑎, 𝑏) ⊂ ℝ and that 𝑥 ∗ is a
local extremum of 𝑓. If 𝑓 is differentiable at the point 𝑥 ∗ , then 𝑓 ′ (𝑥 ∗ ) = 0.
Proof Suppose that 𝑥∗ is a local minimum. (The proof in the case of a local
maximum proceeds analogously.) Then, by the definition of a local minimum,
there exists a 𝛿 ∈ ℝ+ such that (𝑥 ∗ − 𝛿, 𝑥 ∗ + 𝛿) ⊂ (𝑎, 𝑏) and such that 𝑓(𝑥 ∗ ) ≤
𝑓(𝑥) for all 𝑥 ∈ (𝑥∗ − 𝛿, 𝑥 ∗ + 𝛿). Dividing by positive and negative ℎ implies

𝑓(𝑥 ∗ + ℎ) − 𝑓(𝑥 ∗ )
≥0 ∀ℎ ∈ (0, 𝛿),


𝑓(𝑥 + ℎ) − 𝑓(𝑥 ) ∗
≤0 ∀ℎ ∈ (−𝛿, 0).

Since 𝑓 is differentiable at 𝑥 ∗ by assumption, the limits of these two quotients as
ℎ → 0 exist and taking the limits implies both 𝑓 ′ (𝑥 ∗ ) ≥ 0 and 𝑓 ′ (𝑥 ∗ ) ≤ 0. □
A counterexample showing that the condition in the theorem is not sufficient
and – at the same time – the simplest example of an inflection or saddle point is
the function 𝑓(𝑥) ∶= 𝑥3 on any interval that contains zero. Then 𝑓 ′ (0) = 0, but
zero is not a local extremum; it is a saddle point.
In higher dimensions, a stationary or critical point is a point where the gradi-
ent vanishes. Again, a vanishing gradient is not a sufficient condition for a local
extremum. The prototypical example is the function 𝑓(𝑥, 𝑦) ∶= 𝑥2 + 𝑦 3 at the
point (0, 0), where it resembles a saddle.
332 12 Local Optimization

12.2 The Hessian Matrix

The relationships between the first and second derivatives of a multivariate func-
tion and its local extrema are summarized in Theorem 12.7 below, where the
Hessian matrix of the multivariate function 𝑓 plays a major role. Before stating
the theorem, we define the Hessian matrix of a function and see where it occurs
in multivariate Taylor expansions.

Definition 12.4 (Hessian matrix) The Hessian matrix of a function 𝑓 ∶ ℝ𝑑 →


ℝ is the (𝑑 × 𝑑)-dimensional matrix 𝐻 with the entries

𝜕 2 𝑓(𝐱)
ℎ𝑖𝑗 ∶= .
𝜕𝑥𝑖 𝜕𝑥𝑗

The significance of the Hessian matrix is that it is the coefficient of the


quadratic term in multivariate Taylor expansions.

Theorem 12.5 (multivariate Taylor expansion) Suppose that 𝑓 ∶ ℝ𝑑 → ℝ is


an (𝑛 + 1)-times continuously differentiable function on an open, convex set 𝑆. For
any two points 𝐱 ∈ 𝑆 and 𝐱 + 𝐡 ∈ 𝑆, the Taylor expansion
∑ 𝐡𝛼
𝑓(𝐱 + 𝐡) = 𝜕 𝑓(𝐱) + 𝑅𝑛 (𝐱, 𝐡)
|𝛼|≤𝑛
𝛼! 𝛼

holds, where the integral form of the remainder term is given by


1
∑ 𝐡𝛼
𝑅𝑛 (𝐱, 𝐡) = (𝑛 + 1) ∫ (1 − 𝑡)𝑛 𝜕𝛼 𝑓(𝐱 + 𝑡𝐡)d𝑡
|𝛼|=𝑛+1
𝛼! 0

and the Lagrange form of the remainder term is given by


∑ 𝐡𝛼
𝑅𝑛 (𝐱, 𝐡) = 𝜕 𝑓(𝐱 + 𝑡𝐡) ∃𝑡 ∈ (0, 1).
|𝛼|=𝑛+1
𝛼! 𝛼

In particular, the expansion


1 ⊤
𝑓(𝐱 + 𝐡) = 𝑓(𝐱) + ∇𝑓(𝐱)⊤ 𝐡 + 𝐡 𝐻(𝐱)𝐡 + 𝑂(‖𝐡‖3 )
2!
holds whenever 𝑛 ≥ 3.

The classification of saddle points into degenerate and nondegenerate ones


in the following definition will also be needed in Theorem 12.7.

Definition 12.6 ((non-)degenerate saddle point) A saddle point 𝐱 is called


degenerate if the Hessian matrix 𝐻(𝐱) at 𝐱 is singular. Conversely, the saddle
point is called nondegenerate if the matrix 𝐻(𝐱) is regular.
12.3 Convexity 333

Based on these definitions, we can formulate important relationships be-


tween local extrema and properties of the Hessian matrix.

Theorem 12.7 (Hessian and local extrema) Suppose that 𝐷 ⊂ ℝ𝑑 is a domain,


that 𝑓 ∶ ℝ𝑑 → ℝ a function with continuous partial derivatives of first and second
order, and that 𝐱 ∈ 𝐷 is a stationary point, i.e., ∇𝑓(𝐱) = 0. Then the following
statements hold true.
1. If 𝐻(𝐱) is positive definite, then 𝐱 is a strict local minimum of 𝑓.
2. If 𝐻(𝐱) is negative definite, then 𝐱 is a strict local maximum of 𝑓.
3. If 𝐱 is a local minimum of 𝑓, then 𝐻(𝐱) is positive semidefinite.
4. If 𝐱 is a local maximum of 𝑓, then 𝐻(𝐱) is negative semidefinite.
5. If 𝐻(𝐱) is indefinite (i.e., if there exist nonzero vectors 𝐲 and 𝐳 such that
𝐲 ⊤ 𝐻(𝐱)𝐲 < 0 < 𝐳⊤ 𝐻(𝐱)𝐳), then 𝐱 is a nondegenerate saddle point.

These considerations are often also called the first and second partial-
derivative test and provide a tool how to identify local extrema of a sufficiently
smooth function in the interior of a domain. But the test may be inconclusive,
as the statements in Theorem 12.7 do not cover all cases that may occur.

12.3 Convexity

Another question that naturally arises is whether a local extremum that we may
have found is already a (or the unique) global extremum. Are there properties of
the function or its domain that always ensure that we can draw such a conclu-
sion?
The answer is yes, such properties exist: they are the convexity of the domain
and the convexity of the function. The concept of convexity substantially simpli-
fies the search for global extrema. We start by defining convex sets and convex
functions.

Definition 12.8 (convex set) A subset 𝐶 ⊂ ℝ𝑑 is called convex if

∀𝑡 ∈ [0, 1] ∶ ∀∀𝐱, 𝐲 ∈ 𝐶 ∶ 𝑡𝐱 + (1 − 𝑡)𝐲 ∈ 𝐶

holds.

Definition 12.9 ((strictly) convex function) A function 𝑓 ∶ ℝ𝑑 ⊃ 𝐶 → ℝ on


a convex set 𝐶 ⊂ ℝ𝑑 is called convex if the inequality

∀𝑡 ∈ (0, 1) ∶ ∀∀𝐱, 𝐲 ∈ 𝐶 ∶ 𝑓(𝑡𝐱 + (1 − 𝑡)𝐲) ≤ 𝑡𝑓(𝐱) + (1 − 𝑡)𝑓(𝐲)

holds and it is called strictly convex if the inequality holds with < instead of ≤.

Definition 12.10 ((strictly) concave function) A function 𝑓 ∶ ℝ𝑑 ⊃ 𝐶 → ℝ


on a convex set 𝐶 ⊂ ℝ𝑑 is called (strictly) concave if −𝑓 is (strictly) convex.
334 12 Local Optimization

Convexity has the useful property that it ensures that a local minimum is
already a global minimum (while disregarding the boundary as usual by suppos-
ing that the domain of the function is an open set). If the function is even strictly
convex, we can additionally conclude that this global minimum is unique.

Theorem 12.11 (convexity and minima) Suppose that 𝐶 ⊂ ℝ𝑑 is a convex


open set, that 𝑓 ∶ 𝐶 → ℝ is a convex function, and that 𝐱∗ is a local minimum of 𝑓.
Then the local minimum 𝐱∗ is a global minimum. Furthermore, if the function 𝑓
is even strictly convex, then the local minimum 𝐱∗ is the unique global minimum.

Proof By assumption, 𝐱∗ is a local minimum of 𝑓, i.e.,

∃𝛿 ∈ ℝ+ ∶ ∀𝐱 ∈ {𝐱 ∈ 𝐶 ∣ ‖𝐱 − 𝐱∗ ‖ < 𝛿} ∶ 𝑓(𝐱) ≥ 𝑓(𝐱∗ ). (12.1)

The proof is indirect. Assuming that there exists a different point 𝐱0 ∈ 𝐶∖{𝐱∗ }
such that 𝑓(𝐱0 ) < 𝑓(𝐱∗ ) yields

𝑓(𝑡𝐱0 + (1 − 𝑡)𝐱∗ ) ≤ 𝑡𝑓(𝐱0 ) + (1 − 𝑡)𝑓(𝐱∗ )


< 𝑡𝑓(𝐱∗ ) + (1 − 𝑡)𝑓(𝐱∗ ) = 𝑓(𝐱∗ ) ∀𝑡 ∈ (0, 1).

If a 𝑡 ∈ (0, 1) can be found such that ‖(𝑡𝐱0 + (1 − 𝑡)𝐱∗ ) − 𝐱∗ ‖ < 𝛿, we have found
a point 𝑡𝐱0 + (1 − 𝑡)𝐱∗ that contradicts the assumption (12.1) that 𝐱∗ is a local
minimum. We find

‖(𝑡𝐱0 + (1 − 𝑡)𝐱∗ ) − 𝐱∗ ‖ = 𝑡‖𝐱0 − 𝐱∗ ‖ = 𝛼𝛿 < 𝛿

after defining
𝛼𝛿
0 < 𝑡 ∶= <1
‖𝐱0 − 𝐱∗ ‖
and choosing any 𝛼 < 1 from the interval (0, ‖𝐱0 − 𝐱∗ ‖∕𝛿), which shows that
such a 𝑡 exists. Hence the first part of theorem follows.
To show the second part, we assume that there exists a different point 𝐱0 ∈
𝐶∖{𝐱∗ } such that 𝑓(𝐱0 ) ≤ 𝑓(𝐱∗ ). Since 𝑓 is now even strictly convex, we can
conclude that

𝑓(𝑡𝐱0 + (1 − 𝑡)𝐱∗ ) < 𝑡𝑓(𝐱0 ) + (1 − 𝑡)𝑓(𝐱∗ )


≤ 𝑡𝑓(𝐱∗ ) + (1 − 𝑡)𝑓(𝐱∗ ) = 𝑓(𝐱∗ ) ∀𝑡 ∈ [0, 1].

Proceeding similarly, we again see that the existence of the point 𝑡𝐱0 + (1 − 𝑡)𝐱∗
contradicts (12.1), which concludes the indirect proof. □

In the case of maxima, the function must be (strictly) concave instead of


(strictly) convex.
12.4 Gradient Descent 335

Corollary 12.12 (convexity and maxima) Suppose that 𝐶 ⊂ ℝ𝑑 is a convex


open set, that 𝑓 ∶ 𝐶 → ℝ is a concave function, and that 𝐱∗ is a local maximum
of 𝑓. Then the local maximum 𝐱∗ is a global maximum. Furthermore, if the func-
tion 𝑓 is even strictly concave, then the local maximum 𝐱∗ is the unique global
maximum.
Because of Theorem 12.11 and Corollary 12.12, it is useful to try to partition
the domain of a given function that is not (strictly) convex (or concave) on the
whole domain such that the function is (strictly) convex (or concave) on as many
subdomains as possible. Then Theorem 12.11 and Corollary 12.12 may be ap-
plied to the subdomains, splitting the problem into more manageable parts.

12.4 Gradient Descent

The blueprint of any iterative optimization algorithm is the following.


Algorithm 12.13 (iterative minimization)
1. Choose an initial point 𝐱0 .
2. Repeat the iteration until the stopping criterion is satisfied.
a. Choose the descent direction Δ𝐱𝑛 .
b. Choose the step size ℎ𝑛 ∈ ℝ+ .
c. Update by defining
𝐱𝑛+1 ∶= 𝐱𝑛 + ℎ𝑛 Δ𝐱𝑛 .
A suitable starting point can be found by global optimization (see Chap. 11).
Two common choices for the stopping criterion is to stop when the change in the
points becomes smaller than a prescribed value, i.e., when |𝐱𝑛+1 −𝐱𝑛 | = |ℎ𝑛 Δ𝐱𝑛 |
becomes small, or when the change in the function values becomes small, i.e.,
when |𝑓(𝐱𝑛+1 − 𝑓(𝐱𝑛 )| becomes small.
We already know from Sect. 12.1 that the negative gradient is the direction of
steepest descent. Therefore, for minimizing a multivariate function 𝑓 ∶ ℝ𝑑 → ℝ,
setting
Δ𝐱𝑛 ∶= −∇𝑓(𝐱𝑛 ) (12.2)
and ℎ𝑛 ∶= ℎ ∈ ℝ+ suggests itself. These definitions yield the iteration

𝐱𝑛+1 ∶= 𝐱𝑛 − ℎ∇𝑓(𝐱𝑛 ),

which is called gradient descent. In this straightforward usage of the gradient,


the step size ℎ𝑘 = ℎ ∈ ℝ+ is constant; we will soon see other choices.
When the objective function is to be maximized, the update

𝐱𝑛+1 ∶= 𝐱𝑛 + ℎ∇𝑓(𝐱𝑛 )

is called gradient ascent.


336 12 Local Optimization

Another way to see why the choice (12.2) is expedient, is the following argu-
ment, which is based on a basic inequality, which is shown in Problem 12.4.

Lemma 12.14 (gradient inequality) A continuously differentiable function


𝑓 ∶ ℝ𝑑 → ℝ is convex if and only if the inequality

𝑓(𝐲) − 𝑓(𝐱) ≥ ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱) ∀∀𝐱, 𝐲 ∈ ℝ𝑑

holds.

By this inequality, we have

𝑓(𝐱𝑛+1 ) ≥ 𝑓(𝐱𝑛 ) + ∇𝑓(𝐱𝑛 ) ⋅ (𝐱𝑛+1 − 𝐱𝑛 ) = 𝑓(𝐱𝑛 ) + ∇𝑓(𝐱𝑛 ) ⋅ (ℎ𝑛 Δ𝐱𝑛 ).

Therefore, in order to see an improvement in every step, i.e., in order to have


𝑓(𝐱𝑛 ) > 𝑓(𝐱𝑛+1 ), the descent direction must satisfy

0 > ∇𝑓(𝐱𝑛 ) ⋅ Δ𝐱𝑛 .

This inequality is certainly satisfied by the choice (12.2).


The first theorem below will show how fast gradient descent converges, if
the function 𝑓 is convex and sufficiently smooth. To state the theorems in this
section, the concept of a 𝐿-smooth function is useful.

Definition 12.15 (𝐿-smooth function) A function 𝑓 ∶ ℝ𝑑 ⊃ 𝐷 → ℝ is called


𝐿-smooth if its gradient ∇𝑓 exists and is 𝐿-Lipschitz continuous on its domain 𝐷,
i.e., if

∃𝐿 ∈ ℝ+ ∶ ∀∀𝐱, 𝐲 ∈ 𝐷 ∶ ‖∇𝑓(𝐲) − ∇𝑓(𝐱)‖ ≤ 𝐿‖𝐲 − 𝐱‖

holds.

The following theorem shows that the convergence rate is linear (as a function
of the number of iterations) when an appropriate constant step size is used.

Theorem 12.16 (convergence rate of gradient descent) Suppose that the


function 𝑓 ∶ ℝ𝑑 → ℝ is convex and 𝐿-smooth and that it has the unique global
minimum 𝑥∗ . Then the gradient-descent update

𝐱𝑛+1 ∶= 𝐱𝑛 − ℎ𝑛 ∇𝑓(𝐱𝑛 )

with step sizes ℎ𝑛 ≤ 1∕𝐿 satisfies the inequality

‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ∑𝑛 ∀𝑛 ∈ ℕ.
𝑘=0
ℎ𝑘 (1 − 𝐿ℎ𝑘 ∕2)

Proof Lemma 12.17 below applied to the points 𝐱𝑘+1 = 𝐱𝑘 − ℎ𝑘 ∇𝑓(𝐱𝑘 ) and 𝐱𝑘
yields the inequality
12.4 Gradient Descent 337

𝐿
𝑓(𝐱𝑘+1 ) − 𝑓(𝐱𝑘 ) ≤ ∇𝑓(𝐱𝑘 ) ⋅ (𝐱𝑘+1 − 𝐱𝑘 ) + ‖𝐱 − 𝐱𝑘 ‖ 2
2 𝑘+1
𝐿ℎ𝑘
= −ℎ𝑘 (1 − ) ‖∇𝑓(𝐱𝑘 )‖2 ∀𝑘 ∈ ℕ0 .
2

Next, we define
𝑒𝑘 ∶= 𝑓(𝐱𝑘 ) − 𝑓(𝐱∗ ) ≥ 0,
which is the error in the 𝑘-th step and greater than or equal to zero for all 𝑘, since
𝑥∗ is the global minimum. The last inequality implies

𝐿ℎ𝑘
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ‖∇𝑓(𝐱𝑘 )‖2 ∀𝑘 ∈ ℕ0 . (12.3)
2

Since 𝑓 is convex (and continuously differentiable), we have −𝑒𝑘 = 𝑓(𝐱∗ ) −


𝑓(𝐱𝑘 ) ≥ ∇𝑓(𝐱𝑘 ) ⋅ (𝐱∗ − 𝐱𝑘 ) and hence 𝑒𝑘 ≤ ‖∇𝑓(𝐱𝑘 )‖‖𝐱𝑘 − 𝐱∗ ‖. These two
inequalities yield

𝐿ℎ𝑘 𝑒𝑘2
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0 .
2 ‖𝐱𝑘 − 𝐱∗ ‖2

As Lemma 12.18 below shows, the sequence ‖𝐱𝑘 − 𝐱∗ ‖ decreases as 𝑘 in-


creases; this is where the assumption ℎ𝑘 < 1∕𝐿, and not only ℎ𝑘 < 2∕𝐿, is used.
Therefore we have the estimate

𝐿ℎ𝑘 𝑒𝑘2
𝑒𝑘+1 ≤ 𝑒𝑘 − ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0
2 ‖𝐱0 − 𝐱∗ ‖2

and hence
1 1 𝐿ℎ𝑘 𝑒𝑘
− ≥ ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0
𝑒𝑘+1 𝑒𝑘 2 ‖𝐱0 − 𝐱∗ ‖2 𝑒𝑘+1

after division by 𝑒𝑘 𝑒𝑘+1 and rearranging terms. (If 𝑒𝑘 = 0 for any 𝑘, then 𝐱𝑘 = 𝐱∗
and ∇𝑓(𝐱𝑘 ) = 0, implying that all 𝐱𝑘 will be equal to 𝐱∗ from this point on and
trivially satisfying the asserted inequality.)
Because of 𝑒𝑘+1 ≤ 𝑒𝑘 due to (12.3), we find

1 1 𝐿ℎ𝑘 1
− ≥ ℎ𝑘 (1 − ) ∀𝑘 ∈ ℕ0 .
𝑒𝑘+1 𝑒𝑘 2 ‖𝐱0 − 𝐱∗ ‖2

Summing all these inequalities for 𝑘 ∈ {0, … , 𝑛 − 1} yields a telescopic sum and
hence the estimate
𝑛−1

1 1 1 𝐿ℎ𝑘
− ≥ ℎ𝑘 (1 − ) ∀𝑛 ∈ ℕ.
𝑒𝑛 𝑒0 ∗ 2
‖𝐱0 − 𝐱 ‖ 𝑘=0 2
338 12 Local Optimization

Since 𝑒𝑛 > 0, we find


‖𝐱0 − 𝐱∗ ‖2
𝑒𝑛 ≤ ∑𝑛−1 ,
ℎ (1 − 𝐿ℎ𝑘 ∕2)
𝑘=0 𝑘
which is the asserted inequality. □

The following two lemmata were used in the proof. Note that Lemma 12.14
provides a lower bound for 𝑓(𝐲) − 𝑓(𝐱) due to convexity and that Lemma 12.17
provides an upper bound for 𝑓(𝐲) − 𝑓(𝐱) due to 𝐿-smoothness.

Lemma 12.17 Suppose that the function 𝑓 ∶ ℝ𝑑 → ℝ is 𝐿-smooth. Then the in-
equality
𝐿
𝑓(𝐲) − 𝑓(𝐱) ≤ ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱) + ‖𝐲 − 𝐱‖2 ∀∀𝐱, 𝐲 ∈ ℝ𝑑
2
holds.

Proof We calculate
1
( )
𝑓(𝐲) − 𝑓(𝐱) − ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱) = ∫ ∇𝑓(𝐱 + 𝑡(𝐲 − 𝐱)) − ∇𝑓(𝐱) ⋅ (𝐲 − 𝐱)d𝑡
0
1
≤ ∫ 𝐿𝑡‖𝐲 − 𝐱‖2 d𝑡
0
𝐿
= ‖𝐲 − 𝐱‖2 ,
2
where the inequality follows using the Cauchy–Bunyakovsky–Schwarz inequal-
ity, Theorem 8.1, and the 𝐿-smoothness of 𝑓. □

Lemma 12.18 Suppose the function 𝑓 and the sequences ⟨ℎ𝑛 ⟩ and ⟨𝐱𝑛 ⟩ are as in
Theorem 12.16. Then the sequence ⟨‖𝐱𝑛 − 𝐱∗ ‖⟩ decreases as 𝑛 increases.

Proof We start by calculating

‖𝐱𝑛+1 − 𝐱∗ ‖2 = ‖𝐱𝑛 − ℎ𝑛 ∇𝑓(𝐱𝑛 ) − 𝐱∗ ‖2


= ‖𝐱𝑛 − 𝐱∗ ‖2 − 2ℎ𝑛 ∇𝑓(𝐱𝑛 ) ⋅ (𝐱𝑛 − 𝐱∗ ) + ℎ𝑛2 ‖∇𝑓(𝐱𝑛 )‖2 . (12.4)

In the second step, we show that


1
𝑓(𝐱) − 𝑓(𝐲) ≤ ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) − ‖∇𝑓(𝐱) − ∇𝑓(𝐲)‖2 ∀∀𝐱, 𝐲 ∈ ℝ𝑑 . (12.5)
2𝐿
12.4 Gradient Descent 339

In the estimate
( ) ( )
𝑓(𝐱) − 𝑓(𝐲) = 𝑓(𝐱) − 𝑓(𝐳) + 𝑓(𝐳) − 𝑓(𝐲)
𝐿
≤ ∇𝑓(𝐱) ⋅ (𝐱 − 𝐳) + ∇𝑓(𝐲) ⋅ (𝐳 − 𝐲) + ‖𝐳 − 𝐲‖2
2
( ) 𝐿
= ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) + ∇𝑓(𝐱) − ∇𝑓(𝐲) ⋅ (𝐲 − 𝐳) + ‖𝐳 − 𝐲‖2 ,
2
the first term 𝑓(𝐱) − 𝑓(𝐳) is estimated using Lemma 12.14 and the second term
𝑓(𝐳) − 𝑓(𝐲) is estimated using Lemma 12.17. Then substituting
1( )
𝐳 ∶= 𝐲 − ∇𝑓(𝐲) − ∇𝑓(𝐱)
𝐿
yields (12.5).
Inequality (12.5) applied to the situation in this lemma implies that
1
0 ≤ 𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ∇𝑓(𝐱𝑛 ) ⋅ (𝐱𝑛 − 𝐱∗ ) − ‖∇𝑓(𝐱𝑛 )‖2 . (12.6)
2𝐿
In the last step, we combine (12.4) and (12.6) to find

ℎ𝑛
‖𝐱𝑛+1 − 𝐱∗ ‖2 ≤ ‖𝐱𝑛 − 𝐱∗ ‖2 − ‖∇𝑓(𝐱𝑛 )‖2 + ℎ𝑛2 ‖∇𝑓(𝐱𝑛 )‖2
𝐿
1
= ‖𝐱𝑛 − 𝐱∗ ‖2 − ℎ𝑛 ( − ℎ𝑛 ) ‖∇𝑓(𝐱𝑛 )‖2
𝐿
≤ ‖𝐱𝑛 − 𝐱∗ ‖2 ,

which concludes the proof. □


Theorem 12.16 provides two interesting corollaries. The first is the answer
to the question how to choose the step sizes ℎ𝑛 such that convergence is fastest
based on the estimate in the theorem.
Corollary 12.19 (optimal step sizes) The optimal step sizes ℎ𝑛 based on the es-
timate in Theorem 12.16 are
1
ℎ𝑛 ∶= ,
𝐿
which result in the estimate
2𝐿‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ∀𝑛 ∈ ℕ.
𝑛+1
Proof We maximize the ∑𝑛 denominator in the inequality in Theorem 12.16. Since
all the summands in 𝑘=0 ℎ𝑘 (1 − 𝐿ℎ𝑘 ∕2) are independent, the optimal ℎ𝑛 are
ℎ𝑛 ∶= arg max ℎ∈(0,1∕𝐿) ℎ(1 − 𝐿ℎ∕2) = 1∕𝐿. □
The second corollary concerns diminishing step sizes, which very commonly
occur in stochastic (gradient-descent) optimization (see Sect. 13.6). The denom-
inator in the estimate in Theorem 12.16 can be written as
340 12 Local Optimization

𝑛
∑ ∑𝑛 𝑛
𝐿ℎ𝑘 𝐿 ∑ 2
ℎ𝑘 (1 − )= ℎ𝑘 − ℎ .
𝑘=0
2 𝑘=0
2 𝑘=0 𝑘

In order to ensure convergence, we hence require that


𝑛
∑ 𝑛

ℎ𝑘 = ∞ and ℎ𝑘2 < ∞.
𝑘=0 𝑘=0

This is indeed the usual requirement in stochastic optimization. The step sizes
ℎ𝑛 ∶= 𝑎∕(𝑛 +𝑏) in the second corollary are very common in stochastic optimiza-
tion and satisfy these two requirements.
Corollary 12.20 (diminishing step sizes) Suppose that the assumptions in The-
orem 12.16 hold and that the step sizes are defined such that
𝑎 1
ℎ𝑛 ∶= ≤ , 𝑎, 𝑏 ∈ ℝ+ .
𝑛+𝑏 𝐿
Then the estimate
‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≤ ( 𝑛+1+𝑏 ) ∀𝑛 ∈ ℕ
𝐿𝑎2 (𝑛+1)
𝑎 ln −
𝑏 2𝑏(𝑛+1+𝑏)

holds.
Proof In general, if 𝑔 ∶ ℝ → ℝ is a monotonically decreasing Riemann-
integrable function, the estimates
𝑛1 +1 𝑛1 𝑛1

∫ 𝑔(𝑥)d𝑥 ≤ 𝑔(𝑘) ≤ ∫ 𝑔(𝑥)d𝑥
𝑛0 𝑘=𝑛0 𝑛0 −1

hold. (To see this, the sum is interpreted as the Riemann sum of an integral; it is
useful to draw a sketch of a monotonically decreasing function and the rectan-
gles in the Riemann sum.)
In our case, we have
𝑛 𝑛+1
∑ 𝐿ℎ𝑘 𝑎 𝐿 𝑎
ℎ𝑘 (1 − )≥∫ (1 − ) d𝑥
𝑘=0
2 0 𝑥+𝑏 2𝑥+𝑏
𝑛+1+𝑏 𝐿𝑎2 (𝑛 + 1)
= 𝑎 ln ( )− , (12.7)
𝑏 2𝑏(𝑛 + 1 + 𝑏)
which concludes the proof. □
We can also interpret the estimates in the theorem and its corollaries differ-
ently by asking the question how many iterations are necessary to ensure that
the error is smaller than a prescribed tolerance 𝜖, i.e., that
12.4 Gradient Descent 341

𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) < 𝜖

holds. Based on the estimate in Corollary 12.19, we have

2𝐿‖𝐱0 − 𝐱∗ ‖2
𝑛+1> .
𝜖
Therefore 𝑂(1∕𝜖) iterations are required in order to achieve an error 𝑓(𝐱𝑘 ) −
𝑓(𝐱∗ ) < 𝜖. Furthermore, convergence is proportional to the Lipschitz constant 𝐿
and the distance of the starting point 𝐱0 from the minimum 𝐱∗ and it is inversely
proportional to the prescribed tolerance.
There is also a connection with regularization (see Sect. 13.9.1) in the context
of neural networks (see Chap. 13). Regularization diminishes the Lipschitz con-
stant of the neural network, thus having the additional benefit that it speeds up
convergence.
In practice, the Lipschitz constant 𝐿 can be calculated using the Hessian ma-
trix if the function 𝑓 is given in a sufficiently explicit and differentiable form. But
if it is not represented in a straightforward manner, it may be hard or impossible
to determine the Lipschitz constant 𝐿. Also, intuition suggests that the step size
should become smaller as the minimum is approached. These considerations
motivate deliberations on the step size in the next sections.
Another practical consideration is that if the function 𝑓 is so smooth that
the Hessian matrix exists, then we can take advantage of the second derivatives
directly by using the bfgs method (see Sect. 12.8).
An question that suggests itself having shown Theorem 12.16 is: what is the
best convergence rate that an optimization algorithm that only uses the gradient
of a function can achieve? In order to be able to answer this question, we have
to restrict the class of functions that the optimization is supposed to work on.
The reason is simply that if we demand the optimization algorithm to work on
functions that are not smooth at all, it is always possible to construct counterex-
amples because the function values may jump without restriction. Therefore we
require the functions to have the same smoothness as in the (only) convergence
results we already know, namely Theorem 12.16.
The following theorem states that any optimization algorithm that uses only
the gradient can achieve at most quadratic convergence on this class of functions.
Such iterative optimization algorithms are called first-order methods.

Theorem 12.21 (lower bound for convergence rate of gradient descent)


Suppose that the sequence ⟨𝐱𝑘 ⟩ ⊂ ℝ𝑑 is generated by a first-order optimization
algorithm, i.e., the points in the sequence satisfy the inclusion

𝐱𝑛+1 ∈ 𝐱0 + span{∇𝑓(𝐱1 ), … , ∇𝑓(𝐱𝑛 )} ∀𝑛 ∈ ℕ0 .

Then for any 𝐱0 ∈ ℝ𝑑 and any 𝑘 ∈ ℕ with 1 ≤ 𝑘 ≤ (𝑑 − 1)∕2 there exists a


continuously differentiable, convex, and 𝐿-smooth function 𝑓 ∶ ℝ𝑑 → ℝ that has
the global minimum 𝑥∗ ∈ ℝ𝑑 such that the inequalities
342 12 Local Optimization

3𝐿‖𝐱0 − 𝐱∗ ‖2
𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ) ≥ ∀𝑛 ∈ ℕ,
32(𝑛 + 1)2
1
‖𝐱𝑛 − 𝐱∗ ‖2 ≥ ‖𝐱0 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ
8
hold.

For proof, see [14, Theorem 2.1.7].


In summary, we now know that the gradient-descent algorithm in Theo-
rem 12.16 achieves linear convergence, while Theorem 12.21 means that the best
algorithm on this class of functions can achieve at most quadratic convergence.
Hence the next questions pose themselves. Does an algorithm that achieves
quadratic convergence exists? What have we missed?

12.5 Accelerated Gradient Descent *

The answer to this question was found in 1983 [13] and is discussed in [14, Sec-
tion 2.2]. It turns out that the requirement 𝑓(𝐱𝑛+1 ) < 𝑓(𝐱𝑛 ) discussed at the be-
ginning of Sect. 12.4 is not conducive for the minimization of convex functions;
the requirement is a local statement, but convexity is a global property.
As the next theorem shows, quadratic convergence, i.e., the optimal conver-
gence rate, can indeed be achieved in the minimization of convex, smooth func-
tions. The iteration in accelerated gradient descent combines the gradient with
the difference 𝐱𝑛 − 𝐱𝑛−1 , which is the momentum of the trajectory of the se-
quence ⟨𝐱𝑛 ⟩.

Algorithm 12.22 (accelerated gradient descent) Algorithm 12.13 with the


update

𝜆0 ∶= 1,

2
1+ 4𝜆𝑛−1 +1
𝜆𝑛 ∶= ,
2
𝜆𝑛−1 − 1
𝛾𝑛 ∶= ,
𝜆𝑛
𝐝𝑛 ∶= 𝛾𝑛 (𝐱𝑛 − 𝐱𝑛−1 ),
𝐲𝑛 ∶= 𝐱𝑛 + 𝐝𝑛 ,
1 1
𝐠𝑛 ∶= − ∇𝑓(𝐲𝑛 ) = − ∇𝑓(𝐱𝑛 + 𝐝𝑛 ),
𝐿 𝐿
𝐱𝑛+1 ∶= 𝐲𝑛 + 𝐠𝑛 = 𝐱𝑛 + 𝐝𝑛 + 𝐠𝑛

is called accelerated gradient descent.


12.5 Accelerated Gradient Descent * 343

Theorem 12.23 (accelerated gradient descent) Suppose that the function


𝑓 ∶ ℝ𝑑 → ℝ is convex and 𝐿-smooth and has the unique global minimum 𝑥 ∗ .
Then the accelerated gradient-descent algorithm, Algorithm 12.22, satisfies the in-
equality
𝐿
𝑓(𝐱1 ) − 𝑓(𝐱∗ ) + ‖𝐱1 − 𝐱∗ ‖2
2
𝑓(𝐱𝑛+1 ) − 𝑓(𝐱∗ ) ≤ ∀𝑛 ∈ ℕ.
(𝑛∕2 + 1)2

The following proof partly follows [1].


Proof Lemma 12.14 implies the inequality
( )
𝑓 𝐱 − ℎ∇𝑓(𝐱) − 𝑓(𝐲)
( )
≤ 𝑓 𝐱 − ℎ∇𝑓(𝐱) − 𝑓(𝐱) + ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) ∀∀𝐱, 𝐲 ∈ ℝ𝑑 ∀ℎ ∈ ℝ.

Furthermore estimating the first two terms on the right-hand side using Lemma
12.17 yields
( )
𝑓 𝐱 − ℎ∇𝑓(𝐱) − 𝑓(𝐲)
𝐿ℎ2
≤ (−ℎ + ) ‖∇𝑓(𝐱)‖2 + ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) ∀∀𝐱, 𝐲 ∈ ℝ𝑑 ∀ℎ ∈ ℝ.
2

In order to find the strongest estimate, we define ℎ ∶= arg minℎ∈ℝ (𝐿ℎ2 ∕2−ℎ) =
1∕𝐿, which results in the inequality
( 1 ) 1
𝑓 𝐱 − ∇𝑓(𝐱) − 𝑓(𝐲) ≤ − ‖∇𝑓(𝐱)‖2 + ∇𝑓(𝐱) ⋅ (𝐱 − 𝐲) ∀∀𝐱, 𝐲 ∈ ℝ𝑑 .
𝐿 2𝐿
We define the error in the 𝑛-th iteration as

𝑒𝑛 ∶= 𝑓(𝐱𝑛 ) − 𝑓(𝐱∗ ).

Substituting 𝐱 ∶= 𝐱𝑘 + 𝐝𝑘 and 𝐲 ∶= 𝐱𝑘 into the last inequality yields


1
𝑒𝑛+1 − 𝑒𝑛 = 𝑓(𝐱𝑛+1 ) − 𝑓(𝐱𝑛 ) ≤ − ‖𝐿𝐠𝑛 ‖2 + 𝐠𝑛 ⋅ 𝐝𝑛 ∀𝑛 ∈ ℕ.
2𝐿
Similarly substituting 𝐱 ∶= 𝐱𝑘 + 𝐝𝑘 and 𝐲 ∶= 𝐱∗ into the same inequality yields
1
𝑒𝑛+1 = 𝑓(𝐱𝑛+1 ) − 𝑓(𝐱∗ ) ≤ − ‖𝐿𝐠𝑛 ‖2 + 𝐠𝑛 ⋅ (𝐱𝑛 + 𝐝𝑛 − 𝐱∗ ) ∀𝑛 ∈ ℕ.
2𝐿
The last two inequalities can be rewritten as
344 12 Local Optimization

𝐿( )
𝑒𝑛+1 − 𝑒𝑛 ≤ − ‖𝐠𝑛 ‖2 + 2𝐠𝑛 ⋅ 𝐝𝑛 ∀𝑛 ∈ ℕ,
2
𝐿( )
𝑒𝑛+1 ≤ − ‖𝐠𝑛 ‖2 + 2𝐠𝑛 ⋅ (𝐱𝑛 + 𝐝𝑛 − 𝐱∗ ) ∀𝑛 ∈ ℕ.
2
We would like to use in the identity

‖𝐚‖2 + 2𝐚 ⋅ 𝐛 = ‖𝐚 + 𝐛‖2 − ‖𝐛‖2

on the right-hand side in order to arrive at a telescoping structure. Therefore we


add (𝜆𝑛 − 1) times the first inequality to the second inequality, yielding
𝐿( )
𝜆𝑛 (𝑒𝑛+1 − 𝑒𝑛 ) + 𝑒𝑛 ≤ − 𝜆 ‖𝐠 ‖2 + 2𝐠𝑛 ⋅ (𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ) ∀𝑛 ∈ ℕ.
2 𝑛 𝑛
where the
𝜆𝑛 ≥ 1 (12.8)
will be specified later. We multiply the inequality by 𝜆𝑛 , noting that the 𝜆𝑛 are
nonnegative, and apply the identity above to find

𝐿( )
𝜆𝑛2 (𝑒𝑛+1 − 𝑒𝑛 ) + 𝜆𝑛 𝑒𝑛 ≤ − ‖𝜆𝑛 𝐠𝑛 ‖2 + 2𝜆𝑛 𝐠𝑛 ⋅ (𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ )
2
𝐿( )
= − ‖𝜆𝑛 𝐠𝑛 + 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ‖2 − ‖𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ.
2
We call the arguments of the norms

𝐫𝑛 ∶= 𝜆𝑛 𝐠𝑛 + 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ ,
𝐬𝑛 ∶= 𝜆𝑛 𝐝𝑛 + 𝐱𝑛 − 𝐱∗ .

In order to create a telescoping structure, we would like to determine the 𝜆𝑛 and


𝛾𝑛 such that 𝐫𝑛 = 𝐬𝑛+1 , which is equivalent to

𝜆𝑛 (𝐠𝑛 + 𝐝𝑛 ) + 𝐱𝑛 − 𝐱∗ = 𝜆𝑛+1 𝐝𝑛+1 + 𝐱𝑛+1 − 𝐱∗ ∀𝑛 ∈ ℕ.

Using the definition of 𝐝𝑛+1 and 𝐱𝑛+1 on the right-hand side, this condition is
furthermore equivalent to

𝜆𝑛 (𝐠𝑛 + 𝐝𝑛 ) + 𝐱𝑛 − 𝐱∗ = 𝜆𝑛+1 𝛾𝑛+1 (𝐝𝑛 + 𝐠𝑛 ) + 𝐱𝑛 + 𝐝𝑛 + 𝐠𝑛 − 𝐱∗ ∀𝑛 ∈ ℕ.

This is always true if

𝜆𝑛 = 𝜆𝑛+1 𝛾𝑛+1 + 1 ∀𝑛 ∈ ℕ (12.9)

holds.
Therefore the last inequality becomes
𝐿( )
𝜆𝑛2 𝑒𝑛+1 − (𝜆𝑛2 − 𝜆𝑛 )𝑒𝑛 ≤ − ‖𝐬 ‖2 − ‖𝐬𝑛 ‖2 ∀𝑛 ∈ ℕ.
2 𝑛+1
12.5 Accelerated Gradient Descent * 345

To achieve a suitable telescoping structure on the left side as well, we would like
to meet the condition 𝑢𝑛 = 𝑣𝑛+1 for all 𝑛, where

𝑢𝑛 ∶= 𝜆𝑛2 𝑒𝑛+1 ,
𝑣𝑛 ∶= (𝜆𝑛2 − 𝜆𝑛 )𝑒𝑛 .

This condition is met if

𝜆𝑛2 = 𝜆𝑛+1
2
− 𝜆𝑛+1 ∀𝑛 ∈ ℕ (12.10)

holds.
Thus the last inequality becomes
𝐿( )
𝑣𝑛+1 − 𝑣𝑛 ≤ − ‖𝐬𝑛+1 ‖2 − ‖𝐬𝑛 ‖2 ∀𝑛 ∈ ℕ. (12.11)
2
The last step now involves inequalities with this telescoping structure. Sup-
pose there are two real sequences ⟨𝑎𝑛 ⟩ and ⟨𝑏𝑛 ⟩ such that 𝑎𝑛+1 − 𝑎𝑛 ≤ 𝑏𝑛 − 𝑏𝑛+1
holds for all 𝑛. Then the chain

𝑎𝑛+1 + 𝑏𝑛+1 ≤ 𝑎𝑛 + 𝑏𝑛 ≤ ⋯ ≤ 𝑎1 + 𝑏1

of inequalities clearly holds true. Applying this idea via 𝑎𝑛 ∶= 𝑣𝑛 and 𝑏𝑛 ∶=


(𝐿∕2)‖𝐬𝑛 ‖2 to inequality (12.11), we find

𝐿 𝐿
𝜆𝑛2 𝑒𝑛+1 = 𝑣𝑛+1 ≤ 𝑣𝑛+1 + ‖𝐬 ‖2 ≤ 𝑣1 + ‖𝐬1 ‖2
2 𝑛+1 2
2 𝐿
= 𝜆0 𝑒1 + ‖(𝜆0 − 1)(𝐱1 − 𝐱0 ) + 𝐱1 − 𝐱∗ ‖2 ∀𝑛 ∈ ℕ.
2
After setting 𝜆0 ∶= 1, we have
𝐿
𝑓(𝐱1 ) − 𝑓(𝐱∗ ) + ‖𝐱1 − 𝐱∗ ‖2
2
𝑓(𝐱𝑛+1 ) − 𝑓(𝐱∗ ) ≤ ∀𝑛 ∈ ℕ.
𝜆𝑛2

Finally, we check that all three conditions for the 𝜆𝑛 and 𝛾𝑛 can be met and
that the 𝜆𝑛 grow (at least) linearly. The three conditions are (12.8), (12.9), and
(12.10), which yield

2
1 + 4𝜆𝑛−1 +1 1
𝜆𝑛 ∶= ≥ + 𝜆𝑛−1 ,
2 2
𝜆𝑛−1 − 1
𝛾𝑛 ∶= .
𝜆𝑛

It is now straightforward to see by induction (and using 𝜆0 = 1) that


346 12 Local Optimization
𝑛
𝜆𝑛 ≥ +1 ∀𝑛 ∈ ℕ0 , (12.12)
2
which immediately implies lim𝑛→∞ 𝛾𝑛 = 1; furthermore, the 𝛾𝑛 are bounded by

0 < 𝛾𝑛 < 1. (12.13)

In summary, the linear growth of the 𝜆𝑛 implies quadratic convergence. □


Accelerated gradient descent is implemented in Problem 12.8 and tested in
Problem 12.9.

12.6 Line Search and the Wolfe Conditions

It is important to note that up to now we have only used fixed step sizes, i.e., the
factors of the gradients in the update formulas have been constants determined
by the Lipschitz constant of the objective function. We now lift this restriction
(although it has made the analysis easier), because the Lipschitz constant may
not be known or the function may not be Lipschitz continuous at all, but we still
wish to optimize such functions.
Since we search along the direction given by the gradient, such optimization
algorithms are called line-search algorithms. In general, they have the following
form.
Algorithm 12.24 (line-search algorithm)
1. Compute the search direction 𝐩𝑛 ∶= −∇𝑓.
2. Determine the step size 𝛼𝑛 > 0 such that a sufficient-decrease condition
condition is satisfied.
3. Set 𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐩𝑛 .
A simple-minded decrease condition such as 𝑓(𝐱𝑛+1 ) = 𝑓(𝐱𝑛 + 𝛼𝑛 𝐩𝑛 ) ≤
𝑓(𝐱𝑛 ), meaning that the function value at the new approximation point 𝐱𝑛+1
is smaller than before, is not a useful requirement, since the function value, a
real number, may decrease forever without getting close to a minimum.
Instead, we would ideally like to use a global minimizer

𝛼𝑛 ∶= arg min ℎ(𝛼)


𝛼∈ℝ+

of the function
ℎ∶ ℝ+ → ℝ, 𝛼 ↦ 𝑓(𝐱𝑛 + 𝛼𝐩𝑛 )
as the step size. Finding an approximation of the minimizer may be quite an
elaborate task (but at least it is always a one-dimensional problem) and is usually
performed in two steps. In the first step, an interval of acceptable step sizes is
identified. In the second step, the function is interpolated and the interval is
bisected to find a good approximation of the best step size.
12.6 Line Search and the Wolfe Conditions 347

However we face a trade-off here: while we want to find a good approximation


of the ideal step size, spending too much computational effort on the search for
the step size is not conducive for the performance of the whole algorithm.
Useful and popular conditions that ensure a sufficient decrease of the func-
tion value are the (two) Wolfe conditions and the (two) strong Wolfe conditions
[19, 20] [16, Section 3.1], which we discuss in detail in the rest of this section.
The first Wolfe condition (or Armijo condition) is the inequality

𝑓(𝐱𝑛 + 𝛼𝑛 𝐩𝑛 ) ≤ 𝑓(𝐱𝑛 ) + 𝑐1 𝛼∇𝑓(𝐱𝑛 ) ⋅ 𝐩𝑛 , (12.14)


⏟⎴⎴⎴⎴⎴⎴⎴⏟⎴⎴⎴⎴⎴⎴⎴⏟
𝑙(𝛼)∶=

where 𝑐1 ∈ (0, 1) is a constant, which is in practice chosen to be quite small, e.g.,


𝑐1 ∶= 10−4 . The inequality means that the reduction is required to be (at least)
proportional to the step size
𝛼𝑛 ∶= 𝑐1 𝛼
and the directional derivative ∇𝑓(𝐱𝑛 ) ⋅ 𝐩𝑛 .
The right-hand side of the inequality is a linear function of 𝛼, and we denote
it by 𝑙(𝛼). Using this notation, the condition becomes ℎ(𝛼) ≤ 𝑙(𝛼). The function 𝑙
has the negative derivative 𝑐1 ∇𝑓(𝐱𝑛 ) ⋅ 𝐩𝑛 with respect to 𝛼, and its tangent lies
above the tangent of ℎ because 𝑐1 ∈ (0, 1). The larger 𝑐1 is, the more reduction
in the function value is required.
But this first condition alone does not guarantee good progress of the algo-
rithm, since it can always be satisfied for sufficiently small values of 𝛼. To exclude
this possibility, a second condition is needed.
The second Wolfe condition is called the curvature condition and it is the
requirement that

∇𝑓(𝐱𝑛 + 𝛼𝑛 𝐩𝑛 ) ⋅ 𝐩𝑛 ≥ 𝑐2 ∇𝑓(𝐱𝑛 ) ⋅ 𝐩𝑛 (12.15)

on the step size 𝛼𝑛 , where 𝑐2 ∈ (𝑐1 , 1) is a constant. The left-hand side is equal
to ℎ′ (𝛼𝑛 ), implying the interpretation that the slope ℎ′ (𝛼𝑛 ) of ℎ at 𝛼𝑛 must be
greater than or equal to the constant 𝑐2 times the slope ℎ′ (0).
If the slope ℎ′ (𝛼) is strongly negative for small 𝛼, then this second condition
requires that the step size 𝛼 cannot be chosen too small, which is reasonable,
since a strongly negative slope indicates that the function 𝑓 can be reduced sig-
nificantly by moving further along.
If line search is used in conjunction with a Newton or quasi-Newton method
(see the next sections), a typical value for 𝑐2 is 0.9.
Having defined the Wolfe conditions, the question whether they can be satis-
fied arises naturally. The answer to his question is always yes under reasonable
assumptions on the objective function 𝑓, as the following theorem shows.
348 12 Local Optimization

Theorem 12.25 (Wolfe conditions) Suppose that 𝑓 ∶ ℝ𝑑 → ℝ is a continu-


ously differentiable function, that 𝐵 ∈ ℝ𝑑×𝑑 is a positive definite matrix, that
𝐩 ∶= −𝐵∇𝑓(𝐱), that 0 < 𝑐1 < 𝑐2 < 1, and that the function ℎ ∶ ℝ+ → ℝ,
𝛼 ↦ 𝑓(𝐱𝑛 + 𝛼𝐩𝑛 ) is bounded below. Then there exists an interval of step sizes 𝛼
satisfying the strict inequalities in both Wolfe conditions (12.14) and (12.15).

Proof Since the matrix 𝐵 is positive definite by assumption and since 𝛼 > 0 and
𝑐1 > 0, the function 𝑙 ∶ ℝ+ → ℝ, 𝛼 ↦ 𝑓(𝐱) + 𝛼𝑐1 ∇𝑓(𝐱) ⋅ 𝐩 is unbounded below
and thus intersects the function ℎ, which is bounded below, at least once. We
denote the smallest such value by 𝛼1 such that 𝛼1 > 0 and

𝑓(𝐱 + 𝛼1 𝐩) = 𝑓(𝐱) + 𝛼1 𝑐1 ∇𝑓(𝐱) ⋅ 𝐩.

Hence the strict inequality in the first Wolfe condition (12.14) is satisfied for all
𝛼 ∈ (0, 𝛼1 ).
Next, the mean-value theorem implies that

∃𝛼0 ∈ (0, 𝛼1 ) ∶ 𝑓(𝐱 + 𝛼1 𝐩) − 𝑓(𝐱) = 𝛼1 ∇𝑓(𝐱 + 𝛼0 𝐩) ⋅ 𝐩.

The last two equations yield

∇𝑓(𝐱 + 𝛼0 𝐩) ⋅ 𝐩 = 𝑐1 ∇𝑓(𝐱) ⋅ 𝐩 > 𝑐2 ∇𝑓(𝐱) ⋅ 𝐩

after noting that 𝑐1 < 𝑐2 and ∇𝑓(𝐱) ⋅ 𝐩 = −∇𝑓(𝐱)𝐵∇𝑓(𝐱) < 0.


Finally, because 𝑓 is continuously differentiable, there exists an interval
around 𝛼0 satisfying the strict inequalities in both Wolfe conditions. □
Having seen that the Wolfe conditions can always be satisfied under reason-
able assumptions on the objective function, we next mention a bracketing al-
gorithm [9, Algorithm 4.6] that calculates such a permissible step size. It can be
shown [9, Theorem 4.7] that the output of the algorithm is always a step size that
satisfies the Wolfe conditions (12.14) and (12.15) if the algorithm terminates.

Algorithm 12.26 (Wolfe line search)


1. Initialize 𝛼 ∶= 0, 𝛽 ∶= ∞, and 𝑡 ∶= 1.
2. Repeat:
a. If the first Wolfe condition (12.14) is not satisfied
set 𝛽 ∶= 𝑡
else if the second Wolfe condition (12.15) is not satisfied
set 𝛼 ∶= 𝑡
else break the loop.
b. If 𝛽 < ∞
set 𝑡 ∶= (𝛼 + 𝛽)∕2
else
set 𝑡 ∶= 2𝛼.
3. Finally return the step size 𝛼.
12.7 The Newton Method 349

The line-search algorithm Algorithm 12.26 is implemented in Problem 12.11


and tested in Problem 12.12.
While other conditions for line search can be used, the significance of the
Wolfe conditions is that they are beneficial for the convergence of the bfgs
method, which we will discuss in Sect. 12.8 below.

12.7 The Newton Method

In this section, the Newton method for approximating the roots of functions,
i.e., for approximating points 𝑥 such that 𝑔(𝑥) = 0, is summarized in the one-
dimensional case, i.e., for functions 𝑔 ∶ ℝ ⊃ [𝑎, 𝑏] → ℝ. In addition to finding
roots of general, nonlinear functions, we can also use the Newton method to
find stationary points, which are candidates for local extrema, by approximating
the roots of the derivative of a given function. The second use is extended in
the following section, Sect. 12.8, where we will discuss a generalization of the
Newton method for nonlinear, multidimensional optimization.
The Newton method works iteratively and is best summarized as replacing
the function by its tangent at the current approximation 𝑥𝑛 of the root and then
using the root of the tangent, which is a linear function, as the next approxima-
tion 𝑥𝑛+1 .
In more detail, we start from a differentiable function 𝑔 ∶ [𝑎, 𝑏] → ℝ and an
initial approximation 𝑥𝑛 of a root. First, we write the tangent 𝑦 ∶ [𝑎, 𝑏] → ℝ of 𝑔
at the point (𝑥𝑛 , 𝑔(𝑥𝑛 )) as the linear function

𝑦(𝑥) ∶= 𝑔′ (𝑥𝑛 )(𝑥 − 𝑥𝑛 ) + 𝑔(𝑥𝑛 ).

This formula is easily checked by noting that 𝑦 is a linear function, that its has
the correct slope 𝑔′ (𝑥𝑛 ), and that it takes the correct value 𝑦(𝑥𝑛 ) = 𝑔(𝑥𝑛 ) of the
tangent at the point (𝑥𝑛 , 𝑔(𝑥𝑛 )).
The next approximation 𝑥𝑛+1 is the root of the tangent 𝑦 and is hence found
by solving the equation

𝑔′ (𝑥𝑛 )(𝑥𝑛+1 − 𝑥𝑛 ) + 𝑔(𝑥𝑛 ) = 0 (12.16)

for 𝑥𝑛+1 . This yields the next approximation 𝑥𝑛+1 as

𝑔(𝑥𝑛 )
𝑥𝑛+1 ∶= 𝑥𝑛 − . (12.17)
𝑔′ (𝑥𝑛 )

Unless 𝑥𝑛 is a stationary point, this fraction is defined. These considerations


yield the following algorithm.
350 12 Local Optimization

Algorithm 12.27 (Newton method, Newton iteration)


1. Initialize the counter 𝑛 ∶= 0 and choose a starting value 𝑥0 sufficiently close
to a root of the given differentiable function 𝑔 ∶ [𝑎, 𝑏] → ℝ (e.g. by perform-
ing a global optimization first).
2. Repeat until the difference |𝑥𝑛+1 − 𝑥𝑛 | or the residuum |𝑔(𝑥𝑛+1 ) − 𝑔(𝑥𝑛 )|
is sufficiently small (i.e., smaller than a prescribed value) or the maximum
number of iterations has been reached.
a. Define
𝑔(𝑥𝑛 )
𝑥𝑛+1 ∶= 𝑥𝑛 − .
𝑔′ (𝑥𝑛 )
b. Increase the counter 𝑛 ∶= 𝑛 + 1.
The main appeal of the Newton method is that it converges quadratically to a
root provided that the function is smooth enough and that the starting point is
sufficiently close to a root.
Theorem 12.28 (quadratic convergence of the Newton method) Suppose
𝜉 ∈ ℝ, 𝑥0 ∈ ℝ, and 𝑟 ≥ |𝑥0 − 𝜉|, and define the interval 𝐼 ∶= [𝜉 − 𝑟, 𝜉 + 𝑟] ⊂ ℝ.
Suppose further that 𝜉 ∈ ℝ is a root of a function 𝑔 ∶ 𝐼 → ℝ, that 𝑔′ (𝑥) ≠ 0 for
all 𝑥 ∈ 𝐼, and that 𝑔′′ (𝑥) is continuous for all 𝑥 ∈ 𝐼. If the starting point 𝑥0 ∈ 𝐼
of the Newton iteration is sufficiently close to the root 𝜉 ∈ 𝐼, then the sequence ⟨𝑥𝑛 ⟩
converges quadratically to the root 𝜉.
Proof The first-order Taylor expansion of 𝑔(𝜉) around 𝑥𝑛 exists by the assump-
tions on the function 𝑔 and can be written as
1 ′′
𝑔(𝜉) = 𝑔(𝑥𝑛 ) + 𝑔′ (𝑥𝑛 )(𝜉 − 𝑥𝑛 ) + 𝑔 (𝜉𝑛 )(𝜉 − 𝑥𝑛 )2 ,
2!
where the last term is the Lagrange form of the remainder and 𝜉𝑛 lies between
the root 𝜉 and 𝑥𝑛 .
Since 𝜉 is a root and the first derivative 𝑔′ does not vanish by assumption,
dividing the Taylor expansion by 𝑔′ (𝑥𝑛 ) yields

𝑔(𝑥𝑛 ) 𝑔′′ (𝜉𝑛 )


+ 𝜉 − 𝑥 𝑛 = − (𝜉 − 𝑥𝑛 )2 .
𝑔′ (𝑥𝑛 ) 2𝑔′ (𝑥𝑛 )

Using the definition (12.17) of 𝑥𝑛+1 yields

𝑔′′ (𝜉𝑛 )
𝜉 − 𝑥𝑛+1 = − (𝜉 − 𝑥𝑛 )2 .
2𝑔′ (𝑥𝑛 )

We denote the error in the 𝑛-th step by

𝑒𝑛 ∶= 𝑥𝑛 − 𝜉.

Using this definition, we have just shown that the relationship


12.8 The bfgs Method 351

|𝑔′′ (𝜉𝑛 )| 2
|𝑒𝑘+1 | = 𝑒
2|𝑔′ (𝑥𝑛 )| 𝑛

holds between the errors 𝑒𝑛 and 𝑒𝑛+1 in steps 𝑛 and 𝑛+1, which implies quadratic
convergence due to the assumptions on 𝑔′ and 𝑔′′ if the starting point 𝑥0 is suffi-
ciently close to the root 𝜉. □
In the multidimensional case of vector-valued functions 𝐠 ∶ ℝ𝑑 → ℝ𝑑 , the
condition (12.16) that the tangent vanishes becomes

𝐽𝐠 (𝐱𝑛 )(𝐱𝑛+1 − 𝐱𝑛 ) + 𝐠(𝐱𝑛 ) = 𝟎,

where 𝐽𝐠 (𝐱𝑛 ) is the Jacobi matrix of 𝐠 at the point 𝐱𝑛 . Although the next approx-
imation 𝐱𝑛+1 can be written concisely as

𝐱𝑛+1 ∶= 𝐱𝑛 − 𝐽𝐠−1 (𝐱𝑛 )𝐠(𝐱𝑛 ),

it is much more computationally efficient not to calculate the inverse 𝐽𝐠−1 (𝐱𝑛 ) of
the Jacobi matrix, but to solve the linear system

𝐽𝐠 (𝐱𝑛 )(𝐱𝑛+1 − 𝐱𝑛 ) = −𝐠(𝐱𝑛 ) (12.18)

for 𝐱𝑛+1 − 𝐱𝑛 and hence for 𝐱𝑛+1 (see Sect. 8.4.8).


Naturally, we would like to take advantage of the quadratic convergence of the
Newton method for optimizing functions as well. As mentioned above, we can
use the Newton method for approximating roots of the derivative of a function to
optimize. To find these stationary points, we hence need the second derivative.
In higher dimensions, i.e., when optimizing functions 𝑓 ∶ ℝ𝑑 → ℝ, this means
that we need the Hessian matrix of the function.
Obviously, calculating the 𝑑 × 𝑑 Hessian matrix is computationally expensive
in higher dimensions. In order to circumvent this problem, quasi-Newton meth-
ods have been developed. In quasi-Newton methods, the Hessian matrix of the
second derivatives is not evaluated directly, but only approximated. One of the
most prominent quasi-Newton methods is bfgs, discussed in the next section.

12.8 The bfgs Method

The bfgs method is one of the most popular quasi-Newton methods and named
after Charles G. Broyden, Roger Fletcher, Donald Goldfarb, and David Shanno
[3, 4, 6, 8, 17, 18]. As a quasi-Newton method it is iterative and suitable for non-
linear optimization problems. Although it does not handle constraints in its orig-
inal form, a version for box constraints has been developed [5].
In this section, the problem is to minimize a real differentiable scalar function
𝑓 ∶ ℝ𝑑 → ℝ without constraints. We denote the starting point of the iteration
by 𝐱0 and the approximation in iteration 𝑛 by 𝐱𝑛 . If 𝐻𝑓 (𝐱𝑛 ) denotes the Hessian
352 12 Local Optimization

matrix of 𝑓 at the point 𝐱𝑛 , the step

𝐬𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛

in iteration 𝑛 is found by solving the linear system

𝐻𝑓 (𝐱𝑛 )𝐬𝑛 = −∇𝑓(𝐱𝑛 ) (12.19)

because of the multidimensional Newton iteration (12.18), where we know cal-


culate roots of the function
𝐠 ∶= ∇𝑓.
Calculating roots of the gradient 𝐠 = ∇𝑓 of the function 𝑓 is – for the purposes
of this section – equivalent to minimizing the quadratic approximation
1 ⊤
𝑓(𝐱𝑛+1 ) = 𝑓(𝐱𝑛 + 𝐬𝑛 ) ≈ 𝑓(𝐱𝑛 ) + ∇𝑓(𝐱𝑛 ) ⋅ 𝐬𝑛 + 𝐬 𝐻 (𝐱 )𝐬
2! 𝑛 𝑓 𝑛 𝑛
of 𝑓, which is a truncated Taylor expansion (see Theorem 12.5). Computing the
gradient of this approximation with respect to 𝐬𝑛 and setting it to zero yields
again (12.19).
But now we make two important changes. First, we do not calculate the Hes-
sian matrix 𝐻𝑓 (𝐱𝑛 ) in each step, which would be very computationally expen-
sive, but we will calculate approximations

𝐻𝑛 ≈ 𝐻𝑓 (𝐱𝑛 )−1

instead as discussed in detail below. The structure of the approximations 𝐻𝑛


will make it possible to find the inverses 𝐻𝑛−1 explicitly so that the linear system
resulting from (12.19) can be solved efficiently by just multiplying 𝐻𝑛−1 with a
vector. Another advantage is that no second derivatives are ever used.
Second, we do not use 𝐬𝑛 as the step 𝐬𝑛 = 𝐱𝑛+1 − 𝐱𝑛 , but we only consider 𝐬𝑛
as the search direction, i.e.,

𝛼𝑛 𝐬𝑛 = 𝐱𝑛+1 − 𝐱𝑛 ,

because we only approximate the Hessian matrix, and we use line search to make
up for this approximation. Having thus determined the search direction 𝐬𝑛 , any
line-search method (see Sect. 12.6) can be employed to find the next approxima-
tion 𝐱𝑛+1 by minimizing the objective function 𝑓(𝐱𝑛+1 ) = 𝑓(𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ) in the
search direction 𝐬𝑛 over the scalar 𝛼 ∈ ℝ+ , i.e.,

𝛼𝑛 ∶= arg min 𝑓(𝐱𝑛 + 𝛼𝐬𝑛 ),


𝛼∈ℝ+
𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐬𝑛 .

How are the approximations 𝐻𝑛 of the Hessian matrices calculated? We im-


pose the two conditions that 𝐻𝑛 is symmetric (as all Hessian matrices are as
12.8 The bfgs Method 353

long as the function is sufficiently smooth) and that 𝐻𝑛 is positive definite (as
all Hessian matrices are at a local minimum, see Theorem 12.7). If 𝐻𝑛 is positive
definite, this property also implies the so-called descent property

𝐠(𝐱𝑛 ) ⋅ 𝐬𝑛 < 0 (12.20)

(cf. Sect. 12.1). Equation (12.19) becomes 𝛼𝑛 𝐬𝑛 = −𝐻𝑛 𝐠(𝐱𝑛 ) for the new step
𝛼𝑛 𝐬𝑛 (instead of 𝐬𝑛 ), implying that

𝛼𝑛 𝐠(𝐱𝑛 ) ⋅ 𝐬𝑛 = −𝐠(𝐱𝑛 )⊤ 𝐻𝑛 𝐠(𝐱𝑛 ) < 0

whenever 𝐠(𝐱𝑛 ) ≠ 𝟎, since 𝐻𝑛 is positive definite. This inequality implies the


descent property (12.20) because 𝛼𝑛 > 0.
A third condition we impose on the approximations 𝐻𝑛 stems from the fol-
lowing consideration. Taylor expansion of 𝐠 at 𝐱𝑛 yields

𝐠(𝐱𝑛+1 ) = 𝐠(𝐱𝑛 ) + 𝐻𝑓 (𝐱𝑛 )(𝐱𝑛+1 − 𝐱𝑛 ) + 𝑂(‖𝐱𝑛+1 − 𝐱𝑛 ‖2 ).

The higher order terms in 𝑂(‖𝐱𝑛+1 − 𝐱𝑛 ‖2 ) vanish for quadratic functions 𝑓 and
we will neglect them, meaning that the following construction will be exact for
quadratic functions. This yields

𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= 𝐠(𝐱𝑛+1 ) − 𝐠(𝐱𝑛 ) = ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ),
𝐻𝑛 𝐝𝑛 = 𝐲𝑛 ,

but the matrix 𝐻𝑛 does usually not satisfy this equation, since 𝐝𝑛 and hence 𝐲𝑛
are only known after the line search is complete. Therefore the next approxima-
tion 𝐻𝑛+1 is chosen such that it satisfies the updated equation

𝐻𝑛+1 𝐝𝑛 = 𝐲𝑛 , (12.21)

which is called the quasi-Newton condition.


Often simple initial values such as the identity matrix or multiples thereof are
used for 𝐻0 , and they clearly satisfy these two conditions.
The next value 𝐻𝑛+1 is calculated as an update of 𝐻𝑛 . At this point it is clear
that the updating formula, which enables 𝐻𝑛+1 to be calculated from 𝐻𝑛 , is cru-
cial. It has to incorporate as much information on the second derivatives into
𝐻𝑛+1 as possible, and its repeated application should change the arbitrary initial
matrix 𝐻0 into a close approximation of the Hessian, while only requiring few
computations. It also should satisfy these three conditions.
Perhaps the simplest possible way is to define

𝐻𝑛+1 ∶= 𝐻𝑛 + 𝑎𝐮𝐮⊤ , 𝑎 ∈ ℝ, (12.22)


354 12 Local Optimization

where the symmetric rank-one matrix 𝑎𝐮𝐮⊤ is added to the previous matrix. The
quasi-Newton condition (12.21) can be satisfied by setting 𝐮 ∶= 𝐲𝑛 − 𝐻𝑛 𝐝𝑛 and
requiring that 𝑎𝐮⊤ 𝐝𝑛 = 1 (see Problem 12.13), although these updates are usu-
ally applied to approximating the inverses 𝐻𝑓 (𝐱𝑁 )−1 directly. More importantly,
however, they have serious disadvantages [7, Section 3.2]: positive definiteness
of the approximate matrices cannot be guaranteed and the denominator in the
resulting update formula may become zero.
A better approach are rank-two updates, which have the form

𝐻𝑛+1 ∶= 𝐻𝑛 + 𝑎𝐮𝐮⊤ + 𝑏𝐯𝐯 ⊤ , 𝑎, 𝑏 ∈ ℝ, (12.23)

in general. Substituting this form of 𝐻𝑛+1 into the quasi-Newton condition


(12.21), setting 𝐮 ∶= 𝐲𝑛 and 𝐯 ∶= 𝐻𝑛 𝐝𝑛 , and requiring that 𝑎𝐮⊤ 𝐝𝑛 = 1 and
𝑏𝐯 ⊤ 𝐝𝑛 = 1 yields the bfgs update

𝐲𝑛 𝐲𝑛⊤ 𝐻𝑛 𝐝𝑛 𝐝⊤ ⊤
𝑛 𝐻𝑛
𝐻𝑛+1 = 𝐻𝑛 + − (12.24)
𝐲𝑛⊤ 𝐝𝑛 𝐝⊤
𝑛 𝐻𝑛 𝐝𝑛

(see Problem 12.14).


Its inverse is

−1 𝐝𝑛 𝐲𝑛⊤ 𝐲𝑛 𝐝⊤
𝑛 𝐝𝑛 𝐝⊤
𝑛
𝐻𝑛+1 = (𝐼 − ) 𝐻𝑛−1 (𝐼 − )+ (12.25)
𝐲𝑛⊤ 𝐝𝑛 𝐲𝑛⊤ 𝐝𝑛 𝐲𝑛⊤ 𝐝𝑛

(see Problem 12.15). Expanding and noting that 𝐲𝑛⊤ 𝐻𝑛−1 𝐲𝑛 is a scalar yields the
equivalent form

−1 𝐝𝑛 𝐲𝑛⊤ 𝐻𝑛−1 + 𝐻𝑛−1 𝐲𝑛 𝐝⊤


𝑛 (𝐲𝑛⊤ 𝐝𝑛 + 𝐲𝑛⊤ 𝐻𝑛−1 𝐲𝑛 )𝐝𝑛 𝐝⊤
𝑛
𝐻𝑛+1 = 𝐻𝑛−1 − + ,
𝐲𝑛⊤ 𝐝𝑛 (𝐲𝑛⊤ 𝐝𝑛 )2

which has the advantage that it can be evaluated more efficiently.


We imposed three conditions on the approximations 𝐻𝑛 of the Hessian ma-
trices: that they are symmetric and positive definite and that they satisfy the
quasi-Newton condition (12.21). Now is a good time to check that these three
conditions are indeed satisfied: symmetric and positive definiteness are checked
in Problem 12.16 and Problem 12.17, and the quasi-Newton condition is obvi-
ously satisfied by the construction of 𝐻𝑛+1 . In summary, we have derived the
bfgs algorithm.

Algorithm 12.29 (bfgs)

1. Initialize the starting point 𝐱0 and the initial approximation 𝐻0−1 ∶= 𝐼 of the
inverse of the Hessian matrix.
12.8 The bfgs Method 355

2. Repeat:
a. Calculate the search direction

𝐬𝑛 ∶= −𝐻𝑛−1 ∇𝑓(𝐱𝑛 ).

b. Perform a line search to find the step size 𝛼𝑛 as

𝛼𝑛 ∶= arg min 𝑓(𝐱𝑛 + 𝛼𝐬𝑛 ),


𝛼∈ℝ+

where 𝛼𝑛 is chosen to satisfy the Wolfe conditions (12.14) and (12.15).


c. Calculate

𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ,
𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ).
−1
d. Calculate 𝐻𝑛+1 using (12.25).
e. Repeat until convergence, i.e., until a norm of ∇𝑓(𝐱𝑛+1 ) has become suf-
ficiently small or a norm of 𝐱𝑛+1 − 𝐱𝑛 has become sufficiently small.
3. Return the approximation 𝐱𝑛 of a minimum.

The bfgs algorithm shows superlinear convergence. A convergence analysis


can be found, e.g., in [16, Section 6.4].
Finally, we note that the same bfgs update formula for the inverse of the
Hessian matrix can also be found in a different way. We again require that the
next approximation of the matrix is symmetric and that it satisfies the quasi-
Newton condition (12.21). The new main requirement is that we choose the next
−1
approximation 𝐻𝑛+1 as the (of course regular) matrix 𝐻 −1 that is closest to the
previous approximation 𝐻𝑛−1 , i.e.,
−1
𝐻𝑛+1 ∶= min ‖𝐻 −1 − 𝐻𝑛−1 ‖2𝑊
𝐻 −1 ∈ℝ𝑑×𝑑
subject to 𝐻 −1 𝐝𝑛 = 𝐲𝑛 and 𝐻 −1 = (𝐻 −1 )⊤ , (12.26)

where the norm is a weighted Frobenius norm [16, Section 6.1] (see Problem
12.20). More precisely, it is a weighted Frobenius norm

‖𝐴‖𝑊 ∶= ‖𝑊 1∕2 𝐴𝑊 1∕2 ‖F

whose weight matrix 𝑊 is any positive definite matrix that satisfies 𝑊𝐝𝑛 = 𝐲𝑛 ;
the Frobenius norm is defined as
356 12 Local Optimization

𝑑 ∑
∑ 𝑑
2
‖𝐴‖F ∶= 𝑎𝑖𝑗 .
𝑖=1 𝑗=1

The update for 𝐻𝑛+1 can be formulated as an analogous minimization problem.

12.9 The l-bfgs (Limited-Memory bfgs) Method

Quasi-Newton methods such as the bfgs algorithm in the previous section pro-
vide superlinear convergence while never calculating second derivatives explic-
itly. Their remaining disadvantage when applied to large optimization problems
is that the whole approximate Hessian matrix 𝐻𝑛−1 is stored during the iteration,
meaning that the memory requirement increases quadratically with the number
of variables.
In order to make the bfgs algorithm amenable to large optimization prob-
lems, the limited-memory bfgs (l-bfgs) method has been developed [15, 10, 11].
In the l-bfgs algorithm, not the whole matrix 𝐻𝑛−1 is stored and manipulated,
which would be prohibitive when the number of variables is large, but only a
few vectors which suffice to calculate the recursion (12.25) only for products
𝐻𝑛−1 𝐪 [16, Section 7.2]. Furthermore, not the whole recursion starting from 𝐻0−1
is calculated, but only a fixed number of previous steps is used. Thus the num-
ber of vectors stored is fixed and often relatively small, and hence the amount of
memory required is only linear in the number of variables.
The key to understanding the l-bfgs algorithm is the efficient way of calcu-
lating only the products 𝐻𝑛−1 𝐪 in Algorithm 12.30. We start by summarizing the
recursion (12.25) for the approximations 𝐻𝑛−1 of the inverses of the Hessian ma-
trices as
−1
𝐻𝑛+1 ∶= 𝑉𝑛⊤ 𝐻𝑛−1 𝑉𝑛 + 𝜌𝑛 𝐝𝑛 𝐝⊤
𝑛,
𝑉𝑛 ∶= 𝐼 − 𝜌𝑛 𝐲𝑛 𝐝⊤
𝑛,
1
𝜌𝑛 ∶= ⊤ .
𝐲𝑛 𝐝 𝑛

Expanding the iteration yields the formula

𝐻𝑛−1 = (𝑉𝑛−1

⋯ 𝑉0⊤ )𝐻0−1 (𝑉0 ⋯ 𝑉𝑛−1 )

+ 𝜌0 (𝑉𝑛−1 ⋯ 𝑉1⊤ )𝐝0 𝐝⊤
0
(𝑉1 ⋯ 𝑉𝑛−1 )

+ 𝜌1 (𝑉𝑛−1 ⋯ 𝑉2⊤ )𝐝1 𝐝⊤
1
(𝑉2 ⋯ 𝑉𝑛−1 )
+⋯
+ 𝜌𝑛−1 𝐝𝑛−1 𝐝⊤
𝑛−1
(12.27)
12.9 The l-bfgs (Limited-Memory bfgs) Method 357

(see Problem 12.21).


How is this formula used? In each iteration 𝑛, we only store the last 𝑚 vectors
𝐝𝑖 and 𝐲𝑖 (𝑖 ∈ {𝑛 − 𝑚, … , 𝑛 − 1}). This leads to the linear memory requirement.
−1
In each iteration 𝑛, the initial approximation 𝐻𝑛,0 of the inverse of the Hessian
matrix is allowed to vary – in contrast to the bfgs algorithm – and is chosen. A
−1
formula for choosing 𝐻𝑛,0 that has proved effective is to define

−1
𝐝⊤ 𝐲
𝑛−1 𝑛−1
𝐻𝑛,0 ∶= ⊤
𝐼. (12.28)
𝐲𝑛−1 𝐲𝑛−1

The scaling factor in front of the identity matrix estimates the size of the true
Hessian matrix along the most recent search direction, which helps to keep the
scale of the search direction correct and hence a step length of one is accepted
in most iterations. Based on the starting value and using only the last 𝑚 vectors,
the expanded iteration becomes

𝐻𝑛−1 = (𝑉𝑛−1
⊤ ⊤
⋯ 𝑉𝑛−𝑚 −1
)𝐻𝑛,0 (𝑉𝑛−𝑚 ⋯ 𝑉𝑛−1 )
⊤ ⊤
+ 𝜌𝑛−𝑚 (𝑉𝑛−1 ⋯ 𝑉𝑛−𝑚+1 )𝐝𝑛−𝑚 𝐝⊤
𝑛−𝑚 (𝑉𝑛−𝑚+1 ⋯ 𝑉𝑛−1 )
⊤ ⊤
+ 𝜌𝑛−𝑚+1 (𝑉𝑛−1 ⋯ 𝑉𝑛−𝑚+2 )𝐝𝑛−𝑚+1 𝐝⊤ (𝑉
𝑛−𝑚+1 𝑛−𝑚+2
⋯ 𝑉𝑛−1 )
+⋯
+ 𝜌𝑛−1 𝐝𝑛−1 𝐝⊤
𝑛−1
. (12.29)

This formula yields the following algorithm (see Problem 12.22).


Algorithm 12.30 (l-bfgs two-loop recursion) Input: 𝐯 ∈ ℝ𝑑 , initial matrix
−1
approximation 𝐻𝑛,0 (for example (12.28)), length 𝑚 ∈ ℕ of history, vectors 𝐝𝑖 ∈
ℝ and 𝐲𝑖 ∈ ℝ (𝑖 ∈ {𝑛 − 𝑚, … , 𝑛 − 1}). Output: 𝐻𝑛−1 𝐪.
𝑑 𝑑

1. Define 𝐪 ∶= 𝐯.
2. For 𝑖 ∶= 𝑛 − 1, 𝑛 − 2, … , 𝑛 − 𝑚:
a. Set 𝛼𝑖 ∶= 𝜌𝑖 𝐝⊤
𝑖
𝐪.
b. Set 𝐪 ∶= 𝐪 − 𝛼𝑖 𝐲𝑖 .
−1
3. Set 𝐫 ∶= 𝐻𝑛,0 𝐪.
4. For 𝑖 ∶= 𝑛 − 𝑚, 𝑛 − 𝑚 + 1, … , 𝑛 − 1:
a. Set 𝛽 ∶= 𝜌𝑖 𝐲𝑖⊤ 𝐫.
b. Set 𝐫 ∶= 𝐫 + (𝛼𝑖 − 𝛽)𝐝𝑖 .
5. Return 𝐫, which is equal to 𝐻𝑛−1 𝐪.
−1
In the first loop, the factors to the right of 𝐻𝑛,0 are calculated. In the second
loop, the multiplications on the left are performed, and the terms are added. Note
that this approach works only for the products 𝐻𝑛−1 𝐪, but not to calculate the
whole matrices 𝐻𝑛−1 .
Assembling all pieces, we arrive at the bfgs algorithm (cf. Algorithm 12.29).
358 12 Local Optimization

Algorithm 12.31 (l-bfgs) Input: starting point 𝐱0 , length 𝑚 ∈ ℕ of history.


Output: approximation 𝐱𝑛 of minimum.
1. Initialize the starting point 𝐱0 .
2. Initialize the counter 𝑛 ∶= 0.
3. Repeat:
−1
a. Choose the initial approximation 𝐻𝑛,0 of the inverse of the Hessian ma-
trix, e.g., by using (12.28).
b. Calculate the search direction

𝐬𝑛 ∶= −𝐻𝑛−1 ∇𝑓(𝐱𝑛 )

using Algorithm 12.30.


c. Perform a line search to find the step size 𝛼𝑛 as

𝛼𝑛 ∶= arg min 𝑓(𝐱𝑛 + 𝛼𝐬𝑛 ),


𝛼∈ℝ+

where 𝛼𝑛 is chosen to satisfy the Wolfe conditions (12.14) and (12.15).


d. If 𝑛 > 𝑚, free the memory for 𝐝𝑛−𝑚 and 𝐲𝑛−𝑚 .
e. Calculate

𝐱𝑛+1 ∶= 𝐱𝑛 + 𝛼𝑛 𝐬𝑛 ,
𝐝𝑛 ∶= 𝐱𝑛+1 − 𝐱𝑛 ,
𝐲𝑛 ∶= ∇𝑓(𝐱𝑛+1 ) − ∇𝑓(𝐱𝑛 ).

f. Increase the counter 𝑛 ∶= 𝑛 + 1.


g. Repeat until convergence.
4. Return the approximation 𝐱𝑛 of a minimum.

The fact that l-bfgs only uses recent information and discards older itera-
tions in contrast to bfgs, which uses information from all previous iterations,
can be viewed as an advantage. While l-bfgs is significantly faster than bfgs
when the number of variables is large and can perform nearly as well, the opti-
mal choice of the parameter 𝑚 in the algorithm depends on the problem class,
which is a disadvantage of l-bfgs. When l-bfgs fails, one should therefore try
to increase the parameter 𝑚.
Many variants of the l-bfgs method have been developed. For example, the
limited-memory algorithm in [5, 21, 12] can solve optimization problems with
simple bounds.
12.11 Bibliographical Remarks 359

12.10 Julia Packages

Many optimization packages are available under the sʼɜɃǤ“ʙʲ and


sʼɜɃǤ‰{Æɴɜ˛ȕʝʧ umbrellas. The “ʙʲɃɦ package provides software for uni-
variate and multivariate optimization written in Julia with a focus on
unconstrained optimization. The {"OQÆ" package is a Julia wrapper for the
l-bfgs-b nonlinear optimization code.

12.11 Bibliographical Remarks

A well written introduction to convex optimization with many applications is [2].


An in-depth treatment of convergence analysis in convex optimization can be
found in [14]. Methods of unconstrained optimization (including quasi-Newton
like methods such as bfgs) and constrained optimization as well as their prac-
tical aspects are discussed in [7]. Another excellent overview of optimization
methods is [16].

Problems

12.1 Prove Theorem 12.5.

12.2 Prove Theorem 12.7.

12.3 Prove Corollary 12.12.

12.4 Prove Lemma 12.14.

12.5 Calculate the integral in (12.7) and plot the sum and its approximate upper
bound used in the proof of Corollary 12.20.

12.6 * Prove Theorem 12.21.

12.7 Prove (12.13).

12.8 Implement Algorithm 12.22.

12.9 Apply Algorithm 12.22 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.

12.10 Explain the meaning of the end points of the intervals 𝑐1 ∈ (0, 1) and
𝑐2 ∈ (𝑐1 , 1) in the Wolfe conditions (12.14) and (12.15).

12.11 Implement Algorithm 12.26, which satisfies the Wolfe conditions (12.14)
and (12.15).
360 12 Local Optimization

12.12 Apply Algorithm 12.26 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.

12.13 Derive a rank-one update formula by substituting the definition (12.22) of


𝐻𝑛+1 into the quasi-Newton condition (12.21) and following the text.

12.14 Prove the rank-two update formula (12.24) by substituting the definition
(12.23) of 𝐻𝑛+1 into the quasi-Newton condition (12.21) and following the text.
−1
12.15 Prove (12.25) by showing that 𝐻𝑛+1 𝐻𝑛+1 = 𝐼 or by showing that
−1
𝐻𝑛+1 𝐻𝑛+1 = 𝐼.

12.16 Prove: Suppose that 𝐻𝑛 is symmetric; then 𝐻𝑛+1 defined in (12.24) is sym-
metric as well.

12.17 * Prove: Suppose that 𝐻𝑛 is symmetric and positive definite and that
𝐲𝑛⊤ 𝐝𝑛 > 0; then 𝐻𝑛+1 defined in (12.24) is positive definite as well.

12.18 Implement Algorithm 12.29.

12.19 Apply Algorithm 12.29 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.

12.20 * Prove that the unique solution of the minimization problem (12.26) is
the bfgs update (12.25).

12.21 Prove that (12.27) holds for 𝐻𝑛−1 defined by (12.25).

12.22 Prove that (12.29) yields Algorithm 12.30.

12.23 Implement Algorithm 12.30.

12.24 Implement Algorithm 12.31.

12.25 Apply Algorithm 12.31 to a benchmark problem in Sect. 11.8 using a start-
ing point you have found by global optimization.

12.26 Compare Algorithm 12.29 and Algorithm 12.31 for different values of the
parameter 𝑚.

References

1. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sciences 2(1), 183–202 (2009)
2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cam-
bridge, UK (2004)
3. Broyden, C.: The convergence of a class of double-rank minimization algorithms 1. general
considerations. IMA Journal of Applied Mathematics 6(1), 76–90 (1970)
References 361

4. Broyden, C.: The convergence of a class of double-rank minimization algorithms 2. the


new algorithm. IMA Journal of Applied Mathematics 6(3), 222–231 (1970)
5. Byrd, R., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained
optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
6. Fletcher, R.: A new approach to variable metric algorithms. The Computer Journal 13(3),
317–322 (1970)
7. Fletcher, R.: Practical Methods of Optimization, 2nd edn. John Wiley and Sons (2000)
8. Goldfarb, D.: A family of variable-metric updates derived by variational means. Mathe-
matics of Computation 24(109), 23–26 (1970)
9. Lewis, A., Overton, M.: Nonsmooth optimization and quasi-Newton methods. Math. Pro-
gramming 141(1-2), 135–163 (2013)
10. Liu, D., Nocedal, J.: On the limited memory BFGS method for large scale optimization.
Math. Programming 45(1–3), 503–528 (1989)
11. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. Journal
of Machine Learning Research 16(Dec), 3151–3181 (2015)
12. Morales, J., Nocedal, J.: Remark on “algorithm 778: L-BFGS-B: Fortran subroutines for
large-scale bound-constrained optimization”. ACM Transactions on Mathematical Soft-
ware (TOMS) 38(1), article no. 7 (2011)
13. Nesterov, Y.: A method for solving the convex programming problem with convergence
rate 𝑂(1∕𝑘 2 ). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
14. Nesterov, Y.: Lectures on Convex Optimization, 2nd edn. Springer (2018)
15. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Mathematics of Com-
putation 35(151), 773–782 (1980)
16. Nocedal, J., Wright, S.: Numerical Optimization, 2nd edn. Springer (2006)
17. Shanno, D.: Conditioning of quasi-Newton methods for function minimization. Mathe-
matics of Computation 24(111), 647–656 (1970)
18. Shanno, D., Kettler, P.: Optimal conditioning of quasi-Newton methods. Mathematics of
Computation 24(111), 657–664 (1970)
19. Wolfe, P.: Convergence conditions for ascent methods. SIAM Review 11(2), 226–235 (1969)
20. Wolfe, P.: Convergence conditions for ascent methods. II: some corrections. SIAM Review
13(2), 185–188 (1971)
21. Zhu, C., Byrd, R., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: Fortran subroutines for
large-scale bound-constrained optimization. ACM Transactions on Mathematical Software
(TOMS) 23(4), 550–560 (1997)
Part IV
Algorithms for Machine Learning
Chapter 13
Neural Networks

Mama always said “son, if you had a brain, you’d be dangerous,”


guess it pays to be brainless.
—Eminem, Song Brainless, The Marshall Mathers LP 2 (2013)

Abstract Artificial neural networks were first conceived decades ago and have
been an important part of machine learning and artificial intelligence ever since.
The basic idea behind neural networks is to combine linear combinations and
nonlinear functions into layers such that they can approximate arbitrary func-
tions. Neural networks have been used to great effect for example in image
classification and recognition once the computational power and suitable algo-
rithms to train large neural networks had become available. In this chapter, we
implement a neural network and train it using backpropagation in a fully self-
contained program. Using tens of thousands of scanned images, we train the
neural network to recognize handwritten digits and discuss important aspects
of training neural networks.

13.1 Introduction

The idea underlying an artificial neural network is to define a function taking


a vector as the input and yielding a vector as the output as shown in Fig. 13.1.
The circles indicate real numbers, the arrows indicate dependencies, and the
rectangles indicate vectors. Between the input and output vectors (or layers in
neural-network parlance), hidden layers are arranged such that the input layer
influences only the first hidden layer, each hidden layer influences only the next
hidden layer, and only the last hidden layer influences the output layer.
Much of the analogy of artificial neural networks obviously stems from the
neurons and axons in the nervous system and the brain, although the depen-
dencies or connections implemented by axons are in general much more com-

© Springer Nature Switzerland AG 2022 365


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_13
366 13 Neural Networks

Input layer Hidden layer Hidden layer Hidden layer Output layer

Vector

Vector Vector

Vector

Vector

Fig. 13.1 Schematic diagram of an artificial neural network. Not all dependencies between ad-
jacent layers are shown in order not to overload the figure; in general, each neuron influences
each neuron in the next layer.

plicated (and still mostly unknown) than the ones between adjacent layers indi-
cated in Fig. 13.1.
More general dependencies are of course possible in artificial neural networks
and have been investigated. The computational reason why an arrangement in
layers is preferable is that it makes training the neural network (see Sections 13.6
and 13.7) much easier than in the general case when circular dependencies are
allowed.
Neural networks have become a vast field with many applications in super-
vised machine learning. Supervised learning means that the input cases and
the output cases are known and that their (possibly highly complicated) func-
tional relationship is to be learned while the data may be noisy. Since the ar-
rangement in Fig. 13.1 resembles some structures in the visual cortex of the
brain, it may be unsurprising that large enough neural networks with a certain
number and structure of the hidden layers have been used very successfully in
13.2 Feeding Forward 367

image-classification and image-recognition tasks. The field of deep learning is


concerned with deep neural networks, i.e., those with many hidden layers.
In the following we study the structure of neural networks in more detail,
state an important result from the theory of neural networks, learn how it is pos-
sible to train neural networks, and consider the example of handwriting recog-
nition as a practical and nontrivial example. The Julia code in this chapter im-
plements neural networks of the form shown in Fig. 13.1 and is used to solve the
handwriting-recognition problem.

13.2 Feeding Forward

Before we can evaluate the function given by a neural network, we still must
specify the neural network more precisely. We denote the vector-valued output
of layer 𝑘 − 1 by 𝐱(𝑘−1) . The action of layer 𝑘 is given by applying a linear (or
more precisely, an affine) function and then a nonlinear function elementwise.
The vector output 𝐱(𝑘) of layer 𝑘 can therefore be written as

𝐱(𝑘) ∶= 𝜎(𝑊 (𝑘) 𝐱(𝑘−1) + 𝐛(𝑘) ),

where the matrix 𝑊 (𝑘) is called the weight matrix, the vector 𝐛(𝑘) contains the
so-called biases, and the activation function 𝜎 ∶ ℝ → ℝ is applied elementwise
to its argument.
It is an important feature of neural networks that the activation function is
nonlinear. Otherwise, if it were linear, the whole network would just be a linear
function as a composition of linear functions. There are two main considerations
important for the choice of the activation function: first, it should facilitate the
kind of functional relationship between the known input and output vectors of
the learning problem at hand, and secondly, it should be expedient for the com-
putations required for learning (see Sections 13.6 and 13.7).
The first requirement is generally hard to satisfy a priori and without any
experience or experimentation. The question of the shape of the output layer in
the handwriting-recognition problem is an example of such considerations and
is discussed below.
The second requirement is especially important for deep neural networks, i.e.,
for networks with many hidden layers, and is discussed using the example of the
choice of the activation function 𝜎 in the following.
Various popular choices of activation functions are the following. The first is
the sigmoid function
1
𝜎1 (𝑥) ∶= ,
1 + e−𝑥
also called the logistic function. The second is the leaky rectifier
368 13 Neural Networks

𝑥, 𝑥 ≥ 0,
𝜎2 (𝑥) ∶= {
𝛼𝑥, 𝑥 < 0,

where 𝛼 is a small parameter, e.g., 𝛼 = 1∕10 or 𝛼 = 1∕100. It is the leaky version


of the rectifier
𝜎3 (𝑥) ∶= max(𝑥, 0).
An example of a smooth rectifier is the function

𝜎4 (𝑥) ∶= ln(1 + exp(𝑥)).

A final example of an activation function is the hyperbolic tangent

𝜎5 (𝑥) ∶= tanh(𝑥).

Basic properties of these five functions are investigated in Problem 13.1. If an


activation function is differentiable only almost everywhere, this fact does not
pose a problem in practice; we can just use the limit from the left or from the
right instead at any points where the derivative is not defined.
As seen in Problem 13.1, these activation functions and their derivatives differ
substantially regarding their behavior away from zero. For example, the deriva-
tive of the sigmoid function becomes small away for zero, which impedes learn-
ing by gradient descent in deep networks, since small derivatives are multiplied,
resulting in tiny gradients and hence slow learning (see Sections 13.6 and 13.7).
The leaky rectifier 𝜎2 circumvents this problem and still enables learning for
𝑥 < 0 – in contrast to the rectifier 𝜎3 – unless the parameter 𝛼 is too small.
The output of all these activation functions is real-valued. Therefore, when
performing regression using a neural network, the output layer provides the re-
sult. In classification problems, however, the output of the network must be one
of the discrete classes, and hence the output must be discretized. If the number
of classes is small, it is convenient to discretize the output of the activation func-
tion of a single neuron in the output layer. For example, if the activation function
is 𝜎1 and there are two classes, then the discretization ([0, 1∕2], (1∕2, 1]) suggests
itself; if the activation function is tanh and there are three classes, then the dis-
cretization ([−1, −1∕3], (−1∕3, 1∕3], (1∕3, 1]) suggests itself.
On the other hand, if the number of classes is relatively large, then it is much
better to assign a class to each neuron in the output layer and to consider the
index of the neuron with the largest value of the activation function to be the
classification of the input vector. This is what the output layer in our example
of recognizing handwritten digits looks like: there are ten neurons in the output
layer and the index (minus one) of the neuron that fires most is the classification
of the input image.
Neural networks are differentiable functions. It is an important feature that
discrete problems such as classification problems are formulated such that learn-
ing becomes optimization of a differentiable function, which is much easier to
solve computationally than a discrete optimization problem.
13.2 Feeding Forward 369

It is of course also possible to use different activation functions on different


layers and to use layers with different connections between the neurons; these
possibilities have indeed been pursued to great advantage in designing neural
networks, e.g., in convolutional neural networks.
Having discussed these design considerations, we start to implement neural
networks. We define a module called ‰‰ and a data structure ȆʲɃ˛ǤʲɃɴɪ. Its
field ȯ contains the activation function, and its field ȍ its derivative. The package
{-ǤʲǤʧȕʲʧ contains sample data sets for machine learning, the package ¸ʝɃɪʲȯ
provides facilities for printing numbers with a given precision, and the package
¸˩¸ɜɴʲ is one of the most popular plotting packages.

ɦɴȍʼɜȕ ‰‰

Ƀɦʙɴʝʲ QĐɃʙ
Ƀɦʙɴʝʲ {ɃɪȕǤʝɜȱȕȂʝǤ
Ƀɦʙɴʝʲ {-ǤʲǤʧȕʲʧ
Ƀɦʙɴʝʲ ¸ʝɃɪʲȯ
Ƀɦʙɴʝʲ ¸˩¸ɜɴʲ
Ƀɦʙɴʝʲ ¼Ǥɪȍɴɦ

ʧʲʝʼȆʲ ȆʲɃ˛ǤʲɃɴɪ
ȯђђOʼɪȆʲɃɴɪ
ȍђђOʼɪȆʲɃɴɪ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ʧɃȱɦǤФ˦Х
ЖЭФЖўȕ˦ʙФв˦ХХ
ȕɪȍ

Ȇɴɪʧʲ ʧɃȱɦɴɃȍ ќ ȆʲɃ˛ǤʲɃɴɪФ


ʧɃȱɦǤя
˦ вљ ʧɃȱɦǤФ˦Х Ѯ ФЖвʧɃȱɦǤФ˦ХХХ

The next data type &ɴʧʲ contains the cost function to be used. In short, a
cost function measures the difference between the output of the neural network
and the correct output to be learned. We already define two cost functions at
this point so that we can evaluate our neural network later. Cost functions are
discussed in more detail in Sect. 13.5.
ʧʲʝʼȆʲ &ɴʧʲ
ȯђђOʼɪȆʲɃɴɪ
ȍȕɜʲǤђђOʼɪȆʲɃɴɪ
ȕɪȍ
370 13 Neural Networks

ȯʼɪȆʲɃɴɪ ʜʼǤȍʝǤʲɃȆѪȆɴʧʲФǤȆʲɃ˛ǤʲɃɴɪђђȆʲɃ˛ǤʲɃɴɪХђђ&ɴʧʲ
&ɴʧʲФФǤя ˩Х вљ ЕѐК Ѯ {ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФǤв˩ХѭЗя
Ф˴я Ǥя ˩Х вљ ФǤв˩Х ѐѮ ǤȆʲɃ˛ǤʲɃɴɪѐȍѐФ˴ХХ
ȕɪȍ

Ȇɴɪʧʲ ȆʝɴʧʧѪȕɪʲʝɴʙ˩ѪȆɴʧʲ ќ &ɴʧʲФ


ФǤя ˩Х вљ ʧʼɦФв ˩ ѐѮ ɜɴȱѐФǤХ в ФЖв˩Х ѐѮ ɜɴȱѐФЖвǤХХя
Ф˴я Ǥя ˩Х вљ Ǥв˩Х

The data structure ‰ȕʲ˞ɴʝɖ contains the weights and biases of all layers with
additional information. The ˞ȕɃȱȹʲʧ are matrices and the ȂɃǤʧȕʧ are vectors.
The vector ʧɃ˴ȕʧ contains the numbers of neurons in each layer. The final four
vectors record the progress during training as we will see later. The function
‰ȕʲ˞ɴʝɖ is a custom constructor and only requires the sizes of the layers, but
takes three keyword arguments. It calls the function ɪȕ˞, only available in this
context, to construct the instance (see Sect. 5.5).
The weights and biases are initialized with normally distributed random num-
bers. If a weight matrix is large, its product with the output of the previous layer
tends to be large as well. This hinders learning with activation functions whose
derivatives are small for large arguments. Therefore it is generally useful to scale
the weight matrices such that products of the weight matrices with columns of
all ones are still normally distributed with variance one; this is achieved by the
scaling factor ʧʜʝʲФɔХ, which is the square root of the size of the previous layer.
ɦʼʲǤȂɜȕ ʧʲʝʼȆʲ ‰ȕʲ˞ɴʝɖ
ǤȆʲɃ˛ǤʲɃɴɪђђȆʲɃ˛ǤʲɃɴɪ
Ȇɴʧʲђђ&ɴʧʲ
ɪѪɜǤ˩ȕʝʧђђbɪʲ
ʧɃ˴ȕʧђђùȕȆʲɴʝШbɪʲЩ
˞ȕɃȱȹʲʧђђùȕȆʲɴʝШʝʝǤ˩ШOɜɴǤʲЛЙя ЗЩЩ
ȂɃǤʧȕʧђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩ
ʲʝǤɃɪɃɪȱѪȆɴʧʲђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ

ȯʼɪȆʲɃɴɪ ‰ȕʲ˞ɴʝɖФʧɃ˴ȕʧѓ ǤȆʲɃ˛ǤʲɃɴɪ ќ ʧɃȱɦɴɃȍя


Ȇɴʧʲ ќ ʜʼǤȍʝǤʲɃȆѪȆɴʧʲФǤȆʲɃ˛ǤʲɃɴɪХя
ʧȆǤɜȕѪ˞ȕɃȱȹʲʧ ќ ʲʝʼȕХђђ‰ȕʲ˞ɴʝɖ
ɪȕ˞ФǤȆʲɃ˛ǤʲɃɴɪя Ȇɴʧʲя ɜȕɪȱʲȹФʧɃ˴ȕʧХя ʧɃ˴ȕʧя
ЦʝǤɪȍɪФɃя ɔХ Э Ƀȯ ʧȆǤɜȕѪ˞ȕɃȱȹʲʧ ʧʜʝʲФɔХ ȕɜʧȕ Ж ȕɪȍ
ȯɴʝ ФɃя ɔХ Ƀɪ ˴ɃʙФʧɃ˴ȕʧЦЗђȕɪȍЧя ʧɃ˴ȕʧЦЖђȕɪȍвЖЧХЧя
ЦʝǤɪȍɪФɃХ ȯɴʝ Ƀ Ƀɪ ʧɃ˴ȕʧЦЗђȕɪȍЧЧя
ЦЧя ЦЧя ЦЧя ЦЧХ
ȕɪȍ
ȕɪȍ
13.3 The Approximation Property 371

Having defined these data structures, we can construct your first neural net-
work by evaluating ‰ȕʲ˞ɴʝɖФЦЖЕЕя ЖЕя ЖЧХ.
Evaluating a neural network is commonly referred to as feeding forward. We
loop over all weight matrices and bias vectors simultaneously using ˴Ƀʙ. In each
iteration, the activation Ǥ of the previous layer is transformed linearly and then
the activation function ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȯ is applied elementwise.
ȯʼɪȆʲɃɴɪ ȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪђђ‰ȕʲ˞ɴʝɖя ɃɪʙʼʲђђùȕȆʲɴʝХђђùȕȆʲɴʝ
ɜɴȆǤɜ Ǥ ќ Ƀɪʙʼʲ

ȯɴʝ Фüя ȂХ Ƀɪ ˴ɃʙФɪɪѐ˞ȕɃȱȹʲʧя ɪɪѐȂɃǤʧȕʧХ


Ǥ ќ ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȯѐФüѮǤ ў ȂХ
ȕɪȍ

Ǥ
ȕɪȍ

13.3 The Approximation Property

Before we train our neural network, the question poses itself if neural networks
such as the ones shown in Fig. 13.1 can approximate arbitrary functions to be
learned or not. This question is fundamental: if arbitrary relationships between
the input and output could not be represented by such functions, it would be
absurd to try to train neural networks. Fortunately, the answer to this question
is positive.
As the following theorem shows, neural networks 𝜙 without any hidden layer,
but whose output consists of a linear combination of a sufficiently large num-
ber 𝑛 of neurons, are already capable of approximating any given continuous
function 𝑓 arbitrarily well on compact subsets of ℝ𝑑 [1, 2]. The assumptions on
the activation function 𝜎 are lenient. The restriction that the function to be ap-
proximated must be continuous is understandable in view of the fact that neural
networks are continuous functions.

Theorem 13.1 (approximation property) Suppose that 𝐾 is a compact subset


of ℝ𝑑 , that 𝑓 is an arbitrary function in 𝐶(𝐾, ℝ), that 𝜖 ∈ ℝ+ is arbitrary, and that
𝜎 is a nonconstant, bounded, and monotonically increasing continuous function.
Then there exist an integer 𝑛, constants 𝑏𝑖 ∈ ℝ and 𝑣𝑖 ∈ ℝ, and vectors 𝐰𝑖 ∈ ℝ𝑑
for 𝑖 ∈ {1, … , 𝑑} such that the inequality

max |𝜙(𝐱) − 𝑓(𝐱)| < 𝜖


𝐱∈𝐾

holds, where the approximation 𝜙 is defined as


372 13 Neural Networks

𝑛

𝜙(𝐱) ∶= 𝑣𝑖 𝜎(𝐰𝑖 ⋅ 𝐱 + 𝑏𝑖 ),
𝑖=1

i.e., the functions 𝜙 are dense in 𝐶(𝐾).

Functions in 𝐶(𝐾, ℝ𝐷 ) can be approximated by combining 𝐷 neural networks,


each neural network being responsible for one component.
Proof (sketch) Instead of giving a full proof of the theorem, we sketch a visual
and intuitive argument for a slightly weaker result in the following. The intuitive
argument works for neural networks with two hidden layers and for sigmoid
activation functions.
In the first step, we approximate functions 𝑔 ∈ 𝐶([𝑎, 𝑏], ℝ), i.e., continuous
functions with one-dimensional input and output. We start by considering a neu-
ral network with two neurons in a single hidden layer and one output neuron.
Increasing the weight of the first hidden neuron makes its output approximate
a step function, and changing the bias of the first hidden neuron changes the
position of the step accordingly. We can thus approximate the output of the first
hidden neuron by a step function with an adjustable step position. The same
holds true for the second hidden neuron.
We can then combine the outputs of the two neurons in the hidden layer,
which are two step functions, such that the neuron in the output layer is nonzero
only between the two step positions. Furthermore, the height of the step can also
be adjusted arbitrarily. This means that the neuron in the output layer before
applying the activation function can be made approximately equal to

𝑐𝑗 , 𝑥 ∈ [𝑎𝑗 , 𝑏𝑗 ),
𝜓𝑗 (𝑥) ∶= {
0, otherwise.

By using a hidden layer with 2𝑚 neurons, we can hence approximate any piece-
wise constant function
𝑚

𝜓(𝑥) = 𝜓𝑗 (𝑥) ≈ 𝜎−1 ◦𝑔(𝑥)
𝑗=1

on the interval [𝑎, 𝑏]. These piecewise constant functions approximate 𝑔 ∈


𝐶([𝑎, 𝑏]) after applying the activation function 𝜎 in the output neuron as 𝑚 in-
creases. This concludes the first step, where we approximate continuous func-
tions with one-dimensional input and output.
In the second step, we generalize this idea to approximate functions 𝑓 ∈
𝐶(ℝ𝑑 , ℝ). In the two-dimensional case, we use two neurons for each of the two
dimensions to construct piecewise constant functions of the form
13.3 The Approximation Property 373

( ) 𝑐 , 𝑥1 ∈ [𝑎1𝑗 , 𝑏1𝑗 ),
𝜓1𝑗 (𝑥1 , 𝑥2 ) ∶= { 1𝑗
0, otherwise,
( ) 𝑐 , 𝑥2 ∈ [𝑎2𝑘 , 𝑏2𝑘 ),
𝜓2𝑘 (𝑥1 , 𝑥2 ) ∶= { 2𝑘
0, otherwise

in the first hidden layer. These four neurons are combined in the second hidden
layer to approximate functions of the form
( )
( ) 𝑐𝑗𝑘 , (𝑥1 , 𝑥2 ) ∈ [𝑎1𝑗 , 𝑏1𝑗 ), [𝑎2𝑘 , 𝑏2𝑘 ) ,
𝜓𝑗𝑘 (𝑥1 , 𝑥2 ) ∶= {
0, otherwise,

which are nonzero only on a rectangle. Using these functions 𝜓𝑗𝑘 , we can approx-
imate any continuous function 𝑓 ∈ 𝐶(ℝ2 , ℝ) by the piecewise (on rectangles)
constant function
𝑚𝑗 𝑚 𝑘
( ) ∑ ∑ ( ) ( )
𝜓 (𝑥1 , 𝑥2 ) ∶= 𝜓𝑗𝑘 (𝑥1 , 𝑥2 ) ≈ 𝜎−1 ◦𝑓 (𝑥1 , 𝑥2 )
𝑗=1 𝑘=1

and finally applying the activation function 𝜎 in the output layer. This argument
can be generalized from two to 𝑑 dimensions.
In the third and last step, we make the assumption that the neurons are ap-
proximated by step functions superfluous. In the one-dimensional case 𝐶(ℝ, ℝ),
we choose intervals of same length. The error due to the smoothed step func-
tions occurs where the intervals meet. We can make this error arbitrarily small
by adding a large number 𝑁 of shifted approximations of the function 𝑓(𝑥)∕𝑁,
because then each point 𝑥 is only affected by the error due to one shifted approx-
imation, which decreases as 𝑛 increases. This ideas can be generalized to the
multidimensional case. This concludes the sketch of the proof. □

The universal approximation property in Theorem 13.1 is of great theoretic


importance, since it motivates pursuing the goal of training neural networks to
learn arbitrary functions. However, it still does not tell us how the parameters of
the neural network, such as activation functions, the number of layers, and the
numbers of neurons in every layer should be chosen. It is also clear that different
classes of functions will require different parameter choices.
Theorem 13.1 may mislead us to believe that shallow neural networks are
sufficient in practice. The opposite is true: the advantage of deep networks with
many layers is their hierarchical structure. For example, in image recognition,
there is a hierarchy comprising the recognition of pixels, edges, simple geometric
shapes, larger objects consisting of the simple shapes, and even scenes contain-
ing multiple objects. In real-world problems, deep networks are far more useful
than shallow networks.
374 13 Neural Networks

Fig. 13.2 The digits from 0 to 9 from the beginning of the mnist training set.

13.4 Handwriting Recognition

To illustrate how neural networks learn, we consider a real-world example,


namely how to recognize hand-written digits. Handwriting recognition is a clas-
sic, nontrivial task in artificial intelligence and machine learning and well-suited
for neural networks, yet recognizing hand-written digits is still feasible as a first
problem.
Fortunately, a large database of hand-written digits is available as the mnist
(modified National Institute of Standards and Technology) database, which con-
tains 60 000 training images and 10 000 test images. The gray-scale images have
a size of 28 × 28 pixels and were digitized from digits written by employees of
the United States Census Bureau and American high-school students. Fig. 13.2
shows some of the first images in the database; local influence is seen in the
shape of the digits 1 and 7. Even more fortunately, the mnist database can be
downloaded by simply installing the {-ǤʲǤʧȕʲʧ package and then evaluating
{-ǤʲǤʧȕʲʧѐ ‰bÆÑѐȍɴ˞ɪɜɴǤȍФХ.
Next, we read the whole database into six global variables, containing the im-
ages and labels of the training, validation, and test datasets. The mnist images
are split into these three datasets, and the reason for doing so will be discussed
below. The global variables ‰bÆÑѪɪѪʝɴ˞ʧ and ‰bÆÑѪɪѪȆɴɜʧ contain the num-
bers of rows and columns of the gray-scale images. After reading each image in
the function ‰bÆÑѪʝȕǤȍѪɃɦǤȱȕʧ, its pixels are scaled to the interval [0, 1].
ȱɜɴȂǤɜ ‰bÆÑѪȯɃɜȕѪʲʝǤɃɪɃɪȱѪɃɦǤȱȕʧ ќ
ɔɴɃɪʙǤʲȹФ {-ǤʲǤʧȕʲʧѐ ‰bÆÑѐȍǤʲǤȍȕʙъ ‰bÆÑъя
{-ǤʲǤʧȕʲʧѐ ‰bÆÑѐѼb‰b Q5ÆХ
ȱɜɴȂǤɜ ‰bÆÑѪȯɃɜȕѪʲʝǤɃɪɃɪȱѪɜǤȂȕɜʧ ќ
ɔɴɃɪʙǤʲȹФ {-ǤʲǤʧȕʲʧѐ ‰bÆÑѐȍǤʲǤȍȕʙъ ‰bÆÑъя
{-ǤʲǤʧȕʲʧѐ ‰bÆÑѐѼb‰{"5{ÆХ
ȱɜɴȂǤɜ ‰bÆÑѪȯɃɜȕѪʲȕʧʲѪɃɦǤȱȕʧ ќ
ɔɴɃɪʙǤʲȹФ {-ǤʲǤʧȕʲʧѐ ‰bÆÑѐȍǤʲǤȍȕʙъ ‰bÆÑъя
13.4 Handwriting Recognition 375

{-ǤʲǤʧȕʲʧѐ ‰bÆÑѐÑ5ÆÑb Q5ÆХ


ȱɜɴȂǤɜ ‰bÆÑѪȯɃɜȕѪʲȕʧʲѪɜǤȂȕɜʧ ќ
ɔɴɃɪʙǤʲȹФ {-ǤʲǤʧȕʲʧѐ ‰bÆÑѐȍǤʲǤȍȕʙъ ‰bÆÑъя
{-ǤʲǤʧȕʲʧѐ ‰bÆÑѐÑ5ÆÑ{"5{ÆХ

ȱɜɴȂǤɜ ‰bÆÑѪɪѪʝɴ˞ʧя ‰bÆÑѪɪѪȆɴɜʧ

ȯʼɪȆʲɃɴɪ ‰bÆÑѪʝȕǤȍѪɃɦǤȱȕʧФȯɃɜȕɪǤɦȕђђÆʲʝɃɪȱХ
QĐɃʙѐɴʙȕɪФȯɃɜȕɪǤɦȕя ъʝъХ ȍɴ ʧ
ɜɴȆǤɜ ɦǤȱɃȆѪɪʼɦȂȕʝ ќ Ȃʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХ
ɜɴȆǤɜ ɪѪɃʲȕɦʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
ȱɜɴȂǤɜ ‰bÆÑѪɪѪʝɴ˞ʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ
ȱɜɴȂǤɜ ‰bÆÑѪɪѪȆɴɜʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ

ЦùȕȆʲɴʝШOɜɴǤʲЛЙЩФʝȕǤȍРФʧя ʝʝǤ˩ШÚbɪʲНЩФʼɪȍȕȯя
‰bÆÑѪɪѪʝɴ˞ʧѮ ‰bÆÑѪɪѪȆɴɜʧХХХ ѐЭ
ʲ˩ʙȕɦǤ˦ФÚbɪʲНХ
ȯɴʝ Ƀ Ƀɪ ЖђɪѪɃʲȕɦʧЧ
ȕɪȍ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ‰bÆÑѪʝȕǤȍѪɜǤȂȕɜʧФȯɃɜȕɪǤɦȕђђÆʲʝɃɪȱХ
QĐɃʙѐɴʙȕɪФȯɃɜȕɪǤɦȕя ъʝъХ ȍɴ ʧ
ɜɴȆǤɜ ɦǤȱɃȆѪɪʼɦȂȕʝ ќ Ȃʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХ
ɜɴȆǤɜ ɪѪɃʲȕɦʧ ќ bɪʲФȂʧ˞ǤʙФʝȕǤȍФʧя ÚbɪʲИЗХХХ

ЦbɪʲФʝȕǤȍФʧя ÚbɪʲНХХ ȯɴʝ Ƀ Ƀɪ ЖђɪѪɃʲȕɦʧЧ


ȕɪȍ
ȕɪȍ

The function ˛ȕȆʲɴʝɃ˴ȕ takes a label, i.e., an integer 𝑛 between zero and nine,
and yields a vector of length ten whose 𝑛-th element is equal to one, while all
other elements are equal to zero. The reason for expanding the label into a vector
was already discussed in Sect. 13.2: if the number of classes is large, then neural
networks learn much better when each class corresponds to a designated neuron
in the output layer. The class assigned to a given input is the one whose output
neuron has the largest value.
ȯʼɪȆʲɃɴɪ ˛ȕȆʲɴʝɃ˴ȕФɪђђbɪʲȕȱȕʝХ
ɜɴȆǤɜ ʝȕʧʼɜʲ ќ ˴ȕʝɴʧФЖЕХ
ʝȕʧʼɜʲЦɪўЖЧ ќ Ж
ʝȕʧʼɜʲ
ȕɪȍ

Finally, the function ɜɴǤȍѪ ‰bÆÑѪȍǤʲǤ yields the input and output values in
six vectors, and six global variables are defined.
376 13 Neural Networks

ȯʼɪȆʲɃɴɪ ɜɴǤȍѪ ‰bÆÑѪȍǤʲǤФХ


ɜɴȆǤɜ ʲʝǤɃɪѪ˦ ќ ‰bÆÑѪʝȕǤȍѪɃɦǤȱȕʧФ ‰bÆÑѪȯɃɜȕѪʲʝǤɃɪɃɪȱѪɃɦǤȱȕʧХ
ɜɴȆǤɜ ʲʝǤɃɪѪ˩ ќ ‰bÆÑѪʝȕǤȍѪɜǤȂȕɜʧФ ‰bÆÑѪȯɃɜȕѪʲʝǤɃɪɃɪȱѪɜǤȂȕɜʧХ
ɜɴȆǤɜ ʲȕʧʲѪ˦ ќ ‰bÆÑѪʝȕǤȍѪɃɦǤȱȕʧФ ‰bÆÑѪȯɃɜȕѪʲȕʧʲѪɃɦǤȱȕʧХ
ɜɴȆǤɜ ʲȕʧʲѪ˩ ќ ‰bÆÑѪʝȕǤȍѪɜǤȂȕɜʧФ ‰bÆÑѪȯɃɜȕѪʲȕʧʲѪɜǤȂȕɜʧХ

ФʲʝǤɃɪѪ˦ЦЖђКЕѪЕЕЕЧя ˛ȕȆʲɴʝɃ˴ȕѐФʲʝǤɃɪѪ˩ЦЖђКЕѪЕЕЕЧХя
ʲʝǤɃɪѪ˦ЦКЕѪЕЕЖђЛЕѪЕЕЕЧя ʲʝǤɃɪѪ˩ЦКЕѪЕЕЖђЛЕѪЕЕЕЧя
ʲȕʧʲѪ˦я ʲȕʧʲѪ˩Х
ȕɪȍ

ȱɜɴȂǤɜ ФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩я


˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦я ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩я
ʲȕʧʲѪȍǤʲǤѪ˦я ʲȕʧʲѪȍǤʲǤѪ˩Х ќ ɜɴǤȍѪ ‰bÆÑѪȍǤʲǤФХ

The following function was used to plot the images in Fig. 13.2. The ¸˩¸ɜɴʲ
packages was already imported at the beginning of the module.
ȯʼɪȆʲɃɴɪ ʙɜɴʲѪȍɃȱɃʲФɪђђbɪʲя ȯɃɜȕ ќ ɪɴʲȹɃɪȱХ
ɜɴȆǤɜ ˛

Ƀȯ Ж јќ ɪ јќ КЕѪЕЕЕ
˛ ќ ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ЦɪЧ
ȕɜʧȕɃȯ КЕѪЕЕЖ јќ ɪ јќ ЛЕѪЕЕЕ
˛ ќ ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ЦɪвКЕѪЕЕЕЧ
ȕɜʧȕɃȯ ЛЕѪЕЕЖ јќ ɪ јќ МЕѪЕЕЕ
˛ ќ ʲȕʧʲѪȍǤʲǤѪ˦ЦɪвЛЕѪЕЕЕЧ
ȕɪȍ

¸˩¸ɜɴʲѐɦǤʲʧȹɴ˞ФʝȕʧȹǤʙȕФ˛я Ф ‰bÆÑѪɪѪʝɴ˞ʧя ‰bÆÑѪɪѪȆɴɜʧХХщя


ȆɦǤʙ ќ ъ"ɜʼȕʧъХ
¸˩¸ɜɴʲѐǤ˦ɃʧФъɴȯȯъХ

Ƀȯ ɃʧǤФȯɃɜȕя ÆʲʝɃɪȱХ
¸˩¸ɜɴʲѐʧǤ˛ȕȯɃȱФȯɃɜȕ Ѯ ʧʲʝɃɪȱФɪХ Ѯ ъѐʙȍȯъя
ȂȂɴ˦ѪɃɪȆȹȕʧ ќ ъʲɃȱȹʲъя ʙǤȍѪɃɪȆȹȕʧ ќ ЕХ
ȕɪȍ
ȕɪȍ

13.5 Cost Functions

In order to train a neural network, we must be able to measure how well its out-
put agrees with the given labels of the data. Functions that measure this differ-
13.6 Stochastic Gradient Descent 377

ence are called cost functions, loss functions, or objective functions, and many
choices exist.
We denote the items in the given training data by vectors 𝐱 ∈ ℝ784 , which
in our application is 28 ⋅ 28 = 784 dimensional and represents an image. These
items serve as input to the neural network. The corresponding labels in the train-
ing data are denoted by 𝑦(𝐱) ∈ ℝ10 , where each of the ten elements or neurons
corresponds to one digit. In other words, the function 𝐲 ∶ ℝ784 ⊃ 𝑇 → ℝ10 rep-
resents the given training data. Furthermore, we denote the neural network by
the function 𝐚 ∶ ℝ784 → ℝ10 , whose value is the activation of the output layer.
One of the most popular cost functions is the quadratic cost function
1 ∑
𝐶2 (𝑊, 𝐛) ∶= ‖𝐲(𝐱) − 𝐚(𝐱)‖22 ,
2|𝑇| 𝐱∈𝑇

also called the mean squared error. The sum is over all |𝑇| elements 𝐱 of the
training set 𝑇. The cost function is a function of the parameters of the neural
network, which are denoted by 𝑊 and 𝐛 for the collection of all weights and
biases. The reason for the factor 1∕|𝑇| is that it makes the values of the cost
function comparable whenever the number |𝑇| of the training items changes.
The reason for the factor 1∕2 is that it removes the factor 2 in the derivative of
the cost function, which we will use shortly. The factors are, of course, irrelevant
when the cost function is minimized.
It goes without saying that other norms can be used instead of the Euclidean
norm. Different choices of cost functions generally lead to different neural net-
works, and an expedient choice generally depends on the problem at hand.
Another popular cost function is the cross-entropy cost function
1 ∑( )
𝐶CE (𝑊, 𝐛) ∶= − 𝐲(𝐱) ⋅ ln 𝐚(𝐱) + (𝟏 − 𝐲(𝐱)) ⋅ ln(𝟏 − 𝐚(𝐱)) , (13.1)
|𝑇| 𝐱∈𝑇

where the logarithm is applied elementwise to its vector argument. The cost func-
tion, the activation function of the output layer, and how fast a network learns
are closely related. This relationship is discussed at the end of the next section,
where we will also see how the expression in 𝐶CE is obtained.

13.6 Stochastic Gradient Descent

In order to improve a neural network, we aim to find weights and biases that min-
imize the cost function. Since neural networks generally have a large number of
weights and biases, this is usually a high-dimensional optimization problem.
To minimize the cost function, we use a version of gradient descent called
stochastic gradient descent. Other optimization methods can of course be used
depending on the size of the optimization problem (see Chap. 11 and Chap. 12).
378 13 Neural Networks

How does gradient descent work? Gradient descent was already discussed in
Chap. 12, but we recapitulate the main idea here using the current notation. To
simplify notation, we collect all parameters of the network, i.e., all weights and
biases, in the vector 𝐩. Hence the gradient of the cost function 𝐶 is
𝜕𝐶
⎛ 𝜕𝑝 ⎞
1
∇𝐶 = ⎜ ⋮ ⎟ .
⎜ 𝜕𝐶 ⎟
⎝ 𝜕𝑝𝑛 ⎠
The directional derivative
𝜕𝐶
(𝐩)
𝜕𝐞
is the derivative of 𝐶 at 𝐩 in the direction of the unit vector 𝐞 and can be written
as
𝜕𝐶
(𝐩) = ∇𝐶(𝐩) ⋅ 𝐞
𝜕𝐞
using the gradient of 𝐶 at 𝐩.
Starting at the point 𝐩, we would like to take a (small) step in the direction that
minimizes 𝐶(𝐩) the most. How can we find a direction 𝐞 in which the function 𝐶
changes the most? The Cauchy–Bunyakovsky–Schwarz inequality, Theorem 8.1,
implies the inequality
|| 𝜕𝐶 ||
|| (𝐩)|| = |∇𝐶(𝐩) ⋅ 𝐞| ≤ ‖∇𝐶(𝐩)‖,
|| 𝜕𝐞 ||
| |
since ‖𝐞‖ = 1; equality in the Cauchy–Bunyakovsky–Schwarz inequality holds if
and only if one vector is a multiple of the other. The inequality therefore means
that the multiples of the gradient are the directions in which the directional
derivative changes the most.
If the direction is 𝐞 = ∇𝐶(𝐩), then the directional derivative is

𝜕𝐶
(𝐩) = ∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = ‖∇𝐶(𝐩)‖2 ≥ 0,
𝜕𝐞
and taking a step in this direction increases 𝐶; this is called gradient ascent. On
the other hand, if the direction is 𝐞 = −∇𝐶(𝐩), then the directional derivative is

𝜕𝐶
(𝐩) = −∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = −‖∇𝐶(𝐩)‖2 ≤ 0,
𝜕𝐞
and taking a step in this direction decreases 𝐶; this is called gradient descent.
Since we aim to minimize the function 𝐶, we define the step Δ𝐩 to take at the
point 𝐩 as
Δ𝐩 ∶= −𝜂∇𝐶(𝐩),
where 𝜂 ∈ ℝ+ is called the learning rate. The directional derivative implies that
the change Δ𝐶 in the function value is approximately given by
13.6 Stochastic Gradient Descent 379

Δ𝐶 ≈ ∇𝐶(𝐩) ⋅ Δ𝐩,

which yields
Δ𝐶 ≈ −𝜂∇𝐶(𝐩) ⋅ ∇𝐶(𝐩) = −𝜂‖∇𝐶(𝐩)‖2 ≤ 0
for our choice of the step Δ𝐩; the function value indeed decreases.
Having discussed the methods of gradient ascent and descent, we now use a
variant called stochastic gradient descent to adjust the parameters of the neural
network. Stochastic gradient descent is the most common basic method to min-
imize the cost function. It just means that the training data are split randomly
into batches and that gradient descent is performed for each batch. A reason for
doing so is that in practice the number of training items is very large so that
working with batches is more manageable. Reasonably large batches are also
usually already sufficient to obtain a good approximation of the gradient, and
the stochastic nature of stochastic gradient descent helps escape from local min-
ima. Furthermore, the gradients can be computed in parallel for all batches.
Stochastic gradient descent is implemented by the function ÆQ-. The number
of steps in stochastic gradient descent is commonly called the number of epochs.
The meaning of the parameter 𝜆 will be discussed in Sect. 13.9.1; we suppose
that it vanishes for now. The function can also monitor the cost function and the
accuracy (i.e., how many items are classified correctly) during the epochs. It can
do so using the training data that must be supplied, but it can also use optional
validation data. The reasons for this additional capability will be discussed in
Sect. 13.8.
ȯʼɪȆʲɃɴɪ ÆQ-Фɪɪђђ‰ȕʲ˞ɴʝɖя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȕʙɴȆȹʧђђbɪʲя ȂǤʲȆȹѪʧɃ˴ȕђђbɪʲя ȕʲǤђђOɜɴǤʲЛЙя
ɜǤɦȂȍǤђђOɜɴǤʲЛЙ ќ ЕѐЕѓ
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩ ќ ЦЧя
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩ђђÚɪɃɴɪШùȕȆʲɴʝШbɪʲЛЙЩя
ùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩЩ ќ ЦЧя
ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪȆɴʧʲ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ ќ ʲʝʼȕя
ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ ќ ʲʝʼȕХ
ɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲ ќ ЦЧ
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ ќ ЦЧ
ɪɪѐʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩ ќ ЦЧ
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩ ќ ЦЧ

ȯɴʝ ȕʙɴȆȹ Ƀɪ ЖђȕʙɴȆȹʧ


ɜɴȆǤɜ ʙȕʝɦ ќ ¼ǤɪȍɴɦѐʝǤɪȍʙȕʝɦФɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ХХ
380 13 Neural Networks

ȯɴʝ ɖ Ƀɪ ЖђȂǤʲȆȹѪʧɃ˴ȕђɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦Х
ʼʙȍǤʲȕРФɪɪя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ЦʙȕʝɦЦɖђɦɃɪФɖўȂǤʲȆȹѪʧɃ˴ȕвЖя
ȕɪȍХЧЧя
ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩ЦʙȕʝɦЦɖђɦɃɪФɖўȂǤʲȆȹѪʧɃ˴ȕвЖя
ȕɪȍХЧЧя
ȕʲǤя ɜǤɦȂȍǤя ɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦ХХ
ȕɪȍ

ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъȕʙɴȆȹ ҄ȍ ȍɴɪȕъя ȕʙɴȆȹХ

Ƀȯ ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪȆɴʧʲ
ʙʼʧȹРФɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲя
ʲɴʲǤɜѪȆɴʧʲФɪɪя ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩я
ɜǤɦȂȍǤХХ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъȆɴʧʲ ɴɪ ʲʝǤɃɪɃɪȱ ȍǤʲǤђ ҄ȯъя
ɪɪѐʲʝǤɃɪɃɪȱѪȆɴʧʲЦȕɪȍЧХ
ȕɪȍ

Ƀȯ ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲ
ʙʼʧȹРФɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲя
ʲɴʲǤɜѪȆɴʧʲФɪɪя ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩я ɜǤɦȂȍǤХХ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъȆɴʧʲ ɴɪ ˛ǤɜɃȍǤʲɃɴɪ ȍǤʲǤђ ҄ȯъя
ɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪȆɴʧʲЦȕɪȍЧХ
ȕɪȍ

Ƀȯ ɦɴɪɃʲɴʝѪʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩
ɜɴȆǤɜ Ǥ ќ ǤȆȆʼʝǤȆ˩Фɪɪя ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩Х
ɜɴȆǤɜ ɜ ќ ɜȕɪȱʲȹФʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦Х
ɜɴȆǤɜ ʝ ќ ǤЭɜ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъǤȆȆʼʝǤȆ˩ ɴɪ ʲʝǤɃɪɃɪȱ ȍǤʲǤђ
҄Кȍ Э ҄Кȍ ќ ҄КѐЖȯ҄҄ ȆɴʝʝȕȆʲъя Ǥя ɜя ЖЕЕѮʝХ
ʙʼʧȹРФɪɪѐʲʝǤɃɪɃɪȱѪǤȆȆʼʝǤȆ˩я ʝХ
ȕɪȍ

Ƀȯ ɦɴɪɃʲɴʝѪ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩
ɜɴȆǤɜ Ǥ ќ ǤȆȆʼʝǤȆ˩Фɪɪя ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩Х
ɜɴȆǤɜ ɜ ќ ɜȕɪȱʲȹФ˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦Х
ɜɴȆǤɜ ʝ ќ ǤЭɜ
ЪɃɪȯɴ Ъ¸ʝɃɪʲȯѐʧʙʝɃɪʲȯФъǤȆȆʼʝǤȆ˩ ɴɪ ˛ǤɜɃȍǤʲɃɴɪ ȍǤʲǤђ
҄Кȍ Э ҄Кȍ ќ ҄КѐЖȯ҄҄ ȆɴʝʝȕȆʲъя Ǥя ɜя ЖЕЕѮʝХ
ʙʼʧȹРФɪɪѐ˛ǤɜɃȍǤʲɃɴɪѪǤȆȆʼʝǤȆ˩я ʝХ
ȕɪȍ
13.6 Stochastic Gradient Descent 381

ȕɪȍ

ɪɪ
ȕɪȍ

The next four functions calculate the cost and the accuracy of a neural net-
work. Each function has two methods, one for labels that are integers and one
for vectorized labels.
ȯʼɪȆʲɃɴɪ ʲɴʲǤɜѪȆɴʧʲФɪɪђђ‰ȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШbɪʲЛЙЩя ɜǤɦȂȍǤђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ʲɴʲǤɜѪȆɴʧʲФɪɪя ȍǤʲǤѪ˦я ˛ȕȆʲɴʝɃ˴ȕѐФȍǤʲǤѪ˩Хя ɜǤɦȂȍǤХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ʲɴʲǤɜѪȆɴʧʲФɪɪђђ‰ȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ɜǤɦȂȍǤђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ʧʼɦФɦǤʙФФ˦я ˩Х вљ ɪɪѐȆɴʧʲѐȯФȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪя ˦Хя ˩Хя
ȍǤʲǤѪ˦я ȍǤʲǤѪ˩ХХ Э ɜȕɪȱʲȹФȍǤʲǤѪ˦Х ў
ЕѐК Ѯ ɜǤɦȂȍǤ Ѯ ʧʼɦФ{ɃɪȕǤʝɜȱȕȂʝǤѐɪɴʝɦФ˞ХѭЗ ȯɴʝ ˞ Ƀɪ
ɪɪѐ˞ȕɃȱȹʲʧХ Э ɜȕɪȱʲȹФȍǤʲǤѪ˦Х
ȕɪȍ

ȯʼɪȆʲɃɴɪ ǤȆȆʼʝǤȆ˩Фɪɪђђ‰ȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШbɪʲЛЙЩХђђbɪʲȕȱȕʝ
ȆɴʼɪʲФɦǤʙФФ˦я ˩Х вљ ˩ ќќ ǤʝȱɦǤ˦ФȯȕȕȍѪȯɴʝ˞ǤʝȍФɪɪя ˦ХХ в Жя
ȍǤʲǤѪ˦я ȍǤʲǤѪ˩ХХ
ȕɪȍ

ȯʼɪȆʲɃɴɪ ǤȆȆʼʝǤȆ˩Фɪɪђђ‰ȕʲ˞ɴʝɖя
ȍǤʲǤѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȍǤʲǤѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩХђђbɪʲȕȱȕʝ
ǤȆȆʼʝǤȆ˩Фɪɪя ȍǤʲǤѪ˦я ɦǤʙФ˩ вљ ǤʝȱɦǤ˦Ф˩Х в Жя ȍǤʲǤѪ˩ХХ
ȕɪȍ

The function ʼʙȍǤʲȕР adds the gradient of the weights and biases multiplied
by the learning rate 𝜂 to the weights and biases of the network for each batch
of training items. Again, we suppose that 𝜆 = 0 for now. The gradients are cal-
culated by the function ʙʝɴʙǤȱǤʲȕѪȂǤȆɖ, which will be discussed in the next
section.
ȯʼɪȆʲɃɴɪ ʼʙȍǤʲȕРФɪɪђђ‰ȕʲ˞ɴʝɖя
ȂǤʲȆȹѪ˦ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
ȂǤʲȆȹѪ˩ђђùȕȆʲɴʝШùȕȆʲɴʝШOɜɴǤʲЛЙЩЩя
382 13 Neural Networks

ȕʲǤђђOɜɴǤʲЛЙя ɜǤɦȂȍǤђђOɜɴǤʲЛЙя ɪђђbɪʲХђђ‰ȕʲ˞ɴʝɖ


ɜɴȆǤɜ ȱʝǤȍѪü ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФüХХ ȯɴʝ ü Ƀɪ ɪɪѐ˞ȕɃȱȹʲʧЧ
ɜɴȆǤɜ ȱʝǤȍѪȂ ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФȂХХ ȯɴʝ Ȃ Ƀɪ ɪɪѐȂɃǤʧȕʧЧ

ȯɴʝ Ф˦я ˩Х Ƀɪ ˴ɃʙФȂǤʲȆȹѪ˦я ȂǤʲȆȹѪ˩Х


ФȍȕɜʲǤѪȱʝǤȍѪüя ȍȕɜʲǤѪȱʝǤȍѪȂХ ќ ʙʝɴʙǤȱǤʲȕѪȂǤȆɖФɪɪя ˦я ˩Х
ȱʝǤȍѪü ўќ ȍȕɜʲǤѪȱʝǤȍѪü
ȱʝǤȍѪȂ ўќ ȍȕɜʲǤѪȱʝǤȍѪȂ
ȕɪȍ

ɪɪѐ˞ȕɃȱȹʲʧ ќ
ФЖвȕʲǤѮɜǤɦȂȍǤЭɪХѮɪɪѐ˞ȕɃȱȹʲʧ в ФȕʲǤЭɜȕɪȱʲȹФȂǤʲȆȹѪ˦ХХѮȱʝǤȍѪü
ɪɪѐȂɃǤʧȕʧ вќ ФȕʲǤЭɜȕɪȱʲȹФȂǤʲȆȹѪ˦ХХ Ѯ ȱʝǤȍѪȂ

ɪɪ
ȕɪȍ

13.7 Backpropagation

In this section, we calculate the gradient of a neural network. The function


ʙʝɴʙǤȱǤʲȕѪȂǤȆɖ, which yields the gradient, is thus the final missing piece be-
fore we can train our neural network. Since a neural network is the composition
of its layers, we use the chain rule to calculate the derivatives.
We start by fixing the notation. We denote the activation of the 𝑙-th layer by

𝐚 (𝑙) ∶= 𝜎(𝑊 (𝑙) 𝐚 (𝑙−1) + 𝐛(𝑙) ),

where the function 𝜎 is applied elementwise to its vector argument, and we de-
note the weighted input to the neurons in layer 𝑙 by

𝐳(𝑙) ∶= 𝑊 (𝑙) 𝐚 (𝑙−1) + 𝐛(𝑙) .

Of course, the equation 𝐚 (𝑙) = 𝜎(𝐳(𝑙) ) holds.


The backpropagation algorithm calculates all the partial derivatives

𝜕𝐶 𝜕𝐶
and
(𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑏𝑖

of the cost function 𝐶 with respect to all elements of the weight matrices 𝑊 (𝑙)
and with respect to all elements of the bias vectors 𝐛(𝑙) .
We make two assumptions on the cost function 𝐶, which are usually satisfied,
namely that it can be written as an average
13.7 Backpropagation 383

1 ∑
𝐶(𝑊, 𝐛) = 𝐾(𝑊, 𝐛, 𝐱)
|𝑇| 𝐱∈𝑇

over all training samples 𝐱 in the training set 𝑇 and that it is a function of the
activation of the output layer of the neural network only, i.e., 𝐶 = 𝐶(𝐚 (𝑙) ). Both
cost functions in Sect. 13.5, the quadratic cost function and cross-entropy cost
function, satisfy these two assumptions.
To help apply the chain rule, we denote the partial derivatives of the cost func-
(𝑙)
tion 𝐶 with respect to the weighted input 𝐳𝑖 to neuron 𝑖 in layer 𝑙 by

(𝑙) 𝜕𝐶
𝛿𝑖 ∶= . (13.2)
(𝑙)
𝜕𝑧𝑖

This value is often called the error of neuron 𝑖 in layer 𝑙.


We start with the output layer 𝐿. Applying the chain rule to the definition
(13.2) of the error yields
(𝐿)
(𝐿) ∑ 𝜕𝐶 𝜕𝑎𝑘
𝛿𝑖 = ∀𝑖,
(𝐿) (𝐿)
𝑘 𝜕𝑎𝑘 𝜕𝑧𝑖

where the sum is over all neurons 𝑘 in the output layer 𝐿. Since the activation
(𝐿) (𝐿)
function 𝜎 is applied elementwise, the partial derivative 𝜕𝑎𝑘 ∕𝜕𝑧𝑖 is nonzero
only if 𝑘 = 𝑖. Therefore we have

(𝐿) 𝜕𝐶 (𝐿)
𝛿𝑖 = 𝜎′ (𝑧𝑖 ) ∀𝑖
(𝐿)
𝜕𝑎𝑖
or
𝜕𝐶
𝛅(𝐿) = ⊙ 𝜎′ (𝐳(𝐿) ), (13.3)
𝜕𝐚 (𝐿)
a formula for the error in the output layer, where ⊙ denotes elementwise multi-
plication and 𝜎′ is applied elementwise to its vector argument.
Second, we derive an equation for the error 𝛅(𝑙) in terms of the error 𝛅(𝑙+1) .
The chain rule yields
(𝑙+1) (𝑙+1)
(𝑙) 𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘 ∑ 𝜕𝑧𝑘 (𝑙+1)
𝛿𝑖 = = = 𝛿𝑘 .
(𝑙) (𝑙+1) (𝑙) (𝑙)
𝜕𝑧𝑖 𝑘 𝜕𝑧
𝑘
𝜕𝑧𝑖 𝑘 𝜕𝑧𝑖

Since
𝐳(𝑙+1) = 𝑊 (𝑙+1) 𝐚 (𝑙) + 𝐛(𝑙+1) = 𝑊 (𝑙+1) 𝜎(𝐳(𝑙) ) + 𝐛(𝑙+1)
and hence ∑
(𝑙+1) (𝑙+1) (𝑙) (𝑙+1)
𝑧𝑘 = 𝑤𝑘𝑖 𝜎(𝑧𝑖 ) + 𝑏𝑘
𝑖
384 13 Neural Networks

hold by definition, we find


(𝑙+1)
𝜕𝑧𝑘 (𝑙+1) ′ (𝑙)
= 𝑤𝑘𝑖 𝜎 (𝑧𝑖 )
(𝑙)
𝜕𝑧𝑖

and therefore ∑
(𝑙) (𝑙+1) ′ (𝑙) (𝑙+1)
𝛿𝑖 = 𝑤𝑘𝑖 𝜎 (𝑧𝑖 )𝛿𝑘
𝑘
or ( )
𝛅(𝑙) = (𝑊 (𝑙+1) )⊤ 𝛅(𝑙+1) ⊙ 𝜎′ (𝐳(𝑙) ). (13.4)
Third, we can find the partial derivatives of the cost function 𝐶 with respect
to all weights and biases using the errors 𝛅(𝑙) . The chain rule yields
(𝑙)
𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘
= ,
(𝑙) (𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝑘 𝜕𝑧𝑘 𝜕𝑤𝑖𝑗
(𝑙)
𝜕𝐶 ∑ 𝜕𝐶 𝜕𝑧𝑘
= .
(𝑙) (𝑙) (𝑙)
𝜕𝑏𝑖 𝑘 𝜕𝑧𝑘 𝜕𝑏𝑖

(𝑙) (𝑙)
Furthermore, the equations 𝜕𝐶∕𝜕𝑧𝑘 = 𝛿𝑘 and

(𝑙) ∑ (𝑙) (𝑙−1) (𝑙)


𝑧𝑘 = 𝑤𝑘𝑗 𝑎𝑗 + 𝑏𝑘
𝑗

hold by definition. Differentiating the last equation yields


(𝑙) (𝑙−1)
𝜕𝑧𝑘 𝑎𝑗 , 𝑖 = 𝑘,
={
(𝑙) 0, 𝑖 ≠ 𝑘,
𝜕𝑤𝑖𝑗
(𝑙)
𝜕𝑧𝑘 1, 𝑖 = 𝑘,
={
(𝑙) 0, 𝑖 ≠ 𝑘.
𝜕𝑏𝑖

In summary, we have shown that

𝜕𝐶 (𝑙) (𝑙−1)
= 𝛿𝑖 𝑎𝑗 , (13.5a)
(𝑙)
𝜕𝑤𝑖𝑗
𝜕𝐶 (𝑙)
= 𝛿𝑖 . (13.5b)
(𝑙)
𝜕𝑏𝑖

The four equations in (13.3), (13.4), and (13.5) constitute the four fundamen-
tal equations of backpropagation. All the partial derivatives of the cost func-
13.7 Backpropagation 385

tion 𝐶 are calculated via the errors 𝛅(𝑙) as well as the weighted inputs 𝐳(𝑙) and
the activations 𝐚 (𝑙) . The weighted inputs and the activations are found by sim-
ply feeding the network forward. Then we calculate the errors 𝛅(𝑙) by starting
with the error 𝛅(𝐿) of the last layer 𝐿 given by (13.3) and working recursively
towards lower layers by using (13.4) in order to obtain 𝛅(𝑙) from 𝛅(𝑙+1) . Finally,
(𝑙) (𝑙)
(13.5) yields the partial derivatives 𝜕𝐶∕𝜕𝑤𝑖𝑗 and 𝜕𝐶∕𝜕𝑏𝑖 , since the errors and
activations are now known.
The backpropagation algorithm is very efficient, since it comprises of only
two passes over the layers to calculate all partial derivatives. As such it is approx-
imately twice as costly as the feeding the network forward in order to evaluate
it. In the forward pass, the weighted inputs and activations are calculated, while
in the backward pass the errors and the partial derivatives are calculated.
Backpropagation is implemented by the function ʙʝɴʙǤȱǤʲȕѪȂǤȆɖ. First, the
two variables ȱʝǤȍѪü and ȱʝǤȍѪȂ for the partial derivatives as well as the two
variables ˴ and Ǥ for the weighted inputs and activations are allocated. Note that
ǤЦɜЧ is equal to 𝐚 (𝑙−1) such that ǤЦЖЧ records the input to the neural network.
The first loop is the forward pass, where all weights inputs and activations are
calculated. Before the second loop, the variable ȍȕɜʲǤ is initialized as 𝛅(𝐿) given
by (13.3). Since the expression depends on the cost function, it is calculated by
the function stored in the ȍȕɜʲǤ field of the type &ɴʧʲ. The results ȱʝǤȍѪüЦȕɪȍЧ
and ȱʝǤȍѪȂЦȕɪȍЧ are given by (13.5) for 𝑙 = 𝐿.
Then, in the second loop, the variable ȍȕɜʲǤ is updated recursively using
(13.4) and the results ȱʝǤȍѪüЦɜЧ and ȱʝǤȍѪȂЦɜЧ are calculated by (13.5). Again
note that ǤЦɜЧ is equal to 𝐚 (𝑙−1) .
ȯʼɪȆʲɃɴɪ ʙʝɴʙǤȱǤʲȕѪȂǤȆɖФɪɪђђ‰ȕʲ˞ɴʝɖя ˦ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩя
˩ђђùȕȆʲɴʝШOɜɴǤʲЛЙЩХђђÑʼʙɜȕ
ɜɴȆǤɜ ȱʝǤȍѪü ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФüХХ ȯɴʝ ü Ƀɪ ɪɪѐ˞ȕɃȱȹʲʧЧ
ɜɴȆǤɜ ȱʝǤȍѪȂ ќ ЦȯɃɜɜФЕѐЕя ʧɃ˴ȕФȂХХ ȯɴʝ Ȃ Ƀɪ ɪɪѐȂɃǤʧȕʧЧ

ɜɴȆǤɜ ˴ ќ ùȕȆʲɴʝФʼɪȍȕȯя ɪɪѐɪѪɜǤ˩ȕʝʧвЖХ


ɜɴȆǤɜ Ǥ ќ ùȕȆʲɴʝФʼɪȍȕȯя ɪɪѐɪѪɜǤ˩ȕʝʧХ

ǤЦЖЧ ќ ˦
ȯɴʝ ФɃя Фüя ȂХХ Ƀɪ ȕɪʼɦȕʝǤʲȕФ˴ɃʙФɪɪѐ˞ȕɃȱȹʲʧя ɪɪѐȂɃǤʧȕʧХХ
˴ЦɃЧ ќ ü Ѯ ǤЦɃЧ ў Ȃ
ǤЦɃўЖЧ ќ ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȯѐФ˴ЦɃЧХ
ȕɪȍ

ɜɴȆǤɜ ȍȕɜʲǤ ќ ɪɪѐȆɴʧʲѐȍȕɜʲǤФ˴ЦȕɪȍЧя ǤЦȕɪȍЧя ˩Х


ȱʝǤȍѪüЦȕɪȍЧ ќ ȍȕɜʲǤ Ѯ ǤЦȕɪȍвЖЧщ
ȱʝǤȍѪȂЦȕɪȍЧ ќ ȍȕɜʲǤ

ȯɴʝ ɜ Ƀɪ ɪɪѐɪѪɜǤ˩ȕʝʧвЗђвЖђЖ
ȍȕɜʲǤ ќ Фɪɪѐ˞ȕɃȱȹʲʧЦɜўЖЧщ Ѯ ȍȕɜʲǤХ ѐѮ ɪɪѐǤȆʲɃ˛ǤʲɃɴɪѐȍѐФ˴ЦɜЧХ
386 13 Neural Networks

ȱʝǤȍѪüЦɜЧ ќ ȍȕɜʲǤ Ѯ ǤЦɜЧщ


ȱʝǤȍѪȂЦɜЧ ќ ȍȕɜʲǤ
ȕɪȍ

ФȱʝǤȍѪüя ȱʝǤȍѪȂХ
ȕɪȍ

At this point, the code of the module that implements our neural network is
complete.
ȕɪȍ Ы ɦɴȍʼɜȕ

Next, we run an example. Whenever a program uses random numbers, it can


be useful while developing and debugging to set the seed of the random-number
generator using ¼ǤɪȍɴɦѐʧȕȕȍР. This ensures that the same sequence of random
numbers is generated every time.
Ƀɦʙɴʝʲ ¼Ǥɪȍɴɦ
¼ǤɪȍɴɦѐʧȕȕȍРФЕХ
‰‰ѐÆQ-Ф‰‰ѐ‰ȕʲ˞ɴʝɖФЦ‰‰ѐ ‰bÆÑѪɪѪʝɴ˞ʧ Ѯ ‰‰ѐ ‰bÆÑѪɪѪȆɴɜʧя ИЕя ЖЕЧХя
‰‰ѐʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ‰‰ѐʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩я
ЖЕЕя ЖЕя ИѐЕя
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ ќ ‰‰ѐʲȕʧʲѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩ ќ ‰‰ѐʲȕʧʲѪȍǤʲǤѪ˩Х

We use the package name ‰‰ as a prefix to access the symbols in this package.
After a few minutes, we observe that the algorithm classifies more than 95% of
the digits in the validation data correctly, while the accuracy in the training data
is higher. The cost function is also smaller on the training data than on the vali-
dation data. Typical accuracies and costs for one hundred iterations of training
are shown in Figures 13.3 and 13.4. We will discuss these curves thoroughly in
the next section.

13.8 Hyperparameters and Overfitting

Training algorithm contain various parameters such as the learning rate 𝜂 and
the parameter 𝜆 here. The parameters that pertain to the training algorithm are
called hyperparameters in order to distinguish them from the parameters of the
neural network itself such as its number of levels, its weights, and its biases. Thus
the question naturally arises how these parameters should be chosen.
Unfortunately, there is no general answer to this question. Much depends
on the specific problem and the task the neural network shall solve. Whenever
there are unknown (hyper-)parameters, the idea of optimizing these parameters
suggests itself (see Chap. 11 and Chap. 12 for inspirations for optimization algo-
rithms to be used). In addition to optimizing continuous parameters such as the
13.8 Hyperparameters and Overfitting 387

Fig. 13.3 Training and validation accuracies as functions of the iteration number. The training
accuracies are generally larger than the validation accuracies.

learning rate, there are also discrete optimization problems. These concern dis-
crete parameters such as the numbers of layers of the neural network and their
sizes, the activation functions used in each layer, and even the type of the layer
(e.g., in convolutional neural networks).
The hyperparameters are found either manually or automatically by an op-
timization algorithm using the validation dataset. The validation dataset is also
useful in another regard. When training the network using stochastic gradient
descent as in the example at the end of the previous section, the training ac-
curacy is generally higher than the accuracy observed on an validation dataset
(see Fig. 13.3). The same effect is observed in the values of the cost function: the
cost function is smaller on the training set than on the validation dataset (see
Fig. 13.4).
These observations are easily explained: the training data are used in gradient
descent and therefore the cost function decreases and the accuracy increases
on the training dataset, in this case up to the 100-th iteration. The situation is
different on the validation dataset. There the accuracy stops to increase after
about fifteen iterations, and the cost stops to decrease after the same number
of iterations. This means that any improvement on the training data after this
number of iterations is highly unlikely to translate into any improvement on
new data. This effect is called overfitting.
388 13 Neural Networks

Fig. 13.4 Training and validation costs as functions of the iteration number. The training costs
are generally smaller than the validation costs.

After this point, training fits the neural network only to idiosyncrasies and
noise in the training data, which is not only a waste of computational resources,
but it is even more importantly also detrimental for generalization. Generaliza-
tion is the ultimate goal in machine learning; it means that the essence hidden in
the data is learned and that the idiosyncrasies of the training data are discarded.
Since neural networks contain many parameters while the amount of avail-
able data is always limited, they are a very versatile tool, but at the same time the
parameters are comparatively ill specified. Hence neural networks are prone to
overfitting, as the many parameters are usually easily fitted to the training data.
Therefore overfitting requires careful consideration.
Early stopping is a simple strategy to overcome overfitting and means that
training is stopped as soon as the accuracy stops to decrease on the validation
dataset. Unfortunately, it is not obvious when to stop since stochastic gradient
descent is stochastic by its nature and because the accuracy of a neural networks
may reach a plateau while training and then improve again.
Clearly, overfitting is reduced (and generalization is improved) by using more
training data, but the amount of training data is often given and cannot be
changed easily (but see Problem 13.12). We can also reduce the size of the neural
network to reduce the number of parameters to be determined. This is generally
13.9 Improving Training 389

a good approach, but care must be taken not to reduce it too much until it cannot
learn the essence hidden in the data anymore.
After the hyperparameters have been chosen and training has been estab-
lished while avoiding overfitting, the success of the whole procedure must be
assessed on a third dataset never used before. This third dataset is called the test
dataset, and it is the arbiter for the accuracy that has been achieved.
In summary, it is prudent to split all the available data into three sets, namely
the training, the validation, and the test datasets, with the following purposes.
• The training dataset is used for minimizing the cost function.
• The validation dataset is used for finding hyperparameters and for prevent-
ing overfitting (e.g. using early stopping).
• The test dataset is used for assessing the success of the whole training pro-
cedure using untouched data.

13.9 Improving Training

In this section, we discuss two methods for improving the training of neural
networks. The first is regularization, which we have implemented already, but
which still needs to be explained. The second is the choice of cost function and
how it affects training.

13.9.1 Regularization

Regularization is a method for reducing overfitting. The idea is to add a con-


straint to the parameters in order to keep them simple in a certain sense. In
weight-decay regularization, parameters that are large carry a penalty in order
to keep them simple, i.e., small. This has the added benefit of avoiding run-away
parameters. To implement the penalty, a regularization term, which is often a
multiple of a norm of the weights, is added to the cost function. If 𝐶0 is the origi-
(𝑙)
nal cost function (see Sect. 13.5) and the vector 𝐰 contains all weights 𝑊𝑖𝑗 , then
the 𝓁2 -regularized cost function is

𝜆
𝐶𝓁2 (𝑊, 𝐛, 𝜆) ∶= 𝐶0 (𝑊, 𝐛) + ‖𝐰‖22
2|𝑇|
𝜆 ∑ 2
= 𝐶0 (𝑊, 𝐛) + 𝑤
2|𝑇| 𝑘 𝑘
𝜆 ∑ (𝑙) 2
= 𝐶0 (𝑊, 𝐛) + (𝑊 ) ,
2|𝑇| 𝑙,𝑖,𝑗 𝑖𝑗
390 13 Neural Networks

where 𝜆 ∈ ℝ+ 0
is called the regularization parameter. The factor 1∕|𝑇| occurs
since it is also part of 𝐶0 . The biases are not affected by regularization as ex-
plained below.
Of course, any other norm of the weights such as the 𝓁𝑝 -norms can be used,
resulting in the more general definition

𝜆 𝑝 𝜆 ∑
𝐶𝓁𝑝 (𝑊, 𝐛, 𝜆) ∶= 𝐶0 (𝑊, 𝐛) + ‖𝐰‖𝑝 = 𝐶0 (𝑊, 𝐛) + |𝑤𝑘 |𝑝 .
𝑝|𝑇| 𝑝|𝑇| 𝑘

We first discuss how this modification of the cost function affects learning. It
means that smaller weights are preferred all other things being equal. The size of
the regularization parameter 𝜆 determines the relative importance of minimiz-
ing the original cost function 𝐶0 and minimizing the weights.
Why does regularization reduce overfitting? Regularized neural networks
tend to contain smaller weights, which implies that the output of the network
does not change much when small perturbations are added to the input in con-
trast to unregularized neural networks with larger weights. In other words, it is
more difficult for regularized neural networks to learn randomness in the train-
ing data; the smaller weights must learn the features present in the training data.
In other words, the larger weights of an unregularized network can adjust better
to noise and thus facilitate overfitting.
This explanation also justifies why the biases are not included in regulariza-
tion: large biases do not affect the sensitivity of the neural network to perturba-
tions or noise.
Regularization as modification of the cost function is implemented in a
straightforward manner in backpropagation. Because of the derivatives

𝜕𝐶𝓁2 𝜕𝐶0 𝜆
= + 𝐰,
𝜕𝐰 𝜕𝐰 |𝑇|
𝜕𝐶𝓁2 𝜕𝐶0
= ,
𝜕𝐛 𝜕𝐛
the steps in gradient descent become

𝜕𝐶0 𝜂𝜆
Δ𝐰 ∶= −𝜂 − 𝐰,
𝜕𝐰 |𝑇|
𝜕𝐶0
Δ𝐛 ∶= −𝜂 .
𝜕𝐛
After adding the step Δ𝐰 to 𝐰, the new weight is

𝜂𝜆 𝜕𝐶0
(1 − )𝐰 − 𝜂 ,
|𝑇| 𝜕𝐰
which is calculated at the end of the function ʼʙȍǤʲȕР. This concludes the expla-
nation of the implementation.
13.9 Improving Training 391

Fig. 13.5 Training and validation accuracies as functions of the iteration number with and
without regularization.

Finally, we consider a numerical example. It is the same example as above,


but now we use regularization with 𝜆 = 1.
‰‰ѐÆQ-Ф‰‰ѐ‰ȕʲ˞ɴʝɖФЦ‰‰ѐ ‰bÆÑѪɪѪʝɴ˞ʧ Ѯ ‰‰ѐ ‰bÆÑѪɪѪȆɴɜʧя ИЕя ЖЕЧХя
‰‰ѐʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˦я ‰‰ѐʲʝǤɃɪɃɪȱѪȍǤʲǤѪ˩я
ЖЕЕя ЖЕя ИѐЕя ЖѐЕя
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˦ ќ ‰‰ѐʲȕʧʲѪȍǤʲǤѪ˦я
˛ǤɜɃȍǤʲɃɴɪѪȍǤʲǤѪ˩ ќ ‰‰ѐʲȕʧʲѪȍǤʲǤѪ˩Х

Figures 13.5 and 13.6 show the accuracies and costs evaluated on training
and validation data with and without regularization. Fig. 13.5 shows that contin-
ued overfitting to the training data is much reduced by regularization and that
performance on the validation data has been improved as well. Fig. 13.6 again
shows that continued overfitting to the training data has been reduced. It also
shows that the cost does not increase on the validation data as training continues.
Therefore even this first choice of hyperparameters has beneficial effects.
392 13 Neural Networks

Fig. 13.6 Training and validation costs as functions of the iteration number with and without
regularization.

13.9.2 Cost Functions

The choice of the cost function can strongly affect how fast a neural network
learns. We discuss properties of the two cost functions in Sect. 13.5 that can now
be understood in terms of the properties of the activation functions and their
role in the backpropagation algorithm (see Sect. 13.7).
An effect due to the choice of the activation function is learning slowdown.
As seen from (13.4), the errors 𝛅(𝑙) and therefore the gradients used while mini-
mizing the cost function using stochastic gradient descent become small when
𝜎′ (𝐳(𝑙) ) becomes small. Such a factor 𝜎′ (𝐳(𝑙) ) occurs in every recursive use of
(13.4) in backpropagation, which implies that deep neural networks become
harder to train when the factors 𝜎′ (𝐳(𝑙) ) become small. Therefore the leaky rec-
tifier 𝜎2 is advantageous compared to the hyperbolic tangent 𝜎5 , especially in
deep neural networks.
The cost function can also be a cause of learning slowdown, again due to the
factor 𝜎′ (𝐳(𝑙) ) in the error 𝛿 (𝐿) in (13.3). Thus this interaction between the cost
function and the activation function in the output layer may be detrimental to
learning.
We can remedy the situation by choosing the cost function 𝐶 such that this fac-
tor disappears. Considering the derivatives with respect to the biases, we would
13.9 Improving Training 393

hence like to achieve that


𝜕𝐶
= 𝑎 − 𝑦.
𝜕𝑏
To simplify notation, we drop the sum over all training items in the cost func-
(𝐿)
tion 𝐶 for now. We also denote an element 𝑏𝑖 of the bias vector 𝐛(𝐿) in the
(𝐿)
output layer 𝐿 by just 𝑏, the element 𝑎𝑖 by just 𝑎, and the element 𝑦𝑖 by just 𝑦.
The chain rule yields
𝜕𝐶 𝜕𝐶 ′
= 𝜎 (𝑧).
𝜕𝑏 𝜕𝑎
For the activation function 𝜎1 , we have 𝜎1′ (𝑧) = 𝜎1 (𝑧)(1 − 𝜎1 (𝑧)) = 𝑎(1 − 𝑎).
Therefore we have
𝜕𝐶 𝜕𝐶
𝑎−𝑦 = = 𝑎(1 − 𝑎),
𝜕𝑏 𝜕𝑎
which is the ordinary differential equation

𝜕𝐶 𝑎−𝑦 𝑦 1−𝑦
= =− +
𝜕𝑎 𝑎(1 − 𝑎) 𝑎 1−𝑎

for 𝜕𝐶∕𝜕𝑎. The last equation follows from a partial-fraction decomposition. In-
tegration yields

𝐶 = −𝑦 ln 𝑎 − (1 − 𝑦) ln(1 − 𝑎) + const.

for the cost function for a single training item, and hence the cross-entropy cost
function 𝐶CE defined in (13.1) for all training items.
We now check that this calculation for the partial derivatives with respect to
the biases yields the desired form for all partial derivatives in the output layer.
Starting from the cross-entropy cost function
1 ∑( )
𝐶CE (𝑊, 𝐛) = − 𝐲(𝐱) ⋅ ln 𝐚 (𝐿) (𝐱) + (𝟏 − 𝐲(𝐱)) ⋅ ln(𝟏 − 𝐚 (𝐿) (𝐱)) ,
|𝑇| 𝐱∈𝑇

we find the error in the output layer 𝐿 by (13.3) to be

𝜕𝐶
𝛅(𝐿) = ⊙ 𝜎′ (𝐳(𝐿) )
𝜕𝐚 (𝐿)
1 ∑( )
=− 𝐲(𝐱) ⊘ 𝐚 (𝐿) (𝐱) − (𝟏 − 𝐲(𝐱)) ⊘ (𝟏 − 𝐚 (𝐿) (𝐱)) ⊙ 𝐚 (𝐿) (𝐱)
|𝑇| 𝐱∈𝑇

⊙ (𝟏 − 𝐚 (𝐿) (𝐱))

1 ∑ (𝐿)
= (𝐚 (𝐱) − 𝐲(𝐱)),
|𝑇| 𝐱∈𝑇

where ⊘ denotes elementwise division of two vectors. Hence, by (13.5), the par-
tial derivatives are
394 13 Neural Networks

𝜕𝐶CE (𝐿) (𝐿−1) 1 ∑ (𝐿) (𝐿−1)


= 𝛿𝑖 𝑎𝑗 = (𝑎 (𝐱) − 𝑦𝑖 (𝐱))𝑎𝑗 (𝐱),
(𝐿)
𝜕𝑤𝑖𝑗 |𝑇| 𝐱∈𝑇 𝑖

𝜕𝐶CE 1 ∑ (𝐿)
= 𝛅(𝐿) = (𝐚 (𝐱) − 𝐲(𝐱)).
𝜕𝐛(𝐿) |𝑇| 𝐱∈𝑇

Hence there are no factors 𝜎′ (𝐳(𝐿) ) often responsible for learning slowdown.
There is a similar interaction between the quadratic cost function 𝐶2 and so-
called linear neurons in the output layer, meaning that the activation function 𝜎
(𝐿) (𝐿)
of the output layer 𝐿 is the identity function. Then 𝐚𝑗 = 𝐳𝑗 and 𝛅(𝐿) = 𝐚 (𝐿) − 𝐲
by (13.3). Therefore (13.5) yields

𝜕𝐶2 1 ∑ (𝐿) (𝐿) (𝐿−1)


= (𝑎 (𝐱) − 𝑧𝑖 (𝐱))𝑎𝑗 (𝐱),
(𝐿)
𝜕𝑤𝑖𝑗 |𝑇| 𝐱∈𝑇 𝑖

𝜕𝐶2 1 ∑ (𝐿)
= (𝐚 (𝐱) − 𝐳(𝐿) (𝐱)),
𝜕𝐛(𝐿) |𝑇| 𝐱∈𝑇

showing no detrimental factor in the output layer 𝐿 for this choice of cost func-
tion and activation function.
These considerations imply that the cost function and the activation functions
should not be chosen independently. It is worthwhile to study their interactions
in order to arrive at efficient training algorithms.

13.10 Julia Packages

The package {-ǤʲǤʧȕʲʧ used above contains sample data sets for machine
learning. The package Oɜʼ˦ provides software for neural networks written in
pure Julia and includes gpu support.

13.11 Bibliographical Remarks

Artificial neural networks were first implemented in the middle of the twentieth
century and have become a standard method in machine learning and artificial
intelligence. Thus there is a vast body of literature on this topic. A historic per-
spective can be found in [4], and a very accessible introduction can be found in
[3]. The hyperparameters used in the examples are from [3].
13.11 Bibliographical Remarks 395

Problems

13.1 Implement and plot the five activation functions 𝜎𝑖 , 𝑖 ∈ {1, 2, 3, 4, 5},
along with their derivatives using Julia. Furthermore, find the ten limits
lim𝑥→±∞ 𝜎𝑖 (𝑥) for 𝑖 ∈ {1, 2, 3, 4, 5}.

13.2 Find suitable hyperparameters for each of the five activation functions 𝜎𝑖 ,
𝑖 ∈ {1, 2, 3, 4, 5}, such that stochastic gradient descent works well for mnist
handwriting recognition.

13.3 Compare how well the neural network learns with and without scaling the
initial weights.

13.4 Compare the best classification performance you can each achieve using
each of the five activation functions 𝜎𝑖 , 𝑖 ∈ {1, 2, 3, 4, 5}, in a neural network.

13.5 * Prove Theorem 13.1.

13.6 Implement an adaptive strategy for choosing the learning rate 𝜂. Choose
𝜇 ∈ ℝ+ , 𝜇 ≈ 1, and change the learning rate to 𝜇𝜆, 𝜆, or 𝜆∕𝜇 depending on the
improvement achieved by each of these three values.

13.7 Investigate further strategies for choosing the learning rate 𝜂 adaptively by
making it depend on the learning progress.

13.8 Parallelize the ȯɴʝ loop over the all batches in the function ÆQ-.

13.9 Use the validation dataset to optimize the hyperparameters of the training
algorithm. Optimize single hyperparameters one after another in one-dimens-
ional optimization problems. Which hyperparameters should be considered on
a logarithmic scale?

13.10 Use the validation dataset to optimize the hyperparameters of the train-
ing algorithm. Optimize (as many as possible of) the hyperparameters simulta-
neously in a multidimensional optimization problem. Which hyperparameters
should be considered on a logarithmic scale?

13.11 Implement early stopping.

13.12 More training data helps reduce overfitting. In certain problems, more
training data can be generated by simply perturbing the available training data
slightly. When dealing with image data, shifting and rotating the images slightly
suggests itself. Implement and evaluate this idea to reduce overfitting.

13.13 Derive the formulas for 𝓁1 -regularization and implement it (as an option).

13.14 Derive the formulas for 𝓁𝑝 -regularization and implement it (as an option).

13.15 Compare the performance of 𝓁1 - and 𝓁2 -regularization.


396 13 Neural Networks

13.16 After using all ideas in this chapter and optimizing the hyperparameters
using the validation data, what is the best accuracy you can achieve on the test
data?

13.17 Implement a convolutional neural network and train it with the mnist
database. Which classification error can you achieve?

13.18 Extend the code in Sect. 13.10 to use a gpu.

References

1. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of


Control, Signals, and Systems 2(4), 303–314 (1989)
2. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal
approximators. Neural Networks 2(5), 359–366 (1989)
3. Nielsen, M.: Neural networks and deep learning.
http://neuralnetworksanddeeplearning.com/
4. Nilsson, N.J.: The Quest for Artificial Intelligence. Cambridge University Press, Cambridge,
UK (2010)
Chapter 14
Bayesian Estimation

No other formula in the alchemy of logic has exerted more astonishing powers,
For it has established the existence of God from the premiss of total ignorance;
and it has measured with numerical precision the probability
that the sun will rise to-morrow.
—John Maynard Keynes, A Treatise on Probability, Chapter VII (1921)

Abstract Frequentist and Bayesian statistics and inference differ in their fun-
damental assumptions on the nature of probabilities and models. After a short
discussion of the differences, we use the ideas of Bayesian inference to determine
model parameters. The motivation for these considerations is the fact that mod-
els usually contain parameters that are unknown and often cannot be measured
or determined directly. Thus they must be estimated by comparing the model to
data. In this chapter, the Bayesian approach to the estimation of model parame-
ters is developed, implemented, and applied to an example.

14.1 Introduction

Whenever we use a mathematical model to describe a realistic situation, it is very


likely that the model will contain at least one parameter. Dimensional analysis
offers a way to reduce the number of parameters and make equations dimension-
less, but the fact remains. Therefore the Bayesian approach to estimating model
parameters is discussed in this chapter, and one of most popular numerical ap-
proaches, namely Markov-chain Monte Carlo, is developed and implemented. It
is assumed that the reader is familiar with the basics of probability theory, al-
though the exposition is self-contained.

© Springer Nature Switzerland AG 2022 397


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3_14
398 14 Bayesian Estimation

14.2 The Riemann–Stieltjes Integral

Theorems in discrete and continuous probability are often stated separately us-
ing two different notions, namely the (discrete) probability 𝑃 of an event and
the (continuous) probability density 𝑓 of a random variable. In order to unify the
treatment of discrete and continuous probabilities, we use the Riemann–Stieltjes
integral as an elegant concept to cover both cases, the discrete and the continu-
ous one, simultaneously.
To define the Riemann–Stieltjes integral, we need the ⋃ concept of a partition. A
partition of a set 𝐴 is a set of subsets 𝐴𝑖 ⊂ 𝐴 such that 𝑖 𝐴𝑖 = 𝐴 and 𝐴𝑖 ∩𝐴𝑗 = ∅
for all indices 𝑖 ≠ 𝑗. The definition of the Riemann–Stieltjes integral is a gen-
eralization of the Riemann integral with the additional notion of an integrator
function. In a Riemann sum, each function value is multiplied by the subinterval
length, whereas in the Riemann–Stieltjes integral the function values are more
generally multiplied by the subintervals weighted by the integrator.

Definition 14.1 (Riemann–Stieltjes integral) The Riemann–Stieltjes integral


𝑏
∫ ℎ(𝑥)d𝑔(𝑥)
𝑎

of a function ℎ ∶ [𝑎, 𝑏] → ℝ (the integrand) with respect to a function


𝑔 ∶ [𝑎, 𝑏] → ℝ (the integrator) is the limit

lim 𝑆(𝑃, ℎ, 𝑔)
|𝑃|→0

of the Riemann–Stieltjes sum


𝑛−1

𝑆(𝑃, ℎ, 𝑔) ∶= ℎ(𝜉𝑖 )(𝑔(𝑥𝑖+1 ) − 𝑔(𝑥𝑖 )).
𝑖=0

Here { }
𝑃 ∶= [𝑥0 ∶= 𝑎, 𝑥1 ), [𝑥1 , 𝑥2 ), … , [𝑥𝑛−1 , 𝑥𝑛 ∶= 𝑏]
is a partition of the interval [𝑎, 𝑏], where 𝑥𝑖 < 𝑥𝑖+1 holds for all indices 𝑖 ∈
{0, … , 𝑛 − 1}; the fineness
{ }
|𝑃| ∶= max 𝑥𝑖+1 − 𝑥𝑖 ∣ 𝑖 ∈ {0, … , 𝑛 − 1}

of a partition 𝑃 is the length of the longest of its subintervals; and the points 𝜉𝑖
are chosen as 𝜉𝑖 ∈ [𝑥𝑖 , 𝑥𝑖+1 ) for all indices 𝑖 ∈ {0, … , 𝑛 − 1}.
If the limit above exists, then ℎ is called Riemann–Stieltjes integrable with
respect to 𝑔 on the interval [𝑎, 𝑏].

If the integrator is smooth enough, then the Riemann–Stieltjes integral simpli-


fies to the Riemann integral as the next theorem shows. Of course, if the integra-
14.2 The Riemann–Stieltjes Integral 399

tor is the identity, then the Riemann–Stieltjes integral reduces to the Riemann
integral.

Theorem 14.2 (Riemann–Stieltjes and Riemann integrals) Suppose that


the function ℎ ∶ [𝑎, 𝑏] → ℝ is continuous and that the function 𝑔 ∶ [𝑎, 𝑏] → ℝ
is continuously differentiable. Then
𝑏 𝑏
∫ ℎ(𝑥)d𝑔(𝑥) = ∫ ℎ(𝑥)𝑔′ (𝑥)d𝑥
𝑎 𝑎

holds.

The usefulness in probability theory becomes apparent in the following theo-


rem.

Theorem 14.3 (reduction of Riemann–Stieltjes integral to a finite sum)


Suppose that the function ℎ ∶ [𝑎, 𝑏] → ℝ is piecewise continuous and that the
function 𝑔 ∶ [𝑎, 𝑏] → ℝ is a step function with the jumps

𝑔𝑖 ∶= 𝑔(𝑥𝑖 +) − 𝑔(𝑥𝑖 −)

at the points 𝑥𝑖 ∈ [𝑎, 𝑏], 𝑖 ∈ {1, … , 𝑛}, of discontinuity, where 𝑔(𝑎−) ∶= 𝑔(𝑎)
and 𝑔(𝑏+) ∶= 𝑔(𝑏). Suppose further that at all points 𝑥𝑖 not both ℎ and 𝑔 are
discontinuous from the left and not both ℎ and 𝑔 are discontinuous from the right.
Then the function ℎ is Riemann–Stieltjes integrable with respect to 𝑔 on the interval
[𝑎, 𝑏], and the integral has the value
𝑏 𝑛

∫ ℎ(𝑥)d𝑔(𝑥) = ℎ(𝑥𝑖 )𝑔𝑖 .
𝑎 𝑖=1

To apply the Riemann–Stieltjes integral to probability theory, we view the in-


tegrator 𝑔 as the cumulative probability distribution function

𝐹𝑋 (𝑥) ∶= 𝑃(𝑋 ≤ 𝑥)

of a random variable 𝑋. The derivative

𝑓𝑋 ∶= 𝐹𝑋′

is its probability density. Then discrete random variables become special cases of
continuous random variables: discrete random variables are just continuous ran-
dom variables with piecewise constant cumulative probability distributions 𝐹𝑋 ,
which are the integrators, implying that the probability densities 𝑓𝑋 are sums of
delta distributions.
400 14 Bayesian Estimation

Definition 14.4 (expectance, expected value) The expectance or expected


value of a function ℎ of a random variable 𝑋 is defined as

𝔼[ℎ(𝑋)] ∶= ∫ ℎ(𝑥)d𝐹𝑋 (𝑥).
−∞

If the random variable 𝑋 is continuous, then Theorem 14.2 yields the usual
integral
∞ ∞
𝔼[ℎ(𝑋)] = ∫ ℎ(𝑥)d𝐹𝑋 (𝑥) = ∫ ℎ(𝑥)𝑓𝑋 (𝑥)d𝑥
−∞ −∞
of the expected value of a continuous random variable.
If the random variable 𝑋 is discrete, then 𝐹𝑋 is a step function with the points
𝑥𝑖 ∈ [𝑎, 𝑏], 𝑖 ∈ {1, … , 𝑛}, of discontinuity, and the jumps 𝑃(𝑋 = 𝑥𝑖 ) = 𝐹𝑋 (𝑥𝑖 +) −
𝐹𝑋 (𝑥𝑖 −). The expected value 𝔼[ℎ(𝑋)] simplifies to the sum
∞ 𝑛

𝔼[ℎ(𝑋)] = ∫ ℎ(𝑥)d𝐹𝑋 (𝑥) = ℎ(𝑥𝑖 )𝑃(𝑋 = 𝑥𝑖 )
−∞ 𝑖=1

by Theorem 14.3, which is the usual definition of the expected value of discrete
random variables. Furthermore, using delta distributions, we can write the prob-
ability density 𝑓𝑋 , i.e., the derivative of the step function 𝐹𝑋 , as
𝑛

𝑓𝑋 (𝑥) = 𝑃(𝑋 = 𝑥𝑖 )𝛿(𝑥 = 𝑥𝑖 ).
𝑖=1

14.3 Bayes’ Theorem

We start with a basic definition.

Definition 14.5 (conditional probability) The conditional probability density


function of 𝑌 given the occurrence of the value 𝑥 of the random variable 𝑋 is

𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥) ∶=
𝑓𝑋 (𝑥)

assuming that 𝑓𝑋 (𝑥) > 0, where 𝑓𝑋,𝑌 (𝑥, 𝑦) is the joint density of the random
variables 𝑋 and 𝑌 and 𝑓𝑋 (𝑥) is the marginal density of 𝑋.

In the discrete case, the conditional probability 𝑃(𝐴|𝐵) of the event 𝐴 occur-
ring given that 𝐵 with 𝑃(𝐵) > 0 occurs is usually written as

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
14.3 Bayes’ Theorem 401

𝐴 𝐴∩𝐵 𝐵

Fig. 14.1 Here two events 𝐴 and 𝐵 are illustrated as part of a probability space Ω. The condi-
tional probability 𝑃(𝐴|𝐵) corresponds to the ratio of the areas of 𝐴 ∩ 𝐵 and 𝐵.

(see Fig. 14.1).


The significance of Bayes’ theorem is that it makes it possible to calculate
𝑃(𝐴|𝐵) if 𝑃(𝐵|𝐴) is known, i.e., it relates the two conditional probabilities 𝑃(𝐴|𝐵)
and 𝑃(𝐵|𝐴). The basic form of Bayes’ theorem is stated and proved in the follow-
ing.

Theorem 14.6 (Bayes’ theorem) Suppose that 𝐴 and 𝐵 are two events and that
𝑃(𝐵) > 0. Then the equation

𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
holds.

Proof We start with the case 𝑃(𝐴) = 0. Then both sides of the equation vanish.
The remaining case is 𝑃(𝐴) > 0. By Definition 14.5, 𝑃(𝐴|𝐵) = 𝑃(𝐴 ∩ 𝐵)∕𝑃(𝐵)
and 𝑃(𝐵|𝐴) = 𝑃(𝐵∩𝐴)∕𝑃(𝐴) if 𝑃(𝐴) > 0 and 𝑃(𝐵) > 0. Since 𝑃(𝐴∩𝐵) = 𝑃(𝐵∩𝐴),
we find 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴). Division by 𝑃(𝐵) > 0 yields the assertion.□

There is an extended form of Bayes’ theorem incorporating the law of total


probability. The law of total probability states that

𝑓𝑌 (𝑦) = ∫ 𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)d𝐹𝑋 (𝑥) = ∫ 𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)𝑓𝑋 (𝑥)d𝑥 (14.1)

and follows easily from Definition 14.5. For discrete random variables, it reads

𝑃(𝐵) = 𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 )
𝑖

if the events 𝐴𝑖 are a partition of the whole sample space.


Using the law of total probability, the proof of the extended form of Bayes’
theorem is similar to the proof of Theorem 14.6 above.
402 14 Bayesian Estimation

Theorem 14.7 (extended form of Bayes’ theorem) Suppose that 𝑋 and 𝑌 are
two random variables and that 𝑓𝑌 > 0. Then the equations

𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)𝑓𝑋 (𝑥)


𝑓𝑋 (𝑥 ∣ 𝑌 = 𝑦) = =
𝑓𝑌 (𝑦) ∫ 𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)d𝐹𝑋 (𝑥)

hold.
Proof We start with the case 𝑓𝑋 (𝑥) = 0. Then all terms vanish.
The general case is 𝑓𝑋 (𝑥) > 0. By Definition 14.5, the equations

𝑓𝑌 (𝑦 ∣ 𝑋 = 𝑥)𝑓𝑋 (𝑥) = 𝑓𝑋,𝑌 (𝑥, 𝑦)


𝑓𝑋 (𝑥 ∣ 𝑌 = 𝑦)𝑓𝑌 (𝑦) = 𝑓𝑌,𝑋 (𝑦, 𝑥)

hold if 𝑓𝑋 (𝑥) > 0 and 𝑓𝑌 (𝑦) > 0. Since the two joint densities on the right-hand
sides are identical, the first equation follows. The second equation uses (14.1) in
the denominator. □
Corollary 14.8 (discrete, extended form of Bayes’ theorem) Suppose that
the events 𝐴𝑖 are a partition of the sample space and that 𝐵 is an event with nonzero
probability 𝑃(𝐵) ≠ 0. Then the equations

𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 ) 𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 )


𝑃(𝐴𝑖 |𝐵) = =∑
𝑃(𝐵) 𝑖
𝑃(𝐵|𝐴𝑖 )𝑃(𝐴𝑖 )

hold.

14.4 Frequentist and Bayesian Inference

There are two perspectives in statistical inference, namely the frequentist and
the Bayesian perspectives.
In the frequentist perspective, probabilities are the frequencies of the occur-
rences of an event if experiments are repeated many times. This definition is
objective, since the frequencies are independent of the observer. The probabili-
ties are not updated during data acquisition. The parameters of a model are un-
known but deterministic. Estimators are constructed and confidence intervals
are calculated. A confidence interval for a parameter contains the true value
of the corresponding parameter in repeated sampling with a given probability
or frequency, namely the confidence level. Since the unknown parameters are
viewed as deterministic, parameter densities cannot be propagated through the
model in order to quantify model uncertainties.
In the Bayesian perspective, probabilities are subjective and can be updated to
incorporate new data or information. Probabilities are probability distributions
and not a single frequency value. Model parameters are considered to be random
variables. When a parameter is estimated, the probability density obtained is
14.4 Frequentist and Bayesian Inference 403

called the posterior probability density. This viewpoint is a natural one when
uncertainties in model parameters are to be propagated through the models and
quantified. Instead of confidence intervals, credible intervals are calculated; a
credible interval contains the parameter with a given probability, namely the
confidence level.
In Bayesian inference, new information can be incorporated into the knowl-
edge of an observer, e.g., into the probabilities of parameters, as soon as it be-
comes available. In other words, it is a method for online learning. We denote
the (unknown) model parameters by the random variable 𝑄, which can be mul-
tidimensional in general. The data, measurements, or observations are denoted
by the random variable 𝐷 and can be multidimensional in general as well.
We rewrite Bayes’ formula in Theorem 14.7 in the form

𝜋(𝑑|𝑞)𝜋0 (𝑞) 𝜋(𝑑|𝑞)𝜋0 (𝑞)


𝜋(𝑞|𝑑) = = , (14.2)
𝜋𝐷 (𝑑) ∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞

where

𝜋(𝑞|𝑑) ∶= 𝜋𝑄 (𝑞 ∣ 𝐷 = 𝑑),
𝜋(𝑑|𝑞) ∶= 𝜋𝐷 (𝑑 ∣ 𝑄 = 𝑞)

are defined to simplify notation. The indices referring to the random variables
are usually dropped in Bayesian inference, since they are clear from the context.
The probability density
𝜋0 (𝑞) ∶= 𝜋𝑄 (𝑞)
is called the prior probability density and contains the previous knowledge about
the random variable 𝑄 before the incorporation of new information. Further-
more, the probability density 𝜋(𝑞|𝑑) on the left-hand side is called the posterior
probability density and represents the updated knowledge after the realization
𝑑 = 𝐷(𝜔) has been observed. Finally, 𝜋(𝑑|𝑞) is called the likelihood, and the
marginal density 𝜋𝐷 (𝑑) is a normalization factor.
Therefore, new data informs the posterior density directly only through the
likelihood 𝜋(𝑑|𝑞). With this interpretation, equation (14.2) gives rise to an itera-
tive algorithm.
Algorithm 14.9 (Bayesian inference)
1. Initialize the prior density 𝜋0 (𝑞).
2. While data are available:
a. Calculate the posterior density 𝜋(𝑞|𝑑) using (14.2). Realizations 𝑑 =
𝐷(𝜔) of the data inform the likelihood 𝜋(𝑑|𝑞).
b. Set 𝜋0 (𝑞) to be the posterior density 𝜋(𝑞|𝑑) just calculated.
In the first step, the question which prior density should be used before any
data are available arises immediately. If no prior information is available at all,
404 14 Bayesian Estimation

one commonly uses a noninformative prior density or distribution. Straightfor-


ward choices are a uniform distribution if the parameter is known to lie in an
interval and an unnormalized uniform distribution if the parameter is known to
lie in an unbounded interval.
For example, if a parameter is known to be positive, then the unnormalized
uniform density is 𝜋0 (𝑞) ∶= [𝑞 > 0], where the notation means that [𝑆] = 1 if
the statement 𝑆 is true and [𝑆] = 0 if it is false. Of course, the integral ∫ℝ 𝜋0 (𝑞)d𝑞
is unbounded, and hence this prior distribution is called improper.
We now consider a classical and simple, yet very instructive example. Suppose
that you are tested positive for a rare disease that effects 1‰ of the population
while knowing that the test correctly identifies 99% of the people who have the
disease and incorrectly identifies only 1% of the people who do not have the
disease. The numbers look dire, so you turn to mathematics as a last resort.
Fortunately, the discrete version of Bayes’ theorem in Theorem 14.8 can im-
mediately be applied to the problem at hand by writing Bayes’ formula in the
form
𝑃(pos.|dis.)𝑃(dis.)
𝑃(dis.|pos.) = ,
𝑃(pos.|dis.)𝑃(dis.) + 𝑃(pos.|¬dis.)𝑃(¬dis.)
where “dis.” means you have the disease and “pos.” means you were tested posi-
tive. This formula is easily implemented in Julia.
ȯʼɪȆʲɃɴɪ ʙɴʧʲȕʝɃɴʝФʙʝɃɴʝя ɜɃɖȕɜɃȹɴɴȍХ
ɜɃɖȕɜɃȹɴɴȍ Ѯ ʙʝɃɴʝ Э ФɜɃɖȕɜɃȹɴɴȍѮʙʝɃɴʝ
ў ФЖвɜɃɖȕɜɃȹɴɴȍХѮФЖвʙʝɃɴʝХХ
ȕɪȍ

The likelihood 𝑃(pos.|dis.) is given as 99% in this example. It is reasonable to


choose the frequency of the disease in the population as the initial prior proba-
bility. After a few iterations, we obtain these numbers.
ȱɜɴȂǤɜ ʙ ќ ЕѐЕЕЖ
ȯɴʝ Ƀ Ƀɪ ЖђК
ȱɜɴȂǤɜ ʙ ќ ʙɴʧʲȕʝɃɴʝФʙя ЕѐООХ
Ъʧȹɴ˞ ФɃя ʙХ
ȕɪȍ

ФɃя ʙХ ќ ФЖя ЕѐЕОЕЖЛИОИЙЙЗЛЗЗОЙЙХ


ФɃя ʙХ ќ ФЗя ЕѐОЕМЙООООООООООООХ
ФɃя ʙХ ќ ФИя ЕѐООНОМЖЙМОЙЕЖМОЕЗХ
ФɃя ʙХ ќ ФЙя ЕѐООООНОЛЕЕИЖЙНЕЖЗХ
ФɃя ʙХ ќ ФКя ЕѐООООООНОЙОКЖКОИЗХ

We see that after the first test, the posterior probability of having the disease
has increased from 1‰ to ≈ 9%. Why not more? This is explained by the rel-
atively small correctness 𝑃(pos.|dis.) = 99% of the test compared to the small
frequency of the disease in the population. Taking a simple of one thousand peo-
ple, we expect one person to have the disease, while ten are expected to be tested
14.5 Parameter Estimation and Inverse Problems 405

positive. Out of these (approximately) eleven persons tested positive, only one
has the disease, resulting in the probability of ≈ 9% to have the disease after the
first test.
If the prior probability is zero, the posterior probability will always be zero as
well. This means that absolute confidence in the beginning remains unchanged
in the Bayesian setting, unless the likelihood is equal to one. In this case, both
the nominator and the denominator are zero and the quotient is undefined.
After a second, independent test, the posterior probability of having the dis-
ease increases to ≈ 91%, and so forth. The posterior probability converges quickly
to 1 as more information becomes available.
When applying Bayes’ Theorem like in this example, a few questions arise.
What is the influence of the initial prior density? Can it change the result? And
does the posterior density converge?
These questions can be answered as follows. The Bernstein–von-Mises theo-
rem states that the posterior density converges to a normal distribution indepen-
dent of the initial prior density in the limit of infinitely many, independent, and
identically distributed realizations under certain conditions.

14.5 Parameter Estimation and Inverse Problems

Parameter-estimation or inverse problems are ubiquitous in science, technology,


engineering, and applications. They occur whenever a model contains parame-
ters which are a priori unknown.

14.5.1 Problem Statement

Parameter estimation and inverse problems refer to the kind of problems where
model parameters are to be inferred given measurements or observations. The
advantage of the Bayesian approach is that it also shows how well such an in-
verse problem can be solved. If the resulting probability distribution for a sought
model parameter is spread out or multimodal, this parameter cannot be deter-
mined well. On the other hand, if the distribution is well localized around a
certain value, the parameter can be calculated precisely. Having calculated prob-
ability distributions, confidence intervals are also easily found.
In contrast, classical methods that work by trying to find parameter values
such that the distance in a certain norm between measurement and model out-
put is minimized always yield a parameter value. This parameter value depends
on the choice of norm in the minimization problem, but more importantly, ad-
ditional considerations are necessary in order to assess how sensitive this mini-
mum is to perturbations. Without additional work, we do not know how reliable
406 14 Bayesian Estimation

the parameter values found are, which is especially important in the case of non-
linear or so-called ill-posed inverse problems.
Because parameter estimation and inverse modeling must always deal with
measurement noise and possibly with other uncertainties, it is therefore expedi-
ent to pose the problem within the context of probability theory from the begin-
ning.
In order to apply Bayes’ theorem, we start by considering the general statisti-
cal model
𝐷𝑖 ∶= 𝑓(𝑡𝑖 , 𝑄) + 𝜖𝑖 , (14.3)
where the function 𝑓 represents the model; 𝐷 is a random vector representing
data, measurements, or observations; 𝑄 is a random vector representing param-
eters to be determined, i.e., the quantities of interest; and the random vector 𝜖
represents any unbiased, independent, and identically distributed errors such
as measurement errors. The independent variable 𝑡 of the model function 𝑓 will
represent time in the example below. The indices 𝑖 number the points where data
points 𝐷𝑖 are available for the values 𝑡𝑖 of the independent variable 𝑡 of 𝑓. The
errors are mutually independent from the parameters 𝑄, and they are additive
here. This model equation applies to any problem with additive errors, but the
approach can of course also be formulated for multiplicative errors.
Equivalently, we can write

𝐷𝑖 ∼ 𝑁(𝑓(𝑡𝑖 , 𝑄), 𝜎2 ),

i.e., the measurements are independent and normally distributed with mean
𝑓(𝑡𝑖 , 𝑄) and variance 𝜎2 .
Before we apply Theorem 14.7 or (14.2), we define a model function in or-
der to make things concrete and to discuss a non-trivial example that we will
implement later.

14.5.2 The Logistic Equation as an Example

The logistic equation is the ordinary differential equation

𝑓(𝑡)
𝑓 ′ (𝑡) = 𝑞1 𝑓(𝑡) (1 − ), 𝑓(0) = 𝑞3 ≠ 0, (14.4)
𝑞2

with the three positive parameters 𝑞1 ∈ ℝ+ , 𝑞2 ∈ ℝ+ , and 𝑞3 ∈ ℝ+ . The equa-


tion models the increase and decrease of a population of size 𝑓 as a function of
time. The parameter 𝑞3 is the initial population size at time 𝑡 = 0. If the param-
eter 𝑞2 is very large, then the second term on the right-hand side becomes negli-
gible and the equation simplifies to 𝑓 ′ (𝑡) = 𝑞1 𝑓(𝑡) with the solution 𝑓(𝑡) = e𝑞1 𝑡 ,
implying that the parameter 𝑞1 is a growth rate. The meaning of the parameter 𝑞2
will become apparent briefly, but it is already clear that the second, quadratic
14.5 Parameter Estimation and Inverse Problems 407

term −(𝑞1 ∕𝑞2 )𝑓(𝑡)2 counteracts the growth term 𝑞1 𝑓(𝑡). (A linear second term
could just be absorbed into the first term.)
Equation (14.4) can be solved by separation of variables; a short calculation
shows that its solution is
𝑞2 𝑞3
𝑓(𝑡) = , (14.5)
𝑞3 + (𝑞2 − 𝑞3 )e−𝑞1 𝑡

called the logistic function. It is straightforward to see that

lim 𝑓(𝑡) = 𝑞2 ,
𝑡→∞

implying that the population size always tends to the parameter 𝑞2 , which is
hence usually called the carrying capacity. The larger the carrying capacity is,
the smaller the second term −(𝑞1 ∕𝑞2 )𝑓(𝑡)2 in (14.4) responsible for population
decrease is.
A realistic example of a population governed by the logistic equation is the
growth of a bacterial colony. We can measure the number of bacteria in a Petri
dish as time progresses or we can create synthetic measurements by defining

𝑑𝑖 ∶= 𝑓(𝑡𝑖 , 𝑞1 , 𝑞2 , 𝑞3 ) + 𝜖𝑖 , 𝜖𝑖 ∼ 𝑁(0, 𝜎2 ),

after recalling (14.3). Here the times 𝑡𝑖 ∶= 𝑖Δ𝑡 when measurements are taken
are equidistant, and the measurement errors 𝜖𝑖 are normally distributed with
variance 𝜎2 .
The following Julia function implements the model. It is simple, since we
could solve the underlying logistic equation explicitly, and its solution is given
by (14.5).
ȯʼɪȆʲɃɴɪ ȯФɃя ȍʲя ʧɃȱɦǤя ʜЖя ʜЗя ʜИХ
ʜЗѮʜИ Э ФʜИ ў ФʜЗвʜИХѮȕ˦ʙФвʜЖѮɃѮȍʲХХ ў ʧɃȱɦǤѮʝǤɪȍɪФOɜɴǤʲЛЙХ
ȕɪȍ

Fig. 14.2 shows synthetic measurements generated by this function. The popula-
tion starts with a small number of individuals and then approaches the carrying
capacity.
In order to generate the synthetic measurements, we had to choose values for
the three parameters 𝑞1 , 𝑞2 , and 𝑞3 of interest. After having generated the data,
we forget these three values before proceeding with the parameter estimation.

14.5.3 The Likelihood

In order to apply Bayes’ theorem in the form (14.2) and to calculate the sought
posterior density 𝜋(𝑞|𝑑), we must know the likelihood 𝜋(𝑑|𝑞) (and the prior
density 𝜋0 (𝑞)). The likelihood function depends on the assumptions made on the
408 14 Bayesian Estimation

Fig. 14.2 Synthetic measurements generated using the logistic function as implemented by ȯ
for 𝑖 ∈ {1, … , 100}, Δ𝑡 ∶= 1, 𝜎 = 0.05, 𝑞1 ∶= 0.1, 𝑞2 ∶= 2, 𝑞3 ∶= 0.05.

distribution of the errors. In the case of independent and identically distributed


errors 𝜖𝑖 ∼ 𝑁(0, 𝜎2 ), which we assume in this example, the likelihood is

𝑁
∏ 1 2 2
𝜋(𝑑|𝑞) ∶= 𝐿(𝑞, 𝜎2 ∣ 𝑑) ∶= √ e−(𝑑𝑖 −𝑓(𝑡𝑖 ,𝑞)) ∕(2𝜎 )
𝑖=1 2𝜋𝜎
1 2
= e−𝑆(𝑞)∕(2𝜎 ) , (14.6)
(2𝜋𝜎 )2 𝑁∕2

where 𝐷 is a 𝑁-dimensional random vector meaning that there are 𝑁 data points
(𝑑1 , … , 𝑑𝑁 ) and we have defined
𝑁

𝑆(𝑞) ∶= (𝑑𝑖 − 𝑓(𝑡𝑖 , 𝑞))2 .
𝑖=1
14.5 Parameter Estimation and Inverse Problems 409

14.5.4 Markov-Chain Monte Carlo

Furthermore, in order to apply Bayes’ Theorem in the form (14.2), the integral in
the denominator must be evaluated. This numerical integration remains a chal-
lenge if the number of parameters, i.e., the dimension of the random variable 𝑄,
is large, although many methods for high-dimensional numerical integration
have been developed. Therefore we follow an alternative approach here.
The alternative is to construct a Markov chain whose stationary distribution
is equal to the posterior density. To do so, we start by defining Markov chains.
Definition 14.10 (Markov chain, Markov property) A Markov chain is a se-
quence of random variables 𝑋𝑛 , 𝑛 ∈ ℕ, that satisfy the Markov property, namely
that 𝑋𝑛+1 depends only on its predecessor 𝑋𝑛 for all 𝑛 ∈ ℕ, i.e.,

𝑃(𝑋𝑛+1 = 𝑥𝑛+1 ∣ 𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 )
= 𝑃(𝑋𝑛+1 = 𝑥𝑛+1 ∣ 𝑋𝑛 = 𝑥𝑛 ) ∀𝑛 ∈ ℕ.

The set of all possible realizations of the random variables 𝑋𝑛 is called the
state space of the Markov chain.
Markov chains can be realized if three pieces of information are known:
1. its state space,
2. its initial distribution 𝑝0 , i.e., the distribution of 𝑋0 , and
3. its transition or Markov kernel, which gives the probability

𝑝𝑖𝑗 ∶= 𝑃(𝑋𝑛+1 = 𝑥𝑗 ∣ 𝑋𝑛 = 𝑥𝑖 )

of transitioning from state 𝑥𝑖 to 𝑥𝑗 and thus defines how the chain evolves.
If the state space is finite, the Markov chain is called finite and the entries of
the transition matrix 𝑃 are the probabilities 𝑝𝑖𝑗 .
Here we assume that the transition probabilities 𝑝𝑖𝑗 that constitute the transition
kernel are independent of time or iteration 𝑛. Markov chains with this property
are called homogeneous Markov chains.
Clearly, the entries of the initial distribution 𝑝0 and of the transition matrix 𝑃
are nonnegative, and the elements of 𝑝0 and the rows of 𝑃 sum to one. The dis-
tributions of the states as time progresses are given by

𝑝0 ,
𝑝1 ∶= 𝑝0 𝑃,
𝑝2 ∶= 𝑝1 𝑃 = 𝑝0 𝑃2 ,

𝑝𝑛 ∶= 𝑝𝑛−1 𝑃 = 𝑝0 𝑃𝑛 ,

i.e., in each iteration the distribution 𝑝𝑛 is multiplied by the transition matrix 𝑃.


410 14 Bayesian Estimation

A natural question to ask is whether the random variables of a Markov chain


converge and have a limit. It turns out that convergence in distribution is the
right notion of convergence to use to answer this question.
Definition 14.11 (convergence in distribution) Let 𝑋𝑛 , 𝑛 ∈ ℕ, be a sequence
of random variables with distributions 𝐹𝑋𝑛 . If 𝐹𝑋 is a distribution function, if 𝐶
is the set where 𝐹𝑋 is continuous, and if

lim 𝐹𝑋𝑛 (𝑥) = 𝐹𝑋 (𝑥) ∀𝑥 ∈ 𝐶


𝑛→∞

holds, then the sequence ⟨𝑋𝑛 ⟩𝑛∈ℕ is said to converge in distribution to the limiting
random variable 𝑋. Convergence in distribution is written as
𝐷
𝑋𝑛 ⟶ 𝑋.

Applying convergence in distribution to a Markov chain therefore means con-


sidering the limit
𝜋 ∶= lim 𝑝𝑛 .
𝑛→∞

We want to decide whether the limiting distribution 𝜋 exists and – if it does – to


calculate it. The calculation

𝜋 = lim 𝑝0 𝑃𝑛 = lim 𝑝0 𝑃𝑛+1 = ( lim 𝑝0 𝑃𝑛 ) 𝑃 = 𝜋𝑃


𝑛→∞ 𝑛→∞ 𝑛→∞

implies that if the limit 𝜋 exists it must satisfy the equation

𝜋 = 𝜋𝑃.

This observation motivates the following definition.

Definition 14.12 (stationary or equilibrium distribution) If a Markov chain


has the transition matrix 𝑃, then distributions that satisfy the equation

𝜋 = 𝜋𝑃

are called stationary or equilibrium distributions of the Markov chain.

Every homogeneous Markov chain on a finite state space has at least one sta-
tionary distribution. A stationary distribution, however, may not be unique and
it may not be equal to lim𝑛→∞ 𝑝𝑛 .
There are two kinds of homogeneous Markov chain that we must exclude to
ensure the unique existence of a stationary distribution. The first kind to be ex-
cluded are Markov chains in which not the whole state space is reachable after
some time. This kind is excluded by stipulating that the Markov chain is irre-
ducible.
14.5 Parameter Estimation and Inverse Problems 411

Definition 14.13 (irreducible Markov chain, reducible Markov chain) A


Markov chain is called irreducible if any state 𝑥𝑖 can be reached from any other
state 𝑥𝑗 in a finite number of steps with nonzero probability. A reducible Markov
chain is a not irreducible one.

The second kind of Markov chain to be excluded in order to obtain a station-


ary distribution are periodic ones. In a periodic Markov chain, parts of the state
space are visited at regular time intervals.

Definition 14.14 (aperiodic Markov chain, periodic Markov chain) The


period of a Markov chain is defined as

gcd({𝑚 ∈ ℕ ∣ 𝑃(𝑋𝑛+𝑚 = 𝑥𝑖 ∣ 𝑋𝑛 = 𝑥𝑖 ) > 0}).

A Markov chain is called aperiodic if its period is equal to one and it is called
periodic if its period is greater than one.

The following theorem answers the question which properties of a Markov


chain ensure that it has a unique stationary distribution 𝜋.

Theorem 14.15 (unique stationary distribution) Every finite, homogeneous,


irreducible, and aperiodic Markov chain has a unique stationary distribution 𝜋.
Furthermore, the Markov chain converges in distribution, i.e.,
𝐷
𝑋𝑛 ⟶ 𝑋

and
lim 𝑝𝑛 = 𝜋,
𝑛→∞

to this limiting stationary distribution 𝜋 for every initial distribution 𝑝0 .

Knowing that a Markov chain has a unique stationary distribution 𝜋, we


could calculate it using the∑defining equation 𝜋 = 𝜋𝑃 in Definition 14.12 to-
gether with the condition 𝑖 𝜋𝑖 = 1. Unfortunately, this is often difficult. An
alternative that we discuss next is to use the detailed balance condition in the
following definition.

Definition 14.16 (reversible Markov chain, detailed balance) A Markov


chain with transition matrix 𝑃 and limiting distribution 𝜋 is called reversible if
the condition
𝜋𝑖 𝑝𝑖𝑗 = 𝜋𝑗 𝑝𝑗𝑖 ∀∀𝑖, 𝑗
of detailed balance is satisfied.

The next result is easy to prove.

Theorem 14.17 Every reversible Markov chain is stationary.


412 14 Bayesian Estimation

Proof By definition, the limiting distribution of a reversible Markov chain exists.


Using the detailed-balance condition, we calculate
∑ ∑ ∑
𝜋𝑃 = 𝜋𝑖 𝑝𝑖𝑗 = 𝜋𝑗 𝑝𝑗𝑖 = 𝜋𝑗 𝑝𝑗𝑖 = 𝜋𝑗 ,
𝑖 𝑖 𝑖

which implies
𝜋𝑃 = 𝜋,
which is the definition of stationarity. □

The detailed-balance condition helps calculate the stationary distribution.


Suppose a finite and homogeneous Markov chain is irreducible and aperiodic.
Then Theorem 14.15 implies that there exists a unique stationary distribution ir-
respective of the initial distribution. We can find the limiting stationary distribut-
ing by identifying a candidate distribution and checking that it is stationary us-
ing the detailed-balance condition in Definition 14.16 and Theorem 14.17. Then
the candidate distribution must be identical to the limiting distribution, since it
is unique and stationary.
In other words, the detailed-balance condition in Definition 14.16 and Theo-
rem 14.17 is only a sufficient condition for the existence of a stationary distribu-
tion. The uniqueness of the stationary distribution can be guaranteed by Theo-
rem 14.17, when the important assumptions that the Markov chain is irreducible
and aperiodic are satisfied.
The basic idea of Markov-chain Monte Carlo is to construct Markov chains
whose stationary distribution is the sought posterior density. Evaluating realiza-
tions of the Markov chain hence is the same as sampling the posterior and yields
densities for the parameter values.

14.5.5 The Metropolis–Hastings Algorithm

The Metropolis and Metropolis–Hastings algorithms make it possible to imple-


ment this idea. In parameter estimation and inverse problems, the random vari-
ables 𝑋𝑛 in the Markov chain are the parameters of the model, and therefore we
denote the random variables by 𝑄𝑛 and the realizations by 𝑞𝑛 from now on.
The Metropolis and Metropolis–Hastings algorithms compute Markov chains
whose unique stationary distributions are any given distribution 𝜔. For our pur-
poses, we will later define 𝜔 to be the posterior distribution, i.e., we will use

𝜋(𝑑|𝑞)𝜋0 (𝑞)
𝜔(𝑞) ∶= 𝜋(𝑞|𝑑) = . (14.7)
∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞

But for now it is more convenient to derive the algorithm for a general distribu-
tion 𝜔 that should become the stationary distribution of the Markov chain.
14.5 Parameter Estimation and Inverse Problems 413

We construct the transition probability 𝑃(𝑞 ′ |𝑞) for going from state 𝑞 to
state 𝑞′ such that it satisfies the detailed-balance condition (see Definition 14.16),
of course. Given 𝜔, this means that the transition probability 𝑃(𝑞 ′ |𝑞) must satisfy

𝑃(𝑞′ |𝑞)𝜔(𝑞) = 𝑃(𝑞|𝑞 ′ )𝜔(𝑞 ′ )

or, equivalently,
𝑃(𝑞 ′ |𝑞) 𝜔(𝑞′ )

= . (14.8)
𝑃(𝑞|𝑞 ) 𝜔(𝑞)
The transition from state 𝑞 to state 𝑞 ′ happens in two steps: first, a new state is
proposed by a proposal or jumping distribution 𝐽(𝑞 ′ |𝑞), which is the conditional
probability of proposing the new state 𝑞′ given state 𝑞, and second the acceptance
probability 𝐴(𝑞 ′ |𝑞) is the probability of accepting the proposed state 𝑞 ′ . If it is
rejected, the old state 𝑞 is repeated. In summary, this means that we try to find 𝐽
and 𝐴 such that
𝑃(𝑞 ′ |𝑞) = 𝐽(𝑞 ′ |𝑞)𝐴(𝑞′ |𝑞).
Substituting this form of 𝑃(𝑞′ 𝑞) into (14.8) yields the form

𝐴(𝑞 ′ |𝑞) 𝜔(𝑞 ′ )𝐽(𝑞|𝑞 ′ )


= (14.9)
𝐴(𝑞|𝑞′ ) 𝜔(𝑞)𝐽(𝑞 ′ |𝑞)

of the detailed-balance condition. At this point, we can choose any proposal dis-
tribution 𝐽. There are many choices, but the choice is important for the numer-
ical behavior of the Markov chain and will be discussed later. Having chosen
the proposal distribution, we must define a suitable acceptance probability. The
Metropolis acceptance probability

𝜔(𝑞′ )𝐽(𝑞|𝑞 ′ )
𝐴(𝑞 ′ |𝑞) ∶= min (1, )
𝜔(𝑞)𝐽(𝑞 ′ |𝑞)

is common.
We can check that it works by setting

𝜔(𝑞′ )𝐽(𝑞|𝑞 ′ )
𝑟 ∶=
𝜔(𝑞)𝐽(𝑞 ′ |𝑞)

and calculating

⎧ 1 , 𝑟 ≥ 1,
𝐴(𝑞′ |𝑞) min(1, 𝑟)
= = 1∕𝑟
𝐴(𝑞|𝑞 ′ ) min(1, 1∕𝑟) ⎨𝑟, 𝑟<1

= 𝑟,
414 14 Bayesian Estimation

which shows that (14.9) and therefore the detailed-balance condition is satisfied.
Hence we can indeed construct a Markov chain whose stationary distribution is
the given, arbitrary distribution 𝜔.
The difference between the Metropolis and the Metropolis–Hastings algo-
rithm lies only in the proposal distribution 𝐽. If it is symmetric, i.e., if 𝐽(𝑞 ′ , 𝑞) =
𝐽(𝑞, 𝑞 ′ ), then the algorithm is called a Metropolis algorithm. If it is not symmet-
ric, it is called a Metropolis–Hastings algorithm, which is therefore slightly more
general.
We can now formulate the Metropolis–Hastings algorithm. The similarity to
simulated annealing (see Sect. 11.3) is not a coincidence, but due to their com-
mon root. The algorithm works for general distributions 𝜔, but in the formu-
lation of the algorithm we also note what happens when 𝜔 is given by (14.7),
because this is the application we are interested in.

Algorithm 14.18 (Metropolis–Hastings)


1. Initialization: choose an initial state 𝑞1 (such that 𝜋(𝑞0 |𝑑) > 0).
2. Repeat:

a. Generate a candidate state

𝑞′ ∶= 𝑞𝑛 + 𝑅𝑧 (14.10)

after choosing 𝑧 ∼ 𝑁(0, 1), where 𝑅 is the Cholesky factorization of 𝐷


or 𝑉 in (14.12) below. This definition of 𝑞 ′ ensures that

𝑞′ ∼ 𝑁(𝑞𝑛 , 𝐷) or 𝑞′ ∼ 𝑁(𝑞𝑛 , 𝑉),

respectively, because of Theorem 14.20 below.


b. Calculate the acceptance probability

𝜔(𝑞 ′ )𝐽(𝑞𝑛 |𝑞 ′ ) 𝜋(𝑞 ′ |𝑑)𝐽(𝑞𝑛 |𝑞 ′ )


𝐴(𝑞 ′ |𝑞𝑛 ) ∶= min (1, ) = min (1, )
𝜔(𝑞𝑛 )𝐽(𝑞 ′ |𝑞𝑛 ) 𝜋(𝑞𝑛 |𝑑)𝐽(𝑞 ′ |𝑞𝑛 )
⎛ 𝜋(𝑑|𝑞′ )𝜋0 (𝑞′ )
𝐽(𝑞𝑛 |𝑞′ ) ⎞ ′ ′ ′
= min ⎜1,
∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞
⎟ = min (1, 𝜋(𝑑|𝑞 )𝜋0 (𝑞 )𝐽(𝑞𝑛 |𝑞 ) ) .
⎜ 𝜋(𝑑|𝑞𝑛 )𝜋0 (𝑞𝑛 )
𝐽(𝑞 ′ |𝑞𝑛 ) ⎟ 𝜋(𝑑|𝑞𝑛 )𝜋0 (𝑞𝑛 )𝐽(𝑞 ′ |𝑞𝑛 )
⎝ ∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞 ⎠
(14.11)

c. Accept or reject the candidate by generating a uniformly distributed ran-


dom number 𝑢 ∼ 𝑈(0, 1) from the interval [0, 1] and defining the next
value in the Markov chain as

𝑞 ′ , 𝑢 ≤ 𝐴(𝑞 ′ |𝑞𝑛 ),
𝑞𝑛+1 ∶= {
𝑞𝑛 , 𝑢 > 𝐴(𝑞 ′ |𝑞𝑛 ).
14.5 Parameter Estimation and Inverse Problems 415

In other words, the candidate value 𝑞 ′ is used as the next value 𝑞𝑛+1 in
the Markov chain with the acceptance probability, and otherwise it is
rejected and the old value 𝑞𝑛 is repeated.
d. Repeat until the chain is long enough to estimate the parameter 𝑞 after
discarding a sufficiently long burn-in period at the beginning. Compute
any statistic of interest from the Markov chain without the burn-in pe-
riod.

The derivation of the acceptance probability above showed that the values
in a Markov chain calculated by the Metropolis–Hastings algorithm satisfy the
detailed-balance condition by construction and thus the Markov chain is re-
versible. We have hence shown the following theorem.

Theorem 14.19 (detailed balance in Metropolis–Hastings algorithm) The


Markov chain constructed by Algorithm 14.18 satisfies the detailed-balance condi-
tion and is thus reversible.

If the proposal distribution used in the Metropolis–Hastings algorithm addi-


tionally renders the Markov chain irreducible and aperiodic, then Theorem 14.15
ensures that the stationary distribution is unique. Intuitively, this means that the
proposal distribution samples the whole space without prejudice.
Before we implement the Metropolis–Hastings algorithm, we discuss a few
important points.
It is maybe the most important feature of the algorithm that the integral
∫ 𝜋(𝑑|𝑞)𝜋0 (𝑞)d𝑞 cancels in the acceptance probability (14.11). The integral only
serves as a normalization factor in Bayes’ theorem, but in order to apply Bayes’
theorem directly it must be evaluated, which is time-consuming in multidimen-
sional problems. Rendering this integral over the whole parameter space super-
fluous is the main computational appeal of the Metropolis–Hastings algorithm
and other Markov-chain Monte Carlo methods. Instead of evaluating the inte-
gral, a sufficiently long Markov chain is computed.
Many choices for the proposal distribution 𝐽 are possible. Two common
choices that ensure that Theorem 14.15 can be applied are the normal distri-
butions

𝐽(𝑞 ′ |𝑞𝑛 ) ∶= 𝑁(𝑞𝑛 , 𝐷), (14.12a)


𝐽(𝑞 ′ |𝑞𝑛 ) ∶= 𝑁(𝑞𝑛 , 𝑉), (14.12b)

where 𝐷 is a diagonal matrix and 𝑉 is the covariance matrix for the parameter
vector 𝑞. In the first choice, the elements of the diagonal matrix reflect the scale
associated with each parameter. In the second choice, the scale of each param-
eter can depend on the other parameters via the covariance matrix. Considera-
tions regarding the choices of 𝐷 or 𝑉 will be discussed later.
In both these choices for the proposal distributions 𝐽, 𝐽 is symmetric as the
calculation
416 14 Bayesian Estimation
1
1 − (𝑞′ −𝑞𝑛 )𝑉 −1 (𝑞′ −𝑞𝑛 )⊤
𝐽(𝑞 ′ , 𝑞𝑛 ) = √ e 2
(2𝜋)𝑁 |𝑉|
1
1 − (𝑞 −𝑞 ′ )𝑉 −1 (𝑞𝑛 −𝑞 ′ )⊤
=√ e 2 𝑛 = 𝐽(𝑞𝑛 , 𝑞 ′ )
𝑁
(2𝜋) |𝑉|

shows.
It is obvious that in a Metropolis algorithm (where the proposal distribution 𝐽
is symmetric by definition) the acceptance probability (14.11) simplifies to

𝜋(𝑑|𝑞′ )𝜋0 (𝑞′ )


𝐴(𝑞 ′ |𝑞𝑛 ) = min (1, ).
𝜋(𝑑|𝑞𝑛 )𝜋0 (𝑞𝑛 )

When generating a candidate state 𝑞′ (which may be a vector), equation


(14.10) shows how to calculate a state 𝑞′ distributed according to a general nor-
mal distribution 𝑁(𝑞𝑛 , 𝑉) using only a random-number generator that produces
values distributed according to 𝑁(0, 1). The definition in (14.10) indeed calcu-
lates a suitable candidate state from 𝑁(𝑞𝑛 , 𝑉) due to the following theorem.

Theorem 14.20 (construction of 𝑁(𝜇, 𝑉)) Suppose that 𝑌 ∼ 𝑁(𝜇, 𝑉) and 𝑍 ∼


𝑁(0, 𝐼) are normally distributed 𝑛-dimensional random vectors, where the matrix
𝑉 is a positive definite and 𝐼 is the 𝑛 × 𝑛 identity matrix. Then

𝑌 = 𝑅𝑍 + 𝜇,

where
𝑉 = 𝑅𝑅⊤
and 𝑅 is a lower triangular matrix.

A proof can be found in [5]. The factorization 𝑉 = 𝑅𝑅⊤ can be efficiently


computed using a Cholesky factorization (see Theorem 8.10).
The likelihood 𝜋(𝑑|𝑞) used in Algorithm 14.18 was already discussed in
Sect. 14.5.3 and contains an assumption on the error distribution.
Finally, the particular choice of the initial parameter value should not have
any influence on any statistic of the Markov chain, since a sufficiently large burn-
in period should be discarded anyway.

14.5.6 Implementation of the Metropolis–Hastings Algorithm

Having shown that the Metropolis–Hastings algorithm can be used to construct


the posterior distribution, we discuss an implementation for one parameter. We
start by importing a few packages.
Ƀɦʙɴʝʲ ¸ʝɴȱʝȕʧʧ ȕʲȕʝ
Ƀɦʙɴʝʲ ÆʲǤʲɃʧʲɃȆʧ
14.5 Parameter Estimation and Inverse Problems 417

Ƀɦʙɴʝʲ ÆʲǤʲʧ"Ǥʧȕ

To run Algorithm 14.18, we must provide three arguments, the length of


the Markov chain to be calculated, the variance ˛Ǥʝ of the proposal distribution,
and a specification of the parameter value. The macro Ъʧȹɴ˞ʙʝɴȱʝȕʧʧ in the
¸ʝɴȱʝȕʧʧ ȕʲȕʝ package takes three arguments: the time in seconds between
updates of the progress bar, a description, and the ȯɴʝ loop whose progress is to
be shown.
ȯʼɪȆʲɃɴɪ YѪЖ-ФʙʝɃɴʝђђOʼɪȆʲɃɴɪя ɜɃɖȕɜɃȹɴɴȍђђOʼɪȆʲɃɴɪя
ʙʝɴʙɴʧǤɜђђOʼɪȆʲɃɴɪя
ђђbɪʲя ˛ǤʝђђOɜɴǤʲЛЙя
ʜѪɦɃɪђђOɜɴǤʲЛЙя ʜѪɦǤ˦ђђOɜɴǤʲЛЙя
ʜѪɃɪɃʲђђOɜɴǤʲЛЙХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
ɜɴȆǤɜ ʜ ќ ȯɃɜɜФ‰Ǥ‰я Х
ʜЦЖЧ ќ ʜѪɃɪɃʲ

¸ʝɴȱʝȕʧʧ ȕʲȕʝѐЪʧȹɴ˞ʙʝɴȱʝȕʧʧ Ж ъbʲȕʝǤʲɃɴɪʧђ ъ ȯɴʝ ɪ Ƀɪ Жђ вЖ


ɜɴȆǤɜ ʜʜђђOɜɴǤʲЛЙ ќ ʜЦɪЧ ў ʧʜʝʲФ˛ǤʝХ Ѯ ʝǤɪȍɪФOɜɴǤʲЛЙХ

Ƀȯ ʜʜ ј ʜѪɦɃɪ
ʜʜ ќ ʜѪɦɃɪ
ȕɪȍ
Ƀȯ ʜʜ љ ʜѪɦǤ˦
ʜʜ ќ ʜѪɦǤ˦
ȕɪȍ

ɜɴȆǤɜ ђђOɜɴǤʲЛЙ ќ
ɦɃɪФЖя ФɜɃɖȕɜɃȹɴɴȍФʜʜХ Ѯ ʙʝɃɴʝФʜʜХ
Ѯ ʙʝɴʙɴʧǤɜФʜЦɪЧя ʜʜя ˛ǤʝХХ Э
ФɜɃɖȕɜɃȹɴɴȍФʜЦɪЧХ Ѯ ʙʝɃɴʝФʜЦɪЧХ
Ѯ ʙʝɴʙɴʧǤɜФʜʜя ʜЦɪЧя ˛ǤʝХХХ

ʜЦɪўЖЧ ќ
Ƀȯ ʝǤɪȍФOɜɴǤʲЛЙХ јќ 
ʜʜ
ȕɜʧȕ
ʜЦɪЧ
ȕɪȍ
ȕɪȍ

ʜ
ȕɪȍ
418 14 Bayesian Estimation

The function ȯ yields the exact value of the model, here the logistic equation,
at time Ƀ times ȍʲ for given parameter values ʜЖ, ʜЗ, and ʜИ.
ȯʼɪȆʲɃɴɪ ȯФɃђђbɪʲя ȍʲђђOɜɴǤʲЛЙя ʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя
ʜИђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ЪǤʧʧȕʝʲ Ƀ љќ Е
ЪǤʧʧȕʝʲ ȍʲ љ Е
ЪǤʧʧȕʝʲ ʜЖ љќ Е
ЪǤʧʧȕʝʲ ʜЗ љќ Е
ЪǤʧʧȕʝʲ ʜИ љќ Е

ʜЗѮʜИ Э ФʜИ ў ФʜЗвʜИХ Ѯ ȕ˦ʙФвʜЖѮɃѮȍʲХХ


ȕɪȍ

The function ɦɴȍȕɜ wraps evaluations of ȯ for equidistant points in time and
returns a vector.
ȯʼɪȆʲɃɴɪ ɦɴȍȕɜФʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя
ʜИђђOɜɴǤʲЛЙХђђùȕȆʲɴʝШOɜɴǤʲЛЙЩ
OɜɴǤʲЛЙЦȯФɃя ȍʲя ʜЖя ʜЗя ʜИХ ȯɴʝ Ƀ Ƀɪ Еђ‰вЖЧ
ȕɪȍ

You may want to experiment with different values for the number of points and
the time step by changing the global variables ‰ and ȍʲ below.
Next we define some global variables. Since we use synthetic measurements,
we define the exact parameter values. We will extract the parameter ʜЗ and as-
sume that it lies in the interval [0, 5]. We also know the standard deviation 𝜎 of
the additive noise. In a real-world example, it would correspond to the measure-
ment error. Based on these constants, we produce the synthetic data by evaluat-
ing the ɦɴȍȕɜ function and adding the noise.
ȱɜɴȂǤɜ ‰ ќ КЕ
ȱɜɴȂǤɜ ȍʲ ќ ЖѐЕ
ȱɜɴȂǤɜ ʜЖѪȕ˦ǤȆʲ ќ ЕѐЖ
ȱɜɴȂǤɜ ʜЗѪȕ˦ǤȆʲ ќ ЗѐЕ
ȱɜɴȂǤɜ ʜЗѪɦɃɪ ќ ЕѐЕ
ȱɜɴȂǤɜ ʜЗѪɦǤ˦ ќ КѐЕ
ȱɜɴȂǤɜ ʜИѪȕ˦ǤȆʲ ќ ЕѐЕК
ȱɜɴȂǤɜ ʧɃȱɦǤ ќ ЕѐЕК
ȱɜɴȂǤɜ ȍǤʲǤ ќ
ɦɴȍȕɜФʜЖѪȕ˦ǤȆʲя ʜЗѪȕ˦ǤȆʲя ʜИѪȕ˦ǤȆʲХ ў ʧɃȱɦǤ Ѯ ʝǤɪȍɪФOɜɴǤʲЛЙя ‰Х

To complete the description of our inverse problem, we define the prior, the
likelihood, and the proposal distribution. We use a uniformly distributed prior
here.
ȯʼɪȆʲɃɴɪ ʙʝɃɴʝФʜђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
Ж Э ФʜЗѪɦǤ˦ в ʜЗѪɦɃɪХ
ȕɪȍ
14.5 Parameter Estimation and Inverse Problems 419

The ratio 𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) in the acceptance probability (14.11) may simplify if the
prior distribution has a suitable form. In fact, the factor 𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) in (14.11)
simplifies to one here.
The likelihood (14.6) uses the standard deviation 𝜎 defined above.
ȯʼɪȆʲɃɴɪ ɜɃɖȕɜɃȹɴɴȍФʜђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ɜɴȆǤɜ Æ ќ ʧʼɦФФɦɴȍȕɜФʜЖѪȕ˦ǤȆʲя ʜя ʜИѪȕ˦ǤȆʲХ ѐв ȍǤʲǤХѐѭЗХ

ȕ˦ʙФвÆ Э ФЗѮʧɃȱɦǤѭЗХХ Э ФЗѮʙɃѮʧɃȱɦǤѭЗХѭФ‰ЭЗХ


ȕɪȍ

If the likelihood is a normal distribution (as is the case in (14.6) in Sect. 14.5.3),
a numerical improvement is possible. Then in the quotient 𝜋(𝑑|𝑞 ′ )∕𝜋(𝑑|𝑞𝑛 ) in
the acceptance probability (14.11), the normalization factor 1∕(2𝜋𝜎2 )𝑁∕2 can be
cancelled so that we have
𝜋(𝑑|𝑞′ ) ′ 2
= e(𝑆(𝑞𝑛 )−𝑆(𝑞 ))∕(2𝜎 ) . (14.13)
𝜋(𝑑|𝑞𝑛 )

This form of the ratio has the advantage that the division of two numbers possibly
very close to zero is avoided and thus numerical accuracy is improved. You may
want to implement this improvement when possible (see Problem 14.13).
The third function to be defined is the proposal distribution.
ȯʼɪȆʲɃɴɪ ʙʝɴʙɴʧǤɜФʜЖђђOɜɴǤʲЛЙя ʜЗђђOɜɴǤʲЛЙя ˛ǤʝђђOɜɴǤʲЛЙХђђOɜɴǤʲЛЙ
ЪǤʧʧȕʝʲ ˛Ǥʝ љ Е

ȕ˦ʙФвФʜЖвʜЗХѭЗ Э ФЗѮ˛ǤʝХХ Э ʧʜʝʲФЗѮʙɃѮ˛ǤʝХ


ȕɪȍ

Now everything is in place to compute a Markov chain for the parameter ʜЗ.
The following function call returns a long vector, still containing the burn-in
period. The proposal distribution has variance 0.01, and the initial state is the
interval midpoint.
YѪЖ-ФʙʝɃɴʝя ɜɃɖȕɜɃȹɴɴȍя ʙʝɴʙɴʧǤɜя
ЖЕѭКя ЕѐЕЖя
ʜЗѪɦɃɪя ʜЗѪɦǤ˦я ФʜЗѪɦɃɪўʜЗѪɦǤ˦ХЭЗХ

In postprocessing (see Problem 14.14), the burn-in period is discarded and the
values of the Markov chain are sorted into bins. You may want to experiment
with different values for the arguments of YѪЖ- and for the various (global) vari-
ables to see how they affect the posterior distribution.
The results from this function call for determining the parameter 𝑞2 in the lo-
gistic equation are shown in Figures 14.3, 14.4, and 14.5. Using a burn-in period
of length 103 and a bin width of 0.005 for the final histogram, the maximum-a-
posterior (map) estimate is 2.0025, which is reasonably close to the exact value 2.
In other words, the most likely bin is centered around 2.0025. The mean value is
420 14 Bayesian Estimation

Fig. 14.3 Exact solution of the model equation and 50 synthetic measurements with additive
noise (𝜎 = 0.05).

Fig. 14.4 Beginning of a Markov chain of length 105 for the measurement data shown in
Fig. 14.3. The burn-in period discarded when plotting the next figure, Fig. 14.5, is shown in
black.

≈ 2.004 and the median is ≈ 2.004. In all three values, the first three digits agree
with the exact value.
In Problem 14.15, you are asked to extend the implementation to multiple
dimensions.
14.5 Parameter Estimation and Inverse Problems 421

Fig. 14.5 Histogram of the parameter values found in the Markov chain shown in Fig. 14.4
after the burn-in period.

14.5.7 Maximum-a-Posteriori Estimate and


Maximum-Likelihood Estimate

The posterior density 𝜋(𝑞|𝑑) provides complete information about the model
parameters calculated from the measurements or observations. From this den-
sity, point estimates such as the mean, the median, or a mode can be calculated.
Furthermore, confidence intervals are easily calculated as well.
A mode of a continuous probability distribution is a local maximum of its
density. A mode of the posterior density 𝜋(𝑞|𝑑) is called a maximum-a-posterior
(map) estimate and can be written as

𝑞MAP ∶= arg max 𝜋(𝑞|𝑑) = arg max 𝜋(𝑑|𝑞)𝜋0 (𝑞)


𝑞 𝑞

by (14.2), since 𝜋𝐷 (𝑑) is constant with respect to 𝑞.


If the prior 𝜋0 is uniform, the map estimate 𝑞MAP is identical to the maximum-
likelihood (ml) estimate
2
(𝑞ML , 𝜎ML ) ∶= arg max 𝐿(𝑞, 𝜎2 ∣ 𝑑),
𝑞∈𝑄, 𝜎2 ∈ℝ+

since 𝜋(𝑑|𝑞) = 𝐿(𝑞, 𝜎2 ∣ 𝑑) by (14.6).


422 14 Bayesian Estimation

14.5.8 Convergence

Having shown that the Metropolis–Hastings algorithm yields the stationary dis-
tribution of the Markov chain and having implemented the basic algorithm, nu-
merical questions still remain. The two main questions are the following. How
should the proposal distribution be chosen? And how long should the Markov
chain be?
The variance of the proposal distribution affects the Markov chain in an im-
portant way. If the variance is too large, a large proportion of the candidate states
is rejected because they have smaller likelihoods and the chain stagnates for
many iterations. On the other hand, if the variance is too small, the acceptance
probability is large, but the chain explores the parameter space only slowly.
In multidimensional problems, the individual parameters should in general
be explored at different speeds or scales. This is the reason why covariance ma-
trices 𝐷 or 𝑉 are used in the proposal distributions in (14.12) instead of just a
multiple of the identity matrix, which would explore all parameters at the same
scale. We still do not know a good, automatic method to find such a matrix 𝐷 or 𝑉
beyond checking the resulting Markov chain, but we will return to this question
in the next section.
How long should the Markov chain be to ensure convergence and to ade-
quately sample the posterior distribution? This is a difficult question, as analytic
convergence and stopping criteria are lacking. Convergence of Markov-chain
Monte Carlo algorithms can be falsified, but not verified in general. We men-
tion some tests to instill confidence in the convergence of a Markov chain, while
more on this subject can be found, for example, in [2, 4].
The simplest and most straightforward method to assess the burn-in period
and the convergence behavior is to plot or to statistically monitor the marginal
paths of the unknown parameters as in Fig. 14.4. Unfortunately, the chain may
meander around a local minimum for a long time before it transitions close to
another local minimum or, hopefully, close to a global minimum. But depending
on the problem and on the starting point, this may take a very, very long time.
Furthermore, it is possible – at least when the number of unknown param-
eters is sufficiently small – to compare the parameter values resulting from the
Markov chain with the parameter values that stem from applying Bayes’ formula
(14.2) by calculating the integral directly, e.g. by sparse-grid quadrature.
A more statistical test is to keep track of the ratio of accepted states to the total
number of states, which is called the acceptance ratio. Depending on the prob-
lem, a rather large range of acceptance ratios can be reasonable, but acceptance
ratios between 0.1 and 0.5 are usually considered reasonable (see Problem 14.17).
Knowing the acceptance ratio helps tuning the proposal density 𝐽.
Another statistical test is to calculate the autocorrelation between subchains
of length 𝐿 of the Markov chain with lag ℎ. While adjacent subchains are likely
correlated because of the Markov property, low autocorrelation often indicates
fast convergence since in this case independent samples are produced and mix-
ing is good. The autocorrelation function
14.5 Parameter Estimation and Inverse Problems 423
∑𝐿−ℎ
𝑖=1
̄ 𝑖+ℎ − 𝑞)
(𝑞𝑖 − 𝑞)(𝑞 ̄
ACF(𝐿, ℎ) ∶= ∑𝐿
̄2
(𝑞 − 𝑞)
𝑖=1 𝑖

is the ratio of the estimate


𝐿−ℎ
1 ∑
̄ 𝑖+ℎ − 𝑞)
(𝑞 − 𝑞)(𝑞 ̄ (14.14)
𝐿 𝑖=1 𝑖

of the autocovariance and the estimate


𝐿
1∑
̄2
(𝑞 − 𝑞)
𝐿 𝑖=1 𝑖

of the variance, where 𝑞̄ is the sample mean. Although the estimate

𝐿−ℎ
1 ∑
̄ 𝑖+ℎ − 𝑞)
(𝑞 − 𝑞)(𝑞 ̄
𝐿 − ℎ 𝑖=1 𝑖

of the autocovariance suggests itself, the estimate (14.14) with the factor 1∕𝐿
instead of 1∕(𝐿 − ℎ) is often used, since it can be shown to be biased with a bias
of (only) order 1∕𝐿 (thus being asymptotically unbiased) and it has the useful
property that its finite Fourier transform is nonnegative, among other properties
[3, Section 4.1].

14.5.9 The Delayed-Rejection Adaptive-Metropolis (dram)


Algorithm

We now revisit the question how the proposal distribution can be chosen auto-
matically to ensure expedient parameter scaling during the learning progress.
The answer is provided by adaptive Metropolis algorithms [1, 7, 8, 10] such as
the dram algorithm [6].
Since adaptive Metropolis algorithms change the proposal distribution using
the chain history, they violate the Markov property and no longer yield a Markov
process. Therefore establishing their convergence to the posterior distribution re-
quires further thought. Examples are criteria such as the diminishing-adaptation
condition and the bounded-convergence condition [1, 7, 8].
The Metropolis algorithm becomes adaptive in the dram algorithm in the
following manner. In the beginning, the covariance matrix is initialized as 𝑉1 =
𝐷 (diagonal) or 𝑉1 = 𝑉. Afterwards, the covariance matrix in the 𝑛-th step is
computed as
𝑉𝑛 ∶= 𝑠𝑝 (cov(𝑞1 , … , 𝑞𝑛 ) + 𝜖𝐼𝑝 ). (14.15)
424 14 Bayesian Estimation

Here 𝑝 is the dimension of the parameter space, and the parameter 𝑠𝑝 is com-
monly chosen to be 𝑠𝑝 ∶= 2.382 ∕𝑝 [6]. The initial period without adaptation
should be chosen long enough to be sufficiently diverse to make the covari-
ance matrix nonsingular. The purpose of the second term 𝜖𝐼𝑝 , where 𝐼𝑝 is the
𝑝-dimensional identity matrix and 𝜖 ≥ 0, is to ensure that 𝑉𝑛 is positive definite;
it is often possible to set 𝜖 ∶= 0.
The most straightforward way to calculate the covariance in the formula
above is to use the formula for the empirical covariance. However, this becomes
increasingly computationally expensive as 𝑛 increases. A much faster way is to
use recursive formulas.
First, the definition of and a recursive formula for the sample mean 𝑞̄𝑛 in the
𝑛-th step are
𝑛
1∑ 1 𝑛−1
𝑞̄𝑛 ∶= 𝑞 = 𝑞 + 𝑞̄ , (14.16)
𝑛 𝑖=1 𝑖 𝑛 𝑛 𝑛 𝑛−1

which can be interpreted as a weighted average (see Problem 14.19).


Using the sample mean, the definition and a direct formula for the empirical
covariance in the 𝑛-th step are
𝑛
1 ∑
cov(𝑞1 , … , 𝑞𝑛 ) ∶= (𝑞 − 𝑞̄𝑛 )(𝑞𝑖 − 𝑞̄𝑛 )⊤ (14.17a)
𝑛 − 1 𝑖=1 𝑖
∑𝑛
1
= ( 𝑞𝑖 𝑞𝑖⊤ − 𝑛𝑞̄𝑛 𝑞̄𝑛⊤ ) (14.17b)
𝑛 − 1 𝑖=1

(see Problem 14.20). Based on this direct formula, the recursive formula

𝑛−2
cov(𝑞1 , … , 𝑞𝑛 ) = cov(𝑞1 , … , 𝑞𝑛−1 )
𝑛−1
1 𝑛
+ 𝑞 𝑞⊤ − 𝑞̄ 𝑞̄ ⊤ + 𝑞̄𝑛−1 𝑞̄𝑛−1

(14.18)
𝑛−1 𝑛 𝑛 𝑛−1 𝑛 𝑛
for the empirical covariance can be shown (see Problem 14.21).
Using this recursion for the empirical covariance cov(𝑞1 , … , 𝑞𝑛 ) occurring in
(14.15), we find the recursive formula
1 ( )
𝑉𝑛 = (𝑛 − 2)𝑉𝑛−1 + 𝑠𝑝 (𝑞𝑛 𝑞𝑛⊤ − 𝑛𝑞̄𝑛 𝑞̄𝑛⊤ + (𝑛 − 1)𝑞̄𝑛−1 𝑞̄𝑛−1

+ 𝜖𝐼𝑝 ) (14.19)
𝑛−1
equivalent to (14.15) above (see Problem 14.22).
The efficiency of the algorithm is further improved if the proposal distribution
is adapted only from time to time as in Algorithm 14.21 below.
Delayed rejection is the second aspect of the dram algorithm. It means that
another candidate state 𝑞′′ is constructed and given a chance instead of retain-
ing the previous value whenever a candidate state 𝑞 ′ has been rejected. This so-
called second-stage candidate 𝑞′′ can be chosen using the proposal function
14.5 Parameter Estimation and Inverse Problems 425

𝐽2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) ∶= 𝑁(𝑞𝑛 , 𝛾2 𝑉𝑛 ),

where 𝑉𝑛 is the covariance matrix calculated in the adaptive part of the algo-
rithm above and 𝛾 ∈ (0, 1) is a constant [6]. Since the constant 𝛾 is smaller than
one, the proposal function 𝐽2 for the second state is narrower than the original
one, which increases mixing. A popular choice for 𝛾 is 1∕5.
This proposal function 𝐽2 must be accompanied by a matching acceptance
probability 𝐴2 in order to ensure that the detailed-balance condition is satisfied.
In Sect. 14.5.5, we used the detailed-balance condition to define a suitable accep-
tance probability after having decided which proposal distribution to use. If the
first proposed state 𝑞′ is accepted, the detailed-balance condition holds by the
calculations in Sect. 14.5.5. Otherwise, if it is rejected, the transition probability
is

𝑃(𝑞 ′′ |𝑞𝑛 ) = 𝑃(𝑞′ proposed)𝑃(𝑞 ′ rejected)𝑃(𝑞 ′′ proposed)𝑃(𝑞′′ accepted)


= 𝐽(𝑞 ′ |𝑞𝑛 )(1 − 𝐴(𝑞 ′ |𝑞𝑛 ))𝐽2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ )𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ ),

where 𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) is the second-stage acceptance probability for the proposed


state 𝑞′′ after being in state 𝑞𝑛 and having proposed the state 𝑞 ′ in the first stage.
Substituting this transition probability 𝑃(𝑞′′ |𝑞𝑛 ) and the desired stationary distri-
bution 𝜔(𝑞) ∶= 𝜋(𝑞|𝑑), i.e., the posterior distribution, into the detailed-balance
condition, which now reads

𝑃(𝑞 ′′ |𝑞𝑛 )𝜔(𝑞𝑛 ) = 𝑃(𝑞𝑛 |𝑞 ′′ )𝜔(𝑞 ′′ ),

yields

𝜋(𝑞𝑛 |𝑑)𝐽(𝑞 ′ |𝑞𝑛 )(1 − 𝐴(𝑞 ′ |𝑞𝑛 ))𝐽2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ )𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ )


= 𝜋(𝑞 ′′ |𝑑)𝐽(𝑞 ′ |𝑞 ′′ )(1 − 𝐴(𝑞 ′ |𝑞 ′′ ))𝐽2 (𝑞𝑛 ∣ 𝑞 ′′ , 𝑞 ′ )𝐴2 (𝑞𝑛 ∣ 𝑞′′ , 𝑞 ′ )

and hence
𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) 𝜋(𝑞 ′′ |𝑑)𝐽(𝑞 ′ |𝑞′′ )(1 − 𝐴(𝑞 ′ |𝑞 ′′ ))𝐽2 (𝑞𝑛 ∣ 𝑞 ′′ , 𝑞 ′ )
= =∶ 𝑟.
𝐴2 (𝑞𝑛 ∣ 𝑞′′ , 𝑞 ′ ) 𝜋(𝑞𝑛 |𝑑)𝐽(𝑞 ′ |𝑞𝑛 )(1 − 𝐴(𝑞′ |𝑞𝑛 ))𝐽2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ )

We can proceed similarly to the case with only one stage in Sect. 14.5.5 to find
such an 𝐴2 . A suitable acceptance probability is

𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) ∶= min(1, 𝑟), (14.20)

since
𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ ) min(1, 𝑟)
= = 𝑟.
𝐴2 (𝑞𝑛 ∣ 𝑞′′ , 𝑞 ′ ) min(1, 1∕𝑟)
As you will have guessed, these ideas can be extended and third-order, fourth-
order, etc. candidate states can be constructed recursively, together with their
proposal densities and acceptance probabilities as we just did.
426 14 Bayesian Estimation

In summary, combining both adapting the proposal distribution and delay-


ing rejection, the dram algorithm is the following. The adaptive mechanism en-
sures that information learned about the posterior distribution is remembered
in the long term as the chain progresses. The delayed-rejection part acts in the
short term to improve mixing and to avoid stagnation of the chain.

Algorithm 14.21 (delayed-rejection adaptive Metropolis (dram))


1. Initialization: choose an initial state 𝑞1 (such that 𝜋(𝑞0 |𝑑) > 0), e.g., as

𝑞1 ∶= arg min 𝑆(𝑞);


𝑞

choose the number 𝐾 of steps after which the proposal distribution is adap-
ted; choose the parameter 𝜖; choose the initial covariance matrix 𝑉1 (diag-
onal or symmetric) in the proposal distribution; choose the factor 𝛾 (often
𝛾 ∶= 1∕5) for the second-stage proposal distribution; and set the iteration
number 𝑛 ∶= 1.
If the likelihood function (14.6) is used, the variance 𝜎2 must be known. If
it is not known a priori as is often the case, there are two options:
a. use the empirical estimate
𝑁
1 ∑
𝜎2 ∶= (𝑑 − 𝑓𝑖 (𝑞))2
𝑁 − 𝑝 𝑖=1 𝑖

for 𝜎𝑛 , where 𝑁 is the number of observations and 𝑝 is the dimension of


the unknown parameter vector, throughout the iteration or
b. sample it through realizations of the Markov chain; then the parameters
𝜎𝑠 and 𝑛𝑠 (usually 𝑛𝑠 ∈ [0.01, 1]) must be chosen during initialization.
2. Iteration:

a. Every 𝐾 steps, update the covariance matrix 𝑉𝑛 of the proposal distribu-


tion using (14.19).
b. Generate a first-stage candidate state

𝑞′ ∶= 𝑞𝑛 + 𝑅𝑛 𝑧

after choosing 𝑧 ∼ 𝑁(0, 1), where 𝑅𝑛 is the Cholesky factorization of 𝑉𝑛


(ensuring 𝑞 ′ ∼ 𝑁(𝑞𝑛 , 𝑉𝑛 ) because of Theorem 14.20).
c. If 𝜎2 is sampled by the Markov chain, update 𝜎𝑛 as

𝑛𝑠 + 𝑁 𝑛𝑠 𝜎𝑠2 + 𝑆(𝑞𝑛 )
𝜎𝑛 ∼ InvGamma ( , ),
2 2

where InvGamma is the inverse-gamma distribution.


14.5 Parameter Estimation and Inverse Problems 427

d. Calculate the first-stage acceptance probability

𝜋(𝑑|𝑞′ )𝜋0 (𝑞 ′ )𝐽(𝑞𝑛 |𝑞 ′ )


𝐴(𝑞 ′ |𝑞𝑛 ) ∶= min (1, ).
𝜋(𝑑|𝑞𝑛 )𝜋0 (𝑞𝑛 )𝐽(𝑞 ′ |𝑞𝑛 )

The first fraction 𝜋(𝑑|𝑞′ )∕𝜋(𝑑|𝑞𝑛 ) can be simplified to

𝜋(𝑞′ |𝑑) ′ 2
= e(𝑆(𝑞𝑛 )−𝑆(𝑞 ))∕(2𝜎𝑛 )
𝜋(𝑞𝑛 |𝑑)

as in (14.13) if the likelihood has the form (14.6). The second fraction
𝜋0 (𝑞 ′ )∕𝜋0 (𝑞𝑛 ) is equal to one in the case of a uniform prior distribution.
The third fraction 𝐽(𝑞𝑛 |𝑞′ )∕𝐽(𝑞 ′ |𝑞𝑛 ) is equal to one if the proposal distri-
bution is symmetric.
e. Accept or reject the first-stage candidate 𝑞′ by generating a uniformly
distributed random number 𝑢 ∼ 𝑈(0, 1) from the interval [0, 1]. If 𝑢 ≤
𝐴(𝑞′ |𝑞𝑛 ), the candidate 𝑞′ is accepted and we set 𝑞𝑛+1 ∶= 𝑞′ .
f. If the first-stage candidate was rejected, accept or reject a second-stage
candidate.
i. Generate a second-stage candidate

𝑞 ′′ ∶= 𝑞𝑛 + 𝛾𝑅𝑛 𝑧,

where 𝑧 ∼ 𝑁(0, 1) and 𝑅𝑛 is the Cholesky factorization of 𝑉𝑛 as


above.
ii. Calculate the second-stage acceptance probability 𝐴2 (𝑞 ′′ ∣ 𝑞𝑛 , 𝑞 ′ )
using (14.20), where the fraction 𝜋(𝑞′′ |𝑑)∕𝜋(𝑞𝑛 |𝑑) can be simplified
to
𝜋(𝑞′′ |𝑑) ′′ 2
= e(𝑆(𝑞𝑛 )−𝑆(𝑞 ))∕(2𝜎𝑛 )
𝜋(𝑞𝑛 |𝑑)
if the likelihood is given by (14.6).
iii. Accept or reject the second-stage candidate 𝑞 ′′ by generating a uni-
formly distributed random number 𝑢 ∼ 𝑈(0, 1) from the interval
[0, 1]. The next state becomes

𝑞′′ , 𝑢 ≤ 𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ ),
𝑞𝑛+1 ∶= {
𝑞𝑛 , 𝑢 > 𝐴2 (𝑞′′ ∣ 𝑞𝑛 , 𝑞 ′ ).

g. Increase the iteration number 𝑛 ∶= 𝑛 + 1.

3. Iterate until the chain is long enough to estimate the parameter 𝑞 after dis-
carding a sufficiently long burn-in period at the beginning. Compute any
statistic of interest from the Markov chain without the burn-in period.
428 14 Bayesian Estimation

We close the discussion of the algorithm with a comment on how to treat the
variance 𝜎2 on the right side in (14.6) as a random parameter to be sampled by
the Markov chain. The likelihood
1 2
𝜋(𝑑, 𝑞 ∣ 𝜎2 ) ∶= e−𝑆(𝑞)∕(2𝜎 )
2
(2𝜋𝜎 )𝑁∕2

is inverse-gamma distributed. Note that 𝜋(𝑑, 𝑞 ∣ 𝜎2 ) is different from 𝜋(𝑑|𝑞). The


conjugated prior is
2
𝜋0 (𝜎2 ) ∝ (𝜎2 )−(𝛼+1) e𝛽∕𝜎
with the two parameters 𝛼 and 𝛽. The posterior density is
2
𝜋(𝜎2 ∣ 𝑑, 𝑞) ∝ (𝜎2 )−(𝛼+1+𝑁∕2) e−(𝛽+𝑆(𝑞)∕2)∕𝜎 ,

implying that

𝜎2 ∣ (𝑑, 𝑞) ∼ InvGamma(𝛼 + 𝑁∕2, 𝛽 + 𝑆(𝑞)∕2)


𝑛𝑠 + 𝑁 𝑛𝑠 𝜎𝑠2 + 𝑆(𝑞𝑛 )
= InvGamma ( , )
2 2

with the two new parameters 𝑛𝑠 ∶= 2𝛼 and 𝜎𝑠2 ∶= 𝛽∕𝛼. The parameter 𝑛𝑠 can
be interpreted as the number of observations used in the prior distribution, and
the parameter 𝜎𝑠2 is the mean squared error of the observations [4]. Usually 𝑛𝑠 is
chosen to be small, which corresponds to a noninformative prior distribution.

14.6 Julia Packages

The packages under the ÑʼʝɃɪȱ umbrella provide Bayesian inference with
general-purpose probabilistic programming. Similarly, the ǤɦȂǤ package imple-
ments Markov-chain Monte Carlo methods.

14.7 Bibliographical Remarks

A standard textbook on Bayesian inference is [4]. A very good introduction to


Bayesian techniques and uncertainty quantification in general can be found in
[9].
14.7 Bibliographical Remarks 429

Problems

14.1 Prove Theorem 14.2.

14.2 Prove Theorem 14.3.

14.3 Consider the example of a fair dice, all of whose six faces have the probabil-
ity 1∕6. Write down and sketch the probability density and the cumulative proba-
bility distribution as discussed in Sect. 14.2. What are the points of discontinuity
and the probabilities 𝑃(𝑋 = 𝑥𝑖 )? At which points is 𝐹𝑋 continuous from the left?
At which points from the right? Furthermore, calculate the expected values 𝔼[𝑋]
and 𝔼[𝑋 2 ] as well as the variance 𝕍[𝑋] = 𝔼[(𝑋 − 𝔼(𝑋))2 ] as Riemann–Stieltjes
integrals.

14.4 For the example at the end of Sect. 14.4, plot the iterated posterior probabil-
ity for different values for the initial prior probability and the likelihood.

14.5 Use separation of variables to solve the logistic equation (14.4).

14.6 Find an example of an irreducible Markov chain and of a reducible one.

14.7 Find an example of a reducible Markov chain with two different stationary
distributions.

14.8 Find an example of an aperiodic Markov chain and of a periodic one.

14.9 Find an example of a periodic Markov chain whose limiting distribution


lim𝑛→∞ 𝑝𝑛 does not exist.

14.10 Prove Theorem 14.15.

14.11 Prove Theorem 14.19.

14.12 Prove Theorem 14.20.

14.13 Implement the special form of the acceptance probability for the case of a
normally distributed likelihood.

14.14 Write a function that – given a Markov chain and the length of the burn-in
period – calculates a histogram (given the bin width), the maximum-a-posteriori
(map) estimate, and a (symmetric) confidence interval around the map estimate
based on the histogram and given the percentage of samples to be found in the
confidence interval.

14.15 Implement a multidimensional version of the Metropolis–Hastings algo-


rithm in Sections 14.5.5 and 14.5.6.

14.16 Investigate how the parameters of the Metropolis–Hastings algorithm af-


fect the results at the hand of a parameter-estimation problem of your choice.
430 14 Bayesian Estimation

14.17 Extend the implementations of the Markov-chain Monte Carlo algorithm


to calculate and return the acceptance ratio. Observe in an example how the
proposal distribution affects the acceptance ratio.

14.18 Extend the implementations of the Markov-chain Monte Carlo algorithms


to calculate and return the autocorrelation. Observe in an example how the pro-
posal distribution affects the autocorrelation.

14.19 Prove equation (14.16).

14.20 Prove equation (14.17b).

14.21 Prove equation (14.18).

14.22 Prove equation (14.19).

14.23 Implement the dram algorithm Algorithm 14.21 for a one-dimensional


parameter.

14.24 Implement the dram algorithm Algorithm 14.21 for parameter vectors.

14.25 Consider the example of estimating parameters in the logistic equation,


choose the value 𝜎2 in (14.6), and use both variants of the dram algorithm for
known and unknown 𝜎2 to estimate a parameter.
1. Do the parameter values found by both variants differ?
2. When using the variant for unknown 𝜎2 while pretending to not know the
value of 𝜎2 , is the true value of 𝜎2 approximated?
3. How do 𝑛𝑠 and 𝜎𝑠2 affect the results?

14.26 Investigate how the parameters of the dram algorithm affect the results
at the hand of a parameter-estimation problem of your choice.

14.27 Compare the performance of the Metropolis–Hastings and the dram al-
gorithms. Which one is easier to use?

14.28 Second-order ordinary differential equations describe physical systems


such as spring-mass systems and electrical circuits.
1. Implement a numerical method in Chap. 9 to solve the (general form of the)
second-order ordinary differential equation

𝑦 ′′ (𝑡) + 𝑞1 𝑦 ′ (𝑡) + 𝑞2 𝑦(𝑡) = 0, 𝑦(0) = 𝑞3 , 𝑦 ′ (0) = 𝑞4 (14.21)

with the four unknown parameters 𝑞 = (𝑞1 , 𝑞2 , 𝑞3 , 𝑞4 ) and use it to generate


synthetic measurements as in Sect. 14.5.
2. Use the Metropolis–Hastings and dram algorithms
a. to estimate one of the parameters,
b. to estimate two of the parameters,
References 431

c. to estimate three of the parameters, and


d. to estimate all four parameters,
and compare their performance. Which one is easier to use?
3. Investigate how the parameters of the algorithms affect the results.
4. Compare the results for known and unknown 𝜎2 .

References

1. Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Statistics and Computing 18, 343–
373 (2008)
2. Brooks, S., Roberts, G.: Convergence assessment techniques for Markov chain Monte
Carlo. Statistics and Computing 8(4), 319–335 (1998)
3. Chatfield, C.: The Analysis of Time Series, 6th edn. Chapman & Hall (2003)
4. Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D.: Bayesian Data Anal-
ysis, 3rd edn. Taylor & Francis Group, Boca Raton, FL (2013)
5. Golberg, M., Cho, H.: Introduction to Regression Analysis. WIT Press, Southampton, UK
(2004)
6. Haario, H., Laine, M., Mira, A., Saksman, E.: DRAM: efficient adaptive MCMC. Statistics
and Computing 16(4), 339–354 (2006)
7. Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli 7(2),
223–242 (2001)
8. Roberts, G., Rosenthal, J.: Examples of adaptive MCMC. Journal of Computational and
Graphical Statistics 18(2), 349–367 (2009)
9. Smith, R.C.: Uncertainty Quantification. SIAM, Philadelphia, PA (2014)
10. Vihola, M.: Robust adaptive Metropolis algorithm with coerced acceptance rate. Statistics
and Computing 22(5), 997–1008 (2012)
Index

ablation study, 322 blas, 6, 221


acceptance probability, 314, 413, 414, 425 Boltzmann probability distribution, 312
Metropolis, 413 Boltzmann constant, 312
acceptance ratio, 422 bounded, 270
accuracy, 379 bounds check, 147
Ackermann function, 37 burn-in period, 415, 416, 419, 422, 427
activation, 367 Butcher tableau, 245, 247, 248
adjoint, 155
angle, 180 call
approximation, 371 by reference, 27
low-rank, 220 by sharing, 27
argument by value, 27
keyword, 32 cancellation, 193, 198, 199
list, 94 canonical ensemble, 312
optional, 31 ȆǤʝ, 87
splicing, 34 Cauchy–Bunyakovsky–Schwarz inequality,
variable number of, 34, 94 179, 180, 221, 272, 330, 338, 378
Armijo condition, 347 Ȇȍʝ, 87
assembler, 147 Céa’s lemma, 297
assertion, 33 channel
assignment, 72 buffered, 119
autocorrelation, 422 remote, 123
autocovariance, 423 unbuffered, 116, 120
characteristic polynomial, 205, 206
backpropagation, 382, 384 charge density, 261
basis, 172 classification, 368
change, 176 Clojure, 106
orthogonal, 180 clos, 7
orthonormal, 180 closure, 46
standard, 176 codomain, 17
batch, 379 coercive, 270
Bayesian, 402 collect, 141
Bayes’ theorem, 401 column-major order, 170
benchmark problems, 323 Common Lisp, 4, 42, 99, 106, 132, 141, 146,
Bernstein–von-Mises theorem, 405 151, 152
bias, 367 commutative diagram, 178
bijection, 174, 176 comparison, 72

© Springer Nature Switzerland AG 2022 433


C. Heitzinger, Algorithms with JULIA,
https://doi.org/10.1007/978-3-031-16560-3
434 Index

compiler, 6, 147 validation, 387–389


compressed sparse column, 166 deep learning, 367
computing deflation, 215
distributed, 121, 123 ȍȕȯɦǤȆʝɴ, 133
parallel, 121, 123 delayed rejection, 424, 426
sequential, 118 destructuring, 27
concave function, 333 detailed balance, 411, 413, 415, 425
condition, 121 differential evolution, 327
boundary, 266, 274, 275 diffusion, 263, 274
Dirichlet, 266, 293 dimension, 172
homogeneous, 267 Dirac delta distribution, 291
inhomogeneous, 267 directional directive, 329, 378
mixed, 267 discontinuity, 399
Neumann, 266, 293 discretization, 279, 280
Robin, 267 compact, 282
initial, 266, 274, 275 divergence, 260
confidence interval, 402 form, 284
Ȇɴɪʧ, 87 theorem, 264, 276, 284
conservation law, 275 ȍɴ block, 113
conservation of fluxes, 285, 288 do until, 104
constructor, 86 documentation, 11
inner, 86 string, 52, 149
outer, 86 domain, 17
control flow, 115
local, 109 early stopping, 388
non-local, 109, 110 eigenbasis, 207
control volume, 284 eigenspace, 206
convergence, 298, 299, 339–341, 359 generalized, 212
in distribution, 410 eigenvalue, 204
rate, 341 electric displacement field, 261
gradient descent, 336 electric field, 261
linear, 336 element, 173
optimal, 342 elementwise, 163, 164
quadratic, 341, 342, 350 Emacs, 14
superlinear, 355 epoch, 379
convex equality, 163
function, 333 equation
set, 333 difference, 23
cooling, 314 elliptic, 258
coordinate, 172 hyperbolic, 258
Coulomb’s law, 262 strictly, 276
cpu core, 121 logistic, 406
Cramer’s rule, 186 Maxwell, 260
credible interval, 403 parabolic, 258
cross entropy, 377 Poisson, 260, 285
cross product, 184, 221 equivalence relation, 177
current density, 261 error, 406
curvature condition, 347 escape, 138, 139
Euclidean space, 172
data structure Euler method
circular, 88 backward, 237, 241, 246
dataset, 374 forward, 237, 241, 243, 246, 247, 252
test, 389 improved, 240, 241, 243, 246, 247
training, 377, 379, 389 evaluation, 61, 63
Index 435

evaluation time, 147 hat, 295


evolutionary computation, 317 hyperbolic tangent, 368
existence, 231 kernel, 181
expectance, 400 linear, 171, 173
expected value, 400 logistic, 367, 407
exponential growth, 230 loss, 377
expression 𝐿-smooth, 336
quoted, 135 mathematical, 17
extremum method, 19, 26
global, 330 nullspace, 181
local, 330 objective, 377
rectifier, 368
factorization leaky, 367
Cholesky, 194, 414, 416, 426, 427 smooth, 368
diagonalization, 209 sigmoid, 367
eigenfactorization, 211 test, 268, 294, 296
Hessenberg, 216 vectorized, 164
𝐿𝑈, 189, 191, 216 functional, 29
𝑄𝑅, 197–199, 214, 216 fundamental lemma of variational calculus,
Schur, 213 264, 268, 276, 294
svd decomposition, 218
generalized, 221 garbage collection, 126
feed Gaussian elimination, 190
backward, 385 generalization, 388
forward, 371, 385 generator, 69, 73
Fermat’s theorem, 331 genetic algorithm, 318
Fibonacci sequence, 36, 144 Gershgorin circle theorem, 215
definition, 17 Givens rotations, 198
d’Ocagne’s identity, 36 go to, 108
identities, 36 gradient, 259, 329
identity, 36 ascent, 335, 378
recursive formula, 23, 25 descent, 335, 378
Fick’s first law, 264 stochastic, 377
finite volume, 284 neural network, 382
flux, 263, 275 Gram–Schmidt orthogonalization, 198
density, 264 Green’s first identity, 294
folding, 65, 67 grid point, 276
Fortran, 4
Fourier’s law, 265 handwriting, 374
frequentist, 402 hash function, 72
function heat conduction, 265, 274
affine, 367 help mode, 11
anonymous, 29 homoiconicity, 7, 151
argument, 94 Householder
composition, 175 reflection, 199
cost, 369, 377, 389, 392 transformation, 198
cross entropy, 377, 393 hybrid algorithm, 323
quadratic, 377, 394 hyperparameter, 386
definition, 18, 45
elementary, 236 identity matrix, 206
factorial, 68 bsʼɜɃǤ, 14
generated, 149 image
generic, 19, 26 classification, 367
Green, 263 recognition, 367, 373, 374
436 Index

immutable, 52 destructuring, 106


index enumerate, 107
linear, 159, 162 nested, 106
logical, 161
initialization, 85 machine learning, 366
inlining, 150 macro, 249
inner product, 179 anaphoric, 151
canonical, 179 hygienic, 134, 137, 140
inspection, 58 unhygienic, 134, 136
instance, 83 ɦǤȆʝɴȕ˦ʙǤɪȍ, 133
integral ɦǤȆʝɴȕ˦ʙǤɪȍвЖ, 133
equation, 231 Macsyma, 4
one-dimensional, 29 magic square, 157
Riemann, 29, 398 magnetic field, 261
Riemann–Stieltjes, 398 magnetizing field, 261
integrand, 398 mapping, 30, 70, 128
integrator, 398 mapreduce, 31
interactivity, 6 Markov
inverse problem, 405 chain, 409, 412, 422
isometry, 198 aperiodic, 411
ɃʲȕʝǤʲȕ, 104 homogeneous, 409
iteration, 65 irreducible, 411
periodic, 411
Jacobian matrix, 276 reducible, 411
job, 118, 123 reversible, 411, 415
Julia, 5 kernel, 409
command-line arguments, 10 property, 409
configuration file, 10 matlab, 4, 5
exiting, 9 matrix, 173
starting, 9 bidiagonal, 182
version, 12 conjugate, 177
conjugate transpose, 155, 180
lapack, 6, 221 defective, 209, 212
law of total probability, 401 determinant, 192, 208
Lax–Milgram theorem, 270 diagonal, 182
layer diagonalizable, 209
hidden, 365, 367 Hermitian, 180, 183, 194, 208, 221
input, 365 Hermitian conjugate, 180
output, 365 Hessian, 332, 351, 352, 355, 356
learning rate, 378 Hilbert, 49
least-squares problem, 195, 196, 201 identity, 176
lifetime, 116 inverse, 176, 189, 208
likelihood, 403, 428 Jacobi, 351
line search, 346, 352 Jordan, 210
linear dependence, 187 Jordan normal form, 211, 212
linear independence, 172 negative definite, 208
Lipschitz negative semidefinite, 208
constant, 341 norm, 220
continuous, 336 normal, 209
Lisp, 4, 87, 132, 151 nullity, 181
list, 87 orthogonal, 198
loop positive definite, 180, 194, 208
break, 107 positive semidefinite, 208
continue, 108 rank, 181
Index 437

regular, 185, 187, 189, 208, 220 floating-point, 23


self-adjoint, 180, 183
similar, 177, 210, 211, 213–216, 225 objective, 402
singular, 185, 187 operator
sparse, 279 Laplace, 263, 274, 281
square, 185, 187 nabla, 259, 274
symmetric, 180, 183 optimization
symmetric tridiagonal, 183 stochastic, 340
trace, 208 order, 229, 258
transition, 409 overfitting, 387, 390
transpose, 180
triangular, 183, 189 packages, 12
tridiagonal, 183 parallelepiped, 185, 224
unitary, 198, 208 parallelogram, 184
upper-Hessenberg, 216 parameter estimation, 405
Maxima, 4 Pareto ranking, 320
mean inequality, 302 parser, 60, 61
measurement, 406 particle-swarm optimization, 317
synthetic, 407 partition, 398, 401
memoization, 25, 142 𝜋, 127
memory usage, 147 Picard’s iteration method, 231
Mersenne prime numbers, 20 Picard’s theorem, 231
metaprogramming, 151 pivot element, 190, 193, 199
Metropolis algorithm, 313, 414 Poincaré inequality, 273
adaptive, 423, 426 point
Metropolis–Hastings algorithm, 414, 422 critical, 331
midpoint method, 246 inflection, 331
mnist, 374 saddle, 331
mode, 421 (non-)degenerate, 332
model, 406 stationary, 331
atomistic, 262 polymorphism, 80
continuum, 262 pretty printing, 96
module, 39 probability
import, 40 conditional, 400
nest, 40 density, 399
replace, 40 posterior, 403, 405, 412
Monte Carlo, 127, 311 prior, 403, 405
multiple return values, 27 distribution, 399
multiplication equilibrium, 410
matrix-matrix, 175 improper, 404
matrix-vector, 174 noninformative, 404, 428
multiplicity proposal, 413, 415, 423
algebraic, 206 stationary, 409, 410, 412
geometric, 206 producer-consumer problem, 115
pseudoinverse, 197
Newton method, 32, 349 Moore–Penrose, 202
quasi-, 351, 359
no-free-lunch theorem, 310, 311 𝑄𝑅 iteration, 213
noise, 406 𝑄𝑅 iteration
norm, 179, 195 implicit, 216
Euclidean, 377 shifted, 215
Frobenius, 355 quantity of interest, 406
normal equations, 197, 202 queue, 115
number quine, 152
438 Index

quote, 60, 62 sort, 54


sorting, 29, 30
race condition, 124 span, 172
random variable, 399 special form, 132
rank-nullity theorem, 181, 188 spectrum, 215
reader macro, 146 stability, 298
reduction, 31, 65 statistical mechanics, 311
regression, 195, 368 step size, 346
regularity, 273 stride, 169
regularization, 389 string literal, 56, 144
remote call, 121 subjective, 402
remote reference, 121 substitution
repeat, 139 backward, 189, 202
repl, 10 forward, 189
reproducibility, 5 subtype, 110
residuum, 195 sufficient-decrease, 346
resource, 113 SuiteSparse, 6
Riesz representation theorem, 270 summation, 67
Risch algorithm, 236 supervised learning, 366
rotation, 177, 260 swarm, 315
Runge–Kutta–Fehlberg method, 247 symbol
Runge–Kutta method, 243 ampersand, 102, 103
adaptive, 246 apostrophe, 155, 180
explicit, 244 asterisk, 57
four-stage, 244, 246, 252 at sign, 132
implementation, 248 backquote, 57, 133
implicit, 244 backslash, 186
colon, 60, 62, 68, 80, 157, 160
scheduler, 117 comma, 94, 154
scope, 39, 99, 101 curly bracket, 73, 89
global, 39 decimal point, 59
local, 41 decimal comma, 59
hard, 43 dollar sign, 54
soft, 43 dollar sign, 135
rules dot, 40, 84, 163
hard local, 44 double quote, 52
soft local, 45 ellipsis, 33, 34
scoping equal, 84
dynamic, 41 exclamation mark, 31, 65, 74, 154
lexical, 41 newline character, 177
semantic versioning, 56 number sign, 32
shell mode, 11 parenthesis, 94
simd, 150 semicolon, 32, 33, 99, 112, 154, 177
Simpson’s rule, 244, 287, 303 single quote, 51
simulated annealing, 313 space, 154
solution square bracket, 64, 154, 159, 161
classical, 231, 266 triple double quote, 52
existence, 272 uninterned, 135
fundamental, 263 vertical bar, 102, 103
pointwise, 266 symmetric coroutine, 117
uniqueness, 265, 272 system
weak, 268, 293, 294 32-bit, 19
existence, 270 64-bit, 19
uniqueness, 270 system of linear equations, 163
Index 439

homogeneous, 187 operation, 97


inhomogeneous, 188 parameter, 80
overdetermined, 187, 195, 201 parametric, 89
square, 188 self-referential, 87
underdetermined, 187 subtype, 22, 82, 90, 92, 94, 97, 109
supertype, 22, 82, 92, 97
Taylor expansion, 277 system
multivariate, 332, 352 dynamic, 79
Taylor series, 68, 71 static, 79
Taylor’s theorem, 277, 278, 282, 285 union, 89
ternary operator, 18, 101
test, 280 uncertainty, 402, 406
thermal conduction, 265 Unicode, 51
thermal conductivity, 265, 274 uniformly elliptic, 272
transition uniqueness, 231
kernel, 409
probability, 409, 413, 425
trapezoid rule, 241, 244 variable capture, 135, 138
triangulation, 295 vector, 154
truncation error column, 154, 173
global, 238, 239, 244 eigenvector, 204
local, 238, 241, 244, 277, 282, 283 generalized, 212
tuple, 94 Jordan chain, 212
named, 95 orthogonal, 180, 198
tuples, 27 orthonormal, 198
type product, 184
abstract, 63, 80, 82, 109, 168 row, 154
parametric, 92 singular, 219
annotation, 80 space, 171
composite, 83 version number, 56
immutable, 84 view, 163, 165, 170
mutable, 84
parametric, 89 War Games, 58
concrete, 80, 82, 169 wave, 275
conversion, 20 weight, 367
graph, 82 white space, 52
hierarchy, 82 Wilkinson shift, 215
of return value, 21 Wolfe conditions, 347
of argument, 21 worker, 122, 123

You might also like