0% found this document useful (0 votes)
5 views

Main

The document outlines the course MATH-UA 9252: Numerical Analysis at NYU Paris for the Spring term 2022, detailing the weekly schedule, course content, prerequisites, study goals, education methods, and assessment criteria. It covers classical topics in numerical analysis, including floating point arithmetic, solutions to linear and nonlinear equations, and numerical integration. The course aims to familiarize students with key concepts in numerical algorithms and their efficient implementation, with a significant focus on computational homework and a final exam.

Uploaded by

enderinbox
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Main

The document outlines the course MATH-UA 9252: Numerical Analysis at NYU Paris for the Spring term 2022, detailing the weekly schedule, course content, prerequisites, study goals, education methods, and assessment criteria. It covers classical topics in numerical analysis, including floating point arithmetic, solutions to linear and nonlinear equations, and numerical integration. The course aims to familiarize students with key concepts in numerical algorithms and their efficient implementation, with a significant focus on computational homework and a final exam.

Uploaded by

enderinbox
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

MATH-UA 9252: Numerical Analysis

Urbain Vaes
[email protected]

NYU Paris, Spring term 2022

Weekly schedule:

• Lectures on Tuesday and Thursday from 14:15 to 15:30 (Paris time) in room 406;

• Recitation on Thursday from 15:45 to 17:15 (Paris time) in room 406;

• Office hour on Tuesday, starting ten minutes after the lecture.


License

The copyright of these notes rests with the author and their contents are made available under
a Creative Commons “Attribution-ShareAlike 4.0 Interational” license. You are free to copy,
distribute, transform and build upon the course material under the following terms:

• Attribution. You must give appropriate credit, provide a link to the license, and indicate
if changes were made. You may do so in any reasonable manner, but not in any way that
suggests the licensor endorses you or your use.

• ShareAlike. If you remix, transform, or build upon the material, you must distribute
your contributions under the same license as the original.

i
Course syllabus

Course content. This course is aimed at giving a first introduction to classical topics in numer-
ical analysis, including floating point arithmetics and round-off errors, the numerical solution
of linear and nonlinear equations, iterative methods for eigenvalue problems, interpolation and
approximation of functions, and numerical quadrature. If time permits, we will also cover
numerical methods for solving ordinary differential equations.

Prerequisites. The course assumes a basic knowledge of linear algebra and calculus. Prior
programming experience in Julia, Python or a similar language is desirable but not required.

Study goals. After the course, the students will be familiar with the key concepts of stability,
convergence and computational complexity in the context of numerical algorithms. They will
have gained a broad understanding of the classical numerical methods available for performing
fundamental computational tasks, and be able to produce efficient computer implementations
of these methods.

Education method. The weekly schedule comprises two lectures (2× 1h15 per week) and
an exercise session (1h30 per week). The course material includes rigorous proofs as well as
illustrative numerical examples in the Julia programming language, and the weekly exercises
blend theoretical questions and practical computer implementation tasks.

Assessment. Computational homework will be handed out on a weekly or biweekly basis, each
of them focusing on one of the main topics covered in the course. The homework assignments
will count towards 70% of the final grade, and the final exam will count towards 30%.

Literature and study material. A comprehensive reference for this course is the following
textbook: A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics, volume 37 of
Texts in Applied Mathematics. Springer-Verlag, Berlin, 2007. Other pointers to the literature
will be given within each chapter.

ii
Acknowledgments

I am grateful to Vincent Legat, Tony Lelièvre, Gabriel Stoltz and Paul Van Dooren for letting
me draw freely from their lectures notes in numerical analysis. I would also like to thank the
following students who found errors and typos in the lecture notes: Anthony Francis, Claire
Chen, Jinqiao Cheng, Marco He, Wenye Jiang, Tingche Lyu, Nikki Tai, Alice Wang and Anita
Ye.

iii
List of examinable proofs

This list collects the examinable results for this course. It will grow and may be modified as
the course progresses. You don’t have to remember the statements of the results, but should
be able to prove them given the statements.

Chapter 2
• Proposition 2.1 (upper bound on the condition number with respect to perturbations of
the right-hand side);

• Proposition 2.2 (upper bound the on condition number with respect to perturbations of
the matrix), if you are given the preceding Lemma 2.3;

• Lemma 2.5 (explicit expression of the matrix L given the parameters of the Gaussian
transformations).

• Proposition A.9, in the case where A is diagonalizable (equivalence between ρ(A) < 1 and
the convergence kAk k → 0 as k → ∞).

• Proposition 2.10 when given Gelfand’s formula for granted (convergence of the general
splitting method).

• Derivation of the optimal ω in Richardson’s method.

• Proposition 2.11 (convergence of Jacobi’s method in the case of a strictly diagonally


dominant matrix).

• Proposition 2.12 and Corollary 2.13 (convergence of the relaxation method for Hermitian
and positive definite A).

• Theorem 2.17 given the Kantorovich inequality (convergence of the steepest descent
method).

Chapter 3
• Theorem 3.2 (global exponential convergence of the fixed point iteration).

• Proposition 3.4 (local exponential convergence of the fixed point iteration under a local
Lipschitz condition).

• Proposition 3.5 (local exponential convergence given bound on the Jacobian matrix).

iv
• Proposition 3.6 (superlinear convergence of fixed point iteration when the Jacobian is zero
at the fixed point).

Chapter 4
• Proposition 4.1 (Convergence of the power iteration).

Chapter 5
• Theorem 5.2 (Interpolation error).

• Corollary 5.3 (Corollary following from Theorem 5.2).

• Theorem 5.4 (Monic polynomial with minimum ∞ norm).

• Corollary 5.5 (Derivation of Chebyshev nodes).

• Derivation of normal equations (in text).

Chapter 6
• Derivation of the Newton–Cotes integration rules (in text).

• Proof of the error estimates for the composite trapezoidal and Simpson rules.

v
Contents

1 Floating point arithmetic 4


1.1 Binary representation of real numbers . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Set of values representable in floating point formats . . . . . . . . . . . . . . . . . 7
1.3 Arithmetic operations between floating point formats . . . . . . . . . . . . . . . . 11
1.4 Encoding of floating point numbers � . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Integer formats � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Solution of linear systems of equation 21


2.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Direct solution method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Iterative methods for linear systems . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3 Solution of nonlinear systems 63


3.1 The bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Fixed point methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Convergence of fixed point methods . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Examples of fixed point methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Numerical computation of eigenvalues 78


4.1 Numerical methods for eigenvalue problems: general remarks . . . . . . . . . . . 79
4.2 Simple vector iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Methods based on a subspace iteration . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Projection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5 Interpolation and approximation 98


5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

vi
Contents

5.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6 Numerical integration 124


6.1 The Newton–Cotes method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Composite methods with equidistant nodes . . . . . . . . . . . . . . . . . . . . . 126
6.3 Richardson extrapolation and Romberg’s method . . . . . . . . . . . . . . . . . . 130
6.4 Methods with non-equidistant nodes . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

A Linear algebra 138


A.1 Inner products and norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.2 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.3 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.4 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.5 Similarity transformation and Jordan normal form . . . . . . . . . . . . . . . . . 145
A.6 Oldenburger’s theorem and Gelfand’s formula . . . . . . . . . . . . . . . . . . . . 146

B Brief introduction to Julia 148

vii
Introduction

Goals of computer simulation


In a wide variety of scientific disciplines, ranging from physics to biology and economics, the
phenomena under consideration are well-described by mathematical equations. More often than
not, it is too difficult to solve these equations analytically, and so one has to recur to computer
calculations in order to obtain approximate solutions. Computer simulation enables to gain
understanding of the phenomena examined, to explain observations and to make predictions.
It plays a crucial role in a number of practical applications including weather forecasting, drug
discovery through molecular modeling, flight simulation, and structural engineering, to mention
just a few.
Numerical simulation may also be employed in order to calibrate mathematical models of
physical phenomena, particularly when observation through experiment is impractical or too
costly. For example, it is frequently the case that the parameters in mathematical models for
turbulence are estimated not from real data, but from synthetic data generated by computer
simulation of the fundamental equations of fluid mechanics. Relying on “computer experiments”
is attractive in this context because these enable to perform accurate measurements without
disturbing the system being observed. Numerical simulation is also very useful to understand
and build simplified models for physical phenomena at very small scales, if direct observation
is beyond the capabilities of experimental physics.

Sources of error in computational science


It is important for practitioners of computer simulation to be aware of the different sources
of error likely to affect numerical results obtained in applications, which may be classified as
follows:

• Modeling error. There may be a discrepancy between the mathematical model and the
underlying physical phenomenon.

• Data error. The data of the problem, such as the initial conditions or the parameters
entering the equations, are usually known only approximately.

• Discretization error. The discretization of mathematical equations, i.e. turning them


into finite-dimensional problems amenable to computer simulation, adds another source
of error.

1
Contents

• Discrete solver error. The method employed to solve the discretized equations, espe-
cially if it is of iterative nature, may also introduce an error.

• Round-off errors. Finally, the limited accuracy of computer arithmetics causes addi-
tional errors.

Of these, only the last three are in the domain of numerical analysis, and in this course we
focus mainly on the solver and round-off errors. The order of magnitude of the overall error is
dictated by the largest among the above sources of error.

Aims of this course


The aim of this course is to present the standard numerical methods for performing the tasks
most commonly encountered in applications: the solution of linear and nonlinear systems of
equations, the solution of eigenvalue problems, interpolation and approximation of functions,
and numerical integration. For a given task, there are usually several numerical methods to
choose from, and these often include parameters which must be fixed appropriately in order to
guarantee a good efficiency. In order to guide these choices, we study carefully the convergence
and stability of the various methods we present. Six topics will be covered in these lecture notes.

• Floating point arithmetic. In Chapter 1, we discuss how real numbers are represented,
manipulated and stored on a computer. There is an uncountable infinity of real numbers,
but only a finite subset of these can be represented exactly on a machine. This subset
is specified in the IEEE 754 standard, which is widely accepted today and employed in
most programming languages, including Julia.

• Solution of linear systems. In Chapter 2, we study the standard numerical methods


for solving linear systems. Linear systems are ubiquitous in science, often arising from
the discretization of linear elliptic partial differential equations, which themselves govern
a large number of physical phenomena including heat propagation, electromagnetism,
gravitation and the deformation of solids.

• Solution of nonlinear equations. In Chapter 3, we present widely used methods for


solving nonlinear equations. Like linear equations, nonlinear equations are omnipresent in
science, a prime example being the Navier–Stokes equation describing the motion of fluid
flows. They are usually much more difficult to solve and require dedicated techniques.

• Solution of eigenvalue problems. In Chapter 4, we present and study the standard


iterative methods for calculating the eigenfunctions and eigenvalues of a matrix. Eigen-
value problems have a large number of applications, for instance in quantum physics and
vibration analysis. They are also at the root of the PageRank algorithm for ranking web
pages, which played a key role in the early success of Google search.

• Interpolation and extrapolation of functions. In Chapter 5, we focus on the topics


of interpolation and approximation. Interpolation is concerned with the construction of
a function within a given set, for example that of polynomials, that takes given values

2
Contents

when evaluated at a discrete set of points. The aim of approximation, on the other hand,
is usually to determine, within a class of simple functions, which one is closest to a given
function. Depending on the metric employed to measure closeness, this may or may not
be a well-defined problem.

• Numerical integration. In Chapter 6, we study numerical methods for computing


definite integrals. This chapter is strongly related to the previous one, as numerical
approximations of the integral of a function are often obtained by first approximating the
function, say by a polynomial, and then integrating this approximation exactly.

Why Julia?
Throughout the course, the Julia programming language is employed to exemplify some of the
methods and key concepts. In the author’s opinion, the Julia language has several advantages
compared to other popular languages in the context of scientific computing, such as Matlab
or Python.

• Its main advantage over Matlab is that it is free and open source, with the byproduct that
it benefits from contributions from a large number of contributors around the world. Ad-
ditionally, Julia is a fully-fledged programming language that can be used for applications
unrelated to mathematics.

• Its main advantages over Python are significantly better performance and a more concise
syntax for mathematical operations, especially those involving vectors and matrices. It
should be recognized, however, that although use of Julia is rapidly increasing, Python
still enjoys a more mature ecosystem and is much more widely used.

3
Chapter 1

Floating point arithmetic

1.1 Binary representation of real numbers . . . . . . . . . . . . . . . . . . 5


1.1.1 Conversion between binary and decimal formats . . . . . . . . . . . . . 6
1.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Set of values representable in floating point formats . . . . . . . . . . 7
1.2.1 Denormalized floating point numbers . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Relative error and machine epsilon . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Arithmetic operations between floating point formats . . . . . . . . . 11
1.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Encoding of floating point numbers �
. . . . . . . . . . . . . . . . . . . 17
1.5 Integer formats �
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 20

Introduction
When we study numerical algorithms in the next chapters, we assume implicitly that the op-
erations involved are performed exactly. On a computer, however, only a subset of the real
numbers can be stored and, consequently, many arithmetic operations are performed only ap-
proximately. This is the source of the so-called round-off errors. The rest of this chapter is
organized as follows.

• In Section 1.1, we discuss the binary representation of real numbers.

• In Section 1.2, we describe the set of floating point numbers that can be represented in
the usual floating point formats;

• In Section 1.3 we explain how arithmetic operations between floating point numbers be-
have. We insist in particular on the fact that, in a calculation involving several successive
arithmetic operations, the result of each intermediate operation is stored as a floating
point number, with a possible error.

4
Chapter 1. Floating point arithmetic

• In Section 1.4, we briefly present how floating point numbers are encoded according to
the IEEE 754 standard, widely accepted today. We discuss also the encoding of special
values such as Inf, -Inf and NaN.

• Finally, in Section 1.5, we present the standard integer formats and their encoding.

In order to completely describe floating-point arithmetic, one would in principle need to also
discuss the conversion mechanisms between different number formats, as well as a number of
edge cases. Needless to say, a comprehensive discussion of the subject is beyond the scope of
this course; our aim in this chapter is only to introduce the key concepts.

1.1 Binary representation of real numbers


Given any integer number β > 0, called the base, a real number x can always be expressed as
a finite or infinite series of the form

(1.1)
X
x=± ak β −k , ak ∈ {0, . . . , β − 1}.
k=−n

The number x may then be denoted as ±(a−n a−n+1 . . . a−1 a0 .a1 a2 . . . )β , where the subscript β
indicates the base. This numeral system is called the positional notation and is universally used
today, both by humans (usually with β = 10) and machines (usually with β = 2). If the base β
is omitted, it is always assumed in this course that β = 10 unless otherwise specified – this is
the decimal representation. The digits a−n , a−n+1 , . . . are also called bits if β = 2. In computer
science, several bases other than 10 are regularly employed, for example the following:

• Base 2 (binary) is the usual choice for storing numbers on a machine. The binary format
is convenient because the digits have only two possible values, 0 or 1, and so they can
be stored using simple electrical circuits with two states. We employ the binary notation
extensively in the rest of this chapter. Notice that, just like multiplying and dividing
by 10 is easy in base 10, multiplying and dividing by 2 is very simple in base 2: these
operations amount to shifting all the bits by one position to the left or right, respectively.

• Base 16 (hexadecimal) is sometimes convenient to represent numbers in a compact manner.


In order to represent the values 0-15 by a single digit, 16 different symbols are required,
which are conventionally denoted by {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F }. With this
notation, we have (F F )16 = (255)10 , for example.
The hexadecimal notation is often used in programming languages for describing colors
specified by a triplet (r, g, b) of values between 0 and 255, corresponding to the primary
colors red, blue and green. Since the number of possible values for the components is
256 = 162 , only 2 digits are required to represent these in the hexadecimal notation.
Hexadecimal numbers are also employed in IPv6 addresses.

5
Chapter 1. Floating point arithmetic

1.1.1 Conversion between binary and decimal formats


Obtaining the decimal representation of a binary number can be achieved from (1.1), using the
decimal representations of the powers of 2. Since all the positive and negative powers of 2 have
a finite decimal representation, any real number with a finite representation in base 2 has a
finite representation also in base 10. For example, (0.01)2 = (0.25)10 and (0.111)2 = (0.875)10 .

Example 1.1 (Converting a binary number to decimal notation). Let us calculate the decimal
representation of x = (0.10)2 , where the horizontal line indicates repetition: x = (0.101010 . . . )2 .
By definition, it holds that

X
x= ak 2−k ,
k=0

where ak = 0 if k is even and 1 otherwise. Thus, the series may be rewritten as


∞ ∞
X 1 X −2 k
x= 2−(2k+1) = (2 ) .
2
k=0 k=0

We recognize on the right-hand side a geometric series with common ratio r = 2−2 = 14 , and so
we obtain  
1 1 2
x= = = (0.6)10 .
2 1−r 3

Obtaining the binary representation of a decimal number is more difficult, because negative
powers of 10 have infinite binary representations, as Exercise 1.4 demonstrates. There is,
however, a simple procedure to perform the conversion, which we present for the specific case of
a real number x with decimal representation of the form x = (0.a1 . . . an )10 . In this setting, the
bits (b1 , b2 , . . . ) in the binary representation of x = (0.b1 b2 b2 . . . )2 may be obtained as follows:

Algorithm 1 Conversion of a number to binary format


1: i ← 1
2: while x 6= 0 do
3: x ← 2x
4: if x ≥ 1 then
5: bi ← 1
6: else
7: bi ← 0
8: end if
9: x ← x − bi
10: i←i+1
11: end while

Example 1.2 (Converting a decimal number to binary notation). Let us calculate the binary
representation of x = 1
3 = (0.3)10 . We apply Algorithm 1 and collate the values of i and x
obtained at the beginning of each iteration, i.e. just before Line 3, in the table below.

6
Chapter 1. Floating point arithmetic

i x Result
1 1
3 0.0000…
2 2
3 0.0100…
3 1
3 0.0000…

Since x in the last row is again 3,


1
successive bits alternate between 0 and 1, and the binary
representation of x is given by (0.01)2 . This is not surprising since 2x = (0.66)10 = (0.10)2 , as
we saw in Example 1.1.

1.1.2 Exercises
� Exercise 1.1. Show that if a number x ∈ R admits a finite representation (1.1) in base β,
then it also admits an infinite representation in the same base. Hint: You may have learned
before that (0.9)10 = 1.

� Exercise 1.2. How many digits does it take to represent all the integers from 0 to 1010 − 1
in decimal and binary format? What about the hexadecimal format?

� Exercise 1.3. Find the decimal representation of (0.0001100)2 .

� Exercise 1.4. Find the binary representation of (0.1)10 .

� Exercise 1.5. Implement Algorithm 1 on a computer and verify that it works. Your function
should take as argument an array of integers containing the digits after the decimal point; that
is, an array of the form [a_1, ..., a_n].

� Exercise 1.6. As mentioned above, Algorithm 1 works only for decimal numbers of the
specific form x = (0.a1 . . . an )10 . Find and implement a similar algorithm for integer numbers.
More precisely, write a function that takes an integer n as argument and returns an array
containing the bits of the binary expansion (bm . . . b0 )2 of n, from the least significant b0 to the
most significant bm . That is to say, your code should return [b_0, b_1, ...].

function to_binary(n)
# Your code comes here
end

# Check that it works


number = 123456789
bits = to_binary(number)
pows2 = 2 .^ range(0, length(bits) - 1)
@assert sum(bits'pows2) == number

1.2 Set of values representable in floating point formats


We mentioned in the introduction that, because of memory limitations, only a subset of the real
numbers can be stored exactly in a computer. Nowadays, the vast majority of programming

7
Chapter 1. Floating point arithmetic

languages and software comply with the IEEE 754 standard, which requires that the set of
representable numbers be of the form
n
F(p, Emin , Emax ) = (−1)s 2E (b0 .b1 b2 . . . bp−1 )2 :
o
s ∈ {0, 1}, bi ∈ {0, 1} and Emin ≤ E ≤ Emax . (1.2)

In addition to these, floating number formats provide the special entities Inf, -Inf and NaN,
the latter being an abbreviation for Not a Number. Three parameters appear in the set defini-
tion (1.2). The parameter p ∈ N>0 is the number of significant bits (also called the precision),
and (Emin , Emax ) ∈ Z2 are respectively the minimum and maximum exponents. From the preci-
sion, the machine epsilon is defined as εM = 2−p−1 ; its significance is discussed in Section 1.2.2.
For a number x ∈ F(p, Emin , Emax ), s is called the sign, E is the exponent and b0 .b1 b2 . . . bp−1
is the significand. The latter can be divided into a leading bit b0 and the fraction b1 b2 . . . bp−1 ,
to the right of the binary point. The most widely used floating point formats are the single
and double precision formats, which are called respectively Float32 and Float64 in Julia.
Their parameters, together with those of the lesser-known half-precision format, are summarized
in Table 1.1. In the rest of this section we use the shorthand notation F16 , F32 and F64 . Note
that F16 ⊂ F32 ⊂ F64 .

Half precision Single precision Double precision


p 11 24 53
Emin -14 -126 -1022
Emax 15 127 1023

Table 1.1: Floating point formats. The first column corresponds to the half-precision format.
This format, which is available through the Float16 type in Julia, is more recent than the single
and double precision formats. It was introduced in the 2008 revision to the IEEE 754 standard
of 1985, a revision known as IEEE 754-2008.

Remark 1.1. Some definitions, notably that in [9, Section 2.5.2], include a general base β
instead of the base 2 as an additional parameter in the definition of the number format (1.2).
Since the binary format (β = 2) is always employed in practice, we focus on this case for
simplicity in most of this chapter.

Remark 1.2. Given a real number x ∈ F(p, Emin , Emax ), the exponent E and significand are
generally not uniquely defined. For example, the number 2.0 ∈ F64 may be expressed as
(−1)0 21 (1.00 . . . 00)2 or, equivalently, as (−1)0 22 (0.100 . . . 00)2 .

In Julia, non-integer numbers are interpreted as Float64 by default, which can be ver-
ified by using the typeof function. For example, the instruction “a = 0.1” is equivalent
to “a = Float64(0.1)”. In order to define a number of type Float32, the suffix f0 must be
appended to the decimal expansion. For instance, the instruction “a = 4.0f0” defines a float-
ing point number a of type Float32; it is equivalent to writing “a = Float32(4.0)”.

8
Chapter 1. Floating point arithmetic

1.2.1 Denormalized floating point numbers


We can decompose the set F(p, Emin , Emax ) in two disjoint parts:
n
F(p, Emin , Emax ) = (−1)s 2E (1.b1 b2 . . . bp−1 )2 :
o
s ∈ {0, 1}, bi ∈ {0, 1} and Emin ≤ E ≤ Emax
n o
∪ (−1)s 2Emin (0.b1 b2 . . . bp−1 )2 : s ∈ {0, 1}, bi ∈ {0, 1} .

The numbers in the second set are called subnormal or denormalized.

1.2.2 Relative error and machine epsilon


Let x be a nonzero real number and x
b be an approximation. We define the absolute and relative
errors of the approximation as follows.

Definition 1.1 (Absolute and relative error). The absolute error is given by |x − x
b|, whereas
the relative error is
|x − xb|
|x|

The following result establishes a link between the machine εM and the relative error between
a real number and the closest member of a floating point format.

Proposition 1.1. Let xmin and xmax denote the smallest and largest non-denormalized pos-
itive numbers in a format F = F(p, Emin , Emax ). If x ∈ [−xmax , −xmin ] ∪ [xmin , xmax ], then

|x − xb| 1 1
min ≤ 2−(p−1) = εM . (1.3)
b∈F
x |x| 2 2

Proof. For simplicity, we also assume that x > 0. Let us introduce n = blog2 (x)c and y := 2−n x.
Since y ∈ [1, 2), it has a binary representation of the form (1.b1 b2 . . . )2 , where the bits after
the binary point are not all equal to 1 ad infinitum. Thus x = 2n (1.b1 b2 . . . )2 , and from the
assumption that xmin ≤ x ≤ xmax we deduce hat Emin ≤ n ≤ Emax . We now define the number
x− ∈ F by truncating the binary expansion of x as follows:

x− = 2n (1.b1 . . . bp−1 )2 .

The distance between x− and its successor in F , which we denote by x+ , is given by 2n−p+1 .
Consequently, it holds that

(x+ − x) + (x − x− ) = x+ − x− = 2n−p+1 .

Since both summands on the left-hand side are positive, this implies that either x+ −x or x−x−
is bounded from above by 12 2n−p+1 ≤ 12 2−p+1 x, which concludes the proof.

The machine epsilon, which was defined as εM = 2−(p−1) , coincides with the maximum

9
Chapter 1. Floating point arithmetic

Figure 1.1: Density of the double-precision floating point numbers, measured here as 1/∆(x)
where, for x ∈ F64 , ∆(x) denotes the distance between x and its successor in F64 .

relative spacing between a non-denormalized floating point number x and its successor in the
floating point format, defined as the smallest number in the format that is strictly larger than x.
Figure 1.1 depicts the density of double-precision floating point numbers, i.e. the number
of F64 members per unit on the real line. The figure shows that the density decreases as
the absolute value of x increases. We also notice that the density is piecewise constant with
discontinuities at powers of 2. Figure 1.2 illustrates the relative spacing between successive
floating point numbers. Although the absolute spacing increases with the absolute value of x,
the relative spacing oscillates between 12 εM and εM .

Figure 1.2: Relative spacing between successive double-precision floating point numbers in the
“normal range”. The relative spacing oscillates between 12 εM and εM .

The picture of the relative spacing between successive floating point numbers looks quite
different for denormalized numbers. This is illustrated in Figure 1.3, which shows that the
relative spacing increases beyond the machine epsilon in the denormalized range. Fortunately,
in the usual F32 and F64 formats, the transition between non-denormalized and denormalized
numbers occurs at such a small value that it rarely needs worrying about.

10
Chapter 1. Floating point arithmetic

Figure 1.3: Relative spacing between successive double-precision floating point numbers, over a
range which includes denormalized number. The vertical red line indicates the transition from
denormalized to non-denormalized numbers.

Remark 1.3. In Julia, the machine epsilon can be obtained using the eps function. For
example, the instruction eps(Float16) returns εM for the half-precision format.

1.2.3 Exercises
� Exercise 1.7. Write down the values of the smallest and largest, in absolute value, positive
real numbers representable in the F32 and F64 formats.

� Exercise 1.8 (Relative error and machine epsilon). Prove that the inequality (1.3) is sharp.
To this end, find x ∈ R such that the inequality is an equality.

� Exercise 1.9 (Cardinality of the set of floating point numbers). Show that, if Emax ≥ Emin ,
then F(p, Emin , Emax ) contains exactly

(Emax − Emin )2p + 2p+1 − 1

distinct real numbers. (In particular, the special values Inf, -Inf and NaN are not counted.)
Hint: Count first the numbers with E > Emin and then those with E = Emin .

1.3 Arithmetic operations between floating point formats


Now that we have presented the set of values representable on a computer, we attempt in
this section to understand precisely how arithmetic operations between floating point formats
are performed. The key mechanism governing arithmetic operations on a computer is that of
rounding, the action of approximating a real number regarded as infinitely precise by a number
in a floating point format F(p, Emin , Emax ). The IEEE 754 standard stipulates that the default
mechanism for rounding a real number x, called round to nearest, should behave as follows:

• Standard case: The number x is rounded to the nearest representable number, if this
number is unique.

11
Chapter 1. Floating point arithmetic

• Edge case: When there are two equally near representable numbers in the floating point
format, the one with the least significant bit equal to zero is delivered.

• Infinities: If the real number x is larger than the largest representable number in the
format, that is larger than or equal to xmax = 2Emax (2 − 2−p−1 ), then there are two cases,

– If x < 2Emax (2 − 2−p ), then xmax is delivered;


– Otherwise, the special value Inf is delivered.

In other words, xmax is delivered if it would be delivered by following the rules of the first
two bullet points in a different floating point format with the same precision but a larger
exponent Emax . A similar rule applies for large negative numbers.
When a binary arithmetic operation (+, −, ×, /) is performed on floating point numbers
in format F, the result delivered by the computer is obtained by rounding the exact result of
the operation according to the rules given above. In other words, the arithmetic operation is
performed as if the computer first calculated an intermediate exact result, and then rounded
this intermediate result in order to provide a final result in F.
Mathematically, arithmetic operations between floating point numbers in a given format F
may be formalized by introducing the rounding operator fl : R → F and by defining, for any
binary operation ◦ ∈ {+, −, ×, /}, the corresponding machine operation

◦ : F × F → F; (x, y) 7→ fl(x ◦ y).


b

We defined this operator for arguments in the same floating point format F. If the arguments
of a binary arithmetic operation are of different types, the format of the end result, known as
the destination format, depends on that of the arguments: as a rule of thumb, it is given by
the most precise among the formats of the arguments. In addition, recall that a floating point
literal whose format is not explicitly specified is rounded to double-precision format and so, for
example, the addition 0.1 + 0.1 produces the result fl64 fl64 (0.1) + fl64 (0.1) , where fl64 is the


rounding operator to the double-precision format.

Example 1.3. Using the typeof function, we check that the floating point literal 1.0 is indeed
interpreted as a double-precision number:

julia> a = 1.0; typeof(a)


Float64

When two numbers in different floating point formats are passed to a binary operation, the
result is in the more precise format.

julia> typeof(Float16(1) + Float32(1))


Float32

julia> typeof(Float32(1) + Float64(1))


Float64

12
Chapter 1. Floating point arithmetic

If a mathematical expression contains several binary arithmetic operations to be performed


in succession, the result of each intermediate calculation is stored in a floating point format
dictated by the formats of its argument, and this floating point number is employed in the next
binary operation. A consequence of this mechanism is that the machine operands + b and b∗ are
generally not associative. For example, in general

(x + b z 6= x +
b y) + b (y +
b z)

Example 1.4. Let x = 1 and y = 3 × 2−13 . Both of these numbers belong to F16 and, denoting
b machine addition in F16 , we have
by +

(x +
b y) +
by=1 (1.4)

but
x+ b y) = 1 + 2−10 .
b (y + (1.5)

To explain this somewhat surprising result, we begin by writing the normalized representations
of x and y in the F16 format:

x = (−1)0 × 20 × (1.0000000000)2
y = (−1)0 × 2−12 × (1.1000000000)2 .

The exact result of the addition x + y is given by r = 1 + 3 × 2−13 , which in binary notation is

r = (1. 00000000000
| {z } 11)2 .
11 zeros

Since the length of the significand in the half-precision (F16 ) format is only p = 11, this number
is not part of F16 . The result of the machine addition + b is therefore obtained by rounding r
to the nearest member of F16 , which is 1. This reasoning can then be repeated in order to
conclude that, indeed,
(x +
b y) +
b y =x+
b y = 1.

In order to explain the result of (1.5), note that the exact result of the addition y + y is
r = 3 × 2−12 , which belongs to the floating point format, so it also holds that y +
b y = 3 × 2−12 .
Therefore,
x+
b (y + b 3 × 2−12 = fl16 (1 + 3 × 2−12 ).
b y) = 1 +

The argument of the F16 rounding operator does not belong to F16 , since its binary represen-
tation is given by
(1. 0000000000
| {z } 11)2 .
10 zeros

This time the nearest member of F16 is given by 1 + 2−10 .

When a numerical computation unexpectedly returns Inf or Inf, we say that an over-

13
Chapter 1. Floating point arithmetic

flow error occurred. Similarly, underflow occurs when a number is smaller than the smallest
representable number in a floating point format.

1.3.1 Exercises
� Exercise 1.10. Calculate the machine epsilon ε16 for the F16 format. Write the results of
the arithmetic operations 1 + b ε16 in F16 normalized representation.
b ε16 and 1 −

� Exercise 1.11 (Catastrophic cancellation). Let ε16 be the machine epsilon for the F16
format, and define y = 3 ε16 .
4
What is the relative error between ∆ = (1 + y) − 1, and the
machine approximation ∆ b y) −
b = (1 + b 1?

� Exercise 1.12 (Numerical differentiation). Let f (x) = exp(x). By definition, the derivative
of f at 0 is  
f (δ) − f (0)
f (0) = lim
0
.
δ→0 δ
It is natural to use the expression within brackets on the right-hand side with a small but
nonzero δ as an approximation for f 0 (0). Implement this approach using double-precision
numbers and the same values for δ as in the table below. Explain the results you obtain.

ε64 ε64
δ 4 2 ε64

f 0 (0) 0 2 1

� Exercise 1.13 (Avoiding overflow). Write a code to calculate the weighted average
PJ
j=0 wj j
S := PJ , wj = exp(j), J = 1000.
j=0 wj

� Exercise 1.14 (Calculating the sample variance). Assume that (xn )1≤n≤N , with N = 106 ,
are independent random variables distributed according to the uniform distribution U(L, L + 1).
That is, each xn takes a random value uniformly distributed between L and L + 1 where L = 109 .
In Julia, these samples can be generated with the following lines of code:

N, L = 10^6, 10^9
x = L .+ rand(N)

It is well know that the variance of xn ∈ U(L, L + 1) is given by σ 2 = 12 .


1
Numerically, the
variance can be estimated from the sample variance:

N N
! !
1 1 X
(1.6)
X
2
s = x2n − N x̄ 2
, x̄ = xn .
N −1 N
n=1 n=1

Write a computer code to calculate s2 with the best possible accuracy. Can you find a formula
that enables better accuracy than (1.6)?

14
Chapter 1. Floating point arithmetic

Remark 1.4. In order to estimate the true value of s2 for your samples, you can use the
BigFloat format, to which the array x can be converted by using the instruction x = BigFloat.(x).

� Exercise 1.15. Euler proved that

N
π2 X 1
= lim .
6 N →∞ n2
n=1

Using the default Float64 format, estimate the error obtained when the series on the right-hand
side is truncated after 1010 terms. Can you rearrange the sum for best accuracy?

� Exercise 1.16. Let x and y be positive real numbers in the interval [2−10 , 210 ] (so that we
do not need to worry about denormalized numbers, assuming we are working in single or double
precision), and let us define the machine addition operator +
b for arguments in real numbers as

b : R × R → R; (x, y) 7→ fl fl(x) + fl(y) .



+

Prove the following bound on the relative error between the sum x + y and its machine approx-
imation x +
b y:
(x + y) − (x +b y) εM  εM 
≤ 2+ .
|x + y| 2 2
Hint: decompose the numerator as

b y) = x − fl(x) + y − fl(y) + fl(x) + fl(y) − (x+y)


  
(x + y) − (x + b ,

and then use Proposition 1.1.

� Exercise 1.17. Is Float32(0.1) * Float32(10) == 1 equal to true or false given the


default rounding rule defined by the IEEE standard? Explain.

Solution. By default, real numbers are rounded to the nearest floating point number. This can
be checked in Julia with the command rounding(Float32), which prints the default rounding
mode. The exact binary representation of the real number x = 0.1 is

x = (0.0001100)2
= 2−4 × (1.10011001100110011001100
| {z } 1100)2
24 bits

The first task is to determine the member of F32 that is nearest x. We have

x− = max x : x ∈ F32 and x ≤ 2 = 2−4 × (1.10011001100110011001100)2


x+ = min x : x ∈ F32 and x ≥ 2 = 2−4 × (1.10011001100110011001101)2 .


Since the number (0.1100)2 is closer to 1 than to 0, the number x is closer to x+ than to x− .
Therefore, the number obtained when writing Float32(0.1) is x+ . To conclude the exercise,
we need to calculate fl 10 × x+ , and to this end we first write the exact binary representation


15
Chapter 1. Floating point arithmetic

of the real number 10 × x+ = (1010)2 × x+ . We have

(1010)2 × x+ = (1000)2 × x+ + (10)2 × x+ = 2−4 × (1100.11001100110011001101)2


+ 2−4 × (11.0011001100110011001101)2
= 2−4 × (10000.0000000000000000000
| {z }001)2 .
24 bits

This can be checked in Julia by writing bitstring(Float32(0.1) * Float64(10.0)). Clearly,


when rounding to the nearest F32 number, the number 2−4 (10000)2 = 1 is obtained.

Remark 1.5. It should not be inferred from Exercise 1.17 that Float32(1/i) * i is always
exact in floating point arithmetic. For example Float32(1/41) * 41 does not evaluate to 1,
and neither do Float16(1/11) * 11 and Float64(1/49) * 49.

� Exercise 1.18. Explain why Float32(sqrt(2))^2 - 2 is not zero in Julia.



Solution. The exact binary representation of x := 2 is

x = (1.01101010000010011110011
| {z } 001100 . . . )2 .
24 bits

The first task is to determine the member of F32 that is nearest x. We have

x− = max x : x ∈ F32 and x ≤ 2 = (1.01101010000010011110011

| {z })2
24 bits

x = min x : x ∈ F32 and x ≥ 2 = (1.01101010000010011110100
+

| {z })2 ,
24 bits

and we calculate

x − x− = 2−24 (0.01100 . . . )2 ,
x+ − x = 2−21 1 − (0.11001100 . . . )2 ≥ 2−21 1 − (0.11001101)2 = 2−21 (0.00110011)2 .
 

We deduce that x − x− ≤ x+ − x, and so fl(x) = x− . To conclude the exercise, we need to show


that fl (x− )2 is not equal to 2. The exact binary expansion of (x− )2 is


(x− )2 = (1.11111111111111111111111
| {z } 011011)2 .
24 bits

The member of F32 nearest this number is

(1.11111111111111111111111)2 = 2 − 2−23 ,

which is precisely the result returned by Julia.

16
Chapter 1. Floating point arithmetic

1.4 Encoding of floating point numbers �

Once a number format is specified through parameters (p, Emin , Emax ), the choice of encoding,
i.e. the machine representation of numbers in this format, has no bearing on the magnitude and
propagation of round-off errors. Studying encodings is, therefore, not essential for our purposes
in this course, but we opted to cover the topic anyway in the hope that it will help the students
build intuition on floating point numbers. We focus mainly on the single precision format, but
the following discussion applies mutatis mutandis to the double and half-precision formats. The
material in this section is for information purposes only.
We already mentioned in Remark 1.2 that a number in a floating point format may have
several representations. On a computer, however, a floating point number is always stored in
the same manner (except for the number 0, see Remark 1.7). The values of the exponent and
significand which are selected by the computer, in the case where there are several possible
choices, are determined from the following rules:

• Either E > Emin and b0 = 1;

• Or E = Emin , in which case the leading bit may be 0.

The following result proves that these rules define the exponent and significand uniquely.

Proposition 1.2. Assume that

(−1)s (2E b0 .b1 . . . bp−1 )2 = (−1)se(2Eeb0 .eb1 . . . ebp−1 )2 , (1.7)


e

where the parameter sets (s, E, b0 , . . . bp−1 ) and (e e eb0 , . . . , ebp−1 ) both satisfy the above rule.
s, E,
Then E = E e and bi = ebi for i ∈ {0, . . . , p − 1}.

Proof. We show that E = E, e after which the equality of significands follows trivially. Let us
assume for contradiction that E > Ee and denote the left and right-hand sides of (1.7) by x
and x
e, respectively. Then E > Emin , implying that b0 = 1 and so 2E ≤ |x| < 2E+1 . On the
other hand, it holds that |e
x| < 2E+1 regardless of whether E
e = Emin or not. Since E ≥ E
e+1
e

by assumption, we deduce that |e


x| < 2E ≤ |x|, which contradicts the equality x = x
e.

Now that we explained how a unique set of parameters (sign, exponent, significand) can
be assigned to any floating point number, we describe how these parameters are stored on the
computer in practice? As their names suggest, the Float16, Float32 and Float64 formats use 16,
32 and 64 bits of memory, respectively. A naive approach for encoding these number formats
would be to store the full binary representations of the sign, exponent and significand.
For the Float32 format, this approach would requires 1 bit for the sign, 8 bits to cover the
254 possible values of the exponent, and 24 bits for the significand, i.e. for storing b0 , . . . , bp−1 .
This leads to a total number of 33 bits, which is one more than is available, and this is without
the special values NaN, Inf and -Inf. So how are numbers in the F32 format actually stored?
To answer this question, we begin with two observations:

17
Chapter 1. Floating point arithmetic

• If E > Emin , then necessarily b0 = 1 in the unique representation of the significand.


Consequently, the leading bit need not be explicitly specified in the case; it is said to be
implicit. As a consequence, we will see that p − 1 instead of p bits are in fact sufficient
for the significand.

• In the F32 format, 8 bits at minimum need to be reserved for the exponent, which enables
the representation of 28 = 256 different values, but there are only 254 possible values for
the exponent. This suggests that 256−254 = 2 combinations of the 8 bits can be exploited
in order to represent the special values Inf, -Inf and NaN.

Simplifying a little, we may view single precision floating point number as an array of 32
bits as illustrated below:

Sign Encoded exponent Encoded significand


1 bit 8 bits 23 bits

According to the IEEE 754 standard, the first bit is the sign s, the next 8 bits e0 e1 . . . e6 e7 encode
the exponent, and the last 23 bits b1 b2 . . . bp−2 bp−1 encode the significand. Let us introduce the
integer number e = (e0 e1 . . . e6 e7 )2 ; that is to say, 0 ≤ e ≤ 28 − 1 is the integer number whose
binary representation is given by e0 e1 . . . e6 e7 . One may determine the exponent and significand
of a floating point number from the following rules.

• Denormalized numbers: If e = 0, then the implicit leading bit b0 is zero, the frac-
tion is b1 b2 . . . bp−2 bp−1 , and the exponent is E = Emin . In other words, using the
notation of Section 1.2, we have x = (−1)s 2Emin (0.b1 b2 . . . bp−2 bp−1 )2 . In particular, if
b1 b2 . . . bp−2 bp−1 = 00 . . . 00, then it holds that x = 0.

• Non-denormalized numbers: If 0 < e < 255, then the implicit leading bit b0 of the
significand is 1 and the fraction is given by b1 b2 . . . bp−2 bp−1 . The exponent is given by

E = e − bias = Emin + e − 1.

where the exponent bias for the single and double precision formats are given in Table 1.2.
In this case x = (−1)s 2e−bias 1.b1 b2 . . . bp−2 bp−1 . Notice that E = Emin if e = 1, as in the
case of subnormal numbers.

• Infinities: If e = 255 and b1 b2 . . . bp−2 bp−1 = 00 . . . 00, then x = Inf if s = 0 and -Inf
otherwise.

• Not a Number: If e = 255 and b1 b2 . . . bp−2 bp−1 6= 00 . . . 00, then x = NaN. Notice that
the special value NaN can be encoded in many different manners. These extra degrees of
freedom were reserved for passing information on the reason for the occurrence of NaN,
which is usually an indication that something has gone wrong in the calculation.

18
Chapter 1. Floating point arithmetic

Half precision Single precision Double precision


Exponent bias (−Emin + 1) 15 127 1023
Exponent encoding (bits) 5 8 11
Significand encoding (bits) 10 23 52

Table 1.2: Encoding parameters for floating point formats

Remark 1.6 (Encoding efficiency). With 32 bits, at most 232 different numbers could in
principle be represented. In practice, as we saw in Exercise 1.9, the Float32 format enables
to represent

(Emax − Emin )2p + 2p+1 − 1 = 253 × 223 + 225 − 1 = 232 − 224 − 1 ≈ 99.6% × 232 ,

different real numbers, which is a very good efficiency.

Remark 1.7 (Nonuniqueness of the floating point represention of 0.0). The sign s is clearly
unique for any number in a floating point format, except for 0.0, which could in principle be
represented as

(−1)0 2Emin (0.00 . . . 00)2 or (−1)1 2Emin (0.00 . . . 00)2 .

In practice, both representations of 0.0 are available on most machines, and these behave
slightly differently. For example 1/(0.0) = Inf but 1/(-0.0) = -Inf.

� Exercise 1.19. Determine the encoding of the following Float32 numbers:

• x1 = 2.0Emin

• x2 = −2.0Emin −p−1 = −2.0−149

• x3 = 2.0Emax (2 − 2−p+1 )

Check your results using the Julia function bitstring.

1.5 Integer formats �

The machine representation of integer formats is much simpler than that of floating point
numbers. In this short section, we give a few orders of magnitude for common integer formats
and briefly discuss overflow issues. Programming languages typically provide integer formats
based on 16, 32 and 64 bits. In Julia, these correspond to the types Int16, Int32 and Int64, the
latter being the default for integer literals.
The most common encoding for integer numbers, which is used in Julia, is known as two’s
complement: a number encoded with p bits as bp−1 bp−2 . . . b0 corresponds to

p−2
X
x = −bp−1 2p−1 + bi 2i .
i=0

19
Chapter 1. Floating point arithmetic

This encoding enables to represent uniquely all the integers from Nmin = −2p−1 to Nmax =
2p−1 − 1. In contrast with floating point formats, integer formats do not provide special values
like Inf and NaN. The number delivered by the machine when a calculation exceeds the maxi-
mum representable value in the format, called the overflow behavior, generally depends on the
programming language.
Since overflow behavior of integer numbers is not universal across programming languages, a
detailed discussion is of little interest. We only mention that Julia uses a wraparound behavior,
where Nmax + 1 silently returns Nmin and, similarly, −Nmin − 1 gives Nmax ; the numbers loop
back. This can lead to unexpected results, such as 2^64 evaluating to 0.

1.6 Discussion and bibliography


This chapter is mostly based on the original 1985 IEEE 754 standard [3] and the reference
book [9]. A significant revision to the 1985 IEEE standard was published in 2008 [4], adding for
example specifications for the half precision and quad precision formats, and a minor revision was
published in 2019 [5]. The original IEEE standard and its revisions constitute the authoritative
guide on floating point formats. It was intended to be widely disseminated and is written very
clearly and concisely, but is not available for free online. Another excellent source for learning
about floating point numbers and round-off errors is D. Goldberg’s paper “What every computer
scientist should know about floating-point arithmetic” [2], freely available online.

20
Chapter 2

Solution of linear systems of equation

2.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Direct solution method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Backward and forward substitution . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Gaussian elimination with pivoting �
. . . . . . . . . . . . . . . . . . . 31
2.2.4 Direct method for symmetric positive definite matrices . . . . . . . . . . 34
2.2.5 Direct methods for banded matrices . . . . . . . . . . . . . . . . . . . . 34
2.2.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Iterative methods for linear systems . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Basic iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 The conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 62

Introduction
This chapter is devoted to the numerical solution of linear problems of the following form:

Find x ∈ Rn such that Ax = b, A ∈ Rn×n , b ∈ Rn . (2.1)

Systems of this type appear in a variety of applications. They naturally arise in the context
of linear partial differential equations, which we use as main motivating example. Partial
differential equations govern a wide range of physical phenomena including heat propagation,
gravity, and electromagnetism, to mention just a few. Linear systems in this context often
have a particular structure: the matrix A is generally very sparse, which means that most of
the entries are equal to 0, and it is often symmetric and positive definite, provided that these
properties are satisfied by the underlying operator.
There are two main approaches for solving linear systems:

21
Chapter 2. Solution of linear systems of equation

• Direct methods enable to calculate the exact solution to systems of linear equations, up to
round-off errors. Although this is an attractive property, direct methods are usually too
computationally costly for large systems: The cost of inverting a general n × n matrix,
measured in number of floating operations, scales as n3 !

• Iterative methods, on the other hand, enable to progressively calculate increasingly accu-
rate approximations of the solution. Iterations may be stopped once the the residual is
sufficiently small. These methods are often preferable when the dimension n of the linear
system is very large.

This chapter is organized as follows.

• In Section 2.1, we introduce the concept of conditioning. The condition number of a matrix
provides information on the sensitivity of the solution to perturbations of the right-hand
side b or matrix A. It is useful, for example, in order to determine the potential impact
of round-off errors.

• In Section 2.2, we present the direct method for solving systems of linear equations. We
study in particular the LU decomposition for an invertible matrix, as well as its variant
for symmetric positive definite matrices, which is called the Cholesky decomposition.

• In Section 2.3, we present iterative methods for solving linear systems. We focus in
particular on basic iterative methods based on a splitting, and on the conjugate gradient
method.

2.1 Conditioning
The condition number for a given problem measures the sensitivity of the solution to the
input data. In order to define this concept precisely, we consider a general problem of the
form F (x, d) = 0, with unknown x and data d. The linear system (2.1) can be recast in this
form, with the input data equal to b or A or both. We denote the solution corresponding to
perturbed input data d + ∆d by x + ∆x. The absolute and relative condition numbers are
defined as follows.

Definition 2.1 (Condition number for the problem F (x, d) = 0). The absolute and relative
condition numbers with respect to perturbations of d are defined as
! !
k∆xk k∆xk/kxk
Kabs (d) = lim sup , K(d) = lim sup .
ε→0 k∆dk≤ε k∆dk ε→0 k∆dk≤ε k∆dk/kdk

The short notation K is reserved for the relative condition number, which is often more useful
in applications.

In the rest of this section, we obtain an upper bound on the relative condition number for
the linear system (2.1) with respect to perturbations first of b, and then of A. We use the
notation k•k to denote both a vector norm on Rn and the induced operator norm on matrices.

22
Chapter 2. Solution of linear systems of equation

Proposition 2.1 (Perturbation of the right-hand side). Let x + ∆x denote the solution to
the perturbed equation A(x + ∆x) = b + ∆b. Then it holds that

k∆xk k∆bk
≤ kAkkA−1 k , (2.2)
kxk kbk

Proof. It holds by definition of ∆x that A∆x = ∆b. Therefore, we have

kAxk −1 kAkkxk −1
k∆xk = kA−1 ∆bk ≤ kA−1 kk∆bk = kA kk∆bk ≤ kA kk∆bk. (2.3)
kbk kbk

Here we employed (A.7), proved in Appendix A, in the first and last inequalities. Rearranging
the inequality (2.3), we obtain (2.2).

Proposition 2.1 implies that the relative condition number of (2.1) with respect to perturbations
of the right-hand side is bounded from above by kAkkA−1 k. Exercise 2.1 shows that there are
values of x and ∆b for which the inequality (2.2) is sharp.
Studying the impact of perturbations of the matrix A is slightly more difficult, because
this time the variation ∆x of the solution does not depend linearly on the perturbation.

Proposition 2.2 (Perturbation of the matrix). Let x+∆x denote the solution to the perturbed
equation (A + ∆A)(x + ∆x) = b. If A is invertible and k∆Ak < kA−1 k−1 , then
 
k∆xk k∆Ak 1
≤ kAkkA−1 k . (2.4)
kxk kAk 1 − kA−1 ∆Ak

Before proving this result, we show an ancillary lemma.

Lemma 2.3. Let B ∈ Rn×n be such that kBk < 1. Then I − B is invertible and

1
k(I − B)−1 k ≤ , (2.5)
1 − kBk

where I ∈ Rn×n is the identity matrix.

Proof. It holds for any matrix B ∈ Rn×n that

I − Bn+1 = (I − B)(I + B + · · · + Bn ).

Since kBk < 1 in a submultiplicative matrix norm, both sides of the equation are convergent
in the limit as n → ∞, with the left-hand side converging to identity matrix I. Equating the
limits, we obtain

I = (I − B) Bi .
X

i=0

23
Chapter 2. Solution of linear systems of equation

This implies that (I − B) is invertible with inverse given by a so-called Neumann series

(I − B)−1 = Bi .
X

i=0

Applying the triangle inequality repeatedly, and then using the submultiplicative property of
the norm, we obtain

n n n
1
∀n ∈ N, Bi ≤
X X X
kBi k ≤ kBki = .
1 − kBk
i=0 i=0 i=0

where we used the summation formula for geometric series in the last equality. Letting n → ∞
in this equation and using the continuity of the norm enables to conclude the proof.

Proof of Proposition 2.2. Left-multiplying both side with A−1 , we obtain

(I + A−1 ∆A)(x + ∆x) = x ⇔ (I + A−1 ∆A)∆x = −A−1 ∆Ax. (2.6)

Since kA−1 ∆Ak ≤ kA−1 kk∆Ak < 1 by assumption, we deduce from Lemma 2.3 that the matrix
on the left-hand side is invertible with a norm bounded as in (2.5). Consequently, using in
addition the assumed submultiplicative property of the norm, we obtain that

kA−1 ∆Ak
k∆xk = k(I + A−1 ∆A)−1 A−1 ∆Axk ≤ kxk.
1 − kA−1 ∆Ak

which enables to conclude the proof.

Using Proposition 2.2, we deduce that the relative condition number of (2.1) with respect
to perturbations of the matrix A is also bounded from above by kAkkA−1 k, because the term
between brackets on the right-hand side of (2.4) converges to 1 as k∆Ak → 0.
Propositions 2.1 and 2.2 show that the condition number, with respect to perturbations of
either b or A, depends only on A. This motivates the following definition.

Definition 2.2 (Condition number of a matrix). The condition number of a matrix A asso-
ciated to a vector norm k•k is defined as

κ(A) = kAkkA−1 k.

The condition number for the p-norm, defined in Definition A.3, is denoted by κp (A).

Note that the condition number κ(A) associated with an induced norm is at least one. Indeed,
since the identity matrix has induced norm 1, it holds that

1 = kIk = kAA−1 k ≤ kAkkA−1 k.

Since the 2-norm of an invertible matrix A ∈ Rn×n coincides with the spectral radius ρ(AT A),

24
Chapter 2. Solution of linear systems of equation

the condition number κ2 corresponding to the 2-norm is equal to


s
λmax (AT A)
κ2 (A) = ,
λmin (AT A)

where λmax and λmin are the maximal and minimal (both real and positive) eigenvalues of A.

Example 2.1 (Perturbation of the matrix). Consider the following linear system with perturbed
matrix ! ! ! !
x1 0 1 0 0 0
(A + ∆A) = , A= , ∆A = ,
x2 .01 0 .01 0 ε
where 0 < ε  .01. Here the eigenvalues of A are given by λ1 = 1 and λ2 = 0.01. The solution
when ε = 0 is given by (0, 1)T , and the solution to the perturbed equation is
! !
x1 + ∆x1 0
= 1
.
x2 + ∆x2 1+100ε

Consequently, we deduce that, in the 2-norm,

k∆xk 100ε k∆Ak


= ≈ 100ε = 100 .
kxk 1 + 100ε kAk

In this case, the relative impact of perturbations of the matrix is close to κ2 (A) = 100.

� Exercise 2.1. In the simple case where A is symmetric, find values of x, b and ∆b for
which the inequality (2.2) is in fact an equality?

2.2 Direct solution method


In this section, we present the direct method for solving linear systems of the form (2.1) with a
general invertible matrix A ∈ Rn×n . The direct method can be decomposed into three steps:

• First calculate the so-called LU decomposition of A, i.e. find an upper triangular matrix U
and a unit lower triangular matrix L such that A = LU. A unit lower triangular matrix is
a lower triangular matrix with only ones on the diagonal.

• Then solve Ly = b using a method called forward substitution.

• Finally, solve Ux = y. using a method called backward substitution.

By construction, the solution x thus obtained is a solution to (2.1). Indeed, we have that

Ax = LUx = Ly = b.

2.2.1 LU decomposition
In this section, we first discuss the existence and uniqueness of the LU factorization. We
then describe a numerical algorithm for calculating the factors L and U, based on Gaussian

25
Chapter 2. Solution of linear systems of equation

elimination.

Existence and uniqueness of the decomposition

We present a necessary and sufficient condition for the existence of a unique LU decomposition
of a matrix. To this end, we define the principal submatrix of order i of a matrix A ∈ Rn×n as
the matrix Ai = A[1 : i, 1 : i], in Julia notation.

Proposition 2.4. The LU factorization of a matrix A ∈ Rn×n exists and is unique if and
only if the principal submatrices of A of all orders are nonsingular.

Proof. We prove only the “if” direction; see [9, Theorem 3.4] for the “only if” implication.
The statement is clear if n = 1. Reasoning by induction, we assume that the result is
proved up to n − 1. Since the matrix An−1 and all its principal submatrices are nonsingular
by assumption, it holds that An−1 = Ln−1 Un−1 for a unit lower triangular matrix Ln−1 and an
upper triangular matrix Un−1 . These two matrices are nonsingular, for if either of them were
singular then the product An−1 = Ln−1 Un−1 would be singular as well. Let us decompose A as
follows:
An−1
!
c
A= .
dT ann

Let ` and u denote the solutions to Ln−1 u = c and UTn−1 ` = d. These solutions exist and are
unique, because the matrices Ln−1 and Un−1 are nonsingular. Letting unn = ann − (`T u)−1 , we
check that A factorizes as

An−1 Ln−1 0n−1 Un−1


! ! !
c u
= .
dT ann `T 1 0Tn−1 unn

This completes the proof of the existence of the decomposition. The uniqueness of the factors
follows from the uniqueness of `, u and unn .

Proposition 2.4 raises the following question: are there classes of matrices whose principal
matrices are all nonsingular? The answer is positive, and we mention, as an important example,
the class of positive definite matrices. Proving this is the aim of Exercise 2.4.

Gaussian elimination algorithm for computing L and U


So far we have presented a condition under which the LU decomposition of a matrix exists and
is unique, but not a practical method for calculating the matrices L and U. We describe in this
section an algorithm, known as Gaussian elimination, for calculating the LU decomposition of
a matrix. We begin by introducing the concept of Gaussian transformation.

Definition 2.3. A Gaussian transformation is a matrix of the form Mk = I−c(k) eTk , where ek
is the column vector with entry at index k equal to 1 and all the other entries equal to zero,

26
Chapter 2. Solution of linear systems of equation

and c(k) is a column vector of the following form:


 T
(k) (k) (k)
c(k) = 0 0 . . . 0 ck+1 ck+2 . . . cn .

The action of a Gaussian transformation Mk left-multiplying a matrix A ∈ Rn×n is to replace


the rows from index k + 1 to index n by a linear combination involving themselves and the k-th
row. To see this, let us denote by (r (i) )1≤i≤n the rows of a matrix T ∈ Rn×n . Then, we have

r (1) r (1)
    
1
r (2)
  (2)  
 1  r
 
 .   ..
  
..  .  
.
 
 .   .
 
 
Mk T = I − c(k) eTk T = 
     
1   r (k)  =  r (k) 
    
(k)   (k+1)   (k+1) (k) (k) 
−ck+1 1  r  r − ck+1 r 


.. ..   ..   ..
    
. .  .   .
 
 
(k) (k)
−cn 1 r (n) r (n) − cn r (k)

We show in Exercise 2.2 that the inverse of a Gaussian transformation matrix is given by

(I − c(k) eTk )−1 = I + c(k) eTk . (2.7)

The idea of the Gaussian elimination algorithm is to successively left-multiply A with Gaussian
transformation matrices M1 , then M2 , etc. appropriately chosen in such a way that the ma-
trix A(k) , obtained after k iterations, is upper triangular up to column k. That is to say, the
Gaussian transformations are constructed so that all the entries in columns 1 to k under the
diagonal of the matrix A(k) are equal to zero. The resulting matrix A(n−1) after n − 1 iterations
is then upper triangular and satisfies

A(n−1) = Mn−1 . . . M1 A.

Rearranging this equation, we deduce that

A = (M−1
1 . . . Mn−1 )A
−1 (n−1)
.

The first factor is lower triangular by (2.7) and Exercise 2.3. The product in the definition of
the matrix L admits a simple explicit expression.

Lemma 2.5. It holds that


n−1
M−1
1 · · · Mn−1 =I+
X
−1
= (I + c(1) eT1 ) · · · (I + c(n−1) eTn−1 ) c(i) eTi .
i=1

27
Chapter 2. Solution of linear systems of equation

Proof. Notice that, for i < j,

c(i) eTi c(j) eTj = c(i) (eTi c(j) )eTj = c(i) 0eTj = 0.

The statement then follows easily by expanding the product.

A corollary of Lemma 2.5 is that all the diagonal entries of the lower triangular matrix L
are equal to 1; the matrix L is unit lower triangular. The full expression of the matrix L given
the Gaussian transformations is
 
1
 (1)
c2 1


 c(1) c(2) 1
 

L = I + c(1) . . . c(n)

(2.8)
 3 3
0n =  (1) (2) (3)

c4 c4 c4 1


 .. .. .. ..
 
 . . . .


(1) (2) (3) (n−1)
cn cn cn ... cn 1

Therefore, the Gaussian elimination algorithms, if all the steps are well-defined, correctly gives
the LU factorization of the matrix A. Of course, the success of the strategy outlined above
for the calculation of the LU factorization hinges on the existence of an appropriate Gaussian
transformation at each iteration. It is not difficult to show that, if the (k + 1)-th diagonal
entry of the matrix A(k) is nonzero for all k ∈ {1, . . . , n − 2}, then the Gaussian transformation
matrices exist and are uniquely defined.

Lemma 2.6. Assume that A(k) is upper triangular up to column k included, with k ≤ n − 2.
If ak+1,k+1 > 0, then there is a unique Gaussian transformation matrix Mk+1 such that Mk+1 A(k)
(k)

is upper triangular up to column k + 1. It is given by I − c(k+1) eTk+1 where


 (k) (k) (k)
T
(k+1) ak+2,k+1 ak+3,k+1 an,k+1
c = 0 0 ... 0 (k) (k) ... (k) .
ak+1,k+1 ak+1,k+1 ak+1,k+1

Proof. We perform the multiplication explicitly. Denoting denote by (r (i) )1 ≤ i ≤ n the rows
of A(k) , we have

r (1) r (1)
    
1
r (2)
  (2)  
1  r
 
 .   .
   
..  .   ..
.
 
 .  
 
 
Mk+1 A (k)
    
= 1  r (k+1) = r (k+1) .
    
(k+1)   (k+2)   (k+2) (k+1) (k+1) 
−ck+2 1  r  r − ck+2 r

 
.. ..   ..   ..
    
. .  .   .
 
 
(k+1) (k+1) (k+1)
−cn 1 r (n) r (n) − cn r

We need to show that the matrix on the right-hand side is upper triangular up to column k + 1

28
Chapter 2. Solution of linear systems of equation

included. This is clear by definition of c(k+1) and from the fact that A(k) is upper triangular up
to column k by assumption.

The diagonal elements ak+1,k+1 , where k ∈ {0, . . . , n − 2}, are called the pivots. We now
(k)

prove that, if an invertible matrix A admits an LU factorization, then the pivots are necessarily
nonzero and the Gaussian elimination algorithm is successful.

Proposition 2.7 (Gaussian elimination works � ). If A is invertible and admits an LU factor-


ization, then the Gaussian elimination algorithm is well-defined and successfully terminates.

Proof. We denote by c(1) , . . . c(n−1) , the columns of the matrix L − I. Then the matrices given
by Mk = I − c(k) eTk , for k ∈ {1, . . . , n − 1}, are Gaussian transformations and it holds that

L = M−1
1 · · · Mn−1
−1

in view of Lemma 2.5. Since A = LU by assumption, the result of the product

Mn−1 · · · M1 A = U

is upper triangular. Let us use the notation A(k) = Mk · · · M1 A. Of all the Gaussian transforma-
tions, only M1 acts on the second line of the matrix it multiplies. Therefore, the entry (2, 1) of U
coincides with the entry (2, 1) of A(1) , which implies that a2,1 = 0. Then notice that a3,1 = a3,1
(1) (k) (1)

for all k ≥ 1, because the entry (3, 1) of the matrix M2 A(1) is given by a3,1 − c3 a2,1 = a3,1 , and
(1) (2) (1) (1)

the other transformation matrices M3 , . . . , Mn−1 leave the third line invariant. Consequently, it
holds that a3,1 = u3,1 = 0. Continuing in this manner, we deduce that A(1) is upper triangular
(1)

in the first column and that, since A is invertible by assumption, the first pivot a11 is nonzero.
Since this pivot is nonzero, the matrix M1 is uniquely defined by Lemma 2.6.
The reasoning can then be repeated with other columns, in order to deduce that A(k) is
upper triangular up to column k and that all the pivots akk are nonzero.
(k−1)

Computer implementation
The Gaussian elimination procedure is summarized as follows.
A(0) ← A, L ← I
for i ∈ {1, . . . , n − 1} do
Construct Mi as in Lemma 2.6.
A(i) ← Mi A(i−1) , L ← LM−1
i
end for
U ← A(n−1) .
Of course, in practice it is not necessary to explicitly create the Gaussian transformation
matrices, or to perform full matrix multiplications. A more realistic, but still very simplified,
version of the algorithm in Julia is given below. The code exploits the relation (2.8) between L
and the parameters of the Gaussian transformations.

29
Chapter 2. Solution of linear systems of equation

1 # A is an invertible matrix of size n x n


2 L = [i == j ? 1.0 : 0.0 for i in 1:n, j in 1:n]
3 U = copy(A)
4 for i in 1:n-1
5 for r in i+1:n
6 U[i, i] == 0 && error("Pivotal entry is zero!")
7 ratio = U[r, i] / U[i, i]
8 L[r, i] = ratio
9 U[r, i:end] -= U[i, i:end] * ratio
10 end
11 end
12 # L is unit lower triangular and U is upper triangular

Computational cost

The computational cost of the algorithm, measured as the number of floating point operations
(flops) required, is dominated by the Gaussian transformations, in line 9 in the above code.
All the other operations amount to a computational cost scaling as O(n2 ), which is negligible
compared to the cost of the LU factorization when n is large. This factorization requires

- and * n−1 for r in i+1:n


2
flops = n3 + O(n2 ) flops.
z}|{ X z }| {
2× (n − i) (n − i + 1)
| {z } 3
i=1 indices [i:end]
|{z}
for i in 1:n-1

2.2.2 Backward and forward substitution


Once the LU factorization has been completed, the solution to the linear system can be obtained
by first using forward, and then backward substitution, which are just bespoke methods for
solving linear systems with lower and upper triangular matrices, respectively. Let us consider
the case of a lower triangular system:
Ly = b

Notice that the unknown y1 may be obtained from the first line of the system. Then, since y1
is known, the value of y2 can be obtained from the second line, etc. A simple implementation
of this algorithm is as follows:

# L is unit lower triangular


y = copy(b)
for i in 2:n
for j in 1:i-1
y[i] -= L[i, j] * y[j]
end
end

30
Chapter 2. Solution of linear systems of equation

2.2.3 Gaussian elimination with pivoting �

The Gaussian elimination algorithm that we presented in Section 2.2.1 relies on the existence
of an LU factorization. In practice, this assumption may not be satisfied, and in this case a
modified algorithm, called Gaussian elimination with pivoting, is required.
In fact, pivoting is useful even if the usual LU decomposition of A exists, as it enables to
reduce the condition number of the matrices matrices L and U. There are two types of pivoting:
partial pivoting, where only the rows are rearranged through a permutation at each iteration,
and complete pivoting, where both the rows and the columns are rearranged at each iteration.
Showing rigorously why pivoting is useful requires a detailed analysis and is beyond the
scope of this course. In this section, we only present the partial pivoting method. Its influence
on the condition number of the factors L and U is studied empirically in Exercise 2.6. It is
useful at this point to introduce the concept of a row permutation matrix.

Row permutation matrix

Definition 2.4. Let σ : {1, . . . , n} → {1, . . . , n} be a permutation, i.e. a bijection on the


set {1, . . . , n}. The row permutation matrix associated with σ is the matrix with entries

1 if i = σ(j),
pij =
0 otherwise.

When a row permutation P left-multiplies a matrix B ∈ Rn×n , row i of matrix B is moved to


row index σ(i) in the resulting matrix, for all i ∈ {1, . . . , n}. A permutation matrix has a single
entry equal to 1 per row and per column, and its inverse coincides with its transpose: P−1 = PT .

Partial pivoting

Gaussian elimination with partial pivoting applies for any invertible matrix A, and it outputs 3
matrices: a row permutation P, a unit triangular matrix L, and an upper triangular matrix U.
These are related by the relation
PA = LU.

This is sometimes called a PLU decomposition of the matrix A. It is not unique in general but,
unlike the usual LU decomposition, it always exists provided that A is invertible. We take this
for granted in this course.
The idea of partial pivoting is to rearrange the rows at each iteration of the Gaussian
elimination procedure in such a manner that the pivotal entry is as large as possible in absolute
value. One step of the procedure reads

A(k+1) = Mk+1 Pk+1 A(k) . (2.9)

Here Pk+1 is a simple row permutation matrix which, when acting on A(k) , interchanges row k+1
and row `, for some index ` ≥ k + 1. The row index ` is selected in such a way that the absolute
value of the pivotal entry, in position (k + 1, k + 1) of the product Pk+1 A(k) , is maximum. The

31
Chapter 2. Solution of linear systems of equation

matrix Mk+1 is then the unique Gaussian transformation matrix ensuring that A(k+1) is upper
triangular up to column k + 1, obtained as in Lemma 2.6. The resulting matrix A(n−1) after
n − 1 steps of the form (2.9) is upper triangular and satisfies

A(n−1) = Mn−1 Pn−1 · · · M1 P1 A ⇔ A = (Mn−1 Pn−1 · · · M1 P1 )−1 A(n−1) .

The first factor in the decomposition of A is not necessarily lower triangular. However, using
the notation M = Mn−1 Pn−1 · · · M1 P1 and P = Pn−1 · · · P1 , we have

PA = PM−1 U = (PM−1 )U =: LU. (2.10)

Lemma 2.8 below shows that, as the notation L suggests, the matrix L = (PM−1 ) on the right-
hand side is indeed lower triangular. Before stating and proving the lemma, we note that P is
a row permutation matrix, and so the solution to the linear system Ax = b can be obtained by
solving LUx = PT b by forward and backward substitution. Since P is a very sparse matrix, the
right-hand side PT b can be calculated very efficiently.

Lemma 2.8. The matrix L = PM−1 is unit lower triangular with all entries bounded in absolute
value from above by 1. It admits the expression

L = I + (Pn−1 · · · P2 c(1) )eT1 + (Pn−1 · · · P3 c(2) )eT2 + · · · + (Pn−1 c(n−2) )eTn−2 + c(n−1) eTn−1 .

Proof. Let M(k) = Mk Pk · · · M1 P1 and P(k) = Pk · · · P1 . It is sufficient to show that

P(k) M(k) = I + (Pk · · · P2 c(1) )eT1 + (Pk · · · P3 c(2) )eT2 + · · · + (Pk c(k−1) )eTk−1 + c(k) eTk (2.11)
−1

for all k ∈ {1, . . . , n − 1}. The statement is clear for k = 1, and we assume by induction that it
is true up to k − 1. Then notice that

P(k) M(k) = Pk P(k−1) M(k−1) Pk Mk


−1  −1  −1 −1

= Pk I + (Pk−1 · · · P2 c(1) )eT1 + · · · + (Pk−1 c(k−2) )eTk−2 + c(k−1) eTk−1 P−1


k Mk
 
−1

= I + (Pk Pk−1 · · · P2 c(1) )eT1 + · · · + (Pk Pk−1 c(k−2) )eTk−2 + (Pk c(k−1) )eTk−1 M−1
 
k .

In the last equality, we used that eTi P−1


k = (Pk ei ) = ei for all i ∈ {1, . . . , k − 1}, because the
T T

row permutation Pk does not affect rows 1 to k − 1. Using the expression M−1
k = I + c ek ,
(k) T

expanding the product and noting that eTj c(k) = 0 if j ≤ k, we obtain (2.11). The statement
that the entries are bounded in absolute value from above by 1 follows from the choice of the
pivot at each iteration.

The expression of L in Lemma 2.8 suggests the iterative procedure given in Algorithm 2 for
performing the LU factorization with partial pivoting. A Julia implementation of this algorithm
is presented in Listing 2.1.

32
Chapter 2. Solution of linear systems of equation

Algorithm 2 LU decomposition with partial pivoting


Assign A(0) ← A and P ← I
for i ∈ {1, . . . , n − 1} do
Find the row index k ≥ i such that Ak,i is maximum in absolute value.
(i−1)

Interchange the rows i and k of matrices A(i−1) and P, and of vectors c(1) , . . . , c(i−1) .
Construct Mi with corresponding column vector c(i) as in Lemma 2.6.
Assign A(i) ← Mi A(i−1)
end for
Assign U ← A(n−1) .
Assign L ← I + c(1) . . . c(n−1) 0n .


# Auxiliary function
function swap_rows!(i, j, matrices...)
for M in matrices
M_row_i = M[i, :]
M[i, :] = M[j, :]
M[j, :] = M_row_i
end
end

n = size(A)[1]
L, U = zeros(n, 0), copy(A)
P = [i == j ? 1.0 : 0.0 for i in 1:n, j in 1:n]
for i in 1:n-1
# Pivoting
index_row_pivot = i - 1 + argmax(abs.(U[i:end, i]))
swap_rows!(i, index_row_pivot, U, L, P)

# Usual Gaussian transformation


c = [zeros(i-1); 1.0; zeros(n-i)]
for r in i+1:n
ratio = U[r, i] / U[i, i]
c[r] = ratio
U[r, i:end] -= U[i, i:end] * ratio
end
L = [L c]
end
L = [L [zeros(n-1); 1.0]]
# It holds that P*A = L*U

Listing 2.1: LU factorization with partial pivoting.

Remark 2.1. It is possible to show that, if the matrix A is column diagonally dominant in the
sense that
n
X
∀j ∈ {1, . . . , n}, |ajj | ≥ |aij |,
i=1,i6=j

then pivoting does not have an effect: at each iteration, the best pivot is already on the
diagonal.

33
Chapter 2. Solution of linear systems of equation

2.2.4 Direct method for symmetric positive definite matrices


The LU factorization with partial pivoting applies to any matrix A ∈ Rn×n that is invertible.
If A is symmetric positive definite, however, it is possible to compute a factorization into lower
and upper triangular matrices at half the computational cost, using the so-called Cholesky
decomposition.

Lemma 2.9 (Cholesky decomposition). If A is symmetric positive definite, then there exists a
lower-triangular matrix C ∈ Rn×n such that

A = CCT . (2.12)

Equation (2.12) is called the Cholesky factorization of A. The matrix C is unique if we require
that all its diagonal entries are positive.

Proof. Since A is positive definite, its LU decomposition exists and is unique by Propositions 2.4
and 2.7. Let D denote the diagonal matrix with the same diagonal as that of U. Then

A = LD(D−1 U).

Note that the matrix D−1 U is unit upper triangular. Since A is symmetric, we have

A = AT = (D−1 U)T (LD)T .

The first and second factors on the right-hand side are respectively unit lower triangular and
upper triangular, and so we deduce, by uniqueness of the LU decomposition, that L = (D−1 U)T
and U = (LD)T . But then
√ √
A = LU = LDLT = (L D)( DL)T .

Here D denotes the diagonal matrix whose diagonal entries are obtained by taking the square
root of those of D, which are necessarily positive because A is positive definite. This implies the

existence of a Cholesky factorization with C = L D.

Calculation of the Choleski factor

The matrix C can be calculated from (2.12). For example, developing the matrix product gives

that a1,1 = c21,1 and so c1,1 = a1,1 . It is then possible to calculate c2,1 from the equation a2,1 =
c2,1 c1,1 , and so on. Implementing the Cholesky factorization is the goal of Exercise 2.7.

2.2.5 Direct methods for banded matrices


In applications related to partial differential equations, the matrix A ∈ Rn×n very often has a
bandwidth which is small in comparison with n.

34
Chapter 2. Solution of linear systems of equation

Definition 2.5. The bandwidth of a matrix A ∈ Rn×n is the smallest number k ∈ N such
that aij = 0 for all (i, j) ∈ {1, . . . , n}2 with |i − j| > k.

It is not difficult to show that, if A is a matrix with bandwidth k, then so are L and U in the
absence of pivoting. This can be proved by equaling the entries of the product LU with those
of the matrix A. We emphasize, however, that the sparsity structure within the band of A may
be destroyed in L and U; this phenomenon is called fill-in.

Reducing the bandwidth: the Cuthill–McKee algorithm �

The computational cost of calculating the LU or Cholesky decomposition of a matrix with


bandwidth k scales as O(nk 2 ), which is much better than the general scaling O(n3 ) if k  n.
In applications, the bandwidth k is often related to the matrix size n. For example, if A

arises from the discretization of the Laplacian operator, then k = O( n) provided that a good
ordering of the vertices is employed. In this case, the computational cost scales as O(n2 ).
Since a narrow band is associated with a lower computational cost of the LU decomposition,
it is natural to wonder whether the bandwidth of a matrix A can be reduced. A possible strategy
to this end is to use permutations. More precisely, is it possible to identify a row permutation
matrix P such that PAPT has minimal bandwidth? Given such a matrix, the solution to the
linear system (2.1) can be obtained by first solving (PAPT )y = Pb, and then letting x = PT y.
The Cuthill–McKee algorithm is a heuristic method for finding a good, but sometimes not
optimal, permutation matrix P in the particular case where A is a symmetric matrix. It is based
on the fact that, to a symmetric matrix A, we can associate a unique undirected graph whose
adjacency matrix A∗ has the same sparsity structure as that of A, i.e. zeros in the same places.
For any row permutation matrix Pσ with corresponding permutation σ : {1, . . . , n} → {1, . . . , n}
(see Definition 2.4), the matrices Pσ APTσ and Pσ A∗ PTσ also have the same sparsity structure.
Therefore, minimizing the bandwidth of Pσ APTσ is equivalent to minimizing the bandwidth
of Pσ A∗ PTσ . The key insight for understanding the Cuthill–McKee method is that Pσ A∗ PTσ
is the adjacency matrix of the graph obtained by renumbering the nodes according to the
permutation σ, i.e. by changing the number of the nodes from i to σ(i). Consider, for example,
the following graph and renumbering:

7 8 5 3
   
1 1 1 1 1 1
1 1 1 1 1 1
   
6 1 7 1
 
   

 1 1 1 

1
 1 1 

 1 1 1   1 1 1 
→
   
5 2 8 2
 

 1 1 1 


 1 1 1 

 1 1 1   1 1 1
4 3 6 4
   
1 1 1 1 1 1
   
 
1 1 1 1 1 1

Here we also wrote the adjacency matrices associated to the graphs. We assume that the nodes
are all self-connected, although this is not depicted, and so the diagonal entries of the adjacency

35
Chapter 2. Solution of linear systems of equation

matrices are equal to 1. This renumbering corresponds to the permutation


!
i: 1 2 3 4 5 6 7 8
,
σ(i) : 1 2 4 6 8 7 5 3

and we may verify that the adjacency matrix of the renumbered graph can be obtained from
the associated row permutation matrix:
   
1 1 1
1 1
   

 1  1 1 1

 1




 1
 1 1 1
 
 1 

   
1 1 1 1 1
PA∗ P = 
T
     
  

 1 
 1 1 1 
 1

   

 1 
 1 1 1  
 1 

 1 
 1 1 1 
 1 

1 1 1 1 1

In this example, renumbering the nodes of the graph enables a significant reduction of the
bandwidth, from 7 to 2. The Cuthill–McKee algorithm, which was employed to calculate the
permutation, is an iterative method that produces an ordered n-tuple R containing the nodes in
the new order; in other words, it returns σ −1 (1), . . . , σ −1 (n) . The first step of the algorithm


is to find the node i with the lowest degree, i.e. with the smallest number of connections to other
nodes, and to initialize R = (i). Then the following steps are repeated until R contains all the
nodes of the graph:

• Define Ai as the set containing all the nodes which are adjacent to a node in R but not
themselves in R;

• Sort the nodes in Ai according to the following rules: a node i ∈ Ai comes before j ∈ Ai if i
is connected to a node in R that comes before all the nodes in R to which j is connected.
As a tiebreak, precedence is given to the node with highest degree.

• Append the nodes in Ai to R, in the order determined in the previous item.

3 5 3 5 3 5 3
1 1 7 1 7 1
1
2 2 2 8 2
4 6 4 6 4

Figure 2.1: Illustration of the Cuthill–McKee algorithm. The new numbering of the nodes
is illustrated. The first node was chosen randomly since all the nodes have the same degree.
In this example, the ordered tuple R evolves as follows: (1) → (1, 2, 8) → (1, 2, 8, 3, 7) →
(1, 2, 8, 3, 7, 4, 6) → (1, 2, 8, 3, 7, 4, 6, 5).

The steps of the algorithm for the example above are depicted in Figure 2.1. Another
example, taken from the original paper by Cuthill and McKeen [1], is presented in Figure 2.2.

36
Chapter 2. Solution of linear systems of equation

Figure 2.2: Example from the original Cuthill–McKee paper [1].

2.2.6 Exercices
� Exercise 2.2 (Inverse of Gaussian transformation). Prove the formula (2.7).

� Exercise 2.3. Prove that the product of two lower triangular matrices is lower triangular.

� Exercise 2.4. Assume that A ∈ Rn×n is positive definite, i.e. that

∀x ∈ Rn \{0n }, xT Ax > 0.

Show that all the principal submatrices of A are nonsingular.

� Exercise 2.5. Implement the backward substitution algorithm for solving Ux = y. What is
the computational cost of the algorithm?

� Exercise 2.6. Compare the condition number of the matrices L and U with and without
partial pivoting. For testing, use a matrix with pseudo-random entries generated as follows

import Random
# Set the seed so that the code is deterministic
Random.seed!(0)
n = 1000 # You can change this parameter
A = randn(n, n)

Solution. See the Jupyter notebook for this chapter.

� Exercise 2.7. Write a code for calculating the Cholesky factorization of a symmetric positive
definite matrix A by comparing the entries of the product CCT with those of the matrix A. What
is the associated computational cost, and how does it compare with that of the LU factorization?
Extra credit: ... if your code is able to exploit the potential banded structure of the matrix
passed as argument for better efficiency. Specifically, your code will be tested with a matrix is
of the type BandedMatrix defined in the BandedMatrices.jl package, which you will need to
install. The following code can be useful for testing purposes.

import BandedMatrices
import LinearAlgebra

function cholesky(A)
m, n = size(A)

37
Chapter 2. Solution of linear systems of equation

m != n && error("Matrix must be square")


# Convert to banded matrix
B = BandedMatrices.BandedMatrix(A)
B.u != B.l && error("Matrix must be symmetric")
# --> Your code comes here <--
end
n, u, l = 20000, 2, 2
A = BandedMatrices.brand(n, u, l)
A = A*A'
# so that A is symmetric and positive definite (with probability 1).
C = @time cholesky(A)
LinearAlgebra.norm(C*C' - A, Inf)

For information, my code takes about 1 second to run with the parameters given here.

� Exercise 2.8 (Matrix square root). Let A be a symmetric positive definite matrix. Show
that A has a square root, i.e. that there exists a symmetric matrix B such that BB = A.

2.3 Iterative methods for linear systems


Iterative methods enjoy more flexibility than direct methods, because they can be stopped at
any point if the residual is deemed sufficiently small. This generally enables to obtain a good
solution at a computational cost that is significantly lower than that of direct methods. In this
section, we present and study two classes of iterative methods: basic iterative methods based
on a splitting of the matrix A, and the so-called Krylov subspace methods.

2.3.1 Basic iterative methods


The basic iterative methods are particular cases of a general splitting method. Given a splitting
of the matrix of the linear system as A = M − N, for a nonsingular matrix M ∈ Rn×n and a
matrix N ∈ Rn×n , together with an initial guess x(0) of the solution, one step of this general
method reads
Mx(k+1) = Nx(k) + b. (2.13)

For any choice of splitting, the exact solution x∗ to the linear system is a fixed point of this
iteration, in the sense that if x(0) = x∗ , then x(k) = x∗ for all k ≥ 0. Equation (2.13) is
a linear system with matrix M, unknown x(k+1) , and right-hand side Nx(k) + b. There is a
compromise between the cost of a single step and the speed of convergence of the method.
In the extreme case where M = A and N = 0, the method converges to the exact solution
in one step, but performing this step amounts to solving the initial problem. In practice, in
order for the method to be useful, the linear system (2.13) should be relatively simple to solve.
Concretely, this means that the matrix M should be diagonal, triangular, block diagonal, or
block triangular. The error e(k) and residual r (k) at iteration k are defined as follows:

e(k) = x(k) − x∗ , r (k) = Ax(k) − b.

38
Chapter 2. Solution of linear systems of equation

Convergence of the splitting method

Before presenting concrete examples of splitting methods, we obtain a necessary and sufficient
condition for the convergence of (2.13) for any initial guess x(0) .

Proposition 2.10 (Convergence). The splitting method (2.13) converges for any initial
guess x(0) if and only if ρ(M−1 N) < 1. In addition, for any ε > 0 there exists K > 0
such that
k
∀k ≥ K, ke(k) k ≤ ρ(A) + ε ke(0) k.

Proof. Let x∗ denote the solution to the linear system. Since Mx∗ − Nx∗ = b, we have

M(x(k+1) − x) = N(x(k) − x).

Using the assumption that M is nonsingular, we obtain that the error satisfies the equation

e(k+1) = (M−1 N)e(k) .

Applying this equality repeatedly, we deduce

e(k+1) = (M−1 N)2 e(k−1) = · · · = (M−1 N)k+1 e(0) .

Therefore, it holds that

∀k ≥ 0, ke(k) k ≤ k(M−1 N)k kke(0) k.

In order to conclude the proof, we will used Gelfand’s formula, proved in Proposition A.10 of
Appendix A. This states that

lim k(M−1 N)k k k = ρ(M−1 N).


1

k→∞

In particular, for all ε > 0 there is K ∈ N such that

k(M−1 N)k k k ≤ ρ(M−1 N) + ε.


1
∀k ≥ K,

Rearranging this inequality gives that k(M−1 N)k k the statement.

At this point, it is natural to wonder whether there exist sufficient conditions on the matrix A
such that the inequality ρ(M−1 N) < 1 is satisfied, which is best achieved on a case by case basis.
In the next sections, we present four instances of splitting methods. For each of them, we obtain
a sufficient condition for convergence. We are particularly interested in the case where the
matrix A is symmetric (or Hermitian) and positive definite, which often arises in applications,
and in the case where A is strictly row or column diagonally dominant. We recall that a matrix
A is said to be row or column diagonally dominant if, respectively,

or
X X
|aii | ≥ |aij | ∀i |ajj | ≥ |aij | ∀j.
j6=i i6=j

39
Chapter 2. Solution of linear systems of equation

Richardson’s method

Arguably the simplest splitting of the matrix A is given by A = ωI


1
ωI
1
− A , which leads to


Richardson’s method:
x(k+1) = x(k) + ω(b − Ax(k) ). (2.14)

In this case the spectral radius which enters in the asymptotic rate of convergence is given by
  
1
−1
N) = ρ ω I−A = ρ I − ωA

ρ(M
ω

The eigenvalues of the matrix I − ωA are given by 1 − ωλi , where (λi )1≤i≤L are the eigenvalues
of A. Therefore, the spectral radius is given by

ρ(M−1 N) = max |1 − ωλi |.


1≤i≤L

Case of symmetric positive definite A. If the matrix A is symmetric and positive definite,
it is possible to explicitly calculate the optimal value of ω for convergence. In order make
convergence as fast as possible, we want the spectral radius of M−1 N to be as small as possible,
in view of Proposition 2.10. Denoting by λmin and λmax the minimum and maximum eigenvalues
of A, it is not difficult to show that

ρ(M−1 N) = max |1 − ωλi | = max |1 − ωλmin |, |1 − ωλmax | .



1≤i≤L

The maximum is minimized when its two arguments are equal, i.e. when 1 − ωλmin = ωλmax − 1.
From this we deduce the optimal value of ω and the associated spectral radius:

2 2λmin λmax − λmin κ2 (A) − 1


ωopt = , ρopt = 1 − = = .
λmax + λmin λmax + λmin λmax + λmin κ2 (A) + 1

We observe that the smaller the condition number of the matrix A, the better the asymptotic
rate of convergence.

Remark 2.2 (Link to optimization). In the case where A is symmetric and positive definite,
the Richardson update (2.14) may be viewed as a step of the steepest descent algorithm for
the function f (x) = 12 xT Ax − bT x:

x(k+1) = x(k) − ω∇f (x(k) ). (2.15)

The gradient of this function is ∇f (x) = Ax − b, and its Hessian matrix is A. Since the
Hessian matrix is positive definite, the function is convex and attains its global minimum
when ∇f is zero, i.e. when Ax = b.

Jacobi’s method

In Jacobi’s method, the matrix M in the splitting is the diagonal matrix D with the same entries
as those of A on the diagonal. We denote by L and U the lower and upper triangular parts of

40
Chapter 2. Solution of linear systems of equation

A, without the diagonal. One step of the method reads

Dx(k+1) = (D − A)x(k) + b = −(L + U)x(k) + b (2.16)

Since the matrix D on the left-hand side is diagonal, this linear system with unknown x(k+1) is
very simple to solve. The equation (2.16) can be rewritten equivalently as

(k+1) (k)


 a11 x1 + a12 x2 + · · · + a1n x(k)
n = b1


a21 x(k) (k+1)
+ · · · + a2n x(k)

1 + a22 x2 n = b2

..


 .

(k) (k)

an1 x1 + an2 x2 + · · · + ann x(k+1) = bn .

n

The updates for each of the entries of x(k+1) are independent, and so the Jacobi method lends
itself well to parallel implementation. The computational cost of one iteration, measured in
number of floating point operations required, scales as O(n2 ) if A is a full matrix, or O(nk)
if A is a sparse matrix with k nonzero elements per row on average. It is simple to prove the
convergence of Jacobi’s method is the case where A is diagonally dominant.

Proposition 2.11. Assume that A is strictly (row or column) diagonally dominant. Then it
holds that ρ(M−1 N) < 1 for the Jacobi splitting.

Proof. Assume that λ is an eigenvalue of M−1 N and v is the associated unit eigenvector. Then

M−1 Nv = λv ⇔ Nv = λMv ⇔ (N − λM)v = 0.

In the case of Jacobi’s splitting, this is equivalent to

−(L + λD + U)v = 0.

If |λ| > 1, then the matrix on the left-hand side of this equation is diagonally dominant and
thus invertible (see Exercise 2.9). Therefore v = 0, but this is a contradiction because v is
vector of of unit norm. Consequently, all the eigenvalues are bounded from above strictly by 1
in modulus.

Gauss–Seidel’s method

In Gauss Seidel’s method, the matrix M in the splitting is the lower triangular part of A,
including the diagonal. One step of the method then reads

(L + D)x(k+1) = −Ux(k) + b (2.17)

41
Chapter 2. Solution of linear systems of equation

The system can be solved by forward substitution. The equation (2.17) can be rewritten equiv-
alently as
(k+1) (k) (k)
+ a12 x2 + a13 x3 + · · · + a1n x(k)


a11 x1 n = b1

(k+1) (k+1) (k)

+ a23 x3 + · · · + a2n x(k)



a21 x1 + a22 x2 n = b2


(k+1) (k+1) (k+1)
a32 x1 + a32 x2 + a33 x3 + · · · + a3n x(k)
n = b3
..


.






(k+1) (k+1) (k+1)

an1 x1 + an2 x2 + an3 x3 + · · · + ann x(k+1) = bn .

n

Given x(k) , the first entry of x(k+1) can be obtained from the first equation. Then the value of
the second entry can be obtained from the second equation, etc. Unlike Jacobi’s method, the
Gauss–Seidel method is sequential and the entries of x(k+1) cannot be updated in parallel.
It is possible to prove the convergence of the Gauss–Seidel method in particular cases. For
example, the method converges if A is strictly diagonally dominant. Proving this, using an
approach similar to that in the proof of Proposition 2.11, is the goal of Exercise 2.19. It is also
possible to prove convergence when A is Hermitian and positive definite. We show this in the
next section for the relaxation method, which generalizes the Gauss–Seidel method.

Relaxation method

The relaxation method generalizes the Gauss–Seidel method. It corresponds to the splitting

D
   
1−ω
A= +L − D−U . (2.18)
ω ω

When ω = 1, this is simply the Gauss–Seidel splitting. The idea is that, by letting ω be a
parameter that can differ from 1, faster convergence can be achieved. This intuition will be
verified later. The equation (2.13) for this splitting can be rewritten equivalently as
  
(k+1) (k)  (k) (k) (k)


a x
11 1 − x 1 = −ω a x
11 1 + a x
12 2 + · · · + a x
1n n − b1

  
a22 x(k+1) (k) (k+1) (k)

(k)

− x = −ω a x + a x + · · · + a x − b


2 2 21 1 22 2 2n n 2

..
.





  
ann x(k+1)
 (k)
 (k+1) (k+1) (k)
− x = −ω a x + a x + · · · + a x − bn .

n n n1 1 n2 2 nn n

The coefficient on the right-hand side is larger than in the Gauss–Seidel method if ω > 1, and
smaller if ω < 1. These regimes are called over-relaxation and under-relaxation, respectively.
To conclude this section, we establish a sufficient condition for the convergence of the re-
laxation method, and also of the Gauss–Seidel method as a particular case when ω = 1, when
the matrix A is Hermitian and positive definite. To this end, we begin by showing the following
preparatory result, which concerns a general splitting A = M − N.

Proposition 2.12. Let A be Hermitian and positive definite. If the Hermitian matrix M∗ + N
is positive definite, then ρ(M−1 N) < 1.

42
Chapter 2. Solution of linear systems of equation

Proof. First, notice that M + N∗ is indeed Hermitian because

(M + N∗ )∗ = M∗ + N = A∗ + N∗ + N = A + N∗ + N.

We will show that kM−1 NkA < 1, where k•kA is the matrix norm induced by the following norm
on vectors:

kxkA := x∗ Ax.

Showing that this indeed defines a vector norm is the goal of Exercise 2.11. Since N = M − A,
it holds that kM−1 NkA = kI − M−1 AkA , and so

kM−1 NkA = sup kx − M−1 AxkA : kxkA ≤ 1 .




Letting y = M−1 Ax, we calculate

∀x ∈ Rn with kxkA ≤ 1, kx − M−1 Axk2A = x∗ Ax − y ∗ Ax − x∗ Ay + y ∗ Ay


= x∗ Ax − y ∗ MM−1 Ax − (M−1 Ax)∗ M∗ y + y ∗ Ay
= x∗ Ax − y ∗ My − y ∗ M∗ y + y ∗ (M − N)y
= x∗ Ax − y ∗ (M∗ + N)y ≤ 1 − y ∗ (M∗ + N)y < 1,

where we used in the last inequality the assumption that M∗ + N is positive definite. This shows
that kM−1 NkA < 1, and so ρ(M−1 N) < 1.

As a corollary, we obtain a sufficient condition for the convergence of the relaxation method.

Corollary 2.13. Assume that A is Hermitian and positive definite. Then the relaxation method
converges if ω ∈ (0, 2).

Proof. For the relaxation method, we have

D
   ∗
1−ω
M+N = ∗
+L + D−U .
ω ω

Since A is Hermitian, it holds that D∗ = D and U∗ = L. Therefore,

2−ω
M + N∗ = D.
ω

The diagonal elements of D are all positive, because A is positive definite. (Indeed, if there
was an index i such that dii ≤ 0, then it would hold that eTi Aei = dii ≤ 0, contradicting the
assumption that A is positive definite.) We deduce that M + N∗ is positive definite if and only
if ω ∈ (0, 2). We can then conclude the proof by using Proposition 2.12.

Note that Corollary 2.13 implies as a particular case the convergence of the Gauss–Seidel
method when A is Hermitian and positive definite. The condition ω ∈ (0, 2) is fact necessary for

43
Chapter 2. Solution of linear systems of equation

the convergence of the relaxation method, not only in the case of a Hermitian positive definite
matrix A but in general.

Proposition 2.14 (Necessary condition for the convergence of the relaxation method). Let
A ∈ Cn×n be an invertible matrix, and let A = Mω − Nω denote the splitting of the relaxation
method with parameter ω. It holds that

∀ω 6= 0, ω Nω ) ≥ |ω − 1|.
ρ(M−1

Proof. We recall the following facts:

• the determinant of a product of matrices is equal to the product of the determinants.

• the determinant of a triangular matrix is equal to the product of its diagonal entries;

• the determinant of a matrix is equal to the product of its eigenvalues, to the power of
their algebraic multiplicity. This can be shown from the previous two items, by passing
to the Jordan normal form.

Therefore, we have that

det ω D−U
1−ω

ω Nω )
det(M−1 = det(Mω ) det(Nω ) =
−1
= (1 − ω)n .
det Dω +L


Since the determinant on the left-hand side is the product of the eigenvalues of M−1
ω Nω , it is

ω Nω ) , and so we deduce ρ(Mω Nω ) ≥ |1 − ω| . The


bounded from above in modulus by ρ(M−1 n −1 n n

statement then follows by taking the n-th root.

Comparison between Jacobi and Gauss–Seidel for tridiagonal matrices �

For tridiagonal matrices, the convergence rate of the Jacobi and Gauss–Seidel methods satisfy
an explicit relation, which we prove in this section. We denote the Jacobi and Gauss–Seidel
splittings by MJ − NJ and MG − NG , respectively, and use the following notation for the entries
of the matrix A:  
a1 b1
.. ..
. .
 
c 
 1
.. .. .

. .


 bn−1 

cn−1 an
Before presenting and proving the result, notice that for any µ 6= 0 it holds that
 
µ−1 b1
   −1 
µ µ a1
.. ..
. .
  
µ2 µ−2
  
A
 µc1 
(2.19)
  
.. ..
= .
.. ..
 
. . . .
     −1
    
 µ bn−1 

µn µ−n µcn−1 an

44
Chapter 2. Solution of linear systems of equation

Proposition 2.15. Assume that A is tridiagonal with nonzero diagonal elements, so that
both MJ = D and MG = L + D are invertible. Then

G NG ) = ρ(MJ NJ )
ρ(M−1 −1 2

Proof. If λ is an eigenvalue of M−1


G NG with associated unit eigenvector v, then

M−1
G NG v = λv ⇔ NG v = λMG v ⇔ (NG − λMG )v = 0.

For fixed λ, there exists a nontrivial solution v to the last equation if and only if

pG (λ) := det(NG − λMG ) = − det(λL + λD + U) = 0.

Likewise, λ is an eigenvalue of M−1


J NJ if and only if

pJ (λ) := det(NJ − λMJ ) = − det(L + λD + U) = 0.

Now notice that

pG (λ2 ) = − det λ2 L + λ2 D + U = −λn det λL + (λD) + λ−1 U .


 

Applying (2.19) with µ = λ 6= 0, we deduce

pG (λ2 ) = −λn det L + λD + U = λn pJ (λ)




It is clear that this relation is true also if λ = 0. Consequently, it holds that if λ is an eigenvalue
of the matrix M−1
J NJ then λ is an eigenvalue of MG NG . Conversely, if λ is an eigenvalue
2 −1

of M−1
G NG , then the two square roots of λ are eigenvalues of MJ NJ .
−1

If a matrix A is tridiagonal and Toeplitz, i.e. if it is of the form


 
a b
.. ..
. .
 
c 
.. .. ,
 
. . b


 
c a

then it is possible to prove that the eigenvalues of A are given by


 

λk = a + 2 bc cos , k = 1, . . . , n. (2.20)
n+1

In this case, the spectral radius of M−1


J NJ can be determined explicitly.

Monitoring the convergence

In practice, we have access to the residual r (k) = Ax(k) − b at each iteration, but not to the
error e(k) = x(k) − x∗ , as calculating the latter would require to know the exact solution of the

45
Chapter 2. Solution of linear systems of equation

problem. Nevertheless, the two are related by the equation

r (k) = Ae(k) ⇔ e(k) = A−1 r (k) .

Therefore, it holds that ke(k) k ≤ kA−1 kkr (k) k. Likewise, the relative error satisfies

ke(k) k kA−1 r (k) k


=
kxk kA−1 bk

and since kbk = kAA−1 bk ≤ kAkkA−1 bk, we deduce

ke(k) k kr (k) k
≤ κ(A) .
x kbk

The fraction on the right-hand side is the relative residual. If the system is well conditioned,
that is if κ(A) is close to one, then controlling the residual enables a good control of the error.

Stopping criterion

In practice, several criteria can be employed in order to decide when to stop iterating. Given a
small number ε (unrelated to the machine epsilon in Chapter 1), the following alternatives are
available:
• Stop when r (k) ≤ ε. The downside of this approach is that it is not scaling invariant:
when used for solving the following rescaled system

kAx = kb, k 6= 1,

a splitting method with rescaled initial guess kx(0) will require a number of iterations
that depends on k: fewer if k  1 and more if k  1. In practice, controlling the relative
residual and the relative error is often preferable.

• Stop when kr (k) k/kr (0) k ≤ ε. This criterion is scaling invariant, but the number of
iterations is dependent on the quality of the initial guess x(0) .

• Stop when kr (k) k/kbk. This criterion is generally the best, because it is both scaling
invariant and the quality of the final iterate is independent of that of the initial guess.

2.3.2 The conjugate gradient method


As already mentioned in Remark 2.2, when the matrix A ∈ Rn×n in the linear system Ax = b
is symmetric and positive definite, the system can be interpreted as a minimization problem for
the function
1
f (x) = xT Ax − bT x. (2.21)
2
The fact that the exact solution x∗ to the linear system is the unique minimizer of this function
appears clearly when rewriting f as follows:

1 1
f (x) = (x − x∗ )T A(x − x∗ ) − xT∗ Ax∗ . (2.22)
2 2

46
Chapter 2. Solution of linear systems of equation

The second term is constant with x, and the first term is strictly positive if x−x∗ 6= 0, because A
is positive definite. We saw that Richardson’s method can be interpreted as a steepest descent
with fixed step size,
x(k+1) = x(k) − ω∇f (x(k) ).

In this section, we will present and study other methods for solving the linear system (2.1)
which can be viewed as optimization methods. Since A is symmetric, it is diagonalizable and
the function f can be rewritten as

1 1
f (x) = (x − x∗ )T QDQT (x − x∗ ) − xT∗ Ax∗
2 2
1 T T 1 T
= (Q e) D(Q e) − x∗ Ax∗ ,
T
e = x − x∗ .
2 2

Therefore, we have that


n
1X 1
f (x) = λi ηi2 − xT∗ Ax∗ , η = QT (x − x∗ ),
2 2
i=1

where (λi )1≤i≤n are the diagonal entries of D. This shows that f is a paraboloid after a change
of coordinates.

Steepest descent method

The steepest descent method is more general than Richardson’s method in the sense that the
step size changes from iteration to iteration and the method is not restricted to quadratic
functions of the form (2.21). Each iteration is of the form

x(k+1) = x(k) − ωk ∇f (x(k) ).

It is natural to wonder whether the step size ωk can be fixed in such a way that f (x(k+1) )
is as small as possible. For the case of the quadratic function (2.21), this value of ωk can be
calculated explicitly for a general search direction d, and in particular also when d = ∇f (x(k) ).
We calculate that
 1  (k) T  T
x − ωk d A x(k) − ωk d − x(k) − ωk d b
 
f (x(k+1) ) = f x(k) − ωk d =
2
ω 2
= f x(k) + k dT Ad − ωk dT r (k) . (2.23)

2

When viewed as a function of the real parameter ωk , the right-hand side is a convex quadratic
function. It is minimized when its derivative is equal to zero, i.e. when

dT r (k)
ωk dT Ad − dT (Axk − b) = 0 ⇒ ωk = . (2.24)
dT Ad

The steepest descent algorithm with step size obtained from this equation is summarized in Al-
gorithm 3 below. By construction, the function value f (x(k) ) is nonincreasing with k, which is

equivalent to saying that the error x − x∗ is nonincreasing in the norm x 7→ xT Ax. In order

47
Chapter 2. Solution of linear systems of equation

to quantify more precisely the decrease of the error in this norm, we introduce the notation

Ek = kx − x∗ k2A := (x(k) − x∗ )T A(x(k) − x∗ ) = (Ax(k) − b)T A−1 (Ax(k) − b).

We begin by showing the following auxiliary lemma.

Lemma 2.16 (Kantorovich inequality). Let A ∈ Rn×n be a symmetric and positive definite
matrix, and let λ0 ≤ · · · ≤ λn denote its eigenvalues. Then for all nonzero z ∈ Rn it holds that

(z T z)2 4λ1 λn
≥ .
(z Az)(z A z)
T T −1 (λ1 + λn )2

Proof. By the AM-GM (arithmetic mean-geometric mean) inequality, it holds for all t > 0 that
 
1 1 T −1
q q
(z Az)(z A z) = (tz Az)(t z A z) ≤
T T −1 T −1 T −1 tz Az + z A z
T
2 t
 
1 1
= z T tA + A−1 z.
2 t

The matrix on the right-hand side is also symmetric and positive definite, with eigenvalues
equal to tλi + (tλi )−1 . Therefore, we deduce
 
1
q
∀t ≥ 0, (z Az)(z A z) ≤
T T −1 max tλi + (tλi ) −1
z T z. (2.25)
2 i∈{1,...,n}

The function x 7→ x + x−1 is convex, and so over any closed interval [xmin , xmax ] it attains its
maximum either at xmin or at xmax . Consequently, it holds that
   
1 1
max tλi + (tλi ) −1
= max tλ1 + , tλn + .
i∈{1,...,n} tλ1 tλn

In order to obtain the best possible bound from (2.25), we should let t be such that the maximum
is minimized, which occurs when the two arguments of the maximum are equal:

1 1 1
tλ1 + = tλn + ⇒ t= √ .
tλ1 tλn λ1 λn

For this value of t, the maximum in (2.25) is equal to


r r
λ1 λn
+ .
λn λ1

By substituting this expression in (2.25) and rearranging, we obtain the statement.

We are now able to prove the convergence of the steepest descent method.

48
Chapter 2. Solution of linear systems of equation

Theorem 2.17 (Convergence of the steepest descent method). It holds that


 2
κ2 (A) − 1
Ek+1 ≤ Ek .
κ2 (A) + 1

Proof. Substituting x(k+1) = x(k) − ωk d in the expression for Ek+1 , we obtain

Ek+1 = (x(k) − ωk d − x∗ )T A(x(k) − ωk d − x∗ )


= Ek − 2ωk dT Ar (k) + ωk2 dT Ad
(dT d)2 (dT d)2
 
= Ek − T = 1− T Ek ,
d Ad (d Ad)(dT A−1 d)

Using the Kantorovich inequality, we have


   2  2
4λ1 λn λ1 − λn κ2 (A) − 1
Ek+1 ≤ 1− Ek ≤ Ek = Ek .
(λ1 + λn )2 λ1 + λn κ2 (A) + 1

We immediately deduce the statement from this inequality.

Algorithm 3 Steepest descent method


1: Pick ε and initial x
2: r ← Ax − b
3: while krk ≥ εkbk do
4: d←r
5: ω ← dT r/dT Ad
6: x ← x − ωd
7: r ← Ax − b
8: end while

Preconditioned steepest descent

We observe from Theorem 2.17 that the convergence of the steepest descent method is faster
when the condition number of the matrix A is low. This naturally leads to the following question:
can we reformulate the minimization of f (x) in (2.21) as another optimization problem which
is of the same form but involves a matrix with a lower condition number, thereby providing
scope for faster convergence? In order to answer this question, we consider a linear change of
coordinates y = T−1 x, where T is an invertible matrix, and we define

1
fe(y) = f (Ty) = y T (TT AT)y − (TT b)T y. (2.26)
2

This function is of the same form as f in (2.21), with the matrix Ae := TT AT instead of A
and the vector eb := TT b instead of b. Its minimizer is y ∗ = T−1 x∗ . The steepest descent
algorithm can be applied to (2.26) and, from an approximation y (k) of the minimizer y ∗ , an

49
Chapter 2. Solution of linear systems of equation

approximation x(k) of x∗ is obtained by the change of variable x(k) = Ty (k) . This approach
is called preconditioning. By Theorem 2.17, the steepest descent method satisfies the following
error estimate when applied to the function (2.26):

κ2 (TT AT) − 1
 2
Ek+1 ≤ Ek , Ek = (y (k) − y ∗ )T A(y
e (k) − y ),
κ2 (TT AT) + 1 ∗

= (x(k) − x∗ )T A(x(k) − x∗ ).

Consequently, the convergence is faster than that of the usual steepest descent method if
κ2 (TT AT) < κ2 (A). The optimal change of coordinates is given by T = C−T , where C is
the factor of the Cholesky factorization of A as CCT . Indeed, in this case

TT AT = C−1 CCT C−T = I ⇒ κ2 (TT AT) = 1,

and the method converges in a single iteration! However, this iteration amounts to solving the
linear system by direct Cholesky factorization of A. In practice, it is usual to define T from an
approximation of the Cholesky factorization, such as the incomplete Cholesky factorization.
To conclude this section, we demonstrate that the change of variable from x to y need not
be performed explicitly in practice. Indeed, one step of the steepest descent algorithm applied
to function fe reads

(Ay b)T (Ay


e (k) − e e (k) − e
b)
ek (Ay
y (k+1) = y (k) − ω e (k) − e
b), ω
ek = .
(Ay b)T A(
e (k) − e e Ay
e (k) − e
b)

Letting x(k) = Ty (k) , this equation can be rewritten as the following iteration:

dTk r (k)
x(k+1) = x(k) − ω
e k dk , ω
ek = , dk = TT T(Ax(k) − b).
dTk Adk

A comparison with (2.24) shows that the step size ω


ek is such that f (x(k+1) ) is minimized. This
reasoning shows that the preconditioned conjugate gradient method amounts to choosing the
direction d(k) = TTT r (k) at each iteration, instead of just r (k) , as is apparent in Algorithm 4.
It is simple to check that −d(k) is a descent direction for f :

−∇f (x)T TT T(Ax − b) = − T(Ax − b) T(Ax − b) ≤ 0.


 T 

Algorithm 4 Preconditioned steepest descent method


1: Pick ε, invertible T and initial x
2: r ← Ax − b
3: while krk ≥ εkbk do
4: d ← TT Tr
5: ω ← dT r/dT Ad
6: x ← x − ωd
7: r ← Ax − b
8: end while

50
Chapter 2. Solution of linear systems of equation

Conjugate directions method

Definition 2.6 (Conjugate directions). Let A be a symmetric positive definite matrix. Two
vectors d1 and d2 are called A-orthogonal or conjugate with respect to A if dT1 Ad2 = 0, i.e.
if they are orthogonal for the inner product hx, yiA = xT Ay.

Assume that d0 , . . . , dn−1 are n pairwise A-orthogonal nonzero directions. By Exercise 2.20,
these vectors are linearly independent, and so they form a basis of Rn . Consequently, for any
initial guess x(0) , the vector x(0) − x∗ , where x∗ is the solution to the linear system Ax = b,
can be decomposed as
x(0) − x∗ = α0 d0 + · · · + αn−1 dn−1 .

Taking the h•, •iA inner product of both sides with dk , with k ∈ {0, . . . , n − 1}, we obtain an
expression for the scalar coefficient αk

dTk A(x(0) − x∗ ) dTk (Ax(0) − b)


αk = = .
dTk Adk dTk Adk

Therefore, calculating the expression of the coefficient does not require to know the exact
solution x∗ . Given conjugate directions, the exact solution can be obtained as

n−1
dTk r (0)
(2.27)
X
x∗ = x(0) − αk dk , αk = .
k=0
dTk Adk

From this equation we deduce, denoting by D the matrix d0 . . . dn−1 , that the inverse
 

of A can be expressed as

n−1
dk dTk
A = D(D AD) D = νk = dTk Adk .
X
−1 T −1 T
,
νk
k=0

In this equation, νk is the k-th diagonal entry of the matrix DT AD. The conjugate directions
method is illustrated in Algorithm 5. Its implementation is very similar to the steepest descent
method, the only difference being that the descent direction at iteration k is given by dk instead
of r (k) . In particular, the step size at each iteration is such that f (x(k+1) ) is minimized.

Algorithm 5 Conjugate directions method


1: Assuming d0 , . . . , dn−1 are given.
2: Pick initial x(0)
3: for k in {0, . . . , n − 1} do
4: r (k) = Ax(k) − b
5: ωk = dTk r (k) /dTk Adk
6: x(k+1) = x(k) − ωk dk
7: end for

Let us now establish the connection between the Algorithm 5 and (2.27), which is not
immediately apparent because (2.27) involves only the initial residual Ax(0) − b, while the
residual at the current iteration r (k) is used in the algorithm.

51
Chapter 2. Solution of linear systems of equation

Proposition 2.18 (Convergence of the conjugate directions method). The vector x(k) ob-
tained after k iterations of the conjugate directions method is given by

k−1
dTi r (0)
(2.28)
X
(k) (0)
x =x − αi di , αi = .
i=0
dTi Adi

In particular, the method converges in at most n iterations.

Proof. Let us denote by y (k) the solution obtained after k steps of Algorithm 5. Our goal is to
show that y (k) coincides with x(k) defined in (2.28). The result is trivial for k = 0. Reasoning
by induction, we assume that it holds true up to k. Then performing step k + 1 of the algorithm
gives
dTk r (k)
y (k+1) = y (k) − ωk dk , ωk = .
dTk Adk
On the other hand, it holds from (2.28) that

dTk r (0)
x(k+1) = x(k) − αk dk , αk = .
dTk Adk

By the induction hypothesis it holds that y (k) = x(k) and so, in order to prove the equality
y (k+1) = x(k+1) , it is sufficient to show that ωk = αk , i.e. that

dTk r (k) = dTk r (0) ⇔ dTk (r (k) − r (0) ) = 0 ⇔ dTk A(x(k) − x(0) ) = 0.

The latter equality is obvious from the A-orthonormality of the directions.

Since ωk in Algorithm 5 coincides with the expression in (2.24), the conjugate directions
algorithm satisfies the following “local optimization” property: the iterate x(x+1) minimizes f
on the straight line ω 7→ x(k) − ωdk . In fact, it also satisfies the following stronger property.

Proposition 2.19 (Optimality of the conjugate directions method). The iterate x(k) is the
minimizer of f over the set x(0) + Bk , where Bk = span{d0 , . . . , dk−1 }.

Proof. By (2.27), it holds that

n−1
X dTi r (0)
x∗ = x(0) − αi d i , αi =
i=0
dTi Adi

On the other hand, any vector y ∈ x(0) + Bk can be expanded as

y = x(0) − β0 d0 − · · · − βk−1 dk−1 .

Employing these two expressions, the formula for f in (2.22), and the A-orthogonality of the

52
Chapter 2. Solution of linear systems of equation

directions, we obtain

1 1
f (y) = (y − x∗ )T A(y − x∗ ) − xT∗ Ax∗
2 2
k−1 n−1
1X 1X 2 T 1
= (βi − αi )2 dTi Adi + αi di Adi − xT∗ Ax∗
2 2 2
i=1 i=k

This is minimized when βi = αi for all i ∈ {0, . . . , k − 1}, in which case y coincides with the
k-th iterate x(k) of the conjugate directions method in view of Proposition 2.18.

Remark 2.3. Let k•kA denote the norm induced by the inner product (x, y) 7→ xT Ay. Since
q
kx (k)
− x∗ kA = 2f (x(k) ) + xT∗ Ax∗ ,

Proposition 2.19 shows that x(k) minimizes the norm kx(k) − x∗ kA over x(0) + Bk .

A corollary of (2.19) is that the gradient of f at x(k) , i.e. the residual r (k) = Ax(k) − b, is
orthogonal to any vector in {d0 , . . . , dk−1 } for the usual Euclidean inner product. This can also
be checked directly from the formula

n−1
(k)
X dTi r (0)
x − x∗ = αi di , αi = ,
i=k
dTi Adi

which follows directly from Proposition 2.18. Indeed, it holds that

n−1
dTj r (k) = A(x(k) − x∗ ) = (2.29)
X
∀j ∈ {0, . . . , k − 1}, αi dTj di = 0.
i=k

The conjugate gradient method

In the previous section, we showed that, given n conjugate directions, the solution to the
linear system Ax = b can be obtained in a finite number of iterations using Algorithm 5. The
conjugate gradient method can be viewed as a particular case of the conjugate directions method.
Instead of assuming that the conjugate directions are given, they are constructed iteratively
as part of the algorithm. Given an initial guess x(0) , the first direction is the residual r (0) ,
which coincides with the gradient of f at x(0) . The directions employed for the next iterations
are obtained by applying the Gram-Schmidt process to the residuals. More precisely, given
conjugate directions d1 , . . . , dk−1 , and letting x(k) denote the k-th iterate of the conjugate
directions method, the direction dk is obtained by

d Ar (k)
k−1 T
r (k) = Ax(k) − b. (2.30)
X
i
dk = r (k) − di ,
i=0
dTi Adi

It is simple to check that dk is indeed A-orthogonal to di for i ∈ {0, . . . , k − 1}, and that dk is
nonzero if r (k) is nonzero. To prove the latter claim, we can take the Euclidean inner product

53
Chapter 2. Solution of linear systems of equation

of both sides with r (k) and use Proposition 2.19 to deduce that

dTk r (k) = (r (k) )T r (k) > 0. (2.31)

It appears from (2.30) that the cost of calculating a new direction grows linearly with the
iteration index. In fact, it turns out that only the last term in the sum is nonzero, and so
the cost of computing a new direction does not grow with the iteration index k. In order to
explain why only the last term of the sum in (2.30) is nonzero, we begin by making a couple of
observations:

• Since the directions are obtained from the residuals, it holds that

∀k ∈ {0, . . . , n − 1}, span{d0 , . . . , dk } = span{r (0) , . . . , r (k) }. (2.32)

• A calculation gives that

r (i+1) = Ax(i+1) − b = A(x(i) − ωi di ) − b = r (i) − ωi Adi . (2.33)

Note that ωi > 0 if r (i) 6= 0 by (2.31). Rearranging this equation and noting that the
difference r (i+1) −r (i) is a linear combination of the first i+1 conjugate directions by (2.32),
we deduce that there exist scalar coefficients (αi,j )0≤j≤i+1 such that

i+1
1  (i+1)
Adi =
 X
(i)
r −r = αi,j dj .
ωi
j=0

These two observations imply that


 T
i+1
dTi Ar (k) = (Adi )T r (k) = 
X
αi,j dj  r (k) .
j=0

Using the orthonormality property (2.29) for the residual of the conjugate directions method,
we deduce that the right-hand side of this equation is zero for all i ∈ {0, . . . , k − 2}, implying
that only the last term in the sum of (2.30) is nonzero. This leads to Algorithm 6.
The subspace spanned by the descent directions of the conjugate gradient method can be
characterized precisely as follows.

Proposition 2.20. Assume that kr (k) k 6= 0 for k < m ≤ n, Then it holds that

∀k ∈ {0, . . . m}, span{r (0) , r (1) , . . . , r (k) } = span{r (0) , Ar (0) , . . . , Ak r (0) }

The subspace on the right-hand side is called a Krylov subspace and denoted Bk+1 .

Proof. The result is clear for k = 0. Reasoning by induction, we prove that if the result is true

54
Chapter 2. Solution of linear systems of equation

Algorithm 6 Conjugate gradient method


1: Pick initial x(0)
2: d0 = r (0) = Ax(0) − b
3: for k in {0, . . . , n − 1} do
4: if kr (k) k = 0 then
5: Stop
6: end if
7: ωk = dTk r (k) /dTk Adk
8: x(k+1) = x(k) − ωk dk
9: r (k+1) = Ax(k+1) − b
10: βk = dTk Ar (k+1) /dTk Adk .
11: dk+1 = r (k+1) − βk dk .
12: end for

up to k < m, then it is also true for k + 1. From (2.33) we have

r (k+1) = r (k) − ωk Adk . (2.34)

By the induction hypothesis and (2.32), there exist scalar coefficients (γi )0≤i≤k such that

dk = γ0 r (0) + γ1 Ar (0) + · · · + γk Ak r (0)

The coefficient γk is necessarily nonzero, otherwise dk would be a linear combination of the


previous conjugate directions. The residual r (k) admits a decomposition of the same form:

r (k) = β0 r (0) + β1 Ar (0) + · · · + βk Ak r (0)

Substituting these expressions in (2.34), we conclude that

r (k+1) = β0 r (0) + (β1 − ω0 γ0 )Ar (0) + · · · + (βk − ωk−1 γk−1 )Ak r (0) − ωk γk A(k+1) r (0) .

Since both ωk and γk are nonzero, the proof is complete.

Although the conjugate gradient method converges in a finite number of iterations, perform-
ing n iterations for very large systems would require an excessive computational cost, and so it
is sometimes desirable to stop iterating when the residual is sufficiently small. To conclude this
section, we study the convergence of the method.

Theorem 2.21 (Convergence of the conjugate gradient method). The error for the conjugate
gradient method, measured as Ek+1 := (x(k+1) − x∗ )T A(x(k+1) − x∗ ), satisfies the following
inequality:
∀qk ∈ P(k), Ek+1 ≤ max 1 + λi qk (λi ) E0 . (2.35)
2
1≤i≤n

Here P(k) is the vector space of polynomials of degree less than or equal to k.

55
Chapter 2. Solution of linear systems of equation

Proof. In view of Proposition 2.20, the iterate x(k+1) can be written as

k
αi Ai r (0) = x(0) + pk (A)r (0) ,
X
(k+1) (0)
x =x +
i=0

where pk is a polynomial of degree k. By Proposition 2.19, pk is in fact the polynomial of


degree k such that f (x(k+1) ), and thus also Ek+1 by (2.22), is minimized. Noting that

x(k+1) − x∗ = x(0) − x∗ + pk (A)r (0) = x(0) − x∗ + pk (A)A(x(0) − x∗ )


= I + Apk (A) (x(0) − x∗ ),


we deduce that

∀qk ∈ P(k), Ek+1 ≤ (x(0) − x∗ )T A I + Aqk (A) (x(0) − x∗ ).


2

In order to exploit this inequality, it is useful to diagonalize A as A = QDQT , for an orthogonal


matrix Q and a diagonal matrix D. Since qk (A) = Qqk (D)QT for all qk ∈ P(k), it holds that

∀qk ∈ P(k), Ek+1 = QT (x(0) − x∗ ) D I + Dqk (D) QT (x(0) − x∗ )


T 2 

Q (x − x∗ ) D QT (x(0) − x∗ ) ,
 2  T (0)
≤ max 1 + λi qk (λi )
T 
1≤i≤n | {z }
E0

which completes the proof.

An corollary of Theorem 2.21 is that, if A has m ≤ n distinct eigenvalues, then the conjugate
gradient method converges in at most m iterations. Indeed, in this case we can take
 
1 (λ1 − λ) . . . (λm − λ)
qk (λ) = −1 .
λ λ1 . . . λ m

It is simple to check that the right-hand side is indeed a polynomial, and that qk (λi ) = 0 for all
eigenvalues of A.
In general, finding the polynomial that minimizes the right-hand side of (2.35) is not possible,
because the eigenvalues of the matrix A are unknown. However, it is possible to derive from
this equation an error estimate with a dependence on the condition number of κ2 (A).

Theorem 2.22. It holds that


√ 2(k+1)
κ−1
∀k ≥ 0, Ek ≤ 4 √ E0 ,
κ+1

Sketch of the proof. Theorem 2.21 implies that

∀qk ∈ P(k), max


2
Ek+1 ≤ 1 + λqk (λ) E0 ,
λ∈[λ1 ,λn ]

56
Chapter 2. Solution of linear systems of equation

where λ1 and λn are the minimum and maximum eigenvalues of A. We show at the end of the
proof that the right-hand side is minimized when
 
λn +λ1 −2λ
Tk+1 λn −λ1
1 + λqk (λ) =   , (2.36)
λn +λ1
Tk+1 λn −λ1

where Tk+1 is the Chebyshev polynomial of degree k + 1. These polynomials may be defined
defined from the formula

cos(i arccos x) for |x| ≤ 1
Ti (x) = 1  √ √ (2.37)
 x − x2 − 1 + 1 x + x2 − 1
i  i
for |x| ≥ 1.
2 2

It is clear from this definition that |Ti (x)| ≤ 1 for all x ∈ [−1, 1]. Consequently, the following
inequality holds true for all λ ∈ [λ1 , λn ]:

1
 p k+1  p k+1 −1
1 + λqk (λ) ≤   =2 2
r− r −1 2
+ r+ r −1 ,
λn +λ1
Tk+1 λn −λ1
√ k+1  √ k+1 !−1
κ+1 κ−1
=2 √ + √ .
κ−1 κ+1

where r = λn −λ1 .
λn +λ1
Since the second term in the bracket converges to zero as k → ∞, it is
natural to bound this expression by keeping only the first term, which after simple algebraic
manipulations leads to
√ k+1
κ−1
∀λ ∈ [λ1 , λn ], 1 + λqk (λ) ≤ 2 √ .
κ+1

From this inequality, the statement of the theorem follows immediately.


Let us now prove the optimality of (2.36). To this end, we first observe that since
  

∀j ∈ {0, . . . , k + 1}, Tk+1 cos = cos (jπ) = (−1)j ,
k+1

it holds for all ∀j ∈ {0, . . . , k + 1} that

λn + λ1 − (λn − λ1 ) cos (jπ)


 −1
λn + λ1
1 + µj qk (µj ) = Tk+1 (−1)j , µj = . (2.38)
λn − λ1 2

Note that the points (µj )0≤j≤k+1 are all in the interval [λ1 , λn ]. Reasoning by contradiction,
we assume that there is another polynomial qek ∈ P(k) such that
 −1
λn + λ1
max |1 + λe
qk (λ)| < Tk+1 . (2.39)
λ∈[λ1 ,λn ] λn − λ1

Then tk+1 (λ) = 1 + λqk (λ) − 1 + λe


qk (λ) is a polynomial of degree k + 1 with a zero at λ = 0.
 

57
Chapter 2. Solution of linear systems of equation

In addition, by (2.38) and (2.39) it holds that



strictly positive if j is even,
∀j ∈ {0, . . . , k + 1}, tk+1 (µj ) is
strictly negative if j is odd.

This implies, by the intermediate value theorem, that tk+1 has k +1 roots in the interval [λ1 , λn ],
so k + 2 roots in total, but this is impossible for a nonzero polynomial of degree k + 1.

2.3.3 Exercises
� Exercise 2.9. Show that if A is row or column diagonally dominant, then A is invertible.

� Exercise 2.10. Let T be a nonsingular matrix. Show that

kAkT := kT−1 ATk2

defines a matrix norm induced by a vector norm.

� Exercise 2.11. Let A ∈ Rn×n be a symmetric positive definite matrix. Show that the
functional

k•kA : x 7→ xT Ax

defines a norm on Rn .

� Exercise 2.12. Show that the residual satisfies the equation

r (k+1) = NM−1 r (k) = (I − AM−1 )r (k) .

� Exercise 2.13. Show that, if A and B are two square matrices, then ρ(AB) = ρ(BA).

� Exercise 2.14. Is ρ(•) a norm? Prove or disprove.

� Exercise 2.15. Prove that, if A is a diagonal matrix, then

kAk1 = kAk2 = kAk∞ = ρ(A).

� Exercise 2.16. Show that, for any matrix norm k•k induced by a vector norm,

ρ(A) ≤ kAk.

� Exercise 2.17. Let k•k denote the Euclidean vector norm on Rn . We define in Appendix A
the induced matrix norm as
kAk = sup kAxk : kxk ≤ 1 .


Show from this definition that, if A is symmetric and positive definite, then

kAk = kAk∗ := sup |xT Ax| : kxk ≤ 1 .




58
Chapter 2. Solution of linear systems of equation

Solution. By the Cauchy–Schwarz inequality and the definition of kAk, it holds that

∀x ∈ Rn with kxk ≤ 1, |xT Ax| ≤ kxkkAxk ≤ kxkkAkkxk ≤ kAk.

This shows that kAk∗ ≤ kAk. Conversely, letting B denote a matrix square root of A (see
Exercise 2.8), we have
√ q q
∀x ∈ Rn with kxk ≤ 1, kAxk = xT AT Ax = (Bx)T BB(Bx) = (Bx)T A(Bx)
Bx
= kBxk y T Ay,
p
y= .
kBxk

It holds that kBxk = xT Ax ≤ kAk∗ . In addition kyk = 1, so the expression inside the
p

square root is bounded from above by kAk∗ , which enables to conclude the proof.

� Exercise 2.18. Implement an iterative method based on a splitting for finding a solution to
the following linear system on Rn .
    
2 −1 1 x1
−1 2 −1   x2  1
    
    
1  −1 2 −1   x3  1 1
.. .. ..   .  = . , h= .
   
h2  . . .   ..   ..  n

    
−1 2 −1 xn−1  1
    

−1 2 xn 1

Plot the norm of the residual as a function of the iteration index. Use as stopping criterion the
condition
kr (k) k ≤ εkbk, ε = 10−8 .

As initial guess, use a vector of zeros. The code will be tested with n = 5000.

Extra credit: � Find a formula for the optimal value of ω in the relaxation method given n.
The proof of Proposition 2.15, as well as the formula (2.20) for the eigenvalues of a tridiagonal
matrix, are useful to this end.

Solution. Corollary 2.13 and Proposition 2.14 imply that a sufficient and necessary condition
for convergence, when A is Hermitian and positive definite, is that ω ∈ (0, 2). Let Mω = ωD+L
1

and Nω = ω D − U. A nonzero scalar λ ∈ C is an eigenvalue of M−1


ω Nω if and only if
1−ω

ω Nω − λI) = 0
det(M−1 ⇔ det(M−1
ω ) det(Nω − λMω ) = 0 ⇔ det(λMω − Nω ) = 0.

Substituting the expressions of Mω and Nω , we obtain that this condition can be equivalently
rewritten as
√ √
       
λ+ω−1 λ+ω−1
det λL + D+U =0 ⇔ det λL + D+ λU =0
ω ω

where we used (2.19) for the last equivalence. The equality of the determinants in these two

equations is valid for λ denoting either of the two complex square roots of λ. This condition

59
Chapter 2. Solution of linear systems of equation

is equivalent to    
λ+ω−1
det L + √ D + U = 0.
λω
We recognize from the proof of Proposition 2.15 that this condition is equivalent to

λ+ω−1
√ J NJ ).
∈ spectrum(M−1
λω

In other words, for any (λ, µ) ∈ C2 such that

(λ + ω − 1)2
= µ2 , (2.40)
λω 2

J NJ ) if and only if λ ∈ spectrum(Mω Nω ). By (2.20), the


it holds that µ ∈ spectrum(M−1 −1

eigenvalues of M−1
J NJ are real and given by
 

µj = cos , 1 ≤ j ≤ n. (2.41)
n+1

Rearranging (2.40), we find

λ2 + λ 2(ω − 1) − ω 2 µ2 + (ω − 1)2 = 0.


For given ω ∈ (0, 2) and µ ∈ R, this is a quadratic equation for λ with solutions
r
ω 2 µ2 ω 2 µ2
 
λ± = + 1 − ω ± ωµ + 1 − ω,
2 4

Since the first bracket is positive when the argument of the square root is positive, it is clear
that r
ω 2 µ2 ω 2 µ2
max |λ− |, |λ+ | =

+ 1 − ω + ω|µ| +1−ω .
2 4

Combining this with (2.41), we deduce that the spectral radius of M−1
ω Nω is given by
s
ω 2 µ2j ω 2 µ2j
ω Nω ) =
ρ(M−1 max + 1 − ω + ω|µj | +1−ω . (2.42)
j∈{1,...,n} 2 4

We wish to minimize this expression over the interval ω ∈ (0, 2). While this can be achieved
by algebraic manipulations, we content ourselves here with graphical exploration. Figure 2.3
depicts the amplitude of the modulus in (2.42) for different values of µ. It is apparent that, for
given ω, the modulus increases as µ increases, which suggests that
r
ω 2 µ2∗ ω 2 µ2∗
ω Nω )
ρ(M−1 = + 1 − ω + ω|µ∗ | +1−ω , J NJ ).
µ∗ = ρ(M−1 (2.43)
2 4

The figure also suggests that for a given value of µ, the modulus is minimized at the discontinuity
of the first derivative, which occurs when the argument of the square root is zero. We conclude

60
Chapter 2. Solution of linear systems of equation

that the optimal ω satisfies


2 µ2
p
ωopt 1− 1 − µ2∗ 2 2
==⇒

+ 1 − ωopt = 0 ωopt = 2 2
= =  .
1 + sin n+1
p
4 ω<2 µ∗ 1 + 1 − µ2∗ π

Figure 2.3: Modulus of |λ+ | as a function of ω, for different eigenvalues of µ.

� Exercise 2.19. Prove that, if the matrix A is strictly diagonally dominant (by rows or
columns), then the Gauss–Seidel method converges, i.e. ρ(M−1 N) < 1. You can use the same
approach as in the proof of Proposition 2.11.

� Exercise 2.20. Let A ∈ Rn×n denote a symmetric positive definite matrix, and assume that
the vectors d1 , . . . , dn are pairwise A-orthogonal directions. Show that d1 , . . . , dn are linearly
independent.

� Exercise 2.21 (Steepest descent algorithm). Consider the linear system


! ! !
3 1 x1 1
Ax := = =: b. (2.44)
1 3 x2 1

• Show that A is positive definite.

• Draw the contour lines of the function

1
f (x) = xT Ax − bT x.
2

• Plot the contour lines of f in Julia using the function contourf from the package Plots.

• Using Theorem 2.17, estimate the number K of iterations of the steepest descent algorithm
required in order to guarantee that EK ≤ 10−8 , when starting from the vector x(0) = (2 3)T .

• Implement the steepest descent method for finding the solution to (2.44), and plot the
iterates as linked dots over the filled contour of f .

61
Chapter 2. Solution of linear systems of equation

• Plot the error Ek as a function of the iteration index, using a linear scale for the x axis
and a logarithmic scale for the y axis.

� Exercise 2.22. Compute the number of floating point operations required for performing
one iteration of the conjugate gradient method, assuming that the matrix A contains α  n
nonzero elements per row.

� Exercise 2.23 (Solving the Poisson equation over a rectangle). We consider in this exercise
Poisson’s equation in the domain Ω = (0, 2) × (0, 1), equipped with homogeneous Dirichlet
boundary conditions:
−4f (x, y) = b(x, y), x ∈ Ω,
f (x) = 0, x ∈ ∂Ω.
The right-hand side is
b(x, y) = sin(4πx) + sin(2πy).

A number of methods can be employed in order to discretize this partial differential equation.
After discretization, a finite-dimensional linear system of the form Ax = b is obtained. A
Julia function for calculating the matrix A and the vector b using the finite difference method
is given to you on the course website, as well as a function to plot the solution. The goal of
this exercise is to solve the linear system using the conjugate gradient method. Use the same
stopping criterion as in Exercise 2.18.

� Exercise 2.24. Show that (2.37) indeed defines a polynomial, and find its expression in the
usual polynomial notation.

� Exercise 2.25. Show that if A ∈ Rn×n is nonsingular, then the solution to the equation
Ax = b belongs to the Krylov subspace

Kn (A, b) = span b, Ab, A2 b, . . . , An−1 b .


n o

2.4 Discussion and bibliography


In this chapter, we presented direct methods and some of the standard iterative methods for
solving linear systems. We focused particularly on linear systems with a symmetric positive
definite matrix. Section 2.2 is based on [9, 15] and Section 2.3 roughly follows [13, Chapter 2].
The book [10] is a very detailed reference on iterative methods for solving sparse linear systems.
The reference [12] is an excellent introduction to the conjugate gradient method.

62
Chapter 3

Solution of nonlinear systems

3.1 The bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


3.2 Fixed point methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Convergence of fixed point methods . . . . . . . . . . . . . . . . . . . . 66
3.4 Examples of fixed point methods . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 The Newton–Raphson method . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 The secant method �
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 77

Introduction
This chapter concerns the numerical solution of nonlinear equations of the general form

f (x) = 0, f : Rn → Rn . (3.1)

A solution to this equation is called a zero of the function f . Except in particular cases (for
example linear systems), there does not exist a numerical method for solving (3.1) in a finite
number of operations, so iterative methods are required.
In contrast with the previous chapter, it may not be the case that (3.1) admits one and
only one solution. For example, the equation 1 + x2 = 0 does not have a (real) solution, and
the equation cos(x) = 0 has infinitely many. Therefore, convergence results usually contain
assumptions on the function f that guarantee the existence and uniqueness of a solution in Rn
or a subset of Rn .
For an iterative method generating approximations (xk )k≥0 of a root x∗ , we define the error
as ek = xk − x∗ . If the sequence (xk )k≥0 converges to x∗ in the limit as k → ∞ and if

kek+1 k
lim = r, (3.2)
k→∞ kek kq

63
Chapter 3. Solution of nonlinear systems

then we say that (xk )k≥0 converges with order of convergence q and rate of convergence r. In
addition, we say that the convergence is linear q = 1, and quadratic if q = 2. The convergence
is said to be superlinear if
kek+1 k
lim = 0. (3.3)
k→∞ kek k
In particular, the convergence is superlinear if the order of convergence is q > 1.

Remark 3.1. The definition (3.3) for the order and rate of convergence is not entirely satis-
factory, as the limit may not exist. A more general definition for the order of convergence of
a sequence (xk )k≥0 is the following:
 
kek+1 k
q(x0 ) = inf p ∈ [1, ∞) : lim sup = ∞ ,
k→∞ kek kp

or q(x0 ) = ∞ if the numerator and denominator of the fraction are zero for sufficiently large
k. It is possible to define similarly the order of convergence of an iterative method for an
initial guess in a neighborhood V of x∗ :
   
kek+1 k
q = inf p ∈ [1, ∞) : sup lim sup =∞ ,
x0 ∈V k→∞ kek kp

where the fraction should be interpreted as 0 if the numerator and denominator are zero. A
more detailed discussion of this subject is beyond the scope of this course.

The rest of chapter is organized as follows:

• In Section 3.1, by way of introduction to the subject of numerical methods for nonlinear
equations, we present and analyze the bisection method.

• In Section 3.2, we present a general method based on a fixed point iteration for solv-
ing (3.1). The convergence of this method is analyzed in Section 3.3.

• In Section 3.4, two concrete examples of fixed point methods are studied: the chord
method and the Newton–Raphson method.

3.1 The bisection method


As an introduction to numerical methods for solving nonlinear equations, we present the bisec-
tion method. This method applies only in the case of a real-valued function f : R → R, and
relies on the knowledge of two points a < b such that f (a) and f (b) have different signs. By
the intermediate value theorem, there necessarily exists x∗ ∈ (a, b) such that f (x∗ ) = 0. The
idea of the bisection method it to successively divide the interval in two equal parts, and to
retain, based on the sign of f at the midpoint x1/2 , the one that necessarily contains a root.
If f (x1/2 )f (a) ≥ 0, then f (x1/2 )f (b) ≤ 0 and so there necessarily exists a root of f in the
interval [x1/2 , b) by the intermediate value theorem. In contrast, if f (x1/2 )f (a) < 0, then there
necessarily is a root in the interval (a, x1/2 ). The algorithm is presented in Algorithm 7.
The following result establishes the convergence of the method.

64
Chapter 3. Solution of nonlinear systems

Algorithm 7 Bisection method


Assume that f (a)f (b) < 0 with a < b.
Pick ε > 0.
x ← a/2 + b/2
while |b − a| ≥ ε do
if f (x)f (a) ≥ 0 then
a←x
else
b←x
end if
x ← a/2 + b/2
end while

Proposition 3.1. Assume that f : R → R is a continuous function. Let [aj , bj ] denote the
interval obtained after j iterations of the bisection method, and let xj denote the midpoint
(aj + bj )/2. Then there exists a root x∗ of f such that

|xj − x∗ | ≤ (b0 − a0 )2−(j+1) . (3.4)

Proof. By construction, f (aj )f (bj ) ≤ 0 and f (b) 6= 0. Therefore, by the intermediate value
theorem, there exists a root of f in the interval [aj , bj ), implying that

bj − aj
|xj − x∗ | ≤ .
2

Since bj − aj = 2−j (b0 − a0 ), the statement follows.

Although the limit in (3.2) may not be well-defined (for example, x1 may be a root of f ), the
error xj − x∗ is bounded in absolute value by the sequence (e
ej )j≥0 , where eej = (b0 − a0 )2−(j+1)
by Proposition 3.1. Since the latter sequence exhibits linear convergence to 0, the convergence
of the bisection method is said to be linear, by a slight abuse of terminology.

3.2 Fixed point methods


Let x∗ denote a zero of the function f . The idea of iterative methods for (3.1) is to construct,
starting from an initial guess x0 , a sequence (xk )k=0,1,... approaching x∗ . A number of iterative
methods for solving (3.1) are based on an iteration of the form

xk+1 = F (xk ), (3.5)

for an appropriate continuous function F . Assume that xk converges to some point x∗ ∈ Rn in


the limit as k → ∞. Then, taking the limit k → ∞ in (3.5), we find that x∗ satisfies

F (x∗ ) = x∗ .

Such a point x∗ is called a fixed point of the function F . Several definitions of the function F
can be employed in order to ensure that a fixed point of F coincides with a zero of f . One

65
Chapter 3. Solution of nonlinear systems

may, for example, define F (x) = x − α−1 f (x), for some nonzero scalar coefficient α. Then
F (x∗ ) = x∗ if and only if f (x∗ ) = 0. Later in this chapter, in Section 3.4, we study two
instances of numerical methods which can be recast in the form (3.5). Before this, we study the
convergence of the iteration (3.5) for a general function F .

3.3 Convergence of fixed point methods


Equation (3.5) may be viewed as a discrete-time dynamical system. In order to study the
behavior of the system as k → ∞, it is important to understand the concept of stability of a
fixed point. The concept of stability appears also in the field of ordinary differential equations,
which are continuous-time dynamical systems. Before we define this concept, we introduce the
following notation for the open ball of radius δ around x ∈ Rn :

Bδ (x) := y ∈ Rn : ky − xk < δ .


Definition 3.1 (Stability of fixed points). Let (xk )k≥0 denote iterates obtained from (3.5)
when starting from an initial vector x0 . Then we say that a fixed point x∗ is

• an attractor if there exists a neighborhood V of s such that

∀x0 ∈ V, xk −−−→ x∗ . (3.6)


k→∞

The largest neighborhood for which this is true, i.e. the set of values of x0 such that (3.6)
holds true, is called the basin of attraction of x∗ .

• stable (in the sense of Lyapunov) if for all ε > 0, there exists δ > 0 such that

∀x0 ∈ Bδ (x∗ ), kxk − x∗ k < ε.

• asymptotically stable if it is stable and an attractor.

• exponentially stable if there exists C > 0, α ∈ (0, 1), and δ > 0 such that

∀x0 ∈ Bδ (x∗ ), ∀k ∈ N, kxk − x∗ k ≤ Cαk kx0 − x∗ k.

• globally exponentially stable if there exists C > 0, α ∈ (0, 1) such that

∀x0 ∈ Rn , ∀k ∈ N, kxk − x∗ k ≤ Cαk kx0 − x∗ k.

• Unstable if it is not stable.

Clearly, global exponential stability implies exponential stability, which itself implies asymp-
totic stability and stability. If x∗ is globally exponentially stable, then x∗ is the unique fixed
point of F ; showing this is the aim of Exercise 3.3. If x∗ is an attractor, then the dynamical
system (3.5) is said to be locally convergent to x∗ . The larger the basin of attraction of x∗ , the

66
Chapter 3. Solution of nonlinear systems

less careful we need to be when picking the initial guess x0 . Global exponential stability of a
fixed point can sometimes be shown provided that F satisfies a strong hypothesis.

Definition 3.2 (Lipschitz continuity). A function F : Rn → Rn is said to be Lipschitz


continuous with constant L if

∀(x, y) ∈ Rn × Rn , kF (y) − F (x)k ≤ Lky − xk.

A function F : Rn → Rn that is Lipschitz continuous with a constant L < 1 is called a contrac-


tion. For such a function, it is possible to prove that (3.5) has a unique globally exponentially
stable fixed point.

Theorem 3.2. Assume that F is a contraction. Then there exists a unique fixed point of (3.5),
and it holds that
∀x0 ∈ Rn , ∀k ∈ N, kxk − x∗ k ≤ Lk kx0 − x∗ k.

Proof. It holds that

kxk+2 − xk+1 k = kF (xk+1 ) − F (xk )k ≤ Lkxk+1 − xk k ≤ · · · ≤ Lk+1 kx1 − x0 k.

Therefore, for any n ≥ m, we have by the triangle inequality

kxn − xm k ≤ kxn − xn−1 k + · · · + kxm+1 − xm k


Lm
≤ (Ln−1 + · · · + Lm )kx1 − x0 k ≤ Lm (1 + L + · · · )kx1 − x0 k = kx1 − x0 k.
1−L

It follows that the sequence (xk )k≥0 is Cauchy in Rn , implying by completeness that xk → x∗ in
the limit as k → ∞, for some limit x∗ ∈ Rn . Being a contraction, the function F is continuous,
and so taking the limit k → ∞ in (3.5) we obtain that x∗ is a fixed point of F . Then

kxk − x∗ k = kF (xk−1 ) − F (x∗ )k ≤ Lkxk−1 − x∗ k ≤ · · · ≤ Lk kx0 − x∗ k, (3.7)

proving the statement. To show the uniqueness of the fixed point, assume y ∗ is another fixed
point. Then,
ky ∗ − x∗ k = kF (y ∗ ) − F (x∗ )k ≤ Lky ∗ − x∗ k,

which implies that y ∗ = x∗ since L < 1.

It is possible to prove a weaker, local result under a less restrictive assumptions on the
function F .

Theorem 3.3. Assume that x∗ is a fixed point of (3.5) and that F : Rn → Rn satisfies the
local Lipschitz condition

∀x ∈ Bδ (x∗ ), kF (x) − F (x∗ )k ≤ Lkx − x∗ k, (3.8)

67
Chapter 3. Solution of nonlinear systems

with 0 ≤ L < 1 and δ > 0. Then x∗ is the unique fixed point of F in Bδ (x∗ ) and, for all
x0 ∈ Bδ (x∗ ), it holds that

• All the iterates (xk )k∈N belong to Bδ (x∗ ).

• The sequence (xk )k∈N converges exponentially to x∗ .

Proof. See Exercise 3.4.

It is possible to guarantee that condition (3.8) holds provided that we have sufficiently good
control of the derivatives of the function F . The function F is differentiable at x (in the sense
of Fréchet) if there exists a linear operator DF x : Rn → Rn such that

kF (x + h) − F (x) − DFx (h)k


lim = 0. (3.9)
h→0 khk

If F is differentiable, then all its first partial derivatives exist. The Jacobian matrix of F at x
is defined as  
∂1 F1 (x) . . . ∂n F1 (x)
.. .. ..
JF (x) =  . . .
 
,
 
∂1 Fn (x) . . . ∂n Fn (x)

and it holds that DF x (h) = JF (x)h.

Proposition 3.4. Let x∗ be a fixed point of (3.5), and assume that there exists δ and a
subordinate matrix norm such that F is differentiable everywhere in Bδ (x∗ ) and

∀x ∈ Bδ (x∗ ), kJF (x)k ≤ L < 1.

Then condition (3.8) is satisfied in the associated vector norm, and so the fixed point x∗ is
locally exponentially stable.

Proof. Let x ∈ Bδ (x∗ ) By the fundamental theorem of calculus and the chain rule, we have

1
d 1
Z Z
JF x∗ + t(x − x∗ ) (x − x∗ ) dt.

F x∗ + t(x − x∗ ) dt =

F (x) − F (x∗ ) =
0 dt 0

Therefore, it holds that


Z 1 Z 1
JF x + t(x − x∗ ) dt kx − x∗ k = L dt kx − x∗ k = Lkx − x∗ k,

kF (x) − F (x∗ )k ≤
0 0

which is the statement.

In fact, it is possible to prove that a fixed point x∗ is exponentially locally stable under an
even weaker condition, involving only the derivative of F at x∗ .

68
Chapter 3. Solution of nonlinear systems

Proposition 3.5. Let x∗ be a fixed point of (3.5) and that f is differentiable at x∗ with

kJF (x∗ )k = L < 1,

in a subordinate vector norm. Then there exists δ > 0 such that condition (3.8) is satisfied
in the associated vector norm, and so the fixed point x∗ is locally exponentially stable.

Proof. Let ε = 12 (1 − L) > 0. By the definition of differentiability (3.9), there exists δ > 0 such
that
kF (x∗ + h) − F (x∗ ) − JF (x∗ )hk
∀h ∈ Bδ (0)\{0}, ≤ ε.
khk
By the triangle inequality, this implies

∀h ∈ Bδ (0), kF (x∗ + h) − F (x∗ )k ≤ kF (x∗ + h) − F (x∗ ) − JF (x∗ )hk + kJF (x∗ )hk
≤ εkhk + kJF (x∗ )kkhk = (ε + L)khk = (1 − ε)khk.

This shows that F satisfies the condition (3.8) in the neighborhood Bδ (x∗ ).

The estimate in Theorem 3.2 suggests that when the fixed point iteration (3.5) converges, the
convergence is linear. While this is usually the case, the convergence is superlinear if JF (x∗ ) = 0.

Proposition 3.6. Assume that x∗ is a fixed point of (3.5) and that JF (x∗ ) = 0. Then the
convergence to x∗ is superlinear, in the sense that if xk → x∗ as k → ∞, then

kxk+1 − x∗ k
lim = 0.
k→∞ kxk − x∗ k

Proof. By Proposition 3.5, there exists δ > 0 such that (xk )k≥0 is a sequence converging to x∗
for all x0 ∈ Bδ (x∗ ). It holds that

kxk+1 − x∗ k kF (xk ) − F (x∗ )k kF (xk ) − F (x∗ ) − JF (x∗ )(xk − x∗ )k


= = .
kxk − x∗ k kxk − x∗ k kxk − x∗ k

Since xk − x∗ → 0 as k → ∞, the right-hand side converges to 0 by (3.9).

Similarly, if there exist δ > 0, C > 0 and q ∈ (1, ∞) such that

∀x ∈ Bδ (x∗ ), kF (x) − F (x∗ )k ≤ Ckx − x∗ kq ,

then assuming that (xk )k≥0 converges to x∗ , it holds for sufficiently large k that

kxk+1 − x∗ k kF (xk ) − F (x∗ )k


q
= ≤ C.
kxk − x∗ k kxk − x∗ kq

In this case, the order of convergence is at least q.

69
Chapter 3. Solution of nonlinear systems

3.4 Examples of fixed point methods


As we mentioned in Section 3.2, there are several choices for the function F that guarantee the
equivalence F (x) = x ⇔ f (x) = 0. In the case where f is a function from R to R, the simplest
approach, sometimes called the chord method, is to define

F (x) = x − α−1 f (x).

The fixed point iteration (3.4) in this case admits a simple geometric interpretation: at each
step, the function f is approximated by the affine function x 7→ f (xk ) + α(x − xk ), and the new
iterate is defined as the zero of this affine function, i.e.

xk+1 = xk − α−1 f (xk ) = F (xk ). (3.10)

By Proposition 3.5, a sufficient condition to ensure local convergence is that

|F 0 (x∗ )| = |1 − α−1 f 0 (x∗ )| < 1. (3.11)

In order for this condition to hold true, the slope α must be of the same sign as f 0 (x∗ ) and the
inequality |α| ≥ |f 0 (x∗ )|/2 must be satisfied. If f 0 (x∗ ) = 0, then the sufficient condition (3.11) is
never satisfied; in this case, the convergence must be studied on a case-by-case basis. By Propo-
sition 3.6, the convergence of the chord method is superlinear if α = f 0 (x∗ ). In practice, the
solution x∗ is unknown, and so this choice is not realistic. Nevertheless, the above reasoning
suggests that, by letting the slope α vary from iteration to iteration in such a manner that αk
approaches f 0 (x∗ ) as k → ∞, fast convergence can be obtained. This is precisely what the
Newton–Raphson method aims to achieve; see Section 3.4.1
When f is a function from Rn to Rn , the above approach generalizes to

xk+1 = F (xk ), F (x) = x − A−1 f (x),

where A is an invertible matrix. The geometric interpretation of the method in this case is the
following: at each step, the function f is approximated by the affine function x 7→ xk +A(x−xk ),
and the next iterate is given by the unique zero of the latter function. Superlinear convergence
is achieved when A = Jf (x∗ ). Notice that each iteration requires to calculate y := A−1 f (xk ),
which is generally achieved by solving the linear system Ay = f (xk ).

3.4.1 The Newton–Raphson method


Let us first consider the case of a function from R to R. The Newton–Raphson method applies
when f is continuously differentiable. At each step, the function f is approximated by the affine
function x 7→ f (xk ) + f 0 (xk )(x − xk ) and the unique zero of this function is returned. In other
words, one iteration of the Newton–Raphson method reads

xk+1 = xk − f 0 (xk )−1 f (xk ). (3.12)

70
Chapter 3. Solution of nonlinear systems

For this iteration to be well-defined, it is necessary that f 0 (xk ) 6= 0. The Newton–Raphson


method may be viewed as a variation on (3.10) where the slope α is adapted as the simulation
progresses. If the method converges, then f 0 (xk ) → f 0 (x∗ ) in the limit as k → ∞, which
indicates that superlinear convergence could occur. Equation (3.12) may be recast as a fixed
point iteration of the form (3.4) with

f (x)
F (x) = x − .
f 0 (x)

Clearly, if x∗ is a simple root of f , that is if f (x∗ ) = 0 and f 0 (x∗ ) 6= 0, then x∗ is a fixed


point of F . If the function f is twice continuously differentiable, then the convergence of the
Newton–Raphson method is superlinear by Proposition 3.6, because then

f (x∗ )f 00 (x∗ )
F 0 (x∗ ) = = 0.
f 0 (x∗ )2

The Newton–Raphson method generalizes to nonlinear equations in Rn of the form (3.1). In


this case F (x) = x − Jf (x)−1 f (x), and so an iteration of the method reads

xk+1 = xk − Jf (xk )−1 f (xk ). (3.13)

In the rest of this section, we show that the iteration (3.13) is well-defined in a small neigh-
borhood of a root of f under appropriate assumptions, and we demonstrate the second order
convergence of the method. We begin by proving the following preparatory lemma, which we
will then employ in the particular case where the matrix-valued function A is equal to Jf .

Lemma 3.7. Let A : Rn → Rn×n denote a matrix-valued function on Rn that is both continuous
and nonsingular at x∗ , and let f be a function that is differentiable at x∗ where f (x∗ ) = 0.
Then the function
G(x) = x − A(x)−1 f (x)

is well-defined in a neighborhood Bδ (x∗ ) of x∗ . In addition, G is differentiable at x∗ with

JG (x∗ ) = I − A(x∗ )−1 Jf (x∗ ). (3.14)

Proof. It holds that

A(x) = A(x∗ ) − A(x∗ ) − A(x) = A(x∗ ) I − A(x∗ )−1 A(x∗ ) − A(x) .


   
(3.15)

Let β = kA(x∗ )−1 k and ε = (2β)−1 . By continuity of the matrix-valued function A, there
exists δ > 0 such that
∀x ∈ Bδ (x∗ ), kA(x) − A(x∗ )k ≤ ε.

For x ∈ Bδ (x∗ ) we have kA(x∗ )−1 A(x∗ ) − A(x) k ≤ kA(x∗ )−1 kkA(x∗ ) − A(x)k ≤ βε = 12 , and


so Lemma 2.3 implies that the second factor on the right-hand side of (3.15) is invertible with

71
Chapter 3. Solution of nonlinear systems

a norm bounded from above by 2. Therefore, we deduce that A(x) is invertible with

∀x ∈ Bδ (x∗ ), kA(x)−1 k ≤ 2kA(x∗ )−1 k = 2β.

In order to prove (3.14), we need to show that

kG(x∗ + h) − G(x∗ ) − I − A(x∗ )−1 Jf (x∗ ) hk



lim =0
khk→0 khk

By definition of G, the argument of the norm in the numerator is equal to

A(x∗ )−1 f (x∗ ) − A(x∗ + h)−1 f (x∗ + h) + A(x∗ )−1 Jf (x∗ )h


= A−1 (x∗ ) − A(x∗ + h)−1 Jf (x∗ )h − A(x∗ + h)−1 f (x∗ + h) − f (x∗ ) − Jf (x∗ )h .
 
| {z } | {z }
=:v 1 =:v 2

Noting that A−1 (x∗ ) − A(x∗ + h)−1 = A(x∗ )−1 A(x∗ + h) − A(x∗ ) A(x∗ + h)−1 , we bound the


norm of the first term on the right-hand as follows:

∀h ∈ Bδ (0), kv 1 k ≤ 2β 2 kA(x∗ + h) − A(x∗ )kkJf (x∗ )kkhk.

Clearly kv 1 k/khk → 0 is the limit as h → 0 by continuity of the matrix function A. It also


holds that kv 2 k/khk → 0 by differentiability of f at x∗ , which concludes the proof.

Using this lemma, we can show the following result on the convergence of the multi-
dimensional Newton–Raphson method.

Theorem 3.8 (Convergence of Newton–Raphson). Let f : Rn → Rn denote a function that


is differentiable in a neighborhood Bδ (x∗ ) of a point x∗ where f (x∗ ) = 0. Assume that Jf (x)
is nonsingular and continuous at x∗ . Then x∗ is an attractor of the Newton–Raphson itera-
tion (3.13) and the convergence is superlinear.
In addition, there is α > 0 and such that the Lipschitz condition

∀x ∈ Bδ (x∗ ), kJf (x) − Jf (x∗ )k ≤ αkx − x∗ k

is satisfied, then the convergence is at least quadratic: there exists d ∈ (0, δ) such that

∀xk ∈ Bd (x∗ ), kxk+1 − x∗ k ≤ Ckxk − x∗ k2 .

Proof. Using Lemma 3.7, we obtain that the Newton–Raphson update

F (x) = x − Jf (x)−1 f (x),

is well-defined in a neighborhood Bδ (x∗ ) of x∗ for sufficiently small δ. In addition, the second


statement in Lemma 3.7 gives that JF (x∗ )−1 = I − JF (x∗ )JF (x∗ ) = 0, which establishes the
superlinear convergence by Proposition 3.6.

72
Chapter 3. Solution of nonlinear systems

In order to show that the convergence is quadratic, we begin by noticing that, since
t
d t
Z Z
f x∗ + t(xk − xx ) dt, = Jf x∗ + t(xk − xx ) (xk − x∗ ) dt,
 
f (xk ) =
0 dt 0

it holds for all xk ∈ Bδ (x∗ ) that


Z 1
kf (xk ) − Jf (x∗ )(xk − x∗ )k = Jf x∗ + t(xk − x∗ ) − Jf (x∗ ) (xk − x∗ ) dt
 

0
Z 1
Jf x∗ + t(xk − x∗ ) − Jf (x∗ ) kxk − x∗ k dt


0
Z 1
α
≤ αtkxk − x∗ k2 dt ≤ kxk − x∗ k2 . (3.16)
0 2

Let d ∈ (0, δ) be sufficiently small to ensure that

∀x ∈ Bd (x∗ ), kJf (x)−1 k ≤ 2kJf (x∗ )−1 k.

Using the inequality (3.16), we have that for all xk ∈ Bd (x∗ ),

kxk+1 − x∗ k = kF (xk ) − x∗ k = kxk − x∗ − Jf (xk )−1 f (xk )k


= Jf (xk )−1 f (xk ) − Jf (xk )(xk − x∗ ) ≤ Jf (xk )−1 kf (xk ) − Jf (xk )(xk − x∗ )k


≤ Jf (xk )−1 kf (xk ) − Jf (x∗ )(xk − x∗ )k + kJf (x∗ ) − Jf (xk )kkxk − x∗ k


 


≤ Jf (xk )−1 kxk − x∗ k2 ≤ 3αkJf (x∗ )kkxk − x∗ k2 ,
2

which concludes the proof.

3.4.2 The secant method �

The Newton–Raphson method exhibits very fast convergence, but it requires the knowledge of
the derivatives of the function f . To conclude this chapter, we describe a root-finding algorithm,
known as the secant method, that enjoys superlinear convergence but does not require the
derivatives of f . This method applies only when f is a function from R to R, and so we drop
the vector notation in the rest of this section.
Unlike the other methods presented so far in Section 3.2, the secant method can not be
recast as a fixed point iteration of the form xk+1 = F (xk ). Instead, it is of the more form
general form xk+2 = F (xk , xk+1 ). The geometric intuition behind the method in the following:
given xk and xk+1 , the function f is approximated by the unique linear function that passes
through xk , f (xk ) and xk+1 , f (xk+1 ) , and the iterate xk+2 is defined as the root of this
 

linear function. In other words, f is approximated as follows:

f (xk+1 ) − f (xk )
fe(x) = f (xk ) + (x − xk ).
xk+1 − xk

73
Chapter 3. Solution of nonlinear systems

Solving fe(x) = 0 gives the following expression for xk+2 :

f (xk+1 )xk − f (xk )xk+1


xk+2 = , (3.17)
f (xk+1 ) − f (xk )

Showing the convergence of the secant method rigorously under general assumptions is tedious,
so in this course we restrict our attention to the case where f is a quadratic function. Extending
the proof of convergence to a more general smooth function can be achieved by using a quadratic
Taylor approximation of f around the root x∗ , which is accurate in a close neighborhood of x∗ .

Theorem 3.9 (Convergence of the secant method). Assume that f is a convex quadratic poly-
nomial with a simple root at x∗ and that the secant method converges: limk→∞ xk = x∗ . Then
the order of convergence is given by the golden ratio

1+ 5
ϕ= .
2

More precisely, there exists a positive real number y∞ such that

|xk+1 − x∗ |
lim = y∞ . (3.18)
k→∞ |xk − x∗ |ϕ

Proof. Equation (3.17) implies that

f (xk+1 )(xk − x∗ ) − f (xk )(xk+1 − x∗ )


xk+2 − x∗ = .
f (xk+1 ) − f (xk )

By assumption, the function f may be expressed as

f (x) = λ(x − x∗ ) + µ(x − x∗ )2 , λ 6= 0.

Substituting this expression in (3.4.2) and letting ek = xk − x∗ , we obtain

µek ek+1 (ek+1 − ek ) µek ek+1


ek+2 = 2 2 = .
λ(ek+1 − ek ) + µ(ek+1 − ek ) λ + µ(ek+1 + ek )

Rearranging this equation, we have

ek+2 µek
= . (3.19)
ek+1 λ + µ(ek+1 + ek )

By assumption, the right-hand side converges to zero, and so the left-hand side must also
converge to zero; the convergence is superlinear. In order to find the order of convergence, we
take absolute values in the previous equation to obtain, after rearranging,
!1−ϕ  1−ϕ
|ek+2 | |ek+1 | µ |ek+1 | |µ|
= 1 = ,
|ek+1 |ϕ |ek | ϕ−1 |λ + µ(ek+1 + ek )| |ek |ϕ |λ + µ(ek+1 + ek )|

74
Chapter 3. Solution of nonlinear systems

where we used that ϕ = ϕ−1 , since ϕ


1
is a root of the equation ϕ2 − ϕ − 1 = 0. Thus, introducing
the ratio yk = |ek+1 |/|ek |ϕ , we have

|µ|
yk+1 = yk1−ϕ .
|λ + µ(ek+1 + ek )|

Taking logarithms in this equation, we deduce


 
|µ|
log(yk+1 ) = (1 − ϕ) log(yk ) + ck , ck := log .
|λ + µ(ek+1 + ek )|

This is a recurrence equation for log(yk ), whose explicit solution can be obtained from the
variation-of-constants formula:
k−1
log(yk ) = (1 − ϕ)k−1 log(y1 ) +
X
(1 − ϕ)k−1−i ci .
i=1

Since (ck )k≥0 converges to the constant c∞ = |µ/λ| by the assumption that ek → 0, the se-
quence log(yk ) k≥0 converges to c∞ /ϕ (prove this!). Therefore, by continuity of the exponential


function, it holds that  


c∞
yk = exp log(yk ) −−−→ exp

,
k→∞ ϕ
and so we deduce (3.18).

3.5 Exercises
� Exercise 3.1. Implement the bisection method for finding the solution(s) to the equation

x = cos(x).

� Exercise 3.2. Find a discrete-time dynamical system over R of the form

xk+1 = F (xk )

for which 0 is an attractor but is not stable.


Hint: Use a function F that is discontinuous.
� Exercise 3.3. Show that if x∗ is a globally exponentially stable fixed point of F , then F
does not have any other fixed point: x∗ is the unique fixed point.
� Exercise 3.4. Prove Theorem 3.3.
� Exercise 3.5. Let x∗ be a fixed point of (3.5). Show that if

ρ JF (x∗ ) < 1,


then x∗ is locally exponentially stable. It is sufficient by Proposition 3.5 to find a subordinate


matrix norm such that kJF (x∗ )k < 1. In other words, this exercise amounts to showing that for
any matrix A ∈ Rn×n with ρ(A) < 1, there exists a matrix norm such that kAk < 1.

75
Chapter 3. Solution of nonlinear systems

Hint: One may employ a matrix norm of the form kAkT := kT−1 ATk2 , which is a subordinate
norm by Exercise 2.10. The Jordan normal form is useful for constructing the matrix T, and
equation (2.19) is also useful.

Solution. Let J = P−1 AP denote the Jordan normal form of A, and let
 
ε
ε2
 
Eε = 
 
..

.
 
 
εn

By Eq. (2.19), the matrix Jε := E−1


ε JEε coincides with J, except that the first superdiagonal is
multiplied by ε. Let D denote the diagonal part of Jε . We have that
q
kJε − Dk2 = λmax (ETε Eε ).

The matrix ETε Eε is diagonal with entries equal to either 0 or ε2 , and so kJε − Dk2 < ε. By the
triangle inequality, we have

kJε k ≤ kDk + kJε − Dk2 ≤ ρ(A) + ε. (3.20)

ε P APEε k. By (2.10) with T = PEε , this is indeed a subordinate matrix


Let kAkε := kE−1 −1

norm. By (3.20) and the assumption that ρ(A) < 1, it is clear that kAkε < 1 provided that ε is
sufficiently small.
r

� Exercise 3.6. Calculate x = 3 3 + 3 3 + 3 3 + . . . using the bisection method.
q p

� Exercise 3.7. Solve the equation f (x) = ex − 2 = 0 using a fixed point iteration of the form

xk+1 = F (xk ), F (x) = x − α−1 f (x).

Using your knowledge of the exact solution x∗ = log 2, write a sufficient condition on α to
guarantee that x∗ is locally exponentially stable. Verify your findings numerically and plot,
using a logarithmic scale for the y axis, the error in absolute value as a function of k.

� Exercise 3.8. Implement the Newton–Raphson method for solving f (x) = ex − 2 = 0, and
plot the error in absolute value as a function of the iteration index k.

� Exercise 3.9. Find the point (x, y) on the parabola y = x2 that is closest to the point (3, 1).

� Exercise 3.10. Consider the linear system

y = (x − 1)2
(

x2 + y 2 = 4

By drawing these two constraints in the xy plane, find an approximation of the solution(s).
Then calculate the solution(s) using a fixed-point method.

76
Chapter 3. Solution of nonlinear systems

� Exercise 3.11. Find solutions (ψ, λ), with λ > 0, to the following eigenvalue problem:

ψ 00 = −λ2 ψ, ψ(0) = 0, ψ 0 (1) = ψ(1).

� Exercise 3.12. Suppose that we have n data points (xi , yi ) of an unknown function y = f (x).
We wish to approximate f by a function of the form

a
fe(x) =
b+x

by minimizing the sum of squares


n
X
|fe(xi ) − yi |2 .
i=1

Write a system of nonlinear equations that the minimizer (a, b) must satisfy, and solve this
system using the Newton–Raphson method starting from (1, 1). The data is given below:

x = [0.0; 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7; 0.8; 0.9; 1.0]
y = [0.6761488864859304; 0.6345697680852508; 0.6396283580587062; 0.6132010027973919;
0.5906142598705267; 0.5718728461471725; 0.5524549902830562; 0.538938885654085;
0.5373495476994958; 0.514904589752926; 0.49243437874655027]

Plot the data points together with the function fe over the interval [0, 1]. Your plot should look
like Figure 3.1.

Figure 3.1: Solution to Exercise 3.12.

3.6 Discussion and bibliography


The content of this chapter is largely based on the lecture notes [13]. Several of the exercises are
taken or inspired from [6]. The proof of convergence of the secant method is inspired from the
general proof presented in the short paper [14]. For a detailed treatment of iterative methods
for nonlinear equations, see the book [8].

77
Chapter 4

Numerical computation of eigenvalues

4.1 Numerical methods for eigenvalue problems: general remarks . . . . 79


4.2 Simple vector iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 The power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.2 Inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.3 Rayleigh quotient iteration . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Methods based on a subspace iteration . . . . . . . . . . . . . . . . . . 83
4.3.1 Simultaneous iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 The QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Projection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1 Projection method in a Krylov subspace . . . . . . . . . . . . . . . . . . 90
4.4.2 The Arnoldi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.3 The Lanczos iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 97

Introduction
Calculating the eigenvalues and eigenvectors of a matrix is a task often encountered in scientific
and engineering applications. Eigenvalue problems naturally arise in quantum physics, solid
mechanics, structural engineering and molecular dynamics, to name just a few applications.
The aim of this chapter is to present an overview of the standard methods for calculating
eigenvalues and eigenvectors numerically. We focus predominantly on the case of a Hermitian
matrix A ∈ Cn×n , which is technically simpler and arises in many applications. The reader is
invited to go through the background material in Appendix A.4 before reading this chapter.
The rest of this chapter is organized as follows

• In Section 4.1, we make general remarks concerning the calculation of eigenvalues.

• In Section 4.2, we present standard methods based on a simple vector iteration.

78
Chapter 4. Numerical computation of eigenvalues

• In Section 4.3, we present a method for calculating several eigenvectors simultaneously,


based on iterating a subspace.

• In Section 4.4, we present method for constructing an approximation of the eigenvectors


in a given subspace of Cn .

4.1 Numerical methods for eigenvalue problems: general remarks


As mentioned in Appendix A.4, a complex number λ ∈ C is an eigenvalue of A ∈ Cn×n if and
only if λ is a root of the characteristic polynomial of A:

pA : C → C : λ 7→ det(A − λI).

One may, therefore, calculate the eigenvalues of A by calculating the roots of the polynomial pA
using, for example, one of the methods presented in Chapter 3. While feasible for small matrices,
this approach is not viable for large matrices, because the number of floating point operations
required for calculating calculating the coefficients of the characteristic polynomial scales as n!.
In view of the prohibitive computational cost required for calculating the characteristic
polynomial, other methods are required for solving large eigenvalue problems numerically. All
the methods that we study in this chapter are iterative in nature. While some of them are
aimed at calculating all the eigenpairs of the matrix A, other methods enable to calculate only
a small number of eigenpairs at a lower computational cost, which is often desirable. Indeed,
calculating all the eigenvalues of a large matrix is computationally expensive; on a personal
computer, the following Julia code takes well over a second to terminate:
import LinearAlgebra
LinearAlgebra.eigen(rand(2000, 2000))
In many applications, the matrix A is sparse, and in this case it is important that algorithms
for eigenvalue problems do not destroy the sparsity structure. Note that the eigenvectors of a
sparse matrix are generally not sparse.
To conclude this section, we introduce some notation used throughout this chapter. For a
diagonalizable matrix A, we denote the eigenvalues by λ1 , . . . , λn , with |λ1 | ≥ |λ2 | ≥ . . . ≥ |λn |.
The associated eigenvectors are denoted by v 1 , . . . , v n . Therefore, it holds that

V−1 AV = D = diag(λ1 , . . . , λn ), where V = v 1 . . . v n .


 

4.2 Simple vector iterations


In this section, we present simple iterative methods aimed at calculating just one eigenvector
of the matrix A, which we assume to be diagonalizable for simplicity.

4.2.1 The power iteration


The power iteration is the simplest method for calculating the eigenpair associated with the
eigenvalue of A with largest modulus. Since the eigenvectors of A span Cn , any vector x0 may

79
Chapter 4. Numerical computation of eigenvalues

be decomposed as

x 0 = α1 v 1 + · · · + αn v n . (4.1)

The idea of the power iteration is to repeatedly left-multiply this vector by the matrix A, in
order to amplify the coefficient of v 0 relative to the other ones. Indeed, notice that

Ak x0 = λk1 α1 v 1 + · · · + λkn αn v n .

If λ1 is strictly larger in modulus than the other eigenvalues, and if α1 6= 0, then for large k the
vector Ak x0 is approximately aligned, in a sense made precise below, with the eigenvector v 1 .
In order to avoid overflow errors at the numerical level, the iterates are normalized at each
iteration. The power iteration is presented in Algorithm 8.

Algorithm 8 Power iteration


x ← x0
for i ∈ {1, 2, . . . } do
x ← Ax
x ← x/kxk
end for

To precisely quantify the convergence of the power method, we introduce the notion of angle
between vectors of Cn .

|x∗ y|
 
∠(x, y) = arccos √ √
x∗ x y ∗ y
k(I − Py )xk yy ∗
 
= arcsin , Py := .
kxk y∗y

This definition generalizes the familiar notion of angle for vectors in R2 or R3 , and we note
that the angle function satisfies ∠(eiθ1 x, eiθ2 y) = ∠(x, y). We can then prove the following
convergence result.

Proposition 4.1 (Convergence of the power iteration). Suppose that A is diagonalizable and
that |λ1 | > |λ2 |. Then, for every initial guess with α1 6= 0, the sequence (xk )k≥0 generated by
the power iteration satisfies
lim ∠(xk , v 1 ) = 0.
k→∞

Proof. By construction, it holds that

λk α k

λk1 α1 v 1 + · · · + λkn αn v n v 1 + λ2k α2 v 2 + · · · + λλnk αα21 v n


xk = k =e iθk 1 1 n
, (4.2)
kλ1 α1 v 1 + · · · + λkn αn v n k λk2 α2 λkn αn
v 1 + λk α v 2 + · · · + λk α v n
1 1 1 1

where
λk1 α1
eiθk := .
|λk1 α1 |
Clearly, it holds hat xk → eiθk v 1 /kv 1 k. Using the definition of the angle between two vectors

80
Chapter 4. Numerical computation of eigenvalues

in Cn , and the continuity of the Cn Euclidean inner product and of the arccos function, we
calculate
! !
|x∗ v 1 | |x∗ v 1 |
∠(xk , v 1 ) = arccos p ∗ kp ∗ = arccos p k∗ −−−→ arccos(1) = 0,
xk xk v 1 v 1 v1v1 k→∞

which concludes the proof.

Since v 1 is normalized, equation (4.2) implies that e−iθk xk → v 1 in the limit as k → ∞. We


say that xk converges essentially to v 1 .
An inspection of the proof also reveals that the dominant term in the error, asymptotically
λk2 α2
in the limit as k → ∞, is the one with coefficient λk1 α1
. Therefore, we deduce that

!
k
λ2
∠(xk , v 1 ) = O .
λ1

The convergence is slow if |λ2 /λ1 | is close to one, and fast if |λ2 |  |λ1 |. Once an approximation
of the eigenvector v 1 has been calculated, the corresponding eigenvalue λ1 can be estimated
from the Rayleigh quotient:
x∗ Ax
ρA : Cn∗ → C : x 7→ . (4.3)
x∗ x
For any eigenvector v of A, the corresponding eigenvalue is equal to ρA (v). In order to study the
error on the eigenvalue λ1 for the power iteration, we assume for simplicity that A is Hermitian
and that the eigenvectors v 1 , . . . , v n are orthonormal. The convergence of the eigenvalue in this
particular case is faster than for a general matrix in Cn×n . Substituting (4.2) in the Rayleigh
quotient (4.3), we obtain

λk2 α2 2 λkn αn
2
λ1 + λk1 α1
λ2 + ··· + λk1 α1
λn
ρA (xk ) = .
λk α 2
2 2 λkn αn
2
1+ λk1 α1
+ ··· + λk1 α1

Therefore, we deduce
2 2
!
2k
λk α2 λ k αn λ2
|ρA (xk ) − λ1 | ≤ 2k |λ2 − λ1 | + · · · + nk |λn − λ1 | = O .
λ1 α1 λ1 α1 λ1

For general matrices, it is possible to show using a similar argument that the error is of or-
der O |λ2 /λ1 |k in the limit as k → ∞.


4.2.2 Inverse iteration


The power iteration is simple but enables to calculate only the dominant eigenvalue of the
matrix A, i.e. the eigenvalue of largest modulus. In addition, the convergence of the method is
slow when |λ2 | ≈ |λ1 |.
The inverse iteration enables a more efficient calculation of not only the dominant eigenvalue
but also the other eigenvalues of A. It is based on applying the power iteration to (A − µI)−1 ,

81
Chapter 4. Numerical computation of eigenvalues

where µ ∈ C is a shift. The eigenvalues of (A−µI)−1 are given by (λ1 −µ)−1 , . . . , (λn −µ)−1 , with
associated eigenvectors v 1 , . . . , v n . If 0 < |λJ − µ| < |λK − µ| for all j 6= i, then the dominant
eigenvalue of the matrix (A − µI)−1 is (λJ − µ)−1 , and so the power iteration applied to this
matrix yields an approximation of the eigenvector v i . In other words, the inverse iteration
with shift µ enables to calculate an approximation of the eigenvector of A corresponding to the
eigenvalue nearest µ. The inverse iteration is presented in Algorithm 9. In practice, the inverse
matrix (A − µI)−1 need not be calculated, and it is often preferable to solve a linear system at
each iteration.
Algorithm 9 Inverse iteration
x ← x0
for i ∈ {1, 2, . . . } do
Solve (A − µI)y = x
x ← y/kyk
end for
λ ← x∗ Ax/x∗ x
return x, λ

An application of Proposition 4.1 immediately gives the following convergence result for the
inverse iteration.

Proposition 4.2 (Convergence of the inverse iteration). Assume that A ∈ Cn is diagonaliz-


able and that there exist J and K such that

0 < |λJ − µ| < |λK − µ| ≤ |λj − µ| ∀j 6= J.

Assume also that αJ 6= 0, where αJ is the coefficient of v J in the expansion of x0 given


in (4.1). Then the iterates of the inverse iteration satisfy

lim ∠(xk , v J ) = 0.
k→∞

More precisely, !
k
λJ − µ
∠(xk , v J ) = O .
λK − µ

Notice that the closer µ is to λJ , the faster the inverse iteration converges. With µ = 0, the
inverse iteration enables to calculate the eigenvalue of A of smallest modulus.

4.2.3 Rayleigh quotient iteration


Since the inverse iteration is fast when µ is close to an eigenvalue λi , it is natural to wonder
whether the method can be improved by progressively updating µ as the simulation progresses.
Specifically, an approximation of the eigenvalue associated with the current vector may be
employed in place of µ. This leads to the Rayleigh quotient iteration, presented in Algorithm 10.

It is possible to show that, when A is Hermitian, the Rayleigh quotient iteration converges to
an eigenvector for almost every initial guess x0 . Furthermore, if convergence to an eigenvector

82
Chapter 4. Numerical computation of eigenvalues

Algorithm 10 Inverse iteration


x ← x0
for i ∈ {1, 2, . . . } do
µ ← x∗ Ax/x∗ x
Solve (A − µI)y = x
x ← y/kyk
end for
λ ← x∗ Ax/x∗ x
return x, λ

occurs, then µ converges cubically to the corresponding eigenvalue. See [11] and the references
therein for more details.

4.3 Methods based on a subspace iteration


The subspace iteration resembles the power iteration but it is more general: not just one but
several vectors are updated at each iteration.

4.3.1 Simultaneous iteration

Let X0 = x1 . . . xp denote an initial set of linearly independent vectors. Before we present


 

the simultaneous iteration, we recall a statement concerning the QR decomposition of a matrix,


which is related to the Gram–Schmidt orthonormalization process. We recall that the Gram–
Schmidt method enables to construct, starting form an ordered set of p vectors {x1 , . . . , xp } in
Cn , a new set of vectors {q 1 , . . . , q p } which are orthonormal and span the same subspace of Cn
as the original vectors.

Proposition 4.3 (Reduced QR decomposition). Assume that X ∈ Cn×p has linearly inde-
pendent columns. Then there exist a matrix Q ∈ Cn×p with orthonormal columns and an
upper triangular matrix R such that the following factorization holds:

X = QR.

This decomposition is known as a reduced QR decomposition if p < n, or simply QR de-


composition if p = n, in which case X is a square matrix and Q is a unitary matrix. The
decomposition is unique if we require that the diagonal elements of R are real and positive.

Sketch of the proof. Let us denote the columns of Q by q 1 , . . . , q p and the entries of R by

α11 α12 . . . α1p


 

α22 . . . α2p 
 
R=

.. . .
. .. 


αpp

83
Chapter 4. Numerical computation of eigenvalues

Expanding the matrix product and identifying columns of both sides, we find that

x1 = α11 q 1
x2 = α12 q 1 + α22 q 2
..
.
xp = α1p q 1 + α2p q 2 + αpp q p .

Under the restriction that the diagonal elements of R are real and positive, the first equation
enables to uniquely determines x1 and α11 = kx1 k. Then, projecting both sides of the second
equation onto span{q 1 } and span{q 1 }⊥ , we obtain

q ∗1 x2 = α12 , x2 − (q 1 q ∗1 )x2 = α22 q 2 .

The first equation gives α12 , and the second equation enables to calculate q 2 and α22 . This
process can be iterated p times in order to fully construct the matrices Q and R.

Note that the columns of the matrix Q of the decomposition coincide with the vectors that
would be obtained by applying the Gram–Schmidt method to the columns of the matrix X. In
fact, the Gram–Schmidt process is one of several methods by which the QR decomposition can
be calculated in practice.

Algorithm 11 Simultaneous iteration


X ← X0
for k ∈ {1, 2, . . . } do
Qk Rk = AXk−1 (QR decomposition).
Xk ← Qk .
end for

The simultaneous iteration method is presented in Algorithm 11. Like the normalization
in the power iteration Algorithm 8, the QR decomposition performed at each step enables to
avoid overflow errors. Notice that when p = 1, the simultaneous iteration reduces to the power
iteration. We emphasize that the factorization step at each iteration does not influence the
subspace spanned by the columns of X. Therefore, this subspace after k iterations coincides
with that spanned by the columns of the matrix Ak X0 . In fact, in exact arithmetic, it would
be equivalent to perform the QR decomposition only once as a final step, after the for loop.
Indeed, denoting by Qk Rk the QR decomposition of AXk−1 , we have

Xk = AXk−1 R−1
k = A Xk−2 Rk−1 Rk = · · · = A X0 R1 . . . Rk
2 −1 −1 k −1 −1

⇔ Xk (Rk . . . R1 ) = Ak X0 .

Since Xk has orthonormal columns and Rk . . . R1 is an upper triangular matrix (see Exercise 2.3)
with real positive elements on the diagonal (check this!), it follows that Xk can be obtained by QR
factorization of Ak X0 . In order to show the convergence of the simultaneous iteration, we begin
by proving the following preparatory lemma.

84
Chapter 4. Numerical computation of eigenvalues

Lemma 4.4 (Continuity of the QR decomposition). If Qk Rk → QR, where Q is orthogonal


and R is upper triangular with positive real entries on the diagonal, then Qk → Q.

Proof. We reason by contradiction and assume there is ε > 0 and a subsequence (Qkn )n≥0 such
that kQkn −Qk ≥ ε for all n. Since the set of unitary matrices is a compact subset of Cn×n , there
exists a further subsequence (Qknm )m≥0 that converges to a limit Q∞ which is also a unitary
matrix and at least ε away in norm from Q. But then

Rknm = Q−1
kn (Qknm Rknm ) = Qknm (Qknm Rknm ) −

−−−→ Q∗∞ (QR) =: R∞ .
m m→∞

Since Rk is upper triangular with positive diagonal elements for all k, clearly R∞ is also upper
triangular with positive diagonal elements. But then Q∞ R∞ = QR, and by uniqueness of the
decomposition we deduce that Q = Q∞ , which is a contradiction.

Before presenting the convergence theorem, we introduce the following terminology: we say
that Xk ∈ Cn×p converges essentially to a matrix X∞ if each column of Xk converges essentially to
the corresponding column of X∞ . We recall that a vector sequence (xk )k≥1 converges essentially
to x∞ if there exists (θk )k≥1 such that (eiθk xk )k≥1 converges to x∞ . We prove the convergence
in the Hermitian case for simplicity. In the general case of A ∈ Cn×n , it cannot be expected
that Xk converges essentially to V, because the columns of Xk are orthogonal but eigenvectors
may not be orthogonal. In this case, the columns of Xk converge not to the eigenvectors but to
the so-called Schur vectors of A; see [11] for more information.

Theorem 4.5 (Convergence of the simultaneous iteration � ). Assume that A ∈ Cn×n is Her-
mitian, and that X ∈ Cn×p has linearly independent columns. Assume also that the subspace
spanned by the column of X0 satisfies

col(X0 ) ∩ span{v p+1 , . . . , v n } = ∅. (4.4)

If it holds that
λ1 > λ2 > · · · > λp > λp+1 ≥ λp+2 ≥ . . . ≥ λn , (4.5)

then Xk converges essentially to V1 := v 1 . . . v p .


 

Proof. Let B = V−1 X0 ∈ Cn×p , so that X0 = VB, and note that Ak X0 = VDk B. We denote
by B1 ∈ Cp×p and B2 ∈ C(n−p)×p the upper p × p and lower (n − p) × p blocks of B, respectively.
The matrix B1 is nonsingular, otherwise the assumption (4.4) would not hold. Indeed, if there
was a nonzero vector z ∈ Cp such that B1 z = 0, then

B1
! !
0
X0 z = V = V2 B2 z.
 
z = V1 V2
B2 B2 z

85
Chapter 4. Numerical computation of eigenvalues

implying that X0 z ∈ col(X0 ) is a linear combination of the vectors in V2 =


 
v p+1 . . . v n ,
which contradicts the assumption. We also denote by D1 and D2 the p × p upper-left and
the (n − p) × (n − p) lower-right blocks of D, respectively. From the expression of Ak X0 , we have

 Dk B1
! !
A X0 = V1 V2 = V1 Dk1 B1 + V2 Dk2 B2 ,

k 1
B2 Dk2
= V1 + V2 Dk2 B2 B−1 D Dk1 B1 .
 
1
−k
1 (4.6)

The second term in the bracket on the right-hand side converges to zero in the limit as k → ∞
by (4.5). Let Q
fk R
fk denote the reduced QR decomposition of the bracketed term. By Lemma 4.4,
which we proved for the standard QR decomposition but also holds for the reduced one, we
deduce from Q
fk R
fk → V1 that Q
fk → V1 . Rearranging (4.6), we have

Ak X0 = Q
fk (R
fk Dk B1 ).
1

Since the matrix between brackets is a p × p square invertible matrix, this equation implies
that col(Ak X0 ) = col(Qfk ). Denoting by Qk Rk the QR decomposition of Ak X0 , we therefore
have col(Qk ) = col(Q
fk ), and so the projectors on these subspaces are equal. We recall that, for
a set of orthonormal vectors r 1 , . . . , r p gathered in a matrix R = r 1 . . . r p , the projector
 

on col(R) = span{r 1 , . . . , r p } ⊂ Cn is the square n × n matrix

RR∗ = r 1 r ∗1 + · · · + r p r ∗p .

Consequently, the equality of the projectors implies Qk Q∗k = Q fk ∗ . Now, we want to establish
fk Q
the essential convergence of Qk to V1 . To this end, we reason by induction, relying on the
fact that the first k columns of X0 undergo a simultaneous iteration independent of the other
columns. For example, the first column simply undergoes a power iteration, and so it converges
essentially to v 1 . Assume now that the columns 1 to p − 1 of Qk converge essentially to
v 1 , . . . , v p−1 . Then the p-th column of Qk at iteration k, which we denote by q p , satisfies
(k)

∗ (k) (k) ∗ ∗ ∗ ∗
= Qk Q∗k − q 1 q 1 − · · · − q p−1 q p−1 = Q
e kQ
(k) (k) e ∗ − q (k) q (k) − · · · − q (k) q (k)
q (k)
p qp
(k)
k 1 1 p−1 p−1

−−−→ V1 V∗1 − v 1 v ∗1 − · · · − v p−1 v ∗p−1 = v p v ∗p .


k→∞


Therefore, noting that |a| = aa for every a ∈ C, we deduce
q
(k) (k) ∗
q
|v ∗p q (k)
p | = v ∗p q p q p v p −−−→ v ∗p v p v ∗p v p = 1,
k→∞

which shows that ∠(q p , v p ) converges to 0. Finally, observing that


(k)

(k)
2 v ∗p q p
e−iθk q (k)
p − vp =2− 2|v ∗p q (k)
p | −−−→ 0, θk = (k)
,
k→∞ |v ∗p q p |

we conclude that q p converges essentially to v p .


(k)

86
Chapter 4. Numerical computation of eigenvalues

In addition to this convergence result, it is possible to show that the error satisfies
!
k
  λp+1
∠ col(Xk ), col(V1 ) = O .
λp

Here, the angle between two subspaces A and B of Cn is defined as


 
∠(A, B) = max min ∠(a, b) .
a∈A\{0} b∈B\{0}

4.3.2 The QR algorithm


The QR algorithm, which is based on the QR decomposition, is one of the most famous algo-
rithms for calculating all the eigenpairs of a matrix. We first present the algorithm and then
relate it to the simultaneous iteration in Section 4.3.1. The method is presented in Algorithm 12.

Algorithm 12 QR algorithm
X0 = A
for i ∈ {1, 2, . . . } do
Qk Rk = Xk−1 (QR decomposition)
Xk = Rk Qk
end for

Successive iterates of the QR algorithm are related by the equation

Xk = Q−1
k Xk−1 Qk = Qk Xk−1 Qk = · · · = (Q1 . . . Qk ) X0 (Q1 . . . Qk )
∗ ∗
(4.7)

Therefore, all the iterates are related by a unitary similarity transformation, and so they all
have the same eigenvalues as X0 = A. Rearranging (4.7), we have

(Q1 . . . Qk )Xk = A(Q1 . . . Qk ),

and so, introducing Q


e k = Q1 . . . Qk and noting that Xk = Qk+1 Rk+1 , we deduce

Q
e k+1 Rk+1 = AQ
e k.

This reveals that the matrix sequence (Q


e k )k≥1 undergoes a simultaneous iteration and so, as-
suming that A is Hermitian with n distinct nonzero eigenvalues, we deduce that Qk → V
essentially in the limit as k → ∞, by Theorem 4.5. As a consequence, by (4.7), it holds
that Xk → V∗ X0 V = D; in other words, the matrix Xk converges to a diagonal matrix with the
eigenvalues of A on the diagonal.

4.4 Projection methods


In this section, we begin by presenting a general method for constructing an approximation of
the eigenvectors of A in a given subspace U of Cn . We then discuss a particular choice for the

87
Chapter 4. Numerical computation of eigenvalues

subspace U as a Krylov subspace, which is very useful in practice.


Assume that {u1 , . . . , up } is an orthonormal basis of U. Then for any vector v ∈ Cn , the
vector of U that is closest to v in the Euclidean distance is given by the orthogonal projection

PU v := UU∗ v = (u1 u∗1 + · · · + up u∗p )v.

In practice, the eigenvectors of A are unknown, and so it is impossible to calculate approx-


imations using this formula. The Rayleigh–Ritz method, which we present hereafter, is an
alternative and practical method for constructing approximations of the eigenvectors and eigen-
values. In general, the subspace U does not contain any eigenvector of A, and so the problem

Av = λv, v∈U (4.8)

does not admit a solution. Let us denote by U the matrix with columns u1 , . . . , up . Since any
vector v ∈ U is equal to Uz for some vector z ∈ Cp , equation (4.8) is equivalent to the problem

AUz = λUz,

which is a system of n equations with p < n unknowns. The Rayleigh–Ritz method is based on
the idea that, in order to obtain a problem with as many unknowns as there are equations, we
can multiply this equation by U∗ , which leads to the problem

Bz := (U∗ AU)z = λz. (4.9)

This is standard eigenvalue problem for the matrix U∗ AU ∈ Cp×p , which is much easier to solve
than the original problem if p  n. Equivalently, equation (4.9) may be formulated as follows:
find v ∈ U such that
u∗ (Av − λv), ∀u ∈ U . (4.10)

The solutions to (4.9) and (4.10) are related by the equation v = Uz. Of course, the eigenvalues
of B in problem (4.9), which are called the Ritz values of A relative to U, are in general different
from those of A. Once an eigenvector y of B has been calculated, an approximate eigenvector
of A, called a Ritz vector of A relative to U, is obtained from the equation v
b = Uy. The
Rayleigh–Ritz algorithm is presented in full in Algorithm 13.

Algorithm 13 Rayleigh–Ritz
Choose U ⊂ Cn
Construct a matrix U whose columns are orthonormal and span U
bi and eigenvectors y ∈ Cp of B := U∗ AU
Find the eigenvalues λ i
bi = Uy i ∈ Cn .
Calculate the corresponding Ritz vectors v

It is clear that if v i ∈ U , then λi is an eigenvalue of B in (4.9). In fact, we can show the


following more general statement.

88
Chapter 4. Numerical computation of eigenvalues

Proposition 4.6. If U is an invariant subspace of A, meaning that AU ⊂ U, then each Ritz


vector of A relative to U is an eigenvector of A.

Proof. Let U ∈ Cn×p and W ∈ Cn×(n−p) be matrices whose columns form orthonormal bases
of U and U ⊥ , respectively. Here U ⊥ denotes the orthogonal complement of U with respect to
the Euclidean inner product. Then, since W∗ AU = 0 by assumption, it holds that

U∗ AU U∗ AW U∗ AU U∗ AW
! !
Q AQ = Q= U W .
 

= ,
W∗ AU W∗ AW 0 W∗ AW

b is an eigenvector of U∗ AU, then


If (y, λ)

(U∗ AU)y
! ! !
y b y
Q∗ AQ = =λ =: λx,
b
0 0 0

b is an eigenpair of Q∗ AQ. But then (Qx, λ)


and so (x, λ) b is an eigenpair of A, which
b = (Uy, λ)
proves the statement.

If U is close to being an invariant subspace of A, then it is expected that the Ritz vectors
and Ritz values of A relative to U will provide good approximations of some of the eigenpairs
of A. Quantifying this approximation is difficult, so we only present without proof the following
error bound. See [10] for more information.

Proposition 4.7. Let A be a full rank Hermitian matrix and U a p-dimensional subspace
of Cn . Then there exists eigenvalues λi1 , . . . , λip of A which satisfy

∀j ∈ {1, . . . , p}, bj | ≤ k(I − PU )APU k2 .


|λij − λ

In the case where A is Hermitian, it is possible to show that the Ritz values are bounded from
above by the eigenvalues of A. The proof of this result relies on the Courant–Fisher theorem
for characterizing the eigenvalues of a Hermitian matrix, which is recalled in Theorem A.6 in
the appendix.

Proposition 4.8. If A ∈ Cn×n is Hermitian, then

∀i ∈ {1, . . . , p}, b i ≤ λi
λ

Proof. By the Courant–Fisher theorem, it holds that

x∗ Bx
 
λ
bi = max min
S⊂Cp ,dim(S)=i x∈S\{0} x∗ x

89
Chapter 4. Numerical computation of eigenvalues

Letting y = Ux and then R = US, we deduce that

y ∗ Ay
 
λ
bi = max min
S⊂Cp ,dim(S)=i y∈US\{0} y ∗ y
y ∗ Ay y ∗ Ay
   
= max min ≤ max min = λi ,
R⊂U ,dim(R)=i y∈R\{0} y ∗ y R⊂Cn ,dim(R)=i y∈R\{0} y ∗ y

where we used the Courant–Fisher for the matrix A in the last equality.

This projection approach is sometimes combined with a simultaneous subspace iteration: an


approximation Xk of the p first eigenvector is first calculated using Algorithm 11, and then the
matrix Xk is used in place of U in Algorithm 13.

4.4.1 Projection method in a Krylov subspace


The power iteration constructs at iteration k an approximation of v 1 in the one-dimensional
subspace spanned by the vector Ak x0 , and only the previous iteration xk is employed to con-
struct xk+1 . One may wonder whether, by employing all the previous iterates rather than only
the previous one, a better approximation of v 1 can be constructed. More precisely, instead of
looking for an approximation in the subspace span{Ak x0 }, would it be useful to extend the
search area to the Krylov subspace

Kk+1 (A, x0 ) := span x0 , Ax0 , . . . , Ak x0 ?




The answer to this question is positive, and the resulting method is often much faster than
the power iteration. This is achieved by employing the Rayleigh–Ritz projection method Al-
gorithm 13 with the choice U = Kk+1 (A, x0 ). Applying this method requires to calculate an
orthonormal basis of the Krylov subspace and to calculate the reduced matrix U∗ AU. The
Arnoldi method enables to achieve these two goals simultaneously.

4.4.2 The Arnoldi iteration


This Arnoldi iteration is based on the Gram–Schmidt process and presented in Algorithm 14.
The iteration breaks down if hj+1,j = 0, which indicates that Auj belongs to the Krylov

Algorithm 14 Arnoldi iteration for constructing an orthonormal basis of Kp (A, u1 )


Choose u1 with unit norm.
for j ∈ {1, . . . p} do
uj+1 ← Auj
for i ∈ {1, . . . , j} do
hi,j ← u∗i uj+1
uj+1 ← uj+1 − hi,j ui
end for
hj+1,j ← kuj+1 k
uj+1 ← uj+1 /hj+1,j
end for

subspace span{u1 , . . . , uj } = Kj (A, u1 ), implying that Kj+1 (A, u1 ) = Kj (A, u1 ). In this case,

90
Chapter 4. Numerical computation of eigenvalues

the subspace Kj (A, u1 ) is an invariant subspace of A because, by Exercise 4.2, we have

AKj (A, u1 ) ⊂ Kj+1 (A, u1 ) = Kj (A, u1 ).

Therefore, applying the Rayleigh–Ritz with U = span{u1 , . . . , uj } yields exact eigenpairs Propo-
sition 4.6. If the iteration does not break down then, by construction, the vectors {u1 , . . . , up }
at the end of the algorithm are orthonormal. It is also simple to show by induction that they
form a basis of Kp (A, u1 ). The scalar coefficients hi,j can be collected in a matrix square p × p
matrix  
h1,1 h1,2 h1,3 ··· h1,p
 
h2,1 h2,2 h2,3
 ··· h2,p 
H= · · ·
 0 h h h 3,p  .

3,2 3,3
 . .. .. .. .. 
 .. . . . . 
 
0 ··· 0 hp,p−1 hp,p
This matrix contains only zeros under the first subdiagonal; such a matrix is called a Hessenberg
matrix. Inspecting the algorithm, we notice that the j-th column contains the coefficients of
the projection of the vector Auj onto the basis {u1 , . . . , up }. In other words,

U∗ AU = H, (4.11)

We have thus shown that the Arnoldi algorithm enables to construct both an orthonormal basis
of a Krylov subspace and the associated reduced matrix. In fact, we have the following equation
 
0
.
AU = UH + hp+1,p (v p+1 e∗p ), .
. ∈ C . (4.12)
p
ep = 
1

The Arnoldi algorithm, coupled with the Rayleigh–Ritz method, has very good convergence
properties in the limit as p → ∞, in particular for eigenvalues with a large modulus. The
following result shows that the residual r = Ab b associated with a Ritz vector can be
v − λv
estimated inexpensively. Specifically, the norm of the residual is equal to the last component of
the associated eigenvector of H multiplied by hp+1,p .

Proposition 4.9 (Formula for the residual � ). Let y i be an eigenvector of H associated with
the eigenvalues λ bi = Uy denote the corresponding eigenvector. Then
bi , and let v
i

Ab
v i − λv
b i = hp+1,p (y )p v p+1 .
i

Consequently, it holds that


kAb
v i − λv
b i k = |hp+1,p (y )p |.
i

Proof. Multiplying both sides of (4.12) by y i , we obtain

AUy i = UHy i + hp+1,p (v p+1 e∗p )y i .

91
Chapter 4. Numerical computation of eigenvalues

Using the definition of v


bi and rearranging the equation, we have

Avbi − λ
bi y = hp+1,p (v p+1 e∗ )y ,
i p i

which immediately gives the result.

In practice, the larger the dimension p of the subspace U employed in the Rayleigh–Ritz
method, the more memory is required for storing an orthonormal basis of U. In addition, for
large values of p, computing the reduced matrix (4.11) and its eigenpairs becomes computation-
ally expensive; the computational cost of computing the matrix H scales as O(p2 ). To remedy
these potential issues, the algorithm can be restarted periodically. For example, Algorithm 15
can be employed as an alternative to the power iteration in order to find the eigenvector asso-
ciated with the eigenvalue with largest modulus.

Algorithm 15 Restarted Arnoldi iteration


Choose u1 ∈ Cn and p  n
for i ∈ {1, 2, . . . } do
Perform p iterations of the Arnoldi iteration and construct U;
Calculate the Ritz vector v b1 associated with the largest Ritz value relative to U;
If this vector is sufficiently accurate, then stop. Otherwise, restart with u1 = v b1 .
end for

4.4.3 The Lanczos iteration


The Lanczos iteration may be viewed as a simplified version of the Arnoldi iteration in the
case where the matrix A is Hermitian. Let us denote by {u1 , . . . , up } the orthonormal vectors
generated by the Arnoldi iteration. When A is Hermitian, it holds that

hi,j = u∗i (Auj ) = (Aui )∗ uj = hj,i .

Therefore, the matrix H is Hermitian. This is not surprising, since we showed that H = U∗ AU
and the matrix A is Hermitian. Since H is also of Hessenberg form, we deduce that H is
tridiagonal. An inspection of Algorithm 14 shows that the subdiagonal entries of H are real.
Since A is Hermitian, the diagonal entries hi,i = u∗i (Auj ) are also real, and so we conclude that
all the entries of the matrix H are in fact real. This matrix if of the form
 
α1 β2
 β2 α2 β3
 

.. ..
 
H= β3 . .
 

.. ..
 
. . βp 
 

βp αp

Adapting the Arnoldi iteration to this setting leads to Algorithm 16.

92
Chapter 4. Numerical computation of eigenvalues

Algorithm 16 Lanczos iteration for constructing an orthonormal basis of Kp (A, u1 )


Choose u1 with unit norm.
β1 ← 0, u0 ← 0 ∈ Cn
for j ∈ {1, . . . p} do
uj+1 ← Auj − βj uj−1
αj ← u∗j uj+1
uj+1 ← uj+1 − αj uj
βj+1 ← kuj+1 k
uj+1 ← uj+1 /βj+1
end for

4.5 Exercises
� Exercise 4.1. PageRank is an algorithm for assigning a rank to the vertices of a directed
graph. It is used by many search engines, notably Google, for sorting search results. In this
context, the directed graph encodes the links between pages of the World Wide Web: the vertices
of the directed graph are webpages, and there is an edge going from page i to page j if page i
contains a hyperlink to page j.
Let us consider a directed graph G(V, E) with vertices V = {1, . . . , n} and edges E. The
graph can be represented by its adjacency matrix A ∈ {0, 1}n×n , whose entries are given by

1 if there is an edge from i to j,
aij =
0 otherwise.

Let ri denote the “value” assigned to vertex i. The idea of PageRank, in its simplest form, is
to assign values to the vertices by solving the following system of equations;

rj
(4.13)
X
∀i ∈ V, ri = .
oj
j∈neighbors

where oj is the outdegree of vertex j, i.e. the number of edges leaving from j. The sum is over
all the “incoming” neighbors of i, which have an edge towards node i.

• Read the Wikipedia page on PageRank to familiarize yourself with the algorithm.
 T
• Let r = r1 . . . rn . Show using (4.13) that r satisfies

 
1
o1
r = AT  ..  r =: AT O−1 r.
.
 
 
1
on

In other words, r is an eigenvector with eigenvalue 1 of the matrix M = AT O−1 .

• Show that M is a left-stochastic matrix, i.e. that each column sums to 1.

• Prove that the eigenvalues of any matrix B ∈ Rn×n coincide with those of BT . You may
use the fact that det(B) = det(BT ).

93
Chapter 4. Numerical computation of eigenvalues

• Show that 1 is an eigenvalue and that ρ(M) = 1. For the second part, find a subordinate
matrix norm such that kMk = 1.

• Implement PageRank in order to rank pages from a 2013 snapshot of English Wikipedia.
You can use either the simplified version of the algorithm given in (4.13) or the improved
version with a damping factor described on Wikipedia. In the former case, the following
are both sensible stopping criteria:

kMb
r−r b k1 kMb
r − λb
b r k1 bT Mb
b= r r
< 10−15 or < 10−15 , λ T
,
kb
r k1 krk1 r
b r b

where v
b is an approximation of the eigenvector corresponding to the dominant eigenvalue.
A dataset is available on the course website to complete this part. This dataset contains
a subset of the data publicly available here, and was generated from the full dataset by
retaining only the 5% best rated articles. After decompressing the archive, you can load
the dataset into Julia by using the following commands:

import CSV
import DataFrames

# Data (nodes and edges)


nodes = CSV.read("names.csv", DataFrames.DataFrame)
edges = CSV.read("edges.csv", DataFrames.DataFrame)

# Convert data to matrices


nodes = Matrix(nodes)
edges = Matrix(edges)

After you have assigned a rank to all the pages, print the 10 pages with the highest ranks.
My code returns the following entries:

1. United States 5. France 9. Canada


2. United Kingdom 6. Germany 10. India
3. World War II 7. English language
4. Latin 8. China

• Extra credit: Write a function search(keyword) that can be employed for searching the
database. Here is an example of what it could return:

julia> search("New York")


481-element Vector{String}:
"New York City"
"New York"
"The New York Times"
"New York Stock Exchange"

94
Chapter 4. Numerical computation of eigenvalues

"New York University"


� Exercise 4.2. Show the following properties of the Krylov subspace Kp (A, x).

• Kp (A, x) ⊂ Kp+1 (A, x).

• AKp (A, x) ⊂ Kp+1 (A, x).

• The Krylov subspace Kp (A, x) is invariant under rescaling: for all α ∈ C,

Kp (A, x) = Kp (αA, x) = Kp (A, αx).

• The Krylov subspace Kp (A, x) is invariant under shift of the matrix A: for all α ∈ C,

Kp (A, x) = Kp (A − αI, x).

• Similarity transformation: If T ∈ Cn×n is nonsingular, then

Kp (T−1 AT, T−1 x) = T−1 Kp (A, x).

� Exercise 4.3. The minimal polynomial of a matrix A ∈ Cn×n is the monic polynomial p of
lowest degree such that p(A) = 0. Prove that, if A is Hermitian with m ≤ n distinct eigenvalues,
then the minimal polynomial is given by
m
Y
p(t) = (t − λi ).
i=1

� Exercise 4.4. The minimal polynomial for a general matrix A ∈ Cn×n is given by
m
Y
p(t) = (t − λi )si .
i=1

where si is the size of the largest Jordan block associated with the eigenvalue λi in the normal
Jordan form of A. Verify that p(A) = 0.

� Exercise 4.5. Let d denote the degree of the minimal polynomial of A. Show that

∀p ≥ d, Kp+1 (A, x) = Kp (A, x).

Deduce that, for p ≥ n, the subspace Kp (A, x) is an invariant subspace of A.

� Exercise 4.6. Let A ∈ Cn×n . Show that Kn (A, x) is the smallest invariant subspace of A
that contains x.

� Exercise 4.7 (A posteriori error bound). Assume that A ∈ Cn×n is Hermitian, and that v
b
is a normalized approximation of an eigenvector which satisfies

b∗ Ab
b= v v
kb
z k := kAb
v − λb
bv k = δ, λ ∗ .
v
b vb

95
Chapter 4. Numerical computation of eigenvalues

Prove that there is an eigenvalue λ of A such that


b − λ| ≤ δ.

Hint: Consider first the case where A is diagonal.

� Exercise 4.8 (Bauer–Fike theorem). Assume that A ∈ Cn×n is diagonalizable: AV = VD.


Show that, if v
b is a normalized approximation of an eigenvector which satisfies

krk := kAb
v − λb
bv k = δ

b ∈ C, then there is an eigenvalue λ of A such that


for some λ


b − λ| ≤ κ2 (V)δ.

Hint: Rewrite
kb
v k = k(A − λI) b −1 V−1 r)k.
b −1 rk = kV(D − λI)

� Exercise 4.9. In Exercise 4.7 and Exercise 4.8, we saw examples a posteriori error es-
timates which guarantee the existence of an eigenvalue of A within a certain distance of the
approximation λ.
b In this exercise, our aim is to provide an answer to the following different
question: given an approximate eigenpair (b b what is the smallest perturbation E that we
v , λ),
need to apply to A in order to guarantee that (b b is an exact eigenpair, i.e. that
v , λ)

(A + E)b
v = λb
bv ?

b is normalized and let E = E ∈ Cn×n : (A + E)b


n o
Assume that v bv . Prove that
v = λb

minkEk2 = krk2 := kAb


v − λb
bv k. (4.14)
E∈E

To this end, you may proceed as follows:

• Show first that any E ∈ E satisfies Eb


v = −r.

• Deduce from the first item that


inf kEk2 ≥ krk2 .
E∈E

• Find a rank one matrix E∗ such that kE∗ k2 = krk2 , and then conclude. Recall that any
rank 1 matrix can be written in the form E∗ = uw∗ , with norm kuk2 kwk2 .

Equation (4.14) is a simplified version of the Kahan–Parlett–Jiang theorem and is an example


of a backward error estimate. Whereas forward error estimates quantify the distance between an
approximation and the exact solution, backward error estimates give the size of the perturbation
that must be applied to a problem so that an approximation is exact.

� Exercise 4.10. Assume that (xk )k≥0 is a sequence of normalized vectors in Cn . Show that
the following statements are equivalent:

96
Chapter 4. Numerical computation of eigenvalues

• (xk )k≥0 converges essentially to x∞ in the limit as k → ∞.

• The angle ∠(xk , x∞ ) converges to zero in the limit as k → ∞.

� Exercise 4.11. Assume that A ∈ Cn×n is skew-Hermitian. Derive a Lanczos-like algorithm


for constructing an orthonormal basis of Kp (A, x) and calculating the reduced matrix

U∗ AU,

where U ∈ Cn×p contains the vectors of the basis as columns. Implement your algorithm
with p = 20 in order to approximate the dominant eigenvalue of the matrix A constructed by the
following piece of code:

n = 5000
A = rand(n, n) + im * randn(n, n)
A = A - A' # A is now skew-Hermitian

� Exercise 4.12. Assume that {u1 , . . . , up } and {w1 , . . . , wn } are orthonormal bases of the
same subspace U ⊂ Cn . Show that the projectors UU∗ and WW∗ are equal.

� Exercise 4.13. Assume that A ∈ Cn×n is a Hermitian matrix with distinct eigenvalues, and
suppose that we know the dominant eigenpair (v 1 , λ1 ), with v 1 normalized. Let

B = A − λ1 v 1 v ∗1 .

If we apply the power iteration to this matrix, what convergence can we expect?

� Exercise 4.14. Assume that vc1 and v c2 are two Ritz vectors of a Hermitian matrix A relative
to a subspace U ⊂ C . Show that, if the associated Ritz values are distinct, then v
n c1 ∗ v
c2 = 0.

4.6 Discussion and bibliography


The content of this chapter is inspired mainly from [13] and also from [11]. The latter volume
contains a detailed coverage of the standard methods for eigenvalue problems. Some of the
exercises are taken from [15], and others are inspired from [11].

97
Chapter 5

Interpolation and approximation

5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.1 Vandermonde matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.2 Lagrange interpolation formula . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.3 Gregory–Newton interpolation . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.4 Interpolation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5 Interpolation at Chebyshev nodes . . . . . . . . . . . . . . . . . . . . . . 107
5.1.6 Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.7 Piecewise interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.1 Least squares approximation of data points . . . . . . . . . . . . . . . . 113
5.2.2 Mean square approximation of functions . . . . . . . . . . . . . . . . . . 115
5.2.3 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.4 Orthogonal polynomials and numerical integration: an introduction �
. 118
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 123

Introduction
In this chapter, we study numerical methods for interpolating and approximating functions.
The Cambridge dictionary defines interpolation as the addition of something different in the
middle of a text, piece of music, etc. or the thing that is added. The concept of interpolation
in mathematics is consistent with this definition; interpolation consists in finding, given a set
of points (xi , yi ), a function f in a finite-dimensional space that goes through these points.
Throughout this course, you have used the plot function in Julia, which performs piecewise
linear interpolation for drawing functions, but there are a number of other standard interpolation
methods. Our first goal in this chapter is to present an overview of these methods and the
associated error estimates.

98
Chapter 5. Interpolation and approximation

In the second part of this chapter, we focus of function approximation, which is closely
related to the subject of mathematical interpolation. Indeed, a simple manner for approximating
a general function by another one in a finite-dimensional space is to select a set of real numbers
on the x axis, called nodes, and find the associated interpolant. As we shall demonstrate, not
all sets of interpolation nodes are equal, and special care is required in order to avoid undesired
oscillations. The field of function approximation is vast, so our aim in this chapter is to present
only an introduction to the subject. In order to quantify the quality of an approximation, a
metric on the space of functions, or a subset thereof, must be specified in order to measure
errors. Without a metric, saying that two functions are close is almost meaningless!

5.1 Interpolation
Assume that we are given n+1 nodes x0 , . . . , xn on the x axis, together with the values u0 , . . . , un
taken by an unknown function u(x) when evaluated at these points, and suppose that we are
looking for an interpolation u
b(x) in a subspace span{ϕ0 , . . . , ϕn } of the space of continuous
functions, i.e. an interpolating function of the form

b(x) = α0 ϕ0 (x) + · · · + αn ϕn (x),


u

where α0 , . . . , αn are real coefficients. In order for u


b(x) to be an interpolating function, we must
require that
∀i ∈ {0, . . . , n}, u
b(xi ) = ui .

This leads to a linear system of n + 1 equations and n + 1 unknowns, the latter being the
coefficients α0 , . . . , αn . This system of equations in matrix form reads
    
ϕ0 (x0 ) ϕ1 (x0 ) . . . ϕn (x0 ) α0 u0
 ϕ0 (x1 ) ϕ1 (x1 ) . . . ϕn (x1 )   α1   u1 
    
 .
 . .. ..    ..  =  ..  .
    (5.1)
 . . .  .   . 
ϕ0 (xn ) ϕ1 (xn ) . . . ϕn (xn ) αn un

5.1.1 Vandermonde matrix


Since polynomials are very convenient for evaluation, integration, and differentiation, they are
a natural choice for interpolation purposes. The simplest basis of the subspace of polynomials
of degree less than or equal to n is given by the monomials:

ϕ0 (x) = 1, ϕ1 (x) = x, ..., ϕn (x) = xn .

99
Chapter 5. Interpolation and approximation

In this case, the linear system (5.1) for determining the coefficients of the interpolant reads
    
1 x0 . . . xn0 α0 u0
1 x1 . . . xn1   α1   u1 
    
. .
. . .. 
 .  =  . . (5.2)
.   ..   .. 
   
. .
1 xn . . . xnn αn un

The matrix on the left-hand side is called a Vandermonde matrix. If the abcissae x0 , . . . , xn
are distinct, then this is a full rank matrix, and so (5.2) admits a unique solution, implying
as a corollary that the interpolating polynomial exists and is unique. It is possible to show
that the condition number of the Vandermonde increases dramatically with n. Consequently,
solving (5.2) is not a viable method in practice for calculating the interpolating polynomial.

5.1.2 Lagrange interpolation formula


One may wonder whether polynomial basis functions ϕ0 , . . . , ϕn can be defined in such a manner
that the matrix in (5.1) is the identity matrix. The answer to this question is positive; it suffices
to take as a basis the Lagrange polynomials, which are given by

(x − x0 )(x − x1 ) . . . (x − xi−1 )(x − xi+1 ) . . . (x − xn )


ϕi (x) = .
(xi − x0 )(xi − x1 ) . . . (xi − xi−1 )(xi − xi+1 ) . . . (xi − xn )

It is simple to check that 


1 if i = j,
ϕi (xj ) = δi,j =
0 otherwise.

Finding the interpolant in this basis is immediate:

b(x) = u1 ϕ1 (x) + · · · + un ϕn (x).


u

While simple, this approach to polynomial interpolation has a couple of disadvantages: first,
evaluating u
b(x) is computationally costly when n is large and, second, all the basis functions
change when adding new interpolation nodes. In addition, Lagrange interpolation is numerically
unstable because of cancellations between large terms. Indeed, it is often the case that Lagrange
polynomials take very large values over the interpolation intervals; this occurs, for example,
when many equidistant interpolation nodes are employed, as illustrated in Figure 5.1.

5.1.3 Gregory–Newton interpolation


By Taylor’s formula, any polynomial p of degree n may be expressed as

p00 (0) 2 p(n) (0) n


p(x) = p(0) + p0 (0)x + x + ... + x . (5.3)
2 n!

The constant coefficient can be obtained by evaluating the polynomial at 0, the linear coefficient
can be identified by evaluating the first derivative at 0, and so on. Assume now that we are given
the values taken by p when evaluated at the integer numbers {0, . . . , n}. We ask the following

100
Chapter 5. Interpolation and approximation

Figure 5.1: Lagrange polynomials associated with equidistant nodes over the (0, 1) interval.

question: can we find a formula similar in spirit to (5.3), but including only evaluations of p and
not of its derivatives? To answer this question, we introduce the difference operator ∆ which
acts on functions as follows:
∆f (x) = f (x + 1) − f (x).

The operator ∆ is a linear operator on the space of continuous functions. It maps constant
functions to 0, and the linear function x to the constant function 1, suggesting a resemblance
with the differentiation operator. In order to further understand this connection, let us define
the falling power of a real number x as

xk = x(x − 1)(x − 2) . . . (x − k + 1).

We then have that

∆xk = (x + 1)x(x − 1) . . . (x − k + 2) − x(x − 1)(x − 2) . . . (x − k + 1)


= (x + 1) − (x − k + 1) x(x − 1) . . . (x − k + 2) = kxk−1
 

In other words, the action of the difference operator on falling powers mirrors that of the differ-
entiation operator on monomials. The falling powers form a basis of the space of polynomials,

101
Chapter 5. Interpolation and approximation

and so any polynomial in P(n), i.e. of degree less than or equal to n, can be expressed as

p(x) = α0 + α1 x1 + α2 x2 + · · · + αn xn . (5.4)

It is immediate to show that αi = ∆i p(0), where ∆i p denotes the function obtained after i
applications of the operator ∆. Therefore, any polynomial of degree less than or equal to n may
be expressed as
∆2 2 ∆n p(0) n
p(x) = p(0) + ∆p(0)x1 + x + ··· + x . (5.5)
2 n!
An expansion of the form (5.5) is called a Newton series, which is the discrete analog of the
continuous Taylor series. From the definition of ∆, it is clear that the coefficients in (5.5) depend
only on p(0), . . . , p(n). We conclude that, given points n + 1 points (i, ui ) for i ∈ {0, . . . , n}, the
unique interpolating polynomial given by (5.5), after replacing p(i) by ui .

Example 5.1. Let us use (5.4) in order to calculate the value of


n
X
S(n) := i2 .
i=0

Since ∆S(n) = (n + 1)2 , which is a second degree polynomial in n, we deduce that S(n) is a
polynomial of degree 3. Let us now determine its coefficients.

n 0 1 2 3
∆0 S(n) 0 1 5 14
∆1 S(n) 1 4 9
∆2 S(n) 3 5
∆3 S(n) 2

We conclude that

3 2 n(2n + 1)(n + 1)
S(n) = 1n + n(n − 1) + n(n − 1)(n − 2) =
2! 3! 6

Notice that when falling powers are employed as polynomial basis, the matrix in (5.1) is
lower triangular, and so the algorithm described in Example 5.1 could be replaced by the
forward substitution method. Whereas the coefficients of the Lagrange interpolant can be
obtained immediately from the values of u at the nodes, calculating the coefficients of the
expansion in (5.4) requires O(n2 ) operations. However, Gregory–Newton interpolation has
several advantages over Lagrange interpolation:

• If a point (n + 1, pn+1 ) is added to the set of interpolation points, only one additional
term, corresponding to the falling power xn+1 , needs to be calculated in (5.5). All the
other coefficients are unchanged. Therefore, the Gregory–Newton approach is well-suited
for incremental interpolation.

• The Gregory–Newton interpolation method is more numerically stable than Lagrange

102
Chapter 5. Interpolation and approximation

interpolation, because the basis functions do not take very large values.

• A polynomial in the form of a Newton series can be evaluated efficiently using Horner’s
method, which is based on rewriting the polynomial as
  

p(x) = α0 + x α1 + (x − 1) α2 + (x − 2) α3 + (x − 3) . . . .

Evaluating this expression starting from the innermost bracket leads to an algorithm with
a cost scaling linearly with the degree of the polynomial.

Non-equidistant nodes

So far, we have described the Gregory–Newton method in the simple setting where interpolation
nodes are just a sequence of successive natural numbers. The method can be generalized to the
setting of nodes x0 6= . . . 6= xn which are not necessarily equidistant. In this case, we take as
basis the following functions instead of the falling powers:

ϕi (x) = (x − x0 )(x − x1 ) . . . (x − xi−1 ), (5.6)

with the convention that the empty product is 1. By (5.1), the coefficients of the interpolating
polynomial in this basis solve the following linear system:
    
1 ... 0 α0 u0
1 x1 − x0
    
  α1   u1 
 ..    
. (5.7)
 α  u 
1 x2 − x0 (x2 − x0 )(x2 − x1 )   2 =  2 .

 .   . 
 .. .. ..   ..   .. 

. . .    
Qn−1
1 xn − x0 ... ... j=0 (xn − xj )
αn un

This system could be solved using, for example, forward substitution. Clearly α0 = u0 , and
then from the second equation we obtain

u1 − u0
α1 = =: [u0 , u1 ],
x1 − x0

which may be viewed as an approximation of the slope of u at x0 . The right-hand side of


this equation is an example of a divided difference. In general, divided differences are defined
recursively as follows:

[u1 , . . . , ud ] − [u0 , . . . , ud−1 ]


[u0 , u2 , . . . , ud ] := , [ui ] = ui .
xd − x0

It is possible to find an expression for the coefficients of the interpolating polynomials in terms
of these divided differences.

Proposition 5.1. Assume that (x0 , u0 ), . . . , (xn , un ) are n+1 points in the plane with distinct

103
Chapter 5. Interpolation and approximation

abcissae. Then the interpolating polynomial of degree n may be expressed as


n
X
p(x) = [u0 , . . . , un ]ϕi (x),
i=0

where ϕi (x), for i = 0, . . . , n, are the basis functions defined in (5.6).

Proof. The statement is true for n = 0. Reasoning by induction, we assume that it holds true
for polynomials of degree up to n − 1. Let p1 (x) and p2 (x) be the interpolating polynomials at
the points x0 , x1 , . . . , xn−2 , xn−1 and x0 , x1 , . . . , xn−2 , xn , respectively. Then

x − xn−1
(5.8)

p(x) = p1 (x) + p2 (x) − p1 (x)
xn − xn−1

is a polynomial of degree n that interpolates all the data points. By the induction hypothesis,
it holds that
n−2
p1 (x) = u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , u1 , . . . , un−2 , un−1 ]
Y
(x − xi ),
i=0
n−2
p2 (x) = u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , u1 , . . . , un−2 , un ]
Y
(x − xi ).
i=0

Here we used bold font to emphasize the difference between the two expressions. Substituting
these expressions in (5.8), we obtain

n−2
Y
p(x) =u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , . . . , un−2 ] (x − xi )
i=0
n−1
[u0 , u1 , . . . , un−2 , un−1 ] − [u0 , u1 , . . . , un−2 , un ] Y
+ (x − xi ).
xn − xn−1
i=0

In Exercise 5.4, we show that divided differences are invariant under permutations of the data
points, and so we have that

[u0 , u1 , . . . , un−2 , un−1 ] − [u0 , u1 , . . . , un−2 , un ]


= [u0 , . . . , un ],
xn − xn−1

which enables to conclude.

Example 5.2. Assume that we are looking for the third degree polynomial going through the
points
(−1, 10), (0, 4), (2, −2), (4, −40).

We have to calculate αi = [u0 , . . . , ui ] for i ∈ {0, 1, 2, 3}. It is convenient to use a table in order
to calculate the divided differences:

104
Chapter 5. Interpolation and approximation

i 0 1 2 3
[ui ] 10 4 −2 −40
xi+1 − xi 1 2 2
[ui , ui+1 ] −6 −3 −19
xi+2 − xi 3 4
[ui , ui+1 , ui+2 ] 1 −4
xi+3 − xi 5
[ui , ui+1 , ui+2 , ui+3 ] −1

We deduce that the expression of the interpolating polynomial is

p(x) = 10 + (−6)(x + 1) + 1(x + 1)x + (−1)(x + 1)x(x − 2) = −x3 + 2x2 + −3x + 4.

5.1.4 Interpolation error


Assume that u(x) is a continuous function and denote by u
b(x) its interpolation through the
points (xi , ui ), for i = 0, . . . , n. In this section, we study the behavior of the error in the limit
as n → ∞.

Theorem 5.2 (Interpolation error). Assume that u : [a, b] → R is a function in C n+1 ([a, b])
and let x0 , . . . , xn denote n + 1 distinct interpolation nodes. Then for all x ∈ [a, b], there
exists ξ = ξ(x) in the interval [a, b] such that

u(n+1) (ξ)
en (x) := u(x) − u
b(x) = (x − x0 ) . . . (x − xn ).
(n + 1)!

Proof. The statement is obvious if x ∈ {x0 , . . . , xn }, so we assume from now on that x does not
coincide with an interpolation node. Let us use the compact notation ωn = ni=0 (x − xi ) and
Q

introduce the function


g(t) = en (t)ωn (x) − en (x)ωn (t). (5.9)

The function g is smooth and takes the value 0 when evaluated at x0 , . . . , xn , x. Since g has
n + 2 roots in the interval [a, b], Rolle’s theorem implies that g 0 has n + 1 distinct roots in (a, b).
Another application of Rolle’s theorem yields that g 00 has n distinct roots in (a, b). Iterating
this reasoning, we deduce that g (n+1) has one root t∗ in (a, b). From (5.9), we calculate that

g (n+1) (t) = e(n+1)


n (t)ωn (x) − en (x)ωn(n+1) (t) = u(n+1) (t)ωn (x) − en (x)(n + 1)!. (5.10)

Here we employed the fact that u


b(n+1) (t) = 0, because u
b is a polynomial of degree at most n.

105
Chapter 5. Interpolation and approximation

Evaluating (5.10) at t∗ and rearranging, we obtain that

u(n+1) (t∗ )
en (x) = ωn (x),
(n + 1)!

which completes the proof.

As a corollary to Theorem 5.2, we deduce the following error bound.

Corollary 5.3 (Upper bound on the interpolation error). Assume that u is smooth in [a, b] and
that
sup |u(n+1) (x)| ≤ Cn+1 .
x∈[a,b]

Then
Cn+1 n+1
∀x ∈ [a, b], |en (x)| ≤ h (5.11)
4(n + 1)
where h is the maximum spacing between two successive interpolation nodes.

Proof. By Theorem 5.2, we have

Cn+1
∀x ∈ [a, b], |en (x)| ≤ |(x − x0 ) . . . (x − xn )|. (5.12)
(n + 1)!

The product on the right-hand side is bounded from above by

h2 n!hn+1
× 2h × 3h × 4h × · · · × nh = . (5.13)
4 4

The first factor comes from the fact that, if x ∈ [xi , xi+1 ], then

(xi+1 − xi )2
|(x − xi )(x − xi+1 )| ≤ ,
4

because the left-hand side is maximized when x is the midpoint of the interval [xi , xi+1 ]. Sub-
stituting (5.13) into (5.12), we deduce the statement.

Let us now introduce the supremum norm of the error over the interval [a, b]:

En = sup |en (x)|.


x∈[a,b]

We ask the following natural question: does En tend to zero as the maximum spacing between
successive nodes tends to 0? By Corollary 5.3, the answer to this question is positive when Cn
does not grow too quickly as n → ∞. For example, as illustrated in Figure 5.2, the interpolation
error for the function u(x) = sin(x), when using equidistant interpolation nodes, decreases very
quickly as n → ∞.
In some cases, however, the constant Cn grows quickly with n, to the extent that En may
increase with n; in this case, the maximum interpolation error grows as we add nodes! The

106
Chapter 5. Interpolation and approximation

Figure 5.2: Interpolation (in orange) of the function u(x) = sin(x) (in blue) using 3, 4, 6, and
8 equidistant nodes.

classic example, in order to illustrate this potential issue, is that of the Runge function:

1
u(x) = . (5.14)
1 + 25x2

It is possible to show that, for this function, the upper bound in Corollary 5.3 tends to ∞ in the
limit as the number of interpolation nodes increases. We emphasize that this does not prove
that En → ∞ in the limit as n → ∞, because (5.11) provides only an upper bound on the error.
In fact, the interpolation error for the Runge function can either grow or decrease, depending
on the choice of interpolation nodes. With equidistant nodes, it turns out that En → ∞, as
illustrated in Figure 5.3.

5.1.5 Interpolation at Chebyshev nodes


Sometimes, interpolation is employed as a technique for approximating functions. The spectral
collocation method, for example, is a technique for solving partial differential equations where
a discrete solution is first calculated, and then a continuous solution is constructed using poly-
nomial or Fourier interpolation. When the interpolation nodes are not given a priori as data, it
is natural to wonder whether these can be picked in such a manner that the interpolation error,
measured in a function norm, is minimized. For example, given a continuous function u(x) and
a number of nodes n, is it possible to chose nodes x0 , . . . , xn such that

E := sup |u(x) − u
b(x)|
x∈[a,b]

is minimized? Here u
b is the polynomial interpolating u at the nodes. Achieving this goal in
general is a difficult task, essentially because ξ = ξ(x) is unknown in the expression of the

107
Chapter 5. Interpolation and approximation

Figure 5.3: Interpolation (in orange) of the Runge function (5.14) (in blue) using 6, 10, 14, and
20 equidistant nodes.

interpolation error from Theorem 5.2:

u(n+1) (ξ)
en (x) = (x − x0 ) . . . (x − xn ).
n + 1!

In view of this difficulty, we will focus on the simpler problem of finding interpolation nodes
such that the product (x − x0 ) . . . (x − xn ) is minimized in the supremum norm. This problem
amounts to finding the optimal interpolation nodes, in the sense that E is minimized, in the
particular case where u is a polynomial of degree n+1, because in this case un+1 (ξ) is a constant
factor. It turns out that this problem admits an explicit solution, which we will deduce from
the following theorem.

Theorem 5.4 (Minimum ∞ norm). Assume that p is a monic polynomial of degree n:

p(x) = α0 + α1 x + · · · + αn−1 xn−1 + 1xn .

Then it holds that


1
sup |p(x)| ≥ =: E. (5.15)
x∈[−1,1] 2n−1

In addition, the lower bound is achieved for p∗ (x) = 2−(n−1) Tn (x), where Tn is the Chebyshev
polynomial of degree n:

Tn (x) = cos(n arccos x) (−1 ≤ x ≤ 1). (5.16)

Proof. By Exercise 2.24, the polynomial x 7→ 2−(n−1) Tn (x) is indeed monic, and it is clear that
it achieves the lower bound (5.15) since |Tn (x)| ≤ 1.

108
Chapter 5. Interpolation and approximation

In order to prove (5.15), we assume by contradiction that there is a different monic polyno-
mial pe of degree n such that
sup |e
p(x)| < E. (5.17)
x∈[−1,1]

Let us introduce xi = cos(iπ/n), for i = 0, . . . , n, and observe that

p(xi ) = 2−(n−1) Tn (xi ) = (−1)i E.

The function q(x) := p(x) − pe(x) is a polynomial of degree at most n − 1 which, by the
assumption (5.17), is strictly positive at x0 , x2 , x4 , . . . and strictly negative at x1 , x3 , x5 , . . . .
Therefore, the polynomial q(x) changes sign at least n times and so, by the intermediate value
theorem, it has at least n roots. But this is impossible, because q(x) 6= 0 and q(x) is of degree
at most n − 1.

Remark 5.1 (Derivation of Chebyshev polynomials). The polynomial p∗ achieving the lower
bound in (5.15) oscillates between the values −E and E, which are respectively its minimum
and maximum values over the interval [−1, 1]. It attains the values E or −E at n + 1 points
x0 , . . . , xn , with x0 = −1 and xn = 1. It turns out that these properties, which can be shown
to hold a priori using Chebyshev’s equioscillation theorem, are sufficient to derive an explicit
expression for the polynomial p∗ , as we formally demonstrate hereafter.
The points x2 , . . . , xn−1 are local extrema of p∗ , and so p0∗ (x) = 0 at these nodes. We
therefore deduce that p∗ satisfies the differential equation

n2 E 2 − p∗ (x)2 = p0∗ (x)2 (1 − x2 ). (5.18)




Indeed, both sides are polynomials of degree 2n with single roots at -1 and 1, with double roots
at x2 , . . . , xn+1 , and with the same coefficient of the leading power. In order to solve (5.18),
we rearrange the equation and take the square root:

p0∗ (x)
d d
 
n p∗ (x)
E
=± ⇔ arccos = ±n arccos(x).
1 − x2 dx dx
q
p∗ (x)2 E
1− E2

Integrating both sides and taking the cosine, we obtain

p∗ (x) = E cos C + n arccos(x) .




Requiring that |p∗ (−1)| = E, we deduce C = 0.

From Theorem 5.4, we deduce the following corollary.

Corollary 5.5 (Chebyshev nodes). Assume that x0 < x1 < . . . < xn are in the interval [a, b].

109
Chapter 5. Interpolation and approximation

The supremum norm of the product ω(x) := (x − x0 ) · · · (x − xn ) over [a, b] is minimized when
 
1 + cos (2i+1)π
2(n+1)
xi = a + (b − a) (5.19)
2

Proof. We consider the affine change of variable

ζ : [−1, 1] → [a, b];


a + b + y(b − a)
y 7→ .
2

The function

2n  2n+1  
p(y) := n
ω ζ(y) = n+1
ζ(y) − x0 · · · ζ(y) − xn
(b − a) (b − a)
yi = ζ −1 (xi ),
 
= y − y0 · · · y − yn ,

is a monic polynomial of degree n + 1 such that

2n+1
sup |p(y)| = sup |(x − x0 ) . . . (x − xn )|. (5.20)
y∈[−1,1] (b − a)n+1 x∈[a,b]

By Theorem 5.4, the left-hand side is minimized when p is equal to 2−n Tn+1 (y), i.e. when the
roots of p coincide with the roots of Tn+1 . This occurs when
 
(2i + 1)π
yi = ζ −1 (xi ) = cos .
2(n + 1)

Applying the inverse change of variable xi = ζ(yi ), we deduce the result.

Corollary 5.5 is useful for interpolation. The nodes

1 + cos (2i + 1) 2n
π

xi = a + (b − a) , i = 0, . . . , n, (5.21)
2

are known as Chebyshev nodes and, more often than not, employing these nodes for interpola-
tion produces much better results than using equidistant rodes, both in the case where u is a
polynomial of degree n + 1, as we just proved, but also for general u. As an example we plot in
Section 5.1.5 the interpolation of the Runge function using Chebyshev nodes. In this case, the
interpolating polynomial converges uniformly to the Runge function as we increase the number
of interpolation nodes!

5.1.6 Hermite interpolation


Hermite interpolation, sometimes also called Hermite–Birkoff interpolation, generalizes La-
grange interpolation to the case where, in addition to the function values u0 , . . . , un , the values
of some of the derivatives are given at the interpolation nodes. For simplicity, we assume in this

110
Chapter 5. Interpolation and approximation

section that only the first derivative is specified. In this case, the aim of Hermite interpolation
is to find, given data (xi , ui , u0i ) for i ∈ {0, . . . , n}, a polynomial u
b of degree at most 2n + 1 such
that
∀i ∈ {0, . . . , n}, u
b(xi ) = ui , b0 (xi ) = u0i .
u

In order to construct the interpolating polynomial, it is useful to define the functions


n
x − xj 2
Y  
ϕi (x) = , i = 0, . . . , n.
xi − xj
j=0,j6=i

The function ϕi is the square of the usual Lagrange polynomials associated with xi , and it
satisfies
n
X 2
ϕi (xj ) = 1, ϕ0i (xi ) = , ∀j 6= i ϕi (xj ) = ϕ0i (xj ) = 0.
xi − xj
j=0,j6=i

We consider the following ansatz for u


b:
n
X
u
b(x) = ϕi (x)qi (x),
i=0

where qi are polynomials to be determined of degree at most one, so that u


b is of degree at
most 2n + 1. We then require

u
b(xi ) = qi (xi ), b0 (xi ) = ϕ0i (xi )q(xi ) + q 0 (xi ).
u

From the first equation, we deduce that qi (xi ) = ui , and from the second equation we then
have q 0 (xi ) = u
b0 (xi ) − ϕ0i (xi )ui . We conclude that the interpolating polynomial is given by

n
X  
ϕi (x) ui + u0i − ϕ0i (xi )ui (x − xi ) .

u
b(x) =
i=0

111
Chapter 5. Interpolation and approximation

The following theorem gives an expression of the error.

Theorem 5.6 (Hermite interpolation error). Assume that u : [a, b] → R is a function in


C 2n+2 ([a, b]) and let u
b denote the Hermite interpolation of u at the nodes x0 , . . . , xn . Then
for all x ∈ [a, b], there exists ξ = ξ(x) in the interval [a, b] such that

u(2n+2) (ξ)
u(x) − u
b(x) = (x − x0 )2 . . . (x − xn )2 .
(2n + 2)!

Proof. See Exercise 5.8.

5.1.7 Piecewise interpolation


The interpolation methods we discussed so far are in some sense global; they aim to construct
one polynomial that goes through all the data points. This approach is attractive because the
interpolant is infinitely smooth but, as we showed, it is not always fruitful, in particular when
equidistant interpolation nodes are employed. An alternative approach is to divide the domain
in a number of small intervals, and to perform polynomial interpolation within each interval.
Although the resulting interpolating function is usually not smooth over the full domain, this
“local” approach to interpolation is in general more robust.
Several methods belong in the category of piecewise interpolation. We mention, for instance,
piecewise Lagrange interpolation and cubic splines interpolation. In this section, we briefly
describe the former method, which is widely used in the context of the finite element method.
Information on the latter method is available in [9, Section 8.7.1.].
For simplicity, we illustrate the method in dimension 1, but piecewise Lagrange interpo-
lation can be extended to several dimensions. Assume that we wish to approximate a func-
tion u : [a, b] → R. We consider a subdivision a = x0 < x1 < . . . < xn = b of the interval [a, b]
and let h denote the maximum spacing:

h= max |xi+1 − xi |.
i∈{0,...,n−1}

Within each subinterval Ii = [xi , xi+1 ], we consider a further subdivision

(0) (1) (m)


xi = xi < xi < . . . < xi = xi+1 ,

where the nodes xi , . . . , xi are equally spaced with distance h/m. The idea of piecewise
(0) (m)

Lagrange interpolation is to calculate, for each interval Ii in the partition, the interpolating
polynomial pi at the nodes xi , . . . , xi . The interpolant is then defined as
(0) (m)

u
b(x) = pι (x), (5.22)

where ι = ι(x) is the index of the interval to which x belongs. Since xi = xi+1 = xi+1 ,
(m) (0)

the interpolant defined by (5.22) is continuous. If the function u belongs to C m+1 ([a, b]), then

112
Chapter 5. Interpolation and approximation

by Corollary 5.3 the interpolation error within each subinterval may be bounded from above as
follows:
Cm+1 (h/m)m+1
sup |u(x) − u
b(x)| ≤ , Cm+1 := sup |u(m+1) (x)|, (5.23)
x∈Ii 4(m + 1) x∈[a,b]

and so we deduce
sup |u(x) − u
b(x)| ≤ Chm+1 ,
x∈[a,b]

for an appropriate constant C independent of h. This equation shows that the error is guaran-
teed to decrease to 0 in the limit as h → ∞. In practice, the number m of interpolation nodes
within each interval can be small.

5.2 Approximation
In this section, we focus on the subject of approximation, both of discrete data points and of
continuous functions. We begin, in Section 5.2.1 with a discussion of least squares approximation
for data points, and in Section 5.2.2 we focus on function approximation in the mean square
sense.

5.2.1 Least squares approximation of data points


Consider n + 1 distinct x values x0 < . . . < xn , and suppose that we know the values u0 , . . . , un
taken by an unknown function u when evaluated at these points. We wish to approximate the
function u by a function of the form
m
αi ϕi (x) ∈ span{ϕ0 , . . . , ϕm }, (5.24)
X
u
b(x) =
i=0

for some m < n. In many cases of practical interest, the basis functions ϕ0 , . . . , ϕm are polyno-
mials. In contrast with interpolation, here we seek a function u
b in a finite-dimensional function
space of dimension m strictly lower than the number of data points. In order for u
b to be a good
approximation, we wish to find coefficients α0 , . . . , αm such that the following linear system is
approximately satisfied.
   
ϕ0 (x0 ) ϕ1 (x0 ) ... ϕm (x0 ) u0
   
 ϕ0 (x1 ) ϕ1 (x1 ) ... ϕm (x1 )     u1 
  α0  
 ϕ (x ) ϕ1 (x2 ) ... ϕm (x2 ) 
    u2 
 
 0 2
.
.. .. ..  1  . 
α
  .  ≈  ..  =: b.
Aα :=  . .

 .  
 .  
 
 
ϕ0 (xn−2 ) ϕ1 (xn−2 ) ... ϕm (xn−2 )
 un−2 

ϕ (x αm  
 0 n−1 ) ϕ1 (xn−1 ) ... ϕm (xn−1 ) un−1 
  
ϕ0 (xn ) ϕ1 (xn ) ... ϕm (xn ) un

In general, since the matrix on the left-hand side has more lines than columns, there does not
exist an exact solution to this equation. In order to find an approximate solution, a natural
approach is to find coefficients α0 , . . . , αm such that the residual vector r = Aα − b is small in

113
Chapter 5. Interpolation and approximation

some vector norm. A particularly popular approach, known as least squares approximation, is
to minimize the Euclidean norm of r or, equivalently, the square of the Euclidean norm:
 2
n
X n
X m
X
krk2 = b(xi )|2 =
|ui − u ui − αj ϕj (xi ) .
i=0 i=0 j=0

Let us denote the right-hand side of this equation by J(α), which we view as a function of the
vector of coefficients a. A necessary condition for a∗ to be a minimizer is that ∇J(α∗ ) = 0.
The gradient of J, written as a column vector, is given by
 
∇J(α) = ∇ (Aα − b)T (Aα − b)

= ∇ αT (AT A)α − bAα − αT AT b + bT b)


 

= 2(AT A)α − 2AT b.

We deduce that α∗ solves the linear system

AT Aα∗ = AT b, (5.25)

where the matrix on the left-hand side is given by:


 Pn Pn Pn 
i=0 ϕ0 (xi )ϕ0 (xi ) ϕ0 (xi )ϕ1 (xi ) ... ϕ0 (xi )ϕm (xi )
 Pn Pi=0
n Pi=0
n
 i=0 ϕ1 (xi )ϕ0 (xi ) i=0 ϕ1 (xi )ϕ1 (xi ) ... i=0 ϕ1 (xi )ϕm (xi ) 

AT A :=  .. .. . ..
. . .
 
 
Pn Pn Pn
i=0 ϕm (xi )ϕ0 (xi ) i=0 ϕm (xi )ϕ1 (xi ) . . . i=0 ϕm (xi )ϕm (xi )

Equation (5.25) is a system of m equations with m unknowns, which admits a unique solution
provided that AT A is full rank or, equivalently, the columns of A are linearly independent. The
linear equations (5.25) are known as the normal equations. As a side note, we mention that
the solution α∗ = (AT A)−1 AT b coincides with the maximum likelihood estimator for α under
the assumption that the data is generated according to ui = u(xi ) + εi , for some function
u ∈ span{ϕ0 , . . . , ϕm } and random noise εi ∼ N (0, 1).

Remark 5.2. From equation (5.25) we deduce that

Aα∗ = A(AT A)−1 AT b.

The matrix ΠA := A(AT A)−1 AT on the right-hand side is the orthogonal projection operator
onto col(A) ⊂ Rn , the subspace spanned by the columns of A. Indeed, it holds that Π2A = ΠA ,
which is the defining property of a projection operator.

To conclude this section, we note that the matrix A+ = (AT A)−1 AT is a left inverse of the
matrix A, because A+ A = I. It is also called the Moore–Penrose inverse or pseudoinverse of the
matrix A, which generalizes the usual inverse matrix. In Julia, the backslash operator silently
uses the Moore–Penrose inverse when employed with a rectangular matrix. Therefore, solving

114
Chapter 5. Interpolation and approximation

the normal equations (5.25) can be achieved by just writing α = A\b.

5.2.2 Mean square approximation of functions


The approach described in Section 5.2.1 can be generalized to the setting where the actual
function u, rather than just discrete evaluations of it, is available. In this section, we seek an
approximation of the form (5.24) such that the error u
b(x) − u(x), measured in some function
norm, is minimized. Of course, the solution to this minimization problem depends in general on
the norm employed, and may in some cases not even be unique. Instead of specifying a particular
norm, as done in Section 5.2.1, in this section we retain some generality and assume only that
the norm is induced by an inner product on the space of real-valued continuous functions:

h•, •i : C([a, b]) × C([a, b]) → R. (5.26)

In other words, we seek to minimize

u − uk2 = hb
J(α) := kb u − u, u
b − ui.

This is again a function of the m + 1 variables α0 , . . . , αm . Before calculating its gradient, we


rewrite the function J(α) in a simpler form:

m m
* +
X X
J(α) = u− αj ϕj , u − α k ϕk
j=0 k=0
m m m
αj hu, ϕj i − hu, ui = αT Gα − 2bT α − hu, ui,
XX X
= αj αk hϕj , ϕk i − 2
j=0 k=0 j=0

where we introduced
   
hϕ0 , ϕ0 i hϕ0 , ϕ1 i ... hϕ0 , ϕm i hu, ϕ0 i
 hϕ1 , ϕ0 i hϕ1 , ϕ1 i ... hϕ1 , ϕm i   hu, ϕ1 i 
   
G :=  .. .. .. , b :=  .  . (5.27)
 .. 

. . .
 
 
hϕm , ϕ0 i hϕm , ϕ1 i . . . hϕm , ϕm i hu, ϕm i

Employing the same approach as in the previous section, we then obtain ∇J(α) = Gα − b, and
so the minimizer of J(α) is the solution to the equation

Gα∗ = b. (5.28)

The matrix G, known as the Gram matrix, is positive semi-definite and nonsingular provided
that the basis functions are linearly independent, see Exercise 5.9. Therefore, the solution α∗
exists and is unique. In addition, since the Hessian of J is equal to G, the vector α∗ is indeed
a minimizer. Note that if h•, •i is defined as a finite sum of the form
n
(5.29)
X
hf, gi = f (xi )g(xi ),
i=0

115
Chapter 5. Interpolation and approximation

then (5.28) coincides with the normal equations (5.25) from the previous section. We remark
that (5.29) is in fact not an inner product on the space of continuous functions, but it is an
inner product on the space of polynomials of degree less than or equal to n.
In practice, the matrix and right-hand side of the linear system (5.28) can usually not be
calculated exactly, because the inner product h•, •i is defined through an integral; see (5.30) in
the next section.

Remark 5.3. Rewriting the normal equations (5.28) in terms of u


b we obtain

hb
u − u, ϕ0 i = 0, ..., hb
u − u, ϕm i = 0.

Therefore, the optimal approximation u


b ∈ span{ϕ0 , . . . , ϕm } satisfies

∀v ∈ span{ϕ0 , . . . , ϕm }, hb
u − u, vi = 0.

This shows that the optimal approximation u


b, in the sense of the norm k•k, is the orthogonal
projection of u onto span{ϕ0 , . . . , ϕm }.

5.2.3 Orthogonal polynomials


The Gram matrix G in (5.28) is equal to the identity matrix when the basis functions are
orthonormal for the inner product considered. In this case, the solution to the linear system is

αi = hu, ϕi i, i = 0, . . . , m,

and so the best approximation u


b (for the norm induced by the inner product considered!) is
simply given by
m
X
u
b= hu, ϕi iϕi .
i=0

The coefficients hu, ϕi i of the basis functions in this expansion are called Fourier coefficients.
Given a finite dimensional subspace S of the space of continuous functions, an orthonormal basis
can be constructed via the Gram–Schmidt process. In this section, we focus on the particular
case where S = P(n) – the subspace of polynomials of degree less than or equal to n. Another
widely used approach, which we do not explore in this course, is to use trigonometric basis
functions. We also assume that the inner product (5.26) is of the form
Z b
hf, gi = f (x)g(x)w(x) dx, (5.30)
a

where w(x) is a given nonnegative weight function such that


Z b
w(x) dx > 0.
a

Let ϕ0 (x), ϕ1 (x), ϕ2 (x) . . . denote the orthonormal polynomials obtained by applying the Gram–
Schmidt procedure to the monomials 1, x, x2 , . . . . These depend in general on the weight w(x)

116
Chapter 5. Interpolation and approximation

and on the interval [a, b]. A few of the popular classes of orthogonal polynomials are presented
in the table below:

Name w(x) [a, b]

Legendre 1 [−1, 1]

Chebyshev √ 1
1−x2
(−1, 1)
 
Hermite exp − x2
2
[−∞, ∞]

Laguerre e−x [0, ∞]

Orthogonal polynomials have a rich structure, and in the rest of this section we prove
some of their common properties, one of which will be very useful in the context of numerical
integration in Chapter 6. We begin by showing that orthogonal polynomials have distinct real
roots.

Proposition 5.7. Assume for simplicity that w(x) > 0 for all x ∈ [a, b], and let ϕ0 , ϕ1 , . . .
denote orthonormal polynomials of increasing degree for the inner product (5.30). Then for
all n ∈ N, the polynomial ϕn has n distinct roots in the open interval (a, b).

Proof. Reasoning by contradiction, we assume that ϕn changes sign at only k < n points of the
open interval (a, b), which we denote by x1 , . . . , xk . Then

ϕn (x) × (x − x0 )(x − x1 ) . . . (x − xk )

is either everywhere nonnegative or everywhere nonpositive over [a, b] But then


Z b
ϕn (x) × (x − x1 ) . . . (x − xk ) w(x) dx
a

is nonzero, which is a contradiction because the product (x − x1 ) . . . (x − xk ) is a polynomial of


degree k, which is orthogonal to ϕn by assumption. Indeed, being orthogonal to ϕ0 , . . . , ϕn−1 ,
the polynomial ϕn is also orthogonal to any linear combination of these polynomials.

Next, we show that orthogonal polynomials satisfy a three-term recurrence relation.

Proposition 5.8. Assume that ϕ0 , ϕ1 , . . . are orthonormal polynomials for some inner prod-
uct of the form (5.30) such that ϕi is of degree i. Then

∀n ∈ {1, 2, . . . }, αn+1 ϕn+1 (x) = (x − βn )ϕn (x) − αn ϕn−1 (x), (5.31)

where
αn = hxϕn , ϕn−1 i, βn = hxϕn , ϕn i.

In addition, α1 ϕ1 (x) = (x − β0 )ϕ0 (x).

117
Chapter 5. Interpolation and approximation

Proof. Since xϕn (x) is a polynomial of degree n + 1, it may be decomposed as

n+1
(5.32)
X
xϕn (x) = γn,i ϕi (x).
i=0

Taking the inner product of both sides of this equation with ϕi and employing the orthonormality
assumption, we obtain an expression for the coefficients:

γn,i = hxϕn , ϕi i.

From the expression (5.30) of the inner product, it is clear that hxϕn , ϕi i = hϕn , xϕi i. Since xϕi
is a polynomial of degree i + 1 and ϕn is orthogonal to all polynomials of degree strictly less
than n, we deduce that γn,i = 0 if i < n − 1. Consequently, we can rewrite the right-hand side
of (5.32) as a sum involving only three terms

xϕn (x) = hxϕn , ϕn−1 iϕn−1 (x) + hxϕn , ϕn iϕn (x) + hxϕn , ϕn+1 iϕn+1 (x). (5.33)

Since hxϕn , ϕn+1 i = hxϕn+1 , ϕn i, we obtain the statement after rearranging.

Remark 5.4. Notice that the polynomials in (5.8) are orthonormal by assumption, and so the
coefficient αn+1 is just a normalization constant. We deduce that

(x − βn )ϕn (x) − αn ϕn−1 (x)


ϕn+1 (x) = ,
k(x − βn )ϕn (x) − αn ϕn−1 (x)k

which enables to calculate the orthogonal polynomials recursively.

5.2.4 Orthogonal polynomials and numerical integration: an introduction �

Equation (5.33) may be rewritten in matrix form as follows:


      
xϕ0 (x) β0 α1 ϕ0 (x) 0
 xϕ1 (x)  α1 β1 α2   ϕ1 (x)   0
      

      
 xϕ2 (x)   α2 β2 α3   ϕ2 (x)   0 
. = .. .. .. .. + .. .
      
.. . . . . .
 
      
      
xϕm−1 (x)  αm−1 βm−1 αm  ϕm−1 (x)  0
      

xϕm (x) αm βm ϕm (x) αm+1 ϕm+1 (x)

Let T denote the matrix on the left-hand side of this equation, and let r0 , . . . , rm denote the
roots of ϕm+1 . By Proposition 5.7, these are distinct and all belong to the interval (a, b). But

118
Chapter 5. Interpolation and approximation

now notice that


    
rϕ0 (r) β0 α1 ϕ0 (r)
    
 rϕ1 (r)  α1 β1 α2   ϕ1 (r) 
.. .. .. .. ..
n o     
∀r ∈ r0 , . . . , rm , . = . . . . .
    
 
    
rϕ (r)   αm−1 βm−1 αm  ϕm−1 (r)
 
 m−1   
rϕm (r) αm βm ϕm (r)

 T
In other words, for any root r of ϕm+1 , the vector ϕ0 (r) . . . ϕm (r) is an eigenvector of the
matrix T, with associated eigenvalue equal to r. Since T is a symmetric matrix, the eigenvectors
associated to distinct eigenvalues are orthogonal for the Euclidean inner product of Rm+1 , so
given that the eigenvalues of T are distinct, we deduce that
m
(5.34)
X
∀i 6= j, ϕi (ri )ϕi (rj ) = 0.
i=0

Let us construct the matrix


 
ϕ0 (r0 ) ϕ1 (r0 ) ... ϕm (r0 )
 
 ϕ0 (r1 ) ϕ1 (r1 ) ... ϕm (r1 ) 
 
P =  ϕ0 (r2 ) ϕ1 (r2 ) . . . ϕm (r2 ) 
.

 . .. ..
 .. . .

 ... 

ϕ0 (rm ) ϕ1 (rm ) . . . ϕm (rm )

Equation (5.34) indicates that the lines of P are orthogonal, and so the matrix D = PPT is
diagonal with elements given by
m
X
dii = |ϕj (ri )|2 , i = 0, . . . , m.
j=0

(Here we start counting the lines from 0 for convenience.) Since PPT D−1 = I, we deduce that
the inverse of P is given by P−1 = PT D−1 . Consequently,

PT D−1 P = P−1 P = I,

which means that the polynomials ϕ1 , . . . , ϕm are orthogonal for the inner product

h•, •im+1 : P(m) × P(m) → R;


Xm p(ri )q(ri )
(p, q) 7→ .
i=0 dii

We have thus shown that, if ϕ0 , ϕ1 , ϕ2 are a family of orthonormal polynomials for an inner
product h•, •i, then they are also orthonormal for the inner product h•, •im+1 . Consequently,
it holds that
∀(p, q) ∈ P(m) × P(m), hp, qi = hp, qim+1 , (5.35)

119
Chapter 5. Interpolation and approximation

Indeed, denoting by p = α0 ϕ0 + · · · + αm ϕm and q = β0 ϕ0 + · · · + βm ϕm the expansions of the


polynomials p and q in the orthonormal basis, we have

hp, qi = hα0 ϕ0 + · · · + αm ϕm , β0 ϕ0 + · · · + βm ϕm i
m X
X m m X
X m
= αi βj hϕi , ϕj i = α0 β0 + · · · + αm βm = αi βj hϕi , ϕj im+1
i=0 j=0 i=0 j=0

= hα0 ϕ0 + · · · + αm ϕm , β0 ϕ0 + · · · + βm ϕm im+1 = hp, qim+1 .

We have thus shown the following result. Since m was arbitrary in the previous reasoning, we
add a superscript to indicate when the quantities involved depend on m.

Theorem 5.9. Orthonormal polynomials ϕ0 , . . . , ϕm for the inner product


Z b
hf, gi = f (x)g(x)w(x) dx
a

are also orthonormal for the inner product


m    
(m+1) (m+1) (m+1)
X
hf, gim+1 = f ri g ri wi ,
i=0

where r0 are the roots of ϕm+1 and the weights wi are given by
(m+1) (m+1) (m+1)
, . . . , rm

(m+1) 1
wi =P   2, i = 0, . . . , m.
m (m+1)
j=0 ϕj ri

Taking q = 1 in (5.35) and employing the definitions of h•, •i and h•, •im+1 , we have
Z b m  
∀p ∈ P(m), p(x) w(x) dx = (5.36)
(m+1) (m+1)
X
p ri wi .
a i=0

Since the left-hand side is an integral and the right-hand side is a sum, we have just constructed
an integration formula, which enjoys a very nice property: it is exact for polynomials of degree up
to m! A formula of this type is called a quadrature formula, with m + 1 nodes r0
(m+1) (m+1)
, . . . , rm
and associated weights w0 . Note that the nodes and weights of the quadrature
(m+1) (m+1)
, . . . , wm
depend on the weight w(x) and on the degree m.
In fact, equation (5.36) is valid more generally than for p ∈ P(m). Indeed, for every
monomial xp with m ≤ p ≤ 2m, we have that
Z b Z b
x w(x) dx =
p
xp−m xm w(x) dx
a a
m 
(m+1) p

(m+1)
X
= hxp−m , xm i = hxp−m , xm im+1 = hxp , 1im+1 = ri wi .
i=0

120
Chapter 5. Interpolation and approximation

Consequently, the formula (5.36) is exact for any monomial of degree up to 2m, and we conclude
that, by linearity, it is also valid for any polynomials of degree up to 2m.

Remark 5.5. In fact, the formula (5.36) is valid for polynomials of degree up to 2m + 1.
Indeed, using that x2m+1 = ϕm+1 (x)p(x) + q(x), for some polynomial p of degree m (the
quotient of the polynomial division of x2m+1 by ϕm+1 ) and some polynomial q of degree less
than m (the remainder of the polynomial division), we have
Z b Z b Z b Z b
x2m+1 w(x) dx = ϕm+1 (x)p(x)w(x) dx + q(x)w(x) dx = 0 + q(x)w(x) dx
a a a a
m     m  
(m+1) (m+1) (m+1) (m+1) (m+1)
X X
= ϕm+1 ri p ri wi + q ri wi
i=0 i=0
m  2m+1
(m+1) (m+1)
X
= ri wi ,
i=0

where we used, in the penultimate inequality, the fact that r0 are the roots
(m+1) (m+1)
, . . . , rm
of the polynomial ϕm+1 .

We will revisit this subject in Chapter 6.

5.3 Exercises
� Exercise 5.1. Find the polynomial p(x) = ax + b (a straight line) that goes through the
points (x0 , u0 ) and (x1 , u1 ).

� Exercise 5.2. Find the polynomial p(x) = ax2 + bx + c (a parabola) that goes through the
points (0, 1), (1, 3) and (2, 7).

� Exercise 5.3. Prove the following recurrence relation for Chebyshev polynomials:

Ti+1 (x) = 2xTi (x) − Ti−1 (x), i = 1, 2, . . . .

� Exercise 5.4. Show by recursion that


n
X uj
[u0 , u1 , . . . , un ] = Q .
k∈{0,...,n}\{j} (xj − xk )
j=0

Deduce from this identity that

[u0 , u1 , . . . , un ] = [uσ1 , uσ2 , . . . , uσn ],

for any permutation σ of (0, 1, 2, . . . , n).

� Exercise 5.5. Using the Gregory–Newton formula, find an expression for


n
X
i4 .
i=1

121
Chapter 5. Interpolation and approximation

� Exercise 5.6. Let (f0 , f1 , f2 , . . . ) = (1, 1, 2, . . . ) denote the Fibonacci sequence. Prove that
there does not exist a polynomial p such that

∀i ∈ N, fi = p(i).

� Exercise 5.7. Using the Gregory–Newton formula, show that

n2 n3 n4
∀n ∈ N>0 , 2n = 1 + n + + + + ··· (5.37)
2! 3! 4!

Remark 5.6. Remarkably, equation (5.37) holds in fact for any n ∈ R>0 . However, showing
this more general statement is beyond the scope of this course.

� Exercise 5.8. Prove Theorem 5.6.


� Exercise 5.9. Show that the matrix G in (5.27) is positive definite if the basis func-
tions ϕ0 , . . . , ϕm are linearly independent.
� Exercise 5.10. Write a Julia code for interpolating the following function using a polynomial
of degree 20 over the interval [−1, 1].
   
x + 1/2 x x − 1/2
f (x) = tanh + tanh + tanh , ε = .01.
ε ε ε

Use equidistant and then Chebyshev nodes, and compare the two approaches in terms of accuracy.
Plot the function f together with the approximating polynomials.
� Exercise 5.11. We wish to use interpolation to approximate the following parametric func-
tion, called an epitrochoid:
 
R+r
x(θ) = (R + r) cos θ + d cos θ (5.38)
r
 
R+r
y(θ) = (R + r) sin θ − d sin θ , (5.39)
r

with R = 5, r = 2 and d = 3, and for θ ∈ [0, 4π]. Write a Julia program to interpolate x(θ)
and y(θ) using 40 equidistant points. Use the BigFloat format in order to reduce the impact
of round-off errors. After constructing the polynomial interpolations x b(θ) and yb(θ), plot the
parametric curve θ 7→ xb(θ), yb(θ) . Your plot should look similar to Figure 5.4.


� Exercise 5.12 (Modeling the vapor pressure of mercury). The dataset loaded through the
following Julia commands contains data on the vapor pressure of mercury as a function of the
temperature.
import RDatatets
data = RDatasets.dataset("datasets", "pressure")
Find a low-dimensional mathematical model of the form

p(T ) = exp α0 + α1 T + α2 T 2 + α3 T 3 (5.40)




122
Chapter 5. Interpolation and approximation

Figure 5.4: Solution for Exercise 5.11.

for the pressure as a function of the temperature. Plot the approximation together with the data.
An example solution is given in Figure 5.5.

Figure 5.5: Solution for Exercise 5.12.

5.4 Discussion and bibliography


A comprehensive study of approximation theory would require to cover the L∞ setting as well
as other functional settings. A pillar of L∞ approximation theorem is Chebyshev’s equioscil-
lation theorem, which we alluded to in Remark 5.1. An excellent introductory reference on
approximation theory is [7] (in French). See also [9, Chapter 10] and the references therein.

123
Chapter 6

Numerical integration

6.1 The Newton–Cotes method . . . . . . . . . . . . . . . . . . . . . . . . . 125


6.2 Composite methods with equidistant nodes . . . . . . . . . . . . . . . 126
6.3 Richardson extrapolation and Romberg’s method . . . . . . . . . . . 130
6.4 Methods with non-equidistant nodes . . . . . . . . . . . . . . . . . . . 133
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 137

Introduction
Integrals are ubiquitous in science and mathematics. In this chapter, we are concerned with the
problem of calculating numerically integrals of the form
Z
I= u(x) dx, (6.1)

Perhaps somewhat surprisingly, the numerical calculation of such integrals when n  1 is


still a very active area of research today. In this chapter, we will focus for simplicity on the
one-dimensional setting where Ω = [a, b] ⊂ R. We assume throughout this chapter that the
function u is Riemann-integrable. This means that

n−1
I = lim
X
u(ti )(zi+1 − zi ),
h→0
i=0

where a = z0 < · · · < zn = b is a partition of the interval [a, b] such that the maximum spacing
between successive x values is equal to h, and with ti ∈ [xi , xi+1 ] for all i ∈ {0, . . . , n − 1}.
All the numerical integration formulas that we present in this chapter are based on a deter-
ministic approximation of the form
m
(6.2)
X
Ib = wi u(xi ),
i=0

124
Chapter 6. Numerical integration

where x0 < . . . < xn are the integration nodes and w0 , . . . , wn are the integration weights. In
many cases, integration formulas contain a small parameter that can be changed to improve
the accuracy of the approximation. In methods based on equidistant interpolation nodes, for
example, this parameter encodes the distance between nodes and is typically denoted by h, and
we often use the notation Ibh to emphasize the dependence of the approximation on h. The
difference Eh = I − Ih is called the integration error or discretization error, and the degree of
precision of an integration method is the smallest integer number d such that the integration
error is zero for all the polynomials of degree less than or equal to d.
We observe that, without loss of generality, we can consider that the integration interval is
equal to [−1, 1]. Indeed, using the change of variable

ζ : [−1, 1] → [a, b];


b + a (b − a)
y 7→ + y, (6.3)
2 2

we have
b 1 1
b−a
Z Z Z
u(x) dx = u ζ(y) ζ 0 (y) dy = u ◦ ζ(y) dy, (6.4)

a −1 2 −1

and the right-hand side is the integral of u ◦ ζ over the interval [−1, 1].

6.1 The Newton–Cotes method


Given a set of equidistant points −1 = x0 < · · · < xm = 1, a natural method for approximating
the integral (6.1) of a function u : [−1, 1] → R is to first construct the interpolating polynomial u
b
through the nodes, and then calculate the integral of this polynomial. By construction, this
method is exact for polynomials of degree up to m, and so the degree of precision is equal to
at least m. Let ϕ0 , . . . , ϕm denote the Lagrange polynomials associated with the integration
nodes. Then we have
Z 1 Z 1 n n Z 1
b(x) dx = u(xi )ϕi u(x) dx = ϕi u(x) dx .
X X
I≈ u u(xi )
−1 −1 i=0 i=0 −1
| {z }
wi

The weights are independent of the function u, and so they can be calculated a priori. The class
of integration methods obtained using this approach are known as Newton–Cotes methods. We
present a few particular cases:

• m = 1, d = 1 (trapezoidal rule):
Z 1
u(x) dx ≈ u(−1) + u(1). (6.5)
−1

• m = 2, d = 3 (Simpson’s rule):
Z 1
1 4 1
u(x) dx ≈ u(−1) + u(0) + u(1). (6.6)
−1 3 3 3

125
Chapter 6. Numerical integration

• m = 3, d = 3 (Simpson’s 3
8 rule):
Z 1
1 3 3 1
u(x) dx ≈ u(−1) + u(−1/3) + u(1/3) + u(1).
−1 4 4 4 4

• m = 4, d = 5 (Bode’s rule):
Z 1    
7 32 1 12 32 1 7
u(x) dx ≈ u(−1) + u − + u (0) + u + u(1).
−1 45 45 2 45 45 2 45

In principle, this approach could be employed in order to construct integration rules of arbitrary
high degree of precision. In practice, however, the weights become more and more imbalanced
as the number of interpolation points increases, with some of them becoming negative. As a
result, roundoff errors become increasingly detrimental to accuracy as the degree of precision
increases. In addition, in cases where the interpolating polynomial does not converge to u, for
example if u is Runge’s function, the approximate integral may not even converge to the correct
value in the limit as m → ∞ in exact arithmetic!
Note that, although it is based on a quadratic polynomial interpolation, Simpson’s rule (6.6)
has a degree of precision equal to 3. This is because any integration rule with nodes and weights
symmetric around x = 0 is exact for odd functions, in particular x3 . Likewise, the degree of
precision of Bode’s rule is equal to 5.

6.2 Composite methods with equidistant nodes


A natural alternative to the approach presented in Section 6.1 is to construct an integration rule
using piecewise polynomial interpolation, which we studied in Section 5.1.7. After partitioning
the integration interval in a number of subintervals, the integral can be approximated by using
one of the rules presented in (6.1) within each subinterval.

Composite trapezoidal rule. Let us illustrate the composite approach with an example. To
this end, we introduce a partition a = x0 < · · · < xm = b of the interval [a, b] and assume
that the nodes are equidistant with xi+1 − xi = h. Using (6.4), we first generalize (6.5) to an
interval [xi , xi+1 ] as follows:
Z xi+1 Z 1
h
u(x) dx = u ◦ ζ(y) dy ≈ u ◦ ζ(−1) + u ◦ ζ(1) =

u(xi ) + u(xi+1 ) ,
xi −1 2

where ≈ in this equation indicates approximation using the trapezoidal rule. Applying this
approximation to each subinterval of the partition, we obtain the composite trapezoidal rule:
Z b n−1
X Z xi+1 n−1
hX
u(x) dx = u(x) dx ≈

u(xi ) + u(xi+1 )
a xi 2
i=0 i=0
h
(6.7)

= u(x0 ) + 2u(x1 ) + 2u(x2 ) + · · · + 2u(xn−2 ) + 2u(xn−1 ) + u(xn ) .
2

126
Chapter 6. Numerical integration

Like the trapezoidal rule (6.5), the composite trapezoidal rule (6.7) has a degree of precision
equal to 1. However, the accuracy of the method depends on the parameter h, which repre-
sents the width of each subinterval: for very small h, equation (6.7) is expected to provide
a good approximation of the integral. An error estimate can be obtained directly from the
formula in Theorem 5.2 for interpolation error, provided that we assume that u ∈ C 2 ([a, b]).
Denoting by Ibh the approximate integral calculated using (6.7), and by u
bh the piecewise linear
interpolation of u, we have
Z xi+1 Z xi+1
1
b(x) dx = u00 ξ(x) (x − xi )(x − xi+1 ) dx.

u(x) − u
xi 2 xi

Since (x − xi )(x − xi+1 ) is nonpositive over the interval [xi , xi+1 ], we deduce that
!Z
xi+1 xi+1
h3
Z
b(x) dx ≤
u(x) − u sup u (ξ)00
(x − xi )(x − xi+1 ) dx = C2 ,
xi ξ∈[a,b] xi 12

where we introduced
C2 = sup u00 (ξ) .
ξ∈[a,b]

Summing the contributions of all the intervals, we obtain

n−1 xi+1
h3 b−a
Z
b(x) dx ≤ n × C2 (6.8)
X
|I − I|
b ≤ u(x) − u = C2 h2 .
xi 12 12
i=0

The integration error therefore scales as O(h2 ). (Strictly speaking, we have shown only that
the integration error admits an upper bound that scales at O(h2 ), but it turns out that the
dependence on h of this bound is optimal).

Composite Simpson rule. The composite Simpson rule is derived in Exercise 6.2. Given an
odd number n + 1 of equidistant points a = x0 < x1 < · · · < xn = b, it is given by

h 
Ibh = u(x0 ) + 4u(x1 ) + 2u(x2 ) + 4u(x3 ) + 2u( x4 ) + · · · + 2u(xn−2 ) + 4u(xn−1 ) + u(xn ) . (6.9)
3

This approximation is obtained by integrating the piecewise quadratic interpolant over a par-
tition of the integration interval into n/2 subintervals of equal width. Obtaining an optimal
error estimate, in terms of the dependence on h, for this integration formula is slightly more
involved. For a given subinterval [x2i , x2i+2 ], let us denote by u
b2 (x) the quadratic interpolating
polynomial at x2i , x2i+1 , x2i+2 , and by u
b3 (x) a cubic interpolating polynomial relative to the
nodes x2i , x2i+1 , x2i+2 , xα , for some α ∈ [x2i , x2i+1 ]. We have
Z x2i+2 Z x2i+2 Z x2i+2
b2 (x) dx =
u(x) − u b3 (x) dx +
u(x) − u b2 (x) dx.
b3 (x) − u
u (6.10)
x2i x2i x2i

127
Chapter 6. Numerical integration

The second term is zero, because the integrand is a cubic polynomial with zeros at x2i , x2i+1
and x2i+2 , and because
Z x2i+2
(x − x2i )(x − x2i+1 )(x − x2i+2 ) = 0.
x2i

(Notice that the integrand is even around x2i+1 .) The cancellation of the second term in (6.10)
also follows from the fact that the degree of precision of the Simpson rule (6.6) is equal to 3,
and so
Z x2i+2
1 4 1
u b2 (x) dx = (b
b3 (x) − u u3 − u u3 − u
b2 )(x2i ) + (b u3 − u
b2 )(x2i+1 ) + (b b2 )(x2i+2 ) = 0.
x2i 3 3 3

Using Theorem 5.2, we rewrite first term in (6.10) from above as follows:

u(4) ξ(x)
Z x2i+2 Z x2i+2

b3 (x) dx ≤
u(x) − u (x − x2i )(x − x2i+1 )(x − x2i+2 )(x − xα ) dx.
x2i x2i 24

Since this formula is valid for all α ∈ [2i, 2i + 2], we are allowed to take α = x2i+1 . Given that
Z x2i+2
4 5
(x − x2i )(x − x2i+1 )2 (x − x2i+2 ) dx = h ,
x2i 15

with an integrand everywhere nonpositive in the interval [x2i , x2i+2 ], we conclude that
Z x2i+2
C4 5
b3 (x) dx ≤
u(x) − u h , C4 = sup |u(4) (ξ)|.
x2i 90 ξ∈[a,b]

Summing the contributions of all the subintervals, we finally obtain

n C4 h5 C4 h4
|I − Ibh | ≤ × = (b − a) . (6.11)
2 90 180

Estimating the error. In practice, it is useful to be able to estimate the integration error
so that, if the error is deemed too large, a better approximation of the integral can then be
calculated by using a smaller value for the step size h. Calculating the exact error I − Ibh is
impossible in general, because this would require to know the exact value of the integral, but it is
possible to calculate a rough approximation of the error based on two numerical approximations
of the integral, as we illustrate formally hereafter for the composite Simpson rule.
Suppose that Ib2h and Ibh are two approximations of the integral, calculated using the com-
posite Simpson rule with step size 2h and h, respectively. If we assume that the error scales
as O(h4 ) as (6.11) suggests, then it holds approximately that

1
I − Ibh ≈ 4 (I − Ib2h ). (6.12)
2

This implies that

1
I − Ib2h = (I − Ibh ) + (Ih − Ib2h ) ≈ (I − Ib2h ) + (Ih − Ib2h ).
16

128
Chapter 6. Numerical integration

Rearranging this equation gives an approximation of the error for Ib2h :

16 b
I − Ib2h ≈ (Ih − Ib2h ).
15

Using (6.12), we can then derive an error estimate for Ibh :

1 b
|I − Ibh | ≈ |Ih − Ib2h |. (6.13)
15

The right-hand side can be calculated numerically, because it does not depend on the exact
value of the integral. In practice, the two sides of (6.13) are often very close for small h. In the
code example below, we approximate the integral
Z π
2
I= cos(x) dx = 1 (6.14)
0

for different step sizes and compare the exact error with the approximate error obtained us-
ing (6.13). The results obtained are summarized in Table 6.1, which shows a good match
between the two quantities.

Table 6.1: Comparison between the exact integration error and the approximate integration
error calculated using (6.13).

h Exact error |I − Ibh | Approximate error 1 b


15 |Ih − Ib2h |
2−4 5.166847063531321 × 10−7 5.185892840930961 × 10−7
2−5 3.226500089326123 × 10−8 3.229464703065806 × 10−8
2−6 2.0161285974040766 × 10−9 2.016591486390477 × 10−9
2−7 1.2600120946615334 × 10−10 1.260084925291949 × 10−10

# Composite Simpson's rule


function composite_simpson(u, a, b, n)
# Integration nodes
x = LinRange(a, b, n + 1)
# Evaluation of u at the nodes
ux = u.(x)
# Step size
h = x[2] - x[1]
# Approximation of the integral
return (h/3) * sum([ux[1]; ux[end]; 4ux[2:2:end-1]; 2ux[3:2:end-2]])
end
# Function to integrate
u(x) = cos(x)
# Integration bounds
a, b = 0, π/2
# Exact integral
I = 1.0
# Number of subintervals
ns = [8; 16; 32; 64; 128]

129
Chapter 6. Numerical integration

# Approximate integrals
Î = composite_simpson.(u, a, b, ns)
# Calculate exact and approximate errors
for i in 2:length(ns)
println("Exact error: $(I - Î[i]), ",
"Approx error: $((Î[i] - Î[i-1])/15)")
end

6.3 Richardson extrapolation and Romberg’s method


In the previous section, we showed how the integration error could be approximated based on
two approximations of the integral with different step sizes. The aim of this section is to show
that, by cleverly combining two approximations Ibh and Ib2h of an integral, an approximation
even better than Ibh can be constructed.
This approach is based on Richardson’s extrapolation, which is a general method for ac-
celerating the convergence of sequences, with applications beyond numerical integration. The
idea is the following: assume that J(h) is an approximation with step size h of some unknown
quantity J∗ = limh→0 J(h), and that we have access to evaluations of J at h, h/2, h/4, h/8 . . . .
Assuming that J extends to a smooth function over [0, H], we have by Taylor expansion that

η2 η3 ηk
J(η) = J(0) + J 0 (0)η + J 00 (0) + J (3) (0) + · · · + J (k) (0) + O(η k+1 ).
2 3! k!

Elimination of the linear error term. Let us assume that J 0 (0) 6= 0, so that the leading order
term after the constant J(0) scales as η. Then we have

J(h) = J(0) + J 0 (0)h + O(h2 )


h
J(h/2) = J(0) + J 0 (0) + O(h2 ).
2

We now ask the following question: can we combine linearly J(h) and J(h/2) in order to
approximate J(0) with an error scaling as O(h2 )? Using the ansatz J1 (h/2) = αJ(h)+βJ(h/2),
we calculate  
1−α
J1 (h/2) = (α + β)J(0) + J 0 (0)h α + + O(h2 ). (6.15)
2
Since we want this expression to approximate J(0) for small h, we need to impose that α+β = 1.
Then, in order for the term multiplying h to cancel out, we require that

1−α
α+ =0 ⇔ α = −1.
2

This yields the formula


J1 (h/2) = 2J(h/2) − J(h). (6.16)

Notice that, in the case where J is a linear function, J1 (h/2) is exactly equal to J(0). This
reveals a geometric interpretation of (6.16): the approximation J1 (h/2) is simply the y intercept
of the straight line passing through the points h/2, J(h/2) and h, J(h) .
 

130
Chapter 6. Numerical integration

Elimination of the quadratic error term. If we had tracked the coefficient of h2 in the previous
paragraph, we would have obtained instead of (6.15) the following equation:

h2
J1 (h/2) = J(0) − J (3) (0) + O(h3 ).
4

Provided that we have access also to J(h/4), we can also calculate

2J(h/4) − J(h/2) h2
J1 (h/4) = = J(0) − J (3) (0) + O(h3 ).
2 16

At this point, it is natural to wonder whether we can combine J1 (h/2) and J1 (h/4) in order to
produce an even better approximation of J(0). Applying the same reasoning as in the previous
section leads us to introduce

4J1 (h/2) − J2 (h/4)


J2 (h/4) = = J(0) + O(h3 ).
4−1

This is an exact approximation of J(0) if J is a quadratic polynomial, implying that J2 (h/4) is


simply the y intercept of the quadratic polynomial interpolation through the points h/4, J(h/4) ,


h/2, J(h/2) and h, J(h) .


 

Elimination of higher order terms. The procedure above can be repeated in order to elimi-
nate terms of higher and higher orders. The following schematic illustrates, for example, the
calculation of an approximation J3 (h/8) = J(0) + O(h4 ).

J(h)

J(h/2) J1 (h/2)

J(h/4) J1 (h/4) J2 (h/4)

J(h/8) J1 (h/8) J2 (h/8) J3 (h/8)

O(h) O(h2 ) O(h3 ) O(h4 ).

The linear combination in order to calculate Ji (h/2i ) is always of the form

2i Ji−1 (h/2i ) − Ji−1 (h/2i−1 )


Ji (h/2i ) = , J0 = J.
2i − 1

In practice we calculate the values taken by J, J1 , J2 , . . . at specific values of h, but these are
in fact functions of h. In Figure 6.1, we plot these functions when J(h) = 1 + sin(h). It
appears clearly from the figure that, for sufficiently small h, J3 (h) provides the most precise
approximation of J(0) = 1. Constructing the functions in Julia can be achieved in just a few
lines of code.
J(h) = 1 + sin(h)
J_1(h) = 2J(h) - J(2h)
J_2(h) = (4J_1(h) - J_1(2h))/3
J_3(h) = (8J_2(h) - J_2(2h))/7

131
Chapter 6. Numerical integration

Figure 6.1: Illustration of the functions J1 , J2 and J3 constructed by Richardson interpolation.


.

Generalization. Sometimes, it is known a priori that the Taylor development of the function J
around zero contains only even powers of h. In this case, the Richardson extrapolation procedure
can be slightly modified to produce approximations with errors scaling as O(h4 ), then O(h6 ),
then O(h8 ), etc. This procedure is illustrated below:

J(h)

J(h/2) J1 (h/2)

J(h/4) J1 (h/4) J2 (h/4)

J(h/8) J1 (h/8) J2 (h/8) J3 (h/8)

O(h2 ) O(h4 ) O(h6 ) O(h8 ).

This time, the linear combinations required for populating this table are given by:

22i Ji−1 (h/2i ) − Ji−1 (h/2i−1 )


Ji (h/2i ) = . (6.17)
22i − 1

Application to integration: Romberg’s method Romberg’s integration method consists of


applying Richardson’s extrapolation to the function
 
b−a
J(h) = Ibh = u(x0 ) + 2u(x1 ) + 2u(x2 ) + · · · + 2u(xn−1 ) + 2u(xn ), h∈ ;n ∈ N .
n

where a = x0 < x1 < · · · < xn = b are equidistant nodes. The right-hand side of this equation is
simply the composite trapezoidal rule with step size h. It is possible to show, see [9], that J(h)
may be expanded as follows:

∀k ∈ N, J(h) = I + α1 h2 + α2 h4 + · · · + αk h2k + O(h2k+2 ).

132
Chapter 6. Numerical integration

Richardson’s extrapolation (6.17) can therefore be employed in order to compute approximations


of the integral increasing accuracy. The convergence of Romberg’s method for calculating the
integral (6.14) is illustrated in Figure 6.2.

Figure 6.2: Convergence of Romberg’s method. The straight lines correspond to the monomial
functions f (h) = Ci hi , with i = 2, 4, 6, 8 and for appropriate constants Ci . We observe a good
agreement between the observed and theoretical convergence rates.

6.4 Methods with non-equidistant nodes


The Newton–Cotes method relies on equidistant integration nodes, and the only degrees of
freedom are the integration weights. If the nodes are not fixed, then additional degrees of
freedom are available, and these can be leveraged in order to construct a better integration
formula. The total number of degrees of freedom for a general integration rule of the form (6.2)
is 2n + 2, which enable to construct an integration rule with degree of precision equal to 2n + 1.
A necessary condition for an integration rule of the form (6.2) to have a degree of precision
equal to 2n + 1 is that it integrates exactly all the monomials of degree 0 to 2n + 1. This
condition is also sufficient because, assuming that it is satisfied, we have by linearity of the
functionals I and Ib that

b 0 + α1 x + · · · + α2n+1 x2n+1 ) = α0 I(1)


I(α b 2n+1 )
b + · · · + α2n+1 I(x
b + α1 I(x)

= α0 I(1) + α1 I(x) + · · · + α2n+1 I(x2n+1 )


= I(α0 + α1 x + · · · + α2n+1 x2n+1 ),

Here I(u) and I(u)


b denote respectively the exact integral of u and its approximate integral
using (6.2). Finding the nodes and weights of the integration rule, we can therefore solve the

133
Chapter 6. Numerical integration

nonlinear system of 2n + 2 equations with 2n + 2 unknowns:


n Z 1
xd dx, (6.18)
X
wi xdi = d = 0, . . . , 2n + 1.
i=0 −1

The quadrature rule obtained by solving this system of equations is called the Gauss–Legendre
quadrature.

Example 6.1. Let us derive the Gauss–Legendre quadrature with n + 1 = 2 nodes. The system
of equations that we need to solve in this case is the following:

2
w0 + w1 = 2, w0 x0 + w1 x1 = 0, w0 x20 + w1 x21 = , w0 x30 + w1 x31 = 0.
3

The solution to these equations is given by



3
−x0 = x1 = , w0 = w1 = 1.
3

Connection with orthogonal polynomials.� The Gauss–Legendre quadrature rules can be


obtained more constructively from orthogonal polynomials, an approach we presented in Sec-
tion 5.2.4. In particular, the integration nodes are related the roots of a Legendre polynomial,
which opens the door to another computational approach for computing them. It is possible to
prove, and this should be clear after reading Section 5.2.4, that the integration weights are all
positive, and so Gauss–Legendre quadrature rules are less susceptible to roundoff errors than
the Newton–Cotes methods.
In this section, we prove, starting from the system of nonlinear equations (6.18), that the
roots of the integration formula are necessarily the roots of a Legendre polynomial. This will
imply, as a corollary, that the set of integration nodes satisfying (6.18) is unique, because the
orthogonal polynomials are unique. To this end, let us denote by

p(x) = (x − x0 ) . . . (x − xn ) =: α0 + α1 x + · · · + αn+1 xn+1

the polynomial whose roots coincide with the unknown integration nodes. Multiplying the first
equation of (6.18) (d = 0) by α0 , the second (d = 1) by α1 , and so forth until the equation
corresponding to d = n + 1, we obtain after summing these equations

n n+1 Z 1 n+1 n Z 1
αd x dx p(x) dx.
X X X X
wi αd xdi = d
⇔ wi p(xi ) =
i=0 d=0 −1 d=0 i=0 −1

Since the left-hand side of this equation is equal to 0 by definition of p, we deduce that p is
orthogonal to the constant polynomial for the inner product
Z 1
hf, gi = f (x) g(x) dx. (6.19)
−1

134
Chapter 6. Numerical integration

Now if we multiply the second equation of (6.18) (d = 1) by α0 , the third (d = 2) by α1 , and so


forth until until the equation corresponding to d = n + 2, we obtain after summation of these
equations that

n n+1 Z 1 n+1 n Z 1
dx
X X X X
wi x i αd xdi = αd xd+1
⇔ wi xi p(xi ) = p(x) xdx.
i=0 d=0 −1 d=0 i=0 −1

Since the left-hand side of this equation is again 0 because the nodes xi are the roots of p, we
deduce that p is orthogonal to the linear polynomial x 7→ x for the inner product (6.19). This
reasoning can be repeated in order to deduce that p is in fact orthogonal to all the monomials
of degree 0 to n, implying that p is a multiple of the Legendre polynomial of degree n + 1.

Generalization to higher dimensions. Gauss–Legendre integration is ubiquitous in numerical


methods for partial differential equations, in particular the finite element method. Its general-
ization to higher dimensions is immediate: for a function u : [−1, 1] × [−1, 1] → R, we have
Z 1Z 1 n X
n
u(x, y) dydx ≈
X
wi wj u(xi , yi ).
0 0 i=0 j=0

The degree of precision of this integration rule is the same as that of the corresponding one-
dimensional rule.

6.5 Exercises
� Exercise 6.1. Derive the Simpson’s integration rule (6.6).

� Exercise 6.2. Derive the composite Simpson integration rule (6.9).

� Exercise 6.3. Consider the integration rule


Z 1
u(x) dx ≈ w1 u(0) + w2 u(1) + w3 u0 (0).
0

Find w1 , w2 and w3 so that this integration rule has the highest possible degree of precision.

� Exercise 6.4. Consider the integration rule


Z 1
u(x) dx ≈ w1 u(x1 ) + w2 u0 (x1 ).
−1

Find w1 , w2 and x1 so that this integration rule has the highest possible degree of precision.

� Exercise 6.5. What is the degree of precision of the following quadrature rule?
Z 1     
2 1 1
u(x) dx ≈ 2u − − u(0) + 2u .
−1 3 2 2

� Exercise 6.6. The Gauss–Hermite quadrature rule with n + 1 nodes is an approximation of

135
Chapter 6. Numerical integration

the form
Z ∞ n
x2
u(x) e− dx ≈
X
2 wi u(xi ),
−∞ i=0

such that the rule is exact for all polynomials of degree less than or equal to 2n + 1. Find the
Gauss–Hermite rule with two nodes.

� Exercise 6.7. Use Romberg’s method to construct an integration rule with an error term
scaling as O(h4 ). Is there a link between the method you obtained and another integration rule
seen in class?

� Exercise 6.8 (Improving the error bound for the composite trapezoidal rule). The notation
used in this exercise is the same as in Section 6.2. In particular, Ibh denotes the approximate
integral obtained by using the composite trapezoidal rule (6.7), and u
bh is the corresponding
piecewise linear interpolant.
A version of the mean value theorem states that, if g : [a, b] → R is a non-negative integrable
function and f : [a, b] → R is continuous, then there exists ξ ∈ (a, b) such that
Z b Z b
f (x)g(x) dx = f (ξ) g(x) dx. (6.20)
a a

• Using (6.20), show that, for all i ∈ {0, . . . , n − 1}, there exists ξi ∈ (xi , xi+1 ) such that
xi+1
h3
Z
bh (x) dx = −u00 (ξi )
u(x) − u .
xi 12

• Prove, by using the intermediate value theorem, that if f : [a, b] → R is a continuous


function, then for any set ξ0 , . . . , ξn−1 of points within the interval (a, b), there exists
c ∈ (a, b) such that
n−1
1X
f (ξi ) = f (c).
n
i=0

• Combining the previous items, conclude that there exists ξ ∈ (a, b) such that

h 2
I − Ibh = −u00 (ξ)(a − b) ,
12

which is a more precise expression of the error than that obtained in (6.8).

Remark 6.1. One may convince oneself of (6.20) by rewriting this equation as

f (x)g(x) dx
Rb
a
= f (c).
a g(x) dx
Rb

The left-hand side is the average of f (x) with respect to the probability measure with density

136
Chapter 6. Numerical integration

given by
g(x)
x 7→ R b .
a g(x) dx

6.6 Discussion and bibliography


In this chapter, we covered mainly deterministic integration formulas. The presentation of part
of the material follows that in [6], and some exercises come from [9, Chapter 9]. Much of the
research around the calculation of high-dimensional integrals today is concerned with proba-
bilistic integration methods using Monte Carlo approaches. These methods are based on the
connection between integrals and expectations. For example, the integral
Z 1
I= x2 dx
0

may be expressed as the expectation E[X 2 ], where E is the expectation operator and X ∼ U (0, 1)
is a uniformly distributed random variable over the interval [0, 1]. Therefore, in practice, I
may be approximated by generating a large number of samples X1 , X2 , . . . from the distribu-
tion U(0, 1) and averaging f (Xi ) over all these samples.

n = 1000
f(x) = x^2
X = rand(n)
Î = (1/n) * sum(f.(X))

The main advantage of this approach is that it generalizes very easily to high-dimensional and
infinite-dimensional settings.

137
Appendix A

Linear algebra

In this chapter, we collect basic results on vectors and matrices that are useful for this course.
Throughout these lecture notes, we use lower case and bold font to denote vectors, e.g. x ∈ Cn ,
and upper case to denote matrices, e.g. A ∈ Cm×n . The entries of a vector x ∈ Cn are denoted
by (xi ), and those of a matrix A ∈ Cm×n are denoted by (aij ) or (ai,j ).

A.1 Inner products and norms


We begin by recalling the definitions of the fundamental concepts of inner product and norm.
For generality, we consider the case of a complex vector space, i.e. a vector space for which the
scalar field is C.

Definition A.1. An inner product on a real vector space X is a function h•, •i : X × X → C


satisfying the following axioms:

• Conjugate symmetry: Here • denotes the complex conjugate.

∀(x, y) ∈ X × X , hx, yi = hy, xi.

• Linearity: For all (α, β) ∈ R2 and all (x, y, z) ∈ X 3 , it holds that

hαx + βy, zi = αhx, zi + βhy, zi.

• Positive-definiteness:
∀x ∈ X \{0}, hx, xi > 0.

For example, the familiar Euclidean inner product on Cn is given by


n
X
hx, yi := xi y i .
i=1

A vector space with an inner product is called an inner product space.

138
Appendix A. Linear algebra

Definition A.2. A norm on a real vector space X is a function k•k : X → R satisfying the
following axioms:

• Positivity: ∀x ∈ X \{0}, kxk > 0.

• Homogeneity: ∀(c, x) ∈ C × X , kcxk = |c| kxk.

• Triangular inequality: ∀(x, y) ∈ X × X , kx + yk ≤ kxk + kyk.

Any inner product on X induces a norm via the formula

(A.1)
p
kxk = hx, xi.

The Cauchy–Schwarz inequality enables to bound inner products using norms. It is also useful
for showing that the functional defined in (A.1) satisfies the triangle inequality.

Proposition A.1 (Cauchy–Schwarz inequality). Let X be an inner product space. It holds


that
∀(x, y) ∈ X × X , |hx, yi| ≤ kxkkyk. (A.2)

Proof. Let us define p(t) = kx + tyk2 . Using the bilinearity of the inner product, we have

p(t) = kxk2 + 2thx, yi + t2 kyk2 .

This shows that p is a convex second-order polynomial with a minimum at t∗ = −hx, yi/kyk2 .
Substituting this value in the expression of p, we obtain

|hx, yi|2 |hx, yi|2 |hx, yi|2


p(t∗ ) = kxk2 − 2 + = kxk2
− .
kyk2 kyk2 kyk2

Since p(t∗ ) ≥ 0 by definition of p, we obtain (A.2).

Several norms can be defined on the same vector space X . Two norms k•kα and k•kβ on X
are said to be equivalent if there exist positive real numbers c` and cu such that

∀x ∈ X , c` kxkα ≤ kxkβ ≤ cu kxkα . (A.3)

When working with norms on finite-dimensional vector spaces, it is important to keep in mind
the following result. The proof is provided for information purposes.

Proposition A.2. Assume that X is a finite-dimensional vector space. Then all the norms
defined on X are pairwise equivalent.

Proof. Let k•kα and k•kβ be two norms on X , and let (e1 , . . . , en ) be a basis of X , where n is
the dimension of the vector space. Any x ∈ X can be represented as x = λ1 e1 + · · · + λn en . By
the triangle inequality, it holds that
  n o
kxkα ≤ |λ1 |ke1 kα + · · · + |λn |ken kα ≤ |λ1 | + · · · + |λn | max ke1 kα , . . . , ken kα . (A.4)

139
Appendix A. Linear algebra

On the other hand, as we prove below, there exists a positive constant ` such that
 
∀x ∈ X , kxkβ ≥ ` |λ1 | + · · · + |λn | . (A.5)

Combining (A.4) and (A.5), we conclude that

1 n o
kxkα ≤ max ke1 kα , . . . , ken kα kxkβ .
`

This proves the first inequality in (A.3), and reversing the roles of k•kα and k•kβ yields the
second inequality.
We now prove (A.5) by contradiction. If this inequality were not true, then there would
exist a sequence (x(i) )i∈N such that kx(i) kβ → 0 as i → ∞ but |λ1 | + · · · + |λn | = 1 for
(i) (i)

all i ∈ N. Since λ1 ∈ [−1, 1] for all i ∈ N, we can extract a subsequence, still denoted by
(i)

(x(i) )i∈N for simplicity, such that the corresponding coefficient λ1 satisfies λ1 → λ∗1 ∈ [−1, 1],
(i) (i)

by compactness of the interval [−1, 1]. Repeating this procedure for λ2 , λ3 , . . . , taking a new
subsequence every time, we obtain a subsequence (x(i) )i∈N such that λj → λ∗j in the limit
(i)

as i → ∞, for all j ∈ {1, . . . , n}. Therefore, it holds that x(i) → x∗ := λ∗1 e1 + · · · λ∗n en in the
k•kβ norm. Since x(i) → 0 by assumption and the vectors e1 , . . . , en are linearly independent,
this implies that λ∗1 = · · · = λ∗n = 0, which is a contradiction because we also have that

|λ∗1 | + · · · + |λ∗n | = lim |λ1 | + · · · + |λ(i)


(i)
n | = 1.
i→∞

This concludes the proof of (A.5).

� Exercise A.1. Using Proposition A.1, show that the function k•k defined by (A.1) satisfies
the triangle inequality.

A.2 Vector norms


In the vector space Cn , the most commonly used norms are particular cases of the p-norm, also
called Hölder norm.

Definition A.3. Given p ∈ [1, ∞], the p-norm of a vector x ∈ Cn is defined by


 P 1
 n i p p
i=1 |x | if p < ∞,
kxkp := n o
max |x1 |, . . . , |xn | if p = ∞.

The values of p most commonly encountered in applications are 1, 2 and ∞. The 1-norm is
sometimes called the taxicab or Manhattan norm, and the 2-norm is usually called the Euclidean
norm. The explicit expressions of these norms are
v
n
X
u n
uX
kxk1 = |xi |, kxk2 = t |xi |2 .
i=1 i=1

140
Appendix A. Linear algebra

Notice that the infinity norm k•k∞ may be defined as the limit of the p-norm as p → ∞:

kxk∞ := lim kxkp .


p→∞

In the rest of this chapter, the notations h•, •i and k•k without subscript always refer to the
Euclidean inner product (A.1) and induced norm, unless specified otherwise.

A.3 Matrix norms


Given two norms k•kα and k•kβ on Cm and Cn , respectively, we define the operator norm
induced by k•kα and k•kβ of the matrix A as

kAkα,β = sup kAxkα : x ∈ Cn , kxkβ ≤ 1 . (A.6)




The term operator norm is motivated by the fact that, to any matrix A ∈ Cm×n , there naturally
corresponds the linear operator from Cn to Cm with action x 7→ Ax. Equation (A.6) comes from
the general definition of the norm of a bounded linear operator between normed spaces. Matrix
norms of the type (A.6) are also called subordinate matrix norms. An immediate corollary of
the definition (A.6) is that, for all x ∈ Cn ,

x
xkα kxkβ ≤ sup kAykα : kykβ ≤ 1 kxkβ = kAkα,β kxkβ , (A.7)

kAxkα = kAb x
b= .
kxkβ

Proposition A.3. Equation (A.6) defines a norm on Cm×n .

Proof. We need to verify that (A.6) satisfies the properties of positivity and homogeneity, to-
gether with the triangle inequality.

• Checking positivity is simple and left as an exercise.

• Homogeneity follows trivially from the definition (A.6) and the homogeneity of k•kα .

• Triangle inequality. Let A and B be two elements of Cm×n . Employing the triangle
inequality for the norm k•kα , we have

∀x ∈ Cn with kxkβ ≤ 1, k(A + B)xkα = kAx + Bxkα


≤ kAxkα + kBxkα ≤ kAkα,β + kBkα,β .

Taking the supremum as in (A.6), we obtain kA + Bkα,β ≤ kAkα,β + kBkα,β .

Since the three properties are satisfied, k•kα,β is indeed a norm.

The matrix p-norm is defined as the operator norm (A.6) in the particular case where k•kα
and k•kβ are both Hölder norms with the same value of p.

141
Appendix A. Linear algebra

Definition A.4. Given p ∈ [1, ∞], the p-norm of a matrix A ∈ Cm×n is given by

kAkp := sup kAxkp : x ∈ Cn , kxkp ≤ 1 . (A.8)




Not all matrix norms are induced by vector norms. For example, the Frobenius norm, which
is widely used in applications, is not induced by a vector norm. It is, however, induced by an
inner product on Cm×n .

Definition A.5. The Frobenius norm of A ∈ Cm×n is given by


 1
m X
n 2

(A.9)
X
2
kAkF =  |aij | .
i=1 j=1

A matrix norm k•k is said to be submultiplicative if, for any two matrices A ∈ Cm×n and
B ∈ Cn×` , it holds that
kABk ≤ kAkkBk.

All subordinate matrix norms, for example the p-norms, are submultiplicative, and so is the
Frobenius norm.

� Exercise A.2. Write down the inner product on Cm×n corresponding to (A.9).

� Exercise A.3. Show that the matrix p-norm is submultiplicative.

A.4 Diagonalization

Definition A.6. A square matrix A ∈ Cn×n is said to be diagonalizable if there exists an


invertible matrix P ∈ Cn×n and a diagonal matrix D ∈ Cn×n such that

AP = PD. (A.10)

In this case, the diagonal elements of D are called the eigenvalues of A, and the columns of P
are called the eigenvectors of A.

Denoting by ei the i-th column of P and by λi the i-th diagonal element of D, we have by (A.10)
that Aei = λi ei or, equivalently, (A − λi In )ei = 0. Here In is the Cn×n identity matrix.
Therefore, λ is an eigenvalue of A if and only if det(A−λIn ) = 0. In other words, the eigenvalues
of A are the roots of det(A − λIn ), which is called the characteristic polynomial.

Symmetric matrices and spectral theorem


The transpose of a matrix A ∈ Cm×n is denoted by AT ∈ Cn×m and defined as the matrix with
entries aTij = aji . The conjugate transpose of A is the matrix obtained by taking the transpose
and taking the complex conjugate of all the entries. A real matrix equal to its transpose is
necessarily square and called symmetric, and a complex matrix equal to its conjugate transpose

142
Appendix A. Linear algebra

is called Hermitian. Hermitian matrices, of which real symmetric matrices are a subset, enjoy
many nice properties, the main one being that they are diagonalizable with a matrix P that is
unitary, i.e. such that P−1 = P∗ . This is the content of the spectral theorem, a pillar of linear
algebra with important generalizations to infinite-dimensional operators.

Theorem A.4 (Spectral theorem for Hermitian matrices). If A ∈ Cn×n is Hermitian, then
there exists an unitary matrix Q ∈ Cn×n and a diagonal matrix D ∈ Rn×n such that

AQ = QD.

Sketch of the proof. The result is trivial for n = 1. Reasoning by induction, we assume that the
result is true for Hermitian matrices in Cn−1×n−1 and prove that it then also holds for A ∈ Cn×n .

Step 1. Existence of a real eigenvalue. By the fundamental theorem of algebra, there


exists at least one solution λ1 ∈ C to the equation det(A − λIn ) = 0, to which there corresponds
a solution q 1 ∈ Cn of norm 1 to the equation (A − λ1 In )q 1 = 0. The eigenvalue λ1 is necessarily
real because

λ1 he1 , e1 i = hλ1 e1 , e1 i = hAe1 , e1 i = he1 , Ae1 i = he1 , λ1 e1 i = he1 , λ1 e1 i = λ1 he1 , e1 i.

Step 2. Using the induction hypothesis. Next, take an orthonormal basis (e2 , . . . , en )
of the orthogonal complement span{q 1 }⊥ and construct the unitary matrix
 
V = q 1 e2 . . . en ,

i.e. the matrix with columns q 1 , e2 , etc. A calculation gives,

hq 1 , Aq 1 i hq 1 , Ae2 i . . . hq 1 , Aen i
  
λ1 0 ... 0
 he2 , Aq 1 i he2 , Ae2 i . . . he2 , Aen i   0 he2 , Ae2 i . . . he2 , Aen i 
   
V AV = 

.. .. .. .. = .
 . .. .. .. .
. . . .  . . . .
 
 
hen , Aq 1 i hen , Ae2 i . . . hen , Aen i 0 hen , Ae2 i . . . hen , Aen i

Let us denote the n − 1 × n − 1 lower right block of this matrix by Vn−1 . This is a Hermitian
matrix of size n − 1 so, using the induction hypothesis, we deduce that Vn−1 = Qn−1 Dn−1 Q∗n−1
for appropriate matrices Qn−1 ∈ Cn−1×n−1 and Dn−1 ∈ Rn−1×n−1 which are unitary and
diagonal, respectively.

Step 3. Constructing Q and D. Define now


!
1 0T
Q=V .
0 Qn−1

143
Appendix A. Linear algebra

It is not difficult to verify that Q is a unitary matrix, and we have


! ! ! ! !
1 0T 1 0T 1 0T λ1 0T 1 0T
Q AQ =

V AV

= .
0 Q∗n−1 0 Qn−1 0 Q∗n−1 0 Vn−1 0 Qn−1

Developing the last expression, we obtain


!
1 0T
Q AQ =

,
0 Dn−1

which concludes the proof.

We deduce, as a corollary of the spectral theorem, that if e1 and e2 are eigenvectors of a


Hermitian matrix associated with different eigenvalues, then they are necessarily orthogonal for
the Euclidean inner product. Indeed, since A = A∗ and the eigenvalues are real, it holds that

(λ1 − λ2 )he1 , e2 i = hλ1 e1 , e2 i − he1 , λ2 e2 i = hAe1 , e2 i − he1 , Ae2 i = hAe1 , e2 i − hA∗ e1 , e2 i = 0.

The largest eigenvalue of a matrix, in modulus, is called the spectral radius and denoted by ρ.
The following result relates the 2-norm of a matrix to the spectral radius of AA∗ .

Proposition A.5. It holds that kAk2 = ρ(A∗ A).


p

Proof. Since A∗ A is Hermitian, it holds by the spectral theorem that A∗ A = QDQ∗ for some
unitary matrix Q and real diagonal matrix D. Therefore, denoting by (µi )1≤i≤n the (positive)
diagonal elements of D and introducing y := Q∗ x, we have
v v
u n u n

x∗ A∗ Ax = x∗ QDQ∗ x = t µi yi2 ≤ ρ(A∗ A)t yi2 = ρ(A∗ A)kxk, (A.11)
p uX p uX p
kAxk =
i=1 i=1

where we used in the last equality the fact that y has the same norm as x, because Q is
unitary. It follows from (A.11) that kAk ≤ ρ(A∗ A), and the converse inequality also holds
p

true since kAxk = ρ(A∗ A)kxk if x is the eigenvector of A∗ A corresponding to an eigenvalue


p

of modulus ρ(A∗ A).

To conclude this section, we recall and prove the Courant–Fisher theorem.

Theorem A.6 (Courant–Fisher Min-Max theorem). The eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn of a


Hermitian matrix are characterized by the relation

x∗ Ax
 
λk = max min .
S,dim(S)=k x∈S\{0} x∗ x

Proof. Let v 1 , . . . , v n be orthonormalized eigenvectors associated with the eigenvalues λ1 , . . . , λn .

144
Appendix A. Linear algebra

Let Sk = span{v 1 , . . . , v k }. Any x ∈ Sk may be expressed as x = α1 v 1 + · · · + αk v k , and so

x∗ Ax
Pk 2
i=1 λi |αi |
∀x ∈ Sk , ∗
= P k
≥ λk .
x x i=1 |αi | 2

Therefore, it holds that


x∗ Ax
min ≥ λk ,
x∈Sk \{0} x∗ x

which proves the ≥ direction. For the ≤ direction, let Uk = span{v k , . . . v n }. Using a well-known
result from linear algebra, we calculate that, for any subspace S ⊂ Cn of dimension k,

dim(S ∩ Uk ) = dim(S) + dim(Uk ) − dim(S + Uk )


= k + (n − k + 1) − n = 1.

Therefore, any S ⊂ Cn of dimension k has a nonzero intersection with Uk . But since any vector
in Uk can be expanded as β1 v k + · · · + βn v n , we have

x∗ Ax
Pn 2
i=k λi |αi |
∀x ∈ Uk , = n ≤ λk .
x∗ x 2
P
i=k |αi |

This shows that


x∗ Ax
∀S ⊂ Cn , dim(S) = k, min ≤ λk ,
x∈S\{0} x∗ x

which enables to conclude the proof.

� Exercise A.4. Prove that if A ∈ Rn×n is diagonalizable as in (A.10), then An = PDn P−1 .

A.5 Similarity transformation and Jordan normal form


In this section, we work with matrices in Cn×n . A similarity transformation is a mapping of the
type Cn×n 3 A 7→ P−1 AP ∈ Cn×n , where P ∈ Cn×n is a nonsingular matrix. If two matrices
are related by a similarity transformation, they are called similar.

Definition A.7 (Jordan block). A Jordan block with dimension n is a matrix of the form
 
λ 1
 
 λ 1 
.. ..
 
Jn (λ) =  . .
 

 

 λ 1 
λ

The parameter λ ∈ C is called the eigenvalue of the Jordan block.

A Jordan block is diagonalizable if and only if it is of dimension 1. The only eigenvector of a


 T
Jordan block is 1 0 . . . 0 . The power of a Jordan block admits an explicit expression.

145
Appendix A. Linear algebra

Lemma A.7. It holds that


 
k k k−2 k
λk λk−1 k−n+1
  
1 2 λ ··· ··· n−1λ
k k−1 k
λk k−n+2 

 1 λ ··· ··· n−2 λ

.. .. ..
 
. . .
 
Jn (λ) = 
k
(A.12)
 
..
.
.. ..
. . .
 
 
 
k
λk λk−1
  
 1 
λk

Proof. The explicit expression of the Jordan block can be obtained by decomposing the block
as Jn (λ) = λI + N and using the binomial formula:

k  
k
(λI + N) = (λI)k−i Ni .
X
k
i
i=1

To conclude the proof, we use the fact that Ni is a matrix with zeros everywhere except for i-th
super-diagonal, which contains only ones. Moreover Ni = 0n×n if i ≥ n.

A matrix is said to be of Jordan normal form if it is block-diagonal with Jordan blocks on


the diagonal. In other words, a matrix J ∈ Cn×n is of Jordan normal form if
 
Jn1 (λ1 )
 
 Jn2 (λ2 ) 
..
 
J= .
 

 

 Jnk−1 (λk−1 ) 

Jnk (λk )

with n1 + · · · + nk = n. Note that λ1 , . . . , λk are the eigenvalues of A. We state without proof


the following important result.

Proposition A.8 (Jordan normal form). Any matrix A ∈ Cn×n is similar to a matrix in
Jordan normal form. In other words, there exists an invertible matrix P ∈ Cn×n and a matrix
in normal Jordan form J ∈ Cn×n such that

A = PJP−1

A.6 Oldenburger’s theorem and Gelfand’s formula


The following result establishes a necessary and sufficient condition, in terms of the spectral
radius of A for the convergence of kAk k to 0, for any matrix norm k•k.

146
Appendix A. Linear algebra

Proposition A.9 (Oldenburger). Let ρ(A) denote the spectral radius of A ∈ Cn×n and k•k
be a matrix norm. Then kAk k → 0 in the limit as k → ∞ if and only if ρ(A) < 1. In addition,
kAk k → ∞ in the limit as k → ∞ if and only if ρ(A) > 1.

Proof. Since all matrix norms are equivalent, we can assume without loss of generality that k•k
is the 2-norm. We prove only the equivalence kAk k → 0 ⇔ ρ(A) < 1. The other statement can
be proved similarly.
If ρ(A) ≥ 1, then denoting by v the eigenvector of A corresponding to the eigenvalue with
modulus ρ(A), we have kAk vk = ρ(A)k kvk ≥ kvk. Therefore, the sequence (Ak )k∈N does not
converge to the zero matrix in the 2-norm. This shows the implication kAk k → 0 ⇒ ρ(A) < 1.
To show the converse implication, we employ Proposition A.8, which states that there exists
a nonsingular matrix P such that A = PJP−1 , for a matrix J ∈ Cn×n which is in normal Jordan
form. It holds that Ak = PJk P−1 , and so it is sufficient to show that kJk k → 0. The latter
convergence follows from the expression of the power of a Jordan block given in Lemma A.7.

With this result, we can prove Gelfand’s formula, which relates the spectral radius to the
asymptotic growth of kAk k, and is used in Chapter 2.

Proposition A.10 (Gelfand’s formula). Let A ∈ Cn×n . It holds for any norm that

lim kAk k k = ρ(A)


1

k→∞

A A
Proof. Let 0 < ε < ρ(A) and define A+ = ρ(A)+ε and A− = ρ(A)−ε . It holds by construction
that ρ(A+ ) < 1 and ρ(A− ) > 1. Using Proposition A.9, we deduce that

lim k(A+ )k k = 0, lim k(A− )k k = ∞.


k→∞ k→∞

In particular, there exists K(ε) ∈ N such that

kAk k kAk k
∀k ≥ K(ε), k(A+ )k k = ≤ 1, k(A− )k k = ≥ 1.
(ρ(A) + ε)k (ρ(A) − ε)k

Taking the k-th root and rearranging these equations, we obtain

1
∀k ≥ K(ε), ρ(A) − ε ≤ kAk k k ≤ ρ(A) + ε.

Since ε was arbitrary, this implies the statement.

147
Appendix B

Brief introduction to Julia

In this chapter, we very briefly present some of the basic features and functions of Julia. Most
of the information contained in this chapter can be found in the online manual, to which we
provide pointers in each section.

Installing Julia
The suggested programming environment for this course is the open-source text editor Visual
Studio Code. You may also use Vim or Emacs, if you are familiar with any of these.

� Task 1. Install Visual Studio Code. Install also the Julia and Jupyter Notebook extensions.

Obtaining documentation
To find documentation on a function from the Julia console, type “?” to access “help mode”,
and then the name of the function. Tab completion is helpful for listing available function
names.

� Task 2. Read the help pages for if, while and for. More information on these keywords is
available in the online documentation.

Remark B.1 (Shorthand if notation). If there is no elseif clause, it is sometimes convenient


to use the following shorthand notations instead of an if block.

condition = true

# Assign 0 to x if `condition' is true, else assign 2


x = condition ? 0 : 2

# Print "true" if `condition' is true


condition && println("true")

148
Appendix B. Brief introduction to Julia

# Print "false" if `condition' is false


condition || println("false")

Installing and using a package [link to relevant manual section]

To install a package from the Julia REPL (Read Evaluate Print Loop, also more simply called
the Julia console), first type “]” to enter the package REPL, and then type add followed the
name of the package to install. After it has been added, a package can be used with the import
keyword. A function fun defined in a package pack can be accessed as pack.fun. For example,
to plot the cosine function from the Julia console or in a script, write

import Plots
Plots.plot(cos)

Alternatively, a package may be imported with the using keyword, and then functions can
be accessed without specifying the package name. While convenient, this approach is less
descriptive; it does not explicitly show what package a function comes from. For this reason, it
is often recommended to use import, especially in a large codebase.

� Task 3. Install the Plots package, read the documentation of the Plots.plot function, and
plot the function f (x) = exp(x). The tutorial on plotting available at this link may be useful for
this exercise.

Remark B.2. We have seen that ? and ] enable to access “help mode” and “package mode”,
respectively. Another mode which is occasionally useful is “shell mode”, which is accessed
with the character ; and allows to type bash commands, such as cd to change directory. See
this part of the manual for additional documentation on Julia modes.

Printing output
The functions println and print enable to display output. The former adds a new line at
the end and the latter does not. The symbol $, followed by a variable name or an expression
within brackets, can be employed to perform string interpolation. For instance, the following
code prints a = 2, a^2 = 4.

a = 2
println("a = $a, a^2 = $(a*a)")

To print a matrix in an easily readable format, the display function is very useful.

Defining functions [link to relevant manual section]

Functions can be defined using a function block. For example, the following code block defines
a function that prints “Hello, NAME!”, where NAME is the string passed as argument.

function hello(name)
# Here * is the string concatenation operator

149
Appendix B. Brief introduction to Julia

println("Hello, " * name)


end

# Call the function


hello("Bob")

If the function definition is short, it is convenient to use the following more compact syntax:

hello(name) = println("Hello, " * name)

Sometimes, it is useful to define a function without giving it a name, called an anonymous


function. This can be achieved in Julia using the arrow notation ->. For example, the following
expressions calculate the squares and cubes of the first 5 natural numbers. Here, the function
map enables to transform a collection by applying a function to each element.

squares = map(x -> x^2, [1, 2, 3, 4, 5])


cubes = map(x -> x^3, [1, 2, 3, 4, 5])

The return keyword can be used for returning a value to the function caller. Several values,
separated by commas, can be returned at once. For instance, the following function takes a
number x and returns a tuple (x, x2 , x3 ).

function powers(x)
return x, x^2, x^3
end

# This is an equivalent definition in short notation


short_powers(x) = x, x^2, x^3

# This assigns a = 2, b = 4, c = 8
a, b, c = powers(2)

Like many other languages, including Python and Scheme, Julia follows a convention for
argument-passing called “pass-by-sharing”: values passed as arguments to a function are not
copied, and the arguments act as new bindings within the function body. It is possible, therefore,
to modify a value passed as argument, provided this value is of mutable type. Functions
that modify some of their arguments usually end with an exclamation mark !. For example,
the following code prints first [4, 3, 2, 1], because the function sort does not modify its
argument, and then it prints [1, 2, 3, 4], because the function sort! does.

x = [4, 3, 2, 1]
y = sort(x) # y is sorted
println(x); sort!(x); println(x)

Similarly, when displaying several curves in a figure, we first start with the function plot, and
then we use plot! to modify the existing figure.

150
Appendix B. Brief introduction to Julia

import Plots
Plots.plot(cos)
Plots.plot!(sin)
As a final example to illustrate argument-passing, consider the following code. Here two
arguments are passed to the function test: an array, which is a mutable value, and an integer,
which is immutable. The instruction arg1[1] = 0 modifies the array to which both a and arg1
are bindings. The instruction arg2 = 2, on the other hand, just causes the variable arg2 to
point to a new immutable value (3), but it does not change the destination of the binding b,
which remains the immutable value 2. Therefore, the code prints [0, 2, 3] and 3.

function test(arg1, arg2)


arg1[1] = 0
arg2 = 2
end
a = [1, 2, 3]
b = 3
test(a, b)
println(a, b)

� Task 4 (Euler–Mascheroni constant for the harmonic series). Euler showed that

N
!
1
lim − ln(N ) +
X
= γ := 0.577...
N →∞ n
n=1

Write a function that returns an approximation of the Euler–Mascheroni constant γ by evaluating


the expression between brackets at a finite value of N .
function euler_constant(N)
# Your code comes here
end

� Task 5 (Tower of Hanoi). We consider a variation on the classic Tower of Hanoi problem, in
which the number r of pegs is allowed to be larger than 3. We denote the pegs by p1 , . . . , pr , and
assume that the problem includes n disks with radii 1 to n. The tower is initially constructed
in p1 , with the disks arranged in order of decreasing radius, the largest at the bottom. The goal
of the problem is to reconstruct the tower at pr by moving the disks one at the time, with the
constraint that a disk may be placed on top of another only if its radius is smaller.
It has been conjectured that the optimal solution, which requires the minimum number of
moves, can always be decomposed into the following three steps, for some k ∈ {1, n − 1}:
• First move the top k disks of the tower to peg p2 ;

• Then move the bottom n − k disks of the tower to pr without using p2 ;

• Finally, move the top of the tower from p2 to pr .


This suggests a recursive procedure for solving the problem, known as the Frame-Stewart algo-
rithm. Write a Julia function T(n, r) returning the minimal number of moves necessary.

151
Appendix B. Brief introduction to Julia

Local and global scopes [link to relevant manual section]

Some constructs in Julia introduce scope blocks, notably for and while loops, as well as function
blocks. The variables defined within these structures are not available outside them. For
example

if true
a = 1
end
println(a)

prints 1, because if does not introduce a scope block, but

for i in [1, 2, 3]
a = 1
end
println(a)

produces ERROR: LoadError: UndefVarError: a not defined. The variable a defined within
the for loop is said to be in the local scope of the loop, whereas a variable defined outside of it is
in the global scope. In order to modify a global variable from a local scope, the global keyword
must be used. For instance, the following code

a = 1
for i in [1, 2, 3]
global a += 1
end
println(a)

modifies the global variable a and prints 4.

Multi-dimensional arrays [link to relevant manual section]

A working knowledge of multi-dimensional arrays is important for this course, because vectors
and matrices are ubiquitous in numerical algorithms. In Julia, a two-dimensional array can be
created by writing its lines one by one, separating them with a semicolon ;. Within a line,
elements are separated by a space. For example, the instruction

M = [1 2 3; 4 5 6]

creates the matrix !


1 2 3
M=
4 5 6
More generally, the semicolon enables vertical concatenation while space concatenates horizon-
tally. For example, [M M] defines the matrix
!
1 2 3 1 2 3
4 5 6 4 5 6

152
Appendix B. Brief introduction to Julia

The expression M[r, c] gives the (r, c) matrix element of M . The special entry end can be used
to access the last row or column. For instance, M[end-1, end] gives the matrix entry in the
second to last row and the last column. From the matrix M above, the submatrix [2 3; 5 6]
can be obtained with M[:, 2:3]. Here the row index : means “select all lines” and the column
index 2:3 means “select columns 2 to 3”. Likewise, the submatrix [1 3; 4 6] may be extracted
with M[:, [1; 3]].

Remark B.3 (One-dimensional arrays). The comma , can also be employed for creating one-
dimensional arrays, but its behavior differs slightly from that of the vertical concatenation
operator ;. For example, x = [1, [2; 3]] creates a Vector object with two elements, the
first one being 1 and the second one being [1; 3], which is itself a Vector. In contrast, the
instruction x = [1; [1; 2]] creates the same Vector as [1; 2; 3] would.
We also mention that the expression x = [1 2 3] produces not a one-dimensional Vector
but a two-dimensional Matrix, with one row and three columns. This can be checked using
the size function, which for x = [1 2 3] returns the tuple (1, 3).

There are many built-in functions for quickly creating commonly used arrays. For example,

• transpose(M) gives the transpose of M , and adjoint(M) or M' gives the transpose con-
jugate. For a matrix with real-valued entries, both functions deliver the same result.

• zeros(Int, 4, 5) creates a 4 × 5 matrix of zeros of type Int;

• ones(2, 2) creates a 2 × 2 matrix of ones;

• range(0, 1, length=101) creates an array of size 101 with elements evenly spaced be-
tween 0 and 1 included. More precisely, range returns an array-like object, which can be
converted to a vector using the collect function.

• collect(reshape(1:9, 3, 3)) creates a 3 × 3 matrix with elements


 
1 4 7
2 5 8 
 

3 6 9

Let us also mention the following shorthand notation, called array comprehension, for creating
vectors and matrices:

• [i^2 for i in 1:5] creates the vector [1, 4, 9, 16, 25].

• [i + 10*j for i in 1:4, j in 1:4] creates the matrix


 
11 21 31 41
 
12 22 32 42
13 23 33 43 .
 
 
14 24 34 44

153
Appendix B. Brief introduction to Julia

• [i for i in 1:10 if ispow2(i)] creates the vector [1, 2, 4, 8]. The same result
can be achieved with the filter function: filter(ispow2, 1:10).

In contrast with Matlab, array assignment in Julia does not perform a copy. For example
the following code prints [1, 2, 3, 4], because the instruction b = a defines a new binding
to the array a.

a = [2; 2; 3]
b = a
b[1] = 1
append!(b, 4)
println(a)

A similar behavior applies when passing an array as argument to a function. The copy function
can be used to perform a copy.

� Task 6. Create a 10 by 10 diagonal matrix with the i-th entry on the diagonal equal to i.

Broadcasting
To conclude this chapter, we briefly discuss broadcasting, which enables to apply functions to
array elements and to perform operations on arrays of different sizes. Julia really shines in this
area, with syntax that is both explicit and concise. Rather than providing a detailed definition
of broadcasting, which is available in this part of the official documentation, we illustrate the
concept using examples. Consider first the following code block:

function welcome(name)
return "Hello, " * name * "!"
end
result = broadcast(welcome, ["Alice", "Bob"])

Here broadcast returns an array with elements "Hello, Alice!" and "Hello, Bob!", as
would the map function. Broadcasting, however, is much more flexible because it can handle
arrays with different sizes. For instance, broadcast(gcd, 24, [10, 20, 30]) returns an array
of size 3 containing the greatest common divisors of the pairs (24, 10), (24, 20) and (24, 30).
Similarly, the instruction broadcast(+, 1, [1, 2, 3]) returns [2, 3, 4]. To understand
the latter example, note that + (as well as *, - and /) can be called like any other Julia
functions; the notation a + b is just syntactic sugar for +(a, b).
Since broadcasting is so often useful in numerical mathematics, Julia provides a shorthand
notation for it: the instruction broadcast(welcome, ["Alice", "Bob"]) can be written com-
pactly as welcome.(["Alice", "Bob"]). Likewise, the line broadcast(+, 1, [1, 2, 3]) can
be shortened to (+).(1, [1, 2, 3]), or to the more readable expression 1 .+ [1, 2, 3].

� Task 7. Explain in words what the following instructions do.

reshape(1:9, 3, 3) .* [1 2 3]
reshape(1:9, 3, 3) .* [1; 2; 3]
reshape(1:9, 3, 3) * [1; 2; 3]

154
Bibliography

[1] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings
of the 1969 24th national conference, pages 157–172, 1969.
[2] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
computing surveys (CSUR), 23(1):5–48, 1991.
[3] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985:1–20, 1985.
doi: 10.1109/IEEESTD.1985.82928.

[4] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008:1–70, 2008.
doi: 10.1109/IEEESTD.2008.4610935.

[5] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019:1–84, 2019.
doi: 10.1109/IEEESTD.2019.8766229.

[6] V. Legat. Mathématiques et méthodes numériques. Lecture notes for the course EPL1104 at École
polytechnique de Louvain, 2009.
url: https://perso.uclouvain.be/vincent.legat/documents/epl1104/epl1104-notes-v8-2.pdf.

[7] A. Magnus. Analyse numérique: approximation, interpolation, intégration. Lecture notes for the
course INMA2171 at École polytechnique de Louvain, 2010.
url: https://perso.uclouvain.be/alphonse.magnus/num1a/dir1a.htm.

[8] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several


variables, volume 30 of Classics in Applied Mathematics. Society for Industrial and Applied Math-
ematics (SIAM), Philadelphia, PA, 2000,
doi: 10.1137/1.9780898719468.
url: https://doi-org.extranet.enpc.fr/10.1137/1.9780898719468.

[9] A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics, volume 37 of Texts in


Applied Mathematics. Springer-Verlag, Berlin, second edition, 2007.
doi: 10.1007/b98885.

[10] Y. Saad. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathe-
matics, Philadelphia, PA, second edition, 2003,
doi: 10.1137/1.9780898718003.
url: https://doi-org.extranet.enpc.fr/10.1137/1.9780898718003.

[11] Y. Saad. Numerical methods for large eigenvalue problems, volume 66 of Classics in Applied
Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2011,
doi: 10.1137/1.9781611970739.ch1.
url: https://doi-org.extranet.enpc.fr/10.1137/1.9781611970739.ch1.

[12] J. R. Shewchuk et al. An introduction to the conjugate gradient method without the agonizing
pain, 1994.
url: https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.

155
Bibliography

[13] P. Van Dooren. Analyse numérique. Lecture notes for the course INMA1170 at École polytech-
nique de Louvain, 2012.
[14] M. Vianello and R. Zanovello. On the superlinear convergence of the secant method. Amer.
Math. Monthly, 99(8):758–761, 1992.
doi: 10.2307/2324244.
url: https://doi-org.extranet.enpc.fr/10.2307/2324244.

[15] C. Vuik and D. J. P. Lahaye. Scientific Computing. Lecture notes for the course wi4201 at Delft
University of Technology, 2019.
url: http://ta.twi.tudelft.nl/users/vuik/wi4201/wi4201_notes.pdf.

156

You might also like