Main
Main
Urbain Vaes
[email protected]
Weekly schedule:
• Lectures on Tuesday and Thursday from 14:15 to 15:30 (Paris time) in room 406;
The copyright of these notes rests with the author and their contents are made available under
a Creative Commons “Attribution-ShareAlike 4.0 Interational” license. You are free to copy,
distribute, transform and build upon the course material under the following terms:
• Attribution. You must give appropriate credit, provide a link to the license, and indicate
if changes were made. You may do so in any reasonable manner, but not in any way that
suggests the licensor endorses you or your use.
• ShareAlike. If you remix, transform, or build upon the material, you must distribute
your contributions under the same license as the original.
i
Course syllabus
Course content. This course is aimed at giving a first introduction to classical topics in numer-
ical analysis, including floating point arithmetics and round-off errors, the numerical solution
of linear and nonlinear equations, iterative methods for eigenvalue problems, interpolation and
approximation of functions, and numerical quadrature. If time permits, we will also cover
numerical methods for solving ordinary differential equations.
Prerequisites. The course assumes a basic knowledge of linear algebra and calculus. Prior
programming experience in Julia, Python or a similar language is desirable but not required.
Study goals. After the course, the students will be familiar with the key concepts of stability,
convergence and computational complexity in the context of numerical algorithms. They will
have gained a broad understanding of the classical numerical methods available for performing
fundamental computational tasks, and be able to produce efficient computer implementations
of these methods.
Education method. The weekly schedule comprises two lectures (2× 1h15 per week) and
an exercise session (1h30 per week). The course material includes rigorous proofs as well as
illustrative numerical examples in the Julia programming language, and the weekly exercises
blend theoretical questions and practical computer implementation tasks.
Assessment. Computational homework will be handed out on a weekly or biweekly basis, each
of them focusing on one of the main topics covered in the course. The homework assignments
will count towards 70% of the final grade, and the final exam will count towards 30%.
Literature and study material. A comprehensive reference for this course is the following
textbook: A. Quarteroni, R. Sacco, and F. Saleri. Numerical mathematics, volume 37 of
Texts in Applied Mathematics. Springer-Verlag, Berlin, 2007. Other pointers to the literature
will be given within each chapter.
ii
Acknowledgments
I am grateful to Vincent Legat, Tony Lelièvre, Gabriel Stoltz and Paul Van Dooren for letting
me draw freely from their lectures notes in numerical analysis. I would also like to thank the
following students who found errors and typos in the lecture notes: Anthony Francis, Claire
Chen, Jinqiao Cheng, Marco He, Wenye Jiang, Tingche Lyu, Nikki Tai, Alice Wang and Anita
Ye.
iii
List of examinable proofs
This list collects the examinable results for this course. It will grow and may be modified as
the course progresses. You don’t have to remember the statements of the results, but should
be able to prove them given the statements.
Chapter 2
• Proposition 2.1 (upper bound on the condition number with respect to perturbations of
the right-hand side);
• Proposition 2.2 (upper bound the on condition number with respect to perturbations of
the matrix), if you are given the preceding Lemma 2.3;
• Lemma 2.5 (explicit expression of the matrix L given the parameters of the Gaussian
transformations).
• Proposition A.9, in the case where A is diagonalizable (equivalence between ρ(A) < 1 and
the convergence kAk k → 0 as k → ∞).
• Proposition 2.10 when given Gelfand’s formula for granted (convergence of the general
splitting method).
• Proposition 2.12 and Corollary 2.13 (convergence of the relaxation method for Hermitian
and positive definite A).
• Theorem 2.17 given the Kantorovich inequality (convergence of the steepest descent
method).
Chapter 3
• Theorem 3.2 (global exponential convergence of the fixed point iteration).
• Proposition 3.4 (local exponential convergence of the fixed point iteration under a local
Lipschitz condition).
• Proposition 3.5 (local exponential convergence given bound on the Jacobian matrix).
iv
• Proposition 3.6 (superlinear convergence of fixed point iteration when the Jacobian is zero
at the fixed point).
Chapter 4
• Proposition 4.1 (Convergence of the power iteration).
Chapter 5
• Theorem 5.2 (Interpolation error).
Chapter 6
• Derivation of the Newton–Cotes integration rules (in text).
• Proof of the error estimates for the composite trapezoidal and Simpson rules.
v
Contents
vi
Contents
vii
Introduction
• Modeling error. There may be a discrepancy between the mathematical model and the
underlying physical phenomenon.
• Data error. The data of the problem, such as the initial conditions or the parameters
entering the equations, are usually known only approximately.
1
Contents
• Discrete solver error. The method employed to solve the discretized equations, espe-
cially if it is of iterative nature, may also introduce an error.
• Round-off errors. Finally, the limited accuracy of computer arithmetics causes addi-
tional errors.
Of these, only the last three are in the domain of numerical analysis, and in this course we
focus mainly on the solver and round-off errors. The order of magnitude of the overall error is
dictated by the largest among the above sources of error.
• Floating point arithmetic. In Chapter 1, we discuss how real numbers are represented,
manipulated and stored on a computer. There is an uncountable infinity of real numbers,
but only a finite subset of these can be represented exactly on a machine. This subset
is specified in the IEEE 754 standard, which is widely accepted today and employed in
most programming languages, including Julia.
2
Contents
when evaluated at a discrete set of points. The aim of approximation, on the other hand,
is usually to determine, within a class of simple functions, which one is closest to a given
function. Depending on the metric employed to measure closeness, this may or may not
be a well-defined problem.
Why Julia?
Throughout the course, the Julia programming language is employed to exemplify some of the
methods and key concepts. In the author’s opinion, the Julia language has several advantages
compared to other popular languages in the context of scientific computing, such as Matlab
or Python.
• Its main advantage over Matlab is that it is free and open source, with the byproduct that
it benefits from contributions from a large number of contributors around the world. Ad-
ditionally, Julia is a fully-fledged programming language that can be used for applications
unrelated to mathematics.
• Its main advantages over Python are significantly better performance and a more concise
syntax for mathematical operations, especially those involving vectors and matrices. It
should be recognized, however, that although use of Julia is rapidly increasing, Python
still enjoys a more mature ecosystem and is much more widely used.
3
Chapter 1
Introduction
When we study numerical algorithms in the next chapters, we assume implicitly that the op-
erations involved are performed exactly. On a computer, however, only a subset of the real
numbers can be stored and, consequently, many arithmetic operations are performed only ap-
proximately. This is the source of the so-called round-off errors. The rest of this chapter is
organized as follows.
• In Section 1.2, we describe the set of floating point numbers that can be represented in
the usual floating point formats;
• In Section 1.3 we explain how arithmetic operations between floating point numbers be-
have. We insist in particular on the fact that, in a calculation involving several successive
arithmetic operations, the result of each intermediate operation is stored as a floating
point number, with a possible error.
4
Chapter 1. Floating point arithmetic
• In Section 1.4, we briefly present how floating point numbers are encoded according to
the IEEE 754 standard, widely accepted today. We discuss also the encoding of special
values such as Inf, -Inf and NaN.
• Finally, in Section 1.5, we present the standard integer formats and their encoding.
In order to completely describe floating-point arithmetic, one would in principle need to also
discuss the conversion mechanisms between different number formats, as well as a number of
edge cases. Needless to say, a comprehensive discussion of the subject is beyond the scope of
this course; our aim in this chapter is only to introduce the key concepts.
The number x may then be denoted as ±(a−n a−n+1 . . . a−1 a0 .a1 a2 . . . )β , where the subscript β
indicates the base. This numeral system is called the positional notation and is universally used
today, both by humans (usually with β = 10) and machines (usually with β = 2). If the base β
is omitted, it is always assumed in this course that β = 10 unless otherwise specified – this is
the decimal representation. The digits a−n , a−n+1 , . . . are also called bits if β = 2. In computer
science, several bases other than 10 are regularly employed, for example the following:
• Base 2 (binary) is the usual choice for storing numbers on a machine. The binary format
is convenient because the digits have only two possible values, 0 or 1, and so they can
be stored using simple electrical circuits with two states. We employ the binary notation
extensively in the rest of this chapter. Notice that, just like multiplying and dividing
by 10 is easy in base 10, multiplying and dividing by 2 is very simple in base 2: these
operations amount to shifting all the bits by one position to the left or right, respectively.
5
Chapter 1. Floating point arithmetic
Example 1.1 (Converting a binary number to decimal notation). Let us calculate the decimal
representation of x = (0.10)2 , where the horizontal line indicates repetition: x = (0.101010 . . . )2 .
By definition, it holds that
∞
X
x= ak 2−k ,
k=0
We recognize on the right-hand side a geometric series with common ratio r = 2−2 = 14 , and so
we obtain
1 1 2
x= = = (0.6)10 .
2 1−r 3
Obtaining the binary representation of a decimal number is more difficult, because negative
powers of 10 have infinite binary representations, as Exercise 1.4 demonstrates. There is,
however, a simple procedure to perform the conversion, which we present for the specific case of
a real number x with decimal representation of the form x = (0.a1 . . . an )10 . In this setting, the
bits (b1 , b2 , . . . ) in the binary representation of x = (0.b1 b2 b2 . . . )2 may be obtained as follows:
Example 1.2 (Converting a decimal number to binary notation). Let us calculate the binary
representation of x = 1
3 = (0.3)10 . We apply Algorithm 1 and collate the values of i and x
obtained at the beginning of each iteration, i.e. just before Line 3, in the table below.
6
Chapter 1. Floating point arithmetic
i x Result
1 1
3 0.0000…
2 2
3 0.0100…
3 1
3 0.0000…
1.1.2 Exercises
� Exercise 1.1. Show that if a number x ∈ R admits a finite representation (1.1) in base β,
then it also admits an infinite representation in the same base. Hint: You may have learned
before that (0.9)10 = 1.
� Exercise 1.2. How many digits does it take to represent all the integers from 0 to 1010 − 1
in decimal and binary format? What about the hexadecimal format?
� Exercise 1.5. Implement Algorithm 1 on a computer and verify that it works. Your function
should take as argument an array of integers containing the digits after the decimal point; that
is, an array of the form [a_1, ..., a_n].
� Exercise 1.6. As mentioned above, Algorithm 1 works only for decimal numbers of the
specific form x = (0.a1 . . . an )10 . Find and implement a similar algorithm for integer numbers.
More precisely, write a function that takes an integer n as argument and returns an array
containing the bits of the binary expansion (bm . . . b0 )2 of n, from the least significant b0 to the
most significant bm . That is to say, your code should return [b_0, b_1, ...].
function to_binary(n)
# Your code comes here
end
7
Chapter 1. Floating point arithmetic
languages and software comply with the IEEE 754 standard, which requires that the set of
representable numbers be of the form
n
F(p, Emin , Emax ) = (−1)s 2E (b0 .b1 b2 . . . bp−1 )2 :
o
s ∈ {0, 1}, bi ∈ {0, 1} and Emin ≤ E ≤ Emax . (1.2)
In addition to these, floating number formats provide the special entities Inf, -Inf and NaN,
the latter being an abbreviation for Not a Number. Three parameters appear in the set defini-
tion (1.2). The parameter p ∈ N>0 is the number of significant bits (also called the precision),
and (Emin , Emax ) ∈ Z2 are respectively the minimum and maximum exponents. From the preci-
sion, the machine epsilon is defined as εM = 2−p−1 ; its significance is discussed in Section 1.2.2.
For a number x ∈ F(p, Emin , Emax ), s is called the sign, E is the exponent and b0 .b1 b2 . . . bp−1
is the significand. The latter can be divided into a leading bit b0 and the fraction b1 b2 . . . bp−1 ,
to the right of the binary point. The most widely used floating point formats are the single
and double precision formats, which are called respectively Float32 and Float64 in Julia.
Their parameters, together with those of the lesser-known half-precision format, are summarized
in Table 1.1. In the rest of this section we use the shorthand notation F16 , F32 and F64 . Note
that F16 ⊂ F32 ⊂ F64 .
Table 1.1: Floating point formats. The first column corresponds to the half-precision format.
This format, which is available through the Float16 type in Julia, is more recent than the single
and double precision formats. It was introduced in the 2008 revision to the IEEE 754 standard
of 1985, a revision known as IEEE 754-2008.
Remark 1.1. Some definitions, notably that in [9, Section 2.5.2], include a general base β
instead of the base 2 as an additional parameter in the definition of the number format (1.2).
Since the binary format (β = 2) is always employed in practice, we focus on this case for
simplicity in most of this chapter.
Remark 1.2. Given a real number x ∈ F(p, Emin , Emax ), the exponent E and significand are
generally not uniquely defined. For example, the number 2.0 ∈ F64 may be expressed as
(−1)0 21 (1.00 . . . 00)2 or, equivalently, as (−1)0 22 (0.100 . . . 00)2 .
In Julia, non-integer numbers are interpreted as Float64 by default, which can be ver-
ified by using the typeof function. For example, the instruction “a = 0.1” is equivalent
to “a = Float64(0.1)”. In order to define a number of type Float32, the suffix f0 must be
appended to the decimal expansion. For instance, the instruction “a = 4.0f0” defines a float-
ing point number a of type Float32; it is equivalent to writing “a = Float32(4.0)”.
8
Chapter 1. Floating point arithmetic
Definition 1.1 (Absolute and relative error). The absolute error is given by |x − x
b|, whereas
the relative error is
|x − xb|
|x|
The following result establishes a link between the machine εM and the relative error between
a real number and the closest member of a floating point format.
Proposition 1.1. Let xmin and xmax denote the smallest and largest non-denormalized pos-
itive numbers in a format F = F(p, Emin , Emax ). If x ∈ [−xmax , −xmin ] ∪ [xmin , xmax ], then
|x − xb| 1 1
min ≤ 2−(p−1) = εM . (1.3)
b∈F
x |x| 2 2
Proof. For simplicity, we also assume that x > 0. Let us introduce n = blog2 (x)c and y := 2−n x.
Since y ∈ [1, 2), it has a binary representation of the form (1.b1 b2 . . . )2 , where the bits after
the binary point are not all equal to 1 ad infinitum. Thus x = 2n (1.b1 b2 . . . )2 , and from the
assumption that xmin ≤ x ≤ xmax we deduce hat Emin ≤ n ≤ Emax . We now define the number
x− ∈ F by truncating the binary expansion of x as follows:
x− = 2n (1.b1 . . . bp−1 )2 .
The distance between x− and its successor in F , which we denote by x+ , is given by 2n−p+1 .
Consequently, it holds that
(x+ − x) + (x − x− ) = x+ − x− = 2n−p+1 .
Since both summands on the left-hand side are positive, this implies that either x+ −x or x−x−
is bounded from above by 12 2n−p+1 ≤ 12 2−p+1 x, which concludes the proof.
The machine epsilon, which was defined as εM = 2−(p−1) , coincides with the maximum
9
Chapter 1. Floating point arithmetic
Figure 1.1: Density of the double-precision floating point numbers, measured here as 1/∆(x)
where, for x ∈ F64 , ∆(x) denotes the distance between x and its successor in F64 .
relative spacing between a non-denormalized floating point number x and its successor in the
floating point format, defined as the smallest number in the format that is strictly larger than x.
Figure 1.1 depicts the density of double-precision floating point numbers, i.e. the number
of F64 members per unit on the real line. The figure shows that the density decreases as
the absolute value of x increases. We also notice that the density is piecewise constant with
discontinuities at powers of 2. Figure 1.2 illustrates the relative spacing between successive
floating point numbers. Although the absolute spacing increases with the absolute value of x,
the relative spacing oscillates between 12 εM and εM .
Figure 1.2: Relative spacing between successive double-precision floating point numbers in the
“normal range”. The relative spacing oscillates between 12 εM and εM .
The picture of the relative spacing between successive floating point numbers looks quite
different for denormalized numbers. This is illustrated in Figure 1.3, which shows that the
relative spacing increases beyond the machine epsilon in the denormalized range. Fortunately,
in the usual F32 and F64 formats, the transition between non-denormalized and denormalized
numbers occurs at such a small value that it rarely needs worrying about.
10
Chapter 1. Floating point arithmetic
Figure 1.3: Relative spacing between successive double-precision floating point numbers, over a
range which includes denormalized number. The vertical red line indicates the transition from
denormalized to non-denormalized numbers.
Remark 1.3. In Julia, the machine epsilon can be obtained using the eps function. For
example, the instruction eps(Float16) returns εM for the half-precision format.
1.2.3 Exercises
� Exercise 1.7. Write down the values of the smallest and largest, in absolute value, positive
real numbers representable in the F32 and F64 formats.
� Exercise 1.8 (Relative error and machine epsilon). Prove that the inequality (1.3) is sharp.
To this end, find x ∈ R such that the inequality is an equality.
� Exercise 1.9 (Cardinality of the set of floating point numbers). Show that, if Emax ≥ Emin ,
then F(p, Emin , Emax ) contains exactly
distinct real numbers. (In particular, the special values Inf, -Inf and NaN are not counted.)
Hint: Count first the numbers with E > Emin and then those with E = Emin .
• Standard case: The number x is rounded to the nearest representable number, if this
number is unique.
11
Chapter 1. Floating point arithmetic
• Edge case: When there are two equally near representable numbers in the floating point
format, the one with the least significant bit equal to zero is delivered.
• Infinities: If the real number x is larger than the largest representable number in the
format, that is larger than or equal to xmax = 2Emax (2 − 2−p−1 ), then there are two cases,
In other words, xmax is delivered if it would be delivered by following the rules of the first
two bullet points in a different floating point format with the same precision but a larger
exponent Emax . A similar rule applies for large negative numbers.
When a binary arithmetic operation (+, −, ×, /) is performed on floating point numbers
in format F, the result delivered by the computer is obtained by rounding the exact result of
the operation according to the rules given above. In other words, the arithmetic operation is
performed as if the computer first calculated an intermediate exact result, and then rounded
this intermediate result in order to provide a final result in F.
Mathematically, arithmetic operations between floating point numbers in a given format F
may be formalized by introducing the rounding operator fl : R → F and by defining, for any
binary operation ◦ ∈ {+, −, ×, /}, the corresponding machine operation
We defined this operator for arguments in the same floating point format F. If the arguments
of a binary arithmetic operation are of different types, the format of the end result, known as
the destination format, depends on that of the arguments: as a rule of thumb, it is given by
the most precise among the formats of the arguments. In addition, recall that a floating point
literal whose format is not explicitly specified is rounded to double-precision format and so, for
example, the addition 0.1 + 0.1 produces the result fl64 fl64 (0.1) + fl64 (0.1) , where fl64 is the
Example 1.3. Using the typeof function, we check that the floating point literal 1.0 is indeed
interpreted as a double-precision number:
When two numbers in different floating point formats are passed to a binary operation, the
result is in the more precise format.
12
Chapter 1. Floating point arithmetic
(x + b z 6= x +
b y) + b (y +
b z)
Example 1.4. Let x = 1 and y = 3 × 2−13 . Both of these numbers belong to F16 and, denoting
b machine addition in F16 , we have
by +
(x +
b y) +
by=1 (1.4)
but
x+ b y) = 1 + 2−10 .
b (y + (1.5)
To explain this somewhat surprising result, we begin by writing the normalized representations
of x and y in the F16 format:
x = (−1)0 × 20 × (1.0000000000)2
y = (−1)0 × 2−12 × (1.1000000000)2 .
The exact result of the addition x + y is given by r = 1 + 3 × 2−13 , which in binary notation is
r = (1. 00000000000
| {z } 11)2 .
11 zeros
Since the length of the significand in the half-precision (F16 ) format is only p = 11, this number
is not part of F16 . The result of the machine addition + b is therefore obtained by rounding r
to the nearest member of F16 , which is 1. This reasoning can then be repeated in order to
conclude that, indeed,
(x +
b y) +
b y =x+
b y = 1.
In order to explain the result of (1.5), note that the exact result of the addition y + y is
r = 3 × 2−12 , which belongs to the floating point format, so it also holds that y +
b y = 3 × 2−12 .
Therefore,
x+
b (y + b 3 × 2−12 = fl16 (1 + 3 × 2−12 ).
b y) = 1 +
The argument of the F16 rounding operator does not belong to F16 , since its binary represen-
tation is given by
(1. 0000000000
| {z } 11)2 .
10 zeros
When a numerical computation unexpectedly returns Inf or Inf, we say that an over-
13
Chapter 1. Floating point arithmetic
flow error occurred. Similarly, underflow occurs when a number is smaller than the smallest
representable number in a floating point format.
1.3.1 Exercises
� Exercise 1.10. Calculate the machine epsilon ε16 for the F16 format. Write the results of
the arithmetic operations 1 + b ε16 in F16 normalized representation.
b ε16 and 1 −
� Exercise 1.11 (Catastrophic cancellation). Let ε16 be the machine epsilon for the F16
format, and define y = 3 ε16 .
4
What is the relative error between ∆ = (1 + y) − 1, and the
machine approximation ∆ b y) −
b = (1 + b 1?
� Exercise 1.12 (Numerical differentiation). Let f (x) = exp(x). By definition, the derivative
of f at 0 is
f (δ) − f (0)
f (0) = lim
0
.
δ→0 δ
It is natural to use the expression within brackets on the right-hand side with a small but
nonzero δ as an approximation for f 0 (0). Implement this approach using double-precision
numbers and the same values for δ as in the table below. Explain the results you obtain.
ε64 ε64
δ 4 2 ε64
f 0 (0) 0 2 1
� Exercise 1.13 (Avoiding overflow). Write a code to calculate the weighted average
PJ
j=0 wj j
S := PJ , wj = exp(j), J = 1000.
j=0 wj
� Exercise 1.14 (Calculating the sample variance). Assume that (xn )1≤n≤N , with N = 106 ,
are independent random variables distributed according to the uniform distribution U(L, L + 1).
That is, each xn takes a random value uniformly distributed between L and L + 1 where L = 109 .
In Julia, these samples can be generated with the following lines of code:
N, L = 10^6, 10^9
x = L .+ rand(N)
N N
! !
1 1 X
(1.6)
X
2
s = x2n − N x̄ 2
, x̄ = xn .
N −1 N
n=1 n=1
Write a computer code to calculate s2 with the best possible accuracy. Can you find a formula
that enables better accuracy than (1.6)?
14
Chapter 1. Floating point arithmetic
Remark 1.4. In order to estimate the true value of s2 for your samples, you can use the
BigFloat format, to which the array x can be converted by using the instruction x = BigFloat.(x).
N
π2 X 1
= lim .
6 N →∞ n2
n=1
Using the default Float64 format, estimate the error obtained when the series on the right-hand
side is truncated after 1010 terms. Can you rearrange the sum for best accuracy?
� Exercise 1.16. Let x and y be positive real numbers in the interval [2−10 , 210 ] (so that we
do not need to worry about denormalized numbers, assuming we are working in single or double
precision), and let us define the machine addition operator +
b for arguments in real numbers as
Prove the following bound on the relative error between the sum x + y and its machine approx-
imation x +
b y:
(x + y) − (x +b y) εM εM
≤ 2+ .
|x + y| 2 2
Hint: decompose the numerator as
Solution. By default, real numbers are rounded to the nearest floating point number. This can
be checked in Julia with the command rounding(Float32), which prints the default rounding
mode. The exact binary representation of the real number x = 0.1 is
x = (0.0001100)2
= 2−4 × (1.10011001100110011001100
| {z } 1100)2
24 bits
The first task is to determine the member of F32 that is nearest x. We have
√
x− = max x : x ∈ F32 and x ≤ 2 = 2−4 × (1.10011001100110011001100)2
√
x+ = min x : x ∈ F32 and x ≥ 2 = 2−4 × (1.10011001100110011001101)2 .
Since the number (0.1100)2 is closer to 1 than to 0, the number x is closer to x+ than to x− .
Therefore, the number obtained when writing Float32(0.1) is x+ . To conclude the exercise,
we need to calculate fl 10 × x+ , and to this end we first write the exact binary representation
15
Chapter 1. Floating point arithmetic
Remark 1.5. It should not be inferred from Exercise 1.17 that Float32(1/i) * i is always
exact in floating point arithmetic. For example Float32(1/41) * 41 does not evaluate to 1,
and neither do Float16(1/11) * 11 and Float64(1/49) * 49.
x = (1.01101010000010011110011
| {z } 001100 . . . )2 .
24 bits
The first task is to determine the member of F32 that is nearest x. We have
√
x− = max x : x ∈ F32 and x ≤ 2 = (1.01101010000010011110011
| {z })2
24 bits
√
x = min x : x ∈ F32 and x ≥ 2 = (1.01101010000010011110100
+
| {z })2 ,
24 bits
and we calculate
x − x− = 2−24 (0.01100 . . . )2 ,
x+ − x = 2−21 1 − (0.11001100 . . . )2 ≥ 2−21 1 − (0.11001101)2 = 2−21 (0.00110011)2 .
(x− )2 = (1.11111111111111111111111
| {z } 011011)2 .
24 bits
(1.11111111111111111111111)2 = 2 − 2−23 ,
16
Chapter 1. Floating point arithmetic
Once a number format is specified through parameters (p, Emin , Emax ), the choice of encoding,
i.e. the machine representation of numbers in this format, has no bearing on the magnitude and
propagation of round-off errors. Studying encodings is, therefore, not essential for our purposes
in this course, but we opted to cover the topic anyway in the hope that it will help the students
build intuition on floating point numbers. We focus mainly on the single precision format, but
the following discussion applies mutatis mutandis to the double and half-precision formats. The
material in this section is for information purposes only.
We already mentioned in Remark 1.2 that a number in a floating point format may have
several representations. On a computer, however, a floating point number is always stored in
the same manner (except for the number 0, see Remark 1.7). The values of the exponent and
significand which are selected by the computer, in the case where there are several possible
choices, are determined from the following rules:
The following result proves that these rules define the exponent and significand uniquely.
where the parameter sets (s, E, b0 , . . . bp−1 ) and (e e eb0 , . . . , ebp−1 ) both satisfy the above rule.
s, E,
Then E = E e and bi = ebi for i ∈ {0, . . . , p − 1}.
Proof. We show that E = E, e after which the equality of significands follows trivially. Let us
assume for contradiction that E > Ee and denote the left and right-hand sides of (1.7) by x
and x
e, respectively. Then E > Emin , implying that b0 = 1 and so 2E ≤ |x| < 2E+1 . On the
other hand, it holds that |e
x| < 2E+1 regardless of whether E
e = Emin or not. Since E ≥ E
e+1
e
Now that we explained how a unique set of parameters (sign, exponent, significand) can
be assigned to any floating point number, we describe how these parameters are stored on the
computer in practice? As their names suggest, the Float16, Float32 and Float64 formats use 16,
32 and 64 bits of memory, respectively. A naive approach for encoding these number formats
would be to store the full binary representations of the sign, exponent and significand.
For the Float32 format, this approach would requires 1 bit for the sign, 8 bits to cover the
254 possible values of the exponent, and 24 bits for the significand, i.e. for storing b0 , . . . , bp−1 .
This leads to a total number of 33 bits, which is one more than is available, and this is without
the special values NaN, Inf and -Inf. So how are numbers in the F32 format actually stored?
To answer this question, we begin with two observations:
17
Chapter 1. Floating point arithmetic
• In the F32 format, 8 bits at minimum need to be reserved for the exponent, which enables
the representation of 28 = 256 different values, but there are only 254 possible values for
the exponent. This suggests that 256−254 = 2 combinations of the 8 bits can be exploited
in order to represent the special values Inf, -Inf and NaN.
Simplifying a little, we may view single precision floating point number as an array of 32
bits as illustrated below:
According to the IEEE 754 standard, the first bit is the sign s, the next 8 bits e0 e1 . . . e6 e7 encode
the exponent, and the last 23 bits b1 b2 . . . bp−2 bp−1 encode the significand. Let us introduce the
integer number e = (e0 e1 . . . e6 e7 )2 ; that is to say, 0 ≤ e ≤ 28 − 1 is the integer number whose
binary representation is given by e0 e1 . . . e6 e7 . One may determine the exponent and significand
of a floating point number from the following rules.
• Denormalized numbers: If e = 0, then the implicit leading bit b0 is zero, the frac-
tion is b1 b2 . . . bp−2 bp−1 , and the exponent is E = Emin . In other words, using the
notation of Section 1.2, we have x = (−1)s 2Emin (0.b1 b2 . . . bp−2 bp−1 )2 . In particular, if
b1 b2 . . . bp−2 bp−1 = 00 . . . 00, then it holds that x = 0.
• Non-denormalized numbers: If 0 < e < 255, then the implicit leading bit b0 of the
significand is 1 and the fraction is given by b1 b2 . . . bp−2 bp−1 . The exponent is given by
E = e − bias = Emin + e − 1.
where the exponent bias for the single and double precision formats are given in Table 1.2.
In this case x = (−1)s 2e−bias 1.b1 b2 . . . bp−2 bp−1 . Notice that E = Emin if e = 1, as in the
case of subnormal numbers.
• Infinities: If e = 255 and b1 b2 . . . bp−2 bp−1 = 00 . . . 00, then x = Inf if s = 0 and -Inf
otherwise.
• Not a Number: If e = 255 and b1 b2 . . . bp−2 bp−1 6= 00 . . . 00, then x = NaN. Notice that
the special value NaN can be encoded in many different manners. These extra degrees of
freedom were reserved for passing information on the reason for the occurrence of NaN,
which is usually an indication that something has gone wrong in the calculation.
18
Chapter 1. Floating point arithmetic
Remark 1.6 (Encoding efficiency). With 32 bits, at most 232 different numbers could in
principle be represented. In practice, as we saw in Exercise 1.9, the Float32 format enables
to represent
(Emax − Emin )2p + 2p+1 − 1 = 253 × 223 + 225 − 1 = 232 − 224 − 1 ≈ 99.6% × 232 ,
Remark 1.7 (Nonuniqueness of the floating point represention of 0.0). The sign s is clearly
unique for any number in a floating point format, except for 0.0, which could in principle be
represented as
In practice, both representations of 0.0 are available on most machines, and these behave
slightly differently. For example 1/(0.0) = Inf but 1/(-0.0) = -Inf.
• x1 = 2.0Emin
• x3 = 2.0Emax (2 − 2−p+1 )
The machine representation of integer formats is much simpler than that of floating point
numbers. In this short section, we give a few orders of magnitude for common integer formats
and briefly discuss overflow issues. Programming languages typically provide integer formats
based on 16, 32 and 64 bits. In Julia, these correspond to the types Int16, Int32 and Int64, the
latter being the default for integer literals.
The most common encoding for integer numbers, which is used in Julia, is known as two’s
complement: a number encoded with p bits as bp−1 bp−2 . . . b0 corresponds to
p−2
X
x = −bp−1 2p−1 + bi 2i .
i=0
19
Chapter 1. Floating point arithmetic
This encoding enables to represent uniquely all the integers from Nmin = −2p−1 to Nmax =
2p−1 − 1. In contrast with floating point formats, integer formats do not provide special values
like Inf and NaN. The number delivered by the machine when a calculation exceeds the maxi-
mum representable value in the format, called the overflow behavior, generally depends on the
programming language.
Since overflow behavior of integer numbers is not universal across programming languages, a
detailed discussion is of little interest. We only mention that Julia uses a wraparound behavior,
where Nmax + 1 silently returns Nmin and, similarly, −Nmin − 1 gives Nmax ; the numbers loop
back. This can lead to unexpected results, such as 2^64 evaluating to 0.
20
Chapter 2
2.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Direct solution method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Backward and forward substitution . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Gaussian elimination with pivoting �
. . . . . . . . . . . . . . . . . . . 31
2.2.4 Direct method for symmetric positive definite matrices . . . . . . . . . . 34
2.2.5 Direct methods for banded matrices . . . . . . . . . . . . . . . . . . . . 34
2.2.6 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Iterative methods for linear systems . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Basic iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 The conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 62
Introduction
This chapter is devoted to the numerical solution of linear problems of the following form:
Systems of this type appear in a variety of applications. They naturally arise in the context
of linear partial differential equations, which we use as main motivating example. Partial
differential equations govern a wide range of physical phenomena including heat propagation,
gravity, and electromagnetism, to mention just a few. Linear systems in this context often
have a particular structure: the matrix A is generally very sparse, which means that most of
the entries are equal to 0, and it is often symmetric and positive definite, provided that these
properties are satisfied by the underlying operator.
There are two main approaches for solving linear systems:
21
Chapter 2. Solution of linear systems of equation
• Direct methods enable to calculate the exact solution to systems of linear equations, up to
round-off errors. Although this is an attractive property, direct methods are usually too
computationally costly for large systems: The cost of inverting a general n × n matrix,
measured in number of floating operations, scales as n3 !
• Iterative methods, on the other hand, enable to progressively calculate increasingly accu-
rate approximations of the solution. Iterations may be stopped once the the residual is
sufficiently small. These methods are often preferable when the dimension n of the linear
system is very large.
• In Section 2.1, we introduce the concept of conditioning. The condition number of a matrix
provides information on the sensitivity of the solution to perturbations of the right-hand
side b or matrix A. It is useful, for example, in order to determine the potential impact
of round-off errors.
• In Section 2.2, we present the direct method for solving systems of linear equations. We
study in particular the LU decomposition for an invertible matrix, as well as its variant
for symmetric positive definite matrices, which is called the Cholesky decomposition.
• In Section 2.3, we present iterative methods for solving linear systems. We focus in
particular on basic iterative methods based on a splitting, and on the conjugate gradient
method.
2.1 Conditioning
The condition number for a given problem measures the sensitivity of the solution to the
input data. In order to define this concept precisely, we consider a general problem of the
form F (x, d) = 0, with unknown x and data d. The linear system (2.1) can be recast in this
form, with the input data equal to b or A or both. We denote the solution corresponding to
perturbed input data d + ∆d by x + ∆x. The absolute and relative condition numbers are
defined as follows.
Definition 2.1 (Condition number for the problem F (x, d) = 0). The absolute and relative
condition numbers with respect to perturbations of d are defined as
! !
k∆xk k∆xk/kxk
Kabs (d) = lim sup , K(d) = lim sup .
ε→0 k∆dk≤ε k∆dk ε→0 k∆dk≤ε k∆dk/kdk
The short notation K is reserved for the relative condition number, which is often more useful
in applications.
In the rest of this section, we obtain an upper bound on the relative condition number for
the linear system (2.1) with respect to perturbations first of b, and then of A. We use the
notation k•k to denote both a vector norm on Rn and the induced operator norm on matrices.
22
Chapter 2. Solution of linear systems of equation
Proposition 2.1 (Perturbation of the right-hand side). Let x + ∆x denote the solution to
the perturbed equation A(x + ∆x) = b + ∆b. Then it holds that
k∆xk k∆bk
≤ kAkkA−1 k , (2.2)
kxk kbk
kAxk −1 kAkkxk −1
k∆xk = kA−1 ∆bk ≤ kA−1 kk∆bk = kA kk∆bk ≤ kA kk∆bk. (2.3)
kbk kbk
Here we employed (A.7), proved in Appendix A, in the first and last inequalities. Rearranging
the inequality (2.3), we obtain (2.2).
Proposition 2.1 implies that the relative condition number of (2.1) with respect to perturbations
of the right-hand side is bounded from above by kAkkA−1 k. Exercise 2.1 shows that there are
values of x and ∆b for which the inequality (2.2) is sharp.
Studying the impact of perturbations of the matrix A is slightly more difficult, because
this time the variation ∆x of the solution does not depend linearly on the perturbation.
Proposition 2.2 (Perturbation of the matrix). Let x+∆x denote the solution to the perturbed
equation (A + ∆A)(x + ∆x) = b. If A is invertible and k∆Ak < kA−1 k−1 , then
k∆xk k∆Ak 1
≤ kAkkA−1 k . (2.4)
kxk kAk 1 − kA−1 ∆Ak
Lemma 2.3. Let B ∈ Rn×n be such that kBk < 1. Then I − B is invertible and
1
k(I − B)−1 k ≤ , (2.5)
1 − kBk
I − Bn+1 = (I − B)(I + B + · · · + Bn ).
Since kBk < 1 in a submultiplicative matrix norm, both sides of the equation are convergent
in the limit as n → ∞, with the left-hand side converging to identity matrix I. Equating the
limits, we obtain
∞
I = (I − B) Bi .
X
i=0
23
Chapter 2. Solution of linear systems of equation
This implies that (I − B) is invertible with inverse given by a so-called Neumann series
∞
(I − B)−1 = Bi .
X
i=0
Applying the triangle inequality repeatedly, and then using the submultiplicative property of
the norm, we obtain
n n n
1
∀n ∈ N, Bi ≤
X X X
kBi k ≤ kBki = .
1 − kBk
i=0 i=0 i=0
where we used the summation formula for geometric series in the last equality. Letting n → ∞
in this equation and using the continuity of the norm enables to conclude the proof.
Since kA−1 ∆Ak ≤ kA−1 kk∆Ak < 1 by assumption, we deduce from Lemma 2.3 that the matrix
on the left-hand side is invertible with a norm bounded as in (2.5). Consequently, using in
addition the assumed submultiplicative property of the norm, we obtain that
kA−1 ∆Ak
k∆xk = k(I + A−1 ∆A)−1 A−1 ∆Axk ≤ kxk.
1 − kA−1 ∆Ak
Using Proposition 2.2, we deduce that the relative condition number of (2.1) with respect
to perturbations of the matrix A is also bounded from above by kAkkA−1 k, because the term
between brackets on the right-hand side of (2.4) converges to 1 as k∆Ak → 0.
Propositions 2.1 and 2.2 show that the condition number, with respect to perturbations of
either b or A, depends only on A. This motivates the following definition.
Definition 2.2 (Condition number of a matrix). The condition number of a matrix A asso-
ciated to a vector norm k•k is defined as
κ(A) = kAkkA−1 k.
The condition number for the p-norm, defined in Definition A.3, is denoted by κp (A).
Note that the condition number κ(A) associated with an induced norm is at least one. Indeed,
since the identity matrix has induced norm 1, it holds that
Since the 2-norm of an invertible matrix A ∈ Rn×n coincides with the spectral radius ρ(AT A),
24
Chapter 2. Solution of linear systems of equation
where λmax and λmin are the maximal and minimal (both real and positive) eigenvalues of A.
Example 2.1 (Perturbation of the matrix). Consider the following linear system with perturbed
matrix ! ! ! !
x1 0 1 0 0 0
(A + ∆A) = , A= , ∆A = ,
x2 .01 0 .01 0 ε
where 0 < ε .01. Here the eigenvalues of A are given by λ1 = 1 and λ2 = 0.01. The solution
when ε = 0 is given by (0, 1)T , and the solution to the perturbed equation is
! !
x1 + ∆x1 0
= 1
.
x2 + ∆x2 1+100ε
In this case, the relative impact of perturbations of the matrix is close to κ2 (A) = 100.
� Exercise 2.1. In the simple case where A is symmetric, find values of x, b and ∆b for
which the inequality (2.2) is in fact an equality?
• First calculate the so-called LU decomposition of A, i.e. find an upper triangular matrix U
and a unit lower triangular matrix L such that A = LU. A unit lower triangular matrix is
a lower triangular matrix with only ones on the diagonal.
By construction, the solution x thus obtained is a solution to (2.1). Indeed, we have that
Ax = LUx = Ly = b.
2.2.1 LU decomposition
In this section, we first discuss the existence and uniqueness of the LU factorization. We
then describe a numerical algorithm for calculating the factors L and U, based on Gaussian
25
Chapter 2. Solution of linear systems of equation
elimination.
We present a necessary and sufficient condition for the existence of a unique LU decomposition
of a matrix. To this end, we define the principal submatrix of order i of a matrix A ∈ Rn×n as
the matrix Ai = A[1 : i, 1 : i], in Julia notation.
Proposition 2.4. The LU factorization of a matrix A ∈ Rn×n exists and is unique if and
only if the principal submatrices of A of all orders are nonsingular.
Proof. We prove only the “if” direction; see [9, Theorem 3.4] for the “only if” implication.
The statement is clear if n = 1. Reasoning by induction, we assume that the result is
proved up to n − 1. Since the matrix An−1 and all its principal submatrices are nonsingular
by assumption, it holds that An−1 = Ln−1 Un−1 for a unit lower triangular matrix Ln−1 and an
upper triangular matrix Un−1 . These two matrices are nonsingular, for if either of them were
singular then the product An−1 = Ln−1 Un−1 would be singular as well. Let us decompose A as
follows:
An−1
!
c
A= .
dT ann
Let ` and u denote the solutions to Ln−1 u = c and UTn−1 ` = d. These solutions exist and are
unique, because the matrices Ln−1 and Un−1 are nonsingular. Letting unn = ann − (`T u)−1 , we
check that A factorizes as
This completes the proof of the existence of the decomposition. The uniqueness of the factors
follows from the uniqueness of `, u and unn .
Proposition 2.4 raises the following question: are there classes of matrices whose principal
matrices are all nonsingular? The answer is positive, and we mention, as an important example,
the class of positive definite matrices. Proving this is the aim of Exercise 2.4.
Definition 2.3. A Gaussian transformation is a matrix of the form Mk = I−c(k) eTk , where ek
is the column vector with entry at index k equal to 1 and all the other entries equal to zero,
26
Chapter 2. Solution of linear systems of equation
r (1) r (1)
1
r (2)
(2)
1 r
. ..
.. .
.
. .
Mk T = I − c(k) eTk T =
1 r (k) = r (k)
(k) (k+1) (k+1) (k) (k)
−ck+1 1 r r − ck+1 r
.. .. .. ..
. . . .
(k) (k)
−cn 1 r (n) r (n) − cn r (k)
We show in Exercise 2.2 that the inverse of a Gaussian transformation matrix is given by
The idea of the Gaussian elimination algorithm is to successively left-multiply A with Gaussian
transformation matrices M1 , then M2 , etc. appropriately chosen in such a way that the ma-
trix A(k) , obtained after k iterations, is upper triangular up to column k. That is to say, the
Gaussian transformations are constructed so that all the entries in columns 1 to k under the
diagonal of the matrix A(k) are equal to zero. The resulting matrix A(n−1) after n − 1 iterations
is then upper triangular and satisfies
A(n−1) = Mn−1 . . . M1 A.
A = (M−1
1 . . . Mn−1 )A
−1 (n−1)
.
The first factor is lower triangular by (2.7) and Exercise 2.3. The product in the definition of
the matrix L admits a simple explicit expression.
27
Chapter 2. Solution of linear systems of equation
c(i) eTi c(j) eTj = c(i) (eTi c(j) )eTj = c(i) 0eTj = 0.
A corollary of Lemma 2.5 is that all the diagonal entries of the lower triangular matrix L
are equal to 1; the matrix L is unit lower triangular. The full expression of the matrix L given
the Gaussian transformations is
1
(1)
c2 1
c(1) c(2) 1
L = I + c(1) . . . c(n)
(2.8)
3 3
0n = (1) (2) (3)
c4 c4 c4 1
.. .. .. ..
. . . .
(1) (2) (3) (n−1)
cn cn cn ... cn 1
Therefore, the Gaussian elimination algorithms, if all the steps are well-defined, correctly gives
the LU factorization of the matrix A. Of course, the success of the strategy outlined above
for the calculation of the LU factorization hinges on the existence of an appropriate Gaussian
transformation at each iteration. It is not difficult to show that, if the (k + 1)-th diagonal
entry of the matrix A(k) is nonzero for all k ∈ {1, . . . , n − 2}, then the Gaussian transformation
matrices exist and are uniquely defined.
Lemma 2.6. Assume that A(k) is upper triangular up to column k included, with k ≤ n − 2.
If ak+1,k+1 > 0, then there is a unique Gaussian transformation matrix Mk+1 such that Mk+1 A(k)
(k)
Proof. We perform the multiplication explicitly. Denoting denote by (r (i) )1 ≤ i ≤ n the rows
of A(k) , we have
r (1) r (1)
1
r (2)
(2)
1 r
. .
.. . ..
.
.
Mk+1 A (k)
= 1 r (k+1) = r (k+1) .
(k+1) (k+2) (k+2) (k+1) (k+1)
−ck+2 1 r r − ck+2 r
.. .. .. ..
. . . .
(k+1) (k+1) (k+1)
−cn 1 r (n) r (n) − cn r
We need to show that the matrix on the right-hand side is upper triangular up to column k + 1
28
Chapter 2. Solution of linear systems of equation
included. This is clear by definition of c(k+1) and from the fact that A(k) is upper triangular up
to column k by assumption.
The diagonal elements ak+1,k+1 , where k ∈ {0, . . . , n − 2}, are called the pivots. We now
(k)
prove that, if an invertible matrix A admits an LU factorization, then the pivots are necessarily
nonzero and the Gaussian elimination algorithm is successful.
Proof. We denote by c(1) , . . . c(n−1) , the columns of the matrix L − I. Then the matrices given
by Mk = I − c(k) eTk , for k ∈ {1, . . . , n − 1}, are Gaussian transformations and it holds that
L = M−1
1 · · · Mn−1
−1
Mn−1 · · · M1 A = U
is upper triangular. Let us use the notation A(k) = Mk · · · M1 A. Of all the Gaussian transforma-
tions, only M1 acts on the second line of the matrix it multiplies. Therefore, the entry (2, 1) of U
coincides with the entry (2, 1) of A(1) , which implies that a2,1 = 0. Then notice that a3,1 = a3,1
(1) (k) (1)
for all k ≥ 1, because the entry (3, 1) of the matrix M2 A(1) is given by a3,1 − c3 a2,1 = a3,1 , and
(1) (2) (1) (1)
the other transformation matrices M3 , . . . , Mn−1 leave the third line invariant. Consequently, it
holds that a3,1 = u3,1 = 0. Continuing in this manner, we deduce that A(1) is upper triangular
(1)
in the first column and that, since A is invertible by assumption, the first pivot a11 is nonzero.
Since this pivot is nonzero, the matrix M1 is uniquely defined by Lemma 2.6.
The reasoning can then be repeated with other columns, in order to deduce that A(k) is
upper triangular up to column k and that all the pivots akk are nonzero.
(k−1)
Computer implementation
The Gaussian elimination procedure is summarized as follows.
A(0) ← A, L ← I
for i ∈ {1, . . . , n − 1} do
Construct Mi as in Lemma 2.6.
A(i) ← Mi A(i−1) , L ← LM−1
i
end for
U ← A(n−1) .
Of course, in practice it is not necessary to explicitly create the Gaussian transformation
matrices, or to perform full matrix multiplications. A more realistic, but still very simplified,
version of the algorithm in Julia is given below. The code exploits the relation (2.8) between L
and the parameters of the Gaussian transformations.
29
Chapter 2. Solution of linear systems of equation
Computational cost
The computational cost of the algorithm, measured as the number of floating point operations
(flops) required, is dominated by the Gaussian transformations, in line 9 in the above code.
All the other operations amount to a computational cost scaling as O(n2 ), which is negligible
compared to the cost of the LU factorization when n is large. This factorization requires
Notice that the unknown y1 may be obtained from the first line of the system. Then, since y1
is known, the value of y2 can be obtained from the second line, etc. A simple implementation
of this algorithm is as follows:
30
Chapter 2. Solution of linear systems of equation
The Gaussian elimination algorithm that we presented in Section 2.2.1 relies on the existence
of an LU factorization. In practice, this assumption may not be satisfied, and in this case a
modified algorithm, called Gaussian elimination with pivoting, is required.
In fact, pivoting is useful even if the usual LU decomposition of A exists, as it enables to
reduce the condition number of the matrices matrices L and U. There are two types of pivoting:
partial pivoting, where only the rows are rearranged through a permutation at each iteration,
and complete pivoting, where both the rows and the columns are rearranged at each iteration.
Showing rigorously why pivoting is useful requires a detailed analysis and is beyond the
scope of this course. In this section, we only present the partial pivoting method. Its influence
on the condition number of the factors L and U is studied empirically in Exercise 2.6. It is
useful at this point to introduce the concept of a row permutation matrix.
Partial pivoting
Gaussian elimination with partial pivoting applies for any invertible matrix A, and it outputs 3
matrices: a row permutation P, a unit triangular matrix L, and an upper triangular matrix U.
These are related by the relation
PA = LU.
This is sometimes called a PLU decomposition of the matrix A. It is not unique in general but,
unlike the usual LU decomposition, it always exists provided that A is invertible. We take this
for granted in this course.
The idea of partial pivoting is to rearrange the rows at each iteration of the Gaussian
elimination procedure in such a manner that the pivotal entry is as large as possible in absolute
value. One step of the procedure reads
Here Pk+1 is a simple row permutation matrix which, when acting on A(k) , interchanges row k+1
and row `, for some index ` ≥ k + 1. The row index ` is selected in such a way that the absolute
value of the pivotal entry, in position (k + 1, k + 1) of the product Pk+1 A(k) , is maximum. The
31
Chapter 2. Solution of linear systems of equation
matrix Mk+1 is then the unique Gaussian transformation matrix ensuring that A(k+1) is upper
triangular up to column k + 1, obtained as in Lemma 2.6. The resulting matrix A(n−1) after
n − 1 steps of the form (2.9) is upper triangular and satisfies
The first factor in the decomposition of A is not necessarily lower triangular. However, using
the notation M = Mn−1 Pn−1 · · · M1 P1 and P = Pn−1 · · · P1 , we have
Lemma 2.8 below shows that, as the notation L suggests, the matrix L = (PM−1 ) on the right-
hand side is indeed lower triangular. Before stating and proving the lemma, we note that P is
a row permutation matrix, and so the solution to the linear system Ax = b can be obtained by
solving LUx = PT b by forward and backward substitution. Since P is a very sparse matrix, the
right-hand side PT b can be calculated very efficiently.
Lemma 2.8. The matrix L = PM−1 is unit lower triangular with all entries bounded in absolute
value from above by 1. It admits the expression
L = I + (Pn−1 · · · P2 c(1) )eT1 + (Pn−1 · · · P3 c(2) )eT2 + · · · + (Pn−1 c(n−2) )eTn−2 + c(n−1) eTn−1 .
P(k) M(k) = I + (Pk · · · P2 c(1) )eT1 + (Pk · · · P3 c(2) )eT2 + · · · + (Pk c(k−1) )eTk−1 + c(k) eTk (2.11)
−1
for all k ∈ {1, . . . , n − 1}. The statement is clear for k = 1, and we assume by induction that it
is true up to k − 1. Then notice that
= I + (Pk Pk−1 · · · P2 c(1) )eT1 + · · · + (Pk Pk−1 c(k−2) )eTk−2 + (Pk c(k−1) )eTk−1 M−1
k .
row permutation Pk does not affect rows 1 to k − 1. Using the expression M−1
k = I + c ek ,
(k) T
expanding the product and noting that eTj c(k) = 0 if j ≤ k, we obtain (2.11). The statement
that the entries are bounded in absolute value from above by 1 follows from the choice of the
pivot at each iteration.
The expression of L in Lemma 2.8 suggests the iterative procedure given in Algorithm 2 for
performing the LU factorization with partial pivoting. A Julia implementation of this algorithm
is presented in Listing 2.1.
32
Chapter 2. Solution of linear systems of equation
Interchange the rows i and k of matrices A(i−1) and P, and of vectors c(1) , . . . , c(i−1) .
Construct Mi with corresponding column vector c(i) as in Lemma 2.6.
Assign A(i) ← Mi A(i−1)
end for
Assign U ← A(n−1) .
Assign L ← I + c(1) . . . c(n−1) 0n .
# Auxiliary function
function swap_rows!(i, j, matrices...)
for M in matrices
M_row_i = M[i, :]
M[i, :] = M[j, :]
M[j, :] = M_row_i
end
end
n = size(A)[1]
L, U = zeros(n, 0), copy(A)
P = [i == j ? 1.0 : 0.0 for i in 1:n, j in 1:n]
for i in 1:n-1
# Pivoting
index_row_pivot = i - 1 + argmax(abs.(U[i:end, i]))
swap_rows!(i, index_row_pivot, U, L, P)
Remark 2.1. It is possible to show that, if the matrix A is column diagonally dominant in the
sense that
n
X
∀j ∈ {1, . . . , n}, |ajj | ≥ |aij |,
i=1,i6=j
then pivoting does not have an effect: at each iteration, the best pivot is already on the
diagonal.
33
Chapter 2. Solution of linear systems of equation
Lemma 2.9 (Cholesky decomposition). If A is symmetric positive definite, then there exists a
lower-triangular matrix C ∈ Rn×n such that
A = CCT . (2.12)
Equation (2.12) is called the Cholesky factorization of A. The matrix C is unique if we require
that all its diagonal entries are positive.
Proof. Since A is positive definite, its LU decomposition exists and is unique by Propositions 2.4
and 2.7. Let D denote the diagonal matrix with the same diagonal as that of U. Then
A = LD(D−1 U).
Note that the matrix D−1 U is unit upper triangular. Since A is symmetric, we have
The first and second factors on the right-hand side are respectively unit lower triangular and
upper triangular, and so we deduce, by uniqueness of the LU decomposition, that L = (D−1 U)T
and U = (LD)T . But then
√ √
A = LU = LDLT = (L D)( DL)T .
√
Here D denotes the diagonal matrix whose diagonal entries are obtained by taking the square
root of those of D, which are necessarily positive because A is positive definite. This implies the
√
existence of a Cholesky factorization with C = L D.
The matrix C can be calculated from (2.12). For example, developing the matrix product gives
√
that a1,1 = c21,1 and so c1,1 = a1,1 . It is then possible to calculate c2,1 from the equation a2,1 =
c2,1 c1,1 , and so on. Implementing the Cholesky factorization is the goal of Exercise 2.7.
34
Chapter 2. Solution of linear systems of equation
Definition 2.5. The bandwidth of a matrix A ∈ Rn×n is the smallest number k ∈ N such
that aij = 0 for all (i, j) ∈ {1, . . . , n}2 with |i − j| > k.
It is not difficult to show that, if A is a matrix with bandwidth k, then so are L and U in the
absence of pivoting. This can be proved by equaling the entries of the product LU with those
of the matrix A. We emphasize, however, that the sparsity structure within the band of A may
be destroyed in L and U; this phenomenon is called fill-in.
7 8 5 3
1 1 1 1 1 1
1 1 1 1 1 1
6 1 7 1
1 1 1
1
1 1
1 1 1 1 1 1
→
5 2 8 2
1 1 1
1 1 1
1 1 1 1 1 1
4 3 6 4
1 1 1 1 1 1
1 1 1 1 1 1
Here we also wrote the adjacency matrices associated to the graphs. We assume that the nodes
are all self-connected, although this is not depicted, and so the diagonal entries of the adjacency
35
Chapter 2. Solution of linear systems of equation
and we may verify that the adjacency matrix of the renumbered graph can be obtained from
the associated row permutation matrix:
1 1 1
1 1
1 1 1 1
1
1
1 1 1
1
1 1 1 1 1
PA∗ P =
T
1
1 1 1
1
1
1 1 1
1
1
1 1 1
1
1 1 1 1 1
In this example, renumbering the nodes of the graph enables a significant reduction of the
bandwidth, from 7 to 2. The Cuthill–McKee algorithm, which was employed to calculate the
permutation, is an iterative method that produces an ordered n-tuple R containing the nodes in
the new order; in other words, it returns σ −1 (1), . . . , σ −1 (n) . The first step of the algorithm
is to find the node i with the lowest degree, i.e. with the smallest number of connections to other
nodes, and to initialize R = (i). Then the following steps are repeated until R contains all the
nodes of the graph:
• Define Ai as the set containing all the nodes which are adjacent to a node in R but not
themselves in R;
• Sort the nodes in Ai according to the following rules: a node i ∈ Ai comes before j ∈ Ai if i
is connected to a node in R that comes before all the nodes in R to which j is connected.
As a tiebreak, precedence is given to the node with highest degree.
3 5 3 5 3 5 3
1 1 7 1 7 1
1
2 2 2 8 2
4 6 4 6 4
Figure 2.1: Illustration of the Cuthill–McKee algorithm. The new numbering of the nodes
is illustrated. The first node was chosen randomly since all the nodes have the same degree.
In this example, the ordered tuple R evolves as follows: (1) → (1, 2, 8) → (1, 2, 8, 3, 7) →
(1, 2, 8, 3, 7, 4, 6) → (1, 2, 8, 3, 7, 4, 6, 5).
The steps of the algorithm for the example above are depicted in Figure 2.1. Another
example, taken from the original paper by Cuthill and McKeen [1], is presented in Figure 2.2.
36
Chapter 2. Solution of linear systems of equation
2.2.6 Exercices
� Exercise 2.2 (Inverse of Gaussian transformation). Prove the formula (2.7).
� Exercise 2.3. Prove that the product of two lower triangular matrices is lower triangular.
∀x ∈ Rn \{0n }, xT Ax > 0.
� Exercise 2.5. Implement the backward substitution algorithm for solving Ux = y. What is
the computational cost of the algorithm?
� Exercise 2.6. Compare the condition number of the matrices L and U with and without
partial pivoting. For testing, use a matrix with pseudo-random entries generated as follows
import Random
# Set the seed so that the code is deterministic
Random.seed!(0)
n = 1000 # You can change this parameter
A = randn(n, n)
� Exercise 2.7. Write a code for calculating the Cholesky factorization of a symmetric positive
definite matrix A by comparing the entries of the product CCT with those of the matrix A. What
is the associated computational cost, and how does it compare with that of the LU factorization?
Extra credit: ... if your code is able to exploit the potential banded structure of the matrix
passed as argument for better efficiency. Specifically, your code will be tested with a matrix is
of the type BandedMatrix defined in the BandedMatrices.jl package, which you will need to
install. The following code can be useful for testing purposes.
import BandedMatrices
import LinearAlgebra
function cholesky(A)
m, n = size(A)
37
Chapter 2. Solution of linear systems of equation
For information, my code takes about 1 second to run with the parameters given here.
� Exercise 2.8 (Matrix square root). Let A be a symmetric positive definite matrix. Show
that A has a square root, i.e. that there exists a symmetric matrix B such that BB = A.
For any choice of splitting, the exact solution x∗ to the linear system is a fixed point of this
iteration, in the sense that if x(0) = x∗ , then x(k) = x∗ for all k ≥ 0. Equation (2.13) is
a linear system with matrix M, unknown x(k+1) , and right-hand side Nx(k) + b. There is a
compromise between the cost of a single step and the speed of convergence of the method.
In the extreme case where M = A and N = 0, the method converges to the exact solution
in one step, but performing this step amounts to solving the initial problem. In practice, in
order for the method to be useful, the linear system (2.13) should be relatively simple to solve.
Concretely, this means that the matrix M should be diagonal, triangular, block diagonal, or
block triangular. The error e(k) and residual r (k) at iteration k are defined as follows:
38
Chapter 2. Solution of linear systems of equation
Before presenting concrete examples of splitting methods, we obtain a necessary and sufficient
condition for the convergence of (2.13) for any initial guess x(0) .
Proposition 2.10 (Convergence). The splitting method (2.13) converges for any initial
guess x(0) if and only if ρ(M−1 N) < 1. In addition, for any ε > 0 there exists K > 0
such that
k
∀k ≥ K, ke(k) k ≤ ρ(A) + ε ke(0) k.
Proof. Let x∗ denote the solution to the linear system. Since Mx∗ − Nx∗ = b, we have
Using the assumption that M is nonsingular, we obtain that the error satisfies the equation
In order to conclude the proof, we will used Gelfand’s formula, proved in Proposition A.10 of
Appendix A. This states that
k→∞
At this point, it is natural to wonder whether there exist sufficient conditions on the matrix A
such that the inequality ρ(M−1 N) < 1 is satisfied, which is best achieved on a case by case basis.
In the next sections, we present four instances of splitting methods. For each of them, we obtain
a sufficient condition for convergence. We are particularly interested in the case where the
matrix A is symmetric (or Hermitian) and positive definite, which often arises in applications,
and in the case where A is strictly row or column diagonally dominant. We recall that a matrix
A is said to be row or column diagonally dominant if, respectively,
or
X X
|aii | ≥ |aij | ∀i |ajj | ≥ |aij | ∀j.
j6=i i6=j
39
Chapter 2. Solution of linear systems of equation
Richardson’s method
In this case the spectral radius which enters in the asymptotic rate of convergence is given by
1
−1
N) = ρ ω I−A = ρ I − ωA
ρ(M
ω
The eigenvalues of the matrix I − ωA are given by 1 − ωλi , where (λi )1≤i≤L are the eigenvalues
of A. Therefore, the spectral radius is given by
Case of symmetric positive definite A. If the matrix A is symmetric and positive definite,
it is possible to explicitly calculate the optimal value of ω for convergence. In order make
convergence as fast as possible, we want the spectral radius of M−1 N to be as small as possible,
in view of Proposition 2.10. Denoting by λmin and λmax the minimum and maximum eigenvalues
of A, it is not difficult to show that
The maximum is minimized when its two arguments are equal, i.e. when 1 − ωλmin = ωλmax − 1.
From this we deduce the optimal value of ω and the associated spectral radius:
We observe that the smaller the condition number of the matrix A, the better the asymptotic
rate of convergence.
Remark 2.2 (Link to optimization). In the case where A is symmetric and positive definite,
the Richardson update (2.14) may be viewed as a step of the steepest descent algorithm for
the function f (x) = 12 xT Ax − bT x:
The gradient of this function is ∇f (x) = Ax − b, and its Hessian matrix is A. Since the
Hessian matrix is positive definite, the function is convex and attains its global minimum
when ∇f is zero, i.e. when Ax = b.
Jacobi’s method
In Jacobi’s method, the matrix M in the splitting is the diagonal matrix D with the same entries
as those of A on the diagonal. We denote by L and U the lower and upper triangular parts of
40
Chapter 2. Solution of linear systems of equation
Since the matrix D on the left-hand side is diagonal, this linear system with unknown x(k+1) is
very simple to solve. The equation (2.16) can be rewritten equivalently as
(k+1) (k)
a11 x1 + a12 x2 + · · · + a1n x(k)
n = b1
a21 x(k) (k+1)
+ · · · + a2n x(k)
1 + a22 x2 n = b2
..
.
(k) (k)
an1 x1 + an2 x2 + · · · + ann x(k+1) = bn .
n
The updates for each of the entries of x(k+1) are independent, and so the Jacobi method lends
itself well to parallel implementation. The computational cost of one iteration, measured in
number of floating point operations required, scales as O(n2 ) if A is a full matrix, or O(nk)
if A is a sparse matrix with k nonzero elements per row on average. It is simple to prove the
convergence of Jacobi’s method is the case where A is diagonally dominant.
Proposition 2.11. Assume that A is strictly (row or column) diagonally dominant. Then it
holds that ρ(M−1 N) < 1 for the Jacobi splitting.
Proof. Assume that λ is an eigenvalue of M−1 N and v is the associated unit eigenvector. Then
−(L + λD + U)v = 0.
If |λ| > 1, then the matrix on the left-hand side of this equation is diagonally dominant and
thus invertible (see Exercise 2.9). Therefore v = 0, but this is a contradiction because v is
vector of of unit norm. Consequently, all the eigenvalues are bounded from above strictly by 1
in modulus.
Gauss–Seidel’s method
In Gauss Seidel’s method, the matrix M in the splitting is the lower triangular part of A,
including the diagonal. One step of the method then reads
41
Chapter 2. Solution of linear systems of equation
The system can be solved by forward substitution. The equation (2.17) can be rewritten equiv-
alently as
(k+1) (k) (k)
+ a12 x2 + a13 x3 + · · · + a1n x(k)
a11 x1 n = b1
(k+1) (k+1) (k)
+ a23 x3 + · · · + a2n x(k)
a21 x1 + a22 x2 n = b2
(k+1) (k+1) (k+1)
a32 x1 + a32 x2 + a33 x3 + · · · + a3n x(k)
n = b3
..
.
(k+1) (k+1) (k+1)
an1 x1 + an2 x2 + an3 x3 + · · · + ann x(k+1) = bn .
n
Given x(k) , the first entry of x(k+1) can be obtained from the first equation. Then the value of
the second entry can be obtained from the second equation, etc. Unlike Jacobi’s method, the
Gauss–Seidel method is sequential and the entries of x(k+1) cannot be updated in parallel.
It is possible to prove the convergence of the Gauss–Seidel method in particular cases. For
example, the method converges if A is strictly diagonally dominant. Proving this, using an
approach similar to that in the proof of Proposition 2.11, is the goal of Exercise 2.19. It is also
possible to prove convergence when A is Hermitian and positive definite. We show this in the
next section for the relaxation method, which generalizes the Gauss–Seidel method.
Relaxation method
The relaxation method generalizes the Gauss–Seidel method. It corresponds to the splitting
D
1−ω
A= +L − D−U . (2.18)
ω ω
When ω = 1, this is simply the Gauss–Seidel splitting. The idea is that, by letting ω be a
parameter that can differ from 1, faster convergence can be achieved. This intuition will be
verified later. The equation (2.13) for this splitting can be rewritten equivalently as
(k+1) (k) (k) (k) (k)
a x
11 1 − x 1 = −ω a x
11 1 + a x
12 2 + · · · + a x
1n n − b1
a22 x(k+1) (k) (k+1) (k)
(k)
− x = −ω a x + a x + · · · + a x − b
2 2 21 1 22 2 2n n 2
..
.
ann x(k+1)
(k)
(k+1) (k+1) (k)
− x = −ω a x + a x + · · · + a x − bn .
n n n1 1 n2 2 nn n
The coefficient on the right-hand side is larger than in the Gauss–Seidel method if ω > 1, and
smaller if ω < 1. These regimes are called over-relaxation and under-relaxation, respectively.
To conclude this section, we establish a sufficient condition for the convergence of the re-
laxation method, and also of the Gauss–Seidel method as a particular case when ω = 1, when
the matrix A is Hermitian and positive definite. To this end, we begin by showing the following
preparatory result, which concerns a general splitting A = M − N.
Proposition 2.12. Let A be Hermitian and positive definite. If the Hermitian matrix M∗ + N
is positive definite, then ρ(M−1 N) < 1.
42
Chapter 2. Solution of linear systems of equation
(M + N∗ )∗ = M∗ + N = A∗ + N∗ + N = A + N∗ + N.
We will show that kM−1 NkA < 1, where k•kA is the matrix norm induced by the following norm
on vectors:
√
kxkA := x∗ Ax.
Showing that this indeed defines a vector norm is the goal of Exercise 2.11. Since N = M − A,
it holds that kM−1 NkA = kI − M−1 AkA , and so
where we used in the last inequality the assumption that M∗ + N is positive definite. This shows
that kM−1 NkA < 1, and so ρ(M−1 N) < 1.
As a corollary, we obtain a sufficient condition for the convergence of the relaxation method.
Corollary 2.13. Assume that A is Hermitian and positive definite. Then the relaxation method
converges if ω ∈ (0, 2).
D
∗
1−ω
M+N = ∗
+L + D−U .
ω ω
2−ω
M + N∗ = D.
ω
The diagonal elements of D are all positive, because A is positive definite. (Indeed, if there
was an index i such that dii ≤ 0, then it would hold that eTi Aei = dii ≤ 0, contradicting the
assumption that A is positive definite.) We deduce that M + N∗ is positive definite if and only
if ω ∈ (0, 2). We can then conclude the proof by using Proposition 2.12.
Note that Corollary 2.13 implies as a particular case the convergence of the Gauss–Seidel
method when A is Hermitian and positive definite. The condition ω ∈ (0, 2) is fact necessary for
43
Chapter 2. Solution of linear systems of equation
the convergence of the relaxation method, not only in the case of a Hermitian positive definite
matrix A but in general.
Proposition 2.14 (Necessary condition for the convergence of the relaxation method). Let
A ∈ Cn×n be an invertible matrix, and let A = Mω − Nω denote the splitting of the relaxation
method with parameter ω. It holds that
∀ω 6= 0, ω Nω ) ≥ |ω − 1|.
ρ(M−1
• the determinant of a triangular matrix is equal to the product of its diagonal entries;
• the determinant of a matrix is equal to the product of its eigenvalues, to the power of
their algebraic multiplicity. This can be shown from the previous two items, by passing
to the Jordan normal form.
det ω D−U
1−ω
ω Nω )
det(M−1 = det(Mω ) det(Nω ) =
−1
= (1 − ω)n .
det Dω +L
Since the determinant on the left-hand side is the product of the eigenvalues of M−1
ω Nω , it is
For tridiagonal matrices, the convergence rate of the Jacobi and Gauss–Seidel methods satisfy
an explicit relation, which we prove in this section. We denote the Jacobi and Gauss–Seidel
splittings by MJ − NJ and MG − NG , respectively, and use the following notation for the entries
of the matrix A:
a1 b1
.. ..
. .
c
1
.. .. .
. .
bn−1
cn−1 an
Before presenting and proving the result, notice that for any µ 6= 0 it holds that
µ−1 b1
−1
µ µ a1
.. ..
. .
µ2 µ−2
A
µc1
(2.19)
.. ..
= .
.. ..
. . . .
−1
µ bn−1
µn µ−n µcn−1 an
44
Chapter 2. Solution of linear systems of equation
Proposition 2.15. Assume that A is tridiagonal with nonzero diagonal elements, so that
both MJ = D and MG = L + D are invertible. Then
G NG ) = ρ(MJ NJ )
ρ(M−1 −1 2
M−1
G NG v = λv ⇔ NG v = λMG v ⇔ (NG − λMG )v = 0.
For fixed λ, there exists a nontrivial solution v to the last equation if and only if
It is clear that this relation is true also if λ = 0. Consequently, it holds that if λ is an eigenvalue
of the matrix M−1
J NJ then λ is an eigenvalue of MG NG . Conversely, if λ is an eigenvalue
2 −1
of M−1
G NG , then the two square roots of λ are eigenvalues of MJ NJ .
−1
√
kπ
λk = a + 2 bc cos , k = 1, . . . , n. (2.20)
n+1
In practice, we have access to the residual r (k) = Ax(k) − b at each iteration, but not to the
error e(k) = x(k) − x∗ , as calculating the latter would require to know the exact solution of the
45
Chapter 2. Solution of linear systems of equation
Therefore, it holds that ke(k) k ≤ kA−1 kkr (k) k. Likewise, the relative error satisfies
ke(k) k kr (k) k
≤ κ(A) .
x kbk
The fraction on the right-hand side is the relative residual. If the system is well conditioned,
that is if κ(A) is close to one, then controlling the residual enables a good control of the error.
Stopping criterion
In practice, several criteria can be employed in order to decide when to stop iterating. Given a
small number ε (unrelated to the machine epsilon in Chapter 1), the following alternatives are
available:
• Stop when r (k) ≤ ε. The downside of this approach is that it is not scaling invariant:
when used for solving the following rescaled system
kAx = kb, k 6= 1,
a splitting method with rescaled initial guess kx(0) will require a number of iterations
that depends on k: fewer if k 1 and more if k 1. In practice, controlling the relative
residual and the relative error is often preferable.
• Stop when kr (k) k/kr (0) k ≤ ε. This criterion is scaling invariant, but the number of
iterations is dependent on the quality of the initial guess x(0) .
• Stop when kr (k) k/kbk. This criterion is generally the best, because it is both scaling
invariant and the quality of the final iterate is independent of that of the initial guess.
1 1
f (x) = (x − x∗ )T A(x − x∗ ) − xT∗ Ax∗ . (2.22)
2 2
46
Chapter 2. Solution of linear systems of equation
The second term is constant with x, and the first term is strictly positive if x−x∗ 6= 0, because A
is positive definite. We saw that Richardson’s method can be interpreted as a steepest descent
with fixed step size,
x(k+1) = x(k) − ω∇f (x(k) ).
In this section, we will present and study other methods for solving the linear system (2.1)
which can be viewed as optimization methods. Since A is symmetric, it is diagonalizable and
the function f can be rewritten as
1 1
f (x) = (x − x∗ )T QDQT (x − x∗ ) − xT∗ Ax∗
2 2
1 T T 1 T
= (Q e) D(Q e) − x∗ Ax∗ ,
T
e = x − x∗ .
2 2
where (λi )1≤i≤n are the diagonal entries of D. This shows that f is a paraboloid after a change
of coordinates.
The steepest descent method is more general than Richardson’s method in the sense that the
step size changes from iteration to iteration and the method is not restricted to quadratic
functions of the form (2.21). Each iteration is of the form
It is natural to wonder whether the step size ωk can be fixed in such a way that f (x(k+1) )
is as small as possible. For the case of the quadratic function (2.21), this value of ωk can be
calculated explicitly for a general search direction d, and in particular also when d = ∇f (x(k) ).
We calculate that
1 (k) T T
x − ωk d A x(k) − ωk d − x(k) − ωk d b
f (x(k+1) ) = f x(k) − ωk d =
2
ω 2
= f x(k) + k dT Ad − ωk dT r (k) . (2.23)
2
When viewed as a function of the real parameter ωk , the right-hand side is a convex quadratic
function. It is minimized when its derivative is equal to zero, i.e. when
dT r (k)
ωk dT Ad − dT (Axk − b) = 0 ⇒ ωk = . (2.24)
dT Ad
The steepest descent algorithm with step size obtained from this equation is summarized in Al-
gorithm 3 below. By construction, the function value f (x(k) ) is nonincreasing with k, which is
√
equivalent to saying that the error x − x∗ is nonincreasing in the norm x 7→ xT Ax. In order
47
Chapter 2. Solution of linear systems of equation
to quantify more precisely the decrease of the error in this norm, we introduce the notation
Lemma 2.16 (Kantorovich inequality). Let A ∈ Rn×n be a symmetric and positive definite
matrix, and let λ0 ≤ · · · ≤ λn denote its eigenvalues. Then for all nonzero z ∈ Rn it holds that
(z T z)2 4λ1 λn
≥ .
(z Az)(z A z)
T T −1 (λ1 + λn )2
Proof. By the AM-GM (arithmetic mean-geometric mean) inequality, it holds for all t > 0 that
1 1 T −1
q q
(z Az)(z A z) = (tz Az)(t z A z) ≤
T T −1 T −1 T −1 tz Az + z A z
T
2 t
1 1
= z T tA + A−1 z.
2 t
The matrix on the right-hand side is also symmetric and positive definite, with eigenvalues
equal to tλi + (tλi )−1 . Therefore, we deduce
1
q
∀t ≥ 0, (z Az)(z A z) ≤
T T −1 max tλi + (tλi ) −1
z T z. (2.25)
2 i∈{1,...,n}
The function x 7→ x + x−1 is convex, and so over any closed interval [xmin , xmax ] it attains its
maximum either at xmin or at xmax . Consequently, it holds that
1 1
max tλi + (tλi ) −1
= max tλ1 + , tλn + .
i∈{1,...,n} tλ1 tλn
In order to obtain the best possible bound from (2.25), we should let t be such that the maximum
is minimized, which occurs when the two arguments of the maximum are equal:
1 1 1
tλ1 + = tλn + ⇒ t= √ .
tλ1 tλn λ1 λn
We are now able to prove the convergence of the steepest descent method.
48
Chapter 2. Solution of linear systems of equation
We observe from Theorem 2.17 that the convergence of the steepest descent method is faster
when the condition number of the matrix A is low. This naturally leads to the following question:
can we reformulate the minimization of f (x) in (2.21) as another optimization problem which
is of the same form but involves a matrix with a lower condition number, thereby providing
scope for faster convergence? In order to answer this question, we consider a linear change of
coordinates y = T−1 x, where T is an invertible matrix, and we define
1
fe(y) = f (Ty) = y T (TT AT)y − (TT b)T y. (2.26)
2
This function is of the same form as f in (2.21), with the matrix Ae := TT AT instead of A
and the vector eb := TT b instead of b. Its minimizer is y ∗ = T−1 x∗ . The steepest descent
algorithm can be applied to (2.26) and, from an approximation y (k) of the minimizer y ∗ , an
49
Chapter 2. Solution of linear systems of equation
approximation x(k) of x∗ is obtained by the change of variable x(k) = Ty (k) . This approach
is called preconditioning. By Theorem 2.17, the steepest descent method satisfies the following
error estimate when applied to the function (2.26):
κ2 (TT AT) − 1
2
Ek+1 ≤ Ek , Ek = (y (k) − y ∗ )T A(y
e (k) − y ),
κ2 (TT AT) + 1 ∗
= (x(k) − x∗ )T A(x(k) − x∗ ).
Consequently, the convergence is faster than that of the usual steepest descent method if
κ2 (TT AT) < κ2 (A). The optimal change of coordinates is given by T = C−T , where C is
the factor of the Cholesky factorization of A as CCT . Indeed, in this case
and the method converges in a single iteration! However, this iteration amounts to solving the
linear system by direct Cholesky factorization of A. In practice, it is usual to define T from an
approximation of the Cholesky factorization, such as the incomplete Cholesky factorization.
To conclude this section, we demonstrate that the change of variable from x to y need not
be performed explicitly in practice. Indeed, one step of the steepest descent algorithm applied
to function fe reads
Letting x(k) = Ty (k) , this equation can be rewritten as the following iteration:
dTk r (k)
x(k+1) = x(k) − ω
e k dk , ω
ek = , dk = TT T(Ax(k) − b).
dTk Adk
50
Chapter 2. Solution of linear systems of equation
Definition 2.6 (Conjugate directions). Let A be a symmetric positive definite matrix. Two
vectors d1 and d2 are called A-orthogonal or conjugate with respect to A if dT1 Ad2 = 0, i.e.
if they are orthogonal for the inner product hx, yiA = xT Ay.
Assume that d0 , . . . , dn−1 are n pairwise A-orthogonal nonzero directions. By Exercise 2.20,
these vectors are linearly independent, and so they form a basis of Rn . Consequently, for any
initial guess x(0) , the vector x(0) − x∗ , where x∗ is the solution to the linear system Ax = b,
can be decomposed as
x(0) − x∗ = α0 d0 + · · · + αn−1 dn−1 .
Taking the h•, •iA inner product of both sides with dk , with k ∈ {0, . . . , n − 1}, we obtain an
expression for the scalar coefficient αk
Therefore, calculating the expression of the coefficient does not require to know the exact
solution x∗ . Given conjugate directions, the exact solution can be obtained as
n−1
dTk r (0)
(2.27)
X
x∗ = x(0) − αk dk , αk = .
k=0
dTk Adk
From this equation we deduce, denoting by D the matrix d0 . . . dn−1 , that the inverse
of A can be expressed as
n−1
dk dTk
A = D(D AD) D = νk = dTk Adk .
X
−1 T −1 T
,
νk
k=0
In this equation, νk is the k-th diagonal entry of the matrix DT AD. The conjugate directions
method is illustrated in Algorithm 5. Its implementation is very similar to the steepest descent
method, the only difference being that the descent direction at iteration k is given by dk instead
of r (k) . In particular, the step size at each iteration is such that f (x(k+1) ) is minimized.
Let us now establish the connection between the Algorithm 5 and (2.27), which is not
immediately apparent because (2.27) involves only the initial residual Ax(0) − b, while the
residual at the current iteration r (k) is used in the algorithm.
51
Chapter 2. Solution of linear systems of equation
Proposition 2.18 (Convergence of the conjugate directions method). The vector x(k) ob-
tained after k iterations of the conjugate directions method is given by
k−1
dTi r (0)
(2.28)
X
(k) (0)
x =x − αi di , αi = .
i=0
dTi Adi
Proof. Let us denote by y (k) the solution obtained after k steps of Algorithm 5. Our goal is to
show that y (k) coincides with x(k) defined in (2.28). The result is trivial for k = 0. Reasoning
by induction, we assume that it holds true up to k. Then performing step k + 1 of the algorithm
gives
dTk r (k)
y (k+1) = y (k) − ωk dk , ωk = .
dTk Adk
On the other hand, it holds from (2.28) that
dTk r (0)
x(k+1) = x(k) − αk dk , αk = .
dTk Adk
By the induction hypothesis it holds that y (k) = x(k) and so, in order to prove the equality
y (k+1) = x(k+1) , it is sufficient to show that ωk = αk , i.e. that
dTk r (k) = dTk r (0) ⇔ dTk (r (k) − r (0) ) = 0 ⇔ dTk A(x(k) − x(0) ) = 0.
Since ωk in Algorithm 5 coincides with the expression in (2.24), the conjugate directions
algorithm satisfies the following “local optimization” property: the iterate x(x+1) minimizes f
on the straight line ω 7→ x(k) − ωdk . In fact, it also satisfies the following stronger property.
Proposition 2.19 (Optimality of the conjugate directions method). The iterate x(k) is the
minimizer of f over the set x(0) + Bk , where Bk = span{d0 , . . . , dk−1 }.
n−1
X dTi r (0)
x∗ = x(0) − αi d i , αi =
i=0
dTi Adi
Employing these two expressions, the formula for f in (2.22), and the A-orthogonality of the
52
Chapter 2. Solution of linear systems of equation
directions, we obtain
1 1
f (y) = (y − x∗ )T A(y − x∗ ) − xT∗ Ax∗
2 2
k−1 n−1
1X 1X 2 T 1
= (βi − αi )2 dTi Adi + αi di Adi − xT∗ Ax∗
2 2 2
i=1 i=k
This is minimized when βi = αi for all i ∈ {0, . . . , k − 1}, in which case y coincides with the
k-th iterate x(k) of the conjugate directions method in view of Proposition 2.18.
Remark 2.3. Let k•kA denote the norm induced by the inner product (x, y) 7→ xT Ay. Since
q
kx (k)
− x∗ kA = 2f (x(k) ) + xT∗ Ax∗ ,
Proposition 2.19 shows that x(k) minimizes the norm kx(k) − x∗ kA over x(0) + Bk .
A corollary of (2.19) is that the gradient of f at x(k) , i.e. the residual r (k) = Ax(k) − b, is
orthogonal to any vector in {d0 , . . . , dk−1 } for the usual Euclidean inner product. This can also
be checked directly from the formula
n−1
(k)
X dTi r (0)
x − x∗ = αi di , αi = ,
i=k
dTi Adi
n−1
dTj r (k) = A(x(k) − x∗ ) = (2.29)
X
∀j ∈ {0, . . . , k − 1}, αi dTj di = 0.
i=k
In the previous section, we showed that, given n conjugate directions, the solution to the
linear system Ax = b can be obtained in a finite number of iterations using Algorithm 5. The
conjugate gradient method can be viewed as a particular case of the conjugate directions method.
Instead of assuming that the conjugate directions are given, they are constructed iteratively
as part of the algorithm. Given an initial guess x(0) , the first direction is the residual r (0) ,
which coincides with the gradient of f at x(0) . The directions employed for the next iterations
are obtained by applying the Gram-Schmidt process to the residuals. More precisely, given
conjugate directions d1 , . . . , dk−1 , and letting x(k) denote the k-th iterate of the conjugate
directions method, the direction dk is obtained by
d Ar (k)
k−1 T
r (k) = Ax(k) − b. (2.30)
X
i
dk = r (k) − di ,
i=0
dTi Adi
It is simple to check that dk is indeed A-orthogonal to di for i ∈ {0, . . . , k − 1}, and that dk is
nonzero if r (k) is nonzero. To prove the latter claim, we can take the Euclidean inner product
53
Chapter 2. Solution of linear systems of equation
of both sides with r (k) and use Proposition 2.19 to deduce that
It appears from (2.30) that the cost of calculating a new direction grows linearly with the
iteration index. In fact, it turns out that only the last term in the sum is nonzero, and so
the cost of computing a new direction does not grow with the iteration index k. In order to
explain why only the last term of the sum in (2.30) is nonzero, we begin by making a couple of
observations:
• Since the directions are obtained from the residuals, it holds that
Note that ωi > 0 if r (i) 6= 0 by (2.31). Rearranging this equation and noting that the
difference r (i+1) −r (i) is a linear combination of the first i+1 conjugate directions by (2.32),
we deduce that there exist scalar coefficients (αi,j )0≤j≤i+1 such that
i+1
1 (i+1)
Adi =
X
(i)
r −r = αi,j dj .
ωi
j=0
Using the orthonormality property (2.29) for the residual of the conjugate directions method,
we deduce that the right-hand side of this equation is zero for all i ∈ {0, . . . , k − 2}, implying
that only the last term in the sum of (2.30) is nonzero. This leads to Algorithm 6.
The subspace spanned by the descent directions of the conjugate gradient method can be
characterized precisely as follows.
Proposition 2.20. Assume that kr (k) k 6= 0 for k < m ≤ n, Then it holds that
∀k ∈ {0, . . . m}, span{r (0) , r (1) , . . . , r (k) } = span{r (0) , Ar (0) , . . . , Ak r (0) }
The subspace on the right-hand side is called a Krylov subspace and denoted Bk+1 .
Proof. The result is clear for k = 0. Reasoning by induction, we prove that if the result is true
54
Chapter 2. Solution of linear systems of equation
By the induction hypothesis and (2.32), there exist scalar coefficients (γi )0≤i≤k such that
r (k+1) = β0 r (0) + (β1 − ω0 γ0 )Ar (0) + · · · + (βk − ωk−1 γk−1 )Ak r (0) − ωk γk A(k+1) r (0) .
Although the conjugate gradient method converges in a finite number of iterations, perform-
ing n iterations for very large systems would require an excessive computational cost, and so it
is sometimes desirable to stop iterating when the residual is sufficiently small. To conclude this
section, we study the convergence of the method.
Theorem 2.21 (Convergence of the conjugate gradient method). The error for the conjugate
gradient method, measured as Ek+1 := (x(k+1) − x∗ )T A(x(k+1) − x∗ ), satisfies the following
inequality:
∀qk ∈ P(k), Ek+1 ≤ max 1 + λi qk (λi ) E0 . (2.35)
2
1≤i≤n
Here P(k) is the vector space of polynomials of degree less than or equal to k.
55
Chapter 2. Solution of linear systems of equation
k
αi Ai r (0) = x(0) + pk (A)r (0) ,
X
(k+1) (0)
x =x +
i=0
we deduce that
Q (x − x∗ ) D QT (x(0) − x∗ ) ,
2 T (0)
≤ max 1 + λi qk (λi )
T
1≤i≤n | {z }
E0
An corollary of Theorem 2.21 is that, if A has m ≤ n distinct eigenvalues, then the conjugate
gradient method converges in at most m iterations. Indeed, in this case we can take
1 (λ1 − λ) . . . (λm − λ)
qk (λ) = −1 .
λ λ1 . . . λ m
It is simple to check that the right-hand side is indeed a polynomial, and that qk (λi ) = 0 for all
eigenvalues of A.
In general, finding the polynomial that minimizes the right-hand side of (2.35) is not possible,
because the eigenvalues of the matrix A are unknown. However, it is possible to derive from
this equation an error estimate with a dependence on the condition number of κ2 (A).
56
Chapter 2. Solution of linear systems of equation
where λ1 and λn are the minimum and maximum eigenvalues of A. We show at the end of the
proof that the right-hand side is minimized when
λn +λ1 −2λ
Tk+1 λn −λ1
1 + λqk (λ) = , (2.36)
λn +λ1
Tk+1 λn −λ1
where Tk+1 is the Chebyshev polynomial of degree k + 1. These polynomials may be defined
defined from the formula
cos(i arccos x) for |x| ≤ 1
Ti (x) = 1 √ √ (2.37)
x − x2 − 1 + 1 x + x2 − 1
i i
for |x| ≥ 1.
2 2
It is clear from this definition that |Ti (x)| ≤ 1 for all x ∈ [−1, 1]. Consequently, the following
inequality holds true for all λ ∈ [λ1 , λn ]:
1
p k+1 p k+1 −1
1 + λqk (λ) ≤ =2 2
r− r −1 2
+ r+ r −1 ,
λn +λ1
Tk+1 λn −λ1
√ k+1 √ k+1 !−1
κ+1 κ−1
=2 √ + √ .
κ−1 κ+1
where r = λn −λ1 .
λn +λ1
Since the second term in the bracket converges to zero as k → ∞, it is
natural to bound this expression by keeping only the first term, which after simple algebraic
manipulations leads to
√ k+1
κ−1
∀λ ∈ [λ1 , λn ], 1 + λqk (λ) ≤ 2 √ .
κ+1
Note that the points (µj )0≤j≤k+1 are all in the interval [λ1 , λn ]. Reasoning by contradiction,
we assume that there is another polynomial qek ∈ P(k) such that
−1
λn + λ1
max |1 + λe
qk (λ)| < Tk+1 . (2.39)
λ∈[λ1 ,λn ] λn − λ1
57
Chapter 2. Solution of linear systems of equation
This implies, by the intermediate value theorem, that tk+1 has k +1 roots in the interval [λ1 , λn ],
so k + 2 roots in total, but this is impossible for a nonzero polynomial of degree k + 1.
2.3.3 Exercises
� Exercise 2.9. Show that if A is row or column diagonally dominant, then A is invertible.
� Exercise 2.11. Let A ∈ Rn×n be a symmetric positive definite matrix. Show that the
functional
√
k•kA : x 7→ xT Ax
defines a norm on Rn .
� Exercise 2.13. Show that, if A and B are two square matrices, then ρ(AB) = ρ(BA).
� Exercise 2.16. Show that, for any matrix norm k•k induced by a vector norm,
ρ(A) ≤ kAk.
� Exercise 2.17. Let k•k denote the Euclidean vector norm on Rn . We define in Appendix A
the induced matrix norm as
kAk = sup kAxk : kxk ≤ 1 .
Show from this definition that, if A is symmetric and positive definite, then
58
Chapter 2. Solution of linear systems of equation
Solution. By the Cauchy–Schwarz inequality and the definition of kAk, it holds that
This shows that kAk∗ ≤ kAk. Conversely, letting B denote a matrix square root of A (see
Exercise 2.8), we have
√ q q
∀x ∈ Rn with kxk ≤ 1, kAxk = xT AT Ax = (Bx)T BB(Bx) = (Bx)T A(Bx)
Bx
= kBxk y T Ay,
p
y= .
kBxk
√
It holds that kBxk = xT Ax ≤ kAk∗ . In addition kyk = 1, so the expression inside the
p
square root is bounded from above by kAk∗ , which enables to conclude the proof.
� Exercise 2.18. Implement an iterative method based on a splitting for finding a solution to
the following linear system on Rn .
2 −1 1 x1
−1 2 −1 x2 1
1 −1 2 −1 x3 1 1
.. .. .. . = . , h= .
h2 . . . .. .. n
−1 2 −1 xn−1 1
−1 2 xn 1
Plot the norm of the residual as a function of the iteration index. Use as stopping criterion the
condition
kr (k) k ≤ εkbk, ε = 10−8 .
As initial guess, use a vector of zeros. The code will be tested with n = 5000.
Extra credit: � Find a formula for the optimal value of ω in the relaxation method given n.
The proof of Proposition 2.15, as well as the formula (2.20) for the eigenvalues of a tridiagonal
matrix, are useful to this end.
Solution. Corollary 2.13 and Proposition 2.14 imply that a sufficient and necessary condition
for convergence, when A is Hermitian and positive definite, is that ω ∈ (0, 2). Let Mω = ωD+L
1
ω Nω − λI) = 0
det(M−1 ⇔ det(M−1
ω ) det(Nω − λMω ) = 0 ⇔ det(λMω − Nω ) = 0.
Substituting the expressions of Mω and Nω , we obtain that this condition can be equivalently
rewritten as
√ √
λ+ω−1 λ+ω−1
det λL + D+U =0 ⇔ det λL + D+ λU =0
ω ω
where we used (2.19) for the last equivalence. The equality of the determinants in these two
√
equations is valid for λ denoting either of the two complex square roots of λ. This condition
59
Chapter 2. Solution of linear systems of equation
is equivalent to
λ+ω−1
det L + √ D + U = 0.
λω
We recognize from the proof of Proposition 2.15 that this condition is equivalent to
λ+ω−1
√ J NJ ).
∈ spectrum(M−1
λω
(λ + ω − 1)2
= µ2 , (2.40)
λω 2
eigenvalues of M−1
J NJ are real and given by
jπ
µj = cos , 1 ≤ j ≤ n. (2.41)
n+1
λ2 + λ 2(ω − 1) − ω 2 µ2 + (ω − 1)2 = 0.
For given ω ∈ (0, 2) and µ ∈ R, this is a quadratic equation for λ with solutions
r
ω 2 µ2 ω 2 µ2
λ± = + 1 − ω ± ωµ + 1 − ω,
2 4
Since the first bracket is positive when the argument of the square root is positive, it is clear
that r
ω 2 µ2 ω 2 µ2
max |λ− |, |λ+ | =
+ 1 − ω + ω|µ| +1−ω .
2 4
Combining this with (2.41), we deduce that the spectral radius of M−1
ω Nω is given by
s
ω 2 µ2j ω 2 µ2j
ω Nω ) =
ρ(M−1 max + 1 − ω + ω|µj | +1−ω . (2.42)
j∈{1,...,n} 2 4
We wish to minimize this expression over the interval ω ∈ (0, 2). While this can be achieved
by algebraic manipulations, we content ourselves here with graphical exploration. Figure 2.3
depicts the amplitude of the modulus in (2.42) for different values of µ. It is apparent that, for
given ω, the modulus increases as µ increases, which suggests that
r
ω 2 µ2∗ ω 2 µ2∗
ω Nω )
ρ(M−1 = + 1 − ω + ω|µ∗ | +1−ω , J NJ ).
µ∗ = ρ(M−1 (2.43)
2 4
The figure also suggests that for a given value of µ, the modulus is minimized at the discontinuity
of the first derivative, which occurs when the argument of the square root is zero. We conclude
60
Chapter 2. Solution of linear systems of equation
� Exercise 2.19. Prove that, if the matrix A is strictly diagonally dominant (by rows or
columns), then the Gauss–Seidel method converges, i.e. ρ(M−1 N) < 1. You can use the same
approach as in the proof of Proposition 2.11.
� Exercise 2.20. Let A ∈ Rn×n denote a symmetric positive definite matrix, and assume that
the vectors d1 , . . . , dn are pairwise A-orthogonal directions. Show that d1 , . . . , dn are linearly
independent.
1
f (x) = xT Ax − bT x.
2
• Plot the contour lines of f in Julia using the function contourf from the package Plots.
• Using Theorem 2.17, estimate the number K of iterations of the steepest descent algorithm
required in order to guarantee that EK ≤ 10−8 , when starting from the vector x(0) = (2 3)T .
• Implement the steepest descent method for finding the solution to (2.44), and plot the
iterates as linked dots over the filled contour of f .
61
Chapter 2. Solution of linear systems of equation
• Plot the error Ek as a function of the iteration index, using a linear scale for the x axis
and a logarithmic scale for the y axis.
� Exercise 2.22. Compute the number of floating point operations required for performing
one iteration of the conjugate gradient method, assuming that the matrix A contains α n
nonzero elements per row.
� Exercise 2.23 (Solving the Poisson equation over a rectangle). We consider in this exercise
Poisson’s equation in the domain Ω = (0, 2) × (0, 1), equipped with homogeneous Dirichlet
boundary conditions:
−4f (x, y) = b(x, y), x ∈ Ω,
f (x) = 0, x ∈ ∂Ω.
The right-hand side is
b(x, y) = sin(4πx) + sin(2πy).
A number of methods can be employed in order to discretize this partial differential equation.
After discretization, a finite-dimensional linear system of the form Ax = b is obtained. A
Julia function for calculating the matrix A and the vector b using the finite difference method
is given to you on the course website, as well as a function to plot the solution. The goal of
this exercise is to solve the linear system using the conjugate gradient method. Use the same
stopping criterion as in Exercise 2.18.
� Exercise 2.24. Show that (2.37) indeed defines a polynomial, and find its expression in the
usual polynomial notation.
� Exercise 2.25. Show that if A ∈ Rn×n is nonsingular, then the solution to the equation
Ax = b belongs to the Krylov subspace
62
Chapter 3
Introduction
This chapter concerns the numerical solution of nonlinear equations of the general form
f (x) = 0, f : Rn → Rn . (3.1)
A solution to this equation is called a zero of the function f . Except in particular cases (for
example linear systems), there does not exist a numerical method for solving (3.1) in a finite
number of operations, so iterative methods are required.
In contrast with the previous chapter, it may not be the case that (3.1) admits one and
only one solution. For example, the equation 1 + x2 = 0 does not have a (real) solution, and
the equation cos(x) = 0 has infinitely many. Therefore, convergence results usually contain
assumptions on the function f that guarantee the existence and uniqueness of a solution in Rn
or a subset of Rn .
For an iterative method generating approximations (xk )k≥0 of a root x∗ , we define the error
as ek = xk − x∗ . If the sequence (xk )k≥0 converges to x∗ in the limit as k → ∞ and if
kek+1 k
lim = r, (3.2)
k→∞ kek kq
63
Chapter 3. Solution of nonlinear systems
then we say that (xk )k≥0 converges with order of convergence q and rate of convergence r. In
addition, we say that the convergence is linear q = 1, and quadratic if q = 2. The convergence
is said to be superlinear if
kek+1 k
lim = 0. (3.3)
k→∞ kek k
In particular, the convergence is superlinear if the order of convergence is q > 1.
Remark 3.1. The definition (3.3) for the order and rate of convergence is not entirely satis-
factory, as the limit may not exist. A more general definition for the order of convergence of
a sequence (xk )k≥0 is the following:
kek+1 k
q(x0 ) = inf p ∈ [1, ∞) : lim sup = ∞ ,
k→∞ kek kp
or q(x0 ) = ∞ if the numerator and denominator of the fraction are zero for sufficiently large
k. It is possible to define similarly the order of convergence of an iterative method for an
initial guess in a neighborhood V of x∗ :
kek+1 k
q = inf p ∈ [1, ∞) : sup lim sup =∞ ,
x0 ∈V k→∞ kek kp
where the fraction should be interpreted as 0 if the numerator and denominator are zero. A
more detailed discussion of this subject is beyond the scope of this course.
• In Section 3.1, by way of introduction to the subject of numerical methods for nonlinear
equations, we present and analyze the bisection method.
• In Section 3.2, we present a general method based on a fixed point iteration for solv-
ing (3.1). The convergence of this method is analyzed in Section 3.3.
• In Section 3.4, two concrete examples of fixed point methods are studied: the chord
method and the Newton–Raphson method.
64
Chapter 3. Solution of nonlinear systems
Proposition 3.1. Assume that f : R → R is a continuous function. Let [aj , bj ] denote the
interval obtained after j iterations of the bisection method, and let xj denote the midpoint
(aj + bj )/2. Then there exists a root x∗ of f such that
Proof. By construction, f (aj )f (bj ) ≤ 0 and f (b) 6= 0. Therefore, by the intermediate value
theorem, there exists a root of f in the interval [aj , bj ), implying that
bj − aj
|xj − x∗ | ≤ .
2
Although the limit in (3.2) may not be well-defined (for example, x1 may be a root of f ), the
error xj − x∗ is bounded in absolute value by the sequence (e
ej )j≥0 , where eej = (b0 − a0 )2−(j+1)
by Proposition 3.1. Since the latter sequence exhibits linear convergence to 0, the convergence
of the bisection method is said to be linear, by a slight abuse of terminology.
F (x∗ ) = x∗ .
Such a point x∗ is called a fixed point of the function F . Several definitions of the function F
can be employed in order to ensure that a fixed point of F coincides with a zero of f . One
65
Chapter 3. Solution of nonlinear systems
may, for example, define F (x) = x − α−1 f (x), for some nonzero scalar coefficient α. Then
F (x∗ ) = x∗ if and only if f (x∗ ) = 0. Later in this chapter, in Section 3.4, we study two
instances of numerical methods which can be recast in the form (3.5). Before this, we study the
convergence of the iteration (3.5) for a general function F .
Bδ (x) := y ∈ Rn : ky − xk < δ .
Definition 3.1 (Stability of fixed points). Let (xk )k≥0 denote iterates obtained from (3.5)
when starting from an initial vector x0 . Then we say that a fixed point x∗ is
The largest neighborhood for which this is true, i.e. the set of values of x0 such that (3.6)
holds true, is called the basin of attraction of x∗ .
• stable (in the sense of Lyapunov) if for all ε > 0, there exists δ > 0 such that
• exponentially stable if there exists C > 0, α ∈ (0, 1), and δ > 0 such that
Clearly, global exponential stability implies exponential stability, which itself implies asymp-
totic stability and stability. If x∗ is globally exponentially stable, then x∗ is the unique fixed
point of F ; showing this is the aim of Exercise 3.3. If x∗ is an attractor, then the dynamical
system (3.5) is said to be locally convergent to x∗ . The larger the basin of attraction of x∗ , the
66
Chapter 3. Solution of nonlinear systems
less careful we need to be when picking the initial guess x0 . Global exponential stability of a
fixed point can sometimes be shown provided that F satisfies a strong hypothesis.
Theorem 3.2. Assume that F is a contraction. Then there exists a unique fixed point of (3.5),
and it holds that
∀x0 ∈ Rn , ∀k ∈ N, kxk − x∗ k ≤ Lk kx0 − x∗ k.
It follows that the sequence (xk )k≥0 is Cauchy in Rn , implying by completeness that xk → x∗ in
the limit as k → ∞, for some limit x∗ ∈ Rn . Being a contraction, the function F is continuous,
and so taking the limit k → ∞ in (3.5) we obtain that x∗ is a fixed point of F . Then
proving the statement. To show the uniqueness of the fixed point, assume y ∗ is another fixed
point. Then,
ky ∗ − x∗ k = kF (y ∗ ) − F (x∗ )k ≤ Lky ∗ − x∗ k,
It is possible to prove a weaker, local result under a less restrictive assumptions on the
function F .
Theorem 3.3. Assume that x∗ is a fixed point of (3.5) and that F : Rn → Rn satisfies the
local Lipschitz condition
67
Chapter 3. Solution of nonlinear systems
with 0 ≤ L < 1 and δ > 0. Then x∗ is the unique fixed point of F in Bδ (x∗ ) and, for all
x0 ∈ Bδ (x∗ ), it holds that
It is possible to guarantee that condition (3.8) holds provided that we have sufficiently good
control of the derivatives of the function F . The function F is differentiable at x (in the sense
of Fréchet) if there exists a linear operator DF x : Rn → Rn such that
If F is differentiable, then all its first partial derivatives exist. The Jacobian matrix of F at x
is defined as
∂1 F1 (x) . . . ∂n F1 (x)
.. .. ..
JF (x) = . . .
,
∂1 Fn (x) . . . ∂n Fn (x)
Proposition 3.4. Let x∗ be a fixed point of (3.5), and assume that there exists δ and a
subordinate matrix norm such that F is differentiable everywhere in Bδ (x∗ ) and
Then condition (3.8) is satisfied in the associated vector norm, and so the fixed point x∗ is
locally exponentially stable.
Proof. Let x ∈ Bδ (x∗ ) By the fundamental theorem of calculus and the chain rule, we have
1
d 1
Z Z
JF x∗ + t(x − x∗ ) (x − x∗ ) dt.
F x∗ + t(x − x∗ ) dt =
F (x) − F (x∗ ) =
0 dt 0
In fact, it is possible to prove that a fixed point x∗ is exponentially locally stable under an
even weaker condition, involving only the derivative of F at x∗ .
68
Chapter 3. Solution of nonlinear systems
Proposition 3.5. Let x∗ be a fixed point of (3.5) and that f is differentiable at x∗ with
in a subordinate vector norm. Then there exists δ > 0 such that condition (3.8) is satisfied
in the associated vector norm, and so the fixed point x∗ is locally exponentially stable.
Proof. Let ε = 12 (1 − L) > 0. By the definition of differentiability (3.9), there exists δ > 0 such
that
kF (x∗ + h) − F (x∗ ) − JF (x∗ )hk
∀h ∈ Bδ (0)\{0}, ≤ ε.
khk
By the triangle inequality, this implies
∀h ∈ Bδ (0), kF (x∗ + h) − F (x∗ )k ≤ kF (x∗ + h) − F (x∗ ) − JF (x∗ )hk + kJF (x∗ )hk
≤ εkhk + kJF (x∗ )kkhk = (ε + L)khk = (1 − ε)khk.
This shows that F satisfies the condition (3.8) in the neighborhood Bδ (x∗ ).
The estimate in Theorem 3.2 suggests that when the fixed point iteration (3.5) converges, the
convergence is linear. While this is usually the case, the convergence is superlinear if JF (x∗ ) = 0.
Proposition 3.6. Assume that x∗ is a fixed point of (3.5) and that JF (x∗ ) = 0. Then the
convergence to x∗ is superlinear, in the sense that if xk → x∗ as k → ∞, then
kxk+1 − x∗ k
lim = 0.
k→∞ kxk − x∗ k
Proof. By Proposition 3.5, there exists δ > 0 such that (xk )k≥0 is a sequence converging to x∗
for all x0 ∈ Bδ (x∗ ). It holds that
then assuming that (xk )k≥0 converges to x∗ , it holds for sufficiently large k that
69
Chapter 3. Solution of nonlinear systems
The fixed point iteration (3.4) in this case admits a simple geometric interpretation: at each
step, the function f is approximated by the affine function x 7→ f (xk ) + α(x − xk ), and the new
iterate is defined as the zero of this affine function, i.e.
In order for this condition to hold true, the slope α must be of the same sign as f 0 (x∗ ) and the
inequality |α| ≥ |f 0 (x∗ )|/2 must be satisfied. If f 0 (x∗ ) = 0, then the sufficient condition (3.11) is
never satisfied; in this case, the convergence must be studied on a case-by-case basis. By Propo-
sition 3.6, the convergence of the chord method is superlinear if α = f 0 (x∗ ). In practice, the
solution x∗ is unknown, and so this choice is not realistic. Nevertheless, the above reasoning
suggests that, by letting the slope α vary from iteration to iteration in such a manner that αk
approaches f 0 (x∗ ) as k → ∞, fast convergence can be obtained. This is precisely what the
Newton–Raphson method aims to achieve; see Section 3.4.1
When f is a function from Rn to Rn , the above approach generalizes to
where A is an invertible matrix. The geometric interpretation of the method in this case is the
following: at each step, the function f is approximated by the affine function x 7→ xk +A(x−xk ),
and the next iterate is given by the unique zero of the latter function. Superlinear convergence
is achieved when A = Jf (x∗ ). Notice that each iteration requires to calculate y := A−1 f (xk ),
which is generally achieved by solving the linear system Ay = f (xk ).
70
Chapter 3. Solution of nonlinear systems
f (x)
F (x) = x − .
f 0 (x)
f (x∗ )f 00 (x∗ )
F 0 (x∗ ) = = 0.
f 0 (x∗ )2
In the rest of this section, we show that the iteration (3.13) is well-defined in a small neigh-
borhood of a root of f under appropriate assumptions, and we demonstrate the second order
convergence of the method. We begin by proving the following preparatory lemma, which we
will then employ in the particular case where the matrix-valued function A is equal to Jf .
Lemma 3.7. Let A : Rn → Rn×n denote a matrix-valued function on Rn that is both continuous
and nonsingular at x∗ , and let f be a function that is differentiable at x∗ where f (x∗ ) = 0.
Then the function
G(x) = x − A(x)−1 f (x)
Let β = kA(x∗ )−1 k and ε = (2β)−1 . By continuity of the matrix-valued function A, there
exists δ > 0 such that
∀x ∈ Bδ (x∗ ), kA(x) − A(x∗ )k ≤ ε.
For x ∈ Bδ (x∗ ) we have kA(x∗ )−1 A(x∗ ) − A(x) k ≤ kA(x∗ )−1 kkA(x∗ ) − A(x)k ≤ βε = 12 , and
so Lemma 2.3 implies that the second factor on the right-hand side of (3.15) is invertible with
71
Chapter 3. Solution of nonlinear systems
a norm bounded from above by 2. Therefore, we deduce that A(x) is invertible with
Noting that A−1 (x∗ ) − A(x∗ + h)−1 = A(x∗ )−1 A(x∗ + h) − A(x∗ ) A(x∗ + h)−1 , we bound the
Using this lemma, we can show the following result on the convergence of the multi-
dimensional Newton–Raphson method.
is satisfied, then the convergence is at least quadratic: there exists d ∈ (0, δ) such that
72
Chapter 3. Solution of nonlinear systems
In order to show that the convergence is quadratic, we begin by noticing that, since
t
d t
Z Z
f x∗ + t(xk − xx ) dt, = Jf x∗ + t(xk − xx ) (xk − x∗ ) dt,
f (xk ) =
0 dt 0
0
Z 1
Jf x∗ + t(xk − x∗ ) − Jf (x∗ ) kxk − x∗ k dt
≤
0
Z 1
α
≤ αtkxk − x∗ k2 dt ≤ kxk − x∗ k2 . (3.16)
0 2
3α
≤ Jf (xk )−1 kxk − x∗ k2 ≤ 3αkJf (x∗ )kkxk − x∗ k2 ,
2
The Newton–Raphson method exhibits very fast convergence, but it requires the knowledge of
the derivatives of the function f . To conclude this chapter, we describe a root-finding algorithm,
known as the secant method, that enjoys superlinear convergence but does not require the
derivatives of f . This method applies only when f is a function from R to R, and so we drop
the vector notation in the rest of this section.
Unlike the other methods presented so far in Section 3.2, the secant method can not be
recast as a fixed point iteration of the form xk+1 = F (xk ). Instead, it is of the more form
general form xk+2 = F (xk , xk+1 ). The geometric intuition behind the method in the following:
given xk and xk+1 , the function f is approximated by the unique linear function that passes
through xk , f (xk ) and xk+1 , f (xk+1 ) , and the iterate xk+2 is defined as the root of this
f (xk+1 ) − f (xk )
fe(x) = f (xk ) + (x − xk ).
xk+1 − xk
73
Chapter 3. Solution of nonlinear systems
Showing the convergence of the secant method rigorously under general assumptions is tedious,
so in this course we restrict our attention to the case where f is a quadratic function. Extending
the proof of convergence to a more general smooth function can be achieved by using a quadratic
Taylor approximation of f around the root x∗ , which is accurate in a close neighborhood of x∗ .
Theorem 3.9 (Convergence of the secant method). Assume that f is a convex quadratic poly-
nomial with a simple root at x∗ and that the secant method converges: limk→∞ xk = x∗ . Then
the order of convergence is given by the golden ratio
√
1+ 5
ϕ= .
2
|xk+1 − x∗ |
lim = y∞ . (3.18)
k→∞ |xk − x∗ |ϕ
ek+2 µek
= . (3.19)
ek+1 λ + µ(ek+1 + ek )
By assumption, the right-hand side converges to zero, and so the left-hand side must also
converge to zero; the convergence is superlinear. In order to find the order of convergence, we
take absolute values in the previous equation to obtain, after rearranging,
!1−ϕ 1−ϕ
|ek+2 | |ek+1 | µ |ek+1 | |µ|
= 1 = ,
|ek+1 |ϕ |ek | ϕ−1 |λ + µ(ek+1 + ek )| |ek |ϕ |λ + µ(ek+1 + ek )|
74
Chapter 3. Solution of nonlinear systems
|µ|
yk+1 = yk1−ϕ .
|λ + µ(ek+1 + ek )|
This is a recurrence equation for log(yk ), whose explicit solution can be obtained from the
variation-of-constants formula:
k−1
log(yk ) = (1 − ϕ)k−1 log(y1 ) +
X
(1 − ϕ)k−1−i ci .
i=1
Since (ck )k≥0 converges to the constant c∞ = |µ/λ| by the assumption that ek → 0, the se-
quence log(yk ) k≥0 converges to c∞ /ϕ (prove this!). Therefore, by continuity of the exponential
3.5 Exercises
� Exercise 3.1. Implement the bisection method for finding the solution(s) to the equation
x = cos(x).
xk+1 = F (xk )
ρ JF (x∗ ) < 1,
75
Chapter 3. Solution of nonlinear systems
Hint: One may employ a matrix norm of the form kAkT := kT−1 ATk2 , which is a subordinate
norm by Exercise 2.10. The Jordan normal form is useful for constructing the matrix T, and
equation (2.19) is also useful.
Solution. Let J = P−1 AP denote the Jordan normal form of A, and let
ε
ε2
Eε =
..
.
εn
The matrix ETε Eε is diagonal with entries equal to either 0 or ε2 , and so kJε − Dk2 < ε. By the
triangle inequality, we have
norm. By (3.20) and the assumption that ρ(A) < 1, it is clear that kAkε < 1 provided that ε is
sufficiently small.
r
√
� Exercise 3.6. Calculate x = 3 3 + 3 3 + 3 3 + . . . using the bisection method.
q p
� Exercise 3.7. Solve the equation f (x) = ex − 2 = 0 using a fixed point iteration of the form
Using your knowledge of the exact solution x∗ = log 2, write a sufficient condition on α to
guarantee that x∗ is locally exponentially stable. Verify your findings numerically and plot,
using a logarithmic scale for the y axis, the error in absolute value as a function of k.
� Exercise 3.8. Implement the Newton–Raphson method for solving f (x) = ex − 2 = 0, and
plot the error in absolute value as a function of the iteration index k.
� Exercise 3.9. Find the point (x, y) on the parabola y = x2 that is closest to the point (3, 1).
y = (x − 1)2
(
x2 + y 2 = 4
By drawing these two constraints in the xy plane, find an approximation of the solution(s).
Then calculate the solution(s) using a fixed-point method.
76
Chapter 3. Solution of nonlinear systems
� Exercise 3.11. Find solutions (ψ, λ), with λ > 0, to the following eigenvalue problem:
� Exercise 3.12. Suppose that we have n data points (xi , yi ) of an unknown function y = f (x).
We wish to approximate f by a function of the form
a
fe(x) =
b+x
Write a system of nonlinear equations that the minimizer (a, b) must satisfy, and solve this
system using the Newton–Raphson method starting from (1, 1). The data is given below:
x = [0.0; 0.1; 0.2; 0.3; 0.4; 0.5; 0.6; 0.7; 0.8; 0.9; 1.0]
y = [0.6761488864859304; 0.6345697680852508; 0.6396283580587062; 0.6132010027973919;
0.5906142598705267; 0.5718728461471725; 0.5524549902830562; 0.538938885654085;
0.5373495476994958; 0.514904589752926; 0.49243437874655027]
Plot the data points together with the function fe over the interval [0, 1]. Your plot should look
like Figure 3.1.
77
Chapter 4
Introduction
Calculating the eigenvalues and eigenvectors of a matrix is a task often encountered in scientific
and engineering applications. Eigenvalue problems naturally arise in quantum physics, solid
mechanics, structural engineering and molecular dynamics, to name just a few applications.
The aim of this chapter is to present an overview of the standard methods for calculating
eigenvalues and eigenvectors numerically. We focus predominantly on the case of a Hermitian
matrix A ∈ Cn×n , which is technically simpler and arises in many applications. The reader is
invited to go through the background material in Appendix A.4 before reading this chapter.
The rest of this chapter is organized as follows
78
Chapter 4. Numerical computation of eigenvalues
pA : C → C : λ 7→ det(A − λI).
One may, therefore, calculate the eigenvalues of A by calculating the roots of the polynomial pA
using, for example, one of the methods presented in Chapter 3. While feasible for small matrices,
this approach is not viable for large matrices, because the number of floating point operations
required for calculating calculating the coefficients of the characteristic polynomial scales as n!.
In view of the prohibitive computational cost required for calculating the characteristic
polynomial, other methods are required for solving large eigenvalue problems numerically. All
the methods that we study in this chapter are iterative in nature. While some of them are
aimed at calculating all the eigenpairs of the matrix A, other methods enable to calculate only
a small number of eigenpairs at a lower computational cost, which is often desirable. Indeed,
calculating all the eigenvalues of a large matrix is computationally expensive; on a personal
computer, the following Julia code takes well over a second to terminate:
import LinearAlgebra
LinearAlgebra.eigen(rand(2000, 2000))
In many applications, the matrix A is sparse, and in this case it is important that algorithms
for eigenvalue problems do not destroy the sparsity structure. Note that the eigenvectors of a
sparse matrix are generally not sparse.
To conclude this section, we introduce some notation used throughout this chapter. For a
diagonalizable matrix A, we denote the eigenvalues by λ1 , . . . , λn , with |λ1 | ≥ |λ2 | ≥ . . . ≥ |λn |.
The associated eigenvectors are denoted by v 1 , . . . , v n . Therefore, it holds that
79
Chapter 4. Numerical computation of eigenvalues
be decomposed as
x 0 = α1 v 1 + · · · + αn v n . (4.1)
The idea of the power iteration is to repeatedly left-multiply this vector by the matrix A, in
order to amplify the coefficient of v 0 relative to the other ones. Indeed, notice that
Ak x0 = λk1 α1 v 1 + · · · + λkn αn v n .
If λ1 is strictly larger in modulus than the other eigenvalues, and if α1 6= 0, then for large k the
vector Ak x0 is approximately aligned, in a sense made precise below, with the eigenvector v 1 .
In order to avoid overflow errors at the numerical level, the iterates are normalized at each
iteration. The power iteration is presented in Algorithm 8.
To precisely quantify the convergence of the power method, we introduce the notion of angle
between vectors of Cn .
|x∗ y|
∠(x, y) = arccos √ √
x∗ x y ∗ y
k(I − Py )xk yy ∗
= arcsin , Py := .
kxk y∗y
This definition generalizes the familiar notion of angle for vectors in R2 or R3 , and we note
that the angle function satisfies ∠(eiθ1 x, eiθ2 y) = ∠(x, y). We can then prove the following
convergence result.
Proposition 4.1 (Convergence of the power iteration). Suppose that A is diagonalizable and
that |λ1 | > |λ2 |. Then, for every initial guess with α1 6= 0, the sequence (xk )k≥0 generated by
the power iteration satisfies
lim ∠(xk , v 1 ) = 0.
k→∞
λk α k
where
λk1 α1
eiθk := .
|λk1 α1 |
Clearly, it holds hat xk → eiθk v 1 /kv 1 k. Using the definition of the angle between two vectors
80
Chapter 4. Numerical computation of eigenvalues
in Cn , and the continuity of the Cn Euclidean inner product and of the arccos function, we
calculate
! !
|x∗ v 1 | |x∗ v 1 |
∠(xk , v 1 ) = arccos p ∗ kp ∗ = arccos p k∗ −−−→ arccos(1) = 0,
xk xk v 1 v 1 v1v1 k→∞
!
k
λ2
∠(xk , v 1 ) = O .
λ1
The convergence is slow if |λ2 /λ1 | is close to one, and fast if |λ2 | |λ1 |. Once an approximation
of the eigenvector v 1 has been calculated, the corresponding eigenvalue λ1 can be estimated
from the Rayleigh quotient:
x∗ Ax
ρA : Cn∗ → C : x 7→ . (4.3)
x∗ x
For any eigenvector v of A, the corresponding eigenvalue is equal to ρA (v). In order to study the
error on the eigenvalue λ1 for the power iteration, we assume for simplicity that A is Hermitian
and that the eigenvectors v 1 , . . . , v n are orthonormal. The convergence of the eigenvalue in this
particular case is faster than for a general matrix in Cn×n . Substituting (4.2) in the Rayleigh
quotient (4.3), we obtain
λk2 α2 2 λkn αn
2
λ1 + λk1 α1
λ2 + ··· + λk1 α1
λn
ρA (xk ) = .
λk α 2
2 2 λkn αn
2
1+ λk1 α1
+ ··· + λk1 α1
Therefore, we deduce
2 2
!
2k
λk α2 λ k αn λ2
|ρA (xk ) − λ1 | ≤ 2k |λ2 − λ1 | + · · · + nk |λn − λ1 | = O .
λ1 α1 λ1 α1 λ1
For general matrices, it is possible to show using a similar argument that the error is of or-
der O |λ2 /λ1 |k in the limit as k → ∞.
81
Chapter 4. Numerical computation of eigenvalues
where µ ∈ C is a shift. The eigenvalues of (A−µI)−1 are given by (λ1 −µ)−1 , . . . , (λn −µ)−1 , with
associated eigenvectors v 1 , . . . , v n . If 0 < |λJ − µ| < |λK − µ| for all j 6= i, then the dominant
eigenvalue of the matrix (A − µI)−1 is (λJ − µ)−1 , and so the power iteration applied to this
matrix yields an approximation of the eigenvector v i . In other words, the inverse iteration
with shift µ enables to calculate an approximation of the eigenvector of A corresponding to the
eigenvalue nearest µ. The inverse iteration is presented in Algorithm 9. In practice, the inverse
matrix (A − µI)−1 need not be calculated, and it is often preferable to solve a linear system at
each iteration.
Algorithm 9 Inverse iteration
x ← x0
for i ∈ {1, 2, . . . } do
Solve (A − µI)y = x
x ← y/kyk
end for
λ ← x∗ Ax/x∗ x
return x, λ
An application of Proposition 4.1 immediately gives the following convergence result for the
inverse iteration.
lim ∠(xk , v J ) = 0.
k→∞
More precisely, !
k
λJ − µ
∠(xk , v J ) = O .
λK − µ
Notice that the closer µ is to λJ , the faster the inverse iteration converges. With µ = 0, the
inverse iteration enables to calculate the eigenvalue of A of smallest modulus.
It is possible to show that, when A is Hermitian, the Rayleigh quotient iteration converges to
an eigenvector for almost every initial guess x0 . Furthermore, if convergence to an eigenvector
82
Chapter 4. Numerical computation of eigenvalues
occurs, then µ converges cubically to the corresponding eigenvalue. See [11] and the references
therein for more details.
Proposition 4.3 (Reduced QR decomposition). Assume that X ∈ Cn×p has linearly inde-
pendent columns. Then there exist a matrix Q ∈ Cn×p with orthonormal columns and an
upper triangular matrix R such that the following factorization holds:
X = QR.
Sketch of the proof. Let us denote the columns of Q by q 1 , . . . , q p and the entries of R by
α22 . . . α2p
R=
.. . .
. ..
αpp
83
Chapter 4. Numerical computation of eigenvalues
Expanding the matrix product and identifying columns of both sides, we find that
x1 = α11 q 1
x2 = α12 q 1 + α22 q 2
..
.
xp = α1p q 1 + α2p q 2 + αpp q p .
Under the restriction that the diagonal elements of R are real and positive, the first equation
enables to uniquely determines x1 and α11 = kx1 k. Then, projecting both sides of the second
equation onto span{q 1 } and span{q 1 }⊥ , we obtain
The first equation gives α12 , and the second equation enables to calculate q 2 and α22 . This
process can be iterated p times in order to fully construct the matrices Q and R.
Note that the columns of the matrix Q of the decomposition coincide with the vectors that
would be obtained by applying the Gram–Schmidt method to the columns of the matrix X. In
fact, the Gram–Schmidt process is one of several methods by which the QR decomposition can
be calculated in practice.
The simultaneous iteration method is presented in Algorithm 11. Like the normalization
in the power iteration Algorithm 8, the QR decomposition performed at each step enables to
avoid overflow errors. Notice that when p = 1, the simultaneous iteration reduces to the power
iteration. We emphasize that the factorization step at each iteration does not influence the
subspace spanned by the columns of X. Therefore, this subspace after k iterations coincides
with that spanned by the columns of the matrix Ak X0 . In fact, in exact arithmetic, it would
be equivalent to perform the QR decomposition only once as a final step, after the for loop.
Indeed, denoting by Qk Rk the QR decomposition of AXk−1 , we have
Xk = AXk−1 R−1
k = A Xk−2 Rk−1 Rk = · · · = A X0 R1 . . . Rk
2 −1 −1 k −1 −1
⇔ Xk (Rk . . . R1 ) = Ak X0 .
Since Xk has orthonormal columns and Rk . . . R1 is an upper triangular matrix (see Exercise 2.3)
with real positive elements on the diagonal (check this!), it follows that Xk can be obtained by QR
factorization of Ak X0 . In order to show the convergence of the simultaneous iteration, we begin
by proving the following preparatory lemma.
84
Chapter 4. Numerical computation of eigenvalues
Proof. We reason by contradiction and assume there is ε > 0 and a subsequence (Qkn )n≥0 such
that kQkn −Qk ≥ ε for all n. Since the set of unitary matrices is a compact subset of Cn×n , there
exists a further subsequence (Qknm )m≥0 that converges to a limit Q∞ which is also a unitary
matrix and at least ε away in norm from Q. But then
Rknm = Q−1
kn (Qknm Rknm ) = Qknm (Qknm Rknm ) −
∗
−−−→ Q∗∞ (QR) =: R∞ .
m m→∞
Since Rk is upper triangular with positive diagonal elements for all k, clearly R∞ is also upper
triangular with positive diagonal elements. But then Q∞ R∞ = QR, and by uniqueness of the
decomposition we deduce that Q = Q∞ , which is a contradiction.
Before presenting the convergence theorem, we introduce the following terminology: we say
that Xk ∈ Cn×p converges essentially to a matrix X∞ if each column of Xk converges essentially to
the corresponding column of X∞ . We recall that a vector sequence (xk )k≥1 converges essentially
to x∞ if there exists (θk )k≥1 such that (eiθk xk )k≥1 converges to x∞ . We prove the convergence
in the Hermitian case for simplicity. In the general case of A ∈ Cn×n , it cannot be expected
that Xk converges essentially to V, because the columns of Xk are orthogonal but eigenvectors
may not be orthogonal. In this case, the columns of Xk converge not to the eigenvectors but to
the so-called Schur vectors of A; see [11] for more information.
Theorem 4.5 (Convergence of the simultaneous iteration � ). Assume that A ∈ Cn×n is Her-
mitian, and that X ∈ Cn×p has linearly independent columns. Assume also that the subspace
spanned by the column of X0 satisfies
If it holds that
λ1 > λ2 > · · · > λp > λp+1 ≥ λp+2 ≥ . . . ≥ λn , (4.5)
Proof. Let B = V−1 X0 ∈ Cn×p , so that X0 = VB, and note that Ak X0 = VDk B. We denote
by B1 ∈ Cp×p and B2 ∈ C(n−p)×p the upper p × p and lower (n − p) × p blocks of B, respectively.
The matrix B1 is nonsingular, otherwise the assumption (4.4) would not hold. Indeed, if there
was a nonzero vector z ∈ Cp such that B1 z = 0, then
B1
! !
0
X0 z = V = V2 B2 z.
z = V1 V2
B2 B2 z
85
Chapter 4. Numerical computation of eigenvalues
Dk B1
! !
A X0 = V1 V2 = V1 Dk1 B1 + V2 Dk2 B2 ,
k 1
B2 Dk2
= V1 + V2 Dk2 B2 B−1 D Dk1 B1 .
1
−k
1 (4.6)
The second term in the bracket on the right-hand side converges to zero in the limit as k → ∞
by (4.5). Let Q
fk R
fk denote the reduced QR decomposition of the bracketed term. By Lemma 4.4,
which we proved for the standard QR decomposition but also holds for the reduced one, we
deduce from Q
fk R
fk → V1 that Q
fk → V1 . Rearranging (4.6), we have
Ak X0 = Q
fk (R
fk Dk B1 ).
1
Since the matrix between brackets is a p × p square invertible matrix, this equation implies
that col(Ak X0 ) = col(Qfk ). Denoting by Qk Rk the QR decomposition of Ak X0 , we therefore
have col(Qk ) = col(Q
fk ), and so the projectors on these subspaces are equal. We recall that, for
a set of orthonormal vectors r 1 , . . . , r p gathered in a matrix R = r 1 . . . r p , the projector
RR∗ = r 1 r ∗1 + · · · + r p r ∗p .
Consequently, the equality of the projectors implies Qk Q∗k = Q fk ∗ . Now, we want to establish
fk Q
the essential convergence of Qk to V1 . To this end, we reason by induction, relying on the
fact that the first k columns of X0 undergo a simultaneous iteration independent of the other
columns. For example, the first column simply undergoes a power iteration, and so it converges
essentially to v 1 . Assume now that the columns 1 to p − 1 of Qk converge essentially to
v 1 , . . . , v p−1 . Then the p-th column of Qk at iteration k, which we denote by q p , satisfies
(k)
∗ (k) (k) ∗ ∗ ∗ ∗
= Qk Q∗k − q 1 q 1 − · · · − q p−1 q p−1 = Q
e kQ
(k) (k) e ∗ − q (k) q (k) − · · · − q (k) q (k)
q (k)
p qp
(k)
k 1 1 p−1 p−1
√
Therefore, noting that |a| = aa for every a ∈ C, we deduce
q
(k) (k) ∗
q
|v ∗p q (k)
p | = v ∗p q p q p v p −−−→ v ∗p v p v ∗p v p = 1,
k→∞
(k)
2 v ∗p q p
e−iθk q (k)
p − vp =2− 2|v ∗p q (k)
p | −−−→ 0, θk = (k)
,
k→∞ |v ∗p q p |
86
Chapter 4. Numerical computation of eigenvalues
In addition to this convergence result, it is possible to show that the error satisfies
!
k
λp+1
∠ col(Xk ), col(V1 ) = O .
λp
Algorithm 12 QR algorithm
X0 = A
for i ∈ {1, 2, . . . } do
Qk Rk = Xk−1 (QR decomposition)
Xk = Rk Qk
end for
Xk = Q−1
k Xk−1 Qk = Qk Xk−1 Qk = · · · = (Q1 . . . Qk ) X0 (Q1 . . . Qk )
∗ ∗
(4.7)
Therefore, all the iterates are related by a unitary similarity transformation, and so they all
have the same eigenvalues as X0 = A. Rearranging (4.7), we have
Q
e k+1 Rk+1 = AQ
e k.
87
Chapter 4. Numerical computation of eigenvalues
does not admit a solution. Let us denote by U the matrix with columns u1 , . . . , up . Since any
vector v ∈ U is equal to Uz for some vector z ∈ Cp , equation (4.8) is equivalent to the problem
AUz = λUz,
which is a system of n equations with p < n unknowns. The Rayleigh–Ritz method is based on
the idea that, in order to obtain a problem with as many unknowns as there are equations, we
can multiply this equation by U∗ , which leads to the problem
This is standard eigenvalue problem for the matrix U∗ AU ∈ Cp×p , which is much easier to solve
than the original problem if p n. Equivalently, equation (4.9) may be formulated as follows:
find v ∈ U such that
u∗ (Av − λv), ∀u ∈ U . (4.10)
The solutions to (4.9) and (4.10) are related by the equation v = Uz. Of course, the eigenvalues
of B in problem (4.9), which are called the Ritz values of A relative to U, are in general different
from those of A. Once an eigenvector y of B has been calculated, an approximate eigenvector
of A, called a Ritz vector of A relative to U, is obtained from the equation v
b = Uy. The
Rayleigh–Ritz algorithm is presented in full in Algorithm 13.
Algorithm 13 Rayleigh–Ritz
Choose U ⊂ Cn
Construct a matrix U whose columns are orthonormal and span U
bi and eigenvectors y ∈ Cp of B := U∗ AU
Find the eigenvalues λ i
bi = Uy i ∈ Cn .
Calculate the corresponding Ritz vectors v
88
Chapter 4. Numerical computation of eigenvalues
Proof. Let U ∈ Cn×p and W ∈ Cn×(n−p) be matrices whose columns form orthonormal bases
of U and U ⊥ , respectively. Here U ⊥ denotes the orthogonal complement of U with respect to
the Euclidean inner product. Then, since W∗ AU = 0 by assumption, it holds that
U∗ AU U∗ AW U∗ AU U∗ AW
! !
Q AQ = Q= U W .
∗
= ,
W∗ AU W∗ AW 0 W∗ AW
(U∗ AU)y
! ! !
y b y
Q∗ AQ = =λ =: λx,
b
0 0 0
If U is close to being an invariant subspace of A, then it is expected that the Ritz vectors
and Ritz values of A relative to U will provide good approximations of some of the eigenpairs
of A. Quantifying this approximation is difficult, so we only present without proof the following
error bound. See [10] for more information.
Proposition 4.7. Let A be a full rank Hermitian matrix and U a p-dimensional subspace
of Cn . Then there exists eigenvalues λi1 , . . . , λip of A which satisfy
In the case where A is Hermitian, it is possible to show that the Ritz values are bounded from
above by the eigenvalues of A. The proof of this result relies on the Courant–Fisher theorem
for characterizing the eigenvalues of a Hermitian matrix, which is recalled in Theorem A.6 in
the appendix.
∀i ∈ {1, . . . , p}, b i ≤ λi
λ
x∗ Bx
λ
bi = max min
S⊂Cp ,dim(S)=i x∈S\{0} x∗ x
89
Chapter 4. Numerical computation of eigenvalues
y ∗ Ay
λ
bi = max min
S⊂Cp ,dim(S)=i y∈US\{0} y ∗ y
y ∗ Ay y ∗ Ay
= max min ≤ max min = λi ,
R⊂U ,dim(R)=i y∈R\{0} y ∗ y R⊂Cn ,dim(R)=i y∈R\{0} y ∗ y
where we used the Courant–Fisher for the matrix A in the last equality.
The answer to this question is positive, and the resulting method is often much faster than
the power iteration. This is achieved by employing the Rayleigh–Ritz projection method Al-
gorithm 13 with the choice U = Kk+1 (A, x0 ). Applying this method requires to calculate an
orthonormal basis of the Krylov subspace and to calculate the reduced matrix U∗ AU. The
Arnoldi method enables to achieve these two goals simultaneously.
subspace span{u1 , . . . , uj } = Kj (A, u1 ), implying that Kj+1 (A, u1 ) = Kj (A, u1 ). In this case,
90
Chapter 4. Numerical computation of eigenvalues
Therefore, applying the Rayleigh–Ritz with U = span{u1 , . . . , uj } yields exact eigenpairs Propo-
sition 4.6. If the iteration does not break down then, by construction, the vectors {u1 , . . . , up }
at the end of the algorithm are orthonormal. It is also simple to show by induction that they
form a basis of Kp (A, u1 ). The scalar coefficients hi,j can be collected in a matrix square p × p
matrix
h1,1 h1,2 h1,3 ··· h1,p
h2,1 h2,2 h2,3
··· h2,p
H= · · ·
0 h h h 3,p .
3,2 3,3
. .. .. .. ..
.. . . . .
0 ··· 0 hp,p−1 hp,p
This matrix contains only zeros under the first subdiagonal; such a matrix is called a Hessenberg
matrix. Inspecting the algorithm, we notice that the j-th column contains the coefficients of
the projection of the vector Auj onto the basis {u1 , . . . , up }. In other words,
U∗ AU = H, (4.11)
We have thus shown that the Arnoldi algorithm enables to construct both an orthonormal basis
of a Krylov subspace and the associated reduced matrix. In fact, we have the following equation
0
.
AU = UH + hp+1,p (v p+1 e∗p ), .
. ∈ C . (4.12)
p
ep =
1
The Arnoldi algorithm, coupled with the Rayleigh–Ritz method, has very good convergence
properties in the limit as p → ∞, in particular for eigenvalues with a large modulus. The
following result shows that the residual r = Ab b associated with a Ritz vector can be
v − λv
estimated inexpensively. Specifically, the norm of the residual is equal to the last component of
the associated eigenvector of H multiplied by hp+1,p .
Proposition 4.9 (Formula for the residual � ). Let y i be an eigenvector of H associated with
the eigenvalues λ bi = Uy denote the corresponding eigenvector. Then
bi , and let v
i
Ab
v i − λv
b i = hp+1,p (y )p v p+1 .
i
91
Chapter 4. Numerical computation of eigenvalues
Avbi − λ
bi y = hp+1,p (v p+1 e∗ )y ,
i p i
In practice, the larger the dimension p of the subspace U employed in the Rayleigh–Ritz
method, the more memory is required for storing an orthonormal basis of U. In addition, for
large values of p, computing the reduced matrix (4.11) and its eigenpairs becomes computation-
ally expensive; the computational cost of computing the matrix H scales as O(p2 ). To remedy
these potential issues, the algorithm can be restarted periodically. For example, Algorithm 15
can be employed as an alternative to the power iteration in order to find the eigenvector asso-
ciated with the eigenvalue with largest modulus.
Therefore, the matrix H is Hermitian. This is not surprising, since we showed that H = U∗ AU
and the matrix A is Hermitian. Since H is also of Hessenberg form, we deduce that H is
tridiagonal. An inspection of Algorithm 14 shows that the subdiagonal entries of H are real.
Since A is Hermitian, the diagonal entries hi,i = u∗i (Auj ) are also real, and so we conclude that
all the entries of the matrix H are in fact real. This matrix if of the form
α1 β2
β2 α2 β3
.. ..
H= β3 . .
.. ..
. . βp
βp αp
92
Chapter 4. Numerical computation of eigenvalues
4.5 Exercises
� Exercise 4.1. PageRank is an algorithm for assigning a rank to the vertices of a directed
graph. It is used by many search engines, notably Google, for sorting search results. In this
context, the directed graph encodes the links between pages of the World Wide Web: the vertices
of the directed graph are webpages, and there is an edge going from page i to page j if page i
contains a hyperlink to page j.
Let us consider a directed graph G(V, E) with vertices V = {1, . . . , n} and edges E. The
graph can be represented by its adjacency matrix A ∈ {0, 1}n×n , whose entries are given by
1 if there is an edge from i to j,
aij =
0 otherwise.
Let ri denote the “value” assigned to vertex i. The idea of PageRank, in its simplest form, is
to assign values to the vertices by solving the following system of equations;
rj
(4.13)
X
∀i ∈ V, ri = .
oj
j∈neighbors
where oj is the outdegree of vertex j, i.e. the number of edges leaving from j. The sum is over
all the “incoming” neighbors of i, which have an edge towards node i.
• Read the Wikipedia page on PageRank to familiarize yourself with the algorithm.
T
• Let r = r1 . . . rn . Show using (4.13) that r satisfies
1
o1
r = AT .. r =: AT O−1 r.
.
1
on
• Prove that the eigenvalues of any matrix B ∈ Rn×n coincide with those of BT . You may
use the fact that det(B) = det(BT ).
93
Chapter 4. Numerical computation of eigenvalues
• Show that 1 is an eigenvalue and that ρ(M) = 1. For the second part, find a subordinate
matrix norm such that kMk = 1.
• Implement PageRank in order to rank pages from a 2013 snapshot of English Wikipedia.
You can use either the simplified version of the algorithm given in (4.13) or the improved
version with a damping factor described on Wikipedia. In the former case, the following
are both sensible stopping criteria:
kMb
r−r b k1 kMb
r − λb
b r k1 bT Mb
b= r r
< 10−15 or < 10−15 , λ T
,
kb
r k1 krk1 r
b r b
where v
b is an approximation of the eigenvector corresponding to the dominant eigenvalue.
A dataset is available on the course website to complete this part. This dataset contains
a subset of the data publicly available here, and was generated from the full dataset by
retaining only the 5% best rated articles. After decompressing the archive, you can load
the dataset into Julia by using the following commands:
import CSV
import DataFrames
After you have assigned a rank to all the pages, print the 10 pages with the highest ranks.
My code returns the following entries:
• Extra credit: Write a function search(keyword) that can be employed for searching the
database. Here is an example of what it could return:
94
Chapter 4. Numerical computation of eigenvalues
� Exercise 4.2. Show the following properties of the Krylov subspace Kp (A, x).
• The Krylov subspace Kp (A, x) is invariant under shift of the matrix A: for all α ∈ C,
� Exercise 4.3. The minimal polynomial of a matrix A ∈ Cn×n is the monic polynomial p of
lowest degree such that p(A) = 0. Prove that, if A is Hermitian with m ≤ n distinct eigenvalues,
then the minimal polynomial is given by
m
Y
p(t) = (t − λi ).
i=1
� Exercise 4.4. The minimal polynomial for a general matrix A ∈ Cn×n is given by
m
Y
p(t) = (t − λi )si .
i=1
where si is the size of the largest Jordan block associated with the eigenvalue λi in the normal
Jordan form of A. Verify that p(A) = 0.
� Exercise 4.5. Let d denote the degree of the minimal polynomial of A. Show that
� Exercise 4.6. Let A ∈ Cn×n . Show that Kn (A, x) is the smallest invariant subspace of A
that contains x.
� Exercise 4.7 (A posteriori error bound). Assume that A ∈ Cn×n is Hermitian, and that v
b
is a normalized approximation of an eigenvector which satisfies
b∗ Ab
b= v v
kb
z k := kAb
v − λb
bv k = δ, λ ∗ .
v
b vb
95
Chapter 4. Numerical computation of eigenvalues
|λ
b − λ| ≤ δ.
krk := kAb
v − λb
bv k = δ
|λ
b − λ| ≤ κ2 (V)δ.
Hint: Rewrite
kb
v k = k(A − λI) b −1 V−1 r)k.
b −1 rk = kV(D − λI)
� Exercise 4.9. In Exercise 4.7 and Exercise 4.8, we saw examples a posteriori error es-
timates which guarantee the existence of an eigenvalue of A within a certain distance of the
approximation λ.
b In this exercise, our aim is to provide an answer to the following different
question: given an approximate eigenpair (b b what is the smallest perturbation E that we
v , λ),
need to apply to A in order to guarantee that (b b is an exact eigenpair, i.e. that
v , λ)
(A + E)b
v = λb
bv ?
• Find a rank one matrix E∗ such that kE∗ k2 = krk2 , and then conclude. Recall that any
rank 1 matrix can be written in the form E∗ = uw∗ , with norm kuk2 kwk2 .
� Exercise 4.10. Assume that (xk )k≥0 is a sequence of normalized vectors in Cn . Show that
the following statements are equivalent:
96
Chapter 4. Numerical computation of eigenvalues
U∗ AU,
where U ∈ Cn×p contains the vectors of the basis as columns. Implement your algorithm
with p = 20 in order to approximate the dominant eigenvalue of the matrix A constructed by the
following piece of code:
n = 5000
A = rand(n, n) + im * randn(n, n)
A = A - A' # A is now skew-Hermitian
� Exercise 4.12. Assume that {u1 , . . . , up } and {w1 , . . . , wn } are orthonormal bases of the
same subspace U ⊂ Cn . Show that the projectors UU∗ and WW∗ are equal.
� Exercise 4.13. Assume that A ∈ Cn×n is a Hermitian matrix with distinct eigenvalues, and
suppose that we know the dominant eigenpair (v 1 , λ1 ), with v 1 normalized. Let
B = A − λ1 v 1 v ∗1 .
If we apply the power iteration to this matrix, what convergence can we expect?
� Exercise 4.14. Assume that vc1 and v c2 are two Ritz vectors of a Hermitian matrix A relative
to a subspace U ⊂ C . Show that, if the associated Ritz values are distinct, then v
n c1 ∗ v
c2 = 0.
97
Chapter 5
5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.1 Vandermonde matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.2 Lagrange interpolation formula . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.3 Gregory–Newton interpolation . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.4 Interpolation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5 Interpolation at Chebyshev nodes . . . . . . . . . . . . . . . . . . . . . . 107
5.1.6 Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.7 Piecewise interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.1 Least squares approximation of data points . . . . . . . . . . . . . . . . 113
5.2.2 Mean square approximation of functions . . . . . . . . . . . . . . . . . . 115
5.2.3 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.4 Orthogonal polynomials and numerical integration: an introduction �
. 118
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Discussion and bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 123
Introduction
In this chapter, we study numerical methods for interpolating and approximating functions.
The Cambridge dictionary defines interpolation as the addition of something different in the
middle of a text, piece of music, etc. or the thing that is added. The concept of interpolation
in mathematics is consistent with this definition; interpolation consists in finding, given a set
of points (xi , yi ), a function f in a finite-dimensional space that goes through these points.
Throughout this course, you have used the plot function in Julia, which performs piecewise
linear interpolation for drawing functions, but there are a number of other standard interpolation
methods. Our first goal in this chapter is to present an overview of these methods and the
associated error estimates.
98
Chapter 5. Interpolation and approximation
In the second part of this chapter, we focus of function approximation, which is closely
related to the subject of mathematical interpolation. Indeed, a simple manner for approximating
a general function by another one in a finite-dimensional space is to select a set of real numbers
on the x axis, called nodes, and find the associated interpolant. As we shall demonstrate, not
all sets of interpolation nodes are equal, and special care is required in order to avoid undesired
oscillations. The field of function approximation is vast, so our aim in this chapter is to present
only an introduction to the subject. In order to quantify the quality of an approximation, a
metric on the space of functions, or a subset thereof, must be specified in order to measure
errors. Without a metric, saying that two functions are close is almost meaningless!
5.1 Interpolation
Assume that we are given n+1 nodes x0 , . . . , xn on the x axis, together with the values u0 , . . . , un
taken by an unknown function u(x) when evaluated at these points, and suppose that we are
looking for an interpolation u
b(x) in a subspace span{ϕ0 , . . . , ϕn } of the space of continuous
functions, i.e. an interpolating function of the form
This leads to a linear system of n + 1 equations and n + 1 unknowns, the latter being the
coefficients α0 , . . . , αn . This system of equations in matrix form reads
ϕ0 (x0 ) ϕ1 (x0 ) . . . ϕn (x0 ) α0 u0
ϕ0 (x1 ) ϕ1 (x1 ) . . . ϕn (x1 ) α1 u1
.
. .. .. .. = .. .
(5.1)
. . . . .
ϕ0 (xn ) ϕ1 (xn ) . . . ϕn (xn ) αn un
99
Chapter 5. Interpolation and approximation
In this case, the linear system (5.1) for determining the coefficients of the interpolant reads
1 x0 . . . xn0 α0 u0
1 x1 . . . xn1 α1 u1
. .
. . ..
. = . . (5.2)
. .. ..
. .
1 xn . . . xnn αn un
The matrix on the left-hand side is called a Vandermonde matrix. If the abcissae x0 , . . . , xn
are distinct, then this is a full rank matrix, and so (5.2) admits a unique solution, implying
as a corollary that the interpolating polynomial exists and is unique. It is possible to show
that the condition number of the Vandermonde increases dramatically with n. Consequently,
solving (5.2) is not a viable method in practice for calculating the interpolating polynomial.
While simple, this approach to polynomial interpolation has a couple of disadvantages: first,
evaluating u
b(x) is computationally costly when n is large and, second, all the basis functions
change when adding new interpolation nodes. In addition, Lagrange interpolation is numerically
unstable because of cancellations between large terms. Indeed, it is often the case that Lagrange
polynomials take very large values over the interpolation intervals; this occurs, for example,
when many equidistant interpolation nodes are employed, as illustrated in Figure 5.1.
The constant coefficient can be obtained by evaluating the polynomial at 0, the linear coefficient
can be identified by evaluating the first derivative at 0, and so on. Assume now that we are given
the values taken by p when evaluated at the integer numbers {0, . . . , n}. We ask the following
100
Chapter 5. Interpolation and approximation
Figure 5.1: Lagrange polynomials associated with equidistant nodes over the (0, 1) interval.
question: can we find a formula similar in spirit to (5.3), but including only evaluations of p and
not of its derivatives? To answer this question, we introduce the difference operator ∆ which
acts on functions as follows:
∆f (x) = f (x + 1) − f (x).
The operator ∆ is a linear operator on the space of continuous functions. It maps constant
functions to 0, and the linear function x to the constant function 1, suggesting a resemblance
with the differentiation operator. In order to further understand this connection, let us define
the falling power of a real number x as
In other words, the action of the difference operator on falling powers mirrors that of the differ-
entiation operator on monomials. The falling powers form a basis of the space of polynomials,
101
Chapter 5. Interpolation and approximation
and so any polynomial in P(n), i.e. of degree less than or equal to n, can be expressed as
p(x) = α0 + α1 x1 + α2 x2 + · · · + αn xn . (5.4)
It is immediate to show that αi = ∆i p(0), where ∆i p denotes the function obtained after i
applications of the operator ∆. Therefore, any polynomial of degree less than or equal to n may
be expressed as
∆2 2 ∆n p(0) n
p(x) = p(0) + ∆p(0)x1 + x + ··· + x . (5.5)
2 n!
An expansion of the form (5.5) is called a Newton series, which is the discrete analog of the
continuous Taylor series. From the definition of ∆, it is clear that the coefficients in (5.5) depend
only on p(0), . . . , p(n). We conclude that, given points n + 1 points (i, ui ) for i ∈ {0, . . . , n}, the
unique interpolating polynomial given by (5.5), after replacing p(i) by ui .
Since ∆S(n) = (n + 1)2 , which is a second degree polynomial in n, we deduce that S(n) is a
polynomial of degree 3. Let us now determine its coefficients.
n 0 1 2 3
∆0 S(n) 0 1 5 14
∆1 S(n) 1 4 9
∆2 S(n) 3 5
∆3 S(n) 2
We conclude that
3 2 n(2n + 1)(n + 1)
S(n) = 1n + n(n − 1) + n(n − 1)(n − 2) =
2! 3! 6
Notice that when falling powers are employed as polynomial basis, the matrix in (5.1) is
lower triangular, and so the algorithm described in Example 5.1 could be replaced by the
forward substitution method. Whereas the coefficients of the Lagrange interpolant can be
obtained immediately from the values of u at the nodes, calculating the coefficients of the
expansion in (5.4) requires O(n2 ) operations. However, Gregory–Newton interpolation has
several advantages over Lagrange interpolation:
• If a point (n + 1, pn+1 ) is added to the set of interpolation points, only one additional
term, corresponding to the falling power xn+1 , needs to be calculated in (5.5). All the
other coefficients are unchanged. Therefore, the Gregory–Newton approach is well-suited
for incremental interpolation.
102
Chapter 5. Interpolation and approximation
interpolation, because the basis functions do not take very large values.
• A polynomial in the form of a Newton series can be evaluated efficiently using Horner’s
method, which is based on rewriting the polynomial as
p(x) = α0 + x α1 + (x − 1) α2 + (x − 2) α3 + (x − 3) . . . .
Evaluating this expression starting from the innermost bracket leads to an algorithm with
a cost scaling linearly with the degree of the polynomial.
Non-equidistant nodes
So far, we have described the Gregory–Newton method in the simple setting where interpolation
nodes are just a sequence of successive natural numbers. The method can be generalized to the
setting of nodes x0 6= . . . 6= xn which are not necessarily equidistant. In this case, we take as
basis the following functions instead of the falling powers:
with the convention that the empty product is 1. By (5.1), the coefficients of the interpolating
polynomial in this basis solve the following linear system:
1 ... 0 α0 u0
1 x1 − x0
α1 u1
..
. (5.7)
α u
1 x2 − x0 (x2 − x0 )(x2 − x1 ) 2 = 2 .
. .
.. .. .. .. ..
. . .
Qn−1
1 xn − x0 ... ... j=0 (xn − xj )
αn un
This system could be solved using, for example, forward substitution. Clearly α0 = u0 , and
then from the second equation we obtain
u1 − u0
α1 = =: [u0 , u1 ],
x1 − x0
It is possible to find an expression for the coefficients of the interpolating polynomials in terms
of these divided differences.
Proposition 5.1. Assume that (x0 , u0 ), . . . , (xn , un ) are n+1 points in the plane with distinct
103
Chapter 5. Interpolation and approximation
Proof. The statement is true for n = 0. Reasoning by induction, we assume that it holds true
for polynomials of degree up to n − 1. Let p1 (x) and p2 (x) be the interpolating polynomials at
the points x0 , x1 , . . . , xn−2 , xn−1 and x0 , x1 , . . . , xn−2 , xn , respectively. Then
x − xn−1
(5.8)
p(x) = p1 (x) + p2 (x) − p1 (x)
xn − xn−1
is a polynomial of degree n that interpolates all the data points. By the induction hypothesis,
it holds that
n−2
p1 (x) = u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , u1 , . . . , un−2 , un−1 ]
Y
(x − xi ),
i=0
n−2
p2 (x) = u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , u1 , . . . , un−2 , un ]
Y
(x − xi ).
i=0
Here we used bold font to emphasize the difference between the two expressions. Substituting
these expressions in (5.8), we obtain
n−2
Y
p(x) =u0 + [u0 , u1 ](x − x0 ) + . . . + [u0 , . . . , un−2 ] (x − xi )
i=0
n−1
[u0 , u1 , . . . , un−2 , un−1 ] − [u0 , u1 , . . . , un−2 , un ] Y
+ (x − xi ).
xn − xn−1
i=0
In Exercise 5.4, we show that divided differences are invariant under permutations of the data
points, and so we have that
Example 5.2. Assume that we are looking for the third degree polynomial going through the
points
(−1, 10), (0, 4), (2, −2), (4, −40).
We have to calculate αi = [u0 , . . . , ui ] for i ∈ {0, 1, 2, 3}. It is convenient to use a table in order
to calculate the divided differences:
104
Chapter 5. Interpolation and approximation
i 0 1 2 3
[ui ] 10 4 −2 −40
xi+1 − xi 1 2 2
[ui , ui+1 ] −6 −3 −19
xi+2 − xi 3 4
[ui , ui+1 , ui+2 ] 1 −4
xi+3 − xi 5
[ui , ui+1 , ui+2 , ui+3 ] −1
Theorem 5.2 (Interpolation error). Assume that u : [a, b] → R is a function in C n+1 ([a, b])
and let x0 , . . . , xn denote n + 1 distinct interpolation nodes. Then for all x ∈ [a, b], there
exists ξ = ξ(x) in the interval [a, b] such that
u(n+1) (ξ)
en (x) := u(x) − u
b(x) = (x − x0 ) . . . (x − xn ).
(n + 1)!
Proof. The statement is obvious if x ∈ {x0 , . . . , xn }, so we assume from now on that x does not
coincide with an interpolation node. Let us use the compact notation ωn = ni=0 (x − xi ) and
Q
The function g is smooth and takes the value 0 when evaluated at x0 , . . . , xn , x. Since g has
n + 2 roots in the interval [a, b], Rolle’s theorem implies that g 0 has n + 1 distinct roots in (a, b).
Another application of Rolle’s theorem yields that g 00 has n distinct roots in (a, b). Iterating
this reasoning, we deduce that g (n+1) has one root t∗ in (a, b). From (5.9), we calculate that
105
Chapter 5. Interpolation and approximation
u(n+1) (t∗ )
en (x) = ωn (x),
(n + 1)!
Corollary 5.3 (Upper bound on the interpolation error). Assume that u is smooth in [a, b] and
that
sup |u(n+1) (x)| ≤ Cn+1 .
x∈[a,b]
Then
Cn+1 n+1
∀x ∈ [a, b], |en (x)| ≤ h (5.11)
4(n + 1)
where h is the maximum spacing between two successive interpolation nodes.
Cn+1
∀x ∈ [a, b], |en (x)| ≤ |(x − x0 ) . . . (x − xn )|. (5.12)
(n + 1)!
h2 n!hn+1
× 2h × 3h × 4h × · · · × nh = . (5.13)
4 4
The first factor comes from the fact that, if x ∈ [xi , xi+1 ], then
(xi+1 − xi )2
|(x − xi )(x − xi+1 )| ≤ ,
4
because the left-hand side is maximized when x is the midpoint of the interval [xi , xi+1 ]. Sub-
stituting (5.13) into (5.12), we deduce the statement.
Let us now introduce the supremum norm of the error over the interval [a, b]:
We ask the following natural question: does En tend to zero as the maximum spacing between
successive nodes tends to 0? By Corollary 5.3, the answer to this question is positive when Cn
does not grow too quickly as n → ∞. For example, as illustrated in Figure 5.2, the interpolation
error for the function u(x) = sin(x), when using equidistant interpolation nodes, decreases very
quickly as n → ∞.
In some cases, however, the constant Cn grows quickly with n, to the extent that En may
increase with n; in this case, the maximum interpolation error grows as we add nodes! The
106
Chapter 5. Interpolation and approximation
Figure 5.2: Interpolation (in orange) of the function u(x) = sin(x) (in blue) using 3, 4, 6, and
8 equidistant nodes.
classic example, in order to illustrate this potential issue, is that of the Runge function:
1
u(x) = . (5.14)
1 + 25x2
It is possible to show that, for this function, the upper bound in Corollary 5.3 tends to ∞ in the
limit as the number of interpolation nodes increases. We emphasize that this does not prove
that En → ∞ in the limit as n → ∞, because (5.11) provides only an upper bound on the error.
In fact, the interpolation error for the Runge function can either grow or decrease, depending
on the choice of interpolation nodes. With equidistant nodes, it turns out that En → ∞, as
illustrated in Figure 5.3.
E := sup |u(x) − u
b(x)|
x∈[a,b]
is minimized? Here u
b is the polynomial interpolating u at the nodes. Achieving this goal in
general is a difficult task, essentially because ξ = ξ(x) is unknown in the expression of the
107
Chapter 5. Interpolation and approximation
Figure 5.3: Interpolation (in orange) of the Runge function (5.14) (in blue) using 6, 10, 14, and
20 equidistant nodes.
u(n+1) (ξ)
en (x) = (x − x0 ) . . . (x − xn ).
n + 1!
In view of this difficulty, we will focus on the simpler problem of finding interpolation nodes
such that the product (x − x0 ) . . . (x − xn ) is minimized in the supremum norm. This problem
amounts to finding the optimal interpolation nodes, in the sense that E is minimized, in the
particular case where u is a polynomial of degree n+1, because in this case un+1 (ξ) is a constant
factor. It turns out that this problem admits an explicit solution, which we will deduce from
the following theorem.
In addition, the lower bound is achieved for p∗ (x) = 2−(n−1) Tn (x), where Tn is the Chebyshev
polynomial of degree n:
Proof. By Exercise 2.24, the polynomial x 7→ 2−(n−1) Tn (x) is indeed monic, and it is clear that
it achieves the lower bound (5.15) since |Tn (x)| ≤ 1.
108
Chapter 5. Interpolation and approximation
In order to prove (5.15), we assume by contradiction that there is a different monic polyno-
mial pe of degree n such that
sup |e
p(x)| < E. (5.17)
x∈[−1,1]
The function q(x) := p(x) − pe(x) is a polynomial of degree at most n − 1 which, by the
assumption (5.17), is strictly positive at x0 , x2 , x4 , . . . and strictly negative at x1 , x3 , x5 , . . . .
Therefore, the polynomial q(x) changes sign at least n times and so, by the intermediate value
theorem, it has at least n roots. But this is impossible, because q(x) 6= 0 and q(x) is of degree
at most n − 1.
Remark 5.1 (Derivation of Chebyshev polynomials). The polynomial p∗ achieving the lower
bound in (5.15) oscillates between the values −E and E, which are respectively its minimum
and maximum values over the interval [−1, 1]. It attains the values E or −E at n + 1 points
x0 , . . . , xn , with x0 = −1 and xn = 1. It turns out that these properties, which can be shown
to hold a priori using Chebyshev’s equioscillation theorem, are sufficient to derive an explicit
expression for the polynomial p∗ , as we formally demonstrate hereafter.
The points x2 , . . . , xn−1 are local extrema of p∗ , and so p0∗ (x) = 0 at these nodes. We
therefore deduce that p∗ satisfies the differential equation
Indeed, both sides are polynomials of degree 2n with single roots at -1 and 1, with double roots
at x2 , . . . , xn+1 , and with the same coefficient of the leading power. In order to solve (5.18),
we rearrange the equation and take the square root:
p0∗ (x)
d d
n p∗ (x)
E
=± ⇔ arccos = ±n arccos(x).
1 − x2 dx dx
q
p∗ (x)2 E
1− E2
Corollary 5.5 (Chebyshev nodes). Assume that x0 < x1 < . . . < xn are in the interval [a, b].
109
Chapter 5. Interpolation and approximation
The supremum norm of the product ω(x) := (x − x0 ) · · · (x − xn ) over [a, b] is minimized when
1 + cos (2i+1)π
2(n+1)
xi = a + (b − a) (5.19)
2
The function
2n 2n+1
p(y) := n
ω ζ(y) = n+1
ζ(y) − x0 · · · ζ(y) − xn
(b − a) (b − a)
yi = ζ −1 (xi ),
= y − y0 · · · y − yn ,
2n+1
sup |p(y)| = sup |(x − x0 ) . . . (x − xn )|. (5.20)
y∈[−1,1] (b − a)n+1 x∈[a,b]
By Theorem 5.4, the left-hand side is minimized when p is equal to 2−n Tn+1 (y), i.e. when the
roots of p coincide with the roots of Tn+1 . This occurs when
(2i + 1)π
yi = ζ −1 (xi ) = cos .
2(n + 1)
1 + cos (2i + 1) 2n
π
xi = a + (b − a) , i = 0, . . . , n, (5.21)
2
are known as Chebyshev nodes and, more often than not, employing these nodes for interpola-
tion produces much better results than using equidistant rodes, both in the case where u is a
polynomial of degree n + 1, as we just proved, but also for general u. As an example we plot in
Section 5.1.5 the interpolation of the Runge function using Chebyshev nodes. In this case, the
interpolating polynomial converges uniformly to the Runge function as we increase the number
of interpolation nodes!
110
Chapter 5. Interpolation and approximation
section that only the first derivative is specified. In this case, the aim of Hermite interpolation
is to find, given data (xi , ui , u0i ) for i ∈ {0, . . . , n}, a polynomial u
b of degree at most 2n + 1 such
that
∀i ∈ {0, . . . , n}, u
b(xi ) = ui , b0 (xi ) = u0i .
u
The function ϕi is the square of the usual Lagrange polynomials associated with xi , and it
satisfies
n
X 2
ϕi (xj ) = 1, ϕ0i (xi ) = , ∀j 6= i ϕi (xj ) = ϕ0i (xj ) = 0.
xi − xj
j=0,j6=i
u
b(xi ) = qi (xi ), b0 (xi ) = ϕ0i (xi )q(xi ) + q 0 (xi ).
u
From the first equation, we deduce that qi (xi ) = ui , and from the second equation we then
have q 0 (xi ) = u
b0 (xi ) − ϕ0i (xi )ui . We conclude that the interpolating polynomial is given by
n
X
ϕi (x) ui + u0i − ϕ0i (xi )ui (x − xi ) .
u
b(x) =
i=0
111
Chapter 5. Interpolation and approximation
u(2n+2) (ξ)
u(x) − u
b(x) = (x − x0 )2 . . . (x − xn )2 .
(2n + 2)!
h= max |xi+1 − xi |.
i∈{0,...,n−1}
where the nodes xi , . . . , xi are equally spaced with distance h/m. The idea of piecewise
(0) (m)
Lagrange interpolation is to calculate, for each interval Ii in the partition, the interpolating
polynomial pi at the nodes xi , . . . , xi . The interpolant is then defined as
(0) (m)
u
b(x) = pι (x), (5.22)
where ι = ι(x) is the index of the interval to which x belongs. Since xi = xi+1 = xi+1 ,
(m) (0)
the interpolant defined by (5.22) is continuous. If the function u belongs to C m+1 ([a, b]), then
112
Chapter 5. Interpolation and approximation
by Corollary 5.3 the interpolation error within each subinterval may be bounded from above as
follows:
Cm+1 (h/m)m+1
sup |u(x) − u
b(x)| ≤ , Cm+1 := sup |u(m+1) (x)|, (5.23)
x∈Ii 4(m + 1) x∈[a,b]
and so we deduce
sup |u(x) − u
b(x)| ≤ Chm+1 ,
x∈[a,b]
for an appropriate constant C independent of h. This equation shows that the error is guaran-
teed to decrease to 0 in the limit as h → ∞. In practice, the number m of interpolation nodes
within each interval can be small.
5.2 Approximation
In this section, we focus on the subject of approximation, both of discrete data points and of
continuous functions. We begin, in Section 5.2.1 with a discussion of least squares approximation
for data points, and in Section 5.2.2 we focus on function approximation in the mean square
sense.
for some m < n. In many cases of practical interest, the basis functions ϕ0 , . . . , ϕm are polyno-
mials. In contrast with interpolation, here we seek a function u
b in a finite-dimensional function
space of dimension m strictly lower than the number of data points. In order for u
b to be a good
approximation, we wish to find coefficients α0 , . . . , αm such that the following linear system is
approximately satisfied.
ϕ0 (x0 ) ϕ1 (x0 ) ... ϕm (x0 ) u0
ϕ0 (x1 ) ϕ1 (x1 ) ... ϕm (x1 ) u1
α0
ϕ (x ) ϕ1 (x2 ) ... ϕm (x2 )
u2
0 2
.
.. .. .. 1 .
α
. ≈ .. =: b.
Aα := . .
.
.
ϕ0 (xn−2 ) ϕ1 (xn−2 ) ... ϕm (xn−2 )
un−2
ϕ (x αm
0 n−1 ) ϕ1 (xn−1 ) ... ϕm (xn−1 ) un−1
ϕ0 (xn ) ϕ1 (xn ) ... ϕm (xn ) un
In general, since the matrix on the left-hand side has more lines than columns, there does not
exist an exact solution to this equation. In order to find an approximate solution, a natural
approach is to find coefficients α0 , . . . , αm such that the residual vector r = Aα − b is small in
113
Chapter 5. Interpolation and approximation
some vector norm. A particularly popular approach, known as least squares approximation, is
to minimize the Euclidean norm of r or, equivalently, the square of the Euclidean norm:
2
n
X n
X m
X
krk2 = b(xi )|2 =
|ui − u ui − αj ϕj (xi ) .
i=0 i=0 j=0
Let us denote the right-hand side of this equation by J(α), which we view as a function of the
vector of coefficients a. A necessary condition for a∗ to be a minimizer is that ∇J(α∗ ) = 0.
The gradient of J, written as a column vector, is given by
∇J(α) = ∇ (Aα − b)T (Aα − b)
AT Aα∗ = AT b, (5.25)
Equation (5.25) is a system of m equations with m unknowns, which admits a unique solution
provided that AT A is full rank or, equivalently, the columns of A are linearly independent. The
linear equations (5.25) are known as the normal equations. As a side note, we mention that
the solution α∗ = (AT A)−1 AT b coincides with the maximum likelihood estimator for α under
the assumption that the data is generated according to ui = u(xi ) + εi , for some function
u ∈ span{ϕ0 , . . . , ϕm } and random noise εi ∼ N (0, 1).
The matrix ΠA := A(AT A)−1 AT on the right-hand side is the orthogonal projection operator
onto col(A) ⊂ Rn , the subspace spanned by the columns of A. Indeed, it holds that Π2A = ΠA ,
which is the defining property of a projection operator.
To conclude this section, we note that the matrix A+ = (AT A)−1 AT is a left inverse of the
matrix A, because A+ A = I. It is also called the Moore–Penrose inverse or pseudoinverse of the
matrix A, which generalizes the usual inverse matrix. In Julia, the backslash operator silently
uses the Moore–Penrose inverse when employed with a rectangular matrix. Therefore, solving
114
Chapter 5. Interpolation and approximation
u − uk2 = hb
J(α) := kb u − u, u
b − ui.
m m
* +
X X
J(α) = u− αj ϕj , u − α k ϕk
j=0 k=0
m m m
αj hu, ϕj i − hu, ui = αT Gα − 2bT α − hu, ui,
XX X
= αj αk hϕj , ϕk i − 2
j=0 k=0 j=0
where we introduced
hϕ0 , ϕ0 i hϕ0 , ϕ1 i ... hϕ0 , ϕm i hu, ϕ0 i
hϕ1 , ϕ0 i hϕ1 , ϕ1 i ... hϕ1 , ϕm i hu, ϕ1 i
G := .. .. .. , b := . . (5.27)
..
. . .
hϕm , ϕ0 i hϕm , ϕ1 i . . . hϕm , ϕm i hu, ϕm i
Employing the same approach as in the previous section, we then obtain ∇J(α) = Gα − b, and
so the minimizer of J(α) is the solution to the equation
Gα∗ = b. (5.28)
The matrix G, known as the Gram matrix, is positive semi-definite and nonsingular provided
that the basis functions are linearly independent, see Exercise 5.9. Therefore, the solution α∗
exists and is unique. In addition, since the Hessian of J is equal to G, the vector α∗ is indeed
a minimizer. Note that if h•, •i is defined as a finite sum of the form
n
(5.29)
X
hf, gi = f (xi )g(xi ),
i=0
115
Chapter 5. Interpolation and approximation
then (5.28) coincides with the normal equations (5.25) from the previous section. We remark
that (5.29) is in fact not an inner product on the space of continuous functions, but it is an
inner product on the space of polynomials of degree less than or equal to n.
In practice, the matrix and right-hand side of the linear system (5.28) can usually not be
calculated exactly, because the inner product h•, •i is defined through an integral; see (5.30) in
the next section.
hb
u − u, ϕ0 i = 0, ..., hb
u − u, ϕm i = 0.
∀v ∈ span{ϕ0 , . . . , ϕm }, hb
u − u, vi = 0.
αi = hu, ϕi i, i = 0, . . . , m,
The coefficients hu, ϕi i of the basis functions in this expansion are called Fourier coefficients.
Given a finite dimensional subspace S of the space of continuous functions, an orthonormal basis
can be constructed via the Gram–Schmidt process. In this section, we focus on the particular
case where S = P(n) – the subspace of polynomials of degree less than or equal to n. Another
widely used approach, which we do not explore in this course, is to use trigonometric basis
functions. We also assume that the inner product (5.26) is of the form
Z b
hf, gi = f (x)g(x)w(x) dx, (5.30)
a
Let ϕ0 (x), ϕ1 (x), ϕ2 (x) . . . denote the orthonormal polynomials obtained by applying the Gram–
Schmidt procedure to the monomials 1, x, x2 , . . . . These depend in general on the weight w(x)
116
Chapter 5. Interpolation and approximation
and on the interval [a, b]. A few of the popular classes of orthogonal polynomials are presented
in the table below:
Legendre 1 [−1, 1]
Chebyshev √ 1
1−x2
(−1, 1)
Hermite exp − x2
2
[−∞, ∞]
Orthogonal polynomials have a rich structure, and in the rest of this section we prove
some of their common properties, one of which will be very useful in the context of numerical
integration in Chapter 6. We begin by showing that orthogonal polynomials have distinct real
roots.
Proposition 5.7. Assume for simplicity that w(x) > 0 for all x ∈ [a, b], and let ϕ0 , ϕ1 , . . .
denote orthonormal polynomials of increasing degree for the inner product (5.30). Then for
all n ∈ N, the polynomial ϕn has n distinct roots in the open interval (a, b).
Proof. Reasoning by contradiction, we assume that ϕn changes sign at only k < n points of the
open interval (a, b), which we denote by x1 , . . . , xk . Then
ϕn (x) × (x − x0 )(x − x1 ) . . . (x − xk )
Proposition 5.8. Assume that ϕ0 , ϕ1 , . . . are orthonormal polynomials for some inner prod-
uct of the form (5.30) such that ϕi is of degree i. Then
where
αn = hxϕn , ϕn−1 i, βn = hxϕn , ϕn i.
117
Chapter 5. Interpolation and approximation
n+1
(5.32)
X
xϕn (x) = γn,i ϕi (x).
i=0
Taking the inner product of both sides of this equation with ϕi and employing the orthonormality
assumption, we obtain an expression for the coefficients:
γn,i = hxϕn , ϕi i.
From the expression (5.30) of the inner product, it is clear that hxϕn , ϕi i = hϕn , xϕi i. Since xϕi
is a polynomial of degree i + 1 and ϕn is orthogonal to all polynomials of degree strictly less
than n, we deduce that γn,i = 0 if i < n − 1. Consequently, we can rewrite the right-hand side
of (5.32) as a sum involving only three terms
xϕn (x) = hxϕn , ϕn−1 iϕn−1 (x) + hxϕn , ϕn iϕn (x) + hxϕn , ϕn+1 iϕn+1 (x). (5.33)
Remark 5.4. Notice that the polynomials in (5.8) are orthonormal by assumption, and so the
coefficient αn+1 is just a normalization constant. We deduce that
Let T denote the matrix on the left-hand side of this equation, and let r0 , . . . , rm denote the
roots of ϕm+1 . By Proposition 5.7, these are distinct and all belong to the interval (a, b). But
118
Chapter 5. Interpolation and approximation
T
In other words, for any root r of ϕm+1 , the vector ϕ0 (r) . . . ϕm (r) is an eigenvector of the
matrix T, with associated eigenvalue equal to r. Since T is a symmetric matrix, the eigenvectors
associated to distinct eigenvalues are orthogonal for the Euclidean inner product of Rm+1 , so
given that the eigenvalues of T are distinct, we deduce that
m
(5.34)
X
∀i 6= j, ϕi (ri )ϕi (rj ) = 0.
i=0
Equation (5.34) indicates that the lines of P are orthogonal, and so the matrix D = PPT is
diagonal with elements given by
m
X
dii = |ϕj (ri )|2 , i = 0, . . . , m.
j=0
(Here we start counting the lines from 0 for convenience.) Since PPT D−1 = I, we deduce that
the inverse of P is given by P−1 = PT D−1 . Consequently,
PT D−1 P = P−1 P = I,
which means that the polynomials ϕ1 , . . . , ϕm are orthogonal for the inner product
We have thus shown that, if ϕ0 , ϕ1 , ϕ2 are a family of orthonormal polynomials for an inner
product h•, •i, then they are also orthonormal for the inner product h•, •im+1 . Consequently,
it holds that
∀(p, q) ∈ P(m) × P(m), hp, qi = hp, qim+1 , (5.35)
119
Chapter 5. Interpolation and approximation
hp, qi = hα0 ϕ0 + · · · + αm ϕm , β0 ϕ0 + · · · + βm ϕm i
m X
X m m X
X m
= αi βj hϕi , ϕj i = α0 β0 + · · · + αm βm = αi βj hϕi , ϕj im+1
i=0 j=0 i=0 j=0
We have thus shown the following result. Since m was arbitrary in the previous reasoning, we
add a superscript to indicate when the quantities involved depend on m.
where r0 are the roots of ϕm+1 and the weights wi are given by
(m+1) (m+1) (m+1)
, . . . , rm
(m+1) 1
wi =P 2, i = 0, . . . , m.
m (m+1)
j=0 ϕj ri
Taking q = 1 in (5.35) and employing the definitions of h•, •i and h•, •im+1 , we have
Z b m
∀p ∈ P(m), p(x) w(x) dx = (5.36)
(m+1) (m+1)
X
p ri wi .
a i=0
Since the left-hand side is an integral and the right-hand side is a sum, we have just constructed
an integration formula, which enjoys a very nice property: it is exact for polynomials of degree up
to m! A formula of this type is called a quadrature formula, with m + 1 nodes r0
(m+1) (m+1)
, . . . , rm
and associated weights w0 . Note that the nodes and weights of the quadrature
(m+1) (m+1)
, . . . , wm
depend on the weight w(x) and on the degree m.
In fact, equation (5.36) is valid more generally than for p ∈ P(m). Indeed, for every
monomial xp with m ≤ p ≤ 2m, we have that
Z b Z b
x w(x) dx =
p
xp−m xm w(x) dx
a a
m
(m+1) p
(m+1)
X
= hxp−m , xm i = hxp−m , xm im+1 = hxp , 1im+1 = ri wi .
i=0
120
Chapter 5. Interpolation and approximation
Consequently, the formula (5.36) is exact for any monomial of degree up to 2m, and we conclude
that, by linearity, it is also valid for any polynomials of degree up to 2m.
Remark 5.5. In fact, the formula (5.36) is valid for polynomials of degree up to 2m + 1.
Indeed, using that x2m+1 = ϕm+1 (x)p(x) + q(x), for some polynomial p of degree m (the
quotient of the polynomial division of x2m+1 by ϕm+1 ) and some polynomial q of degree less
than m (the remainder of the polynomial division), we have
Z b Z b Z b Z b
x2m+1 w(x) dx = ϕm+1 (x)p(x)w(x) dx + q(x)w(x) dx = 0 + q(x)w(x) dx
a a a a
m m
(m+1) (m+1) (m+1) (m+1) (m+1)
X X
= ϕm+1 ri p ri wi + q ri wi
i=0 i=0
m 2m+1
(m+1) (m+1)
X
= ri wi ,
i=0
where we used, in the penultimate inequality, the fact that r0 are the roots
(m+1) (m+1)
, . . . , rm
of the polynomial ϕm+1 .
5.3 Exercises
� Exercise 5.1. Find the polynomial p(x) = ax + b (a straight line) that goes through the
points (x0 , u0 ) and (x1 , u1 ).
� Exercise 5.2. Find the polynomial p(x) = ax2 + bx + c (a parabola) that goes through the
points (0, 1), (1, 3) and (2, 7).
� Exercise 5.3. Prove the following recurrence relation for Chebyshev polynomials:
121
Chapter 5. Interpolation and approximation
� Exercise 5.6. Let (f0 , f1 , f2 , . . . ) = (1, 1, 2, . . . ) denote the Fibonacci sequence. Prove that
there does not exist a polynomial p such that
∀i ∈ N, fi = p(i).
n2 n3 n4
∀n ∈ N>0 , 2n = 1 + n + + + + ··· (5.37)
2! 3! 4!
Remark 5.6. Remarkably, equation (5.37) holds in fact for any n ∈ R>0 . However, showing
this more general statement is beyond the scope of this course.
Use equidistant and then Chebyshev nodes, and compare the two approaches in terms of accuracy.
Plot the function f together with the approximating polynomials.
� Exercise 5.11. We wish to use interpolation to approximate the following parametric func-
tion, called an epitrochoid:
R+r
x(θ) = (R + r) cos θ + d cos θ (5.38)
r
R+r
y(θ) = (R + r) sin θ − d sin θ , (5.39)
r
with R = 5, r = 2 and d = 3, and for θ ∈ [0, 4π]. Write a Julia program to interpolate x(θ)
and y(θ) using 40 equidistant points. Use the BigFloat format in order to reduce the impact
of round-off errors. After constructing the polynomial interpolations x b(θ) and yb(θ), plot the
parametric curve θ 7→ xb(θ), yb(θ) . Your plot should look similar to Figure 5.4.
� Exercise 5.12 (Modeling the vapor pressure of mercury). The dataset loaded through the
following Julia commands contains data on the vapor pressure of mercury as a function of the
temperature.
import RDatatets
data = RDatasets.dataset("datasets", "pressure")
Find a low-dimensional mathematical model of the form
122
Chapter 5. Interpolation and approximation
for the pressure as a function of the temperature. Plot the approximation together with the data.
An example solution is given in Figure 5.5.
123
Chapter 6
Numerical integration
Introduction
Integrals are ubiquitous in science and mathematics. In this chapter, we are concerned with the
problem of calculating numerically integrals of the form
Z
I= u(x) dx, (6.1)
Ω
n−1
I = lim
X
u(ti )(zi+1 − zi ),
h→0
i=0
where a = z0 < · · · < zn = b is a partition of the interval [a, b] such that the maximum spacing
between successive x values is equal to h, and with ti ∈ [xi , xi+1 ] for all i ∈ {0, . . . , n − 1}.
All the numerical integration formulas that we present in this chapter are based on a deter-
ministic approximation of the form
m
(6.2)
X
Ib = wi u(xi ),
i=0
124
Chapter 6. Numerical integration
where x0 < . . . < xn are the integration nodes and w0 , . . . , wn are the integration weights. In
many cases, integration formulas contain a small parameter that can be changed to improve
the accuracy of the approximation. In methods based on equidistant interpolation nodes, for
example, this parameter encodes the distance between nodes and is typically denoted by h, and
we often use the notation Ibh to emphasize the dependence of the approximation on h. The
difference Eh = I − Ih is called the integration error or discretization error, and the degree of
precision of an integration method is the smallest integer number d such that the integration
error is zero for all the polynomials of degree less than or equal to d.
We observe that, without loss of generality, we can consider that the integration interval is
equal to [−1, 1]. Indeed, using the change of variable
we have
b 1 1
b−a
Z Z Z
u(x) dx = u ζ(y) ζ 0 (y) dy = u ◦ ζ(y) dy, (6.4)
a −1 2 −1
and the right-hand side is the integral of u ◦ ζ over the interval [−1, 1].
The weights are independent of the function u, and so they can be calculated a priori. The class
of integration methods obtained using this approach are known as Newton–Cotes methods. We
present a few particular cases:
• m = 1, d = 1 (trapezoidal rule):
Z 1
u(x) dx ≈ u(−1) + u(1). (6.5)
−1
• m = 2, d = 3 (Simpson’s rule):
Z 1
1 4 1
u(x) dx ≈ u(−1) + u(0) + u(1). (6.6)
−1 3 3 3
125
Chapter 6. Numerical integration
• m = 3, d = 3 (Simpson’s 3
8 rule):
Z 1
1 3 3 1
u(x) dx ≈ u(−1) + u(−1/3) + u(1/3) + u(1).
−1 4 4 4 4
• m = 4, d = 5 (Bode’s rule):
Z 1
7 32 1 12 32 1 7
u(x) dx ≈ u(−1) + u − + u (0) + u + u(1).
−1 45 45 2 45 45 2 45
In principle, this approach could be employed in order to construct integration rules of arbitrary
high degree of precision. In practice, however, the weights become more and more imbalanced
as the number of interpolation points increases, with some of them becoming negative. As a
result, roundoff errors become increasingly detrimental to accuracy as the degree of precision
increases. In addition, in cases where the interpolating polynomial does not converge to u, for
example if u is Runge’s function, the approximate integral may not even converge to the correct
value in the limit as m → ∞ in exact arithmetic!
Note that, although it is based on a quadratic polynomial interpolation, Simpson’s rule (6.6)
has a degree of precision equal to 3. This is because any integration rule with nodes and weights
symmetric around x = 0 is exact for odd functions, in particular x3 . Likewise, the degree of
precision of Bode’s rule is equal to 5.
Composite trapezoidal rule. Let us illustrate the composite approach with an example. To
this end, we introduce a partition a = x0 < · · · < xm = b of the interval [a, b] and assume
that the nodes are equidistant with xi+1 − xi = h. Using (6.4), we first generalize (6.5) to an
interval [xi , xi+1 ] as follows:
Z xi+1 Z 1
h
u(x) dx = u ◦ ζ(y) dy ≈ u ◦ ζ(−1) + u ◦ ζ(1) =
u(xi ) + u(xi+1 ) ,
xi −1 2
where ≈ in this equation indicates approximation using the trapezoidal rule. Applying this
approximation to each subinterval of the partition, we obtain the composite trapezoidal rule:
Z b n−1
X Z xi+1 n−1
hX
u(x) dx = u(x) dx ≈
u(xi ) + u(xi+1 )
a xi 2
i=0 i=0
h
(6.7)
= u(x0 ) + 2u(x1 ) + 2u(x2 ) + · · · + 2u(xn−2 ) + 2u(xn−1 ) + u(xn ) .
2
126
Chapter 6. Numerical integration
Like the trapezoidal rule (6.5), the composite trapezoidal rule (6.7) has a degree of precision
equal to 1. However, the accuracy of the method depends on the parameter h, which repre-
sents the width of each subinterval: for very small h, equation (6.7) is expected to provide
a good approximation of the integral. An error estimate can be obtained directly from the
formula in Theorem 5.2 for interpolation error, provided that we assume that u ∈ C 2 ([a, b]).
Denoting by Ibh the approximate integral calculated using (6.7), and by u
bh the piecewise linear
interpolation of u, we have
Z xi+1 Z xi+1
1
b(x) dx = u00 ξ(x) (x − xi )(x − xi+1 ) dx.
u(x) − u
xi 2 xi
Since (x − xi )(x − xi+1 ) is nonpositive over the interval [xi , xi+1 ], we deduce that
!Z
xi+1 xi+1
h3
Z
b(x) dx ≤
u(x) − u sup u (ξ)00
(x − xi )(x − xi+1 ) dx = C2 ,
xi ξ∈[a,b] xi 12
where we introduced
C2 = sup u00 (ξ) .
ξ∈[a,b]
n−1 xi+1
h3 b−a
Z
b(x) dx ≤ n × C2 (6.8)
X
|I − I|
b ≤ u(x) − u = C2 h2 .
xi 12 12
i=0
The integration error therefore scales as O(h2 ). (Strictly speaking, we have shown only that
the integration error admits an upper bound that scales at O(h2 ), but it turns out that the
dependence on h of this bound is optimal).
Composite Simpson rule. The composite Simpson rule is derived in Exercise 6.2. Given an
odd number n + 1 of equidistant points a = x0 < x1 < · · · < xn = b, it is given by
h
Ibh = u(x0 ) + 4u(x1 ) + 2u(x2 ) + 4u(x3 ) + 2u( x4 ) + · · · + 2u(xn−2 ) + 4u(xn−1 ) + u(xn ) . (6.9)
3
This approximation is obtained by integrating the piecewise quadratic interpolant over a par-
tition of the integration interval into n/2 subintervals of equal width. Obtaining an optimal
error estimate, in terms of the dependence on h, for this integration formula is slightly more
involved. For a given subinterval [x2i , x2i+2 ], let us denote by u
b2 (x) the quadratic interpolating
polynomial at x2i , x2i+1 , x2i+2 , and by u
b3 (x) a cubic interpolating polynomial relative to the
nodes x2i , x2i+1 , x2i+2 , xα , for some α ∈ [x2i , x2i+1 ]. We have
Z x2i+2 Z x2i+2 Z x2i+2
b2 (x) dx =
u(x) − u b3 (x) dx +
u(x) − u b2 (x) dx.
b3 (x) − u
u (6.10)
x2i x2i x2i
127
Chapter 6. Numerical integration
The second term is zero, because the integrand is a cubic polynomial with zeros at x2i , x2i+1
and x2i+2 , and because
Z x2i+2
(x − x2i )(x − x2i+1 )(x − x2i+2 ) = 0.
x2i
(Notice that the integrand is even around x2i+1 .) The cancellation of the second term in (6.10)
also follows from the fact that the degree of precision of the Simpson rule (6.6) is equal to 3,
and so
Z x2i+2
1 4 1
u b2 (x) dx = (b
b3 (x) − u u3 − u u3 − u
b2 )(x2i ) + (b u3 − u
b2 )(x2i+1 ) + (b b2 )(x2i+2 ) = 0.
x2i 3 3 3
Using Theorem 5.2, we rewrite first term in (6.10) from above as follows:
u(4) ξ(x)
Z x2i+2 Z x2i+2
b3 (x) dx ≤
u(x) − u (x − x2i )(x − x2i+1 )(x − x2i+2 )(x − xα ) dx.
x2i x2i 24
Since this formula is valid for all α ∈ [2i, 2i + 2], we are allowed to take α = x2i+1 . Given that
Z x2i+2
4 5
(x − x2i )(x − x2i+1 )2 (x − x2i+2 ) dx = h ,
x2i 15
with an integrand everywhere nonpositive in the interval [x2i , x2i+2 ], we conclude that
Z x2i+2
C4 5
b3 (x) dx ≤
u(x) − u h , C4 = sup |u(4) (ξ)|.
x2i 90 ξ∈[a,b]
n C4 h5 C4 h4
|I − Ibh | ≤ × = (b − a) . (6.11)
2 90 180
Estimating the error. In practice, it is useful to be able to estimate the integration error
so that, if the error is deemed too large, a better approximation of the integral can then be
calculated by using a smaller value for the step size h. Calculating the exact error I − Ibh is
impossible in general, because this would require to know the exact value of the integral, but it is
possible to calculate a rough approximation of the error based on two numerical approximations
of the integral, as we illustrate formally hereafter for the composite Simpson rule.
Suppose that Ib2h and Ibh are two approximations of the integral, calculated using the com-
posite Simpson rule with step size 2h and h, respectively. If we assume that the error scales
as O(h4 ) as (6.11) suggests, then it holds approximately that
1
I − Ibh ≈ 4 (I − Ib2h ). (6.12)
2
1
I − Ib2h = (I − Ibh ) + (Ih − Ib2h ) ≈ (I − Ib2h ) + (Ih − Ib2h ).
16
128
Chapter 6. Numerical integration
16 b
I − Ib2h ≈ (Ih − Ib2h ).
15
1 b
|I − Ibh | ≈ |Ih − Ib2h |. (6.13)
15
The right-hand side can be calculated numerically, because it does not depend on the exact
value of the integral. In practice, the two sides of (6.13) are often very close for small h. In the
code example below, we approximate the integral
Z π
2
I= cos(x) dx = 1 (6.14)
0
for different step sizes and compare the exact error with the approximate error obtained us-
ing (6.13). The results obtained are summarized in Table 6.1, which shows a good match
between the two quantities.
Table 6.1: Comparison between the exact integration error and the approximate integration
error calculated using (6.13).
129
Chapter 6. Numerical integration
# Approximate integrals
Î = composite_simpson.(u, a, b, ns)
# Calculate exact and approximate errors
for i in 2:length(ns)
println("Exact error: $(I - Î[i]), ",
"Approx error: $((Î[i] - Î[i-1])/15)")
end
η2 η3 ηk
J(η) = J(0) + J 0 (0)η + J 00 (0) + J (3) (0) + · · · + J (k) (0) + O(η k+1 ).
2 3! k!
Elimination of the linear error term. Let us assume that J 0 (0) 6= 0, so that the leading order
term after the constant J(0) scales as η. Then we have
We now ask the following question: can we combine linearly J(h) and J(h/2) in order to
approximate J(0) with an error scaling as O(h2 )? Using the ansatz J1 (h/2) = αJ(h)+βJ(h/2),
we calculate
1−α
J1 (h/2) = (α + β)J(0) + J 0 (0)h α + + O(h2 ). (6.15)
2
Since we want this expression to approximate J(0) for small h, we need to impose that α+β = 1.
Then, in order for the term multiplying h to cancel out, we require that
1−α
α+ =0 ⇔ α = −1.
2
Notice that, in the case where J is a linear function, J1 (h/2) is exactly equal to J(0). This
reveals a geometric interpretation of (6.16): the approximation J1 (h/2) is simply the y intercept
of the straight line passing through the points h/2, J(h/2) and h, J(h) .
130
Chapter 6. Numerical integration
Elimination of the quadratic error term. If we had tracked the coefficient of h2 in the previous
paragraph, we would have obtained instead of (6.15) the following equation:
h2
J1 (h/2) = J(0) − J (3) (0) + O(h3 ).
4
2J(h/4) − J(h/2) h2
J1 (h/4) = = J(0) − J (3) (0) + O(h3 ).
2 16
At this point, it is natural to wonder whether we can combine J1 (h/2) and J1 (h/4) in order to
produce an even better approximation of J(0). Applying the same reasoning as in the previous
section leads us to introduce
Elimination of higher order terms. The procedure above can be repeated in order to elimi-
nate terms of higher and higher orders. The following schematic illustrates, for example, the
calculation of an approximation J3 (h/8) = J(0) + O(h4 ).
J(h)
J(h/2) J1 (h/2)
In practice we calculate the values taken by J, J1 , J2 , . . . at specific values of h, but these are
in fact functions of h. In Figure 6.1, we plot these functions when J(h) = 1 + sin(h). It
appears clearly from the figure that, for sufficiently small h, J3 (h) provides the most precise
approximation of J(0) = 1. Constructing the functions in Julia can be achieved in just a few
lines of code.
J(h) = 1 + sin(h)
J_1(h) = 2J(h) - J(2h)
J_2(h) = (4J_1(h) - J_1(2h))/3
J_3(h) = (8J_2(h) - J_2(2h))/7
131
Chapter 6. Numerical integration
Generalization. Sometimes, it is known a priori that the Taylor development of the function J
around zero contains only even powers of h. In this case, the Richardson extrapolation procedure
can be slightly modified to produce approximations with errors scaling as O(h4 ), then O(h6 ),
then O(h8 ), etc. This procedure is illustrated below:
J(h)
J(h/2) J1 (h/2)
This time, the linear combinations required for populating this table are given by:
where a = x0 < x1 < · · · < xn = b are equidistant nodes. The right-hand side of this equation is
simply the composite trapezoidal rule with step size h. It is possible to show, see [9], that J(h)
may be expanded as follows:
132
Chapter 6. Numerical integration
Figure 6.2: Convergence of Romberg’s method. The straight lines correspond to the monomial
functions f (h) = Ci hi , with i = 2, 4, 6, 8 and for appropriate constants Ci . We observe a good
agreement between the observed and theoretical convergence rates.
133
Chapter 6. Numerical integration
The quadrature rule obtained by solving this system of equations is called the Gauss–Legendre
quadrature.
Example 6.1. Let us derive the Gauss–Legendre quadrature with n + 1 = 2 nodes. The system
of equations that we need to solve in this case is the following:
2
w0 + w1 = 2, w0 x0 + w1 x1 = 0, w0 x20 + w1 x21 = , w0 x30 + w1 x31 = 0.
3
the polynomial whose roots coincide with the unknown integration nodes. Multiplying the first
equation of (6.18) (d = 0) by α0 , the second (d = 1) by α1 , and so forth until the equation
corresponding to d = n + 1, we obtain after summing these equations
n n+1 Z 1 n+1 n Z 1
αd x dx p(x) dx.
X X X X
wi αd xdi = d
⇔ wi p(xi ) =
i=0 d=0 −1 d=0 i=0 −1
Since the left-hand side of this equation is equal to 0 by definition of p, we deduce that p is
orthogonal to the constant polynomial for the inner product
Z 1
hf, gi = f (x) g(x) dx. (6.19)
−1
134
Chapter 6. Numerical integration
n n+1 Z 1 n+1 n Z 1
dx
X X X X
wi x i αd xdi = αd xd+1
⇔ wi xi p(xi ) = p(x) xdx.
i=0 d=0 −1 d=0 i=0 −1
Since the left-hand side of this equation is again 0 because the nodes xi are the roots of p, we
deduce that p is orthogonal to the linear polynomial x 7→ x for the inner product (6.19). This
reasoning can be repeated in order to deduce that p is in fact orthogonal to all the monomials
of degree 0 to n, implying that p is a multiple of the Legendre polynomial of degree n + 1.
The degree of precision of this integration rule is the same as that of the corresponding one-
dimensional rule.
6.5 Exercises
� Exercise 6.1. Derive the Simpson’s integration rule (6.6).
Find w1 , w2 and w3 so that this integration rule has the highest possible degree of precision.
Find w1 , w2 and x1 so that this integration rule has the highest possible degree of precision.
� Exercise 6.5. What is the degree of precision of the following quadrature rule?
Z 1
2 1 1
u(x) dx ≈ 2u − − u(0) + 2u .
−1 3 2 2
135
Chapter 6. Numerical integration
the form
Z ∞ n
x2
u(x) e− dx ≈
X
2 wi u(xi ),
−∞ i=0
such that the rule is exact for all polynomials of degree less than or equal to 2n + 1. Find the
Gauss–Hermite rule with two nodes.
� Exercise 6.7. Use Romberg’s method to construct an integration rule with an error term
scaling as O(h4 ). Is there a link between the method you obtained and another integration rule
seen in class?
� Exercise 6.8 (Improving the error bound for the composite trapezoidal rule). The notation
used in this exercise is the same as in Section 6.2. In particular, Ibh denotes the approximate
integral obtained by using the composite trapezoidal rule (6.7), and u
bh is the corresponding
piecewise linear interpolant.
A version of the mean value theorem states that, if g : [a, b] → R is a non-negative integrable
function and f : [a, b] → R is continuous, then there exists ξ ∈ (a, b) such that
Z b Z b
f (x)g(x) dx = f (ξ) g(x) dx. (6.20)
a a
• Using (6.20), show that, for all i ∈ {0, . . . , n − 1}, there exists ξi ∈ (xi , xi+1 ) such that
xi+1
h3
Z
bh (x) dx = −u00 (ξi )
u(x) − u .
xi 12
• Combining the previous items, conclude that there exists ξ ∈ (a, b) such that
h 2
I − Ibh = −u00 (ξ)(a − b) ,
12
which is a more precise expression of the error than that obtained in (6.8).
Remark 6.1. One may convince oneself of (6.20) by rewriting this equation as
f (x)g(x) dx
Rb
a
= f (c).
a g(x) dx
Rb
The left-hand side is the average of f (x) with respect to the probability measure with density
136
Chapter 6. Numerical integration
given by
g(x)
x 7→ R b .
a g(x) dx
may be expressed as the expectation E[X 2 ], where E is the expectation operator and X ∼ U (0, 1)
is a uniformly distributed random variable over the interval [0, 1]. Therefore, in practice, I
may be approximated by generating a large number of samples X1 , X2 , . . . from the distribu-
tion U(0, 1) and averaging f (Xi ) over all these samples.
n = 1000
f(x) = x^2
X = rand(n)
Î = (1/n) * sum(f.(X))
The main advantage of this approach is that it generalizes very easily to high-dimensional and
infinite-dimensional settings.
137
Appendix A
Linear algebra
In this chapter, we collect basic results on vectors and matrices that are useful for this course.
Throughout these lecture notes, we use lower case and bold font to denote vectors, e.g. x ∈ Cn ,
and upper case to denote matrices, e.g. A ∈ Cm×n . The entries of a vector x ∈ Cn are denoted
by (xi ), and those of a matrix A ∈ Cm×n are denoted by (aij ) or (ai,j ).
• Positive-definiteness:
∀x ∈ X \{0}, hx, xi > 0.
138
Appendix A. Linear algebra
Definition A.2. A norm on a real vector space X is a function k•k : X → R satisfying the
following axioms:
(A.1)
p
kxk = hx, xi.
The Cauchy–Schwarz inequality enables to bound inner products using norms. It is also useful
for showing that the functional defined in (A.1) satisfies the triangle inequality.
Proof. Let us define p(t) = kx + tyk2 . Using the bilinearity of the inner product, we have
This shows that p is a convex second-order polynomial with a minimum at t∗ = −hx, yi/kyk2 .
Substituting this value in the expression of p, we obtain
Several norms can be defined on the same vector space X . Two norms k•kα and k•kβ on X
are said to be equivalent if there exist positive real numbers c` and cu such that
When working with norms on finite-dimensional vector spaces, it is important to keep in mind
the following result. The proof is provided for information purposes.
Proposition A.2. Assume that X is a finite-dimensional vector space. Then all the norms
defined on X are pairwise equivalent.
Proof. Let k•kα and k•kβ be two norms on X , and let (e1 , . . . , en ) be a basis of X , where n is
the dimension of the vector space. Any x ∈ X can be represented as x = λ1 e1 + · · · + λn en . By
the triangle inequality, it holds that
n o
kxkα ≤ |λ1 |ke1 kα + · · · + |λn |ken kα ≤ |λ1 | + · · · + |λn | max ke1 kα , . . . , ken kα . (A.4)
139
Appendix A. Linear algebra
On the other hand, as we prove below, there exists a positive constant ` such that
∀x ∈ X , kxkβ ≥ ` |λ1 | + · · · + |λn | . (A.5)
1 n o
kxkα ≤ max ke1 kα , . . . , ken kα kxkβ .
`
This proves the first inequality in (A.3), and reversing the roles of k•kα and k•kβ yields the
second inequality.
We now prove (A.5) by contradiction. If this inequality were not true, then there would
exist a sequence (x(i) )i∈N such that kx(i) kβ → 0 as i → ∞ but |λ1 | + · · · + |λn | = 1 for
(i) (i)
all i ∈ N. Since λ1 ∈ [−1, 1] for all i ∈ N, we can extract a subsequence, still denoted by
(i)
(x(i) )i∈N for simplicity, such that the corresponding coefficient λ1 satisfies λ1 → λ∗1 ∈ [−1, 1],
(i) (i)
by compactness of the interval [−1, 1]. Repeating this procedure for λ2 , λ3 , . . . , taking a new
subsequence every time, we obtain a subsequence (x(i) )i∈N such that λj → λ∗j in the limit
(i)
as i → ∞, for all j ∈ {1, . . . , n}. Therefore, it holds that x(i) → x∗ := λ∗1 e1 + · · · λ∗n en in the
k•kβ norm. Since x(i) → 0 by assumption and the vectors e1 , . . . , en are linearly independent,
this implies that λ∗1 = · · · = λ∗n = 0, which is a contradiction because we also have that
� Exercise A.1. Using Proposition A.1, show that the function k•k defined by (A.1) satisfies
the triangle inequality.
The values of p most commonly encountered in applications are 1, 2 and ∞. The 1-norm is
sometimes called the taxicab or Manhattan norm, and the 2-norm is usually called the Euclidean
norm. The explicit expressions of these norms are
v
n
X
u n
uX
kxk1 = |xi |, kxk2 = t |xi |2 .
i=1 i=1
140
Appendix A. Linear algebra
Notice that the infinity norm k•k∞ may be defined as the limit of the p-norm as p → ∞:
In the rest of this chapter, the notations h•, •i and k•k without subscript always refer to the
Euclidean inner product (A.1) and induced norm, unless specified otherwise.
The term operator norm is motivated by the fact that, to any matrix A ∈ Cm×n , there naturally
corresponds the linear operator from Cn to Cm with action x 7→ Ax. Equation (A.6) comes from
the general definition of the norm of a bounded linear operator between normed spaces. Matrix
norms of the type (A.6) are also called subordinate matrix norms. An immediate corollary of
the definition (A.6) is that, for all x ∈ Cn ,
x
xkα kxkβ ≤ sup kAykα : kykβ ≤ 1 kxkβ = kAkα,β kxkβ , (A.7)
kAxkα = kAb x
b= .
kxkβ
Proof. We need to verify that (A.6) satisfies the properties of positivity and homogeneity, to-
gether with the triangle inequality.
• Homogeneity follows trivially from the definition (A.6) and the homogeneity of k•kα .
• Triangle inequality. Let A and B be two elements of Cm×n . Employing the triangle
inequality for the norm k•kα , we have
The matrix p-norm is defined as the operator norm (A.6) in the particular case where k•kα
and k•kβ are both Hölder norms with the same value of p.
141
Appendix A. Linear algebra
Definition A.4. Given p ∈ [1, ∞], the p-norm of a matrix A ∈ Cm×n is given by
Not all matrix norms are induced by vector norms. For example, the Frobenius norm, which
is widely used in applications, is not induced by a vector norm. It is, however, induced by an
inner product on Cm×n .
(A.9)
X
2
kAkF = |aij | .
i=1 j=1
A matrix norm k•k is said to be submultiplicative if, for any two matrices A ∈ Cm×n and
B ∈ Cn×` , it holds that
kABk ≤ kAkkBk.
All subordinate matrix norms, for example the p-norms, are submultiplicative, and so is the
Frobenius norm.
� Exercise A.2. Write down the inner product on Cm×n corresponding to (A.9).
A.4 Diagonalization
AP = PD. (A.10)
In this case, the diagonal elements of D are called the eigenvalues of A, and the columns of P
are called the eigenvectors of A.
Denoting by ei the i-th column of P and by λi the i-th diagonal element of D, we have by (A.10)
that Aei = λi ei or, equivalently, (A − λi In )ei = 0. Here In is the Cn×n identity matrix.
Therefore, λ is an eigenvalue of A if and only if det(A−λIn ) = 0. In other words, the eigenvalues
of A are the roots of det(A − λIn ), which is called the characteristic polynomial.
142
Appendix A. Linear algebra
is called Hermitian. Hermitian matrices, of which real symmetric matrices are a subset, enjoy
many nice properties, the main one being that they are diagonalizable with a matrix P that is
unitary, i.e. such that P−1 = P∗ . This is the content of the spectral theorem, a pillar of linear
algebra with important generalizations to infinite-dimensional operators.
Theorem A.4 (Spectral theorem for Hermitian matrices). If A ∈ Cn×n is Hermitian, then
there exists an unitary matrix Q ∈ Cn×n and a diagonal matrix D ∈ Rn×n such that
AQ = QD.
Sketch of the proof. The result is trivial for n = 1. Reasoning by induction, we assume that the
result is true for Hermitian matrices in Cn−1×n−1 and prove that it then also holds for A ∈ Cn×n .
Step 2. Using the induction hypothesis. Next, take an orthonormal basis (e2 , . . . , en )
of the orthogonal complement span{q 1 }⊥ and construct the unitary matrix
V = q 1 e2 . . . en ,
hq 1 , Aq 1 i hq 1 , Ae2 i . . . hq 1 , Aen i
λ1 0 ... 0
he2 , Aq 1 i he2 , Ae2 i . . . he2 , Aen i 0 he2 , Ae2 i . . . he2 , Aen i
V AV =
∗
.. .. .. .. = .
. .. .. .. .
. . . . . . . .
hen , Aq 1 i hen , Ae2 i . . . hen , Aen i 0 hen , Ae2 i . . . hen , Aen i
Let us denote the n − 1 × n − 1 lower right block of this matrix by Vn−1 . This is a Hermitian
matrix of size n − 1 so, using the induction hypothesis, we deduce that Vn−1 = Qn−1 Dn−1 Q∗n−1
for appropriate matrices Qn−1 ∈ Cn−1×n−1 and Dn−1 ∈ Rn−1×n−1 which are unitary and
diagonal, respectively.
143
Appendix A. Linear algebra
The largest eigenvalue of a matrix, in modulus, is called the spectral radius and denoted by ρ.
The following result relates the 2-norm of a matrix to the spectral radius of AA∗ .
Proof. Since A∗ A is Hermitian, it holds by the spectral theorem that A∗ A = QDQ∗ for some
unitary matrix Q and real diagonal matrix D. Therefore, denoting by (µi )1≤i≤n the (positive)
diagonal elements of D and introducing y := Q∗ x, we have
v v
u n u n
√
x∗ A∗ Ax = x∗ QDQ∗ x = t µi yi2 ≤ ρ(A∗ A)t yi2 = ρ(A∗ A)kxk, (A.11)
p uX p uX p
kAxk =
i=1 i=1
where we used in the last equality the fact that y has the same norm as x, because Q is
unitary. It follows from (A.11) that kAk ≤ ρ(A∗ A), and the converse inequality also holds
p
x∗ Ax
λk = max min .
S,dim(S)=k x∈S\{0} x∗ x
144
Appendix A. Linear algebra
x∗ Ax
Pk 2
i=1 λi |αi |
∀x ∈ Sk , ∗
= P k
≥ λk .
x x i=1 |αi | 2
which proves the ≥ direction. For the ≤ direction, let Uk = span{v k , . . . v n }. Using a well-known
result from linear algebra, we calculate that, for any subspace S ⊂ Cn of dimension k,
Therefore, any S ⊂ Cn of dimension k has a nonzero intersection with Uk . But since any vector
in Uk can be expanded as β1 v k + · · · + βn v n , we have
x∗ Ax
Pn 2
i=k λi |αi |
∀x ∈ Uk , = n ≤ λk .
x∗ x 2
P
i=k |αi |
� Exercise A.4. Prove that if A ∈ Rn×n is diagonalizable as in (A.10), then An = PDn P−1 .
Definition A.7 (Jordan block). A Jordan block with dimension n is a matrix of the form
λ 1
λ 1
.. ..
Jn (λ) = . .
λ 1
λ
145
Appendix A. Linear algebra
Proof. The explicit expression of the Jordan block can be obtained by decomposing the block
as Jn (λ) = λI + N and using the binomial formula:
k
k
(λI + N) = (λI)k−i Ni .
X
k
i
i=1
To conclude the proof, we use the fact that Ni is a matrix with zeros everywhere except for i-th
super-diagonal, which contains only ones. Moreover Ni = 0n×n if i ≥ n.
Proposition A.8 (Jordan normal form). Any matrix A ∈ Cn×n is similar to a matrix in
Jordan normal form. In other words, there exists an invertible matrix P ∈ Cn×n and a matrix
in normal Jordan form J ∈ Cn×n such that
A = PJP−1
146
Appendix A. Linear algebra
Proposition A.9 (Oldenburger). Let ρ(A) denote the spectral radius of A ∈ Cn×n and k•k
be a matrix norm. Then kAk k → 0 in the limit as k → ∞ if and only if ρ(A) < 1. In addition,
kAk k → ∞ in the limit as k → ∞ if and only if ρ(A) > 1.
Proof. Since all matrix norms are equivalent, we can assume without loss of generality that k•k
is the 2-norm. We prove only the equivalence kAk k → 0 ⇔ ρ(A) < 1. The other statement can
be proved similarly.
If ρ(A) ≥ 1, then denoting by v the eigenvector of A corresponding to the eigenvalue with
modulus ρ(A), we have kAk vk = ρ(A)k kvk ≥ kvk. Therefore, the sequence (Ak )k∈N does not
converge to the zero matrix in the 2-norm. This shows the implication kAk k → 0 ⇒ ρ(A) < 1.
To show the converse implication, we employ Proposition A.8, which states that there exists
a nonsingular matrix P such that A = PJP−1 , for a matrix J ∈ Cn×n which is in normal Jordan
form. It holds that Ak = PJk P−1 , and so it is sufficient to show that kJk k → 0. The latter
convergence follows from the expression of the power of a Jordan block given in Lemma A.7.
With this result, we can prove Gelfand’s formula, which relates the spectral radius to the
asymptotic growth of kAk k, and is used in Chapter 2.
Proposition A.10 (Gelfand’s formula). Let A ∈ Cn×n . It holds for any norm that
k→∞
A A
Proof. Let 0 < ε < ρ(A) and define A+ = ρ(A)+ε and A− = ρ(A)−ε . It holds by construction
that ρ(A+ ) < 1 and ρ(A− ) > 1. Using Proposition A.9, we deduce that
kAk k kAk k
∀k ≥ K(ε), k(A+ )k k = ≤ 1, k(A− )k k = ≥ 1.
(ρ(A) + ε)k (ρ(A) − ε)k
1
∀k ≥ K(ε), ρ(A) − ε ≤ kAk k k ≤ ρ(A) + ε.
147
Appendix B
In this chapter, we very briefly present some of the basic features and functions of Julia. Most
of the information contained in this chapter can be found in the online manual, to which we
provide pointers in each section.
Installing Julia
The suggested programming environment for this course is the open-source text editor Visual
Studio Code. You may also use Vim or Emacs, if you are familiar with any of these.
� Task 1. Install Visual Studio Code. Install also the Julia and Jupyter Notebook extensions.
Obtaining documentation
To find documentation on a function from the Julia console, type “?” to access “help mode”,
and then the name of the function. Tab completion is helpful for listing available function
names.
� Task 2. Read the help pages for if, while and for. More information on these keywords is
available in the online documentation.
condition = true
148
Appendix B. Brief introduction to Julia
To install a package from the Julia REPL (Read Evaluate Print Loop, also more simply called
the Julia console), first type “]” to enter the package REPL, and then type add followed the
name of the package to install. After it has been added, a package can be used with the import
keyword. A function fun defined in a package pack can be accessed as pack.fun. For example,
to plot the cosine function from the Julia console or in a script, write
import Plots
Plots.plot(cos)
Alternatively, a package may be imported with the using keyword, and then functions can
be accessed without specifying the package name. While convenient, this approach is less
descriptive; it does not explicitly show what package a function comes from. For this reason, it
is often recommended to use import, especially in a large codebase.
� Task 3. Install the Plots package, read the documentation of the Plots.plot function, and
plot the function f (x) = exp(x). The tutorial on plotting available at this link may be useful for
this exercise.
Remark B.2. We have seen that ? and ] enable to access “help mode” and “package mode”,
respectively. Another mode which is occasionally useful is “shell mode”, which is accessed
with the character ; and allows to type bash commands, such as cd to change directory. See
this part of the manual for additional documentation on Julia modes.
Printing output
The functions println and print enable to display output. The former adds a new line at
the end and the latter does not. The symbol $, followed by a variable name or an expression
within brackets, can be employed to perform string interpolation. For instance, the following
code prints a = 2, a^2 = 4.
a = 2
println("a = $a, a^2 = $(a*a)")
To print a matrix in an easily readable format, the display function is very useful.
Functions can be defined using a function block. For example, the following code block defines
a function that prints “Hello, NAME!”, where NAME is the string passed as argument.
function hello(name)
# Here * is the string concatenation operator
149
Appendix B. Brief introduction to Julia
If the function definition is short, it is convenient to use the following more compact syntax:
The return keyword can be used for returning a value to the function caller. Several values,
separated by commas, can be returned at once. For instance, the following function takes a
number x and returns a tuple (x, x2 , x3 ).
function powers(x)
return x, x^2, x^3
end
# This assigns a = 2, b = 4, c = 8
a, b, c = powers(2)
Like many other languages, including Python and Scheme, Julia follows a convention for
argument-passing called “pass-by-sharing”: values passed as arguments to a function are not
copied, and the arguments act as new bindings within the function body. It is possible, therefore,
to modify a value passed as argument, provided this value is of mutable type. Functions
that modify some of their arguments usually end with an exclamation mark !. For example,
the following code prints first [4, 3, 2, 1], because the function sort does not modify its
argument, and then it prints [1, 2, 3, 4], because the function sort! does.
x = [4, 3, 2, 1]
y = sort(x) # y is sorted
println(x); sort!(x); println(x)
Similarly, when displaying several curves in a figure, we first start with the function plot, and
then we use plot! to modify the existing figure.
150
Appendix B. Brief introduction to Julia
import Plots
Plots.plot(cos)
Plots.plot!(sin)
As a final example to illustrate argument-passing, consider the following code. Here two
arguments are passed to the function test: an array, which is a mutable value, and an integer,
which is immutable. The instruction arg1[1] = 0 modifies the array to which both a and arg1
are bindings. The instruction arg2 = 2, on the other hand, just causes the variable arg2 to
point to a new immutable value (3), but it does not change the destination of the binding b,
which remains the immutable value 2. Therefore, the code prints [0, 2, 3] and 3.
� Task 4 (Euler–Mascheroni constant for the harmonic series). Euler showed that
N
!
1
lim − ln(N ) +
X
= γ := 0.577...
N →∞ n
n=1
� Task 5 (Tower of Hanoi). We consider a variation on the classic Tower of Hanoi problem, in
which the number r of pegs is allowed to be larger than 3. We denote the pegs by p1 , . . . , pr , and
assume that the problem includes n disks with radii 1 to n. The tower is initially constructed
in p1 , with the disks arranged in order of decreasing radius, the largest at the bottom. The goal
of the problem is to reconstruct the tower at pr by moving the disks one at the time, with the
constraint that a disk may be placed on top of another only if its radius is smaller.
It has been conjectured that the optimal solution, which requires the minimum number of
moves, can always be decomposed into the following three steps, for some k ∈ {1, n − 1}:
• First move the top k disks of the tower to peg p2 ;
151
Appendix B. Brief introduction to Julia
Some constructs in Julia introduce scope blocks, notably for and while loops, as well as function
blocks. The variables defined within these structures are not available outside them. For
example
if true
a = 1
end
println(a)
for i in [1, 2, 3]
a = 1
end
println(a)
produces ERROR: LoadError: UndefVarError: a not defined. The variable a defined within
the for loop is said to be in the local scope of the loop, whereas a variable defined outside of it is
in the global scope. In order to modify a global variable from a local scope, the global keyword
must be used. For instance, the following code
a = 1
for i in [1, 2, 3]
global a += 1
end
println(a)
A working knowledge of multi-dimensional arrays is important for this course, because vectors
and matrices are ubiquitous in numerical algorithms. In Julia, a two-dimensional array can be
created by writing its lines one by one, separating them with a semicolon ;. Within a line,
elements are separated by a space. For example, the instruction
M = [1 2 3; 4 5 6]
152
Appendix B. Brief introduction to Julia
The expression M[r, c] gives the (r, c) matrix element of M . The special entry end can be used
to access the last row or column. For instance, M[end-1, end] gives the matrix entry in the
second to last row and the last column. From the matrix M above, the submatrix [2 3; 5 6]
can be obtained with M[:, 2:3]. Here the row index : means “select all lines” and the column
index 2:3 means “select columns 2 to 3”. Likewise, the submatrix [1 3; 4 6] may be extracted
with M[:, [1; 3]].
Remark B.3 (One-dimensional arrays). The comma , can also be employed for creating one-
dimensional arrays, but its behavior differs slightly from that of the vertical concatenation
operator ;. For example, x = [1, [2; 3]] creates a Vector object with two elements, the
first one being 1 and the second one being [1; 3], which is itself a Vector. In contrast, the
instruction x = [1; [1; 2]] creates the same Vector as [1; 2; 3] would.
We also mention that the expression x = [1 2 3] produces not a one-dimensional Vector
but a two-dimensional Matrix, with one row and three columns. This can be checked using
the size function, which for x = [1 2 3] returns the tuple (1, 3).
There are many built-in functions for quickly creating commonly used arrays. For example,
• transpose(M) gives the transpose of M , and adjoint(M) or M' gives the transpose con-
jugate. For a matrix with real-valued entries, both functions deliver the same result.
• range(0, 1, length=101) creates an array of size 101 with elements evenly spaced be-
tween 0 and 1 included. More precisely, range returns an array-like object, which can be
converted to a vector using the collect function.
3 6 9
Let us also mention the following shorthand notation, called array comprehension, for creating
vectors and matrices:
153
Appendix B. Brief introduction to Julia
• [i for i in 1:10 if ispow2(i)] creates the vector [1, 2, 4, 8]. The same result
can be achieved with the filter function: filter(ispow2, 1:10).
In contrast with Matlab, array assignment in Julia does not perform a copy. For example
the following code prints [1, 2, 3, 4], because the instruction b = a defines a new binding
to the array a.
a = [2; 2; 3]
b = a
b[1] = 1
append!(b, 4)
println(a)
A similar behavior applies when passing an array as argument to a function. The copy function
can be used to perform a copy.
� Task 6. Create a 10 by 10 diagonal matrix with the i-th entry on the diagonal equal to i.
Broadcasting
To conclude this chapter, we briefly discuss broadcasting, which enables to apply functions to
array elements and to perform operations on arrays of different sizes. Julia really shines in this
area, with syntax that is both explicit and concise. Rather than providing a detailed definition
of broadcasting, which is available in this part of the official documentation, we illustrate the
concept using examples. Consider first the following code block:
function welcome(name)
return "Hello, " * name * "!"
end
result = broadcast(welcome, ["Alice", "Bob"])
Here broadcast returns an array with elements "Hello, Alice!" and "Hello, Bob!", as
would the map function. Broadcasting, however, is much more flexible because it can handle
arrays with different sizes. For instance, broadcast(gcd, 24, [10, 20, 30]) returns an array
of size 3 containing the greatest common divisors of the pairs (24, 10), (24, 20) and (24, 30).
Similarly, the instruction broadcast(+, 1, [1, 2, 3]) returns [2, 3, 4]. To understand
the latter example, note that + (as well as *, - and /) can be called like any other Julia
functions; the notation a + b is just syntactic sugar for +(a, b).
Since broadcasting is so often useful in numerical mathematics, Julia provides a shorthand
notation for it: the instruction broadcast(welcome, ["Alice", "Bob"]) can be written com-
pactly as welcome.(["Alice", "Bob"]). Likewise, the line broadcast(+, 1, [1, 2, 3]) can
be shortened to (+).(1, [1, 2, 3]), or to the more readable expression 1 .+ [1, 2, 3].
reshape(1:9, 3, 3) .* [1 2 3]
reshape(1:9, 3, 3) .* [1; 2; 3]
reshape(1:9, 3, 3) * [1; 2; 3]
154
Bibliography
[1] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings
of the 1969 24th national conference, pages 157–172, 1969.
[2] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
computing surveys (CSUR), 23(1):5–48, 1991.
[3] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985:1–20, 1985.
doi: 10.1109/IEEESTD.1985.82928.
[4] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008:1–70, 2008.
doi: 10.1109/IEEESTD.2008.4610935.
[5] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019:1–84, 2019.
doi: 10.1109/IEEESTD.2019.8766229.
[6] V. Legat. Mathématiques et méthodes numériques. Lecture notes for the course EPL1104 at École
polytechnique de Louvain, 2009.
url: https://perso.uclouvain.be/vincent.legat/documents/epl1104/epl1104-notes-v8-2.pdf.
[7] A. Magnus. Analyse numérique: approximation, interpolation, intégration. Lecture notes for the
course INMA2171 at École polytechnique de Louvain, 2010.
url: https://perso.uclouvain.be/alphonse.magnus/num1a/dir1a.htm.
[10] Y. Saad. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathe-
matics, Philadelphia, PA, second edition, 2003,
doi: 10.1137/1.9780898718003.
url: https://doi-org.extranet.enpc.fr/10.1137/1.9780898718003.
[11] Y. Saad. Numerical methods for large eigenvalue problems, volume 66 of Classics in Applied
Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2011,
doi: 10.1137/1.9781611970739.ch1.
url: https://doi-org.extranet.enpc.fr/10.1137/1.9781611970739.ch1.
[12] J. R. Shewchuk et al. An introduction to the conjugate gradient method without the agonizing
pain, 1994.
url: https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.
155
Bibliography
[13] P. Van Dooren. Analyse numérique. Lecture notes for the course INMA1170 at École polytech-
nique de Louvain, 2012.
[14] M. Vianello and R. Zanovello. On the superlinear convergence of the secant method. Amer.
Math. Monthly, 99(8):758–761, 1992.
doi: 10.2307/2324244.
url: https://doi-org.extranet.enpc.fr/10.2307/2324244.
[15] C. Vuik and D. J. P. Lahaye. Scientific Computing. Lecture notes for the course wi4201 at Delft
University of Technology, 2019.
url: http://ta.twi.tudelft.nl/users/vuik/wi4201/wi4201_notes.pdf.
156