Numerical
Numerical
Introduction to Numerical
Mathematics
Evolving notes
Simon Foucart
Foreword
1
Contents
I Preliminaries 9
1 Introductory remarks 10
1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Programming suggestions 14
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Non-decimal bases 19
3.2 Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Conversion 2 ↔ 8 ↔ 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Computer arithmetic 23
2
4.2 Computer errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Gaussian elimination 29
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.1 LU -factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8 Iterative methods 44
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3
9 Steepest descent and conjugate gradient methods 50
9.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10 The QR factorization 55
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12 Bisection method 63
12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
13 Newton’s method 66
4
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14 Secant method 72
14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
IV Approximation of functions 75
15 Polynomial interpolation 76
15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
16 Divided differences 80
16.1 A definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
17 Orthogonal polynomials 85
5
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
18.1 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
18.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
18.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6
22.1 Description of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7
27 Multistep methods 132
8
Part I
Preliminaries
9
Chapter 1
Introductory remarks
where the error term , or remainder, En+1 [which also depends on x and x0 ] can be
represented by
Z x (n+1)
f (t)
(x − t)n dt,
n!
En+1 = x 0
f (n+1) (c) n+1
(x − x0 ) , for some c between x and x0 .
(n + 1)!
Note that taking n = 0 together with the second form of the remainder yields the mean
value theorem, itself implying Rolle’s theorem. Let us write x = x0 + h. Suppose for
f (n+1) (c)
example that can be bounded by some constant K, then one has |En+1 | ≤ K|h|n+1 ,
(n + 1)!
which is abbreviated by the ‘big-O’ notation
En+1 = O(hn+1 ).
It means that, as h converges to zero, En+1 converges to zero with (at least) essentially
the same rapidity as |h|n+1 . In such circumstances, we have at hand a way to compute
10
(approximately) the value of f (x): take the Taylor polynomial of degree n for f centered
at x0 and evaluate it at x. But beware that, when n grows, a Taylor series converges
rapidly near the point of expansion but slowly (or not at all) at more remote points. For
example, the Taylor series for ln(1 + x) truncated to eight terms gives
1 1 1 1 1 1 1
ln 2 ≈ 1 − + − + − + − = 0.63452...,
2 3 4 5 6 7 8
1+x
a rather poor approximation to ln 2 = 0.693147..., while the Taylor series for ln
1−x
truncated to four terms gives
1 1 1 1
ln 2 = 2 + + + = 0.69313...
3 33 · 3 35 · 5 37 · 7
As another example, consider the Taylor series of the exponential function centered at 0,
that is
∞
x
X xk
e = .
k!
k=0
Even though the remainder, for a given x, is of the type O(1/n!), the determination of ex via
the Taylor series truncated to seven terms is quite bad for x = 8. However, when x= 1/2
n
1
this produces very good results, as compared e.g. to the approximation e1/2 ≈ 1 + ,
2n
at least when speed of convergence is concerned. But is this the only criterion that
matters? It seems that less operations are required to compute the latter...
Given coefficients a0 , . . . , an , we wonder how one should evaluate the polynomial p(x) :=
an + an−1 x + · · · + a0 xn efficiently at a given point x. The first instinct is to calculate the
terms an−i ×x| ×x× {z· · · × x} for each i ∈ J0, nK and to sum them all. This requires n additions
i times
n(n + 1)
and 1 + 2 + · · · + n = multiplications. Obviously, we have done too much work here
2
since we have not used the result of the computaion of xi to evaluate xi+1 . If we do, the
calculation of the powers of the input number x only requires n − 1 multiplications. Hence
the whole process now requires n additions and 2n − 1 multiplications. This is much better,
but can still be improved via Horner’s method, also called nested multiplication. It is
11
based on a simple observation, namely
p(x) = an + an−1 x + · · · + a0 xn
= an + x(an−1 + an−2 x + · · · + a0 xn−1 )
= · · · = an + x(an−1 + x(an−2 + · · · + x(a1 + xa0 ) · · · )).
Here we notice that one addition and one multiplication are required at each step, hence
the cost of the whole algorithm is n additions and n multiplications. This implies that the
execution time is roughly divided by a factor 2.
1.3 Exercises
1. What is the Taylor series for ln(1 + x)? for ln((1 + x)/(1 − x))?
x x2 (−1)k−1 x k
ln(x + e) = 1 + − 2 + ··· + + ···
e 2e k e
• p(x) = x32
• p(x) = 3(x − 1)5 + 7(x − 1)9
• p(x) = 6(x + 2)3 + 9(x + 2)7 + 3(x + 2)15 − (x + 2)31
• p(x) = x127 − x37 + 10x17 − 3x7
12
5. Using a computer algebra system (Maple, Mathematica,...), print 200 decimal digits of
√
10.
6. Express in mathematical notation without parentheses the final value of z in the follow-
ing pseudocode:
input n, (bi ), z
z ← bn + 1
for k = 1 to n − 2
do z ← z ∗ bn−k + 1 end do
• xn = 5n2 + 9n3 + 1, αn = n2
√
• xn = n + 3, αn = 1/n
Optional problems
13
Chapter 2
Programming suggestions
You may use any software you are comfortable with, however I will use pseudocode as the
‘programming language’ throughout the notes. This is only loosely defined, nevertheless,
as a bridge between mathematics and computer programming, it serves our purpose well.
The course does not focus on programming, but on understanding and testing the various
numerical methods underlying scientific computing. Still, some words of advice can do no
harm. The following suggestions are not intended to be complete and should be considered
in context. I just want to highlight some consideration of efficiency, economy, readability,
and roundoff errors.
Use pseudocode before beginning the coding. It should contain sufficient detail so
that the implementation is straightforward. It should also be easily read and understood
by a person unfamiliar with the code.
Check and double check. It is common to write programs that may work on simple
test but not on more complicated ones. Spend time checking the code before running it to
avoid executing the program, showing the output, discovering an error, and repeating the
process again and again.
Use test cases. Check and trace through the pseudocode using pencil-and-paper calcula-
tions. These sample tests can then be used as test cases on the computer.
14
Modularize code. Build a program in steps by writing and testing a series of self-
contained subtasks (subprograms, procedures, or functions) as separate routines. It makes
reading and debugging easier.
Generalize slightly. It is worth the extra effort to write the code to handle a slightly more
general situation. For example, only a few additional statements are required to write a
program with an arbitrary step size compared with a program in which the step size is
fixed numerically. However, be careful not to introduce too much generality because it can
make a simple task overly complicated.
Show intermediate results. Print out or display intermediate results and diagnostic
messages to assist in debugging and understanding the program’s operation. Unless im-
practical to do so, echo-print the input data.
Include warning messages. A robust program always warns the user of a situation that
it is not designed to handle.
Use meaningful variable names. This is often helpful because they have greater mnemonic
value than single-letter variables.
Declare all variables. All variables should be listed in type declarations in each program
or program segment.
Include comments. Comments within a routine are helpful for recalling at some later
time what the program does. It is recommended to include a preface to each program or
program segment explaining the purpose, the input and output variable, etc. Provide a few
comments between the major segments of the code; indent each block of code a consistent
number of spaces; insert blank comment lines and blank spaces – it greatly improves the
readability.
Use clean loops. Do not put unnecessary statements within loops: move expressions and
variables outside a loop from inside a loop if they do not depend on it. Indenting loops,
particularly for nested loops, can add to the readability of the code.
Use built-in functions. In scientific programming languages, functions such as sin, ln,
exp, arcsin are available. Numeric functions such as integer, real, complex are also usually
available for type conversion. One should use these as much as possible. Some of these
intrinsic functions accept arguments of more that one type and return an output of the
corresponding type. They are called generic functions, for they represent an entire family
15
of related functions.
Use program libraries. In preference to one you might write yourself, a routine from a
program library should be used: such routines are usually state-of-the-art software, well
tested and completely debugged.
Do not overoptimize. The primary concern is to write readable code that correctly com-
putes the desired results. Tricks of the trade to make your code run faster and more
efficiently are to be saved for later use in your programming career.
Computing sums. When a long list of floating-point numbers is added, there will gener-
ally be less roundoff error if the numbers are added in order of increasing magnitude.
Mathematical constants. In many programming languages, the computer does not auto-
matically know the values of common mathematical constants such as π or e. As it is easy
to mistype a long sequence of digits, it is better to use simple calculations involving math-
ematical functions, e.g π ← 4.0 arctan(1.0). Problems will also occur if one uses a short
approximation such as π ← 3.14159 on a computer with limited precision approximation
and than later moves the code to another computer.
Avoid mixed mode. Mixed expressions are formulas in which variables and constants of
different types appear together. Use the intrinsic type conversion functions. For example,
1/m should be coded as 1.0/real(m).
Precision. In the usual mode, i.e. single precision, one word of storage is used for each
number. With double precision (also called extended precision), each number is allotted
two or more words of storage. This is more time consuming, but is to be used when more
accuracy is needed. In particular, on computers with limited precision, roundoff errors can
quickly accumulate in long computations and reduce the accuracy to only three or four
decimal places.
Memory fetches. When using loops, write the code so that fetches are made from adjacent
words in memory. Suppose you want to store values in a two-dimensional array (ai,j ) in
16
which the elements of each column are stored in consecutive memory locations, you would
then use i and j loops with the i loop as the innermost one to process elements down the
columns.
Limit iterations. To avoid endless cycling, limit the number of permissible steps by the
use of a loop with a control variable. Hence, instead of
input x, d
d ← f (x)/f 0 (x)
while |d| > ε
do x ← x − d, output x, d ← f (x)/f 0 (x) end do
it is better to write
for n = 1 to nmax
do d ← f (x)/f 0 (x), x ← x − d, output n, x,
if |d| > ε then exit loop end if
end do
Floating-point equality. The sequence of steps in a routine should not depend on whether
two floating-point numbers are equal. Instead, reasonable tolerance should be permitted.
Thus write
if |x − y| < ε then ...
or even better
if |x − y| < ε max(|x|, |y|) then ...
17
each with a possible round of error. In the second, this is avoided at an added cost of n
multiplications.
2.3 Exercises
Z 1
1. The numbers pn = xn ex dx satisfy the inequalities p1 > p2 > p3 > · · · > 0. Establish
0
this fact. Using integration by parts, show that pn+1 = e − (n + 1)pn and that p1 = 1.
Use the recurrence relation to generate the first twenty values of pn on a computer and
explain why the inequalities above are violated.
2. The following pieces of code are supposedly equivalent, with α ↔ ε. Do they both pro-
duce endless cycles? Implement them on a computer [be sure to know how to abort a
computation before starting].
input α
α←2
while α > 1 do α ← (α + 1)/2 end do
input ε
ε←1
while ε > 0 do ε ← ε/2 end do
18
Chapter 3
Non-decimal bases
Computers usually use base-2, base-8 or base-16 arithmetic, instead of the familiar base-10
arithmetic – which is not more natural than any other system, by the way, we are just
more used to it. Our purpose here is to discuss the relations between the representations
in different bases. First of all, observe for example that 3781.725 represents, in base 10, the
number
where the ai ’s and bi ’s are to be chosen among the digits 0, 1, . . . , β − 1. The first sum
in this expansion is called the integer part and the second sum is called the fractional
part. The separator is called the radix point – decimal point being reserved for base-10
numbers. The systems using base 2, base 8, and base 16 are called binary, octal, and
hexadecimal, respectively. In the latter, the digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f .
19
3.2 Conversions
N = 1 × 28 + 0 × 27 + 1 × 26 + 1 × 25 + 0 × 24 + 1 × 23 + 0 × 22 + 0 × 2 + 1.
N = 256 + 64 + 32 + 8 + 1 = 361.
To convert (N )α to base β, the same procedure [expand in powers of α and carry out the
arithmetic in base-β] can be used for any α and β, except that base-β arithmetic is tedious
for β 6= 10. In fact, when α > β, another technique is preferred.
Suppose for example that we want to find the binary representation of a number x, e.g.
x = (3781.725)10 . Let us write
x = (an an−1 · · · a1 a0 . b1 b2 b3 · · · )2 .
The digit a0 is easy to find: it is the remainder in the division by 2 of the integer part of x.
Now, what about the digits a1 and b1 ? Well, we notice that
x/2 = an an−1 · · · a1 . a0 b1 b2 b3 · · ·
2x = an an−1 · · · a1 a0 b1 . b2 b3 · · · ,
so that a1 and b1 are obtained from the previous observation applied to x/2 and 2x. The
process can be continued. It yields an algorithm for the conversion of the integer part an
another one for the conversion of the fractional part.
Integer part: Divide successively by 2 and store the remainder of the division as a digit
until hitting a quotient equal to zero.
3781 → 1890; 1 → 945; 01 → 472; 101 → 236; 0101 → 118; 00101 → 59; 000101 → 29; 1000101
→ 14; 11000101 → 7; 011000101 → 3; 1011000101 → 1; 11011000101 → 0; 111011000101.
Fractional part: Multiply by 2, store the integer part as a digit and start again with the
new fractional part. Remark that the process might not stop.
20
0.725 → 1; 0.45 → 10; 0.9 → 101; 0.8
→ 1011; 0.6 → 10111; 0.2 → 101110; 0.4 → 1011100; 0.8
→ repeat the previous line.
3.3 Conversion 2 ↔ 8 ↔ 16
The octal system is useful as an intermediate step when converting between decimal and
binary systems by hand. The conversion 10 ↔ 8 proceeds according to the previous prin-
ciples, and the conversion 8 ↔ 2 handles groups of three binary digits according to the
table
Octal 0 1 2 3 4 5 6 7
Binary 000 001 010 011 100 101 110 111
Most computers use the binary system for their internal representation. On computers
whose word lengths are multiples of four, the hexadecimal system is used. Conversion
between these two systems is also very simple, in view of the table
1 2 3 4 5 6 7 8 9 a b c d e f
0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
3.4 Exercises
2. Convert the following numbers and then convert the results back to the original bases:
21
(a) (47)10 = ( )8 = ( )2
(b) (0.782)10 = ( )8 = ( )2
(c) (110011.1110101101101)2 = ( )8 = ( )10
(d) (361.4)8 = ( )2 = ( )10
3. Prove that a number has a finite binary representation if it has the form ±m/2n , where
m and n are non-negative integers. Prove that any number having a finite binary rep-
resentation also has a finite decimal representation.
4. Find the first twelve digits in the octal representation of eπ/4. Use built-in mathemati-
cal functions to obtain the values of e and π.
√
5. With the help of a computer, establish that the number eπ 163 is very close to an 18-
digits decimal integer and convert this integer to base 16. How many times is this
integer dividable by 2?
Optional problems
2. Write and test a routine for converting base-3 integers to binary integers.
22
Chapter 4
Computer arithmetic
Consider a computer that stores numbers in 64-bit words. Then a single-precision num-
ber is represented by three different parts: first, 1 bit is allotted to the sign, then 11 bits
are allotted to an exponent c, called the characteristic, and finally 52 bits are allotted to
a fractional part f , called the mantissa. More precisely, these numbers have the form
(−1)s × 2c−1023 × (1 + f ).
Since c is represented by 11 binary digits, it belongs to the integer interval J0, 211 − 1K =
J0, 2047K, and the range for the exponent c − 1023 is J−1023, 1024K. As for f , it lies in the
interval [0, 1 − (1/2)52 ]. Therefore, the smallest and largest positive machine numbers
are 2−1022 ×(1+0) and 21023 ×(2−(1/2)52 ). Outside this range, we say that an underflow or
an overflow has occured. The latter causes the computation to stop, the former produces
zero [which is represented by c = 0, f = 0 and s = 0 or s = 1].
If a number has the form x = (−1)s ×2c−1023 ×1.d1 d2 · · · d52 d53 · · · , with c ∈ J−1023, 1024K, the
process of replacing it by the number xc := (−1)s × 2c−1023 × 1.d1 · · · d52 is called chopping,
while the process of replacing it by the nearest machine number xr is called rounding.
The error involved is called the roundoff error. For the chopping process, the relative
23
error – not to be confused with the absolute error |x − xc | – satisfies
The number ε := 2−52 is called the unit roundoff error. For the rounding process, we
would obtain the better bound on the relative error
|x − xr | ε
≤ .
|x| 2
The value of the unit roundoff error, also known as machine epsilon, varies with the
computer [it depends on the word length, the arithmetic base used, the rounding method
employed]. As the smallest positive number ε such that 1 + ε 6= 1 in the machine, it can be
computed according to the following pseudocode:
input s, t
s ← 1., t ← 1. + s
while t > 1 do s ← s/2., t ← 1 + s end do
output 2s
Even though it can be sensible to assume that the relative error in any single basic arith-
metic operation is bounded by the machine epsilon, the accumulation of these errors can
have some significant effect. Consider the sequence (pn ) defined by p1 = 1 and pn+1 = e −
pn ) be the computed sequence. The error δn := |pn − pen |
(n + 1)pn [cf exercise 2.3.1], and let (e
obeys the rule δn+1 ≈ (n + 1)δn [of course, this is not an equality], so that δn ≈ n! δ1 blows
out of proportion. When small errors made at some stage are magnified in subsequent
stages, like here, we say that a numerical process is unstable.
A large relative error can occur as a result of a single operation, one of the principal causes
being the subtraction of nearly equal numbers. As a straightforward example, suppose
we subtract, on a hypothetical 5-decimal-digit computer, the numbers x = 1.234512345 and
y = 1.2344. The difference should be x − y = 0.000112345, but the computer stores the value
(x − y)s = 0.0001. Almost all significant digits have been lost, and the relative error is
large, precisely
|(x − y) − (x − y)s | 1.2345 × 10−5
= ≈ 11%.
|x − y| 1.12345 × 10−4
24
Contrary to the roundoff errors which are inevitable, these sort of errors can be kept under
control and the programmer must be alert to such situations. The simplest remedy is
to carry out part of a computation in double-precision arithmetic, but this is costly and
might not even help. A slight change in the formulas is often the answer. The assignment
√
y ← 1 + x2 − 1 involves loss of significance for small x, but the difficulty can be avoided
x2
by writing y ← √ instead [apply this to the solutions of a quadratic equation].
1 + x2 + 1
Likewise, the assignment y ← cos2 x − sin2 x involves loss of significance near π/4. The
assignment y ← cos(2x) should be used instead.
For completeness, we give a quantitative answer to the question: ‘exactly how many sig-
nificant binary bits are lost in the subtraction x − y when x is close to y?’.
Theorem 4.1 (Loss of precision). If two binary machine numbers x and y, with x > y > 0,
satisfy 2−q ≤ 1 − y/x ≤ 2−p for some positive integers p and q, then at most q and at least p
significant binary digits are lost in the subtraction x − y.
4.4 Exercises
1. There will be subtractive cancellation in computing 1 + cos x for some values of x. What
are these values and how can the loss of precision be averted?
Optional problems
√
1. The inverse hyperbolic sine is given by f (x) = ln(x + x2 + 1). Show how to avoid loss
of significance in computing f (x) when x is negative. Hint: exploit a relation between
f (−x) and f (x).
25
Chapter 5
For a linear map f : V −→ W , the kernel and the image of f are defined by
[f invertible] ⇐⇒ [Kerf = {0} and dim V = dim W ] ⇐⇒ [Imf = W and dim V = dim W ].
26
orthogonal matrix is a square matrix P satisfying P P > = I (= P > P ). In fact, a necessary
and sufficient condition for a matrix to be orthogonal is that its columns (or rows) form
an orthonormal system. An orthogonal matrix is also characterized by the fact that it
preserves the norm [i.e. kP xk = kxk for all x] – or the inner product [i.e. hP x, P yi = hx, yi
for all x and y].
5.2 Exercises
From the textbook: read sections 6.3, 6.4 and 7.2; 5 p 380, 6 p 386, 11 p 386, 1.d-e-f p 435,
13 p 436.
Optional problems
1. Prove the statement that a square matrix M has at most one left inverse. Deduce the
statement that a left inverse is automatically both-sided. [Hint: if AM = I, multiply M
on the left by A + I − M A.]
2. Establish that a symmetric matrix is positive definite iff all its eigenvalues are positive.
27
Part II
28
Chapter 6
Gaussian elimination
Many problems in economics, engineering and science involve differential equations which
cannot be solved explicitely. Their numerical solution is obtained by discretizing the differ-
ential equation – see part VI – leading to a system of linear equations. One may already
know that most of the time Gaussian elimination allows to solve this system. But the
higher the required accuracy, the finer the discretization, and consequently the larger the
system of linear equations. Gaussian elimination is not very effective in this case. One of
the goals of this part is to devise better methods for solving linear systems.
Ax = b.
29
Simple cases
It has the solution x = [b1 /a1,1 , b2 /a2,2 , . . . , bn /an,n ]> , so long as all the ai ’s are nonzero. If
ai,i = 0 for some i, then either xi can be chosen to be any number if bi = 0, or the system is
unsolvable.
• lower triangular structure: with ai,j = 0 for i < j, the system reduces to
a11 0 ··· 0 x1 b1
a2,1 a2,2 ··· 0 x2 b2
. .. .. . = . .
. .. . .
. . . . . .
an,1 an,2 · · · an,n xn bn
Assume that ai,i 6= 0 for all i. The system can be solved by forward substitutition, in
other words, get the value of x1 from the first equation a1,1 x1 = b1 , next substitute it in the
second equation a2,1 x1 + a2,2 x2 = b2 to find the value of x2 , next subsitute x1 and x2 in the
third equation to find x3 , and so on. Based on the steps
• upper triangular structure: with ai,j = 0 for i > j, the system reduces to
a1,1 a1,2 · · · a1,n x1 b1
0 a2,2 · · · a2,n x2 b2
.. = .. .
. .. ..
..
.
. . . . . .
0 0 ··· an,n xn bn
30
The system can be solved by backward substitution, i.e. get xn first, then xn−1 , and so
on until x1 . The pseudocode of the corresponding algorithm can be easily written.
We often make the sensible assumption that the execution time of an algorithm is roughly
proportional to the number of operations being performed. We also usually neglect the
number of additions/subtractions with respect to the number of multiplications/divisions,
because the latter require more execution time than the former and because these num-
bers are often comparable in practice. Of course, if our algorithm does not contain any
multiplication/division, this is no longer legitimate.
Looking back at the forward substitution algorithm, we observe that for a given i the op-
eration count is i mult./div., hence the operation count for the whole algorithm is
n
X n(n + 1) n2 n
i= = + .
2 2 2
i=1
It makes no sense to use this accurate expression, firstly because n/2 is small compared
to n2 /2 when n is large and secondly because our starting assumptions are themselves not
accurate. Instead, it is more appropriate to consider only n2 /2 and in fact only n2 since
the execution time is anyway obtained via a proportionality factor. We then use the ‘big-O’
notation to state that
31
6.3 Gaussian elimination
Let us refresh our memory of this familiar concept with a particular case. We wish to solve
the system
2x + y − z = 2
6x + 2y − 2z = 8
4x + 6y − 3z = 5.
The strategy is to combine the second equation and the first one to form a new second
equation where the unknown x does not appear, and likewise to form a new third equation
where the unknown x does not appear either. Then we combine the two new equations
into a new third one, getting rid of the unknown y as well. We are then able to solve by
backward substitution, first determining z, then y and finally x. Here it how it unfolds, in
so-called tableau form,
2 1 −1 | 2 2 1 −1 | 2 2 1 −1 | 2
R2 ←R2 −3R1 R3 ←R3 +4R2
6 2 −2 | 8 −→ 0 −1 1 | 2 −→ 0 −1 1 | 2 .
R3 ←R3 −2R1
4 6 −3 | 5 0 4 −1 | 1 0 0 3 | 9
Note that, when performing Gaussian elimination, we merely use some elementary row
operations which leave the solutions of the system unchanged. Namely, these are
32
6.3.2 The algorithm
Suppose we have already performed j − 1 steps of the algorithm. The current situation is
described by the tableau form
(j) (j) (j) (j) (j) (j)
a a1,2 · · · a1,j a1,j+1 ··· a1,n | b1
1,1 (j) (j) (j) (j) (j)
2,2 · · · ··· |
0 a a2,j a2,j+1 a2,n b2
.. .. .. ..
0
0 . . . ··· .
. .
.. .. (j) (j) (j) (j)
0 aj,j aj,j+1 ··· aj,n | bj ,
..
(j) (j) (j) (j)
0 . 0 aj+1,j aj+1,j+1 ··· aj+1,n | bj+1
.. .. .. .. .. ..
. . . . ··· . | .
0
..
(j) (j) (j) (j)
0 . 0 an,j an,j+1 ··· an,n | bn
(j) (j)
aj+1,j an,j
and we perform the operations Rj+1 ← Rj+1 − (j)
Rj , . . . , Rn ← Rn − (j)
Rj to ‘eliminate’
aj,j aj,j
(j) (j) (j)
aj+1,j , . . . , an,j . Remark that the numbers aj,j , called pivots, must be nonzero otherwise
(j)
the algorithm should halt. If aj,j = 0 the remedy is to exchange the rows of the tableau to
obtain, if possible, a nonzero pivot – see Op.3. This technique is called partial pivoting.
In fact, one should make the exchange so that the element of largest magnitude in the
lower part of the j-th column ends up in the pivotal position. This also reduces the chance
of creating very large number, which might lead to ill conditioning and accumulation of
roundoff error. We do not bother about partial pivoting in the following version of the
Gaussian elimination algorithm. Here, the result of the algorithm is an upper triangular
system of equations. The associated backward substitution subprogram is omitted.
Remark. This portion of code has evolved from the primitive form
33
input n, (ai,j ), (bi )
for j = 1 to n − 1
do eliminate column j
output (ai,j )
It is often a good idea to break up a code into several smaller parts – it facilitates its reading
and the possible mistakes will be easier to identify.
Examining the previous algorithm, we count n − j + 2 mult./div. for fixed j and i. Then, for
a fixed j, the number of mult./div. is (n − j)(n − j + 2) = (n − j + 1)2 − 1. Finally the total
number of mult./div. is
n−1 n
X X n(n + 1)(2n + 1) n3
(n − j + 1)2 − 1 = `2 − (n − 1) = − 1 − (n − 1) = + ··· .
6 3
j=1 `=2
We observe that the elimination step is relatively expensive compared to the backward
substitution step. We will remember that
Remark. When the n × n matrix A is invertible, the solution of the linear system Ax = b
obeys Cramer’s rule, that is
a1,1 . . . a1,i−1 b1 a1,i+1 ... a1,n
a2,1 . . . a2,i−1 b2 a2,i+1 ... a2,n
xi = det
.. .. .. .. / det A, i ∈ J1, nK.
. ··· . ··· . ··· .
an,1 . . . an,i−1 bn an,i+1 . . . an,n
34
Unfortunately, the number of operations required to compute the two determinants grows
like O(n!), therefore this method is totally unpractical.
6.4 Exercises
From the textbook: 4 p 356, 10 p 358, 12 p 358 [write the algorithm with the software of
your choice and test it on the matrices of 5.c and 5.d p 357].
Optional problems
35
Chapter 7
7.1 LU -factorization
7.1.1 Definition
A = LU, where L and U are n × n lower and upper triangular matrices, respectively.
• calculation of a determinant [det A = det L · det U = ni=1 li,i · ni=1 ui,i ] and test for
Q Q
nonsingularity [A invertible iff li,i 6= 0 and ui,i 6= 0 for all i ∈ J1, nK].
Both latter systems are triangular and can easily be solved. The advantage of this
decomposition, compared to the Gaussian elimination, is that one does not consider
the right hand side b until the factorization is complete. Hence, when there are many
right hand sides, it is unnecessary to repeat a O(n3 ) procedure at each time, only O(n2 )
operations will be required for each new b.
36
An LU -decomposition is never unique. This is not surprising, since A = LU induces n2
scalar equations [one for each entry of A] linking n2 + n unknowns [the entries of L and U ].
As a matter of fact, if D = Diag[d1,1 , . . . , dn,n ] is a diagonal matrix such that each diagonal
element di,i is nonzero and if A = LU , then there also holds
A = (LD)(D−1 U ), where LD and D−1 U are lower and upper triangular, respectively.
Since the diagonal elements of LD and D−1 U are li,i di,i and ui,i /di,i , i ∈ J1, nK, we can
most of the time make an arbitrary choice for the diagonal elements of L or U . The most
common one imposes L to be unit lower triangular (i.e. li,i = 1, i ∈ J1, nK), we refer to this
as Doolittle’s factorization. If U was unit upper triangular, we would talk about Crout’s
factorization.
.. .. ..
. . .
0
. 0 0 0
..
.. .
.. ..
0 h
. .
i
1 0 · · · 0 ui,i ui,i+1 · · · ui,n = · · · 0 · · · ui,i ui,i+1 · · · ui,n
.
..
`i+1,i
. ui,i `i+1,i × ×
.
.
.
.
.. ..
0 × . ×
`n,i ..
. ui,i `n,i × ×
37
This allows to simply read u> 1 = first row of A and `1 = first column of A/u1,1 . Then we
consider A1 := A−`1 u> >
1 , and we read u2 = second row of A1 , `2 = second column of A1 /u2,2 ,
and so on. This leads to the following LU -algorithm:
u>
k = k-th row of Ak−1 , `k = k-th column of Ak−1 scaled so that `k,k = 1,
Note that a necessary condition for the realization of this algorithm is that ui,i 6= 0 for all
i ∈ J1, nK.
Let us now recall that the k-th leading principal minor of a matrix A is
a1,1 · · · a1,k
. .. ..
Ak := ..
. . .
ak,1 · · · ak,k
One can see that if a matrix A admits the LU -decomposition A = LU , then the matrix Ak
admits the decomposition Ak = Lk Uk by writing
" # " #" # " #
Ak B Lk 0 Uk Y Lk Uk Lk Y
A =: = = .
C D X L0 0 U0 XUk XY + L0 U 0
| {z }| {z }
:=L :=U
The necessary condition for the realization of the LU -algorithm can therefore be translated
in terms of the matrix A as det Ak 6= 0, k ∈ J1, nK. It turns out that this condition is
somewhat sufficient for a Doolittle’s factorization to exist.
38
Proof. We proceed by induction on n for the direct implication. Clearly the implication
is true for n = 1. Now let A be an n × n invertible matrix such that det Ak 6= 0 for all
k ∈ J1, n − 1K. We can apply the induction hypothesis to An−1 and write An−1 = L eUe , where
the unit lower
" triangular matrix
#
e and the upper triangular matrix U
L e are invertible. Then,
An−1 B e −1 , Y := L
e −1 B, and δ := d − XY , one obtains
with A =: , X := C U
C d
" #" # " # " #
L
e 0 U
e Y LeU
e LY
e An−1 B
= = ,
X 1 0 δ XUe XY + δ C d
which is a Doolittle’s factorization for A.
For the converse implication, suppose that a nonsingular matrix admits the Doolittle’s fac-
torization A = LU . Then det A = ni=1 ui,i is nonzero and it follows that det Ak = ki=1 ui,i
Q Q
is nonzero, too.
The elementary row operations on a matrix A are obtained by multiplying A on the left by
certain matrices, namely
..
. i
..
1 .
A replaces the row a> >
Op.1: ×
i ··· λ i of A by λai ,
1
..
.
..
.
i · · · 1
.. .. A replaces the row a> > >
×
Op.2:
. . j of A by aj ± ai ,
j · · · ±1 · · · 1
..
.
1
i · · · 0 ··· 1
.. . . ..
Op.3:
. . . ×
A swaps the rows a> >
i and aj of A.
j · · · 1 ··· 0
1
39
Now suppose that the Gaussian elimination algorithm can be carried out without encoun-
tering any zero pivot. The algorithm, performed by means of row operations of the type
a> > >
j ← aj − λj,i ai , i < j, produces an upper triangular matrix. But these row operations are
obtained by left-multiplication with the unit lower triangular matrices
..
.
i · · · 1
.
.. . ..
Li,j =
.
j · · · −λj,i · · · 1
..
.
This is particularly transparent on an example, e.g. the one of section 6.3.1. Instead of
inserting zeros in the lower part of the matrix, we store the coefficients λj,i .
2 1 −1 2 1 −1 2 1 −1
6 2 −2 −→ 3 −1 1 −→ 3 −1 1 .
4 6 −3 2 4 −1 2 −4 3
2 1 −1 1 0 0 2 1 −1
This translates into 6 2 −2 = 3 1 0 0 −1 1 .
4 6 −3 2 −4 1 0 0 3
40
7.3 Cholesky factorization
This is just a fancy word for an LU -factorization applied to a symmetric positive definite
matrix. Let us state the result straightaway.
Theorem 7.2. Let A be a real symmetric positive definite matrix. It admits a unique
factorization A = LL> , where L is lower triangular with positive diagonal entries.
Proof. By considering x> Ax = x> Ak x > 0 for nonzero vectors x = [x1 , . . . , xk , 0, . . . , 0]> ,
we see that the leading principal minors of A are all positive definite, hence nonsingular.
Thus A admits a Doolittle’s factorization A = LU . From LU = A = A> = U > L> and
the nonsingularity of L and U , one derives U L−> = L−1 U > =: D. This latter matrix is
simultaneously lower and upper triangular, i.e. it is diagonal. Its entries are positive, as
seen from
di,i = e> >
i Dei = ei U L
−>
ei = (L−> ei )> LU (L−> ei ) =: x> Ax > 0.
Hence the matrix D1/2 := Diag[ d1,1 , . . . , dn,n ] satisfies D1/2 D1/2 = D, and it remains to
p p
write
A = LU = LDL> = LD1/2 D1/2 L> = (LD1/2 )(LD1/2 )> .
7.4 Exercises
"#
0 1
1. Check that the matrix has no LU -factorization.
1 0
1 c 1
admits a unique Doolittle’s decomposition. Given an n×n matrix, what does it say about
the converse of
41
3. Calculate all Doolittle’s factorization of the matrix
10 6 −2 1
10 10 −5 0
A= −2 2 −2
.
1
1 3 −2 3
By using one of these factorizations, find all solutions of the equation Ax = b, where
b = [−2, 0, 2, 1]> .
5. True or false: If A has a Doolittle’s factorization, then it has Crout’s factorization? Give
either a proof or a counterexample.
6. For a nonsingular matrix M , establish that the matrix M M > is symmetric positive
definite.
7. Implement first an algorithm for finding the inverse of an upper triangular matrix U
[hint: if C1 , . . . , Cn denote the columns of U −1 , what is U Ci ?]. Implement next the LU -
factorization algorithm. Implement at last an algorithm for finding the inverse of a
square matrix. Test your algorithms along the way.
Deduce from the factorization the value of λ that makes the matrix singular. Also find
this value of λ by seeking a vector – say, whose first component is one – in the null-space
of the matrix .
Optional problems
1. A matrix A = (ai,j ) in which ai,j = 0 when j > i or j < i − 1 is called a Stieljes matrix.
Devise an efficient algorithm for inverting such a matrix.
42
2. Find the inverse of the matrix
1 −1 0 0 ··· 0
0 1 −1 0 · · · 0
0 0
1 −1 · · · 0
. . .. .. .. .
.. .. . . .
. .
.. .. 0 1 −1
0 0 ··· ··· 0 1
43
Chapter 8
Iterative methods
44
and hope that x(k) will approach the exact solution x∗ . This is known as the Jacobi it-
eration. In fact, if the sequence (x(k) ) is convergent, then its limit is necessarily x∗ . The
computations to be performed, somewhat hidden in (8.1), are
(k+1) 1 −
X (k)
X (k)
xi = ai,j xj − ai,j xj + bi , i ∈ J1, nK.
ai,i
j<i j>i
(k)
At step i, it seems intuitively better to replace the coordinates xj , j < i, by their updated
(k+1)
values xj , j < i, i.e. to perform the computations
(k+1) 1 X (k+1)
X (k)
xi = − ai,j xj − ai,j xj + bi , i ∈ J1, nK.
ai,i
j<i j>i
This translates into the matrix form x(k+1) = D−1 Lx(k+1) + U x(k) + b , or equivalently
(D − L)x(k+1) = U x(k) + b.
The algorithms for these two schemes copy the following pieces of pseudocode.
Jacobi
input n, ai,j , bi , xi , M , ε
for k = 1 to M
do
for i = 1 to n
P
do ui ← bi − j6=i ai,j xj /ai,i end do
end for
if maxj |uj − xj | < ε then output k, (xi ) stop end if
for i = 1 to n
do xi ← ui end do
end for
end do
end for
output ‘procedure unsuccessful’, (xi )
Gauss–Seidel
input n, ai,j , bi , xi , M , ε
for k = 1 to M
do
45
for i = 1 to n
P P
do ui ← bi − j<i ai,j uj − j>i ai,j xj )/ai,i end do
end for
if maxj |uj − xj | < ε then output k, (xi ) stop end if
for i = 1 to n
do xi ← ui end do
end for
end do
end for
output ‘procedure unsuccessful’, (xi )
We assume that the matrix A is nonsingular. The equation Ax = b can always be rewritten
as Qx = (Q − A)x + b, where the splitting matrix Q should be nonsingular and the
system Qx = c should be easy to solve. Then we consider the iteration process Qx(k+1) =
(Q − A)x(k) + b, which is rather rewritten, for the sake of theoretical analysis, as
We say that this iterative method converges if the sequence (x(k) ) converges [to the unique
solution x∗ of Ax = b] regardless of the initial vector x(0) . Obviously, the method is con-
vergent if and only if ε(k) := x(k) − x∗ , the error vector in the k-th iterate, tends to zero,
or equivalently if and only if the residual vector r(k) := Ax(k) − b = Aε(k) tends to zero.
Subtracting x∗ = (I − Q−1 A)x∗ + Q−1 b to (8.2) gives
46
Proposition 8.1. If δ := |k(I − Q−1 A)k| < 1 for some natural matrix norm, then
Reminder: A norm |k•k|on the space of n × n matrices is called a natural (or induced, or
kBxk
subordinate) matrix norm if it has the form |kBk| = max for some vector norm k•k
x6=0 kxk
n
on R .
Proof. We get kε(k) k ≤ δ k kε(0) k directly from the expression of ε(k) in terms of ε(0) , and in
particular one has kε(k) k −→ 0. Furthermore, taking the limit m → +∞ in
Proof. For the Jacobi iteration, one takes Q = D. We consider the norm on Rn defined by
kxk∞ := max |xi |. We leave to the reader the task of checking every step in the inequality
i∈J1,nK
n n
X X ai,j
|kI − D−1 Ak|∞ = max |(I − D−1 A)i,j | = max < 1.
i∈J1,nK i∈J1,nK ai,i
j=1 j=1, j6=i
47
As a matter of fact, Proposition 8.1 can be strengthen to give a necessary and sufficient
condition for convergence. We recall first that the spectral radius of a matrix B is the
maximum of the modulus of its eigenvalues, i.e.
Proof. The previous fact combined with Proposition 8.1 yields the ⇐ part. As for the ⇒
part, suppose that the iteration (8.2) converges, i.e. B k ε converges to zero for any ε ∈ Rn ,
where we have set B := I − Q−1 A. Let us now assume that r := ρ(B) ≥ 1, which should
lead to a contradiction. There exist x ∈ Cn \ {0} and θ ∈ [0, 2π) such that Bx = reiθ x. It
follows that B k x = rk eikθ x. Writing x = u + iv with u, v ∈ Rn , one has B k x = B k u + iB k v −→
0 + i · 0 = 0. But clearly rk eikθ x 6→ 0, hence the required contradiction.
8.3 Exercises
48
1. Program the Gauss–Seidel method and test it on these examples
3x + y + z = 5 3x + y + z = 5
x + 3y − z = 3 , 3x + y − 5z = −1 .
3x + y − 5z = −1 x + 3y − z = 3
Analyse what happens when these systems are solved by simple Gaussian elimination
without pivoting.
with γ large and |α| < 1, |β| < 1. Calculate the elements of H k and show that they tend
to zero as k → +∞. Further, establish the equation x(k) − x∗ = H k (x(0) − x∗ ), where x∗
is defined by x∗ = Hx∗ + b. Thus deduce that the sequence (x(k) ) converges to x∗ .
0 0 1 η ζ 0
η ζ 1
where ξ, η and ζ are constants. Find all values of the constants such that the sequence
(x(k) ) converges for every x(0) and b. Give an example of nonconvergence when ξ = η =
ζ = −1. Is the solution always found in at most two iterations when ξ = ζ = 0?
49
Chapter 9
Proof. We calculate
q(x + tv) = hx + tv, Ax + tAv − 2bi = hx, Ax − 2bi + t(hv, Ax − 2bi + hx, Avi) + t2 hv, Avi
= q(x) + 2thv, Ax − bi + t2 hv, Avi.
hv, b − Axi
This quadratic polynomial is minimized at the zero of its derivative, i.e. at t∗ := ,
hv, Avi
where it takes the value given in the Lemma. The second part is now obtained as follow:
50
Remark. Have we used the hypothesis that A is positive definite?
hv (k) , b − Ax(k) i
x(k) = x(k−1) + tk v (k) , tk = .
hv (k) , Av (k) i
The search direction is taken to be the direction of greatest decrease of q – hence the name
steepest descent method. This direction turns out to be
∂q (k−1) >
(k−1) ∂q (k−1)
∇q(x )= (x ), . . . , (x ) = 2(Ax(k−1) − b),
∂x1 ∂xn
i.e. the direction of the residual. However, this method converges slowly, and is therefore
rarely used for linear systems.
We still follow the basic strategy of minimizing the quadratic form q. The family of methods
for which the search directions (v (1) , . . . , v (n) ) are chosen to form an A-orthogonal system
constitutes the conjugate direction methods. The solution is given in a finite number of
steps, precisely in n steps. Assuming exact arithmetic, that is; the picture is not so perfect
if roundoff errors are taken into account.
Choose the vector x(0) arbitrarily and define the sequence (x(k) ) by
51
Proof. We prove by induction on k that the residual r(k) = b − Ax(k) is orthogonal to the
system (v (1) , . . . , v (k) ), so that r(n) is orthogonal to the basis (v (1) , . . . , v (n) ), hence must be
zero. For k = 0, there is nothing to establish, and the induction hypothesis holds trivially.
Now suppose that it holds for k − 1, that is r(k−1) ⊥ (v (1) , . . . , v (k−1) ). Then apply −A to (9.1)
and add b to obtain
We clearly get hr(k) , v (j) i = 0 for j ∈ J1, k − 1K and we also observe that hr(k) , v (k) i =
hr(k−1) , v (k) i − hv (k) , r(k−1) i = 0. We have shown that r(k) ⊥ (v (1) , . . . , v (k) ), i.e. that the
induction hypothesis holds for k. This concludes the proof.
To initiate this process, one may prescribe the A-orthogonal system at the beginning. This
can be done using the Gram–Schmidt algorithm [see Chapter 10 – the positive definiteness
of A insures that hx, Ayi defines an inner product]. One may also determine the search di-
rections one at a time within the solution process. In the conjugate gradient method,
the vectors v (1) , . . . , v (n) are chosen not only to be A-orthogonal, but also to induce an or-
thogonal system of residual vectors. Therefore, one should have r(k) ⊥ span[r(0) , . . . , r(k−1) ],
and the A-orthogonality of the search directions implies that r(k) ⊥ span[v (1) , . . . , v (k) ],
just like in the proof of Theorem 9.2. This suggests to impose span[r(0) , . . . , r(k−1) ] =
span[v (1) , . . . , v (k) ] =: Uk for each k. Then, since r(k) ∈ Uk+1 , one could write [renormal-
izing the search direction if necessary],
Using (9.2) and the A-orthogonality of (v (1) , . . . , v (n) ), one would derive that
1 1
αj = hr(k) , Av (j) i = hr(k) , r(j−1) − r(j) i.
hv (j) , Av (j) i hv (j) , r(j−1) i
52
Theorem 9.3. Choose the vector x(0) arbitrarily, and set v (1) = r(0) = b − Ax(0) . Assuming
that x(0) , . . . , x(k−1) and v (1) , . . . , v (k) have been constructed, set
hr(k−1) , r(k−1) i
x(k) = x(k−1) + tk v (k) , tk = ,
hv (k) , Av (k) i
hr(k) , r(k) i
v (k+1) = r(k) + sk v (k) , sk = .
hr(k−1) , r(k−1) i
Remark. The residual vector should be determined according to r(k) = r(k−1) − tk Av (k)
rather than r(k) = b − Ax(k) . Besides, there seems to be a flaw in the definitions of tk and sk ,
as we divide e.g. by hr(k−1) , r(k−1) i, which might equal zero. If this situation occurs, then
r(k−1) is already a solution of Ax = b. In any case, if r(k−1) = 0, then tk = 0, hence r(k) =
r(k−1) = 0, then tk+1 = 0, and so on. Updating the search direction becomes unnecessary.
Proof. The second equation implies that span[r(0) , . . . , r(k−1) ] = span[v (1) , . . . , v (k) ] =: Uk
for all k. We now establish by induction on k that r(k) and Av (k+1) are orthogonal to the
space Uk . For k = 0, there is nothing to show. Assume now that r(k−1) ⊥ Uk−1 and that
Av (k) ⊥ Uk−1 . Then, we calculate, for u ∈ Uk−1 ,
and also
hr(k) , v (k) i = hr(k−1) , v (k) i − tk hAv (k) , v (k) i = hr(k−1) , v (k) i − hr(k−1) , r(k−1) i
= hr(k−1) , v (k) − r(k−1) i = sk−1 hr(k−1) , v (k−1) i = 0.
This proves that r(k) ⊥ Uk , which is the first part of the induction hypothesis relative to
k + 1. As for the second part, we calculate
Hence, for j ≤ k − 1, we readily check that hAv (k+1) , v (j) i = 0, and for j = k, we observe
that hAv (k+1) , v (k) i = −hr(k) , r(k) i/tk + sk hv (k) , Av (k) i = 0. This proves that Av (k+1) ⊥ Uk ,
which is the second part of the induction hypothesis relative to k + 1. The inductive proof is
now complete. In particular, we have shown that the system (v (1) , . . . , v (n) ) is A-orthogonal.
Theorem 9.2 applies, and we conclude that Ax(n) = b.
53
A computer code for the conjugate gradient method is based on the following pseudocode.
input n, A, b, x, M , ε, δ
r ← b − Ax, v ← r, c ← hr, ri
for k = 1 to M do
if hv, vi < δ then stop
else z ← Av, t ← c/hv, zi, x ← x + tv, r ← r − tz, d ← hr, ri
end if
if d < ε then stop
else v ← r + (d/c)v, c ← d
end if
end do end for
output k, x, r
9.3 Exercises
1. In the method of steepest descent, show that the vectors v (k) and v (k+1) are orthogonal
hr(k) , r(k) i2
and that q(x(k+1) ) = q(x(k) ) − (k) .
hr , Ar(k) i
2. Program and test the conjugate gradient method. A good test case is the Hilbert matrix
with a simple b-vector:
1 ai,1 + ai,2 + · · · + ai,n
ai,j = , bi = , i, j ∈ J1, nK.
i+j−1 3
Optional problems
1. Using Jacobi, Gauss–Seidel and the conjugate gradient methods with the initial vector
x(0) = [0, 0, 0, 0, 0]> , compute the solution of the system
10 1 2 3 4 x1 12
1 9 −1 2 −3 x2 −27
2 −1 7 3 −5 x3 = 14 .
3 2 3 12 −1 x4 −17
4 −3 −5 −1 15 x5 12
54
Chapter 10
The QR factorization
By QR-factorization of an m × n matrix A, we understand either a decomposition of the
form
Q is an m × m orthogonal matrix,
A = QR, where
R is an m × n upper triangular matrix,
or a decomposition of the form
the columns of the m × n matrix B are orthonormal,
A = BT, where
T is an n × n upper triangular matrix.
Most of the time, we will assume that m ≥ n, in which case every matrix has a (non-unique)
QR-factorization for either of these representations. In the case m = n, the factorization
can be used to solve linear systems, according to
55
It is always possible to find such vectors, and in fact they are uniquely determined if the
additional condition hvk , uk i > 0 is required. The step-by-step construction is based on the
following scheme.
the conditions 0 = he
vk , vi i = huk , vi i + ck,i impose the choice ck,i = −huk , vi i;
1
now that vek is completely determined, form the normalized vector vk = vek .
kevk k
The accompanying code parallels the pseudocode
input n, (uk )
for k = 1 to n do vk ← uk ,
for i = 1 to k − 1 do
ck,i ← huk , vi i, vk ← vk − ck,i vi
end do end for
vk ← vk /kvk k
end do end for
output (vk )
For example, we write down explicitely all the steps in the orthonormalization process for
the vectors
u1 = [6, 3, 2]> , u2 = [6, 6, 1]> , u3 = [1, 1, 1]> .
√
• ve1 = u1 , ke
v1 k = 36 + 3 + 4 = 7, v1 = 1/7 [6, 3, 2]> ;
0 = he
v3 , v2 i, β = −hu3 , v2 i = −(−2 + 6 − 3)/7, β = −1/7,
• ve3 = u3 + βv2 + γv1 , ⇒
0 = he
v3 , v1 i, γ = −hu3 , v1 i = −(6 + 3 + 2)/7, γ = −11/7,
ve3 = 1/49 [49 + 2 − 66, 49 − 6 − 33, 49 + 3 − 22] = 1/49 [−15, 10, 30] = 5/49 [−3, 2, 6]> ,
> >
√
kev3 k = 5/49 9 + 4 + 36 = 5/7, v3 = 1/7 [−3, 2, 6]> .
56
10.1.2 Matrix interpretation
with tk,j = 0 for k > j, in other words, T = [ti,j ]ni,j=1 is an n×n upper triangular matrix. The
n equations (10.1) reduce, in matrix form, to A = BT , where B is the m × n matrix whose
columns are the orthonormal vectors v1 , . . . , vn . To explain the other factorization, let us
complete v1 , . . . , vn with vm+1 , . . . , vn to form an orthonormal basis (v1 , . . . , vm ) of Rm . The
analogs of the equations (10.1), i.e. uj = m
P
k=1 rk,j vk with rk,j = 0 for k > j, read A = QR,
where Q is the m × m orthogonal matrix with columns v1 , . . . , vm and R is an m × n upper
triangular matrix.
To illustrate this point, observe that the orthonormalization carried out in Section 10.1.1
translates into the factorization [identify all the entries]
6 6 1 6 −2 −3 7 8 11/7
1
3 6 1 = 3 6 2 0 3 1/7 .
7
2 1 1 2 −3 6 0 0 5/7
| {z } | {z }
orthogonal upper triangular
The Gram–Schmidt algorithm has the disadvantage that small imprecisions in the calcula-
tion of inner products accumulate quickly and lead to effective loss of orthogonality. Alter-
native ways to obtain a QR-factorization are presented below on some examples. They are
based on the following idea and exploits the fact that the computed product of orthogonal
matrices gives, with acceptable error, an orthogonal matrix.
Multiply the matrix A on the left by some orthogonal matrices Qi which ‘eliminate’ some
entries below the main ‘diagonal’, until the result is an upper triangular matrix R, thus
57
We mainly know two types of orthogonal transformations, namely rotations and reflexions.
Therefore, the two methods we describe are associated with Givens rotations [preferred
when A is sparse] and Householder reflections [preferred when A is dense].
The matrix
..
. i j
.. ..
1 . 0 . 0
i
··· cos θ ··· sin θ · · ·
[i,j]
.. .. ..
Ω :=
0 . . . 0
j ··· − sin θ · · · cos θ · · ·
.. ..
0 . 0 . 1
..
.
corresponds to a rotation along the two-dimensional space span[ei , ej ]. The rows of the ma-
trix Ω[i,j] A are the same as the rows of A, except for the i-th and j-th rows, which are linear
combinations of the i-th and j-th rows of A. By choosing θ appropriately, we may intro-
duce a zero ata prescribed position on one of these rows. Consider for example the matrix
6 6 1 6 6 1
A = 3 6 1 of the end of Section 10.1.2. We pick Ω[1,2] so that Ω[1,2] A = 0 × ×.
2 1 1 × × ×
× × ×
Then we pick Ω [1,3] [1,3] [1,2]
so that Ω Ω A = 0 × ×. Finally, thanks to the leading zeros
0 × ×
× × ×
in the second and third rows, we can pick Ω[2,3] so that Ω[2,3] Ω[1,3] Ω[1,2] A = 0 × × . The
0 0 ×
[2,3] [1,3] [1,2] >
matrix Ω Ω Ω is the orthogonal matrix required in the factorization of A.
The reflection in the direction of a vector v transformes v into −v while leaving the space
v > unchanged. It can therefore be expressed through the symmetric orthogonal matrix
2
Hv := I − vv > .
kvk2
58
6 6 1
Consider the matrix A = 3 6 1 once again. We may transform u1 = [6, 3, 2]> into
2 1 1
7e1 = [7, 0, 0] by way of the reflection in the direction v1 = u1 − 7e1 = [−1, 3, 2]> . The latter
>
0 × ×
column is
8
hv1 , u2 i
Hv1 u 2 = u 2 − v1 = u2 − 2v1 = 0 .
7
−3
To cut the argument short, we may observe at this point that the multiplication of Hv1 A
1 0 0
on the left by the permutation matrix P = 0 0 1 [which can be interpreted as He2 −e3 ]
0 1 0
exchanges the second and third rows, thus gives an upper triangular matrix. In conclusion,
the orthogonal matrix Q has been obtained as
6 2 3
Hv>1 P > = Hv1 P = 3 −6 −2 .
2 3 −6
10.3 Exercises
2. Implement and test the code for the Gram–Schmidt process. Based on this, implement
and test a code producing the QR-factorization of a square matrix.
59
Chapter 11
Proof. According to the next lemma, used with V = ImA, we see that x∗ is characterized
by b − Ax∗ ⊥ ImA, i.e. 0 = hb − Ax∗ , Axi = hA> (b − Ax∗ ), xi for all x ∈ Rm , implying that
A> (b − Ax∗ ) = 0.
60
11.2 Solution of the problem
It is possible to find the optimal x∗ by solving the so-called normal equations, that is
A> Ax∗ = A> b. We could use Cholesky factorization for instance, or even the conjugate
gradient method [check this last claim]. Another option is to use a QR-factorization of A.
Remember that the m × m orthogonal matrix Q preserves the Euclidean norm, hence
We may write
" # " # " # " #
c1 U Ux c1 − U x
c= and Rx = x= , so that c − Rx = .
c2 0 0 c2
Observe that rkU = rkR = rkA = n, meaning that U is nonsingular, which clearly yields
the value
minn kc − Rxk = kc2 k, achieved for x∗ = U −1 c1 .
x∈R
11.3 Exercises
2. Suppose that the solution to the previous problem is given to you as [29/21, −2/3]> . How
can this be verified without solving for x and y.
61
Part III
62
Chapter 12
Bisection method
We now turn our attention to the determination of the roots of a [nonlinear] equation, or,
to say things differently, to the determination of the zeros of a function. Note that any
equation g(x) = h(x) can indeed be written in the form f (x) = 0, with f := g − h. In the
theory of diffraction of light, for instance, one has to deal with the equation x = tan x; in
the calculation of planetary orbits, we need the roots of Kepler’s equation x − a sin x = b for
various values of a and b. Usually, an explicit expression for the solution cannot be obtained
and one has to resort to numerical computations. We describe here how the bisection
method achieves this specific goal.
The bisection method may also be referred to as the method of interval halving, since it
follows the simple strategy:
For a function f changing sign on the interval [a, b], choose between the left and right half-
intervals one that supplies a change of sign for f [hence a zero], and repeat the process
with this new interval.
Note that each of the two half-intervals may contain a zero of f , and that in general the
function f have several zeros, but our process only produces one of them. Remark also that
the method implicitely requires the function f to be continuous [forcing us to be cautious
with the equation x = tan x]. Since sgnf (a) 6= sgnf (b), we must have either sgnf (c) 6=
sgnf (a) or sgnf (c) 6= sgnf (b), where c denotes e.g. the midpoint of [a, b], thus it is the
intermediate value theorem which insures the existence of a zero of f in [a, c] or in [c, b].
63
To avoid unnecessary evaluations of functions, any value needed in the code should be
stored rather than recomputed. Besides, to save multiplications, the test sgnf (a) 6= sgnf (c)
is preferable to the test f (a)f (b) < 0. A primitive pseudocode can be written as
input f , a, b, N max, ε
f a ← f (a), f b ← f (b), e ← b − a
if sgnf a = sgnf b then output a, f a, b, f b, ‘same sign for the function at a and b’ end if
for n = 0 to N max
do e ← e/2, c ← a + e, f c ← f (c)
if f c < ε then output n, c, f c, e end if
if sgnf a 6= sgnf c then b ← c, f b ← f c
else a ← c, f a ← f c end if
end for
We now investigate the accuracy of the bisection method by estimating how close the final
point c is to a zero of the function f [evaluating how close f (c) is to zero constitutes a
different matter]. Let us denote by [a0 , b0 ], [a1 , b1 ], and so on, the intervals arising in the
bisection process. Clearly, the sequence (an ) is increasing and bounded above by b, and
the sequence (bn ) is decreasing and bounded below by a, so both sequences must converge.
Furthermore, the lengths of the successive intervals satisfy bn −an = (bn−1 −an−1 )/2, hence
bn − an = (b − a)/2n , and the limits of an and bn must have the same value, say r. Then the
inequalities f (an )f (bn ) ≤ 0 and the continuity of f imply that f (r)2 ≤ 0, thus f (r) = 0, i.e r
is a zero of f . For each n, the point r lies in the interval [an , bn ], so that
1 1 an + bn
|r − cn | ≤ (bn − an ) ≤ n+1 (b − a), where cn := .
2 2 2
The following summary of the situation may be formulated.
If the error tolerance is prescribed at the start, we can determine in advance how many
steps suffice to achieve this tolerance.
Indeed, to insurethat |r − cn | < ε, it is enough to
b − a b−a
impose (b − a)/2n+1 < ε, i.e. n > log2 = ln / ln 2.
2ε 2ε
64
12.3 Exercises
• Write a program to find a zero of a function f in the following way: at each step, an
interval [a, b] is given and f (a)f (b) < 0; next c is computed as the zero of the linear
function that agrees with f at a and b; then either [a, c] or [c, b] is retained, depending on
whether f (a)f (c) < 0 or f (c)f (b) < 0. Test the program on several functions.
65
Chapter 13
Newton’s method
Newton’s method is another very popular numerical root-finding technique. It can be
applied in many diverse situations. When specialized to real function of a real variable, it
is often called Newton–Raphson iteration.
We start with some preliminary guess work. Suppose that r is a zero of a function f and
that x = r + h is an approximation to r. When x is close to r, Taylor’s theorem allows us to
write
0 = f (r) = f (x − h) = f (x) − hf 0 (x) + O(h2 ) ≈ f (x) − hf 0 (x).
Solving in h gives h ≈ f (x)/f 0 (x). We therefore expect x − f (x)/f 0 (x) to be a better approx-
imation to r than x was. This is the core of Newton’s method:
Construct a sequence (xn ) recursively, starting with an initial estimate x0 for a zero of f ,
and apply the iteration formula
f (xn )
(13.1) xn+1 = xn − .
f 0 (xn )
66
method and it cannot be lifted. As a matter of fact, from the graphical interpretation of the
method, it is not too hard to devise cases where Newton iteration will fail. For instance,
p
the function f (x) = sgn(x) |x| yields a cycle of period 2 for any choice of x0 6= 0 [i.e.
x0 = x2 = x4 = · · · and x1 = x3 = x5 = · · · ]. The above mentioned graphical interpretation
of the method is nothing more than the observation that xn+1 is obtained as the intersection
of the x-axis with the tangent line to the f -curve at xn . Indeed, (13.1) may be rewritten as
f (xn ) + f 0 (xn )(xn+1 − xn ) = 0.
As an example, here is a method often used to compute the square root of a number y > 0.
We want to solve f (x) := x2 − y = 0, so we perform the iterations
f (xn ) 1 y
xn+1 = xn − 0 , that is xn+1 = xn + .
f (xn ) 2 xn
This formula is very old [dated between 100 B.C. and 100 A.D., credited to Heron, a Greek
√
engineer and architect], yet very efficient. The value given for 17 after 4 iterations, start-
ing with x0 = 4, is correct to 28 figures.
We end this section with a pseudocode which includes stopping criteria. Note that we
would need subprograms for f (x) and f 0 (x).
input a, M , δ, ε
v ← f (a)
output 0, a, v
if |v| < ε then stop end if
for k = 1 to M do
b ← a − v/f 0 (a), v ← f (b)
output k, b, v
if |b − a| < δ or |v| < ε then stop end if
a←b
end do end for
67
13.2 Convergence analysis
The analysis will be carried out in the framework of discrete dynamical systems. This
merely involves iterations of the type
xn+1 = F (xn ), where F is a continuous function mapping its domain into itself.
Since x1 = F (x0 ), x2 = F (x1 ) = F (F (x0 )), and so on, we often use the convenient notation
xn = F n (x0 ). One simply take F (x) = x − f (x)/f 0 (x) for Newton’s method, in which case
a zero of f is recognized as a fixed point of F , i.e. a point r for which F (r) = r. Here are
some general convergence theorems in this setting.
Theorem 13.1. Suppose that F is a continuous function from [a, b] into [a, b]. Then F has
a fixed point in [a, b]. Suppose in addition that F is continuously differentiable on [a, b] – in
short, that F ∈ C 1 [a, b]. If there is a constant 0 ≤ c < 1 such that |F 0 (x)| ≤ c for all x ∈ [a, b],
then the fixed point is unique. Moreover, the iterates xn = F n (x0 ) converge towards this
unique fixed point r and the error en = xn − r satisfies |en+1 | ≤ c |en |.
Proof. Consider the function G defined on [a, b] by G(x) := F (x) − x. We observe that
G(a) = F (a) − a ≥ a − a = 0 and that G(b) = F (b) − b ≤ b − b = 0. Thus, by the intermediate
value theorem, there is a point r ∈ [a, b] such that G(r) = 0, that is F (r) = r. Suppose now
that F ∈ C 1 [a, b] and that |F 0 (x)| ≤ c < 1 for all x ∈ [a, b]. Assume that there are points r
and s such that F (r) = r and F (s) = s. We then use the mean value theorem to obtain
Theorem 13.2. Let r be a fixed point of the function F ∈ C 1 [a, b]. If |F 0 (r)| < 1, then r
is an attractive fixed point, in the sense that there exists δ > 0 such that the sequence
(F n (x0 )) converges to r for all x0 chosen in [r −δ, r +δ]. Moreover, the convergence is linear,
meaning that |xn+1 − r| ≤ c |xn − r| for some constant 0 ≤ c < 1.
Proof. Suppose that |F 0 (r)| < 1. Then, by continuity of F 0 there exists δ > 0 such that
|F 0 (x)| ≤ (|F 0 (r)| + 1)/2 =: c < 1 for all x ∈ [r − δ, r + δ]. Note that F maps [r − δ, r + δ] into
itself, so that the previous theorem applies.
68
It is instructive to illustrate this phenomenon on the graph of the function F .
At this stage, one could already show that Newton’s method converges at a linear rate,
provided that the starting point is ‘close enough’ to a zero r of f . There is actually more to
it, which comes as no surprise if F 0 (r) is explicitely computed. We get
f 0 (x)f 0 (x) − f 00 (x)f (x)
0 d f (x)
F (r) = x− 0 = 1− = 1 − (1 − 0) = 0.
dx f (x) |x=r f 0 (x)2 |x=r
Theorem 13.3. Suppose that f belongs to C 2 [a, b] and that r is a simple zero of f [i.e.
f (r) = 0, f 0 (r) 6= 0]. Then there exists δ > 0 such that, for any x0 ∈ [r − δ, r + δ], the
sequence defined by (13.1) converges to r. Moreover, the convergence is quadratic, in the
sense that the error en = xn − r satisfies |en+1 | ≤ C e2n for some constant C.
Proof. We already know that the sequence (xn ) lies in [r − δ, r + δ] for some δ > 0. We may,
if necessary, decrease the value of δ to insure the continuity of f 00 and 1/f 0 on [r − δ, r + δ],
and we set D := max |f 00 (x)| and d := min |f 0 (x)|. We may also redude δ so that
x∈[r−δ,r+δ] x∈[r−δ,r+δ]
D 1
C := < . We then write
2d δ
f (xn ) 1
f (xn ) + (r − xn )f 0 (xn )
en+1 = xn+1 − r = xn − 0
−r =− 0
f (xn ) f (xn )
1 1 00
f (r) − f (yn )(r − xn )2
= − 0 for some yn between r and xn ,
f (xn ) 2
and we obtain
1 |f 00 (yn )| 2 D 2
|en+1 | = en ≤ e = C e2n .
0
2 |f (xn )| 2d n
This establishes the quadratic convergence.
69
13.3 Generalized setting
The same strategy [linearize and solve] may be used to find numerical solutions of systems
of nonlinear equations. Let us illustrate this point with the system
(
f1 (x1 , x2 ) = 0,
f2 (x1 , x2 ) = 0.
∂f2 ∂f2
0 = f2 (x1 − h1 , x2 − h2 ) ≈ f2 (x1 , x2 ) − h1 (x1 , x2 ) − h2 (x1 , x2 ).
∂x1 ∂x2
We solve this 2 × 2 system of linear equations with unknowns (h1 , h2 ), in the expectation
that (x1 − h1 , x2 − h2 ) provides a better approximation to (r1 , r2 ). Introducing the Jacobian
matrix of f1 and f2 ,
∂f1 ∂f1
∂x1 ∂x2
J :=
,
∂f2 ∂f2
∂x1 ∂x2
(n) (n)
Newton’s method is merely the construction of a sequence (x1 , x2 ) according to
" (n+1) # " (n) # " (n) (n)
#
x1 x1 (n) (n) −1
f (x
1 1 , x2 )
(n+1) = (n) − J(x1 , x2 ) (n) (n) .
x2 x2 f2 (x1 , x2 )
Clearly, one can deal with larger systems following this very model. It will be convenient
to use a matrix-vector formalism in this situation: express the system as F (X) = 0, with
F = [f1 , . . . , fn ]> and X = [x1 , . . . , xn ]> ; linearize in the form F (X + H) ≈ F (X) + F 0 (X)H,
where F 0 (X) represents an n × n Jacobian matrix. Of course, the n × n linear systems will
be solved by using the methods already studied, not by determining inverses.
13.4 Exercises
From the textbook: 7 p 61; 18, 19 p 62; 2, 6.b.d.f p 71; 25, 27 p 73.
70
1. Fill in the details of the proofs in Section 13.2.
2. Let Newton’s method be used on f (x) = x2 − y, y > 0. Show that if xn has k correct
digits after the decimal point, then xn+1 will have at least 2k − 1 correct digits after the
decimal point, provided that y > 0.0006.
3. What is the purpose of the iteration xn+1 = 2xn − x2n y ? Identify it as the Newton
iteration of a certain function.
4. The polynomial x3 + 94x2 − 389x + 294 has zeros at 1, 3, and −98. The point x0 = 2 should
be a good starting point for computing one of the small zeros by Newton iteration. Is it
really the case?
Optional problems
71
Chapter 14
Secant method
14.1 Description of the method
One of the drawbacks of Newton’s method is that it necessitates the computation of the
derivative of a function. To overcome this difficulty, we can approximate f 0 (xn ) appearing
in the iteration formula by a quantity which is easier to compute. We can think of using
f (xn ) − f (xn−1 )
f 0 (xn ) ≈ .
xn − xn−1
The resulting algorithm is called the secant method. It is based on the iteration formula
which requires two initial points x0 and x1 to be set off. However, only one new evaluation
of f is necessary at each step. The graphical interpretation for the secant method does not
differ much from the one for Newton’s method – simply replace ‘tangent line’ by ‘secant
line’.
72
The pseudocode for the secant method follows the one for Newton’s method without too
much modifications.
input a, b, M , δ, ε
u ← f (a), output 0, a, u
if |u| < ε then stop end if
v ← f (b), output 1, b, v
if |v| < ε then stop end if
for k = 2 to M do
s ← (b − a)/(v − u), a ← b, u ← v,
b ← b − vs, v ← f (b)
output k, b, v
if |a − b| < δ or |v| < ε then stop end if
end do end for
The error analysis that we present, although somehow lacking rigor, provides the rate of
convergence for the secant method. The error at the (n + 1)-st step is estimated by
[f (r) + f 0 (r)en + f 00 (r)/2 e2n ]/en − [f (r) + f 0 (r)en−1 + f 00 (r)/2 e2n−1 ]/en−1
≈ en en−1
[f (r) + f 0 (r)en ] − [f (r) + f 0 (r)en−1 ]
f 00 (r)
en+1 ≈ C en en−1 , where C := .
2f 0 (r)
We postulate [informed guess, see Chapter 24] the relation |en+1 | ∼ A|en |α . This can also
be written as |en | ∼ A−1/α |en+1 |1/α , which we use with n replaced by n − 1 to derive
73
√
1+ 5
We should therefore have α = 1 + 1/α, or α = ≈ 1.62. We should also have A =
00 0.62 2
f (r)
C 1/(1+1/α) = C 1/α = C α−1 ≈ . Finally, with A thus given, we have established
2f 0 (r)
√
|en+1 | ≈ A |en |(1+ 5)/2
.
The rate of convergence of the secant method is seen to be superlinear, which is better
than the bisection method [linear rate], but not as good of Newton’s method [quadratic
rate]. However, each step of the secant method requires only one new function evaluation,
instead of two for Newton’s method. Hence, it is more appropriate to compare a pair of
steps of the former with one step of the latter. In this case, we get
2
|en+2 | ∼ A|en+1 |α ∼ A1+α |en |α = C α |en |α+1 ≈ C 1.62 |en |2.62 ,
which now appears better than the quadratic convergence of Newton’s method.
14.3 Exercises
1. If xn+1 = xn +(2−exn )(xn −xn−1 )/(exn −exn−1 ) with x0 = 0 and x1 = 1, what is limn→∞ xn ?
74
Part IV
Approximation of functions
75
Chapter 15
Polynomial interpolation
Consider some points x0 < x1 < · · · < xn and some data values y0 , y1 , . . . , yn . We seek a
function f , called an interpolant of y0 , y1 , . . . , yn at x0 , x1 , . . . , xn , which satisfies
This means that the f -curve intercept the n + 1 points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ). Since
we prescribe n + 1 conditions, we should allow n + 1 degrees of freedom for the function f ,
and it should also be a ‘simple’ function. Hence, we will seek f in the linear space
Our aim is to prove that T is bijective [check this claim]. Since dim Pn = dim Rn+1 = n + 1,
it is enough to show either that T is surjective, or that T is injective. To prove the latter,
consider p ∈ ker T . The polynomial p, of degree at most n, possesses n + 1 zeros, hence it
must be the zero polynomial. This establishes the injectivity.
76
15.2 The Lagrange form of the polynomial interpolant
The previous argument is of no practical use, for it provides neither an explicit expression
for the interpolant nor an algorithm to compute it. Note, however, that for each i ∈ J0, nK,
there is a unique polynomial `i ∈ Pn , called i-th Lagrange cardinal polynomial relative
to x0 , x1 , . . . , xn , such that
x(x − 1) (x + 1)x
As an example, we may write 2x2 − 1 = + (x − 1)(x + 1) + .
2 2
77
Proof. Formula (15.1) is obvious if x is one of the xi ’s. Suppose now that x ∈ [a, b] is distinct
from all the xi ’s, and consider it as a fixed parameter. Then the function Φ defined by
n
Y n
Y
Φ(t) := [f (t) − p(t)] (x − xi ) − [f (x) − p(x)] (t − xi )
i=0 i=0
vanishes [i.e. equals zero] at the n + 2 distinct points x, x0 , . . . , xn . Then, using Rolle’s
theorem, we derive that Φ0 vanishes at n+1 distinct points [at least]. Using Rolle’s theorem
once more, we see that Φ00 vanishes at n distinct points. Continuing in this fashion, we
deduce that Φ(n+1) vanishes at a point ξ ∈ [a, b]. Note that
n n
!
(n+1) (n+1) (n+1)
Y dn+1 Y
0=Φ (ξ) = [f (ξ) − p (ξ)] (x − xi ) − [f (x) − p(x)] n+1 (t − xi ) .
dt
i=0 i=0 |t=ξ
p(x) = a0 + a1 x + a2 x2 + · · · + an xn .
The interpolation conditions p(xi ) = yi , i ∈ J0, nK, translates into the linear system
1 x0 x20 · · · xn0 a0 y0
2 n
1 x1 x1 · · · x1 a1 y1
. = . ,
. . .
. .
. . · · · .. .. ..
1 xn x2n · · · xnn an yn
whose unknowns are a0 , a1 , . . . , an . This system is always solvable, as we have seen, hence
the coefficient matrix – called a Vandermonde matrix – is nonsingular. Its determinant
is therefore nonzero [provided that the xi ’s are distinct] and can in fact be determined
explicitely. However, finding a polynomial interpolant using this approach is not recom-
mended.
78
15.4 Exercises
n
X
1. Prove that `i (x) = 1 for all x.
i=0
2. Suppose that the function values f (0), f (1), f (2) and f (3) are given. We wish to esti-
Z 3 Z 3
mate f (6), f 0 (0) and f (x)dx by employing the approximants p(6), p0 (0) and p(x)dx,
0 0
where p is the cubic polynomial interpolating f at 0, 1, 2, and 3. Deduce from the La-
grange formula that each approximant is a linear combination of the four data with
coefficients independent of f . Calculate the numerical values of the coefficients. Verify
your work by showing that the approximants are exact when f is an arbitrary cubic
polynomial.
3. Let f be a function in C 4 [0, 1] and let p be a cubic polynomial satisfying p(0) = f (0),
p0 (0) = f 0 (0), p(1) = f (1), and p0 (1) = f 0 (1). Deduce from the Rolle’s theorem that for
every x ∈ [0, 1], there exists ξ ∈ [0, 1] such that
1 2
f (x) − p(x) = x (x − 1)2 f (4) (ξ).
24
Optional problems
1. Let a, b and c be distinct real numbers [not necessarily in ascending order], and let
f (a), f (b), f 0 (a), f 0 (b) and f 0 (c) be given. Because there are five data, one might try
to approximate f by a polynomial of degree at most four that interpolates the data.
Prove by a general argument that this interpolation problem has a solution and the
solution is unique if and only if there is no nonzero polynomial p ∈ P4 that satisfies
p(a) = p(b) = p0 (a) = p0 (b) = p0 (c) = 0. Hence, given a and b, show that there exists a
unique value c 6= a, b such that there is no unique solution.
2. Try to find the explicit expression for the determinant of a Vandermonde matrix.
79
Chapter 16
Divided differences
16.1 A definition
However, this expression is rarely used in practice, since divided differences can be calcu-
lated in a more efficient way – see Section 16.2. We may already observe that
[x0 ]f = f (x0 )
f (x1 ) − f (x0 )
[x0 , x1 ]f = .
x1 − x0
Note that, when x1 is close to x0 , the divided difference [x0 , x1 ]f approximate f 0 (x0 ). In
fact, we show that higher derivatives can also be approximated by divided differences,
which present the advantage of being easily computed – see Section 16.2.
Theorem 16.1. Let f ∈ C n [a, b] and let x0 , x1 , . . . , xn be distinct points in [a, b]. There
exists a point ξ between the smallest and the largest of the xi ’s such that
1 (n)
[x0 , . . . , xn ]f = f (ξ).
n!
80
f (n) − p(n) has a zero ξ in [mini xi , maxi xi ], thus f (n) (ξ) = p(n) (ξ). It remains to remark that
p(n) (ξ) = n! [x0 , . . . , xn ]f , since p(x) = [x0 , . . . , xn ]f xn + {degree < n}.
If we combine this theorem with the next one, we obtain the alternative proof of Theorem
15.2 previously mentioned.
Theorem 16.2. For a function f , let p ∈ Pn be the interpolant of f at the distinct points
x0 , x1 , . . . , xn . If x is not one of the xi ’s, then
n
Y
f (x) − p(x) = [x0 , . . . , xn , x]f (x − xi ).
i=0
Proof. Let q ∈ Pn+1 be the interpolant of f at the points x0 , x1 , . . . , xn , and x. Note that
the difference q − p is a polynomial of degree at most n + 1 which vanishes at x0 , x1 , . . . , xn .
It is therefore of the type
n
Y
(16.1) q(t) − p(t) = c (t − xi ), for some constant c.
i=0
Identifying the coefficients of tn+1 , we obtain c = [x0 , . . . , xn , x]f . It now remains to write
(16.1) for t = x, keeping in mind that q(x) = f (x).
The recursive method we are about describe allows for fast computation of divided differ-
ences [and also accounts for their names].
81
The evaluation of the divided difference table follows the diagram
f (x0 ) = [x0 ]f
&
[x0 , x1 ]f
% &
f (x1 ) = [x1 ]f [x0 , x1 , x2 ]f
& % &
..
.
··· &
..
. [x0 , . . . , xn ]f
··· %
% & %
f (xn−1 ) = [xn−1 ]f [xn−2 , xn−1 , xn ]f
& %
[xn−1 , xn ]f
%
f (xn ) = [xn ]f
82
Proof. For each k ∈ J0, nK, let pk ∈ Pk be the interpolant of f at x0 , x1 , . . . , xk . Remark, as
in (16.1), that
Looking at the coefficient of xk+1 , we see that ck = [x0 , . . . , xk+1 ]f . We then get
p(x) = p0 (x) + p1 (x) − p0 (x) + · · · + pn−1 (x) − pn−2 (x) + pn (x) − pn−1 (x)
= f (x0 ) + c0 (x − x0 ) + · · · + cn−1 (x − x0 ) · · · (x − xn−1 )
= [x0 ]f + [x0 , x1 ]f · (x − x0 ) + · · · + [x0 , . . . , xn ]f · (x − x0 ) · · · (x − xn−1 ),
The evaluation of p(x) should be performed using nested multiplication, i.e. according to
the scheme implied by the writing
p(x) = [x0 ]f + (x − x0 ) [x0 , x1 ]f + (x − x1 ) [x0 , x1 , x2 ]f + · · · + (x − xn−1 ) [x0 , . . . , xn ]f · · · .
Observe that we only need the divided differences obtained in the top south–east diagonal
of the table. They can be computed via the following algorithm designed to use few storage
space. Start with a vector d = (d0 , d1 , . . . , dn ) containing the values f (x0 ), f (x1 ), . . . , f (xn ).
Note that d0 already provides the first desired coefficient. Then compute the second column
of the table, putting the corresponding divided differences in positions of d1 , . . . , dn , so that
d1 provides the second desired coefficient. Continue in this pattern, being careful to store
the new elements in the bottom part of the vector d without disturbing its top part. Here
is the algorithm.
Once this is done, we can use the following version of Horner’s algorithm to compute the
interpolant at the point x.
u ← dn
for i = n − 1 downto 0 do
u ← (x − xi )u + di
end do end for
83
16.4 Exercises
2. Find the Newton form of the cubic polynomial interpolating the data 63, 11, 7, and 28 at
the points 4, 2, 0, and 3.
Optional problem
84
Chapter 17
Orthogonal polynomials
17.1 Inner product
w integrable on [−1, 1], continuous on (−1, 1), w(x) > 0 for all x ∈ [−1, 1] \ Z, Z a finite set.
defines an inner product on C[−1, 1], which means that it shares the key properties of
the usual inner product on Rn , namely
2. linearity: for f1 , f2 , g ∈ C[−1, 1], a, b ∈ R, there holds haf1 + bf2 , gi = ahf1 , gi + bhf2 , gi,
3. positivity: for f ∈ C[−1, 1], there holds hf, f i ≥ 0, with equality if and only if f = 0.
Let us justify the last statement, i.e. let us show that f = 0 as soon as f ∈ C[−1, 1] satisfies
hf, f i = 0. The continuous function g := f 2 w is nonnegative and has an integral equal to
zero, hence must be identically zero. Therefore f (x) = 0 for all x ∈ [−1, 1] \ Z. But then,
since Z is a finite set, the continuity of f forces f to vanish on Z as well.
p
Note that the expression kf k := hf, f i defines a norm on C[−1, 1].
85
17.2 Orthogonal polynomials for a general weight
By analogy with more familiar inner products, we will say that two functions f and g are
orthogonal if hf, gi = 0. We are interested in a sequence (pn ) of orthogonal polynomials of
degree n, in the sense that we require
deg pn = n, hpn , pm i = 0, n 6= m.
Note that pn should be orthogonal to all pm with m < n, hence to every p ∈ Pn−1 . We call pn
‘the’ n-th orthogonal polynomial. It is only defined up to a multiplicative constant. To
define it uniquely, we need a normalization condition, e.g. we may require pn to be monic,
i.e. to have a leading coefficient equal to one.
Theorem 17.1. For every weight function w on [−1, 1], there exists a unique n-th monic
orthogonal polynomial.
Proof. Let us start by the uniqueness part. Consider two monic polynomials p and q which
are orthogonal to the space Pn−1 . Then the difference p − q is also orthogonal to Pn−1 . It
also belongs to Pn−1 , which implies that p − q = 0, or p = q.
The existence part is proved by induction on n. For n = 0, we take p0 = 1. Then, for n ≥ 1,
suppose that p0 , p1 , . . . , pn−1 have been constructed, and let us construct pn . Inspired by the
Gram–Schmidt algorithm, we define
n−1
X hq, pk i
pn = q − pk , where q(x) := xn .
hpk , pk i
k=0
It is readily checked that pn ∈ Pn is monic and that hpn , pm i = 0 for all m < n. Hence pn is
indeed the n-th monic orthogonal polynomial. This concludes the induction.
Tn (cos θ) = cos(nθ), θ ∈ [0, π], or equivalently Tn (x) = cos(n arccos(x)), x ∈ [−1, 1].
1
They are orthogonal [but not monic] on [−1, 1] with respect to the weight w(x) = √ ,
1 − x2
as seen from
Z 1 Z π Z π
dx sin θdθ
Tn (x)Tm (x) √ = Tn (cos θ)Tm (cos θ) √ = cos(nθ) cos(mθ)dθ
−1 1 − x2 x=cos θ 0 1 − cos2 θ 0
Z π
1
= [cos((n + m)θ) + cos((n − m)θ)] = 0 if n 6= m.
2 0
86
Z π
If one is adverse to trigonometric formulae, one can also integrate I := cos(nθ) cos(mθ)dθ
0
by parts twice to obtain I = m2 I/n2 , thus I = 0 when n 6= m. Note that Tn has n zeros in
(−1, 1) and n + 1 equioscillation points in [−1, 1], precisely
(2n + 1 − 2k)π (n − k)π
Tn cos = 0, k ∈ J1, nK, Tn cos = (−1)n−k , k ∈ J0, nK.
2n n
1 dn 2 n
Pn (x) = (x − 1) .
2n n! dxn
One can check that this formula yields a polynomial of degree n, which is not monic but
3 1
normalized so that Pn (1) = 1. A direct calculation would give e.g. P2 (x) = x2 − . It can
2 2
also be seen [Rolle’s theorem being involved] that Pn has n zeros in (−1, 1). This property
is shared by all n-th orthogomal polynomials, regardless of the weight w.
Theorem 17.2. Any n-th orthogonal polynomial has n distinct zeros inside (−1, 1).
Proof. Let p be a polynomial of degree n which is orthogonal to the space Pn−1 . Denote by
x1 , . . . , xm the zeros of p where p changes sign – hence the endpoints −1 and 1 are not taken
into account – and define q(x) = (x − x1 ) · · · (x − xm ). Then the function p(x)q(x)w(x) does
not change sign on [−1, 1], so its integral cannot vanish. Thus q cannot be a polynomial of
degree ≤ n − 1, i.e. m = deg q ≥ n. We have shown that p has a least m zeros in (−1, 1),
hence it has exactly n zeros.
For orthogonal polynomial to have any practical use, we need to be able to compute them.
The Gram–Schmidt process provides a way, of course, but we know that it is prone to loss
of accuracy. It turns out that a considerably better procedure is available.
87
Proof. The proper proof would be done by induction. Here we simply determine what
the recurrence relation must be like. Since the polynomial xpn (x) belongs to Pn+1 , it has
[should have] an expansion of the form
Note that an+1 must equal one for the leading coefficients on both sides to be the same. If
we take the inner product with pk for k < n − 1, we obtain
hxpn , pn i
an hpn , pn i = hxpn , pn i, or an = .
hpn , pn i
This very last step follows from the expansion (17.1), when n is replaced by n − 1, by
observing that hxpn , pn−1 i = hpn , xpn−1 i = hpn , pn i. The announced recurrence relation is
now simply a rewritting of the expansion of xpn .
Tn+1 (cos θ) + Tn−1 (cos θ) = cos((n + 1)θ) + cos((n − 1)θ) = 2 cos(θ) cos(nθ).
88
Proof. Assume on the contrary that there is a monic polynomial p ∈ Pn such that
1
max |p(x)| < max |Ten (x)| = .
x∈[−1,1] x∈[−1,1] 2n−1
(n − k)π
In particular, at the points xk := cos , k ∈ J0, nK, one has
n
1 (−1)n−k Tn (xk )
(−1)n−k p(xk ) ≤ |p(xk )| < = = (−1)n−k Ten (xk ).
2n−1 2n−1
Thus, the polynomial p − Ten changes sign at the n + 1 points x0 , x1 , . . . , xn . By Rolle’s
theorem, we deduce that p − Ten has n zeros. Since p − Ten is of degree at most n − 1, this is
of course impossible.
Proof. Observe that, with pf thus defined, we have hpf , pi i = hf, pi i for all i ∈ J0, nK, hence
f − pf ⊥ Pn . Let now p be an arbitrary polynomial in Pn . We have
hf, pk i
Note that the coefficient is computed independently of n. In practice, we continue
hpk , pk i
to add terms in the expansion (17.2) until kf − pk2 is below a specified tolerance ε.
Remark. The arguments work so well here because the norm is euclidean, i.e. it is derived
from an inner product. The situation would be quite different with another norm.
89
17.5 Exercises
2. Determine the three-term recurrence relation for the monic polynomials orthogonal on
1
[−1, 1] with respect to the weight w(x) = √ .
1 − x2
Optional problems
sin((n + 1)θ)
Un (cos(θ)) = , θ ∈ [0, π].
sin(θ)
90
Chapter 18
18.1 Approximation
91
Proof. It is clear that h1, Ck i = h1, Sk i = 0 for k > 0. Furthemore, for indices k, h ∈ J1, nK, we
observe that hCk , Sk i = 0, since the integrand Ck Sk is an odd function. Besides, integrations
by part yield
Z π
h π h2 π
Z Z
cos(kx) cos(hx)dx = sin(kx) sin(hx)dx = 2 cos(kx) cos(hx)dx,
−π k −π k −π
Z π Z π
hence, for k 6= h, there holds cos(kx) cos(hx)dx = 0 and sin(kx) sin(hx)dx = 0, thus
−π −π
hCk , Ch i = 0 and hSk , Sh i = 0. Finally, with k = h, we have
Z π Z π
2
cos (kx)dx = sin2 (kx)dx,
−π −π
and we get twice the value of the integral by summing its two expressions and using the
1
identity cos2 + sin2 = 1. It follows that hCk , Ck i = hSk , Sk i = .
2
As usual, we may now express the best approximation to a function f from the space Tn as
n
hf, 1i X hf, Ck i hf, Sk i
(18.1) Sn (f ) = 1+ Ck + Sk .
h1, 1i hCk , Ck i hSk , Sk i
k=1
1 π 1 π
Z Z
ak := f (t) cos(kt)dt, bk := f (t) sin(kt)dt,
π −π π −π
one has
n
a0 X
Sn (f )(x) = + ak cos(kx) + bk sin(kx) .
2
k=1
This is the partial sum of the Fourier series of f . One of the basic theorems from Fourier
analysis states that, for a 2π-periodic continuously differentiable function f , its Fourier
series
∞
a0 X
+ ak cos(kx) + bk sin(kx)
2
k=1
In the case of a 2π-periodic function made of continuously differentiable pieces, the conver-
f (x− ) + f (x+ )
gence is weaker, in the sense that Sn (f )(x) → for all x.
2
92
18.2 Interpolation
It would be possible [and somewhat more elegant] to work with the latter form and express
the theory only in terms of complex exponentials – we will not do it, however. To justify the
second representation, we may write
n
a0 X eikx + e−ikx eikx − e−ikx
p(x) = + ak + bk
2 2 2i
k=1
n n n
a0 X 1 X 1 X
= + (ak − ibk )eikx + (ak + ibk )e−ikx = ck eikx .
2 2 2
k=1 k=1 k=−n
The following lemma can now be establish using the previous remark.
Lemma 18.2. Any nonzero trigonometric polynomial in Tn has at most 2n zeros in [−π, π).
Proof. Suppose that 2n + 1 points x0 < x1 < · · · < x2n in [−π, π) are zeros of a trigonometric
Pn ikx = e−inx
P2n ikx . It follows that the 2n + 1 complex
polynomial p(x) = k=−n ck e k=0 ck e
P2n
numbers eix0 , . . . , eix2n are distinct zeros of the [algebraic] polynomial q(z) = k
k=0 ck z .
Since that latter is of degree at most 2n, this implies that q = 0, and in turn that p = 0.
Theorem 18.3. Fix x0 < x1 < · · · < x2n in [−π, π). For any values y0 , y1 , . . . , y2n , there ex-
ists a unique trigonometric polynomial of degree at most n that interpolates y0 , y1 , . . . , y2n
at the points x0 , x1 , . . . , x2n – in short,
93
In general, simple representations for the trigonometric interpolant, matching Lagrange
or Newton forms, are not available. However, if the interpolation points are assumed to be
the equidistant, e.g
k 2π
τk := , k ∈ J−n, nK,
2n + 1
then the trigonometric interpolant admits a nice expression. It involves the pseudo-inner
product
n
1 X
hf, gi2n+1 := f (τk )g(τk ), f, g ∈ C2π ,
2n + 1
k=−n
k 2π
or more generally, with N ≥ 2n + 1 and σk := , k ∈ J1, N K,
N
N
1 X
hf, giN := f (σk )g(σk ), f, g ∈ C2π .
N
k=1
These are termed a pseudo-inner products because the positivity condition is not fulfilled.
But they would be genuine inner products if we restricted them to the space Tn [use Lemma
18.2]. As a matter of fact, it would agree with the usual inner product on this space.
for any trigonometric polynomial P ∈ T2n , since pq ∈ T2n for any p, q ∈ Tn [check this]. In
fact, it is enough to verify this for P = 1, C1 , S1 , . . . , C2n , S2n . This is immediate for P = 1.
As for P = Ch , h ∈ J1, 2nK, one has
Z π
1
cos(hx)dx = 0,
2π −π
N N N
1 X 1 X ihσk 1 X ihk2π/N
cos(hσk ) = < e = < e
N N N
k=1 k=1 k=1
N −1
1 X
= < eih2π/N (eih2π/N )k
N
k=0
1 1 − (eih2π/N )N
= < eih2π/N = 0,
N 1 − eih2π/N
hence the expected result. The same holds for P = Sh , h ∈ J1, 2nK [replace < by =].
94
This aside tells us that the system [1, C1 , S1 , . . . , Cn , Sn ] is also orthogonal with respect to
h·, ·iN . Let us now give the expression for the trigonometric interpolant. One should note
the analogy with (18.1).
Theorem 18.5. The trigonometric polynomial p ∈ Tn that interpolates the function f ∈ C2π
at the equidistant points τ−n , . . . , τn is given by
n
hf, 1i2n+1 X hf, Ck i2n+1 hf, Sk i2n+1
(18.2) p= 1+ Ck + Sk .
h1, 1i2n+1 hCk , Ck i2n+1 hSk , Sk i2n+1
k=1
Proof. Let us write p in the form α0 + nk=1 (αk Ck + βk Sk ). By the h·, ·i2n+1 -orthogonality of
P
n
hf, 1iN X hf, Ck iN hf, Sk iN
p= 1+ Ck + Sk .
h1, 1iN hCk , Ck iN hSk , Sk iN
k=1
so we are looking for the best approximation to P from Tn ⊆ TbN/2c relatively to a genuine
inner product on TbN/2c . Thus we must have
n n
hP, 1iN X hP, Ck iN hP, Sk iN hf, 1i
N
X hf, Ck iN hf, Sk iN
p= 1+ Ck + Sk = 1+ Ck + Sk .
h1, 1iN hCk , Ck iN hSk , Sk iN h1, 1iN hCk , Ck iN hSk , Sk iN
k=1 k=1
95
18.3 Fast Fourier Transform
These formulae have [almost] the same form as (18.3), N being replaced by N/2, and
they allow for the determination of all the ch , h ∈ J1, N K, via the determination of all
the ch + ch+N/2 and all the ch − ch+N/2 , h ∈ J1, N/2K. For the first ones, we perform γN/2
multiplications, and we perform γN/2 + N/2 multiplications for the second ones [one extra
multiplication for each coefficient]. Thus, we have
96
Of course, the process will now be repeated. With the assumption that γ2p−1 ≤ (p − 1)2p−1 ,
we get
γ2p ≤ (p − 1)2p + 2p−1 ≤ (p − 1)2p + 2p = p2p .
We would deduce, with a proper induction, that γ2p ≤ p2p for any p. This is the announced
result.
18.4 Exercises
2. Determine the Fourier coefficients of the function f ∈ C2π defined on [−π, π] by f (t) = t2 .
Write down the Fourier expansion of f and specify it for the points 0 and π to evaluate
some infinite sums.
97
Part V
98
Chapter 19
Estimating derivatives:
Richardson extrapolation
To illustrate this point, consider the basic approximating formula for a derivative
f (x + h) − f (x)
f 0 (x) ≈ .
h
To assess the error, one uses Taylor expansion
h2 00
f (x + h) = f (x) + hf 0 (x) + f (ξ), ξ between x and x + h,
2
which one rearranges as
f (x + h) − f (x) h 00
(19.1) f 0 (x) = − f (ξ), ξ between x and x + h.
h 2
99
This is already more useful, since an error term comes with the numerical formula. Note
the two parts in the error term: a factor involving some high-order derivative of f , forcing
the function to belong to a certain smoothness class for the estimate to be valid, and a
factor involving a power of h, indicating the speed of convergence as h approaches zero –
the higher the power, the faster the convergence. The previous formula fares poorly, since
the error behaves like O(h). One can obtain an improved formula using
h2 00 h3 000
f (x + h) = f (x) + hf 0 (x) + f (x) + f (ξ1 ),
2 6
h2 h3 000
f (x − h) = f (x) − hf 0 (x) + f 00 (x) − f (ξ2 ),
2 6
with ξ1 , ξ2 between x and x + h, and between x and x − h, respectively. Subtracting and
rearranging, one obtains
f (x + h) − f (x − h) h2 000
f 0 (x) = − [f (ξ1 ) + f 000 (ξ2 )].
2h 12
The error term is now a better O(h2 ), provided that f is thrice, rather than merely twice,
differentiable. Note that, if f ∈ C 3 , the error term can be rewritten [check it] as
f (x + h) − f (x − h) h2 000
(19.2) f 0 (x) = − f (ξ), ξ between x − h and x + h.
2h 6
It should be pointed out that in both (19.1) and (19.2), there is a pronounced deterioration
in accuracy as h approaches zero. This is due to subtractive cancellation: the values of
f (x − h), f (x), and f (x + h) are very close to each other – identical in the machine – and
the computation of the difference yields severe loss of significant digits. Besides, one also
has to be aware that numerical differentiation from empirical data is highly unstable and
should be undertaken with great caution [or avoided]. Indeed, if the sampling points x ± h
are accurately determined while the ordinates f (x ± h) are inaccurate, then the errors in
the ordinates are magnified by the large factor 1/(2h).
f (x + h) − f (x − h) h2 (3) h4 h6
f 0 (x) = f (x) + f (5) (x) + f (7) (x) + · · · .
−
2h 3! 5! 7!
100
This equation takes the form
[1] [1] [1]
(19.3) L = φ[1] (h) + a2 h2 + a4 h4 + a6 h6 + · · · .
Here L and φ[1] (h) stand for f 0 (x) and [f (x + h) − f (x − h)]/(2h), but could very well stand
for other quantities, so that the procedure is applicable to many numerical processes. As of
[1] [1]
now, the error term a2 h2 + a4 h4 + · · · is of order O(h2 ). Our immediate aim is to produce
a new formula whose error term is of order O(h4 ). The trick is to consider (19.3) for h/2
instead of h, which yields
[1] [1] [1]
4L = 4φ[1] (h/2) + a2 h2 + a4 h4 /4 + a6 h6 /16 + · · · .
The numerical formula has already been improved by two orders of accuracy. But there is
no reason to stop here: the same step can be carried out once more to get rid of the term in
h4 , then once again to get rid of the term in h6 , and so on. We would get
[2] [2]
L = φ[2] (h) + a4 h4 + a6 h6 + · · ·
[2] [2]
16L = 16φ[2] (h/2) + a4 h4 + a6 h6 /4 + · · · ,
and consequently
[3] 16 [2] 1
L = φ[3] (h) + a6 h6 + · · · , with φ[3] (h) := φ (h/2) − φ[2] (h).
15 15
On the same model, we would get
[k]
L = φ[k] (h) + a2k h2k + · · ·
[k]
4k L = 4k φ[k] (h/2) + a2k h2k + · · · ,
and consequently
[k+1] 4k 1
L = φ[k+1] (h) + a2k+2 h2k+2 + · · · , with φ[k+1] (h) := φ[k] (h/2) − k φ[k] (h).
4k − 1 4 −1
Let us now give the algorithm allowing for M steps of Richardson extrapolation. Note
that we prefer to avoid the recursive computation of the functions φ[k] , since we only need
φ[M +1] (h) for some particular h, say h = 1. Evaluating φ[M +1] (h) involves φ[M ] (h/2) and
φ[M ] (h), which in turn involve φ[M −1] (h/4), φ[M −1] (h/2), and φ[M −1] (h). Hence, for any k,
101
the values φ[k] (h/2M +1−k ), . . . , φ[k] (h) are required, and the values φ[1] (h/2M ), . . . , φ[1] (h) are
eventually required. Thus, we start by setting
4k 1
D(n, k + 1) = k
D(n, k) − k D(n − 1, k), for k ∈ J1, M K and n ∈ Jk + 1, M + 1K.
4 −1 4 −1
This results in the construction of the triangular array
D(1, 1)
D(2, 1) D(2, 2)
D(3, 1) D(3, 2) D(3, 3)
.. .. .. ..
. . . .
D(M + 1, 1) D(M + 1, 2) D(M + 1, 3) · · · D(M + 1, M + 1)
input x, h, M
for n = 1 to M + 1 do D(n, 1) ← Φ(x, h/2n−1 ) end do end for
for k = 1 to M do
for n = k + 1 to M + 1 do
D(n, k + 1) ← D(n, k) + [D(n, k) − D(n − 1, k)]/(4k − 1)
end do end for
end do end for
output D
The function Φ should have been made available separately here. Note that this algorithm
too will end up producing meaningless results due to subtractive cancellation.
19.3 Exercises
102
2. Derive the following two formulae, together with their error terms, for approximating
the third derivative. Which one is more accurate?
1
f 000 (x) ≈
3
f (x + 3h) − 3f (x + 2h) + 3f (x + h) − f (x) ,
h
1
f 000 (x) ≈
3
f (x + 2h) − 2f (x + h) + 2f (x − h) − f (x − 2h) .
2h
3. Program the Richardson extrapolation algorithm given in the notes to estimate f 0 (x).
Test your program on
• ln x at x = 3,
• tan x at x = sin−1 (0.8),
• sin(x2 + x/3) at x = 0.
Optional problems
103
Chapter 20
Suppose that we want to integrate a function f over an interval [a, b]. We select points
x0 , x1 , . . . , xn on [a, b] and interpolate f at these points by a polynomial p ∈ Pn . The La-
grange form of p is
n n
X Y x − xj
p(x) = f (xi )`i (x), `i (x) := .
xi − xj
i=0 j=0, j6=i
Note that the coefficients A0 , . . . , An are independent of f and that the formula is exact
for all polynomials of degree at most n. If the nodes x0 , . . . , xn are equidistant, the latter
formula is called Newton–Cotes formula. More generally, the approximation of the in-
Z b Xn
tegral f (x)dx by a sum of the type Ai f (xi ) is called a numerical quadrature. If
a i=0
we are given such a quadrature formula which is exact on Pn , we can always retrieve the
coeffiecients A0 , . . . , An , using the Lagrange cardinal polynomials, by writing
Z b n
X
`j (x)dx = Ai `j (xi ) = Aj .
a i=0
104
We could also retrieve the coefficients by solving the linear system [check its nonsingular-
ity]
Z b k
bk+1 − ak+1 X
= xk dx = Ai xki , k ∈ J0, nK.
k+1 a i=0
As mentioned in the previous chapter, a numerical estimate for an integral should come
with a bound on the error. For this purpose, recall that the error in polynomial interpola-
tion is
n
Y
f (x) − p(x) = [x0 , . . . , xn , x]f (x − xi ).
i=0
Integrating, we get
Z b n
X Z b n
Y
f (x)dx − Ai f (xi ) = [x0 , . . . , xn , x]f (x − xi )dx.
a i=0 a i=0
M
Since [x0 , . . . , xn , x]f ≤ , where M := max |f (n+1) (ξ)|, we obtain the bound
(n + 1)! ξ∈[a,b]
Z b n n
Z bY
X M
f (x)dx − Ai f (xi ) ≤ |x − xi | dx.
a (n + 1)! a i=0
i=0
One could minimize the right-hand side over the nodes x0 , . . . , xn . In the case [a, b] = [−1, 1],
this would lead to the zeros of Un+1 [Chebyshev polynomial of the second kind], that is to
the choice xk = cos (k + 1)π/(n + 2) , k ∈ J0, nK. We called upon the fact that 2−n Un
Z 1
minimizes the quantity |p(x)|dx over all monic polynomials p ∈ Pn .
−1
b−a
Take n = 1, x0 = a, and x1 = b to obtain [with the help of a picture] A0 = A1 = .
2
Hence the corresponding quadrature formula, exact for p ∈ P1 , reads
Z b
b−a
f (x)dx = [f (a) + f (b)].
a 2
The error term of this trapezoidal rule takes the form [check it]
Z b
b−a 1
f (x)dx − [f (a) + f (b)] = − (b − a)3 f 00 (ξ), ξ ∈ [a, b].
a 2 12
Let us now partition the interval [a, b] into n subintervals whose endpoints are located at
a = x0 < x1 < · · · < xn = b. A composite rule is produced by applying an approximating
105
formula for the integration over each subinterval. In this case, we obtain the composite
trapezoidal rule
b n Z xi n
xi − xi−1
Z X X
f (x)dx = f (x)dx ≈ [f (xi−1 ) + f (xi )].
a xi−1 2
i=1 i=1
In other words, the integral of f is replaced by the integral of the broken line that inter-
polates f at x0 , x1 , . . . , xn . In particular, if [a, b] is partitioned into n equal subintervals, i.e.
b−a
if xi = a + ih, where h := is the length of each subinterval, we have
n
Z b n n−1
hX h X
f (x)dx ≈ [f (a + (i − 1)h) + f (a + ih)] = f (a) + 2 f (a + ih) + f (b) .
a 2 2
i=1 i=1
The latter expression is preferred for computations, since it avoids the unnecessary repe-
1
tition of function evaluations. The error term takes the form − (b − a)h2 f 00 (ξ) for some
12
ξ ∈ [a, b], and the verification of this statement is left as an exercise.
a+b
We now take n = 2, x0 = a, x1 = , and x2 = b. This leads to the formula
2
b
b−a
Z
(20.1) f (x)dx ≈ [f (a) + 4f ((a + b)/2) + f (b)],
a 6
To derive it, we denote by 2h and c the length and the midpoint of [a, b]. Thus, the left-hand
side of (20.1) becomes
106
while the right-hand side becomes
h h h4
6f (c) + h2 f 00 (c) + f (4) (c) + · · ·
[f (c − h) + 4f (c) + f (c + h)] =
3 3 12
h3 00 h5 (4)
= 2h f (c) + f (c) + f (c) + · · · .
3 36
The difference reduces to
h5 (4) h5 (4) h5
f (c) − f (c) + · · · = − f (4) (c) + · · · .
60 36 90
The announced error formula would be obtained by expressing the Taylor remainders in a
more careful way.
20.4 Exercises
1. Establish the form given in the notes for the error in the composite trapezoidal rule .
2. Prove that Simpson’s rule is exact on P3 , without using the error formula.
Z 1
3. Is there a formula of the form f (x)dx ≈ α[f (x0 ) + f (x1 )] that correctly integrates all
0
quadratic polynomials?
107
is exact for all polynomials in Pn , and if the nodes are symmetrically placed about the
origin, then the formula is exact for all polynomials in Pn+1 .
Z b
5. Write a function SimpsonUniform[f,a,b,n] to calculate f (x)dx using the composite
a
Simpson’s rule with 2n equal subintervals. Use it to approximate π from an integral of
Z b
c
the type 2
dx.
a 1+x
Optional problems
Z 1
1. Prove that 2−n Un is the minimizer of |p(x)|dx over all monic polynimials p ∈ Pn
−1Z 1
[Hint: prove first the orthogonality relations Um (x) sgn[Un (x)]dx = 0, m ∈ J0, n − 1K,
−1
next integrate p sgn[Un ] for all monic polynomials p = 2−n Un + an−1 Un−1 + · · · + a0 U0 ].
108
Chapter 21
Romberg integration
The Romberg algorithm combines the composite trapezoidal rule and the Richardson
Z b
extrapolation process in order to approximate the integral I := f (x)dx.
a
b−a
Recall that the composite trapezoidal rule on [a, b], using n subintervals of length h = ,
n
provides an approximate value for I given by
n
X
0
I(n) = h f (a + ih),
i=0
where the prime on the summation symbol means that the first and last terms are to be
b − a
halved. Note that, when computing I(2n), we will need values of the form f a + 2i =
2n
b−a
f a+i . Since these are precisely the summands of I(n), we should use the work
n
done in the computation of I(n) to compute I(2n). Precisely, remark that
n
I(n) b − a X b − a
I(2n) = + f a + (2i − 1) .
2 2n 2n
i=1
This allows to compute the sequence I(1), I(2), . . . , I(2k ), . . . in a recursive way, without
duplicate function evaluations. Let us now accept the validity of an error formula of the
type
b−a
I = I(2n ) + c2 h2 + c4 h4 + · · · , where h = n .
2
109
21.2 Romberg algorithm
Since the previous formula remains valid when h is halved, we may apply Richardson
extrapolation to obtain more and more accurate error formulae [to use exactly the same
formalism as in Chapter 19, we could express everything only in terms of h]. This would
lead us to define, for n ∈ J1, M + 1K,
R(n, 1) = I(2n−1 ),
R(1, 1)
R(2, 1) R(2, 2)
R(3, 1) R(3, 2) R(3, 3)
.. .. .. ..
. . . .
R(M + 1, 1) R(M + 1, 2) R(M + 1, 3) · · · R(M + 1, M + 1)
Note that the first column is computed according to the recursive process presented earlier.
Usually, we only need a moderate value of M , since 2M + 1 function evaluations will be
performed. The pseudocode is only marginally different from the one in Chapter 19, the
difference being that the triangular array is computed row by row, instead of column by
column.
input a, b, f , M
b−a
h ← b − a, R(1, 1) ← [f (a) + f (b)]
2
for n = 1 to M do
P n−1
h ← h/2, R(n + 1, 1) = R(n, 1)/2 + h 2i=1 f a + (2i − 1)h
for k = 1 to n do
R(n + 1, k + 1) = R(n + 1, k) + [R(n + 1, k) − R(n, k)]/(4k − 1)
end do end for
end do end for
output R
110
21.3 Exercises
From the textbook: 9 p 211, 13 p 212, 2 p 211 [after having written a program to carry out
the Romberg algorithm].
Z 1
sin x
1. Calculate √ dx by the Romberg algorithm [Hint: change of variable].
0 x
Z b
2. Assume that the first column of the Romberg array converges to I := f (x)dx – i.e.
a
that lim R(n, 1) = I. Show that the second column also converges to I. Can one infer
n→∞
that lim R(n, n) = I?
n→∞
111
Chapter 22
Adaptive quadrature
Adaptive quadrature methods are designed to take into account the behavior of the
function f to be integrated on an interval [a, b]. To do so, sample points will be clustered
in the regions of large variations of f . We present here a typical adaptive method based
on Simpson’s rule, where the user only supplies the function f , the interval [a, b], and a
desired accuracy ε.
We start by using Simpson’s rule on the interval [a, b]. If the approximation is not accurate
enough, then the interval [a, b] is divided into two equal subintervals, on each of which
Simpson’s rule is used again. If one [or both] of the resulting approximations is not accurate
enough, then a subdivision is applied once more. The repetition of this process constitutes
the main idea behind the adaptive Simpson’s method. Note that, at the end of the
process, we will have obtained approximations S1 , . . . , Sn to the integrals of f on some
intervals [x0 , x1 ], . . . , [xn−1 , xn ]. Denoting e1 , . . . , en the associated local errors, we get
Z b Xn Z xi Xn Xn
f (x)dx = f (x)dx = Si + ei .
a i=1 xi−1 i=1 i=1
112
On each interval [u, w] that will be considered, we write the basic Simpson’s rule as
1 w − u 5 (4)
Z w
w−u
f (x)dx = [f (u) + 4f (u + w)/2 + f (w)] − f (ζ),
u 6 90 2
1 w − u 5 (4)
=: S(u, w) − f (ζ), ζ ∈ [u, w].
90 2
Then, if [u, w] is to be divided into two equal subintervals [u, v] and [v, w], a more accurate
value of the integral can be computed according to
Z w Z v Z w
f (x)dx = f (x)dx + f (x)dx
u u v
1 v − u 5 (4) 0 1 w − v 5 (4) 00
= S(u, v) − f (ξ ) + S(v, w) − f (ξ )
90 2 90 2
1 w − u 5 1 (4) 0
= S(u, v) + S(v, w) − [f (ξ ) + f (4) (ξ 00 )]
90 2 25
1 1 w − u 5 (4)
= S(u, v) + S(v, w) − 4 f (ξ),
2 90 2
where ξ 0 , ξ 00 , and ξ belong to [u, v], [v, w], and [u, w], respectively. This is justified only if
f (4) is continuous. As usual, the error term in such a formula cannot be bounded without
some knowledge of f (4) . For automatic computation, however, it is imperative to be able to
estimate the magnitude of f (4) (ξ). So we have to come up with an additional assumption.
Over small intervals, the function f (4) will be considered almost constant – in particular,
we suppose that f (4) (ζ) = f (4) (ξ). This term can then be eliminated to obtain
Z w
15 f (x)dx = 16[S(u, v) + S(v, w)] − S(u, w),
Zuw
1
f (x)dx = [S(u, v) + S(v, w)] + [S(u, v) + S(v, w) − S(u, w)].
u 15
Observe that the error term has now been replaced by a quantity we are able to compute,
hence an error tolerance condition can be tested.
where 2h = w − u is the length of [u, w] and S is the Simpson’s estimate on [u, w], that is
h
S = S(u, w) = [f (u) + 4f (u + h) + f (u + 2h)].
3
113
Then one computes the midpoint v = u+h and the estimates S1 := S(u, v) and S2 = S(v, w).
To see whether S1 + S2 is a good enough approximation, we test the inequality
30h
|S1 + S2 − S| ≤ ε .
b−a
If the inequality holds, then the refined value S1 + S2 + [S1 + S2 − S]/15 is the accepted value
of the integral of f on [u, w]. In this case, it is added to a variable Σ which should eventually
receive the approximate value of the integral on [a, b]. If the inequality does not hold, then
the interval [u, w] is divided in two. The previous vector is discarded and replaced by the
two vectors
[u, h/2, f (u), f (y), f (v), S1 ], y := u + h/2,
[v, h/2, f (v), f (z), f (w), S2 ], z := v + h/2.
The latter are added to an existing stack of such vectors. Note that only two new function
evaluations have been required to compute S1 and S2 . Note also that the user should
prevent the size of the stack from exceeding a prescribed value n, in order to avoid an
infinite algorithm. In view of the description we have made, we may now suggest the
following version of the pseudocode. A recursive version would also be conceivable.
input f , a, b, ε, n
∆ ← b − a, Σ ← 0, h ← ∆/2, c ← (a + b)/2, k ← 1,
f a ← f (a), f b ← f (b), f c ← f (c), S ← [f a + 4f c + f b]h/3,
v [1] ← [a, h, f a, f c, f b, S]
while 1 ≤ k ≤ n do
[k]
h ← v2 /2,
[k] [k] [k]
f y = f (v1 + h) , S1 ← [v3 + 4f y + v4 ]h/3,
[k] [k] [k]
f z = f (v1 + 3h), S2 ← [v4 + 4f z + v5 ]h/3
[k]
if |S1 + S2 − v6 | < 30εh/∆
[k]
then Σ ← Σ + S1 + S2 + [S1 + S2 − v6 ]/15, k ← k − 1,
if k = 0 then output Σ, stop
else if k = n then output failure, stop
[k]
f w ← v5 ,
[k] [k] [k]
v [k] ← [v1 , h, v3 , f y, v4 , S1 ],
k ← k + 1,
[k−1] [k−1]
v [k] ← [v1 + 2h, h, v5 , f z, f w, S2 ]
end if
end do end while
114
22.3 Exercises
115
Chapter 23
Gaussian quadrature
If based on interpolation, we know that the formula is exact on Pn . We also have seen
that the degree of accuracy might be larger than n for some choices of nodes x0 , . . . , xn . We
may ask what is the largest degree of accuracy possible. Observe that the latter cannot
exceed 2n + 1, for if the formula was exact on P2n+2 , we would obtain a contradiction by
considering f (x) = (x − x1 )2 · · · (x − xn )2 . In fact, the maximal degree of accuracy is exactly
2n + 1. Indeed, a judicious choice of nodes leads to the following Gaussian quadrature,
which is exact on P2n+1 [the dimension of the space where exactness occurs has doubled
from a priori n + 1 to 2n + 2].
Theorem 23.1. Let x0 , . . . , xn be the zeros of the Legendre polynomial of degree n+1. Then
the quadrature formula
Z 1 n Z 1 Y n
X x − xj
f (x)dx ≈ Ai f (xi ), Ai = dx,
−1 −1 xi − xj
i=0 j=0, j6=i
is exact on P2n+1 .
Remark that the formula makes sense as long as x0 , . . . , xn are distinct points in (−1, 1).
This property is not specific to the zeros of the Legendre polynomial of degree n + 1, but
116
is shared by the (n + 1)-st orthogonal polynomials relative to all possible weight functions
– see Theorem 17.2. As a matter of fact, the quadrature rule also remains valid in this
context. We only need to establish this general result.
Theorem 23.2. Let x0 , . . . , xn be the zeros of the (n + 1)-st orthogonal polynomial relative
to the weight function w. Then the quadrature formula
1 n 1 n
x − xj
Z X Z Y
f (x)w(x)dx ≈ Ai f (xi ), Ai = w(x) dx,
−1 −1 j=0, j6=i xi − xj
i=0
is exact on P2n+1 .
f = q p + r, with deg r ≤ n.
Note that we must have deg q ≤ n to ensure that deg f ≤ 2n + 1. Then observe that
Z 1 Z 1 Z 1 Z 1
f (x)w(x)dx = q(x)p(x)w(x)dx + r(x)w(x)dx = r(x)w(x)dx,
−1 −1 −1 −1
n
X n
X n
X
Ai f (xi ) = Ai [q(xi )p(xi ) + r(xi )] = Ai r(xi ).
i=0 i=0 i=0
The equality of these two quantities for all r ∈ Pn is derived from the expansion of r on the
Lagrange cardinal polynomials `0 , . . . , `n associated to x0 , . . . , xn , just as in Chapter 20.
23.2 Examples
In theory, for any weight function w, we could devise the corresponding quadrature formula
numerically: construct the (n + 1)-st orthogonal polynomial numerically, find its zeros e.g.
by using NSolve in Mathematica, and find the weights e.g. by solving a linear system.
Fortunately, this work can be avoided for usual choices of w, since the nodes xi and the
weights Ai can be found in appropriate handbooks.
117
Let us consider the simple case w = 1 and n = 1. The zeros of the second orthogonal
√
polynomial, i.e. of the Legendre polynomial P2 (x) = 3x2 /2 − 1/2 are x0 = −1/ 3 and
√
x1 = 1/ 3. The weights A0 and A1 must be identical, by symmetry, and must sum to
Z 1
dx = 2. Hence the Gaussian quadrature, which is exact on P3 , takes the form
−1
Z 1 √ √
f (x)dx ≈ f − 1/ 3 + f 1/ 3 .
−1
1
If we now consider the weight function w(x) = √ , the (n+1)-st orthogonal polynomial
1 − x2
reduces to the Chebyshev polynomial Tn+1 , whose zeros are cos (2i+1)π/(2n+2) , i ∈ J0, nK.
It turns out that the weights Ai are independent of i, their common value being π/(n + 1).
√
Hence, with F (x) := 1 − x2 f (x), we may write
Z 1 Z 1 n
dx π X
f (x)dx = F (x) √ ≈ F cos (2i + 1)π/(2n + 2)
−1 −1 1 − x2 n+1
i=0
n
π X
= sin (2i + 1)π/(2n + 2) · f cos (2i + 1)π/(2n + 2) .
n+1
i=0
This is known as Hermite’s quadrature formula. Note that it is not exact on P2n+1 , but
rather on w P2n+1 := w p, p ∈ P2n+1 .
Theorem 23.3. Let p be the (n + 1)-st monic orthogonal polynomial relative to the weight
function w, and let x0 , . . . , xn be its zeros. Then, for any f ∈ C 2n+2 [−1, 1], the error term in
the Gaussian quadrature formula takes the form, for some ξ ∈ (−1, 1),
1 n 1
f (2n+2) (ξ)
Z X Z
f (x)w(x)dx − Ai f (xi ) = p2 (x)w(x)dx.
−1 (2n + 2)! −1
i=0
118
23.4 Exercises
1. Is is true that if Z b n
X
f (x) w(x) dx ≈ Ai f (xi )
a i=0
with n = 1 that is exact for all polynomials of degree at most 3. Repeat with n = 2,
making the formula exact on P5 .
119
Part VI
120
Chapter 24
Difference equations
Many numerical algorithms, in particular those derived from the discretization of differ-
ential equations, are designed to produce a sequence x = (xn ) of numbers. Often, the
sequence x obeys an (m + 1)-term recurrence relation of the type
We will establish that the solutions of the difference equation (24.1) can be given in
explicit form when the coefficients an,m−1 , . . . , an,0 are constant, i.e. independent of n. But
observe first that the set S of all real sequences satisfying (24.1) is a linear subspace of the
space RN of all real sequences. In fact, this linear subspace has dimension m, since the
linear map
x ∈ S 7→ [x1 , x2 , . . . , xm ]> ∈ Rm
is a bijection [check this fact]. The latter simply says that a solution of (24.1) is uniquely
determined by its first m terms.
If the coefficients of the recurrence relation are constant, we may rewrite (24.1) as
121
Theorem 24.1. Suppose that the characteristic polynomial p has m simple zeros λ1 , . . . , λm .
Then the sequences
Proof. First of all, we need to check that each sequence xi , i ∈ J1, nK, is a solution of (24.2).
To see this, we simply write
[i] [i] n+m−1
xn+m + cm−1 xn+m−1 + · · · + c0 x[i]
n = λi + cm−1 λn+m−2
i + · · · + c0 λin−1 = λin−1 · p(λi ) = 0.
Now, since the system (x[1] , . . . , x[m] ) has cardinality equal to the dimension of S, we only
need to verify that it is linearly independent. If it was not, then the system composed of
[1, λ1 , . . . , λm−1
1 ]> , . . . , [1, λm , . . . , λm−1 >
m ] would be linearly dependent, which is impossible
[remember Vandermonde matrices]. This completes the proof.
Let us illustrate how this result is used in practice. Consider the Fibonacci sequence F
defined by
F1 = 1, F2 = 1, Fn = Fn−1 + Fn−2 for n ≥ 3.
The first ten terms are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55. We now emphasize a closed-form
expression for the Fibonacci number Fn . According to Theorem 24.1, the two sequences
x[1] = (µn−1 ) and x[2] = (ν n−1 ) form a basis of the space x ∈ RN : xn = xn−1 + xn−2 , where
For completeness, we now formulate a slight generalization of Theorem 24.1. Its proof is
based on the one of Theorem 24.1, and is therefore omitted.
122
Theorem 24.2. Suppose that the polynomial p has k zeros of respective multiplicities
m1 , . . . , mk , with m1 + · · · + mk = k. Then the sequences
d dmi −1
(1, λ, . . . , λn , . . .)|λ=λi , (1, λ, . . . , λn , . . .)|λ=λi , ... , (1, λ, . . . , λn , . . .)|λ=λi ,
dλ dλmi −1
where i runs through J1, kK, form a basis for the vector space S of solutions of (24.2).
24.1 Exercises
1. Give bases consisting of real sequences for the solution spaces of xn+2 = 2xn+1 − 3xn and
xn+3 = 3xn+2 − 4xn .
2. Consider the difference equation xn+2 − 2xn+1 − 2xn = 0. One of its solutions is xn =
√
(1− 3)n−1 . This solution oscillates in sign and converges to zero. Compute and print out
the first 100 terms of this sequence by use of the equation xn+2 = 2(xn+1 + xn ) starting
√
with x1 = 1 and x2 = 1 − 3. Explain the curious phenomenon that occurs.
3. Setting x(λ) := (1, λ, . . . , λn , . . .), explain in details why the system x(λ1 ), . . . , x(λm ) is
4. Consider the difference equation 4xn+2 − 8xn+1 + 3xn = 0. Find the general solution and
determine whether the difference equation is stable – meaning that every solution x is
bounded, that is to say supn |xn | < ∞. Assuming that x1 = 0 and x2 = −2, compute x101
using the most efficient method.
123
Chapter 25
124
25.1 Existence and uniqueness of solutions
Theorem 25.1. If the function f = f (t, y) is continuous on the strip [a, b] × (−∞, ∞) and
satisfies a Lipschitz condition in its second variable, that is
|f (t, y1 ) − f (t, y2 )| ≤ L |y1 − y2 | for some constant L and all t ∈ [a, b], y1 , y2 ∈ (−∞, ∞),
Let us mention that, in order to check that a Lipschitz condition is satisfied, we may verify
∂f
that sup (t, y) < ∞ and use the mean value theorem. Note also that, under the same
t,y ∂y
assumptions as in Theorem 25.1, we can even conclude that the initial-value problem is
well-posed, in the sense that it has a unique solution y and that there exist constants
ε0 > 0 and K > 0 such that for any ε ∈ (0, ε0 ), any number γ with |γ| < ε, and any
continuous function δ with sup |δ(t)| < ε, the initial-value problem
t∈[a,b]
dz
= f (t, z) + δ(t), t ∈ [a, b], z(a) = y0 + γ,
dt
has a unique solution z, which further satisfies sup |z(t) − y(t)| ≤ K ε. In other words, a
t∈[a,b]
small perturbation in the problem, e.g. due to roundoff errors in the representations of f
and y0 , does not significantly perturb the solution.
are more efficient than Euler’s method, so that one rarely uses it for practical purposes.
For theoretical purposes, however, it has the advantage of its simplicity. The output of
125
the method does not consist of an approximation of the solution function y, but rather of
approximations yi of the values y(ti ) at some mesh points a = t0 < t1 < · · · < tn−1 < tn = b.
b−a
These points are often taken to be equidistant, i.e. ti = a + i h, where h = is the step
n
size. The approximations yi are constructed one-by-one via the difference equation
yi+1 = yi + h f (ti , yi ).
Theorem 25.2. Let us assume that the function f = f (t, y) is continuous on the strip
[a, b] × (−∞, ∞) and satisfies a Lipschitz condition with constant L in its second variable.
Let y be the unique solution of the initial-value problem
dy
= f (t, y), t ∈ [a, b], y(a) = y0 .
dt
If y is twice differentiable and sup |y 00 (t)| =: M < ∞, then, for each i ∈ J0, nK, one has
t∈[a,b]
hM
|y(ti ) − yi | ≤ [exp L(ti − a) − 1],
2L
where y0 , . . . , yn are the approximations generated by Euler’s method on the uniform mesh
b−a
t0 , . . . , tn with step size h = .
n
Proof. Given i ∈ J0, n − 1K, we can write, for some ξi ∈ (ti , ti+1 ),
h2 00
|y(ti+1 ) − yi+1 | = | y(ti ) + h y 0 (ti ) +
y (ξi ) − yi + h f (ti , yi ) |
2
h2
= |y(ti ) − yi + h (y 0 (ti ) − f (ti , yi )) + y 00 (ξi )|
2
h2
= |y(ti ) − yi + h (f (ti , y(ti )) − f (ti , yi )) + y 00 (ξi )|
2
h2 00
≤ |y(ti ) − yi | + h |f (ti , y(ti )) − f (ti , yi )| + |y (ξi )|
2
h2
≤ |y(ti ) − yi | + h L |y(ti ) − yi | + M.
2
To simplify the notations, set ui := |y(ti ) − yi |, a := hL, and b := h2 M/2. Then the previous
inequality takes the form ui+1 ≤ (1 + a)ui + b, which is equivalent [add a constant c to be
126
chosen smartly] to the inequality ui+1 + b/a ≤ (1 + a) ui + b/a . It follows by immediate
induction that
b b hM hM
ui + ≤ (1 + a)i u0 + , i.e. ui + ≤ (1 + hL)i .
a a 2L 2L
To obtain the announced result, remark that (1 + hL)i ≤ exp(ihL) = exp L(ti − a) .
Note that, y 00 being a priori unknown, we cannot give a precise evaluation for M . If f is
‘well-behaved’, however, we can differentiate the relation y 0 (t) = f (t, y(t)) and estimate M
from there. It should also be noted that, because of roundoff errors, decreasing the step
size h beyond a certain critical value will not improve the accuracy of the approximations.
25.3 Exercises
From the textbook: 1.a.c., 2.a.c. p 255; 8.a.c. p 256; 1.b.c. p 263; 9 p 264; 12 p 265.
Optional problems
To be completed.
127
Chapter 26
Runge–Kutta methods
The methods named after Carl Runge and Wilhelm Kutta are designed to imitate the Tay-
lor series methods without requiring analytic differentiation of the original differential
equation. Performing preliminary work before writing the computer program can indeed
be a serious obstacle. An ideal method should involve nothing more that writing a code
to evaluate f , and the Runge–Kutta methods accomplish just that. We will present the
Runge–Kutta method of order two, even though its low precision usually precludes its use
in actual scientific computations – it does find application in real-time calculations on small
computers, however. Then the Runge–Kutta method of order four will be given, without
any form of proof.
The bivariate Taylor series, analogous to the usual univariate one, takes the form
∞ !
∂ i
X 1 ∂
f (x + h, y + k) = h +k f (x, y),
i! ∂x ∂y
i=0
128
where the operators appearing in this equation are
∂ 0
∂
h +k f = f
∂x ∂y
∂ 1
∂ ∂f ∂f
h +k f = h +k
∂x ∂y ∂x ∂y
2 2 ∂2f ∂2f
∂ ∂ ∂ f
h +k f = h2 2 + 2hk + k2 2
∂x ∂y ∂x ∂x∂y ∂x
..
.
If the series is truncated, an error term is needed to restore equality. We would have
n−1 !
X1 ∂ ∂ i ∂ n
1 ∂
f (x + h, y + k) = h +k f (x, y) + h +k f (x, y),
i! ∂x ∂y n! ∂x ∂y
i=0
where the point (x, y) lies in the line segment joining (x, y) and (x+h, y +k). It is customary
to use subscripts to denote partial derivatives, for instance
∂2f ∂2f
∂f ∂f
fx := , fy := , fxy = = .
∂x ∂y ∂x∂y ∂y∂x
we need a procedure for advancing the solution function one step at a time. For the Runge–
Kutta method of order two, the formula for y(t + h) in terms of known quantities is
(
K1 := hf (t, y),
(26.1) y(t + h) = y(t) + w1 K1 + w2 K2 , with
K2 := hf (t + αh, y + βK1 ).
129
One obvious way to force (26.1) and (26.2) to agree up through the term in h is by taking
w1 = 1 and w2 = 0. This is Euler’s method. One can improve this to obtain agreement up
through the term in h2 . Let us apply the bivariate Taylor theorem in (26.1) to get
y(t + h) = y(t) + w1 hf (t, y) + w2 h f (t, y) + αhft (t, y) + βhf (t, y)fy (t, y) + O(h2 )
= y(t) + (w1 + w2 )f (t, y)h + αw2 ft (t, y) + βw2 f (t, y)fy (t, y) h2 + O(h3 ).
Making use of y 00 (t) = ft (t, y) + f (t, y)fy (t, y), Equation (26.2) can also be written as
1 1
y(t + h) = y(t) + f (t, y)h + ft (t, y) + f (t, y)fy (t, y) h2 + O(h3 ).
2 2
Thus, we shall require
1 1
w1 + w2 = 1, αw2 = , βw2 = .
2 2
Among the convenient solutions, there are the choice α = 1, β = 1, w1 = 1/2, w2 = 1/2,
resulting in the second-order Runge–Kutta method [also known as Heun’s method]
h h
y(t + h) = y(t) + f (t, y) + f t + h, y + hf (t, y) ,
2 2
and the choice α = 1/2, β = 1/2, w1 = 0, w2 = 1, resulting in the modified Euler’s method
y(t + h) = y(t) + hf t + h/2, y + h/2 f (t, y) .
Despite its elegance, this formula is tedious to derive, and we shall not do so. The method
is termed fourth-order because it reproduces the terms in the Taylor series up to and in-
cluding the one involving h4 , so that the [local truncation] error is in h5 . Note that the
solution at y(t + h) is obtained at the expense of four function evaluations. Here is a possi-
ble pseudocode to be implemented.
130
input f , t, y, h, n
t0 ← t, output 0, t, y
for i = 1 to n do
K1 ← hf (t, y)
K2 ← hf (t + h/2, y + K1 /2)
K3 ← hf (t + h/2, y + K2 /2)
K4 ← hf (t + h, y + K3 )
y ← y + (K1 + 2K2 + 2K3 + K4 )/6
t ← t0 + ih
output i, t, y
end do end for
26.4 Exercises
Optional problems
131
Chapter 27
Multistep methods
The Taylor series and the Runge–Kutta methods for solving an initial-value problem are
singlestep methods, in the sense that they do not use any knowledge of prior values of
y when the solution is being advanced from t to t + h. Indeed, if t0 , t1 , . . . , ti , ti+1 are steps
along the t-axis, then the approximation of y at ti+1 depends only on the approximation of y
at ti , while the approximations of y at ti−1 , ti−2 , . . . , t0 are not used. Taking these values into
account should, however, provide more efficient procedures. We present the basic principle
involved here on a particular case. Suppose we want to solve numerically the initial-value
problem y 0 (t) = f (t, y), with y(t0 ) = y0 . We prescribe the [usually equidistant] mesh points
t0 , t1 , . . . , tn . By integrating on [ti , ti+1 ], we have
Z ti+1
y(ti+1 ) = y(ti ) + f (t, y(t))dt.
ti
This integral can be approximated by a numerical quadrature scheme involving the values
f (tj , y(tj )) for j ≤ i. Thus we will define yi+1 by a formula of the type
132
which we require to be exact on P4 . We follow the method of undetermined coefficients,
working with the basis of P4 consisting of
p0 (x) = 1,
p1 (x) = x,
p2 (x) = x(x + 1),
p3 (x) = x(x + 1)(x + 2),
p4 (x) = x(x + 1)(x + 2)(x + 3).
If the coefficient bm+1 is zero, then the method is said to be explicit, since yi+1 does not
appear on the right-hand side, thus yi+1 can be computed directly from (27.2). If otherwise
the coefficient bm+1 is nonzero, then yi+1 appears on the right-hand side of (27.2) by way of
fi+1 = f (ti+1 , yi+1 ), and Equation (27.2) determines yi+1 implicitly. The method is therefore
said to be implicit.
We now list, for reference only, various Adams–Bashforth and Adams–Moulton formulae.
The latter are derived just as the former would be – see Section 27.1 – except that the
133
Z 1
quadrature formula for F (x)dx also incorporates the value F (1). This implies that
0
the Adams–Moulton methods are implicit, by opposition to the explicit Adams–Bashforth
methods.
134
• Adams–Moulton four-step method
h
yi+1 = yi + 251 fi+1 + 646 fi − 264 fi−1 + 106 fi−2 − 19 fi−3
720
3 (6)
Local truncation error: − y (ξi ) h5 , for some ξi ∈ (ti−3 , ti+1 ).
160
Note that the order of the local truncation error of the Adams–Moulton m-step method
matches the one of the Adams–Bashforth (m + 1)-step method, for the same number of
function evaluations. Of course, this has a price, namely solving the equation in yi+1 .
In numerical practice, the Adams–Bashforth formulae are rarely used by themselves, but
in conjunction with other formulae to enhance the precision. For instance, the Adams–
Bashforth five-step formula can be employed to predict a tentative estimate for the value
yi+1 given by the Adams–Moulton five-step formula, denote this estimate by yi+1 ∗ , and use
∗
it to replace fi+1 = f (ti+1 , yi+1 ) by its approximation fi+1 ∗ ) in the Adams–
= f (ti+1 , yi+1
Moulton formula, thus yielding a corrected estimate for yi+1 . This general principle pro-
vides satisfactory algorithms, called predictor–corrector methods. In fact, remark that
yi+1 should be interpreted as a fixed point of a certain mapping φ, and as such may be
computed, under appropriate hypotheses, as the limit of the iterates φk (z), for suitable
z. Note that the predictor–corrector simply estimates yi+1 by φ(yi+1 ∗ ). A better estimate
would obviously be obtained by applying φ over again, but in practice only one or two fur-
ther iterations will be performed. To accompany this method, we can suggest the following
pseudocode.
27.4 Exercises
135
Chapter 28
136
Observe now that this system can be rewritten using a condensed vector notation, setting
y1 f1
y2 f2
n n n
.. − a mapping from R into R , F := .. − a mapping from R × R into R ,
Y :=
. .
yn fn
Y 0 = F (t, Y ).
The initial conditions then translates into the vector equality Y (t0 ) = Y0 .
A higher-order differential equation can also be converted into this vector format, by trans-
forming it first into a system of first-order ordinary differential equations. Indeed, for the
differential equation
y (n) = f (t, y, y 0 , . . . , y (n−1) ),
The same approach can of course be applied to systems of higher-order ordinary differential
equations. To illustrate this point, one may check that the system
(
(x00 )2 + t ey + y 0 = x0 − x,
y 0 y 00 − cos(x y) + sin(t x y) = x,
can be rewritten as
x01 = x2 ,
x0
= [x2 − x1 − x4 − t ex3 ]1/2 ,
2
x03
= x4 ,
x0
= [x1 − sin(t x2 x3 ) + cos(x1 x3 )]/x4 .
4
Remark also that any system of the form (28.1) can be transformed into an autonomous
system, i.e. a system that does not contains t explicitely. It suffices for that to introduce
the function y0 (t) = t, to define Y = [y0 , y1 , . . . , yn ]> and F = [1, f1 , . . . , fn ]> , so that (28.1) is
expressed as Y 0 = F (Y ).
137
28.2 Extensions of the methods
We write a truncated Taylor expansion for each of the unknown functions in (28.1) as
h2 00 hn (n)
yi (t + h) ≈ yi (t) + hyi0 (t) + yi (t) + · · · + y (t),
2 n! i
or, in the shorter vector notation,
h2 00 hn (n)
Y (t + h) ≈ Y (t) + hY 0 (t) + Y (t) + · · · + Y (t).
2 n!
The derivatives appearing here are to be computed from the differential equation before
being included [in the right order] in the code. For instance, to solve the system
(
x0 = x + y 2 − t3 , x(1) = 3,
0 3
y = y + x + cos(t), y(1) = 1,
input a, b, α, β, n
t ← b, x ← α, y ← β, h ← (b − a)/n
output t, x, y
for k = 1 to n do
x1 ← x + y 2 − t3
y1 ← y + x3 + cos(t)
x2 ← x1 + 2yy1 − 3t2
y2 ← y1 + 3x2 x1 − sin(t)
x3 ← x2 + 2yy2 + 2(y1)2 − 6t
y3 ← y2 + 6x(x1)2 + 3x2 x2 − cos(t)
x ← x − h(x1 − h/2(x2 − h/3(x3)))
y ← y − h(y1 − h/2(y2 − h/3(y3)))
t ← b − kh
output t, x, y
end do end for
138
28.2.2 Runge–Kutta method for systems
For an autonomous system, the classical fourth-order Runge–Kutta formula reads, in vec-
tor form,
F1 = h F (Y (t)),
1 F2 = h F (Y (t) + F1 /2),
Y (t + h) ≈ Y (t) + [F1 + 2F2 + 2F3 + F4 ], with
6
F3 = h F (Y (t) + F2 /2),
F = h F (Y (t) + F ).
4 3
Adapting the usual case turns out to be straightforward, too. The vector form of the
Adams–Bashforth–Moulton predictor–corrector method of order five, for example, reads
h
Y ∗ (t + h) ≈ Y (t) + 1901 F (Y (t)) − 2774 F (Y (t − h)) + 2616 F (Y (t − 2h))
720
−1274 F (Y (t − 3h)) + 251 F (Y (t − 4h)) ,
h
Y (t + h) ≈ Y (t) + 251 F (Y ∗ (t + h)) + 646 F (Y (t)) − 264 F (Y (t − h))
720
+106 F (Y (t − 2h)) − 19 F (Y (t − 3h)) .
28.3 Exercises
139
Chapter 29
140
As a consequence, we can derive the following result.
y 00 (t) = p(t) y 0 (t) + q(t) y(t) + r(t), t ∈ [a, b], y(a) = α, y(b) = β,
admits a unique solution as soon as the functions p, q, and r are continuous on [a, b] and
q(t) > 0 for all t ∈ [a, b].
This applies e.g. to the equation y 00 = y, but not to the equation y 00 = −y, as expected from
the previous discussion. It is instructive to attempt a direct proof of Corollary 29.2, and
one is advised to do so. We now focus on our primary interest, namely we ask how to solve
boundary-value problems numerically.
Since methods for initial-value problems are at our disposal, we could try and use them
in this situation. Precisely, we will make a guess for the initial value y 0 (a) and solve the
corresponding initial-value problem, hoping that the resulting y(b) agrees with β. If not,
we alter our guess for y 0 (a), compute the resulting y(b), and compare it with β. The process
is repeated as long as y(b) differs too much from β. This scheme is called the shooting
method. To formalize the procedure, consider the initial-value problem
where only z is variable here. Then, assuming existence and uniqueness of the solution y,
its value at the right endpoint depends on z. We may write
Our goal is to solve the equation ψ(z) = 0, for which we have already developed some
techniques. Note that, in this approach, any algorithm for the initial-value problem can be
combined with any algorithm for root finding. The choices are determined by the nature of
the problem at hand. We choose here to use the secant method. Thus, we need two initial
guesses z1 and z2 , and we construct a sequence (zn ) from the iteration formula
zn − zn−1 zn − zn−1
zn+1 = zn − ψ(zn ) , i.e. zn+1 = zn + (β − ϕ(zn )) .
ψ(zn ) − ψ(zn−1 ) ϕ(zn ) − ϕ(zn−1 )
We will monitor the value of ψ(z) = ϕ(z) − β as we go along, and stop when it is sufficiently
small. Note that the valuess of the approximations yi are to be stored until better ones
141
are obtained. Remark also that the function ϕ is very expensive to compute, so that in the
first stages of the shooting method, as high precision is essentially wasted, a large step
size should be used. Let us now observe now that the secant method provides the exact
solution in just one step if the function ϕ is linear. This will occur for linear boundary-value
problems. Indeed, if y1 and y2 are solutions of
with y10 (a) = z1 and y20 (a) = z2 , then for any λ the function ye := (1 − λ)y1 + λy2 is the
solution of (29.1) with ye 0 (a) = (1 − λ)z1 + λz2 . Considering the value of ye at b, we get
ϕ((1 − λ)z1 + λz2 ) = (1 − λ)ϕ(z1 ) + λϕ(z2 ), which characterizes ϕ as a linear function.
Let us finally recall that the secant method outputs zn+1 as the zero of the linear func-
tion interpolating ψ at zn−1 and zn [think of the secant line]. An obvious way to refine
the shooting method would therefore be to use higher-order interpolation. For example,
we could form the cubic polynomial p that interpolates ψ at z1 , z2 , z3 , z4 , and obtain z5 as
the solution of p(z5 ) = 0. Still with cubic interpolants, an even better refinement involves
inverse interpolation. This means that we consider the cubic polynomial q that inter-
polates the data z1 , z2 , z3 , z4 at the points ψ(z1 ), ψ(z2 ), ψ(z3 ), ψ(z4 ), and we obtain z5 as
z5 = q(0).
Another approach consists of producing a discrete version of the boundary value problem
For this purpose, we partition the interval [a, b] into n equal subintervals of length h =
(b − a)/n and we exploit the centered-difference formulae
1 h2
y 0 (t) = y(t + h) − y(t − h)] − y (3) (ξ), ξ between t − h and t + h,
2h 6
1 h2
y 00 (t) = 2 y(t + h) − 2y(t) + y(t − h) − y (4) (ζ), ζ between t − h and t + h,
h 12
to construct approximations yi of the solution at the mesh points ti = a + ih, i ∈ J0, nK.
Neglecting the error terms, the original problem translates into the system
y0 = α,
yi+1 − 2yi + yi−1 = h2 f ti , yi , [yi+1 − yi−1 ]/2h , i ∈ J1, n − 1K,
(29.2)
yn = β.
142
In the linear case [i.e. for f (t, y, y 0 ) = p(t)y 0 + q(t)y + r(t)], the equations (29.2) for the
unknowns y1 , . . . , yn−1 can be written in matrix form as
d 1 b1 y1 c1 − a0 α
di =2 + h2 q(ti ),
a2 d2 b2 y2 c2
ai = − 1 − hp(ti )/2,
. .
. . . . . . . = . , with
. . .
. .
bi = − 1 + hp(ti )/2,
an−2 d n−2 b y
n−2 n−2
cn−2
ci = − h2 r(ti ).
an−1 dn−1 yn−1 cn−1 − bn−1 β
Note that the system is tridiagonal, therefore can be solved with a special Gaussian algo-
rithm. Observe also that, under the hypotheses of Corollary 29.2 and for h small enough,
the matrix is diagonally dominant, hence nonsingular, since |ai | + |bi | = 2 < |di |. Under the
same hypotheses, we can also show that the discrete solution converges towards the actual
solution, in the sense that max |yi − y(ti )| −→ 0.
i∈J0,nK h→0
Newton’s method for nonlinear system can be called upon in this setting [see Section 13.3].
We construct a sequence (Y [k] ) of vector according to the iteration formula
As in the unidimensional case, convergence is ensured provided that one starts with Y [0]
close enough to the solution and that the Jacobian matrix F 0 (Y [k] ) is invertible. Recall that
∂fi
the Jacobian matrix displays the entries , so that it takes the form
∂yj
.. .. ..
. . . (
ui :=fy0 (ti , yi , [yi+1 − yi−1 ]/2h),
−1 + hui /2 2 + h2 vi −1 − hui /2
, with
vi :=fy (ti , yi , [yi+1 − yi−1 ]/2h).
.. .. ..
. . .
This matrix appears to be diagonally dominant, hence invertible, for h small enough pro-
vided that we make the assumptions that fy0 is bounded and that fy is positive – see the
hypotheses of Theorem 29.1. Then the vector Y [k+1] has to be explicitely computed from
(29.3), not by inverting the Jacobian matrix but rather by solving a linear system. This
task is not too demanding, since the Jacobian matrix is tridiagonal.
143
29.3 Collocation method
(29.4) Ly = r,
y = c1 v1 + · · · + cn vn ,
where the functions v1 , . . . , vn are chosen at the start. By linearity of L, the equation (29.4)
becomes
Xn
ci Lvi = r.
i=1
In general, there is no hope to solve this for the coefficients c1 , . . . , cn , since the solution y
has no reason to be a linear combination of v1 , . . . , vn . Instead of equality everywhere, we
P
will merely require that the function i ci Lvi interpolates the function r at n prescribed
points, say τ1 , . . . , τn . Thus, we need to solve the n × n system
n
X
(29.5) ci (Lvi )(τj ) = r(τj ), j ∈ J1, nK.
i=1
Let us suppose, for simplicity, that we work on the interval [0, 1] and that α = 0 and β = 0
[see how we can turn to this case anyway]. We select the functions v1 , . . . , vn to be
The calculation of Lvi,n does not present any difficulties and is in fact most efficient if we
use the relations
0
vi,n = i vi−1,n−1 − (n + 1 − i) vi,n−1
00
vi,n = i(i − 1) vi−2,n−2 − 2i(n + i − 1) vi−1,n−2 + (n + 1 − i)(n − i) vi,n−2 .
It is not hard to realize [try it] that this choice of functions v1 , . . . , vn leads to a nonsingular
collocation system (29.5).
It is more popular, however, to consider some B-splines for the functions v1 , . . . , vn . A brief
account on B-splines will be given separately to this set of notes.
144
29.4 Exercises
—————————————————–
XXXXXXXXXXXXXXXXXXXXXXXXX
—————————————————–
145