Principles of Scientific Computing
Principles of Scientific Computing
2 by an amount that we call the error. An accurate result has a small error.
The goal of a scientic computation is rarely the exact answer, but a result that
is as accurate as needed. Throughout this book, we use A to denote the exact
answer to some problem and
A to denote the computed approximation to A.
The error is
AA.
There are four primary ways in which error is introduced into a computation:
(i) Roundo error from inexact computer arithmetic.
(ii) Truncation error from approximate formulas.
(iii) Termination of iterations.
(iv) Statistical error in Monte Carlo.
This chapter discusses the rst of these in detail and the others more briey.
There are whole chapters dedicated to them later on. What is important here
is to understand the likely relative sizes of the various kinds of error. This will
help in the design of computational algorithms. In particular, it will help us
focus our eorts on reducing the largest sources of error.
We need to understand the various sources of error to debug scientic com-
puting software. If a result is supposed to be A and instead is
A, we have to
ask if the dierence between A and
A is the result of a programming mistake.
Some bugs are the usual kind a mangled formula or mistake in logic. Oth-
ers are peculiar to scientic computing. It may turn out that a certain way of
calculating something is simply not accurate enough.
Error propagation also is important. A typical computation has several
stages, with the results of one stage being the inputs to the next. Errors in
the output of one stage most likely mean that the output of the next would be
inexact even if the second stage computations were done exactly. It is unlikely
that the second stage would produce the exact output from inexact inputs. On
the contrary, it is possible to have error amplication. If the second stage output
is very sensitive to its input, small errors in the input could result in large errors
in the output; that is, the error will be amplied. A method with large error
amplication is unstable.
The condition number of a problem measures the sensitivity of the answer
to small changes in its input data. The condition number is determined by the
problem, not the method used to solve it. The accuracy of a solution is limited
by the condition number of the problem. A problem is called ill-conditioned
if the condition number is so large that it is hard or impossible to solve it
accurately enough.
A computational strategy is likely to be unstable if it has an ill-conditioned
subproblem. For example, suppose we solve a system of linear dierential equa-
tions using the eigenvector basis of the corresponding matrix. Finding eigenvec-
tors of a matrix can be ill-conditioned, as we discuss in Chapter 4. This makes
2.1. RELATIVE ERROR, ABSOLUTE ERROR, AND CANCELLATION 7
the eigenvector approach to solving linear dierential equations potentially un-
stable, even when the dierential equations themselves are well-conditioned.
2.1 Relative error, absolute error, and cancella-
tion
When we approximate A by
A, the absolute error is e =
AA, and the relative
error is = e/A. That is,
_
2
11
+ 2
9
+ 2
7
+ 2
6
_
= (1)
1
2
11
_
1 + 2
2
+ 2
4
+ 2
5
_
= (1)
1
2
11
(1 + (.01)
2
+ (.0001)
2
+ (.00001)
2
)
2752 = (1)
1
2
11
(1.01011)
2
.
The bits in a oating point word are divided into three groups. One bit
represents the sign: s = 1 for negative and s = 0 for positive, according to
(2.2). There are p 1 bits for the mantissa and the rest for the exponent.
For example (see Figure 2.1), a 32-bit single precision oating point word has
p = 24, so there are 23 mantissa bits, one sign bit, and 8 bits for the exponent.
Floating point formats allow a limited range of exponents (emin e
emax). Note that in single precision, the number of possible exponents 126, 125, . . . , 126, 127,
is 254, which is two less than the number of 8 bit combinations (2
8
= 256). The
remaining two exponent bit strings (all zeros and all ones) have dierent inter-
pretations described in Section 2.2.4. The other oating point formats, double
precision and extended precision, also reserve the all zero and all one exponent
bit patterns.
The mantissa takes the form
m = (1.b
1
b
2
b
3
. . . b
p1
)
2
,
where p is the total number of bits (binary digits)
1
used for the mantissa. In Fig-
ure 2.1, we list the exponent range for IEEE single precision (float in C/C++),
IEEE double precision (double in C/C++), and the extended precision on the
Intel processors (long double in C/C++).
Not every number can be exactly represented in binary oating point. For
example, just as 1/3 = .333 cannot be written exactly as a nite decimal frac-
tion, 1/3 = (.010101)
2
also cannot be written exactly as a nite binary fraction.
1
Because the rst digit of a normal oating point number is always one, it is not stored
explicitly.
10 CHAPTER 2. SOURCES OF ERROR
Name C/C++ type Bits p
mach
= 2
p
emin emax
Single float 32 24 6 10
8
126 127
Double double 64 53 10
16
1022 1023
Extended long double 80 63 5 10
19
16382 16383
Figure 2.1: Parameters for oating point formats.
If x is a real number, we write x = round(x) for the oating point number (of a
given format) that is closest
2
to x. Finding x is called rounding. The dierence
round(x) x = x x is rounding error. If x is in the range of normal oating
point numbers (2
emin
x < 2
emax+1
), then the closest oating point number
to x has a relative error not more than [[
mach
, where the machine epsilon
mach
= 2
p
is half the distance between 1 and the next oating point number.
The IEEE standard for arithmetic operations (addition, subtraction, mul-
tiplication, division, square root) is: the exact answer, correctly rounded. For
example, the statement z = x*y gives z the value round(val(x) val(y)). That
is: interpret the bit strings x and y using the oating point standard (2.2), per-
form the operation (multiplication in this case) exactly, then round the result
to the nearest oating point number. For example, the result of computing
1/(float)3 in single precision is
(1.01010101010101010101011)
2
2
2
.
Some properties of oating point arithmetic follow from the above rule. For
example, addition and multiplication are commutative: x*y = y*x. Division by
powers of 2 is done exactly if the result is a normalized number. Division by 3 is
rarely exact. Integers, not too large, are represented exactly. Integer arithmetic
(excluding division and square roots) is done exactly. This is illustrated in
Exercise 8.
Double precision oating point has smaller rounding errors because it has
more mantissa bits. It has roughly 16 digit accuracy (2
53
10
16
, as opposed
to roughly 7 digit accuracy for single precision. It also has a larger range of
values. The largest double precision oating point number is 2
1023
10
307
, as
opposed to 2
126
10
38
for single. The hardware in many processor chips does
arithmetic and stores intermediate results in extended precision, see below.
Rounding error occurs in most oating point operations. When using an
unstable algorithm or solving a very sensitive problem, even calculations that
would give the exact answer in exact arithmetic may give very wrong answers
in oating point arithmetic. Being exactly right in exact arithmetic does not
imply being approximately right in oating point arithmetic.
2.2.3 Modeling oating point error
Rounding error analysis models the generation and propagation of rounding
2
If x is equally close to two oating point numbers, the answer is the number whose last
bit is zero.
2.2. COMPUTER ARITHMETIC 11
errors over the course of a calculation. For example, suppose x, y, and z are
oating point numbers, and that we compute fl(x+y +z), where fl() denotes
the result of a oating point computation. Under IEEE arithmetic,
fl(x +y) = round(x +y) = (x +y)(1 +
1
),
where [
1
[ <
mach
. A sum of more than two numbers must be performed
pairwise, and usually from left to right. For example:
fl(x +y +z) = round
_
round(x +y) +z
_
=
_
(x +y)(1 +
1
) +z
_
(1 +
2
)
= (x +y +z) + (x +y)
1
+ (x +y +z)
2
+ (x +y)
1
2
Here and below we use
1
,
2
, etc. to represent individual rounding errors.
It is often replace exact formulas by simpler approximations. For example,
we neglect the product
1
2
because it is smaller than either
1
or
2
(by a factor
of
mach
). This leads to the useful approximation
fl(x +y +z) (x +y +z) + (x +y)
1
+ (x +y +z)
2
,
We also neglect higher terms in Taylor expansions. In this spirit, we have:
(1 +
1
)(1 +
2
) 1 +
1
+
2
(2.3)
1 + 1 +/2 . (2.4)
As an example, we look at computing the smaller root of x
2
2x + = 0
using the quadratic formula
x = 1
1 . (2.5)
The two terms on the right are approximately equal when is small. This can
lead to catastrophic cancellation. We will assume that is so small that (2.4)
applies to (2.5), and therefore x /2.
We start with the rounding errors from the 1 subtraction and square
root. We simplify with (2.3) and (2.4):
fl(
1 ) =
_
_
(1 )(1 +
1
)
_
(1 +
2
)
(
1 )(1 +
1
/2 +
2
) = (
1 )(1 +
d
),
where [
d
[ = [
1
/2 +
2
[ 1.5
mach
. This means that relative error at this point
is of the order of machine precision but may be as much as 50% larger.
Now, we account for the error in the second subtraction
3
, using
1 1
and x /2 to simplify the error terms:
fl(1
1 )
_
1 (
1 )(1 +
d
)
_
(1 +
3
)
= x
_
1
1
x
d
+
3
_
x
_
1 +
2
d
+
3
_
.
3
For 0.75, this subtraction actually contributes no rounding error, since subtraction
of oating point values within a factor of two of each other is exact. Nonetheless, we will
continue to use our model of small relative errors in this step for the current example.
12 CHAPTER 2. SOURCES OF ERROR
Therefore, for small we have
x x x
d
x
,
which says that the relative error from using the formula (2.5) is amplied from
mach
by a factor on the order of 1/x. The catastrophic cancellation in the nal
subtraction leads to a large relative error. In single precision with x = 10
5
,
for example, we would have relative error on the order of 8
mach
/x 0.2. when
We would only expect one or two correct digits in this computation.
In this case and many others, we can avoid catastrophic cancellation by
rewriting the basic formula. In this case, we could replace (2.5) by the mathe-
matically equivalent x = /(1 +
(x)
f(x +h) f(x)
h
. (2.6)
This is not an exact formula for f
(x) +
1
2
h
2
f
(x) + ,
is truncated by neglecting all the terms after the rst two on the right. This
leaves the approximation
f(x +h) f(x) +hf
(x) ,
which can be rearranged to give (2.6). Truncation usually is the main source of
error in numerical integration or solution of dierential equations. The analysis
of truncation error using Taylor series will occupy the next two chapters.
As an example, we take f(x) = xe
x
, x = 1, and several h values. The
truncation error is
e
tr
=
f(x +h) f(x)
h
f
(x) .
In Chapter 3 we will see that (in exact arithmetic) e
tr
roughly is proportional
to h for small h. This is apparent in Figure 2.3. As h is reduced from 10
2
to
10
5
(a factor of 10
3
), the error decreases from 4.10 10
2
to 4.08 10
5
,
approximately the same factor of 10
3
.
The numbers in Figure 2.3 were computed in double precision oating point
arithmetic. The total error, e
tot
, is a combination of truncation and roundo
error. Roundo error is signicant for the smallest h values: for h = 10
8
the
error is no longer proportional to h; by h = 10
10
the error has increased. Such
small h values are rare in a practical calculation.
14 CHAPTER 2. SOURCES OF ERROR
h .3 .01 10
5
10
8
10
10
k=1
X
k
.
The dierence between
A and A is statistical error. A theorem in probability,
the law of large numbers, implies that
A A as n . Monte Carlo statistical
errors typically are larger than roundo or truncation errors. This makes Monte
Carlo a method of last resort, to be used only when other methods are not
practical.
Figure 2.4 illustrates the behavior of this Monte Carlo method for the ran-
dom variable X =
3
2
U
2
with U uniformly distributed in the interval [0, 1]. The
exact answer is A = E[X] =
3
2
E[U
2
] = .5. The value n = 10
6
is repeated to
illustrate the fact that statistical error is random (see Chapter 9 for a clarica-
tion of this). The errors even with a million samples are much larger than those
in the right columns of Figures 2.3 and 2.3.
2.6 Error propagation and amplication
Errors can grow as they propagate through a computation. For example, con-
sider the divided dierence (2.6):
f1 = . . . ; // approx of f(x)
f2 = . . . ; // approx of f(x+h)
fPrimeHat = ( f2 - f1 ) / h ; // approx of derivative
There are three contributions to the nal error in f
= f
(x)(1 +
pr
)(1 +
tr
)(1 +
r
) f
(x)(1 +
pr
+
tr
+
r
). (2.7)
6
E[X] is the expected value of X. Chapter 9 has some review of probability.
16 CHAPTER 2. SOURCES OF ERROR
It is unlikely that f
1
=
f(x) f(x) is exact. Many factors may contribute to
the errors e
1
= f
1
f(x) and e
2
= f
2
f(x +h), including inaccurate x values
and roundo in the code to evaluate f. The propagated error comes from using
inexact values of f(x +h) and f(x):
f
2
f
1
h
=
f(x +h) f(x)
h
_
1 +
e
2
e
1
f
2
f
1
_
=
f(x +h) f(x)
h
(1 +
pr
) . (2.8)
The truncation error in the dierence quotient approximation is
f(x +h) f(x)
h
= f
(x)(1 +
tr
). (2.9)
Finally, there is roundo error in evaluating ( f2 - f1 ) / h:
=
f
2
f
1
h
(1 +
r
). (2.10)
If we multiply the errors from (2.8)(2.10) and simplify, we get (2.7).
Most of the error in this calculation comes from truncation error and prop-
agated error. The subtraction and the division in the evaluation of the divided
dierence each have relative error of at most
mach
; thus, the roundo error
r
is
at most about 2
mach
, which is relatively small
7
. We noted in Section 2.3 that
truncation error is roughly proportional to h. The propagated error
pr
in the
outputs is roughly proportional to the absolute input errors e
1
and e
2
amplied
by a factor of h
1
:
pr
=
e
2
e
1
f
2
f
1
e
2
e
1
f
(x)h
.
Even if
1
= e
1
/f(x) and
2
= e
2
/f(x+h) are small,
pr
may be quite large.
This increase in relative error by a large factor in one step is another example
of catastrophic cancellation, which we described in Section 2.1. If the numbers
f(x) and f(x + h) are nearly equal, the dierence can have much less relative
accuracy than the numbers themselves. More subtle is gradual error growth over
many steps. Exercise 2.15 has an example in which the error roughly doubles
at each stage. Starting from double precision roundo level, the error after 30
steps is negligible, but the error after 60 steps is larger than the answer.
An algorithm is unstable if its error mainly comes from amplication. This
numerical instability can be hard to discover by standard debugging techniques
that look for the rst place something goes wrong, particularly if there is gradual
error growth.
In scientic computing, we use stability theory, or the study of propagation
of small changes by a process, to search for error growth in computations. In a
typical stability analysis, we focus on propagated error only, ignoring the original
sources of error. For example, Exercise 8 involves the backward recurrence
f
k1
= f
k+1
f
k
. In our stability analysis, we assume that the subtraction
is performed exactly and that the error in f
k1
is entirely due to errors in f
k
7
If h is a power of two and f
2
and f
1
are within a factor of two of each other, then r = 0.
2.7. CONDITION NUMBER AND ILL-CONDITIONED PROBLEMS 17
and f
k+1
. That is, if
f
k
= f
k
+ e
k
is the computer approximation, then the e
k
satisfy the error propagation equation e
k1
= e
k+1
e
k
. We then would use the
theory of recurrence relations to see whether the e
k
can grow relative to the f
k
as k decreases. If this error growth is possible, it will usually happen.
2.7 Condition number and ill-conditioned prob-
lems
The condition number of a problem measures the sensitivity of the answer to
small changes in the data. If is the condition number, then we expect relative
error at least
mach
, regardless of the algorithm. A problem with large condition
number is ill-conditioned. For example, if > 10
7
, then there probably is no
algorithm that gives anything like the right answer in single precision arithmetic.
Condition numbers as large as 10
7
or 10
16
can and do occur in practice.
The denition of is simplest when the answer is a single number that
depends on a single scalar variable, x: A = A(x). A change in x causes a
change in A: A = A(x + x) A(x). The condition number measures the
relative change in A caused by a small relative change of x:
A
A
x
x
. (2.11)
Any algorithm that computes A(x) must round x to the nearest oating point
number, x. This creates a relative error (assuming x is within the range of
normalized oating point numbers) of [x/x[ = [( x x)/x[
mach
. If the rest
of the computation were done exactly, the computed answer would be
A(x) =
A( x) and the relative error would be (using (2.11))
A(x) A(x)
A(x)
A( x) A(x)
A(x)
x
x
mach
. (2.12)
If A is a dierentiable function of x with derivative A
(x)
x
A(x)
. (2.13)
An algorithm is unstable if relative errors in the output are much larger than
relative errors in the input. This analysis argues that any computation for an
ill-conditioned problem must be unstable. Even if A(x) is evaluated exactly,
relative errors in the input of size are amplied by a factor of . The formulas
(2.12) and (2.13) represent an lower bound for the accuracy of any algorithm.
An ill-conditioned problem is not going to be solved accurately, unless it can be
reformulated to improve the conditioning.
A backward stable algorithm is as stable as the condition number allows.
This sometimes is stated as
A(x) = A( x) for some x that is within on the order
18 CHAPTER 2. SOURCES OF ERROR
of
mach
of x, [ x x[ / [x[ C
mach
where C is a modest constant. That is, the
computer produces a result,
A, that is the exact correct answer to a problem
that diers from the original one only by roundo error in the input. Many
algorithms in computational linear algebra in Chapter 5 are backward stable in
roughly this sense. This means that they are unstable only when the underlying
problem is ill-conditioned
The condition number (2.13) is dimensionless because it measures relative
sensitivity. The extra factor x/A(x) removes the units of x and A. Absolute
sensitivity is is just A
(x). Note that both sides of our starting point (2.11) are
dimensionless with dimensionless .
As an example consider the problem of evaluating A(x) = Rsin(x). The
condition number formula (2.13) gives
(x) =
cos(x)
x
sin(x)
.
Note that the problem remains well-conditioned ( is not large) as x 0, even
though A(x) is small when x is small. For extremely small x, the calculation
could suer from underow. But the condition number blows up as x ,
because small relative changes in x lead to much larger relative changes in A.
This illustrates quirk of the condition number denition: typical values of A
have the order of magnitude R and we can evaluate A with error much smaller
than this, but certain individual values of A may not be computed to high
relative precision. In most applications that would not be a problem.
There is no perfect denition of condition number for problems with more
than one input or output. Suppose at rst that the single output A(x) depends
on n inputs x = (x
1
, . . . , x
n
). Of course A may have dierent sensitivities to
dierent components of x. For example, x
1
/x
1
= 1% may change A much
more than x
2
/x
2
= 1%. If we view (2.11) as saying that [A/A[ for
[x/x[ = , a worst case multicomponent generalization could be
=
1
max
A
A
, where
x
k
x
k
for all k.
We seek the worst case
8
x. For small , we write
A
n
k=1
A
x
k
x
k
,
then maximize subject to the constraint [x
k
[ [x
k
[ for all k. The maxi-
mum occurs at x
k
= x
k
, which (with some algebra) leads to one possible
generalization of (2.13):
=
n
k=1
A
x
k
x
k
A
. (2.14)
8
As with rounding, typical errors tend to be on the order of the worst case error.
2.8. SOFTWARE 19
This formula is useful if the inputs are known to similar relative accuracy, which
could happen even when the x
k
have dierent orders of magnitude or dierent
units. Condition numbers for multivariate problems are discussed using matrix
norms in Section 4.3. The analogue of (2.13) is (4.28).
2.8 Software
Each chapter of this book has a Software section. Taken together they form
a mini-course in software for scientic computing. The material ranges from
simple tips to longer discussions of bigger issues. The programming exercises
illustrate the chapters software principles as well as the mathematical material
from earlier sections.
Scientic computing projects fail because of bad software as often as they
fail because of bad algorithms. The principles of scientic software are less
precise than the mathematics of scientic computing, but are just as important.
Like other programmers, scientic programmers should follow general software
engineering principles of modular design, documentation, and testing. Unlike
other programmers, scientic programmers must also deal with the approximate
nature of their computations by testing, analyzing, and documenting the error
in their methods, and by composing modules in a way that does not needlessly
amplify errors. Projects are handsomely rewarded for extra eorts and care
taken to do the software right.
2.8.1 General software principles
This is not the place for a long discussion of general software principles, but
it may be worthwhile to review a few that will be in force even for the small
programs called for in the exercises below. In our classes, students are required
to submit the program source as well the output, and points are deducted for
failing to follow basic software protocols listed here.
Most software projects are collaborative. Any person working on the project
will write code that will be read and modied by others. He or she will read
code that will be read and modied by others. In single person projects, that
other person may be you, a few months later, so you will end up helping
yourself by coding in a way friendly to others.
Make code easy for others to read. Choose helpful variable names, not
too short or too long. Indent the bodies of loops and conditionals. Align
similar code to make it easier to understand visually. Insert blank lines to
separate paragraphs of code. Use descriptive comments liberally; these
are part of the documentation.
Code documentation includes comments and well chosen variable names.
Larger codes need separate documents describing how to use them and
how they work.
20 CHAPTER 2. SOURCES OF ERROR
void do_something(double tStart, double tFinal, int n)
{
double dt = ( tFinal - tStart ) / n; // Size of each step
for (double t = tStart; t < tFinal; t += dt) {
// Body of loop
}
}
Figure 2.5: A code fragment illustrating a pitfall of using a oating point variable
to regulate a for loop.
Design large computing tasks into modules that perform sub-tasks.
A code design includes a plan for building it, which includes a plan for
testing. Plan to create separate routines that test the main components
of your project.
Know and use the tools in your programming environment. This includes
code specic editors, window based debuggers, formal or informal version
control, and make or other building and scripting tools.
2.8.2 Coding for oating point
Programmers who forget the inexactness of oating point may write code with
subtle bugs. One common mistake is illustrated in Figure 2.5. For tFinal
> tStart, this code would give the desired n iterations in exact arithmetic.
Because of rounding error, though, the variable t might be above or below
tFinal, and so we do not know whether this code while execute the loop body
n times or n + 1 times. The routine in Figure 2.5 has other issues, too. For
example, if tFinal <= tStart, then the loop body will never execute. The
loop will also never execute if either tStart or tFinal is a NaN, since any
comparison involving a NaN is false. If n = 0, then dt will be Inf. Figure 2.6
uses exact integer arithmetic to guarantee n executions of the for loop.
Floating point results may depend on how the code is compiled and executed.
For example, w = x + y + z is executed as if it were A = x + y; w = A + z.
The new anonymous variable, A, may be computed in extended precision or in
the precision of x, y, and z, depending on factors often beyond control or the
programmer. The results will be as accurate as they should be, but they will
be exactly identical. For example, on a Pentium running Linux
9
, the statement
double eps = 1e-16;
cout << 1 + eps - 1 << endl;
9
This is a 64-bit Pentium 4 running Linux 2.6.18 with GCC version 4.1.2 and Intel C
version 10.0.
2.8. SOFTWARE 21
void do_something(double tStart, double tFinal, int n)
{
double dt = ( tFinal - tStart ) / n;
double t = tStart;
for (int i = 0; i < n; ++i) {
// Body of loop
t += dt;
}
}
Figure 2.6: A code fragment using an integer variable to regulate the loop of
Figure 2.5. This version also is more robust in that it runs correctly if tFinal
<= tStart or if n = 0.
prints 1e-16 when compiled with the Intel C compiler, which uses 80-bit inter-
mediate precision to compute 1 + eps, and 0 when compiled with the GNU C
compiler, which uses double precision for 1 + eps.
Some compilers take what seem like small liberties with the IEEE oat-
ing point standard. This is particularly true with high levels of optimization.
Therefore, the output may change when the code is recompiled with a higher
level of optimization. This is possible even if the code is absolutely correct.
2.8.3 Plotting
Visualization tools range from simple plots and graphs to more sophisticated
surface and contour plots, to interactive graphics and movies. They make it
possible to explore and understand data more reliably and faster than simply
looking at lists of numbers. We discuss only simple graphs here, with future
software sections discussing other tools. Here are some things to keep in mind.
Learn your system and your options. Find out what visualization tools are
available or easy to get on your system. Choose a package designed for scientic
visualization, such as Matlab (or Gnuplot, R, Python, etc.), rather than one
designed for commercial presentations such as Excel. Learn the options such
as line style (dashes, thickness, color, symbols), labeling, etc. Also, learn how
to produce gures that are easy to read on the page as well as on the screen.
Figure 2.7 shows a Matlab script that sets the graphics parameters so that
printed gures are easier to read.
Use scripting and other forms of automation. You will become frustrated
typing several commands each time you adjust one detail of the plot. Instead,
assemble the sequence of plot commands into a script.
Choose informative scales and ranges. Figure 2.8 shows the convergence of
the xed-point iteration from Section 2.4 by plotting the residual r
i
= [x
i
+
22 CHAPTER 2. SOURCES OF ERROR
% Set MATLAB graphics parameters so that plots are more readable
% in print. These settings are taken from the front matter in
% L.N. Trefethens book Spectral Methods in MATLAB (SIAM, 2000),
% which includes many beautifully produced MATLAB figures.
set(0, defaultaxesfontsize, 12, ...
defaultaxeslinewidth, .7, ...
defaultlinelinewidth, .8, ...
defaultpatchlinewidth, .7);
Figure 2.7: A Matlab script to set graphics parameters so that printed gures
are easier to read.
log(x
i
) y[ against the iteration number. On a linear scale (rst plot), all the
residual values after the tenth step are indistinguishable to plotting accuracy
from zero. The semi-logarithmic plot shows how big each of the iterates is. It
also makes clear that the log(r
i
) is nearly proportional to i; this is the hallmark
of linear convergence, which we will discuss more in Chapter 6.
Combine curves you want to compare into a single gure. Stacks of graphs
are as frustrating as arrays of numbers. You may have to scale dierent curves
dierently to bring out the relationship they have to each other. If the curves
are supposed to have a certain slope, include a line with that slope. If a certain
x or y value is important, draw a horizontal or vertical line to mark it in the
gure. Use a variety of line styles to distinguish the curves. Exercise 9 illustrates
some of these points.
Make plots selfdocumenting. Figure 2.8 illustrates mechanisms in Matlab
for doing this. The horizontal and vertical axes are labeled with values and text.
Parameters from the run, in this case x
1
and y, are embedded in the title.
2.9 Further reading
We were inspired to start our book with a discussion of sources of error by
the book Numerical Methods and Software by David Kahaner, Cleve Moler,
and Steve Nash [13]. Another interesting version is in Scientic Computing
by Michael Heath [9]. A much more thorough description of error in scientic
computing is Highams Accuracy and Stability of Numerical Algorithms [10].
Actons Real Computing Made Real [1] is a jeremiad against common errors in
computation, replete with examples of the ways in which computations go awry
and ways those computations may be xed.
A classic paper on oating point arithmetic, readily available online, is Gold-
bergs What Every Computer Scientist Should Know About Floating-Point
Arithmetic [6]. Michael Overton has written a nice short book IEEE Floating
2.9. FURTHER READING 23
0 10 20 30 40 50 60 70
0
0.5
1
1.5
i
R
e
s
i
d
u
a
l
|
x
i
+
l
o
g
(
x
i
)
l
o
g
(
y
)
|
Convergence of x
i+1
= log(y)log(x
i
), with x
1
= 1, y = 10
0 10 20 30 40 50 60 70
10
20
10
10
10
0
10
10
i
R
e
s
i
d
u
a
l
|
x
i
+
l
o
g
(
x
i
)
l
o
g
(
y
)
|
Convergence of x
i+1
= log(y)log(x
i
), with x
1
= 1, y = 10
Figure 2.8: Plots of the convergence of x
i+1
= log(y) log(x
i
) to a xed point
on linear and logarithmic scales.
24 CHAPTER 2. SOURCES OF ERROR
% sec2_8b(x1, y, n, fname)
%
% Solve the equation x*exp(x) = y by the fixed-point iteration
% x(i+1) = log(y) - log(x(i));
% and plot the convergence of the residual |x+log(x)-log(y)| to zero.
%
% Inputs:
% x1 - Initial guess (default: 1)
% y - Right side (default: 10)
% n - Maximum number of iterations (default: 70)
% fname - Name of an eps file for printing the convergence plot.
function sec2_8b(x, y, n, fname)
% Set default parameter values
if nargin < 1, x = 1; end
if nargin < 2, y = 10; end
if nargin < 3, n = 70; end
% Compute the iterates
for i = 1:n-1
x(i+1) = log(y) - log(x(i));
end
% Plot the residual error vs iteration number on a log scale
f = x + log(x) - log(y);
semilogy(abs(f), x-);
% Label the x and y axes
xlabel(i);
ylabel(Residual |x_i + log(x_i) - log(y)|);
% Add a title. The sprintf command lets us format the string.
title(sprintf([Convergence of x_{i+1} = log(y)-log(x_i), ...
with x_1 = %g, y = %g\n], x(1), y));
grid; % A grid makes the plot easier to read
% If a filename is provided, print to that file (sized for book)
if nargin ==4
set(gcf, PaperPosition, [0 0 6 3]);
print(-deps, fname);
end
Figure 2.9: Matlab code to plot convergence of the iteration from Section 2.4.
2.10. EXERCISES 25
Point Arithmetic [18] that goes into more detail.
There are many good books on software design, among which we recom-
mend The Pragmatic Programmer by Hunt and Thomas [11] and The Practice
of Programming by Kernighan and Pike [14]. There are far fewer books that
specically address numerical software design, but one nice exception is Writing
Scientic Software by Oliveira and Stewart [16].
2.10 Exercises
1. It is common to think of
2
= 9.87 as approximately ten. What are the
absolute and relative errors in this approximation?
2. If x and y have type double, and ( fabs( x - y ) >= 10 ) evaluates
to TRUE, does that mean that y is not a good approximation to x in the
sense of relative error?
3. Show that f
jk
= sin(x
0
+ (j k)/3) satises the recurrence relation
f
j,k+1
= f
j,k
f
j+1,k
. (2.15)
We view this as a formula that computes the f values on level k +1 from
the f values on level k. Let
f
jk
for k 0 be the oating point numbers
that come from implementing f
j0
= sin(x
0
+j/3) and (2.15) (for k > 0)
in double precision oating point. If
f
jk
f
jk
f
j,k+1
f
j,k+1
to be about 10
10
. Is the problem too
ill-conditioned for single precision? For double precision?
26 CHAPTER 2. SOURCES OF ERROR
6. Use the fact that the oating point sum x + y has relative error <
mach
to show that the absolute error in the sum S computed below is no worse
than (n 1)
mach
n1
k=0
[x
i
[:
double compute_sum(double x[], int n)
{
double S = 0;
for (int i = 0; i < n; ++i)
S += x[i];
return S;
}
7. Starting with the declarations
float x, y, z, w;
const float oneThird = 1/ (float) 3;
const float oneHalf = 1/ (float) 2;
// const means these never are reassigned
we do lots of arithmetic on the variables x, y, z, w. In each case below,
determine whether the two arithmetic expressions result in the same oat-
ing point number (down to the last bit) as long as no NaN or inf values
or denormalized numbers are produced.
(a)
( x * y ) + ( z - w )
( z - w ) + ( y * x )
(b)
( x + y ) + z
x + ( y + z )
(c)
x * oneHalf + y * oneHalf
( x + y ) * oneHalf
(d) x * oneThird + y * oneThird
( x + y ) * oneThird
8. The Fibonacci numbers, f
k
, are dened by f
0
= 1, f
1
= 1, and
f
k+1
= f
k
+f
k1
(2.16)
for any integer k > 1. A small perturbation of them, the pib numbers
(p instead of f to indicate a perturbation), p
k
, are dened by p
0
= 1,
p
1
= 1, and
p
k+1
= c p
k
+p
k1
for any integer k > 1, where c = 1 +
3/100.
2.10. EXERCISES 27
(a) Plot the f
n
and p
n
in one together on a log scale plot. On the plot,
mark 1/
mach
for single and double precision arithmetic. This can be
useful in answering the questions below.
(b) Rewrite (2.16) to express f
k1
in terms of f
k
and f
k+1
. Use the
computed f
n
and f
n1
to recompute f
k
for k = n 2, n 3, . . . , 0.
Make a plot of the dierence between the original f
0
= 1 and the
recomputed
f
0
as a function of n. What n values result in no accuracy
for the recomputed f
0
? How do the results in single and double
precision dier?
(c) Repeat b. for the pib numbers. Comment on the striking dierence
in the way precision is lost in these two cases. Which is more typical?
Extra credit: predict the order of magnitude of the error in recom-
puting p
0
using what you may know about recurrence relations and
what you should know about computer arithmetic.
9. The binomial coecients, a
n,k
, are dened by
a
n,k
=
_
n
k
_
=
n!
k!(n k)!
To compute the a
n,k
, for a given n, start with a
n,0
= 1 and then use the
recurrence relation a
n,k+1
=
nk
k+1
a
n,k
.
(a) For a range of n values, compute the a
n,k
this way, noting the largest
a
n,k
and the accuracy with which a
n,n
= 1 is computed. Do this in
single and double precision. Why is roundo not a problem here as
it was in problem 8? Find n values for which a
n,n
1 in double
precision but not in single precision. How is this possible, given that
roundo is not a problem?
(b) Use the algorithm of part (a) to compute
E(k) =
1
2
n
n
k=0
ka
n,k
=
n
2
. (2.17)
Write a program without any safeguards against overow or zero di-
vide (this time only!)
10
. Show (both in single and double precision)
that the computed answer has high accuracy as long as the interme-
diate results are within the range of oating point numbers. As with
(a), explain how the computer gets an accurate, small, answer when
the intermediate numbers have such a wide range of values. Why is
cancellation not a problem? Note the advantage of a wider range of
values: we can compute E(k) for much larger n in double precision.
Print E(k) as computed by (2.17) and M
n
= max
k
a
n,k
. For large n,
one should be inf and the other NaN. Why?
10
One of the purposes of the IEEE oating point standard was to allow a program with
overow or zero divide to run and print results.
28 CHAPTER 2. SOURCES OF ERROR
(c) For fairly large n, plot a
n,k
/M
n
as a function of k for a range of k
chosen to illuminate the interesting bell shaped behavior of the a
n,k
near k = n/2. Combine the curves for n = 10, n = 20, and n = 50 in
a single plot. Choose the three k ranges so that the curves are close
to each other. Choose dierent line styles for the three curves.
Chapter 3
Local Analysis
29
30 CHAPTER 3. LOCAL ANALYSIS
Among the most common computational tasks are dierentiation, interpola-
tion, and integration. The simplest methods used for these operations are nite
dierence approximations for derivatives, polynomial interpolation, and panel
method integration. Finite dierence formulas, integration rules, and interpo-
lation form the core of most scientic computing projects that involve solving
dierential or integral equations.
Finite dierence formulas range from simple low order approximations (3.14a)
(3.14c) to more complicated high order methods such as (3.14e). High order
methods can be far more accurate than low order ones. This can make the
dierence between getting useful answers and not in serious applications. We
will learn to design highly accurate methods rather than relying on simple but
often inecient low order ones.
Many methods for these problems involve a step size, h. For each h there
is an approximation
1
A(h) A. We say
A is consistent if
2
A(h) A as
h 0. For example, we might estimate A = f
(x) . (3.1)
The second order approximation is more complicated and more accurate:
f(x +h) f(x) +f
(x)h +
1
2
f
(x)h
2
. (3.2)
Figure 3.1 illustrates the rst and second order approximations. Truncation
error is the dierence between f(x + h) and one of these approximations. For
example, the truncation error for the rst order approximation is
f(x) +f
A(h) f
(x) = A . (3.3)
1
In the notation of Chapter 2,
b
A is an estimate of the desired answer, A.
2
Here, and in much of this chapter, we implicitly ignore rounding errors. Truncation errors
are larger than rounding errors in the majority of applications.
3
The nominal order of accuracy may only be achieved if f is smooth enough, a point that
is important in many applications.
31
-1
0
1
2
3
4
5
6
0.3 0.4 0.5 0.6 0.7 0.8 0.9
x
f
f(x) and approximations up to second order
f
Constant
Linear
Quadratic
Figure 3.1: Plot of f(x) = xe
2x
together with Taylor series approximations of
order zero, one, and two. The base point is x = .6 and h ranges from .3 to .3.
Notice the convergence for xed h as the order increases. Also, the higher order
curves make closer contact with f(x) as h 0.
The more accurate Taylor approximation (3.2) allows us to estimate the error
in (3.14a). Substituting (3.2) into (3.14a) gives
A(h) A+A
1
h , A
1
=
1
2
f
(x) . (3.4)
This asymptotic error expansion is an estimate of
A A for a given f and
h. It shows that the error is roughly proportional to h, for small h. This
understanding of truncation error leads to more sophisticated computational
strategies. Richardson extrapolation combines
A(h) and
A(2h) to create higher
order estimates with much less error. Adaptive methods are automatic ways to
nd h such that
A(h) A
n=0
1
n!
f
(n)
(x)h
n
. (3.5)
The notation f
(n)
(x) refers to the n
th
derivative of f evaluated at x. The partial
sum of order p is a degree p polynomial in h:
F
p
(x, h) =
p
n=0
1
n!
f
(n)
(x)h
n
. (3.6)
The partial sum F
p
is the Taylor approximation to f(x + h) of order p. It is a
polynomial of order p in the variable h. Increasing p makes the approximation
more complicated and more accurate. The order p = 0 partial sum is simply
F
0
(x, h) = f(x). The rst and second order approximations are (3.1) and (3.2)
respectively.
The Taylor series sum converges if the partial sums converge to f:
lim
p
F
p
(x, h) = f(x +h) .
If there is a positive h
0
so that the series converges whenever [h[ < h
0
, then f is
analytic at x. A function probably is analytic at most points if there is a formula
for it, or it is the solution of a dierential equation. Figure 3.1 plots a function
f(x) = xe
2x
together with the Taylor approximations of order zero, one, and
two. The symbols at the ends of the curves illustrate the convergence of this
Taylor series when h = .3. When h = .3, the series converges monotonically:
F
0
< F
1
< F
2
< f(x + h). When h = .3, there are approximants on
both sides of the answer: F
0
> F
2
> f(x +h) > F
1
.
For our purposes, we will often use the partial sums F
p
without letting p
go to innity. To analyze how well the approximations F
p
approximate f near
the base point x, it is useful to introduce big-O notation. If B
1
(h) > 0 for
h ,= 0, we write B
2
(h) = O(B
1
(h)) (as h 0
4
to mean there is a C such that
[B
1
(h)[ CB
2
(h) in an open neighborhood of h = 0. A common misuse of
the notation, which we will follow, is to write B
2
(h) = O(B
1
(h)) when B
1
can
be negative, rather than B
2
(h) = O([B
1
(h)[). We say B
2
is an order smaller
than B
1
(or an order of approximation smaller) if B
2
(h) = O(hB
1
(h)). If B
2
is
an order smaller than B
1
, then we have [B
2
(h)[ < [B
1
(h)[ for small enough h
(h < 1/C), and reducing h further makes B
2
much smaller than B
1
.
An asymptotic expansion is a sequence of approximations with increasing
order of accuracy, like the partial sums F
p
(x, h). We can make an analogy
4
Big-O notation is also used to describe the asymptotic behavior of functions as they go
to innity; we say f(t) = O(g(t)) for g 0 as t if |f(t)|/g(t) is bounded for all t > T. It
will usually be clear from context whether we are looking at limits going to zero or going to
innity.
3.1. TAYLOR SERIES AND ASYMPTOTIC EXPANSIONS 33
between asymptotic expansions of functions and decimal expansions of numbers
if we think of orders of magnitude rather than orders of accuracy. The expansion
= 3.141592 = 3 + 1 .1 + 4 (.1)
2
+ 1 (.1)
3
+ 5 (.1)
4
+ (3.7)
is a sequence of approximations
A
0
3
A
1
3 + 1 .1 = 3.1
A
2
3 + 1 .1 + 4 (.1)
2
= 3.14
etc.
The approximation
A
p
is an order of magnitude more accurate than
A
p1
. The
error
A
p
is of the order of magnitude (.1)
p+1
, which also is the order of
magnitude of the rst neglected term. The error in
A
3
= 3.141 is
A
3
6 10
4
. This error is approximately the same as the next term, 5 (.1)
4
. Adding
the next term gives an approximation whose error is an order of magnitude
smaller:
A
3
+ 5 (.1)
4
=
A
4
9 10
5
.
We change notation when we view the Taylor series as an asymptotic expan-
sion, writing
f(x +h) f(x) +f
(x) h +
1
2
f
(x)h
2
+ . (3.8)
This means that the right side is an asymptotic series that may or may not
converge. It represents f(x +h) in the sense that the partial sums F
p
(x, h) are
a family of approximations of increasing order of accuracy:
[F
p
(x, h) f(x +h)[ = O(h
p+1
) . (3.9)
The asymptotic expansion (3.8) is much like the decimal expansion (3.7). The
term in (3.8) of order O(h
2
) is
1
2
f
(x) h
2
. The term in (3.7) of order of
magnitude 10
2
is 4 (.1)
2
. The error in a p term approximation is roughly
the rst neglected term, since all other neglected terms are at least one order
smaller.
Figure 3.1 illustrates the asymptotic nature of the Taylor approximations.
The lowest order approximation is F
0
(x, h) = f(x). The graph of F
0
touches
the graph of f when h = 0 but otherwise has little in common. The graph of
F
1
(x, h) = f(x) + f
f
(p+1)
(y)
,
then we nd that
[F
p
(x, h) f(x +h)[ C h
p+1
.
This is the proof of (3.9), which states that the Taylor series is an asymptotic
expansion.
The approximation F
p
(x, h) includes terms in the sum (3.8) through order
p. The rst neglected term is
1
(p+1)
f
(p+1)
(x), the term of order p +1. This also
is the dierence F
p+1
F
p
. It diers from the the Taylor series remainder (3.10)
only in being replaced by x. Since [x, x +h], this is a small change if h is
small. Therefore, the error in the F
p
is nearly equal to the rst neglected term.
An asymptotic expansion can converge to the wrong answer or not converge
at all. We give an example of each. These are based on the fact that exponentials
beat polynomials in the sense that, for any n,
t
n
e
t
0 as t .
If we take t = a/ [x[ (because x may be positive or negative), this implies that
1
x
n
e
a/x
0 as x 0 . (3.11)
Consider the function f(x) = e
1/|x|
. This function is continuous at x = 0
if we dene f(0) = 0. The derivative at zero is (using (3.11))
f
(0) = lim
h0
f(h) f(0)
h
= lim
h0
e
1/|h|
h
= 0 .
When x ,= 0, we calculate f
(x) =
1
|x|
2
e
1/|x|
. The rst derivative is con-
tinuous at x = 0 because (3.11) implies that f
(x) 0 = f
(0) as x 0.
Continuing in his way, one can see that each of the higher derivatives vanishes
at x = 0 and is continuous. Therefore F
p
(0, h) = 0 for any p, as f
(n)
(0) = 0 for
all n. Thus clearly F
p
(0, h) 0 as p . But f(h) = e
1/h
> 0 if h ,= 0, so
the Taylor series, while asymptotic, converges to the wrong answer.
5
See any good calculus book for a derivation and proof.
3.1. TAYLOR SERIES AND ASYMPTOTIC EXPANSIONS 35
What goes wrong here is that the derivatives f
(p)
are zero at x = 0, but
they are large for x close zero. The remainder theorem (3.10) implies that
F
p
(x, h) f(x +h) as p if
M
p
=
h
p
p!
max
xx+h
f
(p)
()
0 as p .
Taking x = 0 and any h > 0, function f(x) = e
1/|x|
has M
p
as p .
Here is an example of an asymptotic Taylor series that does not converge at
all. Consider
f(h) =
_
1/2
0
e
x/h
1
1 x
dx . (3.12)
The integrand goes to zero exponentially as h 0 for any xed x. This suggests
6
that most of the integral comes from values of x near zero and that we can
approximate the integral by approximating the integrand near x = 0. Therefore,
we write 1/(1 x) = 1 +x +x
2
+ , which converges for all x in the range of
integration. Integrating separately gives
f(h) =
_
1/2
0
e
x/h
dx +
_
1/2
0
e
x/h
xdx +
_
1/2
0
e
x/h
x
2
dx + .
We get a simple formula for the integral of the general term e
x/h
x
n
if we change
the upper limit from 1/2 to . For any xed n, changing the upper limit of
integration makes an exponentially small change in the integral, see problem
(6). Therefore the n
th
term is (for any p > 0)
_
1/2
0
e
x/h
x
n
dx =
_
0
e
x/h
x
n
dx +O(h
p
)
= n!h
n+1
+O(h
p
) .
Assembling these gives
f(h) h +h
2
+ 2h
3
+ + (n 1)! h
n
+ (3.13)
This is an asymptotic expansion because the partial sums are asymptotic ap-
proximations:
h +h
2
+ 2h
3
+ + (p 1)! h
p
f(h)
= O(h
p+1
) .
But the innite sum does not converge; for any h > 0 we have n! h
n+1
as n .
In these examples, the higher order approximations have smaller ranges of
validity. For (3.12), the three term approximation f(h) h+h
2
+2h
3
is reason-
ably accurate when h = .3 but the six term approximation is less accurate, and
the ten term approximation is 4.06 for an answer less than .5. The ten term
approximation is very accurate when h = .01 but the fty term approximation
is astronomical.
6
A more precise version of this intuitive argument is in exercise 6.
36 CHAPTER 3. LOCAL ANALYSIS
3.2 Numerical Dierentiation
One basic numerical task is estimating the derivative of a function from function
values. Suppose we have a smooth function, f(x), of a single variable, x. The
problem is to combine several values of f to estimate f
(x)
f(x +h) f(x)
h
(3.14a)
f
(x)
f(x) f(x h)
h
(3.14b)
f
(x)
f(x +h) f(x h)
2h
(3.14c)
f
(x)
f(x + 2h) + 4f(x +h) 3f(x)
2h
(3.14d)
f
(x)
f(x + 2h) + 8f(x +h) 8f(x h) +f(x + 2h)
12h
(3.14e)
Formulas (3.14a)-(3.14c) have simple geometric interpretations as the slopes
of lines connecting nearby points on the graph of f(x). A carefully drawn
gure shows that (3.14c) is more accurate than (3.14a). We give an analytical
explanation of this below. Formulas (3.14d) and (3.14e) are more technical. The
formulas (3.14a), (3.14b), and (3.14d) are one sided because they use values only
on one side of x. The formulas (3.14c) and (3.14e) are centered because they
use points symmetrical about x and with opposite weights.
The Taylor series expansion (3.8) allows us to calculate the accuracy of each
of these approximations. Let us start with the simplest, the rst-order formula
(3.14a). Substituting (3.8) into the right side of (3.14a) gives
f(x +h) f(x)
h
f
(x) +h
f
(x)
2
+h
2
f
(x)
6
+ . (3.15)
This may be written:
f(x +h) f(x)
h
= f
(x) +E
a
(h) ,
where
E
a
(h)
1
2
f
(x) h +
1
6
f
(x) h
2
+ . (3.16)
In particular, this shows that E
a
(h) = O(h), which means that the one sided
two point nite dierence approximation is rst order accurate. Moreover,
E
a
(h) =
1
2
f
(x) h +O(h
2
) , (3.17)
7
Isaac Newton thought of the dierential dx as something innitely small and yet not zero.
He called such quantities innitesimals. The innitesimal dierence was df = f(x+dx)f(x).
The nite dierence occurs when the change in x is nitely small but not innitely small.
3.2. NUMERICAL DIFFERENTIATION 37
which is to say that, to leading order, the error is proportional to h and given
by
1
2
f
(x)h.
Taylor series analysis applied to the two point centered dierence approxi-
mation (3.14c) leads to
f
(x) =
f(x +h) f(x h)
2h
+E
c
(h)
where
E
c
(h)
1
6
f
(x) h
2
+
1
24
f
(5)
(x) h
4
+ (3.18)
=
1
6
f
(x) h
2
+O(h
4
)
This centered approximation is second order accurate, E
c
(h) = O(h
2
). This is
one order more accurate than the one sided approximations (3.14a) and (3.14b).
Any centered approximation such as (3.14c) or (3.14e) must be at least second
order accurate because of the symmetry relation
A(h) =
A(h). Since A =
f
E(h) =
1
2
f
(x)h. For the second order centered formula, (3.18) gives leading
order error
E
c
(h) =
1
6
f
(x)h
2
. For the three point one sided formula, the
coecient of f
(x)h
2
is
1
3
, twice the coecient for the second order centered
formula. For the fourth order formula, the coecient of f
(5)
(x)h
4
is
1
30
. The
table shows that
E is a good predictor of E, if h is at all small, until roundo
gets in the way. The smallest error
8
in the table comes from the fourth order
8
The error would have been 310
19
rather than 610
12
, seven orders of magnitude
smaller, in exact arithmetic. The best answer comes despite some catastrophic cancellation,
but not completely catastrophic.
38 CHAPTER 3. LOCAL ANALYSIS
h (3.14a) (3.14c) (3.14d) (3.14e)
.5 3.793849 0.339528 7.172794 0.543374
E 2.38e+00 -1.08e+00 5.75e+00 -8.75e-01
(x), the
middle row gives the error E(h), and the third row is
E(h), the leading Taylor
series term in the error formula. All calculations were done in double precision
oating point arithmetic.
formula and h = 10
5
. It is impossible to have an error this small with a rst
or second order formula no matter what the step size. Note that the error in
the (3.14e) column increased when h was reduced from 10
5
to 10
7
because of
roundo.
A dierence approximation may not achieve its expected order of accuracy
if the requisite derivatives are innite or do not exist. As an example of this,
let f(x) be the function
f(x) =
_
0 if x 0
x
2
if x 0 .
If we want f
(0) , the formulas (1c) and (1e) are only rst order accurate despite
their higher accuracy for smoother functions. This f has a mild singularity, a
discontinuity in its second derivative. Such a singularity is hard to spot on a
graph, but may have a drastic eect on the numerical analysis of the function.
3.2. NUMERICAL DIFFERENTIATION 39
We can use nite dierences to approximate higher derivatives such as
f(x +h) 2f(x) +f(x h)
h
2
= f
(x) +
h
2
12
f
(4)
+O(h
4
) ,
and to estimate partial derivatives of functions depending on several variables,
such as
f(x +h, y) f(x h, y)
2h
x
f(x, y) +
h
2
3
3
f
x
3
(x, y) + .
3.2.1 Mixed partial derivatives
Several new features arise only when evaluating mixed partial derivatives or
sums of partial derivatives in dierent variables. For example, suppose we want
to evaluate
9
f
xy
=
x
y
f(x, y). Rather than using the same h for both
10
x
and y, we use step size x for x and y for y. The rst order one sided
approximation for f
y
is
f
y
f(x, y + y) f
y
.
We might hope this, and
f
y
(x + x, y)
f(x + x, y + y) f(x + x, y)
y
,
are accurate enough so that
x
_
y
f
_
f
y
(x + x, y) f
y
x
f(x + x, y + y) f(x + x, y)
y
f(x, y + y) f
y
x
f
xy
f(x + x, y + y) f(x + x, y) f(x, y + y) +f
xy
(3.19)
is consistent
11
.
To understand the error in (3.19), we need the Taylor series for functions of
more than one variable. The rigorous remainder theorem is more complicated,
9
We abbreviate formulas by denoting partial derivatives by subscripts, xf = fx, etc., and
by leaving out the arguments if they are (x, y), so f(x+x, y) f(x, y) = f(x+x, y) f
xfx(x, y) = xfx.
10
The expression f(x+h, y +h) does not even make sense if x and y have dierent physical
units.
11
The same calculation shows that the right side of (3.19) is an approximation of y (xf).
This is one proof that yxf = yxf.
40 CHAPTER 3. LOCAL ANALYSIS
but it suces here to use all of the rst neglected terms. The expansion is
f(x + x, y + y) f + xf
x
+ yf
y
+
1
2
x
2
f
xx
+ xyf
xy
+
1
2
y
2
f
yy
+
1
6
x
3
f
xxx
+
1
2
x
2
yf
xxy
+
1
2
xy
2
f
xyy
+
1
6
y
3
f
yyy
+
+
1
p!
p
k=0
_
p
k
_
x
pk
y
k
pk
x
k
y
f + .
If we keep just the terms on the top row on the right, the second order terms on
the second row are the rst neglected terms, and (using the inequality xy
x
2
+ y
2
):
f(x + x, y + y) = f + xf
x
+ yf
y
+O
_
x
2
+ y
2
_
.
Similarly,
f(x + x, y + y)
= f + xf
x
+ yf
y
+
1
2
x
2
f
xx
+ xyf
xy
+
1
2
y
2
f
yy
+ O
_
x
3
+ y
3
_
.
Of course, the one variable Taylor series is
f(x + x, y) = f + xf
x
+
1
2
x
2
f
xx
+O
_
x
3
_
, etc.
Using all these, and some algebra, gives
f(x + x, y + y) f(x + x, y) f(x, y + y) +f
xy
= f
xy
+O
_
x
3
+ y
3
xy
_
. (3.20)
This shows that the approximation (3.19) is rst order, at least if x is roughly
proportional to y. The second order Taylor expansion above gives a quanti-
tative estimate of the error:
f(x + x, y + y) f(x + x, y) f(x, y + y) +f
xy
f
xy
1
2
(xf
xxy
+ yf
xyy
) . (3.21)
This formula suggests (and it is true) that in exact arithmetic we could let
x 0, with y xed but small, and still have a reasonable approximation to
f
xy
. The less detailed version (3.20) suggests that might not be so.
3.3. ERROR EXPANSIONS AND RICHARDSON EXTRAPOLATION 41
A partial dierential equation may involve a dierential operator that is a
sum of partial derivatives. One way to approximate a dierential operator is to
approximate each of the terms separately. For example, the Laplace operator
(or Laplacian), which is =
2
x
+
2
y
in two dimensions, may be approximated
by
f(x, y) =
2
x
f +
2
y
f
f(x + x, y) 2f +f(x x, y)
x
2
+
f(x, y + y) 2f +f(x, y 2y)
y
2
.
If x = y = h (x and y have the same units in the Laplace operator), then
this becomes
f
1
h
2
_
f(x +h, y) +f(x h, y) +f(x, y +h) +f(x, y h) 4f
_
. (3.22)
This is the standard ve point approximation (seven points in three dimensions).
The leading error term is
h
2
12
_
4
x
f +
4
y
f
_
. (3.23)
The simplest heat equation (or diusion equation) is
t
f =
1
2
2
x
f. The space
variable, x, and the time variable, t have dierent units. We approximate the
dierential operator using a rst order forward dierence approximation in time
and a second order centered approximation in space. This gives
t
f
1
2
2
x
f
f(x, t + t) f
t
f(x + x, t) 2f +f(x x, t)
2x
2
. (3.24)
The leading order error is the sum of the leading errors from time dierencing
(
1
2
t
2
t
f) and space dierencing (
x
2
24
4
x
f), which is
1
2
t
2
t
f
x
2
24
4
x
f . (3.25)
For many reasons, people often take t proportional to x
2
. In the simplest
case of t = x
2
, the leading error becomes
x
2
_
1
2
2
t
f
1
24
4
x
f
_
.
This shows that the overall approximation (3.24) is second order accurate if we
take the time step to be the square of the space step.
3.3 Error Expansions and Richardson Extrapo-
lation
The error expansions (3.16) and (3.18) above are instances of a common situ-
ation that we now describe more systematically and abstractly. We are trying
42 CHAPTER 3. LOCAL ANALYSIS
to compute A and there is an approximation with
A(h) A as h 0 .
The error is E(h) =
A(h) A. A general asymptotic error expansion in powers
of h has the form
A(h) A+h
p1
A
1
+h
p2
A
2
+ , (3.26)
or, equivalently,
E(h) h
p1
A
1
+h
p2
A
2
+ .
As with Taylor series, the expression (3.26) does not imply that the series on
the right converges to
A(h). Instead, the asymptotic relation (3.26) means that,
as h 0,
A(h) (A+h
p1
A
1
) = O(h
p2
) (a)
A(h) (A+h
p1
A
1
+h
p2
A
2
) = O(h
p3
) (b)
and so on.
_
_
(3.27)
It goes without saying that 0 < p
1
< p
2
< . The statement (3.27a) says not
only that A + A
1
h
p1
is a good approximation to
A(h), but that the error has
the same order as the rst neglected term, A
2
h
p2
. The statement (3.27b) says
that including the O(h
p2
) term improves the approximation to O(h
p3
), and so
on.
Many asymptotic error expansions arise from Taylor series manipulations.
For example, the two point one sided dierence formula error expansion (3.15)
gives p
1
= 1, A
1
=
1
2
f
(x), p
2
= 2, A
2
=
1
6
f
(x), and A
2
=
1
24
f
(5)
(x). The three point one sided formula has
p
1
= 2 because it is second order accurate, but p
2
= 3 instead of p
2
= 4. The
fourth order formula has p
1
= 4 and p
2
= 6.
It is possible that an approximation is p
th
order accurate in the big O sense,
[E(h)[ Ch
p
, without having an asymptotic error expansion of the form (3.26).
Figure 3.4 has an example showing that this can happen when the function
f(x) is not suciently smooth. Most of the extrapolation and debugging tricks
described here do not apply in those cases.
We often work with asymptotic error expansions for which we know the
powers p
k
but not the coecients, A
k
. For example, in nite dierence approx-
imations, the A
k
depend on the function f but the p
k
do not. Two techniques
that use this information are Richardson extrapolation and convergence analysis.
Richardson extrapolation combines
A(h) approximations for several values of h
to produce a new approximation that has greater order of accuracy than
A(h).
Convergence analysis is a debugging method that tests the order of accuracy of
numbers produced by a computer code.
3.3. ERROR EXPANSIONS AND RICHARDSON EXTRAPOLATION 43
3.3.1 Richardson extrapolation
Richardson extrapolation increases the order of accuracy of an approximation
provided that the approximation has an asymptotic error expansion of the form
(3.26) with known p
k
. In its simplest form, we compute
A(h) and
A(2h) and
then form a linear combination that eliminates the leading error term. Note
that
A(2h) = A+ (2h)
p1
A
1
+ (2h)
p2
A
2
+
= A+ 2
p1
h
p1
A
1
+ 2
p2
h
p2
A
2
+ ,
so
2
p1
A(h)
A(2h)
2
p1
1
= A+
2
p1
2
p2
2
p1
1
h
p2
A
2
+
2
p3
2
p2
2
p1
1
h
p3
A
3
+ .
In other words, the extrapolated approximation
A
(1)
(h) =
2
p1
A(h)
A(2h)
2
p1
1
(3.28)
has order of accuracy p
2
> p
1
. It also has an asymptotic error expansion,
A
(1)
(h) = A+h
p2
A
(1)
2
+h
p3
A
(1)
3
+ ,
where A
(1)
2
=
2
p1
2
p2
2
p1
1
A
2
, and so on.
Richardson extrapolation can be repeated to remove more asymptotic error
terms. For example,
A
(2)
(h) =
2
p2
A
(1)
(h)
A
(1)
(2h)
2
p2
1
has order p
3
. Since
A
(1)
(h) depends on
A(h) and
A(2h),
A
(2)
(h) depends on
A(h),
A(2h), and
A(4h). It is not necessary to use powers of 2, but this is
natural in many applications. Richardson extrapolation will not work if the
underlying approximation,
A(h), has accuracy of order h
p
in the O(h
p
) sense
without at least one term of an asymptotic expansion.
Richardson extrapolation allows us to derive higher order dierence approx-
imations from low order ones. Start, for example, with the rst order one sided
approximation to f
(x) 2
f(x +h) f(x)
h
f(x + 2h) f(x)
2h
=
f(x + 2h) + 4f(x +h) 3f(x)
2h
,
which is the second order three point one sided dierence approximation (3.14d).
Starting with the second order centered approximation (3.14c) (with p
1
= 2 and
44 CHAPTER 3. LOCAL ANALYSIS
p
2
= 4) leads to the fourth order approximation (3.14e). The second order one
sided formula has p
1
= 2 and p
2
= 3. Applying Richardson extrapolation to it
gives a one sided formula that uses f(x +4h), f(x +2h), f(x +h), and f(x) to
give a third order approximation. A better third order one sided approximation
would use f(x + 3h) instead of f(x + 4h). Section 3.5 explains how to do this.
Richardson extrapolation may also be applied to the output of a complex
code. Run it with step size h and 2h and apply (3.28) to the output. This is
sometimes applied to dierential equations as an alternative to making up high
order schemes from scratch, which can be time consuming and intricate.
3.3.2 Convergence analysis
We can test a code, and the algorithm it is based on, using ideas related to
Richardson extrapolation. A naive test would be to do runs with decreasing h
values to check whether
A(h) A as h 0. A convergence analysis based on
asymptotic error expansions can be better. For one thing, we might not know
A. Even if we run a test case where A is known, it is common that a code with
mistakes limps to convergence, but not as accurately or reliably as the correct
code would. If we bother to write a code that is more than rst order accurate,
we should test that we are getting the order of accuracy we worked for.
There are two cases, the case where A is known and the case where A is not
known. While we would not write a code solely to solve problems for which we
know the answers, such problems are useful test cases. Ideally, a code should
come with a suite of tests, and some of these tests should be nontrivial. For
example, the fourth-order approximation (3.14e) gives the exact answer for any
polynomial of degree less than ve, so a test suite of only low-order polynomials
would be inappropriate for this rule.
If A is known, we can run the code with step size h and 2h and, from the
resulting approximations,
A(h) and
A(2h), compute
E(h) A
1
h
p1
+A
2
h
p2
+ ,
E(2h) 2
p1
A
1
h
p1
+ 2
p2
A
2
h
p2
+ .
For small h the rst term is a good enough approximation so that the ratio
should be approximately the characteristic value
R(h) =
E(2h)
E(h)
2
p1
. (3.29)
Figure 3.3 illustrates this phenomenon. As h 0, the ratios converge to the
expected result 2
p1
= 2
3
= 8. Figure 3.4 shows what may happen when we apply
this convergence analysis to an approximation that is second order accurate in
the big O sense without having an asymptotic error expansion. The error gets
very small but the error ratio does not have simple behavior as in Figure 3.3.
The dierence between the convergence in these two cases appears clearly in the
log-log error plots shown in Figure 3.5.
3.4. INTEGRATION 45
h Error: E(h) Ratio: E(h)/E(h/2)
.1 4.8756e-04 3.7339e+00
.05 1.3058e-04 6.4103e+00
.025 2.0370e-05 7.3018e+00
.0125 2.7898e-06 7.6717e+00
6.2500e-03 3.6364e-07 7.8407e+00
3.1250e-03 4.6379e-08 7.9215e+00
1.5625e-03 5.8547e-09 7.9611e+00
7.8125e-04 7.3542e-10 -
Figure 3.3: Convergence study for a third order accurate approximation. As
h 0, the ratio converges to 2
3
= 8. The h values in the left column decrease
by a factor of two from row to row.
h Error: E(h) Ratio: E(h)/E(h/2)
.1 1.9041e-02 2.4014e+00
.05 7.9289e-03 1.4958e+01
.025 5.3008e-04 -1.5112e+00
.0125 -3.5075e-04 3.0145e+00
6.2500e-03 -1.1635e-04 1.9880e+01
3.1250e-03 -5.8529e-06 -8.9173e-01
1.5625e-03 6.5635e-06 2.8250e+00
7.8125e-04 2.3233e-06 -
Figure 3.4: Convergence study for an approximation that is second order ac-
curate in the sense that [E(h)[ = O(h
2
) but that has no asymptotic error ex-
pansion. The h values are the same as in Figure 3.3. The errors decrease in an
irregular fashion.
Convergence analysis can be applied even when A is not known. In this case
we need three approximations,
A(4h),
A(2h), and
A(h). Again assuming the
existence of an asymptotic error expansion (3.26), we get, for small h,
R
(h) =
A(4h)
A(2h)
A(2h)
A(h)
2
p1
. (3.30)
3.4 Integration
Numerical integration means nding approximations for quantities such as
I =
_
b
a
f(x)dx .
46 CHAPTER 3. LOCAL ANALYSIS
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.0001 0.001 0.01 0.1 1
E
(
h
)
h
Convergence for a third-order method, E(h) Ah
3
E(h)
h
3
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
0.0001 0.001 0.01 0.1 1
E
(
h
)
h
Convergence for a second-order method, E(h) = O(h
2
)
E(h)
h
2
Figure 3.5: Log-log plots of the convergence study data in Figures 3.3 (top)
and 3.4 (bottom). The existence of an asymptotic error expansion in the rst
example shows up graphically as convergence to a straight line with slope 3.
3.4. INTEGRATION 47
Rectangle
I
k
= h
k
f(x
k
) 1
st
order
Trapezoid
I
k
=
h
k
2
(f(x
k
) +f(x
k+1
)) 2
nd
order
Midpoint
I
k
= h
k
f(x
k+1/2
) 2
nd
order
Simpson
I
k
h
k
6
_
f(x
k
) + 4f(x
k+1/2
) +f(x
k+1
)
_
4
th
order
2 point GQ
I
k
=
h
k
2
_
f(x
k+1/2
h
k
) +f(x
k+1/2
+h
k
)
_
4
th
order
3 point GQ
I
k
=
h
k
18
_
5f(x
k+1/2
h
k
) + 8f(x
k+1/2
) + 5f(x
k+1/2
+h
k
)
_
6
th
order
Figure 3.6: Common panel integration rules. The last two are Gauss quadrature
(Gauss Legendre to be precise) formulas. The denitions are =
1
2
3
and
=
1
2
_
3
5
.
We discuss only panel methods here, though there are other elegant methods.
In a panel method, the integration interval, [a, b], is divided into n subintervals,
or panels, P
k
= [x
k
, x
k+1
], where a = x
0
< x
1
< < x
n
= b. If the panel P
k
is small, we can get an accurate approximation to
I
k
=
_
P
k
f(x)dx =
_
x
k+1
x
k
f(x)dx (3.31)
using a few evaluations of f inside P
k
. Adding these approximations gives an
approximation to I:
I =
n1
k=0
I
k
. (3.32)
Some common panel integral approximations are given in Figure 3.6, where we
write x
k+1/2
= (x
k+1
+x
k
)/2 for the midpoint of the panel and h
k
= x
k+1
x
k
is the width. Note that x
k
is the left endpoint of P
k
and the right endpoint
of P
k1
. In the trapezoid rule and Simpsons rule, we need not evaluate f(x
k
)
twice.
For our error analysis, we assume that all the panels are the same size
h = x = [P
k
[ = x
k+1
x
k
for all k.
Given this restriction, not every value of h is allowed because b a = nh and n
is an integer. When we take h 0, we will assume that h only takes allowed
values h = (ba)/n. The local truncation error is the integration error over one
panel. The global error is the sum of the local truncation errors in all the panels.
The global error usually is one power of h larger than the local truncation error.
If the error per panel is O(h
q
), then the total error will be of the order of nh
q
,
where n is the number of panels. Since n = (b a)/h, this suggests that the
global error will be of order h
q
(b a)/h = O(h
q1
).
48 CHAPTER 3. LOCAL ANALYSIS
For the local truncation error analysis, let P = [x
, x
+ h] be a generic
panel. The panel integration rule approximates the panel integral
I
P
=
_
P
f(x)dx =
_
x+h
x
f(x)dx
with the approximation,
I
P
. For example, the rectangle rule (top row of Figure
3.6) has panel integration rule
_
x+h
x
f(x)dx
I
P
(h) = hf(x
) .
To estimate the dierence between I
P
and
I
P
(h), we expand f in a Taylor series
about x
:
f(x) f(x
) +f
(x
)(x x
) +
1
2
f
(x
)(x x
)
2
+ .
Integrating this term by term leads to
I
P
_
P
f(x
)dx +
_
P
f
(x
)(x x
)dx +
= f(x
)h +
1
2
f
(x
)h
2
+
1
6
f
(x
)h
3
+ .
The error in integration over this panel then is
E(P, h) =
I
P
(h) I
P
1
2
f
(x
)h
2
1
6
f
(x
)h
3
. (3.33)
This shows that the local truncation error for the rectangle rule is O(h
2
) and
identies the leading error coecient.
E =
I I
=
n1
k=0
I
k
I
k
E
n1
k=0
1
2
f
(x
k
)h
2
n1
k=0
1
6
f
(x
k
)h
3
. (3.34)
We sum over k and use simple inequalities to get the order of magnitude of the
global error:
[E[ <
1
2
n1
k=0
[f
(x
k
)[ h
2
n
1
2
max
axb
[f
(x)[ h
2
=
b a
h
O(h
2
)
= O(h) .
3.4. INTEGRATION 49
This shows that the rectangle rule is rst order accurate overall.
Looking at the global error in more detail leads to an asymptotic error
expansion. Applying the rectangle rule error bound to another function, g(x),
we have
n1
k=0
g(x
k
)h =
_
b
a
g(x)dx +O(h) .
Taking g(x) = f
(x) gives
n1
k=0
f
(x
k
)h =
_
b
a
f
k=0
f
(x
k
)h
_
h
2
_
_
b
a
f
(x)dx
_
h
2
E
1
2
(f(b) f(a)) h . (3.35)
This gives the rst term in the asymptotic error expansion. It shows that the
leading error not only is bounded by h, but roughly is proportional to h. It also
demonstrates the curious fact that if f is dierentiable then the leading error
term is determined by the values of f at the endpoints and is independent of
the values of f between. This is not true if f has a discontinuity in the interval
[a, b].
To get the next term in the error expansion, apply (3.34) to the error itself,
i.e.
n1
k=0
f
(x
k
)h =
_
b
a
f
(x)dx
h
2
(f
(b) f
(a)) +O(h
2
)
= f(b) f(a)
h
2
(f
(b) f
(a)) +O(h
2
) .
In the same way, we nd that
n1
k=0
f
(x
k
)
h
3
6
= (f
(b) f
(a))
h
2
6
+O(h
3
) .
Combining all these gives the rst two terms in the error expansion:
E(h)
1
2
(f(b) f(a)) h +
1
12
(f
(b) f
(a)) h
2
+ . (3.36)
It is clear that this procedure can be used to continue the expansion as far
as we want, but you would have to be very determined to compute, for example,
50 CHAPTER 3. LOCAL ANALYSIS
n Computed Integral Error Error/h (E A
1
h)/h
2
(E A
1
h A
2
h
2
)/h
3
10 3.2271 -0.2546 -1.6973 0.2900 -0.7250
20 3.3528 -0.1289 -1.7191 0.2901 -0.3626
40 3.4168 -0.0649 -1.7300 0.2901 -0.1813
80 3.4492 -0.0325 -1.7354 0.2901 -0.0907
160 3.4654 -0.0163 -1.7381 0.2901 -0.0453
Figure 3.7: Experiment illustrating the asymptotic error expansion for rectangle
rule integration.
n Computed Integral Error Error/h (E A
1
h)/h
2
10 7.4398e-02 -3.1277e-02 -3.1277e-01 -4.2173e-01
20 9.1097e-02 -1.4578e-02 -2.9156e-01 -4.1926e-01
40 9.8844e-02 -6.8314e-03 -2.7326e-01 -1.0635e-01
80 1.0241e-01 -3.2605e-03 -2.6084e-01 7.8070e-01
160 1.0393e-01 -1.7446e-03 -2.7914e-01 -1.3670e+00
320 1.0482e-01 -8.5085e-04 -2.7227e-01 -5.3609e-01
640 1.0526e-01 -4.1805e-04 -2.6755e-01 1.9508e+00
1280 1.0546e-01 -2.1442e-04 -2.7446e-01 -4.9470e+00
2560 1.0557e-01 -1.0631e-04 -2.7214e-01 -3.9497e+00
5120 1.0562e-01 -5.2795e-05 -2.7031e-01 1.4700e+00
Figure 3.8: Experiment illustrating the breakdown of the asymptotic expan-
sion for a function with a continuous rst derivative but discontinuous second
derivative.
the coecient of h
4
. An elegant and more systematic discussion of this error
expansion is carried out in the book of Dahlquist and Bjork. The resulting error
expansion is called the Euler Maclaurin formula. The coecients 1/2, 1/12, and
so on, are related to the Bernoulli numbers.
The error expansion (3.36) will not be valid if the integrand, f, has singular-
ities inside the domain of integration. Suppose, for example, that a = 0, b = 1,
u = 1/
h/2, x
+h/2] .
If we now expand f(x) in a Taylor series about x
)h +
f
(x
)
24
h
3
+
f
(4)
(x
)
384
h
5
+ .
For the midpoint rule, this leads to a global error expansion in even powers of
h, E A
1
h
2
+A
2
h
4
+ , with A
1
= (f
(b) f
)
n
dx
is always done exactly if n is odd. This is because both the exact integral and
its panel method approximation are zero by symmetry.
To understand why this rule works, think of the Taylor expansion of f(x)
about the midpoint, x
)
n
over P is proportional to h
n+1
, as is the panel method ap-
proximation to it, regardless of whether the panel method is exact or not. The
rst monomial that is not integrated exactly contributes something proportional
to h
n+1
to the error.
Using this rule it is easy to determine the accuracy of the approximations in
Figure 3.6. The trapezoid rule integrates constants and linear functions exactly,
but it gets quadratics wrong. This makes the local truncation error third order
and the global error second order. The Simpsons rule coecients 1/6 and 2/3
are designed exactly to integrate constants and quadratics exactly, which they
do. Simpsons rule integrates cubics exactly (by symmetry) but gets quartics
wrong. This gives Simpsons rule fourth order global accuracy. The two point
Gauss quadrature also does constants and quadratics correctly but quartics
wrong (check this!). The three point Gauss quadrature rule does constants,
quadratics, and quartics correctly but gets (x x
)
6
wrong. That makes it
sixth order accurate.
52 CHAPTER 3. LOCAL ANALYSIS
3.5 The method of undetermined coecients
The method of undetermined coecients is a general way to nd an approxima-
tion formula of a desired type. Suppose we want to estimate some A in terms
of given data g
1
(h), g
2
(h), . . .. The method is to assume a linear estimation
formula of the form
A(h) = a
1
(h)g
1
(h) +a
2
(h)g
2
(h) + , (3.37)
then determine the unknown coecients a
k
(h) by matching Taylor series up to
the highest possible order. The coecients often take the form of a constant
times some power of h: a
k
(h) = a
k
h
p
k
. The algebra is simpler if we guess or
gure out the powers rst. The estimator is consistent if
A(h) A 0 as
h . Generally (but not always), being consistent is the same as being at
least rst order accurate. At the end of our calculations, we may discover that
there is no consistent estimator of the desired type.
We illustrate the method in a simple example: estimate f
(x) from g
1
= f(x)
and g
2
= f(x + h). As above, we will leave out the argument x whenever
possible and write f for f(x), f
for f
A = a
1
(h)f +a
2
(h)f(x +h) .
Now expand in Taylor series:
f(x +h) = f +f
h +
1
2
f
+ .
The estimator is
A = a
1
(h)f +a
2
(h)f +a
2
(h)f
h +a
2
(h)f
h
2
+ . (3.38)
Looking at the right hand side, we see various coecients, f, f
, and so on.
Since the relation is supposed to work whatever the values of f, f
(x)
A =
1
h
f(x) +
1
h
f(x +h)
=
f(x +h) f(x)
h
.
3.5. THE METHOD OF UNDETERMINED COEFFICIENTS 53
This is the rst order one sided dierence approximation we saw earlier. Plug-
ging the values (3.39) into (3.38) shows that the estimator satises
A = f
+
O(h), which is the rst order accuracy we found before.
A more complicated problem is to estimate f
so
far have this property. Thus assume the form:
f
A =
1
h
(a
1
f(x h) +a
0
f +a
1
f(x +h) +a
2
f(x + 2h)) .
The Taylor series expansions are
f(x h) = f f
h +
f
2
h
2
6
h
3
+
f
(4)
24
h
4
+
f(x +h) = f + f
h +
f
2
h
2
+
f
6
h
3
+
f
(4)
24
h
4
+
f(x + 2h) = f + 2f
h + 2f
h
2
+
4f
3
h
3
+
2f
(4)
3
h
4
+
Equating powers of h turns out to be the same as equating the coecients of f,
f
, O(h
0
) : 1 = a
1
+a
1
+ 2a
2
f
, O(h
1
) : 0 =
1
2
a
1
+
1
2
a
1
+ 2a
2
f
, O(h
2
) : 0 =
1
6
a
1
+
1
6
a
1
+
4
3
a
2
(3.40)
We could compute the O(h
3
) equation but already we have four equations for
the four unknown coecients. If we would use the O(h
3
) equation in place of the
O(h
2
) equation, we loose an order of accuracy in the resulting approximation.
These are a system of four linear equations in the four unknowns a
1
through
a
2
, which we solve in an ad hoc way. Notice that the combination b = a
1
+a
1
appears in the second and fourth equations. If we substitute b, these equations
are
1 = b + 2a
2
,
0 =
1
6
b +
4
3
a
2
.
which implies that b = 8a
2
and then that a
2
=
1
6
and b =
4
3
. Then, since
4a
2
=
2
3
, the third equation gives a
1
+a
1
=
2
3
. Since b =
4
3
is known, we get
two equations for a
1
and a
1
:
a
1
a
1
=
4
3
,
a
1
+a
1
=
2
3
.
The solution is a
1
= 1 and a
1
=
1
3
. With these, the rst equation leads to
a
0
=
1
2
. Finally, our approximation is
f
(x) =
1
h
_
1
3
f(x h)
1
2
f(x) +f(x +h)
1
6
f(x + 2h)
_
+O(h
3
) .
54 CHAPTER 3. LOCAL ANALYSIS
Note that the rst step in this derivation was to approximate f by its Taylor
approximation of order 3, which would be exact if f were a polynomial of order
3. The derivation has the eect of making
A exact on polynomials of degree 3
or less. The four equations (3.40) arise from asking
A to be exact on constants,
linear functions, quadratics, and cubics. We illustrate this approach with the
problem of estimating f
(x)
and f
A = af +bf(x +h) +cf
+df
(x +h) .
We can determine the four unknown coecients a, b, c, and d by requiring the
approximation to be exact on constants, linears, quadratics, and cubics. It does
not matter what x value we use, so let us take x = 0. This gives, respectively,
the four equations:
0 = a +b (constants, f = 1) ,
0 = bh +c +d (linears, f = x) ,
1 = b
h
2
2
+dh (quadratics, f = x
2
/2) ,
0 = b
h
3
6
+d
h
2
2
(cubics, f = x
3
/6) .
Solving these gives
a =
6
h
2
, b =
6
h
2
, c =
4
h
, d =
2
h
.
and the approximation
f
(x)
6
h
2
_
f(x) +f(x +h)
_
2
h
_
2f
(x) +f
(x +h)
_
.
A Taylor series calculation shows that this is second order accurate.
3.6 Adaptive parameter estimation
In most real computations, the computational strategy is not xed in advance,
but is adjusted adaptively as the computation proceeds. If we are using one of
the approximations
A(h), we might not know an appropriate h when we write
the program, and the user might not have the time or expertise to choose h
for each application. For example, exercise 12 involves hundreds or thousands
of numerical integrations. It is out of the question for the user to experiment
manually to nd a good h for each one. We need instead a systematic procedure
for nding an appropriate step size.
Suppose we are computing something about the function f, a derivative or
an integral. We want a program that takes f, and a desired level of accuracy
12
,
12
This is absolute error. We also could seek a bound on relative error:
b
AA
/ |A| .
3.6. ADAPTIVE PARAMETER ESTIMATION 55
e, and returns
A with
AA
A(h) that we can evaluate for any h, and we want an automatic way to choose
h so that
A(h) A
1
3
_
A(2h)
A(h)
_
. (3.41)
The adaptive strategy would be to keep reducing h by a factor of two until the
estimated error (3.41) is within the tolerance
13
:
double adaptive1(double h, // Step size
double eps) // Desired error bound
{
double Ah = A(h);
double Ah2 = A(h/2);
while (fabs(Ah2 - Ah) > 3*eps) {
h = h/2;
Ah = Ah2;
Ah2 = A(h/2);
}
return Ah2;
}
A natural strategy might be to stop when
A(2h)
A(h)
e. Our quantitative
asymptotic error analysis shows that this strategy is o by a factor of 3. We
achieve accuracy roughly e when we stop at
A(2h)
A(h)
3e. This is
because
A(h) is more accurate than
A(2h).
We can base reasonably reliable software on renements of the basic strategy
of adaptive1. Some drawbacks of adaptive1 are that
1. It needs an initial guess, a starting value of h.
2. It may be an innite loop.
3. It might terminate early if the initial h is outside the asymptotic range
where error expansions are accurate.
13
In the increment part we need not evaluate
b
A(2h) because this is what we called
b
A(h)
before we replaced h with h/2.
56 CHAPTER 3. LOCAL ANALYSIS
4. If
A(h) does not have an asymptotic error expansion, the program will
not detect this.
5. It does not return the best possible estimate of A.
A plausible initial guess, h
0
, will depend on the scales (length or time, etc.)
of the problem. For example 10
10
meters is natural for a problem in atomic
physics but not in airplane design. The programmer or the user should supply
h
0
based on understanding of the problem. The programmer can take h
0
= 1
if he or she thinks the user will use natural units for the problem (
Angstroms
for atomic physics, meters for airplanes). It might happen that you need h =
h
0
/1000 to satisfy adaptive1, but you should give up if h = h
0
mach
. For
integration we need an initial n = (b a)/h. It might be reasonable to take
n
0
= 10, so that h
0
= (b a)/10.
Point 2 says that we need some criterion for giving up. As discussed more
in Section 3.7, we should anticipate the ways our software can fail and report
failure. When to give up should depend on the problem. For numerical dif-
ferentiation, we can stop when roundo or propagated error from evaluating f
(see Chapter 2, Section?) creates an error as big as the answer. For integration
limiting the number of renements to 20, would limit the number of panels to
n
0
2
20
n
0
10
6
. The revised program might look like
const int HMIN_REACHED = -1;
int adaptive2(double h, // Step size
double eps, // Desired error bound
double& result) // Final A estimate (output)
{
double Ah = A(h);
double Ah2 = A(h/2);
double hmin = 10*DBL_EPSILON*h; // DBL_EPSILON is in cfloat (float.h)
while (fabs(Ah2 - Ah) > 3*eps) {
h = h/2;
if (h <= hmin) {
result = (4*Ah-Ah2)/3; // Extrapolated best result
return HMIN_REACHED; // Return error code;
}
Ah = Ah2;
Ah2 = A(h/2);
}
result = (4*Ah2-Ah)/3; // Extrapolated best result
return 0;
}
We cannot have perfect protection from point 3, though premature termi-
nation is unlikely if h
0
is sensible and e (the desired accuracy) is small enough.
A more cautious programmer might do more convergence analysis, for example
3.7. SOFTWARE 57
asking that the
A(4h) and
A(2h) error estimate be roughly 2
p
times larger than
the
A(2h) and
A(h) estimate. There might be irregularities in f(x), possibly
jumps in some derivative, that prevent the asymptotic error analysis but do not
prevent convergence. It would be worthwhile returning an error ag in this case,
as some commercial numerical software packages do.
Part of the risk in point 4 comes from the possibility that
A(h) converges
more slowly than the hoped for order of accuracy suggests. For example if
A(h) A+A
1
h
1/2
, then the error is three times that suggested by adaptive1.
The extra convergence analysis suggested above might catch this.
Point 5 is part of a paradox aicting many error estimation strategies. We
estimate the size of E(h) =
A(h) A by estimating the value of E(h). This
leaves us a choice. We could ignore the error estimate (3.41) and report
A(h)
as the approximate answer, or we could subtract out the estimated error and
report the more accurate
A(h)
E(h). This is the same as applying one level of
Richardson extrapolation to
A. The corrected approximation probably is more
accurate, but we have no estimate of its error. The only reason to be dissatised
with this is that we cannot report an answer with error less than e until the
error is far less than e.
3.7 Software
Scientic programmers have many opportunities to err. Even in short compu-
tations, a sign error in a formula or an error in program logic may lead to an
utterly wrong answer. More intricate computations provide even more oppor-
tunities for bugs. Programs that are designed without care become more and
more dicult to read as they grow in size, until
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world.
14
Scientic programmers can do many things to make codes easier to debug
and more reliable. We write modular programs composed of short procedures so
that we can understand and test our calculations one piece at a time. We design
these procedures to be exible so that we can re-use work from one problem in
solving another problem, or a problem with dierent parameters. We program
defensively, checking for invalid inputs and nonsensical results and reporting
errors rather than failing silently. In order to uncover hidden problems, we also
test our programs thoroughly, using techniques like convergence analysis that
we described in Section 3.3.2.
3.7.1 Flexibility and modularity
Suppose you want to compute I =
_
2
0
f(x) dx using a panel method. You could
write the following code using the rectangle rule:
14
From The Second Coming, with apologies to William Butler Yeats.
58 CHAPTER 3. LOCAL ANALYSIS
double I = 0;
for (int k = 0; k < 100; ++k)
I += .02*f(.02*k);
In this code, the function name, the domain of integration, and the number
of panels are all hard-wired. If you later decided you needed 300 panels for
sucient accuracy, it would be easy to introduce a bug by changing the counter
in line 2 without changing anything in line 3. It would also be easy to introduce
a bug by replacing .02 with the integer expression 2/300 (instead of 2.0/300).
A more exible version would be:
double rectangle_rule(double (*f)(double), // The integrand
double a, double b, // Left and right ends
unsigned n) // Number of panels
{
double sum = 0;
double h = (b-a)/n;
for (unsigned k = 0; k < n; ++k) {
double xk = ( a*(n-k) + b*k )/n;
sum += f(xk);
}
return sum*h;
}
In this version of the code, the function, domain of integration, and number
of panels are all parameters. We could use n = 100 or n = 300 and still get
a correct result. This extra exibility costs only a few extra lines of code that
take just a few seconds to type.
This subroutine also documents what we were trying to compute by giving
names to what were previously just numeric constants. We can tell immediately
that the call rectangle rule(f, 0, 2, 100) should integrate over [0, 2], while
we could only tell that in the original version by reading the code carefully and
multipling .02 100.
What if we are interested in a integrating a function that has parameters
in addition to the variable of integration? We could pass the extra parameters
via a global variable, but this introduces many opportunities for error. If we
need to deal with such integrands, we might make the code even more exible
by adding another argument:
// An integrand_t is a type of function that takes a double and a double*
// and returns a double. The second argument is used to pass parameters.
typedef double (*integrand_t)(double, double*);
double rectangle_rule(integrand_t f, // The integrand
double* fparams, // Integrand parameters
double a, double b, // Left and right ends
unsigned n) // Number of panels
3.7. SOFTWARE 59
{
double sum = 0;
double h = (b-a)/n;
for (unsigned k = 0; k < n; ++k) {
double xk = ( a*(n-k) + b*k )/n;
sum += f(xk, fparams);
}
return sum*h;
}
We could make this routine even more exible by making fparams a void*, or
by writing in a style that used more C++ language features
15
.
The rectangle_rule function illustrates how designing with subroutines
improves code exibility and documentation. When we call a procedure, we
care about the interface: what data do we have to pass in, and what results do
we get out? But we can use the procedure without knowing all the details of the
implementation. Because the details of the implementation are hidden, when we
call a procedure we can keep our focus on the big picture of our computation.
To design robust software, we need to design a testing plan, and another
advantage of well-designed procedures is that they make testing easier. For
example, we might write test a generic adaptive Richardson extrapolation rou-
tine as in Section 3.6, and separately write and test an integration routine like
rectangle_rule. If the procedures work correctly separately, they have a good
chance of working correctly together.
3.7.2 Error checking and failure reports
Most scientic routines can fail, either because the user provides nonsensical
input or because it is too dicult to solve the problem to the requested accuracy.
Any module that can fail must have a way to report these failures. The simplest
way to handle an error is to report an error message and exit; but while this is
ne for some research software, it is generally not appropriate for commercial
software. A more sophisticated programmer might write errors to a log le,
return a ag that indicates when an error has occurred, or throw an exception.
16
For example, lets consider our rectangle_rule procedure. We have implic-
itly assumed in this case that the input arguments are reasonable: the function
pointer cannot be NULL, there is at least one panel, and the number of panels is
not larger than some given value of MAX PANELS. These are preconditions for the
15
This routine is written in C-style C++, which we will use throughout the book. A
native C++ solution would probably involve additional language features, such as function
templates or virtual methods.
16
We will avoid discussing C++ exception handling with try/catch, except to say that it
exists. While the try/catch model of exception handling is conceptually simple, and while
its relatively simple in languages like Java and MATLAB that have garbage collection, using
exceptions well in C++ involves some points that are beyond the scope of this course. For
a thorough discussion of C++ exception handling and its implications, we recommend the
discussion in Eective C++ by Scott Meyers.
60 CHAPTER 3. LOCAL ANALYSIS
routine. We can catch errors in these preconditions by using the C++ assert
macro dened in assert.h or cassert:
const int MAX_PANELS = 10000000;
typedef double (*integrand_t)(double, double*);
double rectangle_rule(integrand_t f, // The integrand
double* fparams, // Integrand parameters
double a, double b, // Left and right ends
unsigned n) // Number of panels
{
assert(f != NULL);
assert(a <= b);
assert(n != 0 && n < MAX_PANELS);
...
}
When an assertion fails, the system prints a diagnostic message that tells
the programmer where the problem is (le and line number), then exits. Such
a hard failure is probably appropriate if the user passes a NULL pointer into our
rectangle_rule routine, but what if the user just uses too large a value of n?
In that case, there is a reasonable default behavior that would probably give an
adequate answer. For a kinder failure mode, we can return our best estimate
and an indication that there might be a problem by using an information ag:
const int MAX_PANELS = 10000000;
typedef double (*integrand_t)(double, double*);
/* Integrates f on [a,b] via an n-panel rectangle rule.
* We assume [a,b] and [b,a] are the same interval (no signed measure).
* If an error occurs, the output argument info takes a negative value:
* info == -1 -- n = 0; computed with n = 1
* info == -2 -- n was too large; computed with n = MAX_PANELS
* If no error occurs, info is set to zero.
*/
double rectangle_rule(integrand_t f, // The integrand
double* fparams, // Integrand parameters
double a, double b, // Left and right ends
unsigned n, // Number of panels
int& info) // Status code
{
assert(f != NULL);
info = 0; // 0 means "success"
if (n == 0) {
n = 1;
info = -1; // -1 means "n too small"
} else if (n > MAX_PANELS) {
3.7. SOFTWARE 61
n = MAX_PANELS;
info = -2; // -2 means "n too big"
}
if (b < a)
swap(a,b); // Allow bounds in either order
...
}
Note that we describe the possible values of the info output variable in a com-
ment before the function denition. Like the return value and the arguments,
the types of failures a routine can experience are an important part of the in-
terface, and they should be documented accordingly
17
.
In addition to checking the validity of the input variables, we also want to be
careful about our assumptions involving intermediate calculations. For example,
consider the loop
while (error > targetError) {
... refine the solution ...;
}
If targetError is too small, this could be an innite loop. Instead, we should
at least put in an iteration counter:
const int max_iter = 1000; // Stop runaway loop
int iter = 0;
while (error > targetError) {
if (++iter > max_iter) {
... report error and quit ...;
}
... make the solution more accurate ...;
}
3.7.3 Unit testing
A unit test is a test to make sure that a particular routine does what it is sup-
posed to do. In general, a unit test suite should include problems that exercise
every line of code, including failures from which the routine is supposedly able
to recover. As a programmer nds errors during development, it also makes
sense to add relevant test cases to the unit tests so that those bugs do not
recur. In many cases, it makes sense to design the unit tests before actually
writing a routine, and to use the unit tests to make sure that various revisions
to a routine are correct.
17
There is an unchecked error in this function: the arguments to the integrand could be
invalid. In this C-style code, we might just allow f to indicate that it has been called with
invalid arguments by returning NaN. If we were using C++ exception handling, we could
allow f to throw an exception when it was given invalid arguments, which would allow more
error information to propagate back to the top-level program without any need to redesign
the rectangle rule interface.
62 CHAPTER 3. LOCAL ANALYSIS
In addition, high-quality numerical codes are designed with test problems
that probe the accuracy and stability of the computation. These test cases
should include some trivial problems that can be solved exactly (except possibly
for rounding error) as well as more dicult problems.
3.8 References and further reading
For a review of one variable calculus, I recommend the Schaum outline. The
chapter on Taylor series explains the remainder estimate clearly.
There several good old books on classical numerical analysis. Two favorites
are Numerical Methods by Germund Dahlquist and
Ake Bjork [2], and Analysis
of Numerical Methods by Gene Isaacson and Herb Keller [12]. Particularly
interesting subjects are the symbolic calculus of nite dierence operators and
Gaussian quadrature.
There are several applications of convergent Taylor series in scientic com-
puting. One example is the fast multipole method of Leslie Greengard and
Vladimir Rokhlin.
3.9 Exercises
1. Verify that (3.23) represents the leading error in the approximation (3.22).
Hint, this does not require multidimensional Taylor series. Why?
2. Use multidimensional Taylor series to show that the rotated ve point
operator
1
2h
2
_
f(x+h, y+h)+f(x+h, yh)+f(xh, y+h)+f(xh, yh)4f
_
is a consistent approximation to f. Use a symmetry argument to show
that the approximation is at least second order accurate. Show that the
leading error term is
h
2
12
_
4
x
f + 6
2
x
2
y
f +
4
y
f
_
.
3. For a [0.5, 2], consider the following recurrence dening x
k
(a) = x
k
:
x
k+1
=
a +x
k
x
k1
x
k
+x
k1
with x
0
= x
1
= 1 + (a 1)/2. This recurrence converges to a function
f(a) which is smooth for a [0.5, 2].
(a) Dene
k
to be the largest value of [x
k+1
(a) x
k
(a)[ over twenty
evenly spaced values starting at a = 0.5 and ending with a = 2.0.
Plot the largest value of
k
versus k on a semi-logarithmic plot from
3.9. EXERCISES 63
k = 1 up to k = 10 and comment on the apparent behavior. About
how accurate do you think x
4
(a) is as an approximation to f(x)?
About how accurate do you think x
6
(a) is?
(b) Write a program to compute
x
k
(a; h) = (x
k
(a +h) x
k
(a h))/2h.
For k = 4 and a = 0.5625, show a log-log plot of
x
k
(a; h)
k
(a; h/2)
for h = 2
1
, 2
2
, . . . , 2
20
. For h not too small, the plot should be
roughly linear on a log-log scale; what is the slope of the line?
(c) For a = 0.5625, the exact value of f
k
(0.5625; h) and 2/3 for h =
2
1
, 2
2
, . . . , 2
20
for k = 4, 6. Comment on the accuracy for k = 4
in light of parts (a) and (b).
4. Find a formula that estimates f
x
k
).
(a) What is the order of accuracy of F
k
as an estimate of f((x
k
+
x
k+1
)/2) = f(x
k+1/2
)?
(b) Assuming the panels all have size h, nd a higher order accurate
estimate of f(x
k+1/2
) using F
k
, F
k1
, and F
k+1
.
6. In this exercise, we computationally explore the accuracy of the asymptotic
series for the function (3.12).
(a) Dene
L
n
(h) =
n1
k=0
_
(k+1)/n
k/n
e
x/h
1 a
dx
R
n
(h) =
n1
k=0
_
(k+1)/n
k/n
e
x/h
1 b
dx
f(h) =
_
0.5
0
e
x/h
1 x
dx
and show that for any n > 1, h > 0,
L
n
(h) f(h) R
n
(h)
(b) Compute R
200
(0.3) and L
200
(0.3), and give a bound on the relative
error in approximating f(0.3) by R
200
(0.3). Let
f
k
(h) be the asymp-
totic series for f(h) expanded through order k, and use the result
above to make a semilogarithmic plot of the (approximate) relative
error for h = 0.3 and k = 1 through 30. For what value of k does
f
k
(0.3) best approximate f(0.3), and what is the relative error?
64 CHAPTER 3. LOCAL ANALYSIS
7. An application requires accurate values of f(x) = e
x
1 for x very close
to zero.
18
(a) Show that the problem of evaluating f(x) is well conditioned for small
x.
(b) How many digits of accuracy would you expect from the code f =
exp(x) - 1; for x 10
5
and for x 10
10
in single and in double
precision?
(c) Let f(x) =
n=1
f
n
x
n
be the Taylor series about x = 0. Calculate
the rst three terms, the terms involving x, x
2
, and x
3
. Let p(x) be
the degree three Taylor approximation of f(x) about x = 0.
(d) Assuming that x
0
is so small that the error is nearly equal to the
largest neglected term, estimate max [f(x) p(x)[ when [x[ x
0
.
(e) We will evaluate f(x) using
if ( abs(x) > x0 ) f = exp(x) - 1;
else f = p3(x); // given by part c.
What x
0
should we choose to maximize the accuracy of f(x) for
[x[ < 1 assuming double precision arithmetic and that the expo-
nential function is evaluated to full double precision accuracy (exact
answer correctly rounded)?
8. Suppose that f(x) is a function that is evaluated to full machine precision
but that there is
mach
rounding error in evaluating
A = (f(x + h)
f(x))/h. What value of h minimizes the total absolute error including both
rounding error
19
and truncation error? This will be h
(
mach
)
q
mach
.
Let e
(
mach
) be the error in the resulting best estimate of f
(x). Show
that e
r
mach
and nd r.
9. Repeat Exercise 8 with the two point centered dierence approximation to
f
(x). Show that the best error possible with centered dierencing is much
better than the best possible with the rst order approximation. This is
one of the advantages of higher order nite dierence approximations.
10. Verify that the two point Gauss quadrature formula of Figure 3.6 is exact
for monomials of degree less than six. This involves checking the functions
f(x) = 1, f(x) = x
2
, and f(x) = x
4
because the odd order monomials are
exact by symmetry. Check that the three point Gauss quadrature formula
is exact for monomials of degree less than 8.
11. Find the replacement to adaptive halting criterion (3.41) for a method of
order p.
18
The C standard library function expm1 evaluates e
x
1 to high relative accuracy even
for x close to zero.
19
Note that both x and f(x) will generally be rounded. Recall from the discussion in
Section 2.6 that the subtraction and division in the divided dierence formula usually commit
negligible, or even zero, additional rounding error.
3.9. EXERCISES 65
12. In this exercise, we will write a simple adaptive integration routine in
order to illustrate some of the steps that go into creating robust numerical
software. We will then use the routine to explore the convergence of an
asymptotic series. You will be given much of the software; your job is to
ll in the necessary pieces, as indicated below (and by TODO comments in
the code you are given).
The method we will use is not very sophisticated, and in practice, you
would usually use a library routine from QUADPACK, or possibly a
quadrature routine in Matlab, to compute the integrals in this exam-
ple. However, you will use the same software design ideas we explore here
when you attack more complicated problems on your own.
(a) Write a procedure based on Simpsons rule to estimate the denite
integral
_
b
a
f(x) dx
using panel integration method with uniformly-sized panels. For your
implementation of Simpsons rule, the values f(x
k
) should be com-
puted only once. Your procedure should run cleanly against a set
of unit tests that ensure that you check the arguments correctly,
that you exactly integrate low-order polynomials. Your integration
procedures should have the same signature as the last version of
rectangle rule from the software section:
typedef (*integrand_t)(double x, double* fargs);
double simpson_rule(integrand_t f, double* fargs
double a, double b, unsigned n, int& info);
double gauss3_rule(integrand_t f, double* fargs
double a, double b, unsigned n, int& info);
If there is an error, the info ag should indicate which argument
was a problem (-1 for the rst argument, -1 for the second, etc).
On success, info should return the number of function evaluations
performed.
(b) For each of the quadrature rules you have programmed, use (3.30) to
compute the apparent order of accuracy of
I
1
(h)
_
1
0
e
x
dx
and
I
2
(h)
_
1
0
xdx.
The order of accuracy for I
1
will be computed as part of your standard
test suite. You will need to add the computation of the order of
66 CHAPTER 3. LOCAL ANALYSIS
accuracy of I
2
yourself. Do your routines obtain the nominal order
of accuracy for I
1
? For I
2
?
(c) Write code that does one step of Richardson extrapolation on Simp-
sons rule in order. Test your code using the test harness from the
rst part. This procedure should be sixth-order accurate; why is it
sixth order rather than fth? Repeat the procedure for the three-
point Gauss rule: write a code to do one step of extrapolation and
test to see that it is eighth order.
(d) We want to know how the function
20
f(t) =
_
1
0
cos
_
tx
2
_
dx (3.42)
behaves for large t.
Write a procedure based on the three-point Gauss quadrature rule
with Richardson estimation to nd n that gives f(t) to within a
specied tolerance (at least 10
9
). The procedure should work by
repeatedly doubling n until the estimated error, based on comparing
approximations, is less than this tolerance. This routine should be
robust enough to quit and report failure if it is unable to achieve the
requested accuracy.
(e) The approximation,
f(t)
_
8t
+
1
2t
sin(t)
1
4t
2
cos(t)
3
8t
3
sin(t) + , (3.43)
holds for large t
21
. Use the procedure developed in the previous part
to estimate the error in using one, to, and three terms on the right
side of (3.43) for t in the range 1 t 1000. Show the errors
versus t on a log-log plot. In all cases we want to evaluate f so
accurately that the error in our f value is much less than the error of
the approximation (3.43). Note that even for a xed level of accuracy,
more points are needed for large t. Plot the integrand to see why.
20
The function f(t) is a rescaling of the Fresnel integral C(t).
21
This asymptotic expansion is described, for example, in Exercise 3.1.4 of Olvers Asymp-
totics and Special Functions.
Chapter 4
Linear Algebra I, Theory
and Conditioning
67
68 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
4.1 Introduction
Linear algebra and calculus are the basic tools of quantitative science. The oper-
ations of linear algebra include solving systems of equations, nding subspaces,
solving least squares problems, factoring matrices, and computing eigenvalues
and eigenvectors. There is good, publicly available software to perform these
computations, and in most cases this software is faster and more accurate than
code you write yourself. Chapter 5 outlines some of the basic algorithms of
computational linear algebra. This chapter discusses more basic material.
Conditioning is a major concern in many linear algebra computations. Easily
available linear algebra software is backward stable, which essentially
1
means
that the results are as accurate as the conditioning of the problem allows. Even
a backward stable method produces large errors if the condition number is
of the order of 1/
mach
. For example, if the condition number is 10
18
, even
double precision calculations are likely to yield a completely wrong answer.
Unfortunately, such condition numbers occur in problems that are not terribly
large or rare.
If a computational method for a well-conditioned problem is unstable (much
less accurate than its conditioning allows), it is likely because one of the sub-
problems is ill-conditioned. For example, the problem of computing the matrix
exponential, e
A
, may be well-conditioned while the problem of computing the
eigenvectors of A is ill-conditioned. A stable algorithm for computing e
A
(see
Exercise 12) in that case must avoid using the eigenvectors of A.
The condition number of a problem (see Section 2.7) measures how small
perturbations in the data aect the answer. This is called perturbation theory.
Suppose A is a matrix
2
and f(A) is the solution of a linear algebra problem
involving A, such as x that satises Ax = b, or and v that satisfy Av = v.
Perturbation theory seeks to estimate f = f(A + A) f(A) when A is
small. Usually, this amounts to calculating the derivative of f with respect to
A.
We simplify the results of perturbation calculations using bounds that in-
volve vector or matrix norms. For example, suppose we want to say that all the
entries in A or v are small. For a vector, v, or a matrix, A, the norm, |v|
or |A|, is a number that characterizes the size of v or A. Using norms, we can
say that the relative size of a perturbation in A is |A|/|A|.
The condition number of a problem involving A depends on the problem as
well as on A. The condition number of f(A) = A
1
b (i.e. solving the system
of linear equations Ax = b) is very dierent from the problem of nding the
eigenvalues of A. There are matrices that have well conditioned eigenvalues
but poorly conditioned eigenvectors. What is commonly called the condition
number of A is the worst case condition number of solving Ax = b, taking the
worst possible b.
1
The precise denition of backward stability is in Chapter 5.
2
This notation replaces our earlier A(x). In linear algebra, A always is a matrix and x
never is a matrix.
4.2. REVIEW OF LINEAR ALGEBRA 69
4.2 Review of linear algebra
This section reviews some linear algebra that we will use later. It is not a
substitute for a course on linear algebra. We assume that most of the topics are
familiar to the reader. People come to scientic computing with vastly diering
perspectives on linear algebra, and will know some of the concepts we describe
by dierent names and notations. This section should give everyone a common
language.
Much of the power of linear algebra comes from this interaction between
the abstract and the concrete. Our review connects the abstract language of
vector spaces and linear transformations to the concrete language of matrix
algebra. There may be more than one concrete representation corresponding to
any abstract linear algebra problem. We will nd that dierent representations
often lead to numerical methods with very dierent properties.
4.2.1 Vector spaces
A vector space is a set of elements that may be added and multiplied by scalars
3
(either real or complex numbers, depending on the application). Vector addition
is commutative (u + v = v + u) and associative ((u + v) + w = u + (v + w)).
Multiplication by scalars is distributive over vector addition (a(u+v) = au+av
and (a + b)u = au + bu for scalars a and b and vectors u and v). There is a
unique zero vector, 0, with 0 +u = u for any vector u.
The standard vector spaces are R
n
(or C
n
), consisting of column vectors
u =
_
_
_
_
_
_
u
1
u
2
u
n
_
_
_
_
_
_
,
where the components, u
k
, are arbitrary real (or complex) numbers. Vector
addition and scalar multiplication are done componentwise.
A subset V
remain in V
. That is, V
n
k=1
u
k
= 0). If we add two such vectors or multiply by a scalar, the result
also has the zero sum property. On the other hand, the set of vectors whose
components sum to one (
n
k=1
u
k
= 1) is not closed under vector addition or
scalar multiplication.
3
Physicists use the word scalar in a dierent way. For them, a scalar is a number that
is the same in any coordinate system. The components of a vector in a particular basis are
not scalars in this sense.
70 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
The span of a set of vectors span(f
1
, f
2
, . . . , f
n
) V is the subspace of V
consisting of linear combination of the vectors f
j
:
u = u
1
f
1
+ +u
n
f
n
, (4.1)
where u
k
are scalar coecients. We say f
1
, . . . , f
n
are linearly independent if
u = 0 implies that u
k
= 0 for all k in (4.1). Recall that the f
j
are linearly
independent if and only if the representation (4.1) uniquely determines the ex-
pansion coecients, u
k
. A theorem of linear algebra states that if the f
j
are
not linearly independent, then it is possible to nd a subset of them with the
same span. If V = span(f
1
, . . . , f
n
) and the f
j
are linearly independent, then
the f
j
are a basis for V .
The standard vector spaces R
n
and C
n
have standard bases e
1
, . . . , e
n
,
where e
k
is the vector with all zero components but for a single 1 in position k.
This is a basis because
u =
_
_
_
_
_
_
u
1
u
2
u
n
_
_
_
_
_
_
= u
1
_
_
_
_
_
_
1
0
0
_
_
_
_
_
_
+u
2
_
_
_
_
_
_
0
1
0
_
_
_
_
_
_
+ +u
n
_
_
_
_
_
_
0
0
1
_
_
_
_
_
_
=
n
k=1
u
k
e
k
.
In view of this, there is little distinction between coordinates, components, and
expansion coecients, all of which are denoted u
k
. If V has a basis with n
elements, we say the dimension of V is n. It is possible to make this denition
because of the theorem that states that every basis of V has the same number
of elements. A vector space that does not have a nite basis is called innite-
dimensional
4
.
An inner product space is a vector space that has an inner product , ), which
is a scalar function of two vector arguments with the following properties:
1. u, av
1
+bv
2
) = au, v
1
) +bu, v
2
);
2. u, v) = v, u), where z refers to the complex conjugate of z;
3. u, u) 0;
4. u, u) = 0 if and only if u = 0.
When u and v are vectors with u, v) = 0, we say u and v are orthogonal. If u
and v are n component column vectors (u C
n
, v C
n
), their standard inner
product (sometimes called the dot product) is
u, v) =
n
k=1
u
k
v
k
. (4.2)
The complex conjugates are not needed when the entries of u and v are real.
4
An innite dimensional vector space might have an innite basis.
4.2. REVIEW OF LINEAR ALGEBRA 71
Spaces of polynomials are interesting examples of vector spaces. A polyno-
mial in the variable x is a linear combination of powers of x, such as 2 +3x
4
, or
1, or
1
3
(x1)
2
(x
3
3x)
6
. We could multiply out the last example to write it as
a linear combination of powers of x. The degree of a polynomial is the highest
power that it contains. The product
1
3
(x 1)
2
(x
3
3x)
6
has degree 20. The
vector space P
d
is the set of all polynomials of degree at most d. This space has
a basis consisting of d + 1 elements:
f
0
= 1 , f
1
= x , . . . , f
d
= x
d
. (4.3)
The power basis (4.3) is one basis for P
3
(with d = 3, so P
3
has dimension
4). Another basis consists of the rst four Hermite polynomials
H
0
= 1 , H
1
= x , H
2
= x
2
1 , H
3
= x
3
3x .
The Hermite polynomials are orthogonal with respect to a certain inner prod-
uct
5
:
p, q) =
1
2
_
p(x)q(x)e
x
2
/2
dx. (4.4)
Hermite polynomials are useful in probability because if X is a standard normal
random variable, then they are uncorrelated:
E [H
j
(X)H
k
(X)] = H
j
, H
k
) =
1
2
_
H
j
(x)H
k
(x)e
x
2
/2
dx = 0 if j ,= k.
Still another basis of P
3
consists of Lagrange interpolating polynomials for the
points 1, 2, 3, and 4:
l
1
=
(x 2)(x 3)(x 4)
(1 2)(1 3)(1 4)
, l
2
=
(x 1)(x 3)(x 4)
(2 1)(2 3)(2 4)
,
l
3
=
(x 1)(x 2)(x 4)
(3 1)(3 2)(3 4)
, l
4
=
(x 1)(x 2)(x 3)
(4 1)(4 2)(4 3)
.
These are useful for interpolation because, for example, l
1
(1) = 1 while l
2
(1) =
l
3
(1) = l
4
(1) = 0. If we want u(x) to be a polynomial of degree 3 taking specied
values u(1) = u
1
, u(2) = u
2
, u(3) = u
3
, and u(4) = u
4
, the answer is
u(x) = u
1
l
1
(x) +u
2
l
2
(x) +u
3
l
3
(x) +u
4
l
4
(x) .
The Lagrange interpolating polynomials are linearly independent because if 0 =
u(x) = u
1
l
1
(x) +u
2
l
2
(x) +u
3
l
3
(x) +u
4
l
4
(x) for all x then in particular u(x) = 0
at x = 1, 2, 3, and 4, so u
1
= u
2
= u
3
= u
4
= 0. Let V
P
3
be the set of
polynomials p P
3
that satisfy p(2) = 0 and p(3). This is a subspace of P
3
. A
basis for it consists of l
1
and l
4
.
If V
. For example, if V = P
3
and V
is the polynomials
that vanish at x = 2 and x = 3, we can take
f
1
= l
1
, f
2
= l
4
, f
3
= l
2
, f
4
= l
3
.
In general, the dimension of a subspace V
, then V
j=1
a
jk
g
j
.
Because the transformation is linear, we can calculate what happens to a vector
u V in terms of its expansion u =
k
u
k
f
k
. Let w W be the image of u,
w = Lu, written as w =
j
w
j
g
j
. We nd
w
j
=
n
k=1
a
jk
u
k
,
which is ordinary matrix-vector multiplication.
The matrix that represents L depends on the basis. For example, suppose
V = P
3
, W = P
2
, and L represents dierentiation:
L
_
p
0
+p
1
x +p
2
x
2
+p
3
x
3
_
=
d
dx
_
p
0
+p
1
x +p
2
x
2
+p
3
x
3
_
= p
1
+2p
2
x+3p
3
x
2
.
If we take the basis 1, x, x
2
, x
3
for V , and 1, x, x
2
for W, then the matrix is
_
_
0 1 0 0
0 0 2 0
0 0 0 3
_
_
.
4.2. REVIEW OF LINEAR ALGEBRA 73
The matrix would be dierent if we used the Hermite polynomial basis for V
(see Exercise 1).
Conversely, an mn matrix, A, represents a linear transformation from R
n
to R
m
(or from C
n
to C
m
). We denote this transformation also by A. If v R
n
is an n-component column vector, then the matrix-vector product w = Av
is a column vector with m components. As before, the notation deliberately is
ambiguous. The matrix A is the matrix that represents the linear transformation
A using standard bases of R
n
and R
m
.
A matrix also may represent a change of basis within the same space V .
If f
1
, . . ., f
n
, and g
1
, . . ., g
n
are dierent bases of V , and u is a vector with
expansions u =
k
v
k
f
k
and u =
j
w
j
g
j
, then we may write
_
_
_
_
v
1
v
n
_
_
_
_
=
_
_
_
_
a
11
a
12
a
1n
a
21
a
22
a
2n
a
n1
a
n2
a
nn
_
_
_
_
_
_
_
_
w
1
w
n
_
_
_
_
.
As before, the matrix elements a
jk
are the expansion coecients of g
j
with
respect to the f
k
basis
6
. For example, suppose u P
3
is given in terms of
Hermite polynomials or simple powers: u =
3
j=0
v
j
H
j
(x) =
3
k=0
w
j
x
j
, then
_
_
_
_
v
0
v
1
v
2
v
3
_
_
_
_
=
_
_
_
_
1 0 1 0
0 1 0 3
0 0 1 0
0 0 0 1
_
_
_
_
_
_
_
_
w
0
w
1
w
2
w
3
_
_
_
_
.
We may reverse the change of basis by using the inverse matrix:
_
_
_
_
w
1
w
n
_
_
_
_
=
_
_
_
_
a
11
a
12
a
1n
a
21
a
22
a
2n
a
n1
a
n2
a
nn
_
_
_
_
1
_
_
_
_
v
1
v
n
_
_
_
_
.
Two bases must have the same number of elements because only square matrices
can be invertible.
Composition of linear transformations corresponds to matrix multiplication.
If L is a linear transformation from V to W, and M is a linear transformation
from W to a third vector space, Z, then ML is the composite transformation
that takes V to Z. The composite of L and M is dened if the target (or range)
of L, is the same as the source (or domain) of M, W in this case. If A is an
m n matrix and and B is p q, then the target of A is W = R
m
and the
source of B is R
q
. Therefore, the composite AB is dened if n = p. This is the
condition for the matrix product A B (usually written without the dot) to be
dened. The result is a transformation from V = R
q
to Z = R
m
, i.e., the mq
matrix AB.
6
We write a
jk
for the (j, k) entry of A.
74 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
For vector spaces V and W, the set of linear transformations from V to W
forms a vector space. We can add two linear transformations and multiply a
linear transformation by a scalar. This applies in particular to mn matrices,
which represent linear transformations from R
n
to R
m
. The entries of A + B
are a
jk
+ b
jk
. An n 1 matrix has a single column and may be thought of as
a column vector. The product Au is the same whether we consider u to be a
column vector or an n 1 matrix. A 1 n matrix has a single row and may
be thought of as a row vector. If r is such a row vector, the product rA makes
sense, but Ar does not. It is useful to distinguish between row and column
vectors although both are determined by n components. The product ru is a
1 1 matrix ((1 n) (n 1) gives 1 1), i.e., a scalar.
If the source and targets are the same, V = W, or n = m = p = q, then
both composites LM and ML both are dened, but they probably are not
equal. Similarly, if A and B are n n matrices, then AB and BA are dened,
but AB ,= BA in general. That is, composition and matrix multiplication are
noncommutative. If A, B, and C are matrices so that the products AB and BC
both are dened, then the products A (BC) and (AB) C also are dened. The
associative property of matrix multiplication is the fact that these are equal:
A(BC) = (AB)C. In practice, there may be good reasons for computing BC
then multiplying by A, rather than nding AB and multiplying by C.
If A is an m n matrix with real entries a
jk
, the transpose of A, written
A
T
, is the n m matrix whose (j, k) entry is
_
a
T
_
jk
= a
kj
. If A has complex
entries, then A
)
jk
= a
kj
(a is the complex conjugate of a.). If A is real, then A
T
= A
. The transpose
or adjoint of a column vector is a row vector with the same number of entries,
and vice versa. A square matrix (m = n) is symmetric if A = A
T
, and self-
adjoint (or Hermitian) if A = A
is an
operator that satises L
k=1
[u
k
[ .
Another is the l
= |u|
l
= max
k=1,...,n
[u
k
[ .
Another is the l
2
norm, also called the Euclidean norm
|u|
2
= |u|
l
2 =
_
n
k=1
[u
k
[
2
_
1/2
= u, u)
1/2
.
The l
2
norm is natural, for example, for vectors representing positions or ve-
locities in three dimensional space. If the components of u R
n
represent
probabilities, the l
1
norm might be more appropriate. In some cases we may
have a norm dened indirectly or with a denition that is hard to turn into a
number. For example in the vector space P
3
of polynomials of degree 3, we can
dene a norm
|p| = max
axb
[p(x)[ . (4.5)
There is no simple formula for |p| in terms of the coecients of p.
An appropriate choice of norms is not always obvious. For example, what
norm should we use for the two-dimensional subspace of P
3
consisting of poly-
nomials that vanish at x = 2 and x = 3? In other cases, we might be concerned
with vectors whose components have very dierent magnitudes, perhaps because
they are associated with measurements in dierent units. This might happen,
for example, if the components of u represent dierent factors (or variables) in
a linear regression. The rst factor, u
1
, might be age of a person, the second,
u
2
, income, the third the number of children. In units of years and dollars, we
might get
u =
_
_
45
50000
2
_
_
. (4.6)
However, most people would consider a ve dollar dierence in annual salary to
be small, while a ve-child dierence in family size is signicant. In situations
like these we can dene for example, a dimensionless version of the l
1
norm:
|u| =
n
k=1
1
s
k
[u
k
[ ,
7
The name l
1/p
. One can show that up u as p .
76 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
where the scale factor s
1
k
is a typical value of a quantity with the units of u
k
in the problem at hand. In the example above, we might use s
1
= 40 (years),
s
2
= 60000 (dollars per year), and s
3
= 2.3 (children). This is equivalent to
using the l
1
norm for the problem expressed in a dierent basis, s
1
k
e
k
. In
many computations, it makes sense to change to an appropriately scaled basis
before turning to the computer.
4.2.4 Norms of matrices and linear transformations
Suppose L is a linear transformation from V to W. If we have norms for the
spaces V and W, we can dene a corresponding norm of L, written |L|, as the
largest amount by which it stretches a vector:
|L| = max
u=0
|Lu|
|u|
. (4.7)
The norm denition (4.7) implies that for all u,
|Lu| |L| |u| . (4.8)
Moreover, |L| is the sharp constant in the inequality (4.8) in the sense that if
|Lu| C |u| for all u, then C |L|. Thus, (4.7) is equivalent to saying that
|L| is the sharp constant in (4.8).
The dierent vector norms give rise to dierent matrix norms. The ma-
trix norms corresponding to certain standard vector norms are written with
corresponding subscripts, such as
|L|
l
2 = max
u=0
|Lu|
l
2
|u|
l
2
. (4.9)
For V = W = R
n
, it turns out that (for the linear transformation represented
in the standard basis by A)
|A|
l
1 = max
k
j
[a
jk
[ ,
and
|A|
l
= max
j
k
[a
jk
[ .
Thus, the l
1
matrix norm is the maximum column sum while the max norm is
the maximum row sum. Other norms, such as the l
2
matrix norm, are hard to
compute explicitly in terms of the entries of A.
Any norm dened by (4.7) in terms of vector norms has several properties
derived from corresponding properties of vector norms. One is homogeneity:
|xL| = [x[ |L|. Another is that |L| 0 for all L, with |L| = 0 only for L = 0.
The triangle inequality for vector norms implies that if L and M are two linear
transformations from V to W, then |L +M| |L| + |M|. Finally, we have
4.2. REVIEW OF LINEAR ALGEBRA 77
|LM| |L| |M|. This is because the composite transformation stretches no
more than the product of the individual maximum stretches:
|M(Lu)| |M| |Lu| |M| |L| |u| .
Of course, all these properties hold for matrices of the appropriate sizes.
All of these norms have uses in the theoretical parts of scientic comput-
ing, the l
1
and l
1
0
.
.
.
0
n
_
_
_ .
The eigenvalue eigenvector relations may be expressed in terms of these ma-
trices as
AR = R . (4.11)
To see this, check that multiplying R by A is the same as multiplying each of
the columns of R by A. Since these are eigenvectors, we get
A
_
_
_
_
.
.
.
.
.
.
r
1
r
n
.
.
.
.
.
.
_
_
_
_
=
_
_
_
_
.
.
.
.
.
.
1
r
1
n
r
n
.
.
.
.
.
.
_
_
_
_
=
_
_
_
_
.
.
.
.
.
.
r
1
r
n
.
.
.
.
.
.
_
_
_
_
_
_
_
1
0
.
.
.
0
n
_
_
_
= R .
Since the columns of R are linearly independent, R is invertible, we can multiply
(4.11) from the right and from the left by R
1
to get
R
1
ARR
1
= R
1
RR
1
,
then cancel the R
1
R and RR
1
, and dene
9
L = R
1
to get
LA = L .
This shows that the k
th
row of L is an eigenvector of A if we put the A on the
right:
l
k
A =
k
l
k
.
Of course, the
k
are the same: there is no dierence between right and left
eigenvalues.
9
Here L refers to a matrix, not a general linear transformation.
4.2. REVIEW OF LINEAR ALGEBRA 79
The matrix equation we used to dene L, LR = I, gives useful relations
between left and right eigenvectors. The (j, k) entry of LR is the product of
row j of L with column k of R. When j = k this product should be a diagonal
entry of I, namely one. When j ,= k, the product should be zero. That is
l
k
r
k
= 1
l
j
r
k
= 0 if j ,= k.
_
(4.12)
These are called biorthogonality relations. For example, r
1
need not be orthog-
onal to r
2
, but it is orthogonal to l
2
. The set of vectors r
k
is not orthogonal,
but the two sets l
j
and r
k
are biorthogonal. The left eigenvectors are sometimes
called adjoint eigenvectors because their transposes form right eigenvectors for
the adjoint of A:
A
j
=
j
l
j
.
Still supposing n distinct eigenvalues, we may take the right eigenvectors to be
a basis for R
n
(or C
n
if the entries are not real). As discussed in Section 4.2.2,
we may express the action of A in this basis. Since Ar
j
=
j
r
j
, the matrix
representing the linear transformation A in this basis will be the diagonal matrix
. In the framework of Section 4.2.2, this is because if we expand a vector
v R
n
in the r
k
basis, v = v
1
r
1
+ +v
n
r
n
, then Av =
1
v
1
r
1
+ +
n
v
n
r
n
.
For this reason nding a complete set of eigenvectors and eigenvalues is called
diagonalizing A. A matrix with n linearly independent right eigenvectors is
diagonalizable.
If A does not have n distinct eigenvalues then there may be no basis in which
the action of A is diagonal. For example, consider the matrix
J =
_
0 1
0 0
_
.
Clearly, J ,= 0 but J
2
= 0. A diagonal or diagonalizable matrix cannot have
this property: if
2
= 0 then = 0, and if the relations A ,= 0, A
2
= 0 in one
basis, they hold in any other basis. In general a Jordan block with eigenvalue
is a k k matrix with on the diagonal, 1 on the superdiagonal and zeros
elsewhere:
_
_
_
_
_
_
_
_
1 0 0
0 1 0 0
.
.
. 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 1
0 0 0
_
_
_
_
_
_
_
_
.
If a matrix has fewer than n distinct eigenvalues, it might or might not be
diagonalizable. If A is not diagonalizable, a theorem of linear algebra states
that there is a basis in which A has Jordan form. A matrix has Jordan form if
it is block diagonal with Jordan blocks of various sizes and eigenvalues on the
diagonal. It can be hard to compute the Jordan form of a matrix numerically,
as we will see.
80 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
If A has a Jordan block, a basis of eigenvectors will not exist; and even if
A is diagonalizable, transforming to an eigenvector basis may be very sensitive.
For this reason, most software for the unsymmetric eigenvalue problem actually
computes a Schur form:
AQ = QT,
where T is an upper triangular matrix with the eigenvalues on the diagonal,
T =
_
_
_
_
_
_
_
1
t
12
t
13
. . . t
1n
0
2
t
23
. . . t
2n
0 0
3
. . . t
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 0
n
_
_
_
_
_
_
_
,
and the columns q
k
of Q are orthonormal, i.e.
q
j
q
k
=
_
1, j = k
0, j ,= k.
Equivalently, we can say that Q is an orthogonal matrix, which means that
Q
k
. There are no Jordan blocks. Every self-adjoint matrix is diagonalizable
even if the number of distinct eigenvalues is less than n. A complete set of
eigenvectors of a symmetric matrix forms an orthonormal basis; that is, R is
orthogonal. The matrix form of the eigenvalue relation (4.11) may be written
R
AR = , or A = RR
, or R
A = R
, which are r
k
, are left eigenvectors of A.
4.2.6 Dierentiation and perturbation theory
The main technique in the perturbation theory of Section 4.3 is implicit dier-
entiation. We use the formalism of innitesimal virtual perturbations, which is
related to tangent vectors in dierential geometry. It may seem roundabout at
rst, but it makes actual calculations quick.
Suppose f(x) represents mfunctions, f
1
(x), . . . , f
m
(x) of n variables, x
1
, . . . , x
n
.
The Jacobian matrix
10
, f
jk
(x) =
x
k
f
j
(x). If f is dierentiable (and f
(x
0
)x +O
_
|x|
2
_
. (4.13)
10
See any good book on multivariate calculus.
4.2. REVIEW OF LINEAR ALGEBRA 81
Here f and x are column vectors.
Suppose s is a scalar parameter and x(s) is a dierentiable curve in R
n
with
x(0) = x
0
. The function f(x) then denes a curve in R
m
with f(x(0)) = f(x
0
).
We dene two vectors, called virtual perturbations,
x =
d
ds
s=0
x(s) ,
f =
d
dx
s=0
f(x(s)) .
The multivariate calculus chain rule implies the virtual perturbation formula
f = f
(x
0
) x . (4.14)
This formula is nearly the same as (4.13). The virtual perturbation strategy is to
calculate the linear relationship (4.14) between virtual perturbations and use it
to nd the approximate relation (4.13) between actual perturbations. For this,
it is important that any x R
n
can be the virtual perturbation corresponding
to some curve: just take the straight curve x(s) = x
0
+s x.
The Leibniz rule (product rule) for matrix multiplication makes virtual per-
turbations handy in linear algebra. Suppose A(s) and B(s) are dierentiable
curves of matrices and compatible for matrix multiplication. Then the virtual
perturbation of AB is given by the product rule
d
ds
s=0
AB =
AB +A
B . (4.15)
To see this, note that the (jk) entry of A(s)B(s) is
l
a
jl
(s)b
lk
(s). Dierenti-
ating this using the ordinary product rule then setting s = 0 yields
l
a
jl
b
lk
+
l
a
jl
b
lk
.
These terms correspond to the terms on the right side of (4.15). We can dier-
entiate products of more than two matrices in this way.
As an example, consider perturbations of the inverse of a matrix, B = A
1
.
The variable x in (4.13) now is the matrix A, and f(A) = A
1
. Apply implicit
dierentiation to the formula AB = I, use the fact that I is constant, and we
get
AB +A
B = 0. Then solve for
B and use A
1
= B, and get
B = A
1
AA
1
.
The corresponding actual perturbation formula is
_
A
1
_
= A
1
(A) A
1
+O
_
|A|
2
_
. (4.16)
This is a generalization of the fact that the derivative of 1/x is 1/x
2
, so
(1/x) (1/x
2
)x. When x is replaced by A and A does not commute
with A, we have to worry about the order of the factors. The correct order is
(4.16), and not, for example,
AA
1
A
1
=
AA
2
.
82 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
For future reference we comment on the case m = 1, which is the case of
one function of n variables. The 1 n Jacobian matrix may be thought of as
a row vector. We often write this as f, and calculate it from the fact that
Ax for
some nn matrix A. This is a product of the 1 n matrix x
f = x
Ax +x
A x .
Since the 1 1 real matrix x
f = x
(A+A
) x .
This implies that (both sides are row vectors)
( x
Ax ) = x
(A+A
) . (4.17)
If A
Ax
x
x
=
x, Ax)
x, x)
. (4.18)
If x is complex, x
x =
n
k=1
[x
k
[
2
= |x|
2
l
2. The Rayleigh quotient is dened for x ,= 0. A vector r is
a stationary point if Q(r) = 0. If r is a stationary point, the corresponding
value = Q(r) is a stationary value.
Theorem 1 Suppose A is self-adjoint. A vector x ,= 0 is an eigenvector if and
only if it is a stationary point of the Rayleigh quotient; and if x is an eigenvector,
the Rayleigh quotient is the corresponding eigenvalue.
Proof: The underlying principle is the calculation (4.17). If A
= A (this is
where symmetry matters) then ( x
Ax ) = 2x
Ax) = 2x
A.
Also, x
x = 2x
x
_
x
A2
_
x
Ax
(x
x)
2
_
x
=
2
x
x
(x
Ax
Q(x)) .
If x is a stationary point (Q = 0), then x
A =
_
x
Ax
x
x
_
x
Ax
x
x
_
x .
4.2. REVIEW OF LINEAR ALGEBRA 83
This shows that x is an eigenvector with
=
x
Ax
x
x
= Q(x)
as the corresponding eigenvalue. Conversely if Ar = r, then Q(r) = and the
calculations above show that Q(r) = 0. This proves the theorem.
A simple observation shows that there is at least one stationary point of Q
for Theorem 1 to apply to. If is a real number, then Q(x) = Q(x). We may
choose so that
11
|x| = 1. This shows that
max
x=0
Q(x) = max
x=1
Q(x) = max
x=1
x
Ax .
A theorem of analysis states that if Q(x) is a continuous function on a compact
set, then there is an r so that Q(r) = max
x
Q(x) (the max is attained). The
set of x with |x| = 1 (the unit sphere) is compact and Q is continuous. Clearly
if Q(r) = max
x
Q(x), then Q(r) = 0, so r is a stationary point of Q and an
eigenvector of A.
Now suppose we have found m orthogonal eigenvectors r
1
, . . . , r
m
. If x is
orthogonal to r
j
, i.e. r
j
x = 0, then so is Ax:
r
j
(Ax) = (r
j
A)x =
j
r
j
x = 0.
Therefore, the subspace V
m
C
n
of all x that are orthogonal to r
1
, . . . , r
m
is an invariant subspace: if x V
m
, then Ax V
m
. Thus A denes a linear
transformation from V
m
to V
m
, which we call A
m
. Chapter 5 gives a proof that
A
m
is symmetric in a suitable basis. Therefore, Theorem 1 implies that A
m
has
at least one eigenvector, r
m+1
, with eigenvalue
m+1
. Since r
m+1
V
m
, the
action of A and A
m
on r
m+1
is the same, which means that Ar
m+1
=
m+1
r
m+1
.
After nding r
m+1
, we can repeat the procedure to nd r
m+2
, and continue until
we eventually nd a full set of n orthogonal eigenvectors.
4.2.8 Least squares
Suppose A is an mn matrix representing a linear transformation from R
n
to
R
m
, and we have a vector b R
m
. If n < m the linear equation system Ax = b
is overdetermined in the sense that there are more equations than variables to
satisfy them. If there is no x with Ax = b, we may seek an x that minimizes
the residual
r = Ax b . (4.19)
This linear least squares problem
min
x
|Ax b|
l
2 , (4.20)
11
In this section and the next, x = x
l
2.
84 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
is the same as nding x to minimize the sum of squares
|r|
2
l
2 = SS =
n
j=1
r
2
j
.
Linear least squares problems arise through linear regression in statistics.
A linear regression model models the response, b, as a linear combination of
m explanatory vectors, a
k
, each weighted by a regression coecient, x
k
. The
residual, R = (
m
k=1
a
k
x
k
)b, is modeled as a Gaussian random variable
12
with
mean zero and variance
2
. We do n experiments. The explanatory variables
and response for experiment j are a
jk
, for k = 1. . . . , m, and b
j
, and the residual
(for given regression coecients) is r
j
=
m
k=1
a
jk
x
k
b
j
. The log likelihood
function is f(x) =
2
n
j=1
r
2
j
. Finding regression coecients to maximize the
log likelihood function leads to (4.20).
The normal equations give one approach to least squares problems. We
calculate:
|r|
2
l
2 = r
r
= (Ax b)
(Ax b)
= x
Ax 2x
b +b
b .
Setting the gradient to zero as in the proof of Theorem 1 leads to the normal
equations
A
Ax = A
b , (4.21)
which can be solved by
x = (A
A)
1
A
b . (4.22)
The matrix M = A
A)
1
A
1
= max
x=0
|Ax|
|x|
. (4.24)
As in Section 4.2.7, the maximum is achieved, and
1
> 0. Let v
1
R
n
be a
maximizer, normalized to have |v
1
| = 1. Because |Av
1
| =
1
, we may write
Av
1
=
1
u
1
with |u
1
| = 1. This is the relation (4.23) with k = 1.
The optimality condition calculated as in the proof of Theorem 1 implies
that
u
1
A =
1
v
1
. (4.25)
Indeed, since
1
> 0, (4.24) is equivalent to
13
2
1
= max
x=0
|Ax|
2
|x|
2
= max
x=0
(Ax)
(Ax)
x
2
1
= max
x=0
x
(A
A)x
x
x
. (4.26)
Theorem 1 implies that the solution to the maximization problem (4.26), which
is v
1
, satises
2
v
1
= A
Av
1
. Since Av
1
= u
1
, this implies
1
v
1
= A
u
1
, which
is the same as (4.25).
The analogue of the eigenvalue orthogonality principle is that if x
v
1
= 0,
then (Ax)
u
1
= 0. This is true because
(Ax)
u
1
= x
(A
u
1
) = x
1
v
1
= 0 .
Therefore, if we dene V
1
R
n
by x V
1
if x
v
1
= 0, and U
1
R
m
by y U
1
if y
u
1
= 0, then A also denes a linear transformation (called A
1
) from V
1
to
U
1
. If A
1
is not identically zero, we can dene
2
= max
xV
1
x=0
|Ax|
2
|x|
2
= max
x
v
1
=0
x=0
|Ax|
2
|x|
2
,
13
These calculations make constant use of the associativity of matrix multiplication, even
when one of the matrices is a row or column vector.
86 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
and get Av
2
=
2
u
2
with v
2
v
1
= 0 and u
2
u
1
= 0. This is the second step in
constructing orthonormal bases satisfying (4.23). Continuing in this way, we
can continue nding orthonormal vectors v
k
and u
k
that satisfy (4.23) until we
reach A
k
= 0 or k = m or k = n. After that point, we may complete the v and
u bases arbitrarily as in Chapter 5 with remaining singular values being zero.
The singular value decomposition (SVD) is a matrix expression of the rela-
tions (4.23). Let U be the m m matrix whose columns are the left singular
vectors u
k
(as in (4.10)). The orthonormality relations
14
u
j
u
k
=
jk
are equiv-
alent to U being an orthogonal matrix: U
. (4.27)
This the singular value decomposition. Any matrix may be factored, or decom-
posed, into the product of the form (4.27) where U is an m m orthogonal
matrix, is an m n diagonal matrix with nonnegative entries, and V is an
n n orthogonal matrix.
A calculation shows that A
A = V
= V V
A are given by
j
=
2
j
and the right singular vectors of A are
the eigenvectors of A
A) =
l
2(A)
2
. This means that the condition number of
solving the normal equations (4.21) is the square of the condition number of the
original least squares problem (4.20). If the condition number of a least squares
problem is
l
2(A) = 10
5
, even the best solution algorithm can amplify errors by
a factor of 10
5
. Solving using the normal equations can amplify rounding errors
by a factor of 10
10
.
Many people call singular vectors u
k
and v
k
principal components. They
refer to the singular value decomposition as principal component analysis, or
PCA. One application is clustering, in which you have n objects, each with m
measurements, and you want to separate them into two clusters, say girls
and boys. You assemble the data into a matrix, A, and compute, say, the
largest two singular values and corresponding left singular vectors, u
1
R
m
and u
2
R
m
. The data for object k is a
k
R
m
, and you compute z
k
R
2
by
z
k1
= u
1
a
k
and z
k2
= u
2
a
k
, the components of a
k
in the principal component
directions. You then plot the n points z
k
in the plane and look for clusters, or
maybe just a line that separates one group of points from another. Surprising
as may seem, this simple procedure does identify clusters in practical problems.
14
Here
jk
is the Kronecker delta, which is equal to one when j = k and zero otherwise.
4.3. CONDITION NUMBER 87
4.3 Condition number
Ill-conditioning can be a serious problem in numerical solution of problems in
linear algebra. We take into account possible ill-conditioning when we choose
computational strategies. For example, the matrix exponential exp(A) (see
Exercise 12) can be computed using the eigenvectors and eigenvalues of A. We
will see in Section 4.3.3 that the eigenvalue problem may be ill conditioned even
when the problem of computing exp(A) is ne. In such cases we need a way to
compute exp(A) that does not use the eigenvectors and eigenvalues of A.
As we said in Section 2.7 (in slightly dierent notation), the condition num-
ber is the ratio of the change in the answer to the change in the problem data,
with (i) both changes measured in relative terms, and (ii) the change in the
problem data being small. Norms provide a way to extend this denition to
functions and data with more than one component. Let f(x) represent m func-
tions of n variables, with x being a change in x and f = f(x + x) f(x)
the corresponding change in f. The size of x, relative to x is |x| / |x|, and
similarly for f. In the multivariate case, the size of f depends not only on
the size of x, but also on the direction. The norm-based condition number
seeks the worst case x, which leads to
(x) = lim
0
max
x=
|f(x + x) f(x)| / |f(x)|
|x| / |x|
. (4.28)
Except for the maximization over directions x with |x| = , this is the same
as the earlier denition 2.11.
Still following Section 2.7, we express (4.28) in terms of derivatives of f. We
let f
(x)x+O
_
|x|
2
_
. Since O
_
|x|
2
_
/ |x| =
O(|x|), the ratio in (4.28) may be written
|f|
|x|
|x|
|f|
=
|f
(x)x|
|x|
|x|
|f|
+O(|x|) .
The second term on the right disappears as x 0. Maximizing the rst term
on the right over x yields the norm of the matrix f
(x)|
|x|
|f(x)|
. (4.29)
This diers from the earlier condition number denition (2.13) in that it uses
norms and maximizes over x with |x| = .
In specic calculations we often use a slightly more complicated way of stat-
ing the denition (4.28). Suppose that P and Q are two positive quantities and
there is a C so that P C Q. We say that C is the sharp constant if there
is no C
with C
< C so that P C
= 2, which is sharp.
88 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
This sharp constant idea is not exactly what we want because it is required
to hold for all (large or small), and because the inequality you might want for
small is not exactly true. For example, there is no inequality tan() C that
applies for all > 0. As 0, the constant seems to be C = 1, but this is not
strictly true, since tan() > for 0 < <
2
. Therefore we write
P()CQ() as 0 , (4.30)
if P() > 0, Q() > 0, and
lim
0
P()
Q()
C .
This is the sharp constant, the smallest C so that
P() C Q() +O() as 0.
The denition (4.28) is precisely that (x) is the sharp constant in the inequality
|f|
|f|
|x|
|x|
as |x| 0. (4.31)
4.3.1 Linear systems, direct estimates
We start with the condition number of calculating b = Au in terms of u with A
xed. This ts into the general framework above, with u playing the role of x,
and Au of f(x). Of course, A is the Jacobian of the function u Au, so (4.29)
gives
(A, u) = |A|
|u|
|Au|
. (4.32)
This shows that computing Au is ill-conditioned if |A| is much larger than
the ratio |Au| |u|. The norm of A is the maximum A can stretch any vector
(max |Av| / |v|). Computing Au is ill-conditioned if it stretches some vector v
much more than it stretches u.
The condition number of solving a linear system Au = b (nding u as a
function of b) is the same as the condition number of computing u = A
1
b. The
formula (4.32) gives this as
(A
1
, b) =
_
_
A
1
_
_
|b|
|A
1
b|
=
_
_
A
1
_
_
|Au|
|u|
.
This is large if there is some vector that is stretched much less than u. Of
course, stretching factors can be less than one. For future reference, note
that (A
1
, b) is not the same as (4.32).
The traditional denition of the condition number of the Au computation
takes the worst case relative error not only over perturbations u but also over
vectors u. Taking the maximum over u led to (4.32), so we need only maximize
it over u:
(A) = |A| max
u=0
|u|
|Au|
. (4.33)
4.3. CONDITION NUMBER 89
Since A(u +u) Au = Au, and u and u are independent variables, this is
the same as
(A) = max
u=0
|u|
|Au|
max
u=0
|Au|
|u|
. (4.34)
To evaluate the maximum, we suppose A
1
exists.
15
Substituting Au = v,
u = A
1
v, gives
max
u=0
|u|
|Au|
= max
v=0
_
_
A
1
v
_
_
|v|
=
_
_
A
1
_
_
.
Thus, (4.33) leads to
(A) = |A|
_
_
A
1
_
_
(4.35)
as the worst case condition number of the forward problem.
The computation b = Au with
A =
_
1000 0
0 10
_
illustrates this discussion. The error amplication relates |b| / |b| to |u| / |u|.
The worst case would make |b| small relative to |u| and |b| large relative
to |u|: amplify u the least and u the most. This is achieved by taking
u =
_
0
1
_
so that Au =
_
0
10
_
with amplication factor 10, and u =
_
0
_
with Au =
_
1000
0
_
and amplication factor 1000. This makes |b| / |b|
100 times larger than |u| / |u|. For the condition number of calculating
u = A
1
b, the worst case is b =
_
0
1
_
and b =
_
0
_
, which amplies the error
by the same factor of (A) = 100.
The informal condition number (4.33) has advantages and disadvantages over
the more formal one (4.32). At the time we design a computational strategy,
it may be easier to estimate the informal condition number than the formal
one, as we may know more about A than u. If we have no idea what u will
come up, we have a reasonable chance of getting something like the worst one.
Also, by coincidence, (A) dened by (4.33) determines the convergence rate
of some iterative methods for solving linear systems involving A. On the other
hand, in solving dierential equations (A) often is much larger than (A, u). In
such cases, the error in solving Au = b is much less than the condition number
(A) would suggest. For example, in Exercise 11, (A) is on the order of n
2
,
where n is the number of unknowns. The truncation error for the second order
discretization is on the order of 1/n
2
. A naive estimate using (4.33) might
suggest that solving the system amplies the O(n
2
) truncation error by a
factor of n
2
to be on the same order as the solution itself. This does not happen
because the u we seek is smooth, and not like the worst case.
15
See Exercise 8 for a the l
2
condition number of the u Au problem with singular or
non-square A.
90 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
The informal condition number (4.35) also has the strange feature than
(A) = (A
1
), since
_
A
1
_
1
= A. This gives the mistaken problem that
solving a forward problem (computing Au from u) has the same stability prop-
erties as solving the inverse problem,(computing u from b = Au). For example,
solving the heat equation forward in time is far dierent from the inverse prob-
lem of solving it backward in time.
16
Again, the more precise denition (4.32)
does not have this drawback.
4.3.2 Linear systems, perturbation theory
If Au = b, we can study the dependence of u on A through perturbation theory.
The starting point is the perturbation formula (4.16). Taking norms gives
|u|
_
_
A
1
_
_
|A| |u| , (for small A), (4.36)
so
|u|
|u|
_
_
A
1
_
_
|A|
|A|
|A|
(4.37)
This shows that the condition number satises
_
_
A
1
_
_
|A|. The condition
number is equal to
_
_
A
1
_
_
|A| if the inequality (4.36) is (approximately for small
A) sharp, which it is because we can take A = I and u to be a maximum
stretch vector for A
1
. Note that the condition number formula (4.35) applies
to the problem of solving the linear system Au = b both when we consider
perturbations in b and in A, though the derivations here are dierent.
4.3.3 Eigenvalues and eigenvectors
The eigenvalue relationship is Ar
j
=
j
r
j
. Perturbation theory allows to esti-
mate the changes in
j
and r
j
that result from a small A. These perturbation
results are used throughout science and engineering. The symmetric or self-
adjoint case is often is called Rayleigh-Schodinger perturbation theory
17
Using
the virtual perturbation method of Section 4.2.6, dierentiating the eigenvalue
relation using the product rule yields
Ar
j
+A r
j
=
j
r
j
+
j
r
j
. (4.38)
Multiply this from the left by r
j
and use the fact that r
j
is a left eigenvector
gives
r
j
A
j
r
j
=
l
r
j
r
j
.
If r
j
is normalized so that r
j
r
j
= |r
j
|
2
l
2
= 1, then the right side is just
j
.
Trading virtual perturbations for actual small perturbations turns this into the
famous formula
j
= r
j
A r
j
+ O
_
|A|
2
_
. (4.39)
16
See, for example, the book by Fritz John on partial dierential equations.
17
Lord Rayleigh used it to study vibrational frequencies of plates and shells. Later Erwin
Schrodinger used it to compute energies (which are eigenvalues) in quantum mechanics.
4.3. CONDITION NUMBER 91
We get a condition number estimate by recasting this in terms of rela-
tive errors on both sides. The important observation is that |r
j
|
l
2
= 1, so
|A r
j
|
l
2
|A|
l
2 and nally
j
A r
j
|A|
l
2 .
This inequality is sharp because we can take A = r
j
r
j
, which is an n n
matrix with (see Exercise 7)
_
_
r
j
r
j
_
_
l
2
= [[. Putting this into (4.39) gives the
also sharp inequality,
|A|
l
2
[
j
[
|A|
l
2
|A|
l
2
.
We can put this directly into the abstract condition number denition (4.28) to
get the conditioning of
j
:
j
(A) =
|A|
l
2
[
j
[
=
[[
max
[
j
[
(4.40)
Here,
j
(A) denotes the condition number of the problem of computing
j
,
which is a function of the matrix A, and |A|
l
2 = [[
max
refers to the eigenvalue
of largest absolute value.
The condition number formula (4.40) predicts the sizes of errors we get in
practice. Presumably
j
depends in some way on all the entries of A and the
perturbations due to roundo will be on the order of the entries themselves,
multiplied by the machine precision,
mach
, which are on the order of |A|. Only
if
j
is very close to zero, by comparison with [
max
[, will it be hard to compute
with high relative accuracy. All of the other eigenvalue and eigenvector problems
have much worse condition number diculties.
The eigenvalue problem for non-symmetric matrices can by much more sen-
sitive. To derive the analogue of (4.39) for non-symmetric matrices we start
with (4.38) and multiply from the left with the corresponding left eigenvector,
l
j
, and using the normalization condition l
j
r
j
= 1. After simplifying, the result
is
j
= l
j
Ar
j
,
j
= l
j
Ar
j
+O
_
|A|
2
_
. (4.41)
In the non-symmetric case, the eigenvectors need not be orthogonal and the
eigenvector matrix R need not be well conditioned. For this reason, it is possible
that l
k
, which is a row of R
1
is very large. Working from (4.41) as we did for
the symmetric case leads to
|l
j
||r
j
|
.
Therefore, the condition number of the non-symmetric eigenvalue problem is
(again because the inequalities are sharp)
j
(A) = |l
j
||r
j
|
|A|
[
j
[
. (4.42)
92 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
A useful upper bound is |l
j
||r
j
|
LS
(R), where
LS
(R) =
_
_
R
1
_
_
|R| is
the linear systems condition number of the right eigenvector matrix. Since A
is not symmetric, we cannot replace |A| by [[
max
as we did for (4.40). In the
symmetric case, the only reason for ill-conditioning is that we are looking for a
(relatively) tiny number. For non-symmetric matrices, it is also possible that
the eigenvector matrix is ill-conditioned. It is possible to show that if a family
of matrices approaches a matrix with a Jordan block, the condition number of
R approaches innity. For a symmetric or self-adjoint matrix, R is orthogonal
or unitary, so that |R|
l
2 = |R
|
l
2 = 1 and
LS
(R) = 1.
The eigenvector perturbation theory uses the same ideas, with the extra
trick of expanding the derivatives of the eigenvectors in terms of the eigenvectors
themselves. We expand the virtual perturbation r
j
in terms of the eigenvectors
r
k
. Call the expansion coecients m
jk
, and we have
r
j
=
n
l=1
m
jl
r
l
.
For the symmetric eigenvalue problem, if all the eigenvalues are distinct, the
formula follows from multiplying (4.38) from the left by r
k
:
m
jk
=
r
Ar
j
j
k
,
so that
r
j
=
k=j
r
k
Ar
j
j
k
+O
_
|A|
2
_
.
(The term j = k is omitted because m
jj
= 0: dierentiating r
j
r
j
= 1 gives
r
j
r
j
= 0.) This shows that the eigenvectors have condition number issues
even when the eigenvalues are well-conditioned, if the eigenvalues are close to
each other. Since the eigenvectors are not uniquely determined when eigenvalues
are equal, this seems plausible. The unsymmetric matrix perturbation formula
is
m
kj
=
l
j
Ar
k
j
k
.
Again, we have the potential problem of an ill-conditioned eigenvector basis,
combined with the possibility of close eigenvalues. The conclusion is that the
eigenvector conditioning can be problematic, even though the eigenvalues are
ne, for closely spaced eigenvalues.
4.4 Software
4.4.1 Software for numerical linear algebra
There have been major government-funded eorts to produce high quality soft-
ware for numerical linear algebra. This culminated in the public domain soft-
ware package LAPACK. LAPACK is a combination and extension of earlier
4.4. SOFTWARE 93
packages EISPACK, for solving eigenvalue problems, and LINPACK, for solv-
ing systems of equations. LAPACK is used in many mathematical packages,
including Matlab.
Our advice is to use LAPACK for dense linear algebra computations when-
ever possible
18
, either directly or through an environment like Matlab. These
routines are extensively tested, and are much less likely to have subtle bugs than
codes you developed yourself or copied from a textbook. The LAPACK soft-
ware also includes condition estimators that are often more sophisticated than
the basic algorithms themselves, and includes methods such as equilibration to
improve the conditioning of problems.
LAPACK is built upon a library of Basic Linear Algebra Subroutines (BLAS).
A BLAS library includes routines that perform such tasks as computing dot
products and multiplying matrices. Dierent systems often have their own
specially-tuned BLAS libraries, and these tuned libraries have much better per-
formance than the reference implementation. Computations using LAPACK
with an optimized BLAS can be faster than computations with the reference
BLAS by an order of magnitude or more.
On many platforms, LAPACK and a tuned BLAS are packaged as a pre-
compiled library. On machines running OS X, for example, LAPACK and a
tuned BLAS are provided as part of the vecLib framework. On Windows and
Linux machines, optimized LAPACK and BLAS implementations are part of
Intels Math Kernel Libraries (MKL), which are sold commercially but with
a free version under Windows, and as part of the AMD Core Math Library
(ACML), which is freely available from AMD. The LAPACK Frequently Asked
Questions list from http://www.netlib.org/lapack includes as its rst item
a list of places where you can get LAPACK and the BLAS pre-packaged as part
of a vendor library.
LAPACK is written in Fortran 90
19
, not C++. It is not too dicult to call
Fortran routines from C++, but the details vary from platform to platform.
One alternative to mixed-language programming is to use CLAPACK, which is
an automatic translation of LAPACK into the C language. CLAPACK provides
a Fortran-style interface to LAPACK, but because it is written entirely in C, it
is often easier to link. As of this writing, there is a standardized C interface to
the BLAS (the CBLAS), but there is no standard C interface to LAPACK. One
of the advantages of using LAPACK via a vendor-provided library is that many
of these libraries provide a native C-style interface to the LAPACK routines.
4.4.2 Linear algebra in Matlab
The Matlab system uses LAPACK and BLAS for most of its dense linear
algebra operation. This includes the * command to multiply matrices, the
18
LAPACK has routines for dense linear algebra. Sparse matrices, such as those with only
a few nonzero entries, should generally be solved using other packages. We will discuss sparse
matrices briey in Chapter 5
19
Most of LAPACK was developed in Fortran 77. However, the most recent version (LA-
PACK 3.2) uses some Fortran 90 features to provide extended precision routines.
94 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
N = 500;
A = rand(N);
B = rand(N);
tic; C = A*B; t1 = toc;
tic;
C2 = zeros(N);
for i = 1:N
for j = 1:N
for k = 1:N
C2(i,j) = C2(i,j) + A(i,k)*B(k,j);
end
end
end
t2 = toc;
fprintf(For N = %d: built-in = %f sec; ours = %f sec\n, ...
N, t1, t2);
Figure 4.1: Example Matlab program that multiplies two matrices with a
built-in primitive based on the BLAS and with a hand-written loop.
backslash command to solve linear systems and least squares problems (\), the
eig command to compute eigenvalues and eigenvectors, and the svd command
to compute singular values and vectors.
Where Matlab provides a built-in routine based on LAPACK and the
BLAS, it will usually be much more ecient than a corresponding routine that
you would write yourself. For example, consider the code shown in Figure 4.1
to multiply two matrices using the built-in * operator and a hand-written loop.
On a desktop Pentium 4 system running version 7.6 of Matlab, we can see the
dierence between the two calls clearly:
For N = 500: built-in = 0.061936 sec; ours = 5.681781 sec
Using the Matlab built-in functions that call the BLAS is almost a hundred
times faster than writing our own matrix multiply routine! In general, the most
ecient Matlab programs make heavy use of linear algebra primitives that call
BLAS and LAPACK routines.
Where LAPACK provides very dierent routines for dierent matrix struc-
tures (e.g. dierent algorithms for symmetric and unsymmetric eigenvalue prob-
lems), Matlab tests the matrix structure and automatically chooses an appro-
priate routine. Its possible to see this indirectly by looking at the timing of
4.4. SOFTWARE 95
A = rand(600); % Make a medium-sized random matrix
B = (A+A)/2; % Take the symmetric part of A
Bhat = B+eps*A; % Make B just slightly nonsymmetric
tic; eig(B); t1 = toc; % Time eig on a symmetric problem
tic; eig(Bhat); t2 = toc; % Time eig on a nonsymmetric problem
fprintf(Symmetric: %f sec; Nonsymmetric: %f sec\n, t1, t2);
Figure 4.2: Example Matlab program that times the computation of eigenval-
ues for a symmetric matrix and a slightly nonsymmetric matrix.
the following Matlab script in Figure 4.2. On a desktop Pentium 4 system
running version 7.6 of Matlab, we can see the dierence between the two calls
clearly:
Symmetric: 0.157029 sec; Nonsymmetric: 1.926543 sec
4.4.3 Mixing C++ and Fortran
Like an enormous amount of scientic software, LAPACK is written in For-
tran. Because of this, it is useful to call Fortran code from C. Unfortunately,
many systems do not include a Fortran compiler; and even when there is a
Fortran compiler, it may be dicult to gure out which libraries are needed
to link together the Fortran and C codes. The f2c translator, available from
http://www.netlib.org/f2c, translates Fortran codes into C. In this section,
we will describe how to use f2c-translated codes in a C++ program. As an ex-
ample, we will translate and use the reference version of the BLAS dot product
routine ddot, which is available at http://www.netlib.org/blas/ddot.f. In
our description, we assume a command line shell of the sort provided by OS X,
Linux, or the Cygwin environment under Windows. We also assume that f2c
has already been installed somewhere on the system.
The rst step is to actually translate the code from Fortran to C:
f2c -C++ ddot.f
This command generates a function ddot.c with the translated routine. The
-C++ option says that the translation should be done to C++-compatible C. In
Fortran, ddot has the signature
DOUBLE PRECISION FUNCTION DDOT(N,DX,INCX,DY,INCY)
* .. Scalar Arguments ..
INTEGER INCX,INCY,N
* ..
96 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
#include <iostream>
#include "f2c.h"
extern "C"
double ddot_(const integer& N, const double* dx,
const integer& incX, const double* dy,
const integer& incY);
int main()
{
using namespace std;
double xy[] = {1, 2};
double result = ddot_(2, xy, 1, xy, 1);
cout << "Result = " << result << endl; // Result = 5
return 0;
}
Figure 4.3: Example C++ program that calls an f2c-translated routine.
* .. Array Arguments ..
DOUBLE PRECISION DX(*),DY(*)
The corresponding C function ddot has the signature
double ddot_(integer* n, double* dx, integer* incx,
double* dy, integer* incy);
We note three things about the signature for this translated routine:
The C name of the routine is all lower case, with an underscore appended
at the end: ddot .
Fortran uses call-by-reference semantics; that is, all arguments are passed
by reference (through a pointer in C) rather than by value.
An integer is a typedef dened in the f2c.h header le. Depending on
the system, a Fortran integer may correspond to a C int or long.
Once we have translated the function to C, we need to be able to use it
in another program. We will use as an example the code in Figure 4.3, which
computes the dot product of a two-element vector with itself. We note a few
things about this program:
The extern "C" directive tells the C++ compiler to use C-style linkage
20
,
20
C++ linkage involves name mangling, which means that the compiler adds type infor-
mation into its internal representation for a function name. In C-style linkage, we keep only
the function name.
4.5. RESOURCES AND FURTHER READING 97
which is what f2c requires. This is important when using C++ with any
other language.
Rather than using C-style pointers to pass the vector length N and the
parameters incX and incY, we use C++-style references
21
We tell the
compiler that the ddot routine uses these values purely as input param-
eters by declaring that these are references to constant values. If we do
this, the C++ compiler will allow us to pass in literal constants like 2 and
1 when we call ddot , rather than requiring that we write
// If we had ddot_(integer* N, ...)
integer two = 2;
integer one = 1;
ddot_(&two, xy, &one, xy, &one);
Finally, we compile the program:
c++ -o example.exe -I f2cdir example.cc ddot.c -L f2cdir -lf2c
Here -I f2cdir tells the compiler to search the directory f2cdir for the header
le f2c.h; and the ag -L f2cdir tells the compiler where to nd the support
library libf2c.
The steps we used to incorporate the Fortran ddot routine in our C++ code
using f2c are similar to the steps we would use to incorporate a Fortran library
into any C++ code:
1. Follow the directions to compile the library (or to translate it using f2c,
in our case). If you can nd a pre-built version of the library, you may be
able to skip this step.
2. Write a C++ program that includes a prototype for the library routine.
Remember that the C name will probably be slightly dierent from the
Fortran name (a trailing underscore is common), and that all arguments
in Fortran are passed by reference.
3. Compile your C++ program and link it against the Fortran library and
possibly some Fortran support libraries.
4.5 Resources and further reading
For a practical introduction to linear algebra written by a numerical analyst, try
the books by Gilbert Strang [23, 24]. More theoretical treatments may be found
in the book by Peter Lax [15] or the one by Paul Halmos [8]. The linear algebra
book by Peter Lax also has a beautiful discussion of eigenvalue perturbation
21
The C++ compiler would complain at us if it were keeping type information on the ar-
guments to ddot , but those are discarded because the compiler is using C linkage. References
and pointers are implemented nearly identically, so we can get away with this in practice.
98 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
theory and some of its applications. More applications may be found in the
book Theory of Sound by Lord Rayleigh (reprinted by Dover Press) and in any
book on quantum mechanics [20].
There are several good books that focus exclusively on topics in numerical
linear algebra. The textbook by Lloyd N. Trefethen [25] provides a good survey
of numerical linear algebra, while the textbook by James Demmel [4] goes into
more detail. Both books provide excellent discussion of condition numbers.
Nicholas Highams book Accuracy and Stability of Numerical Algorithms [10]
includes a much more detailed description of conditioning and related issues.
The standard reference book for topics in numerical linear algebra is by Gene
Golub and Charles Van Loan [7].
Beresford Parlett has a book on the theory and computational methods
for the symmetric eigenvalue problem [19]; though it does not include some of
the most recent methods, it remains an excellent and highly readable book.
G.W. Stewart has devoted the second volume of his Matrix Algorithms book
entirely to eigenvalue problems, including both the symmetric and the nonsym-
metric case [22].
The software repository Netlib, http://netlib.org, is a source for LA-
PACK and for many other numerical codes.
4.6 Exercises
1. Let L be the dierentiation operator that takes P
3
to P
2
described in
Section 4.2.2. Let f
k
= H
k
(x) for k = 0, 1, 2, 3 be the Hermite polynomial
basis of P
3
and g
k
= H
k
(x) for k = 0, 1, 2 be the Hermite basis of P
2
.
What is the matrix, A, that represents this L in these bases?
2. Suppose L is a linear transformation from V to V and that f
1
, . . ., f
n
,
and g
1
, . . ., g
n
are two bases of V . Any u V may be written in a unique
way as u =
n
k=1
v
k
f
k
, or as u =
n
k=1
w
k
g
k
. There is an n n matrix,
R that relates the f
k
expansion coecients v
k
to the g
k
coecients w
k
by
v
j
=
n
k=1
r
jk
w
k
. If v and w are the column vectors with components v
k
and w
k
respectively, then v = Rw. Let A represent L in the f
k
basis and
B represent L in the g
k
basis.
(a) Show that B = R
1
AR.
(b) For V = P
3
, and f
k
= x
k
, and g
k
= H
k
, nd R.
(c) Let L be the linear transformation Lp = q with q(x) =
x
(xp(x)).
Find the matrix, A, that represents L in the monomial basis f
k
.
(d) Find the matrix, B, that represents L in the Hermite polynomial
basis H
k
.
(e) Multiply the matrices to check explicitly that B = R
1
AR in this
case.
4.6. EXERCISES 99
3. If A is an n m matrix and B is an m l matrix, then AB is an n l
matrix. Show that (AB)
= B
Mu)
1/2
is a vector norm whenever M is positive denite (dened
below).
(a) Show that u
Mu = u
u = u
_
1
2
(M +M
)
_
u for all u V . This
means that as long as we consider functions of the formf(u) = u
Mu,
we may assume M is symmetric. For the rest of this question, assume
M is symmetric. Hint: u
Mu)
1/2
is homogeneous: |au| =
[a[ |u|.
(c) We say M is positive denite if u
Mv[
|u| |v|. This is the CauchySchwarz inequality. Hint (a famous old
trick): (t) = (u +tv)
= M.
(e) Use the CauchySchwarz inequality to verify the triangle inequality
in its squared form |u +v|
2
|u|
2
+ 2 |u| |u| +|v|
2
.
(f) Show that if M = I then |u| is the l
2
norm of u.
5. Verify that |p| dened by (4.5) on V = P
3
is a norm as long as a < b.
6. Suppose A is the nn matrix that represents a linear transformation from
R
n
to R
n
in the standard basis e
k
. Let B be the matrix of the same linear
transformation in the scaled basis (??).
(a) Find a formula for the entries b
jk
in terms of the a
jk
and u
k
.
(b) Find a matrix formula for B in terms of A and the diagonal scaling
matrix W = diag(u
k
) (dened by w
kk
= u
k
, w
jk
= 0 if j ,= k) and
W
1
.
7. Show that if u R
m
and v R
n
and A = uv
, then |A|
l
2 = |u|
l
2 |v|
l
2.
Hint: Note that Aw = bu where b is a scalar, so |Aw|
l
2 = [b[ |u|
l
2. Also,
be aware of the CauchySchwarz inequality: [v
w[ |v|
l
2 |w|
l
2.
8. Suppose that A is an n n invertible matrix. Show that
_
_
A
1
_
_
= max
u=0
|u|
|Au|
=
_
min
u=0
|Au|
|u|
_
1
.
100 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
9. The symmetric part of the real n n matrix is M =
1
2
(A+A
). Show
that
_
1
2
x
Ax
_
= Mx.
10. The informal condition number of the problem of computing the action of
A is
(A) = max
x=0, x=0
A(x+x)Ax
Ax
x+x
x
.
Alternatively, it is the sharp constant in the estimate
|A(x + x) Ax|
|Ax|
C
|x + x|
|x|
,
which bounds the worst case relative change in the answer in terms of the
relative change in the data. Show that for the l
2
norm,
=
max
/
min
,
the ratio of the largest to smallest singular value of A. Show that this
formula holds even when A is not square.
11. We wish to solve the boundary value problem for the dierential equation
1
2
2
x
u = f(x) for 0 < x < 1, (4.43)
with boundary conditions
u(0) = u(1) = 0 . (4.44)
We discretize the interval [0, 1] using a uniform grid of points x
j
= jx
with nx = 1. The n1 unknowns, U
j
, are approximations to u(x
j
), for
j = 1, . . . , n 1. If we use a second order approximation to
1
2
2
x
u, we get
discrete equations
1
2
1
x
2
(U
j+1
2U
j
+U
j1
) = f(x
j
) = F
j
. (4.45)
Together with boundary conditions U
0
= U
n
= 0, this is a system of
n 1 linear equations for the vector U = (U
1
, . . . , U
n1
)
that we write
as AU = F.
(a) Check that there are n1 distinct eigenvectors of A having the form
r
kj
= sin(kx
j
). Here r
kj
is the j component of eigenvector r
k
.
Note that r
k,j+1
= sin(kx
j+1
) = sin(k(x
j
+ x)), which can be
evaluated in terms of r
kj
using trigonometric identities.
(b) Use the eigenvalue information from part (a) to show that
_
_
A
1
_
_
2/
2
as n and (A) = O(n
2
) (in the informal sense) as n .
All norms are l
2
.
4.6. EXERCISES 101
(c) Suppose
U
j
= u(x
j
) where u(x) is the exact but unknown solution
of (4.43), (4.44). Show that if u(x) is smooth then the residual
22
,
R = A
n
k=1
U
2
j
.
(d) Show that A
_
U
U
_
= R. Use part (b) to show that
_
_
_U
U
_
_
_ =
O(x
2
) (with the x modied ||).
(e) (harder) Create a fourth order ve point central dierence approxi-
mation to
2
x
u. You can do this using Richardson extrapolation from
the second order three point formula. Use this to nd an A so that
solving AU = F leads to a fourth order accurate U. The hard part is
what to do at j = 1 and j = n 1. At j = 1 the ve point approxi-
mation to
2
x
u involves U
0
and U
1
. It is ne to take U
0
= u(0) = 0.
Is it OK to take U
1
= U
1
?
(f) Write a program in Matlab to solve AU = F for the second order
method. The matrix A is symmetric and tridiagonal (has nonzeros
only on three diagonals, the main diagonal, and the immediate sub
and super diagonals). Use the Matlab matrix operation appropriate
for symmetric positive denite tridiagonal matrices. Do a conver-
gence study to show that the results are second order accurate.
(g) (extra credit) Program the fourth order method and check that the
results are fourth order when f(x) = sin(x) but not when f(x) =
max(0, .15 (x .5)
2
). Why are the results dierent?
12. This exercise explores conditioning of the non-symmetric eigenvalue prob-
lem. It shows that although the problem of computing the fundamental
solution is well-conditioned, computing it using eigenvalues and eigen-
vectors can be an unstable algorithm because the problem of computing
eigenvalues and eigenvectors is ill-conditioned. For parameters 0 < < ,
there is a Markov chain transition rate matrix, A, whose entries are a
jk
= 0
if [j k[ > 1 If 1 j n 2, a
j,j1
= , a
jj
= ( +), and a
j,j+1
=
(taking j and k to run from 0 to n 1). The other cases are a
00
= ,
a
01
= , a
n1,n1
= , and a
n1,n2
= . This matrix describes a
continuous time Markov process with a random walker whose position at
time t is the integer X(t). Transitions X X + 1 happen with rate
and transitions X X 1 have rate . The transitions 0 1 and
n 1 n are not allowed. This is the M/M/1 queue used in operations
research to model queues (X(t) is the number of customers in the queue
at time t, is the rate of arrival of new customers, is the service rate. A
customer arrival is an X X+1 transition.). For each t, we can consider
the row vector p(t) = (p
1
(t), . . . , p
n
(t)) where p
j
(t) = Prob(X(t) = j).
22
Residual refers to the extent to which equations are not satised. Here, the equation is
AU = F, which
e
U does not satisfy, so R = A
e
U F is the residual.
102 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
These probabilities satisfy the dierential equation p =
d
dt
p = pA. The
solution can be written in terms of the fundamental solution, S(t), which
in an n n matrix that satises
S = SA, S(0) = I.
(a) Show that if
S = SA, S(0) = I, then p(t) = p(0)S(t).
(b) The matrix exponential may be dened through the Taylor series
exp(B) =
k=0
1
k!
B
k
. Use matrix norms and the fact that
_
_
B
k
_
_
|B|
k
to show that the innite sum of matrices converges.
(c) Show that the fundamental solution is given by S(t) = exp(tA). To
do this, it is enough to show that exp(tA) satises the dierential
equation
d
dt
exp(tA) = exp(tA)A using the innite series, and show
exp(0A) = I.
(d) Suppose A = LR is the eigenvalue and eigenvector decomposition of
A, show that exp(tA) = Lexp(t)R, and that exp(t) is the obvious
diagonal matrix.
(e) Use the Matlab function [R,Lam] = eig(A); to calculate the eigen-
values and right eigenvector matrix of A. Let r
k
be the k
th
column
of R. For k = 1, . . . , n, print r
k
, Ar
k
,
k
r
k
, and |
k
Ar
k
| (you
choose the norm). Mathematically, one of the eigenvectors is a mul-
tiple of the vector 1 dened in part h. The corresponding eigenvalue
is = 0. The computed eigenvalue is not exactly zero. Take n = 4
for this, but do not hard wire n = 4 into the Matlab code.
(f) Let L = R
1
, which can be computed in Matlab using L=R^(-1);.
Let l
k
be the k
th
row of L, check that the l
k
are left eigenvectors of
A as in part e. Corresponding to = 0 is a left eigenvector that is a
multiple of p
= 1p
,
where 1 is the column vector with all ones, and p
k
=
d
ds
k
when s = 0. What size s do you need for
k
to be accurately
approximated by s
k
? Try n = 5 and n = 20. Note that rst order
perturbation theory always predicts that eigenvalues stay real, so
s = .1 is much too large for n = 20.
23
It is easy to see that if A is tridiagonal then there is a diagonal matrix, W, so that WAW
1
is symmetric. Therefore, A has the same eigenvalues as the symmetric matrix WAW
1
.
104 CHAPTER 4. LINEAR ALGEBRA I, THEORY AND CONDITIONING
Chapter 5
Linear Algebra II,
Algorithms
105
106 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
5.1 Introduction
This chapter discusses some of the algorithms of computational linear algebra.
For routine applications it probably is better to use these algorithms as software
packages rather than to recreate them. Still, it is important to know what
they do and how they work. Some applications call for variations on the basic
algorithms here. For example, Chapter 6 refers to a modication of the Cholesky
decomposition.
Many algorithms of numerical linear algebra may be formulated as ways
to calculate matrix factorizations. This point of view gives conceptual insight
and often suggests alternative algorithms. This is particularly useful in nd-
ing variants of the algorithms that run faster on modern processor hardware.
Moreover, computing matrix factors explicitly rather than implicitly allows the
factors, and the work in computing them, to be re-used. For example, the work
to compute the LU factorization of an n n matrix, A, is O(n
3
), but the work
to solve Au = b is only O(n
2
) once the factorization is known. This makes it
much faster to re-solve if b changes but A does not.
This chapter does not cover the many factorization algorithms in great de-
tail. This material is available, for example, in the book of Golub and Van
Loan [7] and many other places. Our aim is to make the reader aware of what
the computer does (roughly), and how long it should take. First we explain
how the classical Gaussian elimination algorithm may be viewed as a matrix
factorization, the LU factorization. The algorithm presented is not the prac-
tical one because it does not include pivoting. Next, we discuss the Cholesky
(LL
k=1
k =
n(n + 1)
2
=
1
2
n
2
+
1
2
n
1
2
n
2
. (5.2)
We would like to pass directly to the useful approximation at the end without
the complicated exact formula in the middle. As an example, the simple fact
(draw a picture)
k
2
=
_
k
k1
x
2
dx +O(k) .
implies that (using (5.2) to add up O(k))
S
2
(n) =
n
k=1
k
2
=
_
n
0
x
2
dx +
n
k=1
O(k) =
1
3
n
3
+O
_
n
2
_
.
The integral approximation to (5.2) is easy to picture in a graph. The sum
represents the area under a staircase in a large square. The integral represents
the area under the diagonal. The error is the smaller area between the staircase
and the diagonal.
Consider the problem of computing C = AB, where A and B are general
n n matrices. The standard method uses the formulas
c
jk
=
n
l=1
a
jl
b
lk
. (5.3)
The right side requires n multiplies and about n adds for each j and k. That
makes nn
2
= n
3
multiplies and adds in total. The formulas (5.3) have (almost)
exactly 2n
3
ops in this sense. More generally, suppose A is n m and B is
mp. Then AB is np, and it takes (approximately) m ops to calculate each
entry. Thus, the entire AB calculation takes about 2nmp ops.
Now consider a matrix triple product ABC, where the matrices are not
square but are compatible for multiplication. Say A is n m, B is mp, and
C is p q. If we do the calculation as (AB)C, then we rst compute AB and
then multiply by C. The work to do it this way is 2(nmp + npq) (because AB
is n p. On the other hand doing is as A(BC) has total work 2(mpq + nmq).
These numbers are not the same. Depending on the shapes of the matrices, one
could be much smaller than the other. The associativity of matrix multiplication
allows us to choose the order of the operations to minimize the total work.
1
The actual relation can be made precise by writing W = (n
2
), for example, which
means that W = O(n
2
) and n
2
= O(W).
108 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
5.3 Gauss elimination and LU decomposition
5.3.1 A 3 3 example
Gauss elimination is a simple systematic way to solve systems of linear equations.
For example, suppose we have the system of equations
4x
1
+ 4x
2
+ 2x
3
= 2
4x
1
+ 5x
2
+ 3x
3
= 3
2x
1
+ 3x
2
+ 3x
3
= 5.
Now we eliminate x
1
from the second two equations, rst subtracting the rst
equation from the second equation, then subtracting half the rst equation from
the third equation.
4x
1
+ 4x
2
+ 2x
3
= 2
x
2
+ x
3
= 1
x
2
+ 2x
3
= 4.
Now we eliminate x
2
from the last equation by subtracting the second equation:
4x
1
+ 4x
2
+ 2x
3
= 2
x
2
+ x
3
= 1
x
3
= 3.
Finally, we solve the last equation for x
3
= 3, then back-substitute into the second
equation to nd x
2
= 2, and then back-substitute into the rst equation to
nd x
1
= 1.
We gain insight into the elimination procedure by writing it in matrix terms.
We begin with the matrix equation Ax = b:
_
_
4 4 2
4 5 3
2 3 3
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
2
3
5
_
_
.
The operation of eliminating x
1
can be written as an invertible linear transfor-
mation. Let M
1
be the transformation that subtracts the rst component from
the second component and half the rst component from the third component:
M
1
=
_
_
1 0 0
1 1 0
0.5 0 1
_
_
. (5.4)
The equation Ax = b is equivalent to M
1
Ax = M
1
b, which is
_
_
4 4 2
0 1 1
0 1 2
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
2
1
4
_
_
.
5.3. GAUSS ELIMINATION AND LU DECOMPOSITION 109
We apply a second linear transformation to eliminate x
2
:
M
2
=
_
_
1 0 0
0 1 0
0 1 1
_
_
.
The equation M
2
M
1
Ax = M
2
M
1
b is
_
_
4 4 2
0 1 1
0 0 1
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
2
1
3
_
_
.
The matrix U = M
2
M
1
A is upper triangular, and so we can apply back substi-
tution.
We can rewrite the equation U = M
2
M
1
A as A = M
1
1
M
1
2
U = LU.
The matrices M
1
and M
2
are unit lower triangular: that is, all of the diagonal
elements are one, and all the elements above the diagonal are zero. All unit lower
triangular matrices have inverses that are unit lower triangular, and products of
unit lower triangular matrices are also unit lower triangular, so L = M
1
1
M
1
2
is unit lower triangular
2
. The inverse of M
2
corresponds to undoing the last
elimination step, which subtracts the second component from the third, by
adding the third component back to the second. The inverse of M
1
can be
constructed similarly. The reader should verify that in matrix form we have
L = M
1
1
M
1
2
=
_
_
1 0 0
0 1 0
0 1 1
_
_
_
_
1 0 0
1 1 0
0.5 0 1
_
_
=
_
_
1 0 0
1 1 0
0.5 1 1
_
_
.
Note that the subdiagonal elements of L form a record of the elimination pro-
cedure: when we eliminated the ith variable, we subtracted l
ij
times the ith
equation from the jth equation. This turns out to be a general pattern.
One advantage of the LU factorization interpretation of Gauss elimination
is that it provides a framework for organizing the computation. We seek lower
and upper triangular matrices L and U so that LU = A:
_
_
1 0 0
l
21
1 0
l
31
l
32
1
_
_
_
_
u
11
u
12
u
13
0 u
22
u
23
0 0 u
33
_
_
=
_
_
4 4 2
4 5 3
2 3 3
_
_
. (5.5)
If we multiply out the rst row of the product, we nd that
1 u
1j
= a
1j
for each j; that is, the rst row of U is the same as the rst row of A. If we
multiply out the rst column, we have
l
i1
u
11
= a
i1
2
The unit lower triangular matrices actually form an algebraic group, a subgroup of the
group of invertible n n matrices.
110 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
function [L,U] = mylu(A)
n = length(A); % Get the dimension of A
L = eye(n); % Initialize the diagonal elements of L to 1
for i = 1:n-1 % Loop over variables to eliminate
for j = i+1:n % Loop over equations to eliminate from
% Subtract L(j,i) times the ith row from the jth row
L(j,i) = A(j,i) / A(i,i);
for k = i:n
A(j,k) = A(j,k)-L(j,i)*A(i,k);
end
end
end
U = A; % A is now transformed to upper triangular
Figure 5.1: Example Matlab program to compute A = LU.
or l
i1
= a
i1
/u
11
= a
i1
/a
11
; that is, the rst column of L is the rst column of
A divided by a scaling factor.
Finding the rst column of L and U corresponds to nding the multiples
of the rst row we need to subtract from the other rows to do the elimination.
Actually doing the elimination involves replacing a
ij
with a
ij
l
i1
u
1j
= a
ij
a
i1
a
1
11
a
1j
for each i > 1. In terms of the LU factorization, this gives us a
reduced factorization problem with a smaller matrix:
_
1 0
l
32
1
__
u
22
u
23
0 u
33
_
=
_
5 3
3 3
_
_
l
21
l
31
_
_
u
21
u
31
_
=
_
1 1
1 2
_
.
We can continue onward in this way to compute the rest of the elements of L
and U. These calculations show that the LU factorization, if it exists, is unique
(remembering to put ones on the diagonal of L).
5.3.2 Algorithms and their cost
The basic Gauss elimination algorithm transforms the matrix A into its upper
triangular factor U. As we saw in the previous section, the scale factors that
appear when we eliminate the ith variable from the jth equation can be in-
terpreted as subdiagonal entries of a unit lower triangular matrix L such that
A = LU. We show a straightforward Matlab implementation of this idea in
Figure 5.1. These algorithms lack pivoting, which is needed for stability. How-
5.3. GAUSS ELIMINATION AND LU DECOMPOSITION 111
function x = mylusolve(L,U,b)
n = length(b);
y = zeros(n,1);
x = zeros(n,1);
% Forward substitution: L*y = b
for i = 1:n
% Subtract off contributions from other variables
rhs = b(i);
for j = 1:i-1
rhs = rhs - L(i,j)*y(j);
end
% Solve the equation L(i,i)*y(i) = 1*y(i) = rhs
y(i) = rhs;
end
% Back substitution: U*x = y
for i = n:-1:1
% Subtract off contributions from other variables
rhs = y(i);
for j = i+1:n
rhs = rhs - U(i,j)*x(j);
end
% Solve the equation U(i,i)*x(i) = rhs
x(i) = rhs / U(i,i);
end
Figure 5.2: Example Matlab program that performs forward and backward
substitution with LU factors (without pivoting)
.
112 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
ever, pivoting does not signicantly aect the estimates of operation counts for
the algorithm, which we will estimate now.
The total number of arithmetic operations for the LU factorization is given
by
W(n) =
n1
i=1
n
j=i+1
_
1 +
n
k=i
2
_
.
We get this work expression by converting loops in Figure 5.1 into sums and
counting the number of operations in each loop. There is one division inside
the loop over j, and there are two operations (a multiply and an add) inside
the loop over k. There are far more multiplies and adds than divisions, so let
us count only these operations:
W(n)
n1
i=1
n
j=i+1
n
k=i
2.
If we approximate the sums by integrals, we have
W(n)
_
n
0
_
n
x
_
n
x
2 dz dy dx =
2
3
n
3
.
If we are a little more careful, we nd that W(n) =
2
3
n
3
+O(n
2
).
We can compute the LU factors of A without knowing the right hand side b,
and writing the LU factorization alone does not solve the system Ax = b. Once
we know A = LU, though, solving Ax = b becomes a simple two-stage process.
First, forward substitution gives y such that Ly = b; then back substitution gives
x such that Ux = y. Solving by forward and backward substitution is equivalent
to writing x = U
1
y = U
1
L
1
b. Figure 5.2 gives a Matlab implementation
of the forward and backward substitution loops.
For a general n n system, forward and backward substitution each take
about
1
2
n
2
multiplies and the same number of adds. Therefore, it takes a total
of about 2n
2
multiply and add operations to solve a linear system once we have
the LU factorization. This costs much less time than the LU factorization,
which takes about
2
3
n
3
multiplies and adds.
The factorization algorithms just described may fail or be numerically un-
stable even when A is well conditioned. To get a stable algorithm, we need to
introduce pivoting. In the present context this means adaptively reordering the
equations or the unknowns so that the elements of L do not grow. Details are
in the references.
5.4 Cholesky factorization
Many applications call for solving linear systems of equations with a symmetric
and positive denite A. An nn matrix is positive denite if x
Ax > 0 whenever
x ,= 0. Symmetric positive denite (SPD) matrices arise in many applications.
5.4. CHOLESKY FACTORIZATION 113
If B is an mn matrix with m n and rank(B) = n, then the product A = B
B
is SPD. This is what happens when we solve a linear least squares problem using
the normal equations, see Section 4.2.8). If f(x) is a scalar function of x R
n
,
the Hessian matrix of second partials has entries h
jk
(x) =
2
f(x)/x
j
x
k
. This
is symmetric because
2
f/x
j
x
k
=
2
f/x
k
x
j
. The minimum of f probably
is taken at an x
with H(x
_
_
_
_
_
_
_
_
l
11
l
21
l
31
l
n1
0 l
22
l
32
l
n2
0 0 l
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 l
nn
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
a
11
a
21
a
31
a
n1
a
21
a
22
a
32
a
n2
a
31
a
32
a
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
nn
_
_
_
_
_
_
_
_
.
Notice that we have written, for example, a
32
for the (2, 3) entry because A is
symmetric. We start with the top left corner. Doing the matrix multiplication
gives
l
2
11
= a
11
= l
11
=
a
11
.
The square root is real because a
11
> 0 because A is positive denite and
3
a
11
= e
1
Ae
1
. Next we match the (2, 1) entry of A. The matrix multiplication
gives:
l
21
l
11
= a
21
= l
21
=
1
l
11
a
21
.
The denominator is not zero because l
11
> 0 because a
11
> 0. We could continue
in this way, to get the whole rst column of L. Alternatively, we could match
(2, 2) entries to get l
22
:
l
2
21
+l
2
22
= a
22
= l
22
=
_
a
22
l
2
21
.
3
Here e
1
is the vector with one as its rst component and all the rest zero. Similarly
a
kk
= e
k
Ae
k
.
114 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
It is possible to show (see below) that if the square root on the right is not real,
then A was not positive denite. Given l
22
, we can now compute the rest of the
second column of L. For example, matching (3, 2) entries gives:
l
31
l
21
+l
32
l
22
= a
32
= l
32
=
1
l
22
(a
32
l
31
l
21
) .
Continuing in this way, we can nd all the entries of L. It is clear that if L
exists and if we always use the positive square root, then all the entries of L are
uniquely determined.
A slightly dierent discussion of the Cholesky decomposition process makes
it clear that the Cholesky factorization exists whenever A is positive denite.
The algorithm above assumed the existence of a factorization and showed that
the entries of L are uniquely determined by LL
a21
a11
1 0 . . . 0
a31
a11
0 1 . . . 0
.
.
.
.
.
.
.
.
.
an1
a11
0 0 . . . 1
_
_
_
_
_
_
_
Because we only care about the nonzero pattern, let us agree to write a star as
a placeholder for possibly nonzero matrix entries that we might not care about
in detail. Multiplying out M
1
A gives:
M
1
A =
_
_
_
_
_
a
11
a
12
a
13
0
0
.
.
.
_
_
_
_
_
.
Only the entries in row two have changed, with the new values indicated by
primes. Note that M
1
A has lost the symmetry of A. We can restore this
symmetry by multiplying from the right by M
1
This has the eect of subtracting
a1i
a11
times the rst column of M
1
A from column i for i = 2, . . . , n. Since the top
row of A has not changed, this has the eect of setting the (1, 2) through (1, n)
entries to zero:
M
1
AM
1
=
_
_
_
_
_
a
11
0 0
0
0
.
.
.
_
_
_
_
_
.
5.4. CHOLESKY FACTORIZATION 115
Continuing in this way, elementary matrices E
31
, etc. will set to zero all the
elements in the rst row and top column except a
11
. Finally, let D
1
be the
diagonal matrix which is equal to the identity except that d
11
= 1/
a
11
. All in
all, this gives (D
1
= D
1
):
D
1
M
1
AM
1
D
1
=
_
_
_
_
_
1 0 0
0 a
22
a
23
0 a
23
a
33
.
.
.
_
_
_
_
_
. (5.6)
We dene L
1
to be the lower triangular matrix
L
1
= D
1
M
1
,
so the right side of (5.6) is A
1
= L
1
AL
1
(check this). Note that the trailing
submatrix elements a
ik
satisfy
a
ik
= a
ik
(L
1
)
i1
(L
1
)
k1
.
The matrix L
1
is nonsingular since D
1
and M
1
are nonsingular. To see that
A
1
is positive denite, simply dene y = L
A
1
x = x
LAL
x = y
22
> 0 and we may nd an L
2
that
sets a
22
to one and all the a
2k
to zero.
Eventually, this gives L
n1
L
1
AL
1
L
n1
= I. Solving for A by revers-
ing the order of the operations leads to the desired factorization:
A = L
1
1
L
1
n1
L
n1
L
1
,
where we use the common convention of writing B
Qx = x
Qx = x
x = |x|
2
l
2 .
(Recall that Q
v
.
The Householder reector corresponds to reection through the plane normal to
v. We can reect any given vector w into a vector |w|e
1
by reecting through
a plane normal to the line connecting w and |w|e
1
H(w |w|e
1
)w = |w|e
1
.
In particular, this means that if A is an mn matrix, we can use a Householder
transformation to make all the subdiagonal entries of the rst column into zeros:
H
1
A = H(Ae
1
|Ae
1
|e
1
)A =
_
_
_
|Ae
1
| a
12
a
13
. . .
0 a
22
a
23
. . .
0 a
32
a
33
. . .
.
.
.
.
.
.
.
.
.
_
_
_
Continuing in the way, we can apply a sequence of Householder transformations
in order to reduce A to an upper triangular matrix, which we usually call R:
H
n
. . . H
2
H
1
A = R.
Using the fact that H
j
= H
1
j
, we can rewrite this as
A = H
1
H
2
. . . H
n
R = QR,
where Q is a product of Householder transformations.
The QR factorization is a useful tool for solving least squares problems. If
A is a rectangular matrix with m > n and A = QR, then the invariance of the
l
2
norm under orthogonal transformations implies
|Ax b|
l
2 = |Q
Ax Q
b|
l
2 = |Rx b
|
l
2.
4
This shows an orthogonal matrix satises (5.7). Exercise 8 shows that a matrix satisfying
(5.7) must be orthogonal.
5.6. SOFTWARE 117
Therefore, minimizing |Axb| is equivalent to minimizing |Rxb
|, where R
is upper triangular
5
:
_
_
_
_
_
_
_
_
_
_
_
_
_
r
11
r
12
r
1n
0 r
22
.
.
.
.
.
.
.
.
.
.
.
.
0 0 r
nn
0 0
.
.
.
.
.
.
0 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
b
1
b
2
.
.
.
b
n
b
n+1
.
.
.
b
m
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
r
1
r
2
.
.
.
r
n
r
n+1
.
.
.
r
m
_
_
_
_
_
_
_
_
_
_
_
_
Assuming none of the diagonals r
jj
is zero, the best we can do is to choose x so
that the rst n components of r
5.6 Software
5.6.1 Representing matrices
Computer memories are typically organized into a linear address space: that is,
there is a number (an address) associated with each location in memory. A one-
dimensional array maps naturally to this memory organization: the address of
the ith element in the array is distance i (measured in appropriate units) from
the base address for the array. However, we usually refer to matrix elements
not by a single array oset, but by row and column indices. There are dierent
ways that a pair of indices can be converted into an array ofset. Languages like
FORTRAN and MATLAB use column major order, so that a 22 matrix would
be stored in memory a column at a time: a11, a21, a12, a22. In contrast,
C/C++ use row major order, so that if we wrote
double A[2][2] = { {1, 2},
{3, 4} };
the order of the elements in memory would be 1, 2, 3, 4.
C++ has only very primitive built-in support for multi-dimensional arrays.
For example, the rst dimension of a built-in two-dimensional array in C++
must be known at compile time. However, C++ allows programmers to write
classes that describe new data types, like matrices, that the basic language does
not fully support. The uBLAS library and the Matrix Template Library are
examples of widely-used libraries that provide matrix classes.
An older, C-style way of writing matrix algorithms is to manually manage
the mappings between row and column indices and array osets. This can
5
The upper triangular part of the LU decomposition is called U while the upper triangular
part of the QR decomposition is called R. The use of r
jk
for the entries of R and r
j
for the
entries of the residual is not unique to this book.
118 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
/*
* Solve the n-by-n lower triangular system Lx = b.
* L is assumed to be in column-major form.
*/
void forward_substitute(const double* Ldata, double* x,
const double* b, int n)
{
#define L(i,j) Ldata[(j)*n+(i)]
for (int i = 0; i < n; ++i) {
x[i] = b[i];
for (int j = 0; j < i; ++j)
x[i] -= L(j,i)*x[j];
x[i] /= L(i,i);
}
#undef L
}
Figure 5.3: C++ code to solve a lower triangular system using forward sub-
stitution. Note the macro denition (and corresponding undenition) used to
simplify array accesses.
be made a little more convenient using C macros. For example, consider the
forward substitution code in Figure 5.3. The line
#define L(i,j) Ldata[(j)*n+(i)]
denes L to be a macro with two arguments. Any time the compiler encounters
L(i,j) while the macro is dened, it will substitute Ldata[(j)*n+(i)]. For
example, if we wanted to look at an entry of the rst superdiagonal, we might
write L(i,i+1), which the compiler would translate to Ldata[(i+1)*n+(i)].
Note that the parentheses in the macro denition are important this is not
the same as looking at Ldata[i+1*n+i]! Also note that we have followed the
C convention of zero-based indexing, so the rst matrix entry is at L(0,0) =
Ldata[0].
In addition to the column-major and row-major layouts, there are many
other ways to represent matrices in memory. Often, it is possible to represent
an n n matrix in a way that uses far less than the n
2
numbers of a standard
row-major or column-major layout. A matrix that can be represented eciently
with much fewer than n
2
memory entries is called sparse; other matrices are
called dense. Much of the discussion here and in Chapter 5 applies mainly to
dense matrices. A modern (2009) desktop computer with 2 GB (2
31
bytes) of
memory can store at most 2
28
double precision numbers, so a square matrix
with n > 2
14
16000 variables would not even t in memory. Sparse matrix
5.6. SOFTWARE 119
methods can handle larger problems and often give faster methods even for
problems that can be handled using dense matrix methods. For example, nite
element computations often lead to sparse matrices with hundreds of thousands
of variables that can be solved in minutes.
One way a matrix can be sparse is for most of its entries to be zero. For
example, discretizations of the Laplace equation in three dimensions have as
few as seven non-zero entries per row, so that 7/n is the fraction of entries of A
that are not zero. Sparse matrices in this sense also arise in circuit problems,
where a non-zero entry in A corresponds to a direct connection between two
elements in the circuit. Such matrices may be stored in sparse matrix format,
in which we keep lists noting which entries are not zero and the values of the
non-zero elements. Computations with such sparse matrices try to avoid ll in.
For example, they would avoid explicit computation of A
1
because most of its
entries are not zero. Sparse matrix software has heuristics that often do very
well in avoiding ll in. The interested reader should consult the references.
In some cases it is possible to compute the matrix vector product y = Ax
for a given x eciently without calculating the entries of A explicitly. One
example is the discrete Fourier transform (DFT) described in Chapter 1. This
is a full matrix with n
2
non-zero entries, but the FFT (fast Fourier transform)
algorithm computes y = Ax in O(nlog(n)) operations. Another example is the
fast multipole method that computes forces from mutual electrostatic interaction
of n charged particles with b bits of accuracy in O(nb) work. Many nite element
packages never assemble the stiness matrix, A.
Computational methods can be direct or iterative. A direct method have
only rounding errors. They would get the exact answer in exact arithmetic
using a predetermined number of arithmetic operations. For example, Gauss
elimination computes the LU factorization of A using O(n
3
) operations. Itera-
tive methods produce a sequence of approximate solutions that converge to the
exact answer as the number of iterations goes to innity. They often are faster
than direct methods for very large problems, particularly when A is sparse.
5.6.2 Performance and caches
In scientic computing, performance refers to the time it takes to run the pro-
gram that does the computation. Faster computers give more performance, but
so do better programs. To write high performance software, we should know
what happens inside the computer, something about the compiler and about the
hardware. This and several later Software sections explore performance-related
issues.
When we program, we have a mental model of the operations that the com-
puter is doing. We can use this model to estimate how long a computation will
take. For example, we know that Gaussian elimination on an nn matrix takes
about
2
3
n
3
ops, so on a machine that can execute at R op/s, the elimination
procedure will take at least
2n
3
3R
seconds. Desktop machines that run at a rate
of 2-3 gigahertz (billions of cycles per second) can often execute two oating
120 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
point oerations per cycle, for a peak performance of 4-5 gigaops. On a desktop
machine capable of 4 gigaop/s (billions of op/s), we would expect to take at
least about a sixth of a second to factor a matrix with n = 1000. In practice,
though, only very carefully written packages will run anywhere near this fast.
A naive implementation of LU factorization is likely to run 10 to 100 slower
than we would expect from a simple op count.
The failure of our work estimates is not due to a aw in our formula, but
in our model of the computer. A computer can only do arithmetic operations
as fast as it can fetch data from memory, and often the time required to fetch
data is much greater than the time required to perform arithmetic. In order to
try to keep the processor supplied with data, modern machines have a memory
hierarchy. The memory hierarchy consists of dierent size memories, each with
a dierent latency, or time taken between when memory is requested and when
the rst byte can be used by the processor
6
. At the top of the hierarchy are a
few registers, which can be accessed immediately. The register le holds on the
order of a hundred bytes. The level 1 cache holds tens of kilobytes 32 and 64
kilobyte cache sizes are common at the time of this writing within one or two
clock cycles. The level 2 cache holds a couple megabytes of information, and
usually has a latency of around ten clock cycles. Main memories are usually
one or two gigabytes, and might take around fty cycles to access. When the
processor needs a piece of data, it will rst try the level 1 cache, then resort to
the level 2 cache and then to the main memory only if there is a cache miss. The
programmer has little or no control over the details of how a processor manages
its caches. However, cache policies are designed so that codes with good locality
suer relatively few cache misses, and therefore have good performance. There
are two versions of cache locality: temporal and spatial.
Temporal locality occurs when we use the same variables repeatedly in a
short period of time. If a computation involves a working set that is not so
big that it cannot t in the level 1 cache, then the processor will only have to
fetch the data from memory once, after which it is kept in cache. If the working
set is too large, we say that the program thrashes the cache. Because dense
linear algebra routines often operate on matrices that do not t into cache, they
are often subject to cache thrashing. High performance linear algebra libraries
organize their computations by partitioning the matrix into blocks that t into
cache, and then doing as much work as possible on each block before moving
onto the next.
In addition to temporal locality, most programs also have some spatial local-
ity, which means that when the processor reads one memory location, it is likely
to read nearby memory locations in the immediate future. In order to get the
best possible spatial locality, high performance scientic codes often access their
data with unit stride, which means that matrix or vector entries are accessed
one after the other in the order in which they appear in memory.
6
Memories are roughly characterized by their latency, or the time taken to respond to a
request, and the bandwidth, or the rate at which the memory supplies data after a request has
been initiated.
5.6. SOFTWARE 121
5.6.3 Programming for performance
Because of memory hierarchies and other features of modern computer architec-
ture, simple models fail to accurately predict performance. The picture is even
more subtle in most high-level programming languages because of the natural,
though mistaken, inclination to assume that program statements take the about
the same amount of time just because they can be expressed with the same num-
ber of keystrokes. For example, printing two numbers to a le is several hundred
times slower than adding two numbers. A poorly written program may even
spend more time allocating memory than it spends computing results
7
.
Performance tuning of scientic codes often involves hard work and dicult
trade-os. Other than using a good compiler and turning on the optimizer
8
,
there are few easy xes for speeding up the performance of code that is too
slow. The source code for a highly tuned program is often much more subtle
than the textbook implementation, and that subtlety can make the program
harder to write, test, and debug. Fast implementations may also have dierent
stability properties from their standard counterparts. The rst goal of scientic
computing is to get an answer which is accurate enough for the purpose at
hand, so we do not consider it progress to have a tuned code which produces
the wrong answer in a tenth the time that a correct code would take. We can
try to prevent such mis-optimizations from aecting us by designing a thorough
set of unit tests before we start to tune.
Because of the potential tradeos between speed, clarity, and numerical sta-
bility, performance tuning should be approached carefully. When a computation
is slow, there is often one or a few critical bottlenecks that take most of the time,
and we can best use our time by addressing these rst. It is not always obvious
where bottlenecks occur, but we can usually locate them by timing our codes. A
proler is an automated tool to determine where a program is spending its time.
Prolers are useful both for nding bottlenecks both in compiled languages like
C and in high-level languages like Matlab
9
.
Some bottlenecks can be removed by calling fast libraries. Rewriting a code
to make more use of libraries also often claries the code a case in which
improvements to speed and clarity go hand in hand. This is particularly the
case in Matlab, where many problems can be solved most concisely with a
few calls to fast built-in factorization routines that are much faster than hand-
written loops. In other cases, the right way to remedy a bottleneck is to change
algorithms or data structures. A classic example is a program that solves a
sparse linear system a tridiagonal matrix, for example using a general-
purpose dense routine. One good reason for learning about matrix factorization
algorithms is that you will then know which of several possible factorization-
based solution methods is fastest or most stable, even if you have not written
7
We have seen this happen in actual codes that we were asked to look at.
8
Turning on compiler optimizations is one of the simplest things you can do to improve
the performance of your code.
9
The Matlab proler is a beautifully informative tool type help profile at the Matlab
prompt to learn more about it
122 CHAPTER 5. LINEAR ALGEBRA II, ALGORITHMS
the factorization codes yourself. One often hears the statement that increasing
computer power makes it unnecessary to nd faster algorithms. We believe
the opposite: the greater the computer power, the larger the problems one can
attempt, and the greater the dierence a good algorithm can make.
Some novice scientic programmers (and even some experienced program-
mers) write codes with bottlenecks that have nothing to do with the main part
of their computation. For example, consider the following fragment of MATLAB
code:
n = 1000;
A = [];
for i = 1:n
A(i,i) = 1;
end
On one of our desktop machines, the time to execute this loop went from about
six seconds to under two milliseconds just by changing the second statement to A
= zeros(n). The problem with the original code is that at each step Matlab
is forced to enlarge the matrix, which it does by allocating a larger block of
memory, copying the old matrix contents to the new location, and only then
writing the new element. Therefore, the original code takes O(i
2
) time to run
step i, and the overall cost of the loop scales like O(n
3
). Fortunately, this sort of
blunder is relatively easy to x. It is always worth timing a code before trying
to tune it, just to make sure that the bottlenecks are where you think they are,
and that your code is not wasting time because of a programming blunder.
5.7 References and resources
The algorithms of numerical linear algebra for dense matrices are described
in great detail in the book by Charles Van Loan and Gene Golub [7] and in
the book by James Demmel [4]. The book Direct Methods for Sparse Linear
Systems by Tim Davis describes computational methods for matrices stored in
sparse matrix format [3]. Still larger problems are solved by iterative meth-
ods. Generally speaking, iterative methods are not very eective unless the user
can concoct a good preconditioner, which is an approximation to the inverse.
Eective preconditioners usually depend in physical understanding of the prob-
lem and are problem specic. The book Templates for the Solution of Linear
Systems provides a concise description of many dierent iterative methods and
preconditioners [].
While the topic is generally covered in courses in computer architecture,
there are relatively few textbooks on performance optimization that seem suit-
able for scientic programmers without a broad CS background. The book
Performance Optimization of Numerically Intensive Codes by Stefan Goedecker
and Adolfy Hoisie []. is one noteworthy exception. We also recommend High
Performance Computing by Kevin Down and Charles Severance [21]. Truly
5.8. EXERCISES 123
high performance computing is done on computers with more than one proces-
sor, which is called parallel computing. There are many specialized algorithms
and programming techniques for parallel computing.
The LAPACK software package is designed to make the most of memory hi-
erarchies. The performance of LAPACK depends on fast Basic Linear Algebra
Subroutines (BLAS). There is a package called ATLAS that producs automati-
cally tuned BLAS libraries for dierent archictectures, as well as choosing good
parameters (block sizes, etc.) for LAPACK depending on the cache performance
of your particular processor. The LAPACK manual is published by SIAM, the
Society for Industrial and Applied Mathematics.
5.8 Exercises
1. The solution to Au = b may be written b = A
1
u. This can be a good
way to analyze algorithms involving linear systems (see Sections 4.3.1 and
6.3). But we try to avoid forming A
1
explicitly in computations because
it is more that twice as expensive as solving the linear equations. A good
way to form B = A
1
is to solve the matrix equation AB = I. Gauss
elimination applied to A gives A = LU, where the entries of L are the
pivots used in elimination.
(a) Show that about
1
3
n
3
work reduces AB = I to UB = L
1
, where the
entries of U and L
1
are known.
(b) Show that computing the entries of B from UB = L
1
takes about
1
2
n
3
work. Hint: It takes one op per element for each of the n
elements of the bottom row of B, then two ops per element of the
n1 row of B, and so on to the top. The total is n(1+2+ +n).
(c) Use this to verify the claim that computing A
1
is more than twice
as expensive as solving Au = b.
2. Show that a symmetric n n real matrix is positive denite if and only
if all its eigenvalues are positive. Hint: If R is a right eigenvector matrix,
then, for a symmetric matrix we may normalize so that R
1
= R
and
A = RR
Ax = x
RR
x = (x
R)(R
x) =
y
y, where y = R
x. Show that y
Ax > C |x|
l
2 for all x. (Hint:
|x|
l
2 = |x|
l
2, C =
min
.)
3. Write a program to compute the LL
,= A, (iii) a
printout showing that the Cholesky factoring procedure reports failure
when A is not positive denite, (iv) a printout showing that the Cholesky
factoring procedure works correctly when applied to a SPD matrix, proven
by checking that LL
= A.
4. A square matrix A has bandwidth 2k + 1 if a
jk
= 0 whenever [j k[ > k.
A subdiagonal or superdiagonal is a set of matrix elements on one side of
the main diagonal (below for sub, above for super) with j k, the distance
to the diagonal, xed. The bandwidth is the number of nonzero bands. A
bandwidth 3 matrix is tridiagonal, bandwidth 5 makes pentadiagonal, etc.
(a) Show that a SPD matrix with bandwidth 2k+1 has a Cholesky factor
with nonzeros only on the diagonal and up to k bands below.
(b) Show that the Cholesky decomposition algorithm computes this L
in work proportional to k
2
n (if we skip operations on entries of A
outside its nonzero bands).
(c) Write a procedure for Cholesky factorization of tridiagonal SPD ma-
trices, and apply it to the matrix of Exercise 11, compare the running
time with this dense matrix factorizer and the one from Exercise 5.4.
Of course, check that the answer is the same, up to roundo.
5. Suppose v
1
, . . . , v
m
is an orthonormal basis for a vector space V R
n
.
Let L be a linear transformation from V to V . Let A be the matrix that
represents L in this basis. Show that the entries of A are given by
a
jk
= v
j
Lv
k
. (5.8)
Hint: Show that if y V , the representation of y is this basis is y =
j
y
j
v
j
, where y
j
= v
j
y. In physics and theoretical chemistry, inner
products of the form (5.8) are called matrix elements. For example, the
eigenvalue perturbation formula (4.39) (in physicist terminology) simply
says that the perturbation in an eigenvalue is (nearly) equal to to the
appropriate matrix element of the perturbation in the matrix.
6. Suppose A is an n n symmetric matrix and V R
n
is an invariant
subspace for A (i.e. Ax V if x V ). Show that A denes a linear
transformation from V to V . Show that there is a basis for V in which
this linear transformation (called A restricted to V ) is represented by a
symmetric matrix. Hint: construct an orthonormal basis for V .
7. If Q is an n n matrix, and (Qx)
Qy = x
Qy = x
(Q
Q)y = x
y, we can
explore the entries of Q
Qy = x
Qy =
x
y.
Chapter 6
Nonlinear Equations and
Optimization
125
126 CHAPTER 6. NONLINEAR EQUATIONS AND OPTIMIZATION
6.1 Introduction
This chapter discusses two related computational problems. One is root nding,
or solving systems of nonlinear equations. This means that we seek values of
n variables, (x
1
, . . . , x
n
) = x R
n
, to satisfy n nonlinear equations f(x) =
(f
1
(x), . . . , f
n
(x)) = 0. We assume that f(x) is a smooth function of x. The
other problem is smooth optimization, or nding the minimum (or maximum
1
)
value of a smooth objective function, V (x). These problems are closely related.
Optimization algorithms use the gradient of the objective function, solving the
system of equations g(x) = V (x) = 0. Conversely, root nding can be seen as
minimizing |f(x)|
2
.
The theory here is for black box methods. These are algorithms that do not
depend on details of the denitions of the functions f(x) or V (x). The code
doing the root nding will learn about f only by user-supplied procedures
that supply values of f or V and their derivatives. The person writing the
root nding or optimization code need not open the box to see how the user
procedure works. This makes it possible for specialists to create general pur-
pose optimization and root nding software that is ecient and robust, without
knowing all the problems it may be applied to.
There is a strong incentive to use derivative information as well as function
values. For root nding, we use the n n Jacobian matrix, f
(x)
jk
=
x
k
f
j
(x). For optimization, we use the gradient and the nn Hessian
matrix of second partials H(x)
jk
=
xj
x
k
V (x). It may seem like too much
extra work to go from the n components of f to the n
2
entries of f
, but
algorithms that use f
often are much faster and more reliable than those that
do not.
There are drawbacks to using general-purpose software that treats each spe-
cic problem as a black box. Large-scale computing problems usually have
specic features that have a big impact on how they should be solved. Re-
formulating a problem to t into a generic f(x) = 0 or min
x
V (x) form may
increase the condition number. Problem-specic solution strategies may be
more eective than the generic Newtons method. In particular, the Jacobian
or the Hessian may be sparse in a way that general purpose software cannot take
advantage of. Some more specialized algorithms are in Exercise 6c (Marquart-
Levenberg for nonlinear least squares), and Section ?? (Gauss-Seidel iteration
for large systems).
The algorithms discussed here are iterative (see Section 2.4). They produce
a sequence of approximations, or iterates, that should converge to the desired
solution, x
as k .
1
Optimization refers either to minimization or maximization. But nding the maximum
of V (x) is the same as nding the minimum of V (x).
2
Here, the subscript denotes the iteration number, not the component. In n dimensions,
iterate x
k
has components x
k
= (x
k1
, . . . , x
kn
).
6.2. SOLVING A SINGLE NONLINEAR EQUATION 127
An iterative method fails if the iterates fail to converge or converge to the wrong
answer. For an algorithm that succeeds, the convergence rate is the rate at which
|x
k
x
| 0 as k .
An iterative method is locally convergent if it succeeds whenever the initial
guess is close enough to the solution. That is, if there is an R > 0 so that if
|x
0
x
| R then x
k
x
| / |x
mach
.
The nal section of this chapter is on methods that do not use higher deriva-
tives. The discussion applies to linear or nonlinear problems. For optimization,
it turns out that the rate of convergence of these methods is determined by
the condition number of H for solving linear systems involving H, see Section
4.3.1 and the condition number formula (4.35). More precisely, the number
of iterations needed to reduce the error by a factor of 2 is proportional to
(H) =
max
(H)/
min
(H). This, more than linear algebra roundo, explains
our xation on condition number. The condition number (H) = 10
4
could
arise in a routine partial dierential equation problem. This bothers us not so
much because it makes us lose 4 out of 16 double precision digits of accuracy,
but because it takes tens of thousands of iterations to solve the problem with a
naive method.
6.2 Solving a single nonlinear equation
The simplest problem is that of solving a single equation in a single variable:
f(x) = 0. Single variable problems are easier than multi-variable problems.
There are simple criteria that guarantee a solution exists. Some algorithms for
one dimensional problems, Newtons method in particular, have analogues for
higher dimensional problems. Others, such as bisection, are strictly one dimen-
sional. Algorithms for one dimensional problems are components for algorithms
for higher dimensional problems.
128 CHAPTER 6. NONLINEAR EQUATIONS AND OPTIMIZATION
6.2.1 Bisection
Bisection search is a simple, robust way to nd a zero of a function on one
variable. It does not require f(x) to be dierentiable, but merely continuous.
It is based on a simple topological fact called the intermediate value theorem:
if f(x) is a continuous real-valued function of x on the interval a x b and
f(a) < 0 < f(b), then there is at least one x
) = 0. A similar
theorem applies in the case b < a or f(a) > 0 > f(b).
The bisection search algorithm consists of repeatedly bisecting an interval in
which a root is known to lie. Suppose we have an interval
3
[a, b] with f(a) < 0
and f(b) > 0. The intermediate value theorem tells us that there is a root of f
in [a, b]. The uncertainty is the location of this root is the length of the interval
b a
, b
], with a
= a and b
= c and b
, b
].
To start the bisection algorithm, we need an initial interval [a
0
, b
0
] over which
f changes sign. Running the bisection procedure then produces intervals [a
k
, b
k
]
whose size decreases at an exponential rate:
[b
k
a
k
[ = 2
k
[b
0
a
0
[ .
To get a feeling for the convergence rate, use the approximate formula 2
10
= 10
3
.
This tells us that we get three decimal digits of accuracy for each ten iter-
ations. This may seem good, but Newtons method is much faster, when it
works. Moreover, Newtons method generalizes to more than one dimension
while there is no useful multidimensional analogue of bisection search. Ex-
ponential convergence often is called linear convergence because of the linear
relationship [b
k+1
a
k+1
[ =
1
2
[b
k
a
k
[. Newtons method is faster than this.
Although the bisection algorithm is robust, it can fail if the computed ap-
proximation to f(x) has the wrong sign. The user of bisection should take into
account the accuracy of the function approximation as well as the interval length
when evaluating the accuracy of a computed root.
6.2.2 Newtons method for a nonlinear equation
As in the previous section, we want to nd a value, x
(x)
for any given x. At each iteration, we have a current iterate, x and we want to
nd an x
that is closer to x
(x) determine the tangent line to the graph of f(x) at the point x. The
new iterate, x
, is the point where this tangent line crosses the x axis. If f(x) is
3
The interval notation [a, b] used here is not intended to imply that a < b. For example,
the interval [5, 2] consists of all numbers between 5 and 2, endpoints included.
6.2. SOLVING A SINGLE NONLINEAR EQUATION 129
close to zero, then x
) should be small.
More analytically, the tangent line approximation (See Section 3.1) is
f(x) F
(1)
(x) = f(x) +f
(x) (x x) . (6.1)
Finding where the line crosses the x axis is the same as setting F
(1)
(x) = 0 and
solving for x
:
x
= x f
(x)
1
f(x) . (6.2)
This is the basic Newton method.
The local convergence rate of Newtons method is governed by the error
in the approximation (6.1). The analysis assumes that the root x
is non-
degenerate, which means that f
(x
(x) ,= 0
for x close enough to x
x[ = O([f(x)[).
This, together with the Taylor series error bound
f(x
) F
(1)
(x
) = O
_
[x
x[
2
_
,
and the Newton equation F
(1)
(x
) = 0, implies that
[f(x
)[ = O
_
[f(x)[
2
_
.
This means that there is a C > 0 so that
[f(x
)[ C [f(x)[
2
. (6.3)
This manifestation of local quadratic convergence says that the residual at the
next iteration is roughly proportional
4
to the square of the residual at the current
iterate.
Quadratic convergence is very fast. In a typical problem, once x
k
x
is
moderately small, the residual will be at roundo levels in a few more iterations.
For example, suppose that
5
C = 1 in (6.3) and that [x
k
x
[ = .1. Then
[x
k+1
x
[ .01, [x
k+2
x
[ 10
4
, and [x
k+4
x
[ 10
16
. The number
of correct digits doubles at each iteration. By contrast, a linearly convergent
iteration with [x
k+1
x
[ .1[x
k
x
_
2
10
_
5
= 2
50
).
Unfortunately, the quadratic convergence of Newtons method is local. There
is no guarantee that x
k
x
.
A program for nding x
x = z, so x
(x)z . (6.4)
To carry out one step of Newtons method, we must evaluate the function f(x),
the Jacobian, f
(x)
_
1
f(x) , (6.5)
which is a natural generalization of the one dimensional formula (6.2). In compu-
tational practice (see Exercise 1) it usually is more expensive to form
_
f
(x)
_
1
than to solve (6.4).
Newtons method for systems of equations also has quadratic (very fast)
local convergence to a non-degenerate solution x
(x)z
_
_
_
_ = O
_
|z|
2
_
.
We see from (6.5) that
6
|z| C |f(x)| .
Together, these inequalities imply that if x x
to get
convergence. In fact, it often is hard to know whether a system of nonlinear
equations has a solution at all. There is nothing as useful as the intermediate
value theorem from the one dimensional case, and there is no multi-dimensional
analogue of the robust but slow bisection method in one dimension.
While Newtons method can suer from extreme ill conditioning, it has a
certain robustness against ill conditioning that comes from its ane invariance.
Ane invariance states Newtons method is invariant under ane transforma-
tions. An ane transformation is a mapping x Ax + b (it would be linear
6
The denition of a non-degenerate solution is that f
`
f
(x)
f(x) C f(x).
6.3. NEWTONS METHOD IN MORE THAN ONE DIMENSION 131
without the b). An ane transformation
7
of f(x) is g(y) = Af(By), where A
and B are invertible n n matrices. The ane invariance is that if we start
from corresponding initial guesses: x
0
= By
0
, and create iterates y
k
by applying
Newtons method to g(y) and x
k
by applying Newtons method to f(x), then
the iterates also correspond: x
k
= By
k
. This means that Newtons method
works exactly as well on the original equations f(x) = 0 as on the transformed
equations g(y) = 0. For example, we can imagine changing from x to variables
y in order to give each of the unknowns the same units. If g(y) = 0 is the best
possible rescaling of the original equations f(x) = 0, then applying Newtons
method to f(x) = 0 gives equivalent iterates.
This argument can be restated informally as saying that Newtons method
makes sense on dimensional grounds and therefore is natural. The variables
x
1
, . . . , x
n
may have dierent units, as may the functions f
1
, . . . , f
n
. The n
2
entries f
)
1
. The
matrix vector product that determines the components of the Newton step (see
(6.4)), z = (f
)
1
f(x), involves adding a number of contributions (entries in
the matrix (f
)
1
multiplying components of f) that might seem likely to have
a variety of units. Nevertheless, each of the n terms in the sum implicit in the
matrix-vector product (6.5) dening a component z
j
has the same units as the
corresponding component, x
j
. See Section 6.6 for a more detailed discussion of
this point.
6.3.1 Quasi-Newton methods
Local quadratic convergence is the incentive for evaluating the Jacobian matrix.
Evaluating the Jacobian matrix may not be so expensive if much of the work in
evaluating f can be re-used in calculating f
(x) is
x
k
f(x). The
cheapest and least accurate approximation to this is the rst-order one-sided
dierence formula
8
: (f(x + x
k
e
k
) f(x))/x
k
. Evaluating all of f
in this
way would take n extra evaluations of f per iteration, which may be so expensive
that it outweighs the fast local convergence.
Quasi-Newton methods replace the true f
[ C [x
k
x
)
V (x) whenever [x x
[ R. We say that x
) whenever x ,= x
and [x x
[ R. We say that x
is a global
minimum if V (x
) V (x) for all x for which V (x) is dened, and a strict global
minimum if V (x
(x
(x
) = 0 and V
(x
) > 0, then x
is at least a strict
local minimum. The function V (x) is convex if
9
V (x) +V (y) > V (x +y)
whenever 0, 0, and + = 1. The function is strictly convex if
V
(x) > 0 for all x. A strictly convex function is convex, but the function
V (x) = x
4
is not strictly convex, because V
= 0.
For the curious, there is an analogue of bisection search in one variable
optimization called golden section search. It applies to any continuous function
that is unimodal, meaning that V has a single global minimum and no local
minima. The golden mean
10
is r = (1 +
5)/2 1.62. At each stage of
bisection search we have an interval [a, b] in which there must be at least one
root. At each stage of golden section search we have an interval [a, c] and a
third point b [a, c] with
[a c[ = r
a b
. (6.8)
9
The reader should check that this is the same as the geometric condition that the line
segment connecting the points (x, V (x)) and (y, V (y)) lies above the graph of V .
10
This number comes up in many ways. From Fibonacci numbers it is r = lim
k
f
k+1
/f
k
.
If ( + )/ = / and > , then / = r. This has the geometric interpretation that
if we remove an square from one end of an ( + ) rectangle, then the remaining
smaller rectangle has the same aspect ratio as the original (+) rectangle. Either
of these leads to the equation r
2
= r + 1.
6.5. NEWTONS METHOD FOR LOCAL OPTIMIZATION 133
As with our discussion of bisection search, the notation [a, c] does not imply
that a < c. In bisection, we assume that f(a) f(b) < 0. Here, we assume
that f(b) < f(a) and f(b) < f(c), so that there must be a local minimum
within [a, c]. Now (this is the clever part), consider a fourth point in the larger
subinterval, d = (1
1
r
)a +
1
r
b. Evaluate f(d). If f(d) < f(b), take a
= a,
b
= d, and c
= b. Otherwise, take a
= c, b
= b, and c
= d, reversing the
sense of the interval. In either case, [a
[ = r [a
[, and f(b
) < f(a
) and
f(b
) < f(c
[ =
1
r
[a c[.
6.5 Newtons method for local optimization
Most of the properties listed in Section 6.4 are the same for multi-variable
optimization. We denote the gradient as g(x) = V (x), and the Hessian matrix
of second partials as H(x). An x
with g(x
) = 0 and H(x
) positive denite
(see Section 5.4 and Exercise 2) is called non-degenerate, a natural generalization
of the condition V
+z) V (x
) =
1
2
z
H(x
)z +O(|z|
3
) .
Exercise 2 shows that the rst term on the right is positive and larger than the
second if H(x
) is negative
denite (obvious denition), the same argument shows that x
is at least a local
maximum. If H(x
) = 0) then x
) = 0 if V
is dierentiable.
We can use Newtons method from Section 6.3 to seek a local minimum by
solving the equations g(x) = 0, but we must pay attention to the dierence
between row and column vectors. We have been considering x, the Newton
step, z, etc. to be column vectors while V (x) = g(x) is a row vector. For
this reason, we consider applying Newtons method to the column vector of
equations g
(x)
is the Hessian H (check this). Therefore, the locally convergent Newton method
is
x
= x +z ,
where the step z is given by the Newton equations
H(x)z = g
(x) . (6.9)
Because it is a special case of Newtons method, it has local quadratic conver-
gence to x
if x
H(x)z . (6.10)
If we minimize V
(2)
(x, z) over z, the result is z = H(x)
1
V (x)
, which is the
same as (6.9). As for Newtons method for nonlinear equations, the intuition
is that V
(2)
(x, z) will be close to V (x + z) for small z. This should make the
minimum of V
(2)
(x, z) close to the minimizer of V , which is x
.
Unfortunately, this simple local method cannot distinguish between between
a local minimum, a local maximum, or even a saddle point. If x
has V (x
) = 0
(so x
(x
k
) will happily converge to to x
if |x
0
x
| is small enough.
This could be a local maximum or a saddle point. Moreover, if |x
0
x
| is not
small, we have no idea whether the iterates will converge to anything at all.
The main dierence between the unsafeguarded Newton method optimiza-
tion problem and general systems of nonlinear equations is that the Hessian
is symmetric and (close enough to a non-degenerate local minimum) positive
denite. The Jacobian f
) < V (x).
In principle, this would allow the x
k
to converge to a saddle point, but this
is extremely unlikely in practice because saddle points are unstable for this
process.
The safeguarded methods use the formulation of the search directions, p and
the step size, t > 0. One iteration will take the form x
t=0
= g(x) p < 0 . (6.11)
This guarantees that if t > 0 is small enough, then V (x + tp) < V (x). Then
we nd a step size, t, that actually achieves this property. If we prevent t from
becoming too small, it will be impossible for the iterates to converge except to
a stationary point.
6.6. SAFEGUARDS AND GLOBAL OPTIMIZATION 135
We nd the search direction by solving a modied Newton equation
Hp = g
(x) . (6.12)
Putting this into (6.11) gives
d
dt
V (x +tp)
t=0
= g(x)
Hg(x)
.
This is negative if
H is positive denite (the right hand side is a 1 1 matrix
(a number) because g is a row vector). One algorithm for nding a descent
direction would be to apply the Cholesky decomposition algorithm (see Section
5.4). If the algorithm nds L with LL
H
kk
l
2
k1
+ +l
2
k,k1
1/2
. (6.13)
Here, H
kk
is the (k, k) entry of H(x). This modied Cholesky algorithm pro-
duces L with LL
p = g(x)
. (6.14)
It is not entirely clear why the more complicated modied Cholesky algo-
rithm is more eective than simply taking
H = I when H(x) is not positive
denite. One possible explanation has to do with units. Let us suppose that U
k
represents the units of x
k
, such as seconds, dollars, kilograms, etc. Let us also
suppose that V (x) is dimensionless. In this case the units of H
jk
=
xj
x
k
V
are [H
jk
] = 1/U
j
U
k
. We can verify by studying the Cholesky decomposition
equations from Section ?? that the entries of L have units [l
jk
] = 1/U
j
, whether
we use the actual equations or the modication (6.13). We solve (??) in two
stages, rst Lq = V
, then L
= x +tp to make sense with a dimensionless t. The reader should check that
the choice
H = I does not have this property in general, even if we allow t to
have units, if the U
k
are dierent.
136 CHAPTER 6. NONLINEAR EQUATIONS AND OPTIMIZATION
The second safeguard is a limited line search. In general, line search means
minimizing the function (t) = V (x +tp) over the single variable t. This could
be done using golden section search, but a much more rudimentary binary search
process suces as a safeguard. In this binary search, we evaluate (0) = V (x)
and (1) = V (x +p). If (1) > (0), the step size is too large. In that case, we
keep reducing t by a factor of 2 (t = t/2;) until (t) < (0), or we give up.
If p is a search direction, we will have (t) < (0) for small enough t and this
bisection process will halt after nitely many reductions of t. If (1) < (0), we
enter a greedy process of increasing t by factors of 2 until (2t) > (t). This
process will halt after nitely many doublings if the set of x with V (x) < V (x)
is bounded.
A desirable feature is that the safeguarded algorithm gives the ordinary
Newton step, and rapid (quadratic) local convergence, if x is close enough to
a nondegenerate local minimum. The modied Hessian will correspond to the
actual Hessian if H(x
. The step
size will be the default t = 1 if x is close enough to x
| is small, where |x
| is the true
solution.
The residual error is easy to determine, but what about the error |x
k
x
|? For Newtons method, each successive iterate is much more accurate than
the previous one, so that as long as x
| = |x
k
x
k+1
| +O(|x
k
x
k+1
|
2
).
Therefore, it is natural to use the length of the Newton step from x
k
to x
k+1
as
an estimate for the error in x
k
, assuming that we are close to the solution and
that we are taking full Newton steps. Of course, we would then usually return
the approximate root x
k+1
, even though the error estimate is for x
k
. That is,
we test convergence based on the dierence between a less accurate and a more
accurate formula, and return the more accurate result even though the error
estimate is only realistic for the less accurate formula. This is exactly the same
strategy we used in our discussion of integration, and we will see it again when
we discuss step size selection in the integration of ordinary dierential equations.
6.8. GRADIENT DESCENT AND ITERATIVE METHODS 137
6.8 Gradient descent and iterative methods
The gradient descent optimization algorithm uses the identity matrix as the
approximate Hessian,
H = I, so the (negative of the) gradient becomes the
search direction: p = V (x)
= x tV (x)
, (6.15)
does not make dimensional sense in general, see Section 6.6. Written in compo-
nents, (6.15) is x
k
= x
k
t
x
k
V (x). Applied for k = 1, this makes dimensional
sense if the units of t satisfy [t] = [x
2
1
]/[V ]. If the units of x
2
are dierent from
those of x
1
, the x
2
equation forces units of t inconsistent with those from the
x
1
equation.
We can understand the slow convergence of gradient descent by studying
how it works on the model problem V (x) =
1
2
x
= V (x)
j
= 1 t
j
. (6.18)
After k iterations of (6.16), we have x
kj
=
k
j
x
0j
, where x
kj
component j of the
iterate x
k
. Clearly, the rate at which x
k
0 depends on the spectral gap,
= 1 max
j
[
j
[ ,
11
We simplied the problem but have not lost generality by taking x = 0 here.
12
See Exercise 7 for an example showing that line search does not improve the situation
very much.
138 CHAPTER 6. NONLINEAR EQUATIONS AND OPTIMIZATION
in the sense that the estimate
|x
k
| (1 )
k
|x
0
|
is sharp (take x
0
= e
1
or x
0
= e
n
). The optimal step size, t is the one that
maximizes , which leads to (see (6.18)
1 =
max
= 1 t
min
1 =
min
= 1 t
max
.
Solving these gives the optimizing value t = 2/(
min
+
max
) and
= 2
min
max
2
(H)
. (6.19)
If we take k = 2/ (H) iterations, and H is ill conditioned so that k is large,
the error is reduced roughly by a factor of
(1 )
k
=
_
1
2
k
_
k
e
2
.
This justies what we said in the Introduction, that it takes k = (H) iterations
to reduce the error by a xed factor.
6.8.1 Gauss Seidel iteration
The Gauss-Seidel iteration strategy makes use of the fact that optimizing over
one variable is easier than optimizing over several. It goes from x to x
in
n steps. Step j optimizes V (x) over the single component x
j
with all other
components xed. We write the n intermediate stages as x
(j)
with components
x
(j)
= (x
(j)
1
, . . . , x
(j)
n
). Starting with x
(0)
= x, we go from x
(j1)
to x
(j)
by
optimizing over component j. That is x
(j1)
m
= x
(j)
m
if m ,= j, and we get x
(j)
j
by solving
min
V
_
(x
(j1)
1
, . . . , x
(j1)
j1
, , x
(j1)
j+1
, . . . , x
(j1)
n
)
_
.
6.9 Resources and further reading
The book by Ortega and Rheinboldt has a more detailed discussion of Newtons
method for solving systems of nonlinear equations [17]. The book Practical
Optimization by Phillip Gill, Walter Murray, and my colleague Margaret Wright,
has much more on nonlinear optimization including methods for constrained
optimization problems [5]. There is much public domain software for smooth
optimization problems, but I dont think much of it is useful.
The set of initial guesses x
0
so that k
k
x
as t is the basin of
attraction of x
= 0 is degenerate in that f
(x
) = 0.
The Newton iterates are x
k
satisfying x
k+1
= x
k
f(x
k
)/f
(x
k
). Show
that the local convergence in this case is linear, which means that there
is an < 1 with [x
k+1
x
[ [x
k
x
0 exponentially. Nevertheless,
contrast this linear local convergence with the quadratic local convergence
for a nondegenerate problem.
2. Use the Taylor expansion to second order to derive the approximation
f(x
) C(x)f(x)
2
=
1
2
f
(x)
f
(x)
2
f(x)
2
. (6.20)
Derive a similar expression that shows that (x
) is approximately
proportional to (x x
)
2
. Use (6.20) to predict that applying Newtons
method to nding solving the equation sin(x) = 0 will have superquadratic
convergence. What makes C(x) large, and the convergence slow, is (i)
small f
(x) (a highly
nonlinear problem).
3. The function f(x) = x/
1 +x
2
has a unique root: f(x) = 0 only for
x = 0. Show that the unsafeguarded Newton method gives x
k+1
= x
3
k
.
Conclude that the method succeeds if and only if [x
0
[ < 1. Draw graphs
to illustrate the rst few iterates when x
0
= .5 and x
0
= 1.5. Note that
Newtons method for this problem has local cubic convergence, which is
even faster than the more typical local quadratic convergence. The formula
(6.20) explains why.
4. Suppose n = 2 and x
1
has units of (electric) charge, x
2
has units of mass,
f
1
(x
1
, x
2
) has units of length, and f
2
(x
1
, x
2
) has units of time. Find the
units of each of the four entries of (f
)
1
. Verify the claims about the
units of the step, z, at the end of Section 6.3.
5. Suppose x
satises f(x
is the set of
x so that if x
0
= x then x
k
x
as k . If f
(x
) is non-singular, the
basin of attraction of x
= 0 is the open interval (endpoints not included) (1, 1). Now consider
the two dimensional problem of nding roots of f(z) = z
2
1, where
z = x+iy. Written out in its real components, f(x, y) = (x
2
y
2
1, 2xy).
The basin of attraction of the solution z
= 1 ((x
, y
z
2
k
1
2z
k
.
(b) Set z
k
= 1 +w
k
and show that w
k+1
=
3
2
w
2
k
/(1 +w
k
).
(c) Use this to show that if [w
k
[ <
1
4
, then [w
k+1
[ <
1
2
[w
k
[. Hint: Show
[1 +w
k
[ >
3
4
. Argue that this implies that the basin of attraction
of z
, which is a
quantitative form of local convergence.
(d) Show that if [z
k
1[ <
1
4
for some k, then z
0
is in the basin of
attraction of z
also is on the x
1
axis, so the iterates
converge to the saddle point, x = 0. Hint:
H has a simple form in
this case.
(b) Show that if x
2
,= 0, and t > 0 is the step size, and we use the
bisection search that increases the step size until (t) = V ((x) +tp)
satises (2t) > (t), then one of the following occurs:
i. The bisection search does not terminate, t , and (t)
. This would be considered good, since the minimum of V is
.
ii. The line search terminates with t satisfying (t) = V (x
) < 0.
In this case, subsequent iterates cannot converge to x = 0 be-
cause that would force V to converge to zero, while our modied
Newton strategy guarantees that V decreases at each iteration.
(c) Nonlinear least squares means nding x R
m
to minimize V (x) =
|f(x) b|
2
l
2, where f(x) = (f
1
(x), . . . , f
n
(x))
is a column vector of
n nonlinear functions of the m unknowns, and b R
n
is a vector we
are trying to approximate. If f(x) is linear (there is an nm matrix
A with f(x) = Ax), then minimizing V (x) is a linear least squares
problem. The Marquart Levenberg iterative algorithm solves a linear
13
The usual denition of saddle point is that H should have at least one positive and
one negative eigenvalue and no zero eigenvalues. The simpler criterion here suces for this
application.
6.10. EXERCISES 141
least squares problem at each iteration. If the current iterate is x, let
the linearization of f be the n m Jacobian matrix A with entries
a
ij
=
xj
f
i
(x). Calculate the step, p, by solving
min
p
|Ap (b f(x))|
l
2 . (6.21)
Then take the next iterate to be x
= x +p.
i. Show that this algorithm has local quadratic convergence if the
residual at the solution has zero residual: r(x
) = f(x
) b = 0,
but not otherwise (in general).
ii. Show that p is a descent direction.
iii. Describe a safeguarded algorithm that probably will converge to
at least a local minimum or diverge.
7. This exercise shows a connection between the slowness of gradient de-
scent and the condition number of H in one very special case. Con-
sider minimizing the model quadratic function in two dimensions V (x) =
1
2
_
1
x
2
1
+
2
x
2
2
_
using gradient descent. Suppose the line search is done
exactly, choosing t to minimize (t) = V (x + tp), where p = V (x). In
general it is hard to describe the eect of many minimization steps be-
cause the iteration x x
, then p
k+1
is in the direction
of (1, 1)
.
(c) Show that p
k
is in the direction of (1, 1)
if and only if (x
1
, x
2
) =
r(
2
,
1
), for some r,
(d) Since the optimum is x
= (0, 0)
2
)/(
1
+
2
) if
1
2
(including the case
1
=
2
).
(e) Show that if
1
2
, then 1 2
2
/
1
= 1 2/(H), where
(H) is the linear systems condition number of H.
(f) Still supposing
1
2
, show that it takes roughly n = 1/(H)
iterations to reduce the error by a factor of e
2
.
8. This exercise walks you through construction of a robust optimizer. It is
as much an exercise in constructing scientic software as in optimization
techniques. You will apply it to nding the minimum of the two variable
function
V (x, y) =
(x, y)
_
1 +(x, y)
2
, (x, y) =
0
+wx
2
+
_
y a sin(x)
_
2
.
142 CHAPTER 6. NONLINEAR EQUATIONS AND OPTIMIZATION
Hand in output documenting what you for each of the of the parts below.
(a) Write procedures that evaluate V (x, y), g(x, y), and H(x, y) analyt-
ically. Write a tester that uses nite dierences to verify that g and
H are correct.
(b) Implement a local Newtons method without safeguards as in Section
6.5. Use the Cholesky decomposition code from Exercise 3. Report
failure if H is not positive denite. Include a stopping criterion and a
maximum iteration count, neither hard wired. Verify local quadratic
convergence starting from initial guess (x
0
, y
0
) = (.3, .3) with param-
eters
0
= .5, w = .5, and a = .5. Find an initial condition from
which the unsafeguarded method fails.
(c) Modify the Cholesky decomposition code Exercise 5.4 to do the mod-
ied Cholesky decomposition described in Section 6.6. This should
require you to change a single line of code.
(d) Write a procedure that implements the limited line search strategy
described in Section 6.6. This also should have a maximum iteration
count that is not hard wired. Write the procedure so that it sees only
a scalar function (t). Test on:
i. (t) = (t .9)
2
(should succeed with t = 1).
ii. (t) = (t .01)
2
(should succeed after several step size reduc-
tions).
iii. (t) = (t 100)
2
(should succeed after several step size dou-
blings).
iv. (t) = t (should fail after too many step size reductions).
v. (t) = t (should fail after too many doublings).
(e) Combine the procedures from parts (c) and (d) to create a robust
global optimization code. Try the code on our test problem with
(x
0
, y
0
) = (10, 10) and parameters
0
= .5, w = .02, and a = 1.
Make plot that shows contour lines of V and all the iterates.
Chapter 7
Approximating Functions
143
144 CHAPTER 7. APPROXIMATING FUNCTIONS
Scientic computing often calls for representing or approximating a general
function, f(x). That is, we seek a
f in a certain class of functions so that
1
, and f = (f
0
, . . . , f
d
)
j<k
(x
k
x
j
) . (7.5)
The reader should verify directly that D(x
0
, x
1
, x
2
) = (x
2
x
0
)(x
1
x
0
)(x
2
x
1
). It is clear that D = 0 whenever x
j
= x
k
for some j ,= k because x
j
= x
k
makes row j and row k equal to each other. The formula (7.5) says that D is a
product of factors coming from these facts.
146 CHAPTER 7. APPROXIMATING FUNCTIONS
Proof: The proof uses three basic properties of determinants. The rst is that
the determinant does not change if we perform an elimination operation on rows
or columns. If we subtract a multiple of row j from row k or of column j from
column k, the determinant does not change. The second is that if row k or
column k has a common factor, we can pull that factor out of the determinant.
The third is that if the rst column is (1, 0 . . . , 0)
k=1
(x
k
x
0
)
_
D(x
1
, . . . , x
d
) . (7.6)
We use the easily checked formula
x
k
y
k
= (x y)(x
k1
+x
k2
y + +y
k1
) . (7.7)
To compute the determinant of V in (7.4), we use Gauss elimination to set all
but the top entry of the rst column of V to zero. This means that we replace
row j by row j minus row 1. Next we nd common factors in the columns.
Finally we perform column operations to put the d d matrix back into the
form of a Vandermonde matrix for x
1
, . . . , x
d
, which will prove (7.6).
Rather than giving the argument in general, we give it for d = 2 and d = 3.
The general case will be clear from this. For d = 2 we have
det
_
_
1 x
0
x
2
0
1 x
1
x
2
1
1 x
2
x
2
2
_
_
= det
_
_
1 x
0
x
2
0
0 x
1
x
0
x
2
1
x
2
0
0 x
2
x
0
x
2
2
x
2
0
_
_
= det
_
x
1
x
0
x
2
1
x
2
0
x
2
x
0
x
2
2
x
2
0
_
.
The formula (7.7) with k = 2 gives x
2
1
x
2
0
= (x
1
x
0
)(x
1
+ x
0
), so (x
1
x
0
)
is a common factor in the top row. Similarly, (x
2
x
0
) is a common factor of
the bottom row. Thus:
det
_
x
1
x
0
x
2
1
x
2
0
x
2
x
0
x
2
2
x
2
0
_
= det
_
x
1
x
0
(x
1
x
0
)(x
1
+x
0
)
x
2
x
0
x
2
2
x
2
0
_
= (x
1
x
0
) det
_
1 (x
1
+x
0
)
x
2
x
0
x
2
2
x
2
0
_
= (x
1
x
0
)(x
2
x
0
) det
_
1 x
1
+x
0
1 x
2
+x
0
_
.
7.1. POLYNOMIAL INTERPOLATION 147
The nal step is to subtract x
0
times the rst column from the second column,
which does not change the determinant:
det
_
1 x
1
+x
0
1 x
2
+x
0
_
= det
_
1 x
1
+x
0
x
0
1
1 x
2
+x
0
x
0
1
_
= det
_
1 x
1
1 x
2
_
= D(x
1
, x
2
) .
This proves (7.6) for d = 2.
For d = 3 there is one more step. If we subtract row 1 from row k for k > 1
and do the factoring using (7.7) for k = 2 and x = 3, we get
det
_
_
_
_
1 x
0
x
2
0
x
3
0
1 x
1
x
2
1
x
3
1
1 x
2
x
2
2
x
3
2
1 x
3
x
2
3
x
3
3
_
_
_
_
=
(x
1
x
0
)(x
2
x
0
)(x
3
x
0
) det
_
_
1 x
1
+x
0
x
2
1
+x
1
x
0
+x
2
0
1 x
2
+x
0
x
2
2
+x
2
x
0
+x
2
0
1 x
3
+x
0
x
2
3
+x
3
x
0
+x
2
0
_
_
.
We complete the proof of (7.6) in this case by showing that
det
_
_
1 x
1
+x
0
x
2
1
+x
1
x
0
+x
2
0
1 x
2
+x
0
x
2
2
+x
2
x
0
+x
2
0
1 x
3
+x
0
x
2
3
+x
3
x
0
+x
2
0
_
_
= det
_
_
1 x
1
x
2
1
1 x
2
x
2
2
1 x
3
x
2
3
_
_
.
For this we rst subtract x
0
times the rst column from the second column,
then subtract x
2
0
times the rst column from the third column, then subtract
x
0
times the second column from the third column. This completes the proof
of Theorem 2 and shows that one can nd the coecients of the interpolating
polynomial by solving a linear system of equations involving the Vandermonde
matrix.
7.1.2 Newton interpolation formula
The Newton interpolation formula is a simple and insightful way to express the
interpolating polynomial. It is based on repeated divided dierences, done in
a way to expose the leading terms of polynomials. These are combined with a
specic basis for the vector space of polynomials of degree k so that in the end
the interpolation property is obvious. In some sense, the Newton interpolation
formula provides a formula for the inverse of the Vandermonde matrix.
We begin with the problem of estimating derivatives of f(x) using a number
of function values. Given nearby points, x
1
and x
0
, we have
f
f(x
1
) f(x
0
)
x
1
x
0
.
148 CHAPTER 7. APPROXIMATING FUNCTIONS
We know that the divided dierence is particularly close to the derivative at the
center of the interval, so we write
f
_
x
1
+x
0
2
_
f(x
1
) f(x
0
)
x
1
x
0
, (7.8)
with an error that is O
_
[x
1
x
0
[
2
_
. If we have three points that might not be
uniformly spaced, the second derivative estimate using (7.8) could be
f
_
x2+x1
2
_
f
_
x1+x0
2
_
x
2
+x
1
2
x
1
+x
0
2
f
f(x
2
) f(x
1
)
x
2
x
1
f(x
1
) f(x
0
)
x
1
x
0
1
2
(x
2
x
0
)
. (7.9)
As we saw in Chapter 3, the approximation (7.9) is consistent (converges to the
exact answer as x
2
x, x
1
x, and x
0
x) if it is exact for quadratics,
f(x) = ax
2
+ bx + c. Some algebra shows that both sides of (7.9) are equal to
2a.
The formula (7.9) suggests the right way to do repeated divided dierences.
Suppose we have d + 1 points
2
x
0
, . . . , x
d
, we dene f[x
k
] = f(x
k
) (exchange
round parentheses for square brackets), and the rst order divided dierence is:
f[x
k
, x
k+1
] =
f[x
k+1
] f[x
k
]
x
k+1
x
k
.
More generally, the Newton divided dierence of order k+1 is a divided dierence
of divided dierences of order k:
f[x
j
, , x
k+1
] =
f[x
j+1
, , x
k+1
] f[x
j
, , x
k
]
x
k+1
x
j
. (7.10)
The denominator in (7.10) is the dierence between the extremal x values, as
(7.9) suggests it should be. If instead of a function f(x) we just have values
f
0
, . . . , f
d
, we dene
[f
j
, , f
k+1
] =
[f
j+1
, , f
k+1
] [f
j
, , f
k
]
x
k+1
x
j
. (7.11)
It may be convenient to use the alternative notation
D
k
(f) = f[x
0
, , x
k
] .
If r(x) = r
k
x
k
+ +r
0
is a polynomial of degree k, we will see that D
k
r = k!r
k
.
We veried this already for k = 1 and k = 2.
2
The x
k
must be distinct but they need not be in order. Nevertheless, it helps the intuition
to think that x
0
< x
1
< < x
d
.
7.1. POLYNOMIAL INTERPOLATION 149
The interpolation problem is to nd a polynomial of degree d that satises
the interpolation conditions (7.2). The formula (7.1) expresses the interpolating
polynomial as a linear combination of pure monomials x
k
. Using the monomials
as a basis for the vector space of polynomials of degree d leads to the Vander-
monde matrix (7.4). Here we use a dierent basis, which might be called the
Newton monomials of degree k (although they strictly speaking are not mono-
mials), q
0
(x) = 1, q
1
(x) = x x
0
, q
2
(x) = (x x
1
)(x x
0
), and generally,
q
k
(x) = (x x
k1
) (x x
0
) . (7.12)
It is easy to see that q
k
(x) is a polynomial of degree k in x with leading coecient
equal to one:
q
k
(x) = x
k
+a
k1
x
k1
+ .
Since this also holds for q
k1
, we may subtract to get:
q
k
(x) a
k1
q
k1
(x) = x
k
+b
k2
x
k2
+ .
Continuing in this way, we express x
k
in terms of Newton monomials:
x
k
= q
k
(x) a
k,k1
q
k1
(x) b
k,k2
. (7.13)
This shows that the q
k
(x) are linearly independent and span the same space as
the monomial basis.
The connection between repeated divided dierences (7.10) and Newton
monomials (7.12) is
D
k
q
j
=
kj
. (7.14)
The intuition is that D
k
f plays the role
1
k!
k
x
f(0) and q
j
(x) plays the role of
x
j
. For k > j,
k
x
j
= 0 because dierentiation lowers the order of a monomial.
For k < j,
k
x
x
j
= 0 when evaluated at x = 0 because monomials vanish when
x = 0. The remaining case is the interesting one,
1
k!
k
x
x
k
= 1.
We verify (7.14) by induction on k. We suppose that (7.14) holds for all
k < d and all j and use that to prove it for k = d and all j, treating the cases
j = d, j < d, and j > d separately. The base case k = 1 explains the ideas. For
j = 1 we have
D
1
q
1
(x) =
q
1
(x
1
) q
1
(x
0
)
x
1
x
0
=
(x
1
x
0
) (x
0
x
0
)
x
1
x
0
= 1 , (7.15)
as claimed. More generally, any rst order divided dierence of q
1
is equal to
one,
q
1
[x
k+1
, x
k
] =
q
1
(x
k+1
) q
1
(x
k
)
x
k+1
x
k
= 1 ,
which implies that higher order divided dierences of q
1
are zero. For example,
q
1
[x
2
, x
3
, x
4
] =
q
1
[x
3
, x
4
] q
1
[x
2
, x
3
]
x
4
x
2
=
1 1
x
4
x
2
= 0 .
150 CHAPTER 7. APPROXIMATING FUNCTIONS
This proves the base case, k = 1 and all j.
The induction step has the same three cases. For j > d it is clear that
D
d
q
j
= q
j
[x
0
, . . . , x
d
] = 0 because q
j
(x
k
) = 0 for all the x
k
that are used in
q
j
[x
0
, . . . , x
d
]. The interesting case is q
k
[x
0
, . . . , x
k
] = 1. From (7.10) we have
that
q
k
[x
0
, . . . , x
k
] =
q
k
[x
1
, . . . , x
k
] q
k
[x
0
, . . . , x
k1
]
x
k
x
0
=
q
k
[x
1
, . . . , x
k
]
x
k
x
0
,
because q
k
[x
0
, . . . , x
k1
] = 0 (it involves all zeros). The same reasoning gives
q
k
[x
1
, . . . , x
k1
] = 0 and
q
k
[x
1
, . . . , x
k
] =
q
k
[x
2
, . . . , x
k
] q
k
[x
1
, . . . , x
k1
]
x
k
x
1
=
q
k
[x
2
, . . . , x
k
]
x
k
x
1
.
Combining these gives
q
k
[x
0
, . . . , x
k
] =
q
k
[x
1
, . . . , x
k
]
x
k
x
0
=
q
k
[x
2
, . . . , x
k
]
(x
k
x
0
)(x
k
x
1
)
,
and eventually, using the denition (7.12), to
q
k
[x
0
, . . . , x
k
] =
q
k
(x
k
)
(x
k
x
0
) (x
k
x
k1
)
=
(x
k
x
0
) (x
k
x
k1
)
(x
k
x
0
) (x
k
x
k1
)
= 1 ,
as claimed. Now, using (7.13) we see that
D
k
x
k
= D
k
_
q
k
a
k,k1
q
k1
_
= D
k
q
k
= 1 ,
for any collection of distinct points x
j
. This, in turn, implies that
q
k
[x
m+k
, . . . , x
m
] = 1 .
which, as in (7.15), implies that D
k+1
q
k
= 0. This completes the induction
step.
The formula (7.14) allows us to verify the Newton interpolation formula,
which states that
p(x) =
d
k=0
[f
0
, . . . , f
k
]q
k
(x) , (7.16)
satises the interpolation conditions (7.2). We see that p(x
0
) = f
0
because each
term on the right k = 0 vanishes when x = x
0
. The formula (7.14) also implies
that D
1
p = D
1
f. This involves the values p(x
1
) and p(x
0
). Since we already
know p(x
0
) is correct, this implies that p(x
1
) also is correct. Continuing in this
way veries all the interpolation conditions.
7.2. DISCRETE FOURIER TRANSFORM 151
7.1.3 Lagrange interpolation formula
The Lagrange approach to polynomial interpolation is simpler mathematically
but less useful than the others. For each k, dene the polynomial
3
of degree d
l
k
(x) =
j=k
(x x
j
)
j=k
(x
k
x
j
)
. (7.17)
For example, for d = 2, x
0
= 0, x
1
= 2, x
2
= 3 we have
l
0
(x) =
(x x
1
)(x x
2
)
(x
0
x
1
)(x
0
x
2
)
=
(x 2)(x 3)
(2)(3)
=
1
6
_
x
2
5x + 6
_
l
1
(x) =
(x x
0
)(x x
2
)
(x
1
x
0
)(x
1
x
2
)
=
(x 0)(x 3)
(1)(1)
= x
2
3x
l
2
(x) =
(x x
0
)(x x
1
)
(x
2
x
0
)(x
2
x
1
)
=
(x 0)(x 2)
(3)(1)
=
1
3
_
x
2
2x
_
,
If j = k, the numerator and denominator in (7.17) are equal. If j ,= k, then
l
k
(x
j
) = 0 because (x
j
x
j
) = 0 is one of the factors in the numerator. Therefore
l
k
(x
j
) =
jk
=
_
1 if j = k
0 if j ,= k
(7.18)
The Lagrange interpolation formula is
p(x) =
d
k=0
f
k
l
k
(x) . (7.19)
The right side is a polynomial of degree d. This satises the interpolation
conditions (7.2) because of (7.18).
7.2 Discrete Fourier transform
The Fourier transform is one of the most powerful methods of applied mathe-
matics. Its nite dimensional anologue, the discrete Fourier transform, or DFT,
is just as useful in scientic computing. The DFT allows direct algebraic solu-
tion of certain dierential and integral equations. It is the basis of computations
with time series data and for digital signal processing and control. It leads to
computational methods that have an innite order of accuracy (which is not the
same as being exact).
The drawback of DFT based methods is their geometric inexibility. They
can be hard to apply to data that are not sampled at uniformly spaced points.
The multidimensional DFT is essentially a product of one dimensional DFTs.
Therefore is it hard to apply DFT methods to problems in more than one
dimension unless the computational domain has a simple shape. Even in one
dimension, applying the DFT depends on boundary conditions.
3
We do not call these Lagrange polynomials because that term means something else.
152 CHAPTER 7. APPROXIMATING FUNCTIONS
7.2.1 Fourier modes
The simplest Fourier analysis is for periodic functions of one variable. We say
f(x) is periodic with period p if f(x + p) = f(x) for all x. If is an integer,
then the Fourier mode
w
(x) = e
2ix/p
(7.20)
is such a periodic function. Fourier analysis starts with the fact that Fourier
modes form a basis for the vector space of periodic functions. That means that
if f has period p, then there are Fourier coecients,
f
, so that
f(x) =
e
2ix/p
. (7.21)
More precisely, let V
p
be the vector space of complex valued periodic func-
tions so that
|f|
2
L
2 =
_
p
0
[f(x)[
2
dx < .
This vector space has an inner product that has the properties discussed in
Section 4.2.2. The denition is
f, g) =
_
p
0
f(x)g(x)dx . (7.22)
Clearly, |f|
2
L
2 = f, f) > 0 unless f 0. The inner product is linear in the g
variable:
f, ag
1
+bg
2
) = af, g
1
) +bf, g
2
) ,
and antilinear in the f variable:
af
1
+bf
2
, g) = af
1
, g) +bf
2
, g) .
Functions f and g are orthogonal if f, g) = 0. A set of functions is orthogonal
if any two of them are orthogonal and none of them is zero. The Fourier modes
(7.20) have this property: if ,= are two integers, then w
, w
) = 0 (check
this).
Any orthogonal system of functions is linearly independent. Linear indepen-
dence means that if
f(x) =
(x) =
e
2ix/p
, (7.23)
then the
f
, f) =
, w
) =
f
, w
) ,
7.2. DISCRETE FOURIER TRANSFORM 153
so
=
w
, f)
|w
|
2
. (7.24)
This formula shows that the coecients are uniquely determined. Written more
explicitly for the Fourier modes, (7.24) is
=
1
p
_
p
x=0
e
2ix/p
f(x)dx . (7.25)
An orthogonal family, w
, f
(x) =
2i
p
e
2ix/p
.
This shows that the Fourier coecient of f
is
=
2i
p
. (7.26)
Formulas like these allow us to express the solutions to certain ordinary and
partial dierential equations in terms of Fourier series. See Exercise 4 for one
example.
We will see that the dierentiation formula (7.26) also contains important
information about the Fourier coecients of smooth functions, that they are
very small for large . This implies that approximations
f(x)
||R
(x)
are very accurate if f is smooth. It also implies that the DFT coecients (see
Section 7.2.2) are very accurate approximations of
f
=
p
2i
1
. (7.27)
The integral formula
=
1
p
_
p
0
w
(x)f
(x)dx
154 CHAPTER 7. APPROXIMATING FUNCTIONS
shows that the Fourier coecients
f
are bounded if f
(x)[ =
e
i
C
1
[[
.
We can go further by applying (7.27) to f
and f
to get
=
p
2
f
4
2
1
2
,
so that if f
is bounded, then
C
1
[
2
[
,
which is faster decay (1/
2
1/ for large ). Continuing in this way, we can
see that if f has N bounded derivatives then
C
N
1
[
N
[
. (7.28)
This shows, as we said, that the Fourier coecients of smooth functions decay
rapidly.
It is helpful in real applications to use real Fourier modes, particularly when
f(x) is real. The real modes are sines and cosines:
u
(x) = cos(2x/p) , v
for
= 0, 1, . . . and v
=0
a
cos(2x/p) +
=1
b
sin(2x/p) . (7.30)
The reasoning that led to (7.25), and the normalization formulas of Exercise ??
below, (7.41), gives
a
=
2
p
_
p
0
cos(2x/p)f(x)dx ( 1),
b
=
2
p
_
p
0
sin(2x/p)f(x)dx ( 1),
a
0
=
1
p
_
p
0
f(x)dx .
_
_
(7.31)
7.2. DISCRETE FOURIER TRANSFORM 155
This real Fourier series (7.30) is basically the same as the complex Fourier
series (7.23) when f(x) is real. If f is real, then (7.25) shows that
4
f
=
f
.
This determines the < 0 Fourier coecients from the 0 coecients. If
> 0 and
f
= g
+ ih
(g
and h
e
2ix/p
+
f
e
2ix/p
= (g
+ih
)
_
cos() +i sin()) + (g
ih
)
_
cos() i sin()
_
= 2g
cos() 2h
sin() .
Moreover, when f is real,
g
=
1
p
Re
__
p
0
e
2ix/p
f(x)dx
_
=
1
p
_
p
0
cos(2ix/p)f(x)dx
=
1
2
a
.
Similarly, h
=
1
2
b
(x
j
) = exp(2ix
j
/p) .
Since x
j
= jp/n, this gives
w
,j
= exp(2ij/n) = w
j
, (7.32)
where w is a primitive root of unity
6
w = e
2i/n
. (7.33)
4
This also shows that the full Fourier series sum over positive and negative is somewhat
redundant for real f. This is another motivation for using the version with cosines and sines.
5
The
in (f
0
, . . . , f
n1
)
and w
(x
j
) = w
(x
j
) for
all j, even though w
and w
C
n
with components given by (7.32) are not
all dierent. In fact, the n vectors w
j=0
f
j
g
j
.
This is the usual inner product on C
n
and leads to the usual l
2
norm f, f) =
|f|
2
l
2. We show that the discrete modes form an orthogonal family, but only as
far as is allowed by aliasing. That is, if 0 < n and 0 < n, and ,= ,
then w
, w
) = 0.
7.2. DISCRETE FOURIER TRANSFORM 157
Recall that for any complex number, z, S(z) =
n1
j=0
z
j
has S = n if z = 1
and S = (z
n
1)/(z 1) if z ,= 1. Also,
w
,j
= e
2ij/n
= e
2ij/n
= w
j
,
so we can calculate
w
, w
) =
n1
j=0
w
j
w
j
=
n1
j=0
w
()j
=
n1
j=0
_
w
_
j
Under our assumptions ( ,= but 0 < n, 0 < n) we have 0 <
[ [ < n, and z = w
, w
) =
w
n()
1
w
1
.
Also,
w
n()
=
_
w
n
_
= 1
= 1 ,
because w is an n
th
root of unity. This shows w
, w
) = 0. We also can
calculate that |w
|
2
=
n
j=1
[w
,j
[
2
= n.
Since the n vectors w
:
f =
n1
=0
.
By the arguments we gave for Fourier series above, the DFT coecients are
=
1
n
w
, f) .
Expressed explicitly in terms of sums, these relations are
=
1
n
n1
j=0
w
j
f
j
, (7.34)
and
f
j
=
n1
=0
w
j
. (7.35)
158 CHAPTER 7. APPROXIMATING FUNCTIONS
These are the (forward) DFT and inverse DFT respectively. Either formula
implies the other. If we start with the f
j
and calculate the
f
=
n1
j=0
w
j
f
j
.
This particular version changes (7.35) to
f
j
=
1
n
n1
=0
w
j
.
Still another way to express these relations is to use the DFT matrix, W, which
is an orthogonal matrix whose (, j) entry is
w
,j
=
1
n
w
j
.
The adjoint of W has entries
w
j,
= w
,j
=
1
n
w
j
.
The DFT relations are equivalent to W
=
1
n
n1
j=0
w
j
f
j
.
It has the advantage of making the direct and inverse DFT as similar as possible,
with a factor 1/
n in both.
As for continuous Fourier series, there is a real discrete cosine and sine
transform for real f. The complex coecients
f
and b
as before. Aliasing is more complicated for the discrete sine and cosine
transform, and depends on whether n is even or odd. The cosine and sine sums
corresponding to (7.30) run from = 0 or = 1 roughly to = n/2.
We can estimate the Fourier coecients of a continuous function by taking
the DFT of its sampled values. We call the vector of samples f
(n)
. It has
components f
(n)
j
= f(x
j
), where x
j
= jx and x = p/n. The DFT coecients
f
(n)
f
(n)
=
1
n
n1
j=0
w
j
f
(n)
j
(7.36)
=
1
p
x
n1
j=0
e
2ixj/n
f(x
j
) .
There is a simple aliasing formula for the Fourier coecients of f
(n)
in
terms of those of f. It depends on aliasing when we put the continuous Fourier
representation (7.25) into the discrete Fourier coecient formula (7.37). If we
rewrite (7.25) with instead of as the summation index, substitute into (7.37),
and change the order of summation, we nd
f
(n)
x
p
n1
j=0
exp[2i ( ) x
j
/n] .
We have shown that the inner sum is equal to zero unless mode aliases to mode
on the grid, which means = + kn for some integer k. We also showed
that if mode does alias to mode on the grid, then each of the summands is
equal to one. Since x/p = 1/n, this implies that
f
(n)
k=
f
+kn
. (7.37)
This shows that the discrete Fourier coecient is equal to the continuous Fourier
coecient (the k = 0 term on the right) plus all the coecients that alias to it
(the terms k ,= 0).
You might worry that the continuous function has Fourier coecients with
both positive and negative while the DFT computes coecients for =
0, 1, . . . , n 1. The answer to this is aliasing. We nd approximations to the
negative Fourier coecients using
f
(n)
=
f
(n)
n
. It may be more helpful to
think of the DFT coecients as being dened for
n
2
to
n
2
(the
exact range depending on whether n is even or odd) rather than from = 0 to
= n 1. The aliasing formula shows that if f is smooth, then
f
(n)
is a very
good approximation to
f
.
7.2.3 FFT algorithm
It takes n
2
multiplies to carry out all the sums in (7.34) directly (n terms in
each of n sums). The Fast Fourier Transform, or FFT, is an algorithm that
calculates the n components of
f from the n components of f using O(nlog(n))
operations, which is much less for large n.
The idea behind FFT algorithms is clearest when n = 2m. A single DFT
of size n = 2m is reduced to two DFT calculations of size m = n/2 followed
160 CHAPTER 7. APPROXIMATING FUNCTIONS
by O(n) work. If m = 2r, this process can be continued to reduce the two size
m DFT operations to four size r DFT operations followed by 2 O(m) = O(n)
operations. If n = 2
p
, this process can be continued p times, where we arrive at
2
p
= n trivial DFT calculations of size one each. The total work for the p levels
is p O(n) = O(nlog
2
(n)).
There is a variety of related methods to handle cases where n is not a power
of 2, and the O(nlog(n)) work count holds for all n. The algorithm is simpler
and faster for n = 2
p
. For example, an FFT with n = 2
20
= 1, 048, 576 should
be signicantly faster than with n = 1, 048, 573, which is a prime number.
Let W
nn
be the complex n n matrix
7
whose (, j) entry is w
,j
= w
j
,
where w is a primitive n
th
root of unity. The DFT (7.34), but for the factor of
1
n
,
is the matrix product
f = W
nn
f. If n = 2m, then w
2
is a primitive m
th
root
of unity and an m m DFT involves the matrix product g = W
mm
g, where
the (, k) entry of W
mm
is (w
2
)
k
= w
2k
. The reduction splits f C
n
into g C
m
and h C
m
, then computes g = W
mm
g and
h = W
mm
h, then
combines g and
h to form
f.
The elements of
f are given by the sums
=
n1
j=0
w
j
f
j
.
We split these into even and odd parts, with j = 2k and j = 2k +1 respectively.
Both of these have k ranging from 0 to
n
2
1 = m 1. For these sums,
j = (2k) = 2k (even), and j = (2k + 1) = 2k (odd)
respectively. Thus
=
m1
k=0
w
2k
f
2k
+w
m1
k=0
w
2k
f
2k+1
. (7.38)
Now dene g C
m
and h C
m
to have the even and odd components of f
respectively:
g
k
= f
2k
, h
k
= f
2k+1
.
The mm operations g = W
mm
g and
h = W
mm
h, written out, are
g
=
n1
k=0
_
w
2
_
k
g
k
,
and
=
n1
k=0
_
w
2
_
k
h
k
.
Then (7.38) may be written
= g
+w
. (7.39)
7
This denition of W diers from that of Section 7.2 by a factor of
n.
7.2. DISCRETE FOURIER TRANSFORM 161
This is the last step, which reassembles
f from g and
h. We must apply (7.39)
for n values of ranging from = 0 to = n 1. The computed g and
h
have period m ( g
+m
= g
in front of
h makes
f have
period n = 2m instead.
To summarize, an order n FFT requires rst order n copying to form g
and h, then two order n/2 FFT operations, then order n copying, adding, and
multiplying. Of course, the order n/2 FFT operations themselves boil down
to copying and simple arithmetic. As explained in Section 5.6, the copying
and memory accessing can take more computer time than the arithmetic. High
performance FFT software needs to be chip specic to take full advantage of
cache.
7.2.4 Trigonometric interpolation
Interpolating f(x) at points x
j
, j = 1, . . . , n, means nding another function
F(x) so that F(x
j
) = f(x
j
) for j = 1, . . . , n. In polynomial interpolation,
F(x) is a polynomial of degree n 1. In trigonometric interpolation, F(x) is a
trigonometric polynomial with n terms. Because we are interested in values of
F(x) o the grid, do should not take advantage of aliasing to simplify notation.
Instead, we let Z
n
be a set of integers as symmetric about = 0 as possible.
This depends on whether n is even or odd
Z
n
=
_
m, m+ 1, . . . , m if n = 2m+ 1 (i.e. n is odd)
m+ 1, . . . , m if n = 2m (i.e. n is even)
(7.40)
With this notation, an n term trigonometric polynomial may be written
F(x) =
Zn
c
e
2ix/p
.
The DFT provides coecients c
Zn
e
2ix/p
.
Note that for large m,
m
1
N
_
=m
1
N
d =
1
N 1
1
N1
.
162 CHAPTER 7. APPROXIMATING FUNCTIONS
The rapid decay inequality (7.28) gives a simple error bound
f(x) F
(n)
(x)
/ Zn
e
2ix/p
n/2
1
N
+C
n/2
1
[[
N
C
1
n
N1
Thus, the smoother f is, the more accurate is the partial Fourier sum approxi-
mation.
7.3 Software
Performance tools.
7.4 References and Resources
The classical books on numerical analysis (Dahlquist and Bjorck, Isaacson and
Keller, etc.) discuss the various facts and forms of polynomial interpolation.
There are many good books on Fourier analysis. One of my favorites is by
Harry Dym and my colleague Henry McKean
8
I learned the aliasing formula
(7.37) from a paper by Heinz Kreiss and Joseph Oliger.
7.5 Exercises
1. Verify that both sides of (7.9) are equal to 2a when f(x) = ax
2
+bx +c.
2. One way to estimate the derivative or integral from function values is to
dierentiate or integrate the interpolating polynomial.
3. Show that the real Fourier modes (7.29) form an orthogonal family. This
means that
(a) u
, u
) = 0 if ,= .
(b) v
, v
) = 0 if ,= .
(c) u
, v
=
1
2
(w
+w
).
Therefore
u
, u
) =
1
4
_
w
, w
) +w
, w
) +w
, w
) +w
, w
)
_
.
You can check that if 0 and 0 and ,= , then all four terms
on the right side are zero because the w
|
2
L
2 = u
, u
) = |v
|
2
L
2 =
p
2
, (7.41)
if 1, and |u
0
|
2
L
2 = p.
4. We wish to use Fourier series to solve the boundary value problem from
4.11.
(a) Show that the solution to (4.43) and (4.44) can be written as a Fourier
sine series
u(x) =
n1
=1
c
sin(x) . (7.42)
One way to do this is to dene new functions also called u and f,
that extend the given ones in a special way to satisfy the boundary
conditions. For 1 x 2, we dene f(x) = f(x 1). This denes
f in the interval [0, 2] in a way that makes it antisymmetric about
x = 1. Next if x / [0, 2] there is a k so that x 2k [0, 2]. Dene
f(x) = f(x 2k) in that case. This makes f a periodic function
with period p = 2 that is antisymmetric about any of the integer
points x = 0, x = 1, etc. Draw a picture to make this two step
extension clear. If we express this extended f as a real cosine and
sine series, the coecients of all the cosine terms are zero (why?), so
f(x) =
>0
b
in terms of the b
and computes
f(x) =
n1
=1
b
and c
j=0
f(t
j
) .
168 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
We see from this that x
k+1
= x
k
+t f(t
k
), which is the forward Euler method
in this case. We know that the rectangle rule for integration is rst order
accurate. This is a hint that the forward Euler method is rst order accurate
more generally.
We can estimate the accuracy of the forward Euler method using an infor-
mal error propagation equation. The error, as well as the solution, evolves (or
propagates) from one time step to the next. We write the value of the exact
solution at time t
k
as x
k
= x(t
k
). The error at time t
k
is e
k
= x
k
x
k
. The
residual is the amount by which x
k
fails to satisfy the forward Euler equations
3
(8.5):
x
k+1
= x
k
+ t f( x
k
) + t r
k
, (8.6)
which can be rewritten as
r
k
=
x(t
k
+ t) x(t
k
)
t
f(x(t
k
)) . (8.7)
In Section 3.2 we showed that
x(t
k
+ t) x(t
k
)
t
= x(t
k
) +
t
2
x(t
k
) +O(t
2
) .
Together with x = f(x), this shows that
r
k
=
t
2
x(t
k
) +O(t
2
) , (8.8)
which shows that r
k
= O(t).
The error propagation equation, (8.10) below, estimates e in terms of the
residual. We can estimate e
k
= x
k
x
k
= x
k
x(t
k
) by comparing (8.5) to
(8.6)
e
k+1
= e
k
+ t (f(x(k)) f( x
k
)) t r
k
.
This resembles the forward Euler method applied to approximating some func-
tion e(t). Being optimistic, we suppose that x
k
and x(t
k
) are close enough to
use the approximation (f
( x
k
)e
k
,
and then
e
k+1
e
k
+ t (f
( x
k
)e
k
r
k
) . (8.9)
If this were an equality, it would imply that the e
k
satisfy the forward Euler
approximation to the dierential equation
e = f
(x)uw,
where w =
1
2
x, then solve (8.10) by setting e = t u. This shows that e(t) = t u(t),
which is what we want, with C(t) = u(t).
170 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
Equating the leading order terms gives the unsurprising result
g
0
(y) = f(y) ,
and leaves us with
g
1
(y(t)) +
1
2
y(t) = O(t) . (8.16)
We dierentiate y = f(y) +O(t) and use the chain rule, giving
y =
d
dt
y =
d
dt
_
f(y(t)) +O(t)
_
= f
(y)f(y) +O(t)
Substituting this into (8.16) gives
g
1
(y) =
1
2
f
(y)f(y) .
so the modied equation, with the rst correction term, is
y = f(y)
t
2
f
(y)f(y) . (8.17)
A simple example illustrates these points. The nondimensional harmonic
oscillator equation is r = r. The solution is r(t) = a sin(t) + b cos(t), which
oscillates but does not grow or decay. We write this in rst order as x
1
= x
2
,
x
2
= x
1
, or
d
dt
_
x
1
x
2
_
=
_
x
2
x
1
_
. (8.18)
Therefore, f(x) =
_
x
2
x
1
_
, f
=
_
0 1
1 0
_
, and f
f =
_
0 1
1 0
__
x
2
x
1
_
=
_
x
1
x
2
_
, so (8.17) becomes
d
dt
_
y
1
y
2
_
=
_
y
2
y
1
_
+
t
2
_
y
1
y
2
_
=
_
t
2
t 1
1
t
2
__
y
1
y
2
_
.
We can solve this by nding eigenvalues and eigenvectors, but a simpler trick
is to use a partial integrating factor and set y(t) = e
1
2
tt
z(t), where z =
_
0 1
1 0
_
z. Since z
1
(t) = a sin(t) + b cos(t), we have our approximate nu-
merical solution y
1
(t) = e
1
2
tt
(a sin(t) +b cos(t)). Therefore
|e(t)|
_
e
1
2
tt
1
_
. (8.19)
This modied equation analysis conrms that forward Euler is rst order
accurate. For small t, we write e
1
2
tt
1
1
2
t t so the error is about
t t (a sin(t) +b cos(t)). Moreover, it shows that the error grows with t. For
each xed t, the error satises |e(t)| = O(t) but the implied constant C(t)
(in |e(t)| C(t)t) is a growing function of t, at least as large as C(t)
t
2
.
8.2. RUNGE KUTTA METHODS 171
8.2 Runge Kutta methods
Runge Kutta
5
methods are a general way to achieve higher order approximate
solutions of the initial value problem (8.1), (8.2). Each time step consists of
m stages, each stage involving a single evaluation of f. The relatively simple
four stage fourth order method is in wide use today. Like the forward Euler
method, but unlike multistep methods, Runge Kutta time stepping computes
x
k+1
from x
k
without using values x
j
for j < k. This simplies error estimation
and adaptive time step selection.
The simplest Runge Kutta method is forward Euler (8.5). Among the second
order methods is Heuns
6
1
= t f(x
k
, t
k
) (8.20)
2
= t f(x
k
+
1
, t
k
+ t) (8.21)
x
k+1
= x
k
+
1
2
(
1
+
2
) . (8.22)
The calculations of
1
and
2
are the two stages of Heuns method. Clearly they
depend on k, though that is left out of the notation.
To calculate x
k
from x
0
using a Runge Kutta method, we apply take k time
steps. Each time step is a transformation that may be written
x
k+1
=
S(x
k
, t
k
, t) .
As in Chapter 6, we express the general time step as
7
x
=
S(x, t, t). This
=
S(x, t, t) if there is a trajectory satisfying the dierential equation (8.1) so that
x(t) = x and x
=
S(x, t) = x+
1
2
(
1
+
2
), where
1
= f(x, t, t), and
2
= f(x+
1
, t, t).
The best known and most used Runge Kutta method, which often is called
the Runge Kutta method, has four stages and is fourth order accurate
1
= t f(x, t) (8.23)
2
= t f(x +
1
2
1
, t +
1
2
t) (8.24)
3
= t f(x +
1
2
2
, t +
1
2
t) (8.25)
4
= t f(x +
3
, t + t) (8.26)
x
= x +
1
6
(
1
+ 2
2
+ 2
3
+
4
) . (8.27)
5
Carl Runge was Professor of applied mathematics at the turn of the 20
th
century in
Gottingen, Germany. Among his colleagues were David Hilbert (of Hilbert space) and Richard
Courant. But Courant was forced to leave Germany and came to New York to found the
Courant Institute. Kutta was a student of Runge.
6
Heun, whose name rhymes with coin, was another student of Runge.
7
The notation x
here does not mean the derivative of x with respect to t (or any other
variable) as it does in some books on dierential equations.
172 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
Understanding the accuracy of Runge Kutta methods comes down to Taylor
series. The reasoning of Section 8.1 suggests that the method has error O(t
p
)
if
(x(t), t) x(t) +
t
f(x(t), t) ,
so
x(t) = f
(x, t)f(x, t) +
t
f(x, t) .
This gives
S(x, t, t) = x+t f(x, t) +
t
2
2
(f
(x, t)f(x, t) +
t
f(x, t)) +O(t
3
) . (8.29)
To make the expansion of
S for Heuns method, we rst have
1
= t f(x, t),
which needs no expansion. Then
2
= t f(x +
1
, t + t)
= t
_
f(x, t) +f
(x, t)
1
+
t
f(x, t)t +O(t
2
)
_
= t f(x, t) + t
2
_
f
(x, t)f(x, t) +
t
f(x, t)
_
+O(t
3
) .
Finally, (8.22) gives
x
= x +
1
2
(
1
+
2
)
= x +
1
2
_
t f(x, t) +
_
t f(x, t) + t
2
_
f
(x, t)f(x, t) +
t
f(x, t)
_
__
+O(t
3
)
Comparing this to (8.29) shows that
and x(t) =
n
=1
u
(t)r
, then the
components u
. (8.31)
Suppose x
k
x(t
k
) is the approximate solution at time t
k
. Write x
k
=
n
=1
u
k
r
[ have units of time and may be called time scales. Since t has units
of time, it does not make sense to say that t is small in an absolute sense, but
only relative to other time scales in the problem. This leads to the following:
Possibility: A time stepping approximation to (8.30) will be accurate only if
max
t [
[ 1 . (8.32)
Although this possibility is not true in every case, it is a dominant technical
consideration in most practical computations involving dierential equations.
The possibility suggests that the time step should be considerably smaller than
the smallest time scale in the problem, which is to say that t should resolve
even the fastest time scales in the problem.
A problem is called sti if it has two characteristics: (i) there is a wide range
of time scales, and (ii) the fastest time scale modes have almost no energy. The
second condition states that if [
[ is small. Most time stepping problems for partial dierential equations are
sti in this sense. For a sti problem, we would like to take larger time steps
than (8.32):
t [
[ 1
_
for all with u
signi-
cantly dierent from zero.
(8.33)
What can go wrong if we ignore (8.32) and choose a time step using (8.33)
is numerical instability. If mode u
[ small [u
[ modes,
8
We call the eigenvalue index to avoid conict with k, which we use to denote the time
step.
174 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
it is natural to assume that the real part satises Re(
) 0. In this case
we say the mode is stable because [u
(t)[ = [u
(0)[ e
t
does not increase as t
increases. However, if t
I I
< .06. It
might be that we can calculate I
1
=
_
1
0
f(x)dx to within .03 using x = .1 (10
points), but that calculating I
2
=
_
2
1
f(x)dx to within .03 takes x = .02 (50
points). It would be better to use x = .1 for I
1
and x = .02 for I
2
(60 points
total) rather than using x = .02 for all of I (100 points).
Adaptive methods can use local error estimates to concentrate computational
resources where they are most needed. If we are solving a dierential equation
to compute x(t), we can use a smaller time step in regions where x has large
acceleration. There is an active community of researchers developing systematic
ways to choose the time steps in a way that is close to optimal without having
the overhead in choosing the time step become larger than the work in solving
the problem. In many cases they conclude, and simple model problems show,
that a good strategy is to equidistribute the local truncation error. That is, to
choose time steps t
k
so that the the local truncation error
k
= t
k
r
k
is
nearly constant.
If we have a variable time step t
k
, then the times t
k+1
= t
k
+ t
k
form
an irregular adapted mesh (or adapted grid). Informally, we want to choose a
mesh that resolves the solution, x(t) being calculated. This means that knowing
the x
k
x(t
k
) allows you make an accurate reconstruction of the function x(t),
say, by interpolation. If the points t
k
are too far apart then the solution is
underresolved. If the t
k
are so close that x(t
k
) is predicted accurately by a few
neighboring values (x(t
j
) for j = k 1, k 2, etc.) then x(t) is overresolved,
we have computed it accurately but paid too high a price. An ecient adaptive
mesh avoids both underresolution and overresolution.
Figure 8.1 illustrates an adapted mesh with equidistributed interpolation
error. The top graph shows a curve we want to resolve and a set of points that
concentrates where the curvature is high. Also also shown is the piecewise linear
8.4. ADAPTIVE METHODS 175
3 2 1 0 1 2 3
0
0.5
1
1.5
x
y
A function interpolated on 16 unevenly spaced points
the function y(x)
mesh values
linear interpolant of mesh values
locations of mesh points
3 2 1 0 1 2 3
7
6
5
4
3
2
1
0
1
x 10
3
x
y
Error in interpolation from 16 nonuniformly spaced points
Figure 8.1: A nonuniform mesh for a function that needs dierent resolution
in dierent places. The top graph shows the function and the mesh used to
interpolate it. The bottom graph is the dierence between the function and the
piecewise linear approximation. Note that the interpolation error equidistributed
though the mesh is much ner near x = 0.
curve that connects the interpolation points. On the graph it looks as though
the piecewise linear graph is closer to the curve near the center than in the
smoother regions at the ends, but the error graph in the lower frame shows this
is not so. The reason probably is that what is plotted in the bottom frame is
the vertical distance between the two curves, while what we see in the picture is
the two dimensional distance, which is less if the curves are steep. The bottom
frame illustrates equidistribution of error. The interpolation error is zero at the
grid points and gets to be as large as about 6.3 10
3
in each interpolation
interval. If the points were uniformly spaced, the interpolation error would be
smaller near the ends and larger in the center. If the points were bunched even
more than they are here, the interpolation error would be smaller in the center
than near the ends. We would not expect such perfect equidistribution in real
problems, but we might have errors of the same order of magnitude everywhere.
For a Runge Kutta method, the local truncation error is (x, t, t) =
S(x, t, t)
S(x, t, t). We want a way to estimate and to choose t so that [[ = e, where
e is the value of the equidistributed local truncation error. We suggest a method
related to Richardson extrapolation (see Section 3.3), comparing the result of
176 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
one time step of size t to two time steps of size t/2. The best adaptive Runge
Kutta dierential equation solvers do not use this generic method, but instead
use ingenious schemes such as the Runge Kutta Fehlberg ve stage scheme that
simultaneously gives a fth order
S
5
, but also gives an estimate of the dierence
between a fourth order and a fth order method,
S
5
S
4
. The method described
here does the same thing with eight function evaluations instead of ve.
The Taylor series analysis of Runge Kutta methods indicates that (x, t, t) =
t
p+1
(x, t) + O(t
p+2
). We will treat as a constant because all the x and
t values we use are within O(t) of each other, so variations in do not ef-
fect the principal error term we are estimating. With one time step, we get
x
=
S(x, t, t) With two half size time steps we get rst x
1
=
S(x, t, t/2),
then x
2
=
S( x
1
, t + t/2, t/2).
We will show, using the Richardson method of Section 3.3, that
x
x
2
=
_
1 2
p
_
(x, t, t) +O(t
p+1
) . (8.34)
We need to use the semigroup property of the solution operator: If we run
the exact solution from x for time t/2, then run it from there for another time
t/2, the result is the same as running it from x for time t. Letting x be the
solution of (8.1) with x(t) = x, the formula for this is
S(x, t, t) = S(x(t + t/2), t + t/2, t/2)
= S( S(x, t, t/2) , t + t/2, t/2) .
We also need to know that S(x, t, t) = x +O(t) is reected in the Jacobian
matrix S
(x, t, t) = I +O(t).
The actual calculation starts with
x
1
=
S(x, t, t/2)
= S(x, t, t/2) + 2
(p+1)
t
(p+1)
+O(t
(p+2)
) ,
and
x
2
=
S( x
1
, t + t, t/2)
= S( x
1
, t + t/2, t/2) + 2
(p+1)
t
(p+1)
+O(t
(p+2)
) ,
We simplify the notation by writing x
1
= x(t+t/2)+u with u = 2
(p+1)
t
p
+
O(t
(p+2)
). Then |u| = O(t
(p+1)
) and also (used below) t |u| = O(t
(p+2)
)
and (since p 1) |u|
2
= O(t
(2p+2)
) = O(t
(p+2)
). Then
S( x
1
, t + t/2, t/2) = S(x(t + t/2) +u, t + t/2, t/2)
= S(x(t + t/2), t + t/2, t/2) +S
u +O
_
|u|
2
_
= S(x(t + t/2), t + t/2, t/2) +u +O
_
|u|
2
_
= S(x, t, t) + 2
(p+1)
t
p
+uO
_
t
p+2
_
.
9
This fact is a consequence of the fact that S is twice dierentiable as a function of all
its arguments, which can be found in more theoretical books on dierential equations. The
Jacobian of f(x) = x is f
(x) = I.
8.4. ADAPTIVE METHODS 177
Altogether, since 2 2
(p+1)
= 2
p
, this gives
x
2
= S(x, t, t) + 2
p
t
p+1
+O
_
t
p+2
_
.
Finally, a single size t time step has
x
= X(x, t, t) + t
p+1
+O
_
t
p+2
_
.
Combining these gives (8.34). It may seem like a mess but it has a simple
underpinning. The whole step produces an error of order t
p+1
. Each half
step produces an error smaller by a factor of 2
p+1
, which is the main idea of
Richardson extrapolation. Two half steps produce almost exactly twice the
error of one half step.
There is a simple adaptive strategy based on the local truncation error es-
timate (8.34). We arrive at the start of time step k with an estimated time
step size t
k
. Using that time step, we compute x
=
S(x
k
, t
k
, t
k
) and x
2
by
taking two time steps from x
k
with t
k
/2. We then estimate
k
using (8.34):
k
=
1
1 2
p
_
x
x
2
_
. (8.35)
This suggests that if we adjust t
k
by a factor of (taking a time step of size
t
k
instead of t
k
), the error would have been
p+1
k
. If we choose to
exactly equidistribute the error (according to our estimated ), we would get
e =
p+1
|
k
| =
k
= (e/ |
k
|)
1/(p+1)
. (8.36)
We could use this estimate to adjust t
k
and calculate again, but this may lead
to an innite loop. Instead, we use t
k+1
=
k
t
k
.
Chapter 3 already mentioned the paradox of error estimation. Once we
have a quantitative error estimate, we should use it to make the solution more
accurate. This means taking
x
k+1
=
S(x
k
, t
k
, t
k
) +
k
,
which has order of accuracy p + 1, instead of the order p time step
S. This
increases the accuracy but leaves you without an error estimate. This gives
an order p + 1 calculation with a mesh chosen to be nearly optimal for an
order p calculation. Maybe the reader can nd a way around this paradox.
Some adaptive strategies reduce the overhead of error estimation by using the
Richardson based time step adjustment, say, every fth step.
One practical problem with this strategy is that we do not know the quanti-
tative relationship between local truncation error and global error
10
. Therefore
it is hard to know what e to give to achieve a given global error. One way to es-
timate global error would be to use a given e and get some time steps t
k
, then
10
Adjoint based error control methods that address this problem are still in the research
stage (2006).
178 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
redo the computation with each interval [t
k
, t
k+1
] cut in half, taking exactly
twice the number of time steps. If the method has order of accuracy p, then the
global error should decrease very nearly by a factor of 2
p
, which allows us to
estimate that error. This is rarely done in practice. Another issue is that there
can be isolated zeros of the leading order truncation error. This might happen,
for example, if the local truncation error were proportional to a scalar function
x. In (8.36), this could lead to an unrealistically large time step. One might
avoid that, say, by replacing
k
with min(
k
, 2), which would allow the time
step to grow quickly, but not too much in one step. This is less of a problem
for systems of equations.
8.5 Multistep methods
Linear multistep methods are the other class of methods in wide use. Rather than
giving a general discussion, we focus on the two most popular kinds, methods
based on dierence approximations, and methods based on integrating f(x(t)),
Adams methods. Hybrids of these are possible but often are unstable. For some
reason, almost nothing is known about methods that both are multistage and
multistep.
Multistep methods are characterized by using information from previous
time steps to go from x
k
to x
k+1
. We describe them for a xed t. A simple
example would be to use the second order centered dierence approximation
x(t) (x(t + t) x(t t)) /2t to get
(x
k+1
x
k1
) /2t = f(x
k
) ,
or
x
k+1
= x
k1
+ 2t f(x
k
) . (8.37)
This is the leapfrog
11
method. We nd that
x
k+1
= x
k1
+ 2t f( x
k
) + tO(t
2
) ,
so it is second order accurate. It achieves second order accuracy with a single
evaluation of f per time step. Runge Kutta methods need at least two evalua-
tions per time step to be second order. Leapfrog uses x
k1
and x
k
to compute
x
k+1
, while Runge Kutta methods forget x
k1
when computing x
k+1
from x
k
.
The next method of this class illustrates the subtleties of multistep methods.
It is based on the four point one sided dierence approximation
x(t) =
1
t
_
1
3
x(t + t) +
1
2
x(t) x(t t) +
1
6
x(t 2t)
_
+O
_
t
3
_
.
This suggests the time stepping method
f(x
k
) =
1
t
_
1
3
x
k+1
+
1
2
x
k
x
k1
+
1
6
x
k2
_
, (8.38)
11
Leapfrog is a game in which two or more children move forward in a line by taking turns
jumping over each other, as (8.37) jumps from x
k1
to x
k+1
using only f(x
k
).
8.5. MULTISTEP METHODS 179
which leads to
x
k+1
= 3t f(x
k
)
3
2
x
k
+ 3x
k1
1
2
x
k2
. (8.39)
This method never is used in practice because it is unstable in a way that Runge
Kutta methods cannot be. If we set f 0 (to solve the model problem x = 0),
(8.38) becomes the recurrence relation
x
k+1
+
3
2
x
k
3x
k1
+
1
2
x
k2
= 0 , (8.40)
which has characteristic polynomial
12
p(z) = z
3
+
3
2
z
2
3z +
1
2
. Since one
of the roots of this polynomial has [z[ > 1, general solutions of (8.40) grow
exponentially on a t time scale, which generally prevents approximate solutions
from converging as t 0. This cannot happen for Runge Kutta methods
because setting f 0 always gives x
k+1
= x
k
, which is the exact answer in this
case.
Adams methods use old values of f but not old values of x. We can integrate
(8.1) to get
x(t
k+1
) = x(t
k
) +
_
t
k+1
t
k
f(x(t))dt . (8.41)
An accurate estimate of the integral on the right leads to an accurate time step.
Adams Bashforth methods estimate the integral using polynomial extrapolation
from earlier f values. At its very simplest we could use f(x(t)) f(x(t
k
)),
which gives
_
t
k+1
t
k
f(x(t))dt (t
k+1
t
k
)f(x(t
k
)) .
Using this approximation on the right side of (8.41) gives forward Euler.
The next order comes from linear rather than constant extrapolation:
f(x(t)) f(x(t
k
)) + (t t
k
)
f(x(t
k
)) f(x(t
k1
))
t
k
t
k1
.
With this, the integral is estimated as (the generalization to non constant t is
Exercise ??):
_
t
k+1
t
k
f(x(t))dt t f(x(t
k
)) +
t
2
2
f(x(t
k
)) f(x(t
k1
))
t
= t
_
3
2
f(x(t
k
))
1
2
f(x(t
k1
))
_
.
The second order Adams Bashforth method for constant t is
x
k+1
= x
k
+ t
_
3
2
f(x
k
)
1
2
f(x
k1
)
_
. (8.42)
12
If p(z) = 0 then x
k
= z
k
is a solution of (8.40).
180 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
To program higher order Adams Bashforth methods we need to evaluate the
integral of the interpolating polynomial. The techniques of polynomial interpo-
lation from Chapter 7 make this simpler.
Adams Bashforth methods are attractive for high accuracy computations
where stiness is not an issue. They cannot be unstable in the way (8.39) is
because setting f 0 results (in (8.42), for example) in x
k+1
= x
k
, as for Runge
Kutta methods. Adams Bashforth methods of any order or accuracy require one
evaluation of f per time step, as opposed to four per time step for the fourth
order Runge Kutta method.
8.6 Implicit methods
Implicit methods use f(x
k+1
) in the formula for x
k+1
. They are used for sti
problems because they can be stable with large t (see Section 8.3) in ways
explicit methods, all the ones discussed up to now, cannot. An implicit method
must solve a system of equations to compute x
k+1
.
The simplest implicit method is backward Euler:
x
k+1
= x
k
+ t f(x
k+1
) . (8.43)
This is only rst order accurate, but it is stable for any and t if Re() 0.
This makes it suitable for solving sti problems. It is called implicit because
x
k+1
is determined implicitly by (8.43), which we rewrite as
F(x
k+1
, t) = 0 , where F(y, t) = y t f(y) x
k
, (8.44)
To nd x
k+1
, we have to solve this system of equations for y.
Applied to the linear scalar problem (8.31) (dropping the index), the
method (8.43) becomes u
k+1
= u
k
+ tu
k+1
, or
u
k+1
=
1
1 t
u
k
.
This shows that [u
k+1
[ < [u
k
[ if t > 0 and is any complex number with
Re() 0. This is in partial agreement with the qualitative behavior of the
exact solution of (8.31), u(t) = e
t
u(0). The exact solution decreases in time if
Re() < 0 but not if Re() = 0. The backward Euler approximation decreases in
time even when Re() = 0. The backward Euler method articially stabilizes a
neutrally stable system, just as the forward Euler method articially destabilizes
it (see the modied equation discussion leading to (8.19)).
Most likely the equations (8.44) would be solved using an iterative method as
discussed in Chapter 6. This leads to inner iterations, with the outer iteration
being the time step. If we use the unsafeguarded local Newton method, and let
j index the inner iteration, we get F
= I t f
and
y
j+1
= y
j
_
I t f
(y
j
)
_
1_
y
j
t f(y
j
) x
k
_
, (8.45)
8.6. IMPLICIT METHODS 181
hoping that y
j
x
k+1
as j . We can take initial guess y
0
= x
k
, or, even
better, an extrapolation such as y
0
= x
k
+ t(x
k
x
k1
)/t = 2x
k
x
k1
.
With a good initial guess, just one Newton iteration should give x
k+1
accurately
enough.
Can we use the approximation J I for small t? If we could, the Newton
iteration would become the simpler functional iteration (check this)
y
j+1
= x
k
+ t f(y
j
) . (8.46)
The problem with this is that it does not work precisely for the sti systems
we use implicit methods for. For example, applied to u = u, the functional
iteration diverges ([y
j
[ as j ) for t < 1.
Most of the explicit methods above have implicit analogues. Among implicit
Runge Kutta methods we mention the trapezoid rule
x
k+1
x
k
t
=
1
2
_
f(x
k+1
) +f(x
k
)
_
. (8.47)
There are backward dierentiation formula, or BDF methods based on higher
order one sided approximations of x(t
k+1
). The second order BDF method uses
(??):
x(t) =
1
t
_
3
2
x(t) 2x(t t) +
1
2
x(t 2t)
_
+O
_
t
2
_
,
to get
f(x(t
k+1
)) = x(t
k+1
) =
_
3
2
x(t
k+1
) 2x(t
k
) +
1
2
x(t
k1
)
_
+O
_
t
2
_
,
and, neglecting the O
_
t
2
_
error term,
x
k+1
2t
3
f(x
k+1
) =
4
3
x
k
1
3
x
k1
. (8.48)
The Adams Molton methods estimate
_
t
k+1
t
k
f(x(t))dt using polynomial in-
terpolation using the values f(x
k+1
), f(x
k
), and possibly f(x
k1
), etc. The
second order Adams Molton method uses f(x
k+1
) and f(x
k
). It is the same
as the trapezoid rule (8.47). The third order Adams Molton method also uses
f(x
k1
). Both the trapezoid rule (8.47) and the second order BDF method
(8.48) both have the property of being A-stable, which means being stable for
(8.31) with any and t as long as Re() 0. The higher order implicit
methods are more stable than their explicit counterparts but are not A stable,
which is a constant frustration to people looking for high order solutions to sti
systems.
182 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
8.7 Computing chaos, can it be done?
In many applications, the solutions to the dierential equation (8.1) are chaotic.
13
The informal denition is that for large t (not so large in real applications) x(t)
is an unpredictable function of x(0). In the terminology of Section 8.5, this
means that the solution operator, S(x
0
, 0, t), is an ill conditioned function of
x
0
.
The dogma of Section 2.7 is that a oating point computation cannot give
an accurate approximation to S if the condition number of S is larger than
1/
mach
10
16
. But practical calculations ranging from weather forecasting
to molecular dynamics violate this rule routinely. In the computations below,
the condition number of S(x, t) increases with t and crosses 10
16
by t = 3 (see
Figure 8.3). Still, a calculation up to time t = 60 (Figure 8.4, bottom), shows
the beautiful buttery shaped Lorentz attractor, which looks just as it should.
On the other hand, in this and other computations, it truly is impossible
to calculate details correctly. This is illustrated in Figure 8.2. The top picture
plots two trajectories, one computed with t = 4 10
4
(dashed line), and the
other with the time step reduced by a factor of 2 (solid line). The dierence
between the trajectories is an estimate of the accuracy of the computations.
The computation seems somewhat accurate (curves close) up to time t 5,
at which time the dashed line goes up to x 15 and the solid line goes down
to x 15. At least one of these is completely wrong. Beyond t 5, the
two approximate solutions have similar qualitative behavior but seem to be
independent of each other. The bottom picture shows the same experiment with
t a hundred times smaller than in the top picture. With a hundred times more
accuracy, the approximate solution loses accuracy at t 10 instead of t 5. If
a factor of 100 increase in accuracy only extends the validity of the solution by
5 time units, it should be hopeless to compute the solution out to t = 60.
The present numerical experiments are on the Lorentz equations, which are
a system of three nonlinear ordinary dierential equations
x = (y x)
y = x( z) y
z = xy z
with
14
= 10, = 28, and =
8
3
. The C/C++ program outputs (x, y, z)
for plotting every t = .02 units of time, though there many time steps in each
plotting interval. The solution rst nds its way to the buttery shaped Lorentz
attractor then stays on it, traveling around the right and left wings in a seemingly
random (technically, chaotic) way. The initial data x = y = z = 0 are not close
to the attractor, so we ran the dierential equations for some time before time
t = 0 in order that (x(0), y(0), z(0)) should be a typical point on the attractor.
13
James Gleick has written a nice popular book on chaos. Steve Strogatz has a more
technical introduction that does not avoid the beautiful parts.
14
These can be found, for example, in http://wikipedia.org by searching on Lorentz at-
tractor.
8.7. COMPUTING CHAOS, CAN IT BE DONE? 183
Figure 8.2 shows the chaotic sequence of wing choice. A trip around the left
wing corresponds to a dip of x(t) down to x 15 and a trip around the right
wing corresponds to x going up to x 15.
Sections 2.7 and 4.3 explain that the condition number of the problem of
calculating S(x, t) depends on the Jacobian matrix A(x, t) =
x
S(x, t). This
represents the sensitivity of the solution at time t to small perturbations of the
initial data. Adapting notation from (4.28), we nd that the condition number
of calculating x(t) from initial conditions x(0) = x is
(S(x, t)) = |A(x(0), t)|
|x(0)|
|x(t)|
. (8.49)
We can calculate A(x, t) using ideas of perturbation theory similar to those we
used for linear algebra problems in Section 4.2.6. Since S(x, t) is the value of a
solution at time t, it satises the dierential equation
f(S(x, t)) =
d
dt
S(x, t) .
We dierentiate both sides with respect to x and interchange the order of dif-
ferentiation,
x
f((S(x, t)) =
x
d
dt
S(x, t) =
d
dt
x
S(x, t) =
d
dt
A(x, t) ,
to get (with the chain rule)
d
dt
A(x, t) =
x
f(S(x, t))
= f
(S(x, t))
x
S
A = f
k
(t) e
k
t
,
More precisely,
lim
t
ln(
k
(t))
t
=
k
.
In our problem, |A(x, t)| =
1
(t) seems to grow exponentially because
1
> 0.
Since |x = x(0)| and |x(t)| are both of the order of about ten, this, with
184 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
(8.49), implies that (S(x, t)) grows exponentially in time. That explains why
it is impossible to compute the values of x(t) with any precision at all beyond
t = 20.
It is interesting to study the condition number of A itself. If
1
>
n
, the
l
2
this also grows exponentially,
l
2(A(x, t)) =
1
(t)
n
(t)
e
(1n)t
.
Figure 8.3 gives some indication that our Lorentz system has diering Lya-
pounov exponents. The top gure shows computations of the three singular
values for A(x, t). For 0 t < 2, it seems that
3
is decaying exponentially,
making a downward sloping straight line on this log plot. When
3
gets to
about 10
15
, the decay halts. This is because it is nearly impossible for a full
matrix in double precision oating point to have a condition number larger than
1/
mach
10
16
. When
3
hits 10
15
, we have
1
10
2
, so this limit has been
reached. These trends are clearer in the top frame of Figure 8.4, which is the
same calculation carried to a larger time. Here
1
(t) seems to be growing expo-
nentially with a gap between
1
and
3
of about 1/
mach
. Theory says that
2
should be close to one, and the computations are consistent with this until the
condition number bound makes
2
1 impossible.
To summarize, some results are quantitatively right, including the buttery
shape of the attractor and the exponential growth rate of
1
(t). Some results
are qualitatively right but quantitatively wrong, including the values of x(t)
for t > 10. Convergence analyses (comparing t results to t/2 results)
distinguishes right from wrong in these cases. Other features of the computed
solution are consistent over a range of t and consistently wrong. There is no
reason to think the condition number of A(x, t) grows exponentially until t 2
then levels o at about 10
16
. Much more sophisticated computational methods
using the semigroup property show this is not so.
8.8 Software: Scientic visualization
Visualization of data is indispensable in scientic computing and computational
science. Anomalies in data that seem to jump o the page in a plot are easy
to overlook in numerical data. It can be easier to interpret data by looking at
pictures than by examining columns of numbers. For example, here are entries
500 to 535 from the time series that made the top curve in the top frame of
Figure 8.4 (multiplied by 10
5
).
0.1028 0.1020 0.1000 0.0963 0.0914 0.0864 0.0820
0.0790 0.0775 0.0776 0.0790 0.0818 0.0857 0.0910
0.0978 0.1062 0.1165 0.1291 0.1443 0.1625 0.1844
0.2104 0.2414 0.2780 0.3213 0.3720 0.4313 0.4998
0.5778 0.6649 0.7595 0.8580 0.9542 1.0395 1.1034
8.8. SOFTWARE: SCIENTIFIC VISUALIZATION 185
0 2 4 6 8 10 12 14 16 18 20
20
15
10
5
0
5
10
15
20
t
x
Lorentz, forward Euler, x component, dt = 4.00e04 and dt = 2.00e04
larger time step
smaller time step
0 2 4 6 8 10 12 14 16 18 20
20
15
10
5
0
5
10
15
20
t
x
Lorentz, forward Euler, x component, dt = 4.00e06 and dt = 2.00e06
larger time step
smaller time step
Figure 8.2: Two convergence studies for the Lorentz system. The time steps in
the bottom gure are 100 times smaller than the time steps in the to gure. The
more accurate calculation loses accuracy at t 10, as opposed to t 5 with a
larger time step. The qualitative behavior is similar in all computations.
186 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
0 2 4 6 8 10 12 14 16 18 20
10
20
10
15
10
10
10
5
10
0
10
5
10
10
t
s
i
g
m
a
Lorentz, singular values, dt = 4.00e04
sigma 1
sigma 2
sigma 3
0 2 4 6 8 10 12 14 16 18 20
10
20
10
15
10
10
10
5
10
0
10
5
10
10
t
s
i
g
m
a
Lorentz, singular values, dt = 4.00e05
sigma 1
sigma 2
sigma 3
Figure 8.3: Computed singular values of the sensitivity matrix A(x, t) =
x
S(x, t) with large time step (top) and ten times smaller time step (bottom).
Top and bottom curves are similar qualitatively though the ne details dier.
Theory predicts that middle singular value should be not grow or decay with t.
The times from Figure 8.2 at which the numerical solution loses accuracy are
not apparent here. In higher precision arithmetic,
3
(t) would have continued
to decay exponentially. It is unlikely that computed singular values of any full
matrix would dier by less than a factor of 1/
mach
10
16
.
8.8. SOFTWARE: SCIENTIFIC VISUALIZATION 187
0 10 20 30 40 50 60
10
20
10
15
10
10
10
5
10
0
10
5
10
10
10
15
10
20
10
25
t
s
i
g
m
a
Lorentz, singular values, dt = 4.00e05
sigma 1
sigma 2
sigma 3
20
15
10
5
0
5
10
15
20
40
20
0
20
40
0
5
10
15
20
25
30
35
40
45
Lorentz attractor up to time 60
x
y
z
Figure 8.4: Top: Singular values from Figure 8.3 computed for longer time. The
1
(t) grows exponentially, making a straight line on this log plot. The computed
2
(t) starts growing with the same exponential rate as
1
when roundo takes
over. A correct computation would show
3
(t) decreasing exponentially and
2
(t) neither growing nor decaying. Bottom: A beautiful picture of the buttery
shaped Lorentz attractor. It is just a three dimensional plot of the solution curve.
188 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
Looking at the numbers, we get the overall impression that they are growing in
an irregular way. The graph shows that the numbers have simple exponential
growth with ne scale irregularities superposed. It could take hours to get that
information directly from the numbers.
It can be a challenge to make visual sense of higher dimensional data. For
example, we could make graphs of x(t) (Figure 8.2), y(t) and z(t) as functions
of t, but the single three dimensional plot in the lower frame of Figure 8.4 makes
is clearer that the solution goes sometimes around the left wing and sometimes
around the right. The three dimensional plot (plot3 in Matlab) illuminates the
structure of the solution better than three one dimensional plots.
There are several ways to visualize functions of two variables. A contour plot
draws several contour lines, or level lines, of a function u(x, y). A contour line for
level u
k
is the set of points (x, y) with u(x, y) = u
k
. It is common to take about
ten uniformly spaced values u
k
, but most contour plot programs, including the
Matlab program contour, allow the user to specify the u
k
. Most contour lines
are smooth curves or collections of curves. For example, if u(x, y) = x
2
y
2
,
the contour line u = u
k
with u
k
,= 0 is a hyperbola with two components. An
exception is u
k
= 0, the contour line is an .
A grid plot represents a two dimensional rectangular array of numbers by
colors. A color map assigns a color to each numerical value, that we call c(u). In
practice, usually we specify c(u) by giving RGB values, c(u) = (r(u), g(u), b(u)),
where r, g, and b are the intensities for red, green and blue respectively. These
may be integers in a range (e.g. 0 to 255) or, as in Matlab, oating point
numbers in the range from 0 to 1. Matlab uses the commands colormap and
image to establish the color map and display the array of colors. The image is
an array of boxes. Box (i, j) is lled with the color c(u(i, j)).
Surface plots visualize two dimensional surfaces in three dimensional space.
The surface may be the graph of u(x, y). The Matlab commands surf and surfc
create surface plots of graphs. These look nice but often are harder to interpret
than contour plots or grid plots. It also is possible to plot contour surfaces of
a function of three variables. This is the set of (x, y, z) so that u(x, y, z) = u
k
.
Unfortunately, it is hard to plot more than one contour surface at a time.
Movies help visualize time dependent data. A movie is a sequence of frames,
with each frame being one of the plots above. For example, we could visualize
the Lorentz attractor with a movie that has the three dimensional buttery
together with a dot representing the position at time t.
The default in Matlab, and most other quality visualization packages, is
to render the users data as explicitly as possible. For example, the Matlab
command plot(u) will create a piecewise linear curve that simply connects
successive data points with straight lines. The plot will show the granularity of
the data as well as the values. Similarly, the grid lines will be clearly visible in a
color grid plot. This is good most of the time. For example, the bottom frame
of Figure 8.4 clearly shows the granularity of the data in the wing tips. Since
the curve is sampled at uniform time increments, this shows that the trajectory
is moving faster at the wing tips than near the body where the wings meet.
8.9. RESOURCES AND FURTHER READING 189
Some plot packages oer the user the option of smoothing the data using
spline interpolation before plotting. This might make the picture less angu-
lar, but it can obscure features in the data and introduce artifacts, such as
overshoots, not present in the actual data.
8.9 Resources and further reading
There is a beautiful discussion of computational methods for ordinary dieren-
tial equations in Numerical Methods by
Ake Bjork and Germund Dahlquist. It
was Dahlquist who created much of our modern understanding of the subject. A
more recent book is A First Course in Numerical Analysis of Dierential Equa-
tions by Arieh Iserles. The book Numerical Solution of Ordinary Dierential
Equations by Lawrence Shampine has a more practical orientation.
There is good public domain software for solving ordinary dierential equa-
tions. A particularly good package is LSODE (google it).
The book by Sans-Serna explains symplectic integrators and their appli-
cation to large scale problems such as the dynamics of large scape biological
molecules. It is an active research area to understand the quantitative relation-
ship between long time simulations of such systems and the long time behavior
of the systems themselves. Andrew Stuart has written some thoughtful papers
on the subject.
The numerical solution of partial dierential equations is a vast subject with
many deep specialized methods for dierent kinds of problems. For computing
stresses in structures, the current method of choice seems to be nite element
methods. For uid ow and wave propagation (particularly nonlinear), the ma-
jority relies on nite dierence and nite volume methods. For nite dierences,
the old book by Richtmeyer and Morton still merit though there are more up
to date books by Randy LeVeque and by Bertil Gustavson, Heinz Kreiss, and
Joe Oliger.
8.10 Exercises
1. We compute the second error correction u
2
(t) in (8.13). For simplicity,
consider only the scalar equation (n = 1). Assuming the error expansion,
we have
f(x
k
) = f( x
k
+ tu
1
(t
k
) + t
2
u
2
(t
k
) +O(t
3
))
f( x
k
) +f
( x
k
)
_
t u
1
(t
k
) + t
2
u
2
(t
k
)
_
+
1
2
f
( x
k
)t
2
u
1
(t
k
)
2
+O
_
t
3
_
.
Also
x(t
k
+ t) x(t
k
)
t
= x(t
k
) +
t
2
x(t
k
) +
t
2
6
x
(3)
(t
k
) +O
_
t
3
_
,
190 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
and
t
u
1
(t
k
+ t) u
1
(t
k
)
t
= t u
1
(t
k
) +
t
2
2
u
1
(t
k
) +O
_
t
3
_
.
Now plug in (8.13) on both sides of (8.5) and collect terms proportional
to t
2
to get
u
2
= f
(x(t))u
2
+
1
6
x
(3)
(t) +
1
2
f
(x(t))u
1
(t)
2
+??? .
2. This exercise conrms what was hinted at in Section 8.1, that (8.19) cor-
rectly predicts error growth even for t so large that the solution has lost
all accuracy. Suppose k = R/t
2
, so that t
k
= R/t. The error equation
(8.19) predicts that the forward Euler approximation x
k
has grown by a
factor of e
R
/2 although the exact solution has not grown at all. We can
conrm this by direct calculation. Write the forward Euler approxima-
tion to (8.18) in the form x
k+1
= Ax
k
, where A is a 2 2 matrix that
depends on t. Calculate the eigenvalues of A up to second order in t:
1
= 1 +it +at
2
+O(t
3
), and
2
= 1 it +bt
2
+O(t
3
). Find
the constants a and b. Calculate
1
= ln(
1
) = it + ct
2
+ O(t
3
)
so that
1
= exp(it + ct
2
+ O(t
3
)). Conclude that for k = R/t
2
,
k
1
= exp(k
1
) = e
iR/t
e
R/2+O(t)
, which shows that the solution has
grown by a factor of nearly e
R/2
as predicted by (8.19). This s**t is good
for something!
3. Another two stage second order Runge Kutta method sometimes is called
the modied Euler method. The rst stage uses forward Euler to predict
the x value at the middle of the time step:
1
=
t
2
f(x
k
, t
k
) (so that
x(t
k
+ t/2) x
k
+
1
). The second stage uses the midpoint rule with
that estimate of x(t
k
+t/2) to step to time t
k+1
: x
k+1
= x
k
+t f(t
k
+
t
2
, x
k
+
1
). Show that this method is second order accurate.
4. Show that applying the four stage Runge Kutta method to the linear
system (8.30) is equivalent to approximating the fundamental solution
S(t) = exp(t A) by its Taylor series in t up to terms of order t
4
(see Exercise ??). Use this to verify that it is fourth order for linear
problems.
5. Write a C/C++ program that solves the initial value problem for (8.1),
with f independent of t, using a constant time step, t. The arguments
to the initial value problem solver should be T (the nal time), t (the
time step), f(x) (specifying the dierential equations), n (the number
of components of x), and x
0
(the initial condition). The output should
be the approximation to x(T). The code must do something to preserve
the overall order of accuracy of the method in case T is not an integer
multiple of t. The code should be able to switch between the three
methods, forward Euler, second order Adams Bashforth, forth order four
8.10. EXERCISES 191
state Runge Kutta, with a minimum of code editing. Hand in output for
each of the parts below showing that the code meets the specications.
(a) The procedure should return an error ag or notify the calling routine
in some way if the number of time steps called for is negative or
impossibly large.
(b) For each of the three methods, verify that the coding is correct by
testing that it gets the right answer, x(.5) = 2, for the scalar equation
x = x
2
, x(0) = 1.
(c) For each of the three methods and the test problem of part b, do a
convergence study to verify that the method achieves the theoretical
order of accuracy. For this to work, it is necessary that T should be
an integer multiple of t.
(d) Apply your code to problem (8.18) with initial data x
0
= (1, 0)
.
Repeat the convergence study with T = 10.
6. Verify that the recurrence relation (8.39) is unstable.
(a) Let z be a complex number. Show that the sequence x
k
= z
k
satises
(8.39) if and only if z satises 0 = p(z) = z
3
+
3
2
z
2
3z +
1
2
.
(b) Show that x
k
= 1 for all k is a solution of the recurrence relation.
Conclude that z = 1 satises p(1) = 0. Verify that this is true.
(c) Using polynomial division (or another method) to factor out the
known root z = 1 from p(z). That is, nd a quadratic polynomial,
q(z), so that p(z) = (z 1)q(z).
(d) Use the quadratic formula and a calculator to nd the roots of q as
z =
5
4
_
41
16
2.85, .351.
(e) Show that the general formula x
k
= az
k
1
+bz
k
2
+cz
k
3
is a solution to
(8.39) if z
1
, z
2
, and z
3
are three roots z
1
= 1, z
2
2.85, z
3
.351,
and, conversely, the general solution has this form. Hint: we can nd
a, b, c by solving a vanderMonde system (Section 7.4) using x
0
, x
1
,
and x
2
.
(f) Assume that [x
0
[ 1, [x
1
[ 1, and [x
2
[ 1, and that b is on the
order of double precision oating point roundo (
mach
) relative to
a and c. Show that for k > 80, x
k
is within
mach
of bz
k
2
. Conclude
that for k > 80, the numerical solution has nothing in common with
the actual solution x(t).
7. Applying the implicit trapezoid rule (8.47) to the scalar model problem
(8.31) results in u
k+1
= m(t)u
k
. Find the formula for m and show that
[m[ 1 if Re() 0, so that [u
k+1
[ [u
k
[. What does this say about the
applicibility of the trapezoid rule to sti problems?
8. Exercise violating time step constraint.
192 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
9. Write an adaptive code in C/C++ for the initial value problem (8.1) (8.2)
using the method described in Section 8.4 and the four stage fourth order
Runge Kutta method. The procedure that does the solving should have
arguments describing the problem, as in Exercise 5, and also the local
truncation error level, e. Apply the method to compute the trajectory
of a comet. In nondimensionalized variables, the equations of motion are
given by the inverse square law:
d
2
dt
2
_
r
1
r
2
_
=
1
(r
2
1
+r
2
2
)
3/2
_
r
1
r
2
_
.
We always will suppose that the comet starts at t = 0 with r
1
= 10,
r
2
= 0, r
1
= 0, and r
2
= v
0
. If v
0
is not too large, the point r(t) traces
out an ellipse in the plane
15
. The shape of the ellipse depends on v
0
. The
period, P(v
0
), is the rst time for which r(P) = r(0) and r(P) = r(0).
The solution r(t) is periodic because it has a period.
(a) Verify the correctness of this code by comparing the results to those
from the xed time step code from Exercise 5 with T = 30 and
v
0
= .2.
(b) Use this program, with a small modication to compute P(v
0
) in
the range .01 v
0
.5. You will need a criterion for telling when
the comet has completed one orbit since it will not happen that
r(P) = r(0) exactly. Make sure your code tests for and reports
failure
16
.
(c) Choose a value of v
0
for which the orbit is rather but not extremely
elliptical (width about ten times height). Choose a value of e for
which the solution is rather but not extremely accurate the error is
small but shows up easily on a plot. If you set up your environment
correctly, it should be quick and easy to nd suitable paramaters
by trial and error using Matlab graphics. Make a plot of a single
period with two curves on one graph. One should be a solid line
representing a highly accurate solution (so accurate that the error is
smaller than the line width plotting accuracy), and the other being
the modestly accurate solution, plotted with a little o for each time
step. Comment on the distribution of the time step points.
(d) For the same parameters as part b, make a single plot of that contains
three curves, an accurate computation of r
1
(t) as a function of t (solid
line), a modestly accurate computation of r
1
as a function of t (o
for each time step), and t as a function of t. You will need to use a
dierent scale for t if for no other reason than that it has dierent
units. Matlab can put one scale in the left and a dierent scale on
15
Isaac Newton formulated these equations and found the explicit solution. Many aspects
of planetary motion elliptical orbits, sun at one focus, |r|
= const had beed discovered
observationally by Johannes Keppler. Newtons inverse square law theory t Kepplers data.
16
This is not a drill.
8.10. EXERCISES 193
the right. It may be necessary to plot t on a log scale if it varies
over too wide a range.
(e) Determine the number of adaptive time stages it takes to compute
P(.01) to .1% accuracy (error one part in a thousand) and how many
xed t time step stages it takes to do the same. The counting will
be easier if you do it within the function f.
10. The vibrations of a two dimensional crystal lattice may be modelled in a
crude way using the dierential equations
17
r
jk
= r
j1,k
+r
j+1,k
+r
j,k1
+r
j,k+1
4r
jk
. (8.51)
Here r
jk
(t) represents the displacement (in the vertical direction) of an
atom at the (j, k) location of a square crystal lattice of atoms. Each atom
is bonded to its four neighbors and is pulled toward them with a linear
force. A lattice of size L has 1 j L and 1 k L. Apply reecting
boundary conditions along the four boundaries. For example, the equation
for r
1,k
should use r
0,k
= r
1,k
. This is easy to implement using a ghost
cell strategy. Create ghost cells along the boundary and copy appropriate
values from the actual cells to the ghost cells before each evaluation of
f. This allows you to use the formula (8.51) at every point in the lattice.
Start with initial data r
jk
(0) = 0 and r
jk
(0) = 0 for all j, k except r
1,1
= 1.
Compute for L = 100 and T = 300. Use the fourth order Runge Kutta
method with a xed time step t = .01. Write the solution to a le every
.5 time units then use Matlab to make a movie of the results, with a 2D
color plot of the solution in each frame. The movie should be a circular
wave moving out from the bottom left corner and bouncing o the top and
right boundaries. There should be some beautiful wave patterns inside the
circle that will be hard to see far beyond time t = 100. Hand in a few
of your favorite stills from the movie. If you have a web site, post your
movie for others to enjoy.
17
This is one of Einsteins contributions to science.
194 CHAPTER 8. DYNAMICS AND DIFFERENTIAL EQUATIONS
Chapter 9
Monte Carlo methods
195
196 CHAPTER 9. MONTE CARLO METHODS
Monte Carlo means using random numbers in scientic computing. More
precisely
1
, it means using random numbers as a tool to compute something that
is not random. We emphasize this point by distinguishing between Monte Carlo
and simulation. Simulation means producing random variables with a certain
distribution just to look at them. For example, we might have a model of a
random process that produces clouds. We could simulate the model to generate
cloud pictures, either out of scientic interest or for computer graphics. As soon
as we start asking quantitative questions about, say, the average size of a cloud
or the probability that it will rain, we move from pure simulation to Monte
Carlo.
Many Monte Carlo computations have the following form. We express A,
the quantity of interest, in the form
A = E
f
[V (X)] . (9.1)
This notation means
2
means that X is a random variable or collection of random
variables whose joint distribution is given by f, and V (x) is some scalar quantity
determined by x. For example, x could be the collection of variables describing
a random cloud and V (x) could be the total moisture. The notation X f
means that X has the distribution f, or is a sample or f.
A Monte Carlo code generates a large number of samples of f, written X
k
, for
k = 1, . . . , N. Sections 9.2 and 9.3 describe how this is done. The approximate
answer is
A
A
N
=
1
N
N
k=1
V (X
k
) . (9.2)
Section 9.4 discusses the error in (9.2). The error generally is of the order of
N
1/2
, which means that it takes roughly four times the computation time
to double the accuracy of a computation. By comparison, the least accurate
integration method in Chapter 3 is the rst order rectangle rule, with error
roughly proportional to 1/N.
The total error is the sum of the bias (error in the expected value of the
estimator),
e
b
= E
_
A
N
_
A ,
and the statistical error,
e
s
=
A
N
E
_
A
N
_
.
The equation (9.1) states that the estimator (9.2) is unbiased, which means
that the error is purely statistical. If the samples X
k
are independent of each
other, the error is roughly proportional to the standard deviation, which is the
square root of the variance of the random variable V (X) when X f:
=
_
var
f
(V (X)) . (9.3)
1
This helpful denition is given in the book by Mal Kalos and Paula Whitlock.
2
The notation used here is dened in Section 9.1 below.
197
More sophisticated Monte Carlo estimators have bias as well as statistical error,
but the statistical error generally is much larger. For example, Exercise 1 dis-
cusses estimators that have statistical error of order 1/N
1/2
and bias of order
1/N.
There may be other ways to express A in terms of a random variable. That
is, there may be a random variable, Y g, with distribution g and a function W
so that A = E
g
[W(Y )]. This is variance reduction if var
g
[W(Y )] < var
f
[V (X)].
We distinguish between Monte Carlo and simulation to emphasize that variance
reduction in Monte Carlo is possible. Even simple variance reduction strategies
can make a big dierence in computation time.
Given a choice between Monte Carlo and deterministic methods, the de-
terministic method usually is better. The large O(1/
N) statistical errors
in Monte Carlo usually make it slower and less accurate than a deterministic
method. For example, if X is a one dimensional random variable with proba-
bility density f(x), then
A = E[V (X)] =
_
V (x)f(x)dx .
Estimating the integral by deterministic panel methods as in Section ?? (error
roughly proportional to 1/N for a rst order method, 1/N
2
for second order,
etc.) usually estimates A to a given accuracy at a fraction of the cost of Monte
Carlo. Deterministic methods are better than Monte Carlo in most situations
where the deterministic methods are practical.
We use Monte Carlo because of the curse of dimensionality. The curse is
that the work to solve a problem in many dimensions may grow exponentially
with the dimension. For example, consider integrating over ten variables. If we
use twenty points in each coordinate direction, the number of integration points
is 20
10
10
13
, which is on the edge of what a computer can do in a day. A
Monte Carlo computation might reach the same accuracy with only, say, 10
6
points. There are many high dimensional problems that can be solved, as far
as we know, only by Monte Carlo.
Another curse of high dimensions is that some common intuitions fail. This
can lead to Monte Carlo algorithms that may work in principle but also are
exponentially slow in high dimensions. An example is in Section 9.3.7. The
probability that a point in a cube falls inside a ball is less than 10
20
in 40
dimensions, which is to say that it will never happen even on a teraop computer.
One favorable feature of Monte Carlo is that it is possible to estimate the or-
der of magnitude of statistical error, which is the dominant error in most Monte
Carlo computations. These estimates are often called error bars because they
are indicated as bars on plots. Error bars are statistical condence intervals,
which rely on elementary or sophisticated statistics depending on the situation.
It is dangerous and unprofessional to do a Monte Carlo computation without
calculating error bars. The practitioner should know how large the error is
likely to be, even if he or she does not include that information in non-technical
presentations.
198 CHAPTER 9. MONTE CARLO METHODS
In Monte Carlo, simple clever ideas can lead to enormous practical improve-
ments in eciency and accuracy. This is the main reason I emphasize so strongly
that, while A is given, the algorithm for estimating it is not. The search for more
accurate alternative algorithms is often called variance reduction. Common
variance reduction techniques are importance sampling, antithetic variates, and
control variates.
Many of the examples below are somewhat articial because I have chosen
not to explain specic applications. The techniques and issues raised here in
the context of toy problems are the main technical points in many real applica-
tions. In many cases, we will start with the probabilistic denition of A, while
in practice, nding this is part of the problem. There are some examples in
later sections of choosing alternate denitions of A to improve the Monte Carlo
calculation.
9.1 Quick review of probability
This is a quick review of the parts of probability needed for the Monte Carlo
material discussed here. Please skim it to see what notation we are using and
check Section 9.7 for references if something is unfamiliar.
9.1.1 Probabilities and events
Probability theory begins with the assumption that there are probabilities asso-
ciated to events. Event B has probability Pr(B), which is a number between 0
and 1 describing what fraction of the time event B would happen if we could
repeat the experiment many times. Pr(B) = 0 and Pr(B) = 1 are allowed. The
exact meaning of probabilities is debated at length elsewhere.
An event is a set of possible outcomes. The set of all possible outcomes is
called and particular outcomes are called . Thus, an event is a subset of the
set of all possible outcomes, B . For example, suppose the experiment is to
toss four coins (a penny, a nickel, a dime, and a quarter) on a table and record
whether they were face up (heads) or face down (tails). There are 16 possible
outcomes. The notation THTT means that the penny was face down (tails), the
nickel was up, and the dime and quarter were down. The event all heads con-
sists of a single outcome, HHHH. The event B = more heads than tails con-
sists of the ve outcomes: B = HHHH, THHH, HTHH, HHTH, HHHT.
The basic set operations apply to events. For example the intersection of
events B and C is the set of outcomes both in B and in C: B C means
B and C. For that reason, B C represents the event B and
C. For example, if B is the event more heads than tails above and C is
the event the dime was heads, then C has 8 outcomes in it, and B C =
HHHH, THHH, HTHH, HHHT. The set union, B C is B C if
B or C, so we call it B or C. One of the axioms of probability is
Pr(B C) = Pr(B) + Pr(C) , if B C is empty.
9.1. QUICK REVIEW OF PROBABILITY 199
This implies the intuitive fact that
B C =Pr(B) Pr(C) ,
for if D is the event C but not B, then B D is empty, so Pr(C) = Pr(B) +
Pr(D) Pr(D). We also suppose that Pr() = 1, which means that includes
every possible outcome.
The conditional probability of B given C is given by Bayes formula:
Pr(B [ C) =
Pr(B C)
Pr(C)
=
Pr(B and C)
Pr(C)
(9.4)
Intuitively, this is the probability that a C outcome also is a B outcome. If we
do an experiment and C, Pr(B [ C) is the probability that B. The
right side of (9.4) represents the fraction of C outcomes that also are in B. We
often know Pr(C) and Pr(B [ C), and use (9.4) to calculate Pr(B C). Of
course, Pr(C [ C) = 1. Events B and C are independent if Pr(B) = Pr(B [ C),
which is the same as Pr(B C) = Pr(B)Pr(C), so B being independent of C is
the same as C being independent of B.
The probability space is nite if it is possible to make a nite list
3
of all the
outcomes in : =
1
,
2
, . . . ,
n
. The space is countable if it is possible to
make a possibly innite list of all the elements of : =
1
,
2
, . . . ,
n
, . . ..
We call discrete in both cases. When is discrete, we can specify the prob-
abilities of each outcome
f
k
= Pr( =
k
) .
Then an event B has probability
Pr(B) =
k
B
Pr(
k
) =
k
B
f
k
.
9.1.2 Random variables and distributions
A discrete random variable
4
is a number, X(), that depends on the random
outcome, . In the coin tossing example, X() could be the number of heads.
The expected value is (dening x
k
= X(
k
))
E[X] =
X()Pr() =
k
x
k
f
k
.
The distribution of a discrete random variable is determined by the values x
k
and
the probabilities f
k
. Two discrete random variables have the same distribution
if they have the same x
k
and f
k
. It often happens that the possible values x
k
are understood and only the probabilities f
k
are unknown, for example, if X is
known to be an integer. We say that f (the numbers f
k
) is the distribution of
X if Pr(X = x
k
) = f
k
. We write this as X f, and say that X is a sample
3
Warning: this list is impractically large even in common simple applications.
4
Warning: sometimes is the random variable and X is a function of a random variable.
200 CHAPTER 9. MONTE CARLO METHODS
of f. If X
is a
sample of X if X
is
independent of any event determined by X. If X
X] = E[X
jk
for
all j, k.
A continuous random variable may be described by a probability density,
f(x), written X f. We say X f if Pr(X B) =
_
B
f(x)dx for any
measurable
5
set B. If X = (X
1
, . . . , X
n
) is a multivariate random variable
with n components, then B R
n
. If V (x) is a function of n variables, then
E[V (X)] =
_
R
n
V (x)f(x)dx.
If X is an n component random variable and Y is an m component random
variable, then the pair (X, Y ) is a random variable with n + m components.
The probability density h(x, y) is the joint density of X and Y . The marginal
densities f and g, with X f and Y g, are given by f(x) =
_
h(x, y)dy, and
g(y) =
_
h(x, y)dx. The random variables X and Y , are independent if and only
if h(x, y) = f(x)g(y). The scalar random variables V = V (X) and W = W(Y )
are uncorrelated if E[V W] = E[V ]E[W]. The random variables X and Y are
independent if and only if V (X) is independent of W(Y ) for any (measurable)
functions V and W.
If X is a univariate random variable with probability density f(x), then
it has CDF, or cumulative distribution function (or just distribution function),
F(x) = Pr(X x) =
_
x
x
f(x
)dx
x
k
x
f
k
. Two random variables have
the same distribution if they have the same distribution function. It may be con-
venient to calculate probability density f(x) by rst calculating the distribution
5
It is an annoying technicality in probability theory that there may be sets or functions
that are so complicated that they are not measurable. If the set B is not measurable, then we
do not dene Pr(B). None of the functions or sets encountered in applied probability fail to
be measurable.
9.1. QUICK REVIEW OF PROBABILITY 201
function, F(x), then using
f(x)dx =
_
x+dx
x
f(x
)dx
2
= var[X] = E
_
(X )
2
_
=
_
(x )
2
f(x)dx = E
_
X
2
2
.
The variance is non-negative, and
2
= 0 only if X is not random. For a
multicomponent random variable, the covariance matrix is the symmetric nn
matrix,
C = E
_
(X ) (X )
=
_
R
n
(x ) (x )
f(x)dx . (9.5)
The diagonal entries of C are the variances
C
jj
=
2
j
= var[X
j
] = E
_
(X
j
j
)
2
_
.
The o diagonal entries are the covariances
C
jk
= E [(X
j
j
) (X
k
k
)] = cov [X
j
, X
k
] .
The covariance matrix is positive semidenite in the sense that for any y
R
n
, y
Cy 0. To see this, for any y, we can dene the scalar random variable
V = y
Cy = var[V ] 0 .
If y ,= 0 and y
Cy = var[V ] = 0, then V = y
X is equal to y
and is not
random. This means that either C is positive denite or X always lies on the
hyperplane y
X = C = y
Cy = y
(E [(X )(X )
]) y
= E
__
y
(X )
__
(X )
y
_
= E [y
XX
] (y
y)
= E
_
V
2
(E [V ])
2
= var[V ] .
6
We also use the fact that y
XX
y = (y
X) (X
y) = (y
X)
2
, by the associativity of
matrix and vector multiplication and the fact that y
X is a scalar.
202 CHAPTER 9. MONTE CARLO METHODS
9.1.3 Common random variables
There are several probability distributions that are particularly important in
Monte Carlo. They arise in common applications, and they are useful technical
tools.
Uniform
A uniform random variable is equally likely to anywhere inside its allowed range.
In one dimension, a univariate random variable U is uniformly distributed in
the interval [a, b] if it has probability density f(u) with f(u) = 0 for u / [a, b]
and f(x) = 1/(b a) if a x b. The standard uniform has b = 1 and a = 0.
The standard uniform is important in Monte Carlo because that is what random
number generators try to produce. In higher dimensions, let B be a set with
area b (for n = 2) or volume b (for n 3). The n component random variable
X is uniformly distributed if its probability density has the value f(x) = b for
x B and f(x) = 0 otherwise.
Exponential
The exponential random variable, T, with rate constant > 0, has probability
density,
f(t) =
_
e
t
if 0 t
0 if t < 0.
(9.6)
The exponential is a model for the amount of time something until happens
(e.g. how long a light bulb will function before it breaks). It is characterized by
the Markov property, if it has not happened by time t, it is as good as new. To
see this, let = f(0), which means dt = Pr(0 T dt). The random time T
is exponential if all the conditional probabilities are equal to this:
Pr(t T t +dt [ T t) = Pr(T dt) = dt .
Using Bayes rule (9.4), and the observation that Pr(T t + dt and T t) =
f(t)dt, this becomes
dt =
f(t)dt
Pr(T > t)
=
f(t)dt
1 F(t)
,
which implies (1 F(t)) = f(t). Dierentiating this gives f(t) = f
(t),
which implies that f(t) = Ce
t
for t > 0. We nd C = using
_
0
f(t)dt = 1.
Independent exponential inter arrival times generate the Poisson arrival
processes. Let T
k
, for k = 1, 2, . . . be independent exponentials with rate .
The k
th
arrival time is
S
k
=
jk
T
j
.
The expected number of arrivals in interval [t
1
, t
2
] is (t
2
t
1
) and all arrivals
are independent. This is a fairly good model for the arrivals of telephone calls
at a large phone bank.
9.1. QUICK REVIEW OF PROBABILITY 203
Gaussian, or normal
We denote the standard normal by Z. The standard normal has PDF
f(z) =
1
2
e
z
2
/2
. (9.7)
The general normal with mean and variance
2
is given by X = Z + and
has PDF
f(x) =
1
2
2
e
(x)
2
/2
2
. (9.8)
We write X ^(,
2
) in this case. A standard normal has distribution
^(0, 1). An n component random variable, X, is a multivariate normal with
mean and covariance C it has probability density
7
f(x) =
1
Z
exp((x )
. (9.10)
We derive this, taking = 0 without loss of generality (why?), as follows:
C
Y
= E [Y Y
] = E
_
(LX) ((LX)
= E [LXX
] = LE [XX
] L
= LC
X
L
.
9.1.4 Limit theorems
The two important theorems for simple Monte Carlo are the law of large numbers
and the central limit theorem. Suppose A = E[X], that the X
k
for k = 1, 2, . . .
are iid samples of X, and
A
N
is the basic estimator (9.2). The law of large
numbers states
8
that
A
N
A as N .
More generally, let
A
N
be a family of Monte Carlo estimators of A. (See
Exercise 1 for other estimators.) As usual, the
A
N
are random but A is not.
The estimators are consistent if
A
N
A as N . The law of large numbers
states that the estimators (9.2) are consistent.
7
The Z here is a normalization constant, not a standard normal random variable. The
double use of Z is unfortunately standard.
8
More precisely, the Kolmogorov strong law of large numbers states that lim
N
b
A
N
= A
almost surely, i.e. that the probability of the limit not existing or being the wrong answer is
zero. More useful for us is the weak law, which states that, for any > 0, P(
b
A
N
A
> ) 0
as N . These are related but they are not the same.
204 CHAPTER 9. MONTE CARLO METHODS
Let V be a random variable with mean A and variance
2
. Let V
k
be
independent samples of V , and
A
N
=
1
N
N
k=1
V
k
, and R
N
=
A
N
A. A
simple calculation shows that E[R
N
] = 0 and var[R
N
] =
2
/N. The central
limit theorem states that for large N, the distribution of R
N
is approximately
normal with that mean and variance. That is to say that the distribution of
R
N
is approximately the same as that of Z/
N
_
Z. In this form, it says that the error in a Monte Carlo
computation is of the order of 1/
k=j
r
jk
. With this convention, the occupation probabilities, f
j
(t) = Pr(X(t) =
j), satisfy
df
j
dt
=
m
k=1
f
k
(t)r
kj
.
A common way to simulate a continuous time Markov chain is to use the
embedded discrete time chain. Suppose the state at time t is X(t) = j. Then
the time until the next transition is an exponential random variable with rate
j
=
k=j
r
jk
. The probability that this transition is to state k is
p
jk
=
r
jk
l=j
r
jl
. (9.13)
To simulate the continuous chain, we start with the given j
0
= X(0). We choose
T
1
to be an exponential with rate parameter
j0
. We set S
1
= T
1
and choose
9
X(S
1
) = k = j
1
according to the probabilities (9.13). Suppose the time of
9
If S
i
is the time of the i
th
transition, we set X(S
i
) to be the new state rather than the old
one. This convention is called CADLAG, from the French continue a droit, limite a gauche
(continuous from the right, limit from the left).
9.2. RANDOM NUMBER GENERATORS 205
the n
th
transition is S
n
and the transition takes us to state j
n
. The time to
the next transition is T
n1
, which is an exponential with rate parameter
jn
.
The next state is j
n+1
= k with probability p
jn,k
. The next transition time is
S
n+1
= S
n
+T
n
.
9.2 Random number generators
The random variables used in Monte Carlo are generated by a (pseudo) random
number generator. The procedure double rng() is a perfect random number
generator if
for( k=0; k<n; k++ ) U[k] = rng();
produces an array of iid standard uniform random variables. The best available
random number generators are nearly perfect in this sense for most Monte Carlo
applications. The native C/C++ procedure random() is good enough for most
Monte Carlo (I use it).
Bad ones, such as the native rand() in C/C++ and the procedure in Numer-
ical Recipes give incorrect results in common simple cases. If there is a random
number generator of unknown origin being passed from person to person in the
oce, do not use it (without a condom).
The computer itself is not random. A pseudo random number generator
simulates randomness without actually being random. The seed is a collection of
m integer variables: int seed[m];. Assuming standard C/C++ 32 bit integers,
the number of bits in the seed is 32 m. There is a seed update function (s) and
an output function (s). The update function produces a new seed: s
= (s).
The output function produces a oating point number (check the precision)
u = (s) [0, 1]. One call u = rng(); has the eect
s (s) ; return u = (s) ; .
The random number generator should come with procedures s = getSeed();
and setSeed(s);, with obvious functions. Most random number generators set
the initial seed to the value of the system clock as a default if the program
has no setSeed(s); command. We use setSeed(s) and getSeed() for two
things. If the program starts with setSeed(s);, then the sequence of seeds and
random numbers will be the same each run. This is helpful in debugging and
reproducing results across platforms. Most random number generators have a
built-in way to get the initial seed if the program does not set one. One common
way is to use the least signicant bits from the system clock. In this way, if
you run the program twice, the results will be dierent. The results of Figure
?? were obtained in that way. It sometimes is recommended to warm up the
random number generator by throwing away the rst few numbers:
setSeed(17};// Start with the seed equal to 17.
int RandWarmUp = 100;
for ( int i = 0; i < RandWarmUp; i++)
rng(); // Throw away the first RandWarmUp random numbers.
206 CHAPTER 9. MONTE CARLO METHODS
This is harmless but with modern random number generators unnecessary. The
other use is checkpointing. Some Monte Carlo runs take so long that there is
a real chance the computer will crash during the run. We avoid losing too
much work by storing the state of the computation to disk every so often. If
the machine crashes, we restart from the most recent checkpoint. The random
number generator seed is part of the checkpoint data.
The simplest random number generators use linear congruences. The seed
represents an integer in the range 0 s < c and is the linear congruence (a
and b positive integers) s
ln(U) (9.14)
9.3. SAMPLING 207
is an exponential with rate parameter . Before checking this, note rst that
U > 0 so ln(U) is dened, and U < 1 so ln(U) is negative and T > 0. Next,
since is a rate, it has units of 1/Time, so (9.14) produces a positive number
with the correct units. The code T = -(1/lambda)*log( rng() ); generates
the exponential.
We verify (9.14) using the informal probability method. Let f(t) be the PDF
of the random variable of (9.14). We want to show f(t) = e
t
for t > 0. Let
B be the event t T t +dt. This is the same as
t
1
ln(U) t +dt ,
which is the same as
t dt ln(U) t (all negative),
and, because e
x
is an increasing function of x,
e
tdt
U e
t
.
Now, e
tdt
= e
t
e
dt
, and e
dt
= 1 dt, so this is
e
t
dte
t
U e
t
.
But this is an interval within [0, 1] of length dt e
t
, so
f(t)dt = Pr(t T t +dt)
= Pr(e
t
dte
t
U e
t
)
= dt e
t
,
which shows that (9.14) gives an exponential.
9.3.3 Markov chains
To simulate a discrete time Markov chain, we need to make a sequence of random
choices from m possibilities. One way to do this uses the idea behind Bernoulli
sampling. Suppose X(t) = j and we want to choose X(t + 1) according to the
probabilities (9.11). Dene the partial sums
s
jk
=
l<k
p
jl
,
so that s
j1
= 0, s
j2
= p
j1
= Pr(X(t + 1) = 1), s
j3
= p
j1
+p
j2
= Pr(X(t + 1) =
1 or X(t+1) = 2), etc. The algorithm is to set X(t+1) = k if s
jk
U < s
j,k+1
.
This insures that Pr(X(t + 1) = k) = p
jk
, since Pr(s
jk
U < s
j,k+1
) =
s
j,k+1
s
jk
= p
jk
. In specic applications, the description of the Markov chain
often suggests dierent ways to sample X(t + 1).
We can simulate a continuous time Markov chain by simulating the embed-
ded discrete time chain and the transition times. Let S
n
be the times at which
208 CHAPTER 9. MONTE CARLO METHODS
the state changes, the transition times. Suppose the transition is from state
j = X(S
n
0), which is known, to state k = X(S
n
+ 0), which must be cho-
sen. The probability of j k is given by (9.13). The probability of j j is
zero, since S
n
is a transition time. Once we know k, the inter-transition time
is an exponential with rate r
kk
, so we can take (see (9.14), X is the state and
TRate[j] is
j
)
S = S - log( rng() )/ TRate[X]; // Update the time variable.
9.3.4 Using the distribution function
In principle, the CDF provides a simple sampler for any one dimensional prob-
ability distribution. If X is a one component random variable with probability
density f(x), the cumulative distribution function is F(x) = Pr(X x) =
_
x
x
f(x
dx
. Of course 0 F(x) 1 for all x, and for any u [0, 1], there
is an x with F(x) = u. The reader can check that if F(X) = U then X f
if and only if U is a standard uniform. Therefore, if double fSolve( double
u) is a procedure that returns x with f(x) = u, then X = fSolve( rng() );
produces an independent X sample each time. The procedure fSolve might in-
volve a simple formula in simple cases. Harder problems might require Newtons
method with a careful choice of initial guess. This procedure could be the main
bottleneck of a big Monte Carlo code, so it would be worth the programmers
time to spend some time making it fast.
For the exponential random variable,
F(t) = Pr(T t) =
_
t
t
=0
e
t
dt
= 1 e
t
.
Solving F(t) = u gives t =
1
=
1
2
e
z
2
/2
dz
= N(z) . (9.15)
There is no elementary
10
formula for the cumulative normal, N(z), but there is
good software to evaluate it to nearly double precision accuracy, both for N(z)
and for the inverse cumulative normal z = N
1
(u). In many applications,
11
this is the best way to make standard normals. The general X ^(,
2
) may
be found using X = Z +.
10
An elementary function is one that can be computed using sums and dierences, products
and quotients, exponentials and logs, trigonometric functions, and powers.
11
There are applications where the relationship between Z and U is important, not only
the value of Z. These include sampling using normal copulas, and quasi (low discrepancy
sequence) Monte Carlo.
9.3. SAMPLING 209
9.3.5 The Box Muller method
The Box Muller algorithm generates two independent standard normals from
two independent standard uniforms. The formulas are
R =
_
2 ln(U
1
)
= 2U
2
Z
1
= Rcos()
Z
2
= Rsin() .
We can make a thousand independent standard normals by making a thousand
standard uniforms then using them in pairs to generate ve hundred pairs of
independent standard normals.
The idea behind the Box Muller method is related to the brilliant elementary
derivation of the formula
_
e
z
2
/2
=
2. Let Z = (Z
1
, Z
2
) be a bivariate
normal whose components are independent univariate standard normals. Since
Z
1
and Z
2
are independent, the joint PDF of Z is
f(z) =
1
2
e
z
2
1
/2
1
2
e
z
2
2
/2
=
1
2
e
(z
2
1
+z
2
2
)/2
. (9.16)
Let R and be polar coordinates for Z, which means that Z
1
= Rcos()
and Z
2
= Rsin(). From (9.16) it is clear that the probability density is
radially symmetric, so is uniformly distributed in the interval [0, 2], and
is independent of R. The Distribution function of R is
F(r) = Pr (R r)
=
_
r
=0
_
2
=0
1
2
e
2
/2
dd
=
_
r
=0
e
2
/2
d .
With the change of variables (the trick behind the integral) t =
2
/2, dt = d,
this becomes
_
r
2
/2
t=0
e
t
dt = 1 e
r
2
/2
.
To sample R, we solve 1 U = F(R) (recall that 1 U is a standard uniform
is U is a standard uniform), which gives R =
_
2 ln(U), as claimed.
9.3.6 Multivariate normals
Let X R
n
be a multivariate normal random variable with mean zero and
covariance matrix C. We can sample X using the Choleski factorization C =
LL
T
, where L is lower triangular. Note that L exists because C is symmetric
and positive denite. Let Z R
n
be a vector of n independent standard normals
210 CHAPTER 9. MONTE CARLO METHODS
generated using Box Muller or any other way. The covariance matrix of Z is I
(check this). Therefore, if
X = LZ , (9.17)
then X is multivariate normal (because it is a linear transformation of a multi-
variate normal) and has covariance matrix (see (9.10))
C
X
= LIL
t
= C .
If we want a multivariate normal with mean R
n
, we simply take X = LZ+.
In some applications we prefer to work with H = C
1
than with C. Exercise 6
has an example.
9.3.7 Rejection
The rejection algorithm turns samples from one density into samples of another.
Not only is it useful for sampling, but the idea is the basis of the Metropolis
algorithm in Markov Chain Monte Carlo (See Exercise 7). One ingredient is
a simple sampler for the trial distribution, or proposal distribution. Suppose
gSamp() produces iid samples from the PDF g(x). The other ingredient is
an acceptance probability, p(x), with 0 p(x) 1 for all x. The algorithm
generates a trial, X g, and accepts this trial value with probability p(X).
The process is repeated until the rst acceptance. All this happens in
while ( rng() > p( X = gSamp() ) ); (9.18)
We accept X is U p(X), so U > p(X) means reject and try again. Each time
we generate a new X, which must be independent of all previous ones.
The X returned by (9.18) has PDF
f(x) =
1
Z
p(x)g(x) , (9.19)
where Z is a normalization constant that insures that
_
f(x)dx = 1:
Z =
_
xR
n
p(x)g(x)dx . (9.20)
This shows that Z is the probability that any given trial will be an acceptance.
The formula (9.19) shows that rejection goes from g to f by thinning out samples
in regions where f < g by rejecting some of them. We verify it informally in
the one dimensional setting:
f(x)dx = Pr ( accepted X (x, x +dx))
= Pr ( generated X (x, x +dx) [ acceptance )
=
Pr ( generated X (x, x +dx) and accepted)
Pr ( accepted )
=
g(x)dx p(x)
Z
,
9.3. SAMPLING 211
where Z is the probability that a given trial generates an acceptance. An argu-
ment like this also shows the correctness of (9.19) also for multivariate random
variables.
We can use rejection to generate normals from exponentials. Suppose g(x) =
e
x
for x > 0, corresponding to a standard exponential, and f(x) =
2
2
e
x
2
/2
for x > 0, corresponding to the positive half of the standard normal distribution.
Then (9.19) becomes
p(x) = Z
f(x)
g(x)
= Z
2
e
x
2
/2
e
x
p(x) = Z
2
2
e
xx
2
/2
. (9.21)
This would be a formula for p(x) if we know the constant, Z.
We maximize the eciency of the algorithm by making Z, the overall prob-
ability of acceptance, as large as possible, subject to the constraint p(x) 1 for
all x. Therefore, we nd the x that maximizes the right side:
e
xx
2
/2
= max = x
x
2
2
= max = x
max
= 1 .
Choosing Z so that the maximum of p(x) is one gives
1 = p
max
= Z
2
2
e
xmaxx
2
max
/2
= Z
2
2
e
1/2
,
so
p(x) =
1
e
e
xx
2
/2
. (9.22)
It is impossible to go the other way. If we try to generate a standard expo-
nential from a positive standard normal we get acceptance probability related
to the reciprocal to (9.21):
p(x) = Z
2
2
e
x
2
/2x
.
This gives p(x) as x for any Z > 0. The normal has thinner
12
tails
than the exponential. It is possible to start with an exponential and thin the
tails using rejection to get a Gaussian (Note: (9.21) has p(x) 0 as x .).
However, rejection cannot fatten the tails by more than a factor of
1
Z
. In
particular, rejection cannot fatten a Gaussian tail to an exponential.
The eciency of a rejection sampler is the expected number of trials needed
to generate a sample. Let N be the number of samples to get a success. The
eciency is
E[N] = 1 Pr(N = 1) + 2 Pr(N = 2) +
12
The tails of a probability density are the parts for large x, where the graph of f(x) gets
thinner, like the tail of a mouse.
212 CHAPTER 9. MONTE CARLO METHODS
We already saw that Pr(N = 1) = Z. To have N = 2, we need rst a rejection
then an acceptance, so Pr(N = 2) = (1 Z)Z. Similarly, Pr(N = k) =
(1 Z)
k1
Z. Finally, we have the geometric series formulas for 0 < r < 1:
k=0
r
k
=
1
1 r
,
k=1
kr
k1
=
k=0
kr
k1
=
d
dr
k=0
r
k
=
1
(1 r)
2
.
Applying these to r = 1 Z gives E[N] =
1
Z
. In generating a standard normal
from a standard exponential, we get
Z =
_
2e
.76 .
The sampler is ecient in that more than 75% of the trials are successes.
Rejection samplers for other distributions, particularly in high dimensions,
can be much worse. We give a rejection algorithm for nding a random point
uniformly distributed inside the unit ball n dimensions. The algorithm is correct
for any n in the sense that it produces at the end of the day a random point with
the desired probability density. For small n, it even works in practice and is not
such a bad idea. However, for large n the algorithm is very inecient. In fact, Z
is an exponentially decreasing function of n. It would take more than a century
on any present computer to generate a point uniformly distributed inside the
unit ball in n = 100 dimensions this way. Fortunately, there are better ways.
A point in n dimensions is x = (x
1
, . . . , x
n
). The unit ball is the set of points
with
n
k=1
x
2
k
1. We will use a trial density that is uniform inside the smallest
(hyper)cube that contains the unit ball. This is the cube with 1 x
k
1 for
each k. The uniform density in this cube is
g(x
1
, . . . , x
n
) =
_
2
n
if [x
k
[ 1 for all k = 1, . . . , n
0 otherwise.
This density is a product of the one dimensional uniform densities, so we can
sample it by choosing n independent standard uniforms:
for( k = 0; k < n; k++) x[k] = 2*rng() - 1; // unif in [-1,1].
We get a random point inside the unit ball if we simply reject samples outside
the ball:
while(1) { // The rejection loop (possibly infinite!)
for( k = 0; k < n; k++) x[k] = 2*rng() - 1; // Generate a trial vector
// of independent uniforms
// in [-1,1].
ssq = 0; // ssq means "sum of squares"
for( k = 0; k < n; k++ ) ssq+= x[k]*x[k];
if ( ssq <= 1.) break ; // You accepted and are done with the loop.
// Otherwise go back and do it again.
}
9.3. SAMPLING 213
Table 9.1: Acceptance fractions for producing a random point in the unit ball
in n dimensions by rejection.
dimension vol(ball) vol(cube) ratio ln(ratio)/dim
2 4 .79 .12
3 4/3 8 .52 .22
4
2
/2 16 .31 .29
10 2
n/2
/(n(n/2)) 2
n
.0025 .60
20 2
n/2
/(n(n/2)) 2
n
2.5 10
8
.88
40 2
n/2
/(n(n/2)) 2
n
3.3 10
21
1.2
The probability of accepting in a given trial is equal to the ratio of the
volume (or area in 2D) of the ball to the cube that contains it. In 2D this is
area(disk)
area(square)
=
4
.79 ,
which is pretty healthy. In 3D it is
vol(ball)
vol(cube)
=
4
3
8
.52 ,
Table 9.1 shows what happens as the dimension increases. By the time the
dimension reaches n = 10, the expected number of trials to get a success is
about 1/.0025 = 400, which is slow but not entirely impractical. For dimension
n = 40, the expected number of trials has grown to about 3 10
20
, which is
entirely impractical. Monte Carlo simulations in more than 40 dimensions are
common. The last column of the table shows that the acceptance probability
goes to zero faster than any exponential of the form e
cn
, because the numbers
that would be c, listed in the table, increase with n.
9.3.8 Histograms and testing
Any piece of scientic software is presumed wrong until it proves itself correct
in tests. We can test a one dimensional sampler using a histogram. Divide the
x axis into neighboring bins of length x centered about bin centers x
j
= jx.
The corresponding bins are B
j
= [x
j
x
2
, x
j
+
x
2
]. With n samples, the
bin counts are
13
N
j
= #X
k
B
j
, 1 k n. The probability that a given
sample lands in B
j
is Pr(B
j
) =
_
xBj
f(x)dx xf(x
j
). The expected bin
count is E[N
j
] nxf(x
j
), and the standard deviation (See Section 9.4) is
Nj
nx
_
f(x
j
). We generate the samples and plot the N
j
and E[N
j
] on
the same plot. If E[N
j
]
Nj
, then the two curves should be relatively close.
This condition is
1
nx
_
f(x
j
) .
13
Here #{ } means the number of elements in the set { }.
214 CHAPTER 9. MONTE CARLO METHODS
In particular, if f is of order one, x = .01, and n = 10
6
, we should have reason-
able agreement if the sampler is correct. If x is too large, the approximation
_
xBj
f(x)dx xf(x
j
) will not hold. If x is too small, the histogram will
not be accurate.
It is harder to test higher dimensional random variables. We can test two and
possibly three dimensional random variables using multidimensional histograms.
We can test that various one dimensional functions of the random X have the
right distributions. For example, the distributions of R
2
=
X
2
k
= |X|
2
l
2 and
Y =
a
k
X
k
= a X are easy to gure out if X is uniformly distributed in the
ball.
9.4 Error bars
For large N, the error in (??) is governed by the central limit theorem. This is
because the numbers V
k
= V (X
k
) are independent random variables with mean
A and variance
2
with given by (9.3). The theorem states that
A
N
A
is approximately normal with mean zero and variance
14
2
/N. This may be
expressed by writing ( indicates that the two sides have approximately the
same distribution)
A
N
A
N
Z , (9.23)
where Z ^(0, 1) is a standard normal.
The normal distribution is tabulated in any statistics book or scientic soft-
ware package. In particular, Pr([Z[ 1) .67 and Pr([Z[ 2) .95. This
means that
Pr
_
A
N
A
N
_
.67 . (9.24)
In the computation, A is unknown but
A
N
is known, along with N and (ap-
proximately, see below) . The equation (9.24) then says that the condence
interval,
[
A
N
N
,
A
N
+
N
] , (9.25)
has a 67% chance of containing the actual answer, A. The interval (9.25) is
called the one standard deviation error bar because it is indicated as a bar in
plots. The center of the bar is
A
N
. The interval itself is a line segment (a bar)
that extends
N
in either direction about its center.
There is a standard way to estimate from the Monte Carlo data, starting
with
2
= var [V (X)] = E
_
_
V (X) A
_
2
_
1
N
N
k=1
(V (X
k
) A)
2
.
14
The mean and variance are exact. Only the normality is approximate.
9.5. VARIANCE REDUCTION 215
Although A is unknown, using
A
N
is good enough. This gives the variance
estimator:
15
2
N
=
1
N
N
k=1
_
V (X
k
) A
_
2
. (9.26)
The full Monte Carlo estimation procedure is as follows. Generate N in-
dependent samples X
k
f and evaluate the numbers V (X
k
). Calculate the
sample mean from (9.2), and the sample variance
2
N
from (9.26). The one
standard deviation error bar half width is
1
N
1
N
N
=
1
N
_
2
N
. (9.27)
The person doing the Monte Carlo computation should be aware of the error
bar size. Error bars should be included in any presentation of Monte Carlo
results that are intended for technically trained people. This includes reports
in technical journals and homework assignments for Scientic Computing.
The error bar estimate (9.27) is not exact. It has statistical error of order
1/
N and bias on the order of 1/N. But there is a sensible saying: Do not
put error bars on error bars. Even if we would know the exact value of , we
would have only a rough idea of the actual error
A
N
A. Note that the estimate
(9.26) has a bias on the order of 1/N. Even if we used the unbiased estimate
of
2
, which has N 1 instead of N, the estimate of still would be biased
because the square root function is nonlinear. The bias in (9.27) would be on
the order of 1/N in either case.
It is common to report a Monte Carlo result in the form A =
A
N
error bar
or A =
A
N
(error bar). For example, writing A = .342 .005 or A = .342(.005)
indicates that
A
N
= .342 and /
A
N
=
1
N
N
k=1
_
V (X
k
) (W(X
k
) B)
_
.
The optimal value of , the value that minimized the variance, is (see Exercise
9)
=
V W
_
var[V ]/var[W]. In practical applications, we would estimate
from the same Monte Carlo data. It is not likely that we would know cov[V, W]
without knowing A = E[V ].
Obviously, this method depends on having a good control variate.
9.5.2 Antithetic variates
We say that R is a symmetry of a probability distribution f if the random
variable Y = R(X) is distributed as f whenever X is. For example, if f(x)
is a probability density so that f(x) = f(x) (e.g. a standard normal), then
R(x) = x is a symmetry of f. If U is a standard uniform then 1 U = R(U)
also is standard uniform. If X is a sample of f (X f), then Y = R(X)
also is a sample of f. This is called the antithetic sample. The method of
antithetic variates is to use antithetic pairs of samples. If A = E
f
[V (X)], then
also A = E
f
[V (R(X))] and
A = E
_
1
2
_
V (X) +V (R(X))
_
_
.
9.6. SOFTWARE: PERFORMANCE ISSUES 217
9.5.3 Importance sampling
Suppose X is a multivariate random variable with probability density f(x) and
that we want to estimate
A = E
f
[V (X)] =
_
R
n
V (x)f(x)dx . (9.29)
Let g(x) be any other probability density subject only to the condition that
g(x) > 0 whenever f(x) > 0. The likelihood ratio is L(x) = f(x)/g(x). Impor-
tance sampling is the strategy of rewriting (9.29) as
A =
_
R
n
V (x)L(x)g(x)dx = E
g
[V (X)L(X)] . (9.30)
The variance is reduced if
var
g
[V (X)L(X)] var
f
[V (X)] .
The term importance sampling comes from the idea that the values of x
that are most important in the expected value of V (X) are unlikely in the f
distribution. The g distribution puts more weight in these important areas.
9.6 Software: performance issues
Monte Carlo methods raises many performance issues. Naive coding following
the text can lead to poor performance. Two signicant factors are frequent
branching and frequent procedure calls.
9.7 Resources and further reading
There are many good books on the probability background for Monte Carlo,
the book by Sheldon Ross at the basic level, and the book by Sam Karlin and
Gregory Taylor for more the ambitious. Good books on Monte Carlo include
the still surprisingly useful book by Hammersley and Handscomb, the physicists
book (which gets the math right) by Mal Kalos and Paula Whitlock, and the
broader book by George Fishman. I also am preparing a book on Monte Carlo,
with many parts already posted.
The sampling methods of Section 9.3 all simple samplers. They produce
independent samples of f. There are many applications for which there is no
known practical simple sampler. Most of these can be treated by dynamic
samplers that produce X
k
that are not independent of each other. The best
known such Markov Chain Monte Carlo method is the Metropolis algorithm.
The disadvantage of MCMC samplers is that the standard deviation of the
estimator (9.2) may be much larger than /
=
1
2t
_
1
1
x
2
e
x
2
/2t
dx .
If we estimate (t) by
(t) =
1
N
N
k=1
V (X
k
) ,
the statistical error will be dierent for each value of t (see exercise 8).
In computational chemistry, using the embedded discrete time Markov chain
as in Section 9.3.3 is called SSA (for stochastic simulation algorithm) and
is attributed to Gillespie. Event driven simulation is a more elaborate but
often more ecient way to simulate complex Markov chains either in discrete
or continuous time.
Choosing a good random number generator is important yet subtle. The
native C/C++ function rand() is suitable only for the simplest applications
because it cycles after only a billion samples. The function random() is much
better. The random number generators in Matlab are good, which cannot be
said for the generators in other scientic computing and visualization packages.
Joseph Marsaglia has a web site with the latest and best random number gen-
erators.
9.8 Exercises
1. Suppose we wish to estimate A = (B), where B = E
f
[V (X)]. For
example, suppose E[X] = 0 and we want the standard deviation =
_
E
_
X
2
_
1/2
. Here B = E
_
X
2
, V (x) = x
2
, and (B) =
B. Let
B
N
be an unbiased estimator of B whose variance is
2
B
/N. If we estimate A
using
A
N
= (
B
N
), and N is large, nd approximate expressions for the
9.8. EXERCISES 219
bias and variance of
A
N
. Show that
A
N
is biased in general, but that the
bias is much smaller than the statistical error for large N. Hint: assume
N is large enough so that
B
N
B and
(
B
N
) (B)
(B)(
B
N
B) +
1
2
(B)(
B
N
B)
2
.
Use this to estimate the bias, E[
A
N
A], and the variance var[
A
N
]. What
are the powers of N in the bias and statistical error?
2. Let (X, Y ) be a bivariate random variable that is uniformly distributed
in the unit disk. This means that h(x, y), the joint density of x and y, is
equal to 1/ if x
2
+y
2
1, and h(x, y) = 0 otherwise. Show that X and
Y are uncorrelated but not independent (hint: calculate the covariance of
X
2
and X
2
.).
3. What is wrong with the following piece of code?
for ( k = 0; k < n; k++ ) {
setSeed(s);
U[k] = rng();
}
4. Calculate the distribution function for an exponential random variable
with rate constant . Show that the sampler using the distribution func-
tion given in Section 9.3.4 is equivalent to the one given in Section 9.3.2.
Note that if U is a standard uniform, then 1U also is standard uniform.
5. If S
1
and S
2
are independent standard exponentials, then T = S
1
+ S
2
has PDF f(t) = te
t
.
(a) Write a simple sampler of T that generates S
1
and S
2
then takes
T = S
1
+S
2
.
(b) Write a simpler sampler of T that uses rejection from an exponential
trial. The trial density must have < 1. Why? Look for a value of
that gives reasonable eciency. Can you nd the optimal ?
(c) For each sampler, use the histogram method to verify its correctness.
(d) Program the Box Muller algorithm and verify the results using the
histogram method.
6. A common problem that arises in statistics and stochastic control is to
sample the n component normal
f(x) =
1
Z
exp
_
_
D
2
n
j=0
(x
j+1
x
j
)
2
_
_
. (9.32)
with the convention that x
0
= x
n+1
= 0. The exponent in (9.32) is a
function of the n variables x
1
, . . . , x
n
.
220 CHAPTER 9. MONTE CARLO METHODS
(a) Determine the entries of the symmetric n n matrix H so that the
exponent in (9.32) has the form x
where M has nonzero entries only on the main diagonal and the
rst subdiagonal. How many operations and how much memory are
required to compute and store M, what power of n?
(c)
17
Find expressions for the entries of M.
(d) Show that if Z is an n component standard normal and we use back
substitution to nd X satisfying M
X = Z, then X is a multivariate
normal with covariance C = H
1
. What is the work to do this, as a
function of n?
(e) Let C = H
1
have Choleski factorization C = LL
. Show that
L = M
1
. Show that L, and C itself are full matrices with all
entries non-zero. Use this to nd the work and memory needed to
generate a sample of (9.32) using the method of Section 9.3.6.
(f) Find an expression for the entries of C. Hint: Let e
j
be the standard
column vector whose only non-zero entry is a one in component j.
Let v satisfy Hv = e
j
. Show that v is linear except at component j.
That is, v
k
= ak +b for k j, and v
k
= ck +d for k j.
7. Suppose X is a discrete random variable whose possible values are x =
1, . . . , n, and that f(x) = Pr (X = x) is a desired probability distribution.
Suppose p
jk
are the transition probabilities for an n state Markov chain.
Then f is an invariant or steady state distribution for p if (supposing X(t)
us the state after t transitions)
Pr (X(t) = x) = f(x) for all x
= Pr (X(t + 1) = x) = f(x) for all x.
This is the same as fP = f, where f is the row vector with components
f(x) and P is the transition matrix. The ergodic theorem for Markov
chains states that if f is an invariant distribution and if the chain is non-
degenerate, then the Markov chain Monte Carlo (MCMC) estimator,
A
T
=
T
t=1
V (X(t)) ,
converges to A = E
f
[V (X)], as t . This holds for any transition
probabilities P that leave f invariant, though some lead to more accu-
rate estimators than others. The Metropolis method is a general way to
construct a suitable P.
17
Parts c and f are interesting facts but may be time consuming and are not directly related
to Monte Carlo
9.8. EXERCISES 221
(a) The transition matrix P satises detailed balance for f if the steady
state probability of an x y transition is the same as the probability
of the reverse, y x. That is, f(x)p
xy
= f(y)p
yx
for all x, y. Show
that fP = f if P satises detailed balance with respect to x.
(b) Suppose Q is any n n transition matrix, which we call the trial
probabilities, and R another n n matrix whose entries satisfy 0
r
xy
1 and are called acceptance probabilities. The Metropolis Hast-
ings (more commonly, simply Metropolis) method generates X(t +1)
from X(t) as follows. First generate Y Q
X(t),
, which means
Pr(Y = y) = Q
X(t),y
. Next, accept Y (and set X(t + 1) = Y )
with probability R
X(t),y
. If Y is rejected, then X(t + 1) = X(t).
Show that the transition probabilities satisfy p
xy
= q
xy
r
xy
as long as
x ,= y.
(c) Show that P satises detailed balance with respect to f if
r
xy
= min
_
f(x)q
xy
f(y)q
yx
, 1
_
. (9.33)
8. Suppose (t) is given by (9.31) and we want to estimate
(t). We consider
three ways to do this.
(a) The most straightforward way would be to take N samples X
k
f
t
and another N independent samples
X
k
f
t+t
then take
=
1
t
_
1
N
V (
X
k
)
1
N
V (X
k
)
_
.
9. Write an expression for the variance of the random variable Y = V (X)
(W(X)B) in terms of var[V (X)], var[W(X)], the covariance cov[V (X), W(X)],
and . Find the value of that minimized the variance of Y . Show that
the variance of Y , with the optimal is less than the variance of V (X)
by the factor 1
2
V W
, where
V W
is the correlation between V and W.
10. A Poisson random walk has a position, X(t) that starts with X(0) = 0. At
each time T
k
of a Poisson process with rate , the position moves (jumps)
by a ^(0,
2
), which means that X(T
k
+ 0) = X(T
k
0) + Z
k
with iid
standard normal Z
k
. Write a program to simulate the Poisson random
walk and determine A = Pr(X(T) > B). Use (but do not hard wire) two
parameter sets:
(a) T = 1, = 4, = .5, and B = 1.
(b) T = 1, = 20, = .2, and B = 2.
Use standard error bars. In each case, choose a sample size, n, so that
you calculate the answer to 5% relative accuracy and report the sample
size needed for this.
222 CHAPTER 9. MONTE CARLO METHODS
Bibliography
[1] Forman S. Acton. Real Computing Made Real: Preventing Errors in Sci-
entic and Engineering Calculations. Dover Publications, 1996.
[2] Germund Dahlquist and
Ake Bjork. Numerical Methods. Dover Publica-
tions, 2003.
[3] Timothy A. Davis. Direct Methods for Sparse Linear Systems. SIAM Pub-
lications, 2006.
[4] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[5] Phillip E. Gill, Walter Murray, and Margaret H. Wright. Practical Opti-
mization. Academic Press, 1982.
[6] David Goldberg. What every computer scientist should know about
oating-point arithmetic. ACM Computing Surveys, 23(1):548, March
1991.
[7] G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins
University Press, 1989.
[8] Paul R. Halmos. Finite-Dimensional Vector Spaces. Springer, fth printing
edition, 1993.
[9] Michael T. Heath. Scientic Computing: an Introductory Survey. McGraw-
Hill, second edition, 2001.
[10] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms.
SIAM, 1996.
[11] Andrew Hunt and David Thomas. The Pragmatic Programmer. Addison-
Wesley, 2000.
[12] Eugene Isaacson and Herbert B. Keller. Analysis of Numerical Methods.
Dover Publications, 1994.
[13] David Kahaner, Cleve Moler, and Stephen Nash. Numerical Methods and
Software. Prentice Hall, 1989.
223
224 BIBLIOGRAPHY
[14] Brian W. Kernighan and Rob Pike. The Practice of Programming. Addison-
Wesley, 1999.
[15] Peter D. Lax. Linear Algebra. Wiley Interscience, 1996.
[16] Suely Oliveira and David Stewart. Writing Scientic Software: A Guide
for Good Style. Cambridge University Press, 2006.
[17] J. M. Ortega and W. G. Rheinboldt. Iterative Solution of Nonlinear Equa-
tions in Several Variables. Academic Press, New York, 1970.
[18] Michael L. Overton. Numerical Computing with IEEE Floating Point Arith-
metic. SIAM Publications, 2004.
[19] B. N. Parlett. The Symmetric Eigenvalue Problem. SIAM, 1997.
[20] J. W. S. Rayleigh. The Theory of Sound, volume 1. Dover Publications,
1976.
[21] Charles Severance and Kevin Dowd. High Performance Computing.
OReilly, 1998.
[22] G.W. Stewart. Matrix Algorithms, Volume II: Eigensystems. SIAM, 2001.
[23] Gilbert Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press,
1993.
[24] Gilbert Strang. Linear Algebra and Its Applications. Brooks Cole, 2005.
[25] L. N. Trefethen and D. Bau III. Numerical Linear Algebra. SIAM, 1997.
Index
225