Math 248: Computers and Numerical Algorithms
Math 248: Computers and Numerical Algorithms
Math 248:
Computers and Numerical Algorithms
Part II: Algorithms and Numerics
AUTHORS:
C. David Pruett ([email protected])
Anthony Tongen ([email protected])
January 9, 2014
Math 248, as the course name suggests, has a dual focus on computers and algorithms. In the course,
sophomore-level students will learn 1) structured programming in a high-level programming language,
and 2) useful algorithms for performing numerical tasks such as rootfinding, solving systems of linear
equations, integrating and differentiating, and interpolating. To our knowledge, Math 248 is unique to
JMU. We know of no other university that offers programming and numerical methods seamlessly in one
course.
Packaging programming and numerical methods together affords many opportunities. It also carries
many pitfalls. The purpose of this coursepak is to structure the course so that students and instructors can
take advantage of the opportunities while avoiding the pitfalls.
Contents
1
1.1
1.1.1
1.1.2
1.2
1.2.2
1.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Machine Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1
16
Word-Length Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2
Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3
Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4
2.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
25
3.1
3.2
3.3
3.4
3.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
36
4.1
Divergent Orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2
Convergent Orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3
4.4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
45
5.1
5.2
Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3
Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4
5.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
63
6.1
6.2
6.3
6.4
6.2.1
Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2
Properties of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1
6.3.2
Matrix Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3
Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.1
6.4.2
Back Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.3
Forward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3
6.5
6.4.4
Checking by Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.5
6.4.6
Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.7
6.6
6.7
Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
97
Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1.1
7.1.2
7.1.3
7.2
7.3
7.4
7.5
8
Gauss-Jordan Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
127
8.1
8.2
8.3
8.4
8.5
8.6
8.7
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
143
9.1
9.2
9.3
9.4
9.2.1
9.2.2
9.2.3
9.2.4
9.3.2
9.3.3
9.3.4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A positional number system is one in which the value of a digit is determined by its position relative to a
decimal point. For example, in decimal notation, the number 1047.321 means
1 103 + 0 102 + 4 101 + 7 100 + 3 101 + 2 102 + 1 103
(1)
In the example above, the base is 10 and the same for all terms, the exponent is determined by place
relative to the decimal point, and the coefficients of each term are given by the respective digits of the
number.
REMARKS:
1. The origin of our common base 10 number system derives from humans having ten fingers, the
medical term of which is digits.
2. The idea of assigning the value of a number to its position originates from the Arabic system.
3. Our numerals (symbols) originate from the Sanskrit.
4. Not all number systems are positional in the same way. For example, in Roman numerals IV means
4 and VI means 6, I=1 being added or subtracted from V=5 depending upon whether it follows
or precedes the V, respectively. Such a system is awkward at best and ill-suited for electronic
computing.
In general, a positional number system can be devised for any integer base except 0. The common
ingredients of positional number systems are
1. An integer base b. Some common bases are
(b = 2) binary
(b = 8) octal
(b = 10) decimal
(b = 16) hexadecimal
2. A set of b symbols (numerals) with values 0 to b 1. For the common number systems above, the
respective numerals are
{0, 1}
{0, 1, 2, 3, 4, 5, 6, 7}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B,C, D, E, F}
1.1
If we know the value of an integer in one base, say b1 , how do we determine its value in another base, say
b2 ? Quite often, we want to know the decimal value of an integer in base b, or vice versa, so lets start
with these conversions. From now on, assume that b is a positive integer (that is, b Z and b > 0).
1.1.1
Lets start with a couple of specific examples and see if we can then generalize to derive an algorithm.
EX1: Convert (257)8 to a decimal integer.
(257)8 =
=
=
=
What is the mathematical name for the expression on the right-hand side of the equation above? Yes, a
polynomial in b.
REMARKS: Converting integers from any base b other than 10 into base 10 is equivalent to evaluating a
polynomial in b at the value of the base.
ASIDE: Algorithms for evaluating polynomials
Consider the nth order polynomial in x,
Pn (x) = an xn + an1 xn1 + ... + a1 x + a0
How many operations (multiplies or adds) does it take to evaluate Pn (x) given the coefficients ai and a value
of x? It depends on how you organize the work. If we approach the problem naively, the operation count
can be quite high. For example, heres how not to undertake the problem, by the brute force approach.
term
an xn = an x x x x... x
an1 xn1 = an1 x x ... x
...
a1 x = a1 x
a0 = a0
total
multiplies
n
n1
...
1
0
n
2 (n + 1)
6 + 5 = 11
;
8
2(11) = 22
2(29) = 58
2(66) = 132
22 + 7 = 29
58 + 8 = 66
132 + 1 = 133
;
;
;
To evaluate the value of each nest, we must perform one multiplication by 2 and one add, or two operations
in total. For a fourth-order polynomial, there are four nests, or eight operations. It appears that the
operation count is exactly 2n for an nth order polynomial. We will verify this in a moment.
Generalizing the procedure for arbitrary n, we store the highest order coefficient in a register, multiply
that coefficient by the value of x and then add to the product the next-highest order coefficient. This result
replaces the value in the register, and the process repeats in descending order until the last coefficient, a0 ,
has been added. There the process terminates. Formalizing this procedure, we obtain the algorithm for
Horners method.
ALGORITHM 1.1: Polynomial evalution by Horners method
input order n
input n + 1 coefficients ai , i = 0, 1, ..., n
input value of x
t an (initialize temporary storage register)
repeat for i = n 1, n 2, ..., 0 (descending order)
| t t x + ai
|
output t (which contains value of Pn (x))
The operation count is easy to obtain from ALG 1.1. There is a single decrementing loop (one that runs
backwards). The loop is executed n times. Why? The only operations within the loop are one multiply
and one add, or two operations per iteration. Thus the total operation count is exactly 2n. Compare this to
the brute force method when n is a large number. Which algorithm is preferable?
REMARKS: Comparing the brute force algorithm with Horners method provides yet another example
of algorithms that are mathematically equivalent (they give the identical mathematical results) but are not
computationally equivalent, in this case because the latter is far more efficient than the former in terms of
operations.
Horners algorithm is ideal for evaluating polynomials. It follows then that Horners algorithm is also
ideal for converting integers in base b to decimal integers. Lets revisit a previous example.
EX: Convert the octal integer (257)8 to a decimal integer using Horners method.
n2
a0 7, a1 5, a2 2
b8
t a2 (initialize temporary storage register)
9
1.1.2
We now wish to go in the opposite direction. That is, given an integer in base 10, can we find and algorithm
for conversion to base b? Of particular interest is conversion to base 2 (binary), for this is the favored base
of machine compuation. Before we proceed, however, there is an important principle that should be stated
explicitly. Suppose b1 and b2 are two different integer bases; that is, b1 and b2 are both integers, but
b1 6= b2 . If
(integer part.fractional part)b1 = (integer part.fractional part)b2
then
(integer part)b1 = (integer part)b2
and
(fractional part)b1 = (fractional part)b2
The practical implication of the principle above is that we can divide and conquer when converting decimal
numbers to any other base. We will first convert the integer part of the number and then follow up by
converting the fractional part.
Lets start with a specific example:
EX: Convert (375)10 to octal (base 8). There is no reason to presuppose that the number of digits will be
the same in two different bases (although in this case, they are the same.) Thus, to start
(375)10 = (...a4 a3 a2 a1 a0 .)8
= a0 + a1 8 + a2 82 + a3 83 + ...
(3)
What happens if we now divide both sides of Eq. 3 by the value of the base? In this case we obtain
375
7 a0
= 46 = + a1 + a2 8 + a3 82 + a4 83 + ...
8
8
8
10
(4)
where it is understood that the expression on the left-hand side of Eq. 4 is a base 10 (decimal) number.
Note that the exponents denoting the powers of the base on the right-hand side each dropped by one. in
each term. Moreover, the underlined term must be a proper fraction, and the remaining terms give a whole
number. Why is that? We can now invoke our guiding principle. The left- and right-hand sides of Eq. 4
denote the same number expressed in two different bases. We may therefore independently equate the
fractional parts and the whole parts. By equating the fractional parts, we obtain
7 a0
=
a0 = 7
8
8
(5)
which gives us the lowest order digit of the base 8 number we desire. Similarly, by equating the integer
parts we obtain
46 = a1 + a2 8 + a3 82 + a4 83 + ...
(6)
At this point, the process is repeated. Dividing both sides of Eq. 6 by the base, we obtain
6 a1
46
= 5 = + a2 + a3 8 + a4 82 + ...
8
8
8
(7)
(8)
Finally, dividing both sides by the base one more time gives a2 = 5 and a whole part of zero, the latter of
which implies a3 = a4 = a5 = ... = 0. Thus, when the whole part vanishes, our job is done.
To summarize, the base 8 digits, obtained from lowest order to highest, are a0 = 7, a1 = 6, and a2 = 5,
from which we conclude that (375)10 = (567)8 .
The process above, with all the details, is useful but cumbersome. Can we strip the process to its bare
essence to derive an algorithm? At each step, we divide a base 10 integer by the new base and compute the
remainder. The remainder yields the desired digit in the new base. The process is repeated anew with the
new whole part. When the whole part vanishes, we are done. The steps are shown below, working from
the bottom upwards.
8 0 R5 (a2 )
8 | 5 R6 (a1 )
8 | 46 R7 (a0 )
8 | 375
From the bare bones above, we derive the following algorithm:
ALGORITHM 1.2: Base 10 to base b integer conversions
input base-10 integer n > 0
input desired new integer base b > 0
i 0 (keeps track of index of coefficient)
repeat until n = 0
11
1.2
Thus far, we have been concerned about converting integer values from one base to another. We now turn
attention to fractional values expressed in different bases.
1.2.1
(9)
What happens if we now multiply both sides of Eq. 9 by the value of the base? In this case
8(0.35) = 2.80 = a1 + a2 81 + a3 82 + ...
(10)
where the underlined term is a whole number, and the remaining terms sum to a proper fraction. Why is
that true? Once again, by independently equating the integer and fractional parts on the left and right sides
of Eq. 10, we see that a1 = 2 and
0.80 = a2 81 + a3 82 + ...
(11)
(12)
(13)
(14)
12
(15)
(16)
and
At this point, the pattern begun in Eq. 12 repeats itself. We conclude (0.35)10 = (0.263146314)8 . This
leads to an interesting observation: a decimal fraction that terminates in one base may repeat when
expressed in another base.
Once again, the details look somewhat complicated, but the essence of the procedure is simple. At
each step multiply by the base. The integer part of the result gives the next successive coefficient. Repeat
the process with the remaining fractional part.
index i
-1
-2
-3
-4
-5
-6
product
8(0.35)
8(0.80)
8(0.40)
8(0.20)
8(0.60)
8(0.80)
integer part = ai
fractional part
2
0.80
6
0.40
3
0.20
1
0.60
4
0.80
6 (pattern replicates) 0.40
13
1.2.2
Fortunately, with a bit of cleverness, we can use what we have already learned. A specific example will
illustrate how.
EX: Convert (0.257)8 to decimal notation.
Consider that
(0.257)8 = 2 81 + 5 82 + 7 83
1
= 3 [2 82 + 5 81 + 7 80 ]
8
175
=
83
= (.341796875)10
(17)
The underlined term is a whole number (257)8 , whose decimal equivalent is 175, obtained by ALG 1.1.
(See EX 1 of Section 1.1.1).
Here then are the essential steps of the algorithm:
1. Shift the decimal point n places to the right until the fraction is whole.
2. Convert the base-b whole number using ALG 1.1 (Horners method).
3. Divide the result by bn to adjust for the shift in Step 1.
1.3
Summary
The following guidelines summarize the conversions of numbers from base b to base 10 and vice versa:
1. To convert (an an1 ...a1 a0 .a1 a2 ...am )b to base 10:
(a) Move the decimal point m places to the right until the number is whole.
(b) Use ALG 1.1 (Nested Polynomial Evaluation or Horners method) to convert the whole number.
(c) Adjust for the shift of the decimal point by dividing the result by bm .
2. To convert (an an1 ...a1 a0 .a1 a2 ...am )10 to base b:
(a) Treat the integer and fractional parts separately.
(b) Convert the integer part by ALG 1.2 using successive divisions by b.
(c) Convert the fractional part by ALG 1.3 using successive multiplications by b.
14
1.4
Exercises
15
Computers can represent neither all integers nor all real numbers for the simple reason that there is an
infinity of integers and an infinity of reals, but computer memory is finite. In this section we will find
out exactly what numbers can be represented and how those representations are stored. Computers treat
integers and reals differently; accordingly the two subsections of this Chapter deal with machine integers
and floating-point numbers, respectively.
Before turning to specifics, a wee bit of history. Modern digital computers store numbers internally
in binary form. The one most often credited with the suggestion that digital computers exploit binary
arithmetic was John von Neumann. Von Neumann (1903-1957), a child prodigy, was born in Budapest,
Hungary. He immigrated to the United States in 1930, as one of the six original mathematicians and scientists, all brilliant, of Princeton Universitys Institute for Advanced Studies (IAS). Following natualization
as a US citizen, von Neumann became deeply involved in the Manhattan Project to build the atomic bomb.
Von Neumann, along with Alan Turing, is considered a pioneer in computational technology. He favored
binary arithmetic for machine computation because it simplified electronic circuitry. In particular, the
simplest switches have only two positions: on and off. Thus, any number could be represented in binary
by sufficiently long strings of switches, where the bits 0 and 1 are associated with the off and on positions
of a switch, respectively.
Binary arithmetic is easy, but it takes a little practice. It might help to start with a table (Table 4) of the
first few integers in binary.
Decimal
0
1
2
3
4
5
6
7
8
9
Binary
0
1
10
11
100
101
110
111
1000
1001
16
b = 10
b=2
11
+7
18
1011
+111
10010
1011
111
1011
1011
1011
1001101
11
7
77
To complete how a four-function digital calculator works, we should also give examples of the operations of subtraction and division. These, however, vary from machine to machine. For example, topof-the-line supercomputers such as those manufactured by Cray perform division by multiplying by the
reciprocal. The full process requires three steps: 1) approximation of the reciprocal, 2) correction by
Newton iteration (see Chapter 5), and 3) multiplication by the reciprocal. Thus, division requires about
three times the computational effort of multiplication!
HW1: Compute 12 9 in base 2 and verify.
HW2: Which of the following MATLAB assignment statements is preferred and why?
X = VALUE / 2.0; or X = 0.5*VALUE;
2.1
Machine Integers
(18)
Recall that the numbers bi are called bits, a contraction of the term binary digits. The integer word
length is the predetermined number of bits set aside in memory to store a single integer value. For example,
the integer word length typical of PCs is 4 bytes = 32 bits. Exactly how those bits are used depends upon
the particular machine. In the most simple-minded approach the leading (left-most) bit is used to indicate
the sign (+ or -) of the integer and the remaining bits represent the absolute value of the number in binary
notation. Thus on a PC, the 32-bit string might be arranged, for example, as follows:
bit string:
bit index:
1
1 2
1
3
17
0
4
... 0 1
... 31 32
In the 32 representation above, what is the largest integer that can be represented? It is fairly obvious
that the largest integer would have a sign bit indicating +, followed by 31 on bits each with value 1.
Therefore, the largest integer would be
N = +(111...11)2
= 1 230 + 1 229 + 1 228 + ... + 1 21 + 1 20
(19)
We could evaluate N in decimal notation by Horners method. However, there is a much simpler way. If
we add 1 to N, the result will be a binary integer of 32 bits, the leading bit being one, and the remaining
bits being zero. That is N + 1 = +(1000...00)2 . In decimal notation N + 1 = 231 = 2147483648. Thus,
the largest integer N = 2, 147, 483, 647, a tad more than two billion. Similarly, the most negative integer
is N = 2, 147, 483, 647.
ASIDE: If truth be told, the schema described here has two drawbacks: 1) The procedure for adding
two such integers is unnecessarily complicated, and 2) representation of zero is non unique; that is,
(000...00)2 =. In practice, most machines make use of twos complements, a notion slightly beyond
the scope of this book. For interested readers, however, look for example at Discrete Mathematics with
Applications (2nd Ed.) by Susanna Epp, Brooks/Cole (1995), in which the following definition is given:
DEF: Given a positive integer a, the twos complement of a relative to a fixed bit length n is the n-bit
binary representation of 2n a.
Curiously, given a and n, 2n a can be found in three easy steps:
1. Write the n-bit representation of a.
2. Switch all the 1s to 0s and vice versa.
3. Add 1 to the result in Step 2 in binary notation.
In twos-complement form, nonnegative integers are treated conventionally, but negative integers are
represented as the twos complement of their absolute values. As a result, for a 32-bit integer word length,
the integers 2, 147, 483, 648 N 2, 147, 483, 647, are each uniquely represented, including zero. But
more importantly, a single algorithm suffices for adding any two integers regardless of their signs. Here is
a summary of the algorithm for adding two 32-bit integers:
1. Convert both integers from decimal to binary representations (representing negative integers as the
twos complement of their absolute values).
2. Add the resulting binary integers using ordinary binary addition.
3. Truncate any leading 1 (overflow) that occurs in the 2n position.
4. Convert the result back to decimal notation by interpreting any 32-bit integers with leading 0s as
positive and those with leading 1s as negative.
18
2.1.1
Word-Length Options
Although the largest 32-bit integer, N = 2, 147, 483, 647, exceeds two billion, this number may not be
sufficiently large for some applications. For example, to print out the value of each base pair in the human
genome would require a loop from 1 to approximately 3 billion. Most compilers allow the user to access
short or long integers, depending upon the needs of the algorithm. In C, for example, one uses the typedeclaration keywords short or long to change the default length of an integer variable. On the other hand,
in MATLAB, integer word length is modified during type declaration , which follows the word INTEGER
in the type-declaration syntax as follows:
I=ones(1,1,int8) % 8-bit (1-byte) integer
J=ones(1,1,int16) % 16-bit (2-byte) integer
K=ones(1,1,int32) % 32-bit (4-byte) integer
L=ones(1,1,int64) % 64-bit (8-byte) integer
HW3: What is the largest integer of any KIND that can be represented in MATLAB?
Note: For some reason, int64 is not fully supported in MATLAB.
2.2
Floating-Point Numbers
Computers store real numbers in floating-point notation. To begin to get a handle on this concept consider
the transcendental number /4 = 0.7853981634... and a 1-foot ruler. Is there a length on the ruler that
corresponds to /4 feet? Is there a tic mark on the ruler that corresponds to /4 feet? If you answered
both questions correctly, you are well on your way to understanding how computers store real numbers in
floating-point notation.
There is certainly a length on the ruler corresponding to /4 feet, but it has no tic mark. The closest
tic is at 9 7/16 inches = 0.786458333... feet.
REMARKS: Floating-point numbers are like the tics on a ruler; there is not a tic for every real number,
and so a given real number is represented by the closest number that has a corresponding tic.
Another way of looking at floating-point notation is that it is the binary equivalent of scientific notation.
Here are some familiar examples of scientific notation:
velocity of light 3 105 km/s
250 billion dollar deficit 2.5 109
Avogadros number 6.022 1023
Plancks constant 6.626068 1034 m2 kg / s
In general, scientific notation, which is particularly useful for expressing very large or very small numbers, consists of a decimal part, usually between one and ten and called the mantissa, and an exponential
19
(20)
The three components of a floating-point number are the sign bit, an n-bit string representing the mantissa,
and an exponent m also expressed in binary.
EX1: Express 0.010 in floating-point notation.
0.010 = +.000...0 20
0.510 = +.100...0 20
1.010 = +.100...0 21
Without an additional constraint on format, floating-point numbers are non-unique. For example, we
could have expressed 0.510 = +.010...0 22 . To remove the problem of non-uniqueness, it is customary
to represent non-zero floating-point numbers in normalized form; that is, with a 1 in the leading bit of the
mantissa. Thus, the only floating-point number whose leading bit is zero is zero itself.
In general, the word length of a floating point number, and the partition of those bits into mantissa
and exponent, depends upon the particular computer. For the sake of illustration, lets consider a lame
computer with a 4-bit mantissa and whose exponents can run from m = 3 to m = 4. Lets now write
down all the real numbers this lame-frame can represent exactly. Lets first set m = 0 and consider all
possible normalized 4-bit mantissas, of which there are only 9. The decimal equivalent of these machine
values are given in the center of Table 4.
Mantissa
(0.0000)2
(0.1000)2
(0.1001)2
(0.1010)2
(0.1011)2
(0.1100)2
(0.1101)2
(0.1110)2
(0.1111)2
m = 3
0
0.0625
0.0703125
0.078125
0.0859375
0.09375
0.1015625
0.109375
0.1171875
m = 2
0
0.125
0.140625
0.15625
0.171875
0.1875
0.203125
0.21875
0.234375
m = 1
0
0.25
0.28125
0.3125
0.34375
0.375
0.40625
0.4375
0.46875
m=0
0
0.5
0.5625
0.625
0.6875
0.75
0.8125
0.875
0.9375
m=1
0
1
1.125
1.25
1.375
1.5
1.625
1.75
1.875
m=2
0
2
2.25
2.5
2.75
3.0
3.25
3.5
3.75
m=3
0
4
4.5
5
5.5
6
6.5
7
7.5
m=4
0
8
9
10
11
12
13
14
15
Table 4: Complete list of machine reals for 4-bit floating-point mantissa with exponent range 3 m 4.
We can now fill in the table by considering non-zero exponents. For example the values in the column
just to the right of the center column were obtained by multiplying those in the center column by 21 = 2,
and so on.
Now consider x = . What is f l(x) on our lame-frame computer? It is the closest machine number to
the actual value of x. In this case f l(x) = (3.25)10 = +(.1101)2 22 . Clearly this is a terrible representation of , but it is the best we can do with a 4-bit mantissa. Fortunately, the situation improves markedly as
20
base-
d1 1
d1
2 d1
Floating point storage for n-bit mantissa and m-bit exponent for base-2 machine:
exponent range
(2m1 ) to 2m1 1
m1
largest positive number
2(2 1) (21 + . . . + 2n )
m1
smallest positive number
.1 22
total numbers represented
2n+m + 1
21
2.3
Precision
When we say significant bits we are referring to the binary representation of a real number, and when we
say significant digits we are referring to its decimal representation. How are significant bits and significant
digits related?
DEF: Precision is the approximate number of decimal digits represented by the mantissa of a floatingpoint number. Although most modern computers are 64-bit with 53-bit mantissas, for the purpose of the
following discussion we are going to examine single precision PCs. For example, single-precision PCs
have 23-bit mantissas. In single precision, 223 1 = 8388607 8 106 is the largest number that can be
represented by the mantissa. (Here, the position of the decimal point is irrelevant, because the number of
significant digits does not change if the decimal point is shifted.) This is a 7-digit number, so it appears
that a 23-bit mantissa gives 7-digit precision. Not quite. As not all 7-digit numbers can be represented by
the mantissa (for example, 9,000,000), we say the precision is 6-7 digits. To summarize, if n is the number
of bits in the mantissa, we say a computer has p-digit precision iff
2n 10 p
(21)
2.4
A run-time underflow (overflow) error occurs whenever the exponent of a floating-point number is too
large (small) to be represented. To assess whether or not a calculation is in danger of an overflow error,
we need to know what the largest exponent p can be in floating-point notation. On a PC, there are 8 bits
assigned to the exponent. Suppose all 8 bits are on, in which case pmax = (1111111)2 = 25510 . Zero is
also a possibility, so the exponents could range from 0 to 255. But wait. This wont work, because we need
negative exponents also. Therefore, normally, the exponents are split more-or-less evenly between negative
and positive. With 256 possibilities, and equal opportunity for negative and positive, 128 p 127.
22
(b) Selected
machine
mantissas
as
tic
marks on
a ruler.
Figure 2:
Finally, 2128 1038 , so the largest real number that can be represented in floating-point notation is
approximately 1.0 1038 in scientific notation. Similarly, the smallest positive real number that can be
represented is roughly 1.0 1038 .
EX: Consider the following lines of MATLAB and explain what happens when the variable Z is computed.
X = ones(1,1,single)*3.7E20, Y = ones(1,1,single)*4.2E28, Z=ones(1,1,single)
Z = X*Y;
In case you are wondering, 3.7E20 is MATLABs syntax for 3.7 1020 in scientific notation. The E
means exponent. The product of X and Y has an exponent of at least 20+28=48. This number exceeds the
maximum exponent size of 38. The computation will result in Z not being correct. How could you change
the above commands so that the computation would work correctly?
2.5
Exercises
23
0
1
2
3
4
5
6
7
0
0
0
0
0
0
0
0
0
1
0
1
2
3
4
5
6
7
2
0
2
3
0
3
4
0
4
5
0
5
6
0
6
7
0
7
24
If the result of a digital computer is almost never exactly correct because of the limitations of storing real
numbers on a finite bit machine, then the question How much in error is my result? is an important one.
Throughout the remainder of this course, error and its estimation will be a recurring theme.
DEF: Absolute error is the difference between the exact value and its approximation. Thus, letting x and
x represent the exact and approximate values of x, respectively, and represents absolute error, then
= xx
(22)
Absolute error, therefore, is more-or-less meaningless. Only relative error really matters.
DEF: Relative error is absolute error divided by the true value. If represents relative error, then
=
xx
=
x
x
(23)
It should be pointed out that, whereas absolute error is a dimensional quantity with units such as miles,
grams, miles/hour, hours, etc. relative error, being a ratio of values, is always dimensionless (unit-less).
Error sneaks into a calculation from many sources. Among these are initial errors in the input data,
which leads to the old adage of computing: Garbage in, garbage out. A more subtle source of error is
model error. For machine solution of physical problems, the equations to be solved are often models of
the physical phenomenon. Certain assumptions may have been necessary to derive the models, in which
case the model may not be a complete or accurate description of the physical problem. Model errors are
beyond the scope of this course, but you should at least be aware of this source of error. The two sources
of error that most concern us are round-off error and truncation error. Each represents the limitation of a
finite process in describing an infinite one. In this Chapter, we address the first of these sources of error:
round-off error.
3.1
(24)
25
(25)
EX: Find the exact absolute and relative round-off errors incurred when the decimal fraction 1/10 is approximated by a floating-point number with a 10-bit mantissa.
First, we need to represent x = (0.1)10 in binary. From a previous HW exercise, recall that 0.110 =
(0.0001100110011)2 . Recall that, although (0.1)10 is a terminating decimal fraction, it is a repeating
binary fraction, which could be represented exactly only by an infinity of bits. Truncating that infinity of
bits to a finite (in this example, 10-bit) length will result in significant round-off error. In normalized form,
x = .110011001100... 23
(26)
Before we find the round-off error, lets work backwards to see if we can verify that the repeating binary
fraction above is indeed the same as the decimal fraction 1/10.
ASIDE: Geometric Series (Review from Math 236)
n
2
3
DEF: The series
n=0 ar = a + ar + ar + ar + ... is called a geometric series.
a
1r
PROOF: Consider the sequence of partial sums, whose first n + 1 elements are given by
S0 = a
S1 = a + ar
S2 = a + ar + ar2
...
Sn = a + ar + ar2 + ... + arn
Multiplying the nth partial sum Sn by r gives
rSn = ar + ar2 + ar3 + ... + arn+1
(27)
Note that the sums Sn and rSn agree in all terms but for the first and last. From this, it follows that
Sn rSn = a arn+1
(28)
Why detour from the topic of round-off error to re-visit geometric series? Because repeating decimal
fractions, whether in base 10 or binary notation, are simply geometric series in disguise. For example the
first (left-most) block of 4 bits in Eq. 26, namely (.1100)2 , is the equivalent of the decimal fraction 3/4.
26
The second block of 4 bits, (.00001100)2 , has the decimal equivalent of 3/4 24 , because the decimal
point is shifted four spaces to the left. Similarly (.000000001100)2 = 3/4 28 , etc. Thus the entire
repeating binary fraction represented by the mantissa is the geometric series
3/4 + 3/4 24 + 3/4 (24 )2 + 3/4 (24 )3 + ...
Here a = 3/4 and r = 24 = 1/16; hence S =
3/4
11/16
3/4
15/16
(30)
23
term of Eq. 26 is
= 1/8. Thus, 0.8 1/8 = 1/10, and hence, we have successfully recovered the
decimal fraction 1/10 from its repeating binary equivalent.
We are now poised to determine the absolute and relative round-off errors associated with storing 1/10
on our machine with 10-bit mantissa, for which
fl(x) = .1100110011 23
(31)
The relative round-off error is found by dividing above by the exact value x = 0.1. Thus,
=
.8 215
=
= 212
x
.1
(32)
Finally, in decimal notation 0.00024 or about 0.024 percent error. While this error is not terrible,
imagine what happens when floating-point numbers that are in error are used again and again in lengthy
calculations. In such cases round-off error tends to grow like a snowball.
A fundamental question to ask is the following: Given a machine whose real numbers are represented
by an n-bit mantissa, what is the absolute value of the worst possible relative round-off error that can be
incurred? This little number is so important that it is given a special name and symbol: unit round-off
error, u.
3.2
To answer this question, recall Fig. 2(b) in Chapter 2, which compares machine numbers to the tic marks
on a ruler. Lets ignore the exponent of a floating-point number and just consider its mantissa. Further
consider the finite sequence of mantissas given by the binary numbers {(.1), (.11), (.111), ..., (.111...1)},
where the last number has exactly n non-zero bits. Further lets assign a tic mark on the ruler to each
of these binary numbers. Once again, note that each additional bit yields a new tic halfway between the
previous one and the end of the ruler. How far apart are the fine-grained tic marks? We have shown
previously that the distance between two adjacent tic marks is 2n . The relevant question now is: What is
27
at which point the computer has revealed the number of bits n it allots to the mantissas of floating-point
numbers, and by inference, it has revealed u. ALG 3.1 returns the correct u regardless of whether the
machine chops or rounds. It does not reveal, however, which method is used.
Finally, most compilers provide an intrinsic function that returns the value of u, which is also sometimes called machine epsilon. Accordingly, MATLABs intrinsic function for this purpose is eps.
3.3
The previous section dealt with the error incurred in approximating a real number as a floating-point
number. What happens, however, when these machine numbers, each of which contains error, are used
time and again in a calculation? Although the initial errors were small, it is not inconceivable that the
error accumulates in a snowball effect. If so, can we quantify and ultimately bound the process of the
propagation of round-off error?
Lets start by a simple example: the addition of two machine numbers on a hypothetical machine with
an 8-bit mantissa. For simplicity, lets assume chopping rather than rounding. Let x = 1/3, y = 1/5, and
z = x + y, the exact result of which is z = 8/15 = 0.5333. What happens on our hypothetical computer is
the following:
1
fl( ) = (.10101010)2 21
3
1
+fl( ) = (.11001100)2 22 (normalized)
5
= (.01100110)2 21 (aligned)
(35)
(1.00010000)2 21
(.10001000)2 20 (chopped & normalized)
The final result, fl(z) is (0.53125)10 . Thus the relative round-off error following the addition operation
is (0.5333 0.53125)/0.53125, or 0.39 percent. A re-examination of the addition operation reveals that
there are three possible entry points for error: 1) the approximation of x by fl(x), the approximation of y
by fl(y), and the approximation of the sum by fl(z). We now examine a generic addition operation to see
if we can bound the error of the result.
Machine:
Exact:
z x+y
x
y
s
z
fl(x)
fl(y)
x+y
fl(s)
Our goal is to bound the relative error of the final result, namely (z z)/z. To simplify the logic, lets
assume x > 0 and y > 0.
x = x x x = x x
30
(36)
y = y y y = y y
From Eq. 36 the definition of relative error it follows that
x
x = x(1 x )
x
y
=
y = y(1 y )
y
x =
y
(37)
Thus,
s =
=
=
=
x+y
x(1 x ) + y(1 y )
x + y xx yy
z xx yy
(38)
It follows that
s = s(1 z )
= (z xx yy )(1 z )
= z xx yy zz + xx z + yy z
(39)
where z is the relative error from storing s as a float. The last two terms of the last line of Eq. 39 contain
products of presumably small relative errors and can be safely neglected. Thus, an accurate approximation
is
s z xx yy zz
(40)
The absolute error in z is now available.
z s = xx + yy + zz
(41)
By the Triangle inequality and the fact that x and y are presumed positive, it follows
|| x|x | + y|y | + z|z |
xu + yu + zu = 2zu
(42)
Dividing through by the exact value z above gives the sought-after result:
||
||
2u (machine addition)
z
(43)
This took some doing, but the result is simple to interpret: the absolute value of the relative error
incurred by adding two machine numbers is at most twice the unit round-off error.
Were we to perform a similar analysis for the multiplication of two machine numbers, namely z x y
(which we will not do), the result would be
|| 3u (machine multiplication)
31
(44)
Not too surprisingly, the result for the division of two machine numbers (z x/y), multiplication by the
inverse, is identical to that for multiplication, namely
|| 3u (machine division)
(45)
So, the relative error of either a multiplication or a division operation is bounded by three times the unit
round-off error. All of this goes to show that adding, multiplying, or dividing machine numbers may
(because these are upper bounds only) increase the relative round-off error of the result, but the rate of
growth is relatively slow, doubling or tripling with each operation. However, over thousands or millions
of operations, round-off error can become a significant problem, eroding the accuracy of a machine result.
It may be necessary to use double-precision arithmetic in such cases.
It might be tempting to assume that substraction behaves as addition. Unfortunately, the situation is
far, far worse than you might imagine. Suppose we want to bound the relative error for z x y. The
analysis is similar to that for addition until near the end, from which it is obtained that
z s = xx yy + zz
(46)
However, trouble arises when the triangle inequality is implemented because of the negative sign in front
of y, in which case | y| = y and
|| x|x | + y|y | + |z||z |
xu + yu + |z|u
(47)
(48)
x+y
||
u 1+
(machine subtraction)
||
|z|
|x y|
(49)
When x and y are nearly the same, the term in the denominator above approaches zero, and the roundoff error is multiplied dramatically. For example, if two machine numbers, each with 7-digit significance,
agree in 6 of their first seven significant digits, the result of subracting them will retain only one significant
digit. In a single operation, the relative error has been multiplied one million fold!!!
3.4
It is sobering to learn that a single operation in a long program of hundreds or thousands of lines can
completely destroy the accuracy of the result. That, in fact, can happen, and when it does, it is called
catastrophic loss of significance. Here is a very real example.
Consider the quadratic formula, which solves ax2 + bx + c = 0, namely
b b2 4ac
x=
2a
(50)
Suppose b > 0 and 4ac is very, very small relative to b. In this case the numerator of Eq. 50 looks almost
like b b. This is not a problem for the root b b, but it is a problem for the root b + b, because
32
this root is computed by the subtraction of two numbers that are almost identical. Lets put some numbers
in to see what happens. Let a = 1.1, b = 1.11, and c = .000000001. Equation 50 gives the solutions
x1 = 1.00909 and x2 = 0.0. The first root is correct to 6 places and the second root is correct to no
significant digits!
It is a little-known fact that there is an alternative quadratic formula, given below, which is derived
from the first by rationalizing its numerator.
x=
2c
b b2 4ac
(51)
In the second form, the root that required subtraction in the first formula is found by addition to be
x2 = 9.00901 1010 . This result is correct to six significant digits. The original result was in error
by 100 percent!
The moral to this story is not to never use subtraction. It is to avoid the subtraction of numbers that are
nearly the same. A general rule of thumb is that the round-off error grows by a factor of 10n , where n is
the number of leading digits of agreement. And whenever choices are available, choose the algorithm that
is most benign in terms of error propagation.
3.5
Exercises
1
1. The expression
i=1 i is the well known divergent series. What do you think would happen if you
tried to sum it on a computer? Dont actually try, it would take far too long!
2. Find the absolute and relative round-off errors incurred when 32 10 is approximated by a floating
point number with a 3-bit mantissa in base-2.
3. Consider a machine in base-2 with a 4-bit mantissa and 4-bit exponent. What is the machine epsilon
for this machine?
4. Derive the alternative quadratic formula of Eq. 51.
5. You know from algebra that the roots of a quadratic polynomial can be found using the quadratic
formula. There is an alternate quadratic formula which can be derived from the first. The standard
and alternate formulas are below:
b b2 4ac
x=
2a
2c
x=
b b2 4ac
Complete the following parts.
33
(52)
(53)
ln x x2 1 . Explain why this is a bad idea for large values of x, then rewrite it in a form
that is worth using for large x.
7.
(a) Evaluate the polynomial y = x3 7x2 + 8x .35 at x=1.37. Use 3-digit arithmetic with chopping. Evaluate the percent relative error.
(b) Repeat (a) but express y as y = ((x 7)x + 8)x .35. Evaluate the error and compare with (a).
(c) Explain what is going on.
8. Consider a binary machine with a 4-bit mantissa and 8-bit exponent. Suppose a = (3)10 and b =
(7.5)10 .
(a) Suppose x = a + b. Find the value of fl(x) after the machine computes the addition (assume
rounding). Represent your answer in base-10.
(b) Compute the relative error associated with the addition in part (a).
9. Recall from calculus, that the mathematical constant, e, can be defined as
1 n
e = lim 1 +
n
n
This problem involves approximating the value of e using the formula above. To approximate this
limit, I wrote an m-file that computes
1 n
1+
n
for successively larger values of n. I calculated the absolute and relative error associated with this
approximation and fl(e). Here are my results:
n
e_approx
fl(e)
abs.error
rel.error
-------------------------------------------------------1.00e+01
2.5937
2.7183
0.1245
4.5815e-02
1.00e+02
2.7048
2.7183
0.0135
4.9546e-03
34
1.00e+03
1.00e+04
1.00e+05
1.00e+06
1.00e+07
1.00e+08
1.00e+09
1.00e+10
1.00e+11
1.00e+12
1.00e+13
1.00e+14
1.00e+15
1.00e+16
1.00e+17
1.00e+18
1.00e+19
1.00e+20
2.7169
2.7181
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7185
2.7161
2.7161
3.0350
1.0000
1.0000
1.0000
1.0000
1.0000
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
2.7183
0.0014
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0002
0.0022
0.0022
0.3168
1.7183
1.7183
1.7183
1.7183
1.7183
4.9954e-04
4.9995e-05
4.9999e-06
5.0008e-07
4.9416e-08
1.1077e-08
8.2240e-08
8.2690e-08
8.2735e-08
8.8905e-05
7.9896e-04
7.9896e-04
1.1653e-01
6.3212e-01
6.3212e-01
6.3212e-01
6.3212e-01
6.3212e-01
Explain what happens to both types of error as n increases. Explain why the error behaves as it does
(you do not need to convert things to base-2 or do any arithmetic here, just explain in words). Why
does e approx approach 1?
35
Fixed-point iteration (FPI) is a very simple idea with widespread applications and deep implications. It is
also fun, something that can be accomplished with a computer program that consists of little more than a
single loop.
DEF: Let g(x), the iteration function, be continuous on [a, b], and let x0 be the initial condition. An
iteration of the form
xi+1 g(xi ) i = 0, 1, 2, ...
(54)
is called a fixed-point iteration. The iterates xi define the infinite sequence
x0
x1 = g(x0 )
x2 = g(x1 ) = g[g(x0 )] = g2 (x0 )
x3 = g(x2 ) = g{g[g(x0 )]} = g3 (x0 )
...
(55)
36
4.1
Divergent Orbits
In Math 236, the focus is on convergence and theorems proving the convergence of certain types of sequences. Here our interest concerns sequences that are generated by FPI; henceforth we use the term
orbits. Divergent orbits may be in some sense more interesting than convergent ones, because of the variety of modes of divergence. Orbits may fail to converge in three entirely different ways: if their elements
are periodic, if the elements grow unbounded, or if the elements are chaotic. Once again, we should define
these terms precisely.
DEF: Let B be a positive real number. If for any B > 0, no matter how large, there exists N(B) such that
the iterates xi of the FPI satisfy |xi | > B for all i > N(B), then the orbit is said to be unbounded.
DEF: If the FPI xi+1 g(xi ) does not converge, but xi+1 gn (xi ) converges, and n is the smallest positive
integer for which this is true, then O[g(x), x0 ] is said to be periodic with period n.
EX: Consider g(x) = x. Note that O[g(x), 1] = {1, 1, 1, 1, 1, 1, ...}, which is divergent. However
g2 (x) = g[g(x)] = (x) = x, so that O[g2 (x), 1] converges to 1. We say g(x) = x has a period-two orbit
from the initial condition x0 = 1.
Another way to identify periodic orbits of FPIs is by examining regular subsets of their associated
sequences. For example, the sequence {1, 1, 1, 1, 1, 1, ...} is associated with FPI using g(x) = x and
x0 = 1. From this divergent sequence, lets look at the subsequences formed by selecting even- and oddindexed elements, respectively, namely the subsequnces {1, 1, 1, 1, ...} and {1, 1, 1, 1, ...}. Clearly,
37
each of these subsequences is convergent. In general, if the sequence Sk associated with a fixed-point
iteration diverges, but the subsequence Snk converges, and n is the smallest positive integer for which this
is true, then the orbit of the FPI is periodic with period n.
We define a chaotic orbit by a process of elimination as follows:
DEF: A divergent orbit that is neither unbounded nor periodic is chaotic.
It is an interesting fact that even simple nonlinear iteration functions may exhibit the full range of
convergent and divergent behaviors. For, example, one of the most oft-studied FPIs is called the logistic
map. The iteration function for the logistic map is g(x) = x(1 x), where 0 4 is a parameter that
can be varied to produce different behaviors. Depending upon the value of , the logistic map produces
convergent, periodic, or chaotic orbits. However, the most remarkable attribute is that orbits of any period
can be obtained if you know where to look in the parameter range!
HW1: Consider FPI with the logistic map, with initial condition 0 < x0 < 1. Write a simple computer
program to input the values of and x0 and then compute and output the first 100 or so iterates of the
logistic map, namely
xi+1 xi (1 xi )
(56)
Note that the guts of the program are comprised of a simple loop. a) Do a numerical experiment to find
a value of for which the iterates converge to 0.5 from any initial guess within the specified interval. b)
Find a value of for which the orbit is of period 2. c) Find a value of for which the orbit is of period 8.
d) Find a value of for which the orbit appears chaotic.
4.2
Convergent Orbits
The $64,000 question is: Why do some FPIs converge and others diverge? More precisely, can we find
criteria that guarantee convergence?
DEF: If there exists some value p such that p = g(p), then p is called a fixed-point of g(x).
EX: If g(x) = x2 , then solving p = p2 defines the fixed points of the iteration xi+1 g(xi ). Thus the fixed
points are the roots p {0, 1} of p2 p = p(p 1) = 0.
Fixed points are appropriately named. If the initial condition x0 corresponds to a fixed point, the iterates
will never change. In the example above, the fixed points were found analytically, by solving p = g(p) for
p. Fixed points can also be found graphically by graphing line y = x and the iteration function y = g(x)
on the same axes. These graphs intersect where x = g(x); that is, at the fixed point(s). Figure 5 shows
the fixed points for the iteration function of the logistic map, g(x) = x(1 x) for several values of the
parameter . Notice that, although p = 0 is always a fixed point, the occurrence and location of a second
fixed point depends upon the value of the parameter. There is only one fixed point (p = 0) if 0 < 1.
For 1 , there are two, and the value of the second increases as the parameter increases.
Of primary interest is what happens to the iterates when the iteration begins not at a fixed point, but
38
lim xi+1 = L
i
39
For both FPIs, we will use the logistic map with the same initial condition x0 = 0.5. Thus, for both
xi+1 xi (1 xi )
The FPIs differ only in the value of the parameter . Here are the two cases and their first few iterates to
nine significant digits:
CASE I: = 5/2 = 2.5
p {0, 3/5}
p {0, 5/7}
x0
x1
x2
x3
x4
x5
...
x13
=
=
=
=
=
=
0.5
0.625
0.5859375
0.596624741
0.601659149
0.599163544
x0
x1
x2
x3
x4
x5
x6
...
xn
= 0.600006516
xn+1
xn+2
xn+3
=
=
=
=
=
=
=
0.5
0.875
0.3828125
0.826934814
0.500897695
0.874997180
0.382819904
=
=
=
=
0.382819...
0.826940...
0.500884...
0.874997...
The behaviors of these two FPIs are dramatically different. For Case I, the iteration appears to be
converging in oscillatory fashion toward the larger fixed point p = 3/5 = 0.6. For Case II, the iteration
appears to be generating a period-4 orbit. It is tempting to suggest that if the initial condition were closer
to the upper fixed point of p = 5/7 in Case II all would be well, and the iteration would converge. But
this is not the case. No matter how close the initial condition is to the fixed point, unless it is exactly (to
machine precision) the value x0 = p = 5/7, the iterates in Case II wander away from the fixed point. It is
as if the fixed point and the iterates are magnets. In Case I, the magnets have opposite polarities and the
iterates are attracted by the fixed point. In Case II, it is as if the fixed point and the iterates have the same
polarities, in which case the fixed point repels the iterates. This is all very strange. Why the difference?
40
4.3
The fundamental difference between Cases I and II above is the slope of the iteration function g(x) at the
fixed point p. That is, Cases I and II differ primarily in g0 (p). Specifically, for Case I, |g0 (p)| < 1, and for
Case II |g0 (p)| > 1. This observation, in fact, lies at the heart of the Fixed-Point Iteration Theorem (FPI
Theorem).
Fixed-Point Iteration THEOREM: Suppose that g(x) and g0 (x) are continuous on [a, b] = [p , p + ]
( > 0) containing the unique fixed point p of g(x), and let x0 6= p be any starting approximation to p
within this interval.
1. If |g0 (x)| K < 1 for all a x b then the FPI xi+1 g(xi ) converges to p, and p is called an
attracting fixed point or simply an attractor.
2. If |g0 (x)| > 1 for all a x b then the FPI diverges and p is called a repelling fixed point or simply
a repulsor.
However, before we can prove this important theorem, we need to revisit an idea from Math 235,
namely the Mean Value Theorem for Derivatives.
ASIDE: Mean Value Theorem for Derivatives (MVTD): If f (x) is continuous on [a, b] and differentiable
on (a, b), then there exists a < c < b such that
f 0 (c) =
f (b) f (a)
ba
(57)
or equivalently
f (b) f (a) = (b a) f 0 (c)
(58)
Figure 6 illustrates the MVTD, which in plain English says that, for a differentiable function, there must
be a point c in the interval of differentiability for which the instantaneous rate of change of the function
is equal to its average rate of change over the entire interval. A related theorem, Rolles Theorem, is the
special case of the MVTD for which f (b) = f (a), in which case the average rate of change is zero.
We are now in the position to sketch the proof of the FPI Theorem. Let p be a fixed point of the
iteration function g(x). Also let x0 be an initial guess at p and xi denote the ith iterate of the FPI. Finally,
let ei p xi denote the error of the ith iterate relative to fixed point. Thus, if the sequence of errors ei
tends to zero, the FPI converges to p. On the other hand, if the sequence of errors diverges, the FPI fails to
converge to p. We now examine the errors of the first few iterates with a little help from the second variant
of the MVTD (Eq. 58). The initial error is given by
e0 = p x0
After the first iteration, the error is
e1 = p x1
41
(59)
=
=
=
=
p g(x0 ) (definition of x1 )
g(p) g(x0 ) (definition of fixed point)
g0 (c0 )(p x0 ) (MVTD)
g0 (c0 )e0
(definition of e0 )
(60)
where c0 lies between p and x0 . Notice that the error has diminished in magnitude if |g0 (c0 )| < 1. Similarly,
e2 = p x2
= g0 (c1 )e1
(as above)
0
0
= g (c1 )g (c0 )e0
(Eq. 60)
(61)
where c1 lies between p and x1 . To generalize, after n iterations, the error is given by the following:
en = p xn
= g0 (cn1 )g0 (cn2 )...g0 (c1 )g0 (c0 )e0
0
= n1
k=0 g (ck )e0
(62)
where ck lies between p and xk , and is a mathematical symbol similar to (for summation) that means
product. From Eq. 62, it is clear that if the derivative of the iteration function is less than one in absolute
value, the magnitude of the error of each iterate is less than that of the previous iterate. Thus, as n ,
the error diminshes toward zero, and the FPI converges. We now formally present the FPI theorem, which
is based upon the error analysis above.
Lets apply the FPI THEOREM to our previous examples with the logistic map, for which g0 (x) =
(1 2x).
CASE I: = 5/2 = 2.5
p {0, 3/5}
p {0, 5/7}
Thus the results of our numerical experiment are consistent with the theory, because the theory predicts
that the second fixed point p2 switches from attractor to repulsor when increases from 2.5 to 3.5.
Strictly speaking, the hypothesis of the FPI Theorem requires that we test the derivative of the iteration
function over an interval containing p, not just at p itself. However, because the iteration function and
its derivative are continuous, what is true at the fixed point must also be true for at least a small interval
containing the fixed point.
Before we leave this section, it is instructive to discuss the meaning of K in Part (1) of the FPI Theorem. First, K can be viewed as the least upper bound of g0 (x) on the interval [p , p + ]. If K = 1/2, for
42
example, the error diminishes roughly by a factor of two with each iteration. This implies that each additional iteration yields an additional bit of accuracy. On the other hand, if K = 1/10, the error diminishes
by a factor of approximately 10 with each iteration. Thus, each iteration yields another significant digit.
However, if K = 9/10, the error is reduced only by 10 percent with each iteration; hence convergence
to the fixed point is slow. We conclude with some additional remarks about convergence based upon the
sketch of the proof of the FPI Theorem, followed by an algorithm for FPI in pseudocode.
REMARKS:
1. If 0 < g0 (x) < 1, then convergence is monotone because the error does not change sign from iteration
to iteration.
2. If 1 < g0 (x) < 0, then convergence is oscillatory because the error alternates sign with successive
iterations. Note that convergence to p2 = 0.6 for the logistic map (Case I) is oscillatory as expected.
3. The rate of convergence is determined by the least upper bound K in the sense that the smaller the
value of K, the faster the rate of convergence.
ALGORITHM 4.1: Fixed-Point Iteration
given iteration function g(x) (user defined function)
input initial guess x1
initialize x0 to outrageous value
input relative error tolerance
input iteration limit imax
i 0 (initialize the iteration counter)
repeat until |x1 x0 | x1 or i > imax
| x0 x1
| x1 g(x0 )
| i i+1
|
output x1
HW1: Let g(x) = 21 cos(x), 0 x /2. Using a calculator, find the fixed point p of the iteration xi+1
g(xi ) to five significant digits. Also perform the following
a) Carefully sketch by hand the functions y = x and y = g(x) on the same axes. Their intersection
occurs at x = g(x); that is, at the fixed point p. Estimate p from your sketch and use this value as
your initial guess x0 .
b) Evaluate g0 (x0 ). Based on this value, do you expect the FPI to converge? If so, monotonically or in
oscillatory fashion?
43
4.4
Exercises
xn
2
+ 2x3n .
3. Create a careful sketch of the graph of y = x 3+2 and y = x on the interval [0,3]. What FPI does this
graph correspond to? Illustrate the idea of a cobweb diagram for two different initial conditions.
How many fixed points are there? What is the state of the fixed point(s)?
44
In the previous section we discussed fixed-point iteration, a type of iterative process. Many numerical
procedures are iterative in nature. Among the most important iterative techniques are rootfinding methods
that solve nonlinear equations.
DEF: Suppose f (x) is a continuous function on the closed interval [a, b]. If there exists r [a, b] such that
f (r) = 0, then r is said to be a root or zero of f (x). In plain English, roots are the x-intercepts of functions
that cross the x axis.
NOTE: Whereas a linear function whose slope is non-zero has a unique root, nonlinear functions may
have no roots, a unique (one and only one) root, or multiple roots, as illustrated in Fig. 7. Rootfinding
methods are unnecessary for linear problems. Thus, the methods of this section will almost always apply
to nonlinear functions. Before we proceed to the methods themselves, it would be well to give a couple of
examples that underscore the necessity of numerical rootfinding methods.
EX1: The Fundamental THEOREM of Algebra (FTA)
DEF: If an 6= 0, then a function of the form
Pn (x) = a0 + a1 x + a2 x2 + ...an1 xn1 + an xn
is called a polynomial in x of degree (order) n.
REMARKS: In general, the graphs of polynomials can be quite wiggly. In fact, it can be proven that
the graph of a polynomial of degree n can change direction as many as n 1 times. From this simple fact
follows the Fundamental Theorem of Algebra (FTA): An nth degree polynomial with real coefficients ai
has at most n real roots. By way of analogy, think of the polynomial as the Shenandoah River, which
45
is undulating, and the x-axis as I-81, which is relatively straight through the Valley. Each intersection of
these two, at a bridge, corresponds to a root of the polynomial. (The FTA can be restated in the following
way if one allows the coefficients ai and the independent variable x to be complex numbers: An nth degree
polynomial has exactly n (complex) roots counting multiplicities.)
Lets consider the general form of three low-order polynomials: the quadratic (n = 2), the cubic
(n = 3), and the quartic (n = 4) , given explicitly below.
P2 (x) = a0 + a1 x + a2 x2
P3 (x) = a0 + a1 x + a2 x2 + a3 x3
P4 (x) = a0 + a1 x + a2 x2 + a3 x3 + a4 x4
The roots of a quadratic are readily found by solving P2 (r) = 0 = a0 + a1 r + a2 r2 for r by the quadratic
formula. No need for numerical solutions here. It turns out that there are general but obscure closed-form
techniques for determining the roots of cubic and quartic polynomials. Thus, no need for numerics for
n = 3 or n = 4 either. However, for n > 4 there are no analytical techniques that work in the most general
case. We dont mean to imply that the analytical techniques just arent known. It can be proven, in fact,
that such techniques do not exist! Thus, the only completely general way to find the roots of polynomials
of degree 5 n is by numerical rootfinding techniques! Analysis has its limits.
EX2: The Rayleigh Pitot Formula
Presumably most of you have flown on a private or commercial aircraft during you lifetime. All
aircraft, from Cessnas to the Space Shuttle have an on board air data system. Roughly speaking, the airdata system is the speedometer of an aircraft, which provides an indicator of speed as Mach number,
the ratio of airspeed to the speed of sound. At sea level, the speed of sound is about 700 mph, but it varies
gradually with temperature, diminishing with altitude. Commericial airliners, which are currently limited
to about Mach 0.85, fly at approximately 600 mph. In contrast, before it retired, the supersonic Concorde
cruised at Mach 1.6 or about 1100 mph, and during the first moments of atmospheric re-entry, the Shuttle
is traveling at a whopping Mach 25.
An air data system extracts Mach number based upon the ratio of two pressures. The relationship
between pressure and Mach number M is given by the Rayleigh Pitot formula. Let P be the ambient
atmospheric pressure and Pt be the total (or impact) pressure measured by the pitot-static system on the
noseboom of an aircraft (or elsewhere). Let their ratio be denoted R. That is, R = P /Pt . The Rayleigh
Pitot Formula, which is derived from the relationships of gas dynamics, states
h
i
1 + 1 M 2 1
if M 1
2
(63)
R(M ) =
h
i h 2
i 1
1 2M (1) 1
2
if M > 1
+1
(+1)M 2
the speed is supersonic (M > 1), however, a closed-form inverse cannot be found. The way out of this
dilemma is to turn the Rayleigh Pitot Formula into a rootfinding problem and to use a numerical technique
that converges quickly, because someones life is riding on the answer.
5.1
With a little bit of manipulation, many rootfinding problems can be re-cast as Fixed-Point Iteration (FPI)
problems, so that the techniques of the previous section can be applied to the solution process. Heres how
it works (when it works).
Our initial problem, given f (x), is to find r such that f (r) = 0. Suppose that f (r) = 0 can be re-written
as r = g(r). In this case, a root r of f is also a fixed point of g. The root r may then be found by iterating
on xi+1 g(xi ) by ALG 4.1, provided the FPI converges. Here is an example to illustrate how it works.
EX: Suppose we want to find the roots of f (x) = x5 2x + 1, a quintic polynomial for which closed-form
methods do not exist. An obvious root is r = 1, but there is at least one more real root between 1/2 and
3/4. How do we know this? Consider that f (1/2) > 0 but f (3/4) < 0. Because f is continuous, it must
therefore pass through zero somewhere on the interval [0.5, 0.75].
Lets begin with our quintic expressed at the unknown root r, where f vanishes by definition.
f (r) = r5 2r + 1 = 0
(64)
(65)
with x0 = 0.5 and see what happens. The iterates, to nine significant digits and obtained by ALG 4.1, are
as follows:
x0
x1
x2
x3
x4
x5
x6
=
=
=
=
=
=
=
0.5
0.515625
0.518223837
0.518687746
0.518771542
0.518786710
0.518789546
47
x7 = 0.518789954
...
x11 = 0.518790064
The 11th iterate is correct in all nine significant digits! This almost seems too good to be true: an answer
of extraordinary accuracy by a simple method in relatively few iterations. Suppose we naively tried to find
the other root (r = 1) by the same method, starting, say, at x0 = 1.1. Here are the iterates in the latter case:
x0
x1
x2
x3
x4
x5
x6
=
=
=
=
=
=
=
1.1
1.305255
2.394291594
39.84188892
50196057.74
1.59 1038
Error!
Yikes! This is as bad as it gets. Not only did the iteration diverge, the computation blew up in only
six iterations. It appears we forgot the lesson learned in the previous section: FPIs converge only if
|g0 (r)| < 1. For g(x) = 21 (x5 + 1), g0 (x) = 25 x4 . For r1 = 0.518790064, g0 (r1 ) 0.18, which implies that
convergence is monotonic and that the error is reduced by more than 80 percent with each successive
iteration when near the root. On the other hand, for r2 = 1, g0 (r2 ) = 5/2 >> 1, so the iteration diverges,
and quickly.
The moral of this story should be clear: FPI can be adapted for rootfinding but the method is far from
foolproof. Here are some things to consider before resorting to FPI to find roots.
REMARKS:
1. FPI does not always converge.
2. If there are multiple roots, FPI may converge to the wrong root.
3. The rate of convergence, which depends on |g0 (r)|, can vary wildly depending upon the particular
problem.
4. In general, the form r = g(r) obtained from f (r) = 0 is non-unique; that is, there are usually multiple
ways to rearrange the problem. It may take trial and error to find which rearrangement works for a
desired root.
HW1: Consider the quadratic f (x) = x2 3x + 2 whose roots, by factoring are r {1, 2}. Pretend you
dont know the roots and try to find both by FPI. Explain what happens and why.
48
Figure 8: Intermediate Value Theorem: a) for continuous f ; and b) why continuity is essential.
5.2
Bisection
If you need a foolproof method for finding a root, bisection is your choice. In truth, no rootfinding method
is foolproof (because fools are so ingenious), but bisection is guaranteed to converge provided you have
set up the problem correctly. To do that, you must select the search interval [a, b] such that 1) you know
at least one root exists in the interval (existence), and 2) you know no more than one root exists in the
interval (uniqueness).
Well address the existence issue here, and the uniqueness issue in the context of Newtons method.
Regarding existence, rootfinding is a bit like game hunting. It doesnt make sense to hunt a lion (with
a camera please, lions are endangered) if there are no lions around. Same with roots. Fortunately, the
Intermediate Value Theorem (IVT) from Math 235 is all that we need to establish the existence of a root.
Intermediate Value THEOREM (IVT): Suppose that f (x) is continuous on the closed interval [a, b] and
L is any number between f (a) and f (b); that is either f (a) < L < f (b) or f (b) < L < f (a). Then there
exists at least one a < c < b such that f (c) = L.
Figure 8a demonstrates the IVT for a continuous function, and Fig. 8b shows why continuity is crucial to the theorem. In plain English, the IVT says that a continuous function must pass through all its
intermediate values.
COROLLARY: (Existence of a Root) If f (x) is continuous on [a, b] and f (a) f (b) < 0, then there exists
a < r < b such that f (r) = 0; that is, r is a root of f (x).
Before the simple proof, lets make sure we understand the hypothesis. If f (a) f (b) < 0 then the
49
EX: Consider f (x) = x2 2, the roots of which are r { 2, + 2}. We are going to use bisection to
actually compute the positive square root of 2. The positive root is known to lie between a = 1 and b = 2,
because f (x) changes sign on this interval. Thus, we will use [a0 , b0 ] = [1, 2] as our starting interval. Note
that f (a0 ) f (b0 ) = f (1) f (2) = (1)(2) = 2 < 0. This establishes that a root lies in the interval. Well
assume for now it is unique and deal with that issue later.
Step 0:
m0 =
f (a0 ) f (m0 )
r
a1
b1
a0 + b0
= 1.5
2
f (1) f (1.5) = (1)(0.25) = 0.25 < 0
[a0 , m0 ]
a0 = 1
m0 = 1.5
50
Step 1:
m1 =
f (a1 ) f (m1 )
r
a2
b2
=
6
a1 + b1
= 1.25
2
f (1) f (1.25) = (1)(0.4375) = 0.4375 > 0
[a1 , m1 ] r [m1 , b1 ]
m1 = 1.25
b1 = 1.5
Step 2:
m2 =
f (a2 ) f (m2 )
r
a3
b3
=
6
a2 + b2
= 1.375
2
f (1.25) f (1.375) = (0.4375)(0.109375) > 0
[a2 , m2 ] r [m2 , b2 ]
m2 = 1.375
b2 = 1.5
Step 3:
m3 =
f (a3 ) f (m3 )
r
a4
b4
a3 + b3
= 1.4375
2
f (1.375) f (1.4375) = (0.109375)(0.06640625) < 0
[a3 , m3 ]
a3 = 1.375
m3 = 1.4375
Step 4:
m4 =
...
a4 + b4
= 1.4065
2
The process continues until the desired accuracy is attained. Several things are clear about bisection.
First, it always converges, provided the root is trapped at each step. Second, the convergence is not
monotonic. At some steps, the value of the midpoint is actually farther from the exact root than was the
midpoint of the previous step. In the example above, m1 = 1.25 is farther from r = 1.414... than was the
initial midpoint m0 = 1.5. So, bisection is a plodding method: dependable but slow. Lets quantify the
error property of the bisection method.
Consider the absolute value of the absolute error after the nth step. The root at each step is approximated by the new midpoint; thus, |en | = |r mn |. However, as Fig. 10 shows, the error can never be worse
than half the interval width. That is,
|en | = |r mn |
51
bn an
2
(66)
|en |
To say it succinctly, if r [a, b], the absolute error after the n bisection step is bounded as follows:
|en |
ba
2n+1
(67)
52
Here is the bisection algorithm in pseudocode, the essence of the example above, but all cleaned up.
ALGORITHM 5.1: Rootfinding by Bisection
given f (x) (user defined function)
input interval limits a, b
input absolute error tolerance
input number bits in mantissa n
f a f (a) (evaluate f at endpoint a)
if f a f (b) > 0
output error message
stop
end if
set f m to something outrageous
i 0 (initialize loop counter)
do until | f m| < or i > n
| m (a + b)/2 (midpoint of interval)
| f m f (m) (evaluate f at midpoint and save)
| if f a f m 0
|
b m (keep a, replace b)
| else
|
a m (keep b, replace a)
|
f a f m (rename f m)
| end if
| i i+1
|
output m (approximates root r) and f m (indication of error)
Can you see an advantage to storing f (a) and f (m) as f a and f m, respectively, in the algorithm above?
5.3
Newtons Method
Bisection is trustworthy, but it is plodding. What if we need fast convergence to a root, for, say, a real-time
application like a flight-control system? Can we do a better job of estimating the root at each step of the
iteration? If so, the iteration may converge faster. In this regard, Newtons method is spectacular. But
before we use it, wed better make sure that the root exists and is unique. Weve already addressed the
existence issue in the previous subsection. Lets now give some attention to uniqueness.
ASIDERolles THEOREM: (Recall that Rolles Theorem is a special case of the Mean Value Theorem
for Derivatives.) If f (x) is continuous on [a, b] and differentiable on (a, b), and if f (a) = f (b) (this is the
53
(68)
Figure 12: Multiple roots, as shown, are ruled out if function f is monotonic.
(Convince yourself this is true by evaluating y when x = x0 and by computing dy/dx.) It makes sense to
use the x intercept of the tangent line as the next guess at the root r. Lets call this value x1 , for which
y = 0 by the definition of the x intercept. Thus, the intercept is found by solving the following equation
for x1 :
0 = f (x0 ) + f 0 (x0 )(x1 x0 )
(69)
for which
f (x0 )
(70)
f 0 (x0 )
Thus x1 r. Now lets repeat the process using x1 as the new initial guess. There is no need to re-derive
everything. The final result is just Eq. 70 with a shift of indices, namely
x1 = x0
x2 = x1
f (x1 )
f 0 (x1 )
(71)
f (x2 )
f 0 (x2 )
(72)
Clearly, we have the beginnings of a new algorithm. In a nutshell, Newtons method is an iteration of the
form
f (xi )
xi+1 xi 0
(73)
f (xi )
Notice that, unlike bisection, Newtons method requires the evaluation of both the function f and its
derivative f 0 at each iteration. Note also that Newtons method would run into trouble if f 0 (xi ) were zero
for some iteration i. However, if the uniqueness criterion (monotonicity) has been established first, then
the derivative cannot vanish.
Of primary interest is the rate of convergence of Newtons method. Lets
apply it to the same problem
to which we applied the bisection method, namely the determination of the 2 by finding the positive zero
55
of f (x) = x2 2. For this function f 0 (x) = 2x. We know 1 r = 2 2 and also that f 0 (x) > 0 on [1, 2].
So both the existence and uniqueness criteria are satisfied if [a, b] = [1, 2].
For this problem, the nutshell version of Newtons method looks like
xi+1 xi
xi2 2 xi 1
= +
2xi
2 xi
(74)
For a head-to-head competition with bisection, lets assume the same initial guess as before, namely
x0 = 1.5, the midpoint of the interval. Table 5 shows the first three iterates of Newtons method.
i xi
f (xi )
0 1.5
0.25
1 1.416666
0.0012817778
2 1.414215686 6.0065285 106
3 1.414213562 1.00553 109
Table 5: Convergence typical of Newtons method.
There are several noteworthy comments to make about Newtons method for the example problem.
First, 9-digit accuracy is attained in only three iterations! Second, that the iteration is converging to the
root is clear from the third column, which should tend toward zero. Third, the number of correct significant
digits is indicated by underlining. The initial guess is correct in only one digit, the next iteration in three
digits, the next in six, and the last iteration in all digits! It appears that the number of correct significant
digits is roughly doubling with each iteration. Is this rapid convergence a fluke for this particular problem
or is it an attribute of Newtons method in general? Stay tuned!
But first, you
may be wondering why anyone would apply Newtons method to such a trivial problem
as computing 2. Cant I just get this value from my calculator? you ask. Yes, but did it ever occur to
56
you to wonder how your calculator extracts square roots? Most likely, by Newtons method! Now back to
the issue of convergence.
It is also interesting to note that Newtons method is a special type of fixed-point iteration (FPI) for
which the iteration function is
f (x)
g(x) = x 0
(75)
f (x)
Recall that the convergence rate of FPI is related to the absolute vale of g0 (x) at the fixed point (or in this
case, the root r). Essentially, the error of each new iteration is approximately g0 (r) times the error of the
previous iteration. In the best of all possible worlds then, an iterative method would strive for g0 (r) = 0.
This wish becomes a reality for Newtons method!
HW1: From Eq. 75 derive g0 (x) for Newtons method, and show that g0 (r) = 0.
In Math 448, should you continue with numerical analysis, you will learn that this wonderful trait of
Newtons method results in what is known as quadratic convergence. From a practical point of view, the
doubling of the number of significant digits of accuracy with each subsequent iteration is a characteristic
of quadratic convergence. Typically Newtons method converges to single precision (6-7 digits) in only
three iterations.
HW2: Suppose you know a root r to machine single precision but want to continue computing to obtain
the value correct to double precision. Compare bisection and Newtons methods in regard to this task.
ALGORITHM 5.2: Rootfinding by Newtons Method
given f (x) and f 0 (x) (user-defined functions)
inputor compute machine unit round-off error u
u (relative error tolerance)
input x1 (initial guess at root r)
set x0 to some outrageous value
input imax (iteration safety limit)
i 0 (initialize iteration counter)
repeat until |x1 x0 | < |x1 | or i > imax
| x0 x1
| f p = f 0 (x0 )
| if f p = 0 then
|
output error message (derivative vanishes)
|
stop
| end if
| x1 x0 f (x0 )/ f p
|
if i > imax then
output warning message (failed to converge completely)
end if
57
Finally, why the mysterious = u in ALG 5.2? Heres one for you to figure out. (HINT: It has to do
with quadratic convergence.)
We close with a summary of Newtons method for rootfinding.
REMARKS:
1. Newtons method requires but a single initial guess at the root.
2. Newtons method converges dramatically fast (quadratically), typically giving double the number of
significant digits of accuracy with each subsequent iteration, provided the conditions for existence
and uniqueness of a root have first been established.
3. If the condition for uniqueness (monotonicity) has not been established, Newtons method can converge slowly or fail to converge altogether, typically due to a vanishing derivative.
4. Newtons method is computationally expensive in the sense that both the function and its derivative
must be evaluated at each iteration. If the function is complicated, its derivative may be horrendously
complicated.
The following algorithm is similar to Newtons method; however, it avoids the necessity of computing
the derivative by requiring two initial guesses, from which the x intercept of the secant line is used to
refine the approximation of the root at each iteration.
5.4
Newtons method is the Spitze of rootfinding methods, but for very complicated functions it can be expensive because of the need to evaluate two functions: f and its derivative. If f is complicated, then f 0 may
be outrageously messy. Secant method is similar to Newtons method, but it avoids the need to evaluate
the derivative. As the name suggests, a secant line is used to approximate the root r of f (x) rather than
58
f (x1 ) f (x0 )
x1 x0
(78)
(79)
Lets define x2 as the x intercept of the secant line, namely the solution to
0 = f (x1 ) + m1 (x2 x1 )
(80)
whereby x2 = x1 f (x1 )/m1 . If the function f is monotonic on [a, b] containing initial guesses x0 and x1 ,
then x2 will approximate the root r better than either initial guess. The process can then be repeated. In a
nutshell, here is the recursion relation for secant method:
f (xi ) f (xi1 )
xi xi1
f (xi )
xi
mi
mi
xi+1
(81)
Lets apply secant method to compute 2 by finding the positive root of f (x) = x2 2. Note that
f (x) is monotonically increasing on the interval [1, 2]. For our initial guesses lets take x0 = 1 and x2 = 2,
neither of which is particularly good. The results of the iterations of secant method for this function are
given in Table 6.
Several attributes of secant method are noteworthy. The first is that it converges quickly, not as fast as
Newtons method, but typically to machine single precision in 5-6 iterations from decent initial guesses.
59
i
0
1
2
3
4
5
6
7
xi
1.0
2.0
1.3333
1.4
1.414634146
1.414211438
1.414213562
1.414213562
f (xi )
-1.0
2.0
0.2222
-0.04
0.001189768
6.007 106
8.94 1010
8.94 1010
mi
NA
3.0
3.3333
2.7333
2.814634146
2.828845584
2.828424803
ERROR!
1. Because secant method uses more than one point at each iteration, it is called a multistep method.
2. Each iteration, however, requires only one new function evaluation, which is why the method is
relatively efficient.
3. The convergence criterion for secant method is crucial to prevent a run-time crash!
4. Newtons method converges quadratically, which means that the number of digits of precision tends
to double with each successive iteration. For secant method, the growth factor of the number of digits
of precision is approximately 1.6 per iteration. The choice = u2/3 , therefore, is carefully chosen to
halt the iteration when x2 is correct in all significant digits, but to prevent a catastrophic additional
iteration. Can you see why this choice?
HW1: Write a function f (x), one of whose roots is 41/3 . Use secant method to find this root to 6 significant
digits.
5.5
Exercises
1. Consider f (x) = ex 3x2 , which has three real roots at approximately -.46, 0.91, and 3.73.
(a) By solving for x in the second term on the right of the equal sign, devise a fixed point iteration
for finding the roots of f (x). Will this algorithm converge to the unique fixed point on [0,2]?
(b) By solving for x in the first term on the right of the equal sign, devise a fixed point iteration for
finding the roots of f (x). Will your algorithm converge to the unique fixed point on [0,2]?
2. Consider the following function which has a discontinuity at x = 1:
if x > 1
x
0
if x = 1
f (x) =
x 2 if x < 1
(a) Does f (x) have a root?
(b) If you try the bisection method with bracketing interval [1, 2], what happens? Will the bisection method fail? Will it converge to something?
3. Consider the function f (x) = x3 4.
(a) Show that f (x) has a root f [0, 2].
(b) Compute the first three iterates x0 , x1 , and x2 when using the bisection method to search for
the root r.
(c) DO NOT continue with the bisection method, but if you were to continue, how precise would
your approximation xk be when k = 10 and k = 20?
(d) Derive the iteration function g(x) for Newtons method for this problem.
61
4.
(a) Carry out 3 iterations of the fixed point method for solving x =
1
x+1
with x0 = 0.
6. The saturation concentration of dissolved oxygen in freshwater (Os f ) can be calculated with the
equation
ln Os f = 139.34411 +
Ta
Ta2
Ta3
Ta4
where Ta is the absolute temperature (K). According to this equation, saturation decreases with
increasing temperature. For typical natural waters in temperate climates, the equation can be used to
determine that oxygen concentration ranges from 14.621 ml/L at 0 degree Celsius to 6.413 mg/L at
40 degrees Celsius. Given a value of oxygen concentration, this formula and the bisection method
can be used to solve for temperature in Celsius.
If the initial guesses are set at 0 degrees and 40 degrees, how many bisection iterations would be
required to determine temperature to an absolute error of 0.05 degrees Celsius?
7. Suppose f (x) has a root at x = 1. Suppose that we use the starting interval [1, 2] for the bisection
method and the starting point x0 = 2 for Newtons method.
(a) Draw a continuous function that meets the above criteria where Newtons method succeeds in
finding the root, but the bisection method fails.
(b) Draw a continuous function that meets the above criteria where the bisection method succeeds
in finding the root, but Newtons method fails.
62
We begin this chapter with a couple of problems that can be expressed as systems of linear equations or
linear systems for short. The first is just a math puzzle; the second has practical significance to the topic
of interpolation, which we will study next.
EX1: I have 12 coins consisting of nickels, dimes, and quarters, whose total value is $1.45, and whose
total weight is 22 grams. If nickels, dimes, and quarters weigh 2,1, and 3 grams, respectively (which they
dont really), how many of each coin do I have? (Courtesy of Jim Sochacki).
For starters, let n, d, and q symbolize the unknown numbers of nickels, dimes, and quarters, respectively. To find the three unknown quantities, we need three independent pieces of information that can be
expressed as equations. We are given information about 1) the number of coins, 2) the value of the coins,
and 3) the weight of the coins, so, if we can turn this data into equation form, there is a good chance of
solving the problem.
That there are 12 coins in all can be expressed as
n + d + q = 12
(82)
(83)
It is not necessary, but it is certainly OK to clear the value equation of decimal fractions by multiplying
each side of the equation by 20, in which case
n + 2d + 5q = 29
(84)
It is an important cornerstone of linear algebra that the information content of an equation is not altered
whenever both sides of the equation are multiplied by the same non-zero constant. Finally, that the total
weight of the coins is 22 grams yields the equation
2n + 1d + 3q = 22
(85)
The three equations above are each linear in that the unknowns appear only to the first power. For example,
we dont see n2 , 1/q, or cos(d) in any of the equations. Collectively the three linear equations in three
unknowns comprise a linear system, in particular the following system:
1n + 1d + 1q = 12
1n + 2d + 5q = 29
2n + 1d + 3q = 22
(86)
Perhaps you have had experience solving 2 2 and 3 3 linear systems in high school. Regarding
the latter, you may have also had the experience of going around in circles, while chasing one of the
unknowns. A systematic approach to the solution can prevent such circular waste of effort. In particular,
63
lets use the first equation to eliminate the first variable in both the second and the third equations. We do
this by subtracting the first equation from the second and twice the first equation from the third, as follows:
1n + 1d + 1q = 12 1n + 1d + 1q = 12
[(Eq. 2)-(Eq. 1)] 1n + 2d + 5q = 29 0n + 1d + 4q = 17
[(Eq. 3)-2(Eq. 1)] 2n + 1d + 3q = 22 0n 1d + 1q = 2
Note that the second and third equations above now involve only the unknowns d and q and thus comprise
a 2 2 subsystem of the original equation system. Lets now focus on solving the subsystem.
1d + 4q = 17
1d + 1q = 2
By now adding the subsystems first equation to its second equation, we eliminate the variable d as follows:
1d + 4q = 17 1d + 4q = 17
[(Eq. 2)+(Eq. 1)] 1d + 1q = 2 0d + 5q = 15
The last equation of the resulting subsystem is now trivial. The solution is q = 3. Knowing q, we can now
work backwards with the first equation immediately above to find d. Specifically, by substituting the value
of q into 1d + 4q = 17, we obtain 1d + 4(3) = 17, or equivalently, 1d = 5. At this stage, we know both
q and d. By substituting these known values into the very first equation of the original system, namely
n + d + q = 12, we get n + 5 + 3 = 12, in which case n = 4. The process of working backwards once the
last unknown (q) was obtained is known as back substitution.
Our next example will be germaine to the next chapter, which concerns polynomial interpolation.
EX2: Find the parabola that passes through the points (1,0), (2,1), and (3,4). Note that a parabola is the
graph of a quadratic function. A general form (of many possible forms) of a quadratic is
q(x) = a0 + a1 x + a2 x2
(87)
Our job is to determine the specific coefficients a0 , a1 , and a2 , so that the resulting quadratic fits the
data. Because there are three unknown coefficients (three degrees of freedom), we need three equations so
that the problem is well-posed. The three equations follow from the requirements that the graph of q(x)
passes through all three points. Specifically, q(1) = 0, q(2) = 1, and q(3) = 4. That is,
a0 + a1 (1) + a2 (1)2 = 0
a0 + a1 (2) + a2 (2)2 = 1
a0 + a1 (3) + a2 (3)2 = 4
(88)
(89)
This however is nothing but the following 3 3 linear system for the unknowns a0 , a1 , and a2 .
1a0 + 1a1 + 1a2 = 0
1a0 + 2a1 + 4a2 = 1
1a0 + 3a1 + 9a2 = 4
(90)
64
Lets again find the solution by our structured approach, but this time we will keep the system intact at
each step. In the first step, we use the first equation to eliminate the first unknown in the second and third
equations. Here is Step 1:
1a0 + 1a1 + 1a2 = 0 1a0 + 1a1 + 1a2 = 0
[(Eq. 2)-(Eq. 1)] 1a0 + 2a1 + 4a2 = 1 0a0 + 1a1 + 3a2 = 1
[(Eq. 3)-(Eq. 1)] 1a0 + 3a1 + 9a2 = 4 0a0 + 2a1 + 8a2 = 4
In Step 2, we use new Eq. 2 above to eliminate the second unknown in Eq. 3 above. Here is Step 2.
1a0 + 1a1 + 1a2 = 0 1a0 + 1a1 + 1a2 = 0
0a0 + 1a1 + 3a2 = 1 0a0 + 1a1 + 3a2 = 1
[(Eq. 3)-2(Eq. 2)] 0a0 + 2a1 + 8a2 = 4 0a0 + 0a1 + 2a2 = 2
Note that the system is now in upper-triangular form. That is, if we remove all terms with zero coefficients,
the following remains:
1a0 + 1a1 + 1a2 = 0
1a1 + 3a2 = 1
2a2 = 2
We now proceed with back susbstitution. The last equation has the trivial solution a2 = 1. Substituting
this result into the middle equation yields 1a1 + 3(1) = 1 a1 = 2. Substituting the known values of
a2 and a1 into the first equation yields 1a0 + 1(2) + 1(1) = 0 a0 = 1. Thus the desired quadratic is
q(x) = 1 2x + x2
(91)
The reader can verify that q(x) fits the given data.
The sample problems above each represented 3 3 linear systems; that is, three linear equations for
three unknowns. However, there is nothing to preclude systems of 4 equations in 4 unknowns (4 4), 11
equations in 11 unknowns (11 11), or for that matter, n equations in n unknowns (n n). In general, an
n n system is written
a11 x1 + a12 x2 + a13 x3 + ... + a1n xn
a21 x1 + a22 x2 + a23 x3 + ... + a2n xn
a31 x1 + a32 x2 + a33 x3 + ... + a3n xn
.
.
.
.
.
.
.
.
.
.
.
.
an1 x1 + an2 x2 + an3 x3 + ... + ann xn
= b1
= b2
= b3
.
.
.
= bn
(92)
The coefficients ai j (a eye jay) of the linear system are identified by two indices. The first index, i in this
case, is the equation index, also called the row index. The second index, j in this case, is called the variable
65
index or the column index. For example, in Eq. 86 above, a23 = 5 (a two three), is the coefficient of the
third variable in the second equation. What value does a33 have in the same equation?
Our primary objective in this chapter is to develop an efficient and accurate method for solving systems of n linear equations in n unknowns. The algorithm that will emerge from this endeavor is called
Gaussian elimination, once again in honor of the master himself, Karl Friedrich Gauss. The widespread
importance of Gaussian elimination cannot be overstated. Ultimately, most problems in science, engineering, or biomathematics boil down to solving large linear systems, sometimes with tens of thousands
of equations. If you dont believe this, check out www.netlib.org, a repository for efficient variations of
Gaussian elimination for every conceivable situation. As of the end of 2006, the website had received 400
million hits, probably more even than Paris Hiltons!
6.1
The graph of a linear equation in two independent variables (say, x and y, or x1 and x2 ) is a straight line in
the plane; hence the terminology linear. Two co-planar straight lines must either intersect or be parallel.
If they intersect, they do so at a single point, which represents graphically the solution of a 2 2 linear
system, whose first equation defines one line and whose second equation defines the other. Provided the
lines are not parallel, the solution is unique, as shown in Fig. 15. If the lines are parallel, the plot thickens
somewhat. They may be coincident, in which case every point on the line satisfies the equation (as there
is now only one equation). On the other hand, if they are truly parallel, with no shared points, then
there is no solution whatsoever. Under what conditions then are two lines either coincident or parallel?
Whenever they have the same slope, of course. To have the same slope, the equations need not have the
same coefficients. For example, the lines l1 : x 2y = 7 and l2 : 3x + 6y = 10 have the same slope but
not the same coefficients nor y intercepts. However, because the coefficients of l2 are a scalar (-3) multiple
of those of l1 , the lines have a common slope. In such cases the coefficient sets of the two equations are
linearly dependent. That is, one set is a disguised version of the other set. Such systems are said to be
singular. Singular linear systems either have no solution or an infinity of solutions. Our primary interest
lies in non-singular linear systems, which have unique (one and only one) solutions.
What graphical interpretation do we give 3 3 systems of linear equations? A linear function of three
independent variables, say ax + by + cz = d where a, b, c, and d are constants, graphs a plane in R3 . Two
non-parallel planes intersect in a line. A third mutually non-parallel plane that does not contain the line
of intersection of the other two must intersect that line in a single point. This point represents graphically
the unique solution of the system. On the other hand, a 3 3 system can fail in one of several ways to
have a unique solution. If two or more of the three planes are parallel and share no common points, then
there is no solution. If all three planes are coincident, on the other hand, then any point on that plane is a
solution, so that the solution space is two-dimensional. Finally, suppose the third plane contains the line
of intersection of the first two planes. Then any point on that line is a solution and the solution space is
one dimensional. In either of the last two scenarios, the system is singular because the coefficient data are
linearly dependent. We will define this concept more precisely later.
For systems larger than 3 3, we lose the ability to interpret the solution in graphical terms. However,
66
6.2
When solving an n n system of linear equations, as we did previously in two examples with n = 3, one
eventually realizes that all the algebraic manipulations are being performed only on the coefficients ai j or
on the right-hand side data bi . Thus, for brevity, we could just arrange the coefficients in a table of values
and ignore the unknowns xi for the moment. The coefficient table (or array) for a general n n system is
shown below.
.
.
.
.
.
.
.
.
.
.
.
.
an1 an2 an3 ... ann
DEF: A rectangular array of real (or complex) numbers enclosed in square brackets is termed a matrix
(as in the movie). The numbers of a matrix are termed its elements or entries. Matrices are denoted by
uppercase letters, as, for example
3 2 1
A=
(93)
4 5 6
Here A is a 2 3 matrix, having two rows and three columns. The number 4 is which element of matrix
A? In general, a matrix of m rows and n columns is said to be m n, which denotes the size of the matrix.
67
Moreover, we will use the nomenclature Mmn to denote the set of all m n matrices. Thus, for example,
of matrix A above it can be said A M23 .
DEF: If m = n, then an m n matrix is said to be square. For example, matrix T below is a square matrix
with m = n = 3.
2
1
0
1
T = 1 2
0
1 2
We are primarily concerned here with the square matrices that result from linear systems of n equations in
n unknowns.
DEF: A degenerate matrix for which either m = 1 or n = 1, but not both, is termed a vector. If m = 1, we
say that the matrix is a row vector. In contrast, if n = 1, we say that the matrix is a column vector. The
elements of a vector are more commonly referred to as components. Moreover, to distinguish vectors from
matrices different symbolism is exploited. Vectors, whether row or column vectors, are distinguished in
some texts by boldface lowercase letters or by lowercase letters with a vector sign. We will use the latter.
If m = n = 1, there is but a single element in the matrix, in which case one refers to the number as a scalar.
b1
~b = b2 ; d~ = [3, 2, 1]
b3
b4
In the example above, ~b is a column 4-vector, and d~ is a row 3-vector.
DEF: Let A be a square matrix with elements ai j . The elements a11 , a22 , ..., ann are called the diagonal
elements of A. For example, the diagonal elements of the 3 3 matrix T above (that is, t11 , t22 , and t33 )
each have the value -2.
REMARKS: The diagonal elements of the coefficient matrix will play a pivotal role later in the Gaussian elimination algorithm to be developed (pun intended).
DEF: If A Mmn , then AT Mnm (A transpose) is the matrix whose rows are the columns of A and
vice versa. That is, aTji = ai j . For example, for the 2 3 matrix A of Eq. 93,
3 4
AT = 2 5
(94)
1 6
A matrix that is equal to its own transpose is said to be symmetric. Symmetric matrices must, of course,
be square. Matrix T above is a symmetric matrix. Note that transposition of square matrices leaves the
diagonal elements, for which i = j, unchanged. Regardless of the size of the matrix A, (AT )T = A. Finally,
the transpose of a row vector is a column vector, and vice versa. That is, if ~b = [3, 5, 2], then
3
~bT = 5
2
68
To close this subsection, we define a zero matrix to be a matrix each of whose entries is the number 0.
Thus, for example, the 2 3 zero matrix is
0 0 0
A=
0 0 0
6.2.1
Matrix Operations
Having defined matrices, we are now in a position to define some operations with matrices, among them
equality, addition, negation, subtraction, and scalar multiplication.
DEF:
1. Equality: If A Mmn , and B Mmn , then A = B iff ai j = bi j for all i and j. (Note that matrices can
be equated only if they have the same size, in which case equality holds if and only if corresponding
elements are equal.)
2. Scalar multiplication: If A = [ai j ]mn , and c is a scalar, then cA [cai j ]mn . (Note that, if c = 0, the
result is the m n zero matrix.)
3. Addition: If A = [ai j ]mn , and B = [bi j ]mn , then A + B [ai j + bi j ]mn . (Note that matrix addition
is defined only if the matrices are the same size, in which case the resultant matrix is formed by
adding corresponding elements of the constituent matrices.)
4. Opposition (negation): If A = [ai j ]mn , then A = cA with c = 1. (Note that this has the effect of
reversing the sign of each element of A.)
5. Subtraction: If A and B are of commensurate size, then A B A + (B).
6.2.2
Properties of Matrices
Suppose the p and q are scalars, and that A, B, and C are matrices of commensurate sizes. Then the
following properties are readily proved from the definitions of the matrix operations presented immediately
above:
PROPERTIES:
1. B + A = A + B (Commutativity under addition)
2. [0] + A = A + [0] = A ([0], the additive identity)
3. A A = A + (A) = [0] (A, the additive inverse)
4. (A + B) +C = A + (B +C) (Associativity of addition)
69
6.3
Matrix Multiplication
If you have no previous experience with linear algebra, the definition of matrix multiplication to follow
will seem entirely bizarre. Only after the fact will it begin to make some sense. So bare with us for the
moment, trusting that this is leading ultimately to something useful.
6.3.1
To understand matrix multiplication, it is first helpful to define the dot product (or inner product), which
applies only to vectors.
DEF: Suppose ~x = [x1 , x2 , ..., xn ]T and ~y = [y1 , y2 , ..., yn ]T are two column n-vectors. Their dot product,
denoted ~x ~y (x dot y) is defined
n
~x ~y =~xT~y = x1 y1 + x2 y2 + ... + xn yn = xi yi
(95)
i=1
In words, the dot product of two vectors that have the same length is formed by multiplying corresponding
components and summing the products. Note that the dot product returns a scalar.
EX: If ~x = [3, 2, 1]T and ~y = [4, 2, 1]T , then ~x ~y = (3)(4) + (2)(2) + (1)(1) = 9.
On a machine, summations are accomplished by a loop operation; hence, a dot product of two vectors
requires little more than a single loop, as shown in Algorithm 6.1.
ALGORITHM 6.1: Dot Product
input size n
input n components xi , i = 1, 2, ..., n (fill a 1D array)
input n components yi , i = 1, 2, ..., n (fill a 1D array)
70
(96)
EX: Let ~x = [3, 2, 1]T . Then ||~x|| = [(3)(3) + (2)(2) + (1)(1)]1/2 = 14. In R2 or R3 , ||~x|| is the
physical length of the vector whose tail resides at the origin and whose head resides at the point whose
coordinates are the components of the vector. Thus, in the example, ||~x|| is the physical length of the vector
whose tail is at (0,0,0) and whose head is at (3,-2,1) in Cartesian 3-space. For n > 3, the physical meaning
of the norm is lost, but the mathematical definition is retained.
6.3.2
Matrix Products
ci j =
(97)
aik bk j
k=1
A=
1 2
3 4
and B =
22
Then
AB M23 with AB =
1
0 1
1 2 1
3 4 3
7 8 7
23
Note however that BA is not even defined because the number of columns (3) of matrix B does not match
with the number of rows (2) of matrix A! However, even though BA is undefined, BT A is well defined as
shown below.
1
1
4
6
1 2
BT A = 0 2
= 6 8
3 4
1
1
4
6 32
Thus, if AB is defined, BA may or may not be defined.
It is worth mentioning that the definition of matrix multiplication above also applies to vectors. For
example, the product of a row vector, which can be viewed as a matrix of size 1 n, and a column vector
of the same length, which can be viewed as a matrix of size n 1, is a 1 1 matrix, which is just a
scalar. Thus if ~x and ~y are two column n vectors, ~xT~y is their dot product. In this case, ~y~xT is also defined.
However, the product of an n 1 matrix with a 1 n matrix is an n n matrix, not a scalar! This is termed
the outer or tensor product of two vectors. For example, if ~x = [1, 2, 3]T and ~y = [1, 0, 1]T , then
1 0 1
~x~yT = 2 0 2
3 0 3 33
All this goes to show that it is important to keep the following ever in mind when dealing with matrix
multiplication: matrix multiplication is not, in general, commutative. Matrices DO NOT behave like real
numbers.
72
Matrix multiplication is a computationally intensive process. If matrix A has m rows and matrix B
has n columns, the resultant matrix C has mn coefficients, each of which is computed by a dot product
of length p that requires 2p operations. Thus the operation count for matrix multiplication is 2mpn. If
matrices A and B are square and each n n, then 2n3 operations are required to form their product. It is not
unusual for n to be 1000 or larger for problems of engineering or scientific interest. Thus, to multiply two
matrices, each of size 1000 1000 requires some two billion operations. Indeed a primary impetus for the
development of fast computers was the necessity of manipulating large matrices in reasonable clock time.
Before closing this subsection, lets define a very special square matrix, the identity matrix.
DEF: The n n identity matrix, denoted In , has elements defined as follows:
In = [i j ]nn where
1 if i = j
i j
0 if i 6= j
(98)
The symbol i j is known as the kronecker delta, which can take on only one of two values: 0 or 1. If all
this mathematical formalism seems a bit over the top to you, then just remember that an identity matrix
is a square matrix all of whose entries are 0, with the exceptions of those lying along the diagonal, whose
values are each 1. Here, for example, is the 3 3 identify matrix:
1 0 0
I3 = 0 1 0
0 0 1 33
Note that the kronecker delta is one if and only if the row and column indices are the same, which is true
only for diagonal elements. All other entries are zero.
6.3.3
Suppose that A, B, C, and D are matrices, I = In denotes an identity matrix, [0] denotes a zero matrix,
and c is a scalar. From Eq. 97, which defines matrix multiplication, Eq. 98, which defines the identity
matrix, and the definition of the zero matrix, follow all the following properties, assuming that the size of
the matrices is such that their products are well defined:
1. (*) AB 6= BA (Non-commutativity of matrix multiplication)
2. (*) AB = [0] 6 A = [0] or B = [0]
3. AI = IA = A (In is the multiplicative identity for matrices)
4. cAB = (cA)B = A(cB) (Associativity of scalar multiplication of matrices)
5. ABC = (AB)C = A(BC) (Associativity of matrix multiplication)
6. A(B +C) = AB + AC (1st distributive property of matrix multiplication)
73
6.4
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
an1 an2 an3 ... ann
xn
b1
b2
.
.
.
bn
Gaussian Elimination
The standard algorithm for solving systems of linear equations of the form A~x = ~b for which the coefficient matrix A is dense is called Gaussian elimination. A dense matrix, by the way, is one whose elements
are (nearly) all non-zero. (The identity matrix, in contrast, is sparse.) The great edifice of Gaussian
74
elimination is predicated upon three basic operations known as the elementary row operations. Moreover, Gaussian elimination is comprised of two separate algorithms implemented in succession: forward
elimination followed by back substitution. The latter is the simpler of the two, so we will describe these
algorithms in reverse order. But first, lets define the elementary row operations.
6.4.1
In short, Gaussian elimination is a process of converting a linear system of equations into an equivalent
system, whose solution is much easier to obtain than that of the original system.
DEF: Equivalent linear systems are two linear systems of the same size that have the same solution. For
example, the two 2 2 systems below are equivalent, because both have the solution (x1 , x2 ) = (1, 2),
which can readily be verified:
x1 + x2 = 3
x1 x2 = 1
x1 + x2 = 3
x2 = 2
Given the choice of solving the first or the second of the two equivalent linear systems, which would you
choose? The second, of course, because it is easier to solve.
Equivalent linear systems are related by the following elementary row operations (ERO):
1. Interchange of any two equations (rows).
2. Multiplication of an equation (row) by a non-zero scalar (real number).
3. Addition of any two equations (rows).
The first ERO recognizes that swapping the order of the equations in a system of equations does not change
the result. The second follows from the fact that equals multiplied by equals remain equal. Similarly, the
third recognizes that when equals are added to equals, the two sides of an equation remain equal.
The EROs apply to equation systems. However, as observed previously, when solving systems of linear
equations, the algebraic manipulations involve only the coefficients of the unknowns and the right-handside values. As a consequence, we can solve a linear system by dealing only with its augmented matrix
form, which is comprised of the coefficient matrix to which an additional column containing the righthand-side values is appended. The augmented matrix of a general n n system looks like the following:
75
a11 a12 a13 ... a1n b1
a21 a22 a23 ... a2n b2
.
.
.
. .
.
.
.
. .
.
.
.
. .
an1 an2 an3 ... ann bn
The EROs get their name because they apply expressly to the rows of the augmented matrix above. As an
example of the application of the EROs to an augmented matrix, lets return to the second example (EX2)
at the very beginning of the chapter, which asked us to find the quadratic polynomial that passes through
three distinct points. Expressed in augmented matrix form, Eq. 88 becomes
1 1 1 0
1 2 4 1
1 3 9 4
Now consider the following sequence of EROs:
1 1 1 0
1 1 1
(R2 R1)
1 2 4 1
0 1 3
(R3 R1)
1 3 9 4
(R3 2R1)
0 2 8
0
1 1 1
1 0 1 3
4
0 0 2
0
1
2
Note that the linear combination of two rows, for example (R3 2R1), is actually the combination
of EROs (2) and (3). Note also that the resulting augmented matrix on the far right above is in uppertriangular or row echelon form, in that all coefficients below the diagonal are zero. Because the final and
the initial augmented matrix forms are related only by EROs, their respective linear systems are equivalent.
Hence, the solution to the system represented at the far right is mathematically identical to the solution
of the original system. But because of the row echelon form, the system at the far right is much easier to
solve than the original system. If we restore the unknowns to the system at the far right, we obtain
1a0 + 1a1 + 1a2 = 0
1a1 + 3a2 = 1
2a2 = 2
Working our way backwards from the last equation to the first, we obtain a2 = 1, a1 = 2, and finally,
a0 = 1. This process of working backwards from row echelon form is known as back substitution.
6.4.2
Back Substitution
Suppose A~x =~b and U~x = d~ are equivalent n n linear systems, where matrix U is upper triangular. That
is, the latter system has the form
76
.
.
.
.
. = .
.
.
.
. . .
.
.
.
. . .
0
0
0 ... unn
xn
dn
Given the choice to solve the original system or the upper-triangular system, we would be crazy to
solve the former because the latter is much easier system to solve. Note that all coefficients of U below the
diagonal are zero. So, lets pretend that the Linear Algebra Fairy has waved her magic wand and converted
~ Lets now find an algorithm for back substitution. Well develop
our original system to the form U~x = d.
this for the 4 4 case and then generalize to n n. Heres an upper-triangular system of size 4 4 with
the variables restored:
u11 x1 + u12 x2 + u13 x3 + u14 x4
u22 x2 + u23 x3 + u24 x4
u33 x3 + u34 x4
u44 x4
=
=
=
=
d1
d2
d3
d4
Lets work our way backwards. The last equation is trivial. From EQ4, we obtain
u44 x4 = d4 x4 =
d4
u44
(100)
The third equation (EQ3) now becomes u33 x3 + u34 x4 = d3 , where we have underlined x4 to indicated that
its value is now known, in which case the second term on the left side can migrate to the right to give
u33 x3 = d3 u34 x4 x3 =
d3 u34 x4
u33
(101)
The last two variables are now known, in which case EQ2 becomes u22 x2 +u23 x3 +u24 x4 = d2 , from which
we conclude
d2 u23 x3 u24 x4
u22 x2 = d2 u23 x3 u24 x4 x2 =
(102)
u22
Finally, we arrive at EQ1, with known values for the last three variables. That is, u11 x1 + u12 x2 + u13 x3 +
u14 x4 = d1 , from which we infer
u11 x1 = d1 u12 x2 u13 x3 u14 x4 x1 =
(103)
Lets now generalize to nn systems and construct an algorithm. First, it is clear that to run backwards,
backward substitution must incorporate a decrementing loop. Moreover, we must first prime the pump;
that is, the last of the unknowns must be solved for first in order that the backward substitution process
can commence. Thus, xn must be obtained outside the main loop. Finally, note that the number of terms
77
subtracted from di to solve for xi grows as the algorithm proceeds, and that the first term subtracted
contains the variable previously just solved for, namely xi+1 . Here then is a bare-bones back-substitution
algorithm:
xn dn /unn (outside the loop)
repeat for i = n 1, n 2, ..., 1 (decrementing loop)
| xi (di nk=i+1 uik xk )/uii
|
We now recognize that the summation inside the loop above is itself a loop in disguise. If we unroll
the summation, the back substitution process is revealed to harbor a loop within a loop, or as we say in
programming parlance, a pair of nested loops. Note that the length of the inner loop depends upon the
index of the outer loop. Here then is the back substition algorithm in its full glory.
ALGORITHM 6.2: Back Substitution for Gaussian Elimination
input (or pass) order n, coefficients [ui j ]nn , n-vector d~
xn dn /unn (outside the loop)
for i = n 1, n 2, ..., 1 (descending order)
|
|
s di (initialize accumulator to right-hand-side value
|
|
for k = i + 1, i + 2, ..., n (ascending order)
|
|
|
|
s s uik xk
|
|
|
|
xi s/uii
|
return ~x
How many operations are required to complete back substitution for an n n triangular system? Here,
a table of operations might help. Table 7 reflects the fact that each time the inner loop is executed a
multiplication () and a subtraction () are performed. However, the length of the inner loop depends
upon the index of the outer loop. In addition, each time the outer loop is executed, a division is performed.
From the table, we surmise that the total operation count for back substitution is S = 1 + 3 + 5 + ... + (2n
1). Using the trick of Gauss once again, we write the sum backwards to obtain S = (2n 1) + (2n 3) +
(2n 5) + ... + 1. Adding columnwise the two sums we get 2S = 2n + 2n + 2n + ... + 2n = n(2n) = 2n2 .
Thus, S = n2 ; that is, back substitution for an n n triangular system requires exactly n2 operations.
What can go wrong with back substitution? Clearly, if uii = 0 for one or more values of i, then there
will be a run-time divide by zero error. How is this possibility to be interpreted?
DEF: Let U = [ui j ]nn be a square upper-triangular matrix. Then the determinant of U, denoted det(U),
78
Equation
n
n1
n2
n3
.
.
.
1
Operations
(1)
(3)
(5)
(7)
.
.
.
(2n-1)
(104)
Here is a mathematical symbol (the Greek uppercase P) denoting product analogous to the symbol
(the Greek uppercase S) for sum. A matrix for which det(U) = 0 is said to be singular.
Note that if uii = 0 for any index i, then det(U) = 0, and U is a singular matrix. If U is singular, then
the system U~x = d~ does not have a unique solution.
HW1: Consider U~x = d~ for
1 2 3
U = 0 4 5
0 0 6
6
d~ = 8
12
Compute det(U) and determine whether or not U is singular. If U is non-singular, find the solution of the
~
system U~x = d.
6.4.3
Forward Elimination
In the previous subsection, we presumed that some good fairy had magically transformed our linear system
~ which we could easily solve in n2 operations
A~x = ~b into an equivalent upper-triangular system U~x = d,
by back substitution. Life is rarely so kind. Alas, there is no Linear Algebra Fairy, and we ourselves must
perform this transformation, which is accomplished by a process known as forward elimination. Gaussian
elimination therefore is forward elimination followed by back substitution. Symbolically we might say
GE = FE + BS, where, here, BS conveys back substitution rather than its usual scatological meaning.
Here then is the goal of forward elimination:
.
.
.
. .
.
.
.
. .
. (FE)
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
. .
an1 an2 an3 ... ann bn
0
0
0 ... unn dn
79
But the devil is in the details. Lets develop FE by examining its major steps in sequence along with the
algorithm for that step.
STEP 1: Zero the coefficients below the diagonal in column 1
In the first step of FE, EROs involving linear combinations of the first row with each other row are
used to eliminate (zero) all coefficients below the diagonal in the first column. If you have printed a color
version of this document, or if you are accessing it on line, you will find that the color blue has been
used to highlight all the coefficients that are to be zeroed in the first step. The color red has been used to
identify the pivot element. In general, the pivot element of STEP k of FE is akk . Pivot elements live along
the diagonal of the matrix. Notice that the pivot element is the divisor for each linear combination of rows
in STEP 1 below. Note also that, for each i, ai1 (ai1 /a11 )a11 = 0. That is, the multiplier (t ai1 /a11 )
of row 1 (R1) is chosen expressly for the purpose of eliminating (zeroing) the element ai1 .
21
(R2 aa11
R1)
a31
(R3 a11 R1)
.
.
.
an1
(Rn a11 R1)
a11
a21
a31
.
.
.
an1
a12
a22
a32
.
.
.
an2
a13
a23
a33
.
.
.
an3
... a1n
... a2n
... a3n
.
.
.
... ann
b1
b2
b3
.
.
.
bn
a11
0
0
.
.
.
0
a12
a022
a032
.
.
.
0
an2
a13 ... a1n b1
a023 ... a02n b02
a033 ... a03n b03
.
. .
.
. .
.
. .
a0n3
a0nn b0n
The linear combination of rows necessary to eliminate (zero) the coefficient ai1 will likely modify all
the other elements of row i as well as the right-hand-side value bi . Here primes denote matrix elements or
vector components that have been modified from their original values. The algorithm for the first step of
FE is given below. Because the modified coefficients simply replace the original coefficients in the same
storage registers, there is no need to retain the primes in the algorithm itself. Note also that, although
the purpose of the first step is to zero elements that fall below the diagonal in column one, it is actually
unnecessary to set ai1 0 in the algorithm. This is because ai1 will never again be used in solving the
linear system.
ALG: FE Step 1
for i = 2, 3, ..., n (row index)
|
t ai1 /a11
|
(ai1 0) (extraneous)
|
for j = 2, 3, ..., n (column index)
|
|
ai j ai j ta1 j
|
|
|
bi bi tb1
|
STEP 2: Zero the coefficients below the diagonal in column 2
STEP 2 of FE is similar to STEP 1, except that EROs involving linear combinations of the second
80
row (R2) with each subsequent row are used to eliminate (zero) all coefficients below the diagonal in the
second column. In STEP 2, a22 is the pivot element, and the divisor in each multiplier of R2.
(R3
.
R2)
(Rn aan2
22
a32
a22 R2)
a11
0
0
.
.
.
0
a12
a022
a032
.
.
.
0
an2
a13
a023
a033
.
.
.
0
an3
... a1n
... a02n
... a03n
.
.
.
0
... ann
b1
b02
b03
.
.
.
0
bn
a11 a12 a13 ... a1n b1
0 a022 a023 ... a02n b02
0
0 a0033 ... a003n b003
.
.
.
. .
.
.
.
. .
.
.
.
. .
00
00
ann b00n
0
0 an3
Here double primes denote the matrix elements or vector components that have been twice modified
from their original values. For clarity in the remaining steps, we will use double primes to denote values
that have been modified at least twice.
ALG: FE Step 2
for i = 3, 4, ..., n (row index)
|
t ai2 /a22
|
(ai2 0) (extraneous)
|
for j = 3, 4, ..., n (column index)
|
|
ai j ai j ta2 j
|
|
|
bi bi tb2
|
Forward elimination proceeds in similar fashion, focusing on the elements below the diagonal in column k in STEP k. We now fast forward to the last step of FE, namely STEP n 1.
STEP n 1: Zero the coefficients below the diagonal in column n 1
In STEP n 1 of FE, a single ERO involving a linear combination of Rn 1 with Rn is used to
eliminate the remaining non-zero coefficient below the diagonal in column n 1. In the last step, an1,n1
is the pivot element.
n,n1
(Rn an1,n1
Rn 1)
ALG: FE Step n 1
81
b1
b0
2
.
.
00 .
b
n1
b00
n
b1
b0
2
.
.
00 .
b
n1
b00
n
82
=
=
=
=
12
18
8
8
=
=
=
=
12
12
0
6
2 4 4
0 12
1 5 5 3 18
2 3
1
3 8
1 4 2
2 8
We now perform elementary row operations to drive the original system toward upper-triangular form.
This is the forward elimination (FE) component of Gaussian elimination.
2 4 4
0 12
2
4 4
0 12
0
R2 21 R1
3 3 3 12
1 5 5 3 18
1
2
1
3 8
5
3 4
R3 3 R2 0 1
R3 2 R1 2 3
1 4 2
2 8
0
2
0
2 2
R4 23 R2
R4 21 R1
2 4 4
0 12
2 4 4
0 12
0 3 3 3 12 0 3 3 3 12
(105)
0 0
4
2 0 0 0
4
2 0
0 0
2
4 6
0 0
0
3 6
R4 24 R3
This completes forward elimination. Note that the resulting upper-triangular system is the augmentedmatrix form of U~x = d~ in the example. Because the original and final systems are related through EROs,
they must be equivalent. Because det(U) = (2)(3)(4)(2) = 48 6= 0, the system is non-singular and has a
unique solution. It remains to solve the upper-triangular system by back substitution (BS).
3x4 = 6
4x3 + 2(2) = 0
3x2 3(1) 3(2) = 12
2x1 + 4(3) 4(1) + 0(2) = 12
x4 = 2
4x3 = 4 x3 = 1
3x2 = 9 x2 = 3
2x2 = 4 x1 = 2
The final result, ~x = [2, 3, 1, 2]T , is the unique solution to the original system, which can be verified by
substitution.
83
6.4.4
Checking by Residuals
There is another way to check the correctness of the candidate solution ~x of linear system A~x = ~b. If ~x is
correct exactly, then the product A~x should exactly equal ~b, in which case ~b A~x = 0. This is indeed the
case for the previous example, for which
12
2 4 4
0
2
0
18 1 5 5 3 3 0
=
~b A~x =
8 2 3
1
3 1 0
8
1 4 2
2
2
0
In general, for large linear systems in which round-off error plays a role, the computed numerical solution
will not be exact. A measure of the fidelity of the computed solution is given by the residual vector ~r,
defined as
~r = ~b A~x
(106)
For an exact solution, ~r = ~0. Thus, the departure of the residual from zero is an indirect measure of the
error inherent in the numerical solution. How large a residual is tolerable? Recall that absolute error
is not very meaningful; relative error is the important concept. Hence, a rule of thumb might be that
||~r||/||~x|| < u, where the double vertical bars denote the Euclidean norm defined previously, and u is the
unit round-off error of the machine, also known as machine epsilon. If the residual exceeds this rule of
thumb, or if the matrix A is severely ill-conditioned, a term meaning almost singular, certain precautions
are necessary, among them iterative refinement. These topics, beyond the scope of this course, are grist,
however, for Math 448.
6.4.5
How expensive is Gaussian elimination (GE) from a computational point of view? Recall that GE =
FE + BS, and that for an n n system, BS requires exactly n2 operations. It may not have escaped your
notice that the dot product, which requires a single loop of length n, has an O(n) operation count, and that
BS, which involves doubly nested loops, has an operation count of O(n2 ). This suggests that FE, which
requires triply nested loops, has an operation count of O(n3 ). This is indeed the case, but lets see if we
can put a finer point on it than that.
For STEP 1 of GE, the i and j loops both range from 2 to n. Thus the assignment statement in the
innermost loop is executed (n 1) (n 1) times. In STEP 2 of GE, the i and j loops each diminsh
in length by one; therefore, the assignment statement in the innermost loop is executed (n 2) (n 2)
times. Proceeding in similar fashion, we can construct a table of operations for FE, as shown in Table 8.
We wish to sum all the operations implied in Table 8. There are neat mathematical formulas to do
this for us, but we can get sufficiently close to the answer by simple reasoning. Imagine that we are
constructing a pyramid with a square base out of perfectly cubic blocks all of equal dimensions. In the
first layer we lay (n 1) (n 1) blocks. In the second layer we lay (n 2) (n 2) blocks, and so
on, (n 1) layers in all. At the top, we will need to lay only a single capstone block. How many blocks
are there in total? The volume of a pyramid is 31 Bh, where B is the area of the base and h is the height.
84
6.4.6
Pivoting
What can go wrong during the forward elimination (FE) process of Gaussian elimination (GE)? The worst
thing that can happen is for the pivot element akk to vanish at outer STEP k, in which case there will be
a divide by zero run-time error; i.e., a crash. What then do we do if a pivot element vanishes? Does that
mean the matrix system is singular? Consider the following simple example, expressed in augmentedmatrix form:
0 1 2 3
1 0 4 5
0 2 1 3
85
The very first step of FE fails because pivot element a11 = 0. There is nothing wrong with the system,
however. All it takes to get us out of this corner is to swap rows one and two (R1 and R2) to obtain
1 0 4 5
0 1 2 3
0 2 1 3
From here, GE proceeds normally to the solution x1 = x2 = x3 = 1. So, clearly, the vanishing of a pivot
element does not always spell disaster. Then again, sometimes it does.
DEF: The interchange of two rows during the FE process of GE to avoid a zero pivot element is known as
pivoting.
Algorithm 6.4 below, which is to be inserted at the top of the outer (k) loop of FE (ALG 6.3), represents
one of the simplest pivoting strategies.
ALGORITHM 6.4: Partial Pivoting for Forward Elimination
if (akk = 0) then
SINGULAR TRUE (SINGULAR is a logical variable)
for i = k + 1, k + 2, ..., n
|
|
if (aik 6= 0) then
|
swap rows k and i
|
SINGULAR FALSE
|
EXIT loop
|
end if
|
end if
if (SINGULAR) then
output error message
stop
end if
A few words about ALG 6.4 are in order. First, the algorithm inserts a test of the pivot element to
ensure that FE does not encounter a zero divisor. If the pivot element at STEP k is indeed zero, then one of
two situations prevails: 1) either the matrix is singular, or 2) the matrix is non-singular, and the problem
can be corrected by interchanging rows. The presumption of singularity is made, but this presumption is
reversed if the search below the diagonal of column k turns up a non-zero value. If no non-zero value can
be found, the matrix truly is singular, an error message is output, and the computation is brought gracefully
to a halt (rather than a crash).
REMARKS:
86
1. A matrix that is almost singular is called ill-conditioned. During FE of ill-conditioined matrices the
pivot element almost vanishes in some relative sense.
2. Thus, pivoting is also used to minimize the accumulation of round-off error during GE, particularly
for ill-conditioned systems.
3. Most linear-algebra software packages involve more sophistocated pivoting algorithms than ALG
6.4, such as scaled partial pivoting or even full pivoting. These are somewhat beyond the scope of
this course. Another great reason to take Math 448!
6.4.7
In linear algebra courses, one is often taught to find the determinant of a square matrix by a method called
expansion by minors. The method is OK in theory, but, carrying an O(n!) operation count, it could hardly
be worse from an algorithmic point of view. Here, we consider a much more practical way of computing
the determinant. It begins with a question: given square matrix A and corresponding upper-triangular
matrix U related to A through EROs, what is the relationship between det(A) and det(U), if any?
First, consider that, when solving linear systems by hand, it is often convenient to clear fractions in a
given equation (row) by multiplying the row by a non-zero scalar. For machine implementations of GE,
however, the machine really doesnt care whether a number is whole or fractional, and so we can rule out
the ERO of simply multiplying a row by a scalar. Note that a linear combination of rows (say, Ri + cR j)
is still allowed, but the operation cR j alone is now ruled out, the later of which multiplies det(A) by c.
If we also temporarily prohibit pivoting, then a wonderful thing happens: det(A) = det(U). (We will
omit the proof of this.) But pivoting may be absolutely necessary, then what? Each row interchange
simply changes the sign of the determinant. So then, if the number of row interchanges during FE is odd,
the sign of the determinant is reversed. On the other hand, if the number of interchanges is even, then the
sign remains unchanged. Putting all of this together, we can get det(A) nearly for free during FE by the
following rubric:
1. During FE of square matrix A, keep track of p, the number of row interchanges during pivoting.
2. If the pivoting strategy fails, then A is singular and det(A) = 0.
3. If the pivoting strategy succeeds, at the end of FE, compute det(U) = nk=1 ukk .
4. Finally, det(A) = (1) p det(U).
EX: Compute det(A) for the matrix below:
2
1
A=
2
1
4 4
0
5 5 3
3
1
3
4 2
2
87
We have already shown in an example above that A is related to the following upper-triangular matrix U
through EROs without pivoting (or pure multiplication of rows by scalars):
2 4 4
0
0 3 3 3
U =
0 0
4
2
0 0
0
3
Hence, p = 0, and det(A) = (1)0 det(U) = (2)(3)(4)(3) = 72 6= 0. Matrix A is non-singular and any
system A~x = ~b will have a unique solution.
6.5
To set up an analogy, lets consider the simplest scalar algebraic equation imaginable:
ax = b
(107)
where a, b, and x are scalars, x is the unknown and a and b are the data. Even this simple equation presents
three different scenarios. First, suppose a 6= 0. Then a1 = 1/a exists, in which case we can multiply both
sides of the equation by a1 to obtain x = a1 b because a1 a = 1. However, if a = 0, there remain two
possible scenarios to consider. If b = 0, then any value of x satisfies ax = 0 (provided, of course a = 0).
On the other hand, no value of x satisfies ax = b if a = 0 but b 6= 0. So, in this trivial example, there exist a
unique solution, an infinity of solutions, or no solution at all, depending upon the values of the data a and
b. In particular, if a 6= 0, a unique solution exists.
Now consider the matrix-vector analog of the scalar algebraic equation previously considered:
A~x = ~b
(108)
Our experience with the scalar analogy begs the question: Given matrix A, under what circumstances, if
any, does A1 exist, in which case the unique solution to Eq. 108 should be ~x = A1~b?
DEF: Given n n, matrix A, if there exists n n matrix B such that AB = BA = In , then B is called the
inverse of A and is denoted A1 . A matrix A for which A1 exists is said to be invertible.
The definition above is a little unusual. We have defined the inverse of a matrix, but it is not yet clear
that such things exist. It is a little bit like defining an elephant as a large four-legged animal with a tail
at both ends. If we see such an animal, well know what to call it. However, just defining the creature
doesnt mean it exists. (Dragons are pretty well defined too, and not many of those have been seen.)
THEOREM (Uniqueness of the matrix inverse:) Suppose A is invertible. Then A1 is unique.
PROOF (by contradiction): Suppose matrix A, which is invertible, has two distinct inverses, B and C.
Then, by definition of the inverse
AB = I
88
CAB
(CA)B
IB
B
=
=
=
=
CI
CI (associativity of matrix multiplication)
CI (because C is presumed an inverse of A)
C (because I is the multiplicative identity)
However, the last line above is a contradiction of the original assumption that A had two distinct inverses.
Hence, there must be one and only one inverse.
Despite the fact that the matrix inverse, if it exists, must be unique, we still havent concluded that such
things exist. Here then is proof positive that at least one such inverse exists. Consider the 2 2 matrix A
given below.
2 1
A=
(109)
1 2
The reader can verify that AA1 = A1 A = I2 for
2/3 1/3
1
A =
1/3
2/3
(110)
So, in at least this instance, A1 is for real. But, how, in general, do we know when the inverse exists and,
moreover, how do we compute the inverse? A 3 3 example should suffice to shed light on the situation.
Consider invertible A M33 , given below with generic elements:
(111)
Because A is invertible, we know that A1 exists, but we dont know its elements. So, lets also define A1
generically as
(113)
Because A is invertible, we know that the solution of the first linear system is ~c1 = A1~e1 . But A1~e1 is
which is the first column of A1 . That is, ~c1 is the first column of A1 . Similarly, the solution ~c2 of system
A~c2 =~e2 is the second column of A1 , and the solution ~c3 of system A~c3 =~e3 is the third column of A1 .
Thus, we can construct the inverse of A one column at a time by solving the three linear systems of Eq. 113
in succession.
This game plan will indeed work provided A is invertible. But it is wasteful because FE, which is
costly, is being replicated three times. It is far more efficient to do the FE only once, which can be done
by considering the following augmented matrix:
5 1 2
5 1 7
5
2 4
We begin with
5 1 2 1 0 0
5 1 7 0 1 0
5
2 4 0 0 1
5 1 2
(R2 R1)
5 1 7
(R3 + R1)
5
2 4
1 0 0
0 1 0
0 0 1
5 1 2 1 0 0
0
0 5 1 1 0
(Swap R2&R3)
0
1 6 1 0 1
5 1 2 1 0 0
0
1 6 1 0 1
0
0 5 1 1 0
Note that the original system is now in the form U~c j = d~ j , where d~ j is the vector in the jth column
of the rightmost three columns of the augmented matrix following FE. Now that FE is complete, we may
proceed with three back substitutions, the first of which solves U~c1 = d~1 , where~c1 = [b11 , b21 , b31 ]T . Here
is the system with unknowns restored:
5b11 1b21 + 2b31 = 1
1b21 + 6b31 = 1
5b31 = 1
90
Back substitution leads to b31 = 1/5, b21 = 11/5, and b11 = 18/25, or ~c1 = [18/25, 11/5, 1/5]T .
Similarly, back substitutions with right-hand-side data d~2 and d~3 lead to ~c2 = [8/25, 6/5, 1/5]T and
~c3 = [1/5, 1, 0]T , respectively. Each of these three solutions gives a column of A1 ; hence
6.5.1
Gauss-Jordan Elimination
Although less efficient computationally than Gaussian elimination, Gauss-Jordan elimination is often preferred for finding the matrix inverse by hand computation when the size of the matrix n is small, say 3 or 4.
Gauss-Jordan elimination consists of forward elimination followed by back elimination, or symbolically
GJ = FE + BE. The FE part is the same as for GE. To illustrate backward elimination (BE), lets return to
our previous example at the end of the FE step (Eq. 116).
5 1 2 1 0 0
0
1 6 1 0 1
0
0 5 1 1 0
At this stage we now perform EROs on the left-hand matrix to eliminate the entries above the diagonal in
a way similar to that during FE in which we eliminated entries below the diagonal. The primary difference
is that the pivot elements a00kk = ukk are used in reverse order, from last to next-to-last, and so on; hence,
backward elimination.
(R1 25 R3)
5 1 2 1 0 0
(R1 + R2)
5 1 0 7/5 2/5 0
0
1 6 1 0 1
1 0 11/5 6/5 1
(R2 56 R3) 0
0
0 5 1 1 0
0
0 5 1
1 0
( 15 R1)
5 0 0 18/5 8/5 1
1 0 0 18/25 8/25 1/5
0 1 0 11/5 6/5 1 0 1 0 11/5 6/5
1 (116)
1
0 0 5
1
1 0
0 0 1
1/5
1/5
0
( 5 R3)
The last step of GJ elimination is simply a scaling operation that produces the identity matrix on the
left-hand side of the augmented system.
To summarize Gauss-Jordan elimination, we use EROs to drive the augmented matrix [A|In ] [In |B].
When the n n identity matrix appears on the left side of the augmented matix, the matrix B on the
right-hand side is A1 .
91
Solutions to linear systems, or matrix inverses, can be found either by GE or by GJ. Of the two, GE
is the more efficient. Because BE is nearly as computationally intensive as FE, GJ has an asymptotic
operation count of n3 , one and one-half times that of GE for large systems.
6.6
Before proceeding to our last subsection, lets summarize what we have learned about solving linear
systems of the form A~x = ~b. The following statements are equivalent; that is, the truth of any one implies
the truth of all:
1. A Mnn is invertible.
2. A1 exits, and it is unique.
3. det(A) 6= 0.
4. A~x = ~b has the unique solution ~x = A1~b.
5. A~x = ~0 has only the trivial solution ~x = ~0.
6. The rows of matrix A are linearly independent.
Item 6 above needs some explanation, because we have not yet defined linear independence.
DEF: A set of n-vectors {~v1 ,~v2 , ...,~v p } is linearly independent iff the linear combination c1~v1 + c2~v2 +
... + c p~v p = ~0 c1 = c2 = ... = c p = 0, where the ck are scalars. Conversely, if the linear combination
c1~v1 + c2~v2 + ... + c p~v p = ~0 with not all the ck = 0, then the vectors are linearly dependent.
From an intuitive point of view, a linearly dependent set of vectors is deficient in information. For
example, suppose c1~v1 + c2~v2 + ... + c p~v p = ~0 with not all the ck = 0, and in particular, suppose c1 6= 0.
p
Then ~v1 = c11 k=2 ck~vk ; that is, ~v1 can be expressed as a linear combination of the remaining vectors.
This implies that some information contained in the set of vectors is redundant. In regard to linear systems
of equations, if the rows of matrix A are not linearly independent, then the set of equations is redundant
in some sense. This sort of redundancy will manifest itself during forward elimination by the failure of
pivoting to find a non-zero pivot element, in which case A is singular.
Finally, we wish to summarize the operation counts of some of the algorithms that we have discussed.
This is done in Table 9, where A and B are n n matrices and ~x, and ~y are n-vectors.
We wish to close this section on a very important note. Suppose we are faced with the task of solving
the linear system A~x = ~b where A is an invertible square matrix. According to the theory of linear algebra, in particular Item 4 above, we could solve the system by first computing the inverse A1 and then
multiplying A1~b to get ~x. This is a terrible idea from a computational point of view. Why?
92
6.7
Special Matrices
The matrices that arise from physical systems or engineering problems often exhibit symmetries. Symmetry can be exploited to simplify Gaussian elimination, sometimes dramatically. In this final subsection, we
discuss a few types of special matrices; that is, matrices that manifest some type of symmetry or unusual
structure.
First, we remind the reader of a few definitions already discussed, to which we add a few more:
DEF:
1. A dense matrix is one all (or nearly all) of whose entries are non-zero.
2. Conversely, a sparse matrix is one with mostly zero entries that are more or less randomly distributed.
3. A symmetric matrix is one that is equal to its own transpose.
4. A diagonal matrix is one whose non-zero elements lie only along the diagonal. That is, ai j = 0
except when i = j.
5. A banded matrix is one whose non-zero elements lie only near the diagonal. That is, ai j = 0 except
when |i j| b, where 0 b is the semi bandwidth, and 2b + 1 is the bandwidth. By this definition,
a diagonal matrix is banded with b = 0.
6. A tridiagonal matrix is a banded matrix for which b = 1. That is, a tridiagonal matrix has non-zero
entries only for elements ai,i1 , ai,i , and ai,i+1 .
For our last algorithm of the chapter, we adapt Gaussian elimination to tridiagonal systems of equations, which arise when solving ordinary differential equations (ODEs) of boundary-value type (BVP) by
the finite-difference method (FDM). The FDM will be discussed later in the context of numerical differentiation. For now, we will simply focus on developing an appropriate algorithm for tridiagonal systems.
93
d1 u1 0
0
...
0
l2 d2 u2
0
...
0
0 l3 d3
u
...
0
3
.
.
...
.
.
.
.
Because most of the elements are zero, only the non-zero matrix elements are stored, typically in three
~ and ~u. The diagonal elements are stored in d.
~ Subdiagonal and superdiagonal elements are
n-vectors: ~l, d,
stored in vectors ~l and ~u, respectively. Note that elements l1 of the subdiagonal and un of the superdiagonal
are unused.
Lets now consider a 4 4 tridiagonal system and run through the steps of Gaussian elimination, after
which we can readily generalize to n n tridiagonal systems.
d1 u1 0 0 b1
d1 u1 0 0 b1
0 d 0 u2 0 b0
2
l
3
0 l3 d3 u3 b3
(R3 d 0 R2) 0 l3 d3 u3 b3
2
0 0 l4 d4 b4
0 0 l4 d4 b4
d1 u1 0 0 b1
d1 u1 0 0 b1
0 d 0 u2 0 b0
0 d 0 u2 0 b0
2
2
2
2
0 0 d 0 u3 b0
0 0 d 0 u3 b0
3
3
3
3
(R4 dl40 R3)
0 0 l4 d4 b4
0 0 0 d40 b04
3
Forward elimination is now complete. As per our previous conventions, red is used to highlight the pivot
element at each step, and blue is used to show the elements that need to be zeroed. Primes denote elements
that have been modified from their original values. Note that the vectors d~ and ~b are clobbered (that is,
overwritten with new values), but the vectors ~l and ~u remain unchanged. (We know, it looks like the
elements li of the subdiagonal have been clobbered, but in the actual algorithm below, it is unnecessary to
set the values to zero because the values li are never again used.) Note that for each step of FE, only a single
element below the diagonal needs to be zeroed. For this reason, tridiagonal GE is vastly more efficient
than standard GE. When FE has been completed, the resulting upper-triangular matrix is bidiagonal, with
non-zero elements confined to the diagonal and the superdiagonal only.
It is now time for back substitution. That too is extremely simple, considering that U is bidiagonal.
Here is the full BS process.
x4
x3
x2
x1
=
=
=
=
b04 /d40
(b03 u3 x4 )/d30
(b02 u2 x3 )/d20
(b01 u1 x2 )/d10
Here is an algorithm GE of and n n tridiagonal system, based on our experience with a 4 4 system.
94
95
lection are several prominent linear-algebra packages deserving of brief mention: LAPACK, EISPACK,
SPARSEPAK, and FISHPACK. LAPACK, short for Linear Algebra PACKage, is a collection of generic
linear-algebra routines for most common purposes. It contains routines that return single or double precision results and routines for both real and complex matrices. SPARSEPAK contains routines specialized
to sparse matrices. The names EISPACK and FISHPACK are both puns. The former contains routines
for efficiently finding the eigenvalues of large matrices. We havent talked at all about eigenvalues, but
they figure prominently in all sorts of scientific problems, especially those that involve determining the
dynamic stability of structures; e.g., an aircraft wing. Eigenvalues and eigenvectors will be defined and
addressed in Math 238 and Math 448. A type of partial differential equation (PDE) that arises frequently
in mathematics is Poissons equation. Its numerical approximation often involves solving a linear system of equations with a block-tridiagonal structure. The word fish in English is poisson in French.
Hence, some computational scientist with a wry sense of humor dubbed the package of routines for solving
Poissons equation with the odiforous moniker FISHPACK.
The reader is encouraged to browse www.netlib.org to gain some appreciation for the extreme importance of numerical linear algebra to the world of science and engineering.
6.8
Exercises
1. Consider the linear system of equations given in Eq. 86. Express this system in the matrix-vector
form of Eq. 99. Be sure to identify A, ~x, and ~b explicitly.
2. Find two square matrices A and B for which AB 6= BA.
3. Find two square matrices A and B for which AB = [0], but for which neither A = [0] nor B = [0].
4. By considering operation counts explain why it is usually a terrible idea to compute the inverse of a
matrix to solve a linear system of equations.
5. Assuming that A is an invertible n n matrix, prove Item 5 above.
6. Find the inverse of the matrix
1 0 1
1 1 0
0 1 1
96
At the very end of the section on Rootfinding, we learned how Newtons method for finding
roots can
be used behind the scenes to create a square-root function; that is, to give the output x to any input
x 0. The square-root function is but one of a panoply of intrinsic functions on a standard scientific
calculator. You probably have not lost sleep wondering how your calculator or your computer evaluates
intrinsic functions such as cos(x), sin(x), ln(x), ex , and so on. Nevertheless, it is an interesting question.
And the truth is that your calculator does not evaluate such functions because it cannot! What it does do
is to closely approximate these functions by other functions that are easy to evaluate.
In this section we will consider only polynomial approximants. We remind you of the definition of a
polynomial, which was presented previously in Section 5.
DEF: If an 6= 0, then a function of the form
Pn (x) = a0 + a1 x + a2 x2 + ...an1 xn1 + an xn
(117)
ak xk
k=0
(118)
kak xk1
k=1
Similarly, by integrating Eq. 117 term-by-term, we obtain the indefinite integral of a polynomial, namely
Z
Pn (x)dx = C + a0 x + a1
x3
xn
xn+1
x2
+ a2 + an1 + an
2
3
n
n+1
(119)
ak k+1
x
k=0 k + 1
= C+
But by far the most useful attribute of polynomials is their ability at mimicry. Polynomials it seems
can look like virtually any continuous function. This property is codified in the famous Weierstrass
Approximation Theorem, which we state here without proof.
Weierstrass Approximation THEOREM (WAT): Suppose f is defined and continuous on closed interval [a, b]. For each > 0, there exists a polynomial P(x) also defined on [a, b] for which
| f (x) P(x)| < for all x [a, b]
97
98
7.1
Polynomial Interpolation
Consider continuous y = f (x) on the closed interval [a, b], and partition the interval such that a x0 <
x1 < x2 ... < xn1 < xn b. From now on, we will refer to the values xi of the partition as nodes. Finally,
lets sample the function f (x) at the nodes, such that yi f (xi ). Thus we have, as our starting point, a
finite subset (xi , yi ), i = 0, 1, 2, ..., n of n + 1 data points that represent function f (x), as shown in Fig. 21.
DEF: Let P(x) be a polynomial of degree at most n. If P(xi ) = yi = f (xi ) for i = 0, 1, 2, ..., n, then P(x)
is called an interpolating polynomial. If x0 x xn , then P(x) interpolates f (x). If, on the other hand,
x < x0 or x > xn , then P(x) extrapolates f (x).
REMARKS: Extrapolation is to be avoided like the plague! Well explain why later.
99
100
7.1.1
THEOREM (Existence and Uniqueness of the Interpolating Polynomial): Given n + 1 distinct points
(xi , yi ) i = 0, 1, 2, ..., n, there exists one and only one polynomial P(x) of degree less than or equal to n such
that P(xi ) = yi for all n + 1 values i.
PROOF: Let P(x) = a0 + a1 x + a2 x2 + ... + an1 xn1 + an xn . (Note that P(x) is a polynomial, but because
we do not require that an be non-zero, or any other ak be non-zero for that matter, we know only that P(x)
is of degree less than or equal to n.) Lets require P(x0 ) = y0 . Thus,
P(x0 ) = a0 + a1 x0 + a2 x02 + ... + an x0n = y0
(120)
(121)
(122)
These n + 1 requirements on P(x) can be written as the following linear system of equations in the n + 1
unknowns ai :
...
... = ...
an
yn
1 xn xn2 ... xnn
101
The system above can be written succinctly as V~a = ~y, where ~a is the n + 1 vector of unknowns, ~y is the
n + 1 vector of sampled function values, and V is an (n + 1) (n + 1) matrix known as the Vandermonde
matrix. From the theory of linear algebra, we know that V~a =~y has a unique solution provided the rows of
V are linearly independent. Thus, all the constraints imposed upon P(x) can be satisfied uniquely provided
the rows of the resulting linear system are independent (|V | 6= 0). Notice that the rows of V are each made
up of powers of the nodal values. Provided the n + 1 nodes are distinct, which is assumed by hypothesis,
then the rows are independent, the system has a solution ~a, and that solution is unique.
You may be wondering why the unique polynomial P(x) is of degree less than or equal to n, rather than
simply of degree n. It is because some or all of the coefficients ai may turn out to be zero. For example, if
f (x) were a linear function, then the only (possibly) non-zero coefficients would be a0 and a1 .
Not only does the Vandermonde matrix approach address the issues of existence and uniqueness, it also
provides a way of actually computing the coefficients ai of the polynomial P(x). Were the Vandermonde
matrix method a good way of doing this, then the story would end here, and we would move on to the
Taylor polynomial. However, it turns out that the Vandermonde matrix is great for theory and terrible as
a practical method. When n is even moderately large, say 5, the Vandermonde matrix method magnifies
round-off error dramatically in a phenomenon known as ill-conditioning. This trait stems from the fact
that the Vandermonde matrix is almost singular. To find out more about ill-behaved matrices and how to
tame them, take Math 448. Suffice it to say here that the Vandermonde matrix method should not be used
for n > 3 without precautions.
HW: Find the quadratic polynomial that passes through the points (0,3), (1,4), and (2,3) by the Vandermonde matrix method.
Before proceding, we want to caution that the Lagrange and Newton forms of the interpolating polynomial to be addressed in the next two subsections are not different polynomials. They are merely different
forms of the same interpolating polynomial. For a given set of n + 1 points (xi , yi ), as discussed above, the
interpolating polynomial of degree < n is unique. How we best find that unique polynomial, however, is
the issue. Because the Vandermonde method fails in practice, lets examine two other methods of finding
the interpolating polynomial.
7.1.2
We begin by re-visiting the notion of linear interpolation, which should be familiar to just about everyone
who has taken some high-school mathematics.
Linear Lagrange Interpolation
Two distinct points determine a line. Thus, for linear interpolation we consider only two nodes (x0 and
x1 ) and their respective function values (y0 and y1 ). Lets refer to the interpolating polynomial as P1 (x), as
a recognition that it is at most first-order (linear).
102
(123)
Lets now define two functions, the (linear) Lagrange functions L0 (x) and L1 (x), respectively:
x x1
x0 x1
x x0
L1 (x)
x1 x0
L0 (x)
(124)
(125)
in which case
P1 (x) = y0 L0 (x) + y1 L1 (x)
That Eq. 126 must be valid is almost, but not quite, obvious. First, consider that, by definition,
0 if x = x1
L0 (x) =
1 if x = x0
0 if x = x0
L1 (x) =
1 if x = x1
(126)
(127)
(128)
Now, because P1 (x) is a linear combination of the two linear functions L0 and L1 , it is (at most) linear. Second, P1 (x0 ) = y0 L0 (x0 ) + y1 L1 (x0 ) = y0 , and P1 (x1 ) = y0 L0 (x1 ) + y1 L1 (x1 ) = y1 . Thus, P1 (x) interpolates
f (x) and it is at most linear, so it is indeed the polynomial we have been looking for.
EX: Use linear Lagrange interpolation to interpolate f (x) = ln(x) on [1/2, 1]. Here x0 = 1/2 and x1 = 1.
Thus,
L0 (x) =
L1 (x) =
(x 1)
= 2(x 1)
( 12 1)
(x 12 )
(1 12 )
= 2x 1
Moreover,
1
1
y0 = f (x0 ) = f ( ) = ln( ) = ln(2)
2
2
y1 = f (x1 ) = f (1) = ln(1) = 0
Thus
P1 (x) = ln(2)(2)(x 1) + 0(2x 1)
P1 (x) = 2 ln 2(x 1)
103
x
0.5
0.6
0.7
0.8
0.9
1.0
f (x) = ln(x)
-0.693147...
-0.510825...
-0.356674...
-0.223143...
-0.105360...
0.0
P1 (x)
-0.693147...
-0.554517...
-0.415888...
-0.277258...
-0.138629...
0.0
e1 (x)
0.0
0.043692...
0.059213...
0.054115...
0.033268...
0.0
104
(x1 , y1 ), and (x2 , y2 ). Rather than reinvent the wheel, can we mimic what we did for linear interpolation
and simply adapt appropriately? That is, lets try to construct P2 (x) such that
P2 (x) = y0 L0 (x) + y1 L1 (x1 ) + y2 L2 (x)
In this case, the Lagrange functions L0 , L1 , and
lowing constraint equations:
L0 (x) =
L1 (x) =
L2 (x) =
(129)
(130)
0 if x = x0 or x = x1
1 if x = x2
(131)
in which case Eq. 129 clearly holds, because P2 (x0 ) = y0 , P2 (x1 ) = y1 , and P2 (x2 ) = y2 by design.
EX: Appoximate cos(x) on [ 4 , + 4 ] by quadratic Lagrange interpolation using 3 equally spaced nodes.
In general nodes need not be equally spaced. However, in this
case the problem specification requires
= 2 x(x )
( 4 0)( 4 4 )
4
[x ( 4 )](x 4 )
2
4 2 2
L1 (x) =
= ( ) [x ( ) ]
[0 ( 4 )](0 4 )
[x ( 4 )](x 0)
8
L2 (x) =
= 2 x(x + )
[ 4 ( 4 )](+ 4 0)
4
L0 (x) =
(132)
The following list contains the ouput of a computer program that makes use of Eq. 129 and Eq. 132
to create a table of interpolated values for the cosine function. The 21 values are equally spaced in x
on [/4, /4]. The fourth column gives the interpolation error e2 (x) = cos(x) P2 (x). Once again, by
design, interpolation is exact at the nodes. The absolute error is at its worst between nodes, but a bit closer
to the ends of the interval than the middle. Well find out why this is so later.
105
Cos(x)
P2(x)
0.707107
0.760406
0.809017
0.852640
0.891007
0.923880
0.951057
0.972370
0.987688
0.996917
1.000000
0.996917
0.987688
0.972370
0.951057
0.923880
0.891007
0.852640
0.809017
0.760406
0.707107
0.707107
0.762756
0.812548
0.856482
0.894558
0.926777
0.953137
0.973640
0.988284
0.997071
1.000000
0.997071
0.988284
0.973640
0.953137
0.926777
0.894558
0.856482
0.812548
0.762756
0.707107
e2(x)
0.0000E+00
-0.2350E-02
-0.3531E-02
-0.3842E-02
-0.3552E-02
-0.2897E-02
-0.2081E-02
-0.1270E-02
-0.5959E-03
-0.1537E-03
0.0000E+00
-0.1537E-03
-0.5959E-03
-0.1270E-02
-0.2081E-02
-0.2897E-02
-0.3552E-02
-0.3842E-02
-0.3531E-02
-0.2350E-02
0.0000E+00
yiLi(x)
(133)
i=0
where each Li (x) is a Lagrange polynomial of degree n that satisfies the constraints
0 if i 6= j
Li (x j ) = i j =
1 if i = j
(134)
where i j is the Kronecker delta function. It is readily shown that the following form of Li (x) satisfies the
constraints:
nj=0 ( j6=i) (x x j )
Li (x) = n
(135)
j=0 ( j6=i) (xi x j )
Recall that means product. Thus, the numerator of each Li (x) is the product of n linear factors, each
of the form (x x j ), so each Li (x) is a polynomial of degree n. The unusual notation ( j 6= i) signals that
one linear factor is omitted for each i, namely the factor x xi .
Lagrange interpolation is a practical method that does not suffer from the problems with round-off
error that plague the Vandermonde matrix method. Nevertheless, Lagrange interpolation suffers from two
shortcomings, the second of which is quite serious.
1. First, the Lagrange method is not very computationally efficient. For degree n, there are n Lagrange
functions Li (x), each of which is formed by the product of n linear factors in the numerator and n
similar constant factors in the denominator. Each factor requires two operations: a subtraction and
a multiplication. Thus, there are approximately 4n operations per Li (x), or 4n2 operations in total to
evaluate the Li s once, and another 2n operations to sum them up for Pn (x).
2. Second, Lagrange interpolation is particularly bad if you need to add an additional node to form
Pn+1 (x), for example, for better accuracy of the interpolant. With the Lagrange method, you must
start over if another node is added.
In the best of all worlds, the method of interpolation would allow us to build up the interpolating
polynomial one order at a time until the desired accuracy was attained. That is, given Pn (x), Pn+1 (x) could
be constructed by simply adding the next higher order term.
Once again, we have Isaac Newton to thank for developing an ideal method: Newtons divideddifference method of interpolation. (This is the same Newton, by the way, who developed differential
and integral calculus, the theory of gravition, the theory of optics, the binomial theorem, Newtons method
for rootfinding, and much more. He did all of this during a two-year period, 1665-1666, when he was
22-24 years of age and working in isolation at home. The bubonic plague had closed many European
universities, including Cambridge, where Newton had been a student. Pretty humbling to think that one
human being could accomplish so much of importance to the world in such a short time.)
107
7.1.3
Newtons method of interpolation begins with the question: Can we construct interpolating polynomial
Pn (x) such that Pn+1 (x) is readily obtained if higher order is necessary?
The Newton Polynomial
Consider the n+1 points [xi , f (xi )], i = 0, 1, ..., n, whose x values are called nodes or centers. Let
P0 (x) = a0
P1 (x) = a0 + a1 (x x0 )
= P0 (x) + a1 (x x0 )
P2 (x) = a0 + a1 (x x0 ) + a2 (x x0 )(x x1 )
= P1 (x) + a2 (x x0 )(x x1 )
P3 (x) = a0 + a1 (x x0 ) + a2 (x x0 )(x x1 ) + a3 (x x0 )(x x1 )(x x2 )
= P2 (x) + a3 (x x0 )(x x1 )(x x2 )
(136)
Notice, for example, that P2 (x) is P1 (x) plus a new quadratic term, and that P3 (x) is P2 (x) plus a new cubic
term. In general,
Pn (x) = a0 + a1 (x x0 ) + a2 (x x0 )(x x1 ) + ... + an (x x0)(x x1 )(x x2 )...(x xn1 )
(137)
and
Pn+1 (x) = Pn (x) + an+1 (x x0 )(x x1 )(x x2 )...(x xn1 )(x xn )
(138)
That is, each polynomial of the next higher order is constructed by adding a higher order term to the
polynomial that is its immediate predecessor. (WARNING: The last term of Pn (x) never involves node xn .
One node, the last, is always left out. If you try to include it, your polynomial will be off by one degree.)
Given n points, our main job now is to find coefficients ai , i = 0, 1, 2, ..., n so that Pn (x) in Newton
form interpolates f (x). Before we turn to that somewhat daunting task, lets consider how best to evaluate
a polynomial in Newton form. For specificity, consider the Newton polynomial of degree 4, namely,
P4 (x) =
+
+
+
+
a0
a1 (x x0 )
a2 (x x0 )(x x1 )
a3 (x x0 )(x x1 )(x x2 )
a4 (x x0 )(x x1 )(x x2 )(x x3 )
(139)
(140)
To evaluate P4 (x) in nested form, from the inner nest out, we would perform the following steps, where s
is a scalar temporary variable, and x is considered specified:
s a4
108
s
s
s
s
s(x x3 ) + a3
s(x x2 ) + a2
s(x x1 ) + a1
s(x x0 ) + a0
(141)
For arbitrary n, Pn (x) could be evaluated in similar fashion, which leads to ALGORITHM 7.1, a generalization of Horners Method (ALG 1.1).
ALGORITHM 7.1: Newton Polynomial Evalution by Generalized Horners method
input order n
input n + 1 coefficients ai , i = 0, 1, ..., n
input n + 1 nodes (centers) xi , i = 0, 1, ..., n
input value of x
s an (initialize temporary storage register)
repeat for i = n 1, n 2, ..., 0 (descending order)
| s s (x xi ) + ai
|
output s (which contains value of Pn (x))
Newtons Divided-Difference Table
Now to find the coefficients ai . If we have but one node x0 , we must require P0 (x0 ) = f (x0 ). But the
zeroth-order Newton polynomial is simply a constant, namely P0 (x) = a0 (Eq. 136). Thus, a0 = f (x0 ).
Substituting this value into the form of the linear (first-order) Newton polynomial, we have
P1 (x) = f (x0 ) + a1 (x x0 )
(142)
(143)
We now substitute the values of a0 and a1 just found into P2 (x) of Eq. 136, to obtain
P2 (x) = f (x0 ) +
f (x1 ) f (x0 )
(x x0 ) + a2 (x x0 )(x x1 )
(x1 x0 )
(144)
and to find a2 , we impose the additional constraint that P2 (x2 ) = f (x2 ), in which case
f (x0 ) +
f (x1 ) f (x0 )
(x2 x0 ) + a2 (x2 x0 )(x2 x1 ) = f (x2 )
(x1 x0 )
109
(145)
The algebra now starts to get tedious. Solving the equation immediately above for a2 yields
a2 =
f (x2 ) f (x0 )
(x2 x0 )
f (x1 ) f (x0 )
(x1 x0 )
x2 x1
[ f (x2 ) f (x0 )](x1 x0 ) [ f (x1 ) f (x0 )](x2 x0 )
=
(x2 x0 )(x1 x0 )(x2 x1 )
(146)
We now wish to manipulate the form above by writing (x2 x0 ) = [(x2 x1 ) (x1 x0 )] to obtain
[ f (x2 ) f (x0 )](x1 x0 ) [ f (x1 ) f (x0 )][(x2 x1 ) (x1 x0 )]
(x2 x0 )(x1 x0 )(x2 x1 )
[ f (x2 ) f (x0 ) f (x1 ) + f (x0 )](x1 x0 ) [ f (x1 ) f (x0 )](x2 x1 )
=
(x2 x0 )(x1 x0 )(x2 x1 )
h
i h
i
f (x2 ) f (x1 )
f (x1 ) f (x0 )
x2 x1
x1 x0
=
x2 x0
a2 =
(147)
We admit, the process is beginning to look hopeless, especially when we consider the daunting process
of trying to solve for a3 and higher order coefficients. However, it turns out that all we need is a bit of
organization, bookkeeping if you will.
Define
f [x0 ] f (x0 ); f [x1 ] f (x1 ); f [x2 ] f (x2 ); etc.
(148)
f [x2 ] f [x1 ]
f [x1 ] f [x0 ]
; f [x1 , x2 ]
; etc.
x1 x0
x2 x1
(149)
and the second divided difference as the difference of first divided differences such that
f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 ]
f [x2 , x3 ] f [x1 , x2 ]
; f [x1 , x2 , x3 ]
; etc.
x2 x0
x3 x1
(150)
Similarly, each successive divided difference is a quotient involving previous lower-order divided differences. For example,
f [x1 , x2 , x3 ] f [x0 , x1 , x2 ]
f [x0 , x1 , x2 , x3 ]
; etc.
(151)
x3 x0
The process is easily organized in tabular form, as shown in Table 11 for n = 4.
The table is completed by beginning with the first two columns on the left and then working across
the columns from left to right. For example, the values in the third column are second divided differences,
which are computed from the first divided differences in the second column, and so forth. The only
tricky part of the computation is figuring out what the divisors should be. For example, the fifth column,
which contains the third divided differences, each of which involves four nodes. In particular, the divided
difference f [x0 , x1 , x2 , x3 ] involves the four nodes beginning with x0 and ending with x3 . Thus, the divisor
is x3 x0 . The divided-difference table is a bit like a geneological table of ancestry. Each new generation
is related to many generations of ancestors. The divisor must span all generations.
110
xk
x0
x1
x2
x3
x4
f [xk ]
f (x0 )
f (x1 )
f (x2 )
f (x3 )
f (x4 )
f[ , ]
f[ , , ]
f[ , , , ]
f[ , , , , ]
f [x0 , x1 ]
f [x1 , x2 ]
f [x2 , x3 ]
f [x3 , x4 ]
f [x0 , x1 , x2 ]
f [x1 , x2 , x3 ]
f [x2 , x3 , x4 ]
f [x0 , x1 , x2 , x3 ]
f [x1 , x2 , x3 , x4 ]
f [x0 , x1 , x2 , x3 , x4 ]
f (xi )
1.5
3.0
6.0
12.0
24.0
Note that x = 1.5 is not a node, so interpolation is required to obtain a value for x = 1.5. We now
transfer the starting data to our triangular array (Table 12) and begin walking across the columns in
succession.
xk
f [xk ]
f[ , ]
f[ , , ]
f[ , , , ]
f[ , , , , ]
-1.0 a0 = 1.5
0.0
3.0 a1 = 1.5
1.0
6.0
3.0 a2 = 0.75
2.0
12.0
6.0
1.5 a3 = 0.25
3.0
24.0
12.0
3.0
0.50 a4 = 0.0625
Table 12: Numerical example of Newtons divided-difference table.
Using these coefficients, we construct the Newton polynomials of degrees 0 to 4, each of which is built
upon the previous.
P0 (x) = 1.5
111
Figure 24: P0 (x) (blue), P1 (x) (green), P2 (x) (red), P3 (x) (magenta), P4 (x) (cyan) compared with f (x) =
3 2x (yellow).
P1 (x)
P2 (x)
P3 (x)
P4 (x)
=
=
=
=
1.5 + 1.5(x + 1)
1.5 + 1.5(x + 1) + 0.75(x + 1)(x)
(152)
1.5 + 1.5(x + 1) + 0.75(x + 1)(x) + 0.25(x + 1)(x)(x 1)
1.5 + 1.5(x + 1) + 0.75(x + 1)(x) + 0.25(x + 1)(x)(x 1) + 0.0625(x + 1)(x)(x 1)(x 2)
See Figure 24 for a comparison of the above Newton Polynomials with the function.
Finally, lets evaluate each of these polynomials at x = 1.5 (using modified Horners method, of
course). Note that the exact value to 10 significant digits is f (1.5) = 8.485281375. Note also that the
n
0
1
2
3
4
Pn (1.5)
1.5
5.25
8.0625
8.53125
8.47265625
constant, linear, and quadratic interpolating polynomials return terrible results, as shown by the error in
112
the third column of the table, but that the relative error of the cubic and quartic polynomials is small, the
latter being on the order of one-tenth of a percent.
The beauty of Newtons method of interpolation is that, if the error of the interpolant is unsatisfactory,
a new node can be added. This requires only the appendage of another row at the bottom of the divideddifference table. Previous rows remain unchanged. Moreover, the new node need not be in numerical
order, nor in fact, must any of the nodes be in numerical order. Newtons method therefore is practical,
accurate, and efficient. What more can you ask of an algorithm?
Speaking of algorithms, here is pseudocode for constructing Newtons divided-difference table.
ALGORITHM 7.2: Newtons Divided-Difference Table
input order n
input n + 1 nodes xi , i = 0, 1, ..., n
input n + 1 function values yi , i = 0, 1, ..., n
for i = 0, 1, ..., n
| di,0 yi (initialize table)
|
for
|
|
|
|
|
Note that Newtons divided-difference table requires the use of an (n + 1) (n + 1) matrix D dimensioned from 0 to n. Upon completion of the algorithm, the ai s are stored along the diagonal of matrix
D. That is, ai = dii . In actuality, by overwriting, the algorithm can be re-written so as to require only an
~ which is used both for computing and storing the values ai .
n + 1-vector d,
The inner assignment statement of the nested loop is executed n(n+1)
times, and there are 3 operations
2
(two subtracts and one divide) within the statement. Thus, the total count is approximately 32 n2 operations.
The operation count for the generalized Horner method (ALG 7.1) to evaluate Pn (x) is only 3n operations.
In comparing these numbers with the counts for the Lagrange method, we find that Newtons method has
a considerable computational advantage over Lagrange interpolation.
In closing, we remind the reader that, given the same input data (and ignoring differences in round-off
errors), the Vandermonde matrix method, the Lagrange method, and Newtons method, all yield that same
interpolating polynomial in different forms. Why so? Because the polynomial that interpolates is unique.
Our next topic, the Taylor polynomial, is a polynomial of a very different stripe.
113
7.2
Taylor Polynomials
As shown in Fig. 19, we desire to approximate f (x) on an interval I, but we consider only a single node
x = a in the interior of I. The general form of a Taylor polynomial of degree n is
Tn (x) = a0 + a1 (x a) + a2 (x a)2 + ... + an (x a)n
(153)
Note that the Taylor polynomial looks like the Newton polynomial with all nodes collapsed into the single
node x = a. Our job is to find the coefficients ai so that Tn (x) looks like f (x).
Taylor polynomial approximants, like Newton polynomials, are built up one order at a time. We begin
with the Taylor polynomial of order zero and constrain it to make T0 (x) = f (x) at x = a. Thus,
T0 (x) = f (a) a0 = f (a)
(154)
(155)
To find a1 , we impose the new constraint T10 (a) = f 0 (a), in which case
a1 = f 0 (a)
T1 (x) = f (a) + f 0 (a)(x a)
(156)
We could continue in this manner, but let it suffice to say that the quadratic Taylor polynomial is
T2 (x) = f (a) + f 0 (a)(x a) +
f 00 (a)
(x a)2
2
(157)
00
where a2 = f 2(a) . It is easily verified that T2 (x) satisfies T2 (a) = f (a), T20 (a) = f 0 (a), and T200 (a) = f 00 (a).
Thus, each new constraint requires the next Taylor polynomial in succession to match another derivative
of f (x) at x = a. If it looks like a duck, walks like a duck, and quacks like a duck, it must be a duck.
Similarly, if Tn (x) has the same value as f (x) at x = a, the same derivative as f (x) at x = a, the same
second derivative as f (x) at x = a, and so on, then it must look a lot like f (x). By requiring the third
derivative of the Taylor polynomial to match that of f (x) at x = a, we get T3 (x), namely
f 00 (a)
f 000 (a)
2
(x a) +
(x a)3
T3 (x) = f (a) + f (a)(x a) +
2
6
0
(158)
ak (x a)k
k=0
where ak =
f (k) (a)
k! .
(159)
f (k) (x)
f (k) (a)
f (x) = sin(x)
sin(0) = 0
cos(x)
cos(0) = 1
sin(x)
sin(0) = 0
cos(x) cos(0) = 1
sin(x)
sin(0) = 0
cos(x)
cos(0) = 1
sin(x)
sin(0) = 0
cos(x) cos(0) = 1
sin(x)
sin(0) = 0
cos(x)
cos(0) = 1
Thus,
T9 (x) = 0 + 1(x 0)
1
0
(x 0)2 (x 0)3
+
2!
3!
0
1
(x 0)4 + (x 0)5
4!
5!
0
1
+
(x 0)6 (x 0)7
6!
7!
0
1
(x 0)8 + (x 0)9
8!
9!
(160)
You may be wondering what possessed us to leave all the zeroes in the expansion of T9 (x) above. It is
to remind you that, in general, these quantities are non-zero. They just happen to vanish for this specific
example. That said, cleaning up the result above and truncating at degrees 5, 7, and 9, respectively, we
obtain
x3 x5
T5 (x) = x +
3! 5!
x3 x5 x7
T7 (x) = x +
3! 5! 7!
x3 x5 x7 x9
T9 (x) = x + +
3! 5! 7! 9!
(161)
f ( 4 ) = sin( 4 ) =
2
2 .
n
5
7
9
f (x)
0.7071067812
0.7071067812
0.7071067812
Tn (x)
0.707143046
0.707106781
0.707106783
7.3
In the study of infinite series in Math 236, we learned that the process of approximation above can be taken
to the infinite limit. But first, a review of terminology. Whereas a polynomial has a finite number of terms
of the form ak (x a)k , a power series has an infinite number of terms of the same form. If the process
above is taken to the infinite limit, the Taylor polynomial morphs into the Taylor series. For example, the
Taylor series for sin(x) (about a = 0) is
sin(x) = x
x3 x5 x7 x9 x11
+ +
+ ...
3! 5! 7! 9! 11!
(162)
Unlike the Taylor polynomial, which is an approximation, the Taylor series above is exact for all x
(, ). Exact Taylor series can be found for any function that is infinitely differentiable (but the interval
of convergence may not be infinite). However, the Taylor series is impractical as a numerical tool. No
matter how fast the computer, it cannot possibly sum an infinity of terms in finite time. At some point we
must chop off the infinite tail of the Taylor series and be content with the finite polynomial that remains.
The $64,000 question is: How many terms do we need for a desired level of accuracy? Fortunately there
is a theorem that (nearly) answers our question.
Taylors Remainder THEOREM: Let f (x) be n + 1 times differentiable on some interval I containing
x = a, and let Tn (x) be the n-order Taylor polynomial approximant of f (x) about x = a. For each x I
there exists some between a and x such that
eTn (x) f (x) Tn (x) =
f (n+1) ()
(x a)n+1
(n + 1)!
(163)
where the superscript T is inserted to remind the reader that the error formula applies only to the Taylor
polynomial. How many of you really appreciate the power of this beautiful theorem (whose proof must
wait until Math 448)? Taylors Remainder Theorem tells one exactly how much error is incurred if the
Taylor series is truncated after n terms. We can thus use the remainder theorem in reverse to determine
the number of terms n necessary for a desired level of accuracy in the Taylor-polynomial approximation
116
of f (x). How cool is that? There is a fly in the ointment, however. The quantity above is somewhat
mysterious. All we know is that lives between a and x. What value should we use?
It turns out that a less exact form of Taylors remainder term is actually more useful to us. Lets derive
this form by taking the absolute value of both sides of Eq. 163.
f (n+1) ()
T
n+1
|en (x)| =
(x a)
(n + 1)!
1
| f (n+1) ()||x a|n+1
(n + 1)!
Mn+1
|x a|n+1
(n + 1)!
(164)
where Mn+1 is an upper bound on the absolute value of the n + 1 derivative of f (x) on the interval of
interest. That is
| f (n+1) (x)| Mn+1 x I
(165)
By replacing the n+1 derivative in Eq. 164 with a bound we get around the thorny problem of not knowing
what value to use for . However, there is a price to be paid. Whereas Eq. 163 is an equality, Eq. 164 is
now an inequality that provides a bound on the error rather than an estimate of the error. In general, finding
a suitable value for Mn+1 is itself a thorny problem. However, for some functions, it is trivial. Consider
that both the sine and cosine functions are bounded by unity. That is, | cos(x)| 1 and | sin(x)| 1 for
all x R . But all derivatives of sines and cosines are themselves sines or cosines. Consequently, for
f (x) = sin(x) or f (x) = cos(x), an appropriate bound to use is Mn+1 = 1.
EX: What degree n of Taylor polynomial approximation of f (x) = sin(x) about a = 0 is necessary so that
Tn (x) approximates f (x) to 10 significant digits on a) x [/4, /4]? and b) x [/2, /2]?
As discussed previously, an appropriate value to use for Mn+1 is 1. Thus,
|eTn (x)|
1
1
|x 0|n+1 =
|x|n+1
(n + 1)!
(n + 1)!
(166)
From the form of the error term, it is clear that Taylor-polynomial approximations tend to get worse as the
polynomial is evaluated further and further from the point of expansion x = a. In fact, the aproximation,
by design, is exact at x = a, and degrades as x moves away from a. Thus, when bounding the error on
an entire intervale, we must consider the worse-case scenario when x is as far as possible from a. For the
example problem, part a), a = 0, and I = [/4, /4]. Thus the furthest reaches of the interval are at either
x = /4 or x = /4. Either way, |x| = /4. Combining this fact with Eq. 166, we get
n+1
1
|eTn (x)|
(167)
(n + 1)! 4
Thus, if n is chosen to satisfy
n+1
1
1010
(168)
(n + 1)! 4
then by the transitive property of inequality, the absolute error of Eq. 167 will be less than or equal to
1010 . Eq. 168 is an inequality for n, whose solution can be found by trial and error, as shown in Table 14,
117
which is derived from a Fortran program in which the notation E-09, for example, means 109 . The
reader can verify that the lowest value of n for which Eq. 168 holds is n = 11.
n
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 n+1
(n+1)! 4
0.785398
0.308425
0.807455E-01
0.158543E-01
0.249039E-02
0.325992E-03
0.365762E-04
0.359086E-05
0.313362E-06
0.246114E-07
0.175725E-08
0.115012E-09
0.694845E-11
0.389807E-12
0.204102E-13
0.100189E-14
Table 14: Error bound of the Taylor polynomial for sin(x) on [/4, /4].
The (b) part of the example problem is left to the reader. How should Eqs. 167 and 168 be modified
for part (b)? Construct a table similar to Table 14 to determine n.
Taylors Remainder Theorem brings us to a vitally important topic: truncation error. Chapter 3 addressed sources of error, but focused upon round-off error. In numerical computations, the two error
sources of greatest concern are round-off error and truncation error. Round-off error was defined as that
error incurred when real numbers are stored as floating-point numbers with finite-bit mantissas. It is now
time to define truncation error.
DEF: Truncation error is that error incurred by the computers inability to perform infinitessimal limit
processes or infinite sums.
For example, we have said that the Taylor-series expansion of sin(x) is mathematically exact. However,
it is not useful for a computer because series have an infinite number of terms. By truncating the (infinite)
tail of the series, we get a finite Taylor polynomial Tn (x) that can be evaluated by computer, but at the price
of an approximation rather than an exact result. Thus, the difference between a Taylor series for sin(x) and
a Taylor polynomial for the same is truncation error. Taylors Remainder Theorem (Eq. 163) is extremely
useful because it gives an explicit formula for the truncation error of the Taylor polynomial. In the next
chapter, we will meet truncation error of a different type, that due to the inability of the computer to take
a limit. But first, back to interpolation.
118
7.4
Interpolation Error
Thus far, in this chapter we have talked about two different types of approximations of functions: polynomial interpolation and Taylor polynomials. For the latter, we have discussed an error formula that allows
us to determine in advance how many terms are necessary for a pre-specified accuracy. It remains for us
to derive an error bound for the interpolating polynomial.
For this purpose we will need Rolles Theorem, so please refer to Fig. 11 and the discussion in Chapter 5. For all subsequent analysis, we assume that f (x) is differentiable as many times as necessary.
Lets first consider the error of linear interpolation by Newtons method, for which P1 (x) = f [x0 ] +
f [x0 , x1 ](x x0 ). Thus,
e1 (x) = f (x) P1 (x)
(169)
Note that e1 (x0 ) = e1 (x1 ) = 0, because x0 and x1 are nodes. Thus Rolles theorem applies and there is an
between x0 and x1 such that
e01 () = 0
f 0 () P10 () = 0
f 0 () = P10 ()
f 0 () = f [x0 , x1 ]
(170)
The previous equation relates Newtons first divided difference to the first derivative of the original function; that is, f [x0 , x1 ] = f 0 (), for the interval containing x0 and x1 .
Lets now consider quadratic interpolation by Newtons method, which will ultimately relate the second divided difference to the second derivative. In this case,
P2 (x) = f [x0 ] + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 )
(171)
(172)
and
Note that e2 (x) has at least three zeroes, at the nodes x0 , x1 , and x2 . By Rolles theorem, there must be
an 1 between x0 and x1 such that e02 (1 ) = 0. But also, there must be an 2 between x1 and x2 such
that e02 (2 ) = 0. Thus, e02 (x) has at least two zeroes; hence, we can apply Rolles theorem once again to
conclude that there exists between 1 and 2 such that e002 () = 0. From this fact if follows that
e002 () = 0
f 00 () P200 () = 0
f 00 () = P200 ()
f 00 () = 2 f [x0 , x1 , x2 ]
(173)
The end result relates the second divided difference and the second derivative as follows:
f [x0 , x1 , x2 ] =
119
f 00 ()
2
(174)
f (n) ()
n!
(176)
The theorem is important because it makes a connection between Newtons nth divided difference and the
nth derivative of the original function f (x). Well now use this connection to derive an error formula for
polynomial interpolation.
Consider the interpolation error
en (x) = f (x) Pn (x)
(177)
where Pn (x) is the Newton form of the polynomial that interpolates f (x) at the n+1 nodes {x0 , x1 , x2 , ..., xn }
with xi [a, b]. Suppose we add another node x [a, b] distinct from all the previous nodes. Then
Pn+1 (x) = Pn (x) + f [x0 , x1 , ..., xn , x]
nk=0 (x xk )
(178)
(179)
(180)
Our final result comes by replacing the n + 1 divided difference in Eq. 179 with the n + 1 derivative from
Eq. 176. We obtain the following:
ePn (x) =
f (n+1) () n
(x xk ) [a, b]
(n + 1)! k=0
(181)
where the superscript P denotes the error of polynomial interpolation. You may be wondering where the
x went. It turns out that there was nothing special about the new node x except that it was distinct from the
previous nodes; otherwise, x could have been any point in the interval [a, b]. Because there was nothing
special about x,
we have dropped the distinguishing mark (overline), and now just refer to the point as x.
120
To make sure you understand Eq. 181, lets specialize it to constant, linear, and quadratic interpolation,
in which cases
eP0 (x) = f 0 ()(x x0 )
f 00 ()
eP1 (x) =
(x x0 )(x x1 )
2
f 000 ()
eP2 (x) =
(x x0 )(x x1 )(x x2 )
6
(182)
Mn+1
|n (x xk )|
(n + 1)! k=0
(183)
where, once again, Mn+1 is an upper bound on f (n+1) (x), x [a, b]. The function nk=0 (x xk ) is sometimes referred to as n (x); note that it depends only upon the placement of the nodes and not upon the
function f . It is illustrative and a bit sobering to graph n (x) for a relatively large value of n under the
supposition that the nodes are equally spaced on [a, b]. So, for pedagogical purposes, lets consider n = 6
and [a, b] = [3, 3], in which case x0 = 3, x1 = 2, x2 = 1, x3 = 0, x4 = +1, x5 = +2, and x6 = +3.
Figure 25 shows 6 (x) for this distribution of nodes. Note that 6 (x) oscillates wildly, especially near
the ends of the interval. It may not be obvious, but what Fig. 25 implies is that high-order polynomial
interpolation is not necessarily a good idea if the nodes are equally spaced. In fact, going to higher order
can actually make the interpolant worse instead of better. The problem is not high-order interpolation per
se, but rather the uniform distribution of nodes. When the nodes are uniformly distributed, n (x) oscillates wildly as shown in the figure. Interpolation error can be tamed only if one is willing and able to
redistribute the nodes, with a greater density of nodes toward the ends of the interval than near the middle.
In Math 448, we will examine the Chebyshev distribution of nodes, which is in a sense optimal. With this
distribution, phenomenal accuracy is possible with high-order interpolation. In contrast, with a uniform
distribution of nodes, one should probably not consider interpolation of order higher than, say, five.
Figure 25 also reveals why extrapolation is usually a terrible idea. Recall that extrapolation is the
evaluation of the interpolating polynomial outside its design interval [a, b]. Notice how quickly n (x)
grows outside the interval, which implies that the error ePn (x) also grows at a phenomenal rate outside the
interval. At best, extrapolation is dangerous and to be avoided.
121
Figure 25: Function of interpolation error formula, for equally spaced nodes.
7.4.1
From the previous discussion, it can be concluded that interpolation should be restricted to relatively low
order if the nodes are equally spaced. With this in mind, lets consider the error of linear, quadratic, and
cubic interpolation when the nodes are each separated by a uniform interval width of h. That is, the nodal
placements are given by
xi = x0 + ih i = 0, 1, 2, ..., n
(184)
where x0 = a and h = (b a)/n. From Eq. 183 for n = 1, n = 2, and n = 3, respectively, we have
M2
|1 (x)|
2
M3
|eP2 (x)|
|2 (x)|
6
M4
|eP3 (x)|
|3 (x)|
12
|eP1 (x)|
(185)
(186)
(187)
where
1 (x) = (x x0 )(x x1 )
2 (x) = (x x0 )(x x1 )(x x2 )
3 (x) = (x x0 )(x x1 )(x x2 )(x x3 )
(188)
(189)
(190)
and where the reader is reminded that M2 , M3 , and M4 are upper bounds on f 00 , f 000 , and f (4) , respectively,
on the interval I containing all relevant nodes. The error will tend to be worst near where the n (x) have
their extrema.
122
Figure 26: Interpolation error function 1 (x), shown for x0 = 1 and x1 = +1.
Lets look first at 1 (x), which is quadratic in x, a parabola opening upwards with zeroes at nodes
x0 and x1 (Fig. 26). By inspection, the absolute minimum of 1 (x) occurs midway between x0 and x1 ,
2
2
that is at x0 + h2 . The minimum is therefore 1 (x0 + 2h ) = ( h2 )( 2h ) = h4 . Consequently |1 (x)| h4 on
I = [x0 , x1 ]. Combining this fact with Eq. 185 yields
|eP1 (x)|
M2 h2
8
(191)
Similarly, lets find the extrema of 2 (x), a cubic with three zeroes: at x0 , x1 , and x2 . For convenience,
write
2 (x) = [x (x1 h)](x x1 )[x (x1 + h)]
= [(x x1 ) + h](x x1 )[(x x1 ) h)]
(192)
The problem becomes a bit easier if we now simplify by the change of variables u = x x1 , in which case
2 (u) = (u+h)(u)(uh) = u(u2 h2 ). Relative extrema of 2 (u) occur where its first derivative vanishes.
Thus, it suffices to solve 03 (u) = 3u2 h2 = 0. There are two critical points: u { h3 , + h3 }. Returning
to the original variable, critical values occur at x = x1 h3 . The first gives the absolute maximum of
2 3
h ,
3 3
2 3
and the second, the absolute minimum of 3
h . Either way, |2 (x)|
3
result with Eq. 186 yields
M3 h3
|eP2 (x)|
9 3
2 2
h .
3 3
Combining this
(193)
It is interesting to note that the extrema of 2 (x) occur slightly closer to the ends of the interval than to the
middle of the interval.
123
Figure 27: Interpolation error function 2 (x), shown for x0 = 1, x1 = 0, and x1 = +1.
Clearly we can continue in this vein to bound |eP3 (x)|. Mercifully, we will skip the details here and
simply summarize the results for low-order interpolation with nodes equally spaced with interval width h.
M2 h2
8
M3 h3
|eP2 (x)|
9 3
M4 h4
|eP3 (x)|
24
|eP1 (x)|
(194)
So, what is the point of this exercise? Interpolation accuracy can be improved in either or both of
two ways: 1) going to higher-order interpolation, or 2) placing nodes closer together. Previous discussion
has shown the limits of the former: high-order is not always better if the nodes are equally spaced. In
that event, Eqs. 194 are useful in determining an appropriate spacing of nodes to maintain a desired
interpolation accuracy when the order of the interpolating polynomial is relatively low. The following
example illustrates how the error bounds work.
EX: Find spacing h such that P3 (x) interpolates cos(x) correctly to four places (absolute error). Thus, we
want
M4 h4
0.0001
(195)
|eP3 (x)|
24
The underlined portion of the inequality implicitly specifies h. For this problem M4 = 1. Thus
h4
0.0001
24
h4 0.0024
h 0.2213
124
Thus any value of h less than 0.2213, say h = 0.2, would guarantee at least the desired accuracy.
To close this chapter, we would like to summarize the two major results: Taylors Remainder Theorem,
and the formula for interpolation error. These are repeated below.
eTn (x)
f (x) Tn (x) =
f (n+1) ()
(x a)n+1
(n + 1)!
(196)
f (n+1) ()
(x x0 )(x x1 )(x x2 )...(x xn )
(n + 1)!
(197)
In the former is between a and x. In the later is somewhere in the interval [a, b] that contains all the
nodes. We hope that the similarity of these two error formulas has not escaped the notice of the reader.
It turns out, in fact, that the Taylor remainder term is exactly what the interpolation error would be if all
n + 1 distinct nodes xi coalesced into the single node x = a! This is in fact a good way to remember the
two formulas. Remember only one, and then think about how they are related. Both error formulas are
used again in the following chapters, so it is a good idea to commit them to memory.
7.5
Exercises
1. Let (x1 , y1 ), (x2 , y2 ), (x3 , y3 ) be three data points such that x1 < x2 < x3 and y1 , y2 , y3 > 0. Does this
imply that the quadratic function interpolating these points is positive on [x1 , x3 ]? Explain.
2
1
near x=0. Hint: Find the Taylor Polynomial
2. Find the Taylor Polynomial,pn (x) , of f (x) = ex 1x
of the two terms to the right of the = sign separately, add the results together, and simplify.
2
x
x(e
cos(x) .
1+x2 1
(a) Given the four points (1,2), (3,4), (5,3), and (9,8) write the Lagrangian interpolating polynomial L(x) that passes through each point.
(b) Give the third degree Newton polynomial P(x) at the above four points.
(c) Without expanding, can you determine a relationship between L(x) and P(x)? If so, what is it?
(a) Find the third degree Taylor polynomial for f (x) = ln(x) expanded about c = 2.
(b) Approximate
R3
2
(c) Derive an error bound for your result in part (b) by integrating and manipulating the remainder
term that would be associated with your polynomial in part (a).
8.
(a) Use the Vandermonde matrix method to find the quadratic polynomial P2 (x) that passes through
the points (0,1), (1,2), and (2,5).
(b) Find the determinant of the matrix in part (a).
(c) Does the matrix have an inverse? Why or why not?
(d) Also find P2 (x) by Newtons divided difference method. Do you get the same polynomial?
9.
(a) For a = 0, what order n of Taylor polynomial Tn (x) is required to approximate f (x) = sin(x)
to machine single precision on [0, 3 ]?
(b) Find Tn (x).
126
In this chapter, we wish to consider the approximation of the derivative of a function, when the function
is known only as a discrete set of points. More specifically, consider the set of n + 1 points given by
{[xi , f (xi )], i = 0, 1, ..., n}. For now assume that the points are equally spaced in x such that xi = x0 + ih,
and h is the length of the interval between adjacent points. Further suppose that these points represent
a discrete sample of a continuous underlying function f (x), as shown in Fig. 28. Herein lies the rub.
We dont know f (x); we know only its value at a few points. More to the point, what we are really after
is f 0 (x), the derivative of f (x). Just differentiate, you say. The problem is, we cant. Differentiability
implies continuity; however the function f is known only partially, as a discontinuous set of points that
comprise some representative sample of f (x). Derivative rules dont apply to discontinuous sets. What is
to be done?
Fortunately we laid all the groundwork in the previous chapter. The game plan is relatively straightforward. It consists of two steps: 1) Fit polynomial Pk (x) through some subset of the n + 1 points, which
implies k n. 2) Approximate the derivative of the function at the nodes by the derivative of the interpolating polynomial; that is, f 0 (xi ) Pk0 (xi ).
The devil of course is in the details. The first practical question is: How big a subset should we use?
That is, what value should k have? Another question is: At what node xi do we evaluate the derivative of
Pk (x)? One way around having to answer the first question is to simply start small and work up to larger
values of k. The simplest type of function that has a non-zero derivative is the linear function.Thus, lets
start with k = 1.
127
8.1
Two-Point Formulas
Here, we consider only two points: [x0 , f (x0 )] and [x1 , f (x1 )]. The Newton form of the linear interpolating
polynomial is
P1 (x) = f (x0 ) + f [x0 , x1 ](x x0 )
f (x1 ) f (x0 )
= f (x0 ) +
(x x0 ) x0 x x1
x1 x0
(198)
f (x1 ) f (x0 )
x0 x x1
x1 x0
(199)
The approximation above is valid over the interval [x0 , x1 ], provided we take a one-sided derivative if at
an endpoint of the interval spanned by the two nodes. Thus, there are two ways to interpret the fact that
f (x0 )
f 0 (x). If we evaluate the derivative at x = x0 , then x1 = x0 + h, x1 x0 = h, and
P1 0(x) = f (xx11)
x0
f 0 (x0 )
f (x0 + h) f (x0 )
h
(200)
f (x1 ) f (x1 h)
h
(201)
Once these formulas are derived, there is no need to retain the subscripts. Generalizing, we obtain
f 0 (x)
f 0 (x)
f (x + h) f (x)
Dh+ f
h
f (x) f (x h)
Dh f
h
(202)
(203)
The former is known as the two-point forward-difference approximation to the first derivative, given the
symbol Dh+ f , and the latter is known as the two-point backward-difference approximation, whose symbol
is Dh f . A word on the symbolism. Here D denotes an approximate differentiation operator, which
operates on the function f , known only discretely, with adjacent nodes separated by by distance h. The
plus or minus designators refer to forward or backward, respectively, and the h refers to the interval
width between the nodes, also known as the step size.
Note that the forward-difference approximation Dh+ f gives the exact derivative in the infinitessimal
limit as h 0, because this is, in fact, the definition of the derivative of f (x)! On the other hand, if h
remains finite, Dh+ f only approximates the derivative of f (x). Similarly, it can be easily shown that Dh f
is also exact in the infinitessimal limit.
So, how well do two-point approximations work in reality? Lets try an example.
128
EX: Suppose f (x) = sin(x). Approximate f 0 ( 3 ) by both Dh+ f and Dh f with h = 0.01. In this pedagogical example, we of course know the exact answer: f 0 ( 3 ) = cos( 3 ) = 12 .
sin( 3 + 0.01) sin( 3 )
Dh+ f =
= 0.495661573
0.01
sin( 3 ) sin( 3 0.01)
Dh f =
= 0.504321758
0.01
Neither result is particularly impressive. The forward-difference approximation undershoots the exact
result by a bit less than one percent, and the backward-difference approximation overshoots by about the
same amount. You might be wondering what happens if we average these two values.
Dh+ f + Dh f
= 0.499991667
2
(204)
Now that is an impressive result, differing from the true value only in the 6th significant digit. Is it a fluke
that the average was so close to the true value, or have we stumbled onto something important? Stay tuned.
8.2
Three-Point Formulas
Two-point formulas suffer from a serious flaw. The second derivative of a linear function is zero. Thus,
two-point formulas are inappropriate if both first- and second-derivative approximations are needed. For
example, if the original function is s(t) (position s as a function of time t), then its first and second
derivatives with respect to time are velocity and acceleration, respectively. A linear approximation of
position would always yield zero for the approximation of acceleration. Not too swift. To insure that
the interpolating polynomial allows for non-zero first and second derivatives, we need to consider at least
quadratic polynomials; that is, 2 k. So, once again adopting a minimalist approach, lets consider k = 2,
in which case we need three points: [x0 , f (x0 )], [x1 , f (x1 )], and [x2 , f (x2 )].
The Newton form of the quadratic interpolating polynomial is
P2 (x) = f (x0 ) + f [x0 , x1 ](x x0 ) + f [x0 , x1 , x2 ](x x0 )(x x1 )
(205)
(206)
(207)
in which case
The expressions above are valid for the interval [x0 , x2 ]. Thus we have the option of evaluating P20 (x) at x0 ,
x1 , or x2 . Lets first pick the center node and evaluate P20 (x) at x = x1 . That is,
P20 (x1 ) = f [x0 , x1 ] + [2x1 (x0 + x1 )] f [x0 , x1 , x2 ]
= f [x0 , x1 ] + (x1 x0 ) f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 ]
= f [x0 , x1 ] + h
x2 x0
129
= f [x0 , x1 ] + h
f [x1 , x2 ] f [x0 , x1 ]
2h
1
{ f [x0 , x1 ] + f [x1 , x2 ]}
2
1 f (x1 ) f (x0 ) f (x2 ) f (x1 )
=
+
2
h
h
f (x2 ) f (x0 )
=
2h
f (x1 + h) f (x1 h)
=
2h
=
Once the derivative formula has been established, it is no longer necessary to retain the nodal subscript.
Thus, in general,
f (x + h) f (x h)
f 0 (x)
Dh f
(208)
2h
This formula is known as the centered- (or central-) difference approximation of f 0 at x. Note that the
centered-difference approximation is the average of the forward- and backward-difference approximations.
That is Dh f = 12 [Dh+ f + Dh f ]. It was this approximation that gave such a good result in our previous
numerical example.
What would have happened had we evaluated P20 (x) at x0 rather than x1 ? In this case,
P20 (x0 ) = f [x0 , x1 ] + [2x0 (x1 + x0 )] f [x0 , x1 , x2 ]
= f [x0 , x1 ] (x1 x0 ) f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 ]
= f [x0 , x1 ] h
x2 x0
f [x1 , x2 ] f [x0 , x1 ]
= f [x0 , x1 ] h
2h
3
1
=
f [x0 , x1 ] f [x1 , x2 ]
2
2
3 f (x1 ) f (x0 )
1 f (x2 ) f (x1 )
=
2
h
2
h
3 f (x0 ) + 4 f (x1 ) f (x2 )
=
2h
Once again, dropping the subscript and generalizing, we obtain the three-point forward difference approximation of the first derivative, namely
f 0 (x)
3 f (x) + 4 f (x + h) f (x + 2h)
Dh++ f
2h
(209)
We leave it to the reader to evaluate Px0 (x) at x = x2 to obtain the three-point backward difference approximation of the first derivative, namely
f 0 (x)
3 f (x) 4 f (x h) + f (x 2h)
Dh f
2h
130
(210)
Finally, recall that we went to a quadratic interpolant in order to have a meaningful second-derivative
approximation. Recall also that
P200 (x) = 2 f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 ]
= 2
x2 x0
f [x1 , x2 ] f [x0 , x1 ]
= 2
2h
f (x2 ) f (x1 )
f (x1 )h f (x0 )
h
=
h
f (x2 ) 2 f (x1 ) + f (x0 )
=
h2
If we interpret the expression above as having been evaluated at the center node x = x1 , then x0 = x1 h
and x2 = x1 + h. Generalizing, this leads to the centered-difference approximation to the second derivative,
namely
f (x + h) 2 f (x) + f (x h)
(2)
f 00 (x)
Dh f
(211)
h2
using the centered-difference approximation with h = 0.01.
EX: Approximate f 00 ( 3 ) for f (x) = sin(x)
Dh f =
3
2
= 0.866025404.)
(212)
8.3
Wed like to begin this section with a table of errors derived in answer to the following problem: Compare
the approximations of Dh+ f , Dh f , and Dh f for f (x) = ln(x) at x = 2 for the successive values h = 1/4,
h = 1/8, h = 1/16, and h = 1/32. Describe how the error diminishes as h diminishes in each case. (Note
that, in this pedagogical example, f 0 (x) = 1/x; thus f 0 (2) = 1/2 = 0.5.
h
1/4
1/8
1/16
1/32
Dh+ f
0.471132
0.484997
0.492346
0.496134
eh+ f
0.028868
0.015003
0.007653
0.003866
O(h)
Dh f
0.534126
0.516308
0.507979
0.053947
eh f
-0.034126
-0.016308
-0.007979
-0.003947
O(h)
Dh f
0.502629
0.500653
0.500163
0.500040
eh f
-0.002629
-0.000652
-0.000163
-0.000041
O(h2 )
It is most illuminating to glance down the columns that give the errors of the three approximations.
In the first two error columns, each time h is diminished by a factor of two, the error also diminshes by
very nearly a factor of two. Notice also that the error columns eh+ f and eh f have approximately the
same magnitudes but are of opposite signs. This suggests that an average of Dh+ f and Dh f would give
a nearly correct value because the errors of the one-sided approximations would tend to cancel. This is
in fact verified by the third error column, which gives the error of the centered-diffference approximation,
which is equivalent to the average of the forward- and backward-difference approximations. Finally, notice
how the error in the last column behaves. Each time h is diminished by a factor of two, the error diminishes
by a factor of approximately four. That is, Dh/2 f 1/4Dh f . We can summarize these observations by
saying that the errors for the biased (one-sided) difference approximations tends to diminish like h as h
decreases. In contrast, the error for the centered approximation tends to diminish like h2 . In shorthand, we
say that the error behaves as O(h) (order h) and O(h2 ), respectively.
We are now going to modify Taylors Remainder Theorem from the previous chapter to actually derive
these error properties.
ASIDE: Alternative Form of Taylors Theorem
Lets change the variable names in Eqs. 159 and 163, replacing x by z and a by x, in which case Taylors
Remainder Theorem can be written as follows:
f (z) = f (x) + f 0 (x)(z x) +
+
f (n) (x)
f 00 (x)
(z x)2 + ... +
(z x)n
2
n!
f (n+1) ()
(z x)n+1
(n + 1)!
The last term, the error term, is formally identical to all previous terms except that the derivative is evaluated at , which lies between x and z. By now letting z = x + h ( z x = h), we now get the desired form
of Taylors Theorem.
f (x + h) = f (x) + f 0 (x)h +
(213)
132
Notice that the signs alternate. Even ordered terms are postive; odd order terms are negative. Notice also
that, here, lives between x h and x.
Similarly, we readily derive the following expansion for f (x + 2h) by replacing h with 2h in Eq. 213.
f (x + 2h) = f (x) + f 0 (x)(2h) +
+
f 00 (x)
f (n) (x)
(2h)2 + ... +
(2h)n
2
n!
f (n+1) ()
(2h)n+1 (x < < x + 2h)
(n + 1)!
So what is the point of all the fun and games above? We are now ready to use these expansions to
derive error formulas for difference approximations of derivatives. Lets start with the forward-difference
approximation of the first derivative. First, we remind the reader of the following:
f (x + h) f (x)
h0
h
f (x + h) f (x)
Dh+ f =
h
Both the exact derivative and the forward-difference approximation to the derivative are defined in terms
of difference quotients. In the former, however, the limit is taken as h becomes infinitessimally small. In
the latter, h remains finite. Recall that truncation error is error incurred by the computers inability to
compute infinite sums or to take infinitessimal limits. In the last chapter, we encountered truncation error
of the first type. Here, we encounter truncation error of the second type. (Attempts to obtain the exact
limit on a computer will ultimately fail from a divide by zero error.) For example, eh+ f = f 0 (x) Dh+ f
represents the second type of truncation error. We now quantify this error.
f 0 (x) = lim
(214)
f (a)+ f (b)
.
2
PROOF: Let v = f (a)+2 f (b) and note that v is a value between f (a) and f (b). By the continuity of f (x) and
the IVT, f (x) assumes all values between f (a) and f (b), and so, in particular, it must assume the value v.
(See Fig. 29).
We are now ready to tackle the error of Dh f .
Dh f
=
=
=
=
=
f (x + h) f (x h)
(def. of Dh f )
2h
3
2
3
2
[ f (x) + h f 0 (x) + h2 f 00 (x) + h6 f 000 (1 )] [ f (x) h f 0 (x) + h2 f 00 (x) h6 f 000 (2 )]
2h
h3 000
0
000
2h f (x) + 6 [ f (1 ) + f (2 )]
2h
2
h
f 0 (x) + [ f 000 (1 ) + f 000 (2 )]
12
h2
1 000
0
000
f (x) + 2
[ f (1 ) + f (2 )]
12
2
h2
f 0 (x) + f 000 () (x h < 2 < < 1 < x + h)
6
(216)
where, in the last step, we have used the corollary above to replace the two values of 1 and 2 with a
single value of that lies somewhere between them. Finally, using Eq. 216 in the expression for the error
eh f , we obtain
eh f f 0 (x) Dh f
134
= f 0 (x) [ f 0 (x) +
=
h2 000
f ()]
6
h2 000
f () (x h < < x + h)
6
(217)
We conclude this subsection with a table that summarizes the error formulas for two-point and threepoint approximations of the first derivative, including those for Dh++ f and Dh f , which we did not
derive. Note that, whereas the two-point difference approximations are each O(h), the three-point difference approximations are each O(h2 ). This explains why halving h diminishes the error by a factor of four
for the higher-order formulas. Although the centered and biased three-point formulas are each O(h2 ), the
centered formula has the smallest coefficient of error and is to be favored whenever possible.
It is also instructive to ask the question: For what functions is the derivative approximation exact?
Certainly, the two-point formulas will be exact if f (x) is linear, in which case f 00 (x) = 0. Thus, two-point
formulas, which are based on linear interpolants, are exact for linear functions. Similarly, the three-point
approximations, which are based on a quadratic interpolant, are exact for quadratic functions because
f 000 (x) = 0.
HW: Use the methods of this section to derive the error rule for the three-point centered-difference approximation for the second derivative, namely
(2)
(2)
eh f f 00 (x) Dh f =
Approximation
Dh+ f
Dh f
Dh f
Dh++ f
Dh f
f (4) () 2
h (x h < < x + h)
12
(218)
Error Rule
Interval for
h 00
2 f ()
x < < x+h
h 00
+ 2 f ()
xh < < x
h2 000
6 f () x h < < x + h
2
+ h3 f 000 ()
x < < x + 2h
h2 000
3 f ()
x 2h < < x
8.4
Undetermined coefficients is another method for deriving the aforementioned two-point and three-point
formulas. If we want to approximate a derivative at x j1 , x j and x j+1 , we consider the mathematical
expression A f (x j1 ) + B f (x j ) +C f (x j+1 ). It is more helpful to write the previous expression as A f (x j
h) + B f (x j ) + C f (x j + h) where we have assumed equally spaced points. Taking a Taylor series of this
mathematical expression at x j = x implies
f 000 (x)
f 00 (x)
0
2
3
A f (x) + f (x)(h) +
(h) +
(h) + ... +
2
6
135
C
f (x) + f 0 (x)h +
f 00 (x)
2
h2 +
f 000 (x)
6
B f (x) +
3
(h) + ...
(219)
where is the truncation error. Since we are only finding the two-point forward difference approximation,
we do not need to consider x j1 , so A=0. Therefore, we need to solve the following system of equations
B +C = 0
hC = 1
where the first equation amounts to the statement that we do not want to approximate the function and the
second equation amounts to the statement that we do want to approximate the first derivative. The second
equation above implies that C = h1 and the first equation implies B = 1h . Thus,
f 0 (x)
f (x + h) f (x)
.
h
(220)
One of the benefits of the method of undetermined coefficients is that finding the truncation error requires
very little additional work. Insert the values obtained for B and C into (219) and notice
f 0 (x) = f (x) (0)
f 0 (x) (1) +
2
h 1
00
f ()
( ) +
2 h
where the last term is the error term and is thus evaluated for some value x < < x + h. Therefore, we can
solve for and find that = 2h f 00 (). Notice this error matches the earlier derivation.
136
It is analogous to derive the two-point backward difference approximation to the derivative. However,
lets calculate the three point centered difference approximation to the first derivative. We have to solve
the following system of equations
A + B +C = 0
h(C A) = 1
h2
(C + A) = 0
2
We include the third equation because we have three unknowns. The result of including this third equation is a better truncation error for the centered difference approximation compared with the two-point
approximations. Solving this system of equations yields the approximation found in (208).
Another advantage of the method of undetermined coefficients is the ability to approximate the second
derivative with very little extra work. For instance, calculating a centered difference approximation to the
second derivative implies solving the following system of equations
A + B +C = 0
h(C A) = 0
h2
(C + A) = 1
2
Solving this system of equations, (211) is easily reproduced.
8.5
The methods of the previous subsection can be used to derive difference approximations for higher derivatives and/or with higher order error properties. The game plan is always the same: Choose a subset of n + 1
points that represent a function, fit a polynomial of degree k through these points, and then approximate
f 0 , f 00 , f 000 , etc., by derivatives of the polynomial evaluated at selected nodes.
Recall that the nth derivative of a polynomial of degree n is a constant. Thus, for a difference scheme
to allow a meaningful approximation of the nth derivative, the scheme must be based upon a polynomial
of at least one degree higher.
For brevity, we will adopt the convention that fi = f (xi ). That is, fi+1 = f (xi + h), fi+2 = f (xi + 2h),
fi1 = f (xi h), etc. The following table provides central-difference approximations of O(h2 ) accuracy
for derivatives up to the fourth. The first two formulas are already familiar.
fi+1 fi1
(221)
f 0 (xi )
2h
fi+1 2 fi + fi1
f 00 (xi )
(222)
h2
fi+2 2 fi+1 + 2 fi1 fi2
f 000 (xi )
(223)
2h3
fi+2 4 fi+1 + 6 fi 4 fi1 + fi2
f (4) (xi )
(224)
h4
137
f
+
16
f
30 fi + 16 fi1 fi2
i+2
i+1
f 00 (xi )
12h2
fi+3 + 8 fi+2 13 fi+1 + 13 fi1 8 fi2 + fi3
f (3) (xi )
8h3
fi+3 + 12 fi+2 39 fi+1 + 56 fi 39 fi1 + 12 fi2 fi3
f (4) (xi )
6h4
f 0 (xi )
(225)
(226)
(227)
(228)
Our final group is comprised of four biased (backward or forward) difference formulas of O(h2 ), the
first two of which are familiar.
3 fi + 4 fi+1 fi+2
2h
+3
f
4
fi1 + fi2
i
f 0 (xi )
2h
2
f
5
f
i
i+1 + 4 f i+2 f i+3
f 00 (xi )
h2
2 fi 5 fi1 + 4 fi2 fi3
f 00 (xi )
h2
f 0 (xi )
(229)
(230)
(231)
(232)
HW: How much does the error decrease with a method of O(h4 ) if h is a) halved, or b) reduced by a factor
of 10?
8.6
Although numerical differentiation is useful in its own right, perhaps one of the most powerful uses of
difference approximations is as the basis of the finite-difference method (FDM), which is used to solve
ordinary and partial differential equations of boundary-value type.
An ordinary differential equation (ODE) is one that involves a function and one or more of its derivatives. Lets consider a very simple, second-order ODE as given below:
y00 (x) = f (x) (0 < x < 1)
y(0) = a
;
y(1) = b
The solution of an ODE involves integrating as many times as the order of the derivative. Thus, a secondorder ODE involves integrating twice. Each integration introduces a constant of integration. This, for the
example problem there are two constants (a and b) that require specification. If those values are specified
at the two ends of the domain, then the ODE is classified as a boundary-value problem (BVP). If f (x) is
easily integrated twice, then the BVP is easily solved.
138
So, lets now provide some data for an example problem. Suppose f (x) = 12x2 , a = 0, and b = 2. This
is easily solved in closed form. That is,
y00 (x) = 12x2
y0 (x) = 4x3 + c1
y(x) = x4 + c1 x + c2
The left boundary condition a = 0 c2 = 0. Imposing the right boundary condition implies that c1 = 1.
So the exact solution is y(x) = x4 + x.
For this problem we dont really need the FDM, but we would if there were no closed-form antiderivatives of f (x) or if the orginal problem were slightly more complicated or perhaps nonlinear. So lets apply
the FDM to our sample problem to see how it works and to see how closely our approximate solution
resembles the true one.
STEP 1: The first step is to perform a regular partition of the domain [0, 1] into n equal subintervals,
each of width h = 1/n. The nodes therefore are given by xi = ih, i = 0, 1, 2, ..., n. For specificity, suppose
n = 4. Then
x0
x1
x2
x3
x4
=
=
=
=
=
0
1/4
1/2
3/4
1
STEP 2: The next step is to write the exact ODE at the interior nodes. That is,
y00 (xi ) = f (xi ) i = 1, 2, 3
(233)
STEP 3: We now replace y00 (xi ) in Eq. 233 by the centered-difference approximation for the second
derivative, corrected by the appropriate error formula (Eq. 218) to obtain
y(xi + h) 2y(xi ) + y(xi h) f (4) (i ) 2
h = f (xi ) i = 1, 2, 3
h2
12
(234)
The equation above remains exact because of the inclusion of the error term, apart from the fact that
xi h < i < xi + h, the precise location of which we dont know. If h is small, the error term should be
relatively insignificant. So lets just ignore it and see what happens. STEP 4: Ignoring the error term will
likely produce an inexact solution. For brevity let yi y(xi ); that is, yi is the approximate solution for
y(xi ). Thus, yi is the solution to
yi+1 2yi + yi1
= f (xi ) i = 1, 2, 3
h2
(235)
where, keep in mind, that yi+1 y(xi+1 ) = y(xi + h) and similarly for yi1 . Lets now unroll (write out
explicitly) the three equations represented by Eq. 235 for i = 1, i = 2, and i = 3. To these three equations
139
we will append two trivial equations for the boundary values. We get
y0
y0 2y1 + y2
y1 2y2 + y3
y2 2y3 + y4
y4
=
=
=
=
=
a
h2 f1
h2 f2
h2 f3
b
(236)
We have taken a few liberties to obtain the equation system Eq. 236. The first is to abbreviate fi = f (xi ).
The second is to multiply the middle three equations by h2 to clear denominators. Equations 236 should
look familiar. They comprise a linear system for the five unknowns yi , {i = 0, 1, 2, 3, 4} that approximate
the solution at the nodes. STEP 5: This system can be written in matrix-vector form as
a
1
0
0
0 0
y0
1 2
2
1
0 0
y1 h2 f1
0
1 2
1 0
y2 = h f2
0
0
1 2 1 y3 h2 f3
0
0
0
0 1
y4
b
Yet more succinctly, we can write the linear system as
T~y = ~b
(237)
where T is a 5 5 tridiagonal matrix, ~y is a 5-vector containing the unknowns, and ~b is a 5-vector containing the right-hand-side values and the boundary conditions.
You may be thinking that the boundary values y0 and y4 are given by the boundary conditions, and so
it is silly to consider them as unknowns. If so, you are correct. The 5 5 system can be collapsed to the
following 3 3 system.
a 2y1 + y2 = h2 f1
y1 2y2 + y3 = h2 f2
y2 2y3 + b = h2 f3
(238)
(239)
If we migrate the known values a and b from the left to right side of the equation, we obtain the following
matrix-vector form:
h f1 a
2
1
0
y1
1 2
1 y2 = h2 f2
2
0
1 2
y3
h f3 b
STEP 6: Solving the resulting matrix system by the methods of linear algebra is the last step.
Finally, lets give the FDM a trial run for the data of the example problem: f (x) = 12x2 , a = 0, b = 2,
and n = 4, in which case the resulting system is
(1/4)2 (12)(1/4)2 0
2
1
0
y1
1 2
1 y2 = (1/4)2 (12)(2/4)2
2
2
0
1 2
y3
(1/4) (12)(3/4) 2
140
;
;
0<x<1
y(1) = 0
(240)
Finally, a comment of perspective. It may seem that the FDM presented above is a toy mathematical
approach of little practical significance. Nothing could be further from the truth. The FDM is a practical
method that has been used to solve the ODEs and PDEs that original for a wide variety of scientific and
engineering problems. In particular, for example, the FDM is used to predict heat flow in engines and
airflow over aircraft wings and fuselages in order that processes can be understood and designs improved.
8.7
Exercises
1. Use the most appropriate three-point formula to determine approximations that will complete the
following table.
141
x
f (x)
0.0 0.0000000
0.2 0.74140
0.4 1.3718
f 0 (x)
The data above was taken from the function f (x) = ex 2x2 + 3x 1. Find the error bounds using
the appropriate error formulas for the above approximations for (a) x = 0.0 and (b) x = 0.2.
142
Recall the (First) Fundamental Theorem of Calculus from Math 235: If f (x) is an integrable function, and
F(x) is any antiderivative of f (x), then
Z b
(241)
This amazing theorem gives us a powerful tool for evaluating definite integrals, or from a geometric
point of view, finding the exact area of regions bounded by the graphs of functions. Perhaps a third of
the semester of Math 236 was devoted to antidifferentiation techniques: u-substitution, trig. substitution,
integration by parts, and so on.
Math 236 may have left you with a false sense of security, the feeling that if you are clever enough
and/or you know enough tricks, given f (x), you can always find an antiderivative F(x). Not so.
Consider the following definite integral:
Z
0
1
q
dx
(242)
1 k2 sin2 (x)
If k = 1, then the integral is trivial to evaluate. However, if 0 < k < 1, there is no closed-form antiderivative.
What does this mean? Well, first of all, the integrand is well-behaved; it is continuous and thus integrable.
So the definite integral is well-defined. However, all the usual analytical methods of determining the area
under the curve fail.
The example is a type of integral known as an elliptic integral of the second kind (sorta like Close
Encounters of the Third Kind.) In general, elliptic integrals cannot be evaluated by standard analytical
methods. What is to be done then? Numerics to the rescue; hence, the purpose of this Chapter.
Before we proceed, a quick word on nomenclature. A somewhat archaic term for numerical integration
is quadrature. Thus, we will use the terms numerical integration and quadrature interchangeably.
9.1
Most calculus texts develop the notion of the definite integral via the Riemann sum, in which the sum of
rectangular areas is used to approximate the area under the curve. We review this powerful idea.
Consider function f (x) defined on the interval [a, b]. The Riemann sum begins with a partition P, not
necessarily regular, of the interval such that
P = {x0 , x1 , x2 , ..., xn1 , xn }
a = x0 < x1 < x2 < ... < xn1 < xn = b
143
x0
x1
x2
x1
x2
x3
xn2 xn1
xn1
xn1 xn xn
Now let Ai denote the area of a rectangle of height f (xi ) and of width xi as shown in Fig. 31; that is,
Ai = f (xi )xi
(243)
in which case the area under the curve is approximated by the sum of the area of rectangles, as follows:
Z b
a
Z b
f (x)dx
f (xi)xi
(244)
i=1
The sum on the right-hand-side of Eq. 244 is called a Riemann Sum, after Bernhard Riemann, who was a
student of Gauss at the University of Goettingen, and a formidable mathematician in his own right. Thus
far, the Riemann sum only approximates the definite integral. However, if n in such a way that
xi 0 for all i, then the infinite sum gives the exact value of the integral. We can put it all together
succinctly in the following definition of the definite integral:
Z b
a
f (x)dx
lim
f (xi )xi
(max xi 0) i=1
(245)
The theory of the Riemann integral is quite general. It allows us to partition the interval [a, b] any way
we please, and furthermore, it allows us to pick the points xi within the subintervals any way we please.
Thus the theory still holds if we consider a regular partition, by which xi = h = ba
n and xi = a + ih, i =
0, 1, 2, ..., n. Note that as n , h 0. Thus, the exact integral can be expressed as
Z b
i=1
i=1
lim h f (xi )
f (xi)h = n
n
f (x)dx = lim
a
144
(246)
9.2
Lets keep the idea of a regular partition, but drop the infinite sum. Where does that leave us?
Z b
a
f (x)dx h f (xi )
(247)
i=1
Note that we are still free to choose xi1 xi xi as we please. Are some choices better than others?
9.2.1
Lets be naive and pick xi to lie at one of the endpoints of the subinterval. For example, if we pick xi to
coincide with the left endpoint of each subinterval (that is, xi = xi1 ), then for Eq. 247
Z b
a
n1
(248)
i=0
where L(h) indicates that the method is commonly called the left rectangle rule, whose value depends upon
the step-size h = (b a)/n. The left rectangle rule is illustrated in Fig. 32. Note that L(h) underpredicts
the true area if the function is increasing. It would, of course, overpredict the true area were the function
decreasing.
145
Alternately, we could have chosen xi to coincide with the right endpoint of the subinterval, in which
case xi = xi , and Eq. 247 gives
Z b
a
(249)
i=1
where R(h) denotes the right rectangle rule, whose value also depends upon h. The right rectangle rule is
illustrated in Fig. 33. Note that R(h) overpredicts the true area if the function is increasing. Conversely, it
would underpredict the true area were the function decreasing.
9.2.2
By now it is clear that some choices for xi are better than others if one is restricted by finite h. One way to
balance the tendencies to over- or undershoot the true area is to choose xi to fall at the midpoint of each
subinterval. That is,
xi1 + xi
xi =
mi
(250)
2
in which case
Z
n
(251)
i=1
Here, M(h) denotes the midpoint rule, whose value depends on h. Midpoint-rule quadrature is illustrated
in Fig. 34. Lets compare L(h), R(h), and M(h) on a trial problem.
EX: Recall from Math 235 that ln(x) = 1x 1t dt. Use the numerical integration schemes L(h), R(h), and
R
M(h) with n = 4 to approximate ln(3). That is, use the three quadrature methods to approximate 13 1x dx,
which is shown in Fig 35. With definite integrals, it is always a good practice to estimate the area first by
eye, as a ballpark check on your computed answer. It appears that the area is a bit more than 1, say 1.2,
square units.
R
1
For each method n = 4 h = 31
4 = 2 . NOTE: The number of nodes is always one greater than the
number of subintervals n. In this instance, there are five nodes: x0 , x1 , x2 , x3 , and x4 . The data are best
organized in tabular form. We should explain the purpose of the fourth and fifth columns of Table 17.
i
0
1
2
3
4
xi
1.0
1.5
2.0
2.5
3.0
f (xi ) = x1i
1.0
0.66666
0.5
0.4
0.33333
(wL)i
1
1
1
1
0
(wR)i
0
1
1
1
1
columns denoted wLi and wRi , respectively, indicate which nodes are turned on and which are turned
off for the particular method. For example, in the column headed wRi , by weights of 0 or 1 for off
and on, respectively, we see that the method R(h) uses the functional values at nodes 1-4, but does not
make use of f (x0 ). By this convention, we can use a single algorithm for both methods L(h) and R(h),
namely
n
I h fi wi
(252)
i=0
where I = ab f (x)dx, fi is shorthand for f (xi ), and wi is the value of the weight for node i. For example,
expanding Eq. 252 for R(h), we get
R
(253)
i=0
A disadvantage of midpoint rule is that it is difficult to use the same algorithm we developed for L(h)
and R(h). This is because M(h) uses the midpoints of the subintervals, which must be calculated, rather
than the original nodal values. Thus, the midpoint rule requires a new table of data. Expanding the rule
i
1
2
3
4
mi
1.25=5/4
1.75=7/4
2.25=9/4
2.75=11/4
f (mi ) =
4/5
4/7
4/9
4/11
1
mi
(wM)i
1
1
1
1
(255)
i=1
To six significant digits, the true value of ln(3) = 1.09861. As expected L(0.5) considerably overpredicts this value, and R(0.5) considerably underpredicts. In contrast, M(0.5) is correct to three significant
digits. We summarize the error in Table 19.
The midpoint rule was the most accurate, but it was also awkward. If you are thinking that perhaps we
could have gotten a better estimate more readily by averaging L(0.5) and R(0.5), then pat yourself on the
back. Indeed,
L(0.5) + R(0.5)
= 1.116666
2
ET (0.5) = 0.0180543
T (0.5) =
149
(256)
L(0.5)
EL (0.5)
R(0.5)
ER (0.5)
1.28333 -0.184721 0.950000 0.148612
M(0.5) EM (0.5)
1.08975 0.008862
9.2.3
Lets take a closer look at what happened when we averaged left and right rectangle rules above. Recall
that we defined this average as T (h). That is,
1
[L(h) + R(h)]
2
n
1 n1
=
[h f (xi ) + h f (xi )]
2 i=0
i=1
h
=
[ f0 + 2 f1 + 2 f2 + ... + 2 fn1 + fn ]
2
h
[(wT )0 f0 + (wT )1 f1 + (wT )2 f2 + ... + (wT )n1 fn1 + (wT )n fn ]
=
2
T (h) =
where
wTi =
(258)
2 if i 6= 0 and i 6= n
1 if i = 0 or i = n
The rule T (h) is known as trapezoid rule for reasons to become apparent shortly. We summarize by
developing an algorithm that can be used for trapezoid rule, left rectangle rule, or right rectangle rule,
depending upon the weights chosen.
Key to the common algorithm is the recognition that each of these quadrature rules can be expressed
as a scaled dot product of two (n + 1)-vectors: a vector of functional values and a vector of weights. In
particular, let
~f = [ f0 , f1 , f2 , ..., fn1 , fn ]
(259)
and
~w = [w0 , w1 , w2 , ..., wn1 , wn ]
150
(260)
Rb
a
N {L, R, T }
~w {~wL ,~wR ,~wT }
c is a scaling constant, namely
c=
1 if N = L or N = R
2 if N = T
and
~wL = [1, 1, 1, ..., 1, 1, 0] (for L(h))
~wR = [0, 1, 1, ..., 1, 1, 1] (for R(h))
~wT = [1, 2, 2, ..., 2, 2, 1] (for T (h))
(262)
(263)
151
(AT )i
(265)
i=1
Geometrically, the sum above results from the approximation of the area under the curve by the use of
trapezoids rather than the rectangles. Alternately, the trapezoid rule can be derived from first principles by
returning to the notion of the interpolating polynomial from Chapter 7.
Here is the game plan for deriving the trapezoid rule from scratch:
1) approximate
the function f (x)
R xi
R xi
on subinterval [xi1 , xi ] by linear polynomial P1 (x), 2) approximate xi1 f (x)dx by xi1 P1 (x)dx, and 3)
sum the approximated areas on each subinterval.
For starters, consider the linear interpolating polynomial P1 (x) in Lagrange form for the first subinterval [x0 , x1 ], namely
x x1
x x0
P1 (x) = f0
+ f1
x0 x1
x1 x0
152
f0
f1
(x x1 ) + (x x0 )
h
h
(266)
f1 x1
f0 x1
(x x1 )dx +
(x x0 )dx
h x0
h x0
f0 (x x1 )2 x1 f1 (x x0 )2 x1
|x0 +
|x0
h
2
h
2
f0 (x0 x1 )2 f1 (x1 x0 )2
+
h
2
h
2
2
2
f0 h
f1 h
+
h 2
h 2
h
[ f0 + f1 ]
2
Z
P1 (x) =
=
=
=
=
(267)
Linear interpolation in each subinterval is equivalent to estimating the area under the curve within each
subinterval by the area of a trapezoid, as revealed by the final expression in Eq. 267. Strictly speaking, what
we have derived is the simple trapezoid rule, which applies to a single subinterval. Composite quadrature
rules are derived by extending the corresponding simple quadrature rule to all subintervals. Thus, the
composite trapezoid rule is obtained by the following logic, presuming there are n subintervals each of
width h:
Z b
Z x1
f (x)dx =
a
Z x2
f (x)dx +
x
Z 0x1
x0
Z xn
f (x)dx + ... +
x1
Z x2
P1 (x)dx +
x1
f (x)dx
xn1
Z xn
P1 (x)dx + ... +
xn1
P1 (x)dx
h
h
h
[ f0 + f1 ] + [ f1 + f2 ] + ... + [ fn1 + fn ]
2
2
2
h
[ f0 + 2 f1 + 2 f2 + ... + 2 fn1 + fn ] = T (h)
=
2
We have taken some liberties with notation in the derivation above. Note that the linear polynomial interpolant is formally the same but numerically different in each subinterval. For example, in subinterval 1,
0
1
2
1
P1 (x) = f0 xxx
+ f1 xxx
, but in subinterval 2, P1 (x) = f1 xxx
+ f2 xxx
. It would be notationally cum0 x1
1 x0
1 x2
2 x1
bersome to distinquish the linear interpolants in each subinterval, and so we have (mis)used the notation
P1 (x) for each subinterval.
Now that you have the idea of how quadrature rules can be derived from scratch, we are ready to tackle
Simpsons rule, an extraordinarily accurate quadrature rule.
153
9.2.4
Simpsons Rule
We begin by considering the quadratic interpolant in Langrange form for three equally spaced nodesx0 ,
x1 , and x2 with adjacent nodes separated by subinterval width h.
(x x1 )(x x2 )
(x x0 )(x x2 )
(x x0 )(x x1 )
+ f1
+ f2
(x0 x1 )(x0 x2 )
(x1 x0 )(x1 x2 )
(x2 x0 )(x2 x1 )
= f0 L0 (x) + f1 L1 (x) + f2 L2 (x)
P2 (x) = f0
(268)
Thus,
Z x2
f (x)dx
x0
Z x2
P2 (x)dx
x0
Z x2
Z x2
Z x2
= f0
L0 (x) + f1
x0
x0
L1 (x) + f2
x0
L2 (x)
(269)
You might be wondering why we have used the Lagrange form of the interpolant rather than the
Newton form. There is a very good reason. Note that, in the Lagrange form, the weight w0 for function
value f0 , for example, is given by the integral of L0 (x), and similarly for weights w1 and w2 . For example,
Z x2
w0 =
=
=
=
=
=
=
x0
Z x2
L0 (x)dx
(x x1 )(x x2 )
dx
x0 (x0 x1 )(x0 x2 )
Z x2
(x x1 )(x x2 )
dx
(h)(2h)
x0
Z x2
1
(x x1 )(x x2 )dx
2h2 x0
Z x2
1
(x x1 )[x (x1 + h)]dx
2h2 x0
Z x2
1
(x x1 )[(x x1 ) h)]dx
2h2 x0
Z x2
1
[(x x1 )2 h(x x1 )]dx
2h2 x0
At this point, a change of variables is helpful (but not absolutely essential). Let t = x x1 x = t + x1 ,
and dt = dx. Moreover, x0 x x2 h t h. Thus,
w0
1
=
2h2
Z +h
[t 2 ht]dt
+h
1 t 3 ht 2
=
2h2 3
2 h
3
3
h3
h
h3
1 h
=
2h2 3
2
3
2
h
=
3
154
(270)
Similarly, w1 =
R x2
x0
L1 (x)dx =
4h
3,
and w2 =
Z x2
x0
R x2
x0
h
4h
h
P2 (x)dx = f0 + f1 + f2
3
3
3
h
[ f0 + 4 f1 + f2 ]
=
3
(271)
Equation 271, in fact, is the simple Simpsons quadrature rule. To derive composite Simpsons rule,
we apply the simple rule to pairs of subintervals. That is,
Z b
Z x2
f (x)dx =
a
Z x4
f (x)dx +
x
Z 0x2
x0
Z xn
f (x)dx + ... +
x2
Z x4
P2 (x)dx +
x2
f (x)dx
xn2
Z xn
P2 (x)dx + ... +
xn2
P2 (x)dx
h
h
h
[ f0 + 4 f1 + f2 ] + [ f2 + 4 f3 + f4 ] + ... + [ fn2 + 4 fn1 + fn ]
3
3
3
h
[ f0 + 4 f1 + 2 f2 + 4 f3 + ... + 2 fn2 + 4 fn1 + fn ] = S(h)
=
3
=
(272)
Composite Simpsons rule is readily added to the Pantheon of methods embraced by ALG 9.1 by
defining the scaling constant as c = 3 and the weight vector as ~wS = [1, 4, 2, 4, 2, 4, ..., 2, 4, 1].
Now that you are familiar with ALG 9.1 for numerical integration, we can even make it work for the
composite midpoint rule, but it requires a slight trick. Consider that the odd-indexed nodes (e.g., x1 , x3 , x5 ,
etc.) are the midpoints of subintervals of widths 2h that span from one even-indexed node to the next. For
example, x1 is the midpoint of the subinterval [x0 , x2 ]. Provided we have an even number of subintervals
(an odd number of nodes), we can adapt ALG 9.1 by defining the weights for the midpoint rule to be
~wM M = [0, 1, 0, 1, 0, ..., 1, 0], in which case
M(2h) = 2h~f ~wM
(273)
2T (h) + M(2h)
3
(274)
Simpsons rule, like the midpoint rule above, and unlike the trapezoid rule, for example, requires an even
number of subintervals.
155
9.3
Quadrature Error
To derive the integration error, we will make use of the formula for the error of polynomial interpolation
in Chapter 7. The game plan is roughly as follows:
f (x) Pn (x)
Z b
f (x)dx
Z b
a
Pn (x)dx
(275)
Consequently,
Z b
E(h) =
f (x)dx
Z b
Z b
=
a
Z b
=
a
Pn (x)dx
[ f (x) Pn (x)]dx
(276)
ePn (x)dx
That is, the integration error is the integral of the interpolation error given by Eq. 181. We begin by
deriving the error of the simple quadrature rules and later modify the results for composite rules. But to
do justice to the derivation, we will first need an integral theorem from Math 235.
Second Integral Mean Value THEOREM (IMVT2): If f and g are continuous on [a, b] and g is nonnegative (or non-positive) on the interval, then there exists c (a, b) for which
Z b
Z b
f (x)g(x)dx = f (c)
g(x)dx
(277)
9.3.1
Consider either left or right rectangle rule on a single subinterval, for which [x0 , x1 ] = [a, b]. Rectangle rules
are crude in that they exploit polynomial interpolation of degree n = 0, by which f (x) P0 (x) = constant
and there is but a single node. For left rectangle rule the constant is f (x0 ), and the node is x0 . For the right
rectangle rule the constant is f (x1 ), and the node is x1 . For the former, the absolute value of the integration
error is given by
Z x
Z x
1
1 0
|ESL (h)| =
e0 (x)dx =
f ()(x x0 )dx
(278)
x
x
0
In the equation above, x0 < < x = (x), and g(x) = x x0 , which is non-negative on the entire
interval. Thus, by the IMVT2 above,
0 Z x1
|ESL (h)| = f (c)
(x x0 )dx
Z x x0
1
M1
(x x0 )dx
x
0
156
(x x0 )2 x1
= M1
2
x0
2
(x1 x0 )
= M1
2
=
(279)
M1 h2
2
where the subscript SL refers to the simple left rectangle rule, and M1 is an upper bound on the absolute
value of the first derivative of f on the interval.
The derivation for ESR (h) is virtually identical to that above. Because the node is x1 rather than x2 , the
factor h2 /2 appears within the absolute value signs. Still, the final result is the same as for left rectangle
rule, namely,
M1 h2
|ESR (h)|
(280)
2
9.3.2
To derive the error formula for simple trapezoid rule (ST), we proceed exactly as before, however, with
the error formula appropriate for linear (n = 1) interpolation, in which case
Z x
Z x 00
1
1 f ()
(x x0 )(x x1 )dx
(281)
|EST (h)| =
e1 (x)dx =
2
x0
x0
In the equation above, g(x) = 2 (x) = (x x0 )(x x1 ). Here, g is non-positive on the entire interval,
which once again satisfies the hypothesis of the IMVT2, in which case
00 Z x
f (c) 1
|EST (h)| =
(x x0 )(x x1 )dx
2
Z x x0
M2 1
(x
x
)(x
x
)dx
0
1
2 x0
Z x
M2 1
=
(x
x
)[x
(x
+
h)]dx
0
0
2 x0
Z x
M2 1
(x
x
)[(x
x
)
h)]dx
=
0
0
2 x0
x
M2 (x x0 )3
(x x0 )2 1
=
h
(282)
2
3
2
x0
M2 (x1 x0 )3
(x1 x0 )2
=
h
2
3
2
3
M2 h
h3
=
2 3
2
3
M2 h
=
2 6
157
M2 h3
12
Note that, whereas the simple rectangle rules have O(h2 ) error properties, the simple trapezoid rule has
O(h3 ) properties. What do you expect to happen for the simple midpoint rule? Like the rectangle rules,
constant interpolation is used. But errors tend to cancel because the height of each rectangle is determined
by the value of the function at the midpoint of the subinterval.
9.3.3
Just when you have been lulled into a false sense of security that the same approach can be used to derive
the error of any quadrature rule, we regret to inform you that it fails for something as simple-minded as
midpoint rule. Why is that?
For mipoint rule, which, like the rectangle rules, uses constant interpolation, there is once again, a
single node m. The interpolation error formula is therefore
e0 (x) = f 0 ()(x m)
(283)
The problem lies in integrating the formula above between x0 and x1 . Note that 1 (x) = x m changes
sign on the interval and thus the IMVT2 is not applicable. We must finesse the error analysis.
To that end, consider a Taylor expansion of f (x) about the midpoint m.
f (x) = f (m) + f 0 (m)(x m) +
f 00 ()
(x m)2
2
(284)
f 00 )
(x m)2
2
and the integration error on the single subinterval is the integral of the interpolation error, that is
Z x
1 0
f 00 ()
2
|EM (h)| =
f (m)(x m) +
(x m) dx
2
x0
The first term in the integral above vanishes. Why is that? We are left with
Z x 00
1 f ()
2
|ESM (h)| =
(x m) dx
2
x000 Z x
f (c) 1
2
=
(x m) dx
2
Z x x0
M2 1
2
(x
m)
dx
2 x0
158
(285)
(286)
(287)
x
M2 (x m)3 1
=
2
3
x0
M2
[(x1 m)3 (x0 m)3 ]
=
6 "
3 #
M2 h 3
h
=
6 2
2
=
M2 h3
24
Finally, by a similar, but more algebraically involved approach, one can derive the following error bound
for simple Simpsons quadrature:
M4 h5
(288)
|ESS (h)|
90
For completeness, we summarize the error rules for simple quadrature below.
|ESL (h)|
|ESR (h)|
|EST (h)|
|ESM (h)|
|ESS (h)|
9.3.4
M1 h2
2
M1 h2
2
M2 h3
12
M2 h3
24
M4 h5
90
(289)
No one performs quadrature using a single subinterval (with the notable historical exception that the
astronomer/mathematician Johannes Kepler devised the equivalent of simple Simpsons rule to improve
greatly on the estimate of the volume of a wine cask). To keep the error within bounds, the interval width h
must be small, which implies that many subintervals are generally required. What then is the relationship
between the error on a single subinteral and the error when many subintervals are used? We illustrate
using composite Simpsons quadrature.
First, recall the triangle inequality: For any two real numbers a and b, |a + b| |a| + |b|.
For composite Simpsons rule, there are n/2 subintervals, each of width 2h and spanning two nodes.
24
Lets adopt the nomenclature, for example, that ESS
(h) represents the error of simple Simpson quadrature
on the subinterval between nodes x2 and x4 . Thus,
(n2)n
02
24
ES (h) = ESS
(h) + ESS
(h) + ... + ESS
159
(h)
(290)
M402 h5 M424 h5
M4
h5
|ES (h)|
+
+ ... +
90
90
90
(291)
Where M424 , for example, bounds the absolute value of the fourth derivative of f on the subinterval
[x2 , x4 ]. Noting that the bound on each subinterval is itself bounded by M4 , the bound on | f (4) (x)| on the
entire interval [a, b], we have
(n2)n
|ES (h)|
=
=
=
M
M402 h5 M424 h5
h5
+
+ ... + 4
90
90
90
5
5
5
M4 h
M4 h
M4 h
+
+ ... +
90
90
90
5
M4 h
[1 + 1 + ... + 1]
90
M4 h5 n
90 2
(b a)
M4 h4
180
(292)
What happened to the power of h in the last step above? Note that nh = b a. Hence, the power of h
diminished by one. In general, and for the same reason, the rules for composite quadrature are of one
degree less in h than the corresponding rules for simple quadrature. We could derive the error rules for
composite rectangle, midpoint, and trapezoid rules, but it turns out that composite Simpsons rule is the
only anomolous rule. Care is needed because there are n/2 subintervals each of width 2h rather than n
subintervals each of width h. For every other composite rule, we simply multiply the simple rule by n and
replace nh by b a. Without further ado, we summarize the error rules for composite quadrature below.
|ESL (h)|
|ESR (h)|
|EST (h)|
|ESM (h)|
|ESS (h)|
(b a)
M1 h
2
(b a)
M1 h
2
(b a)
M2 h2
12
(b a)
M2 h2
24
(b a)
M4 h4
180
(293)
We end this chapter with some remarks about the significance of the error rules above.
REMARKS:
1. L(h) and R(h) integrate constant functions f exactly, because f 0 (x) (and hence, M1 ) for a constant
function is zero.
160
2. M(h) and T (h) integrate linear functions f exactly (for different reasons), because in each case,
f 00 (x) (and hence, M2 ) for a linear function is zero.
3. S(h), on the other hand integrates quadratic and cubic functions f exactly.
There is a mathematical surprise in these remarks. Can you spot it? We expect trapezoid rule to
integrate linear functions exactly, because it is based on linear interpolation. Simpsons rule is based
on quadratic interpolation, and so we would expect it to integrate quadratic functions exactly. But wait!
Simpsons rule integrates quadratic and cubic functions exactly, which is unexpected. It happens because
of a fortuitous cancellation of error, in the same way that midpoint rule gives better results than one would
expect from zero-order interpolation. In sum, Simpsons rule is much better than one would anticipate.
The global error diminshes with h as h4 ; hence, very good results can be obtained even with moderately
small n (moderately large h).
We remind the reader of several things before closing. First, ALG 9.1 covers a multitude of sins. It can
be adapted to each and every one of the quadrature rules with appropriate choice of the weight vector ~w and
scaling constant c. Second, dont forget that S(h) and M(2h) each require an even number of subintervals.
If you try to apply Simpsons rule, for example, to a problem with an odd number of subintervals, all the
nice error properties are destroyed.
HW: Make a table of error for M(2h), T (h), and S(h) for the function f (x) = 1/x on [a, b] = [1, 3], with
n = 2, 4, 8, and 16. Verify that the errors diminish as O(h2 ), O(h2 ), and O(h4 ), respectively.
9.4
Exercises
1. Follow the procedure that culminated in w1 = h/3 in Eq. 270 to show that w2 = 4h/3.
2. Approximate
R3 1
1 x dx
3. Derive Eq. 274 from the composite rules T (h) and M(2h).
4. The following identity holds true for any continuous function f (x) defined on [0, 1000]:
Z 1
Z 1000
f (x) dx =
0
f (x) dx
Z 1000
f (x) dx.
1
One way to approximate the integral on the left-hand side would be to compute approximations via
Simpsons Rule to the two integrals on the right-hand side and then subtract. Give two reasons why
this is probably a bad algorithm for approximating the integral on the left-hand side.
5.
R3
2
(b) What is the absolute error in your Trapezoid approximation in part (a).
(c) What is the best error
bound that can be achieved using the error bound associated with the
R3
Trapezoid Rule for 2 sin(x)dx with n = 4 subdivisions?
161
(d) Using midpoint rules error bound, how many subdivisions would you need to guarantee that
midpoint rule would approximate this integral with an absolute error of less than 106 ?
6.
R3
2
(b) What is the absolute error in your Simpsons rule approximation in part (a). (Hint: use integration by parts)
(c) What is the best error
bound that can be achieved using the error bound associated with the
R3
Simpsons Rule for 2 ln(x)dx with n = 2 subdivisions?
(d) Using midpoint rules error bound, how many subdivisions would you need to guarantee that
midpoint rule would approximate this integral with an absolute error of less than 106 ?
162